بسم هللا الرحمن الرحيم
our Agenda
Today, we’re going to cover the following points
MapReduce general view
Why do we need MapReduce?
How MapReduce Works?
Is MapReduce outdated?
References
so
What is MapReduce?
• MapReduce is the main batch processing framework from the Apache Hadoop project.
• Originally developed by Google and described in 2004.
• Later implemented in Apache Hadoop by Doug Cutting (2006) with the first release in
2007.
• It is a distributed framework that processes large datasets across multiple machines
(commodity hardware).
• It has three main phases:
o Map: Filters or transforms the input data.
o Shuffle: Organizes data to send related pieces together.
o Reduce: Aggregates or processes the grouped data into results.
Why do we need MapReduce?
• Handles massive data: Can process datasets ranging from megabytes to petabytes with
the same code.
• Runs on normal hardware: No need for expensive specialized servers.
• Foundation for Big Data evolution: Enabled the rise of Big Data technologies and
adoption by major companies like Yahoo, Facebook, IBM, LinkedIn, etc.
• Inspired easier frameworks: Tools like Apache Hive (developed at Facebook) made it
easier for people with SQL knowledge to harness MapReduce power.
:
Now les see How MapReduce Works?
Map Side (Input → Intermediate Output)
1. Input Splitting
o The input file is split into blocks (e.g., 64MB), and each is assigned to a Map
task.
2. Map Execution
o Each Map task reads its input and applies the Map function, producing key-
value pairs (e.g., <word, 1>).
3. Buffering & Spilling
o Intermediate outputs are stored in memory (100MB buffer).
o When ~80% full, data is sorted, optionally combined, and spilled to disk.
4. Merging Spill Files
o All spill files are merged into a single sorted & partitioned output file (one per
reducer).
o Optional compression can reduce disk usage and network transfer.
5. Map Completion
o The Map task notifies the system that output is ready for reducers.
Shuffle Phase (Data Movement)
• Reducers fetch and copy intermediate data from all relevant Map tasks.
• This includes sorting, merging, and optional combiner steps.
• Data is grouped by key and routed to the appropriate reducer.
Reduce Side (Aggregation → Final Output)
1. Fetching & Buffering
o Reducers collect outputs from multiple Mappers (stored in memory or disk
depending on size).
2. Merging
o Intermediate files are merged, maintaining sorted order by key.
3. Reduce Function Execution
o The reduce function runs once per unique key, aggregating all values (e.g.,
summing counts).
4. Final Output
o Results are written to HDFS or another output destination.
Now lets see Is MapReduce outdated?
1. Performance:
• Its Slower compared to modern technologies like Apache Spark.
• MapReduce writes intermediate data to disk, which slows down processing,
while Spark processes data in-memory, making it much faster.
2. Use Cases:
• Best suited for:
o Large batch processing (e.g., massive log analysis, ETL jobs).
• And Not ideal for:
o Real-time analytics.
o Applications requiring fast, near-instant responses or interactive machine
learning.
4. Popularity:
• Its popularity has significantly declined with the rise of Apache Spark, Apache Flink, and
other modern Big Data technologies.
• Most new projects favor faster, more flexible frameworks.
5. Flexibility:
• It has Limited flexibility:
o Difficult to handle iterative or real-time data processing.
o Strict processing flow (Map ➔ Shuffle ➔ Reduce) that is hard to adapt or
optimize dynamically.
6. Current Use:
• Still actively used in:
o Older Hadoop-based infrastructures.
o Cost-sensitive environments where upgrading to faster solutions is not yet
necessary.
So we can say
➔ MapReduce is not completely outdated, but it is no longer the first choice for
modern Big Data processing needs.