MapReduce Paradigm
MapReduce is a programming paradigm that was designed to allow parallel distributed
processing of large sets of data, converting them to sets of tuples, and then combining and
reducing those tuples into smaller sets of tuples.
MapReduce was designed to take big data and use parallel distributed computing to turn big
data into little- or regular-sized data.
Parallel distributed processing refers to a powerful framework where mass volumes of data
are processed very quickly by distributing processing tasks across clusters of commodity
servers. With respect to MapReduce, tuples refer to key-value pairs by which data is
grouped, sorted, and processed.
In the map task, you delegate your data to key-value pairs, transform it, and filter it. Then you
assign the data to nodes for processing.
Map the data.
The incoming data must first be delegated into key-value pairs and divided into fragments, which
are then assigned to map tasks. Each computing cluster — a group of nodes that are connected to
each other and perform a shared computing task — is assigned a number of map tasks, which are
subsequently distributed among its nodes.
Upon processing of the key-value pairs, intermediate key-value pairs are generated. The
intermediate key-value pairs are sorted by their key values, and this list is divided into a new set
of fragments. Whatever count you have for these new fragments, it will be the same as the count
of the reduce tasks.
In the reduce task, you aggregate that data down to smaller sized datasets. Data from the reduce
step is transformed into a standard key-value format — where the key acts as the record identifier
and the value is the value that’s being identified by the key. The clusters’ computing nodes
process the map and reduce tasks that are defined by the user.
Reduce the data.
Every reduce task has a fragment assigned to it. The reduce task simply processes the fragment
and produces an output, which is also a key-value pair. Reduce tasks are also distributed among
the different nodes of the cluster. After the task is completed, the final output is written onto a
file system.
In short, you can quickly and efficiently boil down and begin to make sense of a huge volume,
velocity, and variety of data by using map and reduce tasks to tag your data by (key, value) pairs,
and then reduce those pairs into smaller sets of data through aggregation operations —
operations that combine multiple values from a dataset into a single value.
A diagram of the MapReduce architecture
Search phase is geared toward throughput as it processes very efficiently large batches of
queries, typically 104–107 query descriptors.
The search also requires a preliminary step, the creation of a lookup table, where all
query descriptors of a batch are grouped according to their closest representative
discovered from traversing the index tree.
This lookup table is written to the local disk of all the nodes that will perform search.
Each map task receives (i) a block of data from one of the previously created index files
and (ii) the file containing the lookup table.
The mapper processes only the descriptors in its assigned chunk of data, which are
relevant for the queries.
Distance calculations are computed for those descriptors and queries assigned to the same
cluster identifier.
k-nn results are eventually emitted by the mappers and then aggregated by reducers to
create the final result for the query batch.
If your data doesn’t lend itself to being tagged and processed via keys, values, and
aggregation, then map and reduce generally isn’t a good fit for your needs.
If you’re using MapReduce as part of a Hadoop solution, then the final output is written onto
the Hadoop Distributed File System (HDFS). HDFS is a file system that includes clusters of
commodity servers that are used to store big data. HDFS makes big data handling and storage
financially feasible by distributing storage tasks across clusters of cheap commodity servers.
Map and reduce functions, do not address the parallelization and execution of the MapReduce
jobs. This is the responsibility of the MapReduce model, which automatically takes care of
distribution of input data, as well as scheduling and managing map and reduce tasks.
https://www.dummies.com/programming/big-data/data-science/the-mapreduce-programming-
paradigm/#:~:text=MapReduce%20is%20a%20programming%20paradigm,into%20smaller
%20sets%20of%20tuples.&text=In%20the%20map%20task%2C%20you,transform%20it%2C
%20and%20filter%20it.
https://www.ibm.com/docs/en/netezza?topic=guide-mapreduce-paradigm
https://www.sciencedirect.com/topics/computer-science/mapreduce-paradigm