Chapter 5: No SQL Data Management and Mongodb: Unit-2
Chapter 5: No SQL Data Management and Mongodb: Unit-2
Applications:
Document databases are general purpose, useful for a wide variety of applications due to the
flexibility of the data model, the ability to query on any field and the natural mapping of the document
data model to objects in modern programming languages.
1. Extension of key-value model, where value is a structured document
2. Documents can be highly complex, hierarchical data structures without requiring pre-defined
“schema”
3. Supports queries on structured documents
4. Search platforms are also document-oriented
5. Applications that need to manage a large variety of objects that differ in structure
6. Large product catalogs in e-commerce, customer profiles, content management applications
7. No standard query syntax
8. Query performance not linearly scalable
9. Join queries across collections not efficient
Applications
Some popular use cases for document based data stores are:
Nested information: Document-based data stores allow you to work with deeply nested,
complex data structures.
JavaScript friendly: One of the most critical functionalities of document-based data stores are
the way they interface with applications: Using JS friendly JSON.
Examples:
1. MongoDB: An extremely popular and highly-functional database,
2.CouchDB: A ground-breaking document-based data store.
3. Couchbase DB: JSON-based, Memcached-compatible document-based data store.
4.And Apache Solr
5. Elastic Search
5.2.2 Key/Value
The next popular data system is the key-value scheme. If you are used to programming languages, the
key-value system will be familiar to you. The key-value stores a key, which is the identifier for the value.
You can look up data using the key or the value. Again, there are no constraints with the data, because
you simply have key-value pairs for each element.
We will begin our NoSQL modeling journey with key / value based database management simply
because they can be considered the most basic and backbone implementation of NoSQL.
These type of databases work by matching keys with values, similar to a dictionary. There is no structure
nor relation. After connecting to the database server (e.g. Redis), an application can state a key (e.g.
the_answer_to_life) and provide a matching value (e.g. 42) which can later be retrieved the same way by
supplying the key.
Key / value DBMSs are usually used for quickly storing basic information, and sometimes not so-basic
ones after performing, for example, a CPU and memory intensive computation. They are extremely
performant, efficient and usually easily scalable.
These systems provide the ability to retrieve and update data based only on a single or a limited range of
primary keys. For querying on other values, users are encouraged to build and maintain their own
indexes. Some products provide limited support for secondary indexes, but with several caveats. To
perform an update in these systems, multiple round trips may be necessary: first find the record, then
update it, then update the index. In these systems, the update may be implemented as a complete rewrite
of the entire record irrespective of whether a single attribute has changed, or the entire record.
Characteristic Features
The biggest difference between non-relational databases lies in the ability to query data
efficiently.
Document databases provide the richest query functionality, which allows them to address a
wide variety of operational and real-time analytics applications.
Key-value stores and wide column stores provide a single means of accessing data: by primary
key. This can be fast, but they offer very limited query functionality and may impose additional
development costs and application-level requirements to support anything more than basic query
patterns.
The simplest model where each object is retrieved with a unique key, with values having no
inherent model
Utilize in-memory storage to provide fast access with optional persistence
Other data models built on top of this model to provide more complex objects
Applications requiring fast access to a large number of objects, such as caches or queues
Applications that require fast-changing data environments like mobile, gaming, online ads
Weaknesses
Cannot update subset of a value
Does not provide querying
As number of objects becomes large, generating unique keys could become complex
Applications:
Key value stores and wide column stores are useful for a narrow set of applications that only
query data by a single key value. The appeal of these systems is their performance and
scalability, which can be highly optimized due to the simplicity of the data access patterns and
opacity of the data itself.
Some popular use cases for key / value based data stores are:
1. Caching;Quickly storing data for - sometimes frequent - future use.
2. Queue-ing:Some K/V stores (e.g. Redis) supports lists, sets, queues and more.
3. Distributing information / tasks:They can be used to implement Pub/Sub.
4. Keeping live information:Applications which need to keep a state cane use K/V stores easily.
Examples: Riak : Riak KV is a distributed NoSQL database that is highly available, scalable and
easy to operate. It automatically distributes data across the cluster to ensure fast performance and
fault-tolerance. Riak KV Enterprise includes multi-cluster replication ensuring low-latency and
robust business continuity, and Redis, MemcacheDB, BerkleyDB, MemcacheDB, DynamoDB
etc.
Characteristic Features
The wide column model provides more granular access to data than the key value model, but less
flexibility than the document data model.
Extension of key-value model, where the value is a set of columns (column-family)
A column can have multiple time-stamped versions .
Columns can be generated at run-time and not all rows need to have all columns
Storing a large number of time-stamped data like event logs, sensor data
Analytics that involve querying entire columns of data such as trends or time series analytics.
Weakness:
No join queries or sub-queries
Limited support for aggregation
Ordering is done per partition, specified at table creation time
Applications Some popular use cases for column based data stores are:
Keeping unstructured, non-volatile information: If a large collection of attributes and values
needs to be kept for long periods of time, column-based data stores come in extremely handy.
Scaling: Column based data stores are highly scalable by nature. They can handle an awful
amount of information.
1. HBase Data store for Apache Hadoop based on ideas from BigTable
2. Cassandra Column based data store based on BigTable and
3. DynamoDB. ,
4. Apache Accumulo
Applications:
Graph databases are useful in cases where traversing relationships are core to the application,
like navigating social network connections, network topologies or supply chains.
Some popular use cases for graph based data stores are:
Handling complex relational information: As explained in the introduction, graph databases
make it extremely efficient and easy to use to deal with complex but relational information, such
as the connections between two entities and various degrees of other entities indirectly related to
them.
Modelling and handling classifications: Graph databases excel in any situation where
relationships are involved. Modelling data and classifying various information in a relational way
can be handled very well using these type of data stores.
Some popular graph based data stores are:
1. OrientDB: A very fast graph and document based hybrid NoSQL data store written in
Java that comes with different operational modes.
2. Neo4J: A schema-free, extremely popular and powerful Java graph based data store.
3. OrientDB,
4. Apache Giraph,
5. AllegroGraph
5.2.5 Relational
5.4.1 Sharding
NoSQL databases also offer a concept called “sharding.” With relational databases, you need to
scale vertically. This means that when you host a database, you typically host it on one server.
When you need more resources, you add another database server or more resources to the current
server. With NoSQL, you can “shard” your database files. This basically means that you can
share database files across multiple servers. Sharding is using done on very fast storage hardware
such as a SAN and a NAS. Using several database servers at once speeds up your queries
especially when you have millions of rows in your data sets.
You can implement cloud computing along with sharding. This makes your queries especially
fast when you need reports available for several different locations. Using old techniques, your
users would log in to the network and pull reports from miles away, sometimes in other
countries. The result was that the user sometimes needed to wait several minutes for a report to
render. With cloud computing, reporting data is served up from the closest data center to your
users. The result is faster data processing and rendering.
Not only does NoSQL support manual sharding, but NoSQL servers will also do automatic
sharding. The database server will automatically spread a dynamic amount of data across several
servers, so the load on each server individually is reduced.
MongoDB
6.1 What is MongoDB?
MongoDB is 1. Cross-platform. 2. Open source. 3. Non-relational. 4. Distributed. 5. NoSQL. 6.
Document-oriented data store.
Let us trace the journey from .csv to XML to JSON: Let us look at how data is stored in .csv file.
Assume that this data is about the employees of an organization named “XYZ”. As can be seen below,
the column values are separated using commas and the rows are separated by a carriage return.
Now assume that few employees have more than one ContactNo. It can be neatly classified as
OfficeContactNo and HomeContactNo. But what if few employees have more than one OfficeContactNo
and more than one HomeContactNo? Ok, so this is the first issue we need to address.
Let us look at just another piece of data that you wish to store about the employees. You need to store
their email addresses as well. Here again we have the same issues, few employees have two email
addresses, few have three and there are a few employees with more than three email addresses as well.
As we come across these fields or columns, we realize that it gets messy with .csv. CSV are known to
store data well if it is flat and does not have repeating values.
The problem becomes even more complex when different departments maintain the details of their
employees. The formats of .csv (columns, etc.) could vastly differ and it will call for some efforts before
we can merge the files from the various departments to make a single file.
This problem can be solved by XML. But as the name suggests XML is highly extensible. It does not just
call for defining a data format, rather it defines how you define a data format. You may be prepared to
undertake this cumbersome task for highly complex and structured data; however, for simple data
exchange it might just be too much work.
Enter JSON! Let us look at how it reacts to the problem at hand.
As you can see it is quite easy to read a JSON. There is absolutely no confusion now. One can have a list
of n contact numbers, and they can be stored with ease.
JSON is very expressive. It provides the much needed ease to store and retrieve documents in their real
form. The binary form of JSON is BSON. BSON is an open standard. In most cases it consumes less
space as compared to the text-based JSON. There is yet another advantage with BSON. It is much easier
and quicker to convert BSON to a programming language’s native data format. There are MongoDB
drivers available for a number of programming languages such as C, C++, Ruby, PHP, Python, C#, etc.,
and each works slightly differently. Using the basic binary format enables the native data structures to be
built quickly for each language without going through the hassle of first processing JSON.
6.2.6 Sharding
Sharding is akin to horizontal scaling. It means that the large dataset is divided and distributed over
multiple servers or shards. Each shard is an independent database and collectively they would constitute a
logical database.
The prime advantages of sharding are as follows:
1. Sharding reduces the amount of data that each shard needs to store and manage. For example, if the
dataset was 1 TB in size and we were to distribute this over four shards, each shard would house just 256
GB data. Refer Figure 6.4. As the cluster grows, the amount of data that each shard will store and
manage will decrease.
2. Sharding reduces the number of operations that each shard handles. For example, if we were to insert
data, the application needs to access only that shard which houses that data.
When talking of data stores, we first think of Relational Databases with structured data storage
and a sophisticated query engine. However, a Relational Database incurs a big penalty to
improve performance as the data size increases. HBase, on the other hand, is designed from the
ground up to provide scalability and partitioning to enable efficient data structure serialization,
storage and retrieval. Broadly, the differences between a Relational Database and HBase are:
6.2.4 HBase :
1. Is Schema-less
2. Is a Column-oriented datastore
3. Is designed to store Denormalized Data
4. Contains wide and sparsely populated tables
5. Supports Automatic Partitioning
HDFS :
1. Is suited for High Latency operations batch processing
2. Data is primarily accessed through MapReduce
3. Is designed for batch processing and hence doesn’t have a concept of random reads/writes
HBase :
1. Is built for Low Latency operations
2. Provides access to single rows from billions of records
3. Data is accessed through shell commands, Client APIs in Java, REST, Avro or Thrift
Just like in a Relational Database, data in HBase is stored in Tables and these Tables are stored
in Regions. When a Table becomes too big, the Table is partitioned into multiple Regions. These
Regions are assigned to Region Servers across the cluster. Each Region Server hosts roughly the
same number of Regions.
The HMaster in the HBase is responsible for
1. Performing Administration
2. Managing and Monitoring the Cluster
Assigning Regions to the Region Servers
3. Controlling the Load Balancing and Failover
Rows: A row is one instance of data in a table and is identified by a rowkey. Rowkeys are
unique in a Table and are always treated as a byte[].
Column Families: Data in a row are grouped together as Column Families. Each Column
Family has one more Columns and these Columns in a family are stored together in a low level
storage file known as HFile. Column Families form the basic unit of physical storage to which
certain HBase features like compression are applied. Hence it’s important that proper care be
taken when designing Column Families in table. The table above shows Customer and Sales
Column Families. The Customer Column Family is made up 2 columns – Name and City,
whereas the Sales Column Families is made up to 2 columns – Product and Amount.
Cell: A Cell stores data and is essentially a unique combination of rowkey, Column Family and
the Column (Column Qualifier). The data stored in a Cell is called its value and the data type is
always treated as byte[].
Version: The data stored in a cell is versioned and versions of data are identified by the
timestamp. The number of versions of data retained in a column family is configurable and this
value by default is 3.
In 2006, the Google Labs team published a paper entitled BigTable: A Distributed Storage
System for Structured Data. It describes a distributed index designed to manage very large
datasets (``petabytes of data") in a cluster of data servers. BigTable supports key search, range
search and high-throughput file scans, and also provides a flexible storage for structured data.
HBase is an open-source clone of BigTable, and closely mimics its design. At Internet Memory,
we use HBase as a large-scale repository for our collections, holding terabytes of web documents
in a distributed cluster. HBase is often assimilated to a large, distributed relational database. It
actually presents many aspects common to "NoSQL" systems: distribution, fault tolerance,
flexible modeling, absence of some features deemed essential in centralized DBMS (e.g.,
concurrency), etc. This article presents the data model of HBase, and explains how it stands
between relational DBs and the "No Schema" approach. It will be completed by an introduction
to both the Java and REST APIs, and a final article on system aspects.
One obtains what is commonly called an associative array, a dictionary, or a map. Given a
context (the object/document), the structure associates values to keys. We can represent such
data as a graph, as shown by the figure below. The key information is captured by edges,whereas
data values reside at leaves.
There exists many possible representations for a map. We showed a JSON example above, but
XML is of course an appropriate choice. At first glance, a map can also be represented by a
table. The above example is equivalently viewed as
However, this often introduces some confusion. It is worth understanding several important
differences that make a map much more flexible than the strict (relational) table structure. In
particular,
there is no schema that constrains the list of keys (unlike relational table where the
schema is fixed and uniform for all rows),
the value may itself be some complex structure.
HBase, following BigTable, builds on this flexibility. First, we can add new key-value pair to
describe an object, if needed.This does not require any pre-declaration at the 'schema' level, and
the new key remains local. Other objects stored in the same HBase instance remain unaffected by
the change. Second, a value can be another map, yielding a multi-map structure which is
exemplified below.
(2) An HBase "table" is a multi-map structure
Instead of keeping one value for each property of an object, HBases allows the storage of
several versions. Each version is identified by a timestamp. How can we represent such a multi-
versions, key value structure? HBase simply replaces atomic values by a map where the key is
the timestamp. The extended representation for our example is shown below. It helps to figure
out the power and flexibility of the data representation. Now, our document is built from two
nested maps,
a first one, called "column" in HBase terminology (an unfortunate choice, since this is
hardly related to the column relational concept),
a second "timestamp" (each map is named after its key).
Our document is globally viewed as a column map. If we choose a column key, say, type, we
obtain a value which is itself a second map featuring as many keys as there are timestamps for
this specific column. In our example, there is only one timestamp for url (well, we can assume
that the URL of the document does not change much). Looking at, respectively, type and
content, we find the former has two versions and the latter three. Moreover, they only have one
timestamp (t1) in common. Actually, the "timestamp" maps are completely independent from
one another.
Note that we can add as many timestamps (hence, as many keys in one of the second-level maps)
as we wish. And, in fact, this is true for the first-level map as well: we can add as many columns
as we wish, at the document level, without having to impose such a change to all other
documents in a same HBase instance. In essence, each object is just a self-described piece of
information (think again to the flexible representation of semi-structured data formats like XML
or JSON). In this respect, HBase is in the wake of other 'NoSQL' systems, and its data model
shares many aspects that characterize this trend: no schema and self-description of objects. We
are not completely done with the multi-map levels of HBase. Columns are grouped in column
families, and a family is actually a key in a new map level, referring to a group of columns. In
the Figure below, we define two families: meta, grouping url and type, and data representing the
content of a document.
Finally, it is worth noting that the "table" map is a sorted map: rows are grouped on the key
value, and two rows close from one another (with respect to the keys order) are stored is the
same physical area. This make possible (and efficient) range queries of keys. We further explore
this feature is the article devoted to the system aspects of HBase. The following Figures
summarize our structure for an hypothetical webdoc HBase table storing a large collection of
web documents. Each document is indexed by its url (which is therefore the key of the highest
level map). A row is itself a local map featuring a fixed number of keys defined by the family
names (f1, f2, etc.), associated to values which are themselves maps indexed by columns.
Finally, column values are versioned, and represented by a timestamp-index map. Columns and
timestamps do no obey to a global schema: they are defined on a row basis. The columns may
vary arbitrarily from one row to another, and so do the timestamps for columns.
The multi-map structure of a HBase table can thus be summarized as
key -> family -> column -> timestamp -> value
It should be clear that the intuitive meaning of common concepts such as "table", "row", and
"column" must be revisited when dealing with HBase data. In particular, considering HBase as a
kind of large relational database is clearly a misleading option. HBase is essentially a key-value
store with efficient indexing on key access, a semi-structured data model for value
representation, and range search capabilities supported by key ordering.
We are used to querying databases when it comes to random access for structured data.
RDBMSes are the most prominent, but there are also quite a few specialized variations and
implementations, like object-oriented databases. Most RDBMSes strive to implement Codd’s 12
rules, which forces them to comply to very rigid requirements. The architecture used underneath
is well researched and has not changed significantly in quite some time. The recent advent of
different approaches, like column oriented or massively parallel processing (MPP) databases, has
shown that we can rethink the technology to fit specific workloads, but most solutions still
implement all or the majority of Codd’s 12 rules in an attempt to not break with tradition.
The reason to store values on a per-column basis instead is based on the assumption that, for
specific queries, not all of the values are needed. This is often the case in analytical databases in
particular, and therefore they are good candidates for this different storage schema.
Reduced I/O is one of the primary reasons for this new layout, but it offers additional advantages
playing into the same category: since the values of one column are often very similar in nature or
even vary only slightly between logical rows, they are often much better suited for compression
than the heterogeneous values of a row-oriented record structure; most compression algorithms
only look at a finite window.
Facebook, for example, is adding more than 15 TB of data into its Hadoop cluster every dayand
is subsequently processing it all. One source of this data is click-stream logging, saving every
step a user performs on its website, or on sites that use the social plug-ins offered by Facebook.
This is an ideal case in which batch processing to build machine learning models for predictions
and recommendations is appropriate.
Facebook also has a real-time component, which is its messaging system, including chat, wall
posts, and email. This amounts to 135+ billion messages per month,and storing this data over a
certain number of months creates a huge tail that needs to be handled efficiently. Even though
larger parts of emails—for example, attachments—are stored in a secondary system,the amount
of data generated by all these messages is mind-boggling. If we were to take 140 bytes per
message, as used by Twitter, it would total more than 17 TB every month. Even before the
transition to HBase, the existing system had to handle more than 25 TB a month.
In addition, less web-oriented companies from across all major industries are collecting an ever
increasing amount of data. For example:
1. Financial : Such as data generated by stock tickers
2. Bioinformatics : Such as the Global Biodiversity Information Facility or Genomes of various
species
3. Smart grid :Such as the OpenPDC project
4. Sales : Such as the data generated by point-of-sale (POS) or stock/inventory systems
5. Cellular services, military, environmental Which all collect a tremendous amount of data as
well
Storing petabytes of data efficiently so that updates and retrieval are still performed well is no
easy feat. We will now look deeper into some of the challenges.
NoSQL Database
A NoSQL database (sometimes called as Not Only SQL) is a database that provides a
mechanism to store and retrieve data other than the tabular relations used in relational databases.
These databases are schema-free, support easy replication, have simple API, eventually
consistent, and can handle huge amounts of data.
The primary objective of a NoSQL database is to have
simplicity of design
horizontal scaling, and
finer control over availability.
NoSql databases use different data structures compared to relational databases. It makes
some operations faster in NoSQL. The suitability of a given NoSQL database depends on the
problem it must solve.
Besides Cassandra, we have the following NoSQL databases that are quite popular:
Apache HBase - HBase is an open source, non-relational, distributed database
modeled after Google’s BigTable and is written in Java. It is developed as a part of
Apache Hadoop project and runs on top of HDFS, providing BigTable-like
capabilities for Hadoop.
MongoDB - MongoDB is a cross-platform document-oriented database system that
avoids using the traditional table-based relational database structure in favor of
JSON-like documents with dynamic schemas making the integration of data in certain
types of applications easier and faster.
6.8.1 What is Apache Cassandra?
The design goal of Cassandra is to handle big data workloads across multiple nodes without any
single point of failure. Cassandra has peer-to-peer distributed system across its nodes, and data is
distributed among all the nodes in a cluster.
All the nodes in a cluster play the same role. Each node is independent and at the same time
interconnected to other nodes. Each node in a cluster can accept read and write requests,
regardless of where the data is actually located in the cluster. When a node goes down, read/write
requests can be served from other nodes in the network.
Cassandra has become so popular because of its outstanding technical features. Given below are
some of the features of Cassandra:
Cassandra was the daughter of King Priam (Priamos) and Queen Hecuba (Hekabe) and the
fraternal twin sister of Helenus and a princess of Troy.Cassandra is in use at Constant Contact,
CERN, Comcast, eBay, GitHub, GoDaddy, Hulu, Instagram, Intuit, Netflix, Reddit, The Weather
Channel, and over 1500 more companies that have large, active data sets.
One of the largest production deployments is Apple's, with over 75,000 nodes storing over 10
PB of data. Other large Cassandra installations include Netflix (2,500 nodes, 420 TB, over 1
trillion requests per day), Chinese search engine Easou (270 nodes, 300 TB, over 800 million
reqests per day), and eBay (over 100 nodes, 250 TB).
The following figure shows a schematic view of how Cassandra uses data replication among the
nodes in a cluster to ensure no single point of failure
SuperColumn
A super column is a special column, therefore, it is also a key-value pair. But a super column
stores a map of sub-columns.
Generally column families are stored on disk in individual files. Therefore, to optimize
performance, it is important to keep columns that you are likely to query together in the same
column family, and a super column can be helpful here.Given below is the structure of a super
column.
Refer Table 7.3 for built-in data types for columns in CQL.
7.4 CQLSH
7.4.1 Logging into cqlsh
The below screenshot depicts the cqlsh command prompt after logging in, using cqlsh succeeds.
7.5 Keyspaces
What is a keyspace?
A keyspace is a container to hold application data. It is comparable to a relational database. It is
used to group column families together. Typically, a cluster has one keyspace per application.
Replication is controlled on a per keyspace basis. Therefore, data that has different replication
requirements should reside on different keyspaces.
When one creates a keyspace, it is required to specify a strategy class. There are two choices
available with us. Either we can specify a “SimpleStrategy” or a “NetworkTopologyStrategy”
class. While using Cassandra for evaluation purpose, go with “SimpleStrategy” class and for
production usage, work with the “NetworkTopologyStrategy” class.
MAP REDUCE
7.1 INTRODUCTION TO MAPREDUCE
7.1.1 Overview
MapReduce is a framework for processing parallelizable problems across huge datasets using a
large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the
same local network and use similar hardware) or a grid (if the nodes are shared across
geographically and administratively distributed systems, and use more heterogenous hardware).
Processing can occur on data stored either in a filesystem (unstructured) or in a database
(structured). MapReduce can take advantage of locality of data, processing it on or near the
storage assets in order to reduce the distance over which it must be transmitted.
"Map" step: Each worker node applies the "map()" function to the local data, and writes
the output to a temporary storage. A master node orchestrates that for redundant copies of
input data, only one is processed.
"Shuffle" step: Worker nodes redistribute data based on the output keys (produced by the
"map()" function), such that all data belonging to one key is located on the same worker
node.
"Reduce" step: Worker nodes now process each group of output data, per key, in parallel.
MapReduce allows for distributed processing of the map and reduction operations. Provided that
each mapping operation is independent of the others, all maps can be performed in parallel –
though in practice this is limited by the number of independent data sources and/or the number of
CPUs near each source. Similarly, a set of 'reducers' can perform the reduction phase, provided
that all outputs of the map operation that share the same key are presented to the same reducer at
the same time, or that the reduction function is associative. While this process can often appear
inefficient compared to algorithms that are more sequential, MapReduce can be applied to
significantly larger datasets than "commodity" servers can handle – a large server farm can use
MapReduce to sort a petabyte of data in only a few hours.[12] The parallelism also offers some
possibility of recovering from partial failure of servers or storage during the operation: if one
mapper or reducer fails, the work can be rescheduled – assuming the input data is still available.
Let us try to understand the two tasks Map & Reduce with the help of a small diagram –
7.2.1 MapReduce-Example
Let us take a real-world example to comprehend the power of MapReduce. Twitter
receives around 500 million tweets per day, which is nearly 3000 tweets per second. The
following illustration shows how Tweeter manages its tweets with the help of
MapReduce.
As shown in the illustration, the MapReduce algorithm performs the following actions –
Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-value pairs.
Filter − Filters unwanted words from the maps of tokens and writes the filtered maps as
key value pairs.
Count − Generates a token counter per word.
Aggregate Counters − Prepares an aggregate of similar counter values into small
manageable units.
7.2.3 JobTracker
The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes
in the cluster, ideally the nodes that have the data, or at least are in the same rack.
1. Client applications submit jobs to the Job tracker.
2. The JobTracker talks to the NameNode to determine the location of the data
3.The JobTracker locates TaskTracker nodes with available slots at or near the data
4. The JobTracker submits the work to the chosen TaskTracker nodes.
5. The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough,
they are deemed to have failed and the work is scheduled on a different TaskTracker.
6. A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to
do then: it may resubmit the job elsewhere, it may mark that specific record as something to
avoid, and it may may even blacklist the TaskTracker as unreliable.
7. When the work is completed, the JobTracker updates its status.
8. Client applications can poll the JobTracker for information.
The JobTracker is a point of failure for the Hadoop MapReduce service. If it goes down, all
running jobs are halted.
As the name MapReduce implies, the reduce task is always completed after the map task. A
MapReduce job splits the input data set into independent chunks that are processed by map tasks,
which run in parallel. These bits, known as tuples, are key/value pairs. The reduce task takes the
output from the map task as input, and combines the tuples into a smaller set of tuples.
A set of programs that run continuously, known as TaskTracker agents, monitor the status of
each task. If a task fails to complete, the status of that failure is reported to the JobTracker
program, which reschedules the task on another node in the cluster.
This distribution of work enables map tasks and reduce tasks to run on smaller subsets of larger
data sets, which ultimately provides maximum scalability. The MapReduce framework also
maximizes parallelism by manipulating data stored across multiple clusters. MapReduce
applications do not have to be written in Java™, though most MapReduce programs that run
natively under Hadoop are written in Java.
Hadoop MapReduce runs on the JobTracker and TaskTracker framework from Hadoop version
1.1.1, and is integrated with the new Common and HDFS features from Hadoop version 2.2.0.
The MapReduce APIs used in InfoSphere® BigInsights™ are compatible with Hadoop version
2.2.0.
7.3 MAP OPERATIONS
7.3.1 What is a Map operation?
Doing something to every element in an array is a common operation:
7.3.2 What is a Reduce Operation?
MapReduce operation has following characteristic features:
Submitting a MapReduce job
Distributed Mergesort Engine
Two fundamental data types
Fault tolerance
Scheduling
Task execution
Another common operation on arrays is to combine all their values:
7.3.3 Submitting a MapReduce Job
Review Questions:
NoSQL:
1. Explain in detail categories of NO SQL databases -12M
2. What is NO SQL? Explain benefits of SQL -5M
3. Write difference between NOSQL and RDBMS -5M
MonogoDB:
1. What is MongoDB ? Why it is required? -8M
2. Explain the process of replication and Sharding in MongoDB -6M
3. Write following MongoDB commands with example:- 2M*3=6M
i) create database
ii)drop database
iii)show
4. Differentiate with example CRUD operations in RDBMS and MongoDB -8M
MapReduce:
1. What is Map reduce? Explain 5 steps in parallel processing computation? -6M
2. Explain the working of map reduce -8M
3. Explain with diagram the task of map reduce job submission task-8M