[go: up one dir, main page]

0% found this document useful (0 votes)
9 views65 pages

Chapter 5: No SQL Data Management and Mongodb: Unit-2

The document discusses NoSQL data management and its advantages over traditional relational databases, particularly for handling large data sets and diverse data types. It outlines various types of NoSQL databases, including document models, key-value stores, column-based models, and graph databases, each with unique features and applications. The document emphasizes the importance of understanding NoSQL for modern programming and data management needs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views65 pages

Chapter 5: No SQL Data Management and Mongodb: Unit-2

The document discusses NoSQL data management and its advantages over traditional relational databases, particularly for handling large data sets and diverse data types. It outlines various types of NoSQL databases, including document models, key-value stores, column-based models, and graph databases, each with unique features and applications. The document emphasizes the importance of understanding NoSQL for modern programming and data management needs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 65

UNIT-2

Chapter 5: NO SQL Data Management and MongoDB

5.1 NO SQL DATA MANAGEMENT


For over a decade, the de facto standard for database design was relational models. Relational databases
use primary and foreign keys and have strict constraints when you manipulate the tables’ data. These
databases are good for smaller data storage requirements, but you need “big data” capabilities to manage
large queries. This is the goal of NoSQL.
The term "NoSQL" was coined over a decade ago, funnily enough as a name to yet-another relational
database. However, this database had a different idea behind it: eliminating the use of the standardised
SQL. In the next years to come, others picked up and continued to grow this thought, by referring to
various other non-relational databases as NoSQL databases.
By design, NoSQL databases and management systems are relation-less (or schema-less). They are not
based on a single model (e.g. relational model of RDBMSs) and each database, depending on their target-
functionality, adopt a different one.
NoSQL databases work entirely different than relational databases, so you need to learn how to work with
NoSQL to properly manage big data queries. If you don’t implement NoSQL properly, you can actually
slow down your website or applications that use the NoSQL database system.
If you believe you need to gather large sets of data, it’s probably worth looking into a NoSQL database
solution. You’ll also be up on the latest database languages and technology. To be a good programmer,
you need to know the latest technology and languages. NoSQL is a trending type of application that is
taking over from relational database systems.
But organizations are increasingly considering alternatives to legacy relational infrastructure. In some
cases the motivation is technical — such as a need to handle new, multi-structured data types or scale
beyond the capacity constraints of existing systems — while in other cases the motivation is driven by the
desire to identify viable alternatives to expensive proprietary database software and hardware. A third
motivation is agility or speed of development, as companies look to adapt to the market more quickly and
embrace agile development methodologies.
These motivations apply both to analytical and operational applications. Companies are shifting
workloads to Hadoop for their bulk analytical workloads, and they are building online, operational
applications with a new class of so-called “NoSQL” or non-relational databases. “NoSQL” is often used
as an umbrella category for all non-relational databases.

5.2 TYPES OF NOSQL DATABASES


There are several types of NoSQL database systems although a couple of them are more popular than the
others. The primary way in which non-relational databases differ from relational databases is the data
model. Although there are dozens of non-relational databases, they primarily fall into one of the following
three categories:
1. Document Models
2. Key Value and Wide Column Models
3. Graph and
4. Wide column stores.
5.2.1 Document Models
The first type is the most common. Document type NoSQL databases use structures called
documents to store data. The document is given an ID and you can also link these documents together
using IDs just like relational databases. This type of structure is used with MongoDB, which is a common
open source database system. You can think of the document system in the same way you think of a
regular document. You “file” the document in the database. The document can contain any amount of
data and any data types. The next document can also contain any data and any data types. There are no
constraints for the data these documents contain.
Whereas relational databases store data in rows and columns, document databases store data in
documents. These documents typically use a structure that is like JSON (JavaScript Object Notation), a
format popular among developers. Documents provide an intuitive and natural way to model data that is
closely aligned with object-oriented programming – each document is effectively an object. Documents
contain one or more fields, where each field contains a typed value, such as a string, date, binary, or array.
Rather than spreading out a record across multiple columns and tables connected with foreign keys, each
record and its associated (i.e., related) data are typically stored together in a single document.
This simplifies data access and, in many cases, eliminates the need for expensive JOIN
operations and complex, multi-record transactions. Document databases provide the ability to query on
any field within a document. Some products, such as MongoDB, provide a rich set of indexing options to
optimize a wide variety of queries, including text indexes, geospatial indexes, compound indexes, sparse
indexes, time to live (TTL) indexes, unique indexes, and others. Furthermore, some of these products
provide the ability to analyze data in place, without it needing to be replicated to dedicated analytics or
search engines. MongoDB, for instance, provides both the Aggregation Framework for providing real-
time analytics (along the lines of the SQL GROUP BY functionality), and a native MapReduce
implementation for other types of sophisticated analyses. To update data, MongoDB provides a find and
modify method so that values in documents can be updated in a single statement to the database, rather
than making multiple round trips.
In a document database, the notion of a schema is dynamic: each document can contain different
fields. This flexibility can be particularly helpful for modeling unstructured and polymorphic data. It also
makes it easier to evolve an application during development, such as adding new fields. Additionally,
document databases generally provide the query robustness that developers have come to expect from
relational databases. In particular, data can be queried based on any combination offields in a document.

Applications:
Document databases are general purpose, useful for a wide variety of applications due to the
flexibility of the data model, the ability to query on any field and the natural mapping of the document
data model to objects in modern programming languages.
1. Extension of key-value model, where value is a structured document
2. Documents can be highly complex, hierarchical data structures without requiring pre-defined
“schema”
3. Supports queries on structured documents
4. Search platforms are also document-oriented
5. Applications that need to manage a large variety of objects that differ in structure
6. Large product catalogs in e-commerce, customer profiles, content management applications
7. No standard query syntax
8. Query performance not linearly scalable
9. Join queries across collections not efficient
Applications
Some popular use cases for document based data stores are:
 Nested information: Document-based data stores allow you to work with deeply nested,
complex data structures.
 JavaScript friendly: One of the most critical functionalities of document-based data stores are
the way they interface with applications: Using JS friendly JSON.

Examples:
 1. MongoDB: An extremely popular and highly-functional database,
 2.CouchDB: A ground-breaking document-based data store.
 3. Couchbase DB: JSON-based, Memcached-compatible document-based data store.
 4.And Apache Solr
 5. Elastic Search

5.2.1.1 JSON (JavaScript Object Notation) format.


JSON's benefits and challenges: Agility, structure, and interactivity. JSON is the data structure of the
Web. It's a simple data format that allows programmers to store and communicate sets of values, lists, and
key-value mappings across systems.
As JSON adoption has grown, database vendors have sprung up offering JSON-centric document
databases. Now, increasingly, traditional relational-style databases are integrating JSON features,
resulting in a best-of-both worlds benefit to developers and database administrators.
JSON started as the "serialized object notation" for JavaScript, the programming language of the Web.
With a nice combination of simplicity (the essentials of the format are specified in five grammar diagrams
at JSON.org) and "just enough structure," JSON has quickly expanded beyond the Web into applications
and services. For example, JSON is displacing the more complex XML format as the serialization format
for exchanging data between applications.
Application developers must manage both code and data. Code is verbs, or the instructions and
descriptions of how to perform a task. Code is the recipe. Data are the nouns: users, locations, recordings,
amounts -- the ingredients the recipe combines. As more application developers adopt JSON as their
preferred data format, the need is growing for JSON-friendly databases. Few coding tasks are more
annoying or more bug prone than mechanically translating information between formats.
As a consequence, several NoSQL document store database vendors have chosen JSON as their primary
data representation in a stark break from traditional relational schemes used by MySQL, PostgreSQL,
VoltDB, MS SQL Server, Oracle and others. This is a natural fit for developers who use JSON as the data
interchange format in their applications.
Additionally, the relative agility of JSON (JSON records are well structured but easily extended) has
attracted developers looking to avoid painful database schema migrations in agile environments. Data and
schema, in volume, can be hard to change. Rewriting a large dataset stored on disk while keeping the
associated applications online can be time consuming. It can take days of background processing, in
moderate to large examples, to upgrade data.
In contrast, JSON's lack of a predefined schema makes upgrades easy. Developers can store and update
documents without restrictions.

5.2.2 Key/Value
The next popular data system is the key-value scheme. If you are used to programming languages, the
key-value system will be familiar to you. The key-value stores a key, which is the identifier for the value.
You can look up data using the key or the value. Again, there are no constraints with the data, because
you simply have key-value pairs for each element.
We will begin our NoSQL modeling journey with key / value based database management simply
because they can be considered the most basic and backbone implementation of NoSQL.
These type of databases work by matching keys with values, similar to a dictionary. There is no structure
nor relation. After connecting to the database server (e.g. Redis), an application can state a key (e.g.
the_answer_to_life) and provide a matching value (e.g. 42) which can later be retrieved the same way by
supplying the key.
Key / value DBMSs are usually used for quickly storing basic information, and sometimes not so-basic
ones after performing, for example, a CPU and memory intensive computation. They are extremely
performant, efficient and usually easily scalable.
These systems provide the ability to retrieve and update data based only on a single or a limited range of
primary keys. For querying on other values, users are encouraged to build and maintain their own
indexes. Some products provide limited support for secondary indexes, but with several caveats. To
perform an update in these systems, multiple round trips may be necessary: first find the record, then
update it, then update the index. In these systems, the update may be implemented as a complete rewrite
of the entire record irrespective of whether a single attribute has changed, or the entire record.
Characteristic Features
 The biggest difference between non-relational databases lies in the ability to query data
efficiently.
 Document databases provide the richest query functionality, which allows them to address a
wide variety of operational and real-time analytics applications.
 Key-value stores and wide column stores provide a single means of accessing data: by primary
key. This can be fast, but they offer very limited query functionality and may impose additional
development costs and application-level requirements to support anything more than basic query
patterns.
 The simplest model where each object is retrieved with a unique key, with values having no
inherent model
 Utilize in-memory storage to provide fast access with optional persistence
 Other data models built on top of this model to provide more complex objects
 Applications requiring fast access to a large number of objects, such as caches or queues
 Applications that require fast-changing data environments like mobile, gaming, online ads

Weaknesses
 Cannot update subset of a value
 Does not provide querying
 As number of objects becomes large, generating unique keys could become complex

Applications:
 Key value stores and wide column stores are useful for a narrow set of applications that only
query data by a single key value. The appeal of these systems is their performance and
scalability, which can be highly optimized due to the simplicity of the data access patterns and
opacity of the data itself.

Some popular use cases for key / value based data stores are:
1. Caching;Quickly storing data for - sometimes frequent - future use.
2. Queue-ing:Some K/V stores (e.g. Redis) supports lists, sets, queues and more.
3. Distributing information / tasks:They can be used to implement Pub/Sub.
4. Keeping live information:Applications which need to keep a state cane use K/V stores easily.
Examples: Riak : Riak KV is a distributed NoSQL database that is highly available, scalable and
easy to operate. It automatically distributes data across the cluster to ensure fast performance and
fault-tolerance. Riak KV Enterprise includes multi-cluster replication ensuring low-latency and
robust business continuity, and Redis, MemcacheDB, BerkleyDB, MemcacheDB, DynamoDB
etc.

5.2.3 Column or Wide Column Based Models


Column based NoSQL database management systems work by advancing the simple nature of
key / value based ones. Despite their complicated-to-understand image on the internet, these
databases work very simply by creating collections of one or more key / value pairs that match a
record.
Unlike the traditional defines schemas of relational databases, column-based NoSQL solutions
do not require a pre-structured table to work with the data. Each record comes with one or more
columns containing the information and each column of each record can be different.
Basically, column-based NoSQL databases are two dimensional arrays whereby each key (i.e.
row / record) has one or more key / value pairs attached to it and these management systems
allow very large and un-structured data to be kept and used (e.g. a record with tons of
information).
These databases are commonly used when simple key / value pairs are not enough, and storing
very large numbers of records with very large numbers of information is a must. DBMS
implementing column-based, schema-less models can scale extremely well. Wide column stores,
or column family stores, use a sparse, distributed multi-dimensional sorted map to store data.
Each record can vary in the number of columns that are stored. Columns can be grouped together
for access in column families, or columns can be spread across multiple column families. Data is
retrieved by primary key per column family.

Characteristic Features
 The wide column model provides more granular access to data than the key value model, but less
flexibility than the document data model.
 Extension of key-value model, where the value is a set of columns (column-family)
 A column can have multiple time-stamped versions .
 Columns can be generated at run-time and not all rows need to have all columns
 Storing a large number of time-stamped data like event logs, sensor data
 Analytics that involve querying entire columns of data such as trends or time series analytics.

Weakness:
 No join queries or sub-queries
 Limited support for aggregation
 Ordering is done per partition, specified at table creation time

Applications Some popular use cases for column based data stores are:
 Keeping unstructured, non-volatile information: If a large collection of attributes and values
needs to be kept for long periods of time, column-based data stores come in extremely handy.
 Scaling: Column based data stores are highly scalable by nature. They can handle an awful
amount of information.
1. HBase Data store for Apache Hadoop based on ideas from BigTable
2. Cassandra Column based data store based on BigTable and
3. DynamoDB. ,
4. Apache Accumulo

5.2.4 Graph Store Model


The other two types of NoSQL are graph and wide column stores. A graph store keeps data about
network systems, which can grow to several terabytes of data. Wide column stores are used for
big data that stored across several data sets. These systems are mainly used for reporting.
Graph databases use graph structures with nodes, edges and properties to represent data. In
essence, data is modeled as a network of relationships between specific elements. While the
graph model may be counter-intuitive and takes some time to understand, it can be useful for a
specific class of queries. Its main appeal is that it makes it easier to model and navigate
relationships between entities in an application. These systems tend to provide rich query models
where simple and complex relationships can be interrogated to make direct and indirect
inferences about the data in the system. Relationship type analysis tends to be very efficient in
these systems, whereas other types of analysis may be less optimal. As a result, graph databases
are rarely used for more general purpose operational applications.
One common graph store NoSQL database is XML databases. Most developers are familiar with
XML, because it’s an older schema that supports data queries and storage. You can more easily
store XML in NoSQL databases, but XML is also a bit out of style. Most developers are working
more towards JSON as a data model schema, so it might be more difficult to fit an XML
database into your current systems.
Characteristics Features
1. Models graphs consisting of nodes and edges with properties (meta-data) describing them
2. Implement very fast graph traversal operations
3. Also support indexing of meta data to enable graph traversal combined with search queries 4.
Applications that deal with objects with a large number of inter-relations
5. Applications like social networking friends-networks, hierarchical role based permissions,
complex decision trees, maps, network topologies
6. Weakness
7. Difficult to scale for large data sets for generic graphs
8. Giraph uses the Bulk Synchronous Parallel model to overcome some of the scalability
limitations

Applications:
Graph databases are useful in cases where traversing relationships are core to the application,
like navigating social network connections, network topologies or supply chains.
Some popular use cases for graph based data stores are:
 Handling complex relational information: As explained in the introduction, graph databases
make it extremely efficient and easy to use to deal with complex but relational information, such
as the connections between two entities and various degrees of other entities indirectly related to
them.
 Modelling and handling classifications: Graph databases excel in any situation where
relationships are involved. Modelling data and classifying various information in a relational way
can be handled very well using these type of data stores.
 Some popular graph based data stores are:
1. OrientDB: A very fast graph and document based hybrid NoSQL data store written in
Java that comes with different operational modes.
2. Neo4J: A schema-free, extremely popular and powerful Java graph based data store.
3. OrientDB,
4. Apache Giraph,
5. AllegroGraph

5.2.5 Relational

1. Conventional RDBMS structure consisting of fixed schema with ACID properties.


2 Provides well documented and widely supported SQL syntax
3.Capable of complex queries including subqueries and joins 2
4. Transactional data applications like ERP, CRM, Banking etc.
5. Applications where data volume is limited and schema are by and large fixed
6. Lacks horizontal scalability and hence limited in handling “big data”
7. Not efficient at handling complex multi-level nested data
8. Cannot handle “unstructured” data where structure is not known at design time
Example MySql, PostgreSQL, MariaDB, Oracle, SQL Server

5.4 BENEFITS OF NOSQL


There are several reasons to use big data databases. The main and most popular reason is the
amount of data you can handle with your queries and reports. Relational databases can store and
handle tables with millions of records. This was a large data amount a decade ago. However,
these database systems are not so good when your tables grow to billions or trillions of rows.
With a relational database, your only option is to store data using the same setup as your tables,
and any manipulated data must adhere to any strict constraints. This doesn’t allow much room
for dynamic information. NoSQL databases are typically able to retrieve large data sets more
efficiently than relational databases.
NoSQL databases are also able to allow for quicker code releases and work better with object
oriented programming. For instance, when you create an object oriented program in a language
such as C#, you have more freedom with your database creation. NoSQL lets you create database
tables using your OO classes. When these classes store data, they represent a document and store
in the database dynamically. If you change the class models, NoSQL databases will let you store
the new data without needing to change your entire database model. Older programs know that to
change your code, you needed to change your tables, any constraints and then refresh your
database and your code. With NoSQL databases, the tables update automatically.
With older relational databases, developers remember first changing the database then
refactoring code to work with the new changes. Large changes were sometimes not scalable with
old relational databases. NoSQL is dynamic, so changes to the code are reflected in the database.
You don’t need to change the database several times during the development process.

5.4.1 Sharding
NoSQL databases also offer a concept called “sharding.” With relational databases, you need to
scale vertically. This means that when you host a database, you typically host it on one server.
When you need more resources, you add another database server or more resources to the current
server. With NoSQL, you can “shard” your database files. This basically means that you can
share database files across multiple servers. Sharding is using done on very fast storage hardware
such as a SAN and a NAS. Using several database servers at once speeds up your queries
especially when you have millions of rows in your data sets.
You can implement cloud computing along with sharding. This makes your queries especially
fast when you need reports available for several different locations. Using old techniques, your
users would log in to the network and pull reports from miles away, sometimes in other
countries. The result was that the user sometimes needed to wait several minutes for a report to
render. With cloud computing, reporting data is served up from the closest data center to your
users. The result is faster data processing and rendering.
Not only does NoSQL support manual sharding, but NoSQL servers will also do automatic
sharding. The database server will automatically spread a dynamic amount of data across several
servers, so the load on each server individually is reduced.

5.4.2 Cloud computing


In addition to using NoSQL in a cloud computing environment for performance, it’s also
beneficial when it comes to cost. Cloud computing hosts only charge you for the resources you
use, so the costs scale with the growth of your business.
Replication with NoSQL is an option. When you gather data, you probably need to replicate that
data to other servers. For instance, you probably want to replicate data to a reporting server for
performance reasons on your main production server. NoSQL supports automatic replication, so
you can send data to several servers as your database collects data real-time.
You can also use NoSQL with more current languages. For instance, NoSQL works with Node.js
for your real-time network communication web applications. Just give the table name and
schema name and you can pull data directly from your NoSQL database. With the weak typing in
the newer languages and the dynamic way NoSQL stores and implements data, you can create
very powerful dynamic apps with your web applications.

5.4.3 CRUD Queries


NoSQL still supports common CRUD queries. CRUD is the name giving to “create, read, update
and delete” query procedures. These procedures are the four major ways you work with data in
any database. The create statement creates a new record, the read (select) statement retrieves the
data for your application, the update statement changes and edits the data already stored in your
database and the delete statement removes a record from the database.
NoSQL statements are dependent on the type of database structure you incorporate. In most
cases, you’ll use a programming language to create apps that interface with your NoSQL server.
Statements are much different than the statements you’re used to with relational databases. Even
if you had different SQL servers, the statements were still similar between the platforms (Oracle,
MySQL or SQL Server). Because NoSQL is dynamic and works more with documents or entities
completely different than relational tables, you have a different structure for your queries.
However, when you use different programming languages, there are plugins that you can use to
retrieve data from your databases. These plugins help minimize the need to learn raw query
information.

5.4.4 NoSQL DBMSs In Comparison To Relational DBMSs


In order to draw a clear picture of how NoSQL solutions differ from relational database
management systems, let's create a quick comparison list:
When To Use NoSQL Databases
 Size matters: If will be working with very large sets of data, consistently scaling is easier to
achieve with many of the DBMS from NoSQL family.
 Speed: NoSQL databases are usually faster - and sometimes extremely speedier - when it comes
to writes. Reads can also be very fast depending on the type of NoSQL database and data being
queried.
 Schema-free design: Relational DBMSs require structure from the beginning. NoSQL solutions
offer a large amount of flexibility.
 Automated (or easy) replications/scaling: NoSQL databases are growing rapidly and they are
being actively built today - vendors are trying to tackle common issues and one of them clearly is
replication and scaling. Unlike RDBMSs, NoSQL solutions can easily scale and work with(in)
clusters.
 Multiple choices: When it comes to choosing a NoSQL data store, there are a variety of models,
as we have discussed, that you can choose from to get the most out of the database management
system - depending on your data type.

MongoDB
6.1 What is MongoDB?
MongoDB is 1. Cross-platform. 2. Open source. 3. Non-relational. 4. Distributed. 5. NoSQL. 6.
Document-oriented data store.

6.2 Why MongoDB?


Few of the major challenges with traditional RDBMS are dealing with large volumes of data, rich variety
of data – particularly unstructured data, and meeting up to the scale needs of enterprise data. The need is
for a database that can scale out or scale horizontally to meet the scale requirements, has flexibility with
respect to schema, is fault tolerant, is consistent and partition tolerant, and can be easily distributed over
a multitude of nodes in a cluster. Refer Figure 6.1.
6.2.1 Using Java Script Object Notation (JSON)
JSON is extremely expressive. MongoDB actually does not use JSON but BSON (pronounced
Bee Son) − it is Binary JSON. It is an open standard. It is used to store complex data structures.

Let us trace the journey from .csv to XML to JSON: Let us look at how data is stored in .csv file.
Assume that this data is about the employees of an organization named “XYZ”. As can be seen below,
the column values are separated using commas and the rows are separated by a carriage return.

Now assume that few employees have more than one ContactNo. It can be neatly classified as
OfficeContactNo and HomeContactNo. But what if few employees have more than one OfficeContactNo
and more than one HomeContactNo? Ok, so this is the first issue we need to address.
Let us look at just another piece of data that you wish to store about the employees. You need to store
their email addresses as well. Here again we have the same issues, few employees have two email
addresses, few have three and there are a few employees with more than three email addresses as well.
As we come across these fields or columns, we realize that it gets messy with .csv. CSV are known to
store data well if it is flat and does not have repeating values.
The problem becomes even more complex when different departments maintain the details of their
employees. The formats of .csv (columns, etc.) could vastly differ and it will call for some efforts before
we can merge the files from the various departments to make a single file.
This problem can be solved by XML. But as the name suggests XML is highly extensible. It does not just
call for defining a data format, rather it defines how you define a data format. You may be prepared to
undertake this cumbersome task for highly complex and structured data; however, for simple data
exchange it might just be too much work.
Enter JSON! Let us look at how it reacts to the problem at hand.

As you can see it is quite easy to read a JSON. There is absolutely no confusion now. One can have a list
of n contact numbers, and they can be stored with ease.
JSON is very expressive. It provides the much needed ease to store and retrieve documents in their real
form. The binary form of JSON is BSON. BSON is an open standard. In most cases it consumes less
space as compared to the text-based JSON. There is yet another advantage with BSON. It is much easier
and quicker to convert BSON to a programming language’s native data format. There are MongoDB
drivers available for a number of programming languages such as C, C++, Ruby, PHP, Python, C#, etc.,
and each works slightly differently. Using the basic binary format enables the native data structures to be
built quickly for each language without going through the hassle of first processing JSON.

6.2.2 Creating or Generating a Unique Key


Each JSON document should have a unique identifier. It is the _id key. It is similar to the primary key in
relational databases. This facilitates search for documents based on the unique identifier. An index is
automatically built on the unique identifier. It is your choice to either provide unique values yourself or
have the mongo shell generate the same.

6.2.2.1 Database It is a collection of collections.


In other words, it is like a container for collections. It gets created the first time that your collection
makes a reference to it. This can also be created on demand. Each database gets its own set of files on the
file system. A single MongoDB server can house several databases.
6.2.2.2 Collection
A collection is analogous to a table of RDBMS. A collection is created on demand. It gets created
the first time that you attempt to save a document that references it. A collection exists within a single
database. A collection holds several MongoDB documents. A collection does not enforce a schema. This
implies that documents within a collection can have different fields. Even if the documents within a
collection have same fields, the order of the fields can be different.
6.2.2.3 Document
A document is analogous to a row/record/tuple in an RDBMS table. A document has a dynamic
schema. This implies that a document in a collection need not necessarily have the same set of
fields/key–value pairs. Shown in Figure 6.2 is a collection by the name “students” containing three
documents.

6.2.3 Support for Dynamic Queries


MongoDB has extensive support for dynamic queries. This is in keeping with traditional RDBMS
wherein we have static data and dynamic queries. CouchDB, another document-oriented, schema-less
NoSQL database and MongoDB’s biggest competitor, works on quite the reverse philosophy. It has
support for dynamic data and static queries.

6.2.4 Storing Binary Data


MongoDB provides GridFS to support the storage of binary data. It can store up to 4 MB of data. This
usually suffices for photographs (such as a profile picture) or small audio clips. However, if one wishes to
store movie clips, MongoDB has another solution. It stores the metadata (data about data along with the
context information) in a collection called “file”. It then breaks the data into small pieces called chunks
and stores it in the “chunks” collection. This process takes care about the need for easy scalability.

6.2.5 Replication : Why replication?


It provides data redundancy and high availability. It helps to recover from hardware failure and service
interruptions. In MongoDB, the replica set has a single primary and several secondaries. Each write
request from the client is directed to the primary. The primary logs all write requests into its Oplog
(operations log). The Oplog is then used by the secondary replica members to synchronize their data.
This way there is strict adherence to consistency. Refer Figure 6.3. The clients usually read from the
primary. However, the client can also specify a read preference that will then direct the read operations to
the secondary.

6.2.6 Sharding

Sharding is akin to horizontal scaling. It means that the large dataset is divided and distributed over
multiple servers or shards. Each shard is an independent database and collectively they would constitute a
logical database.
The prime advantages of sharding are as follows:
1. Sharding reduces the amount of data that each shard needs to store and manage. For example, if the
dataset was 1 TB in size and we were to distribute this over four shards, each shard would house just 256
GB data. Refer Figure 6.4. As the cluster grows, the amount of data that each shard will store and
manage will decrease.
2. Sharding reduces the number of operations that each shard handles. For example, if we were to insert
data, the application needs to access only that shard which houses that data.

6.2.7 Updating Information In-Place


MongoDB updates the information in-place. This implies that it updates the data wherever it is available.
It does not allocate separate space and the indexes remain unaltered. MongoDB is all for lazy-writes. It
writes to the disk once every second. Reading and writing to disk is a slow operation as compared to
reading and writing from memory. The fewer the reads and writes that we perform to the disk, the better
is the performance. This makes MongoDB faster than its other competitors who write almost
immediately to the disk. However, there is a tradeoff. MongoDB makes no guarantee that data will be
stored safely on the disk.

6.3 Terms Used in RDBMS and MongoDB

6.3.1 Create Database


The syntax for creating database is as follows: use DATABASE_Name
6.3.2 Drop Database

The syntax to drop database is as follows: db.dropDatabase();

6.4 Data Types in MongoDB


The following are various data types in MongoDB.
A few commands worth looking at are as follows (try them!!!).
Consider a table “Students” with the following columns:
1. StudRollNo
2. StudName
3. Grade
4. Hobbies
5. DOJ
Before we get into the details of CRUD operations in MongoDB, let us look at how the statements are
written in RDBMS and MongoDB.
5.6 ADVANTAGES OF MONGODB OVER RDBMS
1. Schema less : MongoDB is document database in which one collection holds different different
documents. Number of fields, content and size of the document can be differ from one document to
another.
2. Structure of a single object is clear
3. No complex joins
4. Deep query-ability. MongoDB supports dynamic queries on documents using a document-based query
language that's nearly as powerful as SQL
5. Tuning
6. Ease of scale-out: MongoDB is easy to scale
7. Conversion / mapping of application objects to database objects not needed
8. Uses internal memory for storing the (windowed) working set, enabling faster access of data
HBASE and CASSANDRA
6.1 INTRODUCTION TO HBASE
HBase is a column-oriented database that’s an open-source implementation of Google’s Big
Table storage architecture. It can manage structured and semi-structured data and has some built-
in features such as scalability, versioning, compression and garbage collection. Since its uses
write-ahead logging and distributed configuration, it can provide fault-tolerance and quick
recovery from individual server failures. HBase built on top of Hadoop / HDFS and the data
stored in HBase can be manipulated using Hadoop’s MapReduce capabilities.
HBase is an open source, non-relational, distributed database modeled after Google's BigTable
and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop
project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like
capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of
sparse data (small amounts of information caught within a large collection of empty or
unimportant data, such as finding the 50 largest items in a group of 2 billion records, or finding
the non-zero items representing less than 0.1% of a huge collection).
A sparse file is a type of computer file that attempts to use file system space more efficiently
when the file itself is mostly empty. This is achieved by writing brief information (metadata)
representing the empty blocks to disk instead of the actual "empty" space which makes up the
block, using less disk space. The full block size is written to disk as the actual size only when the
block contains "real" (non empty) data.
When reading sparse files, the file system transparently converts metadata representing empty
blocks into "real" blocks filled with zero bytes at runtime. The application is unaware of this
conversion.
Let’s now take a look at how HBase (a column-oriented database) is different from some other
data structures and concepts that we are familiar with.As shown in Fig. 6.2, in a row-oriented
data store, a row is a unit of data that is read or written together. In a column-oriented data store,
the data in a column is stored together and hence quickly retrieved.
A Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed
by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of
bytes. HBase uses a data model very similar to that of Bigtable. Users store data rows in labelled
tables. A data row has a sortable key and an arbitrary number of columns. The table is stored
sparsely, so that rows in the same table can have crazily-varying columns, if the user likes. A
map is "an abstract data type composed of a collection of keys and a collection of values, where
each key is associated with one value."

6.2 ROW-ORIENTED VS. COLUMN-ORIENTED DATA STORES

6.2.1 Row-oriented data stores :


1. Data is stored and retrieved one row at a time and hence could read unnecessary data if only
some of the data in a row is required.
2. Easy to read and write records
3. Well suited for OLTP systems
4. Not efficient in performing operations applicable to the entire dataset and hence aggregation is
an expensive operation
5. Typical compression mechanisms provide less effective results than those on column-oriented
data stores

6.2.2 Column-oriented data stores :


1. Data is stored and retrieved in columns and hence can read only relevant data if only some
data is required
2. Read and Write are typically slower operations
3. Well suited for OLAP systems
4. Can efficiently perform operations applicable to the entire dataset and hence enables
aggregation over many rows and columns
5. Permits high compression rates due to few distinct values in columns
6. Introduction Relational Databases vs. HBase

When talking of data stores, we first think of Relational Databases with structured data storage
and a sophisticated query engine. However, a Relational Database incurs a big penalty to
improve performance as the data size increases. HBase, on the other hand, is designed from the
ground up to provide scalability and partitioning to enable efficient data structure serialization,
storage and retrieval. Broadly, the differences between a Relational Database and HBase are:

6.2.3 Relational Database :


1. Is Based on a Fixed Schema
2. Is a Row-oriented datastore
3.Is designed to store Normalized Data
4. Contains thin tables
5. Has no built-in support for partitioning.

6.2.4 HBase :
1. Is Schema-less
2. Is a Column-oriented datastore
3. Is designed to store Denormalized Data
4. Contains wide and sparsely populated tables
5. Supports Automatic Partitioning

6.3 HDFS VS. HBASE


HDFS is a distributed file system that is well suited for storing large files. It’s designed to
support batch processing of data but doesn’t provide fast individual record lookups. HBase is
built on top of HDFS and is designed to provide access to single rows of data in large tables.
Overall, the differences between HDFS and HBase are

HDFS :
1. Is suited for High Latency operations batch processing
2. Data is primarily accessed through MapReduce
3. Is designed for batch processing and hence doesn’t have a concept of random reads/writes

HBase :
1. Is built for Low Latency operations
2. Provides access to single rows from billions of records
3. Data is accessed through shell commands, Client APIs in Java, REST, Avro or Thrift

6.4 HBASE ARCHITECTURE


The HBase Physical Architecture consists of servers in a Master-Slave relationship as shown in
Fig. 6.3. Typically, the HBase cluster has one Master node, called HMaster and multiple Region
Servers called HRegionServer. Each Region Server contains multiple Regions – HRegions.

Just like in a Relational Database, data in HBase is stored in Tables and these Tables are stored
in Regions. When a Table becomes too big, the Table is partitioned into multiple Regions. These
Regions are assigned to Region Servers across the cluster. Each Region Server hosts roughly the
same number of Regions.
The HMaster in the HBase is responsible for
1. Performing Administration
2. Managing and Monitoring the Cluster
Assigning Regions to the Region Servers
3. Controlling the Load Balancing and Failover

On the other hand, the HRegionServer perform the following work


1. Hosting and managing Regions
2. Splitting the Regions automatically
3. Handling the read/write requests
4. Communicating with the Clients directly
Each Region Server contains a Write-Ahead Log (called HLog) and multiple Regions. Each
Region in turn is made up of a MemStore and multiple StoreFiles (HFile). The data lives in these
StoreFiles in the form of Column Families (explained below). The MemStore holds in-memory
modifications to the Store (data).
The mapping of Regions to Region Server is kept in a system table called .META. When trying
to read or write data from HBase, the clients read the required Region information from
the .META table and directly communicate with the appropriate Region Server. Each Region is
identified by the start key (inclusive) and the end key (exclusive)

6.5 HBASE DATA MODEL


The Data Model in HBase is designed to accommodate semi-structured data that could vary in
field size, data type and columns. Additionally, the layout of the data model makes it easier to
partition the data and distribute it across the cluster. The Data Model in HBase is made of
different logical components such as Tables, Rows, Column Families, Columns, Cells and
Versions.
Tables: The HBase Tables are more like logical collection of rows stored in separate partitions
called Regions. As shown above, every Region is then served by exactly one Region Server. The
figure above shows a representation of a Table.

Rows: A row is one instance of data in a table and is identified by a rowkey. Rowkeys are
unique in a Table and are always treated as a byte[].

Column Families: Data in a row are grouped together as Column Families. Each Column
Family has one more Columns and these Columns in a family are stored together in a low level
storage file known as HFile. Column Families form the basic unit of physical storage to which
certain HBase features like compression are applied. Hence it’s important that proper care be
taken when designing Column Families in table. The table above shows Customer and Sales
Column Families. The Customer Column Family is made up 2 columns – Name and City,
whereas the Sales Column Families is made up to 2 columns – Product and Amount.

Columns: A Column Family is made of one or more columns. A Column is identified by a


Column Qualifier that consists of the Column Family name concatenated with the Column name
using a colon – example: columnfamily:columnname. There can be multiple Columns within a
Column Family and Rows within a table can have varied number of Columns.

Cell: A Cell stores data and is essentially a unique combination of rowkey, Column Family and
the Column (Column Qualifier). The data stored in a Cell is called its value and the data type is
always treated as byte[].

Version: The data stored in a cell is versioned and versions of data are identified by the
timestamp. The number of versions of data retained in a column family is configurable and this
value by default is 3.

6.6 UNDERSTANDING HBASE DATA MODEL


At Internet Memory, we use HBase as a large-scale repository for our collections, holding
terabytes of web documents in a distributed cluster. This article presents the data model of
HBase, and explains how it stands between relational DBs and the "No Schema" approach.

In 2006, the Google Labs team published a paper entitled BigTable: A Distributed Storage
System for Structured Data. It describes a distributed index designed to manage very large
datasets (``petabytes of data") in a cluster of data servers. BigTable supports key search, range
search and high-throughput file scans, and also provides a flexible storage for structured data.
HBase is an open-source clone of BigTable, and closely mimics its design. At Internet Memory,
we use HBase as a large-scale repository for our collections, holding terabytes of web documents
in a distributed cluster. HBase is often assimilated to a large, distributed relational database. It
actually presents many aspects common to "NoSQL" systems: distribution, fault tolerance,
flexible modeling, absence of some features deemed essential in centralized DBMS (e.g.,
concurrency), etc. This article presents the data model of HBase, and explains how it stands
between relational DBs and the "No Schema" approach. It will be completed by an introduction
to both the Java and REST APIs, and a final article on system aspects.

(1) The map structure: representing data with key/value pairs


We start with an idea familiar to Lisp programmers of association lists, which are nothing more
than key-value pairs. They constitute a simple and convenient way of representing the properties
of an object. We use as a running example the description of a Web document. For instance,
using the JSON notation:

One obtains what is commonly called an associative array, a dictionary, or a map. Given a
context (the object/document), the structure associates values to keys. We can represent such
data as a graph, as shown by the figure below. The key information is captured by edges,whereas
data values reside at leaves.

There exists many possible representations for a map. We showed a JSON example above, but
XML is of course an appropriate choice. At first glance, a map can also be represented by a
table. The above example is equivalently viewed as
However, this often introduces some confusion. It is worth understanding several important
differences that make a map much more flexible than the strict (relational) table structure. In
particular,
 there is no schema that constrains the list of keys (unlike relational table where the
schema is fixed and uniform for all rows),
 the value may itself be some complex structure.
HBase, following BigTable, builds on this flexibility. First, we can add new key-value pair to
describe an object, if needed.This does not require any pre-declaration at the 'schema' level, and
the new key remains local. Other objects stored in the same HBase instance remain unaffected by
the change. Second, a value can be another map, yielding a multi-map structure which is
exemplified below.
(2) An HBase "table" is a multi-map structure
Instead of keeping one value for each property of an object, HBases allows the storage of
several versions. Each version is identified by a timestamp. How can we represent such a multi-
versions, key value structure? HBase simply replaces atomic values by a map where the key is
the timestamp. The extended representation for our example is shown below. It helps to figure
out the power and flexibility of the data representation. Now, our document is built from two
nested maps,
 a first one, called "column" in HBase terminology (an unfortunate choice, since this is
hardly related to the column relational concept),
 a second "timestamp" (each map is named after its key).
Our document is globally viewed as a column map. If we choose a column key, say, type, we
obtain a value which is itself a second map featuring as many keys as there are timestamps for
this specific column. In our example, there is only one timestamp for url (well, we can assume
that the URL of the document does not change much). Looking at, respectively, type and
content, we find the former has two versions and the latter three. Moreover, they only have one
timestamp (t1) in common. Actually, the "timestamp" maps are completely independent from
one another.

Note that we can add as many timestamps (hence, as many keys in one of the second-level maps)
as we wish. And, in fact, this is true for the first-level map as well: we can add as many columns
as we wish, at the document level, without having to impose such a change to all other
documents in a same HBase instance. In essence, each object is just a self-described piece of
information (think again to the flexible representation of semi-structured data formats like XML
or JSON). In this respect, HBase is in the wake of other 'NoSQL' systems, and its data model
shares many aspects that characterize this trend: no schema and self-description of objects. We
are not completely done with the multi-map levels of HBase. Columns are grouped in column
families, and a family is actually a key in a new map level, referring to a group of columns. In
the Figure below, we define two families: meta, grouping url and type, and data representing the
content of a document.

6.7 HBASE: AN ON-DISK COLUMN STORAGE FORMAT


Google and Amazon are prominent examples of companies that realized the value of data and
started developing solutions to fit their needs. For instance, in a series of technical publications,
Google described a scalable storage and processing system based on commodity hardware.
These ideas were then implemented outside of Google as part of the open source Hadoop project:
HDFS and MapReduce. Hadoop excels at storing data of arbitrary, semi-, or even unstructured
formats, since it lets you decide how to interpret the data at analysis time, allowing you to
change the way you classify the data at any time: once you have updated the algorithms, you
simply run the analysis again. Hadoop also complements existing database systems of almost any
kind. It offers a limitless pool into which one can sink data and still pull out what is needed when
the time is right. It is optimized for large file storage and batch-oriented, streaming access. This
makes analysis easy and fast, but users also need access to the final data, not in batch mode but
using random access—this is akin to a full table scan versus using indexes in a database system.
We are used to querying databases when it comes to random access for structured data.
RDBMSes are the most prominent, but there are also quite a few specialized variations and
implementations, like object-oriented databases. Most RDBMSes strive to implement Codd’s 12
rules, which forces them to comply to very rigid requirements. The architecture used underneath
is well researched and has not changed significantly in quite some time. The recent advent of
different approaches, like column oriented or massively parallel processing (MPP) databases, has
shown that we can rethink the technology to fit specific workloads, but most solutions still
implement all or the majority of Codd’s 12 rules in an attempt to not break with tradition. 6.7.1
Column-Oriented Database Column-oriented databases save their data grouped by columns.
Subsequent column values are stored contiguously on disk. This differs from the usual row-
oriented approach of traditional databases, which store entire rows contiguously—see Fig. 6.4
for a visualization of the different physical layouts. The reason to store values on a per-column
basis instead is based on the assumption that, for specific queries, not all of the values are
needed. This is often the case in analytical databases in particular, and therefore they are good
candidates for this different storage schema. Reduced I/O is one of the primary reasons for this
new layout, but it offers additional advantages playing into the same category: since the values of
one column are often very similar in nature or even vary only slightly between logical rows, they
are often much better suited for compression than the heterogeneous values of a row-oriented
record structure; most compression algorithms only look at a finite window.
Important: Unlike the column and timestamp maps, the keys if a family map is fixed. We
cannot add new families to a table once it is created. The family level constitutes therefore the
equivalent of a relational schema, although, as we saw, the content of a family value may be a
quite complex structure.

The full picture: rows and tables


So, now, we know how to represent our objects with the HBase data model. It remains to
describe how we can put many objects (potentially, millions or even billions of object) in HBase.
This is where HBase borrows some terminology to relational databases: objects are called
"rows", and rows are stored in a "table". Although one could find some superficial similarities,
this comparison is a likely source of confusion. Let us try to list the differences:
1. a "table" is actually a map where each row is a value, and the key is chosen by the table
designer.
2. we already explained that the structure of a "row" has little to do with the flat representation of
relational row.
3. regarding data manipulation, the nature of a "table" implies that two basic operations are
available: put(key, row) and get(key): row. Nothing comparable to SQL here!

Finally, it is worth noting that the "table" map is a sorted map: rows are grouped on the key
value, and two rows close from one another (with respect to the keys order) are stored is the
same physical area. This make possible (and efficient) range queries of keys. We further explore
this feature is the article devoted to the system aspects of HBase. The following Figures
summarize our structure for an hypothetical webdoc HBase table storing a large collection of
web documents. Each document is indexed by its url (which is therefore the key of the highest
level map). A row is itself a local map featuring a fixed number of keys defined by the family
names (f1, f2, etc.), associated to values which are themselves maps indexed by columns.
Finally, column values are versioned, and represented by a timestamp-index map. Columns and
timestamps do no obey to a global schema: they are defined on a row basis. The columns may
vary arbitrarily from one row to another, and so do the timestamps for columns.
The multi-map structure of a HBase table can thus be summarized as
key -> family -> column -> timestamp -> value
It should be clear that the intuitive meaning of common concepts such as "table", "row", and
"column" must be revisited when dealing with HBase data. In particular, considering HBase as a
kind of large relational database is clearly a misleading option. HBase is essentially a key-value
store with efficient indexing on key access, a semi-structured data model for value
representation, and range search capabilities supported by key ordering.

6.7 HBASE : AN ON-DISK COLUMN STORAGE FORMAT


Google and Amazon are prominent examples of companies that realized the value of data and
started developing solutions to fit their needs. For instance, in a series of technical publications,
Google described a scalable storage and processing system based on commodity hardware.
These ideas were then implemented outside of Google as part of the open source Hadoop project:
HDFS and MapReduce.
Hadoop excels at storing data of arbitrary, semi-, or even unstructured formats, since it lets you
decide how to interpret the data at analysis time, allowing you to change the way you classify the
data at any time: once you have updated the algorithms, you simply run the analysis again.
Hadoop also complements existing database systems of almost any kind. It offers a limitless pool
into which one can sink data and still pull out what is needed when the time is right. It is
optimized for large file storage and batch-oriented, streaming access. This makes analysis easy
and fast, but users also need access to the final data, not in batch mode but using random access
—this is akin to a full table scan versus using indexes in a database system.

We are used to querying databases when it comes to random access for structured data.
RDBMSes are the most prominent, but there are also quite a few specialized variations and
implementations, like object-oriented databases. Most RDBMSes strive to implement Codd’s 12
rules, which forces them to comply to very rigid requirements. The architecture used underneath
is well researched and has not changed significantly in quite some time. The recent advent of
different approaches, like column oriented or massively parallel processing (MPP) databases, has
shown that we can rethink the technology to fit specific workloads, but most solutions still
implement all or the majority of Codd’s 12 rules in an attempt to not break with tradition.

6.7.1 Column-Oriented Database


Column-oriented databases save their data grouped by columns. Subsequent column values are
stored contiguously on disk. This differs from the usual row-oriented approach of traditional
databases, which store entire rows contiguously—see Fig. 6.4 for a visualization of the different
physical layouts.

The reason to store values on a per-column basis instead is based on the assumption that, for
specific queries, not all of the values are needed. This is often the case in analytical databases in
particular, and therefore they are good candidates for this different storage schema.

Reduced I/O is one of the primary reasons for this new layout, but it offers additional advantages
playing into the same category: since the values of one column are often very similar in nature or
even vary only slightly between logical rows, they are often much better suited for compression
than the heterogeneous values of a row-oriented record structure; most compression algorithms
only look at a finite window.

Specialized algorithms—for example, delta and/or prefix compression—selected based on the


type of the column (i.e., on the data stored) can yield huge improvements in compression ratios.
Better ratios result in more efficient bandwidth usage.
The speed at which data is created today is already greatly increased, compared to only just a few
years back. We can take for granted that this is only going to increase further, and with the rapid
pace of globalization the problem is only exacerbated. Websites like Google, Amazon, eBay, and
Facebook now reach the majority of people on this planet. The term planet-size web application
comes to mind, and in this case it is fitting.

Facebook, for example, is adding more than 15 TB of data into its Hadoop cluster every dayand
is subsequently processing it all. One source of this data is click-stream logging, saving every
step a user performs on its website, or on sites that use the social plug-ins offered by Facebook.
This is an ideal case in which batch processing to build machine learning models for predictions
and recommendations is appropriate.
Facebook also has a real-time component, which is its messaging system, including chat, wall
posts, and email. This amounts to 135+ billion messages per month,and storing this data over a
certain number of months creates a huge tail that needs to be handled efficiently. Even though
larger parts of emails—for example, attachments—are stored in a secondary system,the amount
of data generated by all these messages is mind-boggling. If we were to take 140 bytes per
message, as used by Twitter, it would total more than 17 TB every month. Even before the
transition to HBase, the existing system had to handle more than 25 TB a month.
In addition, less web-oriented companies from across all major industries are collecting an ever
increasing amount of data. For example:
1. Financial : Such as data generated by stock tickers
2. Bioinformatics : Such as the Global Biodiversity Information Facility or Genomes of various
species
3. Smart grid :Such as the OpenPDC project
4. Sales : Such as the data generated by point-of-sale (POS) or stock/inventory systems
5. Cellular services, military, environmental Which all collect a tremendous amount of data as
well
Storing petabytes of data efficiently so that updates and retrieval are still performed well is no
easy feat. We will now look deeper into some of the challenges.

6.8 CASANDRA : INTRODUCTION


Apache Cassandra is a highly scalable, high-performance distributed database designed to handle
large amounts of data across many commodity servers, providing high availability with no single
point of failure. It is a type of NoSQL database. Let us first understand what a NoSQL database
does.

NoSQL Database
A NoSQL database (sometimes called as Not Only SQL) is a database that provides a
mechanism to store and retrieve data other than the tabular relations used in relational databases.
These databases are schema-free, support easy replication, have simple API, eventually
consistent, and can handle huge amounts of data.
The primary objective of a NoSQL database is to have
 simplicity of design
 horizontal scaling, and
 finer control over availability.
NoSql databases use different data structures compared to relational databases. It makes
some operations faster in NoSQL. The suitability of a given NoSQL database depends on the
problem it must solve.

Besides Cassandra, we have the following NoSQL databases that are quite popular:
 Apache HBase - HBase is an open source, non-relational, distributed database
modeled after Google’s BigTable and is written in Java. It is developed as a part of
Apache Hadoop project and runs on top of HDFS, providing BigTable-like
capabilities for Hadoop.
 MongoDB - MongoDB is a cross-platform document-oriented database system that
avoids using the traditional table-based relational database structure in favor of
JSON-like documents with dynamic schemas making the integration of data in certain
types of applications easier and faster.
6.8.1 What is Apache Cassandra?

Apache Cassandra is an open source, distributed and decentralized/distributed storage system


(database), for managing very large amounts of structured data spread out across the world. It
provides highly available service with no single point of failure.
Listed below are some of the notable points of Apache Cassandra:
 It is scalable, fault-tolerant, and consistent.
 It is a column-oriented database.
 Its distribution design is based on Amazon’s Dynamo and its data model on Google’s
Bigtable.
 Created at Facebook, it differs sharply from relational database management systems.
Cassandra implements a Dynamo-style replication model with no single point of failure, but adds
a more powerful “column family” data model. Cassandra is being used by some of the biggest
companies such as Facebook, Twitter, Cisco, Rackspace, ebay, Twitter, Netflix, and more.

The design goal of Cassandra is to handle big data workloads across multiple nodes without any
single point of failure. Cassandra has peer-to-peer distributed system across its nodes, and data is
distributed among all the nodes in a cluster.

All the nodes in a cluster play the same role. Each node is independent and at the same time
interconnected to other nodes. Each node in a cluster can accept read and write requests,
regardless of where the data is actually located in the cluster. When a node goes down, read/write
requests can be served from other nodes in the network.

6.9 FEATURES OF CASSANDRA

Cassandra has become so popular because of its outstanding technical features. Given below are
some of the features of Cassandra:

1. Elastic scalability - Cassandra is highly scalable; it allows to add more hardware to


accommodate more customers and more data as per requirement.
2. Always on architecture - Cassandra has no single point of failure and it is continuously
available for business-critical applications that cannot afford a failure.
3. Fast linear-scale performance - Cassandra is linearly scalable, i.e., it increases your
throughput as you increase the number of nodes in the cluster. Therefore it maintains a quick
response time. 4. Flexible data storage - Cassandra accommodates all possible data formats
including: structured, semi-structured, and unstructured. It can dynamically accommodate
changes to your data structures according to your need.
5. Easy data distribution - Cassandra provides the flexibility to distribute data where you need
by replicating data across multiple data centers.
6. Transaction support - Cassandra supports properties like Atomicity, Consistency, Isolation,
and Durability (ACID).
7. Fast writes - Cassandra was designed to run on cheap commodity hardware. It performs
blazingly fast writes and can store hundreds of terabytes of data, without sacrificing the read
efficiency.

Cassandra was the daughter of King Priam (Priamos) and Queen Hecuba (Hekabe) and the
fraternal twin sister of Helenus and a princess of Troy.Cassandra is in use at Constant Contact,
CERN, Comcast, eBay, GitHub, GoDaddy, Hulu, Instagram, Intuit, Netflix, Reddit, The Weather
Channel, and over 1500 more companies that have large, active data sets.

One of the largest production deployments is Apple's, with over 75,000 nodes storing over 10
PB of data. Other large Cassandra installations include Netflix (2,500 nodes, 420 TB, over 1
trillion requests per day), Chinese search engine Easou (270 nodes, 300 TB, over 800 million
reqests per day), and eBay (over 100 nodes, 250 TB).

6.11 DATA REPLICATION IN CASSANDRA


In Cassandra, one or more of the nodes in a cluster act as replicas for a given piece of data. If it
is detected that some of the nodes responded with an out-of-date value, Cassandra will return the
most recent value to the client. After returning the most recent value, Cassandra performs a read
repair in the background to update the stale values.

The following figure shows a schematic view of how Cassandra uses data replication among the
nodes in a cluster to ensure no single point of failure

6.12 COMPONENTS OF CASSANDRA


The key components of Cassandra are as follows –
1. Node − It is the place where data is stored.
2. Data center − It is a collection of related nodes.
3. Cluster − A cluster is a component that contains one or more data centers.
4. Commit log − The commit log is a crash-recovery mechanism in Cassandra. Every write
operation is written to the commit log.
5. Mem-table − A mem-table is a memory-resident data structure. After commit log, the data
will be written to the mem-table. Sometimes, for a single-column family, there will be multiple
mem tables.
6. SSTable − It is a disk file to which the data is flushed from the mem-table when its contents
reach a threshold value.
7. Bloom filter − These are nothing but quick, nondeterministic, algorithms for testing whether
an element is a member of a set. It is a special kind of cache. Bloom filters are accessed after
every query.

8.14 CASSANDRA DATA MODEL


The data model of Cassandra is significantly different from what we normally see in an RDBMS.
This chapter provides an overview of how Cassandra stores its data.
Cluster
Cassandra database is distributed over several machines that operate together. The outermost
container is known as the Cluster. For failure handling, every node contains a replica, and in case
of a failure, the replica takes charge. Cassandra arranges the nodes in a cluster, in a ring format,
and assigns data to them.
Keyspace
Keyspace is the outermost container for data in Cassandra. The basic attributes of a Keyspace in
Cassandra are –
 Replication factor: It is the number of machines in the cluster that will receive copies of
the same data.
 Replica placement strategy: It is nothing but the strategy to place replicas in the ring. We
have strategies such as simple strategy (rack-aware strategy), old network topology
strategy (rack aware strategy), and network topology strategy (datacenter-shared
strategy).
 Column families: Keyspace is a container for a list of one or more column families. A
column family, in turn, is a container of a collection of rows. Each row contains ordered
columns. Column families represent the structure of your data. Each keyspace has at
least one and often many column families.
The syntax of creating a Keyspace is as follows

The following illustration shows a schematic view of a Keyspace.


Column Family A column family is a container for an ordered collection of rows. Each row, in
turn, is an ordered collection of columns. The following table lists the points that differentiate a
column family from a table of relational databases.

A Cassandra column family has the following attributes –


1. keys_cached − It represents the number of locations to keep cached per SSTable.
2. rows_cached − It represents the number of rows whose entire contents will be cached in
memory.
3. preload_row_cache − It specifies whether you want to pre-populate the row cache.

The following figure shows an example of a Cassandra column family.


Column
A column is the basic data structure of Cassandra with three values, namely key or column
name, value, and a time stamp. Given below is the structure of a column.

SuperColumn
A super column is a special column, therefore, it is also a key-value pair. But a super column
stores a map of sub-columns.
Generally column families are stored on disk in individual files. Therefore, to optimize
performance, it is important to keep columns that you are likely to query together in the same
column family, and a super column can be helpful here.Given below is the structure of a super
column.

6.15 DATA MODELS OF CASSANDRA AND RDBMS


The following table lists down the points that differentiate the data model of Cassandra from that
of an RDBMS.
Cassandra can be accessed using cqlsh as well as drivers of different languages. This chapter
explains how to set up both cqlsh and java environments to work with Cassandra.

7.3 CQL Data Types

Refer Table 7.3 for built-in data types for columns in CQL.
7.4 CQLSH
7.4.1 Logging into cqlsh
The below screenshot depicts the cqlsh command prompt after logging in, using cqlsh succeeds.
7.5 Keyspaces
What is a keyspace?
A keyspace is a container to hold application data. It is comparable to a relational database. It is
used to group column families together. Typically, a cluster has one keyspace per application.
Replication is controlled on a per keyspace basis. Therefore, data that has different replication
requirements should reside on different keyspaces.

When one creates a keyspace, it is required to specify a strategy class. There are two choices
available with us. Either we can specify a “SimpleStrategy” or a “NetworkTopologyStrategy”
class. While using Cassandra for evaluation purpose, go with “SimpleStrategy” class and for
production usage, work with the “NetworkTopologyStrategy” class.
MAP REDUCE
7.1 INTRODUCTION TO MAPREDUCE

MapReduce is a programming model and an associated implementation for processing and


generating large data sets with a parallel, distributed algorithm on a cluster.[1][2] Conceptually
similar approaches have been very well known since 1995 with the Message Passing Interface
[3] standard having reduce [4] and scatter operations.
A MapReduce program is composed of a Map() procedure (method) that performs filtering and
sorting (such as sorting students by first name into queues, one queue for each name) and a
Reduce() method that performs a summary operation (such as counting the number of students in
each queue, yielding name frequencies). The "MapReduce System" (also called "infrastructure"
or "framework") orchestrates the processing by marshalling the distributed servers, running the
various tasks in parallel, managing all communications and data transfers between the various
parts of the system, and providing for redundancy and fault tolerance.
The model is inspired by the map and reduce functions commonly used in functional
programming, although their purpose in the MapReduce framework is not the same as in their
original forms. The key contributions of the MapReduce framework are not the actual map and
reduce functions, but the scalability and fault-tolerance achieved for a variety of applications by
optimizing the execution engine once. As such, a single-threaded implementation of MapReduce
will usually not be faster than a traditional (non-MapReduce) implementation, any gains are
usually only seen with multi-threaded implementations. The use of this model is beneficial only
when the optimized distributed shuffle operation (which reduces network communication cost)
and fault tolerance features of the MapReduce framework come into play. Optimizing the
communication cost is essential to a good MapReduce algorithm.
MapReduce libraries have been written in many programming languages, with different levels
of optimization. A popular open-source implementation that has support for distributed shuffles
is part of Apache Hadoop. The name MapReduce originally referred to the proprietary Google
technology, but has since been genericized. By 2014, Google were no longer using MapReduce
as a Big Data processing model, and development on Apache Mahout had moved on to more
capable and less disk-oriented mechanisms that incorporated full map and reduce capabilities.

7.1.1 Overview
MapReduce is a framework for processing parallelizable problems across huge datasets using a
large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the
same local network and use similar hardware) or a grid (if the nodes are shared across
geographically and administratively distributed systems, and use more heterogenous hardware).
Processing can occur on data stored either in a filesystem (unstructured) or in a database
(structured). MapReduce can take advantage of locality of data, processing it on or near the
storage assets in order to reduce the distance over which it must be transmitted.
 "Map" step: Each worker node applies the "map()" function to the local data, and writes
the output to a temporary storage. A master node orchestrates that for redundant copies of
input data, only one is processed.
 "Shuffle" step: Worker nodes redistribute data based on the output keys (produced by the
"map()" function), such that all data belonging to one key is located on the same worker
node.
 "Reduce" step: Worker nodes now process each group of output data, per key, in parallel.
MapReduce allows for distributed processing of the map and reduction operations. Provided that
each mapping operation is independent of the others, all maps can be performed in parallel –
though in practice this is limited by the number of independent data sources and/or the number of
CPUs near each source. Similarly, a set of 'reducers' can perform the reduction phase, provided
that all outputs of the map operation that share the same key are presented to the same reducer at
the same time, or that the reduction function is associative. While this process can often appear
inefficient compared to algorithms that are more sequential, MapReduce can be applied to
significantly larger datasets than "commodity" servers can handle – a large server farm can use
MapReduce to sort a petabyte of data in only a few hours.[12] The parallelism also offers some
possibility of recovering from partial failure of servers or storage during the operation: if one
mapper or reducer fails, the work can be rescheduled – assuming the input data is still available.

Another way to look at MapReduce is as a 5-step parallel and distributed computation:


1. Prepare the Map() input – the "MapReduce system" designates Map processors, assigns the
input key value K1 that each processor would work on, and provides that processor with all the
input data associated with that key value.
2. Run the user-provided Map() code – Map() is run exactly once for each K1 key value,
generating output organized by key values K2.
3. "Shuffle" the Map output to the Reduce processors – the MapReduce system designates
Reduce processors, assigns the K2 key value each processor should work on, and provides that
processor with all the Map-generated data associated with that key value.
4. Run the user-provided Reduce() code – Reduce() is run exactly once for each K2 key value
produced by the Map step.
5. Produce the final output – the MapReduce system collects all the Reduce output, and sorts it
by K2 to produce the final outcome. These five steps can be logically thought of as running in
sequence – each step starts only after the previous step is completed – although in practice they
can be interleaved as long as the final result is not affected. In many situations, the input data
might already be distributed ("sharded") among many different servers, in which case step 1
could sometimes be greatly simplified by assigning Map servers that would process the locally
present input data. Similarly, step 3 could sometimes be sped up by assigning Reduce processors
that are as close as possible to the Map-generated data they need to process.

7.2 HOW MAPREDUCE WORKS?


The MapReduce algorithm contains two important tasks, namely Map and Reduce.
 The Map task takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key-value pairs).
 The Reduce task takes the output from the Map as an input and combines those data
tuples (key-value pairs) into a smaller set of tuples.
The reduce task is always performed after the map job.
Let us now take a close look at each of the phases and try to understand their significance.
 Input Phase − Here we have a Record Reader that translates each record in an input file
and sends the parsed data to the mapper in the form of key-value pairs. Z
 Map − Map is a user-defined function, which takes a series of key-value pairs and
processes each one of them to generate zero or more key-value pairs.
 Intermediate Keys − The key-value pairs generated by the mapper are known as
intermediate keys.
 Combiner − A combiner is a type of local Reducer that groups similar data from the
map phase into identifiable sets. It takes the intermediate keys from the mapper as input
and applies a user defined code to aggregate the values in a small scope of one mapper. It
is not a part of the main MapReduce algorithm; it is optional.
 Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads
the grouped key-value pairs onto the local machine, where the Reducer is running. The
individual key-value pairs are sorted by key into a larger data list. The data list groups the
equivalent keys together so that their values can be iterated easily in the Reducer task.
 Reducer − The Reducer takes the grouped key-value paired data as input and runs a
Reducer function on each one of them. Here, the data can be aggregated, filtered, and
combined in a number of ways, and it requires a wide range of processing. Once the
execution is over, it gives zero or more key-value pairs to the final step.
 Output Phase − In the output phase, we have an output formatter that translates the final
key value pairs from the Reducer function and writes them onto a file using a record
writer.

Let us try to understand the two tasks Map & Reduce with the help of a small diagram –
7.2.1 MapReduce-Example
Let us take a real-world example to comprehend the power of MapReduce. Twitter
receives around 500 million tweets per day, which is nearly 3000 tweets per second. The
following illustration shows how Tweeter manages its tweets with the help of
MapReduce.
As shown in the illustration, the MapReduce algorithm performs the following actions –

 Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-value pairs.
 Filter − Filters unwanted words from the maps of tokens and writes the filtered maps as
key value pairs.
 Count − Generates a token counter per word.
 Aggregate Counters − Prepares an aggregate of similar counter values into small
manageable units.

7.2.2 Hadoop’s MapReduce


MapReduce applications can process large data sets in parallel by using a large number of
computers, known as clusters.
In this programming paradigm, applications are divided into self-contained units of work. Each
of these units of work can be run on any node in the cluster. In a Hadoop cluster, a MapReduce
program is known as a job. A job is run by being broken down into pieces, known as tasks.
These tasks are scheduled to run on the nodes in the cluster where the data exists.
Applications submit jobs to a specific node in a Hadoop cluster, which is running a program
known as the JobTracker. The JobTracker program communicates with the NameNode to
determine where all of the data required for the job exists across the cluster. The job is then
broken into map tasks and reduce tasks for each node in the cluster to work on. The JobTracker
program attempts to schedule tasks on the cluster where the data is stored, rather than sending
data across the network to complete a task. The MapReduce framework and the Hadoop
Distributed File System (HDFS) typically exist on the same set of nodes, which enables the
JobTracker program to schedule tasks on nodes where the data is stored.

7.2.3 JobTracker
The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes
in the cluster, ideally the nodes that have the data, or at least are in the same rack.
1. Client applications submit jobs to the Job tracker.
2. The JobTracker talks to the NameNode to determine the location of the data
3.The JobTracker locates TaskTracker nodes with available slots at or near the data
4. The JobTracker submits the work to the chosen TaskTracker nodes.
5. The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough,
they are deemed to have failed and the work is scheduled on a different TaskTracker.
6. A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to
do then: it may resubmit the job elsewhere, it may mark that specific record as something to
avoid, and it may may even blacklist the TaskTracker as unreliable.
7. When the work is completed, the JobTracker updates its status.
8. Client applications can poll the JobTracker for information.
The JobTracker is a point of failure for the Hadoop MapReduce service. If it goes down, all
running jobs are halted.
As the name MapReduce implies, the reduce task is always completed after the map task. A
MapReduce job splits the input data set into independent chunks that are processed by map tasks,
which run in parallel. These bits, known as tuples, are key/value pairs. The reduce task takes the
output from the map task as input, and combines the tuples into a smaller set of tuples.
A set of programs that run continuously, known as TaskTracker agents, monitor the status of
each task. If a task fails to complete, the status of that failure is reported to the JobTracker
program, which reschedules the task on another node in the cluster.
This distribution of work enables map tasks and reduce tasks to run on smaller subsets of larger
data sets, which ultimately provides maximum scalability. The MapReduce framework also
maximizes parallelism by manipulating data stored across multiple clusters. MapReduce
applications do not have to be written in Java™, though most MapReduce programs that run
natively under Hadoop are written in Java.
Hadoop MapReduce runs on the JobTracker and TaskTracker framework from Hadoop version
1.1.1, and is integrated with the new Common and HDFS features from Hadoop version 2.2.0.
The MapReduce APIs used in InfoSphere® BigInsights™ are compatible with Hadoop version
2.2.0.
7.3 MAP OPERATIONS
7.3.1 What is a Map operation?
Doing something to every element in an array is a common operation:
7.3.2 What is a Reduce Operation?
MapReduce operation has following characteristic features:
 Submitting a MapReduce job
 Distributed Mergesort Engine
 Two fundamental data types
 Fault tolerance
 Scheduling
 Task execution
Another common operation on arrays is to combine all their values:
7.3.3 Submitting a MapReduce Job

Review Questions:
NoSQL:
1. Explain in detail categories of NO SQL databases -12M
2. What is NO SQL? Explain benefits of SQL -5M
3. Write difference between NOSQL and RDBMS -5M

MonogoDB:
1. What is MongoDB ? Why it is required? -8M
2. Explain the process of replication and Sharding in MongoDB -6M
3. Write following MongoDB commands with example:- 2M*3=6M
i) create database
ii)drop database
iii)show
4. Differentiate with example CRUD operations in RDBMS and MongoDB -8M

HBase and Cassandra:


1. What is HBase? Explain the architecture of HBase -8M
2. Explain in detail how HBase can be represented in key –value and multi -map structure? -8M
3. Differentiate between : i) column oriented data stores and row oriented data stores
ii) RDBMS and HBase
iii) HBase and HDFS -5M+3M+3M
4. What is Apache Cassandra? Explain its features? -6M
5. How Data replicated in Cassandra? Explain its components -3M+6M
6. What is key space and mention its attributes? -4M
7. How to define keyspace? Write a note on relational table and Cassandra column family - 6M
8. Differentiate between HBase and Cassandra -6M
9. Describe CQL data types-6M
10. Perform following operations with reference to CQLSH -2M*6=12M
i) Create Keyspace for Student
ii)To describe existing Keyspace Student
iii)To get all the information on the Student keyspace
iv)To use keyspace Student
v)To create column family student_info for Student keyspace
vi)To describe column family student_info

MapReduce:
1. What is Map reduce? Explain 5 steps in parallel processing computation? -6M
2. Explain the working of map reduce -8M
3. Explain with diagram the task of map reduce job submission task-8M

CRUD Operations in Cassendra-Extra Topic

You might also like