NoSQL Big Data Management,
MongoDB and Cassandra
Manjunath G.S.
Asst. Professor
Dept. of ISE, BNMIT
Learning Objectives
After studying this chapter, you will be able to:
Understanding of NoSQL data stores, Big Data solutions, schema-less models.
NoSQL data architecture patterns.
NoSQL data store management, applications and handling problems.
Solve Big Data Analytics using shared-nothing architecture.
Apply MongoDB databases and query commands.
Use the Cassandra databases, data model, clients and integrate them with
Hadoop.
Introduction
Big Data uses distributed systems.
The features of distributed computing is listed below
Increased
reliability and Flexibility Sharding Speed
fault tolerance
Resources
Sclability Open system Performance
sharing
Contd,…
The demerits of distributed computing is listed below:
Issues in
troubleshooting in a Additional software Security risks for
larger networking requirements data and resources
infrastructure
Key terms used in Database systems
Class Object Tupple Transaction
Database
transaction MySQL Oracle DB2
model
Sybase MySQL server
NOSQL Data Store
SQL is a programming language based on relational algebra.
SQL creates databases and RDBMSs.
It is a datastore that store and handle big data and it provides High
availability - which means serving to many concurrent users.
The other properties of NoSQL data store is that they are non-relational
meaning they don't guarantee ACID properties and don't to adhere to a fixed
schema.
Triggers, Views, Schedules, and Join in SQL Databases.
NoSQL
• NoSQL is a new approach of thinking about the databases, such as:
simple
flexibility dynamic schemas auto sharding
relationships
semi-structured
integrated data and
Replication
caching flexibility in
approach
• Issues with NoSQL data stores are lack of standardization, processing
difficulties for large queries, consistency in all states.
Big Data NoSQL
NoSQL data store characteristics are as follows:
It is a class of non-relational data storage system with flexible data model.
NoSQL not necessarily has a fixed schema.
Features in NoSQL transactions
Relax one or more of the ACID properties.
Characterize by 2/3 properties (CAP theorem).
Can be categorized by BASE properties.
CAP Theorem for Big Data Solutions
Schema-less Models
A schema-less database, like MongoDB, does not have these up-front
constraints, mapping to a more ‘natural’ database.
Even when sitting on top of a data lake, each document is created with a
partial schema to aid retrieval.
Any formal schema is applied in the code of your applications; this layer of
abstraction protects the raw data in the NoSQL database and allows for rapid
transformation as your needs change.
Any data, formatted or not, can be stored in a non-tabular NoSQL type of
database.
At the same time, using the right tools in the form of a schema-less database
can unlock the value of all your structured and unstructured data types.
Benefits
Greater flexibility over data types
No pre-defined database schemas
No data truncation
Suitable for real-time analytics functions
Enhanced scalability and flexibility
Increasing Flexibility for Data
Manipulation
NoSQL data store possess characteristic of increasing flexibility for data
manipulation.
Provisions of BASE increase flexibility.
BA – Basic availability
S – Soft state
E – Eventual consistency
NoSQL Data Architecture Patterns
1. Key-Value Store
The simple way to implement a schema-less data store is to use key-value
pairs.
The data store characteristics are: high performance, scalability, and
flexibility.
Key-value store accesses use a primary key for accessing the values.
The concept is similar to a hash table where a unique key points to a
particular item(s) of data.
NoSQL Data Architecture Patterns
1. Key-Value Store
NoSQL Data Architecture Patterns
1. Key-Value Store Advantages
Data store can store any data type in a value field.
A query just requests the values and returns the values as a single item.
Key-value store is eventually consistent.
Key-value data store may be hierarchical or may be ordered key-value store.
Returned values on queries can be used convert one form to another.
It is scalable, reliable, portable, and low operational cost.
The key can be synthetic or auto-generated.
NoSQL Data Architecture Patterns
2. Document Store
It stores unstructured data
Storage has similarity with object store
Data store is nested hierarchies
Querying is easy
Transaction on the document store exhibit ACID properties
Typical uses of a document store are: office documents, inventory store, forms
data, document exchange, and document search
NoSQL Data Architecture Patterns
3. Tabular Data
Tabular data store uses rows and columns.
Oracle DBs provide both options: columnar and row format storages.
Relational DS store uses in-memory row-based data (OLTP).
In-memory column-based data has the keys in the first column of each row at
successive memory address (OLAP).
In-memory column-based DB store a column as a consecutive memory or disk
entry.
3. Tabular Data
3.1 Column Family Store
Column Data Store: A way to implement a schema is the divisions into columns.
Column-Family Data Store: it has a group of columns as a column family.
Sparse Column Fields: may associate large number of columns but contains
values in few column fields.
Grouping of Column Families: two or more columns-families in data store
form a super group.
Grouping into Rows: when number of rows is larger than partitioning, forms
one-row group.
3. Tabular Data
3.1 Characteristics of Columnar DS
Scalability
Partitionability
Availability
Tree-like columnar structure
Adding new data at ease
Querying all the field values
Replication of columns
No optimization for join
3. Tabular Data
3.2 BigTable Data Store
Bigtable is ideal for the applications that need very high throughput and
scalability for key/value data.
Bigtable is used to store and query the following types of the data:
Time-series data
Marketing data
Financial data
IoT data
Graph data
3. Tabular Data
3.3 RC, ORC, and Parquet File Formats
RC – Record Columnar Data is suitable for intermediate tables for fast
column-family store in HDFS with Hive.
ORC – Optimized Row Columnar file consists of row-group data called stripes,
enables concurrent reads of the same file using separate RecordReaders.
Parquet is nested hierarchical columnar-storage concept.
This format is designed to separate the metadata from the data. Also, allows
splitting columns into multiple files, as well as having a single metadata.
4. Object Data Store
An object store refers to a repository which stores the
Objects
System metadata
Custom metadata
Metadata finds the relationships among the objects, maps the object relations
and trends.
A single file domain may contain multiple Object Stores.
Supporting APIs
An object data store consists of functions supporting APIs for:
Scalability
Indexing
Large collections
Query language, processing and optimization
Transactions
Data replication for high availability, data distribution model, data integration
Schema evolution
Persistency
Persistent object life cycle
Adding modules
Locking and caching strategy
Object relational Mapping
5. Graph Database
One way to implement a data store is to use graph database.
Any number of nodes and any number of edges can be added to expand a
graph.
Data store focuses on modeling interconnected structure of data.
Graph databases enable fast network searches.
Data store uses graphs with nodes and edges connecting each other through
relations, associations, and properties.
Contd,…
Characteristics of graph databases are:
Use specialized query languages, such as RDF uses SPARQL.
Create a database system which models the data in a completely different
way.
Can have hyper-edges.
Consists of a collection of small data size records.
Graph databases have poor scalability (Multiple Servers).
6. Variations of NoSQL
Six Data architectures are
SQL-table
Key-value pairs
In-memory
Column-family
Document
Graph and object
NoSQL to manage Big Data
By improving our ability to extract knowledge and insights from large and
complex collections of digital data, the initiative promises to help solve some
the Nation’s most pressing challenges.
Have you ever wanted to analyze a large amount of data gathered
from log files or files you’ve found on the web?
NoSQL data store management, applications and handling problems in Big
Data.
Using NoSQL to manage Big Data
NoSQL
Limits the support for join-queries, supports sparse columnar family.
Characteristics of easy creation and high processing speed, scalability and
storability of much higher magnitude of data.
It supports for CAP and BASE properties.
It scales horizontally as well vertically.
NoSQL Solutions for Big Data
The characteristics of Big Data NoSQL solution are:
High and easy scalability
Support to replication
Distributable
Usages of NoSQL servers
Usage of open-source tools
Support to schema-less data model
Support to integrated caching
No inflexibility
Big Data Use cases
Some typical big data use cases are:
Bulk image processing
Public web page data
Remote sensor data
Event log data
Cell phone data
Social media data
Game data
Types of Big Data Problems
Big data problem arises due to the limitations of NoSQL and other DBs:
Big Data needs the scalable storage and use of Distributed servers together as
cluster.
NoSQL database is open source (it is positive as well as negative)
No stored procedures in MongoDB
GUI mode tools to access the data store are not available in the market
Lack of standardization
NoSQL data store sacrifices ACID compliancy for flexibility and processing speed.
Comparison of NoSQL with SQL/RDBMS
Features NoSQL Data Store SQL/RDBMS
• Model Schema-less Relational
• Schema Dynamic schema Predefined
• Data Architecture Key-value based, column- Table based
Patterns family based
• Scalable Horizontally Vertically
• Use of SQL No Yes
• Dataset size preference Prefers large datasets Large dataset not preferred
• Consistency Variable Strong
• Vendor support Open source Strong
• ACID properties No Strictly follows
Shared Nothing Architecture
The columns of two tables relate by a relationship.
Keys share between two or more SQL tables in RDBMS.
Shared nothing (SN) is a cluster architecture. A node does not share data with
any other node.
Big data store consists of SN architecture.
A partition processes the different queries on data of the different users at
each node independently.
A coordination protocol controls the processing at all SN nodes.
Data of different data stores partition among the number of nodes.
Contd,…
Examples are using the partitioning and processing are Hadoop, Flink, and
Spark.
The features of SN architecture are as follows:
Independence
Self-Healing
Each node functioning as a shard
No network contention
Choosing the Distribution Models
Data need to be distributed on multiple data nodes at clusters.
Distributed software components give advantage of parallel processing
(horizontal scalability).
Distribution provides:
Ability to handle large-sized data
Processing of many read and write operations simultaneously in application
A resource manager manages, allocates, and schedules the resources of each
processor, memory and network connection.
Distribution increases the availability when a network slows or link fails.
1. Single Server Model
Simplest distribution option for NoSQL data
store and access is Single Server Distribution
(SSD) of an application.
A graph database processes the relationships
between nodes at a server.
An application executes the data sequentially
on a single server.
Process and datasets distribute to a single
server which runs the application.
2. Sharding Very
Large Databases
The below figure shows sharding of very
large datasets into four divisions, each
running application on 4 different servers at
the clusters.
This model runs as per SN architecture,
application process runs on multiple shards
in parallel.
3. Master-Slave Distribution Model
A node serves as a master or primary node and the other nodes are slave
nodes.
Slave nodes data replicate on multiple slave servers in Master Slave
Distribution (MSD) model.
Whenever process updates the master, it updates the slaves also.
Master-Slave replication: Processing performance decreases due to
replication in MSD distribution model.
Complexity: Cluster-based processing has greater complexity than the other
architectures. Consistency will be affected.
3. Master-Slave Distribution Model
4. Peer-to-Peer Distribution Model
PPD model and replication show the following characteristics:
All replication nodes accept read request and send the responses.
All replicas function equally.
Node failures do not cause loss of write capability.
Cassandra adopts PPD model, the data distribute among al the nodes in a
cluster.
Performance can be further enhanced by adding the nodes.
Replicated node also has updated data, it achieves consistency.
4. Peer-to-Peer Distribution Model
Choosing MSD Vs PPD
MSD replication provides greater scalability for read operations.
PPD replication provides greater scalability for both read & write operations.
Sharing Combining with Replication
Master-Slave and Sharding creates multiple masters.
For each data a single master exists.
Peer-to-Peer and Sharding use same strategy for the column-family data
stores.
Ways to Handle Big Data Problems
The below figure shows four ways of handling Big Data Problems.
MongoDB Database
It is an open-source DBMS.
It manages the collection and document data store.
Its functions do query and accessing the required information.
The functions include viewing, querying, changing, visualizing, and running
the transactions.
Changing includes updating, inserting, appending or deleting.
MongoDB is Non-relational
NoSQL
Distributed
Open-source
Document based
Cross-platform
Contd,… Scalable
Flexible data model
Indexed
Multi-master
Fault-tolerant
Features of MongoDB Database
MongoDB data Document Storing of
Collection Storing of data
store model documents
Querying,
indexing, and Deep query- No complex Indexes on any
Distributed DB
real-time ability joins fields
aggregation
Atomic
Fast-in-place No configurable Conversion /
operations on a
updates cache mapping
single document
Dynamic Schema
Dynamic schema implies that documents in the same collection do not need
to have the same set of fields or structures.
The similar fields in a document may contain different types of data.
RDBMS MongoDB
Database Data store
Table Collection
Column Key
Value Value
Records / Rows / Tuple Document / Object
Joins Embedded Documents
Index Index
Primary key Primary key (_id) is default key provided by MongoDB itself
Dynamic Schema
Replication
Replication ensures high availability in Big Data.
Replication makes DBs fault-tolerant against any database server failure.
MongoDB replicates with the help of a replica set.
A replica set usually has minimum three nodes.
Data gets replicated from primary to secondary nodes.
Commands Description
[Link] ( ) To initiate a new replica set
[Link] ( ) To check the replica set configuration
[Link] ( ) To check the status of a replica set
[Link] ( ) To add members to a replica set
Dynamic Schema
Replicated set on creating secondary members
Dynamic Schema
Auto-sharding
Sharding is a method of distributing data across multiple machines in a
distributed application environment.
A single machine may not be adequate to store the data.
Sharding automatically balances the data and load across various servers.
Basically, it splits the dataset and distributes them across multiple DBs, called
shards on the different servers.
A shard stores lesser data than the actual data and handles lesser number of
operations in a singe instance.
Data types which MongoDB support
Binary
Double String Object Array
data
Regular
Object id Boolean Date Null
expression
Timestamp Min key Max key
Rich Queries and Other DB
Functionalities
MongoDB offers rich set of features and functionality compared to those
offered in simple key-value stores.
MongoDB has a complete query language, highly-functional secondary
indexes and a powerful aggregation framework for data analysis.
MongoDB provides functionalities and features for more diverse data types
than a relational DB, and at scale.
It can derive a document-based data model is also a distinct advantage of
MongoDB.
Data is Stored in the form of BSON (Binary JSON).
Defining the Modern Database
A database must meet three requirements.
The relational databases can manage some of these requirements, and newer
so-called “NoSQL” key-value or wide column data stores meet others, only
MongoDB meets all three requirements.
The database MUST scale.
The database MUST adapt to change.
The database MUST unleash your data.
A Modern Database is much more needed for Today's Business.
Comparison b/w MongoDB and RDBMS
Features RDBMS MongoDB
• Rich Data Model NO YES
• Dynamic Schema NO YES
• Typed Data YES YES
• Data Locality NO YES
• Field Updates YES YES
• Complex Transactions YES NO
• Auditing YES YES
• Horizontal Scaling NO YES
MongoDB Query Language and DB
Commands
Command Functionality
Mongo Starts MongoDB. The default database in MongoDB is test.
[Link]( ) Runs help. This displays the list of all the commands.
[Link]( ) Gets statistics about MongoDB server.
Use<database name) Creates database
Db Outputs the names of existing database, if created earlier
Dbs Gets list of all the databases
[Link]( ) Drops a database
[Link] [Link]( ) Creates a collection using insert( )
db.<database name>.find( ) Views all documents in a collection
db.<database name>.update( ) Updates a document
db.<database name>.remove( ) Deletes a document
Cassandra Databases
Cassandra was developed by Facebook and released by Apache.
IBM also released the enhancement of Cassandra, as open source version.
The open source version includes an IBM Data Engine which processes NoSQL
data store.
The engine has improved throughput when workload of read-operations is
intensive.
Cassandra is basically a column family database that stores and handles
massive data of any format including structured, semi-structured, and
unstructured data.
Features of Cassandra Databases
Maximizes the number of writes-writes are not very costly.
Maximizes data duplication.
Does not support Joins, group by, OR clause and aggregations.
Uses classes consisting of ordered keys and semi-structured data storage
systems.
Is fast and easily scalable and write operations speed across the cluster.
Multiple cloud servers.
Uses PPD Model.
Cassandra Databases Contd,…
Data replication: a requirement is CAP theorem
Components at Cassandra
Component Description
Node Place where data stores for processing
Data Center Collection of many related nodes
Cluster Collection of many data centers
Commit log Used for crash recovery
Mem-table Memory resident DS
SSTable When Mem-Table reaches the threshold, data flush into SSTable
Bloom filter It is fast and memory efficient, it is accessed after every query
Cassandra Databases Contd,…
Scalability
Transaction support
Replication option
Simple Strategy
Network Topology Strategy
Data types
Cassandra Data Model
Cassandra Data Model is based on Google’s BigTable.
Each value maps with two strings (row key, column key)
Cassandra Data Model consists of 4 main components:
Cluster
Keyspace
Column
Coulumn-family
Cassandra Query Language
Cassandra Query Language Commands
CQLSH
HELP
CONSISTENCY
EXIT
SHOW HOST
SHOW VERSION
CREATE KEYSPACE <Keyspace Name>
DESCRIBE KEYSPACE <Keyspace Name>
ALTER KEYSPACE <Keyspace Name>
DROP KEYSPACE <Keyspace Name>
CREATE (TABLE | COLUMNFAMILY)
COLLECTIONS