[go: up one dir, main page]

0% found this document useful (0 votes)
33 views63 pages

NoSQL Big Data Management Overview

This document provides an overview of NoSQL Big Data Management, focusing on databases like MongoDB and Cassandra. It covers key concepts such as NoSQL data architecture patterns, schema-less models, and the benefits of using NoSQL for handling large datasets. Additionally, it discusses various distribution models, challenges associated with Big Data, and the comparison of NoSQL with traditional SQL databases.

Uploaded by

yashbnv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views63 pages

NoSQL Big Data Management Overview

This document provides an overview of NoSQL Big Data Management, focusing on databases like MongoDB and Cassandra. It covers key concepts such as NoSQL data architecture patterns, schema-less models, and the benefits of using NoSQL for handling large datasets. Additionally, it discusses various distribution models, challenges associated with Big Data, and the comparison of NoSQL with traditional SQL databases.

Uploaded by

yashbnv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

NoSQL Big Data Management,

MongoDB and Cassandra


Manjunath G.S.
Asst. Professor
Dept. of ISE, BNMIT
Learning Objectives
After studying this chapter, you will be able to:
 Understanding of NoSQL data stores, Big Data solutions, schema-less models.
 NoSQL data architecture patterns.
 NoSQL data store management, applications and handling problems.
 Solve Big Data Analytics using shared-nothing architecture.
 Apply MongoDB databases and query commands.
 Use the Cassandra databases, data model, clients and integrate them with
Hadoop.
Introduction
 Big Data uses distributed systems.
 The features of distributed computing is listed below

Increased
reliability and Flexibility Sharding Speed
fault tolerance

Resources
Sclability Open system Performance
sharing
Contd,…
 The demerits of distributed computing is listed below:

Issues in
troubleshooting in a Additional software Security risks for
larger networking requirements data and resources
infrastructure
Key terms used in Database systems
Class Object Tupple Transaction

Database
transaction MySQL Oracle DB2
model

Sybase MySQL server


NOSQL Data Store
 SQL is a programming language based on relational algebra.
 SQL creates databases and RDBMSs.
 It is a datastore that store and handle big data and it provides High
availability - which means serving to many concurrent users.
 The other properties of NoSQL data store is that they are non-relational
meaning they don't guarantee ACID properties and don't to adhere to a fixed
schema.
 Triggers, Views, Schedules, and Join in SQL Databases.
NoSQL
• NoSQL is a new approach of thinking about the databases, such as:

simple
flexibility dynamic schemas auto sharding
relationships

semi-structured
integrated data and
Replication
caching flexibility in
approach

• Issues with NoSQL data stores are lack of standardization, processing


difficulties for large queries, consistency in all states.
Big Data NoSQL
NoSQL data store characteristics are as follows:
 It is a class of non-relational data storage system with flexible data model.
 NoSQL not necessarily has a fixed schema.

Features in NoSQL transactions


 Relax one or more of the ACID properties.
 Characterize by 2/3 properties (CAP theorem).
 Can be categorized by BASE properties.
CAP Theorem for Big Data Solutions
Schema-less Models
 A schema-less database, like MongoDB, does not have these up-front
constraints, mapping to a more ‘natural’ database.
 Even when sitting on top of a data lake, each document is created with a
partial schema to aid retrieval.
 Any formal schema is applied in the code of your applications; this layer of
abstraction protects the raw data in the NoSQL database and allows for rapid
transformation as your needs change.
 Any data, formatted or not, can be stored in a non-tabular NoSQL type of
database.
 At the same time, using the right tools in the form of a schema-less database
can unlock the value of all your structured and unstructured data types.
Benefits
Greater flexibility over data types

No pre-defined database schemas

No data truncation

Suitable for real-time analytics functions

Enhanced scalability and flexibility


Increasing Flexibility for Data
Manipulation
 NoSQL data store possess characteristic of increasing flexibility for data
manipulation.

 Provisions of BASE increase flexibility.


 BA – Basic availability
 S – Soft state
 E – Eventual consistency
NoSQL Data Architecture Patterns
1. Key-Value Store
 The simple way to implement a schema-less data store is to use key-value
pairs.
 The data store characteristics are: high performance, scalability, and
flexibility.
 Key-value store accesses use a primary key for accessing the values.
 The concept is similar to a hash table where a unique key points to a
particular item(s) of data.
NoSQL Data Architecture Patterns
1. Key-Value Store
NoSQL Data Architecture Patterns
1. Key-Value Store Advantages
 Data store can store any data type in a value field.
 A query just requests the values and returns the values as a single item.
 Key-value store is eventually consistent.
 Key-value data store may be hierarchical or may be ordered key-value store.
 Returned values on queries can be used convert one form to another.
 It is scalable, reliable, portable, and low operational cost.
 The key can be synthetic or auto-generated.
NoSQL Data Architecture Patterns
2. Document Store
 It stores unstructured data
 Storage has similarity with object store
 Data store is nested hierarchies
 Querying is easy
 Transaction on the document store exhibit ACID properties
 Typical uses of a document store are: office documents, inventory store, forms
data, document exchange, and document search
NoSQL Data Architecture Patterns
3. Tabular Data
 Tabular data store uses rows and columns.
 Oracle DBs provide both options: columnar and row format storages.
 Relational DS store uses in-memory row-based data (OLTP).
 In-memory column-based data has the keys in the first column of each row at
successive memory address (OLAP).
 In-memory column-based DB store a column as a consecutive memory or disk
entry.
3. Tabular Data
3.1 Column Family Store
 Column Data Store: A way to implement a schema is the divisions into columns.
 Column-Family Data Store: it has a group of columns as a column family.
 Sparse Column Fields: may associate large number of columns but contains
values in few column fields.
 Grouping of Column Families: two or more columns-families in data store
form a super group.
 Grouping into Rows: when number of rows is larger than partitioning, forms
one-row group.
3. Tabular Data
3.1 Characteristics of Columnar DS
 Scalability
 Partitionability
 Availability
 Tree-like columnar structure
 Adding new data at ease
 Querying all the field values
 Replication of columns
 No optimization for join
3. Tabular Data
3.2 BigTable Data Store
 Bigtable is ideal for the applications that need very high throughput and
scalability for key/value data.
 Bigtable is used to store and query the following types of the data:
 Time-series data
 Marketing data
 Financial data
 IoT data
 Graph data
3. Tabular Data
3.3 RC, ORC, and Parquet File Formats
 RC – Record Columnar Data is suitable for intermediate tables for fast
column-family store in HDFS with Hive.
 ORC – Optimized Row Columnar file consists of row-group data called stripes,
enables concurrent reads of the same file using separate RecordReaders.
 Parquet is nested hierarchical columnar-storage concept.
 This format is designed to separate the metadata from the data. Also, allows
splitting columns into multiple files, as well as having a single metadata.
4. Object Data Store
 An object store refers to a repository which stores the
 Objects
 System metadata
 Custom metadata
 Metadata finds the relationships among the objects, maps the object relations
and trends.
 A single file domain may contain multiple Object Stores.
Supporting APIs
An object data store consists of functions supporting APIs for:
 Scalability
 Indexing
 Large collections
 Query language, processing and optimization
 Transactions
 Data replication for high availability, data distribution model, data integration
 Schema evolution
 Persistency
 Persistent object life cycle
 Adding modules
 Locking and caching strategy
Object relational Mapping
5. Graph Database
 One way to implement a data store is to use graph database.
 Any number of nodes and any number of edges can be added to expand a
graph.
 Data store focuses on modeling interconnected structure of data.
 Graph databases enable fast network searches.
 Data store uses graphs with nodes and edges connecting each other through
relations, associations, and properties.
Contd,…
Characteristics of graph databases are:
 Use specialized query languages, such as RDF uses SPARQL.
 Create a database system which models the data in a completely different
way.
 Can have hyper-edges.
 Consists of a collection of small data size records.
 Graph databases have poor scalability (Multiple Servers).
6. Variations of NoSQL
Six Data architectures are
 SQL-table
 Key-value pairs
 In-memory
 Column-family
 Document
 Graph and object
NoSQL to manage Big Data
 By improving our ability to extract knowledge and insights from large and
complex collections of digital data, the initiative promises to help solve some
the Nation’s most pressing challenges.

 Have you ever wanted to analyze a large amount of data gathered


from log files or files you’ve found on the web?

 NoSQL data store management, applications and handling problems in Big


Data.
Using NoSQL to manage Big Data
NoSQL
 Limits the support for join-queries, supports sparse columnar family.
 Characteristics of easy creation and high processing speed, scalability and
storability of much higher magnitude of data.
 It supports for CAP and BASE properties.
 It scales horizontally as well vertically.
NoSQL Solutions for Big Data
The characteristics of Big Data NoSQL solution are:
 High and easy scalability
 Support to replication
 Distributable
 Usages of NoSQL servers
 Usage of open-source tools
 Support to schema-less data model
 Support to integrated caching
 No inflexibility
Big Data Use cases
Some typical big data use cases are:
 Bulk image processing
 Public web page data
 Remote sensor data
 Event log data
 Cell phone data
 Social media data
 Game data
Types of Big Data Problems
Big data problem arises due to the limitations of NoSQL and other DBs:
 Big Data needs the scalable storage and use of Distributed servers together as

cluster.
 NoSQL database is open source (it is positive as well as negative)
 No stored procedures in MongoDB
 GUI mode tools to access the data store are not available in the market
 Lack of standardization
 NoSQL data store sacrifices ACID compliancy for flexibility and processing speed.
Comparison of NoSQL with SQL/RDBMS
Features NoSQL Data Store SQL/RDBMS
• Model Schema-less Relational
• Schema Dynamic schema Predefined
• Data Architecture Key-value based, column- Table based
Patterns family based
• Scalable Horizontally Vertically
• Use of SQL No Yes
• Dataset size preference Prefers large datasets Large dataset not preferred
• Consistency Variable Strong
• Vendor support Open source Strong
• ACID properties No Strictly follows
Shared Nothing Architecture
 The columns of two tables relate by a relationship.
 Keys share between two or more SQL tables in RDBMS.
 Shared nothing (SN) is a cluster architecture. A node does not share data with
any other node.
 Big data store consists of SN architecture.
 A partition processes the different queries on data of the different users at
each node independently.
 A coordination protocol controls the processing at all SN nodes.
 Data of different data stores partition among the number of nodes.
Contd,…
 Examples are using the partitioning and processing are Hadoop, Flink, and
Spark.

The features of SN architecture are as follows:


 Independence
 Self-Healing
 Each node functioning as a shard
 No network contention
Choosing the Distribution Models
 Data need to be distributed on multiple data nodes at clusters.
 Distributed software components give advantage of parallel processing
(horizontal scalability).
 Distribution provides:
 Ability to handle large-sized data
 Processing of many read and write operations simultaneously in application

 A resource manager manages, allocates, and schedules the resources of each


processor, memory and network connection.
 Distribution increases the availability when a network slows or link fails.
1. Single Server Model
 Simplest distribution option for NoSQL data
store and access is Single Server Distribution
(SSD) of an application.
 A graph database processes the relationships
between nodes at a server.
 An application executes the data sequentially
on a single server.
 Process and datasets distribute to a single
server which runs the application.
2. Sharding Very
Large Databases
 The below figure shows sharding of very
large datasets into four divisions, each
running application on 4 different servers at
the clusters.
 This model runs as per SN architecture,
application process runs on multiple shards
in parallel.
3. Master-Slave Distribution Model
 A node serves as a master or primary node and the other nodes are slave
nodes.
 Slave nodes data replicate on multiple slave servers in Master Slave
Distribution (MSD) model.
 Whenever process updates the master, it updates the slaves also.

 Master-Slave replication: Processing performance decreases due to


replication in MSD distribution model.
 Complexity: Cluster-based processing has greater complexity than the other
architectures. Consistency will be affected.
3. Master-Slave Distribution Model
4. Peer-to-Peer Distribution Model
 PPD model and replication show the following characteristics:
 All replication nodes accept read request and send the responses.
 All replicas function equally.
 Node failures do not cause loss of write capability.

 Cassandra adopts PPD model, the data distribute among al the nodes in a
cluster.
 Performance can be further enhanced by adding the nodes.
 Replicated node also has updated data, it achieves consistency.
4. Peer-to-Peer Distribution Model
Choosing MSD Vs PPD
 MSD replication provides greater scalability for read operations.
 PPD replication provides greater scalability for both read & write operations.

Sharing Combining with Replication


 Master-Slave and Sharding creates multiple masters.
 For each data a single master exists.
 Peer-to-Peer and Sharding use same strategy for the column-family data

stores.
Ways to Handle Big Data Problems
 The below figure shows four ways of handling Big Data Problems.
MongoDB Database
 It is an open-source DBMS.
 It manages the collection and document data store.
 Its functions do query and accessing the required information.
 The functions include viewing, querying, changing, visualizing, and running
the transactions.
 Changing includes updating, inserting, appending or deleting.
MongoDB is Non-relational
NoSQL
Distributed
Open-source
Document based
Cross-platform
Contd,… Scalable
Flexible data model
Indexed
Multi-master
Fault-tolerant
Features of MongoDB Database

MongoDB data Document Storing of


Collection Storing of data
store model documents

Querying,
indexing, and Deep query- No complex Indexes on any
Distributed DB
real-time ability joins fields
aggregation

Atomic
Fast-in-place No configurable Conversion /
operations on a
updates cache mapping
single document
Dynamic Schema
 Dynamic schema implies that documents in the same collection do not need
to have the same set of fields or structures.
 The similar fields in a document may contain different types of data.
RDBMS MongoDB
Database Data store
Table Collection
Column Key
Value Value
Records / Rows / Tuple Document / Object
Joins Embedded Documents
Index Index
Primary key Primary key (_id) is default key provided by MongoDB itself
Dynamic Schema
Replication
 Replication ensures high availability in Big Data.
 Replication makes DBs fault-tolerant against any database server failure.
 MongoDB replicates with the help of a replica set.
 A replica set usually has minimum three nodes.
 Data gets replicated from primary to secondary nodes.
Commands Description
[Link] ( ) To initiate a new replica set
[Link] ( ) To check the replica set configuration
[Link] ( ) To check the status of a replica set
[Link] ( ) To add members to a replica set
Dynamic Schema
 Replicated set on creating secondary members
Dynamic Schema
Auto-sharding
 Sharding is a method of distributing data across multiple machines in a
distributed application environment.
 A single machine may not be adequate to store the data.
 Sharding automatically balances the data and load across various servers.
 Basically, it splits the dataset and distributes them across multiple DBs, called
shards on the different servers.
 A shard stores lesser data than the actual data and handles lesser number of
operations in a singe instance.
Data types which MongoDB support
Binary
Double String Object Array
data

Regular
Object id Boolean Date Null
expression

Timestamp Min key Max key


Rich Queries and Other DB
Functionalities
 MongoDB offers rich set of features and functionality compared to those
offered in simple key-value stores.
 MongoDB has a complete query language, highly-functional secondary
indexes and a powerful aggregation framework for data analysis.
 MongoDB provides functionalities and features for more diverse data types
than a relational DB, and at scale.
 It can derive a document-based data model is also a distinct advantage of
MongoDB.
 Data is Stored in the form of BSON (Binary JSON).
Defining the Modern Database
 A database must meet three requirements.
 The relational databases can manage some of these requirements, and newer
so-called “NoSQL” key-value or wide column data stores meet others, only
MongoDB meets all three requirements.
 The database MUST scale.
 The database MUST adapt to change.
 The database MUST unleash your data.
 A Modern Database is much more needed for Today's Business.
Comparison b/w MongoDB and RDBMS
Features RDBMS MongoDB
• Rich Data Model NO YES
• Dynamic Schema NO YES
• Typed Data YES YES
• Data Locality NO YES
• Field Updates YES YES
• Complex Transactions YES NO
• Auditing YES YES
• Horizontal Scaling NO YES
MongoDB Query Language and DB
Commands
Command Functionality
Mongo Starts MongoDB. The default database in MongoDB is test.
[Link]( ) Runs help. This displays the list of all the commands.
[Link]( ) Gets statistics about MongoDB server.
Use<database name) Creates database
Db Outputs the names of existing database, if created earlier
Dbs Gets list of all the databases
[Link]( ) Drops a database
[Link] [Link]( ) Creates a collection using insert( )
db.<database name>.find( ) Views all documents in a collection
db.<database name>.update( ) Updates a document
db.<database name>.remove( ) Deletes a document
Cassandra Databases
 Cassandra was developed by Facebook and released by Apache.
 IBM also released the enhancement of Cassandra, as open source version.
 The open source version includes an IBM Data Engine which processes NoSQL
data store.
 The engine has improved throughput when workload of read-operations is
intensive.
 Cassandra is basically a column family database that stores and handles
massive data of any format including structured, semi-structured, and
unstructured data.
Features of Cassandra Databases
 Maximizes the number of writes-writes are not very costly.
 Maximizes data duplication.
 Does not support Joins, group by, OR clause and aggregations.
 Uses classes consisting of ordered keys and semi-structured data storage
systems.
 Is fast and easily scalable and write operations speed across the cluster.
 Multiple cloud servers.
 Uses PPD Model.
Cassandra Databases Contd,…
 Data replication: a requirement is CAP theorem
 Components at Cassandra
Component Description
Node Place where data stores for processing
Data Center Collection of many related nodes
Cluster Collection of many data centers
Commit log Used for crash recovery
Mem-table Memory resident DS
SSTable When Mem-Table reaches the threshold, data flush into SSTable
Bloom filter It is fast and memory efficient, it is accessed after every query
Cassandra Databases Contd,…
 Scalability
 Transaction support
 Replication option
 Simple Strategy
 Network Topology Strategy
 Data types
Cassandra Data Model
 Cassandra Data Model is based on Google’s BigTable.
 Each value maps with two strings (row key, column key)
 Cassandra Data Model consists of 4 main components:
 Cluster
 Keyspace
 Column
 Coulumn-family
Cassandra Query Language
Cassandra Query Language Commands
 CQLSH
 HELP
 CONSISTENCY
 EXIT
 SHOW HOST
 SHOW VERSION
 CREATE KEYSPACE <Keyspace Name>
 DESCRIBE KEYSPACE <Keyspace Name>
 ALTER KEYSPACE <Keyspace Name>
 DROP KEYSPACE <Keyspace Name>
 CREATE (TABLE | COLUMNFAMILY)
 COLLECTIONS

You might also like