0 ratings0% found this document useful (0 votes) 146 views30 pagesBig Data Analytics Unit-2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
r
UNIT II
| 2 NoSQL Data Management
Syllabus
Introduction 10 NoSQL - aggregate data models - key-value and document data models -
relationships - graph databases - schemaless databases - materialized views - distribution models -
master-slave replication - consistency - Cassandra - Cassandra data model ~ Cassandra examples
= Cassandra clients
Contents
2.1. Introduction to NoSQL
2.2 Aggregate Data Models
2.3. Schemaless Databases
2.4 Materialized Views
2.5 Distribution Models
26 Consistency
27 Cassandra
2.8 Two Marks Questions with Answers
(2-1)Big Date Analytics
El Introduction t
“re
ing huge vol
lem of handling huge volume g
s o 7 ‘SQL databases are schema fre
NoSOL databases are open source,
, which means that information jg
\d_ database, te or local. This ensures
1 NoSQL
NoSQL means Not Only
data that rel
and are non-relationa
ibute
NoSQL is also type of distribute ach can ena
copied and stored 0” various sever he data re Seine, the resto th
availability and reliability of data. If 80
database can continue tO run.
NoSQL encompasses structured data, semi-structured data, unstructured data ang
polymorphic data. ;
No SQL database provides @ mechanism for storage and crs i - that
employs less constrained consistency models than traditional relational databases.
NoSQL is a response to nowadays business data related factors ?
referring to the ability to handle large datasets that arrive
1. Volume and velocity,
quickly;
2, Variability, referring to how diverse data types don't fit into structured tables;
3. Agility, referring to how fast an organization
NoSQL databases are very often referred to as data stores rather thai
NoSQL systems work on multiple processors and can run on low-cost separate
computer systems - No need for expensive nodes to get high-speed performance.
It supports linear scalability. Every time we add more processors, we get a
consistent increase in performance.
responds to business changes.
mn data-bases.
Histoty ‘of NoSQL :
= The acronym NoSQL was first use
his lightweight, open-source “relational”
name came up again in 2009 when Eric Evans and Johan Oskarsson use
to describe non-relational databases.
Relations databases are often referred to as SQL systems. The term NoSQL
eat a aaa * - systems" or the more commonly accepted
y SQL," t : :
support SQL-like query ae SS es
NoSQL developed at least i ‘
t in the beginni:
Beste : ginning as a response te , the
processing unstructured data and the oy for Cae peocesing
The NoSQL model us
- : es a distrib
ia aie ae uted database system, meaning a syst¢™
d in 1998 by Carlo Strozzi while naming
database that did not use SQL. The
rd it
TECHNI
CAL PUBLICATIONS® - an up-thrust f
for knowhednaig Data Analytics
» Not only can NosoL
but they SQL systems hy
organizations such as Faceboon eet
NoSQL systems, aioe
These organi
‘ganizations
uunstructured data, coordinating it to find eens, Temendous amounts of
Big data became ai ind p:
atterns and gain business insights.
why NoSQL ?
in official term in 2005.
mn handle tre
| e It cat large volumes of Structured, semi-structured and unstructured dat:
3 .. . % ae
™ Agile sprints, quick iteration and frequent code pushes
Ee occu that is easy to use and flexible.
types of NoSQL Stores :
1. Column Oriented (Accumulo, Cassandra, Hbase)
2, Document Oriented (MongoDB, Couchbase, Clusterpoint)
3. Key-value (Dynamo, MemcacheDB, Riak)
4, Graph (Allegro, Neo4j, OrientDB)
¢ NoSQL databases are guaranteed to adhere to two of the CAP properties. Such
databases are of several types.
1. Key-value store : Stores in the form of a hash table (Example - Riak, Amazon
$3 (Dynamo), Redis)
2. Document-based store : Stores objects, mostly JSON, which is web friendly or
supports ODM (Object Document Mappings). (Example - CouchDB, MongoDB)
3, Column-based store : Each storage block contains data from only one column
{Example - HBase, Cassandra}
4. Graph-based : Graph representation of relationships, mostly used by social
networks. {Example - Neo4])
Br The Definition of Four Types of NoSQL Databases
les a mechanism for storage and retrieval, of data that is
than the tabular relations used in relational databases.
ted as Not-only-SQL to emphasize that they may also
Most NoSQL databases are designed to store
* NoSQL database provid
modeled in means other
NoSQL is often interpre
support SQL-like query languages.
iti in a fault-tolerant way:
co uae oo at is used to describe a family of databases that are
fae i a
SBNSSOL ie siroply peter ies, data types and use cases vary Wi lely
Jonal, While the technologies
= ee me cay agreed that there are four tyPes ‘of NoSQL. databases
amount them, 1
®
TECHNICAL PUBLICATIONS” - &”
up-thrust for knowledgezon
5 any of four primary days
sig Data Anaiytics
Lased and graph based.
information usi
can manage
panes .4, column
datal
oN store, document-base
models : Key-value
Be Example and Advantages
GOL databases ;
3 a cena fan open source, JSON document-based database that uses
a) b
JavaScript as its query Tangues®
b) Elasticsearch,
©) Couchbase, a key-val
build responsive and flexible PP
computing.
Advantages :
a) NoSQL databases have a simple and flexible
b) NoSQL databases are pased on key-value pairs.
include column store, document store,
c) Some store types of NoSQL databases
key. value store, graph store, object, store, XML store and other data store
modes.
database that inelucdes a full-text search engine,
se that empowers developers to
mobile and edge
a document-based
te and document databa:
ications for cloud,
structure. They aré schema-free,
abase stores also allow
base has a key. Some NoSQL dat
not just simple string
d) Each value in the. datal
developers to store serialized objects into the database,
values.
e) Open-source NoSQL databa
run on inexpensive hardware,
«Disadvantages :
a) Most NoSQL databases do not suppo!
supported by relational database systems.
b) In order to support reliability. and consistency features, developers must
implement their own proprietary code, which adds more complexity to the
system.
EZE cap Theorem
* Fig. 2.1.1 shows the three properties of the CAP theorem.
° Th istril
ea aaa ote data systems will offer a trade-off between
Z lity and partition t
alas areca hones Pi sia olerance. And, that any database can only
ses do not require expensive licensing fees and can
rendering their deployment cost-effective.
rt reliability features that are natively
© Consistency : E i
Cent oe nee pei in. cluster responds with the most recent data, eve?
the request until all replicas update. If you query *
TECHNIC
ICAL PUBLICATIONS® - an up-thrust for knowledgegel Analytics 2-6
NoSQL Data Management
Fig. 2.1.1 CAP theorem
you will wait for that
*consistent system” for an item that is currently updating,
will receive the most
response until all replicas successfully update, However, you
current data.
ity : Every node returns an immediate response,
"available system” for an item th
ervice can provide at that
Availabilit even if that response is
not the most recent data. If you query an at is
updating, you will get the best possible answer the s
moment.
stem continues to operate ever: if a
: Guarantees the sy’
with other replicated data nodes.
Partition tolerance +
replicated data node fails or loses connectivity
‘on of SQL and Nost
QL. Databases
srctused query Nos databases Tave dynamic schemas
ed schema. for ‘unstructured data. é oS
§ SQL databases use Sue
e a predefin
Janguage and bev
NosQL databases are document
SQL databases are table-based. Key-value, graph, oF wide-column stores _
ee spultirow while NoGOl better [oh unstructured
sh daabacs ae beter or MITT | lik dosumgns 8 SON
_fransadtons
Ba Aggregate Data Models
collection
gate is
tg that are treated as a unit In NoSQL
of data that interact as a unit Moreover,
of object
rm the boundaries for the ACID
© Aggregate means @
Databases, an aggre!
these units of data or 96
operations.
‘an up-thrust for knowledge
@
TECHNICAL PUBLICATIONS« Aggregate data ™
a er the clusters a5 dl ‘
js retrieve’
“Is in NoSQL-
D transaction:
ut s and Sactitg
aa
the databases tO manage g
a
ow reside on a
Il the data a
5
IL
ate data
aggreea vom the database
C)
SQL do not support A
new e help of aggregate data models in No
atabase. We can achieve hj.”
NoSQL database if the Pi
a
e aggregate.
Aggregate data a
iD properties.
can ‘ ee LAP operations on the d
in the
one easly perform OLAT, A mmodels
ici legate data
Miiciency of the aB8r°8" i a
transactions and interactions
tring of characters and the
ea Key-value Store
the key is usually a simple ©
to the database. Key-value
structure,
that are opaque
Jn the key-value
value is a series of uninterrupted bytes
store is like @ relational database with only two columns : The key or attribute
name and the value.
Fig. 22.1 shows key-value store:
value
"Name
Key
F Mobile Number
[Aadhat
ja=p[ rie]
Fig, 2.2.1 Key-value store
oe data as a group of key value pairs, which are made up of two data items
= are linked. The link between the items is a "key” which acts as an identifier
for an item within the data and the ‘value that is the data that hy b
identified. oc
The data itself i : imiti
1 itself oe some — data type (string, integer, array) oF 4
{tan application needs to persist and access directly:
flexible data mode!
structures as their
it
more complex obj
Thi ae
ie ig sei of relational schemas with a more
; levelopers to easil} ify fi
oe ly modify fields and object st
Key value systems tre:
: : at the dat i
ee tbe a a as a single opaque collection which may ha
In each key value pair, ‘
a) The key is represented by an arbitrary strin,
b) The value can be any kind of data i 7
ai
an image, file, text or document.
TECHNICAL PUBLICATIONS®
"ONS® - an up-thrust for kn
r knowledgeNoSQL Data Management
5
9g
8
é
-
a
=
z
2
g
a
i simply provid
Fi le a wa
ple GET, PUT and DELETE
The simplicity of this model
: 1 tak;
portable and flexible, akes a key-value store fast, easy to use, scalabl
, , scalable,
Advantages of key value stores ;
a) The secret to its speed lies in
: ts simplici
request to the object in mer plicity. The path to retrieve data is a direct
ry or on disk
b) The relationship betw :
ee b between data does not have to b
inguage, there is no optimization performed. e calculated by a query
¢) They can exist on distributed 5
store indexes.
Disadvantages of key value stores :
a) No complex query filters
ystems and do not need to worry about where to
b) Alll joins must be done in code
©) No foreign key constraints
d) No trigger.
Document-based —
A document is an object and keys (strings) that have values of recognizable types,
Booleans and strings, as well as nested arrays and dictionaries.
so there is no need for cross-referencing and
tion in a table, it is stored ina document.
including numbers,
All data is stored in one table,
instead of storing informal
nt based data model.
Fig. 2.2.2 shows docume!
Documents
ument based data model
flexibility. They are
modif}
Fig. 2.2.2 Doe!
are designed for
jatabases
are therefore easy to
not typically forced to
* Document d
have a schema and
‘an up-thrust for knowleds
TECHNICAL PUBLICATIONS2-8 NoSQk. Data Manageman,
Big Data Analytics
ires the ability to store varying attributes along with larg,
mt databases are @ good option.
te formats including XML and JSON. This
1a without an impedance match.
« If an application requ
amounts of data, docume!
Document stores work with multip!
allows for storage and retrieval of dat
t data store are as follows :
Terminologies in document
a) A table is called a collection
b) A row is called a document
‘A column in called a field.
syeiod ‘use eases for document stores include the storage and retrieval of catalogs,
biog posts, news articles and data analysis.
MongoDB and Apache CouchDB are examples of popular document-based
databases.
Do not use document databases for transactions across multiple documents
(records) and Ad hoc cross-document queries.
Advantages of document based model :
a) Faster retrieval of data.
b) Dynamic architecture for unstructured data and storage options
¢) Sharing for horizontal scalability
nally, so chances of accidental loss of data is
) Replication is managed inte
negligible.
Disadvantages of document data model :
2) No views, triggers, scripts or stored procedure.
b) Relationship not well defined.
©) No support for transactions, which could lead to data corruption.
2% Column-based
* Column-based is also called ‘wi
; ide column’ models enabli :
using a row key, column name and cell timestamp. ling very quick data access
* The flexible !
aaa deci - these types of databases means that the columns do not
actoss records and you can add a column to specific rows
As data is organized 4
into colum:
key-value stores, wumns, we have better indexin
- Also, when it comes to updates, multiple 5 aaa to =
‘ umn upda
can be aggregated,
ee
CHNICAL PUBLICATION: M1 Up-thrust for knowlacinaNoSQL Data Management
Google oj
ogle open sourced its i
-ynown Google e-mail : cd Bij its implementatic
ell-no gle e-mail service, Gmail ig Table. Apparently, the data f Te
NoSQL Database. 1 is stored in the Google Bi : e
le Big Table
e wide, columnar stores data mod
ae el, li
derived from Google's BigTable paper. ike that found in Apache Cassandra, are
nizations mostly u:
Cee ae ay use Column data stores for data 7
proces: Br evident in services such as Amazon R warehousing and data
zon i
advantages of column data stores : ae
a) Column stores are very efficient at data compression and/or partitioning.
b) Columnar stores can be loaded extremely fast.»
¢) Columnar databases are very scalable.
d) Due to ‘thelr structure, columnar databases perform particularly well ‘with
aggregation queries.
Disadvantages of column data store : :
a) Updates can be inefficient. The fact that columnar families group attributes, as
opposed to rows of tuples, works against it
b) If multiple attributes are touched by 2 join or query, this may also lead to
column storage experiencing slower performance.
«) It is also slower when deleting rows from columnar systems,
to be deleted from each of the record files.
as a record needs
EEX] Graph-based
* The modem graph database js a data storage and processing engine that makes
the persistence and exploration of data and relationships more efficient.
data in nodes that are connected by edges. These
Guaphbased dat stor OL are widely used for storing the huge volumes
Ay in NoSQ] : 7 ‘
ggregate Data Models i mn dimensional data having many interconnections
of complex aggregates
between’ them.
In graph theory, structures are composed of
later be called "data relationshiP® es ae oes
je think, alee
Graphs behave similarly 1° how people thinks Or uarly useful for visualizing
discrete units of data. i
e type
This database OF gwen different pieces of data.
analyzing, of helping to :
vertices and edges, oF What would
find connection
——~ puBLiCATION:Big Data Analytics 2-10 NoSQL Data Managemen, l
nologies for recommendation engine,
° As ult, businesses leverage graph tec
s a resi Examples of graph-based NoSQL database,
fraud analytics and network analysis.
include Neodj and JanusGraph.
* Graph databases can be used to analyze customer interactions, social media ang
scientific applications where it is crucial to traverse long relationship graphs ¢g
better understand data.
© Advantages of graph data +
a) More descriptive queries
b) Greater flexibility in adapting your model
©) Greater performance when traversing data
yh data stores :
relationships.
* Disadvantages of grap
a) Difficult to scale
}) No standard language.
NoSQL Key/Value Database : MongoDB
+ MongoDB is an open-source document database that provides high performance,
high availability and automatic scaling. MongoDB is one of the most popular
open-source NoSQL databases written in C++. As of February 2015, MongoDB is
the fourth most popular database management system. It was developed by a
ry 10gen which is now known as MongoDB Inc.
* Why use MongoDB ?
a) Simple queries
b) Functionality provided applicable to most web applications
©) Easy and fast integration of data
d) No ERD diagram
: = = for heavy and complex transactions systems
not provides any comm: " :
eed ee fe and to create a "database". Actually, you do
ease ae ae cause, MangoDB will create it on the fly, during
database, lue into the defined collection (or table in SQL) and
* MongoDB is a docu:
iment-oriented di rt
phen ea teary fatabase which stores data i
lata -li
oan ae schema, It means you can store = ities ie
e data ‘structure @ your records without
is 2 vies, Monge on as the number of fields. or types of fields
a mi imil
*. MongoDB stores data record: en oe
ea Is as BSON docum ; i
JSONdocuments, though it contains a fat pa
lata types than JSON.
TECHNICAL PUBLICATIONS® . an up-thrust for knowledaepig D2t8 Analytics
NoSQL Data Management
ability of MongoDB ting existing ones. This
crrays and other mer YOU to represent hierarchical selationshe, :
ys er More complex structures easily. lips, to store
MongoDB
° Bi uses Mongo server and Mongo shell commands to fetch records or the
informatic i
fo. ion from the database (ie. collections). Few areas where MongoDB is ideal
are big data, user data mana; i
gement, mobile and social i
management and delivery, data hub. a
+ A MongoDB instance may have zero or more databases, A database may have
zero or more ‘collections’. A collection may have zero or more ‘documents’. A
document may have one or more ‘fields’. MongoDB ‘Indexes' function much like
their RDBMS counterparts. :
* Database is a physical container for collections. Collection is a group of documents
and is similar to an RDBMS table. A document is a set of key-value pairs.
Documents have dynamic schema.
MongoDB documents are composed of field-and-value pairs and have the
following structure
ieee
Tere
Tae
|
oun . DB supports many data types
d arrays of documents. Mongol ;
done a nd a Nem Se
such as : 7
date, code and binary data.
+ Fig, 2.2.3 shows relation between
‘SQL terms and MangoBD terms.
figctions
T Gol a]
|_| | peaTeTESON |
:
‘an up-thrust for knowledge
pa
TECHNICAL PUBLICATIONS2-12 NoSQL Data Managemen,
Big Date Analytics
guage and supports Ad hoc queries
lany
uses MongoDB query 1anBuB Toe MongoDB that helps it 4,
and sharding. Sharding i
ted data system.
jongoDB
replication
operate as a distribul
Shatding is used by MongoDB to
izontal scaling to add more ma
ae to the growth of load and demand.
Sharding arrangement in MongoDB has mainly three components :
.
store data across multiple machines. It uses
chines to distribute data and operation with
Fig. 2.2.4 Sharding by MongoDB
a) Shards or replica sets : Each shard serves as a separate replica set. They store
all the data. They target to increase the consistency and availability of the data.
b) Configuration servers : They are like the managers of the clusters. These
servers contain the cluster's metadata. They actually have the mapping of the
cluster’s data to the shards. When a query comes, query routers use thes?
mappings from the config servers to target the required shard.
©) Query router : The query router is mongo instances which serve as interface
for user applications, The : i ions and
. They take in the user queries from the applications @™
serve the applications with the required results. a
Advantages of MangoDB :
* MongoDB is a schema - less document type database.
* MongoDB supports field, 1: ing
data from the stored cag, nS® based query, regular expression for searching .
SIC a
aData Analytics
Boost Date Management
MongoDB is very easy to scale up or down.
« It uses internal memory for storin ‘
much faster.
ig the working temporary datasets for which it is
+ MongoDB support primary ard secondary indexes on any field.
* MongoDB supports replication of databases,
MongoDB can be used as a file storage system which is known as a GridFS.
| [ERI schemaless Databases
«Since Nest does not require a schema, there is no blueprint on how data should
be stored and therefore varies between databases. Generally, there are two ways
that NoSQL data storage functions :
1, On-the-disk using B-Trees, with the top of it being permanently in RAM.
2, In-memory where it is all on RAM using RB-Trees and anything stored on the
disc is just an append.
Schemaless databases are a type of NoSQL databases that do not have a
predefined schema or structure for data. This means that data can be inserted and
retrieved without adhering to a specific structure and the database can adapt to
changes in data over time without requiring schema migrations or changes.
* Schemaless database manages information without the need for a blueprint. The
onset of building a schemaless database does not rely on conforming to certain
fields, tables, or data model structures:
* There is no Relational Database Management System (RDBMS) to enforce any
specific kind of structure. In other words, it is a non-relational database that can
handle any database type, whether ‘that be a key-value store, document store,
in-memory, column-oriented, or graph data model.
© In actuality, there is no such thing as schema-less dataset :
1 In a relational database, the schema is explicit and created separately in
advance.
2. In column-based databases, we ¢
we often reuse schema fragment
same is true for document databases. .
in document databases, users directly query data
eate a fresh schema for each row and in fact,
ts from rows that are’ grouped together. The
3. In column-based and also
based on the schema.
4, In graph-based databases,
the data.
we are in essence building the schema as we build
TECHNICAL PUBLICATIONS® - an up-thnust for knowledgeott NoSQL Date Menegemon,
Big Data Analytics
a jg stored in JSON-style documents which ca,
« In schemaless databases, informatio! Ippes for each field. So, a collecten
have varying sets of fields with different data
could look like this :
name : "Joe", age : 30, interests + 'gootball’ }-
{
"name: "Kate', age! 25
the data itself normally has a fairly consistent structure,
there is some additional structure, the
ollections and indexes. Collections
© In the above condition,
With the schemaless MongoDB database,
i icit list of ¢
system namespace contains an explicit nd
may be implicitly or explicitly created, indexes must be explicitly declared.
Benefits of using schemaless databases :
1. Flexibility : Schemaless databases allow fo
x greater flexibility in data modeling.
2. Scalability : | Schemaless databases are designed for scalability, as
they can handle large amounts of unstructured data with ease.
3, Reduced complexity : Schemaless databases can reduce the complexity of data
modeling and development.
4. Good support for non-uniform data.
© Disadvantages :
1. Potentially inconsistent names and data types for a single value.
2. Management of the implicit schema migrates into the application layer.
Materialized Views
© Materialized views solve the problem of views. The views provide a mechanism to
hide from the client whether data is detived data or base data. Views are used
when data is to be accessed infrequently and the data in a table gets updated on a
frequent basis.
* A materialized view is a replica of a target master from a single point in time. The
master can be either a master table at a master site or a master materialized view
at a materialized view site. A materialized view is like a cache, a copy of the data
that can be accessed quickly. :
If a regular view is a saved ialized vi :
query, a materialized i is
results stored as a table, o Meee ate ”
Nese ae do not have views, they may have precomputed and cached
queries and they reuse the term "materialized view" to describe them NoSql.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgean 2-15
We can use materialized y; iews t
. Bas
1. Ease network loads achieve ong °F more of th,
e
NoSQL Data Management
followii :
2, Create a mass deployment pee ‘owing goals :
ito
3, Enable data subsetting a
4, Enable disconnected computing
Two methods are used for build: ;
1. Eager approach : user eee materialized view ;
: © materialized yj
the base data for it. In. uy wed view at the same time updat
Ne case, adding an order would also ou the
: aay
purchase history aggregates for ea i
Pee wade re en ch product, This method is used when more
; lized view than writes,
2, The application database approach i
is valuabk i it ji
ensure that any updates to SES dela also le here as it makes it easier to
dest : ipdate materialized views,
Materialized views can be built outside o} y ie data,
; f the database by i
: latabase by reading the data,
ving it back to the database,
EZ Distribution Models
* Ability of NoSql is to run a database on a large cluster. As data volumes increase,
it becomes more difficult and expensive to scale up, so it is necessary to buy a
bigger server to run the database on.
EERI Single Server
* Database is run on a single machine which handles all the reads and writes to the
data store. Organizations prefer a single server because it eliminates all the
complexities that the. other options introduce.
Single server is easy to manage for application developers. Lot of NoSQL
databases are designed around the idea of running on a cluster, it can make sense
to use NoSQL with a single-server distribution model if the data model of the
NoSQL store is more suited to the application.
is sui h-database.
i iguration is suitable for grap!
Feel eait rat processing aggregates, then a single-server document
If data usage is mostly abo
or key-value store may be useful.
Baa Sharding single dataset across multiple databases,
eae
* Sharding is a method for oe machines, This allows for larger ae
Which can then be ue om id stored in multiple data nodes, increasing the to
be split into smaller
storage capacity of the system.
© - an up-thrust for knowledge
IBLICATIONS. ~ 8”
TECHNICAL PUEpaniarentics 2-16 NoSQL Data Menagemon,
Big Data
ing i ing known as horizontal scaling or scale-out, ,,
: nae ae reales to share the load, Horizontal scaling allows jo.
near-limitless scalability to handle big data and intense workloads.
Sharding is also known as data partitioning. Many NoSQL databases offe
auto-sharding, Fig, 2.5.1 shows Sharding.
aay
Fig. 2.5.1 Sharding
Sharding is the process of splitting a large dataset into many small partitions
which are placed on different machines. Each partition is known as a "shard",
Each shard has the same database schema as the original database. Most data is
distributed such that each row appears in exactly one shard. The combined data
from all shards is the same as the data from the original database. The load is
balanced out nicely between servers, for example, if we have five servers, each one
only has to handle 20 % of the load.
The NoSQL framework is natively designed to support automatic distribution of
the data across multiple servers including the query load. Both data and query
replacements are automatically distributed across multiple servers located in the
different geographic regions and this facilitates, rapid, automatic and transparent
replacement of the data or query instances without any disruption.
° Sharding is particularly valuable for performance because it can improve both read
and write performance. Using replication, Particularly with caching, can greatly
improve read performance but does little for applications that have a lot of writes.
Advantages of Sharding
4) Faster performance : There are more servers available to handle input/output.
») Horizontal scaling : We can quickly add additional servers to a cluster.
© Costs : Horizontal scaling can often be less expensive than vertical scaling.
4) Distribution/uptime : A horizontally scaled distributed database can achieve
better uptime than a traditional single server,
.
TECHNICAL PUBLICATIONS® « an upthrust for knowiedge”
sig Data Analytics 7
al 2-1
NoSQL Data Management
« Disadvantages of Sharding
Complexity : i
a) iplexity : Depending on the database system, sharding complexity can vary.
| .-p) Rebalancing : When addin;
R g additional machi i
likely need to be rebalanced to distribute eee me
c) Increased infrastructure costs.
[EGEI master-slave Replication
« We replicate data across multi
5 tipl is i i
Se ultiple nodes. One node is designed as primary
ce dary (slaves). Master is responsible for processing any
pdates to that data. A replication process synchronizes the slaves with the
master.
© Master is the authoritative source for the data. It is responsible for processing any
updates to that data, Masters can be appointed manually or automatically.
+ Slaves is a replication process that synchronizes the slaves with the master. After
a tail of the master, a slave can be appointed as new master very quickly.
Fig, 2.5.2 shows master-slave replication.
All updates are,
made to master
Fig. 2.5.2 Master-slave replication
«. Masterslave replication is most helpful for scaling when we have a read-intensive
dataset. It will scale horizontally to handle more reads.
This design offers read resilience. Even if one or more of the servers fails, the
remaining servers can keep offering read access. This can help a lot with
read-heavy applications, but will offer little benefit to write-intensive applications.
splicas of the master server, one of them can assume the
role of the master in case the master fails. In fact most of the time you can simply
create a set of nodes and have them “atomatically decide who would be the
Tceues that occur due to the delay in updating
« As the slaves are exact re
master. There are some consistency
between master and slaves.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeBig Data Analytics 2-18 NoSQL Data Management
+ Masters can be appointed manually or automatically. In manual appointing
performed when we configure our cluster and we configure one node as the
snaster, With automatic appointment, we create a cluster of nodes and they elect
cone of themselves to be the master.
© Problems of master-slave replication :
1. Does not help with scalability of writes
resilience against failure of a slave, but not of a master
2. Provides
3. The master is still a bottleneck.
EEE] Peer-to-Peer Replication
+ Ina peer-to-peer replication setup the various nodes are all "equals". Any node
J] as writes and they communicate these writes to each
can accept reads as wel i
other. In peer-to-peer replication updates on any one server are replicated to all
other associated servers.
«Fig, 2.5.3 shows peer-to-peer replications.
Requests:
Fig. 2.5.3 Peer-to-peer replications
* The advantage of this setup is its read and write resilience. One node's failure
does not cause problems, as the remaining nodes can continue their work without
losing a beat.
* The problem that arises is that of consistency. For example we may have
conflicting write requests that come to different nodes and then those nodes
attempt to communicate those requests to the rest of the nodes. This could lead to
considerable inconsistencies.
* There are various ways to resolve this problem. The most standard approach
would be to have the replicas communicate their writes first before they “accept”
them. Once a majority of the replicas has confirmed a write, it can now be
considered as having been successfully performed and a response sent to the
client. This requires a certain amount of network traffic in coordinating these
writes.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgetli ll
Big Date Analytic
nalytics 2-19 NoSQL Data Management
. There is a problem of write-write conflict. Two users can update different copies
Of the same record stored on different nodes at the same time is called a
write-write conflict.
Combining Sharding and Replication
¢ Sharding and replication can be combined to get a better response. If we use both
master-slave replication With Sharding and Peer-to-peer replication with Sharding.
1, Master-slave replication and Sharding :
* We have multiple masters, but each data item only has a single master.
* Anode can be a master for some data and a slave for others.
2, Peer-to-peer replication and Sharding :
+ A common strategy for column-family databases.
* A good starting point for peer-to-peer replication is to have a replication factor of
3, so each shard is present on three nodes.
EEG Difference between Replication and Sharding
. Replication
The primary server node copies data
onto secondary server nodes. This can
help increase data availability and act as
a backup, in case the primary server
fails. ; :
Replication copies data across multiple
servers. = oe
Each bit of data can be found in multiple
es si: ae
Replicated servers contain identical | Sharded database servers each contain a
copies of the entire database. part of the overall data, ie. they store |
: | __. different data on separate nod
_ It can improve bot
More read requests, i
i [sequests
12.6 | Consistency
ortant when considering a distributed database, since we
* The CAP theorem is imp
must make a decision about what we are willing to give up. The database we
choose will lose either availability or consistency. Reading about NoSQL databases
uum is the minimal number of nodes
e the concept of quorum. A quort i
1 write operation to be considered complete.
1s® - an up-thrust for knowledge
we can fac
that must respond to a read 0
‘TECHNICAL’ PUBLICATION:2-20 NOSQL Data Managemen
Big Date Analytics:
Of course having a maximum quorum and querying all servers is the way we cs
0}
i It.
determine the correct resul
Consistency can be simply defined by how the copies from the same data may vary wigyi,
si y
epi se system.
the same replicated database sys aa _
vadays systems need to scale. The "traditional! seer database
aera er w power server, does fot gunrantee the high ava
a a etwork pattition required by today’s web-scale systems, as demonstrateq
th CaP theorem. To achieve such requirements, systems cannot impose strong
e i
consistency.
* In the past, almost all architectures used in database aa strongly
consistent, In these cases, most architectures would have a single datal ase instance
only responding to a few hundred clients. Nowadays, many systems are ‘accessed,
by hundreds of thousands of clients, so there was a mandatory requirement to
system's architectures that scale. However, considering the CAP theorem,
high-availability and consistency do conflict on distributed systems when subject
to a network partition event. :
Update Consistency 5
* Two users updating the same data item at the same time is called write-write
conflict.
* When the writes reach the server of the two users, the server will serialize them
and decide to apply one, then the other. First user's update would be applied and
immediately overwritten by the second user,
* In this case first user's is a lost update. Here the lost update is not a big problem.
We see this as a failure of consistency because second user's update was based on
the state before first user's update, yet was applied after it,
Approaches for maintaining consiste:
described as pessimistic or optimistic,
* A pessimistic approach works by preventing conflicts from occurring; an optimistic
approach lets conflicts occur, but detects them and takes action to sort them out.
* For update conflicts, the most
so that in order to change a
ensures that only one client cai
ney in the case of concurrency are often
value we need to acquire a lock and the system
n get a lock at a time,
: bore ieers would attempt to acquire the write lock, but only the first use
Succeed. Second user would then see the i 's. write
oa result of the first user's:
before deciding whether to make his own update.mecca 4-21
, Acommon optimistic ap,
an update tests the value just bef
efore
ona NoSQL Data Management
ditional
Updat
on opt Updating rime Where any client that does
Proach is a
NE it to see if it j
« Both, the pessimistic and 6 Wit is changed since his
P
timisti
the updates and it is possible fox “PPI
le for a
Two general solutions for wie Single 5
ar ite-wrj
4. Pessimistic approach ; Prevent: conf!
locks before update, ing conf
2, Optimistic approach : Lets «
resolve them.
« Jf there are more than one sity
: er Le
might apply the updates j , * Peer-to-peer replicati
Soe fate) ie es in a different order, resulting peer ere ig
n each peer. Sequential consi: Peictatd value for the
systems. istency is used in distributed
hes 1
ely on a i izat
a co ializati
aber nsistent serialization of
fa ate as follows :
icts ‘
from Occurring. Also acquire write
‘onflicts 0,
cur, but detects them and takes actions to
timistic way t i i
oe they se = ae a poe conflict is to save both updates and record
~ Replication makes it much more lik i
; y ‘ : ely to run int
aa HE If different nodes have different copies of nee data ‘which
lependently updated, then we will get conflicts unless we take specific
measures SS aoe them. Using a single node as the target for all writes for some
data makes it much easier to maintain update consistency.
ra Read Consistency
« Problem : One user reads in the middle of another user's writing.
© It is called read-write conflict, inconsistent reading. This leads to logical
inconsistency.
* In NoSQL databases, read consistency refers to the level of consistency between
multiple read operations on the same data. In a distributed database, where data
can be replicated across multiple nodes, ensuring read consistency can be
oo updates, but only within a single
support atomic aly
consistency within an aggregate
* Aggregate-oriented databases do
a ll have logical
aggregate. This means that we W!
but not between aggregates.
* The length of time an inconsistency is ae
A NoSQL system may have a quite short i
There are different levels of 74
Tanging from eventual consistency to
called the inconsistency window.
tency window.
ailable in NoSQL databases,
present is
d consistency Vv
strong consistency:
thrust for knowledge
TECHNICAL PUBLICATIONSth Write (key, A) Write (key, B) i nN
2-22 NoSQL Data Manageneny
Big Data Analytics
i inconsistency to occur betwe,
ee ae ieee oe es gee en
ee bite Secale to all nodes, but it makes no guarantees about how long
a will take a about the order in which updates will be applied.
* Read-your-writes consistency means that once we have iat a record, all of
our subsequent reads of that record will return the update value.
* Session consistency means read-your-writes consistency but at een level,
Session can be identified with a conversation between a client an : Server. As
Jong as the conversation continues, we will read everything we have Written
during this conversation. If the session ends and we start another session with the
same server, there is no guarantee that we can read values we have written
during previous conversation.
* Session consistency is of two types : Sticky session and version stamps.
EZ Quorums
* Quorum consistency is used in systems where consistency is more important than
availability (CAP theorem) for write and read.
* In systems with multiple replicas there is a possibility that the user read:
inconsistent data. This happens say when there are 2 replicas, N1 and N2 in ¢
cluster and a user writes value v1 to node N1 and then another user reads fron
node N2 which is still behind N1 and thus will not have the value v1, so th
second user will not get the consistent state of data.
* In order to achieve a state where at least one node has consistent data we us
quorum consistency.
* Fig. 2.6.1 shows write and read quorums.
Write (key, A) Write (key, B)
(a) Write quorums
ee
ce
(b) Read quorums2-23
: Quorum is achieved when nodes fotlory NOSOL Date Menegement
where 11= Nodes in the the below protocol : wt r>n
Minimum write nog
1 = Minimum read nodes
‘ Here w is our write quorum and +;
. 4s our read quorum.
[ED Relaxing Durability
. en write is committed, the ch is
« In some cases, strict durability =
(write performance). 35 not essential and it can be traded for scalability
« A simple way to relax durability is
to store i : E
regularly, If the system shuts oe L we lose nantes eee
GA Cassandra
+ Cassandra fs a column NoSQL database, It was initially developed by Facebook to
fulfill the needs of the company's Inbox Search services. In 2009, it became an
Apache Project,
Apache Cassandra is an open source, distributed, NoSQL database, Apache
Cassandra in a distributed database system using a shared nothing architecture.
* Apache Cassandra war initially designed at Facebook using a Staged Event-Driven
Architecture (SEDA) to implement a combination of Amazon's Dynamo distributed
tion techniques and Google's Bigtable data and storage engine
storage and replicat
model.
A columnar database, also called @
store, is a database that stores the
storing the values of each row together.
Columnar databases are well suited for
users can determine the consistency
(B}) and analytics. :
. ides tunable consistency © 1°"
paeanca provides epread apd wae operations. Cassandra enables users to
vel by tuning i of replicas in 2 cluster that must acknowledge a read or
configure the num! = .tion suc
\eiite operation before considering © OP eS all nodes in a cluster.
pers 3 tocol to discover node state for ¢ ies in :
* Cassandra uses a EOSSIP PIOHO sara workloads bY distributing data, reads
Cound is see handil tiple nodes with no single point of failure.,
writes (eventually os,
column-oriented database or a wide-column
values of each column together, rather than
big data processing, Business Intelligence
rans? - on uptinst for nowledee2-24 2M
Big Data Analytics
© Features of Cassandra + : able; it allows to
js highly scala?’ add
1) Elastic scalability : Cassandra 38 HEY Ts more data as per rein
i,
custome!
te more
hardware to accommoda oe no single point of failure,
dra
+ Cassandra is linearly scalable, ie, it ince
wr of nodes in the cluster. U
3) Fast linear-scale pe!
throughput as we increase the numb :
4) Flexible data storage Cassandra accommodates all possible data foray
including : Structured, semi-structured and unstructured.
port : Cassandra supports properties like ACID.
vides the flexibility to distribute gg,
tiple datacenters.
2) Always on architecture CAS
formance
5) Transaction sup}
6) Easy data distribution : Cassandra Pr?
where you need by replicating data across mul
Cassandra Architecture
«Fig. 2.7.1 shows Cassandra architecture.
Fig. 2.7.1 Cassandra architecture
« Components of Cassandra architecture
are node, it
memtable, SSTable, Bloom Filters and Cassandra ee 2 7
re.
« Node : A Cassandra node is a place where data is stored.
* Data center ; Data center is a collection of related nod
\odes,
« Cluster : A cluster it i
t is a component which contains one or more data centets-
© Commit log : In Cassandr;
e : ‘a, the commit is ver
Mem-table : A mem-table is
: a memory-resi, it
the data will be writ ryeresident dat es
written to the mem-table, § a structure, Aes CT Aol
there will be multiple mem-tables,
jometimes, for a single-colum”
TECHNICA
M PUBLICATIONS® . an upshmust for kn
mr knowledgeee
| sotable : It is a disk gy
contents reach © to Which NoSQL Data M
o a threshold val fanagement
4 ue,
« Bloom filter : Bloom §; :
whether an elem filters are ve
: ent is a memb, TY fast, nondetermi
filters are accessed after ee er Of a g leterminis|
2-25
the data j
‘a is flus|
shed from the mem-table when its
tic algori
et. Tt j ee igorithms for testing
TY query, 18 a special kind of cache. Bloom
an in-memo! integrity. :
ry structure called a Secacl ese writes are indexed and written to
« A memtable can be thoy
ught of as ,
to cache with its completion i @ write-back cache whi ‘i
advantage of low ee immediately confirmed His oa oe ——
va heaj high throughput. Thi . This has the
Ja ip Memory by default. e memtable structure is kept in
« SSTables : When the commit
it log gets foes
the memtable are written to aa full, a flush is triggered and the contents of
his Genaees ema into an SSTables data file. At the completion of
lea nemtal le is cleared and the commit log is recycled. Cassandra
ally partitions these writes and replicates them throughout the cluster.
Cassandra Data Model
* Some of the features of Cassandra data model are as follows :
1) Data in Cassandra is stored as a set of rows that are organized into tables.
2) Tables are also called column families.
3) Each row is identified by a primary key value.
4) Data is partitioned by the primary key- ’
ae driven approach, in which specific
* Dat ling in Cassandra uses @ query: ?
agen ti key to organizing the data. The main goal of Cassandra data
maaan ig to develop and design @ high-performance and well-organized Cluster.
Apache Ca: dra data model components include keyspaces, tables and
. ssandr
pache
or column families
2 ; ta as a set of rows organized into tables
a) Cassandra stores dat
b) A primary key ¥!
©) The primary key
data
alue identifies each FO"
pased on the primary Key:
partitions 4a¥@
its entirely ‘The components of
in part OF in ism for data storage.1. Keyspaces :
)
2-26 NoSQL Data Management
Big Data Analytics
Fig. 2.7.2 shows Cassandra data model
Cluster
[ KeySpacet KeySpace2
Column family Column family Column family2
Row | Row Row Row
{em Column2}] Column} Column2} | Column ‘Column3||Column1}|Column2}
Value [see Value | Value [Eve | Value Value Value Value
[ee ess
Fig. 2.7.2 Cassandra data model
SQL data model consists of data containers
* At a high level, the Cassandra No!
Jar to the schema in a relational database,
called keyspaces. Keyspaces are simil
Typically, there are many tables in a keyspace.
* Features of keyspaces are :
a) A keyspace needs to be defined before creating tables, as there is no default
keyspace. :
b) A keyspace can contain any number of tables an
keyspace. This represents a one-to-many relationship.
©) Replication is specified at the keyspace level. For example, replication of three
implies that each data row in the keyspace will have three copies.
da table belongs only to one
2. Tables :
are defined
© Tables, alsé called column families in earlier iterations of Cassandra,
within the keyspaces. Tables store data in a set of rows and contain a primary key
and a set of columns.
Cassandra tables are used to hold the actual data in the form of rows and
columns. A table in Cassandra must be created with the primary key during table
creation time, post that it can not be altered.
To alter the table new tables should be created with existing data. The primar!
key would be used to locate and order the data.2-27
NoSQL Dat
me of the features of tables are ; Sa Nenavenent
* |) Tables have multiple rows
and coh
called column family in the SMUMNS. AS mentio
earlier vers
is still referred to
p) it is 8 as column fami
dolienents of Cassandra, ily in some of the error messages and
ned earlier, a table is also
, si
Ons of Cassandra,
) It is important to define a Primary key for a table.
4, columns :
Columns define data structure within a tal
such as Boolean, double, integer and text,
: ble. There are various types of columns,
+ Cassandra column’ is used to store a sin,
L gle piece of data. Th i's
of various types of data such as big int ie column a conta
eget, double, text, float and Boolean.
Fach column value has a timestamp associated with it that shows the time of
update. Cassandra provides’ the collection type of columns such as list, set and
map.
Some of its features are :
a) Columns consist of various types, such as integer, big integer, text, float, double
and Boolean. ;
b) Cassandra also provides collection types such as set, list and map.
¢) Further, column values have an associated time stamp representing the time of
update,
4) This timestamp can be retrieved using the function write time.
Ba Cassandra Clients
* Thrift is the driver-level interface;
a wide variety of languages. Thrift was
. i 7 Juster, allowing it to be queried. Each
A Client holds connections to a ieee ree to the cluster nodes, provides
Client i maintains multip] ies for failed
poe which node to use for each query and handles retries for
query etc... hy single instance is
lived and usually a sing!
* Client § designed to be long: “logged” into one keyspace
coe) instances are peti given Client can only be logs Snopes
Sugh per application. Am Mee to create one client pet ee
: s time, it can oop multiple Keyspaces since it is always p
ever not necessary
ified table name in queries. oe
{se a single session with fully qualifie et ete
das a ring.
it provides the API for client implementations in
developed at Facebook.
* The Cassandra cluster is denote
‘0 show token
.n up-thrust for knowledge
1. PUBLICATIONS
TECHNICA!a th NoSQL Data Man
Big Data Analytics Se
1, Write in action :
the Cassandra nodes and seng
To write, clients need to connect t0 any of ites Wtedsiante °
request. This node is galled the coordinator : a8
and
cluster receives a write request, it delegates it to a service called StorageP roxy hg
the right place to write the data to. The tu
11 the replicas) that are responsible to hoy id
tilizes a replication strategy to do that,
This node may or may not be
StorageProxy is to get the nodes (a
data that is going to be written. It ul
of
the
Once the replica nodes are identified, it sends the RowMutation message to
the node waits for replies from these nodes, but it does not wait for all the repli
to come.
It only waits for as many responses as are enough to satisfy the client's minimuy,
number of successful writes defined by: consistency level.
Write operations at a node level :
« Each node processes the request individually. Every node first writes the
mutation to the commit log and then writes the mutation to the memtable.
Writing to the commit log ensures durability of the write as the memtzble is
an in-memory structure and is only written to disk when the memtable is
flushed to disk. ,
« A memtable is flushed to disk when :
1. It reaches its maximum allocated size in memory
2. The number of minutes a memtable can stay in memory elapses.
3. Manually flushed by a user.
« A memtable is flushed to an immutable structure called as SSTable (Sorted
String Table). The commit log is used for playback purposes in case dat?
from the memtable is lost due to node failure.
Two Marks Questions with Answers
Qi
What Is consistency In a distributed system ?
Ans, : In a distributed system, consistency will be defined as one that responds Wi"
the same output for the same request at the same time across all the replicas.
az
What is database Sharding 7
Ans. : Sharding is a method for distributing a single dataset across multiple datab |
which can then be stored on multiple machines. This al
split into smaller chunks and stored in multiple dat: increasi
Bay of he ryeum. ple data nodes, increasing the t
lows for larger datasets ©
otal storag' |
TECHNICAL PUBLICATIONS® . . ...— | as How Is Sharding different from partitioning ?
Ans.: All partitions of a table reside on the same server whereas Sharding involves
multiple servers. Therefore, Sharding implies a distributed architecture whereas
- pattitioning does not. Partitions can be horizontal (split by tows) or vertical (by
Columns), Shards are usually only horizontal. In other words, all shards share the same
schema but contain different records of the original table.
Q.6 What aro write-writo and read-write conflicts ?
big Date Anaiytics
2-29
NoSQL Data Management
Q3 Why are N.
ae aah losis databases known as schemaless databases 7
iy eee pais levee designed to store and query unstructured data,
Recia can be apelica ae = rigid schemas used by relational databases. Although a
ee mi ed at the application level, NoSQL databases retain all of your
Sheebseapiaatin is original raw format. This means that complete granularity is
, ‘if you later change your application schema - Something that is simpl
not possible with a traditional SQL database. a
a4
‘|
What is the difference between Sharding and replication 7
Ans. : Sharded database servers each contain a part of the overall data, ie. they store
different data on separate nodes. Replicated servers contain identical copies of the
entire database.
Ans. : Write-write conflicts occur when two clients try to write the same data at the
same time, Read-write conflicts occur when one client reads inconsistent data in the
middle of another client's write.
Q.7 Define Cassandra.
‘Ans. ; Cassandra is a distributed, fault tolerant, scalable, column oriented data store.
Cassandra is a peer-to-peer distributed system made up of a cluster of nodes in which
any node can accept read or write request.
Q.8 What Is the use of Bloom filters In Cassandra 7
Ans. : Bloom filters are used as a performance booster. Bloom filters are very fast,
nondeterministic algorithms for testing whether an element is a member of a set. They
are nondeterministic because it is possible to get a false-positive read from a Bloom
filter, but not a false-negative. Bloom filters work by mapping the values in a data set
into a bit array and condensing 2 larger data set into a digest string. Bloom filter is a
special kind of cache.
Q9 Explain sorted strings table. ;
ich i the statics
: trings table which is a file format used by Cassandra to store
aM sere the memtables. The Cassandra SSTables are immutable hence any
ia a ble creates a new SSTables file. The data structure format used by
eeTubled is Log Structured Merge which is qualified for writing intense heavy data sets
compared to the traditional B tree structure.
TECHNICAL ‘PUBLICATIONS® - an up-thrust for knowledge2-30 NoS@L. Data Managemony
Q.10 Explain Cassandra data center. —]
‘Ans. : Cassandra data center is the collection of related nodes which are configured in
the cluster to perform the replication, The data centers can be physical data centers or
depending upon the workload a separate data center can be
By Data Anadtics
logical data center and
used,
Qt
‘Ans. : ¢ Advantages of graph data +
a) More descriptive queries
b) Greater flexibility in adapting your model
©) Greater performance when traversing data relationships.
¢ Disadvantages of graph data stores :
a) Difficult to scale,
b) No standard language.
Q.12 Describe session consistency.
‘Ans. Session consistency means read-your-writes consistency but. at session level.
Session can be identified with a conversation between a client and a server. As long as
the conversation continues, we will read everything. we have written during this
conversation. If the session ends and we start another session with the same server,
there is no guarantee that we can read values we have written during previous
Explain advantages and disadvantages of graph data.
conversation.
Q.13 What are schemaless databases 7
‘Ans. : Schemaless databases are a type of NoSQL databases that do not have a
predefined schema or structure for data. This means that data can be inserted and
retrieved without adhering to'a specific structure and the database can adapt to
changes in data over time without requiring schema migrations or changes.
Qaa