Big Data and NoSQL
Modified from slides by Perry
Hoekstra (Perficient, Inc) and
Database Systems concepts, 7th
Ed.
Motivation
Very large volumes of data being collected
– Driven by growth of web, social media, and more
recently internet-of-things
– Web logs were an early source of data
• Analytics on web logs has great value for
advertisements, web site structuring, what posts to
show to a user, etc
Big Data: differentiated from data handled by
earlier generation databases
– Volume: much larger amounts of data stored
– Velocity: much higher rates of insertions
– Variety: many types of data, beyond relational
data
2
Querying Big Data
Transaction processing systems that need very
high scalability
– Many applications willing to sacrifice ACID
properties and other database features, if they can
get very high scalability
Query processing systems that
– Need very high scalability, and
– Need to support non-relation data
3
History of the World
Relational Databases – mainstay of business
Web-based applications caused spikes
– Especially true for public-facing e-Commerce sites
Developers begin to front RDBMS with memcache or
integrate other caching mechanisms within the
application (ie. Ehcache)
4
Scaling Up
Issues with scaling up when the dataset is just too
big
RDBMS were not designed to be distributed
Began to look at multi-node database solutions
Known as ‘scaling out’ or ‘horizontal scaling’
Different approaches include:
Master-slave
– Sharding
5
Scaling RDBMS – Master/Slave
Master-Slave
– All writes are written to the master. All reads
performed against the replicated slave databases
– Critical reads may be incorrect as writes may not have
been propagated down
– Large data sets can pose problems as master needs to
duplicate data to slaves
6
Scaling RDBMS - Sharding
Partition or sharding
– Scales well for both reads and writes
– Not transparent, application needs to be partition-
aware
– Can no longer have relationships/joins across
partitions
– Loss of referential integrity across shards
7
Other ways to scale RDBMS
Multi-Master replication
INSERT only, not UPDATES/DELETES
No JOINs, thereby reducing query time
– This involves de-normalizing data
In-memory databases
8
What is NoSQL?
Stands for Not Only SQL
Class of non-relational data storage systems
Usually do not require a fixed table schema nor do
they use the concept of joins
All NoSQL offerings relax one or more of the ACID
properties (will talk about the CAP theorem)
9
Why NoSQL?
For data storage, an RDBMS cannot be the
be-all/end-all
Just as there are different programming languages,
need to have other data storage tools in the toolbox
A NoSQL solution is more acceptable to a client now
than 5 years ago
10
How did we get here?
Explosion of social media sites (Facebook,
Twitter) with large data needs
Rise of cloud-based solutions such as Amazon
S3 (simple storage solution)
Just as moving to dynamically-typed
languages (Ruby/Groovy), a shift to
dynamically-typed data with frequent schema
changes
Open-source community
11
Dynamo and BigTable
Three major papers were the seeds of the NoSQL
movement
– BigTable (Google)
– Dynamo (Amazon)
• Gossip protocol (discovery and error detection)
• Distributed key-value data store
• Eventual consistency
– CAP Theorem (discuss in a sec ..)
12
The Perfect Storm
Large datasets, acceptance of alternatives, and
dynamically-typed data has come together in a
perfect storm
Not a backlash/rebellion against RDBMS
SQL is a rich query language that cannot be rivaled
by the current list of NoSQL offerings
13
CAP Theorem
Three properties of a system: consistency,
availability and partitions
You can have at most two of these three properties
for any shared-data system
To scale out, you have to partition. That leaves
either consistency or availability to choose from
– In almost all cases, you would choose availability over
consistency
14
15
Availability
Traditionally, thought of as the server/process
available five 9’s (99.999 %).
However, for large node system, at almost any point
in time there’s a good chance that a node is either
down or there is a network disruption among the
nodes.
– Want a system that is resilient in the face of network
disruption
16
Consistency Model
A consistency model determines rules for visibility
and apparent order of updates.
For example:
– Row X is replicated on nodes M and N
– Client A writes row X to node N
– Some period of time t elapses.
– Client B reads row X from node M
– Does client B see the write from client A?
– Consistency is a continuum with tradeoffs
– For NoSQL, the answer would be: maybe
– CAP Theorem states: Strict Consistency can't be
achieved at the same time as availability and partition-
tolerance.
17
Eventual Consistency
When no updates occur for a long period of time,
eventually all updates will propagate through the
system and all the nodes will be consistent
For a given accepted update and a given node,
eventually either the update reaches the node or the
node is removed from service
Known as BASE (Basically Available, Soft state,
Eventual consistency), as opposed to ACID
18
What kinds of NoSQL
NoSQL solutions fall into two major areas:
– Key/Value or ‘the big hash table’.
• Amazon S3 (Dynamo)
• Voldemort
• Scalaris
– Schema-less which comes in multiple flavors,
column-based, document-based or graph-
based.
• Cassandra (column-based)
• CouchDB (document-based)
• Neo4J (graph-based)
• HBase (column-based)
19
Key/Value
Pros:
– very fast
– very scalable
– simple model
– able to distribute horizontally
Cons:
- many data structures (objects) can't be easily modeled
as key value pairs
20
Schema-Less
Pros:
- Schema-less data model is richer than key/value pairs
- eventual consistency
- many are distributed
- still provide excellent performance and scalability
Cons:
- typically no ACID transactions or joins
21
Common Advantages
Cheap, easy to implement (open source)
Data are replicated to multiple nodes (therefore identical
and fault-tolerant) and can be partitioned
– Down nodes easily replaced
– No single point of failure
Easy to distribute
Don't require a schema
Can scale up and down
Relax the data consistency requirement (CAP)
22
What am I giving up?
joins
group by
order by
ACID transactions
SQL as a sometimes frustrating but still powerful
query language
easy integration with other applications that support
SQL
23
Cassandra
Originally developed at Facebook
Follows the BigTable data model: column-oriented
Uses the Dynamo Eventual Consistency model
Written in Java
Open-sourced and exists within the Apache family
Uses Apache Thrift as it’s API
24
Cassandra and Consistency
Talked previous about eventual consistency
Cassandra has programmable read/writable
consistency
– One: Return from the first node that responds
– Quorom: Query from all nodes and respond with the
one that has latest timestamp once a majority of
nodes responded
– All: Query from all nodes and respond with the one
that has latest timestamp once all nodes responded.
An unresponsive node will fail the node
27
Cassandra and Consistency
– Zero: Ensure nothing. Asynchronous write done in
background
– Any: Ensure that the write is written to at least 1 node
– One: Ensure that the write is written to at least 1
node’s commit log and memory table before receipt to
client
– Quorom: Ensure that the write goes to node/2 + 1
– All: Ensure that writes go to all nodes. An
unresponsive node would fail the write
28
Some Statistics
Facebook Search
MySQL > 50 GB Data
– Writes Average : ~300 ms
– Reads Average : ~350 ms
Rewritten with Cassandra > 50 GB Data
– Writes Average : 0.12 ms
– Reads Average : 15 ms
29
Don’t forget about the DBA
It does not matter if the data is deployed on a
NoSQL platform instead of an RDBMS.
Still need to address:
– Backups & recovery
– Capacity planning
– Performance monitoring
– Data integration
– Tuning & optimization
What happens when things don’t work as
expected and nodes are out of sync or you
have a data corruption occurring at 2am?
Who you gonna call?
– DBA and SysAdmin need to be on board
30
Where would I use it?
Where would I use a NoSQL database?
Do you have somewhere a large set of uncontrolled,
unstructured, data that you are trying to fit into a
RDBMS?
– Log Analysis
– Social Networking Feeds (many firms hooked in
through Facebook or Twitter)
– External feeds from partners (EAI)
– Data that is not easily analyzed in a RDBMS such as
time-based data
– Large data feeds that need to be massaged before
entry into an RDBMS
31
Summary
Leading users of NoSQL datastores are social
networking sites such as Twitter, Facebook, LinkedIn,
and Digg.
To implement a single feature in Cassandra, Digg
has a dataset that is 3 terabytes and 76 billion
columns.
Not every problem is a nail and not every solution is
a hammer.
32