BigData and NoSQL dBs
P I Y USH G UPTA J U LY 20 13
Background
Up until the internet age, most data was generated within the enterprise.
Emphasis on schema design, enterprise was in control. What data do I need to store?
A huge amount of data is now coming at the enterprise.
The new paradigm requires upfront query design. What business intelligence do I want to extract from everything thats coming at me?
Monetization from Increased Market Share, Increased Enterprise Ops Efficiency, New Consumer Products Internet of Things.
Piyush Gupta / July 2013
Internet of Things We Need Things - Our Things need us too!
Things are smart, loaded with sensors, cpus and connectivity wi-fi, cellular. Hotspots are ubiquitous. Home, PoleTops, Stores. Things are talking to each other.
Cars will talk to each other. (www.waze.com )
Our Things are talking to us! (Cars, Appliances, Home, Parking meters-www.sfpark.org)
Next up? Cars bid for parking with the meter!
Hyundai BlueLink, emails owner to schedule maintenance!
Calm down Dave, I have found parking!!!
Piyush Gupta / July 2013
BigData Play Areas
Data Repository
Data Acquisition (eg web crawlers, server logs, ETL:Extract-Transform-Load) Data Storage (various dB solutions)
R/W Latency, throughput, storage technology, capacity. (4/40/400/4000 TB)
Monitoring of the dB (DevOps)
Data Analytics.
Time to Generate Results months, days, hours, minutes, seconds, sub-second. Mathematical packages for analytics. (Statistical, Sentiment, Text search)
Results Presentation
User Specific Dynamic Web Content generated in real time driven by analytics. Data visualization
Piyush Gupta / July 2013
BigData Companies
BigGuys (~1% revenue is from BigData plays) IBM, Intel, Oracle, HP, Teradata, Fujitsu, Amazon (18%) About 120 BigData startups, many are VC funded A platform built for everyone is a platform built for no one! Good resources:
www.BigDataLandscape.com
www.451Research.com
Piyush Gupta / July 2013
The BigData Landscape!
Source: www.BigDataLandscape.com (Dave Feinleib)
Piyush Gupta / July 2013
NoSQL Databases
Common theme distributed key-value store.
Data is replicated. Some dbs require it, others make it optional. Most use cases will replicate.
Throughput , Capacity, Storage Technology, Latency - closely coupled. 4/40/400 Tb Rule of Thumb.
RAM memory lowest latency, highest cost => limited capacity. ns - s / 4 TB SSD Solid State Disk memory (Vendors: FusionIO, Violin, OCZ, HP, Intel)
Lower latency (50s R / 500 s W), $2-$4 /GB (was $20), .x - x ms / 40 - 100TB clusters.
HDD - Rotational Disk highest latency, lowest cost, xx - xxx ms/ 400 Tb+.
Piyush Gupta / July 2013
CAP Theorem for Distributed Databases (E. Brewer, UC Berkeley, 2000)
C-Consistency (across replicas) , A-Availability (latency, guaranteed response), P-Partition Tolerance (tolerance to a node going down) You can only have two of the three. (Challenged by some!)
CA High consistency and availability, accept no response if a node goes down. (Challenge then how is this A?)
AP High availability on a cluster of nodes, accept Eventual Consistency.
Professor Eric Brewer, UC Berkeley, explains the CAP theorem
CP High consistency and cluster scalability, accept some unavailability during node failure/addition.
=> Understand a DBs design goal. Unfair to compare CP vs AP on latency. Most NoSQL dbs are either AP or CP dominant.
Piyush Gupta / July 2013
Comparing Performance
Hard to create a level playing field when comparing performance across dbs. What matters to your application? Seek best performance for your use case.
NoSQL DB designs are targeted.
Latency (~1 ms, 10ms, 50ms) , Read heavy/Write heavy, Consistency Throughput operations / sec. Storage Medium - RAM vs SSD vs HDD Storage Capacity 4 /40/400/4000 Tb Storage Format - HDFS / NFS / Log Structured File system. Content type (K-V store, Document Store, Graph. Upto 1Mb / 1-10Mb / >10 Mb size)
Piyush Gupta / July 2013
Basic Operations
Any Distributed Database offers these basic operations. WRITE (Update, Delete)
READ (Indexing to speed reads)
Map-Reduce (Optional) or an Aggregation Pipeline Bring computation to data.
Use network bandwidth to move results instead of data.
Incurs communication overhead.
Piyush Gupta / July 2013
Map-Reduce Concept
Data: a, x, z, d, b, x, e, d, a, b, c Need: Count of x Map: Count(x) => result Reduce: Sum(result) => sum
Application Server Needs sum
Client Server
Sends map request, runs reduce
dB - Node 0 a, x, z, d
dB - Node 1 b, x, e, d
dB - Node 2 a, b, c
Piyush Gupta / July 2013
Map-Reduce Concept
Storage Network store data distributed on servers. Bring data to client server for computation. Distributed DB shard or split / spread data across db servers. MAP Phase - Client Server requests db server nodes to compute on their share of the data and return the results. (dB Server side User Defined Function) REDUCE Phase - Client assimilates data from nodes, computes net result. (Client server side User Defined Function) Can only do partial map-reduce if computation on one record subsequently depends on data in another record.
Piyush Gupta / July 2013
Evaluating dB Performance
Understand the WRITE path of a database architecture. READ path trails the WRITE path.
How is the data indexed?
Indexing options in a database: Primary Key Index, Secondary Indexes Indexing speeds up READS Indexing slows down WRITES Indexing slows down REPLICATION
What Data Model is the db optimized for?
Piyush Gupta / July 2013
Data Models for NoSQL Databases
All distributed databases can be argued are Key-Value stores. Record based organization. (May support TTL Time To Live)
All columns are reasonably populated, values are ints, strings, blobs. Aersopike (128Kb 1Mb, C), Riak (1 10 Mb, Erlang)
Columnar databases sparsely populated row-column matrix
Cassandra, HBase
Document Store databases Key:BLOB store. Binary Large Object
Store JSON object as BSON BLOB.
MongoDB (16 Mb, C++), CouchBaseDB (1Mb Mem / 20Mb HDD, Erlang)
Graph Databases (neo4j) who is related to whom (Facebook, LinkedIn)
Nodes / vertices sides/edges store.
Piyush Gupta / July 2013
Master- Slave Distribution Model (eg MongoDB)
Application Server Config Server Shard Node 0 Shard Node 1 b, x, e, d Node 1 Replica 1 Node 1 Replica 2 Config Server Failover 1 Config Server Failover 2
Shard Node 2 a, b, c Node 2 Replica 1 Node 2 Replica 2
a, x, z, d
Node 0
Replica 1
Node 0 Replica 2
Master-Slave architecture vs Shared Nothing architecture. Partition Tolerance is the major difference. Both node failure and network failure to master affect performance.
Piyush Gupta / July 2013
Peer to Peer Distribution Model Shared Nothing. (eg Aerospike)
put ns, set, {pk:rec1,bin1:1, bin2:test} Hash(set, pk) => 20 byte digest, (160 bits) 0110011.001101010111.001010 12 specific bits => 4k partitions Partition Table:
Partition # M 0 4095 2 1 0 R1 0 0 2
Application Server put
R2 1 2 1
Client Server - Hash(set,pk) - Lookup Partition Table - Send record to master
Node 0
- Aware of peers (multicast) - Knows partition table - Write Buffer Memory - Persistent Store SSD
Piyush Gupta / July 2013
Node 1 - Master gets data - Writes to its membuf - Updates Index table in mem - Sends to R1, R2 - Waits for R1, R2.
Node 2 - Async write to SSD
- Update Index table in mem
- For Reads, lookup index table. - Index Table Entry: 20 bytes of digest + 64 bytes of meta-data.
Data Distribution (Sharding)
Hashing the Records Primary Key prevents hot spotting. Incoming data is auto sharded.
MongoDB, RethinkDB do not hash primary key allows range queries on Primary Key.
In hashed key case, user can store Primary Key copy (if integer) in a bin, build a secondary index on it and do range queries.
Piyush Gupta / July 2013
Optimistic Locking vs Pessimistic Locking
Consider:
read (x)
Increment x by 1 write (x)
C1 -R X=5 C2 -R X=5 C2 W C3 -R X=6 C3 -W X=7 C1 -W X=6
X=6
Multiple clients are reading and trying to modify x. Pessimistic Locking: Lock before read for a client, release lock after client has done writing. What if client never returns?
Optimistic Locking: (Also called CAS Check and Set) Give client x & metadata on read, lock only at commit. Client provides new x and metadata at write. If metadata mismatches, client retries.
=> Pro/Con: x is open for read while one client is about to modify it.
Piyush Gupta / July 2013
Performance Data Throughput (SSD synch) (Thumbtack)
Balanced Workload Read Latency (Full view)
Balanced workload, READ latency
350,000 10 300,000 Aerospike 250,000 200,000 Aerospike Cassandra MongoDB 100,000 50,000 0 0 Balanced Read-Heavy 0 50,000 100,000 150,000 200,000 Throughput, ops/sec 7.5
Average Latency, ms
Cassandra
MongoDB
150,000
2.5
Piyush Gupta / July 2013
Performance Data Partition Tolerance (SSD)
Node goes down, to node back up ~500 sec. downtime. Node joining cluster causes automatic data rebalancing on Aersospike.
Piyush Gupta / July 2013
Performance Data Throughput (Altoros)
Data on MongoDB, Riak, Hbase and Cassandra (Low throughput range)
Consistent with Thumbtacks data. Hbase data consistent with Hstack data.
Balanced Workload Read Latency (Full view)
Balanced workload, READ latency
10
Aerospike
Average Latency, ms
7.5
Cassandra MongoDB
2.5
50,000
100,000
150,000
200,000
Throughput, ops/sec
Piyush Gupta / July 2013
Evaluating Performance -Things to look for.
Do the Test Conditions apply to your use case? A well designed DB will give you a throughput of ~80% of the network bandwidth. CPU usage should not max out, Queue size shouldnt bloom. DB design should not result in hot-spotting ie excessive load on a particular node.
Pessimistic locking vs optimistic locking. Excessive locking causes performance deterioration.
Faulty node, mismatched nodes, non-identical h/w configuration across nodes affect performance => Have a uniform cluster.
Piyush Gupta / July 2013
Take Away
Engineers are the smartest people on earth! (OK, I am biased)
All designs are good!
Database Design => Pros & Cons (CAP Theorem Rules!) =>Match your use case to what the db is designed for<=
Piyush Gupta / July 2013
Use Cases
Aerospike: Ideally suited for K-V store, ~2-3ms latency, 500K ops/sec, 1 100 Tb SSD store. Efficiently using SSDs is their secret sauce!
Potential Use Cases:
Real Time Ad Bidding Market Real Time Sports Analytics (Check out www.SportsVision.com) Real time interaction with Things
MongoDB, Couchbase2.0: Ideally suited for Document Store, 16MB/20Mb per document, low throughput requirements, relaxed latency requirements. Replication and Scalability is not without its challenges. (Web Server Developers like its APIs, server side java scripting and json data pipeline)
Cassandra: Ideally suited for relaxed consistency requirements, large capacity (400Tb), data matrix is sparsely populated, columnar database, K-V store. (Netflix and Ebay are major adopters) Hbase: Ideally suited as K-V store in the ETL pipeline on a Hadoop cluster best capacity, higher latency and lower throughput (~35ms at 20K ops/sec) Facebook switched from Cassandra to Hbase in 2010.
Piyush Gupta / July 2013
Performance Comparisons - References
Cassandra Hbase MongoDB Riak (www.altoros.com)
http://www.networkworld.com/news/tech/2012/102212-nosql-263595.html?page=1
Cassandra CouchDB MongoDB Aerospike (www.thumbtack.net)
http://www.slideshare.net/bengber/no-sql-presentation
Hbase Performance Tests. (www.hstack.org)
http://hstack.org/hbase-performance-testing concise, good info.
Read the details and chose the correct DB for your application!
Piyush Gupta / July 2013
Nomenclature
Equivalents....
RDBMS Aerospike MongoDB, CouchBaseDB* Cassandra HBase
database tablespace
table primary key row column
namespace
set key record bin
db
collection _id document n/a (json)
keyspace
column-family row-id row column-name
Piyush Gupta / July 2013
Obligatory Plug for my Employer!
Aerospike Community Edition available for Free Download. 2 node cluster / 200 Gb store.
www.aerospike.com
Look me up on LinkedIn! Piyush Gupta Aerospike py@aerospike.com
Piyush Gupta / July 2013