Chapter 14
Chapter 14
Chapter 14
Big Data and NoSQL
1
Learning Objectives
●
Explain the role of Big Data in modern business
●
Describe the primary characteristics of Big Data and
how these go beyond the traditional “3 Vs”
●
Explain how the core components of the Hadoop
framework operate
●
Identify the major components of the Hadoop
ecosystem
2
Learning Objectives
●
Summarize the four major approaches of the NoSQL
data model and how they differ from the relational model
●
Describe the characteristics of NewSQL databases
●
Understand how to work with document databases
using MongoDB
●
Understand how to work with graph databases using
Neo4j
3
Big Data: Definitions
●
Volume: quantity of data to be stored
●
Scaling up: keeping the same number of systems
but migrating each one to a larger system
●
Scaling out: when the workload exceeds server
capacity, it is spread out across a number of
servers
4
Big Data: Definitions
●
Velocity: speed at which data is entered into system
and must be processed
●
Stream processing: focuses on input processing
and requires analysis of data stream as it enters the
system
●
Feedback loop processing: analysis of data to
produce actionable results
5
Feedback Loop Processing
6
Big Data: Definitions
●
Variety: variations in the structure of data to be
stored
●
Structured data: fits into a predefined data
model
●
Unstructured data: does not fit into a predefined
model
7
Big Data
●
Big Data generally refers to a set of data that
displays the characteristics of volume, velocity,
and variety (the 3 Vs) to an extent that makes
the data unsuitable for management by a
relational database management system.
8
Other Definitions
●
Variability: changes in meaning of data based on context
●
Sentimental analysis: attempts to determine if a statement
conveys a positive, negative, or neutral attitude about a topic
●
Veracity: trustworthiness of data
●
Value: degree data can be analyzed for meaningful insight
●
Visualization: ability to graphically resent data to make it
understandable
9
Big Data: What to Do?
●
Use Hadoop
●
De facto standard for most Big Data storage
and processing
●
Java-based framework for distributing and
processing very large data sets across clusters
of computers
10
Hadoop Components
●
Hadoop Distributed File System (HDFS): low-
level distributed file processing system that can
be used directly for data storage
●
MapReduce: programming model that supports
processing large data sets
11
HDFS Characteristics
●
High volume: default block sizes is 64 MB and can be configured to
even larger values
●
Write-once, read-many: model simplifies concurrency issues and
improves data throughput
●
Streaming access: optimized for batch processing of entire files as a
continuous stream of data
●
Fault tolerance: designed to replicate data across many different
devices so that when one fails, data is still available from another
device
12
HDFS
13
HDFS
●
Client Node: writes or accesses data
●
Name Node: holds meta data
– Which blocks are associated with which files
– Where the blocks are stored
●
Data Node: hold data
– Data is replicated over multiple data nodes
14
Adding a New File
●
Client node tells name node it wants to add a file
●
The name node...
– Adds the new file name to the metadata
– Determines a new block numbers for the file
– Determines a list of which data nodes the blocks will be stored
– Passes that information back to the client node
●
Client node sends blocks to data nodes
●
Data nodes write the data
15
Reading Data
●
Client node tells name node it wants to read a file
●
Name node returns blocks and data nodes where
the file is stored
●
Client node contacts closest data nodes on the
network for the data
●
Data nodes send data to client node
16
Map Reduce
●
Framework used to process large data sets across clusters
●
Breaks down complex tasks into smaller subtasks, performing the
subtasks, and producing a final result
●
Map function takes a collection of data and sorts and filters it into a
set of key-value pairs
– Mapper program performs the map function
●
Reduce summaries results of map function produce a single result
– Reducer program performs the reduce function
17
More Than Just Hadoop
18
More Than Just Hadoop
●
Hive
– Data warehousing system that sits on top of HDFS
and supports its own SQL-like language
●
Pig
– Tool that compiles a high-level scripting language,
named Pig Latin, into MapReduce jobs for executing
in Hadoop
19
More Than Just Hadoop
●
Flume
– Component for ingesting data in Hadoop
●
Sqoop
– Tool for converting data back and forth between a
relational database and the HDFS
20
More Than Just Hadoop
●
Hbase
– Column-oriented NoSQL database designed to sit
on top of the HDFS that quickly processes sparse
datasets
●
Impala
– The first SQL on Hadoop application
21
NoSQL
●
Unfortunate name
– ! No SQL
– “Not Only” SQL
●
A new generation of database management
systems that is not based on the traditional
relational database model
22
NoSQL Examples
23
Key Value Databases
24
Document Databases
25
MongoDB
●
Popular document database
– Among the NoSQL databases currently available, MongoDB has been
one of the most successful in penetrating the database market
●
MongoDB, comes from the word humongous as its developers
intended their new product to support extremely large data sets
– High availability
– High scalability
– High performance
26
MongoDB Uses JSON Documents
27
Mongo Commands
db.inventory.insertMany([
{ item: "journal", qty: 25, size: { h: 14, w: 21, uom: "cm" }, status: "A" },
{ item: "notebook", qty: 50, size: { h: 8.5, w: 11, uom: "in" }, status: "A" },
{ item: "paper", qty: 100, size: { h: 8.5, w: 11, uom: "in" }, status: "D" },
{ item: "planner", qty: 75, size: { h: 22.85, w: 30, uom: "cm" }, status: "D" },
{ item: "postcard", qty: 45, size: { h: 10, w: 15.25, uom: "cm" }, status: "A" }
]);
db.inventory.find( {} ) SELECT * FROM inventory
28
Column/Row Oriented Databases
29
Graph Databases
30
Neo4j
●
Even though Neo4j is not yet as widely adopted as MongoDB, it has been one
of the fastest growing NoSQL databases
●
Graph databases still work with concepts similar to entities and relationships
– Focus is on the relationships
●
Graph databases are used in environments with complex relationships among
entities
– Heavily reliant on interdependence among their data
●
Neo4j provides several interface options
– Designed with Java programming in mind
31
Neo4j Commands
CREATE (rob:Person{name:'Roberto'}), (isidro:Person{name:'Isidro'}),
(tony:Person{name:'Antonio'}), (nora:Person{name:'Nora'}),
(lily:Person{name:'Lilian'}), (freddy:Person{name:'Alfredo'}),
(lucas:Person{name:'Lucas'}), (mau:Person{name:'Mauricio'}),
(alb:Person{name:'Albina'}), (reg:Person{name:'Regina'}),
(j:Person{name:'Joaquín'}), (julian:Person{name:'Julián'})
CREATE
(rob)-[:FriendsWith]->(isidro), (rob)-[:FriendsWith]->(tony), (rob)-[:FriendsWith]->(reg),
(rob)-[:FriendsWith]->(mau), (rob)-[:FriendsWith]->(julian),
(tony)-[:FriendsWith]->(reg), (tony)-[:FriendsWith]->(j),
(alb)-[:FriendsWith]->(reg), (lily)-[:FriendsWith]->(isidro), (lily)-[:FriendsWith]->(j),
(mau)-[:FriendsWith]->(lucas), (lucas)-[:FriendsWith]->(nora), (freddy)-[:FriendsWith]->(nora);
32
Neo4j Commands
MATCH friendships=()-[:FriendsWith]-()
RETURN friendships
MATCH friends=(a:Person{name:'Lucas'})-[:FriendsWith]-(friend)
RETURN friends
33
NewSQL
●
Database model that attempts to provide ACID-
compliant transactions across a highly distributed
infrastructure
– Latest technologies to appear in the data
management area to address Big Data problems
– No proven track record
– Have been adopted by relatively few organizations
34
NewSQL
●
NewSQL databases support:
– SQL as the primary interface
– ACID-compliant transactions
●
Similar to NoSQL, NewSQL databases also support:
– Highly distributed clusters
– Key-value or column-oriented data stores
35