[go: up one dir, main page]

0% found this document useful (0 votes)
28 views35 pages

Chapter 14

The document discusses Big Data and NoSQL, focusing on their significance in modern business and the characteristics that differentiate them from traditional databases. It covers the Hadoop framework, its components, and various NoSQL database types, including document and graph databases like MongoDB and Neo4j. Additionally, it introduces NewSQL as a database model that combines the benefits of SQL with the scalability of NoSQL.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views35 pages

Chapter 14

The document discusses Big Data and NoSQL, focusing on their significance in modern business and the characteristics that differentiate them from traditional databases. It covers the Hadoop framework, its components, and various NoSQL database types, including document and graph databases like MongoDB and Neo4j. Additionally, it introduces NewSQL as a database model that combines the benefits of SQL with the scalability of NoSQL.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

COMP255

Chapter 14
Big Data and NoSQL

1
Learning Objectives

Explain the role of Big Data in modern business

Describe the primary characteristics of Big Data and
how these go beyond the traditional “3 Vs”

Explain how the core components of the Hadoop
framework operate

Identify the major components of the Hadoop
ecosystem

2
Learning Objectives

Summarize the four major approaches of the NoSQL
data model and how they differ from the relational model

Describe the characteristics of NewSQL databases

Understand how to work with document databases
using MongoDB

Understand how to work with graph databases using
Neo4j

3
Big Data: Definitions

Volume: quantity of data to be stored

Scaling up: keeping the same number of systems
but migrating each one to a larger system

Scaling out: when the workload exceeds server
capacity, it is spread out across a number of
servers

4
Big Data: Definitions

Velocity: speed at which data is entered into system
and must be processed

Stream processing: focuses on input processing
and requires analysis of data stream as it enters the
system

Feedback loop processing: analysis of data to
produce actionable results

5
Feedback Loop Processing

6
Big Data: Definitions

Variety: variations in the structure of data to be
stored

Structured data: fits into a predefined data
model

Unstructured data: does not fit into a predefined
model

7
Big Data

Big Data generally refers to a set of data that
displays the characteristics of volume, velocity,
and variety (the 3 Vs) to an extent that makes
the data unsuitable for management by a
relational database management system.

8
Other Definitions

Variability: changes in meaning of data based on context

Sentimental analysis: attempts to determine if a statement
conveys a positive, negative, or neutral attitude about a topic

Veracity: trustworthiness of data

Value: degree data can be analyzed for meaningful insight

Visualization: ability to graphically resent data to make it
understandable

9
Big Data: What to Do?

Use Hadoop

De facto standard for most Big Data storage
and processing

Java-based framework for distributing and
processing very large data sets across clusters
of computers

10
Hadoop Components

Hadoop Distributed File System (HDFS): low-
level distributed file processing system that can
be used directly for data storage

MapReduce: programming model that supports
processing large data sets

11
HDFS Characteristics

High volume: default block sizes is 64 MB and can be configured to
even larger values

Write-once, read-many: model simplifies concurrency issues and
improves data throughput

Streaming access: optimized for batch processing of entire files as a
continuous stream of data

Fault tolerance: designed to replicate data across many different
devices so that when one fails, data is still available from another
device

12
HDFS

13
HDFS

Client Node: writes or accesses data

Name Node: holds meta data
– Which blocks are associated with which files
– Where the blocks are stored

Data Node: hold data
– Data is replicated over multiple data nodes

14
Adding a New File

Client node tells name node it wants to add a file

The name node...
– Adds the new file name to the metadata
– Determines a new block numbers for the file
– Determines a list of which data nodes the blocks will be stored
– Passes that information back to the client node

Client node sends blocks to data nodes

Data nodes write the data

15
Reading Data

Client node tells name node it wants to read a file

Name node returns blocks and data nodes where
the file is stored

Client node contacts closest data nodes on the
network for the data

Data nodes send data to client node

16
Map Reduce

Framework used to process large data sets across clusters

Breaks down complex tasks into smaller subtasks, performing the
subtasks, and producing a final result

Map function takes a collection of data and sorts and filters it into a
set of key-value pairs
– Mapper program performs the map function

Reduce summaries results of map function produce a single result
– Reducer program performs the reduce function

17
More Than Just Hadoop

18
More Than Just Hadoop

Hive
– Data warehousing system that sits on top of HDFS
and supports its own SQL-like language

Pig
– Tool that compiles a high-level scripting language,
named Pig Latin, into MapReduce jobs for executing
in Hadoop

19
More Than Just Hadoop

Flume
– Component for ingesting data in Hadoop

Sqoop
– Tool for converting data back and forth between a
relational database and the HDFS

20
More Than Just Hadoop

Hbase
– Column-oriented NoSQL database designed to sit
on top of the HDFS that quickly processes sparse
datasets

Impala
– The first SQL on Hadoop application

21
NoSQL

Unfortunate name
– ! No SQL
– “Not Only” SQL

A new generation of database management
systems that is not based on the traditional
relational database model

22
NoSQL Examples

23
Key Value Databases

24
Document Databases

25
MongoDB

Popular document database
– Among the NoSQL databases currently available, MongoDB has been
one of the most successful in penetrating the database market

MongoDB, comes from the word humongous as its developers
intended their new product to support extremely large data sets
– High availability
– High scalability
– High performance

26
MongoDB Uses JSON Documents

27
Mongo Commands
db.inventory.insertMany([
{ item: "journal", qty: 25, size: { h: 14, w: 21, uom: "cm" }, status: "A" },
{ item: "notebook", qty: 50, size: { h: 8.5, w: 11, uom: "in" }, status: "A" },
{ item: "paper", qty: 100, size: { h: 8.5, w: 11, uom: "in" }, status: "D" },
{ item: "planner", qty: 75, size: { h: 22.85, w: 30, uom: "cm" }, status: "D" },
{ item: "postcard", qty: 45, size: { h: 10, w: 15.25, uom: "cm" }, status: "A" }
]);
db.inventory.find( {} ) SELECT * FROM inventory

db.inventory.find( { status: "D" } ) SELECT * FROM inventory WHERE status = "D"

28
Column/Row Oriented Databases

29
Graph Databases

30
Neo4j

Even though Neo4j is not yet as widely adopted as MongoDB, it has been one
of the fastest growing NoSQL databases

Graph databases still work with concepts similar to entities and relationships
– Focus is on the relationships

Graph databases are used in environments with complex relationships among
entities
– Heavily reliant on interdependence among their data

Neo4j provides several interface options
– Designed with Java programming in mind

31
Neo4j Commands
CREATE (rob:Person{name:'Roberto'}), (isidro:Person{name:'Isidro'}),
(tony:Person{name:'Antonio'}), (nora:Person{name:'Nora'}),
(lily:Person{name:'Lilian'}), (freddy:Person{name:'Alfredo'}),
(lucas:Person{name:'Lucas'}), (mau:Person{name:'Mauricio'}),
(alb:Person{name:'Albina'}), (reg:Person{name:'Regina'}),
(j:Person{name:'Joaquín'}), (julian:Person{name:'Julián'})

CREATE
(rob)-[:FriendsWith]->(isidro), (rob)-[:FriendsWith]->(tony), (rob)-[:FriendsWith]->(reg),
(rob)-[:FriendsWith]->(mau), (rob)-[:FriendsWith]->(julian),
(tony)-[:FriendsWith]->(reg), (tony)-[:FriendsWith]->(j),
(alb)-[:FriendsWith]->(reg), (lily)-[:FriendsWith]->(isidro), (lily)-[:FriendsWith]->(j),
(mau)-[:FriendsWith]->(lucas), (lucas)-[:FriendsWith]->(nora), (freddy)-[:FriendsWith]->(nora);

32
Neo4j Commands

MATCH friendships=()-[:FriendsWith]-()
RETURN friendships

MATCH friends=(a:Person{name:'Lucas'})-[:FriendsWith]-(friend)
RETURN friends

33
NewSQL

Database model that attempts to provide ACID-
compliant transactions across a highly distributed
infrastructure
– Latest technologies to appear in the data
management area to address Big Data problems
– No proven track record
– Have been adopted by relatively few organizations

34
NewSQL

NewSQL databases support:
– SQL as the primary interface
– ACID-compliant transactions

Similar to NoSQL, NewSQL databases also support:
– Highly distributed clusters
– Key-value or column-oriented data stores

35

You might also like