0% found this document useful (0 votes)

9 views32 pages

Unit Iiibig Data Processing Demo

The document provides an overview of Big Data processing technologies, focusing on Hadoop and Google File System (GFS), including their architectures and functionalities. It discusses operational and analytical Big Data, the advantages and disadvantages of Hadoop and GFS, and introduces NoSQL databases and ETL processes. Key components such as NameNode, DataNode, and JobTracker are explained, along with common Hadoop shell commands and the importance of handling unstructured data.

Uploaded by

patil777781

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views32 pages

Unit Iiibig Data Processing Demo

Uploaded by

patil777781

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Unit III

Big Data Processing

✔ Big Data technologies ✔ Hadoop MapReduce paradigm

✔ Introduction to Google file system ✔ Map Reduce tasks, Job, Task trackers
✔ Hadoop Architecture ✔ Cluster Setup – SSH & Hadoop
✔ Hadoop Storage: HDFS Configuration
✔ Common Hadoop Shell commands ✔ Introduction to: NOSQL
✔ Anatomy of File Write and Read ✔ Textual ETL processing.
✔ NameNode, Secondary NameNode, and
DataNode
Big Data Technologies
? BDT essential for precise analysis, better operational efficiencies, cost reduction
and reduce risk
? useful in process large huge volume of data, preserve privacy and security

▪ Types of Classes to handle Big data

Operational
(NoSQL+Operational+Velocity) Big Data

Big
Data

(Hadoop+Analytical+Volume) Analytical
Big Data
▪ Operational Big Data:

? data that is produced by your organization's day to day operations

? It gives Most Up to date Information
? Operational Data Systems support high-volume, called Online Transactional
Processing tables, or OLTP, where you want to create, read, update, or delete one
piece of data at a time.
▪ Analytical Big Data:

? little more complex and will look different for different types of organizations;
? is used to make business decisions
? including business, market and customer data
Introduction To Google File System

? Google File System (GFS) is a scalable distributed file system (DFS) created by
Google

? GFS hold Google's huge data without making extra load on applications

? Files are stored in hierarchical directories identified by path names

GFS features include:

1. Fault tolerance
2. Critical data replication
3. Automatic and efficient data recovery
4. High aggregate throughput
5. High availability
▪ Google File System Architecture:
? GFS is structured into Clusters of computer
? Every cluster contains 100 or up to 1000 computers
? Components of Architecture

Components of Architecture

1. Client

2. Master Server

3. Chunk Server

Client Chunk Server

• Client can be other Master Server • They store 64-MB file
computers or computer • Maintaining an operation chunks, send requested
applications and make a log, that keeps track of the chunks directly to the client.
file request activities of the cluster.
• The GFS copies every chunk
• Requests can be retrieving, • The master server also multiple times and stores it on
manipulating existing files, keeps track of metadata, different chunk servers. Each
creating new files which describes chunks. copy is called a replica.
• Advantages

? It reduces clients’ need to interact with the master because reads and writes on the
same chunk require only one initial request to the master for chunk location
information.

? it can reduce network overhead,1 chunk perform many operation

• Disadvantages

? Lazy space allocation

Hadoop
Hadoop is an Apache open source framework written in java that allows distributed processing
of large datasets across clusters of computers using simple programming models.

? Hadoop was created by Doug Cutting and Mike Cafarella in 2005.

? It was originally developed to support distribution for the Nutch search engine project.

? In 2008 Yahoo released Hadoop, Now it is manage by Apache Software

Foundation(ASF)

? It can handle thousands of Terabyte of data parallel

? Use DFS which provide rapid data transfer among nodes

? System proceed even if 1 or more node get fails

▪ Hadoop Architecture:

? Hadoop Architecture contains 4 Components

Components of
Architecture

1. Hadoop Common

2. MapReduce

3. YARN framework

4. HDFS
Hadoop Storage : HDFS

? Hadoop storage contains huge data in distributed computing environment

HDFS is DFS provided by hadoop for analyzing and transforming huge data
using mapReduce framework

? HDFS supports the rapid transfer of data between compute nodes.

? HDFS breaks information down into separate blocks and distributes them to
different nodes in a cluster.

? HDFS uses master/slave architecture. This master node "data chunking"

architecture takes as its design guides elements from Google File System (GFS)
HDFS Architecture

• Namenode:
• manage Metadata file system

• Datanode:
• stores Actual data

• File Contain is divided into

block and replicate over the
data node
Hadoop Storage : HDFS

Advantage Disadvantage
• High scalability
• Programming model is restrictive
• Low limitation
• Cluster management
• open source

• Low cost
Common Hadoop Shell Commands

ls<Path> List out content in directory

mv<src><dest> Move file or directory

cp<src><dest> Copy file or directory

rm<path> Remove file or directory

Put<local src><dest> Copy file from local source to destination

Common Hadoop Shell Commands
copyFromLocal<src><dest> Copy file or directory

Chown[-R][owner]:<path> Use to change the owning of file or folder

Cat<filename> Concat file

Mkdir<path> Remove file or directory

Chmode[-R]<path> Use to change file permission

Anatomy of File Write and Read
? 3 types of node work in HDFS master/Slave cluster

Node Type

2.
1. 3.
Secondary
Namenode Datanode
Namenode
1. NameNode
? Is the centerpiece of an HDFS ,it manages
information about the file system tree which
contains the metadata about all the files and
directories

? Metadata stored file name, file path, number

of blocks, block Ids, replication level.

? It uses two files for storing this metadata

information.
1) FsImage 2) EditLog

? keeps location of the DataNodes that store

the blocks in memory.
2. Secondary Namenode

? It is not a backup NameNode

server

? It gets the latest FsImage and

EditLog files from the primary
NameNode.

? It applies each transaction from

EditLog file to FsImage to create a
new merged FsImage file.

? Merged FsImage file is transferred

back to primary NameNode.
3. DataNode
? Data blocks of the files are stored in
a set of DataNodes

? DataNodes are responsible for

serving read and write requests
from the file system’s clients.

? The DataNodes store blocks, delete

blocks and replicate those blocks
upon instructions from the
NameNode.
Anatomy of File Write and Read
Hadoop MapReduce Paradigm
Hadoop MapReduce is a software framework for distributed processing
of large data sets on computing clusters.
Daemon Services of Hadoop-

1. Namenodes
2. Secondary Namenodes
3. Jobtracker
4. Datanodes
5. Tasktracker
Job Tracker
? JobTracker is the service take client requests ,tries to assign the tasks to TaskTrackers
? Job requests from client received by the JobTracker,It use NameNode to determine the
location of the required data.
? JobTracker updates its status when the job completes.
Task Tracker
? The TaskTracker performs its tasks while being closely monitored by JobTracker.
? A TaskTracker is a node in the cluster that accepts tasks - Map, Reduce and Shuffle
operations - from a JobTracker
? The TaskTracker monitors these spawned processes, capturing the output and exit codes.
Introduction to NoSQL
? A NoSQL originally referring to Not only SQL or non relational.

? In 1998 Carlo Strozzi first introduced lightweight, open source relational database and
named such as NoSQL(NoREAL-Not only relational)

When should NoSQL be used:

? When huge amount of data need to be stored and retrieved .

? The relationship between the data you store is not that important

? The data changing over time and is not structured.

? Support of Constraints and Joins is not required at database level

? The data is growing continuously and you need to scale the database regular to handle the
data.
Introduction to NoSQL
? A NoSQL is use to handle big data and real-time web application

? NoSQL Concentrate on availability, partition tolerance and speed,

? Horizontally scalable because Store data in key value

? Also stores data in document form,Graphs

? NoSQL market leaders are: MongoDB,Datastax,Marklogic

? It is schema-less,open-source,run well on cluster

? NoSQL has data distribution andauto repair capabilities, simplified data models---less
had-on management required
Introduction to NoSQL
Advantages

1. Data Storage
❖ Types of NoSQL database
Key Value Store: Memcached, Redis, Coherence 2. Support unstructured data
Tabular: Hbase, Big Table, Accumulo
Document based: MongoDB, CouchDB, Cloudant 3. Handle change over time
❖ Company Using NoSQL 4. Support multiple data structure
- Google
5.Bigdata Application
- Facebook
- - Linkdin 6.Ability to scale horizontally
- - Mozila
7.Less Database administration

8.Low Cost
❑ Difference SQL and NoSQL
Textual ETL Processing
ETL is defined as a process that extracts the data from different RDBMS source systems, then
transforms the data (like applying calculations, concatenations, etc.) and finally loads the data
into the Data Warehouse system.
Why do you need ETL?

? It helps companies to analyze their business data

? Transactional databases cannot answer complex business questions
? ETL moving the data from various sources into a data warehouse.
? As data sources change, the Data Warehouse will automatically update.
? Well-designed and documented
? Allow verification of data transformation, aggregation and calculations rules.
? ETL allows data comparison between source and target system.
? ETL process can perform complex transformations and requires the extra area to store the
data.
? ETL helps to Migrate data into a Data Warehouse. Convert to the various formats
1. Structured ETL

? use to convert data from corporate and legacy applications into uniform,
corporate structure

? It is responsible for formatting, data integration, transformation, encoding so on

• An example of ETL processing is as follows:

? Data representing gender is encoded in the input data in the form of

(male/female), (m/f), (x/y), and (1/0) from different applications across the
enterprise. Once processed, the output for gender is converted and specified
simply as (m/f).

? The dimensions will include lengths that are measured by (inches), (centimeters),
or (feet). As output of ETL, data is converted and length is measured uniformly
(for example, in centimeters).
2. Unstructured Data
? Textual data comes in many forms and from many places.

? Forms of textual data include email of different types; corporate contracts with
multiple vendors, employees, customers and more; human resource files; medical
records, financial reports; and corporate memos.

? it is a multi-step process that guides a business user to define the rules for processing
any form of unstructured data.

? It uses technology such as Hadoop,Mapreduce,Ruby,NoSQL

Big Data
No ratings yet
Big Data
51 pages
Wa0002.
No ratings yet
Wa0002.
32 pages
Hadoop Basics for Engineering Students
No ratings yet
Hadoop Basics for Engineering Students
18 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
Hadoop for Big Data Enthusiasts
No ratings yet
Hadoop for Big Data Enthusiasts
21 pages
BDA Unit 1 Notes
No ratings yet
BDA Unit 1 Notes
24 pages
5.apache Hadoop Updated
No ratings yet
5.apache Hadoop Updated
57 pages
Hadoop Distributed Programming Guide
No ratings yet
Hadoop Distributed Programming Guide
38 pages
Big Data
No ratings yet
Big Data
51 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
CH 05
No ratings yet
CH 05
20 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
No ratings yet
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
25 pages
CH-05 CC
No ratings yet
CH-05 CC
21 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
14 pages
Unit - 5 Learning Notes
No ratings yet
Unit - 5 Learning Notes
8 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Hadoop Intro and Hdfs
No ratings yet
Hadoop Intro and Hdfs
37 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
Hadoop Basics and Benefits
No ratings yet
Hadoop Basics and Benefits
52 pages
Unit II Hadoop and Map Reduce Overview
No ratings yet
Unit II Hadoop and Map Reduce Overview
136 pages
Bda - Unit 3
No ratings yet
Bda - Unit 3
41 pages
10th August Morning and Afternoon Session Hadoop
No ratings yet
10th August Morning and Afternoon Session Hadoop
18 pages
UNIT 5 Combined
No ratings yet
UNIT 5 Combined
13 pages
Big Data
No ratings yet
Big Data
67 pages
Wa0001.
No ratings yet
Wa0001.
56 pages
Unit III
No ratings yet
Unit III
32 pages
Big Data Notes
No ratings yet
Big Data Notes
191 pages
CS19741-Cloud Computing-Unit 3 Notes
No ratings yet
CS19741-Cloud Computing-Unit 3 Notes
37 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
25 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Apache Hadoop Filesystem and Its Usage in Facebook
No ratings yet
Apache Hadoop Filesystem and Its Usage in Facebook
33 pages
Module III Note
No ratings yet
Module III Note
36 pages
CC 2
No ratings yet
CC 2
25 pages
Unit 3 Da
No ratings yet
Unit 3 Da
43 pages
Unit 3
No ratings yet
Unit 3
18 pages
12 Lecture
No ratings yet
12 Lecture
21 pages
History of Hadoop Apache Hadoop - The Hadoop Distributed File System
No ratings yet
History of Hadoop Apache Hadoop - The Hadoop Distributed File System
8 pages
Hadoop
No ratings yet
Hadoop
7 pages
Bda-Unit-2 - 2023
No ratings yet
Bda-Unit-2 - 2023
58 pages
Introduction to Hadoop & DFS
No ratings yet
Introduction to Hadoop & DFS
34 pages
Module 2 Hadoop
No ratings yet
Module 2 Hadoop
180 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
4 UNIT-4 Introduction To Hadoop
No ratings yet
4 UNIT-4 Introduction To Hadoop
154 pages
Nosql and Hadoop Technologies On Oracle Cloud: Volume 2, Issue 2, March - April 2013
No ratings yet
Nosql and Hadoop Technologies On Oracle Cloud: Volume 2, Issue 2, March - April 2013
6 pages
Slide 2 GFS and Hadoop
No ratings yet
Slide 2 GFS and Hadoop
95 pages
Hadoop Notes 2
No ratings yet
Hadoop Notes 2
5 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
Unit-1 Introduction To Big Data
No ratings yet
Unit-1 Introduction To Big Data
38 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Hadoop for Big Data Analysis
No ratings yet
Hadoop for Big Data Analysis
4 pages
Hadoop
No ratings yet
Hadoop
7 pages
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
No ratings yet
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
112 pages
Unit Iii
100% (1)
Unit Iii
43 pages
Hadoop and Big Data Unit 2
No ratings yet
Hadoop and Big Data Unit 2
11 pages
Module II
No ratings yet
Module II
46 pages
BDS Session 6
No ratings yet
BDS Session 6
78 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
Speed Monitor PDF
No ratings yet
Speed Monitor PDF
30 pages
FPGA Circuit Design Guide
100% (1)
FPGA Circuit Design Guide
17 pages
MF48120-User Manual
100% (1)
MF48120-User Manual
18 pages
Unit 1
No ratings yet
Unit 1
27 pages
Responsibility Matrix
100% (1)
Responsibility Matrix
3 pages
Restrictions DCM V1-4 en PDF
No ratings yet
Restrictions DCM V1-4 en PDF
1 page
0801 220602 081259
No ratings yet
0801 220602 081259
2 pages
AnyView A8 A6 A5 A3 Patient Monitor User Manual-2011E2
80% (10)
AnyView A8 A6 A5 A3 Patient Monitor User Manual-2011E2
276 pages
USB Made Simple
100% (1)
USB Made Simple
86 pages
LTE CPE 12000U Installation Guide
No ratings yet
LTE CPE 12000U Installation Guide
11 pages
Comp DLP Grade 3 (23-4-24)
No ratings yet
Comp DLP Grade 3 (23-4-24)
2 pages
Dbaudio Technical Information Ti370 1.1 en
No ratings yet
Dbaudio Technical Information Ti370 1.1 en
18 pages
eCommerce Site for Gadget Works
No ratings yet
eCommerce Site for Gadget Works
8 pages
MI-330 (Update of Billing Information)
No ratings yet
MI-330 (Update of Billing Information)
1 page
SLM-Q1M02-CSS9 - V2
No ratings yet
SLM-Q1M02-CSS9 - V2
26 pages
ALC PDH RADIO Technical Training Siae Mi
No ratings yet
ALC PDH RADIO Technical Training Siae Mi
141 pages
Huawei DG8245V Fibre Setup Guide
No ratings yet
Huawei DG8245V Fibre Setup Guide
2 pages
Parameter
No ratings yet
Parameter
48 pages
Book Review Fundamentals of Power Supply Design
No ratings yet
Book Review Fundamentals of Power Supply Design
3 pages
CorrectedROD-1 LED
No ratings yet
CorrectedROD-1 LED
3 pages
The Transmission of Electric Energy: ET2105 Electrical Power System Essentials
No ratings yet
The Transmission of Electric Energy: ET2105 Electrical Power System Essentials
39 pages
Engineering Manufacturing Bridging Design and Production PDF
No ratings yet
Engineering Manufacturing Bridging Design and Production PDF
2 pages
HANA Security
100% (1)
HANA Security
34 pages
Ai in Surveying and Geomatics
No ratings yet
Ai in Surveying and Geomatics
13 pages
Escalator Drive System Design
100% (2)
Escalator Drive System Design
29 pages
QNX System Boot Log Analysis
No ratings yet
QNX System Boot Log Analysis
12 pages
Infosys Hackathon HackWith
No ratings yet
Infosys Hackathon HackWith
3 pages
InPage Crypkey LAN Setup Guide
No ratings yet
InPage Crypkey LAN Setup Guide
8 pages
Senior Medical Scribe
No ratings yet
Senior Medical Scribe
4 pages
IntelAVX-512 InstructionSetForPacketProcessing TechGuide 633930v2
No ratings yet
IntelAVX-512 InstructionSetForPacketProcessing TechGuide 633930v2
20 pages

Unit Iiibig Data Processing Demo

Uploaded by

Unit Iiibig Data Processing Demo

Uploaded by

Unit III

Big Data Processing

✔ Big Data technologies ✔ Hadoop MapReduce paradigm

▪ Types of Classes to handle Big data

? data that is produced by your organization's day to day operations

? Files are stored in hierarchical directories identified by path names

GFS features include:

Client Chunk Server

? it can reduce network overhead,1 chunk perform many operation

? Lazy space allocation

? Hadoop was created by Doug Cutting and Mike Cafarella in 2005.

? In 2008 Yahoo released Hadoop, Now it is manage by Apache Software

? It can handle thousands of Terabyte of data parallel

? Use DFS which provide rapid data transfer among nodes

? System proceed even if 1 or more node get fails

? Hadoop Architecture contains 4 Components

? Hadoop storage contains huge data in distributed computing environment

? HDFS supports the rapid transfer of data between compute nodes.

? HDFS uses master/slave architecture. This master node "data chunking"

• File Contain is divided into

ls<Path> List out content in directory

mv<src><dest> Move file or directory

cp<src><dest> Copy file or directory

rm<path> Remove file or directory

Put<local src><dest> Copy file from local source to destination

Chown[-R][owner]:<path> Use to change the owning of file or folder

Cat<filename> Concat file

Mkdir<path> Remove file or directory

Chmode[-R]<path> Use to change file permission

? Metadata stored file name, file path, number

? It uses two files for storing this metadata

? keeps location of the DataNodes that store

? It is not a backup NameNode

? It gets the latest FsImage and

? It applies each transaction from

? Merged FsImage file is transferred

? DataNodes are responsible for

? The DataNodes store blocks, delete

When should NoSQL be used:

? The data changing over time and is not structured.

? Support of Constraints and Joins is not required at database level

? NoSQL Concentrate on availability, partition tolerance and speed,

? Horizontally scalable because Store data in key value

? Also stores data in document form,Graphs

? NoSQL market leaders are: MongoDB,Datastax,Marklogic

? It is schema-less,open-source,run well on cluster

? It helps companies to analyze their business data

? It is responsible for formatting, data integration, transformation, encoding so on

• An example of ETL processing is as follows:

? Data representing gender is encoded in the input data in the form of

? It uses technology such as Hadoop,Mapreduce,Ruby,NoSQL

You might also like