Unit III
Big Data Processing
✔ Big Data technologies ✔ Hadoop MapReduce paradigm
✔ Introduction to Google file system ✔ Map Reduce tasks, Job, Task trackers
✔ Hadoop Architecture ✔ Cluster Setup – SSH & Hadoop
✔ Hadoop Storage: HDFS Configuration
✔ Common Hadoop Shell commands ✔ Introduction to: NOSQL
✔ Anatomy of File Write and Read ✔ Textual ETL processing.
✔ NameNode, Secondary NameNode, and
DataNode
Big Data Technologies
? BDT essential for precise analysis, better operational efficiencies, cost reduction
and reduce risk
? useful in process large huge volume of data, preserve privacy and security
▪ Types of Classes to handle Big data
Operational
(NoSQL+Operational+Velocity) Big Data
Big
Data
(Hadoop+Analytical+Volume) Analytical
Big Data
▪ Operational Big Data:
? data that is produced by your organization's day to day operations
? It gives Most Up to date Information
? Operational Data Systems support high-volume, called Online Transactional
Processing tables, or OLTP, where you want to create, read, update, or delete one
piece of data at a time.
▪ Analytical Big Data:
? little more complex and will look different for different types of organizations;
? is used to make business decisions
? including business, market and customer data
Introduction To Google File System
? Google File System (GFS) is a scalable distributed file system (DFS) created by
Google
? GFS hold Google's huge data without making extra load on applications
? Files are stored in hierarchical directories identified by path names
GFS features include:
1. Fault tolerance
2. Critical data replication
3. Automatic and efficient data recovery
4. High aggregate throughput
5. High availability
▪ Google File System Architecture:
? GFS is structured into Clusters of computer
? Every cluster contains 100 or up to 1000 computers
? Components of Architecture
Components of Architecture
1. Client
2. Master Server
3. Chunk Server
Client Chunk Server
• Client can be other Master Server • They store 64-MB file
computers or computer • Maintaining an operation chunks, send requested
applications and make a log, that keeps track of the chunks directly to the client.
file request activities of the cluster.
• The GFS copies every chunk
• Requests can be retrieving, • The master server also multiple times and stores it on
manipulating existing files, keeps track of metadata, different chunk servers. Each
creating new files which describes chunks. copy is called a replica.
• Advantages
? It reduces clients’ need to interact with the master because reads and writes on the
same chunk require only one initial request to the master for chunk location
information.
? it can reduce network overhead,1 chunk perform many operation
• Disadvantages
? Lazy space allocation
Hadoop
Hadoop is an Apache open source framework written in java that allows distributed processing
of large datasets across clusters of computers using simple programming models.
? Hadoop was created by Doug Cutting and Mike Cafarella in 2005.
? It was originally developed to support distribution for the Nutch search engine project.
? In 2008 Yahoo released Hadoop, Now it is manage by Apache Software
Foundation(ASF)
? It can handle thousands of Terabyte of data parallel
? Use DFS which provide rapid data transfer among nodes
? System proceed even if 1 or more node get fails
▪ Hadoop Architecture:
? Hadoop Architecture contains 4 Components
Components of
Architecture
1. Hadoop Common
2. MapReduce
3. YARN framework
4. HDFS
Hadoop Storage : HDFS
? Hadoop storage contains huge data in distributed computing environment
HDFS is DFS provided by hadoop for analyzing and transforming huge data
using mapReduce framework
? HDFS supports the rapid transfer of data between compute nodes.
? HDFS breaks information down into separate blocks and distributes them to
different nodes in a cluster.
? HDFS uses master/slave architecture. This master node "data chunking"
architecture takes as its design guides elements from Google File System (GFS)
HDFS Architecture
• Namenode:
• manage Metadata file system
• Datanode:
• stores Actual data
• File Contain is divided into
block and replicate over the
data node
Hadoop Storage : HDFS
Advantage Disadvantage
• High scalability
• Programming model is restrictive
• Low limitation
• Cluster management
• open source
• Low cost
Common Hadoop Shell Commands
ls<Path> List out content in directory
mv<src><dest> Move file or directory
cp<src><dest> Copy file or directory
rm<path> Remove file or directory
Put<local src><dest> Copy file from local source to destination
Common Hadoop Shell Commands
copyFromLocal<src><dest> Copy file or directory
Chown[-R][owner]:<path> Use to change the owning of file or folder
Cat<filename> Concat file
Mkdir<path> Remove file or directory
Chmode[-R]<path> Use to change file permission
Anatomy of File Write and Read
? 3 types of node work in HDFS master/Slave cluster
Node Type
2.
1. 3.
Secondary
Namenode Datanode
Namenode
1. NameNode
? Is the centerpiece of an HDFS ,it manages
information about the file system tree which
contains the metadata about all the files and
directories
? Metadata stored file name, file path, number
of blocks, block Ids, replication level.
? It uses two files for storing this metadata
information.
1) FsImage 2) EditLog
? keeps location of the DataNodes that store
the blocks in memory.
2. Secondary Namenode
? It is not a backup NameNode
server
? It gets the latest FsImage and
EditLog files from the primary
NameNode.
? It applies each transaction from
EditLog file to FsImage to create a
new merged FsImage file.
? Merged FsImage file is transferred
back to primary NameNode.
3. DataNode
? Data blocks of the files are stored in
a set of DataNodes
? DataNodes are responsible for
serving read and write requests
from the file system’s clients.
? The DataNodes store blocks, delete
blocks and replicate those blocks
upon instructions from the
NameNode.
Anatomy of File Write and Read
Hadoop MapReduce Paradigm
Hadoop MapReduce is a software framework for distributed processing
of large data sets on computing clusters.
Daemon Services of Hadoop-
1. Namenodes
2. Secondary Namenodes
3. Jobtracker
4. Datanodes
5. Tasktracker
Job Tracker
? JobTracker is the service take client requests ,tries to assign the tasks to TaskTrackers
? Job requests from client received by the JobTracker,It use NameNode to determine the
location of the required data.
? JobTracker updates its status when the job completes.
Task Tracker
? The TaskTracker performs its tasks while being closely monitored by JobTracker.
? A TaskTracker is a node in the cluster that accepts tasks - Map, Reduce and Shuffle
operations - from a JobTracker
? The TaskTracker monitors these spawned processes, capturing the output and exit codes.
Introduction to NoSQL
? A NoSQL originally referring to Not only SQL or non relational.
? In 1998 Carlo Strozzi first introduced lightweight, open source relational database and
named such as NoSQL(NoREAL-Not only relational)
When should NoSQL be used:
? When huge amount of data need to be stored and retrieved .
? The relationship between the data you store is not that important
? The data changing over time and is not structured.
? Support of Constraints and Joins is not required at database level
? The data is growing continuously and you need to scale the database regular to handle the
data.
Introduction to NoSQL
? A NoSQL is use to handle big data and real-time web application
? NoSQL Concentrate on availability, partition tolerance and speed,
? Horizontally scalable because Store data in key value
? Also stores data in document form,Graphs
? NoSQL market leaders are: MongoDB,Datastax,Marklogic
? It is schema-less,open-source,run well on cluster
? NoSQL has data distribution andauto repair capabilities, simplified data models---less
had-on management required
Introduction to NoSQL
Advantages
1. Data Storage
❖ Types of NoSQL database
Key Value Store: Memcached, Redis, Coherence 2. Support unstructured data
Tabular: Hbase, Big Table, Accumulo
Document based: MongoDB, CouchDB, Cloudant 3. Handle change over time
❖ Company Using NoSQL 4. Support multiple data structure
- Google
5.Bigdata Application
- Facebook
- - Linkdin 6.Ability to scale horizontally
- - Mozila
7.Less Database administration
8.Low Cost
❑ Difference SQL and NoSQL
Textual ETL Processing
ETL is defined as a process that extracts the data from different RDBMS source systems, then
transforms the data (like applying calculations, concatenations, etc.) and finally loads the data
into the Data Warehouse system.
Why do you need ETL?
? It helps companies to analyze their business data
? Transactional databases cannot answer complex business questions
? ETL moving the data from various sources into a data warehouse.
? As data sources change, the Data Warehouse will automatically update.
? Well-designed and documented
? Allow verification of data transformation, aggregation and calculations rules.
? ETL allows data comparison between source and target system.
? ETL process can perform complex transformations and requires the extra area to store the
data.
? ETL helps to Migrate data into a Data Warehouse. Convert to the various formats
1. Structured ETL
? use to convert data from corporate and legacy applications into uniform,
corporate structure
? It is responsible for formatting, data integration, transformation, encoding so on
• An example of ETL processing is as follows:
? Data representing gender is encoded in the input data in the form of
(male/female), (m/f), (x/y), and (1/0) from different applications across the
enterprise. Once processed, the output for gender is converted and specified
simply as (m/f).
? The dimensions will include lengths that are measured by (inches), (centimeters),
or (feet). As output of ETL, data is converted and length is measured uniformly
(for example, in centimeters).
2. Unstructured Data
? Textual data comes in many forms and from many places.
? Forms of textual data include email of different types; corporate contracts with
multiple vendors, employees, customers and more; human resource files; medical
records, financial reports; and corporate memos.
? it is a multi-step process that guides a business user to define the rules for processing
any form of unstructured data.
? It uses technology such as Hadoop,Mapreduce,Ruby,NoSQL