[go: up one dir, main page]

0% found this document useful (0 votes)
36 views5 pages

Distributed Storage & Horizontal Scalability Increasing The No of Systems & Operate in Parallel

The document provides an overview of distributed storage and processing using Hadoop. It describes key components like HDFS for distributed storage, MapReduce for parallel processing, Hive and Pig for analytics, Sqoop and Flume for data import/export, and HBase as the only database supported. It also mentions Oozie for workflow scheduling of Hadoop processes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views5 pages

Distributed Storage & Horizontal Scalability Increasing The No of Systems & Operate in Parallel

The document provides an overview of distributed storage and processing using Hadoop. It describes key components like HDFS for distributed storage, MapReduce for parallel processing, Hive and Pig for analytics, Sqoop and Flume for data import/export, and HBase as the only database supported. It also mentions Oozie for workflow scheduling of Hadoop processes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Distributed storage & horizontal scalability

Increasing the no of systems & operate in Parallel.

Vertical scalability

Increasing the disk size & RAM


HDFS -> Used for storage-> distributed file system ->based on Name node[master] & multiple Data
nodes[slaves] based on the data size ->HDFS fault tolerance is based on 2 factors->Replication factor
& Block size -> Block Size by default is 64MB ->The total blocks required is (File Size)/(Default Block
Size) -> Each block will be distributed based on the Replication factor ->used for replication the same
dataset over N no of Datanodes, where N is the Replication factor.
MAP REDUCE ->Native support for Java ->framework used for processing of data->Mapper &
Reducer combination->Mapper used for parallel processing of the instructions input to the Map
Reduce framework-> distributes the instructions set among the Data nodes for parallel processing ->
Reducer will merge the results obtained from the parallel processed instruction from different data
nodes & aggregates(merge) them.

HIVE -> SQL query based support for analytics

PIG -> defined Functions support for analytics

SQOOP ->for Importing/Exporting data from DMS/RDBMS systems to HDFS

FLUME -> for Importing Streaming data from to HDFS

HBASE -> a NoSQL database ->Column based storage->Only Database supported by Hadoop

APACHE OOZIE -> scheduler to control the workflow of all the process

Overview of the Hadoop ecosystem

You might also like