[go: up one dir, main page]

0% found this document useful (0 votes)
148 views15 pages

Hadoop Ecosystem

The document provides an introduction to the Hadoop ecosystem, which comprises the core Hadoop projects of HDFS, MapReduce, YARN as well as several related Apache projects that build upon Hadoop's capabilities. It describes the architecture and purpose of HDFS, MapReduce, YARN and highlights some additional projects like Avro, BigTop, Chukwa, Drill, and Flume. The ecosystem is complex with many interrelated projects, but they aim to leverage Hadoop's scalability through distributed storage and processing.

Uploaded by

Mun Chang Chia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
148 views15 pages

Hadoop Ecosystem

The document provides an introduction to the Hadoop ecosystem, which comprises the core Hadoop projects of HDFS, MapReduce, YARN as well as several related Apache projects that build upon Hadoop's capabilities. It describes the architecture and purpose of HDFS, MapReduce, YARN and highlights some additional projects like Avro, BigTop, Chukwa, Drill, and Flume. The ecosystem is complex with many interrelated projects, but they aim to leverage Hadoop's scalability through distributed storage and processing.

Uploaded by

Mun Chang Chia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

HadoopEcosystem

IntroductiontotheHadoopSoftwareEcosystem

ViaA.Griffins
WhenHadoop1.0.0wasreleasedbyApachein2011,comprisingmainlyHDFSandMapReduce,itsoon
becameclearthatHadoopwasnotsimplyanotherapplicationorservice,butaplatformaroundwhichan
entireecosystemofcapabilitiescouldbebuilt.Sincethen,dozensofselfstandingsoftwareprojectshave
sprungintobeingaroundHadoop,eachaddressingavarietyofproblemspacesandmeetingdifferent
needs.
Manyoftheseprojectswerebegunbythesamepeopleorcompanieswhowerethemajordevelopersand
earlyusersofHadoopotherswereinitiatedbycommercialHadoopdistributors.Themajorityofthese
projectsnowshareahomewithHadoopattheApacheSoftwareFoundation,whichsupportsopensource
softwaredevelopmentandencouragesthedevelopmentofthecommunitiessurroundingtheseprojects.
ThefollowingsectionsaremeanttogivethereaderabriefintroductiontotheworldofHadoopandthe
corerelatedsoftwareprojects.TherearecountlesscommercialHadoopintegratedproductsfocusedon
makingHadoopmoreusableandlaymanaccessible,buttheonesherewerechosenbecausetheyprovide
corefunctionalityandspeedinHadoop.

Thesocalled"Hadoopecosystem"is,asbefitsanecosystem,complex,evolving,andnoteasilyparcelled
intoneatcategories.Simplykeepingtrackofalltheprojectnamesmayseemlikeataskofitsown,butthis
palesincomparisontothetaskoftrackingthefunctionalandarchitecturaldifferencesbetweenprojects.
Theseprojectsarenotmeanttoallbeusedtogether,aspartsofasingleorganismsomemayevenbe
seekingtosolvethesameproblemindifferentways.Whatunitesthemisthattheyeachseektotapinto
thescalabilityandpowerofHadoop,particularlytheHDFScomponentofHadoop.

AdditionalLinks
Cloudstory.com:3partseriesonHadoopecosystem
Part1
Part2
Part3

HDFS
TheHadoopDistributedFileSystem(HDFS)offersawaytostorelargefilesacrossmultiplemachines,
ratherthanrequiringasinglemachinetohavediskcapacityequalto/greaterthanthesummedtotalsizeof
thefiles.HDFSisdesignedtobefaulttolerantduetodatareplicationanddistributionofdata.Whenafile
isloadedintoHDFS,itisreplicatedandbrokenupinto"blocks"ofdata,whicharestoredacrossthecluster
nodesdesignatedforstorage,a.k.a.DataNodes.

ViaPaulKrzyzanowski

Atthearchitecturallevel,HDFSrequiresaNameNodeprocesstorunononenodeintheclusteranda
DataNodeservicetorunoneach"slave"nodethatwillbeprocessingdata.Whendataisloadedinto
HDFS,thedataisreplicatedandsplitintoblocksthataredistributedacrosstheDataNodes.The
NameNodeisresponsibleforstorageandmanagementofmetadata,sothatwhenMapReduceoranother
executionframeworkcallsforthedata,theNameNodeinformsitwheretheneededdataresides.

HDFSArchitecture
ViaComputerTechnologyReview
OnesignificantdrawbacktoHDFSisthatithasasinglepointoffailure(SPOF),whichliesinthe
NameNodeservice.IftheNameNodeortheserverhostingitgoesdown,HDFSisdownfortheentire
cluster.TheSecondaryNameNode,whichtakesperiodicsnapshotsoftheNameNodeandupdatesit,is
notitselfabackupNameNode.
CurrentlythemostcomprehensivesolutiontothisproblemcomesfromMapR,oneofthemajorHadoop
distributors.MapRhasdevelopeda"distributedNameNode,"wheretheHDFSmetadataisdistributed
acrosstheclusterin"Containers,"whicharetrackedbytheContainerLocationDatabase(CLDB).
RegularNameNodearchitecturevs.MapR'sdistributedNameNodearchitecture
ViaMapR
TheApachecommunityisalsoworkingtoaddressthisNameNodeSPOF:Hadoop2.0.2willincludean
updatetoHDFScalledHDFSHighAvailability(HA),whichprovidestheuserwith"theoptionofrunning
tworedundantNameNodesinthesameclusterinanActive/Passiveconfigurationwithahotstandby.This
allowsafastfailovertoanewNameNodeinthecasethatamachinecrashes,oragracefuladministrator
initiatedfailoverforthepurposeofplannedmaintenance."TheactiveNameNodelogsallchangestoa
directorythatisalsoaccessiblebythestandbyNameNode,whichthenusestheloginformationtoupdate

itself.

ArchitectureofHDFSHighAvailabilityframework
ViaCloudera

MapReduce
TheMapReduceparadigmforparallelprocessingcomprisestwosequentialsteps:mapandreduce.
Inthemapphase,theinputisasetofkeyvaluepairsandthedesiredfunctionisexecutedovereach
key/valuepairinordertogenerateasetofintermediatekey/valuepairs.
Inthereducephase,theintermediatekey/valuepairsaregroupedbykeyandthevaluesarecombined
togetheraccordingtothereducecodeprovidedbytheuserforexample,summing.Itisalsopossiblethat
noreducephaseisrequired,giventhetypeofoperationcodedbytheuser.

ViaArtificialIntelligenceinMotion
Attheclusterlevel,theMapReduceprocessesaredividedbetweentwoapplications,JobTrackerand
TaskTracker.JobTrackerrunsononly1nodeofthecluster,whileTaskTrackerrunsoneveryslavenode
inthecluster.EachMapReducejobissplitintoanumberoftaskswhichareassignedtothevarious
TaskTrackersdependingonwhichdataisstoredonthatnode.JobTrackerisresponsibleforscheduling
jobrunsandmanagingcomputationalresourcesacrossthecluster.JobTrackeroverseestheprogressof
eachTaskTrackerastheycompletetheirindividualtasks.

MapReduceArchitecture
ViaComputerTechnologyReview

YARN
AsHadoopbecamemorewidelyadoptedandusedonclusterswithuptotensofthousandsofnodes,it
becameobviousthatMapReduce1.0hadissueswithscalability,memoryusage,synchronization,andhad
itsownSPOFissues.Inresponse,YARN(YetAnotherResourceNegotiator)wasbegunasasubprojectin
theApacheHadoopProject,onparwithothersubprojectslikeHDFS,MapReduce,andHadoopCommon.
YARNaddressesproblemswithMapReduce1.0'sarchitecture,specificallywiththeJobTrackerservice.
Essentially,YARN"split[s]upthetwomajorfunctionalitiesoftheJobTracker,resourcemanagementand
jobscheduling/monitoring,intoseparatedaemons.TheideaistohaveaglobalResourceManager(RM)
andperapplicationApplicationMaster(AM)."(source:Apache)Thus,ratherthanburdeningasinglenode
withhandlingschedulingandresourcemanagementfortheentirecluster,YARNnowdistributesthis
responsibilityacrossthecluster.

YARNArchitecture
ViaApache
MapReduce2.0
MapReduce2.0,orMR2,containsthesameexecutionframeworkasMapReduce1.0,butitisbuiltonthe
scheduling/resourcemanagementframeworkofYARN.
YARN,contrarytowidespreadmisconceptions,isnotthesameasMapReduce2.0(MRv2).Rather,YARN
isageneralframeworkwhichcansupportmultipleinstancesofdistributedprocessingapplications,of
whichMapReduce2.0isone.

AdditionalLinks

Clouderablog:MR2andYARNBrieflyExplained
Hortonworksblog:ApacheHadoopYARNBackgroundandanOverview
InterviewwithArunMurthy,cofounderofHortonworks,aboutYARN

HadoopRelatedProjectsatApache
WiththeexceptionofChukwa,Drill,andHCatalog(incubatorlevelprojects),allotherApacheprojects
mentionedherearetoplevelprojects.
Thislistisnotmeanttobeallinclusive,butitservesasanintroductiontosomeofthemostcommonly
usedprojects,andalsoillustratestherangeofcapabilitiesbeingdevelopedaroundHadoop.Tonamejust
acouple,WhirrandCrunchareotherHadooprelatedApacheprojectsnotdescribedhere.

Avro

Avroisaframeworkforperformingremoteprocedurecallsanddataserialization.InthecontextofHadoop,
itcanbeusedtopassdatafromoneprogramorlanguagetoanother,e.g.fromCtoPig.Itisparticularly
suitedforusewithscriptinglanguagessuchasPig,becausedataisalwaysstoredwithitsschemainAvro,
andthereforethedataisselfdescribing.
Avrocanalsohandlechangesinschema,a.k.a."schemaevolution,"whilestillpreservingaccesstothe
data.Forexample,differentschemascouldbeusedinserializationanddeserializationofagivendataset.

AdditionalLinks
Avroin3minutes

BigTop

BigTopisaprojectforpackagingandtestingtheHadoopecosystem.MuchofBigTop'scodewasinitially
developedandreleasedaspartofCloudera'sCDHdistribution,buthassincebecomeitsownprojectat
Apache.
ThecurrentBigToprelease(0.5.0)supportsanumberofLinuxdistributionsandpackagesHadoop
togetherwiththefollowingprojects:Zookeeper,Flume,HBase,Pig,Hive,Sqoop,Oozie,Whirr,Mahout,
SolrCloud,Crunch,DataFuandHue.

AdditionalLinks
Apacheblogpost:WhatisBigTop?

Chukwa

Chukwa,currentlyinincubation,isadatacollectionandanalysissystembuiltontopofHDFSand
MapReduce.Tailoredforcollectinglogsandotherdatafromdistributedmonitoringsystems,Chukwa
providesaworkflowthatallowsforincrementaldatacollection,processingandstorageinHadoop.Itis
includedintheApacheHadoopdistribution,butasanindependentmodule.

Drill

DrillisanincubationlevelprojectatApacheandisanopensourceversionofGoogle'sDremel.Drillisa
distributedsystemforexecutinginteractiveanalysisoverlargescaledatasets.Someexplicitgoalsofthe
Drillprojectaretosupportrealtimequeryingofnesteddataandtoscaletoclustersof10,000nodesor

more.
Designedtosupportnesteddata,Drillalsosupportsdatawith(e.g.Avro)orwithout(e.g.JSON)schemas.
ItsprimarylanguageisanSQLlikelanguage,DrQL,thoughtheMongoQueryLanguagecanalsobe
used.

Flume

Flumeisatoolforharvesting,aggregatingandmovinglargeamountsoflogdatainandoutofHadoop.
Flume"channels"databetween"sources"and"sinks"anditsdataharvestingcaneitherbescheduledor
eventdriven.PossiblesourcesforFlumeincludeAvro,files,andsystemlogs,andpossiblesinksinclude
HDFSandHBase.Flumeitselfhasaqueryprocessingengine,sothatthereistheoptiontotransformeach
newbatchofdatabeforeitisshuttledtotheintendedsink.
SinceJuly2012,FlumehasbeenreleasedasFlumeNG(NewGeneration),asitdifferssignificantlyfrom
itsoriginalincarnation,a.k.aFlumeOG(OriginalGeneration)..

AdditionalLinks
Flumein3minutes

HBase

BasedonGoogle'sBigtable,HBase"isanopensource,distributed,versioned,columnorientedstore"that
sitsontopofHDFS.HBaseiscolumnbasedratherthanrowbased,whichenableshighspeedexecution
ofoperationsperformedoversimilarvaluesacrossmassivedatasets,e.g.read/writeoperationsthat
involveallrowsbutonlyasmallsubsetofallcolumns.HBasedoesnotprovideitsownqueryorscripting
language,butisaccessiblethroughJava,Thrift,andRESTAPIs.
HBasedependsonZookeeperandrunsaZookeeperinstancebydefault.

AdditionalLinks
HBasein3minutes

HCatalog

AnincubatorlevelprojectatApache,HCatalogisametadataandtablestoragemanagementservicefor
HDFS.HCatalogdependsontheHivemetastoreandexposesittootherservicessuchasMapReduceand
PigwithplanstoexpandtoHBaseusingacommondatamodel.HCatalog'sgoalistosimplifytheuser's
interactionwithHDFSdataandenabledatasharingbetweentoolsandexecutionplatforms.

AdditionalLinks
HCatalogin3minutes

Hive

HiveprovidesawarehousestructureandSQLlikeaccessfordatainHDFSandotherHadoopinput
sources(e.g.AmazonS3).Hive'squerylanguage,HiveQL,compilestoMapReduce.Italsoallowsuser
definedfunctions(UDFs).Hiveiswidelyused,andhasitselfbecomea"subplatform"intheHadoop
ecosystem.
Hive'sdatamodelprovidesastructurethatismorefamiliarthanrawHDFStomostusers.Itisbased
primarilyonthreerelateddatastructures:tables,partitions,andbuckets,wheretablescorrespondto
HDFSdirectoriesandcanbedividedintopartitions,whichinturncanbedividedintobuckets.

AdditionalLinks
Hivein3minutes

Mahout

Mahoutisascalablemachinelearninganddatamininglibrary.Therearecurrentlyfourmaingroupsof
algorithmsinMahout:
recommendations,a.k.a.collectivefiltering
classification,a.k.acategorization
clustering
frequentitemsetmining,a.k.aparallelfrequentpatternmining
Mahoutisnotsimplyacollectionofpreexistingalgorithmsmanymachinelearningalgorithmsare
intrinsicallynonscalablethatis,giventhetypesofoperationstheyperform,theycannotbeexecutedasa
setofparallelprocesses.AlgorithmsintheMahoutlibrarybelongtothesubsetthatcanbeexecutedina
distributedfashion,andhavebeenwrittentobeexecutableinMapReduce.

AdditionalLinks
MahoutandMachineLearningin3minutes

Oozie

OozieisajobcoordinatorandworkflowmanagerforjobsexecutedinHadoop,whichcanincludenon
MapReducejobs.ItisintegratedwiththerestoftheApacheHadoopstackand,accordingtotheOozie
site,it"support[s]severaltypesofHadoopjobsoutofthebox(suchasJavamapreduce,Streamingmap
reduce,Pig,Hive,SqoopandDistcp)aswellassystemspecificjobs(suchasJavaprogramsandshell
scripts)."
AnOozieworkflowisacollectionofactionsandHadoopjobsarrangedinaDirectedAcyclicGraph(DAG),

whichisacommonmodelfortasksthatmustbeainsequenceandaresubjecttocertainconstraints.

AdditionalLinks
Ooziein3minutes

Pig

Pigisaframeworkconsistingofahighlevelscriptinglanguage(PigLatin)andaruntimeenvironmentthat
allowsuserstoexecuteMapReduceonaHadoopcluster.LikeHiveQLinHive,PigLatinisahigherlevel
languagethatcompilestoMapReduce.
PigismoreflexiblethanHivewithrespecttopossibledataformat,duetoitsdatamodel.ViathePigWiki:
"Pig'sdatamodelissimilartotherelationaldatamodel,exceptthattuples(a.k.a.recordsorrows)canbe
nested.Forexample,youcanhaveatableoftuples,wherethethirdfieldofeachtuplecontainsatable.In
Pig,tablesarecalledbags.Pigalsohasa'map'datatype,whichisusefulinrepresentingsemistructured
data,e.g.JSONorXML."

AdditionalLinks
Pigin3minutes

Sqoop

Sqoop("SQLtoHadoop")isatoolwhichtransfersdatainbothdirectionsbetweenrelationalsystemsand
HDFSorotherHadoopdatastores,e.g.HiveorHBase.

AccordingtotheSqoopblog,"YoucanuseSqooptoimportdatafromexternalstructureddatastoresinto
HadoopDistributedFileSystemorrelatedsystemslikeHiveandHBase.Conversely,Sqoopcanbeused
toextractdatafromHadoopandexportittoexternalstructureddatastoressuchasrelationaldatabases
andenterprisedatawarehouses."

ZooKeeper

ZooKeeperisaserviceformaintainingconfigurationinformation,naming,providingdistributed
synchronizationandprovidinggroupservices.AstheZooKeeperwikisummarizesit,"ZooKeeperallows
distributedprocessestocoordinatewitheachotherthroughasharedhierarchicalnamespaceofdata
registers(wecalltheseregistersznodes),muchlikeafilesystem."ZooKeeperitselfisadistributedservice
with"master"and"slave"nodes,andstoresconfigurationinformation,etc.inmemoryonZooKeeper
servers.

AdditionalLinks
Zookeeperin3minutes

HadoopRelatedProjectsOutsideApache
TherearealsoprojectsoutsideofApachethatbuildonorparallelthemajorHadoopprojectsatApache.
Severalofinterestaredescribedhere.

Spark(UCBerkeley)

SparkisaparallelcomputingprogramwhichcanoperateoveranyHadoopinputsource:HDFS,HBase,
AmazonS3,Avro,etc.SparkisanopensourceprojectattheU.C.BerkeleyAMPLab,andinitsownwords,
Spark"wasinitiallydevelopedfortwoapplicationswherekeepingdatainmemoryhelps:iterative
algorithms,whicharecommoninmachinelearning,andinteractivedatamining."
WhileoftencomparedtoMapReduceinsofarasitalsoprovidesparallelprocessingoverHDFSandother
Hadoopinputsources,Sparkdiffersintwokeyways:
Sparkholdsintermediateresultsinmemory,ratherthanwritingthemtodiskthisdrasticallyreduces
queryreturntime
Sparksupportsmorethanjustmapandreducefunctions,greatlyexpandingthesetofpossible
analysesthatcanbeexecutedoverHDFSdata
ThefirstfeatureisthekeytodoingiterativealgorithmsonHadoop:ratherthanreadingfromHDFS,
performingMapReduce,writingtheresultsbacktoHDFS(i.e.todisk)andrepeatingforeachcycle,Spark
readsdatafromHDFS,performsthecomputation,andstorestheintermediateresultsinmemoryas
ResilientDistributedDataSets.Sparkcanthenrunthenextsetofcomputationsontheresultscachedin
memory,therebyskippingthetimeconsumingstepsofwritingthenthroundresultstoHDFSandreading
thembackoutforthe(n+1)thround.

AdditionalLinks
http://www.youtube.com/watch?v=N3ITxQcf6uQ

Shark(UCBerkeley)

Sharkisessentially"HiverunningonSpark."ItutilizestheApacheHiveinfrastructure,includingtheHive
metastoreandHDFS,butitgivesusersthebenefitsofSpark(increasedprocessingspeed,additional
functionsbesidesmapandreduce).Thisway,SharkuserscanexecutethequeriesinHiveQLoverthe
sameHDFSdatasets,butreceiveresultsinnearrealtimefashion.

Impala(Cloudera)

ReleasedbyCloudera,Impalaisanopensourceprojectwhich,likeApacheDrill,wasinspiredbyGoogle's
paperonDremelthepurposeofbothistofacilitaterealtimequeryingofdatainHDFSorHBase.Imapala
usesanSQLlikelanguagethat,thoughsimilartoHiveQL,iscurrentlymorelimitedthanHiveQL.Because
ImpalareliesontheHiveMetastore,HivemustbeinstalledonaclusterinorderforImpalatowork.
ThesecretbehindImpala'sspeedisthatit"circumventsMapReducetodirectlyaccessthedatathrougha
specializeddistributedqueryenginethatisverysimilartothosefoundincommercialparallelRDBMSs."
(Source:Cloudera)

You might also like