0% found this document useful (0 votes)

39 views33 pages

Big Data Analytics - Lecture 6

The document discusses installing the Cloudera QuickStart Hadoop distribution on a virtual machine using VirtualBox. It provides steps to download the Cloudera VM appliance, import and configure it in VirtualBox, and launch the VM. It then demonstrates basic Hadoop filesystem commands and running a sample MapReduce word count job on sample input files.

Uploaded by

Alaa Mdina

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views33 pages

Big Data Analytics - Lecture 6

Uploaded by

Alaa Mdina

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Big Data Analytics

Isara Anantavrasilp

Lecture 6: Installing Hadoop Distribution

© Isara Anantavrasilp 1
Hadoop Distribution
• Hadoop distribution: Cloudera QuickStart
• Platform: Virtual Box
• System Requirements
– 64-bit host OS and a virtualization that support
64-bit guest OS
– RAM for VM: 4 GB
– HDD: 20 GB

© Isara Anantavrasilp 2
Installing Cloudera QuickStart
• Download size: ~5.5 GB
• Download links
– https://www.virtualbox.org/wiki/Downloads
Select package corresponding to your host system

– https://downloads.cloudera.com/demo_vm/virtu
albox/cloudera-quickstart-vm-5.13.0-0-
virtualbox.zip

© Isara Anantavrasilp 4
Installing Cloudera QuickStart
• Install VirtualBox
• Unzip Cloudera VM
• Start VirtualBox
• Import Appliance (Virtual Machine)
• Launch Cloudera VM

Select Bidirectional
to share clipboard

8GB of RAM is
recommended

At least 2 CPUs is
recommended

Login: cloudera Password: cloudera

© Isara Anantavrasilp 13
Troubleshooting
• The VM does not start:
AMD-V is disabled in the BIOS (or by
the host OS) (VERR_SVM_DISABLED).

Make sure that your BIOS allows virtualization

• VM freezes when starting:

It does not freeze, just wait until it finishes
loading

• Type in the following command

hadoop jar /usr/lib/hadoop-
mapreduce/hadoop-mapreduce-examples.jar

• It should list available commands

© Isara Anantavrasilp 15
Word Count
• Now let’s try
hadoop jar /usr/lib/hadoop-
mapreduce/hadoop-mapreduce-
examples.jar wordcount
• Result
Usage: wordcount <in> [<in>...] <out>
[cloudera@quickstart ~]$
• This is word-counting example
• Let’s count some words
© Isara Anantavrasilp 16
Word Files
• The Complete Works of William Shakespeare
https://ocw.mit.edu/ans7870/6/6.006/s08/lec
turenotes/files/t8.shakespeare.txt

• The Project Gutenberg EBook of The

Adventures of Sherlock Holmes
http://norvig.com/big.txt

• Type in or paste the URL

© Isara Anantavrasilp 19
Let’s count the words
• Open terminal and type
hadoop jar /usr/lib/hadoop-
mapreduce/hadoop-mapreduce-
examples.jar wordcount big.txt out

• It will fail
InvalidInputException: Input path
does not exist:

• This is because the file is not yet in HDFS!

© Isara Anantavrasilp 20
Local File System and HDFS
• Hadoop does not store everything in HDFS
• Map results are normally stored in nodes’ local
file systems
– Map results are intermediate results which will be
sent to reduce task later
– They do not need redundancy provided by HDFS
– If a map node fails, Hadoop task manager simply
resend the task to another node
• Hadoop HDFS stores
– Input data: We must put our data into HDFS first
– Reduce output data: Result of the entire process

• List the files with ls or ls –al

• You should see your downloaded files

[cloudera@quickstart Downloads]$ ls
big.txt t8.shakespeare.txt

© Isara Anantavrasilp 22
Copy the data into HDFS
• Copy the file from local file system to HDFS
hadoop fs -copyFromLocal big.txt

Command: Command Option:

File system Copy file from local FS to HDFS
commands

• Check whether the file is copied correctly

hadoop fs –ls

• Now, let’s try to copy big.txt to HDFS again

• Copy files within HDFS

hadoop fs -cp big.txt big2.txt

• Copy files back to local file system

hadoop fs -copyToLocal big2.txt

• Remove files in HDFS

hadoop fs -rm big2.txt

• Show all command options

hadoop fs
© Isara Anantavrasilp 24
Let’s count the words (again)
• Open terminal and type
hadoop jar /usr/lib/hadoop-
mapreduce/hadoop-mapreduce-
examples.jar wordcount big.txt out
• This time it should run
• While it is running, Hadoop will show progress
including completed map and reduce tasks

• You can list the contents inside the directory with:

hadoop fs –ls out

• Then copy the result file back with

hadoop fs –copyToLocal out/part-r-
00000

• Now see the contents of the result:

more part-r-00000
© Isara Anantavrasilp 26
What have we done so far?
• We copied files to and from HDFS
• We have run some HDFS file commands
• We have executed MapReduce program
– The data to be operated is on HDFS
– But the program is on the local file system
– WordCount is written in Java but it can be any
language

© Isara Anantavrasilp 27
Prepare Compiling Environment
• Most of environment parameters are already set in
Cloudera QuickStart, to check type:
printenv

• The following environment should be there:

JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera
PATH=/usr/java/jdk1.7.0_67-cloudera/bin

• What we have to do is to set is

export
HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar

© Isara Anantavrasilp 28
Compiling Word Count
• To compile:
hadoop com.sun.tools.javac.Main
WordCount.java
• The result will be multiple class files

• We have to pack them into one JAR file

jar cf wc.jar WordCount*.class
• Result will be a JAR file: wc.jar

© Isara Anantavrasilp 29
Running Word Count
• Counting word in the big.txt file
hadoop jar wc.jar WordCount big.txt
out2
• You should have the same result as previous
example
• The result is stored in out2 directory
• Let’s copy to local file system
hadoop fs -copyToLocal out2

© Isara Anantavrasilp 30
Hadoop Jobs
• Hadoop MapReduce process is categorized as
a job
• A job consists of tasks
– Map tasks
– Reduce tasks
– Tasks are scheduled by YARN
– If a task fails, it will be automatically re-scheduled
in another node

© Isara Anantavrasilp 31
Input Splits
• MapReduce separates entire data into smaller
chunks or splits and feed into map tasks (and
later to reduce tasks)
• Splits allow the tasks to be distributed among
nodes
• Best size of each splits is the size of a HDFS block
– Too small, too much scheduling overhead
– Too large, one split is separated into many nodes
• Hadoop tries to assign map task to the node
where the data already resides
– locality optimization
© Isara Anantavrasilp 32
Distributed and Combining Tasks
• A job is split into tasks and tasks are distributed to map
nodes
– Tasks are processed in parallel
• When map tasks are done, the results will be sent to
reducer(s)
– There can be more than one reducers
– Could also be zero reducer if the tasks are simple and can
be done as map tasks
• If there are more than one reducers, the map tasks
must partition the outputs
– Partition (divide) the outputs into different keys
– Send different keys to different reducers

EX403 Study Notes
No ratings yet
EX403 Study Notes
7 pages
Bda Manual
No ratings yet
Bda Manual
80 pages
Big Data & Analytics Lab Manual
No ratings yet
Big Data & Analytics Lab Manual
51 pages
Hadoop 2 Quick Start Guide PDF
100% (1)
Hadoop 2 Quick Start Guide PDF
736 pages
How To Set Up A Hadoop Cluster in Docker
No ratings yet
How To Set Up A Hadoop Cluster in Docker
13 pages
Lab 1 - Hadoop HDFS and MapReduce
No ratings yet
Lab 1 - Hadoop HDFS and MapReduce
4 pages
Ionic Framework
No ratings yet
Ionic Framework
24 pages
CCNA 1+2+3+4 v5
No ratings yet
CCNA 1+2+3+4 v5
87 pages
Big Data Cloudera TP
No ratings yet
Big Data Cloudera TP
33 pages
TPhadoop
No ratings yet
TPhadoop
27 pages
Guia Contador de Palabras Cloudera
No ratings yet
Guia Contador de Palabras Cloudera
23 pages
Setup Hadoop Gettingstart
No ratings yet
Setup Hadoop Gettingstart
4 pages
Exp1 Hirday Merged
No ratings yet
Exp1 Hirday Merged
102 pages
03_Run the WordCount program instructions.docx
No ratings yet
03_Run the WordCount program instructions.docx
4 pages
Go To Cloudera Quickstart VM To Download A Pre-Setup CDH Virtual Machine
No ratings yet
Go To Cloudera Quickstart VM To Download A Pre-Setup CDH Virtual Machine
20 pages
Activity 2
No ratings yet
Activity 2
31 pages
Extracting Real Value From Your Data With Apache Hadoop: Sarah Sproehnle
No ratings yet
Extracting Real Value From Your Data With Apache Hadoop: Sarah Sproehnle
51 pages
Homework_Labs_Lecture2
No ratings yet
Homework_Labs_Lecture2
6 pages
Word Count using MapReduce on Hadoop
No ratings yet
Word Count using MapReduce on Hadoop
14 pages
Big Data _ Tomas Iglesias IV(1)
No ratings yet
Big Data _ Tomas Iglesias IV(1)
37 pages
Part 03 Intro To Hadoop
No ratings yet
Part 03 Intro To Hadoop
22 pages
Cloud PDF
No ratings yet
Cloud PDF
47 pages
Hands On
No ratings yet
Hands On
26 pages
Extreme Computing Lab Exercises Session One: 1 Getting Started
No ratings yet
Extreme Computing Lab Exercises Session One: 1 Getting Started
6 pages
Big Data Analytics and Visualization Lab
No ratings yet
Big Data Analytics and Visualization Lab
193 pages
Run The WordCount Program Instructions
No ratings yet
Run The WordCount Program Instructions
3 pages
Anushka Shetty 35
No ratings yet
Anushka Shetty 35
34 pages
Big Data Lab Manual Printout Copy
No ratings yet
Big Data Lab Manual Printout Copy
51 pages
BDA Lab Manual_organized (2) (1) - Copy
No ratings yet
BDA Lab Manual_organized (2) (1) - Copy
69 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
56 pages
Big Data File
No ratings yet
Big Data File
16 pages
Developing A Simple Map-Reduce Program For Hadoop: Big Data Course CS6350 Professor: Dr. Latifur Khan
No ratings yet
Developing A Simple Map-Reduce Program For Hadoop: Big Data Course CS6350 Professor: Dr. Latifur Khan
22 pages
basic HDFS commands
No ratings yet
basic HDFS commands
7 pages
Homework Labs Lecture01
No ratings yet
Homework Labs Lecture01
9 pages
toc_9780134049984
No ratings yet
toc_9780134049984
10 pages
Hadoop Training #4: Programming With Hadoop
100% (2)
Hadoop Training #4: Programming With Hadoop
46 pages
Tutorial-Counting Words in File (S) Using Mapreduce: Prerequisites
No ratings yet
Tutorial-Counting Words in File (S) Using Mapreduce: Prerequisites
11 pages
BD Lab File
No ratings yet
BD Lab File
39 pages
Bda Record
No ratings yet
Bda Record
83 pages
Developing a MapReduce Application
No ratings yet
Developing a MapReduce Application
30 pages
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
No ratings yet
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
20 pages
wk8__final
No ratings yet
wk8__final
39 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
42 pages
Lab11 B
No ratings yet
Lab11 B
9 pages
12 13 14 Map Reduce
No ratings yet
12 13 14 Map Reduce
57 pages
w_java132
No ratings yet
w_java132
14 pages
BIG data file
No ratings yet
BIG data file
28 pages
Data Storage Data Processing: Hadoop Distributed File System (HDFS) Mapreduce
No ratings yet
Data Storage Data Processing: Hadoop Distributed File System (HDFS) Mapreduce
35 pages
Install and Run Hadoop On Windows
No ratings yet
Install and Run Hadoop On Windows
29 pages
Labs Lecture2
No ratings yet
Labs Lecture2
6 pages
Labs Hadoop1
No ratings yet
Labs Hadoop1
9 pages
hadoop_tutorialv3
No ratings yet
hadoop_tutorialv3
31 pages
lab manual
No ratings yet
lab manual
34 pages
Introduction To The Big Data Ecosystem
No ratings yet
Introduction To The Big Data Ecosystem
13 pages
NEW BDA MANUAL
No ratings yet
NEW BDA MANUAL
80 pages
Hadoop Operations Managing Big Data Clusters
No ratings yet
Hadoop Operations Managing Big Data Clusters
59 pages
@bigdatalabfile 09
No ratings yet
@bigdatalabfile 09
35 pages
DC Hadoop
No ratings yet
DC Hadoop
48 pages
Quick Configuration of Openldap and Kerberos In Linux and Authenicating Linux to Active Directory
From Everand
Quick Configuration of Openldap and Kerberos In Linux and Authenicating Linux to Active Directory
Dr. Hidaia Mahmood Alassouli
No ratings yet
76200
No ratings yet
76200
51 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Configuration of a Simple Samba File Server, Quota and Schedule Backup
From Everand
Configuration of a Simple Samba File Server, Quota and Schedule Backup
Dr. Hidaia Mahmood Alassouli
No ratings yet
Configuration of a Simple Samba File Server, Quota and Schedule Backup
From Everand
Configuration of a Simple Samba File Server, Quota and Schedule Backup
Dr. Hedaya Alasooly
No ratings yet
EG8145V5 Datasheet
No ratings yet
EG8145V5 Datasheet
2 pages
Csca0102 Lesson 3 A History of The Computer
No ratings yet
Csca0102 Lesson 3 A History of The Computer
36 pages
UGD-D01157 - Netspan SQL Server 2016 Express Installation Guide - SR15.00 - Rev 1.0
No ratings yet
UGD-D01157 - Netspan SQL Server 2016 Express Installation Guide - SR15.00 - Rev 1.0
21 pages
Ibm Sap Hana Ctac
No ratings yet
Ibm Sap Hana Ctac
6 pages
Ansible On Azure
No ratings yet
Ansible On Azure
141 pages
DataSheet XRN-6410RB2 EN
No ratings yet
DataSheet XRN-6410RB2 EN
5 pages
KEYS
No ratings yet
KEYS
1 page
Mohsin Jamali Resume.
No ratings yet
Mohsin Jamali Resume.
20 pages
(Suroso Isnandar) PLN - Pres UIP2B Jamali - WAMPAC MKI (120821)
No ratings yet
(Suroso Isnandar) PLN - Pres UIP2B Jamali - WAMPAC MKI (120821)
25 pages
A C Ryan AluBoxDuo 3 5 USB2 0 LAN10 100 For 2 X SATA Hard Disks
No ratings yet
A C Ryan AluBoxDuo 3 5 USB2 0 LAN10 100 For 2 X SATA Hard Disks
4 pages
Updating To VIO Server v3.1: Agenda
No ratings yet
Updating To VIO Server v3.1: Agenda
34 pages
General Guidelines For Remote Proctored Examination Feb Mar 2024
No ratings yet
General Guidelines For Remote Proctored Examination Feb Mar 2024
7 pages
Virtual system AND SERVICES (1)
No ratings yet
Virtual system AND SERVICES (1)
17 pages
Synchronization in Java
No ratings yet
Synchronization in Java
13 pages
Complete ICT For Cambridge IGCSE Second Edition
70% (10)
Complete ICT For Cambridge IGCSE Second Edition
292 pages
XIComp.Sc.H.Y.135
No ratings yet
XIComp.Sc.H.Y.135
5 pages
Path Vector Routing PDF
No ratings yet
Path Vector Routing PDF
2 pages
B Cisco MDS 9000 Family NX-OS Licensing Guide Release 5 0 1a Chapter 00
No ratings yet
B Cisco MDS 9000 Family NX-OS Licensing Guide Release 5 0 1a Chapter 00
40 pages
TCPtroubleshooting
No ratings yet
TCPtroubleshooting
31 pages
Lab 6
No ratings yet
Lab 6
11 pages
wsn-unit-04-notes
No ratings yet
wsn-unit-04-notes
18 pages
UART HAL Module Guide
No ratings yet
UART HAL Module Guide
31 pages
LFI With PHPInfo Assistance PDF
No ratings yet
LFI With PHPInfo Assistance PDF
6 pages
Wireless Lab
No ratings yet
Wireless Lab
430 pages
Twido
No ratings yet
Twido
350 pages
CH05-COA10e
No ratings yet
CH05-COA10e
38 pages
Renault ECU Tool V1.01
No ratings yet
Renault ECU Tool V1.01
10 pages