0% found this document useful (0 votes)

107 views13 pages

Streaming Data Via Flume

Flume is used to collect, aggregate, and move large amounts of log data from different sources to a centralized data store like HDFS. It consists of sources that collect data, sinks that store data in the destination, and channels that act as buffers between sources and sinks. This document provides steps to set up a Flume agent to stream Twitter data into HDFS using the Twitter source, memory channel, and HDFS sink. It describes creating a Twitter app, configuring Flume, and running the Flume agent to collect tweets matching specified keywords and store them in HDFS.

Uploaded by

muhammad rafiuddin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

107 views13 pages

Streaming Data Via Flume

Uploaded by

muhammad rafiuddin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

1.

Streaming Data via Flume

Monday, April 09, 2018 7:36 AM

We all know that Hadoop is a framework which helps in storing and processing huge datasets and Sqoop
component is used to transfer files from traditional databases like RDBMS to HDFS and vice versa when
the data is of the structured type.

What if we want to load the data which is of type semi-structured and unstructured into the HDFS
cluster, or else capture the live streaming data which is generated, from different sources like twitter,
weblogs and more into the HDFS cluster, which component of Hadoop ecosystem will be useful to do
this kind of job. The solution is FLUME.

Learning Flume will help users to collect from and store a large amount of data from different sources
into the Hadoop cluster.

What is Apache Flume?

Apache Flume is a Hadoop ecosystem component used to collect, aggregate and moves a large amount
of log data from different sources to a centralized data store.

It is an open source component which is designed to locate and store the data in a distributed
environment and collects the data as per the specified input key(s).

Flume Architecture

Before moving forward to know the working of flume tool, It is mandatory to know the Flume
architecture first.

Flume is composed of the following components.

Flume Event: It is the main unit of the data that is transported inside the Flume (Typically a single log
entry). It contains a payload of the byte array that is to be transported from the source path to the
destination path which could be accompanied by optional
headers.

A Flume event will be in the following structure.

Big data Technical implementation Page 1

A Flume event will be in the following structure.

Header Byte Payload

Flume Agent: Is an independent Java virtual machine daemon process which receives the data (events)
from clients and transports to the subsequent destination (sink or agent).

Source: Is the component of Flume agent which receives data from the data generators say, twitter,
facebook, weblogs from different sites and transfers this data to one or more channels in the form of
Flume event.

The external source sends data to Flume in a format that is recognized by the target Flume source.
Example, an Avro Flume source can be used to receive Avro data from Avro clients or other Flume
agents in the flow that send data from an Avro sink, or the Thrift Flume source will receive data from a
Thrift sink, or a Flume Thrift RPC client or Thrift Clients are written in any language generated from the
Flume thrift protocol.

Channel: Once, the Flume source receives an Event, it stores this data into one or more channel and
buffers them till they are consumed by sinks. It acts as a bridge between the source and sinks. These
channels are implemented to handle any number of sources and sinks.

Sink: It stores the data into the centralized stores like HDFS and HBase.

Streaming Twitter Data

To stream data to our database from twitter we should have the following pre-requisites.

• Twitter account
• Hadoop cluster
If both prerequisites are available we can move to our further step.

Step 1:

Login to the twitter account

Big data Technical implementation Page 2

Step 2:

Go to the following link and click the ‘create new app’ button.

https://apps.twitter.com/app

Step 3:

Enter the necessary details.

Big data Technical implementation Page 3

Step 4:

Accept the developer agreement and select the ‘create your Twitter application’ button.

Big data Technical implementation Page 4

Step 5:

Select the ‘Keys and Access Token’ tab.

Big data Technical implementation Page 5

Step 6:

Copy the consumer key and the consumer secret code.

Step 7:

Scroll down further and select the ‘create my access token’ button.

Now, you will receive a message stating “that you have successfully generated your application access
token”.

Step 8:

Copy the Access Token and Access token Secret code.

Follow Step 9 and Step 10 to install Apache flume

Big data Technical implementation Page 6

Step 9: Download flume tar file from below link and extract it.

https://drive.google.com/drive/u/0/folders/0B1QaXx7tpw3SWkMwVFBkc3djNFk

Right click on the downloaded flume tar file and select the option as Extract Here to untar the flume
directory and update the path of extracted flume directory in the .bashrc file as mentioned in the below
image.

NOTE: keep the path same as where the extracted file exists.

After setting the path of flume directory, save and close the .bashrc file. And then in the terminal type
the below command to update the .bashrc file.

Step 10:

Create a new file inside the conf directory inside the Flume-extracted directory.

Big data Technical implementation Page 7

Note: Make sure you have below jars placed in your $FLUME_HOME/lib directory:

1. twitter4j-core-X.XX.jar
2. twitter4j-stream-X.X.X.jar
3. twitter4j-media-support-X.X.X.jar
Step 11:

Copy theFlumee configuration code from the below link and paste it in the newly created file.

https://drive.google.com/open?id=0B1QaXx7tpw3Sb3U4LW9SWlNidkk

Step 12:

Change the twitter api keys with the keys generated as shown in the step no 6 and step number 8.

Big data Technical implementation Page 8

Step 13:

We have to decide which keywords tweet data to be collected from the twitter application. So, you can
change the keywords in the TwitterAgent.sources.Twitter.keywords command.

In our example, we are fetching tweet data related to Hadoop, election, sports, cricket and Big data.

Step 14:

Open a new terminal and start all the Hadoop daemons, before running the flume command to fetch
the twitter data.

Use the ‘jps’ command to see the running Hadoop daemons.

Big data Technical implementation Page 9

Step 15:

Create a new directory inside HDFS path, where the Twitter tweet data should be stored.

Hadoop dfs –mkdir –p /user/flume/tweets

Step 16:

For fetching data from Twitter, Use the below command to fetch the twitter tweet data into the HDFS
cluster path.

flume-ng agent -n TwitterAgent -f <location of created/edited conf file>

Big data Technical implementation Page 10

The above command will start fetching data from Twitter and steams it into the HDFS given path.

Once, the tweet data started streaming it into the given HDFS path we can use ‘Ctrl+c’ command to stop
the streaming process.

Step 17:

To check the contents of the tweet data we can use the following command:

hadoop dfs –ls /user/flume/tweets

Big data Technical implementation Page 11

Step 18:

We can use the ‘cat’ command to display the tweet data inside the /user/flume/tweets/FlumeData.145
* path.

hadoop dfs –cat /us er/flume/tweets/<flumeData file name>

Big data Technical implementation Page 12

We can observe from the above image that we have successfully fetched twitter data into our HDFS
cluster directory. Once the tweets have been successfully stored in your database, you can manipulate
the tweet data to fit the needs of our future projects. You can follow the above steps for the same.

From <https://acadgild.com/blog/streaming-twitter-data-using-flume/>

Big data Technical implementation Page 13

Java Script &jquery PDF
0% (1)
Java Script &jquery PDF
41 pages
Cloudera Hbase
100% (1)
Cloudera Hbase
145 pages
GemFire Architecture
No ratings yet
GemFire Architecture
72 pages
RHCA Syllabus
No ratings yet
RHCA Syllabus
13 pages
Tomcatx Performance Tuning
No ratings yet
Tomcatx Performance Tuning
51 pages
BIM The Summary of A Long History PDF
100% (1)
BIM The Summary of A Long History PDF
6 pages
Customer Query Track
No ratings yet
Customer Query Track
5 pages
Apache Flume Tutorial PDF
No ratings yet
Apache Flume Tutorial PDF
43 pages
Partitioned Tables and Indexes
100% (1)
Partitioned Tables and Indexes
24 pages
Scaladayslambda Architecture Spark Cassandra Akka Kafka 150609194508 Lva1 App6891 PDF
No ratings yet
Scaladayslambda Architecture Spark Cassandra Akka Kafka 150609194508 Lva1 App6891 PDF
100 pages
Akash High Scale Benchmarks
No ratings yet
Akash High Scale Benchmarks
74 pages
Linux Tutorial - 9. Discover Filters
No ratings yet
Linux Tutorial - 9. Discover Filters
15 pages
PLNY12 Galera Cluster Best Practices
No ratings yet
PLNY12 Galera Cluster Best Practices
76 pages
Deploying Jupyter Notebooks For Students and Researchers
No ratings yet
Deploying Jupyter Notebooks For Students and Researchers
35 pages
Logsene Brochure PDF
No ratings yet
Logsene Brochure PDF
24 pages
Linux Virtual Server Tutorial
No ratings yet
Linux Virtual Server Tutorial
27 pages
SUSE Linux Enterprise: 10 SP1 The Linux Audit Framework
No ratings yet
SUSE Linux Enterprise: 10 SP1 The Linux Audit Framework
76 pages
MapGuide Programming Manual
No ratings yet
MapGuide Programming Manual
164 pages
Coursera Enterprise Catalog - Master
No ratings yet
Coursera Enterprise Catalog - Master
1,702 pages
Edureka VM Split Files - 4.0
No ratings yet
Edureka VM Split Files - 4.0
2 pages
Virtual Memory Behavior in Red Hat Linux Advanced ...
No ratings yet
Virtual Memory Behavior in Red Hat Linux Advanced ...
10 pages
Metasploit
No ratings yet
Metasploit
4 pages
Install LAMP Server
No ratings yet
Install LAMP Server
6 pages
Migrating From Nagios Core To Nagios XI
No ratings yet
Migrating From Nagios Core To Nagios XI
4 pages
NetBackup102 WebUIGuide MySQLAdmin
No ratings yet
NetBackup102 WebUIGuide MySQLAdmin
38 pages
Spark Training in Bangalore
No ratings yet
Spark Training in Bangalore
36 pages
Gcloud Command Structure
No ratings yet
Gcloud Command Structure
14 pages
Galera Cluster
100% (1)
Galera Cluster
106 pages
Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam
No ratings yet
Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam
11 pages
Clustering - 3 - Management
100% (1)
Clustering - 3 - Management
15 pages
Artifactory-Configuring With Jenkins
No ratings yet
Artifactory-Configuring With Jenkins
9 pages
000 104
No ratings yet
000 104
71 pages
Cloudera Administration
No ratings yet
Cloudera Administration
694 pages
Freeipa 1.2.1 Administration Guide: Ipa Solutions From The Ipa Experts
No ratings yet
Freeipa 1.2.1 Administration Guide: Ipa Solutions From The Ipa Experts
48 pages
QRadar Domains and Tenants - OpenMic - October 2018 PDF
No ratings yet
QRadar Domains and Tenants - OpenMic - October 2018 PDF
47 pages
CaseStudy Cisco Web
No ratings yet
CaseStudy Cisco Web
2 pages
Log Management With Open Source Tools
No ratings yet
Log Management With Open Source Tools
21 pages
Mysql Cluster Manager 8.0 En.a4
No ratings yet
Mysql Cluster Manager 8.0 En.a4
188 pages
Nagios Automation
No ratings yet
Nagios Automation
15 pages
Experiences Running Apache Flink at Very Large Scale: @stephanewen Berlin Buzzwords, 2017
No ratings yet
Experiences Running Apache Flink at Very Large Scale: @stephanewen Berlin Buzzwords, 2017
76 pages
Docs Graylog Org en 3.2
No ratings yet
Docs Graylog Org en 3.2
528 pages
Hands-On With BTRFS: Course ATT1800 Lecture Manual September 6,2012
No ratings yet
Hands-On With BTRFS: Course ATT1800 Lecture Manual September 6,2012
52 pages
Red Hat Satellite 6.2 ArchitectureGuide
100% (1)
Red Hat Satellite 6.2 ArchitectureGuide
35 pages
OpenShift Cookbook Sample Chapter
No ratings yet
OpenShift Cookbook Sample Chapter
43 pages
BK Hdfs Administration
No ratings yet
BK Hdfs Administration
73 pages
Os Lab Manual-1
0% (1)
Os Lab Manual-1
65 pages
Resize Disk Linux
No ratings yet
Resize Disk Linux
2 pages
LVM
No ratings yet
LVM
27 pages
FJP Student Labmanual v20
100% (1)
FJP Student Labmanual v20
312 pages
Cloud Computing Lab Setup Using Hadoop & Open Nebula
100% (4)
Cloud Computing Lab Setup Using Hadoop & Open Nebula
46 pages
MySql NDB Clustering
No ratings yet
MySql NDB Clustering
6 pages
Tcpdump: Capture and Record Specific Protocols / Port: Monitor All Packets On Eth1 Interface
No ratings yet
Tcpdump: Capture and Record Specific Protocols / Port: Monitor All Packets On Eth1 Interface
3 pages
Redhat Cluster
100% (1)
Redhat Cluster
120 pages
Splunk 6.3.1 Forwarding
No ratings yet
Splunk 6.3.1 Forwarding
159 pages
Influxdb Client Readthedocs Io en Stable
No ratings yet
Influxdb Client Readthedocs Io en Stable
123 pages
DRBD Users Guide PDF
0% (1)
DRBD Users Guide PDF
170 pages
Docker Basic Commands
No ratings yet
Docker Basic Commands
6 pages
Cloudera CDSW
No ratings yet
Cloudera CDSW
122 pages
Edb Efm User
No ratings yet
Edb Efm User
113 pages
Apache Kafka Installation
No ratings yet
Apache Kafka Installation
3 pages
Storage area network The Ultimate Step-By-Step Guide
From Everand
Storage area network The Ultimate Step-By-Step Guide
Gerardus Blokdyk
No ratings yet
Splunk Punk: Taming Logs, Alerts, and the Chaos of SIEM
From Everand
Splunk Punk: Taming Logs, Alerts, and the Chaos of SIEM
Scott Markham
No ratings yet
HBase Administration Cookbook
From Everand
HBase Administration Cookbook
Yifeng Jiang
No ratings yet
Permit SSH Login For Root in Oracle Solaris 11
No ratings yet
Permit SSH Login For Root in Oracle Solaris 11
1 page
SumatraPDF Settings
No ratings yet
SumatraPDF Settings
19 pages
Actual Copy PrjoectGuidelines
No ratings yet
Actual Copy PrjoectGuidelines
92 pages
JOURNAL REVIEW: Finite-Difference Approximations To The Heat Equation (Focusing On Backward Time Centered Scheme)
No ratings yet
JOURNAL REVIEW: Finite-Difference Approximations To The Heat Equation (Focusing On Backward Time Centered Scheme)
5 pages
Requirements Gathering
No ratings yet
Requirements Gathering
41 pages
Please Succ Me ' (
No ratings yet
Please Succ Me ' (
1 page
Operating System
No ratings yet
Operating System
7 pages
IMS Documentation
0% (1)
IMS Documentation
59 pages
Seminar Report ON "Linux"
No ratings yet
Seminar Report ON "Linux"
17 pages
One-Shot Learning With Memory-Augmented Neural Networks
No ratings yet
One-Shot Learning With Memory-Augmented Neural Networks
13 pages
Mpeg-4 Video Compression
No ratings yet
Mpeg-4 Video Compression
12 pages
Ec2203-Digital Electronics Question Bank
No ratings yet
Ec2203-Digital Electronics Question Bank
22 pages
Credit Card Processing System
No ratings yet
Credit Card Processing System
8 pages
Content Based Video Retrieval
No ratings yet
Content Based Video Retrieval
22 pages
How To Batch Separate & Crop Multiple Scanned Photos
No ratings yet
How To Batch Separate & Crop Multiple Scanned Photos
5 pages
10.automated Testing Vs Manual Testing
No ratings yet
10.automated Testing Vs Manual Testing
23 pages
(Share) Simple Rsi Bull - Bear Strategy
No ratings yet
(Share) Simple Rsi Bull - Bear Strategy
7 pages
RC1835
No ratings yet
RC1835
5 pages
IT403 Mobile Computing
No ratings yet
IT403 Mobile Computing
2 pages
Sample PPT 1
No ratings yet
Sample PPT 1
29 pages
2.software and Hardware Requirement Specification
No ratings yet
2.software and Hardware Requirement Specification
3 pages
JD Edwards Building Blocks
100% (1)
JD Edwards Building Blocks
4 pages
Computer System Architecture (NCP 423) : Engr. Joan P. Lazaro
No ratings yet
Computer System Architecture (NCP 423) : Engr. Joan P. Lazaro
24 pages
Cable Operating Management System Tybsc (Computer Science) Projectsubmitte Dby: Abeda Inamdar Senior Collage Tybcs
No ratings yet
Cable Operating Management System Tybsc (Computer Science) Projectsubmitte Dby: Abeda Inamdar Senior Collage Tybcs
43 pages
Bypass Firewall
No ratings yet
Bypass Firewall
2 pages
Conclusion
No ratings yet
Conclusion
2 pages
The Laravel Survival Guide - Tony Lea
100% (1)
The Laravel Survival Guide - Tony Lea
98 pages