[go: up one dir, main page]

0% found this document useful (0 votes)
431 views9 pages

Apache Nifi

Apache NiFi is an open source software project that automates data flows between systems. It uses a web-based user interface and flow-based programming to manage data flows in real time. NiFi was created by the US National Security Agency and allows users to ingest, transform, and move data between systems in a secure, governed manner.

Uploaded by

manideep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
431 views9 pages

Apache Nifi

Apache NiFi is an open source software project that automates data flows between systems. It uses a web-based user interface and flow-based programming to manage data flows in real time. NiFi was created by the US National Security Agency and allows users to ingest, transform, and move data between systems in a secure, governed manner.

Uploaded by

manideep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

APACHE NIFI

Apache NiFi is an open source project which enables the


automation of data flow between systems, known as "data
logistics". The project is written using flow-based programming
and provides a web-based user interface to manage data flows in
real time.
The project was created by the United States National Security
Agency (NSA), originally named Niagarafiles.
NiFi is a single ingestion platform that gives you out-of-the-box
tools to ingest several data sources in a secure and governed
manner is a real differentiator.

NiFi is implemented in the Java programming language and allows


extensions (processors, controller services, and reporting tasks)
to be implemented in Java. In addition, NiFi supports processors
that execute scripts written in Groovy, Jython, and several other
popular scripting languages.
NiFi provides many processors out of the box (293 in Nifi 1.9.2).
You’re on the shoulders of a giant. Those standard processors
handle the vast majority of use cases you may encounter.

Apache NiFi Features:


 Web base UI
 Highly Configurable
 Data Provenance
 Extensible

NiFi Architecture:

NiFi executes within a JVM on a host operating system/cluster


system.
So, the data that the repos in NiFi can handle or maintain fairly
depends in the machine capacity (JVM capacity)
The primary components of NiFi on the JVM are:
Web Server:
The purpose of the web server is to host NiFi’s HTTP-based
command and control API.
Flow Controller:
The flow controller is the brains of the operation. It allocates the
threads on a node to the processors or even manages nodes in a
cluster based NiFi.
Extensions are also processors that are connected in the
UI(canvas)
Flow File:
A flow file has two parts – attributes and content
Attributes have the information like filename, file path and uuid on
default and any variables(attributes) created using update attribute
processor adds up to the FlowFile.
Content has the data that the FlowFile carries.
In real, the content has just the reference to the stream of
data that sits down in the content repository and that is one of the
reasons the throughput is high.
When a processor modifies the content of a FlowFile, the previous
data is kept. NiFi copies-on-write, it modifies the content while
copying it to a new location. The original information is left intact
in the Content Repository and the reference from the flow file
content is pointed to new data(modified data).

Repositories in NiFi:
Flow File Repository: Known to be write-Ahead Log
Flow file repo stores FlowFile attributes, pointer to the content and
state of FlowFile
The FlowFile Repository acts as NiFi’s Write-Ahead Log, so as the
FlowFiles are flowing through the system, each change is logged in
the FlowFile Repository before it happens as a transactional unit of
work. This allows the system to know exactly what step the node is
on when processing a piece of data. If the node goes down while
processing the data, it can easily resume from where it left off upon
restart.
 A snapshot is automatically taken periodically by the system,
which creates a new snapshot for each FlowFile. The system
computes a new base checkpoint by serializing each FlowFile in the
hash map and writing it to disk with the filename ".partial". As the
checkpointing proceeds, the new FlowFile baselines are written to
the ".partial" file. Once the checkpointing is done the old
"snapshot" file is deleted and the ".partial" file is renamed
"snapshot".
Content Repository:
The Content Repository is where the actual content bytes of a
given FlowFile live.
Provenance Repository:The Provenance Repository is where all
provenance event data is stored. Every time a FlowFile is modified,
NiFi takes a snapshot of the FlowFile and its context at this point.
The name for this snapshot in NiFi is a Provenance Event.
Provenance enables us to retrace the lineage of the data and
build the full chain of custody for every piece of information
processed in NiFi. On top of offering the complete lineage of the
data, the Provenance Repository also offers to replay the data
from any point in time
The main difference between flowfile repo and provenance repo is
that flow file repo has the latest state of in use flowfiles , whereas
provenance repo has the complete life cycle of every FlowFile that
has been in the flow.
flow.xml.gz:
Everything the DFM puts onto the NiFi User Interface canvas is
written, in real time, to one file called the flow.xml.gz. This file is
located in the nifi/conf directory by default. Any change made on
the canvas is automatically saved to this file. NiFi automatically
creates a backup copy of this file in the archive directory when it
is updated. You can use these archived files to rollback flow
configuration. To do so, stop NiFi, replace flow.xml.gz with a
desired backup copy, then restart NiFi.
NiFi is also able to operate within a cluster.

Starting with the NiFi 1.0 release, a Zero-Master Clustering


paradigm is employed. Each node in a NiFi cluster performs the
same tasks on the data, but each operates on a different set of
data. Apache ZooKeeper elects a single node as the Cluster
Coordinator, and failover is handled automatically by ZooKeeper.
All cluster nodes report heartbeat and status information to the
Cluster Coordinator. The Cluster Coordinator is responsible for
disconnecting and connecting nodes. Additionally, every cluster
has one Primary Node, also elected by ZooKeeper. As a DataFlow
manager, you can interact with the NiFi cluster through the user
interface (UI) of any node. Any change you make is replicated to all
nodes in the cluster, allowing for multiple entry points.
Processor:
NiFi comes with many processors when you install it. If you don’t
find the perfect one for your use case, it’s still possible to build
your own processor
A processor is a component of the NiFi that listen for incoming data
, pull data or publish data to external sources or makes
transformation depending on the type of processor it is. A group of
processors put together with their connections form a process
group.

Types of processors:
 data ingestion
 data transform
 data egress/sending data
 routing and mediation
 database access
 attribute extraction
 system interaction
 splitting and aggregation
 http and udp
 AWS processors

Steps to create a custom processor:


https://medium.com/hashmapinc/creating-custom-processors-and-
controllers-in-apache-nifi-e14148740ea
1) Install maven
2) $ mvn archetype:generate
3) Apply filter nifi
4) Select option 1 to create new processor
5) Choose the version number

Scaling in NiFi:
For each processor, you can specify the number of concurrent
tasks you want to run simultaneously. Like this, the Flow
Controller allocates more resources to this processor, increasing
its throughput. Processors share threads. If one processor
requests more threads, other processors have fewer threads
available to execute
Another way to scale is to increase the number of nodes in your
NiFi cluster. This is called horizontal scaling.

Backpressure in NiFi:
It is a concept where-in if the quantity of FlowFiles go beyond
threshold then the Flow Controller won’t schedule the previous
processor(upstream) to run again until there is room in the queue.
The threshold limits are set in the queue that connects the two
processors. The thresholds are based on either count of files or
the memory that the files occupy.
For example if the Object Threshold is set to 100 and initially 80
files flow from processor-1 to processor-2 , the flow continues as it
is less than threshold but now in the next instance if the 30
FlowFiles flow from p-1 to p-2 the queue shows up the warning
but allows the FlowFiles to flow but the Flow Controller will not
schedule the p-1 until the files moves from p-2 to downstream
processors(files in queue should be less than the Object
Threshold).
The configuration settings for a queue has the following options to
set the threshold limits..
Load Balancing Strategy(Used for cluster set up)
Round Robin:
It distributes the FlowFiles among all available nodes in
round robin fashion. If any particular node is disconnected or
dropped then the FlowFiles that are in queue for that node are re-
distributed among available nodes
Single Node:
All the FlowFiles are routed to a single node and incase that
particular node goes down then the FlowFiles will be in queue
waiting for the node to start.
Partition by Attribute:
The FlowFiles are distributed based on the Attribute value
that you pass to the argument. If load balancing strategy is set to
Partition by Attribute and no attribute value is set then it assumes
that attribute value is NULL and all the FlowFiles that do not have
that Attribute are sent to same node.
Scheduling strategy:
 Timer driven – (viz; 5 sec, 2 min, 1 days)
 Event driven – (implementation stage as of now)
 CRON driven – (*****?) (sec min hrs DOM mon DOW year)
* - supports any format
? - only (DOM or DOW)
L- specify last occurrence of the day in a month
1L means last Sunday of a month. (1-7 Sun-Sat)

Penalization:  
During the normal course of processing a piece of data (a FlowFile),
an event may occur that indicates that the data cannot be
processed but the data may be processable at a later time. When
this occurs, the Processor may choose to Penalize the FlowFile. This
will prevent the FlowFile from being Processed for some period of
time.

Yield:
Processor may determine that some situation exists such that the
Processor can no longer make any progress, regardless of the data
that it is processing. In such cases the processor should yield for
some time. By doing this, the Processor is telling the framework
that it should not waste resources triggering this Processor to run,
because there's nothing that it can do - it's better to use those
resources to allow other Processors to run .

You might also like