Apache Nifi
Apache Nifi
NiFi Architecture:
Repositories in NiFi:
Flow File Repository: Known to be write-Ahead Log
Flow file repo stores FlowFile attributes, pointer to the content and
state of FlowFile
The FlowFile Repository acts as NiFi’s Write-Ahead Log, so as the
FlowFiles are flowing through the system, each change is logged in
the FlowFile Repository before it happens as a transactional unit of
work. This allows the system to know exactly what step the node is
on when processing a piece of data. If the node goes down while
processing the data, it can easily resume from where it left off upon
restart.
A snapshot is automatically taken periodically by the system,
which creates a new snapshot for each FlowFile. The system
computes a new base checkpoint by serializing each FlowFile in the
hash map and writing it to disk with the filename ".partial". As the
checkpointing proceeds, the new FlowFile baselines are written to
the ".partial" file. Once the checkpointing is done the old
"snapshot" file is deleted and the ".partial" file is renamed
"snapshot".
Content Repository:
The Content Repository is where the actual content bytes of a
given FlowFile live.
Provenance Repository:The Provenance Repository is where all
provenance event data is stored. Every time a FlowFile is modified,
NiFi takes a snapshot of the FlowFile and its context at this point.
The name for this snapshot in NiFi is a Provenance Event.
Provenance enables us to retrace the lineage of the data and
build the full chain of custody for every piece of information
processed in NiFi. On top of offering the complete lineage of the
data, the Provenance Repository also offers to replay the data
from any point in time
The main difference between flowfile repo and provenance repo is
that flow file repo has the latest state of in use flowfiles , whereas
provenance repo has the complete life cycle of every FlowFile that
has been in the flow.
flow.xml.gz:
Everything the DFM puts onto the NiFi User Interface canvas is
written, in real time, to one file called the flow.xml.gz. This file is
located in the nifi/conf directory by default. Any change made on
the canvas is automatically saved to this file. NiFi automatically
creates a backup copy of this file in the archive directory when it
is updated. You can use these archived files to rollback flow
configuration. To do so, stop NiFi, replace flow.xml.gz with a
desired backup copy, then restart NiFi.
NiFi is also able to operate within a cluster.
Types of processors:
data ingestion
data transform
data egress/sending data
routing and mediation
database access
attribute extraction
system interaction
splitting and aggregation
http and udp
AWS processors
Scaling in NiFi:
For each processor, you can specify the number of concurrent
tasks you want to run simultaneously. Like this, the Flow
Controller allocates more resources to this processor, increasing
its throughput. Processors share threads. If one processor
requests more threads, other processors have fewer threads
available to execute
Another way to scale is to increase the number of nodes in your
NiFi cluster. This is called horizontal scaling.
Backpressure in NiFi:
It is a concept where-in if the quantity of FlowFiles go beyond
threshold then the Flow Controller won’t schedule the previous
processor(upstream) to run again until there is room in the queue.
The threshold limits are set in the queue that connects the two
processors. The thresholds are based on either count of files or
the memory that the files occupy.
For example if the Object Threshold is set to 100 and initially 80
files flow from processor-1 to processor-2 , the flow continues as it
is less than threshold but now in the next instance if the 30
FlowFiles flow from p-1 to p-2 the queue shows up the warning
but allows the FlowFiles to flow but the Flow Controller will not
schedule the p-1 until the files moves from p-2 to downstream
processors(files in queue should be less than the Object
Threshold).
The configuration settings for a queue has the following options to
set the threshold limits..
Load Balancing Strategy(Used for cluster set up)
Round Robin:
It distributes the FlowFiles among all available nodes in
round robin fashion. If any particular node is disconnected or
dropped then the FlowFiles that are in queue for that node are re-
distributed among available nodes
Single Node:
All the FlowFiles are routed to a single node and incase that
particular node goes down then the FlowFiles will be in queue
waiting for the node to start.
Partition by Attribute:
The FlowFiles are distributed based on the Attribute value
that you pass to the argument. If load balancing strategy is set to
Partition by Attribute and no attribute value is set then it assumes
that attribute value is NULL and all the FlowFiles that do not have
that Attribute are sent to same node.
Scheduling strategy:
Timer driven – (viz; 5 sec, 2 min, 1 days)
Event driven – (implementation stage as of now)
CRON driven – (*****?) (sec min hrs DOM mon DOW year)
* - supports any format
? - only (DOM or DOW)
L- specify last occurrence of the day in a month
1L means last Sunday of a month. (1-7 Sun-Sat)
Penalization:
During the normal course of processing a piece of data (a FlowFile),
an event may occur that indicates that the data cannot be
processed but the data may be processable at a later time. When
this occurs, the Processor may choose to Penalize the FlowFile. This
will prevent the FlowFile from being Processed for some period of
time.
Yield:
Processor may determine that some situation exists such that the
Processor can no longer make any progress, regardless of the data
that it is processing. In such cases the processor should yield for
some time. By doing this, the Processor is telling the framework
that it should not waste resources triggering this Processor to run,
because there's nothing that it can do - it's better to use those
resources to allow other Processors to run .