Streaming Ecosystem
Reza Farivar
Capital One
Reza.farivar@capitalone.com
Components of a streaming ecosystem
• Gather the data
• Funnel
• Distributed Queue
• Real-Time Processing
• Semi-Real-Time Processing
• Real-time OLAP
Step 1: Gather the Data
• Apache NiFi is a good distributed funnel
• Was made in NSA
• Over 8 years of development
• Open sourced in 2014 and picked up by HortonWorks
• Great visual UI to design a data flow
• Has many many processor types in the box
• But not very good for heavy weight distributed processing
• Same graph is executed on all the nodes
NiFi Components
• FlowFile
• Unit of data moving through the system
• Content + Attributes (key/value pairs)
• Processor
• Performs the work, can access FlowFiles
• Connection
• Links between processors
• Queues that can be dynamically prioritized
• Process Group
• Set of processors and their connections
• Receive data via input ports, send data via output ports
NiFi GUI
• Drag and drop processors to build a flow
• Start, stop, and configure components in real time
• View errors and corresponding error messages
• View statistics and health of data flow
• Create templates of common processor & connections
NiFi Site-to-Site
• Site-to-site allows very easy pushing of data from one data center to
another
• Makes it a great choice for
distributed funnel
Step 2: Distributed Queue
• Pub-sub model
Producer publish(topic, msg) Consumer
subscribe
• Kafka a very poular
example Topic
1
Topic msg
2
Topic
3
Publish subscribe
system Consumer
Producer
msg
Kafka Architecture
• Distributed, high-throughput,
pub-sub messaging system
• Fast, Scalable, Durable Producer Producer
• Main use cases:
• log aggregation, real-time Broker Broker ZK Broker Broker
processing, monitoring,
queueing
• Originally developed by
Consumer Consumer
LinkedIn
• Implemented in Scala/Java
Kafka Manager
• There are some CLI tools
kafka-console-producer
kafka-console-consumer
Kafka-topics
kafka-consumer-offset-checker
• Some very new open-source projects for monitoring Kafka
• Kafka-manager by yahoo
• https://github.com/yahoo/kafka-manager
Step 3: Distributed Processing
• Once data is in the Kafka message broker, we need to process it
• Filter
• Join
• Windowing
• Business logic
• Real-time requirements
• Sub ms to 10 ms
Storm
• Apache Storm
• Built in backtype, sold to Twitter
• Written in Clojure
Storm Architecture
Storm programming
• Topology
• Spouts
• Bolts
• Tuples
• Streams
• topologyBuilder API
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("words", new TestWordSpout(), 10);
builder.setBolt("exclaim1", ne ExclamationBolt(), 3)
.shuffleGrouping("words");
builder.setBolt("exclaim2", new ExclamationBolt(), 2)
.shuffleGrouping("exclaim1");
Example topology
• Storm is great for non-
trivial large scale
processing
• Mature enterprise
level features,
including multitenancy
and security
• Work on resource
aware scheduling
Step 5: Micro batch processing / SQL / ML
• Instead of real-time event-by event processing, we can do micro
batch
• Reduce overheads
• Fault tolerance Kappa architecture
• High latency
Spark
• Spark was a project out of Berkeley from 2010
• Has become very popular
• Most contributed open source project in big-data domain
• RDD: Resilient Distributed Data Set
Spark Streaming
• Window a bit of
data
• Run a batch
• Repeat
Spark ML, Graph, etc.
• Advantage of Spark Streaming:
• Rich ecosystem of big data tools
• Spark SQL
• Spark ML
• Spark GraphX
• SparkR
• Disadvantage:
• Not really streaming
Benchmark: ETL pipeline
Three-way Comparison
• Flink and Storm
have similar linear
performance
profiles
• These two systems
process an
incoming event as
it becomes
available
• Spark Streaming
has much higher
latency, but is
expected to handle
higher throughputs
• System behaves in
a stepwise
function, a direct
result from its
micro-batching
nature
Side note: in-memory key-value store
• Redis
• Cassandra
Step 6: OLAP (Online Analytical Processing)
• Business Intelligence
• Multidimensional data analytics
• Analyze multidimensional data interactively
• Basic Operations
• Consolidation (roll-up, aggregation in dimensions)
• Drill-down (filter)
• Slicing and dicing (Look at the data from different viewpoints)
Druid
• Developed in Metamarkets in 2011
• RDBMs: Too slow
• NoSQL key value store: fast, but exponential memory space, precompute very slow
• Gaining in popularity
• Open Source (Apache license) in late 2012
• OLAP queries
• Column oriented
• Sub second query time (Avg query time 0.5 seconds)
• Real-time streaming ingestion
• Scalable
Druid
• Arbitrary slice and dive of data
Druid Architecture
Druid Bitmap Index
• This is one of the
reasons Druid is so fast
• Dictionary encoding
• Bitmap Index
• Compression ratio: 1
bit per record
• Logical AND/OR of a
few thousand numbers
for a query
lightning fast queries
Step 7: BI
• Pivot
• web-based exploratory visualization UI for Druid
• Easily filter, split, visualize, etc.
• Tableu and SQL not natively supported
• But wait!
Pivot
Druid and Spark
• Druid’s native API is JSON
• No Tableau, SQL support
• But there is hope!
https://github.com/SparklineData/spark-druid-olap
• Connect Druid to Tableu
through Spark
Why Druid and Spark together?
• Spark is great as a general engine
• Everything and the kitchen sink
• Queries can take a long time
• Still much faster than Hive on Yarn
• Druid is optimized for Column based time-series queries
Questions?
Email: reza.farivar@capitalone.com