In the name of ALLAH, the Beneficent, the Merciful
9 Apache Spark Streaming
An Introduction
Compiled by
Dr. Muhammad Sajid Qureshi
Contents*
❖ Apache Spark Streaming
▪ Introduction, and applications of spark streaming
▪ Structured Spark Streaming, how it works?
▪ Major concepts
• Discretized Streams, Streaming Sink, Triggers, Watermarking
• Windowed Stream Operations
MSQ
• Accumulator and Broadcast variables
• Data Persistence and Caching
• Data Stream Checkpointing
▪ Spark Streaming applications, and limitations
* Most of the contents are extracted from:
+ “Apache Spark Docs” available on apache.spark.org.
Apache Spark Streaming 2
Spark Streaming – What?
❖ What is Spark Streaming?
▪ Apache Spark Streaming provides a framework to process large scale data streams
• It can provide high throughput by scaling up to 100s of data nodes
• Being an in-memory processing engine, it can achieve near-real-time output
▪ It can continuously ingest new data from live data streams from KaCa, Flume, ZeroMQ, to compute a
result
• The input data is unbounded and has no predetermined beginning or end.
MSQ
• Series of events that arrive at the stream processing system
▪ The framework provides a simple batch-like API for implementing complex algorithms
▪ Integrates with Spark’s batch and interactive processing
Apache Spark Streaming 3
Spark Streaming
MSQ
Apache Spark Streaming 4
Features of Spark Streaming
MSQ
Apache Spark Streaming 5
Spark Streaming
MSQ
Apache Spark Streaming 6
Structured Spark Streaming
❖ Structured Spark Streaming
▪ Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL
engine.
• It uses the existing structured APIs in Spark (Data Frames, Datasets, and SQL)
• Structured Streaming ensures end-to-end, exactly-once processing as well as fault-tolerance
through checkpointing and write-ahead logs.
▪ A cornerstone of the API is that you should not have to change your query’s code when doing batch or
MSQ
stream processing—you should have to specify only whether to run that query in a batch or streaming
fashion.
Apache Spark Streaming 7
How Spark Streaming Works
❖ How Spark Streaming Works
▪ Internally, Structured Streaming queries are processed using a micro-batch processing engine, which
processes data streams as a series of small batch jobs thereby achieving end-to-end latencies as low as
100 milliseconds and exactly-once fault-tolerance guarantees.
• Structured Streaming treats a live data stream as a table that is being continuously appended.
• A query on the input will generate the "Result Table".
MSQ
• Every trigger interval (say, every 1second), new rows get appended to the Input Table, which
eventually updates the Result Table.
Apache Spark Streaming 8
How Spark Streaming Works
MSQ
Apache Spark Streaming 9
How Spark Streaming Works
MSQ
Apache Spark Streaming 10
How Spark Streaming Works
MSQ
Apache Spark Streaming 11
Discretized Streams (DStreams)
❖ Discretized Stream
▪ Discretized Stream or DStream is the basic abstraction provided by Spark Streaming.
▪ It represents a continuous stream of data, either the input data stream received from source, or the
processed data stream generated by transforming the input stream.
▪ Internally, a DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an
immutable, distributed dataset
MSQ
Apache Spark Streaming 12
Discretized Streams (DStreams)
MSQ
Apache Spark Streaming 13
Transformation on DStreams
MSQ
Apache Spark Streaming 14
Streaming Sources
❖ Spark Streaming provides two categories of built-in streaming sources.
▪ Basic sources: Sources directly available in the Streaming Context API.
• Examples: file systems, and socket connections.
▪ Advanced sources: Sources like Kafka, Kinesis, etc.
• They are available through extra utility classes.
MSQ
• These require linking against extra dependencies.
Apache Spark Streaming 15
Streaming Sink
❖ Streaming Sink
▪ Sinks specify the destination for the result set of that stream
• Almost any file format
▪ A foreach sink for running arbitrary computation on the output records
• A console sink for testing
• A memory sink for debugging
MSQ
• Apache Kafka 0.10
Apache Spark Streaming 16
Streaming Output
❖ Streaming output modes
▪ The "Output" is what gets written out to the external storage. The output can be defined in different
modes:
▪ Complete Mode
• The entire updated Result Table will be written to the external storage. It is up to the storage
connector to decide how to handle writing of the entire table.
▪ Append Mode
MSQ
• Only the new rows appended in the Result Table since the last trigger will be written to the
external storage. This is applicable only on the queries where existing rows in the Result Table are
not expected to change.
▪ Update Mode
• Only the rows that were updated in the Result Table since the last trigger will be written to the
external storage (available since Spark 2.1.1).If the query doesn't contain aggregations; it will be
equivalent to Append mode.
Apache Spark Streaming 17
Triggers
❖ Triggers
▪ Like output modes define how data is output, triggers define when data is output and when Structured
Streaming should check for new input data and update its result.
▪ By default, Structured Streaming will look for new input records as soon as it has finished processing
the last group of input data, giving the lowest latency possible for new results.
▪ This behavior can lead to writing many small output files when the sink is a set of files.
MSQ
• So, Spark also supports triggers based on processing time (only look for new data at a fixed
interval).
Apache Spark Streaming 18
Watermarks
❖ Watermarks
▪ Watermarks are a feature of streaming systems that allow us to specify how late they can expect to see
data in event time.
• For example, in an application that processes logs from mobile devices, one might expect logs to
be up to 30 minutes late due to upload delays.
• Systems that support event time, including Structured Streaming, usually allow setting watermarks
MSQ
to limit how long they need to remember old data.
• Watermarks can also be used to control when to output a result for a particular event time
window.
Apache Spark Streaming 19
Windowed Stream Operation
❖ Windowed Stream Operation
▪ Spark Streaming also provides windowed computations, which allows transformations over a sliding
window of data stream.
▪ Every time, the window slides over a source DStream, the source RDDs, that fall under the window are
combined and the desired operation is performed to get the resultant RDDs.
MSQ
Apache Spark Streaming 20
Windowed Stream Operation
MSQ
Apache Spark Streaming 21
Data Persistence and Caching
MSQ
Apache Spark Streaming 22
Accumulator Variables
MSQ
Apache Spark Streaming 23
Broadcast Variables
MSQ
Apache Spark Streaming 24
Data Stream Checkpointing
MSQ
Apache Spark Streaming 25
Spark Streaming – Applications
❖ Many important applications must process large streams of live data and provide results in
near-real time:
▪ Social network trends
▪ Website statistics
▪ Intrusion detection systems
❖ Spark Streaming is suitable for processing:
MSQ
▪ Live notification and alerts
▪ Real time reporting
▪ Incremental ETL
▪ Update data to server in real time
▪ Real time decision making
Apache Spark Streaming 26
Spark Streaming – Applications
MSQ
Apache Spark Streaming 27
Spark Streaming – Applications
MSQ
Apache Spark Streaming 28
Spark Streaming – Limitations
❖ Spark Streaming is commonly used in processing:
▪ Processing out-of-order data based on application timestamps (also called event time)
▪ Maintaining large amounts of state
▪ Supporting high-data throughput
▪ Processing each event exactly once despite machine failures
▪ Handling load imbalance and stragglers
MSQ
▪ Responding to events at low latency
▪ Joining with external data in other storage systems
▪ Determining how to update output sinks as new events arrive
▪ Writing data transactionally to output systems
Apache Spark Streaming 29
Related Resources
❖ Apache Spark Streaming Tutorials
▪ https://spark.apache.org/docs/latest/streaming-programming-guide.html
▪ https://www.youtube.com/watch?v=qlJmjkgHZ88
▪ https://www.youtube.com/watch?v=UuRhEmqqhRM&t=2s
▪ https://www.youtube.com/watch?v=sSkAuTqfBA8
MSQ
Apache Spark Streaming 30
Contents’ Review
❖ Apache Spark Streaming
▪ Introduction, and applications of spark streaming
▪ Structured Spark Streaming, how it works?
▪ Major concepts
• Discretized Streams, Streaming Sink, Triggers, Watermarking
Windowed Stream Operations
MSQ
• Accumulator and Broadcast variables
• Data Persistence and Caching
• Data Stream Checkpointing
You are Welcome !
▪ Spark Streaming applications, and limitations Questions ?
Comments !
Suggestions !!
Apache Spark Streaming 31