[go: up one dir, main page]

0% found this document useful (0 votes)
17 views31 pages

09 - Apache Spark Streaming

The document provides an introduction to Apache Spark Streaming, detailing its framework for processing large-scale data streams in near-real-time. It covers key concepts such as Structured Spark Streaming, Discretized Streams, and various streaming operations including triggers and watermarks. Additionally, it discusses applications, limitations, and resources for further learning about Spark Streaming.

Uploaded by

i237822
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views31 pages

09 - Apache Spark Streaming

The document provides an introduction to Apache Spark Streaming, detailing its framework for processing large-scale data streams in near-real-time. It covers key concepts such as Structured Spark Streaming, Discretized Streams, and various streaming operations including triggers and watermarks. Additionally, it discusses applications, limitations, and resources for further learning about Spark Streaming.

Uploaded by

i237822
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

In the name of ALLAH, the Beneficent, the Merciful

9 Apache Spark Streaming


An Introduction

Compiled by
Dr. Muhammad Sajid Qureshi
Contents*

❖ Apache Spark Streaming

▪ Introduction, and applications of spark streaming

▪ Structured Spark Streaming, how it works?

▪ Major concepts

• Discretized Streams, Streaming Sink, Triggers, Watermarking


• Windowed Stream Operations
MSQ

• Accumulator and Broadcast variables


• Data Persistence and Caching
• Data Stream Checkpointing

▪ Spark Streaming applications, and limitations

* Most of the contents are extracted from:


+ “Apache Spark Docs” available on apache.spark.org.

Apache Spark Streaming 2


Spark Streaming – What?
❖ What is Spark Streaming?
▪ Apache Spark Streaming provides a framework to process large scale data streams
• It can provide high throughput by scaling up to 100s of data nodes
• Being an in-memory processing engine, it can achieve near-real-time output
▪ It can continuously ingest new data from live data streams from KaCa, Flume, ZeroMQ, to compute a
result

• The input data is unbounded and has no predetermined beginning or end.


MSQ

• Series of events that arrive at the stream processing system

▪ The framework provides a simple batch-like API for implementing complex algorithms

▪ Integrates with Spark’s batch and interactive processing

Apache Spark Streaming 3


Spark Streaming
MSQ

Apache Spark Streaming 4


Features of Spark Streaming
MSQ

Apache Spark Streaming 5


Spark Streaming
MSQ

Apache Spark Streaming 6


Structured Spark Streaming
❖ Structured Spark Streaming
▪ Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL
engine.

• It uses the existing structured APIs in Spark (Data Frames, Datasets, and SQL)

• Structured Streaming ensures end-to-end, exactly-once processing as well as fault-tolerance


through checkpointing and write-ahead logs.

▪ A cornerstone of the API is that you should not have to change your query’s code when doing batch or
MSQ

stream processing—you should have to specify only whether to run that query in a batch or streaming
fashion.

Apache Spark Streaming 7


How Spark Streaming Works
❖ How Spark Streaming Works

▪ Internally, Structured Streaming queries are processed using a micro-batch processing engine, which
processes data streams as a series of small batch jobs thereby achieving end-to-end latencies as low as
100 milliseconds and exactly-once fault-tolerance guarantees.

• Structured Streaming treats a live data stream as a table that is being continuously appended.

• A query on the input will generate the "Result Table".


MSQ

• Every trigger interval (say, every 1second), new rows get appended to the Input Table, which
eventually updates the Result Table.

Apache Spark Streaming 8


How Spark Streaming Works
MSQ

Apache Spark Streaming 9


How Spark Streaming Works
MSQ

Apache Spark Streaming 10


How Spark Streaming Works
MSQ

Apache Spark Streaming 11


Discretized Streams (DStreams)
❖ Discretized Stream

▪ Discretized Stream or DStream is the basic abstraction provided by Spark Streaming.

▪ It represents a continuous stream of data, either the input data stream received from source, or the
processed data stream generated by transforming the input stream.

▪ Internally, a DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an


immutable, distributed dataset
MSQ

Apache Spark Streaming 12


Discretized Streams (DStreams)
MSQ

Apache Spark Streaming 13


Transformation on DStreams
MSQ

Apache Spark Streaming 14


Streaming Sources

❖ Spark Streaming provides two categories of built-in streaming sources.

▪ Basic sources: Sources directly available in the Streaming Context API.

• Examples: file systems, and socket connections.

▪ Advanced sources: Sources like Kafka, Kinesis, etc.

• They are available through extra utility classes.


MSQ

• These require linking against extra dependencies.

Apache Spark Streaming 15


Streaming Sink
❖ Streaming Sink
▪ Sinks specify the destination for the result set of that stream

• Almost any file format

▪ A foreach sink for running arbitrary computation on the output records

• A console sink for testing

• A memory sink for debugging


MSQ

• Apache Kafka 0.10

Apache Spark Streaming 16


Streaming Output
❖ Streaming output modes
▪ The "Output" is what gets written out to the external storage. The output can be defined in different
modes:

▪ Complete Mode

• The entire updated Result Table will be written to the external storage. It is up to the storage
connector to decide how to handle writing of the entire table.

▪ Append Mode
MSQ

• Only the new rows appended in the Result Table since the last trigger will be written to the
external storage. This is applicable only on the queries where existing rows in the Result Table are
not expected to change.

▪ Update Mode

• Only the rows that were updated in the Result Table since the last trigger will be written to the
external storage (available since Spark 2.1.1).If the query doesn't contain aggregations; it will be
equivalent to Append mode.

Apache Spark Streaming 17


Triggers
❖ Triggers

▪ Like output modes define how data is output, triggers define when data is output and when Structured
Streaming should check for new input data and update its result.

▪ By default, Structured Streaming will look for new input records as soon as it has finished processing
the last group of input data, giving the lowest latency possible for new results.

▪ This behavior can lead to writing many small output files when the sink is a set of files.
MSQ

• So, Spark also supports triggers based on processing time (only look for new data at a fixed
interval).

Apache Spark Streaming 18


Watermarks
❖ Watermarks

▪ Watermarks are a feature of streaming systems that allow us to specify how late they can expect to see
data in event time.

• For example, in an application that processes logs from mobile devices, one might expect logs to
be up to 30 minutes late due to upload delays.

• Systems that support event time, including Structured Streaming, usually allow setting watermarks
MSQ

to limit how long they need to remember old data.

• Watermarks can also be used to control when to output a result for a particular event time
window.

Apache Spark Streaming 19


Windowed Stream Operation
❖ Windowed Stream Operation

▪ Spark Streaming also provides windowed computations, which allows transformations over a sliding
window of data stream.

▪ Every time, the window slides over a source DStream, the source RDDs, that fall under the window are
combined and the desired operation is performed to get the resultant RDDs.
MSQ

Apache Spark Streaming 20


Windowed Stream Operation
MSQ

Apache Spark Streaming 21


Data Persistence and Caching
MSQ

Apache Spark Streaming 22


Accumulator Variables
MSQ

Apache Spark Streaming 23


Broadcast Variables
MSQ

Apache Spark Streaming 24


Data Stream Checkpointing
MSQ

Apache Spark Streaming 25


Spark Streaming – Applications
❖ Many important applications must process large streams of live data and provide results in
near-real time:
▪ Social network trends

▪ Website statistics

▪ Intrusion detection systems

❖ Spark Streaming is suitable for processing:


MSQ

▪ Live notification and alerts


▪ Real time reporting
▪ Incremental ETL
▪ Update data to server in real time
▪ Real time decision making

Apache Spark Streaming 26


Spark Streaming – Applications
MSQ

Apache Spark Streaming 27


Spark Streaming – Applications
MSQ

Apache Spark Streaming 28


Spark Streaming – Limitations
❖ Spark Streaming is commonly used in processing:
▪ Processing out-of-order data based on application timestamps (also called event time)

▪ Maintaining large amounts of state

▪ Supporting high-data throughput

▪ Processing each event exactly once despite machine failures

▪ Handling load imbalance and stragglers


MSQ

▪ Responding to events at low latency

▪ Joining with external data in other storage systems

▪ Determining how to update output sinks as new events arrive

▪ Writing data transactionally to output systems

Apache Spark Streaming 29


Related Resources
❖ Apache Spark Streaming Tutorials

▪ https://spark.apache.org/docs/latest/streaming-programming-guide.html

▪ https://www.youtube.com/watch?v=qlJmjkgHZ88

▪ https://www.youtube.com/watch?v=UuRhEmqqhRM&t=2s

▪ https://www.youtube.com/watch?v=sSkAuTqfBA8
MSQ

Apache Spark Streaming 30


Contents’ Review

❖ Apache Spark Streaming

▪ Introduction, and applications of spark streaming

▪ Structured Spark Streaming, how it works?

▪ Major concepts

• Discretized Streams, Streaming Sink, Triggers, Watermarking


Windowed Stream Operations
MSQ

• Accumulator and Broadcast variables


• Data Persistence and Caching
• Data Stream Checkpointing
You are Welcome !
▪ Spark Streaming applications, and limitations Questions ?
Comments !
Suggestions !!

Apache Spark Streaming 31

You might also like