0% found this document useful (0 votes)

17 views31 pages

09 - Apache Spark Streaming

The document provides an introduction to Apache Spark Streaming, detailing its framework for processing large-scale data streams in near-real-time. It covers key concepts such as Structured Spark Streaming, Discretized Streams, and various streaming operations including triggers and watermarks. Additionally, it discusses applications, limitations, and resources for further learning about Spark Streaming.

Uploaded by

i237822

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views31 pages

09 - Apache Spark Streaming

Uploaded by

i237822

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

In the name of ALLAH, the Beneficent, the Merciful

9 Apache Spark Streaming

An Introduction

Compiled by
Dr. Muhammad Sajid Qureshi
Contents*

❖ Apache Spark Streaming

▪ Introduction, and applications of spark streaming

▪ Structured Spark Streaming, how it works?

▪ Major concepts

• Discretized Streams, Streaming Sink, Triggers, Watermarking

• Windowed Stream Operations
MSQ

• Accumulator and Broadcast variables

• Data Persistence and Caching
• Data Stream Checkpointing

▪ Spark Streaming applications, and limitations

* Most of the contents are extracted from:

+ “Apache Spark Docs” available on apache.spark.org.

Apache Spark Streaming 2

Spark Streaming – What?
❖ What is Spark Streaming?
▪ Apache Spark Streaming provides a framework to process large scale data streams
• It can provide high throughput by scaling up to 100s of data nodes
• Being an in-memory processing engine, it can achieve near-real-time output
▪ It can continuously ingest new data from live data streams from KaCa, Flume, ZeroMQ, to compute a
result

• The input data is unbounded and has no predetermined beginning or end.

MSQ

• Series of events that arrive at the stream processing system

▪ The framework provides a simple batch-like API for implementing complex algorithms

▪ Integrates with Spark’s batch and interactive processing

Apache Spark Streaming 3

Spark Streaming
MSQ

Apache Spark Streaming 4

Features of Spark Streaming
MSQ

Apache Spark Streaming 5

Spark Streaming
MSQ

Apache Spark Streaming 6

Structured Spark Streaming
❖ Structured Spark Streaming
▪ Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL
engine.

• It uses the existing structured APIs in Spark (Data Frames, Datasets, and SQL)

• Structured Streaming ensures end-to-end, exactly-once processing as well as fault-tolerance

through checkpointing and write-ahead logs.

▪ A cornerstone of the API is that you should not have to change your query’s code when doing batch or
MSQ

stream processing—you should have to specify only whether to run that query in a batch or streaming
fashion.

Apache Spark Streaming 7

How Spark Streaming Works
❖ How Spark Streaming Works

▪ Internally, Structured Streaming queries are processed using a micro-batch processing engine, which
processes data streams as a series of small batch jobs thereby achieving end-to-end latencies as low as
100 milliseconds and exactly-once fault-tolerance guarantees.

• Structured Streaming treats a live data stream as a table that is being continuously appended.

• A query on the input will generate the "Result Table".

MSQ

• Every trigger interval (say, every 1second), new rows get appended to the Input Table, which
eventually updates the Result Table.

Apache Spark Streaming 8

How Spark Streaming Works
MSQ

Apache Spark Streaming 9

How Spark Streaming Works
MSQ

Apache Spark Streaming 10

How Spark Streaming Works
MSQ

Apache Spark Streaming 11

Discretized Streams (DStreams)
❖ Discretized Stream

▪ Discretized Stream or DStream is the basic abstraction provided by Spark Streaming.

▪ It represents a continuous stream of data, either the input data stream received from source, or the
processed data stream generated by transforming the input stream.

▪ Internally, a DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an

immutable, distributed dataset
MSQ

Apache Spark Streaming 12

Discretized Streams (DStreams)
MSQ

Apache Spark Streaming 13

Transformation on DStreams
MSQ

Apache Spark Streaming 14

Streaming Sources

❖ Spark Streaming provides two categories of built-in streaming sources.

▪ Basic sources: Sources directly available in the Streaming Context API.

• Examples: file systems, and socket connections.

▪ Advanced sources: Sources like Kafka, Kinesis, etc.

• They are available through extra utility classes.

MSQ

• These require linking against extra dependencies.

Apache Spark Streaming 15

Streaming Sink
❖ Streaming Sink
▪ Sinks specify the destination for the result set of that stream

• Almost any file format

▪ A foreach sink for running arbitrary computation on the output records

• A console sink for testing

• A memory sink for debugging

MSQ

• Apache Kafka 0.10

Apache Spark Streaming 16

Streaming Output
❖ Streaming output modes
▪ The "Output" is what gets written out to the external storage. The output can be defined in different
modes:

▪ Complete Mode

• The entire updated Result Table will be written to the external storage. It is up to the storage
connector to decide how to handle writing of the entire table.

▪ Append Mode
MSQ

• Only the new rows appended in the Result Table since the last trigger will be written to the
external storage. This is applicable only on the queries where existing rows in the Result Table are
not expected to change.

▪ Update Mode

• Only the rows that were updated in the Result Table since the last trigger will be written to the
external storage (available since Spark 2.1.1).If the query doesn't contain aggregations; it will be
equivalent to Append mode.

Apache Spark Streaming 17

Triggers
❖ Triggers

▪ Like output modes define how data is output, triggers define when data is output and when Structured
Streaming should check for new input data and update its result.

▪ By default, Structured Streaming will look for new input records as soon as it has finished processing
the last group of input data, giving the lowest latency possible for new results.

▪ This behavior can lead to writing many small output files when the sink is a set of files.
MSQ

• So, Spark also supports triggers based on processing time (only look for new data at a fixed
interval).

Apache Spark Streaming 18

Watermarks
❖ Watermarks

▪ Watermarks are a feature of streaming systems that allow us to specify how late they can expect to see
data in event time.

• For example, in an application that processes logs from mobile devices, one might expect logs to
be up to 30 minutes late due to upload delays.

• Systems that support event time, including Structured Streaming, usually allow setting watermarks
MSQ

to limit how long they need to remember old data.

• Watermarks can also be used to control when to output a result for a particular event time
window.

Apache Spark Streaming 19

Windowed Stream Operation
❖ Windowed Stream Operation

▪ Spark Streaming also provides windowed computations, which allows transformations over a sliding
window of data stream.

▪ Every time, the window slides over a source DStream, the source RDDs, that fall under the window are
combined and the desired operation is performed to get the resultant RDDs.
MSQ

Apache Spark Streaming 20

Windowed Stream Operation
MSQ

Apache Spark Streaming 21

Data Persistence and Caching
MSQ

Apache Spark Streaming 22

Accumulator Variables
MSQ

Apache Spark Streaming 23

Broadcast Variables
MSQ

Apache Spark Streaming 24

Data Stream Checkpointing
MSQ

Apache Spark Streaming 25

Spark Streaming – Applications
❖ Many important applications must process large streams of live data and provide results in
near-real time:
▪ Social network trends

▪ Website statistics

▪ Intrusion detection systems

❖ Spark Streaming is suitable for processing:

MSQ

▪ Live notification and alerts

▪ Real time reporting
▪ Incremental ETL
▪ Update data to server in real time
▪ Real time decision making

Apache Spark Streaming 26

Spark Streaming – Applications
MSQ

Apache Spark Streaming 27

Spark Streaming – Applications
MSQ

Apache Spark Streaming 28

Spark Streaming – Limitations
❖ Spark Streaming is commonly used in processing:
▪ Processing out-of-order data based on application timestamps (also called event time)

▪ Maintaining large amounts of state

▪ Supporting high-data throughput

▪ Processing each event exactly once despite machine failures

▪ Handling load imbalance and stragglers

MSQ

▪ Responding to events at low latency

▪ Joining with external data in other storage systems

▪ Determining how to update output sinks as new events arrive

▪ Writing data transactionally to output systems

Apache Spark Streaming 29

Related Resources
❖ Apache Spark Streaming Tutorials

▪ https://spark.apache.org/docs/latest/streaming-programming-guide.html

▪ https://www.youtube.com/watch?v=qlJmjkgHZ88

▪ https://www.youtube.com/watch?v=UuRhEmqqhRM&t=2s

▪ https://www.youtube.com/watch?v=sSkAuTqfBA8
MSQ

Apache Spark Streaming 30

Contents’ Review

❖ Apache Spark Streaming

▪ Introduction, and applications of spark streaming

▪ Structured Spark Streaming, how it works?

▪ Major concepts

• Discretized Streams, Streaming Sink, Triggers, Watermarking

Windowed Stream Operations
MSQ

• Accumulator and Broadcast variables

• Data Persistence and Caching
• Data Stream Checkpointing
You are Welcome !
▪ Spark Streaming applications, and limitations Questions ?
Comments !
Suggestions !!

Apache Spark Streaming 31

Spark Streaming API Guide
No ratings yet
Spark Streaming API Guide
37 pages
Bài Giảng Spark Streaming
No ratings yet
Bài Giảng Spark Streaming
75 pages
Databricks Streaming and Delta Live Tables
No ratings yet
Databricks Streaming and Delta Live Tables
69 pages
8 - Streaming 3 - Spark Flink
No ratings yet
8 - Streaming 3 - Spark Flink
52 pages
Stream Data Processing
No ratings yet
Stream Data Processing
32 pages
Spark Streaming
No ratings yet
Spark Streaming
14 pages
Lec 05
No ratings yet
Lec 05
10 pages
b0m33bdt 7p Spark Databricks Streaming - 2023 - en
No ratings yet
b0m33bdt 7p Spark Databricks Streaming - 2023 - en
50 pages
Lecture 11
No ratings yet
Lecture 11
31 pages
Bda Unit-Iii-1
No ratings yet
Bda Unit-Iii-1
29 pages
Spark Streaming for Developers
100% (1)
Spark Streaming for Developers
28 pages
Spark Streaming: Tathagata "TD" Das
No ratings yet
Spark Streaming: Tathagata "TD" Das
28 pages
Continuous Application 1725280881
No ratings yet
Continuous Application 1725280881
72 pages
Stream Processing Chapter 5
No ratings yet
Stream Processing Chapter 5
23 pages
Unit - 5 FBDA
No ratings yet
Unit - 5 FBDA
7 pages
Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
No ratings yet
Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
407 pages
Sigmod Structured Streaming
No ratings yet
Sigmod Structured Streaming
13 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
Ebin - Pub Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
100% (1)
Ebin - Pub Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
307 pages
Lec 19
No ratings yet
Lec 19
23 pages
Spark Streaming
No ratings yet
Spark Streaming
99 pages
Spark Streaming for Data Engineers
No ratings yet
Spark Streaming for Data Engineers
22 pages
Unit Iii
No ratings yet
Unit Iii
19 pages
Hands-On Guide To Apache Spark 3: Build Scalable Computing Engines For Batch and Stream Data Processing Alfonso Antolínez García Download
No ratings yet
Hands-On Guide To Apache Spark 3: Build Scalable Computing Engines For Batch and Stream Data Processing Alfonso Antolínez García Download
77 pages
Learning Real-Time Processing With Spark Streaming - Sample Chapter
No ratings yet
Learning Real-Time Processing With Spark Streaming - Sample Chapter
30 pages
Bda Unit-4 PDF
No ratings yet
Bda Unit-4 PDF
63 pages
Real-Time Data Pipelines Made Easy With Structured Streaming in Apache Spark
No ratings yet
Real-Time Data Pipelines Made Easy With Structured Streaming in Apache Spark
51 pages
Week5 Lesson6
No ratings yet
Week5 Lesson6
8 pages
Lec 19
No ratings yet
Lec 19
24 pages
Structured Streaming and Basic Concepts
No ratings yet
Structured Streaming and Basic Concepts
4 pages
BDA Lec10
No ratings yet
BDA Lec10
33 pages
BDA Unit 3
No ratings yet
BDA Unit 3
42 pages
Real-Time Streaming for Tech Pros
No ratings yet
Real-Time Streaming for Tech Pros
5 pages
4a.introduction To Apache Spark
No ratings yet
4a.introduction To Apache Spark
28 pages
Lecture 7 - 1-Spark - Streaming
No ratings yet
Lecture 7 - 1-Spark - Streaming
25 pages
Bda U-5
No ratings yet
Bda U-5
30 pages
Spark Streaming Through Dynamic Batch Sizing
No ratings yet
Spark Streaming Through Dynamic Batch Sizing
4 pages
Spark Streaming Workflow Guide
No ratings yet
Spark Streaming Workflow Guide
25 pages
Bda Unit 5
No ratings yet
Bda Unit 5
29 pages
Big Data With Spark Detailed Presentation
No ratings yet
Big Data With Spark Detailed Presentation
13 pages
Data Lake 1
No ratings yet
Data Lake 1
19 pages
Bigdata-Mining Data Streams
No ratings yet
Bigdata-Mining Data Streams
19 pages
Spark Streaming for Data Engineers
No ratings yet
Spark Streaming for Data Engineers
34 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Ade Mod 1 Incremental Processing With Spark Structured Streaming
No ratings yet
Ade Mod 1 Incremental Processing With Spark Structured Streaming
73 pages
B.Tech. 3rd Yr CSE (AI) 2022 23 Revised - 30
No ratings yet
B.Tech. 3rd Yr CSE (AI) 2022 23 Revised - 30
1 page
Unit 5 (Big Data Analytics)
No ratings yet
Unit 5 (Big Data Analytics)
11 pages
T09 Data Streaming
No ratings yet
T09 Data Streaming
52 pages
Chapter-5 Stream Processing Part1
No ratings yet
Chapter-5 Stream Processing Part1
32 pages
Databricks - Spark Streaming
No ratings yet
Databricks - Spark Streaming
55 pages
Structured Streaming Guide
No ratings yet
Structured Streaming Guide
1 page
Stream Processing for IT/CSE Students
No ratings yet
Stream Processing for IT/CSE Students
57 pages
SA Unit 1 PPT 2
No ratings yet
SA Unit 1 PPT 2
27 pages
Bigdata Unit-Ii
No ratings yet
Bigdata Unit-Ii
33 pages
Grade 12 CS Project
No ratings yet
Grade 12 CS Project
28 pages
Airline Reservation System
No ratings yet
Airline Reservation System
3 pages
IT Midterm Exam Review
No ratings yet
IT Midterm Exam Review
7 pages
Technology Application Project
No ratings yet
Technology Application Project
4 pages
6-Select Statements Types
No ratings yet
6-Select Statements Types
7 pages
GHOSTERR
No ratings yet
GHOSTERR
16 pages
Oracle: Question & Answers
No ratings yet
Oracle: Question & Answers
19 pages
C Program To Implement A Stack: Problem Description
No ratings yet
C Program To Implement A Stack: Problem Description
9 pages
List of All ORACLE Interview Questions
No ratings yet
List of All ORACLE Interview Questions
69 pages
Test Answer ICT
No ratings yet
Test Answer ICT
4 pages
Encrypted Data Analysis
No ratings yet
Encrypted Data Analysis
16 pages
Teradata API
No ratings yet
Teradata API
237 pages
Data Mining: Concepts and Tasks
No ratings yet
Data Mining: Concepts and Tasks
15 pages
Net Maui
No ratings yet
Net Maui
9 pages
Sales Register Report - ZVRSALES - REG
No ratings yet
Sales Register Report - ZVRSALES - REG
4 pages
Divanshu Soni - BCA 5EA - 03129802022
No ratings yet
Divanshu Soni - BCA 5EA - 03129802022
4 pages
Keywords: INSC 20263 (BIS) Terry Gray 1 October 4, 2020
No ratings yet
Keywords: INSC 20263 (BIS) Terry Gray 1 October 4, 2020
19 pages
03 CCFP4.0 RDBMS
No ratings yet
03 CCFP4.0 RDBMS
27 pages
LAB MANUAL-SUSE Linux Enterprise Server Administration - Lms
100% (1)
LAB MANUAL-SUSE Linux Enterprise Server Administration - Lms
172 pages
PostgreSQL Vacuum and Index Guide
No ratings yet
PostgreSQL Vacuum and Index Guide
31 pages
RAP Application For PO Using Managed With Unmanaged Save Scenario
No ratings yet
RAP Application For PO Using Managed With Unmanaged Save Scenario
20 pages
Template Sample Paper - CS409P
No ratings yet
Template Sample Paper - CS409P
7 pages
Database Software Update Log
No ratings yet
Database Software Update Log
76 pages
The Ghost Game Privacy Policy
No ratings yet
The Ghost Game Privacy Policy
2 pages
SQL Topic Wise Notes HackerRank
No ratings yet
SQL Topic Wise Notes HackerRank
3 pages
Percona Mongo-Upgrade Best Practices
No ratings yet
Percona Mongo-Upgrade Best Practices
17 pages
Reconciliation
100% (1)
Reconciliation
16 pages
SAS Email Automation Guide
No ratings yet
SAS Email Automation Guide
10 pages
SAP HANA Security - Intellipaat Blog PDF
No ratings yet
SAP HANA Security - Intellipaat Blog PDF
6 pages
Advanced SQL Functions Guide
100% (1)
Advanced SQL Functions Guide
26 pages

09 - Apache Spark Streaming

Uploaded by

09 - Apache Spark Streaming

Uploaded by

In the name of ALLAH, the Beneficent, the Merciful

9 Apache Spark Streaming

❖ Apache Spark Streaming

▪ Introduction, and applications of spark streaming

▪ Structured Spark Streaming, how it works?

• Discretized Streams, Streaming Sink, Triggers, Watermarking

• Accumulator and Broadcast variables

▪ Spark Streaming applications, and limitations

* Most of the contents are extracted from:

Apache Spark Streaming 2

• The input data is unbounded and has no predetermined beginning or end.

• Series of events that arrive at the stream processing system

▪ Integrates with Spark’s batch and interactive processing

Apache Spark Streaming 3

Apache Spark Streaming 4

Apache Spark Streaming 5

Apache Spark Streaming 6

• Structured Streaming ensures end-to-end, exactly-once processing as well as fault-tolerance

Apache Spark Streaming 7

• A query on the input will generate the "Result Table".

Apache Spark Streaming 8

Apache Spark Streaming 9

Apache Spark Streaming 10

Apache Spark Streaming 11

▪ Discretized Stream or DStream is the basic abstraction provided by Spark Streaming.

▪ Internally, a DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an

Apache Spark Streaming 12

Apache Spark Streaming 13

Apache Spark Streaming 14

❖ Spark Streaming provides two categories of built-in streaming sources.

▪ Basic sources: Sources directly available in the Streaming Context API.

• Examples: file systems, and socket connections.

▪ Advanced sources: Sources like Kafka, Kinesis, etc.

• They are available through extra utility classes.

• These require linking against extra dependencies.

Apache Spark Streaming 15

• Almost any file format

▪ A foreach sink for running arbitrary computation on the output records

• A console sink for testing

• A memory sink for debugging

• Apache Kafka 0.10

Apache Spark Streaming 16

Apache Spark Streaming 17

Apache Spark Streaming 18

to limit how long they need to remember old data.

Apache Spark Streaming 19

Apache Spark Streaming 20

Apache Spark Streaming 21

Apache Spark Streaming 22

Apache Spark Streaming 23

Apache Spark Streaming 24

Apache Spark Streaming 25

▪ Intrusion detection systems

❖ Spark Streaming is suitable for processing:

▪ Live notification and alerts

Apache Spark Streaming 26

Apache Spark Streaming 27

Apache Spark Streaming 28

▪ Maintaining large amounts of state

▪ Supporting high-data throughput

▪ Processing each event exactly once despite machine failures

▪ Handling load imbalance and stragglers

▪ Responding to events at low latency

▪ Joining with external data in other storage systems

▪ Determining how to update output sinks as new events arrive

▪ Writing data transactionally to output systems

Apache Spark Streaming 29

Apache Spark Streaming 30

❖ Apache Spark Streaming

▪ Introduction, and applications of spark streaming

▪ Structured Spark Streaming, how it works?

• Discretized Streams, Streaming Sink, Triggers, Watermarking

• Accumulator and Broadcast variables

Apache Spark Streaming 31

You might also like