0% found this document useful (0 votes)

28 views36 pages

07 - Apache Spark - An Introduction

The document provides an introduction to Apache Spark, detailing its features, components, and architecture. It explains how Spark serves as an efficient, open-source in-memory cluster computing framework that supports multiple programming languages and offers fast data processing capabilities. Key components such as Spark Core, Spark SQL, and Spark Streaming are discussed, along with the anatomy of a Spark job run and the role of Resilient Distributed Datasets (RDDs).

Uploaded by

i237822

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views36 pages

07 - Apache Spark - An Introduction

Uploaded by

i237822

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

In the name of ALL AH, the Beneficent, the Merciful

7 Apache Spark
An Introduction

Compiled by
Dr. Muhammad Sajid Qureshi
Contents*

❖ Apache Spark
▪ Introduction, features, ecosystem, and major benefits
▪ Components of Spark
• Spark Core, Spark SQL, Spark Streaming, Spark MLlib, and Spark GraphX
▪ Architecture of Spark
• RDDs, Jobs, Tasks, and Context
▪ Anatomy of a Spark Job Run
• Job submission, DAG creation, Task scheduling, Task Execution
▪ Executors and Cluster Managers
▪ Spark applications

* Most of the contents are extracted from:

+ “Hadoop-The Definitive Guide” (Chapter 19) by Tom White, O’ Rielly Media Inc., 4 edition.

Apache Spark - An Introduction 2

What is Spark ?
❖ Spark – an alternative to MapReduce
▪ Spark is an efficient, open-source in-memory cluster computing framework.

• It can store large datasets in distributed fashion, and can apply parallel processing on the data.

• Being an in-memory data processing engine, it can process the real-time data streams.

• Spark can run on YARN and works with Hadoop file formats and storage backends like HDFS.

▪ Data analyst use Spark to process, analyze, transform, and visualize data at very large scale.

• It efficiently performs the iterative and interactive operations on data.

• Spark provides a user-friendly interface for programming a cluster with implicit parallel data processing
and fault tolerance.

Apache Spark - An Introduction 3

What is Spark ?
❖ Spark – an alternative to MapReduce
▪ Spark support multiple languages for its programming

• It provides APIs in languages like Scala, Python, R, and JAVA.

▪ Spark was started as a project in AMPLab at the University of Berkley in 2009.

▪ In 2014 spark set a world record when, it was used by Databricks to sort-out a large dataset efficiently.

Apache Spark - An Introduction 4

Spark’s features

Apache Spark - An Introduction 5

Spark’s features
❖ Fast data processing
▪ Spark use the Resilient Distributed Dataset (RDD) for quick and reliable data processing.

❖ In-memory computing
▪ As data resides in RAM, so its read-write operations, and processing is very fast.

❖ Support for multiple programming languages

▪ Scala, Python, R, and JAVA

❖ Fault tolerance
▪ The RDD mechanism makes Spark a reliable data processing engine, as they recover the data loss in case of
failure of a node.

❖ Rich libraries for data processing

▪ Spark offer rich libraries to process, analyze, transform, and visualize data.

Apache Spark - An Introduction 6

Spark’s features

Apache Spark - An Introduction 7

Hadoop Versus Spark

Hadoop Spark

o The MapReduce framework is slower than Spark, because it o Being an in-memory processing engine, Spark can do
loads the data from storage devices before processing it. parallel data processing much faster than MapReduce.

o The MapReduce is designed for batch processing of large

datasets. o Spark can do batch-processing as well as it can process the
o The data node store intermediate results in their local real-time data (streaming data).
storage, that slows down compilation of the final results.

o Hadoop uses Kerberos for authentication, that is o Sparks uses a shared secret for easier authentication.
complicated. Additionally, it can run on YARN to use the Kerberos.

o Writing a program (usually in JAVA) for Hadoop framework

o Sparks supports Scala, that simplifies programming for it.
requires more effort.

Apache Spark - An Introduction 8

Components of Spark
.
❖ Spark has the following major components
▪ Spark Core
• It manages the RDD in Spark, to enable efficient and reliable data processing.
• Responsible for memory management, job scheduling, fault tolerance.
• It also coordinates with the storage systems like HDFS, HBase, and DBMSs.
▪ Spark SQL
• This component allows fast processing of the structured and semi-structured data.
▪ Spark Streaming
• This offers a light-weight API to process the real-time data (data streams).
▪ Spark MLlib
• A simple, scalable librarY containing various machine learning algorithms for big data analytics.
▪ Spark GraphX
• Specially designed for storage and processing of the graph databases (as used in LinkedIn, DBpedia, Meta
etc.).

Apache Spark - An Introduction 9

Spark Ecosystem

Apache Spark - An Introduction 10

Spark Components – SQL

Apache Spark - An Introduction 11

Spark Components – Streaming

Apache Spark - An Introduction 12

Spark Components – MLlib

Apache Spark - An Introduction 13

Spark Components – GraphX

Apache Spark - An Introduction 14

The RDD in Spark
❖ Role of the Resilient Distributed Dataset (RDD) in Spark
▪ Spark uses the RDDs for quick and reliable distributed in-memory data processing.

▪ RDD is a read-only collection of objects that is partitioned across multiple data nodes in a cluster.

• In a Spark program, initially one or more RDDs are loaded as input

• Then, through a series of transformations they are turned into a set of target RDDs, which have an
action to be performed on them.

▪ RDD are resilient because Spark can automatically reconstruct a lost partition by recomputing it from the
RDDs that it was computed from.

▪ RDD can be created in 3 ways:

• From an in-memory collection of objects (known as parallelizing a collection)

• Using a dataset from external storage (such as HDFS)

• Transforming an existing RDD

Apache Spark - An Introduction 15

Operations on RDDs

Apache Spark - An Introduction 16

Transformation and Actions on RDDs
❖ Spark provides two categories of operations on RDDs: transformations and actions.
▪ A transformation generates a new RDD from an existing one.

• If the return type of and operation is RDD, then it’s a transformation; otherwise, it’s an action.

▪ An action triggers a computation on an RDD and does something with the results—either returning them to
the user, or saving them to external storage.

• Actions have an immediate effect, but transformations do not—they are lazy.

▪ Spark’s library contains a rich set of operators including the following:

• Transformations for mapping,

• Grouping, aggregating, and repartitioning

• Sampling, and joining RDDS

• Treating RDDS as sets.

Apache Spark - An Introduction 17

Operations on RDDs

Apache Spark - An Introduction 18

Spark Applications, Jobs, Stages, and Tasks
❖ The Job in Spark
▪ A Spark job is more is made up of an arbitrary directed acyclic graph (DAG) of stages, each of which is
roughly equivalent to a map or reduce phase in MapReduce.

▪ Stages are split into tasks by the Spark runtime and are run in parallel on partitions of an RDD spread across
the cluster—just like tasks in MapReduce.

▪ A job always runs in the context of an application (represented by a SparkContext instance) that serves to
group RDDs and shared variables.

▪ An application can run more than one job, in series or in parallel, and provides the mechanism for a job to
access an RDD that was cached by a previous job in the same application.

Apache Spark - An Introduction 19

How Spark runs a job ?

❖ Anatomy of a Spark Job Run

▪ Job submission

▪ DAG creation

▪ Task scheduling

▪ Task execution

Apache Spark - An Introduction 20

How Spark runs a job ?

Apache Spark - An Introduction 21

How Spark runs a job ?

Apache Spark - An Introduction 22

How Spark runs a job ?

Apache Spark - An Introduction 23

How Spark runs a job ?

Apache Spark - An Introduction 24

The stages and RDDs in a Spark job

Apache Spark - An Introduction 25

Spark Job Executors

❖ Spark use the Executors to run the tasks that make up a job.

▪ First the executor keeps a local cache of all the dependencies that previous tasks have used.

▪ Second it deserializes the task code from the serialized bytes that were sent as a part of the
launch task message.

▪ Third, the Executor executes the task code in the same JVM as the executor.

• Tasks can return a result to the driver. The result is serialized and sent to the executor
backend, and then back to the driver as a status update message.

Apache Spark - An Introduction 26

Cluster Managers for Spark

❖ Cluster Managers for Spark

▪ Spark requires a cluster manager to manage the lifecycle of executors that run the jobs.

▪ Spark is compatible with a variety of cluster managers with different characteristics:

• Local cluster

✓ In local mode there is a single executor running in the same JVM as the driver.

✓ This mode is useful for testing or running small jobs.

• Standalone

✓ It is a simple distributed implementation to run a single Master and multiple worker nodes.

Apache Spark - An Introduction 27

Cluster Managers for Spark

❖ Cluster Managers for Spark

• Apache Mesos

✓ Mesos is a general-purpose cluster resource manager that allows fine-grained sharing of

resources across different applications.

• Hadoop YARN

✓ When YARN is used as a cluster manager for Spark, each Spark application corresponds to an
instance of a YARN application, and each executor runs in its own YARN container.

✓ The Mesos and YARN cluster managers are superior to the standalone manager as they can
manage resources of other applications running on the cluster. They also enforce a scheduling
policy across all of them.

Apache Spark - An Introduction 28

Spark Cluster Managers

Apache Spark - An Introduction 29

Spark on YARN

❖ Spark deployment on YARN

▪ Running Spark on YARN provides better integration with other Hadoop components.

▪ Spark can be deployed on YARN in two modes:

• Client mode
✓ In it, the driver program runs in the client application.
✓ The client mode is required for programs having an interactive component, like Spark-shell or
PySpark.

• Cluster mode
✓ In this mode, the driver program runs on the cluster in the YARN Application Master.
✓ YARN cluster mode is appropriate for production jobs that require logging activity.

Apache Spark - An Introduction 30

How Spark executors are started in YARN client mode

Apache Spark - An Introduction 31

How Spark executors are started in YARN cluster mode

Apache Spark - An Introduction 32

Applications of Spark

Apache Spark - An Introduction 33

Spark Use Case

Apache Spark - An Introduction 34

Related Resources
❖ Apache Spark Tutorials
▪ Apache Spark

• https://www.youtube.com/watch?v=QaoJNXW6SQo&t=3s

▪ Understanding Apache Spark

• https://www.youtube.com/watch?v=znBa13Earms

▪ How Apache Spark runs a job

• https://www.youtube.com/watch?v=jDkLiqlyQaY

Apache Spark - An Introduction 35

Contents’ Review
❖ Apache Spark
▪ Introduction, features, ecosystem, and major benefits
▪ Components of Spark
• Spark Core, Spark SQL, Spark Streaming, Spark MLlib, and Spark GraphX
▪ Architecture of Spark
• RDDs, Jobs, Tasks, and Context
▪ Anatomy of a Spark Job Run
• Job submission, DAG creation, Task scheduling, Task Execution
▪ Executors and Cluster Managers
▪ Spark applications

You are Welcome !

Questions ?
Comments !
Suggestions !!

Apache Spark - An Introduction 36

L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Unit 5
100% (1)
Unit 5
109 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
Unit V
No ratings yet
Unit V
35 pages
Unit - 4
No ratings yet
Unit - 4
49 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Apache Spark Overview & Features
No ratings yet
Apache Spark Overview & Features
65 pages
Spark
No ratings yet
Spark
26 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
SPARK
No ratings yet
SPARK
47 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
Introduction to Data Analysis with Spark
No ratings yet
Introduction to Data Analysis with Spark
51 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Lec No 10
No ratings yet
Lec No 10
17 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
Spark Guide for 4th Year Engineering Students
No ratings yet
Spark Guide for 4th Year Engineering Students
241 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
Presentation On Apache Spark
No ratings yet
Presentation On Apache Spark
7 pages
CC PPT
No ratings yet
CC PPT
12 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
DEV3600SlideGuide PDF
No ratings yet
DEV3600SlideGuide PDF
555 pages
Spark Architecture for Developers
No ratings yet
Spark Architecture for Developers
7 pages
Apache Spark
No ratings yet
Apache Spark
27 pages
Bda U4
No ratings yet
Bda U4
49 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
18 pages
Learn by Doing It
No ratings yet
Learn by Doing It
9 pages
ApacheSparkWorkshop 2020 09 17
No ratings yet
ApacheSparkWorkshop 2020 09 17
58 pages
8 TH
No ratings yet
8 TH
19 pages
Unit 5.1
No ratings yet
Unit 5.1
9 pages
Shark
No ratings yet
Shark
24 pages
Apache Spark
No ratings yet
Apache Spark
162 pages
Msbte Super 25 Unit 5 Notes
No ratings yet
Msbte Super 25 Unit 5 Notes
17 pages
Apache Spark Defined
No ratings yet
Apache Spark Defined
14 pages
Apache Spark Basics & Comparison
No ratings yet
Apache Spark Basics & Comparison
66 pages
Spark
No ratings yet
Spark
7 pages
Bda 5
No ratings yet
Bda 5
21 pages
Apache Spark IP Gemini 1 PDF
No ratings yet
Apache Spark IP Gemini 1 PDF
38 pages
Spark Architecture
No ratings yet
Spark Architecture
17 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
Q1. Understanding Apache Spark
No ratings yet
Q1. Understanding Apache Spark
4 pages
Mastering Apache Spark PDF
75% (4)
Mastering Apache Spark PDF
541 pages
Unit IV Spark
No ratings yet
Unit IV Spark
23 pages
A Brief Introduction To Apache Spark
No ratings yet
A Brief Introduction To Apache Spark
10 pages
SPARK
No ratings yet
SPARK
66 pages
Instruction Manual Cims MK Iii: CIMS MK III Is Mechanically and Electrically 100% Compatible With CIMS MK II
No ratings yet
Instruction Manual Cims MK Iii: CIMS MK III Is Mechanically and Electrically 100% Compatible With CIMS MK II
24 pages
Introduction To Cycloconverters
No ratings yet
Introduction To Cycloconverters
4 pages
Capstone Project Report Vishal 2
No ratings yet
Capstone Project Report Vishal 2
26 pages
Omnisphere Software Update Guide
No ratings yet
Omnisphere Software Update Guide
9 pages
Eh VCB02
No ratings yet
Eh VCB02
1 page
SOP-156-1-DH Dye Machine Profibus & PT100 Calibration
No ratings yet
SOP-156-1-DH Dye Machine Profibus & PT100 Calibration
26 pages
Ensuring Software Quality Through Effective Quality Assurance Testing: Best Practices and Case Studies
No ratings yet
Ensuring Software Quality Through Effective Quality Assurance Testing: Best Practices and Case Studies
19 pages
Web Caching & FTP Protocols Guide
No ratings yet
Web Caching & FTP Protocols Guide
31 pages
Basic Router Pod: Planning and Installation Guide For Cisco Networking Academy Curriculum
No ratings yet
Basic Router Pod: Planning and Installation Guide For Cisco Networking Academy Curriculum
23 pages
Flagis Galileo
No ratings yet
Flagis Galileo
34 pages
Software Testing
No ratings yet
Software Testing
5 pages
HPE Private Cloud AI Solutions Student Guide
No ratings yet
HPE Private Cloud AI Solutions Student Guide
209 pages
Power Supplies for Industrial & Medical
No ratings yet
Power Supplies for Industrial & Medical
3 pages
4 Websphere Customization Tool Box
No ratings yet
4 Websphere Customization Tool Box
24 pages
Fico Fico Xpress Optimization Xpress Optimization: Whitepaper
No ratings yet
Fico Fico Xpress Optimization Xpress Optimization: Whitepaper
43 pages
Tee Chartx
No ratings yet
Tee Chartx
86 pages
Product Acceptance Plan
No ratings yet
Product Acceptance Plan
5 pages
Webinar ZigBee 3-0 Launch FINAL PDF
No ratings yet
Webinar ZigBee 3-0 Launch FINAL PDF
54 pages
SuperVisor 60E - Customer Engineer Manual
No ratings yet
SuperVisor 60E - Customer Engineer Manual
112 pages
22622-Sample-Question-Paper (Msbte Study Resources)
100% (1)
22622-Sample-Question-Paper (Msbte Study Resources)
4 pages
Blue Green Deployment
No ratings yet
Blue Green Deployment
22 pages
Thesis Report
No ratings yet
Thesis Report
107 pages
(FreeCourseWeb - Com) B08HHSY83JEBOK
No ratings yet
(FreeCourseWeb - Com) B08HHSY83JEBOK
273 pages
Biostar U8668-D Spec
No ratings yet
Biostar U8668-D Spec
2 pages
Workshop 2.1 Geometry Repair - Engine Block: Introduction To ANSYS Icem CFD
No ratings yet
Workshop 2.1 Geometry Repair - Engine Block: Introduction To ANSYS Icem CFD
20 pages
1.3.1.1 It's Network Access Time Instructions
No ratings yet
1.3.1.1 It's Network Access Time Instructions
2 pages
Active Power Factor Correction Technique For Single Phase Full Bridge Rectifier
No ratings yet
Active Power Factor Correction Technique For Single Phase Full Bridge Rectifier
6 pages
CS25C01 Full QuestionBank With Answers
No ratings yet
CS25C01 Full QuestionBank With Answers
10 pages
Oracle8 On Rh7x Howto
No ratings yet
Oracle8 On Rh7x Howto
11 pages

07 - Apache Spark - An Introduction

Uploaded by

07 - Apache Spark - An Introduction

Uploaded by

In the name of ALL AH, the Beneficent, the Merciful

* Most of the contents are extracted from:

Apache Spark - An Introduction 2

• It efficiently performs the iterative and interactive operations on data.

Apache Spark - An Introduction 3

• It provides APIs in languages like Scala, Python, R, and JAVA.

▪ Spark was started as a project in AMPLab at the University of Berkley in 2009.

Apache Spark - An Introduction 4

Apache Spark - An Introduction 5

❖ Support for multiple programming languages

❖ Rich libraries for data processing

Apache Spark - An Introduction 6

Apache Spark - An Introduction 7

o The MapReduce is designed for batch processing of large

o Writing a program (usually in JAVA) for Hadoop framework

Apache Spark - An Introduction 8

Apache Spark - An Introduction 9

Apache Spark - An Introduction 10

Apache Spark - An Introduction 11

Apache Spark - An Introduction 12

Apache Spark - An Introduction 13

Apache Spark - An Introduction 14

• In a Spark program, initially one or more RDDs are loaded as input

▪ RDD can be created in 3 ways:

• From an in-memory collection of objects (known as parallelizing a collection)

• Using a dataset from external storage (such as HDFS)

• Transforming an existing RDD

Apache Spark - An Introduction 15

Apache Spark - An Introduction 16

• Actions have an immediate effect, but transformations do not—they are lazy.

▪ Spark’s library contains a rich set of operators including the following:

• Transformations for mapping,

• Grouping, aggregating, and repartitioning

• Sampling, and joining RDDS

• Treating RDDS as sets.

Apache Spark - An Introduction 17

Apache Spark - An Introduction 18

Apache Spark - An Introduction 19

❖ Anatomy of a Spark Job Run

Apache Spark - An Introduction 20

Apache Spark - An Introduction 21

Apache Spark - An Introduction 22

Apache Spark - An Introduction 23

Apache Spark - An Introduction 24

Apache Spark - An Introduction 25

Apache Spark - An Introduction 26

❖ Cluster Managers for Spark

▪ Spark is compatible with a variety of cluster managers with different characteristics:

✓ This mode is useful for testing or running small jobs.

Apache Spark - An Introduction 27

❖ Cluster Managers for Spark

✓ Mesos is a general-purpose cluster resource manager that allows fine-grained sharing of

Apache Spark - An Introduction 28

Apache Spark - An Introduction 29

❖ Spark deployment on YARN

▪ Spark can be deployed on YARN in two modes:

Apache Spark - An Introduction 30

Apache Spark - An Introduction 31

Apache Spark - An Introduction 32

Apache Spark - An Introduction 33

Apache Spark - An Introduction 34

▪ Understanding Apache Spark

▪ How Apache Spark runs a job

Apache Spark - An Introduction 35

You are Welcome !

Apache Spark - An Introduction 36

You might also like