In the name of ALL AH, the Beneficent, the Merciful
7 Apache Spark
An Introduction
Compiled by
Dr. Muhammad Sajid Qureshi
Contents*
❖ Apache Spark
▪ Introduction, features, ecosystem, and major benefits
▪ Components of Spark
• Spark Core, Spark SQL, Spark Streaming, Spark MLlib, and Spark GraphX
▪ Architecture of Spark
• RDDs, Jobs, Tasks, and Context
▪ Anatomy of a Spark Job Run
• Job submission, DAG creation, Task scheduling, Task Execution
▪ Executors and Cluster Managers
▪ Spark applications
* Most of the contents are extracted from:
+ “Hadoop-The Definitive Guide” (Chapter 19) by Tom White, O’ Rielly Media Inc., 4 edition.
Apache Spark - An Introduction 2
What is Spark ?
❖ Spark – an alternative to MapReduce
▪ Spark is an efficient, open-source in-memory cluster computing framework.
• It can store large datasets in distributed fashion, and can apply parallel processing on the data.
• Being an in-memory data processing engine, it can process the real-time data streams.
• Spark can run on YARN and works with Hadoop file formats and storage backends like HDFS.
▪ Data analyst use Spark to process, analyze, transform, and visualize data at very large scale.
• It efficiently performs the iterative and interactive operations on data.
• Spark provides a user-friendly interface for programming a cluster with implicit parallel data processing
and fault tolerance.
Apache Spark - An Introduction 3
What is Spark ?
❖ Spark – an alternative to MapReduce
▪ Spark support multiple languages for its programming
• It provides APIs in languages like Scala, Python, R, and JAVA.
▪ Spark was started as a project in AMPLab at the University of Berkley in 2009.
▪ In 2014 spark set a world record when, it was used by Databricks to sort-out a large dataset efficiently.
Apache Spark - An Introduction 4
Spark’s features
Apache Spark - An Introduction 5
Spark’s features
❖ Fast data processing
▪ Spark use the Resilient Distributed Dataset (RDD) for quick and reliable data processing.
❖ In-memory computing
▪ As data resides in RAM, so its read-write operations, and processing is very fast.
❖ Support for multiple programming languages
▪ Scala, Python, R, and JAVA
❖ Fault tolerance
▪ The RDD mechanism makes Spark a reliable data processing engine, as they recover the data loss in case of
failure of a node.
❖ Rich libraries for data processing
▪ Spark offer rich libraries to process, analyze, transform, and visualize data.
Apache Spark - An Introduction 6
Spark’s features
Apache Spark - An Introduction 7
Hadoop Versus Spark
Hadoop Spark
o The MapReduce framework is slower than Spark, because it o Being an in-memory processing engine, Spark can do
loads the data from storage devices before processing it. parallel data processing much faster than MapReduce.
o The MapReduce is designed for batch processing of large
datasets. o Spark can do batch-processing as well as it can process the
o The data node store intermediate results in their local real-time data (streaming data).
storage, that slows down compilation of the final results.
o Hadoop uses Kerberos for authentication, that is o Sparks uses a shared secret for easier authentication.
complicated. Additionally, it can run on YARN to use the Kerberos.
o Writing a program (usually in JAVA) for Hadoop framework
o Sparks supports Scala, that simplifies programming for it.
requires more effort.
Apache Spark - An Introduction 8
Components of Spark
.
❖ Spark has the following major components
▪ Spark Core
• It manages the RDD in Spark, to enable efficient and reliable data processing.
• Responsible for memory management, job scheduling, fault tolerance.
• It also coordinates with the storage systems like HDFS, HBase, and DBMSs.
▪ Spark SQL
• This component allows fast processing of the structured and semi-structured data.
▪ Spark Streaming
• This offers a light-weight API to process the real-time data (data streams).
▪ Spark MLlib
• A simple, scalable librarY containing various machine learning algorithms for big data analytics.
▪ Spark GraphX
• Specially designed for storage and processing of the graph databases (as used in LinkedIn, DBpedia, Meta
etc.).
Apache Spark - An Introduction 9
Spark Ecosystem
Apache Spark - An Introduction 10
Spark Components – SQL
Apache Spark - An Introduction 11
Spark Components – Streaming
Apache Spark - An Introduction 12
Spark Components – MLlib
Apache Spark - An Introduction 13
Spark Components – GraphX
Apache Spark - An Introduction 14
The RDD in Spark
❖ Role of the Resilient Distributed Dataset (RDD) in Spark
▪ Spark uses the RDDs for quick and reliable distributed in-memory data processing.
▪ RDD is a read-only collection of objects that is partitioned across multiple data nodes in a cluster.
• In a Spark program, initially one or more RDDs are loaded as input
• Then, through a series of transformations they are turned into a set of target RDDs, which have an
action to be performed on them.
▪ RDD are resilient because Spark can automatically reconstruct a lost partition by recomputing it from the
RDDs that it was computed from.
▪ RDD can be created in 3 ways:
• From an in-memory collection of objects (known as parallelizing a collection)
• Using a dataset from external storage (such as HDFS)
• Transforming an existing RDD
Apache Spark - An Introduction 15
Operations on RDDs
Apache Spark - An Introduction 16
Transformation and Actions on RDDs
❖ Spark provides two categories of operations on RDDs: transformations and actions.
▪ A transformation generates a new RDD from an existing one.
• If the return type of and operation is RDD, then it’s a transformation; otherwise, it’s an action.
▪ An action triggers a computation on an RDD and does something with the results—either returning them to
the user, or saving them to external storage.
• Actions have an immediate effect, but transformations do not—they are lazy.
▪ Spark’s library contains a rich set of operators including the following:
• Transformations for mapping,
• Grouping, aggregating, and repartitioning
• Sampling, and joining RDDS
• Treating RDDS as sets.
Apache Spark - An Introduction 17
Operations on RDDs
Apache Spark - An Introduction 18
Spark Applications, Jobs, Stages, and Tasks
❖ The Job in Spark
▪ A Spark job is more is made up of an arbitrary directed acyclic graph (DAG) of stages, each of which is
roughly equivalent to a map or reduce phase in MapReduce.
▪ Stages are split into tasks by the Spark runtime and are run in parallel on partitions of an RDD spread across
the cluster—just like tasks in MapReduce.
▪ A job always runs in the context of an application (represented by a SparkContext instance) that serves to
group RDDs and shared variables.
▪ An application can run more than one job, in series or in parallel, and provides the mechanism for a job to
access an RDD that was cached by a previous job in the same application.
Apache Spark - An Introduction 19
How Spark runs a job ?
❖ Anatomy of a Spark Job Run
▪ Job submission
▪ DAG creation
▪ Task scheduling
▪ Task execution
Apache Spark - An Introduction 20
How Spark runs a job ?
Apache Spark - An Introduction 21
How Spark runs a job ?
Apache Spark - An Introduction 22
How Spark runs a job ?
Apache Spark - An Introduction 23
How Spark runs a job ?
Apache Spark - An Introduction 24
The stages and RDDs in a Spark job
Apache Spark - An Introduction 25
Spark Job Executors
❖ Spark use the Executors to run the tasks that make up a job.
▪ First the executor keeps a local cache of all the dependencies that previous tasks have used.
▪ Second it deserializes the task code from the serialized bytes that were sent as a part of the
launch task message.
▪ Third, the Executor executes the task code in the same JVM as the executor.
• Tasks can return a result to the driver. The result is serialized and sent to the executor
backend, and then back to the driver as a status update message.
Apache Spark - An Introduction 26
Cluster Managers for Spark
❖ Cluster Managers for Spark
▪ Spark requires a cluster manager to manage the lifecycle of executors that run the jobs.
▪ Spark is compatible with a variety of cluster managers with different characteristics:
• Local cluster
✓ In local mode there is a single executor running in the same JVM as the driver.
✓ This mode is useful for testing or running small jobs.
• Standalone
✓ It is a simple distributed implementation to run a single Master and multiple worker nodes.
Apache Spark - An Introduction 27
Cluster Managers for Spark
❖ Cluster Managers for Spark
• Apache Mesos
✓ Mesos is a general-purpose cluster resource manager that allows fine-grained sharing of
resources across different applications.
• Hadoop YARN
✓ When YARN is used as a cluster manager for Spark, each Spark application corresponds to an
instance of a YARN application, and each executor runs in its own YARN container.
✓ The Mesos and YARN cluster managers are superior to the standalone manager as they can
manage resources of other applications running on the cluster. They also enforce a scheduling
policy across all of them.
Apache Spark - An Introduction 28
Spark Cluster Managers
Apache Spark - An Introduction 29
Spark on YARN
❖ Spark deployment on YARN
▪ Running Spark on YARN provides better integration with other Hadoop components.
▪ Spark can be deployed on YARN in two modes:
• Client mode
✓ In it, the driver program runs in the client application.
✓ The client mode is required for programs having an interactive component, like Spark-shell or
PySpark.
• Cluster mode
✓ In this mode, the driver program runs on the cluster in the YARN Application Master.
✓ YARN cluster mode is appropriate for production jobs that require logging activity.
Apache Spark - An Introduction 30
How Spark executors are started in YARN client mode
Apache Spark - An Introduction 31
How Spark executors are started in YARN cluster mode
Apache Spark - An Introduction 32
Applications of Spark
Apache Spark - An Introduction 33
Spark Use Case
Apache Spark - An Introduction 34
Related Resources
❖ Apache Spark Tutorials
▪ Apache Spark
• https://www.youtube.com/watch?v=QaoJNXW6SQo&t=3s
▪ Understanding Apache Spark
• https://www.youtube.com/watch?v=znBa13Earms
▪ How Apache Spark runs a job
• https://www.youtube.com/watch?v=jDkLiqlyQaY
Apache Spark - An Introduction 35
Contents’ Review
❖ Apache Spark
▪ Introduction, features, ecosystem, and major benefits
▪ Components of Spark
• Spark Core, Spark SQL, Spark Streaming, Spark MLlib, and Spark GraphX
▪ Architecture of Spark
• RDDs, Jobs, Tasks, and Context
▪ Anatomy of a Spark Job Run
• Job submission, DAG creation, Task scheduling, Task Execution
▪ Executors and Cluster Managers
▪ Spark applications
You are Welcome !
Questions ?
Comments !
Suggestions !!
Apache Spark - An Introduction 36