[go: up one dir, main page]

0% found this document useful (0 votes)
26 views45 pages

Lecture 4 - Spark Introduction

The document provides an overview of Apache Spark, detailing its history, components, and functionality. It discusses the evolution from MapReduce to Spark, highlighting the advantages of using Resilient Distributed Datasets (RDDs) for efficient data processing. Key features of Spark include its speed, ease of use, generality, and ability to run across various platforms and data sources.

Uploaded by

Tuân Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views45 pages

Lecture 4 - Spark Introduction

The document provides an overview of Apache Spark, detailing its history, components, and functionality. It discusses the evolution from MapReduce to Spark, highlighting the advantages of using Resilient Distributed Datasets (RDDs) for efficient data processing. Key features of Spark include its speed, ease of use, generality, and ability to run across various platforms and data sources.

Uploaded by

Tuân Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

06/10/2024

Spark Introduction

1
06/10/2024

Agenda
• History of Spark
• Introduction
• Components of Stack
• Resilient Distributed Dataset – RDD

History of Spark
2004 2010
MapReduce paper Spark paper

2002 2004 2006 2008 2010 2012 2014

2002 2008 2014


MapReduce @ Google Hadoop Summit Apache Spark top-level

2006
Hadoop @Yahoo!

2
06/10/2024

History of Spark
circa 1979 – Stanford, MIT, CMU, etc.
set/list operations in LISP, Prolog, etc., for parallel processing
www-formal.stanford.edu/ jmc/ history / lisp/ lisp.htm

circa 2004 – Google


MapReduce: Simplified Data Processing on Large Clusters Jeffrey
Dean and Sanjay Ghemawat
research.google.com/ archive/ mapreduce.html

circa 2006 – Apache


Hadoop, originating from the Nutch Project Doug Cutting
research.yahoo.com/ files/ cutting.pdf

circa 2008 – Yahoo


web scale search indexing Hadoop Submit, HUG, etc.
developer.yahoo.com/hadoop/

circa 2009 – Amazon AWS


Elastic MapReduce
Hadoop modified for EC2/S3, plus support for Hive, Pig, Cascading, etc.
aws.amazon.com/ elasticmapreduce/

MapReduce
Most current cluster programming models are
based on acyclic data flow from stable storage to
stable storage

Map
Reduce

Input Map Output

Reduce
Map

3
06/10/2024

MapReduce
• Acyclic data flow is inefficient for applications that
repeatedly reuse a working set of data:
• Iterative algorithms (machine learning, graphs)
• Interactive data mining tools (R, Excel, Python)

Data Processing Improvement Goals


• Low latency (interactive) queries on historical data:
enable faster decisions
• E.g., identify why a site is slow and fix it
• Low latency queries on live data (streaming): enable
decisions on real-time data
• E.g., detect & block worms in real-time (a worm may
infect 1mil hosts in 1.3sec)
• Sophisticated data processing: enable “better” decisions
• E.g., anomaly detection, trend analysis

Therefore, people built specialized


systems as workarounds…

4
06/10/2024

Specialized Systems
Pregel Giraph

Dremel Drill Tez

MapReduce
Impala GraphLab

Storm S4

General Batch Processing Specialized Systems:


iterative,interactive,streaming,graph,etc.

The State of Spark,andWhereWe're Going Next


Matei Zaharia
Spark Summit (2013)
youtu.be/ nU6vO2EJAb4

Storage vs Processing Wars


NoSQL battles Compute battles
MapReduce
vs
Relational vs Spark
NoSQL
HBase vs ing vs Storm
Spark Stream
Cassanrdra Redis vs M
emcached Hive vs Spa
Riak vs rk SQL vs
Impala

Couchbase
C ouchDB vs ib vs H20
MongoDB vs Neo4j vs Tita Mahout vs M
Ll
n vs
Giraph vs Ori
entDB

csearch
Solr vs Elasti

10

5
06/10/2024

Storage vs Processing Wars


NoSQL battles Compute battles
MapReduce
vs
Relational vs Spark
NoSQL
HBase vs ing vs Storm
Spark Stream
Cassanrdra Redis vs M
emcached Hive vs Spa
Riak vs rk SQL vs
Impala
Couchbase
C ouchDB vs
MongoDB vs Ll ib vs H20
Neo4j vs Tita Mahout vs M
n vs
Giraph vs Ori
entDB
csearch
Solr vs Elasti

11

Specialized Systems
(2007 – 2015?)

Giraph
(2004 – 2013)
Pregel (2014 – ?)
Tez
Drill
Dremel
Mahout
Storm
S4
Impala
GraphLab

Specialized
Systems
General Batch Processing (iterative, interactive, ML, streaming, graph, SQL, etc) General Unified Engine

12

6
06/10/2024

vs

YARN Mesos

Tachyon

SQL

MLlib

Streaming

13

10x – 100x

14

7
06/10/2024

Support Interactive and Streaming


Comp.
• Aggressive use of memory 10Gbps
• Why?
128-512GB
1. Memory transfer rates >> disk or SSDs
2. Many datasets already fit into memory
• Inputs of over 90% of jobs in 40-60GB/s
Facebook, Yahoo!, and Bing
clusters fit into memory
16 cores
• e.g., 1TB = 1 billion records @
1KB each 0.2-
3. Memory density (still) grows with 1GB/s 1-
Mooreʼs law (x10 disks)
4GB/s
(x4 disks)
• RAM/SSD hybrid memories at 10-30TB
horizon 1-4TB

High end datacenter node

15

Support Interactive and Streaming


Comp.
• Increase parallelism
result
• Why?
• Reduce work per node à improve
latency
T
• Techniques:
• Low latency parallel scheduler that
achieve high locality
• Optimized parallel communication
patterns (e.g., shuffle, broadcast) result
• Efficient recovery from failures and
straggler mitigation

Tnew (< T)

16

8
06/10/2024

Berkeley AMPLab
§ “Launched” January 2011: 6 Year Plan
• 8 CS Faculty
• ~40 students
• 3 software engineers
• Organized for collaboration:

17

Berkeley AMPLab
• Funding:
• XData, CISE Expedition Grant

• Industrial, founding sponsors


• 18 other sponsors, including

Goal: Next Generation of Analytics Data Stack for Industry &


Research:
• Berkeley Data Analytics Stack (BDAS)
• Release as Open Source

18

9
06/10/2024

Databricks

making big data simple

• Founded in late 2013


• by the creators of Apache Spark
• Original team from UC Berkeley AMPLab
• Raised $47 Million in 2 rounds

Databricks Cloud:
“A unified platform for building Big Data pipelines – from
ETL to Exploration and Dashboards, to Advanced Analytics
and Data Products.”

19

The Databricks team contributed more than 75% of the code


added to Spark in the 2014

20

10
06/10/2024

History of Spark
2004 2010
MapReduce paper Spark paper

2002 2004 2006 2008 2010 2012 2014

2002 2008 2014


MapReduce @ Google Hadoop Summit Apache Spark top-level

Spark:Cluster Computing withWorking Sets


Matei Zaharia,Mosharaf Chowdhury,
Michael J.Franklin,Scott Shenker,Ion Stoica
USENIX HotCloud (2010)
people.csail.mit.edu/ matei/ papers/ 2010/ hotcloud_spark.pdf

Resilient Distributed Datasets:A Fault-TolerantAbstraction for


In-Memory Cluster Computing
Matei Zaharia,Mosharaf Chowdhury,Tathagata Das,Ankur Dave,
Justin Ma,Murphy McCauley,Michael J.Franklin,Scott Shenker,
Ion Stoica NSDI (2012)
usenix.org / system/ files/ conference/ nsdi12/ nsdi12- final138.pdf

21

History of Spark
“We present Resilient Distributed Datasets
(RDDs), a distributed memory abstraction that
lets programmers perform in-memory
computations on large clusters in a fault-
tolerant manner.

RDDs are motivated by two types of


applications that current computing
frameworks handle inefficiently: iterative
algorithms and interactive data mining tools.

In both cases, keeping data in memory can


improve performance by an order of
magnitude.”

April 2012

22

11
06/10/2024

History of Spark

The State of Spark,andWhereWe're Going Next


Matei Zaharia
Spark Summit (2013)
youtu.be/ nU6vO2EJAb4

23

History of Spark

Analyze real time streams of data in ½ second intervals

TwitterUtils.createStream(...)
.filter(_.getText.contains("Sp
ark"))
.countByWindow(Seconds(5))

24

12
06/10/2024

History of Spark

Seemlessly mix SQL queries with Spark programs.

sqlCtx = new HiveContext(sc)


results = sqlCtx.sql(
"SELECT * FROM people")
names = results.map(lambda p:
p.name)

25

History of Spark

Analyze networks of nodes and edges using graph processing

graph = Graph(vertices, edges)


messages =
spark.textFile("hdfs://...")
graph2 =
graph.joinVertices(messages) {
(id, vertex, msg) => ...
}

https://amplab.cs.berkeley.edu/wp-content/uploads/2013/05/grades-graphx_with_fonts.pdf

26

13
06/10/2024

History of Spark

SQL queries with Bounded Errors and Bounded Response Times

https://www.cs.berkeley.edu/~sameerag/blinkdb_eurosys13.pdf

27

History of Spark
• Unlike the various specialized systems, Spark’s goal
was to generalize MapReduce to support new apps
within same engine
• Two reasonably small additions are enough to
express the previous models:
• fast data sharing
• general DAGs
• This allows for an approach which is more efficient
for the engine, and much simpler for the end users

28

14
06/10/2024

Directed Acyclic Graph - DAG

29

What is Apache Spark


• Spark is a unified analytics engine for large-scale data
processing
• Speed: run workloads 100x faster
• High performance for both batch and streaming data
• Computations run in memory

Logistic regression in Hadoop and Spark

30

15
06/10/2024

What is Apache Spark


• Ease of Use: write applications quickly in Java, Scala, Python,
R, SQL
• Offer over 80 high-level operators
• Use them interactively form Scala, Python, R, and SQL

df = spark.read.json("logs.json")
df.where("age > 21")
select("name.first").show()
Spark's Python DataFrame API
Read JSON files with automatic schema inference

31

What is Apache Spark


• Generality: combine SQL, Streaming, and complex analytics
• Provide libraries including SQL and DataFrames, Spark
Streaming, MLib, GraphX
• Wide range of workloads e.g., batch applications,
interactive algorithms, interactive queries, streaming

32

16
06/10/2024

What is Apache Spark


• Run Everywhere:
• run on Hadoop, Apache
Mesos, Kubernetes,
standalone or in the cloud.
• access data in HDFS, Aluxio,
Apache Cassandra, Apache
Hbase, Apache Hive, etc.

33

Comparison between Hadoop and


Spark

34

17
06/10/2024

100TB Daytona Sort Competition


Spark sorted the same data 3X faster using 10X
fewer machines than Hadoop MR in 2013.

All the sorting took place on disk (HDFS)


without using Spark’s in-memory cache!

35

36

18
06/10/2024

Components of Stack

37

The Spark stack

38

19
06/10/2024

The Spark stack


• Spark Core:
• contain basic functionality of Spark including task
scheduling, memory management, fault recovery, etc.
• provide APIs for building and manipulating RDDs
• SparkSQL
• allow querying structured data via SQL, Hive Query
Language
• allow combining SQL queries and data manipulations in
Python, Java, Scala

39

The Spark stack


• Spark Streaming: enables processing of live streams
of data via APIs
• Mlib:
• contain common machine language functionality
• provide multiple types of algorithms: classification,
regression, clustering, etc.
• GraphX:
• library for manipulating graphs and performing graph-
parallel computations
• extend Spark RDD API

40

20
06/10/2024

The Spark stack


• Cluster Managers
• Hadoop Yarn
• Apache Mesos, and
• Standalone Schedular (simple manager in Spark).

41

Resilient • RDD Basics


Distributed • Creating RDDs
• RDD Operations
Dataset – • Common Transformation and
RDD Actions
• Persistence (Caching)

42

21
06/10/2024

RDD Basics
• RDD:
• Immutable distributed collection of objects
• Split into multiple partitions => can be computed on
different nodes
• All work in Spark is expressed as
• creating new RDDs
• transforming existing RDDs
• calling actions on RDDs

43

Example
Load error messages from a log into memory, then
interactively search for various patterns
BaseTransformed
RDD Cache 1
RDD
lines = spark.textFile(“hdfs://...”) Worker
results
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2)) tasks Block 1
Driver
cachedMsgs = messages.cache()

Action
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count Cache 2
Worker
. . .
Cache 3
Worker Block 2

Block 3

44

22
06/10/2024

RDD Basics
• Two types of operations: transformations and
actions
• Transformations: construct a new RDD from a
previous one e.g., filter data
• Actions: compute a result base on an RDD e.g.,
count elements, get first element

45

Transformations
• Create new RDDs from existing RDDs
• Lazy evaluation
• See the whole chain of transformations
• Compute just the data needed
• Persist contents:
• persist an RDD in memory to reuse it in future
• persist RDDs on disk is possible

46

23
06/10/2024

Typical works of a Spark program


1. Create some input RDDs form external data
2. Transform them to define new RDDs using
transformations like filter()
3. Ask Spark to persist() any intermediate RDDs that
will need to be reused
4. Launch actions such as count(), first() to kick off
a parallel computation

47

Resilient • RDD Basics


Distributed • Creating RDDs
• RDD Operations
Dataset – • Common Transformation and
RDD Actions
• Persistence (Caching)

48

24
06/10/2024

Two ways to create RDDs


1. Parallelizing a collection: uses parallelize()
• Python
lines = sc.parallelize(["pandas", "i like pandas"])
• Scala
val lines = sc.parallelize(List("pandas", "i like
pandas"))
• Java
JavaRDD<String> lines =
sc.parallelize(Arrays.asList("pandas", "i like
pandas"));

49

Two ways to create RDDs


2. Loading data from external storage
• Python
lines = sc.textFile("/path/to/README.md")
• Scala
val lines = sc.textFile("/path/to/README.md")
• Java
JavaRDD<String> lines =
sc.textFile("/path/to/README.md");

50

25
06/10/2024

Resilient • RDD Basics


Distributed • Creating RDDs
• RDD Operations
Dataset – • Common Transformation and
RDD Actions
• Persistence (Caching)

51

RDD Operations
• Two types of operations
• Transformations: operations that return a new RDDs e.g.,
map(), filter()
• Actions: operations that return a result to the driver
program or write it to storage such as count(), first()
• Treated differently by Spark
• Transformation: lazy evaluation
• Action: execution at any time

52

26
06/10/2024

Transformation
• Example 1. Use filter()
• Python
inputRDD = sc.textFile("log.txt")
errorsRDD = inputRDD.filter(lambda x: "error" in x)

• Scala
val inputRDD = sc.textFile("log.txt")
val errorsRDD = inputRDD.filter(line =>
line.contains("error"))
• Java
JavaRDD<String> inputRDD = sc.textFile("log.txt");
JavaRDD<String> errorsRDD = inputRDD.filter(
new Function<String, Boolean>() {
public Boolean call(String x) {
return x.contains("error"); }}
});

53

Transformation
• filter()
• does not change the existing inputRDD
• returns a pointer to an entirely new RDD
• inputRDD still can be reused
• union()
errorsRDD = inputRDD.filter(lambda x: "error" in x)
warningsRDD=inputRDD.filter(lambda x: "warning" in x)
badLinesRDD = errorsRDD.union(warningsRDD)
• transformations can operate on any number of input
RDDs

54

27
06/10/2024

Transformation
• Spark keeps track dependencies between RDDs,
called the lineage graph
• Allow recovering lost data

55

Actions
• Example. count the number of errors
• Python
print "Input had " + badLinesRDD.count() + " concerning lines"
print "Here are 10 examples:"
for line in badLinesRDD.take(10):
print line

• Scala
println("Input had " + badLinesRDD.count() + " concerning lines")
println("Here are 10 examples:")
badLinesRDD.take(10).foreach(println)

• Java
System.out.println("Input had " + badLinesRDD.count() + " concerning
lines")
System.out.println("Here are 10 examples:")
for (String line: badLinesRDD.take(10)) {
System.out.println(line);
}

56

28
06/10/2024

Resilient • RDD Basics


Distributed • Creating RDDs
• RDD Operations
Dataset – • Common Transformation and
RDD Actions
• Persistence (Caching)

57

RDD Basics

Transformations Actions
map reduce
flatMap collect
filter count
sample save
union lookupKey
groupByKey …
reduceByKey
join
cache

58

29
06/10/2024

Transformations

59

Transformations

60

30
06/10/2024

Actions

61

Actions

62

31
06/10/2024

Resilient • RDD Basics


Distributed • Creating RDDs
• RDD Operations
Dataset – • Common Transformation and
RDD Actions
• Persistence (Caching)

63

Persistence levels

64

32
06/10/2024

Persistence
• Example

val result = input.map(x => x * x)


result.persist(StorageLevel.DISK_ONLY)
println(result.count())
println(result.collect().mkString(","))

65

DataFrame
• A primary abstraction in Spark 2.0
• Immutable once constructed
• Track lineage information to efficiently re-compute lost data
• Enable operations on collection of elements in parallel
• To construct DataFrame
• By parallelizing existing Python collections (lists)
• By transforming an existing Spark or pandas DataFrame
• From files in HDFS or other storage system

66

66

33
06/10/2024

Using DataFrame
>>> data = [(‘Alice’, 1), (‘Bob’, 2), (‘Bob’, 2)]
>>> df1 = sqlContext.createDataFrame(data, [‘name’,
‘age’])
[Row(name=u’Alice’, age=1),
Row=(name=u’Bob’, age=2),
Row=(name=u’Bob’, age=2)]

67

67

Transformations
• Create new DataFrame from an existing one
• Use lazy evaluation
• Nothing executes
• Spark saves recipe for transformation source
Transformation Description
select(*cols) Selects columns from this DataFrame
drop(col) Returns a new Dataframe that drops the specific column

filter(func) Returns a new DataFrame formed by selecting those rows of the


source on which func returns true

where(func) Where is an alias for filter


distinct() Returns a new DataFrame that contains the distinct rows of the
source DataFrame

sort(*cols, **kw) Returns a new DataFrame sorted by the specified columns and in
the sort order specified by kw
68

68

34
06/10/2024

Using Transformations
>>> data = [(‘Alice’, 1), (‘Bob’, 2), (‘Bob’, 2)]
>>> df1 = sqlContext.createDataFrame(data, [‘name’,
‘age’])
>>> df2 = df1.distinct()
[Row(name=u’Alice’, age=1), Row=(name=u’Bob’,
age=2)]
>>> df3 = df2.sort(“age”, asceding=False)
[Row=(name=u’Bob’, age=2), Row(name=u’Alice’,
age=1)]

69

69

Actions
• Cause Spark to execute recipe to transform source
• Mechanisms for getting results out of Spark
Action Description
show(n, truncate) Prints the first n rows of this DataFrame
take(n) Returns the first n rows as a list of Row
collect() Returns all the records as a list of Row (*)
count() Returns the number of rows in this DataFrame

describe(*cols) Exploratory Data Analysis function that computes statistics


(count, mean, stddev, min, max) for numeric columns

70

70

35
06/10/2024

Using Actions
>>> data = [(‘Alice’, 1), (‘Bob’, 2)]
>>> df = sqlContext.createDataFrame(data, [‘name’, ‘age’])
>>> df.collect()
[Row(name=u’Alice’, age=1), Row=(name=u’Bob’, age=2)]
>>> df.count()
2
>>> df.show()
+-------+--------+
|name| age |
+-------+-------+
|Alice| 1|
|Bob | 2|
+-----+-------+

71

71

Caching
>>> linesDF = sqlContext.read.text(‘…’)
>>> linesDF.cache()
>>> commentsDF = linesDF.filter(isComment)
>>> print linesDF.count(), commentsDF.count()
>>> commentsDF.cache()

72

72

36
06/10/2024

Spark Programming Routine


• Create DataFrames from external data or createDataFrame
from a collection in driver program
• Lazily transform them into new DataFrames
• cache() some DataFrames for reuse
• Perform actions to execute parallel computation and produce
results

73

73

DataFrames versus RDDs


• For new users familiar with data frames in other
programming languages, this API should make
them feel at home
• For existing Spark users, the API will make Spark
easier to program than using RDDs
• For both sets of users, DataFrames will improve
performance through intelligent optimizations and
code-generation

74

37
06/10/2024

Write Less Code: Input & Output

Unified interface to reading/writing data ina


variety of formats.
val df = sqlContext.
read.
format("json").
option("samplingRatio", "0.1").
load("/Users/spark/data/stuff.json")

df.write.
format("parquet").
mode("append").
partitionBy("year").
saveAsTable("faster-stuff")

46

75

Write Less Code: Input & Output

Unified interface to reading/writing data ina


variety of formats.
val df = sqlContext.
read.
format("json").
option("samplingRatio", "0.1").
load("/Users/spark/data/stuff.json")

df.write.
format("parquet"). read and write
mode("append"). functions create
partitionBy("year").
saveAsTable("faster-stuff")
new buildersfor
doing I/O
47

76

38
06/10/2024

Write Less Code: Input & Output


Unified interface to reading/writing data ina
variety of formats.

val df = sqlContext.
read.
format("json"). Builder
"methods
}
option("samplingRatio", "0.1").
)
specify:
load("/Users/spark/data/stuff.json
• format
df.write.
mode("append").
• partitioning

} • handling of
format("parquet").
existing data
partitionBy("year").
saveAsTable("faster- 48
stuff")

77

Write Less Code: Input & Output

Unified interface to reading/writing data ina


variety of formats.
val df = sqlContext.
read.
format("json").
option("samplingRatio", "0.1"). load(…), save(…),
load("/Users/spark/data/stuff.json")
or saveAsTable(…)
df.write.
format("parquet").
finish theI/O
mode("append"). specification
partitionBy("year").
saveAsTable("faster-stuff")

49

78

39
06/10/2024

Data Sources supported by DataFrames

built-in external

JDBC

{ JSON }

and more …

79

Write Less Code: High-Level


Operations
• Solve common problems concisely with DataFrame
functions:
• selecting columns and filtering
• joining different data sources
• aggregation (count, sum, average, etc.)
• plotting results (e.g., with Pandas)

80

40
06/10/2024

Write Less Code: Compute an Average

private IntWritable one = new IntWritable(1); rdd = sc.textFile(...).map(_.split(" "))


private IntWritable output =new IntWritable(); rdd.map { x => (x(0), (x(1).toFloat, 1)) }.
protected void map(LongWritable key, reduceByKey { case ((num1, count1), (num2, count2)) =>
Text value, (num1 + num2, count1 + count2)
Context context) { }.
String[] fields = value.split("\t"); map { case (key, (num, count)) => (key, num / count) }.
output.set(Integer.parseInt(fields[1])); collect()
context.write(one, output);
}

---------------------------------------------------------------------------------- rdd = sc.textFile(...).map(lambda s: s.split())


rdd.map(lambda x: (x[0], (float(x[1]), 1))).\
IntWritable one = new IntWritable(1) reduceByKey(lambda t1, t2: (t1[0] + t2[0], t1[1] + t2[1])).\
DoubleWritable average = new DoubleWritable(); map(lambda t: (t[0], t[1][0] / t[1][1])).\
collect()
protected void reduce(IntWritable key,
Iterable<IntWritable> values,
Context context) {
int sum = 0;
int count = 0;
for (IntWritable value: values) {
sum += value.get();
count++;
}
average.set(sum / (double) count);
context.write(key, average);
}

81

Write Less Code: Compute an


Average
Using RDDs
rdd = sc.textFile(...).map(_.split(" "))
rdd.map { x => (x(0), (x(1).toFloat, 1)) }. Full APIDocs
reduceByKey { case ((num1, count1), (num2, count2)) =>
(num1 + num2, count1 + count2)
• Scala
}. • Java
map { case (key, (num, count)) => (key, num / count) }.
collect() • Python
• R
Using DataFrames
import org.apache.spark.sql.functions._

val df = rdd.map(a => (a(0), a(1))).toDF("key", "value")


df.groupBy("key")
.agg(avg("value"))
.collect()

82

41
06/10/2024

Architecture
• A master-worker type architecture
• A driver or master node
• Worker nodes

• The master send works to the workers and either


instructs them to pull data from memory or from
hard disk (or from another source like S3 or HDSF)

83

83

Architecture(2)
• A Spark program first creates a SparkContext object
• SparkContext tells Spark how and where to access a
cluster
• The master parameter for a SparkContext determines
Master parameter Description
which type and size of cluster to use
local Run Spark locally with one worker thread (no parallelism)
local[K] Run Spark locally with K worker threads (ideal set to number of
cores)
spark://HOST:PORT Connect to a Spark standalone cluster
mesos://HOST:PORT Connect to a Mesos cluster
yarn Connect to a YARN cluster

84

84

42
06/10/2024

Lifetime of a Job in Spark

85

85

Books:

• Holden Karau, Andy Konwinski,


Patrick Wendell & Matei Zaharia.
Learning Spark. Oreilly
• TutorialsPoint. Spark Core
Programming
Acknowledgement and
References Slides:

• Paco Nathan. Intro to Apache


Spark
• Harold Liu. Berkely Data
Analytics Stack
• DataBricks. Intro to Spark
Development

86

43
06/10/2024

Q&A

87

88

44
06/10/2024

Thank you
for your
attentions!

89

90

45

You might also like