0% found this document useful (0 votes)

225 views43 pages

Big Data with Apache Spark Basics

This document provides an introduction to Apache Spark, including: - Spark uses Resilient Distributed Datasets (RDDs) as its core abstraction, which allow parallel operations on large datasets distributed across a cluster. - RDDs are created from data sources or by transforming existing RDDs, and transformations are lazy evaluated while actions cause computation. - The Spark programming model involves creating RDDs, applying transformations, and using actions to return results to the driver program.

Uploaded by

Karthigai Selvan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

225 views43 pages

Big Data with Apache Spark Basics

Uploaded by

Karthigai Selvan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Introduction to Big Data

with Apache Spark

UC BERKELEY
This Lecture
Programming Spark
Resilient Distributed Datasets (RDDs)
Creating an RDD
Spark Transformations and Actions
Spark Programming Model
Python Spark (pySpark)
•  We are using the Python programming interface to
Spark (pySpark)
•  pySpark provides an easy-to-use programming
abstraction and parallel runtime:
» “Here’s an operation, run it on all of the data”

•  RDDs are the key concept

Spark Driver and Workers
Your application
(driver program)
•  A Spark program is two programs:
»  A driver program and a workers program
SparkContext
•  Worker programs run on cluster nodes
Cluster Local or in local threads
manager threads
•  RDDs are distributed
Worker Worker across workers
Spark Spark
executor executor

Amazon S3, HDFS, or other storage

Spark Context
•  A Spark program ﬁrst creates a SparkContext object
»  Tells Spark how and where to access a cluster
»  pySpark shell and Databricks Cloud automatically create the sc variable
»  iPython and programs must use a constructor to create a new SparkContext

•  Use SparkContext to create RDDs

In the labs, we create the SparkContext for you

Spark Essentials: Master
•  The master parameter for a SparkContext
determines which type and size of cluster to use
Master Parameter Description

local run Spark locally with one worker thread

(no parallelism)
local[K] run Spark locally with K worker threads
(ideally set to number of cores)
spark://HOST:PORT connect to a Spark standalone cluster;
PORT depends on conﬁg (7077 by default)
mesos://HOST:PORT connect to a Mesos cluster;
PORT depends on conﬁg (5050 by default)

In the labs, we set the master parameter for you

Resilient Distributed Datasets
•  The primary abstraction in Spark
» Immutable once constructed
» Track lineage information to efﬁciently recompute lost data
» Enable operations on collection of elements in parallel

•  You construct RDDs

» by parallelizing existing Python collections (lists)
» by transforming an existing RDDs
» from files in HDFS or any other storage system

RDDs
•  Programmer specifies number of partitions for an RDD
(Default value used if unspecified)
RDD split into 5 partitions
more partitions = more parallelism
item-1 item-6 item-11 item-16 item-21
item-2 item-7 item-12 item-17 item-22
item-3 item-8 item-13 item-18 item-23
item-4 item-9 item-14 item-19 item-24
item-5 item-10 item-15 item-20 item-25

Worker Worker Worker

Spark Spark Spark
executor executor executor
RDDs
•  Two types of operations: transformations and actions
•  Transformations are lazy (not computed immediately)
•  Transformed RDD is executed when action runs on it
•  Persist (cache) RDDs in memory or disk
Working with RDDs
•  Create an RDD from a data source: <list>
•  Apply transformations to an RDD: map filter
•  Apply actions to an RDD: collect count
RDD
RDD filteredRDD
filtered RDD mappedRDD
mapped RDD
<list> RDD filtered RDD mapped RDD
parallelize filter map
collect

collect action causes parallelize, filter,
and map transforms to be executed
Result
Spark References
•  http://spark.apache.org/docs/latest/programming-guide.html

•  http://spark.apache.org/docs/latest/api/python/index.html
Creating an RDD
•  Create RDDs from Python collections (lists)
No computation occurs with sc.parallelize()
>>> data = [1, 2, 3, 4, 5] •  Spark only records how to create the RDD with
>>> data four partitions
[1, 2, 3, 4, 5]

>>> rDD = sc.parallelize(data, 4)

>>> rDD

ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:229
Creating RDDs
•  From HDFS, text ﬁles, Hypertable, Amazon S3, Apache Hbase,
SequenceFiles, any other Hadoop InputFormat, and directory or
glob wildcard: /data/201404*

>>> distFile = sc.textFile("README.md", 4)!

>>> distFile!

MappedRDD[2] at textFile at  

NativeMethodAccessorImpl.java:-2!
Creating an RDD from a File
distFile = sc.textFile("...", 4)

•  RDD distributed in 4 partitions
•  Elements are lines of input
•  Lazy evaluation means
no execution happens now
Spark Transformations
•  Create new datasets from an existing one
•  Use lazy evaluation: results not computed right away –
instead Spark remembers set of transformations applied
to base dataset
» Spark optimizes the required calculations
» Spark recovers from failures and slow workers

•  Think of this as a recipe for creating result

Some Transformations

Transformation Description
map(func) return a new distributed dataset formed by passing
each element of the source through a function func
filter(func) return a new dataset formed by selecting those
elements of the source on which func returns true
distinct([numTasks])) return a new dataset that contains the distinct
elements of the source dataset
flatMap(func) similar to map, but each input item can be mapped
to 0 or more output items (so func should return a
Seq rather than a single item)
Review: Python lambda Functions
•  Small anonymous functions (not bound to a name)
lambda a, b: a + b
» returns the sum of its two arguments

•  Can use lambda functions wherever function objects are

required
•  Restricted to a single expression
Transformations
>>> rdd = sc.parallelize([1, 2, 3, 4]) Function literals (green)
>>> rdd.map(lambda x: x * 2) are closures automatically
RDD: [1, 2, 3, 4] → [2, 4, 6, 8] passed to workers

>>> rdd.filter(lambda x: x % 2 == 0)
RDD: [1, 2, 3, 4] → [2, 4]

>>> rdd2 = sc.parallelize([1, 4, 2, 2, 3])
>>> rdd2.distinct()
RDD: [1, 4, 2, 2, 3] → [1, 4, 2, 3]

Transformations
>>> rdd = sc.parallelize([1, 2, 3])
>>> rdd.Map(lambda x: [x, x+5])
RDD: [1, 2, 3] → [[1, 6], [2, 7], [3, 8]]

>>> rdd.flatMap(lambda x: [x, x+5])
RDD: [1, 2, 3] → [1, 6, 2, 7, 3, 8]

Function literals (green)

are closures automatically
passed to workers
Transforming an RDD
lines = sc.textFile("...", 4)

comments = lines.filter(isComment)

lines comments
Lazy evaluation means
nothing executes –
Spark saves recipe for
transforming source
Spark Actions
•  Cause Spark to execute recipe to transform source
•  Mechanism for getting results out of Spark
Some Actions
Action Description
reduce(func) aggregate dataset’s elements using function func.
func takes two arguments and returns one, and is
commutative and associative so that it can be
computed correctly in parallel
take(n) return an array with the first n elements
collect() return all the elements as an array
WARNING: make sure will fit in driver program
takeOrdered(n, key=func) return n elements ordered in ascending order or
as specified by the optional key function
Getting Data Out of RDDs
>>> rdd = sc.parallelize([1, 2, 3])
>>> rdd.reduce(lambda a, b: a * b)
Value: 6

>>> rdd.take(2)
Value: [1,2] # as list

>>> rdd.collect()
Value: [1,2,3] # as list
Getting Data Out of RDDs

>>> rdd = sc.parallelize([5,3,1,2])
>>> rdd.takeOrdered(3, lambda s: ‐1 * s)
Value: [5,3,2] # as list

Spark Programming Model
lines = sc.textFile("...", 4)

print lines.count()

lines count() causes Spark to:
# •  read data
# •  sum within partitions
•  combine sums in driver
#
#
Spark Programming Model
lines = sc.textFile("...", 4)
comments = lines.filter(isComment)
print lines.count(), comments.count()

lines comments Spark recomputes lines:
# # •  read data (again)
# # •  sum within partitions
•  combine sums in
# # driver
# #
Caching RDDs
lines = sc.textFile("...", 4)
lines.cache() # save, don't recompute!
comments = lines.filter(isComment)
print lines.count(),comments.count()
lines comments
RAM # #
RAM # #

RAM # #

RAM # #
Spark Program Lifecycle
1.  Create RDDs from external data or parallelize a
collection in your driver program
2.  Lazily transform them into new RDDs
3.  cache() some RDDs for reuse
4.  Perform actions to execute parallel
computation and produce results
Spark Key-Value RDDs
•  Similar to Map Reduce, Spark supports Key-Value pairs
•  Each element of a Pair RDD is a pair tuple

>>> rdd = sc.parallelize([(1, 2), (3, 4)])
RDD: [(1, 2), (3, 4)]
Some Key-Value Transformations

Key-Value Transformation Description

reduceByKey(func) return a new distributed dataset of (K, V) pairs where
the values for each key are aggregated using the
given reduce function func, which must be of type
(V,V)  V
sortByKey() return a new dataset (K, V) pairs sorted by keys in
ascending order
groupByKey() return a new dataset of (K, Iterable<V>) pairs
Key-Value Transformations
>>> rdd = sc.parallelize([(1,2), (3,4), (3,6)])
>>> rdd.reduceByKey(lambda a, b: a + b)
RDD: [(1,2), (3,4), (3,6)] → [(1,2), (3,10)]

>>> rdd2 = sc.parallelize([(1,'a'), (2,'c'), (1,'b')])
>>> rdd2.sortByKey()
RDD: [(1,'a'), (2,'c'), (1,'b')] →
             [(1,'a'), (1,'b'), (2,'c')]

Key-Value Transformations
>>> rdd2 = sc.parallelize([(1,'a'), (2,'c'), (1,'b')])
>>> rdd2.groupByKey()
RDD: [(1,'a'), (1,'b'), (2,'c')] →
             [(1,['a','b']), (2,['c'])]

Be careful using groupByKey() as
it can cause a lot of data movement
across the network and create large
Iterables at workers
pySpark Closures
Worker
•  Spark automatically creates closures for:
functions
functions Worker
functions
functions
Driver globals
globals
globals
globals
Worker
» Functions that run on RDDs at workers
» Any global variables used by those workers Worker

•  One closure per worker

» Sent for every task
» No communication between workers
» Changes to global variables at workers are not sent to driver
Consider These Use Cases
•  Iterative or single jobs with large global variables
» Sending large read-only lookup table to workers
» Sending large feature vector in a ML algorithm to workers

•  Counting events that occur during job execution

» How many input lines were blank?
» How many input records were corrupt?

Consider These Use Cases
•  Iterative or single jobs with large global variables
» Sending large read-only lookup table to workers
» Sending large feature vector in a ML algorithm to workers

•  Counting events that occur during job execution

» How many input lines were blank?
» How many input records were corrupt?
Problems:
•  Closures are (re-)sent with every job
•  Inefficient to send large data to each worker
•  Closures are one way: driver  worker
pySpark Shared Variables
•  Broadcast Variables
» Efficiently send large, read-only value to all workers
» Saved at workers for use in one or more Spark operations
» Like sending a large, read-only lookup table to all the nodes
Σ
+ + •  + Accumulators
» Aggregate values from workers back to driver
» Only driver can access value of accumulator
» For tasks, accumulators are write-only
» Use to count errors seen in RDD across workers
Broadcast Variables
•  Keep read-only variable cached on workers
» Ship to each worker only once instead of with each task
•  Example: efficiently give every worker a large dataset
•  Usually distributed using efficient broadcast algorithms
At the driver:
>>> broadcastVar = sc.broadcast([1, 2, 3])

At a worker (in code passed via a closure)
>>> broadcastVar.value
[1, 2, 3]
Broadcast Variables Example
•  Country code lookup for HAM radio call signs
# Lookup the locations of the call signs on the Expensive to send large table
# RDD contactCounts. We load a list of call sign
(Re-)sent for every processed file
# prefixes to country code to support this lookup
signPrefixes = loadCallSignTable()

def processSignCount(sign_count, signPrefixes):
    country = lookupCountry(sign_count[0], signPrefixes)
    count = sign_count[1]
    return (country, count)

countryContactCounts = (contactCounts
                        .map(processSignCount)
                        .reduceByKey((lambda x, y: x+ y)))

From: http://shop.oreilly.com/product/0636920028512.do
Broadcast Variables Example
•  Country code lookup for HAM radio call signs
# Lookup the locations of the call signs on the
# RDD contactCounts. We load a list of call sign
# prefixes to country code to support this lookup   Efficiently sent once to workers
signPrefixes = sc.broadcast(loadCallSignTable())

def processSignCount(sign_count, signPrefixes):
    country = lookupCountry(sign_count[0], signPrefixes.value)
    count = sign_count[1]
    return (country, count)

countryContactCounts = (contactCounts
                        .map(processSignCount)
                        .reduceByKey((lambda x, y: x+ y)))

From: http://shop.oreilly.com/product/0636920028512.do
Σ
+ + +
Accumulators
•  Variables that can only be “added” to by associative op
•  Used to efficiently implement parallel counters and sums
•  Only driver can read an accumulator’s value, not tasks
>>> accum = sc.accumulator(0)
>>> rdd = sc.parallelize([1, 2, 3, 4])
>>> def f(x):
>>>   global accum
>>>   accum += x

>>> rdd.foreach(f)
>>> accum.value
Value: 10
Σ
+ + +
Accumulators Example
•  Counting empty lines
file = sc.textFile(inputFile)
# Create Accumulator[Int] initialized to 0
blankLines = sc.accumulator(0)

def extractCallSigns(line):
    global blankLines # Make the global variable accessible
    if (line == ""):
        blankLines += 1
    return line.split(" ")

callSigns = file.flatMap(extractCallSigns)
print "Blank lines: %d" % blankLines.value
Σ
+ + +
Accumulators
•  Tasks at workers cannot access accumulator’s values
•  Tasks see accumulators as write-only variables
•  Accumulators can be used in actions or transformations:
» Actions: each task’s update to accumulator is applied only once
» Transformations: no guarantees (use only for debugging)
•  Types: integers, double, long, float
» See lab for example of custom type
Summary
Driver program Spark automatically
Programmer pushes closures to
specifies number R D D workers
of partitions

Worker Worker Worker

code RDD code RDD code RDD

Master parameter speciﬁes number of workers

Big Data - Spark
100% (1)
Big Data - Spark
72 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Pyspark
No ratings yet
Pyspark
31 pages
Day 9
No ratings yet
Day 9
30 pages
Spark RDD
No ratings yet
Spark RDD
60 pages
Spark
No ratings yet
Spark
160 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Introduction To Big Data With PySpark - Spark RDDs With PySpark Cheatsheet - Codecademy
No ratings yet
Introduction To Big Data With PySpark - Spark RDDs With PySpark Cheatsheet - Codecademy
6 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Class 06 IntroToSpark
No ratings yet
Class 06 IntroToSpark
51 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
SPARK
No ratings yet
SPARK
35 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
3 - Spark
No ratings yet
3 - Spark
51 pages
Spark Slides
No ratings yet
Spark Slides
23 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Spark RDD Basics and Operations
No ratings yet
Spark RDD Basics and Operations
84 pages
BDT Unit 3
No ratings yet
BDT Unit 3
105 pages
Spark
No ratings yet
Spark
51 pages
Data Bricks
No ratings yet
Data Bricks
42 pages
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
06 Parallel Processing Part2
No ratings yet
06 Parallel Processing Part2
93 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
8 Apache Spark
No ratings yet
8 Apache Spark
25 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
BDA Unit III IV
No ratings yet
BDA Unit III IV
33 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Introduction to Data Analysis with Spark
No ratings yet
Introduction to Data Analysis with Spark
51 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
External Video-En
No ratings yet
External Video-En
2 pages
4.1. Spark Basics
No ratings yet
4.1. Spark Basics
28 pages
Lecture 19-RDD in Spark
No ratings yet
Lecture 19-RDD in Spark
12 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
Lec28 - RDD
No ratings yet
Lec28 - RDD
56 pages
Introduction To RDD
No ratings yet
Introduction To RDD
39 pages
APACHE SPARK and Scala
No ratings yet
APACHE SPARK and Scala
49 pages
Learning Spark Programming Basics: Introduction To Rdds
No ratings yet
Learning Spark Programming Basics: Introduction To Rdds
70 pages
Writing Spark Application
No ratings yet
Writing Spark Application
37 pages
Spark and RDD Presentation
No ratings yet
Spark and RDD Presentation
64 pages
Apache Spark vs Dask: Big Data Tools
No ratings yet
Apache Spark vs Dask: Big Data Tools
55 pages
PySpark Basics Overview 2
No ratings yet
PySpark Basics Overview 2
15 pages
Spark (Introduction, RDD)
No ratings yet
Spark (Introduction, RDD)
28 pages
PySpark Notes
No ratings yet
PySpark Notes
190 pages
Course Slideware
No ratings yet
Course Slideware
60 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
No ratings yet
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
51 pages
HDP Developer Apache Pig and Hive
No ratings yet
HDP Developer Apache Pig and Hive
42 pages
Lab 04 Spark APIs
No ratings yet
Lab 04 Spark APIs
20 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Spark & RDD Guide for Developers
No ratings yet
Spark & RDD Guide for Developers
1 page
Lecture 10 - Spark
No ratings yet
Lecture 10 - Spark
87 pages
Apache Spark Basics & Comparison
No ratings yet
Apache Spark Basics & Comparison
66 pages
Chapter 7 Spark Computing Engine
No ratings yet
Chapter 7 Spark Computing Engine
42 pages
Microsoft: Exam Questions AZ-900
No ratings yet
Microsoft: Exam Questions AZ-900
9 pages
Azure AZ-900 Exam Prep Dumps
No ratings yet
Azure AZ-900 Exam Prep Dumps
9 pages
PySpark SQL Cheat Sheet Guide
No ratings yet
PySpark SQL Cheat Sheet Guide
1 page
Microsoft: Exam Questions AZ-900
No ratings yet
Microsoft: Exam Questions AZ-900
9 pages
Microsoft: Exam Questions AZ-900
No ratings yet
Microsoft: Exam Questions AZ-900
6 pages
Microsoft - Practicetest.dp 900.study - Guide.2021 Mar 06.by - Cash.75q.vce
No ratings yet
Microsoft - Practicetest.dp 900.study - Guide.2021 Mar 06.by - Cash.75q.vce
20 pages
Daily A News - March 01
No ratings yet
Daily A News - March 01
2 pages
Big Data Processing for Developers
No ratings yet
Big Data Processing for Developers
38 pages
Negative - Affirmative Statements
100% (1)
Negative - Affirmative Statements
5 pages
Big Data: Business Intelligence, and Analytics
No ratings yet
Big Data: Business Intelligence, and Analytics
31 pages
Aws Best Practices Guide
No ratings yet
Aws Best Practices Guide
14 pages
Parry New Price List - 2021
100% (1)
Parry New Price List - 2021
188 pages
Convert Tenses
No ratings yet
Convert Tenses
2 pages
1000 Most Common Verbs in English - Verb Forms V1, V2, V3 List
100% (1)
1000 Most Common Verbs in English - Verb Forms V1, V2, V3 List
106 pages
Design of Absolute Loader
57% (7)
Design of Absolute Loader
3 pages
DX Diag
No ratings yet
DX Diag
40 pages
Cheat Sheet of Metasploit
No ratings yet
Cheat Sheet of Metasploit
7 pages
Tally Assignment Sale Order
No ratings yet
Tally Assignment Sale Order
2 pages
BMW ISID & ICOM Emulator With OPS Configuration Manual
No ratings yet
BMW ISID & ICOM Emulator With OPS Configuration Manual
20 pages
CANedge 1 Intro
No ratings yet
CANedge 1 Intro
15 pages
B-Tronic Monitor
100% (3)
B-Tronic Monitor
59 pages
Fortran90 Graphing Interface Guide
No ratings yet
Fortran90 Graphing Interface Guide
33 pages
Encrypted Document Placeholder
No ratings yet
Encrypted Document Placeholder
23 pages
Computer Networks: Izaz A Khan
No ratings yet
Computer Networks: Izaz A Khan
45 pages
C++ Const Usage and Class Functions
No ratings yet
C++ Const Usage and Class Functions
5 pages
Fundamentals of Programming
0% (2)
Fundamentals of Programming
2 pages
EZ-BIST User-Manual v3.3
No ratings yet
EZ-BIST User-Manual v3.3
95 pages
VT-VSPD - Digital Valve Amplifier For Proportional Valves Without Position Feedback
No ratings yet
VT-VSPD - Digital Valve Amplifier For Proportional Valves Without Position Feedback
80 pages
Cisco MDS Series Swich Zoning PDF
0% (1)
Cisco MDS Series Swich Zoning PDF
5 pages
Mad 32 Hs
No ratings yet
Mad 32 Hs
7 pages
Database Systems: Concepts & Architecture
No ratings yet
Database Systems: Concepts & Architecture
13 pages
PLC Applications and Ladder Logic
No ratings yet
PLC Applications and Ladder Logic
27 pages
Student Eligibility System For Job or Internship
No ratings yet
Student Eligibility System For Job or Internship
13 pages
Week4 6
No ratings yet
Week4 6
7 pages
Ccboot
No ratings yet
Ccboot
52 pages
Java MCQ
100% (2)
Java MCQ
10 pages
P10AN01 Service Manual
No ratings yet
P10AN01 Service Manual
109 pages
SW Fluorospot Compact
No ratings yet
SW Fluorospot Compact
38 pages
JAX-9B JAX-9B: Instruction Manual Instruction Manual
No ratings yet
JAX-9B JAX-9B: Instruction Manual Instruction Manual
82 pages
Jimma University JIT School of Computing Advanced Database System Lab
100% (1)
Jimma University JIT School of Computing Advanced Database System Lab
70 pages
Sankar Aws Resume1
No ratings yet
Sankar Aws Resume1
2 pages
JPR Rishi Microproject
No ratings yet
JPR Rishi Microproject
19 pages
x86 Translation and Registers
No ratings yet
x86 Translation and Registers
2 pages
Que:1 Multiple Choice Questions:: D. All of These A. Google Docs C. Openoffice Writer
100% (1)
Que:1 Multiple Choice Questions:: D. All of These A. Google Docs C. Openoffice Writer
2 pages

Big Data with Apache Spark Basics

Uploaded by

Big Data with Apache Spark Basics

Uploaded by

Introduction to Big Data

with Apache Spark

• RDDs are the key concept

Amazon S3, HDFS, or other storage

• Use SparkContext to create RDDs

In the labs, we create the SparkContext for you

local run Spark locally with one worker thread 

In the labs, we set the master parameter for you

• You construct RDDs

Worker Worker Worker

>>> distFile = sc.textFile("README.md", 4)!

• Think of this as a recipe for creating result

• Can use lambda functions wherever function objects are

Function literals (green)

Key-Value Transformation Description

• One closure per worker

• Counting events that occur during job execution

• Counting events that occur during job execution

Worker Worker Worker

Master parameter speciﬁes number of workers

You might also like

Introduction to Big Data

•  RDDs are the key concept

•  Use SparkContext to create RDDs

local run Spark locally with one worker thread

•  You construct RDDs

•  Think of this as a recipe for creating result

•  Can use lambda functions wherever function objects are

•  One closure per worker

•  Counting events that occur during job execution

•  Counting events that occur during job execution