0% found this document useful (0 votes)

26 views45 pages

Lecture 4 - Spark Introduction

The document provides an overview of Apache Spark, detailing its history, components, and functionality. It discusses the evolution from MapReduce to Spark, highlighting the advantages of using Resilient Distributed Datasets (RDDs) for efficient data processing. Key features of Spark include its speed, ease of use, generality, and ability to run across various platforms and data sources.

Uploaded by

Tuân Nguyễn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views45 pages

Lecture 4 - Spark Introduction

Uploaded by

Tuân Nguyễn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

06/10/2024

Spark Introduction

1
06/10/2024

Agenda
• History of Spark
• Introduction
• Components of Stack
• Resilient Distributed Dataset – RDD

History of Spark
2004 2010
MapReduce paper Spark paper

2002 2004 2006 2008 2010 2012 2014

2002 2008 2014

MapReduce @ Google Hadoop Summit Apache Spark top-level

2006
Hadoop @Yahoo!

2
06/10/2024

History of Spark
circa 1979 – Stanford, MIT, CMU, etc.
set/list operations in LISP, Prolog, etc., for parallel processing
www-formal.stanford.edu/ jmc/ history / lisp/ lisp.htm

circa 2004 – Google

MapReduce: Simplified Data Processing on Large Clusters Jeffrey
Dean and Sanjay Ghemawat
research.google.com/ archive/ mapreduce.html

circa 2006 – Apache

Hadoop, originating from the Nutch Project Doug Cutting
research.yahoo.com/ files/ cutting.pdf

circa 2008 – Yahoo

web scale search indexing Hadoop Submit, HUG, etc.
developer.yahoo.com/hadoop/

circa 2009 – Amazon AWS

Elastic MapReduce
Hadoop modified for EC2/S3, plus support for Hive, Pig, Cascading, etc.
aws.amazon.com/ elasticmapreduce/

MapReduce
Most current cluster programming models are
based on acyclic data flow from stable storage to
stable storage

Map
Reduce

Input Map Output

Reduce
Map

3
06/10/2024

MapReduce
• Acyclic data flow is inefficient for applications that
repeatedly reuse a working set of data:
• Iterative algorithms (machine learning, graphs)
• Interactive data mining tools (R, Excel, Python)

Data Processing Improvement Goals

• Low latency (interactive) queries on historical data:
enable faster decisions
• E.g., identify why a site is slow and fix it
• Low latency queries on live data (streaming): enable
decisions on real-time data
• E.g., detect & block worms in real-time (a worm may
infect 1mil hosts in 1.3sec)
• Sophisticated data processing: enable “better” decisions
• E.g., anomaly detection, trend analysis

Therefore, people built specialized

systems as workarounds…

4
06/10/2024

Specialized Systems
Pregel Giraph

Dremel Drill Tez

MapReduce
Impala GraphLab

Storm S4

General Batch Processing Specialized Systems:

iterative,interactive,streaming,graph,etc.

The State of Spark,andWhereWe're Going Next

Matei Zaharia
Spark Summit (2013)
youtu.be/ nU6vO2EJAb4

Storage vs Processing Wars

NoSQL battles Compute battles
MapReduce
vs
Relational vs Spark
NoSQL
HBase vs ing vs Storm
Spark Stream
Cassanrdra Redis vs M
emcached Hive vs Spa
Riak vs rk SQL vs
Impala

Couchbase
C ouchDB vs ib vs H20
MongoDB vs Neo4j vs Tita Mahout vs M
Ll
n vs
Giraph vs Ori
entDB

csearch
Solr vs Elasti

5
06/10/2024

Storage vs Processing Wars

NoSQL battles Compute battles
MapReduce
vs
Relational vs Spark
NoSQL
HBase vs ing vs Storm
Spark Stream
Cassanrdra Redis vs M
emcached Hive vs Spa
Riak vs rk SQL vs
Impala
Couchbase
C ouchDB vs
MongoDB vs Ll ib vs H20
Neo4j vs Tita Mahout vs M
n vs
Giraph vs Ori
entDB
csearch
Solr vs Elasti

Specialized Systems
(2007 – 2015?)

Giraph
(2004 – 2013)
Pregel (2014 – ?)
Tez
Drill
Dremel
Mahout
Storm
S4
Impala
GraphLab

Specialized
Systems
General Batch Processing (iterative, interactive, ML, streaming, graph, SQL, etc) General Unified Engine

6
06/10/2024

YARN Mesos

Tachyon

SQL

MLlib

Streaming

10x – 100x

7
06/10/2024

Support Interactive and Streaming

Comp.
• Aggressive use of memory 10Gbps
• Why?
128-512GB
1. Memory transfer rates >> disk or SSDs
2. Many datasets already fit into memory
• Inputs of over 90% of jobs in 40-60GB/s
Facebook, Yahoo!, and Bing
clusters fit into memory
16 cores
• e.g., 1TB = 1 billion records @
1KB each 0.2-
3. Memory density (still) grows with 1GB/s 1-
Mooreʼs law (x10 disks)
4GB/s
(x4 disks)
• RAM/SSD hybrid memories at 10-30TB
horizon 1-4TB

High end datacenter node

Support Interactive and Streaming

Comp.
• Increase parallelism
result
• Why?
• Reduce work per node à improve
latency
T
• Techniques:
• Low latency parallel scheduler that
achieve high locality
• Optimized parallel communication
patterns (e.g., shuffle, broadcast) result
• Efficient recovery from failures and
straggler mitigation

Tnew (< T)

8
06/10/2024

Berkeley AMPLab
§ “Launched” January 2011: 6 Year Plan
• 8 CS Faculty
• ~40 students
• 3 software engineers
• Organized for collaboration:

Berkeley AMPLab
• Funding:
• XData, CISE Expedition Grant

• Industrial, founding sponsors

• 18 other sponsors, including

Goal: Next Generation of Analytics Data Stack for Industry &

Research:
• Berkeley Data Analytics Stack (BDAS)
• Release as Open Source

9
06/10/2024

Databricks

making big data simple

• Founded in late 2013

• by the creators of Apache Spark
• Original team from UC Berkeley AMPLab
• Raised $47 Million in 2 rounds

Databricks Cloud:
“A unified platform for building Big Data pipelines – from
ETL to Exploration and Dashboards, to Advanced Analytics
and Data Products.”

The Databricks team contributed more than 75% of the code

added to Spark in the 2014

10
06/10/2024

History of Spark
2004 2010
MapReduce paper Spark paper

2002 2004 2006 2008 2010 2012 2014

2002 2008 2014

MapReduce @ Google Hadoop Summit Apache Spark top-level

Spark:Cluster Computing withWorking Sets

Matei Zaharia,Mosharaf Chowdhury,
Michael J.Franklin,Scott Shenker,Ion Stoica
USENIX HotCloud (2010)
people.csail.mit.edu/ matei/ papers/ 2010/ hotcloud_spark.pdf

Resilient Distributed Datasets:A Fault-TolerantAbstraction for

In-Memory Cluster Computing
Matei Zaharia,Mosharaf Chowdhury,Tathagata Das,Ankur Dave,
Justin Ma,Murphy McCauley,Michael J.Franklin,Scott Shenker,
Ion Stoica NSDI (2012)
usenix.org / system/ files/ conference/ nsdi12/ nsdi12- final138.pdf

History of Spark
“We present Resilient Distributed Datasets
(RDDs), a distributed memory abstraction that
lets programmers perform in-memory
computations on large clusters in a fault-
tolerant manner.

RDDs are motivated by two types of

applications that current computing
frameworks handle inefficiently: iterative
algorithms and interactive data mining tools.

In both cases, keeping data in memory can

improve performance by an order of
magnitude.”

April 2012

11
06/10/2024

History of Spark

The State of Spark,andWhereWe're Going Next

Matei Zaharia
Spark Summit (2013)
youtu.be/ nU6vO2EJAb4

History of Spark

Analyze real time streams of data in ½ second intervals

TwitterUtils.createStream(...)
.filter(_.getText.contains("Sp
ark"))
.countByWindow(Seconds(5))

12
06/10/2024

History of Spark

Seemlessly mix SQL queries with Spark programs.

sqlCtx = new HiveContext(sc)

results = sqlCtx.sql(
"SELECT * FROM people")
names = results.map(lambda p:
p.name)

History of Spark

Analyze networks of nodes and edges using graph processing

graph = Graph(vertices, edges)

messages =
spark.textFile("hdfs://...")
graph2 =
graph.joinVertices(messages) {
(id, vertex, msg) => ...
}

https://amplab.cs.berkeley.edu/wp-content/uploads/2013/05/grades-graphx_with_fonts.pdf

13
06/10/2024

History of Spark

SQL queries with Bounded Errors and Bounded Response Times

https://www.cs.berkeley.edu/~sameerag/blinkdb_eurosys13.pdf

History of Spark
• Unlike the various specialized systems, Spark’s goal
was to generalize MapReduce to support new apps
within same engine
• Two reasonably small additions are enough to
express the previous models:
• fast data sharing
• general DAGs
• This allows for an approach which is more efficient
for the engine, and much simpler for the end users

14
06/10/2024

Directed Acyclic Graph - DAG

What is Apache Spark

• Spark is a unified analytics engine for large-scale data
processing
• Speed: run workloads 100x faster
• High performance for both batch and streaming data
• Computations run in memory

Logistic regression in Hadoop and Spark

15
06/10/2024

What is Apache Spark

• Ease of Use: write applications quickly in Java, Scala, Python,
R, SQL
• Offer over 80 high-level operators
• Use them interactively form Scala, Python, R, and SQL

df = spark.read.json("logs.json")
df.where("age > 21")
select("name.first").show()
Spark's Python DataFrame API
Read JSON files with automatic schema inference

What is Apache Spark

• Generality: combine SQL, Streaming, and complex analytics
• Provide libraries including SQL and DataFrames, Spark
Streaming, MLib, GraphX
• Wide range of workloads e.g., batch applications,
interactive algorithms, interactive queries, streaming

16
06/10/2024

What is Apache Spark

• Run Everywhere:
• run on Hadoop, Apache
Mesos, Kubernetes,
standalone or in the cloud.
• access data in HDFS, Aluxio,
Apache Cassandra, Apache
Hbase, Apache Hive, etc.

Comparison between Hadoop and

Spark

17
06/10/2024

100TB Daytona Sort Competition

Spark sorted the same data 3X faster using 10X
fewer machines than Hadoop MR in 2013.

All the sorting took place on disk (HDFS)

without using Spark’s in-memory cache!

18
06/10/2024

Components of Stack

The Spark stack

19
06/10/2024

The Spark stack

• Spark Core:
• contain basic functionality of Spark including task
scheduling, memory management, fault recovery, etc.
• provide APIs for building and manipulating RDDs
• SparkSQL
• allow querying structured data via SQL, Hive Query
Language
• allow combining SQL queries and data manipulations in
Python, Java, Scala

The Spark stack

• Spark Streaming: enables processing of live streams
of data via APIs
• Mlib:
• contain common machine language functionality
• provide multiple types of algorithms: classification,
regression, clustering, etc.
• GraphX:
• library for manipulating graphs and performing graph-
parallel computations
• extend Spark RDD API

20
06/10/2024

The Spark stack

• Cluster Managers
• Hadoop Yarn
• Apache Mesos, and
• Standalone Schedular (simple manager in Spark).

Resilient • RDD Basics

Distributed • Creating RDDs
• RDD Operations
Dataset – • Common Transformation and
RDD Actions
• Persistence (Caching)

21
06/10/2024

RDD Basics
• RDD:
• Immutable distributed collection of objects
• Split into multiple partitions => can be computed on
different nodes
• All work in Spark is expressed as
• creating new RDDs
• transforming existing RDDs
• calling actions on RDDs

Example
Load error messages from a log into memory, then
interactively search for various patterns
BaseTransformed
RDD Cache 1
RDD
lines = spark.textFile(“hdfs://...”) Worker
results
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2)) tasks Block 1
Driver
cachedMsgs = messages.cache()

Action
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count Cache 2
Worker
. . .
Cache 3
Worker Block 2

Block 3

22
06/10/2024

RDD Basics
• Two types of operations: transformations and
actions
• Transformations: construct a new RDD from a
previous one e.g., filter data
• Actions: compute a result base on an RDD e.g.,
count elements, get first element

Transformations
• Create new RDDs from existing RDDs
• Lazy evaluation
• See the whole chain of transformations
• Compute just the data needed
• Persist contents:
• persist an RDD in memory to reuse it in future
• persist RDDs on disk is possible

23
06/10/2024

Typical works of a Spark program

1. Create some input RDDs form external data
2. Transform them to define new RDDs using
transformations like filter()
3. Ask Spark to persist() any intermediate RDDs that
will need to be reused
4. Launch actions such as count(), first() to kick off
a parallel computation

Resilient • RDD Basics

Distributed • Creating RDDs
• RDD Operations
Dataset – • Common Transformation and
RDD Actions
• Persistence (Caching)

24
06/10/2024

Two ways to create RDDs

1. Parallelizing a collection: uses parallelize()
• Python
lines = sc.parallelize(["pandas", "i like pandas"])
• Scala
val lines = sc.parallelize(List("pandas", "i like
pandas"))
• Java
JavaRDD<String> lines =
sc.parallelize(Arrays.asList("pandas", "i like
pandas"));

Two ways to create RDDs

2. Loading data from external storage
• Python
lines = sc.textFile("/path/to/README.md")
• Scala
val lines = sc.textFile("/path/to/README.md")
• Java
JavaRDD<String> lines =
sc.textFile("/path/to/README.md");

25
06/10/2024

Resilient • RDD Basics

Distributed • Creating RDDs
• RDD Operations
Dataset – • Common Transformation and
RDD Actions
• Persistence (Caching)

RDD Operations
• Two types of operations
• Transformations: operations that return a new RDDs e.g.,
map(), filter()
• Actions: operations that return a result to the driver
program or write it to storage such as count(), first()
• Treated differently by Spark
• Transformation: lazy evaluation
• Action: execution at any time

26
06/10/2024

Transformation
• Example 1. Use filter()
• Python
inputRDD = sc.textFile("log.txt")
errorsRDD = inputRDD.filter(lambda x: "error" in x)

• Scala
val inputRDD = sc.textFile("log.txt")
val errorsRDD = inputRDD.filter(line =>
line.contains("error"))
• Java
JavaRDD<String> inputRDD = sc.textFile("log.txt");
JavaRDD<String> errorsRDD = inputRDD.filter(
new Function<String, Boolean>() {
public Boolean call(String x) {
return x.contains("error"); }}
});

Transformation
• filter()
• does not change the existing inputRDD
• returns a pointer to an entirely new RDD
• inputRDD still can be reused
• union()
errorsRDD = inputRDD.filter(lambda x: "error" in x)
warningsRDD=inputRDD.filter(lambda x: "warning" in x)
badLinesRDD = errorsRDD.union(warningsRDD)
• transformations can operate on any number of input
RDDs

27
06/10/2024

Transformation
• Spark keeps track dependencies between RDDs,
called the lineage graph
• Allow recovering lost data

Actions
• Example. count the number of errors
• Python
print "Input had " + badLinesRDD.count() + " concerning lines"
print "Here are 10 examples:"
for line in badLinesRDD.take(10):
print line

• Scala
println("Input had " + badLinesRDD.count() + " concerning lines")
println("Here are 10 examples:")
badLinesRDD.take(10).foreach(println)

• Java
System.out.println("Input had " + badLinesRDD.count() + " concerning
lines")
System.out.println("Here are 10 examples:")
for (String line: badLinesRDD.take(10)) {
System.out.println(line);
}

28
06/10/2024

Resilient • RDD Basics

Distributed • Creating RDDs
• RDD Operations
Dataset – • Common Transformation and
RDD Actions
• Persistence (Caching)

RDD Basics

Transformations Actions
map reduce
flatMap collect
filter count
sample save
union lookupKey
groupByKey …
reduceByKey
join
cache
…

29
06/10/2024

Transformations

30
06/10/2024

Actions

31
06/10/2024

Resilient • RDD Basics

Distributed • Creating RDDs
• RDD Operations
Dataset – • Common Transformation and
RDD Actions
• Persistence (Caching)

Persistence levels

32
06/10/2024

Persistence
• Example

val result = input.map(x => x * x)

result.persist(StorageLevel.DISK_ONLY)
println(result.count())
println(result.collect().mkString(","))

DataFrame
• A primary abstraction in Spark 2.0
• Immutable once constructed
• Track lineage information to efficiently re-compute lost data
• Enable operations on collection of elements in parallel
• To construct DataFrame
• By parallelizing existing Python collections (lists)
• By transforming an existing Spark or pandas DataFrame
• From files in HDFS or other storage system

33
06/10/2024

Using DataFrame
>>> data = [(‘Alice’, 1), (‘Bob’, 2), (‘Bob’, 2)]
>>> df1 = sqlContext.createDataFrame(data, [‘name’,
‘age’])
[Row(name=u’Alice’, age=1),
Row=(name=u’Bob’, age=2),
Row=(name=u’Bob’, age=2)]

Transformations
• Create new DataFrame from an existing one
• Use lazy evaluation
• Nothing executes
• Spark saves recipe for transformation source
Transformation Description
select(*cols) Selects columns from this DataFrame
drop(col) Returns a new Dataframe that drops the specific column

filter(func) Returns a new DataFrame formed by selecting those rows of the

source on which func returns true

where(func) Where is an alias for filter

distinct() Returns a new DataFrame that contains the distinct rows of the
source DataFrame

sort(*cols, **kw) Returns a new DataFrame sorted by the specified columns and in
the sort order specified by kw
68

34
06/10/2024

Using Transformations
>>> data = [(‘Alice’, 1), (‘Bob’, 2), (‘Bob’, 2)]
>>> df1 = sqlContext.createDataFrame(data, [‘name’,
‘age’])
>>> df2 = df1.distinct()
[Row(name=u’Alice’, age=1), Row=(name=u’Bob’,
age=2)]
>>> df3 = df2.sort(“age”, asceding=False)
[Row=(name=u’Bob’, age=2), Row(name=u’Alice’,
age=1)]

Actions
• Cause Spark to execute recipe to transform source
• Mechanisms for getting results out of Spark
Action Description
show(n, truncate) Prints the first n rows of this DataFrame
take(n) Returns the first n rows as a list of Row
collect() Returns all the records as a list of Row (*)
count() Returns the number of rows in this DataFrame

describe(*cols) Exploratory Data Analysis function that computes statistics

(count, mean, stddev, min, max) for numeric columns

35
06/10/2024

Using Actions
>>> data = [(‘Alice’, 1), (‘Bob’, 2)]
>>> df = sqlContext.createDataFrame(data, [‘name’, ‘age’])
>>> df.collect()
[Row(name=u’Alice’, age=1), Row=(name=u’Bob’, age=2)]
>>> df.count()
2
>>> df.show()
+-------+--------+
|name| age |
+-------+-------+
|Alice| 1|
|Bob | 2|
+-----+-------+

Caching
>>> linesDF = sqlContext.read.text(‘…’)
>>> linesDF.cache()
>>> commentsDF = linesDF.filter(isComment)
>>> print linesDF.count(), commentsDF.count()
>>> commentsDF.cache()

36
06/10/2024

Spark Programming Routine

• Create DataFrames from external data or createDataFrame
from a collection in driver program
• Lazily transform them into new DataFrames
• cache() some DataFrames for reuse
• Perform actions to execute parallel computation and produce
results

DataFrames versus RDDs

• For new users familiar with data frames in other
programming languages, this API should make
them feel at home
• For existing Spark users, the API will make Spark
easier to program than using RDDs
• For both sets of users, DataFrames will improve
performance through intelligent optimizations and
code-generation

37
06/10/2024

Write Less Code: Input & Output

Unified interface to reading/writing data ina

variety of formats.
val df = sqlContext.
read.
format("json").
option("samplingRatio", "0.1").
load("/Users/spark/data/stuff.json")

df.write.
format("parquet").
mode("append").
partitionBy("year").
saveAsTable("faster-stuff")

Write Less Code: Input & Output

Unified interface to reading/writing data ina

variety of formats.
val df = sqlContext.
read.
format("json").
option("samplingRatio", "0.1").
load("/Users/spark/data/stuff.json")

df.write.
format("parquet"). read and write
mode("append"). functions create
partitionBy("year").
saveAsTable("faster-stuff")
new buildersfor
doing I/O
47

38
06/10/2024

Write Less Code: Input & Output

Unified interface to reading/writing data ina
variety of formats.

val df = sqlContext.
read.
format("json"). Builder
"methods
}
option("samplingRatio", "0.1").
)
specify:
load("/Users/spark/data/stuff.json
• format
df.write.
mode("append").
• partitioning

} • handling of
format("parquet").
existing data
partitionBy("year").
saveAsTable("faster- 48
stuff")

Write Less Code: Input & Output

Unified interface to reading/writing data ina

variety of formats.
val df = sqlContext.
read.
format("json").
option("samplingRatio", "0.1"). load(…), save(…),
load("/Users/spark/data/stuff.json")
or saveAsTable(…)
df.write.
format("parquet").
finish theI/O
mode("append"). specification
partitionBy("year").
saveAsTable("faster-stuff")

39
06/10/2024

Data Sources supported by DataFrames

built-in external

JDBC

{ JSON }

and more …

Write Less Code: High-Level

Operations
• Solve common problems concisely with DataFrame
functions:
• selecting columns and filtering
• joining different data sources
• aggregation (count, sum, average, etc.)
• plotting results (e.g., with Pandas)

40
06/10/2024

Write Less Code: Compute an Average

private IntWritable one = new IntWritable(1); rdd = sc.textFile(...).map(_.split(" "))

private IntWritable output =new IntWritable(); rdd.map { x => (x(0), (x(1).toFloat, 1)) }.
protected void map(LongWritable key, reduceByKey { case ((num1, count1), (num2, count2)) =>
Text value, (num1 + num2, count1 + count2)
Context context) { }.
String[] fields = value.split("\t"); map { case (key, (num, count)) => (key, num / count) }.
output.set(Integer.parseInt(fields[1])); collect()
context.write(one, output);
}

---------------------------------------------------------------------------------- rdd = sc.textFile(...).map(lambda s: s.split())

rdd.map(lambda x: (x[0], (float(x[1]), 1))).\
IntWritable one = new IntWritable(1) reduceByKey(lambda t1, t2: (t1[0] + t2[0], t1[1] + t2[1])).\
DoubleWritable average = new DoubleWritable(); map(lambda t: (t[0], t[1][0] / t[1][1])).\
collect()
protected void reduce(IntWritable key,
Iterable<IntWritable> values,
Context context) {
int sum = 0;
int count = 0;
for (IntWritable value: values) {
sum += value.get();
count++;
}
average.set(sum / (double) count);
context.write(key, average);
}

Write Less Code: Compute an

Average
Using RDDs
rdd = sc.textFile(...).map(_.split(" "))
rdd.map { x => (x(0), (x(1).toFloat, 1)) }. Full APIDocs
reduceByKey { case ((num1, count1), (num2, count2)) =>
(num1 + num2, count1 + count2)
• Scala
}. • Java
map { case (key, (num, count)) => (key, num / count) }.
collect() • Python
• R
Using DataFrames
import org.apache.spark.sql.functions._

val df = rdd.map(a => (a(0), a(1))).toDF("key", "value")

df.groupBy("key")
.agg(avg("value"))
.collect()

41
06/10/2024

Architecture
• A master-worker type architecture
• A driver or master node
• Worker nodes

• The master send works to the workers and either

instructs them to pull data from memory or from
hard disk (or from another source like S3 or HDSF)

Architecture(2)
• A Spark program first creates a SparkContext object
• SparkContext tells Spark how and where to access a
cluster
• The master parameter for a SparkContext determines
Master parameter Description
which type and size of cluster to use
local Run Spark locally with one worker thread (no parallelism)
local[K] Run Spark locally with K worker threads (ideal set to number of
cores)
spark://HOST:PORT Connect to a Spark standalone cluster
mesos://HOST:PORT Connect to a Mesos cluster
yarn Connect to a YARN cluster

42
06/10/2024

Lifetime of a Job in Spark

Books:

• Holden Karau, Andy Konwinski,

Patrick Wendell & Matei Zaharia.
Learning Spark. Oreilly
• TutorialsPoint. Spark Core
Programming
Acknowledgement and
References Slides:

• Paco Nathan. Intro to Apache

Spark
• Harold Liu. Berkely Data
Analytics Stack
• DataBricks. Intro to Spark
Development

43
06/10/2024

Q&A

44
06/10/2024

Thank you
for your
attentions!

Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
18 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Apache Spark RDD Overview
No ratings yet
Apache Spark RDD Overview
15 pages
Advanced DevOps with Spark
0% (1)
Advanced DevOps with Spark
301 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Big Data Processing With Apache Spark - Infoqdotcom
No ratings yet
Big Data Processing With Apache Spark - Infoqdotcom
16 pages
Bda U4
No ratings yet
Bda U4
49 pages
Berkeley Data Analytics Stack
No ratings yet
Berkeley Data Analytics Stack
48 pages
In9040 PHD Presentation Selimozcan 2
No ratings yet
In9040 PHD Presentation Selimozcan 2
36 pages
06 Big Data
No ratings yet
06 Big Data
52 pages
Spark: Big Data Processing & Libraries
No ratings yet
Spark: Big Data Processing & Libraries
47 pages
T07 Spark
No ratings yet
T07 Spark
23 pages
Berkeley Data Analytics Stack (BDAS) Overview: Ion Stoica UC Berkeley
No ratings yet
Berkeley Data Analytics Stack (BDAS) Overview: Ion Stoica UC Berkeley
28 pages
Berkeley Data Analytics Stack BDAS Overview Ion Stoica Strata 2013
No ratings yet
Berkeley Data Analytics Stack BDAS Overview Ion Stoica Strata 2013
28 pages
Shark
No ratings yet
Shark
24 pages
Learning Spark - Chapter 1
No ratings yet
Learning Spark - Chapter 1
18 pages
Spark Development for Developers
No ratings yet
Spark Development for Developers
172 pages
Unit - 4
No ratings yet
Unit - 4
49 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
1 Introduction
No ratings yet
1 Introduction
31 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Introduction To Spark 1
No ratings yet
Introduction To Spark 1
21 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
PySpark+Slides v1
No ratings yet
PySpark+Slides v1
458 pages
Spark-Powered Big Data Platform by Atigeo
No ratings yet
Spark-Powered Big Data Platform by Atigeo
17 pages
Lec No 10
No ratings yet
Lec No 10
17 pages
Unit 4
No ratings yet
Unit 4
60 pages
Unit V Big Data
No ratings yet
Unit V Big Data
18 pages
Module 2
No ratings yet
Module 2
20 pages
8 TH
No ratings yet
8 TH
19 pages
Apache Spark For Beginners
No ratings yet
Apache Spark For Beginners
30 pages
20J41A0514-Big Data Spark
No ratings yet
20J41A0514-Big Data Spark
12 pages
Apache Spark Essential Training
No ratings yet
Apache Spark Essential Training
30 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
Spark & SparkMLLib
No ratings yet
Spark & SparkMLLib
6 pages
Unit 5
100% (1)
Unit 5
109 pages
Spark
No ratings yet
Spark
96 pages
Apache Spark Defined
No ratings yet
Apache Spark Defined
14 pages
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
No ratings yet
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
99 pages
Introduction to Data Analysis with Spark
No ratings yet
Introduction to Data Analysis with Spark
51 pages
Big Data Anlytics Unit 3 R22 It
No ratings yet
Big Data Anlytics Unit 3 R22 It
57 pages
BigData Session1
No ratings yet
BigData Session1
14 pages
4a.introduction To Apache Spark
No ratings yet
4a.introduction To Apache Spark
28 pages
Spark
No ratings yet
Spark
26 pages
Apache Spark: Dhineshkumar S K
No ratings yet
Apache Spark: Dhineshkumar S K
31 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Introduction-to-Apache-Spark
No ratings yet
Introduction-to-Apache-Spark
22 pages
Apache Spark: Features & Components
No ratings yet
Apache Spark: Features & Components
9 pages
Master Asr
No ratings yet
Master Asr
13 pages
Business Plan
No ratings yet
Business Plan
29 pages
Group5
No ratings yet
Group5
33 pages
Lecture 7 - 1-Spark - Streaming
No ratings yet
Lecture 7 - 1-Spark - Streaming
25 pages
Lecture 3 - 1-ML and Data Systems Fundamentals
No ratings yet
Lecture 3 - 1-ML and Data Systems Fundamentals
48 pages
5.1 Exploratory Analysis en
No ratings yet
5.1 Exploratory Analysis en
79 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
124 pages
5.2 Feature Engineering
No ratings yet
5.2 Feature Engineering
57 pages
RA 1 Practice Case Study
No ratings yet
RA 1 Practice Case Study
19 pages
CTS Batch - GenC AIA - Informatica - IICS - DataStage - Curriculum
100% (1)
CTS Batch - GenC AIA - Informatica - IICS - DataStage - Curriculum
69 pages
DS Lab Manual CD 303
No ratings yet
DS Lab Manual CD 303
38 pages
Design, Verification AND Synthesis OF Synchronous Fifo
No ratings yet
Design, Verification AND Synthesis OF Synchronous Fifo
43 pages
Complete Python Questions Answers
No ratings yet
Complete Python Questions Answers
6 pages
PPL Chapter 8
No ratings yet
PPL Chapter 8
51 pages
IGCSE Software MS
No ratings yet
IGCSE Software MS
7 pages
Unix Commands
No ratings yet
Unix Commands
8 pages
UNIT3 Python
No ratings yet
UNIT3 Python
30 pages
Specimen (2023) QP - Paper 2 CAIE Computer Science GCSE
No ratings yet
Specimen (2023) QP - Paper 2 CAIE Computer Science GCSE
16 pages
Software Engineer Portfolio
No ratings yet
Software Engineer Portfolio
1 page
The MMX Instruction Set (ch11 - MMX)
No ratings yet
The MMX Instruction Set (ch11 - MMX)
38 pages
Ds Lab Assignment-6
No ratings yet
Ds Lab Assignment-6
12 pages
LCS Notes
No ratings yet
LCS Notes
5 pages
Programming Exam Guide
No ratings yet
Programming Exam Guide
4 pages
Bibliography
No ratings yet
Bibliography
19 pages
Extra Credit Course Subjects 2020
No ratings yet
Extra Credit Course Subjects 2020
17 pages
Spacy Cheat Sheet Python For Data Science: Spans Visualizing
No ratings yet
Spacy Cheat Sheet Python For Data Science: Spans Visualizing
2 pages
Unit 4. DMS
No ratings yet
Unit 4. DMS
21 pages
BPCL Placement Paper
No ratings yet
BPCL Placement Paper
8 pages
C++ Constructors and Destructors Guide
No ratings yet
C++ Constructors and Destructors Guide
34 pages
Solved - (A) Design The Optimal Conical Container (Fig. P16.2) T...
No ratings yet
Solved - (A) Design The Optimal Conical Container (Fig. P16.2) T...
3 pages
Endterm Alibek Tastan
No ratings yet
Endterm Alibek Tastan
4 pages
Software Design: CMSE 322 Assoc - Prof.Dr - Duygu Çelik Ertuğrul
No ratings yet
Software Design: CMSE 322 Assoc - Prof.Dr - Duygu Çelik Ertuğrul
21 pages
Zimpl User Guide: Thorsten Koch
No ratings yet
Zimpl User Guide: Thorsten Koch
36 pages
Emgu CV Tutorial Skander
No ratings yet
Emgu CV Tutorial Skander
36 pages
Debugging Abaqus Subroutines Guide
No ratings yet
Debugging Abaqus Subroutines Guide
4 pages
ISCE Manual
No ratings yet
ISCE Manual
100 pages
AWS Lambda
No ratings yet
AWS Lambda
5 pages
SAP BDC Customer Master Script
No ratings yet
SAP BDC Customer Master Script
11 pages