Introduction to Apache Spark
Outline
q The Genesis of Spark
q What is Apache Spark?
q Getting Started with Spark
Reference:
• Chapter 1, “Learning Spark”, 2nd Edition. Authors: Jules S. Damji, Brooke Wenig,
Tathagata Das, Denny Lee. Publisher(s): O'Reilly Media, Inc. ISBN: 9781492050049
2
3
The Genesis of Spark
• Big Data and Distributed Computing at Google
o creation of the Google File System (GFS), MapReduce (MR), and Bigtable to handle
massive amount of data on the Internet
• Hadoop at Yahoo!
o Open-source community – especially, Yahoo! was also interested
o GFS provided a blueprint for the Hadoop File System (HDFS)
o Donated to the Apache
o Shortcomings: administration and management, complex operation, low fault
tolerance of MapReduce, slow MR jobs
• Spark was developed to address the issues Hadoop had
4
The Genesis of Spark
• Spark was developed to address the issues Hadoop had
Intermittent iteration of reads and writes between map and reduce computations
5
What Is Apache Spark?
● Apache Spark is a unified engine
designed for large-scale distributed
data processing, on premises in data
centers or in the cloud.
● Design philosophy:
○ Speed
○ Ease of use
○ Modularity
○ Extensibility
Apache Spark’s ecosystem of connectors
6
What Is Apache Spark?
Structured Real-time Common Analyze
data processing of Machine graphs and
(e.g., CSV, text, continually learning topologies
JSON, Avro, growing table algorithms using
ORC, Parquet) algorithms e.g.,
PageRank
Apache Spark components and API stack
8
Spark SQL
• Read from a JSON file stored on Amazon S3
• Create a temporary table, and
• Issue a SQL-like query on the results read into memory as a Spark DataFrame
9
Who Uses Spark, and for What?
Data Science, Data Engineering, Machine Learning
Some use cases:
• Processing in parallel large data sets distributed across a cluster
• Performing ad hoc or interactive queries to explore and visualize data sets
• Building, training, and evaluating ML models using MLlib
• Implementing end-to-end data pipelines from myriad streams of data
• Analyzing graph data sets and social networks
10
Basic Operations a Data Scientist May Perform
11
Spark Ecosystem
12
Spark’s Distributed Execution
13
Spark Installation
14
Spark – Databricks Community Edition
1. Create a free Databricks account using this link:
https://databricks.com/try-databricks
2. When asked to select a cloud provider, click "Get
started with Community Edition" towards the bottom
(see screenshot)
3. Verify your email account by clicking the link sent to
your email. Then log in here:
https://community.cloud.databricks.com/login.html
15