0% found this document useful (0 votes)

75 views13 pages

Introduction To Apache Spark

The document provides an introduction to Apache Spark, detailing its genesis as a solution to the shortcomings of Hadoop in handling big data and distributed computing. It describes Spark as a unified engine for large-scale data processing, emphasizing its speed, ease of use, and modularity. Additionally, it outlines various use cases for Spark, including data science, machine learning, and real-time data processing.

Uploaded by

azamsyed811

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views13 pages

Introduction To Apache Spark

Uploaded by

azamsyed811

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Introduction to Apache Spark

Outline
q The Genesis of Spark

q What is Apache Spark?

q Getting Started with Spark

Reference:
• Chapter 1, “Learning Spark”, 2nd Edition. Authors: Jules S. Damji, Brooke Wenig,
Tathagata Das, Denny Lee. Publisher(s): O'Reilly Media, Inc. ISBN: 9781492050049
2
3
The Genesis of Spark
• Big Data and Distributed Computing at Google
o creation of the Google File System (GFS), MapReduce (MR), and Bigtable to handle
massive amount of data on the Internet

• Hadoop at Yahoo!
o Open-source community – especially, Yahoo! was also interested
o GFS provided a blueprint for the Hadoop File System (HDFS)
o Donated to the Apache
o Shortcomings: administration and management, complex operation, low fault
tolerance of MapReduce, slow MR jobs

• Spark was developed to address the issues Hadoop had

4
The Genesis of Spark
• Spark was developed to address the issues Hadoop had

Intermittent iteration of reads and writes between map and reduce computations

5
What Is Apache Spark?
● Apache Spark is a unified engine
designed for large-scale distributed
data processing, on premises in data
centers or in the cloud.
● Design philosophy:
○ Speed
○ Ease of use
○ Modularity
○ Extensibility

Apache Spark’s ecosystem of connectors

6
What Is Apache Spark?
Structured Real-time Common Analyze
data processing of Machine graphs and
(e.g., CSV, text, continually learning topologies
JSON, Avro, growing table algorithms using
ORC, Parquet) algorithms e.g.,
PageRank

Apache Spark components and API stack

8
Spark SQL
• Read from a JSON file stored on Amazon S3
• Create a temporary table, and
• Issue a SQL-like query on the results read into memory as a Spark DataFrame

9
Who Uses Spark, and for What?
Data Science, Data Engineering, Machine Learning

Some use cases:

• Processing in parallel large data sets distributed across a cluster

• Performing ad hoc or interactive queries to explore and visualize data sets

• Building, training, and evaluating ML models using MLlib

• Implementing end-to-end data pipelines from myriad streams of data

• Analyzing graph data sets and social networks

10
Basic Operations a Data Scientist May Perform

11
Spark Ecosystem

12
Spark’s Distributed Execution

13
Spark Installation

14
Spark – Databricks Community Edition
1. Create a free Databricks account using this link:
https://databricks.com/try-databricks

2. When asked to select a cloud provider, click "Get

started with Community Edition" towards the bottom
(see screenshot)

3. Verify your email account by clicking the link sent to

your email. Then log in here:
https://community.cloud.databricks.com/login.html

Learning Spark - Chapter 1
No ratings yet
Learning Spark - Chapter 1
18 pages
Big Data Anlytics Unit 3 R22 It
No ratings yet
Big Data Anlytics Unit 3 R22 It
57 pages
Bda U3 p1 (Intro To Spark)
No ratings yet
Bda U3 p1 (Intro To Spark)
66 pages
Apache Spark: Dhineshkumar S K
No ratings yet
Apache Spark: Dhineshkumar S K
31 pages
Introduction-to-Apache-Spark
No ratings yet
Introduction-to-Apache-Spark
22 pages
Pyspark Notes New
No ratings yet
Pyspark Notes New
18 pages
Apache Spark Essential Training
No ratings yet
Apache Spark Essential Training
30 pages
Apache Spark For Beginners
No ratings yet
Apache Spark For Beginners
30 pages
06 Big Data
No ratings yet
06 Big Data
52 pages
Beginning Database Design
No ratings yet
Beginning Database Design
2 pages
Apache Spark IP Gemini 1 PDF
No ratings yet
Apache Spark IP Gemini 1 PDF
38 pages
Lecture 4 - Spark Introduction
No ratings yet
Lecture 4 - Spark Introduction
45 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
20J41A0514-Big Data Spark
No ratings yet
20J41A0514-Big Data Spark
12 pages
4a.introduction To Apache Spark
No ratings yet
4a.introduction To Apache Spark
28 pages
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
Apache Spark Primer 170303
No ratings yet
Apache Spark Primer 170303
8 pages
Lec No 10
No ratings yet
Lec No 10
17 pages
Apache Spark RDD Overview
No ratings yet
Apache Spark RDD Overview
15 pages
1 Introduction
No ratings yet
1 Introduction
31 pages
Apache Spark & Azure Databricks
No ratings yet
Apache Spark & Azure Databricks
25 pages
UNIT 4 Part 2
No ratings yet
UNIT 4 Part 2
11 pages
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
18 pages
Databricks On AWS 01 Getting Started Apache Spark Slides
100% (1)
Databricks On AWS 01 Getting Started Apache Spark Slides
29 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
8 Steps For A Developer To Learn Apache Spark and Delta Lake PDF
No ratings yet
8 Steps For A Developer To Learn Apache Spark and Delta Lake PDF
35 pages
Bda Unit5
No ratings yet
Bda Unit5
11 pages
1 Spark
No ratings yet
1 Spark
2 pages
Spark Final Theory
No ratings yet
Spark Final Theory
19 pages
Apache Spark
No ratings yet
Apache Spark
113 pages
Mastering Apache Spark PDF
75% (4)
Mastering Apache Spark PDF
541 pages
A Brief Introduction To Apache Spark
No ratings yet
A Brief Introduction To Apache Spark
10 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
DE in AI
No ratings yet
DE in AI
14 pages
Introduction To Spark 1
No ratings yet
Introduction To Spark 1
21 pages
Unit 4
No ratings yet
Unit 4
60 pages
Spark 101
No ratings yet
Spark 101
25 pages
Apache Spark: Fast Big Data Processing
No ratings yet
Apache Spark: Fast Big Data Processing
4 pages
Apache Spark: In-Memory Data Processing
No ratings yet
Apache Spark: In-Memory Data Processing
187 pages
Spark & SparkMLLib
No ratings yet
Spark & SparkMLLib
6 pages
Unit V Big Data
No ratings yet
Unit V Big Data
18 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Big Data Processing With Apache Spark - Infoqdotcom
No ratings yet
Big Data Processing With Apache Spark - Infoqdotcom
16 pages
1.1.4 and 1.1.5
No ratings yet
1.1.4 and 1.1.5
38 pages
Bda U4
No ratings yet
Bda U4
49 pages
Apache Spark 1
No ratings yet
Apache Spark 1
11 pages
Apache Spark Defined
No ratings yet
Apache Spark Defined
14 pages
In9040 PHD Presentation Selimozcan 2
No ratings yet
In9040 PHD Presentation Selimozcan 2
36 pages
Module 2
No ratings yet
Module 2
20 pages
Machine Learning With Spark - Sample Chapter
100% (1)
Machine Learning With Spark - Sample Chapter
36 pages
8 TH
No ratings yet
8 TH
19 pages
Large Scale Data Processing: Saeed Iqbal Khattak
No ratings yet
Large Scale Data Processing: Saeed Iqbal Khattak
81 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
Apache Spark A Comprehensive Guide
No ratings yet
Apache Spark A Comprehensive Guide
9 pages
Open Source PACS Architecture
No ratings yet
Open Source PACS Architecture
2 pages
Performance Tuning Brochure W
No ratings yet
Performance Tuning Brochure W
3 pages
Test Bank CH5
No ratings yet
Test Bank CH5
4 pages
Chapter 3 Synchronizing - Source - Code - With - UML - Models
No ratings yet
Chapter 3 Synchronizing - Source - Code - With - UML - Models
9 pages
Power Off Reset Reason
No ratings yet
Power Off Reset Reason
5 pages
Ethical Hacking Training
No ratings yet
Ethical Hacking Training
31 pages
Agughasi Victor Ikechukwu - Final Journal
No ratings yet
Agughasi Victor Ikechukwu - Final Journal
14 pages
15 SCP Commands Securely Copy Files To Remote Servers Linux
No ratings yet
15 SCP Commands Securely Copy Files To Remote Servers Linux
7 pages
Cheatsheet Kubernetes A4
No ratings yet
Cheatsheet Kubernetes A4
5 pages
Su
No ratings yet
Su
58 pages
Definitive Guide To Zero Trust Security ColorTokens
No ratings yet
Definitive Guide To Zero Trust Security ColorTokens
52 pages
CC 101 SG1
No ratings yet
CC 101 SG1
16 pages
AL3452 Operating Systems Lecture Notes 1 32
No ratings yet
AL3452 Operating Systems Lecture Notes 1 32
32 pages
Microsoft Office 2021 Professional Plus For Windows Activation Instruction
No ratings yet
Microsoft Office 2021 Professional Plus For Windows Activation Instruction
3 pages
Jncda Practice 1
No ratings yet
Jncda Practice 1
13 pages
Financials, Transactions,: 1.1 Purpose of This Document
No ratings yet
Financials, Transactions,: 1.1 Purpose of This Document
41 pages
Master Data Management
No ratings yet
Master Data Management
5 pages
Cloud Computing: The Emerging Technology
No ratings yet
Cloud Computing: The Emerging Technology
8 pages
Preparation Preparation: Containment
No ratings yet
Preparation Preparation: Containment
2 pages
Report Final (PRIYANSHU)
No ratings yet
Report Final (PRIYANSHU)
12 pages
Mobile Pervasive Computing Notes
No ratings yet
Mobile Pervasive Computing Notes
2 pages
Preventive Maintenance Checklist
No ratings yet
Preventive Maintenance Checklist
1 page
254 Personal Resume Sample
No ratings yet
254 Personal Resume Sample
3 pages
Recommendations For Iot Device Manufacturers:: Foundational Activities and Core Device Cybersecurity Capability Baseline
No ratings yet
Recommendations For Iot Device Manufacturers:: Foundational Activities and Core Device Cybersecurity Capability Baseline
41 pages
4-SICAM SIAPP Presentation
No ratings yet
4-SICAM SIAPP Presentation
99 pages
11 Phases of Software Testing
No ratings yet
11 Phases of Software Testing
3 pages
MSME Global Mart: National Small Industries Corporation
No ratings yet
MSME Global Mart: National Small Industries Corporation
11 pages
24 APAC en US Idc Other Ardm Emerging Isvs in Asia Pacific Excluding Japan Transforming The Enterprise Software Landscape Reprint
No ratings yet
24 APAC en US Idc Other Ardm Emerging Isvs in Asia Pacific Excluding Japan Transforming The Enterprise Software Landscape Reprint
18 pages
Jahir - Hussain - SAP BASIS Lead & Technical Manager - 11+ - CV
No ratings yet
Jahir - Hussain - SAP BASIS Lead & Technical Manager - 11+ - CV
4 pages
Vmware Vsphere: Install, Configure, Manage: Lecture Manual Esxi 7 and Vcenter Server 7
No ratings yet
Vmware Vsphere: Install, Configure, Manage: Lecture Manual Esxi 7 and Vcenter Server 7
811 pages

Introduction To Apache Spark

Uploaded by

Introduction To Apache Spark

Uploaded by

Introduction to Apache Spark

q What is Apache Spark?

q Getting Started with Spark

• Spark was developed to address the issues Hadoop had

Apache Spark’s ecosystem of connectors

Apache Spark components and API stack

Some use cases:

• Processing in parallel large data sets distributed across a cluster

• Performing ad hoc or interactive queries to explore and visualize data sets

• Building, training, and evaluating ML models using MLlib

• Implementing end-to-end data pipelines from myriad streams of data

• Analyzing graph data sets and social networks

2. When asked to select a cloud provider, click "Get

3. Verify your email account by clicking the link sent to

You might also like