Interview Question Spark Day1

Uploaded by

venkatarakesh2203

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views3 pages

Interview Question Spark Day1

Uploaded by

venkatarakesh2203

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 3

1. Difference between Rdd, DataFrame, Data Set.

Coming to the performance wise RDD's mainly sutable for processing large amounts of
unstructured data and customized function logic,
data frame is sutable for structured, semi-structured data it is like a tabular
format,
IN RDD has no type safety, performance wise lack of catalyst otimizer, tungeston
engine
but in Data frame us it make the query optimization and improves performance and
runtime type safety errors are detected during execution.
coming to datasets combine advantages of both RDDs and DataFrames when we perform
large unstructured and also perform optimization techniques
and additionally have compile-time type safety detected errors are caught during
compilation.

RDD want to define schema dataframe and datasets derivied auto schema evaluation.

2. Narrow vs Wide transformations.

Narrow transformations not require data shuffling across the network nodes.
Wide transformations involve shuffling data across the network nodes.
this partitions are like parent and child relation ship it one partition from
parent only one partition to child called narrow.

Narrow transformations are those where each input partition contributes to exactly
one output partition.
Examples of narrow transformations include map, filter, and union.
These transformations do not require data shuffling across the network and can be
executed more efficiently.
For instance, the map transformation applies a function to each element of the RDD,

creating a new RDD without the need to redistribute the data across partitions.

wide transformations, on the other hand, involve shuffling data across the network.

In these transformations, multiple output partitions may depend on multiple input

partitions.
Examples include groupByKey, reduceByKey, and join. These operations require data
to be exchanged between nodes,
leading to a more complex execution plan, making them more expensive and time-
consuming.

3. Spark Architecture.
Apache Spark is an open-source distributed computing framework used for big data
procesing and analytics.
Azure databricks is a cloud-based data analytics service built on top of Apache
Spark.
Distributed nature of Spark:
There will be multiple worker nodes and the data will be distributed among them.
Each worker node can perform individually.
In every worker node there will be executors.

Here first when we run job Driver called as master node communicates with the
cluster manager and
this cluster manager will check the resources in worker nodes.
(for example which node is free of slots to perform task) and get back to the
driver program give information to driver program.
Then driver will assign the tasks directly to the worker nodes to perform a Job.
Each worker node can have 'N' no of executors.

4. Lazy evaluation.
Definition: Lazy evaluation is a computational strategy where the execution of
operations is delayed until their results are actually needed.
This allows for optimization of the overall computation by combining and reordering
operations for more efficient execution.
query Execution:

When working with large datasets, transformations like filter, map, and groupBy are
not immediately executed.
Instead, Spark builds an execution plan. This plan is optimized and executed only
when an action
like count, collect, or saveAsTextFile is called, reducing unnecessary computation.

optimizes query execution, reduces data shuffling, enhances fault tolerance, allows
efficient caching,
and improves performance of DataFrame operations.

5. Transformations and actions

Transformations are lazily evaluated and define a new dataset from an existing one.

They include operations like map, filter, flatMap, groupByKey, reduceByKey, and
join.

Actions trigger the execution of transformations and return results or save data.
They include operations like collect, count, first, take, reduce, saveAsTextFile,
and foreach.

6. Memory variables: Broadcast variables, accumulators, cache() vs Parsist()

Broadcast Variables
Definition: Broadcast variables are read-only variables that are cached on each
machine rather than shipping a copy of it with tasks.
Use Case: They are used to efficiently distribute large read-only data across
worker nodes, such as a lookup table.

val broadcastVar = sc.broadcast(Array(1, 2, 3))

val result = rdd.map(x => x * broadcastVar.value(0)).collect()

accumulators:
Definition: Accumulators are variables that are only “added” to, such as counters
and sums. They are used to perform global aggregations across the cluster.
Use Case: They are typically used for counting events or accumulating values across
tasks.
val accum = sc.longAccumulator("SumAccumulator")
rdd.foreach(x => accum.add(x))
println(accum.value)

Cache():

Definition: It is a shorthand for persist() with the default storage level set to
MEMORY_ONLY.
Use Case: Useful when the data fits entirely in memory and is going to be reused
multiple times.
Example:
val cachedRdd = rdd.cache()
persist():

Definition: Allows more control over the storage level used. It can store data in
memory, on disk, or a combination of both.
Use Case: Useful for fine-grained control over storage options, especially when
data is too large to fit in memory.
Storage Levels:
MEMORY_ONLY: Stores RDD as deserialized objects in memory.
MEMORY_AND_DISK: Stores RDD as deserialized objects in memory and spills to disk if
memory is insufficient.
MEMORY_ONLY_SER: Stores RDD as serialized objects in memory.
MEMORY_AND_DISK_SER: Stores RDD as serialized objects in memory and spills to disk
if memory is insufficient.
DISK_ONLY: Stores RDD as serialized objects on disk.
Example:
val persistedRdd = rdd.persist(StorageLevel.MEMORY_AND_DISK)

Key Differences in Aache Spark Components and Concepts
No ratings yet
Key Differences in Aache Spark Components and Concepts
7 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
Spark Architecture for Developers
No ratings yet
Spark Architecture for Developers
7 pages
SPARK
No ratings yet
SPARK
35 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
Understanding Apache Spark Architecture
No ratings yet
Understanding Apache Spark Architecture
30 pages
Notes
No ratings yet
Notes
4 pages
Complete Spark & Azure Databricks Interview Guide - Claude
No ratings yet
Complete Spark & Azure Databricks Interview Guide - Claude
46 pages
Spark End To End QUESTIONS
No ratings yet
Spark End To End QUESTIONS
10 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
Big Data Assignment
No ratings yet
Big Data Assignment
6 pages
Spark & Databricks Guide for Developers
No ratings yet
Spark & Databricks Guide for Developers
71 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
Spark Overview
No ratings yet
Spark Overview
31 pages
Apache Spark: Key Concepts & Features
No ratings yet
Apache Spark: Key Concepts & Features
8 pages
Spark Class 1 PPT
No ratings yet
Spark Class 1 PPT
33 pages
Spark Class 1
No ratings yet
Spark Class 1
33 pages
Unit 5 Note
No ratings yet
Unit 5 Note
18 pages
Big Data & Apache Spark Explained
No ratings yet
Big Data & Apache Spark Explained
31 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Delta Format and Spark Optimization
No ratings yet
Delta Format and Spark Optimization
4 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Big Data Computing Notes
No ratings yet
Big Data Computing Notes
17 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Data Bricks
No ratings yet
Data Bricks
42 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
Lec28 - RDD
No ratings yet
Lec28 - RDD
56 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
5 pages
Apache Spark Components Guide
No ratings yet
Apache Spark Components Guide
6 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
BDT Unit 3
No ratings yet
BDT Unit 3
105 pages
SPARK Architecture
No ratings yet
SPARK Architecture
22 pages
PySpark Basics Overview 2
No ratings yet
PySpark Basics Overview 2
15 pages
Spark
No ratings yet
Spark
96 pages
8 Apache Spark
No ratings yet
8 Apache Spark
25 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Apache Spark vs Dask: Big Data Tools
No ratings yet
Apache Spark vs Dask: Big Data Tools
55 pages
Unit V
No ratings yet
Unit V
35 pages
Spark Essentials for Data Engineers
No ratings yet
Spark Essentials for Data Engineers
17 pages
Spark & RDD Guide for Developers
No ratings yet
Spark & RDD Guide for Developers
1 page
Notes
No ratings yet
Notes
3 pages
Spark
No ratings yet
Spark
7 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
61 pages
BDA Unit III IV
No ratings yet
BDA Unit III IV
33 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Introduction To Big Data With PySpark - Spark RDDs With PySpark Cheatsheet - Codecademy
No ratings yet
Introduction To Big Data With PySpark - Spark RDDs With PySpark Cheatsheet - Codecademy
6 pages
Day 9
No ratings yet
Day 9
30 pages
Lecturer 5
No ratings yet
Lecturer 5
21 pages
Spark
No ratings yet
Spark
51 pages
Spark (Introduction, RDD)
No ratings yet
Spark (Introduction, RDD)
28 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Data Engineers' Guide to Delta & Spark
No ratings yet
Data Engineers' Guide to Delta & Spark
5 pages
SystemVerilog Randomization Guide
100% (2)
SystemVerilog Randomization Guide
41 pages
(Ebook PDF) Computer Systems 5th Edition PDF Download
No ratings yet
(Ebook PDF) Computer Systems 5th Edition PDF Download
45 pages
Introduction to VHDL on EDA Playground
No ratings yet
Introduction to VHDL on EDA Playground
14 pages
Abhishek Resume
No ratings yet
Abhishek Resume
1 page
Indexed Sequential File
No ratings yet
Indexed Sequential File
2 pages
UML Diagrams For Railway Reservation
50% (2)
UML Diagrams For Railway Reservation
7 pages
Beginner Array Manipulation Guide
No ratings yet
Beginner Array Manipulation Guide
12 pages
Advanced Software Architecture: Muhammad Bilal Bashir
No ratings yet
Advanced Software Architecture: Muhammad Bilal Bashir
25 pages
Java
No ratings yet
Java
74 pages
Intro to Computing for Beginners
No ratings yet
Intro to Computing for Beginners
25 pages
Planet 8 - Installation Guide
No ratings yet
Planet 8 - Installation Guide
11 pages
CS Practical
No ratings yet
CS Practical
26 pages
Python Lab Manual
No ratings yet
Python Lab Manual
27 pages
Data Management Reference Guide
No ratings yet
Data Management Reference Guide
3 pages
Aspiring IT Professional Resume
No ratings yet
Aspiring IT Professional Resume
1 page
Ichi Chama
No ratings yet
Ichi Chama
3 pages
Lab Manual Web Programming
No ratings yet
Lab Manual Web Programming
3 pages
Android Debugging Essentials
No ratings yet
Android Debugging Essentials
19 pages
C++ Function Basics for Beginners
No ratings yet
C++ Function Basics for Beginners
9 pages
SQL Server Quick Reference
No ratings yet
SQL Server Quick Reference
1 page
Fmea K1ba 2020
No ratings yet
Fmea K1ba 2020
7 pages
Resume Ios Tqjones
No ratings yet
Resume Ios Tqjones
1 page
Java Introduction:: What Is JDK?
No ratings yet
Java Introduction:: What Is JDK?
28 pages
Service Virtualization
No ratings yet
Service Virtualization
16 pages
PPS-Queestion and Answers - 2024
No ratings yet
PPS-Queestion and Answers - 2024
21 pages
2017 CS8493 Operating System Dr. D. Loganathan
No ratings yet
2017 CS8493 Operating System Dr. D. Loganathan
37 pages
Cwsdpmi
No ratings yet
Cwsdpmi
3 pages
Building of Personal Ai Assistant Edt
No ratings yet
Building of Personal Ai Assistant Edt
5 pages
F
No ratings yet
F
6 pages
Object-Oriented Design Heuristics PDF
100% (2)
Object-Oriented Design Heuristics PDF
608 pages

Interview Question Spark Day1

Uploaded by

Interview Question Spark Day1

Uploaded by

1. Difference between Rdd, DataFrame, Data Set.

2. Narrow vs Wide transformations.

In these transformations, multiple output partitions may depend on multiple input

5. Transformations and actions

6. Memory variables: Broadcast variables, accumulators, cache() vs Parsist()

val broadcastVar = sc.broadcast(Array(1, 2, 3))

You might also like