1. Difference between Rdd, DataFrame, Data Set.
Coming to the performance wise RDD's mainly sutable for processing large amounts of
unstructured data and customized function logic,
data frame is sutable for structured, semi-structured data it is like a tabular
format,
IN RDD has no type safety, performance wise lack of catalyst otimizer, tungeston
engine
but in Data frame us it make the query optimization and improves performance and
runtime type safety errors are detected during execution.
coming to datasets combine advantages of both RDDs and DataFrames when we perform
large unstructured and also perform optimization techniques
and additionally have compile-time type safety detected errors are caught during
compilation.
RDD want to define schema dataframe and datasets derivied auto schema evaluation.
2. Narrow vs Wide transformations.
Narrow transformations not require data shuffling across the network nodes.
Wide transformations involve shuffling data across the network nodes.
this partitions are like parent and child relation ship it one partition from
parent only one partition to child called narrow.
Narrow transformations are those where each input partition contributes to exactly
one output partition.
Examples of narrow transformations include map, filter, and union.
These transformations do not require data shuffling across the network and can be
executed more efficiently.
For instance, the map transformation applies a function to each element of the RDD,
creating a new RDD without the need to redistribute the data across partitions.
wide transformations, on the other hand, involve shuffling data across the network.
In these transformations, multiple output partitions may depend on multiple input
partitions.
Examples include groupByKey, reduceByKey, and join. These operations require data
to be exchanged between nodes,
leading to a more complex execution plan, making them more expensive and time-
consuming.
3. Spark Architecture.
Apache Spark is an open-source distributed computing framework used for big data
procesing and analytics.
Azure databricks is a cloud-based data analytics service built on top of Apache
Spark.
Distributed nature of Spark:
There will be multiple worker nodes and the data will be distributed among them.
Each worker node can perform individually.
In every worker node there will be executors.
Here first when we run job Driver called as master node communicates with the
cluster manager and
this cluster manager will check the resources in worker nodes.
(for example which node is free of slots to perform task) and get back to the
driver program give information to driver program.
Then driver will assign the tasks directly to the worker nodes to perform a Job.
Each worker node can have 'N' no of executors.
4. Lazy evaluation.
Definition: Lazy evaluation is a computational strategy where the execution of
operations is delayed until their results are actually needed.
This allows for optimization of the overall computation by combining and reordering
operations for more efficient execution.
query Execution:
When working with large datasets, transformations like filter, map, and groupBy are
not immediately executed.
Instead, Spark builds an execution plan. This plan is optimized and executed only
when an action
like count, collect, or saveAsTextFile is called, reducing unnecessary computation.
optimizes query execution, reduces data shuffling, enhances fault tolerance, allows
efficient caching,
and improves performance of DataFrame operations.
5. Transformations and actions
Transformations are lazily evaluated and define a new dataset from an existing one.
They include operations like map, filter, flatMap, groupByKey, reduceByKey, and
join.
Actions trigger the execution of transformations and return results or save data.
They include operations like collect, count, first, take, reduce, saveAsTextFile,
and foreach.
6. Memory variables: Broadcast variables, accumulators, cache() vs Parsist()
Broadcast Variables
Definition: Broadcast variables are read-only variables that are cached on each
machine rather than shipping a copy of it with tasks.
Use Case: They are used to efficiently distribute large read-only data across
worker nodes, such as a lookup table.
val broadcastVar = sc.broadcast(Array(1, 2, 3))
val result = rdd.map(x => x * broadcastVar.value(0)).collect()
accumulators:
Definition: Accumulators are variables that are only “added” to, such as counters
and sums. They are used to perform global aggregations across the cluster.
Use Case: They are typically used for counting events or accumulating values across
tasks.
val accum = sc.longAccumulator("SumAccumulator")
rdd.foreach(x => accum.add(x))
println(accum.value)
Cache():
Definition: It is a shorthand for persist() with the default storage level set to
MEMORY_ONLY.
Use Case: Useful when the data fits entirely in memory and is going to be reused
multiple times.
Example:
val cachedRdd = rdd.cache()
persist():
Definition: Allows more control over the storage level used. It can store data in
memory, on disk, or a combination of both.
Use Case: Useful for fine-grained control over storage options, especially when
data is too large to fit in memory.
Storage Levels:
MEMORY_ONLY: Stores RDD as deserialized objects in memory.
MEMORY_AND_DISK: Stores RDD as deserialized objects in memory and spills to disk if
memory is insufficient.
MEMORY_ONLY_SER: Stores RDD as serialized objects in memory.
MEMORY_AND_DISK_SER: Stores RDD as serialized objects in memory and spills to disk
if memory is insufficient.
DISK_ONLY: Stores RDD as serialized objects on disk.
Example:
val persistedRdd = rdd.persist(StorageLevel.MEMORY_AND_DISK)