[go: up one dir, main page]

0% found this document useful (0 votes)
14 views2 pages

Sparkapache

Apache Spark is an open-source analytics engine designed for large-scale data processing, featuring fault tolerance and data parallelism. It utilizes resilient distributed datasets (RDDs) and supports various cluster managers and distributed storage systems, enabling efficient iterative algorithms and data analysis. Originally developed at UC Berkeley, Spark has evolved to encourage the use of the Dataset API while maintaining compatibility with RDDs.

Uploaded by

derkuzesta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views2 pages

Sparkapache

Apache Spark is an open-source analytics engine designed for large-scale data processing, featuring fault tolerance and data parallelism. It utilizes resilient distributed datasets (RDDs) and supports various cluster managers and distributed storage systems, enabling efficient iterative algorithms and data analysis. Originally developed at UC Berkeley, Spark has evolved to encourage the use of the Dataset API while maintaining compatibility with RDDs.

Uploaded by

derkuzesta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Apache Spark is an open-source unified analytics engine for large-scale data processing.

Spark provides an interface for programming clusters with implicit data parallelism and fault
tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark
codebase was later donated to the Apache Software Foundation, which has maintained it
since.

Overview
[edit]

Apache Spark has its architectural foundation in the resilient distributed dataset (RDD), a
read-only multiset of data items distributed over a cluster of machines, that is maintained in
a fault-tolerant way.[2] The Dataframe API was released as an abstraction on top of the RDD,
followed by the Dataset API. In Spark 1.x, the RDD was the primary application programming
interface (API), but as of Spark 2.x use of the Dataset API is encouraged[3] even though the
RDD API is not deprecated.[4][5] The RDD technology still underlies the Dataset API.[6][7]

Spark and its RDDs were developed in 2012 in response to limitations in the MapReduce
cluster computing paradigm, which forces a particular linear dataflow structure on
distributed programs: MapReduce programs read input data from disk, map a function across
the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs
function as a working set for distributed programs that offers a (deliberately) restricted form
of distributed shared memory.[8]

Inside Apache Spark the workflow is managed as a directed acyclic graph (DAG). Nodes
represent RDDs while edges represent the operations on the RDDs.

Spark facilitates the implementation of both iterative algorithms, which visit their data set
multiple times in a loop, and interactive/exploratory data analysis, i.e., the repeated
database-style querying of data. The latency of such applications may be reduced by several
orders of magnitude compared to Apache Hadoop MapReduce implementation.[2][9] Among
the class of iterative algorithms are the training algorithms for machine learning systems,
which formed the initial impetus for developing Apache Spark.[10]

Apache Spark requires a cluster manager and a distributed storage system. For cluster
management, Spark supports standalone native Spark, Hadoop YARN, Apache Mesos or
Kubernetes.[11] A standalone native Spark cluster can be launched manually or by the launch
scripts provided by the install package. It is also possible to run the daemons on a single
machine for testing. For distributed storage Spark can interface with a wide variety of
distributed systems, including Alluxio, Hadoop Distributed File System (HDFS),[12] MapR File
System (MapR-FS),[13] Cassandra,[14] OpenStack Swift, Amazon S3, Kudu, Lustre file system,
[15] or a custom solution can be implemented. Spark also supports a pseudo-distributed
local mode, usually used only for development or testing purposes, where distributed storage
is not required and the local file system can be used instead; in such a scenario, Spark is run
on a single machine with one executor per CPU core.

You might also like