Apache Spark Reviews

Name: Apache Spark
Brand: Apache
Rating: 4.2 (68 reviews)

4.2 out of 5

68 reviews
90% willing to recommend

What is Apache Spark?

Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflowstructure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory

Get the Apache Spark Buyer's Guide and find out what your peers are saying about Apache Spark, Spring Boot, Jakarta EE and more!

Apache Spark is the #2 ranked solution in top Hadoop solutions, #2 ranked solution in top Java Frameworks, and #4 ranked solution in top Compute Service solutions. PeerSpot users give Apache Spark an average rating of 8.4 out of 10. Apache Spark is most commonly compared to Spring Boot: Apache Spark vs Spring Boot. Apache Spark is popular among the large enterprise segment, accounting for 71% of users researching this solution on PeerSpot. The top industry researching this solution are professionals from a financial services firm, accounting for 26% of all views.

Helped 875,455 peers since 2012

Featured Apache Spark reviews

Devindra Weerasooriya

Data Architect at Devtech

The in-memory computation feature is certainly helpful for my processing tasks. It is helpful because while using structures that could be held in memory rather than stored during the period of computation, I go for the in-memory option, though there are limitations related to holding it in memory that need to be addressed, but I have a preference for in-memory computation. The solution is beneficial in that it provides a base-level long-held understanding of the framework that is not variant day by day, which is very helpful in my prototyping activity as an architect trying to assess Apache Spark, Great Expectations, and Vault-based solutions versus those proposed by clients like TIBCO or Informatica.

Read full review

Omar Khaled

Data Engineer at a tech company with 10,001+ employees

I can improve the organization's functions by taking less time to make decisions. To make the right decision, you need the right data, and a solution can provide this by hiring talent and employees who can consolidate data from different sources and organize it. Not all solutions can make this data fast enough to be used, except for solutions such as Apache Spark Structured Streaming. To make the right decision, you should have both accurate and fast data. Apache Spark itself is similar to the Python programming language. Python is a language with many libraries for mathematics and machine learning. Apache Spark is the solution, and within it, you have PySpark, which is the API for Apache Spark to write and run Python code. Within it, there are many APIs, including SQL APIs, allowing you to write SQL code within a Python function in Apache Spark. You can also use Apache Spark Structured Streaming and machine learning APIs.

Read full review

Bharghava Raghavendra Beesa

Senior Developer at Infosys

The Spark solution could improve in scheduling tasks and managing dependencies. Spark alone cannot handle sequential tasks, requiring environments like Airflow scheduler or scripts. For instance, one task should trigger another based on completion, however, Spark can't manage these dependent loads. We focus on specific compute tasks that we can deliver.

Read full review

Hadoop Market Share Distribution
Product	Market Share (%)
Apache Spark	17.1%
Cloudera Distribution for Hadoop	19.1%
HPE Data Fabric	14.6%
Other	49.199999999999996%

Type	Title	Date
Category	Hadoop	Nov 25, 2025	Download
Product	Reviews, tips, and advice from real users	Nov 25, 2025	Download
Comparison	Apache Spark vs Cloudera Distribution for Hadoop	Nov 25, 2025	Download
Comparison	Apache Spark vs Amazon EMR	Nov 25, 2025	Download
Comparison	Apache Spark vs HPE Data Fabric	Nov 25, 2025	Download

Suggested products

Title	Rating	Mindshare	Recommending
Spring Boot	4.2	N/A	95%	41 interviews Add to research
Jakarta EE	3.7	N/A	66%	3 interviews Add to research

Key learnings from peers

Last updated Nov 21, 2025

Valuable Features

Apache Spark is recognized for its speed, scalability, ease of use, and ability to handle large datasets. Key features include Spark Streaming, Spark SQL, machine learning with MLlib, in-memory processing, and distributed computing. Users appreciate its fast performance, fault tolerance, and real-time processing capabilities. Compatibility with languages like Python, Scala, and Java enhances its usability. The ability to execute SQL-like queries and the flexibility in managing data pipelines significantly improve data processing efficiency.

"Apache Spark, specifically PySpark and the tools available there, have been quite helpful in my event analysis work."
"Apache Spark's ability to handle both batch and streaming data is the most valuable feature for me as it offers solid real-time processing capability, making it more efficient in managing data analytics."
"Apache Spark resolves many problems in the MapReduce solution and Hadoop, such as the inability to run effective Python or machine learning algorithms."

Room for Improvement

Apache Spark could enhance its scalability and stability. Its real-time query capabilities are lacking, making it difficult for some users without technical backgrounds. More intuitive user interfaces and user-friendly documentation are needed. The integration with BI tools could be expanded. Memory management and garbage collection need optimization, and adding support for additional machine learning algorithms would be beneficial. Users desire improved task scheduling, dependency management, and easier setup for non-technical individuals.

"Very often in many of my experiments, the data set has had to be partitioned, and there have been issues in handling very large data sets, with most of my work done using Python machine learning libraries, requiring chunking, and speed of prediction has been an issue of concern in some experiments where we have had to shut down processes due to CPU requirements, then restart with different Apache configurations, and resourcing support is a major determinant if I were to name a constraint in terms of running machine learning experiments."
"The basic improvement would be to have integration with these solutions."
"The Spark solution could improve in scheduling tasks and managing dependencies."

ROI

Apache Spark has enabled significant cost savings, reducing operational expenses by 50 percent. By optimizing startup time for customers and leveraging specific skill sets, users experience enhanced performance and reduced financial outlay. Spark's open-source nature complements reduced costs, though higher memory and infrastructure requirements may increase performance expenses. While specific ROI percentages vary based on the cluster, the overall impact remains financially beneficial, evidenced by substantial savings in billion-dollar operations.

Pricing

Apache Spark is primarily an open-source solution, leading many organizations to utilize it free of licensing fees, especially for on-premises deployments. However, costs arise from infrastructure requirements, particularly with cloud implementations involving substantial hardware, memory, and management expenses. While some companies opt for bundled services like Cloudera or Databricks for enhanced support and features, these involve additional charges. Enterprise buyers should carefully assess their infrastructure needs and potential cloud costs associated with Apache Spark deployments.

"I did not pay anything when using the tool on cloud services, but I had to pay on the compute side. The tool is not expensive compared with the benefits it offers. I rate the price as an eight out of ten."
"Considering the product version used in my company, I feel that the tool is not costly since the product is available for free."
"The tool is an open-source product. If you're using the open-source Apache Spark, no fees are involved at any time. Charges only come into play when using it with other services like Databricks."

Popular Use Cases

Apache Spark is primarily used for processing large data sets, enabling data processing and transformation in real-time and batch scenarios. Users employ it for ETL processes, data lakes, and analytics, leveraging its in-memory computing power for fast processing. It assists in building data pipelines, handling streaming data, and powering machine learning applications. Apache Spark's scalability and flexibility allow seamless integration with cloud platforms and various data sources, proving vital for organizations handling extensive data operations.

Service and Support

Apache Spark's open-source nature means official support isn't widespread. Users frequently rely on vibrant community forums for help, though responses can vary. Some prefer commercial support through vendors like Cloudera, providing stronger assistance. Feedback about customer service and technical support tends to be mixed; while community forums and documentation are helpful, dedicated vendor support may enhance response times, especially for complex issues. Some emphasize the value of internal expertise for problem-solving.

Deployment

Apache Spark's initial setup varies in complexity based on deployment type and expertise level. Cloud setups are often quick and straightforward, while on-premise environments require more time and configuration, especially with security measures. Experienced teams find it manageable, but beginners may face challenges due to complex dependencies and configurations. Support from specialized professionals can ease the process. Self-managed clusters demand significant effort in resource allocation and integration with services.

Scalability

Users find Apache Spark highly scalable, with no major issues. Its performance is superior to Python and R, efficiently supporting many users and large data volumes. While setting up requires technical skills, effective scaling depends on cluster size and infrastructure management. Companies using it emphasize its processing capabilities, allowing multiple nodes and instances, supporting vast data processing tasks. It requires monitoring to optimize scaling but is rated highly for scalability.

Stability

Apache Spark is recognized for its stability, with many users reporting no significant issues. Some encountered difficulties with standalone deployment, Spark Streaming, and large datasets, especially in earlier versions. Version updates have improved reliability. Stability ratings often reach nine or ten out of ten. Memory errors and schema changes can occur under high data loads but are manageable. Many prefer Spark for its ease compared to alternatives like Flink and its effective handling of Python and machine learning algorithms.

These insights are based on the in-depth reviews provided by peers to help you make a better buying decision.

Download our Apache Spark Buyer's Guide for additional reliable information.

Review data by company size

By reviewers
Company Size	Count
Small Business	25
Midsize Enterprise	13
Large Enterprise	25

By reviewers

By visitors reading reviews
Company Size	Count
Small Business	115
Midsize Enterprise	32
Large Enterprise	353

By visitors reading reviews

Top industries

By visitors reading reviews

Financial Services Firm

26%

Computer Software Company

11%

Manufacturing Company

Comms Service Provider

University

Government

Retailer

Insurance Company

Educational Organization

Healthcare Company

Construction Company

Real Estate/Law Firm

Outsourcing Company

Non Profit

Performing Arts

Media Company

Legal Firm

Recreational Facilities/Services Company

Hospitality Company

Pharma/Biotech Company

Consumer Goods Company

Transportation Company

Renewables & Environment Company

Energy/Utilities Company

Compare Apache Spark with alternative products

Learn more about Apache Spark

Apache Spark customers

NASA JPL, UC Berkeley AMPLab, Amazon, eBay, Yahoo!, UC Santa Cruz, TripAdvisor, Taboola, Agile Lab, Art.com, Baidu, Alibaba Taobao, EURECOM, Hitachi Solutions

Author info	Rating	Review Summary
Data Architect at Devtech	4.5	I’ve used Apache Spark for four years, mainly for data integration and access. Its in-memory processing and open-source flexibility suit my needs, despite some stability issues. I prefer it over commercial tools like Informatica due to cost and adaptability.
Data Engineer at a tech company with 10,001+ employees	5.0	I use Apache Spark for real-time data processing and transformation across multiple sources like CRM and Siebel. It's reliable, fast, and improves our decision-making, though I see future needs for better integration with emerging cloud solutions.
Senior Developer at Infosys	3.5	No summary available
Senior Software Architect at USEReady	4.0	No summary available
Sr Manager at a transportation company with 10,001+ employees	4.5	I use Apache Spark for real-time data processing and ETL tasks. It offers unparalleled features but faces limitations due to its in-memory implementation. Despite improvements in version 3.0, reducing costs and addressing memory issues would enhance it further.
Data Scientist at a financial services firm with 10,001+ employees	4.5	I primarily use Apache Spark for data processing tasks involving large datasets, appreciating its ease of use and portability. While it's efficient for both small and large datasets, the lack of support for geospatial data is a limitation.
Data engineer at Cocos pt	4.5	We use Apache Spark primarily for Spark SQL and occasionally Spark Streaming, processing data from sources like SAP and Azure Data Warehouse. Its in-memory processing significantly outperforms Hadoop, offering faster data handling and enhanced query optimization.
Head of Data at a energy/utilities company with 51-200 employees	4.0	Apache Spark significantly reduced operational costs by 50% and although it supports parallel processing, it needs improvements in scalability and user-friendliness. Working with datasets isn't as straightforward as with Pandas, though it's flexible and functional.