A Practitioners Guide To Databricks Vs Snowflake
A Practitioners Guide To Databricks Vs Snowflake
Summary
When comparing Databricks and Snowflake across various features and capabilities, it is evident that
Databricks holds a competitive edge for TCO sensitive organizations seeking a unified analytics
platform that supports all their data, all their users and all their use cases. Databricks offers a
comprehensive solution for data-driven organisations and offers superior performance in:
• ETL workloads;
• processing different data types;
• cataloguing and lineage;
• AI/ML ecosystem integration; and
• real-time data processing.
Databrick’s innovative Delta Share feature through the Unity Catalog enables seamless and secure
data sharing without relying on traditional connectivity patterns. While Snowflake excels in specific
areas such as data warehousing and ease of deployment, Databricks emerges as the better platform
for organisations looking to unlock the full potential of their data and drive impactful business
decisions.
Introduction
Fujitsu Data & AI, a specialist division of Fujitsu within the APAC region, works with enterprise
organisations and governments to find, interrogate, and help solve the most complex data problems
across Australia, New Zealand, and Asia. Our purpose is to accelerate the growth of our customers
Data Analytics and Artificial Intelligence capabilities to unlock the value within their data. We have
one of the largest data engineering capabilities in Australia and are backed by Fujitsu, the third
largest ICT service providers in the world. Using industry leading specialists, we offer full breadth,
end-to-end Advanced Analytics, Business Intelligence and AI capabilities. We are a Premium
Databricks delivery partner and have global partnerships with Microsoft, AWS, and Google. Our
strong Databricks partnership is shown through our continued success across our clients and having
won Databricks Regional Systems Integration Partner Asia Pacific and Japan in 2021 and 2022.
Utilising our extensive specialist experience, Fujitsu Data & AI APAC in collaboration with Databricks,
have created this article to provide a practical perspective on the differences between Databricks
and Snowflake.
Snowflake, founded in 2012 by three data warehousing experts, was created with a vision to
revolutionise the world of data storage and management. The founders recognised the limitations of
traditional data warehouse solutions and sought to build a cloud-native, fully managed, and highly
scalable data platform. Snowflake’s unique architecture, a hybrid approach to shared-nothing MPP
query cluster (every node has some amount of data) and shared-disk data storage, allows for
seamless scalability, improved performance, and cost-effective solutions tailored to the needs of
each organisation. Over the years, Snowflake has gained widespread recognition as a leading cloud
data warehouse, serving a multitude of industries and customers around the world. With a strong
commitment to innovation and customer success, Snowflake continues to break new ground in data
warehousing, empowering organisations to make data-driven decisions and achieve better business
outcomes.
Databricks and Snowflake are both popular technologies used in the field of data analytics and
processing, but they have some key differences in their features and functionalities.
1. Data warehouse vs Lakehouse: Snowflake is a cloud-based data warehouse that provides a
fully managed, scalable, and SQL-based data warehousing solution. It is optimised for fast
query performance and allows users to store and analyse structured and semi-structured
data.
On the other hand, Databricks is a cloud-based cloud native data platform that brings the
best of data warehouse and data lake together in a new category of Lakehouse and provides
a unified analytics workspace for data engineering, AI, and machine learning tasks.
It is optimised for large-scale data processing and analysis, and supports a wide variety of
data formats, including structured, semi-structured, and unstructured data.
2. Architecture: Snowflake provides a SaaS based platform and uses a unique hybrid
architecture of shared-nothing MPP query engine (aka virtual warehouses) and “shared-disk
central data storage. It provides automatic scaling, caching, and storage optimisation, making
it easy to scale up or down based on the workload, and is available on all major public clouds
like Azure, AWS and GCP.
Databricks, a PaaS based Platform, is built on top of Apache Spark, an open-source distributed
data processing framework, and runs on cloud platforms such as AWS, Azure, and Google
Cloud. Snowflake is a managed service, and its architecture is known for end users. However,
Snowflake's node types are unknown and do not allow customers to modify or cost-optimise,
while Databricks allows complete control over compute.
Additionally, Databricks implemented fully serverless options to further enhance customer
experience.
5. Security and compliance: Both Snowflake and Databricks provide robust security features for
protecting data and ensuring compliance with data regulations. Snowflake provides features
such as data encryption at rest and in transit, role-based access control (RBAC), and auditing.
It also supports features such as virtual private cloud (VPC) peering for enhanced network
security.
Databricks provides similar security features, along with functions such as data lake firewall,
data lake encryption, and integration with Azure Active Directory for authentication and
authorisation.
In terms of data ownership, Snowflake has decoupled storage and processing with ownership
over both layers. However, storage is proprietary, controlled by Snowflake (and partners that
Snowflake permits). Access to this storage comes at a cost to the customer. Databricks has
fully decoupled storage layers and allows users to store data anywhere in any format,
focussing on open standards and the freedom of choosing the processing engine while
integrating with 3rd party solutions.
6. Pricing: Pricing can be complex to compare as both providers offer different pricing models
and different pricing tiers. Databricks pricing is considered more cost-effective than
Snowflake due to its flexible and scalable cost structure, which better accommodates the
needs of organizations of various sizes and budgets. Databricks offers a pay-as-you-go model,
where customers only pay for the resources they consume, thus optimizing expenses
according to their workloads. Additionally, the platform provides features like auto-scaling
and auto-termination, which further help control costs by automatically adjusting resources
based on usage and terminating idle clusters. Databricks also offers better pricing when it
comes to ETL/ELT workloads. Running standard Spark-based clusters allows for very flexible
pricing model.
Snowflake uses a more rigid pricing model based on pre-allocated compute resources, which
can result in overprovisioning and underutilization of resources, ultimately leading to higher
costs. Therefore, Databricks' pricing flexibility and resource optimization make it a more
cost-effective solution compared to Snowflake.
Recent innovations
As we dive into the world of Databricks and Snowflake, it's crucial to examine the innovative features
that set these platforms apart. Both Databricks and Snowflake have consistently pushed the
boundaries of data engineering and analytics, introducing cutting-edge solutions to address evolving
business needs. In this section, we will explore some of the most recent advancements in both
platforms, showcasing how their commitment to innovation empowers organisations to harness the
full potential of their data and make data-driven decisions with confidence.
Databricks
1. Delta Lake: An open-source storage layer that brings ACID transactions, scalability, and
reliability to data lakes. It enables organisations to manage the challenges of data reliability,
quality, and performance for big data and AI workloads.
2. Delta Live Tables : These declarative SQL Pipelines are based on a truly streaming
architecture making it easy for customers to develop and maintain extremely fast workflows
to enable operational decision-making as well as take advantage of advancing IOT
technology.
3. Delta Sharing: The world's first open protocol for securely sharing data across organisations in
real-time, without the need for the other organisation to have Databricks. This innovation
simplifies data sharing and collaboration, helping organisations unlock new insights and
opportunities.
4. Unity Catalog: A unified data catalog that enables organisations to manage and discover
datasets, as well as track data lineage across their Databricks workspaces. It streamlines data
governance and provides greater visibility into data assets and their usage.
5. Databricks Machine Learning: A comprehensive solution that integrates popular machine
learning frameworks, distributed ML libraries, and a collaborative UI. This platform aims to
make it easier for data scientists and engineers to develop, train, and deploy machine learning
models at scale.
6. Databricks SQL Warehouse: Designed to provide a fast, easy-to-use, and cost-effective way
for data analysts to work with massive datasets using SQL. Databricks SQL integrates with
popular business intelligence tools and offers features like auto-scaling and optimised query
performance for a seamless analytics experience.
7. Dolly: A completely open-source Large Language Model (LLM) that exhibits high quality
instruction-following behaviours that can be trained on fine-tuned datasets to meet specific
needs of customers without the overhead.
Snowflake
1. Snowflake Data Cloud: A global data network that facilitates secure and governed access to a
wide range of data sets, enabling organisations to share and collaborate on data more
effectively. It allows businesses to break down data silos and accelerate their data-driven
initiatives.
2. Snowflake Data Marketplace: A platform that provides access to a vast array of data from
various providers, making it easy for organisations to discover, access, and utilise third-party
data sets in real-time. It simplifies data acquisition and integration, helping businesses unlock
new insights and opportunities.
3. Snowpark: A new developer experience that is intended to allow users to write code in
familiar programming languages to perform complex data transformations and processing
within Snowflake. At this stage it supports SQL translation and portions of python for basic
ML. This innovation begins to extend Snowflake's capabilities to cater a broader range of data
engineering tasks. At this stage there is no MLOps capability, and it is yet to support Pandas
API or other statistical languages (Scala, Java, R).
4. Snowflake Data Exchange: A feature that enables secure and real-time sharing of data
between Snowflake accounts, simplifying data sharing between organisations and facilitating
seamless collaboration on data-driven projects.
5. Dynamic Data Masking: A security feature that allows organisations to define masking policies
for sensitive data, ensuring that users only see the information they are authorised to access.
This innovation enhances data security and helps businesses comply with data protection
regulations.
Considering the features outlined in the comparison table, Databricks stands out in ecosystem
integration, real-time data processing and cost effectiveness, handling more than simply Data
Warehousing capabilities. It demonstrates superior performance in areas such as ETL workloads,
handling various data types, cataloguing and lineage, and AI/ML. On the other hand, Snowflake
excels in traditional SQL-based data warehouse functions where no other analytics needs for semi-
and un-structured data analysis or ML/AI are required in an organisation's strategy.
For an organisation wishing to maintain complete control of their data within their own network
environment, Databricks managed VNET allows for data to remain in place and never be moved
outside of the company’s own security controls and monitoring.
If your company needs help with their data to produce the results required by the business, please
contact a Fujitsu Data & AI specialist now.
© Fujitsu 2022. All rights reserved. Fujitsu and Fujitsu logo are trademarks of Fujitsu
Contact Limited registered in many jurisdictions worldwide. Other product, service and company
Fujitsu Data & AI names mentioned herein may be trademarks of Fujitsu or other companies. This
document is current as of the initial date of publication and subject to be changed by
+61 3 9924 3000 Fujitsu without notice. This material is provided for information purposes only and Fujitsu
Select Information Classification Uncontrolled if printed
assumes no liability related to its use. 8 of 8 © Fujitsu 2023