WP Dremio Simplifying Data Mesh
WP Dremio Simplifying Data Mesh
By Mike Ferguson
Intelligent Business Strategies
June 2023
Table of Contents
Introduction ..................................................................................................................................3
Building a Data Mesh for Self-Service Analytics on Dremio’s Open Data Lakehouse ........11
Dremio Sonar ................................................................................................................................. 11
What is a Data Lakehouse? ................................................................................................. 11
Creating a Universal Semantic Layer with Dremio Sonar .................................................... 11
Dremio Arctic .................................................................................................................................. 14
What is Apache Iceberg? ..................................................................................................... 14
What is Dremio Arctic? ......................................................................................................... 16
How can Dremio be used to build a data mesh? ............................................................................ 16
Phase 1 – Unify Data Access ............................................................................................... 17
Phase 2 – Deliver a Data Lakehouse ................................................................................... 18
Phase 3 – Enterprise Data Mesh ......................................................................................... 18
Conclusions................................................................................................................................21
INTRODUCTION
There has been a In the last decade we have seen a frenzy of activity when it comes to data and
frenzy of activity in analytics. Many new technologies have emerged over that time including new
data and analytics over data science and data engineering tools, new database products, and data
the last decade management platforms. Also, the arrival of self-service data preparation,
augmented business intelligence tools, and machine-learning (ML) automation
Also, the number of has helped lower the skills bar needed to use these technologies. In addition,
new data sources has the number of new data sources that companies want to access has exploded
increased rapidly as people look to add to what they already know to produce richer insights for
better decision making.
Many companies have With so much going on, it’s not surprising, that different departments in many
overspent because of companies have embraced all these technologies in their determination to
stand-alone data and deliver new value. However, different stand-alone data and analytical initiatives
analytical initiatives across large and medium-sized enterprises have resulted in piecemeal adoption
of technologies resulting in a fractured approach to data and analytics
development. It is common to see multiple different overlapping technologies in
different parts of the enterprise. Also, re-invention, and data redundancy have
occurred. As a result, many organizations now have multiple stand-alone
analytical systems such as multiple data warehouses, multiple data lakes in use
in data science, graph databases, and streaming analytics initiatives. There are
also multiple overlapping and competing toolsets. All of this has led to platform
complexity and inadvertent overspend.
Also, considerable data The problem with this is that stand-alone departmental and line-of-business data
redundancy and multiple and analytical development initiatives has meant that different teams are making
overlapping technologies use of different data engineering tools to clean and integrate data for different
exists across many analytical use cases. In some cases, they are even sourcing and engineering
analytical systems the same data from the same data stores for these different analytical use cases.
Also, stand-alone self-service analytics initiatives have led to many different self-
service data preparation jobs, BI reports, dashboards, and machine-learning
models being produced using different data preparation, BI and data science
tools across the enterprise. All of this has led to managing tools and data
platforms that do not integrate with each other and overspending. Also, very little
of what has been created is published anywhere and so people who could
benefit are often unaware of valuable data and insights that have already been
created across the enterprise.
Organizations have made In summary, while many organizations have been busy and progress has been
progress, but progress is made, what is being created under the banner of data and analytics is not joined
slower than desired, up. The pace of development is slower than desired and the way in which
integration is poor, and artifacts like ML models, BI reports and dashboards are produced is somewhat
costs are running higher
untidy, inefficient, and complex to understand. This leads to cost that is higher
than they should be
than it should be.
As the pace of business However, we are now in an era where the pace of business is quickening, and
quickens executives are executives are demanding rapid development to compete in a data-driven digital
demanding a strategy economy. They want to move away from fractured development initiatives to an
aligned industrialized industrialized development approach that accelerates data engineering and
development approach to enables the sharing of data, ML models, and business intelligence that are all
shorten time to value being created to align with business strategy. To do that means more people are
needed but also, we need to organize them to build data and analytical products
that can be shared, reused, and assembled more rapidly in a similar way to a
manufacturing production line. A good example, of this is the emergence of a
Data Mesh1 and reusable data products.
Data Mesh is a divide and
conquer approach to data
engineering aimed at The general idea behind data mesh is to move to a decentralized, ‘divide and
accelerating development conquer’ approach to data engineering to speed up development. That means
upskilling more people in different business domains around the enterprise to
enable them to produce high-quality, reusable, compliant datasets known as
data products. Data products are defined2 as being discoverable, addressable,
trustworthy, self-describing, interoperable and secure. They include the data
itself, the pipeline (rules to clean and integrate data), the runtime specification to
execute the pipeline and APIs. They can be physically stored or represented as
virtual views of data in multiple sources that integrate data on-demand to
produce the required data product.
A data concept model can An easy way to think about data products is to think about data concepts in a
be used to identify data data concept model. For example, data concepts in an insurance company
products to produce would include customers, insurance brokers, claims assessors, insurance
policy applications, quotes, policy agreements, premium payments, claims, etc.
There are several In this fast-moving digital economy, the question is, how do you implement
requirements that need this? How can you move away from the fractured approaches of the last
to be met to make this decade and industrialize the development of data and analytics to rapidly build
possible a data and AI-driven enterprise? What are the requirements that need to be
met to make this possible? How does data mesh and data products fit in and
what else is needed?
1 Data Mesh – Delivering Data Value at Scale, Zhamak Dehghani, O’Reilly ISBN:
9781492092391
2 How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh, Zhamak Dehghani, May 2019
Development of data
products should all be
aligned with strategic
business priorities to
achieve high-value
business outcomes
Figure 1
All domain-oriented
teams of data
producers should use a
common data platform
technology
A common data
platform enables teams
to build business-ready
data products in
isolation and easily
share data and
metadata with others
A standard setup is
needed for all teams
Figure 3
In this way data products can be produced by connecting directly to data sources
and transforming and integrating data.
When using a data lake, However, if an organization wishes to ingest data into a central data lake (e.g.,
data ingestion needs to cloud object storage), it should also be possible to manage data ingestion in a
be managed and source common way so that new data is not made available to teams of data producers
data checked before it is until it is quality checked. Managed data ingestion would also enable data
made available for use ingestion to be tracked and prevent the same data from being ingested more
than once. This is particularly important when it comes to external data that has
Data producers need to been purchased. Once ingested and made available, all teams could then
have their own working access a common ingestion zone and have their own workspaces (working
environment to build environment) to create new data products or new versions of data products
data products without impacting on any other data producers or data consumers.
By using common
business data names
and global identifiers,
data products become
the building blocks to
incrementally build a
universal semantic
layer for self-service
analytics
Figure 4
All BI and data science
tools can access • Enable consumers to define different flattened views of data in star
commonly understood schemas to simplify access in the semantic layer
data via a common
semantic layer • Enable access to data via the common semantic layer from BI and data
science tools for use in self-service analytics and ML model development
DREMIO SONAR
Dremio Sonar is a data lakehouse query engine that supports self-service
analytics with data warehouse functionality and data lake flexibility. Before
looking at Dremio Sonar, let’s first define what a data lakehouse is.
What is a Data Lakehouse?
A data lakehouse is a For many years data warehouses have been used to organize and store data in
single platform that relational tables in a proprietary format tied to a specific data warehouse
provides the scale and database management system (DBMS) for query using SQL. This means data
flexibility of a data lake is only accessible by that specific DBMS engine. Data lakes on the other hand
together with the data
have stored data in files that could be accessed by multiple engines. Historically,
integrity and query
capability and fine- this data was primarily analyzed by data scientists using programs written in
grained governance of popular languages such as Python and R. Then the data lakehouse emerged to
a data warehouse provide the scale and flexibility of a data lake with columnar file formats, ACID,
open table formats (e.g., Iceberg) with fine grained governance, Spark and SQL
data transformation capability and SQL query performance that is becoming
competitive with data warehouses. With a data lakehouse, multiple engines can
access and manipulate data directly from data lake storage without
compromising data integrity and without the need to copy data into a proprietary
format specific to a single data warehouse DBMS.
Creating a Universal Semantic Layer with Dremio Sonar
Dremio Sonar is a Dremio Sonar is an Apache Arrow-based columnar lakehouse query engine that
lakehouse query supports BI reporting, dashboards and interactive analytics over data in data
engine for self- lake storage as well as unified access to data in cloud object storage and other
service analytics data sources. Using Dremio Sonar, self-service interactive BI tools can access
data via unified views that use common business data names to provide a
common semantic layer across all self-service analytics tools. It enables
organizations to run high performance queries on data lakes, simplify access to
data in multiple underlying data stores, modernize their data architecture and
facilitate self-service creation of data products anywhere, be it on-premises,
hybrid, or cloud.
Teams can create a logical view of the data for data consumers, with the physical
data residing in cloud data lakes (S3, ADLS), on-prem data lakes (HDFS),
metastores (Nessie, Hive, AWS Glue), relational databases (Snowflake,
Redshift, etc.), NoSQL databases, and a Dremio-to-Dremio connector.
It supports virtual views
that use SQL to Dremio Sonar’s key features include:
transform and integrate
• Virtual views: Data consumers can transform and integrate data via
data
SQL-based virtual views. Users can take advantage of an integrated
Dremio Sonar
Note that Dremio maintains full lineage showing transformation and integration
mappings between sources and virtual views. With respect to data products in a
data mesh, data consumers can see the lineage on how virtual views
representing data products are constructed. Figure 7 shows a lineage example
in Dremio Sonar.
DREMIO ARCTIC
Before proceeding with Dremio Arctic, let’s first explain what Apache Iceberg is.
What is Apache Iceberg?
Data lakes were Enterprises are increasingly relying on data lakes for data science and over the
originally created for last several years companies have amassed considerable amounts of data in
use in data science data lakes typically held in columnar-oriented Parquet files. First generation data
lakes were often on-premises storing data in files in Hadoop HDFS. Today, those
files are more likely to be stored on cloud object storage. However, while cloud
object storage data lakes gave companies scalability, they have often been
lacking in governance which is something almost always found in auditable
systems like data warehouses. Lack of governance often meant that it wasn’t
long before data in these data lakes started to become inconsistent. Data
Apache Iceberg is an scientists would also often copy data for their own use resulting in significant
open table format for data redundancy. There was little understanding of what was in these data lakes
files in object storage
and so they began to deteriorate and were dubbed ‘data swamps’.
Iceberg ensures the • Schema Evolution: The enables Iceberg to track changes to a table
data is transactionally over time to adapt to and accommodate evolving data requirements
consistent even when
data is maintained • Time Travel: With Iceberg, you can query historical data and verify
using multiple query changes between updates using the time travel feature. This provides
engines and supports insights into the data's evolution and facilitates auditing and analysis
automatic
• Data Partitioning and Evolution: Iceberg facilitates updates to partition
schemes as queries and data volumes change. This ensures flexibility
and efficiency in managing large datasets
These features make Iceberg a robust and flexible table format for managing
data in data lakes, providing improved data consistency, schema management,
historical analysis, partitioning flexibility, data rollback, and enhanced query
performance.
Metadata can be Iceberg achieves these capabilities for a table by utilizing metadata files, known
tracked though point in as manifests, which are tracked through point-in-time snapshots. As the table is
time snapshots and updated over time, Iceberg maintains all the deltas, allowing for a
table changes tracked comprehensive view of the table's schema, partitioning, and file information
within each snapshot. This approach ensures full isolation and consistency and
enables scalable parallel query performance on large volumes of data stored in
data lakes.
It supports multiple Dremio Arctic contains a list of all Iceberg tables and views that are in the data
engines which means lake together with their locations. It also contains pointers to metadata files that
data can be updated include metadata on Iceberg tables, schema, and partitions. The data, metadata
by one engine and and indexes are stored in AWS S3 cloud object storage. Dremio Arctic runs
queried by another
independently from Dremio Sonar but Dremio Sonar can connect to Dremio
Arctic to gain access to data in Iceberg tables and views as a data source. Figure
8 shows the Dremio Arctic architecture.
Data is engineered by
combining the SQL
queries in different
virtual views
Figure 9
Figure 10 shows the consumption side on how virtual data products can be
Consumers can find,
request access to and
searched for and assembled into virtual star schemas and aggregate views
consume data in created for use in self-service analytics all using common business terms.
virtual data products
from a shared space
Figure 10
Virtual data marts can inherit the common data names. Query execution from BI
tools can be accelerated using columnar cloud caching, data reflections and
pushdown optimization. This breeds consistency and confidence.
Phase 2 – Deliver a Data Lakehouse
The shift to using SaaS transaction applications and the adoption of Internet of
Migrate legacy data
Things (IoT) has seen a huge amount of data being captured outside the firewall.
lakes and ingest source
data into an Iceberg
This data is increasingly being brought into cloud storage to stage and integrate
lakehouse managed by it with other data for input into cloud-based data warehouses and for use in data
Dremio Arctic to science. The second phase is to convert this data to Parquet files to ready it for
guarantee consistent use in a lakehouse. In addition, data in legacy on-premises data lakes such as
data, and allow teams of HDFS should also be migrated to columnar Parquet files in object storage. As
data engineers to build on-premises data warehouses are migrated to the cloud, data in data warehouse
virtual data products staging tables should also be added into object storage if data is from data
leveraging data as code sources not being ingested already. This decouples staged source data from
single analytical systems.
At this point Dremio Arctic with Iceberg should be deployed to enable Iceberg
Turn a data lake into a tables to be created and overlaid on top of the files in object storage. This turns
lakehouse and reduce the data lake into a lakehouse, and centralizes data needed and makes it
data copying possible to create data products once for use in multiple analytical use cases
such as self-service analytics and data science.
Decentralized teams can • Data to be streamed into Iceberg tables in columnar format by one
build and manage data engine (e.g., Spark or Flink) with full ACID support and transformed and
products in isolation and integrated to create up-to-date data products by Dremio Sonar
publish them for
consumption in a shared • Decentralized teams can engineer lakehouse data using Dremio Sonar
searchable space in complete isolation and check it before making new or new versions of
without impacting data virtual data products available to others. Dremio Sonar and Arctic
consumers therefore enable virtual data product development.
• Iceberg tables to be overlayed on columnar files which means data
Consumer queries run in compression, automatic partitioning, data skipping, indexing, statistics
parallel on compressed and parallel query processing are all possible on object storage-based
columnar storage data lakes during data product development and when querying data in
virtual data products in self-service analytics
• All updates to data can be audited and tracked
Federated ownership
and governance of • Virtual data products can be consumed into cloud-based data
data products warehouses with virtual data marts built on Dremio Sonar
• Data scientists can consume virtual data products, to provide features
for building ML models and do it all in isolation without copying data
Self-service access • Business analysts can run reports on virtual data marts built using virtual
across decentralized data products and also run BI reports knowing they are working with the
teams most consistent and up to date version of their data
• Automatic data optimization for Iceberg tables, including compaction to
write smaller files into larger files, and garbage collection, which removes
unused files, to optimize performance and storage utilization
Figure 11 is a variation on Figure 9 and shows how Dremio Sonar and Dremio
Arctic can enable decentralized teams of domain-oriented data engineers to
create their own workspaces to build virtual data products in a data mesh on
consistent data in an Iceberg lakehouse. It is Dremio’s intention to extend the
use of Arctic to offer it as a source for on-premises object stores in the near
future which would allow Dremio to use Sonar to create a universal semantic
layer for self-service analytics across data in a hybrid open lakehouse.
Figure 11
Dremio Sonar allows Note that the data products in the shared space data marketplace could be
virtual data products to organized. For example, they could be organized into:
be tagged and • Master data products (these are cross domain – e.g., customers)
organized to make
them easier to find, • Transaction data products (e.g., sales, shipments, returns)
understand and use
• Aggregate data products (domain specific)
Dremio Sonar also Once data products have been created, data owners can be assigned, and
provides full data policies created to govern access to this data before they are made available to
lineage to explain how consumers. Federated data governance can be implemented by enforcing
data products have enterprise-wide data privacy policies to protect personally identifiable
been created
information (PII) data everywhere and implementing domain specific data
product access security policies (set by decentralized domain data product
Common data names owners) to govern access to specific data products. Dremio Sonar can enforce
make it possible to both data privacy and data access security at run time to implement federated
create a universal computational data governance in a data mesh.
semantic layer to drive
consistent
understanding across
Finally, as shown in Figure 4, by using common data names for all data products,
all tools in all virtual data marts and all virtual aggregate views, it is possible to establish
a universal semantic layer in Dremio Sonar to guarantee that all BI tools and
data science notebooks see consistent data names and definitions.
CONCLUSIONS
Insatiable demand to For many companies today, the demand to analyze more data from more data
analyze more and sources by more business users has become so great that IT can’t keep pace
more data has resulted with requests to engineer data. This has led to IT becoming a bottleneck to
in IT becoming a data producing richer business insights. To overcome this bottleneck, businesses are
engineering bottleneck looking to upskill people around the enterprise to increase the number of data
engineers.
At the same time, they want to avoid mistakes of the past including the chaos of
‘every person for themselves’ self-service data preparation using many different
Companies want to
tools and siloed analytical systems with overlapping subsets of the same data
democratize and
coordinate decentralized being repeatedly re-engineered for different data warehouses, data marts, graph
data engineering activity databases and machine-learning models. Instead, they are looking to move
to produce reusable towards a decentralized but coordinated, incremental development approach to
data products for use in data engineering where different business domain-oriented teams of data
self-service analytics engineers around the enterprise integrate data to produce reusable data
and data science products and make them available for sharing. This is the data mesh approach
which aims to build data products once and reuse them for many different
analytical use cases including data warehouses and data science. The data
They also want to mesh approach separates the data producers who engineer data to create data
provision data without products and data consumers who select the data products they need and
the need to copy it consume and assemble them for use in self-service analytics and data science.
The challenge is to avoid every consumer wanting copies of data products.
Therefore, a zero data copy approach to provisioning data products is needed.
In addition, consumers can produce new data products (e.g., aggregate data)
Data Mesh enables
incremental
and new analytical products (e.g., ML models, BI reports) and also publish these.
development of data
products and should
The benefit of data mesh is that as more data (and analytical) products are
help to progressively incrementally created, consumers should be able to get faster at delivering value
shorten time to value because more and more of what they need is already available for reuse.
A common self-service With so much demand, companies need a common platform to build data
data platform is needed products quickly and easily where projects can be coordinated, where
to coordinate decentralized teams can engineer data in isolation using DataOps (DevOps
decentralized data applied to data) practices, and where metadata can be shared. In addition, they
product development need to connect to a number of data sources and either leave the data where it
and facilitate data is and engineer it from there or move it to a centralized data lake where data
sharing in a data mesh
consistency must be upheld. If it is the latter, cloud storage is not enough. A
lakehouse is needed that supports ACID transactions, that can cater for schema
An open table format
evolution, that supports managed data ingestion and where teams can engineer
and catalog are needed
if firms want to integrate
in their own workspaces in isolation until it is ready for use. Furthermore, once
data warehouse and produced, data products need to be published in a shared data marketplace
data science workloads where consumers can request access, where access security and data privacy
on a data lakehouse is governed and where data can be provisioned without the need to copy it for
use in self-service analytics. Also, common business data names need to be
A universal semantic enforced to create a universal semantic layer so that all tools accessing this data
layer is also needed to see consistent, and meaningful business data names. Dremio provides all these
drive common things and can be used to industrialize development of data products in a data
understanding of data mesh. For companies looking for software to help them to do this, it should be a
across multiple tools
contender on anyone’s shortlist.
Author
Mike Ferguson is Managing Director of Intelligent Business Strategies Limited.
As an independent IT industry analyst and consultant, he specializes in BI /
analytics and data management. With over 40 years of IT experience, Mike has
consulted for dozens of companies on BI/Analytics, data strategy, technology
selection, enterprise architecture, and data management. Mike is also
conference chairman of Big Data LDN, the largest data and analytics conference
in Europe and a member of the EDM Council CDMC Executive Advisory Board.
He has spoken at events all over the world and written numerous papers and
articles. Formerly he was a principal and co-founder of Codd and Date – the
inventors of the Relational Model, a Chief Architect at Teradata on the Teradata
DBMS, and European Managing Director of Database Associates. He teaches
popular master classes in Data Warehouse Modernization, Big Data,
Centralized Data Governance of a Distributed Data Landscape, Practical
Guidelines for Implementing a Data Mesh, Machine-learning and Advanced
Analytics, and Embedded Analytics, Intelligent Apps and AI Automation.
All diagrams sourced from Dremio and used in this paper remain
the copyright and intellectual property of Dremio