[go: up one dir, main page]

0% found this document useful (0 votes)
59 views22 pages

WP Dremio Simplifying Data Mesh

This white paper discusses the implementation of a Data Mesh for self-service analytics on an open data lakehouse, emphasizing the need for decentralized data ownership and the creation of reusable data products. It outlines organizational and standardization requirements necessary to accelerate data product development and improve integration across various teams. The paper also highlights how Dremio's technology can facilitate this transition, ultimately aiming to streamline data engineering and enhance business outcomes.

Uploaded by

odelacruzf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views22 pages

WP Dremio Simplifying Data Mesh

This white paper discusses the implementation of a Data Mesh for self-service analytics on an open data lakehouse, emphasizing the need for decentralized data ownership and the creation of reusable data products. It outlines organizational and standardization requirements necessary to accelerate data product development and improve integration across various teams. The paper also highlights how Dremio's technology can facilitate this transition, ultimately aiming to streamline data engineering and enhance business outcomes.

Uploaded by

odelacruzf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

WHITE PAPER

Simplifying Data Mesh for Self-Service Analytics on


an Open Data Lakehouse

By Mike Ferguson
Intelligent Business Strategies
June 2023

Research sponsored by:


Simplifying Data Mesh for Self-Service Analytics on an Open Data Lakehouse

Table of Contents
Introduction ..................................................................................................................................3

Requirements to Accelerate the Development of Data Products ............................................5


What is data mesh? .......................................................................................................................... 5
Organizational requirements to accelerate development ................................................................. 5
Standardization requirements to accelerate development ............................................................... 6
Data product development requirements ......................................................................................... 8
Data product version control requirements....................................................................................... 9
Self-service analytics semantic layer requirements.......................................................................... 9

Building a Data Mesh for Self-Service Analytics on Dremio’s Open Data Lakehouse ........11
Dremio Sonar ................................................................................................................................. 11
What is a Data Lakehouse? ................................................................................................. 11
Creating a Universal Semantic Layer with Dremio Sonar .................................................... 11
Dremio Arctic .................................................................................................................................. 14
What is Apache Iceberg? ..................................................................................................... 14
What is Dremio Arctic? ......................................................................................................... 16
How can Dremio be used to build a data mesh? ............................................................................ 16
Phase 1 – Unify Data Access ............................................................................................... 17
Phase 2 – Deliver a Data Lakehouse ................................................................................... 18
Phase 3 – Enterprise Data Mesh ......................................................................................... 18

Conclusions................................................................................................................................21

Copyright © Intelligent Business Strategies Limited, 2023, All Rights Reserved 2


Simplifying Data Mesh for Self-Service Analytics on an Open Data Lakehouse

INTRODUCTION
There has been a In the last decade we have seen a frenzy of activity when it comes to data and
frenzy of activity in analytics. Many new technologies have emerged over that time including new
data and analytics over data science and data engineering tools, new database products, and data
the last decade management platforms. Also, the arrival of self-service data preparation,
augmented business intelligence tools, and machine-learning (ML) automation
Also, the number of has helped lower the skills bar needed to use these technologies. In addition,
new data sources has the number of new data sources that companies want to access has exploded
increased rapidly as people look to add to what they already know to produce richer insights for
better decision making.

Many companies have With so much going on, it’s not surprising, that different departments in many
overspent because of companies have embraced all these technologies in their determination to
stand-alone data and deliver new value. However, different stand-alone data and analytical initiatives
analytical initiatives across large and medium-sized enterprises have resulted in piecemeal adoption
of technologies resulting in a fractured approach to data and analytics
development. It is common to see multiple different overlapping technologies in
different parts of the enterprise. Also, re-invention, and data redundancy have
occurred. As a result, many organizations now have multiple stand-alone
analytical systems such as multiple data warehouses, multiple data lakes in use
in data science, graph databases, and streaming analytics initiatives. There are
also multiple overlapping and competing toolsets. All of this has led to platform
complexity and inadvertent overspend.

Also, considerable data The problem with this is that stand-alone departmental and line-of-business data
redundancy and multiple and analytical development initiatives has meant that different teams are making
overlapping technologies use of different data engineering tools to clean and integrate data for different
exists across many analytical use cases. In some cases, they are even sourcing and engineering
analytical systems the same data from the same data stores for these different analytical use cases.
Also, stand-alone self-service analytics initiatives have led to many different self-
service data preparation jobs, BI reports, dashboards, and machine-learning
models being produced using different data preparation, BI and data science
tools across the enterprise. All of this has led to managing tools and data
platforms that do not integrate with each other and overspending. Also, very little
of what has been created is published anywhere and so people who could
benefit are often unaware of valuable data and insights that have already been
created across the enterprise.

Organizations have made In summary, while many organizations have been busy and progress has been
progress, but progress is made, what is being created under the banner of data and analytics is not joined
slower than desired, up. The pace of development is slower than desired and the way in which
integration is poor, and artifacts like ML models, BI reports and dashboards are produced is somewhat
costs are running higher
untidy, inefficient, and complex to understand. This leads to cost that is higher
than they should be
than it should be.

As the pace of business However, we are now in an era where the pace of business is quickening, and
quickens executives are executives are demanding rapid development to compete in a data-driven digital
demanding a strategy economy. They want to move away from fractured development initiatives to an
aligned industrialized industrialized development approach that accelerates data engineering and
development approach to enables the sharing of data, ML models, and business intelligence that are all
shorten time to value being created to align with business strategy. To do that means more people are

Copyright © Intelligent Business Strategies Limited, 2023, All Rights Reserved 3


Simplifying Data Mesh for Self-Service Analytics on an Open Data Lakehouse

needed but also, we need to organize them to build data and analytical products
that can be shared, reused, and assembled more rapidly in a similar way to a
manufacturing production line. A good example, of this is the emergence of a
Data Mesh1 and reusable data products.
Data Mesh is a divide and
conquer approach to data
engineering aimed at The general idea behind data mesh is to move to a decentralized, ‘divide and
accelerating development conquer’ approach to data engineering to speed up development. That means
upskilling more people in different business domains around the enterprise to
enable them to produce high-quality, reusable, compliant datasets known as
data products. Data products are defined2 as being discoverable, addressable,
trustworthy, self-describing, interoperable and secure. They include the data
itself, the pipeline (rules to clean and integrate data), the runtime specification to
execute the pipeline and APIs. They can be physically stored or represented as
virtual views of data in multiple sources that integrate data on-demand to
produce the required data product.

A data concept model can An easy way to think about data products is to think about data concepts in a
be used to identify data data concept model. For example, data concepts in an insurance company
products to produce would include customers, insurance brokers, claims assessors, insurance
policy applications, quotes, policy agreements, premium payments, claims, etc.

The objective of a data mesh is to enable different teams of data producers to


Multiple teams of clean, transform and integrate data to create a data product for each of these.
decentralized data So, in this insurance example, you would have a data product holding all
producers create
claims data, another data product holding all premium payments, and another
pipelines to produce
holding all customers, etc. These data products are ‘building blocks’ that are
reusable data products
made available for sharing and reuse in multiple analytical systems.

This is a very different approach to building pipelines to populate a monolithic


Data products can be data model in a single analytical system like a data warehouse. In a data mesh
consumed and used in approach, work is divided up among different teams of business domain-
multiple analytical oriented data engineers, who are experts in specific kinds of data (e.g., claims
workloads data), and who use self-service common data management software to build
pipelines that transform and integrate data to produce specific data products.
These data products can then be assembled and used in multiple analytical
workloads to drive business outcomes.

There are several In this fast-moving digital economy, the question is, how do you implement
requirements that need this? How can you move away from the fractured approaches of the last
to be met to make this decade and industrialize the development of data and analytics to rapidly build
possible a data and AI-driven enterprise? What are the requirements that need to be
met to make this possible? How does data mesh and data products fit in and
what else is needed?

This paper seeks to answer these questions by defining the requirements


needed to make it possible. It then looks at how one vendor, Dremio, can help
organizations reduce the cost of operating, accelerate data engineering, and
shorten time to value.

1 Data Mesh – Delivering Data Value at Scale, Zhamak Dehghani, O’Reilly ISBN:
9781492092391

2 How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh, Zhamak Dehghani, May 2019

Copyright © Intelligent Business Strategies Limited, 2023, All Rights Reserved 4


Simplifying Data Mesh for Self-Service Analytics on an Open Data Lakehouse

REQUIREMENTS TO ACCELERATE THE DEVELOPMENT


OF DATA PRODUCTS
Before getting into requirements, let’s first look at what Data Mesh is.

WHAT IS DATA MESH?


Data Mesh is a decentralized data engineering approach created by Zhamak
Deghani, in 2019. There are four principles of Data Mesh:
Data Mesh is a 1. Domain oriented decentralized data ownership and architecture
decentralized approach
to self-service data 2. Data as a product
engineering to produce 3. Self-serve data infrastructure as a platform (common platform)
reusable data products 4. Federated computational data governance
available to share
across the enterprise In this approach, decentralized teams of domain-oriented subject matter experts
most familiar with data used in their business domain, build pipelines to produce
reusable, data products (datasets) that they own and make available for others
to use across the enterprise. To simplify things, self-service development is done
Some data governance on common data infrastructure software. Data governance is federated since
policies must apply enterprise-wide data privacy policies must be applied to sensitive data in all data
enterprise-wide, and products while data owners control access to data products they create.
some are controlled
The following requirements need to be met if companies are to implement a data
locally in the domains
mesh and accelerate data product development in a data-driven enterprise.

ORGANIZATIONAL REQUIREMENTS TO ACCELERATE DEVELOPMENT


The first major requirement to move away from siloed data engineering and
Getting the right analytics development initiatives is to address the organization problem. While
organizational setup in it is accepted that different parts of the business need to deliver value in their
place is critical to
respective areas, it has to be part of a bigger plan. We have to consider this as
success
a jigsaw puzzle where different teams are building different pieces that align and
come together to achieve common business goals. Therefore, development
needs to be coordinated as opposed to everyone working on isolated stand-
alone projects. That requires a federated organization setup with a central
program office that acts as a ‘base camp’ for all data and analytical projects.
A federated setup is This is shown in Figure 1 and is typically by the office of the Chief Data Officer.
needed with a CDO run
central program office
coordinating
development activity
across multiple
decentralized teams and
aligning it to business
strategy

Development of data
products should all be
aligned with strategic
business priorities to
achieve high-value
business outcomes

Figure 1

Copyright © Intelligent Business Strategies Limited, 2023, All Rights Reserved 5


Simplifying Data Mesh for Self-Service Analytics on an Open Data Lakehouse

A centralized program office supporting decentralized data engineering teams


offers several benefits.
• It provides a focal point for the company on data and analytics. It ensures
that each data and analytical project is aligned to one or more business
A center of enablement objectives in the company business strategy.
embeds data engineering • Prioritization of data product, BI and ML model development can be
experts in decentralized dictated by and aligned with strategic business priorities to help achieve
domain-oriented teams specific high-priority business outcomes.
• It enables collaborative planning and coordination of all data product, BI
(reports, dashboards), and ML model development projects across
multiple domains while avoiding reinvention and siloed engineering.
• It provides teams with an understanding of the ‘big picture’ on who is
Their job is to upskill building what right now, what is still outstanding, how everything fits
citizen data engineers together and is progressing towards helping achieve specific business
while ensuring adoption outcomes.
of best practices A centralized program office can also function as a center of enablement from
where professional IT data engineering experts can be embedded in business
domain-oriented teams of data producers to help upskill citizen data engineers
and guide them to adopt best practices. This federated organizational setup is
critical to ensuring progressive incremental development of a data-driven
enterprise.

STANDARDIZATION REQUIREMENTS TO ACCELERATE DEVELOPMENT


The second major set of requirements in accelerating data product development
Standardization is is to lay the groundwork for rapid and consistent development of data products
needed to industrialize and to avoid unnecessary complexity across teams. Standardization (see Figure
data product 2) should improve development productivity and enable data and metadata
development and avoid sharing across teams of domain-oriented data producers without getting in their
unnecessary complexity way. It should also make everything easier to maintain over the longer term.

All domain-oriented
teams of data
producers should use a
common data platform
technology

Build data products


once and share them Figure 2
everywhere
In terms of standardization requirements, it should be possible to:
Data products • Have all domain-oriented teams of data product producers make use of
representing master
a common data platform to build and manage data products
data should be used
enterprise-wide • Engineer data to build both virtual and physical data products

Copyright © Intelligent Business Strategies Limited, 2023, All Rights Reserved 6


Simplifying Data Mesh for Self-Service Analytics on an Open Data Lakehouse

• Produce common master data products (e.g., customers, products,


employees, suppliers, materials, assets etc.,) once and share them
everywhere to avoid them being reinvented unnecessarily
You can connect • Leave source data where it is and define rules and mappings to transform
directly to multiple data and integrate that data to create virtual data products (virtual views of
sources and integrate integrated data) and physical data products (persisted datasets)
data to create virtual or
physical data products • Provide the option to ingest source data into a managed lakehouse
(preferably one that offers open table formats, ACID3 support, schema
evolution, columnar file formats, data partitioning, indexing, fine grained
Alternatively, source access control and support for deploying data as code) to ensure
data can be ingested
consistent data and then enable different teams of domain-oriented data
into a managed
lakehouse first before producers to engineer that data for self-service analytics
building data products • Use a common approach to managing and governing source data
ingestion, engineering data and publishing of all data products so that
A common approach is data is processed and managed in a consistent way no matter which
needed to produce data teams are developing data products
products
• Enabling the sharing of standard templates and common services across
multiple teams to simplify and expedite development while also ensuring
Templates and
adoption of best practices
common services help
speed up development • Adopting a common federated approach to governing access to
published data products made available for sharing
Data products should be • Enabling the sharing of data products in a standard way
shared and governed in
a standard way The use of a common data platform enables each team of data producers to
create their own workspace where they can find, connect to raw data, and then
transform and integrate it in isolation to create data products. In addition, they
would also be able to share business-ready data and metadata across teams in
a collaborative environment. Figure 3 shows how the above requirements can
help standardize the setup to establish a common approach across multiple
domain-oriented teams of data producers.

A common data
platform enables teams
to build business-ready
data products in
isolation and easily
share data and
metadata with others

A standard setup is
needed for all teams

Figure 3

3 Atomicity, Consistency, Isolation and Durability

Copyright © Intelligent Business Strategies Limited, 2023, All Rights Reserved 7


Simplifying Data Mesh for Self-Service Analytics on an Open Data Lakehouse

In this way data products can be produced by connecting directly to data sources
and transforming and integrating data.
When using a data lake, However, if an organization wishes to ingest data into a central data lake (e.g.,
data ingestion needs to cloud object storage), it should also be possible to manage data ingestion in a
be managed and source common way so that new data is not made available to teams of data producers
data checked before it is until it is quality checked. Managed data ingestion would also enable data
made available for use ingestion to be tracked and prevent the same data from being ingested more
than once. This is particularly important when it comes to external data that has
Data producers need to been purchased. Once ingested and made available, all teams could then
have their own working access a common ingestion zone and have their own workspaces (working
environment to build environment) to create new data products or new versions of data products
data products without impacting on any other data producers or data consumers.

DATA PRODUCT DEVELOPMENT REQUIREMENTS


Incremental Each team of domain-oriented data producers should be able to connect to,
development of query, transform and integrate data from one or more sources to create a data
semantically linked product that can be easily shared across the enterprise. Examples of master
master, transaction and data products might be customers, products, and suppliers. Examples of
aggregate data transaction data products would be orders, shipments, payments, returns etc.
products is needed In this way, multiple decentralized teams of data engineers can use a common
data platform to incrementally create a set of semantically linked, reusable data
products that can be published and made available to use in multiple analytical
workloads. To create a data product, it should be possible for each team to use
the common data platform to:
All data products should • Define common business data names that describe all attributes in the
use common business data product to be created. It is preferable that this should be
data names and documented as business metadata in a business glossary
enterprise-wide
identifiers • Ensure that the business data names defined for the data product include
an enterprise-wide identifier
• Define and create a data product schema using the common business
data names as column names in virtual or physical table structures
Automatic discovery • Connect to and automatically discover data in domain data sources to
and cataloging of identify the data needed to build a data product
source data
• Map the physical data names of the required data in the data sources to
the business data names used in the data product to be created
Map required source • Access identified source data where it resides or ingest the source data
data to data product into a central or domain-oriented data lake using a managed service. If
business data names ingestion occurs, then the ingested data will be the source data used to
produce the desired data product
Engineer data ensuring • Engineer the identified source data in domain-oriented workspaces, to
any sensitive data is produce the required virtual or physical data product
protected to create a
data product • Ensure any personally identifiable data contained in the data product is
masked during data integration so the data product created is compliant
with enterprise-wide data privacy legislation. This ensures data
Assign a data product governance is designed into the data engineering process.
owner and publish it in
• Appoint and tag the data product with an owner
a data marketplace to
make it available for • Grant the owner privileges to enable them to approve/reject access
sharing requests to the data product

Copyright © Intelligent Business Strategies Limited, 2023, All Rights Reserved 8


Simplifying Data Mesh for Self-Service Analytics on an Open Data Lakehouse

• Publish domain-oriented data products in a shared workspace or


marketplace for others to request access to and use

DATA PRODUCT VERSION CONTROL REQUIREMENTS


With so many new data sources, the chances of new data emerging that would
enrich what you already have are very high. Therefore, the demand for change
is inevitable which means data product version control is needed. Also, firms
have long struggled with the impact of change in traditional data warehouse
architectures and the domino effect it causes resulting in long change times.
To address this, it should also be possible to support:
Version control is also • Data product lifecycle management. This would enable new data sources
needed to be added, schema changes tracked, and new enriched versions of
data products created that can offer more value.
• Build data integration pipelines to produce new versions of data products
in isolation without the need to copy data and without impacting on data
consumers in production while new versions are being developed

SELF-SERVICE ANALYTICS SEMANTIC LAYER REQUIREMENTS


It should be possible to use the common data platform to:
Access to data products • Set policies to govern data sharing using role and attribute-based access
needs to be governed control
• Enable consumers to search for, find and request access to data relevant
products subject to approval from data owners and acceptance of data
sharing terms and conditions
They also need to be • Provision data products without the need to copy data as copies
easy to find and should introduce security risks, are hard to govern, and data consumers
be provisioned without question the validity of the data
the need to copy data
• Enable consumers to select and combine relevant semantically linked
data products without the need to copy data to build:
o Star schemas and a semantic layer for use in self-service
analytics
o Curated data for data scientists to develop machine-learning
models
For example, to analyze sales in a retailer by store, product, and
Consumers should be
customer over time it should be possible to access:
able to select the data
products they need and o The channels data product (e.g., stores, website)
for different analytical o The customers data product
use cases o The products data product
o The sales data product
o The time data product
and assemble them into a star schema as shown in Figure 4. Note that
because the data products have enterprise-wide identifiers, they are
semantically linked and so should snap together easily to enable self-
service analytics. Using common data names also ensures a common
semantic layer is created and so all tools see the same data names.

Copyright © Intelligent Business Strategies Limited, 2023, All Rights Reserved 9


Simplifying Data Mesh for Self-Service Analytics on an Open Data Lakehouse

By using common
business data names
and global identifiers,
data products become
the building blocks to
incrementally build a
universal semantic
layer for self-service
analytics

Figure 4
All BI and data science
tools can access • Enable consumers to define different flattened views of data in star
commonly understood schemas to simplify access in the semantic layer
data via a common
semantic layer • Enable access to data via the common semantic layer from BI and data
science tools for use in self-service analytics and ML model development

Copyright © Intelligent Business Strategies Limited, 2023, All Rights Reserved 10


Simplifying Data Mesh for Self-Service Analytics on an Open Data Lakehouse

BUILDING A DATA MESH FOR SELF-SERVICE ANALYTICS


ON DREMIO’S OPEN DATA LAKEHOUSE
Having understood the requirements to implement a data mesh, this section of
the paper looks at how one vendor is stepping up to meet these requirements.
Dremio has two main That vendor is Dremio.
products
Dremio offers two main products. These are:
• Dremio Sonar
• Dremio Arctic

DREMIO SONAR
Dremio Sonar is a data lakehouse query engine that supports self-service
analytics with data warehouse functionality and data lake flexibility. Before
looking at Dremio Sonar, let’s first define what a data lakehouse is.
What is a Data Lakehouse?
A data lakehouse is a For many years data warehouses have been used to organize and store data in
single platform that relational tables in a proprietary format tied to a specific data warehouse
provides the scale and database management system (DBMS) for query using SQL. This means data
flexibility of a data lake is only accessible by that specific DBMS engine. Data lakes on the other hand
together with the data
have stored data in files that could be accessed by multiple engines. Historically,
integrity and query
capability and fine- this data was primarily analyzed by data scientists using programs written in
grained governance of popular languages such as Python and R. Then the data lakehouse emerged to
a data warehouse provide the scale and flexibility of a data lake with columnar file formats, ACID,
open table formats (e.g., Iceberg) with fine grained governance, Spark and SQL
data transformation capability and SQL query performance that is becoming
competitive with data warehouses. With a data lakehouse, multiple engines can
access and manipulate data directly from data lake storage without
compromising data integrity and without the need to copy data into a proprietary
format specific to a single data warehouse DBMS.
Creating a Universal Semantic Layer with Dremio Sonar
Dremio Sonar is a Dremio Sonar is an Apache Arrow-based columnar lakehouse query engine that
lakehouse query supports BI reporting, dashboards and interactive analytics over data in data
engine for self- lake storage as well as unified access to data in cloud object storage and other
service analytics data sources. Using Dremio Sonar, self-service interactive BI tools can access
data via unified views that use common business data names to provide a
common semantic layer across all self-service analytics tools. It enables
organizations to run high performance queries on data lakes, simplify access to
data in multiple underlying data stores, modernize their data architecture and
facilitate self-service creation of data products anywhere, be it on-premises,
hybrid, or cloud.
Teams can create a logical view of the data for data consumers, with the physical
data residing in cloud data lakes (S3, ADLS), on-prem data lakes (HDFS),
metastores (Nessie, Hive, AWS Glue), relational databases (Snowflake,
Redshift, etc.), NoSQL databases, and a Dremio-to-Dremio connector.
It supports virtual views
that use SQL to Dremio Sonar’s key features include:
transform and integrate
• Virtual views: Data consumers can transform and integrate data via
data
SQL-based virtual views. Users can take advantage of an integrated

Copyright © Intelligent Business Strategies Limited, 2023, All Rights Reserved 11


Simplifying Data Mesh for Self-Service Analytics on an Open Data Lakehouse

SQL editor and low/no-code interface that auto-generates federated


SQL queries to filter, extract, transform and join data in multiple
underlying data sources
Views can be organized • Spaces: Data teams can create a single, consistent view of data for
into spaces to make it data consumers. They can organize views of data products in a
easy to find and provide convenient hierarchy of folders to make it easy to browse and find data.
secure access to data Privileges can be set on a folder anywhere in the hierarchy to ensure
data products are shared securely within the organization.
• Reflections: Dremio transparently accelerates queries using reflections
so users can work with data logically without needing to worry about
physical tables/copies to make queries run quickly. Even as additional
views are added, Dremio’s optimizer can intelligently utilize existing
Metadata is available to reflections to accelerate queries on those new views.
show the meaning of • Integrated catalog: To enable users to discover data products,
data and how data understand their meaning and access built-in lineage to visualize how
products were
they were constructed. Users can also add descriptions and tags to
constructed
datasets for others to leverage. The catalog information is exposed
through a UI that simplifies data discovery and SQL editing.
Data can be described • Collaboration: Users can collaborate by securely sharing data
and tagged to make it products with other users/teams with the click of a button (or a SQL
easier to collaborate command). In addition, users can collaboratively author metadata such
and share securely as descriptions and tags.

Dremio Sonar

Copyright © Dremio Corporation 2023

Source: Dremio Figure 5


Dremio Sonar includes support for:
• Connectors to access data in the following data sources:
o Cloud data lakes
§ AWS S3, Azure Blob Storage, Azure Data Lake Storage
(ADLS), and Google Cloud Storage
It supports a range of o On-premises data lakes
cloud and on-premises § Hadoop distributed file system (HDFS) and S3 compatible
data sources object storage (e.g., MinIO, Dell)
o Relational databases

Copyright © Intelligent Business Strategies Limited, 2023, All Rights Reserved 12


Simplifying Data Mesh for Self-Service Analytics on an Open Data Lakehouse

§ Examples include Amazon Redshift, IBM Db2, Microsoft


Azure Synapse Analytics, Microsoft SQL Server, MySQL,
Oracle, PostgreSQL, Snowflake, Teradata
o NoSQL databases
§ MongoDB
o Open table formats
§ Apache Iceberg tables, Delta tables
o Metastores
§ Nessie
§ Hive tables accessing datasets on HDFS, S3 and ADLS
§ AWS Glue Catalog
o Elasticsearch
o NAS
There is also a Dremio-
o Other Dremio Sonar clusters (nested architecture)
to-Dremio connector to
query data managed by • Dremio Sonar virtual tables and views
multiple Sonar • A cost-based query optimizer
instances • Pushdown optimization to make back-end databases and lakehouses do
the work to retrieve and filter the required data for queries
• Query acceleration to boost query performance using:
Columnar cloud cache,
o Dremio Reflections - these are optimized materializations of
Dremio reflections and
pushdown optimization source data or queries that are held in a reflections data store.
all help improve query Both raw and aggregated reflections can be created to improve
performance and are query performance
transparent to users o Columnar Cloud Cache (C3) based on Apache Arrow
Both mechanisms are used by the Dremio Sonar optimizer to accelerate
popular BI tool self-service queries and are transparent to users.
Figure 6 shows how domain-oriented data producers can use Dremio Sonar to
integrate data from multiple data sources to create virtual data products in a data
mesh.
Teams of data
producers can use
Dremio Sonar to
integrate data from
multiple data sources to
create virtual data
products in a data mesh

Source: Dremio Figure 6

Note that Dremio maintains full lineage showing transformation and integration
mappings between sources and virtual views. With respect to data products in a
data mesh, data consumers can see the lineage on how virtual views
representing data products are constructed. Figure 7 shows a lineage example
in Dremio Sonar.

Copyright © Intelligent Business Strategies Limited, 2023, All Rights Reserved 13


Simplifying Data Mesh for Self-Service Analytics on an Open Data Lakehouse

Source: Dremio Figure 7

DREMIO ARCTIC
Before proceeding with Dremio Arctic, let’s first explain what Apache Iceberg is.
What is Apache Iceberg?
Data lakes were Enterprises are increasingly relying on data lakes for data science and over the
originally created for last several years companies have amassed considerable amounts of data in
use in data science data lakes typically held in columnar-oriented Parquet files. First generation data
lakes were often on-premises storing data in files in Hadoop HDFS. Today, those
files are more likely to be stored on cloud object storage. However, while cloud
object storage data lakes gave companies scalability, they have often been
lacking in governance which is something almost always found in auditable
systems like data warehouses. Lack of governance often meant that it wasn’t
long before data in these data lakes started to become inconsistent. Data
Apache Iceberg is an scientists would also often copy data for their own use resulting in significant
open table format for data redundancy. There was little understanding of what was in these data lakes
files in object storage
and so they began to deteriorate and were dubbed ‘data swamps’.

Apache Iceberg is an open table format first developed by Netflix. It is an open-


source project with contributors from companies such as Apple, Netflix, AWS,
Snowflake, and Dremio. Apache Iceberg offers important capabilities that enable
It overlays tables and multiple applications to collaboratively work on data while ensuring transactional
metadata on top of consistency. Additionally, Iceberg provides insights into dataset evolution and
columnar compressed changes over time.
data files in cloud
object storage
The Iceberg table format is similar to tables found in traditional relational
databases. The difference is that it is open which means that multiple engines
like Dremio and Spark can query, process, and maintain data in the same
underlying dataset via Iceberg tables without compromising data integrity.
Iceberg offers a wide range of features, including:
• Transactional Consistency: Iceberg ensures transactional consistency
across multiple applications. Files can be added, removed, or modified

Copyright © Intelligent Business Strategies Limited, 2023, All Rights Reserved 14


Simplifying Data Mesh for Self-Service Analytics on an Open Data Lakehouse

atomically, providing full read isolation and allowing for multiple


concurrent writes.

Iceberg ensures the • Schema Evolution: The enables Iceberg to track changes to a table
data is transactionally over time to adapt to and accommodate evolving data requirements
consistent even when
data is maintained • Time Travel: With Iceberg, you can query historical data and verify
using multiple query changes between updates using the time travel feature. This provides
engines and supports insights into the data's evolution and facilitates auditing and analysis
automatic
• Data Partitioning and Evolution: Iceberg facilitates updates to partition
schemes as queries and data volumes change. This ensures flexibility
and efficiency in managing large datasets

• Rollback: Iceberg allows you to quickly correct issues by returning


tables to a known good state by rolling back to prior versions. This offers
It supports schema guaranteed data consistency, data integrity and simplifies data correction
evolution, time travel,
rollback, data • Parallel Query Processing and Filtering: Iceberg open tables can be
partitioning and overlaid on compressed, columnar file formats such as Parquet and
evolution as well as
ORC. It also supports data skipping, statistics and automated hidden
parallel query
partitioning. All of this enables massively parallel query execution,
processing and
filtering performance and filtering on very large data volumes

These features make Iceberg a robust and flexible table format for managing
data in data lakes, providing improved data consistency, schema management,
historical analysis, partitioning flexibility, data rollback, and enhanced query
performance.

Metadata can be Iceberg achieves these capabilities for a table by utilizing metadata files, known
tracked though point in as manifests, which are tracked through point-in-time snapshots. As the table is
time snapshots and updated over time, Iceberg maintains all the deltas, allowing for a
table changes tracked comprehensive view of the table's schema, partitioning, and file information
within each snapshot. This approach ensures full isolation and consistency and
enables scalable parallel query performance on large volumes of data stored in
data lakes.

Iceberg also employs a hierarchical structure to organize snapshot metadata


efficiently which enables rapid changes to tables without the need to redefine all
Iceberg’s open table dataset files. In addition, it maintains the complete history within the table format
format enables itself, independent of any specific storage system. This design choice enables
multiple different an open table format that provides the flexibility to change and accommodate
engines to query, multiple query engines on top of your data in object store without disrupting users
process and update
or existing workloads.
data in underlying data
files while ensuring
transaction validity The immutability of the historical state and the tracking of history in Iceberg
and data integrity enable users to query prior states at any snapshot or historical point in time. This
allows for consistent results, facilitates comparisons, and offers the ability to roll
back to previous versions to address any issues or corrections needed.

In summary, Iceberg's utilization of metadata files and point-in-time snapshots,


along with its hierarchical organization and open architecture, ensures efficient
management of table changes, parallel query performance at scale from multiple
query engines, and the ability to query and roll back to historical states.

Copyright © Intelligent Business Strategies Limited, 2023, All Rights Reserved 15


Simplifying Data Mesh for Self-Service Analytics on an Open Data Lakehouse

What is Dremio Arctic?


Dremio Arctic is a Dremio Arctic is a data lakehouse management service for Apache Iceberg
lakehouse tables that overlay files held in cloud object storage (e.g., AWS S3). It is a
management service metadata service powered by Project Nessie, a cloud-native metastore that
for managing Apache provides automatic optimization for Iceberg tables and ‘data as code’ capabilities
Iceberg tables that
that enable data teams to build, manage, and deliver data products the same
overlay files in cloud
storage data lakes way software developers manage code. It supports a range of execution engines
including Dremio Sonar, Hive, Spark, and Flink so data consumers can utilize
the engine best suited to each analytic workload. This is important because data
can be streamed by one engine and updated or queried by another.

It supports multiple Dremio Arctic contains a list of all Iceberg tables and views that are in the data
engines which means lake together with their locations. It also contains pointers to metadata files that
data can be updated include metadata on Iceberg tables, schema, and partitions. The data, metadata
by one engine and and indexes are stored in AWS S3 cloud object storage. Dremio Arctic runs
queried by another
independently from Dremio Sonar but Dremio Sonar can connect to Dremio
Arctic to gain access to data in Iceberg tables and views as a data source. Figure
8 shows the Dremio Arctic architecture.

Dremio Sonar can


connect to Dremio
Arctic to access
Iceberg tables and
views as a data
source

Copyright © Dremio Corporation 2023

Source: Dremio Figure 8


Dremio Arctic uses branches, commits and tags in a similar fashion to Git
Dremio Arctic supports
‘data as code’ so that
repositories. This allows data engineers to create a branch where they can
data producers can transform and integrate data in Iceberg tables. Updates (including changing
engineer data without table schema, changing view definitions, and ingesting data) can be made to
impacting on data tables via SQL or Spark in isolation. Any changes to the table can be thoroughly
consumers tested before they are merged into the main branch.

HOW CAN DREMIO BE USED TO BUILD A DATA MESH?


Dremio uses a phased Now that we understand Dremio Sonar and Dremio Arctic the next question is
approach to data mesh how can they be used to build data products in a data mesh? This can be done
implementation in a three-phased approach.

Copyright © Intelligent Business Strategies Limited, 2023, All Rights Reserved 16


Simplifying Data Mesh for Self-Service Analytics on an Open Data Lakehouse

Phase 1 – Unify Data Access


The first phase is unifying data access by using Dremio Sonar connect to and
federate queries across multiple data sources. This enables each team of
Start by leaving source domain-oriented data producers to build data products using Dremio Sonar only.
data where it is and They can do this by creating virtual views with common business data names as
building virtual data
column names for each data product they want to create. These views use
products using Dremio
Sonar federated SQL queries to access, transform and integrate data in one or more
data sources at runtime to build virtual data products. Dremio Sonar uses
reflections to provide faster access to this data without impacting source
production systems. Virtual data products can be tagged by data producers and
Teams of data published in a shared space within Dremio Sonar, which acts as a data
producers each have marketplace for consumers to find ready-made data products that are available
their own workspace for consumption and business use. Figure 9 shows both master data products
and publish data (Customers, Products, Channels, Distribution Centers) and transaction data
products in a shared products (Sales, Shipments, Returns). The use of common business data
workspace names ensures that all data is commonly defined in a universal semantic layer.
All data products published in the data marketplace (a shared space) are
Dremio Sonar can searchable, have supporting metadata (wiki, tagging and data lineage) and are
govern access to
available for consumption without the need to copy data. The wiki can contain a
virtual data products
and protect sensitive business glossary to help consumers understand the meaning of data in data
data products, who the owners are, data freshness etc., while lineage shows how a
data product was created. Dremio Sonar offers role-based access control
(RBAC) to govern data product security. User defined functions can also be
created to mask personal data to ensure compliance with any data privacy
legislation if the data product contains sensitive data.

Data is engineered by
combining the SQL
queries in different
virtual views

Virtual data products


are published in a
shared space that acts
as a data marketplace
for consumers to find
business-ready data

Figure 9
Figure 10 shows the consumption side on how virtual data products can be
Consumers can find,
request access to and
searched for and assembled into virtual star schemas and aggregate views
consume data in created for use in self-service analytics all using common business terms.
virtual data products
from a shared space

Copyright © Intelligent Business Strategies Limited, 2023, All Rights Reserved 17


Simplifying Data Mesh for Self-Service Analytics on an Open Data Lakehouse

Virtual data products


can be used to create
virtual data marts for
use in self-service
analytics all without
copying data

Common data names


ensure incremental
creation of a universal
semantic layer to
guarantee consistent
data meaning across
all tools

Figure 10
Virtual data marts can inherit the common data names. Query execution from BI
tools can be accelerated using columnar cloud caching, data reflections and
pushdown optimization. This breeds consistency and confidence.
Phase 2 – Deliver a Data Lakehouse
The shift to using SaaS transaction applications and the adoption of Internet of
Migrate legacy data
Things (IoT) has seen a huge amount of data being captured outside the firewall.
lakes and ingest source
data into an Iceberg
This data is increasingly being brought into cloud storage to stage and integrate
lakehouse managed by it with other data for input into cloud-based data warehouses and for use in data
Dremio Arctic to science. The second phase is to convert this data to Parquet files to ready it for
guarantee consistent use in a lakehouse. In addition, data in legacy on-premises data lakes such as
data, and allow teams of HDFS should also be migrated to columnar Parquet files in object storage. As
data engineers to build on-premises data warehouses are migrated to the cloud, data in data warehouse
virtual data products staging tables should also be added into object storage if data is from data
leveraging data as code sources not being ingested already. This decouples staged source data from
single analytical systems.

At this point Dremio Arctic with Iceberg should be deployed to enable Iceberg
Turn a data lake into a tables to be created and overlaid on top of the files in object storage. This turns
lakehouse and reduce the data lake into a lakehouse, and centralizes data needed and makes it
data copying possible to create data products once for use in multiple analytical use cases
such as self-service analytics and data science.

Phase 3 – Enterprise Data Mesh


After migrating your workloads to object storage, you are now on a path to an
Phase 3 enables data enterprise data mesh. In phase three, you can start building a data lakehouse
ingestion to be centered around Apache Iceberg. This allows you to simplify and manage your
managed and the lakehouse and also control decentralized data product development using
adoption of Iceberg Dremio Arctic. With Dremio Arctic and Dremio Sonar on top of your data in
open table format Iceberg tables, virtual data products can be produced, published, discovered,
while maintaining data understood, assembled and accessed via a common semantic layer. This
consistency ensures redundancy is minimized when you are building data products and
that data is consistent.

Dremio Arctic enables:


Dremio Arctic provides • Data ingestion to be managed - new data can be ingested into your cloud
lakehouse management data lakehouse and checked for quality before making it available to data
engineers to integrate to build data products.

Copyright © Intelligent Business Strategies Limited, 2023, All Rights Reserved 18


Simplifying Data Mesh for Self-Service Analytics on an Open Data Lakehouse

Decentralized teams can • Data to be streamed into Iceberg tables in columnar format by one
build and manage data engine (e.g., Spark or Flink) with full ACID support and transformed and
products in isolation and integrated to create up-to-date data products by Dremio Sonar
publish them for
consumption in a shared • Decentralized teams can engineer lakehouse data using Dremio Sonar
searchable space in complete isolation and check it before making new or new versions of
without impacting data virtual data products available to others. Dremio Sonar and Arctic
consumers therefore enable virtual data product development.
• Iceberg tables to be overlayed on columnar files which means data
Consumer queries run in compression, automatic partitioning, data skipping, indexing, statistics
parallel on compressed and parallel query processing are all possible on object storage-based
columnar storage data lakes during data product development and when querying data in
virtual data products in self-service analytics
• All updates to data can be audited and tracked
Federated ownership
and governance of • Virtual data products can be consumed into cloud-based data
data products warehouses with virtual data marts built on Dremio Sonar
• Data scientists can consume virtual data products, to provide features
for building ML models and do it all in isolation without copying data
Self-service access • Business analysts can run reports on virtual data marts built using virtual
across decentralized data products and also run BI reports knowing they are working with the
teams most consistent and up to date version of their data
• Automatic data optimization for Iceberg tables, including compaction to
write smaller files into larger files, and garbage collection, which removes
unused files, to optimize performance and storage utilization
Figure 11 is a variation on Figure 9 and shows how Dremio Sonar and Dremio
Arctic can enable decentralized teams of domain-oriented data engineers to
create their own workspaces to build virtual data products in a data mesh on
consistent data in an Iceberg lakehouse. It is Dremio’s intention to extend the
use of Arctic to offer it as a source for on-premises object stores in the near
future which would allow Dremio to use Sonar to create a universal semantic
layer for self-service analytics across data in a hybrid open lakehouse.

Virtual data products


can be consumed and
used in data science
and combined to
create virtual star-
schema data marts for
use in self-service
analytics all without
copying data

Figure 11

Copyright © Intelligent Business Strategies Limited, 2023, All Rights Reserved 19


Simplifying Data Mesh for Self-Service Analytics on an Open Data Lakehouse

Dremio Sonar allows Note that the data products in the shared space data marketplace could be
virtual data products to organized. For example, they could be organized into:
be tagged and • Master data products (these are cross domain – e.g., customers)
organized to make
them easier to find, • Transaction data products (e.g., sales, shipments, returns)
understand and use
• Aggregate data products (domain specific)
Dremio Sonar also Once data products have been created, data owners can be assigned, and
provides full data policies created to govern access to this data before they are made available to
lineage to explain how consumers. Federated data governance can be implemented by enforcing
data products have enterprise-wide data privacy policies to protect personally identifiable
been created
information (PII) data everywhere and implementing domain specific data
product access security policies (set by decentralized domain data product
Common data names owners) to govern access to specific data products. Dremio Sonar can enforce
make it possible to both data privacy and data access security at run time to implement federated
create a universal computational data governance in a data mesh.
semantic layer to drive
consistent
understanding across
Finally, as shown in Figure 4, by using common data names for all data products,
all tools in all virtual data marts and all virtual aggregate views, it is possible to establish
a universal semantic layer in Dremio Sonar to guarantee that all BI tools and
data science notebooks see consistent data names and definitions.

Copyright © Intelligent Business Strategies Limited, 2023, All Rights Reserved 20


Simplifying Data Mesh for Self-Service Analytics on an Open Data Lakehouse

CONCLUSIONS
Insatiable demand to For many companies today, the demand to analyze more data from more data
analyze more and sources by more business users has become so great that IT can’t keep pace
more data has resulted with requests to engineer data. This has led to IT becoming a bottleneck to
in IT becoming a data producing richer business insights. To overcome this bottleneck, businesses are
engineering bottleneck looking to upskill people around the enterprise to increase the number of data
engineers.

At the same time, they want to avoid mistakes of the past including the chaos of
‘every person for themselves’ self-service data preparation using many different
Companies want to
tools and siloed analytical systems with overlapping subsets of the same data
democratize and
coordinate decentralized being repeatedly re-engineered for different data warehouses, data marts, graph
data engineering activity databases and machine-learning models. Instead, they are looking to move
to produce reusable towards a decentralized but coordinated, incremental development approach to
data products for use in data engineering where different business domain-oriented teams of data
self-service analytics engineers around the enterprise integrate data to produce reusable data
and data science products and make them available for sharing. This is the data mesh approach
which aims to build data products once and reuse them for many different
analytical use cases including data warehouses and data science. The data
They also want to mesh approach separates the data producers who engineer data to create data
provision data without products and data consumers who select the data products they need and
the need to copy it consume and assemble them for use in self-service analytics and data science.
The challenge is to avoid every consumer wanting copies of data products.
Therefore, a zero data copy approach to provisioning data products is needed.
In addition, consumers can produce new data products (e.g., aggregate data)
Data Mesh enables
incremental
and new analytical products (e.g., ML models, BI reports) and also publish these.
development of data
products and should
The benefit of data mesh is that as more data (and analytical) products are
help to progressively incrementally created, consumers should be able to get faster at delivering value
shorten time to value because more and more of what they need is already available for reuse.

A common self-service With so much demand, companies need a common platform to build data
data platform is needed products quickly and easily where projects can be coordinated, where
to coordinate decentralized teams can engineer data in isolation using DataOps (DevOps
decentralized data applied to data) practices, and where metadata can be shared. In addition, they
product development need to connect to a number of data sources and either leave the data where it
and facilitate data is and engineer it from there or move it to a centralized data lake where data
sharing in a data mesh
consistency must be upheld. If it is the latter, cloud storage is not enough. A
lakehouse is needed that supports ACID transactions, that can cater for schema
An open table format
evolution, that supports managed data ingestion and where teams can engineer
and catalog are needed
if firms want to integrate
in their own workspaces in isolation until it is ready for use. Furthermore, once
data warehouse and produced, data products need to be published in a shared data marketplace
data science workloads where consumers can request access, where access security and data privacy
on a data lakehouse is governed and where data can be provisioned without the need to copy it for
use in self-service analytics. Also, common business data names need to be
A universal semantic enforced to create a universal semantic layer so that all tools accessing this data
layer is also needed to see consistent, and meaningful business data names. Dremio provides all these
drive common things and can be used to industrialize development of data products in a data
understanding of data mesh. For companies looking for software to help them to do this, it should be a
across multiple tools
contender on anyone’s shortlist.

Copyright © Intelligent Business Strategies Limited, 2023, All Rights Reserved 21


Simplifying Data Mesh for Self-Service Analytics on an Open Data Lakehouse

About Intelligent Business Strategies


Intelligent Business Strategies is an independent research, education, and
consulting company whose goal is to help companies understand and exploit
new developments in business intelligence, machine-learning, advanced
analytics, data management, big data, and enterprise business integration.
Together, these technologies help an organization become an intelligent
business.

Author
Mike Ferguson is Managing Director of Intelligent Business Strategies Limited.
As an independent IT industry analyst and consultant, he specializes in BI /
analytics and data management. With over 40 years of IT experience, Mike has
consulted for dozens of companies on BI/Analytics, data strategy, technology
selection, enterprise architecture, and data management. Mike is also
conference chairman of Big Data LDN, the largest data and analytics conference
in Europe and a member of the EDM Council CDMC Executive Advisory Board.
He has spoken at events all over the world and written numerous papers and
articles. Formerly he was a principal and co-founder of Codd and Date – the
inventors of the Relational Model, a Chief Architect at Teradata on the Teradata
DBMS, and European Managing Director of Database Associates. He teaches
popular master classes in Data Warehouse Modernization, Big Data,
Centralized Data Governance of a Distributed Data Landscape, Practical
Guidelines for Implementing a Data Mesh, Machine-learning and Advanced
Analytics, and Embedded Analytics, Intelligent Apps and AI Automation.

Telephone: (+44)1625 520700


Internet URL: www.intelligentbusiness.biz
E-Mail: info@intelligentbusiness.biz

Simplifying Data Mesh for Self-Service Analytics on an Open


Data Lakehouse

Copyright © 2023, Intelligent Business Strategies


All rights reserved

All diagrams sourced from Dremio and used in this paper remain
the copyright and intellectual property of Dremio

Copyright © Intelligent Business Strategies Limited, 2023, All Rights Reserved 22

You might also like