DK Dataops Book 2nd Edition
DK Dataops Book 2nd Edition
The
DataOps
Cookbook
Methodologies and Tools That Reduce
Analytics Cycle Time While Improving Quality
Second Edition
by
Christopher Bergh, Gil Benghiat,
and Eran Strod
The DataOps Cookbook
© 2019 DataKitchen, Inc. All Rights Reserved.
DataKitchen Headquarters:
101 Main Street, 14th Floor
Cambridge, MA 02142
INTRODUCTION 3
Welcome to the Second Edition of the “DataOps Cookbook.” With over 5,000 copies distrib-
uted, the first edition of the book far exceeded our expectations. Managers have asked us
for boxes of books to distribute to their entire organization. Data professionals are forming
study groups around the “DataOps Cookbook.” DataOps is a methodology truly coming of age.
The name “DataOps” has always been somewhat problematic, misleading people to believe
that we are simply talking about DevOps for data. This misconception started to gain traction
in the technical press in mid-2018, shortly after Gartner placed DataOps on the fastest rising
part of their Hype Cycle curve for Data Management.
In response, we wrote the post “DataOps is NOT Just DevOps for Data” (see the chapter
“What is DataOps” below). On Medium, the post has received over 28,000 views (and
counting), making it one of 2019’s most widely read and referenced thought pieces on data
analytics. The DataOps view that analytics is a combination of software development and
manufacturing operations seems to have struck a chord within the data industry.
The remarkable interest in DataOps has opened the door to many conversations with data
professionals, both in individual contributor and management roles. These discussions
spurred further thinking about DataOps, and we are now pleased to expand upon the
original book with several new additions. We hope these further advance the industry-wide
dialogue about data organization productivity and quality. In this latest edition of the Data-
Ops Cookbook, you’ll find the following new sections:
• “Warring Tribes into Winning Teams: Improving Teamwork in Your Data Organization”
on inter-team teamwork
• “Improving Teamwork in Data Analytics with DataOps” on intra-team collaboration
• “Eliminate Your Analytics Development Bottlenecks”
• “A Great Model is Not Enough. Deploying AI Without Technical Debt.”
• “The ‘Right To Repair’ DataOps Data Architecture”
• “Enabling Design Thinking in Data Analytics with DataOps”
• “Tomorrow’s Forecast: Cloudy with a Chance of Data Errors — Key Findings of the 2019
DataOps Survey “
DataOps is a foundational topic that requires data teams to fundamentally rethink the
ways that they perform their duties. Despite the inherent challenges, we are confident you
will find this to be a fruitful and worthwhile endeavor. We look forward to continuing the
conversation.
In the early 2000s, Chris and Gil worked at a company that specialized in analytics for the
pharmaceutical industry. It was a small company that offered a full suite of services related
to analytics — data engineering, data integration, visualization and what is now called “data
science.” Their customers were marketing and sales executives who tend to be challenging
because they are busy, need fast answers and don’t understand or care about the underlying
mechanics of analytics. They are business people, not technologists.
When a request from a customer came in, Chris and Gil would gather their team of engi-
neers, data scientists and consultants to plan out the how to get the project done. After days
of planning, they would propose their project plan to the customer. “It will take two weeks.”
The customer would shoot back, “I need it in two hours!”
Walking back to their office, tail between their legs, they would pick up the phone. It was a
customer boiling over with anger. There was a data error. If it wasn’t fixed immediately the
customer would find a different vendor.
The company had hired a bunch of smart people to deliver these services. “ I want to
innovate — Can I try out this new open source tool,” the team members would ask. “No,” the
managers would have to answer. “We can’t afford to introduce technical risk.”
They lived this life for many years. How do you create innovative data analytics? How do you
not have embarrassing errors? How do you let your team easily try new ideas? There had to
be a better way.
They found their answer by studying the software and manufacturing industries which had
been struggling with these same issues for decades. They discovered that data-analytics
cycle time and quality can be optimized with a combination of tools and methodologies that
they now call DataOps. They decided to start a new company. The new organization adopted
the kitchen metaphor for data analytics. After all, cooking up charts and graphs requires the
right ingredients and recipes.
Having experienced this transformation, the DataKitchen founders sought a way to help
other data professionals. There are so many talented people stuck in no-win situations. This
book is for data professionals who are living the nightmare of slow, buggy analytics and
frustrated users. It will explain why working weekends isn’t the answer. It provides you with
practical steps that you can take tomorrow to improve your analytics cycle time.
DataKitchen markets a DataOps Platform that will help analytics organizations implement
DataOps. However, this book isn’t really about us and our product. It is about you, your
challenges, your potential and getting your analytics team back on track.
The values and principles that are central to DataOps are listed in the DataOps Manifesto
which you can read below. If you agree with it, please join the thousands of others who
share these beliefs by signing the manifesto. There may be aspects of the manifesto that
require further explanation. Please read on. By the end of this book, it should all make sense.
You’ll also notice that we’ve included some real recipes in this book. These are some of our
favorites. We hope you enjoy them!
Background
Through firsthand experience working with data across organizations, tools, and industries
we have uncovered a better way to develop and deliver analytics that we call DataOps.
Whether referred to as data science, data engineering, data management, big data, business
intelligence, or the like, through our work we have come to value in analytics:
• Individuals and interactions over processes and tools
• Working analytics over comprehensive documentation
• Customer collaboration over contract negotiation
• Experimentation, iteration, and feedback over extensive upfront design
• Cross-functional ownership of operations over siloed responsibilities
DataOps Principles
5. DAILY INTERACTIONS
Customers, analytic teams, and operations must work together daily throughout the project.
6. SELF-ORGANIZE
We believe that the best analytic insight, algorithms, architectures, requirements, and de-
signs emerge from self-organizing teams.
7. REDUCE HEROISM
As the pace and breadth of need for analytic insights ever increases, we believe analytic
teams should strive to reduce heroism and create sustainable and scalable data analytic
teams and processes.
8. REFLECT
Analytic teams should fine-tune their operational performance by self-reflecting, at regular
intervals, on feedback provided by their customers, themselves, and operational statistics.
9. ANALYTICS IS CODE
Analytic teams use a variety of individual tools to access, integrate, model, and visualize data.
Fundamentally, each of these tools generates code and configuration which describes the
actions taken upon data to deliver insight.
10. ORCHESTRATE
The beginning-to-end orchestration of data, tools, code, environments, and the analytic
team’s work is a key driver of analytic success.
17. REUSE
We believe a foundational aspect of analytic insight manufacturing efficiency is to avoid the
repetition of previous work by the individual or team.
Join the Thousands of People Who Have Already Signed The Manifesto
Companies increasingly look to analytics to drive growth strategies. As the leader of the
data-analytics team, you manage a group responsible for supplying business partners with
the analytic insights that can create a competitive edge. Customer and market opportunities
evolve quickly and drive a relentless series of questions. Analytics, by contrast, move slowly,
constrained by development cycles, limited resources and brittle IT systems. The gap be-
tween what users need and what IT can provide can be a source of conflict and frustration.
Inevitably this mismatch between expectations and capabilities can cause dissatisfaction,
leaving the data-analytics team in an unfortunate position and preventing a company from
fully realizing the strategic benefit of its data.
As a manager overseeing analytics, it’s your job to understand and address the factors that
prevent the data-analytics team from achieving peak levels of performance. If you talk to
your team, they will tell you exactly what is slowing them down. You’ll likely hear variations
of the following eight challenges:
They don’t know what they want. Users are not data experts. They don’t know what insights
are possible until someone from your team shows them. Sometimes they don’t know what
they want until after they see it in production (and maybe not even then). Often, business
stakeholders do not know what they will need next week, let alone next quarter or next year.
It’s not their fault. It’s the nature of pursuing opportunities in a fast-paced marketplace.
They need everything ASAP. Business is a competitive endeavor. When an opportunity opens,
the company needs to move on it faster than the competition. When users bring a question
to the data-analytics team, they expect an immediate response. They can’t wait weeks or
months — the opportunity will close as the market seeks alternative solutions.
The questions never end. Sometimes providing business stakeholders with analytics generates
more questions than answers. Analytic insights enable users to understand the business
in new ways. This spurs creativity, which leads to requests for more analytics. A healthy
relationship between the analytics and users will foster a continuous series of questions that
drive demand for new analytics. However, this relationship can sour quickly if the delivery of
new analytics can’t meet the required time frames.
Business stakeholders want fast answers. Meanwhile, the data-analytics team has to work
with IT to gain access to operational systems, plan and implement architectural changes, and
develop/test/deploy new analytics. This process is complex, lengthy and subject to numer-
ous bottlenecks and blockages.
A database optimized for data analytics is structured to optimize reads and aggregations.
It’s also important for the schema of an analytics database to be easily understood by
humans. For example, the field names would be descriptive of their contents and data tables
would be linked in ways that make intuitive sense.
4 – Data Errors
Whether your data sources are internal or from external third parties, data will eventually
contain errors. Data errors can prevent your data pipeline from flowing correctly. Errors may
also be subtle, such as duplicate records or individual fields that contain erroneous data.
Data errors could be caused by a new algorithm that doesn’t work as expected, a database
schema change that broke one of your feeds, an IT failure or one of many other possibilities.
Data errors can be difficult to trace and resolve quickly.
Further, manual processes can also lead to high employee turnover. Many managers have
watched high-performing data-analytics team members burn out due to having to repeat-
edly execute manual data procedures. Manual processes strain the productivity of the data
team in numerous ways.
According to the research firm Gartner, Inc., half of all chief data officers (CDO) in large
organizations will not be deemed a success in their role. Per Forrester Research, 60% of the
data and analytics decision-makers surveyed said they are not very confident in their analytics
insights. Only ten percent responded that their organizations sufficiently manage the quality
of data and analytics. Just sixteen percent believe they perform well in producing accurate
models.
Heroism - Data-analytics teams work long hours to compensate for the gap between perfor-
mance and expectations. When a deliverable is met, the data-analytics team is considered
heroes. However, yesterday’s heroes are quickly forgotten when there is a new deliverable
to meet. Also, this strategy is difficult to sustain over a long period of time, and it, ultimately,
just resets expectations at a higher level without providing additional resources. The heroism
approach is also difficult to scale up as an organization grows.
Hope - When a deadline must be met, it is tempting to just quickly produce a solution with
minimal testing, push it out to the users and hope it does not break. This approach has inher-
ent risks. Eventually, a deliverable will contain data errors, upsetting the users and harming
the hard-won credibility of the data-analytics team.
Caution - The team decides to give each data-analytics project a longer development and
test schedule. Effectively, this is a decision to deliver higher quality, but fewer features to
users. One difficulty with this approach is that users often don’t know what they want until
they see it, so a detailed specification might change considerably by the end of a project. The
slow and methodical approach might also make the users unhappy because the analytics are
delivered more slowly than their stated delivery requirements and as requests pile up, the
data-analytics team risks being viewed as bureaucratic and inefficient.
None of these approaches adequately serve the needs of both users and data-analytics
professionals, but there is a way out of this bind. The challenges above are not unique to
analytics, and in fact, are shared by other organizations.
Overcoming
the Challenges
Some say that an analytics team
can overcome these challenges
by buying a new tool. While
it is true that new tools are
helpful, they are not enough
by themselves. You cannot
truly transform your staff into a
high-performance team without
an overhaul of the methodolo-
gies and processes that guide
your workflows. In this book,
we will discuss how to combine
tools and new processes in a way that improves the productivity of your data analytics team
by orders of magnitude.
INSTRUCTIONS
1. Crock Pot: Combine all ingredients and cook on high for 5-8 hours.
Stir occasionally.
2. Stove Top: Combine ground beef, onion, and pepper. Cook on medium high
until beef is cooked through. Add the remaining ingredients and cook on
low-simmer for 1-2 hours. Stir occasionally.
3. For vegan chili: Substitute 5 tablespoons of canola oil for the ground beef.
4. Serve with rice.
You can view DataOps in the context of a century-long evolution of ideas that improve
how people manage complex systems. It started with pioneers like W. Edwards Deming and
statistical process control - gradually these ideas crossed into the technology space in the
form of Agile, DevOps and now, DataOps. In the next section we will examine how these
methodologies impact productivity, quality and reliability in data analytics.
The world changed in February 2005 when Amazon Prime brought flat-rate, unlimited, two-
day shipping into a world where people expected to pay extra to receive packages in four
to six business days. Since its launch, Amazon Prime has completely transformed the retail
market, making low-cost, predictable shipping an integral part of consumer expectations.
This business model, which some have called the “on-demand economy,” is popping up in
many industries and markets across the globe.
For example, some may remember video stores where movies were rented for later viewing.
Today, 65 percent of global respondents to a recent Nielsen survey watch video on demand
(VOD), many of them daily. With VOD, a person’s desire to watch a movie is fulfilled within
seconds. Amazon participates in the VOD market with their Amazon Prime Video service.
Instant fulfillment of customer orders seems to be part of Amazon’s business model. They
have even brought that capability to IT. About 10 years ago, Amazon Web Services (AWS)
began offering computing, storage, and other IT infrastructure on an as-needed basis.
Whether the need is for one server or thousands and whether for hours, days, or months,
you only pay for what you use, and the resources are available in just a few minutes.
A typical example: the VP of sales enters the office of the chief data officer (CDO). She’d like
to cross-reference the customer database with some third-party consumer data. The CDO
asks for time to study the problem and, days later, has planned the project. Resources will be
allocated and configured, schemas will be updated, reports will be elegantly designed, and
the delivery pipeline will be thoroughly tested. The changes will take several weeks. “Not
acceptable,” the VP of sales fires back. The new analytics are needed for a meeting with the
board later in the week. “The competition is ahead of us; we can’t wait weeks.” This scenario
is playing out in one form or another in corporations around the globe.
In order to deliver value consistently, quickly and accurately, data-analytics teams must learn
to create and publish analytics in a new way. We call this new approach DataOps. DataOps
is a combination of tools and methods, which streamline the development of new analytics
while ensuring impeccable data quality. DataOps helps shorten the cycle time for producing
analytic value and innovation, while avoiding the trap of “hope, heroism and caution.”
Figure 1: In Agile development, a burndown chart shows work remaining over time.
The waterfall model is better suited to situations where the requirements are fixed and well
understood up front. This is nothing like the technology industry where the competitive
environment evolves rapidly. In the 1980’s a typical software project required about 12
calendar months. In technology-driven businesses (i.e. nearly everyone these days) custom-
ers demand new features and services, and competitive pressures change priorities on a
seemingly daily basis. The waterfall model has no mechanism to respond to these changes.
In waterfall, changes trigger a seemingly endless cycle of replanning causing delays and
resulting in project budget overruns.
In the early 2000’s, the software industry embraced a new approach to code production
called Agile Development. Agile is an umbrella term for several different iterative and incre-
mental software development methodologies.
In Agile Software Development, the team and its processes and tools are organized
around the goal of publishing releases to the users every few weeks (or at most every
few months). A development cycle is called a sprint or an iteration. At the beginning of an
iteration, the team commits to completing working and (the most) valuable changes to
the code base. Features are associated with user stories, which help the development team
Agile is widely credited with boosting software productivity. One study sponsored by the
Central Ohio Agile Association and Columbus Executive Agile Special Interest Group found
that Agile projects were completed 31 percent faster and with a 75 percent lower defect
rate than the industry norm. The vast majority of companies are getting on-board. In a survey
of 400 IT professionals by TechBeacon, two-thirds described their company as either “pure
agile” or “leaning towards agile. Among the remaining one third of companies, most use a
hybrid approach, leaving only nine percent using a pure waterfall approach.
If, for example, the customer reported a problem, it might not be replicable in the support,
test or development groups due to differences in the hardware and software environments
being run. This lack of alignment fostered misunderstandings and delays and often led to a
lack of trust and communication between the various stakeholders.
About a decade ago, Amazon Web Services (AWS) and other cloud providers, began offering
computing, storage and other IT resources as an on-demand service. No more waiting weeks
or months for the IT department to fulfill a request for servers. Cloud providers now allow
you to order computing services, paying only for what you use, whether that is one proces-
sor for an hour or thousands of processors for months. These on-demand cloud services
have enabled developers to write
code that provisions processing
resources with strictly specified
environments, on-demand, in just
a few minutes. This capability has
been called Infrastructure as Code
(IaC). IaC has made it possible for
everyone in the software devel-
opment pipeline, all the different
groups mentioned above, to use
an identical environment tailored
to application requirements. With
IaC, design, test, QA and support
With IT infrastructure being defined by code, the hard divisions between IT operations and
software development are able to blur. The merger of development and operations is how
the term DevOps originated.
With the automated provisioning of resources, DevOps paved the way for a fully automated
test and release process. The process of deploying code that used to take weeks, could now
be completed in minutes. Major organizations including Amazon, Facebook and Netflix are
now operating this way. At a recent conference, Amazon disclosed that their AWS team per-
forms 50,000,000 code releases per year. This is more than one per second! This methodolo-
gy of rapid releases is called continuous delivery or alternatively, continuous deployment, when
new features (and fixes) are not only delivered internally but fully deployed to customers.
DevOps starts with continuous delivery and Agile development and adds automated provi-
sioning of resources (infrastructure as code) and cloud services (platform as a service) to en-
sure that the same environment is being utilized at every stage of the software development
pipeline. The cloud provides a natural platform that allows individuals to create and define
identical run-time environments. DevOps is beginning to achieve critical mass in terms of its
adoption within the world of software development.
DevOps improves collaboration between employees from the planning through the deploy-
ment phases of software. It seeks to reduce time to deployment, decrease time to market,
minimize defects, and shorten the time required to fix problems.
The impact of DevOps on development organizations was shown in a 2014 survey, “The
2014 State of DevOps Report” by Puppet Labs, IT Revolution Press and ThoughtWorks,
based on 9,200 survey responses from technical professionals. The survey found that IT or-
ganizations implementing DevOps were deploying code 30 times more frequently and with
50 percent fewer failures. Further, companies with these higher performing IT organizations
tended to have stronger business performance, greater productivity, higher profitability and
larger market share. In other words, DevOps is not just something that engineers are doing
off in a dark corner. It is a core competency that helps good companies become better.
The Data analytics team transforms raw data into actionable information that improves
decision making and provides market insight. Imagine an organization with the best data
analytics in the industry. That organization would have a tremendous advantage over com-
petitors. That could be you.
In data analytics, tests should verify that the results of each intermediate step in the
production of analytics matches expectations. Even very simple tests can be useful. For
example, a simple row-count test could catch an error in a join that inadvertently produces a
Cartesian product. Tests can also detect unexpected trends in data, which might be flagged
as warnings. Imagine that the number of customer transactions exceeds its historical average
by 50%. Perhaps that is an anomaly that upon investigation would lead to insight about
business seasonality.
Tests in data analytics can be applied to data or models either at the input or output of a
phase in the analytics pipeline. Tests can also verify business logic.
Input tests check data prior to each stage in the analytics pipeline. For example:
• Count Verification – Check that row counts are in the right range, ...
• Conformity – US Zip5 codes are five digits, US phone numbers are 10 digits, ...
• History – The number of prospects always increases, ...
• Balance – Week over week, sales should not vary by more than 10%, ...
• Temporal Consistency – Transaction dates are in the past, end dates are later than start
dates, ...
• Application Consistency – Body temperature is within a range around 98.6F/37C, ...
• Field Validation – All required fields are present, correctly entered, ...
Output tests check the results of an operation, like a Cartesian join. For example:
• Completeness – Number of customer prospects should increase with time
• Range Verification – Number of physicians in the US is less than 1.5 million
The data analytics pipeline is a complex process with steps often too numerous to be moni-
tored manually. SPC allows the data analytics team to monitor the pipeline end-to-end from
a big-picture perspective, ensuring that everything is operating as expected. As an automat-
ed test suite grows and matures, the quality of the analytics is assured without adding cost.
This makes it possible for the data analytics team to move quickly — enhancing analytics to
address new challenges and queries — without sacrificing quality.
When DataOps is implemented correctly, it addresses many of the issues discussed earlier
that have plagued data-analytics teams.
DataOps views the data-analytics pipeline as a process and as such focuses on how to make
the entire process run more rapidly and with higher quality, rather than optimizing the pro-
ductivity of any single individual or tool by itself.
DataKitchen markets an automated DataOps platform that helps companies accelerate their
DataOps implementation, but this book is about DataOps not us. This book is not trying to
sell you anything. You can implement DataOps all by yourself, using your existing tools, by
implementing the seven steps described in the next section. If you desire assistance, there is
an ecosystem of DataOps vendors who offer a variety of innovative solutions and services.
Data analytics has become business critical, but requirements quickly evolve and data-an-
alytics teams that respond to these challenges in the traditional ways often end up facing
disappointed users. DataOps offers a more effective approach that optimizes the productivi-
ty of the data analytics pipeline by an order of magnitude.
Imagine the next time that the Vice President of Marketing requests a new customer
segmentation, by tomorrow. With DataOps, the data-analytics team can respond ‘yes’ with
complete confidence that the changes can be accomplished quickly, efficiently and robustly.
How then does an organization implement DataOps? You may be surprised to learn that an
analytics team can migrate to DataOps in seven simple steps.
Adding tests in data analytics is analogous to the statistical process control that is imple-
mented in a manufacturing operations flow. Tests insure the integrity of the final output by
verifying that work-in-progress (the results of intermediate steps in the pipeline) matches
expectations. Testing can be applied to data, models and logic. The figure below shows
examples of tests in the data-analytics pipeline.
For every step in the data-analytics pipeline, there should be at least one test. The philoso-
phy is to start with simple tests and grow over time. Even a simple test will eventually catch
an error before it is released out to the users. For example, just making sure that row counts
are consistent throughout the process can be a very powerful test. One could easily make
a mistake on a join and make a cross product which fails to execute correctly. A simple row-
count test would quickly catch that.
Tests can detect warnings in addition to errors. A warning might be triggered if data exceeds
certain boundaries. For example, the number of customer transactions in a week may be
OK if it is within 90% of its historical average. If the transaction level exceeds that, then a
warning could be flagged. This might not be an error. It could be a seasonal occurrence for
example, but the reason would require investigation. Once recognized and understood, the
users of the data could be alerted.
DataOps is not about being perfect. In fact, it acknowledges that code is imperfect. It’s
natural that a data-analytics team will make a best effort, yet still miss something. If so, they
can determine the cause of the issue and add a test so that it never happens again. In a rapid
release environment, a fix can quickly propagate out to the users.
Automated tests continuously monitor the data pipeline for errors and anomalies. They work
nights, weekends and holidays without taking a break. If you build a DataOps dashboard, you
can view the high-level state of your data operations at any time. If warning and failure alerts
are automated, you don’t have to constantly check your dashboard. Automated testing frees
the data-analytics team from the drudgery of manual testing, so they can focus on higher
value-add activities.
Figure 4: Tests enable the data professional to apply statistical process controls
to the data pipeline
The artifacts (files) that make this reproducibility possible are usually subject to continuous
improvement. Like other software projects, the source files associated with the data pipeline
should be maintained in a version control (source control) system such as Git. A version con-
trol tool helps teams of individuals organize and manage the changes and revisions to code.
It also keeps code in a known repository and facilitates disaster recovery. However, the most
important benefit of version control relates to a process change that it facilitates. It allows
data-analytics team members to branch and merge.
Branching and merging can be a major productivity boost for data analytics because it allows
teams to make changes to the same source code files in parallel without slowing each other
down. Each individual team member has control of his or her work environment. They can
run their own tests, make changes, take risks and experiment. If they wish, they can discard
their changes and start over. Another key to allowing team members to work well in parallel
relates to providing them with an isolated machine environment.
When many team members work on the production database, it can lead to conflicts. A
database engineer changing a schema may break reports. A data scientist developing a new
model might get confused as new data flows in. Giving team members their own Environ-
ment isolates the rest of the organization from being impacted by their work.
Some steps in the data-analytics pipeline are messy and complicated. For example, one
operation might call a custom tool, run a python script, use FTP and other specialized
logic. This operation might be hard to set up (because it requires a specific set of tools)
and difficult to create (because it requires a specific skill set). This scenario is another
common use case for creating a container. Once the code is placed in a container, it is
much easier to use by other programmers who aren’t familiar with the custom tools inside
the container but know how to use the container’s external interfaces. It is also easier to
deploy that code to each environment.
For example, imagine a pharmaceutical company that obtains prescription data from a
3rd party company. The data is incomplete, so the data producer uses algorithms to fill in
those gaps. In the course of improving their product, the data producer develops a different
algorithm to the fill in the gaps. The data has the same shape (rows and columns), but certain
fields are modified using the new algorithm. With the correct built-in parameters, an engi-
neer or analyst can easily build a parallel data mart with the new algorithm and have both
the old and new versions accessible through a parameter change.
Data engineers, scientists and analysts spend an excessive amount of time and energy
working to avoid these disastrous scenarios. They attempt “heroism” — working weekends.
They do a lot of hoping and praying. They devise creative ways to avoid overcommitting. The
problem is that heroic efforts are eventually overcome by circumstances. Without the right
controls in place, a problem will slip through and bring the company’s critical analytics to a halt.
The DataOps enterprise puts the right set of tools and processes in place to enable data and
new analytics to be deployed with a high level of quality. When an organization implements
DataOps, engineers, scientists and analysts can relax because quality is assured. They can
Work Without Fear or Heroism. DataOps accomplishes this by optimizing two key workflows.
As mentioned above, the worst possible outcome is for poor quality data to enter the Value
Pipeline. DataOps prevents this by implementing data tests (step 1). Inspired by the statisti-
cal process control in a manufacturing workflow, data tests ensure that data values lay within
an acceptable statistical range. Data tests validate data values at the inputs and outputs of
each processing stage in the pipeline. For example, a US phone number should be ten digits.
Any other value is incorrect or requires normalization.
Once data tests are in place, they work 24x7 to guarantee the integrity of the Value Pipeline.
Quality becomes literally built in. If anomalous data flows through the pipeline, the data tests
catch it and take action — in most cases this means firing off an alert to the data analytics
team who can then investigate. The tests can even, in the spirit of auto manufacturing,
“stop the line.” Statistical process control eliminates the need to worry about what might
happen. With the right data tests in place, the data analytics team can Work Without Fear
or Heroism. This frees DataOps engineers to focus on their other major responsibility —
the Innovation Pipeline.
DataOps implements continuous deployment of new ideas by automating the workflow for
building and deploying new analytics. It reduces the overall cycle time of turning ideas into
innovation. While doing this, the development team must avoid introducing new analytics
that break production. The DataOps enterprise uses logic tests (step 1) to validate new code
before it is deployed. Logic tests ensure that data matches business assumptions. For exam-
ple, a field that identifies a customer should match an existing entry in a customer dimension
table. A mismatch should trigger some type of follow-up.
With logic tests in place, the development pipeline can be automated for continuous deploy-
ment, simplifying the release of new enhancements and enabling the data analytics team to
focus on the next valuable feature. With DataOps the dev team can deploy without worrying
about breaking the production systems — they can Work Without Fear or Heroism. This is a
key characteristic of a fulfilled, productive team.
Figure 8: The Value and Innovation Pipelines illustrate how new analytics are
introduced into data operations.
One common misconception about DataOps is that it is just DevOps applied to data analyt-
ics. While a little semantically misleading, the name “DataOps” has one positive attribute.
It communicates that data analytics can achieve what software development attained with
DevOps. That is to say, DataOps can yield an order of magnitude improvement in quality
and cycle time when data teams utilize new tools and methodologies. The specific ways that
DataOps achieves these gains reflect the unique people, processes and tools characteristic
of data teams (versus software development teams using DevOps). Here’s our in-depth take
on both the pronounced and subtle differences between DataOps and DevOps.
Using DevOps, leading companies have been able to reduce their software release cycle time
from months to (literally) seconds. This has enabled them to grow and lead in fast-paced,
emerging markets. Companies like Google, Amazon and many others now release software
many times per day. By improving the quality and cycle time of code releases, DevOps de-
serves a lot of credit for these companies’ success.
Optimizing code builds and delivery is only one piece of the larger puzzle for data analyt-
ics. DataOps seeks to reduce the end-to-end cycle time of data analytics, from the origin
of ideas to the literal creation of charts, graphs and models that create value. The data
lifecycle relies upon people in addition to tools. For DataOps to be effective, it must manage
collaboration and innovation. To this end, DataOps introduces Agile Development into data
analytics so that data teams and users work together more efficiently and effectively.
Studies show that Agile software development projects complete faster and with fewer
defects when Agile Development replaces the traditional Waterfall sequential methodol-
ogy. The Agile methodology is particularly effective in environments where requirements
are quickly evolving — a situation well known to data analytics professionals. In a DataOps
setting, Agile methods enable organizations to respond quickly to customer requirements
and accelerate time to value.
Agile development and DevOps add significant value to data analytics, but there is one more
major component to DataOps. Whereas Agile and DevOps relate to analytics development
and deployment, data analytics also manages and orchestrates a data pipeline. Data con-
tinuously enters on one side of the pipeline, progresses through a series of steps and exits
in the form of reports, models and views. The data pipeline is the “operations” side of data
analytics. It is helpful to conceptualize the data pipeline as a manufacturing line where quali-
ty, efficiency, constraints and uptime must be managed. To fully embrace this manufacturing
mindset, we call this pipeline the “data factory.”
In DataOps, the flow of data through operations is an important area of focus. DataOps
orchestrates, monitors and manages the data factory. One particularly powerful lean-man-
ufacturing tool is statistical process control (SPC). SPC measures and monitors data and
While the name “DataOps” implies that it borrows most heavily from DevOps, it is all three
of these methodologies - Agile, DevOps and statistical process control — that comprise the
intellectual heritage of DataOps. Agile governs analytics development, DevOps optimizes
code verification, builds and delivery of new analytics and SPC orchestrates and monitors
the data factory. Figure 10 illustrates how Agile, DevOps and statistical process control flow
into DataOps.
You can view DataOps in the context of a century-long evolution of ideas that improve how
people manage complex systems. It started with pioneers like Deming and statistical process
control — gradually these ideas crossed into the technology space in the form of Agile,
DevOps and now, DataOps.
DevOps was created to serve the needs of software developers. Dev engineers love coding
and embrace technology. The requirement to learn a new language or deploy a new tool
is an opportunity, not a hassle. They take a professional interest in all the minute details of
code creation, integration and deployment. DevOps embraces complexity.
DataOps users are often the opposite of that. They are data scientists or analysts who are
focused on building and deploying models and visualizations. Scientists and analysts are
typically not as technically savvy as engineers. They focus on domain expertise. They are
interested in getting models to be more predictive or deciding how to best visually render
data. The technology used to create these models and visualizations is just a means to an
end. Data professionals are happiest using one or two tools — anything beyond that adds un-
welcome complexity. In extreme cases, the complexity grows beyond their ability to manage
it. DataOps accepts that data professionals live in a multi-tool, heterogeneous world and it
seeks to make that world more manageable for them.
The data factory takes raw data sources as input and through a series of orchestrated steps
produces analytic insights that create “value” for the organization. We call this the “Value
Pipeline.” DataOps automates orchestration and, using SPC, monitors the quality of data
flowing through the Value Pipeline.
The “Innovation Pipeline” is the process by which new analytic ideas are introduced into the
Value Pipeline. The Innovation Pipeline conceptually resembles a DevOps development pro-
cess, but upon closer examination, several factors make the DataOps development process more
challenging than DevOps. Figure 13 shows a simplified view of the Value and Innovation Pipelines.
Figure 13: The DataOps lifecycle — the Value and Innovation Pipelines
DevOps introduces two foundational concepts: Continuous Integration (CI) and Continuous
Deployment (CD). CI continuously builds, integrates and tests new code in a development
environment. Build and test are automated so they can occur rapidly and repeatedly. This
allows issues to be identified and resolved quickly. Figure 14 illustrates how CI encompasses
the build and test process stages of DevOps.
As noted above, the Innovation Pipeline has a representative copy of the data pipeline
which is used to test and verify new analytics before deployment into production. This is
the orchestration that occurs in conjunction with “testing” and prior to “deployment” of new
analytics — as shown in Figure 16.
Orchestration occurs in both the Value and Innovation Pipelines. Similarly, testing fulfills a
dual role in DataOps.
Figure 16: DataOps orchestration controls the numerous tools that access, transform,
model, visualize and report data
In the Innovation Pipeline code is variable and data is fixed. The analytics are revised and
updated until complete. Once the sandbox (analytics development environment) is set-up,
the data doesn’t usually change. In the Innovation Pipeline, tests target the code (analytics),
not the data. All tests must pass before promoting (merging) new code into production. A
good test suite serves as an automated form of impact analysis that runs on any and every
code change before deployment.
Some tests are aimed at both data and code. For example, a test that makes sure that a
database has the right number of rows helps your data and code work together. Ultimately
both data tests and code tests need to come together in an integrated pipeline as shown
in Figure 13. DataOps enables code and data tests to work together so all around quality
remains high.
Figure 17: In DataOps, analytics quality is a function of data and code testing
Figure 19: The concept of test data management is a first order problem in DataOps.
The concept of test data management is a first order problem in DataOps whereas in most
DevOps environments, it is an afterthought. To accelerate analytics development, DataOps
has to automate the creation of development environments with the needed data, software,
hardware and libraries so innovation keeps pace with Agile iterations.
In data analytics, the operations team supports and monitors the data pipeline. This can be
IT, but it also includes customers — the users who create and consume analytics. DataOps
brings these groups together so they can work together more closely.
Figure 20: DataOps combines data analytics development and data operations
Centralizing analytics development under the control of one group, such as IT, enables the
organization to standardize metrics, control data quality, enforce security and governance,
and eliminate islands of
data. The issue is that too
much centralization chokes
creativity.
DataOps brings three cycles of innovation between core groups in the organization: central-
ized production teams, centralized data engineering/analytics/science/governance develop-
ment teams, and groups using self-service tools distributed into the lines business closest to
the customer. Figure 23 shows the interlocking cycles of innovation.
The challenge of pushing analytics into production across these four quite different envi-
ronments is daunting without DataOps. It requires a patchwork of manual operations and
scripts that are in themselves complex to manage. Human processes are error-prone so data
professionals compensate by working long hours, mistakenly relying on hope and heroism for
success. All of this results in unnecessary complexity, confusion and a great deal of wasted
time and energy. Slow progression through the lifecycle shown in Figure 24 coupled with
high-severity errors finding their way into production can leave a data analytics team little
time for innovation.
Implementing DataOps
DataOps simplifies the complexity of data analytics creation and operations. It aligns data
analytics development with user priorities. It streamlines and automates the analytics
development lifecycle — from the creation of sandboxes to deployment. DataOps controls
and monitors the data factory so data quality remains high, keeping the data team focused
on adding value.
A DataOps Platform automates the steps and processes that comprise DataOps: sandbox
management, orchestration, monitoring, testing, deployment, the data factory, dashboards,
Agile, and more. A DataOps Platform is built for data professionals with the goal of simpli-
fying all of the tools, steps and processes that they need into an easy-to-use, configurable,
end-to-end system. This high degree of automation eliminates a great deal of manual work,
freeing up the team to create new and innovative analytics that maximize the value of an
organization’s data.
Freedom and employee empowerment are essential to innovation, but a lack of top-down
control leads to chaos. Self-service tools enable data analysts to create new analytics very
quickly, but they can drift in different directions. Imagine a team of analysts building reports
that tally sales figures and each come up with a different result. One approach includes drop
shipments and sales from distributors/subsidiaries. Another report might consist of product
sales, but not services. These different approaches each have their use case, but from a man-
ager’s perspective inconsistency creates the appearance of inaccuracy. You can’t establish a
shared reality when everyone has different numbers.
Some managers respond to this challenge by centralizing analytics. With data and analytics
under the control of one group, such as IT, you can standardize metrics, control data quality,
enforce security and governance, and eliminate islands of data. All worthy endeavors, how-
ever forcing analytic updates through a heavyweight IT development process is a sure way to
stifle innovation. It is one of the reasons that some companies take three months to deploy
ten lines of SQL into production. Analytics have to be able to evolve and iterate quickly to
keep up with user demands and fast-paced markets. Managers instinctively understand that
data analytics teams must be free to innovate. The fast-growing self-service tools market
(Tableau, Looker, etc.) addresses this market.
Centralizing analytics brings it under control but granting analysts free reign is necessary to
stay competitive. How do you balance the need for centralization and freedom? How do you
empower your analysts to be innovative without drowning in the chaos and inconsistency
that a lack of centralized control inevitably produces? Visit any modern enterprise, and you
will find this challenge playing out repeatedly in budget discussions and hiring decisions. You
might say, it is a struggle between centralization and freedom.
Figure 26: Data suppliers, engineers and analysts use different cycle times driven
mainly by their tools, methods and proximity to demanding users.
Analysts choose tools and processes oriented toward this business context. They use
powerful, self-service tools, such as Tableau, Alteryx, and Excel, to quickly create or iterate
on charts, graphs, and dashboards. They organize their work into daily sprints (figure 26), so
they can deliver value regularly and receive feedback from users immediately. Agile tools like
Jira are an excellent way to manage the productivity of analyst daily sprints.
The data analyst is the tip of the innovation spear. Organizations must give data analysts
maximum freedom to experiment. There are a lot more data in the world than companies
can analyze. Not everything can be placed in data warehouses. Not all data should be opera-
tionalized. Companies need data analysts to play around with different data sets to establish
what is predictive and relevant.
Some companies mistakenly ask data engineering to create data sets for every idea. It is best
to let analysts lead on implementing new analytic ideas and proving them out before consid-
ering how data engineering can help. For example, consider the following:
By this standard, the organization focuses its data engineering resources on those items
that give the most bang for the buck. Keep in mind that when analytics are moved into a data
warehouse, some of the benefits of centralization come at the expense of reduced freedom
— it is slower to update a data warehouse than a Tableau worksheet. It’s important to wait
until analytics have earned the right to make this transition. The value created by centralizing
must outweigh the restriction of freedom.
Data engineers utilize programmable platforms such as AWS, S3, EC2 and Redshift. These
tools require programming in a high-level language and offer greater potential functionality
than the tools used by analysts. The relative complexity of the tools and scope of projects
in data engineering fit best in weekly Agile iterations (figure 26). DataOps platforms like
DataKitchen enable the data engineer to streamline the quality control, orchestration and
data operations aspects of their duties. With automated support for agile development,
impact analysis, and data quality, the data engineer can stay focused on creating and improv-
ing data sets for analysts.
After data sets have proven their value, it’s worth considering whether the benefits of fur-
ther centralization outweigh the cost of a further reduction in freedom. Data suppliers fulfill
the function of greater centralization by providing data sources or data extracts for data
engineering.
There are several reasons that a project may have earned the right to transition to data
suppliers. Analytics may provide functionality that executives wish to make available to the
entire corporation, not just one business unit. It could also be a case of standardization — for
example, the company wants to standardize on an algorithm for calculating market share. In
another example, perhaps data engineering has implemented quality control on a data set
and wishes to achieve efficiencies by pushing this functionality upstream to the data suppli-
er. A data supplier may be an external third party or an internal group, such as an IT master
data management (MDM) team.
After the usefulness of the mastered data is established, the company might decide that the
data has broader uses. They may want the customer or partner list to be available for a portal
or tied into a billing system. This use case requires a higher standard of accuracy for the mas-
tered data than was necessary for the analytic data warehouse. It’s appropriate at this point
to consider moving the MDM to a data supplier, such as a corporate IT team, who are adept
at tackling more extensive, development initiatives. Put another way, initial data mastery
may have been good enough for analytic insights, but data must be perfect when it is being
used in a billing system. The data supplier takes the MDM to the next level.
Data Suppliers
Projects transitioned to data suppliers tend to incorporate more process and tool complexity
than those in data engineering, leading to a more extended iteration period of one or more
months (figure 26). These projects use tools such as RDBS, MDM, Salesforce, Excel, sFTP,
etc., and rely upon waterfall project management and MS Project tracking. Table 2 summariz-
es tools and processes preferred by data suppliers as contrasted with engineers and analysts.
Figure 27: Data Suppliers, Data Engineers and Data Analysts sit on a spectrum of
centralization and innovation/freedom.
Figure 28: Tests verify that data rows, facts and dimensions match business logic
throughout the data pipeline
For example, Figure 28 shows how the DataOps platform orchestrates, tests and monitors
every step of the data operations pipeline, freeing up the team from significant manual
effort. The test verifies that the quantity of data matches business logic at each stage of the
data pipeline. If a problem occurs at any point in the pipeline, the analytics team is alerted and can
resolve the issue before it develops into an emergency. With 24x7 monitoring of the data pipeline,
the team can rest easy and focus on customer requirements for new/updated analytics.
INSTRUCTIONS
1. Place chicken drums and wings in a large zip-lock back, add marinade, seal
zip-lock bag, mix contents of the bag around gently (you don’t want to acci-
dentally open the bag and marinate your kitchen floor or counter), make sure
your chicken is well coated inside the bag.
2. Refrigerate your chicken in the marinade for 8-24 hours (You can also just
cook them right away if you don’t have the time)
3. Best slow cooked for 5-6 hours in a crockpot or on 225 degrees in a conven-
tional oven—use all the contents in the bag. (If you don’t have that kind of time,
bake at 400 degrees Fahrenheit.) 3.5 lbs. of chicken should bake for 55-60
minutes; 4.5 lbs. of chicken requires 60-65 minutes.
If the groups in your data-analytics organization don’t work together, it can impact analyt-
ics-cycle time, data quality, governance, employee retention and more. A variety of factors
contribute to poor teamwork. Sometimes geographical, cultural and language barriers hinder
communication and trust. Technology-driven companies face additional barriers related to
tools, technology integrations and workflows which tend to drive people into isolated silos.
Figure 30: Delivery of analytics (the value chain) to customers requires contribu-
tions from several groups in the data organization
Let’s explore some of the factors that isolate the tribes from one another. For starters, the
groups are often set apart from each other by the tools that they use. Figure 31 is the same
value chain as above but reconstructed from the perspective of tools.
To be more specific, each of the roles mentioned above (figure 30) view the world through a
preferred set of tools (figure 31):
The day-to-day existence of a data engineer working on a master data management (MDM)
platform is quite different than a data analyst working in Tableau. Tools influence their opti-
mal iteration cycle time, e.g., months/weeks/days. Tools determine their approach to solving
problems. Tools affect their risk tolerance. In short, they view the world through the lens of
the tools that they use.
The division of each function into a tools silo creates a sense of isolation which prevents the
tribes from contemplating their role in the end-to-end data pipeline. The less they under-
stand about each other, the less compelling the need to communicate about actions taken
which impact others. Communication between teams (people in roles) is critical to the orga-
nization’s success. Most analytics requires contributions from all the teams. The work output
of one team may be an input to another team. In the figure below, the data (and metadata)
build as the work products compound through the value chain.
In many enterprises, there is a natural tendency for the groups to retreat into the complexity
of their local workflow. In figure 33, we represent the local workflow of each tribe with a
directed-acyclic graph (DAG).
Figure 33: Work groups tend to focus on the complexity of their local workflow
It is too easy to overlook the fact that the shared purpose of these local workflows is to
work together to publish analytics for end-customers.
The two groups managing the two halves of the solution have difficulty maintaining quality,
coordinating their processes and maintaining independence (modularity). Group one tests
part one of the system (figure 35). Group two validates part two. Do the part one and two
tests deliver a unified set of results (and alerts) to all stakeholders? Can tests one and two
evolve independently without breaking each other? These issues repeatedly surface in data
organizations.
In another example, assume that two groups are required to work together to deliver analyt-
ics to the VP of marketing. The home office in Boston handles data engineering and creates
data marts. Their iteration period is weekly. The local team in New Jersey uses the data
marts to create analytics for the VP of Marketing. Their iteration is daily (or hourly).
One day, the VP of Marketing requests new analytics (deadline ASAP) from the data analysts
for a meeting later that day. The analysts jump into action, but face obstacles when they try
to add a new data set. They contact data engineering in Boston. Boston has its own pres-
sures and priorities and their workflow, organized around a weekly cadence, can’t respond to
these requests on an “ASAP” basis.
The home office team in Boston finally makes the needed changes, but they inadvertent-
ly break other critical reports (figure 37). Meanwhile, out of desperation, the New Jersey
team adds the required data sets and updates their analytics. The new data sets are only
available to New Jersey, so other sites are now a revision behind. New Jersey’s reports are
inconsistent with everyone else’s. Misunderstandings ensue. It’s not hard to imagine why the
relationship between these groups could be strained.
These challenges may seem specific to data organizations, but at a high level, everything that
we have discussed boils down to poor communication and lack of coordination between
individuals and groups. As such, we can turn to management science to better understand
the problem and explore solutions.
For those who don’t remember, the airline business in the 1980s and 1990s was brutally
competitive, but during this same period, Southwest Airlines revolutionized air travel. By
the early 2000s, they had experienced 31 straight years of profitability and had a market
capitalization greater than all the other major US airlines combined. Brandeis management
professor Jody Hoffer Gittell investigated the factors in Southwest Airlines’ performance
and, back in 2003, published a quantitative, data-driven analysis shedding light on South-
west’s success.
Dr. Gittell surveyed the major players in the airline industry and found a correlation between
key performance parameters (KPP) and something that she termed Relational Coordination
(RC), the way that relationships influence task coordination, for better or worse. “Relational
coordination is communicating and relating for the purpose of task integration — a powerful
driver of performance when work is interdependent, uncertain and time constrained.”
In her study, higher RC levels correlated with better performance on KPPs, even when com-
paring two sites within the same company. Since that time RC has been applied in industries
ranging from healthcare to manufacturing across 22 countries.
Members of the “Low-RC” organization express their goals solely in terms of their own
function. They keep knowledge to themselves and there may be a tendency for one group
to look down upon another group. Inter-group communication is inadequate, inaccurate and
might be more concerned with finding blame than finding solutions. As expected, the “High-
RC” organization embodies the exact opposite end of this spectrum.
At this point you may be thinking: “OK fine, this is all touchy-feely stuff. I’ll try to smile
more and I’ll organize a pizza party so everyone can get to know each other.” Maybe you
should (smiling will make you feel good and parties are fun after all), but our experience
is that the good feeling wears off once the last cupcake is gone and the mission-critical
analytics are offline.
How do you keep people working independently and efficiently when their work product is
a dependency for another team? How can one team reuse the data or artifacts or code that
another team produces?
For most enterprises, improving RC requires foundational change. You need to examine
your end-to-end data operations and analytics-creation workflow. Is it building up or tearing
down the communication and relationships that are critical to your mission? Instead of al-
lowing technology to be a barrier to Relational Coordination, how about utilizing automation
and designing processes to improve and facilitate communication and coordination between
the groups? In other words, you need to restructure your data analytics pipelines as services
(or microservices) that create a robust, transparent, efficient, repeatable analytics process
that unifies all your workflows.
Robust – Statistical process control (lean manufacturing) calls for tests at the inputs and
outputs of each stage of the data operations pipeline. Tests also vet analytics deployments,
like an impact review board, so new analytics don’t disrupt critical operations.
Transparent – Dashboards display the status of new analytics development and the
operational status of the data operations pipeline. Automated alerts communicate issues
immediately to appropriate response teams. Team members can see a birds-eye-view of the
end-to-end workflow as well as local workflows.
Efficient – Automated orchestration of the end-to-end data pipeline (from data sources
to published analytics) minimizes manual steps that tie up resources and introduce human
error. Balance is maintained between centralization and decentralization; the need for
fast-moving innovation, while supporting standardization of metrics, quality and governance.
Repeatable – Revision control with built-in error detection and fault resilience is applied to
the data operations pipeline.
It may help to provide further concrete examples of a DataOps implementation and how
it impacts productivity. Some of these points are further explained in our blog DataOps in
Seven Steps:
• Data Sharing – data sources flow into a data lake which is used to create data ware-
houses and data marts. Bringing data under the control of the data organization decou-
ples it from IT operations and enables it to be shared more easily.
• Environment startup, shutdown – With computing and storage on-demand from cloud
services (infrastructure as code), large data sets and applications (test environments)
can be quickly and inexpensively copied or provisioned to reduce conflicts and depen-
dencies.
• Testing of data and other artifacts – Testing of inputs, outputs, and business logic are
applied at each stage of the data analytics pipeline. Tests catch potential errors and
warnings before they are released so the quality remains high. Test alerts immediately
• Reuse of a set of steps across multiple pipelines – Analytics reuse is a vast topic, but
the basic idea is to componentize functionalities as services in ways that can be shared.
Complex functions, with lots of individual parts, can be containerized using a container
technology (like Docker).
We have seen marked improvements in analytics cycle time and quality with DataOps. It
unlocks an organization’s creativity by forging trust and close working relationships between
data engineers, scientists, analysts and most importantly, users. DataOps is a task coor-
dination and communication framework that uses technology to break down the barriers
between the groups in the data organization. Let’s look at the DataOps enterprise from the
perspective of Relational Coordination.
CONCLUSION
Technology companies face unique challenges in fostering positive interaction and commu-
nication due to tools and workflows which tend to promote isolation. This natural distance
and differentiation can lead the groups in a data organization to act more like warring tribes
than partners. These challenges can be understood through the lens of Relational Coor-
dination; a management theory that has helped explain how some organizations achieve
extraordinary levels of performance as measured by KPPs. DataOps is a tools and method-
ological approach to data analytics which raises the Relational Coordination between teams.
It breaks down the barriers between the warring tribes of data organizations. With faster
cycle time, automated orchestration, higher quality and better end-to-end data pipeline
visibility, DataOps enables data analytics groups to better communicate and coordinate their
activities, transforming warring tribes into winning teams.
Previously, we wrote about how members of large data organizations sometimes behave more like
warring tribes than members of the same team. We discussed how DataOps facilitates commu-
nication and task coordination between groups. Today we move from the macro to the micro-lev-
el. We look at how DataOps operates, within a team, to ease the flow of work from one team
member to the next.
Whether celebrating a team’s success or contemplating its failure, people tend to focus on
team leadership as the most crucial factor in team performance. Richard Hackman, a pioneer
in the field of organizational behavior who studied teams for more than 40 years, called
this the “leader attribution error.” People generally pay more attention to factors that they
can see (like leaders) than to the background structural and contextual factors that actually
determine team performance. Hackman’s groundbreaking insight was to look beyond per-
sonalities, attitudes, or behavioral styles. (Put down your Meyers-Briggs assessments.) What
matters most to high-performance teams is the presence of “enabling conditions.”
As W. Edwards Deming famously said, “A bad system will beat a good person every time.”
DataOps applies this point of view to data analytics by taking a process-oriented approach
to improving analytics quality and reducing cycle-time. It seeks to uncover the specific fac-
tors that best contribute to team success.
We live in a world where the average tenure of a CDO or CAO is about 2.5 years. A couple
of years ago, Gartner predicted that 85 percent of AI projects would not deliver for CIOs.
Forrester affirmed this unacceptable situation by stating that 75% of AI projects under-
whelm. Clearly, data-analytics teams need a “tune-up.”
After conducting 300 interviews and 4,200 surveys over 15 years, Haas and Mortensen
(HBR, June 2016) built upon Richard Hackman’s work by identifying four specific conditions
most critical for team success:
1. Compelling direction – explicit and consequential goals that the team is working
toward
2. Strong structure – includes optimally designed tasks and processes, and norms that
promote positive dynamics.
3. Supportive context – includes an information system that provides access to the data
needed for the work, and the material resources required to do the job
Per Haas and Mortensen, teams are more diverse, dispersed, digital, and dynamic than ever
before. Modern organizations suffer from two corrosive problems — “us versus them” thinking
and incomplete information. The four critical enabling conditions above help teams overcome
these two pervasive problems and can raise overall team productivity while improving the
quality of their work product.
Imagine a typical enterprise. We’ll call them….”Insights Unlimited.” Let’s peek inside the data
team’s weekly staff meeting:
Manager: Good morning, everyone. As you know, our new Chief Data Officer has been asking
questions about the large and growing list of work items on Jira. The backlog has grown
steadily and…
Eric (Production Engineer): You’re kidding me right. I lost most of last week chasing down data
errors that originated upstream from one of our data sources. And the new reports that the
development team gave me last week took 20 hours to install and then broke the weekly sales
report. I thought Bill’s (VP of Sales) head was going to explode.
Padma (Data Engineer): Hey, if you had let me test the changes in the “real” environment, I
could have caught those problems upfront.
Eric (Production Engineer): As I have said before, the operational systems are not a sandbox.
Plus, we have to control access to private HIPAA data.
A typical data analytics team has many key players, with distinct skill sets and tool preferenc-
es. There may be production engineers, data engineers, data analysts, data scientists, BI ana-
lysts, QA engineers, test engineers, ETL engineers, DBAs, governance and more. In our little
anecdote, we could have filled a room full of grumpy and frustrated data professionals and
business colleagues. For the sake of simplicity, we pared the team down to two members,
Eric and Padma, who could each represent many people. To further explore the teamwork
issues at Insights Unlimited, let’s get to know Eric and Padma a little better. Note, that we’ll be
meeting a third key player on our data team as the exploration of Insights Unlimited continues.
Goals: Protect and perfect the daily grind of delivering data; minimize errors and chaos
Padma is the star that turns ideas into analytics that serve the
business. She’s an expert in analytics and machine learning tools.
What motivates Padma is creating exciting new analytics.
Whereas Eric wants to control change to reduce errors, Padma
values a flexible data architecture that can be adapted quickly to
new requirements. She wants to add new data sources and
update schemas easily. Padma is a thought leader in AI and data
science. She’s less interested in the IT infrastructure that powers
the data pipeline. When a new project is assigned, Padma
sometimes has to wait months for the IT department to order and configure a new develop-
ment system or give her access to new data. She also waits weeks or months for the
production team to deploy her new analytics. With the company’s inefficient processes,
Padma has trouble keeping up with user demands for new analytics, and colleagues
sometimes think “she” is the bottleneck. Padma puts a lot of effort into quality, but because
her test environment is different from production, there are always issues that surface
during the production integration. Sometimes erroneous data flows into Padma’s analytics,
distorting results. That isn’t something she can easily anticipate or address because
operations lie outside her domain.
Imagine that the Vice President of Marketing makes an urgent request to the data analytics
team: “I need new data on profitability ASAP.” At Insights Unlimited, the process for creating
and deploying these new analytics would go something like this:
2. Padma requests access to new data. The request goes on the IT backlog. IT grants
access after several weeks.
3. Padma writes a functional specification and submits the proposed change to the Impact
Review Board (IRB), which meets monthly. A key-person is on vacation, so the proposed
feature waits another month.
4. Padma begins implementation. The change that she is making is similar to another re-
cently developed report. Not knowing that, she writes the new analytics from scratch.
Padma’s test environment does not match “production.” so her testing misses some
corner cases.
5. Testing on the target environment begins. High-severity errors pull Eric into an “all-
hands-on-deck” situation, putting testing temporarily on hold.
6. Once the fires are extinguished, Eric returns to testing on the target and uncovers some
issues in the analytics. Eric feeds error reports back to Padma. She can’t easily repro-
duce the issues because the code doesn’t fail in the “dev” environment. She spends
significant effort replicating the errors so she can address them. The cycle is repeated a
few times until the analytics are debugged.
7. Analytics are finally ready for deployment. Production schedules the update. The next
deployment window available is in three weeks.
8. After several months have elapsed (total cycle time), the VP of Marketing receives the
new analytics, wondering why it took so long. This information could have boosted
sales for the current quarter if it had been delivered when she had initially asked.
Every organization faces unique challenges, but the issues above are ubiquitous. The situ-
ation we described is not meeting anyone’s needs. Data engineers went to school to learn
how to create analytic insights. They didn’t expect that it would take six months to deploy
twenty lines of SQL. The process is a complete hassle for IT. They have to worry about gov-
ernance and access control and their backlog is entirely unmanageable. Users are frustrated
because they wait far too long for new analytics. We could go on and on. No one here is
enjoying themselves.
The frustration sometimes expresses itself as conflict and stress. From the outside, it looks
like a teamwork problem. No one gets along. People are rowing the boat in different direc-
tions. If managers want to blame someone, they will point at the team leader.
At this point, a manager might try beer, donuts and trust exercises (hopefully not in that
order) to solve the “teamwork issues” in the group. Another common mistake is to coach the
group to work more slowly and carefully. This thinking stems from the fallacy that you have
to choose between quality and cycle time. In reality, you can have both.
DataOps provides production and development with dedicated system environments. Some
enterprises take this step but fail to align these environments. Development uses cloud
platforms while production uses on-prem. Development uses clean data while production
uses raw data. The list of opportunities for misalignment are endless. DataOps requires that
system environments be aligned. In other words, as close as possible to identical. The more
similar, the easier it will be to migrate code and replicate errors. Some divergence is neces-
sary. For example, data given to developers may have to be sampled or masked for practical
or governance reasons.
Figure 39 below shows a simplified production environment. The system transfers files
securely using SFTP. It stores files in S3 and utilizes a Redshift cluster. It also uses Docker
containers and runs some Python. Production alerts are forwarded to a Slack channel in
real-time. Note that we chose an example based on Amazon Web Services, but we could
have selected any other tools. Our example applies whether the technology is Azure, GCP,
on-prem or anything else.
DataOps segments production and development into separate release environments — see
Figure 40. In our parlance, a release environment includes a set of hardware resources,
a software toolchain, data, and a security Vault which stores, encrypted, sensitive access
control information like usernames and passwords for tools. Our production engineer, Eric,
manages the production release environment. Production has dedicated hardware and soft-
ware resources so Eric can control performance, quality, governance and manage change.
The production release environment is secure — the developers do not have access to it.
The development team receives its own separate but equivalent release environment,
managed by the third important member of our team; Chris, Insight Unlimited’s DataOps
Engineer. Chris also implements the infrastructure that abstracts the release environments
so that analytics move easily between dev and production. We’ll describe this further down
below. Any existing team member, with DataOps skills, can perform the DataOps engineer-
ing function, but in our simplified case study, adding a person will better illustrate how the
roles fit together.
Figure 40: Production and development maintain separate but equivalent envi-
ronments. The production engineer manages the production release environment
and the DataOps Engineer manages the development release environment.
Figure 41 below illustrates the separate but equivalent production and development release
environments. If you aren’t familiar with “environments,” think of these as discrete software
and hardware systems with equivalent configuration, tools and data.
Before we continue any further, let’s formally add Chris to the team.
Goals: Setup and maintain development environments; accelerate and ease deployment
The processing pipelines for analytics consist of a series of steps that operate on data and
produce a result. We use the term “Pipeline” to encompass all of these tasks. A DataOps
Pipeline encapsulates all the complexity of these sequences, performs the orchestration
work, and tests the results. The Idea is that any analytic tool that is invokable under software
control can be orchestrated by a DataOps Pipeline. Kitchens enable team members to ac-
cess, modify and execute workflow Pipelines. A simple Pipeline is shown in Figure 42.
Pipelines, and the components that comprise them, are made visible within a Kitchen. This
encourages reuse of previously developed analytics or services. Code reuse can be a signifi-
cant factor in reducing cycle time.
Kitchens should also tightly couple to version control (Insights Unlimited uses Git). When the
development team wants to start work on a new feature, they instantiate a new child Kitch-
en which creates a corresponding Git branch. When the feature is complete, the Kitchen is
merged back into its parent Kitchen, initiating a Git merge. The Kitchen hierarchy aligns with
the source control branch tree. Figure 43 shows how Kitchen creation/deletion corresponds
to a version control branch and merge.
Kitchens may be persistent or temporary; they may be private or shared, depending on the
needs of a project. Access to a Kitchen is limited to a designated set of users or “Kitchen
staff.” The Vault in a release environment supplies a Kitchen with the set of usernames and
passwords needed to access the environment toolchain.
Figure 44: Eric, Chris and Padma each have personal Kitchens, organized in a
hierarchy that aligns with their workflow.
Chris’ workspace is a Kitchen called “demo_dev.” The “demo_dev” Kitchen is the baseline
development workspace, and it points to the development release environment introduced
above, at the bottom of Figure 41. In our example, Chris’ Kitchen serves as a pre-release
staging area where merges from numerous child development Kitchens consolidate and
integrate before being deployed to production. With release environments aligned, Kitchens
don’t have to do anything different or special for merges across release environments versus
merges within a release environment.
Every developer needs a workspace so they may work productively without impacting or be-
ing impacted by others. A Kitchen can be persistent, like a personal workspace, or temporary,
tied to a specific project. Once Kitchen creation is set-up, team members create workspaces
as needed. This “self-service” aspect of DataOps eliminates the time that developers used to
wait for systems, data, or approvals. DataOps empowers developers to hit the ground running.
In Figure 44, Padma has created the Kitchen “dev_kitchen.” Padma’s Kitchen can leverage
Pipelines and other services created by the dev team.
Slack messages are similarly segregated by Kitchen, in our Figure 41 example. Note how pro-
duction alerts are directed to the Slack channels “#imp_errors” and “#imp_alerts,” while dev
alerts are sent to a kitchen-specific Slack channel. This prevents production from seeing any
dev-related slack messages. It also prevents the developers from receiving each other’s Slack
messages. Alerts could easily be managed on a much finer-grained level if required.
Think back to the earlier request by the VP of Marketing for “new analytics.” DataOps coor-
dinates this multi-step, multi-person and multi-environment workflow and manages it from
STEP 3 — IMPLEMENTATION
Padma’s Kitchen provides her with Pipelines that serve as a significant head start on the new
profitability analytics. Padma procures the test data she needs (de-identified) and configures
toolchain access (SFTP, S3, Redshift, …) for her Kitchen. Padma implements the new analyt-
ics by modifying an existing Pipeline. She adds additional tests to the existing suite, checking
that incoming data is clean and valid. She writes tests for each stage of ETL/processing to
ensure that the analytics are working from end to end. The tests verify her work and will
also run as part of the production flow. Her new Pipelines include orchestration of the data
and analytics as well as all tests. The tests direct messages and alerts to her Kitchen-specific
Slack channel. With the extensive testing, Padma knows that her work will migrate seam-
lessly into production with minimal effort on Eric’s part. Now that release environments have
been aligned, she’s confident that her analytics work in the target environment.
Before she hands off her code for pre-production staging, Padma first has to merge down
from “demo_dev” Kitchen so that she can integrate any relevant changes her coworkers have
made since her branch. She reruns all her tests to ensure a clean merge. If there is a conflict
in the code merge, the DataOps Platform will pop-up a three panel UI to enable further in-
vestigation and resolution. When Padma is ready, she updates and reassigns the JIRA ticket.
If the data team were larger, the new analytics could be handed off from person to person, in
a line, with each person adding their piece or performing their step in the process.
STEP 4 — PRE-RELEASE
In our simple example, Chris serves as the pre-release engineer. With a few clicks, Chris
merges Padma’s Kitchen “dev_Kitchen” back into the main development Kitchen “demo_dev,”
initiating a Git merge. After the merge, the Pipelines that Padma updated are visible in Chris’
Kitchen. If Chris is hands-on, he can review Padma’s work, check artifacts, rerun her tests, or
even add a few tests of his own, providing one last step of QA or governance. Chris creates a
schedule that, once enabled, will automatically run the new Pipeline every Monday at 6 am.
When Chris is satisfied, he updates and reassigns the JIRA ticket, letting Eric know that the
feature is ready for deployment.
DATAOPS BENEFITS
As our short example demonstrated, the DataOps Teamwork Process delivers these benefits:
• Ease movement between team members with many tools and environments – Kitch-
ens align the production and development environment(s) and abstract the machine,
tools, security and networking resources underlying analytics. Analytics easily migrate
from one team member to another or from dev to production. Kitchens also bind
changes to source control.
• Collaborate and coordinate work – DataOps provides teams with the compelling
direction, strong structure, supportive context and shared mindset that are necessary
for effective teamwork.
• Automate Work and Reduce errors – Automated orchestration reduces process vari-
ability and errors resulting from manual steps. Input, output and business logic tests at
each stage of the workflow ensure that analytics are working correctly, and that data is
within statistical limits. DataOps runs tests both in development and production, con-
tinuously monitoring quality. Warnings and errors are forwarded to the right person/
channel for follow up.
• Maintain security – Kitchens are secured with access control. Kitchens then access a
release environment toolchain using a security Vault which stores unique usernames/
passwords.
• Leverage best practices and re-use – Kitchens include Pipelines and other reusable
components which data engineers can leverage when developing new features.
• Self-service – Data professionals can move forward without waiting for resources or
committee approval.
• Transparency – Pipeline status and statistics are available in messages, reports and
dashboards.
Manager: Good morning, everyone. I’m pleased to report that the VP of Marketing called the
CDO thanking him for a great job on the analytics last week.
Padma (Data Engineer): Fortunately, I was able to leverage a Pipeline developed a few months
ago by the MDM team. We were even able to reuse most of their tests.
Chris (DataOps Engineer): Once I set-up Kitchen creation, Padma was able to start being
productive immediately. With matching release environments, we quickly migrated the new
analytics from dev to production.
Eric (Production Engineer): The tests are showing that all data remains within statistical limits.
The dashboard indicators are all green.
DataOps helps our band of frustrated and squabbling data professionals achieve a much
higher level of overall team productivity by establishing processes and providing resources
that support teamwork. With DataOps, two key performance parameters improve dramati-
cally — the development cycle time of new analytics and quality of data and analytics code.
We’ve seen it happen time and time again.
What’s even more exciting is the business impact of DataOps. When users request new
analytics and receive them in a timely fashion, it initiates new ideas and uncharted areas
of exploration. This tight feedback loop can help analytics achieve its true aim, stimulating
creative solutions to an enterprise’s greatest challenges. Now that’s teamwork!
Analytics teams need to move faster, but cutting corners invites problems in quality and
governance. How can you reduce cycle time to create and deploy new data analytics (data,
models, transformation, visualizations, etc.) without introducing errors? The answer relates
to finding and eliminating the bottlenecks that slow down analytics development.
Figure 45: The creation of analytics in a large data organization requires the
contribution of many groups.
Tasks in development
organizations are often
tracked using Kanban boards,
tickets or project tracking
tools. Figure 46 is a Kanban
board, representing a project,
with a yellow sticky note for
each task. As tasks progress
through milestones, they move
from left to right until they
reach the “Done” column.
Data professionals are smart and talented. They work hard. Why does it take so long to
move work tickets to the right? Why does the system become overloaded with so many
unfinished work items forcing the team to waste cycles context switching?
To address these questions, we need to think about the creation and deployment of ana-
lytics like a manufacturing process. The collective workflows of all of the data teams are
a linked sequence of steps, not unlike what you would see in a manufacturing operation.
When we conceptualize the development of new analytics in this way, it offers the possi-
bility of applying manufacturing management tools that uncover and implement process
improvements.
THE BOTTLENECK
The plant’s complex manufacturing process, with its long sequence
of interdependent stages, was throughput limited by one particular
operation — a certain machine with limited capacity. This machine
was the “constraint” or bottleneck. The Theory of Constraints views
every process as a series of linked activities, one of which acts as a
constraint on the overall throughput of the entire system. The
constraint could be a human resource, a process, or a tool/
technology.
In “The Goal,” Alex learned that “an improvement at any point in the
system, not at the constraint, is an illusion.” An improvement made at
Figure 48 a stage that feeds work to the bottleneck just increases the queue
Even though Alex’s robots improved efficiency at one stage of his manufacturing process,
they didn’t alleviate the true system constraint. When Alex’s team focused improvement
efforts on raising the throughput of the bottleneck, they were finally able to increase the
throughput of the overall manufacturing process. True, some of their metrics looked worse
(the robot station efficiency declined), but they were able to reduce cycle time, ship product
on time and make a lot more money for the company. That is, after all, the real “goal” of a
manufacturing facility.
• Expedite – Look for areas where you are regularly being asked to divert resources to
ensure that critical analytics reach users. In data analytics, data errors are a common
source of unplanned work.
• Cycle Time – Pay attention to the steps in your process with the longest cycle time.
For example, some organizations take 6 months to shepherd 20 lines of SQL through
the impact review board. Naturally, if a step is starved or blocked by a dependency, the
bottleneck is the external factor.
• Demand – Note steps in your pipeline or process that are simply not keeping up with
demand. For example, often less time is required to create new analytics than to test
and validate them in preparation for deployment.
When managers talk to data analysts, scientists and engineers, they can quickly discover
the issues that slow them down. Figure 49 shows some common constraints. For example,
data errors in analytics cause unplanned work that upsets a carefully crafted Kanban board.
Work-in-progress (WIP) is placed on hold and key personnel context switch to address
A related problem, also shown in figure 49, occurs when deployment of new analytics breaks
something unexpectedly. Unsuccessful deployments can be another cause of unplanned
work which can lead to excessive caution, and burdensome manual operations and testing.
Another common constraint is team coordination. The teams may all be furiously rowing
the boat, but perhaps not in the same direction. In a large organization, each team’s work
is usually dependent on each other. The result can be a serialized pipeline. Tasks could be
parallelized if the teams collaborated better. New analytics wouldn’t break existing data
operations with proper coordination between and among teams.
A wide variety of constraints potentially slow down analytics development cycle time. In
development organizations, there are sometimes multiple constraints in effect. There is also
variation in the way that constraints impact different projects. The following are some poten-
tial rate-limiting bottlenecks to rapidly deploying analytics:
• Manual orchestration
2. Exploit the constraint – Make improvements to the throughput of the constraint using
existing resources
3. Subordinate everything to the constraint – Review all activities and make sure that they
benefit (or do not negatively impact) the constraint. Remember, any loss in productivity
at the constraint is a loss in throughput for the entire system.
Figure 52
The data organization creates analytics for its consumers (users, colleagues, business units,
managers, …). Think of analytics as your product and data consumers as your customers. Like
any product or service organization, perhaps you should simply ask your customers what they
want?
Management consultant Anthony Ulwick contends (Harvard Business Review) that you
should not expect your customers to recommend solutions to their problems. They aren’t
expert enough for that. Instead, ask about desired outcomes. What do they want analytics to
do for them? The customers might say that they want changes to analytics to be completed
very fast so they can play with ideas. They won’t tell you to implement automated orchestra-
tion or a data warehouse which can both contribute to that outcome.
The outcome-based methodology for gathering customer input breaks down into five steps.
The opportunity algorithm makes use of a simple mathematical formula to estimate the
potential opportunity associated with a particular outcome:
If you have multiple bottlenecks, you can’t address them all at once. The opportunity
algorithm enables the data organization to prioritize process improvements that produce
outcomes that are recognized as valued by users. It avoids the requirement for users to
understand the technology, tools, and processes behind the data analytics pipeline. For
DataOps proponents, it can provide a clear path forward for analytics projects that are both
important and appreciated by users.
Do you deserve a promotion? You may think to yourself that your work is exceptional. Could
you prove it?
As a Chief Data Officer (CDO) or Chief Analytics Officer (CAO), you serve as an advocate for
the benefits of data-driven decision making. Yet, many CDO’s are surprisingly unanalytical
about the activities relating to their own department. Why not use analytics to shine a light
on yourself?
Internal analytics could help you pinpoint areas of concern or provide a big-picture assess-
ment of the state of the analytics team. We call this set of analytics the CDO Dashboard. If
you are as good as you think you are, the CDO Dashboard will show how simply awesome
you are at what you do. You might find it helpful to share this information with your boss
when discussing the data analytics department and your plans to take it to the next level.
Below are some reports that you might consider including in your CDO dashboard:
VELOCITY CHART
The velocity chart shows the amount of work completed during each sprint — it displays how
much work the team is doing week in and week out. This chart can illustrate how improved
processes and indirect investments (training, tools, process improvements, …) increase veloc-
ity over time.
As the Chief Data Officer (or Chief Analytics Officer) of your company, you manage a team,
oversee a budget and a hold a mandate to set priorities and lead organizational change. The
bad news is that everything that could possibly go wrong from a security, governance and
risk perspective is your responsibility. If you do a perfect job, then no one on the manage-
ment team ever hears your name.
The average tenure of a CDO or CAO is about 2.5 years. In our conversations with data and
analytics executives, we find that CDOs and CAOs often fall short of expectations because
they fail to add sufficient value in an acceptable time frame. If you are a CDO looking to sur-
vive well beyond year two, we recommend avoiding three common traps that we have seen
ensnare even the best and brightest.
Data offense expands top-line revenue, builds the brand, grows the company and in general
puts points on the board. Using data analytics to help marketing and sales is data offense.
Companies may acknowledge the importance of defense, but they care passionately about
offense and focus on it daily. Data offense provides the organization with direct value and it
is what gets CDOs and CAOs promoted.
The challenge for a CDO is that data defense is hard. A company’s shortcomings in gover-
nance, security, privacy, or compliance may be glaringly obvious. In some cases, new regu-
lations like GDPR (General Data Protection Regulation, EU 2016/679) demand immediate
In a fast-paced, competitive environment, an 18-month integration project can seem like the
remote future. Also, success is uncertain until you deliver. Your C-level peers know that big
software integration projects fail half the time. Projects frequently turn out to be more com-
plex than anticipated, and they often miss the mark. For example, you may have thought you
needed ten new capabilities, but your internal customers only really require seven, and two
of them were not on your original list. The issue is that you won’t know which seven features
are critical until around the time of your second annual performance review and by then it
might be too late to right the ship.
Figure 60: CDO’s often make the dual mistake of (1) focusing too much on delivering
indirect value (governance, security, privacy, or compliance, …) and (2) using a wa-
terfall project methodology which defers the delivery of value to the end of a long
project cycle. In the case shown, it takes several months to deliver direct value
A data valuation project can take months of effort and consumes the attention of the CDO
and her staff on what is essentially an internally-focused, intellectual exercise. In the end,
you have a beautiful PowerPoint presentation with detailed spreadsheets to back it up. Your
data has tremendous value that can and should be carried on the balance sheet. You tell every-
one all about it — why don’t they care?
Don’t confuse data valuation with data offense. Knowing the theoretical value of data is not
data offense. While data valuation may be useful and important in certain cases, it is often a
distraction. All of the time and resources devoted to creating and populating the valuation
model could have been spent on higher value-add activities.
Figure 61: DataOps uses an iterative product management methodology (Agile develop-
ment) that enables the CDO to rapidly deliver direct value (growing the top line).
One of the greatest challenges in analytics is earning the trust of your organization’s CEO
and management team. A lot of people in the business world make decisions with their gut.
They rely on experience and intuition, but many companies would prefer to depend upon
data. You cannot walk through an airport these days and not see some version of an adver-
tisement saying we are the company who is going to help you to be more data-driven.
People do not always trust data. Imagine you are an executive and an employee walks into
your office and shows you charts and graphs that contradict strongly held assumptions
about your business. A lot of managers in this situation favor their own instincts. Data-ana-
lytics professionals, who tend to be doers, not talkers, are sometimes unable to convince an
organization to trust its data.
DataOps relies upon the data lake design pattern, which enables data analytics teams to up-
date schemas and transforms quickly to create new data warehouses that promptly address
pressing business questions. DataOps incorporates the continuous-deployment methodolo-
gy that is characteristic of DevOps. This reduces the cycle time of new analytics by an order
of magnitude. When users get used to quick answers, it builds trust in the data-analytics
team, and stimulates the type of creativity and teamwork that leads to breakthroughs.
A company that trusts its data develops a unified view of reality and can formulate a
shared vision of how to achieve its goals. Data-driven companies deliver higher growth and
ultimately higher valuations than their peers. As a CAO or CDO, leading the organization
to become more data-driven is your mission. DataOps makes that easier by helping the da-
ta-analytics team deliver quickly and robustly, creating value that is recognized and trusted
by the organization.
First, we walk, then we run. The same is true in data analytics. In our many discussions, we
have encountered companies that are just starting out with data analytics and others with
substantial organizations handling petabytes of data. Everyone that we meet is somewhere
along this spectrum of maturity. We’ve found that just because an enterprise’s data analytics
organization is large does not mean that it is excellent. In fact, the flaws in a process or meth-
odology become particularly noticeable when a team grows beyond the initial stages.
This situation could have implications for the company’s future. What if competitors have
devised a way to use data analytics to garner a competitive advantage? Without a compre-
hensive data strategy, a company risks missing the market.
Boutique Analytics tend to be ad hoc or create one-off reports that answer questions posed
by a manager. For example, a global enterprise may wish to know how much of its reve-
nue it derives from one customer. Data is exported from CRMs or operations systems and
pulled into a spreadsheet for analysis. The term Boutique Analytics may make it sound small
in scale, but some large enterprises are known to rely solely upon this approach. A large
enterprise might run weekly reports exporting sales data into a flat file. The global sales and
marketing team can then easily manipulate the data in a spreadsheet. The sharing of data
using flat files can be used to complement an enterprise’s operational analytics.
There is nothing inherently wrong with Boutique Analytics. It is a great way to explore the
best ways to deliver value based on data. The eventual goal should be to operationalize the
data and deliver that value on a regular basis. This can be time-consuming and error-prone if
executed manually.
In the Waterfall world, development cycles are long and rigidly controlled. Projects pass
through a set of sequential phases: architecture, design, test, deployment, and maintenance.
Changes in the project plan at any stage cause modifications to the scope, schedule or
budget of the project. As a result, Waterfall projects are resistant to change. This is wholly
appropriate when you are building a bridge or bringing a new drug to market, but in the field
of data analytics, changes in requirements occur on a continuous basis. Teams that use Wa-
terfall analytics often struggle with development cycle times that are much longer than their
users expect and demand. Waterfall analytics also tends to be labor intensive, which makes
every aspect of the process slow and susceptible to error. Most data-analytics teams today
are in the Waterfall analytics stage and are often unaware that there is a better way.
INSTRUCTIONS
Preheat oven to 350. Combine dry ingredients (flour through nutmeg) in a small
bowl. In a separate bowl, mix together yogurt, vanilla, brown sugar and honey. Add
egg. Add mashed up bananas. Slowly fold dry ingredients into wet. Stir in cran-
berries and 3/4 cup walnuts gently. Pour mixture into buttered loaf pan. Sprinkle
remaining walnuts on top of loaf. Bake about 45 minutes, or until lightly browned
and knife comes out clean.
A couple of years ago, Gartner predicted that 85 percent of AI projects would not deliver for
CIOs. Forrester affirmed this unacceptable situation by stating that 75% of AI projects under-
whelm. We can’t claim that AI projects fail only for the reasons we listed. We can say, from
our experience working with data scientists on a daily basis, that these issues are real and
pervasive. Fortunately, data science teams can address these challenges by applying lessons
learned in the software industry.
In machine learning, the code can learn. The ML application trains the model using data and
target results. An ML model developer feeds training data into the ML application, along with
correct or expected answers. Errors are then fed back into the learning algorithm to boost
Figure 62: Traditional model programming versus machine learning model development
Below, Figure 63 further elaborates on the complex set of steps that are involved in model
building. Naturally, AI projects begin with a business objective. Data is often imperfect so the
team has to clean, prepare, mask, normalize and otherwise manipulate data so that it can be
used effectively. Feature extraction identifies metrics (measured values) that are informative
and facilitate training. After the building and evaluation phases (see Figure 63), the model
is deployed, and its performance is monitored. When business conditions or requirements
change, the team heads back to the lab for additional training and improvements. This pro-
cess continues for as long as the model is in use.
Figure 64: Machine Learning Code is only a small part of the overall system.
Model creation and deployment commonly use the tools shown in Figure 65. Note that
this is only a portion of what is required by the system in Figure 64. If the responsibility for
these processes and toolchains falls on the data science team, they can end up spending the
majority of their time on data cleaning and data engineering. Unfortunately, this is all too
common in contemporary enterprises. Addressing this situation requires us to take a holistic
view of the value pipeline and analytics creation.
The second pipeline is the process for new model creation — see Figure 62 and Figure 63. In
the development pipeline, new AI and ML models are designed, tested and deployed into the
Value Pipeline. We call this the “Innovation Pipeline.” Figure 66 depicts the Value and Innova-
tion Pipelines intersecting in production.
Conceptually, each pipeline is a set of stages implemented using a range of tools. The stages
may be executed serially, parallelized or contain feedback loops. In terms of artifacts, the
pipeline stages are defined by files: scripts, source code, algorithms, html, configuration files,
parameter files, containers and other files. From a process perspective, all of these artifacts
are essentially just source code. Code controls the entire data-analytics pipeline from end to
end: ideation, design, training, deployment, operations, and maintenance.
When discussing code and coding, data scientists who create AI and ML models, often think
“this has nothing to do with me.” I am a data analyst/scientist, not a coder. I am an ML tool
expert. In process terms, what I do is just a sophisticated form of configuration. This is a
common misconception and it leads to technical debt. When it is time for that debt to be
paid, the speed of new analytics development (cycle time) will slow to a crawl.
Don’t get us wrong. We love our tools, but don’t buy into this falsehood. The $50+ billion AI
market is divided into two segments: “tools that create code” and “tools that run code.” The
point is — AI is code. The data scientist creates code and must own, embrace and manage
the complexity that comes along with it.
• Studies show that software development projects complete significantly faster and with
far fewer defects when Agile Development, an iterative project management method-
ology, replaces the traditional Waterfall sequential methodology. The Agile methodolo-
gy is particularly effective in environments where requirements are quickly evolving — a
situation well known to data science professionals. Some enterprises understand that
they need to be more Agile and that’s great. (Here’s your chance to learn from the mis-
takes of many others.) You won’t receive much benefit from Agile if your quality is poor
or your deployment and monitoring processes involve laborious manual steps. “Agile
development” alone will not make your team more “agile.”
• DevOps, which inspired the name DataOps, focuses on continuous delivery by leverag-
ing on-demand IT resources and by automating test and deployment of code. Imagine
clicking a button in order to test and publish new ML analytics into production. This merging
of software development and IT operations reduces time to deployment, decreases
time to market, minimizes defects, and shortens the time required to resolve issues.
Borrowing methods from DevOps, DataOps brings these same improvements to data
science.
• Like lean manufacturing, DataOps utilizes statistical process control (SPC) to mon-
itor and control the Value Pipeline. When SPC is applied to data science, it leads to
remarkable improvements in efficiency and quality. With SPC in place, the data flowing
through the operational system is verified to be working. If an anomaly occurs, the data
team will be the first to know, through an automated alert. Dashboards make the state
of the pipeline transparent from end to end.
• DataOps eliminates technical debt and improves quality by orchestrating the Value and
Innovation Pipelines. It catches problems early in the data life cycle by implementing
tests at each pipeline stage. Further, it greatly accelerates the development of new AI,
enabling the data science team to respond much more flexibly to changing business
conditions.
You can implement DataOps for your AI and ML project yourself by following these seven
steps:
In so many cases, the files associated with analytics are distributed in various places within
an organization without any governing control. A revision control tool, such as Git, helps
to store and manage all of the changes to code. It also keeps code organized, in a known
repository and provides for disaster recovery. Revision control also helps software teams
parallelize their efforts by allowing them to branch and merge.
Figure 67: With “Branch and Merge” team members can work on features
independently without impacting each other or value pipeline.
Branching and merging allow the data science team to run their own tests, make changes,
take risks and experiment. If a set of changes proves to be unfruitful, the branch can be
discarded, and the team member can start over.
It’s worth reemphasizing that data scientists need a copy of the data. In the past, creating
copies of databases was expensive. With storage on-demand from cloud services, a Terabyte
data set can be quickly and inexpensively copied to reduce conflicts and dependencies. If the
data is still too large to copy, it can be sampled.
CONCLUSION
While AI and data science tools improve the productivity of model development, the actual
ML code is a small part of the overall system solution. Data science teams that don’t apply
modern software development principles to the data lifecycle can end up with poor quality
and technical debt that causes unplanned work.
DataOps offers a new approach to creating and operationalizing AI that minimizes techni-
cal debt, reduces cycle time and improves code and data quality. It is a methodology that
enables data science teams to thrive despite increasing levels of complexity required to
deploy and maintain AI in the field. It decouples operations from new analytics creation and
rejoins them under the automated framework of continuous delivery. The orchestration of
the development, deployment, operations and monitoring toolchains dramatically simplifies
the daily workflows of the data science team. Without the burden of technical debt and
unplanned work, they can focus on their area of expertise; creating new models that help
the enterprise realize its mission.
We’ve been attending data conferences for over 20 years. It has been common to see pre-
senters display a data architecture diagram like the (simplified) one below (figure 69). A data
architecture diagram shows how raw data turns into insights. As the Eckerson Group writes,
“a data architecture defines the processes to capture, transform, and deliver usable data to
business users.”
In our canonical data architecture diagram, data sources flow in from the left and pass
through transformations to generate reports and analytics for users or customers on the
right. In the middle, live all of the tools of the trade: raw data, refined data, data lakes/
warehouses/marts, data engineering, data science, models, visualization, governance and
more. Tools and platforms can exist in the cloud or on premises. Most large enterprise data
architectures have evolved to use a mix of both.
When data professionals define data architectures, the focus is usually on production
requirements: performance, latency, load, etc. Engineers and data professionals do a great
job executing on these requirements. The problem is that the specifications don’t include
architecting for rapid change.
They only think about production, not the process to make changes to production. A Data-
Ops Data Architecture makes the steps to change what is in production a “central idea.”
Thinking first about changes over time to your code, your servers, your tools, and monitoring
for errors are first class citizens in the design.
Building a data architecture without planning for change is much worse than building a mo-
bile phone with a fixed battery. While mobile phone batteries are swapped every few years,
your analytics users are going to want changes every day or sometimes every hour. That may
be impossible with your existing data architecture, but you can meet this requirement if you
architect for it. If data architectures are designed with these goals in mind, they can be more
flexible, responsive, and robust. Legacy data pipelines can be upgraded to achieve these aims
by enhancing the architecture with modern tools and processes.
1. Update and publish changes to analytics within an hour without disrupting operations
If you are a data architect yourself (or perhaps you play one), you may already have creative
ideas about how you might address these types of requirements. You would have to maintain
separate but identical development, test, and production environments. You would have to
orchestrate and automate test, monitoring, and deployment of new analytics to production.
When you architect for flexibility, quality, rapid deployment, and real-time monitoring of
data (in addition to your production requirements), you are moving towards a DataOps data
architecture as shown in figure 70.
The DataOps data architecture expands the traditional operations-oriented data architecture
by including support for Agile iterative development, DevOps, and statistical process control.
We call these tools and processes collectively a DataOps Platform. The DataOps elements in
You have the “Right to Repair” your data architecture — design for it!
Design Thinking has grown beyond its physical design roots to guide innovation in educa-
tion, business and computer science. As you would expect, data professionals are now apply-
ing design thinking to data science. Design thinking can serve as a major boost to corporate
innovation. Unfortunately, most data organizations are not set-up for a rapid feedback loop
of Ideation and Experimentation, so creativity never shifts into high gear.
Figure 73 shows the stages of Empathy, Ideation and Experimentation in series. As any
experienced “Design Thinker” will tell you, the stages do not necessarily happen in sequence.
Experimentation can lead to deeper Empathy, which fuels Ideation. The process could have
Figure 74: Cycle time is the period of time required to turn a new idea into de-
ployed analytics. In many organizations, cycle time is unacceptably long.
Productive Design Thinking depends on a quick turn-around between ideation and ex-
perimentation. The problem is that the Experimentation phase can be unacceptably slow.
Lengthy analytics cycle time occurs for a variety of reasons:
• Poor teamwork within the data team
• Lack of collaboration between groups within the data organization
• Waiting for IT to disposition or configure system resources
• Waiting for access to data
• Moving slowly and cautiously to avoid poor quality
• Requiring approvals, such as from an Impact Review Board
• Inflexible data architectures
• Process bottlenecks
• Technical debt from previous deployments
• Poor quality creating unplanned work
Figure 75: Factors that derail the dev team and lengthen analytics cycle time
Quality is an important aspect of cycle time. A team can’t reduce cycle time if they are
being constantly interrupted by quality problems and high-severity alerts. DataOps applies
automated testing to the data operations pipeline as well as the release pipeline for new
analytics.
Data analytics professionals get used to being in no-win situations. Internal customers make
a simple request; for example, add a new file to the database. Users expect requests like
these to take days, yet, in many large organizations, they require months to complete. At
DataKitchen, we repeatedly hear from companies that they need to improve their cycle time
for new analytics. One approach, Agile Data Warehousing, applies Agile principles to data
warehouse projects in an attempt to speed innovation. However, many companies quickly
discover that simply implementing Scrum is not sufficient to attain results.
Imagine that you oversee a fifty-person team managing numerous large integrated databases
(DB) for a big insurance or financial services company. You have 300 terabytes (TB) of data
which you manage using a proprietary database. Between software, licensing, maintenance,
support and associated hardware, you pay $10M per year in annual fees. Even putting an-
other single CPU into production could cost hundreds of thousands of dollars.
The machine environments are different and have to be managed and maintained separately.
New analytics are tested on each machine in turn — first in dev, then QA and finally produc-
tion. You may not catch every problem in dev and QA since they aren’t using the same data
and environment as production.
In our hypothetical company, the organization of the workforce is also a factor in slowing the
team’s velocity. Everyone is assigned a fixed role. Adding a table to a database involves sev-
eral discrete functions: a Data Quality person who analyzes the problem, a Schema/Architect
who designs the schema, an ETL engineer who writes the ETL, a Test Engineer that writes
tests and a Release Engineer who handles deployment. Each of these functions is performed
sequentially and requires considerable documentation and committee review before any
action is taken. Hand-off meetings mark the transition from one stage to the next.
The team wants to move faster but is prevented from doing so due to heavyweight process-
es, serialization of tasks, overhead, difficulty in coordination and lack of automation. They
need a way to increase collaboration and streamline the many inefficiencies of their current
process without having to abandon their existing tools.
Shared Workspace – DataOps creates a shared workspace so team members have visibility
into each other’s work. This enables the team to work more collaboratively and seamlessly
outside the formal structure of the hand-off meeting. DataOps also streamlines documenta-
tion and reduces the need for formal meetings as a communication forum.
Orchestration – DataOps deploys code updates to each machine instantiation and auto-
mates the execution of tests along each stage of the data analytics pipeline. This includes
data and logic tests that validate both the production and feature deployment pipelines.
Tests are parameterized so they can run in the subset database of each particular machine
environment equally well. As the test suite improves, it grows to reflect the full breadth of
the production environment. Automated tests are run repeatedly so you can be confident
that new features have not broken old ones.
These tools and process changes together break down the organizational and technology
barriers that prevent the team from implementing Agile methods in data analytics. DataOps
unburdens the team from non-value-add tasks and empowers them to self-organize around
new creative initiatives. When the team is free to innovate, the continuous improvement
culture built into DataOps will begin working to reduce the cycle time of new analytics from
months to days (and less). This ultimately puts the Agility back into Agile Data Warehousing
by delivering high-quality analytics to users in a timely fashion.
People often speak about data lakes as a repository for raw data. It can also be helpful to
move processed data into the data lake. There are several important advantages to using
data lakes. First and foremost, the data analytics team controls access to it. Nothing can
frustrate progress more than having to wait for access to an operational system (ERP, CRM,
MRP, …). Additionally, a data lake brings data together in one place. This makes it much eas-
ier to process. Imagine buying items at garage sales all over town and placing them in your
backyard. When you need the items, it is much easier to retrieve them from the backyard
rather than visiting each of the garage sale sites. A data lake serves as a common store for all
of the organization’s critical data. Easy, unrestricted access to data eliminates restrictions on
productivity that slow down the development of new analytics.
The structure of a data lake is designed to support efficient data access. This relates to how
data is organized and how software accesses it. A database schema establishes the relation-
ship between the entities of data.
UNDERSTANDING SCHEMAS
A database schema is a collection of tables. It dictates how the database is structured and
organized and how the various data relate to each other. Below is a schema that might be
used in a pharmaceutical-sales analytics use case. There are tables for products, payers,
period, prescribers and patients with an integer ID number for each row in each table. Each
sale recorded has been entered in the fact table with the corresponding IDs that identify the
product, payer, period, and prescriber respectively. Conceptually, the IDs are pointers into
the other tables.
The schema establishes the basic relationships between the data tables. A schema for an
operational system is optimized for inserts and updates. The schema for an analytics system,
like the star schema shown here, is optimized for reads, aggregations, and is easily under-
stood by people.
You might hear the term data mart in relation to data analytics. Data marts are a streamlined
form of data warehouses. The two are conceptually very similar.
Data transforms (scripts, source code, algorithms, …) create data warehouses from data
lakes. In DataOps this process is optimized by keeping transform code in source control and
by automating the deployment of data warehouses. An automated deployment process is
significantly faster, more robust and more productive than a manual deployment process.
DataOps moves the enterprise beyond slow, inflexible, disorganized and error-prone manual
processes. The DataOps pipeline leverages data lakes and transforms them into well-crafted
data warehouses using continuous deployment techniques. This speeds the creation and
deployment of new analytics by an order of magnitude. Additionally, the DataOps pipeline is
constantly monitored using statistical process control so the analytics team can be confident
of the quality of data flowing through the pipeline. Work Without Fear or Heroism. With
these tools and process improvements, DataOps compresses the cycle time of innovation
while ensuring the robustness of the analytic pipeline. Faster and higher quality analytics ul-
timately lead to better insights that enable an enterprise to thrive in a dynamic environment.
In DataOps, the data analytics team moves at lightning speed using highly optimized tools and pro-
cesses. One of the most important productivity tools is the ability to reuse and containerize code.
When we talk about reusing code, we mean reusing data analytics components. All of the
files that comprise the data analytics pipeline — scripts, source code, algorithms, html, con-
figuration files, parameter files — we think of these as code. Like other software develop-
ment, code reuse can significantly boost coding velocity.
Code reuse saves time and resources by leveraging existing tools, libraries or other code
in the extension or development of new code. If a software component has taken several
months to develop, it effectively saves the organization several months of development time
when another project reuses that component. This practice can be used to decrease projects
budgets. In other cases, code reuse makes it possible to complete projects that would have
been impossible if the team were forced to start from scratch.
Containers make code reuse much simpler. A container packages everything needed to run
a piece of software — code, runtimes, tools, libraries, configuration files — into a stand-alone
executable. Containers are somewhat like virtual machines but use fewer resources because
they do not include full operating systems. A given hardware server can run many more
containers than virtual machines.
A container eliminates the problem in which code runs on one machine, but not on another,
because of slight differences in the set-up and configuration of the two servers or software
environments. A container enables code to run the same way on every machine by auto-
mating the task of setting up and configuring a machine environment. This is one DataOps
techniques that facilitates moving code from development to production — the run-time
environment is the same for both. One popular open-source container technology is Docker.
Each step in the data-analytics pipeline is the output of the prior stage and the input to the
next stage. It is cumbersome to work with an entire data-analytics pipeline as one mono-
lith, so it is common to break it down into smaller components. On a practical level, smaller
components are much easier to reuse by other team members.
Kurt Cagle‘s perceptive analysis of data science titled “Why You Don’t Need Data Scientists”
explains the many reasons that data science falls short of the high expectations usually
placed on it:
• Fancy dashboards are pretty but are only as valuable as the data behind them.
Data quality often….stinks.
• Data sets are quirky and difficult to work with.
• Users/stakeholders know their business domain but little about what data can do for them.
• A multimillion-dollar initiative to rebuild the data pipeline from the ground up is gener-
ally off the table.
• The people who own the databases won’t give data scientists access.
• Everyone agrees that integrating data from disparate databases is really, really hard, but
in reality, it’s much harder than people think.
These are all excellent points and often the conversation ends here — in exasperation. We
can tell you that we have been there and have the PTSD to prove it. Fortunately, a few years
ago, we found a way out of what may seem at times like a no-win situation. We believe that the
secret to successful data science is a little about tools and a lot about people and processes.
INSTRUCTIONS
1. Combine flour, yeast and salt in a large bowl and stir with your DataKitchen
spoon. Add water and stir until blended; dough will be shaggy. You may need
an extra ¼ cup of water to get all the flour to blend in. Cover bowl with plas-
tic wrap. Let dough rest at least 4 hours (12-18 hours is good too) at warm
room temperature, about 70 degrees.
2. Lightly oil a work surface and place dough on it; fold it over on itself once or
twice. Cover loosely with plastic wrap and let rest 30 minutes more. This is a
good time to turn the oven on to 425°F.
3. Put a 6-to-8-quart heavy covered pot (cast iron, enamel, Pyrex or ceramic)
in the oven as it heats. When dough is ready, carefully remove pot from
oven. Slide your hand under dough and put it into pot, seam side up. Shake
pan once or twice if dough is unevenly distributed; it will straighten out as it
bakes.
4. Cover with lid and bake 30 minutes, then remove lid and bake another 15 to
30 minutes, until loaf is beautifully browned. Cool on a rack.
NOTES
In a convection oven, cook 23 minutes with the lid on, and then 5 minutes with
the lid off.
You don’t need to pre-heat the pot. You can put the dough on a cookie sheet. The
only difference is the crust will not be as crunchy or as beautifully browned. You
can experiment with a round shape or Italian or French loaf shapes. The longer
shapes will take less time to cook.
You can also cook at a lower temperature (e.g. 350°F). In all cases, take the bread
out when the internal temperature reaches 190°F - 200°F. Use a meat thermom-
eter to check.
Some companies take six months to write 20 lines of SQL and move it into production.
The last thing that an analytics professional wants to do is introduce a change that breaks
the system. Nobody wants to be the object of scorn, the butt of jokes, or a cautionary tale.
If that 20-line SQL change is misapplied, it can be a “career-limiting move” for an analytics
professional.
Analytics systems grow so large and complex that no single person in the company under-
stands them from end to end. A large company often institutes slow, bureaucratic proce-
dures for introducing new analytics in order to reduce fear and uncertainty. They create
a waterfall process with specific milestones. There is a lot of documentation, checks and
balances, and meetings — lots of meetings.
Imagine you are building technical systems that integrate data and do models and visu-
alizations. How does a change in one area affect other areas? In a traditional established
company, that information is locked in various people’s heads. The company may think it has
no choice but to gather these experts together in one room to discuss and analyze proposed
changes. This is called an “impact analysis meeting.” The process includes the company’s
most senior technical contributors; the backbone of data operations. Naturally, these individ-
uals are extremely busy and subject to high-priority interruptions. Sometimes it takes weeks
to gather them in one room. It can take additional weeks or months for them to approve a
change.
The impact analysis team is a critical bottleneck that slows down updates to analytics. A
DataOps approach to improving analytics cycle time adopts process optimization techniques
from the manufacturing field. In a factory environment, a small number of bottlenecks
often limit throughput. This is called the Theory of Constraints. Optimize the throughput of
bottlenecks and your end-to-end cycle time improves (check out “The Goal” by Eliyahu M.
Goldratt).
DataOps automates testing. Environments are spun up under machine control and test
scripts, written in advance, are executed in batch. Automated testing is much more cost-ef-
fective and reliable than manual testing, but the effectiveness of automated testing depends
on the quality and breadth of the tests. In a DataOps enterprise, members of the analytics
team spend 20% of their time writing tests. Whenever a problem is encountered, a new test
is added. New tests accompany every analytics update. The breadth and depth of the test
suite continuously grow.
These concepts are new to many data teams, but they are well established in the software
industry. As figure 78 shows, the cycle time of software development releases has been
(and continues to be) reduced by orders of magnitude through automation and process
improvements. The automation of impact analysis can have a similar positive effect on your
organization’s analytics cycle time.
Figure 78: Software developers have reduced the cycle time for new releases
by orders of magnitude using automation and process improvements
ANALYTICS IS CODE
At this point some of you are thinking this has nothing to do with me. I am a data analyst/sci-
entist, not a coder. I am a tool expert. What I do is just a sophisticated form of configuration. This
is a common point of view in data analytics. However, it leads to a mindset that slows down
analytics cycle time.
Tools vendors have a business interest in perpetuating the myth that if you stay within the
well-defined boundaries of their tool, you are protected from the complexity of software
development. This is ill-considered.
Don’t get us wrong. We love our tools, but don’t buy into this falsehood.
The $100B analytics market is divided into two segments: tools that create code and tools
that run code. The point is — data analytics is code. The data professional creates code and
must own, embrace and manage the complexity that comes along with it.
Figure 80: Tableau files are stored as XML, and can contain conditional
branches, loops and embedded code.
Figure 79 shows a data operations pipeline with code at every stage of the pipeline. Python,
SQL, R — these are all code. The tools of the trade (Informatica, Tableau, Excel, …) these too
are code. If you open an Informatica or Tableau file, it’s XML. It contains conditional branches
(if-then-else constructs), loops and you can embed Python or R in it.
Remember our 20-line SQL change that took six months to implement? The problem is that
analytics systems become so complex that they can easily break if someone makes one mis-
begotten change. The average data-analytics pipeline encompasses many tools (code genera-
tors) and runs lots of code. Between all of the code and people involved, data operations
becomes a combinatorially complex hairball of systems that could come crashing down with
one little mistake.
For example, imagine that you have analytics that sorts customers into five bins based on
some conditional criterion. Deep inside your tool’s XML file is an if-then-else construct that
is responsible for sorting the customers correctly. You have numerous reports based off of
a template that contains this logic. They provide information to your business stakeholders:
top customers, middle customers, gainers, decliners, whales, profitable customers,…
There’s a team of IT engineers, database developers, data engineers, analysts and data
scientists that manage the end to end system that supports these analytics. One of these
individuals makes a change. They convert the sales volume field from an integer into a
decimal. Perhaps they convert a field that was US dollars into a different currency. Maybe
they rename a column. Everything in the analytics pipeline is so interdependent; the change
breaks all of the reports that contain the if-then-else logic upon which the original five
categories are built. All of a sudden, your five customer categories become one category, or
the wrong customers are sorted into the wrong bins. None of the dependent analytics are
correct, reports are showing incorrect data, and the VP of Sales is calling you hourly.
Whether you use an analytics tool like Informatica or Tableau, an Integrated Development
Environment (IDE) like Microsoft Visual Studio (figure 83) or even a text editor like Notepad,
you are creating code. The code that you create interacts with all of the other code that
populates the DAG that represents your data pipeline.
To automate impact analysis, think of the end-to-end data pipeline holistically. Your test
suite should verify software entities on a stand-alone basis as well as how they interact.
Figure 83: Developers write SQL, Python and other code using an integrated
development environment or sometimes a simple editor like Notepad.
The development of new analytics follows a different path, which is shown in figure 85 as
the Innovation Pipeline. The Innovation Pipeline delivers new insights to the data operations
pipeline, regulated by the release process. To safely develop new code, the analyst needs an
isolated development environment. When creating new analytics, the developer creates an
environment analogous to the overall system. If the database is terabytes in size, the data
Table 5: In the Value Pipeline code is fixed and data is variable. In the Innovation
Pipeline, data is fixed, and code is variable.
privacy or other regulations, then sensitive information is removed. Once the environment is
set up, the data typically remains stable.
In the Innovation Pipeline code is variable, but data is fixed. Tests target the code, not the
data. The unit, integration, functional, performance and regression tests that were men-
tioned above are aimed at vetting new code. All tests are run before promoting (merging)
new code to production. Code changes should be managed using a version control system,
for example GIT. A good test suite serves as an automated form of impact analysis that can
be run on any and every code change before deployment.
Some tests are aimed at both data and code. For example, a test that makes sure that a
database has the right number of rows helps your data and code work together. Ultimately
both data tests and code tests need to come together in an integrated pipeline as shown
in figure 86. DataOps enables code and data tests to work together so all around quality
remains high.
A unified, automated test suite that tests/monitors both production data and analytic code
is the linchpin that makes DataOps work. Robust and thorough testing removes or minimizes
the need to perform manual impact analysis, which avoids a bottleneck that slows innova-
tion. Removing constraints helps speed innovation and improve quality by minimizing analyt-
ics cycle time. With a highly optimized test process you’ll be able to expedite new analytics
into production with a high level of confidence.
We recently talked to a data team in a financial services company that lost the trust of their
users. They lacked the resources to implement quality controls so bad data sometimes
leaked into user analytics. After several high-profile episodes, department heads hired their
own people to create reports. For a data-analytics team, this is the nightmare scenario, and it
could have been avoided.
Organizations trust their data when they believe it is accurate. A data team can struggle to
produce high-quality analytics when resources are limited, business logic keeps changing
and data sources have less-than-perfect quality themselves. Accurate data analytics are the
product of quality controls and sound processes.
The data team can’t spend 100% of its time checking data, but if data analysts or scientists
spend 10-20% of their time on quality, they can produce an automated testing and monitor-
ing system that does the work for them. Automated testing can work 24x7 to ensure that
bad data never reaches users, and when a mishap does occur, it helps to be able to assure
users that new tests can be written to make certain that an error never happens again. Auto-
mated testing and monitoring greatly multiplies the effort that a data team invests in quality.
Figure 87 depicts the data-analytics pipeline. In this diagram, databases are accessed and
then data is transformed in preparation for being input into models. Models output visualiza-
tions and reports that provide critical information to users.
Along the way, tests ask important questions. Are data inputs free from issues? Is business
logic correct? Are outputs consistent? As in lean manufacturing, tests are performed at every
step in the pipeline. For example, data input tests are analogous to manufacturing incoming
quality control. Figure 88 shows examples of data input, output and business logic tests.
Data input tests strive to prevent any bad data from being fed into subsequent pipeline
stages. Allowing bad data to progress through the pipeline wastes processing resources and
increases the risk of never catching an issue. It also focuses attention on the quality of data
sources, which must be actively managed — manufacturers call this supply chain management.
Data output tests verify that a pipeline stage executed correctly. Business logic tests validate
data against tried and true assumptions about the business. For example, perhaps all Europe-
an customers are assigned to a member of the Europe sales team.
Test results saved over time provide a way to check and monitor quality versus historical
levels.
FAILURE MODES
A disciplined data production process classifies failures according to severity level. Some
errors are fatal and require the data analytics pipeline to be stopped. In a manufacturing
setting, the most severe errors “stop the line.”
Some test failures are warnings. They require further investigation by a member of the data
analytics team. Was there a change in a data source? Or a redefinition that affects how data
is reported? A warning gives the data-an-
alytics team time to review the changes,
talk to domain experts, and find the root
cause of the anomaly.
Finding issues before your internal customers do is critically important for the data team.
There are three basic types of tests that will help you find issues before anyone else: location
balance, historical balance and statistical process control.
Figure 89: Location Balance Tests verify 1M rows in raw source data, and the
corresponding 1M rows / 300K facts / 700K dimension members in the database
schema, and 300K facts / 700K dimension members in a Tableau report
HISTORICAL BALANCE
Historical Balance tests compare current data to previous or expected values. These tests
rely upon historical values as a reference to determine whether data values are reasonable
(or within the range of reasonable). For example, a test can check the top fifty customers or
suppliers. Did their values unexpectedly or unreasonably go up or down relative to historical
values?
It’s not enough for analytics to be correct. Accurate analytics that “look wrong” to users raise
credibility questions. Figure 90 shows how a change in allocations of SKUs, moving from
pre-production to production, affects the sales volumes for product groups G1 and G2. You
can bet that the VP of sales will notice this change immediately and will report back that
the analytics look wrong. This is a common issue for analytics — the report is correct, but it
reflects poorly on the data team because it looks wrong to users. What has changed? When
confronted, the data-analytics team has no ready explanation. Guess who is in the hot seat.
Historical Balance tests could have alerted the data team ahead of time that product group
sales volumes had shifted unexpectedly. This would give the data-analytics team a chance to
investigate and communicate the change to users in advance. Instead of hurting credibility,
this episode could help build it by showing users that the reporting is under control and that
the data team is on top of changes that affect analytics. “Dear sales department, you may no-
tice a change in the sales volumes for G1 and G2. This is driven by a reassignment of SKUs within
the product groups.”
Automated tests and alerts enforce quality and greatly lessen the day-to-day burden of
monitoring the pipeline. The organization’s trust in data is built and maintained by producing
consistent, high-quality analytics that help users understand their operational environment.
That trust is critical to the success of an analytics initiative. After all, trust in the data is really
trust in the data team.
Figure 93: Tests verify the results for each intermediate step in the analytics pipeline.
In data analytics, tests should verify that the results of each intermediate step in the
production of analytics matches expectations. Even very simple tests can be useful. For
example, a simple row-count test could catch an error in a join that inadvertently produces a
Cartesian product. Tests can also detect unexpected trends in data, which might be flagged
as warnings. Imagine that the number of customer transactions exceeds its historical average
by 50%. Perhaps that is an anomaly that upon investigation would lead to insight about
business seasonality.
Tests in data analytics can be applied to data or models either at the input or output of a
phase in the analytics pipeline. Tests can also verify business logic.
The data analytics pipeline is a complex process with steps often too numerous to be moni-
tored manually. SPC allows the data analytics team to monitor the pipeline end-to-end from
a big-picture perspective, ensuring that everything is operating as expected. As an automat-
ed test suite grows and matures, the quality of the analytics is assured without adding cost.
This makes it possible for the data analytics team to move quickly — enhancing analytics to
address new challenges and queries — without sacrificing quality.
Instructions
1. Preheat oven to 325F. Line a baking sheet with parchment paper.
2. In a large bowl, cream together butter and sugar until light and fluffy.
3. Beat in honey, vanilla, and both eggs, adding the eggs in one at a time.
4. In a medium bowl, whisk together flour, baking soda, cinnamon, and salt.
5. Working by hand or at a low speed, gradually incorporate flour mixture into
honey mixture.
6. Stir in trail mix.
7. Shape cookie dough into 1-inch balls and place onto prepared baking sheet,
leaving about 2 inches between each cookie to allow for the dough to
spread.
8. Bake for 12-15 minutes, until cookies are golden brown.
9. Cool for 3-4 minutes on the baking sheet, then transfer to a wire rack to cool
completely.
Years ago, prior to the advent of Agile Development, a friend of mine worked as a release en-
gineer. His job was to ensure a seamless build and release process for the software develop-
ment team. He designed and developed builds, scripts, installation procedures and managed
the version control and issue tracking systems. He played a mean mandolin at company
parties too.
The role of release engineer was (and still is) critical to completing a successful software
release and deployment, but as these things go, my friend was valued less than the software
developers who worked beside him. The thinking went something like this — developers
could make or break schedules and that directly contributed to the bottom line. Release
engineers, on the other hand, were never noticed, unless something went wrong. As you
might guess, in those days the job of release engineer was compensated less generously
than development engineer. Often, the best people vied for positions in development where
compensation was better.
Whereas a release engineer used to work off in a corner tying up loose ends, the DevOps
engineer is a high-visibility role coordinating the development, test, IT and operations
functions. If a DevOps engineer is successful, the wall between development and operations
melts away and the dev team becomes more agile, efficient and responsive to the market.
This has a huge impact on the organization’s culture and ability to innovate. With so much
at stake, it makes sense to get the best person possible to fulfill the DevOps engineer role
and compensate them accordingly. When DevOps came along, the release engineer went
from fulfilling a secondary supporting role to occupying the most sought-after position in
the department. Many release engineers have successfully rebranded themselves as DevOps
engineers and significantly upgraded their careers.
Data engineers, data analysts, data scientists — these are all important roles, but they will be
valued even more under DataOps. Too often, data analytics professionals are trapped into
relying upon non-scalable methods: heroism, hope or caution. DataOps offers a way out of
this no-win situation.
The capabilities unlocked by DataOps impacts everyone that uses data analytics — all the
way to the top levels of the organization. DataOps breaks down the barriers between data
analytics and operations. It makes data more easily accessible to users by redesigning the
data analytics pipeline to be more flexible and responsive. It will completely change what
people think of as possible in data analytics.
In many organizations, the DataOps engineer will be a separate role. In others, it will be a
shared function. In any case, the opportunity to have a high-visibility impact on the organi-
zation will make DataOps engineering one of the most desirable and highly compensated
functions. Like the release engineer whose career was transformed by DevOps, DataOps will
boost the fortunes of data analytics professionals. DataOps will offer select members of the
analytics team a chance to reposition their roles in a way that significantly advances their
career. If you are looking for an opportunity for growth as a DBA, ETL Engineer, BI Analyst,
or another role look into DataOps as the next step.
And watch out Data Scientist, the real sexiest job of the 21st century is DataOps Engineer.
Picture what you could accomplish if your organization had accurate and detailed informa-
tion about products, processes, customers and the market. If your company does not have
a data analytics function, you need to start one. Better yet, if data analytics is not serving as
a competitive advantage in your organization, you need to step up your game and establish a
DataOps team.
Data analytics analyzes internal and external data to create value and actionable insights.
Analytics is a positive force that is transforming organizations around the globe. It helps cure
diseases, grow businesses, serve customers better and improve operational efficiency.
In analytics there is mediocre and there is better. A typical data analytics team works slowly,
all the while living in fear of a high-visibility data quality issue. A high-performance data
analytics team rapidly produces new analytics and flexibly responds to marketplace demands
while maintaining impeccable quality. We call this a DataOps team. A DataOps team can
Work Without Fear or Heroism because they have automated controls in place to enforce
a high level of quality even as they shorten the cycle time of new analytics by an order of
magnitude. Want to upgrade your data analytics team to a DataOps team? It comes down to
roles, tools and processes.
DATA ENGINEER
The data engineer is a software or computer engineer that lays the groundwork for other
members of the team to perform analytics. The data engineer moves data from operation-
al systems (ERP, CRM, MRP, …) into a data lake and writes the transforms that populate
schemas in data warehouses and data marts. The data engineer also implements data tests
for quality.
DATA SCIENTIST
Data scientists perform research and tackle open-ended questions. A data scientist has
domain expertise, which helps him or her create new algorithms and models that address
questions or solve problems.
For example, consider the inventory management system of a large retailer. The company
has a limited inventory of snow shovels, which have to be allocated among a large number of
stores. The data scientist could create an algorithm that uses weather models to predict buy-
ing patterns. When snow is forecasted for a particular region it could trigger the inventory
management system to move more snow shovels to the stores in that area.
The process and tools enhancements described above can be implemented by anyone on
the analytics team or a new role may be created. We call this role the DataOps Engineer.
DATAOPS ENGINEER
The DataOps Engineer applies Agile Development, DevOps and statistical process controls
to data analytics. He or she orchestrates and automates the data analytics pipeline to make
it more flexible while maintaining a high level of quality. The DataOps Engineer uses tools
to break down the barriers between operations and data analytics, unlocking a high level of
productivity from the entire team.
As DataOps breaks down the barriers between data and operations, it makes data more
easily accessible to users by redesigning the data analytics pipeline to be more responsive,
efficient and robust. This new function will completely change what people think of as possi-
ble in data analytics. The opportunity to have a high-visibility impact on the organization will
make DataOps engineering one of the most desirable and highly compensated functions on
the data-analytics team.
INSTRUCTIONS
1. Preheat oven to 350ºF
2. Soften butter to room temperature
3. Line a baking sheet with parchment paper
4. In a large bowl, cream together softened butter, brown sugar and
white sugar
5. Add vanilla extract, chocolate peanut butter and eggs and mix well
6. Stir in flour, baking soda and cocoa powder and combine until blended
7. Fold chocolate chips and peanut butter chips into batter
8. Scoop batter onto prepared baking sheet using a cookie or ice-cream scoop,
leaving enough space in-between for cookies to expand
9. Bake for 14-16 minutes
10. Transfer cookies to a wire rack to cool
Data analytics can help drive corporate growth by providing customer analytics and ulti-
mately actionable insights to the sales and marketing teams. Unfortunately, the fast-paced,
dynamic nature of sales makes it difficult for the customer-facing teams to tolerate the slow
and deliberate manner in which analytics is typically produced. In an earlier chapter, we
identified eight major challenges of data analytics:
• The Goalposts Keep Moving – Sales and marketing requirements change constantly
and the requests for new analytics never cease.
• Data Lives in Silos – Data is collected in separate operational systems and typically,
none of these systems talk to each other.
• Data Formats are not Optimized – Data in operational systems is usually not structured
in a way that lends itself to the efficient creation of analytics.
• Data Errors – Data will eventually contain errors, which can be difficult to resolve quickly.
• Bad Data Ruins Good Reports – When data errors work their way through the data
pipeline into published analytics, internal stakeholders can become dissatisfied. These
errors also harm the hard-won trust in the analytics team.
• Data Pipeline Maintenance Never Ends – Every new or updated data source, schema
enhancement, analytics improvement or other change triggers an update to the data
pipeline. These updates may be consuming 80% of your team’s time.
• Manual Process Fatigue – Manual procedures for data integration, cleansing, transfor-
mation, quality assurance and deployment of new analytics are error-prone, time-con-
suming and tedious.
• The Trap of “Hope and Heroism” – To cope with the above challenges, data profession-
als work long hours, make changes (without proper testing) and “hope” for the best or
just retreat into a posture of over-caution in which projects just execute more slowly.
Rapid-Response Analytics – The sales and marketing team will continue to demand a
never-ending stream of new and changing requirements, but the data-analytics team will
delight your sales and marketing colleagues with rapid responses to their requests. New
analytics will inspire new questions that will, in turn, drive new requirements for analytics.
The feedback loop between analytics and sales/marketing will iterate so quickly that it will
infuse excitement and creativity throughout the organization. This will lead to breakthroughs
that vault the company to a leadership position in its markets.
Data Under Your Control – Data from all of the various internal and external sources will be
integrated into a consolidated database that is under the control of the data-analytics team.
Your team will have complete access to it at all times, and they will manage it independently
of IT, using their preferred tools. With data under its control, the data-analytics team can
modify the format and architecture of data to meet its own operational requirements.
Impeccable Data Quality – As data flows through the data-analytics pipeline, it will pass
through tests and filters that ensure that it meets quality guidelines. Data will be monitored
for anomalies 24x7, preventing bad data from ever reaching sales and marketing analytics.
You’ll have a dashboard providing visibility into your data pipeline with metrics that delineate
problematic data sources or other issues. When an issue occurs, the system alerts the appro-
priate member of your team who can then fix the problem before it ever receives visibility.
As the manager of the data-analytics team, you’ll spend far less time in uncomfortable meet-
ings discussing issues and anomalies related to analytics.
The processes, methodologies and tools required to realize these efficiencies combine two
powerful ideas: The Customer Data Platform (CDP) and a revolutionary new approach to
analytics called DataOps. Below we’ll explain how you can implement your own Data-
Ops-powered CDP that improves both your analytics cycle time and data-pipeline quality
by 10X or more.
Figure 95: The Customer Data Platform consolidates data from operational
systems to provide a unified customer view for sales and marketing.
DATAOPS
A CDP is a step in the right direction, but it won’t provide much improvement in team
productivity if the team relies on cumbersome processes and procedures to create analytics.
DataOps is a set of methodologies and tools that will help you optimize the processes by
When implemented in concert, Agile, DevOps and SPC take the productivity of data-analyt-
ics professionals to a whole new level. DataOps will help you get the most out of your data,
human resources and integrated CDP database.
Every resource, technology and tool in the data-analytics organization exists to support the
data analyst’s ability to serve Sales and Marketing. This also applies to Data Scientists who
also deliver insights directly to Sales and Marketing colleagues.
The engineer writes transforms that operate on the data lake, creating data warehouses and
data marts used by data analysts and scientists. The data engineer also implements tests that
monitor data at every point along the data-analytics pipeline assuring a high level of quality.
The data engineer lays the groundwork for other members of the team to perform analytics
without having to be operations experts. With a dedicated data engineering function,
DataOps provides a high level of service and responsiveness to the data-analytics team.
With tests monitoring each stage of the automated data pipeline, DataOps can produce a
dashboard showing the status of the pipeline. The DataOps dashboard provides a high-level
overview of the end-to-end data pipeline. Is any data failing quality tests? What are the error
rates? Which are the troublesome data sources? With this information at his or her finger-
tips, the Data Engineer can proactively improve the data pipeline to increase robustness. In
the event of a high-severity data anomaly, an alert is sent to the Data Engineer who can take
steps to protect production analytics and work to resolve the error. If the anomaly relates to
a data supplier, data engineering can work with the vendor to drive the issue to resolution.
Workarounds and data patches can be implemented as needed with information in release
notes for users. In many cases, errors are resolved without the users (or the organization’s
management) ever being aware of any problem.
DATAOPS PLATFORM
The various methodologies, processes, people (and their tools) and the CDP analytics data-
base are tied together cohesively using a technical environment called a DataOps Platform.
The DataOps Platform includes support for:
• Agile project management
• Deployment of new analytics
• Execution of the data pipeline (orchestration)
• Integration of all tools and platforms
• Management of development and production environments
• Source-code version control
• Testing and monitoring of data quality
• Data Operations reporting and dashboards
The high degree of automation offered by DataOps eliminates a great deal of work that has
traditionally been done manually. This frees up the team to create new analytics requested
by stakeholder partners.
The enterprise can also outsource the functions shown initially but insource them at a later
date. Once set-up, the DataOps Platform can be easily and seamlessly transitioned to an
internal team.
Customer Data Platforms promise to drive sales and improve the customer experience by
unifying customer data from numerous disjointed operational systems. As a leader of the
analytics team, you can take control of sales and marketing data by implementing efficient
analytics-creation and deployment processes using a DataOps-powered CDP. A DataOps
platform makes analytics responsive and robust. This enables your data analysts and scien-
tists to rise above the bits and bytes of data operations and focus on new analytics that help
the organization achieve its goals.
Mixed Martial Arts (MMA) combines striking, wrestling and other fighting techniques into a
unified sport. Every martial art and fighting technique has its strengths and strategic advan-
tages. Boxing is known for punching but also provides footwork, guard position and head
movement. Wrestling relies upon takedowns. Karate features striking techniques such as
kicking. MMA is a hybrid of all of these (and many more) drawing upon each mode of combat
as needed for a given competitive situation. If an MMA athlete competed against a boxer or
karate expert, the mixed martial artist would clearly have an unfair advantage. MMA’s real
strength is its versatility and its ability to absorb new methods.
DataOps is the mixed martial arts of data analytics. It is a hybrid of Agile Development,
DevOps and the statistical process controls drawn from lean manufacturing. Like MMA, the
strength of DataOps is its readiness to evolve and incorporate new techniques that improve
the quality, reliability, and flexibility of the data analytics pipeline. DataOps gives data ana-
lytics professionals an unfair advantage over those who are doing things the old way — using
hope, heroism or just going slowly in order to cope with the rapidly changing requirements
of the competitive marketplace.
Agile development has revolutionized the speed of software development over the past
twenty years. Before Agile, development teams spent long periods of time developing
specifications that would be obsolete long before deployment. Agile breaks down software
development into small increments, which are defined and implemented quickly. This allows
a development team to become much more responsive to customer requirements and ulti-
mately accelerates time to market.
The difficulty of procuring and provisioning physical IT resources has often hampered data
analytics. In the software development domain, leading-edge companies are turning to
DevOps, which utilizes cloud resources instead of on-site servers and storage. This allows
developers to procure and provision IT resources nearly instantly and with much greater
control over the run-time environment. This improves flexibility and yields another order of
magnitude improvement in the speed of deploying features to the user base.
DataOps also incorporates lean manufacturing techniques into data analytics through the
use of statistical process controls. In manufacturing, tests are used to monitor and improve
the quality of factory-floor processes. In DataOps, tests are used to verify the inputs,
business logic, and outputs at each stage of the data analytics pipeline. The data analytics
professional adds a test each time a change is made. The suite of tests grows over time
until it eventually becomes quite substantial. The tests validate the quality and integrity of
a new release when a feature set is released to the user base. Tests allow the data analytics
professional to quickly verify a release, substantially reducing the amount of time spent on
deploying updates.
Statistical process control also monitors data, alerting the data team to an unexpected vari-
ance. This may require updates to the business logic built into the tests, or it might lead data
scientists down new paths of inquiry or experimentation. The test alerts can be a starting
point for creative discovery.
The combination of Agile development, DevOps, and statistical process controls gives Data-
Ops the strategic tools to reduce time to insight, improve the quality of analytics, promote
reuse and refactoring and lower the marginal cost of asking the next business question.
Like mixed martial arts, DataOps draws its effectiveness from an eclectic mix of tools and
techniques drawn from other fields and domains. Individually, each of these techniques is
valuable, but together they form an effective new approach, which can take your data analyt-
ics to the next level.
A global pharmaceutical giant sought to drive top-line growth by modernizing its marketing
operations. The project included a migration to Salesforce Marketing Cloud, integrations
with numerous internal and third-party data sources, and a continuous flow of data. The
plan initially required eighteen months for implementation. Using the DataKitchen DataOps
Platform, which automates deployment, controls quality and supports Agile development
of analytics, the company was able to start delivering value in six weeks and completed the
migration in about one third the time.
Figure 98: With DataKitchen, marketing automation data flows continuously from
numerous sources through the analytics pipeline with efficiency and quality.
BUSINESS IMPACT
With the DataKitchen Platform, the company was able to break the long 18-month project
into sprints and began to deliver value in six weeks. The agility of the DataKitchen DataOps
approach enabled the analytics team to rapidly respond to changing user requirements
with a continuous series of enhancements. Users no longer waited months to add new data
sources or make other changes. The team can now deploy new data sources, update sche-
mas and produce new analytics quickly and efficiently without fear of disrupting the existing
data pipelines.
DataKitchen’s lean manufacturing control helped the team be more proactive addressing
data quality issues. With monitoring and alerts, the team is now able to provide immediate
feedback to data suppliers about issues and can prevent bad data from reaching user analytics.
All this has led to improved insight into customers and markets and higher impact marketing
campaigns that drive revenue growth.
DataKitchen’s DataOps Platform helped this pharmaceutical company achieve its strategic
goals by improving analytics quality, responsiveness, and efficiency. DataKitchen software
provides support for improved processes, automation of tools, and agile development of new
analytics. With DataKitchen, the analytics team was able to deliver value to users in 1/10th
the time, accelerating and magnifying their impact on top line growth.
It costs between $2-3B to bring a new pharmaceutical to market. When a new drug is
introduced, it is already halfway through its patent life. This makes the first 6-12 months of a
pharmaceutical launch critical to a product’s lifetime revenue. The vendor needs up-to-date
information to allocate samples, plan marketing events, and monitor progress vs. goals. With
so much at stake, pharmaceutical companies like Celgene make strategic investments to maxi-
mize product adoption and adherence during the initial phase of a drug product’s life cycle.
INSTRUCTIONS
1. Preheat oven to 400◦. Microwave milk at HIGH for 1 ½ minutes. Melt butter
in a large skillet or Dutch oven over medium-low heat; whisk in flour until
smooth. Cook, whisking constantly for 1 minute.
2. Gradually whisk in warm milk and cook, whisking constantly 5 minutes or
until thickened.
3. Whisk in salt, black pepper, 1 cup shredded cheese, and if desired, red
pepper until smooth; stir in pasta. Spoon pasta mixture into a lightly greased
2-qt. baking dish; top with remaining cheese. Bake at 400◦ for 20 minutes or
until golden and bubbly.
NOTES
For this recipe, it is recommended that you grate the block(s) of cheese. I combine
Sharp Cheddar and Swiss cheeses — my favorite. Pre-shredded varieties won’t
give you the same sharp bite or melt into creamy goodness over your macaroni
as smoothly as block cheese that you grate yourself. You can go reduced-fat (but
then it’s even more important to prep your own). Grating won’t take long, and the
rest of this recipe is super simple. Use a pasta that has plenty of nooks to capture
the cheese—like elbows, shells, or cavatappi. Try it just once, and I guarantee that
Classic Baked Macaroni and Cheese will become your go-to comfort food.
Chances are good that you’ll be interrupted by data errors. That is the clear message com-
municated by the respondents to a joint DataOps survey conducted by DataKitchen and
Eckerson Research.
The survey contains feedback from 300 data-analytics professionals who work for medium
to large-sized companies across multiple industries in the US and Europe. The full report will
be available in June 2019, but here are the key results. The survey sheds light upon three
issues that embody the analytics industry’s challenges. From a recent NewVantage Partners
Report, we know that despite the hype and investment, the number of companies that iden-
tify as data-driven is declining. Gartner estimated that 60% of big data projects fail. We see
results in our survey that may help explain this.
The DataOps enterprises that DataKitchen works with have less than 1 data error per year.
Only 3% of the companies surveyed approached that level of quality. Another 18% reported
1-2 errors per month. Would W. Edwards Deming have considered that an acceptable failure
rate? Would Toyota? In a manufacturing setting, each one of these errors could be the equiv-
A short cycle time enables an analytics team to respond quickly to requests for new ana-
lytics. When analytics are produced quickly, the data team can keep pace with the endless
stream of requests from the business unit. A short cycle time fosters close collaboration with
business users and in our experience this unlocks an organization’s creativity. However, the
cycle time in most data organizations is plagued with inefficient manual processes, bu-
reaucracy, lack of task coordination and dependencies on bottlenecks. For example, survey
respondents provided interesting feedback about the time it takes to create an analytics
development environment.
78% of respondents indicated that it takes days, weeks or months to create a development
environment. 38% of users surveyed report that it took weeks or months. That wait time
prevents the data analytics team from even beginning to work on the critical analytics that
the organization has requested. This means that their time-to-value is much slower than it
should be.
On average, how long does it take your team to create a new development
environment with the appropriate test data, servers, and tools?
76 %
DataOps offers a way to reduce errors, shorten the time it takes to set up a development
environment, and minimize analytics development cycle time. Nothing could state more
clearly why analytics organizations need a DataOps initiative now.
Cashew Cream
• 1 cup cashews soaked in water for at least 2 hours
• 2 cups veg stock
• 4 teaspoons cornstarch (can sub tapioca starch if desired)
• Drain the cashews. In a blender, combine all the ingredients and work for 2
to 5 minutes or until smooth, scraping down the sides with a rubber spatula
several times. Set aside.
Soup
• 1 Tablespoon olive oil
• 1 large onion coarsely chopped
• 2 celery ribs, chopped
• 3 cups veg broth
• 1 large carrot chopped
• 1 red pepper diced (could sub 1 bag Frozen mixed-vegetables, thawed in a
pinch)
• 1 potato, diced
• 3 ears of fresh corn (cut the kernels off and scrape the corn cobs for corn
milk to add to the soup)
• Can of corn
INSTRUCTIONS
1. Heat the oil in 4-quart pot
2. When hot add onion and celery with a pinch of salt, cook until start to soften.
3. Add carrots and potatoes
4. Add corn and red pepper and stir-fry for 10 minutes
5. Add 3 cups veg stock and the corn milk
6. Bring to a boil, lower the head and cover — simmer 10 min or until veg tender
but not overcooked.
7. Stir in Cashew Cream and stir gently for 7 minutes until nicely thickened.
8. Blend up to half the soup to make more liquid and add it back in
9. Add salt & pepper to taste, depending on the type of veg stock you used.
My own adaptation of a vegan New England Clam Chowder recipe from the Boston
Globe from Isa-does-it by Isa Chandra Moskowitz
179 • The DataOps Cookbook
DataOps Energy Bytes
by Eric Estabrooks
INSTRUCTIONS
1. Add rolled oats, coconut flakes, nut butter, flax seed, honey, and vanilla to a
mixing bowl and mix.
2. Mix well so that you can form the balls easily.
3. Add chocolate chips if using or other desired mix ins.
4. Chill the mixture in the fridge for an hour so that balls will bind together.
5. Roll the balls into about a 1-inch diameter.
https://
Statistical Process Control en.wikipedia.
org/wiki/Statistical_process_control
https://
W. Edwards Deming en.wikipedia.
org/wiki/W._Edwards_Deming
Christopher Bergh is a Founder and Head Chef at DataKitchen where, among other activi-
ties, he is leading DataKitchen’s DataOps initiative. Chris has more than 25 years of research,
engineering, analytics, and executive management experience.
Previously, Chris was Regional Vice President in the Revenue Management Intelligence
group in Model N. Before Model N, Chris was COO of LeapFrogRx, a descriptive and pre-
dictive analytics software and service provider. Chris led the acquisition of LeapFrogRx by
Model N in January 2012. Prior to LeapFrogRx Chris was CTO and VP of Product Manage-
ment of MarketSoft (now part of IBM) an innovative Enterprise Marketing Management
software. Prior to that, Chris developed Microsoft Passport, the predecessor to Windows
Live ID, a distributed authentication system used by 100s of Millions of users today. He
was awarded a US Patent for his work on that project. Before joining Microsoft, he led the
technical architecture and implementation of Firefly Passport, an early leader in Internet Per-
sonalization and Privacy. Microsoft subsequently acquired Firefly. Chris led the development
of the first travel-related e-commerce web site at NetMarket. Chris began his career at the
Massachusetts Institute of Technology’s (MIT) Lincoln Laboratory and NASA Ames Research
Center. There he created software and algorithms that provided aircraft arrival optimization
assistance to Air Traffic Controllers at several major airports in the United States.
Chris served as a Peace Corps Volunteer Math Teacher in Botswana, Africa. Chris has an M.S.
from Columbia University and a B.S. from the University of Wisconsin-Madison. He is an
avid cyclist, hiker, reader, and father of two college age children.
Gil has held various technical and leadership roles at Solid Oak Consulting, HealthEdge,
Phreesia, LeapFrogRx (purchased by Model N), Relicore (purchased by Symantec), Phase For-
ward (IPO and then purchased by Oracle), Netcentric, Sybase (purchased by SAP), and AT&T
Bell Laboratories (now Nokia Bell Labs).
Gil holds an M.S. in Computer Science from Stanford University and a Sc.B. in Applied Math-
ematics/Biology from Brown University. He completed hiking all 48 of New Hampshire’s‚
4,000 peaks and is now working on the New England 67, and is the father of one high school
and two college age boys.
Eran Strod works in marketing at DataKitchen where he writes white papers, case studies
and the DataOps blog. Eran was previously Director of Marketing for Atrenne Integrated
Solutions (now Celestica) and has held product marketing and systems engineering roles at
Curtiss-Wright, Black Duck Software (now Synopsys), Mercury Systems, Motorola Computer
Group (now Artesyn), and Freescale Semiconductor (now NXP), where he was a contributing
author to the book “Network Processor Design, Issues and Practices.”
Eran began his career as a software developer at CSPi working in the field of embedded
computing.
Eran holds a B.A. in Computer Science and Psychology from the University of California at
Santa Cruz and an M.B.A. from Northeastern University. He is father to two children and
enjoys hiking, travel and watching the New England Patriots.