Infrastructure as Code
Infrastructure as Code
THIRD EDITION
Kief Morris
Infrastructure as Code
by Kief Morris
While the publisher and the author have used good faith efforts to ensure that the
information and instructions contained in this work are accurate, the publisher
and the author disclaim all responsibility for errors or omissions, including
without limitation responsibility for damages resulting from the use of or
reliance on this work. Use of the information and instructions contained in this
work is at your own risk. If any code samples or other technology this work
contains or describes is subject to open source licenses or the intellectual
property rights of others, it is your responsibility to ensure that your use thereof
complies with such licenses and/or rights.
This work is part of a collaboration between O’Reilly and Linode. See our
statement of editorial independence.
978-1-098-11467-1
[GP]
Infrastructure as Code
by Kief Morris
The views expressed in this work are those of the author(s) and do not represent
the publisher’s views. While the publisher and the author(s) have used good faith
efforts to ensure that the information and instructions contained in this work are
accurate, the publisher and the author(s) disclaim all responsibility for errors or
omissions, including without limitation responsibility for damages resulting
from the use of or reliance on this work. Use of the information and instructions
contained in this work is at your own risk. If any code samples or other
technology this work contains or describes is subject to open source licenses or
the intellectual property rights of others, it is your responsibility to ensure that
your use thereof complies with such licenses and/or rights.
978-1-098-15035-8
[FILL IN]
Preface
Ten years ago, a CIO at a global bank scoffed when I suggested they look into
private cloud technologies and infrastructure automation tooling: “That kind of
thing might be fine for startups, but we’re too large and our requirements are too
complex.” Even a few years ago, many enterprises considered using public
clouds to be out of the question.
These days cloud technology is pervasive. Even the largest, most hidebound
organizations are rapidly adopting a “cloud-first” strategy. Those organizations
that find themselves unable to consider public clouds are adopting dynamically
provisioned infrastructure platforms in their data centers.1 The capabilities that
these platforms offer are evolving and improving so quickly that it’s hard to
ignore them without risking obsolescence.
In Chapter 1 I use the terms “Iron Age” and “Cloud Age” (“From the Iron Age
to the Cloud Age”) to describe the different philosophies that apply to managing
physical infrastructure, where mistakes are slow and costly to correct, and
managing virtual infrastructure, where mistakes can be quickly detected and
fixed.
Infrastructure as Code tools create the opportunity to work in ways that help you
to deliver changes more frequently, quickly, and reliably, improving the overall
quality of your systems. But the benefits don’t come from the tools themselves.
They come from how you use them. The trick is to leverage the technology to
embed quality, reliability, and compliance into the process of making changes.
The experience of writing the first edition was amazing. It gave me the
opportunity to travel and to talk with people around the world about their own
experiences. These conversations gave me new insights and exposed me to new
challenges. I learned that the value of writing a book, speaking at conferences,
and consulting with clients is that it fosters conversations. As an industry, we are
still gathering, sharing, and evolving our ideas for managing Infrastructure as
Code.
What’s New and Different in This Edition
Things have moved along since the first edition came out in June 2016. That
edition was subtitled “Managing Servers in the Cloud,” which reflected the fact
that most infrastructure automation until that point had been focused on
configuring servers. Since then, containers and clusters have become a much
bigger deal, and the infrastructure action has moved to managing collections of
infrastructure resources provisioned from cloud platforms—what I call stacks in
this book.
As a result, this edition involves more coverage of building stacks, which is the
remit of tools like CloudFormation and Terraform. The view I’ve taken is that
we use stack management tools to assemble collections of infrastructure that
provide application runtime environments. Those runtime environments may
include servers, clusters, and serverless execution environments.
I’ve changed quite a bit based on what I’ve learned about the evolving
challenges and needs of teams building infrastructure. As I’ve already touched
on in this preface, I see making it safe and easy to change infrastructure as the
key benefit of Infrastructure as Code. I believe people underestimate the
importance of this by thinking that infrastructure is something you build and
forget.
But too many teams I meet struggle to meet the needs of their organizations;
they are not able to expand and scale quickly enough, support the pace of
software delivery, or provide the reliability and security expected. And when we
dig into the details of their challenges, it’s that they are overwhelmed by the
need to update, fix, and improve their systems. So I’ve doubled down on this as
the core theme of this book.
This edition introduces three core practices for using Infrastructure as Code to
make changes safely and easily:
These three practices are mutually reinforcing. Code is easy to track, version,
and deliver across the stages of a change management process. It’s easier to
continuously test smaller pieces. Continuously testing each piece on its own
forces you to keep a loosely coupled design.
These practices and the details of how to apply them are familiar from the world
of software development. I drew on Agile software engineering and delivery
practices for the first edition of the book. For this edition, I’ve also drawn on
rules and practices for effective design.
In the past few years, I’ve seen teams struggle with larger and more complicated
infrastructure systems, and I’ve seen the benefits of applying lessons learned in
software design patterns and principles, so I’ve included several chapters in this
book on how to do this.
I’ve also seen that organizing and working with infrastructure code is difficult
for many teams, so I’ve addressed various pain points. I describe how to keep
codebases well organized, how to provide development and test instances for
infrastructure, and how to manage the collaboration of multiple people,
including those responsible for governance.
What’s Next
I don’t believe we’ve matured as an industry in how we manage infrastructure.
I’m hoping this book gives a decent view of what teams are finding effective
these days. And a bit of aspiration of what we can do better.
I fully expect that in another five years the toolchains and approaches will
evolve. We could see more general-purpose languages used to build libraries,
and we could be dynamically generating infrastructure rather than defining the
static details of environments at a low level. We certainly need to get better at
managing changes to live infrastructure. Most teams I know are scared when
applying code to live infrastructure. (One team referred to Terraform as
“Terrorform,” but users of other tools all feel this way.)
What This Book Is and Isn’t
The thesis of this book is that exploring different ways of using tools to
implement infrastructure can help us to improve the quality of services we
provide. We aim to use speed and frequency of delivery to improve the
reliability and quality of what we deliver.
So the focus of this book is less on specific tools, and more on how to use them.
You won’t find code examples for real-world tools or clouds. Tools change too
quickly in this field to keep code examples accurate, but the advice in this book
should age more slowly, and be applicable across tools. Instead, I write
pseudocode examples for fictional tools to illustrate concepts. See the book’s
companion website for references to example projects and code.
This book won’t guide you on how to use the Linux operating system,
Kubernetes cluster configuration, or network routing. The scope of this book
does include ways to provision infrastructure resources to create these things,
and how to use code to deliver them. I share different cluster topology patterns
and approaches for defining and managing clusters as code. I describe patterns
for provisioning, configuring, and changing server instances using code.
You should supplement the practices in this book with resources on the specific
operating systems, clustering technologies, and cloud platforms. Again, this
book explains approaches for using these tools and technologies that are relevant
regardless of the particular tool.
This book is also light on operability topics like monitoring and observability,
log aggregation, identity management, and other concerns that you need to
support services in a cloud environment. What’s in here should help you to
manage the infrastructure needed for these services as code, but the details of the
specific services are, again, something you’ll find in more specific resources.
Infrastructure as Code has grown along with the DevOps movement. Andrew
Clay-Shafer and Patrick Debois triggered the DevOps movement with a talk at
the Agile 2008 conference. The first uses I’ve found for the term “Infrastructure
as Code” are from a talk called “Agile Infrastructure” that Clay-Shafer gave at
the Velocity conference in 2009, and an article John Willis wrote summarizing
the talk. Adam Jacob, who cofounded Chef, and Luke Kanies, founder of
Puppet, were also using the phrase around this time.
Who This Book Is For
This book is for people who are involved in providing and using infrastructure to
deliver and run software. You may have a background in systems and
infrastructure, or in software development and delivery. Your role may be
engineering, testing, architecture, or management. I’m assuming you have some
exposure to cloud or virtualized infrastructure and tools for automating
infrastructure using code.
Readers new to Infrastructure as Code should find this book a good introduction
to the topic, although you will get the most out of it if you are familiar with how
infrastructure cloud platforms work, and the basics of at least one infrastructure
coding tool.
Those who have more experience working with these tools should find a mixture
of familiar and new concepts and approaches. The content should create a
common language and articulate challenges and solutions in ways that
experienced practitioners and teams find useful.
Principle
A principle is a rule that helps you to choose between potential solutions.
Practice
A practice is a way of implementing something. A given practice is not
always the only way to do something, and may not even be the best way to do
it for a particular situation. You should use principles to guide you in
choosing the most appropriate practice for a given situation.
Pattern
A pattern is a potential solution to a problem. It’s very similar to a practice in
that different patterns may be more effective in different contexts. Each
pattern is described in a format that should help you to evaluate how relevant
it is for your problem.
Antipattern
An antipattern is a potential solution that you should avoid in most situations.
Usually, it’s either something that seems like a good idea or else it’s
something that you fall into doing without realizing it.
Folks in our industry love to talk about “best practices.” The problem with this
term is that it often leads people to think there is only one solution to a problem,
no matter what the context.
I prefer to describe practices and patterns, and note when they are useful and
what their limitations are. I do describe some of these as being more effective or
more appropriate, but I try to be open to alternatives. For practices that I believe
are less effective, I hope I explain why I think this.
ShopSpinner runs on FCS, the Fictional Cloud Service, a public IaaS provider
with services that include FSI (Fictional Server Images) and FKS (Fictional
Kubernetes Service). It uses the Stackmaker tool—an analog of Terraform,
CloudFormation, and Pulumi—to define and manage infrastructure on its cloud.
It configures servers with the Servermaker tool, which is much like Ansible,
Chef, or Puppet.
Italic
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Shows commands or other text that should be typed literally by the user.
TIP
NOTE
WARNING
For more than 40 years, O’Reilly Media has provided technology and business training, knowledge, and
insight to help companies succeed.
Our unique network of experts and innovators share their knowledge and
expertise through books, articles, and our online learning platform. O’Reilly’s
online learning platform gives you on-demand access to live training courses, in-
depth learning paths, interactive coding environments, and a vast collection of
text and video from O’Reilly and 200+ other publishers. For more information,
visit http://oreilly.com.
How to Contact Us
Please address comments and questions concerning this book to the publisher:
Sebastopol, CA 95472
707-829-0104 (fax)
We have a web page for this book, where we list errata, examples, and any
additional information. You can access this page at https://oreil.ly/infra-as-code-
2e.
For news and information about our books and courses, visit http://oreilly.com.
2 The research published by DORA in the State of DevOps Report finds that
heavyweight change-management processes correlate to poor performance on
change failure rates and other measures of software delivery effectiveness.
With Early Release ebooks, you get books in their earliest form—the author’s
raw and unedited content as they write—so you can take advantage of these
technologies long before the official release of these titles.
This will be the 1st chapter of the final book. Please note that the GitHub repo
will be made active later on.
If you have comments about how we might improve the content and/or examples
in this book, or if you notice missing material within this chapter, please reach
out to the editor at jleonard@oreilly.com.
If you work in a team that builds and runs IT infrastructure, then cloud and
infrastructure automation tools should help you deliver more value in less time,
and to do it more reliably. In practice, however, they drive ever-increasing size,
complexity, and diversity of things to manage.
These technologies have become especially relevant over the past decade as
organizations brought digital technology deeper into the core of what they do.
Previously, many leaders had treated the IT function as an unfortunate
distraction that should be outsourced and ignored. But digitally sophisticated
competitors, users, and staff drove more processes and products online and
created entirely new categories of services like streaming media, social media,
and machine learning.
Cloud and automation have helped by making it far easier for organizations to
add and change digital services. But many teams have struggled to manage the
proliferation of cloud-hosted products, applications, services, and platforms. As
one of my clients told me, “Moving from the data center, where we were limited
to the capacity of our hardware, to the cloud, where capacity is effectively
unlimited, knocked down the walls that kept our tire fire contained.”1
Using code to define and build infrastructure creates the opportunity to bring a
wide set of tools, practices, and patterns to bear on the problem of how to design
and implement systems. This book explores different practices and patterns for
Infrastructure as Code. I’ll describe the problems that Infrastructure as Code can
help with, challenges that come from different approaches to using infrastructure
code, and patterns and practices that have proven to be useful.
Infrastructure as Code
A literal definition of Infrastructure as Code is the practice of provisioning and
managing infrastructure using code, as opposed to doing it interactively, or with
non-code automation tools. By “interactively”, I mean using a command-line
tool or GUI interface to carry out tasks. The alternative is writing code that can
then be distributed and applied by automated systems.
The way I define Infrastructure as Code is about more than the mechanics of
how infrastructure is defined and provisioned. Infrastructure as Code is about
applying the principles, practices, and tools of software engineering to
infrastructure.
However, these models were designed for the Iron Age when changes were
slow, which made mistakes difficult to correct. It seemed reasonable to add a
week to a task that would take a week to implement when it would take a week
or more to correct a mistake later. Adding weeks to a task that takes less than an
hour to implement, and a few minutes to correct, destroys the benefits of cloud
age technology.
What’s more, research2 suggests that these heavyweight processes were never
very effective in preventing errors in the first place. In fact, they can make things
worse by dividing knowledge and accountability across silos and long time
periods.
Fortunately, the emergence of cloud age technologies has coincided with the
growth of what I’d call cloud age approaches to work, including lean, agile, and
DevOps. These approaches encourage close collaboration, short feedback loops,
and a minimalist approach to technical implementation. Automation is leveraged
to fundamentally shift thinking about change and risk, which results not only in
faster delivery but also higher quality (Table 1-2).
Table 1-2. Ways of working in the Iron Age and the Cloud Age
This ability to leverage speed of change to improve quality starts with cloud
technology, which creates the capability to provision and change infrastructure
on demand. We need automation to use this capability. So another definition of
Infrastructure as Code is a Cloud Age approach to automating cloud
infrastructure in a way that embraces continuous change to achieve high
reliability and quality.
People define DevOps in different ways. The fundamental idea of DevOps is collaboration across all of the
people involved in building and running software. This includes not only developers and operations people,
but also testers, security specialists, architects, and even managers. There is no one way to implement
DevOps.
Many people look at DevOps and only notice the technology that people use to collaborate across software
delivery. All too often this leads to reducing “DevOps” to tooling. I’ve seen “DevOps” defined as running
an application deployment tool (usually Jenkins), often in a way that increases barriers across the software
delivery path.
DevOps is first and foremost about people, culture, and ways of working. Tools and practices like
Infrastructure as Code are valuable to the extent that they’re used to bridge gaps and improve collaboration.
The mid-2010s could be considered the “Shadow Age” of IT. Cloud, DevOps,
Continuous Delivery, and Infrastructure as Code were mostly used either by
startups or by separate digital departments of larger organizations. These
departments were usually set up outside the remit of the existing organization,
partly to protect them from the cultural norms and formal policies of the main
organization, which people sometimes call “antibodies”. In some cases they
were created quietly within existing departments, as “shadow IT3“.
The mantra of the shadow age was “move fast and break things4.” Casting aside
the shackles of iron age governance was seen as the key to explosive growth. In
the view of digital hipsters, it was time to leave the crusty old-timers to their
CAB5 meetings, mainframes, and bankruptcies (“Say hello to Blockbuster and
Kodak!”)
As the decade wore on, and digital businesses overtook slower businesses in
more and more markets, digital technologies and approaches were pulled closer
to the center of even older businesses. Digital departments were assimilated, and
boards asked to see strategies to migrate core business systems into the cloud.
This trend accelerated when the Covid pandemic led to a dramatic rise in
consumers and workers moving to online services. Many organizations found
that their digital services were not ready for the unexpected level of demand they
were faced with. As a result, they increased their investment and efforts in cloud
technologies.
I call this period where cloud technology has been shifting from the periphery of
business to the center the “Age of Sprawl”. Although breaking things had gone
out of fashion, moving fast was still the priority. As a result of the haste to adopt
new technologies and practices, larger organizations have seen a proliferation of
initiatives. A larger organization typically has multiple, disconnected teams
building “platforms” using various technologies, multiple cloud vendors, and
varying levels of maturity and quality.
The tidy linear narrative I describe as “the path to the cloud age” is, as with any tidy linear narrative,
simplistic. Many people and organizations have experienced the trends it describes. But none of its “ages”
have completely ended, and many of the drivers of different ways of thinking and working are still valid.
It’s important to recognize that contexts differ. A Silicon Valley startup has different needs and constraints
than a transnational financial institution, and new technologies and methodologies create opportunities to
handle old risks and new opportunities in different ways. The path to the cloud age is uneven and far from
over, understanding how it has unfolded so far can help us navigate what comes next
The gap is not one-sided. Engineering folks tend to focus on implementing the
solutions that seem obvious to them, sometimes assuming that it doesn’t make
much difference what will run on it. One example of how this turns out is a
company whose engineers built a multi-region cloud hosting solution with iron-
clad separation between regions. The team wanted to make sure that user data
would be segregated to avoid conflicts with different privacy regulations, so this
requirement was baked deep into the architecture of their systems.
However, because neither the product nor engineering teams believed they
needed close communication during development, the service was nearly ready
for production rollout when it surfaced that the commercial strategy assumed
that users would be able to use the service while traveling and working in
different countries. It took considerable effort, expense, and delay to rearchitect
the system to ensure that privacy laws could be respected between regions while
giving users international roaming access.
So although infrastructure can seem distant from strategic goals discussed in the
boardroom, it’s essential to make sure everyone from strategic leaders to
engineering teams understands how they are related. Table 1-3 describes a few
common organizational concerns where infrastructure architecture can make a
considerable difference in either enabling success or creating drag.
Facilitate growth
Why
* Deliver value to users quickly and reliably * Deliver new products and
features
Outcomes
* High performance on the four key metrics (“The Four Key Metrics”). *
Low effort and dependency on central teams
Can expand products to new regions and for new customers quickly, and
easily, with costs that scale less than linearly
Systems are upgraded continuously, with low effort. The number of versions
of any given system is minimized. Redundant systems are retired quickly.
How
* Reduce time and effort to provision existing products into new regions and
for new customers, * Mechanisms in place to maintain and update multiple
product instances quickly, easily, and with minimal effort
* Automated systems for testing and delivering patches, fixes, and minor
upgrades across the estate * Capability to add new versions and systems to
delivery systems
Throughout this book, I’ll use the example of a fictitious company called
“ClotheSpin”, an online fashion retailer, to illustrate the concepts I discuss.
“Introduction to ClotheSpin” gives a high-level view of the company’s strategy.
INTRODUCTION TO CLOTHESPIN
ClotheSpin is an online fashion retailer that was founded in the dot-com days. It
is well-established in the UK and Germany and has recently expanded to
multiple countries in Europe, the Americas, and Asia. They have just launched a
new storefront called Hipsteroo in the UK and the US, to reach a younger
market, and want to expand it globally as well. The company has also
determined that they need to be able to add new services like clothing rental to
remain competitive. Last year they acquired a company called BrainZ, which has
a machine-learning system for retail product merchandising and
recommendations, which they want to integrate with their online stores.
The Technology Situation
The main ClotheSpin online storefront was originally built on data center
infrastructure, running J2EE on Solaris servers, and then migrated most of the
systems to Linux a few years later. In the mid-2010s the company began
migrating ClotheSpin onto AWS, initially with CloudFormation, later using
Terraform. A separate initiative re-platformed the software to a containerized
architecture. Much of the front-end experience is backed by containerized
software running on AWS ECS, although some services run on J2EE servers
deployed on virtual machine instances. There are also backend systems for
logistics and billing that still run in the data center, so ClotheSpin is a hybrid
cloud architecture.
When the Hipsteroo storefront was launched, the company decided to build it as
a greenfield project separate from the ClotheSpin systems, because this was the
fastest path to launching it. Hipsteroo is a purely cloud-native architecture
including EKS and Lambda, on infrastructure built with AWS CDK in a mix of
JavaScript and TypeScript.
The BrainZ machine learning systems run on Google Cloud, mostly built using
Terraform.
Until recently, growth was the ClotheSpin board’s primary goal. Their strategy
was to spend to grow and worry about efficiency later. Later has come. The
economic situation has changed, and the cost to run and develop ClotheSpin’s
existing systems is not sustainable. However, the company can’t afford to miss
opportunities to grow market share and enter new markets. So they need to find
efficient ways to continue to grow their footprint. An added factor is that some
of the systems in place now have issues with performance and reliability, and
these need to be addressed to rebuild the confidence of customers and partners.
Grow our customer base, revenue, and profits by bringing new storefronts and
services to market
Grow our customer base, revenue, and profits by expanding our storefronts to
new regions
Retain and grow our customer base by continuously improving our existing
storefronts and services
Improve our profitability and service quality by rationalizing our systems
Figure 1-2 shows an example of how organizational goals, such as the ones
described in “Introduction to ClotheSpin”, drive goals for an engineering
organization, which in turn drive goals for the infrastructure architecture.
Infrastructure as Code can be used to ensure environments are consistent across
the path to production as well as across multiple production instances (I’ll talk
about different types of environments in Chapter 15).
Operations teams know that the biggest risk to a production system is making a
change to it8. The Iron Age approach to managing this risk (as I mentioned
earlier in “From the Iron Age to the Cloud Age”) is to add heavyweight
processes to make changes more slowly and carefully. However, adding barriers
to making changes adds barriers to fixing and improving the quality of a system.
Research from the Accelerate State of DevOps Report backs this up. Making
changes frequently and reliably is correlated to organizational success.9
We want to think that we build an environment, and then it’s “done.” In this
view, we don’t make many changes, so automating changes, especially testing,
is a waste of time.
In reality, very few systems stop changing, at least not before they are retired.
Some people assume that a heavy pace of change is temporary. Others create
heavyweight change request processes to discourage people from asking for
changes. These people are in denial. Most teams that are supporting actively
used systems handle a continuous stream of changes.
A fundamental truth of the Cloud Age is: Stablity comes from making changes.
Unpatched systems are not stable; they are vulnerable. If you can’t fix issues as
soon as you discover them, your system is not stable. If you can’t recover from
failure quickly, your system is not stable. If the changes you do make involve
considerable downtime, your system is not stable. If changes frequently fail,
your system is not stable.
Getting started with Infrastructure as Code is a steep curve. Setting up the tools,
services, and working practices to automate infrastructure delivery is loads of
work, especially if you’re also adopting a new infrastructure platform. The value
of this work is hard to demonstrate before you start building and deploying
services with it. Even then, the value may not be apparent to people who don’t
work directly with the infrastructure.
Automation should enable faster delivery for new systems as well as existing
systems. Implementing automation after most of the work has been done
sacrifices many of the benefits.
Automation makes it easier to write automated tests for what you build. And
it makes it easier to quickly fix and rebuild when you find problems. Doing
this as a part of the build process helps you to build a more robust
infrastructure.
Automating an existing system is very hard. Automation is part of a system’s
design and implementation. To add automation to a system built without it,
you need to change the design and implementation of that system
significantly. This is also true for automated testing and deployment.
The same is true when you build a system as an experiment. Once you have a
proof of concept up and running, there is pressure to move on to the next thing,
rather than to go back and build it right. And in truth, automation should be a
part of the experiment. If you intend to use automation to manage your
infrastructure, you need to understand how this will work, so it should be part of
your proof of concept.
The solution is to build your system incrementally, automating as you go. Ensure
you deliver a steady stream of value, while also building the capability to do so
continuously.
It’s natural to think that you can only move fast by skimping on quality and that
you can only get quality by moving slowly. You might see this as a continuum,
as shown in Figure 1-3.
Figure 1-3. The idea that speed and quality are opposite ends of a spectrum is a false dichotomy
In short, organizations can’t choose between being good at change or being good
at stability. They tend to either be good at both or bad at both.
I prefer to see quality and speed as a quadrant rather than a continuum, as shown
in Figure 1-4.
Figure 1-4. Speed and quality map to quadrants
This quadrant model shows how trying to choose between speed and quality
leads to doing poorly at both:
The upper-right quadrant is the goal of modern approaches like Lean, Agile, and
DevOps. Being able to move quickly while also maintaining a high level of
quality may seem like a fantasy. However, the Accelerate research proves that
many teams do achieve this. So this quadrant is where you find “high
performers.”
Deployment frequency
How often changes are deployed to production systems
I’ll summarize each of these now, to set the context for further discussion. Later,
I’ll devote a chapter to the principles for implementing each of these practices.
Defining all your stuff “as code” is a core practice for making changes rapidly
and reliably. There are a few reasons why this helps:
Reusability
If you define a thing as code, you can create many instances of it. You can
repair and rebuild your things quickly, and other people can build identical
instances of the thing.
Consistency
Things built from code are built the same way every time. This makes system
behavior predictable, makes testing more reliable, and enables continuous
testing and delivery.
Visibility
Everyone can see how the thing is built by looking at the code. People can
review the code and suggest improvements. They can learn things to use in
other code, gain insight to use when troubleshooting, and review and audit for
compliance.
Effective infrastructure teams are rigorous about testing. They use automation to
deploy and test each component of their system and integrate all the work
everyone has in progress. They test as they work, rather than waiting until
they’ve finished.
The idea is to build quality in rather than trying to test quality in.
One part of this that people often overlook is that it involves integrating and
testing all work in progress. On many teams, people work on code in separate
branches and only integrate when they finish. According to the Accelerate
research, however, teams get better results when everyone integrates their work
at least daily. CI involves merging and testing everyone’s code throughout
development. CD takes this further, keeping the merged code always production-
ready.
I’ll go into more detail on how to continuously test and deliver infrastructure
code in Chapter 7.
Teams struggle when their systems are large and tightly coupled. The larger a
system is, the harder it is to change, and the easier it is to break.
When you look at the codebase of a high-performing team, you see the
difference. The system is composed of small, simple pieces. Each piece is easy
to understand and has clearly defined interfaces. The team can easily change
each component on its own and can deploy and test each component in isolation.
I dig more deeply into implementation principles for this core practice in
Chapter 5.
Conclusion
Traditional, Iron Age approaches to software and system design were based on
the belief that, if you are sufficiently skilled, knowledgeable, and diligent, you
can come up with the correct design for the system’s needs. In reality, you won’t
know what the correct design is until your system is being used. Worse, changes
to your organization’s situation, environment, and opportunities mean the
system’s needs are a moving target. So even if you do find and implement the
correct design, it won’t remain correct for very long.
The only thing you know for sure when designing a system is that you will need
to change it when it is in use, not once, but continuously until the system is no
longer needed. The essence of Cloud Age, Lean, Agile, DevOps, and similar
philosophies is designing and implementing systems so that you can
continuously learn and evolve your systems.
With infrastructure, this means exploiting speed to improve quality and building
quality in to gain speed. Automating your infrastructure takes work, especially
when you’re learning how to do it. But doing that work helps to ensure you can
keep your system relevant and useful throughout its lifespan. The next chapter
will discuss more specific principles for designing and building cloud
infrastructure using code.
4 Facebook CEO Mark Zuckerberg said “Unless you are breaking stuff,” he
says, “you are not moving fast enough.” https://www.businessinsider.com/mark-
zuckerberg-2010-10
7 The Cloud Native Landscape diagram is a popular one for illustrating how
many products, tools, and projects are available for building platforms. One of
my favorite memes extends this into a CNCF conspiracy chart
8 According to Gene Kim, George Spafford, and Kevin Behr in The Visible
Ops Handbook (IT Process Institute), changes cause 80% of unplanned outages.
9 Reports from the Accelerate research are available in the annual State of
DevOps Report, and in the book, Accelerate, by Dr. Nicole Forsgren, Jez
Humble, Gene Kim (IT Revolution Press).
10 Accelerate by Dr. Nicole Forsgren, Jez Humble, Gene Kim (IT Revolution
Press)
13 DORA, now part of Google, is the team behind the Accelerate State of
DevOps Report.
Chapter 2. Principles of Cloud
Infrastructure
With Early Release ebooks, you get books in their earliest form—the author’s
raw and unedited content as they write—so you can take advantage of these
technologies long before the official release of these titles.
This will be the 2nd chapter of the final book. Please note that the GitHub repo
will be made active later on.
If you have comments about how we might improve the content and/or examples
in this book, or if you notice missing material within this chapter, please reach
out to the editor at jleonard@oreilly.com.
The rise of cloud and automation has forced us to change how we think about,
design, and use computing resources.
Cloud Native takes this decoupling further, moving away from modeling
resources based on hardware concepts like servers, hard drives, and firewalls.
Instead, infrastructure is defined around concepts driven by application
architecture. Containers strip down the concept of a virtual server to only those
things that are specific to an application process. Serverless removes even that,
providing the bare minimum that an application needs from its environment to
run. A service mesh can abstract various aspects of interaction and integration
between application processes, including routing, authentication, and service
discovery.
You’ll need to take parts of your system offline for reasons other than unplanned
failures. You’ll need to patch and upgrade the system software. You’ll resize,
redistribute the load, and troubleshoot problems.
With static infrastructure, doing these things means taking systems offline. But
in many modern organizations, taking systems offline means taking the business
offline.
So you can’t treat the infrastructure your system runs on as a stable foundation.
Instead, you must design for uninterrupted service when underlying resources
change.2
Effortlessly means that there is no need to make any decisions about how to
rebuild things. You should define things such as configuration settings, software
versions, and dependencies as code. Rebuilding is then a simple “yes/no”
decision.
Not only does reproducibility make it easy to recover a failed system, but it also
helps you to:
Of course, a running system generates data, content, and logs, which you can’t
define ahead of time. You need to identify these and find ways to keep them as a
part of your replication strategy. Doing this might be as simple as automatically
copying or streaming data to a backup and then restoring it when rebuilding. I’ll
describe options for doing this in Chapter 19.
The ability to effortlessly build and rebuild any part of the infrastructure is
powerful. It reduces the risk and fear of making changes, and you can handle
failures with confidence. You can rapidly provision new services and
environments.
Pitfall: Snowflake Systems
A snowflake is an instance of a system or part of a system that is difficult to
rebuild. It may also be an environment that should be similar to other
environments—such as a staging environment—but is different in ways that its
team doesn’t fully understand.
People don’t set out to build snowflake systems, they are a natural occurrence.
The first time you build something with a new tool you learn lessons along the
way, which involves making mistakes. But if people are relying on the thing
you’ve built, you may not have time to go back and rebuild or improve it using
what you learned. Improving what you’ve built is especially hard if you don’t
have the mechanisms and practices that make it easy and safe to change.
You know a system is a snowflake when you’re not confident you can safely
change or upgrade it. Worse, if the system does break, it’s hard to fix it. So
people avoid making changes to the system, leaving it out of date, unpatched,
and maybe even partly broken.
Snowflake systems create risk and waste the time of the teams that manage
them. It is almost always worth the effort to replace them with reproducible
systems. If a snowflake system isn’t worth improving, then it may not be worth
keeping at all.
The best way to replace a snowflake system is to write code that can replicate
the system, running the new system in parallel until it’s ready. Use automated
tests and pipelines to prove that it is correct and reproducible and that you can
change it easily.
Note that it’s possible to create snowflake systems using infrastructure code, as
I’ll explain in Chapter 15.
“Treat your servers like cattle, not pets,” is a popular expression about
disposability.3 I miss giving fun names to each new server I create. But I don’t
miss having to tweak and coddle every server in our estate by hand.
If your systems are dynamic, then you need to use tools that can cope with this.
For example, your monitoring should not raise an alert every time you rebuild
part of your system. However, it should raise a warning if something gets into a
loop rebuilding itself.
People can take a while to get used to ephemeral infrastructure. One team I
worked with automated its infrastructure with VMware and Chef. The team
deleted and replaced virtual machines as needed.
A new developer on the team needed a web server to host files to share with
teammates, so he manually installed an HTTP server on a development server
and put the files there. A few days later, I rebuilt the VM, and his web server
disappeared.
After some confusion, the developer understood why this had happened. He
added his web server to the Chef code and persisted his files to the SAN. The
team now had a reliable file-sharing service.
To make this work, you must apply any change you make to all instances of the
component. Otherwise, you create configuration drift.
LIGHTWEIGHT GOVERNANCE
Modern, digital organizations are learning the value of Lightweight Governance in IT to balance autonomy
and centralized control. This is a key element of the EDGE model for agile organizations. For more on this,
see the book, EDGE: Value-Driven Digital Transformation by Jim Highsmith, Linda Luu, and David
Robinson (Addison-Wesley Professional), or Jonny LeRoy’s talk, “The Goldilocks Zone of Lightweight
Architectural Governance”. Andrew Harmel-Law describes how to Scaling the Practice of Architecture,
Conversationally.
Configuration Drift
Configuration drift is variation that happens over time across once identical
systems. Figure 2-1 shows this. Making changes manually is a common cause of
inconsistencies. It can also happen if you use automation tools to make ad hoc
changes to only some of the instances, or if you create separate branches or
copies of the infrastructure code for different instances. Configuration drift
makes it harder to maintain consistent automation.
Figure 2-1. Configuration drift is when instances of the same thing become different over time
As an example of how infrastructure can diverge over time, consider the journey
of our example company, ClotheSpin (as introduced in “Introduction to
ClotheSpin”). ClotheSpin runs a separate instance of its storefront in each
region, as a set of microservices deployed on an AWS EKS cluster, along with
an API gateway, database instances, and message queues.
Over time, the Terraform code has become increasingly different between
regions. When a change is needed, such as fixing a configuration issue or adding
a new feature, the team needs to manually edit the code for each region. They
test the change in a separate staging instance for each region because the
differences mean that it might work correctly in one region, but break the system
in another one.
It can take a few weeks to apply even a minor change to all of the region’s
infrastructure. In some cases, the team doesn’t bother to make a change to all of
the regions, if they think it may not be relevant. This increases the differences
between the regions, making it even more likely that a later change can’t be
easily applied everywhere.
The automation fear spiral describes how many teams fall into configuration
drift and technical debt.
Many people have the same problem I had in my early days of using automation
tools. I used automation selectively—for example, to help build new servers, or
to make a specific configuration change. I tweaked the configuration each time I
ran it to suit the particular task I was doing.
This is the automation fear spiral, as shown in Figure 2-2. Infrastructure teams
must break this spiral to use automation successfully. The most effective way to
break the spiral is to face your fears. Start with one set of servers. Make sure you
can apply, and then reapply, your infrastructure code to these servers. Then
schedule an hourly process that continuously applies the code to those servers.
Then pick another set of servers and repeat the process. Do this until every
server is continuously updated.
For example, let’s say I have to partition a hard drive as a one-off task. Writing
and testing a script is much more work than just logging in and running the
command. So I do it by hand.
The problem comes later on, when someone else on my team, Priya, needs to
partition another disk. She comes to the same conclusion I did and does the work
by hand rather than writing a script. However, she makes slightly different
decisions about how to partition the disk. I made an 80 GB ext3 partition
on my server, but Priya made a 100 GB XFS partition on hers. We’re
creating configuration drift, which will erode our ability to automate with
confidence.
Effective infrastructure teams have a strong scripting culture. If you can script a
task, then script it.4 If it’s hard to script it, dig deeper. Maybe there’s a technique
or tool that can help, or maybe you can simplify the task or handle it differently.
Breaking work down into scriptable tasks usually makes it simpler, cleaner, and
more reliable.
There are many resources for learning about software design, architecture, and
engineering, which I draw on throughout the book.
Conclusion
The Principles of Cloud Infrastructure embody the differences between
traditional, static infrastructure, and modern, dynamic infrastructure:
These principles are the key to exploiting the nature of cloud platforms. Rather
than resisting the ability to make changes with minimal effort, exploit that ability
to gain quality and reliability.
4 My colleague Florian Sellmayr says, “If it’s worth documenting, it’s worth
automating.”
Chapter 3. Platforms and Toolchains
With Early Release ebooks, you get books in their earliest form—the author’s
raw and unedited content as they write—so you can take advantage of these
technologies long before the official release of these titles.
This will be the 3rd chapter of the final book. Please note that the GitHub repo
will be made active later on.
If you have comments about how we might improve the content and/or examples
in this book, or if you notice missing material within this chapter, please reach
out to the editor at jleonard@oreilly.com.
The organization’s strategy and goals are delivered by various capabilities that
are described by the organization’s enterprise architecture. These capabilities are
implemented as services, which are often grouped into platforms that provide
cohesive sets of services.
Platform design and implementation is too broad a topic for this book to discuss
comprehensively. But we typically use Infrastructure as Code to implement
platforms and platform services.
Platform is one of those words (like “system” and “service”) which is used in so
many different ways that the word is nearly meaningless without a more specific
qualifier, such as “business”, “developer”, or “data”. And even then terms like
“business platform” still feel wooly. To make things even more difficult,
different people define and use various platform-related words in different ways-
there are no industry-standard definitions to rely on.
The patterns and practices for Infrastructure as Code described in this book
apply to any IaaS platform, whether provided by a public cloud vendor or an
internal infrastructure platform. I’ll give a summary of the common types of
resources and services IaaS platforms provide, to establish vendor-independent
terminology used throughout the book.
The parts of the enterprise architecture as shown in the diagram could use some
explanation:
Technology capabilities
Roughly speaking, these are capabilities that an IT organization uses to
enable building and running business products and capabilities. Technology
capabilities are not usually visible outside of the organization. These
capabilities are typically what we use infrastructure code to provide, and I’ll
describe their different types next.
Infrastructure resources
Raw compute, storage, and networking resources may be provided by cloud
vendors or data centers. These resources are the raw ingredients that we work
with using Infrastructure as Code. I’ll elaborate on them later in this chapter
in a section on Infrastructure as a Service.
Delivery capabilities
These services and systems are used to develop, test, and deliver software.
Examples include source code and artifact repositories, CI (Continuous
Integration) services, CD (Continuous Delivery) pipelines, and automated
testing tools. The scope of these capabilities isn’t purely software used for
business products and capabilities but also includes code used for other
technology capabilities. Chapter 7 explains how to use delivery capabilities
for your infrastructure code.3
Operational capabilities
Some systems and services may not be strictly required for the software to
run, but are needed to ensure they run well. Examples include monitoring,
observability,4 security management, disaster recovery, and capacity
management. Chapter 19 explores some key operational capabilities.
Engineering platforms
An engineering platform provides technology capabilities to users inside an
organization, who use them to create products for users inside and outside the
organization. Figure 3-3 wraps the technology capabilities described in “Types
of Technology Capabilities” into an engineering platform.
These services are defined by the capability they provide to the software that
runs on them, rather than the details of the technology. “Public Traffic,” for
example, may include DNS entries, CDN (Content Distribution Network)
services, and network routing. But the service is defined by the fact that it
provides connectivity from users on the public Internet to an application.
Providing Platform Service Functionality
There are different ways that the functionality of a platform service may be
provided, and infrastructure is used differently in each of these ways. For
example, a monitoring service might use functionality from a software package
deployed onto the organization’s infrastructure, a service provided by the IaaS
cloud vendor, or a SaaS monitoring solution. Figure 3-5 shows each of these
options.
Packaged Software
Teams provide the platform functionality by deploying a software package
onto their infrastructure. A few examples include open-source monitoring
software like Prometheus, a secrets management service like Hashicorp
Vault, or a packaged container cluster service like kops or Rancher.
Infrastructure code provides the infrastructure to run the software as well as
integration with other infrastructure and services, such as networking and
authorization.
Externally-Hosted Service
Many organizations use services hosted by a SaaS vendor. Examples include
Datadog monitoring, Akamai Edge DNS, and Okta identity management.
Many SaaS providers have APIs supported by Infrastructure as Code tools, so
you can write code to provision, configure, and integrate their services.
EXAMPLE TECHNOLOGY CAPABILITY IMPLEMENTATIONS BY CLOTHESPIN
The ClotheSpin teams have built their systems over nearly twenty years, and
have used a variety of different ways of providing platform service functionality.
When ClotheSpin introduced an API layer for mobile applications, and later
opened it up to third-party developers, they deployed and ran the Kong API
gateway (https://konghq.com/products/kong-gateway) on an AWS EKS cluster
and a PostgreSQL RDS instance. Their infrastructure code pulled docker images
with Kong pre-installed. This is an example of pull-based packaged software for
a platform service.
Later, the team decided to migrate to the AWS API Gateway service
(https://aws.amazon.com/api-gateway/). Most of the implementation for
ClotheSpin’s folks involved writing Terraform code to configure the service,
with some work by the application developers to migrate their code. This is an
example of functionality provided by the cloud platform.
The new Hipsteroo brand launched a few years ago with an architectural design
principle to keep the amount of packaged software to a minimum, so the team
would have fewer moving parts to manage. They defaulted to using services
provided by their cloud vendor as much as possible. However, they decided they
preferred the advanced features and developer experience of a third-party hosted
monitoring provider, Datadog (https://www.datadoghq.com/). The team deploys
a stack written with AWS CDK that other infrastructure uses to integrate with
the endpoints of the monitoring service. This is an example of an externally
hosted platform service.
IaaS Platforms
So far, I’ve described infrastructure resources vaguely as the stuff assembled by
infrastructure code. These resources, and the IaaS platforms that provide them,
are the medium in which we infrastructure coders work. They are the materials
that we mold, using our craft to turn characters in a file into the digital
foundations that sustain the organizations for which we work.
That may be a flowery way to describe IaaS platforms. But they are important to
what we do.
Figure 3-6 shows the relationship between infrastructure code, an IaaS platform,
and the infrastructure resources provisioned for our use.
Figure 3-6. Infrastructure code interacts with Infrastructure as a Service
An infrastructure tool like Terraform or CDK reads the infrastructure code and
uses it to interact with the API of the IaaS platform to provision or change
infrastructure resources.
There are different types of IaaS platforms, from full-blown public clouds to
private clouds; from commercial vendors to open source platforms. Table 3-1
lists examples of vendors, products, and tools for each type of cloud IaaS
platform.
Type of
Providers or products
platform
At the basic level, an IaaS platform provides compute, storage, and networking
resources. The platform can provide these resources in different ways. For
instance, you may run compute as virtual servers, container runtimes, and
serverless code execution.
Different vendors may package and offer the same resources in different ways,
or at least with different names. For example, AWS object storage, Azure blob
storage, and GCP cloud storage are all pretty much the same thing. This book
tends to use generic names that apply to different platforms. Rather than VPC
and Subnet, I use network address block and VLAN.
Compute Resources
Compute resources execute code. At its most elemental, compute is execution
time on a physical server CPU core. But most platforms provide compute in
different ways. Common compute resource resources include:
The variety of options for provisioning and using compute resources create
useful options for designing and implementing applications to use them
effectively and efficiently.
Storage Resources
Block storage, virtual disk volumes that can be mounted to virtual services or
other compute instances. Examples include AWS EBS, Azure Page Blobs,
OpenStack Cinder, and GCE Persistent Disk.
Object storage, which provides access to files from multiple locations, rather
than attached to a specific compute instance. Amazon’s S3, Azure Block
Blobs, Google Cloud Storage, and OpenStack Swift are all examples. Object
storage is usually cheaper and more reliable than block storage, but with
higher latency.
Networked filesystems, shared network volumes. These are usually volumes
that can be mounted on multiple compute instances using standard protocols,
such as NFS, AFS, or SMB/CIFS.7
Structured data storage. These are often managed Database as a Service
(DBaaS) offerings. They can be a relational database (RDBMS), key-value
store, or formatted document stores for JSON or XML content.
Secrets management, which is essentially structured data storage with
additional features for secrets management such as rotation and fine-grained
access management. See Chapter 11 for techniques for managing secrets and
infrastructure code.
As with compute resources, the different storage options vary from simple
options that provide raw storage space, to more sophisticated options tailored for
more narrow use cases.
Network Resources
Part of the safety comes from the ability to quickly and accurately test a
networking configuration change before applying it to a critical environment.
Beyond this, Software Defined Networking (SDN) makes it possible to create
finer-grained network security constructs than you can do manually. This is
especially true with systems where you create and destroy elements dynamically.
The details of networking are outside the scope of this book, so check the
documentation for your platform provider, and perhaps a reference such as Craig
Hunt’s TCP/IP Network Adminstration (O’Reilly).
Table 3-1 lists some relevant open source and commercial products now
available for building private IaaS platforms in a data center. The bare-metal
cloud tools in that table can automate the provisioning of physical servers, either
to use directly or as a first step for installing virtualization or IaaS software.
Many of these tools are used for installing IaaS products, automating the process
of installing hypervisors onto physical servers.
The major IaaS cloud vendors also offer products for deploying services in a
data center that are compatible with their public cloud offerings. These solutions
are not designed as complete offerings like the private IaaS products listed in the
table but are intended as stepping stones or complements for running hybrid
clouds with their public cloud services.
Although some people argue that private hosting is more economical than public
cloud8, it takes considerable investment and expertise to implement a private
IaaS. Most private IaaS implementations are much less sophisticated, mature,
and flexible than public offerings. Most organizations I’ve worked with who run
in-house IaaS or PaaS clouds struggle to find and retain staff with the skills to
manage them, and so are reliant on third-party vendors.
There are use cases where at least some services need to run in the data center.
However, rather than trying to build and maintain a full-fledged internal IaaS
cloud, it’s generally more useful to build just enough infrastructure to deliver
specific workloads. You can do this by automating the processes to provision,
update, and manage that infrastructure with the simplest set of tooling necessary.
Multicloud
Many organizations end up hosting across multiple platforms. A few terms crop
up to describe variations of this:
Hybrid cloud
Hosting applications and services for a system across both private
infrastructure and a public cloud service. People often do this because of
legacy systems that they can’t easily migrate to a public cloud service (such
as services running on mainframes). In other cases, organizations have
requirements that public cloud vendors can’t currently meet, such as legal
requirements to host data in a country where the vendor doesn’t have a
presence.
Cloud agnostic
Building systems so that they can run on multiple public cloud platforms.
People often do this hoping to avoid lock-in to one vendor. In practice, this
results in locks-in to software that promises to hide differences between
clouds, or involves building and maintaining vast amounts of customized
code, or both.
Polycloud
Running different applications, services, and systems on more than one public
cloud platform. This is usually to exploit different strengths of different
platforms.
Many organizations worry that using certain cloud-provided capabilities might reduce their options for
moving to alternative vendors in the future-vendor lock-in. I’ve seen the obsession with this risk lead to
policies banning the use of common, easily ported services like DBaaS. Other organizations invest in
building or buying products that promise to create an abstraction layer over the cloud vendor. Doing this
adds complexity and cost that is rarely justified by a sensible evaluation of risks and tradeoffs.
Simpler systems might implement these in a single toolchain, for example using
Terraform to define and provision the infrastructure for the platform services in
an environment and deploying applications into it. As a system grows,
particularly in terms of the number of people and teams working on it, a single
toolchain can become messy and difficult to maintain. Many teams find it useful
to split the tools into different sets based on their responsibilities.
The landscape of automation tooling, both open source and commercial, covers
many of these concerns and more. In general, most solutions are aimed at a
subset of concerns, but it can be tempting to stretch them more broadly. Using
Terraform code to deploy applications is one example. Implementing
infrastructure provisioning commands in the configuration of a job in a build
server like Jenkins is another.
The boundaries and dependencies between the concerns of these three different
types of toolchains are a recurring topic for Infrastructure as Code, so I’ll
describe them in more detail, noting which parts of this book are most relevant.
Most teams write at least some amount of custom scripts to orchestrate their
infrastructure code. Chapter 8 digs into the tools and scripts that teams may use
for this.
On the other side of a platform service are the tools, services, and other solutions
used to provision and configure it. As with the infrastructure delivery toolchain,
different organizations may use different names for these things, or may not
even clearly define them as a group.
A few examples of solutions people use for managing platform services include:
Platform-building frameworks
Like Kratix and Humanitec. As opposed to a PaaS, these solutions provide
tooling for teams to build and manage their own platform services, rather than
providing pre-built services.
Developer portals
Along the lines of Backstage, which people can use to provision platform
services (among other things).
In some cases, a team can use a self-service solution such as a developer portal
to manually trigger the provisioning of a platform service instance to use. In
other cases, deploying an application automatically triggers the provisioning of a
service the application requires. The latter situation requires integration with the
application delivery toolchain.
There is a wide and sprawling landscape of tools and services to automate the
build, testing, delivery, and deployment of application software. These include
build and pipelines services like those described in Chapter 8 and application
deployment services like Flux and ArgoCD.
Conclusion
This chapter moves the conversation along the journey from the conceptual stuff
to the more concrete. In the previous chapter, we set out a view of organizational
goals leading down through engineering goals to give us goals for our
infrastructure architecture. This chapter positioned enterprise architecture as the
way we implement those goals, with platforms at different layers to implement
each layer of goals.
Although engineering platforms are a larger topic than this book can cover, it’s
essential to understand the relationship that infrastructure plays in enabling
them. This leads to the topic of IaaS platforms, cloud or otherwise, that provide
the foundations for everything else in the stack. The infrastructure toolchain is
the mechanism for harnessing IaaS resources to provide platform services.
1 For another definition of platform, see What I Talk About When I Talk
About Platforms from my former colleague Evan Bottcher.
4 As I explain in Chapter 19, observability and monitoring are not the same
thing
7 Network File System, Andrew File System, and Server Message Block,
respectively.
With Early Release ebooks, you get books in their earliest form—the author’s
raw and unedited content as they write—so you can take advantage of these
technologies long before the official release of these titles.
This will be the 4th chapter of the final book. Please note that the GitHub repo
will be made active later on.
If you have comments about how we might improve the content and/or examples
in this book, or if you notice missing material within this chapter, please reach
out to the editor at jleonard@oreilly.com.
There are simpler ways to provision infrastructure than writing code and then
feeding it into a tool. You could follow the “ClickOps” approach by opening the
platform’s web-based user interface in a browser, then poking and clicking an
application server cluster into being. Or you could embrace “CLI-Ops” by
opening a prompt and using your command-line prowess to wield the IaaS
platform’s CLI (command-line interface) tool to forge an unbreakable network
boundary.
The “as code” paradigm works for many different parts of infrastructure as well
as things that are infrastructure-adjacent. A partial list of things to consider
defining as code includes:
IaaS resources
Collections of infrastructure resources provisioned on an IaaS platform are
defined and managed as stacks, as described in Chapter 10.
Servers
Managing the configuration of operating systems and other elements of
servers was the focus of the first generation of Infrastructure as Code tools, as
discussed in Chapter 17.
Hardware devices
Even most physical devices can be provisioned and configured using code.
Chapter 17 describes automating bare-metal server provisioning. Software
Defined Networking (SDN) can be used to automate the configuration of
networking devices like routers and firewalls.
Application deployments
Application deployment has moved decisively away from procedural
deployment scripts in favor of immutable containers and declarative
descriptors. See Chapter 16.
Delivery pipelines
Continuous Delivery pipeline stages for building, deploying, and testing
software can and should be defined as code (Chapter 8).
Platform services
Other services such as monitoring, log aggregation, and identity management
should ideally be configured as code. Using services provided by an IaaS
platform (see “Providing Platform Service Functionality”) are easy to
configure using the same tools you use to define infrastructure stacks on the
same platform. But most other tools, whether packaged or SaaS, should be
configurable using APIs and code as well. Many infrastructure stack tools
like Terraform and Pulumi support plugins and other extensions that can
configure third-party software as well as IaaS resources. Configuring
platform services as code has the added benefit of making it easier to
integrate infrastructure with other resources such as monitoring and DNS.
Tests
Tests, monitoring checks, and other validations should all be defined as code.
See Chapter 7 for more.
Use your preferred IDE or text editor, most of which have advanced
functionality and conveniences,
Use a full-featured, off-the-shelf version control system,
Apply specifications and configurations to multiple instances, which is
particularly useful for developing and testing changes safely before applying
them to business-critical systems,
Break specifications and configurations into separate components, so they can
be separately developed and tested, avoiding issues with working on a shared
instance of a closed-box system,
Automatically trigger tests and other activities when a specification is
changed, deployed, or promoted between instances,
Integrate workflows, such as integration testing, across different tools and
systems. For example, you can test integration between an application and its
deployment infrastructure when either side is changed,
Track and record changes across different tools and systems.
The externalized specification model mirrors the way most software source code works. Some visual
development environments, like Visual Basic, store and manage source code behind the scenes. But for
nontrivial systems, developers find that keeping their source code in external files is more powerful.
It is challenging to use Agile engineering practices such as TDD, CI, and CD
with closed-box infrastructure management tools. A tool that uses external code
for its specifications doesn’t constrain you to use a specific workflow. You can
use an industry-standard source control system, text editor, CI server, and
automated testing framework. You can build delivery pipelines using the tools
that work best for you, and integrate testing and delivery workflows with
software and other system elements.
Code and configuration for infrastructure and other system elements should be
stored in a source code repository, also called a Version Control System (VCS).
These systems provide loads of useful features, including tracking changes,
comparing versions, and recovering and using old revisions when new ones have
issues. They can also trigger actions automatically when changes are committed,
which is the enabler for CI jobs and CD pipelines, as discussed in Chapter 8.
One thing that you should not put into source control is unencrypted secrets, such as passwords and keys.
Even if your source code repository is private, its history and revisions of code are too easily leaked. Secrets
leaked from source code are one of the most common causes of security breaches. See Chapter 11 for better
ways to manage secrets.
More recently a new generation of tools for working with IaaS infrastructure has
reinvigorated interest in using general-purpose programming languages to define
infrastructure. Pulumi and the AWS CDK (Cloud Development Kit) support
languages like Typescript, Python, and Java.
Idempotent Code
Many task-focused scripts are written to be run only when an action needs to be
carried out or a specific change made, but can’t be safely run multiple times. For
example, this script creates a virtual server using a fictional infrastructure tool
called +stack-tool+1:
If you run this script once, you get one new server. If you run it three times, you
get three new servers2. This makes perfect sense, assuming the person running it
knows how many servers are already running, and how many need to be
running. In other words, the script doesn’t include all of the knowledge needed
to make decisions, and so leaves decision-making to whoever runs it. A script
that isn’t idempotent doesn’t support the hands-off approach we get with
Infrastructure as Code.
We can change the script to check whether the server name exists and refuse to
create a new one if so. The following snippet runs the fictional
command and only creates a new server if it exits
with a value of “1”:
The script is now idempotent. No matter how many times we run the script, the
result is the same: a single server named “my-server”. If we configure an
automated process to run this script continuously, we can be sure the server
exists. If the server doesn’t already exist, the process will create it. If someone
destroys the server, or if it crashes, the process will restore it.
But what happens if we decide the server needs more memory? We can edit the
script to change the argument to . But if
the server already exists with 4GB of RAM, the script won’t change it. We could
add a new check, taking advantage of some convenient options available in the
imaginary :
The modified script now captures the ID of the existing virtual server, if found,
into a file. It then passes the ID to with a command to change
the existing server’s memory allocation to 8GB. I’ve also changed the script so
that if the server doesn’t already exist, the command to create a new one creates
it with the right memory setting.
Now we have an idempotent script that ensures the server exists with the right
amount of memory. However, scripts like this become messy over time as needs
change and more conditionals are added. A declarative language makes
infrastructure definitions easier to maintain and understand.
This example creates the same virtual server instance as the earlier examples:
This code doesn’t include any logic to check whether the server already exists
or, if it does exist, how much memory or disk space it currently has. The tool
that you run to apply the code takes care of that. The tool checks the current
attributes of infrastructure against the code and works out what changes to make
to bring the infrastructure in line. So, in this example, to increase the RAM of
the application server you would edit the file and rerun the tool.
Declarative infrastructure tools like Terraform and Chef separate what you want
from how to create it. As a result, your code is cleaner and more direct.
Declarative code is inherently idempotent as well. The tool can apply the code
repeatedly and often with no harm. Defining infrastructure declaratively removes
the need to have the right knowledge to make decisions about where and when to
apply the code, which means we can push our code into automated systems to
deliver.
Some people dismiss declarative code as being mere configuration rather than “real” code. “Real” code is,
in their thinking, imperative code, which means either procedural (like C) or object-oriented (like Java)3.
I use the word code to refer to both declarative and imperative languages. I don’t find the debate about
whether a coding language must be Turing-complete useful. I even find regular expressions useful for some
purposes, and they aren’t Turing-complete either. So, my devotion to the purity of “real” programming may
be lacking.
Declarative code is fine when you always want the same outcome. However,
there are situations where you want different results that depend on the
circumstances. For example, the following code creates a set of VLANs. The
ClotheSpin team’s cloud provider has a different number of data centers in each
country, and the team wants its code to create one VLAN in each data center. So
the code needs to dynamically discover how many data centers there are, and
create a VLAN in each one:
The code also assigns an IP range for each VLAN, using a fictional but useful
method called . This method takes the address
space declared in , divides it into several smaller address
spaces based on the value of , and returns one of
those address spaces, the one indexed by the
variable.
This type of logic can’t be expressed using declarative code, so most declarative
infrastructure tools extend their languages to add imperative programming
capability. For example, Ansible adds loops and conditionals to YAML.
Terraform’s HCL configuration language is often described as declarative, but it
combines three sublanguages, one of which is expressions, which includes
conditionals and loops.
Newer tools, such as Pulumi and AWS CDK, return to using programmatic
languages for infrastructure. Much of their appeal is their support for general-
purpose programming languages (as discussed in “General-Purpose Languages
Versus DSLs for Infrastructure”). But they are also valuable for implementing
more dynamic infrastructure code.
Imperative code is a set of instructions that specifies how to make a thing happen. Declarative code
specifies what you want, without specifying how to make it happen.
Too much infrastructure code today suffers from mixing declarative and imperative code, which makes
code messy and difficult to understand. I believe this type of mixing is a result of trying to apply a single
language and single language paradigm across code that would be better separated.
An infrastructure codebase involves many different concerns, from defining infrastructure resources, to
configuring different instances of otherwise similar resources, to orchestrating the provisioning of multiple
interdependent pieces of a system. Some of these concerns can be expressed most simply with a declarative
language. Some concerns are more complex and better handled with an imperative language.
As practitioners of the still-maturing field of infrastructure code, we are learning where to draw boundaries
between these concerns. Mixing concerns can lead to code that mixes language paradigms. One failure
mode is extending a declarative syntax like YAML to add conditionals and loops. The second failure mode
is embedding simple configuration data (“2GB RAM”) into procedural code, mixing what you want with
how to implement it.
In relevant parts of this book, I point out where I believe some of the different concerns may be, and where
I think one or another language paradigm may be most appropriate. But our field is still evolving. Much of
my advice will be wrong or incomplete. So, I intend to encourage you, the reader, to think about these
questions and help us all to discover what works best.
Declarative code is useful for defining the desired state of a system, particularly
when there isn’t much variation in the outcomes you want. It’s common to
define the shape of some infrastructure that you would like to replicate with a
high level of consistency.
For example, you normally want all of the environments supporting a release
process to be nearly identical (see Chapter 15). So declarative code is good for
defining reusable environments, or parts of environments (per the reusable stack
pattern discussed in Chapter 15). You can even support limited variations
between instances of infrastructure defined with declarative code using instance
configuration parameters, as described in Chapter 11.
However, sometimes you want to write reusable, sharable code that can produce
different outcomes depending on the situation. For example, the ShopSpinner
team writes code that can build infrastructure for different application servers.
Some of these servers are public-facing, so they need appropriate gateways,
firewall rules, routes, and logging. Other servers are internally facing, so they
have different connectivity and security requirements. The infrastructure might
also differ for applications that use messaging, data storage, and other optional
elements.
INFRASTRUCTURE AS DATA
Infrastructure as Data is a subgenre of declarative infrastructure that leverages Kubernetes as a platform for
orchestrating processes.4 For example, ACK (AWS Controllers for Kubernetes) exposes AWS resources as
Custom Resources (CRs) in a Kubernetes cluster. This makes them available to standard services and tools
in the cluster, such as the command-line tool, to provision and manage resources on the IaaS
platform.
In addition to convenience, a benefit of integrating IaaS resource provisioning into the Kubernetes
ecosystem is the ability to use capabilities like the control loop of the operator model5. Once infrastructure
resources are defined in the cluster and provisioned on the IaaS platform, a controller loop ensures the
provisioned resources remain synchronized with the definition.
Crossplane is an infrastructure as data product that adds the capability to define and provision
Compositions, which are collections of resources managed as a unit: in other words, a stack.
Using Kubernetes to manage the process of applying infrastructure code can make the process less visible.
Be sure to implement effective monitoring and logging so you can troubleshoot effectively.
In addition to being declarative, many infrastructure tools use their own DSL, or
Domain-Specific Language.6
For example, Ansible, Chef, and Puppet each have a DSL for configuring
servers. Their languages provide constructs for concepts like packages, files,
services, and user accounts. A pseudocode example of a server configuration
DSL is:
This code ensures that two software packages are installed, and .
It defines a service that should be running, including the port it listens to and the
user and group it should run as. Finally, the code specifies that a server
configuration file should be created using the template file
.
The example code is pretty easy for someone with systems administration
knowledge to understand, even if they don’t know the specific tool or language.
Chapter 17 discusses how to use server configuration languages.
Many stack management tools also use DSLs, including Terraform and
CloudFormation. These DSLs model the IaaS platform resources, so that you
can write code that refers to virtual servers, disk volumes, and network routes.
See Chapter 10 for more on using these languages and tools.
Other DSLs model application runtime platform concepts. These model systems
like application clusters, service meshes, or applications. Examples include
Helm charts and CloudFoundry app manifests.
The rise in interest in moving away from declarative languages is driven in part by use cases where
procedural languages are more appropriate. Another reason people like tools like AWS CDK and Pulumi is
that they support coding in general-purpose programming languages like Python and JavaScript rather than
DSLs. Many people, especially those with a background in software development, are more comfortable
using a familiar language rather than learning a new one.
Beyond using existing language skills, popular general-purpose languages have broad ecosystems of tools
for working with their code. These languages are very well supported by IDEs (Integrated Development
Environments) with productivity features like error highlighting and code refactoring. Using these
languages also gives access to a much richer selection of tools and frameworks for activities like static code
analysis and unit testing.
A general-purpose language can be useful for building lower-level abstraction layers, libraries, and
frameworks for infrastructure. However, they are often more verbose than needed for higher-level
infrastructure definitions, obscuring “what” is being defined within boilerplate code and logic of “how it’s
implemented.
So again, it’s important to avoid choosing one tool, such as a general-purpose programming language, for
all jobs, and instead focus on designing systems with a clear separation of concerns, and using the
appropriate tool for each.
Most Infrastructure as Code DSLs directly model the resources they configure.
The languages used with tools like Terraform and CloudFormation are
essentially thin wrappers over IaaS APIs. For example, the Terraform
provider directly maps to the AWS API
API method.7
The IaaS vendor SDKs expose these APIs for general-purpose programming
languages. The advantage of a DSL is that it provides a unifying model to
simplify working with the APIs and the resources that they create. For example,
an infrastructure DSL hides the logic needed to make your code idempotent.
This is the advantage of using a tool like Pulumi or the AWS CDK to write
infrastructure code in JavaScript, for example, over directly using the AWSK
JavaScript SDK.
Many teams use infrastructure code languages to build abstraction layers over
the infrastructure resources provided by the IaaS platform. Doing this can help
people use infrastructure without having to implement the gritty details of, for
example, wiring up network routes. Tools or languages that expose infrastructure
at this abstracted layer tend to focus on application deployment and
configuration, as discussed in Chapter 9.
Writing code to define infrastructure can create some confusion. The code we
write for an application is compiled, deployed, and then executed at run-time, as
shown in Figure 4-1.8.
Figure 4-1. Application code executes in the runtime context
Some infrastructure code is compiled as well. But whether it’s compiled or not,
infrastructure code doesn’t execute in the runtime environment like application
code. Rather, it executes in the delivery context, as shown in Figure 4-2.
The infrastructure code you write defines what happens for the deployment of
the infrastructure. It causes infrastructure resources to be provisioned in the IaaS
platform, but your infrastructure code only affects the way that infrastructure
behaves indirectly.
This difference may seem obvious, but it has implications for things like testing.
If we write a unit test for infrastructure code, does it tell us about the
infrastructure our code creates, or does it only tell us about what happens when
the infrastructure code is executed? For example, if our code creates networking
structures, can we write unit tests that prove that those structures will route
traffic the way we expect? Chapter 7 discusses approaches for automated
infrastructure testing that consider different layers of testing.
At each stage, additional elements such as code modules and the infrastructure
tool itself come into the mix, adding to the distance between the code and reality.
For example, when there is a problem executing application code, a developer
can trace the progress of execution through the source code, perhaps using a
debugger. But a debugger won’t trace the execution of our infrastructure code, it
instead traces the execution of the infrastructure tool’s code.
NOTE
One team struggled with running out of memory while running their infrastructure code. Their first instinct
was to analyze the code to understand which parts of the infrastructure were using the most memory, so
they could optimize that code. However, infrastructure code doesn’t correlate to infrastructure tool memory
usage in the same way that application code does. In the end, we realized that the true issue was that the
infrastructure project, although divided into modules, was simply too large. So the solution was to break the
infrastructure into smaller stacks, as discussed in Chapter 10.
More confusion comes with tools that compile infrastructure code we write in
one language into another language. For example, AWS CDK allows developers
to write infrastructure code in application development languages like Python
and JavaScript, and then compile it to CloudFormation templates. Developers
can then use various tools and other support for the programming language, such
as IDE refactoring features and unit test frameworks. However, it’s important to
keep in mind the differences not only between the code and the resources it
creates but also differences between the code as developed and tested and the
code that is generated to be applied to the instances. The fact that this code
transitions from an imperative (procedural or object-oriented) language to a
declarative language (JSON in the case of the CDK) adds to the gap between
code and reality.
SERVERLESS CODE
Chapter 16 discusses serverless as an application runtime platform. Serverless redraws boundaries between
application deployment and infrastructure provisioning. Serverless application code is arguably deployed
every time it’s executed, together with at least part of its infrastructure, the runtime environment. Efforts at
optimizing serverless applications include deploying some infrastructure resources ahead of time, pre-
packaging some in containers, and perhaps caching others.
Infrastructure Code and Resource Instances
Another peculiarity with infrastructure code is the gap between the code and the
actual resources allocated on the IaaS platform. These two things are consistent
at the point in time when the code is applied. At any other time, there is no
guarantee they are the same. The actual infrastructure may change if someone
makes a change outside of the code using the cloud UI or a command-line tool.
It’s also possible that different versions of the same code can be applied to a
single instance, creating a gap.
CROSSED CODE
Recently, an infrastructure engineer on a team I was working with was confused. He was editing some
Terraform code and applying it to the test environment, but a few minutes later he found that the resources
in the environment didn’t match his code. He applied again and it seemed fine. But when he made another
change and applied it, his change failed because the environment was still out of whack. He went to post on
the team’s Slack and saw a teammate reporting the same issue: weird stuff was happening to the test
environment infrastructure. Then the penny dropped. Both engineers were editing and applying their own
local copy of the code to the same test environment, reverting each other’s changes.9
As discussed later in this book (Chapter 20), teams should ensure that, for any
shared instance of infrastructure, the code is only ever applied from a centralized
service. A centralized service can ensure that the correct version of the code is
applied, avoiding situations where individual engineers run different local copies
or branches.
Infrastructure State
Tools from third-party vendors, like Terraform and Pulumi, need their own data
structures to manage these mappings of code to instances. They store these data
structures in a state file for each instance. Early versions of these tools required
users to handle storage of the state files, but more recent versions add support to
use hosted services like Terraform Cloud and Pulumi.
Although many people prefer having their instance state handled transparently
by the platform, it can be useful to view and even edit state data structures to
debug and fix issues10.
Many infrastructure codebases evolve from configuration files and utility scripts into unmanageable messes.
Too often, people don’t consider infrastructure code to be “real” code. They don’t give it the same level of
engineering discipline as application code. To keep an infrastructure codebase maintainable, you need to
treat it as a first-class concern.
Design and manage your infrastructure code so that it is easy to understand and maintain. Follow code
quality practices, such as code reviews, pair programming, and automated testing. Your team should be
aware of technical debt and strive to minimize it.
Chapter 5 describes how to apply various software design principles to infrastructure, such as improving
cohesion and reducing coupling. Chapter 9 explains ways to organize and manage infrastructure codebases
to make them easier to work with.
Next-Generation Infrastructure
I’ve alluded to some of the limitations of Infrastructure as Code as a model for
managing infrastructure, such as the gap between code and reality. As I write
this, some companies are exploring ways to evolve beyond these limitations. I
can’t predict which of their ideas will take off, which will fade away, and what
other ideas may emerge over the next few years, or even before this book is
published.
However, there are at least two interesting directions suggested by the current
efforts. One is bridging the gap between applications and infrastructure. The
other is bridging the gap between infrastructure code and provisioned resources.
While most of these tools are not mature enough for most teams to consider
using for business-critical systems, they are worth watching.
However, the intent of these languages is not to intermix code at the level
normally written in Terraform with business logic. Instead, business logic code
can specify the relevant attributes of infrastructure at the right level of detail for
the context. The user registration code can specify that the data should be saved
to a database that is configured for handling personal customer information. The
code calls a separately-written library that handles the details of provisioning
and configuring the database appropriately.
So the system can be designed to separate the concerns of business logic and
detailed infrastructure configuration. However, the concerns that are relevant
across the boundaries can be managed explicitly. The current paradigm of
defining and deploying infrastructure separately relies on out-of-band knowledge
to know that the database needs to be configured for personal data. It also
involves brittle integration points that we need to configure explicitly on both
sides, such as connectivity and authentication.
Infrastructure as model
Earlier, I pointed out the challenges of the gap between code and resources
provisioned on IaaS. A given version of code may be different from what was
last applied to provision or change infrastructure, and the infrastructure
resources provisioned may have changed from both of those points.
Infrastructure as data aims to eliminate the gap between the code that was
applied and the provisioned resources, by continuously re-synchronizing the
code. But there are still gaps, especially when it comes to helping operators to
understand the current state of their infrastructure.
The team at System Initiative has shared a demo and details of their work on a
new tool that builds an interactive model of the current state of infrastructure.12
Although at first glance System Initiative’s tool looks similar to the ClickOps
approach that I disparaged at the start of this chapter. However, their extensible
implementation has the potential to handle many of the limitations of ClickOps.
For example, users can use the interactive interface to prepare a change set and
carry out checks and approvals before applying it to the provisioned resources.
This suggests we would be able to implement tests and other validations, support
multiple people working on a system concurrently, and potentially replicate
changes consistently across environments.
Having an interactive model that can be updated from the real infrastructure in
real-time would shrink the feedback loop for working on changes. While
working on a potential change, an engineer can refresh the model to see new
changes to the live system and how they will impact their work.
As with the tools for integrating application and infrastructure code, this is an
early iteration of the concept, essentially an experiment. But it’s heartening to
see people exploring ways to advance our ways of working. It’s important to be
aware that the current state of infrastructure management approaches and tools is
only a step on a journey.
Conclusion
The topics in this chapter could be considered “meta” concepts for Infrastructure
as Code. However, it’s important to keep in mind the goal of making routine
tasks hands-off for team members is what makes Infrastructure as Code more
powerful than writing scripts to automate tasks within a hands-on workflow.
Considering how different language attributes like idempotency affect the
maintainability of infrastructure code helps to select more useful tools for
different jobs in our system. Keeping the differences between infrastructure code
and application code in mind can help avoid traps in the analogy of
infrastructure as software.
This chapter closes out the Foundational chapters of the book (Part I). The
following chapters discuss the more concrete topic of infrastructure stacks, the
core architectural unit of Infrastructure as Code (Part III).
5 See https://kubernetes.io/docs/concepts/architecture/controller/
7 You can see this in the documentation for the Terraform aws_instance
resource and the AWSrun_instances API method.
9 See Chapter 20 for techniques for avoiding code clashes between people
working on infrastructure code.
11 See Gregor Hohpe’s article, IxC: Infrastructure as Code, from Code, with
Code, for an exploration of various combinations of infrastructure, architecture,
and code and their implications.
12 For more, see the System Initiave website. As of this writing, there is a
downloadable demo and videos. The company’s founder, Adam Jacob, has also
said they intend to make the code available as open source.
Part II. Core Topics
Chapter 5. Infrastructure Components
With Early Release ebooks, you get books in their earliest form—the author’s
raw and unedited content as they write—so you can take advantage of these
technologies long before the official release of these titles.
This will be the 6th chapter of the final book. Please note that the GitHub repo
will be made active later on.
If you have comments about how we might improve the content and/or examples
in this book, or if you notice missing material within this chapter, please reach
out to the editor at jleonard@oreilly.com.
One view of system design is that it’s about grouping elements and defining
relationships between those groups. We define these groupings as architectural
units, or components, that we can use in diagrams and discussions. Some
common components for software design include applications, microservices,
libraries, and classes. This chapter will describe a few components that we can
use for Infrastructure as Code.
Our industry does not have widely agreed definitions of what components are
relevant for Infrastructure as Code, or what to call them. So for the purposes of
this book, I’ve defined a set of four components. These components are IaaS
resources, code libraries, stacks, and infrastructure products. Each of these
covers a different level of scope, usually by aggregating lower-level structures.
Code libraries and stacks aggregate one or more IaaS resources. A stack may
optionally aggregate code libraries. An infrastructure product aggregates stacks.
NOTE
Your infrastructure probably doesn’t use this set of components and terms. However, they are helpful for
framing patterns and approaches for defining systems with Infrastructure as Code. You should be able to
adapt and apply these patterns and approaches in this book to your system. It would be a useful exercise,
once you’ve read through this chapter, to consider your infrastructure system’s design and how to map it to
the terms I’ve used here.
Both code libraries and stacks are defined by the infrastructure tool, even if the
tool uses different names for them. For example, Terraform calls code libraries
modules, and CDK calls them level 3 constructs. Although Terraform doesn’t
have a term for stacks, it has projects (for source code), and its state files
correlate with a provisioned stack. Pulumi and CDK both use the term stack,
while Crossplane has compositions. Most tools don’t have an inherent concept of
infrastructure products, so this component is least likely to be familiar.
While the level of scope is the most obvious characteristic of the different
components, the context is particularly important for infrastructure components.
So let’s explore this in more detail.
The idea of examining different components based on the context they’re used in
may be less familiar than considering the levels of scope (high-level components
versus lower-level components). So let’s consider how these three contexts
relate to infrastructure design.
The Runtime Context represents the infrastructure once it has been provisioned
on the IaaS platform. This is the context where the resources are used to run
workloads. The purpose of Infrastructure as Code is to shape the runtime
context, so most design activities start by defining the resources as they will
appear there.
Design for the Source Code Context focuses on source code repositories, folder
structures, and organization of the files that contain infrastructure code. Chapter
9 discusses these topics in detail. Infrastructure code libraries also live in this
context, which is a topic covered in depth in Chapter 13.
The Delivery Context sits between the source code and runtime contexts, being
concerned with turning infrastructure code into usable provisioned resources on
an IaaS platform. Design for this context involves questions of how to organize
infrastructure to apply it with your infrastructure code tool. Do you provision
everything as a single group, or do you break it into multiple groups that you can
deliver and provision separately?
Now that we’ve established these two dimensions for considering components
for Infrastructure as Code, scope and context, we can define some of the
components used in this book.
The Infrastructure Components
The four components used most in this book are summarized in Table 5-1.
You may have picked up hints in the descriptions of how each of these
components’ relevance varies with the infrastructure code context. We work
with IaaS resources across all contexts. Code libraries, such as Terraform
modules, are mainly relevant in the source code context, as they are a
mechanism for sharing and reusing code across projects.
Stacks are the most relevant component in the delivery context because they
define how resources are grouped for provisioning. Infrastructure products are
relevant to how infrastructure is allocated to and consumed by workloads, so can
be relevant across all contexts. They are often used to organize source code,
orchestrate provisioning across stacks, and for managing groups of resources at
runtime.
Generally, IaaS resources are the only components that are visible in the runtime
system. One exception to this is where stack structures are managed by the IaaS
platform, as with AWS CloudFormation. Otherwise, teams can implement
tracking and management of higher-level components through tagging and
permissions. For example, permission to manage different infrastructure
products can be restricted by runtime authorization policies.
Figure 5-2 shows a part of the application architecture for the ClotheSpin online
store.
ClotheSpin has a website front end and a set of mobile applications. These share
services for product browsing, searching, shopping carts, and checkout, among
others. The mobile applications communicate with a “BFF” (Backend For
Frontend) service that connects with the shared services. For our examples, we’ll
focus on the website, and the two shared services used to browse products and
add them to a shopping basket.
The diagram shows infrastructure capabilities that are needed specifically for
each software service, such as static website content storage for the website, and
separate database instances for the product browsing service and the shopping
cart service. It also shows that some infrastructure capabilities are shared by
more than one service, such as container hosting.
Infrastructure Products
For smaller systems this approach keeps the implementation simple, avoiding an
unneeded layer of abstraction. For larger systems, however, these stacks can be
large and messy. We’ll look at approaches to sizing stacks in Chapter 10.
Infrastructure Stacks
A Stack Project includes the source code that specifies the resources in the stack,
possibly referencing infrastructure code libraries. It aligns with the source code
context.
A Stack Tool reads the code in the stack project and any libraries, then calls the
IaaS platform’s API to provision the IaaS resources defined in the code. It’s used
in the delivery context.
A Stack Instance is a set of IaaS resources provisioned on the IaaS platform from
a stack project, available for use by workloads. It is the stack in the runtime
context.
Figure 5-6 shows where these terms fit in the different contexts.
Figure 5-6. An infrastructure stack is a collection of infrastructure elements managed as a group
AWS CloudFormation
Azure Resource Manager
Bosh
Crossplane
Google Cloud Deployment Manager
Terraform
OpenStack Heat
Pulumi
Note that a single stack project may be reused to provision multiple stack
instances, often taking parameters to configure each specific instance. We’ll
cover this in Chapter 11.
“STACK” AS A TERM
Most stack management tools don’t call themselves stack management tools. Each tool has its own
terminology to describe the unit of infrastructure that it manages. CloudFormation and Pulumi both use the
term stack, but Terraform tends to talk about projects.
In this book, I’m describing patterns and practices that should be relevant to any of these tools, so I’ve
chosen to use the word stack as a generic term. I’ve been told there is a better term to describe the concept,
but nobody seems to agree on what that term is. So stack it is.
Chapter 13 will discuss patterns and antipatterns for using code libraries to build
stacks. However, there is a common pattern for using code libraries that is worth
mentioning as long as we’re on the subject of different levels of infrastructure
components. This pattern involves using a code library to implement a stack.
Tools such as Terragrunt1 are designed to support this pattern, which this book
calls a Wrapper Stack. Terraform Cloud’s no-code provisioning feature2
dynamically generates a wrapper stack project to provision a module as a stack
instance. I’ll describe the wrapper stack pattern in more detail later in the context
of other patterns and antipatterns for configuring stack instances (Chapter 11).
For now, it’s useful to understand the difference between infrastructure code
libraries used to share code across multiple stacks (which is how you would
expect to use a library) and those used to define an independently deployable
unit of infrastructure (which is, conceptually, a stack).
The DRY (Don’t Repeat Yourself) principle says, “Every piece of knowledge
must have a single, unambiguous, authoritative representation within a system.”3
If you copy the same code to use in multiple places, then discover the need to
make a change, it can be difficult to track down all of the places to make that
change.
Code libraries are a common solution for reducing duplicated code across
infrastructure stacks. However, sharing a library across multiple stack projects is
a tradeoff between reuse and coupling. Making a change to a library impacts all
of the projects that use it. A simple change requires all of the users to retest their
components and systems to make sure it doesn’t break something. A larger
change needed for the library to support a new stack might create a breaking
change for stacks that also use it.
The DRY principle is best viewed as applying not to specific code, but to higher-
level abstractions. For example, I worked with a team that had created a module
to replace all references to AWS EC2 instances. They saw that the code to define
an EC instance, used in multiple projects, all looked pretty much the same, so
decided it needed to be made DRY. However, once they had implemented a
module to replace the uses of the raw EC2 resource
declarations, they noticed that the references to their modules didn’t look any
more DRY than the original code. Their new module was a thin wrapper that
passed parameters to the raw IaaS reference, it didn’t add any value, but did add
complexity to their codebase.
After a rethink, the team realized there was value in replacing some definitions
of virtual servers. They had multiple stacks that provisioned application servers
for deploying different Java microservices (this was before Kubernetes). Each
use of the EC2 resource code set many parameters, mostly the same other than a
few parameters specific to the Java artifact to deploy. So they created an
module, which really did add some value, capturing
the requirements for running a Java application in one place and simplifying
declarations in dozens of stack projects.
Most of the other servers the team’s code provisioned were varied enough that it
wasn’t useful to wrap them in a module, so they reverted to using the raw EC2
instance resource code. This turned out to be simpler to understand and maintain
than their custom module had been.4
For example, Figure 5-7 shows one infrastructure product definition for a
container cluster being used to create two different cluster instances.
Figure 5-7. A shared infrastructure product deployable
Their first attempt to improve the situation was to create a module for the
database and reuse it at the code level. But reusing the module 60 times in the
same project was not faster (in fact it ran a bit slower) and the blast radius was
just as wide.
The team later moved their database module code into a separate Terraform
project. For each environment, they used this project to provision 60 instances of
the project, each with a separate state file.
The shared network product instance example in Figure 5-8 could be changed to
a shared-nothing implementation by provisioning a separate instance of the
common infrastructure product for each service. Taking this approach would
many services to be rapidly created and managed without friction for
coordination and contention.
Even a system that doesn’t need to scale and change at these levels can often
benefit from removing sharing. Many infrastructure design practices are based
on Iron Age constraints. Duplicating hardware networking and storage devices
was expensive and would usually lead to underutilization. But IaaS and
Infrastructure as Code make it simple, fast, and cheap to duplicate virtual
infrastructure and to automatically resize it to match usage.
Making a configuration change to the database instance for the checkout service
changes requires editing the database infrastructure product that is shared with
the other software services. So the scope of risk for a change to one instance is
all of the instances defined in the product, which adds overhead to the change. If
different teams are responsible for configuring databases within the shared
product, more overhead is needed to coordinate changes.
As with the previous example, the infrastructure stacks may be provisioned from
shared deployables. So the code to define database instances, for example, is not
duplicated. The teams that own the services don’t necessarily need to write the
infrastructure code for their databases themselves, they can use a shared
deployable stack in their infrastructure products. But, owning the infrastructure
product empowers the development teams to manage the lifecycle and
configuration of their infrastructure.
Some infrastructure resources need to be shared across workloads at runtime,
such as a container cluster or shared networking structures. Figure 5-11 shows
the services from previous examples, with some workload-specific
infrastructure, and some shared.
Conclusion
The last few chapters have explored how infrastructure code works, guidance for
designing infrastructure, and, in this chapter, infrastructure components. The
components described in this chapter will be used throughout the rest of the
book to explain patterns for testing, delivering, and managing infrastructure
using code. The terms used here-infrastructure products, infrastructure stacks,
and infrastructure code libraries-are not used universally across tool vendors.
However, the concepts apply to whatever tools you may use, so we need
consistent terminology to describe them in this book.
The next two chapters dive into the delivery context of infrastructure code. The
main goal of a delivery process for infrastructure code is testing that it works
correctly and safely, so that will be the focus of the next chapter.
1 https://github.com/gruntwork-io/terragrunt
2 https://developer.hashicorp.com/terraform/tutorials/cloud/no-code-
provisioning
5 Blast radius is the scope of the potential negative impact from a change or
event. I first saw this term used for software by Charity Majors in her post
Scrapbag of Useful Terraform Tips. Charity’s recommendation to use separate
Terraform state files for every environment is an example of sharing
infrastructure deliverables across environment, in her case using the Wrapper
Stack pattern Chapter 11
6 Most infrastructure teams build and maintain custom scripting to run their
infrastructure tools.
Part III. Infrastructure Stacks
Chapter 6. Designing Environments
With Early Release ebooks, you get books in their earliest form—the author’s
raw and unedited content as they write—so you can take advantage of these
technologies long before the official release of these titles.
This will be the 15th chapter of the final book. Please note that the GitHub repo
will be made active later on.
If you have comments about how we might improve the content and/or examples
in this book, or if you notice missing material within this chapter, please reach
out to the editor at jleonard@oreilly.com.
Software
The workloads that run in the environment. This can be applications, services,
or other system elements. In “Capabilities in an Enterprise Architecture”
these were described as business products and capabilities.
Platform services
Described in “Capabilities in an Enterprise Architecture” as technology
capabilities, these are services that enable the software to run. Instances of
platform services may be dedicated to a specific application, such as a
database instance, or shared across multiple applications, such as a container
cluster.
Infrastructure resources
Typically provided by an IaaS platform or physical infrastructure, these host
platform services and software.
More complex systems often find the need to split environments across more
than one of these architecture categories. For example, if different product
groups each have their own production environments, they probably also have
separate delivery environments to test their software releases.
There are multiple design forces at play with each of these multi-environment
architectures. Keep in mind that, as with any up-front system design, the only
thing you know for sure about the design decisions you make for your
environments is that they will be wrong. Even if you get the design right at first,
needs will change. Don’t assume that you will design your environments and be
done with it. Instead, be sure to consider how you will make changes once the
environments are in use. Changes may include splitting environments, merging
them, and moving systems between them.
Three key concerns for designing and implementing delivery concerns are
segregation, consistency, and variation. These concerns are in tension with each
other and need to be balanced.
Segregation
Resources, services, and software in one environment should not affect other
delivery environments. For example, when testing a change to an application
in an upstream environment, it should not be able to change data in a
downstream environment, which could cause problems in production.
Allowing a change in a downstream environment to affect an upstream
environment can affect the outcomes of testing activities, making the delivery
process unreliable.
Consistency
Differences between resources or services across environments can also affect
the validity of testing. Testing in one environment may not uncover problems
that appear in another. Differences can also increase the effort and time
needed to deploy and configure software in each environment and complicate
troubleshooting. Lack of consistency across delivery environments is a major
contributor to poor performance on the four key metrics, and resolving it is
one of the leading drivers for the use of Infrastructure as Code.
Variation
Some variation between delivery environments is usually necessary. For
example, it may not be practical to provision test environments with the same
capacity as the production environment. Different individuals may have
different access privileges in different environments. At the very least, names
and IDs may be different (appserver-test, appserver-stage, appserver-prod).
As mentioned earlier, other multi-environment architectures may involve
multiple production environments. In situations where each production
environment hosts different software or different parts of a system, there is
usually a need to have a separate set of delivery environments for each
production environment, as shown in Figure 6-2.
In other cases, a single set of delivery environments may be used to test and
deliver the same software to multiple production environments, as in Figure 6-3.
This is a fan-out delivery pattern.
Figure 6-3. Single path to production fanned out to multiple production environments
The fan-out pattern works well when each production environment hosts an
instance of essentially the same software, perhaps with some context-specific
configuration.
ClotheSpin runs two different online stores, one for the core ClotheSpin brand,
and another for the new Hipsteroo brand which was built as a separate greenfield
project. Each of the stores has its own set of business capabilities and platform
services, so there is no need to manage them both in a single environment.
Figure 6-5 adds backend systems that handle data analysis and fulfillment for
both the ClotheSpin and Hipsteroo online stores. These are hosted in a separate
environment.
Figure 6-5. Example of separate environments for integrated systems
Systems running in each of the storefront environments integrate with systems
running in the backend services environment. Decisions about where to draw the
boundaries between environments and which environment should host each
service are driven, at least partly, by the system architecture.
Both of the storefronts in our example are coupled with the backend services.
Hopefully, they are loosely coupled, meaning changes to a system on one side of
the integration can usually be made without changing the other system.
Environments are often defined by who owns them, either in terms of who uses
them or who provides and manages the infrastructure in them. For example, the
ClotheSpin and Hipsteroo online stores are developed by separate groups in the
ClotheSpin company. When the project to create the new Hipsteroo store was
launched, they were given a new environment to host their storefront software.
The intention was to develop the new store quickly without disrupting the
existing business. The environments are aligned to the groups that use them, so
that each can deploy their software without interfering with each other, and they
can customize the platform services to their needs.
Environments may also be aligned with the teams that manage them. The
storefront platform team manages the environment for the ClotheSpin storefront,
and a new team was created to build and manage the new Hipsteroo
environment. So the environments for the two storefronts align with the teams
that own their infrastructure.
The architectural concerns described in the previous section led to the creation of
a third environment to manage backend services, as shown in Figure 6-5. This
led to the discussion of whether to split out a third platform team to manage the
new environment, which underlines how Conway’s Law applies to infrastructure
environments.
Some organizations separate the responsibility for maintaining development and test environments from
maintaining production environments. In more extreme cases, each environment in a multi-stage path to
production may be managed by a separate team. Having a separate team managing separate environments
inevitably leads to different configurations and optimizations in each environment, which leads to
inconsistent behavior and wasted effort.
The worse case of this I’ve seen involved seven environments2 for a path to production, each managed by a
separate team. Deploying, configuring, and troubleshooting the software took nearly a week for each
environment, which added nearly seven weeks to every software release.
Aligning Environments To Governance Concerns
The strength of segregation by environment helps with assurance, and can also
help to reduce the scope (or blast radius) of attacks. An attacker who gets access
to an environment that hosts a user-facing application may have little access to
back-end systems with sensitive data. Hosting security operational services, such
as monitoring, scanning, and log analysis, in a separate environment can prevent
attackers from covering their tracks. Separately hosting systems with a broad
scope of control, such as administrative and delivery pipeline services, can also
limit the scope of damage an attacker can do.
Environment replicas can be used for scalability scenarios using approaches very
similar to those used for availability, and may even use the same
implementation. When the active workload nears the maximum capacity for an
environment, one of the potential strategies is to provision an additional replica
environment and redirect a portion of traffic or work to it.
Availability and capacity can be handled at other levels of the system than the
environment. Compute capacity may be automatically scaled, data replicated,
and workloads shifted across different parts of a system within one environment.
Environment-level replication can be useful as one part of a multi-tiered
approach to these scenarios.
One concern when deciding on the right approach is latency. Even when running
on a public cloud service, your system will be hosted in a physical data center
which may be far enough from some of your users, in terms of network
connectivity, that latency may affect their experience. Replicating or distributing
some parts of your system so they are hosted closer to your users can address
this.
Figure 6-7 illustrates an option for ClotheSpin that uses a single environment to
centrally serve customers in the UK, Germany, and Korea.
ClotheSpin can use a CDN (Content Distribution Network) to cache static assets
like web pages, client-side scripts (JavaScript), and images closer to users. Some
executable code could also be distributed using edge computing offered by a
CDN provider. And even if a system’s implementation doesn’t lend itself to
easily using these types of services, parts of the system could be explicitly
deployed to data centers or cloud hosting locations closer to users. There is a
natural tendency to think of an environment as being located in a single region,
but you can choose to draw its boundaries across regions if it’s useful.
But many organizations prefer to define a separate environment for each region
to support customizations for local markets or businesses. For example,
ClotheSpin may have a logistics partner in South Korea which means they don’t
need to deploy the part of their storefront system that they use for logistics in the
UK and Germany. Other parts of the storefront software may need to be
customized to integrate with the local partner. So ClotheSpin needs to deploy
different builds of their software in different regions, leading them to run a
separate environment for each region, as shown in Figure 6-8.
Customizing the software for each region complicates the testing needed, which
in turn complicates the path to production for the software. If the customization
is implemented so a single build of the software can be used, and simply
configured separately for each environment, the teams may be able to use a fan-
out path to production, as shown earlier in Figure 6-3. The fan-out pattern
minimizes the effort and expense of maintaining multiple regions.
If the software is heavily customized, however, the teams will need separate
paths to production for each environment, each with a separate set of delivery
environments, as in Figure 6-2.
Often, different regions fall under different legal regulations. ClotheSpin may
need to meet different requirements in the UK, Germany (as part of the EU), and
South Korea, which could lead to differences in how infrastructure is
implemented. It’s often feasible to implement systems so that a single build of
the software and even the infrastructure meets the regulations of each region it’s
used in. However, the regulations may still require separate hosting. For
example, data residency laws control where personal data for users can be
transferred or stored. This leads back to governance concerns as a driver for
designing environment boundaries.
For example, ClotheSpin could offer to host online clothing stores for other
businesses. A small fashion label might want to sell its products online, so
ClotheSpin can host an instance of its storefront, customizing the look and
branding so end users see it as the fashion label’s website. The label may not
need all of the features that the full-fledged ClotheSpin storefront offers, so the
software would be customized for their needs.
The simplest way to use the same software to implement separate brands is to
deploy a separate instance of the software for each. An alternative is to
implement a multi-tenant system where a single hosted instance serves multiple
brands. Customized branding, features, and user data separation is implemented
in the software.
Multi-tenancy makes more efficient use of infrastructure resources and takes less
work to maintain. However, it requires more sophisticated software
development. Also, some governance concerns, such as data residency, may
require separate single-tenant systems where each instance serves only one
brand. Some organizations choose to host each single-tenant system in a separate
environment. The implications of this approach are similar to those described for
aligning environments to geography. The cost and effort to update, run, and
maintain multiple system instances can be difficult to control.
The level of abstraction you can use to design your environments is constrained
by the type of application. Normally, only cloud-native applications
implemented as serverless code or containers can be hosted in a configuration
environment, unless you implement custom mechanisms for co-hosting instances
for different environments on shared virtual servers.
A highly regulated system may require that test code that hasn’t been
approved for production must be prevented from accessing production data
by stronger segregation than a deployment configuration setting.
A runtime system may need to be optimized for different types of workloads.
For example, one workload may need to handle high volumes of transactions
with low latency, while another may load and analyze large data sets. These
conflicting requirements may need separate environments defined at a lower
level of abstraction than namespaces in a container cluster.
Upgrading or even patching a runtime system may require downtime, or at
the least risk of disruption for workloads running on it. The more
environments running on the system, the broader the blast radius for planned
or unplanned disruption. The number of teams and stakeholders who need to
be involved in scheduling upgrades can add friction to the process. This
friction can lead to upgrading and patching less often, which then leads to
running outdated software, perhaps even versions with exploitable
vulnerabilities.
Even when a runtime system has automated recovery features, some
availability scenarios can only be managed by running multiple instances that
share less of the underlying infrastructure. A failover between two zones of a
container cluster doesn’t help when cluster management services fail.
If you are managing shared infrastructure for multiple environments, you need
similarly strong change management processes. An application build that
includes untested changes may be deployed to a test environment. But untested
infrastructure changes should not be deployed to that same environment.
Application development and test environments are business-critical systems for
software delivery. Infrastructure changes should be tested in a separate
environment before applying them to an environment that users rely on, as
shown in Figure 6-10.
The diagram shows the three delivery environments used to test an application.
These are virtualized environments, each one running its own application cluster
instance. The team that manages the application cluster tests changes to it in an
environment that is upstream from any of the application testing environments.
This “app cluster testing” environment would be used to test changes to
infrastructure code or to any system software deployed as part of the application
cluster, such as a new Kubernetes release. Chapter 12 describes using automated
testing and pipelines for managing changes to the infrastructure code itself.
Although it’s not something that most teams will be aware of, the IaaS vendor
will test changes that it makes to the underlying systems and services before
applying them to customer systems. From their point of view, the shared
hardware that underpins customer hosting is a production environment, with
upstream delivery environments unseen by customers.
An IaaS resource group is the default level for defining permissions, allocating
costs, and other fundamental configuration that applies to all of the resources
allocated within it. A key question for an environment architecture when using
IaaS is how to align resources structures and environments.
However, it’s easier and more reliable to segregate resources and configurations
between IaaS resource groups than within them. So another approach is to create
a separate resource group for each environment. This approach can be taken
further, splitting parts of a single environment across more than one resource
group. These structures may be divided following similar design forces as those
described for environments.
Figure 6-11 shows how ClotheSpin uses three AWS accounts to create a single
environment.
Figure 6-11. One environment composed of multiple IaaS resource groups
One account runs the application software that directly serves users. A separate
management account runs services with administration permissions to make
changes to the application account. A third account runs monitoring services,
which receives logs from the applications account, and can make read-only
requests into it. The applications account has no access to the other two
accounts. These three accounts have clearly separated and limited permissions
according to the needs of the workloads within each.
Going one step further, maintaining a separate IaaS resource group for each
application or service aligns the permissions and configuration not only with
current team ownership but simplifies alignment when team ownership changes.
As we know, system designs will change, including ownership of parts of a
system by different teams. It’s easier to reassign the permissions for a resource
group from one team to another than to move the infrastructure and
configuration from one team’s resource group to another’s. The lowest level of
granularity for ownership of the contents of infrastructure is typically the
application or service, so aligning IaaS resource groups at this level creates the
most flexibiity for managing changes.
I’ll describe two antipatterns and one pattern for implementing environments
using infrastructure stacks. Each of these patterns describes a way to define
multiple environments using infrastructure stacks. Some systems are composed
of multiple stacks, as I described in Chapter 10. I’ll explain what this looks like
for multiple environments in “Building Environments with Multiple Stacks”.
For example, if there are three environments for testing and running an
application, a single stack project includes the code for all three of the
environments (Figure 6-12).
Figure 6-12. A multiple-environment stack manages the infrastructure for multiple environments as a single
stack instance
Motivations
Many people create this type of structure when they’re learning a new stack tool
because it seems natural to add new environments into an existing project.
Consequences
When running a tool to update a stack instance, the scope of a potential change is
everything in the stack. If you have a mistake or conflict in your code,
everything in the instance is vulnerable.4
Related patterns
You can limit the blast radius of changes by dividing environments into separate
stacks. One obvious way to do this is the snowflake as code (see “Antipattern:
Snowflakes As Code”), where each environment is a separate stack project,
although this is considered an antipattern.
A better approach is the reusable stack pattern (see “Pattern: Reusable Stack”).
A single project is used to define the generic structure for an environment and is
then used to manage a separate stack instance for each environment. Although
this involves using a single project, the project is only applied to one
environment instance at a time. So the blast radius for changes is limited to that
one environment.
In our example of three environments named test, staging, and production, there
is a separate infrastructure stack project for each of these environments
(Figure 6-13). Changes are made by editing the code for one environment and
then copying the changes into the projects for each of the other environments in
turn.
ch15-snowflakes-as-code
Figure 6-13. Snowflakes as code use a separate copy of the stack project code for each instance
Motivation
ClotheSpin has a group that runs white-label stores for different fashion brands.
They created an AWS CDK project to create an environment to host the store
instance for their first customer. When they signed their second customer, they
copied the code to a new CDK project and customized it for that customer’s
needs. They followed this pattern (or anti-pattern) for each new customer.
The ClotheSpin white-label team took this approach because it was the simplest
way to create each new customer.
Applicability
Consequences
A snowflake as code might be consistent when it’s first set up, but variations
creep in over time.
After a year the ClotheSpin white-label team was running nine different
production environment instances for their customers and multiple delivery
instances to test changes. A new version of Kubernetes was released that
included essential fixes. When the team upgraded a test instance they discovered
that they need to change its infrastructure code to get the new version to work.
Upgrading, testing, and fixing each customer instance took a week. Two months
later, as they were finishing the last upgrades, a Kubernetes bugfix was released
that addressed a newly discovered security vulnerability, and the team began the
process again.
Implementation
You create snowflakes by copying the project code from one stack instance into
a new project. You then edit the code to customize it for the new instance. When
you make a change to one stack, you need to copy and paste it across all of the
other stack projects, while keeping the customizations in each one.
Related patterns
The wrapper stack pattern is also similar to snowflakes as code. A wrapper stack
uses a separate stack project for each environment to set configuration
parameters. But the code for the stack is implemented in stack components, such
as reusable module code. That code itself is not copied and pasted to each
environment, but promoted as an artifact, much like a reusable stack. However,
if people add more than basic stack instance parameters to the wrapper stack
projects, it can devolve into the snowflake as code antipattern.
In cases where stack instances are meant to represent the same stack, the
reusable stack pattern is usually more appropriate.
Motivation
However, the team found that making a change across all of their customers was
still time-consuming. They would change and test the relevant library in a
common test environment. Then they deployed the library into a separate test
environment for each customer, customizing for the customer’s configuration,
and testing and fixing customer-specific issues.
So the ClotheSpin team decided to create a single CDK project for their white-
label customers. They made the project configurable to meet the needs of
different customers. Each time they changed their CDK project, they deployed
and tested it in a single environment. Their tests needed to prove that different
configuration options worked, but the overhead of testing a single build was less
than deploying and testing a build for each customer.
Applicability
You can use a reusable stack for multiple environments which are essentially
replicas of the same infrastructure. Reusable stacks are essential for delivery
environments. Operability scenarios such as availability can be implemented by
deploying a reusable stack to create a failover environment, possibly
automatically when failures are detected. They are also useful for deploying
instances of a common service in different geographical regions, and for white-
label situations as described in the ClotheSpin example.
Consequences
The ability to provision and update multiple stacks from the same project
enhances scalability, reliability, and throughput. You can manage more instances
with less effort, make changes with a lower risk of failure, and roll changes out
to more systems more rapidly.
You typically need to configure some aspects of the stack differently for
different instances, even if it’s just what you name things. I’ll spend a whole
chapter talking about this (Chapter 11).
You should test your stack project code before you apply changes to business-
critical infrastructure. I’ll spend multiple chapters on this, including Chapters 8
and 9.
Implementation
You create a reusable stack as an infrastructure stack project and then run the
stack management tool each time you want to create or update an instance of the
stack. Use the syntax of the stack tool command to tell it which instance you
want to create or update. With Terraform, for example, you would specify a
different state file or workspace for each instance. With CloudFormation, you
pass a unique stack ID for each instance.
The following example command provisions two stack instances from a single
project using a fictional command called . The command takes an
argument that identifies unique instances:
As a rule, you should use simple parameters to define differences between stack
instances—strings, numbers, or in some cases, lists. Additionally, the
infrastructure created by a reusable stack should not vary much across instances.
Related patterns
The wrapper stack pattern uses stack components to define a reusable stack, but
uses a different stack project to set parameter values for each instance.
Conclusion
Environment architecture is a topic that is often taken for granted. Many IT
organizations suffer from design decisions they have made not from conscious
thought, but from habits and assumptions about what is “industry best practice”5
This chapter describes different aspects of designing and implementing a
conscious architecture for environments. Infrastructure as Code creates an
opportunity to move beyond heavyweight, static environments. Environments
should be evolvable, as with every part of a system, so they can be continuously
adapted and improved with changing needs and better understanding.
The reusable stack is a workhorse pattern for teams who need to manage large
infrastructures, helping to easily create and maintain multiple environments with
a high level of consistency and good governance. Chapter 7 will discuss ways to
reliably make and deliver changes to stacks across environments. However, a
key challenge that reusable stacks introduce is managing necessary differences
between stack instances. The next chapter, Chapter 11, focuses on ways of
managing instance-specific stack configuration.
1 The two storefronts could share services. The ClotheSpin team has had
many debates about whether and how to consolidate their services, but the
business priority was to accelerate the development of Hipsteroo without
disrupting the existing ClotheSpin business, which led to separate
implementations.
3 In practice, your cloud vendor may give you options to override the
abstraction, for example by specifying that particular resources should not share
physical hardware with each other.
5 I mentioned in the preface why I’m not a fan of the term “best practice”.
About the Author
Kief Morris (he/him) is Global Director of Cloud Engineering at
ThoughtWorks. He drives conversations across roles, regions, and industries at
companies ranging from global enterprises to early stage startups. He enjoys
working and talking with people to explore better engineering practices,
architecture design principles, and delivery practices for building systems on the
cloud.
Kief ran his first online system, a bulletin board system (BBS) in Florida in the
early 1990s. He later enrolled in an MSc program in computer science at the
University of Tennessee because it seemed like the easiest way to get a real
internet connection. Joining the CS department’s system administration team
gave him exposure to managing hundreds of machines running a variety of Unix
flavors.
When the dot-com bubble began to inflate, Kief moved to London, drawn by the
multicultural mixture of industries and people. He’s still there, living with his
wife, son, and cat.
Most of the companies Kief worked for before ThoughtWorks were post-
startups, looking to build and scale. The titles he’s been given or self-applied
include Software Developer, Systems Administrator, Deputy Technical Director,
R&D Manager, Hosting Manager, Technical Lead, Technical Architect,
Consultant, and Director of Cloud Engineering.