[go: up one dir, main page]

0% found this document useful (0 votes)
1K views218 pages

Infrastructure as Code

Infrastructure as Code by Kief Morris is a comprehensive guide on leveraging cloud technologies and automation for managing infrastructure. The third edition emphasizes the importance of defining everything as code, continuously testing changes, and building small, independent components to enhance reliability and quality. It aims to provide practical insights and strategies for teams to effectively manage infrastructure in a rapidly evolving technological landscape.

Uploaded by

devin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views218 pages

Infrastructure as Code

Infrastructure as Code by Kief Morris is a comprehensive guide on leveraging cloud technologies and automation for managing infrastructure. The third edition emphasizes the importance of defining everything as code, continuously testing changes, and building small, independent components to enhance reliability and quality. It aims to provide practical insights and strategies for teams to effectively manage infrastructure in a rapidly evolving technological landscape.

Uploaded by

devin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 218

Infrastructure as Code

THIRD EDITION

Building foundations for leveraging the cloud

Kief Morris
Infrastructure as Code
by Kief Morris

Copyright © 2023 Kief Morris. All rights reserved.

Printed in the United States of America.

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North,


Sebastopol, CA 95472.

O’Reilly books may be purchased for educational, business, or sales promotional


use. Online editions are also available for most titles (http://oreilly.com). For
more information, contact our corporate/institutional sales department: 800-998-
9938 or corporate@oreilly.com.

Acquisitions Editor: John Devins Indexer: Potomac Indexing, LLC

Development Editor: Virginia Interior Designer: David Futato


Wilson

Production Editor: Kate Galloway Cover Designer: Karen


Montgomery

Copyeditor: Kim Cofer Illustrator: John Francis


Amalanathan
Proofreader: nSight, Inc.

June 2016: First Edition


December 2020: Second Edition

Revision History for the Second Edition


2020-11-17: First Release
2021-01-15: Second Release

See http://oreilly.com/catalog/errata.csp?isbn=9781098114671 for release


details.

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc.


Infrastructure as Code, the cover image, and related trade dress are trademarks
of O’Reilly Media, Inc.

While the publisher and the author have used good faith efforts to ensure that the
information and instructions contained in this work are accurate, the publisher
and the author disclaim all responsibility for errors or omissions, including
without limitation responsibility for damages resulting from the use of or
reliance on this work. Use of the information and instructions contained in this
work is at your own risk. If any code samples or other technology this work
contains or describes is subject to open source licenses or the intellectual
property rights of others, it is your responsibility to ensure that your use thereof
complies with such licenses and/or rights.

This work is part of a collaboration between O’Reilly and Linode. See our
statement of editorial independence.

978-1-098-11467-1

[GP]
Infrastructure as Code
by Kief Morris

Copyright © FILL IN YEAR O’Reilly Media. All rights reserved.

Printed in the United States of America.

Published by O’Reilly Media, Inc. , 1005 Gravenstein Highway North,


Sebastopol, CA 95472.

O’Reilly books may be purchased for educational, business, or sales promotional


use. Online editions are also available for most titles ( http://oreilly.com ). For
more information, contact our corporate/institutional sales department: 800-998-
9938 or corporate@oreilly.com .

Editors: John Devins and Jill Leonard

Production Editor: FILL IN PRODUCTION EDITOR

Copyeditor: FILL IN COPYEDITOR

Proofreader: FILL IN PROOFREADER

Indexer: FILL IN INDEXER

Interior Designer: David Futato

Cover Designer: Karen Montgomery

Illustrator: Kate Dullea


October 2024: Third Edition

Revision History for the Third Edition


YYYY-MM-DD: First Release

See http://oreilly.com/catalog/errata.csp?isbn=9781098150358 for release


details.

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc.


Infrastructure as Code, the cover image, and related trade dress are trademarks of
O’Reilly Media, Inc.

The views expressed in this work are those of the author(s) and do not represent
the publisher’s views. While the publisher and the author(s) have used good faith
efforts to ensure that the information and instructions contained in this work are
accurate, the publisher and the author(s) disclaim all responsibility for errors or
omissions, including without limitation responsibility for damages resulting
from the use of or reliance on this work. Use of the information and instructions
contained in this work is at your own risk. If any code samples or other
technology this work contains or describes is subject to open source licenses or
the intellectual property rights of others, it is your responsibility to ensure that
your use thereof complies with such licenses and/or rights.

978-1-098-15035-8

[FILL IN]
Preface
Ten years ago, a CIO at a global bank scoffed when I suggested they look into
private cloud technologies and infrastructure automation tooling: “That kind of
thing might be fine for startups, but we’re too large and our requirements are too
complex.” Even a few years ago, many enterprises considered using public
clouds to be out of the question.

These days cloud technology is pervasive. Even the largest, most hidebound
organizations are rapidly adopting a “cloud-first” strategy. Those organizations
that find themselves unable to consider public clouds are adopting dynamically
provisioned infrastructure platforms in their data centers.1 The capabilities that
these platforms offer are evolving and improving so quickly that it’s hard to
ignore them without risking obsolescence.

Cloud and automation technologies remove barriers to making changes to


production systems, and this creates new challenges. While most organizations
want to speed up their pace of change, they can’t afford to ignore risks and the
need for governance. Traditional processes and techniques for changing
infrastructure safely are not designed to cope with a rapid pace of change. These
ways of working tend to throttle the benefits of modern, Cloud Age technologies
—slowing work down and harming stability.2

In Chapter 1 I use the terms “Iron Age” and “Cloud Age” (“From the Iron Age
to the Cloud Age”) to describe the different philosophies that apply to managing
physical infrastructure, where mistakes are slow and costly to correct, and
managing virtual infrastructure, where mistakes can be quickly detected and
fixed.

Infrastructure as Code tools create the opportunity to work in ways that help you
to deliver changes more frequently, quickly, and reliably, improving the overall
quality of your systems. But the benefits don’t come from the tools themselves.
They come from how you use them. The trick is to leverage the technology to
embed quality, reliability, and compliance into the process of making changes.

Why I Wrote This Book


I wrote the first edition of this book because I didn’t see a cohesive collection of
guidance on how to manage Infrastructure as Code. There was plenty of advice
scattered across blog posts, conference talks, and documentation for products
and projects. But a practitioner needed to sift through everything and piece a
strategy together for themselves, and most people simply didn’t have time.

The experience of writing the first edition was amazing. It gave me the
opportunity to travel and to talk with people around the world about their own
experiences. These conversations gave me new insights and exposed me to new
challenges. I learned that the value of writing a book, speaking at conferences,
and consulting with clients is that it fosters conversations. As an industry, we are
still gathering, sharing, and evolving our ideas for managing Infrastructure as
Code.
What’s New and Different in This Edition
Things have moved along since the first edition came out in June 2016. That
edition was subtitled “Managing Servers in the Cloud,” which reflected the fact
that most infrastructure automation until that point had been focused on
configuring servers. Since then, containers and clusters have become a much
bigger deal, and the infrastructure action has moved to managing collections of
infrastructure resources provisioned from cloud platforms—what I call stacks in
this book.

As a result, this edition involves more coverage of building stacks, which is the
remit of tools like CloudFormation and Terraform. The view I’ve taken is that
we use stack management tools to assemble collections of infrastructure that
provide application runtime environments. Those runtime environments may
include servers, clusters, and serverless execution environments.

I’ve changed quite a bit based on what I’ve learned about the evolving
challenges and needs of teams building infrastructure. As I’ve already touched
on in this preface, I see making it safe and easy to change infrastructure as the
key benefit of Infrastructure as Code. I believe people underestimate the
importance of this by thinking that infrastructure is something you build and
forget.

But too many teams I meet struggle to meet the needs of their organizations;
they are not able to expand and scale quickly enough, support the pace of
software delivery, or provide the reliability and security expected. And when we
dig into the details of their challenges, it’s that they are overwhelmed by the
need to update, fix, and improve their systems. So I’ve doubled down on this as
the core theme of this book.

This edition introduces three core practices for using Infrastructure as Code to
make changes safely and easily:

Define everything as code


This one is obvious from the name, and creates repeatability and consistency.

Continuously test and deliver all work in progress


Each change enhances safety. It also makes it possible to move faster and
with more confidence.

Build small, simple pieces that you can change independently


These are easier and safer to change than larger pieces.

These three practices are mutually reinforcing. Code is easy to track, version,
and deliver across the stages of a change management process. It’s easier to
continuously test smaller pieces. Continuously testing each piece on its own
forces you to keep a loosely coupled design.

These practices and the details of how to apply them are familiar from the world
of software development. I drew on Agile software engineering and delivery
practices for the first edition of the book. For this edition, I’ve also drawn on
rules and practices for effective design.
In the past few years, I’ve seen teams struggle with larger and more complicated
infrastructure systems, and I’ve seen the benefits of applying lessons learned in
software design patterns and principles, so I’ve included several chapters in this
book on how to do this.

I’ve also seen that organizing and working with infrastructure code is difficult
for many teams, so I’ve addressed various pain points. I describe how to keep
codebases well organized, how to provide development and test instances for
infrastructure, and how to manage the collaboration of multiple people,
including those responsible for governance.

What’s Next
I don’t believe we’ve matured as an industry in how we manage infrastructure.
I’m hoping this book gives a decent view of what teams are finding effective
these days. And a bit of aspiration of what we can do better.

I fully expect that in another five years the toolchains and approaches will
evolve. We could see more general-purpose languages used to build libraries,
and we could be dynamically generating infrastructure rather than defining the
static details of environments at a low level. We certainly need to get better at
managing changes to live infrastructure. Most teams I know are scared when
applying code to live infrastructure. (One team referred to Terraform as
“Terrorform,” but users of other tools all feel this way.)
What This Book Is and Isn’t
The thesis of this book is that exploring different ways of using tools to
implement infrastructure can help us to improve the quality of services we
provide. We aim to use speed and frequency of delivery to improve the
reliability and quality of what we deliver.

So the focus of this book is less on specific tools, and more on how to use them.

Although I mention examples of tools for particular functions like configuring


servers and provisioning stacks, you won’t find details of how to use a particular
tool or cloud platform. You will find patterns, practices, and techniques that
should be relevant to whatever tools and platforms you use.

You won’t find code examples for real-world tools or clouds. Tools change too
quickly in this field to keep code examples accurate, but the advice in this book
should age more slowly, and be applicable across tools. Instead, I write
pseudocode examples for fictional tools to illustrate concepts. See the book’s
companion website for references to example projects and code.

This book won’t guide you on how to use the Linux operating system,
Kubernetes cluster configuration, or network routing. The scope of this book
does include ways to provision infrastructure resources to create these things,
and how to use code to deliver them. I share different cluster topology patterns
and approaches for defining and managing clusters as code. I describe patterns
for provisioning, configuring, and changing server instances using code.
You should supplement the practices in this book with resources on the specific
operating systems, clustering technologies, and cloud platforms. Again, this
book explains approaches for using these tools and technologies that are relevant
regardless of the particular tool.

This book is also light on operability topics like monitoring and observability,
log aggregation, identity management, and other concerns that you need to
support services in a cloud environment. What’s in here should help you to
manage the infrastructure needed for these services as code, but the details of the
specific services are, again, something you’ll find in more specific resources.

Some History of Infrastructure as Code


Infrastructure as Code tools and practices emerged well before the term. Systems
administrators have been using scripts to help them manage systems since the
beginning. Mark Burgess created the pioneering CFEngine system in 1993. I
first learned practices for using code to fully automate provisioning and updates
of servers from the Infrastructures.org website in the early 2000s.3

Infrastructure as Code has grown along with the DevOps movement. Andrew
Clay-Shafer and Patrick Debois triggered the DevOps movement with a talk at
the Agile 2008 conference. The first uses I’ve found for the term “Infrastructure
as Code” are from a talk called “Agile Infrastructure” that Clay-Shafer gave at
the Velocity conference in 2009, and an article John Willis wrote summarizing
the talk. Adam Jacob, who cofounded Chef, and Luke Kanies, founder of
Puppet, were also using the phrase around this time.
Who This Book Is For
This book is for people who are involved in providing and using infrastructure to
deliver and run software. You may have a background in systems and
infrastructure, or in software development and delivery. Your role may be
engineering, testing, architecture, or management. I’m assuming you have some
exposure to cloud or virtualized infrastructure and tools for automating
infrastructure using code.

Readers new to Infrastructure as Code should find this book a good introduction
to the topic, although you will get the most out of it if you are familiar with how
infrastructure cloud platforms work, and the basics of at least one infrastructure
coding tool.

Those who have more experience working with these tools should find a mixture
of familiar and new concepts and approaches. The content should create a
common language and articulate challenges and solutions in ways that
experienced practitioners and teams find useful.

Principles, Practices, and Patterns


I use the terms principles, practices, and patterns (and antipatterns) to describe
essential concepts. Here are the ways I use each of these terms:

Principle
A principle is a rule that helps you to choose between potential solutions.

Practice
A practice is a way of implementing something. A given practice is not
always the only way to do something, and may not even be the best way to do
it for a particular situation. You should use principles to guide you in
choosing the most appropriate practice for a given situation.

Pattern
A pattern is a potential solution to a problem. It’s very similar to a practice in
that different patterns may be more effective in different contexts. Each
pattern is described in a format that should help you to evaluate how relevant
it is for your problem.

Antipattern
An antipattern is a potential solution that you should avoid in most situations.
Usually, it’s either something that seems like a good idea or else it’s
something that you fall into doing without realizing it.

WHY I DON’T USE THE TERM “BEST PRACTICE”

Folks in our industry love to talk about “best practices.” The problem with this
term is that it often leads people to think there is only one solution to a problem,
no matter what the context.

I prefer to describe practices and patterns, and note when they are useful and
what their limitations are. I do describe some of these as being more effective or
more appropriate, but I try to be open to alternatives. For practices that I believe
are less effective, I hope I explain why I think this.

The ShopSpinner Examples


I use a fictional company called ShopSpinner to illustrate concepts throughout
this book. ShopSpinner builds and runs online stores for its customers.

ShopSpinner runs on FCS, the Fictional Cloud Service, a public IaaS provider
with services that include FSI (Fictional Server Images) and FKS (Fictional
Kubernetes Service). It uses the Stackmaker tool—an analog of Terraform,
CloudFormation, and Pulumi—to define and manage infrastructure on its cloud.
It configures servers with the Servermaker tool, which is much like Ansible,
Chef, or Puppet.

ShopSpinner’s infrastructure and system design may vary depending on the


point I’m using it to make, as will the syntax of the code and command-line
arguments for its fictional tools.

Conventions Used in This Book


The following typographical conventions are used in this book:

Italic
Indicates new terms, URLs, email addresses, filenames, and file extensions.

Used for program listings, as well as within paragraphs to refer to program


elements such as variable or function names, databases, data types,
environment variables, statements, and keywords.

Shows commands or other text that should be typed literally by the user.

Shows text that should be replaced with user-supplied values or by values


determined by context.

TIP

This element signifies a tip or suggestion.

NOTE

This element signifies a general note.

WARNING

This element indicates a warning or caution.

O’Reilly Online Learning


NOTE

For more than 40 years, O’Reilly Media has provided technology and business training, knowledge, and
insight to help companies succeed.

Our unique network of experts and innovators share their knowledge and
expertise through books, articles, and our online learning platform. O’Reilly’s
online learning platform gives you on-demand access to live training courses, in-
depth learning paths, interactive coding environments, and a vast collection of
text and video from O’Reilly and 200+ other publishers. For more information,
visit http://oreilly.com.

How to Contact Us
Please address comments and questions concerning this book to the publisher:

O’Reilly Media, Inc.

1005 Gravenstein Highway North

Sebastopol, CA 95472

800-998-9938 (in the United States or Canada)

707-829-0515 (international or local)

707-829-0104 (fax)

We have a web page for this book, where we list errata, examples, and any
additional information. You can access this page at https://oreil.ly/infra-as-code-
2e.

Email bookquestions@oreilly.com to comment or ask technical questions about


this book.

For news and information about our books and courses, visit http://oreilly.com.

Find us on Facebook: http://facebook.com/oreilly

Follow us on Twitter: http://twitter.com/oreillymedia

Watch us on YouTube: http://www.youtube.com/oreillymedia

1 For example, many government and financial organizations in countries


without a cloud presence are prevented by law from hosting data or transactions
abroad.

2 The research published by DORA in the State of DevOps Report finds that
heavyweight change-management processes correlate to poor performance on
change failure rates and other measures of software delivery effectiveness.

3 The original content remains on this site as of summer 2020, although it


hadn’t been updated since 2007.
Part I. Foundations
Chapter 1. What Is Infrastructure as Code?

A NOTE FOR EARLY RELEASE READERS

With Early Release ebooks, you get books in their earliest form—the author’s
raw and unedited content as they write—so you can take advantage of these
technologies long before the official release of these titles.

This will be the 1st chapter of the final book. Please note that the GitHub repo
will be made active later on.

If you have comments about how we might improve the content and/or examples
in this book, or if you notice missing material within this chapter, please reach
out to the editor at jleonard@oreilly.com.

If you work in a team that builds and runs IT infrastructure, then cloud and
infrastructure automation tools should help you deliver more value in less time,
and to do it more reliably. In practice, however, they drive ever-increasing size,
complexity, and diversity of things to manage.

These technologies have become especially relevant over the past decade as
organizations brought digital technology deeper into the core of what they do.
Previously, many leaders had treated the IT function as an unfortunate
distraction that should be outsourced and ignored. But digitally sophisticated
competitors, users, and staff drove more processes and products online and
created entirely new categories of services like streaming media, social media,
and machine learning.

Cloud and automation have helped by making it far easier for organizations to
add and change digital services. But many teams have struggled to manage the
proliferation of cloud-hosted products, applications, services, and platforms. As
one of my clients told me, “Moving from the data center, where we were limited
to the capacity of our hardware, to the cloud, where capacity is effectively
unlimited, knocked down the walls that kept our tire fire contained.”1

Using code to define and build infrastructure creates the opportunity to bring a
wide set of tools, practices, and patterns to bear on the problem of how to design
and implement systems. This book explores different practices and patterns for
Infrastructure as Code. I’ll describe the problems that Infrastructure as Code can
help with, challenges that come from different approaches to using infrastructure
code, and patterns and practices that have proven to be useful.

I start this chapter, unsurprisingly, by defining Infrastructure as Code. That done,


I put infrastructure code in the context of the shift from working with physical
systems (the Iron Age) to working with automated technology and tools (the
Cloud Age). As with any technology shift, the changes go beyond the
technology itself to alter how we think about and approach our work.

Next, I discuss the goals of Infrastructure as Code, and what it is that an


organization stands to gain from adopting the approaches I describe throughout
the book. I frequently refer back to these points in later chapters to give context
for considering various practices and patterns.

Infrastructure as Code
A literal definition of Infrastructure as Code is the practice of provisioning and
managing infrastructure using code, as opposed to doing it interactively, or with
non-code automation tools. By “interactively”, I mean using a command-line
tool or GUI interface to carry out tasks. The alternative is writing code that can
then be distributed and applied by automated systems.

Interactive infrastructure management doesn’t help in doing things consistently


or repeatably, since you decide how to implement each change as you work.
This leads to inconsistent implementations and mistakes. In Chapter 2 I’ll talk
about some of the principles and goals that using code helps to achieve, most of
which are nearly impossible with interactive infrastructure management.

Non-code automation tools typically provide a GUI interface to define the


infrastructure you want, for example by choosing options from drop-down
menus. These tools usually have ways to save configurations that you build, for
example creating templates, which means you can build multiple instances
consistently. They may also have ways to update existing infrastructure to keep
consistency over time. However, they store the definitions for infrastructure in
closed systems, rather than in open files. With these systems you aren’t able to
exploit the vast ecosystem of tools for working with code, such as source control
repositories, code scanning tools, automated testing, and automated delivery, to
name just a few. I’ll go into more detail about the useful things you can do with
infrastructure code that you can’t do with non-code automation in Chapter 4, and
also in, well, pretty much this entire book.

The way I define Infrastructure as Code is about more than the mechanics of
how infrastructure is defined and provisioned. Infrastructure as Code is about
applying the principles, practices, and tools of software engineering to
infrastructure.

Throughout this book, I explain how to use modern software development


practices such as Test Driven Development (TDD), Continuous Integration (CI),
and Continuous Delivery (CD) to make changing infrastructure fast and safe. I
also describe how principles of software design help create resilient, well-
maintained infrastructure. These practices and design approaches reinforce each
other. Well-designed infrastructure is easier to test and deliver. Automated
testing and delivery drive simpler and cleaner designs.

From the Iron Age to the Cloud Age


We can use modern technologies like cloud, virtualization, and automated
infrastructure, deployment, and testing to carry out tasks much more quickly
than we can by managing physical hardware and manually typing commands or
clicking on GUIs (Table 1-1). But as many organizations have discovered,
simply adopting these tools doesn’t necessarily bring visible benefits.

Table 1-1. Technology changes in the Cloud Age


Iron Age Cloud Age

Types of resources Physical hardware Virtualized resources

Provisioning Takes days or weeks Takes minutes or seconds

Processes Manual (runbooks) Automated (as code)

The ability to provision new infrastructure in moments, and even to create it


automatically, can lead to uncontrolled sprawl. If you don’t have good processes
for ensuring systems are well-managed and maintained, then the unbounded
nature of cloud technology leads to spiraling technical debt.

Cloud Age Approaches To Change Management

Many organizations attempt to control the potential chaos by using well-known,


traditional IT governance models. These models focus on throttling the speed of
change, requiring implementation details to be decided before work begins,
high-effort process gates, and strictly siloed responsibilities between teams.

However, these models were designed for the Iron Age when changes were
slow, which made mistakes difficult to correct. It seemed reasonable to add a
week to a task that would take a week to implement when it would take a week
or more to correct a mistake later. Adding weeks to a task that takes less than an
hour to implement, and a few minutes to correct, destroys the benefits of cloud
age technology.
What’s more, research2 suggests that these heavyweight processes were never
very effective in preventing errors in the first place. In fact, they can make things
worse by dividing knowledge and accountability across silos and long time
periods.

Fortunately, the emergence of cloud age technologies has coincided with the
growth of what I’d call cloud age approaches to work, including lean, agile, and
DevOps. These approaches encourage close collaboration, short feedback loops,
and a minimalist approach to technical implementation. Automation is leveraged
to fundamentally shift thinking about change and risk, which results not only in
faster delivery but also higher quality (Table 1-2).

Table 1-2. Ways of working in the Iron Age and the Cloud Age

Iron Age Cloud Age

Cost of change High Low

Changes are Risks to be minimized Essential to improve quality

A change of Failure of planning Success in learning and


plan means improving

Optimize to Reduce opportunities Maximize speed of


to fail improvement

Delivery Large batches, test at Small changes, test


approach the end continuously

Architectures Monolithic (fewer, Microservices architectures


larger moving parts) (more, smaller parts)

This ability to leverage speed of change to improve quality starts with cloud
technology, which creates the capability to provision and change infrastructure
on demand. We need automation to use this capability. So another definition of
Infrastructure as Code is a Cloud Age approach to automating cloud
infrastructure in a way that embraces continuous change to achieve high
reliability and quality.

DEVOPS AND INFRASTRUCTURE AS CODE

People define DevOps in different ways. The fundamental idea of DevOps is collaboration across all of the
people involved in building and running software. This includes not only developers and operations people,
but also testers, security specialists, architects, and even managers. There is no one way to implement
DevOps.

Many people look at DevOps and only notice the technology that people use to collaborate across software
delivery. All too often this leads to reducing “DevOps” to tooling. I’ve seen “DevOps” defined as running
an application deployment tool (usually Jenkins), often in a way that increases barriers across the software
delivery path.

DevOps is first and foremost about people, culture, and ways of working. Tools and practices like
Infrastructure as Code are valuable to the extent that they’re used to bridge gaps and improve collaboration.

The Path To The Cloud Age


DevOps, Infrastructure as Code (the name, at least), and Cloud all emerged
between 2005-2010. In the early years, these were largely experimental,
dismissed by larger organizations that considered themselves too serious to need
to change how they approached IT. In the first edition of this book, published in
2016, I included arguments for why readers really should consider using cloud
even for critical domains like finance.

The mid-2010s could be considered the “Shadow Age” of IT. Cloud, DevOps,
Continuous Delivery, and Infrastructure as Code were mostly used either by
startups or by separate digital departments of larger organizations. These
departments were usually set up outside the remit of the existing organization,
partly to protect them from the cultural norms and formal policies of the main
organization, which people sometimes call “antibodies”. In some cases they
were created quietly within existing departments, as “shadow IT3“.

The mantra of the shadow age was “move fast and break things4.” Casting aside
the shackles of iron age governance was seen as the key to explosive growth. In
the view of digital hipsters, it was time to leave the crusty old-timers to their
CAB5 meetings, mainframes, and bankruptcies (“Say hello to Blockbuster and
Kodak!”)

Cavalier attitudes towards governance made it easier for traditionalists to dismiss


the newer technologies and related ideas as irresponsible and doomed to failure.
At the same time, new technology enthusiasts have often ignored the very real
concerns and risks that underpin what may seem like legacy mindsets. We need
to learn how to leverage newer technologies and ways of working to address
fundamental issues, rather than either rejecting the new ways or dismissing the
issues as legacy.6

As the decade wore on, and digital businesses overtook slower businesses in
more and more markets, digital technologies and approaches were pulled closer
to the center of even older businesses. Digital departments were assimilated, and
boards asked to see strategies to migrate core business systems into the cloud.
This trend accelerated when the Covid pandemic led to a dramatic rise in
consumers and workers moving to online services. Many organizations found
that their digital services were not ready for the unexpected level of demand they
were faced with. As a result, they increased their investment and efforts in cloud
technologies.

I call this period where cloud technology has been shifting from the periphery of
business to the center the “Age of Sprawl”. Although breaking things had gone
out of fashion, moving fast was still the priority. As a result of the haste to adopt
new technologies and practices, larger organizations have seen a proliferation of
initiatives. A larger organization typically has multiple, disconnected teams
building “platforms” using various technologies, multiple cloud vendors, and
varying levels of maturity and quality.

The variety of options available for building digital infrastructure and


platforms7, and the rapid pace of change within them, has made it difficult to
keep up to date. Platforms built on the latest technology two years ago may
already be legacy.
The drivers that led to this sprawl are real. Organizations have needed to rapidly
evolve to survive and prosper in the modern, digital economy. However, as I
write this in 2023, the economic landscape has changed in a way that means
most organizations need to be more careful in how they invest. Not only do we
need to be choosy about what new systems and initiatives to invest in, but we
also need to consider how to manage the cost of running and evolving what we
already have in place. The need to grow, improve, and even exploit emerging
technologies has not gone away, so the next age is not simply one of cutting
back and staying in place. Instead, organizations need to find sustainable ways to
grow and evolve. Call it the Age of Sustainable Growth.

What does this have to do with Infrastructure as Code? Those of us involved in


designing and building the foundational layers of our organizations’ business
systems need to be aware of the strategic drivers those foundations must support.

THE FUTURE IS NOT EVENLY DISTRIBUTED

The tidy linear narrative I describe as “the path to the cloud age” is, as with any tidy linear narrative,
simplistic. Many people and organizations have experienced the trends it describes. But none of its “ages”
have completely ended, and many of the drivers of different ways of thinking and working are still valid.
It’s important to recognize that contexts differ. A Silicon Valley startup has different needs and constraints
than a transnational financial institution, and new technologies and methodologies create opportunities to
handle old risks and new opportunities in different ways. The path to the cloud age is uneven and far from
over, understanding how it has unfolded so far can help us navigate what comes next

Strategic Goals and Infrastructure as Code


Figure 1-1 shows the gap between organizational strategy and infrastructure
strategy. Customer value should drive the organization’s strategy, which drives
strategy to infrastructure via product and technology strategy. Each strategic
layer supports the layers above it.

Figure 1-1. Customer value driving strategy down to infrastructure


When talking to organizations about their strategy for cloud infrastructure, I’m
often struck by the gap between people interested in that topic and those
interested in organizational strategy. Engineering people are puzzled when I ask
questions about the product and commercial strategy. Organizational leaders are
dismissive of the need for infrastructure capability, assuming that selecting a
cloud vendor is the end of that story. Even when their infrastructure architecture
creates problems with growth, stability, or security, the instinct is to demand a
quick fix and move on.

The gap is not one-sided. Engineering folks tend to focus on implementing the
solutions that seem obvious to them, sometimes assuming that it doesn’t make
much difference what will run on it. One example of how this turns out is a
company whose engineers built a multi-region cloud hosting solution with iron-
clad separation between regions. The team wanted to make sure that user data
would be segregated to avoid conflicts with different privacy regulations, so this
requirement was baked deep into the architecture of their systems.

However, because neither the product nor engineering teams believed they
needed close communication during development, the service was nearly ready
for production rollout when it surfaced that the commercial strategy assumed
that users would be able to use the service while traveling and working in
different countries. It took considerable effort, expense, and delay to rearchitect
the system to ensure that privacy laws could be respected between regions while
giving users international roaming access.

So although infrastructure can seem distant from strategic goals discussed in the
boardroom, it’s essential to make sure everyone from strategic leaders to
engineering teams understands how they are related. Table 1-3 describes a few
common organizational concerns where infrastructure architecture can make a
considerable difference in either enabling success or creating drag.

Table 1-3. How Infrastructure as Code is relevant to an organization’s strategic goals

Enable effective software delivery

Facilitate growth

Ensure operational quality

Support continuous modernization

Why

* Deliver value to users quickly and reliably * Deliver new products and
features

Grow value by adding markets, products, and customers

Sustain value by managing cost, performance, reliability, scalability, and


security
Sustain value by avoiding accumulating legacy technology, and by adopting
new technology at a timely and sensible pace

Outcomes

* High performance on the four key metrics (“The Four Key Metrics”). *
Low effort and dependency on central teams

Can expand products to new regions and for new customers quickly, and
easily, with costs that scale less than linearly

* High visibility of and ability to improve operational metrics such as cost


and performance * Performance, security, compliance, availability, and
other concerns are continuously evaluated, tracked, and meet expectations

Systems are upgraded continuously, with low effort. The number of versions
of any given system is minimized. Redundant systems are retired quickly.

How

* Align infrastructure architecture with application architecture. * Reduce


dependency on central teams in value flows by empowering product teams

* Reduce time and effort to provision existing products into new regions and
for new customers, * Mechanisms in place to maintain and update multiple
product instances quickly, easily, and with minimal effort

* Capability and responsibility to measure, validate, and improve


operational metrics are given to those closest to implementation *
Measurement, validation, tracking, and management of operational qualities
are inherent in the delivery process and tooling

* Automated systems for testing and delivering patches, fixes, and minor
upgrades across the estate * Capability to add new versions and systems to
delivery systems

Throughout this book, I’ll use the example of a fictitious company called
“ClotheSpin”, an online fashion retailer, to illustrate the concepts I discuss.
“Introduction to ClotheSpin” gives a high-level view of the company’s strategy.

INTRODUCTION TO CLOTHESPIN

ClotheSpin is an online fashion retailer that was founded in the dot-com days. It
is well-established in the UK and Germany and has recently expanded to
multiple countries in Europe, the Americas, and Asia. They have just launched a
new storefront called Hipsteroo in the UK and the US, to reach a younger
market, and want to expand it globally as well. The company has also
determined that they need to be able to add new services like clothing rental to
remain competitive. Last year they acquired a company called BrainZ, which has
a machine-learning system for retail product merchandising and
recommendations, which they want to integrate with their online stores.
The Technology Situation

The main ClotheSpin online storefront was originally built on data center
infrastructure, running J2EE on Solaris servers, and then migrated most of the
systems to Linux a few years later. In the mid-2010s the company began
migrating ClotheSpin onto AWS, initially with CloudFormation, later using
Terraform. A separate initiative re-platformed the software to a containerized
architecture. Much of the front-end experience is backed by containerized
software running on AWS ECS, although some services run on J2EE servers
deployed on virtual machine instances. There are also backend systems for
logistics and billing that still run in the data center, so ClotheSpin is a hybrid
cloud architecture.

When the Hipsteroo storefront was launched, the company decided to build it as
a greenfield project separate from the ClotheSpin systems, because this was the
fastest path to launching it. Hipsteroo is a purely cloud-native architecture
including EKS and Lambda, on infrastructure built with AWS CDK in a mix of
JavaScript and TypeScript.

The BrainZ machine learning systems run on Google Cloud, mostly built using
Terraform.

The Business Situation

Until recently, growth was the ClotheSpin board’s primary goal. Their strategy
was to spend to grow and worry about efficiency later. Later has come. The
economic situation has changed, and the cost to run and develop ClotheSpin’s
existing systems is not sustainable. However, the company can’t afford to miss
opportunities to grow market share and enter new markets. So they need to find
efficient ways to continue to grow their footprint. An added factor is that some
of the systems in place now have issues with performance and reliability, and
these need to be addressed to rebuild the confidence of customers and partners.

Key organizational goals for ClotheSpin include:

Grow our customer base, revenue, and profits by bringing new storefronts and
services to market
Grow our customer base, revenue, and profits by expanding our storefronts to
new regions
Retain and grow our customer base by continuously improving our existing
storefronts and services
Improve our profitability and service quality by rationalizing our systems

System Architecture Goals and


Infrastructure as Code
An organization’s strategic goals typically filter down into goals for systems in
general, which may cross teams such as product development, software
engineering, platform teams, and IT operations. These groups will have their
own goals, objectives, and initiatives that infrastructure architecture needs to
support.

Figure 1-2 shows an example of how organizational goals, such as the ones
described in “Introduction to ClotheSpin”, drive goals for an engineering
organization, which in turn drive goals for the infrastructure architecture.
Infrastructure as Code can be used to ensure environments are consistent across
the path to production as well as across multiple production instances (I’ll talk
about different types of environments in Chapter 15).

Consistency across environments supports the engineering goal of improving


software delivery effectiveness by making sure that test environments accurately
reflect production environments. Consistency also reduces the amount of
customization needed to provision new environments for adding products or
expanding into new regions.

It’s easier to automate operational capabilities like security, compliance, and


recovery when infrastructure is built consistently. And having less variation
between environments makes it easier to consolidate and simplify overall system
architecture. So this one goal for infrastructure architecture can support multiple
higher-level goals for the organization.
Figure 1-2. Example of infrastructure goals driven by organizational goals

Use Infrastructure as Code to Optimize for


Change
In Chapter 2 I discuss principles and practices for Infrastructure as Code and
explain how they can align with goals for system architecture and business
processes. But one of the most fundamental reasons for adopting Infrastructure
as Code, and one that is not universally understood in our industry, is to optimize
the process for making changes to IT systems. This theme underpins the
concepts I discuss throughout this book. When an organization finds that they’re
failing to see value from cloud and infrastructure automation, it is commonly
because they have not approached their use of these technologies as an enabler
for change.

Operations teams know that the biggest risk to a production system is making a
change to it8. The Iron Age approach to managing this risk (as I mentioned
earlier in “From the Iron Age to the Cloud Age”) is to add heavyweight
processes to make changes more slowly and carefully. However, adding barriers
to making changes adds barriers to fixing and improving the quality of a system.

Research from the Accelerate State of DevOps Report backs this up. Making
changes frequently and reliably is correlated to organizational success.9

Rather than resisting commercial pressures to make changes frequently and


quickly, modern methods of change management, from lean to agile, lean into
the idea that this is a good thing. Having the ability to deliver changes both
rapidly and reliably is the secret sauce for high-quality systems in the digital age.

Common Myths About Infrastructure


Automation and Change
There are several objections I hear when I recommend an infrastructure team
implement automation to optimize for change. I believe these come from
misunderstandings of how you can and should use automation.

Myth: Infrastructure Doesn’t Change Very Often

We want to think that we build an environment, and then it’s “done.” In this
view, we don’t make many changes, so automating changes, especially testing,
is a waste of time.

In reality, very few systems stop changing, at least not before they are retired.
Some people assume that a heavy pace of change is temporary. Others create
heavyweight change request processes to discourage people from asking for
changes. These people are in denial. Most teams that are supporting actively
used systems handle a continuous stream of changes.

Consider these common examples of infrastructure changes:

An essential new application feature requires you to add a new data


processing tool.
A new application feature needs you to upgrade to a newer version of your
messaging service.
Performance profiling shows that the current application deployment
architecture is limiting performance. You need to redeploy the applications
across multiple clusters globally. Doing this requires changes to your cloud
accounts and network architecture.
There is a newly announced security vulnerability in system packages for
your container cluster system. You need to patch clusters across multiple
regions, as well as development and testing systems.
Your API gateway experiences intermittent failures. You need to make a
series of configuration changes to diagnose and resolve the problem.
You find a configuration change that improves the performance of your
database.

A fundamental truth of the Cloud Age is: Stablity comes from making changes.

Unpatched systems are not stable; they are vulnerable. If you can’t fix issues as
soon as you discover them, your system is not stable. If you can’t recover from
failure quickly, your system is not stable. If the changes you do make involve
considerable downtime, your system is not stable. If changes frequently fail,
your system is not stable.

Myth: We Can Build the Infrastructure First and


Automate It Later

Getting started with Infrastructure as Code is a steep curve. Setting up the tools,
services, and working practices to automate infrastructure delivery is loads of
work, especially if you’re also adopting a new infrastructure platform. The value
of this work is hard to demonstrate before you start building and deploying
services with it. Even then, the value may not be apparent to people who don’t
work directly with the infrastructure.

Stakeholders often pressure infrastructure teams to build new cloud-hosted


systems quickly, by hand, and worry about automating it later.

There are several reasons why automating afterward is a bad idea:

Automation should enable faster delivery for new systems as well as existing
systems. Implementing automation after most of the work has been done
sacrifices many of the benefits.
Automation makes it easier to write automated tests for what you build. And
it makes it easier to quickly fix and rebuild when you find problems. Doing
this as a part of the build process helps you to build a more robust
infrastructure.
Automating an existing system is very hard. Automation is part of a system’s
design and implementation. To add automation to a system built without it,
you need to change the design and implementation of that system
significantly. This is also true for automated testing and deployment.

Cloud infrastructure built without automation becomes a write-off sooner than


you expect. The cost of manually maintaining and fixing the system can escalate
quickly. If the service it runs is successful, stakeholders will pressure you to
expand and add features rather than stop to rearchitect it.

The same is true when you build a system as an experiment. Once you have a
proof of concept up and running, there is pressure to move on to the next thing,
rather than to go back and build it right. And in truth, automation should be a
part of the experiment. If you intend to use automation to manage your
infrastructure, you need to understand how this will work, so it should be part of
your proof of concept.

The solution is to build your system incrementally, automating as you go. Ensure
you deliver a steady stream of value, while also building the capability to do so
continuously.

Myth: Speed And Quality Must Be Traded Off


Against Each Other

It’s natural to think that you can only move fast by skimping on quality and that
you can only get quality by moving slowly. You might see this as a continuum,
as shown in Figure 1-3.

Figure 1-3. The idea that speed and quality are opposite ends of a spectrum is a false dichotomy

However, research shows otherwise:

These results demonstrate that there is no tradeoff between improving


performance and achieving higher levels of stability and quality. Rather,
high performers do better at all of these measures. This is precisely what
the Agile and Lean movements predict, but much dogma in our industry
still rests on the false assumption that moving faster means trading off
against other performance goals, rather than enabling and reinforcing
them.10

—Dr. Nicole Forsgren, Accelerate

In short, organizations can’t choose between being good at change or being good
at stability. They tend to either be good at both or bad at both.

I prefer to see quality and speed as a quadrant rather than a continuum, as shown
in Figure 1-4.
Figure 1-4. Speed and quality map to quadrants

This quadrant model shows how trying to choose between speed and quality
leads to doing poorly at both:

Lower-right quadrant: Prioritize speed over quality


This is the “move fast and break things” philosophy. Teams that optimize for
speed and sacrifice quality build messy, fragile systems. They slide into the
lower-left quadrant because their shoddy systems slow them down. A
common pattern for startups is seeing development slow down after a year or
two, leading founders to despair that their team has lost their “mojo.” Simple
changes that they would have whipped out quickly in the old days now take
days or weeks because the system is a tangled mess. This is a consequence of
a system built in a rush, without considering quality a priority.

Upper-left quadrant: Prioritize quality over speed


Also known as “We’re doing serious and important things, so we have to do
things properly.” Then deadline pressures drive “workarounds.” Heavyweight
processes create barriers to improvement, so technical debt grows along with
lists of “known issues.” These teams slump into the lower-left quadrant. They
end up with low-quality systems because it’s too hard to improve them. They
add more processes in response to failures. These processes make it harder to
make improvements and increase fragility and risk. This leads to more
failures and more process. Many people working in organizations that work
this way assume this is normal,11 especially those who work in risk-sensitive
industries.12

The upper-right quadrant is the goal of modern approaches like Lean, Agile, and
DevOps. Being able to move quickly while also maintaining a high level of
quality may seem like a fantasy. However, the Accelerate research proves that
many teams do achieve this. So this quadrant is where you find “high
performers.”

The Four Key Metrics


Navigating your way into the high-performing quadrant is challenging. DORA’s
Accelerate research team identifies four key metrics for software delivery and
operational performance that can help keep you on track.13 Its research surveys
various measures, and has found that these four have the strongest correlation to
how well an organization meets its goals:

Delivery lead time


The elapsed time it takes to implement, test, and deliver changes to the
production system

Deployment frequency
How often changes are deployed to production systems

Change fail percentage


What percentage of changes either cause an impaired service or need
immediate correction, such as a rollback or emergency fix

Mean Time to Restore (MTTR)


How long it takes to restore service when there is an unplanned outage or
impairment

Organizations that perform well against their goals—whether that’s revenue,


share price, or other criteria—also perform well against these four metrics. The
ideas in this book aim to help your team, and your organization, perform well on
these metrics. Three core practices for Infrastructure as Code can help you to
achieve this.
Core Practices for Infrastructure as Code
You can build and maintain highly effective systems by using Infrastructure as
Code to deliver changes continuously, quickly, and reliably. This book describes
various principles, practices, and patterns for achieving this. Underlying all of
this are a few core practices:

Define everything as code


Continuously test and deliver all work in progress
Build small, simple pieces that you can change independently

I’ll summarize each of these now, to set the context for further discussion. Later,
I’ll devote a chapter to the principles for implementing each of these practices.

Core Practice: Define Everything as Code

Defining all your stuff “as code” is a core practice for making changes rapidly
and reliably. There are a few reasons why this helps:

Reusability
If you define a thing as code, you can create many instances of it. You can
repair and rebuild your things quickly, and other people can build identical
instances of the thing.

Consistency
Things built from code are built the same way every time. This makes system
behavior predictable, makes testing more reliable, and enables continuous
testing and delivery.

Visibility
Everyone can see how the thing is built by looking at the code. People can
review the code and suggest improvements. They can learn things to use in
other code, gain insight to use when troubleshooting, and review and audit for
compliance.

I’ll expand on concepts and implementation principles for defining things as


code in Chapter 4.

Core Practice: Continuously Test and Deliver All


Work in Progress

Effective infrastructure teams are rigorous about testing. They use automation to
deploy and test each component of their system and integrate all the work
everyone has in progress. They test as they work, rather than waiting until
they’ve finished.

The idea is to build quality in rather than trying to test quality in.

One part of this that people often overlook is that it involves integrating and
testing all work in progress. On many teams, people work on code in separate
branches and only integrate when they finish. According to the Accelerate
research, however, teams get better results when everyone integrates their work
at least daily. CI involves merging and testing everyone’s code throughout
development. CD takes this further, keeping the merged code always production-
ready.

I’ll go into more detail on how to continuously test and deliver infrastructure
code in Chapter 7.

Core Practice: Build Small, Simple Pieces That You


Can Change Independently

Teams struggle when their systems are large and tightly coupled. The larger a
system is, the harder it is to change, and the easier it is to break.

When you look at the codebase of a high-performing team, you see the
difference. The system is composed of small, simple pieces. Each piece is easy
to understand and has clearly defined interfaces. The team can easily change
each component on its own and can deploy and test each component in isolation.

I dig more deeply into implementation principles for this core practice in
Chapter 5.

Conclusion
Traditional, Iron Age approaches to software and system design were based on
the belief that, if you are sufficiently skilled, knowledgeable, and diligent, you
can come up with the correct design for the system’s needs. In reality, you won’t
know what the correct design is until your system is being used. Worse, changes
to your organization’s situation, environment, and opportunities mean the
system’s needs are a moving target. So even if you do find and implement the
correct design, it won’t remain correct for very long.

The only thing you know for sure when designing a system is that you will need
to change it when it is in use, not once, but continuously until the system is no
longer needed. The essence of Cloud Age, Lean, Agile, DevOps, and similar
philosophies is designing and implementing systems so that you can
continuously learn and evolve your systems.

With infrastructure, this means exploiting speed to improve quality and building
quality in to gain speed. Automating your infrastructure takes work, especially
when you’re learning how to do it. But doing that work helps to ensure you can
keep your system relevant and useful throughout its lifespan. The next chapter
will discuss more specific principles for designing and building cloud
infrastructure using code.

1 According to Wikipedia, a tire fire has two forms: “Fast-burning events,


leading to almost immediate loss of control, and slow-burning pyrolysis which
can continue for over a decade.”

2 The Accelerate State of DevOps Report, 2019 specifically researched the


effectiveness of governance approaches, and includes a discussion of their
findings on pages 48-52.
3 https://en.wikipedia.org/wiki/Shadow_IT

4 Facebook CEO Mark Zuckerberg said “Unless you are breaking stuff,” he
says, “you are not moving fast enough.” https://www.businessinsider.com/mark-
zuckerberg-2010-10

5 CAB: Change Advisory Board

6 Chapter 19 looks at modern approaches to governance and compliance.

7 The Cloud Native Landscape diagram is a popular one for illustrating how
many products, tools, and projects are available for building platforms. One of
my favorite memes extends this into a CNCF conspiracy chart

8 According to Gene Kim, George Spafford, and Kevin Behr in The Visible
Ops Handbook (IT Process Institute), changes cause 80% of unplanned outages.

9 Reports from the Accelerate research are available in the annual State of
DevOps Report, and in the book, Accelerate, by Dr. Nicole Forsgren, Jez
Humble, Gene Kim (IT Revolution Press).

10 Accelerate by Dr. Nicole Forsgren, Jez Humble, Gene Kim (IT Revolution
Press)

11 This is an example of “Normalization of Deviance,” which means people


get used to working in ways that increase risk. Diane Vaughan defined this term
in The Challenger Launch Decision (University Of Chicago Press).
12 It’s ironic (and scary) that so many people in industries like finance,
government, and health care consider fragile IT systems—and processes that
obstruct improving them—to be normal, and even desirable.

13 DORA, now part of Google, is the team behind the Accelerate State of
DevOps Report.
Chapter 2. Principles of Cloud
Infrastructure

A NOTE FOR EARLY RELEASE READERS

With Early Release ebooks, you get books in their earliest form—the author’s
raw and unedited content as they write—so you can take advantage of these
technologies long before the official release of these titles.

This will be the 2nd chapter of the final book. Please note that the GitHub repo
will be made active later on.

If you have comments about how we might improve the content and/or examples
in this book, or if you notice missing material within this chapter, please reach
out to the editor at jleonard@oreilly.com.

The rise of cloud and automation has forced us to change how we think about,
design, and use computing resources.

Computing resources in the Iron Age of IT were tightly coupled to physical


hardware. We assembled CPUs, memory, and hard drives in a case, mounted the
case into a rack, and cabled it to switches and routers. We installed and
configured an operating system and application software. We could describe
where an application server was in the data center: which floor, which row,
which rack, which slot.
Cloud decouples the computing resources from the physical hardware they run
on. The hardware still exists, of course, but servers, hard drives, and routers have
transformed into virtual constructs that we create, duplicate, change, move, and
destroy at will.

Cloud Native takes this decoupling further, moving away from modeling
resources based on hardware concepts like servers, hard drives, and firewalls.
Instead, infrastructure is defined around concepts driven by application
architecture. Containers strip down the concept of a virtual server to only those
things that are specific to an application process. Serverless removes even that,
providing the bare minimum that an application needs from its environment to
run. A service mesh can abstract various aspects of interaction and integration
between application processes, including routing, authentication, and service
discovery.

We can no longer rely on the physical attributes of our infrastructure to be


constant. We must be able to add and remove instances of our systems and
components of it without ceremony, and we need to be able to easily maintain
the consistency and quality of our systems even as we rapidly expand their scale.

There are several principles for designing and implementing infrastructure on


cloud platforms. These principles articulate the reasoning for using the core
practices I described in Chapter 1 (define everything as code, continuously test
and deliver, and build small pieces). I also list several common pitfalls that
teams fall into with dynamic infrastructure.
These principles and pitfalls underlie more specific advice on implementing
Infrastructure as Code practices throughout this book.

Principle: Assume Systems Are Unreliable


In the Iron Age, we assumed our systems were running on reliable hardware. In
the Cloud Age, you need to assume your system runs on unreliable hardware.1

Cloud-scale infrastructure involves hundreds of thousands of devices, if not


more. At this scale, failures happen even when using reliable hardware—and
most cloud vendors use cheap, less reliable hardware, detecting and replacing it
when it breaks.

You’ll need to take parts of your system offline for reasons other than unplanned
failures. You’ll need to patch and upgrade the system software. You’ll resize,
redistribute the load, and troubleshoot problems.

With static infrastructure, doing these things means taking systems offline. But
in many modern organizations, taking systems offline means taking the business
offline.

So you can’t treat the infrastructure your system runs on as a stable foundation.
Instead, you must design for uninterrupted service when underlying resources
change.2

Principle: Make Everything Reproducible


One way to make a system resilient is to make sure you can rebuild its parts
effortlessly and reliably.

Effortlessly means that there is no need to make any decisions about how to
rebuild things. You should define things such as configuration settings, software
versions, and dependencies as code. Rebuilding is then a simple “yes/no”
decision.

Not only does reproducibility make it easy to recover a failed system, but it also
helps you to:

Make test environments consistent with production


Replicate systems across regions for availability
Add instances on demand to cope with high load
Replicate systems to give each customer a dedicated instance

Of course, a running system generates data, content, and logs, which you can’t
define ahead of time. You need to identify these and find ways to keep them as a
part of your replication strategy. Doing this might be as simple as automatically
copying or streaming data to a backup and then restoring it when rebuilding. I’ll
describe options for doing this in Chapter 19.

The ability to effortlessly build and rebuild any part of the infrastructure is
powerful. It reduces the risk and fear of making changes, and you can handle
failures with confidence. You can rapidly provision new services and
environments.
Pitfall: Snowflake Systems
A snowflake is an instance of a system or part of a system that is difficult to
rebuild. It may also be an environment that should be similar to other
environments—such as a staging environment—but is different in ways that its
team doesn’t fully understand.

People don’t set out to build snowflake systems, they are a natural occurrence.
The first time you build something with a new tool you learn lessons along the
way, which involves making mistakes. But if people are relying on the thing
you’ve built, you may not have time to go back and rebuild or improve it using
what you learned. Improving what you’ve built is especially hard if you don’t
have the mechanisms and practices that make it easy and safe to change.

Another cause of snowflakes is when people make changes to one instance of a


system that they don’t make to others. They may be under pressure to fix a
problem that only appears in one system, or they may start a major upgrade in a
test environment, but run out of time to roll it out to others.

You know a system is a snowflake when you’re not confident you can safely
change or upgrade it. Worse, if the system does break, it’s hard to fix it. So
people avoid making changes to the system, leaving it out of date, unpatched,
and maybe even partly broken.

Snowflake systems create risk and waste the time of the teams that manage
them. It is almost always worth the effort to replace them with reproducible
systems. If a snowflake system isn’t worth improving, then it may not be worth
keeping at all.

The best way to replace a snowflake system is to write code that can replicate
the system, running the new system in parallel until it’s ready. Use automated
tests and pipelines to prove that it is correct and reproducible and that you can
change it easily.

Note that it’s possible to create snowflake systems using infrastructure code, as
I’ll explain in Chapter 15.

Principle: Create Disposable Things


Building a system that can cope with dynamic infrastructure is one level. The
next level is building a system that is itself dynamic. You should be able to
gracefully add, remove, start, stop, change, and move the parts of your system.
Doing this creates operational flexibility, availability, and scalability. It also
simplifies and de-risks changes.

“Treat your servers like cattle, not pets,” is a popular expression about
disposability.3 I miss giving fun names to each new server I create. But I don’t
miss having to tweak and coddle every server in our estate by hand.

If your systems are dynamic, then you need to use tools that can cope with this.
For example, your monitoring should not raise an alert every time you rebuild
part of your system. However, it should raise a warning if something gets into a
loop rebuilding itself.

THE CASE OF THE DISAPPEARING FILE SERVER

People can take a while to get used to ephemeral infrastructure. One team I
worked with automated its infrastructure with VMware and Chef. The team
deleted and replaced virtual machines as needed.

A new developer on the team needed a web server to host files to share with
teammates, so he manually installed an HTTP server on a development server
and put the files there. A few days later, I rebuilt the VM, and his web server
disappeared.

After some confusion, the developer understood why this had happened. He
added his web server to the Chef code and persisted his files to the SAN. The
team now had a reliable file-sharing service.

Principle: Minimize Variation


As a system grows, it becomes harder to understand, harder to change, and
harder to fix. The work involved grows with the number of pieces, and also with
the number of different types of pieces. So a useful way to keep a system
manageable is to have fewer types of pieces—to keep variation low. It’s easier to
manage one hundred identical servers than five completely different servers.

The reproducibility principle (see “Principle: Make Everything Reproducible”)


complements this idea. If you define a simple component and create many
identical instances of it, then you can easily understand, change, and fix it.

To make this work, you must apply any change you make to all instances of the
component. Otherwise, you create configuration drift.

Here are some variations you may have in your system:

Multiple operating systems, Kubernetes distributions, databases, and other


technologies. Each one of these needs people on your team to keep up skills
and knowledge.
Multiple versions of software such as a container cluster or database. Even if
you only use one type of container cluster, different versions may need
different configurations and tooling.
Different versions of a package. When some systems have a newer version of
a package, utility, or library than others, you have risk. Commands may not
run consistently across them, or older versions may have vulnerabilities or
bugs.

Organizations have tension between allowing each team to choose technologies


and solutions that are appropriate to their needs, versus keeping the amount of
variation in the organization to a manageable level.

LIGHTWEIGHT GOVERNANCE

Modern, digital organizations are learning the value of Lightweight Governance in IT to balance autonomy
and centralized control. This is a key element of the EDGE model for agile organizations. For more on this,
see the book, EDGE: Value-Driven Digital Transformation by Jim Highsmith, Linda Luu, and David
Robinson (Addison-Wesley Professional), or Jonny LeRoy’s talk, “The Goldilocks Zone of Lightweight
Architectural Governance”. Andrew Harmel-Law describes how to Scaling the Practice of Architecture,
Conversationally.

Configuration Drift

Configuration drift is variation that happens over time across once identical
systems. Figure 2-1 shows this. Making changes manually is a common cause of
inconsistencies. It can also happen if you use automation tools to make ad hoc
changes to only some of the instances, or if you create separate branches or
copies of the infrastructure code for different instances. Configuration drift
makes it harder to maintain consistent automation.

Figure 2-1. Configuration drift is when instances of the same thing become different over time
As an example of how infrastructure can diverge over time, consider the journey
of our example company, ClotheSpin (as introduced in “Introduction to
ClotheSpin”). ClotheSpin runs a separate instance of its storefront in each
region, as a set of microservices deployed on an AWS EKS cluster, along with
an API gateway, database instances, and message queues.

The ClotheSpin infrastructure team maintains a separate Terraform project for


each region. For each new region, they copy the code from an existing region
and edit the code and configuration as needed.

Over time, the Terraform code has become increasingly different between
regions. When a change is needed, such as fixing a configuration issue or adding
a new feature, the team needs to manually edit the code for each region. They
test the change in a separate staging instance for each region because the
differences mean that it might work correctly in one region, but break the system
in another one.

It can take a few weeks to apply even a minor change to all of the region’s
infrastructure. In some cases, the team doesn’t bother to make a change to all of
the regions, if they think it may not be relevant. This increases the differences
between the regions, making it even more likely that a later change can’t be
easily applied everywhere.

The ClotheSpin team is exploring ways to make the infrastructure more


consistent across all of their regions. Chapter 11 will be especially helpful for
this issue.
THE AUTOMATION FEAR SPIRAL

The automation fear spiral describes how many teams fall into configuration
drift and technical debt.

At an Open Space session on configuration automation at a DevOpsDays


conference, I asked the group how many of them were using automation tools
like Ansible, Chef, or Puppet. The majority of hands went up. I asked how many
were running these tools unattended, on an automatic schedule. Most of the
hands went down.

Many people have the same problem I had in my early days of using automation
tools. I used automation selectively—for example, to help build new servers, or
to make a specific configuration change. I tweaked the configuration each time I
ran it to suit the particular task I was doing.

I was afraid to turn my back on my automation tools because I lacked confidence


in what they would do.

I lacked confidence in my automation because my servers were not consistent.

My servers were not consistent because I wasn’t running automation frequently


and consistently.

This is the automation fear spiral, as shown in Figure 2-2. Infrastructure teams
must break this spiral to use automation successfully. The most effective way to
break the spiral is to face your fears. Start with one set of servers. Make sure you
can apply, and then reapply, your infrastructure code to these servers. Then
schedule an hourly process that continuously applies the code to those servers.
Then pick another set of servers and repeat the process. Do this until every
server is continuously updated.

Good monitoring and automated testing build confidence to continuously


synchronize your code. This exposes configuration drift as it happens, so you
can fix it immediately.

FIGURE 2-2. THE AUTOMATION FEAR SPIRAL

GitOps, described in Chapter 20, involves automatically and continuously


applying configuration to systems. In itself, this doesn’t prevent configuration
drift between multiple instances of a system. However, the “hands-off”
mechanism of keeping the system and code synchronized helps to avoid a type
of automation fear that comes from people making changes to a system outside
of code.

Principle: Ensure That You Can Repeat


Any Process
Building on the reproducibility principle, you should be able to repeat anything
you do to your infrastructure. It’s easier to repeat actions using scripts and
configuration management tools than to do them by hand. But automation can be
a lot of work, especially if you’re not used to it.

For example, let’s say I have to partition a hard drive as a one-off task. Writing
and testing a script is much more work than just logging in and running the
command. So I do it by hand.

The problem comes later on, when someone else on my team, Priya, needs to
partition another disk. She comes to the same conclusion I did and does the work
by hand rather than writing a script. However, she makes slightly different
decisions about how to partition the disk. I made an 80 GB ext3 partition
on my server, but Priya made a 100 GB XFS partition on hers. We’re
creating configuration drift, which will erode our ability to automate with
confidence.

Effective infrastructure teams have a strong scripting culture. If you can script a
task, then script it.4 If it’s hard to script it, dig deeper. Maybe there’s a technique
or tool that can help, or maybe you can simplify the task or handle it differently.
Breaking work down into scriptable tasks usually makes it simpler, cleaner, and
more reliable.

Principle: Apply Software Design


Principles to Infrastructure Code
At the start of Chapter 1, I defined Infrastructure as Code as “applying the
principles, practices, and tools of software engineering to infrastructure”. The
nature of infrastructure and software are quite different, as I’ll explore in
Chapter 4, so this principle needs to be treated with some care. However, many
software design and engineering concepts are useful for infrastructure.

There are many resources for learning about software design, architecture, and
engineering, which I draw on throughout the book.

Conclusion
The Principles of Cloud Infrastructure embody the differences between
traditional, static infrastructure, and modern, dynamic infrastructure:

Assume Systems Are Unreliable


Make Everything Reproducible
Avoid Snowflake Systems
Create Disposable Things
Minimize Variation
Ensure That You Can Repeat Any Process
Apply Software Design Principles to Infrastructure Code

These principles are the key to exploiting the nature of cloud platforms. Rather
than resisting the ability to make changes with minimal effort, exploit that ability
to gain quality and reliability.

1 I learned this idea from Sam Johnson’s article, “Simplifying Cloud:


Reliability”.

2 The principle of assuming systems are unreliable drives chaos engineering,


which injects failures in controlled circumstances to test and improve the
reliability of your services. I talk about this more in Chapter 19.

3 I first heard this expression in Gavin McCance’s presentation “CERN Data


Centre Evolution”. Randy Bias credits Bill Baker’s presentation “Architectures
for Open and Scalable Clouds”. Both of these presentations are an excellent
introduction to these principles.

4 My colleague Florian Sellmayr says, “If it’s worth documenting, it’s worth
automating.”
Chapter 3. Platforms and Toolchains

A NOTE FOR EARLY RELEASE READERS

With Early Release ebooks, you get books in their earliest form—the author’s
raw and unedited content as they write—so you can take advantage of these
technologies long before the official release of these titles.

This will be the 3rd chapter of the final book. Please note that the GitHub repo
will be made active later on.

If you have comments about how we might improve the content and/or examples
in this book, or if you notice missing material within this chapter, please reach
out to the editor at jleonard@oreilly.com.

Infrastructure code’s role in a wider system is to assemble infrastructure


resources to support organizational strategy. Figure 1-2 in Chapter 1 showed
how an organization’s strategic goals drive an engineering strategy, which in
turn drives the infrastructure strategy.

This chapter describes a model of the implementation of these different layers of


strategy and goals, which places Infrastructure as Code in the context of
platforms and toolchains. In this model, infrastructure code assembles
infrastructure resources and services provided by an IaaS (Infrastructure as a
Service) platform to enable platform services that support the engineering
strategy.

The organization’s strategy and goals are delivered by various capabilities that
are described by the organization’s enterprise architecture. These capabilities are
implemented as services, which are often grouped into platforms that provide
cohesive sets of services.

Platform design and implementation is too broad a topic for this book to discuss
comprehensively. But we typically use Infrastructure as Code to implement
platforms and platform services.

Platform is one of those words (like “system” and “service”) which is used in so
many different ways that the word is nearly meaningless without a more specific
qualifier, such as “business”, “developer”, or “data”. And even then terms like
“business platform” still feel wooly. To make things even more difficult,
different people define and use various platform-related words in different ways-
there are no industry-standard definitions to rely on.

I’ll define a platform as a collection of system elements that are presented to


users, who use them to create products for other users.1

The model I describe in this chapter can be used to plan an infrastructure


architecture that enables people to create products for users. The model starts
with a simple view for grouping capabilities as part of an enterprise architecture.
A core grouping of these is technology capabilities, which may be implemented
as one or more engineering platforms that people use to build and run software.
Technology capabilities are provided by an engineering platform, which is
composed of platform services. I’ll describe different ways infrastructure code
can be involved in provisioning and configuring platform services.

The patterns and practices for Infrastructure as Code described in this book
apply to any IaaS platform, whether provided by a public cloud vendor or an
internal infrastructure platform. I’ll give a summary of the common types of
resources and services IaaS platforms provide, to establish vendor-independent
terminology used throughout the book.

The chapter closes with a discussion of toolchains involved in managing


Infrastructure as Code and related concerns.

Capabilities in an Enterprise Architecture


Enterprise architecture is a huge topic,2 with many different viewpoints and
models to describe it. Figure 3-1 is a simplistic view that focuses on the areas
that infrastructure most directly connects with.
Figure 3-1. Technology capabilities in the context of enterprise architecture

The parts of the enterprise architecture as shown in the diagram could use some
explanation:

Business products and capabilities


This vast area includes customer-facing software applications and services, as
well as business capabilities shared across applications. Examples of shared
business capabilities in ClotheSpin’s domain of online retail include order
management, customer data, and product catalog. Cloud vendors call it
“workloads.” I sometimes wave my hands and call it “stuff that software
developers do.” Lumping it all into a narrow box at the top is excusable only
because detailing the concerns that may live there varies widely for different
types of organizations, and is out of scope for this book. However, anyone
designing and building infrastructure must understand this landscape for their
organization, to ensure that they deliver what is needed.

Technology capabilities
Roughly speaking, these are capabilities that an IT organization uses to
enable building and running business products and capabilities. Technology
capabilities are not usually visible outside of the organization. These
capabilities are typically what we use infrastructure code to provide, and I’ll
describe their different types next.

Infrastructure resources
Raw compute, storage, and networking resources may be provided by cloud
vendors or data centers. These resources are the raw ingredients that we work
with using Infrastructure as Code. I’ll elaborate on them later in this chapter
in a section on Infrastructure as a Service.

Infrastructure as Code bridges the infrastructure resources with the technology


capabilities.

Types of Technology Capabilities

Technology capabilities are often treated with a brief hand-wave in enterprise


architecture diagrams, but it’s worth breaking them out into more detail for our
purposes:

Delivery capabilities
These services and systems are used to develop, test, and deliver software.
Examples include source code and artifact repositories, CI (Continuous
Integration) services, CD (Continuous Delivery) pipelines, and automated
testing tools. The scope of these capabilities isn’t purely software used for
business products and capabilities but also includes code used for other
technology capabilities. Chapter 7 explains how to use delivery capabilities
for your infrastructure code.3

Application runtime capabilities


The services and systems that software runs on may include physical or
virtual servers, container management systems, and serverless runtimes. Data
storage and processing, messaging, and traffic management are also
application runtime services. “Middleware” is another common term for
software in this area. Application runtime capabilities are at the heart of what
we provide through infrastructure code, so they get multiple chapters in this
book, Part IV.

Operational capabilities
Some systems and services may not be strictly required for the software to
run, but are needed to ensure they run well. Examples include monitoring,
observability,4 security management, disaster recovery, and capacity
management. Chapter 19 explores some key operational capabilities.

EXAMPLE - TECHNOLOGY CAPABILITIES IN CLOTHESPIN

ClotheSpin’s online fashion store includes software product development teams


organized around different aspects of their business, as shown in Figure 3-2. The
product browse and search team owns the product catalog, administration
interfaces and business logic for managing products and pricing, and storefront
activities like browsing and searching for products. The Order Management
team is responsible for the systems for placing and tracking orders, including
transactions.
FIGURE 3-2. SOME OF THE CAPABILITIES FOR THE CLOTHESPIN ONLINE STORE

The company’s IT department supports the product development teams by


providing the infrastructure for building and running their software. The
capabilities their infrastructure provides include software delivery systems like
source code repositories and delivery pipelines, as well as software run services
such as container clusters.

Technology capabilities can be provided without much complexity for smaller


and simpler systems. But as a system grows, it’s important to have a clear
strategy. Platform engineering is the domain of providing technology capability
within an organization.

Engineering platforms
An engineering platform provides technology capabilities to users inside an
organization, who use them to create products for users inside and outside the
organization. Figure 3-3 wraps the technology capabilities described in “Types
of Technology Capabilities” into an engineering platform.

Figure 3-3. Engineering platform provides technology capabilities

As I mentioned earlier, platform engineering is too large to cover thoroughly in


this book. However, the role infrastructure code plays in providing a platform’s
capabilities is an important lens for us. So we need to consider how platform
engineering relates to infrastructure.

People often see an engineering platform as a single, unified solution-and many


vendors aim to sell them this way. However, a platform is best seen as a
collection of services. Different services may be provided by different teams,
and in some cases hosted by external vendors as a Software as Service (SaaS)
solution.
Platform Services

A platform service is an implementation of a technology capability as a cohesive


offering within an engineering platform. Figure 3-4 shows a few of the platform
services used by the ClotheSpin storefront.

Figure 3-4. Example platform services

These services are defined by the capability they provide to the software that
runs on them, rather than the details of the technology. “Public Traffic,” for
example, may include DNS entries, CDN (Content Distribution Network)
services, and network routing. But the service is defined by the fact that it
provides connectivity from users on the public Internet to an application.
Providing Platform Service Functionality

There are different ways that the functionality of a platform service may be
provided, and infrastructure is used differently in each of these ways. For
example, a monitoring service might use functionality from a software package
deployed onto the organization’s infrastructure, a service provided by the IaaS
cloud vendor, or a SaaS monitoring solution. Figure 3-5 shows each of these
options.

Figure 3-5. Ways to provide platform service functionality


Infrastructure is used in different ways for each of these three options.

Packaged Software
Teams provide the platform functionality by deploying a software package
onto their infrastructure. A few examples include open-source monitoring
software like Prometheus, a secrets management service like Hashicorp
Vault, or a packaged container cluster service like kops or Rancher.
Infrastructure code provides the infrastructure to run the software as well as
integration with other infrastructure and services, such as networking and
authorization.

Cloud Platform-Provided Service


Most cloud platform vendors not only provide basic infrastructure resources
like virtual servers and network structures but also offer platform service
functionality. Examples include Azure Monitor, AWS Secrets Manager, and
Google Kubernetes Engine. Infrastructure code defines, configures, and
provisions the services needed directly from the IaaS platform. The code also
defines integration with other resources like networking and authorization,
which is often simplified by coming from the same platform.

Externally-Hosted Service
Many organizations use services hosted by a SaaS vendor. Examples include
Datadog monitoring, Akamai Edge DNS, and Okta identity management.
Many SaaS providers have APIs supported by Infrastructure as Code tools, so
you can write code to provision, configure, and integrate their services.
EXAMPLE TECHNOLOGY CAPABILITY IMPLEMENTATIONS BY CLOTHESPIN

The ClotheSpin teams have built their systems over nearly twenty years, and
have used a variety of different ways of providing platform service functionality.

When ClotheSpin introduced an API layer for mobile applications, and later
opened it up to third-party developers, they deployed and ran the Kong API
gateway (https://konghq.com/products/kong-gateway) on an AWS EKS cluster
and a PostgreSQL RDS instance. Their infrastructure code pulled docker images
with Kong pre-installed. This is an example of pull-based packaged software for
a platform service.

Later, the team decided to migrate to the AWS API Gateway service
(https://aws.amazon.com/api-gateway/). Most of the implementation for
ClotheSpin’s folks involved writing Terraform code to configure the service,
with some work by the application developers to migrate their code. This is an
example of functionality provided by the cloud platform.

The new Hipsteroo brand launched a few years ago with an architectural design
principle to keep the amount of packaged software to a minimum, so the team
would have fewer moving parts to manage. They defaulted to using services
provided by their cloud vendor as much as possible. However, they decided they
preferred the advanced features and developer experience of a third-party hosted
monitoring provider, Datadog (https://www.datadoghq.com/). The team deploys
a stack written with AWS CDK that other infrastructure uses to integrate with
the endpoints of the monitoring service. This is an example of an externally
hosted platform service.
IaaS Platforms
So far, I’ve described infrastructure resources vaguely as the stuff assembled by
infrastructure code. These resources, and the IaaS platforms that provide them,
are the medium in which we infrastructure coders work. They are the materials
that we mold, using our craft to turn characters in a file into the digital
foundations that sustain the organizations for which we work.

That may be a flowery way to describe IaaS platforms. But they are important to
what we do.

Figure 3-6 shows the relationship between infrastructure code, an IaaS platform,
and the infrastructure resources provisioned for our use.
Figure 3-6. Infrastructure code interacts with Infrastructure as a Service

An infrastructure tool like Terraform or CDK reads the infrastructure code and
uses it to interact with the API of the IaaS platform to provision or change
infrastructure resources.

The essential characteristics of an IaaS platform for Infrastructure as Code are


that it provides infrastructure resources on demand and that it provides them
through a programmable interface. Most IaaS clouds expose a REST5 API, often
with SDKs (Software Development Kits) for different programming languages.6

There are different types of IaaS platforms, from full-blown public clouds to
private clouds; from commercial vendors to open source platforms. Table 3-1
lists examples of vendors, products, and tools for each type of cloud IaaS
platform.

Table 3-1. Examples of Infrastructure as a Service Solutions

Type of
Providers or products
platform

Public IaaS Alibaba Cloud, AWS, Azure, Digital Ocean, Google


cloud services Cloud Platform, Linode (Akamai), Oracle Cloud,
OVHCloud, Scaleway, and Vultr

Private IaaS CloudStack, OpenStack, and VMware vCloud


cloud products

Bare-metal Cobbler, FAI, and Foreman (see Chapter 17)


server
provisioning
tools

Public cloud AWS Outposts, Azure Stack, Google Anthos


data center
offerings

At the basic level, an IaaS platform provides compute, storage, and networking
resources. The platform can provide these resources in different ways. For
instance, you may run compute as virtual servers, container runtimes, and
serverless code execution.

Different vendors may package and offer the same resources in different ways,
or at least with different names. For example, AWS object storage, Azure blob
storage, and GCP cloud storage are all pretty much the same thing. This book
tends to use generic names that apply to different platforms. Rather than VPC
and Subnet, I use network address block and VLAN.

Types of Infrastructure Resources Provided by an


IaaS Platform

There are three essential types of resources provided by an IaaS platform:


compute, storage, and networking. Different platforms combine and package
these resources in different ways. For example, you may be able to provision a
database instance, which combines compute, storage, and networking. Even
something as seemingly simple as block storage involves not only storage but
also networking to allow connections over HTTP and compute to carry out
encryption.

The fundamental forms of infrastructure are primitive resources, such as servers


and block storage. Cloud platforms combine infrastructure primitives into
composite resources, such as:

Database as a Service (DBaaS)


Cluster as a Service (CaaS)
Load balancing
DNS
Identity management
Secrets management

Figure 3-7 shows examples of composite resources and their relationships to


primitive resources:

Figure 3-7. IaaS resource types

The line between a primitive resource and a composite resource is arbitrary, as is


the line between a composite infrastructure resource and a platform service such
as an API gateway. But it’s a useful distinction. There are three broad groups of
primitive infrastructure resources: compute, storage, and networking.

Compute Resources
Compute resources execute code. At its most elemental, compute is execution
time on a physical server CPU core. But most platforms provide compute in
different ways. Common compute resource resources include:

Virtual machine instances (VMs)


Physical servers, also called Bare Metal as a Service (BMaaS)
Server clusters, such as AWS Auto Scaling Group (ASG), Azure virtual
machine scale set, and Google Managed Instance Groups (MIGs)
Container instances, Containers as a Service (CaaS)
Container clusters (CCaaS), although sometimes also called CaaS. Examples
include Amazon ECS, Amazon Elastic Container Service for Kubernetes
(EKS), Azure Kubernetes Service (AKS), and Google Kubernetes Engine
(GKE)
FaaS serverless code runtimes, such as AWS Lambda

The variety of options for provisioning and using compute resources create
useful options for designing and implementing applications to use them
effectively and efficiently.

Storage Resources

Infrastructure platforms provide storage in different ways. Typical storage


resources are:

Block storage, virtual disk volumes that can be mounted to virtual services or
other compute instances. Examples include AWS EBS, Azure Page Blobs,
OpenStack Cinder, and GCE Persistent Disk.
Object storage, which provides access to files from multiple locations, rather
than attached to a specific compute instance. Amazon’s S3, Azure Block
Blobs, Google Cloud Storage, and OpenStack Swift are all examples. Object
storage is usually cheaper and more reliable than block storage, but with
higher latency.
Networked filesystems, shared network volumes. These are usually volumes
that can be mounted on multiple compute instances using standard protocols,
such as NFS, AFS, or SMB/CIFS.7
Structured data storage. These are often managed Database as a Service
(DBaaS) offerings. They can be a relational database (RDBMS), key-value
store, or formatted document stores for JSON or XML content.
Secrets management, which is essentially structured data storage with
additional features for secrets management such as rotation and fine-grained
access management. See Chapter 11 for techniques for managing secrets and
infrastructure code.

As with compute resources, the different storage options vary from simple
options that provide raw storage space, to more sophisticated options tailored for
more narrow use cases.

Network Resources

Typical networking constructs and services an IaaS platform provides include:

Network address blocks, such as VPCs, Virtual Networks, Subnets, and


VLANs.
DNS service
Traffic routing, gateways (low-level and API level), and proxies
Load balancing
VPNs (virtual private networks)
Firewall rules
Asynchronous message queues
Caching
Service mesh

The capability of dynamic platforms to provision and change networking on


demand, from code, creates great opportunities. These opportunities go beyond
changing networking more quickly; they also include much safer use of
networking.

Part of the safety comes from the ability to quickly and accurately test a
networking configuration change before applying it to a critical environment.
Beyond this, Software Defined Networking (SDN) makes it possible to create
finer-grained network security constructs than you can do manually. This is
especially true with systems where you create and destroy elements dynamically.

The details of networking are outside the scope of this book, so check the
documentation for your platform provider, and perhaps a reference such as Craig
Hunt’s TCP/IP Network Adminstration (O’Reilly).

IaaS in the Data Center


People used Infrastructure as Code in data centers long before public IaaS clouds
brought it into mainstream IT. Early Infrastructure as Code focused on
configuring servers and didn’t involve provisioning networks and storage. Public
cloud IaaS made it possible to use infrastructure code for broader infrastructure,
which drove interest in ways to offer IaaS on-premise.

Table 3-1 lists some relevant open source and commercial products now
available for building private IaaS platforms in a data center. The bare-metal
cloud tools in that table can automate the provisioning of physical servers, either
to use directly or as a first step for installing virtualization or IaaS software.
Many of these tools are used for installing IaaS products, automating the process
of installing hypervisors onto physical servers.

The major IaaS cloud vendors also offer products for deploying services in a
data center that are compatible with their public cloud offerings. These solutions
are not designed as complete offerings like the private IaaS products listed in the
table but are intended as stepping stones or complements for running hybrid
clouds with their public cloud services.

Although some people argue that private hosting is more economical than public
cloud8, it takes considerable investment and expertise to implement a private
IaaS. Most private IaaS implementations are much less sophisticated, mature,
and flexible than public offerings. Most organizations I’ve worked with who run
in-house IaaS or PaaS clouds struggle to find and retain staff with the skills to
manage them, and so are reliant on third-party vendors.
There are use cases where at least some services need to run in the data center.
However, rather than trying to build and maintain a full-fledged internal IaaS
cloud, it’s generally more useful to build just enough infrastructure to deliver
specific workloads. You can do this by automating the processes to provision,
update, and manage that infrastructure with the simplest set of tooling necessary.

Multicloud

Many organizations end up hosting across multiple platforms. A few terms crop
up to describe variations of this:

Hybrid cloud
Hosting applications and services for a system across both private
infrastructure and a public cloud service. People often do this because of
legacy systems that they can’t easily migrate to a public cloud service (such
as services running on mainframes). In other cases, organizations have
requirements that public cloud vendors can’t currently meet, such as legal
requirements to host data in a country where the vendor doesn’t have a
presence.

Cloud agnostic
Building systems so that they can run on multiple public cloud platforms.
People often do this hoping to avoid lock-in to one vendor. In practice, this
results in locks-in to software that promises to hide differences between
clouds, or involves building and maintaining vast amounts of customized
code, or both.
Polycloud
Running different applications, services, and systems on more than one public
cloud platform. This is usually to exploit different strengths of different
platforms.

CLOUD VENDOR LOCK-IN

Many organizations worry that using certain cloud-provided capabilities might reduce their options for
moving to alternative vendors in the future-vendor lock-in. I’ve seen the obsession with this risk lead to
policies banning the use of common, easily ported services like DBaaS. Other organizations invest in
building or buying products that promise to create an abstraction layer over the cloud vendor. Doing this
adds complexity and cost that is rarely justified by a sensible evaluation of risks and tradeoffs.

Toolchains for infrastructure and platforms


Given an IaaS platform with an API that can be used to provision and configure
infrastructure resources, we need tools that help us to use that API easily. There
are different types of tools we can use. Some tools differ in the way we can use
them, for example, command-line tools, GUI tools, and code-based tools. Other
tools differ in the level at which they work, with some working directly with
low-level resources like network routes, and others operating at higher levels of
abstractions like a database. Some tools focus directly on managing
infrastructure resources, while others deal with infrastructure-adjacent concerns
like deploying software.

Figure 3-8 shows three major groupings of toolchains.


Figure 3-8. An overview of toolchains

An Infrastructure Management Toolchain includes the tools and services


directly involved in provisioning and managing resources of infrastructure
instances on an IaaS platform.
A Platform Management Toolchain manages platform services, for example,
provisioning and configuring environments or parts of environments.
An Application Delivery Toolchain deploys and configures applications and
other software.

Simpler systems might implement these in a single toolchain, for example using
Terraform to define and provision the infrastructure for the platform services in
an environment and deploying applications into it. As a system grows,
particularly in terms of the number of people and teams working on it, a single
toolchain can become messy and difficult to maintain. Many teams find it useful
to split the tools into different sets based on their responsibilities.

The landscape of automation tooling, both open source and commercial, covers
many of these concerns and more. In general, most solutions are aimed at a
subset of concerns, but it can be tempting to stretch them more broadly. Using
Terraform code to deploy applications is one example. Implementing
infrastructure provisioning commands in the configuration of a job in a build
server like Jenkins is another.

Infrastructure management toolchains are essential to the topic of this book, so


will be covered in more depth than the others. We’ll touch on platform
management and application delivery as well, but mainly in how they relate to
infrastructure architecture. Those tools are often configured using code, so many
of the principles and practices of Infrastructure as Code described here are
relevant.

The boundaries and dependencies between the concerns of these three different
types of toolchains are a recurring topic for Infrastructure as Code, so I’ll
describe them in more detail, noting which parts of this book are most relevant.

Infrastructure Delivery Toolchain

The infrastructure delivery toolchain is whatever collection of tools, services,


and scripts are used to build, deliver, and manage infrastructure code. It can be
viewed as a layer between the IaaS resources and a platform service.
Your organization may not use the term infrastructure delivery toolchain. But
every organization that uses infrastructure code has it. Things you find there
include the obvious infrastructure stack tools, like Terraform and AWS CDK;
server configuration tools like Chef and Puppet; testing tools like Terratest and
Inspec; and infrastructure-specific delivery tools such as Spacelift and Env0.

Most teams write at least some amount of custom scripts to orchestrate their
infrastructure code. Chapter 8 digs into the tools and scripts that teams may use
for this.

Platform Management Toolchain

On the other side of a platform service are the tools, services, and other solutions
used to provision and configure it. As with the infrastructure delivery toolchain,
different organizations may use different names for these things, or may not
even clearly define them as a group.

A few examples of solutions people use for managing platform services include:

PaaS (Platform as a Service) solutions


Such as OpenShift or Tanzu. A PaaS provides a collection of platform service
implementations along with the tooling to provision, configure, and manage
them.

Platform-building frameworks
Like Kratix and Humanitec. As opposed to a PaaS, these solutions provide
tooling for teams to build and manage their own platform services, rather than
providing pre-built services.

Platform descriptor languages


For example, the Open Application Model (OAM) that people can use to
configure platform services for an environment or application. These
languages may be used by infrastructure delivery tools like Crossplane or
Pulumi Deployments to provision IaaS resources. Many teams build in-house
configuration languages and supporting tools.

Developer portals
Along the lines of Backstage, which people can use to provision platform
services (among other things).

A central platform team may use a platform management toolchain to provision


environments and services. However, there is a strong movement towards
implementing solutions that empower other teams to provision, configure, and
manage instances of platform services for themselves.

In some cases, a team can use a self-service solution such as a developer portal
to manually trigger the provisioning of a platform service instance to use. In
other cases, deploying an application automatically triggers the provisioning of a
service the application requires. The latter situation requires integration with the
application delivery toolchain.

Application Delivery Toolchain

There is a wide and sprawling landscape of tools and services to automate the
build, testing, delivery, and deployment of application software. These include
build and pipelines services like those described in Chapter 8 and application
deployment services like Flux and ArgoCD.

Application runtime services and products typically provide tooling for


deploying applications. Deployment tools for modern, cloud-native and
serverless runtimes are typically code-based, such as Helm for Kubernetes, and
the Serverless Framework and AWS SAM. Chapter 16 discusses this topic in
more depth.

Conclusion
This chapter moves the conversation along the journey from the conceptual stuff
to the more concrete. In the previous chapter, we set out a view of organizational
goals leading down through engineering goals to give us goals for our
infrastructure architecture. This chapter positioned enterprise architecture as the
way we implement those goals, with platforms at different layers to implement
each layer of goals.

Although engineering platforms are a larger topic than this book can cover, it’s
essential to understand the relationship that infrastructure plays in enabling
them. This leads to the topic of IaaS platforms, cloud or otherwise, that provide
the foundations for everything else in the stack. The infrastructure toolchain is
the mechanism for harnessing IaaS resources to provide platform services.

Figure 3-9 wraps up this view of the layers of platforms.


Figure 3-9. Key platform concepts for infrastructure

Having touched on the topic of infrastructure toolchains, the next chapter,


Chapter 4, discusses the fundamental concepts of defining Infrastructure as
Code.

1 For another definition of platform, see What I Talk About When I Talk
About Platforms from my former colleague Evan Bottcher.

2 When I went to look for resources to share about enterprise architecture, I


struggled to find something that was both useful and that specifically addressed
architecture at the enterprise level, rather than software architecture in general.
Gregor Hohpe’s talk at the 2017 YOW! conference, Enterprise Architecture =
Architecting the Enterprise, gives a particularly good view of enterprise
architecture as connecting the business strategy with the IT architecture.

3 For more on software delivery capabilities, see Continuous Delivery:


Reliable Software Releases through Build, Test, and Deployment Automation, by
David Farley and Jez Humble (July 2010, Addison-Wesley), and Continuous
Delivery Pipelines: How To Build Better Software Faster, by David Farley (Feb.
2021)

4 As I explain in Chapter 19, observability and monitoring are not the same
thing

5 REST is REpresentational State Transfer, an architectural style used for


web service APIs.

6 The US National Institute of Standards and Technology (NIST) has an


excellent definition of cloud computing: “The capability provided to the
consumer is to provision processing, storage, networks, and other fundamental
computing resources where the consumer can deploy and run arbitrary software,
which can include operating systems and applications. The consumer does not
manage or control the underlying cloud infrastructure but has control over
operating systems, storage, and deployed applications; and possibly limited
control of select networking components (e.g., host firewalls).”

7 Network File System, Andrew File System, and Server Message Block,
respectively.

8 37 Signals CEO David Heinemeier Hansson sparked a popular tech media


meme about “cloud repatriation” in his post We stand to save $7m over five
years from our cloud exit. Charles Fitzgerald, an analyst, regularly writes about
his doubts that this is a meaningful industry trend, such as his post
Platformonomics Repatriation Index – Q4 2022: The Search Continues.
Chapter 4. Defining Infrastructure as Code

A NOTE FOR EARLY RELEASE READERS

With Early Release ebooks, you get books in their earliest form—the author’s
raw and unedited content as they write—so you can take advantage of these
technologies long before the official release of these titles.

This will be the 4th chapter of the final book. Please note that the GitHub repo
will be made active later on.

If you have comments about how we might improve the content and/or examples
in this book, or if you notice missing material within this chapter, please reach
out to the editor at jleonard@oreilly.com.

In Chapter 1, I identified three core practices that help you to manage


infrastructure effectively: define everything as code, continuously test and
deliver all work in progress, and build small, simple pieces. This chapter focuses
on the first of these core practices.

Understanding how and why current paradigms for defining infrastructure as


code evolved helps to understand the current landscape of approaches.
Understanding the different approaches supported by current tools helps to
choose how to implement your infrastructure, choosing between alternative
styles of languages and tools. This chapter lays the conceptual groundwork that
influences how to design and deliver infrastructure codebases.

There are simpler ways to provision infrastructure than writing code and then
feeding it into a tool. You could follow the “ClickOps” approach by opening the
platform’s web-based user interface in a browser, then poking and clicking an
application server cluster into being. Or you could embrace “CLI-Ops” by
opening a prompt and using your command-line prowess to wield the IaaS
platform’s CLI (command-line interface) tool to forge an unbreakable network
boundary.

As you may recall, Chapter 1 listed several advantages of defining Infrastructure


as Code over alternative approaches, including repeatability, consistency, and
visibility. Building infrastructure by hand, whether with a GUI or on the
command line, creates systems that are unmaintainable, inconsistent, and
confusing.

Infrastructure as Code emerged from task-focused scripting and embraced


idempotent, declarative coding models. As the field has grown to encompass
wider and more varied parts of complex IT systems, it has run into limits that
have led to the re-introduction of imperative (procedural and object-oriented)
languages. So we’ll examine how different language models apply to different
situations, and how to manage the boundaries between them.

The Basics of Defining Infrastructure as


Code
Systems administration has always involved automating tasks by writing scripts,
and often developing more complex in-house tools using languages like Perl,
Python, and Ruby. But Infrastructure as Code is a different approach. Task-
focused scripting focuses on activities within an overall process, such as “create
a server” or “add nodes to a container cluster.” A symptom of task-focused
scripting is wiki pages and other documentation that list steps to carry out the
overall process, including guidance on how to prepare things before running the
script, and perhaps arguments to pass to the script.

Task-focused scripting usually needs someone to make decisions to carry out a


complete process. Infrastructure as Code, on the other hand, pulls decision-
making knowledge into the act of writing definitions, specifications, code, and
tests. Provisioning, configuring, and deploying systems then become hands-off
processes that can be left to the machines. Getting this right creates capabilities
like automated scaling, recovery, and self-service platform services.

Capturing decision-making in code and specifications also creates the


opportunity for more effective knowledge sharing and governance. Anyone can
understand how the system is implemented in practice directly from the code.
Compliance can be assured and proven by reviewing the history of code
changes. They can use the code and its history to troubleshoot and fix issues.

What You Can Define as Code

The “as code” paradigm works for many different parts of infrastructure as well
as things that are infrastructure-adjacent. A partial list of things to consider
defining as code includes:

IaaS resources
Collections of infrastructure resources provisioned on an IaaS platform are
defined and managed as stacks, as described in Chapter 10.

Servers
Managing the configuration of operating systems and other elements of
servers was the focus of the first generation of Infrastructure as Code tools, as
discussed in Chapter 17.

Hardware devices
Even most physical devices can be provisioned and configured using code.
Chapter 17 describes automating bare-metal server provisioning. Software
Defined Networking (SDN) can be used to automate the configuration of
networking devices like routers and firewalls.

Application deployments
Application deployment has moved decisively away from procedural
deployment scripts in favor of immutable containers and declarative
descriptors. See Chapter 16.

Delivery pipelines
Continuous Delivery pipeline stages for building, deploying, and testing
software can and should be defined as code (Chapter 8).

Platform services
Other services such as monitoring, log aggregation, and identity management
should ideally be configured as code. Using services provided by an IaaS
platform (see “Providing Platform Service Functionality”) are easy to
configure using the same tools you use to define infrastructure stacks on the
same platform. But most other tools, whether packaged or SaaS, should be
configurable using APIs and code as well. Many infrastructure stack tools
like Terraform and Pulumi support plugins and other extensions that can
configure third-party software as well as IaaS resources. Configuring
platform services as code has the added benefit of making it easier to
integrate infrastructure with other resources such as monitoring and DNS.

Tests
Tests, monitoring checks, and other validations should all be defined as code.
See Chapter 7 for more.

Choose Tools With Externalized Configuration

Infrastructure as Code, by definition, involves specifying your infrastructure in


files that you store and manage separately from the tools that apply them,
externalized configuration. However, some infrastructure automation tools use a
“closed-box” model. With this approach, users edit and manage specifications
with a UI or perhaps an API, but the tool manages the specification data. It may
seem convenient to have the tool take care of this for you, rather than forcing
you to shuffle files around. But in practice, a closed-box tool restricts you to
whatever functionality is provided by the tool itself.
Using a tool where the configuration is stored separately from the tool itself
creates options for using the massive ecosystem of tools, services, and
techniques available for managing code and other text files. Some examples of
the benefits of externalized specifications and configurations include:

Use your preferred IDE or text editor, most of which have advanced
functionality and conveniences,
Use a full-featured, off-the-shelf version control system,
Apply specifications and configurations to multiple instances, which is
particularly useful for developing and testing changes safely before applying
them to business-critical systems,
Break specifications and configurations into separate components, so they can
be separately developed and tested, avoiding issues with working on a shared
instance of a closed-box system,
Automatically trigger tests and other activities when a specification is
changed, deployed, or promoted between instances,
Integrate workflows, such as integration testing, across different tools and
systems. For example, you can test integration between an application and its
deployment infrastructure when either side is changed,
Track and record changes across different tools and systems.

LESSONS FROM SOFTWARE SOURCE CODE

The externalized specification model mirrors the way most software source code works. Some visual
development environments, like Visual Basic, store and manage source code behind the scenes. But for
nontrivial systems, developers find that keeping their source code in external files is more powerful.
It is challenging to use Agile engineering practices such as TDD, CI, and CD
with closed-box infrastructure management tools. A tool that uses external code
for its specifications doesn’t constrain you to use a specific workflow. You can
use an industry-standard source control system, text editor, CI server, and
automated testing framework. You can build delivery pipelines using the tools
that work best for you, and integrate testing and delivery workflows with
software and other system elements.

Manage Your Code in a Source Code Repository

Code and configuration for infrastructure and other system elements should be
stored in a source code repository, also called a Version Control System (VCS).
These systems provide loads of useful features, including tracking changes,
comparing versions, and recovering and using old revisions when new ones have
issues. They can also trigger actions automatically when changes are committed,
which is the enabler for CI jobs and CD pipelines, as discussed in Chapter 8.

SECRETS IN SOURCE CONTROL

One thing that you should not put into source control is unencrypted secrets, such as passwords and keys.
Even if your source code repository is private, its history and revisions of code are too easily leaked. Secrets
leaked from source code are one of the most common causes of security breaches. See Chapter 11 for better
ways to manage secrets.

Languages for Coding Infrastructure


The practice of using scripts, configuration files, and programming languages to
manage infrastructure has evolved over the past few decades and is still evolving
to this day. Infrastructure teams are confronted with the question not only of
which tools to select for infrastructure automation and which languages to prefer
but also what type of language to use. The different models of infrastructure
coding languages can be understood by looking at how they have evolved.

As mentioned in the previous section, system administrators have long used


scripts and general-purpose programming languages to automate infrastructure
tasks. CFEngine pioneered the use of declarative, idempotent, domain-specific
languages (DSL) for infrastructure management in the 1990s. Puppet and then
Chef emerged alongside mainstream server virtualization and IaaS cloud in the
2000s. Ansible, Saltstack, and others followed.

As IaaS cloud platforms emerged, pioneered by AWS, many people reverted to


writing procedural code in languages like Ruby and Python to manage the
resources they provided. We used SDKs to interact with the IaaS platform API,
perhaps using a library like boto or fog.

Stack-oriented tools like Terraform and CloudFormation emerged in the 2010s,


using a declarative DSL model similar to that used by server-oriented tools like
Puppet and Chef.

More recently a new generation of tools for working with IaaS infrastructure has
reinvigorated interest in using general-purpose programming languages to define
infrastructure. Pulumi and the AWS CDK (Cloud Development Kit) support
languages like Typescript, Python, and Java.

Additionally, the growing popularity of Kubernetes as a platform for


orchestration has led to the emergence of frameworks that leverage it to manage
infrastructure. This approach is sometimes called Infrastructure as Data, which
I’ll discuss in a bit more detail later in this chapter.

The key attributes of infrastructure coding languages are idempotency, domain-


specific languages, and declarative or imperative programming.

Idempotent Code

Many task-focused scripts are written to be run only when an action needs to be
carried out or a specific change made, but can’t be safely run multiple times. For
example, this script creates a virtual server using a fictional infrastructure tool
called +stack-tool+1:

If you run this script once, you get one new server. If you run it three times, you
get three new servers2. This makes perfect sense, assuming the person running it
knows how many servers are already running, and how many need to be
running. In other words, the script doesn’t include all of the knowledge needed
to make decisions, and so leaves decision-making to whoever runs it. A script
that isn’t idempotent doesn’t support the hands-off approach we get with
Infrastructure as Code.

We can change the script to check whether the server name exists and refuse to
create a new one if so. The following snippet runs the fictional
command and only creates a new server if it exits
with a value of “1”:

The script is now idempotent. No matter how many times we run the script, the
result is the same: a single server named “my-server”. If we configure an
automated process to run this script continuously, we can be sure the server
exists. If the server doesn’t already exist, the process will create it. If someone
destroys the server, or if it crashes, the process will restore it.
But what happens if we decide the server needs more memory? We can edit the
script to change the argument to . But if
the server already exists with 4GB of RAM, the script won’t change it. We could
add a new check, taking advantage of some convenient options available in the
imaginary :

The modified script now captures the ID of the existing virtual server, if found,
into a file. It then passes the ID to with a command to change
the existing server’s memory allocation to 8GB. I’ve also changed the script so
that if the server doesn’t already exist, the command to create a new one creates
it with the right memory setting.

Now we have an idempotent script that ensures the server exists with the right
amount of memory. However, scripts like this become messy over time as needs
change and more conditionals are added. A declarative language makes
infrastructure definitions easier to maintain and understand.

Declarative Infrastructure Languages

Many infrastructure code tools, including Ansible, Chef, CloudFormation,


Puppet, and Terraform use declarative languages. Your code defines what you
want your infrastructure to look like, such as how much memory and disk space
you want your server to have, and what operating system you want it to run. The
tool handles the logic of how to make that desired state come about.

This example creates the same virtual server instance as the earlier examples:

This code doesn’t include any logic to check whether the server already exists
or, if it does exist, how much memory or disk space it currently has. The tool
that you run to apply the code takes care of that. The tool checks the current
attributes of infrastructure against the code and works out what changes to make
to bring the infrastructure in line. So, in this example, to increase the RAM of
the application server you would edit the file and rerun the tool.

Declarative infrastructure tools like Terraform and Chef separate what you want
from how to create it. As a result, your code is cleaner and more direct.
Declarative code is inherently idempotent as well. The tool can apply the code
repeatedly and often with no harm. Defining infrastructure declaratively removes
the need to have the right knowledge to make decisions about where and when to
apply the code, which means we can push our code into automated systems to
deliver.

IS DECLARATIVE CODE REAL CODE?

Some people dismiss declarative code as being mere configuration rather than “real” code. “Real” code is,
in their thinking, imperative code, which means either procedural (like C) or object-oriented (like Java)3.

I use the word code to refer to both declarative and imperative languages. I don’t find the debate about
whether a coding language must be Turing-complete useful. I even find regular expressions useful for some
purposes, and they aren’t Turing-complete either. So, my devotion to the purity of “real” programming may
be lacking.

Programmable Infrastructure Languages

Declarative code is fine when you always want the same outcome. However,
there are situations where you want different results that depend on the
circumstances. For example, the following code creates a set of VLANs. The
ClotheSpin team’s cloud provider has a different number of data centers in each
country, and the team wants its code to create one VLAN in each data center. So
the code needs to dynamically discover how many data centers there are, and
create a VLAN in each one:

The code also assigns an IP range for each VLAN, using a fictional but useful
method called . This method takes the address
space declared in , divides it into several smaller address
spaces based on the value of , and returns one of
those address spaces, the one indexed by the
variable.

This type of logic can’t be expressed using declarative code, so most declarative
infrastructure tools extend their languages to add imperative programming
capability. For example, Ansible adds loops and conditionals to YAML.
Terraform’s HCL configuration language is often described as declarative, but it
combines three sublanguages, one of which is expressions, which includes
conditionals and loops.

Newer tools, such as Pulumi and AWS CDK, return to using programmatic
languages for infrastructure. Much of their appeal is their support for general-
purpose programming languages (as discussed in “General-Purpose Languages
Versus DSLs for Infrastructure”). But they are also valuable for implementing
more dynamic infrastructure code.

Rather than asking whether a declarative or imperative infrastructure language is


the right one to use for infrastructure, we should ask which concerns within our
system each one is more suited to. We then need to design our systems to
separate those concerns so we can use the right language for each job.

AVOID MIXING DECLARATIVE AND IMPERATIVE CODE

Imperative code is a set of instructions that specifies how to make a thing happen. Declarative code
specifies what you want, without specifying how to make it happen.

Too much infrastructure code today suffers from mixing declarative and imperative code, which makes
code messy and difficult to understand. I believe this type of mixing is a result of trying to apply a single
language and single language paradigm across code that would be better separated.
An infrastructure codebase involves many different concerns, from defining infrastructure resources, to
configuring different instances of otherwise similar resources, to orchestrating the provisioning of multiple
interdependent pieces of a system. Some of these concerns can be expressed most simply with a declarative
language. Some concerns are more complex and better handled with an imperative language.

As practitioners of the still-maturing field of infrastructure code, we are learning where to draw boundaries
between these concerns. Mixing concerns can lead to code that mixes language paradigms. One failure
mode is extending a declarative syntax like YAML to add conditionals and loops. The second failure mode
is embedding simple configuration data (“2GB RAM”) into procedural code, mixing what you want with
how to implement it.

In relevant parts of this book, I point out where I believe some of the different concerns may be, and where
I think one or another language paradigm may be most appropriate. But our field is still evolving. Much of
my advice will be wrong or incomplete. So, I intend to encourage you, the reader, to think about these
questions and help us all to discover what works best.

Deciding Between Declarative and Imperative


Languages

Declarative code is useful for defining the desired state of a system, particularly
when there isn’t much variation in the outcomes you want. It’s common to
define the shape of some infrastructure that you would like to replicate with a
high level of consistency.

For example, you normally want all of the environments supporting a release
process to be nearly identical (see Chapter 15). So declarative code is good for
defining reusable environments, or parts of environments (per the reusable stack
pattern discussed in Chapter 15). You can even support limited variations
between instances of infrastructure defined with declarative code using instance
configuration parameters, as described in Chapter 11.

However, sometimes you want to write reusable, sharable code that can produce
different outcomes depending on the situation. For example, the ShopSpinner
team writes code that can build infrastructure for different application servers.
Some of these servers are public-facing, so they need appropriate gateways,
firewall rules, routes, and logging. Other servers are internally facing, so they
have different connectivity and security requirements. The infrastructure might
also differ for applications that use messaging, data storage, and other optional
elements.

As declarative code supports more complex variations, it needs increasing


amounts of logic. At some point, you should question why you are writing logic
in YAML, JSON, XML, or some other declarative language.

Programmable, imperative languages are more appropriate for building libraries


and abstraction layers, which I’ll cover in more detail in Chapter 13. The support
these languages offer for writing, testing, and managing code libraries makes
them especially useful for these purposes.

INFRASTRUCTURE AS DATA

Infrastructure as Data is a subgenre of declarative infrastructure that leverages Kubernetes as a platform for
orchestrating processes.4 For example, ACK (AWS Controllers for Kubernetes) exposes AWS resources as
Custom Resources (CRs) in a Kubernetes cluster. This makes them available to standard services and tools
in the cluster, such as the command-line tool, to provision and manage resources on the IaaS
platform.

In addition to convenience, a benefit of integrating IaaS resource provisioning into the Kubernetes
ecosystem is the ability to use capabilities like the control loop of the operator model5. Once infrastructure
resources are defined in the cluster and provisioned on the IaaS platform, a controller loop ensures the
provisioned resources remain synchronized with the definition.

Although some people consider infrastructure as data to be an alternative to Infrastructure as Code, in


practice it’s simply another implementation. A Kubernetes cluster with infrastructure resource CRDs
embeds the functionality of an infrastructure tool like Terraform or AWS CDK. Infrastructure code is
written and applied by loading it with or another tool.

Crossplane is an infrastructure as data product that adds the capability to define and provision
Compositions, which are collections of resources managed as a unit: in other words, a stack.

Using Kubernetes to manage the process of applying infrastructure code can make the process less visible.
Be sure to implement effective monitoring and logging so you can troubleshoot effectively.

Domain-Specific Infrastructure Languages

In addition to being declarative, many infrastructure tools use their own DSL, or
Domain-Specific Language.6

A DSL is a language designed to model a specific domain, in our case


infrastructure. This makes it easier to write code and makes the code easier to
understand because it closely maps the things you’re defining.

For example, Ansible, Chef, and Puppet each have a DSL for configuring
servers. Their languages provide constructs for concepts like packages, files,
services, and user accounts. A pseudocode example of a server configuration
DSL is:
This code ensures that two software packages are installed, and .
It defines a service that should be running, including the port it listens to and the
user and group it should run as. Finally, the code specifies that a server
configuration file should be created using the template file
.

The example code is pretty easy for someone with systems administration
knowledge to understand, even if they don’t know the specific tool or language.
Chapter 17 discusses how to use server configuration languages.

Many stack management tools also use DSLs, including Terraform and
CloudFormation. These DSLs model the IaaS platform resources, so that you
can write code that refers to virtual servers, disk volumes, and network routes.
See Chapter 10 for more on using these languages and tools.

Other DSLs model application runtime platform concepts. These model systems
like application clusters, service meshes, or applications. Examples include
Helm charts and CloudFoundry app manifests.

Many infrastructure DSLs are built as extensions of existing markup languages


such as YAML (Ansible, CloudFormation, anything related to Kubernetes) and
JSON (Packer, CloudFormation). Some are internal DSLs written as a subset (or
superset) of a general-purpose programming language. Chef is an example of an
internal DSL written as Ruby code. Others are external DSLs, which are
interpreted by code written in a different language. Terraform HCL is an
external DSL not related to the Go language its interpreter is written in.

GENERAL-PURPOSE LANGUAGES VERSUS DSLS FOR INFRASTRUCTURE

The rise in interest in moving away from declarative languages is driven in part by use cases where
procedural languages are more appropriate. Another reason people like tools like AWS CDK and Pulumi is
that they support coding in general-purpose programming languages like Python and JavaScript rather than
DSLs. Many people, especially those with a background in software development, are more comfortable
using a familiar language rather than learning a new one.

Beyond using existing language skills, popular general-purpose languages have broad ecosystems of tools
for working with their code. These languages are very well supported by IDEs (Integrated Development
Environments) with productivity features like error highlighting and code refactoring. Using these
languages also gives access to a much richer selection of tools and frameworks for activities like static code
analysis and unit testing.

A general-purpose language can be useful for building lower-level abstraction layers, libraries, and
frameworks for infrastructure. However, they are often more verbose than needed for higher-level
infrastructure definitions, obscuring “what” is being defined within boilerplate code and logic of “how it’s
implemented.

So again, it’s important to avoid choosing one tool, such as a general-purpose programming language, for
all jobs, and instead focus on designing systems with a clear separation of concerns, and using the
appropriate tool for each.

Levels of Abstraction In Infrastructure Code

Most Infrastructure as Code DSLs directly model the resources they configure.
The languages used with tools like Terraform and CloudFormation are
essentially thin wrappers over IaaS APIs. For example, the Terraform
provider directly maps to the AWS API
API method.7

The IaaS vendor SDKs expose these APIs for general-purpose programming
languages. The advantage of a DSL is that it provides a unifying model to
simplify working with the APIs and the resources that they create. For example,
an infrastructure DSL hides the logic needed to make your code idempotent.
This is the advantage of using a tool like Pulumi or the AWS CDK to write
infrastructure code in JavaScript, for example, over directly using the AWSK
JavaScript SDK.

Many teams use infrastructure code languages to build abstraction layers over
the infrastructure resources provided by the IaaS platform. Doing this can help
people use infrastructure without having to implement the gritty details of, for
example, wiring up network routes. Tools or languages that expose infrastructure
at this abstracted layer tend to focus on application deployment and
configuration, as discussed in Chapter 9.

Code Is Not Infrastructure


The ability to treat infrastructure like software opens many possibilities, such as
applying well-proven software design principles and patterns to our
infrastructure architecture. But differences between how infrastructure code and
application code work can confuse things, leading us to force-fit concepts and
techniques even when they’re not appropriate. It’s useful to consider the
differences between how infrastructure code works and how application code
works.

For example, refactoring application code is usually straightforward: edit the


code, compile, and deploy. However, refactoring infrastructure code in an editor
is disconnected from how the changes will be applied to real infrastructure. An
IDE gives you control to carefully update references to a resource when you
modify it. Controlling the way the code is applied to IaaS resources is a different
matter.

The Context For Executing Code

Writing code to define infrastructure can create some confusion. The code we
write for an application is compiled, deployed, and then executed at run-time, as
shown in Figure 4-1.8.
Figure 4-1. Application code executes in the runtime context

Some infrastructure code is compiled as well. But whether it’s compiled or not,
infrastructure code doesn’t execute in the runtime environment like application
code. Rather, it executes in the delivery context, as shown in Figure 4-2.

Figure 4-2. Infrastructure code executes in the deployment context

The infrastructure code you write defines what happens for the deployment of
the infrastructure. It causes infrastructure resources to be provisioned in the IaaS
platform, but your infrastructure code only affects the way that infrastructure
behaves indirectly.

This difference may seem obvious, but it has implications for things like testing.
If we write a unit test for infrastructure code, does it tell us about the
infrastructure our code creates, or does it only tell us about what happens when
the infrastructure code is executed? For example, if our code creates networking
structures, can we write unit tests that prove that those structures will route
traffic the way we expect? Chapter 7 discusses approaches for automated
infrastructure testing that consider different layers of testing.

At each stage, additional elements such as code modules and the infrastructure
tool itself come into the mix, adding to the distance between the code and reality.
For example, when there is a problem executing application code, a developer
can trace the progress of execution through the source code, perhaps using a
debugger. But a debugger won’t trace the execution of our infrastructure code, it
instead traces the execution of the infrastructure tool’s code.

An infrastructure tool might output logs that help to understand what is


happening with our infrastructure code at each point in the process. But it may
not be feasible to, for instance, analyze memory usage of different parts of our
infrastructure code.

NOTE

One team struggled with running out of memory while running their infrastructure code. Their first instinct
was to analyze the code to understand which parts of the infrastructure were using the most memory, so
they could optimize that code. However, infrastructure code doesn’t correlate to infrastructure tool memory
usage in the same way that application code does. In the end, we realized that the true issue was that the
infrastructure project, although divided into modules, was simply too large. So the solution was to break the
infrastructure into smaller stacks, as discussed in Chapter 10.

More confusion comes with tools that compile infrastructure code we write in
one language into another language. For example, AWS CDK allows developers
to write infrastructure code in application development languages like Python
and JavaScript, and then compile it to CloudFormation templates. Developers
can then use various tools and other support for the programming language, such
as IDE refactoring features and unit test frameworks. However, it’s important to
keep in mind the differences not only between the code and the resources it
creates but also differences between the code as developed and tested and the
code that is generated to be applied to the instances. The fact that this code
transitions from an imperative (procedural or object-oriented) language to a
declarative language (JSON in the case of the CDK) adds to the gap between
code and reality.

SERVERLESS CODE

Chapter 16 discusses serverless as an application runtime platform. Serverless redraws boundaries between
application deployment and infrastructure provisioning. Serverless application code is arguably deployed
every time it’s executed, together with at least part of its infrastructure, the runtime environment. Efforts at
optimizing serverless applications include deploying some infrastructure resources ahead of time, pre-
packaging some in containers, and perhaps caching others.
Infrastructure Code and Resource Instances

Another peculiarity with infrastructure code is the gap between the code and the
actual resources allocated on the IaaS platform. These two things are consistent
at the point in time when the code is applied. At any other time, there is no
guarantee they are the same. The actual infrastructure may change if someone
makes a change outside of the code using the cloud UI or a command-line tool.
It’s also possible that different versions of the same code can be applied to a
single instance, creating a gap.

CROSSED CODE

Recently, an infrastructure engineer on a team I was working with was confused. He was editing some
Terraform code and applying it to the test environment, but a few minutes later he found that the resources
in the environment didn’t match his code. He applied again and it seemed fine. But when he made another
change and applied it, his change failed because the environment was still out of whack. He went to post on
the team’s Slack and saw a teammate reporting the same issue: weird stuff was happening to the test
environment infrastructure. Then the penny dropped. Both engineers were editing and applying their own
local copy of the code to the same test environment, reverting each other’s changes.9

The difference between code and instance is sometimes called configuration


drift. The same term is also used to describe differences between different
instances of the same infrastructure, such as across environments on the path to
production for testing and delivering software.

Some infrastructure tools, including Terraform and Pulumi, have a “plan”


command to compare code with an instance and identify differences. People run
a plan before applying a change to preview what will change before running
apply. But plan commands can also be used to identify drift.

As discussed later in this book (Chapter 20), teams should ensure that, for any
shared instance of infrastructure, the code is only ever applied from a centralized
service. A centralized service can ensure that the correct version of the code is
applied, avoiding situations where individual engineers run different local copies
or branches.

Infrastructure State

Infrastructure tools need a way to know which resources defined in code


correspond to which resource instances on the IaaS platform. This mapping is
used to make sure a change to the code is applied to the correct resource. IaaS
platforms can handle this internally for their own infrastructure tools. For
instance, when you run AWS CloudFormation, you pass an identifier for the
stack instance, which the AWS API uses as a reference to an internal data
structure that lists the resources that belong to that stack instance.

Tools from third-party vendors, like Terraform and Pulumi, need their own data
structures to manage these mappings of code to instances. They store these data
structures in a state file for each instance. Early versions of these tools required
users to handle storage of the state files, but more recent versions add support to
use hosted services like Terraform Cloud and Pulumi.

Although many people prefer having their instance state handled transparently
by the platform, it can be useful to view and even edit state data structures to
debug and fix issues10.

TREAT INFRASTRUCTURE CODE LIKE REAL CODE

Many infrastructure codebases evolve from configuration files and utility scripts into unmanageable messes.
Too often, people don’t consider infrastructure code to be “real” code. They don’t give it the same level of
engineering discipline as application code. To keep an infrastructure codebase maintainable, you need to
treat it as a first-class concern.

Design and manage your infrastructure code so that it is easy to understand and maintain. Follow code
quality practices, such as code reviews, pair programming, and automated testing. Your team should be
aware of technical debt and strive to minimize it.

Chapter 5 describes how to apply various software design principles to infrastructure, such as improving
cohesion and reducing coupling. Chapter 9 explains ways to organize and manage infrastructure codebases
to make them easier to work with.

Next-Generation Infrastructure
I’ve alluded to some of the limitations of Infrastructure as Code as a model for
managing infrastructure, such as the gap between code and reality. As I write
this, some companies are exploring ways to evolve beyond these limitations. I
can’t predict which of their ideas will take off, which will fade away, and what
other ideas may emerge over the next few years, or even before this book is
published.

However, there are at least two interesting directions suggested by the current
efforts. One is bridging the gap between applications and infrastructure. The
other is bridging the gap between infrastructure code and provisioned resources.
While most of these tools are not mature enough for most teams to consider
using for business-critical systems, they are worth watching.

Infrastructure From Code

Several startups are addressing the experience of developing applications and


infrastructure.footnotes:[Two examples are Darklang Wing.] Most of those I’ve
seen provide their own IDE and development language for building application
software and infrastructure. Developers can directly specify the integration
between application code and the infrastructure resources it uses, such as
network ports, disk storage, and message queues.

In this book, I give strong guidance to separate concerns between applications


and infrastructure, which these tools may appear to contradict. For example,
imagine a developer writing code to save a new customer’s registration
information in a database. Mixing the code that handles the registration
information with detailed infrastructure code to provision and configure the
database to store it in wouldn’t make sense.

However, the intent of these languages is not to intermix code at the level
normally written in Terraform with business logic. Instead, business logic code
can specify the relevant attributes of infrastructure at the right level of detail for
the context. The user registration code can specify that the data should be saved
to a database that is configured for handling personal customer information. The
code calls a separately-written library that handles the details of provisioning
and configuring the database appropriately.

So the system can be designed to separate the concerns of business logic and
detailed infrastructure configuration. However, the concerns that are relevant
across the boundaries can be managed explicitly. The current paradigm of
defining and deploying infrastructure separately relies on out-of-band knowledge
to know that the database needs to be configured for personal data. It also
involves brittle integration points that we need to configure explicitly on both
sides, such as connectivity and authentication.

Integrating application and infrastructure development means we can redraw the


boundaries of applications vertically, aligning infrastructure with the logic it
supports, rather than horizontally.11

Infrastructure as model

Earlier, I pointed out the challenges of the gap between code and resources
provisioned on IaaS. A given version of code may be different from what was
last applied to provision or change infrastructure, and the infrastructure
resources provisioned may have changed from both of those points.
Infrastructure as data aims to eliminate the gap between the code that was
applied and the provisioned resources, by continuously re-synchronizing the
code. But there are still gaps, especially when it comes to helping operators to
understand the current state of their infrastructure.
The team at System Initiative has shared a demo and details of their work on a
new tool that builds an interactive model of the current state of infrastructure.12

Although at first glance System Initiative’s tool looks similar to the ClickOps
approach that I disparaged at the start of this chapter. However, their extensible
implementation has the potential to handle many of the limitations of ClickOps.
For example, users can use the interactive interface to prepare a change set and
carry out checks and approvals before applying it to the provisioned resources.
This suggests we would be able to implement tests and other validations, support
multiple people working on a system concurrently, and potentially replicate
changes consistently across environments.

Having an interactive model that can be updated from the real infrastructure in
real-time would shrink the feedback loop for working on changes. While
working on a potential change, an engineer can refresh the model to see new
changes to the live system and how they will impact their work.

As with the tools for integrating application and infrastructure code, this is an
early iteration of the concept, essentially an experiment. But it’s heartening to
see people exploring ways to advance our ways of working. It’s important to be
aware that the current state of infrastructure management approaches and tools is
only a step on a journey.

Conclusion
The topics in this chapter could be considered “meta” concepts for Infrastructure
as Code. However, it’s important to keep in mind the goal of making routine
tasks hands-off for team members is what makes Infrastructure as Code more
powerful than writing scripts to automate tasks within a hands-on workflow.
Considering how different language attributes like idempotency affect the
maintainability of infrastructure code helps to select more useful tools for
different jobs in our system. Keeping the differences between infrastructure code
and application code in mind can help avoid traps in the analogy of
infrastructure as software.

This chapter closes out the Foundational chapters of the book (Part I). The
following chapters discuss the more concrete topic of infrastructure stacks, the
core architectural unit of Infrastructure as Code (Part III).

1 I use fictional tools, languages, and platforms throughout this book.


Imaginary tools are nice because their features and syntax work exactly the way
I need for any example.

2 You might hope that the argument will prevent


the tool from creating multiple copies of the server. But most IaaS platforms
don’t treat user-supplied tags or names as unique identifiers. So this example
creates three servers with the same name.

3 Functional programming is a subset of declarative programming, as


procedural and object-oriented programming are subsets of imperative
programming.
4 See I do declare! Infrastructure automation with Configuration as Data, by
Kelsey Hightower and Mark Balch.

5 See https://kubernetes.io/docs/concepts/architecture/controller/

6 Martin Fowler and Rebecca Parsons define a DSL as a “small language,


focused on a particular aspect of a software system” in their book Domain-
Specific Languages (Addison-Wesley Professional).

7 You can see this in the documentation for the Terraform aws_instance
resource and the AWSrun_instances API method.

8 Although code written in interpreted languages like Ruby and Python is


deployed first and compiled at run-time, the execution still happens in the
runtime context

9 See Chapter 20 for techniques for avoiding code clashes between people
working on infrastructure code.

10 I strongly recommend avoiding “infrastructure surgery” as an approach for


changing infrastructure. It’s best left as an emergency measure for when a safer
change management process fails. See Chapter 8 for more.

11 See Gregor Hohpe’s article, IxC: Infrastructure as Code, from Code, with
Code, for an exploration of various combinations of infrastructure, architecture,
and code and their implications.

12 For more, see the System Initiave website. As of this writing, there is a
downloadable demo and videos. The company’s founder, Adam Jacob, has also
said they intend to make the code available as open source.
Part II. Core Topics
Chapter 5. Infrastructure Components

A NOTE FOR EARLY RELEASE READERS

With Early Release ebooks, you get books in their earliest form—the author’s
raw and unedited content as they write—so you can take advantage of these
technologies long before the official release of these titles.

This will be the 6th chapter of the final book. Please note that the GitHub repo
will be made active later on.

If you have comments about how we might improve the content and/or examples
in this book, or if you notice missing material within this chapter, please reach
out to the editor at jleonard@oreilly.com.

One view of system design is that it’s about grouping elements and defining
relationships between those groups. We define these groupings as architectural
units, or components, that we can use in diagrams and discussions. Some
common components for software design include applications, microservices,
libraries, and classes. This chapter will describe a few components that we can
use for Infrastructure as Code.

Our industry does not have widely agreed definitions of what components are
relevant for Infrastructure as Code, or what to call them. So for the purposes of
this book, I’ve defined a set of four components. These components are IaaS
resources, code libraries, stacks, and infrastructure products. Each of these
covers a different level of scope, usually by aggregating lower-level structures.
Code libraries and stacks aggregate one or more IaaS resources. A stack may
optionally aggregate code libraries. An infrastructure product aggregates stacks.

NOTE

Your infrastructure probably doesn’t use this set of components and terms. However, they are helpful for
framing patterns and approaches for defining systems with Infrastructure as Code. You should be able to
adapt and apply these patterns and approaches in this book to your system. It would be a useful exercise,
once you’ve read through this chapter, to consider your infrastructure system’s design and how to map it to
the terms I’ve used here.

Both code libraries and stacks are defined by the infrastructure tool, even if the
tool uses different names for them. For example, Terraform calls code libraries
modules, and CDK calls them level 3 constructs. Although Terraform doesn’t
have a term for stacks, it has projects (for source code), and its state files
correlate with a provisioned stack. Pulumi and CDK both use the term stack,
while Crossplane has compositions. Most tools don’t have an inherent concept of
infrastructure products, so this component is least likely to be familiar.

While the level of scope is the most obvious characteristic of the different
components, the context is particularly important for infrastructure components.
So let’s explore this in more detail.

Design Contexts for Infrastructure


Components
Designing infrastructure as code needs another dimension for considering how to
use different types of components, which is the context. Chapter 4 described
three different contexts for infrastructure code, which were source code,
delivery, and runtime, as shown in Figure 5-1. When we consider the
architecture and design of infrastructure using code, it’s important to understand
how different components apply in these different contexts.

Figure 5-1. Different contexts for infrastructure code

The idea of examining different components based on the context they’re used in
may be less familiar than considering the levels of scope (high-level components
versus lower-level components). So let’s consider how these three contexts
relate to infrastructure design.
The Runtime Context represents the infrastructure once it has been provisioned
on the IaaS platform. This is the context where the resources are used to run
workloads. The purpose of Infrastructure as Code is to shape the runtime
context, so most design activities start by defining the resources as they will
appear there.

Design for the Source Code Context focuses on source code repositories, folder
structures, and organization of the files that contain infrastructure code. Chapter
9 discusses these topics in detail. Infrastructure code libraries also live in this
context, which is a topic covered in depth in Chapter 13.

The Delivery Context sits between the source code and runtime contexts, being
concerned with turning infrastructure code into usable provisioned resources on
an IaaS platform. Design for this context involves questions of how to organize
infrastructure to apply it with your infrastructure code tool. Do you provision
everything as a single group, or do you break it into multiple groups that you can
deliver and provision separately?

Some of the most difficult challenges of delivering and managing infrastructure


as code are found in the delivery context. The capability to easily and safely
change infrastructure in the runtime context is almost entirely driven by the
design of the infrastructure code in the delivery context.

Now that we’ve established these two dimensions for considering components
for Infrastructure as Code, scope and context, we can define some of the
components used in this book.
The Infrastructure Components
The four components used most in this book are summarized in Table 5-1.

Table 5-1. Infrastructure Components By Scope

Design Scope Component Description Examples

High-Level Infrastructure Infrastructure OAM, In-housea


Product resources
grouped by the
capability they
provide to
applications and
other workloads

Mid-Level Infrastructure Infrastructure Terraform


Stack resources project or state
provisioned as a file,
group CloudFormation
/ CDK stack,
Pulumi stack,
Crossplane
composition

Low-Level Code Library Infrastructure Terraform


resources module, CDK
grouped by how Level 3
their code is construct,
shared and CloudFormation
reused across module
stacks

Primitive IaaS Resource The smallest CDK Level 1 / 2


unit of construct,
infrastructure Terraform
that can be resource, ACK
independently resource
defined and
provisioned.

a As of this writing, there are few, if any, off-the-shelf implementations


that help you to define and configure infrastructure at the capability level.
Many teams implement capabilities as stacks, rather than aggregating
multiple stacks for this.

You may have picked up hints in the descriptions of how each of these
components’ relevance varies with the infrastructure code context. We work
with IaaS resources across all contexts. Code libraries, such as Terraform
modules, are mainly relevant in the source code context, as they are a
mechanism for sharing and reusing code across projects.

Stacks are the most relevant component in the delivery context because they
define how resources are grouped for provisioning. Infrastructure products are
relevant to how infrastructure is allocated to and consumed by workloads, so can
be relevant across all contexts. They are often used to organize source code,
orchestrate provisioning across stacks, and for managing groups of resources at
runtime.

Generally, IaaS resources are the only components that are visible in the runtime
system. One exception to this is where stack structures are managed by the IaaS
platform, as with AWS CloudFormation. Otherwise, teams can implement
tracking and management of higher-level components through tagging and
permissions. For example, permission to manage different infrastructure
products can be restricted by runtime authorization policies.

Before using these components in an infrastructure design, it’s important to


understand the purpose the infrastructure will serve, which is to run workloads.
Understanding how to think about workloads sets the scene for how to think
about infrastructure components.

Starting Infrastructure Design With Workloads

The starting point for designing infrastructure is understanding the workloads


that it will support. A good way to begin an infrastructure design project is by
engaging with the people responsible for the software. Collaborate with them to
create a picture of the workloads and the infrastructure resources needed to
support them.

A Workload is software that runs on infrastructure. A workload could be a user-


facing application, a back-end business service, a microservice, or even a
platform service such as a monitoring server.

Figure 5-2 shows a part of the application architecture for the ClotheSpin online
store.

Figure 5-2. Example workload for ClotheSpin

ClotheSpin has a website front end and a set of mobile applications. These share
services for product browsing, searching, shopping carts, and checkout, among
others. The mobile applications communicate with a “BFF” (Backend For
Frontend) service that connects with the shared services. For our examples, we’ll
focus on the website, and the two shared services used to browse products and
add them to a shopping basket.

Figure 5-3 adds a high-level view of the infrastructure capabilities needed to


support some of the software behind the storefront website.
Figure 5-3. Example infrastructure resource design

The diagram shows infrastructure capabilities that are needed specifically for
each software service, such as static website content storage for the website, and
separate database instances for the product browsing service and the shopping
cart service. It also shows that some infrastructure capabilities are shared by
more than one service, such as container hosting.

An infrastructure capability diagram like this one shows the infrastructure


domain at a high level. This gives us the first step in designing the infrastructure
to run the software for ClotheSpin’s business. The next step is to design the
higher-level architecture components to implement the infrastructure
capabilities.

Infrastructure Products

An Infrastructure Product is a collection of IaaS resources organized around a


workload-relevant concern. The contents of an infrastructure product and the
way it is presented for configuration and use should make sense to its users, who
are usually the teams responsible for configuring, deploying, and managing the
applications and services that use them.

In contrast, infrastructure stacks are grouped around technical considerations,


especially how IaaS resources should be grouped for provisioning. Figure 5-4
shows the contrasting concerns of products and stacks.
Figure 5-4. The contrasting concerns for infrastructure products and stacks

The diagram shows the as an example


workload and one of the infrastructure products it uses,
. The REST networking product is composed of several
infrastructure stacks that define resources specific to the service (as opposed to
shared networking structures like subnets), including a load balancer rule, routes,
and firewall rules. In this example, these resources are split across three stacks.
Later in this chapter, we’ll cover drivers for splitting resources into stacks.

As I mentioned earlier, the infrastructure product is not a universally used


component type for infrastructure codebases. Not many infrastructure tools
support it as a concept. But in practice, many teams create custom collections of
infrastructure stacks, although they use different names for the concept, like
“components” or “services”. They also may not design their components around
workload concerns, only using them as a way to manage configuration,
integration, and provisioning across larger groups of infrastructure.

Other teams don’t differentiate between infrastructure products and stacks,


designing stacks around workload concerns rather than having two separate
layers of components, as shown in Figure 5-5.
Figure 5-5. A single infrastructure stack acting as an infrastructure product

For smaller systems this approach keeps the implementation simple, avoiding an
unneeded layer of abstraction. For larger systems, however, these stacks can be
large and messy. We’ll look at approaches to sizing stacks in Chapter 10.
Infrastructure Stacks

An Infrastructure Stack is a collection of IaaS resources defined, provisioned,


and modified as an independently deployable group. Its purpose is focused on
the delivery context of infrastructure code.

A Stack Project includes the source code that specifies the resources in the stack,
possibly referencing infrastructure code libraries. It aligns with the source code
context.

A Stack Tool reads the code in the stack project and any libraries, then calls the
IaaS platform’s API to provision the IaaS resources defined in the code. It’s used
in the delivery context.

A Stack Instance is a set of IaaS resources provisioned on the IaaS platform from
a stack project, available for use by workloads. It is the stack in the runtime
context.

Figure 5-6 shows where these terms fit in the different contexts.
Figure 5-6. An infrastructure stack is a collection of infrastructure elements managed as a group

Examples of stack tools include:

AWS CloudFormation
Azure Resource Manager
Bosh
Crossplane
Google Cloud Deployment Manager
Terraform
OpenStack Heat
Pulumi

Note that a single stack project may be reused to provision multiple stack
instances, often taking parameters to configure each specific instance. We’ll
cover this in Chapter 11.

In the terminology defined by the authors of Building Evolutionary


Architectures, a stack is an Architectural Quantum, “an independently
deployable component with high functional cohesion.” As with software
deployment architecture, decisions around how large stacks should be and how
to group the elements within them have a big impact on how easy it is to deliver,
update, and manage a system. For this reason, stacks are mentioned throughout
this book. Part III will explore various aspects of stacks in detail across multiple
chapters.

“STACK” AS A TERM

Most stack management tools don’t call themselves stack management tools. Each tool has its own
terminology to describe the unit of infrastructure that it manages. CloudFormation and Pulumi both use the
term stack, but Terraform tends to talk about projects.

In this book, I’m describing patterns and practices that should be relevant to any of these tools, so I’ve
chosen to use the word stack as a generic term. I’ve been told there is a better term to describe the concept,
but nobody seems to agree on what that term is. So stack it is.

Infrastructure Code Libraries


An infrastructure code library is a component that groups infrastructure code so
that it can be shared and reused across stacks. Common implementations include
Terraform modules, CDK Level 3 constructs, and CloudFormation modules.
Infrastructure coding tools that use general-purpose languages like Python and
Typescript may also support using library formats supported by those languages.

Chapter 13 will discuss patterns and antipatterns for using code libraries to build
stacks. However, there is a common pattern for using code libraries that is worth
mentioning as long as we’re on the subject of different levels of infrastructure
components. This pattern involves using a code library to implement a stack.

As of this writing, most infrastructure tools only support packaging and


versioning code libraries, but not for stack-level projects. For example, there is
no standard packaging or artifact format for a Terraform project, but modules
can be versioned and distributed using an artifact repository called a registry.
This leads many teams to implement each reusable stack project as a module,
and then use their infrastructure tool’s functionality to define each stack instance
as a separate project.

Tools such as Terragrunt1 are designed to support this pattern, which this book
calls a Wrapper Stack. Terraform Cloud’s no-code provisioning feature2
dynamically generates a wrapper stack project to provision a module as a stack
instance. I’ll describe the wrapper stack pattern in more detail later in the context
of other patterns and antipatterns for configuring stack instances (Chapter 11).
For now, it’s useful to understand the difference between infrastructure code
libraries used to share code across multiple stacks (which is how you would
expect to use a library) and those used to define an independently deployable
unit of infrastructure (which is, conceptually, a stack).

Sharing and Reusing Infrastructure


Sharing and reusing infrastructure is an obvious way to reduce the surface area
of code and systems that need to be maintained and updated, and get more value
out of the work we put into building and managing it. There are tradeoffs and
pitfalls of course. Very often an existing component doesn’t quite meet the needs
of a new use case, so we need to exercise judgment on whether we can modify it
to meet the new requirements while still meeting the old, or whether we should
create a new component.

Each of the infrastructure components introduced in this chapter can be shared


and reused in different ways. The infrastructure as code lifecycle contexts
(source code, delivery, and runtime) each offer a different approach.

Sharing Infrastructure Code Components

The DRY (Don’t Repeat Yourself) principle says, “Every piece of knowledge
must have a single, unambiguous, authoritative representation within a system.”3
If you copy the same code to use in multiple places, then discover the need to
make a change, it can be difficult to track down all of the places to make that
change.

Code libraries are a common solution for reducing duplicated code across
infrastructure stacks. However, sharing a library across multiple stack projects is
a tradeoff between reuse and coupling. Making a change to a library impacts all
of the projects that use it. A simple change requires all of the users to retest their
components and systems to make sure it doesn’t break something. A larger
change needed for the library to support a new stack might create a breaking
change for stacks that also use it.

The DRY principle is best viewed as applying not to specific code, but to higher-
level abstractions. For example, I worked with a team that had created a module
to replace all references to AWS EC2 instances. They saw that the code to define
an EC instance, used in multiple projects, all looked pretty much the same, so
decided it needed to be made DRY. However, once they had implemented a
module to replace the uses of the raw EC2 resource
declarations, they noticed that the references to their modules didn’t look any
more DRY than the original code. Their new module was a thin wrapper that
passed parameters to the raw IaaS reference, it didn’t add any value, but did add
complexity to their codebase.

After a rethink, the team realized there was value in replacing some definitions
of virtual servers. They had multiple stacks that provisioned application servers
for deploying different Java microservices (this was before Kubernetes). Each
use of the EC2 resource code set many parameters, mostly the same other than a
few parameters specific to the Java artifact to deploy. So they created an
module, which really did add some value, capturing
the requirements for running a Java application in one place and simplifying
declarations in dozens of stack projects.
Most of the other servers the team’s code provisioned were varied enough that it
wasn’t useful to wrap them in a module, so they reverted to using the raw EC2
instance resource code. This turned out to be simpler to understand and maintain
than their custom module had been.4

Chapter 13 outlines patterns and antipatterns for building and using


infrastructure code libraries.

Sharing Deployable Infrastructure Components

Infrastructure stacks and products are both deployable components


(“architectural quanta”), unlike code libraries which are only deployed as part of
a larger-scale component. Deployability creates the opportunity to share and
reuse a component by using it to provision multiple instances of the resources it
defines.

For example, Figure 5-7 shows one infrastructure product definition for a
container cluster being used to create two different cluster instances.
Figure 5-7. A shared infrastructure product deployable

Reusing infrastructure components across multiple environments helps maintain


essential consistency, reducing variation and configuration drift across them.
Consistency across environments makes application deployment more reliable,
ensures that test environments accurately reflect production environments, and
reduces the time and effort needed to maintain multiple environments.

Deployable components can also be reused within a single environment. One


team ran 60 database instances in each of its environments, defining them all in a
single Terraform file and deploying them in a single stack. Applying the
Terraform project code was slow and the impact when it failed (the blast radius5)
was broad.

Their first attempt to improve the situation was to create a module for the
database and reuse it at the code level. But reusing the module 60 times in the
same project was not faster (in fact it ran a bit slower) and the blast radius was
just as wide.

The team later moved their database module code into a separate Terraform
project. For each environment, they used this project to provision 60 instances of
the project, each with a separate state file.

Provisioning so many instances meant the team needed to modify their


orchestration scripts6 to manage all of the instances, but applying a change to
each instance was simpler. A failure only impacted the first database, which
made it simpler to troubleshoot and fix, with a much smaller disruption to the
environment.

Structuring, configuring, and managing infrastructure stacks for multiple


deployments is a foundational practice discussed throughout this book.

Sharing Runtime Instances of Infrastructure


Components

We can use infrastructure components to share work across a codebase with


code libraries and we can provision multiple instances of infrastructure by
sharing a deployable component. A third option for sharing infrastructure is to
use a single running instance of a component like a stack or infrastructure
product as a provider for multiple instances of other components.

Figure 5-8 shows a shared infrastructure product instance for common


networking.
Figure 5-8. A shared infrastructure product instance

The common networking product defines foundational networking structures


like VPCs and subnets. Networking products for each service include service-
specific networking structures like load balancing configuration, firewall rules,
and DNS entries. Only one instance of the networking product is provisioned in
the environment, and each service networking stack uses the subnets and other
structures it creates.

SHARED-NOTHING INFRASTRUCTURE ARCHITECTURES

In the distributed computing field, a shared-nothing architecture enables scaling


by ensuring that new nodes can be added to a system without adding contention
for any resources outside the node itself.

The typical counter-example is a system architecture where processors share a


single disk. Contention for the shared disk limits the scalability of the system
when adding more processors. Removing the shared disk from the design means
the system can scale closer to linearly by adding processors.

A shared-nothing design with Infrastructure as Code avoids sharing provisioned


instances of infrastructure components. Sharing a deployable component to
create multiple instances is acceptable since the point is to remove shared
dependencies at runtime.

The shared network product instance example in Figure 5-8 could be changed to
a shared-nothing implementation by provisioning a separate instance of the
common infrastructure product for each service. Taking this approach would
many services to be rapidly created and managed without friction for
coordination and contention.

Shared-nothing architectures are often appropriate for very high-scale systems.


Organizations like telecommunications providers may need to run tens of
thousands of nodes and service instances, adding and removing dozens of
instances per minute.

Even a system that doesn’t need to scale and change at these levels can often
benefit from removing sharing. Many infrastructure design practices are based
on Iron Age constraints. Duplicating hardware networking and storage devices
was expensive and would usually lead to underutilization. But IaaS and
Infrastructure as Code make it simple, fast, and cheap to duplicate virtual
infrastructure and to automatically resize it to match usage.

A rule of thumb is that having two consumer infrastructure components share a


provider infrastructure component is more likely to be useful when it enables
them to interact. The services in our example may use the common infrastructure
to communicate with each other, for example. But if you are implementing a
shared infrastructure instance to avoid duplication, consider whether you can
share the deployable component to create separate instances efficiently.

Application-Driven Infrastructure Design


Traditionally, architects organized systems functionally. Networking stuff
together, database stuff together, and operating system stuff together. Figure 5-9
shows three infrastructure products organized around horizontal concerns. Each
of the products provides resources used by four different software services: one
for compute resources, one for databases, and the third for networking.
Figure 5-9. Infrastructure grouped into horizontal layers

Note that these infrastructure products don’t necessarily duplicate code or


components. For example, the database infrastructure product may use a single
database infrastructure stack project, provisioning three different instances. So
the issue with the horizontal infrastructure architecture isn’t duplication, it’s
scope and ownership of change.

Making a configuration change to the database instance for the checkout service
changes requires editing the database infrastructure product that is shared with
the other software services. So the scope of risk for a change to one instance is
all of the instances defined in the product, which adds overhead to the change. If
different teams are responsible for configuring databases within the shared
product, more overhead is needed to coordinate changes.

A common solution to this problem is to have a central team, such as a database


team, own the shared database infrastructure product. This disempowers the
teams that own the services, requiring them to raise a request to the database
team for even a small change, and makes the capacity of the database team a
constraint for all of the services.

An alternative is to organize the infrastructure code to align with workloads, as


shown in Figure 5-10. In this example, a single infrastructure product specifies
the infrastructure stacks to be provisioned for each software service.
Figure 5-10. Infrastructure grouped into vertical layers

As with the previous example, the infrastructure stacks may be provisioned from
shared deployables. So the code to define database instances, for example, is not
duplicated. The teams that own the services don’t necessarily need to write the
infrastructure code for their databases themselves, they can use a shared
deployable stack in their infrastructure products. But, owning the infrastructure
product empowers the development teams to manage the lifecycle and
configuration of their infrastructure.
Some infrastructure resources need to be shared across workloads at runtime,
such as a container cluster or shared networking structures. Figure 5-11 shows
the services from previous examples, with some workload-specific
infrastructure, and some shared.

Figure 5-11. Shared infrastructure product

The workload-specific infrastructure products define a database instance and


service-specific networking structure, such as rules for firewall and load
balancer for traffic to that service. Two infrastructure products are shared, one
that defines a container cluster, the other defining shared networking resources
like a VPC and subnets.

Conclusion
The last few chapters have explored how infrastructure code works, guidance for
designing infrastructure, and, in this chapter, infrastructure components. The
components described in this chapter will be used throughout the rest of the
book to explain patterns for testing, delivering, and managing infrastructure
using code. The terms used here-infrastructure products, infrastructure stacks,
and infrastructure code libraries-are not used universally across tool vendors.
However, the concepts apply to whatever tools you may use, so we need
consistent terminology to describe them in this book.

The next two chapters dive into the delivery context of infrastructure code. The
main goal of a delivery process for infrastructure code is testing that it works
correctly and safely, so that will be the focus of the next chapter.

1 https://github.com/gruntwork-io/terragrunt

2 https://developer.hashicorp.com/terraform/tutorials/cloud/no-code-
provisioning

3 The DRY principle can be found in The Pragmatic Programmer: From


Journeyman to Master by Andrew Hunt and David Thomas (Addison-Wesley).
4 I recommend Sandi Metz’s post, The Wrong Abstraction. Kent C. Dodds’
post AHA Programming builds on Metz’s post and on Cher Scarlett’s
observation that one should “Avoid Hasty Abstractions”.

5 Blast radius is the scope of the potential negative impact from a change or
event. I first saw this term used for software by Charity Majors in her post
Scrapbag of Useful Terraform Tips. Charity’s recommendation to use separate
Terraform state files for every environment is an example of sharing
infrastructure deliverables across environment, in her case using the Wrapper
Stack pattern Chapter 11

6 Most infrastructure teams build and maintain custom scripting to run their
infrastructure tools.
Part III. Infrastructure Stacks
Chapter 6. Designing Environments

A NOTE FOR EARLY RELEASE READERS

With Early Release ebooks, you get books in their earliest form—the author’s
raw and unedited content as they write—so you can take advantage of these
technologies long before the official release of these titles.

This will be the 15th chapter of the final book. Please note that the GitHub repo
will be made active later on.

If you have comments about how we might improve the content and/or examples
in this book, or if you notice missing material within this chapter, please reach
out to the editor at jleonard@oreilly.com.

The concept of an environment is so pervasive in the IT industry that,


paradoxically, it’s not very well defined. For this book, I define an environment
as a grouping of shared infrastructure resources and platform services that host a
set of interrelated applications. That’s a pretty vague definition. In practice,
every organization structures its environments differently, and may even have
multiple levels or groupings of environments.

Many organizations structure their environments based on habits and traditions,


but the decisions about how to group and divide the elements of your system
have implications for how easy it is to make changes, ensure consistency,
replicate systems, and apply good governance. A great deal of the difficulties
that IT organizations face stem from coping with the consequences of their
environment architecture. Much of this chapter discusses different design forces
that apply to environments, which can be used to create and evolve your
organization’s environment architecture.

Changes in technology, such as hardware virtualization and cloud-native


approaches to packaging and deploying software, are driving changes in how
people think about environments. Some people believe the concept of
environments is obsolete. The truth is that infrastructure can be abstracted at
different levels of the stack, but as with the other design forces discussed in this
chapter, it’s important to consider the implications and draw the boundaries of
abstraction at the right level for your situation. I will share a model for
environment implementation layers that can help with this.

IaaS platforms provide structures for grouping resources as accounts (AWS),


resource groups (Azure), and projects (GCP). These structures have a lot in
common with environments, so an environment architecture needs to consider
what kind of relationship to create between IaaS resource groupings and
environments.

An environment’s infrastructure should of course be implemented as code.


Infrastructure stacks, as defined in Chapter 10, are deployable units of
infrastructure code. So the chapter finishes by bringing environment architecture
together with design patterns for infrastructure stacks to implement
environments.
Multi-Environment Architectures
The core elements of an environment are:

Software
The workloads that run in the environment. This can be applications, services,
or other system elements. In “Capabilities in an Enterprise Architecture”
these were described as business products and capabilities.

Platform services
Described in “Capabilities in an Enterprise Architecture” as technology
capabilities, these are services that enable the software to run. Instances of
platform services may be dedicated to a specific application, such as a
database instance, or shared across multiple applications, such as a container
cluster.

Infrastructure resources
Typically provided by an IaaS platform or physical infrastructure, these host
platform services and software.

An environment provides shared infrastructure resources and platform services


to run one or more software deployments. In a simple situation, you would use a
single environment to run all of your software. But, for various reasons, you
usually need to separate software in ways that mean you need more than one
environment.
There are at least three categories of multi-environment architecture:

Providing delivery environments for change management, such as software


delivery
Splitting environments for manageability, such as ownership by different
groups
Replicating and customizing environments for scale, such as geographical
distribution

More complex systems often find the need to split environments across more
than one of these architecture categories. For example, if different product
groups each have their own production environments, they probably also have
separate delivery environments to test their software releases.

There are multiple design forces at play with each of these multi-environment
architectures. Keep in mind that, as with any up-front system design, the only
thing you know for sure about the design decisions you make for your
environments is that they will be wrong. Even if you get the design right at first,
needs will change. Don’t assume that you will design your environments and be
done with it. Instead, be sure to consider how you will make changes once the
environments are in use. Changes may include splitting environments, merging
them, and moving systems between them.

Multiple Delivery Environments


Most processes for managing changes to different parts of a system—whether
it’s software, infrastructure, or configuration—involve deploying a change into a
series of separate environments before applying the change to a production
environment. Developing and testing the change in separate environments
reduces the risk of causing a problem in a business-critical system. The
environments used in the change delivery process are delivery environments.
The process of validating and progressing a change through the stages of the
delivery process is sometimes called the path to production.

Different organizations and teams use various environments in their path to


production, with many different names. Figure 6-1 shows a simple set of
delivery environments that include one environment each for ,
, and , to use as a reference.
Figure 6-1. ClotheSpin delivery environments

An upstream environment is used earlier in the delivery process. A downstream


environment is used later in the process. In the ClotheSpin example, the test
environment is upstream of the staging environment, and the production
environment is downstream of the test and staging environments. The metaphor
is the change as a boat floating down a stream.

Three key concerns for designing and implementing delivery concerns are
segregation, consistency, and variation. These concerns are in tension with each
other and need to be balanced.

Segregation
Resources, services, and software in one environment should not affect other
delivery environments. For example, when testing a change to an application
in an upstream environment, it should not be able to change data in a
downstream environment, which could cause problems in production.
Allowing a change in a downstream environment to affect an upstream
environment can affect the outcomes of testing activities, making the delivery
process unreliable.

Consistency
Differences between resources or services across environments can also affect
the validity of testing. Testing in one environment may not uncover problems
that appear in another. Differences can also increase the effort and time
needed to deploy and configure software in each environment and complicate
troubleshooting. Lack of consistency across delivery environments is a major
contributor to poor performance on the four key metrics, and resolving it is
one of the leading drivers for the use of Infrastructure as Code.

Variation
Some variation between delivery environments is usually necessary. For
example, it may not be practical to provision test environments with the same
capacity as the production environment. Different individuals may have
different access privileges in different environments. At the very least, names
and IDs may be different (appserver-test, appserver-stage, appserver-prod).
As mentioned earlier, other multi-environment architectures may involve
multiple production environments. In situations where each production
environment hosts different software or different parts of a system, there is
usually a need to have a separate set of delivery environments for each
production environment, as shown in Figure 6-2.

Figure 6-2. Separate delivery environments for separate software systems

In other cases, a single set of delivery environments may be used to test and
deliver the same software to multiple production environments, as in Figure 6-3.
This is a fan-out delivery pattern.
Figure 6-3. Single path to production fanned out to multiple production environments

The fan-out pattern works well when each production environment hosts an
instance of essentially the same software, perhaps with some context-specific
configuration.

Splitting Environments For Alignment


It may be unrealistic to run a very large system in a single environment,
especially if the workloads are not closely related, if there are many stakeholders
with conflicting requirements, or if the governance requirements vary. In these
cases, dividing the workload across multiple, separately configured
environments can simplify design and maintenance. As with any architectural
decisions about where to draw boundaries between system components, it’s
important to consider the design forces involved for environments, so you can
keep cohesion high and coupling low. Environments should be aligned to
specific concerns.

Aligning Environments To System Architecture

A defining characteristic of an environment is sharing infrastructure resources


and platform services across multiple software deployables. But another word
for sharing is coupling. As the number of things running in an environment
grows, barriers to changing, upgrading, patching, or fixing the shared elements
also grow. At some point, it may be useful to create multiple environments to
reduce coupling across workloads.

In some cases, different software systems have no dependencies between them


so the environments can be cleanly separated. In other cases, systems running in
different environments need to integrate. Figure 6-4 shows an example of the
first case.
Figure 6-4. Example of environments for separate systems

ClotheSpin runs two different online stores, one for the core ClotheSpin brand,
and another for the new Hipsteroo brand which was built as a separate greenfield
project. Each of the stores has its own set of business capabilities and platform
services, so there is no need to manage them both in a single environment.

Figure 6-5 adds backend systems that handle data analysis and fulfillment for
both the ClotheSpin and Hipsteroo online stores. These are hosted in a separate
environment.
Figure 6-5. Example of separate environments for integrated systems
Systems running in each of the storefront environments integrate with systems
running in the backend services environment. Decisions about where to draw the
boundaries between environments and which environment should host each
service are driven, at least partly, by the system architecture.

Each of the two storefront environments is fairly cohesive, in that everything in


it supports that particular storefront. If one environment hosted both storefronts
it would lack cohesion, since the two storefronts don’t share services.1

Both of the storefronts in our example are coupled with the backend services.
Hopefully, they are loosely coupled, meaning changes to a system on one side of
the integration can usually be made without changing the other system.

Aligning Environments To Organizational Structure

Environments are often defined by who owns them, either in terms of who uses
them or who provides and manages the infrastructure in them. For example, the
ClotheSpin and Hipsteroo online stores are developed by separate groups in the
ClotheSpin company. When the project to create the new Hipsteroo store was
launched, they were given a new environment to host their storefront software.
The intention was to develop the new store quickly without disrupting the
existing business. The environments are aligned to the groups that use them, so
that each can deploy their software without interfering with each other, and they
can customize the platform services to their needs.

Environments may also be aligned with the teams that manage them. The
storefront platform team manages the environment for the ClotheSpin storefront,
and a new team was created to build and manage the new Hipsteroo
environment. So the environments for the two storefronts align with the teams
that own their infrastructure.

Then the company decided to create a purchasing and fulfillment system to


manage the delivery and return of goods from both ClotheSpin and Hipsteroo. A
new fulfillment service development team was set up separately from the
ClotheSpin and Hipsteroo development groups. The existing storefront platform
team was asked to provide the hosting infrastructure for the new system, which
they initially did with the existing ClotheSpin storefront environment
infrastructure. The ClotheSpin storefront environment was aligned with the team
that managed it rather than with the teams that used it, as shown in Figure 6-6.
Figure 6-6. Example of an environment aligned with the team that manages it

The architectural concerns described in the previous section led to the creation of
a third environment to manage backend services, as shown in Figure 6-5. This
led to the discussion of whether to split out a third platform team to manage the
new environment, which underlines how Conway’s Law applies to infrastructure
environments.

As mentioned in Chapter 10, Conway’s Law is an observation that


organizational design tends to drive system architecture. Splitting out a new
team to manage the infrastructure for Hipsteroo would certainly have led to the
creation of a separate environment, even if it hadn’t been already decided.
Assigning a single team to provide infrastructure for two different software
development groups resulted, at least initially, in using a single infrastructure
environment for both groups’ software.

AVOID SPLITTING OWNERSHIP OF DELIVERY ENVIRONMENTS

Some organizations separate the responsibility for maintaining development and test environments from
maintaining production environments. In more extreme cases, each environment in a multi-stage path to
production may be managed by a separate team. Having a separate team managing separate environments
inevitably leads to different configurations and optimizations in each environment, which leads to
inconsistent behavior and wasted effort.

The worse case of this I’ve seen involved seven environments2 for a path to production, each managed by a
separate team. Deploying, configuring, and troubleshooting the software took nearly a week for each
environment, which added nearly seven weeks to every software release.
Aligning Environments To Governance Concerns

Splitting a system across multiple environments can be useful to address


governance concerns like compliance or security. Boundaries and integration
points between environments are usually very clear, making it easy to enforce
and audit what data and transactions cross them. Infrastructure resources should
not be shared across environments at a level that is material for governance (see
“Environment implementation layers”), which reduces the risk of leaking data or
authority.

The strength of segregation by environment helps with assurance, and can also
help to reduce the scope (or blast radius) of attacks. An attacker who gets access
to an environment that hosts a user-facing application may have little access to
back-end systems with sensitive data. Hosting security operational services, such
as monitoring, scanning, and log analysis, in a separate environment can prevent
attackers from covering their tracks. Separately hosting systems with a broad
scope of control, such as administrative and delivery pipeline services, can also
limit the scope of damage an attacker can do.

With a well-designed system governance and compliance concerns align closely


with architectural and organizational drivers.

Multiple Environment Replicas


As an organization scales, there is often a need to deploy multiple production
instances of a system. Each environment is a replica, in that it hosts essentially
the same software, although typically there are requirements to configure the
software and the infrastructure in different ways for different replicas. So, as
with delivery environments, the need for consistency (which helps keep
maintenance costs low), variation (to customize for different use cases), and
segregation are usually in tension with one another.

Three common reasons for deploying replica environments are operability


scenarios (such as resilience), geographic distribution, and multiple user bases.
These reasons may be combined. For example, a common resilience scenario is
running replicas of environments in multiple regions. The reasons may also be
combined with other multi-environment architectures. System replicas in
different geographies may be subject to different legal regulations, which can
drive alignment to governance concerns as discussed above.

Designing Environments For Operability Scenarios

Replicating environments can be useful to support operability operational


scenarios, particularly for availability and scaling. Both types of scenarios
involve running multiple environments, each of which can handle any given
workload interchangeably.

Availability scenarios address situations where one environment is impaired. For


example, if services in one data center or cloud hosting region have an issue,
traffic can be rerouted to a replica environment hosted in another data center or
region. Availability scenarios can also be handled with a single environment that
spans multiple hosting zones. However, a replica environment is often a
convenient unit for providing redundancy, since it should include all of the
resources and services needed to continue handling workloads independently.
See Chapter 19 for more discussion of continuity scenarios.

Environment replicas can be used for scalability scenarios using approaches very
similar to those used for availability, and may even use the same
implementation. When the active workload nears the maximum capacity for an
environment, one of the potential strategies is to provision an additional replica
environment and redirect a portion of traffic or work to it.

Some organizations run multiple environment replicas continuously, shifting


load between them to handle failures or local surges in traffic gracefully.

Availability and capacity can be handled at other levels of the system than the
environment. Compute capacity may be automatically scaled, data replicated,
and workloads shifted across different parts of a system within one environment.
Environment-level replication can be useful as one part of a multi-tiered
approach to these scenarios.

Distributing Environments Geographically

Organizations can use different approaches for structuring their systems,


infrastructure, and environments to serve users in different geographical regions.

One concern when deciding on the right approach is latency. Even when running
on a public cloud service, your system will be hosted in a physical data center
which may be far enough from some of your users, in terms of network
connectivity, that latency may affect their experience. Replicating or distributing
some parts of your system so they are hosted closer to your users can address
this.

Figure 6-7 illustrates an option for ClotheSpin that uses a single environment to
centrally serve customers in the UK, Germany, and Korea.

Figure 6-7. A single environment for multiple regions

ClotheSpin can use a CDN (Content Distribution Network) to cache static assets
like web pages, client-side scripts (JavaScript), and images closer to users. Some
executable code could also be distributed using edge computing offered by a
CDN provider. And even if a system’s implementation doesn’t lend itself to
easily using these types of services, parts of the system could be explicitly
deployed to data centers or cloud hosting locations closer to users. There is a
natural tendency to think of an environment as being located in a single region,
but you can choose to draw its boundaries across regions if it’s useful.
But many organizations prefer to define a separate environment for each region
to support customizations for local markets or businesses. For example,
ClotheSpin may have a logistics partner in South Korea which means they don’t
need to deploy the part of their storefront system that they use for logistics in the
UK and Germany. Other parts of the storefront software may need to be
customized to integrate with the local partner. So ClotheSpin needs to deploy
different builds of their software in different regions, leading them to run a
separate environment for each region, as shown in Figure 6-8.

Figure 6-8. A separate environment for each region

In this example, ClotheSpin also runs a separate environment for centralized


services. They may use these services for analytics and reporting, as well as
operational services like monitoring.

Customizing the software for each region complicates the testing needed, which
in turn complicates the path to production for the software. If the customization
is implemented so a single build of the software can be used, and simply
configured separately for each environment, the teams may be able to use a fan-
out path to production, as shown earlier in Figure 6-3. The fan-out pattern
minimizes the effort and expense of maintaining multiple regions.

If the software is heavily customized, however, the teams will need separate
paths to production for each environment, each with a separate set of delivery
environments, as in Figure 6-2.

Often, different regions fall under different legal regulations. ClotheSpin may
need to meet different requirements in the UK, Germany (as part of the EU), and
South Korea, which could lead to differences in how infrastructure is
implemented. It’s often feasible to implement systems so that a single build of
the software and even the infrastructure meets the regulations of each region it’s
used in. However, the regulations may still require separate hosting. For
example, data residency laws control where personal data for users can be
transferred or stored. This leads back to governance concerns as a driver for
designing environment boundaries.

It’s often theoretically possible to adhere to local regulations, including data


residency, within a single environment with careful system design. But
environments offer clear boundaries, reducing the risk of a misconfiguration or
other mistake that breaks regulations. The boundaries of an environment can also
simplify auditing.

Replicating Environments for User Bases

A common business model involves providing a service customized for different


markets, customers, or partners. Some are white-label services, where the
provider hosts the service for different customer companies, customizing the
system for each customer’s brand so the customer can market it as their own.
Online service white labeling is used in telecommunications, finance, retail, and
many other domains.

For example, ClotheSpin could offer to host online clothing stores for other
businesses. A small fashion label might want to sell its products online, so
ClotheSpin can host an instance of its storefront, customizing the look and
branding so end users see it as the fashion label’s website. The label may not
need all of the features that the full-fledged ClotheSpin storefront offers, so the
software would be customized for their needs.

Another scenario is partnering. A company may provide an instance of its


service co-branded with another company, perhaps in a region or market where
the partner has a presence. For example, ClotheSpin could partner with an
established offline clothing retailer in India, running an instance of their online
storefront as a joint venture.

Each white-labeled or partnered system may have customized branding or


features. They will usually have separately managed data, such as products and
users. A person who shops at both the ClotheSpin storefront and the fashion
label’s white-labeled online store will use a separate login for each, and will not
expect to see their shopping history from one store appear on the other store.

The simplest way to use the same software to implement separate brands is to
deploy a separate instance of the software for each. An alternative is to
implement a multi-tenant system where a single hosted instance serves multiple
brands. Customized branding, features, and user data separation is implemented
in the software.

Multi-tenancy makes more efficient use of infrastructure resources and takes less
work to maintain. However, it requires more sophisticated software
development. Also, some governance concerns, such as data residency, may
require separate single-tenant systems where each instance serves only one
brand. Some organizations choose to host each single-tenant system in a separate
environment. The implications of this approach are similar to those described for
aligning environments to geography. The cost and effort to update, run, and
maintain multiple system instances can be difficult to control.

Environment implementation layers


Virtualization and cloud technologies have increasingly moved the level of
abstraction between application software and the hardware it runs on. IaaS uses
virtualization to separate concepts like servers, disks, and load balancers from
their hardware implementations. PaaS takes things another step, discarding
abstractions that imitate hardware devices, instead using abstractions aligned
with application architecture. A container wraps a compute process and the
smallest subset of the operating system it needs to run. Serverless strips the
abstraction even further to the execution of a single operation.

The implementation of an environment is evolving with the layers of


infrastructure abstraction. Figure 6-9 shows the progression from physical
environments to environments running on shared runtimes such as container
clusters.

Figure 6-9. Increasing levels of abstraction layers for environments

A physical environment has dedicated hardware and only shares infrastructure at


the level of the data center, such as networking and power.
Virtual environments share hardware using an IaaS cloud. You provision
dedicated application runtime resources for each environment, such as a
container cluster or virtual servers. But the underlying physical hardware is a
black box, with no distinction between hardware for different environments.3

A configuration environment is an application-level construct. Applications are


deployed onto a shared application runtime service, such as a Kubernetes cluster.
Environments are defined by configuration, for example using namespaces.

Design forces for choosing the environment


implementation layer

The level of abstraction you can use to design your environments is constrained
by the type of application. Normally, only cloud-native applications
implemented as serverless code or containers can be hosted in a configuration
environment, unless you implement custom mechanisms for co-hosting instances
for different environments on shared virtual servers.

However, even if you are technically able to deploy application instances on


shared infrastructure, it’s important to consider the level of isolation you need
between environments, and what you can achieve with the technology. Many of
the design forces discussed earlier in this chapter that lead to separating
environments may not be satisfied when the environments are implemented on a
shared runtime service. Some examples:

A highly regulated system may require that test code that hasn’t been
approved for production must be prevented from accessing production data
by stronger segregation than a deployment configuration setting.
A runtime system may need to be optimized for different types of workloads.
For example, one workload may need to handle high volumes of transactions
with low latency, while another may load and analyze large data sets. These
conflicting requirements may need separate environments defined at a lower
level of abstraction than namespaces in a container cluster.
Upgrading or even patching a runtime system may require downtime, or at
the least risk of disruption for workloads running on it. The more
environments running on the system, the broader the blast radius for planned
or unplanned disruption. The number of teams and stakeholders who need to
be involved in scheduling upgrades can add friction to the process. This
friction can lead to upgrading and patching less often, which then leads to
running outdated software, perhaps even versions with exploitable
vulnerabilities.
Even when a runtime system has automated recovery features, some
availability scenarios can only be managed by running multiple instances that
share less of the underlying infrastructure. A failover between two zones of a
container cluster doesn’t help when cluster management services fail.

Testing and delivering changes to environment


infrastructure

Your environments are defined at a certain level of abstraction on top of a shared


layer of infrastructure. Someone needs to manage that shared layer of
infrastructure. If that shared layer is an IaaS service managed by a cloud vendor,
then it’s the vendor’s problem. Most cloud vendors have robust processes for
testing and applying changes to the infrastructure you use without disrupting
how you use it.

If you are managing shared infrastructure for multiple environments, you need
similarly strong change management processes. An application build that
includes untested changes may be deployed to a test environment. But untested
infrastructure changes should not be deployed to that same environment.
Application development and test environments are business-critical systems for
software delivery. Infrastructure changes should be tested in a separate
environment before applying them to an environment that users rely on, as
shown in Figure 6-10.

Figure 6-10. Testing the lower environment layers

The diagram shows the three delivery environments used to test an application.
These are virtualized environments, each one running its own application cluster
instance. The team that manages the application cluster tests changes to it in an
environment that is upstream from any of the application testing environments.
This “app cluster testing” environment would be used to test changes to
infrastructure code or to any system software deployed as part of the application
cluster, such as a new Kubernetes release. Chapter 12 describes using automated
testing and pipelines for managing changes to the infrastructure code itself.

Although it’s not something that most teams will be aware of, the IaaS vendor
will test changes that it makes to the underlying systems and services before
applying them to customer systems. From their point of view, the shared
hardware that underpins customer hosting is a production environment, with
upstream delivery environments unseen by customers.

IaaS Resource Groups and Environments


Each Iaas platform provides a base-level organizational structure for resources.
With AWS this is the Account, Azure has Resource Groups, and GCP as
Projects. For lack of a common term, I’ll call these IaaS Resource Groups. The
vendors provide other groupings for managing hierarchies of resource groups,
such as Organizations in AWS and GCP, and Management Groups in Azure.

An IaaS resource group is the default level for defining permissions, allocating
costs, and other fundamental configuration that applies to all of the resources
allocated within it. A key question for an environment architecture when using
IaaS is how to align resources structures and environments.

A common, basic approach is to create multiple environments in a single IaaS


resource group. Resource groups can be difficult to create and configure. Many
organizations put heavyweight governance around the creation of AWS
accounts, GCP projects, and the like, so teams find it easier to create a new
environment inside an existing group than to have a new group created.

The drawback of having multiple environments in one resource group is that it


results in sharing at least some configuration, access policies, and resources
across those environments. It’s usually possible to segregate resources and
configuration within a resource group, to at least some extent, for example by
adding a filter to apply a policy to subsets of resources based on identifiers or
tags.

However, it’s easier and more reliable to segregate resources and configurations
between IaaS resource groups than within them. So another approach is to create
a separate resource group for each environment. This approach can be taken
further, splitting parts of a single environment across more than one resource
group. These structures may be divided following similar design forces as those
described for environments.

Figure 6-11 shows how ClotheSpin uses three AWS accounts to create a single
environment.
Figure 6-11. One environment composed of multiple IaaS resource groups

One account runs the application software that directly serves users. A separate
management account runs services with administration permissions to make
changes to the application account. A third account runs monitoring services,
which receives logs from the applications account, and can make read-only
requests into it. The applications account has no access to the other two
accounts. These three accounts have clearly separated and limited permissions
according to the needs of the workloads within each.

IaaS resource groups are designed as a boundary for configuration, permissions,


and accounting. So using one or more resource groups for each environment
makes for more natural alignments of governance, operability, and other design
forces for environments. Many organizations find it useful to align resource
groups to ownership by team. Permissions for each AWS Account, Azure
Resource Group, or GCP Project are assigned to one group of people, rather than
giving each team permissions for resources owned by other teams in a shared
group.

Going one step further, maintaining a separate IaaS resource group for each
application or service aligns the permissions and configuration not only with
current team ownership but simplifies alignment when team ownership changes.
As we know, system designs will change, including ownership of parts of a
system by different teams. It’s easier to reassign the permissions for a resource
group from one team to another than to move the infrastructure and
configuration from one team’s resource group to another’s. The lowest level of
granularity for ownership of the contents of infrastructure is typically the
application or service, so aligning IaaS resource groups at this level creates the
most flexibiity for managing changes.

As I mentioned earlier, people commonly put multiple environments in a single


AWS Account, Azure Resource Group, GCP Project, or equivalent because it’s
easy. Put another way, it’s typically difficult to create and manage an IaaS
resource group. An architecture that aligns resource groups at a lower level of
granularity, even per-environment, means you will be creating and managing
many more of them. So a prerequisite of fine-grained resource groups is defining
and managing them as code, so they can be easily created, updated, and kept
consistent. Managing IaaS resource groups as code gives all of the additional
benefits of defining anything as code, including auditability and hooks for
automated governance.

Patterns for Using Stacks to Build


Environments
However you organize your environments, their foundational elements are the
infrastructure resources allocated in them, and these resources should of course
be implemented using code. Chapter 10 described the infrastructure stack as the
core unit of infrastructure architecture, because it is the smallest unit of code that
can be independently deployed to infrastructure. So the key question for
designing the implementation of your environment architecture is how to use
stacks to define and build environments.

I’ll describe two antipatterns and one pattern for implementing environments
using infrastructure stacks. Each of these patterns describes a way to define
multiple environments using infrastructure stacks. Some systems are composed
of multiple stacks, as I described in Chapter 10. I’ll explain what this looks like
for multiple environments in “Building Environments with Multiple Stacks”.

Antipattern: Multiple-Environment Stack


A multiple-environment stack defines and manages the infrastructure for
multiple environments as a single stack instance.

For example, if there are three environments for testing and running an
application, a single stack project includes the code for all three of the
environments (Figure 6-12).

Figure 6-12. A multiple-environment stack manages the infrastructure for multiple environments as a single
stack instance

Motivations

Many people create this type of structure when they’re learning a new stack tool
because it seems natural to add new environments into an existing project.
Consequences

When running a tool to update a stack instance, the scope of a potential change is
everything in the stack. If you have a mistake or conflict in your code,
everything in the instance is vulnerable.4

When your production environment is in the same stack instance as another


environment, changing the other environment risks causing a production issue. A
coding error, unexpected dependency, or even a bug in your tool can break
production when you only meant to change a test environment.

Related patterns

You can limit the blast radius of changes by dividing environments into separate
stacks. One obvious way to do this is the snowflake as code (see “Antipattern:
Snowflakes As Code”), where each environment is a separate stack project,
although this is considered an antipattern.

A better approach is the reusable stack pattern (see “Pattern: Reusable Stack”).
A single project is used to define the generic structure for an environment and is
then used to manage a separate stack instance for each environment. Although
this involves using a single project, the project is only applied to one
environment instance at a time. So the blast radius for changes is limited to that
one environment.

Antipattern: Snowflakes As Code


The snowflakes as code antipattern uses a separate stack source code project for
each instance of infrastructure, even when the instances are intended to act as
replicas of the same resources.

In our example of three environments named test, staging, and production, there
is a separate infrastructure stack project for each of these environments
(Figure 6-13). Changes are made by editing the code for one environment and
then copying the changes into the projects for each of the other environments in
turn.

ch15-snowflakes-as-code
Figure 6-13. Snowflakes as code use a separate copy of the stack project code for each instance

Motivation

Snowflake code environments are a simple way to maintain multiple


environments. They avoid the blast radius problem of the multiple environment
stack antipattern. You can also easily customize each stack instance.

ClotheSpin has a group that runs white-label stores for different fashion brands.
They created an AWS CDK project to create an environment to host the store
instance for their first customer. When they signed their second customer, they
copied the code to a new CDK project and customized it for that customer’s
needs. They followed this pattern (or anti-pattern) for each new customer.

The ClotheSpin white-label team took this approach because it was the simplest
way to create each new customer.

Applicability

Snowflake code might be appropriate if you want to maintain and change


different instances of infrastructure that don’t need to be consistent. Arguably
this case isn’t snowflakes as code, but simply separate infrastructure projects.

Consequences

It can be challenging to maintain multiple snowflakes. When you want to make a


code change, you need to copy it to every project. You probably need to test
each instance separately, as a change may work in one but not another.

Snowflake environments often suffer from configuration drift (see


“Configuration Drift”). Using snowflakes as code for delivery environments
reduces the reliability of the deployment process and the validity of testing, due
to inconsistencies from one environment to the next.

A snowflake as code might be consistent when it’s first set up, but variations
creep in over time.

After a year the ClotheSpin white-label team was running nine different
production environment instances for their customers and multiple delivery
instances to test changes. A new version of Kubernetes was released that
included essential fixes. When the team upgraded a test instance they discovered
that they need to change its infrastructure code to get the new version to work.
Upgrading, testing, and fixing each customer instance took a week. Two months
later, as they were finishing the last upgrades, a Kubernetes bugfix was released
that addressed a newly discovered security vulnerability, and the team began the
process again.

Implementation

You create snowflakes by copying the project code from one stack instance into
a new project. You then edit the code to customize it for the new instance. When
you make a change to one stack, you need to copy and paste it across all of the
other stack projects, while keeping the customizations in each one.

Related patterns

Environment branches may be considered a form of snowflakes as code. Each


branch has a copy of the code, and people copy code between branches by
merging. Continuously applying code may avoid the pitfalls of snowflakes,
because it guarantees the code isn’t modified from one environment to the next.
Editing the code as a part of merging it to an environment branch creates the
hazards of the snowflake antipattern.

The wrapper stack pattern is also similar to snowflakes as code. A wrapper stack
uses a separate stack project for each environment to set configuration
parameters. But the code for the stack is implemented in stack components, such
as reusable module code. That code itself is not copied and pasted to each
environment, but promoted as an artifact, much like a reusable stack. However,
if people add more than basic stack instance parameters to the wrapper stack
projects, it can devolve into the snowflake as code antipattern.

In cases where stack instances are meant to represent the same stack, the
reusable stack pattern is usually more appropriate.

Pattern: Reusable Stack

A reusable stack is an infrastructure source code project that is used to create


multiple instances of a stack (Figure 6-14).
Figure 6-14. Multiple stack instances created from a single reusable stack project

Motivation

You create a reusable stack to maintain multiple consistent instances of


infrastructure. When you make changes to the stack code, you can apply and test
it in one instance, and then use the same code version to create or update
multiple additional instances. You aim to provision new instances of the stack
with minimal ceremony, maybe even automatically.

The ClotheSpin white-label team decided to move from the snowflake


environments they were using to reusable code. In their first attempt, they
created reusable libraries for the common parts of their white-label
environments. They kept a separate project for each customer environment, but
this project only imported the libraries to create the Kubernetes cluster, database
instances, and message queues.

However, the team found that making a change across all of their customers was
still time-consuming. They would change and test the relevant library in a
common test environment. Then they deployed the library into a separate test
environment for each customer, customizing for the customer’s configuration,
and testing and fixing customer-specific issues.

So the ClotheSpin team decided to create a single CDK project for their white-
label customers. They made the project configurable to meet the needs of
different customers. Each time they changed their CDK project, they deployed
and tested it in a single environment. Their tests needed to prove that different
configuration options worked, but the overhead of testing a single build was less
than deploying and testing a build for each customer.

Reusable stacks support many of the principles of cloud infrastructure described


in Chapter 2, including making everything reproducible, avoiding snowflake
systems, minimizing variation, and creating disposable things.

Applicability

You can use a reusable stack for multiple environments which are essentially
replicas of the same infrastructure. Reusable stacks are essential for delivery
environments. Operability scenarios such as availability can be implemented by
deploying a reusable stack to create a failover environment, possibly
automatically when failures are detected. They are also useful for deploying
instances of a common service in different geographical regions, and for white-
label situations as described in the ClotheSpin example.

The reusable stack pattern is less applicable when environments need to be


heavily customized. Many of the drivers for breaking up a system across
architectural, organizational, and governance boundaries are best addressed
using the stack structuring patterns discussed in Chapter 10.

Consequences

The ability to provision and update multiple stacks from the same project
enhances scalability, reliability, and throughput. You can manage more instances
with less effort, make changes with a lower risk of failure, and roll changes out
to more systems more rapidly.

You typically need to configure some aspects of the stack differently for
different instances, even if it’s just what you name things. I’ll spend a whole
chapter talking about this (Chapter 11).

You should test your stack project code before you apply changes to business-
critical infrastructure. I’ll spend multiple chapters on this, including Chapters 8
and 9.

Implementation

You create a reusable stack as an infrastructure stack project and then run the
stack management tool each time you want to create or update an instance of the
stack. Use the syntax of the stack tool command to tell it which instance you
want to create or update. With Terraform, for example, you would specify a
different state file or workspace for each instance. With CloudFormation, you
pass a unique stack ID for each instance.

The following example command provisions two stack instances from a single
project using a fictional command called . The command takes an
argument that identifies unique instances:

As a rule, you should use simple parameters to define differences between stack
instances—strings, numbers, or in some cases, lists. Additionally, the
infrastructure created by a reusable stack should not vary much across instances.
Related patterns

The reusable stack is an improvement on the snowflake as code antipattern (see


“Antipattern: Snowflakes As Code”), making it easier to keep multiple instances
consistent.

The wrapper stack pattern uses stack components to define a reusable stack, but
uses a different stack project to set parameter values for each instance.

Building Environments with Multiple


Stacks
The reusable stack pattern describes an approach for implementing multiple
replica environments. This pattern is straightforward to implement when your
environment is defined as a single stack, as shown in the diagram for the
reusable stack pattern (Figure 6-14).

However, Chapter 10 described patterns for breaking a system across multiple


stacks. As a system grows, using a single stack for each environment becomes
unwieldy, as described by the Monolithic Stack antipattern in that chapter.

When a system that is replicated in multiple environments is broken into


multiple stacks, each of the stacks will need to be replicated in each
environment. Figure 6-15 shows the example delivery environments, each of
which hosts the set of stacks described in the example for the Single Service
Stack pattern (Chapter 10).
Figure 6-15. Using multiple stacks to build each environment

Each of the three service stacks-browse, search, and administration-is defined as


a stack code project. To create an environment, the stack tool is run to deploy
each of the three stacks. To create a new environment called using the
fictional tool, the ClotheSpin team would run the command three times:
When the code in one of the stack projects changes, the updated version of the
stack project needs to be deployed in each environment, but the other stacks
aren’t touched. Chapter 5 describes strategies for splitting systems into multiple
components, and Chapter 14 discusses how to integrate infrastructure across
stacks.

Conclusion
Environment architecture is a topic that is often taken for granted. Many IT
organizations suffer from design decisions they have made not from conscious
thought, but from habits and assumptions about what is “industry best practice”5
This chapter describes different aspects of designing and implementing a
conscious architecture for environments. Infrastructure as Code creates an
opportunity to move beyond heavyweight, static environments. Environments
should be evolvable, as with every part of a system, so they can be continuously
adapted and improved with changing needs and better understanding.

The reusable stack is a workhorse pattern for teams who need to manage large
infrastructures, helping to easily create and maintain multiple environments with
a high level of consistency and good governance. Chapter 7 will discuss ways to
reliably make and deliver changes to stacks across environments. However, a
key challenge that reusable stacks introduce is managing necessary differences
between stack instances. The next chapter, Chapter 11, focuses on ways of
managing instance-specific stack configuration.
1 The two storefronts could share services. The ClotheSpin team has had
many debates about whether and how to consolidate their services, but the
business priority was to accelerate the development of Hipsteroo without
disrupting the existing ClotheSpin business, which led to separate
implementations.

2 The environments were Development, QA, SIT (Systems Integration


Testing), UAT (User Acceptance Testing), OAT (Operations Acceptance
Testing), Pre-Prod, and Production.

3 In practice, your cloud vendor may give you options to override the
abstraction, for example by specifying that particular resources should not share
physical hardware with each other.

4 Charity Majors shared her painful experiences of working with a multiple-


environment stack in a blog post.

5 I mentioned in the preface why I’m not a fan of the term “best practice”.
About the Author
Kief Morris (he/him) is Global Director of Cloud Engineering at
ThoughtWorks. He drives conversations across roles, regions, and industries at
companies ranging from global enterprises to early stage startups. He enjoys
working and talking with people to explore better engineering practices,
architecture design principles, and delivery practices for building systems on the
cloud.

Kief ran his first online system, a bulletin board system (BBS) in Florida in the
early 1990s. He later enrolled in an MSc program in computer science at the
University of Tennessee because it seemed like the easiest way to get a real
internet connection. Joining the CS department’s system administration team
gave him exposure to managing hundreds of machines running a variety of Unix
flavors.

When the dot-com bubble began to inflate, Kief moved to London, drawn by the
multicultural mixture of industries and people. He’s still there, living with his
wife, son, and cat.

Most of the companies Kief worked for before ThoughtWorks were post-
startups, looking to build and scale. The titles he’s been given or self-applied
include Software Developer, Systems Administrator, Deputy Technical Director,
R&D Manager, Hosting Manager, Technical Lead, Technical Architect,
Consultant, and Director of Cloud Engineering.

You might also like