[go: up one dir, main page]

0% found this document useful (0 votes)
148 views30 pages

TLA+ in Software Engineering

This document summarizes the experience of software engineers at Amazon Web Services using formal specification tools TLA+, PlusCal and TLC. It describes how AWS services face challenges in specifying complex interacting algorithms at large scale. The document discusses initial difficulties engineers had with formal methods terminology and concepts. It also provides examples of how formal specification of systems like Chord revealed hidden defects. The document suggests ways to increase adoption of formal specification, such as syntax highlighting and example specifications. It briefly describes several systems specified by AWS engineers using these tools.

Uploaded by

Chuka Okoye
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
148 views30 pages

TLA+ in Software Engineering

This document summarizes the experience of software engineers at Amazon Web Services using formal specification tools TLA+, PlusCal and TLC. It describes how AWS services face challenges in specifying complex interacting algorithms at large scale. The document discusses initial difficulties engineers had with formal methods terminology and concepts. It also provides examples of how formal specification of systems like Chord revealed hidden defects. The document suggests ways to increase adoption of formal specification, such as syntax highlighting and example specifications. It briefly describes several systems specified by AWS engineers using these tools.

Uploaded by

Chuka Okoye
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Experience

of so-ware engineers
using TLA+, PlusCal and TLC
Chris Newcombe
Principal Engineer, Amazon Web Services

Amazon Web Services


30+ services [1]; compute, storage, db, queuing,
High Scale:
S3 (Simple Storage Service) launched in 2006
now holds over 1 trillion objects
growing by 3.5 billion per day [2]

SophisVcated features
DynamoDB: strongly-consistent, highly available,
scalable, predictable performance, mulV-tenant

NetFlix is now 100% cloud (on AWS)[3]


2

Most services face the same problem


Interac(ng subtle algorithms
complex business logic, rules-engines, evolving
schemas,
interacVons with other systems
dynamic group-membership; live re-sharding / auto-
scaling
concurrency-control
consistency, workow
replicaVon, fail-over
acVons by human operators
3

Why dont engineers specify their


systems?
Deep ignorance/confusion about benet, terminology,
method, tools
Deep skepVcism of snake-oil
The weight of evidence for an extraordinary claim must be
propor(oned to its strangeness. - Laplace

No Vme to look at anything that is not an immediate


producVvity gain
Failing to disVnguish design aw vs. implementaVon bug
Skills complacence : I only need java/python/
ResignaVon : Bugs are inevitable anyway
Blog-comment culture - more fun to broadcast/defend
an immediate personal opinion than to consider changing it
4

Moving from intuiVve correctness to precision


From Java Concurrency In Prac(ce, page 17


Correctness means a class conforms to its specica(on. A good
specicaVon denes invariants constraining an objects state, and
postcondi(ons describing the eects of its operaVons.

Since we o-en dont write adequate specicaVons for our classes, how can
we possibly know they are correct? We cant, but that doesnt stop us from
using them anyway, once weve convinced ourselves that the code works.

This code condence is about as close as many of us get to correctness,
so lets just assume that single-threaded correctness is
something that we know it when we see it. Having

opVmisVcally dened correctness as something that can be


recognized, we can now dene thread safety

Terminology diculVes
Allergic to the word formal
Overloaded terminology:

So, specicaVon means inputs and outputs, right?

Current best soluVon: call it


ExhausVvely-testable pseudo-code
6

The most persuasive example so far

A famous system design (2001)

Ring of nodes

Three features that dis(nguish Chord from many other peer-to-peer


lookup protocols are its simplicity, provable correctness, and
provable performance.
9

Original pseudo-code for Chord

10

Nine years later, another paper

11

Found 8 major defects in the


Chord ring-membership protocol

12

13

IniVal conceptual diculVes


when engineers are learning TLA+
Single global state -- But where is the concurrency?
Use of non-determinism
Interleaving vs. non-interleaving specicaVon
Can either or allow 2 or more clauses to happen?

Unfamiliar kinds of expressions

programming via recursive operators,


instead of using set comprehensions

Idioms? E.g. for abstracVng network


What is the dierence between [ -> ] and [ |-> ] ?
AcVons that inadvertently access the state of other
processes (e.g. to get data that should have been sent as
part of a message payload)
14

More advanced conceptual diculVes


Liveness is a property of an innite behavior,
so what does TLCs liveness checking
actually tell you?
How can we easily tell if it safe to use a
symmetry set for model-checking?

15

SuggesVons to help increase adopVon

Syntax highlighVng for PlusCal


A wiki-style cookbook of specicaVon idioms
A wiki-style cookbook of tricks for using TLC
More real-world example specicaVons
and, ideally, their inducVve invariants.



16

Use at Amazon so far


1. Tim Fault-tolerant replicaVon

(interacVng distributed algorithms)
2. Fan Fault-tolerant network protocol
3. Mike Lock-free data structure
4. Chris Concurrent algorithm,

involving third-party components
5. Sam Fault-tolerant replicaVon

(interacVng distributed algorithms)
6. Cahill IsolaVon of database transacVons
17

Tim
interacVng distributed algorithms
Fault-tolerant replicaVon, group-membership
Principal Engineer, one of the best at Amazon
Algorithm was already designed, and coded in java, with months of
thorough design analysis (30+ pages of informal prose proofs)
Approx. 6 weeks, while doing other work
Pure TLA+. We might have considered PlusCal, but hadnt tried it
yet. Not sure how PlusCal works with mulVple modules.
939 non-comment, non-blank lines split across 7 modules
529 pure comment lines
11 invariants. No liveness properVes.
Distributed TLC, on cluster of 10 cc1.4xl nodes in EC2
TLC found 3 important bugs. Counter-examples had 35+ steps.

18

Fan
Fault-tolerant protocol

Junior engineer. Very keen to learn, a-er a presentaVon on TLA.


PhD but not in computer science (might be signicant)
6 weeks. Independent a-er 3 weeks of approx. 1hr/day help
First wrote ~700 line TLA+ spec.
93 lines were UNCHANGED statements, very tedious to maintain

Converted to PlusCal, conVnued adding to specicaVon


995 non-blank, non-comment lines (1728 lines of translated TLA+)
4 kinds of processes
No procedures, just macros
TLC on single cc2.8xl EC2 instance, and distributed on 10 x cc1.4xl
Views help, symmetry does not
So far has found 2 invariant failures, one of which they expected
and intend to live with (low probability, and the system fails safe).
19

Mike
Lock-free algorithm
OpVmizaVon for criVcal in-memory data-structure
Mid-level engineer
Learned & used PlusCal on his own, a-er a presentaVon on
a paper by Lampson[1]
Reportedly missed a liveness bug.
No-one at Amazon has checked liveness properVes yet.
Liveness is less important than safety, but sVll important.
223 non-blank, non-comment lines (excluding translaVon)
9 invariants, as asserts
5 macros, 4 procedures, 2 types of process (each 1..2
disjuncVons)
5..6 labels per procedure
20

Chris: Concurrent algorithm,


using third-party components

Using PlusCal as an analysis/design sketchpad
3400 lines, 80% prose comments
Wrote prose, then rened secVons into TLA+
deniVons and PlusCal, to make it precise
Never got as far as model-checking
SVll major benets
Helped me understand a messy problem
Helped me to hugely simplify the problem

Speculate that I want Literate specicaVons


21

Literate SpecicaVons?
Donald Knuth wrote[1]


Some of my major programs could not have
been wriJen with any other methodology that Ive
ever heard of. The complexity was simply too
daun(ng for my limited brain to handle; without
literate programming, the whole enterprise would
have opped miserably.

Literate programming is what you need to rise
above the ordinary level of achievement.
22

Sam
interacVng distributed algorithms
Another fault-tolerant replicaVon and group-
membership problem
Got a brief design sketch from a senior
engineer, in somewhat hand-wavy prose
Turning it into a PlusCal shell, to
systemaVcally uncover ambiguity
649 lines including prose (but barely started)
Again, literate specicaVons might help
23

Cahill
IsolaVon of database transacVons
Spec. for a published algorithm, Serializable
Snapshot IsolaVon
PracVcal algorithm; now used in postgresql
Not used at Amazon; spec was wriwen as an
example for a conference talk. But I was doing
Amazon work in that area at the Vme.
1600 lines of TLA+, inc lots of comments.
Our rst use of Distributed TLC
Spec available at hwp://hpts.ws/agenda.html

(the source bundle for the talk called Debugging Designs)


24

Future challenges

Renement
Model-checking real-Vme specicaVons.
Proofs perhaps one day
AdopVon by other divisions

25

TLC performance
Disk-IO bound for all pracVcal specs weve seen,
including distributed TLC.
Huge improvement using new EC2 hi1.4xlarge
16 cores
60.5 GB RAM
120,000 x 4KB random read IOPS
+ 85,000 x 4KB random write IOPS
across 2 * 1TB SSD volumes
10 Gigabit Ethernet, with cluster placement groups

May help further to have an opVon to disable


checkpoints

26

27

Support Material

28

Specs for Cahills algorithm


Download link: hwp://hpts.ws/sessions/amazonbundle.tar.gz

TLA+

Alloy

Amazon CondenVal

29

An interesVng execuVon

The example from the paper:



R2(X0,0) R2(Y0,0) R1(Y0,0) W1(Y1,20) C1 R3(X0,0) R3(Y1,20) C3 W2(X2,-11) C2

The simplest example found by TLC (note: begin can be a separate operaVon)

R1(X0) W1(Y1) W2(X2) C2 B3 C1 R3(X2) R3(Y0) C3
Amazon CondenVal

30

You might also like