Experience
of
so-ware
engineers
using
TLA+,
PlusCal
and
TLC
Chris
Newcombe
Principal
Engineer,
Amazon
Web
Services
Amazon
Web
Services
30+
services
[1];
compute,
storage,
db,
queuing,
High
Scale:
S3
(Simple
Storage
Service)
launched
in
2006
now
holds
over
1
trillion
objects
growing
by
3.5
billion
per
day
[2]
SophisVcated
features
DynamoDB:
strongly-consistent,
highly
available,
scalable,
predictable
performance,
mulV-tenant
NetFlix
is
now
100%
cloud
(on
AWS)[3]
2
Most
services
face
the
same
problem
Interac(ng
subtle
algorithms
complex
business
logic,
rules-engines,
evolving
schemas,
interacVons
with
other
systems
dynamic
group-membership;
live
re-sharding
/
auto-
scaling
concurrency-control
consistency,
workow
replicaVon,
fail-over
acVons
by
human
operators
3
Why
dont
engineers
specify
their
systems?
Deep
ignorance/confusion
about
benet,
terminology,
method,
tools
Deep
skepVcism
of
snake-oil
The
weight
of
evidence
for
an
extraordinary
claim
must
be
propor(oned
to
its
strangeness.
-
Laplace
No
Vme
to
look
at
anything
that
is
not
an
immediate
producVvity
gain
Failing
to
disVnguish
design
aw
vs.
implementaVon
bug
Skills
complacence
:
I
only
need
java/python/
ResignaVon
:
Bugs
are
inevitable
anyway
Blog-comment
culture
-
more
fun
to
broadcast/defend
an
immediate
personal
opinion
than
to
consider
changing
it
4
Moving
from
intuiVve
correctness
to
precision
From
Java
Concurrency
In
Prac(ce,
page
17
Correctness
means
a
class
conforms
to
its
specica(on.
A
good
specicaVon
denes
invariants
constraining
an
objects
state,
and
postcondi(ons
describing
the
eects
of
its
operaVons.
Since
we
o-en
dont
write
adequate
specicaVons
for
our
classes,
how
can
we
possibly
know
they
are
correct?
We
cant,
but
that
doesnt
stop
us
from
using
them
anyway,
once
weve
convinced
ourselves
that
the
code
works.
This
code
condence
is
about
as
close
as
many
of
us
get
to
correctness,
so
lets
just
assume
that
single-threaded
correctness
is
something
that
we
know
it
when
we
see
it.
Having
opVmisVcally
dened
correctness
as
something
that
can
be
recognized,
we
can
now
dene
thread
safety
Terminology
diculVes
Allergic
to
the
word
formal
Overloaded
terminology:
So,
specicaVon
means
inputs
and
outputs,
right?
Current
best
soluVon:
call
it
ExhausVvely-testable
pseudo-code
6
The
most
persuasive
example
so
far
A
famous
system
design
(2001)
Ring
of
nodes
Three
features
that
dis(nguish
Chord
from
many
other
peer-to-peer
lookup
protocols
are
its
simplicity,
provable
correctness,
and
provable
performance.
9
Original
pseudo-code
for
Chord
10
Nine
years
later,
another
paper
11
Found
8
major
defects
in
the
Chord
ring-membership
protocol
12
13
IniVal
conceptual
diculVes
when
engineers
are
learning
TLA+
Single
global
state
--
But
where
is
the
concurrency?
Use
of
non-determinism
Interleaving
vs.
non-interleaving
specicaVon
Can
either
or
allow
2
or
more
clauses
to
happen?
Unfamiliar
kinds
of
expressions
programming
via
recursive
operators,
instead
of
using
set
comprehensions
Idioms?
E.g.
for
abstracVng
network
What
is
the
dierence
between
[
->
]
and
[
|->
]
?
AcVons
that
inadvertently
access
the
state
of
other
processes
(e.g.
to
get
data
that
should
have
been
sent
as
part
of
a
message
payload)
14
More
advanced
conceptual
diculVes
Liveness
is
a
property
of
an
innite
behavior,
so
what
does
TLCs
liveness
checking
actually
tell
you?
How
can
we
easily
tell
if
it
safe
to
use
a
symmetry
set
for
model-checking?
15
SuggesVons
to
help
increase
adopVon
Syntax
highlighVng
for
PlusCal
A
wiki-style
cookbook
of
specicaVon
idioms
A
wiki-style
cookbook
of
tricks
for
using
TLC
More
real-world
example
specicaVons
and,
ideally,
their
inducVve
invariants.
16
Use
at
Amazon
so
far
1. Tim
Fault-tolerant
replicaVon
(interacVng
distributed
algorithms)
2. Fan
Fault-tolerant
network
protocol
3. Mike
Lock-free
data
structure
4. Chris
Concurrent
algorithm,
involving
third-party
components
5. Sam
Fault-tolerant
replicaVon
(interacVng
distributed
algorithms)
6. Cahill
IsolaVon
of
database
transacVons
17
Tim
interacVng
distributed
algorithms
Fault-tolerant
replicaVon,
group-membership
Principal
Engineer,
one
of
the
best
at
Amazon
Algorithm
was
already
designed,
and
coded
in
java,
with
months
of
thorough
design
analysis
(30+
pages
of
informal
prose
proofs)
Approx.
6
weeks,
while
doing
other
work
Pure
TLA+.
We
might
have
considered
PlusCal,
but
hadnt
tried
it
yet.
Not
sure
how
PlusCal
works
with
mulVple
modules.
939
non-comment,
non-blank
lines
split
across
7
modules
529
pure
comment
lines
11
invariants.
No
liveness
properVes.
Distributed
TLC,
on
cluster
of
10
cc1.4xl
nodes
in
EC2
TLC
found
3
important
bugs.
Counter-examples
had
35+
steps.
18
Fan
Fault-tolerant
protocol
Junior
engineer.
Very
keen
to
learn,
a-er
a
presentaVon
on
TLA.
PhD
but
not
in
computer
science
(might
be
signicant)
6
weeks.
Independent
a-er
3
weeks
of
approx.
1hr/day
help
First
wrote
~700
line
TLA+
spec.
93
lines
were
UNCHANGED
statements,
very
tedious
to
maintain
Converted
to
PlusCal,
conVnued
adding
to
specicaVon
995
non-blank,
non-comment
lines
(1728
lines
of
translated
TLA+)
4
kinds
of
processes
No
procedures,
just
macros
TLC
on
single
cc2.8xl
EC2
instance,
and
distributed
on
10
x
cc1.4xl
Views
help,
symmetry
does
not
So
far
has
found
2
invariant
failures,
one
of
which
they
expected
and
intend
to
live
with
(low
probability,
and
the
system
fails
safe).
19
Mike
Lock-free
algorithm
OpVmizaVon
for
criVcal
in-memory
data-structure
Mid-level
engineer
Learned
&
used
PlusCal
on
his
own,
a-er
a
presentaVon
on
a
paper
by
Lampson[1]
Reportedly
missed
a
liveness
bug.
No-one
at
Amazon
has
checked
liveness
properVes
yet.
Liveness
is
less
important
than
safety,
but
sVll
important.
223
non-blank,
non-comment
lines
(excluding
translaVon)
9
invariants,
as
asserts
5
macros,
4
procedures,
2
types
of
process
(each
1..2
disjuncVons)
5..6
labels
per
procedure
20
Chris:
Concurrent
algorithm,
using
third-party
components
Using
PlusCal
as
an
analysis/design
sketchpad
3400
lines,
80%
prose
comments
Wrote
prose,
then
rened
secVons
into
TLA+
deniVons
and
PlusCal,
to
make
it
precise
Never
got
as
far
as
model-checking
SVll
major
benets
Helped
me
understand
a
messy
problem
Helped
me
to
hugely
simplify
the
problem
Speculate
that
I
want
Literate
specicaVons
21
Literate
SpecicaVons?
Donald
Knuth
wrote[1]
Some
of
my
major
programs
could
not
have
been
wriJen
with
any
other
methodology
that
Ive
ever
heard
of.
The
complexity
was
simply
too
daun(ng
for
my
limited
brain
to
handle;
without
literate
programming,
the
whole
enterprise
would
have
opped
miserably.
Literate
programming
is
what
you
need
to
rise
above
the
ordinary
level
of
achievement.
22
Sam
interacVng
distributed
algorithms
Another
fault-tolerant
replicaVon
and
group-
membership
problem
Got
a
brief
design
sketch
from
a
senior
engineer,
in
somewhat
hand-wavy
prose
Turning
it
into
a
PlusCal
shell,
to
systemaVcally
uncover
ambiguity
649
lines
including
prose
(but
barely
started)
Again,
literate
specicaVons
might
help
23
Cahill
IsolaVon
of
database
transacVons
Spec.
for
a
published
algorithm,
Serializable
Snapshot
IsolaVon
PracVcal
algorithm;
now
used
in
postgresql
Not
used
at
Amazon;
spec
was
wriwen
as
an
example
for
a
conference
talk.
But
I
was
doing
Amazon
work
in
that
area
at
the
Vme.
1600
lines
of
TLA+,
inc
lots
of
comments.
Our
rst
use
of
Distributed
TLC
Spec
available
at
hwp://hpts.ws/agenda.html
(the
source
bundle
for
the
talk
called
Debugging
Designs)
24
Future
challenges
Renement
Model-checking
real-Vme
specicaVons.
Proofs
perhaps
one
day
AdopVon
by
other
divisions
25
TLC
performance
Disk-IO
bound
for
all
pracVcal
specs
weve
seen,
including
distributed
TLC.
Huge
improvement
using
new
EC2
hi1.4xlarge
16
cores
60.5
GB
RAM
120,000
x
4KB
random
read
IOPS
+
85,000
x
4KB
random
write
IOPS
across
2
*
1TB
SSD
volumes
10
Gigabit
Ethernet,
with
cluster
placement
groups
May
help
further
to
have
an
opVon
to
disable
checkpoints
26
27
Support
Material
28
Specs
for
Cahills
algorithm
Download
link:
hwp://hpts.ws/sessions/amazonbundle.tar.gz
TLA+
Alloy
Amazon
CondenVal
29
An
interesVng
execuVon
The
example
from
the
paper:
R2(X0,0)
R2(Y0,0)
R1(Y0,0)
W1(Y1,20)
C1
R3(X0,0)
R3(Y1,20)
C3
W2(X2,-11)
C2
The
simplest
example
found
by
TLC
(note:
begin
can
be
a
separate
operaVon)
R1(X0)
W1(Y1)
W2(X2)
C2
B3
C1
R3(X2)
R3(Y0)
C3
Amazon
CondenVal
30