[go: up one dir, main page]

0% found this document useful (1 vote)
83 views87 pages

Zookeeper

This document provides an overview of a ZooKeeper tutorial presented at Eurosys 2011. The tutorial covers fundamentals of coordination services and discusses how ZooKeeper works. It introduces ZooKeeper as a coordination kernel that provides recipes to implement common coordination primitives. ZooKeeper uses a file system based API to manipulate small data nodes called znodes. The tutorial covers how ZooKeeper maintains consistency through an ensemble of servers that replicate updates with an elected leader. It also provides examples of how systems like Bigtable, web crawlers, and Google File System use ZooKeeper for tasks like master election, metadata management, and crash detection.

Uploaded by

Arnav Vats
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
83 views87 pages

Zookeeper

This document provides an overview of a ZooKeeper tutorial presented at Eurosys 2011. The tutorial covers fundamentals of coordination services and discusses how ZooKeeper works. It introduces ZooKeeper as a coordination kernel that provides recipes to implement common coordination primitives. ZooKeeper uses a file system based API to manipulate small data nodes called znodes. The tutorial covers how ZooKeeper maintains consistency through an ensemble of servers that replicate updates with an elected leader. It also provides examples of how systems like Bigtable, web crawlers, and Google File System use ZooKeeper for tasks like master election, metadata management, and crash detection.

Uploaded by

Arnav Vats
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 87

ZooKeeper Tutorial

Flavio Junqueira
Benjamin Reed

Yahoo! Research

hCps://cwiki.apache.org/confluence/display/ZOOKEEPER/EurosysTutorial
Eurosys 2011 ‐ Tutorial 1

Used for 14-848 Discussion, 11/6/2017 by Gregory Kesden


Plan for today
• First half
– Part 1
• Motivation and background
– Part 2
• How ZooKeeper works on paper
• Second half
– Part 3
• Share some practical experience
• Programming exercises
– Part 4
• Some caveats
• Wrap up

Eurosys 2011 ‐ Tutorial 2


ZooKeeper Tutorial

Part 1
Fundamentals
Yahoo! Portal
Search

E‐mail

Finance

Weather

News

Eurosys 2011 ‐ Tutorial 4


Yahoo!: Workload generated
• Home page
– 38 million users a day (USA)
– 2.5 billion users a month (USA)
• Web search
– 3 billion queries a month
• E‐mail
– 90 million actual users
– 10 min/visit

Eurosys 2011 ‐ Tutorial 5


Yahoo! Infrastructure
• Lots of servers
• Lots of processes
• High volumes of data
• Highly complex
soaware systems
• … and developers are
mere mortals
Yahoo! Lockport Data Center

Eurosys 2011 ‐ Tutorial 6


Coordination is important

Eurosys 2011 ‐ Tutorial 7


Coordination primitives
• Semaphores
• Queues
• Leader election
• Group membership
• Barriers
• Configuration

Eurosys 2011 ‐ Tutorial 8


Even small is hard…

Eurosys 2011 ‐ Tutorial 9


A simple model
• Work assignment
– Master assigns work
– Workers execute tasks assigned by master

Master

Worker Worker Worker Worker

Eurosys 2011 ‐ Tutorial 10


Master crashes
• Single point of failure
• No work is assigned
• Need to select a new master

Master

Worker Worker Worker Worker

Eurosys 2011 ‐ Tutorial 11


Worker crashes
• Not as bad… Overall system still works
– Does not work if there are dependencies
• Some tasks will never be executed
• Need to detect crashed workers
Master

Worker Worker Worker Worker

Eurosys 2011 ‐ Tutorial 12


Worker does not receive assignment
• Same problem as before
• Some tasks may not be executed
• Need to guarantee that worker receives
assignment
Master

Worker Worker Worker Worker

Eurosys 2011 ‐ Tutorial 13


Fault‐tolerant distributed system

Master

Coordination
Service
Master

Worker Worker Worker Worker Worker Worker

Eurosys 2011 ‐ Tutorial 14


Fault‐tolerant distributed system

Master
Coordination
Service

Master

Worker Worker Worker Worker Worker Worker

Eurosys 2011 ‐ Tutorial 15


Fully distributed

Coordination
Service

Worker Worker Worker Worker Worker Worker

Eurosys 2011 ‐ Tutorial 16


Fallacies of distributed computing
1. The network is reliable.
2. Latency is zero.
3. Bandwidth is infinite.
4. The network is secure.
5. Topology doesn't change.
6. There is one administrator.
7. Transport cost is zero.
8. The network is homogeneous.
Peter Deutsch, http://blogs.sun.com/jag/resource/Fallacies.html

Eurosys 2011 ‐ Tutorial 17


One more fallacy

• You know who is alive

Eurosys 2011 ‐ Tutorial 18


Why is it difficult?
• FLP impossibility result
– Asynchronous systems
– Consensus is impossible if a single process can crash
Fischer, Lynch, Paterson, ACM PODS, 1983

• According to Herlihy, we do need consensus


– Wait‐free synchronization
– Wait‐free: completion in a finite number of steps
– Universal object: equivalent to solving consensus for n
processes
Herlihy, ACM TOPLAS, 1991
Eurosys 2011 ‐ Tutorial 19
Why is it difficult?
• CAP principle
– Can’t obtain availability, consistency, and parOtion
tolerance simultaneously

Availability

ParOtion
Consistency
tolerance

Gilbert, Lynch, ACM SIGACT NEWS, 2002

Eurosys 2011 ‐ Tutorial 20


The case for a coordination service
• Many impossibility results
• Many fallacies to stumble upon
• Several common requirements across
applications
– Duplicating is bad
– Duplicating poorly is even worse
• Coordination service
– Implement it once and well
– Share by a number of applications
Eurosys 2011 ‐ Tutorial 21
Current systems
• Chubby, Google
– Lock service
Burrows, USENIX OSDI, 2006

• Centrifuge, Microsoft
– Lease service
Adya et al., USENIX NSDI, 2010

• ZooKeeper, Yahoo!
– Coordination kernel
– On Apache since 2008
Hunt et al., USENIX ATC, 2010
Eurosys 2011 ‐ Tutorial 22
Example – Bigtable, HBase
• Sparse column‐oriented data storage
– Tablet: range of rows
– Unit of distribution
• Architecture
– Master
– Tablet servers

Eurosys 2011 ‐ Tutorial 23


Example – Bigtable, HBase
• Master election
– Tolerate master crashes
• Metadata management
– ACLs, Tablet metadata
• Rendezvous
– Find tablet server
• Crash detection
– Live tablet servers

Eurosys 2011 ‐ Tutorial 24


Example – Web crawling
• Fetching service
Master
– Fetch Web pages for search
engine
• Master election
– Assign work Fetcher Fetcher Fetcher

• Metadata management
– Politeness constraints
– Shards
• Crash detection
– Live workers

Eurosys 2011 ‐ Tutorial 25


And more examples…
• GFS – Google File System
– Master election
– File system metadata
• KaCa ‐ Document indexing system
– Shard information
– Index version coordination
• Hedwig – Pub‐Sub system
– Topic metadata
– Topic assignment
Eurosys 2011 ‐ Tutorial 26
Summary of Part 1
• Large infrastructures require coordination
• Fallacies of distributed compuOng
• Theory results: FLP, CAP
• Coordination services
• Examples
– Web search
– Storage systems

Eurosys 2011 ‐ Tutorial 27


ZooKeeper
Tutorial

Part 2
The service
ZooKeeper
Introduction
• Coordination kernel
– Does not export concrete primitives
– Recipes to implement primitives
• File system based API
– Manipulate small data nodes:
znodes

Eurosys 2011 ‐ Tutorial 2


9
ZooKeeper: Overview
Ensemble

ZooKeeper
Client App Follower
Client Lib Session
Leader
atomically
Leader
broadcast
updates
ZooKeeper
Client App Follower
Client Lib Session

Follower
Replicated
ZooKeeper system
Client App Follower
Client Lib
Session

Eurosys 2011 ‐ Tutorial 3


0
ZooKeeper: Read operations
Ensemble
Read
ZooKeeper Read “x” X = 10 operations
Client App Follower
Client Lib processed
locally
Leader

ZooKeeper
Client App Follower
Client Lib

Follower

ZooKeeper
Client App Follower
Client Lib

Eurosys 2011 ‐ Tutorial 3


1
ZooKeeper: Write operations
Ensemble

ZooKeeper Write “x”,11 X = 11


Client App Follower
Client Lib
X = 11
Leader

X = 11
ZooKeeper
Client App Follower
Client Lib

Follower

ZooKeeper
Client App Follower
Client Lib

Replicates across a quorum


Eurosys 2011 ‐ Tutorial 3
2
ZooKeeper: Semantics of Sessions
• A prefix of operations submitted through a
session are executed
• Upon disconnection
– Client lib tries to contact another server
– Before session expires: connect to new server
– Server must have seen a transaction id at least as
large as the session

Eurosys 2011 ‐ Tutorial 3


3
ZooKeeper: API
• Create znodes: create
– Persistent, sequential, ephemeral
• Read and modify data: setData, getData
• Read the children of znode: getChildren
• Check if znode exists: exists
• Delete a znode: delete

Eurosys 2011 ‐ Tutorial 3


4
ZooKeeper: API
• Order
– Updates: Totally ordered, linearizable
– FIFO order for client operations
– Read: sequentially ordered
write(x, 10)
Client 1:
write(x, 11)
Client 2:

write(x, 10) write(x, 11)


Sequen7al:
Eurosys 2011 ‐ Tutorial 3
5
ZooKeeper: API
• Order
– Updates: Totally ordered, linearizable
– FIFO order for client operations
– Read: sequentially ordered
write(x, 10) read(x)
Client 1:
read(x)
Client 2:

read(x) write(x, 10) read(x)


Sequen7al:

Eurosys 2011 ‐ Tutorial 9


ZooKeeper: Example
1‐ create “/C‐”, “Ci”, sequential, ephemeral
2‐ getChildren “/”
3‐ If not leader, getData “first node”

Client 1
(C1) /

C1 /C‐1
Client 2
(C2)
C3 /C‐2

Client 3 C2 /C‐3
(C3)

Eurosys 2011 ‐ Tutorial 10


ZooKeeper: Znode changes
• Znode changes
– Data is set
– Node is created or deleted
– Etc…
• To learn of znode changes
– Set a watch
– Upon change, client receives a notification
– Notification ordered before new updates

Eurosys 2011 ‐ Tutorial 38


ZooKeeper: Watches

getData “/foo”, true


Client
/
return 10

10 /foo

Eurosys 2011 ‐ Tutorial 39


ZooKeeper: Watches

Client
/

11 /foo

setData “/foo”, 11
Client

return ok

Eurosys 2011 ‐ Tutorial 40


ZooKeeper: Watches

Client
/
notification

11 /foo

Client

Eurosys 2011 ‐ Tutorial 41


Watches, Locks, and the herd effect
• Herd effect
– Large number of clients wake up simultaneously

• Load spikes
– Undesirable

Eurosys 2011 ‐ Tutorial 42


Watches, Locks, and the herd effect

Client 1
(C1) /

C1 /C‐1
Client 2
(C2)
C2 /C‐2

Client 3
(Cn) Cn /C‐m

Eurosys 2011 ‐ Tutorial 43


Watches, Locks, and the herd effect

Client 1
(C1) /

Client 2
(C2)
C2 /C‐2

Client 3
(Cn) Cn /C‐m

Eurosys 2011 ‐ Tutorial 44


Watches, Locks, and the herd effect

Client 1
(C1) /

Client 2
(C2) notification C2 /C‐2

Client 3
(Cn) Cn /C‐m
notification

Eurosys 2011 ‐ Tutorial 45


Watches, Locks, and the herd effect
• A solution
– Use order of clients
– Each client
• Determines the znode z preceding its own znode in the
sequential order
• Watch z
– A single notification is generated upon a crash
• Disadvantage for leader election
– One client is notified of a leader change
Eurosys 2011 ‐ Tutorial 46
Linearizability
• Correctness condition
• Informal definition
– Order of operations is equivalent to a sequential
execution
– Equivalent order satisfies real time precedence order
write(x, 10) read(x)
Client 1:
read(x)
Client 2:

read(x) write(x, 10) read(x)


Sequential:
Eurosys 2011 ‐ Tutorial 47
Linearizability
• Correctness condition
• Informal definition
– Order of operations is equivalent to a sequential
execution
– Equivalent order satisfies real time precedence order
write(x, 10) read(x)
Client 1:
read(x)
Client 2:

read(x) write(x, 10) read(x)


Sequen7al:
Eurosys 2011 ‐ Tutorial 48
Linearizability
• Correctness condition
• Informal definition
– Order of operations is equivalent to a sequential
execution
– Equivalent order satisfies real time precedence order
write(x, 10) read(x)
Client 1:
read(x)
Client 2:

write(x, 10) read(x) read(x)


Sequen7al:
Eurosys 2011 ‐ Tutorial 49
Linearizability
• Is it important? It depends…
• Implements universal object
– Herlihy’s result
– Implement consensus for n processes

Eurosys 2011 ‐ Tutorial 50


Implementing consensus
• Each process p proposes then decides
• Propose(v)
– setData “/c/proposal-”, “v”, sequential
• Decide()
– getChildren “/c”
– Select znode z with smallest sequence number
– v’ = getData “/c/z”
– Decide upon v’

Eurosys 2011 ‐ Tutorial 51


Linearizability
• Is it important? It depends…
• Implements universal object
– Herlihy’s result
– Implement consensus for n processes
– … but it is affected by hidden channels

Eurosys 2011 ‐ Tutorial 52


Hidden channels

/
Client 1 ZK1
C1 /config

/
ZK2
C1 /config

/
Client 2
ZK3
C1 /config

Eurosys 2011 ‐ Tutorial 53


Hidden channels

/
Client 1 ZK1
C1 /config

/
ZK2
C2 /config

setData “/config”, C2
/
Client 2
ZK3
return OK
C2 /config

Eurosys 2011 ‐ Tutorial 54


Hidden channels

/
Client 1 ZK1
C1 /config

/
I have changed the config, ZK2
please read it!
C2 /config

/
Client 2
ZK3
C2 /config

Eurosys 2011 ‐ Tutorial 55


Hidden channels

getData “/config” /
Client 1 ZK1
return C1 C1 /config

/
ZK2
C2 /config

/
Client 2
ZK3
C2 /config

Eurosys 2011 ‐ Tutorial 56


A hat trick…
• sync Client 1
(C1)
– Asynchronous operation
getData “/foo”
– Before read operations
sync
– Flushes the channel
between follower and Follower

leader
/foo = C1
– Makes operations
linearizable
Leader
setData

Eurosys 2011 ‐ Tutorial 30


A hat trick…
• sync Client 1
(C1)
– Asynchronous operation
– Before read operations
– Flushes the channel
between follower and Follower
getData
leader sync
/foo = C1
– Makes operations
linearizable
Leader
setData

Eurosys 2011 ‐ Tutorial 58


A hat trick…
• sync Client 1
(C1)
– Asynchronous operation
– Before read operations
– Flushes the channel
between follower and Follower
getData
leader sync
/foo = C1
– Makes operations
linearizable sync

Leader
setData
sync

Eurosys 2011 ‐ Tutorial 59


A hat trick…
• sync Client 1
(C1)
– Asynchronous operation
– Before read operations
– Flushes the channel
between follower and Follower
getData
leader sync
/foo = C2
– Makes operations
linearizable setData “/foo”, C2

Leader
sync

Eurosys 2011 ‐ Tutorial 60


A hat trick…
• sync Client 1
(C1)
– Asynchronous operation
– Before read operations
– Flushes the channel
between follower and Follower
getData
leader sync
/foo = C2
– Makes operations
linearizable sync

Leader

Eurosys 2011 ‐ Tutorial 61


A hat trick…
• sync Client 1
(C1)
– Asynchronous operation
– Before read operations return sync
– Flushes the channel
between follower and Follower

leader getData
/foo = C2
– Makes operations
linearizable
Leader

Eurosys 2011 ‐ Tutorial 62


A hat trick…
• sync Client 1
(C1)
– Asynchronous operation
– Before read operations return “/foo”, C2
– Flushes the channel
between follower and Follower

leader
/foo = C2
– Makes operations
linearizable
Leader

Eurosys 2011 ‐ Tutorial 63


Summary of Part 2
• ZooKeeper
– Replicated service
– Propagate updates with a broadcast protocol
• Updates use consensus
• Reads served locally
• Workload not linearizable because of reads
• sync() makes it linearizable

Eurosys 2011 ‐ Tutorial 64


ZooKeeper Tutorial

Part 3
How it really works
Master/Worker System
 Clients Master Coordination
 Monitor the tasks Service
Master
 Queue tasks to be executed
 Masters Worker Worker Worker Worker

 Assign tasks to workers


 Workers
 Get tasks from the master

 Execute tasks

Eurosys 2011 ‐ Tutorial 66


Task Queue

client1

tasks
create(“/tasks/client1-”,
client1‐1 cmds,
SEQUENTIAL)
client3‐4
cmds is an array of String
client1‐6

Eurosys 2011 ‐ Tutorial 67


Group Membership
worker1

assign

worker1 create(“/assign/worker-”,
“”,
worker2 EPHEMERAL SEQUENTIAL)

worker3

listChildren(“/assign”,
true)

Master

Eurosys 2011 ‐ Tutorial 68


Leader ElecKon

master

create(“/master”,
getdata(“/master”,
hostinfo,
true);
EPHEMERAL)

Master Backup

Eurosys 2011 ‐ Tutorial 69


Configuration

worker2

assign

worker1 getdata(“/assign/worker2”, true)

worker2

setdata(“/assign/worker2”, znode_of_task)

Master

Eurosys 2011 ‐ Tutorial 70


Connecting to ZooKeeper
 Everyone has their own ZooKeeper address
and auth info.
Try connecting to ZooKeeper with the CLI.

java ‐jar zookeeper‐3.3.2‐fatjar.jar client zkaddr


 Use addAuth command to authenKcate

 Try out some commands

 Create znodes for /servers, /tasks, /assign

Eurosys 2011 ‐ Tutorial 71


Worker Processing
 Create a session
 Create the “worker” ephemeral znode
 Watch for the assign znode
 Deal with the watches
 Processing the assignment
- Update status in the task
- Delete assignment znode when finished
 What do to with SessionExpired

Eurosys 2011 ‐ Tutorial 72


Code on your own or
follow together

Eurosys 2011 ‐ Tutorial 9


Client Processing
 Create a session
 Create a task as a child of the /tasks znode
 Watch the status child of the /tasks znode

Eurosys 2011 ‐ Tutorial 10


Code on your own or
follow together

Eurosys 2011 ‐ Tutorial 11


Master Processing
 Create a session
 Do leader election using master znode
 Watch the worker list
 Watch the task queue
 Watch the assignment queue
 Deal with the watches
 Deal with workers coming and going
 Assign new tasks
 Watch for compleKons

Eurosys 2011 ‐ Tutorial 12


Code on your own or
follow together

Eurosys 2011 ‐ Tutorial 13


Give it a try…
 Start up the master
 Start up a worker
 Try submitting a command
 Queue up a bunch of sleep 100
 Add more workers
 Try killing a worker
 Try killing the master. Did take over work?

Eurosys 2011 ‐ Tutorial 14


ZooKeeper Tutorial

Part 4
Caveat Emptor
Revisit FLP and CAP
 What should a master do when
disconnected?
 What is the consequence of acAng as a master
while disconnected?

P1 reconnects
P1 elected P1 disconnected P1 expires
gets expired event
Ame

P2 elected

Eurosys 2011 ‐ Tutorial 80


Revisit FLP and CAP
 What happens if master elecAon gets a
“ConnectionLossException” aLer the
create?
 How do you fix it?
 How do you test it?

Eurosys 2011 ‐ Tutorial 81


Guidelines to ConnectionLoss
 A process will not see state changes while
disconnected
 Masters should act very conservaAvely, they
should not assume that they sAll have
mastership
 Don't treat as if it's the end of the world. The
client library will try to recover the session

Eurosys 2011 ‐ Tutorial 82


Other issues
 Watch out for SEQUENTIAL | EPHEMERAL!
 Problems resetting the ZooKeeper state
 What happens when you clear server state while
clients are running?
 What happens when you clear some servers but
not others?

Eurosys 2011 ‐ Tutorial 83


WriAng a test
 Use JUnit
 Use QuorumBase
 In setup call QuorumBase.setup()
 In tearDown call QuorumBase.tearDown()
 Write a simple test
 Use QuorumBase.hostPort to iniAalize the
ZooKeeper object in the tests
 Startup a master and a backup.
 Kill the master and make sure backup takes over

Eurosys 2011 ‐ Tutorial 84


Guidelines for SessionExpiration
 It is the end of the world!
 Should be rare.
 The session handle is dead, so you need a
new one.
 It is dangerous to try to transparently recover
by creating a new session. Usually there is
some cleanup and setup that needs to be
done

Eurosys 2011 ‐ Tutorial 85


Code on your own or follow
together

Eurosys 2011 ‐ Tutorial 8


Summary
 When used properly ZooKeeper can make it easy to build
distributed applicaAons.
 ZooKeeper is a tool to help you deal with the chaos of
distributed systems. It isn't magic.
 Don't try to shortcut the API

 Think about the consequences of ConnectionLoss

and SessionExpiration
 Make sure you test

 Checkout the developer resources


http://zookeeper.apache.org

Eurosys 2011 ‐ Tutorial 9

You might also like