Distributed k-ary System
Algorithms for Distributed Hash Tables
Ali Ghodsi
[Link]
PhD Defense, 7th December 2006, KTH/Royal Institute of Technology
Distributed k-ary System
Algorithms for Distributed Hash Tables
Ali Ghodsi
[Link]
PhD Defense, 7th December 2006, KTH/Royal Institute of Technology
Presentation Overview
Gentle introduction to DHTs Contributions The future
Whats a Distributed Hash Table (DHT)?
An ordinary hash table , which is distributed
Key Value Alexander Berlin Ali Marina Peter Seif Stefan Stockholm Gothenburg Louvain la neuve Stockholm Stockholm
Every node provides a lookup operation
Provide the value associated with a key
Nodes keep routing pointers
If item not found, route to another node
4
So what?
Characteristic properties
Scalability
Number of nodes can be huge Number of items can be huge
Self-manage in presence joins/leaves/failures
Routing information Data items
So what?
Characteristic properties
Time to find data is logarithmic Scalability Size of routing tables is Number of nodes can be huge logarithmic
Number of items can be huge Example:
log2(1000000)20 Self-manage in presence joins/leaves/failures
Routing information Data items
EFFICIENT!
So what?
Characteristic properties Store number of items
Scalability proportional to number Number of nodes can be huge of nodes
Number of items can be hugeTypically:
With D items and n nodes Self-manage in presence joins/leaves/failures Store D/n items per node Routing information
Data items
Move D/n items when nodes join/leave/fail
EFFICIENT!
So what?
Characteristic properties Self-management routing info:
Scalability Ensure routing information Number of nodes can be huge is up-to-date
Number of items can be huge
Self-management of items: Ensure that data is always Self-manage in presence joins/leaves/failures replicated and available Routing information
Data items
Presentation Overview
Whats been the general motivation for DHTs?
Traditional Motivation (1/2)
Peer-to-Peer filesharing very popular Napster
Completely centralized Central server knows who has what Judicial problems
central index
Gnutella
Completely decentralized Ask everyone you know to find data Very inefficient
decentralized index
10
Traditional Motivation (2/2)
Grand vision of DHTs
Provide efficient file sharing Quote from Chord: In particular, [Chord] can help avoid single points of failure or control that systems like Napster possess, and the lack of scalability that systems like Gnutella display because of their widespread use of broadcasts. [Stoica et al. 2001] Hidden assumptions
Millions of unreliable nodes User can switch off computer any time (leave=failure) Extreme dynamism (nodes joining/leaving/failing) Heterogeneity of computers and latencies Unstrusted nodes
11
Our philosophy
DHT is a useful data structure Assumptions might not be true
Moderate amount of dynamism Leave not same thing as failure
Dedicated servers
Nodes can be trusted Less heterogeneity
Our goal is to achieve more given stronger assumptions
12
Presentation Overview
How to construct a DHT?
13
How to construct a DHT (Chord)?
Use a logical name space, called the identifier space, consisting of identifiers {0,1,2,, N-1}
Identifier space is a logical ring modulo N Every node picks a random identifier Example:
Space N=16 {0,,15} Five nodes a, b, c, d
a picks 6 b picks 5 c picks 0 d picks 5 e picks 2
13 12 11 10 9 8 6 7
14
15 14
1 2 3 4 5
Definition of Successor
The successor of an identifier is the
first node met going in clockwise direction starting at the identifier
15
Example
succ(12)=14 succ(15)=2 succ(6)=6
13 12 11
1 2 3 4 5
14
10 9 8
6 7
15
Where to store data (Chord) ?
Use globally known hash function, H Each item <key,value> gets
identifier H(key)
Key Alexander Marina Peter Seif Stefan Value Berlin Gothenburg Louvain la neuve Stockholm Stockholm
Store each item at its successor
Node n is responsible for item k
15 14 13 12 11 10 9
1 2 3 4 5 6
Example
H(Marina)=12 H(Peter)=2 H(Seif)=9 H(Stefan)=14
7
16
Where to store data (Chord) ?
Use globally known hash function, H
Store number of items Each item <key,value> gets proportional to number of nodes identifier H(key)
Typically:
Key Alexander Marina Peter Seif Stefan Value Berlin Gothenburg Louvain la neuve Stockholm Stockholm
With D items and n nodes Node n is responsible for item k Store D/n items per node Move D/n items when Example nodes join/leave/fail H(Marina)=12
EFFICIENT! H(Peter)=2
13 12 11
Store each item at its successor
15 14
1 2 3 4 5
H(Seif)=9 H(Stefan)=14
10 9 8
6 7
17
Where to point (Chord) ?
Each node points to its successor
The successor of a node n is succ(n+1) Known as a nodes succ pointer
Each node points to its predecessor
First node met in anti-clockwise direction starting at n-1 Known as a nodes pred pointer
Example
0s successor is succ(1)=2 2s successor is succ(3)=5 5s successor is succ(6)=6 6s successor is succ(7)=11 11s successor is succ(12)=0
13 12 11
15 14
1 2 3 4 5
10 9 8
6 7
18
DHT Lookup
To lookup a key k
Calculate H(k) Follow succ pointers until item k is found
Key
Value
Alexander
Berlin
Marina
Gothenburg
Peter
Louvain la neuve
Seif
Stockholm
Stefan
Stockholm
Example
Lookup Seif at node 2 H(Seif)=9 Traverse nodes:
2, 5, 6, 11 (BINGO)
13 12 11 10 14
15
1 2 3 4 5 6
Return Stockholm to initiator
7
19
Speeding up lookups
If only pointer to succ(n+1) is used
Worst case lookup time is N, for N nodes
Improving lookup time
Point to Point to Point to Point to succ(n+1) succ(n+2) succ(n+4) succ(n+8)
15 14 13 12 5 10 9 8 6 7
20
1 2 3 4
Point to succ(n+2M)
Distance always halved to the destination
11
Speeding up lookups
If only pointer to succ(n+1) is used
Worst case lookup time is N, for N nodes
Time to find data is Improving lookup time logarithmic Point to Size of routing tables succ(n+1) is Point to succ(n+2) logarithmic
Example: Point to succ(n+4) Point to succ(n+8)
13 12
15 14
1 2 3 4 5
log2(1000000)20
Point to succ(n+2M) EFFICIENT!
Distance always halved to the destination
11 10 9 8 6 7
21
Dealing with failures
Each node keeps a successor-list
Pointer to f closest successors
succ(n+1) succ(succ(n+1)+1) succ(succ(succ(n+1)+1)+1) ...
14 13 12 11 10 9 8 6 7 5 15 0 1 2 3 4
If successor fails
Replace with closest alive successor
If predecessor fails
Set pred to nil
22
Handling Dynamism
Periodic stabilization used to make pointers eventually correct
Try pointing succ to closest alive successor Try pointing pred to closest alive predecessor
23
Handling joins
When n joins
Find ns successor with lookup(n) Set succ to ns successor Stabilization fixes the rest
13 15
11 Periodically at n: 1. 2. 3. 4. set v:=[Link] if vnil and v is in (n,succ] set succ:=v send a notify(n) to succ When receiving notify(p) at n: 1. 2. if pred=nil or p is in (pred,n] set pred:=p
24
Handling leaves
When n leaves
Just dissappear (like failure)
When pred detected failed
Set pred to nil 13
15
When succ detected failed
Set succ to closest alive in successor list 11 When receiving notify(p) at n: 1. 2. if pred=nil or p is in (pred,n] set pred:=p
Periodically at n: 1. 2. 3. 4. set v:=[Link] if vnil and v is in (n,succ] set succ:=v send a notify(n) to succ
25
Presentation Overview
Gentle introduction to DHTs Contributions The future
26
Outline
Lookup consistency
27
Problems with periodic stabilization
Joins and leaves can result in inconsistent lookup results
At node 12, lookup(14)=14 At node 10, lookup(14)=15
10
12
14
15
28
Problems with periodic stabilization
Leaves can result in routing failures
10
13
16
29
Problems with periodic stabilization
Too many leaves destroy the system
#leaves+#failures/round < |successor-list|
10
11
12
14
15
30
Outline
Atomic Ring Maintenance
31
Atomic Ring Maintenance
Differentiate leaves from failures
Leave is a synchronized departure Failure is a crash-stop
Initially assume no failures Build a ring initially
32
Atomic Ring Maintenance
Separate parts of the problem
Concurrency control
Serialize neighboring joins/leaves
Lookup consistency
33
Nave Approach
Each node i hosts a lock called Li
For p to join or leave:
First acquire [Link] Second acquire Lp Third acquire [Link] Thereafter update relevant pointers
Can lead to deadlocks
34
Our Approach to Concurrency Control
Each node i hosts a lock called Li
For p to join or leave:
First acquire Lp Thereafter acquire [Link] Thereafter update relevant pointers
Each lock has a lock queue
Nodes waiting to acquire the lock
35
Safety
Non-interference theorem:
When node p acquires both locks:
Node ps successor cannot leave Node ps predecessor cannot leave Other joins cannot affect relevant pointers
36
Dining Philosophers
Problem similar to the Dining philosophers problem Five philosophers around a table
One fork between each philosopher (5) Philosophers eat and think To eat:
grab left fork then grab right fork
37
Deadlocks
Can result in a deadlock
If all nodes acquire their first lock Every node waiting indefinitely for second lock
Solution from Dining philosophers
Introduce asymmetry One node acquires locks in reverse order
Node with highest identifier reverses
If n<[Link], then n has highest identity
38
Pitfalls
Join adds node/philosopher
Solution: some requests in the lock queue forwarded to new node
12
14, 12 14 12
10
12
14
15
39
Pitfalls
Leave removes a node/philosopher
Problem: if leaving node gives lock queue to its successor, nodes can get worse position in queue: starvation
Use forwarding to avoid starvation
Lock queue empty after local leave request
40
Correctness
Liveness Theorem:
Algorithm is starvation free
Also free from deadlocks and livelocks
Every joining/leaving node will eventually succeed getting both locks
41
Performance drawbacks
If many neighboring nodes leaving
All grab local lock Sequential progress
12 14
10
15
Solution
Randomized locking Release locks and retry Liveness with high probability
42
Lookup consistency: leaves
So far dealt with concurrent joins/leaves
Look at concurrent join/leaves/lookups
Lookup consistency (informally):
At any time, only one node responsible for any key Joins/leaves should not affect functionality of lookups
43
Lookup consistency
Goal is to make joins and leaves appear as if they happened instantaneously Every leave has a leave point
A point in global time, where the whole system behaves as if the node instantaneously left
Implemented with a LeaveForward flag
The leaving node forwards messages to successor if LeaveForward is true
44
Leave Algorithm
Node p Node q (leaving)
leave point LeaveForward=true
<Le a v e Poin t, pr e d= p>
Node r
pred:=p
c= , su c uc c d a t eS < Up r>
succ:=r
<St o p
F orwa
rd i ng
>
LeaveForward=false
45
Lookup consistency: joins
Every join has a join point
A point in global time, where the whole system behaves as if the node instantaneously joined
Implemented with a JoinForward flag
The successor of a joining node forwards messages to new node if JoinForward is true
46
Join Algorithm
Node p Node q (joining)
<Upda tePre d,
Node r
pred= q >
Join Point JoinForward=true oldpred=pred pred=q
c=q> , suc Succ pdate <U
d=p> , pre oint JoinP < pred:=p succ:=r
succ:=q
<Stop Forwa rding >
sh> <Fini
JoinForwarding=false
47
Outline
What about failures?
48
Dealing with Failures
We prove it is impossible to provide lookup consistency on the Internet Assumptions
Availability (always eventually answer) Lookup consistency Partition tolerance
Failure detectors can behave as if the networked partitioned
49
Dealing with Failures
We provide fault-tolerant atomic ring
Locks leased Guarantees locks are always released
Periodic stabilization ensures
Eventually correct ring Eventual lookup consistency
50
Contributions
Lookup consistency in presence of joins/leaves
System not affected by joins/leaves Inserts do not disappear
No routing failures when nodes leave Number of leaves not bounded
51
Related Work
Li, Misra, Plaxton (04, 06) have a similar solution Advantages
Assertional reasoning Almost machine verifiable proofs
Disadvantages
Starvation possible Not used for lookup consistency Failure-free environment assumed
52
Related Work
Lynch, Malkhi, Ratajczak (02), position paper with pseudo code in appendix Advantages
First to propose atomic lookup consistency
Disadvantages
No proofs Message might be sent to a node that left Does not work for both joins and leaves together Failures not dealt with
53
Outline
Additional Pointers on the Ring
54
Routing
Generalization of Chord to provide arbitrary arity Provide logk(n) hops per lookup k being a configurable parameter n being the number of nodes Instead of only log2(n)
55
Achieving logk(n) lookup
Each node logk(N) levels, N=kL Each level contains k intervals, Example, k=4, N=64 (43), node 0
0 4 8
Interval 3
Interval 0
Node 0
12
I0
I1
I2
I3
Level 1 015 1631 3247 4863
48 16
Interval 2
Interval 1
32
56
Achieving logk(n) lookup
Each node logk(N) levels, N=kL Each level contains k intervals, Example, k=4, N=64 (43), node 0
0 4 8
Interval 0 Interval 1
Node 0
12
I0
I1
I2
I3
Interval 2
48
Level 1 015 1631 3247 4863
16
Interval 3
Level 2 03
47
811 1215
32
57
Achieving logk(n) lookup
Each node logk(N) levels, N=kL Each level contains k intervals, Example, k=4, N=64 (43), node 0
0 4 8
Node 0
12
I0
I1
I2
I3
Level 1 015 1631 3247 4863
48 16
Level 2 03 Level 3 0
47 1
811 1215 2 3
32
58
Arity important
Maximum number of hops can be configured
1 r
k =N
1 r log k (N ) = log 1 (N ) = log 1 N r = r r r N N
Example, a 2-hop system
k = log
N
N (N ) = 2
59
Placing pointers
Each node has (k-1)logk(N) pointers
Node ps pointers point at
f (i ) = p (1 + ((i 1) mod (k 1)))k
i 1 k 1
Node 0s pointers f(1)=1 f(2)=2 f(3)=3 f(4)=4 f(5)=8 f(6)=12 f(7)=16 f(8)=32 f(9)=48
4 8 12
48
16
32
60
Greedy Routing
lookup(i) algorithm
Use pointer closest to i, without overshooting i If no such pointer exists, succ is responsible for i
i
61
Routing with Atomic Ring Maintenance
Invariant of lookup
Last hop is always predecessor of responsible node
Last step in lookup
If JoinForward is true, forward to pred If LeaveForward is true, forward to succ
62
Avoiding Routing Failures
If nodes leave, routing failures can occur Accounting algorithm
Simple Algorithm
No routing failures of ordinary messages
Fault-free Algorithm
No routing failures
Many cases and interleavings
Concurrent joins and leaves, pointers in both directions
63
General Routing
Three lookup styles
Recursive Iterative Transitive
64
Reliable Routing
Reliable lookup for each style
If initiator doesnt crash, responsible node reached No redundant delivery of messages
General strategy
Repeat operation until success Filter duplicates using unique identifiers
Iterative lookup
Reliability easy to achieve
Recursive lookup
Several algorithms possible
Transitive lookup
Efficient reliability hard to achieve
65
Outline
One-to-many Communication
66
Group Communication on an Overlay
Use existing routing pointers
Group communication
DHT only provides key lookup
Complex queries by searching the overlay Limited horizon broadcast Iterative deepening
More efficient than Gnutella-like systems
No unintended graph partitioning Cheaper topology maintenance [castro04]
67
Group Communication on an Overlay
Use existing routing pointers
Group communication
DHT only provides key lookup
Complex queries by searching the overlay Limited horizon broadcast Iterative deepening
More efficient than Gnutella-like systems
No unintended graph partitioning Cheaper topology maintenance [castro04]
68
Group Communication on an Overlay
DHT builds a graph
Why not use general graph algorithms?
Can use the specific structure of DHTs
More efficient Avoids redundant messages
69
Broadcast Algorithms
Correctness conditions:
Termination
Algorithm should eventually terminate
Coverage
All nodes should receive the broadcast message
Non-redundancy
Each node receives the message at most once
Initially assume no failures
70
Nave Broadcast
Naive Broadcast Algorithm
send message to succ until: initiator reached or overshooted
initiator
15 14 13 12 11 10 9 8 6 7
71
1 2 3 4 5
Nave Broadcast
Naive Broadcast Algorithm
send message to succ until: initiator reached or overshooted
Improvement
Initiator delegates half the space to neighbor
14 13 12
initiator
15 0 1 2 3 4 5 10 9 8 6 7
72
Idea applied recursively
log(n) time and n messages 11
Simple Broadcast in the Overlay
Dissertation assumes general DHT model
event [Link](m, limit) % initially limit = n for i:=M downto 1 do if u(i) (n,limit) then sendto u(i) : SimpleBcast(m, limit) limit := u(i)
73
Advanced Broadcast
Old algorithm on k-ary trees
74
Getting responses
Getting a reply
Nodes send directly back to initiator Not scalable
Simple Broadcast with Feedback
Collect responses back to initiator Broadcast induces a tree, feedback in reverse direction
Similar to simple broadcast algorithm
Keeps track of parent (par) Keeps track of children (Ack) Accumulate feedback from children, send to parent
Atomic ring maintenance
Acquire local lock to ensure nodes do not leave
75
Outline
Advanced One-to-many Communication
76
Motivation for Bulk Operation
Building MyriadStore in 2005
Distributed backup using the DKS DHT
Restoring a 4mb file
Each block (4kb) indexed in DHT Requires 1000 items in DHT
Expensive
One node making 1000 lookups Marshaling/unmarshaling 1000 requests
77
Bulk Operation
Define a bulk set: I
A set of identifiers
bulk_operation(m, I)
Send message m to every node i I
Similar correctness to broadcast
Coverage: all nodes with identifier in I Termination Non-redundancy
78
Bulk Owner Operation with Feedback
Define a bulk set: I
A set of identifiers
bulk_own(m, I)
Send m to every node responsible for an identifier i I
Example
Bulk set I={4} Node 4 might not exist Some node is responsible for identifier 4
79
Bulk Operation with Feedback
Define a bulk set: I
A set of identifiers
bulk_feed(m, I)
Send message m to every node i I Accumulate responses back to initiator
bulk_own_feed(m, I)
Send message m to every node responsible for i I Accumulate responses back to initiator
80
Bulk Properties (1/2)
No redundant messages Maximum log(n) messages per node
81
Bulk Properties (2/2)
Two extreme cases Case 1
Bulk set is all identifiers Identical to simple broadcast Message complexity is n Time complexity is log(n)
Case 2
Bulk set is a singleton with one identifier Identical to ordinary lookup Message complexity is log(n) Time complexity is in log(n)
82
Pseudo Reliable Broadcast
Pseudo-reliable broadcast to deal with crash failures Coverage property
If initiator is correct, every node gets the message
Similar to broadcast with feedback Use failure detectors on children
If child with responsibility to cover I fails Use bulk to retry covering interval I
Filter redundant messages using unique identifiers Eventually perfect failure detector for termination
Inaccuracy results in redundant messages
83
Applications of bulk operation
Bulk operation
Topology maintenance: update nodes in bulk set Pseudo-reliable broadcast: re-covering intervals
Bulk owner
Multiple inserts into a DHT
Bulk owner with feedback
Multiple lookups in a DHT Range queries
84
Outline
Replication
85
Successor-list replication
Successor-list replication
Replicate a nodes item on its f successors DKS, Chord, Pastry, Koorde etcetera.
Was abandoned in favor of symmetric replication because
86
Motivation: successor-lists
If a node joins or leaves
f replicas need to be updated
Color represents data item
Replication degree 3 Every color replicated three times
87
Motivation: successor-lists
If a node joins or leaves
f replicas need to be updated
Color represents data item
Node leaves Yellow, green, red, blue need to be re-distributed
88
Multiple hashing
Rehashing
Store each item <k,v> at
succ( H(k) ) succ( H(H(k)) ) succ( H(H(H(k))) )
Multiple hash functions
succ( H1(k) ) succ( H2(k) ) succ( H3(k) )
Store each item <k,v> at
Advocated by CAN and Tapestry
89
Motivation: multiple hashing
Example
Item <Seif, Stockholm>
H(Seif)=7 succ(7)=9
Node 9 crashes
Node 12 should get item from replica Need hash inverse H-1(7)=Seif (impossible) Items dispersed all over nodes (inefficient)
9 7 5
Seif, Stockholm
90
12
Symmetric Replication
Basic Idea
Replicate identifiers, not nodes
Associate each identifier i with f other identifiers:
N r (k ) = i + k , for 0 k < f f
Identifier space partitioned into m equivalence classes
Cardinality of each class is
f, m=N/f
Each node replicates the equivalence class of all identifiers it is responsible for
91
Symmetric replication
Replication degree f=4, Space={0,,15} Congruence classes modulo 4:
{0, {1, {2, {3, 4, 5, 6, 7, 8, 12} 9, 13} 10, 14} 11, 15}
Data: 14, 13, 12, 11
Data: 15, 0
15 14
1 2 3 4 5
Data: 4, 5 Data: 1, 2, 3
13 12 11 10
Data: 6, 7, 8, 9, 10
6 9 8 7
92
Ordinary Chord
Replication degree f=4, Space={0,,15} Congruence classes modulo 4
{0, {1, {2, {3, 4, 5, 6, 7, 8, 12} 9, 13} 10, 14} 11, 15}
Data: 2, 1, 0, 15 Data: 6, 5, 4, 3 Data: 10, 9, 8, 7 Data: 14, 13, 12, 11 Data: 3, 4 Data: 7, 8 Data: 11, 12 Data: 15, 0 Data: 5, 6, 7 Data: 9, 10, 11
15 14
1 2 3 4 5
Data: 13, 14, 15 Data: 1, 2, 3 Data: 8, 9 Data: 12, 13 Data: 0, 1
13 12
Data: 10, 11, 12, 13, 14 Data: 14, 15, 0, 1, 2 Data: 2, 3, 4, 5, 6 Data: 6, 7, 8, 9, 10
11 10 9 8 7 6
Data: 4, 5
93
Cheap join/leave
Replication degree f=4, Space={0,,15} Congruence classes modulo 4
{0, {1, {2, {3, 4, 5, 6, 7, 8, 12} 9, 13} 10, 14} 11, 15}
Data: 2, 1, 0, 15 Data: 6, 5, 4, 3 Data: 10, 9, 8, 7 Data: 14, 13, 12, 11 Data: 0, 15 Data: 3, 4 Data: 7, 8 Data: 11, 12 Data: 10, 11, 12, 13, 14 Data: 14, 15, 0, 1, 2 Data: 2, 3, 4, 5, 6 Data: 6, 7, 8, 9, 10 Data: 3, 4 Data: 7, 8 Data: 11, 12 Data: 15, 0 Data: 5, 6, 7 Data: 9, 10, 11
15 14
1 2 3 4 5
Data: 13, 14, 15 Data: 1, 2, 3 Data: 8, 9 Data: 12, 13 Data: 0, 1
13 Data: 11, 12, 7,
8, 3, 4, 0, 15
12 11 10 9 8 7 6
Data: 4, 5
94
Contributions
Message complexity for join/leave O(1)
Bit complexity remains unchanged
Handling failures more complex
Bulk operation to fetch data On average log(n) complexity
Can do parallel lookups
Decreasing latencies Increasing robustness Distributed voting Erasure codes
95
Presentation Overview
Summary
96
Summary (1/3)
Atomic ring maintenance
Lookup consistency for j/l No routing failures as nodes j/l No bound on number of leaves Eventual consistency with failures
Additional routing pointers
k-ary lookup Reliable lookup No routing failures with additional pointers
97
Summary (2/3)
Efficient Broadcast
log(n) time and n message complexity Used in overlay multicast
Bulk operations
Efficient parallel lookups Efficient range queries
98
Summary (3/3)
Symmetric Replication
Simple, O(1) message complexity for j/l
O(log f) for failures
Enables parallel lookups
Decreasing latencies Increasing robustness Distributed voting
99
Presentation Overview
Gentle introduction to DHTs Contributions The future
100
Future Work (1/2)
Periodic stabilization
Prove it is self-stabilizing
101
Future Work (2/2)
Replication Consistency
Atomic consistency impossible in asynchronous systems Assume partial synchrony Weaker consistency models? Using virtual synchrony
102
Speculative long-term agenda
Overlay today provides
Dynamic membership Identities (max/min avail) Only know subset of nodes Shared memory registers
Revisit distributed computing
Assuming an overlay as basic primitive Leader election Consensus Shared memory consistency (started) Transactions Wave algorithms (started)
Implement middleware providing these
103
Acknowledgments
Seif Haridi Luc Onana Alima Cosmin Arad Per Brand Sameh El-Ansary Roland Yap
104
THANK YOU
105