[go: up one dir, main page]

0% found this document useful (0 votes)
18 views10 pages

Understanding Skip Graphs in P2P Networks

Uploaded by

zabith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views10 pages

Understanding Skip Graphs in P2P Networks

Uploaded by

zabith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

384

Skip Graphs
James Aspnes* Gauri Shah t

Abstract guarantee on the search latency and permits accessi-


Skip graphs are a novel distributed data structure, ble data to be missed.
based on skip lists, that provide the full functional- Recent systems like CAN [RFH+01],
ity of a balanced tree in a distributed system where Chord [SMK+01], Pastry [RD01], Tapestry [JKZ01]
elements are stored in separate nodes that may fall and Viceroy [MNR02] use a d i s t r i b u t e d h a s h
at any time. They are designed for use in searching t a b l e (DHT) approach to overcome scalability
peer-to-peer networks, and by providing the ability to problems. To ensure scalability, they hash the key
perform queries based on key ordering, they improve of a resource to determine which node it will be
on existing search tools that provide only hash ta- stored at and balance out the load on the nodes in
ble functionality. Unlike skip lists or other tree data the network. The main operation in these systems
structures, skip graphs are highly resilient, tolerating is to retrieve the identity of the node which stores
a large fraction of failed nodes without losing con- the resource, from any other node in the system.
nectivity. In addition, constructing, inserting new el- To this end, there is an overlay graph in which the
ements into, searching a skip graph and detecting and location of the nodes and resources is determined
repairing errors in the data structure introduced by by the hashed values of their identities and keys
node failures can be done using simple and straight- respectively. Resource location using the overlay
forward algorithms. graph is done in these various systems by using
different routing algorithms. Pastry and Tapestry
1 Introduction uses Plaxton's algorithm [PRR97], which is based
on hypercube routing: the message is forwarded
Peer-to-peer networks are distributed systems with-
deterministically to a neighbor whose identifier is one
out any central authority that are used for efficient
digit closer to the target identifier. CAN partitions
location of shared resources. Such systems have be-
a d-dimensional coordinate space into zones that
come very popular for Internet applications in a short
are owned by nodes which store keys mapped to
period of time. A survey of recent peer-to-peer re-
their zone. Routing is done by greedily forwarding
search yields a slew of desirable features for a peer-
messages to the neighbor closest to the target zone.
to-peer network such as decentralization, scalabil-
Chord maps nodes and resources to identities of
ity, fault-tolerance, self-stabilization, data availabil-
m bits placed around a modulo 2 TM identifier circle
ity, load balancing, dynamic addition and deletion of
and does greedy routing to the farthest possible
peer nodes, efficient and complex query searching, in-
node stored in the routing table. Most of these
corporating geography in searches and exploiting spa-
systems use O(log n) space and time for routing and
tial as well as temporal locality in searches. The ini-
O(log2 n) time for node insertion. Because hashing
tial systems, such as Napster [NAP], Gnutella [GNU]
destroys the ordering on keys, DHT systems do not
and Freenet [FRE], did not support most of these
support queries that seek near matches to a key or
features and were clearly unscalable either due to the
keys within a given range.
use of a central server (Napster) or due to high mes-
Some of these systems try to optimize perfor-
sage complexity from performing searches by flooding
mance by taking locality into account. Pastry [RD01,
the network (Gnutella). The performance of Freenet
CDHR02] and Tapestry [JKZ01, ZJK02] exploit geo-
is difficult to evaluate, but it provides no provable
graphical proximity by choosing the physically closest
node out of all the possible nodes with an appropriate
identifier prefix. In CAN [RFH+01], each node mea-
"IYepartment of Computer Science, Yale University, New sures its round-trip delay to a set of landmark nodes
Haven, CT 06520-8285, USA. Emaih aspnes@c~.[Link].
Supported by NSF grants CCR-9820888 and CCR-0098078. and accordingly places itself in the co-ordinate space
?Department of Computer Science, Yale University, New to facilitate routing with respect to network prox-
Haven, CT 06520-8285, USA. Email: shah¢[Link]. Sup- imity. This last method is not fully self-organizing
ported by NSF grants CCR-9820888 and CCR-0098078.
385

and may cause imbalance in the distribution of nodes of the nodes chosen by an adversary still leaves most
leading to hotspots. Some methods to solve the near- of the nodes in the largest surviving component. Skip
est neighbor problem for overlay networks can be seen graphs can also be constructed without knowledge of
in [HKRZ02] and [KR02]. the total number of nodes in advance. In contrast,
Some of these systems are partly resilient to DHT systems such as Pastry and Chord require a
random node failures, but their performance may be priori knowledge about the size of the system or its
badly impaired by adversarial deletion of nodes. Fiat keyspace.
and Saia [FS02] present a system which is resilient The rest of the paper is organized as follows:
to adversarial deletion of a constant fraction of the we describe skip graphs and algorithms for them
nodes; some extensions of this result can be seen in detail in Section 2. Sections 3 and 4 describe
in [Dat02]. However, they do not give efficient the repair mechanism and fault-tolerance properties
methods to dynamically maintain such a system. for a skip graph. Contention analysis and load
TerraDir [SBK02] is a recent system that pro- balancing results are described in Section 5. Finally,
vides locality and maintains a hierarchical data struc- we conclude in Section 6.
ture using caching and replication. There are as yet
no provable guarantees on load balancing and fault 1.2 M o d e l We briefly describe the model for our
tolerance for this system. algorithms. We assume a m e s s a g e p a s s i n g envi-
ronment in which all processes communicate with
1.1 O u r a p p r o a c h The underlying structure of each other by sending messages over a communication
Chord, CAN, and similar DHTs resembles a balanced channel. The system is p a r t i a l l y s y n c h r o n o u s , i.e.,
tree in which balancing depends on the near-uniform there is a fixed upper bound (time-out) on the trans-
distribution of the output of the hash function. So mission delay of a message. Processes can c r a s h , i.e.,
the costs of constructing, maintaining, and searching halt prematurely, and crashes are permanent. We use
these data structures is closer to the O(logn) costs the term node to represent a process that is running
of tree operations than the O(1) costs of traditional on a particular machine. We assume that each mes-
hash tables. But because keys are hashed, DHTs can sage takes at most unit time to be delivered and any
provide only hash table functionality. Our approach internal processing at a machine takes no time.
is to exploit the underlying tree structure to give
tree functionality, while applying a simple distributed 2 Skip graphs
balancing scheme to preserve balance and distribute A skip list [Pug90] is a randomized balanced tree
load. data structure organized as a tower of increasingly
We describe a new model for a peer-to-peer net- sparse linked lists. Level 0 of a skip list is a linked
work based on a distributed data structure that we list of all nodes in increasing order by key. For each
call a skip g r a p h . This distributed data structure i greater than 0, each node in level i - 1 appears
has several benefits. Resource location and dynamic in level i independently with some fixed probability
node addition and deletion can be done in logarith- p. In a doubly-linked skip list, each node stores
mic time, and each node in a skip graph requires a predecessor pointer and a successor pointer for
only logarithmic space to store information about each list in which it appears, for an average of l i p
its neighbors. More importantly, there is no hashing pointers per node. The lists at the higher level
of the resource keys so related resources are present act as "express lanes" that allow the sequence of
near each other in a skip graph. This may be use- nodes to be traversed quickly. Searching for a node
ful for certain applications such as prefetching of with a particular key involves searching first in the
web pages, enhanced browsing and efficient search- highest level, and repeatedly dropping down a level
ing. Skip graphs also support c o m p l e x q u e r i e s such whenever it becomes clear that the node is not in the
as range queries, i.e. locating resources whose keys current level. Considering the search path in reverse
lie within a certain specified range. There has been shows that no more than 1-~ nodes are searched on
some interest in supporting complex queries in peer- average per level, giving an average search time of
to-peer-systems [HHH+02], and designing a system
that supports range queries has been posed as an O ( logn(l_p~log ~ ) . Skip lists have been extensively
open question. Skip graphs are resilient to node fail- studied [Pug90, PMP90, Dev92, KP94, KMP95] and
ures: a skip graph tolerates removal of a large fraction because they require no global balancing operations
of its nodes chosen at random without becoming dis- are particularly useful in parallel systems [GMM93,
connected, and even the loss of an O ( 1 / l o g n) fraction GMM96, GM97].
386

HEAD TAIL
appears to be necessary for distributed data struc-
tures based on nodes in a one-dimensional space
i "( ";-~)" i l LEVEL 2
linked by random connections whose distribution sat-
isfies certain symmetry properties lADS02]. While
this lower bound requires some independence as-
sumptions that are not satisfied by skip graphs, there
is enough similarity between skip graphs and the class
Figure 1: A skip list. of models considered in lADS02] that an f~(log n) av-
erage degree is not surprising.
We now give a formal definition of a skip graph.
Precisely which lists an element x belongs to is
We would like to use a data structure similar to controlled by a m e m b e r s h i p v e c t o r re(x). We
a skip list to support typical binary tree operations think of rn(x) as an infinite random word over some
on a sequence whose elements are stored at separate fixed alphabet, although in practice, only an O(log n)
locations in a highly distributed system subject to length prefix of m ( x ) needs to be generated on
unpredictable failures. A skip list alone is not enough average. The idea of the membership vector is that
for our purposes, because it lacks redundancy and every doubly-linked list in the skip graph is labeled
is thus vulnerable to both failures and contention. by some finite word w, and an element x is in the list
Since only a few nodes appear in the highest-level labeled by w if and only if w is a prefix of re(x).
list, each such node acts as a single point of failure
whose removal partitions the list, and forms a hot
d('2i)~ i~ 33~!
spot that must process a constant fraction of all
search operations. Skip lists also offer few guarantees 13)..... i
that individual nodes are not separated from their -~~o~ ', ! 65 -il "il
i •
fellows even with occasional random failures. Since MEMBERSHIP .... . )----~ i -~ LEVEL 1
each node is connected on average to only O(1) other VECTOR ! ~-,~'<~ T6 " -" :--~, ' ii "il
". !H 13)----- "~33)-- ~48 2
nodes, even a constant probability of node failures SKIP LIST --!i " .06. . . . . . . . ol
. . - " - :~
will isolate a large fraction of the surviving nodes. \" .... i ~ 13 ~ ( 21 )----( 33 '- -'~ 48 )'---( 75 ?--'i 99 p i I LEVEL 0
Our solution is to define a generalization of a skip ; "06 ]b ~L ~0 Ti qi ij
list that we call a skip g r a p h . As in a skip list,
each node in a skip graph is a member of multiple
linked lists. The level 0 list consists of all nodes in Figure 2: A skip graph with [log N] = 3 levels.
sequence. Where a skip graph is distinguished from
a skip list is that there may be many lists at level To reason about this structure formally, we will
i, and every node participates in one of these lists, need some notation. Let Z be a finite alphabet, let Z*
until the nodes are splintered into singletons after be the set of all finite words consisting of characters
O(logn) levels on average. A skip graph supports in E, and let Z oo consist of all infinite words. We
s e a r c h , i n s e r t , and d e l e t e operations analogous use subscripts to refer to individual characters of a
to the corresponding operations for skip lists; indeed, word, starting with subscript 0; a word w is equal to
we show in Lemma 2.1 that algorithms for skip lists wowlw~ .... Let Iwl be the length of w, with Iwl = oo
can be applied directly to skip graphs, as a skip graph if w E E ~. If Iw I _> i, write w [ i for the prefix of w
is equivalent to a collection of up to n skip lists that of length i. Write e for the empty word.
happen to share some of their lower levels. Returning to skip graphs, the b o t t o m level is
Because there are many lists at each level, the always a doubly-linked list S~ consisting of all the
chances that any individual node participates in elements in order. In general, for each w in ~*,
some search is small, eliminating both single points the doublyqinked list Sw contains all x for which
of failure and hot spots. Furthermore, each node w is a prefix of re(x), in increasing order. We
has O(logn) neighbors on average, and with high say that a particular list Svo is part of level i if
probability no node is isolated. In Section 4 we Iwl = i. This gives an infinite family of doubly-
observe that skip graphs are resilient to node failures linked lists; in an actual implementation, only those
and have an expansion ratio of ~ ( 1 / l o g n ) with n Sw with at least two elements are represented. A
nodes in the graph. skip graph is precisely a family {Sw} of doubly-linked
In addition to providing fanlt-tolerance, having lists generated in this fashion. Note that because the
an f~(logn) degree to support O(logn) search time membership vectors are random variables, each Sw is
387

also a random variable. A l g o r i t h m 1: s e a r c h for node n


We can also think of a skip graph as a random upon receiving (searchOp, startNode, searchKey,
graph, where there is an edge between x and y level):
whenever x and y are adjacent in some Sw. Define if [Link] = searchKey then
x's left and right neighbors at level i as its immediate send (search0p, n) to startNode
predecessor and successor, respectively, in Sin(t)ri, or if [Link] < searchKey then
_l_ if no such nodes exist. We will write xLi for x's w h i l e level > 0 d o
left neighbor at level i and xRi for its right neighbor, if (nRt~vel ).key < searehKey then
and in general will think of Li and Ri as composable s e n d (search0p, startNode, searchKey,
operators, to allow writing expressions like xRiR~_ 1 level) to nRtevet
etc. break
An alternative view of a skip graph is a else leveN--level-1
trie [dlB59, Fre60, Knu73] of skip lists that share else
their lower levels. If we think of a skip list formally as w h i l e level > 0 d o
a sequence of random variables So, $ 1 , 8 2 , . . . , where i f (nLtewt).key > searchKey then
the value of Si is the level i list, then we have: s e n d (search0p, startNode, searchKey,
level) to nLlevei
LEMMA 2.1. Let {S~} be a skip graph with alphabet break
~. For any z ~ ~ , the sequence So, $1, $ 2 , . . . , where else level+-level- 1
each Si = Szri, is a skip list with p = IZ1-1 . if level < 0 then
Proof: By induction on i. The list So equals S~, s e n d (search0p, n) to startNode
which is just the base list of all elements. An element
x appears in Si if re(x) ~ i = z I i; conditioned on this
event occurring, the probability that x also appears
LEMMA 2.2. The search operation in a skip graph
in Si+l is just the probability that m(x)i+i = zi+l.
S with n nodes takes expected O(logn) time and
This event occurs with probability p = [E[ -1, and it is
O(log n) messages.
easy to see that it is independent of the corresponding
event for any other x t in Si. Thus each element in Si
Skip graphs can support range queries in which
appears in Si+l with independent probability p, and
one is asked to find a key _> x, a key < x, the
So, S 1 , . . . form a skip list. I
largest key < x, the least key > x, some key in the
For a peer-to-peer system, each resource will be interval Ix, y], all keys in I x . . . , y], and so forth. For
a node in a skip graph and the nodes are sorted most of these queries, the procedure is an obvious
according to the resource key. Each node stores the modification of Algorithm I and runs in O(log n) time
addresses and the keys of two neighbors at each of with O(log n) messages. For finding all nodes in an
the O(log n) levels. In addition, each node also needs interval, we can use a modified Algorithm 1 to find a
O(log n) bits of space for its membership vector. single element of the interval (which takes O(log n)
time and O(logn) messages), and then broadcast
2.1 A l g o r i t h m s f o r a skip graph We describe the query through the m nodes in the interval by
the s e a r c h and i n s e r t operations for a skip graph flooding (which takes O(log m) time and O(m log n)
but omit the description of d e l e t e , which is fairly messages). If the originator of the query is capable
straightforward, to save space. of processing m simultaneous responses, the entire
operation still takes O(log n) time.
2.1.1 T h e s e a r c h o p e r a t i o n The s e a r c h opera-
tion (Algorithm 1) is exactly the same as in the case 2.1.2 T h e i n s e r t o p e r a t i o n A new node n' in-
of a skip list with only minor adaptations to run in a serts itself in some list at each level till it finds itself
distributed system. The search is started at the top- alone in a list at any level (Algorithms 2 and 3). At
most level of the node seeking a key and it proceeds level 0, n' will link to a node with a key closest to its
along the same level without overshooting the key, own key. At each level i, i > 1, n' will try to find the
continuing at a lower level if required, until it reaches closest node x in level i - 1 with re(x) r i = m(n') r i
level 0. Either the address of the node storing the and link to x at level i. Each existing node can delay
search key, if it exists, or the address of the node determining m(x)i until a new node shows up ask-
storing the key closest to the search key is returned. ing for its value; thus at any given time only a finite
prefix of any membership vector has to be generated.
388

Inserts can be trickier when we have to deal with A l g o r i t h m 3: I n s e r t for existing node n
concurrent node joins. Before n' links to any neigh- upon receiving (link0p, n', side, level):
bors, it verifies that its join will not violate the skip if side = R t h e n crop = <
graph properties. So if any new nodes have joined the else cmp = >
skip graph between n' and its predetermined neigh- if (n sidelevel).key crop n'.key t h e n
bor, n' will advance over the new nodes if required s e n d (link0p, n', side, level) to n sideleve 1
before linking in the correct place. else
adjust links to add n' as side neighbor
A l g o r i t h m 2: i n s e r t for new node n' s e n d (link0p, n', otherS±de, level) to n sidetevel
if introducer = n' t h e n upon receiving (buddy0p, n', level, val) from side L(R):
nlLo = ±
if m(n)level = ± t h e n
n'Ro = ±
else m(n)level = getCoin 0
nnleve t = ±
if [Link] < n'.key t h e n side = R
else side = L n R l e v e l ---- ±
s e n d (searchOp, n', n'.key, 0 7 to introducer if m(n)level = val t h e n
upon receiving (searchOp, neighbor): s e n d (buddy0p, n, level) to n'
s e n d (linkOp, n', side, O) to neighbor else
level+- 1 if nRlevel(Llevel) # ± t h e n
w h i l e true d o s e n d (buddy0p, n', val, level) to nRleve 1
if n'Llevel_ 1 # ± t h e n (nLlevel)
s e n d (buddy0p, n', level, m(n')level ) to else
s e n d (buddyflp, l , level) to n'
n'Llevel_l
upon receiving (buddy0p, newBuddy, level):
if newBuddy ¢ ± t h e n
send (link0p, n', R, level) to newBuddy 3. If xLi # ±, xLiR~ = x.
else if (n'Rlevel_ 1 # ± ) A (newBuddy = L )
then 4. If xRi # L, x R i L i = x.
s e n d (buddy0p, n', level, m(n')level ) to
n'Rlevel_l 5. If i > O, re(x) [ i = m(xR~_l) [ i and ~tk, k <
upon receiving (buddy0p, newBuddy, level): i , m ( x ) [ i = m ( x R i k l ) I i, then xRi = xR~_ 1.
if newBuddy # I t h e n
6. If i > O, re(x) [ i = m(xL~_l) r i and ~ k , k <
send (linkOp, n', L, level) to newBuddy
1,re(x) [ i = m(xLi~_l) [ i, then x i i = x i ~ _ 1.
else b r e a k
else b r e a k
THEOREM 3.1. Every connected component of the
level6-1evel+ 1
data structure is a skip graph if and only if conditions
nJLlevet = ±
1 - 6 are satisfied.
nIRlevel = ±
3.1 M a i n t a i n i n g t h e i n v a r i a n t Define _l_Li =
± R i = ±. We define conditions 1 - 4 as an i n v a r i a n t
LEMMA 2.3. The i n s e r t operation in a skip graph
for a skip graph as they hold in all states with no
S with n nodes takes expected O(logn) time and
undelivered messages, even in the presence of failures.
O(log n) messages.
Conditions 5 - 6 may fail to hold with failures, but
they can be restored by the repair mechanism. We
3 Repair Mechanism
shall call conditions 5 and 6 as the L and R successor
In this section, we describe a self-stabilization mecha- conditions respectively.
nism that repairs the skip graph in the event of node
and link failures. We first characterize the constraints THEOREM 3.2. With no undelivered messages, the
for an ideal skip graph. Let x be any node in the skip invariant is maintained for a skip graph with node
graph; then for any level i: insertions, deletions and node failures.
1. If x R i # ±, x R i > x.
3.2 R e s t o r i n g skip g r a p h c o n s t r a i n t s The
2. If xLi # ±, xLi < x. successor conditions get violated during insert and
389

delete operations as well as when a node or a link fails. • Send ( z i p p e r 0 p F , x R i , i) to a.

• Send (z±pper0pB, x, i) to a.
A l t h o u g h the skip g r a p h constraints m a y get
violated during an insert or a delete operation,
once no messages are pending and provided no C a s e 2: x R i ~ xRik_l, for any k. There are
additional inserts, deletes or failures occur, the three ways to repair this violation depending on w h a t
successor conditions are satisfied. Thus we see t h a t other nodes are present at level i - 1.
the repair mechanism is required to restore the
Case 2a: 3a = xR~_ 1 > xP~ and ~b = xRi+_x < a
successor conditions only in case of node or link such t h a t re(b) r i = re(x) r i.
failures. We consider the possible cases in which the
successor conditions can be violated and provide a x xRi LEVEL i
repair mechanism for the each of those cases. We
7
will concentrate on the repair mechanism for the R
" } zipperOpF
links arid fixing the L links is symmetric. It m a y be : aLi- I _
possible to combine the two mechanisms to improve
z~.pperOpB / ._~__~#___(~__ LEVEL i 1
the performance but we will treat t h e m separately -

for simplicity. L R

There are two cases when the R successor condi- T h e nodes connected to a and x R i at level i - 1
tion is violated: have to be merged together into one ring by sending
the following messages:
1. x R i = xR~_x b u t Sa = x R ik'_ l , k' < k, re(x) I
i = re(a) I i. This case occurs when two nodes s P r o b e level i - 1 to find largest x R i R i k'_ 1 = R < a.
are connected to each other at levels i - 1 and
i, and a new node in inserted between t h e m at • Send ( z i p p e r 0 p F , a, i - 1) to R.
level i - 1 but is pending to be inserted between ktl
• P r o b e level i - 1 to find smallest x R i L i _ 1 = L >
t h e m at level i. If the left neighbor of the new
aLi-1.
node checks its R successor condition at level i
before the insert of the new node at level i is • Send ( z i p p e r 0 p B , aLi-1, i - 1) to L.
completed, it will detect a discrepancy.
Case 2b: ~a = xR~_ 1 < x P ~ , m ( a ) ~ i = m(x) I i
2. xP~ ~ xR~_l, for any k. This case occurs with and xRi+_l ~ xP~.
the failure of any node or link in an ideal skip
graph.
-"" ' ""- LEVEL i
We consider each case in detail and propose a •~ .-4"" ! ")..... xR~
repair mechanism for each violation. ---CZ::- / ! L "'C ~--
: /) :
zippe~OpB d aR/-1 i zipperOpF
C a s e 1: x R i = xR~_l, b u t 3a = x ,~~ri _, kl ,' , t., <
k, m ( x ) r i = re(a) r i. ~ ,,,.fT---- LEVELi-1
M

........ I ':=;" .... LEVEL i


T h e nodes connected to a and xP~ have to be
merged at levels i and i - 1 respectively by sending
zipperOpB - -
_A_) : the following messages:
i !- zipperOpF

• P r o b e level i - 1 to find smallest x R i L i k'


_ 1= M >
---/~-- ( ) kt L_~-- LEVEL i - 1 a.
X a = XRi_ 1 xR i

• Send (z±pper0pB, a, i - 1) to M.
Node a should be inserted into level i and this is
done by sending the following messages1: • Send ( z i p p e r 0 p F , aRi-1, i - 1) to M.

• Send ( z i p p e r 0 p B , x, i) to a.
TD-etails of the zipperOp a l g o r i t h m are given in Algo-
rithms 4 a n d 5. • Send ( z i p p e r 0 p F , xRi, i) to a.
390

Case 2c: 3a < xRi, aRi = 3_. THEOREM 3.3. In the absence of new failures, the
repair mechanism described in Section 3.2 will even-
---~ "~ ~)- - LEVEL i tually restore the violated constraints of a skip graph,
without losing existing connectivity.
6
4 Fault Tolerance
-/'~ -----~---q, L E V E L i -- 1
z:[Link] C~--- (~_')--~2~-----~ ;----~.~--)-- In this section, we describe some of the fault tolerance
R properties of a skip graph. Fault tolerance of related
data structures, such as augmented versions of linked
lists and binary trees, has been well-studied and some
The nodes connected to a and xRi at level i - 1 results can be seen in [MP84, AB96]. The main
have to be merged by sending the following messages: question is how many nodes can be separated from
the primary component by the failure of other nodes,
• Probe level i - 1 to find smallest xRiLik_l = R >
as this determines the size of the surviving skip graph
a.
after the repair mechanism finishes.
• Send (zipper0pB, a, i - 1) to R. We show first that even a worst-case choice
of failures by an adversary that can observe the
A l g o r i t h m 4: z i p p e r 0 p B for node n structure of the skip graph can do only limited
damage. With high probability, a skip graph with
upon receiving <zipper0pB, x, g):
n nodes has an tQ(1/logn) expansion ratio, implying
if nLt > [Link] t h e n
that at most O ( f log n) nodes can be separated from
s e n d (zipper0pB, x, ~) to nL~
else the primary component by ff failures. These results
are described in Section 4.1
trap = nLt
For random failures, the situation appears even
nL~ = x
more promising; our experimental results, presented
XRl ~ n
in Section 4.2, show that for a reasonably large skip
if trap ~ 3_ t h e n
graph nearly all nodes remain in the primary com-
s e n d (zipper0pB, tmp, ~) to x
ponent until about two-thirds of the nodes fail, and
that it is possible to make searches highly resilient to
failure even without using the repair mechanism by
A l g o r i t h m 5: z i p p e r 0 p F for node n use of redundant links.
upon receiving (zipper0pF, x, t):
4.1 A d v e r s a r i a l f a i l u r e s Given a subset A of the
i f nRt <[Link] t h e n
nodes of a skip graph, define 5A as the set of all nodes
s e n d <zipper0pF, x, g) to nRe
that are not in A but that are adjacent to A. Further
else
define ~hA as the set of all nodes that are not in A
tmp = nRt
but are joined to a node in A by an edge at level h.
xL~ = n
Clearly 6A = Uh 5hA and 16AI > maxh 15hAl.
n R g --~ x
The expansion ratio of a set A is ISAI/[AI. The
if trap ~ 3- t h e n
expansion ratio of a graph is the minimum expansion
s e n d (zipper0pF, tmp, g/ to x
ratio of any set A for which 1 < IA[ .< n/2. The
expansion ratio determines the resilience of a skip
graph in the presence of adversarial failures, because
Original Links Unchanged Links separating a set A from the primary component
requires all nodes in 5A to fail. We will show that
skip graphs have f~(1/logn) expansion ratios with
high probability, implying that only O ( f log n) nodes
can be separated by f failures, even if the failures are
New Links carefully targeted.
( z i p p e r f l p messages)
Our strategy for showing a lower bound on the
expansion ratio of a skip graph will be to show that
Figure 3: z i p p e r 0 p operation to merge nodes on the
with high probability, all sets A either have large ~0A
same level.
(i.e., many neighbors at the b o t t o m level of the skip
391

graph) or have large 5hA for some particular h chosen i , , , , ! , , ,

based on the size of A. We begin by counting the I1)


number of sets A of a given size that have small g0A. e-t 0.8 . . . . . . . . . . . . . . . . . .
\

LEMMA 4.1. In an n-node skip graph, the number of 0.6


sets A, where IAi = m < n and I~0AI < s, is less 0

than
s--1 [m+l~ [n--m--l~
Z r ~ I ~ r #~ r--I 1" ¢1
I.-I
q-I
0.4!
............ii.i I
Sketch of proof: Represent each A as a bit-vector m 0.2
¢1
where 1 indicates a member of the set and 0 a N 0
non-member. Then I~0AI is at least the number of 0.1 0,2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.1
Probability of node failure
intervals of zeroes in this bit-vector. The bound in
the lemma is then obtained by bounding the number Figure 4: Size of the largest connected component as
of length n bit-vectors with m ones and at most s a fraction of the surviving nodes with 131072 nodes.
intervals of zeroes. |

LEMMA 4.2. Let A be a subset of m < nl 2 nodes l 0.25 r , i ~ i ,


..el
of an n-node skip graph S. Then for any h, U
Iq

Pr [l<f,,AI _< 2 h] <


0.2

~ 0.15
Sketch of proof: The key observation is that for each
b in {0, 1} h, each skip list Sb that contains a member ~ 0.1
of both A and its complement contributes at least one
distinct element to 5hA. We then show that at least .0 0,05

a third of the Sb are likely to do so by bounding the


probability that either A or S - A are represented in ~ 0
0.1 0.2 0.3 0.4 0.5 0.5 0.7
less than two-thirds of the Sb. | Probability of node failure

THEOREM 4.1. Let c >_ 6. Then a skip graph with n Figure 5: Failed Searches with 131072 nodes and
nodes has an expansion ratio of at least _c l_o gLa /_2 _n with 10000 messages.
probability 1-0(n5-C), where the constant factor does
not depend on c.
most searches succeed as long as the proportion of
Sketch of proof: The probability bound is obtained failed nodes is substantially less that O(logn). By
by summing the probability of having 5hA too small detecting failures locally and using additional redun-
over all A for which 5oA is too small. For each dant edges, we can make searches highly tolerant to
set A of size m, h is chosen so that the 1 . 2 h small numbers of random faults; some experimental
bound of Lemma 4.2 exceeds m times the expansion results are shown in Figure 5. In these experiments,
ratio. The probabilities derived from Lemma 4.2 are each node x has extra links to its five nearest neigh-
then summed over all sets A of a fixed size m using bors on each side, at every level that it is a member
Lemma 4.1, and the result of this process is summed of. In general, we cannot make as strong guaran-
over all m > clog3/2 n to obtain the final bound, i tees as those provided by data structures based on
explicit use of expanders [FS02, Dat02], but we be-
4.2 R a n d o m f a i l u r e s In our experiments, skip lieve that this is compensated for by the simplicity
graphs appear to be highly resilient against random of skip graphs and the existence of good distributed
failures. As shown in Figure 4, nearly all nodes re- mechanisms for constructing and repairing them.
main in the primary component even as the probabil-
ity of individual node failure exceeds 0.6, and we sus- 5 Load balancing
pect that most of the lost nodes at this stage become In addition to fault-tolerance, a skip graph provides
isolated only because all of their immediate neighbors a limited form of load balancing, by smoothing out
die. hot spots caused by popular search targets. The
For searches, the fact that the average search in- guarantees that a skip graph makes in this case are
volves only O(logn) nodes establishes trivially that similar to the guarantees made for survivability. Just
392

as an element stored at a particular node will not I.1i~ T

survive the loss of that node or its neighbors in the


graph, many searches directed at a particular element 0.9
will lead to high load on the node that stores it and on 0.8
nodes likely to be on a search path. However, we can
0.7
show that this effect drops off rapidly with distance;
0.6
0.5
elements that are far away from a popular target in 0.4
the bottom-level list produce little additional load on 0.3
average.
We give two characterizations of this result. The 0.1

first shows that the probability that a particular 0


76400 76450 7~soo 7Gsso 76600 ~s5o
search uses a node between the source and target Nodes
drops off inversely with the distance from the node to
the target. This fact is not necessarily reassuring to Figure 6: Actual and expected load in a skip graph
heavily-loaded nodes. Since the probability averages with 131072 nodes with the target=76539. Messages
over all choices of membership vectors, it may be were delivered from each node to the target and
that some particularly unlucky node finds itself with the actual load on each node was measured. The
a membership vector that puts it on nearly every expected load is computed using Theorem 5.1.
search path to some very popular target. Our second
characterization addresses this issue by showing that
most of the load-spreading effects are the result of
nodes into and searching in a skip graph can be done
assuming a random membership vector for the source
in logarithmic time. Using the repair mechanism,
of the search.
disruptions to the data structure can be repaired
THEOREM 5.1. Let S be a skip graph with alphabet in the absence of additional faults. Skip graphs
{0, 1}, and consider a search ~rom s to t in S. Let u also support range queries which allows, for example,
be node with s < u < t in the key ordering and let d searching for a copy of a resource near a particular
be the distance from u to t, defined as the number of location by using the location as low-order field in the
nodes v with u < v < t. Then the probability that a key and clustering of nodes with similar keys.
search from s to t passes through u is less than 2 This data structure gives rise to a class of ran-
dom graphs whose properties we have only begun
Theorem 5.1 is of small consolation to some node to examine: some open problems remain regarding
that draws a ~ straw and participates in every the reliability of these graphs. Also, skip graphs
search. Fortunately, such things do not happen often. do not exploit geographical proximity in location of
Define the a v e r a g e l o a d Lt~ imposed by a search resources and it would be interesting to study per-
for t on a node u in a given skip graph S as the formance benefits in that direction, perhaps by us-
probability that an s - t search hits u conditioned on ing multi-dimensional skip graphs. Finally, while the
the membership vectors of all nodes in the interval theoretical properties and relative simplicity of skip
[u,t], where s < u < t. This approximates the graphs make them a good candidate for implemen-
situation in a fixed skip graph where a particular tation, the ultimate test of their usefulness will be
target t is used for many searches that may hit u, their performance in practice. This is an issue that
but the sources of these searches are chosen randomly we hope to study soon.
from the other nodes in the graph.
References
THEOREM 5.2. Let S be a skip graph with alphabet
{0,1}. Fix nodes t and u, where u < t and I{v : u <
v < t}l = d. T h e n f o r a n y a > 0, Pr[Lu~ > a] < [AB96] Yonatan Aumann and Michael A. Bender. Fault
2e--cxd/2 . -- _ tolerant data structures. In Thirty-Seventh Annual
Symposium on Foundations off Computer Science,
pages 580-589, Burlington, VT, USA, October 1996.
6 Conclusion lADS02] James Aspnes, Zo~ Diamadi, and Gauri Shah.
Fault-tolerant routing in peer-to-peer systems. In
We have defined a new data structure, the skip Twenty-First ACM Symposium on Principles of
graph, for distributed data stores that has several Distributed Computing, pages 223-232, Monterey,
desirable properties. Constructing, inserting new MA, USA, July 2002.
393

[CDHR02] Miguel Castro, Peter Druschel, Y. Charlie Programming: Sorting and Searching, volume 3.
Hu, and Anthony Rowstron. Exploiting network Addison-Wesley Publishing Company Inc., Reading,
proximity in peer-to-peer overlay networks. In Massachusetts, 1973.
International Workshop on Future Directions in [KP94] P. Kirschenhofer and H. Prodinger. The path
Distributed Computing, Bertinoro, Italy, June 2002. length of random skip lists. Acta Informatica,
[Longer version submitted for publication]. 31(8):775-792, 1994.
[Dat02] Mayur Datar. Butterflies and peer-to-peer net- [KR02] David Karger and Matthias Ruhl. Finding
works. In Proceedings of the lOth European Sympo- nearest neighbors in growth-restricted metrics. In
sium on Algorithms, Rome, Italy, September 2002. Thirty-Fourth ACM Symposium on Theory of Com-
[Dev92] L. Devroye. A limit theory for random skip puting, pages 741-750, Montreal, Canada, May
lists. The Annals of Applied Probability, 2(3):597- 2002.
609, 1992. [MNR02] Dahlia Malkhi, Moni Naor, and David Rata-
[dlB59] Rene de la Briandals. File searching using vari- jczak. Viceroy: A scalabale and dynamic emulation
able length keys. In Western Joint Computer Con- of the butterfly. In Twenty-First ACM Symposium
ference, volume 15, pages 295-298, Montvale, N J, on Principles of Distributed Computing, pages 183-
USA, 1959. AFIPS Press. 192, Monterey, CA, USA, July 2002.
[FRE] FREENET. [Link] IMP84] J. Ian Munro and Patricio V. Poblete. Fault
[Fre60] Edward Fredkin. Trie memory. Communications tolerance and storage reduction in binary search
of the ACM, 3(9):490-499, September 1960. trees. Information and Control, 62(2/3):210-218,
[FS02] Amos Fiat and Jared Saia. Censorship resistant August 1984.
peer-to-peer content addressable networks. In Pro- [NAP] NAPSTER. Formerly, [Link]
ceedings of the Thirteenth Annual ACM-SIAM Sym- [PMP90] T. Papadakis, J.I. Munro, and P.V. Poblete.
posium on Discrete Algorithms, San Francisco, CA, Analysis of the expected search cost in skip lists. In
USA, January 2002. J. R. Gilbert and R. G. Karlsson, editors, SWAT 90,
[GM97] J. Gabarr6 and X. Messeguer. A unified ap- 2nd Scandinavian Workshop on Algorithm Theory,
proach to concurrent and parallel algorithms on bal- volume 447 of Lecture Notes in Computer Science,
anced data structures. In X V I I International Con- pages 160-172, Bergen, Norway, 11-14 July 1990.
ference of the Chilean Computer Society, 1997. Springer.
[GMM93] J. Gabarr6,~C. Mart/nez, and X. Messeguer. [PRR97] C. Plaxton, R. Rajaram, and A. W. Richa. Ac-
Parallel update and search in skip lists. In 13th cessing nearby copies of replicated objects in a dis-
International Conference of the Chilean Computer tributed environment. In Proceedings of the Ninth
Society, 1993. Annual ACM Symposium on Parallel Algorithms
[GMM96] J. Gabarr6, C. Martinez, and X. Messeguer. A and Architectures (SPAA), June 1997.
top-down design of a parallel dictionary using skip [Pugg0] William Pugh. Skip lists: A probabilistic alter-
lists. Theoretical Computer Science, 158(1-2):1-33, native to balanced trees. Communications of the
May 1996. ACM, 33(6):668-676, June 1990.
[GNU] GNUTELLA. [Link] [RD01] Antony Rowstron and Peter Druschel. Pastry:
[HHH+02] Matthew Harren, Joseph M. Hellerstein, Ryan Scalable, distributed object location and routing for
Huebsch, Boon Thau Loo, Scott Shenker, and Ion large-scale peer-to-peer systems. In Proceedings of
Stoica. Complex queries in DHT-based peer-to-peer the 18th IFIP/ACM International Conference on
networks. In 1st International Workshop on Peer- Distributed Systems Platforms (Middlewarc 2001),
to-Peer Systems (IPTPS), Cambridge, MA, USA, Heidelberg, Germany, November 2001.
March 2002. [RFH+01] Sylvia Ratnasamy, Paul Francis, Mark Hand-
[HKRZ02] Kirsten Hildrum, John D. Kubiatowicz, Satish ley, Richard Karp, and Scott Shenker. A scalable
Rao, and Ben Y. Zhao. Distributed object location content-addressable network. In Proceedings of the
in a dynamic network. In Fourteenth A CM Sympo- ACM SIGCOMM, pages 161-170, 2001.
sium on Parallel Algorithms and Architectures, Win- [SBK02] Bujor Silaghi, Bobby Bhattachaxjee, and Pete
nipeg, Manitoba, Canada, August 2002. Keleher. Query routing in the terradir distributed
[JKZ01] Anthony D. Joseph, John Kubiatowicz, and directory. In SPIE ITCOM 2002, August 2002.
Ben Y. Zhao. Tapestry: An infrastructure for fault- [SMK+01] Ion Stoica, Robert Morris, David Karger,
tolerant wide-area location and routing. Technical Frans Ka~shoek, and Hari Balakrishna. Chord:
Report UCB/CSD-01-1141, University of California, A scalable peer-to-peer lookup service for internet
Berkeley, Apr 2001. applications. In Proceedings of SIGCOMM 2001,
[KMP95] P. Kirschenhofer, C. Martlnez, and pages 149-160, 2001.
H. Prodinger. Analysis of an optimized search [ZJK02] Ben Y. Zhao, Anthony D. Joseph, and John D.
algorithm for skip lists. Theoretical Computer Kubiatowicz. Locality-aware mechanisms for large-
Science, 144(1-2):119-220, 26 June 1995. scale networks. In Workshop on Future Directions in
[Knu73] Donald E. Knuth. The Art of Computer Distributed Computing, Bertinoro, Italy, June 2002.

You might also like