[go: up one dir, main page]

Academia.eduAcademia.edu
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-11, NO. 1, JANUARY 1985 [24] [251 [26] [271 [281 [29] [30] ing for local computer networks," Commun. ACM, vol. 19, pp. 395-404, July 1976. D. Menasce and R. Muntz, "Locking and deadlock detection in distributed databases," IEEE Trans. Software Eng., vol. SE-5, pp. 195-202, May 1979. J. McQuillan and D. Walden, "The ARPA network design decisions," Comput. Networks, vol. 1, pp. 243-289, Aug. 1977. B. Nelson, "Remote procedure call," Dep. Comput. Sci., CarnegieMellon Univ., Pittsburgh, PA, Tech. Rep. CMU-CS-81-119, May 1981. R. Thomas, "A solution to the concurrency control problem for multiple copy databases," in Proc. IEEE Compon '78, 1978, pp. 56-62. H. Sturgis, J. Mitchell, and J. Israel, "Issues in the design and use of a distributed file system," Oper. Syst. Rev., vol. 14, pp. 55-69, July 1980. R. Smith, "The contract net protocol," in Proc. 1st Conf Distributed Computing Systems, 1979, pp. 185-191. R. Strom and S. Yemini, "NIL: An integrated language and system for distributed programming," in Proc. SIGPLAN '83 Symp. Programming Language Issues in Software Systems, 1983, pp. 73-82. 67 Mustaque Ahamad received the B.E.(Hons.) degree in electTical engineering from Birla Institute of Technology and Science, Pilani, India, in 1981. He is currently working toward the Ph.D. degree in computer science at the State University of New York at Stony Brook. His research interests include distributed programming languages, operating systems, network protocols, and distributed algorithms. Arthur J. Bernstein (S'56-M'63-SM'78-F'81) received the Ph.D. degree from Columbia University, New York, NY. He is on the faculty of the Computer Science Department at the State University of New York at Stony Brook. His current research interests are in the area of distributed algorithms, concurrent programming, and networks. Dr. Bernstein was a member of the IEEE Distinguished Visitors Program. A Priority Based Distributed Deadlock Detection Algorithm MUKUL K. SINHA AND N. NATARAJAN Abstract-Deadlock handling is an important component of transacI. INTRODUCTION tion management in a database system. In this paper, we contribute to the development of techniques for transaction management by present- IN a database system, accesses to data items by concurrent ing an algorithm for detecting deadlocks in a distributed database systransactions must be synchronized to preserve the consistem. The algorithm uses priorities for transactions to minimize the tency of the database. Locking is the most common mechanumber of messages initiated for detecting deadlocks. It does not con- nism used for access synchronization. When locking is used, a struct any wait-for graph but detects cycles by an edge-chasing method. It does not detect any phantom deadlock (in the absence of failures), group of transactions (two or more) may sometimes get inand for the resolution of deadlocks it does not need any extra computa- volved in a deadlock [5]: this is a situation in which each memtion. The algorithm also incorporates a post-resolution computation ber of the group waits (indefinitely) for a data item locked by that leaves information characterizing dependence relations of remain- some member transaction of the group. Deadlocks can be reing transactions of the deadlock cycle in the system, and this will help in detecting and resolving deadlocks which may arise in the future. An solved by aborting at least one of the transactions involved. A interesting aspect of this algorithm is that it is possible to compute the simple scheme that can be used to break a deadlock is to use exact number of messages generated for a given deadlock configuration. timeouts and abort transactions when they have waited for The complexity is comparable to the best algorithm reported. We fi'rst more than a specified time interval after issuing a lock request. present a basic algorithm and then extend it to take into account shared Alternatively, a deadlock can be detected using a specific aland exclusive lock modes, simultaneous acquisition of multiple locks, gorithm for this purpose and resolved by aborting at least one and nested transactions. of the transactions involved in the deadlock. Index Terms-Deadlock, deadlock detection, distributed database, Using timeouts to handle deadlocks is only a brute force nested transaction, priority, timestamp, transaction. technique. Since in practice, it is very difficult to choose a Manuscript received February 25, 1984; August 28, 1984. The authors are with the National Centre for Software Development and Computing Techniques, Tata Institute of Fundamental Research, Bombay 400 005, India. proper timeout interval, this technique may result in unnlecessary transaction aborts. Another major drawback of this scheme is that it cannot avoid cyclic restarts [16] ; i.e., a transaction may repeatedly be aborted and restarted. In contrast 0098-5589/85/0100-0067$01.00 © 1985 IEEE 68 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-1l , NO. 1, JANUARY 1985 to the timeout technique, a deadlock detection scheme aborts a transaction only when the transaction is involved in a deadlock. Most deadlock detection schemes [81, [9], [12], [15] detect deadlocks by finding cycles in a transaction wait-for graph, in which each node represents a transaction, and a directed edge from one transaction to another indicates that the former is waiting for a data item locked by the latter transaction. In a distributed database system, the problem is, in essence, of finding cycles in a distributed graph where no single site knows the entire graph. The deadlock detection scheme presented in this paper does not construct any transaction wait-for graph, but follows the edges of the graph to search for a cycle (called an edge-chasing algorithm by Moss [131). It is assumed that each transaction is assigned a priority in such a way that priorities of all transactions are totally ordered. When a transaction waits for a data item locked by a lower priority transaction, we say that an antagonistic conflict has occurred. When an antagonistic conflict occurs for a data item, the waiting transaction initiates a message to find cycles of transactions, in which each transaction is waiting for a data item locked by the next. If the message comes back to the initiating transaction, a deadlock cycle is detected. Our algorithm presumes a point-to-point network with a reliable message communication facility, and it is not applicable for detecting communication deadlocks [4], [141. The distinguishing features of the proposed deadlock detection scheme are as follows. 1) For a given deadlock cycle, it is possible to compute the exact number of messages that have been generated for the purpose of deadlock detection. If the number of messages generated is used as a complexity measure, the proposed algorithm is not inferior to any of the other algorithms reported in the explicitly [3], [13], [14]. In comparison to the algorithm of Chandy and Misra [3], our algorithm has the following advantages. 1) In our scheme, a deadlock computation is initiated only when an antagonistic conflict occurs. In contrast, in their scheme, a computation is initiated whenever a transaction begins to wait for another. Hence, our algorithm generates a fewer number of messages to detect a deadlock. 2) In our scheme, there is no separate phase for deadlock resolution. Our scheme has some similarities (e.g., initiation of deadlock computation only when an antagonistic conflict occurs) with the algorithm proposed by Moss [13]. However, in comparison to his scheme, our algorithm has the following advantages. 1) In Moss' scheme, a transaction does not maintain any information regarding transactions that wait for it, directly or indirectly. Hence, his scheme requires transactions to initiate deadlock detection computations periodically. Thus, his scheme would, in general, require more messages and it is not possible to compute the exact number of messages generated before a deadlock is detected. 2) In our scheme, a transaction continues to retain the above information even after the resolution of a deadlock, and this in turn speeds up detection and resolution of future deadlocks. 3) Our algorithm is less prone to detect phantom deadlocks that may involve nested transactions than Moss' scheme. In our scheme, a detected deadlock is made phantom only when a waiting transaction aborts, either explicitly or implicitly. In contrast, in Moss' scheme, sometimes a detected deadlock is made phantom even when an active transaction aborts, say due to some application considerations. We discuss this further in Section VI-C 4) In our scheme all messages have an identical short length whereas Moss' scheme has messages of varying lengths. In the following section, we introduce a distributed database model in order to set the context, and in Section III we describe the basic distributed deadlock detection algorithm. We analyze the cost of the algorithm in Section IV. The basic algorithm is applicable when only exclusive locks are used. However, it has been reported in the literature [9] that 80 percent of access is only for reading data. Taking this into account, we show in Section V how the basic algorithm can be modified to include share locks as well as simultaneous acquisition of multiple locks. In Section VI, we describe a nested transaction model and extend the algorithm to detect and resolve deadlocks taking into account nested transactions. We conclude the paper with suggestions for further improving the algorithm. literature. 2) When a deadlock is detected, the detector has information about the highest and the lowest priority transactions of the cycle, and this can be used for deadlock resolution. Thus, resolution does not need any new computation. 3) In the absence of failures (site failures or explicit abort of a waiting transaction by the user), it does not detect any phantom deadlock. 4) Even after a transaction is aborted to resolve a deadlock, other member transactions of the cycle continue to retain information about the remaining transactions. This, in turn, helps to detect, with fewer number of messages, deadlocks in which the remaining transactions (or any subset of them) may get involved in the future. 5) The resolution scheme adopted guarantees progress of II. THE DISTRIBUTED DATABASE MODEL computation, and avoids the problem of cyclic restart. 6) The basic algorithm can be easily extended to a locking A database is a structured collection of information. In a scheme that provides both share locks and exclusive locks, and distributed database system, the information is spread across a the scheme in which a transaction can acquire several locks collection of nodes (or sites) interconnected through a communication network. Each node has a system-wide, unique simultaneously. 7) It can also be extended to detect and resolve deadlocks identifi'er, called the site-identification-number (site id, in which may occur in an environment where transactions can be short), and nodes communicate through messages. All messages sent arrive at their destinations in finite time, nested within other transactions. In the literature, several authors have proposed algorithms for and the network filters duplicate messages and guarantees that deadlock detection in which wait-for graph is not constructed messages are error-free. The site-to-site communication is SINHA AND NATARAJAN: DISTRIBUTED DEADLOCK DETECTION ALGORITHM pipelined, i.e., the receiving site gets messages in the same order that the sending site has transmitted them. Within a node, there are several processes and data items (or objects). A process is an autonomous active entity that is scheduled for execution. Every process has a system-wide unique name, called process-id, and processes communicate with each other through messages. To access one or more data items, which may be distributed over several nodes, a user creates a transaction process at the local node. The transaction process coordinates actions on all data items participating in the transaction and preserves the consistency of the database. Henceforth, we use the term transaction to denote the corresponding transaction process. Data items are passive entities that represent some independently accessible piece of information. Each data item is maintained by a data manager which has the exclusive right to operate on a data item. If a transaction wants to operate on a data item, it must send a request to the data manager that manages the data item. A data manager can maintain several data items simultaneously. However, to simplify the exposition, we shall assume that a data- manager maintains only one data item. In addition to data manipulation operations, a data manager provides two primitives to control access to the data item that it maintains: Lock(data_item) and Un_Lock(data_item). A transaction must lock a data item before accessing it, and it must unlock the data item when it no longer needs to access it. A data item can be in one of two lock modes, null or free (N, i.e., absence of a lock), and exclusive (X, i.e., presence of a lock). A data manager honors the lock request of a transaction if the data item is free; otherwise it keeps the lock request pending in a queue, called request_Q. Atransactionwhichhas locked the data item is called the holder of the data, whereas a transaction which is waiting in the request_Q is called a requester of the data item. When a holder unlocks the data item, the data manager chooses a lock request from the request_Q, and grants the lock to that requester. The scheduling scheme followed by the data manager does not guarantee avoidance of deadlocks [5], e.g., it may follow an arrival order scheduling scheme. Transactions can be in one of two states: active or wait. If a transaction waits in a request_Q of a data manager, it is in wait state, otherwise it is an active state. An active transaction process may or may not be running on a processor. The state of a transaction changes from active to wait when its lock request for a data item is queued by the data manager in its request-Q. The state of the transaction changes from wait to active when the data manager schedules its pending lock request. In either case, the manager informs the transaction of its change of state. We assume that a transaction acquires locks one after another (i.e., at any time it has only one outstanding lock request), and it follows the two-phase lock protocol [7]. Each transaction is assigned a priority in such a way that priorities of all transactions are totally ordered. To assign priorities to transactions, we use the timestamp mechanism. When a transaction is initiated, it is assigned a unique timestamp. Timestamps induce priorities in the following manner: a transaction is of higher priority than another if the timestamp of the former is less than that of the latter. Unlike the timestamp 69 synchronization scheme [2] which uses timestamps to schedule lock requests of transactions (and in turn, prevents deadlocks), here timestamps are used only to assign priorities to transactions. For generating timestamps, we assume that every node has a logical clock (or counter) that is monotonically increasing, and the various clocks are loosely synchronized [111. A timestamp generated by a node i is a pair (C, i) where C is the current value of the local clock and i is the site-id of the node i. Greater than (>) and less than (<) relations for timestamps are defined as follows. Let t, = (Cl, il) and t2 = (C2, i2) be two timestamps. Then t1 >t2 iffCl >C2or(Cl =C2andil >i2); t1 < t2 iff Cl <C2 or (Cl = C2 and il < i2). Each transaction is denoted by an ordered pair of the form (p, t)where p is the process-id of the corresponding transaction process, and t is the timestamp of the transaction. The process-id is used for communication purposes. If two transactions T1 and T2 are denoted by the pairs (Pi, t1) and (P2, t2), respectively, we say that Tl > T2, i.e., priority of T1 is higher than that of T2, if t, < t2Further, we say that there is an antagonistic conflict at a data item if the item is locked, and there is a requester of higher priority than the holder. In such a case, we also say that the requesterfaces the antagonistic conflict. III. DISTRUBUTED DEADLOCK DETECTION AND RESOLUTION In this algorithm, a deadlock is detected by circulating a message, called probe, through the deadlock cycle. The occurrence of an antagonistic conflict at a data site triggers initiation of a probe. A probe is an ordered pair (initiator, junior), where initiator denotes the requester which faced the antagonistic conflict, triggering the deadlock detection computation, and initiating this probe. The element junior denotes the transaction whose priority is the least among transactions through which the probe has traversed. A data manager sends a probe only to the holder of its data while a transaction process sends a probe only to the data manager from which it is waiting to receive the lock grant. Transaction processes (or data managers) never communicate among themselves for purposes of deadlock detection. A. The Basic Deadlock Detection Algorithm The basic deadlock detection algorithm has three steps. 1) A data manager initiates a probe in the following two situations. a) When the data item is locked by a transaction, if a lock request arrives from another transaction, and requester > holder, the data manager initiates a probe and sends it to the holder. b) When a holder releases the data item, the data manager schedules a waiting lock request. If there are more lock requests still in the request_Q, then for each lock request for which requester> new holder, the data manager initiates a probe and sends it to the new holder. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-11, NO. 1, JANUARY 1985 70 When a data manager initiates a probe it sets initiator: = requester; junior := holder; We shall presently assume that a data manager sends a probe as soon as the above situations occur. However, as we shall elaborate in Section VII, in order to improve performance, a data manager can wait for a while before sending a probe. 2) Each transaction maintains a queue, called probe_Q, where it stores all probes received by it. The probe_Q of a transaction contains information about the transactions which wait for it, directly or transitively. Since we have assumed that a transaction follows the two phase lock protocol, the information contained in the probe_Q of a transaction remains valid until it aborts or commits. After a transaction enters the second phase of the two phase lock protocol, it can never get involved in a deadlock. Hence, when it enters the second phase, it discards the probe-Q. During the second phase, any probe or clean message (discussed later in this section) received is ignored. A transaction sends a probe to the data manager, where it is waiting in the following two cases. a) When a transaction T receives probe(initiator, junior), it performs the following. if junior > T then junior := T; save the probe in the probe-Q; if T is in wait state then transmit a copy of the saved probe to the data manager where it is waiting; b) When a transaction issues a lock request to a data manager and waits for the lock to be granted (i.e., it goes from active to wait state), it transmits a copy of each probe stored in its probe_Q to that data manager. 3) When a data manager receives probe(initiator, junior) from one of its requesters, it performs the following. if holder > initiator then discard the probe else if holder < initiator then propagate the probe to the holder else declare deadlock and initiate deadlock resolution; When a deadlock is detected, the detecting data manager has the identities of two members of the deadlock cycle, initiator and junior, i.e., the highest and the lowest priority transactions, respectively. In order to guarantee progress, we choose to abort junior, i.e., the lowest priority transaction (hereafter called victim). When victim restarts, its priority does not change, i.e., it uses the same timestamp that was assigned to it when it was initiated. B. The Deadlock Resolution and Post-Resolution Computation This consists of the following three steps. 1) To abort the victim, the data manager that detects the deadlock sends an abort signal to the victim. The identity of the initiator is also sent along with the abort signal: abort (victim, initiator). Since victim is aborted, it is necessary to discard those probes (from probe-Qs of various transactions) that have victim as their juinor or initiator. Hence, on receiving an abort-signal, the victim does the following. a) It initiates a message, clean(victim, initiator), sends it to the data manager where it is waiting, and enters the abort phase. Since initiator is the highest priority transaction of the deadlock cycle, its probe_Q will never contain any probe generated by other members of the cycle. Consequently, probe_Qs of transactions, from initiator to victim in the direction of probe traversal, will not contain a probe having victim either as junior or as initiator. And hence, the clean message carries the identity of initiator beyond which it need not traverse. b) In abort phase, the victim releases all locks it held, withdraws its pending lock request, and aborts. During this phase, it discards any probe or clean message that it receives. 2) When a data manager receives clean(victim, initiator) message, it propagates the message to its holder. 3) When a transaction T receives clean(victim, initiator) message, it acts as follows. purge from the probe_Q every probe that has victim as its junior or initiator; if Tis in wait state then if T = initiator then discard the clean message else propagate the clean message to the data manager where it is waiting else discard the clean message; A transaction discards a clean message in the following two situations: 1) the transaction is in active state or, 2) the transaction is same as the initiator of the clean message received. After "cleaning" up its probe_Q as described above, each member transaction of the deadlock cycle continues to retain the remaining probes in its probeQ. In the future, if the remaining members (or any subset of them) get involved in a deadlock cycle, it will be detected with fewer number of messages, since probes have already traversed some edges of the cycle. IV. THE COST OF DEADLOCK DETECTION To compare our algorithm to other deadlock detection and resolution algorithms, we consider three factors which determine the cost of any deadlock detection algorithm: 1) Communication Cost: the number of messages that must be exchanged to detect a deadlock; 2) Delay: the time needed to detect a deadlock once the deadlock cycle is formed (presuming that every message exchange, whether it is an intersite communication or an intrasite communication, takes equal time); and 3) Storage Cost: the amount of storage needed by transactions and data managers specifically for purposes of deadlock detection and resolution. In our scheme, the communication and the delay costs of detecting a deadlock depends on the configuration of a deadlock cycle. The configuration indicates which transaction waits for which other transaction. We describe a configuration using a SINHA AND NATARAJAN: DISTRIBUTED DEADLOCK DETECTIONI ALGORITHM Ti T. TN 1. TN-l Ob,- Obj2 Fig. 71 TN-2 O@ T5 b T4 T3 O- bi ob T2 *j Ti O .bj An edge of a TWFG. transaction wait-for graph (TWFG) [101 with the following convention. In a TWFG, nodes and edges are associated with transactions and data items, respectively. The d;irection of an edge from one transaction to another indicates that the former is waiting for the latter. For example, Fig. 1 indicates a conflict where the data item Obj, is locked by a transaction Ti and the transaction TJ is waiting to acquire the lock. We shall call the data manager of Obij as D. If T > Ti, the conflict is antagonistic and the data manager D, will initiate a deadlock detection computation by initiating probe(T1, T1), and sending it to the transaction TI. A data item can have many requesters but only one holder, and hence, in a TWFG, a node can have several incoming edges but at most one outgoing edge. A. The Communication Cost We analyze the communication cost of our algorithm by considering three kinds of configurations of a deadlock cycle. The order of priority among transactions is assumed as follows: Ti> Tjifi<j. 'The Best Configuration: For our algorithm, the best deadlock configuration, i.e., the configuration for which the deadlock is detected with minimum number of messages, is the one in which only one edge of the cycle causes an antagonistic conflict. For example, consider the configuration illustrated in Fig. 2. Except at the site of ObhN where T1 waits for TN and T1 > TN, there is no antagonistic conflict at any other site. The data manager DN initiates probe(T1, TN) and sends it to the transaction TN. On receiving the probe, TN stores it in its probe_Q, -and propagates it to DN-1. In two steps, a probe travels from one data manager to the next data manager of the TWFG. On receiving probe(Tl, TN), the data manager DN-1 compares its holder TN-1 to the initiator T1 of the probe. Since T1 > TN- 1, it propagates probe(T1, TN) to its holder, i.e., TN.-. The transaction TN-I, in turn, stores probe(Ti, TN) in its probe_Q, and propagates it to the data manager DN-2, and so on. When the data manager D, finally receives probe(T1, TN) from the requester T2, it finds that its holder is same as the initiator of the probe, and hence, it detects the deadlock. In this case, the total number of messages generated is 2 * (N - 1). An Intermediate Configuration: Consider the deadlock configuration of Fig. 3. In comparison to the previous configuration, the positions of T2 and T3 are swapped at Obj2 site. Thus, apart from the data item ObiN, the cycle has one more antagonistic conflict at data item Obj2. Similar to DN, the data manager D2 also initiates probe(T2, T3), and sends it to the transaction T3. T3 stores it in its probe_Q, and since it has an outstanding lock request for data item Objl, it propa- ObjN Fig. 2. Deadlock cycle: best configuration. Tw TN-1 \ bjN-1 TN-2 T5 ObjN_2 T4 Qbj4 x T2 T3 T1 Obj, Obi, Obj 3 N °~~~~~bj Fig. 3. Deadlock cycle: intermediate configuration. T2 T3 T4 \Ni_lOi-1 TN-3 TN-2 Obj4 . TN-1 ..Obj3 TN Obj2 Ti Ob'l ObiN Fig. 4. Deadlock cycle: worst configuration. gates the probe to DI. When the data manager D1 receives probe(T2, T3), it discards it since initiator < holder (i.e., T2 T1). Hence, the probe initiated at Obj2 site dies after two steps. As in the previous configuration, in this configuration as well, the deadlock will be detected only when the probe initiated by the data manager, DN traverses through the entire cycle, and eventually reaches D1 after 2 * (N - 1) steps. Hence, in this case, the total number of messages generated is 2 * (N - 1) + 2. The Worst Configuration: By induction, we can infer that the worst deadlock configuration, i.e., which will generate the maximum number of messages before the deadlock is detected, is the one in which each edge of the cycle except one causes an antagonistic conflict. For example, consider Fig. 4, in which there are (N - 1) antagonistic conflicts. All data managers, except D1, initiate a probe. All probes traverse up to the data manager D1 and terminate, except the probe initiated by DN which leads to the detection of a deadlock. Hence, the total number of messages generated will be 2 *(N- 1)+2 *(N- 2)+2 *(N- 3)+* =N*(N- 1). +2 In general, for a deadlock cycle of length N there are (N - 1)! possible deadlock configurations. For a specific deadlock configuration, the total number of messages generated will be 2 *(N - 1) + CN_2 * 2 * (N - 2) + CN_3 * 2 * (N - 3) +*'*+C2 *2. where CI is 1, if an antagonistic conflict exists' at data item Obj1, and 0 otherwise. For the above expression, the maximum and minimum values are N * (N - 1) and 2 * (N - 1), 72 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-ll, NO. 1, JANUARY 1985 respectively. For N= 2, the maximum and the minimum are rently in the system is N, then the length of a probe_Q can grow at most up to (N - 1). identical, namely 2. B. The Delay The delay is defined to be the time taken to detect the deadlock after the deadlock cycle is formed. Note that irrespective of the configuration of a deadlock cycle of length N (best, worst, or any intermediate), the maximum amount of delay is the time taken to exchange 2 * (N - 1) messages. The delay is maximum if the highest priority transaction of the cycle is the last transaction to enter the wait state, closing the deadlock cycle. If a transaction other than the highest priority transaction is the last to enter the wait state, the delay is less. This is because the probe initiated by the highest priority transaction would have traversed part of the cycle before the cycle is formed. Suppose, in the configuration shown in Fig. 2 (prior to the formation of a deadlock cycle), all edges except the edge TJ+ 1-TJ (where I < J <N - 1) are formed, i.e., TJ+ I is still active. When TJ+ 1 requests for a lock on data item Objj held by TJ, it enters the wait state closing the deadlock cycle. Case 1: If probe(T1, TN), initiated due to the antagonistic conflict T1 TN, has reached the transaction TJ+1 before it entered the wait state, the delay to detect the deadlock will be equal to the time taken to exchange (2 * J - 1) messages. Case 2: If probe(TI, TN) is yet to reach the transaction TN, i.e., transactions T, and TJ+ 1 entered the wait state in a quick succession (closing the deadlock cycle), and the time gap was too small compared to the time taken to exchange one message. In this case, the delay to detect the deadlock will be equal to the time taken to exchange 2 * (N - 1) messages. Hence, if a deadlock cycle is closed by transaction Tj+ 1, then the time taken to detect the deadlock will be any where between (2 * J - 1) to 2 * (N - 1), for J = I (N- 1). For the configuration given in Fig. 2, the delay will be minimum (i.e., the time taken to exchange one message) if 1) the cycle is closed by transaction T2 by waiting for Tl, the highest priority transaction of the cycle, and 2) the probe initiated due to the .antagonistic conflict T1_ TN must have reached T2 before the latter entered the wait phase. From this result we can generalize that for any configuration the minimum time taken to detect a deadlock is the time taken for exchange of one message, and this can happen only when 1) the cycle is closed by a transaction waiting for the highest priority transaction of the configuration, and 2) the probe initiated by the highest priority transaction had reached the cycle-closing transaction before the latter entered the wait D. Costwise Comparison to Other Algorithms In comparison to the algorithm of Chandy and Mishra [3], our algorithm has less communication cost since it initiates a deadlock computation only upon the occurrence of antagonistic conflicts, but not otherwise. Furthermore, the resolution of deadlock does not involve any extra cost. Unlike Moss' algorithm [13], we have separated the cost of reliable network communication from that of deadlock detection. Incorporation of this distinction in our algorithm enables us to compute exact communication and delay costs of deadlock detection, for a given configuration. In the distributed database model considered by Obermarck [15], transactions migrate from one data site to another, and there is a deadlock detector at each site which builds a transaction wait-for graph for that site (by extracting information from lock tables, and other resource allocation tables and queues). In computing the communication cost to detect a deadlock cycle (which is N * (N - 1)/2 exchange of messages, in worst case, among deadlock detectors), he does not include the expenses of transaction migration and construction of a TWFG by deadlock detectors in terms of messages. In contrast, in our model; the transmission of information from a transaction to a data manager and from a data manager to a transaction cost one message each. If the above two expenses are also included in terms of messages, the communication cost for his algorithm will become equal to that of ours. V. EXTENSIONS TO THE DEADLOCK DETECTION ALGORITHM In this section, we extend the algorithm to take care of two refinements: 1) availability of share lock (S_lock) mode as well, and 2) allowing a transaction to acquire locks on more than one data item simultaneously, either in share mode or in exclusive mode. A. Share and Exclusive Locks The Distributed Database Model with Share and Exclusive Locks: We extend the basic model, discussed in Section II, by distinguishing a share lock (S_lock) request from an exclusive lock (X_lock) request. Correspondingly, a locked data item can be either in S_mode or in X_mode. The desired lock mode is specified as a parameter of the lock request primitive: Lock(data-item, mode). In order to distinguish between the two kinds of lock requests, a data manager splits its request_Q into phase. Srequest_ Q and Xrequest_ Q, for storing pending S_lock and C. The Storage Cost X_lock requests, respectively. If a data item is free, a transaction can lock it in any mode. In this algorithm, each transaction requires storage space to When a transaction has locked a data item in X_mode, and bemaintain its probe_Q, and a probe_Q exists until the transacthe come the of the X_holder, no other transaction can lock the data item tion enters second phase two phase lock protocol. in mode. A transaction can lock a data item in S_mode, of The size of a probe_Q depends upon the number higher any an and become or S_holder even if the item is already locked in priority transactions which wait for it directly transitively. a data item in S_mode can have several S_ Thus, A probe_Q shrinks only when the transaction receives a clean S_mode. holders it whereas can have only one X_holder. When the message, but not otherwise. the lock, if the data manager decides to releases the run X_holder If maximum number of transactions that can concur- SINHA AND NATARAJAN: DISTRIBUTED DEADLOCK DETECTION ALGORITHM T, T4 T2 Obj, Obj4 (a) Ti T4 T2 Obj, Obj4 Obj, OT3 Obj2 (b) Fig. 5. (a) A TWFG where a probe gets discarded. (b) A deadlock caused by incremental share lock remains undetected by the basic algorithm. honor S-lock requests, we assume that all S-lock requests queued in Srequest_Q are scheduled simultaneously. We note that in this scheduling policy, it is possible that an X_requester may starve. Hence, this policy is unfair. We shall discuss this issue later in Section VII. Since S-holders of a data item can be many, an X_requester may now wait for more than one transaction simultaneously, i.e., in a TWFG, a node can have several incoming as well as outgoing edges. Deadlock Detection and Resolution: With the availability of S_locks, it is now possible that S-holders of a data item may increase incrementally. Consequently, it is possible that antagonistic conflicts for data items may occur incrementally. To take this into account, a data manager has to initiate a probe in one more situation apart from those discussed in the basic algorithm-[refer to Section Ill-A, step 1)]. When a data manager grants an additional S-holder Ts, it performs the following. if Xrequest_Q is not empty then for each X_requester, T, if Tx > Ts then initiate pro be(Tx, Ts) and send it to Ts; However, this modification alone is not enough since it does not take into account transactions that wait transitively (now) for the additional S-holder. We shall elaborate this through an example. Consider the scenario shown in Fig. 5(a) where Ti > T, for all i < j. The data item Obil is share locked by Tl, and the 73 lustrated in Fig. 5(b), this request forms the deadlock cycle T3 T2 T4 T3 which has only one antagonistic conflict, i.e., T2_T4. But the probe(T2, T4) initiated due to this conflict was discarded by DI, before T3 required S_lock on Obj1. And hence, this deadlock will remain undetected. To handle such cases, a data manager, when it grants S-lock to an additional S_holder T, needs to propagate to Ts copies of the probes received (may be only some of them), prior to granting the S_lock to Ts. However, in the basic scheme, a data manager does not preserve the probes it receives. There are two possible solutions to this problem. 1) When a data manager schedules an additional S-holder TS, it asks all X_requesters queued in its Xrequest_Q to retransmit their probe_Q elements so that relevant probes can be propagated to Ts. 2) Alternatively, a data manager keeps all probes received in its own probe_Q, and later, when it schedules an additional S_holder Ts, it checks, for each probe in its probe_Q, whether the initiator of the probe is of greater priority than that of Ts, and if so, propagates that probe to Ts. The former scheme adds complexity since a data manager must keep track of its requests for probes retransmission and distinguish an original probe from a retransmitted duplicate probe. Further, the communication cost for a given configuration cannot be specified exactly. The latter scheme necessitates storage space within each data manager, but the algorithm remains simple, and the communication cost of a deadlock configuration can be specified exactly. Hence, we use the latter scheme and modify the basic algorithm as follows. 1) When a data manager receives probe(initiator, junior) from one of its requesters, it performs the following. if the data item is in S_mode then save the probe in the probe_Q; for each holder do if holder = initiator then declare deadlock and initiate deadlock resolution else if holder < initiator then propagate a copy of the probe to the holder; 2) When a data manager grants an additional S_holder Ts it performs the following. if Xrequest_Q is not empty then for each X_requester, Tx do if TX > Ts then initiate probe(Tx, T), and send it to T3; if the probe_Q is not empty then for each probe, P, in its probe_Q do if Ts <P. initiator then propagate a copy of P to Ts; 3) When a data manager exits from S_mode, it discards its data items Obj4 and Obj2 are exclusively locked by T4 and T22, respectively. Transactions T4 and T2 wait for exclusive locks to be granted on data items Objl and Obj4, respectively. Unlike the data manager Dl, D4 finds the antagonistic conflict T2-T4, initiates probe(T2, T4), and sends it to its holder T4. T4 saves the probe in its probe_Q, and propagates it to D -where it is an X_requester. On receiving probe(T2, T4), D1 discards it since its holder T1 is of higher priority than T2, the initiator of the probe. Some time later, another transaction T3 requests an S-lock on the data item Obilj. Since Objl is in S-mode, D1 grants probe_Q. Post-Resolution Computation: The provision of S_mode rethe S_lock request of T3 immediately. T3 is the additional S-holder of Objl, and now T4 waits for T3 as well. Since quires only a minor modification in the deadlock resolution T3 > T4, D1 does not initiate any probe. Later, T3 requests and post-resolution computation. Step 2) of Section Ill-B is an X_lock for data item Obj2 (held by T2), and waits. As il- modified as follows. 74 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-il, NO. 1, JANUARY 1985 When a data manager receives a clean message if the data item is in X_mode then propagate the clean message to the X_holder else for each S_holder, TS do propagate a copy of the clean message to T,; if probe_Q is not empty then purge every probe that has victim as junior or initiator; Storage Cost: This extended algorithm requires extra storage within each data manager for maintaining its own probe_Q. The probe_Q within a data manager exists till the data item is in S_mode. As soon as the data item becomes free or it enters X_mode the probe_Q is discarded. Delay and Communication Cost: In the original database model, if a transaction enters into wait state, it can close at most one deadlock cycle (in a TWFG, a node can have at most one outgoing edge). But, in a TWFG for the extended model, a node can ha've several incoming and outgoing edges, and formation of an edge may close simultaneously'many' cycles. Givdh a TWFG of n nodes that' is acyclic, the maximum number of cycles (say M)' which can be closed simultaneously by the formation of a single edge is expressed by the following equation. M=niC1 +n-iC2 + ** +nlCn_l where n'C1 is the number of cyclesoflength2, n 1C2 is the number of cycles of length 3, etc. Depending upon the type of configuration for each cycle, we can calculate the delay and the communication cost based on the formula given in Section IV-A. For example, consider the TWFG given in Fig. 6(a). All data items are locked by S_lock requests and all edges are due to waiting X_lock requests. Ti > Ty for all i < j. Objl is share locked by T1; Obj2 is share locked by T' and T2 ; Obj3 by T1, T2, and T3; and Obj4 by T2, T3, and T4. The, Xlock requests of T2, T3 and T4 wait for data items Obj1, Obj2 and Obj3, respectively. Until now, there is no deadlock cycle in the TWFG. When Ti issues an X_lock request for the data item Obj4 and waits, as illustrated in Fig. 6(b), it simultaneously closes seven cycles. (The number can'be derived from the above equation.) Three cycles of length 2 (viz. T2 -T1 -'T2' T3 -T1 -T3, and T4-T1 -T4); three cycles of length 3 (viz. T3-T2-T1 -T3, T4-T2-T1-T4, and T4-T3-Tl-T4); and one cycle of length 4 (viz. T4-T3-Tj-T1 -IT4) are formed& Though there are seven cycles in the TWFG, there exist only three antagonistic conflicts, T1 T2, T1_T3, T1_T4. Hence, only three probes will originate. Since T1 is the highest priority transaction of every cycle, all probes will have T, as their initiator, and all deadlock cycles will be independently detected by various data managers for which TI is an S_holder. Since the algorithm chooses the lowest priority transaction as the victim, all transactions except T, will be junior in at least one of the three probes, and hence, in worst case, all transactions except T1 may get aborted. On the contrary, if the initiator is chosen to be the victim, then all cycles can be broken -simultaneously by aborting only T1. Ob 3 Qb3 T4 T, T2 T3 I Obj3 Obj I Obj Obj2 (a) Obj4 (b) Fig. 6. (a) A TWFG with multiple outgoing edges. (b) An X-lock re- quest by T1 simultaneously closes seven cycles. However, this latter scheme may result in cyclic restart for the transaction TI. In case of multiple cycles, early abortion of one transaction may resolve many cycles simultaneously. For example, if T2 gets aborted on detection of T72-T1 -T2 cycle, cycles T3-T2T1-T3, T4-T2-T1-T4, and T4-T3-T2-T1--T4 will also get resolved simultaneously. This may result in discarding many probes and clean messages. Hence, in this case, we can compute only the limits (best and worst) of delay and communication cost for a specific configuration. The exact cost will depend upon many other factors such as scheduling policy of data managers, characteristics of communication substrate, etc. B. Simultaneous Acquisition ofMultiple Locks Let us now consider the refinement which allows a transaction to issue more than one lock request simultaneously. If its requests are not granted immediately, a transaction simultaneously waits for a number of transactions (in a TWFG, a node will have' several outgoing edges). The modification needed for step 2) of the basic algorithm of Section Ill-A is as follows. When a transaction issues more than one lock request simultaneously, if all lock requests are not granted immediately (i.e., it waits for multiple locks), it sends a copy of each probe stored in its probe_Q to all data managers for which it is a requester. Now, in the TWFG, a transaction can be the tail of multiple edges. The nature of this wait-for graph is the same as that caused by multiple S_holders. And hence, its characteristics will also be the same. From the above argument, we can deduce that, in a model 75 SINHA AND NATARAJAN: DISTRIBUTED DEADLOCK DETECTION ALGORITHM that provides share as well as exclusive lock requests, and also allows a transaction to issue more than one lock request simultaneously, the characteristics of the graph as well as the complexity of deadlock detection will be similar to the one described in the previous subsection. A /\ /\ B // VI. HANDLING NESTED TRANSACTIONS We shall now discuss the applicability of our algorithm to detect deadlocks that may occur in an environment where a transaction can be nested within another transaction. The concept of a nested transaction permits a transaction to decompose its task into several subtasks and initiate a new transaction (called nested transaction or subtransaction) to perform each of the subtasks. A nested transaction, in turn, may initiate its own set of nested transactions, thus giving rise to a hierarchy (or tree) of transactions. Since nesting of transactions follows a tree structure, we use the terms root, leaf, parent, child, ancestor, and descendant with the usual connotations. Using nested transactions, it is possible to achieve higher concurrency, and higher degree of resilience against failures [6]. A. A Model for Nested Transactions During its execution, a transaction can create a set of nested transactions, which will be its children, simultaneously. After creating its children, a parent transaction cannot resume execution until all its children commit or abort. However, a (parent) transaction may abort at any time, either explicitly because a child aborted, or implicitly because an ancestor aborted. A transaction, whether nested or not, always has the properties of failure atomicity and concurrency transparency. However, a nested transaction has an additional property: even if a nested transaction commits, this commitment is only conditional and the commitment of its effects, i.e., installation of the new states of the objects modified by it, is dependent on whether its parent transaction commits or not. This commit dependency follows from the property of atomicity. We allow arbitrary nesting of transactions, and hence the commit dependency is transitive. Consider the transaction tree shown in Fig. 7. If A, B, and F are three transactions such that A created B which then created F, the effects of F must be committed only when both B and A commit. It should be noted that the commit dependency relation is asymmetric: only children are dependent on their parents and not vice versa. Thus, a transaction may commit even if some (or all) of its children are aborted. Once all its children commit or abort, a parent transaction can resume execution, and it can create a new set of children. A transaction is in wait state if either 1) it is waiting for locks to be granted on some data items, or 2) it is waiting for its children to commit or abort. Note that a transaction never runs concurrently with its children. The commit dependency described above necessitates new locking rules. This is required because it is not the case that when a transaction commits, its effects become visible to any transaction. The visibility of effects of a transaction is governed by the following rule [14]. ./ // \ F ./ Fig. 7. A transaction tree. The Visibility Rule: When a transaction A commits, the effects of the transaction tree rooted at A are visible to a transaction X that is external to the transaction tree only if either 1) the root transaction has no parent, or 2) the parent of the root is either X or an ancestor of X. As an example, consider the transaction tree illustrated in Fig. 7. The effects of D will be invisible to F, when D commits. Only when C commits the cumulative effects of C, D, and E become visible to F. When C aborts, the action tree rooted at C has no effect, even if D and E have committed earlier. In order to implement the above visibility rule through a locking scheme, we introduce the notion of a retainer of a data item, through the following set of rules [13], [141. 1) When an S_holder (X_holder) of a data item commits, it releases the lock it held, and the parent of the holder, if any, becomes an S_retainer (X-retainer) of the data item, unless it is already an S_retainer (X-retainer) of that item. 2) When an S_holder or the X-holder of a data item aborts, it releases the lock it held, and no new retainer is introduced. 3) When an S_retainer (X-retainer) of a data item commits, the parent of that retainer, if any, becomes an S_retainer (X-retainer) of that item, if it is not already one. 4) When an S_retainer (X-retainer) commits or aborts, it ceases to be an S-retainer (X_retainer) for that data item. As an example, consider the transaction tree of Fig. 7. When E commits, C becomes an S_retainer (X-retainer) for all data items for which E was an S-holder (X-holder). When C commits, A becomes an S_retainer (X-retainer) for all data items for which C was an S-holder (X-holder) or S_retainer (X-retainer). Note that there can be several S_retainers and X-retainers for a data item simultaneously. Even though there can be only one X-holder for a data item at any time, multiple X_retainers arise because a transaction tree grows and shrinks dynamically when nested transactions are created, committed, or aborted. Because of this, it is also possible that a transaction is a retainer as well as a holder of a data item simultaneously.With the introduction of retainers, we can now restate the rules for granting locks as follows. 1) If a transaction T requests an S_lock on a data item, it can be granted if there is no X-holder for the item, and either a) there is no X-retainer for the item, or b) each X-retainer is either T, or an ancestor of T. Presence of an S_holder or an S_retainer does not forbid the grant of an S_lock. 2) If a transaction T requests an X-lock on a data item, it IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-1l, NO. 1, JANUARY 1985 76 can be granted if there is no S_holder or X_holder for the item, and either a) there is no S-retainer or X_retainer for the item, or b) each S-retainer (X_retainer) is either T, or an ancestor of T. For example, suppose in the transaction tree of Fig. 7, F requests an S_lock for a data item for which E is an X_holder. The S_lock can be granted to F only when either there is no X_retainer or X_holder for the item, or A becomes the only X_retainer for the item, i.e., when C, D, and E commit or abort. When an S_holder releases the lock and if it introduces an S_retainer to the data item, it may result in simultaneous scheduling of a descendant X_requester (if any). Similarly, when an X_holder releases the lock and if it introduces an X_retainer to the data item, it may result in simultaneous scheduling of a descendant X_requester (if any), or one or more descendant S_requesters (if any). B. Nested Transactions and Deadlock Detection and Resolution We shall now discuss the scheme for detecting deadlocks that can arise in the nested transaction model described above. The basic detection algorithm needs to be modified, in order to take into account the fact that a transaction now waits for its descendants to commit/abort. As in the basic algorithm, we shall use priorities for transactions in order to determine when to initiate a deadlock computation, as well as for deadlock resolution. Timestamps induce priorities among transactions as described earlier. However, the scheme for assigning timestamps needs to be modified to take into account nested transactions. When a nonnested transaction (i.e., the root of a tree) is created, a (C, i) pair is generated as described in Section II, and this pair is assigned as the timestamp of the transaction. When a nested transaction is created, a (C, i) pair is generated, and a timestamp is generated for the transaction by concatenating this (C, i) pair with the timestamp of the parent transaction. Thus, the timestamp of a nested transaction is a sequence of (C, i) pairs, the length of the sequence being determined by the depth of nesting. Based on the ordering on (C, i) pairs described in Section II, timestamps of transactions are totally ordered in the following way. Given two timestamps, X and Y of the form X1X2 * *Xm and Y1 Y2 * Yn respectively, where each Xi or Xi is a (C, i) pair, their relations are defined as follows. X is greater than Y, if either 1) m > n, and for all i, 1 <i<n, Xi Yi, 9 or 2) for some i, <i<min(m, n), Xi = Y1, X2 = Y2,, Xi-1 = Yi1, and Xi > Yi. Note that in this order, the priority of a transaction is higher than that of its descendants. Deadlock Detection: We now extend the deadlock detection algorithm described in Section V-A, to take into account nested transactions. The probe_Q of a data manager is split into S_probe_Q and Xprobe_Q: the former stores the probes received from S_requesters, and the latter stores the probes received from X_requesters. A transaction has only one probe_Q. 1) If a data manager cannot grant a lock requested by a transaction, it acts as follows. if the lock request of a transaction, T, cannot be honored then begin for each X_retainer and the X_holder (f any), Tx, do if Tx < T then initiate probe(T, Tx) and send it to Tx; if X_lock requested then for each S_retainer and each S_holder, Ts, do if Ts< T then initiate probe(T, T.) and send it to Ts end; Note that in no case will a transaction send a probe to its ancestor since an ancestor always has higher priority. 2) When a transaction begins to wait for a data item, or for its children to commit/abort, it transmits each probe in its probe_Q to the data manager, or to its children. 3) When a transaction T receives a probe P, it performs the following. if P. junior > T then P. unior T; save P in the probe_Q; if T is waiting for its children to commit/abort then transmit a copy of the saved probe to each child else if T is waiting for a data item then transmit a copy of the saved probe to the data manager; 4) When a data manager receives a probe P from a transaction T it acts as follows. if T if waiting for an S_lock then save the probe in S_probe_Q else save the probe in X_probe_Q; if P. initiator is either a retainer or the holder, or P. initiator is a descendant of a retainer or of the holder then declare deadlock and initiate deadlock resolution else begin for each X_retainer and the X_holder (if any), Tx, do begin if P. initiator > T, then propagate the probe P to Tx end; if T is waiting for an X_lock then for each S_retainer and each S_holder (if any), T, do SINHA AND NATARAJAN: DISTRIBUTED DEADLOCK DETECTION ALGORITHM if P. initiator > T, then propagate the probe P to T, end; 5) When a new retainer or holder is introduced for a data item, the data manager acts as follows. (Note that when a new retainer is introduced, the data manager may have simultaneously scheduled a descendant X_requester, or one or more descendant S_requesters, i.e., the introduction of a new retainer may result in simultaneous introduction of new holder(s) as well.) if an S_holder or an S_retainer ls, is introduced then begin for each requestor, T, in X_request_Q do if T> Ts then initiate probe(T, Ts) and send it to Ts; for each probe, P, in X_probe_Q do if P. initiator > Ts then send a copy of P to Ts end else % an X_holder or an X_retainer, Tx, is introduced begin for each requester, T, in S_request_Q or X_request_Q do if T> Tx then initiate probe(T, Tx) and send it to Tx; for each probe P, in S_probe_Q or X_probe_Q do if P.initiator > Tx then send a copy of P to Tx end; In this extended algorithm, it is possible that a transaction may receive more than one probe with the same value for initiator. This may arise because the transaction as well as some of its ancestors may be retainers or holders for a data item simultaneously. In such cases, the transaction needs to process only the probe that it receives first, and it may discard others. In Section VII, we discuss this issue again. Deadlock Resolution and Post-Resolution Computation: As in the basic algorithm, we abort only the lowest priority transaction to resolve the deadlock. However, the scheme for handling clean messages requires some modifications as given below. 1) When a transaction receives a clean message, it acts as follows. if T is in wait state then if T = initiator then discard the clean message else if T is waiting for its children then propagate a copy of the clean message to every child else propagate the clean message to the data manager where it is waiting. 2) When a data manager receives a clean message, it updates its S_probe_Q and X_probe_Q, and propagates the message to all holders and retainers. 77 \ Obj T1TT2 O (retained) Fig. 8. A deadlock cycle with nested transactions. An Illustrative Example: Let us illustrate the working of this extended algorithm for detecting deadlocks, through an example. Consider the scenario shown in Fig. 8. A transaction T, requests an X_lock for the data item Obj1. The lock cannot be granted since another transaction T2 is an X_holder for Objl . T2 has created a child T21 and is waiting for T21 to commit. T21 is waiting for an S_lock on another data item Obj2, which has T1 as an X_retainer. (T1 had created earlier a child Tll which held the item Obj2 in X_mode, and it has committed.) In the above situation, a deadlock T1_T2_T21_T1 occurs when T, begins to wait for Obj1. Let us illustrate how this deadlock is detected. We consider two possible cases. Case 1: T1 > T2. By definition, it follows that T1 > T21. When the data manager of Objl, DI, receives the lock request from TI, it originates probe(T1, T2) and sends it to T2. When T2 receives this probe, it saves the probe in its probe_Q and propagates it to its child T21. When T21 receives probe(TI, T2), it modifies it to probe(T1, T21 ), saves it in its probe_Q, and propagates it to D2, the data manager of Obj2. When D2 receives probe(T1, T21 ), it detects a deadlock since the initiator of the probe Tl is an X_retainer for the item. The deadlock is resolved by aborting T21 . Case 2: T2 > T1 . By definition, it follows that T21 > T1 . Before T, issues its X_lock request for the data item Obj1, its probe-Q contains probe(T21, Ti ). This is due to the fact that when D2 cannot grant the Silock to T21, it initiates probe(T21, T1) and sends it to TI. Upon receiving this probe, T, saves it in its probe-Q. When T1 waits for an X_lock on Obj1, it propagates probe(T21, T1) contained in its probe_Q to D1. Upon receiving probe(T21, TI), D1 detects a deadlock since the initiator of the probe T21 is a descendant of T2 which is the X_holder of Objl. The deadlock is resolved by aborting TI. C. Comparison to Related Work Moss [131 has also proposed an edge-chasing algorithm for detecting deadlocks taking into account nested transactions. As described earlier, a major difference between his algorithm and ours is that in Moss' scheme, probes are not stored within transactions and data managers, and his scheme relies on periodic retransmission of probes to ensure eventual detection of deadlocks. Apart from this, in Moss' scheme, a data manager sends a probe not to the holders of the item, but always to the "potential" retainers. Because of this, his algorithm is prone to detect phantom or false deadlocks. For example, consider the scenario shown in Fig. 9. There are two transactions T, and T2 , where T, > T2. T2 has created two children T21 and T22. T1 waits for an X_lock on an item 78 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-1 1, NO. 1, JANUARY 1985 / / \ \ TT2 /\ Objl ~. (active) @122 Obj 2 Fig. 9. Moss' scheme: phantom deadlock. Objl, which has T21 as the X_holder. T22 is waiting for an X_lock on another item Obj2, which has T1 as the X_holder. T21 is active. Given this situation, a deadlock occurs only when T21 commits. If T21 aborts of its own accord, say due to some application considerations, no deadlock results. However, in Moss' scheme, when T1 's request arrives, D1 sends a probe to T2 even if T21 is active. T2 propagates this probe to T21 and T22. T21 ignores this probe since it is active. But, T22 propagates this to D2 which detects a deadlock. Meanwhile, if T21 aborts, the deadlock detected is a false deadlock. In our scheme, no such false deadlock will be detected since D1 sends a probe to T2 only when it becomes an X_retainer (i.e., when T21 commits). In general, however, our scheme may also detect phantom deadlocks, but such deadlocks become false only if waiting transactions abort, explicitly (on user's request) or implicitly (due to site crash), after the cycle detecting probe has traversed through it, but not otherwise. VII. DISCUSSION A. Delaying the Initiation of a Probe Currently in our algorithm, a data manager initiates a probe as soon as it finds an antagonistic conflict at its site. But an antagonistic conflict is a potential deadlock situation only if the holder transaction is in wait state, but not otherwise. Hence, the initiation and the propagation of the probe can be delayed until the holder enters the wait state. We suggest that a data manager, upon the occurrence of an antagonistic conflict, should wait for a specific time period and then only initiate the probe and send it to the holder. Similarly, the propagation of probes received by a data manager can be delayed. B. Dynamic Assignment of Priorities Another orthogonal technique that can be incorporated to improve performance is to assign priority for a transaction only on demand basis and not a priori. As long as a transaction does not get into conflict with a transaction in wait state, it need not be assigned a priority. Whenever a conflict arises with a waiting transaction, transactions must be assigned priorities, if possible, in such a way that conflict is nonantagonistic. Otherwise, an antagonistic conflict has occurred and a probe is initiated. Now, a transaction for which a priority has not been assigned never causes an antagonistic conflict. Thus, by employing a scheme for dynamic assignment of priorities [11, occurrence of antagonistic conflicts, and consequently, initiation of probes can be reduced still further. C. Other Mechanisms for AssigningPriorities In our algorithm, we have used timestamps for assigning priorities. However, our scheme is applicable even if some other mechanism is used for assigning priorities. The only requirement is that the mechanism must induce a total order on transactions. For example, the number of resources held by a transaction can be used to assign a priority for it. To guarantee uniqueness, we may append the timestamp of the transaction to the number of resources held. Notice that in this scheme, the priority of an active transaction changes dynamically as it acquires resources, but if a transaction is in wait state its priority does not change. Because of this, the nature of a conflict (antagonistic or otherwise) does not change dynamically, and hence, our algorithm is applicable to this dynamic priority scheme as well. D. Avoidance of Phantom Deadlocks In our algorithm, if a waiting transaction which is a component of a deadlock cycle aborts (either due to site crash, or abort of the parent or a child, or on user request) after the detecting probe has traversed through it, we may find a phantom deadlock. Since a situation of this kind is unpredictable, our algorithm comes about as close as possible in avoiding detection of phantom deadlocks. The possibility of phantom deadlock can be reduced even further if the victim transaction does not abort itself until the clean message, initiated by it, comes to it after circulating through the entire deadlock cycle. This requires a clean message to traverse beyond initiator (note that in the algorithm described in Section III, the clean message does not go beyond initiator). E. Discarding Duplicate Probes In our basic algorithm, there is a possibility that some probes may circulate through a deadlock cycle more than once. Suppose, for example, a transaction which is not part of a deadlock cycle, but waits for (perhaps transitively) a member transaction of a cycle, inserts a probe in the deadlock cycle. If the outside transaction is of lower priority than the highest priority transaction of the cycle, the inserted probe ceases to propagate at some point in the cycle. On the other hand, if the outside transaction is of higher priority than the highest priority transaction of the cycle, the inserted probe propagates through the entire cycle, and keeps circulating until the deadlock cycle is broken. (Note that a probe never propagates through the entire cycle if its initiator is a member of the cycle.) For example, consider the configuration (an extension of the configuration given in Fig. 2) shown in Fig. 10. Here the transaction T2 has acquired X_locks on data items Obj2 and ObIx before it entered the wait state. A transaction TX, which is not a member of the deadlock cycle (called an external transaction), requests a lock for ObIx and waits. For simplicity, we assume that Tx enters the wait state after the deadlock cycle T1-TN-Ti is formed. If TXs> T2 (but not otherwise), the data manager Dx will initiate probe(Tx, T2 ) and send it to holder T2. Now, a probe ini- SINHA AND NATARAJAN: DISTRIBUTED DEADLOCK DETECTION ALGORITHM Tx, TN-1 TN \ bjN-1 TN-2 ObjN-2 T5 T4 Obj4 T2 T3 Obj3 °bi2 Objl Ti / ObI N Fig. 10. Propagation of an external probe in a deadlock cycle. tiated by an external transaction (called an external probe) enters the deadlock cycle. T2 will save the probe in its probe_Q, and since it is waiting for Obj1, will propagate probe(Tx, T2) 79 priority is less than that of Tx. Otherwise, Tx waits for Tr, which waits for its descendant to commit, and the latter waits for Tx, resulting in a deadlock. Hence, in the case of nested transactions, the above fair scheduling policy can be enforced only when no ancestor of the requesting transaction is a retainer (S_retainer or X_retainer) of the data item. Thus, in this case, an X_requester may encounter antagonistic conflicts incrementally. G. Computation of Cycle Length Since we use an edge-chasing algorithm, it is quite simple to compute the length of a deadlock cycle. For this purpose, a probe should have an additional parameter, say length (1), which is set to one to start with. When a transaction receives a probe P, it increments P.1 by one before saving it in its probe_Q. On receiving a probe P, if a data manager detects a deadlock, then the value of P.1 gives the length of the deadlock cycle. H. Voluntary Abort by a Transaction Though the algorithm is designed for detection and resolution of deadlocks, it can be used by transactions to abort voluntarily rather than wait until a deadlock cycle is formed, detected, and resolved. When a transaction receives a probe, it can decide to abort voluntarily on either of two conditions: 1) a transaction with very high priority waits for it directly or transitively, or 2) the value of P.1 is very high, i.e., a big waitfor chain is already formed. toD1. If T1 > Tx, i.e., the external transaction's priority is lower than the highest priority 'transaction of the cycle, DI will discard the probe. On the other hand, if Tx > T1, D1 will propagate the probe to T1. Once this probe has crossed over the highest priority transaction of the deadlock cycle, it will cover the entire cycle and will be saved in probe-Qs of all member transactions (and data managers). This is correct since the external transaction Tx waits directly or transitively on all member transactions of the deadlock cycle. But since Tx > T1, the probe will keep circulating the cycle indefinitely (until the cycle is broken) and a member transaction may receive a probe whose initiator is the initiator for some probe already stored in its probe-Q. Such a probe can be considered to be a duplicate, and it should be discarded. To discard these duplicate probes, ACKNOWLEDGMENT the following modification to the basic algorithm is needed. The authors thank the referee for his comments and suggestions. They are also thankful to Prof. K. Mani Chandy and When a transaction receives a probe from a data manProf. M. Stonebraker for their helpful discussions. ager, it discards the probe, if there exists a probe in its probe_Q which has an identical initiator. REFERENCES F. Fair Scheduling of Exclusive Locks The policy discussed in Section V, of granting an S-lock request when an'X_lock request is already pending, is unfair to X_requestors. A fair scheduling policy would be as follows. [11 R. Bayer, K. Elhardt, J. Heigert, and A. Reiser, "Dynamic time- stamp allocation for transactions in database systems," in Distributed Databases, H. J. Schneider, Ed. Amsterdam, The Netherlands: North-Holland, 1982, pp. 9-20. [2] P. A. Bernstein and N. Goodman, "Concurrency control in distributed database systems," ACM Comput. Surveys, vol. 13, pp. 185-221, June, 1981. [3] K. 'M. Chandy and J. Misra, "A distributed algorithm for detecting resource deadlocks in distributed systems," in Proc. ACM When a transaction T, requests an S_lock, it is granted if there is no X_holder, and no' X_requester of higher priority than T. SIGACT-SIGOPS Symp. Principles of -Disbributed Computing, Ottawa, Ont., Canada, Aug. 1982. Such a scheme ensures that an X_requester will never en- [41 K. M. Chandy, J. Misra, and L. M. Haas, "Distributed-deadlock detection," ACM Trans. Comp ut. Syst., vol. 1, pp. 144-156, May counter antagonistic conflicts incrementally. However, even 1983. in this case, S_holders are introduced incrementally, and to E. G. Coffman, Jr., M. J. Elphick, and A. Shoshani, "System dead[51 take into account transitive wait on these additional S_holders, locks,"ACMComput. Surveys,-vol. 3, pp. 66-78, June 1971. we need to maintain probe-Qs within data managers. Further, [6] C. T. Davies, "Recovery semantics for a DB/DC systeni," in Proc. Nat. Conf.; vol. 28, 1973, pp. 136-141. now an S_requester may encounter antagonistic conflicts with [71 ACM K. P. Eswaran, J. N. Gray, R. A. Lorie, and I. L. Traiger, "The some S_-holders, and in such cases probes must be sent to notion of consistency and predicate locks in a database system," Commun. ACM, vol. 19, pp. 624-633, Nov. 1976. those S-holders. D. Gligor and S. H. Shattuck, "On deadlock detection in disWe must point out here that this fair scheduling policy is not [81 V. tributed systems," IEEE Trans. Software Eng., voL SE-6, pp. directly applicable in the case of nested transactions since we 435-440, Sept. 1980. have to take into account retainers also. For example, suppose [9] J. N. Gray, "Notes on database operating systems," in Operating Systems, An Advanced-Course (Lecture Notes in Computer Scifor some data item there is a retainer Tr and an X_requester ence 60). Berlin, Germany: Springer-Verlag, 1978, pp. 398-481. Tx and let us assume that Tx > Tr. Now, when a descendant [101 R. C. Holt, "Some deadlock properties of computer systems," ACM Comput. Surveys, vol. 4, pp. 179-195, Dec. 1972. of Tr requests an S-lock, it must be granted, even though its 80 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-Il, NO. 1, JANUARY 1985 [11] L. Lamport, "Time, clocks and ordering of events in a distributed ment and Computing Techniques, Bombay. From September 1979 to system," Commur. AC3M, vol. 21, pp. 558-565, July 1978. August 1980, he was a Visiting Engineer in the Computer Systems Re[121 D. A. Menasce and R. R. Muntz, "Locking and deadlock detec- search Group at Massachusetts Institute of Technology, where he worked tion in distributed databases," IEEE Trans. Software Eng., voL on concurrency control problems in distributed systems. He has deSE-5, pp. 195-202, May 1979. signed and implemented various systems which include compilers, gen[13] J.E.B. Moss, "Nested transactions: An approach to reLable dis- eral purpose graphics systems, multiprocessor operating systems, and a tributed computing," Lab. Comput. Sci., Massachusetts Inst. file server for a local area network. His current research interests are opTechnol., Cambridge, MA, Tech. Rep. 260, Apr. 1981. erating systems, database concurrency control, and local area networks. [14] N. Natarajan, "Communication and synchronization in distributed programs," Ph.D. dissertation, National Centre for Software Development and Computing Techniques, Tata Inst. Fundamental Res., Bombay, India, Nov. 1983. [15] R. Obermarck, "Distributed deadlock detection algorithm," ACM Trans. Database Syst., vol. 7, pp. 187-208, June 1982. N. Natarajan was born in Madras, India, on June [161 D. J. Rosenkrantz, R. E. Stearns, and P.M. Lewis, "System level 28, 1950. He received the B.E. (Hons.) degree concurrency control for distributed database systems," ACM in electronics and communication engineering Trans Database Syst., vol. 3, pp. 178-198, June 1978. from the University of Madras, Madras, in 1972, the M.E. degree in automation from Indian Institute of Science, Bangalore, India, in 1974, .% Mukul K. Sinha was born in Patna, India, on and th PhD. degree in computer science from _M the University of Bombay, Bombay, India, in September 27, 1950. He received the B.Sc. 1983. (Engineering) degree in electrical engineering from Bihar Institute of Technology, Sindri, He has been working with the National Centre for Software Development and Computing @t.g s@ Ine@lIndia, in 1968, the M.Tech degree in electrical engineering from Indian Institute of Technol- Techniques, Tata Institute of Fundamental Research, Bombay, since ogy, Kanpur, India, in 1971, and the Ph.D. de- 1974 where he has worked on compilers, operating system for a multigree in computer science from the University of processor, and the design of a local area network. He visited the Laboratory for CQmputer Science, Massachusetts Institute of Technology, Bombay, Bombay, India, in 1983. He is currently working as a Scientific Officer during 1979-1980. His research interests include operating systems, at the National Centre for Software Develop- programming languages, computer networks, and distributed systems. Timing Constraints of Real-Time Systems: Constructs for Expressing Them, Methods of Validating Them B. DASARATHY, MEMBER, IEEE Abstract-This paper examines timing constraints as features of realtime systems. It investigates the various constructs required in requirements languages to express timing constraints and considers how automatic test systems can validate systems that include timing constraints. Specifically, features needed in test languages to validate timing constraints are discussed. One of the distinguishing aspects of three tools developed at GTE Laboratories for real-time systems specification and testing is in their extensive ability to handle timing constraints. Thus, the paper highlights the timing constraint features of these tools. Index Tenns-Real-time systems, requirements specification, test mal languages for expressing the requirements of systems [9]. In particular, researchers have shown an interest in languages for expressing the requirements of reat-time systems. Examples of such languages are REVS' RSL [1], [2], [7], CCITT's System Description Language (SDL) [5], Zave's PAISLey [13], and GTE Laboratories' Real-Time Requirements Language (RTRL) [10]. SDL, RSL, and RTRL share a common view of real-time systems. They hold that a real-time system (or the ports it generation, test language, timing constraints, validation. serves) can be modeled as finite-state machines (FSM's) in which a response. at any instance is completely determined by INTRODUCTION DURING the past decade there has been great progress in the system's present state and the stimulus that has arrived. the development of requirements languages; that is, for- The behavior of the system is captured in transitions made from one state to another state on a stimulus. PAISLey has a more general view of a real-time system in that it allows both Manuscript received July 29, 1983. The author is with GTE Laboratories, Inc., Waltham, MA 02254. the system sid 4ts-4wkonment to be modeled as interacting 0098-5589/85/0100-0080$01.00 © 1985 IEEE