Published on September 17, 2007
Deadlocks in Distributed Systems: Deadlocks in Distributed Systems Deadlocks in distributed systems are similar to deadlocks in single processor systems, only worse. They are harder to avoid, prevent or even detect. They are hard to cure when tracked down because all relevant information is scattered over many machines. People sometimes might classify deadlock into the following types: Communication deadlocks -- competing with buffers for send/receive Resources deadlocks -- exclusive access on I/O devices, files, locks, and other resources. We treat everything as resources, there we only have resources deadlocks. Four best-known strategies to handle deadlocks: The ostrich algorithm (ignore the problem) Detection (let deadlocks occur, detect them, and try to recover) Prevention (statically make deadlocks structurally impossible) Avoidance (avoid deadlocks by allocating resources carefully) The FOUR Strategies for handling deadlocks: The FOUR Strategies for handling deadlocks The ostrich algorithm No dealing with the problem at all is as good and as popular in distributed systems as it is in single-processor systems. In distributed systems used for programming, office automation, process control, no system-wide deadlock mechanism is present -- distributed databases will implement their own if they need one. Deadlock detection and recovery is popular because prevention and avoidance are so difficult to implement. Deadlock prevention is possible because of the presence of atomic transactions. We will have two algorithms for this. Deadlock avoidance is never used in distributed system, in fact, it is not even used in single processor systems. The problem is that the banker’s algorithm need to know (in advance) how much of each resource every process will eventually need. This information is rarely, if ever, available. Hence, we will just talk about deadlock detection and deadlock prevention. Distributed Deadlock Detection: Distributed Deadlock Detection Since preventing and avoiding deadlocks to happen is difficult, researchers works on detecting the occurrence of deadlocks in distributed system. The presence of atomic transaction in some distributed systems makes a major conceptual difference. When a deadlock is detected in a conventional system, we kill one or more processes to break the deadlock --- one or more unhappy users. When deadlock is detected in a system based on atomic transaction, it is resolved by aborting one or more transactions. But transactions have been designed to withstand being aborted. When a transaction is aborted, the system is first restored to the state it had before the transaction began, at which point the transaction can start again. With a bit of luck, it will succeed the second time. Thus the difference is that the consequences of killing off a process are much less severe when transactions are used. Centralized Deadlock Detection: Centralized Deadlock Detection We use a centralized deadlock detection algorithm and try to imitate the nondistributed algorithm. Each machine maintains the resource graph for its own processes and resources. A centralized coordinator maintain the resource graph for the entire system. When the coordinator detect a cycle, it kills off one process to break the deadlock. In updating the coordinator’s graph, messages have to be passed. Method 1) Whenever an arc is added or deleted from the resource graph, a message have to be sent to the coordinator. Method 2) Periodically, every process can send a list of arcs added and deleted since previous update. Method 3) Coordinator ask for information when it needs it. False Deadlocks: False Deadlocks One possible way to prevent false deadlock is to use the Lamport’s algorithm to provide global timing for the distributed systems. When the coordinator gets a message that leads to a suspect deadlock: It send everybody a message saying 'I just received a message with a timestamp T which leads to deadlock. If anyone has a message for me with an earlier timestamp, please send it immediately' When every machine has replied, positively or negatively, the coordinator will see that the deadlock has really occurred or not. Distributed Deadlock Detection: Distributed Deadlock Detection The Chandy-Misra-Haas algorithm: Processes are allowed to request multiple resources at once -- the growing phase of a transaction can be speeded up. The consequence of this change is a process may now wait on two or more resources at the same time. When a process has to wait for some resources, a probe message is generated and sent to the process holding the resources. The message consists of three numbers: the process being blocked, the process sending the message, and the process receiving the message. When message arrived, the recipient checks to see it it itself is waiting for any processes. If so, the message is updated, keeping the first number unchanged, and replaced the second and third field by the corresponding process number. The message is then send to the process holding the needed resources. If a message goes all the way around and comes back to the original sender -- the process that initiate the probe, a cycle exists and the system is deadlocked. Chandy-Misra-Haas Algorithm: Chandy-Misra-Haas Algorithm There are several ways to break the deadlock: The process that initiates commit suicide -- this is overkilling because several process might initiates a probe and they will all commit suicide in fact only one of them is needed to be killed. Each process append its id onto the probe, when the probe come back, the originator can kill the process which has the highest number by sending him a message. (Hence, even for several probes, they will all choose the same guy) Distributed Deadlock Prevention: Distributed Deadlock Prevention A method that might work is to order the resources and require processes to acquire them in strictly increasing order. This approach means that a process can never hold a high resource and ask for a low one, thus making cycles impossible. With global timing and transactions in distributed systems, two other methods are possible -- both based on the idea of assigning each transaction a global timestamp at the moment it starts. When one process is about to block waiting for a resource that another process is using, a check is made to see which has a larger timestamp. We can then allow the wait only if the waiting process has a lower timestamp. The timestamp is always increasing if we follow any chain of waiting processes, so cycles are impossible --- we can used decreasing order if we like. It is wiser to give priority to old processes because they have run longer so the system have larger investment on these processes. they are likely to hold more resources. A young process that is killed off will eventually age until it is the oldest one in the system, and that eliminates starvation. Wait-die Vs. Wound-wait: Wait-die Vs. Wound-wait As we have pointed out before, killing a transaction is relatively harmless, since by definition it can be restarted safely later. Wait-die: If an old process wants a resource held by a young process, the old one will wait. If a young process wants a resource held by an old process, the young process will be killed. Observation: The young process, after being killed, will then start up again, and be killed again. This cycle may go on many times before the old one release the resource. Once we are assuming the existence of transactions, we can do something that had previously been forbidden: take resources away from running processes. When a conflict arises, instead of killing the process making the request, we can kill the resource owner. Without transactions, killing a process might have severe consequences. With transactions, these effects will vanish magically when the transaction dies. Wound-wait: (we allow preemption andamp; ancestor worship) If an old process wants a resource held by a young process, the old one will preempt the young process -- wounded and killed, restarts and wait. If a young process wants a resource held by an old process, the young process will wait.