Workshop do Projeto ARGO – Junho/2001

Agreement Problems in Asynchronous Distributed Systems

Michel Hurfin

IRISA/INRIA Rennes (France)

hurfin@irisa.fr

In asynchronous distributed systems prone to crash failures, many problems require all the processes to reach agreement on a decision value. Atomic commitment, total order, and membership are among the most famous and cited agreement problems. All these problems suffer from the impossibility result exhibit by Fischer, Lynch, and Paterson who consider the particular case of the consensus problem. Despite its simple definition, the consensus problem is very representative of the difficulties encountered when solving any other agreement problems. For this main reason, this problem has been intensively studied.

In order to overcome this impossibility result, Chandra and Toueg have augmented the asynchronous system model with the notion of Failure Detectors. A Failure Detector is associated with each process of the distributed computation and is responsible for detecting external failures. Suspicions are essentially implemented using time-out mechanisms, which means that (1) the detection of a real failure is usually delayed and (2) a Failure Detector can make mistakes by incorrectly suspecting a process to have crashed. Several classes of Failure Detectors have been defined. All are specified by a completeness property and an accuracy property. A completeness property puts a condition on the detection of crashed processes, while an accuracy property restricts the possible mistakes made by a Failure Detector. The class of Failure Detector called "Diamond S" is very attractive as it has been proved to be the weakest class of failure detectors that allow to solve the consensus problem.

During this talk, we will survey several solutions to the consensus problem in asynchronous distributed systems equipped with "Diamond S" failure detectors. All the described solutions have been designed in the ADP research team during the 4 last years. Part of this research has been realized in the context of the INRIA-CNPq ARGO project which aims to develop a consensus-based group communication protocol. The objective of all these works is to define a more efficient and flexible agreement module which can be used to develop a group communication service and more generally to solve any agreement problem in an asynchronous distributed system. Each of these works suggests a particular improvement to the classical approach proposed by Chandra-Toueg 10 years ago. We classifies these works as follows.

 

All the solutions proposed in the literature ensure the validity property (the decision value is one of the proposed value), the agreement property (no two processes adopt different decision values) and the termination property (all the non-crashed processes eventually adopt a decision value). The two last properties which are respectively a safety property and a liveness property can be replaced by weakest properties.

Here, the goal is not to increase the efficiency of the agreement module itself but to increase the significance of its result. Indirectly, it will sometimes reduce the time and message cost of an upper-layer application. In all the cases, the validity property is affected.

The results is a set of proposed value rather than a single one.

A function is applied to the gathered value to generate an output value. The sets of input and output values can be completely deconnected.

During a classical consensus execution, a process proposes a value once and decides once. In fact, if the integrity property forces a process to decide once per consensus execution, the number of proposed values can be equal to zero or even greater than zero.

Crash failure is a particular type of failure. Protocols can be slightly modified to take into account other types of failures as for example, crash-recovery failures or Byzantine failures.

If the local clocks are synchronized then it is possible to define protocols where each round as a fixed duration. In that case, messages used by the consensus protocol (except of course the decision message) can be lost.

Protocols based on a fixed round duration allows to defines constraints on the duration of the whole computation.

References

[1] F. Greve, M. Hurfin, R. Macedo, and M. Raynal. Consensus based on strong failure detectors: Time and message efficient protocol. In Proc. of the Int. Workshop on Fault-Tolerant Parallel and Distributed Systems, number 1800 in LNCS, pages 1258-1267, Cancun, May 2000. Springer Verlag.

[2] F. Greve, M. Hur_n, R. Macedo, and M. Raynal. Time and message efficient s-based consensus. In Proc. of the 19th ACM SIGACT-SIGOPS Int. Symposium on Principles of Distributed Computing (PODC'00),pages 332{332, Portland, OR, Jul 2000. ACM. Brief announcement.

[3] M. Hurfin, R. Macedo, A. Mostefaoui, and M. Raynal. A sliding round window diamond-s-based consensus protocol. In Proc. of the 20th IEEE Int. Symposium on Reliable Distributed Systems (SRDS'01), New Orleans, USA, Oct 2001. IEEE.