Workshop do Projeto ARGO – Junho/2001

Implementation of a Failure Detector based on an Artificial Neural Network

Nivea Ferreira, Raimundo Macêdo

Distributed Systems Laboratory – LaSiD/UFBa

{niveacf,macedo}@ufba.br

 

 

Providing fault-tolerant is an important issue in designing nowadays distributed systems, mainly for those which deal with critical services or information (such as stock market, air traffic control, etc.). Research in fault-tolerant distributed computing aims at making systems more reliable, i. e., systems that can deliver the specified services despite failures of some of its components. A basic module of these systems is the failure detector which provides information about processes that have crashed.

In synchronous systems, characterized by bounded messages transmission delay and processing time, it is possible to construct perfect failures detectors based simply on the use of local timeouts. These failures detectors are perfect because they detect only crashed processes and, eventually, all crashed processes become detected.

In asynchronous distributed systems where there is no bound on message transmission delay and processing times, it is not always possible to distinguish between a crashed process and a very slow ones. Therefore, timeouts cannot be directly taken as an accurate indication to failures, and, as a consequence, perfect failure detection can not be implemented. On the other hand, this model is of our particular interest due to the fact that it is the most realist for distributed computing in today’s large-scale wide-area networks.

Chandra and Toueg [CT96] proposed a modular way of extending the asynchronous model with failure detector. In their theory of unreliable failure detectors, they propose a program module that acts as an unreliable oracle on the functional states of neighboring processes. The failure detectors are defined by two properties: completeness and accuracy. Informally, completeness requires that every process that crashes is eventually suspected by some correct process; accuracy states that a failure detector will never suspect correct processes of having crashed. Based on these properties, Chandra and Toueg have defined eight classes of failure detectors. Among them, the class à S, the one that imposes the weakest conditions on the run time environment, includes all the failures detectors that satisfy strong completeness (eventually, every crashed process is suspected by every correct process) and eventual weak accuracy (there is a time after which is a correct process that is never suspected).

A mechanism of timeout is sufficient to implement strong completeness property. However, the property of weak accuracy is impossible to achieve unless extra assumptions on the system environment are made. In [MR00] is presented a mechanism, called Connectivity Time Indicator (CTI), that is used to implement the properties required by the failure detectors of class à S, considering that a given set of processes can crash and recover (and stay lively long enough between crashes so that eventual weak accuracy can be achieved). The time connectivity between two processes, Pi and Pj, is defined as the time duration for a message to travel from process Pi to process Pj (or vice-versa) for a given moment of the system live. The idea is to use the CTI to hint about the present connectivity time between processes by analyzing the current operating system and network loads.

Using a specific type of artificial neural network, our aim in this work is to implement the CTI concept. The artificial neural networks are more powerful tools than standard statistical methods, when considering some classes of "complex" problems. By using of neural networks we intend to predict more realistic timeouts, based on previous values of time connectivity and the actual characteristics of the communication channel (adaptive timeouts). To predict these values, it is necessary to provide dynamic properties to the neural network. In this work we present a prototype of such a neural network.

References

[CT96] Chandra T., Hadzilacos V. and Toueg S., The weakest Failure Detector for Solving Consensus. Journal of the ACM, 43(4): 685-722, July 1996.

[MR00] Macêdo R., Failure Detection in Asynchronous Distributed Systems. Proc. of II Wokshop on Tests and Fault-Tolerance, pp. 76-81, July 2000, Curitiba-Brazil.