Workshop do Projeto ARGO

Workshop do Projeto ARGO – Junho/2001

Approaches for Mobile Agent Fault Tolerance
Flávio Morais de Assis Silva, Raimundo José de Araújo Macêdo

Distributed Systems Laboratory – LaSiD/UFBa

{fassis, macedo}@ufba.br

A mobile agent (or simply agent) is a self-contained software element responsible for executing a programmatic process, which is capable of autonomously migrating through a network. An agent migrates in a distributed environment between logical "places" referred here to as agencies. When an agent migrates, its execution is suspended at the original agency, the agent is transported (i.e. program code, data, execution state and control information) to another agency in the distributed environment, and, after being re-instantiated at the new agency, the agent resumes execution.

The mobile agent concept began to obtain stronger attention of the research and industry community around 1994, and since then many mobile agent systems have been developed, among them Mole, TACOMA, Aglets, D’Agents and Grasshopper. The mobile agent concept is being proposed to support different types of applications, including electronic commerce, workflow management, network management, distributed information retrieval and active networks. Mobile agents have been considered a concept that can be explored to provide, among others, the following benefits: better use of communication resources; flexible support for disconnected operation; flexibility for the management of software deployment and maintenance; and adequate support for interactions with human users [1].

A fundamental issue in the development of mobile agent systems is how to support reliability of mobile agent based applications, specially for those applications that will use mobile agents in open environments such as the Internet. One of the reliability requirements of mobile agent applications is fault tolerance of mobile agent executions, i.e., how to guarantee that mobile agent applications execute correctly despite failures (e.g., failures of the agencies or machines where mobile agents are executing). Mobile agent fault tolerance requires at least mechanisms for making agents persistent, for reactivating agents and recovering the state of the agent activity after a failure, and for reliably transporting agents between agencies. Additionally, the execution of mobile agent-based applications should be able to tolerate long-term failures of agencies. When a mobile agent executing an application moves to an agency, it transfers the control flow execution of that application to that agency. While executing at an agency, an agent is completely subject to the execution rules and conditions of that agency. If the agency where the agent is running fails, the execution of that agent remains blocked while the agency is faulty. An agency may remain faulty for a long time. Long unavailability periods have the obvious undesirable effect of delaying the execution of an agent-based application.

In this work we concentrate on mechanisms for tolerating long term failures of agencies. Making the execution of an agent-based application tolerate this type of failures can be achieved by replicating agents. An agent execution can be regarded as being performed in a sequence of stages. The first stage begins when the application execution starts. A new stage then begins (and the previous terminates) when the mobile agent execution reaches a movement operation, to continue at a new agency. To provide agent fault-tolerance, when an agent requests a movement, instead of being sent to a single agency, a number of copies of the agent are sent to a non-empty set of agencies. Each stage will thus be performed by a set of agent replicas. Mobile agent executions can thus be made fault tolerant, since in the event of a failure of one of the replicas executing the agent task, another replica can recover and resume the agent execution. When a stage terminates, the replicas used in that stage are destroyed and a new set of replicas will be created and sent to the agencies where the new stage will be executed.

The set of replicas executing a stage must be coordinated so that stage executions are consistent. Some desirable properties that a mobile agent fault tolerance mechanism should provide are: it should be non blocking (at least the failure of a single element (agent/agency/machine) should not interrupt the execution of the application); it should guarantee exactly-once semantics, i.e., at each stage the effects of exactly one of the replicas should be effectivated; and it should allow an application to be executed by multiple fault tolerant agents.

Different approaches have been proposed to provide mobile agent fault tolerance (e.g. [2, 3, 5, 6]. In this work we present a general overview of the approaches based on agent replication. Additionally, we provide a general view of how mobile agent fault tolerance can be provided on top of mobile groups [4]. Mobile groups are an extension of the traditional concept of groups of processes, that provides message delivery guarantees to members of the group and a sort of virtual synchrony despite the mobility of group members. A protocol for mobile agent fault tolerance on top of mobile groups is under development at LaSiD.

References

[1] F.M.Assis Silva. A Transaction Model based on Mobile Agents. PhD Thesis. Technical University Berlin. 1999

[2] F.M.Assis Silva, R.Popescu-Zeletin. An Approach for Providing Mobile Agent Fault Tolerance. Proceedings of the Second International Workshop on Mobile Agents, MA’98, Stuttgart, Germany, September 1998. Lecture Notes in Computer Science (LNCS) 1477, Springer-Verlag. 1998. pp.14-25

[3] D.Johansen, K.Marzullo, F.B.Schneider, K.Jacobsen, D.Zagorodnov. NAP: Practical Fault-Tolerance for Itinerant Computations. Technical Report TR98-1716. Department of Computer Science, Cornell University. USA. November, 1998

[4] R.J.A.Macêdo, F.M.Assis Silva. Mobile Groups. Proceedings of the 19th Brazilian Symposium on Computer Networks. Florianópolis, Brazil. May, 2001

[5] S. Pleisch, A. Schiper. Modeling Fault-Tolerant Mobile Agents as a Sequence of Agreement Problems. Proceedings of the 19th Symposium on Reliable Distributed Systems (SRDS). Nuremberg, Germany. October 2000. pp.11-20

[6] K.Rothermel, M.Straßer. A Fault-Tolerant Protocol for Providing the Exactly-Once Property of Mobile Agents. Proceedings of the IEEE Symposium on Reliable Distributed Systems (SRDS’98). West Lafayette, USA. October, 1998. pp. 100-108