Reliable broadcast for fault-tolerance on local computer networks

Paulo Veríssimo and José A. Marques

in Proceedings of the 9th Symposium on Reliable Distributed Systems, October 1990, Huntsville, Alabama, USA.

Abstract

The importance of fault-tolerance mechanisms in application-independent systems, has led to the increased use of techniques based in "macroscopic" replication of components and software oriented error processing. In distributed environments, management of the replication of components throughout different sites, may benefit from the availability of reliable broadcast or multicast protocols.

This paper discusses the definition and design of a generic reliable communication architecture, on a widely used host independent platform, such as a local area network. Two aspects of relevance are the use of non-replicated LANs and of self-checking components.

The protocol itself is innovative, in the sense that, although clock-less, and running on a non-replicated network, it displays bounded execution times. The architecture is in consequence capable of reliably addressing real-time.

Also available as INESC AR/24-90 (gzip postscript).