Introduction to Fault Tolerance

Fault tolerance is closely related to the notion of dependability,

Basic Concepts

In distributed systems, this is characterized under a number of headings:

  1. Availability - The system is ready to be used immediately.
  2. Reliability - The system can run continuously without failure.
  3. Safety - If a system fails, nothing catastrophic will happen.
  4. Maintainability - When a system fails, it can be repaired easily and quickly.

A system is said to “fail” when it cannot meet its promises.

<aside> <img src="/icons/map-pin_gray.svg" alt="/icons/map-pin_gray.svg" width="40px" /> This is closely resembles back to our discussion in COE891 in When Software Goes Bad with software faults and software errors.

</aside>

Fault tolerance meaning that a system can provide its services even in the presence of faults. That is, the system can tolerate faults and continue to operate normally.

Process Resilience

Processes can be made fault tolerant by arranging to have a group of processes, with each member of the group being identical.

There’s two types of organizations, which communicate differently from one another.