Fault Tolerance in Distributed Systems
In order to implement the techniques for fault tolerance in distributed systems, the design, configuration and relevant applications need to be considered. Below are the phases carried out for fault tolerance in any distributed systems.
1. Fault Detection
Fault Detection is the first phase where the system is monitored continuously. The outcomes are being compared with the expected output. During monitoring if any faults are identified they are being notified. These faults can occur due to various reasons such as hardware failure, network failure, and software issues. The main aim of the first phase is to detect these faults as soon as they occur so that the work being assigned will not be delayed.
2. Fault Diagnosis
Fault diagnosis is the process where the fault that is identified in the first phase will be diagnosed properly in order to get the root cause and possible nature of the faults. Fault diagnosis can be done manually by the administrator or by using automated Techniques in order to solve the fault and perform the given task.
3. Evidence Generation
Evidence generation is defined as the process where the report of the fault is prepared based on the diagnosis done in an earlier phase. This report involves the details of the causes of the fault, the nature of faults, the solutions that can be used for fixing, and other alternatives and preventions that need to be considered.
4. Assessment
Assessment is the process where the damages caused by the faults are analyzed. It can be determined with the help of messages that are being passed from the component that has encountered the fault. Based on the assessment further decisions are made.
5. Recovery
Recovery is the process where the aim is to make the system fault free. It is the step to make the system fault free and restore it to state forward recovery and backup recovery. Some of the common recovery techniques such as reconfiguration and resynchronization can be used.
Fault Tolerance in Distributed System
Distributed systems are defined as a collection of multiple independent systems connected together as a single system. Every independent system has its own memory and resources and some common resources and peripheral devices that are common to devices connected together. The design of Distributed systems is a complex process where all the nodes or devices need to be connected together even if they are located at long distances. Challenges faced by distributed systems are Fault Tolerance, transparency, and communication primitives. Fault Tolerance is one of the major challenges faced by distributed systems.
In distributed systems, there are three types of problems that occur. All these three types of problems are related.
- Fault: Fault is defined as a weakness or shortcoming in the system or any hardware and software component. The presence of fault can lead to error and failure.
- Errors: Errors are incorrect results due to the presence of faults.
- Failure: Failure is the final outcome where the assigned goal is not achieved.
Contact Us