Refad: Real-time and Fault-tolerant Distributed Systems
Research is supported in part by NSF under Grant EPS-0091900 and the University of Nebraska-Lincoln under Grant 26-0511-0019
In real-time and distributed systems, tasks must be guaranteed a priori to meet its timing constraint. Mission critical real-time tasks must tolerate faults. In other words, mission critical tasks must be completed before their deadlines even the real-time system encounter temporary and permanent faults. Scheduling multiple copies of a task on different processors is a promising approach to support fault-tolerance. If the primary copy of a task cannot be completed due to a fault, its backup copy executes and complete task before its deadline. We propose a series of new algorithms for fault-tolerant scheduling on a real-time distributed system. The algorithms guarantee the completion of a scheduled task before its deadline in the presence of a processor's permanent failures.
Firstly, we attempted to investigate two real-time fault-tolerant scheduling algorithms for independent tasks, which have no precedence constraints. The description of the algorithms and the experiment results can be found in Journal of Software, No5. 2000. The novel idea associate with these two algorithms is that real-time tasks without fault-tolerant requirement are also considered. This is because complex real-time systems consist of both fault-tolerant task and non-fault-tolerant tasks.
Heterogeneous systems involve multiple heterogeneous modules that interact with one another via high-speed network, which also can be heterogeneous. For each real-time task, the execution time on different computing modules are various. A fault-tolerant real-time scheduling is studied to provide a fault-tolerant support in the heterogeneous real-time systems. More information about this algorithm, including its description and simulation results, can be found in the Proceedings of 2000 International Conference on Parallel and Distributed Processing Techniques and Applications.
Above three algorithms are offline scheduling algorithms, which cannot directly be applied in dynamic real-time computing environment. Therefore, a heuristic dynamic scheduling scheme for parallel real-time jobs in a heterogeneous system is presented. The parallel real-time jobs studied in this paper are modelled by directed acyclic graphs (DAG). In the scheduling model, parallel real-time jobs arrive at a heterogeneous system following a Poisson process. The scheduling algorithms developed in this paper take the reliability measure into account, in order to enhance the reliability of the heterogeneous system without any additional hardware cost. In addition, scheduling time and dispatch time are both incorporated into our scheduling scheme so as to make the scheduling result more realistic and precise. Admission control is in place so that a parallel real-time job whose deadline cannot be guaranteed is rejected by the system. The performance of the proposed scheme is evaluated via extensive simulations. The simulation results show that the heuristic algorithm performs significantly better than two other algorithms that do not consider reliability cost. Furthermore, results suggest that shortening the scheduling time results in a higher guarantee ratio. Hence, if parallel scheduling algorithm is devised and employed to shorten the scheduling time, the performance of the heterogeneous system will be further enhanced. The detailed discussion about this algorithm is given in the proceedings of the 30th International Conference on Parallel Processing (ICPP 2001).
Publications