The Study and Design of High Availability Monitoring Subsystem for Fault Tolerant Computing Systems
|School||Harbin Institute of Technology|
|Course||Computer Science and Technology|
|Keywords||High availability Monitor scheme TMR|
Fault-Tolerant computing systems are very important in the field of information technology. On one hand, the systems have strong ability to deal with key tasks. On the other hand, they have high availability, and can provide high-speed and reliable of information processing services. The information losing and destroying or the exceptional shutting down of Fault-Tolerant computing systems would exert a great influence on those key tasks, so the ability of continuously operating is put forward for these systems, the ability is high availability.This paper is based on blade server systems. The design of high availability monitoring subsystem is presented. The monitoring subsystem can choose any two blades from blade server systems as the Leader layer of high availability. The monitoring subsystem use TMR technology and it make Leader layer become the core of the high availability system.Whether or not the arbitration process succeeds is a main bottleneck influencing the availability of Fault-Tolerant computing systems. When both two leader blades are good, the network services they provide are almost the same as single module system. Only when one sever crushes down, and the arbitration and reconfiguration succeed, the advantage is manifested. If any failure happens during the arbitration process, Leader layer system has nothing advantages compare with single module system.During the analysis of the whole process of arbitration process, a Marcov model is proposed to study the influence of some parameters on the availability of the whole system. Integrated active-standby systems and dual active systems, we can conclude that fault detection and fault diagnose are critical to system availability.This paper presents some research and design as follows: some normal arbitration techniques studied. The conflict between normal techniques and practical requirements is analyzed. A high-availability arbitration scheme is proposed to provide hardware support for blades server systems. The hardware designs for high availability monitoring subsystem of fault tolerant computing systems are presented. Some concrete works are implemented, including TMR, CPLD, USB switching, HotSwap, etc.