Research on Survivability Enhancing Techniques of Grid Applications
|School||Harbin Institute of Technology|
|Course||Computer System Architecture|
|Keywords||Grid system Grid application Survivability Schedule Failure detection Replication protocol|
The emergence and development of Grid provides large numbers of computing resources for large-scale applications. However, the dynamic and complex characteristic of the Grid system cause the higher failure rate of Grid resources, compared to that in traditional distributed systems. This brings great challenges for the execution of Grid applications in Grid environment. The tasks allocated to the grid resources may be halted by the failure of grid resources. Especially for the large-scale applications, which require large numbers of resources and will take lots of time, the failure of grid resources may cause that they can not execute normally. Therefore, this paper focuses on the problem how to make the applications execute normally in the complex grid system. And the survivability theory is applied into the grid system, and the concept of the grid application survivability is proposed. The research on the survivable grid applications in this paper has great significances on the development and application of grid technologies. The main research topic of this paper includes the following aspects:The first part of this dissertation introduces the research background of Grid system and Grid applications, analyzes the challenge faced by the execution of grid applications and make clear the significance of the research on Grid application survivability. Then it reviews the research state of Grid security and system survivability. The current research of Grid security adopts the traditional security theory, and the current research of system survivability focuses on the traditional distributed information system. There is no systematic research on the survivability of Grid applications.On the basis of that, this dissertation introduces the system model, including Grid model, failure model and application model, and gives the definition of the survivability of Grid applications. Then the survivability analysis method on the grid applications and the survivability life-cycle model of grid applications are proposed. And the key technologies supporting the survivable grid applications are introduced.To implement the capability of the grid applications to guard against the failure of grid resources, the dissertation proposes the scheduling objective of survivability and the cost function considering the objectives of survivability and makespan at the same time. Then the scheduling algorithms considering the the objectives of survivability and makespan are proposed for grid independent task applications and Grid workflow applications respectively. These scheduling algorithms can prevent grid tasks from being scheduled to the grid resources with higher failure rate..To improve the capability of detecting failures, and decrease the error rate of failure detection and the detection time, the failure detection machinism in grid environment is considered. The current failure detection algorithms can adapt to the variation of transmission delay by adaptive mechanism, and decrese the error rate of failure detection caused by the variation of transimissin delay. However, this algorithm does not consider the loss of detecting packets which cause high error rate. To solve this problem, the PUSH-and-PULL based failure detection algorithm is proposed. This algorithm bases on the semi-synchronous distributing system model and can decrease the high error rate of failure detection efficiently.Finally, the failure response capability is implemented by a transparent replication mechaninsm. The message agent mechianism on the level of network message flow and flexible configuration mechanism are proposed, a asynchronous active replication protocol and failure response protocol are proposed, then a transparent and all-purpose replication agent is implemented. This agent can synchronize the state of replicas in the replica group and implement the failure recovery capability after the failure of the primary replica.Base on the above studies, a Grid application scheduling and managing system is designed and implemented using the off-line defense and on-line reconfiguration techniques. In this system, the survivability enhancing techniques proposed in the previous chapters are utilized efficiently. Finally, the efficiency of these Grid survivability enhancing techniques is approved by the execution of an real application.