The Research of Fault-Tolerant Techniques for Parallel/Distributed Network Simulator PDNS
|School||Harbin Institute of Technology|
|Course||Computer Science and Technology|
|Keywords||distributed network simulation fault-tolerance checkpoint socket re-establishment|
Network simulation is very important in network behavior analysis and protocol evaluation. As a popular parallel/distributed network simulator, PDNS is widely used. However, it can not get rid of the weakness in system reliability like other typical distributed applications. Checkpointing with rollback recovery is a very useful technique in system fault tolerance. It saves the state of a program when it runs normally by checkpointing, and reconstructs the process according to the state information stored in the checkpoint file while some error causes the program breakdown. And then the program could continue from the last time it checkpointed, thus it saves much time compare to redoing the simulation from the beginning.This paper conducts research on improving the reliability of PDNS with checkpointing and rollback recovery techniques. Distributed checkpointing algorithm is based on single process checkpointers. As for PDNS, checkpointing a member of the simulating federacy is the basic issue. Trough the analysis of checkpointers in different implementation levels, user-level transparent checkpoint is realized based on Condor in one single node of PDNS, and then its performance is examined, and the impact of the numbers of nodes and links in the network topology on the checkpoint overhead and space consumption is also discussed.The next question in PDNS checkpointing is to backup and re-establish the links between the federated members of simulating. PDNS nodes use TCP to communicate in LAN. The internal TCP implementation in Linux is examined first, and then a tool is designed as a kernel module to realize the backup and re-establishment of TCP links between the simulating nodes in PDNS.With the two basic functionalities implemented above, choosing a proper distributed checkpointing algorithm comes to the last question in PDNS fault-tolerance. PDNS uses conservative synchronization in distributed simulation and takes a node as master process which is labeled number 0 in libSynk. Considering these characteristics, Sync-and-Stop coordinated distributed algorithm is chose to achieve the proto fault-tolerant model of PDNS. This article discusses the key issues and main techniques in PDNS fault-tolerance which is helpful to improve the distributed simulator’s reliability.