High Productivity OpenMP for Distributed Shared Memory Architecture
|School||National University of Defense Science and Technology|
|Course||Computer Science and Technology|
|Keywords||High efficiency OpenMP Language extensions The two-stage data prefetching Checkpoint / renewal of operator Low Power Optimization|
The high-end computing development today, has committed to achieve the high performance of the system to improve system performance, programmability, portability and robustness, while reducing system development, operation, and maintenance costs from a single pursuit of high-performance steering . The high-performance computer systems can not be separated from efficient programming environment, especially the next one hundred trillion times petaflop computer system-oriented application is multi-disciplinary and multi-scale and complexity of these applications require various disciplines scientists and software specialists to design, manage and maintain applications. Put forward higher requirements for the participation of experts in various disciplines of the programming environment performance, programmability, portability, and fault tolerance. The OpenMP easy programming to support incremental programming mode, maintainability and portability and high, for a long time in the future will continue to be the mainstream parallel programming language. The paper tightly around high performance for massively parallel system development OpenMP programming environment that theme, large-scale distributed shared memory (Distributed Shared Memory, DSM) system OpenMP implementation of key technologies for the DSM system OpenMP language extensions compile guidance data prefetching, OpenMP check points / renewal of operator technology as well as low-power-oriented OpenMP optimization study made innovative achievements: 1, for large-scale parallel computer architecture, designed and implemented the OpenMP parallel compiler CCRGOpenMP. Compiled and OpenMP shared data link synergistic placement policy, not only to overcome the shortcomings of the need to explicitly allocate shared memory in a distributed operating system, and checkpoint data locality optimization provides a strong support. OpenMP implementation, using a large number of source-level optimization strategies to improve program performance. On our SCCMP system, CCRG OpenMP performance for scientific computing and simulation program, quite SGI Altix using the latest Intel 9.1 compiler. The proposed the two new OpenMP guidance command BARRIER (thread_id) and ALLREDUCTION reduce the overhead of the OpenMP parallel programming on the global operation of the barrier synchronization and reduction: the realization of the new guidance commands algorithm. Particles cloud the actual scientific computing program, in 64 threads, the performance improvement of 76%. 3, proposed a two-stage guidance OpenMP compiler data prefetching algorithms, to overcome the DSM system remote memory access and local memory access delay inconsistent due to inaccurate prefetch. A static performance analysis model, evaluate prefetch algorithms. On in SCCMP system, this paper two-stage data prefetching algorithm in 32 threads, SPEC OMP2001 swim program in our system performance improvement of 14%; 64 threads, the performance increased by 9%. 4, establish a system-level and application-level coordination of OpenMP checkpoint / renewal of operator mechanism design the OpenMP checkpoint blocking agreement. Based on the mechanism to achieve a CCRG OpenMP checkpoint / Continued count system. The system fully supports the OpenMP 2.0 API, has good scalability and practical value. 5, OpenMP power optimization technology. Node with dynamic voltage scaling (DynamicVoltage Scaling DVS) capability parallel system, the three low-power optimization method and its implementation algorithm. Worst execution time analysis based on the synchronization segment OpenMP program with DVS method proposed power optimization based on the worst-case execution time. The synchronization segment as units of analysis and voltage adjustment, effectively avoid the obstacles synchronous load imbalance caused by the impact of program execution and power consumption. To establish an energy consumption analysis model, simulation analysis for OpenMP parallel applications, power consumption optimization techniques can effectively reduce energy consumption when the parallel system running OpenMP programs.