Research on Key Techniques of High Productivity GPGPU Architecture
|School||National University of Defense Science and Technology|
|Course||Electronic Science and Technology|
|Keywords||GPGPU Load Balance Resource Configuration Power Model Fault Tolerant Parallel Algorithm Cost-Efficient Fault Tolerance|
The continuous revolution of processor architecture is driven by the rapid development of VLSI technology and the new demands of the emerging applications. TLP (Thread Level Parallelism) and DLP (Data Level Parallelism) are more and more important in the field of processor architecture design. With the increasing research in mutli-core and many-core design, GPGPU (General Purpose Graphic Processing Unit) is a throughput-oriented processor which integrates plenty of parallel computing resources on-chip to explore TLP and DLP deeply. Large-scale concurrent threads are organized hierarchically on GPGPU, and traditional cache hierarchy and distributed scratchpad memory are employed to support different memory access patterns. Hence, applications in high performance computing fields and scientific computing fields can take advantage of massively parallel computing capabilities of GPGPU. However, new challenges arise with the rapid development of GPGPU, such as low utilization of computing resources, high power consumption, and low reliability. However, the related researches are still at the preliminary level, there is still a large design space for GPGPU.This thesis studies deeply on the architecture of GPGPU and related development platform, and then details our researches on some key techniques, including the mapping and optimization techniques, load balance strategy, architectural power modeling, fault tolerant parallel algorithm design, and cost-efficient fault tolerant memory design. The primary innovative works in this thesis are list as follows:1. We present a strategy to choose the optimal configuration ratio between computing resources and memory access bandwidth on GPGPU.GPGPU integrates plenty of parallel computing resources on chip, and high memory access bandwidth is needed to satisfy the data demands of computing resources. We investigate the configuration ratio between computing resources and memory controllers for GPGPU architecture, and employ heuristic searching strategy to analyze the impacts of the configuration ratio on application performance. Based on the analysis, we further use coarse-grained configuration ratio to test benchmarks of different memory access characteristics. Experimental results show that selecting the optimal configuration ratio according to the specific compute-to-access characteristics of different applications could provide energy-efficient solutions for GPGPU speedup.2. We propose a system-level task division strategy based on stream computing.Abundant memory resources and flexible memory hierarchy are employed by GPGPU to support different memory access patterns and to release the pressure of Front Side Bus. Firstly, we use loop unrolling and prefetching techniques to improve the ratio of computing operations to memory accesses and to increase data reusage to avoid high latency of off-chip memory access. Then we wrap the related transfer and computing operations into several streams to overlap kernel execution with data transfer between CPU and GPU. Finally we figure out the most appropriate factor depending on the practical performance of different computing devices, when we divide the whole application into several tasks running on different processors in the system concurrently. Accordingly, we propose the system-level task division strategy based on stream computing.3. We map the HPL (High Performance Linpack) benchmark onto GPGPU and optimize it with certain speedup.HPL is the most important evaluation criteria widely used in performance testing of supercomputer and massively parallel systems. Matrix multiplication and LU decomposition algorithms are key parts of HPL benchmark, and the matrix multiplication algorithm accounts for most of the computing process of HPL. We firstly wrap the matrix multiplication function call in the HPL benchmark, and then make a task division for its parallel execution on CPU and GPGPU. Then, we employ loop unrolling, prefetching and stream computing strategies to hide the global memory access latency and to reduce the costs of data transfer between CPU and GPGPU. According to the practical performance of different computing devices, we adjust the division factor, matrix dimension and block size to achieve the optimal performance.4. We develop an architectural model for GPGPU power estimation using empirical technology data.Although GPGPU is much more energy-efficient than CPU in general purpose computing fields, its high power consumption results in a series of problems, such as the increased chip manufacture and cooling costs, and the reduced system stability. We first evaluate various GPGPU power estimation methods. Then we develop an architectural power estimation model for abstract GPGPU micro-architectures based on empirical technology data. Finally, we integrate the power model into the GPGPU performance simulator, and validate the accuracy of the power model.5. We explore different design patterns of parallel fault tolerant algorithms on GPGPU.Since graphics applications are more tolerant of transient errors, the reliability issue is a non-issue for traditional GPU. However, with the increasing demands of reliability for applications in scientific computing fields, soft errors are already causing noticeable problems for GPGPU. Considering the abundant hardware redundancy and hierarchical organization of executing threads on GPGPU, we propose and implement simple redundant computing based fault tolerance, parallel error check based fault tolerance, task partition based thread block level fault tolerance, stream computing patterns based fault tolerance. We take advantage of the on-chip resources of GPGPU to reduce data transfers with satisfied reliability goal. 6. We propose a cost-efficient fault tolerant technique for memories.AVF (Architectural Vulnerability Factor) is often employed for the measurement of processor reliability, and is demonstrated to exhibit significant dynamic characteristics. AVF-aware dynamic fault tolerant techniques provide selective soft error protection for processor structures according to the online AVF values, thus potentially reducing the overheads of fault tolerant techniques with satisfied reliability goal. We propose a BART (Bayesian Additive Regression Tree) based AVF prediction model for memories, and integrate the model into the AVF-aware ECC techniques. According to the online predictive AVF values of memories, we only provide ECC protection for the execution points of high AVF. AVF-aware ECC technique trades off between performance, power and reliability, and is demonstrated to be a good candidate for cost-efficient soft error protection.