Architectural vulnerability factor estimation through fault injections
|Keywords:||architectural vulnerability factor; fault injection; GPU; reliability; soft errors; transient faults|
|Full text PDF:||http://hdl.handle.net/2047/D20213017|
Given the large number of processing cores, as well as their impressive parallel processing capabilities, Graphic Processing Units (GPUs) have become the accelerator of choice across multiple domains. GPUs are able to accelerate processing in a wide range of applications including scientific computing, bioinformatics, and financial applications. Their presence in the world's fastest supercomputers has been steadily growing over the last few years.; With technology scaling, soft-error reliability has become a major issue for hardware designers. Soft-errors are a non-permanent fault, where a bit flip occurs in a latch or memory cell. A recent study by the Department of Energy has identified soft errors as one of the top 10 barrier to exascale computing. The architecture research community needs to pursue solutions to address the challenges presented by the growing presence of soft errors. While some number of soft errors will not necessarily cause an error at the output of a program, many will corrupt vulnerable program state. Since GPUs are increasingly being used for compute instead of just graphics, their reliability has become a concern. Therefore, an important step in tackling soft errors in GPUs is to first assess the impact of soft errors and the robustness of the GPUs in the presence of these faults.; In this thesis, we evaluate this question using fault injection on an AMD Evergreen family of GPUs. In this study, we inject bit flips using a detailed architectural simulator. Our results indicate that a GPU can be a highly resilient device to soft errors. We present a study of trends that appear in common GPU programs when soft errors occur in GPU memory hierarchy. These trends can be used to inform programmers, as well as system designers, when making decisions about how to increase the reliability of GPU software and hardware.