New Approach to Fault Tolerance Means More Efficient High-Performance Computers

The Science

The hardware in high performance computer (HPC) systems is incredibly complex. These computers can have millions of cores, or processors. This creates many chances for small system problems—even a bad wire—that can affect HPC-based simulations and calculations. Computer scientists call this challenge fault tolerance. Researchers have developed a new approach to fault tolerance that requires less time and less computer power to run than traditional fault tolerance solutions. The new approach is known as coded computing, or algorithm-based fault-tolerance. This approach involves building procedures for detecting faults and correcting errors that are specific for particular numerical algorithms.

The Impact

The new approach, called 3D Coded SUMMA, is a novel algorithm for resilient and efficient parallel matrix multiplication in HPC systems. Matrix multiplication is an important tool for research questions with many variables. The algorithm performs parallel matrix multiplication with the ability to recover from node failures. This fault tolerance employs a new method called coded computation. The approach requires 50 percent less redundancy and much less computer time than traditional methods. This means 3D Coded SUMMA addresses an important challenge facing HPCs—the need for fault tolerance that works efficiently with these incredibly complex computer systems. The new approach also applies the latest advances to failure tolerant computing, opening up a new area of research. Finally, this approach could allow for larger and longer time scale simulations of climate models and clean energy technologies than possible today.


A team of researchers have developed a novel fault-tolerant parallel matrix multiplication algorithm called 3D Coded Scalable Universal Matrix Multiplication Algorithm (SUMMA) that achieves higher failure tolerance than replication-based schemes for the same amount of redundancy. This research bridges the gap between recent developments in coded computing and fault tolerance in high-performance computing. The fundamental concept of coded computing is the same as traditional algorithm-based fault tolerance, which is weaving redundancy in the computation by using error-correcting codes. This integrates MatDot codes, an innovative code construction for parallel matrix multiplications, into the 3D SUMMA in a communication-avoiding manner. To tolerate any two node failures, the 3D Coded SUMMA requires 50% less redundancy than replication, while the overhead in execution time is only about 5-10%.Funding


This research was funded by the Department of Energy Office of Science, Advanced Scientific Computing Research program, and partially supported by the NSF.