Author 1
Name: Teresa Davies
Org: Colorado School of Mines
Country: United States
Email: tdavies@mines.edu
Author 2
Name: Zizhong Chen
Org: Colorado School of Mines
Country: 
Email: zchen@mines.edu


In parallel applications, the probability that a failure will occur before the computation has finished is greater the more processors are used. For some, it is essential that fault tolerance be used to prevent that all the work up to the point of the failure is lost. One of the most effective techniques is diskless checkpointing, in which a periodic snapshot of the process states is saved in another location. Unfortunately, the overhead of diskless checkpointing is proportional to the amount of data changed between checkpoints. For matrix operations, this overhead is high. A lower overhead way of making matrix operations fault tolerant is desirable. 

In this research, we modify a technique for checking the outcome of matrix operations in order to produce low overhead fault tolerance for a variety of matrix operations. The technique, called algorithm based fault tolerance, appends a checksum to the matrix that is maintained by the operation and can be used to verify the result at the end of the calculation. We use this idea to show that, for some algorithms that compute matrix operations, the sum is also maintained at each step of the algorithm. Although this is not true of all algorithms, it is true for some that are widely used. The technique that uses the checksum at the end of the calculation is not new, but our use of the checksum to recover from failures in the middle of the calculation has not been done before. 

The right-looking LU decomposition used in HPL maintains a checksum at each step. Therefore we have added fault tolerance to HPL using this method. Our approach is expected to significantly outperform checkpointing. Previous work on this project, adding a checksum to matrix multiplication and Cholesky decomposition, has shown that the overhead is significantly decreased. Additionally, the overhead as a fraction of the total time decreases as the problem size increases, in contrast to checkpointing, where the fraction of the total time remains constant.