Exploiting Redundancy and Asynchrony in Forward Exact Recoveries for Iterative Solvers

Detta är en Uppsats för yrkesexamina på avancerad nivå från KTH/Skolan för informations- och kommunikationsteknik (ICT)

Författare: Luc Jaulmes; [2014]

Nyckelord: ;

Sammanfattning: This report presents a method to recover from faults detected by hardware in numerical iterative solvers. By exploiting redundancy inherent to an iterative solver instead of adding redundancy, we can interpolate lost data and thus devise an exact recovery scheme, which does not compromise mathematical convergence properties of the solver as methods based on restart would do. We rely on a task-based programming model to overlap the furthering of normal computation and recovery. Results show a low overhead with no fault injection, that could be reduced even more with better lower-level support for application level resilience, and exceptional performance when faults are injected, even under with extremely high fault injection rates. This is a huge improvement on checkpoint-based recovery methods, and progress towards the goal of resilient and asynchronous HPC methods for exascale computing.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)