SELSE Workshop

  • Increase font size
  • Default font size
  • Decrease font size

Emerging challenges in high performance computing: Resilience and the science of embracing failure

Resilience is about keeping the application workload running to a correct solution in a timely and efficient manner in spite of system failures. Future extreme scale supercomputers are likely to suffer more frequent failures than current systems: As devices scale, they are more susceptible to upsets due to radiation and to errors due to manufacturing variances. The probability of multiple bit upsets is growing, since an event is increasingly likely to impact multiple nearby cells. The use of near-threshold voltage in order to reduce power consumption also increases error rates. Thus, we can expect more frequent hardware failures, and a significant rate of undetected soft errors. While it is desirable to have failure-free system hardware and software, this goal may not be achievable at reasonable cost as both hardened components and methodologies to design and test critical software tend to be extremely expensive. The challenge is to construct a system out of less than perfectly reliable hardware and software that nevertheless behaves as a reliable system from the perspective of the user.

Sani R. Nassif

John T. Daly is a computer systems researcher for the Advanced Computing Systems (ACS) Program at the Department of Defense / Center for Exceptional Computing (CEC). He is focused on the problem of keeping supercomputer applications running toward a correct solution in a timely and efficient manner in the presence of system degradations and failures. His research interests include mathematical modeling and analysis of failure, reliability, fault tolerance, calculational correctness, and throughput for applications at extreme scale. Before coming to the CEC, John was a researcher and resilience technical leader in the High Performance Computing (HPC) division at Los Alamos National Laboratory and a software engineer and application analyst for Raytheon Intelligence and Information Systems. He is a nationally recognized expert in resilience with 25 years of experience developing, porting, and running applications as an early adopter of many of the world's fastest supercomputers.