SELSE Workshop

  • Increase font size
  • Default font size
  • Decrease font size

Resilience for ExaScale


Large scientific computers and data centers share a need for resilience. Systems with 10^5 nodes and failure rates of 1000 FITs/node experience failures every 10 hours but must provide application MTTI of weeks.  To provide resilience in such systems we can exploit the properties of different components. Memory and communications channels can be inexpensively checked using codes while arithmetic operations are relatively inexpensive to duplicate. Exposing the resilience mechanisms to the software enables the application to describe what state needs to be preserved (vs. being easily reconstructed) and what operations need to be checked in hardware (vs. software).  This talk will sketch the resilience issues for ExaScale systems and point out some open challenges.



Bill is Chief Scientist and Senior Vice President of Research at NVIDIA Corporation and a Professor (Research) and former chair of Computer Science at Stanford University. Bill and his group have developed system architecture, network architecture, signaling, routing, and synchronization technology that can be found in most large parallel computers today. While at Bell Labs Bill contributed to the BELLMAC32 microprocessor and designed the MARS hardware accelerator. At Caltech he designed the MOSSIM Simulation Engine and the Torus Routing Chip which pioneered wormhole routing and virtual-channel flow control. At the Massachusetts Institute of Technology his group built the J-Machine and the M-Machine, experimental parallel computer systems that pioneered the separation of mechanisms from programming models and demonstrated very low overhead synchronization and communication mechanisms.  At Stanford University his group has developed the Imagine processor, which introduced the concepts of stream processing and partitioned register organizations, the Merrimac supercomputer, which led to GPU computing, and the ELM low-power processor.  Bill is a Member of the National Academy of Engineering, a Fellow of the IEEE, a Fellow of the ACM, and a Fellow of the American Academy of Arts and Sciences.  He has received the ACM Eckert-Mauchly Award, the IEEE Seymour Cray Award, and the ACM Maurice Wilkes award.  He currently leads projects on computer architecture, network architecture, circuit deisgn, and programming systems. He has published over 200 papers in these areas, holds over 90 issued patents, and is an author of the textbooks, Digital Design: A Systems Approach, Digital Systems Engineering, and Principles and Practices of Interconnection Networks.