SELSE Workshop

  • Increase font size
  • Default font size
  • Decrease font size

Application oriented fault tolerance system for Tianhe-2


Since fault tolerance is one of the major challenges for advanced HPC systems in the post-petascale and exascale era, innovative integrated technology designs are needed for new architecture as well as associated software stacks.  We need to explore the capability of cpu, accelerator, interconnection, I/O storage system, and till whole system. We couldn’t avoid the fault for such a large scale system like Tianhe-2, what we should do is to work with the emerging faults. This talk analysis the different types of faults we facing and dealing with mainly on Tianhe-2, demonstrate the mechanics we used to monitor, isolate, and recovery from the faults. We try to focus on the application oriented fault tolerance strategies, some large scale applications will be followed to show how do them run sustaining long time.



Professor Yutong Lu is the Director of the System Software Laboratory, School of Computer Science, National University of Defense Technology (NUDT), Changsha, China. She is also a professor in the State Key Laboratory of High Performance Computing, China. She got her B.S, M.S, and PhD degrees from the NUDT. Her extensive research and development experience has spanned several generations of domestic supercomputers in China.  During this period, Prof. Lu was the Director Designer for the Tianhe-1A and Tianhe-2 systems – both of which have been internationally recognized as the top-ranked supercomputing system worldwide in respectively November of 2010 and June of 2013.  Her continuing research interests include parallel operating systems (OS), high speed interconnect communication, global file systems, and advanced programming environments.