SELSE Workshop

  • Increase font size
  • Default font size
  • Decrease font size

Hardware errors at Cloud-scale: Ignore, Retry, or Abort?

Operating at Cloud-scale with hundreds of thousands of machines serving web-facing transactions requires a completely different methodology for dealing with hardware reliability issues. Failures can range the gamut from soft/hard errors in silicon components to datacenter components and affect service availability. The talk will highlight these challenges and implications to infrastructure design, and discuss research areas for failure characterization and mitigation.


Mark Shaw is Director of Hardware Engineering for Cloud Server Infrastructure Engineering team in Microsoft’s Global Foundation Services (GFS) group. He is responsible for defining and designing next generation server architectures for compute, storage and networking. Prior to 2008, Mark was a Distinguished Technologist at Hewlett-Packard leading systems architecture and development of HP’s highest end mainframe platforms. Mark holds over 25 US and International patents in server systems architecture and design. He holds a BS of Electrical and Computer Engineering from the University of Wisconsin.