Last October the gaming company Roblox’s online network went down, an outage that lasted three days. The site is used by 50M gamers daily. Figuring out and fixing the root causes of this disruption would take a massive effort by engineers at both Roblox and their main tech supplier, HashiCorp. The company eventually posted an amazing analysis on a blog post at the end of January. Roblox got bitten by a strange coincidence of several events. The processes they went through to diagnose and ultimately fix things is instructive to readers that are doing similar projects, and especially if you are running any large-scale IaC installations or are a heavy user of containers and microservices across your infrastructure.
There are a few things to be learned from the Roblox outage that I discuss in my latest story for Infoworld.