Last month, software tools vendor Atlassian suffered a major network outage that lasted two weeks and affected more than 400 of their over 200,000 customers. It is rare that a vendor who has been hit with such a massive and public outage takes the effort to thoughtfully piece together what happened and why, and also provide a roadmap that others can learn from as well.
In a post on their blog last week, they describe their existing IT infrastructure in careful detail, point out the deficiencies in their disaster recovery program, how to fix its shortcomings to prevent future outages, and describe timelines, workflows and ways they intend to improve their processes. I wrote an op/ed for Network World that gleans the four takeaways for network and IT managers.