I spent last week visiting a data center tucked into an anonymous office park in Champaign, Ill. The data center is operated by Amdocs, a company that makes its money doing managed back office applications for telecom companies, such as Sprint, Metro PCS, and others. The visit was part of a general press briefing about what Amdocs is doing, but the term “single point of failure” kept coming up.
If you are going to host apps for telecom vendors, you have to know what you are doing in terms of providing uptime. You need redundant everything, from the plug that a router connects to for power to the backup of the backup diesel generator that has to fire up when you lose main AC power from the utility.
Actually, the most impressive part of the tour was the empty “situation rooms” that Amdocs has built. They are empty because there wasn’t any crisis going on – each room is dedicated to a particular customer and is where the account team gathers when they have a problem to work on. Think “24” but with far nerdier people. And that brings up a good point: what is the rest of CTU doing to protect the other 300 million of us that aren’t directly threatened by the current plot? All the action is happening on the main stage. But I digress.
I started thinking about other IT managers who haven’t completely thought through this issue that I have met down through the years.
There was one manager at a very large financial services firm near Washington DC that I interviewed a few years ago. Gazillions of dollars a day pass through its computer networks, and as you might imagine the firm had three Internet providers – not just two, but three – to provide connectivity. Each provider had a separate path and pole for their line from the firm’s server room. Well, that sounded all well and good until the day that a truck collision happened in the Baltimore Harbor Tunnel – a main north-south artery about 50 miles away. Trouble was all three of the Internet provider’s lines went through that tunnel and the firm was offline from the Internet until they got things re-routed. Now they have four Internet providers, and they got them to share their route maps (try doing this with yours, and good luck) to make sure there was no single point of failure.
Another time I was helping another firm in Florida upgrade one of its high-end network servers back in the late 1990s. This was a Tricord server, which took an ordinary Intel CPU and wrapped it around all sorts of redundant things: two power supplies, RAID hard drives, two physical processors, separate memory, and so forth. We had to pull and replace the network cards from this $40,000 server. This required powering down the beast and opening it up. Sadly, the one thing that wasn’t redundant was the physical power plug that went from the server into the wall – and the $25 part that the ordinary plug fit into went south when we powered the unit down. It took a few white-knuckle hours to locate a new part and get it over to us before we could bring the Tricord up again. I bet no one thought that probably the least sophisticated part in the whole machine was going to fail.
These days, you see lots of gear that have two physical power plugs, and at Amdocs’ data center they have two separate power paths just in case one goes out. That means taking that path back to a generator and line conditioning gear too.
Here is a story from my own mistakes, lest you think I am just harping on my subjects here. Several years ago, I was running this email list server on a friend’s Linux server that was in his California basement. The friend is one of the original Internet heavyweights, and knows his systems and has plenty of backups. However, the day came when a lot of flooding in his area knocked out all of his Internet connections, and I wasn’t able to access my list. Well, I thought I had all sorts of backup procedures in place and had saved copies of the server list configuration, so I could bring it up on someone else’s server. However, I had neglected to do one simple task – make a copy of the names of everyone on my list. Now I do. You would think something this simple would not have eluded me but you would think wrong.
So single point of failure: it is easier to say than to do. And when you see what Amdocs had to do to deliver on this maxim, you would be impressed.