I spent last week visiting a data center tucked into an anonymous office park in Champaign, Ill. The data center is operated by Amdocs, a company that makes its money doing managed back office applications for telecom companies, such as Sprint, Metro PCS, and others. The visit was part of a general press briefing about what Amdocs is doing, but the term “single point of failure” kept coming up.
If you are going to host apps for telecom vendors, you have to know what you are doing in terms of providing uptime. You need redundant everything, from the plug that a router connects to for power to the backup of the backup diesel generator that has to fire up when you lose main AC power from the utility.
Actually, the most impressive part of the tour was the empty “situation rooms” that Amdocs has built. They are empty because there wasn’t any crisis going on – each room is dedicated to a particular customer and is where the account team gathers when they have a problem to work on. Think “24” but with far nerdier people. And that brings up a good point: what is the rest of CTU doing to protect the other 300 million of us that aren’t directly threatened by the current plot? All the action is happening on the main stage. But I digress.
I started thinking about other IT managers who haven’t completely thought through this issue that I have met down through the years.
There was one manager at a very large financial services firm near Washington DC that I interviewed a few years ago. Gazillions of dollars a day pass through its computer networks, and as you might imagine the firm had three Internet providers – not just two, but three – to provide connectivity. Each provider had a separate path and pole for their line from the firm’s server room. Well, that sounded all well and good until the day that a truck collision happened in the Baltimore Harbor Tunnel – a main north-south artery about 50 miles away. Trouble was all three of the Internet provider’s lines went through that tunnel and the firm was offline from the Internet until they got things re-routed. Now they have four Internet providers, and they got them to share their route maps (try doing this with yours, and good luck) to make sure there was no single point of failure.
Another time I was helping another firm in Florida upgrade one of its high-end network servers back in the late 1990s. This was a Tricord server, which took an ordinary Intel CPU and wrapped it around all sorts of redundant things: two power supplies, RAID hard drives, two physical processors, separate memory, and so forth. We had to pull and replace the network cards from this $40,000 server. This required powering down the beast and opening it up. Sadly, the one thing that wasn’t redundant was the physical power plug that went from the server into the wall – and the $25 part that the ordinary plug fit into went south when we powered the unit down. It took a few white-knuckle hours to locate a new part and get it over to us before we could bring the Tricord up again. I bet no one thought that probably the least sophisticated part in the whole machine was going to fail.
These days, you see lots of gear that have two physical power plugs, and at Amdocs’ data center they have two separate power paths just in case one goes out. That means taking that path back to a generator and line conditioning gear too.
Here is a story from my own mistakes, lest you think I am just harping on my subjects here. Several years ago, I was running this email list server on a friend’s Linux server that was in his California basement. The friend is one of the original Internet heavyweights, and knows his systems and has plenty of backups. However, the day came when a lot of flooding in his area knocked out all of his Internet connections, and I wasn’t able to access my list. Well, I thought I had all sorts of backup procedures in place and had saved copies of the server list configuration, so I could bring it up on someone else’s server. However, I had neglected to do one simple task – make a copy of the names of everyone on my list. Now I do. You would think something this simple would not have eluded me but you would think wrong.
So single point of failure: it is easier to say than to do. And when you see what Amdocs had to do to deliver on this maxim, you would be impressed.
My good friend Joel Snyder of Opus1 writes:
The key in redundancy, though, is to properly balance risk and cost. It’s easy to go out and say “Oh, I need dual power supplies in all my devices,” but an analysis of the failure rates of power supplies may show that this is an expense that doesn’t match the risk reduction. Remember that there is not only a capital cost to this, but also a continuing cost. The capital cost includes the power supplies, as well as the power infrastructure to properly use them (two rails of power in each rack, going back to separate UPSes, along with the necessary documentation and labeling to make sure you don’t screw it up). The operational cost results from the inefficiencies of the power supplies. Running two power supplies into a single Dell, for example, takes about 115% of the power that using only a single power supply does–real waste.
And, of course, putting two power supplies in a server only solves some of the issues related to single points of failure. The server itself represents an SPF. Maybe it’s better to build your data center so that if the whole server goes down, then you don’t care—your cluster (or whatever the mechanism is) keeps running. Of course, that has other inefficiencies, especially if you’re building two-server clusters in active/passive mode–that takes 200% of power (and space … and cooling … and cabling …)
I guess my point is that I run into a lot of people who have this naive notion about building for uptime; they tend to be like generals (“… always preparing to fight the last battle…”, or, in our terms, “always solving the last crisis they failed to prepare for”) instead of like strategists or big-picture thinkers. OF course, it’s a lot easier just to say “let’s build two power rails…” and then implement that, then bitch and moan and blame someone else when the inevitable failure occurs.
Good data centers analyze their true failure rates (these tend to be idiosyncratic; for example, there’s a data center in NY that I work with a lot which has abnormally high levels of device failure–we don’t know why, although we have suspicions, but we plan for it which is what is most important) and then they build to reduce the risk of such failures interrupting business-critical processes. Probably the most interesting example of this in the world is Google, and they have shared little glimpses into how they do things with the rest of the world. The results are actually not applicable to most companies–no one else is going to build their own servers–but if you look beyond the raw numbers, it’s the philosophy and thinking and planning behind it all which is fascinating.
Pingback: uberVU - social comments