N+1 Redundancy – How Not To Do It

This morning, as I got up, I noticed that our server monitoring system (Nagios – if you don’t know it, give it a look) was erroring for a set of customers servers. They’d been rebooted a couple of times over the last few weeks, so I decided a visit to the DC was in order to check everything was OK. Having brought all but one of the servers back on-line (it’s a clustered system), I noticed that we had a few internal fan failures, and swapped a couple of fans from a spare box into a live server to save me having to grab some from spares, and made a note to order up replacements.

So far, so good, but these servers have 6 or 7 identical fans inside, and it appears that the failure of even one of them will lead to the server shutting down and refusing to boot. It looks like the designer failed to understand the concept of redundancy – if you are putting that many fans in, why not add one or two more so in the event of single failure everything can carry on after an abnormal event is logged.

Now, I am mostly a software guy these days, but, really, if it seems obvious to me, surely it’s obvious to others. We are talking major named brand, not some nameless box shifter.

Leave a Reply