Let’s start with a story. Way back in the mists of time*, I performed audits for an organisation which sent out cryptographic keys to its members. These member audits involved checking multiple processes and systems, but the core one was this: the keys that were sent out were are really big deal, as they were the basis from which tens of thousands of other keys would be derived. So, the main key that was sent out was really, really important, because if it got leaked, the person who got hold of it would have a chance to do many, many Bad Things[tm].
The main organisation thought that allowing people the possibility to do Bad Things[tm] wasn’t generally a good idea, so they had a rule. You had to follow a procedure, which was this: they would send out this key in two separate parts, to be stored in two different physical safes, to be combined by two different people, reporting to two different managers, in a process split into to separate parts, which ensured that the two different key-holders could never see the other half of the key. The two separate parts were sent out by separate couriers, so that nobody outside the main organisation, could ever get to see the two parts. It was a good, and carefully thought out process.
So one of the first things I’d investigate, on arriving at a member company to perform an audit, would be how they managed their part of this process. And, because they were generally fairly clued up, or wouldn’t have been allowed to have the keys in the first place, they’d explain how careful they were with the key components, and who reported to whom, and where the safes were, and back up plans for when the key holders were ill: all good stuff. And then I’d ask: “And what happens when a courier arrives with the key component?” To which they’d reply: “Oh, the mail room accepts the package.” And then I’d ask “And when the second courier arrives with the second key component?” And nine times out of ten, they’d answer: “Oh, the mail room accepts that package, too.” And then we’d have a big chat.**
This is a classic example of a single point of failure. Nobody designs systems with a single point of failure on purpose****, but they just creep in. I’m using the word systems here in the same way I used it in my post Systems security – why it matters: in the sense of a bunch of different things working together, some of which are likely to be human, some of which are likely to be machine. And it’s hard to work out where single points of failure are. A good way to avoid them – or minimise their likelihood of occurrence – is to layer or overlap systems*****. What is terrible is when two single points of failure are triggered at once, because they overlap. From the little information available to us, this seems to be what happened to British Airways over the past weekend: they had a power failure, and then their backups didn’t work. In other words, they had a cascade failure – one thing went wrong, and then, when another thing went wrong as well, everything fell over. This is terrible, and every IT professional out there ought be cringing a little bit inside at the thought that it might happen to them.******
How can you stop this happening? It’s hard, really it is, because the really catastrophic failures only happen rarely – pretty much by definition. Here are some thoughts, though:
- look at pinch points, where a single part of the system, human or machine, is doing heavy lifting – what happens when they fail?
- look at complex processes with many interlocking pieces – what happens if one of them produces unexpected results (or none)?
- look at processes with many actors – what happens if one or actor fails to do what is expected?
- look at processes with a time element to them – what will happen if an element doesn’t produce results when expected?
- try back-tracking, rather than forward-tracking. We tend to think forwards, from input to output: try the opposite, and see what the key parts to any output are. This may give unexpected realisations about critical inputs and associated components.
Last: don’t assume that your systems are safe. Examine, monitor, test, remediate. You might******* also have a good long hard think about managed degradation: it’s really going to help if things do go horribly wrong.
Oh – and good luck.
*around ten years ago. It feels like a long time, anyway.
**because, in case you missed it, that meant that the person in charge of the mail room had access to both parts of the key.***
***which meant that they needed to change their policies, quickly, unless they wanted to fail the audit.
****I’m assuming that we’re all Good Guys and Gals[tm], right, and not the baddies?
*****the principle of defence in depth derives from this idea, though it’s only one way to do it.
******and we/you shouldn’t be basking in the schadenfreude. No, indeed.
*******should. Or even must. Just do it.
One thought on “Single Point of Failure”