Well, not all the time, obviously*. But bear with me: we spend most of our time ensuring that all of our systems are up and secure and working as expected, because that’s what we hope for, but there’s a real argument for not only finding out what happens when they don’t, and not just planning for when they don’t, but also planning for how they shouldn’t. Let’s start by examining some techniques for how we might do that.
Part 1 – planning
There’s a story** that the oil company Shell, in the 1970’s, did some scenario planning that examined what were considered, at the time, very unlikely events, and which allowed it to react when OPEC’s strategy surprised most of the rest of the industry a few years later. Sensitivity modelling is another technique that organisations use at the financial level to understand what impact various changes – in order fulfilment, currency exchange or interest rates, for instance – make to the various parts of their business. Yet another is war gaming, which the military use to try to understand what will happen when failures occur: putting real people and their associated systems into situations and watching them react. And Netflix are famous for taking this a step further in the context of the IT world and having a virtual Chaos Monkey (a set of processes and scripts) which they use to bring down parts of their systems in real time to allow them to understand how resilient they the wider system is.
So that gives us four approaches that are applicable, with various options for automation:
- scenario planning – trying to understand what impact large scale events might have on your systems;
- sensitivity planning – modelling the impact on your systems of specific changes to the operating environment;
- wargaming – putting your people and systems through simulated events to see what happens;
- real outages – testing your people and systems with actual events and failures.
Actually going out of your way to sabotage your own systems might seem like insane behaviour, but it’s actually a work of genius. If you don’t plan for failure, what are you going to do when it happens?
So let’s say that you’ve adopted all of these practices****: what are you going to do with the information? Well, there are some obvious things you can do, such as:
- removing discovered weaknesses;
- improving resilience;
- getting rid of single points of failure;
- ensuring that you have adequately trained staff;
- making sure that your backups are protected, but available to authorised entities.
I won’t try to compile an exhaustive list, because there are loads books and articles and training courses about this sort of thing, but there’s another, maybe less obvious, course of action which I believe we must take, and that’s plan for managed degradation.
Part 2 – managed degradation
What do I mean by that? Well, it’s simple. We***** are trained and indoctrinated to take the view that if something fails, it must always “fail to safe” or “fail to secure”. If something stops working right, it should stop working at all.
There’s value in this approach, of course there is, and we’re paid****** to ensure everything is secure, right? Wrong. We’re actually paid to help keep the business running, and here’s the interesting distinction between the classic IT security mindset and that of “the business”: the business generally want things to keep running. Crazy, right? “The business” want to keep making money and servicing customers even if things aren’t perfectly secure! Don’t they know the risks?
And the answer to that question is “no”. They don’t know the risks. And that’s our real job: we need to explain the risks and the mitigations, and allow a balancing act to take place. In fact, we’re always making those trade-offs and managing that balance – after all, the only truly secure computer is one with no network connection, no keyboard, no mouse and no power connection*******. But most of the time, we don’t need to explain the decisions we make around risk: we just take them, following best industry practice, regulatory requirements and the rest. Nor are the trade-offs usually so stark, because when failure strikes – whether through an attack, accident or misfortune – it’s often a pretty simple choice between maintaining a particular security posture and keeping the lights on. So we need to think about and plan for some degradation, and realise that on occasion, we may need to adopt a different security posture to the perfect (or at least preferred) one in which we normally operate.
How would we do that? Well, the approach I’m advocating is best described as “managed degradation”. We allow our systems – including, where necessary our security systems – to degrade to a managed (and preferably planned) state, where we know that they’re not operating at peak efficiency, but where they are operating. Key, however, is that we know the conditions under which they’re working, so we understand their operational parameters, and can explain and manage the risks associated with this new posture. That posture may change, in response to ongoing events, and the systems and our responses to those events, so we need to plan ahead (using the techniques I discussed above) so that we can be flexible enough to provide real resiliency.
We need to find modes of operation which don’t expose the crown jewels******** of the business, but do allow key business operations to take place. And those key business operations may not be the ones we expect – maybe it’s more important to be able to create new orders than to collect payments for them, for instance, at least in the short term. So we need to discuss the options with the business, and respond to their needs. This planning is not just security resiliency planning: it’s business resiliency planning. We won’t be able to consider all the possible failures – though the techniques I outlined above will help us to identify many of them – but the more we plan for, the better we will be at reacting to the surprises. And, possibly best of all, we’ll be talking to the business, informing them, learning from them, and even, maybe just a bit, helping them understand that the job we do does have some value after all.
*I’m assuming that we’re the Good Guys/Gals**.
**Maybe less story than MBA*** case study.
***There’s no shame in it.
****Well done, by the way.
*****The mythical security community again – see past posts.
******Hopefully…
*******Preferably at the bottom of a well, encased in concrete, with all storage already removed and destroyed.
********Probably not the actual Crown Jewels, unless you work at the Tower of London.
8 thoughts on “Service degradation: actually a good thing”