Wouldn’t it be lovely if everything were functioning exactly as it should? All the time. Alas, that is not to be our lot: we in IT security know that there’s always something that needs attention, something that’s not quite right, something that’s limping along and needs to be fixed.
The question I want to address in this article is whether that’s actually OK. I’ve written before about managed degradation (the idea that planning for when things go wrong is a useful and important practice) in Service degradation: actually a good thing. The subject of this article, however, is living in a world where everything is running almost normally – or, more specifically, “nominally”. Most definitions of the word “nominal” focus on its meaning of “theoretical” or “token” value. A quick search of online dictionaries provided two definitions which were more relevant to the usage I’m going to be looking at today:
- informal (chiefly in the context of space travel) functioning normally or acceptably.
- being according to plan: satisfactory.
I’d like to offer a slightly different one:
- within acceptable parameters for normal system operation.
I’ve seen “tolerances” used instead of “parameters”, and that works, too, but it’s not a word that I think we use much within IT security, so I lean towards “parameters”.
Why do I think that this is a useful concept? Because, as I noted above, we all know that it’s a rare day when everything works perfectly. But we find ways to muddle through, or we find enough bandwidth to make the backups happen without significant impact on database performance, or we only lose 1% of the credit card details collected that day. All of these (except the last one), are fine, and if we are wise, we might start actually defining what counts as acceptable operation – nominal operation – for all of our users. This is what we should be striving for, and exactly how far off perfect operation we are in will give us clues as to how much effort we need to expend to sort them out, and how quickly we need to perform mitigations.
In fact, many organisations which provide services do this already: that’s where SLAs (Service Level Agreements) come from. I remember, at school, doing some maths around food companies ensuring that they were in little danger of being in trouble for under-filling containers by looking at standard deviations to understand what the likely amount in each would be. This is similar, and the likelihood of hardware failures, based on similar calculations, is often factored into uptime planning.
So far, much of the above seems to be about resilience: where does security come in? Well, your security components, features and functionality are also subject to the same issues of resiliency to which any other part of your system is. The problem is that if a security piece fails, it may well be a single point of failure, which means that although the rest of the system is operating at 99% performance, your security just hit zero.
These are the reasons that we perform failure analysis, and why we consider defence in depth. But when we’re looking at defence in depth, do we remember to perform second order analysis? For instance, if my back-up LDAP server for user authentication is running on older hardware, what are the chances that it will fail when put under load?
It should come as no surprise to regular readers that I want to expand the scope of the discussion beyond just hardware and software components in systems to the people who are involved, and beyond that to processes in general.
Train companies are all too aware of the impact on their services if a bad flu epidemic hits their drivers – or if the weather is good, so their staff prefer to enjoy the sunshine with their families, rather than take voluntary overtime. You may have considered the impact of a staff member or two being sick, but have you gone as far as modelling what would happen if four members of your team were sick, and, just as important, how likely that is? Equally vital to consider may be issues of team dynamics, or even terrorism attacks or union disputes. What about external factors, like staff not being able to get into work because of train cancellations? What are the chances of broadband failures occurring at the same time, scuppering your fall-back plan of allowing key staff to work from home?
We can go deeper and deeper into this, and at some point it become pointless to go any further. But I believe that it’s useful to consider how far to go with it, and also to spend some time considering exactly what you consider “nominal” operation, particularly for you security systems.
1 – I nearly wrote “art”.
2 – Oxford Dictionaries: https://en.oxforddictionaries.com/definition/nominal.
3 – Merriam-Webster: https://www.merriam-webster.com/dictionary/nominal.
4 – this article was inspired by the Public Service Broadcasting Song Go. Listen to it: it rocks. And they’re great live, too.
5 – Note: this is a joke, and not a very funny one. You’ve probably just committed a GDPR breach, and you need to tell someone about it. Now.
6 – in the UK, and most countries speaking versions of Commonwealth English, we do more than one math. Because “mathematics“.
7 – this happened: I saw it it on TV.
8 – so it must be true.