Managing by exception

What I want, as a human, is interesting opportunities to apply my expertise

I’ve visited a family member this week (see why in last week’s article), in a way which is allowed under the UK Covid-19 lockdown rules as an “exceptional circumstance”. When governments and civil authorities are managing what their citizens are allowed to do, many jurisdictions (including the UK, where I live), follow the general principle of “Everything which is not forbidden is allowed“. This becomes complicated when you’re putting in (hopefully short-term) restrictions on civil liberties such as disallowing general movement to visit family and friends, but in the general case, it makes a lot of sense. It allows for the principle of “management by exception”: rather than taking the approach that you check that every journey is allowed, you look out for disallowed journeys (taking an unnecessary trip to a castle in the north of England, for instance) and (hopefully) punish those who have undertaken them.

What astonishes me about the world of IT – and security in particular – is how often we take the opposite approach. We record every single log result and transfer it across the network just in case there’s a problem. We have humans in the chain for every software build, checking that the correct versions of containers have been used. When what we should be doing is involving humans – the expensive parts of the chain – only when they’re needed, and only sending lots of results across the network – the expensive part of that chain – when the system which is generating the logs is under attack, or has registered a fault.

That’s not to say that we shouldn’t be recording information, but that we should be intelligent about how we use it: which means that we should be automating. Automation allows us to manage – that is, apply the expensive operations in a chain – only when it is relevant. Having a list of allowed container images, and then asking the developer why she has chosen a non-standard option, is so, so much cheaper for the organisation, not to mention more interesting for the container expert, than monitoring every single build. Allowing the system generating logs to increase the amount of information sent when it realises its under attack, or to send it a command to up what it sends when a possible problem is noticed remotely – is more efficient than the alternative.

The other thing I’m not saying is that we should just ignore information that’s generated in normal cases, where operation is “nominal“. The growing opportunities to apply AI/ML techniques to this to allow us to realise what is outside normal operation, and become more sensitive to when we need to apply those expensive components in a system where appropriate, makes a lot of sense. Sometimes, statistical sampling is required, where we can’t expect all of the data to be provided to those systems (in the remote logging case, for instance), or designs of distributed systems with remote agents need to be designed.

What I want, as a human, is interesting opportunities to apply my expertise, where I can make a difference, rather than routine problems (if you have routine problems, you have broader, more concerning issues) which don’t test me, and which don’t make a broader difference to how the systems and processes I’m involved with run. That won’t happen unless I can be part of an organisation where management by exception is the norm.

One final thing that I should be clear about is that I’m also not talking about an approach where “everything which isn’t explicitly allowed is disallowed” – that doesn’t sound like a great approach for security (I may not be a huge fan of the term zero-trust, but I’m not that opposed to it). It’s the results of the decisions that we care about, on the whole, and where we can manage it, we just have to automate, given the amount of information that’s becoming available. Even worse than not managing by exception is doing nothing with the data at all!

It doesn’t happen often, but let’s realise that, on this occasion, we have something to learn from our governments, and manage by exception.

Uninterrupted power for the (CI)A

I’d really prefer not to have to restart every time we have a minor power cut.

Regular readers of my blog will know that I love the CIA triad – confidentiality, integrity and availability – as applied to security. Much of the time, I tend to talk about the first two, and spend little time on the third, but today I thought I’d address the balance a little bit by talking about UPSes – Uninterruptible Power Supplies. For anyone who was hoping for a trolling piece on giving extra powers to US Government agencies, it’s time to move along to another blog. And anyone who thinks I’d stoop to a deliberately misleading article title just to draw people into reading the blog … well, you’re here now, so you might as well read on (and you’re welcome to visit others of my articles such as Do I trust this package?, A cybersecurity tip from Hazzard County, and, of course, Defending our homes).

Years ago, when I was young and had more time on my hands, I ran an email server for my own interest (and accounts). This was fairly soon after we moved to ADSL, so I had an “always-on” connection to the Internet for the first time. I kept it on a pretty basic box behind a pretty basic firewall, and it served email pretty well. Except for when it went *thud*. And the reason it went *thud* was usually because of power fluctuations. We live in a village in the East Anglian (UK) countryside where the electricity supply, though usually OK, does go through periods where it just stops from time to time. Usually for under a minute, admittedly, but from the point of view of most computer systems, even a second of interruption is enough to turn them off. Usually, I could reboot the machine and, after thinking to itself for a while, it would come back up – but sometimes it wouldn’t. It was around this time, if I remember correctly, that I started getting into journalling file systems to reduce the chance of unrecoverable file system errors. Today, such file systems are custom-place: in those days, they weren’t.

Even when the box did come back up, if I was out of the house, or in the office for the day, on holiday, or travelling for business, I had a problem, which was that the machine was now down, and somebody needed to go and physically turn it on if I wanted to be able to access my email. What I needed was a way to provide uninterruptible power to the system if the electricity went off, and so I bought a UPS: an Uninterruptible Power System. A UPS is basically a box that sits between your power socket and your computer, and has a big battery in it which will keep your system going for a while in the event of a (short) power failure and the appropriate electronics to provide AC power out from the battery. Most will also have some sort of way to communicate with your system such as a USB port, and software which you can install to talk to it that your system can decide whether or not to shut itself down – when, for instance, the power has been off for long enough that the battery is about to give out. If you’re running a data centre, you’ll typically have lots of UPS boxes keeping your most important servers up while you wait for your back-up generators to kick in, but for my purposes, knowing that my email server would stay up for long enough that it would ride out short power drops, and be able to shut down gracefully if the power was out for longer, was enough: I have no interest in running my own generator.

That old UPS died probably 15 years ago, and I didn’t replace it, as I’d come to my senses and transferred my email accounts to a commercial provider, but over the weekend I bought a new one. I’m running more systems now, some of them are fairly expensive and really don’t like power fluctuations, and there are some which I’d really prefer not to have to restart every time we have a minor power cut. Here’s what I decided I wanted:

  • a product with good open source software support;
  • something which came in under £150;
  • something with enough “juice” (batter power) to tide 2-3 systems over a sub-minute power cut;
  • something with enough juice to keep one low-powered box running for a little longer than that, to allow it to coordinate shutting down the other boxes first, and then take itself down if required;
  • something with enough ports to keep a a couple of network switches up while the previous point happened (I thought ahead!);
  • Lithium Ion rather than Lead battery if possible.

I ended up buying an APC BX950UI, which meets all of my requirements apart from the last one: it turns out that only high-end UPS systems currently seem to have moved to Lithium Ion battery technology. There are two apparently well-maintained open source software suites that support APC UPS systems: apcupsd and nut, both of which are available for my Linux distribution of choice (Fedora). As it happens, they both also have support for Windows and Mac, so you can mix and match if needs be. I chose nut, which doesn’t explicitly list my model of UPS, but which supports most of the product lower priced product line with its usbhid-ups driver – I knew that I could move to apsupsd if this didn’t work out, but nut worked fine.

Set up wasn’t difficult, but required a little research (and borrowing a particular cable/lead from a kind techie friend…), and I plan to provide details of the steps I took in a separate article or articles to make things easier for people wishing do replicate something close to my set-up. However, I’m currently only using it on one system, so haven’t set up the coordinated shutdown configuration. I’ll wait till I’ve got it more set up before trying that.

What has this got to do with security? Well, I’m planning to allow VPN access to at least one of the boxes, and I don’t want it suddenly to disappear, leaving a “hole in the network”. I may well move to a central authentication mechanism for this and other home systems (if you’re interested, check out projects such as yubico-pam): and I want the box that provides that service to stay up. I’m also planning some home automation projects where access to systems from outside the network (to view cameras, for instance) will be a pain if things just go down: IoT devices may well just come back up in the event of a power failure, but the machines which coordinate them are less likely to do so.

What’s a vulnerability?

Often, it’s the terms which seem familiar that are most confusing

I was discussing a document with some colleagues recently, and wanted to point them an article I’d written with some definitions of terms like “vulnerability”, “exploit” and “attack”, only to get somewhat annoyed when I discovered that I’d never written it. So this week’s post attempts to remedy that, though I’m going to address them in reverse order, and I’m going to add an extra one: “mitigation”.

The world of IT security is full of lots of terms, some familiar, some less so. Often, it’s the terms which seem familiar that are most confusing, because they may mean something other that what you think. The three that I think it’s important to define here are the ones I noted above (and mitigation, because it’s often used in the same contexts). Before we do that, let’s just define quickly what we’re talking about when we mean security.

CIA

We often talk about three characteristics of a system that we want to protect or maintain in order to safeguard its security: C, I and A.

  • “C” stands for confidentiality. Unauthorised entities should not be able to access information or processes.
  • “I” stands for integrity. Unauthorised entities should not be able to change information or processes.
  • “A” stands for availability. Unauthorised entities should not be able to impact the ability for authorised entities to access information or processes.

I’ve gone into more detail in my article The Other CIA: Confidentiality, Integrity and Availability, but that should keep us going for now. I should also say that there are various other definitions around relevant characteristics that we might want to protect, but CIA gives us a good start.

Mitigation

A mitigation[3] in this context is a technique to reduce the impact of an attack on whichever of C, I or A is (or are) affected.

Attack

An attack[1] is an action or set of actions which affects the C, the I or the A of a system. It’s the breaking of the protections applied to safeguard confidentiality, integrity or availability. An attack is possible through the use of an exploit.

Exploit

An exploit is a mechanism to affect the C, the I or the A of a system, allowing an attack. It is a technique, or set of techniques, employed in an an attack, and is possible through the exploiting[2] of a vulnerability.

Vulnerability

A vulnerability is a flaw in a system which allows the breaking of protections applied to the C, the I or the A of a system. It allows attacks, which take place via exploits.

One thing that we should note is that a vulnerability is not necessarily a flaw in software: it may be in hardware or firmware, or be exposed as an emergent characteristic of a system, in which case it’s a design problem. Equally, it may expose a protocol design issue, so even if the software (+ hardware, etc.) is a correct implementation of the protocol, a security vulnerability exists. This is, in fact, very common, particularly where cryptography is involved, but cryptography is hard to do.


1 – a successful one, anyway.

2 – hence the name.

3 – I’ve written more about mitigation in Mitigate or remediate?.

“All systems nominal” – borrowing a useful phrase

Wouldn’t it be lovely if everything were functioning exactly as it should?

Wouldn’t it be lovely if everything were functioning exactly as it should? All the time. Alas, that is not to be our lot: we in IT security know that there’s always something that needs attention, something that’s not quite right, something that’s limping along and needs to be fixed.

The question I want to address in this article is whether that’s actually OK. I’ve written before about managed degradation (the idea that planning for when things go wrong is a useful and important practice[1]) in Service degradation: actually a good thing. The subject of this article, however, is living in a world where everything is running almost normally – or, more specifically, “nominally”. Most definitions of the word “nominal” focus on its meaning of “theoretical” or “token” value. A quick search of online dictionaries provided two definitions which were more relevant to the usage I’m going to be looking at today:

  • informal (chiefly in the context of space travel) functioning normally or acceptably[2].
  • being according to plan: satisfactory[3].

I’d like to offer a slightly different one:

  • within acceptable parameters for normal system operation.

I’ve seen “tolerances” used instead of “parameters”, and that works, too, but it’s not a word that I think we use much within IT security, so I lean towards “parameters”[4].

Utility

Why do I think that this is a useful concept? Because, as I noted above, we all know that it’s a rare day when everything works perfectly. But we find ways to muddle through, or we find enough bandwidth to make the backups happen without significant impact on database performance, or we only lose 1% of the credit card details collected that day[5]. All of these (except the last one), are fine, and if we are wise, we might start actually defining what counts as acceptable operation – nominal operation – for all of our users. This is what we should be striving for, and exactly how far off perfect operation we are in will give us clues as to how much effort we need to expend to sort them out, and how quickly we need to perform mitigations.

In fact, many organisations which provide services do this already: that’s where SLAs (Service Level Agreements) come from. I remember, at school, doing some maths[6] around food companies ensuring that they were in little danger of being in trouble for under-filling containers by looking at standard deviations to understand what the likely amount in each would be. This is similar, and the likelihood of hardware failures, based on similar calculations, is often factored into uptime planning.

So far, much of the above seems to be about resilience: where does security come in? Well, your security components, features and functionality are also subject to the same issues of resiliency to which any other part of your system is. The problem is that if a security piece fails, it may well be a single point of failure, which means that although the rest of the system is operating at 99% performance, your security just hit zero.

These are the reasons that we perform failure analysis, and why we consider defence in depth. But when we’re looking at defence in depth, do we remember to perform second order analysis? For instance, if my back-up LDAP server for user authentication is running on older hardware, what are the chances that it will fail when put under load?

Broader usage

It should come as no surprise to regular readers that I want to expand the scope of the discussion beyond just hardware and software components in systems to the people who are involved, and beyond that to processes in general.

Train companies are all too aware of the impact on their services if a bad flu epidemic hits their drivers – or if the weather is good, so their staff prefer to enjoy the sunshine with their families, rather than take voluntary overtime[7]. You may have considered the impact of a staff member or two being sick, but have you gone as far as modelling what would happen if four members of your team were sick, and, just as important, how likely that is? Equally vital to consider may be issues of team dynamics, or even terrorism attacks or union disputes. What about external factors, like staff not being able to get into work because of train cancellations? What are the chances of broadband failures occurring at the same time, scuppering your fall-back plan of allowing key staff to work from home?

We can go deeper and deeper into this, and at some point it becomes pointless to go any further. But I believe that it’s useful to consider how far to go with it, and also to spend some time considering exactly what you consider “nominal” operation, particularly for you security systems.


1 – I nearly wrote “art”.

2 – Oxford Dictionaries: https://en.oxforddictionaries.com/definition/nominal.

3 – Merriam-Webster: https://www.merriam-webster.com/dictionary/nominal.

4 – this article was inspired by the Public Service Broadcasting Song Go. Listen to it: it rocks. And they’re great live, too.

5 – Note: this is a joke, and not a very funny one. You’ve probably just committed a GDPR breach, and you need to tell someone about it. Now.

6 – in the UK, and most countries speaking versions of Commonwealth English, we do more than one math. Because “mathematics“.

7 – this happened: I saw it it on TV[8].

8 – so it must be true.

Why I should have cared more about lifecycle

Every deployment is messy.

I’ve always been on the development and architecture side of the house, rather than on the operations side. In the old days, this distinction was a useful and acceptable one, and wasn’t too difficult to maintain. From time to time, I’d get involved with discussions with people who were actually running the software that I had written, but on the whole, they were a fairly remote bunch.

This changed as I got into more senior architectural roles, and particularly as I moved through some pre-sales roles which involved more conversations with users. These conversations started to throw up[1] an uncomfortable truth: not only were people running the software that I helped to design and write[3], but they didn’t just set it up the way we did in our clean test install rig, run it with well-behaved, well-structured data input by well-meaning, generally accurate users in a clean deployment environment, and then turn it off when they’re done with it.

This should all seem very obvious, and I had, of course, be on the receiving end of requests from support people who exposed that there were odd things that users did to my software, but that’s usually all it felt like: odd things.

The problem is that odd is normal.  There is no perfect deployment, no clean installation, no well-structured data, and certainly very few generally accurate users.  Every deployment is messy, and nobody just turns off the software when they’re done with it.  If it’s become useful, it will be upgraded, patched, left to run with no maintenance, ignored or a combination of all of those.  And at some point, it’s likely to become “legacy” software, and somebody’s going to need to work out how to transition to a new version or a completely different system.  This all has major implications for security.

I was involved in an effort a few years ago to describe the functionality, lifecycle for a proposed new project.  I was on the security team, which, for all the usual reasons[4] didn’t always interact very closely with some of the other groups.  When the group working on error and failure modes came up with their state machine model and presented it at a meeting, we all looked on with interest.  And then with horror.  All the modes were “natural” failures: not one reflected what might happen if somebody intentionally caused a failure.  “Ah,” they responded, when called on it by the first of the security to be able to form a coherent sentence, “those aren’t errors, those are attacks.”  “But,” one of us blurted out, “don’t you need to recover from them?”  “Well, yes,” they conceded, “but you can’t plan for that.  It’ll need to be on a case-by-case basis.”

This is thinking that we need to stamp out.  We need to design our systems so that, wherever possible, we consider not only what attacks might be brought to bear on them, but also how users – real users – can recover from them.

One way of doing this is to consider security as part of your resilience planning, and bake it into your thinking about lifecycle[5].  Failure happens for lots of reasons, and some of those will be because of bad people doing bad things.  It’s likely, however, that as you analyse the sorts of conditions that these attacks can lead to, a number of them will be similar to “natural” errors.  Maybe you could lose network connectivity to your database because of a loose cable, or maybe because somebody is performing a denial of service attack on it.  In both these cases, you may well start off with similar mitigations, though the steps to fix it are likely to be very different.  But considering all of these side by side means that you can help the people who are actually going to be operating those systems plan and be ready to manage their deployments.

So the lesson from today is the same as it so often is: make sure that your security folks are involved from the beginning of a project, in all parts of it.  And an extra one: if you’re a security person, try to think not just about the attackers, but also about all those poor people who will be operating your software.  They’ll thank you for it[6].


1 – not literally, thankfully[2].

2 – though there was that memorable trip to Singapore with food poisoning… I’ll stop there.

3 – a fact of which I actually was aware.

4 – some due entirely to our own navel-gazing, I’m pretty sure.

5 – exactly what we singularly failed to do in the project I’ve just described.

6 – though probably not in person.  Or with an actual gift.  But at least they’ll complain less, and that’s got to be worth something.

Explained: five misused security words

Untangling responsibility, authority, authorisation, authentication and identification.

I took them out of the title, because otherwise it was going to be huge, with lots of polysyllabic words.  You might, therefore, expect a complicated post – but that’s not my intention*.  What I’d like to do it try to explain these five important concepts in security, as they’re often confused or bound up with one another.  They are, however, separate concepts, and it’s important to be able to disentangle what each means, and how they might be applied in a system.  Today’s words are:

  • responsibility
  • authority
  • authorisation
  • authentication
  • identification.

Let’s start with responsibility.

Responsibility

Confused with: function; authority.

If you’re responsible for something, it means that you need to do it, or if something goes wrong.  You can be responsible for a product launching on time, or for the smooth functioning of a team.  If we’re going to ensure we’re really clear about it, I’d suggest using it only for people.  It’s not usually a formal description of a role in a system, though it’s sometimes used as short-hand for describing what a role does.  This short-hand can be confusing.  “The storage module is responsible for ensuring that writes complete transactionally” or “the crypto here is responsible for encrypting this set of bytes” is just a description of the function of the component, and doesn’t truly denote responsibility.

Also, just because you’re responsible for something doesn’t mean that you can make it happen.  One of the most frequent confusions, then, is with authority.  If you can’t ensure that something happens, but it’s your responsibility to make it happen, you have responsibility without authority***.

Authority

Confused with: responsibility, authorisation.

If you have authority over something, then you can make it happen****.  This is another word which is best restricted to use about people.  As noted above, it is possible to have authority but no responsibility*****.

Once we start talking about systems, phrases like “this component has the authority to kill these processes” really means “has sufficient privilege within the system”, and should best be avoided. What we may need to check, however, is whether a component should be given authorisation to hold a particular level of privilege, or to perform certain tasks.

Authorisation

Confused with: authority; authentication.

If a component has authorisation to perform a certain task or set of tasks, then it has been granted power within the system to do those things.  It can be useful to think of roles and personae in this case.  If you are modelling a system on personae, then you will wish to grant a particular role authorisation to perform tasks that, in real life, the person modelled by that role has the authority to do.  Authorisation is an instantiation or realisation of that authority.  A component is granted the authorisation appropriate to the person it represents.  Not all authorisations can be so easily mapped, however, and may be more granular.  You may have a file manager which has authorisation to change a read-only permission to read-write: something you might struggle to map to a specific role or persona.

If authorisation is the granting of power or capability to a component representing a person, the question that precedes it is “how do I know that I should grant that power or capability to this person or component?”.  That process is authentication – authorisation should be the result of a successful authentication.

Authentication

Confused with: authorisation; identification.

If I’ve checked that you’re allowed to perform and action, then I’ve authenticated you: this process is authentication.  A system, then, before granting authorisation to a person or component, must check that they should be allowed the power or capability that comes with that authorisation – that are appropriate to that role.  Successful authentication leads to authorisation.  Unsuccessful authentication leads to blocking of authorisation******.

With the exception of anonymous roles, the core of an authentication process is checking that the person or component is who he, she or it says they are, or claims to be (although anonymous roles can be appropriate for some capabilities within some systems).  This checking of who or what a person or component is authentication, whereas the identification is the claim and the mapping of an identity to a role.

Identification

Confused with: authentication.

I can identify that a particular person exists without being sure that the specific person in front of me is that person.  They may identify themselves to me – this is identification – and the checking that they are who they profess to be is the authentication step.  In systems, we need to map a known identity to the appropriate capabilities, and the presentation of a component with identity allows us to apply the appropriate checks to instantiate that mapping.

Bringing it all together

Just because you know whom I am doesn’t mean that you’re going to let me do something.  I can identify my children over the telephone*******, but that doesn’t mean that I’m going to authorise them to use my credit card********.  Let’s say, however, that I might give my wife my online video account password over the phone, but not my children.  How might the steps in this play out?

First of all, I have responsibility to ensure that my account isn’t abused.  I also have authority to use it, as granted by the Terms and Conditions of the providing company (I’ve decided not to mention a particular service here, mainly in case I misrepresent their Ts&Cs).

“Hi, darling, it’s me, your darling wife*********. I need the video account password.” Identification – she has told me who she claims to be, and I know that such a person exists.

“Is it really you, and not one of the kids?  You’ve got a cold, and sound a bit odd.”  This is my trying to do authentication.

“Don’t be an idiot, of course it’s me.  Give it to me or I’ll pour your best whisky down the drain.”  It’s her.  Definitely her.

“OK, darling, here’s the password: it’s il0v3myw1fe.”  By giving her the password, I’ve  performed authorisation.

It’s important to understand these different concepts, as they’re often conflated or confused, but if you can’t separate them, it’s difficult not only to design systems to function correctly, but also to log and audit the different processes as they occur.


*we’ll have to see how well I manage, however.  I know that I’m prone to long-windedness**

**ask my wife.  Or don’t.

***and a significant problem.

****in a perfect world.  Sometimes people don’t do what they ought to.

*****this is much, much nicer than responsibility without authority.

******and logging.  In both cases.  Lots of logging.  And possibly flashing lights, security guards and sirens on failure, if you’re into that sort of thing.

*******most of the time: sometimes they sound like my wife.  This is confusing.

********neither should you assume that I’m going to let my wife use it, either.*********

*********not to suggest that she can’t use a credit card: it’s just that we have separate ones, mainly for logging purposes.

**********we don’t usually talk like this on the phone.

The Curious Incident of the Patch in the Night-Time

Gregory: “The patch did nothing in the night-time.”
Holmes: “That was the curious incident.”

To misquote Sir Arthur Conan-Doyle:

Gregory (cyber-security auditor) “Is there any other point to which you would wish to draw my attention?”
Holmes: “To the curious incident of the patch in the night-time.”
Gregory: “The patch did nothing in the night-time.”
Holmes: “That was the curious incident.”

I considered a variety of (munged) literary titles to head up this blog, and settled on the one above or “We Need to Talk about Patching”.  Either way round, there’s something rotten in the state of patching*.

Let me start with what I hope is a fairly uncontroversial statement: “we all know that patches are important for security and stability, and that we should really take them as soon as they’re available and patch all of our systems”.

I don’t know about you, but I suspect you’re the same as me: I run ‘sudo dnf –refresh upgrade’** on my home machines and work laptop at least once every day that I turn them on.  I nearly wrote that when an update comes out to patch my phone, I take it pretty much immediately, but actually, I’ve been burned before with dodgy patches, and I’ll often have a check of the patch number to see if anyone has spotted any problems with it before downloading it. This feels like basic due diligence, particularly as I don’t have a “staging phone” which I could use to test pre-production and see if my “production phone” is likely to be impacted***.

But the overwhelming evidence from the industry is that people really don’t apply patches – including security patches – even though they understand that they ought to.  I plan to post another blog entry at some point about similarities – and differences – between patching and vaccinations, but let’s take as read, for now, the assumption that organisations know they should patch, and look at the reasons they don’t, and what we might do to improve that.

Why people don’t patch

Here are the legitimate reasons that I can think of for organisations not patching****.

  1. they don’t know about patches
    • not all patches are advertised well enough
    • organisations don’t check for patches
  2. they don’t know about their systems
    • incomplete knowledge of their IT estate
  3. legacy hardware
    • patches not compatible with legacy hardware
  4. legacy software
    • patches not  compatible with legacy software
  5. known impact with up-to-date hardware & software
  6. possible impact with up-to-date hardware & software

Some of these are down to the organisations, or their operating environment, clearly: 1b, 2, 3 and 4.  The others, however, are down to us as an industry.  What it comes down to is a balance of risk: the IT operations department doesn’t dare to update software with patches because they know that if the systems that they maintain go down, they’re in real trouble.  Sometimes they know there will be a problem (typically because they test patches in a staging environment of some type), and sometimes because they just don’t dare.  This may be because they are in the middle of their own software update process, and the combination of Operating System, middleware or integrated software updates with their ongoing changes just can’t be trusted.

What we can do

Here are some thoughts about what we as an industry can do to try to address this problem – or set of problems.

Staging

Staging – what is a staging environment for?  It’s for testing changes before they go into production, of course.  But what changes?  Changes to your software, or your suppliers’ software?  The answer has to be “both”, I think.  You may need separate estates so that you can look at changes of these two sets of software separately before seeing what combining them does, but in the end, it is the combination of the two that matters.  You may consider using the same estate at different times to test the different options, but that’s not an option for all organisations.

DevOps

DevOps shouldn’t just be about allowing agile development practices to become part of the software lifecycle: it should also be about allowing agile operational practices become a part of the software lifecycle.  DevOps can really help with patching strategy if you think of it this way.  Remember, in DevOps, everybody has responsibility.  So your DevOps pipeline the perfect way to test how changes in your software are affected by changes in the underlying estate.  And because you’re updating regularly, and have unit tests to check all the key functionality*****, any changes can be spotted and addressed quickly.

Dependencies

Patches sometimes have dependencies.  We should be clear when a patch requires other changes, resulting a large patchset, and when a large patchset just happens to be released because multiple patches are available.  Some dependencies may be outside the control of the vendor.  This is easier to test when your patch has dependencies on an underlying Operating System, for instance, but more difficult if the dependency is on the opposite direction.  If you’re the one providing the underlying update and the customer is using software that you don’t explicitly test, then it’s incumbent on you, I’d argue, to use some of the other techniques that I’ve outlined to help your customers understand likely impact.

Visibility of likely impact

One obvious option available to those providing patches is a good description of areas of impact.  You’d hope that everyone did this already, of course, but a brief line something like “this update is for the storage subsystem, and should affect only those systems using EXT3”, for instance, is a great help in deciding the likely impact of a patch.  You can’t always get it right – there may always be unexpected consequences, and vendors can’t test for all configurations.  But they should at least test all supported configurations…

Risk statements

This is tricky, and maybe political, but is it time that we started giving those customers who need it a little more detail about the likely impact of the changes within a patch?  It’s difficult to quantify, of course: a one-character change may affect 95% of the flows through a module, whereas what may seem like a simple functional addition to a customer may actually require thousands of lines of code.  But as vendors, we should have an idea of the impact of a change, and we ought to be considering how we expose that to customers.

Combinations

Beyond that, however, I think there are opportunities for customers to understand what the impact of not having accepted a previous patch is.  Maybe the risk of accepting patch A is low, but the risk of not accepting patch A and patch B is much higher.  Maybe it’s safer to accept patch A and patch C, but wait for a successor to patch B.  I’m not sure quite how to quantify this, or how it might work, but I think there’s grounds for research******.

Conclusion

Businesses have every right not to patch.  There are business reasons to balance the risk of patching against not patching.  But the balance is currently often tipped too far in direction of not patching.  Much too far.  And if we’re going to improve the state of IT security, we, the industry, need to do something about it.  By helping organisations with better information, by encouraging them to adopt better practices, by training them in how to assess risk, and by adopting better practices ourselves.

 


*see what I did there?

**your commands my vary.

***this almost sounds like a very good excuse for a second phone, though I’m not sure that my wife would agree.

****I’d certainly be interested to hear of others: please let me know via comments.

*****you do have these two things, right?  Because if you don’t, you’re really not doing DevOps.  Sorry.

******as soon as I wrote this, I realised that somebody’s bound to have done research on this issue.  Please let me know if you have: or know somebody who has.

 

Embracing fallibility

History repeats itself because no one was listening the first time. (Anonymous)

We’re all fallible.  You’re fallible, he’s fallible, she’s fallible, I’m fallible*.  We all get things wrong from time to time, and the generally accepted “modern” management approach is that it’s OK to fail – “fail early, fail often” – as long as you learn from your mistakes.  In fact, there’s a growing view that if you’d don’t fail, you can’t learn – or that your learning will be slower, and restricted.

The problem with some fields – and IT security is one of them – is that failing can be a very bad thing, with lots of very unexpected consequences.  This is particularly true for operational security, but the same can be the case for application, infrastructure or feature security.  In fact, one of the few expected consequences is that call to visit your boss once things are over, so that you can find out how many days*** you still have left with your organisation.  But if we are to be able to make mistakes**** and learn from them, we need to find ways to allow failure to happen without catastrophic consequences to our organisations (and our careers).

The first thing to be aware of is that we can learn from other people’s mistakes.  There’s a famous aphorism, supposedly first said by George Santayana and often translated as “Those who cannot learn from history are doomed to repeat it.”  I quite like the alternative:  “History repeats itself because no one was listening the first time.”  So, let’s listen, and let’s consider how to learn from other people’s mistakes (and our own).  The classic way of thinking about this is by following “best practices”, but I have a couple of problems with this phrase.  The first is that very rarely can you be certain that the context in which you’re operating is exactly the same as that of those who framed these practices.  The other – possibly more important – is that “best” suggests the summit of possibilities: you can’t do better than best.  But we all know that many practices can indeed be improved on.  For that reason, I rather like the alternative, much-used at Intel Corporation, which is “BKMs”: Best Known Methods.  This suggests that there may well be better approaches waiting to be discovered.  It also talks about methods, which suggests to me more conscious activities than practices, which may become unconscious or uncritical followings of others.

What other opportunities are open to us to fail?  Well, to return to a theme which is dear to my heart, we can – and must – discuss with those within our organisations who run the business what levels of risk are appropriate, and explain that we know that mistakes can occur, so how can we mitigate against them and work around them?  And there’s the word “mitigate” – another approach is to consider managed degradation as one way to protect our organisations***** from the full impact of failure.

Another is to embrace methodologies which have failure as a key part of their philosophy.  The most obvious is Agile Programming, which can be extended to other disciplines, and, when combined with DevOps, allows not only for fast failure but fast correction of failures.  I plan to discuss DevOps – and DevSecOps, the practice of rolling security into DevOps – in more detail in a future post.

One last approach that springs to mind, and which should always be part of our arsenal, is defence in depth.  We should be assured that if one element of a system fails, that’s not the end of the whole kit and caboodle******.  That only works if we’ve thought about single points of failure, of course.

The approaches above are all well and good, but I’m not entirely convinced that any one of them – or a combination of them – gives us a complete enough picture that we can fully embrace “fail fast, fail often”.  There are other pieces, too, including testing, monitoring, and organisational cultural change – an important and often overlooked element – that need to be considered, but it feels to me that we have some way to go, still.  I’d be very interested to hear your thoughts and comments.

 


*my family is very clear on this point**.

**I’m trying to keep it from my manager.

***or if you’re very unlucky, minutes.

****amusingly, I first typed this word as “misteaks”.  You’ve got to love those Freudian slips.

*****and hence ourselves.

******no excuse – I just love the phrase.

 

 

Service degradation: actually a good thing

…here’s the interesting distinction between the classic IT security mindset and that of “the business”: the business generally want things to keep running.

Well, not all the time, obviously*.  But bear with me: we spend most of our time ensuring that all of our systems are up and secure and working as expected, because that’s what we hope for, but there’s a real argument for not only finding out what happens when they don’t, and not just planning for when they don’t, but also planning for how they shouldn’t.  Let’s start by examining some techniques for how we might do that.

Part 1 – planning

There’s a story** that the oil company Shell, in the 1970’s, did some scenario planning that examined what were considered, at the time, very unlikely events, and which allowed it to react when OPEC’s strategy surprised most of the rest of the industry a few years later.  Sensitivity modelling is another technique that organisations use at the financial level to understand what impact various changes – in order fulfilment, currency exchange or interest rates, for instance – make to the various parts of their business.  Yet another is war gaming, which the military use to try to understand what will happen when failures occur: putting real people and their associated systems into situations and watching them react.  And Netflix are famous for taking this a step further in the context of the IT world and having a virtual Chaos Monkey (a set of processes and scripts) which they use to bring down parts of their systems in real time to allow them to understand how resilient they the wider system is.

So that gives us four approaches that are applicable, with various options for automation:

  1. scenario planning – trying to understand what impact large scale events might have on your systems;
  2. sensitivity planning – modelling the impact on your systems of specific changes to the operating environment;
  3. wargaming – putting your people and systems through simulated events to see what happens;
  4. real outages – testing your people and systems with actual events and failures.

Actually going out of your way to sabotage your own systems might seem like insane behaviour, but it’s actually a work of genius.  If you don’t plan for failure, what are you going to do when it happens?

So let’s say that you’ve adopted all of these practices****: what are you going to do with the information?  Well, there are some obvious things you can do, such as:

  • removing discovered weaknesses;
  • improving resilience;
  • getting rid of single points of failure;
  • ensuring that you have adequately trained staff;
  • making sure that your backups are protected, but available to authorised entities.

I won’t try to compile an exhaustive list, because there are loads books and articles and training courses about this sort of thing, but there’s another, maybe less obvious, course of action which I believe we must take, and that’s plan for managed degradation.

Part 2 – managed degradation

What do I mean by that?  Well, it’s simple.  We***** are trained and indoctrinated to take the view that if something fails, it must always “fail to safe” or “fail to secure”.  If something stops working right, it should stop working at all.

There’s value in this approach, of course there is, and we’re paid****** to ensure everything is secure, right?  Wrong.  We’re actually paid to help keep the business running, and here’s the interesting distinction between the classic IT security mindset and that of “the business”: the business generally want things to keep running.  Crazy, right?  “The business” want to keep making money and servicing customers even if things aren’t perfectly secure!  Don’t they know the risks?

And the answer to that question is “no”.  They don’t know the risks.  And that’s our real job: we need to explain the risks and the mitigations, and allow a balancing act to take place.  In fact, we’re always making those trade-offs and managing that balance – after all, the only truly secure computer is one with no network connection, no keyboard, no mouse and no power connection*******.  But most of the time, we don’t need to explain the decisions we make around risk: we just take them, following best industry practice, regulatory requirements and the rest.  Nor are the trade-offs usually so stark, because when failure strikes – whether through an attack, accident or misfortune – it’s often a pretty simple choice between maintaining a particular security posture and keeping the lights on.  So we need to think about and plan for some degradation, and realise that on occasion, we may need to adopt a different security posture to the perfect (or at least preferred) one in which we normally operate.

How would we do that?  Well, the approach I’m advocating is best described as “managed degradation”.  We allow our systems – including, where necessary our security systems – to degrade to a managed (and preferably planned) state, where we know that they’re not operating at peak efficiency, but where they are operating.  Key, however, is that we know the conditions under which they’re working, so we understand their operational parameters, and can explain and manage the risks associated with this new posture.  That posture may change, in response to ongoing events, and the systems and our responses to those events, so we need to plan ahead (using the techniques I discussed above) so that we can be flexible enough to provide real resiliency.

We need to find modes of operation which don’t expose the crown jewels******** of the business, but do allow key business operations to take place.  And those key business operations may not be the ones we expect – maybe it’s more important to be able to create new orders than to collect payments for them, for instance, at least in the short term.  So we need to discuss the options with the business, and respond to their needs.  This planning is not just security resiliency planning: it’s business resiliency planning.  We won’t be able to consider all the possible failures – though the techniques I outlined above will help us to identify many of them – but the more we plan for, the better we will be at reacting to the surprises.  And, possibly best of all, we’ll be talking to the business, informing them, learning from them, and even, maybe just a bit, helping them understand that the job we do does have some value after all.


*I’m assuming that we’re the Good Guys/Gals**.

**Maybe less story than MBA*** case study.

***There’s no shame in it.

****Well done, by the way.

*****The mythical security community again – see past posts.

******Hopefully…

*******Preferably at the bottom of a well, encased in concrete, with all storage already removed and destroyed.

********Probably not the actual Crown Jewels, unless you work at the Tower of London.