operations – Alice, Eve and Bob

A User Advisory Council for the CCC

The CCC is currently working to create a User Advisory Council (UAC)

Disclaimer: the views expressed in this article (and this blog) do not necessarily reflect those of any of the organisations or companies mentioned, including my employer (Red Hat) or the Confidential Computing Consortium.

The Confidential Computing Consortium was officially formed in October 2019, nearly a year and a half ago now. Despite not setting out to be a high membership organisation, nor going out of its way to recruit members, there are, at time of writing, 9 Premier members (of which Red Hat, my employer, is one), 22 General members, and 3 Associate members. You can find a list of each here, and a brief analysis I did of their business interests a few weeks ago in this article: Review of CCC members by business interests.

The CCC has two major committees (beyond the Governing Board):

Technical Advisory Board (TAC) – this coordinates all technical areas in which the CCC is involved. It recommends whether software projects should be accepted into the CCC (no hardware projects have been introduced so far, thought it’s possible they might be), coordinates activities like special interest groups (we expect one on Attestation to start very soon), encourages work across projects, manages conversations with other technical bodies, and produces material such as the technical white paper listed here.
Outreach Committee – when we started the CCC, we decided against going with the title “Marketing Committee”, as we didn’t think it represented the work we hoped this committee would be doing, and this was a good decision. Though there are activities which might fall under this heading, the work of the Outreach Committee is much wider, including analyst and press relations, creation of other materials, community outreach, cross-project discussions, encouraging community discussions, event planning, webinar series and beyond.

These two committees have served the CCC well, but now that it’s fairly well established, and has a fairly broad industry membership of hardware manufacturers, CSPs, service providers and ISVs (see my other article), we decided that there was one set of interested parties who were not well-represented, and which the current organisational structure did not do a sufficient job of encouraging to get involved: end-users.

It’s all very well the industry doing amazing innovation, coming up with astonishingly well-designed, easy to integrate, security-optimised hardware-software systems for confidential computing if nobody wants to use them. Don’t get me wrong: we know from many conversations with organisations across multiple sectors that users absolutely want to be able to make use of TEEs and confidential computing. That is not that same, however, as understanding their use cases in detail and ensuring that what we – the members of the CCC, who are focussed mainly on creating services and software – actually provide what users need. These users are across many sectors – finance, government, healthcare, pharmaceutical, Edge, to name but a few – and their use cases and requirements are going to be different.

This is why the CCC is currently working to create a User Advisory Council (UAC). The details are being worked out at the moment, but the idea is that potential and existing users of confidential computing technologies should have a forum in which they can connect with the leaders in the space (which hopefully describes the CCC members), share their use cases, find out more about the projects which are part of the CCC, and even take a close look at those projects most relevant to them and their needs. This sort of engagement isn’t likely, on the whole, to require attendance at lots of meetings, or to have frequent input into the sorts of discussions which the TAC and the Outreach Committee typically consider, and the general feeling is that as we (the CCC) are aiming to service these users, we shouldn’t be asking them to pay for the privilege (!) of talking to us. The intention, then, is to allow a low bar for involvement in the UAC, and for there to be no membership fee required. That’s not to stop UAC members from joining the CCC as members if they wish – it would be a great outcome if some felt that they were so keen to become more involved that membership was appropriate – but there should be no expectation of that level of commitment.

I should be clear that the plans for the UAC are not complete yet, and some of the above may change. Nor should you consider this a formal announcement – I’m writing this article because I think it’s interesting, and because I believe that this is a vital next step in how those involved with confidential computing engages with the broader world, not because I represent the CCC in this context. But there’s always a danger that “cool” new technologies develop into something which fits only the fundamentally imaginary needs of technologists (and I’ll put my hand up and say that I’m one of those), rather than the actual needs of businesses and organisations which are struggling to operate around difficult issues in the real world. The User Advisory Council, if it works as we hope, should allow the techies (me, again) to hear from people and organisations about what they want our technologies to do, and to allow the CCC to steer its efforts in these directions.

Managing by exception

What I want, as a human, is interesting opportunities to apply my expertise

I’ve visited a family member this week (see why in last week’s article), in a way which is allowed under the UK Covid-19 lockdown rules as an “exceptional circumstance”. When governments and civil authorities are managing what their citizens are allowed to do, many jurisdictions (including the UK, where I live), follow the general principle of “Everything which is not forbidden is allowed“. This becomes complicated when you’re putting in (hopefully short-term) restrictions on civil liberties such as disallowing general movement to visit family and friends, but in the general case, it makes a lot of sense. It allows for the principle of “management by exception”: rather than taking the approach that you check that every journey is allowed, you look out for disallowed journeys (taking an unnecessary trip to a castle in the north of England, for instance) and (hopefully) punish those who have undertaken them.

What astonishes me about the world of IT – and security in particular – is how often we take the opposite approach. We record every single log result and transfer it across the network just in case there’s a problem. We have humans in the chain for every software build, checking that the correct versions of containers have been used. When what we should be doing is involving humans – the expensive parts of the chain – only when they’re needed, and only sending lots of results across the network – the expensive part of that chain – when the system which is generating the logs is under attack, or has registered a fault.

That’s not to say that we shouldn’t be recording information, but that we should be intelligent about how we use it: which means that we should be automating. Automation allows us to manage – that is, apply the expensive operations in a chain – only when it is relevant. Having a list of allowed container images, and then asking the developer why she has chosen a non-standard option, is so, so much cheaper for the organisation, not to mention more interesting for the container expert, than monitoring every single build. Allowing the system generating logs to increase the amount of information sent when it realises its under attack, or to send it a command to up what it sends when a possible problem is noticed remotely – is more efficient than the alternative.

The other thing I’m not saying is that we should just ignore information that’s generated in normal cases, where operation is “nominal“. The growing opportunities to apply AI/ML techniques to this to allow us to realise what is outside normal operation, and become more sensitive to when we need to apply those expensive components in a system where appropriate, makes a lot of sense. Sometimes, statistical sampling is required, where we can’t expect all of the data to be provided to those systems (in the remote logging case, for instance), or designs of distributed systems with remote agents need to be designed.

What I want, as a human, is interesting opportunities to apply my expertise, where I can make a difference, rather than routine problems (if you have routine problems, you have broader, more concerning issues) which don’t test me, and which don’t make a broader difference to how the systems and processes I’m involved with run. That won’t happen unless I can be part of an organisation where management by exception is the norm.

One final thing that I should be clear about is that I’m also not talking about an approach where “everything which isn’t explicitly allowed is disallowed” – that doesn’t sound like a great approach for security (I may not be a huge fan of the term zero-trust, but I’m not that opposed to it). It’s the results of the decisions that we care about, on the whole, and where we can manage it, we just have to automate, given the amount of information that’s becoming available. Even worse than not managing by exception is doing nothing with the data at all!

It doesn’t happen often, but let’s realise that, on this occasion, we have something to learn from our governments, and manage by exception.

“All systems nominal” – borrowing a useful phrase

Wouldn’t it be lovely if everything were functioning exactly as it should?

Wouldn’t it be lovely if everything were functioning exactly as it should? All the time. Alas, that is not to be our lot: we in IT security know that there’s always something that needs attention, something that’s not quite right, something that’s limping along and needs to be fixed.

The question I want to address in this article is whether that’s actually OK. I’ve written before about managed degradation (the idea that planning for when things go wrong is a useful and important practice[1]) in Service degradation: actually a good thing. The subject of this article, however, is living in a world where everything is running almost normally – or, more specifically, “nominally”. Most definitions of the word “nominal” focus on its meaning of “theoretical” or “token” value. A quick search of online dictionaries provided two definitions which were more relevant to the usage I’m going to be looking at today:

informal (chiefly in the context of space travel) functioning normally or acceptably[2].
being according to plan: satisfactory[3].

I’d like to offer a slightly different one:

within acceptable parameters for normal system operation.

I’ve seen “tolerances” used instead of “parameters”, and that works, too, but it’s not a word that I think we use much within IT security, so I lean towards “parameters”[4].

Utility

Why do I think that this is a useful concept? Because, as I noted above, we all know that it’s a rare day when everything works perfectly. But we find ways to muddle through, or we find enough bandwidth to make the backups happen without significant impact on database performance, or we only lose 1% of the credit card details collected that day[5]. All of these (except the last one), are fine, and if we are wise, we might start actually defining what counts as acceptable operation – nominal operation – for all of our users. This is what we should be striving for, and exactly how far off perfect operation we are in will give us clues as to how much effort we need to expend to sort them out, and how quickly we need to perform mitigations.

In fact, many organisations which provide services do this already: that’s where SLAs (Service Level Agreements) come from. I remember, at school, doing some maths[6] around food companies ensuring that they were in little danger of being in trouble for under-filling containers by looking at standard deviations to understand what the likely amount in each would be. This is similar, and the likelihood of hardware failures, based on similar calculations, is often factored into uptime planning.

So far, much of the above seems to be about resilience: where does security come in? Well, your security components, features and functionality are also subject to the same issues of resiliency to which any other part of your system is. The problem is that if a security piece fails, it may well be a single point of failure, which means that although the rest of the system is operating at 99% performance, your security just hit zero.

These are the reasons that we perform failure analysis, and why we consider defence in depth. But when we’re looking at defence in depth, do we remember to perform second order analysis? For instance, if my back-up LDAP server for user authentication is running on older hardware, what are the chances that it will fail when put under load?

Broader usage

It should come as no surprise to regular readers that I want to expand the scope of the discussion beyond just hardware and software components in systems to the people who are involved, and beyond that to processes in general.

Train companies are all too aware of the impact on their services if a bad flu epidemic hits their drivers – or if the weather is good, so their staff prefer to enjoy the sunshine with their families, rather than take voluntary overtime[7]. You may have considered the impact of a staff member or two being sick, but have you gone as far as modelling what would happen if four members of your team were sick, and, just as important, how likely that is? Equally vital to consider may be issues of team dynamics, or even terrorism attacks or union disputes. What about external factors, like staff not being able to get into work because of train cancellations? What are the chances of broadband failures occurring at the same time, scuppering your fall-back plan of allowing key staff to work from home?

We can go deeper and deeper into this, and at some point it becomes pointless to go any further. But I believe that it’s useful to consider how far to go with it, and also to spend some time considering exactly what you consider “nominal” operation, particularly for you security systems.

1 – I nearly wrote “art”.

2 – Oxford Dictionaries: https://en.oxforddictionaries.com/definition/nominal.

3 – Merriam-Webster: https://www.merriam-webster.com/dictionary/nominal.

4 – this article was inspired by the Public Service Broadcasting Song Go. Listen to it: it rocks. And they’re great live, too.

5 – Note: this is a joke, and not a very funny one. You’ve probably just committed a GDPR breach, and you need to tell someone about it. Now.

6 – in the UK, and most countries speaking versions of Commonwealth English, we do more than one math. Because “mathematics“.

7 – this happened: I saw it it on TV[8].

8 – so it must be true.

Mitigate or remediate?

What’s the difference between mitigate and remediate?

I very, very nearly titled this article “The ‘aters gonna ‘ate”, and then thought better of it. This is a rare event, and I put it down to the extreme temperatures that we’re having here in the UK at the moment[1].

What prompted this article was reading something online today where I saw the word mitigate, and thought to myself, “When did I start using that word? It’s not a normal one to drop into conversation. And what about remediate? What’s the difference between mitigate and remediate? In fact, how well could I describe the difference between the two?” Both are quite jargon-y words, so, in the spirit of my recent article Jargon – a force for good or ill? here’s my attempt at describing the difference, but also pointing out how important both are – along with a couple of other “-ate” words.

Let’s work backwards.

Remediate

Remediation is a set of actions to get things back the way they should be. It’s the last step in the process of recovery from an attack or other failure. The re– prefix here is the give -away: like resetting and reconciliation. When you’re remediating, there may be an expectation that you’ll be returning your systems to the same state they were before, for example, power failure, but that’s not necessarily the case. What you should be focussing on is the service you’re providing, rather than the system(s) that are providing it. A set of steps for remediation might require you to replace your database with another one from a completely different provider[3], to add a load-balancer and to change all of your hardware, but you’re still remediating the problem that hit you. At the very least, if you’ve suffered an attack, you should make sure that you plug any hole that allowed the attacker in to start with[4].

Mitigate

Mitigation isn’t about making things better – it’s about reducing the impact of an attack or failure. Mitigation is the first set of steps you take when you realise that you’ve got a problem. You could argue that these things are connected, but mitigation isn’t about returning the service to normal (remediation), but about taking steps to reduce the impact of an attack. Those mitigations may be external – adding load-balancers to help deal with a DDoS attack, maybe – or internal – shutting down systems that have been actively compromised.

In fact, I’d argue that some mitigations might quite properly actually have an adverse effect on the service: there may be short term reductions in availability to ensure that long-term remediations can be performed. Planning for this is vitally important, as I’ve discussed previously, in Service degradation: actually a good thing.

The other -ates: update and operate

As I promised above, there are a couple of other words that are part of your response to an attack – or possible attack. The first is update. Updating is one of the key measures that you can put in place to reduce the chance of a successful attack. It’s the step you take before mitigation – because, if you’re lucky, you won’t need mitigation, because you’ll be immune from attack[5].

The second of these is operate. Operation is your normal state: it’s where you want to be. But operation doesn’t mean that you can just sit back and feel secure: you need to be keeping an eye on what’s going on, planning upgrades, considering mitigations and preparing for remediations. We too often think of the operate step as our Happy Place, where all is rosy, and we can sit back and watch the daisies grow. As DevOps (and DevSecOps) is teaching us, this is absolutely not the case: operate is very much a dynamic state, and not a passive one.

1 – 30C (~86F) is hot for the UK. Oh, and most houses (ours included) don’t have air conditioning[2].

2 – and I work from an office in the garden. Direct sunlight through the mainly glass wall is a blessing in the winter. Less so in the summer.

3 – hopefully open source, of course.

4 – hint: patch, patch, patch.

5 – well, that attack, at least. You’re never immune from all attacks: sorry.

In praise of the CIA

CIA is not sufficient to ensure security within a system.

In the wake of the widespread failure of the Visa processing network on Friday last week [1] (see The Register for more details), I thought it might be time to revisit that useful aide memoire, C.I.A.:

Confidentiality
Integrity
Availability.

This isn’t the first time I’ve written about this trio, and I doubt that it’ll be the last. However, this particular incident seems like a perfect example to examine the least-regarded of the three – availability – and also to cogitate somewhat on how the CIA is necessary, but not sufficient[3].

Availability

As far as we can tell, the problem with the Visa payment system came down to a hardware failure. As someone who used to work as a software engineer, I can tell you that this is by far the best type of failure, because there’s very little you can do about it once you’ve diagnosed it[6], which means that it quickly becomes SEP[7]. Be that as it may, the result of this hardware problem was that a large percentage of the network was unable to access Visa processing capabilities correctly. Though ATMs[8] generally worked, it seems, payment using card readers generally didn’t.

How is this a security problem? Well, one way to answer that question is to say that if security is about reducing risk to your business, then as this caused significant damage to Visa’s revenue stream – not to mention its reputation – then the risk materialised, and there was a security failure. I would be interested to know, however, how many organisations have their security teams in charge of ensuring up-time and availability of their systems in terms of guarding against vulnerabilities such as hardware failures. My suspicion is that the scope of availability-safeguarding by security teams is generally to the extent of managing denial of service or other malicious attacks.

I would argue that more organisations should consider this part of the security team’s mandate, to be honest, because the impacts are very similar, and many of the mitigations will be the same. Of course, if you’re already an integrated Ops team – or even moving to a DevOps or DevSecOps model – then well done you: I’m sure you’re 100% safe from anything similar befalling you[10].

Consistency and correctness

As I mentioned above, there’s a criticism which is often levelled at the CIA triad, which is that confidentiality, integrity and availability are not, on their own, sufficient to design and run a system.

The Visa incident is a perfect example of why this is the case. It appears that the outage was not complete, as even at card readers, some amount of information was going through when a transaction was attempted. This meant that for some (attempted) transactions, at least, debits were appearing on accounts even when they were not being recorded as credits at the vendor’s side. What does this mean? In simple terms, money was coming out of people’s accounts, but not going to the people they were trying to pay. I’m not an expert on retail banking, but I believe that this is pretty much the opposite of what how a financial transaction is supposed to work.

You can’t really blame this on a lack of confidentiality or a lack or integrity. Nor is it really to do with a lack of availability – it may have been a side effect of the same cause as the availability failures, but that doesn’t mean that it caused them[11].

These problems can be characterised in two ways: as a lack of consistency and/or a lack of correctness. In a system, data should be consistent across the system, so when a debit shows up with no corresponding credit, there is a a failure of consistency. This lack of consistency highlights a lack of consistency: in fact, the very point of double-entry book-keeping is to allow these sorts of errors to be spotted.

What this tells us is not only that CIA is not sufficient to ensure security within a system but also that there exist other mechanisms – some very ancient – that allow us to manage our systems and to mitigate failures.

1 – at time of writing. If you’re reading this after, say, the 10th or 11th of June 2018, then it was longer ago than that[2].

2 – unless there’s been another outage, in which case it may be time to start taking out cash and stuffing it into your mattress.

3 – in no sense is this a comment on the Central Intelligence Agency. I am unqualified to discuss that particular, august[4] body. Nor would I consider it in my best interests to do so[5].

4 – it was actually founded in the month of September, according to the Interwebs.

5 – or the best interests of my readers.

6 – which can, admittedly, take quite a long time, as you’re probably looking in the wrong places if you, like me, generally assume that the bug is in your own code.

7 – Somebody Else’s Problem (hat tip to the late, great Douglas Adams).

8 – “cash points” (US), “holes in the wall” (UK)[9].

9 – yes, we really do call them this: “I need to get some money from the hole in the wall”. It’s descriptive and accurate: what more do you want?

10 – no, I know you’re not, and you know you’re not, but this will make everybody else feel that little bit more nervous, and you can feel a little bit more smug, which is always nice, isn’t it?

11 – cue a link to one of my most favourite comics of all time: https://xkcd.com/552/.

What’s your availability? DoS attacks and more

In security we talk about intentional degradation of availability

A colleague of mine recently asked me about protection from DoS attacks[1] for a project with which he’s involved – Denial of Service attacks. The first thing that sprung to mind, of course, was DDoS: Distributed Denial of Service attacks, where hundreds or thousands[2] of hosts are used to send vast amounts of network traffic to – or maybe more accurately “at” – servers in the hopes of bringing the servers to their knees and stopping them providing the service for which they’re designed. These are the attacks that get into the news, and with good reason.

There are other types of DoS however, and the more I thought about it, the more I wondered whether he – and I – should be worrying about these other DoS attacks and also considering other related types of issue which could cause problems to systems. And because I realised it was an interesting topic, I decided to write about it[3].

I’m going to return to the classic “C.I.A.” model of computer security: Confidentiality, Integrity and Availability. The attacks we’re talking about here are those most often overlooked: attempts to degrade the availability of a service. There’s an overlap with the related discipline of resilience here, but I think that the key differentiator is that in security we’re generally talking about intentional degradation of availability, whereas resilience also covers (and maybe focuses on) unintentional degradation.

So, what types of availability attacks might we want to consider?

Denial of service attacks

I think it’s worth linking to Wikipedia’s pretty awesome entry “Denial of service attack” – not something I often do, but I thought it was excellent. Although they’re not mutually exclusive at all, here are some of the key types as I’d define them:

Distributed DoS – where you have lots of different hosts attacking at the same time, flooding the target with traffic. These days, this can be easily automated, and it’s possible to rent compromised machines to perform a coordinated attack.
Application layer – where the attack is aimed at the service, rather than at the host beneath. This may seem like an academic distinction, but it’s not: what it really means is that the attack is performed with knowledge of the application layer. So, for instance, if you’re attacking a web server, you might initiate lots of HTTP sessions, or if you were attacking a Kerberos server, you might request lots of authentication tickets. These types of attacks may be quite costly to perform, but they’re also difficult to protect against, as each attack looks like a “legal” interaction with the service, and unless you’re on the look-out in a way which is typically not automated at this level, they’re difficult to avoid.
Host level – this is a family of attacks which go for the host and/or associated Operating System, rather than the service itself. A classic attack would be the SYN flood, which misused the TCP protocol to use up resources on the host, thereby stopping any associated services from being able to respond. Host attacks may be somewhat simpler to defend against, as it’s easier to invest in logic to detect them at this level (or maybe “set of layers”, if we adopt the OSI model), and to correlate responses across different hosts. Firewalls and similar defences are also more likely to be able to be configured to help defend hosts which may be targeted.

Resource starvation

The term “resource starvation” most accurately refers[4] to situations where a process (or application) is denied sufficient CPU allocation to perform correctly. How could this occur? Well, it’s going to be rarer than in the DoS case, because in order to do it, you’re going to need some way to impact the underlying scheduling of the Operating System and/or virtualisation management (think hypervisor, typically). That would normally mean that you’d need pretty low-level access to the machine, but there is a family of attacks known as “noisy neighbour”[5] where workloads – VMs or containers, typically – use up so many resources that other workloads are starved.

However, partly because of this case, I’d argue that resource starvation can usefully be associated with other types of availability attacks which occur locally to the machine hosting the targeted service, which might be related to CPU, file descriptor, network or other resources.

Generally, noisy neighbour attacks can be fairly easily mitigated by controls in the Operating System or virtualisation manager, though, of course, compromised or malicious components at this layer are very difficult to manage.

Dependency blocking

I’m not sure what the best term for this type of attack is, but what I’m thinking of is attacks which impact a service by reducing or removing access to external services on which they depend – remote components, if you will. If, for instance, my web application requires access to a database, then an attack on that database – however performed – will impact my service. As almost any kind of service will have external dependencies these days[6], this is can be a very effective attack, as it allows knowledgeable attackers to target the weakest link in the “chain” of components that make up your service.

There are mitigations against some of these attacks – caching and later reconciliation/synching being one – but identifying and defending against these sorts of attacks depends largely on considering your service as a system, and realising the types of impact degradation of the different parts might have.

Conclusion – managed degradation

Which leads me to a final point, which is that when considering availability attacks, understanding and planning Service degradation: actually a good thing is going to be invaluable – and when you’ve done that, you’ll definitely going to need to test it, too (If it isn’t tested, it doesn’t work).

1 – yes, I checked the capitalisation – he wasn’t worried about DRDOS, MS-DOS or any of those lovely 80s era command line Operating Systems.

2 – or millions or more, these days.

3 – here, for the avoidance of doubt.

4 – I believe.

5 – you know my policy on spellings by now. I’m British, and we’ll keep it that way.

6 – unless you’re still using green-screen standalone machines to run your business, in which case either a) yikes or b) well done.

The Curious Incident of the Patch in the Night-Time

Gregory: “The patch did nothing in the night-time.”
Holmes: “That was the curious incident.”

To misquote Sir Arthur Conan-Doyle:

Gregory (cyber-security auditor) “Is there any other point to which you would wish to draw my attention?”

Holmes: “To the curious incident of the patch in the night-time.”

Gregory: “The patch did nothing in the night-time.”

Holmes: “That was the curious incident.”

I considered a variety of (munged) literary titles to head up this blog, and settled on the one above or “We Need to Talk about Patching”. Either way round, there’s something rotten in the state of patching*.

Let me start with what I hope is a fairly uncontroversial statement: “we all know that patches are important for security and stability, and that we should really take them as soon as they’re available and patch all of our systems”.

I don’t know about you, but I suspect you’re the same as me: I run ‘sudo dnf –refresh upgrade’** on my home machines and work laptop at least once every day that I turn them on. I nearly wrote that when an update comes out to patch my phone, I take it pretty much immediately, but actually, I’ve been burned before with dodgy patches, and I’ll often have a check of the patch number to see if anyone has spotted any problems with it before downloading it. This feels like basic due diligence, particularly as I don’t have a “staging phone” which I could use to test pre-production and see if my “production phone” is likely to be impacted***.

But the overwhelming evidence from the industry is that people really don’t apply patches – including security patches – even though they understand that they ought to. I plan to post another blog entry at some point about similarities – and differences – between patching and vaccinations, but let’s take as read, for now, the assumption that organisations know they should patch, and look at the reasons they don’t, and what we might do to improve that.

Why people don’t patch

Here are the legitimate reasons that I can think of for organisations not patching****.

they don’t know about patches
- not all patches are advertised well enough
- organisations don’t check for patches
they don’t know about their systems
- incomplete knowledge of their IT estate
legacy hardware
- patches not compatible with legacy hardware
legacy software
- patches not compatible with legacy software
known impact with up-to-date hardware & software
possible impact with up-to-date hardware & software

Some of these are down to the organisations, or their operating environment, clearly: 1b, 2, 3 and 4. The others, however, are down to us as an industry. What it comes down to is a balance of risk: the IT operations department doesn’t dare to update software with patches because they know that if the systems that they maintain go down, they’re in real trouble. Sometimes they know there will be a problem (typically because they test patches in a staging environment of some type), and sometimes because they just don’t dare. This may be because they are in the middle of their own software update process, and the combination of Operating System, middleware or integrated software updates with their ongoing changes just can’t be trusted.

What we can do

Here are some thoughts about what we as an industry can do to try to address this problem – or set of problems.

Staging

Staging – what is a staging environment for? It’s for testing changes before they go into production, of course. But what changes? Changes to your software, or your suppliers’ software? The answer has to be “both”, I think. You may need separate estates so that you can look at changes of these two sets of software separately before seeing what combining them does, but in the end, it is the combination of the two that matters. You may consider using the same estate at different times to test the different options, but that’s not an option for all organisations.

DevOps

DevOps shouldn’t just be about allowing agile development practices to become part of the software lifecycle: it should also be about allowing agile operational practices become a part of the software lifecycle. DevOps can really help with patching strategy if you think of it this way. Remember, in DevOps, everybody has responsibility. So your DevOps pipeline the perfect way to test how changes in your software are affected by changes in the underlying estate. And because you’re updating regularly, and have unit tests to check all the key functionality*****, any changes can be spotted and addressed quickly.

Dependencies

Patches sometimes have dependencies. We should be clear when a patch requires other changes, resulting a large patchset, and when a large patchset just happens to be released because multiple patches are available. Some dependencies may be outside the control of the vendor. This is easier to test when your patch has dependencies on an underlying Operating System, for instance, but more difficult if the dependency is on the opposite direction. If you’re the one providing the underlying update and the customer is using software that you don’t explicitly test, then it’s incumbent on you, I’d argue, to use some of the other techniques that I’ve outlined to help your customers understand likely impact.

Visibility of likely impact

One obvious option available to those providing patches is a good description of areas of impact. You’d hope that everyone did this already, of course, but a brief line something like “this update is for the storage subsystem, and should affect only those systems using EXT3”, for instance, is a great help in deciding the likely impact of a patch. You can’t always get it right – there may always be unexpected consequences, and vendors can’t test for all configurations. But they should at least test all supported configurations…

Risk statements

This is tricky, and maybe political, but is it time that we started giving those customers who need it a little more detail about the likely impact of the changes within a patch? It’s difficult to quantify, of course: a one-character change may affect 95% of the flows through a module, whereas what may seem like a simple functional addition to a customer may actually require thousands of lines of code. But as vendors, we should have an idea of the impact of a change, and we ought to be considering how we expose that to customers.

Combinations

Beyond that, however, I think there are opportunities for customers to understand what the impact of not having accepted a previous patch is. Maybe the risk of accepting patch A is low, but the risk of not accepting patch A and patch B is much higher. Maybe it’s safer to accept patch A and patch C, but wait for a successor to patch B. I’m not sure quite how to quantify this, or how it might work, but I think there’s grounds for research******.

Conclusion

Businesses have every right not to patch. There are business reasons to balance the risk of patching against not patching. But the balance is currently often tipped too far in direction of not patching. Much too far. And if we’re going to improve the state of IT security, we, the industry, need to do something about it. By helping organisations with better information, by encouraging them to adopt better practices, by training them in how to assess risk, and by adopting better practices ourselves.

*see what I did there?

**your commands my vary.

***this almost sounds like a very good excuse for a second phone, though I’m not sure that my wife would agree.

****I’d certainly be interested to hear of others: please let me know via comments.

*****you do have these two things, right? Because if you don’t, you’re really not doing DevOps. Sorry.

******as soon as I wrote this, I realised that somebody’s bound to have done research on this issue. Please let me know if you have: or know somebody who has.