Mitigate or remediate?

What’s the difference between mitigate and remediate?

I very, very nearly titled this article “The ‘aters gonna ‘ate”, and then thought better of it. This is a rare event, and I put it down to the extreme temperatures that we’re having here in the UK at the moment[1].

What prompted this article was reading something online today where I saw the word mitigate, and thought to myself, “When did I start using that word? It’s not a normal one to drop into conversation. And what about remediate? What’s the difference between mitigate and remediate? In fact, how well could I describe the difference between the two?” Both are quite jargon-y word, so, in the spirit of my recent article Jargon – a force for good or ill? here’s my attempt at describing the difference, but also pointing out how important both are – along with a couple of other “-ate” words.

Let’s work backwards.

Remediate

Remediation is a set of actions to get things back the way they should be. It’s the last step in the process of recovery from an attack or other failure. The reprefix here is the give -away: like resetting and reconciliation. When you’re remediating, there may be an expectation that you’ll be returning your systems to the same state they were before, for example, power failure, but that’s not necessarily the case. What you should be focussing on is the service you’re providing, rather than the system(s) that are providing it. A set of steps for remediation might require you to replace your database with another one from a completely different provider[3], to add a load-balancer and to change all of your hardware, but you’re still remediating the problem that hit you. At the very least, if you’ve suffered an attack, you should make sure that you plug any hole that allowed the attacker in to start with[4].

Mitigate

Mitigation isn’t about making things better – it’s about reducing the impact of an attack or failure. Mitigation is the first set of steps you take when you realise that you’ve got a problem. You could argue that these things are connected, but mitigation isn’t about returning the service to normal (remediation), but about taking steps to reduce the impact of an attack. Those mitigations may be external – adding load-balancers to help deal with a DDoS attack, maybe – or internal – shutting down systems that have been actively compromised.

In fact, I’d argue that some mitigations might quite properly actually have an adverse effect on the service: there may be short term reductions in availability to ensure that long-term remediations can be performed. Planning for this is vitally important, as I’ve discussed previously, in Service degradation: actually a good thing.

The other -ates: update and operate

As I promised above, there are a couple of other words that are part of your response to an attack – or possible attack. The first is update. Updating is one of the key measures that you can put in place to reduce the chance of a successful attack. It’s the step you take before mitigation – because, if you’re lucky, you won’t need mitigation, because you’ll be immune from attack[5].

The second of these is operate. Operation is your normal state: it’s where you want to be. But operation doesn’t mean that you can just sit back and feel secure: you need to be keeping an eye on what’s going on, planning upgrades, considering mitigations and preparing for remediations. We too often think of the operate step as our Happy Place, where all is rosy, and we can sit back and watch the daisies grow. As DevOps (and DevSecOps) is teaching us, this is absolutely not the case: operate is very much a dynamic state, and not a passive one.


1 – 30C (~86F) is hot for the UK. Oh, and most houses (ours included) don’t have air conditioning[2].

2 – and I work from an office in the garden. Direct sunlight through the mainly glass wall is a blessing in the winter. Less so in the summer.

3 – hopefully open source, of course.

4 – hint: patch, patch, patch.

5 – well, that attack, at least. You’re never immune from all attacks: sorry.

In praise of the CIA

CIA is not sufficient to ensure security within a system.

In the wake of the widespread failure of the Visa processing network on Friday last week [1] (see The Register for more details), I thought it might be time to revisit that useful aide memoire, C.I.A.:

  • Confidentiality
  • Integrity
  • Availability.

This isn’t the first time I’ve written about this trio, and I doubt that it’ll be the last.  However, this particular incident seems like a perfect example to examine the least-regarded of the three – availability – and also to cogitate somewhat on how the CIA is necessary, but not sufficient[3].

Availability

As far as we can tell, the problem with the Visa payment system came down to a hardware failure.  As someone who used to work as a software engineer, I can tell you that this is by far the best type of failure, because there’s very little you can do about it once you’ve diagnosed it[6], which means that it quickly becomes SEP[7].  Be that as it may, the result of this hardware problem was that a large percentage of the network was unable to access Visa processing capabilities correctly.  Though ATMs[8] generally worked, it seems, payment using card readers generally didn’t.

How is this a security problem?  Well, one way to answer that question is to say that if security is about reducing risk to your business, then as this caused significant damage to Visa’s revenue stream – not to mention its reputation – then the risk materialised, and there was a security failure.  I would be interested to know, however, how many organisations have their security teams in charge of ensuring up-time and availability of their systems in terms of guarding against vulnerabilities such as hardware failures.  My suspicion is that the scope of availability-safeguarding by security teams is generally to the extent of managing denial of service or other malicious attacks.

I would argue that more organisations should consider this part of the security team’s mandate, to be honest, because the impacts are very similar, and many of the mitigations will be the same.  Of course, if you’re already an integrated Ops team – or even moving to a DevOps or DevSecOps model – then well done you: I’m sure you’re 100% safe from anything similar befalling you[10].

Consistency and correctness

As I mentioned above, there’s a criticism which is often levelled at the CIA triad, which is that confidentiality, integrity and availability are not, on their own, sufficient to design and run a system.

The Visa incident is a perfect example of why this is the case.  It appears that the outage was not complete, as even at card readers, some amount of information was going through when a transaction was attempted.  This meant that for some (attempted) transactions, at least, debits were appearing on accounts even when they were not being recorded as credits at the vendor’s side.  What does this mean?  In simple terms, money was coming out of people’s accounts, but not going to the people they were trying to pay.  I’m not an expert on retail banking, but I believe that this is pretty much the opposite of what how a financial transaction is supposed to work.

You can’t really blame this on a lack of confidentiality or a lack or integrity.  Nor is it really to do with a lack of availability – it may have been a side effect of the same cause as the availability failures, but that doesn’t mean that it caused them[11].

These problems can be characterised in two ways: as a lack of consistency and/or a lack of correctness.  In a system, data should be consistent across the system, so when a debit shows up with no corresponding credit, there is a a failure of consistency.  This lack of consistency highlights a lack of consistency: in fact, the very point of double-entry book-keeping is to allow these sorts of errors to be spotted.

What this tells us is not only that CIA is not sufficient to ensure security within a system but also that there exist other mechanisms – some very ancient – that allow us to manage our systems and to mitigate failures.


1 – at time of writing.  If you’re reading this after, say, the 10th or 11th of June 2018, then it was longer ago than that[2].

2 – unless there’s been another outage, in which case it may be time to start taking out cash and stuffing it into your mattress.

3 – in no sense is this a comment on the Central Intelligence Agency.  I am unqualified to discuss that particular, august[4] body.  Nor would I consider it in my best interests to do so[5].

4 – it was actually founded in the month of September, according to the Interwebs.

5 – or the best interests of my readers.

6 – which can, admittedly, take quite a long time, as you’re probably looking in the wrong places if you, like me, generally assume that the bug is in your own code.

7 – Somebody Else’s Problem (hat tip to the late, great Douglas Adams).

8 – “cash points” (US), “holes in the wall” (UK)[9].

9 – yes, we really do call them this: “I need to get some money from the hole in the wall”.  It’s descriptive and accurate: what more do you want?

10 – no, I know you’re not, and you know you’re not, but this will make everybody else feel that little bit more nervous, and you can feel a little bit more smug, which is always nice, isn’t it?

11 – cue a link to one of my most favourite comics of all time: https://xkcd.com/552/.

What’s your availability? DoS attacks and more

In security we talk about intentional degradation of availability

A colleague of mine recently asked me about protection from DoS attacks[1] for a project with which he’s involved – Denial of Service attacks.  The first thing that sprung to mind, of course, was DDoS: Distributed Denial of Service attacks, where hundreds or thousands[2] of hosts are used to send vast amounts of network traffic to – or maybe more accurately “at” – servers in the hopes of bringing the servers to their knees and stopping them providing the service for which they’re designed.  These are the attacks that get into the news, and with good reason.

There are other types of DoS however, and the more I thought about it, the more I wondered whether he – and I – should be worrying about these other DoS attacks and also considering other related types of issue which could cause problems to systems.  And because I realised it was an interesting topic, I decided to write about it[3].

I’m going to return to the classic “C.I.A.” model of computer security: Confidentiality, Integrity and Availability.  The attacks we’re talking about here are those most often overlooked: attempts to degrade the availability of a service.  There’s an overlap with the related discipline of resilience here, but I think that the key differentiator is that in security we’re generally talking about intentional degradation of availability, whereas resilience also covers (and maybe focuses on) unintentional degradation.

So, what types of availability attacks might we want to consider?

Denial of service attacks

I think it’s worth linking to Wikipedia’s pretty awesome entry “Denial of service attack” – not something I often do, but I thought it was excellent.  Although they’re not mutually exclusive at all, here are some of the key types as I’d define them:

  • Distributed DoS – where you have lots of different hosts attacking at the same time, flooding the target with traffic.  These days, this can be easily automated, and it’s possible to rent compromised machines to perform a coordinated attack.
  • Application layer – where the attack is aimed at the service, rather than at the host beneath.  This may seem like an academic distinction, but it’s not: what it really means is that the attack is performed with knowledge of the application layer.  So, for instance, if you’re attacking a web server, you might initiate lots of HTTP sessions, or if you were attacking a Kerberos server, you might request lots of authentication tickets.  These types of attacks may be quite costly to perform, but they’re also difficult to protect against, as each attack looks like a “legal” interaction with the service, and unless you’re on the look-out in a way which is typically not automated at this level, they’re difficult to avoid.
  • Host level – this is a family of attacks which go for the host and/or associated Operating System, rather than the service itself.  A classic attack would be the SYN flood, which misused the TCP protocol to use up resources on the host, thereby stopping any associated services from being able to respond.  Host attacks may be somewhat simpler to defend against, as it’s easier to invest in logic to detect them at this level (or maybe “set of layers”, if we adopt the OSI model), and to correlate responses across different hosts.  Firewalls and similar defences are also more likely to be able to be configured to help defend hosts which may be targeted.

Resource starvation

The term “resource starvation” most accurately refers[4] to situations where a process (or application) is denied sufficient CPU allocation to perform correctly.  How could this occur?  Well, it’s going to be rarer than in the DoS case, because in order to do it, you’re going to need some way to impact the underlying scheduling of the Operating System and/or virtualisation management (think hypervisor, typically).  That would normally mean that you’d need pretty low-level access to the machine, but there is a family of attacks known as “noisy neighbour”[5] where workloads – VMs or containers, typically – use up so many resources that other workloads are starved.

However, partly because of this case, I’d argue that resource starvation can usefully be associated with other types of availability attacks which occur locally to the machine hosting the targeted service, which might be related to CPU, file descriptor, network or other resources.

Generally, noisy neighbour attacks can be fairly easily mitigated by controls in the Operating System or virtualisation manager, though, of course, compromised or malicious components at this layer are very difficult to manage.

 

Dependency blocking

I’m not sure what the best term for this type of attack is, but what I’m thinking of is attacks which impact a service by reducing or removing access to external services on which they depend – remote components, if you will.  If, for instance, my web application requires access to a database, then an attack on that database – however performed – will impact my service.  As almost any kind of service will have external dependencies these days[6], this is can be a very effective attack, as it allows knowledgeable attackers to target the weakest link in the “chain” of components that make up your service.

There are mitigations against some of these attacks – caching and later reconciliation/synching being one – but identifying and defending against these sorts of attacks depends largely on considering your service as a system, and realising the types of impact degradation of the different parts might have.

 

Conclusion – managed degradation

Which leads me to a final point, which is that when considering availability attacks, understanding and planning Service degradation: actually a good thing is going to be invaluable – and when you’ve done that, you’ll definitely going to need to test it, too (If it isn’t tested, it doesn’t work).

 


1 – yes, I checked the capitalisation – he wasn’t worried about DRDOS, MS-DOS or any of those lovely 80s era command line Operating Systems.

2 – or millions or more, these days.

3 – here, for the avoidance of doubt.

4 – I believe.

5 – you know my policy on spellings by now.  I’m British, and we’ll keep it that way.

6 – unless you’re still using green-screen standalone machines to run your business, in which case either a) yikes or b) well done.

The Curious Incident of the Patch in the Night-Time

Gregory: “The patch did nothing in the night-time.”
Holmes: “That was the curious incident.”

To misquote Sir Arthur Conan-Doyle:

Gregory (cyber-security auditor) “Is there any other point to which you would wish to draw my attention?”
Holmes: “To the curious incident of the patch in the night-time.”
Gregory: “The patch did nothing in the night-time.”
Holmes: “That was the curious incident.”

I considered a variety of (munged) literary titles to head up this blog, and settled on the one above or “We Need to Talk about Patching”.  Either way round, there’s something rotten in the state of patching*.

Let me start with what I hope is a fairly uncontroversial statement: “we all know that patches are important for security and stability, and that we should really take them as soon as they’re available and patch all of our systems”.

I don’t know about you, but I suspect you’re the same as me: I run ‘sudo dnf –refresh upgrade’** on my home machines and work laptop at least once every day that I turn them on.  I nearly wrote that when an update comes out to patch my phone, I take it pretty much immediately, but actually, I’ve been burned before with dodgy patches, and I’ll often have a check of the patch number to see if anyone has spotted any problems with it before downloading it. This feels like basic due diligence, particularly as I don’t have a “staging phone” which I could use to test pre-production and see if my “production phone” is likely to be impacted***.

But the overwhelming evidence from the industry is that people really don’t apply patches – including security patches – even though they understand that they ought to.  I plan to post another blog entry at some point about similarities – and differences – between patching and vaccinations, but let’s take as read, for now, the assumption that organisations know they should patch, and look at the reasons they don’t, and what we might do to improve that.

Why people don’t patch

Here are the legitimate reasons that I can think of for organisations not patching****.

  1. they don’t know about patches
    • not all patches are advertised well enough
    • organisations don’t check for patches
  2. they don’t know about their systems
    • incomplete knowledge of their IT estate
  3. legacy hardware
    • patches not compatible with legacy hardware
  4. legacy software
    • patches not  compatible with legacy software
  5. known impact with up-to-date hardware & software
  6. possible impact with up-to-date hardware & software

Some of these are down to the organisations, or their operating environment, clearly: 1b, 2, 3 and 4.  The others, however, are down to us as an industry.  What it comes down to is a balance of risk: the IT operations department doesn’t dare to update software with patches because they know that if the systems that they maintain go down, they’re in real trouble.  Sometimes they know there will be a problem (typically because they test patches in a staging environment of some type), and sometimes because they just don’t dare.  This may be because they are in the middle of their own software update process, and the combination of Operating System, middleware or integrated software updates with their ongoing changes just can’t be trusted.

What we can do

Here are some thoughts about what we as an industry can do to try to address this problem – or set of problems.

Staging

Staging – what is a staging environment for?  It’s for testing changes before they go into production, of course.  But what changes?  Changes to your software, or your suppliers’ software?  The answer has to be “both”, I think.  You may need separate estates so that you can look at changes of these two sets of software separately before seeing what combining them does, but in the end, it is the combination of the two that matters.  You may consider using the same estate at different times to test the different options, but that’s not an option for all organisations.

DevOps

DevOps shouldn’t just be about allowing agile development practices to become part of the software lifecycle: it should also be about allowing agile operational practices become a part of the software lifecycle.  DevOps can really help with patching strategy if you think of it this way.  Remember, in DevOps, everybody has responsibility.  So your DevOps pipeline the perfect way to test how changes in your software are affected by changes in the underlying estate.  And because you’re updating regularly, and have unit tests to check all the key functionality*****, any changes can be spotted and addressed quickly.

Dependencies

Patches sometimes have dependencies.  We should be clear when a patch requires other changes, resulting a large patchset, and when a large patchset just happens to be released because multiple patches are available.  Some dependencies may be outside the control of the vendor.  This is easier to test when your patch has dependencies on an underlying Operating System, for instance, but more difficult if the dependency is on the opposite direction.  If you’re the one providing the underlying update and the customer is using software that you don’t explicitly test, then it’s incumbent on you, I’d argue, to use some of the other techniques that I’ve outlined to help your customers understand likely impact.

Visibility of likely impact

One obvious option available to those providing patches is a good description of areas of impact.  You’d hope that everyone did this already, of course, but a brief line something like “this update is for the storage subsystem, and should affect only those systems using EXT3”, for instance, is a great help in deciding the likely impact of a patch.  You can’t always get it right – there may always be unexpected consequences, and vendors can’t test for all configurations.  But they should at least test all supported configurations…

Risk statements

This is tricky, and maybe political, but is it time that we started giving those customers who need it a little more detail about the likely impact of the changes within a patch?  It’s difficult to quantify, of course: a one-character change may affect 95% of the flows through a module, whereas what may seem like a simple functional addition to a customer may actually require thousands of lines of code.  But as vendors, we should have an idea of the impact of a change, and we ought to be considering how we expose that to customers.

Combinations

Beyond that, however, I think there are opportunities for customers to understand what the impact of not having accepted a previous patch is.  Maybe the risk of accepting patch A is low, but the risk of not accepting patch A and patch B is much higher.  Maybe it’s safer to accept patch A and patch C, but wait for a successor to patch B.  I’m not sure quite how to quantify this, or how it might work, but I think there’s grounds for research******.

Conclusion

Businesses have every right not to patch.  There are business reasons to balance the risk of patching against not patching.  But the balance is currently often tipped too far in direction of not patching.  Much too far.  And if we’re going to improve the state of IT security, we, the industry, need to do something about it.  By helping organisations with better information, by encouraging them to adopt better practices, by training them in how to assess risk, and by adopting better practices ourselves.

 


*see what I did there?

**your commands my vary.

***this almost sounds like a very good excuse for a second phone, though I’m not sure that my wife would agree.

****I’d certainly be interested to hear of others: please let me know via comments.

*****you do have these two things, right?  Because if you don’t, you’re really not doing DevOps.  Sorry.

******as soon as I wrote this, I realised that somebody’s bound to have done research on this issue.  Please let me know if you have: or know somebody who has.