Mitigate or remediate?

What’s the difference between mitigate and remediate?

I very, very nearly titled this article “The ‘aters gonna ‘ate”, and then thought better of it. This is a rare event, and I put it down to the extreme temperatures that we’re having here in the UK at the moment[1].

What prompted this article was reading something online today where I saw the word mitigate, and thought to myself, “When did I start using that word? It’s not a normal one to drop into conversation. And what about remediate? What’s the difference between mitigate and remediate? In fact, how well could I describe the difference between the two?” Both are quite jargon-y word, so, in the spirit of my recent article Jargon – a force for good or ill? here’s my attempt at describing the difference, but also pointing out how important both are – along with a couple of other “-ate” words.

Let’s work backwards.

Remediate

Remediation is a set of actions to get things back the way they should be. It’s the last step in the process of recovery from an attack or other failure. The reprefix here is the give -away: like resetting and reconciliation. When you’re remediating, there may be an expectation that you’ll be returning your systems to the same state they were before, for example, power failure, but that’s not necessarily the case. What you should be focussing on is the service you’re providing, rather than the system(s) that are providing it. A set of steps for remediation might require you to replace your database with another one from a completely different provider[3], to add a load-balancer and to change all of your hardware, but you’re still remediating the problem that hit you. At the very least, if you’ve suffered an attack, you should make sure that you plug any hole that allowed the attacker in to start with[4].

Mitigate

Mitigation isn’t about making things better – it’s about reducing the impact of an attack or failure. Mitigation is the first set of steps you take when you realise that you’ve got a problem. You could argue that these things are connected, but mitigation isn’t about returning the service to normal (remediation), but about taking steps to reduce the impact of an attack. Those mitigations may be external – adding load-balancers to help deal with a DDoS attack, maybe – or internal – shutting down systems that have been actively compromised.

In fact, I’d argue that some mitigations might quite properly actually have an adverse effect on the service: there may be short term reductions in availability to ensure that long-term remediations can be performed. Planning for this is vitally important, as I’ve discussed previously, in Service degradation: actually a good thing.

The other -ates: update and operate

As I promised above, there are a couple of other words that are part of your response to an attack – or possible attack. The first is update. Updating is one of the key measures that you can put in place to reduce the chance of a successful attack. It’s the step you take before mitigation – because, if you’re lucky, you won’t need mitigation, because you’ll be immune from attack[5].

The second of these is operate. Operation is your normal state: it’s where you want to be. But operation doesn’t mean that you can just sit back and feel secure: you need to be keeping an eye on what’s going on, planning upgrades, considering mitigations and preparing for remediations. We too often think of the operate step as our Happy Place, where all is rosy, and we can sit back and watch the daisies grow. As DevOps (and DevSecOps) is teaching us, this is absolutely not the case: operate is very much a dynamic state, and not a passive one.


1 – 30C (~86F) is hot for the UK. Oh, and most houses (ours included) don’t have air conditioning[2].

2 – and I work from an office in the garden. Direct sunlight through the mainly glass wall is a blessing in the winter. Less so in the summer.

3 – hopefully open source, of course.

4 – hint: patch, patch, patch.

5 – well, that attack, at least. You’re never immune from all attacks: sorry.

Who do you trust: your data, or your enemy’s?

Our security logs define our organisational memory.

Imagine, just imagine, that you’re head of an organisation, and you suspect that there’s been some sort of attack on your systems[1].  You don’t know for sure, but a bunch of your cleverest folks are pretty sure that there are sufficient signs to merit an investigation.  So, you agree to allow that investigation, and it’s pretty clear from the outset – and becomes yet more clear as the investigation unfolds – that there was an attack on your organisation.  And as this investigation draws close to its completion, you happen to meet the head of another organisation, which happens to be not only your competitor, but also the party that your investigators are certain was behind the attack.  That person – the leader of your competitor – tells you that they absolutely didn’t perform an attack: no, sirree.  Who do you believe?  Your people, who have been laboring[2] away for months, or your competitor?

What a ridiculous question: of course you’d believe your own people over your competitor, right?

So, having set up such an absurd scenario[4], let’s look at a scenario which is actually much more likely.  Your systems have been acting strangely, and there seems to be something going on.  Based on the available information, you believe that you’ve been attacked, but you’re not sure.  Your experts think it’s pretty likely, so you approve an investigation.  And then one of your investigatory team come to you to tell you some really bad news about the data.  “What’s the problem?” you ask.  “Is there no data?”

“No,” they reply, “it’s worse than that.  We’ve got loads of data, but we don’t know which is real.”

Our logs are our memory

There is a literary trope – one of my favourite examples is Margery Allingham’s The Traitor’s Purse – where a character realises that he or she is somebody other than who they think they are when they start to question their memories.  We are defined by our memories, and the same goes for organisational security.  Our security logs define our organisational memory.  If you cannot prove that the data you are looking at is[5] correct, then you cannot be sure what led to the state you are in now.

Let’s give a quick example.  In my organisation, I am careful to log every time I upgrade a piece of software, but I begin to wonder whether a particular application is behaving as expected.  Maybe I see some traffic to an external IP address which I’ve never seen before, for instance.  Well, the first thing I’m going to do is to check the logs to see whether somebody has updated the software with a malicious version.  Assuming that somebody has installed a malicious version, there are three likely outcomes at this point:

  1. you check the logs, and it’s clear that a non-authorised version was installed.  This isn’t good, because you now know that somebody has unauthorised access to your system, but at least you can do something about it.
  2. at least some of the logs are missing.  This is less good, because you really can’t be sure what’s gone on, but you know have a pretty strong suspicion that somebody with unauthorised access has deleted some logs to cover up their tracks, which means that you have a starting point for remediation.
  3. there’s nothing wrong.  All the logs look fine.  You’re now really concerned, as you can’t be sure of your own data – your organisation’s memories.  You may be looking at correct data, or you may be looking at incorrect data: data which has been written by your enemy.  Attackers can – and do – go into log files and change the data in them to obscure the fact that they have performed particular actions.  It’s actually one of the first steps that a competent attacker will perform.

In the scenario as defined, things probably aren’t too bad: you can check the checksum or hash of the installed software, and check that it’s the version you expect.  But what if the attacker has also changed the version of your checksum- or hash-checker so that, for packages associated with this particular application, they always return what you expect to see?  This is not a theoretical attack, and nor is it the only way approach that attackers have to try to muddy the waters as to what they’ve done. And if it has happened, they you have no way of confirming that you have the correct version of the application.  You can try updating the checksum- or hash-checker, but what if the attacker has meddled with the software installer to ensure that it always installs their version…?

It’s a slippery slope, and bar wiping the entire system and reinstalling[6], there’s little you can do to be certain that you’ve cleared things up properly.  And in some cases, it may be too late: what if the attacker’s main aim was to exfiltrate some of your proprietary data, for example?

Lessons to learn, actions to take

Well, the key lesson is that your logs are important, and they need to be protected.  If you can’t trust your logs, then it can be very, very difficult not only to identify the extent of an attack, but also to remediate it or, in the worst case, even to know that you’ve been attacked at all.

And what can you do?  Well, there are many techniques that you can employ, and the best combination will depend on a number of questions, including your regulatory regime, your security posture, and what attackers you decide to defend against.  I’d start with a fairly simple combination:

  • move your most important logs off-system.  Where possible, host logs on different systems to the ones that are doing the reporting.  If you’re an attacker, it’s more difficult to jump from one system to another than it is to make changes to logs on a system which you’ve already compromised;
  • make your logs write-only.  This sounds crazy – how are you supposed to check logs if they can’t be read?  What this really means is that you separate read and write privileges on your logs so that only those with a need to read them can do so.  Writing is less worrisome – though there are attacks here, including filesystem starvation – because if you can’t see what you need to change, then it’s almost impossible to do so.  If you’re an attacker, you might be able to wipe some logs – see our case 2 above – but obscuring what you’ve actually done is more difficult.

Exactly what steps you take are up to you, but remember: if you can’t trust your logs, you can’t trust your data, and if you can’t trust your data, you don’t know what has happened.  That’s what your enemy wants: confusion.


1 – like that’s ever going to happen.

2- on this, very rare, occasion, I’m going to countenance[2] a US spelling.  I think you can guess why.

3 – contenance?

4 – I know, I know.

5 – are?

6 – you checked the firmware, right?  Hmm – maybe safer just to buy completely new hardware.

 

In praise of the CIA

CIA is not sufficient to ensure security within a system.

In the wake of the widespread failure of the Visa processing network on Friday last week [1] (see The Register for more details), I thought it might be time to revisit that useful aide memoire, C.I.A.:

  • Confidentiality
  • Integrity
  • Availability.

This isn’t the first time I’ve written about this trio, and I doubt that it’ll be the last.  However, this particular incident seems like a perfect example to examine the least-regarded of the three – availability – and also to cogitate somewhat on how the CIA is necessary, but not sufficient[3].

Availability

As far as we can tell, the problem with the Visa payment system came down to a hardware failure.  As someone who used to work as a software engineer, I can tell you that this is by far the best type of failure, because there’s very little you can do about it once you’ve diagnosed it[6], which means that it quickly becomes SEP[7].  Be that as it may, the result of this hardware problem was that a large percentage of the network was unable to access Visa processing capabilities correctly.  Though ATMs[8] generally worked, it seems, payment using card readers generally didn’t.

How is this a security problem?  Well, one way to answer that question is to say that if security is about reducing risk to your business, then as this caused significant damage to Visa’s revenue stream – not to mention its reputation – then the risk materialised, and there was a security failure.  I would be interested to know, however, how many organisations have their security teams in charge of ensuring up-time and availability of their systems in terms of guarding against vulnerabilities such as hardware failures.  My suspicion is that the scope of availability-safeguarding by security teams is generally to the extent of managing denial of service or other malicious attacks.

I would argue that more organisations should consider this part of the security team’s mandate, to be honest, because the impacts are very similar, and many of the mitigations will be the same.  Of course, if you’re already an integrated Ops team – or even moving to a DevOps or DevSecOps model – then well done you: I’m sure you’re 100% safe from anything similar befalling you[10].

Consistency and correctness

As I mentioned above, there’s a criticism which is often levelled at the CIA triad, which is that confidentiality, integrity and availability are not, on their own, sufficient to design and run a system.

The Visa incident is a perfect example of why this is the case.  It appears that the outage was not complete, as even at card readers, some amount of information was going through when a transaction was attempted.  This meant that for some (attempted) transactions, at least, debits were appearing on accounts even when they were not being recorded as credits at the vendor’s side.  What does this mean?  In simple terms, money was coming out of people’s accounts, but not going to the people they were trying to pay.  I’m not an expert on retail banking, but I believe that this is pretty much the opposite of what how a financial transaction is supposed to work.

You can’t really blame this on a lack of confidentiality or a lack or integrity.  Nor is it really to do with a lack of availability – it may have been a side effect of the same cause as the availability failures, but that doesn’t mean that it caused them[11].

These problems can be characterised in two ways: as a lack of consistency and/or a lack of correctness.  In a system, data should be consistent across the system, so when a debit shows up with no corresponding credit, there is a a failure of consistency.  This lack of consistency highlights a lack of consistency: in fact, the very point of double-entry book-keeping is to allow these sorts of errors to be spotted.

What this tells us is not only that CIA is not sufficient to ensure security within a system but also that there exist other mechanisms – some very ancient – that allow us to manage our systems and to mitigate failures.


1 – at time of writing.  If you’re reading this after, say, the 10th or 11th of June 2018, then it was longer ago than that[2].

2 – unless there’s been another outage, in which case it may be time to start taking out cash and stuffing it into your mattress.

3 – in no sense is this a comment on the Central Intelligence Agency.  I am unqualified to discuss that particular, august[4] body.  Nor would I consider it in my best interests to do so[5].

4 – it was actually founded in the month of September, according to the Interwebs.

5 – or the best interests of my readers.

6 – which can, admittedly, take quite a long time, as you’re probably looking in the wrong places if you, like me, generally assume that the bug is in your own code.

7 – Somebody Else’s Problem (hat tip to the late, great Douglas Adams).

8 – “cash points” (US), “holes in the wall” (UK)[9].

9 – yes, we really do call them this: “I need to get some money from the hole in the wall”.  It’s descriptive and accurate: what more do you want?

10 – no, I know you’re not, and you know you’re not, but this will make everybody else feel that little bit more nervous, and you can feel a little bit more smug, which is always nice, isn’t it?

11 – cue a link to one of my most favourite comics of all time: https://xkcd.com/552/.

What’s an attack surface?

“Reduce your attack surface,” they say. But what is it?

“Reduce your attack surface,” they[1] say.  But what is it?  The instruction to reduce your attack surface is one of the principles of IT security, so it must be a Good Thing[tm].  The problem is that it’s not always clear what an attack surface actually is.

I’m going to go for the broadest possible description I can think of, or nearly, because I’m pretty paranoid, and because I’m not convinced that the Wikipedia definition[2] is sufficient[3].  Although I’ll throw in a few examples of how to reduce attack surfaces, the purpose of this post is really to explain what one is, rather than to help protect you – but a good understanding really is required before you start with anything else, so hopefully this will be useful.

So, here’s my start at a definition:

  • The attack surface of a system is the sum of areas where attacks could be launched against it.

That feels a little bit circular – let’s define some terms.  First of all, what’s an an “area” in this definition?  Well, I’d say that any particular component of a system may have many points of possible vulnerability – and therefore attack.  The sum of those points is an area – and the sum of the areas of the different components of a system gives us our system’s attack surface.

To understand better, we’re going to have to talk about systems – one of my favourite topics[4] – because I think it’s important to clarify a key difference between the attack surface of a component considered alone, and the area that a component adds when part of a system.  They will not generally be the same.

Here’s an example: you’re deploying an Operating System.  Let’s look at two options for deployment, and compare the attack surfaces.  In both cases, I’m going to take a fairly restricted look at points of vulnerability, excluding, for instance, human factors, as I don’t want to get bogged down in the details.

Deployment one – bare metal

You install your Operating System onto a physical machine, and plug it into the network.  What are some of the attack points?

  • your network connection
  • the physical hardware
  • services which are listening on the network connection
  • connections via USB – keyboard and mouse, for example.

There are more, but this should give us enough to do some comparisons.  I’d generally think of the attack surface as being associated with the physical bounds of the hardware, with the addition of the network port and USB connections.

How can we reduce the attack surface?  Well, we could unplug the network connection – though that might significantly reduce the efficacy of the system! – or we might take steps to reduce the number of services listening on the connection, to reduce the privilege level at which they run, or increase the authentication requirements for connecting to them.  We could reduce our surface area by using a utility such as “usbguard” to restrict USB connections, and, if we’re worried about physical access to the machine, we could put it in a locked cabinet somewhere.  These are all useful and appropriate ways to reduce our system’s attack surface.

Deployment two – a Virtual Machine

In this deployment scenario, we’re going to install the Operating System onto a Virtual Machine (VM), running on a physical host.  What does my attack surface look like now?  Well, that rather depends on how you define your system.  You could, of course, look at the wider system – the VM and the physical host – but for the purposes of this discussion, I’m going to consider that the operation of the Operating System is what we’re interested in, rather than the broader system[6].  So, what does our attack surface look like this time?  Here’s a quick list.

  • your network connection
  • the hypervisor
  • services which are listening on the network connection
  • connections via USB – keyboard and mouse, for example.

You’ll notice that “the physical hardware” is missing from this list, and that’s because it’s been replace with “the hypervisor”.  This is a little simplistic, for a few reasons, including that the hypervisor is arguably implemented via a combination of software and hardware controls, but it’s certainly different from the entire physical hardware we were talking about before, and in fact, there’s not much you can do from the point of the Virtual Machine to secure it, other than recognise its restrictions, so we might want to remove it from our list at this level.

The other entries are also somewhat different from our first scenario, although you might not realise at first glance.  First, it’s quite likely (though not certain) that your network connection may in fact be a virtual network connection provided by the hosting system, which means that some of the burden of defending it goes to the hosting system.  The same goes for the connections via USB – the hypervisor generally provides “virtual hardware” (via something like qemu, for example), which can be attached – or removed – from virtual machines.

So, you still have the services which are listening on the network connection, but it’s definitely a different attack surface from the first deployment scenario.

Now, if you take the wider view, then there’s definitely an attack surface at the physical machine level as well, and that needs to be considered – but it’s quite likely that this will be under the control of somebody completely different (such as a Cloud Service Provider – CSP).

Another quick example

When I deploy a webserver (using, for instance, Apache), I’ll need to consider a variety of attack vectors, from authentication to denial of service to storage attacks: these are part of our attack surface.  If I deploy it with a database (e.g. PostgreSQL or MySQL), the attack surface looks different, assuming that I care about the data in the database.  Whereas I might previously have been concerned to ensure that an HTTP “PUT” command didn’t overwrite or scramble a file on my filesystem, a malformed command to my database server could delete or corrupt multiple tables.  On the other hand, I might now be able to lock down some of the functions of my webserver that I no longer need to worry about filesystem attacks.  The attack surface of my webserver is different when it’s combined in a system with other components[7].

Why do I want to reduce my attack surface?

Well, this is quite an easy one.  By looking back at my earlier definition, you’ll see that the smaller a system’s attack surface, the fewer points of attack there are available to malicious actors.  That’s got to be a piece of good news.

You will, of course, never be able to reduce your attack  surface to zero (see There are no absolutes in security), but the more you reduce (and document, always document!), the better position you’ll be in.  It’s always about raising the bar to make it more difficult for malicious actors to affect you.


1 – the mythical IT Security Community, that’s who.

2 – to give one example.

3 – it only talks about data, and only about software: that’s not broad enough for me.

4 – as long-standing[4] readers of this blog will know.

5 – and long-suffering.

6 – yes, I know we can’t ignore that, but we’ll come back to it, honest.

7 – there are considerations around the attack surface of the database as well, of course.

Do you know what’s lurking on your system?

Every utility, every library, every executable … increases your attack surface.

I had a job once which involved designing hardening procedures for systems that we were going to be use for security-related projects.  It was fascinating.  This was probably 15 years ago, and not only were guides a little thin on the ground for the Linux distribution I was using, but what we were doing was quite niche.  At first, I think I’d assumed that I could just write a script to close down a few holes that originated from daemons[1] that had been left running for no reasons: httpd, sendmail, stuff like that.  It did involve that, of course, but then I realised there was more to do, and started to dive down the rabbit hole.

So, after those daemons, I looked at users and groups.  And then at file systems, networking, storage.  I left the two scariest pieces to last, for different reasons.  The first was the kernel.  I ended up hand-crafting a kernel, removing anything that I thought it was unlikely we’d need, and then restarting several times when I discovered that the system wouldn’t boot because the things I thought I understood were more … esoteric than I’d realised.  I’m not a kernel developer, and this was a salutary lesson in quite how skilled those folks are.  At least, at the time I was doing it, there was less code, and fewer options, than there are today.  On the other hand, I was having to hack back to a required state, and there are more cut-down kernels and systems to start with than there were back then.

The other piece that I left for last was just pruning the installed Operating System applications and associated utilities.  Again, there are cut-down options that are easier to use now than then, but I also had some odd requirements – I believe that we needed Java, for instance, which has, or had …. well let’s say a lot of dependencies.  Most modern Linux distributions[3] start off by installing lots of pieces so that you can get started quickly, without having to worry about trying to work out dependencies for every piece of external software you want to run.

This is understandable, but we need to realise when we do this that we’re making a usability/security trade-off[5].  Every utility, every library, every executable that you add to a system increases your attack surface, and increases the likelihood of vulnerabilities.

The problem isn’t just that you’re introducing vulnerabilities, but that once they’re there, they tend to stay there.  Not just in code that you need, but, even worse, in code that you don’t need.  It’s a rare, but praiseworthy hacker[6] who spends time going over old code removing dependencies that are no longer required.  It’s a boring, complex task, and it’s usually just easier to leave the cruft[7] where it is and ship a slightly bigger package for the next release.

Sometimes, code is refactored and stripped: most frequently, security-related code. This is a very Good Thing[tm], but it turns out that it’s far from sufficient.  The reason I’m writing this post is because of a recent story in The Register about the “beep” command.  This command used the little speaker that was installed on most PC-compatible motherboards to make a little noise.  It was a useful little utility back in the day, but is pretty irrelevant now that most motherboards don’t ship with the relevant hardware.  The problem[8] is that installing and using the beep command on a system allows information to be leaked to users who lack the relevant permissions.  This can be a very bad thing.  There’s a good, brief overview here.

Now, “beep” isn’t installed by default on the distribution I’m using on the system on which I’m writing this post (Fedora 27), though it’s easily installable from one of the standard repositories that I have enabled.  Something of a relief, though it’s not a likely attack vector for this machine anyway.

What, though, do I have installed on this system that is vulnerable, and which I’d never have thought to check?  Forget all of the kernel parameters which I don’t need turned on, but which have been enabled by the distribution for ease of use across multiple systems.  Forget the apps that I’ve installed and use everyday.  Forget, even the apps that I installed once to try, and then neglected to remove.  What about the apps that I didn’t even know were there, and which I never realised might have an impact on the security posture of my system?  I don’t know, and have little way to find out.

This system doesn’t run business-critical operations.  When I first got it and installed the Operating System, I decided to err towards usability, and to be ready to trash[9] it and start again if I had problems with compromise.  But that’s not the case for millions of other systems out there.  I urge you to consider what you’re running on a system, what’s actually installed on it, and what you really need.  Patch what you need, remove what you don’t.   It’s time for a Spring clean[10].


1- I so want to spell this word dæmons, but I think that might be taking my Middle English obsession too far[2].

2 – I mentioned that I studied Middle English, right?

3 – I’m most interested in (GNU) Linux here, as I’m a strong open source advocate and because it’s the Operating System that I know best[4].

4 – oh, and I should disclose that I work for Red Hat, of course.

5 – as with so many things in our business.

6 – the good type, or I’d have said “cracker”, probably.

7 – there’s even a word for it, see?

8 – beyond a second order problem that a suggested fix seems to have made the initial problem worse…

9 – physically, if needs be.

10 – In the Northern Hemisphere, at least.

What’s your availability? DoS attacks and more

In security we talk about intentional degradation of availability

A colleague of mine recently asked me about protection from DoS attacks[1] for a project with which he’s involved – Denial of Service attacks.  The first thing that sprung to mind, of course, was DDoS: Distributed Denial of Service attacks, where hundreds or thousands[2] of hosts are used to send vast amounts of network traffic to – or maybe more accurately “at” – servers in the hopes of bringing the servers to their knees and stopping them providing the service for which they’re designed.  These are the attacks that get into the news, and with good reason.

There are other types of DoS however, and the more I thought about it, the more I wondered whether he – and I – should be worrying about these other DoS attacks and also considering other related types of issue which could cause problems to systems.  And because I realised it was an interesting topic, I decided to write about it[3].

I’m going to return to the classic “C.I.A.” model of computer security: Confidentiality, Integrity and Availability.  The attacks we’re talking about here are those most often overlooked: attempts to degrade the availability of a service.  There’s an overlap with the related discipline of resilience here, but I think that the key differentiator is that in security we’re generally talking about intentional degradation of availability, whereas resilience also covers (and maybe focuses on) unintentional degradation.

So, what types of availability attacks might we want to consider?

Denial of service attacks

I think it’s worth linking to Wikipedia’s pretty awesome entry “Denial of service attack” – not something I often do, but I thought it was excellent.  Although they’re not mutually exclusive at all, here are some of the key types as I’d define them:

  • Distributed DoS – where you have lots of different hosts attacking at the same time, flooding the target with traffic.  These days, this can be easily automated, and it’s possible to rent compromised machines to perform a coordinated attack.
  • Application layer – where the attack is aimed at the service, rather than at the host beneath.  This may seem like an academic distinction, but it’s not: what it really means is that the attack is performed with knowledge of the application layer.  So, for instance, if you’re attacking a web server, you might initiate lots of HTTP sessions, or if you were attacking a Kerberos server, you might request lots of authentication tickets.  These types of attacks may be quite costly to perform, but they’re also difficult to protect against, as each attack looks like a “legal” interaction with the service, and unless you’re on the look-out in a way which is typically not automated at this level, they’re difficult to avoid.
  • Host level – this is a family of attacks which go for the host and/or associated Operating System, rather than the service itself.  A classic attack would be the SYN flood, which misused the TCP protocol to use up resources on the host, thereby stopping any associated services from being able to respond.  Host attacks may be somewhat simpler to defend against, as it’s easier to invest in logic to detect them at this level (or maybe “set of layers”, if we adopt the OSI model), and to correlate responses across different hosts.  Firewalls and similar defences are also more likely to be able to be configured to help defend hosts which may be targeted.

Resource starvation

The term “resource starvation” most accurately refers[4] to situations where a process (or application) is denied sufficient CPU allocation to perform correctly.  How could this occur?  Well, it’s going to be rarer than in the DoS case, because in order to do it, you’re going to need some way to impact the underlying scheduling of the Operating System and/or virtualisation management (think hypervisor, typically).  That would normally mean that you’d need pretty low-level access to the machine, but there is a family of attacks known as “noisy neighbour”[5] where workloads – VMs or containers, typically – use up so many resources that other workloads are starved.

However, partly because of this case, I’d argue that resource starvation can usefully be associated with other types of availability attacks which occur locally to the machine hosting the targeted service, which might be related to CPU, file descriptor, network or other resources.

Generally, noisy neighbour attacks can be fairly easily mitigated by controls in the Operating System or virtualisation manager, though, of course, compromised or malicious components at this layer are very difficult to manage.

 

Dependency blocking

I’m not sure what the best term for this type of attack is, but what I’m thinking of is attacks which impact a service by reducing or removing access to external services on which they depend – remote components, if you will.  If, for instance, my web application requires access to a database, then an attack on that database – however performed – will impact my service.  As almost any kind of service will have external dependencies these days[6], this is can be a very effective attack, as it allows knowledgeable attackers to target the weakest link in the “chain” of components that make up your service.

There are mitigations against some of these attacks – caching and later reconciliation/synching being one – but identifying and defending against these sorts of attacks depends largely on considering your service as a system, and realising the types of impact degradation of the different parts might have.

 

Conclusion – managed degradation

Which leads me to a final point, which is that when considering availability attacks, understanding and planning Service degradation: actually a good thing is going to be invaluable – and when you’ve done that, you’ll definitely going to need to test it, too (If it isn’t tested, it doesn’t work).

 


1 – yes, I checked the capitalisation – he wasn’t worried about DRDOS, MS-DOS or any of those lovely 80s era command line Operating Systems.

2 – or millions or more, these days.

3 – here, for the avoidance of doubt.

4 – I believe.

5 – you know my policy on spellings by now.  I’m British, and we’ll keep it that way.

6 – unless you’re still using green-screen standalone machines to run your business, in which case either a) yikes or b) well done.

Why I should have cared more about lifecycle

Every deployment is messy.

I’ve always been on the development and architecture side of the house, rather than on the operations side. In the old days, this distinction was a useful and acceptable one, and wasn’t too difficult to maintain. From time to time, I’d get involved with discussions with people who were actually running the software that I had written, but on the whole, they were a fairly remote bunch.

This changed as I got into more senior architectural roles, and particularly as I moved through some pre-sales roles which involved more conversations with users. These conversations started to throw up[1] an uncomfortable truth: not only were people running the software that I helped to design and write[3], but they didn’t just set it up the way we did in our clean test install rig, run it with well-behaved, well-structured data input by well-meaning, generally accurate users in a clean deployment environment, and then turn it off when they’re done with it.

This should all seem very obvious, and I had, of course, be on the receiving end of requests from support people who exposed that there were odd things that users did to my software, but that’s usually all it felt like: odd things.

The problem is that odd is normal.  There is no perfect deployment, no clean installation, no well-structured data, and certainly very few generally accurate users.  Every deployment is messy, and nobody just turns off the software when they’re done with it.  If it’s become useful, it will be upgraded, patched, left to run with no maintenance, ignored or a combination of all of those.  And at some point, it’s likely to become “legacy” software, and somebody’s going to need to work out how to transition to a new version or a completely different system.  This all has major implications for security.

I was involved in an effort a few years ago to describe the functionality, lifecycle for a proposed new project.  I was on the security team, which, for all the usual reasons[4] didn’t always interact very closely with some of the other groups.  When the group working on error and failure modes came up with their state machine model and presented it at a meeting, we all looked on with interest.  And then with horror.  All the modes were “natural” failures: not one reflected what might happen if somebody intentionally caused a failure.  “Ah,” they responded, when called on it by the first of the security to be able to form a coherent sentence, “those aren’t errors, those are attacks.”  “But,” one of us blurted out, “don’t you need to recover from them?”  “Well, yes,” they conceded, “but you can’t plan for that.  It’ll need to be on a case-by-case basis.”

This is thinking that we need to stamp out.  We need to design our systems so that, wherever possible, we consider not only what attacks might be brought to bear on them, but also how users – real users – can recover from them.

One way of doing this is to consider security as part of your resilience planning, and bake it into your thinking about lifecycle[5].  Failure happens for lots of reasons, and some of those will be because of bad people doing bad things.  It’s likely, however, that as you analyse the sorts of conditions that these attacks can lead to, a number of them will be similar to “natural” errors.  Maybe you could lose network connectivity to your database because of a loose cable, or maybe because somebody is performing a denial of service attack on it.  In both these cases, you may well start off with similar mitigations, though the steps to fix it are likely to be very different.  But considering all of these side by side means that you can help the people who are actually going to be operating those systems plan and be ready to manage their deployments.

So the lesson from today is the same as it so often is: make sure that your security folks are involved from the beginning of a project, in all parts of it.  And an extra one: if you’re a security person, try to think not just about the attackers, but also about all those poor people who will be operating your software.  They’ll thank you for it[6].


1 – not literally, thankfully[2].

2 – though there was that memorable trip to Singapore with food poisoning… I’ll stop there.

3 – a fact of which I actually was aware.

4 – some due entirely to our own navel-gazing, I’m pretty sure.

5 – exactly what we singularly failed to do in the project I’ve just described.

6 – though probably not in person.  Or with an actual gift.  But at least they’ll complain less, and that’s got to be worth something.