“All systems nominal” – borrowing a useful phrase

Wouldn’t it be lovely if everything were functioning exactly as it should?

Wouldn’t it be lovely if everything were functioning exactly as it should? All the time. Alas, that is not to be our lot: we in IT security know that there’s always something that needs attention, something that’s not quite right, something that’s limping along and needs to be fixed.

The question I want to address in this article is whether that’s actually OK. I’ve written before about managed degradation (the idea that planning for when things go wrong is a useful and important practice[1]) in Service degradation: actually a good thing. The subject of this article, however, is living in a world where everything is running almost normally – or, more specifically, “nominally”. Most definitions of the word “nominal” focus on its meaning of “theoretical” or “token” value. A quick search of online dictionaries provided two definitions which were more relevant to the usage I’m going to be looking at today:

  • informal (chiefly in the context of space travel) functioning normally or acceptably[2].
  • being according to plan: satisfactory[3].

I’d like to offer a slightly different one:

  • within acceptable parameters for normal system operation.

I’ve seen “tolerances” used instead of “parameters”, and that works, too, but it’s not a word that I think we use much within IT security, so I lean towards “parameters”[4].

Utility

Why do I think that this is a useful concept? Because, as I noted above, we all know that it’s a rare day when everything works perfectly. But we find ways to muddle through, or we find enough bandwidth to make the backups happen without significant impact on database performance, or we only lose 1% of the credit card details collected that day[5]. All of these (except the last one), are fine, and if we are wise, we might start actually defining what counts as acceptable operation – nominal operation – for all of our users. This is what we should be striving for, and exactly how far off perfect operation we are in will give us clues as to how much effort we need to expend to sort them out, and how quickly we need to perform mitigations.

In fact, many organisations which provide services do this already: that’s where SLAs (Service Level Agreements) come from. I remember, at school, doing some maths[6] around food companies ensuring that they were in little danger of being in trouble for under-filling containers by looking at standard deviations to understand what the likely amount in each would be. This is similar, and the likelihood of hardware failures, based on similar calculations, is often factored into uptime planning.

So far, much of the above seems to be about resilience: where does security come in? Well, your security components, features and functionality are also subject to the same issues of resiliency to which any other part of your system is. The problem is that if a security piece fails, it may well be a single point of failure, which means that although the rest of the system is operating at 99% performance, your security just hit zero.

These are the reasons that we perform failure analysis, and why we consider defence in depth. But when we’re looking at defence in depth, do we remember to perform second order analysis? For instance, if my back-up LDAP server for user authentication is running on older hardware, what are the chances that it will fail when put under load?

Broader usage

It should come as no surprise to regular readers that I want to expand the scope of the discussion beyond just hardware and software components in systems to the people who are involved, and beyond that to processes in general.

Train companies are all too aware of the impact on their services if a bad flu epidemic hits their drivers – or if the weather is good, so their staff prefer to enjoy the sunshine with their families, rather than take voluntary overtime[7]. You may have considered the impact of a staff member or two being sick, but have you gone as far as modelling what would happen if four members of your team were sick, and, just as important, how likely that is? Equally vital to consider may be issues of team dynamics, or even terrorism attacks or union disputes. What about external factors, like staff not being able to get into work because of train cancellations? What are the chances of broadband failures occurring at the same time, scuppering your fall-back plan of allowing key staff to work from home?

We can go deeper and deeper into this, and at some point it becomes pointless to go any further. But I believe that it’s useful to consider how far to go with it, and also to spend some time considering exactly what you consider “nominal” operation, particularly for you security systems.


1 – I nearly wrote “art”.

2 – Oxford Dictionaries: https://en.oxforddictionaries.com/definition/nominal.

3 – Merriam-Webster: https://www.merriam-webster.com/dictionary/nominal.

4 – this article was inspired by the Public Service Broadcasting Song Go. Listen to it: it rocks. And they’re great live, too.

5 – Note: this is a joke, and not a very funny one. You’ve probably just committed a GDPR breach, and you need to tell someone about it. Now.

6 – in the UK, and most countries speaking versions of Commonwealth English, we do more than one math. Because “mathematics“.

7 – this happened: I saw it it on TV[8].

8 – so it must be true.

6 types of attack: learning from Supermicro, State Actors and silicon

… it could have happened, and it could be happening now.

Last week, Bloomberg published a story detailing how Chinese state actors had allegedly forced employees of Supermicro (or companies subcontracting to them) to insert a small chip – the silicon in the title – into motherboards destined for Apple and Amazon.  The article talked about how an investigation into these boards had uncovered this chip and the steps that Apple, Amazon and others had taken.  The story was vigorously denied by Supermicro, Apple and Amazon, but that didn’t stop Supermicro’s stock price from tumbling by over 50%.

I have heard strong views expressed by people with expertise in the topic on both sides of the argument: that it probably didn’t happen, and that it probably did.  One side argues that the denials by Apple and Amazon, for instance, might have been impacted by legal “gagging orders” from the US government.  An opposing argument suggests that the Bloomberg reporters might have confused this story with a similar one that occurred a few months ago.  Whether this particular story is correct in every detail, or a fabrication – intentional or unintentional – is not what I’m interested in at this point.  What I’m interested in is not whether it did happen in this instance: the clear message is that it could have happened, and it could be happening now.

I’ve written before about State Actors, and whether you should worry about them.  There’s another question which this story brings up, which is possibly even more germane: what can you do about it if you are worried about them?  This breaks down further into two questions:

  • how can I tell if my systems have been compromised?
  • what can I do if I discover that they have?

The first of these is easily enough to keep us occupied for now [1], so let’s spend some time on that.  First, let’s first define six types of compromise, think about how they might be carried out, and then consider the questions above for each:

  • supply-chain hardware compromise;
  • supply-chain firmware compromise;
  • supply-chain software compromise;
  • post-provisioning hardware compromise;
  • post-provisioning firmware compromise;
  • post-provisioning software compromise.

This article doesn’t provide sufficient space to go into detail of these types of attack, and provides an overview of each, instead[2].

Terms

  • Supply-chain – all of the steps up to when you start actually running a system.  From manufacture through installation, including vendors of all hardware components and all software, OEMs, integrators and even shipping firms that have physical access to any pieces of the system.  For all supply-chain compromises, the key question is the extent to which you, the owner of a system, can trust every single member of the supply chain[3].
  • Post-provisioning – any point after which you have installed the hardware, put all of the software you want on it, and started running it: the time during which you might consider the system “under your control”.
  • Hardware – the physical components of a system.
  • Software – software that you have installed on the system and over which you have some control: typically the Operating System and application software.  The amount of control depends on factors such as whether you use proprietary or open source software, and how much of it is produced, compiled or checked by you.
  • Firmware – special software that controls how the hardware interacts with the standard software on the machine, the hardware that comprises the system, and external systems.  It is typically provided by hardware vendors and its operation opaque to owners and operators of the system.

Compromise types

See the table at the bottom of this article for a short summary of the points below.

  1. Supply-chain hardware – there are multiple opportunities in the supply chain to compromise hardware, but the more hard they are made to detect, the more difficult they are to perform.  The attack described in the Bloomberg story would be extremely difficult to detect, but the addition of a keyboard logger to a keyboard just before delivery (for instance) would be correspondingly more simple.
  2. Supply-chain firmware – of all the options, this has the best return on investment for an attacker.  Assuming good access to an appropriate part of the supply chain, inserting firmware that (for instance) impacts network performance or leaks data over a wifi connection is relatively simple.  The difficulty in detection comes from the fact that although it is possible for the owner of the system to check that the firmware is what they think it is, what that measurement confirms is only that the vendor has told them what they have supplied.  So the “medium” rating relates only to firmware that was implanted by members in the supply chain who did not source the original firmware: otherwise, it’s “high”.
  3. Supply-chain software – by this, I mean software that comes installed on a system when it is delivered.  Some organisations will insist in “clean” systems being delivered to them[4], and will install everything from the Operating System upwards themselves.  This means that they basically now have to trust their Operating System vendor[5], which is maybe better than trusting other members of the supply chain to have installed the software correctly.  I’d say that it’s not too simple to mess with this in the supply chain, if only because checking isn’t too hard for the legitimate members of the chain.
  4. Post-provisioning hardware – this is where somebody with physical access to your hardware – after it’s been set up and is running – inserts or attaches hardware to it.  I nearly gave this a “high” rating for difficulty below, assuming that we’re talking about servers, rather than laptops or desktop systems, as one would hope that your servers are well-protected, but the ease with which attackers have shown that they can typically get physical access to systems using techniques like social engineering, means that I’ve downgraded this to “medium”.  Detection, on the other hand, should be fairly simple given sufficient resources (hence the “medium” rating), and although I don’t believe anybody who says that a system is “tamper-proof”, tamper-evidence is a much simpler property to achieve.
  5. Post-provisioning firmware – when you patch your Operating System, it will often also patch firmware on the rest of your system.  This is generally a good thing to do, as patches may provide security, resilience or performance improvements, but you’re stuck with the same problem as with supply-chain firmware that you need to trust the vendor: in fact, you need to trust both your Operating System vendor and their relationship with the firmware vendor.
  6. Post-provisioning software – is it easy to compromise systems via their Operating System and/or application software?  Yes: this we know.  Luckily – though depending on the sophistication of the attack – there are generally good tools and mechanisms for detecting such compromises, including behavioural monitoring.

Table

 

Compromise type Attacker difficulty Detection difficulty
Supply-chain hardware High High
Supply-chain firmware Low Medium
Supply-chain software Medium Medium
Post-provisioning hardware Medium Medium
Post-provisioning firmware Medium Medium
Post-provisioning software Low Low

Conclusion

What are your chances of spotting a compromise on your system?  I would argue that they are generally pretty much in line with the difficulty of performing the attack in the first place: with the glaring exception of supply-chain firmware.  We’ve seen attacks of this type, and they’re very difficult to detect.  The good news is that there is some good work going on to help detection of these types of attacks, particularly in the world of Linux[6] and open source.  In the meantime, I would argue our best forms of defence are currently:

  • for supply-chain: build close relationships, use known and trusted suppliers.  You may want to restrict as much as possible of your supply chain to “friendly” regimes if you’re worried about State Actor attacks, but this is very hard in the global economy.
  • for post-provisioning: lock down your systems as much as possible – both physically and logically – and use behavioural monitoring to try to detect anomalies in what you expect them to be doing.

1 – I’ll try to write something on this other topic in a different article.

2 – depending on interest, I’ll also consider a series of articles to go into more detail on each.

3 – how certain are you, for instance, that your delivery company won’t give your own government’s security services access to the boxes containing your equipment before they deliver them to you?

4 – though see above: what about the firmware?

5 – though you can always compile your own Operating System if you use open source software[6].

6 – oh, you didn’t compile your compiler yourself?  All bets off, then…

7 – yes, “GNU Linux”.

Knowing me, knowing you: on Russian spies and identity

Who are you, and who tells me so?

Who are you, and who tells me so?  These are questions which are really important for almost any IT-related system in use today.  I’ve previously discussed the difference between identification and authentication (and three other oft-confused terms) in Explained: five misused security words, and what I want to look at in this post is the shady hinterland between identification and authentication.

There has been a lot in the news recently about the poisoning in the UK of two Russian nationals and two British nationals, leading to the tragic death of Dawn Sturgess.  I’m not going to talk about that, or even about the alleged perpetrators, but about the problem of identity – their identity – and how that relates to IT.  The two men who travelled to Salisbury, and were named by British police as the perpetrators, travelled under Russian passports.  These documents provided their identities, as far as the British authorities – including UK Border Control, who allowed them into the country – were aware, and led to their being allowed into the country.

When we set up a new user in an IT system or allow them physical access to a building, for instance, we often ask for “Government-issued ID” as the basis for authenticating the identity that they have presented, in preparation for deciding whether to authorise them to perform whatever action they have requested.  There are two issues here – one obvious, and one less so.  The first, obvious one, is that it’s not always easy to tell whether a document has actually been issued by the authority by which it appears to be have been issued – document forgers have been making a prosperous living for hundreds, if not thousands of years.  The problem, of course, is that the more tell-tale signs of authenticity you reveal to those whose job it is to check a document, the more clues you give to would-be forgers for how to improve the quality of the false versions that they create.

The second, and less obvious problem, is that just because a document has been issued by a government authority doesn’t mean that it is real.  Well, actually, it does, and there’s the issue.  Its issuance by the approved authority makes it real – that is to say “authentic” – but it doesn’t mean that it is correct.  Correctness is a different property to authenticity. Any authority may decide to issue identification documents that may be authentic, but do not truly represent the identity of the person carrying them. When we realise that a claim of identity is backed up by an authority which is issuing documents that we do not believe to be correct, that means that we should change our trust relationship with that authority.  For most entities, IDs which have been authentically issued by a government authority are quite sufficient, but it is quite plausible, for instance, that the UK Border Force (and other equivalent entities around the world) may choose to view passports issued by certain government authorities as suspect in terms of their correctness.

What impact does this have on the wider IT security community?  Well, there are times when we are accepting government-issued ID when we might want to check with relevant home nation authorities as to whether we should trust them[1].  More broadly than that, we must remember that every time that we authenticate a user, we are making a decision to trust the authority that represented that user’s identity to us.  The level of trust we place in that authority may safely be reduced as we grow to know that user, but it may not – either because our interactions are infrequent, or maybe because we need to consider that they are playing “the long game”, and are acting as so-called “sleepers”.

What does this continuous trust mean?  What it means is that if we are relying on an external supplier to provide contractors for us, we need to keep remembering that this is a trust relationship, and one which can change.  If one of those contractors turns out to have faked educational qualifications, then we need to doubt the authenticity of the educational qualifications of all of the other contractors – and possibly other aspects of the identity which the external supplier has presented to us.  This is because we have placed transitive trust in the external supplier, which we must now re-evaluate.  What other examples might there be?  The problem is that the particular aspects of identity that we care about are so varied and differ between different roles that we perform.  Sometimes, we might not care about education qualifications, but credit score, or criminal background, or blood type.  In the film Gattaca[2], identity is tied to physical and mental ability to perform a particular task.

There are various techniques available to allow us to tie a particular individual to a set of pieces of information: DNA, iris scans and fingerprints are pretty good at telling us that the person in front of us now is the person who was in front of us a year ago.  But tying that person to other information relies on trust relationships with external entities, and apart from a typically small set of inferences that we can draw from our direct experiences of this person, it’s difficult to know exactly what is truly correct unless we can trust those external entities.


1 – That assumes, of course, that we trust our home nation authorities…

2 – I’m not going to put a spoiler in here, but it’s a great film, and really makes you think about identity: go and watch it!

Single point of failure

Any failure which completely brings down a system for over 12 hours counts as catastrophic.

Yesterday[1], Gatwick Airport suffered a catastrophic failure. It wasn’t Air Traffic Control, it wasn’t security scanners, it wasn’t even check-in desk software, but the flight information boards. Catastrophic? Well, maybe the impact on the functioning of the airport wasn’t catastrophically affected, but the system itself was. For my money, any failure which completely brings down a system for over 12 hours (from 0430 to 1700 BST, reportedly), counts as catastrophic.

The failure has been blamed on damage to a fibre optic cable. It turned out that if this particular component of the system was brought down, then the system failed to operate as expected: it was a single point of failure. Now, in this case, it could be argued that the failure did not have a security impact: this was a resilience problem. Setting aside the fact that resilience and security are often bedfellows[2], many single points of failure absolutely are security issues, as they become obvious points of vulnerability for malicious actors to attack.

A key skill that needs to be grown with IT in general, but security in particular, is systems thinking, as I’ve discussed elsewhere, including in my first post on this blog: Systems security – why it matters. We need more systems engineers, and more systems architects. The role of systems architects, specifically, is to look beyond the single components that comprise a system, and to consider instead the behaviour of the system as a whole. This may mean looking past our first focus and our to, to for instance, hardware or externally managed systems to consider what the impact of failure, damage or compromise would be to the system’s overall operation.

Single points of failure are particularly awkward.  They crop up in all sorts of places, and they are a very good example of why diversity is important within IT security, and why you shouldn’t trust a single person – including yourself – to be the only person who looks at the security of a system.  My particular biases are towards crypto and software, for instance, so I’m more likely to miss a hardware or network point of failure than somebody with a different background to me.  Not to say that we shouldn’t try to train ourselves to think outside of whatever little box we come from – that’s part of the challenge and excitement of being a systems architect – but an acknowledgement of our own lack of expertise is in itself a realisation of our expertise: if you realise that you’re not an expert, you’re part way to becoming one.

I wanted to finish with an example of a single point of failure that is relevant to security, and exposes a process vulnerability.  The Register has a good write-up of the Foreshadow attack and its impact on SGX, Intel’s Trusted Execution Environment (TEE) capability.  What’s interesting, if the write-up is correct, is that what seems like a small break to a very specific part of the entire security chain means that you suddenly can’t trust anything.  The trust chain is broken, and you have to distrust everything you think you know.  This is a classic security problem – trust is a very tricky set of concepts – and one of the nasty things about it is that it may be entirely invisible to the user that an attack has taken place at all, particularly as the user, at this point, may have no visibility of the chain of trust that has been established – or not – up to the point that they are involved.  There’s a lot more to write about on this subject, but that’s for another day.  For now, if you’re planning to visit an airport, ensure that you have an app on your phone which will tell you your flight departure time and the correct gate.


1 – at time of writing, obviously.

2 – for non-native readers[3] , what I mean is that they are often closely related and should be considered together.

3 – and/or those unaquainted with my somewhat baroque language and phrasing habits[4].

4 – I prefer to double-dot when singing or playing Purcell, for instance[5].

5 – this is a very, very niche comment, for which slight apologies.

Mitigate or remediate?

What’s the difference between mitigate and remediate?

I very, very nearly titled this article “The ‘aters gonna ‘ate”, and then thought better of it. This is a rare event, and I put it down to the extreme temperatures that we’re having here in the UK at the moment[1].

What prompted this article was reading something online today where I saw the word mitigate, and thought to myself, “When did I start using that word? It’s not a normal one to drop into conversation. And what about remediate? What’s the difference between mitigate and remediate? In fact, how well could I describe the difference between the two?” Both are quite jargon-y word, so, in the spirit of my recent article Jargon – a force for good or ill? here’s my attempt at describing the difference, but also pointing out how important both are – along with a couple of other “-ate” words.

Let’s work backwards.

Remediate

Remediation is a set of actions to get things back the way they should be. It’s the last step in the process of recovery from an attack or other failure. The reprefix here is the give -away: like resetting and reconciliation. When you’re remediating, there may be an expectation that you’ll be returning your systems to the same state they were before, for example, power failure, but that’s not necessarily the case. What you should be focussing on is the service you’re providing, rather than the system(s) that are providing it. A set of steps for remediation might require you to replace your database with another one from a completely different provider[3], to add a load-balancer and to change all of your hardware, but you’re still remediating the problem that hit you. At the very least, if you’ve suffered an attack, you should make sure that you plug any hole that allowed the attacker in to start with[4].

Mitigate

Mitigation isn’t about making things better – it’s about reducing the impact of an attack or failure. Mitigation is the first set of steps you take when you realise that you’ve got a problem. You could argue that these things are connected, but mitigation isn’t about returning the service to normal (remediation), but about taking steps to reduce the impact of an attack. Those mitigations may be external – adding load-balancers to help deal with a DDoS attack, maybe – or internal – shutting down systems that have been actively compromised.

In fact, I’d argue that some mitigations might quite properly actually have an adverse effect on the service: there may be short term reductions in availability to ensure that long-term remediations can be performed. Planning for this is vitally important, as I’ve discussed previously, in Service degradation: actually a good thing.

The other -ates: update and operate

As I promised above, there are a couple of other words that are part of your response to an attack – or possible attack. The first is update. Updating is one of the key measures that you can put in place to reduce the chance of a successful attack. It’s the step you take before mitigation – because, if you’re lucky, you won’t need mitigation, because you’ll be immune from attack[5].

The second of these is operate. Operation is your normal state: it’s where you want to be. But operation doesn’t mean that you can just sit back and feel secure: you need to be keeping an eye on what’s going on, planning upgrades, considering mitigations and preparing for remediations. We too often think of the operate step as our Happy Place, where all is rosy, and we can sit back and watch the daisies grow. As DevOps (and DevSecOps) is teaching us, this is absolutely not the case: operate is very much a dynamic state, and not a passive one.


1 – 30C (~86F) is hot for the UK. Oh, and most houses (ours included) don’t have air conditioning[2].

2 – and I work from an office in the garden. Direct sunlight through the mainly glass wall is a blessing in the winter. Less so in the summer.

3 – hopefully open source, of course.

4 – hint: patch, patch, patch.

5 – well, that attack, at least. You’re never immune from all attacks: sorry.

Who do you trust: your data, or your enemy’s?

Our security logs define our organisational memory.

Imagine, just imagine, that you’re head of an organisation, and you suspect that there’s been some sort of attack on your systems[1].  You don’t know for sure, but a bunch of your cleverest folks are pretty sure that there are sufficient signs to merit an investigation.  So, you agree to allow that investigation, and it’s pretty clear from the outset – and becomes yet more clear as the investigation unfolds – that there was an attack on your organisation.  And as this investigation draws close to its completion, you happen to meet the head of another organisation, which happens to be not only your competitor, but also the party that your investigators are certain was behind the attack.  That person – the leader of your competitor – tells you that they absolutely didn’t perform an attack: no, sirree.  Who do you believe?  Your people, who have been laboring[2] away for months, or your competitor?

What a ridiculous question: of course you’d believe your own people over your competitor, right?

So, having set up such an absurd scenario[4], let’s look at a scenario which is actually much more likely.  Your systems have been acting strangely, and there seems to be something going on.  Based on the available information, you believe that you’ve been attacked, but you’re not sure.  Your experts think it’s pretty likely, so you approve an investigation.  And then one of your investigatory team come to you to tell you some really bad news about the data.  “What’s the problem?” you ask.  “Is there no data?”

“No,” they reply, “it’s worse than that.  We’ve got loads of data, but we don’t know which is real.”

Our logs are our memory

There is a literary trope – one of my favourite examples is Margery Allingham’s The Traitor’s Purse – where a character realises that he or she is somebody other than who they think they are when they start to question their memories.  We are defined by our memories, and the same goes for organisational security.  Our security logs define our organisational memory.  If you cannot prove that the data you are looking at is[5] correct, then you cannot be sure what led to the state you are in now.

Let’s give a quick example.  In my organisation, I am careful to log every time I upgrade a piece of software, but I begin to wonder whether a particular application is behaving as expected.  Maybe I see some traffic to an external IP address which I’ve never seen before, for instance.  Well, the first thing I’m going to do is to check the logs to see whether somebody has updated the software with a malicious version.  Assuming that somebody has installed a malicious version, there are three likely outcomes at this point:

  1. you check the logs, and it’s clear that a non-authorised version was installed.  This isn’t good, because you now know that somebody has unauthorised access to your system, but at least you can do something about it.
  2. at least some of the logs are missing.  This is less good, because you really can’t be sure what’s gone on, but you know have a pretty strong suspicion that somebody with unauthorised access has deleted some logs to cover up their tracks, which means that you have a starting point for remediation.
  3. there’s nothing wrong.  All the logs look fine.  You’re now really concerned, as you can’t be sure of your own data – your organisation’s memories.  You may be looking at correct data, or you may be looking at incorrect data: data which has been written by your enemy.  Attackers can – and do – go into log files and change the data in them to obscure the fact that they have performed particular actions.  It’s actually one of the first steps that a competent attacker will perform.

In the scenario as defined, things probably aren’t too bad: you can check the checksum or hash of the installed software, and check that it’s the version you expect.  But what if the attacker has also changed the version of your checksum- or hash-checker so that, for packages associated with this particular application, they always return what you expect to see?  This is not a theoretical attack, and nor is it the only way approach that attackers have to try to muddy the waters as to what they’ve done. And if it has happened, they you have no way of confirming that you have the correct version of the application.  You can try updating the checksum- or hash-checker, but what if the attacker has meddled with the software installer to ensure that it always installs their version…?

It’s a slippery slope, and bar wiping the entire system and reinstalling[6], there’s little you can do to be certain that you’ve cleared things up properly.  And in some cases, it may be too late: what if the attacker’s main aim was to exfiltrate some of your proprietary data, for example?

Lessons to learn, actions to take

Well, the key lesson is that your logs are important, and they need to be protected.  If you can’t trust your logs, then it can be very, very difficult not only to identify the extent of an attack, but also to remediate it or, in the worst case, even to know that you’ve been attacked at all.

And what can you do?  Well, there are many techniques that you can employ, and the best combination will depend on a number of questions, including your regulatory regime, your security posture, and what attackers you decide to defend against.  I’d start with a fairly simple combination:

  • move your most important logs off-system.  Where possible, host logs on different systems to the ones that are doing the reporting.  If you’re an attacker, it’s more difficult to jump from one system to another than it is to make changes to logs on a system which you’ve already compromised;
  • make your logs write-only.  This sounds crazy – how are you supposed to check logs if they can’t be read?  What this really means is that you separate read and write privileges on your logs so that only those with a need to read them can do so.  Writing is less worrisome – though there are attacks here, including filesystem starvation – because if you can’t see what you need to change, then it’s almost impossible to do so.  If you’re an attacker, you might be able to wipe some logs – see our case 2 above – but obscuring what you’ve actually done is more difficult.

Exactly what steps you take are up to you, but remember: if you can’t trust your logs, you can’t trust your data, and if you can’t trust your data, you don’t know what has happened.  That’s what your enemy wants: confusion.


1 – like that’s ever going to happen.

2- on this, very rare, occasion, I’m going to countenance[2] a US spelling.  I think you can guess why.

3 – contenance?

4 – I know, I know.

5 – are?

6 – you checked the firmware, right?  Hmm – maybe safer just to buy completely new hardware.

 

In praise of the CIA

CIA is not sufficient to ensure security within a system.

In the wake of the widespread failure of the Visa processing network on Friday last week [1] (see The Register for more details), I thought it might be time to revisit that useful aide memoire, C.I.A.:

  • Confidentiality
  • Integrity
  • Availability.

This isn’t the first time I’ve written about this trio, and I doubt that it’ll be the last.  However, this particular incident seems like a perfect example to examine the least-regarded of the three – availability – and also to cogitate somewhat on how the CIA is necessary, but not sufficient[3].

Availability

As far as we can tell, the problem with the Visa payment system came down to a hardware failure.  As someone who used to work as a software engineer, I can tell you that this is by far the best type of failure, because there’s very little you can do about it once you’ve diagnosed it[6], which means that it quickly becomes SEP[7].  Be that as it may, the result of this hardware problem was that a large percentage of the network was unable to access Visa processing capabilities correctly.  Though ATMs[8] generally worked, it seems, payment using card readers generally didn’t.

How is this a security problem?  Well, one way to answer that question is to say that if security is about reducing risk to your business, then as this caused significant damage to Visa’s revenue stream – not to mention its reputation – then the risk materialised, and there was a security failure.  I would be interested to know, however, how many organisations have their security teams in charge of ensuring up-time and availability of their systems in terms of guarding against vulnerabilities such as hardware failures.  My suspicion is that the scope of availability-safeguarding by security teams is generally to the extent of managing denial of service or other malicious attacks.

I would argue that more organisations should consider this part of the security team’s mandate, to be honest, because the impacts are very similar, and many of the mitigations will be the same.  Of course, if you’re already an integrated Ops team – or even moving to a DevOps or DevSecOps model – then well done you: I’m sure you’re 100% safe from anything similar befalling you[10].

Consistency and correctness

As I mentioned above, there’s a criticism which is often levelled at the CIA triad, which is that confidentiality, integrity and availability are not, on their own, sufficient to design and run a system.

The Visa incident is a perfect example of why this is the case.  It appears that the outage was not complete, as even at card readers, some amount of information was going through when a transaction was attempted.  This meant that for some (attempted) transactions, at least, debits were appearing on accounts even when they were not being recorded as credits at the vendor’s side.  What does this mean?  In simple terms, money was coming out of people’s accounts, but not going to the people they were trying to pay.  I’m not an expert on retail banking, but I believe that this is pretty much the opposite of what how a financial transaction is supposed to work.

You can’t really blame this on a lack of confidentiality or a lack or integrity.  Nor is it really to do with a lack of availability – it may have been a side effect of the same cause as the availability failures, but that doesn’t mean that it caused them[11].

These problems can be characterised in two ways: as a lack of consistency and/or a lack of correctness.  In a system, data should be consistent across the system, so when a debit shows up with no corresponding credit, there is a a failure of consistency.  This lack of consistency highlights a lack of consistency: in fact, the very point of double-entry book-keeping is to allow these sorts of errors to be spotted.

What this tells us is not only that CIA is not sufficient to ensure security within a system but also that there exist other mechanisms – some very ancient – that allow us to manage our systems and to mitigate failures.


1 – at time of writing.  If you’re reading this after, say, the 10th or 11th of June 2018, then it was longer ago than that[2].

2 – unless there’s been another outage, in which case it may be time to start taking out cash and stuffing it into your mattress.

3 – in no sense is this a comment on the Central Intelligence Agency.  I am unqualified to discuss that particular, august[4] body.  Nor would I consider it in my best interests to do so[5].

4 – it was actually founded in the month of September, according to the Interwebs.

5 – or the best interests of my readers.

6 – which can, admittedly, take quite a long time, as you’re probably looking in the wrong places if you, like me, generally assume that the bug is in your own code.

7 – Somebody Else’s Problem (hat tip to the late, great Douglas Adams).

8 – “cash points” (US), “holes in the wall” (UK)[9].

9 – yes, we really do call them this: “I need to get some money from the hole in the wall”.  It’s descriptive and accurate: what more do you want?

10 – no, I know you’re not, and you know you’re not, but this will make everybody else feel that little bit more nervous, and you can feel a little bit more smug, which is always nice, isn’t it?

11 – cue a link to one of my most favourite comics of all time: https://xkcd.com/552/.