Single point of failure

Any failure which completely brings down a system for over 12 hours counts as catastrophic.

Yesterday[1], Gatwick Airport suffered a catastrophic failure. It wasn’t Air Traffic Control, it wasn’t security scanners, it wasn’t even check-in desk software, but the flight information boards. Catastrophic? Well, maybe the impact on the functioning of the airport wasn’t catastrophically affected, but the system itself was. For my money, any failure which completely brings down a system for over 12 hours (from 0430 to 1700 BST, reportedly), counts as catastrophic.

The failure has been blamed on damage to a fibre optic cable. It turned out that if this particular component of the system was brought down, then the system failed to operate as expected: it was a single point of failure. Now, in this case, it could be argued that the failure did not have a security impact: this was a resilience problem. Setting aside the fact that resilience and security are often bedfellows[2], many single points of failure absolutely are security issues, as they become obvious points of vulnerability for malicious actors to attack.

A key skill that needs to be grown with IT in general, but security in particular, is systems thinking, as I’ve discussed elsewhere, including in my first post on this blog: Systems security – why it matters. We need more systems engineers, and more systems architects. The role of systems architects, specifically, is to look beyond the single components that comprise a system, and to consider instead the behaviour of the system as a whole. This may mean looking past our first focus and our to, to for instance, hardware or externally managed systems to consider what the impact of failure, damage or compromise would be to the system’s overall operation.

Single points of failure are particularly awkward.  They crop up in all sorts of places, and they are a very good example of why diversity is important within IT security, and why you shouldn’t trust a single person – including yourself – to be the only person who looks at the security of a system.  My particular biases are towards crypto and software, for instance, so I’m more likely to miss a hardware or network point of failure than somebody with a different background to me.  Not to say that we shouldn’t try to train ourselves to think outside of whatever little box we come from – that’s part of the challenge and excitement of being a systems architect – but an acknowledgement of our own lack of expertise is in itself a realisation of our expertise: if you realise that you’re not an expert, you’re part way to becoming one.

I wanted to finish with an example of a single point of failure that is relevant to security, and exposes a process vulnerability.  The Register has a good write-up of the Foreshadow attack and its impact on SGX, Intel’s Trusted Execution Environment (TEE) capability.  What’s interesting, if the write-up is correct, is that what seems like a small break to a very specific part of the entire security chain means that you suddenly can’t trust anything.  The trust chain is broken, and you have to distrust everything you think you know.  This is a classic security problem – trust is a very tricky set of concepts – and one of the nasty things about it is that it may be entirely invisible to the user that an attack has taken place at all, particularly as the user, at this point, may have no visibility of the chain of trust that has been established – or not – up to the point that they are involved.  There’s a lot more to write about on this subject, but that’s for another day.  For now, if you’re planning to visit an airport, ensure that you have an app on your phone which will tell you your flight departure time and the correct gate.


1 – at time of writing, obviously.

2 – for non-native readers[3] , what I mean is that they are often closely related and should be considered together.

3 – and/or those unaquainted with my somewhat baroque language and phrasing habits[4].

4 – I prefer to double-dot when singing or playing Purcell, for instance[5].

5 – this is a very, very niche comment, for which slight apologies.

In praise of the CIA

CIA is not sufficient to ensure security within a system.

In the wake of the widespread failure of the Visa processing network on Friday last week [1] (see The Register for more details), I thought it might be time to revisit that useful aide memoire, C.I.A.:

  • Confidentiality
  • Integrity
  • Availability.

This isn’t the first time I’ve written about this trio, and I doubt that it’ll be the last.  However, this particular incident seems like a perfect example to examine the least-regarded of the three – availability – and also to cogitate somewhat on how the CIA is necessary, but not sufficient[3].

Availability

As far as we can tell, the problem with the Visa payment system came down to a hardware failure.  As someone who used to work as a software engineer, I can tell you that this is by far the best type of failure, because there’s very little you can do about it once you’ve diagnosed it[6], which means that it quickly becomes SEP[7].  Be that as it may, the result of this hardware problem was that a large percentage of the network was unable to access Visa processing capabilities correctly.  Though ATMs[8] generally worked, it seems, payment using card readers generally didn’t.

How is this a security problem?  Well, one way to answer that question is to say that if security is about reducing risk to your business, then as this caused significant damage to Visa’s revenue stream – not to mention its reputation – then the risk materialised, and there was a security failure.  I would be interested to know, however, how many organisations have their security teams in charge of ensuring up-time and availability of their systems in terms of guarding against vulnerabilities such as hardware failures.  My suspicion is that the scope of availability-safeguarding by security teams is generally to the extent of managing denial of service or other malicious attacks.

I would argue that more organisations should consider this part of the security team’s mandate, to be honest, because the impacts are very similar, and many of the mitigations will be the same.  Of course, if you’re already an integrated Ops team – or even moving to a DevOps or DevSecOps model – then well done you: I’m sure you’re 100% safe from anything similar befalling you[10].

Consistency and correctness

As I mentioned above, there’s a criticism which is often levelled at the CIA triad, which is that confidentiality, integrity and availability are not, on their own, sufficient to design and run a system.

The Visa incident is a perfect example of why this is the case.  It appears that the outage was not complete, as even at card readers, some amount of information was going through when a transaction was attempted.  This meant that for some (attempted) transactions, at least, debits were appearing on accounts even when they were not being recorded as credits at the vendor’s side.  What does this mean?  In simple terms, money was coming out of people’s accounts, but not going to the people they were trying to pay.  I’m not an expert on retail banking, but I believe that this is pretty much the opposite of what how a financial transaction is supposed to work.

You can’t really blame this on a lack of confidentiality or a lack or integrity.  Nor is it really to do with a lack of availability – it may have been a side effect of the same cause as the availability failures, but that doesn’t mean that it caused them[11].

These problems can be characterised in two ways: as a lack of consistency and/or a lack of correctness.  In a system, data should be consistent across the system, so when a debit shows up with no corresponding credit, there is a a failure of consistency.  This lack of consistency highlights a lack of consistency: in fact, the very point of double-entry book-keeping is to allow these sorts of errors to be spotted.

What this tells us is not only that CIA is not sufficient to ensure security within a system but also that there exist other mechanisms – some very ancient – that allow us to manage our systems and to mitigate failures.


1 – at time of writing.  If you’re reading this after, say, the 10th or 11th of June 2018, then it was longer ago than that[2].

2 – unless there’s been another outage, in which case it may be time to start taking out cash and stuffing it into your mattress.

3 – in no sense is this a comment on the Central Intelligence Agency.  I am unqualified to discuss that particular, august[4] body.  Nor would I consider it in my best interests to do so[5].

4 – it was actually founded in the month of September, according to the Interwebs.

5 – or the best interests of my readers.

6 – which can, admittedly, take quite a long time, as you’re probably looking in the wrong places if you, like me, generally assume that the bug is in your own code.

7 – Somebody Else’s Problem (hat tip to the late, great Douglas Adams).

8 – “cash points” (US), “holes in the wall” (UK)[9].

9 – yes, we really do call them this: “I need to get some money from the hole in the wall”.  It’s descriptive and accurate: what more do you want?

10 – no, I know you’re not, and you know you’re not, but this will make everybody else feel that little bit more nervous, and you can feel a little bit more smug, which is always nice, isn’t it?

11 – cue a link to one of my most favourite comics of all time: https://xkcd.com/552/.

Why I should have cared more about lifecycle

Every deployment is messy.

I’ve always been on the development and architecture side of the house, rather than on the operations side. In the old days, this distinction was a useful and acceptable one, and wasn’t too difficult to maintain. From time to time, I’d get involved with discussions with people who were actually running the software that I had written, but on the whole, they were a fairly remote bunch.

This changed as I got into more senior architectural roles, and particularly as I moved through some pre-sales roles which involved more conversations with users. These conversations started to throw up[1] an uncomfortable truth: not only were people running the software that I helped to design and write[3], but they didn’t just set it up the way we did in our clean test install rig, run it with well-behaved, well-structured data input by well-meaning, generally accurate users in a clean deployment environment, and then turn it off when they’re done with it.

This should all seem very obvious, and I had, of course, be on the receiving end of requests from support people who exposed that there were odd things that users did to my software, but that’s usually all it felt like: odd things.

The problem is that odd is normal.  There is no perfect deployment, no clean installation, no well-structured data, and certainly very few generally accurate users.  Every deployment is messy, and nobody just turns off the software when they’re done with it.  If it’s become useful, it will be upgraded, patched, left to run with no maintenance, ignored or a combination of all of those.  And at some point, it’s likely to become “legacy” software, and somebody’s going to need to work out how to transition to a new version or a completely different system.  This all has major implications for security.

I was involved in an effort a few years ago to describe the functionality, lifecycle for a proposed new project.  I was on the security team, which, for all the usual reasons[4] didn’t always interact very closely with some of the other groups.  When the group working on error and failure modes came up with their state machine model and presented it at a meeting, we all looked on with interest.  And then with horror.  All the modes were “natural” failures: not one reflected what might happen if somebody intentionally caused a failure.  “Ah,” they responded, when called on it by the first of the security to be able to form a coherent sentence, “those aren’t errors, those are attacks.”  “But,” one of us blurted out, “don’t you need to recover from them?”  “Well, yes,” they conceded, “but you can’t plan for that.  It’ll need to be on a case-by-case basis.”

This is thinking that we need to stamp out.  We need to design our systems so that, wherever possible, we consider not only what attacks might be brought to bear on them, but also how users – real users – can recover from them.

One way of doing this is to consider security as part of your resilience planning, and bake it into your thinking about lifecycle[5].  Failure happens for lots of reasons, and some of those will be because of bad people doing bad things.  It’s likely, however, that as you analyse the sorts of conditions that these attacks can lead to, a number of them will be similar to “natural” errors.  Maybe you could lose network connectivity to your database because of a loose cable, or maybe because somebody is performing a denial of service attack on it.  In both these cases, you may well start off with similar mitigations, though the steps to fix it are likely to be very different.  But considering all of these side by side means that you can help the people who are actually going to be operating those systems plan and be ready to manage their deployments.

So the lesson from today is the same as it so often is: make sure that your security folks are involved from the beginning of a project, in all parts of it.  And an extra one: if you’re a security person, try to think not just about the attackers, but also about all those poor people who will be operating your software.  They’ll thank you for it[6].


1 – not literally, thankfully[2].

2 – though there was that memorable trip to Singapore with food poisoning… I’ll stop there.

3 – a fact of which I actually was aware.

4 – some due entirely to our own navel-gazing, I’m pretty sure.

5 – exactly what we singularly failed to do in the project I’ve just described.

6 – though probably not in person.  Or with an actual gift.  But at least they’ll complain less, and that’s got to be worth something.

Embracing fallibility

History repeats itself because no one was listening the first time. (Anonymous)

We’re all fallible.  You’re fallible, he’s fallible, she’s fallible, I’m fallible*.  We all get things wrong from time to time, and the generally accepted “modern” management approach is that it’s OK to fail – “fail early, fail often” – as long as you learn from your mistakes.  In fact, there’s a growing view that if you’d don’t fail, you can’t learn – or that your learning will be slower, and restricted.

The problem with some fields – and IT security is one of them – is that failing can be a very bad thing, with lots of very unexpected consequences.  This is particularly true for operational security, but the same can be the case for application, infrastructure or feature security.  In fact, one of the few expected consequences is that call to visit your boss once things are over, so that you can find out how many days*** you still have left with your organisation.  But if we are to be able to make mistakes**** and learn from them, we need to find ways to allow failure to happen without catastrophic consequences to our organisations (and our careers).

The first thing to be aware of is that we can learn from other people’s mistakes.  There’s a famous aphorism, supposedly first said by George Santayana and often translated as “Those who cannot learn from history are doomed to repeat it.”  I quite like the alternative:  “History repeats itself because no one was listening the first time.”  So, let’s listen, and let’s consider how to learn from other people’s mistakes (and our own).  The classic way of thinking about this is by following “best practices”, but I have a couple of problems with this phrase.  The first is that very rarely can you be certain that the context in which you’re operating is exactly the same as that of those who framed these practices.  The other – possibly more important – is that “best” suggests the summit of possibilities: you can’t do better than best.  But we all know that many practices can indeed be improved on.  For that reason, I rather like the alternative, much-used at Intel Corporation, which is “BKMs”: Best Known Methods.  This suggests that there may well be better approaches waiting to be discovered.  It also talks about methods, which suggests to me more conscious activities than practices, which may become unconscious or uncritical followings of others.

What other opportunities are open to us to fail?  Well, to return to a theme which is dear to my heart, we can – and must – discuss with those within our organisations who run the business what levels of risk are appropriate, and explain that we know that mistakes can occur, so how can we mitigate against them and work around them?  And there’s the word “mitigate” – another approach is to consider managed degradation as one way to protect our organisations***** from the full impact of failure.

Another is to embrace methodologies which have failure as a key part of their philosophy.  The most obvious is Agile Programming, which can be extended to other disciplines, and, when combined with DevOps, allows not only for fast failure but fast correction of failures.  I plan to discuss DevOps – and DevSecOps, the practice of rolling security into DevOps – in more detail in a future post.

One last approach that springs to mind, and which should always be part of our arsenal, is defence in depth.  We should be assured that if one element of a system fails, that’s not the end of the whole kit and caboodle******.  That only works if we’ve thought about single points of failure, of course.

The approaches above are all well and good, but I’m not entirely convinced that any one of them – or a combination of them – gives us a complete enough picture that we can fully embrace “fail fast, fail often”.  There are other pieces, too, including testing, monitoring, and organisational cultural change – an important and often overlooked element – that need to be considered, but it feels to me that we have some way to go, still.  I’d be very interested to hear your thoughts and comments.

 


*my family is very clear on this point**.

**I’m trying to keep it from my manager.

***or if you’re very unlucky, minutes.

****amusingly, I first typed this word as “misteaks”.  You’ve got to love those Freudian slips.

*****and hence ourselves.

******no excuse – I just love the phrase.

 

 

Single Point of Failure

Avoiding cascade failures with systems thinking.

Let’s start with a story.  Way back in the mists of time*, I performed audits for an organisation which sent out cryptographic keys to its members.  These member audits involved checking multiple processes and systems, but the core one was this: the keys that were sent out were are really big deal, as they were the basis from which tens of thousands of other keys would be derived.  So, the main key that was sent out was really, really important, because if it got leaked, the person who got hold of it would have a chance to do many, many Bad Things[tm].

The main organisation thought that allowing people the possibility to do Bad Things[tm] wasn’t generally a good idea, so they had a rule.  You had to follow a procedure, which was this: they would send out this key in two separate parts, to be stored in two different physical safes, to be combined by two different people, reporting to two different managers, in a process split into to separate parts, which ensured that the two different key-holders could never see the other half of the key.  The two separate parts were sent out by separate couriers, so that nobody outside the main organisation, could ever get to see the two parts.  It was a good, and carefully thought out process.

So one of the first things I’d investigate, on arriving at a member company to perform an audit, would be how they managed their part of this process.  And, because they were generally fairly clued up, or wouldn’t have been allowed to have the keys in the first place, they’d explain how careful they were with the key components, and who reported to whom, and where the safes were, and back up plans for when the key holders were ill: all good stuff.  And then I’d ask: “And what happens when a courier arrives with the key component?”  To which they’d reply: “Oh, the mail room accepts the package.”  And then I’d ask “And when the second courier arrives with the second key component?”  And nine times out of ten, they’d answer: “Oh, the mail room accepts that package, too.”  And then we’d have a big chat.**

This is a classic example of a single point of failure.  Nobody designs systems with a single point of failure on purpose****, but they just creep in.  I’m using the word systems here in the same way I used it in my post Systems security – why it matters: in the sense of a bunch of different things working together, some of which are likely to be human, some of which are likely to be machine.  And it’s hard to work out where single points of failure are.  A good way to avoid them – or minimise their likelihood of occurrence – is to layer or overlap systems*****.  What is terrible is when two single points of failure are triggered at once, because they overlap.  From the little information available to us, this seems to be what happened to British Airways over the past weekend: they had a power failure, and then their backups didn’t work.  In other words, they had a cascade failure – one thing went wrong, and then, when another thing went wrong as well, everything fell over. This is terrible, and every IT professional out there ought be cringing a little bit inside at the thought that it might happen to them.******

How can you stop this happening?  It’s hard, really it is, because the really catastrophic failures only happen rarely – pretty much by definition. Here are some thoughts, though:

  • look at pinch points, where a single part of the system, human or machine, is doing heavy lifting – what happens when they fail?
  • look at complex processes with many interlocking pieces – what happens if one of them produces unexpected results (or none)?
  • look at processes with many actors – what happens if one or actor fails to do what is expected?
  • look at processes with a time element to them – what will happen if an element doesn’t produce results when expected?
  • try back-tracking, rather than forward-tracking.  We tend to think forwards, from input to output: try the opposite, and see what the key parts to any output are.  This may give unexpected realisations about critical inputs and associated components.

Last: don’t assume that your systems are safe.  Examine, monitor, test, remediate.  You might******* also have a good long hard think about managed degradation: it’s really going to help if things do go horribly wrong.

Oh – and good luck.


*around ten years ago.  It feels like a long time, anyway.

**because, in case you missed it, that meant that the person in charge of the mail room had access to both parts of the key.***

***which meant that they needed to change their policies, quickly, unless they wanted to fail the audit.

****I’m assuming that we’re all Good Guys and Gals[tm], right, and not the baddies?

*****the principle of defence in depth derives from this idea, though it’s only one way to do it.

******and we/you shouldn’t be basking in the schadenfreude.  No, indeed.

*******should.  Or even must.  Just do it.

 

Service degradation: actually a good thing

…here’s the interesting distinction between the classic IT security mindset and that of “the business”: the business generally want things to keep running.

Well, not all the time, obviously*.  But bear with me: we spend most of our time ensuring that all of our systems are up and secure and working as expected, because that’s what we hope for, but there’s a real argument for not only finding out what happens when they don’t, and not just planning for when they don’t, but also planning for how they shouldn’t.  Let’s start by examining some techniques for how we might do that.

Part 1 – planning

There’s a story** that the oil company Shell, in the 1970’s, did some scenario planning that examined what were considered, at the time, very unlikely events, and which allowed it to react when OPEC’s strategy surprised most of the rest of the industry a few years later.  Sensitivity modelling is another technique that organisations use at the financial level to understand what impact various changes – in order fulfilment, currency exchange or interest rates, for instance – make to the various parts of their business.  Yet another is war gaming, which the military use to try to understand what will happen when failures occur: putting real people and their associated systems into situations and watching them react.  And Netflix are famous for taking this a step further in the context of the IT world and having a virtual Chaos Monkey (a set of processes and scripts) which they use to bring down parts of their systems in real time to allow them to understand how resilient they the wider system is.

So that gives us four approaches that are applicable, with various options for automation:

  1. scenario planning – trying to understand what impact large scale events might have on your systems;
  2. sensitivity planning – modelling the impact on your systems of specific changes to the operating environment;
  3. wargaming – putting your people and systems through simulated events to see what happens;
  4. real outages – testing your people and systems with actual events and failures.

Actually going out of your way to sabotage your own systems might seem like insane behaviour, but it’s actually a work of genius.  If you don’t plan for failure, what are you going to do when it happens?

So let’s say that you’ve adopted all of these practices****: what are you going to do with the information?  Well, there are some obvious things you can do, such as:

  • removing discovered weaknesses;
  • improving resilience;
  • getting rid of single points of failure;
  • ensuring that you have adequately trained staff;
  • making sure that your backups are protected, but available to authorised entities.

I won’t try to compile an exhaustive list, because there are loads books and articles and training courses about this sort of thing, but there’s another, maybe less obvious, course of action which I believe we must take, and that’s plan for managed degradation.

Part 2 – managed degradation

What do I mean by that?  Well, it’s simple.  We***** are trained and indoctrinated to take the view that if something fails, it must always “fail to safe” or “fail to secure”.  If something stops working right, it should stop working at all.

There’s value in this approach, of course there is, and we’re paid****** to ensure everything is secure, right?  Wrong.  We’re actually paid to help keep the business running, and here’s the interesting distinction between the classic IT security mindset and that of “the business”: the business generally want things to keep running.  Crazy, right?  “The business” want to keep making money and servicing customers even if things aren’t perfectly secure!  Don’t they know the risks?

And the answer to that question is “no”.  They don’t know the risks.  And that’s our real job: we need to explain the risks and the mitigations, and allow a balancing act to take place.  In fact, we’re always making those trade-offs and managing that balance – after all, the only truly secure computer is one with no network connection, no keyboard, no mouse and no power connection*******.  But most of the time, we don’t need to explain the decisions we make around risk: we just take them, following best industry practice, regulatory requirements and the rest.  Nor are the trade-offs usually so stark, because when failure strikes – whether through an attack, accident or misfortune – it’s often a pretty simple choice between maintaining a particular security posture and keeping the lights on.  So we need to think about and plan for some degradation, and realise that on occasion, we may need to adopt a different security posture to the perfect (or at least preferred) one in which we normally operate.

How would we do that?  Well, the approach I’m advocating is best described as “managed degradation”.  We allow our systems – including, where necessary our security systems – to degrade to a managed (and preferably planned) state, where we know that they’re not operating at peak efficiency, but where they are operating.  Key, however, is that we know the conditions under which they’re working, so we understand their operational parameters, and can explain and manage the risks associated with this new posture.  That posture may change, in response to ongoing events, and the systems and our responses to those events, so we need to plan ahead (using the techniques I discussed above) so that we can be flexible enough to provide real resiliency.

We need to find modes of operation which don’t expose the crown jewels******** of the business, but do allow key business operations to take place.  And those key business operations may not be the ones we expect – maybe it’s more important to be able to create new orders than to collect payments for them, for instance, at least in the short term.  So we need to discuss the options with the business, and respond to their needs.  This planning is not just security resiliency planning: it’s business resiliency planning.  We won’t be able to consider all the possible failures – though the techniques I outlined above will help us to identify many of them – but the more we plan for, the better we will be at reacting to the surprises.  And, possibly best of all, we’ll be talking to the business, informing them, learning from them, and even, maybe just a bit, helping them understand that the job we do does have some value after all.


*I’m assuming that we’re the Good Guys/Gals**.

**Maybe less story than MBA*** case study.

***There’s no shame in it.

****Well done, by the way.

*****The mythical security community again – see past posts.

******Hopefully…

*******Preferably at the bottom of a well, encased in concrete, with all storage already removed and destroyed.

********Probably not the actual Crown Jewels, unless you work at the Tower of London.