Single point of failure

Any failure which completely brings down a system for over 12 hours counts as catastrophic.

Yesterday[1], Gatwick Airport suffered a catastrophic failure. It wasn’t Air Traffic Control, it wasn’t security scanners, it wasn’t even check-in desk software, but the flight information boards. Catastrophic? Well, maybe the impact on the functioning of the airport wasn’t catastrophically affected, but the system itself was. For my money, any failure which completely brings down a system for over 12 hours (from 0430 to 1700 BST, reportedly), counts as catastrophic.

The failure has been blamed on damage to a fibre optic cable. It turned out that if this particular component of the system was brought down, then the system failed to operate as expected: it was a single point of failure. Now, in this case, it could be argued that the failure did not have a security impact: this was a resilience problem. Setting aside the fact that resilience and security are often bedfellows[2], many single points of failure absolutely are security issues, as they become obvious points of vulnerability for malicious actors to attack.

A key skill that needs to be grown with IT in general, but security in particular, is systems thinking, as I’ve discussed elsewhere, including in my first post on this blog: Systems security – why it matters. We need more systems engineers, and more systems architects. The role of systems architects, specifically, is to look beyond the single components that comprise a system, and to consider instead the behaviour of the system as a whole. This may mean looking past our first focus and our to, to for instance, hardware or externally managed systems to consider what the impact of failure, damage or compromise would be to the system’s overall operation.

Single points of failure are particularly awkward.  They crop up in all sorts of places, and they are a very good example of why diversity is important within IT security, and why you shouldn’t trust a single person – including yourself – to be the only person who looks at the security of a system.  My particular biases are towards crypto and software, for instance, so I’m more likely to miss a hardware or network point of failure than somebody with a different background to me.  Not to say that we shouldn’t try to train ourselves to think outside of whatever little box we come from – that’s part of the challenge and excitement of being a systems architect – but an acknowledgement of our own lack of expertise is in itself a realisation of our expertise: if you realise that you’re not an expert, you’re part way to becoming one.

I wanted to finish with an example of a single point of failure that is relevant to security, and exposes a process vulnerability.  The Register has a good write-up of the Foreshadow attack and its impact on SGX, Intel’s Trusted Execution Environment (TEE) capability.  What’s interesting, if the write-up is correct, is that what seems like a small break to a very specific part of the entire security chain means that you suddenly can’t trust anything.  The trust chain is broken, and you have to distrust everything you think you know.  This is a classic security problem – trust is a very tricky set of concepts – and one of the nasty things about it is that it may be entirely invisible to the user that an attack has taken place at all, particularly as the user, at this point, may have no visibility of the chain of trust that has been established – or not – up to the point that they are involved.  There’s a lot more to write about on this subject, but that’s for another day.  For now, if you’re planning to visit an airport, ensure that you have an app on your phone which will tell you your flight departure time and the correct gate.

1 – at time of writing, obviously.

2 – for non-native readers[3] , what I mean is that they are often closely related and should be considered together.

3 – and/or those unaquainted with my somewhat baroque language and phrasing habits[4].

4 – I prefer to double-dot when singing or playing Purcell, for instance[5].

5 – this is a very, very niche comment, for which slight apologies.

Author: Mike Bursell

Long-time Open Source and Linux bod, distributed systems security, etc.. Now employed by Red Hat.

3 thoughts on “Single point of failure”

      1. – from this site:
        Because the PALM database is down, the following systems are currently unavailable:

        Electronic Filing System (EFS-Web)
        Electronic Freedom of Information Act (E-FOIA)
        Electronic Patent Assignment System (ePAS)
        Employee Locator Search
        Global Dossier (GD)
        Office of Enrollment and Discipline Information System (OEDIS)
        Order Entry Management System (OEMS)
        One Portal Dossier (OPD)
        Patent Global Dossier Public Access Dossier (P-GD-PAD)
        Patent Maintenance Fees Storefront
        Private Patent Application Information Retrieval (Private PAIR)
        Public Patent Application Information Retrieval (Public PAIR)
        Patent Review Process System (PRPS)
        Patent Trial and Appeal Board End-to-End (PTAB-E2E)


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s