There was a huge story in the UK last week about how an expired certificate basically brought an entire mobile phone network (O2) to its knees. But what is a certificate, why do they expire, and why would that have such a big impact? In order to understand, let’s step back a bit and look at why you need certificates in the first place.
Let’s assume that two people, Alice and Bob, want to exchange some secret information. Let’s go further, and say that Bob is actually Bobcorp, Alice’s bank, and she wants to be able send and receive her bank statements in encrypted form. There are well established ways to do this, and the easiest way is for them to agree on a shared key that they use to both encrypt and decrypt each others’ messages. How do they agrees this key? Luckily, there are some clever ways in which they can manage a “handshake” between they two of them, even if they’ve not communicated before, which ends in their both having a copy of the key, without the chance of anybody else getting hold of it.
The problem is that Alice can’t actually be sure that she’s talking to Bobcorp (or vice versa). Bobcorp probably doesn’t mind, at this point, because he can ask Alice to provide her login credentials, which will allow him to authenticate her. But Alice really does care: she certainly shouldn’t be handing her login details to somebody – let’s call her “Eve” – who’s just pretending to be Bob.
The solution to this problems comes in two parts: certificates and Certificate Authorities (CAs). A CA is a well-known and trusted party with whom Bobcorp has already established a relationship: typically by providing company details, website details and the like. Bobcorp also creates and sends the CA a special key and very specific information about itself (maybe including the business name, address and website information). The CA, having established Bobcorp’s bona fides, creates a certificate for Bobcorp, incorporating the information that was requested – in fact, some of the information that Bobcorp sends the CA is usually in the form of a “self-signed certificate”, so pretty much all that the CA needs to do is provide its own signature.
Astute readers will be asking themselves: “How did this help? Alice still needs to trust the CA, right?” The answer is that she does. But there will typically be a very small number of CAs in comparison to Bobcorp-type companies, so all Alice needs to do is ensure that she can trust a few CAs, and she’s now good to go. In a Web-browsing scenario, Alice will usually have downloaded a browser which already has appropriate trust relationships with the main CAs built in. She can now perform safe handshakes with lots of companies, and as long as she (or her browser) checks that they provide certificates signed by a CA that she trusts, she’s relatively safe.
But there’s a twist. The certificates that the CA issues to Bobcorp (and others) typically have an expiration date on them. This isn’t just to provide the CA with a recurring revenue stream – though I’m sure that’s a nice benefit – but also in case Bobcorp’s situation changes: what if it has gone bankrupt, for instance, or changed its country of business? So after a period of time (typically in the time frame of a year or two, but maybe less or more), Bobcorp needs to reapply to get a new certificate.
What if Bobcorp forgets? Well, when Alice visits Bobcorp’s site and the browser notices an expired certificate, it should want her not to proceed, and she shouldn’t give them any information until it’s renewed. This sounds like a pain, and it is: Bobcorp and its customers are going to be severely inconvenienced. Somebody within Bobcorp whose job it was to renew the certificate is going to be in trouble.
Life is even worse in the case where no actual people are involved. If, instead of Alice, we have an automated system A, and instead of Bob, we have an automated system, B. A still needs to trust that it’s talking to the real B in case an evil system E is pretending to be B, so certificates are still required. In this case, if, B’s certificate expires, A should quite rightly refuse to connect to it. This seems to have been what happened to cause the mobile data outage that O2 is blaming on Ericsson, one of its suppliers. There was no easy way to fix the problem, or tell the many, many A-type systems that may have been trying to communicate with the B system(s) to carry on regardless. And so, for want of a nail, the kingdom was lost.
The lesson? Avoid single points of failure, think about fall-back modes. And be ready to move to remedy unexpected errors. Quickly.
1 – there are also various mechanisms to revoke, or cancel, certificates but hey they are typically complex, ill-implemented in many cases, and consequently little-used.