Track and trace failure: a systems issue

The problem was not Excel, but that somebody used the wrong tools in a system.

Like many other IT professionals in the UK – and across the world, having spoken to some other colleagues in other countries – I was first surprised and then horrified as I found out more about the failures of the UK testing and track and trace systems. What was most shocking about the failure is not that it was caused by some alleged problem with Microsoft Excel, but that anyone thought this was a problem due to Excel. The problem was not Excel, but that somebody used the wrong tools in a system which was not designed properly, tested properly, or run properly. I have written many words about systems, and one article in particular seems relevant: If it isn’t tested, it doesn’t work. In it, I assert that a system cannot be said to work properly if it has not been tested, as a fully working system requires testing in order to be “working”.

In many software and hardware projects, in order to complete a piece of work, it has to meet one or more of a set of tests which allow it to be described as “done”. These tests may be actual software tests, or documentation, or just checks done by members of the team (other than the person who did the piece of work!), but the list needs to be considered and made part of the work definition. This “done” definition is as much part of the issue being addressed, functionality added or documentation being written as the actual work done itself.

I find it difficult to believe that there was any such definition for the track and trace system. If there was, then it was not, I’m afraid, defined by someone who is an expert in distributed or large-scale systems. This may not have been the fault of the person who chose Excel for the task of recording information, but it is the fault of the person who was in charge of the system, because Excel is not, and never was, a fit application for what it was being used for. It does not have the scalability characteristics, the integrity characteristics or the versioning characteristics required. This is not the fault of Microsoft, any more than it would be fault of Porsche if a 911T broke down because its owner filled with diesel fuel, rather than petrol[1]. Any competent systems architect or software engineer, qualified to be creating such a system, would have known this: the only thing that seems possible is that whoever put together the system was unqualified to do so.

There seem to be several options here:

  1. the person putting together the system did not know they were unqualified;
  2. the person putting together the system realised that they were unqualified, but did not feel able to tell anyone;
  3. the person putting together the systems realised that they were unqualified, but lied.

In any of the above, where was the oversight? Where was the testing? Where were the requirements? This was a system intended to safeguard the health of millions – millions – of people.

Who can we blame for this? In the end, the government needs to take some large measure of responsibility: they commissioned the system, which means that they should have come up with realistic and appropriate requirements. Requirements of this type may change over the life-cycle of a project, and there are ways to manage this: I recommend a useful book in another article, Building Evolutionary Architectures – for security and for open source. These are not new problems, and they are not unsolved problems: we know how to do this as a profession, as a society.

And then again, should we blame someone? Normally, I’d consider this a question out of scope for this blog, but people may die because of this decision – the decision not to define, design, test and run a system which was fit for purpose. At the very least, there are people who are anxious and worried about whether they have Covid-19, whether they need to self-isolate, whether they may have infected vulnerable friends or family. Blame is a nasty thing, but if it’s about holding people to account, then that’s what should happen.

IT systems are important. Particularly when they involve people’s health, but in many other areas, too: banking, critical infrastructure, defence, energy, even retail and entertainment, where people’s jobs will be at stake. It is appropriate for those of us with a voice to speak out, to remind the IT community that we have a responsibility to society, and to hold those who commission IT systems to account.

1 – or “gasoline” in some geographies.

Author: Mike Bursell

Long-time Open Source and Linux bod, distributed systems security, etc.. Now employed by Red Hat. マイク・バーゼル: オープンソースとLinuxに長く従事。他にも分散セキュリティシステムなども手がける。現在Red Hatのチーフセキュリティアーキテクト

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s