In praise of triage

It’s all too easy to prioritise based on the “golfing test”.

Not all bugs are created equal.

Some bugs need fixing now, some bugs can wait. Some bugs are in your implementation, some are in the underlying design. Some bugs will annoy a few customers, some will destroy your business.

Bugs come in all shapes and sizes, and one of the tasks of a product owner, product manager, chief architect – whoever makes the call about where to assign resources – is to decide which ones to address in which order: to prioritise them. The problem is deciding how to prioritise them. It’s all too easy to prioritise based on the “golfing test”: your CEO meets someone on golf course who mentions that his or her company loves your product, except for one tiny issue. The CEO comes back, and makes it clear that fixing this “major bug” is now your one and only task until it’s done, and your world is turned upside down. You have to fix the bug as quickly as possible, with no thought to the impact it has on the rest of the project, or the immense pile technical debt that’s just been accrued. You don’t want to live in this world. What, then, is the alternative?

The answer – though it’s only the beginning of the answer – is triage. Triage (from the French for “separating out”) comes from the world of battlefield medicine. When deciding which wounded soldiers to treat, rapid (hopefully objective) assessments are carried out, allowing a quick sorting of each soldier, typically into categories such as “not urgent: wait”, “urgent: treat immediately” and “not saveable: do not treat”. We can apply the same to software bugs in order to decide what to treat (fix) and with what priority. The important thing is not so much the categories – which will vary based on your context – but the assessment criteria, and how they are applied. Here are a list of just some of the possible criteria:

  • likely monetary impact per customer
  • number of customers impacted
  • reputational impact on your organisation
  • ease to fix
  • impact on system security
  • impact on system performance
  • impact on system stability
  • annoyance of CEO not to be listened to.

We do not, of course, only need to apply one of these: a number of them can be combined with a weighting system, though the more you add, the less clear your priorities will be, and the more likely it is that someone will “put a finger on the scales” – tweak the numbers to give the outcome they want. Another important point about the categories that you decide to apply is that they should be as measurable as you can make them, to allow as objective scoring as possible. I wrote a review of the book Building Evolutionary Architectures a while ago: the methodology adopted there, where you measure and test in order to meet specific criteria, is exactly the sort of approach you should be choosing when designing your triage system.

This is (ostensibly) a blog about security, and so you might expect me to say that “security always wins”, but that should absolutely not be the approach you take. Security might be the most important category for you (that is, carry the most weight), but you need to understand why that is the case – at this particular time – and what exactly you mean by “security”. The “security of the system” is not an objective measure: in order to mean anything, such a phrase needs to reference measurements that can be made (“resistance to physical tampering”, “resistance to brute force attacks”, “number or PhD students likely to be needed to reverse engineer our ‘secure’ protocol”[1]). More importantly, it may be that at this point in your organisation’s life, the damage done by lack of stability or decreased performance outweighs the impact of a security bug. If that’s the case, then your measurements should encapsulate that information and lead you to prioritise bugs with impact in these categories over security issues[3].

There’s one proviso that I feel I need to put in at this point, and it’s about the power of what, in Agile Methodology terms, is called the Product Owner. This is the person who represents the users of the product/project, and should have final say about the direction of development in terms of features, functionality and, most relevant in this discussion, bug-fixing. As noted above, this may be an architect, product manager or someone enjoying another title, but their role should be clear: they get to call the shots. There are times when this person goes against the evidence provided by the triage, and makes a decision to prioritise a particular bug over others despite the outcome of the measurements. This is typically very painful for the technical team[4], but, when it comes down to it, as the product owner, they get to decide. The technical team – after appropriate warnings and discussion[5] – must be ready to step aside and accept the decision. Such decisions (and related discussions) should be recorded, and the product owner must be ready to stand or fall based on the outcome, but that is their job. Triage is a guide, and there are occasions when there are measurements which cannot be easily made objectively, and which sit outside the expertise or scope of knowledge of the technical team. If this sort of decision keeps being made, and you think you know better, you may have a future in technical product management, where people with a view of both the technical and the business side of technology are much in demand. In the end, though, the product owner will need to justify their decision to management, and if they get it wrong, then they must be ready to take the blame (this is one reason why you should make sure that you’ve recorded the process taken to get to this decision – you don’t want to take the blame for a poor decision which you advised against).

So: go out an design a triage process, be ready to follow it, and be ready to defend it. Oh, and one last point: you might want to buy a set of golf clubs.

—–

1 – this last one is a joke: don’t design your own protocol, or if you do, make it open and have it peer-reviewed[2].

2 – and then throw it away and use an open source implementation of better, more thoroughly-reviewed one.

3 – much as it pains me to say it.

4 – I’ve been on both sides of these decisions: I know.

5 -often rather heated, in my experience.

What’s a hash function?

It should be computationally implausible to work backwards from the output hash to the input.

Note – many thanks to a couple of colleagues who provided excellent suggestions for improvements to this text, which has been updated to reflect them.

Sometimes I like to write articles about basics in security, and this is one of those times. I’m currently a little over a third of the way through writing a book on trust in computing and the the cloud, and ended up creating a section about cryptographic hashes. I thought that it might be a useful (fairly non-technical) article for readers of this blog, so I’ve edited it a little bit and present it here for your delectation, dears readers.

There is a tool in the security practitioner’s repertoire that it is helpful for everyone to understand: cryptographic hash functions. A cryptographic hash function, such as SHA-256 or MD5 (now superseded for cryptographic uses as it’s considered “broken”) takes as input a set of binary data (typically as bytes) and gives as output which is hopefully unique for each set of possible inputs. The length of the output – “the hash” – for any particularly hash function is typically the same for any pattern of inputs (for SHA-256, it is 32 bytes, or 256 bits – the clue’s in the name). It should be computationally implausible (cryptographers hate the word “impossible”) to work backwards from the output hash to the input: this is why they are sometimes referred to as “one-way hash functions”. The phrase “hopefully unique” when describing the output is extremely important: if two inputs are discovered that yield the same output, the hash is said to have “collisions”. The reason that MD5 has become deprecated is that it is now trivially possible to find collisions with commercially-available hardware and software systems. Another important property is that even a tiny change in the message (e.g. changing a single bit) should generate a large change to the output (the “avalanche effect”).

What are hash functions used for, and why is the property of being lacking in collisions so important? The simplest answer to the first question is that hash functions are typically used to ensure that when someone hands you a piece of binary data (and all data in the world of computing can be described in binary format, whether it is text, an executable, a video, an image or complete database of data), it is what you expect. Comparing binary data directly is slow and arduous computationally, but hash functions are designed to be very quick. Given two files of several Megabytes or Gigabytes of data, you can produce hashes of them ahead of time, and defer the comparisons to when you need them[1].

Indeed, given the fact that it is easy to produce hashes of data, there is often no need to have both sets of data. Let us say that you want to run a file, but before you do, you want to check that it really is the file you think you have, and that no malicious actor has tampered with it. You can hash that file very quickly and easily, and as long as you have a copy of what the hash should look like, then you can be fairly certain that you have the file you wanted. This is where the “lack of collisions” (or at least “difficulty in computing collisions”) property of hash functions is important. If the malicious actor can craft a replacement file which shares the same hash as the real file, then the process is essentially useless.

In fact, there are more technical names for the various properties, and what I’ve described above mashes three of the important ones together. More accurately, they are:

  1. pre-image resistance – this says that if you have a hash, it should be difficult to find the message from which it was created, even if you know the hash function used;
  2. second pre-image resistance – this says that if you have a message, it should be difficult to find another message which, when hashed, generates the same hash;
  3. collision resistance – this says that it should be difficult to find any two messages which generate the same hash.

Collision resistance and second pre-image resistance sound like the same property, at first glances, but are subtly (but importantly) different. Pre-image resistance says that if you already have a message, finding another with a matching hash, whereas collision resistance should make it hard for you to find any two messages which will generate the same hash, and is a much harder property to fulfil in a hash function.

Let us go back to our scenario of a malicious actor trying to exchange a file (with a hash which we can check) with another one. Now, to use cryptographic hashes “in the wild” – out there in the real world beyond the perfectly secure, bug-free implementations populated by unicorns and overflowing with fat-free doughnuts – there are some important and difficult provisos that need to be met. More paranoid readers may already have spotted some of them, in particular:

  1. you need to have assurances that the copy of the hash you have has also not been subject to tampering;
  2. you need to have assurances that the entity performing the hash performs and reports it correctly;
  3. you need to have assurances that the entity comparing the two hashes reports the result of that comparison correctly.

Ensuring that you can meet such assurances it not necessarily an easy task, and is one of the reasons that Trusted Platform Modules (TPMs) are part of many computing systems: they act as a hardware root of trust with capabilities to provide such assurances. TPMs are a useful an important tool for real-world systems, and I plan to write an article on them (similar to What’s an HSM?) in the future.


1 – It’s also generally easier to sign hashes of data, rather than large sets of data themselves – this happens to be important as one of the most common uses of hashes is for cryptographic (“digital”) signatures.

Thunderspy – should I care?

Thunderspy is a nasty attack, but easily prevented.

There’s a new attack out there which is getting quite a lot of attention this week. It’s called Thunderspy, and it uses the Thunderbolt port which is on many modern laptops and other computers to suck data from your machine. I thought that it might be a good issue to cover this week, as although it’s a nasty attack, there are easy ways to defend yourself, some of which I’ve already covered in previous articles, as they’re generally good security practice to follow.

What is Thunderspy?

Thunderspy is an attack on your computer which allows an attacker with moderate resources to get at your data under certain circumstances. The attacker needs:

  • physical access to your machine – not for long (maybe five minutes), but they do need it. This type of attack is sometimes called an “evil maid” attack, as it can be carried out by hotel staff with access to your room;
  • the ability to take your computer apart (a bit) – all we’re talking here is a screwdriver;
  • a little bit of hardware – around $400 worth, according to one source;
  • access to some freely available software;
  • access to another computer at the same time.

There’s one more thing that the attacker needs, and that’s for you to leave your computer on, or in suspend mode. I’ve discussed different power modes before (in 3 laptop power mode options), and mentioned, as well, that leaving your machine in suspend mode is generally a bad idea (in 7 security tips for travelling with your laptop). It turns out I was right.

What’s the bad news?

Well, there’s quite a lot of bad news:

  • lots of machines have Thunderbolt ports (you can find pictures of both the port and connectors on Wikipedia’s Thunderbolt page, in case you’re not sure whether your machine is affected);
  • machines are vulnerable even if you have full disk encryption;
  • Windows machines are vulnerable;
  • Linux machines are vulnerable;
  • Macintosh machines are vulnerable;
  • most machines with a Thunderbolt port from 2011 onwards are vulnerable;
  • although protection is available on some newer machines (from around 2019)
    • the extent of its efficacy is unclear;
    • lots of manufacturers don’t implement it;
  • some protections that you can turn on break USB and other functionality;
  • one variant of the attack breaks Thunderbolt security permanently, meaning that the attacker won’t need to take your computer apart at all for subsequent attacks: they just need physical access to the port whilst your machine it turned on (or in suspend mode).

The worst thing to note is that full disk encryption does not help you if your computer is turned on or in suspend mode.

Note – I’ve been unable to find out whether any Chromebooks have Thunderbolt support. Please check your model’s specifications or datasheet to be certain.

What’s the good news?

The good news is short and sweet: if you turn your computer completely off, or ensure that it’s in Hibernate mode, then it’s not vulnerable. Thunderspy is a nasty attack, but it’s easily prevented.

What should I do?

  1. Turn your computer off when you leave it unattended, even for short amounts of time.

That was easy, wasn’t it? This is best practice anyway, and it turns out that hibernate mode is also OK. What the attacker is looking for is a powered-up, logged-on computer with Thunderbolt. If you can stop them finding a computer that meets those criteria, then you’re fine. Putting your computer into hibernate mode is also OK.

3 open/closed Covid-19 contact tracing questions

All projects are not created equal.

One of the cheering things about the pandemic crisis in which we find ourselves is the vast up-swell of volunteering that we are seeing across the world. We are seeing this equally across the IT sector, and one of the areas where work is being done is in apps to help track Covid-19. Specifically, there is an interest in Covid-19 contact tracing, or tracking, apps for our mobile[0] phones. These aren’t apps which keep an eye on whether you’ve observed lock-down procedures, but which attempt to work out who has been in contact with whom, and work out from that, once we know that one person is infected with Covid-19, what the likely spread of the virus will be.

There are lots of contact tracing initiatives out there, from Pep-Pt from the European Union to Singapore’s TraceTogether, from the University of Washington’s PACT to MIT’s PACT[1]. Google and Apple are – unprecedentedly – working on an app together. There are lots of ways of comparing these apps and projects, but in today’s article, I want to suggest three measures which can help you consider them from the point of view of “openness”. As regular readers of this blog will know, I’m a big fan of open source – not just for software, but for data, management and the rest – and I believe that there’s also a strong correlation here with civil or human rights. There are lots of ways to compare these apps, but these three measures are not too technical, and can help us get a grip on the likelihood that some of the apps (and associated projects) may impinge on privacy and other issues about which we care. I don’t want the data generated from apps that I download onto my phone to be used now or in the future to curtail my, or other people’s civil or human rights, for blackmail or even for unapproved commercial gain.

1. Open source

Our first question must be: “is the app open source?” If the answer is “no”, then we have no way to know what is being captured, and therefore how it is being used. If the app is closed source, it could be collecting any data from pretty much any measuring device on our phones, including photo, video, audio, Bluetooth, wifi, temperature, GPS or accelerometer. We can try restricting access to these measurements, but such controls have not always been effective, understanding the impact of turning them off is rarely simple, and people frankly rarely bother to check them anyway. Equally bad is the fact that with closed source, you can’t have any idea of how good the security is, nor any chance to criticise and improve it. This is something about which I’ve written many times, including in my articles Disbelieving the many eyes hypothesis and Trust & choosing open source. Luckily, it seems that the majority of contact tracing apps are open source, but please be careful, and reject any which are not.

2 Centralised or distributed

In order to make sense of all the data that these apps collect, there needs to be a centralised[2] store where it can be processed, right? It’s common sense.

Actually, no. Although managing and processing data in one place can be much easier, there are ways to store data in a distributed manner, and allow the sorts of processing needed for contact tracing to take place. It may be more complex, but it also makes it much, much more difficult for governments, corporations or malicious actors to misuse this information. And we should be clear that this will be what happens if the data is made available. Maybe the best governments and the best corporations will be well-behaved by their standards, but a) those are not necessarily the standards that I or others will endorse and b) what about malicious actors and governments and corporations which are not “the best”?

3 Location or proximity tracking

This might seem like another obvious choice: if you want to be finding out who was in contact with whom, then the way to do it is see who was where, and when. GPS tracking – and associated technologies like wifi access point location tracking – combined with easily available time data, would give the ability to work out who was in a particular place at the same time as other people. This is true, but it also provides enormous opportunities for misuse, particularly when the data is held centrally (see above). An alternative is to use sensors like Bluetooth or NFC[3], to allow phones to collect information about other phones (or devices) with which they have been in contact and when. This is more easily anonymised – or pseudonymised – allowing information to be passed to the owners of those phones, but at the same time more difficult to misuse by governments, corporations and malicious actors.

There are other issues to consider, one of which is that these sensors were not designed for this type of use, and we may be sacrificing accuracy if we choose this option. On the other hand, many interactions between people occur indoors, where GPS is much less effective anyway, and these types of technologies may help.

You could argue that this measurement is not about “openness” in itself, but it is a key indicator to whether the information collected can be used in ways which are far from open.

Conclusion

There are many other questions we can ask about Covid-19 contact tracing apps, some of which are related to openness, and some of which are not. These include:

  • Coverage
    • not all demographics have – or use – phones as much as the rest of the population, including the poor, the elderly, and certain religious groups. How effective will such projects be if they have reduced access to these groups?
    • older devices may have less accurate sensors, or not have some of the capabilities required by the apps. What is more, there may be a correlation between use of these older devices with some of the demographics noted above.
    • some people rarely update the apps on their phones, so even if they load an initial version of an app, newer versions, with functionality or security improvements, are likely to be unequally distributed across the set of devices.
  • Removal – how easy will it be to remove the application fully, what are the consequences of not doing so, and how likely are people to do so anyway[4]?
  • Will use of these apps by mandatory or voluntary? If the former, there are serious concerns about civil or human rights, not to mention the problems noted above about coverage.

All of these questions are important, but not directly related to the question of the “openness” of the apps and projects. However, we have, right now, some great opportunities to work with and influence some really important projects for public health and well-being, and I believe that it is important that we consider the questions I’ve raised about openness before endorsing, installing or using any of the apps that are being created.


0 – or “cell”, if you’re in North America.

1 – yes, they chose the same acronym. Yes, it is confusing.

2 – or, I supposed, “centralized”, depending on your geography.

3 – “Near Field Communication” – the same capability used when you do contactless payment with your phone or credit/debit card.

4 – how many apps do you still have on your phone that you’ve not even opened for 3 months? Yup, me too.

Post-Covid, post-open?

We are inventive, we are used to turning technologies to good.

The world of lockdown to which we’re becoming habituated at the moment has produced some amazing upsides. The number of people volunteering, the resurgence of local community initiatives, the selfless dedication of key workers across the world and the recognition of their sacrifice by the general public are among the most visible. As many regular readers of this blog are likely to be aware, there has also been an outpouring of interest and engagement in software- and hardware-related projects to help, from infection-tracking apps to 3D-printing of PPE[0]. Companies have made training and educational materials available for free, and there are attempts around the world to engage and contribute to the public commonwealth.

Sadly, not all of the news is good. There has been a rise in phishing attacks, and the lack of appropriate or sufficient security in commonly-used apps such as Zoom has become frightenly evident[1]. There’s an article to write here about the balance between security, usability and cost, but I’m going to save that for another day.

Somewhere in the middle, between the obvious positives and obvious negatives, there are some developments which most of us probably accept at necessary, but which aren’t things that we’d normally welcome. Beyond the obvious restrictions on movement and public gatherings, there are a number of actions which governments, in particular, a retaking which have generally negative impacts on human rights and civil liberties, as outlined in this piece by The Guardian. The article lists numerous examples of governments imposing, or considering the imposition of, measures which would normally be quickly attacked by human rights groups, and resisted by most citizens. Despite the headline, which suggests that the article will deal with how difficult these measures will be to remove after the end of the crisis, there is actually little discussion, beyond a note that “[w]hether that surveillance is eventually rolled back will depend on public oversight.”

I think that we need to go beyond just “oversight” and start planning now for public action. In the communities in which I live and work, there is a general expectation that the world – software, management, government, data – is becoming more, not less open. We are in grave danger of losing that openness even once the need for these government measures diminish. Governments – who will see the wider intelligence-gathering and control opportunities of these changes – will espouse the view that “we need these measures in place in order to be able to react quickly if the same thing happens again”, and, if we’re not careful, public sentiment, bruised and bloodied by the pandemic, will quietly acquiesce, and we will see improvements in human and civil rights rolled back decades, and damaged further by the availability of cheap, mobile, networked technology.

If we believe that openness is a public good, then we need to think how to counter the arguments which we will hear from governments, and be ready to be vocal – not just with counter-arguments, but with counter-proposals. This pandemic is unlike either of the World Wars of the 20th Century, when a clear ending was marked, and there was the opportunity (sadly denied to many citizens of the former USSR) to regain civil liberties and roll back the restrictions of the war years. Nor is it even like the aftermath of the 9/11, that event which has impacted the intelligence and security landscape of the past two decades, where there is (was?) at least a set of (posited) human foes to target. In the case of the Covid-19 pandemic, the “enemy” is amorphous and will be around for decades to come. The measures to combat it – and its successors – will only be slowly reduced, and some will not be.

We need to fight against those measures which are unnecessary, and we need to find alternatives – transparent, public alternatives – to measures which may have some positive effects, but whose overall impact on society and human rights is clearly negative. In a era where big data is becoming pervasive, and the tools to mine it tractable, we need to provide international mechanisms to share and use that data in ways which do not benefit any single government, bloc, or section of society. We are inventive, we are used to turning technologies to good. This is the time we need to do it, and do it quickly. We can make a difference by being open, but we need to start now.


0 – Personal Protection Equipment.

1 – although note that the company is reported to be making improvements to at least one area of concern to some – routing of traffic through China.

No security without an architecture

Your diagrams don’t need to be perfect. But they do need to be there.

I attended a virtual demo this week. It didn’t work, but none of us was stressed by that: it was an internal demo, and these things happen. Luckily, the members of the team presenting the demo had lots of information about what it would have shown us, and a particularly good architectural diagram to discuss. We’ve all been in the place where the demo doesn’t work, and all felt for the colleague who was presenting the slidedeck, and on whose screen a message popped up a few slides in, saying “Demo NO GO!” from one of her team members.

After apologies, she asked if we wanted to bail completely, or to discuss the information they had to hand. We opted for the latter – after all, most demos which aren’t foregrounding user experience components don’t show much beyond terminal windows that most of us could fake up in half an hour or so anyway. She answered a couple of questions, and then I piped up with one about security.

This article could have been about the failures in security in a project which was showing an early demo: another example of security being left till late (often too late) in the process, at which point it’s difficult and expensive to integrate. However, it’s not. It clear that thought had been given to specific aspects of security, both on the network (in transit) and in storage (at rest), and though there was probably room for improvement (and when isn’t there?), a team member messaged me more documentation during the call which allowed me to understand of the choices the team had made.

What this article is about is the fact that we were able to have a discussion at all. The slidedeck included an architecture diagram showing all of the main components, with arrows showing the direction of data flows. It was clear, colour-coded to show the provenance of the different components, which were sourced from external projects, which from internal, and which were new to this demo. The people on the call – all technical – were able to see at a glance what was going on, and the team lead, who was providing the description, had a clear explanation for the various flows. Her team members chipped in to answer specific questions or to provide more detail on particular points. This is how technical discussions should work, and there was one thing in particular which pleased me (beyond the fact that the project had thought about security at all!): that there was an architectural diagram to discuss.

There are not enough security experts in the world to go around, which means that not every project will have the opportunity to get every stage of their design pored over by a member of the security community. But when it’s time to share, a diagram is invaluable. I hate to think about the number of times I’ve been asked to look at project in order to give my thoughts about security aspects, only to find that all that’s available is a mix of code and component documentation, with no explanation of how it all fits together and, worse, no architecture diagram.

When you’re building a project, you and your team are often so into the nuts and bolts that you know how it all fits together, and can hold it in your head, or describe the key points to a colleague. The problem comes when someone needs to ask questions of a different type, or review the architecture and design from a different slant. A picture – an architectural diagram – is a great way to educate external parties (or new members of the project) in what’s going on at a technical level. It also has a number of extra benefits:

  • it forces you to think about whether everything can be described in this way;
  • it forces you to consider levels of abstraction, and what should be shown at what levels;
  • it can reveal assumptions about dependencies that weren’t previously clear;
  • it is helpful to show data flows between the various components
  • it allows for simpler conversations with people whose first language is not that of your main documentation.

To be clear, this isn’t just a security problem – the same can go for other non-functional requirements such as high-availability, data consistency, performance or resilience – but I’m a security guy, and this is how I experience the issue. I’m also aware that I have a very visual mind, and this is how I like to get my head around something new, but even for those who aren’t visually inclined, a diagram at least offers the opportunity to orient yourself and work out where you need to dive deeper into code or execution. I also believe that it’s next to impossible for anybody to consider all the security implications (or any of the higher-order emergent characteristics and qualities) of a system of any significant complexity without architectural diagrams. And that includes the people who designed the system, because no system exists on its own (or there’s no point to it), so you can’t hold all of those pieces in your head of any length of time.

I’ve written before about the book Building Evolutionary Architectures, which does a great job in helping projects think about managing requirements which can morph or change their priority, and which, unsurprisingly, makes much use of architectural diagrams. Enarx, a project with which I’m closely involved, has always had lots of diagrams, and I’m aware that there’s an overhead involved here, both in updating diagrams as designs change and in considering which abstractions to provide for different consumers of our documentation, but I truly believe that it’s worth it. Whenever we introduce new people to the project or give a demo, we ensure that we include at least one diagram – often more – and when we get questions at the end of a presentation, they are almost always preceded with a phrase such as, “could you please go back to the diagram on slide x?”.

I nearly published this article without adding another point: this is part of being “open”. I’m a strong open source advocate, but source code isn’t enough to make a successful project, or even, I would add, to be a truly open source project: your documentation should not just be available to everybody, but accessible to everyone. If you want to get people involved, then providing a way in is vital. But beyond that, I think we have a responsibility (and opportunity!) towards diversity within open source. Providing diagrams helps address four types of diversity (at least!):

  • people whose first language is not the same as that of your main documentation (noted above);
  • people who have problems reading lots of text (e.g. those with dyslexia);
  • people who think more visually than textually (like me!);
  • people who want to understand your project from different points of view (e.g. security, management, legal).

If you’ve ever visited a project on github (for instance), with the intention of understanding how it fits into a larger system, you’ll recognise the sigh of relief you experience when you find a diagram or two on (or easily reached from) the initial landing page.

And so I urge you to create diagrams, both for your benefit, and also for anyone who’s going to be looking at your project in the future. They will appreciate it (and so should you). Your diagrams don’t need to be perfect. But they do need to be there.

Not quantum-safe, not tamper-proof, not secure

Let’s make security “marketing-proof”. Or … maybe not.

If there’s one difference that you can use to spot someone who takes security seriously, it’s this: they don’t make absolute statements about security. I’m going to be a bit contentious here, and I’m sorry if it upsets some people who do take security seriously, but I’m of the very strong opinion that we should never, ever say that something is “completely secure”, “hack-proof” or even just “secured”. I wrote a few weeks ago about lazy journalism, but it pains me even more to see or hear people who really should know better using such absolutes. There is no “secure”, and I’d love to think that one day I can stop having to say this, but it comes up again and again.

We, as a community, need to be careful about the words and phrases that we use, because it’s difficult enough to educate the rest of the world about what we do without allowing non-practitioners to believe that we (or they) can take a system or component and make it so safe that it cannot be compromised or go wrong. There are two particular bug-bears that are getting to me at the moment – and that’s before I even start on the one which rules them all, “zero-trust”, which makes my skin crawl and my hackles rise whenever I hear it used[1] – and they are (as you may have already guessed from the title of this article):

  • quantum-proof
  • tamper-proof

I’ll start with the latter, because it’s more clear cut (and easier to explain). Some systems – typically hardware systems – are deployed in environments where bad people might mess with them. This, in the trade, is called “tampering”, and it has a slightly different usage from the normal meaning, in that it tends to imply that the damage done to a system or component was done with the intention that the damage didn’t necessarily stop its normal operation, but did alter it in such a way that the attacker could gain some advantage (often, but not always, snooping on activities being performed). This may have been the intention, but it may be that the damage did actually stop or at least effect normal operation, whether or not the attacker gained the advantage they were attempting. The problem with saying that any system is tamper-proof is that it clearly isn’t, particularly if you accept the second part of the definition, but even, possibly if you don’t. And it’s pretty much impossible to be sure, for the same reason that the adage that “any fool can create a cryptographic protocol that he/she can’t break” is true: you can’t assess the skills and abilities of all future attackers of your system. The best you can do is make it tamper-evident: put such controls in place that it should be clear if someone tries to tamper with the system[3].

“Quantum-safe” is another such phrase. It refers to cryptographic protocols or primitives which are designed to be resistant to attacks by quantum computers. The phrase “quantum-proof” is also used, and the problem with both of these terms is that, since nobody has yet completed a quantum computer of sufficient complexity even to be try, we can’t be sure. Even once they do, we probably won’t be sure, as people will probably come up with new and improved ways of using them to attack the protocols and primitives we’ve been describing. And what’s annoying is that the key to what we should be saying is actually in the description I gave: they are meant to be resistant to such attacks. “Quantum-resistant” is a much more descriptive and accurate phrase[5], so why not use it?

The simple answer to that question, and to the question of why people use phrases like “tamper-proof” and “secure” is that it makes better marketing copy. Ill-informed customers are more likely to buy something which is “safe” or which is “proof” against something, rather than evidencing it, or being resistant to it. Well, our part of our jobs as security professionals is to try to educate those customers, and make them less ill-informed[6]. Let’s make security “marketing-proof”. Or … maybe not.


1 – so much so that I’m actually writing a book at it[2].

2 – not just the concept of “zero-trust”, but about trust in general.

3 – sometimes, the tamper-evidence is actually intentionally destroying the capabilities a system so that you can be pretty sure that the attacker wasn’t able to make it do things it wasn’t supposed to[4].

4 – which is pretty cool, though it does mean that you can’t make it do the things it was supposed to either, of course.

5 – well, I’m assuming that most of such mechanisms are resistant, of course…

6 – I fully accept that “better-informed” would be better choice of phrase here.

Isolationism – not a 4 letter word (in the cloud)

Things are looking up if you’re interested in protecting your workloads.

In the world of international relations, economics and fiscal policy, isolationism doesn’t have a great reputation. I could go on, I suppose, if I did some research, but this is a security blog[1], and international relations, fascinating area of study though it is, isn’t my area of expertise: what I’d like to do is borrow the word and apply it to a different field: computing, and specifically cloud computing.

In computing, isolation is a set of techniques to protect a process, application or component from another (or a set of the former from a set of the latter). This is pretty much always a good thing – you don’t want another process interfering with the correct workings of your one, whether that’s by design (it’s malicious) or in error (because it’s badly designed or implemented). Isolationism, therefore, however unpopular it may be on the world stage, is a policy that you generally want to adopt for your applications, wherever they’re running.

This is particularly important in the “cloud”. Cloud computing is where you run your applications or processes on shared infrastructure. If you own that infrastructure, then you might call that a “private cloud”, and infrastructure owned by other people a “public cloud”, but when people say “cloud” on its own, they generally mean public clouds, such as those operated by Amazon, Microsoft, IBM, Alibaba or others.

There’s a useful adage around cloud computing: “Remember that the cloud is just somebody else’s computer”. In other words, it’s still just hardware and software running somewhere, it’s just not being run by you. Another important thing to remember about cloud computing is that when you run your applications – let’s call them “workloads” from here on in – on somebody else’s cloud (computer), they’re unlikely to be running on their own. They’re likely to be running on the same physical hardware as workloads from other users (or “tenants”) of that provider’s services. These two realisations – that your workload is on somebody else’s computer, and that it’s sharing that computer with workloads from other people – is where isolation comes into the picture.

Workload from workload isolation

Let’s start with the sharing problem. You want to ensure that your workloads run as you expect them to do, which means that you don’t want other workloads impacting on how yours run. You want them to be protected from interference, and that’s where isolation comes in. A workload running in a Linux container or a Virtual Machine (VM) is isolated from other workloads by hardware and/or software controls, which try to ensure (generally very successfully!) that your workload receives the amount of computing time it should have, that it can send and receive network packets, write to storage and the rest without interruption from another workload. Equally important, the confidentiality and integrity of its resources should be protected, so that another workload can’t look into its memory and/or change it.

The means to do this are well known and fairly mature, and the building blocks of containers and VMs, for instance, are augmented by software like KVM or Xen (both open source hypervisors) or like SELinux (an open source capabilities management framework). The cloud service providers are definitely keen to ensure that you get a fair allocation of resources and that they are protected from the workloads of other tenants, so providing workload from workload isolation is in their best interests.

Host from workload isolation

Next is isolating the host from the workload. Cloud service providers absolutely do not want workloads “breaking out” of their isolation and doing bad things – again, whether by accident or design. If one of a cloud service provider’s host machines is compromised by a workload, not only can that workload possibly impact other workloads on that host, but also the host itself, other hosts and the more general infrastructure that allows the cloud service provider to run workloads for their tenants and, in the final analysis, make money.

Luckily, again, there are well-known and mature ways to provide host from workload isolation using many of the same tools noted above. As with workload from workload isolation, cloud service providers absolutely do not want their own infrastructure compromised, so they are, of course, going to make sure that this is well implemented.

Workload from host isolation

Workload from host isolation is more tricky. A lot more tricky. This is protecting your workload from the cloud service provider, who controls the computer – the host – on which your workload is running. The way that workloads run – execute – is such that such isolation is almost impossible with standard techniques (containers, VMs, etc.) on their own, so providing ways to ensure and prove that the cloud service provider – or their sysadmins, or any compromised hosts on their network – cannot interfere with your workload is difficult.

You might expect me to say that providing this sort of isolation is something that cloud service providers don’t care about, as they feel that their tenants should trust them to run their workloads and just get on with it. Until sometime last year, that might have been my view, but it turns out to be wrong. Cloud service providers care about protecting your workloads from the host because it allows them to make more money. Currently, there are lots of workloads which are considered too sensitive to be run on public clouds – think financial, health, government, legal, … – often due to industry regulation. If cloud service providers could provide sufficient isolation of workloads from the host to convince tenants – and industry regulators – that such workloads can be safely run in the public cloud, then they get more business. And they can probably charge more for these protections as well! That doesn’t mean that isolating your workloads from their hosts is easy, though.

There is good news, however, for both cloud service providers and their teants, which is that there’s a new set of hardware techniques called TEEs – Trusted Execution Environments – which can provide exactly this sort of protection[2]. This is rapidly maturing technology, and TEEs are not easy to use – in that it can not only be difficult to run your workload in a TEE, but also to ensure that it’s running in a TEE – but when done right, they do provide the sorts of isolation from the host that a workload wants in order to maintain its integrity and confidentiality[3].

There are a number of projects looking to make using TEEs easier – I’d point to Enarx in particular – and even an industry consortium to promote open TEE adoption, the Confidential Computing Consortium. Things are looking up if you’re interested in protecting your workloads, and the cloud service providers are on board, too.


1 – sorry if you came here expecting something different, but do stick around and have a read: hopefully there’s something of interest.

2 – the best known are Intel’s SGX and AMD’s SEV.

3 – availability – ensuring that it runs fairly – is more difficult, but as this is a property that is also generally in the cloud service provider’s best interest, and something that can can control, it’s not generally too much of a concern[4].

4 – yes, there are definitely times when it is, but that’s a story for another article.

Coming to you in Japanese

We are now multi-lingual.

I have an exciting announcement, which is that starting this week, some of the articles on this blog will also be in Japanese.  My very talented Red Hat colleague Yuki Kubota showed an interest in translating some which she thought might be of interest to Japanese readers, and I jumped at the chance.  I’m very thrilled and humbled.

We’re still ironing out the process, but hopefully (if you already read Japanese), you’ll be able to read the following articles.

We’ll try to add the tag “Japanese” to each of these, as well.

So, a huge thank you to Yuki: we’d love comments – in English or Japanese!

Timely risk or risky times?

Being aware of “the long game”.

On Friday, 29th November 2019, Jack Merritt and Saskia Jones were killed in a terrorist attack.  A number of members of the public (some with with improvised weapons) and of the emergency services acted with great heroism.  I wanted to mention the mention the names of the victims and to praise those involved in stopping him before mentioning the name of the attacker: Usman Khan.  The victims, the attacker were taking part in an offender rehabilitation conference to help offenders released from prison to reintegrate into society: Khan had been convicted to 16 years in prison for terrorist offences.

There’s an important formula that everyone involved in risk – and given that IT security is all about mitigating risk, that’s anyone involved in security – should know. It’s usually expressed thus:

Risk = likelihood x impact

Sometimes likelihood is sometimes expressed as “probability”, impact as “consequence” or “loss”, and I’ve seen some other variants as well, but the version above is generally sufficient for most purposes.

Using the formula

How should you use the formula? Well, it’s most useful for comparing risks and deciding how to mitigate them. Humans are terrible at calculating risk, and any tools that help them[1] is good.  In order to use this formula correctly, you want to compare risks over the same time period.  You could say that almost any eventuality may come to pass over the lifetime of the universe, but comparing the risk of losing broadband access to the risk of your lead developer quitting for another company between the Big Bang and the eventual heat death of the universe is probably not going to give you much actionable information.

Let’s look at the two variables that we need to have in order to calculate risk.  We’ll start with the impact, because I want to devote most of this article to the other part: likelihood.

Impact is what the damage will be if the risk happens.  In a business context, you want to look at the risk of your order system being brought down for a week by malicious attackers.  You might calculate that you would lose £15,000 in orders.  On top of that, there might be a loss of reputation which you might calculate at £30,000.  Fixing the problem might add £10,000.  Add these together, and the impact is £55,000.

What’s the likelihood?  Well, remember that we need to consider a particular time period.  What you choose will depend on what you’re interested in, but a classic use is for budgeting, and so the length of time considered is often a year.  “What is the likelihood of my order system being brought down for a week by malicious attackers over the next twelve months?” is the question you want to ask.  If you decide that it’s 0.005 (or 0.5%), then your risk is calculated thus:

Risk = 0.005 x 55,000

Risk = 275

The units don’t really matter, because what you want to do is compare risks.  If the risk of your order system being brought down through hardware failure is higher (say 500), then you should probably balance the amount of resources you assign to mitigate these risks accordingly.

Time, reputation, trust and risk

What I’m interested in is a set of rather more complicated risks, however: those associated with human behaviour.  I’m very interested in trust, and one of the interesting things about trust is how we decide to trust people.  One way is by their reputation: if someone keeps behaving well over a long period, then we tend to trust them more – or if badly, then to trust them less[2].  If we trust someone more, our calculation of risk is likely to be strongly based on that trust, as our view of the likelihood of a behaviour at odds with the reputation that person holds will be informed by that.

This makes sense: in the absence of perfect information about humans, their motivations and intentions, our view of risk must be based on something, and reputation is actually a fairly good measure for that.  We might say that the likelihood of a customer defaulting on payment terms reduces year by year as we start to think of them as a “trusted customer”.  As the likelihood reduces, we may decide to increase the amount we lend to them – and thereby the impact of defaulting – to keep the risk about the same, year on year.

The risk here is what is sometimes called “playing the long game”.  Humans sometimes manipulate their reputation, or build up a reputation, in order to perform an action once they have gained trust.  Online sellers my make lots of “good” sales in order to get a 5 star rating over time, only to wait and then make a set of “bad” sales, where they don’t ship goods at all, and then just pocket the money.  Or, they may make many small sales in order to build up a good reputation, and then use that reputation to make one big sale which they have no intention of fulfilling.  Online selling sites are wise to some of these tricks, and have algorithms to try to protect buyers (in fact, the same behaviour can be used by sellers in some cases), but these are not perfect.

I’d like to come back to the London Bridge attack.  In this case, it seems likely that the attacker bided his time over many years, behaving well, and raising his “reputation” among those who knew him – the prison staff, parole board, rehabilitation conference organisers, etc. – so that he had the opportunity to perform one major action at odds with that reputation.  The heroism of those around him stopped him being as successful as he may have hoped, but still at the cost of two innocent lives and several serious injuries.

There is no easy way to deal with such issues.  We need reputation, and we need to allow people to show that they have changed and can be integrated into society, but when we make risk calculations based on reputation in any sphere, we should take care to consider whether actors are playing a long game, and what the possible ramifications would be if they were to act at odds with that reputation.

I noted above that humans are bad at calculating risk, and to follow our example of the non-defaulting customer, one mistake might be to increase the credit we give to that customer beyond the balance of the increase of reputation: actually accepting higher risk than we would have done previously, because we consider them trustworthy.  If we do this, we’ve ceased to use the risk formula, and have started to act irrationally.  Don’t do that.

 


1 – OK, then: “us”.

2 – I’m writing this in the lead up to a UK General Election, and it occurs to me that we actually don’t apply this to most of our politicians.