Organisational suppleness

Growing the ability to react to the unexpected is a valuable skill.

“In preparing for battle I have always found that plans are useless but planning is indispensable.”

Dwight D. Eisenhower

Much of this blog is about security – cybersecurity – in one way or another, but on occasion I do try to take a broader view. Cybersecurity is often modelled or described in military terms, talking about “fighting battles”, “wars of attrition” and “arms races” with “attackers”. These can be useful metaphors (and it’s why I started this article with a quote from a general), but there is a broader set of responsibilities that many of us in the sector need to consider, which is the continued (and hopefully healthy) functioning of our businesses and organisations. In particular, I like to talk about risk and how it relates not just to security, but to how businesses work and plan. One theme that I’ve visited before is that known or planned degradation of a service is often significantly better than failure, or even planned closure (see Service degradation: actually a good thing). My argument is that there are many occasions where keeping a service or business function running, albeit at reduced capacity, or with reductions in known capabilities, allows for better continuity than just stopping it.

Keeping a service running requires work. You can’t just hope that everything is installed and will run as you expect: what happens when your administrator is ill, your fibre-optic cable gets severed by a back hoe, or a DDoS attack is directed at you? You need to plan and practice what to do in these situations. What I’d like to explore in this article goes somewhat beyond the expectation of that planning in three directions. Let’s call them scenario coverage, muscle memory and organisational suppleness.

Scenario coverage

The first, and most obvious of the three directions, is about understanding eventualities. The more scenarios that we model and practice, the more we reduce our risk, simply because we have reduced the number of unknown eventualities in the probability space. There is a actually a side benefit to modelling lost of scenarios, which is that the more situations you consider, the more will come to mind. Every situation involves sets of choices or probabilities – “after the door closes, will it lock or not?” or “if the coolant fails, will the system turn off or burst into flames?” – and the more scenarios you consider, the more questions will arise. This can be daunting – and it’s almost impossible to consider every eventuality – but the more options are covered, the better your opportunities to mitigate the various risks they present.

Muscle memory

Muscle memory is what comes with training and practice. Assuming that you are including your teams in the scenario planning

And I’m assuming here that the planning isn’t solely a paper exercise. Theoretical planning, while useful, only goes so far, for a couple of important reasons:

  • systems will always fails in unexpected ways
  • people will do unexpected things.

What the first of these means is that however much you assume that your back-up generator will kick in if there’s a power outage, until you test it, you can’t be sure that it will. The second of these relates to the fact that however much you tell people what to do, when it actually comes to the doing of it, they’re unlikely to as you expect. This is likely to be even worse if there’s been no training, and you’re just assuming that person X will know how to operate a fire extinguisher, or that team Y will, of course, exit the building in an orderly manner via exit Z (rather than find fourteen different exits, or not even leave the building at all).

For both of these reasons, getting people together to work through possible scenarios, and then, where possible, actually practising what to do, means that you have a higher assurance that when one of the situations you’ve considered does arrive, that they will know what to do, and act as you expect.

Organisational suppleness

While you cannot, as we’ve noted, plan for every eventuality or know exactly how an employee or team will react when things go wrong, there is another benefit to involving a broad group of people in your scenario planning and training. This is that their very involvement gives them practice in dealing with uncertainty, working out how they will react, and giving them experience in how those around them will act. While I may not know exactly what to do if the payroll system goes down an hour before it is due to run, if I have worked with colleagues on scenarios where the sales processing system fails, I’ve got a better chance of making some sensible choices about who to contact, initial steps to take and information to collect than if this is the first time I’ve ever seen anything like it. Likewise, we may not have modelled our response to a physical failure of one of our network links, but our shared experience of practising our response to a DDoS attack means that we have an idea of what to do.

And it is not just having an idea of what to do that is important, but also having gathered and practised the cognitive skills associated with investigating failures, collating data, sharing information and working with others to ameliorate the situation which allows a team or an organisation to respond better to new, maybe unexpected situations. We can think of this as suppleness, as it means that rather than just failing, or cracking, an organisation can react as a tree does to strong winds, or a gymnast does to a new exercise. Growing the ability to react to the unexpected is a valuable skill for an organisation, and knowing that it is supple allows its leaders to plan with more certainty and mitigate more risk.

Trust book – chapter index and summary

I thought it might be interesting to provide the chapter index and a brief summary of each chapter addresses.

In a previous article, I presented the publisher’s blurb for my upcoming book with Wiley, Trust in Computer Systems and the Cloud. I thought it might be interesting, this time around, to provide the chapter index of the book and to give a brief summary of what each chapter addresses.

While it’s possible to read many of the chapters on their own, I haved tried to maintain a logical progression of thought through the book, building on earlier concepts to provide a framework that can be used in the real world. It’s worth noting that the book is not about how humans trust – or don’t trust – computers (there’s a wealth of literature around this topic), but about how to consider the issue of trust between computing systems, or what we can say about assurances that computing systems can make, or can be made about them. This may sound complex, and it is – which is pretty much why I decided to write the book in the first place!

  • Introduction
    • Why I think this is important, and how I came to the subject.
  • Chapter 1 – Why Trust?
    • Trust as a concept, and why it’s important to security, organisations and risk management.
  • Chapter 2 – Humans and Trust
    • Though the book is really about computing and trust, and not humans and trust, we need a grounding in how trust is considered, defined and talked about within the human realm if we are to look at it in our context.
  • Chapter 3 – Trust Operations and Alternatives
    • What are the main things you might want to do around trust, how can we think about them, and what tools/operations are available to us?
  • Chapter 4 – Defining Trust in Computing
    • In this chapter, we delve into the factors which are specific to trust in computing, comparing and contrasting them with the concepts in chapter 2 and looking at what we can and can’t take from the human world of trust.
  • Chapter 5 – The Importance of Systems
    • Regular readers of this blog will be unsurprised that I’m interested in systems. This chapter examines why systems are important in computing and why we need to understand them before we can talk in detail about trust.
  • Chapter 6 – Blockchain and Trust
    • This was initially not a separate chapter, but is an important – and often misunderstood or misrepresented – topic. Blockchains don’t exist or operate in a logical or computational vacuum, and this chapter looks at how trust is important to understanding how blockchains work (or don’t) in the real world.
  • Chapter 7 – The Importance of Time
    • One of the important concepts introduced earlier in the book is the consideration of different contexts for trust, and none is more important to understand than time.
  • Chapter 8 – Systems and Trust
    • Having introduced the importance of systems in chapter 5, we move to considering what it means to have establish a trust relationship from or to a system, and how the extent of what is considered part of the system is vital.
  • Chapter 9 – Open Source and Trust
    • Another topc whose inclusion is unlikely to surprise regular readers of this blog, this chapter looks at various aspects of open source and how it relates to trust.
  • Chapter 10 – Trust, the Cloud, and the Edge
    • Definitely a core chapter in the book, this addresses the complexities of trust in the modern computing environments of the public (and private) cloud and Edge networks.
  • Chapter 11 – Hardware, Trust, and Confidential Computing
    • Confidential Computing is a growing and important area within computing, but to understand its strengths and weaknesses, there needs to be a solid theoretical underpinning of how to talk about trust. This chapter also covers areas such as TPMs and HSMs.
  • Chapter 12 – Trust Domains
    • Trust domains are a concept that allow us to apply the lessons and frameworks we have discussed through the book to real-world situations at large scale. They also allow for modelling at the business level and for issues like risk management – introduced at the beginning of the book – to be considered more explicitly.
  • Chapter 13 – A World of Explicit Trust
    • Final musings on what a trust-centric (or at least trust-inclusive) view of the world enables and hopes for future work in the field.
  • References
    • List of works cited within the book.

In praise of … the Community Manager

I am not – and could never be – a community manager

This is my first post in a while. Since Hanging up my Red Hat I’ve been busy doing … stuff. Stuff which I hope to be able to speak about soon. But in the meantime, I wanted to start blogging regularly again. Here’s my first post back, a celebration of an important role associated with open source projects: the community manager.

Open source communities don’t just happen. They require work. Sometimes the technical interest in an open source project is enough to attract a group of people to get involved, but after some time, things are going to get too big for those with a particular bent (documentation, coding, testing) to manage the interactions between the various participants, moderate awkward (or downright aggressive) communications, help encourage new members to contribute, raise the visibility of the project into new areas or market sectors and all the other pieces that go into keeping a project healthy.

Enter the Community Manager. The typical community manager is in that awkward position of having lots of responsibility, but no direct authority. Open source projects being what they are, few of them have empowered “officers”, and even when there are governance structures, they tend to operate by consent of those involved – by negotiated, rather than direct, authority. That said, by the point a community manager is appointed for a community manager, it’s likely that at least one commercial entity is sufficiently deep into the project to fund or part-fund the community manager position. This means that the community manager will hopefully have some support from at least one set of contributors, but will still need to build consensus across the rest of the community. There may also be tricky times, also, when the community manager will need to decide whether their loyalties lie with their employer or with the community. A wise employer should set expectations about how to deal with such situations before they arise!

What does the community manager need to do, then? The answer to this will depend on a number of issues, and there is likely to be a balance between these tasks, but here’s a list of some that come to mind[1].

  • marketing/outreach – this is about raising visibility of the project, either in areas where it is already known, or new markets/sectors, but there are lots of sub-tasks such as a branding, swag ordering (and distribution!), analyst and press relations.
  • event management – setting up meetups, hackathons, booths at larger events or, for really big projects, organising conferences.
  • community growth – spotting areas where the project could use more help (docs, testing, outreach, coding, diverse and inclusive representation, etc.) and finding ways to recruit contributors to help improve the project.
  • community lubrication – this is about finding ways to keep community members talking to each other, celebrate successes, mourn losses and generally keep conversations civil at least and enthusiastically friendly at best.
  • project strategy – there are times in a project when new pastures may beckon (a new piece of functionality might make the project exciting to the healthcare or the academic astronomy community for instance), and the community manager needs to recognise such opportunities, present them to the community, and help the community steer a path.
  • product management – in conjunction with project strategy, situations are likely to occur when a set of features or functionality are presented to the community which require decisions about their priority or the ability of the community to resource them. These may even create tensions between various parts of the community, including involved commercial interests. The community manager needs to help the community reason about how to make choices, and may even be called upon to lead the decision-making process.
  • partner management – as a project grows, partners (open source projects, academic institutions, charities, industry consortia, government departments or commercial organisations) may wish to be associated with the project. Managing expectations, understanding the benefits (or dangers) and relative value can be a complex and time-consuming task, and the community manager is likely to be the first person involved.
  • documentation management – while documentation is only one part of a project, it can often be overlooked by the core code contributors. It is, however, a vital resource when considering many of the tasks associated with the points above. Managing strategy, working with partners, creating press releases: all of these need good documentation, and while it’s unlikely that the community manager will need to write it (well, hopefully not all of it!), making sure that it’s there is likely to be their responsibility.
  • developer enablement – this is providing resources (including, but not restricted to, documentation) to help developers (particularly those new to the project) to get involved in the project. It is often considered a good idea to separate this set of tasks out, rather than expecting a separate role to that of a community manager, partly because it may require a deeper technical focus than is required for many of the other responsibilities associated with the role. This is probably sensible, but the community manager is likely to want to ensure that developer enablement is well-managed, as without new developers, almost any project will eventually calcify and die.
  • cat herding – programmers (who make up the core of any project) are notoriously difficult to manage. Working with them – particularly encouraging them to work to a specific set of goals – has been likened to herding cats. If you can’t herd cats, you’re likely to struggle as a community manager!

Nobody (well almost nobody) is going to be an expert in all of these sets of tasks, and many projects won’t need all of them at the same time. Two of the attributes of a well-established community manager are an awareness of the gaps in their expertise and a network of contacts who they can call on for advice or services to fill out those gaps.

I am not – and could never be – a community manager. I don’t have the skills (or the patience), and one of the joys of gaining experience and expertise in the world is realising when others do have skills that you lack, and being able to recognise and celebrate what they can bring to your world that you can’t. So thank you, community managers!


1 – as always, I welcome comments and suggestions for how to improve or extend this list.

The importance of process (and people and rules)

If there is no process, you can throw technology at it as much as you want, but you are still likely to fail.

Those of us in Europe awoke to the news that the US electoral college have voted for Joe Biden as 46th President of the United States of America. Getting to this point has seemed (at least from the outside) to be a rather tortuous route, but from my understanding of how the US Constitution works[1], this is it: the process is complete and Joe Biden will be sworn in a President of the United States on a (probably very chilly) day next month, at the beginning of 2021. I have no intention of weighing the pros and cons of the candidates, nor even of examining the process (sometime labelled “arcane” by journalists”) by which US presidents are elected, but I do want to spend some time on the fact that there is a process, and thinking about how that works, and what supports it.

This is, first and foremost, a blog about IT security (though I have been known to post on a much wider range of issues from time to time), and so I unsurprisingly spend quite a lot of time discussing technology, but on this occasion I want to avoid doing that, as far as possible. If we look at the process for electing a US president, one of the most striking things about it, we might note, is the lack of technology. Yes, there are electronic voting machines to allow votes to be cast, yes, a myriad computers are deployed by psephologists[2] to forecast the results, but the actual process is lacking in much that we would normally think of as technology.

We often fixate on technology, but if there is no process in place to get from point A to point B, then you can throw technology at it as much as you want, but you are still likely to fail. Those points may be getting from having no president-elect to having a new president, completing a transaction to buy a house or a paperclip, hiring a new CEO or sous-chef, moving from a set of requirements to a working software program, or literally getting from a point A on a map to point B – they all require a process.

What is a process? Google, courtesy of Oxford Languages, offers the definition: a series of actions or steps taken in order to achieve a particular end. This seems like a useful description, but in the contexts we’re describing, it is the fact that the actions or steps are defined which is important. In the world of computing, we might say that there is an algorithm to be followed to complete the process. This algorithm allows a variety of things, all of which are important:

  1. the writing down and codification of the process;
  2. the allocation of different people to different roles in the process;
  3. norms, rules, regulation and/or legislation to be created to ensure the correct following of the process;
  4. the application of technology to simplify, speed up or automate parts of the process.

I don’t want to talk about point 4 particularly – I spend far too much of my time on that in most of my life – and the ways of achieving point 1 are so diverse as to defy consideration in this context, so let’s briefly discuss points 2 and 3.

Allocating people

If you have a process, you can break that process into steps, you can assign roles and responsibilities to those steps. This is useful in a variety of ways, the first of which is that you can start to scale the process by having different people working on different steps – sometimes in parallel. Imagine having one person having to count all of the votes in the US presidential election, or even having multiple people doing it, but having to do so in series: it might work, but it’s going to take way too long. Another benefit is one on which the Industrial Revolution was built: specialisation. Some people will be good at some parts of the process, and others at other parts of the process. You can increase efficiency by putting those with expertise on the right pieces of the process. A third, unrelated to efficiency, is separation of responsibilities. Sometimes, it’s important that certain people, who are experts or certified to perform a particular role, are the ones who do that. Often, it’s even more important that certain people don’t perform those roles. An example of this would be if one of the candidates in an election was the one to perform the final tally of votes and hand the result to the person making the announcement, or if they made the announcement themselves. This is equally true for other types of process: your bank does not want you to be the person who provides the final approval for your loan, and a company does not want a spouse, partner or family member to be providing sign-off for a hiring decision.

Norms, rules, regulation and legislation

In the UK, we have strong social norms around the process of queuing, and you will be subject to social (and sometimes stronger!) censure if you break them. Rules around other processes may be stronger, and sometimes regulation by an industry body or even legislation at the nation level (or multi-national level such as EU or UN) is required to safeguard the appropriate execution of a process. The ability for courts to intervene where vote-rigging may have taken place is a good example in the US election process, but legislation and regulation around anything from wiring a house to what fertilisers are allowed on particular crops provide additional levels of checking and assurance that processes are following correctly (by including censure or punishment for those who have contravened them) or can be remedied when not (through other processes such as legal review or court cases).

Legislation and regulation can be annoying, but without them (or equivalent rules and norms for other types of process), we cannot be sure of what we are getting into, or whether, if we get into it improperly, that we will ever get out of it. People support and are subject to these checks and balances, and without the combination of all of them (not forgetting the technology as well), processes are next to useless.


1 – I am not a lawyer. Nor a constitutional expert. Nor even a US citizen. Basically, do not take my word for any of this.

2 – I love this word. We should use it more often.

Security, cost and usability (pick 2)

If we cannot be explicit that there is a trade-off, it’s always security that loses.

Everybody wants security: why wouldn’t you? Let’s role-play: you’re a software engineer on a project to create a security product. There comes a time in the product life-cycle when it’s nearly due, and, as usual, time is tight. So you’re in the regular project meeting and the product manager’s there, so you ask them what they want you to do: should you prioritise security? The product manager is very clear[1]: they will tell you that they want the product as secure as possible – and they’re right, because that’s what customers want. I’ve never spoken to a customer (and I’ve spoken to lots of customers over the years) who said that they’d prefer a product which wasn’t as secure as possible. But there’s a problem, which is that all customers also want their products tomorrow – in fact, most customers want their products today, if not yesterday.

Luckily, products can generally be produced more quickly if more resources are applied (though Frederick Brooks’ The Mythical Man Month tells us that simple application of more engineers is actually likely to have a negative impact), so the requirement for speed of delivery can be translated to cost. There’s another thing that customers want, however, and that is for products to be easy to use: who wants to get a new product and then, when it arrives, for it to take months to integrate or for it to be almost impossible for their employees to run it as they expect?

So, to clarify, customers want a security product to be be the following:

  1. secure – security is a strong requirement for many enterprises and organisations[3], and although we shouldn’t ever use the word secure on its own, that’s still what customers want;
  2. cheap – nobody wants to pay more than the minimum they can;
  3. usable – everybody likes simple-to-use, easy-to-integrate applications.

There’s a problem, however, which is that out of the three properties above, you can only choose two for any application or project. You say this to your product manager (who’s always right, remember[1]), and they’ll say: “don’t be ridiculous! I want all three”.

But it just doesn’t work like that: why? Here’s my take on the reasons. Security, simply stated, is designed to stop people doing things. Stated from the point of view of a user, security’s view is to reduce usability. “Doing security” is generally around applying controls to actions in a system – whether by users or non-human entities – and the simplest way to apply it is “blanket security” – defaulting to blocking or denying actions. This is sometimes known as fail to safe or fail to closed.

Let’s take an example: you have a simple internal network in your office and you wish to implement a firewall between your network and the Internet, to stop malicious actors from probing your internal machines and to compromised systems on the internal network from communicating out to the Internet. “Easy,” you think, and set up a DENY ALL rule for connections originating outside the firewall, and a DENY ALL rule for connections originating inside the firewall, with the addition of a ALLOW all outgoing port 443 connections to ensure that people can use web browsers to make HTTPS connections. You set up the firewall, and get ready to head home, knowing that your work is done. But then the problems arise:

  • it turns out that some users would like to be able to send email, which requires a different outgoing port number;
  • sending email often goes hand in hand with receiving email, so you need to allow incoming connections to your mail server;
  • one of your printers has been compromised, and is making connections over port 443 to an external botnet;
  • in order to administer the pay system, your accountant – who is not a full-time employee, and works from home, needs to access your network via a VPN, which requires the ability to accept an incoming connection.

Your “easy” just became more difficult – and it’s going to get more difficult still as more users start encountering what they will see as your attempts to make their day-to-day revenue-generating lives more difficult.

This is a very simple scenario, but it’s clear that in order to allow people actually to use a system, you need to spend a lot more time understanding how security will interact with it, and how people’s experience of the measures you put in place will be impacted. Usability and user experience (“UX”) is a complex field on its own, but when you combine it with the extra requirements around security, things become even more tricky.

You need both to manage the requirements of users to whom the security measures should be transparent (“TLS encryption should be on by default”) and those who may need much more control (“developers need to be able to select the TLS cipher suite options when connecting to a vendor’s database”), so you need to understand the different personae[4] you are targeting for your application. You also need to understand the different failures modes, and what the correct behaviour should be: if authentication fails three times in a row, should the medical professional who is trying to get a rush blood test result be locked out of the system, or should the result be provided, and a message sent to an administrator, for example? There will be more decisions to make, based on what your application does, the security policies of your customers, their risk profiles, and more. All of these investigations and decisions take time, and time equates to money. What is more, they also require expertise – both in terms of security but also usability – and that is in itself expensive.

So, you have three options:

  1. choose usability and cost – you can prioritise usability and low cost, but you won’t be able to apply security as you might like;
  2. choose security and cost – in this case, you can apply more security to the system, but you need to be aware that usability – and therefore your customer’s acceptance of the system – will suffer;
  3. choose usability and security – I wish this was the one that we chose every time: you decide that you’re willing to wait longer or pay more for a more secure product, which people can use.

I’m not going to pretend that these are easy decisions, nor that they are always clear cut. And a product manager’s job is sometimes to make difficult choices – hopefully ones which can be re-balanced in a later release, but difficult choices nevertheless. It’s really important, however, that anyone involved in security – as an engineer, as a UX expert, as a product manager, as a customer – understands the trade-off here. If we cannot be explicit that there is a trade-off, then the trade-off will be made silently, and in my experience, it’s always security that loses.


1 – and right: product managers are always right[2].

2 – I know: I used to be a product manager.

3 – and the main subject of this blog, so it shouldn’t be a surprise that I’m writing about it.

4 – or personas if you really, really must. I got an “A” in Latin O level, and I’m not letting this one go.

In praise of triage

It’s all too easy to prioritise based on the “golfing test”.

Not all bugs are created equal.

Some bugs need fixing now, some bugs can wait. Some bugs are in your implementation, some are in the underlying design. Some bugs will annoy a few customers, some will destroy your business.

Bugs come in all shapes and sizes, and one of the tasks of a product owner, product manager, chief architect – whoever makes the call about where to assign resources – is to decide which ones to address in which order: to prioritise them. The problem is deciding how to prioritise them. It’s all too easy to prioritise based on the “golfing test”: your CEO meets someone on golf course who mentions that his or her company loves your product, except for one tiny issue. The CEO comes back, and makes it clear that fixing this “major bug” is now your one and only task until it’s done, and your world is turned upside down. You have to fix the bug as quickly as possible, with no thought to the impact it has on the rest of the project, or the immense pile technical debt that’s just been accrued. You don’t want to live in this world. What, then, is the alternative?

The answer – though it’s only the beginning of the answer – is triage. Triage (from the French for “separating out”) comes from the world of battlefield medicine. When deciding which wounded soldiers to treat, rapid (hopefully objective) assessments are carried out, allowing a quick sorting of each soldier, typically into categories such as “not urgent: wait”, “urgent: treat immediately” and “not saveable: do not treat”. We can apply the same to software bugs in order to decide what to treat (fix) and with what priority. The important thing is not so much the categories – which will vary based on your context – but the assessment criteria, and how they are applied. Here are a list of just some of the possible criteria:

  • likely monetary impact per customer
  • number of customers impacted
  • reputational impact on your organisation
  • ease to fix
  • impact on system security
  • impact on system performance
  • impact on system stability
  • annoyance of CEO not to be listened to.

We do not, of course, only need to apply one of these: a number of them can be combined with a weighting system, though the more you add, the less clear your priorities will be, and the more likely it is that someone will “put a finger on the scales” – tweak the numbers to give the outcome they want. Another important point about the categories that you decide to apply is that they should be as measurable as you can make them, to allow as objective scoring as possible. I wrote a review of the book Building Evolutionary Architectures a while ago: the methodology adopted there, where you measure and test in order to meet specific criteria, is exactly the sort of approach you should be choosing when designing your triage system.

This is (ostensibly) a blog about security, and so you might expect me to say that “security always wins”, but that should absolutely not be the approach you take. Security might be the most important category for you (that is, carry the most weight), but you need to understand why that is the case – at this particular time – and what exactly you mean by “security”. The “security of the system” is not an objective measure: in order to mean anything, such a phrase needs to reference measurements that can be made (“resistance to physical tampering”, “resistance to brute force attacks”, “number or PhD students likely to be needed to reverse engineer our ‘secure’ protocol”[1]). More importantly, it may be that at this point in your organisation’s life, the damage done by lack of stability or decreased performance outweighs the impact of a security bug. If that’s the case, then your measurements should encapsulate that information and lead you to prioritise bugs with impact in these categories over security issues[3].

There’s one proviso that I feel I need to put in at this point, and it’s about the power of what, in Agile Methodology terms, is called the Product Owner. This is the person who represents the users of the product/project, and should have final say about the direction of development in terms of features, functionality and, most relevant in this discussion, bug-fixing. As noted above, this may be an architect, product manager or someone enjoying another title, but their role should be clear: they get to call the shots. There are times when this person goes against the evidence provided by the triage, and makes a decision to prioritise a particular bug over others despite the outcome of the measurements. This is typically very painful for the technical team[4], but, when it comes down to it, as the product owner, they get to decide. The technical team – after appropriate warnings and discussion[5] – must be ready to step aside and accept the decision. Such decisions (and related discussions) should be recorded, and the product owner must be ready to stand or fall based on the outcome, but that is their job. Triage is a guide, and there are occasions when there are measurements which cannot be easily made objectively, and which sit outside the expertise or scope of knowledge of the technical team. If this sort of decision keeps being made, and you think you know better, you may have a future in technical product management, where people with a view of both the technical and the business side of technology are much in demand. In the end, though, the product owner will need to justify their decision to management, and if they get it wrong, then they must be ready to take the blame (this is one reason why you should make sure that you’ve recorded the process taken to get to this decision – you don’t want to take the blame for a poor decision which you advised against).

So: go out an design a triage process, be ready to follow it, and be ready to defend it. Oh, and one last point: you might want to buy a set of golf clubs.

—–

1 – this last one is a joke: don’t design your own protocol, or if you do, make it open and have it peer-reviewed[2].

2 – and then throw it away and use an open source implementation of better, more thoroughly-reviewed one.

3 – much as it pains me to say it.

4 – I’ve been on both sides of these decisions: I know.

5 -often rather heated, in my experience.

Are you positive?

What do pregnancy tests and the Ukrainian aircraft missile strike have in common?

Not everything in life is nicely binary, much as we[1] might like it to be. There are shades of grey[2] in many aspects of life, and though humans can often cope with uncertainty, computer systems are less good at it: they generally want a “yes” or “no” answer. This means that decisions sometimes need to be made on incomplete evidence, and, well, that means that the answers aren’t always correct. There’s a whole area of computer science related to this: fuzzy logic.

Let’s look into what the options are. Assuming that we’re looking two options: “yes” (a positive) and “no” (a negative). That means that there are two ways in which the answer can be incorrect:

  1. a “yes” answer was incorrectly chosen (false positive);
  2. a “no” answer was incorrectly chosen (false negative).

An example to allow us to explore this is pregnancy. It’s generally agreed that you can’t be a little bit pregnant: if you take a test, any result it gives you needs to be either positive or negative. If you are pregnant, and a test result comes back negative, then that’s a false negative. If you are not pregnant, and a test comes back positive, that’s a false positive. The implications of a false positive or a false negative can both be pretty major – as anybody who has received one will tell you. I spent a little time online trying to find expected false positive and false negatives for pregnancy tests, but it turns out that the rates are so dependent on a variety of factors that it was difficult to find a sensible answer[3].

A tragic recent example of a false positive took place on Wednesday, 8th January 2020, when a Ukrainian International Airlines flight was shot down by an Iranian missile, killing all 176 people on board. It appears that an air defence radar system misidentified the aircraft as a cruise missile. As the radar system was looking for a positive identification of a threat, this can be counted as a false positive.

What might have been the alternative in this case? If the aircraft actually had been a cruise missile, but was identified as a civilian aircraft, this would have been a false negative, and the impact might well have been significant damage to an Iranian military installation.

Which is the most damaging? Well, in the case of the aircraft, it would seem pretty clear to most observers that the false positive would be worse, but from a military point of view, that might not be the case. Maybe the impact of a missile strike on a major military installation might be considered worse than the civilian loss of life in the other case. In this case, as in many others, a decision needs to be made as to which is most important to reduce: the chance of a false negative or the chance of a false positive? In a perfect world, of course, there would be no false results, negative or positive. The problem with many systems that take analogue[4] inputs and turn them into digital outputs in this way is that avoiding false results is very costly, and sometimes impossible. Even worse news is that reducing probability of one of the two types of false result tends to increase the probability of the other.

A classic example of this is in the use of biometrics for user identification. Fingerprints, facial recognition, iris scanning and similar techniques have to balance the likelihood of a false positive with a false negative. Which is worse: the chance that the CEO will not be able to update the payroll details, or that a rogue employee will update her details to improve her salary package?[5]

One good piece of news is that AI/ML (Artificial Intelligence/Machine Learning) is improving the performance of biometric systems and, in fact, other areas of computing where “fuzzy logic” is required. In most cases, humans are still better at reducing messy sets of information to yes/no results, but that is changing, and where multiple automated decisions need to be made, then AI/ML is worth considering.

Whenever you are dealing with “messy” data[6] which needs to be reduced to a “yes/no” or “positive/negative” binary result, you need to consider the likelihood of false positives or negatives. Not only do you need to consider the likelihood of each, but also the impact of each. Once you have understood these, you can then decide which you want to try to minimise, and what techniques you should use to do so.

We may be stuck with false results, but we need to understand what our choices are, and how we can get the best outcomes available from messy data.


1 – in talking security, but I’m sure this goes for lots of other people, too.

2. “gray” for our non-Commonwealth readers.

3. good advice seems to be to test several times over several days.

4. “analog”, I suppose – see [2].

5. this is one of the reasons that authentication systems generally use two factors from the three “something you are”, “something you know”, “something you have”.

6. most real-world data, to be honest.

Will quantum computing break security?

Do you want J. Random Hacker to be able to pretend that they’re your bank?

Over the past few years, a new type of computer has arrived on the block: the quantum computer.  It’s arguably the sixth type of computer:

  1. humans – before there were artificial computers, people used, well, people.  And people with this job were called “computers”.
  2. mechanical analogue – devices such as the Antikythera mechanism, astrolabes or slide rules.
  3. mechanical digital – in this category I’d count anything that allowed discrete mathematics, but didn’t use electronics for the actual calculation: the abacus, Babbage’s Difference Engine, etc.
  4. electronic analogue – many of these were invented for military uses such as bomb sights, gun aiming, etc.
  5. electronic digital – I’m going to go out on a limb here, and characterise Colossus as the first electronic digital computer[1]: these are basically what we use today for anything from mobile phones to supercomputers.
  6. quantum computers – these are coming, and are fundamentally different to all of the previous generations.

What is quantum computing?

Quantum computing uses concepts from quantum mechanics to allow very different types of calculations to what we’re used to in “classical computing”.  I’m not even going to try to explain, because I know that I’d do a terrible job, so I suggest you try something like Wikipedia’s definition as a starting point.  What’s important for our purposes is to understand that quantum computers use qubits to do calculations, and for quite a few types of mathematical algorithms – and therefore computing operations – they can solve problems much faster than classical computers.

What’s “much faster”?  Much, much faster: orders of magnitude faster.  A calculation that might take years or decades with a classical computer could, in certain circumstances, take seconds.  Impressive, yes?  And scary.  Because one of the types of problems that quantum computers should be good at solving is decrypting encrypted messages, even without the keys.

This means that someone with a sufficiently powerful quantum computer should be able to read all of your current and past messages, decrypt any stored data, and maybe fake digital signatures.  Is this a big thing?  Yes.  Do you want J. Random Hacker to be able to pretend that they’re your bank[2]?  Do you want that transaction on the blockchain where you were sold a 10 bedroom mansion in Mayfair to be “corrected” to be a bedsit in Weston-super-Mare[3]?

Some good news

This is all scary stuff, but there’s good news, of various types.

The first is that in order to make any of this work at all, you need a quantum computer with a good number of qubits operating, and this is turning out to be hard[4].  The general consensus is that we’ve got a few years before anybody has a “big” enough quantum computer to do serious damage to classical encryption algorithms.

The second is that, even with a sufficient number of qubits to attacks our existing algorithms, you still need even more in order to allow for error correction.

The third is that although there are theoretical models to show how to attack some of our existing algorithms, actually making them work is significantly harder than you or I[5] might expect.  In fact, some of the attacks may turn out to be infeasible, or just take more years to perfect that we’d worried about.

The fourth is that there are clever people out there who are designing quantum-computation resistant algorithms (sometimes referred to as “post-quantum algorithms”) that we can use, at least for new encryption, once they’ve been tested and become widely available.

All-in-all, in fact, there’s a strong body of expert opinion that says that we shouldn’t be overly worried about quantum computing breaking our encryption in the next 5 or even 10 years.

And some bad news

It’s not all rosy, however.  Two issues stick out to me as areas of concern.

  1. People are still designing and rolling out systems which don’t consider the issue.  If you’re coming up with a system which is likely to be in use for ten or more years, or which will be encrypting or signing data which must remain confidential or attributable over those sorts of periods, then you should be considering what the possible impact of quantum computing may have on your system.
  2. some of the new, quantum-computing resistant algorithms are proprietary.  This means that when you and I want to start implementing systems which are designed to be quantum-computing resistant, we’ll have to pay to do so.  I’m a big proponent of open source, and particularly of open source cryptography, and my big worry is that we just won’t be able to open source these things, and worse, that when new protocol standards are created – either de facto or through standards bodies – they will choose proprietary algorithms that exclude the use of open source, whether on purpose, through ignorance, or because few good alternatives are available.

What to do?

Luckily, there are things you can do to address both of the issues above.  The first is to think and plan, when designing a system, about what the impact of quantum computing might be on it.  Often – very often – you won’t actually need to implement anything explicit now (and it could be hard to, given the current state of the art), but you should at least embrace the concept of crypto-agility: designing protocols and systems so that you can swap out algorithms if required[7].

The second is a call to arms: get involved in the open source movement, and encourage everybody you know who has anything to do with cryptography to rally for open standards and for research into non-proprietary, quantum-computing resistant algorithms.  This is something that’s very much on my to-do list, and an area where pressure and lobbying is just as important as the research itself.


1 – I think it’s fair to call it the first electronic, programmable computer.  I know there were earlier non-programmable ones, and that some claim ENIAC, but I don’t have the space or the energy to argue the case here.

2 – no.

3 – see [2].  Don’t get me wrong, by the way – I grew up near Weston-super-Mare, and it’s got things going for it, but it’s not Mayfair.

4 – and if a quantum physicist says that something’s hard, then, to my mind, it’s hard.

5 – and I’m assuming that neither of us is a quantum physicist or mathematician[6].

6 – I’m definitely not.

7 – and not just for quantum-computing reasons: there’s a good chance that some of our existing classical algorithms may just fall to other, non-quantum attacks such as new mathematical approaches.

Mitigate or remediate?

What’s the difference between mitigate and remediate?

I very, very nearly titled this article “The ‘aters gonna ‘ate”, and then thought better of it. This is a rare event, and I put it down to the extreme temperatures that we’re having here in the UK at the moment[1].

What prompted this article was reading something online today where I saw the word mitigate, and thought to myself, “When did I start using that word? It’s not a normal one to drop into conversation. And what about remediate? What’s the difference between mitigate and remediate? In fact, how well could I describe the difference between the two?” Both are quite jargon-y words, so, in the spirit of my recent article Jargon – a force for good or ill? here’s my attempt at describing the difference, but also pointing out how important both are – along with a couple of other “-ate” words.

Let’s work backwards.

Remediate

Remediation is a set of actions to get things back the way they should be. It’s the last step in the process of recovery from an attack or other failure. The reprefix here is the give -away: like resetting and reconciliation. When you’re remediating, there may be an expectation that you’ll be returning your systems to the same state they were before, for example, power failure, but that’s not necessarily the case. What you should be focussing on is the service you’re providing, rather than the system(s) that are providing it. A set of steps for remediation might require you to replace your database with another one from a completely different provider[3], to add a load-balancer and to change all of your hardware, but you’re still remediating the problem that hit you. At the very least, if you’ve suffered an attack, you should make sure that you plug any hole that allowed the attacker in to start with[4].

Mitigate

Mitigation isn’t about making things better – it’s about reducing the impact of an attack or failure. Mitigation is the first set of steps you take when you realise that you’ve got a problem. You could argue that these things are connected, but mitigation isn’t about returning the service to normal (remediation), but about taking steps to reduce the impact of an attack. Those mitigations may be external – adding load-balancers to help deal with a DDoS attack, maybe – or internal – shutting down systems that have been actively compromised.

In fact, I’d argue that some mitigations might quite properly actually have an adverse effect on the service: there may be short term reductions in availability to ensure that long-term remediations can be performed. Planning for this is vitally important, as I’ve discussed previously, in Service degradation: actually a good thing.

The other -ates: update and operate

As I promised above, there are a couple of other words that are part of your response to an attack – or possible attack. The first is update. Updating is one of the key measures that you can put in place to reduce the chance of a successful attack. It’s the step you take before mitigation – because, if you’re lucky, you won’t need mitigation, because you’ll be immune from attack[5].

The second of these is operate. Operation is your normal state: it’s where you want to be. But operation doesn’t mean that you can just sit back and feel secure: you need to be keeping an eye on what’s going on, planning upgrades, considering mitigations and preparing for remediations. We too often think of the operate step as our Happy Place, where all is rosy, and we can sit back and watch the daisies grow. As DevOps (and DevSecOps) is teaching us, this is absolutely not the case: operate is very much a dynamic state, and not a passive one.


1 – 30C (~86F) is hot for the UK. Oh, and most houses (ours included) don’t have air conditioning[2].

2 – and I work from an office in the garden. Direct sunlight through the mainly glass wall is a blessing in the winter. Less so in the summer.

3 – hopefully open source, of course.

4 – hint: patch, patch, patch.

5 – well, that attack, at least. You’re never immune from all attacks: sorry.

7 steps to security policy greatness

… we’ve got a good chance of doing the right thing, and keeping the auditors happy.

Security policies are something that everybody knows they should have, but which are deceptively simple to get wrong.  A set of simple steps to check when you’re implementing them is something that I think it’s important to share.  You know when you come up a set of steps, you suddenly realise that you’ve got a couple of vowels in there and you think, “wow, I can make an acronym!”?[1] This was one of those times.  The problem was that when I looked at the different steps, I decided that DDEAVMM doesn’t actually have much of a ring to it.  I’ve clearly still got work to do before I can name major public sector projects, for instance[2].  However, I still think it’s worth sharing, so let’s go through them in order.  Order, as for many sets of steps, is important here.

I’m going to give an example and walk through the steps for clarity.  Let’s say that our CISO, in his or her infinite wisdom, has decided that they don’t want anybody outside our network to be able to access our corporate website via port 80.  This is the policy that we need to implement.

1. Define

The first thing I need is a useful definition.  We nearly have that from the request[3] noted above, when our CISO said “I don’t want anybody outside our network to be able to access our corporate website via port 80”.  So, let’s make that into a slightly more useful definition.

“Access to IP address mycorporate.network.com on port 80 must be blocked to all hosts outside the 10.0.x.x to 10.3.x.x network range.”

I’m assuming that we already know that our main internal network is within 10.0.x.x to 10.3.x.x.  It’s not exactly a machine readable definition, but actually, that’s the point: we’re looking for a definition which is clear and human understandable.  Let’s assume that we’re happy with this, at least for now.

2. Design

Next, we need to design a way to implement this.  There are lots of ways of doing this – from iptables[4] to a full-blown, commercially supported firewall[6] – and I’m not fluent in any of them these days, so I’m going to assume that somebody who is better equipped than I has created a rule or set of rules to implement the policy defined in step 1.

We also need to define some tests – we’ll come back to these in step 5.

3. Evaluate

But we need to check.  What if they’ve mis-implemented it?    What if they misunderstood the design requirement?  It’s good practice – in fact, it’s vital – to do some evaluation of the design to ensure it’s correct.  For this rather simple example, it should be pretty to check by eye, but we might want to set up a test environment to evaluate that it meets the policy definition or take other steps to evaluate its correctness.  And don’t forget: we’re not checking that the design does what the person/team writing it thinks it should do: we’re checking that it meets the definition.  It’s quite possible that at this point we’ll realise that the definition was incorrect.  Maybe there’s another subnet – 10.5.x.x, for instance – that the security policy designer didn’t know about.  Or maybe the initial request wasn’t sufficiently robust, and our CISO actually wanted to block all access on any port other than 443 (for HTTPS), which should be allowed.  Now is a very good time to find that out.

4. Apply

We’ve ascertained that the design does what it should do – although we may have iterated a couple of times on exactly “what it should do” means – so now we can implement it.  Whether that’s ssh-ing into a system, uploading a script, using some sort of GUI or physically installing a new box, it’s done.

Excellent: we’re finished, right?

No, we’re not.  The problem is that often, that’s exactly what people think.  Let’s move to our next step: arguably the most important, and the most often forgotten or ignored.

5. Validate

I really care about this one: it’s arguably the point of this post.  Once you’ve implemented a security policy, you need to validate that it actually does what it’s supposed to do.  I’ve written before about this, in my post If it isn’t tested, it doesn’t work, and it’s absolutely true of security policy implementations.  You need to check all of the different parts of the design.  This, you might hope, would be really easy, but even in the simple case that we’re working with, there are lots of tests you should be doing.  Let’s say that we took the two changes mentioned in step 3.  Here are some tests that I would want to be doing, with the expected result:

  • FAIL: access port 80 from an external IP address
  • FAIL: access port 8080 from an external IP address
  • FAIL: access port 80 from a 10.4.x.x address
  • PASS: access port 443 from an external IP address
  • UNDEFINED: access port 80 from a 10.5.x.x
  • UNDEFINED: access port 80 from a 10.0.x.x address
  • UNDEFINED: access port 443 from a 10.0.x.x address
  • UNDEFINED: access port 80 from an external IP address but with a VPN connection into 10.0.x.x

Of course, we’d want to be performing these tests on a number of other ports, and from a variety of IP addresses, too.

What’s really interesting about the list is the number of expected results that are “UNDEFINED”.  Unless we have a specific requirement, we just can’t be sure what’s expected.  We can guess, but we can’t be sure.  Maybe we don’t care?  I particularly like the last one, because the result we get may lead us much deeper into our IT deployment than we might expect.

The point, however, is that we need to check that the actual results meet our expectations, and maybe even define some new requirements if we want to remove some of the “UNDEFINED”.  We may be fine to leave some the expected results as “UNDEFINED”, particularly if they’re out of scope for our work or our role.  Obviously, if the results don’t meet our expectations, then we also need to make some changes and apply them and then re-validate.  We also need to record the final results.

When we’ve got more complex security policy – multiple authentication checks, or complex SDN[7] routing rules – then our tests are likely to be much, much more complex.

6. Monitor

We’re still not done.  Remember those results we got in our previous tests?  Well, we need to monitor our system and see if there’s any change.  We should do this on a fairly frequent basis.  I’m not going to say “regular”, because regular practices can lead to sloppiness, and also leave windows of opportunities open to attackers.  We also need to perform checks whenever we make a change to a connected system.  Oh, if I had a dollar for every time I’ve heard “oh, this won’t affect system X at all.”…[8]

One interesting point is that we should also note when results whose expected value remains “UNDEFINED” change.  This may be a sign that something in our system has changed, it may be a sign that a legitimate change has been made in a connected system, or it may be a sign of some sort of attack.  It may not be quite as important as a change in one of our expected “PASS” or “FAIL” results, but it certainly merits further investigation.

7. Mitigate

Things will go wrong.  Some of them will be our fault, some of them will be our colleagues’ fault[10], some of them will be accidental, or due to hardware failure, and some will be due to attacks.  In all of these cases, we need to act to mitigate the failure.  We are in charge of this policy, so even if the failure is out of our control, we want to make sure that mitigating mechanism are within our control.  And once we’ve completed the mitigation, we’re going to have to go back at least to step 2 and redesign our implementation.  We might even need to go back to step 1 and redefine what our definition should be.

Final steps

There are many other points to consider, and one of the most important is the question of responsibility, touched on in step 7 (and which is particularly important during holiday seasons), special circumstances and decommissioning, but if we can keep these steps in mind when we’re implementing – and running – security policies, we’ve got a good chance of doing the right thing, and keeping the auditors happy, which is always worthwhile.


1 – I’m really sure it’s not just me.  Ask your colleagues.

2 – although, back when Father Ted was on TV, a colleague of mine and I came up with a database result which we named a “Fully Evaluated Query”.  It made us happy, at least.

3 – if it’s from the CISO, and he/she is my boss, then it’s not a request, it’s an order, but you get the point.

4 – which would be my first port[5] of call, but might not be the appropriate approach in this context.

5 – sorry, that was unintentional.

6 – just because it’s commercially support doesn’t mean it has to be proprietary: open source is your friend, boys and girls, ladies and gentlemen.

7 – Software-Defined Networking.

8 – I’m going to leave you hanging, other than to say that, at current exchange rates, and assuming it was US dollars I was collecting, then I’d have almost exactly 3/4 of that amount in British pounds[9].

9 – other currencies are available, but please note that I’m not currently accepting bitcoin.

10 – one of the best types.