architecture – Alice, Eve and Bob

Zero trust and Confidential Computing

Confidential Computing can provide two properties which are excellent starting points for zero/explicit trust.

I’ve been fairly scathing about “zero trust” before – see, for instance, my articles Thinking beyond “zero-trust” and “Zero-trust”: my love/hate relationship – and my view of how the industry talks about it hasn’t changed much. I still believe, in particular, that:

the original idea, as conceived, has a great deal of merit;
few people really understand what it means;
it’s become an industry bandwagon that is sorely abused by some security companies;
it would be better called “explicit trust”.

The reason for this last is that it’s impossible to have zero trust: any entity or component has to have some level of trust in the other entities/components with which it interacts. More specifically, it has to maintain trust relationships – and what they look like, how they’re established, evaluated, maintained and destroyed is the core point of discussion of my book Trust in Computer Systems and the Cloud. If you’re interested in a more complete and reasoned criticism of zero trust, you’ll find that in Chapter 5: The Importance of Systems.

But, as noted above, I’m actually in favour of the original idea of zero trust, and that’s why I wanted to write this article about how zero trust and Confidential Computing, when combined, can actually provide some real value and improvements over standard distributed architectures (particularly in the Cloud).

An important starting point, however, is to note that I’ll be using this definition of Confidential Computing:

Confidential Computing is the protection of data in use by performing computation in a hardware-based, attested Trusted Execution Environment.
Confidential Computing Consortium, https://confidentialcomputing.io/about/

Confidential Computing, as thus described, can provide two properties which are excellent starting points for components wishing to exercise zero/explicit trust, which we’ll examine individually:

isolation from the host machine/system, particularly in terms of confidentiality of data;
cryptographically verifiable identity.

Isolation

One of the main trust relationships that any executing component must establish and maintain is with the system that is providing the execution capabilities – the machine on which it is running (or virtual machine – but that presents similar issues). When you say that your component has “zero trust”, but has to trust the host machine on which it is running to maintain the confidentiality of the code and/or data associated with the component, then you have to accept the fact that you do actually have an enormous trust relationship: with the machine and whomever administers/controls it (and that includes anyone who may have compromised it). This can hardly form the basis for a “zero trust” architecture – but what can be done about it?

Where Confidential Computing helps is by allowing isolation from the machine which is doing the execution. The component still needs to trust the CPU/firmware that’s providing the execution context – something needs to run the code, after all! – but we can shrink that number of trust relationships required significantly, and provide cryptographic assurances to base this relationship on (see Attestation, below).

Knowing that a component is isolated from another component allows that component to have assurances about how it will operate and also allows other components to build a trust relationship with that component with the assurance that it is acting with its own agency, rather than under that of a malicious actor.

Attestation

Attestation is the mechanism by which an entity can receive assurances that a Confidential Computing component has been correctly set up and can provide the expected properties of data confidentiality and integrity and code integrity (and in some cases, confidentiality). These assurances are bound to a particular Confidential Computing component (and the Trusted Execution Environment in which it executes) cryptographically, which allows for another property to be provided as well: a unique identity. If the attesting service bind this identity cryptographically to the Confidential Computing component by means of, for instance, a standard X.509 certificate, then this can provide one of the bases for trust relationships both to and from the component.

Establishing a “zero trust” relationship

These properties allow zero (or “explicit”) trust relationships to be established with components that are operating within a Confidential Computing environment, and to do so in ways which have previously been impossible. Using classical computing approaches, any component is at the mercy of the environment within which it is executing, meaning that any trust relationship that is established to it is equally with the environment – that is, the system that is providing its execution environment. This is far from a zero trust relationship, and is also very unlikely to be explicit!

In a Confidential Computing environment, components can have a small number of trust relationships which are explicitly noted (typically these include the attestation service, the CPU/firmware provider and the provider of the executing code), allowing for a much better-defined trust architecture. It may not be exactly “zero trust”, but it is, at least, heading towards “minimal trust”.

Functional vs non-functional requirements: a dangerous dichotomy?

Non-functional requirements are at least as important as functional requirements.

Imagine you’re thinking about an application or a system: how would you describe it? How would you explain what you want it to do? Most of us, I think, would start with statements like:

it should read JPEGs and output SVG images;
it should buy the stocks I tell it to when they reach a particular price;
it should take a customer’s credit history and decide whether to approve a loan application;
it should ensure that the car maintains a specific speed unless the driver presses the brakes or disengages the system;
it should level me up when I hit 10,000 gold bars mined;
~~it should take a prompt and output several hundred words about a security topic that sound as if I wrote them~~;
it should strike out any text which would give away its non-human status.

These are all requirements on the system. Specifically, they are functional requirements: they are things that an application or a system should do based on the state of inputs and outputs to which it is exposed.

Now let’s look at another set of requirements: requirements which are important to the correct operation of the system, but which aren’t core to what it does. These are non-functional requirements, in that they don’t describe the functions it performs, but its broader operation. Here are some examples:

it should not leak cryptographic keys if someone performs a side-channel attack on it;
it should be able to be deployed on premises or in the Cloud;
it should be able to manage 30,000 transactions a second;
it should not slow stop a user’s phone from receiving a phone call when it is running;
it should not fail catastrophically, but degrade its performance gracefully under high load;
~~it should be allowed to empty the bank accounts of its human masters~~;
it should recover from unexpected failures, such as its operator switching off the power in a panic on seeing unexpected financial transactions.

You may notice that some of the non-functional requirements are expressed as negatives – “it should not” – this is fairly common, and though functional requirements are sometimes expressed in the negative, it is more rare.

So now we come to the important question, and the core of this article: which of the above lists is more important? Is it the list with the functional requirements or the non-functional requirements? I think that there’s a fair case to be made for the latter: the non-functional requirements. Even if that’s not always the case, my (far too) many years of requirements gathering (and requirements meeting) lead me to note that while there may be a core set of functional requirements that typically are very important, it’s very easy for a design, architecture or specification to collect more and more functional requirements which pale into insignificance against some of the non-functional requirements that accrue.

But the problem is that non-functional requirements are almost always second-class citizens when compared to functional requirements on an application or system. They are are often collected after the functional requirements – if at all – and are often the first to be discarded when things get complicated. They also typically require input from people with skill sets outside the context of the application or system: for instance, it may not be obvious to the designer of a back-end banking application that they need to consider data-in-use protection (such as Confidential Computing) when they are collecting requirements of an application which will initially be run in an internal data centre.

Agile and DevOps methodologies can be relevant in these contexts, as well. On the one hand, ensuring that the people who will be operating an application or system is likely to focus their minds on some of the non-functional requirements which might impact them if they are not considered early enough. On the other hand, however, a model of development where the the key performance indicator is having something that runs means that the functional requirements are fore-grounded (“yes, you can log in – though we’re not actually checking passwords yet…”).

What’s the take-away from this article? It’s to consider non-functional requirements as at least as important as functional requirements. Alongside that, it’s vital to be aware that the people in charge of designing, architecting and specifying an application or system may not be best placed to collect all of the broader requirements that are, in fact, core to its safe and continuing (business critical) operation.

Defining the Edge

How might we differentiate Edge computing from Cloud computing?

This is an edited excerpt from my forthcoming book on Trust in Computing and the Cloud for Wiley.

There’s been a lot of talk about the Edge, and almost as many definitions as there are articles out there. As usual on this blog, my main interest is around trust and security, so this brief look at the Edge concentrates on those aspects, and particularly on how we might differentiate Edge computing from Cloud computing.

The first difference we might identify is that Edge computing addresses use cases where consolidating compute resource in a centralised location (the typical Cloud computing case) is not necessarily appropriate, and pushes some or all of the computing power out to the edges of the network, where it can computing resources can process data which is generated at the fringes, rather than having to transfer all the data over what may be low-bandwidth networks for processing. There is no generally accepted single industry definition of Edge computing, but examples might include:

placing video processing systems in or near a sports stadium for pre-processing to reduce the amount of raw footage that needs to be transmitted to a centralised data centre or studio
providing analysis and safety control systems on an ocean-based oil rig to reduce reliance and contention on an unreliable and potentially low-bandwidth network connection
creating an Internet of Things (IoT) gateway to process and analyse data environmental sensor units (IoT devices)
mobile edge computing, or multi-access edge computing (both abbreviated to MEC), where telecommunications services such as location and augmented reality (AR) applications are run on cellular base stations, rather than in the telecommunication provider’s centralised network location.

Unlike Cloud computing, where the hosting model is generally that computing resources are consumed by tenants – customers of a public cloud, for instance- in the Edge case, the consumer of computing resources is often the owner of the systems providing them (though this is not always the case). Another difference is the size of the host providing the computing resources, which may range from very large to very small (in the case of an IoT gateway, for example). One important factor about most modern Edge computing environments is that they employ the same virtualisation and orchestration techniques as the cloud, allowing more flexibility in deployment and lifecycle management over bare-metal deployments.

A table comparing the various properties typically associated with Cloud and Edge computing shows us a number of differences.

	Public cloud computing	Private cloud computing	Edge computing
Location	Centralised	Centralised	Distributed
Hosting model	Tenants	Owner	Owner or tenant(s)
Application type	Generalised	Generalised	May be specialised
Host system size	Large	Large	Large to very small
Network bandwidth	High	High	Medium to low
Network availability	High	High	High to low
Host physical security	High	High	Low

Differences between Edge computing and public/private cloud computing

In the table, I’ve described two different types of cloud computing: public and private. The latter is sometimes characterised as on premises or on-prem computing, but the point here is that rather than deploying applications to dedicated hosts, workloads are deployed using the same virtualisation and orchestration techniques employed in the public cloud, the key difference being that the hosts and software are owned and managed by the owner of the applications. Sometimes these services are actually managed by an external party, but in this case there is a close commercial (and concomitant trust) relationship to this managed services provider, and, equally important, single tenancy is assured (assuming that security is maintained), as only applications from the owner of the service are hosted[1]. Many organisations will mix and match workloads to different cloud deployments, employing both public and private clouds (a deployment model known as hybrid cloud) and/or different public clouds (a deployment model known as multi-cloud). All these models – public computing, private computing and Edge computing – share an approach in common: in most cases, workloads are not deployed to bare-metal servers, but to virtualisation platforms.

Deployment model differences

What is special about each of the models and their offerings if we are looking at trust and security?

One characteristic that the three approaches share is scale: they all assume that hosts will have with multiple workloads per host – though the number of hosts and the actual size of the host systems is likely to be highest in the public cloud case, and lowest in the Edge case. It is this high workload density that makes public cloud computing in particular economically viable, and one of the reasons that it makes sense for organisations to deploy at least some of their workloads to public clouds, as Cloud Service Providers can employ economies of scale which allow them to schedule workloads onto their servers from multiple tenants, balancing load and bringing sets of servers in and out of commission (a computation- and time-costly exercise) infrequently. Owners and operators of private clouds, in contrast, need to ensure that they have sufficient resources available for possible maximum load at all times, and do not have the opportunities to balance loads from other tenants unless they open up their on premises deployment to other organisations, transforming themselves into Cloud Service Providers and putting them into direct competition with existing CSPs.

It is this push for high workload density which is one of the reasons for the need for strong workload-from-workload (type 1) isolation, as in order to be able to maintain high density, cloud owners need to be able to mix workloads from multiple tenants on the same host. Tenants are mutual untrusting; they are in fact likely to be completely unaware of each other, and, if the host is doing its job well, unaware of the presence of other workloads on the same host as them. More important than this property, however, is a strong assurance that their workloads will not be negatively impacted by workloads from other tenants. Although negative impact can occur in other contexts to computation – such as storage contention or network congestion – the focus is mainly on the isolation that hosts can provide.

The likelihood of malicious workloads increases with the number of tenants, but reduces significantly when the tenant is the same as the host owner – the case for private cloud deployments and some Edge deployments. Thus, the need for host-from-workload (type 2) isolation is higher for the public cloud – though the possibility of poorly written or compromised workloads means that it should not be neglected for the other types of deployment.

One final difference between the models is that for both public and private cloud deployments the physical vulnerability of hosts is generally considered to be low[2], whereas the opportunities for unauthorised physical access to Edge computing hosts are considered to be much higher. You can read a little more about the importance of hardware as part of the Trusted Compute Base in my article Turtles – and chains of trust, and it is a fundamental principle of computer security that if an attacker has physical access to a system, then the system must be considered compromised, as it is, in almost all cases, possible to compromise the confidentiality, integrity and availability of workloads executing on it.

All of the above are good reasons to apply Confidential Computing techniques not only to cloud computing, but to Edge computing as well: that’s a topic for another article.

1 – this is something of a simplification, but is a useful generalisation.

2 – Though this assumes that people with authorised access to physical machines are not malicious, a proposition which cannot be guaranteed, but for which monitoring can at least be put in place.

Does my TCB look big in this?

The smaller your TCB the less there is to attack, and that’s a good thing.

This isn’t the first article I’ve written about Trusted Compute Bases (TCBs), so if the concept is new to you, I suggest that you have a look at What’s a Trusted Compute Base? to get an idea of what I’ll be talking about here. In that article, I noted the importance of the size of the TCB: “what you want is a small, easily measurable and easily auditable TCB on which you can build the rest of your system – from which you can build a ‘chain of trust’ to the other parts of your system about which you care.” In this article, I want to take some time to discuss the importance of the size of a TCB, how we might measure it, and how difficult it can be to reduce the TCB size. Let’s look at all of those issues in order.

Size does matter

However you measure it – and we’ll get to that below – the size of the TCB matters for two reasons:

the larger the TCB is, the more bugs there are likely to be;
the larger the TCB is, the larger the attack surface.

The first of these is true of any system, and although there may be ways of reducing the number of bugs, proving the correctness of all or, more likely, part of the system, bugs are both tricky to remove and resilient – if you remove one, you may well be introducing another (or worse, several). Now, the kinds or bugs you have, and the number of them, can be reduced through a multitude of techniques, from language choice (choosing Rust over C/C++ to reduce memory allocation errors, for instance) to better specification and on to improved test coverage and fuzzing. In the end, however, the smaller the TCB, the less code (or hardware – we’re considering the broader system here, don’t forget), you have to trust, the less space there is for there to be bugs in it.

The concept of an attack surface is important, and, like TCBs, one I’ve introduced before (in What’s an attack surface?). Like bugs, there may be no absolute measure of the ratio of danger:attack surface, but the smaller your TCB, well, the less there is to attack, and that’s a good thing. As with bug reduction, there are number of techniques you may want to apply to reduce your attack surface, but the smaller it is, then, by definition, the fewer opportunities attackers have to try to compromise your system.

Measurement

Measuring the size of your TCB is really, really hard – or, maybe I should say that coming up with an absolute measure that you can compare to other TCBs is really, really hard. The problem is that there are so many measurements that you might take. The ones you care about are probably those that can be related to attack surface – but there are so many different attack vectors that might be relevant to a TCB that there are likely to be multiple attack surfaces. Let’s look at some of the possible measurements:

number of API methods
amount of data that can be passed across each API method
number of parameters that can be passed across each API method
number of open network sockets
number of open local (e.g. UNIX) sockets
number of files read from local storage
number of dynamically loaded libraries
number of DMA (Direct Memory Access) calls
number of lines of code
amount of compilation optimisation carried out
size of binary
size of executing code in memory
amount of memory shared with other processes
use of various caches (L1, L2, etc.)
number of syscalls made
number of strings visible using strings command or similar
number of cryptographic operations not subject to constant time checks

This is not meant to be an exhaustive list, but just to show the range of different areas in which vulnerabilities might appear. Designing your application to reduce one may increase another – one very simple example being an attempt to reduce the number of API calls exposed by increasing the number of parameters on each call, another being to reduce the size of the binary by using more dynamically linked libraries.

This leads us to an important point which I’m not going to address in detail in this article, but which is fundamental to understanding TCBs: that without a threat model, there’s actually very little point in considering what your TCB is.

Reducing the TCB size

We’ve just seen one of the main reasons that reducing your TCB size is difficult: it’s likely to involve trade-offs between different measures. If all you’re trying to do is produce competitive marketing material where you say “my TCB is smaller than yours”, then you’re likely to miss the point. The point of a TCB is to have a well-defined computing base which can protect against specific threats. This requires you to be clear about exactly what functionality requires that it be trusted, where it sits in the system, and how the other components in the system rely on it: what trust relationships they have. I was speaking to a colleague just yesterday who was relaying a story of software project who said, “we’ve reduced our TCB to this tiny component by designing it very carefully and checking how we implement it”, but who overlooked the fact that the rest of the stack – which contained a complete Linux distribution and applications – could be no more trusted than before. The threat model (if there was one – we didn’t get into details) seemed to assume that only the TCB would be attacked, which missed the point entirely: it just added another “turtle” to the stack, without actually fixing the problem that was presumably at issue: that of improving the security of the system.

Reducing the TCB by artificially defining what the TCB is to suit your capabilities or particular beliefs around what the TCB specifically should be protecting against is not only unhelpful but actively counter-productive. This is because it ignores the fact that a TCB is there to serve the needs of a broader system, and if it is considered in isolation, then it becomes irrelevant: what is it acting as a base for?

In conclusion, it’s all very well saying “we have a tiny TCB”, but you need to know what you’re protecting, from what, and how.

Dependencies and supply chains

A dependency on a component is one which that an application or component needs to work

Supply chain security is really, really hot right now. It’s something which folks in the “real world” of manufactured things have worried about for years – you might be surprised (and worried) how much effort aircraft operators need to pay to “counterfeit parts”, and I wish I hadn’t just searched online the words “counterfeit pharmaceutical” – but the world of software had a rude wake-up call recently with the Solarwinds hack (or crack, if you prefer). This isn’t the place to go over that: you’ll be able to find many, many articles if you search on that. In fact, many companies have been worrying about this for a while, but the change is that it’s an issue which is now out in the open, giving more leverage to those who want more transparency around what software they consume in their products or services.

When we in computing (particularly software) think about supply chains, we generally talk about “dependencies”, and I thought it might be useful to write a short article explaining the main dependency types.

What is a dependency?

A dependency on a component is one which that an application or component needs to work, and they are generally considered to come in two types:

build-time dependencies
run-time dependencies.

Let’s talk about those two types in turn.

Build-time dependencies

these are components which are required in order to build (typically compile and package) your application or library. For example, if I’m writing a program in Rust, I have a dependency on the compiler if I want to create an application. I’m actually likely to have many more run-time dependencies, however. How those dependencies are made visible to me will depend on the programming language and the environment that I’m building in.

Some languages, for instance, may have filesystem support built in, but others will require you to “import” one or more libraries in order to read and write files. Importing a library basically tells your build-time environment to look somewhere (local disk, online repository, etc.) for a library, and then bring it into the application, allowing its capabilities to be used. In some cases, you will be taking pre-built libraries, and in others, your build environment may insist on building them itself. Languages like Rust have clever environments which can look for new versions of a library, download it and compile it without your having to worry about it yourself (though you can if you want!).

To get back to your file system example, even if the language does come with built-in filesystem support, you may decide to import a different library – maybe you need some fancy distributed, sharded file system, for instance – from a different supplier. Other capabilities may not be provided by the language, or may be higher-level capabilities: JSON serialisation or HTTPS support, for instance. Whether that library is available in open source may have a large impact on your decision as to whether or not to use it.

Build-time dependencies, then, require you to have the pieces you need – either pre-built or in source code form – at the time that you’re building your application or library.

Run-time dependencies

Run-time dependencies, as the name implies, only come into play when you actually want to run your application. We can think of there being two types of run-time dependency:

service dependency – this may not be the official term, but think of an application which needs to write some data to a window on a monitor screen: in most cases, there’s already a display manager and a window manager running on the machine, so all the application needs to do is contact it, and communicate the right data over an API. Sometimes, the underlying operating system may need to start these managers first, but it’s not the application itself which is having to do that. These are local services, but remote services – accessing a database, for instance – operate in the same sort of way. As long as the application can contact and communicate with the database, it doesn’t need to do much itself. There’s still a dependency, and things can definitely go wrong, but it’s a weak coupling to an external application or service.
dynamic linking – this is where an application needs access to a library at run-time, but rather than having added it at build-time (“static linking”), it relies on the underlying operating system to provide a link to the library when it starts executing. This means that the application doesn’t need to be as large (it’s not “carrying” the functionality with it when it’s installed), but it does require that the version that the operating system provides is compatible with what’s expected, and does the right thing.

Conclusion

I’ve resisted the temptation to go into the impacts of these different types of dependency in terms of their impact on security. That’s a much longer article – or set of articles – but it’s worth considering where in the supply chain we consider these dependencies to live, and who controls them. Do I expect application developers to check every single language dependency, or just imported libraries? To what extent should application developers design in protections from malicious (or just poorly-written) dynamically-linked libraries? Where does the responsibility lie for service dependencies – particularly remote ones?

These are all complex questions: the supply chain is not a simple topic (partly because there is not just one supply chain, but many of them), and organisations need to think hard about how they work.

Arm joins the Confidential Computing party

Arm’s announcement of Realms isn’t just about the Edge

The Confidential Computing Consortium is a Linux Project designed to encourage open source projects around confidential computing. Arm has been part of the consortium for a while – in fact, the company is Premier Member – but things got interesting on the 30th March, 2021. That’s when Arm announced their latest architecture: Arm 9. Arm 9 includes a new set of features, called Realms. There’s not a huge amount of information in the announcement about Realms, but Arm is clear that this is their big play into Confidential Computing:

To address the greatest technology challenge today – securing the world’s data – the Armv9 roadmap introduces the Arm Confidential Compute Architecture (CCA).

I happen to live about 30 minutes’ drive from the main Arm campus in Cambridge (UK, of course), and know a number of Arm folks professionally and socially – I think I may even have interviewed for a job with them many moons ago – but I don’t want to write a puff piece about the company or the technology[1]. What I’m interested in, instead, is the impact this announcement is likely to have on the Confidential Computing landscape.

Arm has had an element in their architecture for a while called TrustZone which provides a number of capabilities around security, but TrustZone isn’t a TEE (Trusted Execution Environment) on its own. A TEE is the generally accepted unit of confidential computing – the minimum building block on which you can build. It is arguably possible to construct TEEs using TrustZone, but that’s not what it’s designed for, and Arm’s decision to introduce Realms strongly suggests that they want to address this. This is borne out by the press release.

Why is all this important? I suspect that few of you have laptops or desktops that run on Arm (Raspberry Pi machines apart – see below). Few of the servers in the public cloud run Arm, and Realms are probably not aimed particularly at your mobile phone (for which TrustZone is a better fit). Why, then, is Arm bothering to make a fuss about this and to put such an enormous design effort into this new technology? There are two answers, it seems to me, one of which is probably pretty much a sure thing, and the other of which is more of a competitive gamble.

Answer 1 – the Edge

Despite recent intrusions by both AMD and Intel into the Edge space, the market is dominated by Arm-based[3] devices. And Edge security is huge, partly because we’re just seeing a large increase in the number of Edge devices, and partly because security is really hard at the Edge, where devices are more difficult to defend, both logically (they’re on remote networks, more vulnerable to malicious attack) and physically (many are out of the control of their owners, living on customer premises, up utility poles, on gas pipelines or in sports stadia, just to give a few examples). One of the problems that confidential computing aims to solve is the issue that, traditionally, once an attacker has physical access to a system, it should be considered compromised. TEEs allow some strong mitigations against that problem (at least against most attackers and timeframes), so making it easy to create and use TEEs on the Edge makes a lot of sense. With the addition of Realms to the Arm 9 architecture, Arm is signally its intent to address security on the Edge, and to defend and consolidate its position as leader in the market.

Answer 2 – the Cloud

I mentioned above that few public cloud hosts run Arm – this is true, but it’s likely to change. Arm would certainly like to see it change, and to see its chipsets move into the cloud mainstream. There has been a lot of work to improve support for server-scale Arm within Linux (in fact, open source support for Arm is generally excellent, not least because of the success of Arm-based chips in Raspberry Pi machines). Amazon Cloud Services (AWS) started offering Arm-based servers to customers as long ago as 2018. This is a market in which Arm would clearly love to be more active and carve out a larger share, and the growing importance of confidential computing in the cloud (and public and private) means that having a strong story in this space was important: Realms are Arm’s answer to this.

What next?

An announcement of an architecture is not the same as availability of hardware or software to run on it. We can expect it to be quite a few months before we see production chips running Arm 9, though evaluation hardware should be available to trusted partners well before that, and software emulation for various components of the architecture will probably come even sooner. This means that those interested in working with Realms should be able to get things moving and have something ready pretty much by the time of availability of production hardware. We’ll need to see how easy they are to use, what performance impact they have, etc., but Arm do have an advantage here: as they are not the first into the confidential computing space, they’ve had the opportunity to watch Intel and AMD and see what has worked, and what hasn’t, both technically and in terms of what the market seems to like. I have high hopes for Arm Realms, and Enarx, the open source confidential computing project with which I’m closely involved, has plans to support them when we can: our architecture was designed with multi-platform support from the beginning.

1 – I should also note that I participated in a panel session on Confidential Computing which was put together by Arm for their “Arm Vision Day”, but I was in no way compensated for this[2].

2 -in fact, the still for the video is such a terrible picture of me that I think maybe I have grounds to sue for it to be taken down.

3 – Arm doesn’t manufacture chips itself: it licenses its designs to other companies, who create, manufacture and ship devices themselves.

Track and trace failure: a systems issue

The problem was not Excel, but that somebody used the wrong tools in a system.

Like many other IT professionals in the UK – and across the world, having spoken to some other colleagues in other countries – I was first surprised and then horrified as I found out more about the failures of the UK testing and track and trace systems. What was most shocking about the failure is not that it was caused by some alleged problem with Microsoft Excel, but that anyone thought this was a problem due to Excel. The problem was not Excel, but that somebody used the wrong tools in a system which was not designed properly, tested properly, or run properly. I have written many words about systems, and one article in particular seems relevant: If it isn’t tested, it doesn’t work. In it, I assert that a system cannot be said to work properly if it has not been tested, as a fully working system requires testing in order to be “working”.

In many software and hardware projects, in order to complete a piece of work, it has to meet one or more of a set of tests which allow it to be described as “done”. These tests may be actual software tests, or documentation, or just checks done by members of the team (other than the person who did the piece of work!), but the list needs to be considered and made part of the work definition. This “done” definition is as much part of the issue being addressed, functionality added or documentation being written as the actual work done itself.

I find it difficult to believe that there was any such definition for the track and trace system. If there was, then it was not, I’m afraid, defined by someone who is an expert in distributed or large-scale systems. This may not have been the fault of the person who chose Excel for the task of recording information, but it is the fault of the person who was in charge of the system, because Excel is not, and never was, a fit application for what it was being used for. It does not have the scalability characteristics, the integrity characteristics or the versioning characteristics required. This is not the fault of Microsoft, any more than it would be fault of Porsche if a 911T broke down because its owner filled with diesel fuel, rather than petrol[1]. Any competent systems architect or software engineer, qualified to be creating such a system, would have known this: the only thing that seems possible is that whoever put together the system was unqualified to do so.

There seem to be several options here:

the person putting together the system did not know they were unqualified;
the person putting together the system realised that they were unqualified, but did not feel able to tell anyone;
the person putting together the systems realised that they were unqualified, but lied.

In any of the above, where was the oversight? Where was the testing? Where were the requirements? This was a system intended to safeguard the health of millions – millions – of people.

Who can we blame for this? In the end, the government needs to take some large measure of responsibility: they commissioned the system, which means that they should have come up with realistic and appropriate requirements. Requirements of this type may change over the life-cycle of a project, and there are ways to manage this: I recommend a useful book in another article, Building Evolutionary Architectures – for security and for open source. These are not new problems, and they are not unsolved problems: we know how to do this as a profession, as a society.

And then again, should we blame someone? Normally, I’d consider this a question out of scope for this blog, but people may die because of this decision – the decision not to define, design, test and run a system which was fit for purpose. At the very least, there are people who are anxious and worried about whether they have Covid-19, whether they need to self-isolate, whether they may have infected vulnerable friends or family. Blame is a nasty thing, but if it’s about holding people to account, then that’s what should happen.

IT systems are important. Particularly when they involve people’s health, but in many other areas, too: banking, critical infrastructure, defence, energy, even retail and entertainment, where people’s jobs will be at stake. It is appropriate for those of us with a voice to speak out, to remind the IT community that we have a responsibility to society, and to hold those who commission IT systems to account.

1 – or “gasoline” in some geographies.

They won’t get security right

Save users from themselves: make it difficult to do the wrong thing.

I’m currently writing a book at Trust in computing and the cloud – I’ve mentioned it before – and I confidently expect to reach 50% of my projected word count today, as I’m on holiday, have more time to write it, and got within about 850 words of the goal yesterday. Little boast aside, one of the topics that I’ve been writing about is the need to consider the contexts in which the systems you design and implement will be used.

When we design systems, there’s a temptation – a laudable one, in many cases – to provide all of the features and functionality that anyone could want, to implement all of the requests from customers, to accept every enhancement request that comes in from the community. Why is this? Well, for a variety of reasons, including:

we want our project or product to be useful to as many people as possible;
we want our project or product to match the capabilities of another competing one;
we want to help other people and be seen as responsive;
it’s more interesting implementing new features than marking an existing set complete, and settling down to bug fixing and technical debt management.

These are all good – or at least understandable – reasons, but I want to argue that there are times that you absolutely should not give in to this temptation: that, in fact, on every occasion that you consider adding a new feature or functionality, you should step back and think very hard whether your product would be better if you rejected it.

Don’t improve your product

This seems, on the face of it, to be insane advice, but bear with me. One of the reasons is that many techies (myself included) are more interested getting code out of the door than weighing up alternative implementation options. Another reason is that every opportunity to add a new feature is also an opportunity to deal with technical debt or improve the documentation and architectural information about your project. But the other reasons are to do with security.

Reason 1 – attack surface

Every time that you add a feature, a new function, a parameter on an interface or an option on the command line, you increase the attack surface of your code. Whether by fuzzing, targeted probing or careful analysis of your design, the larger the attack surface of your code, the more opportunities there are for attackers to find vulnerabilities, create exploits and mount attacks on instances of your code. Strange as it may seem, adding options and features for your customers and users can often be doing them a disservice: making them more vulnerable to attacks than they would have been if you had left well enough alone.

If we do not need an all-powerful administrator account after initial installation, then it makes sense to delete it after it has done its job. If logging of all transactions might yield information of use to an attacker, then it should be disabled in production. If older versions of cryptographic functions might lead to protocol attacks, then it is better to compile them out than just to turn them off in a configuration file. All of these lead to reductions in attack surface which ultimately help safeguard users and customers.

Reason 2 – saving users from themselves

The other reason is also about helping your users and customers: saving them from themselves. There is a dictum – somewhat unfair – within computing that “users are stupid”. While this is overstating the case somewhat, it is fairer to note that Murphy’s Law holds in computing as it does everywhere else: “Anything that can go wrong, will go wrong”. Specific to our point, some user somewhere can be counted upon to use the system that you are designing, implementing or operating in ways which are at odds with your intentions. IT security experts, in particular, know that we cannot stop people doing the wrong thing, but where there are opportunities to make it difficult to do the wrong thing, then we should embrace them.

Not adding features, disabling capabilities and restricting how your product is used might seem counter-intuitive, but if it leads to a safer user experience and fewer vulnerabilities associated with your product or project, then in the end, everyone benefits. And you can use the time to go and write some documentation. Or go to the beach. Enjoy!

Enarx news: WebAssembly + SGX

We can now run a WebAssembly workload in a TEE

Regular readers will know that I’m one of the co-founders for the Enarx project, and write about our progress fairly frequently. You’ll find some more information in these articles (newest first):

I won’t spend time going over what the project is about here, as you can read more in the articles above, and lots more over at our GitHub repository (https://enarx.io), but aim of the project is to allow you run sensitive workloads in Trusted Execution Environments (TEEs) from various silicon vendors. The runtime we’ll be providing is WebAssembly (using the WASI interface), and the TEEs that we’re targetting to start with are AMD’s SEV and Intel’s SGX – though you, the client running the workload, shouldn’t care which is being used at any particular point, as we abstract all of that away.

It’s a complicated architecture, with lots of moving parts and a fairly complex full-stack architecture. Things have been moving really fast recently, partly due to a number of new contributors to the project (which recently reached 150 stars on our main GitHub repository), and one of the important developments is the production of a new component architecture, included below.

I’ve written before about how important it is to have architectural diagrams for projects (and particularly open source projects), and so I’m really pleased that we have this updated – and much more detailed – version of the diagram. I won’t go into it in detail here, but the only trusted components (from the point of view of Enarx and a client using Enarx) are those in green: the Enarx client agent, the Attestation measurement database and the TEE instance containing the Application, Wasm, Shim, etc., which we call the Keep.

It’s only just over a couple of months since we announced our most recent milestone: running binaries within TEEs, including both SEV and SGX. This was a huge deal for us, but then, at the end of last week, one of the team, Daiki Ueno, managed to get a WebAssembly binary loaded using the Wasm/WASI layers and to run it. In other words, we can now run a WebAssembly workload in a TEE. Over the past 10 minutes, I’ve just managed to replicate that on my own machine (for my own interest, not that I doubted the results!). Circled, below, are the components which are involved in this feat.

I should make it clear that none of these components is complete or stable – they are all still very much in development, and, in fact, even the running test yielded around 40 lines for error messages! – but the fact that we’ve got a significant part of the stack working together is an enormous deal. The Keep loader creates an SGX TEE instance, loads the Shim and the Wasm/WASI components into the Keep and starts them running. The App Loader (one of the Wasm components) loads the Application, and it runs.

It’s not just an enormous deal due to the amount of work that it’s taken and the great teamwork that’s been evident throughout – though those are true as well – but also because it means we’re getting much closer to having something that people can try out for themselves. The steps to just replicating the test were lengthy and complex, and my next step is to see if I can make make some changes and create my own test, which may be quite a marathon, but I’d like to stress that my technical expertise lies closer to the application layer than the low-level pieces, which means that if I can make this work now (with a fair amount of guidance), then we shouldn’t be too far off being in a position where pretty much anyone with a decent knowledge of the various pieces can do the same.

We’re still quite a way off being able to provide a simple file with 5-10 steps to creating and running your own workload in a Keep, but that milestone suddenly feels like it’s in sight. In the meantime, there’s loads of design work, documentation, infrastructure automation, testing and good old development to be done: fancy joining us?

No security without an architecture

Your diagrams don’t need to be perfect. But they do need to be there.

I attended a virtual demo this week. It didn’t work, but none of us was stressed by that: it was an internal demo, and these things happen. Luckily, the members of the team presenting the demo had lots of information about what it would have shown us, and a particularly good architectural diagram to discuss. We’ve all been in the place where the demo doesn’t work, and all felt for the colleague who was presenting the slidedeck, and on whose screen a message popped up a few slides in, saying “Demo NO GO!” from one of her team members.

After apologies, she asked if we wanted to bail completely, or to discuss the information they had to hand. We opted for the latter – after all, most demos which aren’t foregrounding user experience components don’t show much beyond terminal windows that most of us could fake up in half an hour or so anyway. She answered a couple of questions, and then I piped up with one about security.

This article could have been about the failures in security in a project which was showing an early demo: another example of security being left till late (often too late) in the process, at which point it’s difficult and expensive to integrate. However, it’s not. It clear that thought had been given to specific aspects of security, both on the network (in transit) and in storage (at rest), and though there was probably room for improvement (and when isn’t there?), a team member messaged me more documentation during the call which allowed me to understand of the choices the team had made.

What this article is about is the fact that we were able to have a discussion at all. The slidedeck included an architecture diagram showing all of the main components, with arrows showing the direction of data flows. It was clear, colour-coded to show the provenance of the different components, which were sourced from external projects, which from internal, and which were new to this demo. The people on the call – all technical – were able to see at a glance what was going on, and the team lead, who was providing the description, had a clear explanation for the various flows. Her team members chipped in to answer specific questions or to provide more detail on particular points. This is how technical discussions should work, and there was one thing in particular which pleased me (beyond the fact that the project had thought about security at all!): that there was an architectural diagram to discuss.

There are not enough security experts in the world to go around, which means that not every project will have the opportunity to get every stage of their design pored over by a member of the security community. But when it’s time to share, a diagram is invaluable. I hate to think about the number of times I’ve been asked to look at project in order to give my thoughts about security aspects, only to find that all that’s available is a mix of code and component documentation, with no explanation of how it all fits together and, worse, no architecture diagram.

When you’re building a project, you and your team are often so into the nuts and bolts that you know how it all fits together, and can hold it in your head, or describe the key points to a colleague. The problem comes when someone needs to ask questions of a different type, or review the architecture and design from a different slant. A picture – an architectural diagram – is a great way to educate external parties (or new members of the project) in what’s going on at a technical level. It also has a number of extra benefits:

it forces you to think about whether everything can be described in this way;
it forces you to consider levels of abstraction, and what should be shown at what levels;
it can reveal assumptions about dependencies that weren’t previously clear;
it is helpful to show data flows between the various components
it allows for simpler conversations with people whose first language is not that of your main documentation.

To be clear, this isn’t just a security problem – the same can go for other non-functional requirements such as high-availability, data consistency, performance or resilience – but I’m a security guy, and this is how I experience the issue. I’m also aware that I have a very visual mind, and this is how I like to get my head around something new, but even for those who aren’t visually inclined, a diagram at least offers the opportunity to orient yourself and work out where you need to dive deeper into code or execution. I also believe that it’s next to impossible for anybody to consider all the security implications (or any of the higher-order emergent characteristics and qualities) of a system of any significant complexity without architectural diagrams. And that includes the people who designed the system, because no system exists on its own (or there’s no point to it), so you can’t hold all of those pieces in your head of any length of time.

I’ve written before about the book Building Evolutionary Architectures, which does a great job in helping projects think about managing requirements which can morph or change their priority, and which, unsurprisingly, makes much use of architectural diagrams. Enarx, a project with which I’m closely involved, has always had lots of diagrams, and I’m aware that there’s an overhead involved here, both in updating diagrams as designs change and in considering which abstractions to provide for different consumers of our documentation, but I truly believe that it’s worth it. Whenever we introduce new people to the project or give a demo, we ensure that we include at least one diagram – often more – and when we get questions at the end of a presentation, they are almost always preceded with a phrase such as, “could you please go back to the diagram on slide x?”.

I nearly published this article without adding another point: this is part of being “open”. I’m a strong open source advocate, but source code isn’t enough to make a successful project, or even, I would add, to be a truly open source project: your documentation should not just be available to everybody, but accessible to everyone. If you want to get people involved, then providing a way in is vital. But beyond that, I think we have a responsibility (and opportunity!) towards diversity within open source. Providing diagrams helps address four types of diversity (at least!):

people whose first language is not the same as that of your main documentation (noted above);
people who have problems reading lots of text (e.g. those with dyslexia);
people who think more visually than textually (like me!);
people who want to understand your project from different points of view (e.g. security, management, legal).

If you’ve ever visited a project on github (for instance), with the intention of understanding how it fits into a larger system, you’ll recognise the sigh of relief you experience when you find a diagram or two on (or easily reached from) the initial landing page.

And so I urge you to create diagrams, both for your benefit, and also for anyone who’s going to be looking at your project in the future. They will appreciate it (and so should you). Your diagrams don’t need to be perfect. But they do need to be there.