architecture – Page 2 – Alice, Eve and Bob

Immutability: my favourite superpower

As a security guy, I approve of defence in depth.

I’m a recent but dedicated convert to Silverblue, which I run on my main home laptop and which I’ll be putting onto my work laptop when I’m due a hardware upgrade in a few months’ time. I wrote an article about Silverblue over at Enable Sysadmin, and over the weekend, I moved the laptop that one of my kids has over to it as well. You can learn more about Silverblue over at the main Silverblue site, but in terms of usability, look and feel, it’s basically a version of Fedora. There’s one key difference, however, which is that the operating system is mounted read-only, meaning that it’s immutable.

What does “immutable” mean? It means that it can’t be changed. To be more accurate, in a software context, it generally means that something can’t be changed during run-time.

Important digression – constant immutability

I realised as I wrote that final sentence that it might be a little misleading. Many programming languages have the concept of “constants”. A constant is a variable (or set, or data structure) which is constant – that is, not variable. You can assign a value to a constant, and generally expect it not to change. But – and this depends on the language you are using – it may be that the constant is not immutable. This seems to go against common sense[1], but that’s just the way that some languages are designed. The bottom line is this: if you have a variable that you intend to be immutable, check the syntax of the programming language you’re using and take any specific steps needed to maintain that immutability if required.

Operating System immutability

In Silverblue’s case, it’s the operating system that’s immutable. You install applications in containers (of which more later), using Flatpak, rather than onto the root filesystem. This means not only that the installation of applications is isolated from the core filesystem, but also that the ability for malicious applications to compromise your system is significantly reduced. It’s not impossible[2], but the risk is significantly lower.

How do you update your system, then? Well, what you do is create a new boot image which includes any updated packages that are needed, and when you’re ready, you boot into that. Silverblue provides simple tools to do this: it’s arguably less hassle than the standard way of upgrading your system. This approach also makes it very easy to maintain different versions of an operating system, or installations with different sets of packages. If you need to test an application in a particular environment, you boot into the image that reflects that environment, and do the testing. Another environment? Another image.

We’re more interested in the security properties that this offers us, however. Not only is it very difficult to compromise the core operating system as a standard user[3], but you are always operating in a known environment, and knowability is very much a desirable property for security, as you can test, monitor and perform forensic analysis from a known configuration. From a security point of view (let alone what other benefits it delivers), immutability is definitely an asset in an operating system.

Container immutability

This isn’t the place to describe containers (also known as “Linux containers” or, less frequently or accurately these days, “Docker containers) in detail, but they are basically collections of software that you create as images and then run workloads on a host server (sometimes known as a “pod”). One of the great things about containers is that they’re generally very fast to spin up (provision and execute) from an image, and another is that the format of that image – the packaging format – is well-defined, so it’s easy to create the images themselves.

From our point of view, however, what’s great about containers is that you can choose to use them immutably. In fact, that’s the way they’re generally used: using mutable containers is generally considered an anti-pattern. The standard (and “correct”) way to use containers is to bundle each application component and required dependencies into a well-defined (and hopefully small) container, and deploy that as required. The way that containers are designed doesn’t mean that you can’t change any of the software within the running container, but the way that they run discourages you from doing that, which is good, as you definitely shouldn’t. Remember: immutable software gives better knowability, and improves your resistance to run-time compromise. Instead, given how lightweight containers are, you should design your application in such a way that if you need to, you can just kill the container instance and replace it with an instance from an updated image.

This brings us to two of the reasons that you should never run containers with root privilege:

there’s a temptation for legitimate users to use that privilege to update software in a running container, reducing knowability, and possibly introducing unexpected behaviour;
there are many more opportunities for compromise if a malicious actor – human or automated – can change the underlying software in the container.

Double immutability with Silverblue

I mentioned above that Silverblue runs applications in containers. This means that you have two levels of security provided as default when you run applications on a Silverblue system:

the operating system immutability;
the container immutability.

As a security guy, I approve of defence in depth, and this is a classic example of that property. I also like the fact that I can control what I’m running – and what versions – with a great deal more ease than if I were on a standard operating system.

1 – though, to be fair, the phrases “programming language” and “common sense” are rarely used positively in the same sentence in my experience.

2 – we generally try to avoid the word “impossible” when describing attacks or vulnerabilities in security.

3 – as with many security issues, once you have sudo or root access, the situation is significantly degraded.

Building Evolutionary Architectures – for security and for open source

Consider the fitness functions, state them upfront, have regular review.

Ford, N., Parsons, R. & Kua, P. (2017) Building Evolution Architectures: Support Constant Change. Sebastapol, CA: O’Reilly Media.

https://www.oreilly.com/library/view/building-evolutionary-architectures/9781491986356/

This is my first book review on this blog, I think, and although I don’t plan to make a habit of it, I really like this book, and the approach it describes, so I wanted to write about it. Initially, this article was simply a review of the book, but as I got into it, I realised that I wanted to talk about how the approach it describes is applicable to a couple of different groups (security folks and open source projects), and so I’ve gone with it.

How, then, did I come across the book? I was attending a conference a few months ago (DeveloperWeek San Diego), and decided to go to one of the sessions because it looked interesting. The speaker was Dr Rebecca Parsons, and I liked what she was talking about so much that I ordered this book, whose subject was the topic of her talk, to arrive at home by the time I would return a couple of days later.

Building Evolutionary Architectures is not a book about security, but it deals with security as one application of its approach, and very convincingly. The central issue that the authors – all employees of Thoughtworks – identifies is, simplified, that although we’re good at creating features for applications, we’re less good at creating, and then maintaining, broader properties of systems. This problem is compounded, they suggest, by the fast and ever-changing nature of modern development practices, where “enterprise architects can no longer rely on static planning”.

The alternative that they propose is to consider “fitness functions”, “objectives you want your architecture to exhibit or move towards”. Crucially, these are properties of the architecture – or system – rather than features or specific functionality. Tests should be created to monitor the specific functions, but they won’t be your standard unit tests, nor will they necessarily be “point in time” tests. Instead, they will measure a variety of issues, possibly over a period of time, to let you know whether your system is meeting the particular fitness functions you are measuring. There’s a lot of discussion of how to measure these fitness functions, but I would have liked even more: from my point of view, it was one of the most valuable topics covered.

Frankly, the above might be enough to recommend the book, but there’s more. They advocate strongly for creating incremental change to meet your requirements (gradual, rather than major changes) and “evolvable architectures”, encouraging you to realise that:

you may not meet all your fitness functions at the beginning;
applications which may have met the fitness functions at one point may cease to meet them later on, for various reasons;
your architecture is likely to change over time;
your requirements, and therefore the priority that you give to each fitness function, will change over time;
that even if your fitness functions remain the same, the ways in which you need to monitor them may change.

All of these are, in my view, extremely useful insights for anybody designing and building a system: combining them with architectural thinking is even more valuable.

As is standard for modern O’Reilly books, there are examples throughout, including a worked fake consultancy journey of a particular company with specific needs, leading you through some of the practices in the book. At times, this felt a little contrived, but the mechanism is generally helpful. There were times when the book seemed to stray from its core approach – which is architectural, as per the title – into explanations through pseudo code, but these support one of the useful aspects of the book, which is giving examples of what architectures are more or less suited to the principles expounded in the more theoretical parts. Some readers may feel more at home with the theoretical, others with the more example-based approach (I lean towards the former), but all in all, it seems like an appropriate balance. Relating these to the impact of “architectural coupling” was particularly helpful, in my view.

There is a useful grounding in some of the advice in Conway’s Law (“Organizations [sic] which design systems … are constrained to produce designs which are copies of the communication structures of these organizations.”) which led me to wonder how we could model open source projects – and their architectures – based on this perspective. There are also (as is also standard these days) patterns and anti-patterns: I would generally consider these a useful part of any book on design and architecture.

Why is this a book for security folks?

The most important thing about this book, from my point of view as a security systems architect, is that it isn’t about security. Security is mentioned, but is not considered core enough to the book to merit a mention in the appendix. The point, though, is that the security of a system – an embodiment of an architecture – is a perfect example of a fitness function. Taking this as a starting point for a project will help you do two things:

avoid focussing on features and functionality, and look at the bigger picture;
consider what you really need from security in the system, and how that translates into issues such as the security posture to be adopted, and the measurements you will take to validate it through the lifecycle.

Possibly even more important than those two points is that it will force you to consider the priority of security in relation to other fitness functions (resilience, maybe, or ease of use?) and how the relative priorities will – and should – change over time. A realisation that we don’t live in a bubble, and that our priorities are not always that same as those of other stakeholders in a project, is always useful.

Why is this a book for open source folks?

Very often – and for quite understandable and forgiveable reasons – the architectures of open source projects grow organically at first, needing major overhauls and refactoring at various stages of their lifecycles. This is not to say that this doesn’t happen in proprietary software projects as well, of course, but the sometimes frequent changes in open source projects’ emphasis and requirements, the ebb and flow of contributors and contributions and the sometimes, um, reduced levels of documentation aimed at end users can mean that features are significantly prioritised over what we could think of as the core vision of the project. One way to remedy this would be to consider the appropriate fitness functions of the project, to state them upfront, and to have a regular cadence of review by the community, to ensure that they are:

still relevant;
correctly prioritised at this stage in the project;
actually being met.

If any of the above come into question, it’s a good time to consider a wider review by the community, and maybe a refactoring or partial redesign of the project.

Open source projects have – quite rightly – various different models of use and intended users. One of the happenstances that can negatively affect a project is when it is identified as a possible fit for a use case for which it was not originally intended. Academic software which is designed for accuracy over performance might not be a good fit for corporate research, for instance, in the same way that a project aimed at home users which prioritises minimal computing resources might not be appropriate for a high-availability enterprise roll-out. One of the ways of making this clear is by being very clear up-front about the fitness functions that you expect your project to meet – and, vice versa, about the fitness functions you are looking to fulfil when you are looking to select a project. It is easy to focus on features and functionality, and to overlook the more non-functional aspects of a system, and fitness functions allow us to make some informed choices about how to balance these decisions.

Oh, how I love my TEE (or do I?)

Trusted Execution Environments use chip-level instructions to allow you to create enclaves of higher security

I realised just recently that I’ve not written yet about Trusted Execution Environments (TEEs) on this blog. This is a surprise, honestly, because TEEs are fascinating, and I spend quite a lot of my professional time thinking – and sometimes worrying – about them. So what, you may ask, is a TEE?

Let’s look at one of the key use cases first, and then get to what a Trusted Execution Environment is. A good place to start it the “Cloud”, which, as we all know, is just somebody else’s computer. What this means is that if you’re running an application (let’s call it a “workload”) in the Cloud – AWS, Azure, whatever – then what you’re doing is trusting somebody else to take the constituent parts of that workload – its code and its data – and run them on their computer. “Yay”, you may be thinking, “that means that I don’t have to run it in my computer: it’s all good.” I’m going to take issue with the “all good” bit of that statement. The problem is that the company – or people within that company – who run your workload on their computer (let’s call it a “host”) can, if they so wish, look inside it, change it, and stop it running. In other words, they can break all three classic “CIA” properties of security: confidentiality (by looking inside it); integrity (by changing it); and availability (by stopping it running). This is because the way that workloads run on hosts – whether in hardware-mediated virtual machines, within containers or on bare-metal – all allow somebody with sufficient privilege on that machine to do all of the bad things I’ve just mentioned.

And these are bad things. We don’t tend to care about them too much as individuals – because the amount of value a cloud provider would get from bothering to look at our information is low – but as businesses, we really should be worried.

I’m afraid that the problem doesn’t go away if you run your systems internally. Remember that anybody with sufficient access to hosts can look inside and tamper with your workloads? Well, are you happy that you sysadmins should all have access to your financial results? Merger and acquisition details? Pay roll? Because if you have this kind of data running on your machines on your own premises, then they do have access to all of those.

Now, there are a number of controls that you can put in place to help with this – not least background checks and Acceptable Use Policies – but TEEs aim to solve this problem with technology. Actually, they only really aim to solve the confidentiality and integrity pieces, so we’ll just have to assume for now that you’re going to be in a position to notice if your sales order process fails to run due to malicious activity (for instance). Trusted Execution Environments use chip-level instructions to allow you to create enclaves of higher security where processes can execute (and data can be processed) in ways that mean that even privileged users of the host cannot attack their confidentiality or integrity. To get a little bit technical, these enclaves are memory pages with particular controls on them such that they are always encrypted except when they are actually being processed by the chip.

The two best-known TEE implementations so far are Intel’s SGX and AMD’s SEV (though other silicon vendors are beginning to talk about their alternatives). Both Intel and AMD are aiming to put these into server hardware and create an ecosystem around their version to make it easy for people to run workloads (or components of workloads) within them. And the security community is doing what it normally does (and, to be clear, absolutely should be doing), and looking for vulnerabilities in the implementation. So far, most of the vulnerabilities that have been identified are within Intel’s SGX – though I’m not in a position to say whether that’s because the design and implementation is weaker, or just because the researchers have concentrated on the market leader in terms of server hardware. It looks like we need to go through a cycle or two of the technologies before the industry is convinced that we have a working design and implementation that provides the levels of security that are worth deploying. There’s also work to be done to provide sufficiently high quality open source software and drivers to support TEEs for wide deployment.

Despite the hopes of the silicon vendors, it may be some time before TEEs are in common usage, but people are beginning to sit up and take notice, partly because there’s so much interest in moving workloads to the Cloud, but still serious concerns about the security of your sensitive processes and data when they’re there. This has got to be a good thing, and I think it’s really worth considering how you might start designing and deploying workloads in new ways once TEEs actually do become commonly available.

The “invisible” trade-off? Security.

“For twenty years, people have been leaving security till last.”

Colleague (in a meeting): “For twenty years, people have been leaving security till last.”

Me (in response): “You could have left out those last two words.”

This article will be a short one, and it’s a plea. It’s also not aimed at my regular readership, because if you’re part of my regular readership, then you don’t need telling. Many of the articles on this blog, however, are written with the express intention of meeting two criteria:

they should be technical credible[1].
you should be able to show them to your parents or to your manager[2].

I suspect that it’s your manager, this time round, who I’ll be targeting, but I don’t want to make assumptions about your parents’ roles or influence, so let’s leave it open.

The issue I want to address this week is the impact of not placing security firmly at the beginning, middle and end of any system or application design process. As we all know, security isn’t something that you can bolt onto the end of a project and hope that you’ll be OK. Equally, if you think about it only at the beginning, you’ll find that by the end, your requirements, use cases, infrastructure or personae will have changed[3], and what you planned at the beginning is no longer fit for purpose. After all, if you know that your functional requirements will change (and everybody knows this), then why would your non-functional requirements be subject to the same drift?

The problem is that security, being a non-functional requirement[4], doesn’t get the up-front visibility that it needs. And, because it’s difficult to do well, and it’s often the responsibility of a non-core team member “flown in” as a consultant or expert for a small percentage design meetings, security is the area that it’s easy to decide to let slide a bit. Or a lot. Or completely.

If there’s a trade-off around features, functionality or resource location, it’s likely to be security, and often, nobody even raises the point that there has been a trade-off: it’s completely invisible (this is one of the reasons Why I love technical debt). This is also the reason that whenever I look at a system, I try to think “what were the decisions made about security?”, because, too often, no decisions were made about security at all.

So, if you’re a manager[6], and you’re involved with designing a system or application, don’t let security be the invisible trade-off. I’m not saying that it needs to be the be-all and end-all of the project, but at least ensure that you think about it. Thank you.

1 – they should be accurate, to be honest, but I also try not to dive deeper into technical topics than is absolutely required for context.

2 – to be clear, this isn’t about making them work- and parent-safe, but about presenting the topics in a manner that is approachable by non-experts.

3 – or, equally likely, all of them.

4 – I don’t mean that security doesn’t function correctly[5], but rather that it’s not one of the key functions of the system or application that’s being designed.

5 – though, now you mention it…

6 – or parent – see above.

Single point of failure

Any failure which completely brings down a system for over 12 hours counts as catastrophic.

Yesterday[1], Gatwick Airport suffered a catastrophic failure. It wasn’t Air Traffic Control, it wasn’t security scanners, it wasn’t even check-in desk software, but the flight information boards. Catastrophic? Well, maybe the impact on the functioning of the airport wasn’t catastrophically affected, but the system itself was. For my money, any failure which completely brings down a system for over 12 hours (from 0430 to 1700 BST, reportedly), counts as catastrophic.

The failure has been blamed on damage to a fibre optic cable. It turned out that if this particular component of the system was brought down, then the system failed to operate as expected: it was a single point of failure. Now, in this case, it could be argued that the failure did not have a security impact: this was a resilience problem. Setting aside the fact that resilience and security are often bedfellows[2], many single points of failure absolutely are security issues, as they become obvious points of vulnerability for malicious actors to attack.

A key skill that needs to be grown with IT in general, but security in particular, is systems thinking, as I’ve discussed elsewhere, including in my first post on this blog: Systems security – why it matters. We need more systems engineers, and more systems architects. The role of systems architects, specifically, is to look beyond the single components that comprise a system, and to consider instead the behaviour of the system as a whole. This may mean looking past our first focus and our to, to for instance, hardware or externally managed systems to consider what the impact of failure, damage or compromise would be to the system’s overall operation.

Single points of failure are particularly awkward. They crop up in all sorts of places, and they are a very good example of why diversity is important within IT security, and why you shouldn’t trust a single person – including yourself – to be the only person who looks at the security of a system. My particular biases are towards crypto and software, for instance, so I’m more likely to miss a hardware or network point of failure than somebody with a different background to me. Not to say that we shouldn’t try to train ourselves to think outside of whatever little box we come from – that’s part of the challenge and excitement of being a systems architect – but an acknowledgement of our own lack of expertise is in itself a realisation of our expertise: if you realise that you’re not an expert, you’re part way to becoming one.

I wanted to finish with an example of a single point of failure that is relevant to security, and exposes a process vulnerability. The Register has a good write-up of the Foreshadow attack and its impact on SGX, Intel’s Trusted Execution Environment (TEE) capability. What’s interesting, if the write-up is correct, is that what seems like a small break to a very specific part of the entire security chain means that you suddenly can’t trust anything. The trust chain is broken, and you have to distrust everything you think you know. This is a classic security problem – trust is a very tricky set of concepts – and one of the nasty things about it is that it may be entirely invisible to the user that an attack has taken place at all, particularly as the user, at this point, may have no visibility of the chain of trust that has been established – or not – up to the point that they are involved. There’s a lot more to write about on this subject, but that’s for another day. For now, if you’re planning to visit an airport, ensure that you have an app on your phone which will tell you your flight departure time and the correct gate.

1 – at time of writing, obviously.

2 – for non-native readers[3] , what I mean is that they are often closely related and should be considered together.

3 – and/or those unaquainted with my somewhat baroque language and phrasing habits[4].

4 – I prefer to double-dot when singing or playing Purcell, for instance[5].

5 – this is a very, very niche comment, for which slight apologies.

What’s an attack surface?

“Reduce your attack surface,” they say. But what is it?

“Reduce your attack surface,” they[1] say. But what is it? The instruction to reduce your attack surface is one of the principles of IT security, so it must be a Good Thing[tm]. The problem is that it’s not always clear what an attack surface actually is.

I’m going to go for the broadest possible description I can think of, or nearly, because I’m pretty paranoid, and because I’m not convinced that the Wikipedia definition[2] is sufficient[3]. Although I’ll throw in a few examples of how to reduce attack surfaces, the purpose of this post is really to explain what one is, rather than to help protect you – but a good understanding really is required before you start with anything else, so hopefully this will be useful.

So, here’s my start at a definition:

The attack surface of a system is the sum of areas where attacks could be launched against it.

That feels a little bit circular – let’s define some terms. First of all, what’s an an “area” in this definition? Well, I’d say that any particular component of a system may have many points of possible vulnerability – and therefore attack. The sum of those points is an area – and the sum of the areas of the different components of a system gives us our system’s attack surface.

To understand better, we’re going to have to talk about systems – one of my favourite topics[4] – because I think it’s important to clarify a key difference between the attack surface of a component considered alone, and the area that a component adds when part of a system. They will not generally be the same.

Here’s an example: you’re deploying an Operating System. Let’s look at two options for deployment, and compare the attack surfaces. In both cases, I’m going to take a fairly restricted look at points of vulnerability, excluding, for instance, human factors, as I don’t want to get bogged down in the details.

Deployment one – bare metal

You install your Operating System onto a physical machine, and plug it into the network. What are some of the attack points?

your network connection
the physical hardware
services which are listening on the network connection
connections via USB – keyboard and mouse, for example.

There are more, but this should give us enough to do some comparisons. I’d generally think of the attack surface as being associated with the physical bounds of the hardware, with the addition of the network port and USB connections.

How can we reduce the attack surface? Well, we could unplug the network connection – though that might significantly reduce the efficacy of the system! – or we might take steps to reduce the number of services listening on the connection, to reduce the privilege level at which they run, or increase the authentication requirements for connecting to them. We could reduce our surface area by using a utility such as “usbguard” to restrict USB connections, and, if we’re worried about physical access to the machine, we could put it in a locked cabinet somewhere. These are all useful and appropriate ways to reduce our system’s attack surface.

Deployment two – a Virtual Machine

In this deployment scenario, we’re going to install the Operating System onto a Virtual Machine (VM), running on a physical host. What does my attack surface look like now? Well, that rather depends on how you define your system. You could, of course, look at the wider system – the VM and the physical host – but for the purposes of this discussion, I’m going to consider that the operation of the Operating System is what we’re interested in, rather than the broader system[6]. So, what does our attack surface look like this time? Here’s a quick list.

your network connection
the hypervisor
services which are listening on the network connection
connections via USB – keyboard and mouse, for example.

You’ll notice that “the physical hardware” is missing from this list, and that’s because it’s been replace with “the hypervisor”. This is a little simplistic, for a few reasons, including that the hypervisor is arguably implemented via a combination of software and hardware controls, but it’s certainly different from the entire physical hardware we were talking about before, and in fact, there’s not much you can do from the point of the Virtual Machine to secure it, other than recognise its restrictions, so we might want to remove it from our list at this level.

The other entries are also somewhat different from our first scenario, although you might not realise at first glance. First, it’s quite likely (though not certain) that your network connection may in fact be a virtual network connection provided by the hosting system, which means that some of the burden of defending it goes to the hosting system. The same goes for the connections via USB – the hypervisor generally provides “virtual hardware” (via something like qemu, for example), which can be attached – or removed – from virtual machines.

So, you still have the services which are listening on the network connection, but it’s definitely a different attack surface from the first deployment scenario.

Now, if you take the wider view, then there’s definitely an attack surface at the physical machine level as well, and that needs to be considered – but it’s quite likely that this will be under the control of somebody completely different (such as a Cloud Service Provider – CSP).

Another quick example

When I deploy a webserver (using, for instance, Apache), I’ll need to consider a variety of attack vectors, from authentication to denial of service to storage attacks: these are part of our attack surface. If I deploy it with a database (e.g. PostgreSQL or MySQL), the attack surface looks different, assuming that I care about the data in the database. Whereas I might previously have been concerned to ensure that an HTTP “PUT” command didn’t overwrite or scramble a file on my filesystem, a malformed command to my database server could delete or corrupt multiple tables. On the other hand, I might now be able to lock down some of the functions of my webserver that I no longer need to worry about filesystem attacks. The attack surface of my webserver is different when it’s combined in a system with other components[7].

Why do I want to reduce my attack surface?

Well, this is quite an easy one. By looking back at my earlier definition, you’ll see that the smaller a system’s attack surface, the fewer points of attack there are available to malicious actors. That’s got to be a piece of good news.

You will, of course, never be able to reduce your attack surface to zero (see There are no absolutes in security), but the more you reduce (and document, always document!), the better position you’ll be in. It’s always about raising the bar to make it more difficult for malicious actors to affect you.

1 – the mythical IT Security Community, that’s who.

2 – to give one example.

3 – it only talks about data, and only about software: that’s not broad enough for me.

4 – as long-standing[4] readers of this blog will know.

5 – and long-suffering.

6 – yes, I know we can’t ignore that, but we’ll come back to it, honest.

7 – there are considerations around the attack surface of the database as well, of course.

There are no absolutes in security

There is no “secure”.

Let’s stop using the word “secure”. There is no “secure” in IT.

I know that sounds crazy, but it’s true.

Sometimes, when I speak to colleagues and customers, there will be non-technical or non-security people there, and they ask how to get a secure system. So I explain how I’d make a system secure. It goes a bit like this.

Remove any non-critical USB connections: in particular external or “thumb” drives.
Turn off all bluetooth.
Turn off all wifi.
Remove any network cables.
Remove any other USB connections, including mouse or keyboard.
Disconnect any monitors.
Disconnect any other cables that are connected to the system.
Yes, that includes the power cable.
Now take out any hard drives – SSD, HDD or other.
Destroy them. My preferred method is to gouge tracks in all spinning media, break the heads, bash all pieces with a hammer and then throw them into Mount Doom, but any other volcano[1] will do. Thermite lances are probably acceptable. You should do the same with all other components that you removed in earlier steps.
Destroy the motherboard, including all chips and RAM.
Tip all remaining pieces down a well.
Pour concrete down the well.[2]
You probably now have a secure which is about as secure as you’re going to get.

Yes, it’s a bit extreme, but the point is that all of the components there are possible threat vectors or information leakage channels.

Can we design and operate a system where we manage and mitigate the risks of threats and information leakage? Yes. That’s where we improve the security of a system. Is that a secure system? No, it’s not. What we’ve done is raise the bar, but we’ve not made it absolutely secure.

Part of the problem is that there’s just no way, these days[4], that any single person can be certain of the security of all parts of a system: they are just too many, and too complex. You may understand the application layer, but what about the virtualisation layer, for instance? I presented a simplified layer diagram in my post Isolationism a few months back, in which I listed the host as the bottom layer, but that was, of course, just asking for trouble. Along came Meltdown and Spectre, and now it’s clear (as if we didn’t know it already) that you should never ignore the fact that you can’t even trust the silicon you’re running on to do the thing you think it ought.

None of this, however, stops people and companies telling you that they’ll “secure your perimeter”, or provide you with “secure systems”. And it annoys me[5]. “We’ll help you secure your perimeter” isn’t too bad, but anything that suggests that you can have “secure systems” smacks to me of marketing – bad marketing.

So here you go: please stop using the word “secure” as an unqualified adjective or verb. We’re grown-ups, now, and we know it’s not real. So let’s not pretend.

Now – where was that well-cover? I need to deal with little Tommy.

1 – terrestrial/Middle Earth. I’m not sure about volcano temperatures on other planets or in the Undying Lands across the Western Sea.

2 – it should probably therefore be a disused well. Check there are no animals down there first[3]. In fact, before you throw anything down there.

3 – what’s that, Lassie? Little Tommy’s down the well? Well, I wonder whether little Tommy is waiting for us to throw the components down there so that he can do bad things. Bad Tommy.

4 – I’d like to think that maybe there was, once, in the distant past, but I’m probably kidding myself.

5 – you might be surprised at the number of things that annoy me[6].

6 – unless you’re my wife, in which case you probably won’t be[7].

7 – surprised. Or, in fact, reading this article.

Explained: five misused security words

Untangling responsibility, authority, authorisation, authentication and identification.

I took them out of the title, because otherwise it was going to be huge, with lots of polysyllabic words. You might, therefore, expect a complicated post – but that’s not my intention*. What I’d like to do it try to explain these five important concepts in security, as they’re often confused or bound up with one another. They are, however, separate concepts, and it’s important to be able to disentangle what each means, and how they might be applied in a system. Today’s words are:

responsibility
authority
authorisation
authentication
identification.

Let’s start with responsibility.

Responsibility

Confused with: function; authority.

If you’re responsible for something, it means that you need to do it, or if something goes wrong. You can be responsible for a product launching on time, or for the smooth functioning of a team. If we’re going to ensure we’re really clear about it, I’d suggest using it only for people. It’s not usually a formal description of a role in a system, though it’s sometimes used as short-hand for describing what a role does. This short-hand can be confusing. “The storage module is responsible for ensuring that writes complete transactionally” or “the crypto here is responsible for encrypting this set of bytes” is just a description of the function of the component, and doesn’t truly denote responsibility.

Also, just because you’re responsible for something doesn’t mean that you can make it happen. One of the most frequent confusions, then, is with authority. If you can’t ensure that something happens, but it’s your responsibility to make it happen, you have responsibility without authority***.

Authority

Confused with: responsibility, authorisation.

If you have authority over something, then you can make it happen****. This is another word which is best restricted to use about people. As noted above, it is possible to have authority but no responsibility*****.

Once we start talking about systems, phrases like “this component has the authority to kill these processes” really means “has sufficient privilege within the system”, and should best be avoided. What we may need to check, however, is whether a component should be given authorisation to hold a particular level of privilege, or to perform certain tasks.

Authorisation

Confused with: authority; authentication.

If a component has authorisation to perform a certain task or set of tasks, then it has been granted power within the system to do those things. It can be useful to think of roles and personae in this case. If you are modelling a system on personae, then you will wish to grant a particular role authorisation to perform tasks that, in real life, the person modelled by that role has the authority to do. Authorisation is an instantiation or realisation of that authority. A component is granted the authorisation appropriate to the person it represents. Not all authorisations can be so easily mapped, however, and may be more granular. You may have a file manager which has authorisation to change a read-only permission to read-write: something you might struggle to map to a specific role or persona.

If authorisation is the granting of power or capability to a component representing a person, the question that precedes it is “how do I know that I should grant that power or capability to this person or component?”. That process is authentication – authorisation should be the result of a successful authentication.

Authentication

Confused with: authorisation; identification.

If I’ve checked that you’re allowed to perform and action, then I’ve authenticated you: this process is authentication. A system, then, before granting authorisation to a person or component, must check that they should be allowed the power or capability that comes with that authorisation – that are appropriate to that role. Successful authentication leads to authorisation. Unsuccessful authentication leads to blocking of authorisation******.

With the exception of anonymous roles, the core of an authentication process is checking that the person or component is who he, she or it says they are, or claims to be (although anonymous roles can be appropriate for some capabilities within some systems). This checking of who or what a person or component is authentication, whereas the identification is the claim and the mapping of an identity to a role.

Identification

Confused with: authentication.

I can identify that a particular person exists without being sure that the specific person in front of me is that person. They may identify themselves to me – this is identification – and the checking that they are who they profess to be is the authentication step. In systems, we need to map a known identity to the appropriate capabilities, and the presentation of a component with identity allows us to apply the appropriate checks to instantiate that mapping.

Bringing it all together

Just because you know whom I am doesn’t mean that you’re going to let me do something. I can identify my children over the telephone*******, but that doesn’t mean that I’m going to authorise them to use my credit card********. Let’s say, however, that I might give my wife my online video account password over the phone, but not my children. How might the steps in this play out?

First of all, I have responsibility to ensure that my account isn’t abused. I also have authority to use it, as granted by the Terms and Conditions of the providing company (I’ve decided not to mention a particular service here, mainly in case I misrepresent their Ts&Cs).

“Hi, darling, it’s me, your darling wife*********. I need the video account password.” Identification – she has told me who she claims to be, and I know that such a person exists.

“Is it really you, and not one of the kids? You’ve got a cold, and sound a bit odd.” This is my trying to do authentication.

“Don’t be an idiot, of course it’s me. Give it to me or I’ll pour your best whisky down the drain.” It’s her. Definitely her.

“OK, darling, here’s the password: it’s il0v3myw1fe.” By giving her the password, I’ve performed authorisation.

It’s important to understand these different concepts, as they’re often conflated or confused, but if you can’t separate them, it’s difficult not only to design systems to function correctly, but also to log and audit the different processes as they occur.

*we’ll have to see how well I manage, however. I know that I’m prone to long-windedness**

**ask my wife. Or don’t.

***and a significant problem.

****in a perfect world. Sometimes people don’t do what they ought to.

*****this is much, much nicer than responsibility without authority.

******and logging. In both cases. Lots of logging. And possibly flashing lights, security guards and sirens on failure, if you’re into that sort of thing.

*******most of the time: sometimes they sound like my wife. This is confusing.

********neither should you assume that I’m going to let my wife use it, either.*********

*********not to suggest that she can’t use a credit card: it’s just that we have separate ones, mainly for logging purposes.

**********we don’t usually talk like this on the phone.

Why microservices are a security issue

Should you go about decomposing all of your legacy applications into microservices? Probably not. But given all of the benefits you can accrue, you might consider starting with your security functions.

I struggled with writing the title for this post, and I worry that it comes over as clickbait. If you’ve come to read this because it looked like clickbait, then sorry*. I hope you’ll stay anyway: there are lots of fascinating** posts and many*** footnotes. What I didn’t mean to suggest is that microservices cause security problems – though like any component, of course, they can – but that microservices are appropriate objects of interest to those involved with security. I’d go further than that: I think they are an excellent architectural construct for those concerned with security.

And why is that? Well, for those of us with a systems security bent, the world is an interesting place at the moment. We’re seeing a growth in distributed systems as bandwidth is cheap and latency low. Add to this the ease of deploying to the cloud, and more architects are beginning to realise that they can break up applications not just into multiple layers but also into multiple components within the layer. Load-balancers, of course, help with this when the various components in a layer are performing the same job, but the ability to expose different services as small components has led to a growth in the design, implementation and deployment of microservices.

So, what exactly is a microservice? I quite like the definition provided by Wikipedia, though it’s interesting that security isn’t mentioned there****. One of the points that I like about microservices is that, when well-designed, they conform to the first two points of Peter H. Salus’ description of the Unix philosophy:

Write programs that do one thing and do it well.
Write programs to work together.
Write programs to handle text streams, because that is a universal interface.

The last of the three is slightly less relevant, because the Unix philosophy is generally used to refer to standalone applications, which often have a command instantiation. It does, however, encapsulate one of the basic requirements of microservices: that they must have well-defined interfaces.

By “well-defined”, I don’t just mean a description of any externally-accessible APIs’ methods, but also of the normal operation of the microservice: inputs and outputs – and, if there are any, side-effects. As I’ve described in a previous post, Thinking like a (systems) architect, data and entity descriptions are crucial if you’re going to be able to design a system. Here, in our description of microservices, we get to see why these are so important, because for me the key defining feature of a microservices architecture is decomposability. And if you’re going to decompose***** your architecture, you need to be very, very clear which “bits” (components) are going to do what.

And here’s where security starts to come in. A clear description of what a particular component should be doing allows you to:

check your design;
ensure that your implementation meets the description;
come up with reusable unit tests to check functionality;
track mistakes in implementation and correct them;
test for unexpected outcomes;
monitor for misbehaviour;
audit actual behaviour for future scrutiny.

Now, are all these things possible in a larger architecture? Yes, they are. But they becoming increasingly difficult where entities are chained together – or combined in more complex configurations. Ensuring correct implementation and behaviour is much, much easier when you’ve got smaller pieces to work together. And deriving complex systems behaviours – and misbehaviours – is much more difficult if you can’t be sure that the individual components are doing what they ought to be.

It doesn’t stop here, however. As I’ve mentioned on many previous occasions in this blog, writing good security code is difficult*******. Proving that it does what it should do is even more so. There is every reason, therefore, to restrict code which has particular security requirements – password checking, encryption, cryptographic key management, authorisation, to offer a few examples – to small, well-defined blocks. You can then do all the things that I’ve mentioned above to try to make sure that it’s done correctly.

And yet there’s more. We all know that not everybody is great at writing security-related code. By decomposing your architecture such that all security-sensitive code is restricted to well-defined components, you get the chance to put your best security people on that, and restricting the danger of J. Random Coder******** putting something in which bypasses or downgrades a key security control.

It can also act as an opportunity for learning: it’s always good to be able to point to a design/implementation/test/monitoring tuple and say: “that’s how it should be done. Hear, read, mark, learn and inwardly digest*********.”

*well, a little bit – it’s always nice to have readers.

**I know they are: I wrote them.

***probably less fascinating.

****at the time of writing this article. It’s entirely possible that I – or one of you – may edit the article to change that.

*****this sounds like a gardening term, which is interesting. Not that I really like gardening, but still******.

******amusingly, I first wrote “…if you’re going to decompose your architect…”, which sounds like the strap-line for an IT-themed murder film.

*******regular readers may remember a reference to the excellent film “The Thick of It”.

********other generic personae exist: please take your pick.

*********not a cryptographic digest: I don’t think that’s what the original writers had in mind.

Isolationism

… what’s the fun in having an Internet if you can’t, well, “net” on it?

Sometimes – and I hope this doesn’t come as too much of a surprise to my readers – sometimes, there are bad people, and they do bad things with computers. These bad things are often about stopping the good things that computers are supposed to be doing* from happening properly. This is generally considered not to be what you want to happen**.

For this reason, when we architect and design systems, we often try to enforce isolation between components. I’ve had a couple of very interesting discussions over the past week about how to isolate various processes from each other, using different types of isolation, so I thought it might be interesting to go through some of the different types of isolation that we see out there. For the record, I’m not an expert on all different types of system, so I’m going to talk some history****, and then I’m going to concentrate on Linux*****, because that’s what I know best.

In the beginning

In the beginning, computers didn’t talk to one another. It was relatively difficult, therefore, for the bad people to do their bad things unless they physically had access to the computers themselves, and even if they did the bad things, the repercussions weren’t very widespread because there was no easy way for them to spread to other computers. This was good.

Much of the conversation below will focus on how individual computers act as hosts for a variety of different processes, so I’m going to refer to individual computers as “hosts” for the purposes of this post. Isolation at this level – host isolation – is still arguably the strongest type available to us. We typically talk about “air-gapping”, where there is literally an air gap – no physical network connection – between one host and another, but we also mean no wireless connection either. You might think that this is irrelevant in the modern networking world, but there are classes of usage where it is still very useful, the most obvious being for Certificate Authorities, where the root certificate is so rarely accessed – and so sensitive – that there is good reason not to connect the host on which it is stored to be connected to any other computer, and to use other means, such as smart-cards, a printer, or good old pen and paper to transfer information from it.

And then…

And then came networks. These allow hosts to talk to each other. In fact, by dint of the Internet, pretty much any host can talk to any other host, given a gateway or two. So along came network isolation to try to stop tha. Network isolation is basically trying to re-apply host isolation, after people messed it up by allowing hosts to talk to each other******.

Later, some smart alec came up with the idea of allowing multiple processes to be on the same host at the same time. The OS and kernel were trusted to keep these separate, but sometimes that wasn’t enough, so then virtualisation came along, to try to convince these different processes that they weren’t actually executing alongside anything else, but had their own environment to do their old thing. Sadly, the bad processes realised this wasn’t always true and found ways to get around this, so hardware virtualisation came along, where the actual chips running the hosts were recruited to try to convince the bad processes that they were all alone in the world. This should work, only a) people don’t always program the chips – or the software running on them – properly, and b) people decided that despite wanting to let these processes run as if they were on separate hosts, they also wanted them to be able to talk to processes which really were on other hosts. This meant that networking isolation needed to be applied not just at the host level, but at the virtual host level, as well******.

A step backwards?

Now, in a move which may seem retrograde, it occurred to some people that although hardware virtualisation seemed like a great plan, it was also somewhat of a pain to administer, and introduced inefficiencies that they didn’t like: e.g. using up lots of RAM and lots of compute cycles. These were often the same people who were of the opinion that processes ought to be able to talk to each other – what’s the fun in having an Internet if you can’t, well, “net” on it? Now we, as security folks, realise how foolish this sounds – allowing processes to talk to each other just encourages the bad people, right? – but they won the day, and containers came along. Containers allow lots of processes to be run on a host in a lightweight way, and rely on kernel controls – mainly namespaces – to ensure isolation********. In fact, there’s more you can do: you can use techniques like system call trapping to intercept the things that processes are attempting and stop them if they look like the sort of things they shouldn’t be attempting*********.

And, of course, you can write frameworks at the application layer to try to control what the different components of an application system can do – that’s basically the highest layer, and you’re just layering applications on applications at this point.

Systems thinking

So here’s where I get to the chance to mention one of my favourite topics: systems. As I’ve said before, by “system” here I don’t mean an individual computer (hence my definition of host, above), but a set of components that work together. The thing about isolation is that it works best when applied to a system.

Let me explain. A system, at least as I’d define it for the purposes of this post, is a set of components that work together but don’t have knowledge of external pieces. Most important, they don’t have knowledge of different layers below them. Systems may impose isolation on applications at higher layers, because they provide abstractions which allow higher systems to be able to ignore them, but by virtue of that, systems aren’t – or shouldn’t be – aware of the layers below them.

A simple description of the layers – and it doesn’t always hold, partly because networks are tricky things, and partly because there are various ways to assemble the stack – may look like this.

Application (top layer)
Container
System trapping
Kernel
Hardware virtualisation
Networking
Host (bottom layer)

As I intimated above, this is a (gross) simplification, but the point holds that the basic rule is that you can enforce isolation upwards in the layers of the stack, but you can’t enforce it downwards. Lower layer isolation is therefore generally stronger than higher layer isolation. This shouldn’t come as a huge surprise to anyone who’s used to considering network stacks – the principle is the same – but it’s helpful to lay out and explain the principles from time to time, and the implications for when you’re designing and architecting.

Because if you are considering trust models and are defining trust domains, you need to be very, very careful about defining whether – and how – these domains spread across the layer boundaries. If you miss a boundary out when considering trust domains, you’ve almost certainly messed up, and need to start again. Trust domains are important in this sort of conversation because the boundaries between trust domains are typically where you want to be able to enforce and police isolation.

The conversations I’ve had recently basically ran into problems because what people really wanted to do was apply lower layer isolation from layers above which had no knowledge of the bottom layers, and no way to reach into the control plane for those layers. We had to remodel, and I think that we came up with some sensible approaches. It was as I was discussing these approaches that it occurred to me that it would have been a whole lot easier to discuss them if we’d started out with a discussion of layers: hence this blog post. I hope it’s useful.

*although they may well not be, because, as I’m pretty sure I’ve mentioned before on this blog, the people trying to make the computers do the good things quite often get it wrong.

**unless you’re one of the bad people. But I’m pretty sure they don’t read this blog, so we’re OK***.

***if you are a bad person, and you read this blog, would you please mind pretending, just for now, that you’re a good person? Thank you. It’ll help us all sleep much better in our beds.

****which I’m absolutely going to present in an order that suits me, and generally neglect to check properly. Tough.

*****s/Linux/GNU Linux/g; Natch.

******for some reason, this seemed like a good idea at the time.

*******for those of you who are paying attention, we’ve got to techniques like VXLAN and SR-IOV.

********kernel purists will try to convince you that there’s no mention of containers in the Linux kernel, and that they “don’t really exist” as a concept. Try downloading the kernel source and doing a search for “container” if you want some ammunition to counter such arguments.

*********this is how SELinux works, for instance.