… what’s the fun in having an Internet if you can’t, well, “net” on it?

Sometimes – and I hope this doesn’t come as too much of a surprise to my readers – sometimes, there are bad people, and they do bad things with computers.  These bad things are often about stopping the good things that computers are supposed to be doing* from happening properly.  This is generally considered not to be what you want to happen**.

For this reason, when we architect and design systems, we often try to enforce isolation between components.  I’ve had a couple of very interesting discussions over the past week about how to isolate various processes from each other, using different types of isolation, so I thought it might be interesting to go through some of the different types of isolation that we see out there.  For the record, I’m not an expert on all different types of system, so I’m going to talk some history****, and then I’m going to concentrate on Linux*****, because that’s what I know best.

In the beginning

In the beginning, computers didn’t talk to one another.  It was relatively difficult, therefore, for the bad people to do their bad things unless they physically had access to the computers themselves, and even if they did the bad things, the repercussions weren’t very widespread because there was no easy way for them to spread to other computers.  This was good.

Much of the conversation below will focus on how individual computers act as hosts for a variety of different processes, so I’m going to refer to individual computers as “hosts” for the purposes of this post.  Isolation at this level – host isolation – is still arguably the strongest type available to us.  We typically talk about “air-gapping”, where there is literally an air gap – no physical network connection – between one host and another, but we also mean no wireless connection either.  You might think that this is irrelevant in the modern networking world, but there are classes of usage where it is still very useful, the most obvious being for Certificate Authorities, where the root certificate is so rarely accessed – and so sensitive – that there is good reason not to connect the host on which it is stored to be connected to any other computer, and to use other means, such as smart-cards, a printer, or good old pen and paper to transfer information from it.

And then…

And then came networks.  These allow hosts to talk to each other.  In fact, by dint of the Internet, pretty much any host can talk to any other host, given a gateway or two.  So along came network isolation to try to stop tha.  Network isolation is basically trying to re-apply host isolation, after people messed it up by allowing hosts to talk to each other******.

Later, some smart alec came up with the idea of allowing multiple processes to be on the same host at the same time.  The OS and kernel were trusted to keep these separate, but sometimes that wasn’t enough, so then virtualisation came along, to try to convince these different processes that they weren’t actually executing alongside anything else, but had their own environment to do their old thing.  Sadly, the bad processes realised this wasn’t always true and found ways to get around this, so hardware virtualisation came along, where the actual chips running the hosts were recruited to try to convince the bad processes that they were all alone in the world.  This should work, only a) people don’t always program the chips – or the software running on them – properly, and b) people decided that despite wanting to let these processes run as if they were on separate hosts, they also wanted them to be able to talk to processes which really were on other hosts.  This meant that networking isolation needed to be applied not just at the host level, but at the virtual host level, as well******.

A step backwards?

Now, in a move which may seem retrograde, it occurred to some people that although hardware virtualisation seemed like a great plan, it was also somewhat of a pain to administer, and introduced inefficiencies that they didn’t like: e.g. using up lots of RAM and lots of compute cycles.  These were often the same people who were of the opinion that processes ought to be able to talk to each other – what’s the fun in having an Internet if you can’t, well, “net” on it?  Now we, as security folks, realise how foolish this sounds – allowing processes to talk to each other just encourages the bad people, right? – but they won the day, and containers came along. Containers allow lots of processes to be run on a host in a lightweight way, and rely on kernel controls – mainly namespaces – to ensure isolation********.  In fact, there’s more you can do: you can use techniques like system call trapping to intercept the things that processes are attempting and stop them if they look like the sort of things they shouldn’t be attempting*********.

And, of course, you can write frameworks at the application layer to try to control what the different components of an application system can do – that’s basically the highest layer, and you’re just layering applications on applications at this point.

Systems thinking

So here’s where I get to the chance to mention one of my favourite topics: systems.  As I’ve said before, by “system” here I don’t mean an individual computer (hence my definition of host, above), but a set of components that work together.  The thing about isolation is that it works best when applied to a system.

Let me explain.  A system, at least as I’d define it for the purposes of this post, is a set of components that work together but don’t have knowledge of external pieces.  Most important, they don’t have knowledge of different layers below them.  Systems may impose isolation on applications at higher layers, because they provide abstractions which allow higher systems to be able to ignore them, but by virtue of that, systems aren’t – or shouldn’t be – aware of the layers below them.

A simple description of the layers – and it doesn’t always hold, partly because networks are tricky things, and partly because there are various ways to assemble the stack – may look like this.

Application (top layer)
System trapping
Hardware virtualisation
Host (bottom layer)

As I intimated above, this is a (gross) simplification, but the point holds that the basic rule is that you can enforce isolation upwards in the layers of the stack, but you can’t enforce it downwards.  Lower layer isolation is therefore generally stronger than higher layer isolation.   This shouldn’t come as a huge surprise to anyone who’s used to considering network stacks – the principle is the same – but it’s helpful to lay out and explain the principles from time to time, and the implications for when you’re designing and architecting.

Because if you are considering trust models and are defining trust domains, you need to be very, very careful about defining whether – and how – these domains spread across the layer boundaries.  If you miss a boundary out when considering trust domains, you’ve almost certainly messed up, and need to start again.  Trust domains are important in this sort of conversation because the boundaries between trust domains are typically where you want to be able to enforce and police isolation.

The conversations I’ve had recently basically ran into problems because what people really wanted to do was apply lower layer isolation from layers above which had no knowledge of the bottom layers, and no way to reach into the control plane for those layers.  We had to remodel, and I think that we came up with some sensible approaches.  It was as I was discussing these approaches that it occurred to me that it would have been a whole lot easier to discuss them if we’d started out with a discussion of layers: hence this blog post.  I hope it’s useful.


*although they may well not be, because, as I’m pretty sure I’ve mentioned before on this blog, the people trying to make the computers do the good things quite often get it wrong.

**unless you’re one of the bad people. But I’m pretty sure they don’t read this blog, so we’re OK***.

***if you are a bad person, and you read this blog,  would you please mind pretending, just for now, that you’re a good person?  Thank you.  It’ll help us all sleep much better in our beds.

****which I’m absolutely going to present in an order that suits me, and generally neglect to check properly.  Tough.

*****s/Linux/GNU Linux/g; Natch.

******for some reason, this seemed like a good idea at the time.

*******for those of you who are paying attention, we’ve got to techniques like VXLAN and SR-IOV.

********kernel purists will try to convince you that there’s no mention of containers in the Linux kernel, and that they “don’t really exist” as a concept.  Try downloading the kernel source and doing a search for “container” if you want some ammunition to counter such arguments.

*********this is how SELinux works, for instance.

Thinking like a (systems) architect

“…I know it when I see it, and the motion picture involved in this case is not that.” – Mr Justice Stewart.

My very first post on this blog, some six months ago*, was entitled “Systems security – why it matters“, and it turns out that posts where I talk** about architecture are popular, so I thought I’d go a bit further into what I think being an architect is about.  First of all, I’d like to reference two books which helped me come to some sort of understanding about the art of being an architect.  I read them a long time ago****, but I still dip into them from time to time.  I’m going to link to the publisher’s***** website, rather than to any particular bookseller:

What’s interesting about them is that they both have multiple points of view expressed in them: some contradictory – even within each book.  And this rather reflects the fact that I believe that being a systems architect is an art, or a discipline.  Different practitioners will have different views about it.  You can talk about Computer Science being a hard science, and there are parts of it which are, but much of software engineering (lower case intentional) goes beyond that.  The same, I think, is even more true for systems architecture: you may be able to grok what it is once you know it, but it’s very difficult to point to something – even a set of principles – and say, “that is systems architecture”.  Sometimes, the easiest way to define something is by defining what it’s not: search for “I know it when I see it, and the motion picture involved in this case is not that.”******

Let me, however, try to give some examples of the sort of things you should expect to see when someone (or a group of people) is doing good systems architecture:

  • pictures: if you can’t show the different components of a system in a picture, I don’t believe that you can fully describe what each does, or how they interact.  If you can’t separate them out, you don’t have a properly described system, so you have no architecture.  I know that I’m heavily visually oriented, but for me this feels like a sine qua non********.
  • a data description: if you don’t know what data is in your system, you don’t know what it does.
  • an entity description: components, users, printers, whatever: you need to know what’s doing what so that you can describe what the what is that’s being done to it, and what for*********.
  • an awareness of time: this may sound like a weird one, but all systems (of any use) process data through time.  If you don’t think about what will change, you won’t understand what will do the changing, and you won’t be able to consider what might go wrong if things get changed in ways you don’t expect, or by components that shouldn’t be doing the changing in the first place.
  • some thinking on failure modes: I’ve said it before, and I’ll say it again: “things will go wrong.”  You can’t be expected to imagine all the things that might go wrong, but you have a responsibility to consider what might happen to different components and data – and therefore to the operation of the system of the whole – if********** they fall over.

There are, of course, some very useful tools and methodologies (the use of UML views is a great example) which can help you with all of these.  But you don’t need to be an expert in all of them – or even any one of them – to be a good systems architect.

One last thing I’d add, though, and I’m going to call it the “bus and amnesiac dictum”***********:

  • In six months’ time, you’ll have forgotten the details or been hit by a bus: document it.  All of it.

You know it makes sense.

*this note is for nothing other than to catch those people who go straight to this section, hoping for a summary of the main article.  You know who you are, and you’re busted.

**well, write, I guess, but it feels like I’m chatting at people, so that’s how I think of it***

***yes, I’m going out of my way to make the notes even less info-filled than usual.  Deal with it.

****seven years is a long time, right?

*****almost “of course”, though I believe they’re getting out of the paper book publishing biz, sadly.

******this is a famous comment from a case called from the U.S. called “Jacobellis v. Ohio” which I’m absolutely not going to quote in full here, because although it might generate quite a lot of traffic, it’s not the sort of traffic I want on this blog*******.

*******I did some searching found the word: it’s apophasis.  I love that word.  Discovered it during some study once, forgot it.  Glad to have re-found it.

********I know: Greek and Latin in one post.  Sehr gut, ja?

********.*I realise that this is a complete mess of a sentence.  But it does have a charm to it, yes?  And you know what it means, you really do.

**********when.  I meant “when”.

***********more Latin.

Why I love technical debt

… if technical debt can be named, then it can be documented.

Not necessarily a title you’d expect for a blog post, I guess*, but I’m a fan of technical debt.  There are two reasons for this: a Bad Reason[tm] and a Good Reason[tm].  I’ll be up front about the Bad Reason first, and then explain why even that isn’t really a reason to love it.  I’ll then tackle the Good Reason, and you’ll nod along in agreement.

The Bad Reason I love technical debt

We’ll get this out of the way, then, shall we?  The Bad Reason is that, well, there’s just lots of it, it’s interesting, it keeps me in a job, and it always provides a reason, as a security architect, for me to get involved in** projects that might give me something new to look at.  I suppose those aren’t all bad things, and it can also be a bit depressing, because there’s always so much of it, it’s not always interesting, and sometimes I need to get involved even when I might have better things to do.

And what’s worse is that it almost always seems to be security-related, and it’s always there.  That’s the bad part.

Security, we all know, is the piece that so often gets left out, or tacked on at the end, or done in half the time that it deserves, or done by people who have half an idea, but don’t quite fully grasp it.  I should be clear at this point: I’m not saying that this last reason is those people’s fault.  That people know they need security it fantastic.  If we (the security folks) or we (the organisation) haven’t done a good enough job in making security resources – whether people, training or sufficient visibility – available to those people who need it, then the fact that they’re trying is a great, and something we can work on.  Let’s call that a positive.  Or at least a reason for hope***.

The Good Reason I love technical debt

So let’s get on to the other reason: the legitimate reason.  I love technical debt when it’s named.

What does that mean?

So, we all get that technical debt is a bad thing.  It’s what happens when you make decisions for pragmatic reasons which are likely to come back and bite you later in a project’s lifecycle.  Here are a few classic examples that relate to security:

  • not getting round to applying authentication or authorisation controls on APIs which might at some point be public;
  • lumping capabilities together so that it’s difficult to separate out appropriate roles later on;
  • hard-coding roles in ways which don’t allow for customisation by people who may use your application in different ways to those that you initially considered;
  • hard-coding cipher suites for cryptographic protocols, rather than putting them in a config file where they can be changed or selected later.

There are lots more, of course, but those are just a few which jump out at me, and which I’ve seen over the years.  Technical debt means making decisions which will mean more work later on to fix them.  And that can’t be good, can it?

Well, there are two words in the preceding paragraph or two which should make us happy: they are “decisions” and “pragmatic”.  Because in order for something to be named technical debt, I’d argue, it needs to have been subject to conscious decision-making, and for trade-offs to have been made – hopefully for rational reasons.  Those reasons may be many and various – lack of qualified resources; project deadlines; lack of sufficient requirement definition – but if they’ve been made consciously, then the technical debt can be named, and if technical debt can be named, then it can be documented.

And if they’re documented, then we’re half-way there.  As a security guy, I know that I can’t force everything that goes out of the door to meet all the requirements I’d like – but the same goes for the High Availability gal, the UX team, the performance folks, et al..  What we need – what we all need – is for there to exist documentation about why decisions were made, because then, when we return to the problem later on, we’ll know that it was thought about.  And, what’s more, the recording of that information might even make it into product documentation, too.  “This API is designed to be used in a protected environment, and should not be exposed on the public Internet” is a great piece of documentation.  It may not be what a customer is looking for, but at least they know how to deploy the product now, and, crucially, it’s an opportunity for them to come back to the product manager and say, “we’d really like to deploy that particular API in this way: could you please add this as a feature request?”.  Product managers like that.  Very much****.

The best thing, though, is not just that named technical debt is visible technical debt, but that if you encourage your developers to document the decisions in code*****, then there’s a half-way to decent chance that they’ll record some ideas about how this should be done in the future.  If you’re really lucky, they might even add some hooks in the code to make it easier (an “auth” parameter on the API which is unused in the current version, but will make API compatibility so much simpler in new releases; or cipher entry in the config file which only accepts one option now, but is at least checked by the code).

I’ve been a bit disingenuous, I know, by defining technical debt as named technical debt.  But honestly, if it’s not named, then you can’t know what it is, and until you know what it is, you can’t fix it*******.  My advice is this: when you’re doing a release close-down (or in your weekly stand-up: EVERY weekly stand-up), have an item to record technical debt.  Name it, document it, be proud, sleep at night.


*well, apart from the obvious clickbait reason – for which I’m (a little) sorry.

**I nearly wrote “poke my nose into”.

***work with me here.

****if you’re software engineer/coder/hacker – here’s a piece of advice: learn to talk to product managers like real people, and treat them nicely.  They (the better ones, at least) are invaluable allies when you need to prioritise features or have tricky trade-offs to make.

*****do this.  Just do it.  Documentation which isn’t at least mirrored in code isn’t real documentation******.

******don’t believe me?  Talk to developers.  “Who reads product documentation?”  “Oh, the spec?  I skimmed it.  A few releases back.  I think.”  “I looked in the header file: couldn’t see it there.”

*******or decide not to fix it, which may also be an entirely appropriate decision.

The simple things, sometimes…

I (re-)learned an important lesson this week: if you’re an attacker, start at the front door.

This week I’ve had an interesting conversation with an organisation with which I’m involved*.  My involvement is as a volunteer, and has nothing to do with my day job – in other words, I have nothing to do with the organisation’s security.  However, I got an email from them telling me that in order to perform a particular action, I should now fill in an online form, which would then record the information that they needed.

So this week’s blog entry will be about entering information on an online form.  One of the simplest tasks that you might want to design – and secure – for any website.  I wish I could say that it’s going to be a happy tale.

I had look at this form, and then I looked at the URL they’d given me.  It wasn’t a fully qualified URL, in that it had no protocol component, so I copied and pasted it into a browser to find out what would happen. I had a hope that it might automatically redirect to an https-served page.  It didn’t.  It was an http-served page.

Well, not necessarily so bad, except that … it wanted some personal information.  Ah.

So, I cheated: I changed the http:// … to an https:// and tried again**.  And got an error.  The certificate was invalid.  So even if they changed the URL, it wasn’t going to help.

So what did I do?  I got in touch with my contact at the organisation, advising them that there was a possibility that they might be in breach of their obligations under Data Protection legislation.

I got a phone call a little later.  Not from a technical person – though there was a techie in the background.  They said that they’d spoken with the IT and security departments, and that there wasn’t a problem.  I disagreed, and tried to explain.

The first argument was whether there was any confidential information being entered.  They said that there was no linkage between the information being entered and the confidential information held in a separate system (I’m assuming database). So I stepped back, and asked about the first piece of information requested on the form: my name.  I tried a question: “Could the fact that I’m a member of this organisation be considered confidential in any situation?”

“Yes, it could.”

So, that’s one issue out of the way.

But it turns out that the information is stored encrypted on the organisation’s systems.  “Great,” I said, “but while it’s in transit, while it’s being transmitted to those systems, then somebody could read it.”

And this is where communication stopped.  I tried to explain that unless the information from the form is transmitted over https, then people could read it.  I tried to explain that if I sent it over my phone, then people at my mobile provider could read it.  I tried a simple example: I tried to explain that if I transmitted it from a laptop in a Starbucks, then people who run the Starbucks systems – or even possibly other Starbucks customers – could see it.  But I couldn’t get through.

In the end, I gave up.  It turns out that I can avoid using the form if I want to.  And the organisation is of the firm opinion that it’s not at risk: that all the data that is collected is safe.  It was quite clear that I wasn’t going to have an opportunity to argue this with their IT or security people: although I did try to explain that this is an area in which I have some expertise, they’re not going to let any Tom, Dick or Harry*** bother their IT people****.

There’s no real end to this story, other than to say that sometimes it’s the small stuff we need to worry about.  The issues that, as security professionals, we feel are cut and dried, are sometimes the places where there’s still lots of work to be done.  I wish it weren’t the case, because frankly, I’d like to spend my time educating people on the really tricky things, and explaining complex concepts around cryptographic protocols, trust domains and identity, but I (re-)learned an important lesson this week: if you’re an attacker, start at the front door.  It’s probably not even closed: let alone locked.

*I’m not going to identify the organisation: it wouldn’t be fair or appropriate.  Suffice to say that they should know about this sort of issue.

**I know: all the skillz!

***Or “J. Random User”.  Insert your preferred non-specific identifier here.

****I have some sympathy with this point of view: you don’t want to have all of their time taken up by random “experts”.  The problem is when there really _are_ problems.  And the people calling them maybe do know their thing.

“Zero-trust”: my love/hate relationship

… “explicit-trust networks” really is a much better way of describing what’s going on here.

A few weeks ago, I wrote a post called “What is trust?”, about how we need to be more precise about what we mean when we talk about trust in IT security.  I’m sure it’s case of confirmation bias*, but since then I’ve been noticing more and more references to “zero-trust networks”.  This both gladdens and annoys me, a set of conflicting emotions almost guaranteed to produce a new blog post.

Let’s start with the good things about the term.  “Zero-trust networks” are an attempt to describe an architectural approach which address the disappearance of macro-perimeters within the network.  In other words, people have realised that putting up a firewall or two between one network and another doesn’t have a huge amount of effect when traffic flows across an organisation – or between different organisations – are very complex and don’t just follow one or two easily defined – and easily defended – routes.  This problem is exacerbated when the routes are not only multiple – but also virtual.  I’m aware that all network traffic is virtual, of course, but in the old days**, even if you had multiple routing rules, ingress and egress of traffic all took place through a single physical box, which meant that this was a good place to put controls***.

These days (mythical as they were) have gone.  Not only do we have SDN (Software-Defined Networking) moving packets around via different routes willy-nilly, but networks are overwhelmingly porous.  Think about your “internal network”, and tell me that you don’t have desktops, laptops and mobile phones connected to it which have multiple links to other networks which don’t go through your corporate firewall.  Even if they don’t******, when they leave your network and go home for the night, those laptops and mobile phones – and those USB drives that were connected to the desktop machines – are free to roam the hinterlands of the Internet******* and connect to pretty much any system they want.

And it’s not just end-point devices, but components of the infrastructure which are much more likely to have – and need – multiple connections to different other components, some of which may be on your network, and some of which may not.  To confuse matters yet further, consider the “Rise of the Cloud”, which means that some of these components may start on “your” network, but may migrate – possibly in real time – to a completely different network.  The rise of micro-services (see my recent post describing the basics of containers) further exacerbates the problem, as placement of components seems to become irrelevant, so you have an ever-growing (and, if you’re not careful, exponentially-growing) number of flows around the various components which comprise your application infrastructure.

What the idea of “zero-trust networks” says about this – and rightly – is that a classical, perimeter-based firewall approach becomes pretty much irrelevant in this context.  There are so many flows, in so many directions, between so many components, which are so fluid, that there’s no way that you can place firewalls between all of them.  Instead, it says, each component should be responsible for controlling the data that flows in and out of itself, and should that it has no trust for any other component with which it may be communicating.

I have no problem with the starting point for this – which is as far as some vendors and architects take it: all users should always be authenticated to any system, and auhorised before they access any service provided by that system. In fact, I’m even more in favour of extending this principle to components on the network: it absolutely makes sense that a component should control access its services with API controls.  This way, we can build distributed systems made of micro-services or similar components which can be managed in ways which protect the data and services that they provide.

And there’s where the problem arises.  Two words: “be managed”.

In order to make this work, there needs to be one or more policy-dictating components (let’s call them policy engines) from which other components can derive their policy for enforcing controls.  The client components must have a level of trust in these policy engines so that they can decide what level of trust they should have in the other components with which they communicate.

This exposes a concomitant issue: these components are not, in fact, in charge of making the decisions about who they trust – which is how “zero-trust networks” are often defined.  They may be in charge of enforcing these decisions, but not the policy with regards to the enforcement.  It’s like a series of military camps: sentries may control who enters and exits (enforcement), but those sentries apply orders that they’ve been given (policies) in order to make those decisions.

Here, then, is what I don’t like about “zero-trust networks” in a few nutshells:

  1. although components may start from a position of little trust in other components, that moves to a position of known trust rather than maintaining a level of “zero-trust”
  2. components do not decide what other components to trust – they enforce policies that they have been given
  3. components absolutely do have to trust some other components – the policy engines – or there’s no way to bootstrap the system, nor to enforce policies.

I know it’s not so snappy, but “explicit-trust networks” really is a much better way of describing what’s going on here.  What I do prefer about this description is it’s a great starting point to think about trust domains.  I love trust domains, because they allow you to talk about how to describe shared policy between various components, and that’s what you really want to do in the sort of architecture that’s I’ve talked about above.  Trust domains allow you to talk about issues such as how placement of components is often not irrelevant, about how you bootstrap your distributed systems, about how components are not, in the end, responsible for making decisions about how much they trust other components, or what they trust those other components to do.

So, it looks like I’m going to have to sit down soon and actually write about trust domains.  I’ll keep you posted.


*one of my favourite cognitive failures

**the mythical days that my children believe in, where people have bouffant hairdos, the Internet could fit on a single Winchester disk, and Linux Torvalds still lived in Finland.

***of course, there was no such perfect time – all I should need to say to convince you is one word: “Joshua”****

****yes, this is another filmic***** reference.

*****why, oh why doesn’t my spell-checker recognise this word?

******or you think they don’t – they do.

*******and the “Dark Web”: ooooOOOOoooo.

Service degradation: actually a good thing

…here’s the interesting distinction between the classic IT security mindset and that of “the business”: the business generally want things to keep running.

Well, not all the time, obviously*.  But bear with me: we spend most of our time ensuring that all of our systems are up and secure and working as expected, because that’s what we hope for, but there’s a real argument for not only finding out what happens when they don’t, and not just planning for when they don’t, but also planning for how they shouldn’t.  Let’s start by examining some techniques for how we might do that.

Part 1 – planning

There’s a story** that the oil company Shell, in the 1970’s, did some scenario planning that examined what were considered, at the time, very unlikely events, and which allowed it to react when OPEC’s strategy surprised most of the rest of the industry a few years later.  Sensitivity modelling is another technique that organisations use at the financial level to understand what impact various changes – in order fulfilment, currency exchange or interest rates, for instance – make to the various parts of their business.  Yet another is war gaming, which the military use to try to understand what will happen when failures occur: putting real people and their associated systems into situations and watching them react.  And Netflix are famous for taking this a step further in the context of the IT world and having a virtual Chaos Monkey (a set of processes and scripts) which they use to bring down parts of their systems in real time to allow them to understand how resilient they the wider system is.

So that gives us four approaches that are applicable, with various options for automation:

  1. scenario planning – trying to understand what impact large scale events might have on your systems;
  2. sensitivity planning – modelling the impact on your systems of specific changes to the operating environment;
  3. wargaming – putting your people and systems through simulated events to see what happens;
  4. real outages – testing your people and systems with actual events and failures.

Actually going out of your way to sabotage your own systems might seem like insane behaviour, but it’s actually a work of genius.  If you don’t plan for failure, what are you going to do when it happens?

So let’s say that you’ve adopted all of these practices****: what are you going to do with the information?  Well, there are some obvious things you can do, such as:

  • removing discovered weaknesses;
  • improving resilience;
  • getting rid of single points of failure;
  • ensuring that you have adequately trained staff;
  • making sure that your backups are protected, but available to authorised entities.

I won’t try to compile an exhaustive list, because there are loads books and articles and training courses about this sort of thing, but there’s another, maybe less obvious, course of action which I believe we must take, and that’s plan for managed degradation.

Part 2 – managed degradation

What do I mean by that?  Well, it’s simple.  We***** are trained and indoctrinated to take the view that if something fails, it must always “fail to safe” or “fail to secure”.  If something stops working right, it should stop working at all.

There’s value in this approach, of course there is, and we’re paid****** to ensure everything is secure, right?  Wrong.  We’re actually paid to help keep the business running, and here’s the interesting distinction between the classic IT security mindset and that of “the business”: the business generally want things to keep running.  Crazy, right?  “The business” want to keep making money and servicing customers even if things aren’t perfectly secure!  Don’t they know the risks?

And the answer to that question is “no”.  They don’t know the risks.  And that’s our real job: we need to explain the risks and the mitigations, and allow a balancing act to take place.  In fact, we’re always making those trade-offs and managing that balance – after all, the only truly secure computer is one with no network connection, no keyboard, no mouse and no power connection*******.  But most of the time, we don’t need to explain the decisions we make around risk: we just take them, following best industry practice, regulatory requirements and the rest.  Nor are the trade-offs usually so stark, because when failure strikes – whether through an attack, accident or misfortune – it’s often a pretty simple choice between maintaining a particular security posture and keeping the lights on.  So we need to think about and plan for some degradation, and realise that on occasion, we may need to adopt a different security posture to the perfect (or at least preferred) one in which we normally operate.

How would we do that?  Well, the approach I’m advocating is best described as “managed degradation”.  We allow our systems – including, where necessary our security systems – to degrade to a managed (and preferably planned) state, where we know that they’re not operating at peak efficiency, but where they are operating.  Key, however, is that we know the conditions under which they’re working, so we understand their operational parameters, and can explain and manage the risks associated with this new posture.  That posture may change, in response to ongoing events, and the systems and our responses to those events, so we need to plan ahead (using the techniques I discussed above) so that we can be flexible enough to provide real resiliency.

We need to find modes of operation which don’t expose the crown jewels******** of the business, but do allow key business operations to take place.  And those key business operations may not be the ones we expect – maybe it’s more important to be able to create new orders than to collect payments for them, for instance, at least in the short term.  So we need to discuss the options with the business, and respond to their needs.  This planning is not just security resiliency planning: it’s business resiliency planning.  We won’t be able to consider all the possible failures – though the techniques I outlined above will help us to identify many of them – but the more we plan for, the better we will be at reacting to the surprises.  And, possibly best of all, we’ll be talking to the business, informing them, learning from them, and even, maybe just a bit, helping them understand that the job we do does have some value after all.

*I’m assuming that we’re the Good Guys/Gals**.

**Maybe less story than MBA*** case study.

***There’s no shame in it.

****Well done, by the way.

*****The mythical security community again – see past posts.


*******Preferably at the bottom of a well, encased in concrete, with all storage already removed and destroyed.

********Probably not the actual Crown Jewels, unless you work at the Tower of London.