Why I love technical debt

… if technical debt can be named, then it can be documented.

Not necessarily a title you’d expect for a blog post, I guess*, but I’m a fan of technical debt.  There are two reasons for this: a Bad Reason[tm] and a Good Reason[tm].  I’ll be up front about the Bad Reason first, and then explain why even that isn’t really a reason to love it.  I’ll then tackle the Good Reason, and you’ll nod along in agreement.

The Bad Reason I love technical debt

We’ll get this out of the way, then, shall we?  The Bad Reason is that, well, there’s just lots of it, it’s interesting, it keeps me in a job, and it always provides a reason, as a security architect, for me to get involved in** projects that might give me something new to look at.  I suppose those aren’t all bad things, and it can also be a bit depressing, because there’s always so much of it, it’s not always interesting, and sometimes I need to get involved even when I might have better things to do.

And what’s worse is that it almost always seems to be security-related, and it’s always there.  That’s the bad part.

Security, we all know, is the piece that so often gets left out, or tacked on at the end, or done in half the time that it deserves, or done by people who have half an idea, but don’t quite fully grasp it.  I should be clear at this point: I’m not saying that this last reason is those people’s fault.  That people know they need security it fantastic.  If we (the security folks) or we (the organisation) haven’t done a good enough job in making security resources – whether people, training or sufficient visibility – available to those people who need it, then the fact that they’re trying is a great, and something we can work on.  Let’s call that a positive.  Or at least a reason for hope***.

The Good Reason I love technical debt

So let’s get on to the other reason: the legitimate reason.  I love technical debt when it’s named.

What does that mean?

So, we all get that technical debt is a bad thing.  It’s what happens when you make decisions for pragmatic reasons which are likely to come back and bite you later in a project’s lifecycle.  Here are a few classic examples that relate to security:

  • not getting round to applying authentication or authorisation controls on APIs which might at some point be public;
  • lumping capabilities together so that it’s difficult to separate out appropriate roles later on;
  • hard-coding roles in ways which don’t allow for customisation by people who may use your application in different ways to those that you initially considered;
  • hard-coding cipher suites for cryptographic protocols, rather than putting them in a config file where they can be changed or selected later.

There are lots more, of course, but those are just a few which jump out at me, and which I’ve seen over the years.  Technical debt means making decisions which will mean more work later on to fix them.  And that can’t be good, can it?

Well, there are two words in the preceding paragraph or two which should make us happy: they are “decisions” and “pragmatic”.  Because in order for something to be named technical debt, I’d argue, it needs to have been subject to conscious decision-making, and for trade-offs to have been made – hopefully for rational reasons.  Those reasons may be many and various – lack of qualified resources; project deadlines; lack of sufficient requirement definition – but if they’ve been made consciously, then the technical debt can be named, and if technical debt can be named, then it can be documented.

And if they’re documented, then we’re half-way there.  As a security guy, I know that I can’t force everything that goes out of the door to meet all the requirements I’d like – but the same goes for the High Availability gal, the UX team, the performance folks, et al..  What we need – what we all need – is for there to exist documentation about why decisions were made, because then, when we return to the problem later on, we’ll know that it was thought about.  And, what’s more, the recording of that information might even make it into product documentation, too.  “This API is designed to be used in a protected environment, and should not be exposed on the public Internet” is a great piece of documentation.  It may not be what a customer is looking for, but at least they know how to deploy the product now, and, crucially, it’s an opportunity for them to come back to the product manager and say, “we’d really like to deploy that particular API in this way: could you please add this as a feature request?”.  Product managers like that.  Very much****.

The best thing, though, is not just that named technical debt is visible technical debt, but that if you encourage your developers to document the decisions in code*****, then there’s a half-way to decent chance that they’ll record some ideas about how this should be done in the future.  If you’re really lucky, they might even add some hooks in the code to make it easier (an “auth” parameter on the API which is unused in the current version, but will make API compatibility so much simpler in new releases; or cipher entry in the config file which only accepts one option now, but is at least checked by the code).

I’ve been a bit disingenuous, I know, by defining technical debt as named technical debt.  But honestly, if it’s not named, then you can’t know what it is, and until you know what it is, you can’t fix it*******.  My advice is this: when you’re doing a release close-down (or in your weekly stand-up: EVERY weekly stand-up), have an item to record technical debt.  Name it, document it, be proud, sleep at night.

 


*well, apart from the obvious clickbait reason – for which I’m (a little) sorry.

**I nearly wrote “poke my nose into”.

***work with me here.

****if you’re software engineer/coder/hacker – here’s a piece of advice: learn to talk to product managers like real people, and treat them nicely.  They (the better ones, at least) are invaluable allies when you need to prioritise features or have tricky trade-offs to make.

*****do this.  Just do it.  Documentation which isn’t at least mirrored in code isn’t real documentation******.

******don’t believe me?  Talk to developers.  “Who reads product documentation?”  “Oh, the spec?  I skimmed it.  A few releases back.  I think.”  “I looked in the header file: couldn’t see it there.”

*******or decide not to fix it, which may also be an entirely appropriate decision.

Single Point of Failure

Avoiding cascade failures with systems thinking.

Let’s start with a story.  Way back in the mists of time*, I performed audits for an organisation which sent out cryptographic keys to its members.  These member audits involved checking multiple processes and systems, but the core one was this: the keys that were sent out were are really big deal, as they were the basis from which tens of thousands of other keys would be derived.  So, the main key that was sent out was really, really important, because if it got leaked, the person who got hold of it would have a chance to do many, many Bad Things[tm].

The main organisation thought that allowing people the possibility to do Bad Things[tm] wasn’t generally a good idea, so they had a rule.  You had to follow a procedure, which was this: they would send out this key in two separate parts, to be stored in two different physical safes, to be combined by two different people, reporting to two different managers, in a process split into to separate parts, which ensured that the two different key-holders could never see the other half of the key.  The two separate parts were sent out by separate couriers, so that nobody outside the main organisation, could ever get to see the two parts.  It was a good, and carefully thought out process.

So one of the first things I’d investigate, on arriving at a member company to perform an audit, would be how they managed their part of this process.  And, because they were generally fairly clued up, or wouldn’t have been allowed to have the keys in the first place, they’d explain how careful they were with the key components, and who reported to whom, and where the safes were, and back up plans for when the key holders were ill: all good stuff.  And then I’d ask: “And what happens when a courier arrives with the key component?”  To which they’d reply: “Oh, the mail room accepts the package.”  And then I’d ask “And when the second courier arrives with the second key component?”  And nine times out of ten, they’d answer: “Oh, the mail room accepts that package, too.”  And then we’d have a big chat.**

This is a classic example of a single point of failure.  Nobody designs systems with a single point of failure on purpose****, but they just creep in.  I’m using the word systems here in the same way I used it in my post Systems security – why it matters: in the sense of a bunch of different things working together, some of which are likely to be human, some of which are likely to be machine.  And it’s hard to work out where single points of failure are.  A good way to avoid them – or minimise their likelihood of occurrence – is to layer or overlap systems*****.  What is terrible is when two single points of failure are triggered at once, because they overlap.  From the little information available to us, this seems to be what happened to British Airways over the past weekend: they had a power failure, and then their backups didn’t work.  In other words, they had a cascade failure – one thing went wrong, and then, when another thing went wrong as well, everything fell over. This is terrible, and every IT professional out there ought be cringing a little bit inside at the thought that it might happen to them.******

How can you stop this happening?  It’s hard, really it is, because the really catastrophic failures only happen rarely – pretty much by definition. Here are some thoughts, though:

  • look at pinch points, where a single part of the system, human or machine, is doing heavy lifting – what happens when they fail?
  • look at complex processes with many interlocking pieces – what happens if one of them produces unexpected results (or none)?
  • look at processes with many actors – what happens if one or actor fails to do what is expected?
  • look at processes with a time element to them – what will happen if an element doesn’t produce results when expected?
  • try back-tracking, rather than forward-tracking.  We tend to think forwards, from input to output: try the opposite, and see what the key parts to any output are.  This may give unexpected realisations about critical inputs and associated components.

Last: don’t assume that your systems are safe.  Examine, monitor, test, remediate.  You might******* also have a good long hard think about managed degradation: it’s really going to help if things do go horribly wrong.

Oh – and good luck.


*around ten years ago.  It feels like a long time, anyway.

**because, in case you missed it, that meant that the person in charge of the mail room had access to both parts of the key.***

***which meant that they needed to change their policies, quickly, unless they wanted to fail the audit.

****I’m assuming that we’re all Good Guys and Gals[tm], right, and not the baddies?

*****the principle of defence in depth derives from this idea, though it’s only one way to do it.

******and we/you shouldn’t be basking in the schadenfreude.  No, indeed.

*******should.  Or even must.  Just do it.

 

Service degradation: actually a good thing

…here’s the interesting distinction between the classic IT security mindset and that of “the business”: the business generally want things to keep running.

Well, not all the time, obviously*.  But bear with me: we spend most of our time ensuring that all of our systems are up and secure and working as expected, because that’s what we hope for, but there’s a real argument for not only finding out what happens when they don’t, and not just planning for when they don’t, but also planning for how they shouldn’t.  Let’s start by examining some techniques for how we might do that.

Part 1 – planning

There’s a story** that the oil company Shell, in the 1970’s, did some scenario planning that examined what were considered, at the time, very unlikely events, and which allowed it to react when OPEC’s strategy surprised most of the rest of the industry a few years later.  Sensitivity modelling is another technique that organisations use at the financial level to understand what impact various changes – in order fulfilment, currency exchange or interest rates, for instance – make to the various parts of their business.  Yet another is war gaming, which the military use to try to understand what will happen when failures occur: putting real people and their associated systems into situations and watching them react.  And Netflix are famous for taking this a step further in the context of the IT world and having a virtual Chaos Monkey (a set of processes and scripts) which they use to bring down parts of their systems in real time to allow them to understand how resilient they the wider system is.

So that gives us four approaches that are applicable, with various options for automation:

  1. scenario planning – trying to understand what impact large scale events might have on your systems;
  2. sensitivity planning – modelling the impact on your systems of specific changes to the operating environment;
  3. wargaming – putting your people and systems through simulated events to see what happens;
  4. real outages – testing your people and systems with actual events and failures.

Actually going out of your way to sabotage your own systems might seem like insane behaviour, but it’s actually a work of genius.  If you don’t plan for failure, what are you going to do when it happens?

So let’s say that you’ve adopted all of these practices****: what are you going to do with the information?  Well, there are some obvious things you can do, such as:

  • removing discovered weaknesses;
  • improving resilience;
  • getting rid of single points of failure;
  • ensuring that you have adequately trained staff;
  • making sure that your backups are protected, but available to authorised entities.

I won’t try to compile an exhaustive list, because there are loads books and articles and training courses about this sort of thing, but there’s another, maybe less obvious, course of action which I believe we must take, and that’s plan for managed degradation.

Part 2 – managed degradation

What do I mean by that?  Well, it’s simple.  We***** are trained and indoctrinated to take the view that if something fails, it must always “fail to safe” or “fail to secure”.  If something stops working right, it should stop working at all.

There’s value in this approach, of course there is, and we’re paid****** to ensure everything is secure, right?  Wrong.  We’re actually paid to help keep the business running, and here’s the interesting distinction between the classic IT security mindset and that of “the business”: the business generally want things to keep running.  Crazy, right?  “The business” want to keep making money and servicing customers even if things aren’t perfectly secure!  Don’t they know the risks?

And the answer to that question is “no”.  They don’t know the risks.  And that’s our real job: we need to explain the risks and the mitigations, and allow a balancing act to take place.  In fact, we’re always making those trade-offs and managing that balance – after all, the only truly secure computer is one with no network connection, no keyboard, no mouse and no power connection*******.  But most of the time, we don’t need to explain the decisions we make around risk: we just take them, following best industry practice, regulatory requirements and the rest.  Nor are the trade-offs usually so stark, because when failure strikes – whether through an attack, accident or misfortune – it’s often a pretty simple choice between maintaining a particular security posture and keeping the lights on.  So we need to think about and plan for some degradation, and realise that on occasion, we may need to adopt a different security posture to the perfect (or at least preferred) one in which we normally operate.

How would we do that?  Well, the approach I’m advocating is best described as “managed degradation”.  We allow our systems – including, where necessary our security systems – to degrade to a managed (and preferably planned) state, where we know that they’re not operating at peak efficiency, but where they are operating.  Key, however, is that we know the conditions under which they’re working, so we understand their operational parameters, and can explain and manage the risks associated with this new posture.  That posture may change, in response to ongoing events, and the systems and our responses to those events, so we need to plan ahead (using the techniques I discussed above) so that we can be flexible enough to provide real resiliency.

We need to find modes of operation which don’t expose the crown jewels******** of the business, but do allow key business operations to take place.  And those key business operations may not be the ones we expect – maybe it’s more important to be able to create new orders than to collect payments for them, for instance, at least in the short term.  So we need to discuss the options with the business, and respond to their needs.  This planning is not just security resiliency planning: it’s business resiliency planning.  We won’t be able to consider all the possible failures – though the techniques I outlined above will help us to identify many of them – but the more we plan for, the better we will be at reacting to the surprises.  And, possibly best of all, we’ll be talking to the business, informing them, learning from them, and even, maybe just a bit, helping them understand that the job we do does have some value after all.


*I’m assuming that we’re the Good Guys/Gals**.

**Maybe less story than MBA*** case study.

***There’s no shame in it.

****Well done, by the way.

*****The mythical security community again – see past posts.

******Hopefully…

*******Preferably at the bottom of a well, encased in concrete, with all storage already removed and destroyed.

********Probably not the actual Crown Jewels, unless you work at the Tower of London.