bugs – Alice, Eve and Bob – a security blog

Logs – good or bad for Confidential Computing?

I wrote a simple workload for testing. It didn’t work.

A few weeks ago, we had a conversation on one of the Enarx calls about logging. We’re at the stage now (excitingly!) where people can write applications and run them using Enarx, in an unprotected Keep, or in an SEV or SGX Keep. This is great, and almost as soon as we got to this stage, I wrote a simple workload to test it all.

It didn’t work.

This is to be expected. First, I’m really not that good a software engineer, but also, software is buggy, and this was our very first release. Everyone expects bugs, and it appeared that I’d found one. My problem was tracing where the issue lay, and whether it was in my code, or the Enarx code. I was able to rule out some possibilities by trying the application in an unprotected (“plain KVM”) Keep, and I also discovered that it ran under SEV, but not SGX. It seemed, then, that the problem might be SGX-specific. But what could I do to look any closer? Well, with very little logging available from within a Keep, there was little I could do.

Which is good. And bad.

It’s good because one of the major points about using Confidential Computing (Enarx is a Confidential Computing framework) is that you don’t want to leak information to untrusted parties. Since logs and error messages can leak lots and lots of information, you want to restrict what’s made available, and to whom. Safe operation dictates that you should make as little information available as you possibly can: preferably none.

It’s bad because there are times when (like me) you need to work out what’s gone wrong, and find out whether it’s in your code or the environment that you’re running your application in.

This is where the conversation about logging came in. We’d started talking about it before this issue came up, but this made me realise how important it was. I started writing a short blog post about it, and then stopped when I realised that there are some really complex issues to consider. That’s why this article doesn’t go into them in depth: you can find a much more detailed discussion over on the Enarx blog . But I’m not going to leave you hanging: below, you’ll find the final paragraph of the Enarx blog article. I hope it piques your interest enough to go and find out more.

In a standard cloud deployment, there is little incentive to consider strong security controls around logging and debugging, simply because the host has access not only to all communications to and from a hosted workload, but also to all the code and data associated with the workload at runtime. For Confidential Computing workloads, the situation is very different, and designers and architects of the TEE infrastructure (e.g. the Enarx projects) and even, to a lesser extent, of potential workloads themselves, need to consider very carefully the impact of host gaining access to messages associated with the workload and the infrastructure components. It is, realistically, infeasible to restrict all communication to levels appropriate for deployment, so it is recommended that various profiles are created which can be applied to different stages of a deployment, and whose use is carefully monitored, logged (!) and controlled by process.

Header image by philm1310 from Pixabay.

Why Enarx is open

It’s not just our coding that we do in the open.

When Nathaniel McCallum and I embarked on the project which is now called Enarx, we made one decision right at the beginning: the code for Enarx would be open source, a stance fully supported by our employer Red Hat (see standard disclaimer). All of it, and for ever. That’s a decision that we’ve not regretted at any point, and it’s something we stand behind. As soon as we had enough code for a demo, and were ready to show it, we created a repository on github and made it public. There’s a very small exception, which is that there are some details of upcoming chip features that are shared with us under NDA[1] where if we write code for them, publishing that code would be a breach of the NDA. But where this applied (which is rarely) we are absolutely clear with the various vendors that we intend to make the code open as soon as possible, and lobby them to release details as early as they can (which may be earlier than they might prefer), so that more experts can look over both their designs and our code.

Auditability and trust

This brings us to possibly the most important reasons for making Enarx open source: auditability and trust. Enarx is a security-related project, and I believe passionately not only that security should be done in the open, but that if anybody is actually going to trust their sensitive data, algorithms and workloads to a piece of software, then they want to be in a position where as many experts as possible have looked at it, scrutinised it, criticised it and improved it: whether that is the people running the software, their employees, contractors or (even better) the wider security community. The more people who check the code, the happier you should be to trust it. This is important for any piece of security software, but vital for software such as Enarx which is designed to protect your more most sensitive workloads.

Bug-catching

There are bugs in Enarx. I know: I’m writing some of the code[2] and I found one yesterday (which I’d put in), just as I was about to give a demo[3]. It is very, very difficult to write perfect code, and we know that if we make our source open, then more people can help us fix issues.

Commonwealth

For Nathaniel and me, open source is an ethical issue, and we make no apologies for that. I think it’s the same for most, if not all, of the team working on Enarx. This include a number of Red Hat employees (see standard disclaimer), so shouldn’t come as a surprise, but we have non-Red Hat contributors from a number of backgrounds, and we feel that Enarx should be a Common Good, and contribute to the commonwealth of intellectual property out there.

More brain power

Making something open source doesn’t just make it easier to fix bugs: it can improve the quality of what you produce in general. The more brain power you have to apply to the problem, but better your chances of making something great – assuming that the brain power is applied efficiently (not always an easy task!). We had a design meeting yesterday where one of the participants said towards the end, “I’m sure I could implement some of this, but don’t know a huge amount about this topic, and I’m worried that I’m not contributing to this discussion.” In fact, they had, by asking questions and clarifying some points, and we assured them that we wanted to include experienced, senior developers for their expertise and knowledge, and to pull out assumptions and to validate the design, and not because we expected everybody to be experts in all parts of the project. Having bright people around, involved in design and coding, spreads expertise and knowledge, and helps keep the work from becoming an insulated, isolated “ivory tower” construction, understood by few, and almost impossible to validate.

Not just code

It’s not just our coding that we do in the open. We manage our architecture in the open, our design meetings, our protocol design, our design methodology[4], our documentation, our bug-tracking, our chat, our CI/CD processes: all of it is open. The one exception is our vulnerability management process, which needs to have the opportunity for confidential exposure for a limited time.

Code – https://github.com/enarx
Wiki – https://github.com/enarx/enarx/wiki
Design – see wiki and https://github.com/enarx/rfcs
Issues & Pull Requests – https://github.com/enarx/enarx/issues & https://github.com/enarx/enarx/pulls
Chat – https://chat.enarx.dev (thanks to Rocket.chat!)
CI/CD resources – thanks to Packet!
Stand-ups – https://github.com/enarx/enarx/wiki/How-to-contribute

We also take diversity seriously, and the project contributors are subject to the Contributor Covenant Code of Conduct.

In short, Enarx is an open project. I’m sure we could do better, and we’ll strive for that, but our underlying principles are that open is good in general, and vital for security. If you agree, please come and visit!

1 – Non-Disclosure Agreement.

2 – to the surprise of many of the team, including myself. At least it’s not in Perl.

3 – I fixed it. Admittedly after the demo.

4 – we’ve just moved to a Sprint pattern – the details of which we designed and agreed in the open.

In praise of triage

It’s all too easy to prioritise based on the “golfing test”.

Not all bugs are created equal.

Some bugs need fixing now, some bugs can wait. Some bugs are in your implementation, some are in the underlying design. Some bugs will annoy a few customers, some will destroy your business.

Bugs come in all shapes and sizes, and one of the tasks of a product owner, product manager, chief architect – whoever makes the call about where to assign resources – is to decide which ones to address in which order: to prioritise them. The problem is deciding how to prioritise them. It’s all too easy to prioritise based on the “golfing test”: your CEO meets someone on golf course who mentions that his or her company loves your product, except for one tiny issue. The CEO comes back, and makes it clear that fixing this “major bug” is now your one and only task until it’s done, and your world is turned upside down. You have to fix the bug as quickly as possible, with no thought to the impact it has on the rest of the project, or the immense pile technical debt that’s just been accrued. You don’t want to live in this world. What, then, is the alternative?

The answer – though it’s only the beginning of the answer – is triage. Triage (from the French for “separating out”) comes from the world of battlefield medicine. When deciding which wounded soldiers to treat, rapid (hopefully objective) assessments are carried out, allowing a quick sorting of each soldier, typically into categories such as “not urgent: wait”, “urgent: treat immediately” and “not saveable: do not treat”. We can apply the same to software bugs in order to decide what to treat (fix) and with what priority. The important thing is not so much the categories – which will vary based on your context – but the assessment criteria, and how they are applied. Here are a list of just some of the possible criteria:

likely monetary impact per customer
number of customers impacted
reputational impact on your organisation
ease to fix
impact on system security
impact on system performance
impact on system stability
annoyance of CEO not to be listened to.

We do not, of course, only need to apply one of these: a number of them can be combined with a weighting system, though the more you add, the less clear your priorities will be, and the more likely it is that someone will “put a finger on the scales” – tweak the numbers to give the outcome they want. Another important point about the categories that you decide to apply is that they should be as measurable as you can make them, to allow as objective scoring as possible. I wrote a review of the book Building Evolutionary Architectures a while ago: the methodology adopted there, where you measure and test in order to meet specific criteria, is exactly the sort of approach you should be choosing when designing your triage system.

This is (ostensibly) a blog about security, and so you might expect me to say that “security always wins”, but that should absolutely not be the approach you take. Security might be the most important category for you (that is, carry the most weight), but you need to understand why that is the case – at this particular time – and what exactly you mean by “security”. The “security of the system” is not an objective measure: in order to mean anything, such a phrase needs to reference measurements that can be made (“resistance to physical tampering”, “resistance to brute force attacks”, “number or PhD students likely to be needed to reverse engineer our ‘secure’ protocol”[1]). More importantly, it may be that at this point in your organisation’s life, the damage done by lack of stability or decreased performance outweighs the impact of a security bug. If that’s the case, then your measurements should encapsulate that information and lead you to prioritise bugs with impact in these categories over security issues[3].

There’s one proviso that I feel I need to put in at this point, and it’s about the power of what, in Agile Methodology terms, is called the Product Owner. This is the person who represents the users of the product/project, and should have final say about the direction of development in terms of features, functionality and, most relevant in this discussion, bug-fixing. As noted above, this may be an architect, product manager or someone enjoying another title, but their role should be clear: they get to call the shots. There are times when this person goes against the evidence provided by the triage, and makes a decision to prioritise a particular bug over others despite the outcome of the measurements. This is typically very painful for the technical team[4], but, when it comes down to it, as the product owner, they get to decide. The technical team – after appropriate warnings and discussion[5] – must be ready to step aside and accept the decision. Such decisions (and related discussions) should be recorded, and the product owner must be ready to stand or fall based on the outcome, but that is their job. Triage is a guide, and there are occasions when there are measurements which cannot be easily made objectively, and which sit outside the expertise or scope of knowledge of the technical team. If this sort of decision keeps being made, and you think you know better, you may have a future in technical product management, where people with a view of both the technical and the business side of technology are much in demand. In the end, though, the product owner will need to justify their decision to management, and if they get it wrong, then they must be ready to take the blame (this is one reason why you should make sure that you’ve recorded the process taken to get to this decision – you don’t want to take the blame for a poor decision which you advised against).

So: go out an design a triage process, be ready to follow it, and be ready to defend it. Oh, and one last point: you might want to buy a set of golf clubs.

—–

1 – this last one is a joke: don’t design your own protocol, or if you do, make it open and have it peer-reviewed[2].

2 – and then throw it away and use an open source implementation of better, more thoroughly-reviewed one.

3 – much as it pains me to say it.

4 – I’ve been on both sides of these decisions: I know.

5 -often rather heated, in my experience.