Note – many thanks to a couple of colleagues who provided excellent suggestions for improvements to this text, which has been updated to reflect them.
Sometimes I like to write articles about basics in security, and this is one of those times. I’m currently a little over a third of the way through writing a book on trust in computing and the the cloud, and ended up creating a section about cryptographic hashes. I thought that it might be a useful (fairly non-technical) article for readers of this blog, so I’ve edited it a little bit and present it here for your delectation, dears readers.
There is a tool in the security practitioner’s repertoire that it is helpful for everyone to understand: cryptographic hash functions. A cryptographic hash function, such as SHA-256 or MD5 (now superseded for cryptographic uses as it’s considered “broken”) takes as input a set of binary data (typically as bytes) and gives as output which is hopefully unique for each set of possible inputs. The length of the output – “the hash” – for any particularly hash function is typically the same for any pattern of inputs (for SHA-256, it is 32 bytes, or 256 bits – the clue’s in the name). It should be computationally implausible (cryptographers hate the word “impossible”) to work backwards from the output hash to the input: this is why they are sometimes referred to as “one-way hash functions”. The phrase “hopefully unique” when describing the output is extremely important: if two inputs are discovered that yield the same output, the hash is said to have “collisions”. The reason that MD5 has become deprecated is that it is now trivially possible to find collisions with commercially-available hardware and software systems. Another important property is that even a tiny change in the message (e.g. changing a single bit) should generate a large change to the output (the “avalanche effect”).
What are hash functions used for, and why is the property of being lacking in collisions so important? The simplest answer to the first question is that hash functions are typically used to ensure that when someone hands you a piece of binary data (and all data in the world of computing can be described in binary format, whether it is text, an executable, a video, an image or complete database of data), it is what you expect. Comparing binary data directly is slow and arduous computationally, but hash functions are designed to be very quick. Given two files of several Megabytes or Gigabytes of data, you can produce hashes of them ahead of time, and defer the comparisons to when you need them.
Indeed, given the fact that it is easy to produce hashes of data, there is often no need to have both sets of data. Let us say that you want to run a file, but before you do, you want to check that it really is the file you think you have, and that no malicious actor has tampered with it. You can hash that file very quickly and easily, and as long as you have a copy of what the hash should look like, then you can be fairly certain that you have the file you wanted. This is where the “lack of collisions” (or at least “difficulty in computing collisions”) property of hash functions is important. If the malicious actor can craft a replacement file which shares the same hash as the real file, then the process is essentially useless.
In fact, there are more technical names for the various properties, and what I’ve described above mashes three of the important ones together. More accurately, they are:
- pre-image resistance – this says that if you have a hash, it should be difficult to find the message from which it was created, even if you know the hash function used;
- second pre-image resistance – this says that if you have a message, it should be difficult to find another message which, when hashed, generates the same hash;
- collision resistance – this says that it should be difficult to find any two messages which generate the same hash.
Collision resistance and second pre-image resistance sound like the same property, at first glances, but are subtly (but importantly) different. Pre-image resistance says that if you already have a message, finding another with a matching hash, whereas collision resistance should make it hard for you to find any two messages which will generate the same hash, and is a much harder property to fulfil in a hash function.
Let us go back to our scenario of a malicious actor trying to exchange a file (with a hash which we can check) with another one. Now, to use cryptographic hashes “in the wild” – out there in the real world beyond the perfectly secure, bug-free implementations populated by unicorns and overflowing with fat-free doughnuts – there are some important and difficult provisos that need to be met. More paranoid readers may already have spotted some of them, in particular:
- you need to have assurances that the copy of the hash you have has also not been subject to tampering;
- you need to have assurances that the entity performing the hash performs and reports it correctly;
- you need to have assurances that the entity comparing the two hashes reports the result of that comparison correctly.
Ensuring that you can meet such assurances it not necessarily an easy task, and is one of the reasons that Trusted Platform Modules (TPMs) are part of many computing systems: they act as a hardware root of trust with capabilities to provide such assurances. TPMs are a useful an important tool for real-world systems, and I plan to write an article on them (similar to What’s an HSM?) in the future.
1 – It’s also generally easier to sign hashes of data, rather than large sets of data themselves – this happens to be important as one of the most common uses of hashes is for cryptographic (“digital”) signatures.