Data may be neutral: information is not

We tend to think of data as, well, binary

Bias – particularly unintentional bias – is interesting, as I’ve noted before in my article Trust you? I can’t trust myself. The Guardian published an interesting report recently on bias in the field of digital forensics which is not, as you might expect, a discussion on algorithmic bias AI/ML (a problem which is a major concern in criminal profiling and even sentencing), but actually about the people performing digital forensics.

The story is based on a (“soon to be published”) study which basically found that, presented with the same data set (a hard drive, presumably with logging and system information on it), digital forensics researchers were more likely to find incriminating information on it when they were told that the drive’s owner had already confessed or was likely to be guilty than if they were considered innocent.

This sounds horrifying and, on first consideration, very surprising, but if the raw data put in front of people were a police interview transcript, I think that many of us would be less surprised by such a result. We know that we read words and sentences differently based on our assumptions and biases (just look at research on candidate success for CVs/resumes where the name of the candidate is clearly male versus female, or suggests a particular ethnic group or religious affiliation), and we are even familiar scientific evidence being debated by experts in court. So why are we surprised by the same result when considering data on a hard drive?

I think the answer is that we tend to think of data as, well, binary: it’s either one thing or it’s not. That’s true, of course (quantum effects aside), but data still needs to be found, considered and put into context in order to be useful: it needs to be turned into information. Hopefully, an AI/ML system applied to the same data multiple times would extract the same information but, as noted above, we can’t be sure that there’s no bias inherent in the AI/ML model already. The process of informationising the data, if you will, moves from deterministic ones and zeroes to information via processes (“algorithms”, if you will) which are complex and provide ample opportunity for for the humans carrying them out (or training the AI/ML models which will carry them out) to introduce bias.

This means that digital forensics experts need training to try to expose their own unconscious bias(es!) in just the same way that experts in other types of evidence do. Humans bring their flaws and imperfections to every field of endeavour, and digital forensics is no different. In fact, in my view, the realisation that digital forensics, as a discipline, is messy and complex is a positive step on the way to its maturing as a field of study and taking its place in set of tools available to criminal investigators, technical auditors and beyond.

Author: Mike Bursell

Long-time Open Source and Linux bod, distributed systems security, etc.. Now employed by Red Hat. マイク・バーゼル: オープンソースとLinuxに長く従事。他にも分散セキュリティシステムなども手がける。現在Red Hatのチーフセキュリティアーキテクト

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: