Uncertain labels are a challenge of applying machine learning to cybersecurity. We look at some of the ways Sophos overcomes this obstacle.
Computing capacity and deep learning methodology have advanced rapidly in the past decade. We’ve started to see computers using these techniques to perform ever more historically human-centric tasks, faster and often better than humans. At Sophos we believe this extends to cybersecurity. Specifically, our data science team specialises in developing deep learning models that can detect malware with extremely high detection rates and very low false positives. But, instead of preaching about the benefits of deep learning, let’s talk about a big challenge we face when applying machine learning to cybersecurity: uncertain labels.
The problem with labels
Supervised machine learning works like this: you give a model (a function) some data (like some HTML files) and a bunch of associated desired output labels (like 0 and 1 to denote benign and malicious). The model looks at the HTML files, looks at the available labels 0 and 1 and then tries to adjust itself to fit the data so that it can correctly guess output labels (0,1) by only looking at input data (HTML files). Long story short: we define the ground truth for the model by telling it that “this is the perfectly accurate state of the world, now learn from it so you can accurately guess labels from new data”. The problem is, sometimes the labels we’re giving our models aren’t correct. Perhaps it’s a new type of malware that our systems have never seen before and hasn’t been flagged properly in our training data. Perhaps it’s a file that the entire security community has cumulatively mislabeled through a snowball effect of copying each other’s classifications. The concern is that our model will fit to this slightly mislabeled data and we’ll end up with a model that predicts incorrect labels. To top it off, we won’t be able to estimate our errors properly because we’ll be evaluating our model with incorrect labels. The validity of this concern is dependent on a couple of factors:- The amount of incorrect labels you have in your dataset
- The complexity of your model
- If incorrect labels are randomly distributed across the data or highly clustered