A Reading of the AI Hype Meter

by @blogic

The hoopla of Artificial Intelligence (AI) among the industry continues.

Buzzwords like Machine Learning (ML), prediction, and AI continue to infiltrate the cyber security consciousness with a bit of fabrication.  There was a 112.5 percent increase from last year in the number of briefing summaries and training descriptions from the big summer hacking conferences containing the following key phrases: data science, machine learning and artificial intelligence.  At DEFCON, there were three mentions of these keywords in 2017 and four mentions in 2018.  Black Hat had a steeper jump in hype with a count of five in 2017 to 13 in 2018.  This is palpable enough that I hope readers like you will want to learn more about it.  Even though vendors and your colleague at the water cooler may think AI will be the panacea or the Terminator of the hacker, let's first start off with a realistic understanding of ML for cyber security before we reach conclusions that AI will be taking our jobs.

In its truest form, AI learns on its own and improves on its capabilities to make better decisions like humans.  Today we have pseudo-AI and their names are Siri and Alexa, yet neither have been able to defend against a zero-day or develop a patch for my Mac firmware that prevents against undiscovered vulnerabilities.  ML is being used to drive cars and tweet the truth like Microsoft's "Tay."  Under the hood, it has the ability and agility to learn without being programmed with rules and logic.

Let's take, for example, the daily activity of security engineers.  They create static/pattern matching rules for Intrusion Detection Systems (IDS) to identify attack signatures and patterns from network flows.  If true AI were to create new IDS rules (without a security engineer's assistance), it would use historical signatures and attack logic to make brand new signatures against unseen attacks on the fly.  Even though there are valiant attempts to use ML for understanding netflow and learning new attacks, building new signatures to block attacks on the fly with algorithms is not feasible yet.

Can ML and Cyber Security Coexist?

As much as we want Siri or Alexa to build IDS signatures for us, there are already worthwhile cyber security use cases in the field today.  With ML we can use methods like supervised and unsupervised models to identify malicious activity.  For a supervised machine learning algorithm, we use labels like confirmed "threat" to identify a direct hit from a verified threat actor or a malware detection match from an Internet download.  Unsupervised machine learning algorithms don't have labels, but the data can be leveraged by user and entity behavior analytics tools to cluster users' suspicious activity or differentiate network flows from an attack versus a misconfigured internal system.

In both unsupervised and supervised machine learning, there needs to be a vast amount of curation on the data set to feed the algorithms that then derive output.  Data quality is crucial to the success or failure of ML to accurately pinpoint malicious activity from the vast amount of noise analysts see today.  This data validation accounts for the majority of the effort implemented in a machine learning model.

Below is a basic Python structure of an ML prediction model:

# prediction code
X = dataframe.filter(['feature1', 'feature2', 'feature3', 'feature4', 'feature5'])
y = dataframe.filter(['dependent_variable'])

# import sklearn and train,test and split packages
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# logistic regression is the prediction algorithm of choice
from sklearn.linear_model import LogistricRegression
from sklearn import metrics
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

# Determine the accuracy of model
from sklearn.metrics import accuracy_score
Logreg.fit(X_train, y_train)
Predictions = logreg.predict(X_test)
accuracy = accuracy_score(y_test, predictions)

# scoring the model
print('Accuracy score: ')
print(accuracy)

# cross validation matrix for accuracy
confusion_matrix = confusion_matrix(y_test, predictions)
print (confusion_matrix)
fig, ax - plot_confusion_matrix(conf_mat=confusion_matrix)
plt.show()
print('#TRUES POSITIVE | FALSE POSITIVE')
print('FALSE NEGATIVE | #TRUE NEGATIVE')

Conclusion

In order for supervised and unsupervised models to accurately identify attacks, they will first need a domain expert to infer this context within the data and tune the algorithm to get better results.  Without a combination of domain expertise and proper implementation of a prediction model, algorithms can be extremely biased or even dangerous.  For example, many systems have machine accounts configured by humans that run scripts.  If that machine's account password expires and a human doesn't change it, it may fail to run a script 139 times in an hour.  This activity may appear as an outlier among other machine activity, but it's not a brute force attack.

Machine learning is not a silver bullet that will solve cyberattacks.  Let's be wary of vendors stating that their products are foolproof or "100 percent predictive and prevent[s] cyberattacks from being successful."  The human element of creativity and ingenuity is existential to both the hacker and the defender.  The human way of solving problems will always reign over the future of ML decisions which will lead into AI.

George Box, one of the greatest statisticians of our time said, "All models are wrong, but some are useful.  In fact, some machine learning prediction models can be useful to the offensive attacker and defender blue teams alike.  Because of the endless announcements of breaches and leaks of private and personal information we are now accustomed to, our blue teams need help sifting the signal from the noise.  Although the margin for error in these investigations is extremely thin, there is still a large margin of upside to derive malicious signals with the use of ML.  Along this journey for ML, we still must understand the caveats, be careful with implementation, and understand that our output may be wrong or biased.

More importantly, let's empower our security analysts to overcome the monotonous work that leads to career burnout.

Code: my-predict.py

Return to $2600 Index