The Rise of the Machines - Learning to Detect DGAs

by blodgic

The cat-and-mouse game in cybersecurity of blocking attackers based on a domain or IP address is archaic.  In the ever-growing realm of bots and command and control architecture, domain assignments have shifted from fast flux, a node shuffling method of DNS and compromised hosts, to a method called Domain Generation Algorithms (DGA).  Block and tackle defense on DGAs is costly, but with a proper implementation, machine learning detection can be fruitful.

The DGA technique produces a list of constantly changing domains from a randomized seed of code.  This code and seed create a rendezvous point between a comm and and control server and the infected client.  If a domain is blocked, resiliency is built in and a new domain can be used to bring back the bi-directional traffic between the host and the C2 server.

The DGA domain is made of random numbers, letters, and characters (example: qx44utti3xbiz74hw2v50www.net).  In addition to a resilient communication channel, it is also a masquerade and evade tactic.  Because the domain does not look like a word or group of words, and it's usually newly registered, it can evade proxy categorization.  Your typical email clicker may also be curious to the site behind this strange gathering of characters.  With this lure and evasion, the DGA trap is set and the attack technique has spun another web of deception.

To defend against this attack, blue teams thoroughly investigate and diagnosis nefarious network activity by blocking connections to the known IP address or domain of the C2 server.  Even though the blue team blocked the traffic, if the malware was not properly removed, the same nefarious network connection originating from malware installation continues because it has spun up a new domain and the connection link continues.

This DGA scheme has been around since 2008 with Kraken.  DGAs found huge success later in 2009 with the Conficker worm.  Ransomware has since picked up the DGA technique and, well, ransomware is continuing to do its thing.

So why not fight fire with fire?  Or, in the case of defenders, use an algorithm to defend against another algorithm.  With the availability of labeled data, open-source intel, and feature engineering, the DGA detect ion use case has a lower barrier to entry for a Machine Learning (ML) application.  Rather than put together blacklists of known DGAs, a ML algorithm can predict and classify a DGA domain to aid in thwarting the attack.

The development lure of using DGAs originates from a seed function that creates randomness in domains.  The seeds for DGAs vary and have been found by malware researchers to range from today's date, the trending hashtag on Twitter, the temperature in a city, and the exchange rate between two countries.  This randomness has caused havoc for defenders and security researchers to develop defense techniques to counteract it.

To complicate defense further, domain registers are stimulating this random domain creation by allowing automated and anonymous registration.  In retrospect, security researchers have done a tremendous job reverse engineering DGAs and labeling it to a malware family.  This labeling is extremely helpful to data scientists as it provides classification labels to be used in prediction.

Building a ML DGA Detection Algorithm

Curating a list of domains is the first step in obtaining training data for a supervised machine learning model to combat DGAs.  A binary label for each domain is assigned.  We will give a "0" label for benign domains and a "1" for DGA domains.

Where to Get Training Data for DGAs:

DGAs:

Benign Domains:

Pulling down the DGA intel sources above should give you a list of more than a million unique DGA domains.  A proportion size of 3:1 (three benign domains to one DGA) is an appropriate assignment to get a realistic prediction accuracy.  The next step is to breakout each domain and try to build features that attribute it to being more closely aligned to a DGA.

At this point, we have to get creative and itemize how we as humans infer how a domain looks more like a random set of letters and numbers.  Feature engineering will give this inference or code to a machine to interpret our human resemblance and provide back a probability of a domain's likelihood of being a DGA.

There are Pythonic utilities to help us build features on domains like the domain parser, TLDextract, and a word parser, like the WordSegment library.  WordSegment has fascinating capabilities with its trillion-word corpus that can dissect a full string and break out words into a dictionary.  These packages ease feature engineering which would otherwise incur writing a ton of code.

Now that we have a domain broken out in words, segments, and its TLD, we will still need to conduct further feature engineering to get us closer to a machine learning model that predicts DGAs.

One step to determine randomness of a character in a domain is to measure its entropy.  Entropy is a mathematical measurement of uncertainty in a random variable.  The more random a string is, or the more uniqueness of letters, numbers, and characters in the string, the higher the value of entropy.

For example, a DGA of OLKQXMAEUIWYX[.]XXX has an entropy score of 3.40 and google[.]com gets an entropy score of 2.64.  If we were to boil this down in cybersecurity economics, a higher score in randomness or entropy results in a higher likelihood of identifying a DGA.

Our feature engineering journey continues in the Python code below.  This code implies that a Python Pandas DataFrame or df was created for the benign and DGA domains.  This data frame of domains will also include a target variable column or, in a statistical formula, the y and the binary label (DGA = 1 versus benign = 0).  The data frame will get wider and include more columns once you start adding more features.

These tactics on their own can be effective in identifying DGAs.  However, it is the combination of multiple features bound together with a prediction function that will warrant more success in the prediction of DGAs.

Other assumptions for this example code below include importing the Python package's math, sklearn, and XGBoost.

# python version 3.6
# DGA feature building

# entropy
def entropy(string):
    # get probability of chars in string
    prob = [float(string.count(c)) / len(string) for c in dict.fromkeys(list(string))]

    # calculate the entropy
    entropy = -sum([p * math.log(p) / math.log(2.0) for p in prob])
    return entropy

# apply entropy to the domain
df["entropy"] = df["domain"].apply(entropy)

# Additional features

# hyphen count
df["hyphen_count"] = df.domain.str.count("-")

# dot count
df["dot_count"] = df.domain.str.count(r"\.")

# string length of the full domain
df["string_len_domain"] = df.domain.str.len()

# tld length
df["tld_len"] = df.tld.str.len()

# count of vowels and consonents
vowels = set("aeiou")
cons = set("bcdfghjklmnpqrstvwxyz")
df["Vowels"] = [sum(1 for c in x if c in vowels) for x in df["domain"]]
df["Consonents"] = [sum(1 for c in x if c in cons) for x in df["domain"]]

# consonents to vowels ratio
df["consec_vowel_ratio"] = (df["Vowels"] / df["Consonents"]).round(5)

# count the number of syllables in a word
import re
def syllables(word):
    word = word.lower()
    if word.endswith("e"):
        word = word[:-1]
    count = len(re.findall("[aeiou]+", word))
    return count

df["syllables"] = df["domain"].apply(syllables)

# prediction code
from xgboost import XGBClassifier

pred = pd.DataFrame(df.data, columns=columns)  # load the dataset as a pandas data frame
y = (df.benign_dga)  # the binary target variable 1 for DGA, 0 for benign. This was assigned in the data collection

# create training and testing sets
X_train, X_test, y_train, y_test = train_test_split(dy, y, test_size=0.3)

# fit model
model = XGBClassifier(objective="binary:logistic")
model.fit(X_train, y_train)

# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]

# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

A finely tuned model which predicts the likelihood of a domain being a DGA will serve your cybersecurity defenders well.  It will save you time, energy and costs on detections.  Example ML implementations of DGA detections include:

The ML DGA prediction accuracy mileage with the code above will vary.  As is the case with every cyber defense detection, tuning or creating more features to be fed to the model can help increase accuracy.  However, more features will add to the processing resources needed to generate a model.  The process of building a machine learning model is a juggling act oftrial-and-error.  But so is cybersecurity defense.

Happy DGA hunting with machine learning.

Code: DGA-algorithim.py  (GitHub)

Return to $2600 Index