The science and art of detecting zero-day phishing and malware campaigns depends on leveraging our knowledge of previous threats. Establishing digital fingerprints, called “fuzzy hashes,” is one way that security teams can identify similarities between novel files and confirmed threats.

Ssdeep is a software program that creates fuzzy hashes, which can be used to identify similar content in files by finding patterns in code. Despite changes, some code may remain consistent across content, providing clues to detect malware.

While the use of ssdeep in detecting malware is well-established, effectively utilizing it to detect novel malware threats requires the use of advanced AI analytics. This blog explores how ssdeep can be effectively used to enhance phishing detection. It will go on to detail how the technique is used in Check Point’s Zero Phishing to actively detect and block phishing and malicious web-based campaigns.

How is Fuzzy Hashing Used to Detect Phishing and Malware Campaigns?

A security team that has the capacity to maintain large databases of webpages can leverage fuzzy hashing to reveal significant correlations between seemingly unrelated domains and known malicious campaigns.

By utilizing the ssdeep fuzzy hashing program, we can create a system that effectively detects and creates clusters of phishing campaigns, caught in the wild, by grouping together web pages from various domains with similar HTML source code.
This approach has enabled Check Point to identify thousands of phishing clusters that are used to protect potential victims worldwide.

Why ssdeep Cluster Detection is Required to Detect Novel Threats

We often see large-scale phishing campaigns hosted on different domains that share the same HTML code, with only slight variants. This code may evade signature-based detection engines because some key elements were changed, but the main structure of the code is the same. Robust detection engines are required to recognize key similarities and extrapolate correlations within slightly varying pieces of code.

This simple example of a Meta phishing campaign demonstrates an example of two phishing pages that are different enough to evade a classic signature detection algorithm.

Figure 1 – Screenshots of two Facebook phishing campaigns

The pages in Figure 1 above were hosted on two unrelated domains using popular web hosting services:

  • feedbacdeveloper-case[.]d3nstmqzpmeow6[.]amplifyapp[.]com
  • personal-interests-2437e1[.]netlify[.]app

While the structures appear to be the same, we can see differences in the text.

Comparing the pages HTML code, we note that there are minor differences in the <title>, the <link> tags href, and other minor elements throughout the code.

Figure 2 – A diff checker tool showing the difference between the two webpages source code.

As expected, when we calculate the SHA256 of these two files, they result in completely different hashes:

However, when comparing their ssdeep hashes, we find there is a high level of similarity between the files:

The ssdeep similarity score for the 2 given source files is calculated by the ssdeep program:

and results in a 97% similarity, concluding the files contain very similar data.

We can see that using cluster methodology, we can identify this new threat because it is so similar to known threats, even if it is not exactly the same. Traditional solutions would miss these similarities, since the code isn’t an exact match.

Figure 3 – Graph Database visualization to see the campaign correlations. Produced by using Gephi – https://gephi.github.io/

This technique developed clusters, simply by connecting highly correlated nodes.

While on the peripheral area of this visualization we can see scattered isolated nodes, in the center we can see strongly correlated clusters of nodes, each cluster representing a different phishing campaign, where each node is derived by the unique source code hosted on a unique domain.

Now let’s look at the results not just in data and graphs, but the actual malicious webpages. These pages may look different, but they are in fact correlated.

Opening our sandbox and diving into different domains related to the same cluster unveils the following results:

Figure 4 – Screenshots of completely unrelated domains, found in the same cluster.

Although the logos and colors vary, when presented side-by-side like this, it’s obvious that all the webpages in this cluster were created by the same entity.

A quick look into the source code shows us that most of the code is similar. However, there are key elements that are different:

  • Brand logo
  • Page title
  • Contact information (email and phone number)
  • CSS Classes

Figure 5 – Inspecting the different title in source code, hints for different brands being spoofed using same base code.

Another popular cluster we detected is a crypto-related webpage. The pages (see Figure 7 below) might seem to be from different companies, but are actually from the same family:

Figure 6 – Different brands, on different domains and meta data, with high code structure similarity.

Here the differences stand out a bit more.  Every page in this cluster represents a different brand, with different contact details, images and brand colors. The correlation in this cluster was a bit weaker than the previous example but was still high enough to determine those webpages are related.

Summary

Comparing previously unseen URLs to our ssdeep-based clusters gives us the ability to block a threat, solely on its high similarity and correlation to a known malicious cluster in our database. This method not only enhances our phishing detection capabilities but also helps to preemptively block potential threats. ThreatCloud AI currently protects tens of thousands of organizations from phishing attacks by using pinpointed, accurate methodologies.

Investigating malicious campaigns that are part of the same cluster, the same family, significantly improves our understanding of rising trends, evasion techniques, and popular targeted spoofed brands. It helps us to continuously improve our detection capabilities. This holistic approach ensures we stay ahead of evolving phishing tactics, providing robust protection to our clients.

Check Point’s Zero-Phishing engine, part of ThreatCloud AI, revolutionizes Threat Prevention, providing industry leading security as part of Check Point’s QuantumHarmony and CloudGuard product lines.

You may also like