Keywords

1 Introduction

Pheromones in nature are chemical messages that act within a species. They are used widely by insects for communication within a community and within the body by means of hormones. This usage led to these substances also being referred to as “social hormones” [16]. Pheromones are external memories that are physically projected onto the ground or into the air, and are shared within a group. The dynamics of depositing and vaporizing pheromones are very sophisticated processes. These chemical messages have diverse biological effects and differ widely in their modes of action. In practice, the term “pheromone” proves useful in describing behaviors such as trail formation, defensive secretions, and social coherence. For the past several decades, several computational pheromone models have been proposed such as the classic ant colony optimization model [18] and a contemporary online shopping cart product recommendation model patented by Amazon [20]. Pheromone models have been used in visualization of human activities in CCTV footages, social media, and digital forensics [17, 19]. Digital pheromone models not only represent simultaneous localization, navigation, and optimization, but also provide perceptual motion intelligence simulations such as low-pass filters and visual episodic memory. Consequently, digital pheromone models bridge collective intelligence and primitive intuition, enabling dynamic data modeling and motion pattern recognition for humans and machines.

In this study, we explore a novel pheromone model for visualization of topological changes of a large dynamic network. Similar to virus distribution networks in nature, a cyber malware distribution network (MDN) is a connected set of maliciously compromised top-level domains (TLDs) used to facilitate the dissemination of malicious software attempting to victimize computers. It acts like a platform for spreading malwares to other nodes in the network. The challenge here is that MDNs are normally hidden and constantly change over time. The topological structures, such as domains hosting and malware and acting as intermediaries assisting in distribution are not revealed until the network traffic data is collected, attributed from multiple data sources, and systematically plotted.

MDNs have been used in botnets [1, 2], spam campaigns [3] and distributed denial of service attacks (DDoS) [4]. In general, MDNs are the essential back end distribution highway fueling underground economies in monetized schemes generating large revenues for malicious actors [3]. In this study, we provide a biologically-inspired approach for the construction and visualization of an MDN’s topological structure. We collect and visualize an MDN data over a period of 9 month in order to gain insight on its structure, persistence, and evolution over time. Our MDN graphs were based on the Google Safe Browsing (GSB) transparency report [5] and malware attribution from the VirusTotal website. Our research shows the novel approach of leveraging GSB and VirusTotal reports to graph MDNs reveals deep insight into structural changes over time. The main contributions of this paper include the novel pheromone-based visualization model that reveals the existence of persistent sub-networks and individual TLDs critical to the successful operation of MDNs and use of crowdsourcing data from Google Safe Browsing and VirusTotal for constructing MDN networks.

2 Related Work

A corpus of research [6, 7] has proposed various approaches to identify malicious URLs. Research such as [8, 9] describe and analyze the use of MDNs and their various components in monetized schemes such as botnets, pay-per-install affiliate programs and traffic direction systems. In the study [10], Behfarshad describes MDNs as a set of landing pages, redirectors, and malware repositories. The authors suggest detecting the presence of an MDN by identifying drive by download attempts via two methods: top-down and bottom-up. The research of Provos, et al. [11, 12] is part of the early work describing web based malware and the existence and identification of MDNs. Their research provided insight on identifying malicious URLs and explains the critical role of iframes, which are a simple and widely accepted mechanism for redirection in http traffic, as the fundamental link binding multiple URLs together in an MDN. The culmination of their work is the Google Safe Browsing service that is the key data source for our research. We enhance the current literature by defining an MDN and visualizing a graphical topological structure including vertex role based node types indicative of a TLD’s role in malware distribution. The authors of this paper developed an early version of a 3D interactive visualization system that revealed basic distribution patterns from the GSB data collected in 2014 [21]. In this study, we want to advance visual analytics with a pheromone simulation model to explore the attributed malware distribution network data collected in 2017, in order to gain insight on network structure, evolution, and persistence over time.

3 Malware Distribution Network as a Graph

An MDN is a dynamic structure topologically consisting of interconnected TLDs. An MDN changes over time that is captured with topological structure visualization at different points in time. We formally define a graph capturing the temporal topological structure of an MDN. A representation of an MDN at a given point in time is defined as a directed graph M(t) = {V(t), E(t)} such that V(t) is a set of vertices at time t, E(t) is a set of directed edges: \( \{ v_{i} \left( t \right), v_{j} \left( t \right)\} \in V\left( t \right) \) at time t. In a given graph M(t), a vertex \( v \in V\left( t \right) \) represents an MDN node in one or more modes. The full range of a node’s modality in the MDN is unknown. We suggest, based on GSB reports, three possible roles: in an MDN, a node may act as an intermediary (MI) by facilitating malicious traffic, a malicious host (MH) or root malicious host (RMH) if malware files were hosted from that domain. In our visualizations, MI and MH nodes have incoming edges, but RMH nodes do not. This implies malware in RMH nodes came from some source other than those represented by a node in the MDN. Figure 1 illustrates an MDN structure.

Fig. 1.
figure 1

Definition of a malware distribution network

4 Pheromone Model

We first build a dynamic graph of the malware distribution network. Graphs are represented by an augmented adjacency list data structure that is designed to capture both the dependencies of graph links and the mode of linkage type – MI or MH. We describe this data structure as a list of key – value pairs, whose keys are the top level domain of a website, denoted as a source and key values are a pair <mode, destination> where by destination is top-level domain which is reported as being affected by the source. To place all the top-level domains on the visualization, we used a Dynamic Behavioral Graph [22] to incorporate event frequencies, protocol types, packet contents and data flow information into one graph. In contrast to a typical Force-Directed Graph such as D3 [16], our model goes beyond the aesthetic layout of a graph to reveal the dynamic sequential patterns in a three-dimensional virtual space. In the model, the attraction force between a pair of nodes is calculated using formula:

$$ f_{a} = \frac{{\left\| {x_{j} - x_{i} } \right\|^{2}}}{\alpha \cdot T} $$
(1)
$$ f_{r} = \frac{\beta }{{\left\| {x_{j} - x_{i} } \right\|^{2} }} $$
(2)

where: i and j are distinct nodes, α is the value of elasticity where a greater value increases the length of the edge. \( \beta \) is the coefficient for repulsion force. T is equal to the average time between each nodes’ timestamps and \( \left| {\left| {x_{i} - x_{j} } \right|} \right| \) is the distance between two nodes.

We use a gradient arc for displaying the direction of edges. The decrease of alpha value indicates the direction, with 1 at the source and 0 at the end. This novel visual representation also enables us to add the attributes to the edges. See Fig. 2.

Fig. 2.
figure 2

Examples of the dynamic behavior graph of an MDN

Simple pheromone-based movement can produce sophisticated dynamic patterns. Conventional Ant Colony Optimization (ACO) models assume that insects walk along paths that connect various nodes. Pheromones can be overlaid in multiple layers. Furthermore, pheromones decay at a certain rate. If they did not decay, ants would risk repeating the same route and not respond to a rapidly changing environment. Here, we generalize pheromone deposit and decay on paths of a network. The amount of pheromones at a pixel position at time t is:

$${\text{Deposit:}} \quad D\left( t \right) = min\;(\sum\nolimits_{i = 0}^{N} {u_{i} \left( t \right)} ,M) $$
(3)
$$ {\text{Decay:}} \quad D\left( t \right) = max\;\left( { u_{i} \left( t \right) - r \cdot t, L } \right) $$
(4)

where, \( D\left( t \right) \) is the current pheromone level at a particular path i between two nodes. M and L are the upper and lower bound limits to it. \( u_{i} \left( t \right) \) is an individual pheromone deposit at time t and N is the total number of deposits on that particular edge. ‘r’ is the linear decay rate. See Fig. 3.

Fig. 3.
figure 3

Pheromone deposit and decay model for representing the persistency of the malware distribution channels (connected edges in the graph).

5 Graph Display Optimization

Whenever there is a new node entering the visualization space, the algorithm has to measure the distance between the new node and all the rest of the nodes nearby to ensure that they won’t collide to each other, which is computationally expensive. In order to speed up the process, we optimize the repulsion calculation along the Z-axis. Here is the pseudo code of the method:

figure a

This enables to speed-up in orders of magnitude and to handle more nodes in the model. For example, before the optimization, it took 30 min to process 2,000 nodes. After the optimization, it takes less than 5 min to process the same amount of the nodes. See Fig. 4.

Fig. 4.
figure 4

A compression mapping along Z-axis in order to filter out the nodes that are further than a threshold.

6 Data Collection

We have collected the data for creating MDN graphs for a 9-month period from 19 January 2017 to 25 September 2017. We used the Google Safe Browsing (GSB) Transparency Report as our main source of data. We seeded the process by requesting a GSB report for the known websites such as notorious vk.net once every 24 h. The site vk.net was selected as the seed website based on a four month observation of the site reliably appearing on GSB. The report, in JSON format, consisted of various statistics as shown in Appendix A. The statistics of interest to us were labeled: name, sendsToAttackSites, receivesTrafficFrom, sendsToIntermediary-Sites, lastVisitDate, and lastMaliciousDate. If the returned report listed domains for labels 2, 3, or 4, those domains were added to the queue to be requested to GSB. This was an exhaustive recursive process that continued until a report with no more domains listed for these labels was received from GSB. We labeled nodes for domains listed with nonempty statistic 2 and 4 as an Intermediate Site (MI), and domains with empty statistic 3 as a Malicious Host (MH) or Root Malicious Host (RMH). The data provided in the GSB report only allowed us to create outgoing edges form the node for the current domain under analysis. All nodes are initially labeled as a Malicious Node (MN) and could be relabeled to MH or MI if the GSB report for a domain listed in 3 contained the current domain name in 2 or 4. If this was the case, an incoming edge can be created from the domain listed in 3 to the current domain under analysis. An MN with no incoming edges for the current collection was relabeled to a Root Malicious Node (RMN). This node is unique to our MDN graphs as it cannot be determined from the GSB reports alone. Our collection process occurred daily starting at 9:00 am EST and required from 4 to 11 h to complete, thus starting and finishing within the same calendar date. We decided on collecting just once a day after extensive manual analysis of the GSB report’s value for label 6 over several weeks revealed GSB tended to perform diurnal updates of their report details.

7 Malware Attribution

We constructed a filtered data set because we want to retain significant patterns for effective visualizing and the design of such a filter must prevent over-trivializing the data. We therefore used VirusTotal scans sites that perform malware distribution on submitted URLs and files logs a report for each query. When given the list of malicious domains, VirusTotal gives the set of reports which list malware linked to the sites. Our malware-attributed data set is the set of TLDs that appear in at least one report of a known malware from VirusTotal. However, as reports are generated only when a malware scan query is sent to VirusTotal, there may not be a report corresponding precisely to the last malicious time reported by Google Safe Browsing. Therefore, the filtered dataset is the set of TLDs that have a report that marks the site as containing malware within 10 days of when Google Safe Browsing determined the site to be malicious. The visualization images provided are based upon this malware-attributed set. More specifically, the visualized dataset is the set of edges that either receive or send traffic from a site that has a properly timed report denoting it as malicious. See Appendix B and C for a sample of the VirusTotal report and the statistical summary of the collected data from GSB and VirusTotal. See Fig. 5.

Fig. 5.
figure 5

The malware distribution data collection and attribution process

8 Visualization Results

The data we visualized are the malware-attributed data as described in the prior section, containing all edges that have at least one endpoint being a malware attributed site. Each edge is label with corresponding capture date time stamps, and rendered in chronological order, spanning across the entire 9 months of data. The rendering process iteratively adds each edge supplied to the graph, depending on whether both endpoints of the edge already exists in the graph, the visualization process will add nodes to the graph and then render a cold colored edge. Each day, the color of the existing edge becomes warmer until it reaches red, reflecting the age of each malicious link. If an existing edge is present in a later collection, the edge will be reassigned a cold color. When an edge is red, it will disappear when the visualization reaches the next day if the edge is not present in the data set for the next day. Given that each edge is directed, opaqueness of the edge demonstrates which node is sending traffic and which node is receiving traffic in the link. Figure 6 through Fig. 8 show the clusters of MDN evolved on January 19th, 20th, and 21st of 2017. Figure 9 shows a close-up of the MDN on September 25th of 2017 (Fig. 7).

Fig. 6.
figure 6

A cluster of malware distribution network on January 19, 2017. (Color figure online)

Fig. 7.
figure 7

A cluster of malware distribution network on January 20, 2017. (Color figure online)

Fig. 8.
figure 8

A cluster of malware distribution network on January 21, 2017. (Color figure online)

Fig. 9.
figure 9

A close-up of the cluster of malware distribution on September 25, 2017. (Color figure online)

From our visualization of the malware-attributed dataset, we can observe five key dynamic phenomena: First, having a clear distinction between days, we can observe the clustering effect of the malware nodes. Large clusters do not form rapidly. Instead, a cluster of significant size in general requires one to two weeks to form. It is rare for edges to be continuously added to a singular node in succession to form a super node, instead, smaller chains will first form over time before multiple chains gather together to form a large cluster. In particular, for the months of January, March, and June, we are able to observe a large centralized cluster of similar websites. This highlights potential large-scale malware attacks that may be occurring during those particular periods and the similarities in the distribution network used for those attacks.

Second, the visualization process highlights the quantitative differences in number of malicious nodes that exist for any particular month visually. By examining the time stamps presented and the quantity of edges present on screen we can visually determine the change in quantity of data over time. For example, we can observe that the quantity of malicious nodes peaks in our dataset during February. We can also visually compare the outbound and inbound edge frequencies across different time periods as well as examine whether high-edge-count sites are scattered or closely interconnected.

Third, the visualization is the ease of determining and examining the site with traits of particular interest. For example, root malicious nodes (RMN) are easily identified visually as each edge has a gradient effect where the opaque end marks the source and the transparent end marks the destination, so a node with only outgoing edges can be identified as a root malicious host (RMH) while examining the visualization. Furthermore, because the visualization allows us to traverse the graph in 3D as well as zoom in and out, we can examine the changes in the nodes, such as a malicious root node (RMN), is connected to, as well as observe the chains that the root malicious nodes (RMN) are part of as they dissipated and form. For example, in Fig. 9, we can zoom in to a specific node to see the node’s address as well as the address of the adjacent nodes in the visualization.

Fourth, given that the data visualized is labeled with time stamps, we can observe changes per specific quantity of time. As mentioned the prior section, the data is filtered to those with a malware-attributed date within 10 days of the capture date. With this time frame in mind, when combined with the timestamps displayed in the visualization, we can view specific changes in whether a link is malicious or not and the time frame which it is malicious. Using this we can view whether past links are deprecated as new links are established and have approximations for overlapping active times in a malware distribution chain.

Finally, the timestamps are further supplemented by the pheromone effect. As describe at the beginning of this section, edges decay over time, with colors representing the age of the edge. The key benefit of this method is the ability to observe the frequency a link is used in malware distribution, at each time in a new collection, when an existing edge is mentioned again, it will be restored to the color that represents very recent. Therefore, in the duration which an edge remains in existence, the frequency of being reset to the coldest color will approximate the frequency the link was malicious. This pheromone method also highlights the order of deprecation in malware distribution chains by showing which links first becomes unused. By decaying edges gradually, we can more easily see changes in chains as malware distribution links shift over time. Overall, the pheromone effect in the visualization give visual feedback on such pattern in a visually identifiable manner.

9 Conclusions

In this study, we provide a biologically-inspired approach for the construction and visualizing an MDN’s topological structure. We collect and visualize an MDN data over a period of 9-month in order to gain insight on its structure, persistence, and evolution over time. Our MDN graphs were based on Google Safe Browsing (GSB) reports and malware attribution from VirusTotal. Our research shows the novel approach of leveraging GSB and VirusTotal reports to graph MDNs reveals deep insight into structural changes over time. We found that the pheromone-based visualization model reveals the existence of persistent subnet works and individual TLDs’ critical to the successful operation of MDNs.

The pheromone model shows the evolving MDN dynamics on daily basis, which enables analysts to zoom in and out, pause, replay, and fast-forward the animation interactively. Furthermore, we found use of crowdsourcing data from Google Safe Browsing and VirusTotal are critical for constructing MDN networks. We need to explore the open source space for better availability of data and more accuracy of the attribution of malware types, names, and contents.

From the visualization, we are able to notice significant malware distribution clusters. Specifically we found large malware distribution clusters in the months of January, March, and June. The visualization also highlighted how large distribution clusters require approximately two weeks to form and also an approximate life span of two weeks.