Attention-Based Cross-Modal CNN Using
Non-Disassembled Files for Malware
Classification
Dr Prabhakar Marry K. Sreelekha K.
Assistant Professor UG Scholar Shivani
Dept of IT Dept of IT UG Scholar
Dept of IT
Vignan Institute of Technology& Science(A) Vignan Institute of Technology & Science(A) Vignan Institute of Technology &Science(A)
Hyderabad Hyderabad Hyderabad
marryprabhakar@gmail.com kattekolasreelekha@gmail.com kondashivani1105@gmail.com
V. Gopal
K. Satya Hemanth Kumar
UG Scholar
Dept of IT UG Scholar
Vignan Institute of Technology and Dept of IT
Science(A) Vignan Institute of Technology and Science(A)
Hyderabad Hyderabad
varthyagopal14@gmail.com hemanthkumarkoppisetty@gmail.com
Abstract: The rapid growth of new malware variants crucial. Malware family classifiers group malware samples
makes it critical to classify malware correctly. By grouping into specific families by using shared code, behaviors, or
malware into families, security experts can apply the right attack strategies common to each family. This simplifies the
strategies and tools to deal with different types. analysis because analysts can apply known methods to study
Traditionally, this classification uses high-level malware within the same family. Thus, accurate malware
information like disassembled code, which can give good family classification is critical to quickly respond to the
results. However, this method assumes that every malware growing number of malware variants.
sample can be perfectly disassembled — an assumption
that doesn't hold true in practice. Advanced malware often Deep learning-based malware classification has drawn a lot
includes techniques to confuse disassemblers, leading to of attention because of deep learning's success in fields like
wrong or incomplete code extraction. To address this, our computer vision and natural language processing. Deep
study focuses on malware family classification without neural network models now outperform traditional machine
relying on disassembly. We introduce a new CNN-based learning classifiers in this area. Deep learning models can
model that works directly with binary files (non- automatically extract multiple levels of features without
disassembled malware). Our model uses two different expert input, whereas conventional models rely on hand-
types of data extracted from these binaries: malware picked features selected based on expert knowledge.
images and structural entropy values. These two views Because of this ability to learn features automatically, deep
capture information at different levels (bytes and chunks), learning models can utilize a wide variety of inputs, mainly
complementing each other’s strengths and weaknesses. categorized into dynamic and static features. Dynamic
We further use a cross-modal attention mechanism that features refer to malware’s behavior during execution.
intelligently blends features from both types of data, These include API call sequences, network activities,
overcoming their individual limitations. We tested our memory usage, registry changes, and execution traces,
model on three well-known datasets — Microsoft Malware extracted by running malware in controlled environments
Classification (Kaggle), Malimg, and BODMAS. The like virtual machines (VMs). Dynamic analysis reveals the
results show that our method classifies malware families true malicious nature of a file. However, executing malware
more accurately than earlier techniques, and importantly, is slow and requires maintaining a special environment.
it does so without the need for disassembly. Moreover, sophisticated malware often uses anti-VM, anti-
Index terms: Malware classification, structural entropy, debugging, and anti-emulation techniques, meaning it can
malware image, deep learning, convolutional neural detect these environments and behave like benign software
network, attention mechanism. to evade detection.
1. INTRODUCTION Static features, on the other hand, are extracted directly
The shift toward remote education and working from home from binary or disassembled files without execution. Static
during the COVID-19 pandemic has heightened features from binary files include raw byte sequences, byte
cybersecurity risks. Social engineering attacks exploiting entropy, histograms, structural entropy, and printable
topics like new vaccines, government announcements, and strings. These are based on the raw bytes and need no
online meeting links have easily tricked employees and execution or disassembly. Meanwhile, disassembled files
students. Collaboration platforms have also been misused reveal semantic information like opcodes, control flow
for spreading malware. Additionally, malware itself has graphs, and function call graphs. These can provide clues to
evolved quickly; the malicious behaviors shown by malware malicious behavior, but depend heavily on the accuracy of
samples rose from an average of 9 to 12 during the disassemblers like IDA Pro. Disassembly can often be
pandemic. New business models like malware-as-a-service flawed, and static features extracted from incorrectly
now offer illegal services where malware can be rented disassembled code may mislead the classifier, reducing its
easily. With this rapid advancement in malware, the accuracy. Furthermore, malware often uses anti-disassembly
importance of malware family classification has become techniques like packing and obfuscation, which complicate
static analysis.
Multi-modal learning combines information from different BODMAS, including comparisons with baseline models.
types of data (modalities) such as images and text, Finally, Section IX concludes the study.
producing richer feature representations. It is promising for
malware family classification because static and dynamic 2. LITERATURE SURVEY
features can be expressed in multiple modalities. For WITH BACKGROUND
example, HYDRA used static features like APIs, opcodes, This section provides previous related works on deep
and raw bytes from disassembled and binary files, processed learning-based malware analysis, including malware
by different components. Orthros used two CNNs to handle classification, and the background of our work. The first
two different types of inputs: byte sequences and opcode two subsections present previous studies on the use of static
sequences. In, Efficient-Net processed images visualized features with and without disassembly. Then, we discuss
from disassembled files, while 1D-CNN processed raw byte malware classification models with dynamic features in
sequences. Different CNNs were used depending on the Section II-C. Section II-D. describes how multi-modal
nature of each input. These multi-modal classification learning for multiple modalities has been used for malware
models have been effective, as multi-modal learning allows classification.
better exploration of the malware feature space. However,
when models rely on features from disassembled files, they A. MALWARE CLASSIFICATION BY STATIC
inherit the problems caused by incorrect disassembly. Thus, FEATURES WITH THE USE OF DISASSEMBLED
even multi-modal models using disassembled code features FILES
can struggle to accurately detect sophisticated malware. In Many previous malware classification models depend on
this work, we present a new attention-based cross-modal features extracted from disassembled binary files. One of
model for malware family classification. Our aim is to the main features observed in the disassembled files is the
accurately classify malware families using static features opcode. By describing the operational meanings of a
extracted from non-disassembled binary files. By focusing program, an opcode enables us to predict the maliciousness
on non- disassembled files, we avoid the issues tied to of disassembled files in many cases. Previous work on
disassembly errors. Our approach uses malware images and word2vec-based long short-term memory (LTSM) used
structural entropy values extracted from binaries. Malware opcode sequences and adopted an LSTM to consider the
images are visualizations of byte-level data, while structural sequential order of the opcodes appearing in the file. They
entropy is computed based on chunks of bytes (512 bytes, encoded opcodes using the word2vec embedding method
1024 bytes, etc.). These two modalities have both been and then fed the result representation to the LSTM.
shown to be effective individually in malware analysis.
Qiao et al. proposed converting grayscale images using
Moreover, these two features complement each other: word vector similarity calculated from opcodes and
malware images are sensitive to byte-level obfuscation, conveyed to a CNN for malware classification.
while structural entropy is chunk-based and thus more Zhang et al. used five machine-learning methods to classify
resistant to small changes. However, entropy loses fine- ransomware families and applied statistical measures to
grained byte- level information, which images retain. To opcode sequences to obtain latent information. They
fully utilize this complementarity, we designed a new extracted n-gram sequences from an opcode sequence and
attention-based cross- modal CNN. We use two separate selected meaningful n-gram sequences. Then they
CNNs to independently extract intermediate representations calculated the TF-IDF for the selected sequence and used n-
from images and structural entropy. These representations gram sequences with high TF-IDF values as features for
are then fused through a cross-modal attention mechanism, classification. Rather than the entire opcode sequence, some
allowing the model to learn how the two modalities previous studies were based only on the API call sequences
correspond and interact. We compared our model to from the disassembled binary code. However, as mentioned
previous classification methods to evaluate its performance earlier, malware developers adopt anti-disassembly
and found that it improved results over existing approaches. techniques (e.g., obfuscation and packing techniques) to
prevent malware from being analyzed. Therefore, always
Our main contributions are summarized as follows: We extracting high-level static features such as opcodes and
observed that malware images and structural entropy have API call sequences from malware samples is
different granularities (bytes vs. chunks) that complement implausible.
each other. Both are derived directly from binaries without
disassembly, making the model robust against obfuscation B. MALWARE CLASSIFICATION BY USING STATIC
and packing techniques. We introduced a cross-modal FEATURES WITHOUT MAKING USE OF
attention fusion mechanism that effectively aligns byte-level DISASSEMBLED FILES
and chunk-level information, resulting in rich and Several studies have explored malware classification using
comprehensive representations. We validated the features that do not require disassembly. Among these,
effectiveness of our model across three malware datasets, visualizing malware as images has become a common
comparing it against both unimodal and multi-modal approach. Such images can be directly generated from
approaches. Our model outperformed previous classification binary files and are frequently input into CNN models as
models that used static features from binaries. The rest of CNNs have shown excellent results in image classification
the paper is structured as follows: Section II discusses prior tasks. CNN- based malware classifiers were developed
work and background on deep-learning-based malware using grayscale images, where byte values ranging from 0–
classification. Sections III to VII explain the design of our 255 were mapped to pixel intensities.
cross-modal attention-based model. Section VIII presents Xiao et al. introduced a different visualization technique by
the experimental results on datasets like the Microsoft embedding the malware's structural information from the PE
Malware Classification Challenge (Kaggle), Malimg, and format into the images
Pektaş et al. classified malware by mining meaningful
subsequences of API calls at runtime rather than using
complete sequences for each malware instance.
Hansen et al. created a random forest classifier that uses
statistical data about API calls as its input.
Pascanu et al. proposed a recurrent neural network-based
approach that employs API call sequences for classifying
malware families. Unlike static features, collecting dynamic
features necessitates running malware samples in a
virtualized environment such as a sandbox. Nonetheless,
advanced malware can detect such environments and either
halt execution or alter its behavior, making it challenging to
reliably gather dynamic features.
D. MALWARE CLASSIFYING BASED UPON MULTI-
MODAL LEARNING
Multi-modal learning is a deep learning approach that
simultaneously processes multiple types of data. Its goal is
to maximize the information available by merging different
modalities. When a single modality cannot capture all
relevant information, other modalities can compensate.
There are various methods for combining modalities in
multi-modal learning. Early fusion merges data from
This method aimed to minimize the misclassification different modalities at a lower level, as illustrated and the
between malware families by distinguishing pixel patterns fused data is passed into a classifier. In late fusion, the
based on different structural layouts. Additionally, they individual classifiers’ outputs for different modalities are
proposed a two-phase process involving CNN-based feature combined. Models that simultaneously handle images and
extraction followed by SVM-based classification. text often employ multi- modal learning. In the context of
Nevertheless, malware images face a fundamental issue of malware classification, several models have adopted multi-
information loss due to resizing operations required by modal strategies.
classification models. Moreover, obfuscated or encrypted Gibert et al. used the late fusion technique to combine API
malware sections can distort critical patterns, affecting calls, raw bytes, and opcodes, leveraging both low-level and
recognition accuracy. Structural entropy emerges as another high-level features to boost classification accuracy. Han et
effective feature that can be extracted without disassembly. al. applied early fusion by merging static API sequences,
It consists of an entropy sequence obtained by calculating obtained from disassembled files, and dynamic API
the entropy values of chunks across a binary file. sequences, recorded during execution. In, malware
Gibert et al. leveraged the entropy sequence, transformed identification was achieved through a multi-modal
via wavelet analysis, and applied it to a CNN-based clustering method using features like PE headers, string
classifier. Their results showed that entropy-based features hashes, IAT, and byte entropy. Recently, cross-modal
are robust against code obfuscation, although they may lack learning has gained attention as a novel form of multi-modal
detailed malware patterns. For instance, encrypted sections learning, emphasizing the adaptive integration of multiple
exhibit high entropy, while dead code inserted for modalities. Cross-modal learning has been actively explored
obfuscation tends to have very low entropy. To build a more in computer vision and natural language processing fields to
reliable malware classification method without the need for efficiently combine diverse data sources. In our
disassembly, this study proposes a complementary use of classification approach, we incorporated a cross-modal
malware images and structural entropy, which enhances the attention mechanism to fuse data. This enables flexible
analysis of sophisticated malware employing advanced anti- integration of malware images and structural entropy,
disassembly techniques. allowing each modality to supplement the other effectively.
C. MALWARE CLASSIFICATION BY MAKING USE III. METHODOLOGY OF ATTENTION-BASED CROSS-
OF DYNAMIC FEATURES MODAL CNN USED FOR CLASSIFICATION OF
Dynamic features capture the behavior of a program during MALWARE
its execution. Examples of such features include API calls, Our malware classification system is built upon CNNs,
network activities, and registry snapshots, all collected at which learn and detect patterns from the provided input
runtime. Among these, API call sequences and opcode data. CNNs are widely applied in malware detection and
sequences are widely used because malicious actions are classification tasks due to their ability to extract higher-level
ultimately executed through APIs and opcodes. features from raw low-level data without requiring
Xue et al. developed two classifiers utilizing CNNs and additional manual feature engineering. As previously
other machine learning approaches. While the CNN-based explained, our model accepts malware images and structural
classifier processes malware images, the second classifier entropy as input modalities, both of which do not need
focuses on variable n-grams of API call sequences obtained disassembly procedures. These inputs are fused using a
during execution. Malware language models have also been cross-modal attention mechanism, which is detailed later in
introduced to predict the next API call based on previous the paper. Here, a malware image is viewed as a sequence
sequences, treating API calls similarly to how traditional of fixed-size byte chunks, similar to how structural entropy
language models handle words. is framed as a series of entropy values across chunks. This
representation enables the cross-modal attention module to dimensional CNNs (2D-CNNs) for malware image analysis
align meaningful parts independently across the two , where kernels move across two spatial dimensions, our
different modalities. By computing cross-modal attention, approach utilizes a one-dimensional CNN (1D-CNN),
the model strengthens its feature set by enriching feature where kernels slide along a single direction. This design
dimensions with information drawn from another modality. choice enables better integration with structural entropy
A one- dimensional CNN (1D-CNN) is employed to handle features, which are also aligned along a single dimension,
these sequential chunk-based representations effectively. making it easier to fuse data across different modalities. In
1D- CNNs are suitable for processing sequences because contrast, the 2D kernel movements complicate feature
their filters move along a single input dimension. Each alignment between diverse modalities.
shallow 1D-CNN processes a specific modality. Outputs Additionally, our internal experiments demonstrated that a
from these shallow CNNs are then combined through the malware family classification model built with a 1D
cross-modal attention mechanism. While both malware convolutional layer achieves accuracy comparable to that of
images and structural entropy are widely used for conventional 2D-CNN-based models.
classification, individually they suffer from accuracy To summarize, we generated malware images directly from
limitations due to inherent drawbacks. raw binary files by mapping bytes to pixel intensities
between 0 and 255. These images were then down-sampled
to a final resolution of 64C×784 pixels. Using 1D-CNNs,
the model extracts low-level features by scanning the image
in a single direction.
The proposed model boosts the strength of each modality by
jointly learning from both, enabling them to cover each
other's weaknesses. Structural entropy features are aligned
to elements within the malware image representations thus
countering the distortion issues that malware images face
due to obfuscation. Meanwhile, malware image features are
aligned with elements in the structural entropy
representations, addressing the information loss that occurs
in structural entropy. Essentially, the shortcomings of each
modality are compensated for using the strengths of the
other. Consequently, the feature representations’ dimensions
for both modalities are expanded. By integrating chunks
from one modality with enriched data from the other, the
model constructs reinforced features that capture deeper
patterns across modalities. However, fusing these modalities
is challenging because their respective feature
representations often differ in length, making direct
alignment complex. Specifically, malware images and
structural entropy streams typically vary in size, which
complicates the fusion process.
IV. INPUT DATA AS MULTI-MODAL DATA
A. MALWARE IMAGE
Our proposed model learns from two different modalities
using CNNs, and the first modality is the malware image. A
malware image is generated by transforming each byte from
a binary file into a pixel value ranging between 0 and 255,
without needing to disassemble the file. After conversion,
the image is resized to fit the CNN input dimensions. An
example of a malware image used in this work is shown in B. STRUCTURAL ENTROPY
Malware images are effective in capturing the overall The second modality in our model is structural entropy
characteristics of malware; however, they are highly combined with section information. Structural entropy
susceptible to noise caused by obfuscation and encryption represents a file as a sequence of entropy values calculated
techniques. from the frequency of bytes within fixed-size chunks,
While traditional research typically employs two- offering robustness against obfuscation. It is computed
using Shannon's entropy formula:
𝐶=𝑌𝑇×𝐴C=Y T×A Context vectors 𝐶,C are concatenated
A=Softmax(Y×W a×XT),
𝐻(𝑥)=−∑𝑗=1n𝑝(𝑥𝑗)log with image features 𝑋,X to form 𝑅𝑋=[𝑋,𝐶]R X=[X,C].
2𝑝(𝑥𝑗) H(x)=− j=1,
∑np(x j)log2p(x j)
where 𝑛,n is 255 (the range of byte values), and 𝑝(𝑥𝑗)p(x
j) is the probability of each byte within a chunk. Malware
samples from the same family often display similar
structural entropy patterns for specific malware families.
Executable files (PE format) are organized into headers and
sections like .text or .data, which serve specific execution
roles. Malware often alters or adds non-standard sections,
making section information valuable for distinguishing
malware from normal files.
However, traditional structural entropy lacks section
information. To address this, we enhanced the structural
entropy by appending section information as one-hot
vectors. Standard section names (e.g., .text, .data, .rdata) are
encoded directly, while non-standard sections are labeled as
"undefined sections" to handle irregular naming.
This enriches each window of 𝑋X with relevant entropy
information. The same process is done in the opposite
direction, allowing mutual reinforcement between the two
data types. Finally, the fused information is fed into deeper
CNNs to extract higher-level features, making the system
more robust and accurate.
VI. 1D-CNN FOR HIGH-LEVEL
FEATURE EXTRACTION AND LATE FUSION
After applying cross-modal attention, each modality
becomes a 128-dimensional feature vector.These vectors are
V. ATTENTION BASED CROSS-MODAL then passed through 1D-CNNs to extract deeper, higher-
This section explains how we combine malware images and level features.
structural entropy using a cross-modal attention mechanism. The 1D-CNN consists of four convolutional layers, each
Each modality captures different strengths: malware images with 70 filters of size 3.
reveal code patterns, while structural entropy resists
obfuscation. Using both together improves malware After each convolution, batch normalization and ReLU
detection. activation are applied, followed by max-pooling (pool size
Inspired by attention techniques in machine translation, we 2×1, stride 2) to highlight key features. Features from both
calculate relevance scores between the two modalities modalities are then concatenated.The merged vector is sent
score(𝑆,ℎ)=Softmax(ℎ×𝑊𝑎×𝑆𝑇)score(S,h)=Softmax(h×W
using: through two fully connected layers (with 1000 and 300
nodes). Finally, a softmax layer is used to classify the
where 𝑊𝑎W a is a learned attention matrix. A context
a×S T) malware into families.
vector is then created by weighting source data based on VII. TRAINING AND TESTING
these scores. The proposed cross-modal CNN is an end-to-end deep
learning model, meaning it trains feature extraction and
In our design, cross-modal attention aligns information from classification together without separate steps.
malware images with structural entropy and vice versa, even Training Phase: Cross-modal attention is used to enhance
if the sequence lengths differ.Both image and entropy feature learning across modalities. The CNN and fully
sequences are passed through 1D-CNN layers. After connected layers are trained together, updating weights
malware image sequence 𝑋,X and the entropy sequence 𝑌,
convolution, both sequences have features of size 64. The based on selected hyperparameters (see Table 1). Testing
Phase: The trained model classifies malware families. 10-
𝐴=Softmax(𝑌×𝑊𝑎×𝑋𝑇),
Y are related through attention: fold cross- validation is used to evaluate performance,
ensuring reliability even with a small dataset.
B. RESULTS FOR MICROSOFT MALWARE
CLASSIFICATION CHALLENGE DATASET
1) Performance Comparison of Baseline Models
To evaluate our approach, we compared our proposed cross-
modal CNN model against several baseline models that
either used malware images or structural entropy separately.
We also implemented a late fusion model without cross-
modal attention to highlight the effectiveness of our
proposed attention mechanism. Specifically, Baseline 1 (B1)
used structural entropy combined with section information
following the method in, and Baseline 2 (B2) used malware
images only. Baseline 3 (B3) fused both modalities but
without applying cross-modal attention.
As shown in Table 5, our model outperformed all baseline
models. Compared to single-modality models, our attention-
based cross-modal CNN improved accuracy by up to 1.93%
and the F1-score by up to 0.054. Even when compared to
VIII. EXPERIMENTATION the late fusion model (B3), which used both modalities
A. DATASET RESULTS without attention, our model demonstrated superior
We evaluated our proposed model using three datasets: the classification ability, proving that cross-modal attention
Kaggle Microsoft Malware Classification Dataset, Malimg, helps in effectively integrating information from malware
and BODMAS-14. The model was implemented in PyTorch images and structural entropy.
and trained on a GeForce RTX 3090 GPU with 24 GB
memory. The confusion matrices reveal that our model significantly
For the Microsoft Malware Classification Dataset, we used improved classification for minority classes. For instance,
10,868 malware samples from nine families. Binary files for the Simda or other malware family, which had relatively
were converted into 64×784 pixel images, replacing fewer samples, B1 and B2 achieved low accuracy (~0.5),
unidentified bytes ("??") with FF. Structural entropy vectors whereas our model achieved a much higher accuracy of
of size 14 × length were generated using available section 90.4%. Although B3 improved classification for Simda in
information from assembly files, and the input size was case compared to using only one modality, it still lagged
fixed at 3600 for the CNN. behind the proposed model. Furthermore, for the
The Malimg dataset provided 8,250 malware samples from Obfuscator.ACY class, our model recorded the highest
25 families after filtering missing binaries. Malware accuracy among all models tested. These results clearly
binaries were used to create images and structural entropy demonstrate that cross-modal attention enhances the fusion
features, with a maximum input length fixed at 873. of multi-modal features, leading to robust performance even
For BODMAS-14, we selected 14 malware families with on imbalanced datasets.
over 1000 samples each, resulting in 34,368 binaries.
Malware images and structural entropy vectors were
generated, treating files with missing headers as unknown
sections. The maximum input length was fixed at 4000 for
CNN input.
2) Performance Comparison with Other Research
We further compared our model’s performance with
previous malware classification studies that used static
features either with or without disassembly. Studies like
Gibert et al., Mays et al. , and Hydra used features such as
API calls, command frequencies, and instruction sequences
extracted after disassembly to classify malware families,
achieving strong performance. As shown in Table 6, models
based on disassembled code generally performed better
because disassembly provides high-level semantic
information.
However, our model achieved a competitive accuracy of
98.72% without requiring disassembly. Compared to non-
disassembly-based models such as those in and, our
approach performed better by effectively combining
malware images and structural entropy through attention
mechanisms. For example, CoLab and MalCVS improved
performance by appending section information to malware
images, while Gibert et al. used structural entropy with
wavelet transforms. Although they performed well, they disassembled malware samples. Experimental outcomes
relied on single-modal input. confirm that using multiple modalities improves
performance over single-modality methods, and that cross-
Most importantly, our model showed stronger resilience to modal attention for feature fusion outperforms simple fusion
data imbalance. For the Simda class, despite its small techniques. For future work, we aim to incorporate
sample size, our model correctly predicted 38 out of 42 additional modalities, such as ASCII strings, IP addresses,
instances (90.4% accuracy), outperforming other unimodal and import tables, which can be extracted from binaries
approaches. Advantage of using multi-modal attention in without disassembly. These modalities carry semantic
handling minor classes effectively, our cross-modal information not present in malware images or structural
attention-based CNN not only delivers high accuracy entropy, and we plan to integrate them into our model.
without the complex step of disassembly but also
strengthens robustness against imbalanced malware family REFERENCES
distributions, offering a powerful and practical solution for [1] Picus Security (2021). Red Report 2021: Top Ten Attack
malware classification tasks. Techniques.
[2] A. Shabtai, R. Moskovitch, Y. Elovici, and C. Glezer,
C. RESULTS FOR MALIMG DATASET "Detection of malicious code by applying machine learning
1) Baseline Model Performance Comparison classifiers on static features: A state-of-the-art survey," Inf.
We developed baseline models B4, B5, and B6 to Secur. Tech. Rep., vol. 14, no. 1, pp. 16–29, 2009.
benchmark our model’s performance, as mentioned earlier. [3] A. Abusitta, M. Q. Li, and B. C. M. Fung, "Malware
Table 7 highlights the results. B4 and B5 are CNNs trained classification and composition analysis: A survey of recent
separately with structural entropy and malware images, developments," J. Inf. Secur. Appl., vol. 59, Jun. 2021, Art.
while B6 is a multi-modal CNN using a late fusion of both. no. 102828.
Our proposed model consistently outperformed all baselines [4] W. Han, J. Xue, Y. Wang, L. Huang, Z. Kong, and L.
across every metric, achieving 99.09% accuracy and a 0.976 Mao, "MalDAE: Detecting and explaining malware based
F1-score. As shown earlier, cross-modal attention proves on correlation and fusion of static and dynamic
better at combining features compared to simple characteristics," Comput. Secur., vol. 83, pp. 208–233, Jun.
concatenation. 2019.
Due to the large number of malware families in Malimg, we [5] M. Sikorski and A. Honig, Practical Malware Analysis:
skipped the confusion matrix. Notably, our model perfectly The Hands-on Guide to Dissecting Malicious Software. San
classified UPX-packed samples from the YunerA class with Francisco, CA, USA: No Starch Press, 2012.
100% accuracy using only binaries. [6] A. Afianian, S. Niksefat, B. Sadeghiyan, and D.
Baptiste, "Malware dynamic analysis evasion techniques: A
2) Comparison With Previous Research survey," ACM Comput. Surveys, vol. 52, no. 6, pp. 1–28,
This section compares our model against past Malimg Nov. 2020.
dataset studies. Table 8 summarizes the results. While [7] M. Hassen, M. M. Carvalho, and P. K. Chan, "Malware
various approaches like Nataraj et al. and a pretrained classification using static analysis-based features," in Proc.
model by Mitsuhashi and Shinagawa exist, a direct IEEE Symp. Ser. Comput. Intell. (SSCI), Nov. 2017, pp. 1–
comparison wasn't fully possible. We lacked all 7.
corresponding binaries for structural entropy extraction, so [8] H. Zhang, X. Xiao, F. Mercaldo, S. Ni, F. Martinelli,
the number of samples differs across studies. Despite this, and
our model displayed the best accuracy and F1-score, A. K. Sangaiah, "Classification of ransomware families with
proving that structural entropy enriches malware image machine learning based on N-gram of opcodes," Future
information and enhances classification performance. Gener. Comput. Syst., vol. 90, pp. 211–221, Jan. 2019.
[9] Hex Ray, IDA Pro-Hex Rays. Accessed: Mar. 7, 2023.
D. RESULTS ON BODMAS-14 DATASET Available online: https://www.hex-rays.com/ida-pro/
1) Baseline Model Performance Comparison [10] D. Gibert, C. Mateu, and J. Planes, "HYDRA: A
We also built baselines B7, B8, and B9 to test on multimodal deep learning framework for malware
BODMAS- 14, as explained previously. Table 9 displays classification," Comput. Secur., vol. 95, Aug. 2020, Art. no.
the outcomes, with our model achieving the highest 101873.
accuracy and F1-score again. Using both malware images [11] D. Gibert, C. Mateu, and J. Planes, "Orthrus: A
and structural entropy together clearly gave better results bimodal learning architecture for malware classification," in
than using either alone. Similar to Malimg findings, our Proc. Int. Joint Conf. Neural Netw. (IJCNN), Jul. 2020, pp.
method succeeded even on this newer dataset. However, as 1–8.
no other research used BODMAS-14 in this exact way, we [12] X. Chong, Y. Gao, R. Zhang, J. Liu, X. Huang, and J.
couldn't perform external performance comparisons. Zhao, "Classification of malware families based on
efficient- net and 1D-CNN fusion," Electronics, vol. 11, no.
IX. CONCLUSION 19, p. 3064, Sep. 2022.
In this study, we introduce a cross-modal convolutional [13] D. Gibert, C. Mateu, J. Planes, and R. Vicens, "Using
neural network (CNN) for malware classification with convolutional neural networks for classification of malware
impressive results. The proposed model is a multi-modal represented as images," J. Comput. Virol. Hacking Techn.,
network trained simultaneously on malware images and vol. 15, no. 1, pp. 15–28, Mar. 2019.
structural entropy. By utilizing cross-modal attention, the [14] M. Xiao, C. Guo, G. Shen, Y. Cui, and C. Jiang,
model strengthens both modalities, compensating for their "Image- based malware classification using section
individual limitations. Our approach achieved remarkable distribution information," Comput. Secur., vol. 110, Nov.
accuracy across three distinct datasets, even with non- 2021, Art. no. 102420.
[15] D. Gibert, C. Mateu, J. Planes, and R. Vicens,
"Classification of malware by using structural entropy on 8.
convolutional neural networks," in Proc. AAAI Conf. Artif. [32] Y.-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P.
Intell., 2018, pp. 1–6. Morency, and R. Salakhutdinov, ‘‘Multimodal transformer
[16] S. Albawi, T. A. Mohammed, and S. Al-Zawi, for unaligned mul timodal language sequences,’’ in Proc.
"Understanding of a convolutional neural network," in Proc. 57th Annu. Meeting Assoc. Comput. Linguistics, Jul. 2019,
Int. Conf. Eng. Technol. (ICET), Aug. 2017, pp. 1–6. pp. 6558–6569.
[17] R. Ronen, M. Radu, C. Feuerstein, E. Yom-Tov, and [33] C. E. Shannon, ‘‘A mathematical theory of
M. Ahmadi, ‘‘Microsoft malware classification challenge,’’ communication,’’ Bell Syst. Tech. J., vol. 27, no. 3, pp.
2018, arXiv:1802.10135. 379– 423, 1948.
[18] L. Nataraj, S. Karthikeyan, G. Jacob, and B. S. [34] J. Kim, E.-S. Cho, and J.-Y. Paik, ‘‘Poster: Feature
Manjunath, ‘‘Malware images: Visualization and automatic engineering using file layout for malware detection,’’ in
classification,’’ in Proc. 8th Int. Symp. Visualizat. Cyber Proc. Annu. Comput. Secur. Appl. Conf., Dec. 2020.
Secur. New York, NY, USA: Association for Computing [35] M.-T. Luong, H. Pham, and C. D. Manning, ‘‘Effective
Machinery, Jul. 2011, pp. 1–7, doi: approaches to attention-based neural machine translation,’’
10.1145/2016904.2016908. in Proc. EMNLP, Aug. 2015, pp. 1412–1421
[19]L.Yang,A.Ciptadi,I.Laziuk,A.Ahmadzadeh,andG.Wang, [36] J. Yan, G. Yan, and D. Jin, ‘‘Classifying malware
‘‘BODMAS: An open dataset for learning based temporal represented as control of low graphs using deep graph
analysis of PE malware,’’ in Proc. IEEE Secur. Privacy convolutional neural network,’’ in Proc. 49th Annu.
Workshops (SPW), May 2021, pp. 78–84. IEEE/IFIP Int. Conf. Dependable Syst. Netw. (DSN), Jun.
[20] J. Kang, S. Jang, S. Li, Y.-S. Jeong, and Y. Sung, 2019, pp. 52–63.
‘‘Long short-term memory-based malware classification [37] M. Mays, N. Drabinsky, and S. Brandle, ‘‘Feature
method for information security,’’ Comput. Elect. Eng., vol. selection for malware classification,’’ in Proc. MAICS, Apr.
77, pp. 366– 375, Jul. 2019. 2017, pp. 165–170.
[21] Y. Qiao, W. Zhang, X. Du, and M. Guizani, ‘‘Malware [38] Y. Zhang, Q. Huang, X. Ma, Z. Yang, and J. Jiang,
classifica tion based on multilayer perception and ‘‘Using multi-features and ensemble learning method for
Word2Vec for IoT security,’’ ACM Trans. Internet imbalanced malware classification,’’ in Proc. IEEE
Technol., vol. 22, no. 1, pp. 1–22, Sep. 2021, doi: Trustco/BigDataSE/ISPA, Aug. 2016, pp. 965–973.
10.1145/3436751. [39] M. Ahmadi, D. Ulyanov, S. Semenov, M. Trofimov, and
[22]A.Bensaoud,N.Abudawaood,andJ.Kalita,‘‘Classifyingm G. Giacinto, ‘‘Novel feature extraction, selection and fusion
alwareimages with convolutional neural network models,’’ for effective malware family classification,’’ in Proc. 6th
Int. J. Netw. Secur., vol. 22, no. 6, pp. 1022–1031, Oct. ACM Conf. Data Appl. Secur. Privacy, Mar. 2016, pp. 183–
2020. 194.
[23] D. Xue, J. Li, T. Lv, W. Wu, and J. Wang, ‘‘Malware [40] R. Mitsuhashi and T. Shinagawa, ‘‘Deriving optimal
classification using probability scoring and machine deep learning models for image-based
learning,’’ IEEE Access, vol. 7, pp. 91641–91656, 2019. malware classification,’’ in Proc. 37th ACM/SIGAPP
[24] R. Pascanu, J. W. Stokes, H. Sanossian, M. Marinescu, Symp. Appl. Comput. New York, NY, USA: Association
and A. Thomas, ‘‘Malwareclassification with recurrent for Computing Machinery, Apr. 2022, pp. 1727–1731.
networks,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal
Process. (ICASSP), Apr. 2015, pp. 1916–1920.
[25] B. Athiwaratkun and J. W. Stokes, ‘‘Malware
classification with LSTM and GRU language models and a
character-level CNN,’’ in Proc. IEEE Int. Conf. Acoust.,
Speech Signal Process. (ICASSP), Mar. 2017, pp. 2482–
2486.
[26] A.PektasandT.Acarman,‘‘Malwareclassification based
on API calls and behaviouranalysis,’’ IETInf. Secur., vol.
12, no.2,pp. 107–117,Mar.2018.
[27] S. S. Hansen, T. M. T. Larsen, M. Stevanovic, and J.
M. Pedersen, ‘‘An approach for detection and family
classification of malware based on behavioral analysis,’’ in
Proc. Int. Conf. Comput., Netw. Commun. (ICNC), Feb.
2016, pp. 1–5.
[28]D.RamachandramandG.W.Taylor,‘‘Deepmultimodallea
rning:Asurvey on recent advances and trends,’’ IEEE Signal
Process. Mag., vol. 34, no. 6, pp. 96–108, Nov. 2017.
[29] X. Xu, T. Wang, Y. Yang, L. Zuo, F. Shen, and H. T.
Shen, ‘‘Cross-modal attention with semantic consistence for
image-text matching,’’ IEEE Trans. Neural Netw. Learn.
Syst., vol. 31, no. 12, pp. 5412–5425, Dec. 2020.
[30] I. J. Cruickshank and K. M. Carley, ‘‘Analysis of
malware communities using multi-modal features,’’ IEEE
Access, vol. 8, pp. 77435–77448, 2020. [31] P. Velickovic,
D. Wang, N. D. Lane, and P. Lio, ‘‘X-CNN: Cross-modal
convolutional neural networks for sparse datasets,’’ in Proc.
IEEE Symp. Ser. Comput. Intell. (SSCI), Dec. 2016, pp. 1–