Deepsign: Deep Learning For Automatic Malware Signature Generation and Classification
Deepsign: Deep Learning For Automatic Malware Signature Generation and Classification
Deepsign: Deep Learning For Automatic Malware Signature Generation and Classification
• New malware programs are growing exponentially (e.g., on average 160,000 malwares
appeared everyday in 2013)
• Anti-virus solutions do not effectively and efficiently detect, analyze and generate
malware signatures.
• Methods for automatic malware signature generation target specific aspects of malware
(e.g., any vulnerability in Windows Operating System)
• Variants of a malware are not easily recognized using these ways.
• Difficult to generate signatures which can be used to prevent new attacks against zero-
day malware.
INTRODUCTION
• Autograph records source and destination of connections attempted from outside the
network
• Honeycomb analyzes the traffic on the honeypot, uses largest common substrings (LCS) to
generate signatures and measure similarities in packet payloads
• The PAYL sensor monitors the flow of information in the network and tries to detect
malicious attacks using anomaly detection
• The Nemean architecture is a semantic-aware Network Intrusion Detection System (NIDS)
which normalizes packets from individual sessions in the network and renders semantic
context. A signature generation component clusters similar sessions and generates signatures
for each cluster
SOME RELATED WORKS (CONT.)
• Polygraph generates content based signatures that use several substring signatures
(tokens), to expand the detection of malware variants
• EarlyBird sifts through the invariant portion of a worm's content that will appear
frequently on the network as it spreads or attempts to spread
• Auto-Sign generates a list of signatures for a malware by splitting its executable to
segments of equal sizes. For each segment a signature is generated, and the list of
signatures is subsequently ranked
LIMITATIONS OF MENTIONED APPROACHES
• Our method follows the following simple steps to convert sandbox files to fixed size
inputs to the neural network:
• Extract all unigrams for each sandbox file in the dataset
• Remove the unigrams which appear in all files (contain no information)
• For each unigram count:
• Select top 20,000 with highest frequency
• Convert each sandbox file to a 20,000 sized bit string, by checking whether each of the 20,000
unigrams appeared in it
TRAINING A DEEP BELIEF NETWORK
• When training is complete, decoder layer is discarded output of the hidden layer is
treated as the input to a new auto-encoder added on top of the previous one.
• Auto-encoders are trained similarly making up a total of eight layers.
IMPLEMENTATION AND EXPERIMENTAL RESULTS
• The six malware categories used are Zeus, Carberp, SpyEye, Cidox, Andromeda and
DarkComet
• Each of the 1,800 programs in our dataset is run in Cuckoo sandbox
• Deep denoising autoencoder were trained consisting of eight layers (20,000-5,000-2,500-
1,000-500-250-l00-30), with layer-wise training
• Dropouts were used to regularize the network and prevent overfitting
SIGNATURE GENERATION PROCESS STEPS
EXPERIMENTAL RESULTS
• All of 1,800 vectors of size 20,000 were passed to the DBN and were convert to 30-sized
representations (signatures)
• The visualization is generated using the t-distributed stochastic neighbor embedding (t-SNE)
algorithm
• Variants of the same malware family are mostly clustered together in the signature space,
demonstrating that the signatures due to DBN indeed capture invariant representations of
malware.
• Training this network on the 1,200 input training samples (using input noise = 0.2, dropout = 0.5,
and learning rate = 0.001), and predicting on 600 test samples results in 98.6% accuracy on test
data, a relatively substantial improvement over other methods.
EXPERIMENTAL RESULT
CONCLUSION
• Current approaches for malware signature generation use specific aspects of malware
• New malware variants easily evade detection by modifying small parts of their code
• Unsupervised deep learning is a powerful method for generating high level invariant
representations in domains beyond computer vision, language processing, or speech
recognition; and can be applied successfully to challenging domains such as malware
signature generation.