International Journal of Computer Applications (0975 – 8887)
Volume 157 – No 1, January 2017
An Efficient AIS based Feature Extraction Techniques
for Spam Detection
Mafaz Mohsin Khalil Al-Anezi, PhD
Computer Sciences
Department, College of
Computer Sciences & Mathematics,
Mosul University, Mosul, 964, Iraq
ABSTRACT
Simultaneously with the development of networks, and with
the increasing volume of unsolicited bulk e-mail especially
advertising, indiscriminately has generated a need for reliable
anti-spam filters. The problem for the traditional method of
spam filtering cannot effectively identify the unknown and
variation characteristics, therefore recently the researchers
look at the artificial immune system exists diversity, immune
memory, adaptive and self learning ability. The spam
detection model describes an e-mail filtering is accomplished
by extracting the characteristics of spam and ham (legitimate
e-mail messages that is generally desired and isn't considered
spam) that is been acquired from trained data set by feature
extraction techniques. These techniques allowed to select
subset of relevant, non redundant and most contributing
features to have an added benefit in accuracy and reduced
time complexity. The extracted features of spam and ham are
then make a two types of antigen detectors, to enter then in
series of cloning and mutation immune operations to built an
immune memory of spam and ham. The experimental result
confirms that the proposed model has a very high detection
rate reach at 1 and a very low false alarm rate reach at 0 when
using low numbers of feature extraction.
General Terms
Artificial Immune System
Techniques and Security.
(AIS),
Feature
Extraction
Keywords
Email, Spam, Ham (legitimate messages), Clonal selection,
Information Gain, LDA, PCA
1. INTRODUCTION
The word “spam” is used to indicate the electronic equivalent
of junk email. Exact definitions will vary, but it typically
covers a range of unsolicited and undesired advertisements
and bulk email messages.
The most common communication in the internet is using
email communication. With the vast growth in email and its
popularity unsolicited e-mail (spam) also emerged very
quickly with almost 90% of all email messages. i.e., over 120
billion of these messages are sent each day. The cost of
sending these e-mails is very close to zero being easy to reach
a high number of potential consumers. In this context, spam
consumes resources; time spent reading unwanted messages,
bandwidth, CPU, disk, being also used to spread malicious
content [1].
The email system design can easily be exploited by spammers
who send inaccurate information. All email on the Internet is
sent via a protocol called Simple Mail Transfer Protocol
(SMTP). SMTP is designed to capture information about the
route that an email message travels from its sender to its
recipient. In actuality, the SMTP protocol provides no security,
email is not private, it can be altered en route, and there is no
way to validate the identity of the email source. In other
words ,when a user receives an email message, there is no
way to tell who sent the email and who has seen it. The lack
of security in SMTP, and specifically the lack of reliable
information identifying the email source, is regularly
exploited by spammers and allows for considerable fraud on
the Internet (such as identity theft or “phishing”) [1].
Spam even provides various kinds of attacks and distributed
harmful content or data such as viruses, worms, Trojan horses
and other malicious code. Several technical solutions are
available for dealing with these issues like commercial and
open-source products [1].
Spam classification has contained the different machine
learning classification. In supervised learning process text
classification is very popular. In supervised learning process a
task is assign to the text data or document and then classifies
this text data according to predefined categories or classes
according to their contents. According categories of their
contents the data is automatically classified. Now days there
are different types of algorithm are present to deal with
automatic text classification [2].
There are two types of spam filtering are supervised and
unsupervised. But the most different classifier method for
detecting the spam mails are [2,3]:
I.
Based on Non-machine learning:
K-means Clustering Method
Black list/White list
Signature
II.
Based on Machine learning:
Support Vector Machine (SVM)
Artificial Neural Network (ANN)
Negative Selection Algorithm
Naïve Bayesian Classifier
Decision Tree
Nearest Neighbor (NN)
Simple techniques including white and black list methods fail
to categorize messages without user intervention. Even worse,
a contacts inserted into the black list may send legitimate
messages beside spam, e.g., a bank may send a spam message
including new credit opportunities and a legitimate message
containing online banking password as well. In this case,
35
International Journal of Computer Applications (0975 – 8887)
Volume 157 – No 1, January 2017
smarter methods such as content based classification are
needed.
One of the solution for the spam problem is the “machine
learning” method. The ability of a machine to improve its
performance based on the previous results is known as
machine learning. In machine learning the existing data set
training is used to differentiate between the spam & non spam
emails. Feature extraction is the major concept used in
machine learning. It extracts the feature from the email & then
give the result whether it is spam or not & it takes the help of
training & learning phase [3].
2. PREVIOUS WORKS
OzarkarP.andPatwardhanM. [4] used the spam dataset because
it is possible to have large number of training instances. Based
on this fact, they have made use of Random Forest and Partial
Decision Trees algorithms to classify spam vs. non-spam
emails. As a preprocessing step they have used feature
selection methods such as Chi-square, Information gain, Gain
ratio, Symmetrical uncertainty, Relief, One R and Correlation.
So after using 70% of the feature set extracted, for spam base
data set, the training accuracy is (99.918%) whereas the
computation time reduced by 20%.
Idris I. and Selamat A. [5] proposed a new improved model
that combines negative selection algorithm (NSA) with
particle swarm optimization (PSO) has been proposed and
implemented. The new model is called swarm negative
selection algorithm (SNSA). The implementation of PSO with
its fitness function improved the detector generation phase of
NSA. The empirical report shows the superiority of the
proposed SNSA improved model over the NSA model. At
8000 generated detectors with threshold value of 0.4, accuracy
for negative selection algorithm is 68.863% while improved
swarm negative selection algorithm is at 82.69%.
3. ARTIFICIAL IMMUNE SYSTEM
Artificial Intelligence System (AIS) is a research area which is
used to build intelligence models & it takes the inspiration
from Biological Immune System (BIS). BIS have several
properties which consists of distributed detection, noise
tolerance & reinforcement learning. Considering the immune
processes related to BIS many AIS models have been
developed to solve engineering problems. Examples are
negative selection, clonal selection, immune network model &
danger theory algorithm & these models are applied on real
world problems which are pattern recognition, data mining,
spam filtering & computer security. The main function of BIS
is to protect the body from molecules which are known as
antigens. The feature of BIS is that it has the pattern
recognition capability which can be used to differentiate
between foreign cells entering in the body (non-self or antigen)
& the body cells (self) [3].
AIS is inspired by the human immune system which is a
highly evolved, parallel and distributed adaptive system that
exhibits the following strengths: immune recognition,
reinforcement learning, feature extraction, immune memory,
diversity and robustness. The artificial immune system (AIS)
combines these strengths and has been gaining significant
attention due to its powerful adaptive learning and memory
capabilities.
The main search power in AIS relies on the mutation operator
and hence, the efficiency deciding factor of this technique.
The steps in AIS are as follows [6]:
1.
Initialization of antibodies (potential solutions to the
problem). Antigens represent the value of the objective
function f(x) to be optimized.
2.
Cloning where the affinity or fitness of each antibody is
determined. Based on this fitness the antibodies are
cloned that is the best will be cloned the most. The
number of clones generated from the n selected
antibodies is given byequation (1):
Nc =Σ round (β*j/i), i = 1,2….n ,
(1)
Where Nc is the total number of clones, β is a multiplier factor
and j is the population size of the antibodies.
3.
Hypermutation: The clones are then subjected to a hyper
mutation process in which the clones are mutated in
inverse proportion to their affinity; the best antibody„s
clones are mutated lesser and worst antibody„s clones are
mutated most. The clones are then evaluated along with
their original antibodies out of which the best N
antibodies are selected for the next iteration. The
mutation can be uniform, Gaussian or exponential.
4. CLONAL SELECTION ALGORITHM
CLONA
Clonal selection and expansion is the most accepted theory
used to explain how the immune system copes with the
antigens. In brief, the Clonal selection theory states that when
antigens invade an organism, a subset of the immune cells
capable of recognizing these antigens proliferate and
differentiate into active or memory cells. The fittest clones are
those, which produce antibodies that bind to antigen best (with
highest affinity). The main steps of Clonal selection algorithm
can be summarized as follows [7]:
Algorithm 1: Clonal selection
Step 1: For each antibody element
Step 2: Determine its affinity with the antigen presented
Step 3: Select a number of high affinity elements and
reproduce (clone) them proportionally to their affinity.
5. E-MAIL SPAM DATASET
The dataset used for our experiment is spam base [8]. It is a
multivariate and its contains 4601 instances, the attribute
characteristics are integer or real. The last attribute of 'spam
base. Data' denotes whether the e-mail was considered spam
(1) or not (0). Most of the attributes indicate the frequency of
spam related term occurrences. The first 48 set of attributes
(1–48) give tf-idf (term frequency and inverse document
frequency) values for spam related words, whereas the next 6
attributes (49-54) provide tf-idf values for spam related terms.
The run-length attributes (55-57) measure the length of
sequences of consecutive capital letters, capital_ run_
length_average, capital_run_length_longest and capital_ run_
length_ total. Thus, our dataset has in total 57 attributes
serving as an input features for spam detection and the last
attribute represent the class (spam/non-spam).
6. FEATURES RANKING AND SUBSET
SELECTION
Dimensionality reduction and feature selection is an important
aspect of electroencephalography based event related potential
detection systems such as brain computer interfaces [9].
Feature ranking further help us to:
36
International Journal of Computer Applications (0975 – 8887)
Volume 157 – No 1, January 2017
1.
Remove irrelevant features, which might be misleading
the classifier decreasing the classifier interpretability by
reducing generalization by increasing over fitting.
2.
Remove redundant features, which provide no additional
information than the other set of features, unnecessarily
decreasing the efficiency of the classifier.
3.
Selecting high rank features, which may not affect much
as far as improving precision and recall is concerned; but
reduces time complexity drastically. Selection of such
high rank features reduces the dimensionality feature
space of the domain. It speeds up the classifier there of
improving the performance and increasing the
comprehensibility of the classification result [4].
From the above defined feature vector of total 58 features,
feature ranking and selection algorithms are used to select the
subset of features. The given set of features are rankedusing
the following distinct approaches.
PCA is concerned with explaining the variance-covariance
structure of a set of variables through a few new variables. If
there are M features in each datum and there are N data which
is represented by x11, x12, x13…..x1M. x21,. x22, x23….x2M.
Similarly the final datum can be represented by xN1, xN2,
xN3….xNM. The matrix A= [Ø1, Ø2,......., ØM] (N×M matrix)
[10].
The sample covariance matrix C of the data set is defined as
by equation (5):
𝐶=
1
𝑀
𝑖=1 ∅𝑖
𝑀
∅𝑖 𝑇
Theeigenvalues(λ1,λ2, ....λN) and eignvectors (u1, u2,......., uN )
of covariance matrix C are computed. The K eigenvectors
having the largest eigenvalues are selected. The
dimensionality of the subspace K can be determined by using
the following criterion.
𝜆𝑖
> 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 (𝛼)
𝜆𝑖
6.1 Information Gain
Information Gain is the expected reduction in entropy caused
by partitioning the examples according to a given attribute.
Information gain is a symmetrical measure that is, the amount
of information gained about Y after observing X is equal to
the amount of information gained about X after observing Y.
The entropy of Y is given by equation (2): [4]
𝐻𝑌= − 𝑃𝑌𝑙𝑜𝑔2(𝑃𝑌)𝑦∈ 𝑌
(2)
If the observed values of Y in the training data are partitioned
according to the values of a second feature X, and the entropy
of Y with respect to the partitions induced by X is less than
the entropy of Y prior to partitioning, then there is a
relationship between features Y and X. Equation (3) gives the
entropy of Y after observing X
𝐻 𝑌 = − (𝑥) 𝑃 𝑦 𝑥 𝑙𝑜𝑔2(𝑃 𝑦 𝑥 )𝑦∈𝑌𝑥∈𝑋
(3)
The amount by which the entropy of Y decreases reflects
additional information about Y provided by X and is called
the information gain or alternatively, mutual information [4].
Information gain is given by equation (4):
𝐺𝑎𝑖𝑛 =𝐻 𝑌 + 𝐻 𝑌 𝑋
=𝐻𝑋+𝐻𝑋𝑌
=H Y +H X −(𝑋,𝑌)(4)
6.2 Principal Component Analysis (PCA)
Perhaps PCA is one of the most commonly used
dimensionality reduction methods. PCA seeks the linear
combinations of the multivariate data that capture a maximum
amount of variance. However, the projections that PCA seeks
are not necessarily related to class labels; hence may not be
optimal for classification problems [9].
Contributions to Principal Component Analysis is technique
used for feature extraction, data used in intrusion detection
problem are high dimensional in nature. It is desirable to
reduce the dimensionality of the data for easy exploration and
further analysis. The PCA is often used for this purpose [10].
The mathematics behind principle component analysis is
statistics and is hinged behind standard deviation, eigenvalues
and eigenvectors. The entire subject of statistics is based
around the idea that you have this big set of data, and you
want to analyze that set in terms of the relationships between
the individual points in that data set [11].
(5)
The linear transformation RN>RK that performs the
dimensionality reduction is by equation (6):
𝑍𝑛 = 𝑈 𝑇 𝑥 − 𝑥) = 𝑈 𝑇 Ø𝑛
(6)
6.3 Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA) searches for those
vectors in the underlying space that best discriminate among
classes (rather than those that best describe the data). More
formally, given a number of independent features relative to
which the data is described, LDA creates a linear combination
of these which yields the largest mean differences between the
desired classes. Mathematically speaking, for all the samples
of all classes, the two measures are defined: 1) one is called
within-class scatter matrix, as given byequation (7):
𝑆𝑤 =
𝑐
𝑖=1
𝑁𝑗
𝑗
(𝑥
𝑖=1 𝑖
𝑗
− 𝜇𝑗 )(𝑥𝑖 − 𝜇𝑗 )𝑇
(7)
where x i is the ith sample of class j, μj is the mean of class j, c
is the number of classes, and Nj the number of samples in
class j; and 2) the other is called between-class scatter matrix,
by equation (8):
𝑆𝑏 =
j
𝑐
𝑗 =1
𝜇𝑗 − 𝜇 (𝜇𝑗 − 𝜇)𝑇
(8)
Where μ represents the mean of all classes.
The goal is to maximize the between-class measure while
minimizing the within-class measure [12].
The standard LDA can be seriously degraded if there are only
a limited number of observations N compared to the
dimension of the feature space n. In PCA, the shape and
location of the original data sets changes when transformed to
a different space whereas LDA doesn‟t change the location
but only tries to provide more class separability and draw a
decision region between the given classes [11]
7. PROPOSED SYSTEM
To solve spam detection and e-mail classification problem
using Artificial Immune System. A new e-mail classification
technique based on Clonal Selection Algorithm and feature
extraction techniques shall be designed and implemented. At
first of all the most important features will be extracted from
each of two types spam and ham, then generate a spam and
ham detector, after which e-mail classification will take place
by utilizing the ham and the spam accordingly in other to
successfully reduce the false rate. The experiment confirms
the reliability and efficiency of our new techniques in
37
International Journal of Computer Applications (0975 – 8887)
Volume 157 – No 1, January 2017
minimizing false positives and time consuming and
maximizing true positive. The datasets used in this research is
gotten from machine learning repository, Center for Machine
Learning and Intelligent System.
7.1 Data Preprocessing
The data set is divided into the two parts; one is for training
and the second is for testing. After the first division; our
training data set will be further divided in to self detector
(Ham) and nonself detector (Spam).
7.2 Feature Extraction by Information Gain
or PCA or LDA
The information gain of each attribute are calculated and the
attributes with low information gains are removed from the
data set. The information gain of an attribute indicates the
statistical
relevance
of
this
attribute
regarding
theclassificationPCA is a feature extraction method is
unsupervised. This means that the class labels are not taken
into account. Therefore, the presence of labels in the data set
does not alter the resulting PCA projection.
Where LDA is a supervised feature extraction method that
finds a linear subspace maximizing separability between
classes. The dimensionality of the resulting subspace is fixed
to the minimum between: number of features, number of
samples, number of classes. Usually, the output
dimensionality is determined by the number of classes.
7.3 Clonal
In the clonal selection method only a small set of best
Artificial LymphoCytes ALCs (i.e., with the highest
calculated affinity with a non-self pattern) is maintained so
that the problem can be solved with the available minimal
resources. The selected ALCs (i.e., detectors) are then cloned
and mutated in an attempt to have a higher binding affinity
with the presented nonself Ham pattern. The mutated clones
compete with the existing set of ALCs, based on the
calculated affinity between the mutated clones and the nonself pattern, for survival to be exposed to the next nonself
Ham pattern.
The percent of data take off from dataset for training is
divided in to spam training set and non-spam training set. The
trained detectors is used to classify the rest of database email
by obtaining feature vector after pre-processing when both email and detectors affinity are calculated, and affinity that is
greater than threshold, it is said to be Spam; otherwise it‟s a
Ham.
8. EXPERIMENTAL SETUP AND
RESULTS
During email classification, two mistakes occur by existing
anti-spam method. It is either the email is recognized as self
and is deleted or non-self and been accepted carelessly. This
process is called false positive and false negative. The false
positive occurs when the email or data that are needed to
create a detector are classified as self while emails or data that
are supposed to be discarded are recognized as non-self.
Figure 1 depicts the functional block diagram of the proposed
detection model.
Metrics are used as true negative rate, true positive rate,
weighted accuracy, G-mean, precision, recall, and F-measure
to evaluate the performance of learning algorithms assuming a
total of N messages test set, the definition of variables: Spam,
legitimate messages (Ham).
Then these Metrics can be defined to evaluate the mail
classification system performance [13].
Detection Rate & False Alarm Rate, They are also called true
positive rate (TPR) and true negative rate (TNR). To get
optimal balanced classification ability, sensitivityand
specificity are usually adopted to monitor classification
performance on two classes separately.
𝑇𝑁
𝑇𝑁𝑅 = 𝑇𝑁+𝐹𝑃 ,
𝑇𝑃𝑅 =
(9)
𝑇𝑃
(10)
𝑇𝑃+𝐹𝑁
Precision, which is a spam probability of correct. The correct
rate is higher; the misjudgment of legitimate messages as
spam, the fewer the number of.
Precision (PPV) =TP/ (TP + FP)
(11)
Accuracy, that is to judge all mail, and determine the
probability of correct.
Accuracy (ACC) =
(TP + TN) / (TP + FN)+(FP +TN)
(12)
Geometric Mean (G-mean): Is used to assess the performance
based on the two metrics TPR and TNR, it is also the
geometric means of classification accuracy on negative
samples and classification accuracy on positive samples. Used
if the target is to optimize classification performance with
balanced positive class accuracy and negative class accuracy.
G-mean = (TPR × TNR)1/2
(13)
F-measure is used to integrate precision and recall into a
single metric for convenience of modeling.
F-measure = (2 × Precision × Recall) /
(Precision + Recall)
(14)
Where True positive value(TP), False positive value(FP), True
negative value(TN), False negative value (FN).
In each experiment the number of Spam and Ham detectors
generated are different for each cases of using Information
Gain or PCA or LDA. And The overall time consuming for
each experiment is computed (i.e. in cases of 10, 15, 21, 30,
40, 50, 57 features together).The following experiments were
performed as follow:
38
International Journal of Computer Applications (0975 – 8887)
Volume 157 – No 1, January 2017
Training Phase
5%,10%, 15% or
20% of the
dataset
PCA
and 5.59.90 mins for
respectively.
Information Gain, LDA and PCA
The remaining
dataset
When a small (or nonrepresentative) training data set is used,
there is no guarantee that Information Gain and LDA will
outperform PCA.
Preprocessing
Preprocessing
Info.Gain
Testing Phase
LDA
CLONA generate detectors
Immune
Memory
Spam or Ham
Info.Gain
PCA
LDA
Affinity
Calculation
Fig. 2: Shows the experimental results of training phase of
experiment 1 (5% of dataset).
Affinity ≤ Th
Metrics
Calculation
Fig. 1: Proposed Model for Spam/Ham detection.
Experiment 1: the input dataset is about 5% from the original
which yield 139 Ham and 90 Spam, as show in table (1) and
figure 2. Figure 3 shows the influence of number of features
on time consuming, where the time increase Directly
proportional as the number of features increase . The overall
time consuming of this experiment are 12.23 secs, 12.55 secs,
and 13.65 secs for Information Gain, LDA and PCA
respectively.
Experiment 2: the input dataset is about 10% from the
original which yield 278 Ham and 181 Spam, as shown in
table (2) and figure 4, and figure 5 for other measures. The
overall time consuming of this experiment are 58.22 secs,
1.02.15 mins, and 1.04.85 mins for Information Gain, LDA
and PCA respectively.
Fig. 3: Shows the time consuming of experimental results
of training and testing phases of experiment 1 (5% of
dataset).
Fig. 4: Shows the experimental results of testing phase of
experiment 2 (10% of dataset).
Experiment 3: the input dataset is about 15% from the
original which yield 418 Ham and 271 Spam, as shown in
table (3). Figure 6 shows the differences between the accuracy
of the algorithms on train and test dataset from table 3, it
seems very convergent. Figure 7 shows the influence of the
number of generated Spam and Ham detectors on the accuracy
at test phase, by suggest that maximum Ham and Spam
detectors are 5600 and 2500 respectively depending on the
maximum number of detectors in table (3). The overall time
consuming of this experiment are 2.36.48 mins, 2.30.30 mins,
and 2.43.10 mins for Information Gain, LDA and PCA
respectively.
Experiment 4: the input dataset is about 20% from the
original which yield 557 Ham and 362 Spam. The overall time
consuming of this experiment are 5.11.25 mins, 5.33.30 mins,
Fig. 5: Shows the experimental results of testing phase of
experiment 2 (10% of dataset) for G-Mean, Precision and
F-Measure.
39
International Journal of Computer Applications (0975 – 8887)
Volume 157 – No 1, January 2017
9. CONCLUSIONS
An efficient email filtering approach which is consists of two
phases are training and testing. This new model tries to
increase the accuracy of a spam filtering and time consuming
via combine a several well known feature extraction
techniques with the artificial immune system by using its
algorithm the Clonal selection algorithm. Experimental results
showed an improvement in the performance of the new spam
filtering than using each technique alone as it always seek to
get the highest and fastest detectors to reduce the false
positive rate and get highest accuracy. The experimental
results applied on 4601 instances of email messages shows a
high efficiency with the less number of false alarm 0 and High
detection rate 1, especially when the experiment depend on
low number of the most important attributes. These promising
results of the immune-inspired method can be further
developed and even integrated with other methods as an
appealing future direction, and also as a model that could help
us better understand the behavior of immune system and how
it could be very useful in different fields of computer and
network security.
Fig. 6: The differences between the accuracy of the
algorithms on train and test dataset from table 3 which
applied training on 15% of dataset.
Fig. 7:The relation between the number of generated
Spam and Ham detectors and the Accuracy at test phase
of experiment 3 (15% of dataset).
Table 1: Results of training on 5% of data (139 Ham & 90 Spam) as antigens
Algorithm
Information Gain
Information Gain
Information Gain
Information Gain
Information Gain
Information Gain
Information Gain
PCA
PCA
PCA
PCA
PCA
PCA
PCA
LDA
LDA
LDA
LDA
LDA
LDA
LDA
No of used
features
10
15
21
30
40
50
57
10
15
21
30
40
50
57
10
15
21
30
40
50
57
Ham
Abs
246
110
107
32
31
27
27
762
114
85
30
35
27
27
935
154
126
35
30
27
27
Spam
Abs
69
74
21
18
18
18
18
60
28
18
18
18
18
18
103
75
19
18
18
18
18
TPR
TrainTest
1
0.99
0.99 0.98
0.98 0.95
0.96 0.91
0.93 0.87
0.88 0.78
0.76 0.66
1
0.98
0.99 0.95
0.98 0.91
0.93 0.86
0.86 0.78
0.78 0.67
0.77 0.65
1
0.99
0.99 0.98
0.95 0.95
0.92
0.9
0.91 0.81
0.82 0.72
0.78 0.65
FNR
Train Test
0.05
0.02
0.09
0.05
0.09
0.08
0.13
0.14
0.11
0.2
0.2
0.32
0.37
0.52
0.02
0.04
0.06
0.14
0.04
0.18
0.15
0.29
0.25
0.39
0.28
0.5
0.36
0.55
0.01
0.01
0.05
0.05
0.12
0.09
0.18
0.18
0.2
0.32
0.29
0.46
0.33
0.53
Accuracy
Train Test
0.98
0.99
0.96
0.97
0.95
0.94
0.92
0.89
0.92
0.84
0.85
0.74
0.71
0.59
0.99
0.97
0.97
0.91
0.97
0.88
0.9
0.79
0.81
0.71
0.76
0.61
0.72
0.57
1
0.99
0.97
0.97
0.92
0.93
0.88
0.87
0.86
0.76
0.78
0.64
0.74
0.58
Time
(Secs)
1.10
1.23
1.32
1.85
2.18
2.60
3.10
1.25
1.42
1.82
2.10
2.34
3.13
3.32
1
1.22
1.51
1.95
2.41
3.07
3.33
Table 2: Results of training on 10% of data (278 Ham & 181 Spam) as antigens
Algorithm
Information Gain
Information Gain
Information Gain
Information Gain
Information Gain
No of used
features
10
15
21
30
40
Ham
Abs
2550
1909
746
934
185
Spam
Abs
239
1110
169
144
44
TPR
TrainTest
1
1
1
1
0.99 0.97
0.98 0.93
0.95 0.89
FNR
Train Test
0.01
0
0.02
0.03
0.01
0.05
0.04
0.08
0.1
0.16
Accuracy
Train Test
1
1
0.99
0.99
0.99
0.96
0.97
0.93
0.93
0.87
G-M
Test
1
0.97
0.92
0.850
0.74
Prec.
Test
1
0.98
0.97
0.95
0.9
F-M
Test
1
0.99
0.97
0.94
0.89
40
International Journal of Computer Applications (0975 – 8887)
Volume 157 – No 1, January 2017
Algorithm
Information Gain
Information Gain
PCA
PCA
PCA
PCA
PCA
PCA
PCA
LDA
LDA
LDA
LDA
LDA
LDA
LDA
No of used
features
50
57
10
15
21
30
40
50
57
10
15
21
30
40
50
57
Ham
Abs
184
183
1906
2437
1504
1533
196
189
192
2548
1876
1769
1475
203
187
173
Spam
Abs
36
36
1083
1106
130
92
36
36
36
1035
1075
157
74
39
37
36
TPR
TrainTest
0.87
0.8
0.78 0.67
1
0.99
1
0.98
0.97 0.94
0.94 0.88
0.87 0.83
0.84 0.74
0.78 0.71
1
1
0.99 0.99
0.96 0.96
0.94 0.91
0.9
0.84
0.85 0.77
0.78 0.71
FNR
Train Test
0.13
0.26
0.32
0.5
0.02
0.05
0.04
0.06
0.05
0.14
0.12
0.16
0.22
0.27
0.23
0.27
0.35
0.44
0.03
0.02
0.02
0.05
0.03
0.06
0.08
0.12
0.16
0.27
0.27
0.38
0.3
0.41
Accuracy
Train Test
0.87
0.78
0.74
0.61
0.99
0.97
0.98
0.96
0.96
0.91
0.92
0.87
0.83
0.79
0.81
0.79
0.73
0.65
0.98
0.99
0.99
0.97
0.97
0.95
0.93
0.9
0.88
0.8
0.8
0.71
0.75
0.67
G-M
Test
0.59
0.36
0.94
0.92
0.8
0.74
0.6
0.47
0.40
0.98
0.94
0.9
0.8
0.61
0.48
0.42
Prec.
Test
0.85
0.7
0.96
0.96
0.9
0.9
0.82
0.79
0.73
0.99
0.96
0.96
0.93
0.82
0.74
0.76
F-M
Test
0.82
0.68
0.97
0.97
0.92
0.89
0.82
0.76
0.72
0.99
0.97
0.96
0.92
0.83
0.75
0.73
Prec.
Test
1
0.98
0.97
0.95
0.92
0.85
0.77
0.98
0.97
0.95
0.9
0.86
0.81
0.77
1
0.97
0.95
0.93
0.84
0.8
0.78
F-M
Test
1
0.98
0.98
0.95
0.9
0.85
0.75
0.98
0.97
0.96
0.91
0.86
0.78
0.75
1
0.98
0.96
0.94
0.84
0.97
0.76
Table 3: Results of training on 15% of data(418 Ham & 271 Spam) as antigens.
Algorithm
Information Gain
Information Gain
Information Gain
Information Gain
Information Gain
Information Gain
Information Gain
PCA
PCA
PCA
PCA
PCA
PCA
PCA
LDA
LDA
LDA
LDA
LDA
LDA
LDA
No of used
features
10
15
21
30
40
50
57
10
15
21
30
40
50
57
10
15
21
30
40
50
57
Ham
Abs
5271
4052
2178
2680
418
418
418
5131
3633
3865
2292
1064
260
274
5647
3432
1912
1326
662
239
284
Spam
Abs
2393
2063
363
1632
271
271
271
271
959
1710
1740
187
180
179
2225
1903
871
1527
65
157
159
TPR
TrainTest
1
1
1
0.99
1
0.99
0.99 0.96
0.95 0.89
0.92 0.85
0.84 0.73
1
0.99
1
0.98
0.99 0.97
0.96 0.93
0.91 0.86
0.82 0.75
0.82 0.73
1
1
0.99 0.99
0.98 0.98
0.97 0.96
0.91 0.84
0.87 0.79
0.84 0.74
FNR
Train Test
0.01
0
0.01
0.2
0.01
0.05
0.03
0.07
0.9
0.13
0.13
0.23
0.27
0.38
0.01
0.03
0.3
0.04
0.05
0.08
0.1
0.15
0.14
0.21
0.22
0.33
0.27
0.39
0
0
0.01
0.4
0.03
0.07
0.06
0.11
0.17
0.24
0.19
0.31
0.24
0.37
Accuracy
Train Test
1
1
0.99
0.99
0.99
0.97
0.98
0.95
0.93
0.88
0.9
0.82
0.8
0.69
1
0.98
0.99
0.97
0.98
0.95
0.93
0.9
0.89
0.83
0.8
0.72
0.78
0.69
1
1
0.99
0.98
0.98
0.96
0.96
0.93
0.88
0.81
0.85
0.75
0.81
0.7
G-M
Test
1
0.97
0.94
0.89
0.77
0.65
0.45
0.96
0.94
0.89
0.79
0.68
0.5
0.45
1
0.95
0.91
0.85
0.64
0.56
0.47
10. REFERENCES
[6] Kathiravan A. V. and Vasumathi B., 2015, " Artificial
Immune System Based Classification Approach for
Detecting Phishing Mails", Pages 4308 – 4315,
International Journal of Innovative Research in Computer
and Communication Engineering, Vol. 3, Issue 5, May.
[2] Sao P., Singh A., 2015.”Survey on Email Spam
Classification using Different Classification Method”,
Pages 680-684 in JETIR, Volume 2, Issue 3.
[7] Mahmoud T. M., El Nashar A. I., Abd-El-Hafeez T. and
Khairy M., 2014, "An Efficient Three-phase Email Spam
FilteringTechnique", Pages 1184-1201, British Journal of
Mathematics & Computer Science.
[1] Geerthik S., 2013. “Survey on Internet Spam:
Classification and Analysis”, Pages 384-391 in
Int.J.Computer Technology & Applications,Vol 4 (3),
May-June.
[3] Sharayu S A., Irabashetti P., 2014, "Efficient Spam
Filtering Based on Artificial Immune System (AIS)",
International Journal of Ignited Minds (IJIMIINDS),
Volume: 01 Issue: 12 | Dec.
[4] Ozarkar P.& PatwardhanM., 2013. “Efficient Spam
Classification by Appropriate Feature Selection”, Pages
48-57 in Global Journal of Computer Science and
TechnologySoftware & Data EngineeringVolume 13
Issue 5 Version 1.0.
[5] Idris I. and Selamat A., 2015, " A Swarm Negative
Selection Algorithm for Email Spam Detection", Journal
of Computer Engineering &Information Technology,
March 17.
[8] “UCI repository of Machine learning Databases”,
Department
of
Information
and
Computer
Science,University
of
California,
Irvine,
CA,http://www.ics.uci.edu/~mlearn/MLRepository.html,
Hettich, S., Blake, C. L., and Merz, C. J.,1998.
[9] LanT., Erdogmus D., BlackL., and SantenJ., 2014. “A
Comparison of Different Dimensionality Reduction
andFeature Selection Methods for Single Trial ERP
Detection”, Conf Proc IEEE Eng Med Biol Soc.
[10] Singh S., Silakari S. and Patel R., 2011, "An efficient
feature reduction technique for intrusion detection
system", International Conference on Machine Learning
and Computing, IPCSIT vol.3.
41
International Journal of Computer Applications (0975 – 8887)
Volume 157 – No 1, January 2017
[11] Khan A. and Farooq H., 2011, "Principal Component
Analysis-Linear Discriminant Analysis Feature Extractor
for Pattern Recognition", pages 267-270, IJCSI
International Journal of Computer Science Issues, Vol. 8,
Issue 6, No 2, November.
Analysis And Machine Intelligence, VOL. 23, NO. 2,
February.
[13] Al-Anezi M. M. and Al-Dabagh N. B., "Multilayer
Artificial Immune Systems for Intrusion and Malware
Detection",
LAP
2012.
[12] Martinez A. M. and Kak A. C., 2001, " PCA versus
LDA", Pages 228-233, IEEE Transactions On Pattern
IJCATM : www.ijcaonline.org
42