Outline

<title>Evaluation of data mining techniques for suspicious network activity classification using honeypots data</title>

Rafael Santos

2007, Data Mining, Intrusion Detection, Information Assurance, and Data Networks Security 2007

https://doi.org/10.1117/12.719023

visibility

…

description

10 pages

Abstract

As the amount and types of remote network services increase, the analysis of their logs has become a very difficult and time consuming task. There are several ways to filter relevant information and provide a reduced log set for analysis, such as whitelisting and intrusion detection tools, but all of them require too much fine-tuning work and human expertise. Nowadays, researchers are evaluating data mining approaches for intrusion detection in network logs, using techniques such as genetic algorithms, neural networks, clustering algorithms, etc. Some of those techniques yield good results, yet requiring a very large number of attributes gathered by network traffic to detect useful information. In this work we apply and evaluate some data mining techniques (K-Nearest Neighbors, Artificial Neural Networks and Decision Trees) in a reduced number of attributes on some log data sets acquired from a real network and a honeypot, in order to classify traffic logs as normal or suspicious. The results obtained allow us to identify unlabeled logs and to describe which attributes were used for the decision. This approach provides a very reduced amount of logs to the network administrator, improving the analysis task and aiding in discovering new kinds of attacks against their networks.

Evaluation of data mining techniques for suspiciou s network activity classification using honeyp ot s data André Grégio 1,2 , Rafael Santos 1 , Antonio Montes 2 1Brazilian Institute for Space Research, Av. dos Astronauta, 1758, São José dos Campos, SP, Brazil, 12227 - 010; 2 Dept. of Information Systems Security, Renato Archer Research Center, Rod. D. Pedro I, Km 143.6, Campinas, SP, Brazil, 13069 - 901. ABSTRACT As the amount and types of remote network services increase, the analysis of their logs has become a very difficult and time consuming task. There are several ways to filter relevant information and provide a reduced log set for analysis, such as whitelisting and intrusion detection tools, but all of them require too much fine- tuning work and human expertise. Nowadays, researchers are evaluating data mining approaches for intrusion detection in network logs, using techniques such as genetic algorith ms, neural networks, clustering algorith m s, etc. Some of those techniques yield good results, yet requiring a very large number of attributes gathered by network traffic to detect useful information. In this work we apply and evaluate some data mining techniques (K- Nearest Neighbors, Artificial Neural Networks and Decision Trees) in a reduced number of attributes on some log data sets acquired from a real network and a honeypot, in order to classify traffic logs as normal or suspicious. The results obtained allow us to identify unlabeled logs and to describe which attributes were used for the decision. This approach provides a very reduced amount of logs to the network administrato r, improving the analysis task and aiding in discovering new kinds of attacks against their networks. Keywords: Computer Security, Data Mining, Log Analysis, Intrusion Detection, Artificial Intelligence 1. INTRODUCTION Since the 90's, the Internet has provided plenty of remote data access services. The facilities provided by the Internet modified the way society interacts, making life easier. People can comm u nicate, shop and access their banks virtually. On the other hand, new ways to commit crimes appeared, fueled by the ease of Internet access and low risk for the criminals. All this network interaction generates data that can (and must) be registered. The register of the traffic information is called logging and consists of the storage of data related to the network commu nication between the parts involved. This can be useful in fault monitoring or in the identification of events that might be attacks against the services. The logging activity is apparently simple, but it has various concerns as more services lead to even more logs being generated. These concerns are related to the amount of storage, the utility of the stored logs, the workload of the person who will analyze those logs and the difficulty and time requiremen t s to do that job. Except for storage requiremen t s, which nowadays is not a real problem, all other concerns are human - centric – in other words, the processing, analysis and actions resulting from the analysis must be performe d by a human expert, which is inherently slow and costly. Data Mining, Intrusion Detection, Information Assurance, and Data Networks Security 2007, edited by Belur V. Dasarathy, Proc. of SPIE Vol. 6570, 657006, (2007) · 0277-786X/07/$18 · doi: 10.1117/12.719023 Proc. of SPIE Vol. 6570 657006-1 The main problem related to logging is the great amoun t of data that has to be analyzed by a network administrato r. In addition to the sheer volume of logs being sometimes impossible to analyze, network administrators are often overwhelmed by other activities, which can cause a decrease in the log analysis effectiveness, not allowing the task to be accomplished in a reasonable time and, ultimately, making the implementation of counter measures ineffective to minimize or avoid attacks. There are some tools and techniques that helps the network administrator in the log analysis task, such as operating system tools to parse text and search for regular expressions, artificial ignorance and intrusion detection systems, which are superficially described as follows: Some tools like grep 1 can be used in scripts that search for pattern s, in order to filter or find useful information in logs, improving the network administrato r efficiency in the analysis job; Artificial ignorance, or whitelisting 2 , is a technique to filter and order system events eliminating known irrelevant entries in log and isolating the relevant ones. This technique is based on the knowledge of the network administrator, who creates rules to generates lists of events, and is used to substantially decrease the amount of false alerts; Intrusion Detection Systems (IDS) are software that analyze network connections either to match known patterns and alert attacks (intrusion detection by abuse), such as SNORT 3 , or to verify if the traffic characteristics are under a threshold that indicates normality (intrusion detection by anomaly), such as IDES4 . These kind of software tends to be difficult to configure, requiring constan t tuning and are somewhat limited, i.e., the IDS by abuse can only detects attacks whose signature exists (the known ones) and the IDS by anomaly is subject to a large amount of false - positives, since it is hard to define a normal profile for computer network's behavior. Despite all efforts in detecting suspicious activities in networks, attacks are very frequent and often don't have specific targets. Vulnerability exploits are embedde d in worms and malware (malicious software) to attack and invade as many computers as possible, in order to build networks of compro mised machines (botnets) to perform other kind of attacks, nowadays often related to criminal organizations. Attacks coming from botnets can be denial of service against an enterprise or institution, spreading of unsolicited e- mail (spam) or even ways to anonymize the attacker's identity. Data from CERT.br (Brazilian Security Incident Treatment, Response and Study Center) in Figure 1 shows that the number of incidents reported are constan tly increasing, and those reports are only possible if incidents are detected. The detection and consequen t report of an incident occurs via log analysis, which makes clear the need for quick and precise log analysis. Fig. 1. Total number of incidents reporte d to Cert.BR per year (Adapted from http:/ / w ww.cert.br). Proc. of SPIE Vol. 6570 657006-2 In order to improve the intrusion detection task perform ance, researchers have been applying data mining techniques to log analysis. Data mining is one of the steps in the Knowledge Discovery in Databases (KDD) process. A simple description of KDD is “a general discovery process of useful knowledge (previously unknown) from large databases” 5 . Data mining techniques are used in some intrusion detection tools like MINDS6 , ADAM7 , GASSATA8 and other. Data mining requires a fairly large quantity of data to be analyzed and characterized. An useful data - gathering resource that can be used to discover suspicious events in networks are honeypots and honeynets. A honeypot 9 is a security resource whose value lies in being probed, attacked or compro mised, while collecting data about the probing and attacks. Honeynets 10 are a specific type of honeypots, called high- interaction honeypots, which allow for greater interaction among the attacker and the computers / s e rvices attacked, while collecting data about the interactions. This happen s because a honeynet doesn't only emulate a service, but it has indeed a working operating system and offers real services remotely. In Brazil, the Brazilian Honeypots Alliance monitors the national Internet space with more than 30 honeypot partners spread over the country*. The purpose of this paper is to evaluate some data mining algorithm s, applying them to the classification of suspicious network activities in order to verify their effectiveness under certain conditions. Our main concern is to reduce the amoun t of logs which inevitably will require human attention. We also propose the use of honeypot s data to aid in exposing the attack trends to a determined network and to the generation of sets related to non - legitimate data, since we can assu me that honeypots logs are all related to non - legitimate access attemp t s. The paper is divided as follows: in section 2 we describe the network topology where the logs are collected and the scenario prepared for log analysis. In section 3 the metho dology is explained, as well as the data mining algorithm s and techniques that were chosen. Section 4 contains the results obtained with the application of these techniques. The fifth section contains conclusions and directions for future work. 2. SCENARIO Logs can be generated by several sources, such as operating systems, firewalls, specific software, network devices, etc. In this work, we focus on logs from resources that can represent a risk to an institution's networks. Attacks to networks can be carried out by maliciously crafted network traffic, illegal operating system interactions and / o r unautho rized access to resources or applications, but we will focus only on traffic information from network traffic logs. The collected logs are from the Brazilian Institute for Space Research's De- Militarized Network (DMZ). The Institute has many services that are remotely accessible, such as Web servers, a File Transfer Protocol (FTP) server, a Secure Shell (SSH) server, which provides access to the internal network. This services are vital to the institute, since they allow the researchers to access the resources from everywhere and provide weather reports and other information to the commu nity. The logs generated by network access to these services are stored in dump format in a collector for later analysis. There is also a honeypot in the same network that belongs to the Brazilian Distributed Honeypots Project which save its logs in dump format. The topology of the network used in this work is shown in Fig. 2. *More information about the Brazilian Distributed Honeypots Project can be obtained at http:/ / w ww.honeypots - alliance.org.br Proc. of SPIE Vol. 6570 657006-3 NET FTP WEB SSH c3 Hon eyp at Log Collector Departments DMZ I Fig. 2. Topology of Brazilian Institute for Space Research DMZ representing the scenario of this work. The volume of network traffic stored in the collector approaches 1GB of data for normal days, making the human analysis impossible. A great amount of these logs is associated with legitimate traffic, i.e., normal users accesses or automated scripts that do routine information exchange. Another part of the logs is related to port scanning or malicious scripts trying to exploit or brute - force some protocol, being non - directed attack not directed to Institute's targets. Thus, most of the logs consist basically of noise (legitimate, inoffensive and / o r known network traffic), interfering negatively with the analysis. The problem here is that the security analyst must know how to distinguish between interesting traffic and noise, performing the log analysis task as quickly as possible. This pre - analysis or filtering is very dependen t on the analyst's expertise and deman d s a lot of time, even with automated tools to remove noise, so there is need of automated, intelligent techniques to aid the analyst, and data mining techniques seems to be useful. 3. METHODOLOGY The process of testing the algorith m s with the available data consists of many steps, involving the choice of algorith m s, the attributes to represent the data, the pre - processing of data to create an input dataset, the application of the algorith m itself and finally, the evaluation of the results. As this work consists of classifying network logs in normal or suspicious, we choose to apply three techniques used in supervised classification tasks: k- nearest neighbors (KNN), neural networks and decision trees. The aim is to find which of them is more adequate to correctly classify our data, providing information about how the classification was done and reducing the false - positive rate, all in an acceptable amount of time. 3.1 Data mining techniques The k- nearest neighbor hypothesis is that records of the same class will be close to each other in the attribute space 11 . The KNN classification algorith m is as follows: for each instance of data of unknown class we calculate the attribute space distance between it and all instances with known classes, selecting as the class for the unknown one the most present class in the K nearest instances. Artificial neural networks are distributed or parallel computing structures that are composed by single processing units called neurons 12 . They can be used to solve complex learning tasks, being capable to handle specialized or generalized problems. There are several forms of neural networks, Proc. of SPIE Vol. 6570 657006-4 but in this paper we are going to use Multi- Layer Perceptro ns (MLP)13 . A decision tree 14 is a kind of machine learning algorithm known as successive partitioning algorithm. These algorith m s work with the partitioning of the original data set in successively more homogeneou s subgroups, which are partitioned in nodes until the desired detail level is accomplished. A partition decision must be taken in each node, in order to classify this data according to the label of a leaf- node in a subgrou p. These techniques are implemented in the Waikato Environment for Knowledge Analysis (WEKA) software package 15 , a open source software under the GNU GPL. WEKA is a collection of machine learning algorithm s for data mining tasks, which contains tools for data pre - processing, classification, regression, clustering, discovery of association rules and visualization. 3.2 Choice of attributes A preliminary analysis was done to choose which attributes are importan t to classify the traffic as normal or suspicious. As we are working on network traces and the objective is not to search for attack signatures on data, i.e. on the packets' payloads, the attributes must be derived from the TCP/IP headers of the packets. Each packet has a TCP/IP header that contains some traffic information, such as source and destination IP address, source and destination port, transport protocol, amount of data sent in the packet, sequence and ACK number, information flags and the payload itself. In this work we intend to separate the logs according to the label assigned to the sessions. A session is a comm u nication involving two unique pairs – source IP and port, destination IP and port – which have begin and end timesta m ps and can be considered complete. The session analysis provides other information such as number of packets and number of bytes exchanged over this specific session. We work under the hypothesis that there is a limited set of attributes associated with the TCP/IP session that can be used to differentiate an attack from a normal session. We ch ose only seven attributes to be evaluated, in opposition to other approaches that uses more than 30 different attributes, such as proposed in KDD Cup 1999 16 . The attributes were chosen in a way that there is no linear combination among them, i.e, an attribute “average of bytes” is the linear combination of two other possible attributes: number of packets and quantity of bytes. Thus, we choose to have the minimu m number of attributes that can be taken directly from the sets of TCP/IP headers from a network session. The attributes that are used in the datasets of this work are: Session duration: the time between the initial three - way handshake and the end of a session; Server port: the service offered by a server from our network; Server packets: the number of packets sent by the server of the session; Server bytes: number of bytes sent by the server of the session; Client packets: the number of packets sent by the client of the session; Client bytes: number of bytes sent by the client of the session; Class: a nominal value that can be either normal or suspicious and is related to the source of data, i.e., data from a honeypot are labeled as suspicious and data from the DMZ are labeled as normal. 3.3 Data pre- processing In order to evaluate the data mining algorith ms, two kinds of data sets were separated: a dataset containing network traffic related to the DMZ and labeled as normal, and a dataset with network traffic directed to this DMZ associated honeypot. Data obtained from a honeypot is inherently Proc. of SPIE Vol. 6570 657006-5 associated with an attack since there are no real services in it. As an initial evaluation, it's worth to emphasize that the normal - labeled data is unfiltered, i.e., it is quite possible that there is some attack data concealed in the data collected from the DMZ. To test the algorithm s, we build several datasets consisting of real data from the DMZ and from the honeypot, separated by day. Some tools were developed to aid the task of getting the log data in tcpdu m p format and transforming it in arrays of attributes that can be read by WEKA algorithm s. These tools extract basic TCP/IP header information from each session and label them according to its origin (DMZ or honeypot), converting the dumps to the ARFF (Attribute - Relation File Format) format files used in WEKA. With the pre - processing task completed, the chosen algorith ms were tested, using as input the generated data sets. The next section contains the tests done with each algorith m. We also present and comment the results. 4. RESULTS The data was collected in some days of the months of February, April and June, 2005. The logs were collected daily and pre - processed as described in section 3. Corrupted logs or incomplete sessions were discarded, so only reliable data was used, and reliable data could be obtained only for some days of the chosen period. Those days and the correspon ding amount of normal and suspicious log entries (sessions) are listed on Table 1. Table. 1. Number of normal and suspicious sessions by the day of month. Number of Number of Month- Day normal sessions suspicious sessions February - 01 (02/01) 376136 26026 February - 02 (02/02) 293058 24008 February - 03 (02/03) 299394 24012 February - 04 (02/04) 290343 16016 February - 05 (02/05) 8193 264 February - 06 (02/06) 113535 12024 February - 07 (02/07) 119898 10836 February - 08 (02/08) 120366 10010 February - 09 (02/09) 52074 2412 April- 28 (04/28) 189806 19038 April- 29 (04/29) 220838 20020 April- 30 (04/30) 119774 12462 June - 01 (06/01) 571550 26026 June - 04 (06/04) 122064 6513 June - 06 (06/06) 204112 13013 June - 07 (06/07) 175848 15015 June - 28 (06/28) 10914 714 Proc. of SPIE Vol. 6570 657006-6 One important fact about the attack data must be mentioned. Since attacks correspo n d to just a tiny fraction of the network activity, they amount to just some entries on the collected data. The algorith ms being evaluated tends to yield simple classification schema for the data, therefore if the attack data were to be used as- is, it would be completely ignored (considered as normal) in order to obtain a simpler, faster classifier. To avoid this, the attack data was replicated, without addition of derived or random perturbation data, to amount at least 5% of the total data. The next subsections will show the results of the application of the three chosen algorithm s. The tables used to describe the results obtained consist of the following fields: Model building time, i.e. the time in seconds spent by WEKA to build a model to evaluate the data; False positives, or the percentage of normal data misclassified as suspicious; False negatives, or the percentage of suspicious data misclassified as normal; Correctness, that means the percentage of data correctly classified with the current model related to all instances evaluated; Effectiveness, that measures the usability of the model for other data sets. It is similar to the “Correctness”, but we apply the model generated to data from the previous day available (or next day available, in the case of a data set belonging to a different month). 4.1 KNN Results The first batch of tests was done using the most simple metho d possible with the KNN algorith m: the 1- nearest neighbor. This classifier works calculating distances and comparing them in the attributes space, labeling a instance with the class of its nearest neighbor. In our case, the separation of data will be done with the creation of a set of hyperplans to divide the instances in normal or suspicious. Depending on the quantity of data and the number of neighbors that will be used as comparison parameter, the KNN could be very slow to process, but it is considered to have a good classification, indepen den t of the data distribution model. The first try was taking too much time, so we decided to verify the classification time of KNN with a step by step increasing procedu re. Thus, we tried to classificate the smallest data set (02/05) with the 1- nearest neighbor. This test last 70 seconds and resulted in 100% of correctnes s. So, we applied the 3- nearest neighbor in the same data set, resulting in the same 100% of correctness in 91 seconds. After it, we passed to the (06/08) data set, that was finished in 197 seconds with 97.76% of correctness. An increase of approximately 30% in the data set almost triple the classification time. The next data set (02/09) had a better correctness rate (99.25%), but its classification time was of 5163 seconds. The last test was done on the (04/28) data set, which have 189706 normal instances and 19038 suspicious instances. After more than 3 hours of test (exactly 11676 seconds), the classifier had just evaluated 36900 instances (17.67% of total). With more neighbors used in comparison, the time used in classification should greatly increase, due to the amount of distance calculations required. These results showed the impossibility to use the KNN in our data sets without modifications, because most of our data sets have more than a hundred thousan d s of points and the timing is a main factor in our case. A variation to consider is to store less labeled samples to optimize the comparison, in a scheme of summ ari zation or grouping of similar instances. In this way, the complexity of processing decreases, reducing consequently the execution time. 4.2 Artificial Neural Network Results The architecture of MLP used was 6:2:2, i.e., six input neurons, two neurons in the hidden layer and two output neurons that indicates the classes. Proc. of SPIE Vol. 6570 657006-7 As we can't see in a traditional graphic the separability of data, because our space of attributes is 6- dimensional, we have to guess the number of neurons on the hidden layer. The hidden layer neurons are used to create hyperplans and combinations, aiming to classify non - linearly separable distributions. Initial tests with the data showed that more than one hidden layer slow the process and don't present better results. The time spent in each test varied between 37 and 2216 seconds, which is acceptable. The false positives rate is lower than 1% in all data sets, but the false negatives rate is too high in most of cases. In the (02/02) data set, there was no classification at all. Thus, the suspicious sessions are being considered normal and, despites the false positives rate being low, providing a reduced data set to the analysts, with this models they will not see a lot of attacks or suspicious behavior in their networks. The results are presented in Table 2. Considering the objectives, which is to reduce data to human analysis, the neural networks provided good results, but this reduction caused a side effect that is to let suspicious data pass undetected. Furtherm ore, the model is easily reusable (the architecture and weights), but the interpretation of results is difficult, because of the complexity of the classification mechanism. Some tests using 10 neurons in the hidden layer was done, to see whether it separates better with more hyperplans. This new test just increased the evaluation time and don't have any effects on the results. For example, using the (02/06) data set with this architecture resulted in a model building and evaluation time of 7354 seconds while the correctness rate increased 0.23%. Table 2. Results obtained using a 6:2:2 MLP neural network Model building Date False positives False negatives Correctness Effectivity time (seconds) (02/01) 1281.65 0.11% 30.76% 97.89% 99.76% (02/02) 1098.86 0 100% 92.42% 96.37% (02/03) 950.39 0 0.04% 99.99% 91.71% (02/04) 974.38 0.49% 62.25% 96.27% 96.92% (02/05) 49.28 0 7,57% 99.76% 99.80% (02/06) 450.95 0.13% 36% 96.42% 86.20% (02/07) 467.82 0.02% 27.77% 97.67% 99.99% (02/08) 481.89 0.004% 80% 93.85% 99.70% (02/09) 165.68 0.01% 33.33% 98.51% 99.26% (04/28) 1047.1 0.16% 52.63% 95.05% 99.85% (04/29) 741.69 0.47% 50% 95.40% 94.71% (04/30) 467.07 0.35% 22.52% 97.55% 94.73% (06/01) 2216.69 0.87% 12.28% 98.62% 99.88% (06/04) 470.2 0.06% 50.97% 95.02% 98.36% (06/06) 598.81 0.09% 15.38% 98.99% 96.18% (06/07) 580.87 0.01% 13.33% 98.94% 95.84% (06/28) 37.91 0.43% 42.85% 96.95% 98.17% Proc. of SPIE Vol. 6570 657006-8 4.3 Decision Tree Results Last, we have carried out tests using decision trees, a supervised classification algorithm. Since it is importan t to understa n d the classification process, it is not convenient to have a tree with a great depth, in order to facilitate the results interpretation. So we use a pruned tree with confidence factor of 0.25. As all the attributes are tested in order to create nodes and this creation is based on the information gain provided by the attribute in relation to the class, the decision trees don't have the processing load involved in training /learning or calculating distances among several instances. So, the results appears quickly. The discretization of numeric attributes contributes to the comprehen sion of the final results, that are the best obtained in the tests. The false positives rate obtained for each data set is very low, while the false negatives rate are much better than that obtained with neural networks. The results of these tests are in the Table 3. Table 3. Results obtained using a C4.5 decision tree. Model building Date False positives False negatives Correctness Effectivity time (seconds) (02/01) 48.05 0.05% 23.07% 98.45% 99.91% (02/02) 19.38 1.78% 25% 96.45% 97.96% (02/03) 51.54 1.62% 0 98.49% 98.21% (02/04) 29.35 1.47% 0 98.59% 98.33% (02/05) 0.62 0 0 100% 98.85% (02/06) 11.01 0.97% 0 99.11% 99.54% (02/07) 11.82 1.08% 0 99% 98.86% (02/08) 6.06 0.02% 0 99.97% 100% (02/09) 3.97 0.01% 16.66% 99.24% 99.97% (04/28) 24.06 2.19% 0 98% 98.21% (04/29) 41.45 1.37% 10% 97.57% 97.84% (04/30) 19.98 0.14% 9.67% 98.95% 99.86% (06/01) 78.11 0.1% 3.84% 99.73% 96.23% (06/04) 10.92 0.84% 0 99.19% 97.34% (06/06) 20.71 0.07% 7.69% 99.46% 99.90% (06/07) 8.27 0.11% 0 99.89% 99.95% (06/28) 0.98 0.35% 33.19% 97.62% 98.04% 5. CONCLUSION In this paper we attemp t ed to solve a network - security related task with the aid of data mining techniques. The main purpose was to simplify some tasks which must be performed by a human analyst. In order to avoid analysis of network connection payloads, we focused on a relevant subset of features which can be extracted from the packets header, which is very cost effective and it is expected to be useful to classify normal versus suspicious network activity. Proc. of SPIE Vol. 6570 657006-9 From the chosen method s and algorith ms, we conclude that data decision trees yielded the best results, both numerically and considering its decision process representation - - decision trees are very similar, structuraly, to expert systems, therefore can be easily unders too d by human analysts and modified to include more knowledge. We consider this feature to be very importan t since it allows the human analyst to identify not only what can be an attack but also why, i.e. which features and values were used in the classification. A traditional machine learning algorithm, K- nearest neighbors, also yielded interesting results, but the implement ation used is compu tationally very expensive, therefore should be improved before real application since one of the requiremen t s for the solution of our problem is that the processing time must be acceptable to allow a timely answer time from the analyst. Future work on this research will be directed to the improvement of the KNN algorithm, possibly using a hybrid supervised / u n s u p e rvised learning schema using clustering as pre - processing step, and application of visualization and identification of trends using time - stamped decision trees. ACKNOWLEDGEMENTS The authors would like to thank Benicio Pereira de Carvalho Filho for authorizing the use of network dumps from Brazilian Institute for Space Research (INPE) DMZ. REFERENCES 1. GNU Project, grep , http:/ / g n u.o rg / s o ftware / g re p 2. M. J. Ranum, Artificial Ignorance: How- To Guide , http: / / www.ranu m.com / s ec urity / co m p u t e r _secur ity/pa per s / ai / i n d ex.ht ml , 1997. 3. R. Rehman, Intrusion Detection with SNORT: Advanced IDS Techniques using SNORT, Apache, MySQL, PHP and ACID, Prentice Hall PTR, 2003. 4. H. S. Javitz and A. Valdes, “The SRI IDES statistical anomaly detector”, Proc. of the Symposiu m on Research in Security and Privacy , 316 – 326, Oakland, CA, May 1991. 5. U. M. Fayyad and G. Piatetsky - Shapiro and P. Smyth and R. Uthurusa my, Advances in Knowledge Discovery and Data Mining , MIT Press, 1996. 6. L. Ertoz and E. Eilertson and A. Lazarevic and P. Tan and J. Srivastava and V. Kumar and P. Dokas, “The minds – minnesota intrusion detection system”, Next Generation Data Mining, 2004. 7. D. Barbará and J. Couto and S. Jajodia and N. Wu, “Adam: a testbed for exploring the use of data mining in intrusion detection”, ACM SIGMOD Record , v. 30 , n. 4, 15 – 24, December 2001. 8. L. Mé, “Gassata, a genetic algorithm as an alternative tool for security audit trail analysis”, First International Workshop on the Recent Advances in Intrusion Detection (RAID), September 1998. 9. L. Spitzner, Honeypots – Tracking Hackers , Addison - Wesley, 2003. 10. The Honeynet Project, Know Your Enemy – Learning About Security Threats , Addison - Wesley, 2004. 11. P. Adriaans, and D. Zantinge, Data Mining, Addison - Wesley, 1996. 12. S. Haykin, Neural Networks: A Comprehensive Foundation , Prentice Hall, 1998. 13. L. V. Fausett, Fundamentals of Neural Networks , Prentice Hall, 1994. 14. J. R. Quinlan, C4.5: Progra ms for Machine Learning , Morgan Kaufman n, 1993. 15. E. Frank and M. Hall and L. Trigg, http: / / www.cs.waikato.ac.nz / ~ m l / w e k a / i n d ex.ht ml 16. KDD Cup 1999, http: / / k d d.ics.uci.edu / ~ k d d / d a t a ba s es / k d dc u p 9 9 / k d d c u p 9 9.ht ml Proc. of SPIE Vol. 6570 657006-10

References (16)

GNU Project, grep , http:/ / g n u.org / s oftware / g rep
M. J. Ranum, Artificial Ignorance: How-To Guide , http:/ / www.ranum.com / security/com pu ter _secur ity/papers / ai / i n d ex.html , 1997.
R. Rehman, Intrusion Detection with SNORT: Advanced IDS Techniques using SNORT, Apache, MySQL, PHP and ACID, Prentice Hall PTR, 2003.
H. S. Javitz and A. Valdes, "The SRI IDES statistical anomaly detector", Proc. of the Symposium on Research in Security and Privacy , 316 -326, Oakland, CA, May 1991.
U. M. Fayyad and G. Piatetsky -Shapiro and P. Smyth and R. Uthurusamy, Advances in Knowledge Discovery and Data Mining , MIT Press, 1996.
L. Ertoz and E. Eilertson and A. Lazarevic and P. Tan and J. Srivastava and V. Kumar and P. Dokas, "The minds -minnesota intrusion detection system", Next Generation Data Mining, 2004.
D. Barbará and J. Couto and S. Jajodia and N. Wu, "Adam: a testbed for exploring the use of data mining in intrusion detection", ACM SIGMOD Record , v. 30 , n. 4, 15 -24, December 2001.
L. Mé, "Gassata, a genetic algorithm as an alternative tool for security audit trail analysis", First International Workshop on the Recent Advances in Intrusion Detection (RAID), September 1998.
L. Spitzner, Honeypots -Tracking Hackers , Addison-Wesley, 2003.
The Honeynet Project, Know Your Enemy -Learning About Security Threats , Addison-Wesley, 2004.
P. Adriaans, and D. Zantinge, Data Mining, Addison -Wesley, 1996.
S. Haykin, Neural Networks: A Comprehensive Foundation , Prentice Hall, 1998.
L. V. Fausett, Fundamentals of Neural Networks , Prentice Hall, 1994.
J. R. Quinlan, C4.5: Programs for Machine Learning , Morgan Kaufmann, 1993.
E. Frank and M. Hall and L. Trigg, http:/ / www.cs.waikato.ac.nz / ~m l / w e ka /i n dex.html
KDD Cup 1999, http:/ / k d d.ics.uci.edu / ~k d d / d a t a bases / k d dc up99 / k d dcu p99.html

About the author

Rafael Santos

Instituto Nacional de Pesquisas Espaciais, Faculty Member

Papers

133

Followers

View all papers from Rafael Santosarrow_forward

<title>Evaluation of data mining techniques for suspicious network activity classification using honeypots data</title>

Sign up for access to the world's latest research

Abstract

Related papers

References (16)

Related papers

Related topics