Abstract
Data leakage prevention (DLP) is an emerging subject in the field of information security. It deals with tools working under a central policy, which analyze networked environments to detect sensitive data, prevent unauthorized access to it and block channels associated with data leak. This requires special data classification capabilities to distinguish between sensitive and normal data. Not only this task needs prior knowledge of the sensitive data, but also requires knowledge of potentially evolved and unknown data. Most current DLPs use content-based analysis in order to detect sensitive data. This mainly involves the use of regular expressions and data fingerprinting. Although these content analysis techniques are robust in detecting known unmodified data, they usually become ineffective if the sensitive data is not known before or largely modified. In this paper we study the effectiveness of using N-gram based statistical analysis, fostered by the use of stem words, in classifying documents according to their topics. The results are promising with an overall classification accuracy of 92%. Also we discuss classification deterioration when the text is exposed to multiple spins that simulate data modification.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Raman, P., Kayacık, H.G., Somayaji, A.: Understanding Data Leak Prevention. In: 6th Annual Symposium on Information Assurance (ASIA 2011), p. 27 (2011)
Mogull, R.: Understanding and Selecting a Data Loss Prevention Solution, https://securosis.com/assets/library/reports/DLP-Whitepaper.pdf
Shapira, Y., Shapira, B., Shabtai, A.: Content-based data leakage detection using extended fingerprinting. arXiv preprint arXiv:1302.2028 (2013)
Kantor, A., Antebi, L., Kirsch, Y., Bialik, U.: Methods for document-to-template matching for data-leak prevention. USA Patent US20100254615 A1 (2009)
Roussev, V.: Data fingerprinting with similarity digests. In: Chow, K.-P., Shenoi, S. (eds.) Advances in Digital ForensicsVI. IFIP AICT, vol. 337, pp. 207–226. Springer, Heidelberg (2010)
Shu, X., Yao, D. D.: Data leak detection as a service. In: Keromytis, A.D., Di Pietro, R. (eds.) SecureComm 2012. LNICST, vol. 106, pp. 222–240. Springer, Heidelberg (2013)
Kornblum, J.: Identifying almost identical files using context triggered piecewise hashing. Digital Investigation 3, 91–97 (2006)
Borders, K., Prakash, A.: Quantifying information leaks in outbound web traffic. In: 30th IEEE Symposium 2009 Security and Privacy, pp. 129–140 (2009)
Clark, D., Hunt, S., Malacaria, P.: Quantitative analysis of the leakage of confidential data. Electronic Notes in Theoretical Computer Science 59 (2002)
Hart, M., Manadhata, P., Johnson, R.: Text classification for data loss prevention. In: Fischer-Hübner, S., Hopper, N. (eds.) PETS 2011. LNCS, vol. 6794, pp. 18–37. Springer, Heidelberg (2011)
Carvalho, V.R., Balasubramanyan, R., Cohen, W.W.: Information Leaks and Suggestions: A Case Study using Mozilla Thunderbird. In: Proc. of 6th Conf. on Email and Antispam (2009)
Zipf, G.K.: Human behavior and the principle of least effort. Addison Wesley, Massachusetts (1949)
Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. Presented at the Ann Arbor MI (1994)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing & Management 24, 513–523 (1988)
Alneyadi, S., Sithirasenan, E., Muthukkumarasamy, V.: Word N-gram Based Classification for Data Leakage Prevention. In: TrustCom, Melbourne (2013)
Holme, P.: Peter Holme’s word stemmer (2011), http://holme.se/stem/
Porter, M.F.: An algorithm for suffix stripping. Program: Electronic Library and Information Systems 14, 130–137 (1980)
Alneyadi, S., Sithirasenan, E., Muthukkumarasamy, V.: Adaptable N-gram Classification Model for Data Leakage Prevention. Presented at the ICSPCS, Gold Coast, Australia(2013)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Alneyadi, S., Sithirasenan, E., Muthukkumarasamy, V. (2014). A Semantics-Aware Classification Approach for Data Leakage Prevention. In: Susilo, W., Mu, Y. (eds) Information Security and Privacy. ACISP 2014. Lecture Notes in Computer Science, vol 8544. Springer, Cham. https://doi.org/10.1007/978-3-319-08344-5_27
Download citation
DOI: https://doi.org/10.1007/978-3-319-08344-5_27
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-08343-8
Online ISBN: 978-3-319-08344-5
eBook Packages: Computer ScienceComputer Science (R0)