Abstract
Support vector machines (SVMs) are a popular class of supervised learning algorithms, and are particularly applicable to large and high-dimensional classification problems. Like most machine learning methods for data classification and information retrieval, they require manually labeled data samples in the training stage. However, manual labeling is a time consuming and errorprone task. One possible solution to this issue is to exploit the large number of unlabeled samples that are easily accessible via the internet. This paper presents a novel active learning method for text categorization. The main objective of active learning is to reduce the labeling effort, without compromising the accuracy of classification, by intelligently selecting which samples should be labeled. The proposed method selects a batch of informative samples using the posterior probabilities provided by a set of multi-class SVM classifiers, and these samples are then manually labeled by an expert. Experimental results indicate that the proposed active learning method significantly reduces the labeling effort, while simultaneously enhancing the classification accuracy.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, vol. 34, no. 1, pp. 1–47, 2002.
B. Settles. Active Learning Literature Survey. Computer Sciences Technical Report, 1648, University of Wisconsinadison, USA, 2010.
D. D. Lewis, W. A. Gale. A sequential algorithm for training text classifiers. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Springer-Verlag, New York, USA, 1994.
C. Persello, L. Bruzzone. Active and semisupervised learning for the classification of remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, vol. 52, no. 11, pp. 6937–6956, 2014.
G. Chen, T. J. Wang, L. Y. Gong, P. Herrera. Multi-class support vector machine active learning for music annotation. International Journal of Innovative Computing, Information and Control, vol. 6, no. 3, pp. 921–930, 2010.
S. Tong, D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, vol. 2, pp. 45–66, 2002.
S. A. A. Balamurugan, R. Rajaram. Effective and efficient feature selection for large-scale data using Bayestheorem. International Journal of Automation and Computing, vol. 6, no. 1, pp. 62–71, 2009.
J. A. Mangai, V. S. Kumar, S. A. alias Balamurugan. A novel feature selection framework for automatic web page classification. International Journal of Automation and Computing, vol. 9, no. 4, pp. 442–448, 2012.
I. Hmeidi, B. Hawashin, E. El-Qawasmeh. Performance of KNN and SVM classifiers on full word Arabic articles. Advanced Engineering Informatics, vol. 22, no. 1, pp. 106–111, 2008.
B. Trstenjak, S. Mikac, D. Donko. KNN with TF-IDF based framework for text categorization. Procedia Engineering, vol. 69, pp. 1356–1364, 2014.
S. Gazzah, N. E. B. Amara. Neural networks and support vector machines classifiers for writer identification using arabic script. The International Arab Journal of Information Technology, vol. 5, no. 1, pp. 92–101, 2008.
W. Lam, Y. Q. Han. Automatic textual document categorization based on generalized instance sets and a metamodel. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 5, pp. 628–633, 2003.
Q. Shen, R. Jensen. Rough sets, their extensions and applications. International Journal of Automation and Computing, vol. 4, no. 3, pp. 217–228, 2007.
L. Messikh, M. Bedda, N. Doghmane. Binary phoneme classification using fixed and adaptive segment-based neural networkapproach. The International Arab Journal of Information Technology, vol. 8, no. 1, pp. 48–51, 2011.
T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning Chemnitz, Springer, Chemnitz, Germany, pp. 137–142, 1998.
Y. M. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, vol. 1, no. 1–2, pp. 69–90, 1999.
T. Luo, K. Kramer, S. Samson, A. Remsen, D. B. Goldgof, L. O. Hall, T. Hopkins. Active learning to recognize multiple types of plankton. In Proceedings of the 17th International Conference on Pattern Recognition, IEEE, Cambridge, USA, vol. 3, pp. 478–481, 2004.
M. Goudjil, M. Koudil, N. Hammami, M. Bedda, M. Alruily. Arabic text categorization using SVM active learning technique: An overview. In Proceedings of World Congress on Computer and Information Technology, IEEE, Sousse, Tunisia, 2013.
P. Mitra, C. A. Murthy, S. K. Pal. A probabilistic active support vector learning algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 3, pp. 413–418, 2004.
G. Schohn, D. Cohn. Less is more: Active learning with support vector machines. In Proceedings of the 17th International Conference on Machine Learning, Morgan Kaufmann Publishers Inc., San Francisco, USA, pp. 839–846, 2000.
K. Brinker. Incorporating diversity in active learning with support vector machines. In Proceedings of the 20th International Conference on Machine Learning, ACM, Washington, USA, pp. 59–66, 2003.
Y. Baram, R. El-Yaniv, K. Luz. Online choice of active learning algorithms. Journal of Machine Learning Research, vol. 5, pp. 255–291, 2004.
N. Roy, A. McCallum. Toward optimal active learning through monte carlo estimation of error reduction. In Proceedings of the 18th International Conference on Machine Learning, Bellevue, USA, pp. 441–448, 2001.
A. K. McCallumzy, K. Nigamy. Employing EM and poolbased active learning for text classification. In Proceedings of the 15th International Conference on Machine Learning, Madison, USA, pp. 350–358, 1998.
S. C. H. Hoi, R. Jin, M. R. Lyu. Large-scale text categorization by batch mode active learning. In Proceedings of the 15th International Conference on World Wide Web, ACM, New York, USA, pp. 633–642, 2006.
M. Goudjil, M. Bedda, M. Koudil, N. Ghoggali. Using active learning in text classification of quranic sciences. In Proceedings of International Conference on Advances in Information Technology for the Holy Quran and its Science, Taibah University, Madinah, Saudi Arabia, pp. 209–213, 2013.
M. Goudjil. Text Categorization using reduced training set. Research Journal of Applied Sciences, Engineering and Technology. vol. 10, no. 12, pp. 1363–1369, 2015.
V. N. Vapnik. Statistical Learning Theory, NewYork, USA: Wiley, 1998.
N. Ghoggali, F. Melgani, Y. Bazi. A multiobjective genetic SVM approach for classification problems with limited training samples. IEEE Transactions on Geoscience and Remote Sensing, vol. 47, no. 6, pp. 1707–1718, 2009.
T. Hastie, R. Tibshirani. Classification by pairwise coupling. The Annals of Statistics, vol. 26, no. 2, pp. 451–471, 1998.
K. B. Duan, S. S. Keerthi. Which is the best multiclass SVM method? An empirical study. In Proceedings of the 6th International Workshop, MCS 2005, California, USA, pp. 278–285, 2005.
T. F. Wu, C. J. Lin, R. C. Weng. Probability estimates for multi-class classification by pairwise coupling. Journal of Machine Learning Research, vol. 5, pp. 975–1005, 2003.
C. C. Chang, C. J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, vol. 2, no. 3, Article number 27, 2011.
M. K. Li, I. K. Sethi. Confidence-based active learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 8, pp. 1251–1261, 2006.
B. Demir, C. Persello, L. Bruzzone. Batch-mode activelearning methods for the interactive classification of remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, vol. 49, no. 3, pp. 1014–1031, 2011.
M. Sassano. An empirical study of active learning with support vector machines for Japanese word segmentation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, Stroudsburg, USA, pp. 505–512, 2002.
S. C. H, Hoi, R. Jin, M. R. Lyu. Batch mode active learning with applications to text categorization and image retrieval. IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1233–1248, 2009.
A. Cardoso-Cachopo, A. L. Oliveira. Semi-supervised single-label text categorization using centroid-based classifiers. In Proceedings of the ACM Symposium on Applied Computing, ACM, Seoul, Korea, pp. 844–851, 2007.
K. S. Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, vol. 60, no. 5, pp. 493–502, 2004.
G. Salton, C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing & Management, vol. 24, no. 5, pp. 513–523, 1988.
Author information
Authors and Affiliations
Corresponding author
Additional information
Recommended by Edit-in-Chief Huo-Sheng Hu
Mohamed Goudjil received the M. Sc. degree in computer engineering from Boumerdes University, Algeria in 2008. He is currently a Ph. D. degree candidate in computer engineering at Ecole nationale Supérieure d’Informatique (ESI), Algeria. From 2005 to 2008, he was a researcher at Advanced Technologies & Resarchs Centre and a lecturer for seven years in different universities.
His research interests include text classification, arabic language processing and machine learning.
Mouloud Koudil received the Ph.D. degree in computer science from l’Ecole nationale Supérieure d’Informatique (ESI), Algeria in 2002. He is currently a full time professor and rector of the same institution.
His research interests include wireless sensor networks, networks on chips, and hardware/software codesign.
Mouldi Bedda received the Ph.D. degree in electrical engineering from the University Nancy 2, France in 1985. From 1985 to 2006, he worked with the University Badji Mokhtar Annaba, Algeria. He was the director of Automatic and Signals Laboratory from 2001 to 2006. Since 2006, he is a full professor at the college of engineering of Al Jouf university KSA. He supervised several Ph. D. students in speech processing, biomedical signals, hand written recognition and image processing.
His research interests include speech processing, biomedical signals, hand written recognition and image processing.
Noureddine Ghoggail received the State Engineer degree in electronics from the University of Batna, Algeria in 2000, and the Ph.D. degree in information and communication technologies in Department of Information Engineering and Computer Science, University of Trento, Italy. He is currently an assistant professor at University of Batna in Algeria.
His research interests include pattern recognition and evolutionary computation methodologies for remote sensing image analysis.
Rights and permissions
About this article
Cite this article
Goudjil, M., Koudil, M., Bedda, M. et al. A Novel Active Learning Method Using SVM for Text Classification. Int. J. Autom. Comput. 15, 290–298 (2018). https://doi.org/10.1007/s11633-015-0912-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11633-015-0912-z