[go: up one dir, main page]

Academia.eduAcademia.edu
Journal of Information Technology and Computing https://jitc.sabapub.com ISSN: 2709-5916 2023 Volume 4, Issue 1 : 1 – 14 DOI : 10.48185/jitc.v4i1.612 Ontology based Feature Selection and Weighting for Text classification using Machine Learning Djelloul BOUCHIHA1, * , Abdelghani BOUZIANE 1, Noureddine DOUMI 2 1 Dept. Mathematics and Computer Science, Ctr Univ Naama, EEDIS Lab., UDL-SBA, Algeria 2 Department of Computer Science, Faculty of Technologies, University of Saida, Algeria Received: 28.02.2023 • Accepted: 05.03.2023 • Published: 27.06.2023 • Final Version: 27.06.2023 Abstract: Text classification consists in attributing text (document) to its corresponding class (category). It can be performed using an artificial intelligence technique called machine learning. However, before training the machine learning model that classifies texts, three main steps are also mandatory: (1) Preprocessing, which cleans the text; (2) Feature selection, which chooses the features that significantly represent the text; and (3) Feature weighting, which aims at numerically representing text through feature vector. In this paper, we propose two algorithms for feature selection and feature weighting. Unlike most existing works, our algorithms are sense-based since they use ontology to represent not the syntax but the sense of a text as a feature vector. Experiments show that our approach gives encouraging results compared to existing works. However, some additional suggested improvements can make these results more impressiveText classification consists in attributing text (document) to its corresponding class (category). It can be performed using an artificial intelligence technique called machine learning. However, before training the machine learning model that classifies texts, three main steps are also mandatory: (1) Preprocessing, which cleans the text; (2) Feature selection, which chooses the features that significantly represent the text; and (3) Feature weighting, which aims at numerically representing text through feature vector. In this paper, we propose two algorithms for feature selection and feature weighting. Unlike most existing works, our algorithms are sense-based since they use ontology to represent not the syntax but the sense of a text as a feature vector. Experiments show that our approach gives encouraging results compared to existing works. However, some additional suggested improvements can make these results more impressive. Keywords: Text Classification, Feature selection, Feature weighting, Machine Learning (ML), Ontology, WordNet 1. Introduction Text classification, also called text categorization, aims to classify texts (documents) into specific classes (categories) [1]. Thus, text classification allows attributing a text to one predefined class. It finds applications in several fields, notably information retrieval (IR) [2], information filtering (IF) [3], Web filtering [4], email or spam filtering [5], news filtering [6], sentiment analysis [7], knowledge management (KM) [8], text summarization [9], etc. Text classification issue can be addressed using an artificial intelligence technique called machine learning. Arthur Samuel defines Machine Learning (ML) as the research field that gives machines the capability to learn without explicit programming [10]. Before launching the ML algorithm, the * Corresponding Author: djelloul.bouchiha@univ-sba.dz 2 D. BOUCHIHA et al.: Ontology based Feature Selection and Weighting for Text classification using Machine Learning classification process needs: (1) Preprocessing, which cleans the text; (2) Feature selection, which chooses the features (generally words) that significantly represent the text; (3) Feature weighting, which aims at numerically representing text through feature vector. Generally, preprocessing is a common step for all the classifiers. However, several feature selection and weighting methods, and various ML algorithms can constitute a classifier. In this paper, we opted for a standard preprocessing step, and we used a classical ML algorithm, namely SVM. However, we propose new algorithms for feature selection and feature weighting. Both of them are ontology-based (sense-based). The first one takes all concepts and relations of all the domain ontologies that correspond to the text categories, and considers their features. The second algorithm builds the feature vector of a text by computing the number of terms (words) which are semantically close to each feature according to a similarity measure. The lack of domain ontologies has obstructed our solution. So, domain ontologies have been replaced by WordNet, which is a large lexical database that provides senses of English words [11]. Nouns, verbs, adverbs and adjectives are grouped into cognitive synonyms called synsets; each describes a distinct concept. Synsets are connected through lexical and conceptual-semantic relations. Nouns and verbs are organized into is- a or hypernym hierarchies. The remainder of the paper is organized as follows: Section 2 reviews some feature selection and weighting methods; Section 3 describes the text classification process; Section 4 presents our proposed text classification system, notably feature selection and weighting algorithms; Section 5 is devoted to experiments, where classification tool, evaluation and comparative study are discussed; and finally, Section 6 gives some conclusions and perspectives. 2. Related work A feature in a text can be a simple term (word), complex linguistic structure (e.g. part of speech (POS)), supported information (e.g. word’s first position), statistical structure (e.g. n-gram), Named Entity (e.g., person’s name), etc. [12]. Feature selection consists in selecting a subset of the features that describe the texts. In the literature, feature selection is also referred to as Dimensionality Reduction, because it aims to reduce the feature matrix’s dimensions that will be defined later. Feature selection should increase the classification accuracy and decrease the computational complexity (time and space) by deleting noise features. Thus, the feature selection step is important to improve the text classifier’s accuracy, efficiency and scalability [13]. Next is a non-exhaustive list of feature selection techniques: • Information gain (IG) has the same statistical meaning as the Kullback–Leibler divergence [14]. In text classification, IG is frequently used as a goodness criterion of a term (word). It looks for the term in a document to compute the number of bits of information, then predicts the class (category) of this document [15]. • Chi-square (also called Chi-squared test or χ2 test): originally, Pearson published a paper on Chi- square where he investigated a test of goodness of fit [16]. To classify texts, Chi-square is employed to measure the relevance between t (term) and C (class) [17]. Galavotti et al. proposed a simplified variant of CHI-square called GSS Coefficient (GSS) [18]. The authors in [19] proposed another variant of Chi-Square, called Correlation coefficient (CC) or NGL. • Mutual information (MI) was defined and analyzed for the first time by Claude Shannon [20]. However, he did not call it Mutual Information. This term appears later in [21]. So, MI measures the mutual dependence between two random variables in probability theory and information theory. To select features from a text, MI measures the variations in the distribution of terms, and attributes the much higher rank to the terms of a positive nature [22]. Journal of Information Technology and Computing 3 • Term strength (TS) was initially proposed and evaluated for vocabulary reduction in text retrieval [23]. Later, TS was used in text classification [24, 25]. In this context, TS estimates the importance of a term based on how many times this term usually appears in closely-related texts. A "training set" is used to get pairs whose cosine similarity is greater than a threshold. Then, TS is calculated by considering the estimated conditional probability that a term appears in the 2nd half of a pair of related texts, given that it appears in the 1st half. • Odds ratios (OR) is a statistical measure of the association between exposure and outcome [26]. In text classification, OR was proposed to select terms with relevant feedback [27]. OR starts from the fact that the distribution of features on the relevant documents is different from the distribution of features on the non-relevant documents. Mladenić, in [27], defines three measures inspired by the original OR formula: FreqOddsRatio, FreqLogP and ExpP. • Gini Index (GI) measures the purity of the features concerning the class [28]. In text classification, purity is the discrimination level of a term to distinguish between possible classes. • Term Frequency (TF) and Document Frequency (DF): TF is defined as the number of times a term occurs in a text. TF can be used to select features from a text. For example, we keep only features (terms) where TF exceeds a threshold. DF is the number of documents (texts) in which a term occurs. DF can also be used to select features from a text. For example, we maintain only features (terms) for which DF exceeds a threshold. Most feature selection techniques cited above are word-based (term-based). The advantage of word-based techniques is that a large text is reduced to a set of simple independent terms, making the classification efficient. However, relationships between terms are lost [29]. Besides this, Semantic ambiguity (polysemy and synonymy issues) occurs when using terms as features [12]. To overcome these problems, an ontology-based (sense-based) feature selection method should be used. Gruber defines ontology as an explicit specification of a conceptualization [30]. In computer science, ontology is the working model of entities and interactions [31]. The ontology consists mainly of concepts, relations, instances and axioms. The concepts correspond to a set of entities or things within a domain. The relations describe the interactions between the concepts. The instances are the things represented by a concept. The axioms are used to constrain values for concepts or instances. To the best of our knowledge, no work actually uses the domain ontology for feature selection and weighting in the text classification process. However, few attempts use WordNet in this context. In [32], the authors used a dataset (Brown Corpus semantic concordance) annotated with WordNet to compare word-based and sense-based features in the text classification process. With a small training set (182 texts), they didn’t significantly improve the classification effectiveness with sense- based features. In [33], the author proposed sequence kernels for words and POS Tags, which detect basic syntactic information and basic lexical semantics. Moschitti concluded that his kernels are more effective and efficient than previous models. Peng & Choi in [34] proposed text classification based on the words’ senses and the relationships between the senses. Their experiment showed that using WordNet semantic hierarchy to have a sense-based document representation increases classification accuracy. After the feature selection step, a text will be represented as a feature vector that consists of weights of features in the considered text. The feature weight indicates the degree of importance of the feature in the text; it can be represented, for example, by the feature occurrences in the text. Feature weighting of a set of texts generates feature vectors that constitute the so-called feature matrix. This matrix is of dimensions m*n, with m the number of texts, and n the number of features. In the literature, Feature weighting is also referred to as Feature extraction, Indexing or Document representation. 4 D. BOUCHIHA et al.: Ontology based Feature Selection and Weighting for Text classification using Machine Learning Several feature weighting techniques can be found in the literature: • Bag-of-Words (BoW) appeared earlier in a linguistic context [35]. Lately, it has been widely applied in text classification [36], where each word’s frequency (occurrence) is used as a feature value for training a classifier. So, in this method, a text is represented as a bag (a set) of words, disregarding the grammar and order of the words, but keeping only their multiplicities. • N-gram is a set of n words appearing in a document in that order [37]. N-gram was introduced in a mathematical theory of communication [20]. It was later used in Natural Language Processing (NLP), where N-gram is usually defined as a sequence of N words [38]. • Term Frequency ˗ Inverse Document Frequency (TFIDF) computes the importance of a word within a text (document) [39]. TFIDF is the product of Term Frequency (TF) and Inverse Document Frequency (IDF). TF was introduced as the first form of term weighting, the weight of a term that occurs in a text [40]. IDF quantified the specificity of a term by the inverse function of the number of texts in which the term appears [41]. • Word2vec is an NLP technique that uses neural networks to learn relationships between words from a huge dataset [42, 43]. As a result, each word is represented by a list of numbers called a vector such that the semantic similarity between two words corresponds to the similarity between their vectors. An extension of wor2vec, called doc2vec, aims to represent a word as vector, and the entire document as a vector [44]. • HashingVectorizer converts a collection of texts into a matrix of token occurrences [45]. The text vectorizer implementation uses the hashing trick [46] to convert a token (string) into a feature (integer). 3. Text classification Process Our text classification process starts with a dataset consisting of a set of English texts that belong to several categories. Since we opted, in this paper, for supervised learning, we used a labeled dataset, i.e. each text must be annotated with its corresponding category. The dataset will be split into two parts: a training set (70% of the dataset) and a test set (30% of the dataset). As shown in Figure 1, the text classification process consists of three main phases: Training, Test and Prediction. The training phase receives as input Training set that undergoes four main stages: 1. Preprocessing receives as input English texts, cleans these texts, and generates tokenized cleaned texts [47]. 2. Feature selection selects a subset of the features available for describing the texts to reduce the dimensions of the so called feature matrix; lines of the feature matrix are feature vectors of texts. This step produces a set of selected features. Note that Feature selection was drawn in Figure 1 with dashed lines because it is sometimes not specified in the classification process. In this case, all the tokens (words) from the preprocessing step are considered features and will be weighted in the next step. 3. Feature weighting generates a numerical representation of cleaned texts. As a result, this step generates a feature matrix. 4. Machine learning algorithm builds a machine learning model that constitutes the kernel of a classifier. The model represents what was learned by the machine learning algorithm. The test phase consists in running the classifiers generated from the previous phase on the test set and measuring each classifier’s performance. As an output of this phase, the best classifier will be selected. Journal of Information Technology and Computing 5 Finally, the prediction phase receives a new English text introduced by the user as input. Then, the best classifier determines the text category. Figure 1. Text classification process 4. Proposed text classification system In this paper, we have set up a system that covers the four steps of the classification process: 4.1. Preprocessing Preprocessing includes tokenization and normalization. It also removes stop-words, numbers, punctuations, links, white-spaces and non-English words. Tokenization is splitting a text into subunits called tokens [48]. Generally, a token is a word of the text. Normalization is reducing a token to its base form. Two normalization techniques are usually used: stemming and lemmatization [49]. Stop-words are high-frequency words in a document, such as "the", "but" and "not" that are filtered out, because their presence in a text fails to distinguish it from the other texts [50]. Thus, stop-words do not contribute to the content of the text [36], and consequently, they are deleted from the text. In addition, numbers, punctuation, URI links, multiple white-spaces and non-English words are removed from the text because they have no impact on the classification process. 4.2. Ontology based feature selection approach In this paper, we propose an ontology based feature selection algorithm. As shown in Figure 2, our first idea was to take a set of domain ontologies; each corresponds to a category of texts. Then, we consider all the concepts and relations of these ontologies as the selected features. 6 D. BOUCHIHA et al.: Ontology based Feature Selection and Weighting for Text classification using Machine Learning Figure 2. Ontology based feature selection and weighting The lack of domain ontologies has obstructed this first idea. So, domain ontologies have been replaced by WordNet as illustrated in Figure 3. In this case, selected features are all the terms of all synsets (and their hyponyms) that correspond to the texts’ categories. Figure 3. WordNet based feature selection and weighting Next is the first proposed algorithm for feature selection: Algorithm 1: Feature Selection Inputs: Categories, WordNet Outputs: Features_list Begin For each Category From WordNet, extracting all synsets where the Category name occurs For each extracted synset Add all the terms of the synset to Features_list From WordNet, extracting all hyponyms of the synset For each hyponym Add all the terms of the hyponym to Features_list Journal of Information Technology and Computing 7 End_for End_for End_for End The algorithm receives WordNet and a list of categories taken from the dataset as input. It returns a list of features that will be weighted in the next step. 4.3. Ontology based feature weighting approach For the third step of the classification process, we proposed a second algorithm that builds the feature vector of a text by computing the number of terms (in the text) that are semantically close to each feature using similarity measures. Algorithm 2: Feature weighting Inputs: Texts, Features_list, WordNet, Similarity, Threshold Output: Feature_Matrix Begin For each Text For each Feature in Features_list For_each Term in Text If (Similarity(Term, Feature) >= Threshold): Feature_Matrix[Text, Feature] = Feature_Matrix[Text, Feature] + 1 End_if End_for End_for End_for End The algorithm receives as input: texts, features, WordNet, a similarity measure and a threshold, and returns a feature matrix. A feature matrix is a set of feature vectors; each corresponding to a text. Similarity measure computes how much two elements are alike; it returns a value in [0..1]; the value 1 is given when the two elements are semantically equivalent. The threshold is the value that decides if a term and feature are semantically close so that they can be considered equivalent. 4.4. Machine learning algorithm Several ML algorithms exist in the literature. In our classification system, we opted for Support Vector machines (SVM), a nonlinear generalization of the Generalized Portrait algorithm used initially for pattern recognition and computer learning [51, 52]. Afterward, SVM has been introduced explicitly as a new machine learning algorithm to resolve classification problems [53]. It maps the input vectors into high dimensional space and constructs separating hyper-plane(s). 5. Experiments To perform our experiments, we first choose BBC dataset [54]. The BBC dataset consists of 2225 texts taken from the BBC news website distributed over 5 categories: politics, business, entertainment, sport and tech. It was randomly split into a training set (70%) and a test set (30%). Then, we opt for Python language to implement our classifier. 8 D. BOUCHIHA et al.: Ontology based Feature Selection and Weighting for Text classification using Machine Learning 5.1. Implementation Python is a powerful programming language [55]. It is also easy to learn. Python’s library contains built-in modules that provide standardized solutions for many problems when implementing our classifier. A module is a file containing Python definitions and statements. A collection of modules is called a package. Python allows us to install other external packages, among which we cite: • For the preprocessing step, the nltk [50], textblob [56] and tashaphyne [57] packages have to be installed. • Feature selection and weighting need the installation of gensim [58]. The numpy [59] package is also necessary for these two stages. • To implement ML algorithm, we used scikit-learn package [45, 60]. All these Python packages helped us implement our ontology based classifier that we made open access1 for the benefit of the NLP and AI communities. 5.2. Evaluation As mentioned above, we implemented our classification tool1 covering: preprocessing, ontology based feature selection and weighting, and SVM. Then, we checked our classification tool using the BBC dataset [54]. Our classifier dealt with 2225*226 feature matrix: 2225 is the number of documents in the BBC dataset, and 226 is the number of the terms of WordNet synsets that are semantically related to the five categories in the BBC dataset. As an evaluation metric, we opted for Accuracy, which is the fraction of the study population that is decided correctly [61]. For a classification problem, Accuracy = (P + N)/T, where P (true positives) is the number of documents correctly classified, N (true negatives) is the number of documents correctly not classified, and T is the total number of documents. We recall that our proposed weighting algorithm (Section 4.3) needs two important inputs: threshold and similarity measure. The question that strongly arises now is: which similarity measure gives the best classification results, and with which threshold? To answer this question, we implemented six WordNet based similarity measures. For each one, we gave a series of threshold values. As shown in Figure 4, the implemented similarity measures relying on WordNet are: path [62], which is computed by inversing the length of the shortest path between two synsets; lch [63], which scales the shortest path between the two synsets by the maximum path length in the is–a hierarchy in which the synsets appear; wup [64], which computes the path length from the LCS (Least Common Subsumer) of the two synsets to the root node, and scales this value by the sum of the path lengths from the root to the individual synsets; res [65], which returns a score based on the LCS information content and that of the two synsets; both, lin [66] and jcn [67], increase the LCS information content by the sum of the information content of the two synsets; while lin scales the LCS information content by the sum, jcn subtracts the LCS information content from the sum, and then converts the inverse from a distance to a similarity measure. Different combinations (Similarity - Threshold) have been tested. Our ontology based classifier reaches its best result (Accuracy of 0.82) with wup similarity and a threshold of 0.8. 1 https://github.com/khouloud-1/Ontology_Based_Classifier Journal of Information Technology and Computing 9 Evaluation with Path Evaluation with lch Accuracy Accuracy Threshold Threshold Evaluation with res Evaluation with wup Accuracy Accuracy Threshold Threshold Evaluation with lin Evaluation with jcn Accuracy Accuracy Threshold Threshold Figure 4. Evaluation results with different similarity measures 5.3. Comparative study To show the efficiency of our classifier, we compare it to three other classifiers 2, all of them have the same preprocessing step, and the same ML algorithm, namely SVM. Also, the three classifiers have word-based feature selection step. However, the first classifier uses BoW for feature weighting, the second uses TFIDF, and the third uses Doc2Vec. 2 https://github.com/khouloud-1/Ontology_Based_Classifier 10 D. BOUCHIHA et al.: Ontology based Feature Selection and Weighting for Text classification using Machine Learning Accuracy Feature Selection/Weighting Figure 5. Classification Accuracy with different feature selection/weighting methods Figure 5 shows that our approach outperforms two classifiers. However, it needs more improvement to be the best one. This can be justified by the fact that, in its current version, our WordNet based classifier uses only the Nouns hierarchy. Besides this, WordNet is a lexical ontology dedicated mainly to linguistic applications. We think that using actual domain ontologies instead of lexical WordNet, will improve our classification system’s result for many reasons: • Concepts of the ontology are specific Nouns describing the category to which a text belongs. • Relations between concepts will represent Verbs in a text. • Ontology instances can be Named Entities in the text. • Ontology axioms can infer hidden information that can contribute to the classification process. 6. Conclusion and perspectives Feature selection and weighting are primordial steps in the text classification process aiming to attribute a text to its corresponding category. While feature selection extracts important features from a text, feature weighting represents the text through a feature vector that includes a set of values; each represents the weight (importance degree) of the feature in the text. To be accomplished, the text classification process also needs Preprocessing as the first step, and ML algorithm as the last one. In this context, most existing classifiers use word-based feature selection and weighting techniques. In this paper, we propose two sense-based algorithms for selecting and weighting features. Both consider concepts of the domain ontologies as the features that can characterize a text to be classified. Our first intention was to use domain ontologies corresponding to the texts categories. However, we substitute them by WordNet due to the lack of such ontologies. A classification tool has been implemented, and experiments have been conducted to show the efficiency of our approach compared to the existing works. Experiment results were encouraging; however, some additional suggested improvements can make these results more impressive. As future work, we plan to use real domain ontologies, or RDF Linked Data, which is data interlinked with other data [68], widely available on the Web, like Dbpedia [69]. Journal of Information Technology and Computing 11 In its current version, our classifier uses only two kinds of WordNet similarity measures in the second algorithm: path-based (wup, lch and path) and content-based (jcn, lin and res). The next version should implement relation-based measures to improve the classification results. Relation-based measures are: hso [70], which computes the relatedness between two synsets by looking for a path between them that isn’t too long and that doesn’t change direction too often; lesk [71], which computes relatedness by scoring the overlaps between the synsets’ glosses; and vector [72], which computes relatedness by finding the cosine between the gloss vectors of the two synsets. References [1] K. Nalini and L. J. Sheela, "Survey on text classification," International Journal of Innovative Research in Advanced Engineering, vol. 1, pp. 412-417, 2014. [2] W. B. Croft, D. Metzler, and T. Strohman, Search engines: Information retrieval in practice vol. 520: Addison-Wesley Reading, 2010. [3] C. Lanquillon, "Enhancing text classification to improve information filtering," Otto -von-Guericke- Universität Magdeburg, Universitätsbibliothek, 2001. [4] R. Du, R. Safavi-Naini, and W. Susilo, "Web filtering using text classification," in The 11th IEEE International Conference on Networks, 2003. ICON2003., Sydney, NSW, Australia, 2003, pp. 325- 330. [5] A. Bhowmick and S. M. Hazarika, "E-Mail Spam Filtering: A Review of Techniques and Trends," in Advances in Electronics, Communication and Computing, A. Kalam, S. Das, and K. Sharma, Eds., ed Singapore: Springer Singapore, 2018, pp. 583-590. [6] K. Lang, "NewsWeeder: Learning to Filter Netnews," in Machine Learning Proceedings 1995, A. Prieditis and S. Russell, Eds., ed San Francisco (CA): Morgan Kaufmann, 1995, pp. 331-339. [7] B. Liu and L. Zhang, "A Survey of Opinion Mining and Sentiment Analysis," in Mining Text Data, C. C. Aggarwal and C. Zhai, Eds., ed Boston, MA: Springer US, 2012, pp. 415-463. [8] M. Heidarysafa, K. Kowsari, L. Barnes, and D. Brown, "Analysis of Railway Accidents' Narratives Using Deep Learning," in 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), Orlando, Florida, USA, 2018, pp. 1446-1453. [9] I. Mani and M. T. Maybury, Advances in Automatic Text Summarization, abridged, illustrated, reprint ed.: MIT Press, 1999. [10] A. L. Samuel, "Some Studies in Machine Learning Using the Game of Checkers," IBM Journal of Research and Development, vol. 3, pp. 210-229, 1959. [11] G. A. Miller, "WordNet: a lexical database for English," Commun. ACM, vol. 38, pp. 39–41, 1995. [12] X. Zhou, R. Gururajan, Y. Li, R. Venkataraman, X. Tao, G. Bargshady, P. D. Barua, and S. Kondalsamy-Chennakesavan, "A survey on text classification and its applications," Web Intelligence, vol. 18, pp. 205-216, 2020. [13] J. Chen, H. Huang, S. Tian, and Y. Qu, "Feature selection for text classification with Naïve Bayes," Expert Systems with Applications, vol. 36, pp. 5432-5435, 2009/04/01/ 2009. [14] S. Kullback and R. A. Leibler, "On Information and Sufficiency," The Annals of Mathematical Statistics, vol. 22, pp. 79-86, 1951. [15] J. R. Quinlan, "Induction of decision trees," Machine Learning, vol. 1, pp. 81-106, 1986/03/01 1986. [16] K. Pearson, "X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling," The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, vol. 50, pp. 157-175, 1900/07/01 1900. [17] Y. Zhai, W. Song, X. Liu, L. Liu, and X. Zhao, "A Chi-Square Statistics Based Feature Selection Method in Text Classification," in 2018 IEEE 9th International Conference on Software Engineering and Service Science (ICSESS), Beijing, China, 2018, pp. 160-163. 12 D. BOUCHIHA et al.: Ontology based Feature Selection and Weighting for Text classification using Machine Learning [18] L. Galavotti, F. Sebastiani, and M. Simi, "Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization," in Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries, Lisbon, Portugal, 2000, pp. 59-68. [19] H. T. Ng, W. B. Goh, and K. L. Low, "Feature selection, perceptron learning, and a usability case study for text categorization," in Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, Philadelphia, Pennsylvania, USA, 1997, pp. 67–73. [20] C. E. Shannon, "A mathematical theory of communication," The Bell System Technical Journal, vol. 27, pp. 379-423, 1948. [21] R. M. Fano, "Transmission of Information: A Statistical Theory of Communications," American Journal of Physics, vol. 29, pp. 793-794, 1961. [22] D. Agnihotri, K. Verma, and P. Tripathi, "Mutual information using sample variance for text feature selection," in Proceedings of the 3rd International Conference on Communication and Information Processing, Tokyo, Japan, 2017, pp. 39–44. [23] W. J. Wilbur and K. Sirotkin, "The automatic identification of stop words," Journal of information science, vol. 18, pp. 45-55, 1992. [24] Y. Yang, "Noise reduction in a statistical approach to text categorization," in Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, Seattle, Washington, USA, 1995, pp. 256–263. [25] Y. Yang and J. Wilbur, "Using corpus statistics to remove redundant words in text categor ization," Journal of the American Society for Information Science, vol. 47, pp. 357-369, 1996. [26] M. Szumilas, "Explaining odds ratios," Journal of the Canadian academy of child and adolescent psychiatry, vol. 19, pp. 227-229, 2010. [27] D. Mladenić, "Feature subset selection in text-learning," in Machine Learning: ECML-98, Berlin, Heidelberg, 1998, pp. 95-100. [28] W. Shang, H. Huang, H. Zhu, Y. Lin, Y. Qu, and Z. Wang, "A novel feature selection algorithm for text categorization," Expert Systems with Applications, vol. 33, pp. 1-5, 2007/07/01/ 2007. [29] D. Shen, J.-T. Sun, Q. Yang, H. Zhao, and Z. Chen, "Text Classification Improved through Automatically Extracted Sequences," in Proceedings of the 22nd International Conference on Data Engineering (ICDE'06), Atlanta, GA, USA, 2006, pp. 121-121. [30] T. R. Gruber, "A translation approach to portable ontology specifications," Knowledge Acquisition, vol. 5, pp. 199-220, 1993/06/01/ 1993. [31] R. Stevens, C. A. Goble, and S. Bechhofer, "Ontology-based knowledge representation for bioinformatics," Briefings in Bioinformatics, vol. 1, pp. 398-414, 2000. [32] A. Kehagias, V. Petridis, V. G. Kaburlasos, and P. Fragkou, "A Comparison of Word- and Sense- Based Text Categorization Using Several Classification Algorithms," Journal of Intelligent Information Systems, vol. 21, pp. 227-247, 2003/11/01 2003. [33] A. Moschitti, "Syntactic and semantic kernels for short text pair categorization," in Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, Athens, Greece, 2009, pp. 576–584. [34] X. Peng and B. Choi, "Document Classifications based on Word Semantic Hierarchies," in Proceedings of the International Conference on Artificial Intelligence and Applications (AIA’05), Innsbruck, Austria, 2005, pp. 362-367. [35] Z. S. Harris, "Distributional Structure," WORD, vol. 10, pp. 146-162, 1954/08/01 1954. [36] M. F. McTear, Z. Callejas, and D. Griol, The conversational interface, 1 ed. vol. 6: Springer Cham, 2016. [37] K. Kowsari, K. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown, "Text Classification Algorithms: A Survey," Information, vol. 10, p. 150, 2019. [38] D. Jurafsky and J. H. Martin, Speech and Language Processing, Third Edition draft ed., 2021. [39] A. Rajaraman and J. D. Ullman, Mining of Massive Datasets: Cambridge University Press, 2011. Journal of Information Technology and Computing 13 [40] H. P. Luhn, "A Statistical Approach to Mechanized Encoding and Searching of Literary Information," IBM Journal of Research and Development, vol. 1, pp. 309-317, 1957. [41] K. Sparck Jones, "A STATISTICAL INTERPRETATION OF TERM SPECIFICITY AND ITS APPLICATION IN RETRIEVAL," Journal of Documentation, vol. 28, pp. 11-21, 1972. [42] T. Mikolov, K. Chen, G. Corrado, and J. Dean. (2013, October 2022). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781. Available: https://ui.adsabs.harvard.edu/abs/2013arXiv1301.3781M [43] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, "Distributed representations of words and phrases and their compositionality," Advances in neural information processing systems, vol. 26, 2013. [44] Q. Le and T. Mikolov, "Distributed Representations of Sentences and Documents," in Proceedings of the 31st International Conference on Machine Learning, Beijing China, 2014, pp. 1188--1196. [45] L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P. Prettenhofer, A. Gramfort, J. Grobler, R. Layton, J. Vanderplas, A. Joly, B. Holt, and G. Varoquaux. (2013, October 2022). API design for machine learning software: experiences from the scikit-learn project. arXiv:1309.0238. Available: https://ui.adsabs.harvard.edu/abs/2013arXiv1309.0238B [46] K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg, "Feature hashing for large scale multitask learning," in Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, Quebec, Canada, 2009, pp. 1113–1120. [47] D. Bouchiha, A. Bouziane, and N. Doumi, "Machine Learning for Arabic Text Classification: A Comparative Study," Malaysian Journal of Science and Advanced Technology, vol. 2, pp. 163-173, 2022. [48] G. Grefenstette, "Tokenization," in Syntactic Wordclass Tagging, H. van Halteren, Ed., ed Dordrecht: Springer Netherlands, 1999, pp. 117-133. [49] M. Toman, R. Tesar, and K. Jezek, "Influence of word normalization on text classification," in Proceeding of Multidisciplinary Approaches to Global Information Systems, InSciT 2006 , Merida, Spain, 2006, pp. 354-358. [50] S. Bird, E. Klein, and E. Loper, Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit: O'Reilly Media, 2009. [51] V. Vapnik and A. Chervonenkis, "A note on one class of perceptrons," Automation and Remote Control, vol. 25, pp. 821-837, 1964. [52] V. Vapnik and A. Lerner, "Pattern recognition using generalized portrait method," Automation and Remote Control, vol. 24, pp. 774-780, 1963. [53] C. Cortes and V. Vapnik, "Support-vector networks," Machine Learning, vol. 20, pp. 273-297, 1995/09/01 1995. [54] D. Greene and P. Cunningham, "Practical solutions to the problem of diagonal dominance in kernel document clustering," in Proceedings of the 23rd international conference on Machine learning, Pittsburgh, Pennsylvania, USA, 2006, pp. 377–384. [55] P. S. Foundation. (2022, October 2022). Python 3.10.7 documentation. Available: https://docs.python.org/3/ [56] S. Loria. (2020, October 2022). textblob Documentation. Release 0.16.0. Available: https://buildmedia.readthedocs.org/media/pdf/textblob/latest/textblob.pdf [57] T. Zerrouki. (2019, October 2022). Tashaphyne, Arabic light stemmer. Available: https://pypi.org/project/Tashaphyne/0.3.4.1/ [58] R. Rehurek and P. Sojka, "Software framework for topic modelling with large corpora," in Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks, Valletta, Malta, 2010, pp. 45-50. [59] C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del Río, M. Wiebe, P. Peterson, P. Gérard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant, "Array programming with NumPy," Nature, vol. 585, pp. 357-362, 2020/09/01 2020. 14 D. BOUCHIHA et al.: Ontology based Feature Selection and Weighting for Text classification using Machine Learning [60] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and É. Duchesnay, "Scikit-learn: Machine Learning in Python," Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011. [61] C. E. Metz, "Basic principles of ROC analysis," Seminars in Nuclear Medicine, vol. 8, pp. 283-298, 1978/10/01/ 1978. [62] T. Pedersen, S. Patwardhan, and J. Michelizzi, "WordNet:: Similarity-Measuring the Relatedness of Concepts," in Proceedings of the Nineteenth National Conference on Artificial Intelligence (Sponsored by the AAAI), San Jose, California, USA, 2004, pp. 25-29. [63] C. Leacock and M. Chodorow, "Combining local context and WordNet similarity for word sense identification," in WordNet: An electronic lexical database. vol. 49, ed, 1998, pp. 265-283. [64] Z. Wu and M. Palmer, "Verbs semantics and lexical selection," in Proceedings of the 32nd annual meeting on Association for Computational Linguistics, Las Cruces, New Mexico, USA, 1994, pp. 133–138. [65] P. Resnik, "Using information content to evaluate semantic similarity in a taxonomy," in Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, Montreal, Quebec, Canada, 1995, pp. 448-453. [66] D. Lin, "An Information-Theoretic Definition of Similarity," in Proceedings of the Fifteenth International Conference on Machine Learning, 1998, pp. 296–304. [67] J. J. Jiang and D. W. Conrath, "Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy," in Proceedings of the 10th Research on Computational Linguistics International Conference, Taipei, Taiwan, 1997, pp. 19-33. [68] T. Berners-Lee. (2006, October 2022). Linked data-design issues. Available: https://www.w3.org/DesignIssues/LinkedData.html [69] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer, "DBpedia – A large-scale, multilingual knowledge base extracted from Wikipedia," Semantic Web, vol. 6, pp. 167-195, 2015. [70] G. Hirst and D. St-Onge, "Lexical chains as representations of context for the detection and correction of malapropisms," in WordNet: An electronic lexical database. vol. 305, ed: MIT Press, 1998, pp. 305-332. [71] S. Banerjee and T. Pedersen, "Extended gloss overlaps as a measure of semantic relatedness," in Proceedings of the 18th international joint conference on Artificial intelligence, Acapulco, Mexico, 2003, pp. 805–810. [72] S. Patwardhan, "Incorporating dictionary and corpus information into a context vector measure of semantic relatedness (Doctoral dissertation, University of Minnesota, Duluth)," 2003.