Abstract
Modelling and understanding user interests are particularly important tasks for designing services and building systems for customized solutions in web personalization and recommender systems. User generated content (UGC) constitutes a significant source of information for capturing user interests. This paper, suggests an approach to user profiling that analyses the Term Frequency (TF) and the Inverse Document Frequency (IDF) of selected tourism services by utilising the Fuzzy set Qualitative Comparative Analysis (FsQCA). It analyses a sample of customer reviews that are collected from tourism web sites. This paper considers the amount of money that customers spent during their hotel stay, as the outcome set in the FsQCA analysis. The results produce causal combinations of services that are necessary and sufficient for building customer interests models that best lead to the outcome and argue for the applicability of the FsQCA in modelling user interests.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Recommender systems RC utilise techniques spreading from statistics, to AI and machine learning in order to capture user interests, build user and products/services profiles and suggest the most appropriate products or services to them. RC draw on several methods for developing user references models, with user-generated-content (UGC) to represent a source with rich customer information [1, 2]. Since social media platforms allow users to exchange experience, feedbacks, opinions, complaints, etc., they provide significant information for capturing and understanding user interests [3]. Web personalisation is another area where user profiling is necessary for developing customised web interfaces, supporting personalised search [4] that allow users to retrieve search results according to their personal needs.
2 User Profiling in Tourism
Building user interests models has also been the focus of e-tourism research studies. Drawing on behavioural, socio-economic and demographic data analysis several researchers shed light into understanding people’s travel behaviour [3]. Indeed, surveys on travellers’ preferences have shown that the travel selection process is complex depending among others, on personality and mood related factors, service quality issues, the Word-Of-Mouth (WOM) and the eWOM. Customers often express their experience by publishing their reviews. Sentiment analysis of user reviews provides the means for capturing and modelling users’ preferences, emotions and attitudes, thus refining market segregation by grouping customers with similar needs and incentives and predicting customers’ travel behaviour more precisely [5].
Collantes and Mokhtarian [6] claim that a variety of personality factors such as: personality traits, travel-related behaviours, lifestyle characteristics, and travel trends, determine the subjective assessment of travelling and tourism services. Other researchers have noticed that travel behaviour is influenced by travel experiences and feelings [7, 8]. It is also argued that it is important to analyse human behaviour characteristics in order to understand how customers react to alternative transport policies [9]. Other travel research studies have analysed environmental factors that influence travel and tourism. Stradling and Anable [10], argue that environmental characteristics, such as workplace, shops and site topography affect travel choices.
Several approaches have been proposed for building user interests models. Kim and Chan [11], have proposed a hierarchical model for representing user interests. The user profile is constructing by analysing documents that users have visited on the web. The documents’ analysis yields a list of user interests, which subsequently are grouped upon their similarity on the hierarchical interests’ model. It is argued that there exist four classes of information contexts that need to be specified when attempting to understand user interests [12]. The general information class that refers to personal characteristics such as name, contact details, demographics of the user. The event class represents user’s activities. The preference class refers to user’s interests. The social network class explains user’s connections and interactions with other users. The preference class is usually discovered by analysing various sources such as relevant documents that the user has published [12, 13].
Several representational approaches have been proposed for representing user interests. Most frequently though there are three different formats namely: keywords, semantic networks and concept-based representations [14, 15]. Keywords representing domains of interests are associated with weights indicating the strength of user interests for a particular topic. Polysemy and Synonymy are problems associated with keywords. Semantic networks, address these problems, by representing keywords with nodes that are connected with each other, including co-occurrences. Concept-based representations resemble semantic networks in structure but they differ in having nodes to represent abstract topics rather than keywords [14, 15]. User profiles can be used in various ways such as: during personalised information retrieval, that is when a system detects relevant documents and information according to users’ interests, during re-evaluating the relevance of documents taking into consideration what documents a user has retrieved and during query processing, when a user query can be modified based on user interests [16].
It is argued that filtering and clustering techniques are very useful in reducing the number of concepts that are found on the web in order to be used in formulating user profiles. However, [16], argues that these techniques lack effectiveness for they produce the same structure of interests for users with different needs. Research show that while many systems produce and use user profiles, e.g. in web personalisation, recommender systems there exists no definite procedure for deriving user interests [16,17,18,19]. This paper addresses the need for investigating alternative ways of developing user interests’ models and suggests the analysis of the TF-IDF with the FsQCA.
3 Methodology
The aim of the paper is to identify the causal combinations that are necessary and sufficient to represent customer interests. This paper utilises the FsQCA in order to analyse the TF and IDF of UGC and produce causal combinations that best lead to an outcome. The FsQCA is particularly important for investigating intertwined relationships between multiple factors that affect a dependent variable or contribute to the realisation of certain outcome [20]. The FsQCA analyses the sets of relationships among causes. In FsQCA variables are modelled as sets. The FsQCA models allow a detailed analysis of how alternative conditions of causes combine and contribute to high membership scores of the outcome [21]. FsQCA may detect multiple paths, i.e. alternative causal combinations that can lead to high levels of the same outcome [20, 22]. Data in this paper is collected from customer reviews published on hotel web sites. Causal combinations may be represented by tourism services terms such as room, view, cleanliness, etc., in the set of selected documents. The outcome set in this paper, is represented by the large amount of money spent by the customer. Other outcome sets can also be considered. Thus, this paper aims to identify the combinations of customer hotel services interests that best reflect customer’s spending. A sample of the data collected is analysed in this paper. The steps of the methodology are shown below:
-
1.
Select documents published by user \( (u_{i} ) \).
-
2.
Identify the terms that will constitute the causal combinations and specify the term that will represent the outcome set.
-
3.
Calculate the (TF) and the (IDF) for each identified term.
-
4.
Calculate the weight of each term \( (t_{k} ) \) using the following formula:
where,\( W_{tk} \), represents the weight of term \( (t_{k} ) \), \( TF_{tk} \), is the term frequency for term \( (t_{k} ) \), \( N_{i} \), is the total number of documents published by user \( (u_{i} ) \) and \( d_{tk} \), represents the number of documents that contain term \( (t_{k} ) \).
-
5.
Apply the FsQCA and produce User Interests causal combinations.
-
a.
Produce the truth table of all possible permutations of the terms considered. Each permutation is a possible causal combination.
-
b.
Calculate membership degrees for each combination. Its calculation is performed drawing on the fuzzy sets operations theory. Assume two fuzzy sets \( \tilde{A} \) and \( \tilde{B} \) then:
-
a.
The fuzzy union, is defined as
The fuzzy intersection is defined as
and the fuzzy complement is calculated as
-
6.
Calculate the consistency and the coverage of the solutions using formulas (2) and (3) respectively.
where \( (X) \) is the membership degree of each causal combination and \( (Y) \) is the membership degree of the outcome set.
-
7.
Identify best combinations, by selecting the combinations that exhibit a consistently rate above a threshold (in this paper is set at 0.8) and the highest possible coverage. Simplify solutions into the final set of causal combinations.
The final causal combinations indicate the hotel services that customers who spend large amount of money consider as the most important.
4 Data Analysis: Illustrative Example
This paper analyses reviews collected from five (5) hotel customers. Then, for simplicity reasons, five (5) terms representing hotel services are selected from the total set of terms identified in the reviews. The outcome set large amount of money spent (LMSp) by each user during his/her hotel stay is represented as triangular fuzzy numbers (TFN). The membership function \( f_{A} (x) \) of TFN \( \tilde{A}(a,m,b) \) can be calculated according to the following equation [25]:
where a, m, b are real numbers. The linguistic scales which are used and their corresponding TFNs adopted in this study are shown in Table 1.
The linguistic scales represent indicate to what extent a customer is included to the set of those who spend large amount of money during their hotel stay. The TF and IDF scores (step 3) are calculated by using the KNIME tool, for all documents published by each user \( (u_{i} ) \). Then, the weights for each term result from using formula (1). The results are shown in Table 2.
Next the FsQCA is applied. The truth table is developed. Since there are 5 terms to consider the number of permutations is \( 2^{5} = 32 \). Table 3 shows part of the truth table.
The cells in the truth table take the value (1) or (0) representing true or false. Thus, permutation number 3 is read (Quietness = false, Sea View = false, Staff Friendliness = false, Cultural Activities = true, Restaurant = false). Next the membership degrees for all combination for each user are calculated drawing on the fuzzy sets operations theory. Table 4 shows the membership degrees for the first 17 combinations.
The membership degree of combination number 3 \( \mu_{C3} \), for customer-1, see framed cell in Table 4, is calculated as follows by using formulas (3) and (4):
Consider combination number 3 membership degree \( \mu_{C3} \) = \( \mu \)(Quietness = false \( \cap \) Sea View = false \( \cap \) Staff Friendliness = false \( \cap \) Cultural Activities = true \( \cap \) Restaurant = false) = \( \mu \)(not (Quietness), not (Sea View), not (Staff Friendliness), Cultural Activities, not (Restaurant)).
The \( \mu \)(Quietness = false) = \( \mu \)(1 − \( \mu \)(Quietness))= (1 − 0.3) = 0.7. Similar calculations are performed for all terms thus, \( \mu_{C3} \) = min(0.7; 0.5; 0.6; 0.3) = 0.3. After all membership degrees are calculated the consistency and coverage degrees are determined. Table 5 shows the results for the first 17 combinations.
The consistency for combination number 3 is calculated, by applying formula (5) as follows: Consider the outcome column (Y) shown in Table 2 and the membership degrees (X) of combination number 3, for all users as shown in Table 4. Then,
Therefore the consistency for combination number 3 = 0.733.
Regarding the coverage, by applying formula (6), \( \sum {\hbox{min} (X,Y) = 1.5} \) and \( \sum {Y = 2.9} \) thus coverage = 0.37.
According to FsQCA the best causal combinations should exhibit as high as possible consistency and coverage. However, the higher the consistency is the lower the coverage. Assuming a threshold value of 0.8 for the consistency firstly and then the higher possible coverage, the analysis results into two causal combinations; the combinations number 12 and 16 extracted from Table 3, are shown in Table 6.
A closer look at the combinations reveals that “quietness” is not within the customers interests at all. It is not a necessary service. Thus, restructuring the causal combination the analysis results that customers who spend a large amount of money, show interest in
-
(Sea View) AND (Staff friendliness) AND (Cultural activities) AND (Restaurant) OR
-
(Sea View) AND (Cultural activities) AND (Restaurant).
In order to simplify the causal combinations, the “staff friendliness” could be omitted for it does not appear on both combinations.
5 Conclusions-Future Research
This study suggests that the FsQCA can be used for modelling users’ interests. Data selected from customer reviews is analysed by utilising the TF and the IDF. The application of the FsQCA results into useful insights that can be used to understand customer priorities and build customer profiles. Future research can focus on examining the applicability of the FsQCA to handle multiple outcome sets and to specify terms’ priorities. When applying the FsQCA method in large data sets with a long list of factors, the truth table and the set of possible causal combinations can become cumbersome to analyse. Thus, future research can focus on combining the FsQCA analysis with other techniques that will be used in pruning the size of the truth table and reduce the causal combinations to manageable size.
References
Martínez-Garcia, E., Ferrer-Rosell, B., Coenders, G.: Profile of business and leisure travelers on low cost carriers in Europe. J. Air Transp. Manag. 20, 12–14 (2012)
Baka, V.: The becoming of user-generated reviews: looking at the past to understand the future of managing reputation in the travel sector. Tour. Manag. 53, 148–162 (2016)
Hunecke, M., Haustein, S., Böhler, S., Grischkat, S.: Attitude-based target groups to reduce the ecological impact of daily mobility behavior. Environ. Behav. 42, 3–43 (2010)
Zhang, Z., Lin, H., Liu, K., Wu, D., Zhang, G., Lu, J.: A hybrid fuzzy-based personalized recommender system for telecom products/services. Inf. Sci. (Ny) 235, 117–129 (2013). https://doi.org/10.1016/j.ins.2013.01.025
Wedel, M., Kamakura, W.A.: Market segmentation: Conceptual and methodological foundations. Springer Science & Business Media (2012)
Collantes, G.O., Mokhtarian, P.L.: Subjective assessments of personal mobility: what makes the difference between a little and a lot? Transp. Policy 14, 181–192 (2007)
Handy, S., Weston, L., Mokhtarian, P.L.: Driving by choice or necessity? Transp. Res. Part A Policy Pract. 39, 183–203 (2005)
Sheller, M., Urry, J.: The new mobilities paradigm. Environ. Plan. A. 38, 207–226 (2006)
Schade, J., Schlag, B.: Acceptability of urban transport pricing strategies. Transp. Res. Part F Traffic Psychol. Behav. 6, 45–61 (2003)
Stradling, S.G., Anable, J.: Individual transport patterns (2008)
Kim, H.R., Chan, P.K.: Learning implicit user interest hierarchy for context in personalization. In: Proceedings of the 8th International Conference on Intelligent User Interfaces, pp. 101–108. ACM (2003)
Joung, Y., El Zarki, M., Jain, R.: A user model for personalization services. In: Fourth International Conference on Digital Information Management, ICDIM 2009, pp. 1–6. IEEE (2009)
Bakalov, F., König-Ries, B., Nauerz, A., Welsch, M.: A hybrid approach to identifying user interests in web portals. In: IICS, pp. 123–134 (2009)
Gauch, S., Speretta, M., Chandramouli, A., Micarelli, A.: User Profiles for personalized information access. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) The Adaptive Web. LNCS, vol. 4321, pp. 54–89. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-72079-9_2
Michlmayr, E., Cayzer, S.: Learning user profiles from tagging data and leveraging them for personal(ized) information access (2007)
Saleheen, S., Lai, W.: UIWGViz: An architecture of user interest-based web graph vizualization. J. Vis. Lang. Comput. 44, 39–57 (2018)
Magnini, B., Strapparava, C.: Improving user modelling with content-based techniques. In: Bauer, M., Gmytrasiewicz, Piotr J., Vassileva, J. (eds.) UM 2001. LNCS (LNAI), vol. 2109, pp. 74–83. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44566-8_8
Lehmann, S., Schwanecke, U., Dörner, R.: Interactive visualization for opportunistic exploration of large document collections. Inf. Syst. 35, 260–269 (2010)
Bastian, M., Heymann, S., Jacomy, M.: Gephi: an open source software for exploring and manipulating networks. In: ICWSM, vol. 8, pp. 361–362 (2009)
Chari, S., Tarkiainen, A., Salojärvi, H.: Alternative pathways to utilizing customer knowledge: a fuzzy-set qualitative comparative analysis. J. Bus. Res. 69, 5494–5499 (2016)
Rihoux, B., Ragin, C.C.: Configurational Comparative Methods: Qualitative Comparative Analysis (QCA) and Related Techniques. Sage Publications, Los Angeles (2008)
Skarmeas, D., Leonidou, C.N., Saridakis, C.: Examining the role of CSR skepticism using fuzzy-set qualitative comparative analysis. J. Bus. Res. 67, 1796–1805 (2014)
Chen, K., Zhang, Z., Long, J., Zhang, H.: Turning from TF-IDF to TF-IGM for term weighting in text classification. Expert Syst. Appl. 66, 245–260 (2016)
Korjani, M.M., Mendel, J.M.: Fuzzy set qualitative comparative analysis (fsQCA): challenges and applications. In: 2012 Annual Meeting of the North American Fuzzy Information Processing Society (NAFIPS), pp. 1–6. IEEE (2012)
Lin, H.-Y., Hsu, P.-Y., Sheen, G.-J.: A fuzzy-based decision-making procedure for data warehouse system selection. Expert Syst. Appl. 32, 939–953 (2007)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 IFIP International Federation for Information Processing
About this paper
Cite this paper
Kardaras, D.K., Kaperonis, S., Barbounaki, S., Petrounias, I., Bithas, K. (2018). An Approach to Modelling User Interests Using TF-IDF and Fuzzy Sets Qualitative Comparative Analysis. In: Iliadis, L., Maglogiannis, I., Plagianakos, V. (eds) Artificial Intelligence Applications and Innovations. AIAI 2018. IFIP Advances in Information and Communication Technology, vol 519. Springer, Cham. https://doi.org/10.1007/978-3-319-92007-8_51
Download citation
DOI: https://doi.org/10.1007/978-3-319-92007-8_51
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-92006-1
Online ISBN: 978-3-319-92007-8
eBook Packages: Computer ScienceComputer Science (R0)