Keywords

1 Introduction

Recommender systems RC utilise techniques spreading from statistics, to AI and machine learning in order to capture user interests, build user and products/services profiles and suggest the most appropriate products or services to them. RC draw on several methods for developing user references models, with user-generated-content (UGC) to represent a source with rich customer information [1, 2]. Since social media platforms allow users to exchange experience, feedbacks, opinions, complaints, etc., they provide significant information for capturing and understanding user interests [3]. Web personalisation is another area where user profiling is necessary for developing customised web interfaces, supporting personalised search [4] that allow users to retrieve search results according to their personal needs.

2 User Profiling in Tourism

Building user interests models has also been the focus of e-tourism research studies. Drawing on behavioural, socio-economic and demographic data analysis several researchers shed light into understanding people’s travel behaviour [3]. Indeed, surveys on travellers’ preferences have shown that the travel selection process is complex depending among others, on personality and mood related factors, service quality issues, the Word-Of-Mouth (WOM) and the eWOM. Customers often express their experience by publishing their reviews. Sentiment analysis of user reviews provides the means for capturing and modelling users’ preferences, emotions and attitudes, thus refining market segregation by grouping customers with similar needs and incentives and predicting customers’ travel behaviour more precisely [5].

Collantes and Mokhtarian [6] claim that a variety of personality factors such as: personality traits, travel-related behaviours, lifestyle characteristics, and travel trends, determine the subjective assessment of travelling and tourism services. Other researchers have noticed that travel behaviour is influenced by travel experiences and feelings [7, 8]. It is also argued that it is important to analyse human behaviour characteristics in order to understand how customers react to alternative transport policies [9]. Other travel research studies have analysed environmental factors that influence travel and tourism. Stradling and Anable [10], argue that environmental characteristics, such as workplace, shops and site topography affect travel choices.

Several approaches have been proposed for building user interests models. Kim and Chan [11], have proposed a hierarchical model for representing user interests. The user profile is constructing by analysing documents that users have visited on the web. The documents’ analysis yields a list of user interests, which subsequently are grouped upon their similarity on the hierarchical interests’ model. It is argued that there exist four classes of information contexts that need to be specified when attempting to understand user interests [12]. The general information class that refers to personal characteristics such as name, contact details, demographics of the user. The event class represents user’s activities. The preference class refers to user’s interests. The social network class explains user’s connections and interactions with other users. The preference class is usually discovered by analysing various sources such as relevant documents that the user has published [12, 13].

Several representational approaches have been proposed for representing user interests. Most frequently though there are three different formats namely: keywords, semantic networks and concept-based representations [14, 15]. Keywords representing domains of interests are associated with weights indicating the strength of user interests for a particular topic. Polysemy and Synonymy are problems associated with keywords. Semantic networks, address these problems, by representing keywords with nodes that are connected with each other, including co-occurrences. Concept-based representations resemble semantic networks in structure but they differ in having nodes to represent abstract topics rather than keywords [14, 15]. User profiles can be used in various ways such as: during personalised information retrieval, that is when a system detects relevant documents and information according to users’ interests, during re-evaluating the relevance of documents taking into consideration what documents a user has retrieved and during query processing, when a user query can be modified based on user interests [16].

It is argued that filtering and clustering techniques are very useful in reducing the number of concepts that are found on the web in order to be used in formulating user profiles. However, [16], argues that these techniques lack effectiveness for they produce the same structure of interests for users with different needs. Research show that while many systems produce and use user profiles, e.g. in web personalisation, recommender systems there exists no definite procedure for deriving user interests [16,17,18,19]. This paper addresses the need for investigating alternative ways of developing user interests’ models and suggests the analysis of the TF-IDF with the FsQCA.

3 Methodology

The aim of the paper is to identify the causal combinations that are necessary and sufficient to represent customer interests. This paper utilises the FsQCA in order to analyse the TF and IDF of UGC and produce causal combinations that best lead to an outcome. The FsQCA is particularly important for investigating intertwined relationships between multiple factors that affect a dependent variable or contribute to the realisation of certain outcome [20]. The FsQCA analyses the sets of relationships among causes. In FsQCA variables are modelled as sets. The FsQCA models allow a detailed analysis of how alternative conditions of causes combine and contribute to high membership scores of the outcome [21]. FsQCA may detect multiple paths, i.e. alternative causal combinations that can lead to high levels of the same outcome [20, 22]. Data in this paper is collected from customer reviews published on hotel web sites. Causal combinations may be represented by tourism services terms such as room, view, cleanliness, etc., in the set of selected documents. The outcome set in this paper, is represented by the large amount of money spent by the customer. Other outcome sets can also be considered. Thus, this paper aims to identify the combinations of customer hotel services interests that best reflect customer’s spending. A sample of the data collected is analysed in this paper. The steps of the methodology are shown below:

  1. 1.

    Select documents published by user \( (u_{i} ) \).

  2. 2.

    Identify the terms that will constitute the causal combinations and specify the term that will represent the outcome set.

  3. 3.

    Calculate the (TF) and the (IDF) for each identified term.

  4. 4.

    Calculate the weight of each term \( (t_{k} ) \) using the following formula:

$$ W_{tk} = TF_{tk} *\log \left( {\frac{{N_{i} }}{{d_{tk} }}} \right)\;[23] $$
(1)

where,\( W_{tk} \), represents the weight of term \( (t_{k} ) \), \( TF_{tk} \), is the term frequency for term \( (t_{k} ) \), \( N_{i} \), is the total number of documents published by user \( (u_{i} ) \) and \( d_{tk} \), represents the number of documents that contain term \( (t_{k} ) \).

  1. 5.

    Apply the FsQCA and produce User Interests causal combinations.

    1. a.

      Produce the truth table of all possible permutations of the terms considered. Each permutation is a possible causal combination.

    2. b.

      Calculate membership degrees for each combination. Its calculation is performed drawing on the fuzzy sets operations theory. Assume two fuzzy sets \( \tilde{A} \) and \( \tilde{B} \) then:

The fuzzy union, is defined as

$$ \mu_{(A \cup B)} = \hbox{max} (\mu_{A} ,\mu_{B} ), $$
(2)

The fuzzy intersection is defined as

$$ \mu_{(A \cap B)} = \hbox{min} (\mu_{A} ,\mu_{B} ) $$
(3)

and the fuzzy complement is calculated as

$$ \mu_{\neg A} 1 - \mu_{A} $$
(4)
  1. 6.

    Calculate the consistency and the coverage of the solutions using formulas (2) and (3) respectively.

$$ Consistency(X \prec Y) = \frac{{\sum {\hbox{min} (X,Y)} }}{\sum X }\quad [24] $$
(5)
$$ Coverage = \frac{{\sum {\hbox{min} (X,Y)} }}{\sum Y }\quad [24] $$
(6)

where \( (X) \) is the membership degree of each causal combination and \( (Y) \) is the membership degree of the outcome set.

  1. 7.

    Identify best combinations, by selecting the combinations that exhibit a consistently rate above a threshold (in this paper is set at 0.8) and the highest possible coverage. Simplify solutions into the final set of causal combinations.

The final causal combinations indicate the hotel services that customers who spend large amount of money consider as the most important.

4 Data Analysis: Illustrative Example

This paper analyses reviews collected from five (5) hotel customers. Then, for simplicity reasons, five (5) terms representing hotel services are selected from the total set of terms identified in the reviews. The outcome set large amount of money spent (LMSp) by each user during his/her hotel stay is represented as triangular fuzzy numbers (TFN). The membership function \( f_{A} (x) \) of TFN \( \tilde{A}(a,m,b) \) can be calculated according to the following equation [25]:

$$ f_{A} \left( x \right) = \left\{ {\begin{array}{*{20}l} {\frac{x - a}{m - a},} \hfill & {a \le x \le m,\,\,m \ne a} \hfill \\ {\frac{x - b}{m - b},} \hfill & {m \le x \le b,\,\,m \ne b} \hfill \\ {0,} \hfill & {otherwise} \hfill \\ \end{array} } \right. $$
(7)

where a, m, b are real numbers. The linguistic scales which are used and their corresponding TFNs adopted in this study are shown in Table 1.

Table 1. Linguistic scales and corresponding TFNs for large amount of money-spent fuzzy sets

The linguistic scales represent indicate to what extent a customer is included to the set of those who spend large amount of money during their hotel stay. The TF and IDF scores (step 3) are calculated by using the KNIME tool, for all documents published by each user \( (u_{i} ) \). Then, the weights for each term result from using formula (1). The results are shown in Table 2.

Table 2. The term weights and the membership degree for money spent for each customer

Next the FsQCA is applied. The truth table is developed. Since there are 5 terms to consider the number of permutations is \( 2^{5} = 32 \). Table 3 shows part of the truth table.

Table 3. The truth table (part of) show all possible permutations of the terms

The cells in the truth table take the value (1) or (0) representing true or false. Thus, permutation number 3 is read (Quietness = false, Sea View = false, Staff Friendliness = false, Cultural Activities = true, Restaurant = false). Next the membership degrees for all combination for each user are calculated drawing on the fuzzy sets operations theory. Table 4 shows the membership degrees for the first 17 combinations.

Table 4. Membership degrees for combinations for each customer

The membership degree of combination number 3 \( \mu_{C3} \), for customer-1, see framed cell in Table 4, is calculated as follows by using formulas (3) and (4):

Consider combination number 3 membership degree \( \mu_{C3} \) = \( \mu \)(Quietness = false \( \cap \) Sea View = false \( \cap \) Staff Friendliness = false \( \cap \) Cultural Activities = true \( \cap \) Restaurant = false) = \( \mu \)(not (Quietness), not (Sea View), not (Staff Friendliness), Cultural Activities, not (Restaurant)).

The \( \mu \)(Quietness = false) = \( \mu \)(1 − \( \mu \)(Quietness))= (1 − 0.3) = 0.7. Similar calculations are performed for all terms thus, \( \mu_{C3} \) = min(0.7; 0.5; 0.6; 0.3) = 0.3. After all membership degrees are calculated the consistency and coverage degrees are determined. Table 5 shows the results for the first 17 combinations.

Table 5. Causal combinations’ consistency and coverage

The consistency for combination number 3 is calculated, by applying formula (5) as follows: Consider the outcome column (Y) shown in Table 2 and the membership degrees (X) of combination number 3, for all users as shown in Table 4. Then,

$$ \begin{aligned} & \sum {\hbox{min} (X,Y) = { \hbox{min} }\{ { \hbox{min} }\left( {0.3;0.5} \right) + { \hbox{min} }\left( {0.1;0.7} \right) + { \hbox{min} }\left( {0.5;0.1} \right) + { \hbox{min} }\left( {0.3;0.7} \right)} \\ & + { \hbox{min} }\left( {0.3;0.9} \right) = { \hbox{min} }\left( {0.3 + 0.1 + 0.1 + 0.3 + 0.3} \right) = 1.1. \\ & \sum X = \left( {0.3 + 0.1 + 0.5 + 0.3 + 0.3} \right) = 1.5. \\ \end{aligned} $$

Therefore the consistency for combination number 3 = 0.733.

Regarding the coverage, by applying formula (6), \( \sum {\hbox{min} (X,Y) = 1.5} \) and \( \sum {Y = 2.9} \) thus coverage = 0.37.

According to FsQCA the best causal combinations should exhibit as high as possible consistency and coverage. However, the higher the consistency is the lower the coverage. Assuming a threshold value of 0.8 for the consistency firstly and then the higher possible coverage, the analysis results into two causal combinations; the combinations number 12 and 16 extracted from Table 3, are shown in Table 6.

Table 6. The two necessary and sufficient causal combinations

A closer look at the combinations reveals that “quietness” is not within the customers interests at all. It is not a necessary service. Thus, restructuring the causal combination the analysis results that customers who spend a large amount of money, show interest in

  • (Sea View) AND (Staff friendliness) AND (Cultural activities) AND (Restaurant) OR

  • (Sea View) AND (Cultural activities) AND (Restaurant).

In order to simplify the causal combinations, the “staff friendliness” could be omitted for it does not appear on both combinations.

5 Conclusions-Future Research

This study suggests that the FsQCA can be used for modelling users’ interests. Data selected from customer reviews is analysed by utilising the TF and the IDF. The application of the FsQCA results into useful insights that can be used to understand customer priorities and build customer profiles. Future research can focus on examining the applicability of the FsQCA to handle multiple outcome sets and to specify terms’ priorities. When applying the FsQCA method in large data sets with a long list of factors, the truth table and the set of possible causal combinations can become cumbersome to analyse. Thus, future research can focus on combining the FsQCA analysis with other techniques that will be used in pruning the size of the truth table and reduce the causal combinations to manageable size.