[go: up one dir, main page]

Academia.eduAcademia.edu
Information Processing and Management 57 (2020) 102318 Contents lists available at ScienceDirect Information Processing and Management journal homepage: www.elsevier.com/locate/infoproman HCA: Hierarchical Compare Aggregate model for question retrieval in community question answering T ⁎ Mohammad Sadegh Zahedi, Maseud Rahgozar , Reza Aghaeizadeh Zoroofi School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran A R TICL E INFO A BSTR A CT Keywords: Community question answering Question retrieval Hierarchical compare-aggregate model Transfer learning Deep learning We address the problem of finding similar historical questions that are semantically equivalent or relevant to an input query question in community question-answering (CQA) sites. One of the main challenges for this task is that questions are usually too long and often contain peripheral information in addition to the main goals of the question. To address this problem, we propose an end-to-end Hierarchical Compare Aggregate (HCA) model that can handle this problem without using any task-specific features. We first split questions into sentences and compare every sentence pair of the two questions using a proposed Word-Level-Compare-Aggregate model called WLCA-model and then the comparison results are aggregated with a proposed Sentence-LevelCompare-Aggregate model to make the final decision. To handle the insufficient training data problem, we propose a sequential transfer learning approach to pre-train the WLCA-model on a large paraphrase detection dataset. Our experiments on two editions of the Semeval benchmark datasets and the domain-specific AskUbuntu dataset show that our model outperforms the stateof-the-art models. 1. Introduction With the development of Web 2.0 and the emergence of the new technologies, the popularity of user-generated content platforms such as internet forums, blogs, social networking sites, and wikis have increased dramatically. Community question answering (CQA) sites, such as Yahoo! Answers, WikiAnswers, Quora, Stack Overflow are among the most popular user-generated content platforms, that allow users to ask questions or share their knowledge by answering other users’ questions (Srba & Bielikova, 2016). One fundamental task for leveraging the knowledge in CQA sites is finding semantically equivalent or relevant questions and then reranking these candidate questions concerning their relevance to queried questions. This is what we call the question retrieval or reranking (QR) in this article, which has attracted a lot of attention in recent years (Z. Chen et al., 2018; P. Wang et al., 2018; M. Zhang & Wu, 2018; Zhou, Zhou, et al., 2016; Zhou et al., 2016; Zhou & Huang, 2017). QR can improve the success and effectiveness of CQA sites through (a) improving access to the CQA archived data by reducing the redundancy in the data, (b) saving the time of answerers by enabling them to answer new questions instead of repeated questions and (c) minimizing the time lag of the askers by retrieving answers of relevant questions, instead of them having to wait for the community to answer, thereby preventing the question starvation problem (Shtok et al., 2012). QR in CQA is a quite challenging task, and several factors make it difficult. One of the main challenges for QR is the question length. Unlike other social media such as Twitter, users in CQA sites are free to submit their questions or answers with no restrictions on length or structure. For this reason, many of the existing QR methods focus only on the title of the question and cannot handle multisentence questions with extended contexts. The training data sparsity is another important QR challenge in CQA. The process of ⁎ Corresponding author. https://doi.org/10.1016/j.ipm.2020.102318 Received 10 September 2019; Received in revised form 26 April 2020; Accepted 31 May 2020 0306-4573/ © 2020 Elsevier Ltd. All rights reserved. Information Processing and Management 57 (2020) 102318 M.S. Zahedi, et al. labeling data is difficult, time-consuming, expensive, and usually requires domain expertise. Especially for CQA, the available training examples are often insufficient, which makes it difficult to use well-performing supervised models such as a deep neural network. However, relatively little attention has been given to QR on long questions, most of the work has been reported on the SemEval 2016/2017 benchmark datasets (Nakov et al., 2016, 2017). Generally, most participating teams used learning-to-rank methods that are supervised fusion of different similarity features based on lexical, syntactic, and semantic representations of questions. Although most participants utilized an SVMs as learning models, some work adopts a deep learning-based approach (Baldwin et al., 2016; Mohtarami et al., 2016; Qi et al., 2017). Indeed, the results of the challenges have shown that the neural network approaches did not win the QR subtask in SemEval-2016/2017 competitions (Nakov et al., 2016, 2017). Various methods have been proposed to solve the question length problem (Barrón-Cedeño, Martino, et al., 2016; Romeo et al., 2016, 2017). However, they still suffer from the need for rich features and external resources. In recent work, Zhang and Wu (M. Zhang & Wu, 2018) proposed a novel unsupervised framework based on attention autoencoders, named reduced attentive matching network (RAMN). However, the performance of this model relies on the initial rank produced by the search engine. In summary, the limitations of these works seem to be: (a) these approaches usually rely on a significant feature engineering and require language-dependent tools (such as syntactic parser), semantic resources (such as WordNet) and knowledge graphs that may be challenging to obtain, especially for low-resource languages; (b) It is also difficult to adapt these approaches to a new domain, requiring separate feature extraction for the new domain. 1.1. The purpose of the study Sentence pair modeling (SPM) is an essential task for natural language understanding that aims to understand the relationship between two given sentences (Lan & Xu, 2018). For example, in a paraphrase identification task, SPM is used to determine whether two sentences are paraphrases, and in natural language inference task SPM is used to determine whether a hypothesis sentence can be inferred from a premise sentence. There are two main types of general neural network architectures for these tasks: Siamese and Compare-Aggregate. Generally, Compare-Aggregate architectures tend to perform better than Siamese architectures because they can capture more interactive features between the two sentences (Choi & Lee, 2019). QR in CQA is also an SMP task, but the QR is intended to model the relationship between two long questions with noisy extended contexts. Although several compare-aggregate models have been proposed for different SPM tasks such as natural language inference (Ghaeini et al., 2018), paraphrase identification (Gong et al., 2018), semantic textual similarity (Lopez-Gazpio et al., 2019) and answer sentence selection (Tan et al., 2018), these models do not apply to QR task because they are designed to model the relationship between two short sentences. As we discuss later in this paper, the state-of-the-art Compare-Aggregate models (S. Wang & Jiang, 2017; Z. Wang et al., 2017) do not have good results in our QR task. We believe that the reason for this could be the question length and the training data sparsity problems. As shown in Table 1, the two problems mentioned above could be inferred from the provided statistics. The training data pairs for the QR task are much fewer than that of NLSM tasks, while they are much longer too. In this context, the main purpose of this paper is to propose a new end-to-end Compare-Aggregate model for QR in CQA that can handle these two problems. 1.2. Contributions To handle the question-length problem, we propose an end-to-end Hierarchical-Compare Aggregate-model called HCA-model that consists of two compare-aggregate models: A Sentence-Level-Compare-Aggregate-model named SLCA-model that operates at the coarse-grained sentence-level and a Word-Level-Compare- Aggregate model called WLCA-model that operates at the fine-grained word-level. Given an input question Qin and a candidate question Qec , the SLCA-model compares the two encoded questions Qin and Qec at the sentence-level in two directions Qin → Qec and Qin ←Qec . In each direction, we compare each sentence of one question against all sentences of another question using the proposed WLCA-model (It has a similar architecture to the SLCA-model but operates at the fine-grained word-level). After that, the sentence comparison results are aggregated into a vector to make the final decision. The existing Compare-Aggregate models (S. Wang & Jiang, 2017; Z. Wang et al., 2017) operate only at the word level. The main difference with the existing models is that HCA-model has a hierarchical structure that is applied both at the word and the sentence level. This hierarchical structure allows the HCA-model to handle the long question with multiple noisy sentences. To tackle the insufficient training data problem, we propose using a sequential transfer learning (STL). Recently, transfer learning methods and architectures have been a major area of interest within the different tasks of Natural Language Processing (NLP) such as natural language inference (Choi & Lee, 2019), argument reasoning comprehension (Choi & Lee, 2018), sequence tagging (Yang et al., Table 1 Comparing the number of training pairs and mean sequence length of the SPM datasets with the QR dataset. NLSM task Dataset # Of training pairs Avg Sequence Length(word number) Natural language inference Paraphrase detection Answer selection Question retrieval SNLI Quora TrecQA SemEval 550k 390k 53k 2670 11.2 12.68 17.83 52.06 2 Information Processing and Management 57 (2020) 102318 M.S. Zahedi, et al. 2017), and question answering (Chung et al., 2018; Min et al., 2017). In this study, we show that our proposed STL approach is an effective solution to tackle the limited training data problem for the QR task in CQA. Our STL approach consists of two steps: pretraining step in which we first pre-train the WLCA-model on a large paraphrase identification dataset, and adaptation step, in which we try to transfer the paraphrase knowledge gained in solving the paraphrase identification task to our QR task. We evaluate our model on three public datasets, two editions of the public-domain Semeval benchmark datasets, Semeval-2016 (Nakov et al., 2016), and Semeval-2017 (Nakov et al., 2017) and the domain-specific AskUbuntu dataset (Lei et al., 2016). Evaluation results show that the HCA model outperforms the best models on the three benchmark datasets. The main contributions of this work can be summarized as follows: • To solve the question-length problem, we propose a novel HCA-model that consists of two compare-aggregate models. The SLCA• • model aggregates the coarse-grained sentence-level semantic similarity features that are captured by the WLCA-model to calculate the final relevance score between the two questions. To tackle the limited training data problem, we propose the sequential transfer learning approach for the QR task in CQA sites. The empirical experiment confirms the effectiveness of our STL approach. Extensive experiments have been conducted to evaluate the performance of our model. Evaluation results show that the HCAmodel achieves the best results in the both public-domain SemEval datasets and the domain-specific AskUbuntu dataset, thus making it more robust than other existing models. The remainder of this paper is organized as follows: In Section 2, we briefly review the related works on the QR task. Section 3 presents our approach in detail. In Section 4, we present the experimental setup. In Section 5, we evaluate our approach and discuss the results. Finally, the conclusion is given in Section 6. 2. Related Works CQA has been a significant area of interest within the different tasks of information science community such as expert finding (Neshati et al., 2017; Nobari et al., 2020), question recommendation (C. Fu, 2019b), user modeling (C. Fu, 2019a), answer quality assessment (H. Fu & Oh, 2019) and answer retrieval(Shao et al., 2019). One of the fundamentally important tasks in CQA services is QR. Currently, there are two main research lines to QR in CQA. The first line of research focuses only on the question title and cannot handle multi-sentence questions with extended context. The other line of research started after 2016, when SemEval organizers proposed Semeval-2016 Task3 Subtask B, in which the question body is also considered and is, therefore, closer to a real application. This paper follows the second line of research. In this section, we briefly review these two lines of research. Short-question retrieval methods: these methods can be classified into four groups. The first group is translation-model-based approaches, which leverage question-answer or question-question pairs in parallel corpora in the same language to learn translation model to match semantically similar questions such as a translation-based language model (Xue et al., 2008), compact translation model (Lee et al., 2008), entity-based translation language model (Singh, 2012), and phrase-based translation model (Zhou et al., 2011). Some works use statistical machine translation to enrich the question representation (W.-N. Zhang et al., 2016; Zhou, Xie, et al., 2016). The second group considers questions metadata such as question categories to improve QR performance, such as (Cao et al., 2012; Zhou & Huang, 2017). The third group utilizes topic modeling techniques to bridge the lexical gap between semantically similar questions (L. Chen et al., 2016; Ji et al., 2012; K. Zhang et al., 2014). The fourth group applies neural network-based approaches. Qiu et al. (Qiu & Huang, 2015) proposed a convolutional neural tensor network (CNTN) to model the questions and answers in CQA. Dos Santos et al. (dos et al., 2015) introduced a hybrid neural network model that combined a weighted bag-ofwords (WBOW) representation with a convolutional neural network (CNN). To solve the insufficient training data problem, some works train a model on question-answer pairs instead of question-question pairs (Das, Shrivastava, et al., 2016; Das, Yenala, et al., 2016; P. Wang et al., 2018). Das et al. (Das, Yenala, et al., 2016) proposed a Siamese Convolutional Neural Network for QR in CQA that employs twin convolutional neural networks with shared parameters to learn the semantic similarity. Das et al. (Das, Shrivastava, et al., 2016) proposed a two-step model called “Deep Structured Topic Model (DSTM)” for QR. They first employed the LDA topic model to initially retrieve similar questions and then re-rank them using a deep layered semantic model. Wang et al. (P. Wang et al., 2018) proposed a value-based convolutional attention method that can efficiently learn the key parts of questions and answers by leveraging mutual information. To resolve data sparsity, they utilized the multi-view learning method to train the attention-based convolutional semantic model on question-answer pairs. Long-question retrieval methods: SemEval organizes two subtasks for QR in CQA, Semeval-2016 Task3 Subtask B and Semeval2017 Task3 Subtask B. Several teams participated in these tasks. Most of the participating teams applied learning to rank methods that are supervised fusion of different similarity features based on word embeddings (Franco-Salvador et al., 2016; Galbraith et al., 2017; G. Wu & Lan, 2016), simple match counts on words or n-grams (Mohtarami et al., 2016; G. Wu & Lan, 2016), parse trees (Barrón-Cedeño, Da San Martino, et al., 2016; Filice et al., 2016), translation models (G. Wu & Lan, 2016; Y. Wu & Zhang, 2016), topic models (Goyal, 2017; G. Wu & Lan, 2016), external resources such as knowledge graphs and WordNet (Franco-Salvador et al., 2016; Goyal, 2017) and the neural matching features (Baldwin et al., 2016; Goyal, 2017; Mohtarami et al., 2016; Qi et al., 2017). Some work adopts a deep learning-based approach (Baldwin et al., 2016; Mohtarami et al., 2016; Qi et al., 2017), but these methods did not win the QR subtask in SemEval-2016/2017 competitions(Nakov et al., 2016, 2017). A more detailed description of the task and the participating systems can be found in (Nakov et al., 2016, 2017). Further, to handle the question length problem, several methods have been proposed (Barrón-Cedeño, Martino, et al., 2016; Romeo et al., 2016, 2017) to select the most relevant text chunks 3 Information Processing and Management 57 (2020) 102318 M.S. Zahedi, et al. in questions to build them better representation. Barron-Cedeno et al. (Barrón-Cedeño, Martino, et al., 2016) suggested supervised and unsupervised text selection models that operate at both sentence and chunk level (using constituency parse trees) and apply them on top of a state-of-the-art tree-kernel-based classification model. Romeo et al. (Romeo et al., 2017) expressed both the sentence selection and question ranking steps as a multiple-instance learning instantiation. Similarly, Romeo et al. (Romeo et al., 2016) used the attention weights learned by an LSTM network for both selecting the entire sentences and their subparts, i.e., word/chunk, from shallow syntactic trees. In the most recent work, Zhang and Wu (M. Zhang & Wu, 2018) proposed a novel unsupervised framework, namely reduced attentive matching network (RAMN) for QR in CQA. Under this framework, semantic representations of questions are firstly generated using attention autoencoders, and then surface matching between two questions is calculated using lexical mismatching information. The final matching score is computed based on question representations, lexical mismatching information, and the initial rank produced by a search engine. However, the performance of this model relies on the initial rank produced by a search engine. Joty et al. (Joty et al., 2018) proposed a framework for multitask learning of three CQA subtasks: Question-Comment Similarity, Question-Question Similarity and Question-External Comment Similarity. They proposed a two-step framework based on deep neural networks and structured conditional models, with a feed forward neural network to learn task-specific embeddings, which are then used in a pairwise CRF as part of a multitask model for all three subtasks. Compare-Aggregate architecture: The compare-aggregate model, also called matching-aggregation model,(Tan et al., 2018; Z. Wang et al., 2017), Joint feature models(Gong et al., 2018), sentence pair interaction models(Lan & Xu, 2018), soft alignment model (Choi & Lee, 2019) is one of the most successful architecture for SPM tasks. The compare-aggregate model was first proposed by (Parikh et al., 2016) for the NLI task. Various compare-aggregate models have been proposed for different SPM tasks. All these models can be divided into the following six layers: input representation, context representation (projection), attention (alignment), comparison (interaction), aggregation, and prediction. The input representation layer that aims to construct a vector representation for each word of two input sentences. The context representation layer incorporates word context and sequence order into modeling for better vector representation. In the attention layer, first, an attention matrix is computed that each element of it indicates the similarity between a word pair of two input sentences and then soft-aligned the words of two input sentences using some soft attention techniques. The comparison layer compares each aligned sub-phrase and produces a set of comparison vectors. The aggregation layer aggregate the set of comparison vectors from the previous layer. Finally, the prediction layer is a task-specific classification layer that makes the final decision. Diverse deep learning techniques have been used for each of these layers, which are summarized in Table 2. Our work also lies in this line of research but differs from previous approaches in that the HCA-model has a hierarchical structure that is applied both at the word and sentence level. 3. Hierarchical Compare Aggregate Model Given an input question Qin moreover, multiple candidate questions {Q1c , Q2c ,…, Qec ,…, Qnc } retrieved by a search engine or traditional information retrieval models such as BM25 or LM, where n is the number of candidate questions and each question pair (Qin, Qec ) is associated with a label y, this paper aims to re-rank candidate questions concerning their relevance to Qin. In particular, the main goal of this paper is to develop an end-to-end deep neural model, which independent of the length of the question, can compute the similarity score for each question pair (Qin, Qec ) that can be used to rank all candidate questions for a new input question Qin. We propose our HCA-model for QR in real CQA sites which consists of two compare-aggregate models, the SLCA-model and the WLCA-model that whole architecture depicted in Figs. 1 and 2 respectively. 3.1. The SLCA-model The SLCA-model can be divided into the following four layers: 3.1.1. Sentences Comparison Layer In this layer, we first split each of the questions in the pair (Qin, Qec ) into their constituent sentences S in: {S1in, …, S inin } and S c : {S1c, …, S|cQ c |} . Afterward, we compare the two encoded questions Qin and Qec in two directions Qin e Qec and Qin |Q | Qec . In the Qec , we compare each sentence Skin of Qin against all sentences of Qec using our proposed Attentive-Comparing(AC) direction Qin method described in subsection 3.1.1.1, and obtain a comparison vector CVkin that captures semantic similarity features between the sentence Skin and all sentences of the candidate question Qec . Similarly, in the direction Qin Qec , we compare each sentence Slc of Qec against all sentences of Qin and obtain a comparison vector CVlc . The outputs of this layer are two sequences of comparison vectors where each comparison vector captures semantic similarity features between one sentence of the question against all sentences of the other question. In other words, this layer performs as a feature extraction component. 3.1.1.1. Attentive-Comparing Method. In this section, we describe the AC method in the direction Qin Qec . As shown in Fig. 3, given the sentence Skin of input-question Qin and all sentences of the candidate-question Qec , S c : {S1c, …, S|cQ c |} , the goal of the AC method is e obtained a comparison vector CVkin . We first compare the sentence Skin against all sentences of the Qec individually. For each sentence pair (Skin, Slc ), we obtain a comparison vector Vkl that captures semantic similarity features between the sentence pair (Skin, Slc ) and a semantic similarity score Simkl between the sentence pair. To calculate the comparison vector Vkl and similarity score Simkl, we proposed WLCA-model that is inspired by previous compare-aggregate models for NLSM tasks such as (S. Wang & Jiang, 2017; Z. Wang et al., 2017). In sub-section 3.2, we describe the WLCA-model in more detail. Finally, the comparison vector CVkin is obtained by 4 M.S. Zahedi, et al. Table 2 Summary of Compare-Aggregate models for sentence pair modeling tasks. Where ai is i-th output of context representation layer for sentence 1, bj is j-th output of context representation layer for sentence 2, ⊙ is element-wise multiplication, F(.) is a one-layer feed-forward neural network, hi is a soft-alignment vector for i-th output of context representation layer, In the attention layer we show only the function that used to calculate each element of the attention matrix. NLI, QA, STS, PI, MC are referred to natural language inference, question answering, semantic textual similarity, paraphrase identification, and machine comprehension tasks respectively. Compare-Aggregate Layers Task DAM(Parikh et al., 2016) PWIM (He & Lin, 2016) Input Representation Context Representation Attention T Comparison Aggregation 5 GloVe GloVe self-attention Bi-LSTMs F(ai) F(bj) - F(ai; hi) Cosine +Euclidean+dot-product Summation similarity focus layer GloVe input gates of LSTM/GRU F(ai)Tbj F ([ai CNN GloVe+character embedding Bi-LSTM Cosine(ai, bj) multi-perspective matching Bi-LSTM ESIM (Q. Chen et al., 2017) NLI, QA, PI NLI Bi-LSTM+Pooling BIDAF (Seo et al., 2017) DIIN (Gong et al., 2018) MC NLI, PI MwAN (Tan et al., 2018) NLI-QAPI NLI Com-Agg (S. Wang & Jiang, 2017) BiMPM (Z. Wang et al., 2017) DR-BiLSTM (Ghaeini et al., 2018) DAM N-gram(Lopez-Gazpio et al., 2019) WLCA (Our model) NLI, PI, STS QR T hi ; ai hi]) GloVe Bi-LSTM (ai) bj [hi ; ai ; hi GloVe+Character embedding GloVe+character embedding+one-hot (POS) feature+ exact match (EM) feature Glove Bi-LSTMs self-attention+highway F(ai; bj;⊙bj) - [ai; hi; ai⊙hi; ai⊙hj] ai⊙bj two layers Bi-LSTMs DenseNet bi-directional GRU Multiway Matching bi-directional GRU Glove dependent Bi-LSTM Multiway Attention (ai)Tbj F([hi; ai; ai⊙hi; ai GloVe self-attention n-gram attention F(ai; hi) GloVe+character embedding input gates of LSTM/GRU (ai)Tbj [F(hi⊙bj); F(hi bj ; hi bj ; ] hi]) bj )] dependent Bi-LSTM+Pooling Summation Bi-LSTM Information Processing and Management 57 (2020) 102318 NLI QA-STSPI QA, NLI Information Processing and Management 57 (2020) 102318 M.S. Zahedi, et al. Fig. 1. Sentence-Level-Compare-Aggregate-model (SLCA-model). calculating the weighted sum of the comparison vectors Vkl as follows: Qec CVkin = Vkl . Wkl l=1 Where Wkl is attention-weights between the sentence pair (Skin, Slc ) that calculated as follows: Wkl = exp(Simkl ) Qec z=1 exp(Simkz ) 3.1.2. Sentences Filtering Layer After the sentences comparison layer, we have the set of comparison vectors that capture semantic similarity features between the sentences of two questions. The main problem with these features is that they are extracted independently of the importance of each sentence. The two questions can contain noisy sentences that do not provide meaningful information and express the details that are specific to the user. These noisy sentences are unnecessary to the QR task and can even be destructive. For example, the two 6 Information Processing and Management 57 (2020) 102318 M.S. Zahedi, et al. Fig. 2. The Word-Level-Compare-Aggregate model (WLCA-model). semantically dissimilar questions can contain the most similar thanks-sentences pair. So, we need another layer to filter out noisy sentences. In other words, this layer performs as a feature selection component whose task is to learn the relative importance of each sentence's features to select a subset of highly informative features. In this layer, each the comparison vector CVk (or CVl) is passed through a highway encoder layer (Srivastava et al., 2015) to learn a new projected comparison vector PCVk (or PCVl). Highway networks are gated nonlinear transform layers which control information flow to subsequent layers. An intuition behind utilizing the highway encoder in this layer is that the gates in the highway encoder adapt to learn the relative importance of each comparison vector CVk (or CVl) to the QR task. Let H(.) and T(.) be single-layered affine transforms with ReLU and sigmoid activation functions and parameterized by WH and WT respectively. A single highway network layer is defined as follows: PCVk = H (CVk , WH )·T (CVk , WT ) + CVk ·C Where T is the transform gate, and C is the carry gate. T and C express how much of the output is produced by transforming the input and carrying it, respectively. For simplicity, we set C = 1 – T. Thus, depending on the output of the transform gates T, a highway layer can smoothly vary its behavior between that of a plain layer and that of a layer which simply passes its inputs through. Therefore, by learning WT and bT, the network can adaptively pass H(CVk, WH) or just pass CVk to the next layer as follows: PCVk = CVk, if T (CVk , WT ) = 0 H (CVk , WH ), if T (CVk, WT ) = 1 3.1.3. Aggregation Layer To aggregate the two sequences of comparison vectors that were obtained in the previous layer, we utilize an LSTM model that can process the sequence of vectors to learn the long-term dependencies and the positional relations of sentences in the questions. We apply it to the two sequences of the vectors individually as follows: hk = LSTM (hk 1, PCVk ) k = 1, 2, …, Qin 7 Information Processing and Management 57 (2020) 102318 M.S. Zahedi, et al. Fig. 3. Attentive-Comparing Method. hl = LSTM (hl 1, PCVk ) l = 1, 2, …, Qec Next, to learn a fixed dimensional vector, we use two different aggregation methods, a pooling layer, and a title-aware attention layer. 3.1.3.2. Pooling Layer. We applied a MeanMax pooling operator across all hidden outputs of the LSTM encoder, which concatenates the result of the mean pooling and max pooling together. ( ) ( ) PV in = MeanMax h1, h2, …, h Qin PV c = MeanMax h1, h2, …, h Qec Finally, to construct a final comparison vector FCV for question pair (Qin, Qec ) , we concatenate the two fixed-length pooling vectors PVin and PVc as follows: in FCV = PV c PV 3.1.3.3. Title-aware Attention Layer. Usually, the question title denotes the main topic of the question. So, to reduce the impact of unimportant and noisy sentences in the body of the question, the semantic similarity between the question title and each sentence of the question body is used as the attention weight to calculate the weighted sum of the hidden outputs of the LSTM encoder. We first calculate the semantic similarity score between the question title and each sentence of the question body using the WLCAmodel. Then, the aggregated comparison vector PVin is obtained by calculating the weighted sum of the hidden outputs of the LSTM encoder as follows: Qin PV in = Wi . hi i=1 Where hi is i-th hidden outputs of the LSTM encoder and Wi is attention-weights between the first sentence of input-question S1in (the title of input-question) and i-th sentence of input-question that calculated as follows: Wi = exp(Sim1i) Qin z=1 exp(Sim1z ) Where Sim1i is the semantic similarity score between the title of input-question S1in and i-th sentence of input-question that is calculated by the WLCA-model in the sentences comparison layer. Similarly, the PVc is calculate to aggregate all hidden outputs of the LSTM encoder for candidate question. Finally, to construct a final comparison vector FCV for question pair (Qin, Qec ) , we concatenate 8 Information Processing and Management 57 (2020) 102318 M.S. Zahedi, et al. the two fixed-length vectors PVin and PVcas follows: in FCV = PV c PV 3.1.4. Ranking Layer and Optimization Given the fixed dimensional vector FCV from the previous layer, we pass it into a multilayer perceptron (MLP) network to compute a final relevance score Score (Qin, Qec ) between input question Qin and candidate question Qec . The final relevance score is then used to rank candidate questions. Since the QR is a ranking task, we adopt listwise learning to rank method to train the HCA-model. For each input question Qin and its list of candidate questions {Q1c , Q2c , …, Qec , …, Qnc } with relevance label set Y = {y1 , y2 , …, ye , …, yn } , we calculate a normalized relevance score vector S and a normalized relevance label vector Ynormal as follows: Score (Qin, Qec ) = MLP (FCV ) S = softmax ([Score (Qin, Q1c ), …, Score (Q in, Qec ), …, Score (Q in, Qnc )]) Y Ynormal = n y i=1 i Finally, to train the HCA-model, we formulate the objective as minimizing the KL divergence loss (Bian et al., 2017) between S and Ynormal as follows: L= 1 N N KL (S Ynormal ) 1 3.2. The WLCA-model In this section, we describe the details of the WLCA-model. WLCA-model is inspired by previous compare-aggregate models (S. Wang & Jiang, 2017; Z. Wang et al., 2017). Fig. 2 shows a high-level view of the architecture of the WLCA-model. Given the k-th sentence (Skin) from the input-question Qin and the l-th sentence (Slc ) from the candidate-question Qec , the model can be divided into the following six layers: 3.2.1. Input Representation Layer The purpose of this layer is to learn a d-dimensional representation for each word. We construct the word representation by concatenating pre-trained word embeddings and character level embedding of the word. The character level embedding of each word is learned by feeding randomly initialized character embedding into a Long Short-Term Memory Network(LSTM) (Hochreiter & Schmidhuber, 1997) and optimized with other parameters of the network. The character embedding supply extra information for some out-of-vocabulary and wrong spelling words that are some common problems in CQA questions. The outputs of this layer are two sequences of word vectors Skin={w1in , w2in ,…, w|in } and Slc={w1c , w2c ,…, w|cS c |} in which every word vector S in | l k wiin (or wic ) dchar. Rd= (dword + dchar ) is composed of word embedding vector of dimension dword and character level embedding of dimension 3.2.2. Projection Layer The main goal of this layer is to filter out unimportant words for QR task, for example, filtering stop words and words that do not help to predict the semantic similarity of two sentences. Each word representation wiin ( or w jc ) is passed through a gated projection layer to learn a new projected m-dimensional representation w̄iin (or w̄jc ). In this paper, we use a modified version of LSTM/GRU in which we keep only the input gates for remembering important words. A gated projection layer is defined as: w¯ iin = w¯ jc = (W gwiin + b g ) (W gwjc + b g ) tanh(W twiin + bt ), i = 1, 2, …, Skin tanh(W twjc + bt ), j = 1, 2, …, Slc Where ⊙ is element-wise multiplication, and Wg, Wt ∈ Rm × d and bg, bt ∈ Rm are parameters to be learned. 3.2.3. Alignment Layer For each projected word vector w̄iin (or w̄jc ) in one sentence, we hope to find a soft-alignment vector aiin (or ajc ) in the other sentence. To calculate the alignment vector aiin (or ajc ), we apply a soft attention mechanism (Bahdanau et al., 2014) by calculating a weighted combination of all projected word vectors w̄jc (or w̄iin ) in the other sentence. We first compute the word level similarity matrix E where each element Eij indicates the similarity between projected word vector w̄iin and w̄jc . To compute the matrix E, we use a dot operation between w̄iin and w̄jc . 9 Information Processing and Management 57 (2020) 102318 M.S. Zahedi, et al. Eij = w¯ iin· w¯ cj Then, we calculate attention-weights as follows: exp(Eij ) in ij c ij = Slc z=1 exp(Eiz ) exp(Eij ) = Skin z=1 exp(Ezj ) Finally, the alignment vector aiin (or ajc ) which represents parts of the other sentence that best match i-th (or j-th) word of the first sentence, is obtained by calculating the weighted sum of the projected vectors of Slc (or Skin ) as follows: Slc in c ¯j, ij w aiin = i = 1, 2, …, Skin j=1 Skin c in ¯i , ij w ajc = j = 1, 2, …, Slc i=1 3.2.4. Comparison Layer In this layer, we introduce how the projected word vector w̄iin (or w̄jc ) and its alignment vector aiin (or ajc ) are compared. Let f denote a comparison function that transforms w̄iin (or w̄jc ) and aiin (or ajc ) into a vector riin (or r jc ) to represent the comparison result. We utilize element-wise multiplication fmult and element-wise subtraction fsub functions. fsub and fmult are closely related to Euclidean distance and cosine similarity respectively, but they also preserve some information about different corresponding elements of the original two vectors. We pass the output vectors of fsub and fmult into a 1-layer feed-forward neural network with the ReLU activation function to reduce dimensionality as follows: fsub (w¯ iin, aiin ) = ReLU (Ws wiin fmult (w¯ iin, aiin) = ReLU (Wmu (wiin aiin + bs ) aiin) + bmu ) Where ⊙ is element-wise multiplication, and Ws, Wmu ∈ Rp × m and bs, bmu ∈ Rp are parameters to be learned. Finally, we construct the comparison result vector riin by concatenating the output of functions fsub and fmul as follows: riin = fsub (w¯ iin, aiin) fmul (w¯ iin, aiin) Similarly, we obtain a r jc for each position in Sc in the same way with shared parameters as follows: r jc = fsub (w¯ jc , ajc ) fmul (w¯ cj , ajc ) The output of this layer are two sequences of comparison vectors r in: [r1in, …, r|in and r c: [r1c, …, r|cSkc |]. S in |] k 3.2.5. Aggregation Layer To aggregate the two sequences of comparison vectors rin (or rc) into a fixed-length aggregation vector vkin (or vlc ), we applied a bidirectional LSTM(BiLSTM) to each sequence of comparison vectors as follows: in in hi = LSTM hi in 1 , riin , i = 1, 2, …, Skin in hi = LSTM hi + 1, riin , i = Skin , …, 2, 1 Then, we concatenate the last time-step of each direction to obtain the fixed-length aggregation vector vkin as follows: in vkin h S in k = in h S in k Similarly, we obtain a vlc in the same way with shared parameters as follows: 10 Information Processing and Management 57 (2020) 102318 M.S. Zahedi, et al. Fig. 4. Our proposed sequential transfer learning steps. c vlc h Sc l = c h Sc l Finally, to construct a final comparison vector Vkl for sentences pair (Skin , Slc ) , we concatenate the two fixed-length aggregation vectors vkin and vlc . 3.2.6. Prediction Layer The purpose of this prediction layer is to calculate a semantic similarity score Simkl between the sentence pair (Skin , Slc ) . To this end, we pass the output of the aggregation layer (Vkl) into an MLP layer. The MLP has a hidden layer with tanh activation and softmax output layer in our experiments. 3.3. Sequential Transfer Learning A key concept in STL is the notion of relatedness between source and target tasks. To maximize the usefulness of sequential transfer learning, we have to choose a source task that will enable learning a representation that will help for a target task. In this paper, we consider a short text paraphrase identification task as the source task that given a pair of sentences, the goal is to determine whether two sentences are semantically identical or not. The reason for choosing this task as the source task is relatedness between this task and sentence comparison subtask in the HCA-model and the availability of a sufficient number of training examples for the short text paraphrase task. Our STL approach consists of two steps, a pretraining step and an adaptation step that is shown in Fig. 4. In the pretraining step of our proposed STL, the source model (the WLCA-model) is pre-trained on the Quora dataset and then, we perform two approaches for adaptation step named frozen and fine-tuning. In both the frozen and the fine-tuning approaches, we first initialize the sentence comparison layer parameters of our SLCA-model with the pre-trained weights of the source model to transfer the paraphrase knowledge of the source model to SLCA-model. Then, in the frozen approach, the sentence comparison layer parameters are frozen, and only train the other parameters of the SLCA-model but in the fine-tuning approach, the sentence comparison layer parameters are unfrozen and fine-tuned with other parameters of the SLCA-model. The key hyper-parameter of the adaptation step is the learning rate. Followed by Ruder (Ruder, 2019) that suggests the learning rate be set to a lower value than the one used during pre-training so as not to distort the learned parameters too much, we set the learning rate as 10 4 . We also set the learning rate of the pretraining step as 2 × 10 3 . 3.4. Computational Complexity In this section, we discuss the asymptotic complexity of the HCA-model at testing time and compare it to the WLCA-model that only operates at the word level. Recall that d denotes embedding dimension, nw means the number of words of one question and ℓs shows the number of words of one sentence in one question. So, nw is equal to nsℓs, where ns is the number of sentences in one question. Also, for simplicity, we assume that all hidden dimensions are d and that the complexity of one-layer feed-forward neural network with hidden dimensions d is O(d2). Computational Complexity of the WLCA-model: In the projection layer, each word vector is passed through a gated projection layer that requires O(d2). Thus, the complexity of this layer is O(nwd2). In the alignment layer, we first calculate the word level similarity matrix E. The complexity of the dot-product between two d-vector is O(d). Thus, the word level similarity matrix E has the complexity O(nw2d). Next, we get attention-weights for each word that requires O(nw). Therefore, the computational complexity of 11 Information Processing and Management 57 (2020) 102318 M.S. Zahedi, et al. Table 3 Computational Complexity of our proposed model. Model WLCA Complexity O (nw d 2) Projection + O (n w2 d) Alignment + O (nw d2) Comparison + O (nw d 2) Aggregation + O (d 2) Prediction = O (nw d 2 + nw 2d) = O (nw 2d) HCA O (ns 2 ( s d 2 + = 2 s d)) O (ns2 s d 2 + Comparison ns2 2s d ) + O (ns d 2) nw = ns s Filtering O (ns nw d2 + + O (ns d2) n w2 d) = Aggregation + O (ns ( s d 2 + 2 s d)) title-aware-attention + O (d 2) Prediction = O (ns 2 ( s d 2 + 2 s d)) O (n w2 d) Complexity (HCA) = Complexity (WLCA) = O (n w2 d ) the alignment layer is equal to O (n w 2d + n w ) = O (n w 2d) . Then, for each question word, we calculate element-wise subtraction and element-wise multiplication that requires O(d) and then pass the output vectors of fsub and fmult into a 1-layer feed-forward neural network with the complexity O(d2). Thus, the overall complexity of the comparison layer is O (n w (d + d 2 )) = O (n w d 2) . The complexity of an LSTM cell is O(d2), resulting in a complexity of O(nwd2) to aggregate nw comparison vectors. Finally, in the prediction layer, we utilize a fully connected layer that requires O(d2). Thus, the total complexity of the WLCA-model is O (n w d 2 + n w 2d ) = O (n w 2d ). Computational Complexity of the HCA-model: In the sentence comparison layer, we compare all sentence pairs of two questions using the WLCA-model. The complexity of WLCA-model for two sentences with ℓs words is O ( s d 2 + s 2d ) , resulting in a complexity of ns 2 (ls d 2 + ls 2d ) to compare all sentence pairs of two questions. Similar to the projection and aggregation layers of the WLCA-model, the complexity of the sentence filtering and aggregation layers of the HCA-model is O(nsd2). In title-aware attention layer, O (ns ( s d 2 + s 2d )) steps are required to compare the question title with all sentences of question body. Finally, the complexity of the prediction layer is O(d2). Thus, the total complexity of the HCA-model is O (ns2 ( s d 2 + s 2d )) . With n w = ns s , it is equal to O (ns n w d 2 + n w 2d ) = O (n w 2d ) , so Complexity(HCA) = Complexity(WLCA) = n w 2d . Moreover, it should be noted that much of the complexity of our model is related to the sentence comparison layer, that is parallelizable over ns sentences of question. So, at testing time, HCA-model has an acceptable time compared to the existing compare-aggregate models that only operate at the word level. It is worth noting that at training time, the parameters of the WLCA-model are shared among all sentence pairs comparison which reduces the number of training parameters (thus, saves lots of computations). Also, as mentioned in section 3.3, the parameters of the WLCAmodel can be frozen during the training process. So, only the parameters of the layers for sentence filtering, aggregation, and prediction should be trained. The sequence length of these layers is equal to the number of sentences in a question which is usually less than 10. Table 3 summarizes the complexity of the proposed WLCA-model and HCA-model. 4. Experimental Setup In this section, we introduce the datasets and experimental design used to evaluate our approach. In section 4.1, we briefly describe the datasets and evaluation metrics used to evaluate our approach. In section 4.2, we describe the implementation details and the training setting of our system. To evaluate and analyze the proposed approach, several experiments are designed, which are introduced in Section 4.3 4.1. Data Set and Evaluation Metrics For the evaluation, we use three public datasets which are briefly described in the following: SemEval datasets: We used two publicly available benchmark datasets provided by Semeval-2016 Task3 Subtask B (Nakov et al., 2016) and Semeval-2017 Task3 Subtask B (Nakov et al., 2017), that contain real data from the Qatar Living forum. There are three English subtasks, and we focus on Subtask B, Question-Question Similarity that is defined as follows: given new input question and a set of top ten related questions retrieved by Google search engine, the goal is to re-rank the candidate questions according to their similarity with respect to the new input question. Candidate questions are annotated either as “PerfectMatch”, “Relevant” or “Irrelevant” for the original question. Both “PerfectMatch” and “Relevant” questions are considered “Relevant” without distinction, and they should be ranked above the “Irrelevant” questions. The Semeval-2016 dataset contains 2670 training pairs, 500 development pairs, Table 4 Statistics of the SemEval datasets. # of Original Questions # of Candidates Questions PerfectMatch Relevant Irrelevant Train Dev 2016-Test 2017-Test 267 2670 235 848 1586 50 500 59 155 286 70 700 81 152 467 88 880 24 139 717 12 Information Processing and Management 57 (2020) 102318 M.S. Zahedi, et al. Table 5 Statistics of the Stack Exchange AskUbuntu dataset (Lei et al., 2016). Corpus Training Dev Test # of unique questions # of unique questions # of user-marked pairs # of query questions # of annotated pairs Avg # of positive pairs per query # of query questions # of annotated pairs Avg # of positive pairs per query 167,765 12,584 16,391 200 200 × 20 5.8 200 200 × 20 5.5 and 700 testing pairs. For the Semeval-2017, the organizers used the same training and development sets from Semeval-2016 datasets but annotated 880 new testing pairs. Table 4 shows the statistics distribution in the training, development, and test partitions of the datasets and class distribution in each partition. AskUbuntu dataset: We use the Stack Exchange AskUbuntu dataset provided by (Lei et al., 2016) that contains 254480 training pairs, 4000 development pairs, and 4000 testing pairs. Each question is consisting of a title and a body, and a set of user-marked similar question pairs. User-marked similar question pairs on QA sites are often known to be incomplete, so for each of the questions for the test and dev sets, (Lei et al., 2016) retrieved the top 20 similar candidates using BM25 and manually annotated the resulting 8K pairs as similar or non-similar. To create the train sets, they use user-marked similar pairs as positive pairs and use random questions from the corpus paired with each query question as negative pairs. Table 5 shows the various statistics for this dataset. Quora dataset: For pre-training the WLCA-model, we use the Quora question paraphrase data set, which consists of over 400,000 question pairs with binary labels. We use the same data and split as (Z. Wang et al., 2017), with 10,000 question pairs for each of the development and test sets and the remaining 380,000 pairs for the training set. Evaluation Metrics: Performance of different models is measured by Mean Average Precision (MAP) and Mean Reciprocal Rank (MRR) that are widely used in the literature for evaluation of QR performance in CQA. We use the official evaluation script published by the SemEval organizer to compute the MAP and MRR. 4.2. Implementation Details Data preprocessing: We use the Stanford CoreNLP toolkit (Manning et al., 2014) to tokenize questions (question title along with question body) into sentences and then into words. Embedding: We use 300-dimensional GloVe vectors for word embeddings that are pre-trained from the 840B Common Crawl corpus (Pennington et al., 2014), and they would not be updated during training. Out-of-Vocabulary (OOV) words are initialized randomly with Gaussian samples. For the character embeddings, we initialize each character as a 20-dimensional vector and compose each word into a 50- dimensional vector with an LSTM layer. Model implementation and optimization: The proposed model was implemented in TensorFlow. We optimize HCA-model with Adam optimizer with an initial learning rate of 10 4 . We also set the learning rate of the pretraining step as 2 × 10 3 . A dropout rate of 0.2 is applied to all layers except the final linear layer. Default L2 regularization is set to 10 6 . The size of the hidden layers of the projection layer, the comparison layer for fsub and fmul, and the aggregation-LSTM layer of the WLCA-model is set to 150, 30, 70, and 80 respectively. Also, we set the size of the aggregation-LSTM layer and the MLP layer of the SLCA-model as 80 and 100, respectively. For all the experiments, we select the model which works the best on the dev set and then evaluate it on the test set. To have access to implementations of the neural information retrieval baseline models, we utilize the open-source toolkit MatchZoo (Fan et al., 2017), and for compare-aggregate baseline models, we use the original implementation provided by the authors (S. Wang & Jiang, 2017; Z. Wang et al., 2017). 4.3. Experimental Design In this section, we introduce six designed experiments that we performed to analyze the proposed approach in various aspects. Experiment 1: We designed this experiment to compare the performance of the HCA-model with other baseline and state-of-theart approaches in the SemEval competitions. We evaluated our approach with the SemEval datasets described in section 4.1. For the comparison, we used the performances that were originally reported by the authors of the papers. Experiment 2: Since HCA-model is an end-to-end deep neural network model, we designed this experiment to compare it with several deep learning-based models that are used for similar tasks to our QR task. Experiment 3 (Ablation study): This experiment was designed to study the effectiveness of different components of HCA-model. We believe that the good performance of HCA-model can be attributed to (1) the effectiveness of splitting questions into sentences and comparing semantic relevancy of two questions at the sentence level (2) the usefulness of exploiting transfer learning technique for pre-training the WLCA-model of HCA-model. So, to study the effects of these components, we prepare three different versions of the original HCA-model named HCArmST, HCArmT, HCArmS. In HCArmST, we remove both splitting and transfer learning components from the original HCA-model. In other words, in HCArmST, we do not split the questions into their constituent sentences and the whole question is considered as one sequence and pass it as an input to HCA-model. In addition, instead of using the pre-trained weights of 13 Information Processing and Management 57 (2020) 102318 M.S. Zahedi, et al. the WLCA-model, we randomly initialize all parameters of HCA-model. In HCArmS and HCArmT, the splitting component and the transfer learning component are removed respectively from the original HCA-model. Experiment 4: We designed this experiment to analyze our proposed STL approach. We conducted three experiments, one to explore the effect of the two adaptation methods, the frozen and the fine-tuning on the performance of HCA-model, and two other experiments to study the effect of the size of the source and the target datasets for pre-training the WLCA-model and fine-tuning the SLCA-model, respectively. Experiment 5 (Case Studies): To get a better understanding of how our proposed HCA-model actually performs matching between two long questions, we further perform case studies. One of the main layers of the SLCA-model is the sentences comparison layer that captures semantic similarity features between the sentence pairs of two questions. To study the effectiveness of this layer, we visualize the attention-weights that the WLCA-model learned for each sentence pairs of three sample questions from the development sets of SemEval 2016 dataset. Experiment 6: A key feature distinguishing the proposed HCA-model from the existing compare-aggregate models is its good performance for longer questions. This experiment was designed to better understand this functionality of the HCA-model. We evaluated our approach with the AskUbuntu dataset described in section 4.1. The average question length in the AskUbuntu dataset is 155.60 which is 2.88 times the average question length in SemEval datasets. 5. Results and discussion In this section, we discuss the results of the five designed experiments. 5.1. Results of Experiment 1 Competitive Baseline: We take the ordering the related questions in the order provided by the Google search engine as the strong IR baseline. In addition to that, we also consider all the state-of-the-art works on SemEval 2016/2017 datasets. The competitive baselines for this task are SemEval-2016 Best (Franco-Salvador et al., 2016), SemEval-2016 Second (Barrón-Cedeño, Da San Martino, et al., 2016), SemEval-2016 Third (Filice et al., 2016), Text selection (Barrón-Cedeño, Martino, et al., 2016), Attention-based pruning (Romeo et al., 2016), Multitask_Learning (Joty et al., 2018), SemEval-2017 Best (Bahdanau et al., 2014), SemEval-2017 Second (Goyal, 2017), SemEval-2017 Third (Filice et al., 2017), Attention Autoencoders (M. Zhang & Wu, 2018). We have compared the results of two versions of HCA-model with all the above competitive baselines. HCA-model-pooling that use the pooling layer to aggregate all hidden outputs of the LSTM encoder of the SLCA-model and HCA-model-attention that utilizes the title-aware attention layer for aggregation. Results: Table 6 shows the experimental results for SemEval-2016 and SemEval-2017 data sets. We observe the following from the results: (1) The HCA-model-attention performs much better than the HCA-model-pooling on both SemEval datasets, but it is worth noticing that HCA-model-pooling outperforms the state-of-the-art model on both SemEval-2016 and 2017 data sets. We refer to HCAmodel-attention as the HCA-model in the remainder of this paper. (2) We see that HCA-model with transfer learning, outperforms all the competitive baselines above in terms of MAP and MRR metrics on both SemEval-2016 and 2017 data sets. At SemEval-2016, HCA-model outperforms the winner of the SemEval competition with 3.42 MAP points and 4.13 MRR points. Comparing the results of the HCA-model with the state-of-the-art text selection model by (Barrón-Cedeño, Martino, et al., 2016), HCA-model achieves 2.03% gain for MRR and 1.56% for MAP. At SemEval-2017, HCA-model gets a promising MAP of 51.15, which outperforms the best participants of the SemEval competition by a significant margin of 3.93% Table 6 Experimental results of our model compared with the state-of-the-art models. 2016 2017 Models MAP MRR IR_Baseline SemEval Best SemEval Second SemEval Third Attention-based pruning Text selection Multitask Learning Attention Autoencoders HCA-model-pooling HCA-model-attention IR_Baseline SemEval Best SemEval Second SemEval Third(Text selection) Attention Autoencoders HCA-model-pooling HCA-model-attention 74.57 76.70 76.02 75.83 77.82 78.56 76.89 77.79 79.63 80.12 41.85 47.22 46.93 46.66 48.53 50.93 51.15 83.79 83.02 84.64 82.71 84.64 85.12 84.19 85.76 86.70 87.15 46.42 50.07 53.01 50.85 52.75 54.59 55.20 14 Information Processing and Management 57 (2020) 102318 M.S. Zahedi, et al. and 5.13% in term of MAP and MRR respectively. Further, HCA-model outperforms the state-of-the-art Attention Autoencoders model proposed by (M. Zhang & Wu, 2018) by 2.62 MAP points and 2.45 MRR points. (3) SemEval-2017 is more difficult than SemEval-2016 Task. Although the upper bound of the MAP score is 88.57 for SemEval2016, it's only 67.05 for SemEval-2017. The reason that the upper bound for MAP is less than 100% is rooted in the fact that both of these data sets contain query questions that do not have any relevant questions in the gold labels; hence these questions always arrive at a precision of 0 in the MAP scoring. HCA-model makes much better at the difficult task and shows to be more robust for the more challenging task. (4) What is striking in Table 6 is that HCA-model achieves the best results in both the SemEval-2016 and 2017 datasets, thus making it more robust than the existing models. The text selection model proposed by (Barrón-Cedeño, Martino, et al., 2016), which has the best performance on the SemEval-2016 dataset does not have good results on the SemEval-2017 dataset, and it achieves 46.66 in MAP score that is 1.87 points lower than the state-of-the-art Attention Autoencoders model's MAP (M. Zhang & Wu, 2018) in SemEval-2017 dataset. Similarly, the Attention Autoencoders model by (M. Zhang & Wu, 2018) is the best performing model on the SemEval-2017 data set but does not have the best results on the SemEval-2016 dataset. It should be noted that HCA-model is in an end-to-end model that can achieve good performance without using any additional features. The feature engineering is labor-intensive, time-consuming, and error-prone process. But our end to end HCA-model automatically learns features from data, instead of adopting handcrafted features, which mainly depends on prior knowledge of designers, language-dependent resources, and external resources. For example, the performance of the state-of-the-art Attention Autoencoder models (M. Zhang & Wu, 2018) is significantly dependent on a ranking feature provided by the Google search engine, which may not be available in many applications. Also, previous state-of-the-art models such as text selection (Barrón-Cedeño, Martino, et al., 2016) and attention-based pruning (Romeo et al., 2016) models need syntactic parse tree representations. 5.2. Results of Experiment 2 Competitive Baseline: For this experiment, we consider several deep learning-based models that are used for similar tasks to our QR task. The first group of these baseline models is deep information retrieval models (Deep-IR) that can be categorized into two classes: representation and interaction-based models. The representation-based models included DSSM (Huang et al., 2013) and CDSSM (Shen et al., 2014) models and the interaction-based models included MatchPyramid (Pang et al., 2016), DRMM (Guo et al., 2016), K-NRM (Xiong et al., 2017) and CONV-KNRM (Dai et al., 2018). The second group of our baseline models is compareaggregate models included Com-Agg (S. Wang & Jiang, 2017) and BIMPM (Z. Wang et al., 2017) that achieves the state-of-the-art performance on different natural language sentence matching tasks (NLSM) such as answer sentence selection, natural language inference and paraphrase identification. The third group of our baseline models is the state-of-the-art sentence embedding methods including BERT (Devlin et al., 2018), Universal sentence encoder (USE) (Cer et al., 2018), and Sentence-BERT (Reimers & Gurevych, 2019). In BERT and USE models, we use the pre-trained model (BERT or USE) to obtain the representation of the input question and a candidate question and then calculate the semantic similarity score via their cosine similarity. Sentence-BERT adds a pooling operation to the output of BERT to derive a fixed size sentence embedding and then fine-tunes BERT in a Siamese / triplet network architecture(Schroff et al., 2015). We train Sentence-BERT on our SemEval data sets. Results: According to the results in Table 7, our observations are summarized as follows: (1) HCA-model significantly outperforms currently known deep learning-based models (2) Among the Deep-IR models, the interaction-based models perform much better than representation-based models, and these results are in line with those of previous Table 7 Experimental results of our model compared with the deep learning-based models. Models Best Model 2017 (Attention Autoencoders) Best Model 2016 (Text selection) CDSSM DSSM CONV-KNRM MatchPyramid K-NRM DRMM BIMPM Com-Agg BERT Sentence-BERT Universal Sentence Encoder Neural-based Approach LearningToQuestion HCA-model-pooling HCA-model-attention 2016 2017 MAP MRR MAP MRR 77.79 78.56 70.34 72.68 72.79 74.13 74.60 76.72 73.10 74.43 67.17 72.57 77.03 76.17 79.63 80.12 85.76 85.12 82.01 82.39 82.45 83.07 83.46 83.53 82.70 83.80 77.98 82.68 85.02 85.48 86.70 87.15 48.53 46.66 42.64 44.78 44.90 46.16 46.22 48.51 46.05 46.24 38.20 47.76 48.93 46.93 50.93 51.15 52.75 50.85 46.26 48.12 49.01 49.42 50.39 52.94 50.12 51.15 44.71 52.48 52.02 53.01 54.59 55.20 15 Information Processing and Management 57 (2020) 102318 M.S. Zahedi, et al. studies (Guo et al., 2016; P. Wang et al., 2018). (3) DRMM obtains the best results among all the interaction-based models while in (Xiong et al., 2017) and (Dai et al., 2018) it is stated that K-NRM and CONV-KNRM significantly outperform the DRMM. The main reason that lies in this experimental finding is that K-NRM and CONV-KNRM have millions of parameters, which are more difficult to train than DRMM with fewer parameters when being trained with limited data. It has been shown in (Xiong et al., 2017) that the number of parameters is 49,763,110 for K-NRM, but only 167 for DRMM. (4) Although the compare-aggregate models perform well in many different NLSM tasks such as answer sentence selection and paraphrase identification; they do not have good results in our QR task. We believe that the reason can be: questions posted on CQA sites are usually long and contain multiple sentences while NLSM tasks are raised at the sentence level; as a result, the compare-aggregate models do not yield comparable performance. Also, these models require large amounts of annotated data to produce good results while for QR in CQA, the available training examples are often insufficient. HCA-model handles these problems and outperforms the state-of-the-art models. (5) The results of the sentence embedding methods show that directly using the output of BERT leads to rather poor performances. Using the Sentence-BERT and fine-tuning on our SemEval dataset substantially improves the performance. Although USE performs much better than BERT and Sentence-BERT on both SemEval-2016 and 2017 data sets, it does not have the best results on the both SemEval dataset. For SemEval2016 and 2017, the MAP score of the USE is 3.09 and 2.22 points lower than HCA-model respectively. This is because questions posted on CQA sites are long and complex in structure and it is challenging to learn a vector representation that expresses the entire meaning of questions. (6) The best deep learning-based model that participated in the SemEval-2016 is Neural-based Approach that was proposed by (Mohtarami et al., 2016). Although this model used different handcrafted features as augmented features for the deep learning-based approach, it achieved 76.17 in MAP score that is 3.46 points lower than HCA-model. Similarly, for SemEval2017, the MAP score of the best deep learning-based model by (Goyal, 2017) is 4 points lower than our HCA-model. 5.3. Results of Experiment 3 (Ablation Study) Table 8 reports ablation study results on SemEval-2016 and Semeval-2017 datasets. We can see that removing any of the HCAmodel components significantly drops the performance. Firstly, we observe that removing the transfer learning component (HCArmT) significantly decreases the performance of our model on both SemEval datasets which proves that the transfer learning component is crucial for our HCA-model. At SemEval-2016, the MAP score drops from 80.12 to 75.77, and we also found a significant decline in MAP score for SemEval-2017 when removing the transfer learning component. Secondly, we also observe that the splitting component is the most important for HCA-model. When we remove the splitting component from HCA-model, we find that the performance for SemEval 2016 and 2017 datasets drops 2.89 and 2.55 percent respectively. One of the surprising results of our ablation study is that the HCArmST performs better than the HCArmT on both SemEval datasets. It means that if we do not use the transfer learning technique in HCA-model, the splitting questions into sentences will hurt the performance of our model. We believe the reason for this phenomenon can be attributed to the limited training data in SemEval datasets; since removing the splitting component from the HCA model (HCArmST) eliminates the need for an aggregation layer, the overall model becomes simpler and therefore will have better results on limited training data compared to the model with more parameters (HCArmT). 5.4. Results of Experiment 4 In this section, we discuss the results of the three designed experiments to analyze our proposed STL. Analysis of the adaptation methods: The results of utilizing the two different adaptation methods for the adaptation step in our STL are shown in Table 9. The results show that the frozen method performs better than the fine-tuning method. This is because, in the fine-tuning adaptation method, HCA-model try to train many randomly initialized parameters ( Filtering + Aggregation + Rank of HCA-model) together with fully pre-trained ones (θComparison of the HCA-model) and it complicates the optimization problem, While in the frozen approach, the θComparison are frozen, and only trains the Filtering + Aggregation + Rank of HCA-model. Varying the size of fine-tuning datasets: We first pre-train the source model on the “Quora” dataset and then initialize the θComparison of the HCA-model with the pre-trained weights of θWLCA of the source model to transfer the paraphrase knowledge of Table 8 Ablation study results on SemEval-2016 and SemEval-2017 datasets. 2016 2017 Models MAP MRR Tree Kernel Classifier HCArmST HCArmT HCArmS HCAOriginal Attention Autoencoders HCArmST HCArmT HCArmS HCAOriginal 78.56 76.10 75.77 77.23 80.12 48.53 47.14 46.45 48.60 51.15 85.12 84.52 83.90 85.56 87.15 52.75 51.83 51.20 52.65 55.20 16 Information Processing and Management 57 (2020) 102318 M.S. Zahedi, et al. Table 9 Analysis of the adaptation methods. Adaptation Method frozen fine-tuning 2016 2017 MAP MRR MAP MRR 80.12 78.89 87.15 86.24 51.15 49.30 55.20 53.60 the source model to HCA-model. Next, vary the size of training data of the Semeval-2016 and Semeval-2017 datasets used to fine-tune the Filtering + Aggregation + Rank of HCA-model. As shown in Table 10, with increasing the size of training data for fine-tuning the HCAmodel, increases the performance of our model as well. Varying the size of pre-training dataset: The goal of this experiment is to investigate how large the pre-training dataset should be to make transfer learning applicable. We vary the size of the “Quora” dataset used to pre-train the source model. The results of this experiment are shown in Table 11. It can be seen from the results in Table 11 that pre-training the source model has improved the performance of the HCA-model, even with a small percentage of the pre-training dataset. The MAP score of the HCA-model has increased by 1.75 and 2.42 points in SemEval-2016 and SemEval-2017 respectively, by using only 25 % of the “Quora” dataset for pre-training the source model. We also find that the more the pre-training data, the better the model's performance. 5.5. Results of Experiment 5 (Case Studies) Fig. 5 visualizes the attention-weights learned by WLCA-model for each sentence pairs of three questions example in Table 12. Fig. 5(a) and Fig. 5 (b) shows the heat map of attention-weights for the question pairs (Q1in , Q1c ) and (Q1in , Q2c ) in Table 12, respectively. Darker blue areas indicate stronger semantic similarity between the sentence pairs. The most striking observation to emerge from the visualization shown in Fig. 5 is that our model has been able to identify the most semantically similar sentence pairs and has given significantly more attention-weights to these pairs than the unimportant sentence pairs. For example, the sentence S1in (question title) of the input question Q1in is most similar to sentences S1c and S3c of candidate question Q1c , and our model has been able to assign the most weight to these sentences. Similarly the sentence pairs (S3in, S4c ) and (S4in, S5c ) are the most semantically similar sentence pairs in question pair (Q1in , Q1c ) that our model assigns high attention-weights to them which are significantly higher than the ones of their neighboring unimportant pairs. It should be noted that although the noisy sentence pairs (S6in, S2c ) and (S6in, S6c ) are also have a high attention-weights, our proposed model will try to disregard these noisy sentence pairs in the sentences filtering layer and the titleaware attention layer. The question pair (Q1in , Q2c ) is semantically dissimilar pair which in Fig. 5 (b), we can see that the attentionweights that our model learned for the sentence pairs of these questions are close together and there is no significant difference between the weight of the different sentence pairs. The results of this case study show that our model can recognize important coarsegrained sentence-level semantic similarity information for better similarity measurement between two long questions. 5.6. Results of Experiment 6 Competitive Baseline: For this experiment, we compare against two previously published works on this dataset named (RCNN) model (Lei et al., 2016) and the DNN+RR model(Ghosh et al., 2017). Since testing splits are the same, we report the results directly from(Ghosh et al., 2017). We also compare our model against the existing compare-aggregate models included Com-Agg (S. Wang & Jiang, 2017), BIMPM (Z. Wang et al., 2017), and Universal sentence encoder that performs much better other sentence embedding methods on both SemEval datasets. Results: Table 13 shows the experimental results for the domain-specific AskUbuntu dataset. We observe the following from the results: (1) We can see that HCA-model with the STL, significantly outperforms all the competitive baselines above in terms of MAP and MRR metrics. Comparing the results of the HCA-attention with the state-of-the-art DNN+RR model by (Ghosh et al., 2017) although this model utilizes different lexical, syntactic, and semantic features, the HCA-model achieves 3.73% gain for MAP and 5.82% for MRR. (2). Comparing the results of the Com-Agg (S. Wang & Jiang, 2017) with the HCA-attention, it achieved 63.08 and 71.10 in MAP Table 10 Results of varying sizes of SemEval datasets for fine-tuning the HCA-model. Percentage of the target dataset used for fine-tuning 25% 50% 75% 100% 2016 2017 MAP MRR MAP MRR 74.13 76.94 78.32 80.12 82.10 84.23 85.70 87.15 45.10 47.80 49.70 51.15 50.01 51,30 53.80 55.20 17 Information Processing and Management 57 (2020) 102318 M.S. Zahedi, et al. Table 11 Results of varying sizes of Quora dataset for pertaining the WLCA-model. Percentage of source dataset used for pre-training 0 25% 50% 75% 100% 2016 2017 MAP MRR MAP MRR 75.45 77.20 78.34 78.97 80.12 83.70 85.60 86.10 86.80 87.15 46.53 48.95 49.93 50.90 51.15 51.00 52.32 53.12 54.10 55.20 Fig. 5. visualization of the attention-weights that WLCA-model calculated for each sentence pairs of three questions example in Table 12. (a) shows the heat map of attention-weights for the question pairs (Q1in , Q1c ) (b) shows the heat map of attention-weights for the question pairs (Q1in , Q2c ). Table 12 The sentences of two question-pairs example from the development sets of SemEval 2016 dataset. Question ID Sentence ID Sentence Q1in S1 S2 S3 S4 S5 S6 QP Offer I had an interview with department head and lead people not HR at QP for the Grade-13 post They said they will inform me the next procedure like medical chekups etc Does anybody have the idea how much it will take to complete the procedure and when can i get the offer letter Please do let me know what will be the approximate salary and other allowances If anybody have any idea; please share with me . Please ... Q1c S1 S2 S3 S4 S5 Join QP Dear friends; First of all; i would like to thank you all for your amazing help and support Actually i got an offer from QP and it was good They have requested to undergo a full medical examination and i did that and sent them the report through courier I just wan na know; what ' s the next step in the recruitment procedure and how long time it takes from then to issue me work visa ? I will appreciate if any can help me . Regards Anybody here who can give advice about working for QP vs. Aramco . What are the main differences; pros/cons between the two ? I 'm in final talks with both companies; no offers yet but looks promising . Background info: Finance Manager; 40 years; divorced; currently with BP; UK national Q2c S6 S1 S2 S3 S4 and MRR scores that are 6.45 and 8.02 points lower than HCA-model respectively, which indicates that the previous Com-Agg (S. Wang & Jiang, 2017) model suffer from the question length problem. (3) We observe that the title-aware attention layer is a very important component for HCA-model on this dataset. As shown in Table 6, comparing the results of the HCA-attention with the HCA-pooling, it achieves 0.22 and 0.49-point gain at SemEval-2016 and SemEval-2017 datasets. In the AskUbuntu dataset, this improvement is 2.52 and 1.78 points in terms of MAP and MRR metrics. The average length of questions in the SemEval and AskUbuntu datasets is 54.02 and 155.60 respectively which indicates that the titleaware attention is a key component for the long question. (4) Although USE performs much better than the state-of-the-art DNN+RR model by (Ghosh et al., 2017) and the existing 18 Information Processing and Management 57 (2020) 102318 M.S. Zahedi, et al. Table 13 Experimental results of our model compared with the state-of-the-art models on the AskUbuntu dataset. The first seven rows are taken from (Ghosh et al., 2017). BM25 LM TRLM STM RCNN DNN DNN+RR BIMPM Com-Agg Universal Sentence Encoder HCA-pooling HCA-attention MAP MRR 45.49 45.2 42.6 38.9 62.3 64.4 65.8 61.02 63.08 66.32 67.15 69.53 57.8 58.4 56.6 46.6 75.6 72.8 73.3 70.32 71.10 71.87 77.34 79.12 compare-aggregate models, it achieved 66.32 and 71.87 in MAP and MRR scores that are 3.21 and 7.25 points lower than HCA-model respectively. USE was trained on various datasets, including news, question-answer pages, and discussion forums, which appears to be suitable for sentence encoding but our results show that it may not be suitable for long questions with complex structure. In Table 14, for two input question examples, we have compared HCA-model with the USE based on the rank position of the two candidate questions. We can see that for the first semantically relevant question pair, the candidate question is long and complex in the structure; therefore, the USE model has not been able to learn a vector representation that expresses the entire meaning of the question. But by splitting the question into its constituent sentences, it can be seen that the number of sentence pairs in the two questions have the same meaning. For example, the title of the input question “run a command after a specific usb was plugged in” as well as the last sentence “how do you run a command after a specific type of usb is plugged in ?” is semantically identical with the sentence “how to run a script when a specific flash-drive is mounted?” of candidate question. That's why HCA-model is able to put the relevant candidate question at the second rank position while the USE model puts it at the rank position 15. For the second input question, we can see that the USE model puts an irrelevant candidate question at the second rank position while HCA-model puts the question at the rank position 12. The reason for this result is that the candidate question for this input question, although has many related words in common with the input question, its main intention is not related to the input question. As a result, considering the functionality of the USE model that tries to learn one vector representation for the entire meaning of the question, it has incorrectly recognized the irrelevant candidate question semantically relevant therefore put it at the rank position 2. To illustrate the effectiveness of the title-aware attention layer, the attention weight heatmap for the sentences of each question are also drawn in Table 14. In the illustrated heatmap, darker colors represent higher weights between each sentence and their associated question title. What can be concluded from these heatmaps is that our title-aware attention model has been successfully able to identify the most informative sentences and has given significantly more attention-weights to these sentences than the unimportant and noisy sentences. For example, for the candidate question that is located in the upper right of table 14, the sentences S5, S6, and S8 are the most important sentences in the question, which are in line with the main intention of the question that our title-aware attention has been able to assign the most weight to these sentences. Similarly, the sentences S2 and S4 are peripheral and unimportant sentences that our title-aware attention assigns the least weight to these sentences. The title aware attention is more crucial for long questions because these questions are usually containing peripheral sentences in addition to the main sentences of the question and not all sentences of these questions are equally important to finding semantically relevant questions. It should be noted that HCA-model performed well on the domain-specific AskUbuntu dataset without using any domain-specific features, language-dependent resources such as syntactic parsers, and any external resources such as WordNet or FrameNet. While one of the limitations of feature-based models is requiring separate feature extraction and resource development steps for a new domain. For example one of the features that some works (G. Wu & Lan, 2016; Y. Wu & Zhang, 2016) used for the public-domain SemEval datasets is translation models-based feature. This feature had to be re-engineered in the state-of-the-art DNN+RR model by (Ghosh et al., 2017) for the domain-specific AskUbuntu dataset. Also, previous state-of-the-art models (Barrón-Cedeño, Martino, et al., 2016; Ghosh et al., 2017; Romeo et al., 2016) need language-dependent resources such as syntactic parsers, POS tagger and Named-entity recognition tool in addition to the training data for the QR task which may be challenging to obtain, especially for resource-low languages. 6. Conclusion and future works In this paper, we proposed an end-to-end deep neural model for QR in CQA sites, namely HCA-model, that has been able to handle the two challenges of the question length and training data sparsity simultaneously. The HCA-model follows the general compareaggregate framework, but it has a hierarchical structure that is applied at the fine-grained word-level and coarse-grained sentencelevel. This hierarchical structure allows the HCA-model to handle the long questions with multiple noisy sentences. To solve the training data sparsity problem, we use a sequential transfer learning technique to pre-train the WLCA-model on the Quora question paraphrase dataset. The exhaustive experiments conducted on two public-domain benchmark datasets of SemEval19 S1 S1 S2 S3 S4 Table 14 Two question-pairs example from the test sets of the AskUbuntu dataset. M.S. Zahedi, et al. Information Processing and Management 57 (2020) 102318 20 Information Processing and Management 57 (2020) 102318 M.S. Zahedi, et al. 2016 and SemEval-2017, and the domain-specific AskUbuntu dataset show that HCA-model achieves the state-of-the-art performance on the three datasets. The results of this research support the idea that not all sentences of a lengthy question are equally important in finding semantically relevant questions. So, the splitting of questions into their sentences and calculating the semantic similarity between them based on the similarity of their sentences can be a suitable approach to the QR task in CQA. As future work, we will try to study the effectiveness of other SPM data sets such as natural language inference (NLI) to pre-train the WLCA-model. In addition, we will try to employ multi-task learning to train the WLCA-model and the SLCA-model at the same time for QR in CQA. We also plan to apply our HCA-model to other tasks in CQA such as answer retrieval task. CRediT authorship contribution statement Mohammad Sadegh Zahedi: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data curation, Writing - original draft, Writing - review & editing, Visualization. Maseud Rahgozar: Conceptualization, Methodology, Validation, Formal analysis, Resources, Writing - review & editing, Supervision, Project administration. Reza Aghaeizadeh Zoroofi: Conceptualization, Methodology, Validation, Formal analysis, Writing - review & editing, Supervision. References Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. ArXiv Preprint ArXiv:1409.0473. Baldwin, T., Liang, H., Salehi, B., Hoogeveen, D., Li, Y., & Duong, L. (2016). UniMelb at SemEval-2016 Task 3: Identifying Similar Questions by combining a CNN with String Similarity Measures. Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), 851–856. https://doi.org/10.18653/v1/S161131. Barrón-Cedeño, A., Da San Martino, G., Joty, S., Moschitti, A., Al-Obaidli, F., Romeo, S., Tymoshenko, K., & Uva, A. (2016). ConvKN at SemEval-2016 Task 3: Answer and Question Selection for Question Answering on Arabic and English Fora. Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), 896–903. https://doi.org/10.18653/v1/S16-1138. Barrón-Cedeño, A., Martino, G. D. S., Romeo, S., & Moschitti, A. (2016). Selecting Sentences versus Selecting Tree Constituents for Automatic Question Ranking. Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, 2515–2525. Bian, W., Li, S., Yang, Z., Chen, G., & Lin, Z. (2017). A Compare-Aggregate Model with Dynamic-Clip Attention for Answer Selection. Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, 2, 1987–1990. https://doi.org/10.1145/3132847.3133089. Cao, X., Cong, G., Cui, B., Jensen, C. S., & Yuan, Q. (2012). Approaches to Exploring Category Information for Question Retrieval in Community Question-Answer Archives. ACM Transactions on Information Systems, 30(2), 1–38. https://doi.org/10.1145/2180868.2180869. Cer, D., Yang, Y., Kong, S., Hua, N., Limtiaco, N., John, R. S., Constant, N., Guajardo-Cespedes, M., Yuan, S., & Tar, C. (2018). Universal sentence encoder for English. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 169–174. Chen, L., Jose, J. M., Yu, H., Yuan, F., & Zhang, D. (2016). A Semantic Graph based Topic Model for Question Retrieval in Community Question Answering. Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, 287–296. https://doi.org/10.1145/2835776.2835809. Chen, Q., Zhu, X., Ling, Z.-H., Wei, S., Jiang, H., & Inkpen, D. (2017). Enhanced LSTM for Natural Language Inference. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1, 1657–1668. Chen, Z., Zhang, C., Zhao, Z., Yao, C., & Cai, D. (2018). Question retrieval for community-based question answering via heterogeneous social influential network. Neurocomputing, 285, 117–124. https://doi.org/10.1016/j.neucom.2018.01.034. Choi, H., & Lee, H. (2019). Multitask learning approach for understanding the relationship between two sentences. Information Sciences, 485, 413–426. Choi, H., & Lee, H. (2018). GIST at SemEval-2018 Task 12: A network transferring inference knowledge to Argument Reasoning Comprehension task. Proceedings of The 12th International Workshop on Semantic Evaluation, 773–777. Chung, Y., Lee, H., & Glass, J. (2018). Supervised and Unsupervised Transfer Learning for Question Answering. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, 1585–1594. Dai, Z., Xiong, C., Callan, J., & Liu, Z. (2018). Convolutional Neural Networks for Soft-Matching N-Grams in Ad-hoc Search. Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, 126–134. https://doi.org/10.1145/3159652.3159659. Das, A., Shrivastava, M., & Chinnakotla, M. (2016 April). Mirror on the Wall: Finding Similar Questions with Deep Structured Topic Modeling. Pacific-Asia Conference on Knowledge Discovery and Data Mining. 9652 LNAI, 454–465. https://doi.org/10.1007/978-3-319-31750-2_36. Das, A., Yenala, H., Chinnakotla, M., & Shrivastava, M. (2016). Together we stand: Siamese networks for similar question retrieval. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 1, 378–387. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv Preprint ArXiv:1810.04805. D.dos, Santos, C., Barbosa, L., Bogdanova, & Zadrozny, B. (2015). Learning Hybrid Representations to Retrieve Semantically Equivalent Questions. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, 2(1), 694–699. https:// doi.org/10.3115/v1/P15-2114. Fan, Y., Pang, L., Hou, J., Guo, J., Lan, Y., & Cheng, X. (2017). Matchzoo: A toolkit for deep text matching. ArXiv Preprint ArXiv:1707.07270, 2–3. Filice, S., Croce, D., Moschitti, A., & Basili, R. (2016). KeLP at SemEval-2016 Task 3: Learning Semantic Relations between Questions and Answers. Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), 1116–1123. https://doi.org/10.18653/v1/S16-1172. G.Filice, S., Martino, Da San, & Moschitti, A. (2017). KeLP at SemEval-2017 Task 3: Learning Pairwise Patterns in Community Question Answering. Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), 326–333. https://doi.org/10.18653/v1/S17-2053. Franco-Salvador, M., Kar, S., Solorio, T., & Rosso, P. (2016). UH-PRHLT at SemEval-2016 Task 3: Combining Lexical and Semantic-based Features for Community Question Answering. Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), 814–821. Fu, C. (2019a). Tracking user-role evolution via topic modeling in community question answering. Information Processing & Management, 56(6), 102075. https://doi. org/10.1016/j.ipm.2019.102075. Fu, C. (2019b). User intimacy model for question recommendation in community question answering. Knowledge-Based Systems. https://doi.org/10.1016/j.knosys. 2019.07.015. Fu, H., & Oh, S. (2019). Quality assessment of answers with user-identified criteria and data-driven features in social Q&A. Information Processing & Management, 56(1), 14–28. Galbraith, B., Pratap, B., & Shank, D. (2017). Talla at SemEval-2017 Task 3: Identifying Similar Questions Through Paraphrase Detection. Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), 375–379. https://doi.org/10.18653/v1/S17-2062. Ghaeini, R., Hasan, S. A., Datla, V., Liu, J., Lee, K., Qadir, A., Ling, Y., Prakash, A., Fern, X., & Farri, O. (2018). DR-BiLSTM: Dependent Reading Bidirectional LSTM for Natural Language Inference. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume, 1, 1460–1469. Ghosh, K., Bhowmick, P. K., & Goyal, P. (2017). Using re-ranking to boost deep learning based community question retrieval. Proceedings of the International Conference on Web Intelligence, 807–814. 21 Information Processing and Management 57 (2020) 102318 M.S. Zahedi, et al. Gong, Y., Luo, H., & Zhang, J. (2018). Natural Language Inference over Interaction Space. 6th International Conference on Learning Representations, ICLR, 2018https:// openreview.net/forum?id=r1dHXnH6-. Goyal, N. (2017). LearningToQuestion at SemEval 2017 Task 3: Ranking Similar Questions by Learning to Rank Using Rich Features. Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), 310–314. https://doi.org/10.18653/v1/S17-2050. Guo, J., Fan, Y., Ai, Q., & Croft, W. B. (2016). A Deep Relevance Matching Model for Ad-hoc Retrieval. Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, 55–64. https://doi.org/10.1145/2983323.2983769. He, H., & Lin, J. (2016). Pairwise word interaction modeling with deep neural networks for semantic similarity measurement. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 937–948. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. Huang, P.-S., He, X., Gao, J., Deng, L., Acero, A., & Heck, L. (2013). Learning deep structured semantic models for web search using clickthrough data. Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, 2333–2338. Ji, Z., Xu, F., Wang, B., & He, B. (2012). Question-answer topic model for question retrieval in community question answering. Proceedings of the 21st ACM International Conference on Information and Knowledge Management - CIKM ’12, 2471. https://doi.org/10.1145/2396761.2398669. Joty, S., Màrquez, L., & Nakov, P. (2018). Joint Multitask Learning for Community Question Answering Using Task-Specific Embeddings. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 4196–4207. Lan, W., & Xu, W. (2018). Neural network models for paraphrase identification, semantic textual similarity, natural language inference, and question answering. Proceedings of the 27th International Conference on Computational Linguistics, 3890–3902. Lee, J., Kim, S.-B., Song, Y.-I., & Rim, H. (2008). Bridging lexical gaps between queries and questions on large online Q&A collections with compact translation models. Proceedings of the Conference on Empirical Methods in Natural Language Processing. October, 410–418. Lei, T., Joshi, H., Barzilay, R., Jaakkola, T., Tymoshenko, K. K., Moschitti, A., Marquez, L., & Màrquez, L. (2016). Semi-supervised Question Retrieval with Gated Convolutions. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1279–1289. Lopez-Gazpio, I., Maritxalar, M., Lapata, M., & Agirre, E. (2019). Word n-gram attention models for sentence similarity and inference. Expert Systems with Applications, 132, 1–11. Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S., & McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 55–60. Min, S., Seo, M., & Hajishirzi, H. (2017). Question Answering through Transfer Learning from Large Fine-grained Supervision Data. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2, 510–517. Mohtarami, M., Belinkov, Y., Hsu, W., Zhang, Y., Lei, T., Bar, K., Cyphers, S., & Glass, J. (2016). SLS at SemEval-2016 Task 3: Neural-based Approaches for Ranking in Community Question Answering. Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), 828–835. https://doi.org/10.18653/v1/ S16-1128. Nakov, P., Hoogeveen, D., Màrquez, L., Moschitti, A., Mubarak, H., Baldwin, T., & Verspoor, K. (2017). SemEval-2017 Task 3: Community Question Answering. Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), 27–48. https://doi.org/10.18653/v1/S17-2003. Nakov, P., Màrquez, L., Moschitti, A., Magdy, W., Mubarak, H., Freihat, abed A., Glass, J., & Randeree, B. (2016). SemEval-2016 Task 3: Community Question Answering. Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), 525–545. https://doi.org/10.18653/v1/S16-1083. Neshati, M., Fallahnejad, Z., & Beigy, H. (2017). On dynamicity of expert finding in community question answering. Information Processing & Management, 53(5), 1026–1042. Nobari, A. D., Neshati, M., & Gharebagh, S. S. (2020). Quality-aware skill translation models for expert finding on StackOverflow. Information Systems, 87, 101413. Pang, L., Lan, Y., Guo, J., Xu, J., Wan, S., & Cheng, X. (2016). Text matching as image recognition. Thirtieth AAAI Conference on Artificial Intelligence. https://doi.org/10. 1007/s001700170197. Parikh, A., Täckström, O., Das, D., & Uszkoreit, J. (2016). A Decomposable Attention Model for Natural Language Inference. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2249–2255. Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 1532–1543. Qi, L., Zhang, Y., & Liu, T. (2017). SCIR-QA at SemEval-2017 Task 3: CNN Model Based on Similar and Dissimilar Information between Keywords for Question Similarity. Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), 305–309. https://doi.org/10.18653/v1/S17-2049. Qiu, X., & Huang, X. (2015). Convolutional neural tensor network architecture for community-based question answering. Twenty-Fourth International Joint Conference on Artificial Intelligence,. Ijcai, 1305–1311. Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3973–3983. A.Romeo, S., Martino, Da San, G., Barrón-Cedeño, & Moschitti, A. (2017). A Multiple-Instance Learning Approach to Sentence Selection for Question Ranking. European Conference on Information Retrieval, 437–449. M.Romeo, S., Martino, Da San, G., Barrón-Cedeno, A., Moschitti, A., Belinkov, Y., Hsu, W.-N., Zhang, Y., Mohtarami, & Glass, J. (2016). Neural attention for learning to rank questions in community question answering. Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, 1734–1745. Ruder, S. (2019). Neural Transfer Learning for Natural Language Processing. NATIONAL UNIVERSITY OF IRELAND, GALWAY. Schroff, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 815–823). . Seo, M. J., Kembhavi, A., Farhadi, A., & Hajishirzi, H. (2017). Bidirectional Attention Flow for Machine Comprehension. 5th International Conference on Learning Representations, ICLR, 2017https://openreview.net/forum?id=HJ0UKP9ge. Shao, T., Guo, Y., Chen, H., & Hao, Z. (2019). Transformer-Based Neural Network for Answer Selection in Question Answering. IEEE Access, 7, 26146–26156. Shen, Y., He, X., Gao, J., Deng, L., & Mesnil, G. (2014). Learning semantic representations using convolutional neural networks for web search. Proceedings of the 23rd International Conference on World Wide Web (pp. 373–374). . https://doi.org/10.1145/2567948.2577348. Shtok, A., Dror, G., Maarek, Y., & Szpektor, I. (2012). Learning from the past: answering new questions with past answers. Proceedings of the 21st International Conference on World Wide Web, 759–768. Singh, A. (2012). Entity based q&a retrieval. Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (pp. 1266–1277). . Srba, I., & Bielikova, M. (2016). A Comprehensive Survey and Classification of Approaches for Community Question Answering. ACM Transactions on the Web, 10(3), 1–63. https://doi.org/10.1145/2934687. Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Highway networks. ArXiv Preprint ArXiv:1505.00387. Tan, C., Wei, F., Wang, W., Lv, W., & Zhou, M. (2018). Multiway attention networks for modeling sentence pairs. Proceedings of the 27th International Joint Conference on Artificial Intelligence, 4411–4417. Wang, P., Ji, L., Yan, J., Dou, D., Silva, N. De, Zhang, Y., & Jin, L. (2018). Concept and Attention-Based CNN for Question Retrieval in Multi-View Learning. ACM Transactions on Intelligent Systems and Technology, 9(4), 1–24. https://doi.org/10.1145/3151957. Wang, S., & Jiang, J. (2017). A Compare-Aggregate Model for Matching Text Sequences. 5th International Conference on Learning Representations, ICLR, 2017https:// openreview.net/forum?id=HJTzHtqee. Wang, Z., Hamza, W., & Florian, R. (2017). Bilateral multi-perspective matching for natural language sentences. IJCAI International Joint Conference on Artificial Intelligence, 4144–4150. Wu, G., & Lan, M. (2016). ECNU at SemEval-2016 Task 3: Exploring Traditional Method and Deep Learning Method for Question Retrieval and Answer Ranking in 22 Information Processing and Management 57 (2020) 102318 M.S. Zahedi, et al. Community Question Answering. Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), 872–878. https://doi.org/10.18653/v1/ S16-1135. Wu, Y., & Zhang, M. (2016). ICL00 at SemEval-2016 Task 3: Translation-Based Method for CQA System. Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), 857–860. https://doi.org/10.18653/v1/S16-1132. Xiong, C., Dai, Z., Callan, J., Liu, Z., & Power, R. (2017). End-to-End Neural Ad-hoc Ranking with Kernel Pooling. Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, 55–64. https://doi.org/10.1145/3077136.3080809. Xue, X., Jeon, J., & Croft, W. B. (2008). Retrieval models for question and answer archives. Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR ’08, 475. https://doi.org/10.1145/1390334.1390416. Yang, Z., Salakhutdinov, R., & Cohen, W. W. (2017). Transfer learning for sequence tagging with hierarchical recurrent networks. ArXiv Preprint ArXiv:1703.06345. Zhang, K., Wu, W., Wu, H., Li, Z., & Zhou, M. (2014). Question Retrieval with High Quality Answers in Community Question Answering. Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, 371–380. https://doi.org/10.1145/2661829.2661908. Zhang, M., & Wu, Y. (2018). An unsupervised model with attention autoencoders for question retrieval. Thirty-Second AAAI Conference on Artificial Intelligence. Zhang, W.-N., Ming, Z.-Y., Zhang, Y., Liu, T., & Chua, T.-S. (2016). Capturing the Semantics of Key Phrases Using Multiple Languages for Question Retrieval. IEEE Transactions on Knowledge and Data Engineering, 28(4), 888–900. https://doi.org/10.1109/TKDE.2015.2502944. Zhou, G., Cai, L., Zhao, J., & Liu, K. (2011). Phrase-based translation model for question retrieval in community question answer archives. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume, 1, 653–662. Zhou, G., & Huang, J. X. (2017). Modeling and Learning Distributed Word Representation with Metadata for Question Retrieval. IEEE Transactions on Knowledge and Data Engineering, 29(6), 1226–1239. https://doi.org/10.1109/TKDE.2017.2665625. Zhou, G., Xie, Z., He, T., Zhao, J., & Hu, X. T. (2016). Learning the Multilingual Translation Representations for Question Retrieval in Community Question Answering via Non-Negative Matrix Factorization. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(7), 1305–1314. https://doi.org/10.1109/TASLP. 2016.2544661. Zhou, G., Zhou, Y., He, T., & Wu, W. (2016). Learning semantic representation with neural networks for community question answering retrieval. Knowledge-Based Systems, 93, 75–83. https://doi.org/10.1016/j.knosys.2015.11.002. 23