A Modular Hierarchical Model for Paper Quality Evaluation

2023

A MODULAR HIERARCHICAL MODEL FOR PAPER QUALITY EVALUATION Xi Deng1, Shasha Li1, Jie Yu1, Jun Ma1, Bin Ji1, Wuhang Lin1, Shezheng Song1 and Zibo Yi2 1 College of Computer, National University of Defense Technology, Changsha, China 2 Information Research Center of Military Science PLA Academy of Military Science, Beijing, China ABSTRACT Paper quality evaluation is of great significance as it helps to select high quality papers from the massive amount of academic papers. However, existing models needs improvement on the interaction and aggregation of the hierarchical structure. These models also ignore the guiding role of the title and abstract in the paper text. To address above two issues, we propose a welldesigned modular hierarchical model(MHM) for paper quality evaluation. Firstly, the input to our model is most of the paper text, and no additional information is needed. Secondly, we fully exploit the inherent hierarchy of the text with three encoders with attention mechanisms: a word-to-sentence(WtoS) encoder, a sentence-to-paragraph(StoP) encoder, and a paper encoder. Specifically, the WtoS encoder uses the pre-trained language model SciBERT to obtain the sentence representation from the word representation. The StoP encoder lets sentences in the same paragraph interact and aggregates them to get paragraph embeddings based on importance scores. The paper encoder does interaction among different hierarchical structures of three modules of a paper text: the paper title, abstract sentences, and body paragraphs. Then this encoder aggregates new representations generated into a compact vector. In addition, the paper encoder models the guiding role of the title and abstract, respectively, generating another two compact vectors. We concatenate the above three compact vectors and additional four manual features to obtain the paper representation. This representation is then fed into a classifier to obtain the acceptance decision, which is a proxy for papers’ quality. Experimental results on a large-scale dataset built by ourselves show that our model consistently outperforms the previous strong baselines in four evaluation metrics. Quantitative and qualitative analyses further validate the superiority of our model. KEYWORDS Paper quality evaluation, Modular, Hierarchical, Attention mechanisms, Interact. 1. INTRODUCTION As thousands of papers appear, evaluating their quality helps to select high-quality papers from them quickly. Naturally, this task can help readers choose good papers rapidly, assist reviewers in reviewing papers, and allow authors to self-check their papers. However, the research on this novel task still has a long way to go. As we all know, measuring innovativeness and contribution accurately is the key to evaluating the paper's quality. To the best of our knowledge, there is no publicly available expert-rated large-scale dataset on the innovativeness and contribution of papers in this field. More importantly, the models need to be pre-trained with sufficiently large David C. Wyld et al. (Eds): ARIN, CSITA, ISPR, NBIOT, DMAP, MLCL, CONEDU -2023 pp. 09-23, 2023. CS & IT - CSCP 2023 DOI: 10.5121/csit.2023.130702 10 Computer Science & Information Technology (CS & IT) real-world data of mathematical, computer, and other types of subject expertise. Even the stateof-the-art models have less rich prior knowledge about the paper than the reviewers. Therefore, models can hardly truly understand and evaluate the innovation and contribution of a paper from its text alone. We leave it to future researchers to fully understand the innovativeness and contribution of papers. However, the experimental results in this paper demonstrate that our model works well in evaluating the quality of papers. That is, the model has an excellent ability to evaluate papers after training. In reality, the quality of a paper is tough to measure accurately with specific values. Thus most work will take the papers' acceptance as a proxy for quality evaluation. Since conferences at different levels and domains have different quality standards for papers, researchers will carry out their work on a specific level conference dataset in a particular field (e.g., top conferences in Artificial Intelligence). Or they will directly use a specific conference as a dataset to unify the quality standards of the model. Following the convention, we conduct our experiments on the ICLR conference dataset produced ourselves. But in theory, our model will also be able to find papers that meet this quality standard after being trained on any reasonable dataset with uniform quality standards. In the literature, early studies [4–6] focus on collecting domain-specific manual features to build deterministic models to predict paper acceptance, which is a proxy for papers’ quality. However, feature engineering is time-consuming and labor-intensive, let alone the domain knowledge required. Recently, with the rapid development of deep learning, numerous neural models can be leveraged to extract these features automatically, such as MILAM [7], MHCNN [8], DeepSentiPeer [9], and HabNet [10]. Despite the much higher prediction accuracy, these models rely on extra information such as author backgrounds and review comments. However, we demonstrate that the extra information is inaccessible in double-blind reviews, hindering the model application. More importantly, these models pay less attention to the quality of the paper itself. Some studies formulate the paper acceptance prediction task as a text classification task. A paper has multiple modules, such as the title, abstract, and body. Within each module, there are multiple hierarchical structures such as words, sentences, and paragraphs. There exists interaction and aggregation in the hierarchical structure. However, previous works [10–15] does not fully consider the position and contextual information of elements in interaction, and elements’ importance in aggregation. Meanwhile they operate on the overall representation of the module. But we demonstrate that the model works better at different levels of the three modules: title, abstract sentences, and body paragraphs. Also, textual representations in these models have difficulty capturing the information in figures, tables, formulas, and references. Therefore, we improve on the problems in the above work. In this paper, we propose a novel modular hierarchical model (MHM) for paper quality evaluation. Our model contains three encoders from the bottom-up to capture the hierarchical structure of paper texts. Specifically, we first divide a paper text into three modules: title, abstract, and body. Then we apply the word-to-sentence (WtoS) encoder to the sentences contained in the three paper modules and obtain their sentence-level representations. Additionally, this encoder is a fine-tuned SciBERT [16] model. Next, we apply the sentence-toparagraph (StoP) encoder to the sentence-level representations of each body paragraph. Following the DiSAN[17] framework, the StoP encoder allows sentence-level representations to interact contextually at the paragraph level and aggregate these representations into a paragraphlevel representation through bi-directional self-attention and multi-dimensional attention. After that, the paper encoder takes sentence-level representations of the title and abstract sentences, and paragraph-level representations of body paragraphs as inputs, and it outputs a compact vector. Computer Science & Information Technology (CS & IT) 11 For modeling the guiding role of the title, the paper encoder does cross-attention between the title and abstract sentences, generating a compact vector. For modeling the guiding role of the abstract, the paper encoder does cross-attention between the abstract sentences and body paragraphs, generating another compact vector. For a more comprehensive representation, we concatenate the above three compact vectors and four manual features to obtain the paper representation. The four features represent the number of figures, tables, formulas, and references, respectively. Finally, the paper representation is fed into a classifier to obtain the acceptance decision, a proxy for the papers’ quality. The contributions of our work are summarized as follows:    We produce a standard large-scale dataset of ICLR conference papers. This dataset expands the research resources in the field and is available to researchers for extensive research. We propose a well-designed modular hierarchical model for paper quality evaluation. This model takes into account the interactions and aggregations in the hierarchical structure of the paper more fully than existing models. And we are the first to model the guiding role of the title and abstract. Experimental results on ICLR conference dataset show that our model consistently outperforms the previous strong baselines in three evaluation metrics. Quantitative and qualitative analyses further validate the superiority of our model. 2. RELATED WORK Paper quality evaluation is a novel task proposed in 2018[4, 8]. The acceptance of papers in a specific field at a particular level of the conference can be a proxy for quality assessment. All existing research in this area fails to enable models to truly understand the innovativeness and contribution of the paper, even though both are core to the assessment. Researchers use language models to extract linguistic features such as sentence readability and contextual consistency from paper texts. The models can also determine the innovation and contribution of a paper very roughly from specific words such as "SOTA," "outperform," "the first," "code publicly available," etc. Since models do not have the same extensive domain knowledge as reviewers, and the ability to reproduce the experimental results of a paper, they can easily be deceived. The model will likely assess a well-written paper that falsifies innovative and contributory results as good quality. Researchers cannot discern this deception now, leaving it for future work. The current studies assume that the authors are honest and that the paper's content is authentic and trustworthy. Research in this area is limited and still very much in its infancy, with much room for future exploration. The length of a paper is typically thousands of words. Directly using CNNs, RNNs, or attention mechanisms on such long texts will be limited by memory and computing power. So researchers generally divide long texts to build hierarchical networks according to the paper's inherent structure. From word to full text, the structure in aggregation and interaction has three levels: word, sentence, and module. The first is the hierarchical structure for aggregating words. Researchers generally aggregate the representations of words in a sentence to obtain an effective sentence representation. Shen et al. [12] propose a joint model combining text content with a visual rendering for document quality assessment. The model averages the word vectors directly to obtain the sentence vectors, following Shen et al. [18]. This model ignores the position and the varying significance of each word in the sentence when capturing the meaning of long sentences. 12 Computer Science & Information Technology (CS & IT) The second is the hierarchical structure for aggregating sentences. Wenniger et al. [14] propose to use HANs combined with structure tags. They use bi-directional LSTM for sentences, , which prevents parallelization. Leng et al. [13] let model learning the semantic, grammar, and innovative features of an article by three main well-designed components simultaneously. However, only sentences within a fixed-size receptive field can interact with each other. Lu et al. [15] propose MV-HATrans, a text representation model that combines multi-viewpoint information. But they ignore interaction of sentence-level representation at the paragraph level. The third is the hierarchical structure for aggregating modules. Paper consists of the title, abstract, and body. Researchers further subdivide the body into introduction, related work, method, experimental results, and conclusion. As mentioned above, words are aggregated into sentences, and sentences are then aggregated to get modules. Researchers aggregate modules to obtain a paper representation. Yang et al. [8] propose a novel modularized hierarchical convolutional neural network to achieve automatic academic paper rating. However, they do not consider the interaction among different hierarchical structures of modules. Qiao et al. [11] improve on Yang's work by using an LSTM between the modules. Nevertheless, the one-way LSTM only allows the module to fuse its previous content. In summary, the existing work needs to improve in terms of the position and contextual information of elements in interaction, and elements’ importance in aggregation. Besides, these works do not consider the guiding role of the title and abstract. To address the above two problems in existing work, we propose a well-designed modular hierarchical model. Firstly, we take into account as much detail as possible in the interaction and aggregation process on the hierarchical structure. Inspired by Shen [17], we use our improved version of bi-directional self-attention mechanisms to consider the location and contextual information during the interaction. We utilize multi-dimensional source2token self-attention proposed by Shen [17] to ensure distinct weights of elements in aggregation on each hierarchical structure. Secondly, we are the first to model the guiding role of the title and abstract. We leverage cross-attention mechanisms across hierarchical structures to generate two full-text representations guided by the title or abstract respectively. 3. METHODOLOGY In this section, we first describe the problem setting, next explain the intuition behind the model, then introduce two pre-defined components, and finally present the details of our proposed model for paper quality evaluation. 3.1. Problem Setting Our study takes the papers' acceptance as a proxy for quality evaluation. In the experiment, two assumptions may have some deviations from the real world. Assuming the content of the papers in the dataset is authentic and credible and that the authors did not falsify contribution or innovativeness. Assuming a specific conference has consistent standards of review for quality over six years. We examine the problem of predicting papers' acceptance/rejection based on a dataset 𝐷 containing 𝑘 papers. That is, D = {(𝑝1 , y1 ), … , (𝑝k , y𝑘 )}, where 𝑝𝑖 is the 𝑖-thpaper’s text and 𝑦𝑖 is its corresponding conference-specific true decision. 𝑦𝑖 ∈ {0,1}, where 0 means the paper is rejected, and 1 means the paper is accepted. Concretely, in the case of a paper, 𝑝 = {𝑡, 𝑎, 𝑏}, where 𝑡, 𝑎, 𝑏 represent the title, abstract, and body of the paper, respectively. Assume that the title has 𝑙 words, i.e.,𝑤 𝑡 = {𝑤1𝑡 , … , 𝑤𝑙𝑡 }, where 𝑤𝑖𝑡 denotes the embedding of 𝑖-th word in 𝑎 }, each sentence has 𝑛 words. Let the title. Assume the abstract has 𝑚 sentences, 𝑠 𝑎 = {𝑠1𝑎 , … , 𝑠𝑚 Computer Science & Information Technology (CS & IT) 13 𝑎 𝑤𝑖,𝑗 with 𝑖 ∈ [1, 𝑚], 𝑗 ∈ [1, 𝑛] denotes the embedding of 𝑗-th word in the 𝑖-th sentence of the abstract. Assume that the body contains 𝑢 paragraphs, 𝑝𝑎 𝑏 = {𝑝𝑎1𝑏 , … , 𝑝𝑎𝑢𝑏 }, each paragraph 𝑏 𝑏 𝑏 }, and each sentence contains 𝑤 words. Let 𝑤𝑖,𝑗,𝑘 with 𝑖 ∈ contains 𝑣 sentences, 𝑠𝑖𝑏 = {𝑠𝑖,1 , … , 𝑠𝑖,𝑣 [1, 𝑢], 𝑗 ∈ [1, 𝑣], 𝑘 ∈ [1, 𝑤] denotes theembedding of 𝑘-th word in the 𝑗-th sentence in the 𝑖-th paragraph of the body. Given the text 𝑝 of a new paper, our goal is to predict the corresponding decision class 𝑦, which is a proxy for papers’ quality. Here, we treat papers' acceptance/rejection prediction as a binary classification problem, where the class labels are the decisions 𝑦. 3.2. The Intuition Behind the Model By treating the acceptance of a paper as a proxy for its quality, the paper quality evaluation task becomes a binary classification task that predicts whether the paper will be accepted or not. Unlike ordinary classification tasks, a paper has a length of several thousand words or even tens of thousands. CNN, RNN, and attention mechanisms are limited by current computing power resources and cannot directly use the full text as input. The time complexity of the attention mechanism used extensively in language models is quadratic in the sequence length. Common truncation on long documents breaks long-distance dependencies between tokens, resulting in performance degradation. Therefore, we refer to Yang's [19] idea of the hierarchical model to construct our model for paper quality evaluation, preserving the long-distance dependencies between tokens as much as possible. Based on the inherent structure of the paper, we divide the paper into three modules: title, abstract, and body. Under the idea of the hierarchical composition of sentences from words, we generate context-aware sentence representations for titles and abstracts, which is what our modular hierarchical model does in the W to S encoder. According to the hierarchical idea that paragraphs consist of sentences, we generate context-aware paragraph representations for the body of the paper. We utilize the S to P encoder in our model to do this. The paper encoder is the core of our model and the most complex part. We do not interact with the title, abstract, and body at the same level (sentence or paragraph) as in previous work. When analyzing the paper carefully, we find that the title and abstract contain much more information than the body at the same level. Specifically, although the title is at the sentence level, it summarizes the main content of the abstract. The abstract as a whole is at the paragraph level and contains the core content of the paper. Each sentence in the abstract is at the sentence level and outlines the information in one or more essential body paragraphs. The BiDiSAN module in paper encoder helps us to accomplish this interaction and obtain a full-text representation by fusing the title, abstract and body information. However, due to the model's complexity and the dataset's insufficient size, the model relying only on the above full-text representation is inadequate to capture all the critical information. Therefore, we use two modules, T-CrossAN and A-CrossAN, to enhance the full-text representation. We use the cross-attention of the title and the abstract sentences to obtain the core representation of the abstract sentences. When the summary sentence is more relevant to the headline, it is more central. We use the cross-attention of the abstract sentence and the body paragraph to obtain the core representation of the body paragraph. The more relevant the body paragraph is to the abstract sentence, the more central it is. These two core representations containing the critical information of the paper are employed to enhance the full-text representation of the BiDiSAN module. Our model has two drawbacks. First, the model ignores some features of the papers. The complete features of a paper include the title, abstract, body, and references. The body consists of 14 Computer Science & Information Technology (CS & IT) text, tables, images, and formulas. The linguistic model with text-only input cannot access the visual information of the figures and tables in the paper. It is also difficult for the model to understand complex formulas and references. Therefore, we extract the number of figures, tables, references, and formulas to consider the features missed by the model most simply. In the subsequent work, we will let the model learn these four features end-to-end. Second, for innovation and contribution, which best reflect the quality of the paper, we can currently only judge it simply by the descriptive language of the paper. This judgment relies heavily on the integrity and writing ability of the author. We leave this to the future development of artificial intelligence. But from the experimental results in this paper, our model can still assess the quality of papers very well. The model in this paper is well-designed for paper quality evaluation. It also can be used for other long document modeling tasks with minor modifications. 3.3. Two Pre-defined Component Before describing the details of our approach, we briefly introduce the two pre-defined components that will be used several times in the model. 3.3.1. Improved Version of the Bi-directional Self-Attention We improve slightly on bi-directional self-attention [17] to address the need for our paper quality evaluation task. First, computing separate attention scores for each dimension between tokens on such long texts as the paper would take much memory and computation time. Therefore, we replace the original multi-dimensional token2token self-attention with a multi-head self-attention [20], preserving positional masks. In this way, we can reduce the memory by a multiple of the token dimension size when computing the attention mechanism. It also reduces the model complexity and can be trained faster. Second, the fusion gate in the original bi-directional selfattention mechanism dynamically controls the ratio of both the original representation and the context-aware representation. This ratio is a learnable parameter of the same dimensional size as the token. It is hard to train such a complex model for a dataset of our size. Thus we replace the original fusion gate with a residual structure [21] to reduce the number of parameters and decrease the model complexity We introduce our improved version of the bi-directional self-attention mechanism using S to P encoder as an example. The effect of this mechanism in this encoder is to allow the sentences within a paragraph to interact to obtain a context-aware sentence representation. Suppose the input is 𝑠𝑖𝑏 ,which is the set of all sentence representations of the 𝑖-th paragraph in the body. 𝑏 𝑏 𝑏 }, 𝑠𝑖,𝑗 ∈ ℝ𝑑𝑒 , 𝑗 ∈ [1, 𝑣].𝑣 is the number of sentences contained in Specifically, 𝑠𝑖𝑏 = {𝑠𝑖,1 , … , 𝑠𝑖,𝑣 𝑏 𝑏 the 𝑖 -th paragraph.𝑠𝑖,𝑗 ∈ ℝ𝑑𝑒 , 𝑗 ∈ [1, 𝑣]. 𝑠𝑖,𝑗 is the embedding of the 𝑗-th sentence of the 𝑖 -th paragraph in the body. 𝑄 , 𝐾 , and 𝑉 are obtained by making different, learned linear transformations on 𝑠𝑖𝑏 . We compute the dot products of 𝑄 and 𝐾, divide each by √𝑑𝑘 following Transformer [20]. Here we choose from two different positional masks,𝑀𝑓𝑤 and 𝑀𝑏𝑤 , to use. The former makes each sentence only focus on its preceding sentences, while the latter makes each focus only on its following ones. Then we apply a softmax function to obtain the attention weights on 𝑉. Attention(𝑄, 𝐾, 𝑉, 𝑀) is the weighted sum on 𝑉 according to the attention weights. 𝑄 = 𝑊 (1) 𝑠𝑖𝑏 𝐾 = 𝑊 (2) 𝑠𝑖𝑏 𝑉 = 𝑊 (3) 𝑠𝑖𝑏 (1) Computer Science & Information Technology (CS & IT) Attention(𝑄, 𝐾, 𝑉, 𝑀) = softmax ( 𝑄𝐾 𝑇 √𝑑𝑘 15 + 𝑀) 𝑉 (2) where 𝑊1 ∈ ℝ𝑑𝑒 ×𝑑𝑘 , 𝑊2 ∈ ℝ𝑑e ×𝑑𝑘 , 𝑊3 ∈ ℝ𝑑e ×𝑑𝑣 . Among these, 𝑑𝑘 = 𝑑𝑣 = 𝑑𝑒 , 𝑑𝑒 = 768. 𝑀 is either 𝑀𝑓𝑤 or 𝑀𝑏𝑤 .𝑀𝑓𝑤 is a strictly lower triangular with order 𝑣 .𝑀𝑏𝑤 is a strictly upper triangular with order 𝑣.The value of all elements within the 𝑀𝑓𝑤 and 𝑀𝑏𝑤 is 1 except for 0. ̃𝑣 In the multi-head attention mechanism, we linearly project the 𝑄, 𝐾 and 𝑉 to 𝑑̃𝑘 , 𝑑̃𝑘 and 𝑑 dimensions, respectively. Then, we parallel perform scaled dot-product attention with masks in ̃𝑣 -dimensional Eq.(2) on each of these projected versions. Each attention function produces 𝑑 output values. They are concatenated and once again projected to obtain the final values. MultiHead(𝑄, 𝐾, 𝑉, 𝑀) = Concat (head1 , … , head ℎ )𝑊 𝑂 (3) where head𝑖 = Attention(𝑄𝑊𝑖 𝑄 , 𝐾𝑊𝑖𝐾 , 𝑉𝑊𝑖𝑉 , 𝑀) where the number of heads ℎ = 2 . 𝑊𝑖 𝑄 ∈ ℝ𝑑𝑘 ×𝑑̃𝑘 , 𝑊𝑖𝐾 ∈ ℝ𝑑k ×𝑑̃𝑘 , 𝑊𝑖𝑉 ∈ ℝ𝑑v ×𝑑̃𝑣 and 𝑊𝑖𝑂 ∈ ̃𝑣 = 𝑑𝑒 /ℎ. 𝑀 is either 𝑀𝑓𝑤 or 𝑀𝑏𝑤 . ℝℎ𝑑̃v ×𝑑𝑒 . Among these, 𝑑̃𝑘 = 𝑑 Forward mask 𝑀𝑓𝑤 and backward mask 𝑀𝑏𝑤 are input into Eq.(3) separately to use bidirectional attention. We finally employ a residual connection on the results obtained by concatenation operationin Eq.(4). In this way, we add the weighted representation of the context obtained by the attention mechanism to the original representation of the token to derive a context-aware representation. 𝑠𝑒𝑖𝑏 = [𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑(𝑄, 𝐾, 𝑉, 𝑀𝑓𝑤 )||𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑(𝑄, 𝐾, 𝑉, 𝑀𝑏𝑤 )] + 𝑉 (4) 𝑏 𝑏 𝑏 𝑏 }, 𝑠𝑒𝑖,𝑗 is ∈ ℝ2𝑑𝑒 , 𝑗 ∈ [1, 𝑣]. 𝑠𝑒𝑖,𝑗 where ∥ denotes concatenation operation. 𝑠𝑒𝑖𝑏 = {𝑠𝑒𝑖,1 … 𝑠𝑒𝑖,𝑣 context-aware embedding of the 𝑗-th sentence of the 𝑖-th paragraph in the body. 3.3.2. Multi-Dimensional Source2token Self-attention 𝑏 𝑏 𝑏 }, 𝑠𝑒𝑖,𝑗 ∈ ℝ2𝑑𝑒 , 𝑗 ∈ [1, 𝑣] . Suppose the input is 𝑠𝑒𝑖𝑏 calculated from Eq.(4), 𝑠𝑒𝑖𝑏 = {𝑠𝑒𝑖,1 … 𝑠𝑒𝑖,𝑣 𝑠𝑒𝑖𝑏 areall context-aware sentence representations of the 𝑖-th paragraph in the body. We use the 𝑏 𝑏 ) to calculate the dependency between 𝑠𝑒𝑖,𝑗 and the entire sequence 𝑠𝑒𝑖𝑏 . The function 𝑓(𝑠𝑒𝑖,𝑗 𝑏 attention weight α𝑖,𝑗 , 𝑗 ∈ [1, 𝑣], of each sentence 𝑠𝑒𝑖,𝑗 is obtained by applying softmax function 𝑏 ). The output 𝑝𝑎𝑖𝑏 of this module is the weighted sum of the inputs𝑠𝑒𝑖𝑏 according to on the 𝑓(𝑠𝑒𝑖,𝑗 the attention weights. Formally, we have: 𝑏 𝑏 ) = 𝑊 𝑇 𝜎(𝑊 (1) 𝑠𝑒𝑖,𝑗 + 𝑏(1) ) + 𝑏 𝑓(𝑠𝑒𝑖,𝑗 𝛼𝑖,𝑗 = ∑ 𝑏 )) exp(𝑓(𝑠𝑒𝑖,𝑗 𝑦 𝑗=1 𝑏 )) exp(𝑓(𝑠𝑒𝑖,𝑗 𝑏 𝑝𝑎𝑖𝑏 = ∑ 𝛼𝑖,𝑗 𝑠𝑒𝑖,𝑗 𝑗 (5) (6) (7) 16 Computer Science & Information Technology (CS & IT) 𝑏 𝑏 ) ∈ ℝ2𝑑𝑒 is a vector with the same length as 𝑠𝑒𝑖,𝑗 , σ(⋅) is an activation function, where 𝑓(𝑠𝑒𝑖,𝑗 (1) (1) 2𝑑𝑒 ×2𝑑𝑒 and the weight matrices 𝑊, 𝑊 ∈ ℝ . 𝑏 and 𝑏 are bias terms. 𝑊, 𝑊 (1) , 𝑏, and 𝑏(1) are 𝑏 2𝑑𝑒 all trainable parameters. 𝑝𝑎𝑖 ∈ ℝ is the embedding of the 𝑖-th paragraph in the body. 3.4. Our Approach We divide each paper into three modules: title, abstract and body.Our proposed model takes the paper text as input and consists of four main components: WtoS encoder, StoP encoder, paper encoder, and decision predictor, as shown in Fig. 1. The following is a detailed description of the modular hierarchical model. Figure 1.The architecture of our model. Note that the part inside the rectangular dashed box is the paper encoder. The part below the dashed box is the S to P encoder. The red solid circles represent the multidimensional source2token self-attention module. 3.4.1. W to S Encoder The W to S encoder aims to capture the relationships between words in a sentence and the importance of each word to the meaning of the sentence. It has been demonstrated that BERT [22] models perform effective knowledge transfer from self-supervised tasks with large-scale training data. The SciBERT [16] is a variant of BERT [22]. Trained on numerous academic papers, SciBERT contains rich prior knowledge intensely relevant to our task. We tokenize each sentence using vocabulary with a maximum sequence length 𝐿 = 512 following SciBERT [16]. We add [CLS] at the beginning and [SEP] at the end of the token sequence of each sentence. If the length of the token sequence is less than L, we append the extra [pad] token to the end. Then we feed the words token sequence in each sentence contained in the title, abstract, and body into SciBERT, taking the embedding of the [CLS] token as the sentence representation. In summary, the W to S encoder lets words in the same sentence interact and aggregates them to get sentence embeddings based on importance scores. 3.4.2. S to P Encoder The S to P encoder aims to capture the relationships between sentences in a paragraph and the importance of each sentence to the meaning of the paragraph. The input is the sentence 𝑏 embedding 𝑠𝑖,𝑗 of body generated by the WtoS encoder, where 𝑖 ∈ [1, 𝑢], 𝑗 ∈ [1, 𝑣] . It first 𝑏 generates context-aware embeddings 𝑠𝑒𝑖,𝑗 for each sentence using the bidirectional self-attention 𝑏 , the multias shown in Eq.(4). Based on these context-aware sentence embeddings 𝑠𝑒𝑖,𝑗 dimensional source2token self-attention computes attention scores α𝑖,𝑗 as shown in Eq.(6). It indicates the importance of each feature of the sentence to the paragraph. This paragraph Computer Science & Information Technology (CS & IT) 17 embedding is a weighted sum of the context-aware sentence representations according to the attention scores α𝑖,𝑗 as shown in Eq.(7). We obtain 𝑝𝑎𝑖𝑏 after reducing the dimensionality through a full connection layer. This paragraph representation incorporates all the sentence information, the relationship between sentences, and the importance of sentences in the paragraph. In summary, the S to P encoder lets sentences in the same paragraph interact and aggregates them to get paragraph embeddings based on importance scores. 3.4.3. Paper Encoder There are two crucial observations here. First, there is an interaction of information among the title, abstract sentences, and body paragraphs. Second, the title guides the abstract's core ideas and core sentences. Similarly, the abstract guides the body's core ideas and core paragraphs. Based on these two observations, the paper encoder is divided into three modules from top to bottom: T-CrossAN, A-CrossAN, BiDiSAN. (see Fig. 1) The T-CrossAN module models the title's guiding role for the abstract. This cross-attention module first calculates the relevance scores α1𝑖 of the title 𝑠 𝑡 and abstract sentences 𝑠𝑖𝑎 , 𝑖 ∈ [1, 𝑚]. Then it outputs a weighted sum of the embeddings of abstract sentences. The output 𝐴 is the semantic representation of the core abstract guided by the title. Formally, we have: 𝑄1 (1) where 𝑊1 (1) (2) (3) = 𝑊1 𝑠 𝑡 𝐾𝑖1 = 𝑊1 𝑠𝑖𝑎 𝑉𝑖1 = 𝑊1 𝑠𝑖𝑎 𝛼𝑖1 = softmax ( (2) ∈ ℝ𝑑e ×𝑑𝑘 , 𝑊1 𝑄(𝐾𝑖1) √𝑑𝑘 𝑇 ) (3) ∈ ℝ𝑑e ×𝑑𝑘 , 𝑊1 (8) 𝐴 = ∑𝑖 𝛼𝑖1 𝑉𝑖1 (9) ∈ ℝ𝑑e ×𝑑𝑣 . 𝑑𝑘 = 𝑑𝑣 = 𝑑𝑒 . The A-CrossAN module models the abstract’s guiding role for the body. This cross-attention module calculates the relevance score between each abstract sentence 𝑠𝑖𝑎 , 𝑖 ∈ [1, 𝑚] and each body paragraph 𝑝𝑗𝑏 , 𝑗 ∈ [1, 𝑢]. We combined the relevance scores for the abstract and body with the relevance scores for the title and abstract to obtain the final scores α𝑗2 . The module outputs a weighted sum of the embeddings of body paragraphs according to α𝑗2 . This output 𝐵 is the semantic representation of the core body text guided by the abstract sentences. The more relevant an abstract sentence is to the title, the more guidance it has. Formally, we have: (1) (2) 𝑄𝑖2 = 𝑊2 𝑠𝑖𝑎 𝐾𝑗2 = 𝑊2 𝑝𝑗𝑏 (1) where 𝑊2 𝛼𝑗2 𝑇 𝑄𝑖2 (𝐾𝑗2 ) = ∑𝑖 softmax ( (2) ∈ ℝ𝑑e ×𝑑𝑘 , 𝑊2 √𝑑𝑘 (3) ∈ ℝ𝑑e ×𝑑𝑘 , 𝑊2 (3) 𝑉𝑗2 = 𝑊2 𝑝𝑗𝑏 ) 𝛼𝑖1 𝐵 = ∑𝑗 𝛼𝑗2 𝑉𝑗2 ∈ ℝ𝑑e ×𝑑𝑣 . 𝑑𝑘 = 𝑑𝑣 = 𝑑𝑒 . (10) (11) The BiDiSAN module models the interaction of the title, abstract sentences and body paragraphs. Its structure is the same as the StoP encoder. Its inputs are representation for the title, abstract sentences and body paragraphs. The output of this module is 𝐶, a weighted sum of the contextaware full-text representations according to the attention scores. 𝐶 incorporates the relationship among the title, abstract and body, as well as capturing their importance in the paper. 18 Computer Science & Information Technology (CS & IT) The paper encoder concatenates 𝐴, 𝐵 ,and 𝐶 from the above three modules and four manual features as output. These four well-designed manual features are the number of figures, tables, formulas, and references in the paper. We consider the number of figures and tables to be a rough representation of the adequacy of the paper's experiments. The number of formulas partially represents the theoretical correctness of the paper. The number of references correlates with the adequacy of the authors' research on related work. These manual features are currently tricky for deep models to learn from the long textual content. So we combine these four features with the full-text representation consisting of 𝐴, 𝐵, and 𝐶，which adds richer information. 3.4.4. Decision Predictor We design a decision predictor to finally predict the acceptance of the academic paper, which is a proxy for papers’ quality. Specifically, we take the compact representation from the paper encoder as its input. Let it pass through a fully connected layer with ELU function. This time its dimensionality is reduced from high dimension to 128. Then we feed it into the fully connected layer with Sigmoid function to get the final prediction result 𝑦̂. Although existing research on this task, including the work in this paper, is still at a preliminary stage of exploration, we believe that paper quality evaluation will be of great value in the academic field. 4. EXPERIMENTS AND RESULTS 4.1. Dataset It should be noted that the currently widely used dataset is PeerRead [4], whose largest subset is a mixture of several top conferences. The dataset currently widely used for paper quality evaluation is the largest subset of PeerRead[4], arXiv. Papers judged as rejected by the rules in arXiv may not be truly negative samples. And with arXiv as a collection of several top conferences, quality standards that vary between conferences may affect this task. Therefore, we have produced our large dataset containing only ICLR papers. Among the top conferences, only ICLR officially provides complete, accurate data on rejected papers. In theory, our proposed model can be trained on all datasets of the same quality standard to obtain the ability to evaluate at this quality standard. Table 1. Statistics of ICLR dataset Year #paper #Acc/#Rej ICLR2017 ICLR2018 ICLR2019 ICLR2020 ICLR2021 ICLR2022 Total 450 804 1227 2024 2386 2420 9311 177/273 308/496 449/778 645/1379 802/1584 1004/1416 3385/5926 We conduct experiments on a six-year papers dataset from the top conference ICLR to verify the validity of our model for predicting the acceptance of papers. Specifically, using the official interface provided by OpenReview, we download the main conference papers of ICLR from 2017 to 2022. Papers with an original category label of Accept (Oral), Accept (Spotlight), Accept (Talk), or Accept (Poster) are uniformly labeled as 1 (accept). Papers with an original category label of Reject or Invite to Workshop Track are uniformly labeled as 0 (reject). Then we utilize the open-source project GROBID to read the pdf file for the structured text of the paper. After Computer Science & Information Technology (CS & IT) 19 cleaning the data and removing samples with text lengths longer than 8000, the dataset contains a total of 9311 samples. Table 1 shows the detailed statistics of this dataset. All samples are randomly shuffled before dividing the dataset. We follow a 7:2:1 division ratio, meaning that 6516, 1864, and 931 papers are used as the training, validation, and test sets, respectively. 4.2. Experimental Settings All experiments are carried out on a machine with two RTX 3090 Ti having 24G of GPU memory. We use the text of the paper as the input for all models. All models use the deep learning framework PyTorch. All deep learning models are trained using BCE With Logits Loss loss function and AdamW optimizer with default parameters. For all models, we adopt the 10fold cross-validation, using the average of their ten results for evaluation. Other hyperparameters are adjusted based on empirical settings and experimental results. The following are the detailed settings of our model. Due to memory limitations, we only fine-tune the last three layers of SciBERT in our model. The learning rate of these three layers is set to 1𝑒-5, while the other layers of SciBERT are frozen. The learning rate of the rest of our model is set to 3𝑒-4. We use a linear warm-up strategy to optimize these two learning rates, where warm-up steps are 0.2 times of total steps.Also, due to memory limitations, the batch size is only 1. Therefore, we use the gradient accumulation strategy, i.e., we update the network parameters once every 8 batches. In addition, the number of heads of multi-head attention is 2. And the epoch of the experiment is set to 20. With the data in parallel on two GPUs, the total running time of our model is 8 hours. 4.3. Baselines and Evaluation Metrics There has been some research work on the task of predicting the acceptance of academic papers. However, these works do not publish the source code and its specific implementation details are not clear. Therefore, we do not compare our work with these due to the complexity of the reimplementation. To verify the validity of our model, we compare our proposed model with ten baselines according to the experimental settings described above. The baseline models are specified as follows: (1) Five flat baselines: Two are traditional text classification models, including logistic regression (LR) and Bernoulli’s Bayesian (BernoulliNB). Three are deep learning models for text classification, including TextCNN[23], TextRCNN[24] and DPCNN [25]. (2) Five pre-trained models, including GPT2 [26], SciGPT2 [27], BERT-base [22], BERT-large [22] and SciBERT [16]. We use Area Under Curve (AUC), Accuracy (ACC), Macro-F1 (Ma-F1) and Micro-F1 (Mi-F1) to evaluate all models on the task of predicting the acceptance of papers. The AUC is the area under the ROC curve enclosed by the coordinate axis. Mathematically, a pair of positive and negative samples are randomly selected, and the probability that the classifier scores a positive sample more than a negative sample is AUC. Because AUC is not affected by the imbalance between positive and negative data volumes, we use it as the primary metric for our experiments. Since the model predicts values from 0 to 1, we need to set a threshold manually. We consider samples above this threshold as positive samples and those below this threshold as negative samples. As different thresholds produce different ACC, Ma-F1 and Mi-F1, we develop the criteria for selecting the thresholds in this experiment. We traverse the thresholds in 0.01 steps from [0.01,0.99] to find the optimal threshold with the goal of maximizing Mi-F1. Then ACC and Ma-F1 were calculated at that threshold. Note that in binary classification, ACC and Mi-F1 are equal. 20 Computer Science & Information Technology (CS & IT) 4.4. Experimental Results Pre-trained Baseline Flat Baseline Table 2. Performance results of all models on OpenReview datasets Model AUC ACC/Mi-F1 Ma-F1 NB LR TextCNN TextRCNN DPCNN GPT2 SciGPT2 BERT-base BERT-large SciBERT MHM 0.628 0.624 0.629 0.708 0.705 0.642 0.678 0.768 0.755 0.778 0.806 0.668 0.654 0.649 0.708 0.708 0.668 0.665 0.733 0.732 0.742 0.754 0.598 0.456 0.405 0.601 0.600 0.536 0.611 0.675 0.661 0.692 0.701 The experimental results are shown in Table 2. MHM achieves the best performance results in each of the metrics. This demonstrates the effectiveness of our model and its generalizability over the test set. Specifically: (1) NB and LR models have the worst results on the dataset. This may be since their structures are too simple to learn deeper features. (2) Among the three deep learning models, we find that the results of TextCNN are almost close to LR. Moreover, DPCNN performs significantly better than TextCNN, which indicates that CNNs with more layers can learn deeper features and thus perform better. TextRCNN, with fewer network layers and faster training, can achieve almost the same performance as DPCNN. This fully demonstrates the advantages of RNNs in modeling long texts. The RNN model is more suitable for dealing with long-distance dependencies. (3) Among the five pre-trained models, GPT2 and SciGPT2 perform poorly. This may be because the generative models are not good at solving long text classification. Both BERT-base and BERT-large achieve excellent experimental results. It reflects the fact that pre-trained language models trained on a large-scale corpus can be of great help for downstream tasks. SciBERT, a variant of BERT, achieves the most advanced performance among the baseline models due to its rich scientific domain knowledge. (4) Compared to SciBERT, MHM exceeds 2.8%, 1.2%, and 0.9% on AUC, Ma-F1, and Mi-F1, respectively. This demonstrates the effectiveness of our model on the task of predicting the acceptance of academic papers. In addition, the decisions predicted by MHM are currently not entirely correct but still can be helpful as an aid tool for people. 4.5. Ablation Study We perform the following ablation study on MHM to evaluate the contribution of each component. The results are shown in Table 3. The full version of MHM outperforms all variants with individual components removed, which indicates that each component is indispensable. Specifically, (1) the variant with the WtoS encoder removed has the worst performance. This indicates that words, as the most fundamental structure in a paper, play the most critical role in the overall semantics and subsequent performance of the model. (2) Removing the StoP encoder has the second-worst impact on the model. This illustrates the importance of a reasonable generation of paragraph representations. We know that the body paragraphs take up the most words in a paper. (3) The results of removing the paper encoder show that the paper encoder also has a vital role in the model. It can do interaction among the title, abstract sentences, and body paragraphs. Furthermore, it can model the guiding role of the title and abstract to obtain the core semantics of the abstract and body. (4) Although the effect of removing the feature is not as significant, it still provides useful auxiliary information for the model. The model currently has Computer Science & Information Technology (CS & IT) 21 difficulty learning this information from the input text. In conclusion, all three encoders and features are crucial in improving the model's prediction performance. Table 3. Ablation Results of MHM Model MHM -WtoS Encoder -StoP Encoder -Paper Encoder -feature AUC 0.806 0.717 (-0.089) 0.778 (-0.028) 0.783 (-0.023) 0.789 (-0.017) ACC/Mi-F1 0.754 0.702 (-0.052) 0.731 (-0.023) 0.733 (-0.021) 0.742 (-0.012) Ma-F1 0.701 0.623 (-0.078) 0.681 (-0.020) 0.682 (-0.019) 0.696 (-0.005) 4.6. Case Study We randomly sample 20 samples and visualize the attention scores to obtain the following findings: (1) In the StoP encoder, the model generally pays most attention to the first or last sentence or both the first and last sentences of the paragraph. (2) In the T-CrossAN module, the model mainly focuses on the last sentence. (3) In the A-CrossAN module, the model focuses more on the conclusion and the paragraphs near the section headings. (4) In the BiDiSAN module of the paper encoder, the model gives higher attention scores overall to both title and abstract sentences and lower attention scores to both body text. Specifically, in the abstract, the model pays the most attention to the end sentences, generally the parts describing the experiment. In the body text, the model pays more attention to the conclusion and the parts closely linked to the title. Consistent with the idea of human review, it suggests that the model does capture essential information. However, such a boilerplate analysis may miss excellent papers that do not follow this standard writing style. We then analyze the top 10 highest-rated true positives. We find that most of them have keywords such as ''code is available'', ''acknowledgment'', ''appendix'' and ''state-of-the-art''. There is also a specific division of labor descriptions. This reflects, to some extent, the paper's innovation, contribution, meticulousness, etc. We investigate the error cases that our model does not predict correctly on the ICLR dataset and find that: (1) 77.3% of them are false negatives; (2) 22.7% of them are false positives. Our analysis of the highest-scoring false-positive reveals that this paper included the keywords and standard structure that an excellent paper has. It looks like an outstanding paper. However, the publicly available human review comments on OpenReview indicate that this paper is too simple and does not contribute enough. It suggests that the model has difficulty learning in-depth features, such as contribution, requiring extensive domain knowledge to judge. Our analysis of the lowest-scoring false-negative finds that the model misidentifies the core sentences in the abstract. This misidentification may be why the model does not capture the correct semantics of the paper. 5. CONCLUSIONS In this paper, we propose a modular hierarchical model(MHM) for paper quality evaluation. Specifically, we first divide the paper into three modules: title, abstract, and body. Then we use three hierarchical encoders in the model: the W to S encoder, the S to P encoder, and the paper encoder. We consider fuller interaction and aggregation in the hierarchy than in existing models. We are also the first to model the guiding role of titles and abstracts to generate core representations. Experimental results on the large-scale dataset we produced show that our model outperforms the strong baseline model by a large margin on all evaluation metrics. But our model 22 Computer Science & Information Technology (CS & IT) can hardly learn the features of figures, tables, formulas and references without manual features. It also fails to truly understand the contribution and innovation of the paper. With the development of document images understanding[28,29], we plan to investigate the combination of images and text in papers in the future better learn all features of papers. We also hope that in subsequent work, the model will be able to better understand and evaluate the paper's innovation and contribution. ACKNOWLEDGEMENTS This work is supported by Hunan (No.2022JJ30668) and (No.2022JJ30046). Provincial Natural Science FoundationProject REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] R. E. Page, (2013) “Stories and social media: Identities and interactio”,in Routledge, New York,USA, 2013. Q. V. Dang and C. L. Ignat, (2016)“Measuring quality of collaboratively edited documents: The case of Wikipedia”, in Proc. CIC, Pittsburgh, PA, USA, pp266-275. A. Shen, B. Salehi, J. Qi and T. Baldwin, (2020) “A multimodal approach to assessing document quality”, Journal of Artificial Intelligence Research, Vol. 68, pp607-632. D. Kang, W. Ammar, B. D. Mishra, M. V. Zuylen, S. Kohlmeier et al., (2018)“A dataset of peer reviews (peerread): Collection, insights and nlp applications”, in Proc. NAACL HLT, New Orleans, pp1647–1661. M. Skorikov and S. Momen, (2020)“Machine learning approach to predicting the acceptance of academic papers”, in Proc. IAICT, Bali, Indonesia, pp113–117. P. Vincent-Lamarreand V. Larivi`ere, (2021) “Textual analysis of artificial intelligence manuscripts reveals features associated with peer review outcome”, Quantitative Science Studies, Vol. 2, No.2, pp662–677. K. Wangand X. Wan, (2018)“Sentiment analysis of peer review texts for scholarly papers”, in Proc. SIGIR, Ann Arbor, MI, USA, pp175–184. P. Yang, X. Sun, W. Li and S. Ma, (2018) “Automatic academic paper rating based on modularized hierarchical convolutional neural network”, in Proc. ACL, Melbourne, Australia, pp496– 502. T. Ghosal, R. Verma, A. Ekbal and P. Bhattacharyya, (2019)“DeepSentiPeer: Harnessing sentiment in review texts to recommend peer review decisions”, in Proc. ACL, Florence, Italy, pp1120–1130. Z. Deng, H. Peng, C. Xia, J. Li, L. He and P. S. Yu, (2020)“Hierarchical bi-directional self-attention networks for paper review rating recommendation”, in Proc. COLING, Barcelona, Spain, pp6302– 6314. F. Qiao, L. Xu and X. Han, (2018)“Modularized and attention-based recurrent convolutional neural network for automatic academic paper aspect scoring”, in Proc. WISA, Taiyuan, China, pp68–76. A. Shen, B. Salehi, T. Baldwin and J. Qi, (2019)“A joint model for multimodal document quality assessment”, in Proc. JCDL, Champaign, IL, USA, pp107–110. Y. Leng, L. Yu and J. Xiong, (2019)“Deepreviewer: Collaborative grammar and innovation neural network for automatic paper review”, in Proc. ICMI, Suzhou, China, pp395–403. G. M. de Buy Wenniger, T. van Dongen, E. Aedmaa, H. T. Kruitbosch, E. A. Valentijnet al., (2020)“Structure-tags improve text classification for scholarly document quality prediction”, in Proc. Proceedings of the First Workshop on Scholarly Document Processing, Online, pp158–167. Y. Lu, J. Luo, Y. Xiaoand H. Zhu, (2021) “Text representation model of scientific papers based on fusing multi-viewpoint information and its quality assessment”,Scientometrics, Vol. 126, No.8, pp6937–6963. I. Beltagy, K. Lo and A. Cohan, (2019) “SciBERT: A pretrained language model for scientific text”, in Proc. EMNLP-IJCNLP, Hong Kong, China, pp3615–3620. T. Shen, T. Zhou, G. Long, J. Jiang, S. Pan and C. Zhang, (2018) “Disan: Directional self-attention network for rnn/cnn-free language understanding”, in Proc. AAAI, Louisiana, USA, pp5446–5455. Computer Science & Information Technology (CS & IT) 23 [18] A. Shen, J. Qiand T. Baldwin, (2017) “A hybrid model for quality assessment of Wikipedia articles”, in Proc. Proceedings of the Australasian Language Technology Association Workshop, Brisbane, Australia, pp43–52. [19] Z. Yang, D. Yang, C. Dyer, X.He, A. Smolaet al., (2016) “Hierarchical attention networks for document classification”, in Proc. NAACL-HLT,San Diego, California, pp1480–1489. [20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Joneset al., (2017)“Attention is all you need”, in Proc. NIPS, Long Beach, CA, USA, pp5998-6008. [21] K. He, X. Zhang, S. Ren and J. Sun, (2016)“Deep residual learning for image recognition”, in Proc. CVPR, Las Vegas, NV, USA, pp770–778. [22] J. Devlin, M. W. Chang, K. Lee and K. Toutanova, (2019) “BERT: Pre-training of deep bidirectional transformers for language understanding”, in Proc. NAACL-HLT, Minneapolis, Minnesota, pp4171– 4186. [23] Y. Kim, (2014)“Convolutional neural networks for sentence classification”, in Proc. EMNLP, Doha, Qatar, pp1746–1751. [24] S. Lai, L. Xu, K. Liu and J. Zhao, (2015)“Recurrent convolutional neural networks for text classification”, in Proc. AAAI, Austin, Texas, USA, pp2267-–2273. [25] R. Johnson and T. Zhang, (2017)“Deep pyramid convolutional neural networks for text categorization”, in Proc. ACL, Vancouver, Canada, pp562–570. [26] A. Radford, J. Wu, R. Child, D. Luan, D. Amodeiet al., (2019) “Language models are unsupervised multitask learners”,OpenAI blog, Vol. 1, No.8, pp9. [27] K. Luu, X. Wu, R. Koncel-Kedziorski, K. Lo, I. Cachola et al., (2021) “Explaining relationships between scientific documents”, in Proc. ACL-IJCNLP, Online, pp2130–2144. [28] J. Li, Y. Xu, T. Lv, L. Cui, C. Zhang et al., (2022) “DiT: Self-supervised Pre-training for Document Image Transformer”, in Proc. ACM MM, Lisboa, Portugal, pp3530–3539. [29] Y. Huang, T. Lv, L. Cui, Y. Lu and F. Wei, (2022) “LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking”, in Proc. ACM MM, Lisboa, Portugal, pp4083–4091. © 2023 By AIRCC Publishing Corporation. This article is published under the Creative Commons Attribution (CC BY) license.

Log In

A Modular Hierarchical Model for Paper Quality Evaluation

Related papers

Related papers

Related topics