CN101887460A

CN101887460A - A Document Quality Evaluation Method and Its Application

Info

Publication number: CN101887460A
Application number: CN2010102263535A
Authority: CN
Inventors: 张铭; 封盛
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2010-07-14
Filing date: 2010-07-14
Publication date: 2010-11-17

Abstract

The invention provides a document quality evaluation algorithm applied in a document sharing platform, the algorithm comprising the following steps: constructing an academic network graph using the relationship between document-document, document-journal meeting and author; The transition relationship among them is modeled to obtain the transition probability matrix; the user's collection behavior of documents is used to build a model, and the document quality value based on user analysis is calculated; the random walk iterative algorithm with restart is performed on the graph to obtain the document quality, Information on journal conference quality and author academic reputation. The present invention combines user behavior information with document quality evaluation for the first time, and can also give the analysis results of author's academic reputation and academic quality of journal conferences when giving document quality analysis results. The ranking effect of this method is compared with other methods There is a significant improvement.

Description

A Document Quality Evaluation Method and Its Application

技术领域technical field

本发明涉及一种文献的质量评估方法，具体涉及一种在文献共享平台上的文献质量评估方法，属于知识挖掘技术领域。The invention relates to a document quality assessment method, in particular to a document quality assessment method on a document sharing platform, and belongs to the technical field of knowledge mining.

背景技术Background technique

近年以来，随着科学研究的飞速发展，科技文献的出版速度逐年增加，其数量已经非常庞大，例如仅针对计算机和信息科学领域的数字图书馆CiteSeerX上就存有150多万篇科技文献。科研人员在进行研究工作的过程中需要阅读和参考大量的科技文献资料，高质量的文献和低质量的文献对于科研工作者的价值是迥然不同的，从这些良莠不齐而数量十分庞大的文献资料中获取具有较高价值的科技文献成为了一项非常困难的工作。因此，如何对科技文献的质量进行有效的自动评估这一研究课题也吸引了越来越多的研究人员。In recent years, with the rapid development of scientific research, the publication speed of scientific and technological literature has increased year by year, and the number has become very large. For example, there are more than 1.5 million scientific and technological literature on CiteSeerX, a digital library only for the field of computer and information science. Scientific researchers need to read and refer to a large number of scientific and technological documents in the process of conducting research work. The value of high-quality documents and low-quality documents to scientific researchers is very different. Obtaining high-value scientific literature has become a very difficult task. Therefore, the research topic of how to effectively and automatically evaluate the quality of scientific and technological literature has attracted more and more researchers.

在学术研究领域的社会化文献共享交流网站上，用户可以收藏自己认为比较有价值的科技文献，标注标签，进行评论，并将这些文献分享给其他的用户。用户的收藏行为应当在对科技文献的质量进行分析的时候成为一个重要的参考，而目前利用了用户的行为来对科技文献质量进行分析的研究还非常少。因此，在Web 2.0环境下，如何将用户行为有效应用到科技文献质量评价系统中，值得进一步研究。On the social literature sharing and exchange website in the field of academic research, users can collect scientific and technological literature that they think are more valuable, mark tags, make comments, and share these literature with other users. User's collection behavior should be an important reference when analyzing the quality of scientific and technological literature, but currently there are very few studies that use user behavior to analyze the quality of scientific and technological literature. Therefore, in the Web 2.0 environment, how to effectively apply user behavior to the quality evaluation system of scientific and technological literature is worthy of further study.

对学术论文进行质量评估，学术界现有的评价方法主要包括同行评议、引文分析和基于链接分析的方法。同行评议通常用于论文的前期评价，如会议或期刊评审投稿论文；引文评价用于后期评价，例如评价研究人员已发表论文的学术水平。To evaluate the quality of academic papers, the existing evaluation methods in academia mainly include peer review, citation analysis and methods based on link analysis. Peer review is usually used for early evaluation of papers, such as reviewing submitted papers at conferences or journals; citation evaluation is used for later evaluation, such as evaluating the academic level of published papers by researchers.

同行评议，即由相同研究领域的自身专家学者从所选课题的意义以及创新性、研究方法、研究完成的质量、论文写作水平等各个方面进行综合性的评价。同行评议的优点在于专家对研究质量的评价是细致而准确的，专家凭借相关领域深厚的学术造诣能够看清学术研究的水平高下；而缺点则在于当前评价制度尚不完善、“同行”自律不严容易引发一些“流弊”，并且对大量的学术论文进行同行评价费时费力，是不太现实的。Peer review is a comprehensive evaluation by experts and scholars in the same research field from the significance and innovation of the selected topic, research methods, the quality of research completion, and the level of paper writing. The advantage of peer review is that the evaluation of research quality by experts is meticulous and accurate. Experts can see the level of academic research by virtue of their profound academic attainments in related fields. Laxity is likely to lead to some "frauds", and peer evaluation of a large number of academic papers is time-consuming and laborious, which is not realistic.

引文分析，即利用学术论文间的引用和被引用关系采用某种具体方法和评价标准对论文进行质量评价。引文分析法的研究人员提出了一系列量化的质量评价指标，例如被引频次、影响因子等。相对于同行评议，引文分析的评价方法更加简单，易于利用计算机自动完成；与此同时，引文分析的结果更粗糙，而且必须利用论文间的引用与被引用关系，对新发表的文献，因为被引用较少，往往给出的评价偏低，局限性较强。Citation analysis is to use the citation and cited relationship between academic papers to evaluate the quality of papers by using a specific method and evaluation standard. Researchers of citation analysis methods have proposed a series of quantitative quality evaluation indicators, such as cited frequency, impact factor, etc. Compared with peer review, the evaluation method of citation analysis is simpler, and it is easy to use computer to complete automatically; at the same time, the results of citation analysis are rougher, and the relationship between citations and citations between papers must be used. There are few citations, the evaluation is often given low, and the limitations are strong.

Brin和Page在1998年基于网页之间的链接关系提出了PageRank算法来对网页按照其重要度排序，并以此为基础创立了Google搜索引擎。Kleinberg提出了另外一种链接分析算法HITS算法。之后，考虑到科技文献之间通过引用关系天然形成的链接结构，很多研究人员基于这些方法的思想来解决文献质量评价方面的问题。In 1998, Brin and Page proposed the PageRank algorithm based on the link relationship between web pages to sort web pages according to their importance, and based on this, they created the Google search engine. Kleinberg proposed another link analysis algorithm HITS algorithm. Later, considering the link structure naturally formed by the citation relationship between scientific and technical literature, many researchers have solved the problem of literature quality evaluation based on the ideas of these methods.

发明内容Contents of the invention

本发明的目的是通过对文献、作者和期刊会议之间的关系建模并进行分析，利用Web 2.0环境下用户行为和文献质量之间的关系协助分析文献质量。本发明将同行评议和引文分析这两种分析方法统一在带重启动的随机游走算法框架下，给出最终的分析结果。The purpose of the present invention is to use the relationship between user behavior and document quality in the Web 2.0 environment to assist in the analysis of document quality by modeling and analyzing the relationship between documents, authors and journal conferences. The invention unifies the two analysis methods of peer review and citation analysis under the frame of random walk algorithm with restart, and gives the final analysis result.

本发明解决其技术问题所采用的方案是(流程如图1所示)：The scheme adopted by the present invention to solve its technical problems is (flow process as shown in Figure 1):

本发明提出一种评估文献质量的方法，该方法应用于科技文献共享平台，在该平台上，用户可以对文献进行收藏、添加标签、评论、分享给其他用户，其特征在于，所述方法包括以下步骤：The present invention proposes a method for evaluating the quality of documents, which is applied to a sharing platform for scientific and technological documents. On the platform, users can collect, add tags, comment on and share documents with other users. The method is characterized in that the method includes The following steps:

A.利用文献的引用关系、文献与期刊会议和作者的关系以及文献的发表时间，构建带权的有向图，称为学术网络图；A. Using the citation relationship of the literature, the relationship between the literature and journal conferences and authors, and the publication time of the literature, construct a directed graph with weights, which is called an academic network graph;

B.将文献的引用关系、文献与期刊会议和作者的关系定量成图上顶点之间的转移关系，建模得到学术网络图上的转移概率矩阵；B. Quantify the citation relationship of literature, the relationship between literature and journal conferences and authors into the transfer relationship between vertices on the graph, and model the transfer probability matrix on the academic network graph;

C.利用用户对文献的收藏行为建立模型，考虑收藏时间，利用HITS算法计算得到一个基于用户分析的文献质量值；C. Use the user's collection behavior to establish a model, consider the collection time, and use the HITS algorithm to calculate a document quality value based on user analysis;

D.根据步骤B和步骤C建立的模型，进行带重启动的随机游走迭代，直到结果收敛，得到学术网络图上每个顶点的概率值，这个概率值即为文献质量、期刊会议质量和作者学术声望的信息。D. According to the model established in step B and step C, perform random walk iterations with restart until the result converges, and obtain the probability value of each vertex on the academic network graph. This probability value is the document quality, journal conference quality and Information about the author's academic reputation.

本发明提供的方法不仅可用于科技文献共享平台，同样也适用于论文共享平台或网站(其中的文献指的是论文)，以及图片共享平台或网站(其中的文献指的是图片)等。The method provided by the present invention can be used not only for scientific and technological literature sharing platforms, but also for paper sharing platforms or websites (documents therein refer to papers), and picture sharing platforms or websites (documents therein refer to pictures) and the like.

本发明的有益效果：Beneficial effects of the present invention:

本发明提出的应用于科技文献的基于图的质量评估方法，第一次将用户行为信息与文献质量评价结合起来，能够在给出文献质量分析结果时还能够给出作者学术声望和期刊会议学术质量的分析结果。如将本发明应用于科技文献检索网站，对用户按照关键字检索到的结果进行质量值排序，能够帮助用户更快找到高质量的科技文献，更快了解到高质量的期刊和会议，以及学术声望高的作者。实验证明，本方法的排序效果相比其他方法有明显提高。The graph-based quality assessment method applied to scientific and technological documents proposed by the present invention combines user behavior information with document quality evaluation for the first time, and can also provide the author's academic reputation and journal conference academic results when the document quality analysis results are given. quality analysis results. For example, if the present invention is applied to a scientific and technological literature retrieval website, the quality values of the results retrieved by users according to keywords can be sorted, which can help users find high-quality scientific and technological literature faster, learn about high-quality journals and conferences, and academic Author of high reputation. Experiments show that the sorting effect of this method is significantly improved compared with other methods.

附图说明Description of drawings

图1为根据本发明的基于图的科技文献质量评估方法的总流程图；Fig. 1 is the general flowchart of the method for evaluating the quality of scientific and technological documents based on graphs according to the present invention;

图2为根据本发明构建的学术网络图；Fig. 2 is an academic network diagram constructed according to the present invention;

图3为根据本发明构建的学术网络图上顶点间转移关系图；Fig. 3 is a transfer relationship diagram between vertices on the academic network graph constructed according to the present invention;

图4为根据本发明构建的用户-文献收藏关系图。Fig. 4 is a user-document collection relationship diagram constructed according to the present invention.

具体实施方式Detailed ways

下面结合附图和具体实施方式对本发明作进一步详细描述：Below in conjunction with accompanying drawing and specific embodiment the present invention is described in further detail:

步骤1，利用文献的引用关系、文献与期刊会议和作者的关系以及文献的发表时间，构建带权的有向图，称为学术网络图。Step 1: Construct a weighted directed graph, called the academic network graph, using the citation relationship of the literature, the relationship between the literature and journal conferences and authors, and the publication time of the literature.

本发明设计构建的学术网络图由三个部分组成，对文献、作者、期刊会议三种实体之间的关系进行建模。三个部分分别为：The academic network diagram designed and constructed by the present invention is composed of three parts, and models the relationships among three entities: documents, authors, and periodical conferences. The three parts are:

●文献引文互联子图G_dd＝(V_d，E_dd)，● Literature citation interconnection subgraph G _dd = (V _d , E _dd ),

G_dd是有向图，表示文献之间的引用关系，其中V_d是文献顶点集，E_dd是边集，有向边<d_i，d_j>∈E_dd表示文献d_i引用了文献d_j；G _dd is a directed graph, which represents the citation relationship between documents, where V _d is a document vertex set, E _dd is an edge set, and the directed edge <d _i , d _j >∈E _dd means that document d _i cites document d _j ;

●作者-文献子图G_ad＝(V_a∪V_d，E_ad)，● Author-document subgraph G _ad = (V _a ∪ V _d , E _ad ),

G_ad是一个二部图，表示作者和文献之间的著作关系，其中V_a是作者顶点集，E_ad是边集，无向边(a_i，d_j)∈E_ad表示作者a_i写作了文献d_j；G _ad is a bipartite graph, which represents the authorship relationship between the author and the document, where V _a is the author vertex set, E _ad is the edge set, and the undirected edge (a _i , d _j ) ∈ _{E ad} represents the author a _i writing Document d _j ;

●期刊会议-文献子图G_cd＝(V_c∪V_d，E_cd)，●Journal conference-document subgraph G _cd ＝(V _c ∪V _d ，E _cd ),

G_cd是一个二部图，表示期刊会议和文献之间的发表关系，其中V_c是期刊、会议顶点集，E_cd是边集，无向边(c_i，d_j)∈E_cd表示文献d_j发表在期刊或会议c_i上；G _cd is a bipartite graph, which represents the publishing relationship between journal conferences and documents, where V _c is the journal and conference vertex set, E _cd is the edge set, and the undirected edge (c _i , d _j ) ∈ E _cd represents the document d _j published in journals or conferences c _i ;

这三个子图的组合即为学术网络图，如图2所示。The combination of these three subgraphs is the academic network diagram, as shown in Figure 2.

定义学术网络图为有向图G＝(V，E)。其中V为顶点集，V＝V_a∪V_d∪V_c，E为边集，E＝E_dd∪E_ad∪E_cd。考虑到随机游走需要在有向图上进行，因此这里将作者-文献子图和期刊会议-文献子图中的每一条无向边都表示成连接这两个顶点的两条有向边，例如：(c_i，d_j)→<c_i，d_j>∪<d_j，c_i>。Define the academic network graph as a directed graph G=(V, E). Where V is the vertex set, V=V _a ∪V _d ∪V _c , E is the edge set, E=E _dd ∪E _ad ∪E _cd . Considering that the random walk needs to be performed on a directed graph, each undirected edge in the author-document subgraph and journal meeting-document subgraph is represented here as two directed edges connecting these two vertices. For example: (c _i , d _j )→<c _i , d _j >∪<d _j , c _i >.

步骤2，将文献的引用关系、文献与期刊会议和作者的关系定量成图上顶点之间的转移关系，建模得到学术网络图上的转移概率矩阵。Step 2. Quantify the citation relationship of the literature, the relationship between the literature and the journal conference, and the author into the transition relationship between the vertices on the graph, and model the transition probability matrix on the academic network graph.

学术网络图G中每个顶点代表一个作者、一篇文献或者一个期刊/会议，因此图G是一个包含三种不同类型实体的异构图。本发明对不同类型的顶点(实体)之间的转移定义不同的转移概率α，如图3中所示。对于这些转移概率参数，定义：Each vertex in the academic network graph G represents an author, a document, or a journal/conference, so graph G is a heterogeneous graph containing three different types of entities. The present invention defines different transition probabilities α for transitions between different types of vertices (entities), as shown in FIG. 3 . For these transition probability parameters, define:

α_ad＝α_cd＝1α _ad =α _cd =1

α_da+α_dc+α_dd＝1α _da +α _dc +α _dd =1

其中：α_ad为从作者顶点到文献顶点的转移概率，α_cd为从发表地点顶点到文献顶点的转移概率，α_da为从文献顶点到作者顶点的转移概率，α_dc为从文献顶点到发表地点顶点的转移概率，α_dd为从文献顶点到文献顶点的转移概率。Among them: α _ad is the transition probability from the author vertex to the document vertex, α _cd is the transition probability from the publication site vertex to the document vertex, α _da is the transition probability from the document vertex to the author vertex, α _dc is the transition probability from the document vertex to the publication vertex The transition probability of the location vertex, α _dd is the transition probability from document vertex to document vertex.

定义W(G)为图G的带权邻接矩阵，对应于学术网络图中不同顶点之间关系的权重，根据前面对学术网络图的定义，W(G)可以被分解为如下表所示的一系列子矩阵。首先，本发明对各个子矩阵赋初值获得初始的带权邻接矩阵；然后，对矩阵的初值应用权值计分函数，获得最终的带权邻接矩阵；最后，再以带权邻接矩阵为基础，计算得到转移概率矩阵。Define W(G) as the weighted adjacency matrix of graph G, which corresponds to the weight of the relationship between different vertices in the academic network graph. According to the previous definition of academic network graph, W(G) can be decomposed into the following table A series of sub-matrices of . First, the present invention assigns initial values to each sub-matrix to obtain an initial weighted adjacency matrix; then, a weighted scoring function is applied to the initial value of the matrix to obtain the final weighted adjacency matrix; finally, the weighted adjacency matrix is used as Based on the calculation, the transition probability matrix is obtained.

以下分别给出这些子矩阵的初始定义：The initial definitions of these sub-matrices are given below:

●从文献顶点到文献顶点的带权邻接矩阵● Weighted adjacency matrix from document vertex to document vertex

其中t(d)表示文献d的发表时间，Γ_dd(d_i)表示文献d_i引用的文献的集合。Where t(d) represents the publication time of document d, and Γ _dd (d _i ) represents the collection of documents cited by document d _i .

●从作者地点顶点到文献顶点的带权邻接矩阵●Weighted adjacency matrix from author location vertex to document vertex

其中Γ_ad(a_i)表示作者a_i发表文献的集合，

作者a是文献d的第k作者。Where Γ _ad (a _i ) represents the collection of published documents by author a _i ,

Author a is the kth author of document d.

●从文献顶点到作者顶点的带权邻接矩阵W_da(j，i)＝|Γ_da(d_j)|-k+1●The weighted adjacency matrix W _da (j, i)＝|Γ _da (d _j )|-k+1 from document vertex to author vertex

其中Γ_da(d_j)表示文献d_j的作者集合，k表示作者a_i是文献d_j的第k作者。Where Γ _da (d _j ) represents the set of authors of document d _j , and k represents that author a _i is the kth author of document d _j .

●从文献顶点到发表地点顶点的带权邻接矩阵

●The weighted adjacency matrix from the document vertex to the publication site vertex

●从发表地点顶点到文献顶点的带权邻接矩阵●The weighted adjacency matrix from the publishing site vertex to the document vertex

其中c_ik表示会议c_i的某一届，或者期刊c_i的某一卷，Γ_cd(c_im)表示发表在c_im上的文献集合，t(c_im)表示c_im的对应时间(年份)。Among them, c _ik represents a certain session of conference _ci , or a certain volume of journal _ci , Γ _cd (c _im ) represents the collection of documents published on c _im , t(c _im ) represents the corresponding time of c _im (year ).

Γ_cd(c_ik)＝{d|t(d)＝t(c_ik)∧d∈Γ_cd(c_i)}Γ _cd (ci _ik )={d|t(d)=t(ci _ik )∧d∈Γ _cd ( _ci )}

显然， $\underset{k}{Y} Γ_{cd} (c_{ik}) = Γ_{cd} (c_{i}),$ 且

&ForAll; k, l, t (c_{ik}) &NotEqual; t (c_{il}) .

Obviously,

\underset{k}{Y} Γ_{cd} (c_{ik}) = Γ_{cd} (c_{i}),

and

&ForAll; k, l, t (c_{ik}) &NotEqual; t (c_{il}) .

接下来对矩阵中的初始权值应用一个权值计分函数Ф：Next apply a weight scoring function Ф to the initial weights in the matrix:

W(i，j)＝Ф(W(i，j))W(i,j)=Ф(W(i,j))

合适的权值计分函数的标准是：这个函数应该是一个单调递增函数，但随着自变量取值的增大，函数值的增长幅度逐渐减小，即：Ф′(x)＞0且Ф″(x)＜0，本方法中取

The standard for a suitable weight scoring function is: this function should be a monotonically increasing function, but as the value of the independent variable increases, the growth rate of the function value gradually decreases, that is: Ф′(x)>0 and Ф″(x)<0, this method takes

接下来，首先定义三个子图对应的转移概率矩阵，最后计算出整个学术网络图的转移概率矩阵。Next, first define the transition probability matrix corresponding to the three sub-graphs, and finally calculate the transition probability matrix of the entire academic network graph.

●文献引用子图G_dd ● Literature citation subgraph G _dd

文献到文献的转移概率矩阵为：The document-to-document transition probability matrix is:

${M m}_{dd dd} = = {(({M m}_{dd dd} ((i i,, j j))))}_{i i,, j j &Element; &Element; {V V}_{d d}}$

其中，in,

${M m}_{dd dd} ((i i,, j j)) = = P P (({d d}_{j j} | | {d d}_{i i})) = = \frac{{W W}_{dd dd} ((i i,, j j))}{{Σ Σ}_{k k} {W W}_{dd dd} ((i i,, k k))}$

●作者-文献子图G_ad ●Author-document subgraph G _ad

作者到文献的转移概率矩阵为：The transition probability matrix from author to document is:

${M m}_{ad ad} = = {(({M m}_{ad ad} ((i i,, j j))))}_{i i &Element; &Element; {V V}_{a a},, j j &Element; &Element; {V V}_{d d}}$

其中，in,

${M m}_{ad ad} ((i i,, j j)) = = P P (({d d}_{j j} | | {a a}_{i i})) = = \frac{{W W}_{ad ad} ((i i,, j j))}{{Σ Σ}_{k k} {W W}_{ad ad} ((i i,, k k))}$

文献到作者的转移概率矩阵为：The transition probability matrix from document to author is:

${M m}_{da da} = = {(({M m}_{da da} ((i i,, j j))))}_{i i &Element; &Element; {V V}_{a a},, j j &Element; &Element; {V V}_{d d}}$

其中，in,

${M m}_{da da} ((j j,, i i)) = = P P (({a a}_{i i} | | {d d}_{j j})) = = \frac{{W W}_{da da} ((j j,, i i))}{{Σ Σ}_{k k} {W W}_{da da} ((j j,, k k))}$

●期刊会议-文献子图G_cd ●Journal conference - literature subgraph G _cd

文献到期刊会议的转移概率矩阵为：The transition probability matrix from a document to a journal conference is:

${M m}_{dc dc} = = {(({M m}_{dc dc} ((i i,, j j))))}_{i i &Element; &Element; {V V}_{d d},, j j &Element; &Element; {V V}_{c c}}$

其中，in,

${M m}_{dc dc} ((i i,, j j)) = = P P (({c c}_{j j} | | {d d}_{i i})) = = \frac{{W W}_{dc dc} ((i i,, j j))}{{Σ Σ}_{k k} {W W}_{dc dc} ((i i,, k k))}$

期刊会议到文献的转移概率矩阵为：The transition probability matrix from journal conference to literature is:

${M m}_{cd cd} = = {(({M m}_{cd cd} ((j j,, i i))))}_{i i &Element; &Element; {V V}_{d d},, j j &Element; &Element; {V V}_{c c}}$

其中，in,

${M m}_{cd cd} ((j j,, i i)) = = P P (({d d}_{i i} | | {c c}_{j j})) = = \frac{{W W}_{cd cd} ((j j,, i i))}{{Σ Σ}_{k k} {W W}_{cd cd} ((j j,, k k))}$

通过子图的转移概率矩阵，得到学术网络图上的转移概率矩阵：Through the transition probability matrix of the subgraph, the transition probability matrix on the academic network graph is obtained:

$M m ((G G)) = = {((P P ((j j | | i i))))}_{i i,, j j &Element; &Element; V V} = = [\begin{matrix} {α α}_{dd dd} {M m}_{dd dd} & {α α}_{da da} {M m}_{da da} & {α α}_{dc dc} {M m}_{dc dc} \\ {M m}_{ad ad} & 00 & 00 \\ {M m}_{cd cd} & 00 & 00 \end{matrix}]$

步骤3，利用用户对文献的收藏行为建立模型，考虑收藏时间，利用HITS算法计算得到一个基于用户分析的文献质量值。Step 3, use the user's collection behavior to establish a model, consider the collection time, and use the HITS algorithm to calculate a document quality value based on user analysis.

本发明将文献和用户之间通过收藏行为连接起来构造用户-文献收藏关系图，用户和文献是图中的顶点，收藏行为是边，如图4所示。本发明定义用户-文献收藏体系为B＝(U，D，T，R)，其中U是用户集合，D是文献集合，T是一系列时间点的集合，

表示收藏关系的集合。(u，d，t)∈R，表示用户u在时刻t收藏了文献d。The present invention connects documents and users through collection behaviors to construct a user-document collection relationship graph. Users and documents are vertices in the graph, and collection behaviors are edges, as shown in FIG. 4 . The present invention defines the user-document collection system as B=(U, D, T, R), wherein U is a collection of users, D is a collection of documents, and T is a collection of a series of time points,

Represents a collection of favorites. (u, d, t) ∈ R, means that user u bookmarked document d at time t.

定义文献集合的质量值向量为：q＝(q₁，q₂，Λ，q_m)，其中m＝|D|；定义用户集合的专家度向量为：e＝(e₁，e₂，Λ，e_n)，其中n＝|U|。定义用户-文献收藏关系图的邻接矩阵A：The quality value vector defining the document set is: q=(q ₁ , q ₂ , Λ, q _m ), where m=|D|; the expert degree vector defining the user set is: e=(e ₁ , e ₂ , Λ , e _n ), where n=|U|. Define the adjacency matrix A of the user-document collection relationship graph:

计算文献质量值和用户专家度就是重复如下的迭代过程直到结果收敛：Calculating the document quality value and user expert degree is to repeat the following iterative process until the result converges:

q＝e×Aq=e×A

e＝q×A^T e=q× ^AT

步骤4，根据步骤2和步骤3建立的模型，进行带重启动的随机游走迭代，直到结果收敛，得到学术网络图上每个顶点的概率值，这个概率值即为文献质量、期刊会议质量和作者学术声望的信息。Step 4, according to the model established in step 2 and step 3, perform random walk iterations with restart until the result converges, and obtain the probability value of each vertex on the academic network graph, which is the quality of literature and the quality of journal conferences and information on the author's academic reputation.

设d为文献质量值向量，a为作者学术声望向量，c为期刊会议质量值向量。将对应三种实体的向量连接成一个向量：π＝[d^T，a^T，c^T]^T。带重启动的随机游走算法可以用如下的公式表达：Let d be the vector of document quality value, a be the vector of author's academic reputation, and c be the vector of journal conference quality value. Connect the vectors corresponding to the three entities into one vector: π=[d ^T , a ^T , c ^T ] ^T . The random walk algorithm with restart can be expressed by the following formula:

π_t+1＝cM^Tπ_t+(1-c)Q，0≤c≤1π _t+1 ＝cM ^T π _t +(1-c)Q, 0≤c≤1

采用如下的方法构建Q：Q is constructed in the following way:

对Q(i)进行规范化，使得 $\underset{i &Element; V}{Σ} Q (i) = | V | .$ Normalize Q(i) such that $\underset{i &Element; V}{Σ} Q (i) = | V | .$

在判断是否收敛时，将相邻的前后两次迭代得到的π向量相减，如果差小于10^-6，则判断其为收敛。假设最后得到的向量为π_n，则其中的值为文献质量值、作者学术声望值和期刊会议质量值。When judging whether it is converged, subtract the π vectors obtained by two adjacent iterations, and if the difference is less than 10 ^-6 , it is judged to be converged. Assuming that the final vector is π _n , the values in it are literature quality value, author's academic reputation value and journal conference quality value.

性能评测performance evaluation

本发明的科技文献质量评价方法为文献、期刊会议和作者都给出了一个质量评分值，利用这一分值得到的排序结果进行实验评测。The method for evaluating the quality of scientific and technological documents of the present invention provides a quality scoring value for documents, periodical conferences and authors, and uses the sorting results obtained by this score for experimental evaluation.

首先对文献质量评价的结果进行评测，选取三个领域：“Opinion Mining”、“Topic Model”和“Social Network”的文献来进行评测。文献评价的实验人工评测主要利用人工对质量排序结果打分的方式结合DCG(Discounted Cumulative Gain)评测算法来评测。评测者依据不同的文献的质量不同给其赋予不同的分值，分值越高的文献越应该排在排序结果的前面。之后，使用DCG评测算法来对结果进行评测，DCG值越高，说明算法输出的排序结果越符合实际需要。DCG评测值的计算公式为：Firstly, evaluate the results of the literature quality evaluation, and select the literature in three fields: "Opinion Mining", "Topic Model" and "Social Network" for evaluation. The experimental manual evaluation of literature evaluation mainly uses the method of manually scoring the quality sorting results combined with the DCG (Discounted Cumulative Gain) evaluation algorithm to evaluate. Evaluators assign different scores to different documents according to their quality, and the documents with higher scores should be ranked at the front of the sorting results. After that, use the DCG evaluation algorithm to evaluate the results. The higher the DCG value, the more the sorting results output by the algorithm meet the actual needs. The calculation formula of DCG evaluation value is:

${DCG DCG}_{p p} = = {score score}_{11} + + {Σ Σ}_{i i = = 22}^{p p} \frac{{score score}_{i i}}{{log log}_{22} i i}$

其中score_i为评测者给排序结果中第i项的分值。Where score _i is the score given by the evaluator to the i-th item in the ranking results.

对文献质量的评价，所采用的对比方法如下：The comparison methods used to evaluate the quality of literature are as follows:

●PageRank算法结果中的文献部分●The literature part of the PageRank algorithm results

●PopRank算法结果中的文献部分●The literature part of the results of the PopRank algorithm

●学术网络图上的Random Walk算法(RW)结果中的文献部分●The literature part in the results of the Random Walk algorithm (RW) on the academic network graph

●文献被引次数(Citation Count)：文献在本文实验采用的论文集中的被引用次数。●Citation Count: The number of times the document is cited in the collection of papers used in this experiment.

以下为评测结果(为了便于表示，本发明的方法记为RW+U )：The following is the evaluation result (for ease of expression, the method of the present invention is denoted as RW+U):

OpinionMiningOpinion Mining TopicModelTopicModel SocialNetworkSocial Network PageRankPageRank 13.6654713.66547 15.7819115.78191 12.0344812.03448 PopRankPopRank 17.1611717.16117 17.3334317.33343 16.1354616.13546 CitationCountCitationCount 17.4009217.40092 17.2142917.21429 14.6334814.63348 RWRW 17.6803317.68033 17.9036717.90367 16.8155816.81558 RW+URW+U 18.1655918.16559 18.4108118.41081 17.2826117.28261

其次是对作者学术声望的评价实验结果进行评测，方法与文献质量评价实验相同，对比方法如下：The second is to evaluate the results of the author’s academic prestige evaluation experiment. The method is the same as that of the literature quality evaluation experiment. The comparison method is as follows:

PageRank算法结果中的作者部分Author section in PageRank algorithm results

PopRank算法结果中的作者部分Author section in PopRank algorithm results

学术网络图上的Random Walk算法(RW)结果中的作者部分The author part in the results of the Random Walk algorithm (RW) on the academic network graph

发表文献数(Publication Count)：作者在实验的领域文献集中发表的文献总数Publication Count: The total number of documents published by the author in the field literature collection of the experiment

领域文献被引次数(Citation Count)：作者在实验的领域文献集中发表的文献的被引次数总和Citation Count: The total number of citations of the literature published by the author in the experimental field literature collection

评测结果如下所示：The evaluation results are as follows:

OpinionMiningOpinion Mining TopicModelTopicModel SocialNetworkSocial Network PubNumPubNum 12.7936512.79365 13.2912913.29129 10.6482110.64821 CitationCountCitationCount 16.8609116.86091 14.1274414.12744 11.167911.1679 PageRankPageRank 15.5391115.53911 14.7748914.77489 13.7711713.77117 PopRankPopRank 17.3377917.33779 15.8755115.87551 16.4807516.48075 RWRW 17.8166117.81661 16.578616.5786 16.8756816.87568 RW+URW+U 17.9962717.99627 16.6129116.61291 16.885216.8852

最后是对期刊的学术质量评价结果进行评测。考虑到影响因子是学术界中普遍采用的期刊质量评价方法，所以评测的参考标准是修改版影响因子分析法的结果。修改版影响因子计算方法如下：Finally, evaluate the academic quality evaluation results of the journals. Considering that the impact factor is a commonly used journal quality evaluation method in academia, the reference standard for evaluation is the result of the modified version of the impact factor analysis method. The calculation method of the revised impact factor is as follows:

${mIF mIF}_{X x} = = \frac{C C}{D D.}$

其中，D是期刊X上发表的文献的总数，C是这些文献被引用次数之和。Among them, D is the total number of documents published in journal X, and C is the sum of the number of citations of these documents.

对于期刊评价评测的方法是前N个结果的准确率，其计算方法如下：The method for journal evaluation is the accuracy rate of the top N results, which is calculated as follows:

以下为评测结果：The following are the evaluation results:

P@50P@50 P@80P@80 P@100P@100 PageRankPageRank 0.240.24 0.33750.3375 0.40.4 PopRankPopRank 0.420.42 0.4250.425 0.470.47 RWRW 0.420.42 0.4250.425 0.470.47 RW+URW+U 0.440.44 0.43750.4375 0.480.48

上表所示为几种算法结果中的文献质量值平均值的按年分布情况。这里列出的是从1971年到2009年的平均值，每年的均值是用当年发表文献的质量值之和除以发表的文献数。从图中可以看出，本发明的方法RW和RW+U对新文献的质量值要普遍高于其他两种方法，说明本发明的方法解决了传统方法中新文献评价结果普遍偏低的问题。The table above shows the yearly distribution of the average document quality values in the results of several algorithms. Listed here are the average values from 1971 to 2009, and the average value for each year is the sum of the quality values of published documents in that year divided by the number of published documents. As can be seen from the figure, the method RW and RW+U of the present invention have generally higher quality values for new documents than the other two methods, indicating that the method of the present invention solves the problem that the evaluation results of new documents in traditional methods are generally low .

需要注意的是，公布实施例的目的在于帮助进一步理解本发明，本领域的技术人员可以理解：在不脱离本发明及所附权利要求的精神和范围内，各种替换和修改都是可能的。例如，本发明同样可以应用于论文共享平台或网站(只需用论文取代文献)，以及图片共享平台或网站(只需用图片取代文献)等。因此，本发明不应局限于实施例所公开的内容，本发明要求保护的范围以权利要求书界定的范围为准。It should be noted that the purpose of the published embodiments is to help further understand the present invention, and those skilled in the art can understand that various replacements and modifications are possible without departing from the spirit and scope of the present invention and the appended claims . For example, the present invention can also be applied to paper sharing platforms or websites (only need to replace documents with papers), and picture sharing platforms or websites (only need to replace documents with pictures). Therefore, the present invention should not be limited to the content disclosed in the embodiments, and the protection scope of the present invention is subject to the scope defined in the claims.

Claims

1. A method for evaluating quality of literature is applied to a scientific and technical literature sharing platform, on which a user can collect, add tags, comment and share the literature to other users, and is characterized by comprising the following steps:

A. constructing an authorized directed graph called an academic network graph by using the citation relationship of the literature, the relationship between the literature and the periodical conference and the author and the publication time of the literature;

B. quantifying the citation relation of the literature, the relation of the literature and the periodical conference and the relation of the literature and the author into the transfer relation between vertexes on the graph, and modeling to obtain a transfer probability matrix on the academic network graph;

C. establishing a model by using the collection behavior of the user on the document, considering the collection time, and calculating by using a HITS algorithm to obtain a document quality value based on user analysis;

D. and C, according to the models established in the steps B and C, carrying out random walk iteration with restarting until the result is converged to obtain a probability value of each vertex on the academic network diagram, wherein the probability value is the information of the literature quality, the periodical conference quality and the academic reputation of the author.

2. The method of claim 1, wherein the academic network graph in step a is composed of three subgraphs, respectively:

● citation interconnection subfigure G_dd＝(V_d，E_dd)，

G_ddIs a directed graph representing the citation relationship between documents, where V_dIs a set of document vertices, E_ddIs a set of edges, directed edges<d_i，d_j>∈E_ddExpression document d_iReference d is made to_j；

● Author-bibliographic subfigure G_ad＝(V_a∪V_d，E_ad)，

G_adIs a bipartite graph representing the written relationship between the author and the literature, where V_aIs a set of author vertices, E_adIs an edge set, a non-directional edge (a)_i，d_j)∈E_adRepresents author a_iWrite out document d_j；

● conference of periodical-literature subgraph G_cd＝(V_c∪V_d，E_cd)，

G_cdIs a bipartite graph representing the publication relationship between a periodical conference and a document, wherein V_cIs a set of periodicals and meeting vertices, E_cdIs an edge set, no directional edge (c)_i，d_j)∈E_cdExpression document d_jPublished in periodicals or conferences c_iThe above step (1); the academic network graph is a directed graph G(V, E), wherein the set of vertices V ═ V_a∪V_d∪V_cEdge set E ═ E_dd∪E_ad∪E_cd(ii) a Author-literature subgraph G_adAnd journal conference-literature subgraph G_cdEach undirected edge in the set of undirected edges is replaced with two directed edges connecting the two vertices of the edge.

3. The method of claim 2, wherein the step B is implemented by:

B1. different transition probabilities a are defined for transitions between different types of vertices,

α_ad＝α_cd＝1

α_da+α_dc+α_dd＝1

α_adfor transition probability from author vertex to document vertex, α_cdFor transition probability from published site vertex to document vertex, α_daFor transition probability from document vertex to author vertex, α_dcFor transition probabilities from document vertices to published site vertices, α_ddIs the transition probability from document vertex to document vertex;

B2. defining a weighted adjacency matrix W (G) of the graph G, corresponding to the weights of the relations between different vertexes in the academic network graph, and decomposing W (G) into a series of sub-matrixes according to the definition of the academic network graph: w_dd，W_ad，W_da，W_dc，W_cdWherein W is_ddFor weighted adjacency matrices from document vertex to document vertex, W_adFor weighted adjacency matrices from author site vertex to document vertex, W_daFor weighted adjacency matrices from document vertex to author vertex, W_dcFor weighted adjacency matrices from document vertex to publishing point vertex, W_cdIs a weighted adjacency matrix from publication location vertex to document vertex;

B3. and (3) assigning initial values to each submatrix to obtain an initial weighted adjacency matrix:

wherein t (d) represents publication time of document d, Γ_dd(di) shows document d_iA collection of cited documents;

wherein gamma is_ad(a_i) Represents author a_iThe collection of published documents is presented in a number of publications,

author a is the kth author of document d;

c)W_da(j，i)＝|Γ_da(d_j)|-k+1

wherein gamma is_da(d_j) Expression document d_jK denotes author a_iIs document d_jThe kth author of (1);

wherein c is_ikRepresenting a meeting c_iA certain time, or periodical c_iOf a certain volume, gamma_cd(c_im) Presentation is given in c_imDocument set above, t (c)_im) Denotes c_imThe corresponding time of (d);

B4. applying a weight scoring function to the initial value of the matrix to obtain a final weighted adjacent matrix;

B5. and calculating to obtain a transition probability matrix based on the weighted adjacent matrix.

4. The method of claim 3,the weight scoring function adopted in the step B4 is a monotone increasing function, but as the value of the independent variable increases, the increase range of the function value gradually decreases, that is: phi '(x) > 0 and phi' (x) < 0, in the method taking

5. The method of claim 4, wherein the step B5 is implemented by:

i. defining a transition probability matrix for three subgraphs

Document citation subgraph G_dd

Document-to-document transition probability matrix

Wherein

Author-literature subfigure G_ad

Author to document transition probability matrix

Wherein

Document to author transition probability matrix

Wherein

Journal conference-literature subgraph G_cd

Transition probability matrix from literature to periodical conference

Wherein

Transition probability matrix from periodical conference to literature

Wherein

ii, obtaining a transition probability matrix on the academic network diagram through the transition probability matrix of the subgraph:

<math><mrow><mi>M</mi><mrow><mo>(</mo><mi>G</mi><mo>)</mo></mrow><mo>=</mo><msub><mrow><mo>(</mo><mi>P</mi><mrow><mo>(</mo><mi>j</mi><mo>|</mo><mi>i</mi><mo>)</mo></mrow><mo>)</mo></mrow><mrow><mi>i</mi><mo>,</mo><mi>j</mi><mo>&Element;</mo><mi>V</mi></mrow></msub><mo>=</mo><mfenced open='[' close=']'><mtable><mtr><mtd><msub><mi>α</mi><mi>dd</mi></msub><msub><mi>M</mi><mi>dd</mi></msub></mtd><mtd><msub><mi>α</mi><mi>da</mi></msub><msub><mi>M</mi><mi>da</mi></msub></mtd><mtd><msub><mi>α</mi><mi>dc</mi></msub><msub><mi>M</mi><mi>dc</mi></msub></mtd></mtr><mtr><mtd><msub><mi>M</mi><mi>ad</mi></msub></mtd><mtd><mn>0</mn></mtd><mtd><mn>0</mn></mtd></mtr><mtr><mtd><msub><mi>M</mi><mi>cd</mi></msub></mtd><mtd><mn>0</mn></mtd><mtd><mn>0</mn></mtd></mtr></mtable></mfenced><mo>.</mo></mrow></math>

6. the method of claim 5, wherein the step C is implemented by:

C1. a user-document collection relationship graph is constructed,

the top point is the user and the literature, and the edge is the collection behavior; defining a user-document collection system as B ═ U, D, T, R, where U is a set of users, D is a set of documents, T is a set of a series of time points,

representing a collection relation set, (u, d, t) belongs to R, and representing that a user u collects a document d at a moment t;

C2. an adjacency matrix a of the user-document collection graph is defined,

first, a quality value vector q ═ for a document collection is defined (q)₁，q₂，Λ，q_m) Wherein m ═ D |; defining an expert degree vector e ═ (e) for a set of users₁，e₂，Λ，e_n) Wherein n ═ U |; then the adjacency matrix of the user-document collection graph

C3. Calculating the document quality value and the user expertise by repeating the following iterative process until the result converges

q＝e×A

e＝q×A^T。

7. The method of claim 6, wherein the step D is implemented by:

D1. let d be the document quality value vector, a be the author academic reputation vector, and c be the periodical conferenceThe quality value vector is formed by connecting the vectors corresponding to the three entities into a vector pi ═ d^T，a^T，c^T]^T；

D2. Using random walk algorithm with restart, using formula pi_t+1＝cM^Tπ_tQ is more than or equal to 0 and less than or equal to 1, wherein

Normalizing Q (i) such that

<math><mrow><munder><mi>Σ</mi><mrow><mi>i</mi><mo>&Element;</mo><mi>V</mi></mrow></munder><mi>Q</mi><mrow><mo>(</mo><mi>i</mi><mo>)</mo></mrow><mo>=</mo><mo>|</mo><mi>V</mi><mo>|</mo><mo>;</mo></mrow></math>

D3. Subtracting pi vectors obtained by two adjacent iterations, if the difference is less than 10^-6If yes, judging the convergence; suppose the resulting vector is pi_nValues therein are document quality value, author academic reputation value and periodical conference quality value.

8. Applying the method of claim 1 to: a paper sharing platform or website, a picture sharing platform or website.