CN103631859B

CN103631859B - Intelligent review expert recommending method for science and technology projects

Info

Publication number: CN103631859B
Application number: CN201310509358.2A
Authority: CN
Inventors: 徐小良; 吴仁克; 林建海; 陈秋
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2013-10-24
Filing date: 2013-10-24
Publication date: 2017-01-11
Anticipated expiration: 2033-10-24
Also published as: CN103631859A

Abstract

The invention provides an intelligent review expert recommending method for science and technology projects. The method includes the following steps that (1) the science and technology projects to be reviewed and expert information main texts are segmented into substring sequences, ICTCLAS segmentation of Chinese academy of sciences is carried out on the substring sequences, and stop word filtering is carried out on a segmentation result to obtain a term set; (2) a term network of project information is built, feature words are extracted on the basis of statistical characteristics and aggregation characteristics, and if expert information is relatively concise, the term set obtained in the step (1) directly serves as the feature words; (3) a knowledge representation model is built on the basis of fields and weights of the feature words, and a relative information index is built; (4) experts are recommended in groups to carry out feature merging operations between the fields and between the projects on the knowledge representation model; (5) similarity of the experts and the science and technology projects or groups to be viewed is calculated on the basis of semantics, threshold truncation is set, and a final recommended expert list is generated. By means of the method, the problems that recommending workload is large and review decisions lack scientificity are greatly alleviated.

Description

An intelligent recommendation method for review experts for scientific and technological projects

技术领域technical field

本发明属于专家推荐技术领域，尤其涉及一种基于网络服务的科技项目评审专家智能推荐方法，它是一种辅助科技项目立项决策的智能方法。The invention belongs to the technical field of expert recommendation, and in particular relates to an intelligent recommendation method for scientific and technological project evaluation experts based on network services, which is an intelligent method for assisting the decision-making of scientific and technological projects.

背景技术Background technique

随着科技项目管理系统在我国各职能部门迅速普及，科技项目的评审工作从以往的集中会议模式发展到当前的网络模式，打破了评审工作中专家地域的限制。评审专家根据领域知识和资助机构的资助标准，对项目申请书进行评议，资助机构依据专家的评议情况决定是否资助。With the rapid popularization of science and technology project management system in various functional departments in our country, the evaluation work of science and technology projects has developed from the previous centralized meeting mode to the current network mode, breaking the geographical restrictions of experts in the evaluation work. Evaluation experts review the project application based on domain knowledge and the funding standards of the funding agency, and the funding agency decides whether to fund or not based on the experts' evaluation.

目前面向科技项目的专家推荐大多仅凭项目管理人员的主观意识推荐专家对待审项目进行评审，一个待审项目往往需要多个专家进行评审，人工推荐专家势必存在效率不高、工作量大、缺乏科学性等问题，所遴选出的专家并非是最合适的。因此，对科技项目评审专家智能推荐的研究是非常关键的，可以有效地缓解专家与所评项目内容不匹配等问题，大大提升科技项目评审工作的社会服务能力。At present, most expert recommendations for scientific and technological projects only rely on the subjective consciousness of project managers to recommend experts to review projects to be reviewed. A project to be reviewed often requires multiple experts to review, and manual recommendation of experts is bound to be inefficient, heavy workload, and lack of resources. For issues such as scientific nature, the selected experts are not the most suitable. Therefore, the research on the intelligent recommendation of science and technology project review experts is very critical, which can effectively alleviate the problems of mismatch between experts and the content of the projects reviewed, and greatly improve the social service ability of science and technology project review work.

现今智能推荐技术，如协同过滤推荐、基于内容的推荐等，大多应用在影视推荐网站、商品推荐网站，鲜有在科技项目评审专家信息库中的研究与应用，由于特定领域的限制，为科技项目智能推荐专家技术与一般的推荐技术还是有区别的：首先，科技项目管理系统的推荐涉及各行各业，领域知识非常复杂；其次，科技项目评审专家的推荐涉及到科技项目的资助基金，对专家推荐的客观性、公正性和精准性的要求是非常高的。目前在这方面，我国还缺乏系统化的方法指导和成熟的技术支持。而信息文本具有“半结构化”等特征，专家信息和待审科技项目信息的内容是可以进行匹配的，本发明充分利用结构特征以及词语语义信息计算项目与专家的信息相似度。若相似度较高，则表示专家对该项目熟悉，产生推荐专家列表对项目进行评审。本发明同时提供一种为科技项目推荐评审专家的决策支持系统(Decision Support System,DSS)，将评审专家分配到领域知识相匹配的项目进行科学评审，使得辅助专家(决策用户)实现科学的决策，帮助决策用户提高决策水平和质量，使评审更具科学性和客观性。Today's intelligent recommendation technologies, such as collaborative filtering recommendation, content-based recommendation, etc., are mostly used in film and television recommendation websites and commodity recommendation websites, and there are few researches and applications in the expert information database of scientific and technological project review. Project intelligent recommendation expert technology is still different from general recommendation technology: firstly, the recommendation of science and technology project management system involves all walks of life, and the domain knowledge is very complicated; secondly, the recommendation of science and technology project review experts involves the funding funds of science and technology projects The requirements for objectivity, impartiality and accuracy of expert recommendations are very high. At present, in this regard, our country still lacks systematic method guidance and mature technical support. The information text has the characteristics of "semi-structured", and the content of the expert information and the pending scientific and technological project information can be matched. The present invention makes full use of the structural features and word semantic information to calculate the information similarity between the project and the expert. If the similarity is high, it means that the experts are familiar with the project, and a list of recommended experts will be generated to review the project. The present invention also provides a decision support system (Decision Support System, DSS) that recommends review experts for scientific and technological projects, and assigns review experts to projects that match the domain knowledge for scientific review, so that the auxiliary experts (decision users) can realize scientific decision-making , to help decision-making users improve the decision-making level and quality, and make the review more scientific and objective.

发明内容Contents of the invention

本发明针对现有技术的不足，提供了一种面向科技项目的评审专家智能推荐方法。Aiming at the deficiencies of the prior art, the invention provides an intelligent recommendation method for review experts oriented to scientific and technological projects.

本发明面向科技项目的评审专家推荐过程包括如下步骤：The present invention is oriented to the evaluation expert recommendation process of scientific and technological projects, including the following steps:

步骤1.把科技项目和专家信息中的通用词和惯用词作为专业停用词库；把标点符号、非汉字作为切分标记库。Step 1. Use common words and idiomatic words in scientific and technological projects and expert information as a professional stop lexicon; use punctuation marks and non-Chinese characters as a segmentation tag library.

步骤2.对科技项目信息、专家信息进行分词：根据科技项目信息中切分标记，将项目名称、主要研究内容、技术指标等信息切分成子串序列；根据评审专家信息中切分标记，抽取专家信息、获奖情况、发明情况、发表论文情况、课题承担过的项目及完成情况、研究方向等信息切分成子串序列，一个子串序列即一个字段信息；利用中科院ICTCLAS对子串序列进行分词。Step 2. Carry out word segmentation for scientific and technological project information and expert information: according to the segmentation marks in the scientific and technological project information, the project name, main research content, technical indicators and other information are segmented into substring sequences; according to the segmentation marks in the evaluation expert information, extract Information such as expert information, awards, inventions, published papers, projects undertaken by the subject and their completion, and research directions are divided into substring sequences, and a substring sequence is a field information; the substring sequence is segmented using ICTCLAS of the Chinese Academy of Sciences .

步骤3.科技项目特征词语提取：利用通用停用词库和专业停用词库对分词进行停用词过滤，通用停用词库采用哈工大停用词表，把去除停用词的分词结果作为一个词语集合。Step 3. Feature word extraction for science and technology projects: Use the general stop word database and the professional stop word database to filter the word segmentation. A collection of words.

专业停用词库的构建是一个自学习不断完善的过程，在信息分词过程中不断统计词语的词频，词语在文本出现的概率大于一定阈值，将它纳入到停用词库。The construction of professional inactive thesaurus is a process of self-learning and continuous improvement. In the process of information word segmentation, the word frequency of words is continuously counted. If the probability of words appearing in the text is greater than a certain threshold, it will be included in the inactive thesaurus.

科技项目信息量较大，对词语集合进行词语间语义相似度计算，根据词的语义关系和词的共现关系构建词语网络，计算网络中的词语聚集特征值；然后结合词语的统计特征值，计算词语的关键度来提取出科技项目特征词语；科技项目的特征词语就是提取综合文本的统计特征信息和语义特征信息，更加准确地提取出特征词语。The science and technology project has a large amount of information. The semantic similarity between words is calculated on the word set, and the word network is constructed according to the semantic relationship of words and the co-occurrence relationship of words, and the word aggregation feature value in the network is calculated; then combined with the statistical feature value of words, Calculate the key degree of words to extract the characteristic words of science and technology projects; the characteristic words of science and technology projects are to extract the statistical feature information and semantic feature information of the comprehensive text, and extract the characteristic words more accurately.

所述的语义相似度计算过程如下：The described semantic similarity calculation process is as follows:

在知网语义词典中，如果对于两个词语W₁和W₂，W₁有n个概念：S11，S12，...，S1n，W₂有m个概念:S21,S22,...,S2m。词语W₁和W₂的相似度SimSEM(W1,W2)等于各个概念的相似度之最大值：In HowNet Semantic Dictionary, if for two words W ₁ and W ₂ , W ₁ has n concepts: S11, S12, ..., S1n, W ₂ has m concepts: S21, S22, ..., S2m. The similarity SimSEM(W1, W2) of words W ₁ and W ₂ is equal to the maximum value of the similarity of each concept:

$S S i i m m S S E E. M m ((W W 11,, W W 22)) = = \underset{i i = = 11,, ... ... n no . . j j = = 1... 1... m m}{max max} S S i i m m (({S S}_{11 i i},, {S S}_{22 j j}))$

实词和虚词具有不同的描述语言，需要计算其对应的句法义原或关系义原之间的相似度。实词概念包括第一基本义原、其他基本义原、关系义原描述、关系符号描述，相似度分别记为Sim1(p₁,p₂)、Sim2(p₁,p₂)、Sim3(p₁,p₂)、Sim4(p₁,p₂)。两个特征结构的相似度计算最终还原到基本义原或具体词的相似度计算。Content words and function words have different description languages, and it is necessary to calculate the similarity between their corresponding syntactic sememe or relational sememe. _The concept _of content words includes the _first basic sememe, other basic _sememes , relational _sememe description, and relational symbol description. ,p ₂ ), Sim4(p ₁ ,p ₂ ). The similarity calculation of two feature structures is finally restored to the basic sememe or the similarity calculation of specific words.

${Sim Sim}_{44} (({S S}_{11},, {S S}_{22})) = = {Σ Σ}_{i i = = 11}^{44} {β β}_{i i} {Sim Sim}_{i i} (({S S}_{11},, {S S}_{22}))$

β_i(1≤i≤4)是可调节的参数，且有：β₁+β₂+β₃+β₄＝1,β₁≥β₂≥β₃≥β₄。β _i (1≤i≤4) is an adjustable parameter, and has: β ₁ +β ₂ +β ₃ +β ₄ =1, β ₁ ≥β ₂ ≥β ₃ ≥β ₄ .

设CW＝{C1，C2，...，Cm}为处理后得到的词语集合，其对应的语义相似度邻接矩阵S_m定义为：Let CW={C1, C2, ..., Cm} be the word set obtained after processing, and its corresponding semantic similarity adjacency matrix S _m is defined as:

其中，Sim(C₁,C₂)为词C₁与词C₂的语义相似度，Sim(C_i,C_i)为1，Sim(C_i,C_j)＝Sim(C_j,C_i)。Among them, Sim(C ₁ ,C ₂ ) is the semantic similarity between word C ₁ and word C ₂ , Sim(C _i ,C _i ) is 1, Sim(C _i ,C _j )=Sim(C _j ,C _i ).

词语集合CW＝{C1，C2，...，Cm}经过词语语义相似度计算得到m×(1+m)/2个词语间相似度的值。The word set CW={C1, C2, . . . , Cm} is calculated to obtain the value of m×(1+m)/2 similarity between words through word semantic similarity calculation.

所述的词的共现关系计算过程如下：The co-occurrence relationship calculation process of the words is as follows:

词共现模型是基于统计方法的自然语言处理研究领域的重要模型之一。根据词共现模型，若两个词经常共现在文档的同一窗口单元(如一句话、一个自然段等)，这两个词在意义上是相互关联的,它们在一定程度上表达该文本的语义信息。利用滑动窗口(滑动窗口长度为3)对词语序列中的词语进行词语共现度计算，滑动窗口如图1所示：The word co-occurrence model is one of the important models in the research field of natural language processing based on statistical methods. According to the word co-occurrence model, if two words often co-occur in the same window unit of the document (such as a sentence, a natural paragraph, etc.), the two words are related in meaning, and they express the meaning of the text to a certain extent. semantic information. Use the sliding window (the length of the sliding window is 3) to calculate the word co-occurrence degree of the words in the word sequence. The sliding window is as shown in Figure 1:

首先，对词语序列进行词语提取，即去除空格，null以及合并相同的词，得到词语集合CW＝{C1，C2，...，Cm}，其中m≤n。First, word extraction is performed on the word sequence, that is, removing spaces, nulls and merging the same words to obtain a word set CW={C1, C2, ..., Cm}, where m≤n.

词语集合CW对应的词语共现度矩阵Cm定义为：The word co-occurrence matrix Cm corresponding to the word set CW is defined as:

Cm初始时，Coo(Ci,Cj)为01(1≤i,j≤m)。When Cm is initialized, Coo(Ci, Cj) is 01 (1≤i, j≤m).

借助滑动窗口对词语序列进行词语共现度计算，滑动窗口中的词为T_i-1T_iT_i+1(1<i<n)：Calculate the word co-occurrence degree on the word sequence with the help of sliding window. The words in the sliding window are T _i-1 T _i T _i+1 (1<i<n):

1)若i＝n-1，转4)；若T_i-1是空格或null,滑动窗口滑向下一个词，i++；否则，转2)。1) If i=n-1, go to 4); if T _i-1 is a space or null, the sliding window slides to the next word, i++; otherwise, go to 2).

2)若T_i为中文，则Coo(T_i-1,T_i)++，转3)；若T_i为null，转3)；否则转1)。2) If T _i is Chinese, then Coo(T _i-1 ,T _i )++, go to 3); if T _i is null, go to 3); otherwise, go to 1).

3)若T_i是中文，则Coo(T_i-1,T_i+1)++,i++，转1)；否则，转1)。3) If T _i is Chinese, then Coo(T _i-1 ,T _i+1 )++, i++, go to 1); otherwise, go to 1).

4)若T_n-2是中文，转5)；否则，转7)4) If T _n-2 is Chinese, go to 5); otherwise, go to 7)

5)若T_n-1是中文，Coo(T_n-2,T_n-1)++，转6)；若T_n-1是空格，转6)；否则结束。5) If T _n-1 is Chinese, Coo(T _n-2 ,T _n-1 )++, go to 6); if T _n-1 is a space, go to 6); otherwise end.

6)若T_n是中文，Coo(T_n-2,T_n)++,结束；否则结束。6) If T _n is Chinese, Coo(T _n-2 ,T _n )++, end; otherwise, end.

7)若T_n-1是中文，且T_n也是中文，则Coo(T_n-1,T_n)++,结束；否则结束。7) If T _n-1 is Chinese, and T _n is also Chinese, then Coo(T _n-1 ,T _n )++, end; otherwise, end.

经过上面步骤的计算，得到词语共现度矩阵Cm，并对Cm的每一个元素进行归一化处理，也就是每一个元素除以矩阵中所有元素的最大值，即max{Coo(C_i,C_j)|1≤i,j≤m}。After the calculation of the above steps, the word co-occurrence matrix Cm is obtained, and each element of Cm is normalized, that is, each element is divided by the maximum value of all elements in the matrix, that is, max{Coo(C _i , C _j )|1≤i, j≤m}.

所述的词语网络如下：The word network described is as follows:

在构建带权词语网络时，首先要得到词语网络的权值矩阵，定义权值矩阵Wm为：When constructing a weighted word network, the weight matrix of the word network must first be obtained, and the weight matrix Wm is defined as:

其中，α为0.3，β为0.7，强化词语之间的语义关系，弱化词语之间的共现关系。Among them, α is 0.3 and β is 0.7, which strengthens the semantic relationship between words and weakens the co-occurrence relationship between words.

W_m作为输入的词语网络对应的邻接矩阵，则其对应的网络图定义为：G＝{V，E}；其中图G为无向加权图，V表示图G中的顶点集，E表示G中的边集，v_i表示V中第i个顶点(词)。W _m is the adjacency matrix corresponding to the input word network, and its corresponding network graph is defined as: G={V, E}; where graph G is an undirected weighted graph, V represents the vertex set in graph G, and E represents G In the edge set, v _i represents the i-th vertex (word) in V.

所述的词语聚集特征值的计算过程如下：The calculation process of the described word aggregation feature value is as follows:

词语网络的重要特征有度分布、平均最短路径、聚集度与聚集系数。节点的度体现该节点与其它节点的关联情况。节点的聚集度和聚集系数体现在此节点局部范围内的节点相互连接密度。节点的度和聚集系数体现该节点在局部范围内的重要性。本发明通过节点的加权度、聚集系数和节点介数来计算节点的聚集特征值，既能让重要的词语赋予较高的权值，又保证与许多重要的词语有关联的词也有较高的评分。The important features of word network are degree distribution, average shortest path, aggregation degree and aggregation coefficient. The degree of a node reflects the relationship between the node and other nodes. The aggregation degree and aggregation coefficient of a node reflect the interconnection density of nodes within the local range of this node. The degree and aggregation coefficient of a node reflect the local importance of the node. The present invention calculates the aggregation characteristic value of the node through the weighted degree of the node, the aggregation coefficient and the node betweenness, which can not only give important words a higher weight, but also ensure that the words associated with many important words also have a higher score.

在词语语义相似度网络图中，无序偶对(v_i,v_j)表示节点v_i与v_j之间的边，则节点v_i的加权度的定义为：In the word semantic similarity network graph, the unordered pair (v _i , v _j ) represents the edge between the node v _i and v _j , then the weighted degree of the node v _i is defined as:

${WD WD}_{i i} = = {Σ Σ}_{j j = = 11}^{n no} {w w}_{i i j j} / / n no$

其中，w_ij为节点v_i与v_j间边上的权值，n为节点的总个数。Among them, w _ij is the weight on the edge between nodes v _i and v _j , and n is the total number of nodes.

在词语语义相似度网络图中，无序偶对(v_i,v_j)表示节点v_i与v_j之间的边，节点v_i的非加权度D_i为D_i＝|{(v_i,v_j):(v_i,v_j)∈E,v_i,v_j∈V}|；节点v_i的聚集度K_i为邻居节点间存在的实际边数：T_i＝|{(v_j,v_k):(v_i,v_k)∈E,(v_j,v_k)∈E,v_i,v_j∈V}|，则节点v_j的聚集系数C_i的定义为：In the word semantic similarity network graph, the unordered pair (v _i , v _j ) represents the edge between node v _i and v _j , and the unweighted degree D _i of node v _i is D _i =|{(v _i ,v _j ):(v _i ,v _j )∈E,v _i ,v _j ∈V}|; the aggregation degree K _i of node v _i is the actual number of edges existing between neighbor nodes: T _i ＝|{(v _j ,v _k ):(v _i ,v _k )∈E,(v _j ,v _k )∈E,v _i ,v _j ∈V}|, then the clustering coefficient C _i of node v _j is defined as:

${C C}_{i i} = = \frac{{T T}_{i i}}{(\begin{matrix} {D D.}_{i i} \\ 22 \end{matrix})} = = 22 {T T}_{i i} / / {D D.}_{i i} (({D D.}_{i i} - - 11))$

在词语语义相似度网络图中，节点介数Betweenness是节点x和w间且最短路径通过节点v_i的可能性概率。两个非相邻节点间的联系度依赖于连接两点间的最短路径上的节点，这些节点潜在扮演控制节点间交互信息流的角色，B_i体现节点v_i在局部环境下的互连接度，则节点介数Betweenness的定义为：In the word semantic similarity network graph, betweenness between nodes is the possibility probability that the shortest path passes through node v _i between nodes x and w. The connection degree between two non-adjacent nodes depends on the nodes on the shortest path connecting the two points. These nodes potentially play the role of controlling the interactive information flow between nodes. B _i reflects the interconnection degree of node v _i in the local environment , the node betweenness betweenness is defined as:

${B B}_{i i} = = \underset{w w &Element; &Element; G G,, x x &Element; &Element; G G}{Σ Σ} \frac{{r r}_{v v i i} ((w w,, x x))}{d d ((w w,, x x))}$

d(w,x)表示带权词语语义相似度网络图中任意两节点w和x间最短路径数目，表示任意两节点w和x且经过v_i(v_i∈G)的最短路径数目。d(w,x) represents the number of shortest paths between any two nodes w and x in the weighted word semantic similarity network graph, Indicates the number of shortest paths between any two nodes w and x passing through v _i (v _i ∈ G).

将节点v_i的平均加权度、聚集系数和介数Betweenness进行加权综合衡量节点的聚集特征值，节点v_i的聚集特征值Z_i的定义为：The average weighted degree, aggregation coefficient and betweenness of node v _i are weighted to comprehensively measure the aggregation characteristic value of the node. The aggregation characteristic value Z _i of node v _i is defined as:

${Z Z}_{i i} = = a a \times \times {WD WD}_{i i} + + b b \times \times {C C}_{i i} / / {Σ Σ}_{j j = = 11}^{n no} {C C}_{j j} + + c c \times \times {B B}_{i i}$

其中，a+b+c＝1。Among them, a+b+c=1.

所述的词语的统计特征值的计算过程如下：The calculation process of the statistical characteristic value of described words is as follows:

采用非线性函数对词频进行归一化处理。词语W_i在文本中的词频权重TFi定义为：Word frequencies are normalized using a nonlinear function. The word frequency weight TFi of word W _i in the text is defined as:

$T T F f i i = = \frac{f f ((W W i i))}{{Σ Σ}_{j j = = 11}^{n no} f f (({p p}_{j j}))}$

其中，TFi表示词语W_i的词频权重，p_j表示文本中的某个词语，f为词频统计函数。Among them, TFi represents the word frequency weight of the word W _i , p _j represents a certain word in the text, and f is the word frequency statistical function.

中文文本中能标识文本特性的一般是实词，如名词、动词、形容词等。而感叹词、介词、连词等虚词对确定文本类别基本没有意义，会对特征词语提取带来很大干扰。词语W_i在文本中的词性权重posi定义为：Content words, such as nouns, verbs, adjectives, etc., can identify text characteristics in Chinese texts. However, function words such as interjections, prepositions, and conjunctions are basically meaningless to determine the text category, and will greatly interfere with the extraction of feature words. The part-of-speech weight posi of word W _i in the text is defined as:

词越长越能反映具体的信息，反之，较短的词的所表示意义通常较抽象。尤其在文档中的特征词语多是一些专业学术组合词汇，长度较长，其含义更明确，更能反映文本主题。增加长词的权重，有利于对词汇进行分割，从而更准确地反映出词在文档中的重要程度。The longer the word, the more specific information it can reflect. On the contrary, the meaning expressed by the shorter word is usually more abstract. In particular, the characteristic words in the document are mostly some professional academic combination words, which are longer in length, have clearer meanings, and can better reflect the theme of the text. Increasing the weight of long words is conducive to the segmentation of words, so as to more accurately reflect the importance of words in documents.

词语W_i在文本中的词长权重leni定义为：The word length weight leni of word W _i in the text is defined as:

对于词语序列中的每个词，其统计特征值为For each word in the word sequence, its statistical feature value is

stats_i＝A*TF_i+B*pos_i+C*len_i stats _i ＝A*TF _i +B*pos _i +C*len _i

其中，A+B+C＝1。Among them, A+B+C=1.

所述的词语W_i关键度的计算过程如下：The calculation process of the word W _i key degree is as follows:

对应于加权词语网络中的每个节点，它的关键度值Imp_i定义为：Corresponding to each node in the weighted word network, its key value Imp _i is defined as:

Imp_i＝β*stats_i+(1-β)*Z_i Imp _i ＝β*stats _i +(1-β)*Z _i

其中，0＜β＜1。Wherein, 0<β<1.

通过计算将得到关键度的值，从大到小排序，设定一个阈值γ(0＜γ＜1)，取出前q个的值，则这些词语将作为科技项目的特征词语，这些词语充分反映主题，而且是比较重要的词语。By calculating the value of the key degree, sort from large to small, set a threshold γ (0<γ<1), and take out the first q values, then these words will be used as the characteristic words of the scientific and technological project, and these words fully reflect the Themes, and more important words.

步骤4.评审专家特征词语提取：评审专家信息量较科技项目信息少，科技项目的特征词构建网络并基于统计特征和语义特征的提取技术，不适合评审专家信息的特征词语提取，直接根据通用停用词库和专业停用词库进行停用词过滤，提取每个专家的特征词集合，通用停用词库是也是采用哈工大停用词表，专业停用词库需要人员进行不断地维护。Step 4. Extraction of characteristic words of review experts: the amount of information of review experts is less than that of science and technology projects. The network of feature words of science and technology projects is constructed based on the extraction technology of statistical features and semantic features, which is not suitable for the extraction of feature words of review expert information. The stop word database and the professional stop word database are used to filter stop words and extract the feature word set of each expert. The general stop word database also adopts the stop word list of Harbin Institute of Technology, and the professional stop lexicon needs continuous maintenance by personnel. .

步骤5.构建科技项目、评审专家的分字段知识表示模型：通过对空间向量模型和物元知识集模型进行扩展，依据科技项目中的不同字段信息建立文本表示模型PRO＝(id,F,WF,T,V),其中id表示在项目库中的标识字段；F表示科技项目中字段类别集合；WF为字段的权重；T为特征词语；V表示字段所对应的词语及其权重集合即V_i＝{v_i1,f(v_i1),v_i2,f(v_i2),...,v_in,f(v_in)}，v_ij表示第i个字段中的第j个特征词语，f(v_ij)表示v_ij关键词所对应的频数。科技项目信息的知识表示如下：Step 5. Construct the sub-field knowledge representation model of scientific and technological projects and review experts: by extending the space vector model and the matter-element knowledge set model, a text representation model PRO=(id, F, WF is established according to different field information in scientific and technological projects , T, V), where id represents the identification field in the project library; F represents the field category set in the science and technology project; WF is the weight of the field; T is the characteristic word; V represents the word corresponding to the field and its weight set, that is, V _i = {v _i1 , f(v _i1 ), v _i2 , f(v _i2 ),..., v _in , f(v _in )}, v _ij represents the jth feature word in the i-th field, f(v _ij ) represents the frequency corresponding to the keyword v _ij . The knowledge representation of technology item information is as follows:

同理，根据专家中的不同字段信息建立知识表示模型TM＝(id,F,WF,T,V)。其中，id表示在专家库中的标识字段；F表示评审专家中字段类别集合；WF为字段的权重集合；T为特征词语；V表示字段所对应的特征词语及其权重集合即V_i＝{v_i1,f(v_i1),v_i2,f(v_i2),...,v_in,f(v_in)}，v_ij表示第i个字段中的第j个特征词语，f(v_ij)表示v_ij特征词语在所对应的字段内的出现频率。评审专家信息的知识表示为：Similarly, a knowledge representation model TM=(id, F, WF, T, V) is established according to different field information in experts. Among them, id represents the identification field in the expert database; F represents the field category set in the review expert; _WF is the weight set of the field; T is the characteristic word; v _i1 ,f(v _i1 ),v _i2 ,f(v _i2 ),...,v _in ,f(v _in )}, v _ij represents the jth feature word in the i-th field, f(v _ij ) indicates the occurrence frequency of v _ij feature words in the corresponding field. The knowledge of reviewing expert information is expressed as:

评审专家信息索引库构建：待评审专家知识表示模型构建完成后，将信息索引入库：首先从专家库中读取一个评审专家的内容项信息；基于分词结果建立词语语义网络并提取评审专家所包含的特征词；依据知识表示模型并利用Apache Lucene对其建立索引；将建立好的索引按所属类别加至对应的索引库中，直到所有的评审专家索引入库。Construction of review expert information index database: After the construction of the review expert knowledge representation model is completed, the information is indexed into the database: first, a review expert’s content item information is read from the expert database; word semantic network is established based on the word segmentation results and the review expert’s information is extracted. The included feature words; based on the knowledge representation model and using Apache Lucene to build an index; add the established index to the corresponding index library according to its category, until all the review expert indexes are stored in the library.

步骤6：根据项目的个数，推荐方式分为单一待审项目推荐专家和分组(多个)待审项目推荐专家。分组推荐专家对步骤5的待审项目知识表示模型做相应的字段间和项目间的特征合并操作，单一待审专家推荐只做相应的字段间特征合并操作。同时，对步骤5的评审专家的知识表示模型进行字段间特征合并。依据知识表示模型并利用Apache Lucene对合并后的特征信息建立索引。其中，科技项目索引构建在进行项目推荐时进行。Step 6: According to the number of projects, the recommendation method is divided into single pending project recommendation experts and group (multiple) pending project recommendation experts. Experts recommended by groups perform the corresponding inter-field and inter-item feature merging operations on the pending project knowledge representation model in step 5, and a single pending expert recommendation only performs the corresponding inter-field feature merging operation. At the same time, the inter-field features are combined for the knowledge representation model of the review experts in step 5. Based on the knowledge representation model and using Apache Lucene to index the merged feature information. Wherein, the science and technology project index construction is carried out when performing project recommendation.

科技项目申报管理系统中待审项目往往是需要分组推荐的，上述特征合并操作，确保不会消除步骤5中知识表示模型设置不同字段权重对相似度计算产生推荐的贡献差异。Projects to be reviewed in the technology project application management system often need to be recommended in groups. The above-mentioned feature merging operation ensures that the difference in the contribution of different field weights set by the knowledge representation model in step 5 to the similarity calculation to generate recommendations will not be eliminated.

所述的待审项目、评审专家的特征合并通过逻辑异或操作进行过程如下：The feature merging of the items to be reviewed and the review experts is carried out through the logical XOR operation as follows:

(1)一个待审项目、一个评审专家的字段间特征合并(1) Feature merging between fields of a project to be reviewed and a review expert

假设字段特征词集合W'₁和W'₂合并，则定义W'₁和W'₂合并规则为：Assuming that the field feature word sets W' ₁ and W' ₂ are merged, then define the merging rules of W' ₁ and W' ₂ for:

${W W}^{' '}_{11} &CirclePlus; &CirclePlus; {W W}^{' '}_{22} = = {{&ForAll; &ForAll; i i,, j j,, {{{word word}_{11 i i},, \frac{f f (({word word}_{11 i i})) + + f f (({word word}_{22 i i}))}{22}}} | | {word word}_{11 i i} = = {word word}_{22 j j}}}$

其中，word_1i，word_2j为特征词。Among them, word _1i and word _2j are feature words.

加入字段权重改进并扩展上述定义，对评审专家、科技项目的字段间特征进行合并，合并规则为：Add field weights to improve and expand the above definition, and combine the characteristics of review experts and technology projects between fields. The merging rules are:

${W W}^{' '}_{11} &CirclePlus; &CirclePlus; {W W}^{' '}_{22} = = {{&ForAll; &ForAll; i i,, j j,, {{{word word}_{11 i i},, \frac{w w 11 * * f f (({word word}_{11 i i})) + + w w 22 * * f f (({word word}_{22 i i}))}{\sqrt{w w 11^{22} + + w w 22^{22}}}}} | | {word word}_{11 i i} = = {word word}_{22 j j}}}$

(2)分组待审项目的项目间特征合并(2) Inter-item feature merging of group pending items

这一合并过程操作只针对待审科技项目的特征向量，不针对评审专家特征向量，专家特征向量只需要做字段间特征合并操作。若V(d₁)和V(d₂)分别是两个科技项目经过字段间特征合并后的向量模型，对任意t_1j∈V(d₁)，t_2j∈V(d₂)，若存在t_1j与t_2j相同则合并。定义为：This merging process operation is only for the feature vectors of pending scientific and technological projects, not for the feature vectors of review experts, who only need to perform feature merging operations between fields. If V(d ₁ ) and V(d ₂ ) are respectively the vector models of two scientific and technological items after inter-field feature merging, for any t _1j ∈ V(d ₁ ), t _2j ∈ V(d ₂ ), if there is Merge if t _1j is the same as t _2j . defined as:

$V V (({d d}_{11})) &CirclePlus; &CirclePlus; V V (({d d}_{22})) = = {{< < {t t}_{k k},, {w w}_{k k} ((p p)) = = \frac{{w w}_{i i} (({d d}_{11})) + + {w w}_{j j} (({d d}_{22}))}{22} > >}}$

其中，k＝1,…,n，t_k为特征词条项，w_k(p)为t_k的权重。Wherein, k=1,...,n, t _k is a feature term item, and w _k (p) is the weight of t _k .

科技项目组的知识表示模型产生的基本过程如下：The basic process of generating the knowledge representation model of the science and technology project team is as follows:

a).合并科技项目字段间特征，得到每个项目的向量模型V(d)；a). Merge the inter-field features of science and technology items to obtain the vector model V(d) of each item;

b).将所有科技项目向量模型集合采用合并策略通过上述的方法，对科技项目组建立基于向量空间的知识表示模型。b). Use the merge strategy for all vector model collections of science and technology items Through the above method, a knowledge representation model based on vector space is established for the science and technology project team.

V(p)＝{＜t₁,w₁(p)＞,＜t₂,w₂(p)＞,...,＜t_n,w_n(p)＞}V(p)＝{<t ₁ ,w ₁ (p)>,<t ₂ ,w ₂ (p)>,...,<t _n ,w _n (p)>}

其中，k＝1,…,n，t_k为项目组特征词词条项，w_k(p)为t_k的权重。Among them, k=1,...,n, t _k is the feature word entry of the item group, and w _k (p) is the weight of t _k .

步骤7.经过步骤6的评审专家和科技项目的知识表示模型的字段间特征进行合并，假设评审专家信息向量若表示为P＝{s₁,f(s₁),s₂,f(s₂),...,s_n,f(s_n)}，科技项目信息(组)向量表示为Q＝{t₁,f(t₁),t₂,f(t₂),...,t_n,f(t_n)}，基于最大匹配算法计算待审科技项目(组)向量与评审专家的语义相似度。Step 7. After step 6, the review experts and the inter-field features of the knowledge representation model of the scientific and technological project are merged, assuming that if the review expert information vector is expressed as P={s ₁ ,f(s ₁ ),s ₂ ,f(s ₂ ),...,s _n ,f(s _n )}, the vector of technology project information (group) is expressed as Q＝{t ₁ ,f(t ₁ ),t ₂ ,f(t ₂ ),..., t _n ,f(t _n )}, based on the maximum matching algorithm, calculate the semantic similarity between the vector of the technology project (group) pending review and the review experts.

步骤8.设置相似度截断，依据相似度的大小产生推荐指数，产生最终的推荐评审专家列表。Step 8. Set the similarity cutoff, generate the recommendation index according to the similarity, and generate the final recommendation review expert list.

本发明有益效果如下：The beneficial effects of the present invention are as follows:

能够更加便捷地、智能地、精准地推荐出科技项目评审专家；能够大大减轻科技项目申报管理系统科技工作者对评审专家的分配任务，减少管理的成本费用；能够保证评审专家与待审科技项目具有较高的领域匹配度，保证评审专家对项目的评审做到客观性、公正性和科学性，提供自动的、高效的、公正的决策支持，避免科技项目审批出现人情关系网、“马太效应”等审批不端的问题。It can more conveniently, intelligently and accurately recommend review experts for scientific and technological projects; it can greatly reduce the assignment tasks of scientific and technological workers to review experts in the application management system of scientific and technological projects, and reduce management costs; it can ensure that review experts and pending scientific and technological projects It has a high degree of field matching, which ensures that review experts are objective, fair and scientific in project review, provides automatic, efficient and fair decision-making support, and avoids the emergence of human relationship networks in the approval of scientific and technological projects. "Matthew Effect” and other improper approval issues.

附图说明Description of drawings

图1是本发明中进行词语共现度计算滑动窗口。Fig. 1 is a sliding window for word co-occurrence calculation in the present invention.

图2是本发明中基于二部图的最大匹配算法原理示意图。Fig. 2 is a schematic diagram of the principle of the bipartite graph-based maximum matching algorithm in the present invention.

图3是本发明中面向科技项目的评审专家智能推荐方法流程图。Fig. 3 is a flow chart of the intelligent recommendation method for review experts oriented to scientific and technological projects in the present invention.

图4是本发明中科技项目和评审专家信息的特征词的提取流程图。Fig. 4 is a flow chart of extracting feature words of scientific and technological projects and review expert information in the present invention.

图5是本发明中评审专家知识索引库构建流程图。Fig. 5 is a flow chart of the construction of the review expert knowledge index database in the present invention.

具体实施方式detailed description

下面结合附图对本发明作进一步说明，应该强调的是下述说明仅仅是示例性的，而不是为了限制本发明的范围及其应用。以下对本发明的具体实施方式作进一步详述，基于发明中的实施例，本领域普通技术人员在没有创造性劳动前提下所获得的所有其他实施例，都属于本发明的保护范围。The present invention will be further described below in conjunction with the accompanying drawings. It should be emphasized that the following description is only exemplary, not intended to limit the scope of the present invention and its application. The specific implementation of the present invention will be described in further detail below. Based on the embodiments of the invention, all other embodiments obtained by those of ordinary skill in the art without creative work all belong to the protection scope of the present invention.

如图3所示，本发明的推荐方法的主要思路是：(1)针对科技项目申报管理系统中的专家信息和待审科技项目信息，将主要文本切分成子串序列并进行中科院ICTCLAS分词，对分词结果进行停用词过滤得到词语集合；(2)科技项目信息包括主要研究内容、技术指标等信息，信息量较大，发明根据词的语义关系和词的共现关系构建词语网络，并计算词语网络的节点聚集特征值，与统计特征值加权计算词语关键度，提取每个科技项目的特征词；(3)专家信息比科技项目信息精简，信息量较少，直接将每个专家信息经过滤得到的词语集合作为特征词；(4)根据科技项目、专家字段信息的重要性不同设置字段权重，依据(2)和(3)得到的特征词，分别构建针对项目和专家的知识表示模型，构建专家索引库；(5)分组推荐专家模型待审项目知识表示模型做字段间和项目间的特征合并操作，单一待审项目专家推荐只做字段间特征合并操作。同时对专家知识表示模型做字段间特征合并。(6)综合考虑词语具有语义模糊匹配的特征，计算专家信息与待审科技项目信息的相似度，通过设定阈值截断产生最终推荐专家列表。As shown in Figure 3, the main train of thought of the recommendation method of the present invention is: (1) for the expert information in the scientific and technological project declaration management system and the information of the scientific and technological project to be examined, the main text is divided into substring sequences and ICTCLAS word segmentation of the Chinese Academy of Sciences, Filter the word segmentation results to obtain the word set; (2) the scientific and technological project information includes the main research content, technical indicators and other information. Calculate the node aggregation eigenvalues of the word network, calculate the key words weighted with the statistical eigenvalues, and extract the characteristic words of each scientific and technological project; (3) The expert information is more streamlined than the scientific and technological project information, and the amount of information is less, and each expert information is directly The filtered word set is used as the feature word; (4) Set the field weight according to the importance of the scientific and technological project and expert field information, and construct the knowledge representation for the project and expert according to the feature words obtained in (2) and (3). (5) The knowledge representation model of the pending project knowledge representation model performs feature merging operations between fields and items, and the expert recommendation of a single pending project only performs feature merging operations between fields. At the same time, feature merging between fields is performed on the expert knowledge representation model. (6) Considering that words have semantic fuzzy matching characteristics, calculate the similarity between expert information and pending technology project information, and generate the final recommended expert list by setting a threshold cutoff.

步骤3.科技项目特征词语提取：利用通用停用词库和专业停用词库对分词进行停用词过滤，通用停用词库采用哈工大停用词表，把去除停用词的分词结果作为一个词语集合，参见图4。Step 3. Feature word extraction for science and technology projects: Use the general stop word database and the professional stop word database to filter the word segmentation. A set of words, see Figure 4.

所述的词语网络如下：The word network described is as follows:

${WD WD}_{i i} = = {Σ Σ}_{j j = = 11}^{n no} {w w}_{i i j j} / / n no$

将节点v_i的平均加权度、聚集系数和介数Betweenness进行加权综合衡量节点的聚集特征值，节点vi的聚集特征值Z_i的定义为：The average weighted degree, aggregation coefficient and betweenness of node v _i are weighted to comprehensively measure the aggregation characteristic value of the node. The aggregation characteristic value Z _i of node vi is defined as:

其中，a+b+c＝1。Among them, a+b+c=1.

stats_i＝A*TF_i+B*pos_i+C*len_i stats _i ＝A*TF _i +B*pos _i +C*len _i

其中，A+B+C＝1。Among them, A+B+C=1.

Imp_i＝β*stats_i+(1-β)*Z_i Imp _i ＝β*stats _i +(1-β)*Z _i

其中，0＜β＜1。Wherein, 0<β<1.

评审专家信息索引库构建：待评审专家知识表示模型构建完成后，将信息索引入库：首先从专家库中读取一个评审专家的内容项信息；基于分词结果建立词语语义网络并提取评审专家所包含的特征词；依据知识表示模型并利用Apache Lucene对其建立索引；将建立好的索引按所属类别加至对应的索引库中，直到所有的评审专家索引入库，参见图5。Construction of review expert information index database: After the construction of the review expert knowledge representation model is completed, the information is indexed into the database: first, a review expert’s content item information is read from the expert database; word semantic network is established based on the word segmentation results and the review expert’s information is extracted. Included feature words; according to the knowledge representation model and using Apache Lucene to build an index; add the established index to the corresponding index library according to the category until all the review expert indexes are put into the library, see Figure 5.

科技项目组的知识模型表示产生的基本过程如下：The basic process of generating the knowledge model representation of the science and technology project team is as follows:

所述待审科技项目(组)向量与评审专家向量的基于二部图最大匹配算法计算语义相似度计算过程如下：The calculation process of the semantic similarity calculation based on the bipartite graph maximum matching algorithm between the pending science and technology project (group) vector and the review expert vector is as follows:

基于最大匹配算法计算语义相似度，就是获得两个文本的采用基于二部图的最大匹配算法相似度。如图2所示，基于二部图的最大匹配算法计算特征项的相似度，其原理就是把科技项目(组)向量的每个特征词作为X部的一个顶点，评审专家向量的每个特征词作为Y部的一个顶点，等效为求一个完备二部图的最大权匹配，附图2中粗线部分就是X部特征词语与某个Y部特征词最大的语义相似度。The calculation of semantic similarity based on the maximum matching algorithm is to obtain the similarity of two texts using the maximum matching algorithm based on bipartite graph. As shown in Figure 2, the maximum matching algorithm based on the bipartite graph calculates the similarity of feature items. The principle is to use each feature word of the science and technology item (group) vector as a vertex of the X part, and review each feature of the expert vector As a vertex of part Y, a word is equivalent to seeking the maximum weight matching of a complete bipartite graph. The thick line in Figure 2 is the maximum semantic similarity between a feature word of part X and a feature word of part Y.

所谓语义相似度，就是基于知网的相似度计算获得的。本发明借助知网语义词典和最大匹配算法计算待审项目(组)和评审专家间的语义相似度，则计算公式为：The so-called semantic similarity is calculated based on the similarity of HowNet. The present invention calculates the semantic similarity between the pending project (group) and the review experts by means of the HowNet semantic dictionary and the maximum matching algorithm, and the calculation formula is:

$S S i i m m S S E E. M m ((P P,, Q Q)) = = (({Σ Σ}_{k k = = 11}^{p p} \sqrt{f f (({s the s}_{i i})) * * f f (({t t}_{j j}))} * * S S i i m m S S E E. M m (({s the s}_{i i},, {t t}_{j j})))) / / min min ((m m,, n no))$

其中，s_i，t_j为语义相似度最大值SimSEM(s_i,t_j)的边(图2中粗线)所对应的两个词语节点，m，n分别为科技项目向量表示的特征词个数和评审专家向量表示的特征词个数。p为语义相似度最大的边(图2中粗线)的数目。Among them, s _i , t _j are the two word nodes corresponding to the edge (thick line in Figure 2) of the maximum value of semantic similarity SimSEM(s _i , t _j ), m, n are the characteristic words represented by the vector of the scientific and technological project and the number of feature words represented by the review expert vector. p is the number of edges (thick lines in Figure 2) with the largest semantic similarity.

上述待审项目(组)与评审专家信息的语义相似度涉及到语言、词语语义、词语结构等多种因素，它表示两者的匹配程度，相似度大，说明两者匹配度高，评审专家适合评审该项目(组)。The semantic similarity between the above items (groups) to be reviewed and the review expert information involves various factors such as language, word semantics, and word structure. Suitable for reviewing the project (group).

以上所述仅是本发明的优选实施方式，应当指出，对于科技项目评审专家领域的智能机器推荐技术，在不脱离本发明技术原理的前提下，还可以做出若干改进和变形，这些改进和变形也应该视为本发明的法律保护范围。The above is only a preferred embodiment of the present invention. It should be pointed out that for the intelligent machine recommendation technology in the field of scientific and technological project review experts, without departing from the technical principle of the present invention, some improvements and deformations can also be made. These improvements and Deformation should also be regarded as the scope of legal protection of the present invention.

Claims

1. A method for intelligent recommendation of review experts for scientific and technological projects, characterized in that the method comprises the following steps:

Step 1, use common words and idiomatic words in scientific and technological projects and expert information as a professional stop lexicon; use punctuation marks and non-Chinese characters as a segmentation tag library;

Step 2. Segment the scientific and technological project information and expert information: according to the segmentation marks in the scientific and technological project information, segment the project name, main research content, and technical indicators into substring sequences; according to the segmentation marks in the evaluation expert information, extract expert information , awards, inventions, published papers, projects undertaken by the subject and their completion, and research directions are divided into substring sequences, and a substring sequence is a field information; use ICTCLAS of the Chinese Academy of Sciences to segment the substring sequence;

Step 3, Feature Words Extraction of Science and Technology Project: Utilize the general stop thesaurus and the professional stop words to filter the word segmentation. Word segmentation results as a set of words;

The construction of a professional inactive lexicon is a process of self-learning and continuous improvement. In the process of information word segmentation, the word frequency of words is continuously counted. If the probability of a word appearing in the text is greater than a certain threshold, it will be included in the inactive lexicon;

The science and technology project has a large amount of information. The semantic similarity between words is calculated on the word set, and the word network is constructed according to the semantic relationship of words and the co-occurrence relationship of words, and the word aggregation feature value in the network is calculated; then combined with the statistical feature value of words, Calculate the key degree of words to extract the characteristic words of scientific and technological projects; the characteristic words of scientific and technological projects are to extract the statistical feature information and semantic feature information of the comprehensive text, and extract the characteristic words more accurately;

Step 4, feature word extraction of review experts: filter stop words according to the general stop word database and professional stop word library, and extract the feature word set of each expert;

Step 5. Construct the sub-field knowledge representation model of scientific and technological projects and review experts: by extending the space vector model and the matter-element knowledge set model, a text representation model PRO=(id, F, WF , T, V), where id represents the identification field in the project library; F represents the field category set in the science and technology project; WF is the weight of the field; T is the characteristic word; V represents the word corresponding to the field and its weight set, that is, V _i = {v _i1 , f(v _i1 ), v _i2 , f(v _i2 ),..., v _in , f(v _in )}, v _ij represents the jth feature word in the i-th field, f(v _ij ) represents the frequency corresponding to the keyword v _ij ; the knowledge representation of technology project information is as follows:

Similarly, establish a knowledge representation model TM=(id, F, WF, T, V) according to different field information in experts; where, id represents the identification field in the expert database; F represents the set of field categories in the review expert; WF is the weight set of the field; T is the characteristic word; V represents the characteristic word and its weight set corresponding to the field, that is, V _i ＝{v _i1 ,f(v _i1 ),v _i2 ,f(v _i2 ),..., v _in , f(v _in )}, v _ij represents the jth characteristic word in the i-th field, f(v _ij ) represents the frequency of occurrence of v _ij characteristic word in the corresponding field; the knowledge of review expert information Expressed as:

Construction of review expert information index database: After the construction of the review expert knowledge representation model is completed, the information is indexed into the database: first, a review expert’s content item information is read from the expert database; word semantic network is established based on the word segmentation results and the review expert’s information is extracted. Included feature words; according to the knowledge representation model and using Apache Lucene to build an index; add the established index to the corresponding index library according to the category until all the review expert indexes are put into the library;

Step 6. According to the number of projects, the recommendation method is divided into single pending project recommendation experts and group pending project recommendation experts; the group recommendation experts make corresponding inter-field and inter-item features on the pending project knowledge representation model in step 5 In the merge operation, a single pending expert recommends only the corresponding inter-field feature merge operation; at the same time, the inter-field feature merge is performed on the knowledge representation model of the review expert in step 5; the merged feature information is processed according to the knowledge representation model and Apache Lucene Build an index; among them, the construction of the technology project index is carried out when the project is recommended;

Projects to be reviewed in the technology project declaration management system often need to be recommended in groups. The above-mentioned feature merging operation ensures that the difference in the contribution of different field weights set by the knowledge representation model to the similarity calculation in step 5 will not be eliminated;

Step 7: After step 6, the review experts and the inter-field features of the knowledge representation model of the scientific and technological project are merged, assuming that if the review expert information vector is expressed as P={s ₁ ,f(s ₁ ),s ₂ ,f(s ₂ ),...,s _n ,f(s _n )}, the science and technology project information vector is expressed as Q＝{t ₁ ,f(t ₁ ),t ₂ ,f(t ₂ ),...,t _n , f(t _n )}, based on the maximum matching algorithm to calculate the semantic similarity between the vector of the technology project to be reviewed and the review expert;

Step 8. Set the similarity cutoff, generate the recommendation index according to the similarity, and generate the final recommendation review expert list.

2. A kind of scientific and technological project-oriented evaluation expert intelligent recommendation method according to claim 1, characterized in that: the semantic similarity calculation process described in step 3 is as follows:

In HowNet Semantic Dictionary, if for two words W ₁ and W ₂ , W ₁ has n concepts: S11, S12, ..., S1n, W ₂ has m concepts: S21, S22, ..., S2m; the similarity SimSEM(W1, W2) of words W ₁ and W ₂ is equal to the maximum value of the similarity of each concept:

S S i i m m S S E E. M m ((W W 11,, W W 22)) = = \underset{i i = = 11,, ... ... n no . . j j = = 11 ... ... m m}{m m a a x x} S S i i m m (({S S}_{11 i i},, {S S}_{22 j j}));;

Content words and function words have different description languages, and it is necessary to calculate the similarity between their corresponding syntactic sememes or relational sememes; the concept of content words includes the first basic sememe, other basic sememes, relational sememe descriptions, and relational symbol descriptions. The similarities are respectively recorded as Sim1(p ₁ ,p ₂ ), Sim2(p ₁ ,p ₂ ), Sim3(p ₁ ,p ₂ ), Sim4(p ₁ ,p ₂ ); the similarity calculation of the two feature structures is finally Return to the basic sememe or the similarity calculation of specific words;

{Sim Sim}_{44} (({S S}_{11},, {S S}_{22})) = = {Σ Σ}_{i i = = 11}^{44} {β β}_{i i} {Sim Sim}_{i i} (({S S}_{11},, {S S}_{22}));;

β _i (1≤i≤4) is an adjustable parameter, and has: β ₁ + β ₂ + β ₃ + β ₄ = 1, β ₁ ≥ β ₂ ≥ β ₃ ≥ β ₄ ;

Let CW={C1, C2, ..., Cm} be the word set obtained after processing, and its corresponding semantic similarity adjacency matrix S _m is defined as:

Among them, Sim(C ₁ ,C ₂ ) is the semantic similarity between word C ₁ and word C ₂ , Sim(C _i ,C _i ) is 1, Sim(C _i ,C _j )=Sim(C _j ,C _i );

The word set CW={C1, C2,..., Cm} obtains the value of the similarity between m×(1+m)/2 words through the calculation of the semantic similarity of words;

The co-occurrence relationship calculation process of the words is as follows:

The word co-occurrence model is one of the important models in the field of natural language processing research based on statistical methods; according to the word co-occurrence model, if two words often co-occur in the same window unit of the document, the two words are related in meaning , they express the semantic information of the text to a certain extent; use the sliding window to calculate the word co-occurrence degree of words in the word sequence:

First, word extraction is performed on the word sequence, that is, removing spaces, nulls and merging the same words to obtain a word set CW={C1, C2, ..., Cm}, where m≤n;

The word co-occurrence matrix Cm corresponding to the word set CW is defined as:

At the beginning of Cm, Coo(Ci,Cj) is 01(1≤i,j≤m);

Calculate the word co-occurrence degree on the word sequence with the help of sliding window. The words in the sliding window are T _i-1 T _i T _i+1 (1<i<n):

1) If i=n-1, turn to 4); if T _i-1 is a space or null, the sliding window slides to the next word, i++; otherwise, turn to 2);

2) If T _i is Chinese, then Coo(T _i-1 ,T _i )++, go to 3); if T _i is null, go to 3); otherwise, go to 1);

3) If T _i is Chinese, then Coo(T _i-1 ,T _i+1 )++, i++, go to 1); otherwise, go to 1);

4) If T _n-2 is Chinese, go to 5); otherwise, go to 7)

5) If T _n-1 is Chinese, Coo(T _n-2 , T _n-1 )++, turn to 6); if T _n-1 is a space, turn to 6); otherwise end;

6) If T _n is Chinese, Coo(T _n-2 ,T _n )++, end; otherwise end;

7) If T _n-1 is Chinese, and T _n is also Chinese, then Coo(T _n-1 ,T _n )++, end; otherwise end;

After the calculation of the above steps, the word co-occurrence matrix Cm is obtained, and each element of Cm is normalized, that is, each element is divided by the maximum value of all elements in the matrix, that is, max{Coo(C _i , C _j )|1≤i,j≤m};

The word network described is as follows:

When constructing a weighted word network, the weight matrix of the word network must first be obtained, and the weight matrix Wm is defined as:

Among them, α is 0.3 and β is 0.7, which strengthens the semantic relationship between words and weakens the co-occurrence relationship between words;

W _m is the adjacency matrix corresponding to the input word network, and its corresponding network graph is defined as: G={V, E}; where graph G is an undirected weighted graph, V represents the vertex set in graph G, and E represents G In the edge set, v _i represents the i-th vertex in V;

The calculation process of the described word aggregation feature value is as follows:

The important characteristics of the word network are degree distribution, average shortest path, aggregation degree and aggregation coefficient; the degree of a node reflects the association between the node and other nodes; the aggregation degree and aggregation coefficient of a node reflect the interconnection of nodes within the local range of this node Density; the degree and aggregation coefficient of a node reflect the importance of the node in a local scope; the aggregation characteristic value of a node is calculated through the weighted degree, aggregation coefficient and node betweenness of a node, which can not only give important words a higher weight value, and ensure that words associated with many important words also have higher scores;

In the word semantic similarity network graph, the unordered pair (v _i , v _j ) represents the edge between the node v _i and v _j , then the weighted degree of the node v _i is defined as:

{WD WD}_{i i} = = {Σ Σ}_{j j = = 11}^{n no} {w w}_{i i j j} / / n no;;

Among them, w _ij is the weight on the edge between nodes v _i and v _j , n is the total number of nodes;

In the word semantic similarity network graph, the unordered pair (v _i , v _j ) represents the edge between node v _i and v _j , and the unweighted degree D _i of node v _i is D _i =|{(v _i ,v _j ):(v _i ,v _j )∈E,v _i ,v _j ∈V}|; the aggregation degree K _i of node v _i is the actual number of edges existing between neighbor nodes: T _i ＝|{(v _j ,v _k ):(v _i ,v _k )∈E,(v _j ,v _k )∈E,v _i ,v _j ∈V}|, then the clustering coefficient C _i of node v _j is defined as:

{C C}_{i i} = = \frac{{T T}_{i i}}{(\begin{matrix} {D D.}_{i i} \\ 22 \end{matrix})} = = 22 {T T}_{i i} / / {D D.}_{i i} (({D D.}_{i i} - - 11));;

In the word semantic similarity network graph, the node betweenness betweenness is the possibility probability between nodes x and w and the shortest path passes through node v _i ; the connection degree between two non-adjacent nodes depends on the shortest path connecting two points These nodes potentially play the role of controlling the interactive information flow between nodes, B _i reflects the interconnection degree of node v _i in the local environment, then the node betweenness Betweenness is defined as:

{B B}_{i i} = = \underset{w w &Element; &Element; G G,, x x &Element; &Element; G G}{Σ Σ} \frac{{r r}_{v v i i} ((w w,, x x))}{d d ((w w,, x x))};;

d(w,x) represents the number of shortest paths between any two nodes w and x in the weighted word semantic similarity network graph, Represents the number of shortest paths v _i ∈ G between any two nodes w and x passing through v _i ;

The average weighted degree, aggregation coefficient and betweenness of node v _i are weighted to comprehensively measure the aggregation characteristic value of the node. The aggregation characteristic value Z _i of node v _i is defined as:

{Z Z}_{i i} = = a a \times \times {WD WD}_{i i} + + b b \times \times {C C}_{i i} / / {Σ Σ}_{j j = = 11}^{n no} {C C}_{j j} + + c c \times \times {B B}_{i i};;

Among them, a+b+c=1;

The calculation process of the statistical characteristic value of described words is as follows:

A nonlinear function is used to normalize the word frequency; the word frequency weight TFi of the word W _i in the text is defined as:

T T F f i i = = \frac{f f ((W W i i))}{{Σ Σ}_{j j = = 11}^{n no} f f (({p p}_{j j}))};;

Among them, TFi represents the word frequency weight of word W _i , p _j represents a certain word in the text, and f is a word frequency statistical function;

The part-of-speech weight posi of word W _i in the text is defined as:

The longer the word, the more specific information it can reflect. On the contrary, the meaning expressed by the shorter word is usually more abstract; especially the characteristic words in the document are mostly some professional academic combination vocabulary, the length is longer, and its meaning is clearer and more capable Reflect the theme of the text; increase the weight of long words, which is conducive to the segmentation of vocabulary, so as to more accurately reflect the importance of words in the document;

The word length weight leni of word W _i in the text is defined as:

For each word in the word sequence, its statistical feature value is

stats _i = A*TF _i +B*pos _i +C*len _i ;

Among them, A+B+C=1;

The calculation process of the word W _i key degree is as follows:

Corresponding to each node in the weighted word network, its key value Imp _i is defined as:

Imp _i =β*stats _i +(1-β)*Z _i ;

Among them, 0<β<1;

Through calculation, the key value will be obtained, sorted from large to small, set a threshold γ, 0<γ<1, and take out the first q values, then these words will be used as the characteristic words of the technology project, and these words fully reflect the theme , and it is a relatively important word.

3. A kind of scientific and technological project-oriented evaluation expert intelligent recommendation method according to claim 1, characterized in that: the feature merging described in step 6 is carried out through a logical XOR operation as follows:

(1) Feature merging between fields of a project to be reviewed and a review expert

Assuming that the field feature word sets W' ₁ and W' ₂ are merged, then define the merging rules of W' ₁ and W' ₂ for:

{W W}^{' '}_{11} &CirclePlus; &CirclePlus; {W W}^{' '}_{22} = = {{&ForAll; &ForAll; i i,, j j,, {{{word word}_{11 i i},, \frac{f f (({word word}_{11 i i})) + + f f (({word word}_{22 i i}))}{22}}} | | {word word}_{11 i i} = = {word word}_{22 j j}}};;

Among them, word _1i and word _2j are feature words;

Add field weights to improve and expand the above definition, and combine the characteristics of review experts and technology projects between fields. The merging rules are:

{W W}^{' '}_{11} &CirclePlus; &CirclePlus; {W W}^{' '}_{22} = = {{&ForAll; &ForAll; i i,, j j,, {{{word word}_{11 i i},, \frac{w w 11 * * f f (({word word}_{11 i i})) + + w w 22 * * f f (({word word}_{22 i i}))}{\sqrt{w w 11^{22} + + w w 22^{22}}}}} | | {word word}_{11 i i} = = {word word}_{22 j j}}};;

(2) Inter-item feature merging of group pending items

This merging operation is only for the eigenvectors of scientific and technological projects pending review, not for the eigenvectors of review experts, who only need to perform feature merging operations between fields; if V(d ₁ ) and V(d ₂ ) are two The vector model of science and technology items after inter-field feature merging, for any t _1j ∈ V(d ₁ ), t _2j ∈ V(d ₂ ), if t _1j is the same as t _2j , it will be merged; defined as:

V V (({d d}_{11})) &CirclePlus; &CirclePlus; V V (({d d}_{22})) = = {{< < {t t}_{k k},, {w w}_{k k} ((p p)) = = \frac{{w w}_{i i} (({d d}_{11})) + + {w w}_{j j} (({d d}_{22}))}{22} > >}};;

Wherein, k=1,...,n, t _k is a feature entry item, and w _k (p) is the weight of t _k ;

The basic process of knowledge representation model generation is as follows:

a). Merge the inter-field features of science and technology items to obtain the vector model V(d) of each item;

b). Use the merge strategy for all vector model collections of science and technology items Through the above method, a knowledge representation model based on vector space is established for the science and technology project team;

V(p)＝{<t ₁ ,w ₁ (p)>,<t ₂ ,w ₂ (p)>,...,<t _n ,w _n (p)>};

Among them, k=1,...,n, t _k is the feature word entry of the item group, and w _k (p) is the weight of t _k .