CN1645395A

CN1645395A - Method for discovering user interest in e-mail flow and transmitting document effectively

Info

Publication number: CN1645395A
Application number: CN 200510009506
Authority: CN
Inventors: 诸葛海; 丁连红
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2005-02-22
Filing date: 2005-02-22
Publication date: 2005-07-27

Abstract

The invention relates to the field of computer technology, in particular to a method for discovering user interests in email streams and effectively pushing documents accordingly. Members of the same scientific research team often repeatedly search and download the same documents due to their overlapping research fields. When the issues concerned by members change, their interests will be updated in time with the e-mails they send and receive, and the correct documents can always be pushed to members according to their interests; perform semantic analysis on the documents in the team document database, and push them to members according to the analysis results Documents that are consistent with their interests semantically ensure that the documents pushed to the user are exactly what the user needs, accurate and effective; members only need to upload the document to the team's document database, and the program can complete the analysis and push of the document, realizing The sharing of the document among members is simple and easy.

Description

A method for discovering user interests in email streams and effectively pushing documents accordingly

技术领域technical field

本发明涉及计算机技术领域，特别涉及语义理解、文本分类、文档共享和电子邮件流的在电子邮件流中发现用户兴趣并据此有效推送文档的方法。The invention relates to the field of computer technology, in particular to a method for discovering user interests in email streams and effectively pushing documents accordingly, for semantic understanding, text classification, document sharing and email streams.

技术背景technical background

一个科研团队的不同成员的研究领域通常存在交叉，一方面，他们常常为了获得相同文档而重复着搜索和下载操作，造成人力和财力的浪费；另一方面，他们常通过电子邮件交流信息，有时也将有价值的文档作为附件发送给其他成员，这可以在一定程度上实现成员间的文档共享，但仍存在以下问题：The research fields of different members of a scientific research team usually overlap. On the one hand, they often repeat the search and download operations in order to obtain the same document, resulting in a waste of manpower and financial resources; on the other hand, they often exchange information by email, and sometimes Also send valuable documents as attachments to other members, which can achieve document sharing among members to a certain extent, but there are still the following problems:

首先，无法保证每个成员都愿意向其他成员发送对方需要的文档，因此不可能从根本上避免团队成员为获得相同文档所做的重复操作。First of all, there is no guarantee that each member is willing to send other members the documents they need, so it is impossible to fundamentally avoid repeated operations by team members to obtain the same documents.

其次，即使每个成员都愿意向其他成员发送对方需要的文档，仍然会有如下情况发生：某个成员的兴趣经常会随时间而改变，其他成员可能在未察觉此变化的情况下，继续给他发送现在已不再需要的文档，而不给他发送新需要的文档；一个成员很难准确把握其他所有成员的兴趣，因而无法将文档推送给所有需要该文档的成员，也就无法实现文档的充分共享。Second, even if each member is willing to send other members the documents they need, there will still be situations where a member's interests often change over time, and other members may continue to contribute without knowing this change. He sends documents that he no longer needs now, but he does not send him newly needed documents; it is difficult for a member to accurately grasp the interests of all other members, so that the document cannot be pushed to all members who need it, and the document cannot be realized. fully shared.

为了在团队中实现科技文档的充分共享，本发明首先提取每个团队成员科研工作方面的兴趣，然后根据成员兴趣定期为团队成员推送相关文档。准确提取团队成员的兴趣是充分实现团队成员之间科技文档共享的基础。在发送和接收电子邮件的过程中团队成员之间形成了电子邮件流，同时每个成员所关注的问题往往能通过其收发的电子邮件反映出来，因此可以从电子邮件流中提取团队成员的兴趣。本发明以现有的电子邮件功能为基础，从团队成员之间的电子邮件流中提取用户兴趣，确保了文档在团队成员之间充分共享的前提。基本思想是成员所收发的电子邮件集中的地方正是成员研究工作集中的地方：首先，将成员之间的电子邮件保存到数据库中，该过程消除了垃圾邮件的干扰；接着，利用自然语言学习的方法得到能为描述用户兴趣提供有用信息的有效电子邮件；然后，将与团队相关的研究领域划分为更小的子领域，在此基础上对有效电子邮件进行分类；最后，根据有效电子邮件在各子领域中的分布情况，用成员所关注子领域的集合来表示用户兴趣。考虑到用户兴趣可能在一段较长时间之后发生变化，将时间因素引入兴趣提取过程，用户兴趣会随着新邮件的产生和时间的推移得到及时更新，根据用户兴趣推送文档确保总能将文档推送给所有需要该文档的团队成员，既不会错发，也不会漏发。In order to realize the full sharing of scientific and technological documents in the team, the present invention first extracts the interests of each team member in scientific research work, and then regularly pushes relevant documents to the team members according to the interests of the members. Accurately extracting the interests of team members is the basis for fully realizing the sharing of technical documents among team members. In the process of sending and receiving e-mails, an e-mail flow is formed among team members, and the concerns of each member can often be reflected through the e-mails sent and received, so the interests of team members can be extracted from the e-mail flow . Based on the existing email function, the invention extracts user interests from email streams among team members, ensuring the premise that documents are fully shared among team members. The basic idea is that where the e-mails sent and received by the members are concentrated is the place where the members' research work is concentrated: first, the e-mails between members are saved in the database, which process eliminates the interference of spam; second, using natural language learning The method to obtain effective emails that can provide useful information for describing user interests; then, divide the research field related to the team into smaller subfields, and classify the effective emails on this basis; finally, according to the effective emails The distribution in each subfield, using the set of subfields concerned by members to represent user interest. Considering that user interests may change after a long period of time, the time factor is introduced into the interest extraction process. User interests will be updated in time with the generation of new emails and the passage of time. Push documents according to user interests to ensure that documents can always be pushed To all team members who need the document, it will neither be sent by mistake nor missed.

本发明以描述子领域语义的兴趣点集为模板，将文档划分到与其语义相近的子领域中，文档推送程序以此为基础，将文档推送给关注此子领域的用户，确保了所推送的文档在语义上是用户所需要的，准确、有效。The present invention uses the set of interest points describing the semantics of the subfields as a template to divide documents into subfields with similar semantics. Documentation is semantically what the user needs, accurate, and valid.

如果团队成员想与其他成员共享某篇文档，只需将该文档上载到团队的文档数据库中，就可实现该文档的理解和推送，多数团队成员能接受简单的上载操作，很大程度实现了团队成员之间的文档共享，避免了他们繁杂的重复操作。If a team member wants to share a document with other members, they only need to upload the document to the team's document database to realize the understanding and push of the document. Most team members can accept the simple upload operation, which is largely realized Document sharing among team members avoids their cumbersome duplication of operations.

发明内容Contents of the invention

本发明的目的在于提供在电子邮件流中发现用户兴趣并据此有效推送文档的方法，从而有效利用团队资源，充分实现团队成员之间的科技文档共享。本方法步骤如下：首先，将团队成员之间的电子邮件存入数据库；然后，从团队成员之间的电子邮件流中提取用户兴趣，当成员所关注的问题改变时其兴趣会随其收发的电子邮件得到及时更新，根据成员兴趣总能将正确的文档推送给成员；并对团队文档数据库中的文档进行语义分析；最后，在文档语义分析的基础上，将与用户兴趣一致的文档推送给团队成员。The purpose of the present invention is to provide a method for discovering user interests in email streams and effectively pushing documents accordingly, thereby effectively utilizing team resources and fully realizing the sharing of scientific and technological documents among team members. The steps of this method are as follows: firstly, store emails between team members into the database; then, extract user interests from the email flow among team members, and when the issues that members pay attention to change, their interests will be sent and received with them Emails are updated in time, and the correct documents can always be pushed to members according to their interests; and semantic analysis is performed on the documents in the team document database; finally, on the basis of document semantic analysis, documents consistent with user interests are pushed to team member.

本方法主要包括以下几点：通过电子邮件服务器程序提供的功能将团队成员之间的电子邮件转发到某个固定账户，定期执行邮件收集程序，该程序解码固定账户中的电子邮件并将解码结果保存到电子邮件数据库中，完成电子邮件的自动存库，多数垃圾邮件都来源于陌生的电子邮件地址，本发明只保存成员之间的电子邮件，也就消除了提取用户兴趣时垃圾邮件的干扰；只考虑团队成员科研工作方面的兴趣，利用自然语言学习的方法将成员之间的电子邮件划分为有效电子邮件和无效电子邮件，得到能为描述用户兴趣提供有用信息的有效电子邮件，以此为基础提取用户兴趣，确保了用户兴趣的准确性；将与团队相关的各研究领域细分为子领域，通过子领域的先验知识集和兴趣点集表示子领域的背景知识和语义；通过有效电子邮件与先验知识集的相似度计算实现有效电子邮件的分类，用户有效电子邮件集中的子领域正是其研究工作集中的子领域，因此根据用户有效电子邮件分布于各子领域的情况提取用户兴趣，用户兴趣表示为其所关注子领域的集合；用户兴趣可能会随时间的推移而发生改变，电子邮件对用户兴趣的描述能力也应随其存在时间的增长而降低，将时间引入用户兴趣的提取过程，当用户工作重点转移时，其兴趣也得到及时调整，因而总能将文档推送给所有需要该文档的团队成员，既不会错发，也不会漏发，确保了团队成员间充分共享科技文档的前提；以描述子领域语义的兴趣点集为模板，根据文档与各子领域在语义上的相似度将文档划分到不同子领域，以此为基础，将文档推送给所关注子领域集合包含文档所属子领域的用户，从语义上确保了所推送的文档是用户所需要的，准确、有效。团队成员只需将文档上载到团队的文档数据库中，就可实现该文档的理解和推送，多数团队成员能接受简单的上载操作，使团队成员之间的文档共享简单、易行。This method mainly includes the following points: forward the e-mail among team members to a certain fixed account through the function provided by the e-mail server program, regularly execute the mail collection program, and this program decodes the e-mail in the fixed account and decodes the result Save in the email database to complete the automatic storage of emails. Most spam emails come from unfamiliar email addresses. This invention only saves emails between members, which eliminates the interference of spam emails when extracting user interests ; Only consider the interests of team members in scientific research, use the method of natural language learning to divide emails among members into valid emails and invalid emails, and obtain valid emails that can provide useful information for describing user interests. Based on the extraction of user interests, the accuracy of user interests is ensured; each research field related to the team is subdivided into sub-fields, and the background knowledge and semantics of the sub-fields are expressed through the prior knowledge sets and interest point sets of the sub-fields; The similarity calculation between effective emails and prior knowledge sets realizes the classification of effective emails. The subfields in the user’s effective email set are exactly the subfields in its research work. Therefore, according to the distribution of user’s effective emails in each subfield Extract user interests, which are expressed as a collection of sub-fields they care about; user interests may change over time, and the ability to describe user interests in emails should also decrease with the growth of their existence time, introducing time into The process of extracting user interests, when the user's work focus shifts, their interests are also adjusted in time, so the document can always be pushed to all team members who need the document, neither wrong nor missing, ensuring the team The premise of fully sharing scientific and technological documents among members; using the set of interest points describing the semantics of the sub-fields as a template, the documents are divided into different sub-fields according to the semantic similarity between the documents and each sub-field, and based on this, the documents are pushed to The set of concerned subfields includes users in the subfield to which the document belongs, which semantically ensures that the pushed documents are what the user needs, accurate and effective. Team members only need to upload the document to the team's document database to understand and push the document. Most team members can accept the simple upload operation, making document sharing among team members simple and easy.

技术方案Technical solutions

本发明是在电子邮件流中发现用户兴趣并据此有效推送文档的方法。本方法首先，将与团队相关的各研究领域细分为子领域，构建表示子领域背景知识的先验知识集和描述子领域语义的兴趣点集；定期运行电子邮件收集程序将团队成员之间的电子邮件存入电子邮件数据库中，并从中提取能提供有用信息来描述用户兴趣的有效电子邮件，团队成员也可将有价值的科技文档上载到文档数据库中。然后，将有效电子邮件划分到先验知识集与其相似度最高的子领域中，根据有效电子邮件在各子领域的分布情况提取用户兴趣，以子领域的兴趣点集为模板对文档数据库中的文档进行语义分析和分类。最后，由文档推送程序根据用户兴趣和文档分类的结果，将与用户兴趣一致的文档推送给团队成员。The present invention is a method of discovering user interests in email streams and effectively pushing documents accordingly. In this method, firstly, each research field related to the team is subdivided into subfields, and a priori knowledge set representing the background knowledge of the subfield and a set of interest points describing the semantics of the subfield are constructed; Store emails in the email database, and extract effective emails that can provide useful information to describe user interests, and team members can also upload valuable technical documents to the document database. Then, the effective emails are divided into the subfields with the highest similarity between the prior knowledge set and the user interests are extracted according to the distribution of valid emails in each subfield, and the interest point set in the subfields is used as a template to analyze the information in the document database. Documents are semantically analyzed and categorized. Finally, the document pushing program pushes the documents consistent with the user's interests to the team members according to the user's interests and the results of document classification.

本方案主要包括以下几个技术指标：This program mainly includes the following technical indicators:

1.团队成员之间的电子邮件自动存库1. Email auto-repository between team members

首先，构建电子邮件数据库，数据库的每条记录存储一封电子邮件，并通过电子邮件服务器程序将团队成员之间的电子邮件自动转发给某个固定账户；然后，定期运行邮件收集程序，该程序解码固定账户中的电子邮件，并将解码结果存入电子邮件数据库中，实现电子邮件的自动存库。垃圾邮件通常来源于陌生的电子邮件地址，因为只有成员之间的电子邮件被保存下来，电子邮件的自动存库过程本身就实现了垃圾邮件的过滤。First, build an e-mail database, each record of the database stores an e-mail, and automatically forwards e-mails between team members to a fixed account through the e-mail server program; then, run the e-mail collection program regularly, the program Decode the emails in the fixed account, and store the decoding results in the email database to realize the automatic storage of emails. Spam usually comes from unfamiliar e-mail addresses, because only e-mails between members are saved, and the automatic storage process of e-mail itself realizes the filtering of spam.

2.提取有效电子邮件2. Extract valid email

本发明只关心用户在科研工作方面的兴趣，因此只有涉及科研工作内容的电子邮件才是有效的，通过自然语言的学习方法从电子邮件数据库中提取能为描述用户兴趣提供有用信息的有效电子邮件。The present invention only cares about the user's interest in scientific research work, so only the e-mails related to the scientific research work content are effective, and the effective e-mails that can provide useful information for describing the user's interests are extracted from the e-mail database through the learning method of natural language .

3.细化科研领域划分，建立子领域的先验知识集和兴趣点集3. Refine the division of scientific research fields, and establish prior knowledge sets and interest point sets in subfields

对本团队研究领域进行细分，得到与团队相关的子领域集合。为各子领域建立先验知识集和兴趣点集，分别表示子领域的背景知识和语义。先验知识集的元素由表示子领域主要内容的关键词和关键词对子领域的影响因子(描述能力)两部分构成。兴趣点集由与子领域所包含兴趣点相对应的语义链网构成，一个语义链网描述一个兴趣点的语义信息。Subdivide the research field of the team to obtain a set of subfields related to the team. Establish prior knowledge sets and interest point sets for each sub-field, respectively representing the background knowledge and semantics of the sub-field. The elements of the prior knowledge set consist of two parts: keywords representing the main content of the subfield and the impact factors (description ability) of the keywords on the subfield. The interest point set is composed of semantic link network corresponding to the interest points contained in the sub-field, and a semantic link network describes the semantic information of an interest point.

建立子领域的先验知识集表示其背景知识，通过有效电子邮件与各子领域先验知识集的相似度计算对有效电子邮件进行分类，根据有效电子邮件在各子领域的分布情况，用成员所关注子领域集合来表示用户兴趣。Establish the prior knowledge set of the sub-field to represent its background knowledge, and classify the effective emails by calculating the similarity between the effective emails and the prior knowledge sets of each sub-field. According to the distribution of effective emails in each sub-field, use member A collection of concerned subfields to represent user interests.

构建描述子领域语义的兴趣点集，以此为模板将文档划分到与其语义相近的子领域中，由文档推送程序将文档推送给关注此文档所属子领域的成员，从语义上保证了推送给用户的文档正是用户所需的，团队成员只需将文档上载到团队的文档数据库中，就可由程序完成该文档的推送，简单、易行。Construct a set of interest points describing the semantics of the subfield, use this as a template to divide the document into subfields with similar semantics, and the document pusher will push the document to the members who are concerned about the subfield to which the document belongs, semantically guaranteeing the push to The user's document is exactly what the user needs. Team members only need to upload the document to the team's document database, and the program can complete the push of the document, which is simple and easy.

4.根据有效电子邮件的分类结果得到用户兴趣4. Obtain user interests according to the classification results of valid emails

通过有效电子邮件与子领域先验知识集的匹配计算确定每封电子邮件所属的子领域，实现有效电子邮件的分类；以有效电子邮件的分类结果为基础，根据与成员相关的有效电子邮件的分布情况确定成员当前所关注的子领域集合，通过该集合表示用户兴趣。其基本思想是用户电子邮件集中的子领域也是其研究工作集中的子领域。Determine the subfield to which each email belongs through the matching calculation of valid emails and subfield prior knowledge sets, and realize the classification of valid emails; based on the classification results of valid emails, according to the effective emails related to members The distribution determines the set of sub-domains that members are currently interested in, by which user interest is represented. The basic idea is that a subfield in a user's e-mail set is also a subfield in his research effort set.

5.及时更新用户兴趣5. Timely update user interests

用户兴趣往往会随时间的变化而改变，因此电子邮件对用户当前兴趣的描述能力应随其存在时间的增长而降低，该方法把时间因素引入用户兴趣的提取过程，当用户所关注问题发生变化时其兴趣也将得到调整，根据用户兴趣推送文档确保总能将文档推送给所有需要该文档的团队成员，既不会错发，也不会漏发。User interests tend to change over time, so the ability of e-mail to describe the user's current interest should decrease with the increase of its existence time. This method introduces the time factor into the process of extracting user interest. When the user's concern changes At the same time, their interests will also be adjusted, and documents will be pushed according to user interests to ensure that documents can always be pushed to all team members who need the documents, neither by mistake nor by omission.

考虑到用户兴趣可能在一段较长时间之后发生变化，将时间因素引入兴趣提取过程，用户兴趣会随新邮件的产生和时间的推移得到及时更新，根据用户兴趣为用户推送文档确保总能将文档推送给所有需要该文档的团队成员。Considering that user interests may change after a long period of time, the time factor is introduced into the interest extraction process, user interests will be updated in time with the generation of new emails and the passage of time, and documents are pushed to users according to user interests to ensure that documents can always be retrieved Push to all team members who need the document.

6.根据语义分析判断文档所属子领域6. Judging the subfield of the document based on semantic analysis

以子领域的兴趣点集为模板，对文档数据库中的文档进行语义分析，将文档划分到与其语义相近的子领域中，从语义上保证了文档分类的准确性。定期对新添加到文档数据库中的文档进行语义分析和划分。Using the interest point set of the sub-field as a template, the semantic analysis of the documents in the document database is carried out, and the documents are divided into sub-fields with similar semantics, which guarantees the accuracy of document classification semantically. Regularly perform semantic analysis and segmentation of documents newly added to the document database.

7.根据用户兴趣和文档分类结果推送文档7. Push documents according to user interests and document classification results

定期运行文档推送程序，该程序根据用户当前的兴趣，将文档数据库中与用户兴趣一致的文档，通过电子邮件推送给相应的团队成员。根据用户兴趣推送文档，确保总能将正确的文档推送给团队成员；将文档语义分析的结果，而不是简单的关键词匹配的结果推送给用户，确保所推送的文档在语义上是用户所需的，准确、有效。Regularly run the document push program, which pushes the documents in the document database that match the user's interests to the corresponding team members by email according to the user's current interests. Push documents according to user interests to ensure that the correct documents are always pushed to team members; push the results of document semantic analysis instead of simple keyword matching results to users to ensure that the pushed documents are semantically what users need Yes, accurate and effective.

附图说明Description of drawings

图1是本发明在电子邮件流中发现用户兴趣并据此有效推送文档的方法流程图。FIG. 1 is a flow chart of the method for discovering user interests in email streams and effectively pushing documents accordingly according to the present invention.

图2是本发明的一个语义链网和它的邻接矩阵表示图。Fig. 2 is a representation diagram of a semantic link network and its adjacency matrix of the present invention.

图3是本发明的文档理解的流程图。Fig. 3 is a flowchart of document understanding of the present invention.

具体实施方式Detailed ways

本发明是在电子邮件流中发现用户兴趣并据此有效推送文档的方法。本方法将与团队相关的各研究领域细分为更小的子领域，为每个子领域建立先验知识集和兴趣点集分别表示子领域的背景知识和语义，用户兴趣就是其所关注子领域的集合。首先，将团队成员之间的电子邮件保存到电子邮件数据库，从中提取内容涉及科研信息的有效电子邮件。然后，将有效电子邮件划分到先验知识集与其相似度最高的子领域中，实现有效电子邮件的分类；根据分类结果，计算每个成员所收发的有效电子邮件在各子领域中的分布比例，将分布比例大于阈值的子领域加入该用户所关注的子领域集合，得到用户兴趣。同时，以子领域的兴趣点集为模板，通过对文档的语义分析将团队文档数据库中的文档划分到与其语义相近的子领域中。最后，文档推送程序根据用户兴趣为其推送相关文档，具体实现方法是，以电子邮件附件的形式将文档数据库中的文档推送给所关注子领域集合包含该文档所属子领域的用户。The present invention is a method of discovering user interests in email streams and effectively pushing documents accordingly. This method subdivides the research fields related to the team into smaller sub-fields, and establishes a priori knowledge set and interest point set for each sub-field to represent the background knowledge and semantics of the sub-field, and the user's interest is the sub-field of interest collection. First, save emails among team members to an email database, and extract valid emails that involve scientific research information. Then, divide the valid emails into the subfields with the highest similarity between the prior knowledge set and the classification of valid emails; according to the classification results, calculate the distribution ratio of the valid emails sent and received by each member in each subfield , add the subfields whose distribution ratio is greater than the threshold to the set of subfields that the user cares about, and obtain the user's interest. At the same time, the documents in the team document database are divided into subfields with similar semantics through the semantic analysis of the documents by using the interest point set of the subfield as a template. Finally, the document push program pushes relevant documents according to the user's interest. The specific implementation method is to push the document in the document database to users whose concerned subfield set includes the subfield to which the document belongs in the form of email attachments.

图1是本发明的实施流程图，主要包括以下四部分：Fig. 1 is the implementation flowchart of the present invention, mainly comprises following four parts:

一、电子邮件自动存库，提取有效电子邮件1. Automatically store e-mails and extract valid e-mails

1.建立电子邮件数据库1. Build an email database

团队成员使用统一的电子邮件服务器和服务器程序(如：WebEasyMail)，在电子邮件服务器的某个目录下(如：F：\database，以下称为数据库目录)建立数据库文件(如：mail.mdb，以下称为电子邮件数据库)来保存团队成员之间的电子邮件信息。每封邮件在电子邮件数据库中存储为一条记录，包含六个字段，各字段的名称和含义如下：Team members use a unified email server and server program (such as: WebEasyMail), and create a database file (such as: mail.mdb, Hereinafter referred to as e-mail database) to save e-mail information between team members. Each email is stored as a record in the email database, which contains six fields. The names and meanings of each field are as follows:

发件人：发件人的电子邮件地址Sender: Email address of the sender

收件人：收件人的电子邮件地址Recipient: The email address of the recipient

抄送：抄送的电子邮件地址Cc: email addresses to copy

发送时间：发送该电子邮件的时间Sent Time: The time the email was sent

主题：电子邮件的主题Subject: The subject of the email

正文：电子邮件的正文内容，对于长度超过255个字符的，以对象连接和嵌入的方式存储Body: The body content of the email, for those longer than 255 characters, stored in the form of object concatenation and embedding

2.电子邮件自动存库2. Automatic email storage

首先，通过WebEasyMail提供的服务将团队成员之间的所有电子邮件自动转发到一个固定帐户(如：用户名为group的帐户)。该账户的邮件保存在邮件服务器的某个固定目录中(如：C:\WebEasyMail\mail\group，以下称为未解码邮件目录)。传统意义上的垃圾邮件通常来源于用户不熟悉的电子邮件地址，本过程只收集团队成员之间的电子邮件，消除了垃圾邮件对用户兴趣提取过程的干扰。First, through the service provided by WebEasyMail, all emails between team members are automatically forwarded to a fixed account (such as: an account whose user name is group). The emails of this account are saved in a fixed directory of the mail server (such as: C:\WebEasyMail\mail\group, hereinafter referred to as the undecoded email directory). Spam in the traditional sense usually comes from email addresses that users are not familiar with. This process only collects emails between team members, which eliminates the interference of spam on the user interest extraction process.

然后，定期(如：每天一次)运行所编写的邮件收集程序(如：MailGatherer)以实现电子邮件的自动存库。该程序依次读取未解码邮件目录中的每封电子邮件，分析邮件头，解码邮件体，把解码后的电子邮件信息保存到电子邮件数据库文件的相应字段中；将处理过的电子邮件移到电子邮件服务器的另一目录中(如：C:\WebEasyMail\mail\group_deleted，以下称为已解码邮件目录)，下次运行MailGatherer时不再处理。定期运行MailGatherer。Then, run the mail collection program (such as: MailGatherer) written regularly (such as once a day) to realize the automatic storage of e-mails. The program reads each e-mail in the undecoded e-mail directory in turn, analyzes the e-mail header, decodes the e-mail body, and saves the decoded e-mail information into the corresponding field of the e-mail database file; moves the processed e-mail to In another directory of the email server (such as: C:\WebEasyMail\mail\group_deleted, hereinafter referred to as the decoded mail directory), it will not be processed the next time MailGatherer is run. Run MailGatherer regularly.

3.提取有效电子邮件3. Extract valid email

虽然传统意义上的垃圾邮件已经在上一步中被过滤掉，但并不是所有保存在电子邮件数据库中的电子邮件都能为描述用户兴趣提供有效信息。我们将能反映用户兴趣的电子邮件称为有效电子邮件，不能反映用户兴趣的称为无效电子邮件。与团队研究内容相关的电子邮件就是有效电子邮件；而团队成员之间经常发送的笑话或活动通知等就属于无效电子邮件了，这里只考虑成员在科研工作方面的兴趣。为了得到准确的用户兴趣，必需将有效电子邮件从电子邮件数据库中提取出来，这是通过自然语言学习的方法实现的。Although spam in the traditional sense has been filtered out in the previous step, not all emails stored in the email database can provide effective information for describing user interests. We refer to emails that reflect user interests as valid emails, and emails that do not reflect user interests as invalid emails. Emails related to the research content of the team are valid emails; while jokes or event notifications often sent among team members are invalid emails. Only members’ interests in scientific research work are considered here. In order to obtain accurate user interests, valid emails must be extracted from the email database, which is realized through natural language learning.

首先，选择一定数量的有效电子邮件和无效电子邮件分别作为有效电子邮件的训练集合C_l和无效电子邮件的训练集合C₂，并通过以下公式得到有效电子邮件和无效电子邮件的标准向量和

表示： First, select a certain number of valid emails and invalid emails as the training set C ₁ of valid emails and the training set C ₂ of invalid emails respectively, and obtain the standard vectors of valid emails and invalid emails by the following formula and

express:

${\overset{&RightArrow; &Right Arrow;}{c c}}_{11} = = 1616 \frac{11}{| | {C C}_{11} | |} \underset{e e &Element; &Element; {C C}_{11}}{Σ Σ} \frac{\overset{&RightArrow; &Right Arrow;}{e e}}{| | \overset{&RightArrow; &Right Arrow;}{e e} | |} - - 44 \frac{11}{| | {C C}_{22} | |} \underset{e e &Element; &Element; {C C}_{22}}{Σ Σ} \frac{\overset{&RightArrow; &Right Arrow;}{e e}}{| | \overset{&RightArrow; &Right Arrow;}{e e} | |} - - - - - - ((11))$

${\overset{&RightArrow; &Right Arrow;}{c c}}_{22} = = 1616 \frac{11}{| | {C C}_{22} | |} \underset{e e &Element; &Element; {C C}_{22}}{Σ Σ} \frac{\overset{&RightArrow; &Right Arrow;}{e e}}{| | \overset{&RightArrow; &Right Arrow;}{e e} | |} - - 44 \frac{11}{| | {C C}_{11} | |} \underset{e e &Element; &Element; {C C}_{11}}{Σ Σ} \frac{\overset{&RightArrow; &Right Arrow;}{e e}}{| | \overset{&RightArrow; &Right Arrow;}{e e} | |} - - - - - - ((22))$

其中， $\overset{&RightArrow;}{e} = (e_{1}, e_{2}, . . ., e_{| F |})$ 是电子邮件e的向量表示，e_i是关键词w_i在电子邮件e的主题和正文中出现的次数；

是的向量长度；|C₁|和|C₂|分别是C₁和C₂的训练样本数，即，包含的电子邮件数。然后，计算电子邮件数据库中电子邮件e的向量表示与标准向量

和的相似度，计算方法如下：in,

\overset{&Right Arrow;}{e} = (e_{1}, e_{2}, . . ., e_{| f |})

is the vector representation of e-mail e, and e _i is the number of occurrences of keyword w _i in the subject and body of e-mail e;

yes |C ₁ | and |C ₂ | are the number of training samples for C ₁ and C ₂ , respectively, that is, the number of emails included. Then, compute the vector representation of email e in the email database with the standard vector

and The similarity is calculated as follows:

$cos cos ((\overset{&RightArrow; &Right Arrow;}{e e},, {\overset{&RightArrow; &Right Arrow;}{c c}}_{n no})) = = \frac{{Σ Σ}_{i i = = 11}^{| | F f | |} {e e}_{i i} {c c}_{i i}}{\sqrt{{Σ Σ}_{i i = = 11}^{| | F f | |} {e e}_{i i}^{22}} \sqrt{{Σ Σ}_{i i = = 11}^{| | F f | |} {c c}_{i i}^{22}}} - - - - - - ((33))$

其中，n＝1或n＝2，Wherein, n=1 or n=2,

如果 $\cos (\overset{&RightArrow;}{e}, {\overset{&RightArrow;}{c}}_{1}) > \cos (\overset{&RightArrow;}{e}, {\overset{&RightArrow;}{c}}_{2})$ 则e为有效电子邮件，否则e为无效电子邮件。至此，我们就得到了用于提取用户兴趣的有效电子邮件。if $\cos (\overset{&Right Arrow;}{e}, {\overset{&Right Arrow;}{c}}_{1}) > \cos (\overset{&Right Arrow;}{e}, {\overset{&Right Arrow;}{c}}_{2})$ Then e is a valid email, otherwise e is an invalid email. At this point, we have a valid email for extracting user interests.

二、有效电子邮件分类和用户兴趣提取2. Effective Email Classification and User Interest Extraction

将与团队相关的各个研究领域划分为更小的子领域，并通过子领域nd_i的先验知识集K_i表示其背景知识。K_i是(n_k，a_k)的集合，n_k是能共同反映nd_i主要内容的一组关键词中的一个，a_k是n_k的权重，表示n_k对nd_i的描述能力，a_k越高，n_k的描述能力就越强。Each research field related to the team is divided into smaller sub-fields, and its background knowledge is represented by the prior knowledge set K _i of the sub-field n _i . K _i is a set of (n _k , a _k ), _nk is one of a group of keywords that can reflect the main content of nd _i together, a _k is the weight of _nk , indicating the ability of _nk to describe nd _i , The higher a _k is, the stronger the descriptive ability of _nk will be.

通过有效电子邮件与各子领域先验知识集的相似度计算对有效电子邮件进行分类，根据有效电子邮件在各子领域的分布情况，用成员所关注子领域集合来表示用户兴趣；Classify valid emails by calculating the similarity between valid emails and the prior knowledge sets of each subfield, and use the subfields that members pay attention to to represent user interests according to the distribution of valid emails in each subfield;

首先，计算每封有效电子邮件e所描述内容涉及子领域nd_i的概率：First, calculate the probability that the content described by each valid email e refers to the subdomain n _i :

$Sim Sim ((e e,, {K K}_{i i})) = = {Σ Σ}_{k k = = 11}^{R R} {α α}_{k k} f f (({S S}_{kl kl})) / / N N - - - - - - ((44))$

其中，n_k是电子邮件e的主题和正文中包含的属于K_i的关键词；D_l是n_k的集合，显然，(n_k，a_k)∈K_i且n_k∈D_l；S_kl是关键词n_k在电子邮件e的上述部分中出现的次数且f(S_kl)＝tanh(S_kl/3)；R和N分别是D_l和K_i的元素个数。Among them, n _k is the subject of e-mail e and the keywords belonging to K _i contained in the text; D _l is the set of n _k , obviously, ( _nk , a _k ) ∈ _{K i} and _nk ∈ _{D l} ; S _kl is the number of occurrences of keyword n _k in the above part of email e and f(S _kl )=tanh(S _kl /3); R and N are the number of elements of _Dl and _Ki respectively.

然后，将e划分到概率最高的子领域中，实现有效电子邮件的分类。Then, e is divided into the subfield with the highest probability to realize the classification of valid emails.

一般说来用户所发送或接收的有效电子邮件多数会集中在少数几个子领域中，其从事的研究工作应该也集中在这几个子领域中。也就是说，有效电子邮件集中的子领域正是他科研工作所关注的子领域，成员兴趣就是用其所关注子领域的集合表示的。因此，可以根据有效电子邮件的分类结果，计算用户的研究工作涉及各个子领域的百分比。Generally speaking, most of the effective e-mails sent or received by users will be concentrated in a few subfields, and the research work they are engaged in should also be concentrated in these few subfields. That is to say, the subfields in the effective email collection are exactly the subfields that his research work focuses on, and the members' interest is represented by the set of the subfields they focus on. Therefore, based on the classification results of valid emails, the percentage of the user's research work involving each subfield can be calculated.

然后，计算用户i的研究工作涉及子领域j的百分比per_ij Then, calculate the percentage per _ij of user i's research work involving subfield j

${per per}_{ij ij} = = \frac{α α {Σ Σ}_{(({e e &Element; &Element; nd nd}_{j j})) \cap \cap (({e e &Element; &Element; from from}_{i i}))} 22^{- - \frac{age age ((e e))}{hl hl}} sim sim ((e e,, {K K}_{j j})) + + β β {Σ Σ}_{((e e &Element; &Element; {nd nd}_{j j})) \cap \cap ((e e &Element; &Element; {to to}_{i i}))} 22^{- - \frac{age age ((e e))}{hl hl}} sim sim ((e e,, {K K}_{j j}))}{α α {Σ Σ}_{((e e &Element; &Element; {from from}_{i i}))} 22^{- - \frac{age age ((e e))}{hl hl}} sim sim ((e e,, {K K}_{j j})) + + β β {Σ Σ}_{((e e &Element; &Element; {to to}_{i i}))} 22^{- - \frac{age age ((e e))}{hl hl}} sim sim ((e e,, {K K}_{j j}))} \times \times 100100 % % - - - - - - ((55))$

per_ij是用户i的研究工作涉及子领域j的百分比，其中，from_i是用户i所发送的有效电子邮件的集合，to_i是用户i所接收的有效电子邮件的集合；α＝1，β＝0.8，分别表示用户所发送的有效电子邮件和所接收的有效邮件对其兴趣的描述能力，

使得电子邮件的描述能力随其存在时间的增长而降低，age(e)是当前日期与电子邮件e的发送日期的差，hl＝30表明30天前的电子邮件只有当前电子邮件一半的描述能力；from_i是用户i所发送的有效电子邮件的集合，to_i是用户i所接收的有效电子邮件的集合。per _ij is the percentage of user i's research work involving subfield j, where from _i is the set of valid emails sent by user i, to _i is the set of valid emails received by user i; α=1, β = 0.8, which respectively represent the ability to describe the interests of the valid emails sent by the user and the valid emails received by the user,

The descriptive ability of email decreases with the increase of its existence time, age(e) is the difference between the current date and the sending date of email e, hl=30 indicates that the email 30 days ago has only half the description ability of the current email ; from _i is the collection of valid emails sent by user i, and to _i is the collection of valid emails received by user i.

用户所接收到的来自其他成员的有效电子邮件对其兴趣的描述的能力依赖于发送邮件的成员对其科研工作的了解程度；用户所发送的有效电子邮件一般都能正确反映他的研究兴趣，所以，赋予用户所发送的有效电子邮件更强的描述能力。用户的研究重点往往会在经过一段较长的时间后发生变化，所以电子邮件的描述能力也应该随着其存在时间的增长而降低，这是通过将引入公式实现的；The user's ability to receive valid e-mails from other members describing his interests depends on the sending member's knowledge of his research work; valid e-mails sent by users generally accurately reflect his research interests, Therefore, more descriptive power is given to effective emails sent by users. User research priorities tend to change over time, so e-mail should also become less descriptive as it ages, by adding Implemented by introducing a formula;

最后，如果per_ij大于阈值，将子领域nd_j加入用户i所关注的子领域集合中，这里阈值为10％。Finally, if per _ij is greater than the threshold, the subfield nd _j is added to the set of subfields concerned by user i, where the threshold is 10%.

三、文档理解及分类3. Document understanding and classification

一个基本的概念、观点或方法称为一个兴趣点，我们用一个语义链网(SG)来表示一个兴趣点的语义信息。子领域nd_i的兴趣点集SG-set_i描述nd_i所蕴涵的全部语义，它的元素是与nd_i所包含兴趣点对应的语义链网。以子领域的兴趣点集为模板将文档划分到与其语义相近的子领域中；A basic concept, viewpoint or method is called a point of interest, and we use a Semantic Link Network (SG) to represent the semantic information of a point of interest. The interest point set SG- _set _i of the subfield ndi describes all the semantics contained in _ndi , and its elements are the semantic link network corresponding to the interest points contained in _ndi . Divide the document into subfields with similar semantics by using the interest point set of the subfield as a template;

SG＝(N，R)，其中，N是节点的集合，包括一个兴趣点N₁和一组共同表示兴趣点N₁语义的关键词{N₂，N₃，...，N_m}；R是有向弧的集合，表示节点之间的因果关系。SG=(N, R), wherein, N is a collection of nodes, including an interest point N ₁ and a group of keywords {N ₂ , N ₃ ,..., N _m } that jointly represent the semantics of the interest point N ₁ ; R is a collection of directed arcs, representing causal relationships between nodes.

图2(a)是一个语义链网，起始于N_i终止于N_j的有向弧表示N_i到N_j的因果关系，其权重w_ij指示原因节点N_i对结果节点N_j的影响程度，w_ij∈[-1，+1]。Figure 2(a) is a semantic chain network. The directed arc starting from N _i and ending at N _j represents the causal relationship from N _i to N _j , and its weight w _ij indicates the influence of the cause node N _i on the result node N _j degree, w _ij ∈ [-1, +1].

图2(b)是该语义链网的邻接矩阵表示，它是一个n×n的矩阵，n是该语义链网所包含的节点数。如果N_i到N_j存在因果关系，那么该邻接矩阵的第i行，第j列的元素为w_ij，否则为0。Figure 2(b) is the adjacency matrix representation of the semantic link network, which is an n×n matrix, and n is the number of nodes contained in the semantic link network. If there is a causal relationship between N _i and N _j , then the element in row i and column j of the adjacency matrix is w _ij , otherwise it is 0.

图3是文档理解及划分的流程图，具体步骤如下：Figure 3 is a flowchart of document understanding and division. The specific steps are as follows:

S3-1.从团队文档数据库中选择一篇文档d；S3-1. Select a document d from the team document database;

S3-2.选择一个子领域nd_i，得到相应的兴趣点集SG-set_i；S3-2. Select a subfield nd _i to obtain the corresponding interest point set SG-set _i ;

S3-3.计算文档d与子领域nd_i在语义上的相似度md(d，nd_i)：S3-3. Calculate the semantic similarity md(d, nd _i ) between document d and subfield n _i :

S3-3.1将文档d划分为若干个小部分：p₁，p₂，...，p_m，可按字节数划分，也可按段落划分。这里是按小节划分的，包括子小节的做进一步的划分；S3-3.1 Divide the document d into several small parts: p ₁ , p ₂ , . . . , p _m , which can be divided by the number of bytes or paragraphs. Here it is divided by subsection, including subsections for further division;

S3-3.2对任一小部分p_j，令md_Part-ji＝0；S3-3.2 For any small part p _j , set md _Part-ji = 0;

S3-3.3对子领域nd_i兴趣点集SG-set_i的任一元素SG_r S3-3.3 For any element SG _r of the interest point set SG-set _i in the subfield nd _i

(1)计算SG_r所包含的任一关键词N_k在p_j中的状态值V_k′:V_k′＝tanh(S_k/3)，S_k是N_k在p_j中出现的次数；(1) Calculate the state value V _k ′ of any keyword N _k included in SG _r in p _j : V _k ′=tanh(S _k /3), S _k is the number of times N _k appears in p _j ;

(2)V₁′，V₂′，...，V_m′)＝(0，V₂′，...，V_m′)×E_r，E_r是SG_r的邻接矩阵表示；(2) V ₁ ′, V ₂ ′, ..., V _m ′)=(0, V ₂ ′, ..., V _m ′)×E _r , where E _r is the adjacency matrix representation of SG _r ;

(3)如果md_Part-ji＜V₁′则 md_Part-ji＝V₁′(3) If md _Part-ji < V ₁ ′, then md _Part-ji = V ₁ ′

$S 3 - 3.4 md (d, n d_{i}) = Σ_{j = 1}^{m} m d_{Part - ji} / m,$ 其中，文档d划分为m个小部分 $S 3 - 3.4 md (d, no d_{i}) = Σ_{j = 1}^{m} m d_{part - the ji} / m,$ Among them, the document d is divided into m small parts

S3-4.如果md(d，nd_i)＞0.65，将文档d划分到子领域nd_i，转S3-2。S3-4. If md(d, n _i )>0.65, divide document d into sub-field n _i , and go to S3-2.

以上方法从兴趣点层次上计算每篇文档与各子领域的语义相似度，从而将文档归入与其语义相似度较高的子领域中，一篇文档可能同时属于多个子领域。团队的文档数据库既包括大量已有的科技文档，也接收团队成员上载的文档，以不断增加文档数据库的容量。因此，要定期检查文档数据库中是否有新增加的文档，如果有就按如上方法将其划分到相应的子领域中。The above method calculates the semantic similarity between each document and each sub-domain from the point of interest level, so that the document is classified into the sub-domain with higher semantic similarity, and a document may belong to multiple sub-domains at the same time. The team's document database not only includes a large number of existing technical documents, but also receives documents uploaded by team members, so as to continuously increase the capacity of the document database. Therefore, it is necessary to regularly check whether there are newly added documents in the document database, and if so, divide them into corresponding subfields according to the above method.

四、根据用户兴趣有效推送文档4. Effectively push documents according to user interests

文档理解的结果是将团队文档数据库中的文档划分到与团队相关的各个子领域中。编写文档推送程序(例如，FileDeliver)，该程序以文档分类结果为基础，根据用户兴趣，从团队的文档数据库中选择适当的文档推送给团队成员。因为团队成员往往倾向于阅读与其所关注的子领域相关的文档，该程序以电子邮件附件的形式将文档推送给所关注子领域集合包括该文档所属子领域的用户。The result of document understanding is to divide the documents in the team document database into various sub-domains related to the team. Write a document delivery program (eg, FileDeliver), which selects appropriate documents from the team's document database and pushes them to team members based on the document classification results and user interests. Because team members tend to read documents related to the subdomains they care about, the program pushes documents as email attachments to users whose subdomains of interest set includes the subdomain to which the document belongs.

文档数据库中的每篇文档都有“已发送人员”和“上传人员”两个列表。“已发送人员”列表记录该文档已经推送给了哪些团队成员，FileDeliver运行时只将文档推送给未出现在该文档“已发送人员”列表中的团队成员。成员上载文档到团队的文档数据库时，如果文档数据库中还没有这篇文档则上载成功，否则提示重复。不管上载是否成功，该成员都会记录到该文档的“上传人员”列表中。因为成员试图上载的文档必定是他已经拥有的文档，因此FileDeliver也不会将文档推送给已出现在该文档“上传人员”列表中的成员。团队成员只需执行简单的上载操作就可实现文档在所有需要该文档的成员之间的共享，简单、有效。Each document in the document database has two lists of "sent by" and "uploaded by". The "Sent People" list records which team members the document has been pushed to. When FileDeliver runs, it only pushes the document to the team members who do not appear in the "Sent People" list of the document. When a member uploads a document to the team's document database, if the document does not exist in the document database, the upload is successful, otherwise the prompt will be repeated. Regardless of whether the upload is successful or not, the member will be recorded in the "Uploaded by" list of the document. Because the document a member is trying to upload must be a document he already owns, FileDeliver will not push the document to members who already appear in the "uploaded by" list of the document. Team members only need to perform a simple upload operation to share the document among all members who need the document, which is simple and effective.

Claims

1. The method for finding user interest in the E-mail stream and effectively pushing the document according to the user interest comprises the following steps of firstly, storing the E-mails among team members into an E-mail database and extracting effective E-mails from the E-mails; then, extracting user interests according to the distribution rule of the effective e-mails, and realizing classification of the documents in the team document database through semantic analysis; and finally, according to the user interest and the document classification result, pushing the document consistent with the member interest to the team members through the E-mails.

2. The method for discovering user interests in an email stream and efficiently pushing documents in accordance therewith as recited in claim 1, wherein the emails among team members are decoded by the email collecting program and the decoded contents are stored in the email database, and automatic warehousing of the emails is realized by periodically running the email collecting program, wherein spam is mostly from strange email addresses, and the process only considers the emails among the members, thereby eliminating the interference of the spam when extracting the user interests.

3. The method for discovering user interests in an email stream and for efficiently pushing documents as claimed in claim 1, wherein the method of natural language learning is used to obtain efficient emails that provide useful information for describing user interests, taking into account only the interests of team members in research and development work, thereby ensuring the accuracy of the extracted user interests.

4. The method of claim 1 for discovering user interests in an email stream and for efficiently pushing documents based thereon, further comprising subdividing a research domain associated with a team into sub-domains, establishing a priori knowledge sets of the sub-domains to represent background knowledge thereof, classifying the valid emails by similarity calculations between the valid emails and the prior knowledge sets of the sub-domains, and representing user interests by a set of sub-domains of interest to the members based on the distribution of the valid emails in the sub-domains.

5. The method of claim 1 for discovering user interests in an email stream and efficiently pushing documents in response thereto, wherein a time factor is introduced into the interest extraction process in view of possible changes in user interests over a longer period of time, the user interests are updated in time as new emails are generated and time passes, and pushing documents to users in response to user interests ensures that documents are always pushed to all team members who need the documents without either misposting or missed posting.

6. The method for discovering user interest in e-mail stream and effectively pushing documents according to the same as the claim 1, wherein an interest point set describing the sub-domain semantics is constructed, the documents are divided into sub-domains similar to the sub-domains of the sub-domains by taking the interest point set as a template, the documents are pushed to the members concerning the sub-domains to which the documents belong by a document pushing program, the documents pushed to the users are guaranteed to be needed by the users semantically, and the members of the team can complete the pushing of the documents by the program only by uploading the documents to a document database of the team, so that the method is simple and easy to implement.

7. A method for discovering user interest in an email stream and effectively pushing documents according to the user interest is characterized by mainly comprising the following four parts:

firstly, the e-mail is automatically stored, and the effective e-mail is extracted, wherein,

1. building an email database

The team members use the uniform E-mail server and server program to establish a database file under a certain directory of the E-mail server to store the E-mail information among the team members;

2. automatic e-mail storage

Firstly, all the e-mails among the team members are automatically forwarded to a fixed account by a mail server program, and the mails of the account are stored in a certain fixed directory of a mail server; then, regularly running the compiled mail collection program to realize the automatic storage of the e-mails, and decoding the e-mails by the program and storing the decoding result in the corresponding field of the e-mail database;

3. extracting valid emails

The invention only considers the interest of the user in the aspect of scientific research work, and extracts the effective e-mail which can provide useful information for describing the interest of the user through a natural language learning method;

second, efficient email classification and user interest extraction

Dividing the research fields related to the team intoSmaller sub-fields and through sub-fields nd_iPrior knowledge set K_iRepresenting its background knowledge, K_iIs (n)_k，a_k) Set of (2), n_kIs able to reflect nd together_iOne of a set of keywords of the primary content, a_kIs n_kWeight of (2) represents n_kTo nd_iDescription capability of a_kThe higher n_kThe stronger the description capability;

classifying the effective e-mails through similarity calculation of the effective e-mails and the prior knowledge sets of each sub-field, and expressing user interest by using the sub-field set concerned by members according to the distribution condition of the effective e-mails in each sub-field;

third, document understanding and classification

A basic concept, point of view or method is called a point of interest, and a semantic link network (SG) represents semantic information of a point of interest, where SG is (N, R), where N is a set of nodes, including a point of interest N₁And a group of points of interest N represented together₁Semantic keywords { N₂，N₃，...，N_m}; r is a set of directed arcs representing causal relationships between nodes, the sub-domain nd_iInterest point set SG-set_iDescription nd_iAll the implied semantic information, its elements are nd_iThe semantic chain network corresponding to the contained interest points divides the document into sub-fields with similar semantics by taking the interest point set of the sub-fields as a template;

fourthly, effectively pushing documents according to user interests

Writing a document pushing program, wherein the document pushing program pushes a document to a user of a concerned sub-field set including a sub-field to which the document belongs in the form of an e-mail attachment, each document has two lists of 'sent person' and 'uploading person', the document pushing program only pushes the document to team members which do not appear in the two lists, repeated sending is avoided, and the members can share the document among all members needing the document only by uploading the document to a team document database, so that the method is simple and effective.

8. The method for discovering user interests in an email stream and thereby efficiently pushing documents according to claim 7, wherein the first, email is automatically archived, the available emails are extracted, wherein,

3. extracting valid emails

First, a certain number of valid e-mails and invalid e-mails are selected as the training set C of valid e-mails respectively₁And invalid e-mail training set C₂And obtaining the standard vectors of valid e-mails and invalid e-mails by the following formulas

And

represents:

<math> <mrow> <msub> <mover> <mi>c</mi> <mo>&RightArrow;</mo> </mover> <mn>1</mn> </msub> <mo>=</mo> <mn>16</mn> <mfrac> <mn>1</mn> <mrow> <mo>|</mo> <msub> <mi>C</mi> <mn>1</mn> </msub> <mo>|</mo> </mrow> </mfrac> <munder> <mi>Σ</mi> <mrow> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>C</mi> <mn>1</mn> </msub> </mrow> </munder> <mfrac> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mrow> <mo>|</mo> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mo>|</mo> </mrow> </mfrac> <mo>-</mo> <mn>4</mn> <mfrac> <mn>1</mn> <mrow> <mo>|</mo> <msub> <mi>C</mi> <mn>2</mn> </msub> <mo>|</mo> </mrow> </mfrac> <munder> <mi>Σ</mi> <mrow> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>C</mi> <mn>2</mn> </msub> </mrow> </munder> <mfrac> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mrow> <mo>|</mo> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mo>|</mo> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> </math>

<math> <mrow> <msub> <mover> <mi>c</mi> <mo>&RightArrow;</mo> </mover> <mn>2</mn> </msub> <mo>=</mo> <mn>16</mn> <mfrac> <mn>1</mn> <mrow> <mo>|</mo> <msub> <mi>C</mi> <mn>2</mn> </msub> <mo>|</mo> </mrow> </mfrac> <munder> <mi>Σ</mi> <mrow> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>C</mi> <mn>2</mn> </msub> </mrow> </munder> <mfrac> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mrow> <mo>|</mo> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mo>|</mo> </mrow> </mfrac> <mo>-</mo> <mn>4</mn> <mfrac> <mn>1</mn> <mrow> <mo>|</mo> <msub> <mi>C</mi> <mn>1</mn> </msub> <mo>|</mo> </mrow> </mfrac> <munder> <mi>Σ</mi> <mrow> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>C</mi> <mn>1</mn> </msub> </mrow> </munder> <mfrac> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mrow> <mo>|</mo> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mo>|</mo> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow> </math>

wherein,

<math> <mrow> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mo>=</mo> <mrow> <mo>(</mo> <msub> <mi>e</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>e</mi> <mn>2</mn> </msub> <mo>,</mo> <mo>.</mo> <mo>.</mo> <mo>.</mo> <mo>,</mo> <msub> <mi>e</mi> <mrow> <mo>|</mo> <mi>F</mi> <mo>|</mo> </mrow> </msub> <mo>)</mo> </mrow> </mrow> </math>

is a vector representation of an e-mail e, e_iIs the keyword w_iNumber of occurrences in the subject and body of e-mail;

is thatThe vector length of (d); i C₁I and I C₂Each is C₁And C₂The number of training samples, i.e., the number of electronic mail pieces contained,

then, the e-mail e is calculatedVector representation

And a standard vector

And

the calculation method of the similarity is as follows:

<math> <mrow> <mi>cos</mi> <mrow> <mo>(</mo> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mo>,</mo> <msub> <mover> <mi>c</mi> <mo>&RightArrow;</mo> </mover> <mi>n</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <msubsup> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mo>|</mo> <mi>F</mi> <mo>|</mo> </mrow> </msubsup> <msub> <mi>e</mi> <mi>i</mi> </msub> <msub> <mi>c</mi> <mi>i</mi> </msub> </mrow> <mrow> <msqrt> <msubsup> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mo>|</mo> <mi>F</mi> <mo>|</mo> </mrow> </msubsup> <msubsup> <mi>e</mi> <mi>i</mi> <mn>2</mn> </msubsup> </msqrt> <msqrt> <msubsup> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mo>|</mo> <mi>F</mi> <mo>|</mo> </mrow> </msubsup> <msubsup> <mi>c</mi> <mi>i</mi> <mn>2</mn> </msubsup> </msqrt> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow> </math>

wherein n-1 or n-2,

if it is not

<math> <mrow> <mi>cos</mi> <mrow> <mo>(</mo> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mo>,</mo> <msub> <mover> <mi>c</mi> <mo>&RightArrow;</mo> </mover> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mo>></mo> <mi>cos</mi> <mrow> <mo>(</mo> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mo>,</mo> <msub> <mover> <mi>c</mi> <mo>&RightArrow;</mo> </mover> <mn>2</mn> </msub> <mo>)</mo> </mrow> </mrow> </math>

E is a valid email, otherwise e is an invalid email.

9. The method for discovering user interest in an email stream and thereby efficiently pushing documents according to claim 7, wherein two, efficient email classification and user interest extraction, wherein,

first of all, what is described as a computationally efficient e-mail e relates to the sub-domain nd_iProbability of (c):

wherein n is_kBelonging to K contained in the subject and body of e-mail_iThe keyword of (1); d_lIs n_kA set of (a); s_klIs the keyword n_kNumber of occurrences in the above-mentioned part of e-mail and f (S)_kl)＝tanh(S_kl3); r and N are each D_lAnd K_iThe number of elements (c). E is divided into the sub-fields with the highest probability to realize the classification of effective e-mails;

then, the research work to calculate user i involves the percentage per of the sub-domain j_ij

<math> <mrow> <msub> <mi>per</mi> <mi>ij</mi> </msub> <mo>=</mo> <mfrac> <mrow> <mi>α</mi> <msub> <mi>Σ</mi> <mrow> <mrow> <mo>(</mo> <msub> <mrow> <mi>e</mi> <mo>&Element;</mo> <mi>nd</mi> </mrow> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>∩</mo> <mrow> <mo>(</mo> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>from</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> </msub> <msup> <mn>2</mn> <mrow> <mo>-</mo> <mfrac> <mrow> <mi>age</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>)</mo> </mrow> </mrow> <mi>hl</mi> </mfrac> </mrow> </msup> <mi>sim</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>,</mo> <msub> <mi>K</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mi>β</mi> <msub> <mi>Σ</mi> <mrow> <mrow> <mo>(</mo> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>nd</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>∩</mo> <mrow> <mo>(</mo> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>to</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> </msub> <msup> <mn>2</mn> <mrow> <mo>-</mo> <mfrac> <mrow> <mi>age</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>)</mo> </mrow> </mrow> <mi>hl</mi> </mfrac> </mrow> </msup> <mi>sim</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>,</mo> <msub> <mi>K</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mi>α</mi> <msub> <mi>Σ</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>from</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </msub> <msup> <mn>2</mn> <mrow> <mo>-</mo> <mfrac> <mrow> <mi>age</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>)</mo> </mrow> </mrow> <mi>hl</mi> </mfrac> </mrow> </msup> <mi>sim</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>,</mo> <msub> <mi>K</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mi>β</mi> <msub> <mi>Σ</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>to</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </msub> <msup> <mn>2</mn> <mrow> <mo>-</mo> <mfrac> <mrow> <mi>age</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>)</mo> </mrow> </mrow> <mi>hl</mi> </mfrac> </mrow> </msup> <mi>sim</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>,</mo> <msub> <mi>K</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>×</mo> <mn>100</mn> <mo>%</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>5</mn> <mo>)</mo> </mrow> </mrow> </math>

Wherein α ═ 1, β ═ 0.8, represent the description ability of interest to the valid email sent by the user and the valid email received, respectively;making the description capacity of the e-mail reduce with the increase of the existing time, wherein, age (e) is the difference between the current date and the sending date of the e-mail, and hl is 30, which indicates that the e-mail before 30 days has the description capacity of half of the current e-mail; from_iIs the set of valid e-mails, to, sent by user i_iIs the set of valid emails received by user i, if per_ijGreater than threshold value, and sub-domain nd_jJoin the set of sub-domains of interest to user i, where the threshold is 10%.

10. The method for discovering user interests in an email stream and pushing documents efficiently according to the same as claimed in claim 7, wherein the third step of document understanding and dividing is as follows:

s3-1, selecting a document d from the team document database;

s3-2, selecting a sub-field nd_iObtaining the point of interest set SG-set_i；

S3-3, calculating the document d and the sub-field nd_iSemantic matching degree md (d, nd)_i)：

S3-3.1 divides document d into several small parts: p is a radical of₁，p₂，…，p_m；

S3-3.2 for any small part p_jLet md be_Part-ji＝0；

S3-3.3 subfield nd_iInterest point set SG-set_iAny element SG_r

(1) Calculate SG_rAny keyword N contained_kAt p_jState value V in_k′:V_k′＝tanh(S_k/3)，S_kIs N_kAt p_jThe number of occurrences in (a);

(2)(V₁′，V₂′，...，V_m′)＝(0，V₂′，...，V_m′)×E_r，E_ris SG_rA adjacency matrix representation of (a);

(3) if md is_Part-ji＜V₁Then md_Part-ji＝V₁′

S3-3.4

Wherein the document d is divided into m small parts

S3-4. if md (d, nd)_i) > 0.65, partition document d into sub-domains nd_iGo to S3-2.