CN112084150B

CN112084150B - Model training, data retrieval method, device, equipment and storage medium

Info

Publication number: CN112084150B
Application number: CN202010939453.6A
Authority: CN
Inventors: 潘秋桐; 和为; 刘准; 何伯磊; 李雅楠; 巩江传; 李瑞高
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-09-09
Filing date: 2020-09-09
Publication date: 2024-07-26
Anticipated expiration: 2040-09-09
Also published as: CN112084150A

Abstract

The application discloses a model training and data retrieval method, a device, equipment and a storage medium, and relates to the fields of information retrieval and knowledge sharing. The specific implementation scheme is as follows: acquiring a historical search click log generated by a knowledge sharing system; generating a sample set according to the historical search click log; and training the model by using the sample set to obtain a target model. According to the implementation mode, the search click log generated by the user in the enterprise-level wiki is analyzed, and the target model is obtained through training, so that the required knowledge can be quickly and accurately found through the target model.

Description

Model training, data retrieval method, device, equipment and storage medium

技术领域Technical Field

本申请涉及计算机技术领域，具体涉及信息检索、知识共享领域，尤其涉及模型训练、数据检索方法，装置，设备以及存储介质。The present application relates to the field of computer technology, specifically to the fields of information retrieval and knowledge sharing, and in particular to model training, data retrieval methods, devices, equipment and storage media.

背景技术Background technique

在大中小企业发展中都普遍存在这样一个问题，随着公司的逐渐壮大，项目的持续积累，员工的存续迭代，产生了大量含有员工宝贵经验和知识的文档。这些文档若不进行线上的统一管理，则很难做到知识的体系化和规范化，部分知识和经验也可能随着关键员工的离职而损失。所以，大多数企业都引入了企业级wiki，将企业积累的办公场景下的知识文档集中在一个位置，成为企业内部的搜索引擎。Such a problem is common in the development of large, medium and small enterprises. As the company grows, projects continue to accumulate, and employees continue to iterate, a large number of documents containing valuable experience and knowledge of employees are generated. If these documents are not managed online in a unified manner, it will be difficult to systematize and standardize the knowledge, and some knowledge and experience may be lost with the departure of key employees. Therefore, most companies have introduced enterprise-level wikis to centralize the knowledge documents accumulated in office scenarios in one location, which becomes the internal search engine of the company.

与此同时，引入了一个新的问题：有着海量的知识后，怎样能够快速且精准的找到所需的知识。大多数企业级wiki在这一用户需求上的能力是很薄弱的，影响甚至拖慢了知识的传递和办公的效率。At the same time, a new problem is introduced: how to find the required knowledge quickly and accurately after having a huge amount of knowledge. Most enterprise-level wikis are very weak in meeting this user demand, which affects or even slows down the transfer of knowledge and office efficiency.

发明内容Summary of the invention

提供了一种模型训练、数据检索方法，装置，设备以及存储介质。Provided are a model training, data retrieval method, device, equipment and storage medium.

根据第一方面，提供了一种模型训练方法，包括：获取知识共享系统产生的历史搜索点击日志；根据历史搜索点击日志，生成样本集合；利用样本集合训练模型，得到目标模型。According to the first aspect, a model training method is provided, including: obtaining a historical search click log generated by a knowledge sharing system; generating a sample set based on the historical search click log; and training a model using the sample set to obtain a target model.

根据第二方面，提供了一种数据检索方法，包括：接收用户通过终端输入的目标搜索语句；根据目标搜索语句以及如第一方面所描述的目标模型，确定目标特征向量；确定针对目标搜索语句的目标搜索结果中各文档的特征向量；根据各特征向量以及目标特征向量，对目标结果中的各文档进行排序。According to a second aspect, a data retrieval method is provided, including: receiving a target search statement input by a user through a terminal; determining a target feature vector based on the target search statement and a target model as described in the first aspect; determining a feature vector for each document in a target search result for the target search statement; and sorting each document in the target result based on each feature vector and the target feature vector.

根据第三方面，提供了一种模型训练装置，包括：日志获取单元，被配置成获取知识共享系统产生的历史搜索点击日志；样本生成单元，被配置成根据历史搜索点击日志，生成样本集合；模型训练单元，被配置成利用样本集合训练模型，得到目标模型。According to the third aspect, a model training device is provided, including: a log acquisition unit, configured to acquire historical search click logs generated by a knowledge sharing system; a sample generation unit, configured to generate a sample set based on the historical search click logs; and a model training unit, configured to train a model using the sample set to obtain a target model.

根据第四方面，提供了一种数据检索装置，包括：搜索语句接收单元，被配置成接收用户通过终端输入的目标搜索语句；第一向量确定单元，被配置成根据目标搜索语句以及如第一方面所描述的目标模型，确定目标特征向量；第二向量确定单元，被配置成确定针对目标搜索语句的目标搜索结果中各文档的特征向量；文档排序单元，被配置成根据各特征向量以及目标特征向量，对目标结果中的各文档进行排序。According to a fourth aspect, a data retrieval device is provided, including: a search statement receiving unit, configured to receive a target search statement input by a user through a terminal; a first vector determination unit, configured to determine a target feature vector based on the target search statement and the target model as described in the first aspect; a second vector determination unit, configured to determine a feature vector of each document in a target search result for the target search statement; and a document sorting unit, configured to sort each document in the target result based on each feature vector and the target feature vector.

根据第五方面，提供了一种模型训练电子设备，包括：至少一个处理器；以及与上述至少一个处理器通信连接的存储器；其中，存储器存储有可被至少一个处理器执行的指令，上述指令被至少一个处理器执行，以使至少一个处理器能够执行如第一方面所描述的方法。According to the fifth aspect, a model training electronic device is provided, comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the at least one processor is executed to enable the at least one processor to execute the method described in the first aspect.

根据第六方面，提供了一种数据检索电子设备，包括：至少一个处理器；以及与上述至少一个处理器通信连接的存储器；其中，存储器存储有可被至少一个处理器执行的指令，上述指令被至少一个处理器执行，以使至少一个处理器能够执行如第二方面所描述的方法。According to the sixth aspect, a data retrieval electronic device is provided, comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the at least one processor is executed by the at least one processor so that the at least one processor can execute the method described in the second aspect.

根据第七方面，提供了一种存储有计算机指令的非瞬时计算机可读存储介质，上述计算机指令用于使计算机执行如第一方面所描述的方法。According to a seventh aspect, a non-transitory computer-readable storage medium storing computer instructions is provided, wherein the computer instructions are used to cause a computer to execute the method as described in the first aspect.

根据第八方面，提供了一种存储有计算机指令的非瞬时计算机可读存储介质，上述计算机指令用于使计算机执行如第二方面所描述的方法。According to an eighth aspect, a non-transitory computer-readable storage medium storing computer instructions is provided, wherein the computer instructions are used to cause a computer to execute the method described in the second aspect.

根据本申请的技术通过对用户在企业级wiki中产生的搜索点击日志进行分析，并利用其训练得到目标模型，通过该目标模型可以快速且精准的找到所需的知识。According to the technology of the present application, the search click logs generated by users in the enterprise-level wiki are analyzed and used to train a target model, through which the required knowledge can be found quickly and accurately.

应当理解，本部分所描述的内容并非旨在标识本公开的实施例的关键或重要特征，也不用于限制本公开的范围。本公开的其它特征将通过以下的说明书而变得容易理解。It should be understood that the content described in this section is not intended to identify the key or important features of the embodiments of the present disclosure, nor is it intended to limit the scope of the present disclosure. Other features of the present disclosure will become easily understood through the following description.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

附图用于更好地理解本方案，不构成对本申请的限定。其中：The accompanying drawings are used to better understand the present solution and do not constitute a limitation of the present application.

图1是本申请的一个实施例可以应用于其中的示例性系统架构图；FIG1 is an exemplary system architecture diagram in which an embodiment of the present application may be applied;

图2是根据本申请的模型训练方法的一个实施例的流程图；FIG2 is a flow chart of an embodiment of a model training method according to the present application;

图3是根据本申请的模型训练方法的另一个实施例的流程图；FIG3 is a flow chart of another embodiment of a model training method according to the present application;

图4是根据本申请的数据检索方法的一个实施例的流程图；FIG4 is a flow chart of an embodiment of a data retrieval method according to the present application;

图5是根据本申请的模型训练方法、数据检索方法的一个应用场景的示意图；FIG5 is a schematic diagram of an application scenario of the model training method and data retrieval method according to the present application;

图6是根据本申请的模型训练装置的一个实施例的结构示意图；FIG6 is a schematic diagram of the structure of an embodiment of a model training device according to the present application;

图7是根据本申请的数据检索装置的一个实施例的结构示意图；FIG7 is a schematic structural diagram of an embodiment of a data retrieval device according to the present application;

图8是用来实现本申请实施例的模型训练方法、数据检索方法的电子设备的框图。FIG8 is a block diagram of an electronic device used to implement the model training method and data retrieval method of an embodiment of the present application.

具体实施方式Detailed ways

以下结合附图对本申请的示范性实施例做出说明，其中包括本申请实施例的各种细节以助于理解，应当将它们认为仅仅是示范性的。因此，本领域普通技术人员应当认识到，可以对这里描述的实施例做出各种改变和修改，而不会背离本申请的范围和精神。同样，为了清楚和简明，以下的描述中省略了对公知功能和结构的描述。The following is a description of exemplary embodiments of the present application in conjunction with the accompanying drawings, including various details of the embodiments of the present application to facilitate understanding, which should be considered as merely exemplary. Therefore, it should be recognized by those of ordinary skill in the art that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of the present application. Similarly, for the sake of clarity and conciseness, the description of well-known functions and structures is omitted in the following description.

需要说明的是，在不冲突的情况下，本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本申请。It should be noted that, in the absence of conflict, the embodiments and features in the embodiments of the present application can be combined with each other. The present application will be described in detail below with reference to the accompanying drawings and in combination with the embodiments.

图1示出了可以应用本申请的模型训练方法、数据检索方法、模型训练装置或数据检索装置的实施例的示例性系统架构100。FIG. 1 shows an exemplary system architecture 100 to which an embodiment of a model training method, a data retrieval method, a model training device or a data retrieval device of the present application can be applied.

如图1所示，系统架构100可以包括终端设备101、102、103，网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型，例如有线、无线通信链路或者光纤电缆等等。As shown in Fig. 1, system architecture 100 may include terminal devices 101, 102, 103, network 104 and server 105. Network 104 is used to provide a medium for communication links between terminal devices 101, 102, 103 and server 105. Network 104 may include various connection types, such as wired, wireless communication links or optical fiber cables, etc.

用户可以使用终端设备101、102、103通过网络104与服务器105交互，以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用，例如浏览器类应用、社交平台类应用等。Users can use terminal devices 101, 102, 103 to interact with server 105 through network 104 to receive or send messages, etc. Various communication client applications, such as browser applications, social platform applications, etc., can be installed on terminal devices 101, 102, 103.

终端设备101、102、103可以是硬件，也可以是软件。当终端设备101、102、103为硬件时，可以是各种电子设备，包括但不限于智能手机、平板电脑、电子书阅读器、车载电脑、膝上型便携计算机和台式计算机等等。当终端设备101、102、103为软件时，可以安装在上述所列举的电子设备中。其可以实现成多个软件或软件模块(例如用来提供分布式服务)，也可以实现成单个软件或软件模块。在此不做具体限定。Terminal devices 101, 102, 103 can be hardware or software. When terminal devices 101, 102, 103 are hardware, they can be various electronic devices, including but not limited to smart phones, tablet computers, e-book readers, car computers, laptop computers, desktop computers, etc. When terminal devices 101, 102, 103 are software, they can be installed in the electronic devices listed above. It can be implemented as multiple software or software modules (for example, to provide distributed services), or it can be implemented as a single software or software module. No specific limitation is made here.

服务器105可以是提供各种服务的服务器，例如对终端设备101、102、103发送的检索语句进行检索的后台检索服务器。后台检索服务器可以对检索结果进行排序，并将排序后的检索结果反馈给终端设备101、102、103。The server 105 may be a server that provides various services, such as a background search server that searches for search statements sent by the terminal devices 101, 102, 103. The background search server may sort the search results and feed back the sorted search results to the terminal devices 101, 102, 103.

需要说明的是，服务器105可以是硬件，也可以是软件。当服务器105为硬件时，可以实现成多个服务器组成的分布式服务器集群，也可以实现成单个服务器。当服务器105为软件时，可以实现成多个软件或软件模块(例如用来提供分布式服务)，也可以实现成单个软件或软件模块。在此不做具体限定。It should be noted that the server 105 can be hardware or software. When the server 105 is hardware, it can be implemented as a distributed server cluster consisting of multiple servers, or it can be implemented as a single server. When the server 105 is software, it can be implemented as multiple software or software modules (for example, for providing distributed services), or it can be implemented as a single software or software module. No specific limitation is made here.

需要说明的是，本申请实施例所提供的模型训练方法、数据检索方法一般由服务器105执行。相应地，模型训练装置、数据检索装置一般设置于服务器105中。It should be noted that the model training method and data retrieval method provided in the embodiments of the present application are generally executed by the server 105. Accordingly, the model training device and the data retrieval device are generally provided in the server 105.

应该理解，图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要，可以具有任意数目的终端设备、网络和服务器。It should be understood that the number of terminal devices, networks and servers in Figure 1 is only illustrative. Any number of terminal devices, networks and servers may be provided according to implementation requirements.

继续参考图2，示出了根据本申请的模型训练方法的一个实施例的流程200。本实施例的模型训练方法，包括以下步骤：Continuing to refer to FIG2 , a process 200 of an embodiment of a model training method according to the present application is shown. The model training method of this embodiment includes the following steps:

步骤201，获取知识共享系统产生的历史搜索点击日志。Step 201, obtaining the historical search click log generated by the knowledge sharing system.

本实施例中，模型训练方法的执行主体(例如可以是图1中所示的服务器105，也可以是图1中未示出的其它电子设备)可以获取知识共享系统产生的历史搜索点击日志。这里，知识共享系统可以是企业级wiki，企业内部的员工可以通过该知识共享系统共享知识，也可以利用该知识共享系统进行搜索，并通过浏览搜索结果得到知识。用户在搜索和浏览的过程中，会生成搜索点击日志。此处，执行主体可以获取知识共享系统前一天或前一周产生的搜索点击日志，作为历史搜索点击日志。In this embodiment, the execution subject of the model training method (for example, it can be the server 105 shown in Figure 1, or it can be other electronic devices not shown in Figure 1) can obtain the historical search click log generated by the knowledge sharing system. Here, the knowledge sharing system can be an enterprise-level wiki, and employees within the enterprise can share knowledge through the knowledge sharing system, and can also use the knowledge sharing system to search and obtain knowledge by browsing the search results. During the search and browsing process, users will generate search click logs. Here, the execution subject can obtain the search click logs generated by the knowledge sharing system the previous day or the previous week as historical search click logs.

步骤202，根据历史搜索点击日志，生成样本集合。Step 202: Generate a sample set based on historical search click logs.

执行主体在获取到历史搜索点击日志后，可以对历史搜索点击日志进行解析，得到样本集合。具体的，执行主体可以从历史搜索点击日志出分析出用户输入的搜索语句，还可以分析出用户对搜索结果中各文档的点击信息。执行主体可以将搜索语句与已点击的文档，作为正样本。将搜索语句与未点击的文档，作为负样本。从而，执行主体可以得到多个正样本和多个负样本，即为样本集合。After obtaining the historical search click log, the execution subject can parse the historical search click log to obtain a sample set. Specifically, the execution subject can analyze the search statement input by the user from the historical search click log, and can also analyze the user's click information on each document in the search results. The execution subject can use the search statement and the clicked document as positive samples. The search statement and the unclicked document as negative samples. Thus, the execution subject can obtain multiple positive samples and multiple negative samples, which is a sample set.

步骤203，利用样本集合训练模型，得到目标模型。Step 203: Use the sample set to train the model to obtain the target model.

执行主体在得到样本集合后，可以利用样本集合训练模型，得到目标模型。具体的，执行主体可以将样本集合中的搜索语句作为输入，将正样本和负样本中的文档作为期望输出，训练得到目标模型。After obtaining the sample set, the execution subject can use the sample set to train the model and obtain the target model. Specifically, the execution subject can use the search sentence in the sample set as input and the documents in the positive sample and the negative sample as the expected output to train the target model.

本申请的上述实施例提供的模型训练，通过对用户在企业级wiki中产生的搜索点击日志进行分析，并利用其训练得到目标模型，通过该目标模型可以快速且精准的找到所需的知识。The model training provided by the above-mentioned embodiments of the present application analyzes the search click logs generated by users in the enterprise-level wiki, and uses the search click logs to train a target model, through which the required knowledge can be found quickly and accurately.

继续参见图3，其示出了根据本申请的模型训练方法的另一个实施例的流程300。在图3所示的实施例中，上述方法可以包括以下步骤：Continuing to refer to FIG3, it shows a process 300 of another embodiment of the model training method according to the present application. In the embodiment shown in FIG3, the above method may include the following steps:

步骤301，获取知识共享系统产生的历史搜索点击日志。Step 301, obtaining the historical search click log generated by the knowledge sharing system.

本实施例中，历史搜索点击日志包括至少一个搜索语句以及针对至少一个搜索语句的搜索结果中各文档的点击信息。可以理解的是，历史搜索点击日志还包括输入各搜索语句的用户的标识，还可以包括各用户点击的信息。In this embodiment, the historical search click log includes at least one search statement and click information of each document in the search results of the at least one search statement. It is understandable that the historical search click log also includes the identification of the user who input each search statement, and can also include the information of each user's click.

步骤302，对历史搜索点击日志进行过滤，得到过滤数据。Step 302: filter the historical search click log to obtain filtered data.

本实施例中，执行主体可以对历史搜索点击日志进行过滤，得到过滤数据。这样，可以提高样本集合中样本的准确率，从而提高目标模型输出结果的准确性。具体的，执行主体可以对历史搜索点击日志中，搜索次数小于预设值的搜索语句，或搜索语句大于预设值的搜索语句。In this embodiment, the execution subject can filter the historical search click log to obtain the filtered data. In this way, the accuracy of the samples in the sample set can be improved, thereby improving the accuracy of the output results of the target model. Specifically, the execution subject can filter the search statements with a search count less than a preset value or the search statements with a search count greater than a preset value in the historical search click log.

在本实施例的一些可选的实现方式中，执行主体具体可以通过以下步骤来对历史搜索点击日志进行过滤：In some optional implementations of this embodiment, the execution subject may filter the historical search click logs by the following steps:

步骤3021，根据历史搜索点击日志，确定各搜索语句的搜索次数、各文档的点击次数。Step 3021, determine the number of searches for each search statement and the number of clicks on each document based on the historical search click log.

步骤3022，将搜索次数小于第一预设阈值的搜索语句和/或点击次数小于第二预设阈值的文档过滤，得到过滤数据。Step 3022: filter the search statements whose search times are less than a first preset threshold and/or the documents whose click times are less than a second preset threshold to obtain filtered data.

本实现方式中，执行主体可以首先通过历史搜索点击日志，确定各搜索语句的搜索次数、各文档的点击次数。如果搜索次数过少，则说明搜索语句比较偏僻，学习的效果较低，可以将其过滤。如果点击次数过少，则说明该文档对于该搜索语句来说不太准确，可以将其过滤。In this implementation, the execution subject can first determine the number of searches for each search statement and the number of clicks on each document through the historical search click log. If the number of searches is too small, it means that the search statement is relatively remote and the learning effect is low, so it can be filtered. If the number of clicks is too small, it means that the document is not accurate for the search statement and can be filtered.

步骤303，对于每个搜索语句，根据针对该搜索语句的搜索结果中各文档的点击信息，从搜索结果中确定出该搜索语句的点击文档集合。Step 303: for each search statement, according to the click information of each document in the search results for the search statement, a click document set for the search statement is determined from the search results.

本实施例中，执行主体可以对历史搜索点击日志中包括的每个搜索语句进行分析，确定每个用户针对该搜索语句返回的搜索结果中的各文档的点击信息。也就是说，执行主体可以分析出，用户点击了搜索结果中的哪个文档，未点击哪个文档，从而得到点击文档集合。In this embodiment, the execution entity can analyze each search statement included in the historical search click log to determine the click information of each document in the search results returned by each user for the search statement. In other words, the execution entity can analyze which document in the search results the user clicked on and which document the user did not click on, thereby obtaining a click document set.

步骤304，根据该搜索语句以及点击文档集合，确定正样本集合。Step 304: determine a positive sample set based on the search statement and the click document set.

执行主体可以将该搜索语句以及得到的点击文档集合，确定正样本集合。具体的，执行主体可以将该搜索语句以及单个点击文档，作为单个正样本，从而得到正样本集合。The execution subject may determine the positive sample set by using the search statement and the obtained click document set. Specifically, the execution subject may use the search statement and a single click document as a single positive sample to obtain the positive sample set.

步骤305，根据搜索结果以及点击文档集合，确定该搜索语句的负样本集合。Step 305: Determine a negative sample set of the search statement based on the search results and the click document set.

在得到点击文档集合后，执行主体还可以根据搜索结果，确定出未点击文档。执行主体可以将该搜索语句以及单个未点击文档，作为单个负样本，从而得到负样本集合。After obtaining the clicked document set, the execution subject can also determine the unclicked documents according to the search results. The execution subject can use the search statement and the single unclicked document as a single negative sample to obtain a negative sample set.

在本实施例的一些可选的实现方式中，上述搜索结果可以包括多个文档以及各文档的排序。也就是说，搜索结果中已经对各文档进行了排序。执行主体可以通过以下步骤得到负样本集合：In some optional implementations of this embodiment, the search results may include multiple documents and the ranking of each document. That is, each document has been ranked in the search results. The execution subject may obtain a negative sample set by following the steps below:

步骤3051，根据上述排序，确定与点击文档集合中的各点击文档邻近的多个未点击文档，得到未点击文档集合。Step 3051: According to the above sorting, a plurality of unclicked documents adjacent to each clicked document in the clicked document set are determined to obtain the unclicked document set.

步骤3052，根据该搜索语句以及未点击文档集合，得到负样本集合。Step 3052: Obtain a negative sample set based on the search statement and the unclicked document set.

本实现方式中，执行主体可以在上述排序中，确定出于各点击文档邻近的多个未点击文档，得到未点击文档集合。举例来说，搜索语句为abc，用户点击了搜索结果中的第4条的文档，则第1、2、3、5、6条中的文档为未点击文档。执行主体可以分别将搜索语句以及各未点击文档，作为各负样本，得到负样本集合。In this implementation, the execution subject can determine multiple unclicked documents adjacent to each clicked document in the above sorting to obtain a set of unclicked documents. For example, if the search statement is abc and the user clicks on the 4th document in the search results, then the documents in the 1st, 2nd, 3rd, 5th, and 6th items are unclicked documents. The execution subject can use the search statement and each unclicked document as negative samples to obtain a set of negative samples.

在本实施例的一些可选的实现方式中，样本集合中负样本的数量与正样本的数量的比值为预设值。In some optional implementations of this embodiment, the ratio of the number of negative samples to the number of positive samples in the sample set is a preset value.

本实现方式中，为了保证模型的学习能力，也保证模型训练时的计算量不太大，可以控制负样本的数量与正样本的数量的比值为预设值。上述预设值可以根据实际应用场景进行设置。In this implementation, in order to ensure the learning ability of the model and to ensure that the amount of calculation during model training is not too large, the ratio of the number of negative samples to the number of positive samples can be controlled to be a preset value. The above preset value can be set according to the actual application scenario.

步骤306，对于训练集的每个样本，确定该样本中文档对应的特征向量。Step 306: For each sample in the training set, determine the feature vector corresponding to the document in the sample.

本实施例中，样本集合可以包括训练集和测试集。训练集中的各样本用于训练模型，测试集的各样本用于测试训练后的模型。对于训练集的每个样本，执行主体可以确定该样本中文档对应的特征向量。具体的，执行主体可以通过现有的特征提取算法来确定文档对应的特征向量。或者，执行主体可以对文档的某些特征进行判断，来确定文档对应的特征向量。In this embodiment, the sample set may include a training set and a test set. Each sample in the training set is used to train the model, and each sample in the test set is used to test the trained model. For each sample in the training set, the execution subject may determine the feature vector corresponding to the document in the sample. Specifically, the execution subject may determine the feature vector corresponding to the document by using an existing feature extraction algorithm. Alternatively, the execution subject may determine the feature vector corresponding to the document by judging certain features of the document.

在本实施例的一些可选的实现方式中，执行主体可以通过以下步骤来确定文档对应的特征向量：基于该样本中的搜索语句、文档以及预先训练的至少一个模型，得到特征向量。In some optional implementations of this embodiment, the execution entity may determine the feature vector corresponding to the document through the following steps: obtaining the feature vector based on the search sentence, the document and at least one pre-trained model in the sample.

本实现方式中，执行主体可以基于该样本中的搜索语句、文档以及预先训练的至少一个模型，得到特征向量。举例来说，执行主体可以计算以下至少一项信息来确定特征向量：通过将搜索语句和文档分别输入预先训练的相关性模型得到该文档的相关性得分；该搜索语句下该文档在最近一周的点击权重；该搜索语句下该文档为首点的比例；该搜索语句下该文档为尾点的比例；该搜索语句下该文档为满意点击的比例；该搜索语句下该文档为长点击的比例；该搜索语句下该文档为短点击的比例。In this implementation, the execution subject can obtain a feature vector based on the search sentence, document and at least one pre-trained model in the sample. For example, the execution subject can calculate at least one of the following information to determine the feature vector: the relevance score of the document obtained by inputting the search sentence and the document into the pre-trained relevance model respectively; the click weight of the document in the past week under the search sentence; the proportion of the document as the first point under the search sentence; the proportion of the document as the last point under the search sentence; the proportion of the document as a satisfactory click under the search sentence; the proportion of the document as a long click under the search sentence; the proportion of the document as a short click under the search sentence.

上述信息中，相关性模型用于计算搜索语句和文档之间的相关性。点击权重可以采用威尔逊算法(Wilson Score)来计算。该算法可以用于质量排序。如果数据中包括好评和差评，该算法可以综合考虑评论数和好评率，来计算得分。得分越高，数据的质量就越高。例如，假设医生A有100个评价，1个差评99个好评，好评率99％。医生B有2个评价，都是好评，好评率100％，那哪个应该排前面？使用威尔逊算法，则医生A的得分是0.9440，医生B的得分是0.3333，医生A排在前面。In the above information, the relevance model is used to calculate the relevance between the search statement and the document. The click weight can be calculated using the Wilson Score algorithm. This algorithm can be used for quality sorting. If the data includes both good and bad reviews, the algorithm can take into account the number of reviews and the rate of good reviews to calculate the score. The higher the score, the higher the quality of the data. For example, suppose doctor A has 100 reviews, 1 bad review and 99 good reviews, with a rate of 99%. Doctor B has 2 reviews, both good reviews, with a rate of 100%, which one should be ranked first? Using the Wilson algorithm, doctor A's score is 0.9440, doctor B's score is 0.3333, and doctor A ranks first.

上述信息中，首点是指用户在浏览搜索结果时进行的第一次点击。尾点是指用户在浏览搜索结果时进行的最后一次点击。满意点击是指同一搜索意图下最后一个搜索语句下的最后一次点击。所谓同一搜索意图是根据搜索语句来确定的。具体的，执行主体可以通过计算各搜索语句之间的相似度，并将相似度与预设阈值进行比较，确定各搜索语句是否相似。如果相似，则认为搜索意图相同。在计算各搜索语句之间的相似度时，可以通过搜索语句之间的编辑距离、或对二者进行语义分析得到。长点击可以包括满意点击。另外，同一搜索意图下非最后一个搜索语句的最后一次点击，如果该最后一次点击与下一个搜索语句之间的时间差值大于第一预设时长(例如40s)并且小于第二预设时长(例如3600s)，则为长点击。如果该最后一次点击与下一个搜索语句之间的时间差值大于第三预设时长(例如0s)并且小于第四预设时长(例如5s)，则为短点击。另外，相邻的两次点击中，如果前一点击与后一点击之间的时间差值大于第一预设时长(例如40s)并且小于第二预设时长(例如3600s)，则为长点击。如果前一点击与后一点击之间的时间差值大于第三预设时长(例如0s)并且小于第四预设时长(例如5s)，则为短点击。In the above information, the first point refers to the first click made by the user when browsing the search results. The last point refers to the last click made by the user when browsing the search results. A satisfactory click refers to the last click under the last search statement under the same search intent. The so-called same search intent is determined based on the search statement. Specifically, the execution subject can determine whether the search statements are similar by calculating the similarity between the search statements and comparing the similarity with the preset threshold. If similar, the search intent is considered to be the same. When calculating the similarity between the search statements, it can be obtained by the edit distance between the search statements or by semantic analysis of the two. Long clicks can include satisfactory clicks. In addition, the last click of a non-last search statement under the same search intent is a long click if the time difference between the last click and the next search statement is greater than the first preset time length (e.g., 40s) and less than the second preset time length (e.g., 3600s). If the time difference between the last click and the next search statement is greater than the third preset time length (e.g., 0s) and less than the fourth preset time length (e.g., 5s), it is a short click. In addition, if the time difference between the previous click and the next click is greater than the first preset time length (e.g., 40s) and less than the second preset time length (e.g., 3600s) in two adjacent clicks, it is a long click. If the time difference between the previous click and the next click is greater than the third preset time length (e.g., 0s) and less than the fourth preset time length (e.g., 5s), it is a short click.

另外，执行主体还可以确定文档的以下至少一项信息：通过预先训练的权威模型，计算文档的权威得分；通过预先训练的质量模型，计算文档的质量得分；计算文档的时效性(根据当前时间、文档的创建时间以及预先设定的时间周期来确定)；通过PageRank模型(是Google公司最早提出并使用的一种网页排名算法。本质上是一种以网页之间的超链接个数和质量作为主要因素粗略地分析网页的重要性的算法)计算文档的权威性特征；统计该搜索语句在最近一周的搜索频率。In addition, the execution entity may also determine at least one of the following information about the document: calculate the authority score of the document through a pre-trained authority model; calculate the quality score of the document through a pre-trained quality model; calculate the timeliness of the document (determined based on the current time, the creation time of the document, and a pre-set time period); calculate the authoritative characteristics of the document through the PageRank model (a web page ranking algorithm first proposed and used by Google. In essence, it is an algorithm that roughly analyzes the importance of web pages based on the number and quality of hyperlinks between web pages as the main factors); and count the search frequency of the search statement in the past week.

执行主体可以将上述各项信息，作为样本中文档的特征向量。The execution entity may use the above information as the feature vector of the document in the sample.

步骤307，将训练集中各样本中的搜索语句作为输入，将输入的搜索语句对应的特征向量作为期望输出，训练得到目标模型。Step 307: Take the search sentence in each sample in the training set as input, take the feature vector corresponding to the input search sentence as the expected output, and train to obtain the target model.

执行主体可以将训练集中的各样本中的搜索语句作为模型的输入，将输入的搜索语句对应的特征向量作为期望输出，训练得到目标模型。The execution subject can use the search sentences in each sample in the training set as the input of the model, and the feature vector corresponding to the input search sentence as the expected output, and train to obtain the target model.

步骤308，利用测试集，确定目标模型的搜索效果。Step 308: Use the test set to determine the search effect of the target model.

本实施例中，执行主体还可以利用测试集来确定目标模型的搜索效果。具体的，执行主体可以将测试集的各样本中的搜索语句作为目标模型的输入，将目标模型的输出与输入的搜索语句对应的文档进行比较。可以理解的是，如果二者比较相似，则认为目标模型的搜索效果较好。如果二者并不相似，则认为目标模型的搜索效果较差。In this embodiment, the execution subject can also use the test set to determine the search effect of the target model. Specifically, the execution subject can use the search sentence in each sample of the test set as the input of the target model, and compare the output of the target model with the document corresponding to the input search sentence. It can be understood that if the two are similar, the search effect of the target model is considered to be good. If the two are not similar, the search effect of the target model is considered to be poor.

在一些具体的实现中，执行主体可以通过客观指标和主观指标来评估目标模型的效果。其中，客观指标包括准确率(ACC)以及ROC曲线下与坐标轴围成的面积(AUC)两个指标。主观指标可以包括GSB(good、same、bad)。In some specific implementations, the execution subject can evaluate the effect of the target model through objective indicators and subjective indicators. Among them, the objective indicators include accuracy (ACC) and the area under the ROC curve and the coordinate axis (AUC). Subjective indicators can include GSB (good, same, bad).

本申请的上述实施例提供的模型训练方法，可以通过对历史搜索点击日志进行详尽分析，过滤掉其中的无效数据，利用其余的有效数据对模型进行训练，从而提高了模型的准确率。The model training method provided in the above-mentioned embodiments of the present application can improve the accuracy of the model by performing a detailed analysis of the historical search click logs, filtering out invalid data therein, and using the remaining valid data to train the model.

参见图4，其示出了根据本申请的数据检索方法的一个实施例的流程400。如图4所示，本实施例的数据检索方法可以包括以下步骤：Referring to FIG4 , it shows a process 400 of an embodiment of a data retrieval method according to the present application. As shown in FIG4 , the data retrieval method of the present embodiment may include the following steps:

步骤401，接收用户通过终端输入的目标搜索语句。Step 401: receiving a target search statement input by a user through a terminal.

本实施例中，数据检索方法的执行主体(例如图1所示的服务器105)可以接收用户通过终端(例如图1所示的终端101、102、103)输入的目标搜索语句。需要说明的是，本实施例的执行主体与图2、图3所示实施例的执行主体可以相同，也可以不同。用户可以通过终端访问企业级wiki，并通过企业级wiki输入目标搜索语句。In this embodiment, the execution subject of the data retrieval method (e.g., the server 105 shown in FIG. 1 ) can receive a target search statement input by a user through a terminal (e.g., the terminals 101, 102, 103 shown in FIG. 1 ). It should be noted that the execution subject of this embodiment can be the same as or different from the execution subject of the embodiments shown in FIG. 2 and FIG. 3 . The user can access the enterprise-level wiki through the terminal and input the target search statement through the enterprise-level wiki.

步骤402，根据目标搜索语句以及目标模型，确定目标特征向量。Step 402, determining a target feature vector according to the target search statement and the target model.

执行主体在接收到目标搜索语句后，可以将目标搜索语句输入目标模型，得到目标搜索语句的目标特征向量。此处，目标模型可以是图2或图3实施例得到的目标模型。After receiving the target search statement, the execution subject may input the target search statement into the target model to obtain the target feature vector of the target search statement. Here, the target model may be the target model obtained in the embodiment of FIG. 2 or FIG. 3 .

步骤403，确定目标搜索语句的目标搜索结果中各文档的特征向量。Step 403: determine the feature vector of each document in the target search result of the target search statement.

执行主体还可以进一步确定目标搜索语句的目标搜索结果中各文档的特征向量。执行主体可以通过对各文档进行分析，得到特征向量。例如，通过获取各文档的以下至少一项信息：通过将搜索语句和文档分别输入预先训练的相关性模型得到该文档的相关性得分；该搜索语句下该文档在最近一周的点击权重；该搜索语句下该文档为首点的比例；该搜索语句下该文档为尾点的比例；该搜索语句下该文档为满意点击的比例；该搜索语句下该文档为长点击的比例；该搜索语句下该文档为短点击的比例；通过预先训练的权威模型，计算文档的权威得分；通过预先训练的质量模型，计算文档的质量得分；计算文档的时效性(根据当前时间、文档的创建时间以及预先设定的时间周期来确定)；通过PageRank模型(是Google公司最早提出并使用的一种网页排名算法。本质上是一种以网页之间的超链接个数和质量作为主要因素粗略地分析网页的重要性的算法)计算文档的权威性特征；统计该搜索语句在最近一周的搜索频率。The execution subject may further determine the feature vector of each document in the target search result of the target search statement. The execution subject may obtain the feature vector by analyzing each document. For example, by obtaining at least one of the following information of each document: obtaining the relevance score of the document by inputting the search statement and the document into the pre-trained relevance model respectively; the click weight of the document under the search statement in the past week; the proportion of the document as the first point under the search statement; the proportion of the document as the last point under the search statement; the proportion of the document as the satisfactory click under the search statement; the proportion of the document as the long click under the search statement; the proportion of the document as the short click under the search statement; calculating the authority score of the document through the pre-trained authority model; calculating the quality score of the document through the pre-trained quality model; calculating the timeliness of the document (determined according to the current time, the creation time of the document and the pre-set time period); calculating the authority feature of the document through the PageRank model (a web page ranking algorithm first proposed and used by Google. Essentially, it is an algorithm that roughly analyzes the importance of web pages based on the number and quality of hyperlinks between web pages as the main factors); and counting the search frequency of the search statement in the past week.

步骤404，根据各特征向量以及目标特征向量，对目标结果中的各文档进行排序。Step 404: sort the documents in the target result according to the feature vectors and the target feature vector.

执行主体可以根据各文档的特征向量以及目标特征向量，对目标结果中的各文档进行排序。具体的，执行主体可以将与目标特征向量相似的特征向量对应的文档排在搜索结果的前部分。The execution entity may sort the documents in the target result according to the feature vectors of each document and the target feature vector. Specifically, the execution entity may sort the documents corresponding to feature vectors similar to the target feature vector in the front part of the search results.

本申请的上述实施例提供的数据检索方法，通过利用训练好的目标模型，可以快速准确地检索出相关的文档。The data retrieval method provided by the above-mentioned embodiment of the present application can quickly and accurately retrieve relevant documents by utilizing a trained target model.

继续参见图5，其示出了根据本申请的模型训练方法、数据检索方法的一个应用场景的示意图。在图5所示的应用场景中，用户通过终端501访问企业级wiki，并通过其输入搜索语句“abc”。服务器502本地可以包括目标模型，在接收到上述搜索语句后，得到多个搜索结果。结合目标模型，并对各个搜索结果进行排序，最后将排序输出。Continuing to refer to FIG5, it shows a schematic diagram of an application scenario of the model training method and data retrieval method according to the present application. In the application scenario shown in FIG5, the user accesses the enterprise-level wiki through the terminal 501 and inputs the search statement "abc" through it. The server 502 may include a target model locally, and after receiving the above search statement, multiple search results are obtained. Combined with the target model, each search result is sorted, and finally the sorting is output.

进一步参考图6，作为对上述各图所示方法的实现，本申请提供了一种模型训练装置的一个实施例，该装置实施例与图2所示的方法实施例相对应，该装置具体可以应用于各种电子设备中。Further referring to FIG. 6 , as an implementation of the methods shown in the above figures, the present application provides an embodiment of a model training device, which corresponds to the method embodiment shown in FIG. 2 , and can be specifically applied to various electronic devices.

如图6所示，本实施例的模型训练装置600包括：日志获取单元601、样本生成单元602和模型训练单元603。As shown in FIG. 6 , the model training device 600 of this embodiment includes: a log acquisition unit 601 , a sample generation unit 602 , and a model training unit 603 .

日志获取单元601，被配置成获取知识共享系统产生的历史搜索点击日志。The log acquisition unit 601 is configured to acquire the historical search click log generated by the knowledge sharing system.

样本生成单元602，被配置成根据历史搜索点击日志，生成样本集合。The sample generating unit 602 is configured to generate a sample set according to the historical search click log.

模型训练单元603，被配置成利用样本集合训练模型，得到目标模型。The model training unit 603 is configured to train the model using the sample set to obtain a target model.

在本实施例的一些可选的实现方式中，历史搜索点击日志包括至少一个搜索语句以及针对至少一个搜索语句的搜索结果中各文档的点击信息。样本生成单元602可以进一步被配置成：对于每个搜索语句，根据针对该搜索语句的搜索结果中各文档的点击信息，从搜索结果中确定出该搜索语句的点击文档集合；根据该搜索语句以及点击文档集合，确定正样本集合；根据搜索结果以及点击文档集合，确定该搜索语句的负样本集合。In some optional implementations of this embodiment, the historical search click log includes at least one search statement and click information of each document in the search results for the at least one search statement. The sample generation unit 602 may be further configured to: for each search statement, determine a click document set of the search statement from the search results based on the click information of each document in the search results for the search statement; determine a positive sample set based on the search statement and the click document set; and determine a negative sample set of the search statement based on the search results and the click document set.

在本实施例的一些可选的实现方式中，搜索结果包括多个文档以及各文档的排序。样本生成单元602可以进一步被配置成：根据排序，确定与点击文档集合中的各点击文档邻近的多个未点击文档，得到未点击文档集合；根据该搜索语句以及未点击文档集合，得到负样本集合。In some optional implementations of this embodiment, the search results include multiple documents and the ranking of each document. The sample generation unit 602 can be further configured to: determine multiple non-clicked documents adjacent to each clicked document in the clicked document set according to the ranking, and obtain a non-clicked document set; and obtain a negative sample set according to the search statement and the non-clicked document set.

在本实施例的一些可选的实现方式中，样本集合包括训练集。模型训练单元603可以进一步被配置成：对于训练集的每个样本，确定该样本中文档对应的特征向量；将训练集中各样本中的搜索语句作为输入，将输入的搜索语句对应的特征向量作为期望输出，训练得到目标模型。In some optional implementations of this embodiment, the sample set includes a training set. The model training unit 603 can be further configured to: for each sample in the training set, determine the feature vector corresponding to the document in the sample; use the search sentence in each sample in the training set as input, use the feature vector corresponding to the input search sentence as the expected output, and train to obtain the target model.

在本实施例的一些可选的实现方式中，模型训练单元603可以进一步被配置成：基于该样本中的搜索语句、文档以及预先训练的至少一个模型，得到特征向量。In some optional implementations of this embodiment, the model training unit 603 may be further configured to obtain a feature vector based on the search sentence, document and at least one pre-trained model in the sample.

在本实施例的一些可选的实现方式中，装置600还可以进一步包括图6中未示出的数据过滤单元，被配置成对历史搜索点击日志进行过滤，得到过滤数据。In some optional implementations of this embodiment, the device 600 may further include a data filtering unit not shown in FIG. 6 , which is configured to filter the historical search click log to obtain filtered data.

在本实施例的一些可选的实现方式中，数据过滤单元进一步被配置成：根据历史搜索点击日志，确定各搜索语句的搜索次数、各文档的点击次数；将搜索次数小于第一预设阈值的搜索语句和/或点击次数小于第二预设阈值的文档过滤，得到过滤数据。In some optional implementations of this embodiment, the data filtering unit is further configured to: determine the number of searches for each search statement and the number of clicks for each document based on historical search click logs; filter search statements whose search times are less than a first preset threshold and/or documents whose click times are less than a second preset threshold to obtain filtered data.

在本实施例的一些可选的实现方式中，样本集合包括测试集。装置600还可以进一步包括图6中未示出的还包括测试单元，被配置成：利用测试集，确定目标模型的搜索效果。In some optional implementations of this embodiment, the sample set includes a test set. The apparatus 600 may further include a test unit (not shown in FIG. 6 ) configured to: determine the search effect of the target model using the test set.

应当理解，模型训练装置600中记载的单元601至单元603分别与参考图2中描述的方法中的各个步骤相对应。由此，上文针对模型训练方法描述的操作和特征同样适用于装置600及其中包含的单元，在此不再赘述。It should be understood that units 601 to 603 recorded in the model training device 600 correspond to the steps in the method described in reference Figure 2. Therefore, the operations and features described above for the model training method are also applicable to the device 600 and the units contained therein, and will not be repeated here.

进一步参考图7，作为对上述各图所示方法的实现，本申请提供了一种数据检索装置的一个实施例，该装置实施例与图4所示的方法实施例相对应，该装置具体可以应用于各种电子设备中。Further referring to FIG. 7 , as an implementation of the methods shown in the above figures, the present application provides an embodiment of a data retrieval device, which corresponds to the method embodiment shown in FIG. 4 , and can be specifically applied to various electronic devices.

如图7所示，本实施例的数据检索装置700包括：搜索语句接收单元701、第一向量确定单元702、第二向量确定单元703和文档排序单元704。As shown in FIG. 7 , the data retrieval device 700 of this embodiment includes: a search statement receiving unit 701 , a first vector determining unit 702 , a second vector determining unit 703 and a document sorting unit 704 .

搜索语句接收单元701，被配置成接收用户通过终端输入的目标搜索语句。The search sentence receiving unit 701 is configured to receive a target search sentence input by a user through a terminal.

第一向量确定单元702，被配置成根据目标搜索语句以及如图2、图3实施例所描述的目标模型，确定目标特征向量。The first vector determination unit 702 is configured to determine a target feature vector according to the target search statement and the target model described in the embodiments of FIG. 2 and FIG. 3 .

第二向量确定单元703，被配置成确定针对目标搜索语句的目标搜索结果中各文档的特征向量。The second vector determination unit 703 is configured to determine a feature vector of each document in the target search result for the target search sentence.

文档排序单元704，被配置成根据各特征向量以及目标特征向量，对目标结果中的各文档进行排序。The document ranking unit 704 is configured to rank each document in the target result according to each feature vector and the target feature vector.

应当理解，数据检索装置700中记载的单元701至单元704分别与参考图2中描述的方法中的各个步骤相对应。由此，上文针对数据检索方法描述的操作和特征同样适用于装置700及其中包含的单元，在此不再赘述。It should be understood that units 701 to 704 recorded in the data retrieval device 700 correspond to the steps in the method described with reference to Figure 2. Therefore, the operations and features described above for the data retrieval method are also applicable to the device 700 and the units contained therein, and will not be repeated here.

根据本申请的实施例，本申请还提供了一种电子设备和一种可读存储介质。According to an embodiment of the present application, the present application also provides an electronic device and a readable storage medium.

如图8所示，是根据本申请实施例的执行模型训练方法、数据检索方法的电子设备的框图。电子设备旨在表示各种形式的数字计算机，诸如，膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置，诸如，个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例，并且不意在限制本文中描述的和/或者要求的本申请的实现。As shown in Figure 8, it is a block diagram of an electronic device according to an execution model training method and a data retrieval method of an embodiment of the present application. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workbenches, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device can also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely examples, and are not intended to limit the implementation of the present application described and/or required herein.

如图8所示，该电子设备包括：一个或多个处理器801、存储器802，以及用于连接各部件的接口，包括高速接口和低速接口。各个部件利用不同的总线互相连接，并且可以被安装在公共主板上或者根据需要以其它方式安装。处理器可以对在电子设备内执行的指令进行处理，包括存储在存储器中或者存储器上以在外部输入/输出装置(诸如，耦合至接口的显示设备)上显示GUI的图形信息的指令。在其它实施方式中，若需要，可以将多个处理器和/或多条总线与多个存储器和多个存储器一起使用。同样，可以连接多个电子设备，各个设备提供部分必要的操作(例如，作为服务器阵列、一组刀片式服务器、或者多处理器系统)。图8中以一个处理器801为例。As shown in Figure 8, the electronic device includes: one or more processors 801, memory 802, and interfaces for connecting various components, including high-speed interfaces and low-speed interfaces. The various components are connected to each other using different buses, and can be installed on a common mainboard or installed in other ways as needed. The processor can process instructions executed in the electronic device, including instructions stored in or on the memory to display the graphical information of the GUI on an external input/output device (such as a display device coupled to the interface). In other embodiments, if necessary, multiple processors and/or multiple buses can be used together with multiple memories and multiple memories. Similarly, multiple electronic devices can be connected, and each device provides some necessary operations (for example, as a server array, a group of blade servers, or a multi-processor system). In Figure 8, a processor 801 is taken as an example.

存储器802即为本申请所提供的非瞬时计算机可读存储介质。其中，上述存储器802存储有可由至少一个处理器执行的指令，以使至少一个处理器执行本申请所提供的执行模型训练方法、数据检索方法。本申请的非瞬时计算机可读存储介质存储计算机指令，该计算机指令用于使计算机执行本申请所提供的执行模型训练方法、数据检索方法。The memory 802 is the non-transient computer-readable storage medium provided in the present application. Among them, the above-mentioned memory 802 stores instructions executable by at least one processor to enable at least one processor to execute the execution model training method and data retrieval method provided in the present application. The non-transient computer-readable storage medium of the present application stores computer instructions, which are used to enable the computer to execute the execution model training method and data retrieval method provided in the present application.

存储器802作为一种非瞬时计算机可读存储介质，可用于存储非瞬时软件程序、非瞬时计算机可执行程序以及模块，如本申请实施例中的模型训练方法、数据检索方法对应的程序指令/模块(例如，附图6所示的日志获取单元601、样本生成单元602和模型训练单元603，或者附图7所示的搜索语句接收单元701、第一向量确定单元702、第二向量确定单元703和文档排序单元704)。处理器801通过运行存储在存储器802中的非瞬时软件程序、指令以及模块，从而执行服务器的各种功能应用以及数据处理，即实现上述方法实施例中的执行模型训练方法、数据检索方法。The memory 802, as a non-transient computer-readable storage medium, can be used to store non-transient software programs, non-transient computer executable programs and modules, such as the program instructions/modules corresponding to the model training method and the data retrieval method in the embodiments of the present application (for example, the log acquisition unit 601, the sample generation unit 602 and the model training unit 603 shown in FIG. 6, or the search statement receiving unit 701, the first vector determination unit 702, the second vector determination unit 703 and the document sorting unit 704 shown in FIG. 7). The processor 801 executes various functional applications and data processing of the server by running the non-transient software programs, instructions and modules stored in the memory 802, that is, implements the execution model training method and the data retrieval method in the above method embodiments.

存储器802可以包括存储程序区和存储数据区，其中，存储程序区可存储操作系统、至少一个功能所需要的应用程序；存储数据区可存储根据执行模型训练方法、数据检索方法的电子设备的使用所创建的数据等。此外，存储器802可以包括高速随机存取存储器，还可以包括非瞬时存储器，例如至少一个磁盘存储器件、闪存器件、或其他非瞬时固态存储器件。在一些实施例中，存储器802可选包括相对于处理器801远程设置的存储器，这些远程存储器可以通过网络连接至执行模型训练方法、数据检索方法的电子设备。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 802 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application required for at least one function; the data storage area may store data created according to the use of an electronic device that executes a model training method or a data retrieval method, etc. In addition, the memory 802 may include a high-speed random access memory, and may also include a non-transient memory, such as at least one disk storage device, a flash memory device, or other non-transient solid-state storage device. In some embodiments, the memory 802 may optionally include a memory remotely disposed relative to the processor 801, and these remote memories may be connected to the electronic device that executes the model training method or the data retrieval method via a network. Examples of the above-mentioned network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.

执行模型训练方法、数据检索方法的电子设备还可以包括：输入装置803和输出装置804。处理器801、存储器802、输入装置803和输出装置804可以通过总线或者其他方式连接，图8中以通过总线连接为例。The electronic device for executing the model training method and the data retrieval method may further include: an input device 803 and an output device 804. The processor 801, the memory 802, the input device 803 and the output device 804 may be connected via a bus or other means, and FIG8 takes the bus connection as an example.

输入装置803可接收输入的数字或字符信息，以及产生与执行模型训练方法、数据检索方法的电子设备的用户设置以及功能控制有关的键信号输入，例如触摸屏、小键盘、鼠标、轨迹板、触摸板、指示杆、一个或者多个鼠标按钮、轨迹球、操纵杆等输入装置。输出装置804可以包括显示设备、辅助照明装置(例如，LED)和触觉反馈装置(例如，振动电机)等。该显示设备可以包括但不限于，液晶显示器(LCD)、发光二极管(LED)显示器和等离子体显示器。在一些实施方式中，显示设备可以是触摸屏。The input device 803 can receive input digital or character information, and generate key signal input related to user settings and function control of the electronic device that executes the model training method and the data retrieval method, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, an indicator rod, one or more mouse buttons, a trackball, a joystick and other input devices. The output device 804 may include a display device, an auxiliary lighting device (e.g., an LED) and a tactile feedback device (e.g., a vibration motor), etc. The display device may include, but is not limited to, a liquid crystal display (LCD), a light emitting diode (LED) display and a plasma display. In some embodiments, the display device may be a touch screen.

此处描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、专用ASIC(专用集成电路)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括：实施在一个或者多个计算机程序中，该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释，该可编程处理器可以是专用或者通用可编程处理器，可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令，并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and techniques described herein can be realized in digital electronic circuit systems, integrated circuit systems, dedicated ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include: being implemented in one or more computer programs that can be executed and/or interpreted on a programmable system including at least one programmable processor, which can be a special purpose or general purpose programmable processor that can receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.

这些计算程序(也称作程序、软件、软件应用、或者代码)包括可编程处理器的机器指令，并且可以利用高级过程和/或面向对象的编程语言、和/或汇编/机器语言来实施这些计算程序。如本文使用的，术语“机器可读介质”和“计算机可读介质”指的是用于将机器指令和/或数据提供给可编程处理器的任何计算机程序产品、设备、和/或装置(例如，磁盘、光盘、存储器、可编程逻辑装置(PLD))，包括，接收作为机器可读信号的机器指令的机器可读介质。术语“机器可读信号”指的是用于将机器指令和/或数据提供给可编程处理器的任何信号。These computer programs (also referred to as programs, software, software applications, or code) include machine instructions for programmable processors and can be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, device, and/or means (e.g., disk, optical disk, memory, programmable logic device (PLD)) for providing machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal for providing machine instructions and/or data to a programmable processor.

为了提供与用户的交互，可以在计算机上实施此处描述的系统和技术，该计算机具有：用于向用户显示信息的显示装置(例如，CRT(阴极射线管)或者LCD(液晶显示器)监视器)；以及键盘和指向装置(例如，鼠标或者轨迹球)，用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互；例如，提供给用户的反馈可以是任何形式的传感反馈(例如，视觉反馈、听觉反馈、或者触觉反馈)；并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user; and a keyboard and pointing device (e.g., a mouse or trackball) through which the user can provide input to the computer. Other types of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including acoustic input, voice input, or tactile input).

可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如，作为数据服务器)、或者包括中间件部件的计算系统(例如，应用服务器)、或者包括前端部件的计算系统(例如，具有图形用户界面或者网络浏览器的用户计算机，用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如，通信网络)来将系统的部件相互连接。通信网络的示例包括：局域网(LAN)、广域网(WAN)和互联网。The systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., a user computer with a graphical user interface or a web browser through which a user can interact with implementations of the systems and techniques described herein), or a computing system that includes any combination of such back-end components, middleware components, or front-end components. The components of the system may be interconnected by any form or medium of digital data communication (e.g., a communications network). Examples of communications networks include: a local area network (LAN), a wide area network (WAN), and the Internet.

计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。A computer system may include clients and servers. Clients and servers are generally remote from each other and usually interact through a communication network. The relationship of client and server is generated by computer programs running on respective computers and having a client-server relationship to each other.

根据本申请实施例的技术方案，通过对用户在企业级wiki中产生的搜索点击日志进行分析，并利用其训练得到目标模型，通过该目标模型可以快速且精准的找到所需的知识。According to the technical solution of the embodiment of the present application, the search click logs generated by users in the enterprise-level wiki are analyzed and used to train a target model, through which the required knowledge can be found quickly and accurately.

应该理解，可以使用上面所示的各种形式的流程，重新排序、增加或删除步骤。例如，本发申请中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行，只要能够实现本申请公开的技术方案所期望的结果，本文在此不进行限制。It should be understood that the various forms of processes shown above can be used to reorder, add or delete steps. For example, the steps recorded in this application can be executed in parallel, sequentially or in different orders, as long as the expected results of the technical solution disclosed in this application can be achieved, and this document is not limited here.

上述具体实施方式，并不构成对本申请保护范围的限制。本领域技术人员应该明白的是，根据设计要求和其他因素，可以进行各种修改、组合、子组合和替代。任何在本申请的精神和原则之内所作的修改、等同替换和改进等，均应包含在本申请保护范围之内。The above specific implementations do not constitute a limitation on the protection scope of this application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of this application should be included in the protection scope of this application.

Claims

1. A model training method, comprising:

Acquiring a historical search click log generated by a knowledge sharing system, wherein the historical search click log comprises at least one search statement and click information of each document in search results aiming at the at least one search statement;

For each search statement, determining a click document set of the search statement from search results of the search statement according to click information of each document in the search results; the search results include a plurality of documents and a ranking of the documents; determining a positive sample set according to the search statement and the click document set; determining a plurality of non-clicked documents adjacent to each clicked document in the clicked document set according to the sorting to obtain a non-clicked document set; obtaining a negative sample set according to the search statement and the non-clicked document set; the sample set comprises a training set;

For each sample of the training set, determining a feature vector corresponding to a document in the sample includes: calculating at least one of the following information to determine a feature vector: obtaining a relevance score of the document by respectively inputting a search sentence and a document into a pre-trained relevance model, wherein the click weight of the document in the last week under the search sentence, the proportion of the document as a head point under the search sentence, the proportion of the document as a tail point under the search sentence, the proportion of the document as a satisfactory click under the search sentence, the proportion of the document as a long click under the search sentence and the proportion of the document as a short click under the search sentence;

And taking the search sentences in each sample in the training set as input, taking the feature vectors corresponding to the input search sentences as expected output, and training to obtain a target model.

2. The method of claim 1, wherein a ratio of a number of negative samples to a number of positive samples in the sample set is a preset value.

3. The method of claim 1, wherein determining the feature vector corresponding to the document in the sample comprises:

The feature vector is derived based on the search term, the document, and at least one model pre-trained in the sample.

4. The method of claim 1, wherein the method further comprises:

and filtering the historical search click log to obtain filtered data.

5. The method of claim 4, wherein the filtering the historical search click log to obtain filtered data comprises:

determining the searching times of each searching sentence and the clicking times of each document according to the historical searching clicking logs;

Filtering the search sentences with the search times smaller than the first preset threshold and/or the documents with the click times smaller than the second preset threshold to obtain filtering data.

6. The method of claim 1, wherein the set of samples comprises a test set; and

The method further comprises the steps of:

And determining the searching effect of the target model by using the test set.

7. A data retrieval method comprising:

receiving a target search statement input by a user through a terminal;

Determining a target feature vector according to the target search statement and the target model according to claims 1-6;

determining feature vectors of all documents in target search results aiming at the target search statement;

And sorting all the documents in the target search result according to all the feature vectors and the target feature vector.

8. A model training apparatus comprising:

A log obtaining unit configured to obtain a history search click log generated by the knowledge sharing system, the history search click log including at least one search statement and click information of each document in search results for the at least one search statement;

A sample generating unit configured to determine, for each search term, a click document set of the search term from search results of the search term according to click information of each document in the search results; the search results include a plurality of documents and a ranking of the documents; determining a positive sample set according to the search statement and the click document set; determining a plurality of non-clicked documents adjacent to each clicked document in the clicked document set according to the sorting to obtain a non-clicked document set; obtaining a negative sample set according to the search statement and the non-clicked document set; the sample set comprises a training set;

a model training unit configured to determine, for each sample of the training set, a feature vector corresponding to a document in the sample, including: calculating at least one of the following information to determine a feature vector: obtaining a relevance score of the document by respectively inputting a search sentence and a document into a pre-trained relevance model, wherein the click weight of the document in the last week under the search sentence, the proportion of the document as a head point under the search sentence, the proportion of the document as a tail point under the search sentence, the proportion of the document as a satisfactory click under the search sentence, the proportion of the document as a long click under the search sentence and the proportion of the document as a short click under the search sentence; and taking the search sentences in each sample in the training set as input, taking the feature vectors corresponding to the input search sentences as expected output, and training to obtain a target model.

9. The apparatus of claim 8, wherein a ratio of a number of negative samples to a number of positive samples in the sample set is a preset value.

10. The apparatus of claim 8, wherein the model training unit is further configured to:

11. The apparatus of claim 8, wherein the apparatus further comprises:

and the data filtering unit is configured to filter the historical search click log to obtain filtered data.

12. The apparatus of claim 11, wherein the data filtering unit is further configured to:

13. The apparatus of claim 8, wherein the set of samples comprises a test set; and

The apparatus further comprises a test unit configured to:

And determining the searching effect of the target model by using the test set.

14. A data retrieval device comprising:

A search term receiving unit configured to receive a target search term input by a user through a terminal;

a first vector determination unit configured to determine a target feature vector from the target search statement and the target model of claims 1-6;

A second vector determination unit configured to determine feature vectors of respective documents in a target search result for the target search statement;

and the document ordering unit is configured to order the documents in the target search result according to the feature vectors and the target feature vector.

15. A model training electronic device, comprising:

At least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

16. A data retrieval electronic device, comprising:

At least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of claim 7.

17. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-6.

18. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of claim 7.