CN112307295A

CN112307295A - Corpus generalization method, apparatus and electronic device combining RPA and AI

Info

Publication number: CN112307295A
Application number: CN202011206419.4A
Authority: CN
Inventors: 汪冠春; 刘金艳; 胡景超; 胡一川; 褚瑞; 李玮
Original assignee: Beijing Benying Network Technology Co Ltd; Beijing Laiye Network Technology Co Ltd
Current assignee: Beijing Benying Network Technology Co Ltd; Beijing Laiye Network Technology Co Ltd
Priority date: 2020-03-27
Filing date: 2020-11-02
Publication date: 2021-02-02

Abstract

The present application provides a corpus generalization method, apparatus and electronic device combining RPA and AI. The method includes: the RPA system receives a first request, wherein the first request includes a seed corpus; the RPA system generalizes the seed corpus according to a preset generalization method based on natural language processing NLP, Obtain at least one candidate corpus of the seed corpus; the RPA system identifies the similarity between the at least one candidate corpus and the seed corpus, and determines the candidate corpus with a similarity greater than a preset threshold as the generalization of the seed corpus Corpus; the RPA system outputs the generalized corpus of the seed corpus. In the method of the present application, the RPA system automatically generalizes the seed corpus by a preset generalization method, and screens the generalized candidate corpus according to a preset threshold, thereby screening out the generalization corpus of the seed corpus, and improving the generalization of the corpus. efficiency.

Description

Corpus generalization method and apparatus combining RPA and AI, and electronic device

Technical Field

The present application relates to the field of natural language processing, and in particular, to a corpus generalization method and apparatus, an electronic device, and a storage medium in combination with RPA and AI.

Background

Robot Process Automation (RPA) is a Process task that simulates human operations on a computer by specific "robot software" and executes automatically according to rules.

Artificial Intelligence (AI) is a technical science that studies, develops theories, methods, techniques and application systems for simulating, extending and expanding human intelligence.

Natural Language Processing (NLP) is a science for researching computer systems, especially software systems therein, which can effectively realize natural language communication, and is an important direction in the fields of computer science and artificial intelligence.

For human-computer interaction products such as search engines, smart speech, customer service robots, etc., the user's sentence intent is typically identified through a machine learning model. The machine learning model is trained by corpora in advance, and the recognition capability of the machine learning model depends on the number of corpora used to train the model. When the number of the linguistic data is insufficient, the number of the linguistic data can be increased by generalizing the linguistic data.

In the prior art, a corpus generalization task is issued to a plurality of operators in a crowdsourcing task mode, and the operators generalize the corpus by artificial imagination.

However, because the corpus is generalized through artificial imagination, the efficiency of corpus generalization is low.

Disclosure of Invention

The embodiment of the application provides a corpus generalization method, a corpus generalization device, a corpus generalization equipment and a storage medium, so as to solve the problem of low efficiency of the current corpus generalization.

In a first aspect, an embodiment of the present application provides a corpus generalization method combining an RPA and an AI, which is applied to a first electronic device, where the first electronic device includes an RPA system, and the method includes:

the RPA system receives a first request, wherein the first request comprises a seed corpus;

the RPA system generalizes the seed corpus based on Natural Language Processing (NLP) according to a preset generalization mode to obtain at least one candidate corpus of the seed corpus;

the RPA system identifies the similarity between the at least one candidate corpus and the seed corpus, and determines the candidate corpus with the similarity larger than a preset threshold value as the generalization corpus of the seed corpus;

and the RPA system outputs the generalization linguistic data of the seed linguistic data.

In a possible embodiment, the preset generalization includes at least one of the following modes:

network crawling mode, synonym replacing mode, knowledge base searching mode and sentence pattern extracting mode.

In a possible implementation manner, when the preset generalization manner includes the network crawling manner, the RPA system generalizes the seed corpus according to a preset generalization manner based on natural language processing NLP, to obtain at least one candidate corpus of the seed corpus, including:

the RPA system generalizes the seed corpus according to the network crawling mode to obtain at least one candidate corpus of the seed corpus;

the RPA system generalizes the seed corpus according to the network crawling manner to obtain at least one candidate corpus of the seed corpus, including:

the RPA system searches the seed corpus in a webpage searching website to obtain a webpage list for displaying a searching result, wherein the webpage list comprises a plurality of webpage items, and each webpage item has a title sentence;

the RPA system crawls the title sentences of all the webpage items;

and the RPA system takes the title sentences meeting the matching conditions in the title sentences of all the webpage items as the candidate linguistic data of the seed linguistic data.

In one possible embodiment, the method further comprises:

the RPA system receives filter words;

the RPA system takes the title sentences meeting the matching conditions in the title sentences of each webpage item as the candidate linguistic data of the seed linguistic data, and comprises the following steps:

and the RPA system takes the title sentences meeting the matching conditions in the title sentences of the webpage items as the candidate linguistic data of the seed linguistic data according to the filter words.

In one possible embodiment, the web page list comprises at least one presentation page, and each presentation page comprises at least one web page item;

the method further comprises the following steps:

the RPA system receives a specified website address and/or a specified number of pages to crawl;

the RPA system searches the seed corpus in a webpage search website, and the method comprises the following steps:

the RPA system searches the seed corpus in a webpage searching website indicated by the specified website address;

the RPA system crawls title sentences of all webpage items, and the method comprises the following steps:

and the RPA system crawls the title sentences of all the webpage items in the display pages corresponding to the specified crawling page number.

In a possible implementation manner, when the preset generalization manner includes the synonym replacement manner, the RPA system generalizes the seed corpus according to a preset generalization manner based on natural language processing NLP to obtain at least one candidate corpus of the seed corpus, further including:

the RPA system generalizes the seed corpus according to the synonym replacement mode to obtain at least one candidate corpus of the seed corpus;

the RPA system generalizes the seed corpus according to the synonym replacement mode to obtain at least one candidate corpus of the seed corpus, including:

the RPA system acquires the affiliated field of the seed corpus and selects a synonym table corresponding to the affiliated field from a plurality of preset synonym tables, wherein each synonym table corresponds to one field;

the RPA system searches key words in the seed corpus;

and the RPA system performs synonym replacement on the keywords in the seed corpus according to the synonym table corresponding to the field to which the keyword belongs, so as to obtain at least one candidate corpus of the seed corpus.

In a possible implementation manner, when the preset generalization manner includes the knowledge base retrieval manner, the RPA system generalizes the seed corpus according to the preset generalization manner based on the natural language processing NLP to obtain at least one candidate corpus of the seed corpus, and further includes:

the RPA system generalizes the seed corpus according to the knowledge base retrieval mode to obtain at least one candidate corpus of the seed corpus;

the RPA system generalizes the seed corpus according to the knowledge base retrieval mode to obtain at least one candidate corpus of the seed corpus, and the method comprises the following steps:

the RPA system searches a knowledge base for a generalization corpus corresponding to the seed corpus, wherein the knowledge base comprises a plurality of seed corpora and corresponding generalization corpuses thereof;

and the RPA system takes the generalization linguistic data corresponding to the seed linguistic data in the knowledge base as the candidate linguistic data of the seed linguistic data.

In a possible implementation manner, when the preset generalization manner includes the sentence extraction manner, the RPA system generalizes the seed corpus according to the preset generalization manner based on the natural language processing NLP to obtain at least one candidate corpus of the seed corpus, further including:

the RPA system generalizes the seed corpus according to the sentence pattern extraction mode to obtain at least one candidate corpus of the seed corpus;

the RPA system generalizes the seed corpus according to the sentence pattern extraction mode to obtain at least one candidate corpus of the seed corpus, including:

the RPA system identifies and extracts key words in the seed corpus through a dependency syntactic analysis algorithm;

and the RPA system combines the key words of the seed corpus to generate a candidate corpus of the seed corpus.

In one possible embodiment, each predefined generalization corresponds to an identifier;

the method further comprises the following steps:

the RPA system receives a specified identification;

the RPA system is based on natural language processing NLP, and is based on a preset generalization mode to generalize the seed corpus, so as to obtain at least one candidate corpus of the seed corpus, and the method further comprises the following steps:

and the RPA system generalizes the seed corpus by adopting a preset generalization mode corresponding to the specified identification.

In a possible implementation, after the RPA system outputs the generalized corpus of the seed corpus, the method further includes:

the RPA system receiving a second request, wherein the second request is used for indicating at least one generalization corpus of the seed corpus;

the RPA system takes the generalization corpus indicated by the second request as a new seed corpus to carry out generalization to obtain a candidate corpus of the new seed corpus;

the RPA system adds the candidate corpus of the new seed corpus to the candidate corpus of the seed corpus, re-identifies the similarity between at least one candidate corpus of the seed corpus and the seed corpus, and determines the candidate corpus with the similarity larger than the preset threshold value as the generalization corpus of the seed corpus;

and the RPA system outputs the generalization linguistic data of the seed linguistic data again.

In one possible embodiment, after the RPA system receives the first request, the method further comprises:

the RPA system searches the seed linguistic data in a historical record, wherein the historical record comprises historical generalized seed linguistic data and corresponding generalized linguistic data;

if the RPA system finds the seed corpus in the history record, acquiring the generalization corpus of the seed corpus from the history record;

if the seed corpus is not found in the history record by the RPA system, generalizing the seed corpus according to a preset generalization mode to obtain at least one candidate corpus of the seed corpus.

In one possible embodiment, the method further comprises:

the RPA system updates the seed linguistic data and the generalization linguistic data of the seed linguistic data into a knowledge base, wherein the knowledge base at least comprises the seed linguistic data and the corresponding generalization linguistic data;

the RPA system identifies the similarity between the at least one candidate corpus and the seed corpus, including:

and the RPA system identifies the similarity between the at least one candidate corpus and the seed corpus through a logistic regression model, wherein the logistic regression model is trained in advance through a training set consisting of a plurality of corpora in the knowledge base.

In a second aspect, an embodiment of the present application provides a corpus generalization method combining RPA and AI, which is applied to a second electronic device, where the second electronic device includes an RPA system, and the method includes:

the RPA system receives a seed corpus input by a user;

the RPA system sends a first request containing the seed corpus to a first electronic device, wherein the first request is used for indicating the first electronic device to process NLP based on natural language, generalize the seed corpus according to a preset generalization mode to obtain at least one candidate corpus of the seed corpus, identify the similarity between the at least one candidate corpus and the seed corpus, and take the candidate corpus with the similarity larger than a preset threshold value as the generalized corpus of the seed corpus;

and the RPA system receives and displays the generalization linguistic data of the seed linguistic data sent by the first electronic equipment.

In a possible embodiment, the generalization corpus of the seed corpus is at least one;

the RPA system receives and displays the generalization linguistic data of the seed linguistic data sent by the first electronic device, and the method comprises the following steps:

the RPA system displays at least one generalization corpus of the seed corpus and an indication control, wherein the indication control is used for indicating that the generalization corpus selected by a user is used as a new seed corpus to be generalized;

after the RPA system receives and displays the generalized corpus of the seed corpus sent by the first electronic device, the method further includes:

the RPA system responds to the triggering operation aiming at the indication control, and sends a second request to the first electronic equipment, wherein the second request is used for indicating the first electronic equipment to generalize the generalization linguistic data selected by the user as a new seed linguistic data;

and the RPA system receives and displays the generalization linguistic data of the seed linguistic data retransmitted by the first electronic equipment.

In a possible implementation manner, the RPA system receives and displays a generalized corpus of the seed corpus sent by the first electronic device, further including:

the RPA system receives the at least one generalization corpus of the seed corpus sent by the first electronic device and the corresponding similarity and/or generalization mode thereof;

the RPA system displays the at least one generalization corpus of the seed corpus and the corresponding similarity and/or generalization mode in a correlated manner;

the RPA system receives a screening instruction input by a user, wherein the screening instruction is used for indicating at least one of the preset generalization modes;

and the RPA system displays the generalized linguistic data obtained in the generalization mode indicated by the screening instruction.

In a possible embodiment, the seed corpus is at least one;

the RPA system receives and displays the generalized corpora of the seed corpora sent by the first electronic device, and further includes:

the RPA system displays at least one seed corpus and the number of generalization corpuses thereof;

after the RPA system displays the number of at least one seed corpus and the generalization corpus thereof, the method further comprises:

the RPA system receives a trigger instruction of a user for the number of the generalization linguistic data of the specified seed linguistic data, wherein the specified seed linguistic data is one of all the seed linguistic data;

and the RPA system displays the generalization linguistic data of the specified seed linguistic data.

In a third aspect, an embodiment of the present application provides a corpus generalization method combining an RPA and an AI, which is applied to a third electronic device, where the third electronic device includes an RPA system, and the method includes:

the RPA system receives a seed corpus input by a user;

the RPA system generalizes the seed corpus according to a preset generalization mode to obtain at least one candidate corpus of the seed corpus;

and the RPA system displays the generalization linguistic data of the seed linguistic data.

In a fourth aspect, an embodiment of the present application provides a corpus generalization device combining an RPA and an AI, applied to a first electronic device, including:

the device comprises a first receiving module, a second receiving module and a sending module, wherein the first receiving module is used for receiving a first request, and the first request comprises seed corpora;

the first processing module is used for processing NLP based on natural language and generalizing the seed corpus according to a preset generalization mode to obtain at least one candidate corpus of the seed corpus;

a first determining module, configured to identify a similarity between the at least one candidate corpus and the seed corpus, and determine a candidate corpus of which the similarity is greater than a preset threshold as a generalization corpus of the seed corpus;

and the first output module is used for outputting the generalization linguistic data of the seed linguistic data.

In a fifth aspect, an embodiment of the present application provides a corpus generalization device combining an RPA and an AI, applied to a second electronic device, including:

the second receiving module is used for receiving the seed linguistic data input by the user;

a first sending module, configured to send a first request including the seed corpus to a first electronic device, where the first request is used to instruct the first electronic device to process an NLP based on a natural language, generalize the seed corpus according to a preset generalization mode, obtain at least one candidate corpus of the seed corpus, identify a similarity between the at least one candidate corpus and the seed corpus, and use the candidate corpus of which the similarity is greater than a preset threshold as a generalized corpus of the seed corpus;

and the first display module is used for receiving and displaying the generalization linguistic data of the seed linguistic data sent by the first electronic equipment.

In a sixth aspect, an embodiment of the present application provides a corpus generalization device combining an RPA and an AI, applied to a third electronic device, including:

the third receiving module is used for receiving the seed linguistic data input by the user;

the second processing module is used for generalizing the seed corpus according to a preset generalization mode to obtain at least one candidate corpus of the seed corpus;

a second determining module, configured to identify a similarity between the at least one candidate corpus and the seed corpus, and determine a candidate corpus of which the similarity is greater than a preset threshold as a generalization corpus of the seed corpus;

and the second display module is used for displaying the generalization linguistic data of the seed linguistic data.

In a seventh aspect, an embodiment of the present application provides a first electronic device, including: at least one processor and memory;

the memory stores computer-executable instructions;

the at least one processor executes computer-executable instructions stored by the memory to cause the at least one processor to perform the corpus generalization method according to the first aspect and various possible embodiments of the first aspect.

In an eighth aspect, an embodiment of the present application provides a second electronic device, including: at least one processor and memory;

the memory stores computer-executable instructions;

the at least one processor executes the computer-executable instructions stored in the memory, so that the at least one processor performs the corpus generalization method according to the second aspect and various possible embodiments of the second aspect.

In a ninth aspect, an embodiment of the present application provides a third electronic device, including: at least one processor and memory;

the memory stores computer-executable instructions;

the at least one processor executes the computer-executable instructions stored in the memory, so that the at least one processor performs the corpus generalization method according to the third aspect and various possible embodiments of the third aspect.

In a tenth aspect, an embodiment of the present application provides a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and when a processor executes the computer-executable instructions, the corpus generalization method according to the first aspect and various possible implementation manners of the first aspect is implemented.

In an eleventh aspect, embodiments of the present application provide a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and when a processor executes the computer-executable instructions, the corpus generalization method according to the second aspect and various possible implementation manners of the second aspect is implemented.

In a twelfth aspect, an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores computer-executable instructions, and when a processor executes the computer-executable instructions, the corpus generalization method according to the third aspect and various possible embodiments of the third aspect is implemented.

According to the corpus generalization method and device, electronic equipment and storage medium combining RPA and AI, an RPA system receives a first request, wherein the first request includes a seed corpus, generalizes the seed corpus according to a preset generalization mode to obtain at least one candidate corpus of the seed corpus, identifies similarity between each candidate corpus and the seed corpus, determines the candidate corpus with the similarity larger than a preset threshold as the generalized corpus of the seed corpus, and outputs the generalized corpus of the seed corpus. According to the method, the seed corpus can be automatically generalized through a preset generalization mode, and the generalized candidate corpus is screened according to a preset threshold value, so that the generalized corpus of the seed corpus is screened out, and the corpus generalization efficiency is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.

Fig. 1 is a schematic view of a scenario of a corpus generalization method combining RPA and AI according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a scenario of a corpus generalization method combining RPA and AI according to yet another embodiment of the present application;

FIG. 3 is a schematic flow chart illustrating a corpus generalization method in combination with RPA and AI according to an embodiment of the present application;

FIG. 4 is a schematic flow chart illustrating a corpus generalization method in combination with RPA and AI according to yet another embodiment of the present application;

FIG. 5 is a schematic flow chart illustrating a corpus generalization method in combination with RPA and AI according to another embodiment of the present application;

FIG. 6 is a schematic diagram of a default generalization mode selection interface provided in an embodiment of the present application;

FIG. 7 is a schematic diagram of a configuration interface of a web crawling method according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a display interface of the generalized corpora according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a display interface of a generalized corpus according to another embodiment of the present application;

fig. 10 is a signaling interaction diagram of a generalized linguistic approach combining RPA and AI according to an embodiment of the present application;

FIG. 11 is a flowchart illustrating a corpus generalization method in conjunction with RPA and AI according to yet another embodiment of the present application;

fig. 12 is a schematic structural diagram of a corpus generalization device combining RPA and AI according to an embodiment of the present application;

FIG. 13 is a schematic structural diagram of a corpus generalization device combining RPA and AI according to yet another embodiment of the present application;

FIG. 14 is a schematic structural diagram of a corpus generalization device combining RPA and AI according to another embodiment of the present application;

fig. 15 is a schematic hardware structure diagram of a first electronic device according to an embodiment of the present application;

fig. 16 is a schematic hardware structure diagram of a second electronic device according to yet another embodiment of the present application;

fig. 17 is a schematic hardware structure diagram of a third electronic device according to another embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Fig. 1 is a schematic view of a scenario of a corpus generalization method combining RPA and AI according to an embodiment of the present application. A first electronic device 11 and a second electronic device 12 may be included in the scenario. The first electronic device 11 may include, but is not limited to, a server, a computer device, and the like. The second electronic device 12 may include, but is not limited to, a mobile phone, a desktop computer, a vehicle-mounted terminal, or a tablet computer. The first electronic device 11 may provide a background computing or application service support for the second electronic device 12 in the network, for example, the first electronic device 11 may support a corpus generalization platform for corpus generalization, and the corpus generalization platform may be a Robot Process Automation (RPA) system. The second electronic device 12 may access an interface of the corpus generalization platform through an application program, a plug-in a social application program, a website login, and the like, so as to access the corpus generalization platform. The user may access the corpus generalization platform for corpus generalization through operation of the second electronic device 12.

For example, the user may log in the corpus generalization platform on the second electronic device 12 through a web page, input the seed corpus to be generalized in the corpus generalization platform, and trigger the generalization instruction. After receiving the generalization instruction triggered by the user, the second electronic device 12 sends a first request to the first electronic device 11. After the first electronic device 11 generalizes the seed corpus, the generalized corpus of the seed corpus is returned to the second electronic device 12 through the corpus generalization platform. The second electronic device 12 may output the generalized corpus in a display, download, or the like according to the instruction of the user.

Fig. 2 is a schematic view of a scenario of a corpus generalization method combining RPA and AI according to another embodiment of the present application. A third electronic device 13 may be included in the scenario. The third electronic device 13 may include, but is not limited to, a mobile phone, a desktop computer, a vehicle-mounted terminal, or a tablet computer, a robot, and the like. The third electronic device 13 does not need the support of other background devices, and can implement corpus generalization by itself.

For example, the third electronic device may run an application program for implementing the corpus generalization, and the application program may implement the corpus generalization without interacting with devices such as a background server. The user can run the application program on the third electronic device 13, input the seed corpus to be generalized in the interface of the application program, and trigger the generalization instruction. After receiving the generalization instruction triggered by the user, the third electronic device 13 generalizes the seed corpus, and then outputs the generalized corpus of the seed corpus to the user in the manners of displaying, downloading, and the like. The application program for implementing corpus generalization may be an RPA system.

It should be noted that the method provided in the embodiment of the present application is not limited to the application scenarios shown in fig. 1 and fig. 2, and may also be used in other possible application scenarios, which is not limited.

Fig. 3 is a schematic flow chart of a corpus generalization method combining RPA and AI according to an embodiment of the present application. The main execution body of the method is the first electronic device in fig. 1, the first electronic device includes an RPA system, as shown in fig. 3, the method includes:

s301, the RPA system receives a first request, wherein the first request comprises a seed corpus.

In this embodiment, the seed corpus is a corpus to be generalized. For example, the seed corpus may be "dietary contraindication for the pregnancy preparation period," and the RPA system in the first electronic device generalizes the seed corpus after receiving the first request.

Optionally, the first request sent by the second electronic device may be received.

In this embodiment, the second electronic device may be the second electronic device in fig. 1. When the user needs to generalize the corpus, the seed corpus can be input into the second electronic device, and then the second electronic device can send a first request to the first electronic device to request the first electronic device to generalize the seed corpus.

And S302, the RPA system carries out generalization on the seed corpus based on natural language processing NLP according to a preset generalization mode to obtain at least one candidate corpus of the seed corpus.

In this embodiment, the RPA system may generalize the seed corpus based on Natural Language Processing (NLP) by using a preset generalization mode to obtain the candidate corpus of the seed corpus. Subsequently, the candidate corpus can be further screened to obtain the generalization corpus of the seed corpus.

Optionally, the preset generalization manner includes at least one of the following:

In this embodiment, the RPA system may generalize the seed corpus in one or more predetermined generalization manners. The specific preset generalization mode can be default or specified by the user.

Alternatively, each predefined generalization may correspond to an identifier, and the RPA system may receive the specified identifier.

Further, the RPA system generalizes the seed corpus according to a preset generalization mode, which may include generalizing the seed corpus according to the preset generalization mode corresponding to the designated identifier.

In this embodiment, the identifier may be a name, a code, and the like of a preset generalization mode, which is not limited herein. The specified identity is an identity specified by the user. When a plurality of preset generalization modes are preset in the RPA system, if the appointed identification is received, the preset generalization mode corresponding to the appointed identification is adopted to generalize the seed corpus. Wherein, the designated identification can be one or more. The specified identity may be transmitted by the second electronic device. For example, the user selects an identifier of a desired generalization mode from all preset generalization modes, inputs the identifier to the second electronic device, and the second electronic device sends the specified identifier to the first electronic device.

S303, the RPA system identifies the similarity between at least one candidate corpus and the seed corpus, and determines the candidate corpus with the similarity larger than a preset threshold value as the generalization corpus of the seed corpus.

In this embodiment, the RPA system may screen out, according to the similarity, a corpus similar to the seed corpus from the candidate corpus as a generalization corpus of the seed corpus. The generalization linguistic data is the generalization result of the seed linguistic data. The preset threshold may be set according to actual requirements, and is not limited herein. For example, the preset threshold may be set to 0.9, 0.8, or the like.

Optionally, the RPA system may first identify the similarity between each candidate corpus and the seed corpus by using a deep learning model, then compare a preset threshold with the similarity corresponding to each candidate corpus, and determine the candidate corpus with the similarity greater than the preset threshold as the generalization corpus of the seed corpus. Therefore, the RPA system can screen the candidate linguistic data according to the similarity, and the accuracy of the generalized linguistic data can be guaranteed.

Optionally, the obtaining of the deep learning model may include training an optimal model by using an XGB algorithm/logistic regression model based on corpus samples in the knowledge base and using various distance features such as Jaccrad, coverage, w2v (word vectors), WMD (word shift distance), and the like, and the trained optimal model is used for calculating similarity between sentences.

As another possible implementation, the RPA system may further identify similarity between each candidate corpus and the seed corpus through a logistic regression model, where the logistic regression model is trained in advance through a training set composed of a plurality of corpora in the knowledge base. In this embodiment, the knowledge base may store the generalized seed corpus and the corresponding generalized corpus. The corpora can be selected from the knowledge base to form a training set, the created logistic regression model is trained through the training set, and the trained logistic regression model is adopted to identify the similarity between each candidate corpus and the seed corpus.

Optionally, in order to ensure the accuracy of the generalized corpus, the RPA system may further calculate the similarity between the generalized candidate corpus and the seed corpus by using a ranking algorithm, rank all the candidate corpora according to the similarity, delete the candidate corpus with the similarity lower than a preset threshold, and further obtain the generalized corpus of the seed corpus.

And S304, the RPA system outputs the generalization linguistic data of the seed linguistic data.

In this embodiment, the first electronic device may output the generalized corpora of the seed corpora to the user through the RPA system, so that the user may view or download the generalized corpora of the seed corpora, and then perform model training according to the generalized corpora. For example, the seed corpus is "dietary contraindication for pregnancy preparation", and the generalization corpus of the seed corpus may be "dietary attention for pregnancy preparation", "food attention for pregnancy preparation", and the like.

To sum up, in the corpus generalization method combining the RPA and the AI according to the embodiment of the present application, the RPA system receives a first request, where the first request includes a seed corpus, then generalizes the seed corpus based on the natural language processing NLP according to a preset generalization method to obtain at least one candidate corpus of the seed corpus, then identifies a similarity between the at least one candidate corpus and the seed corpus, determines the candidate corpus with the similarity greater than a preset threshold as the generalized corpus of the seed corpus, and finally outputs the generalized corpus of the seed corpus. According to the method, the RPA system can automatically generalize the seed corpus in a preset generalization mode, and can screen the generalized candidate corpus according to a preset threshold value, so that the generalized corpus of the seed corpus is screened out, and the corpus generalization efficiency is improved under the condition that the generalization effect is ensured.

Optionally, the RPA system in the first electronic device may send the generalized corpus of the seed corpus to the second electronic device.

Therefore, the first electronic device can send the generalization linguistic data of the seed linguistic data to the second electronic device through the RPA system, so that the second electronic device can display the generalization linguistic data of the seed linguistic data, and a user can conveniently perform subsequent checking, selecting, downloading and other operations.

Optionally, the RPA system sends the generalized corpora of the seed corpus to the second electronic device, which may include sending the generalized corpora of the seed corpus and the similarity and/or the generalization manner corresponding to each of the generalized corpora to the second electronic device.

The similarity corresponding to the generalization corpus refers to the similarity between the generalization corpus and the seed corpus. The generalization mode corresponding to the generalization corpus refers to a preset generalization mode adopted by the first electronic device in determining the generalization corpus. For example, the generalization language "dietary notes for pregnancy" is obtained by "synonym substitution method", and the generalization language "dietary notes for pregnancy" is obtained by "web crawling method".

Therefore, when the RPA system in the first electronic device sends the generalization corpus of the seed corpus to the second electronic device, the similarity and/or the generalization mode corresponding to each generalization corpus is sent to the second electronic device at the same time, so that the second electronic device displays the similarity and/or the generalization mode corresponding to each generalization corpus to the user.

In one embodiment, after S304, the method further includes: and the RPA system updates the seed linguistic data and the generalization linguistic data of the seed linguistic data into a knowledge base.

Therefore, when the RPA system carries out corpus generalization by adopting a knowledge base retrieval mode, the candidate corpus of the seed corpus is retrieved from the corpus stored in the knowledge base. After the generalization corpus of the seed corpus is obtained, the seed corpus and the corresponding generalization corpus can be updated to the knowledge base, so that the corpus data in the knowledge base is enriched.

In one embodiment, after S304, the method further includes: the RPA system receives a second request, wherein the second request is used for indicating at least one generalization corpus of the seed corpus, then generalizes the generalization corpus indicated by the second request as a new seed corpus to obtain a candidate corpus of the new seed corpus, then adds the candidate corpus of the new seed corpus into the candidate corpus of the seed corpus, re-identifies the similarity between the at least one candidate corpus of the seed corpus and the seed corpus, determines the candidate corpus with the similarity larger than the preset threshold as the generalization corpus of the seed corpus, and then re-outputs the generalization corpus of the seed corpus.

In this embodiment, the second request may be sent by the second electronic device. After the generalization corpuses of the seed corpuses are obtained, the user can indicate one or more generalization corpuses as new seed corpuses to generalize among all the generalization corpuses of the seed corpuses, and the generalization result is updated to the generalization corpuses of the original seed corpuses. For example, the seed corpus is "dietary contraindication for pregnancy preparation", the generalization corpus of the seed corpus may be "dietary attention for pregnancy preparation", "food attention for pregnancy preparation", etc., the user may designate "dietary attention for pregnancy preparation" as a new seed corpus to generalize, add the candidate corpus obtained by generalizing "dietary attention for pregnancy preparation" to the candidate corpus of the original seed corpus "dietary contraindication for pregnancy preparation", screen out the generalization corpus of the "dietary contraindication for pregnancy preparation" of the seed corpus again therefrom, and output the generalization corpus of the "dietary contraindication for pregnancy preparation" again for updating.

In the embodiment of the present disclosure, after a certain sub-corpus is generalized, if the user finds that the number of the generalized corpus is small, the generalized corpus with a relatively accurate generalized effect may be selected as a new seed corpus, the new seed corpus is generalized, the generalized result is embedded into the candidate corpus of the original seed corpus, and the similarity between the candidate corpus and the original seed corpus is recalculated. Therefore, the user can directly generalize the generalization linguistic data of the seed linguistic data, the user operation is reduced, the user experience is improved, and the generalization efficiency and accuracy are improved.

In an embodiment, when the preset generalization mode includes the network crawling mode, the RPA system generalizes the seed corpus according to the network crawling mode to obtain at least one candidate corpus of the seed corpus, including: and the RPA system searches the seed corpus in a webpage searching website to obtain a webpage list for displaying a searching result, wherein the webpage list comprises a plurality of webpage items, each webpage item has a title sentence, then a title sentence of each webpage item is crawled, and then the title sentence which meets the matching condition in the title sentences of each webpage item is used as the candidate corpus of the seed corpus.

In this embodiment, a specific implementation process of obtaining the candidate corpus of the seed corpus by the RPA system in a network crawling manner is described. The web search website refers to a website for retrieving a corresponding web page according to a keyword. The web page search website may be set by default or may be designated by the user, and is not limited herein.

The RPA system may search the seed corpus in the web page search website to obtain a plurality of web page items related to the seed corpus and a title sentence of each web page item. For example, the seed corpus is "food contraindication for pregnancy period", and the web page items and title sentences searched in the web page search website may include: a first web page www.aaa.com.cn titled "how good to eat before pregnancy" how to prepare for pregnancy "and how to not eat" AAA web "; a second webpage item www.bbb.com.cn with a title sentence of "[ Progestion preparation diet Yi-Bao ] Progestion preparation diet cautionary item-BBB net"; and a third web page item www.ccc.com.cn, wherein the title sentence is 'pay attention to 5 points for pregnancy, good habit is given to you more lucky-CCC website', and the like.

Further, the RPA system may crawl the title sentences of each web page item, then identify whether the title sentences meet the matching conditions, and use the title sentences meeting the matching conditions as the candidate corpora of the seed corpora. Wherein the matching condition is used for excluding the title sentences of the similar sentences not containing the seed corpus.

Further, the RPA system may delete the vocabulary irrelevant to the seed corpus in the header sentence meeting the matching condition, to obtain the candidate corpus of the seed corpus. For example, the matching condition includes each vocabulary in the seed corpus or synonyms thereof.

In the above example, the title sentences of the first and second web page items meet the matching condition, and the title sentences of the third web page item do not meet the matching condition, so that the "no-food for pregnancy" in the title sentences of the first web page item and the "food attention for food for pregnancy" in the title sentences of the second web page item can be used as the candidate corpus of the seed corpus "food contraindication for food for pregnancy".

Therefore, the RPA system can acquire the candidate corpus of the seed corpus from a generalization library consisting of a plurality of websites according to a network crawling mode, the styles are more, the query is closer to the real situation of a user, and the generalization effect is more in line with the user requirements.

Optionally, when the preset generalization manner includes the network crawling manner, the method may further include: and the RPA system receives the filter words, and takes the header sentences meeting the matching conditions in the header sentences of the webpage items as the candidate linguistic data of the seed linguistic data according to the filter words.

In this embodiment, the RPA system may receive the filter word sent by the second electronic device, determine a corresponding matching condition according to the filter word, screen the title sentence of each web page item, and use the title sentence, which meets the matching condition, in the title sentence of each web page item as the candidate corpus of the seed corpus, so that network crawling may be performed according to the filter word set by the user.

Therefore, by setting the filter words, the user can conveniently adjust the matching conditions of the network crawling mode according to the requirements, so that candidate sentences meeting the requirements are screened out, and the individuation of the network crawling mode is improved.

As another possible implementation, the RPA system may further receive a filter word and a matching pattern, and use a header sentence meeting the matching condition in the header sentences of each web page item as a candidate corpus of the seed corpus according to the filter word and the matching pattern.

The matching mode is precise matching or fuzzy matching, and when the matching mode is precise matching, the matching condition is that the filter words are contained in the title sentences; and when the matching mode is fuzzy matching, the matching condition is that the title sentence contains the filter word or the synonym of the filter word.

In this embodiment, the RPA system may perform network crawling according to the filtering words and the matching mode set by the user. The RPA system may receive the filter word and the matching pattern sent by the second electronic device, determine a corresponding matching condition according to the filter word and the matching pattern, and filter the title sentence of each web page item.

Wherein, the matching mode comprises two optional modes: exact matching or fuzzy matching.

The matching condition corresponding to the precise matching is that the title sentence must contain a filter word, for example, in the above example, it is assumed that the filter word set by the user is "diet", and since the title sentence of the web page item two contains "diet", the matching condition is met; the title sentences of the first web page item and the third web page item do not contain the 'diet', so that the matching conditions are not met.

For example, in the above example, it is assumed that the filter word set by the user is "diet", and since the title sentence of the first web page item includes "eat" (synonym of diet) and the title sentence of the second web page item includes "diet", the matching condition is met; the title sentence of the third web page item does not contain the "diet", so the matching condition is not met.

Therefore, by setting the filter words and the matching mode, the user can conveniently adjust the matching conditions of the network crawling mode according to the requirement, so that candidate sentences meeting the requirement are screened out, and the individuation of the network crawling mode is improved.

Optionally, when the preset generalization manner includes the network crawling manner, the method may further include: the webpage list comprises at least one display page, and each display page comprises at least one webpage item.

The method further comprises the following steps: the RPA system receives a specified website address and/or a specified number of pages to crawl.

The searching the seed corpus in the web page searching website includes: and the RPA system searches the seed corpus in the webpage searching website indicated by the specified website address.

The title sentence of each webpage item crawled comprises the following steps: and the RPA system crawls the title sentences of all the webpage items in the display pages corresponding to the specified crawling page number.

In this embodiment, the RPA system may search the seed corpus in the web search website indicated by the designated network address according to the received designated network address. For example, a user may input an address of a web site searched for by the user as a specified network address to the second electronic device, the second electronic device transmits the specified network address to the first electronic device, and when the RPA system in the first electronic device is generalized by using a network crawling method, the RPA system searches for a web site corresponding to the specified network address.

Optionally, the web page list includes at least one presentation page, and each presentation page includes at least one web page item therein. For example, when the RPA system generalizes in a web crawling manner, 100 web page items related to the seed corpus are searched by a web page search website and are displayed in 10 display pages, and 10 web page items are displayed on each display page. The RPA system may crawl the title statements of individual web page items within a presentation page starting from a starting page specifying the number of pages crawled. For example, if the number of designated crawled pages can be 5, the RPA system crawls the title sentences of each web page item in the presentation pages from page 1 to page 5. The user can input the specified crawled page number into the second electronic device, and the second electronic device sends the specified crawled page number to the RPA system in the first electronic device.

Therefore, by crawling according to the specified website address, the website can be searched on the webpage specified by the user, and the user experience is improved. By only crawling in the display page corresponding to the specified crawling page number, only the webpage items with high relevance to the seed linguistic data can be crawled, the crawling of irrelevant webpage items is avoided, and the processing efficiency of the seed linguistic data is improved.

In an embodiment, when the preset generalization manner includes the synonym replacement manner, generalizing the seed corpus according to the synonym replacement manner to obtain at least one candidate corpus of the seed corpus, including: the RPA system acquires the affiliated field of the seed corpus, selects a synonym table corresponding to the affiliated field from a plurality of preset synonym tables, wherein each synonym table corresponds to one field, then searches key words in the seed corpus, and then performs synonym replacement on the key words in the seed corpus according to the synonym table corresponding to the affiliated field to obtain at least one candidate corpus of the seed corpus.

In this embodiment, a specific implementation process of obtaining the candidate corpus of the seed corpus by the first electronic device in the synonym replacement manner is described. The field of the seed corpus can be automatically identified or can be specified by a user. The fields may include a network field, a news field, a medical field, a travel field, a life field, etc., and are not limited thereto. For example, the user may determine the domain of the seed corpus, input the domain into the second electronic device, and the second electronic device sends the domain of the seed corpus to the first electronic device. The RPA system in the first electronic device can identify the keywords in the seed corpus, and perform synonym replacement on the keywords in the seed corpus according to the synonym table corresponding to the field to which the keywords belong, so as to obtain the candidate corpus of the seed corpus.

For example, if the seed corpus is "dietary contraindication for pregnancy preparation", the field is "life field", and the keyword "contraindication" in the seed corpus is present in the synonym table corresponding to the field, the candidate corpus obtained through synonym replacement includes "dietary attention for pregnancy preparation" and "dietary attention for pregnancy preparation".

Optionally, the RPA system may perform word segmentation on the seed corpus, find a keyword with a higher TF-IDF (term frequency-inverse document frequency), and perform keyword replacement based on the synonym table, thereby generating a candidate corpus. The synonym lists in the fields of network, medical treatment, tourism, news, life and the like are divided by considering that the synonyms in different fields are different, and the synonym replacement can be accurately carried out by selecting the fields. Wherein, the seed corpus can be participled by adopting a Pkuseg word segmentation tool.

Therefore, synonym replacement is carried out on the keywords in the seed corpus through the synonym table corresponding to the field of the seed corpus, the accuracy of the candidate corpus obtained by synonym replacement can be improved, and the accuracy of corpus generalization is further improved.

In an embodiment, when the preset generalization mode includes the knowledge base retrieval mode, the RPA system generalizes the seed corpus according to the knowledge base retrieval mode to obtain at least one candidate corpus of the seed corpus, including: the RPA system searches a knowledge base for the generalization linguistic data corresponding to the seed linguistic data, wherein the knowledge base comprises a plurality of seed linguistic data and the corresponding generalization linguistic data, and then the generalization linguistic data corresponding to the seed linguistic data in the knowledge base is used as candidate linguistic data of the seed linguistic data;

in this embodiment, a specific implementation process of obtaining a candidate corpus of a seed corpus by using a knowledge base retrieval method is described. It is understood that the generalized corpora after each generalization of the seed corpora may be added to the knowledge base. Optionally, the generalized linguistic data with higher accuracy in the generalized linguistic data can be selected and stored in the knowledge base after being reviewed by a trainer. Thus, when the seed corpus needs to be generalized, whether the generalized corpus corresponding to the seed corpus exists or not can be searched in the knowledge base, and if the generalized corpus corresponding to the seed corpus exists, the generalized corpus corresponding to the seed corpus in the knowledge base is used as the candidate corpus of the seed corpus. Therefore, the generalized corpora corresponding to the seed corpora each time are added into the knowledge base, so that the accuracy of the corpora in the knowledge base is guaranteed, and the accuracy of the subsequent corpus generalization is improved.

In an embodiment, when the preset generalization manner includes the sentence extraction manner, the RPA system generalizes the seed corpus according to the sentence extraction manner to obtain at least one candidate corpus of the seed corpus, including: the RPA system identifies and extracts key words in the seed corpus through a Dependency syntactic analysis (DP) algorithm, and then combines the key words of the seed corpus to generate a candidate corpus of the seed corpus.

In this embodiment, a specific implementation process of obtaining the candidate corpus of the seed corpus by adopting the sentence pattern extraction method is described. Wherein, the key vocabulary may include, but not limited to, at least one of subject, predicate, object in the seed corpus. The RPA system can identify and extract key words in the seed corpus through a DP algorithm, and then combine the key words to generate a candidate corpus of the seed corpus. For example, the seed corpus is "how to know that the user is pregnant", the keyword summary may be "how", "know", "self", "pregnant", and the like, and the combined candidate corpus may include "know that the user is pregnant", "how to know pregnant", and the like.

Therefore, according to the embodiment, through dependency syntactic analysis, the gravity words such as the principal and subordinate guests in the seed corpus are extracted to form a complete sentence, or the limited postphrase is deleted, but the meaning of the original sentence can still be kept, so that the accuracy of the generated candidate corpus is ensured.

Fig. 4 is a schematic flow chart of a corpus generalization method combining RPA and AI according to yet another embodiment of the present application. The embodiment describes a specific implementation process of detecting the device status in detail on the basis of the embodiment of fig. 4. As shown in fig. 4, the method includes:

s401, the RPA system receives a first request, wherein the first request comprises a seed corpus.

In this embodiment, S401 is similar to S301 in the embodiment of fig. 3, and is not described here again.

S402, the RPA system searches the seed linguistic data in a historical record, wherein the historical record comprises historical generalized seed linguistic data and corresponding generalized linguistic data.

And S403, if the seed corpus is found in the history record by the RPA system, acquiring the generalization corpus of the seed corpus from the history record.

In this embodiment, the history record may store the seed corpus and the corresponding generalization corpus that are previously input by the user. After receiving the first request, the RPA system first searches whether the seed corpus exists in the history record, if so, directly obtains the generalized corpus of the seed corpus from the history record, and if not, obtains the generalized corpus of the seed corpus by generalization according to the manners of S404 and S405.

S404, if the seed corpus is not found in the history record by the RPA system, generalizing the seed corpus according to a preset generalization mode to obtain at least one candidate corpus of the seed corpus.

In this embodiment, S404 is similar to S302 in the embodiment of fig. 3, and is not described here again.

S405, the RPA system identifies the similarity between at least one candidate corpus and the seed corpus, and determines the candidate corpus with the similarity larger than a preset threshold value as the generalization corpus of the seed corpus.

In this embodiment, S405 is similar to S303 in the embodiment of fig. 3, and is not described herein again.

And S406, the RPA system outputs the generalization linguistic data of the seed linguistic data.

In this embodiment, S406 is similar to S304 in the embodiment of fig. 3, and is not described herein again.

In this embodiment, whether the seed corpus exists in the history record is firstly queried, and for the historical generalized seed corpus, the generalized corpus is directly obtained, so that the generalization efficiency can be improved. For example, if a user needs to generalize 200 seed corpuses in batch, and determines that 50 seed corpuses are generalized by querying the history, the generalized corpuses of the 50 seed corpuses are directly obtained from the history, and the rest 150 seed corpuses only need to be generalized by a preset generalization mode, so that the data volume required to be generalized is reduced, and the generalization efficiency is improved.

Fig. 5 is a schematic flow chart of a corpus generalization method combining RPA and AI according to another embodiment of the present application. The execution subject of the method may be the second electronic device in fig. 1, where the second electronic device includes an RPA system, as shown in fig. 5, and the method includes:

s501, the RPA system receives the seed corpus input by the user.

In this embodiment, the RPA system in the second electronic device may receive the seed corpus input by the user. The user may input a single seed corpus or may input a plurality of seed corpora in batch, which is not limited herein. The seed corpus may be input into the input box by a user, or a file containing the seed corpus is uploaded by the user, and the RPA system extracts the seed corpus from the file.

S502, the PPA system sends a first request containing the seed corpus to first electronic equipment, wherein the first request is used for indicating the first electronic equipment to process NLP based on natural language, generalize the seed corpus according to a preset generalization mode to obtain at least one candidate corpus of the seed corpus, identify the similarity between the at least one candidate corpus and the seed corpus, and take the candidate corpus with the similarity larger than a preset threshold value as the generalized corpus of the seed corpus.

In this embodiment, the first electronic device may be the first electronic device in fig. 1. The RPA system in the second electronic device may send the first request to the first electronic device. The first electronic device may perform generalization processing on the seed corpus according to the first request to obtain a generalization expectation thereof, and a specific generalization processing procedure is similar to the implementation of the corpus generalization method using the first electronic device as an execution main body, and is not described herein again.

Optionally, the preset generalization manner includes at least one of the following: network crawling mode, synonym replacing mode, knowledge base searching mode and sentence pattern extracting mode.

Optionally, the method further includes: the RPA system displays the identification of each preset generalization mode, then receives a selection instruction input by a user, wherein the selection instruction is used for indicating a specified identification in the identification of the preset generalization mode, and then sends the specified identification to the first electronic equipment.

In this embodiment, the preset generalization manner may include one or more of the above four manners. The implementation process of each predetermined generalization manner is similar to the above embodiment of the corpus generalization method using the first electronic device as the execution main body, and is not described herein again.

When the preset generalization mode includes multiple types, the second electronic device may display the identifier of each preset generalization mode, so that the user may select the preset generalization mode to be adopted.

Optionally, after receiving a selection instruction input by the user, the RPA system sends the specified identifier selected by the user to the first electronic device, so that the first electronic device generalizes the seed corpus in a preset generalization mode corresponding to the specified identifier. The specific identifier selected by the user may be one or more, and is not limited herein.

Fig. 6 is a schematic diagram of a preset generalization mode selection interface provided in an embodiment of the present application. In fig. 6, the user may upload a file including the seed corpus by inputting the seed corpus in the seed corpus input box or by clicking the upload file control. The user can check the preset generalization mode to be used, and after the generalization control is clicked, the RPA system sends a first request to the first electronic device to request the first electronic device to generalize the seed corpus input by the user in the preset generalization mode checked by the user.

Therefore, by displaying the identification of each preset generalization mode and receiving the selection instruction input by the user, the user can conveniently select the preset generalization mode for use, the user operation is facilitated, and the user experience is improved.

Optionally, when the preset generalization manner includes the network crawling manner, the method further includes: the RPA system receives a filtering word and a matching mode input by a user, and sends the filtering word and the matching mode to the first electronic equipment.

In this embodiment, the user may configure the network crawling manner. The RPA system can receive the filter words and the matching modes input by the user, and send the filter words and the matching modes to the first electronic equipment, so that the first electronic equipment can perform network crawling according to the filter words and the matching modes. The implementation process of performing network crawling according to the filter word and the matching pattern is similar to the embodiment of the corpus generalization method using the first electronic device as the execution main body, and is not described herein again.

Optionally, when the preset generalization manner includes the network crawling manner, the method further includes: the RPA system receives a specified website address and/or specified number of pages to be crawled input by a user, and sends the specified website address and/or the specified number of pages to the first electronic equipment.

In this embodiment, the user may configure the network crawling manner. The RPA system can receive a specified website address and/or a specified number of crawled pages input by a user, and send the specified website address and/or the specified number of crawled pages to the first electronic device, so that the first electronic device can perform network crawling according to the specified website address and/or the specified number of crawled pages. The implementation process of performing the web crawling according to the designated website address and/or the designated crawling page number is similar to the embodiment of the corpus generalization method using the first electronic device as the execution subject, and is not described herein again.

Fig. 7 is a schematic diagram of a configuration interface of a network crawling manner provided in an embodiment of the present application. In fig. 7, the user can configure a designated website address to be used in a designated network address input box, configure a designated number of crawl pages in a designated crawl page number input box, configure filter words in a filter word input box, and select a matching mode.

S503, the RPA system receives and displays the generalization linguistic data of the seed linguistic data sent by the first electronic device.

In this embodiment, the RPA system may receive and display the generalized corpus of the seed corpus sent by the first electronic device.

To sum up, the corpus generalization method combining the RPA and the AI according to the embodiment of the present application sends a first request including a seed corpus to a first electronic device by receiving the seed corpus input by a user, where the first request is used to instruct the first electronic device to generalize the seed corpus according to a preset generalization mode, to obtain at least one candidate corpus of the seed corpus, identify a similarity between the at least one candidate corpus and the seed corpus, use the candidate corpus with the similarity greater than a preset threshold as a generalization corpus of the seed corpus, and receive and display the generalization corpus of the seed corpus sent by the first electronic device. According to the method, the seed corpus can be automatically generalized through a preset generalization mode, and the generalized candidate corpus is screened according to a preset threshold value, so that the generalized corpus of the seed corpus is screened out, and the corpus generalization efficiency is improved under the condition that the generalization effect is ensured.

Optionally, in order to facilitate the user to edit the generalized corpus more conveniently, operations such as adding, deleting, modifying and the like are provided on the page displayed by the RPA system, and a batch operation control is supported, so as to perform corresponding processing according to the operation control triggered by the user.

In one embodiment, the receiving and displaying, by the RPA system, the generalized corpus of the seed corpus sent by the first electronic device includes: and the RPA system receives at least one generalization corpus of the seed corpus sent by the first electronic equipment and the corresponding similarity and/or generalization mode thereof, and performs associated display on the at least one generalization corpus of the seed corpus and the corresponding similarity and/or generalization mode thereof.

In this embodiment, the RPA system may receive at least one generalization corpus of the seed corpus sent by the first electronic device and the corresponding similarity and/or generalization manner thereof, and perform association display, so as to facilitate a user to view the similarity between each generalization corpus and the seed corpus and obtain the generalization manner of the generalization corpus.

After the RPA system receives and displays the generalized corpus of the seed corpus sent by the first electronic device, the method further includes: and the RPA system receives a screening instruction input by a user, wherein the screening instruction is used for indicating at least one of the preset generalization modes and displaying the generalization linguistic data obtained in the generalization mode indicated by the screening instruction.

In this embodiment, the user may also screen the generalized corpora displayed by the RPA system according to a preset generalization mode, and the RPA system only displays the generalized corpora obtained in the generalization mode indicated by the screening instruction according to the screening instruction, so that the user can conveniently check the generalized corpora obtained in different generalization modes.

For example, when a batch of seed corpora are generalized, a user usually selects all preset generalization manners to obtain as many generalized corpora as possible, but tracing back to each seed corpus will find that the effective generalization manners are different for a certain seed corpus, and the generalized corpora in a certain generalization manner already satisfy the user's requirements. The embodiment designs the screening button in a generalization mode through the setting, and further meets the personalized processing of the generalization linguistic data.

Fig. 8 is a schematic diagram of a display interface of the generalized corpora according to the embodiment of the present application. In FIG. 8, the corresponding similarity and generalization approaches are shown behind each generalized corpus. And a screening control of a generalization mode is set in the display interface, a user can click the screening control to screen the generalization mode, the RPA system pops up a generalization mode screening popup window after the user clicks the screening control, the user can select a required generalization mode in the generalization mode screening popup window, and then the RPA system only displays the generalization linguistic data obtained in the generalization mode selected by the user in the display interface.

In one embodiment, the seed corpus is at least one.

The method further comprises the following steps: the RPA system displays the number of each sub-corpus and the generalization corpus;

the RPA system displays the generalization linguistic data of the seed linguistic data sent by the first electronic device, and the display linguistic data comprises the following steps: receiving a trigger instruction of a user for the number of the generalization linguistic data of the specified seed linguistic data, wherein the specified seed linguistic data is one of all the seed linguistic data, and displaying the generalization linguistic data of the specified seed linguistic data.

In this embodiment, when there are a plurality of seed corpora, the RPA system may display the number of the generalized corpora of each seed corpus on the interface, and after receiving a trigger instruction of a user for the number of the generalized corpora of a certain seed corpus, display each of the generalized corpora of the seed corpus. Therefore, when the seed corpus is more, the display interface is simpler. For example, a user searches for a plurality of seed corpuses, so that the user can delete and modify the generalized corpuses conveniently, and a generalized corpus list can be popped up on the right side of the seed corpuses clicked by the user, so that the operation habit of the user is met. The user clicks the seed corpus again, and the popup can disappear.

In one embodiment, after S403, the method may further include: the generalization linguistic data of the seed linguistic data is at least one;

the RPA system displays the generalization linguistic data of the seed linguistic data sent by the first electronic device, and further comprises: displaying each generalization corpus of the seed corpus and an indication control, wherein the indication control is used for indicating that the generalization corpus selected by a user is used as a new seed corpus to be generalized;

further, the method further comprises: and the RPA system responds to the triggering operation aiming at the indication control, and sends a second request to the first electronic equipment, wherein the second request is used for indicating the first electronic equipment to generalize the generalization linguistic data selected by the user as new seed linguistic data, and the generalization linguistic data of the seed linguistic data retransmitted by the first electronic equipment is received and displayed.

In this embodiment, the RPA system may display each of the generalized corpora of the seed corpus and the indication control, and send a second request to the first electronic device in response to a trigger operation for the indication control, so that the first electronic device generalizes the generalized corpora selected by the user as a new seed corpus, embeds a generalization result thereof into a candidate corpus of an original seed corpus, recalculates the similarity between all the candidate corpora and the original seed corpus, redetermines the generalized corpora of the original seed corpus, and sends the newly determined generalized corpora of the original seed corpus to the RPA system, so that the RPA system updates and displays the generalized corpora of the original seed corpus. The specific generalization process of the first electronic device is similar to the implementation of the corpus generalization method using the first electronic device as the execution main body, and is not described herein again.

Fig. 9 is a schematic view of a display interface of the generalized corpora according to the embodiment of the present application. In fig. 9, each generalized corpus has a selection box, and the user can select one or more of the generalized corpora as new seed corpora through the selection box, and then click the indication control "seed corpora" on the interface, thereby triggering the RPA system to send the second request to the first electronic device. In this embodiment, instruct the controlling part through setting up, can be convenient for the user to select new seed corpus from the generalization corpus of current show to first electronic equipment improves the convenience of user operation according to the new seed corpus that the user colluded the selection, generalizes again to the generalization corpus of former seed corpus, and then improves generalization efficiency, promotes user experience.

Fig. 10 is a signaling interaction diagram of a generalized linguistic approach according to an embodiment of the present application. The execution body in the signaling interaction diagram comprises the first electronic device and the second electronic device in fig. 1. As shown in fig. 10, the method may include:

s1001, the second electronic device receives the seed corpus input by the user.

S1002, the second electronic device sends a first request containing the seed corpus to the first electronic device.

S1003, the first electronic device generalizes the seed corpus according to a preset generalization mode to obtain at least one candidate corpus of the seed corpus.

S1004, the first electronic device identifies the similarity between each candidate corpus and the seed corpus, and determines the candidate corpus with the similarity larger than a preset threshold as the generalization corpus of the seed corpus.

S1005, the first electronic device sends the generalized corpora of the seed corpora to the second electronic device.

S1006, the second electronic device displays the generalization corpus of the seed corpus.

The specific implementation process and technical effects of the method are similar to the embodiment of the generalized linguistic data method using the first electronic device as the execution main body, and the embodiment of the generalized linguistic data method using the second electronic device as the execution main body, so that the method is only briefly described here and is not repeated.

Fig. 11 is a flowchart illustrating a corpus generalization method according to yet another embodiment of the present application. The execution subject of the method may be the third electronic device in fig. 2, and the third electronic device includes the RPA system. As shown in fig. 11, the method includes:

s1101, the RPA system receives the seed corpus input by the user.

In this embodiment, the RPA system may receive the seed corpus input by the user, and the implementation process and technical effect thereof are similar to those of the embodiment of the corpus generalization method using the second electronic device as the execution main body, and are not described herein again.

And S1102, the RPA system generalizes the seed corpus according to a preset generalization mode to obtain at least one candidate corpus of the seed corpus.

In this embodiment, the RPA system may receive the seed corpus input by the user, and the implementation process and technical effect thereof are similar to those of the embodiment of the corpus generalization method using the first electronic device as the execution main body, and are not described herein again.

Optionally, the preset generalization manner includes at least one of the following: synonym replacement mode, network crawling mode, knowledge base retrieval mode and sentence pattern extraction mode.

Optionally, each predefined generalization corresponds to an identity.

The method further comprises the following steps: and the RPA system displays the identification of each preset generalization mode and receives a selection instruction input by a user, wherein the selection instruction is used for indicating the appointed identification in the identification of the preset generalization mode.

The RPA system generalizes the seed corpus according to a preset generalization mode, and the method comprises the following steps: and the RPA system generalizes the seed corpus by adopting a preset generalization mode corresponding to the specified identification.

In this embodiment, the RPA system displays the identifier of each preset generalization mode and receives the selection instruction input by the user, and the implementation process and technical effect thereof are similar to those of the embodiment of the corpus generalization method using the second electronic device as the execution main body, and are not described herein again. The RPA system generalizes the seed corpus in a preset generalization mode corresponding to the designated identifier, and the implementation process and technical effect thereof are similar to the above embodiment of the corpus generalization method using the first electronic device as the execution main body, and are not described herein again.

S1103, the RPA system identifies the similarity between at least one candidate corpus and the seed corpus, and determines the candidate corpus with the similarity larger than a preset threshold value as the generalization corpus of the seed corpus.

In this embodiment, the RPA system identifies the similarity between each corpus candidate and the seed corpus, and determines the corpus candidate with the similarity greater than the preset threshold as the generalized corpus of the seed corpus, which is similar to the embodiment of the corpus generalization method using the first electronic device as the execution main body, and thus, the implementation process and the technical effect are not repeated herein.

And S1104, the RPA system displays the generalization linguistic data of the seed linguistic data.

In this embodiment, the RPA system displays the generalized corpus of the seed corpus, and the implementation process and technical effect thereof are similar to those of the embodiment of the corpus generalization method using the second electronic device as the execution main body, and are not described herein again.

To sum up, the corpus generalization method combining the RPA and the AI according to the embodiment of the present application generalizes the seed corpus by receiving the seed corpus input by the user according to the preset generalization mode to obtain at least one candidate corpus of the seed corpus, identifies the similarity between the at least one candidate corpus and the seed corpus, determines the candidate corpus with the similarity greater than the preset threshold as the generalized corpus of the seed corpus, and displays the generalized corpus of the seed corpus. According to the method, the seed corpus can be automatically generalized through a preset generalization mode, and the generalized candidate corpus is screened according to a preset threshold value, so that the generalized corpus of the seed corpus is screened out, and the corpus generalization efficiency is improved under the condition that the generalization effect is ensured.

In one embodiment, the generalization corpus of the seed corpus is at least one.

The RPA system displays the generalization linguistic data of the seed linguistic data, and comprises the following steps: and the RPA system displays at least one generalization corpus of the seed corpus and an indication control, wherein the indication control is used for indicating that the generalization corpus selected by a user is generalized as a new seed corpus.

The method further comprises the following steps:

and the RPA system responds to the triggering operation aiming at the indication control, generalizes the generalization linguistic data selected by the user as new seed linguistic data to obtain candidate linguistic data of the new seed linguistic data, then adds the candidate linguistic data of the new seed linguistic data into the candidate linguistic data of the seed linguistic data, re-identifies the similarity of at least one candidate linguistic data of the seed linguistic data and the seed linguistic data, determines the candidate linguistic data with the similarity larger than a preset threshold value as the generalization linguistic data of the seed linguistic data, and then redisplays the generalization linguistic data of the seed linguistic data.

In this embodiment, the RPA system displays at least one generalization corpus of the seed corpus and the indication control, and re-displays the generalization corpus of the seed corpus, and the implementation process and technical effect thereof are similar to those of the embodiment of the corpus generalization method using the second electronic device as the execution main body, and are not repeated herein. The RPA system generalizes the generalization linguistic data selected by the user as a new seed linguistic data to obtain a candidate linguistic data of the new seed linguistic data; the implementation process and technical effect of the method are similar to those of the embodiment of the corpus generalization method using the first electronic device as the execution main body, and are not repeated herein.

In an embodiment, when the preset generalization mode includes the network crawling mode, the RPA system generalizes the seed corpus according to the network crawling mode to obtain at least one candidate corpus of the seed corpus, including: and the RPA system searches the seed corpus in a webpage searching website to obtain a webpage list for displaying a searching result, wherein the webpage list comprises a plurality of webpage items, each webpage item is provided with a title sentence, the title sentences of the webpage items are crawled, and the title sentences meeting the matching conditions in the title sentences of the webpage items are used as candidate corpora of the seed corpus.

In this embodiment, the RPA system generalizes the seed corpus in a network crawling manner, and the implementation process and technical effect thereof are similar to those of the embodiment of the corpus generalization method using the first electronic device as the execution main body, and are not described herein again.

Optionally, the method further comprises: the RPA system receives the filter words and matching patterns input by the user.

The RPA system takes the title sentences meeting the matching condition in the title sentences of each webpage item as the candidate linguistic data of the seed linguistic data, and comprises the following steps: the RPA system takes the title sentences meeting the matching conditions in the title sentences of all the webpage items as the candidate linguistic data of the seed linguistic data according to the filter words and the matching mode, wherein the matching mode is accurate matching or fuzzy matching, and when the matching mode is accurate matching, the matching conditions are that the title sentences contain the filter words; and when the matching mode is fuzzy matching, the matching condition is that the title sentence contains the filter word or the synonym of the filter word.

Optionally, the method further comprises: the RPA system receives the specified website address and/or the specified number of crawled pages input by the user.

The RPA system searches the seed corpus in a webpage searching website, and the method comprises the following steps: the RPA system searches the seed corpus in the webpage searching website indicated by the specified website address;

the RPA system crawls title sentences of all webpage items, and the method comprises the following steps: and the RPA system crawls the title sentences of all the webpage items in the display pages corresponding to the specified crawling page number.

In this embodiment, the RPA system receives the filter word and the matching pattern input by the user, and receives the specified website address and/or the specified crawl page number input by the user, and the implementation process and the technical effect are similar to those of the above embodiment of the corpus generalization method using the second electronic device as the execution main body, and are not described herein again. The RPA system generalizes the seed corpus in a network crawling manner, and the implementation process and technical effect thereof are similar to those of the embodiment of the corpus generalization method using the first electronic device as the execution main body, and are not described herein again.

In an embodiment, when the preset generalization manner includes the synonym replacement manner, the RPA system generalizes the seed corpus according to the synonym replacement manner to obtain at least one candidate corpus of the seed corpus, including: and the RPA system acquires the affiliated field of the seed corpus, selects a synonym table corresponding to the affiliated field from a plurality of preset synonym tables, wherein each synonym table corresponds to one field, searches the keywords in the seed corpus, and performs synonym replacement on the keywords in the seed corpus according to the synonym table corresponding to the affiliated field to obtain at least one candidate corpus of the seed corpus.

In this embodiment, the RPA system generalizes the seed corpus in a synonym replacement manner, and the implementation process and technical effect thereof are similar to those of the embodiment of the corpus generalization method using the first electronic device as the execution main body, and are not described herein again.

In an embodiment, when the preset generalization mode includes the knowledge base retrieval mode, the RPA system generalizes the seed corpus according to the knowledge base retrieval mode to obtain at least one candidate corpus of the seed corpus, including: the RPA system searches a knowledge base for a generalization corpus corresponding to the seed corpus, wherein the knowledge base comprises a plurality of seed corpora and corresponding generalization corpora; and taking the generalization linguistic data corresponding to the seed linguistic data in the knowledge base as the candidate linguistic data of the seed linguistic data.

In this embodiment, the RPA system generalizes the seed corpus in a knowledge base retrieval manner, and the implementation process and technical effect thereof are similar to those of the embodiment of the corpus generalization method using the first electronic device as the execution main body, and are not described herein again.

In an embodiment, when the preset generalization manner includes the sentence extraction manner, the RPA system generalizes the seed corpus according to the sentence extraction manner to obtain at least one candidate corpus of the seed corpus, including: and the RPA system identifies and extracts key words in the seed corpus through a dependency syntactic analysis algorithm, combines the key words of the seed corpus and generates a candidate corpus of the seed corpus.

In this embodiment, the RPA system generalizes the seed corpus in a sentence extraction manner, and the implementation process and technical effect thereof are similar to those of the embodiment of the corpus generalization method using the first electronic device as the execution main body, and are not described herein again.

Fig. 12 is a schematic structural diagram of a corpus generalization device combining RPA and AI according to an embodiment of the present application. The corpus generalization device 120 is applied to a first electronic device. As shown in fig. 12, the corpus generalization device 120 includes: a first receiving module 1201, a first processing module 1202, a first determining module 1203, a first outputting module 1204.

A first receiving module 1201, configured to receive a first request, where the first request includes a seed corpus;

a first processing module 1202, configured to process an NLP based on a natural language, and generalize the seed corpus according to a preset generalization manner to obtain at least one candidate corpus of the seed corpus;

a first determining module 1203, configured to identify similarity between the at least one candidate corpus and the seed corpus, and determine a candidate corpus of which the similarity is greater than a preset threshold as a generalized corpus of the seed corpus;

a first output module 1204, configured to output the generalized corpus of the seed corpus.

In a possible embodiment, the preset generalization includes at least one of the following modes: network crawling mode, synonym replacing mode, knowledge base searching mode and sentence pattern extracting mode.

In a possible implementation manner, when the preset generalization manner includes the network crawling manner, the first processing module 1201 includes: the first processing unit is used for generalizing the seed corpus according to the network crawling manner to obtain at least one candidate corpus of the seed corpus;

the first processing unit is specifically configured to: searching the seed corpus in a webpage searching website to obtain a webpage list for displaying a searching result, wherein the webpage list comprises a plurality of webpage items, and each webpage item has a title sentence; crawling title sentences of all webpage items; and taking the title sentences which accord with the matching conditions in the title sentences of the webpage items as the candidate linguistic data of the seed linguistic data.

In a possible implementation, the first processing unit is further configured to: receiving a filter word; and according to the filter words, taking the title sentences meeting the matching conditions in the title sentences of all the webpage items as the candidate linguistic data of the seed linguistic data.

the first processing unit is further configured to: receiving a specified website address and/or a specified number of crawled pages; searching the seed corpus in a webpage searching website indicated by the specified website address; and crawling title sentences of all the webpage items in the display page corresponding to the specified crawling page number.

In a possible implementation manner, when the preset generalization manner includes the synonym replacement manner, the first processing module 1202 further includes: the second processing unit is used for generalizing the seed corpus according to the synonym replacement mode to obtain at least one candidate corpus of the seed corpus;

the second processing unit is specifically configured to: acquiring the affiliated field of the seed corpus, and selecting a synonym table corresponding to the affiliated field from a plurality of preset synonym tables, wherein each synonym table corresponds to one field; searching key words in the seed corpus; and carrying out synonym replacement on the keywords in the seed corpus according to the synonym table corresponding to the belonging field to obtain at least one candidate corpus of the seed corpus.

In a possible implementation manner, when the preset generalization manner includes the knowledge base retrieval manner, the first processing module 1202 further includes: a third processing unit, configured to generalize the seed corpus according to the knowledge base retrieval manner, to obtain at least one candidate corpus of the seed corpus;

the third processing unit is specifically configured to: searching a knowledge base for a generalization corpus corresponding to the seed corpus, wherein the knowledge base comprises a plurality of seed corpora and corresponding generalization corpora; and taking the generalization linguistic data corresponding to the seed linguistic data in the knowledge base as the candidate linguistic data of the seed linguistic data.

In a possible implementation manner, when the preset generalization manner includes the sentence extraction manner, the first processing module 1202 further includes: a fourth processing unit, configured to generalize the seed corpus according to the sentence pattern extraction manner, to obtain at least one candidate corpus of the seed corpus;

the fourth processing unit is specifically configured to: identifying and extracting key vocabularies in the seed corpus through a dependency syntactic analysis algorithm; and combining the key vocabularies of the seed corpus to generate a candidate corpus of the seed corpus.

the first processing module 1202 is further configured to: receiving a specified identification; and generalizing the seed corpus by adopting a preset generalization mode corresponding to the specified identification.

In a possible implementation manner, the first output module 1204 is further configured to: receiving a second request, wherein the second request is used for indicating at least one generalization corpus of the seed corpus; generalizing the generalization corpus indicated by the second request as a new seed corpus to obtain a candidate corpus of the new seed corpus; adding the candidate corpus of the new seed corpus into the candidate corpus of the seed corpus, re-identifying the similarity between at least one candidate corpus of the seed corpus and the seed corpus, and determining the candidate corpus with the similarity larger than the preset threshold value as the generalization corpus of the seed corpus; and re-outputting the generalization linguistic data of the seed linguistic data.

In a possible implementation, the first receiving module 1201 is further configured to: searching the seed linguistic data in a historical record, wherein the historical record comprises historical generalized seed linguistic data and corresponding generalized linguistic data; if the seed corpus is found in the historical record, acquiring the generalization corpus of the seed corpus from the historical record; if the seed corpus is not found in the history record, generalizing the seed corpus according to a preset generalization mode to obtain at least one candidate corpus of the seed corpus.

In a possible implementation, the first determining module 1203 is further configured to: updating the seed corpus and the generalization corpus of the seed corpus into a knowledge base, wherein the knowledge base at least comprises the seed corpus and the corresponding generalization corpus; and identifying the similarity between the at least one candidate corpus and the seed corpus through a logistic regression model, wherein the logistic regression model is trained in advance through a training set consisting of a plurality of corpora in the knowledge base.

The corpus generalization device combining RPA and AI provided in the embodiment of the present application may be used to implement the method embodiment using the first electronic device as an execution main body, and the implementation principle and technical effect are similar, which are not described herein again.

Fig. 13 is a schematic structural diagram of a corpus generalization device combining RPA and AI according to yet another embodiment of the present application. The corpus generalization device 130 is applied to a second electronic device. As shown in fig. 13, the corpus generalization device 130 includes: a second receiving module 1301, a first sending module 1302, and a first display module 1303.

A second receiving module 1301, configured to receive a seed corpus input by a user;

a first sending module 1302, configured to send a first request including the seed corpus to a first electronic device, where the first request is used to instruct the first electronic device to perform generalization on the seed corpus based on a natural language processing NLP according to a preset generalization mode, to obtain at least one candidate corpus of the seed corpus, identify a similarity between the at least one candidate corpus and the seed corpus, and use the candidate corpus of which the similarity is greater than a preset threshold as a generalized corpus of the seed corpus;

and the first display module 1303 is configured to receive and display the generalized corpora of the seed corpora sent by the first electronic device.

the first display module 1303 is specifically configured to: displaying at least one generalization corpus of the seed corpus and an indication control, wherein the indication control is used for indicating that the generalization corpus selected by a user is used as a new seed corpus to be generalized; responding to a trigger operation aiming at the indication control, and sending a second request to the first electronic equipment, wherein the second request is used for indicating the first electronic equipment to generalize the generalization corpus selected by the user as a new seed corpus; and receiving and displaying the generalization linguistic data of the seed linguistic data retransmitted by the first electronic equipment.

In a possible implementation, the first display module 1303 is further configured to: receiving the at least one generalization corpus of the seed corpus sent by the first electronic device and a corresponding similarity and/or generalization mode thereof; performing related display on the at least one generalization corpus of the seed corpus and the corresponding similarity and/or generalization mode thereof;

the first display module 1303 is further configured to: receiving a screening instruction input by a user, wherein the screening instruction is used for indicating at least one of the preset generalization modes; and displaying the generalized linguistic data obtained in the generalization mode indicated by the screening instruction.

In a possible embodiment, the seed corpus is at least one;

the first display module 1303 is further configured to: displaying the number of at least one seed corpus and the generalization corpus thereof; receiving a trigger instruction of a user for the number of the generalization linguistic data of the specified seed linguistic data, wherein the specified seed linguistic data is one of all the seed linguistic data; and displaying the generalization linguistic data of the specified seed linguistic data.

The corpus generalization device combining RPA and AI provided in the embodiment of the present application may be used to implement the method embodiment using the second electronic device as an execution main body, and the implementation principle and technical effect are similar, which are not described herein again.

Fig. 14 is a schematic structural diagram of a corpus generalization device combining RPA and AI according to another embodiment of the present application. The corpus generalization device 140 is applied to a third electronic device. As shown in fig. 14, the corpus generalization device 140 includes: a third receiving module 1401, a second processing module 1402, a second determining module 1403, and a second displaying module 1404.

A third receiving module 1401, configured to receive a seed corpus input by a user;

a second processing module 1402, configured to generalize the seed corpus according to a preset generalization manner, to obtain at least one candidate corpus of the seed corpus;

a second determining module 1403, configured to identify a similarity between the at least one candidate corpus and the seed corpus, and determine a candidate corpus of which the similarity is greater than a preset threshold as a generalization corpus of the seed corpus;

a second display module 1404, configured to display the generalized corpora of the seed corpora.

In a possible implementation manner, when the preset generalization manner includes the network crawling manner, the second processing module 1402 includes: a fifth processing unit, configured to generalize the seed corpus according to the network crawling manner, to obtain at least one candidate corpus of the seed corpus;

the fifth processing unit is specifically configured to: searching the seed corpus in a webpage searching website to obtain a webpage list for displaying a searching result, wherein the webpage list comprises a plurality of webpage items, and each webpage item has a title sentence; crawling title sentences of all webpage items; and taking the title sentences which accord with the matching conditions in the title sentences of the webpage items as the candidate linguistic data of the seed linguistic data.

In a possible implementation, the fifth processing unit is further configured to: receiving a filter word; and according to the filter words, taking the title sentences meeting the matching conditions in the title sentences of all the webpage items as the candidate linguistic data of the seed linguistic data.

the fifth processing unit is further configured to: receiving a specified website address and/or a specified number of crawled pages; searching the seed corpus in a webpage searching website indicated by the specified website address; and crawling title sentences of all the webpage items in the display page corresponding to the specified crawling page number.

In a possible implementation manner, when the preset generalization manner includes the synonym replacement manner, the second processing module 1402 further includes: a sixth processing unit, configured to generalize the seed corpus according to the synonym replacement manner, to obtain at least one candidate corpus of the seed corpus;

the sixth processing unit is specifically configured to: acquiring the affiliated field of the seed corpus, and selecting a synonym table corresponding to the affiliated field from a plurality of preset synonym tables, wherein each synonym table corresponds to one field; searching key words in the seed corpus; and carrying out synonym replacement on the keywords in the seed corpus according to the synonym table corresponding to the belonging field to obtain at least one candidate corpus of the seed corpus.

In a possible implementation manner, when the preset generalization manner includes the knowledge base retrieval manner, the second processing module 1402 further includes: a seventh processing unit, configured to generalize the seed corpus according to the knowledge base retrieval manner, to obtain at least one candidate corpus of the seed corpus;

the seventh processing unit is specifically configured to: searching a knowledge base for a generalization corpus corresponding to the seed corpus, wherein the knowledge base comprises a plurality of seed corpora and corresponding generalization corpora; and taking the generalization linguistic data corresponding to the seed linguistic data in the knowledge base as the candidate linguistic data of the seed linguistic data.

In a possible implementation manner, when the preset generalization manner includes the sentence extraction manner, the second processing module 1402 further includes: an eighth processing unit, configured to generalize the seed corpus according to the sentence pattern extraction manner, to obtain at least one candidate corpus of the seed corpus;

the eighth processing unit is specifically configured to: identifying and extracting key vocabularies in the seed corpus through a dependency syntactic analysis algorithm; and combining the key vocabularies of the seed corpus to generate a candidate corpus of the seed corpus.

the second processing module 1402 is further configured to: displaying the identification of each preset generalization mode; receiving a selection instruction input by a user, wherein the selection instruction is used for indicating a specified identifier in identifiers of a preset generalization mode; and generalizing the seed corpus by adopting a preset generalization mode corresponding to the specified identification.

the second display module 1404, further configured to: displaying at least one generalization corpus of the seed corpus and an indication control, wherein the indication control is used for indicating that the generalization corpus selected by a user is used as a new seed corpus to be generalized; responding to the triggering operation aiming at the indication control, and generalizing the generalization linguistic data selected by the user as a new seed linguistic data to obtain a candidate linguistic data of the new seed linguistic data; adding the candidate corpus of the new seed corpus into the candidate corpus of the seed corpus, re-identifying the similarity between at least one candidate corpus of the seed corpus and the seed corpus, and determining the candidate corpus with the similarity larger than the preset threshold value as the generalization corpus of the seed corpus; and redisplaying the generalization linguistic data of the seed linguistic data.

The corpus generalization device combining RPA and AI provided in the embodiment of the present application may be used to implement the method embodiment using the third electronic device as an execution main body, and the implementation principle and technical effect are similar, which are not described herein again.

Fig. 15 is a schematic hardware structure diagram of a first electronic device according to an embodiment of the present application. As shown in fig. 15, the first electronic device 150 provided in the present embodiment includes: at least one processor 1501 and memory 1502. The first electronic device 150 also includes a communication component 1503. The processor 1501, the memory 1502, and the communication section 1503 are connected by a bus 1504.

In a specific implementation process, the at least one processor 1501 executes the computer-executable instructions stored in the memory 1502, so that the at least one processor 1501 executes the corpus generalization method with the first electronic device as an execution subject as described above.

For a specific implementation process of the processor 1501, reference may be made to the above method embodiments, which implement similar principles and technical effects, and this embodiment is not described herein again.

Fig. 16 is a schematic hardware structure diagram of a second electronic device according to yet another embodiment of the present application. As shown in fig. 16, the second electronic device 160 provided in the present embodiment includes: at least one processor 1601, and a memory 1602. The second electronic device 160 further comprises a communication component 1603. The processor 1601, the memory 1602, and the communication unit 1603 are connected via a bus 1604.

In a specific implementation process, the at least one processor 1601 executes the computer executable instructions stored in the memory 1602, so that the at least one processor 1601 executes the corpus generalization method with the second electronic device as the execution subject.

For a specific implementation process of the processor 1601, reference may be made to the above method embodiments, which achieve similar implementation principles and technical effects, and details of this embodiment are not described herein again.

Fig. 17 is a schematic hardware structure diagram of a third electronic device according to another embodiment of the present application. As shown in fig. 17, the third electronic device 170 provided in the present embodiment includes: at least one processor 1701 and memory 1702. The third electronic device 170 further comprises a communication component 1703. The processor 1701, the memory 1702, and the communication unit 1703 are connected by a bus 1704.

In particular implementations, the at least one processor 1701 executes the computer executable instructions stored in the memory 1702, so that the at least one processor 1701 executes the corpus generalization method with the third electronic device as an execution subject as described above.

For a specific implementation process of the processor 1701, reference may be made to the above method embodiments, which have similar implementation principles and technical effects, and no further description is given here.

In the embodiments shown in fig. 15, fig. 16, and fig. 17, it should be understood that the processor may be a Central Processing Unit (CPU), other general-purpose processors, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in the incorporated application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor.

The memory may comprise high speed RAM memory and may also include non-volatile storage NVM, such as at least one disk memory.

The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.

The application also provides a computer-readable storage medium, wherein computer-executable instructions are stored in the computer-readable storage medium, and when a processor executes the computer-executable instructions, the corpus generalization method taking the first electronic device as an execution subject is realized.

The application also provides a computer-readable storage medium, wherein computer-executable instructions are stored in the computer-readable storage medium, and when a processor executes the computer-executable instructions, the corpus generalization method taking the second electronic device as an execution subject is realized.

The application also provides a computer-readable storage medium, wherein a computer execution instruction is stored in the computer-readable storage medium, and when a processor executes the computer execution instruction, the corpus generalization method taking the third electronic device as an execution subject is realized.

The readable storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. Readable storage media can be any available media that can be accessed by a general purpose or special purpose computer.

An exemplary readable storage medium is coupled to the processor such the processor can read information from, and write information to, the readable storage medium. Of course, the readable storage medium may also be an integral part of the processor. The processor and the readable storage medium may reside in an Application Specific Integrated Circuits (ASIC). Of course, the processor and the readable storage medium may also reside as discrete components in the apparatus.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. a corpus generalization method combining RPA and AI, is characterized in that, is applied to the first electronic equipment, and described first electronic equipment comprises RPA system, and described method comprises:

The RPA system receives a first request, wherein the first request includes a seed corpus;

The RPA system generalizes the seed corpus according to a preset generalization method based on natural language processing NLP to obtain at least one candidate corpus of the seed corpus;

The RPA system identifies the similarity between the at least one candidate corpus and the seed corpus, and determines the candidate corpus whose similarity is greater than a preset threshold as the generalized corpus of the seed corpus;

The RPA system outputs a generalized corpus of the seed corpus.

2. The method according to claim 1, wherein the preset generalization manner comprises at least one of the following manners:

Web crawling, synonym replacement, knowledge base retrieval and sentence extraction.

3. The method according to claim 2, wherein when the preset generalization method includes the web crawling method, the RPA system processes NLP based on natural language, The seed corpus is generalized to obtain at least one candidate corpus of the seed corpus, including:

The RPA system generalizes the seed corpus according to the web crawling method to obtain at least one candidate corpus of the seed corpus;

The RPA system generalizes the seed corpus according to the web crawling method to obtain at least one candidate corpus of the seed corpus, including:

The RPA system searches for the seed corpus in a web page search website, and obtains a web page list displaying search results, wherein the web page list includes a plurality of web page items, and each web page item has a title statement;

The RPA system crawls the title statement of each web page item;

The RPA system uses the title sentences that meet the matching conditions in the title sentences of each web page item as the candidate corpus of the seed corpus.

4. The method according to claim 3, wherein the method further comprises:

the RPA system receives filter words;

The RPA system uses the title sentences that meet the matching conditions in the title sentences of each web page item as the candidate corpus of the seed corpus, including:

The RPA system selects, according to the filter word, a title sentence that meets the matching condition in the title sentence of each web page item as a candidate corpus of the seed corpus.

5. The method according to claim 3, wherein the webpage list comprises at least one display page, and each display page comprises at least one webpage item;

The method also includes:

The RPA system receives the specified website address and/or the specified number of crawled pages;

The RPA system searches for the seed corpus in the webpage search website, including:

The RPA system searches the seed corpus in the webpage search website indicated by the designated website address;

The RPA system crawls the title statement of each web page item, including:

The RPA system crawls the title statement of each webpage item in the display page corresponding to the specified number of crawled pages.

6 . The method according to claim 2 , wherein, when the preset generalization method includes the synonym replacement method, the RPA system processes NLP based on natural language, and the seed is processed according to the preset generalization method. 7 . The corpus is generalized to obtain at least one candidate corpus of the seed corpus, which also includes:

The RPA system generalizes the seed corpus according to the synonym replacement method to obtain at least one candidate corpus of the seed corpus, including:

The RPA system obtains the field of the seed corpus, and selects a synonym table corresponding to the field from a plurality of preset synonym tables, wherein each synonym table corresponds to a field;

The RPA system searches for keywords in the seed corpus;

The RPA system performs synonym substitution for keywords in the seed corpus according to the synonym table corresponding to the field to which they belong, to obtain at least one candidate corpus of the seed corpus.

7. The method according to claim 2, wherein when the preset generalization method includes the knowledge base retrieval method, the RPA system processes NLP based on natural language, The seed corpus is generalized to obtain at least one candidate corpus of the seed corpus, which also includes:

The RPA system generalizes the seed corpus according to the knowledge base retrieval method to obtain at least one candidate corpus of the seed corpus;

The RPA system generalizes the seed corpus according to the knowledge base retrieval method to obtain at least one candidate corpus of the seed corpus, including:

The RPA system searches the knowledge base for the generalization corpus corresponding to the seed corpus, wherein the knowledge base includes a plurality of seed corpora and their corresponding generalization corpora;

The RPA system uses the generalized corpus corresponding to the seed corpus in the knowledge base as a candidate corpus of the seed corpus.

8 . The method according to claim 2 , wherein when the preset generalization method includes the sentence pattern extraction method, the RPA system processes NLP based on natural language, The seed corpus is generalized to obtain at least one candidate corpus of the seed corpus, which also includes:

The RPA system generalizes the seed corpus according to the sentence pattern extraction method to obtain at least one candidate corpus of the seed corpus;

The RPA system generalizes the seed corpus according to the sentence pattern extraction method to obtain at least one candidate corpus of the seed corpus, including:

The RPA system identifies and extracts key words in the seed corpus through a dependency parsing algorithm;

The RPA system combines key words of the seed corpus to generate candidate corpora of the seed corpus.

9. The method according to claim 1, wherein each preset generalization mode corresponds to an identifier;

The method also includes:

The RPA system receives the specified identifier;

The RPA system generalizes the seed corpus according to a preset generalization method based on natural language processing NLP to obtain at least one candidate corpus of the seed corpus, and further includes:

The RPA system generalizes the seed corpus by using a preset generalization manner corresponding to the specified identifier.

10. The method according to any one of claims 1-9, wherein after the RPA system outputs the generalized corpus of the seed corpus, the method further comprises:

The RPA system receives a second request, wherein the second request is used to indicate at least one generalized corpus of the seed corpus;

The RPA system generalizes the generalization corpus indicated by the second request as a new seed corpus to obtain a candidate corpus of the new seed corpus;

The RPA system adds the candidate corpus of the new seed corpus to the candidate corpus of the seed corpus, and re-identifies the similarity between at least one candidate corpus of the seed corpus and the seed corpus, and sets the similarity greater than The candidate corpus of the preset threshold is determined as the generalization corpus of the seed corpus;

The RPA system re-outputs the generalized corpus of the seed corpus.

The method according to any one of claims 1-9, wherein after the RPA system receives the first request, the method further comprises:

The RPA system searches for the seed corpus in a historical record, and the historical record includes a historically generalized seed corpus and its corresponding generalized corpus;

If the RPA system finds the seed corpus in the historical record, then obtains the generalized corpus of the seed corpus from the historical record;

If the RPA system does not find the seed corpus in the historical record, it generalizes the seed corpus according to a preset generalization method to obtain at least one candidate corpus of the seed corpus.

12. The method according to any one of claims 1-9, wherein the method further comprises:

The RPA system updates the seed corpus and the generalization corpus of the seed corpus into a knowledge base, where the knowledge base at least includes the seed corpus and its corresponding generalization corpus;

The RPA system identifies the similarity between the at least one candidate corpus and the seed corpus through a logistic regression model, wherein the logistic regression model is pre-trained on a training set consisting of multiple corpora in the knowledge base.

13. A corpus generalization method, characterized in that, applied to a second electronic device, the second electronic device comprising an RPA system, the method comprising:

The RPA system receives the seed corpus input by the user;

The RPA system sends a first request including the seed corpus to the first electronic device, where the first request is used to instruct the first electronic device to process NLP based on natural language, and to perform a pre-set generalization method for all the data. generalize the seed corpus to obtain at least one candidate corpus of the seed corpus, identify the similarity between the at least one candidate corpus and the seed corpus, and use the candidate corpus whose similarity is greater than a preset threshold as the seed corpus. generalized corpus;

The RPA system receives and displays the generalized corpus of the seed corpus sent by the first electronic device.

14. The method according to claim 13, wherein the generalized corpus of the seed corpus is at least one;

The RPA system receives and displays the generalized corpus of the seed corpus sent by the first electronic device, including:

The RPA system displays at least one generalized corpus of the seed corpus and an instruction control, wherein the instruction control is used to instruct the generalization corpus selected by the user to be generalized as a new seed corpus;

After receiving and displaying the generalized corpus of the seed corpus sent by the first electronic device, the RPA system further includes:

The RPA system sends a second request to the first electronic device in response to the triggering operation for the indicating control, wherein the second request is used to instruct the first electronic device to use the generalized corpus selected by the user Generalize as a new seed corpus;

The RPA system receives and displays the generalized corpus of the seed corpus retransmitted by the first electronic device.

15. The method according to claim 14, wherein the RPA system receives and displays the generalized corpus of the seed corpus sent by the first electronic device, further comprising:

The RPA system receives the at least one generalized corpus of the seed corpus sent by the first electronic device and its corresponding similarity and/or generalization method;

The RPA system associates and displays the at least one generalization corpus of the seed corpus and its corresponding similarity and/or generalization mode;

The RPA system receives a screening instruction input by a user, wherein the screening instruction is used to indicate at least one of the preset generalization methods;

The RPA system displays the generalized corpus obtained in the generalized manner indicated by the screening instruction.

16. The method according to any one of claims 13-15, wherein the seed corpus is at least one;

The RPA system receives and displays the generalized corpus of the seed corpus sent by the first electronic device, and further includes:

The RPA system displays the number of at least one seed corpus and its generalization corpus;

After the RPA system displays the number of at least one seed corpus and its generalization corpus, it also includes:

The RPA system receives a triggering instruction from the user for the number of generalization corpora of the specified seed corpus, wherein the specified seed corpus is one of all the seed corpora;

The RPA system displays a generalized corpus of the specified seed corpus.

17. A corpus generalization method combining RPA and AI, wherein the method is applied to a third electronic device, wherein the third electronic device comprises an RPA system, and the method comprises:

The RPA system receives the seed corpus input by the user;

The RPA system generalizes the seed corpus according to a preset generalization method to obtain at least one candidate corpus of the seed corpus;

The RPA system displays a generalized corpus of the seed corpus.

18. A corpus generalization device combining RPA and AI, characterized in that, applied to the first electronic device, comprising:

a first receiving module, configured to receive a first request, wherein the first request includes a seed corpus;

a first processing module, configured to generalize the seed corpus according to a preset generalization method based on natural language processing NLP to obtain at least one candidate corpus of the seed corpus;

a first determination module, configured to identify the similarity between the at least one candidate corpus and the seed corpus, and determine the candidate corpus whose similarity is greater than a preset threshold as the generalized corpus of the seed corpus;

The first output module is used for outputting the generalized corpus of the seed corpus.

19. A corpus generalization device combining RPA and AI, characterized in that, applied to a second electronic device, comprising:

The second receiving module is used for receiving the seed corpus input by the user;

A first sending module, configured to send a first request including the seed corpus to a first electronic device, wherein the first request is used to instruct the first electronic device to process NLP based on natural language, generalize according to a preset The seed corpus is generalized to obtain at least one candidate corpus of the seed corpus, the similarity between the at least one candidate corpus and the seed corpus is identified, and the candidate corpus whose similarity is greater than a preset threshold is used as the Generalization corpus of seed corpus;

The first display module is configured to receive and display the generalized corpus of the seed corpus sent by the first electronic device.

20. A corpus generalization device combining RPA and AI, characterized in that, applied to a third electronic device, comprising:

The third receiving module is used to receive the seed corpus input by the user;

a second processing module, configured to generalize the seed corpus according to a preset generalization method to obtain at least one candidate corpus of the seed corpus;

The second determination module is used to identify the similarity between the at least one candidate corpus and the seed corpus, and determine the candidate corpus whose similarity is greater than a preset threshold as the generalized corpus of the seed corpus;

The second display module is used for displaying the generalized corpus of the seed corpus.

21. A first electronic device, comprising: at least one processor and a memory;

the memory stores computer-executable instructions;

The at least one processor executes the computer-executable instructions stored in the memory, so that the at least one processor executes the corpus generalization method of any one of claims 1-12.

22. A second electronic device, comprising: at least one processor and a memory;

the memory stores computer-executable instructions;

The at least one processor executes the computer-executable instructions stored in the memory, so that the at least one processor executes the corpus generalization method of any one of claims 13-16.

23. A third electronic device, comprising: at least one processor and a memory;

the memory stores computer-executable instructions;

The at least one processor executes the computer-executable instructions stored in the memory to cause the at least one processor to perform the corpus generalization method of claim 17 .

24. A computer-readable storage medium, wherein computer-executable instructions are stored in the computer-readable storage medium, and when a processor executes the computer-executable instructions, the computer-executable instructions according to any one of claims 1-12 are implemented. The corpus generalization method described.

25. A computer-readable storage medium, wherein computer-executable instructions are stored in the computer-readable storage medium, and when a processor executes the computer-executable instructions, any one of claims 13-16 is implemented. The corpus generalization method described.

26. A computer-readable storage medium, characterized in that, computer-executable instructions are stored in the computer-readable storage medium, and when a processor executes the computer-executable instructions, the corpus generalization as claimed in claim 17 is realized method.