[go: up one dir, main page]

CN112307295A - Corpus generalization method, apparatus and electronic device combining RPA and AI - Google Patents

Corpus generalization method, apparatus and electronic device combining RPA and AI Download PDF

Info

Publication number
CN112307295A
CN112307295A CN202011206419.4A CN202011206419A CN112307295A CN 112307295 A CN112307295 A CN 112307295A CN 202011206419 A CN202011206419 A CN 202011206419A CN 112307295 A CN112307295 A CN 112307295A
Authority
CN
China
Prior art keywords
corpus
seed
generalization
rpa system
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011206419.4A
Other languages
Chinese (zh)
Inventor
汪冠春
刘金艳
胡景超
胡一川
褚瑞
李玮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Benying Network Technology Co Ltd
Beijing Laiye Network Technology Co Ltd
Original Assignee
Beijing Benying Network Technology Co Ltd
Beijing Laiye Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Benying Network Technology Co Ltd, Beijing Laiye Network Technology Co Ltd filed Critical Beijing Benying Network Technology Co Ltd
Publication of CN112307295A publication Critical patent/CN112307295A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Machine Translation (AREA)

Abstract

本申请提供一种结合RPA和AI的语料泛化方法、装置和电子设备。该方法包括:所述RPA系统接收第一请求,其中,所述第一请求中包括种子语料;所述RPA系统基于自然语言处理NLP,按照预设泛化方式对所述种子语料进行泛化,得到所述种子语料的至少一个候选语料;所述RPA系统识别所述至少一个候选语料与所述种子语料的相似度,将相似度大于预设阈值的候选语料确定为所述种子语料的泛化语料;所述RPA系统输出所述种子语料的泛化语料。本申请的方法,RPA系统通过预设泛化方式自动对种子语料进行泛化,并根据预设阈值对泛化的候选语料进行筛选,从而筛选出种子语料的泛化语料,提高语料泛化的效率。

Figure 202011206419

The present application provides a corpus generalization method, apparatus and electronic device combining RPA and AI. The method includes: the RPA system receives a first request, wherein the first request includes a seed corpus; the RPA system generalizes the seed corpus according to a preset generalization method based on natural language processing NLP, Obtain at least one candidate corpus of the seed corpus; the RPA system identifies the similarity between the at least one candidate corpus and the seed corpus, and determines the candidate corpus with a similarity greater than a preset threshold as the generalization of the seed corpus Corpus; the RPA system outputs the generalized corpus of the seed corpus. In the method of the present application, the RPA system automatically generalizes the seed corpus by a preset generalization method, and screens the generalized candidate corpus according to a preset threshold, thereby screening out the generalization corpus of the seed corpus, and improving the generalization of the corpus. efficiency.

Figure 202011206419

Description

Corpus generalization method and apparatus combining RPA and AI, and electronic device
Technical Field
The present application relates to the field of natural language processing, and in particular, to a corpus generalization method and apparatus, an electronic device, and a storage medium in combination with RPA and AI.
Background
Robot Process Automation (RPA) is a Process task that simulates human operations on a computer by specific "robot software" and executes automatically according to rules.
Artificial Intelligence (AI) is a technical science that studies, develops theories, methods, techniques and application systems for simulating, extending and expanding human intelligence.
Natural Language Processing (NLP) is a science for researching computer systems, especially software systems therein, which can effectively realize natural language communication, and is an important direction in the fields of computer science and artificial intelligence.
For human-computer interaction products such as search engines, smart speech, customer service robots, etc., the user's sentence intent is typically identified through a machine learning model. The machine learning model is trained by corpora in advance, and the recognition capability of the machine learning model depends on the number of corpora used to train the model. When the number of the linguistic data is insufficient, the number of the linguistic data can be increased by generalizing the linguistic data.
In the prior art, a corpus generalization task is issued to a plurality of operators in a crowdsourcing task mode, and the operators generalize the corpus by artificial imagination.
However, because the corpus is generalized through artificial imagination, the efficiency of corpus generalization is low.
Disclosure of Invention
The embodiment of the application provides a corpus generalization method, a corpus generalization device, a corpus generalization equipment and a storage medium, so as to solve the problem of low efficiency of the current corpus generalization.
In a first aspect, an embodiment of the present application provides a corpus generalization method combining an RPA and an AI, which is applied to a first electronic device, where the first electronic device includes an RPA system, and the method includes:
the RPA system receives a first request, wherein the first request comprises a seed corpus;
the RPA system generalizes the seed corpus based on Natural Language Processing (NLP) according to a preset generalization mode to obtain at least one candidate corpus of the seed corpus;
the RPA system identifies the similarity between the at least one candidate corpus and the seed corpus, and determines the candidate corpus with the similarity larger than a preset threshold value as the generalization corpus of the seed corpus;
and the RPA system outputs the generalization linguistic data of the seed linguistic data.
In a possible embodiment, the preset generalization includes at least one of the following modes:
network crawling mode, synonym replacing mode, knowledge base searching mode and sentence pattern extracting mode.
In a possible implementation manner, when the preset generalization manner includes the network crawling manner, the RPA system generalizes the seed corpus according to a preset generalization manner based on natural language processing NLP, to obtain at least one candidate corpus of the seed corpus, including:
the RPA system generalizes the seed corpus according to the network crawling mode to obtain at least one candidate corpus of the seed corpus;
the RPA system generalizes the seed corpus according to the network crawling manner to obtain at least one candidate corpus of the seed corpus, including:
the RPA system searches the seed corpus in a webpage searching website to obtain a webpage list for displaying a searching result, wherein the webpage list comprises a plurality of webpage items, and each webpage item has a title sentence;
the RPA system crawls the title sentences of all the webpage items;
and the RPA system takes the title sentences meeting the matching conditions in the title sentences of all the webpage items as the candidate linguistic data of the seed linguistic data.
In one possible embodiment, the method further comprises:
the RPA system receives filter words;
the RPA system takes the title sentences meeting the matching conditions in the title sentences of each webpage item as the candidate linguistic data of the seed linguistic data, and comprises the following steps:
and the RPA system takes the title sentences meeting the matching conditions in the title sentences of the webpage items as the candidate linguistic data of the seed linguistic data according to the filter words.
In one possible embodiment, the web page list comprises at least one presentation page, and each presentation page comprises at least one web page item;
the method further comprises the following steps:
the RPA system receives a specified website address and/or a specified number of pages to crawl;
the RPA system searches the seed corpus in a webpage search website, and the method comprises the following steps:
the RPA system searches the seed corpus in a webpage searching website indicated by the specified website address;
the RPA system crawls title sentences of all webpage items, and the method comprises the following steps:
and the RPA system crawls the title sentences of all the webpage items in the display pages corresponding to the specified crawling page number.
In a possible implementation manner, when the preset generalization manner includes the synonym replacement manner, the RPA system generalizes the seed corpus according to a preset generalization manner based on natural language processing NLP to obtain at least one candidate corpus of the seed corpus, further including:
the RPA system generalizes the seed corpus according to the synonym replacement mode to obtain at least one candidate corpus of the seed corpus;
the RPA system generalizes the seed corpus according to the synonym replacement mode to obtain at least one candidate corpus of the seed corpus, including:
the RPA system acquires the affiliated field of the seed corpus and selects a synonym table corresponding to the affiliated field from a plurality of preset synonym tables, wherein each synonym table corresponds to one field;
the RPA system searches key words in the seed corpus;
and the RPA system performs synonym replacement on the keywords in the seed corpus according to the synonym table corresponding to the field to which the keyword belongs, so as to obtain at least one candidate corpus of the seed corpus.
In a possible implementation manner, when the preset generalization manner includes the knowledge base retrieval manner, the RPA system generalizes the seed corpus according to the preset generalization manner based on the natural language processing NLP to obtain at least one candidate corpus of the seed corpus, and further includes:
the RPA system generalizes the seed corpus according to the knowledge base retrieval mode to obtain at least one candidate corpus of the seed corpus;
the RPA system generalizes the seed corpus according to the knowledge base retrieval mode to obtain at least one candidate corpus of the seed corpus, and the method comprises the following steps:
the RPA system searches a knowledge base for a generalization corpus corresponding to the seed corpus, wherein the knowledge base comprises a plurality of seed corpora and corresponding generalization corpuses thereof;
and the RPA system takes the generalization linguistic data corresponding to the seed linguistic data in the knowledge base as the candidate linguistic data of the seed linguistic data.
In a possible implementation manner, when the preset generalization manner includes the sentence extraction manner, the RPA system generalizes the seed corpus according to the preset generalization manner based on the natural language processing NLP to obtain at least one candidate corpus of the seed corpus, further including:
the RPA system generalizes the seed corpus according to the sentence pattern extraction mode to obtain at least one candidate corpus of the seed corpus;
the RPA system generalizes the seed corpus according to the sentence pattern extraction mode to obtain at least one candidate corpus of the seed corpus, including:
the RPA system identifies and extracts key words in the seed corpus through a dependency syntactic analysis algorithm;
and the RPA system combines the key words of the seed corpus to generate a candidate corpus of the seed corpus.
In one possible embodiment, each predefined generalization corresponds to an identifier;
the method further comprises the following steps:
the RPA system receives a specified identification;
the RPA system is based on natural language processing NLP, and is based on a preset generalization mode to generalize the seed corpus, so as to obtain at least one candidate corpus of the seed corpus, and the method further comprises the following steps:
and the RPA system generalizes the seed corpus by adopting a preset generalization mode corresponding to the specified identification.
In a possible implementation, after the RPA system outputs the generalized corpus of the seed corpus, the method further includes:
the RPA system receiving a second request, wherein the second request is used for indicating at least one generalization corpus of the seed corpus;
the RPA system takes the generalization corpus indicated by the second request as a new seed corpus to carry out generalization to obtain a candidate corpus of the new seed corpus;
the RPA system adds the candidate corpus of the new seed corpus to the candidate corpus of the seed corpus, re-identifies the similarity between at least one candidate corpus of the seed corpus and the seed corpus, and determines the candidate corpus with the similarity larger than the preset threshold value as the generalization corpus of the seed corpus;
and the RPA system outputs the generalization linguistic data of the seed linguistic data again.
In one possible embodiment, after the RPA system receives the first request, the method further comprises:
the RPA system searches the seed linguistic data in a historical record, wherein the historical record comprises historical generalized seed linguistic data and corresponding generalized linguistic data;
if the RPA system finds the seed corpus in the history record, acquiring the generalization corpus of the seed corpus from the history record;
if the seed corpus is not found in the history record by the RPA system, generalizing the seed corpus according to a preset generalization mode to obtain at least one candidate corpus of the seed corpus.
In one possible embodiment, the method further comprises:
the RPA system updates the seed linguistic data and the generalization linguistic data of the seed linguistic data into a knowledge base, wherein the knowledge base at least comprises the seed linguistic data and the corresponding generalization linguistic data;
the RPA system identifies the similarity between the at least one candidate corpus and the seed corpus, including:
and the RPA system identifies the similarity between the at least one candidate corpus and the seed corpus through a logistic regression model, wherein the logistic regression model is trained in advance through a training set consisting of a plurality of corpora in the knowledge base.
In a second aspect, an embodiment of the present application provides a corpus generalization method combining RPA and AI, which is applied to a second electronic device, where the second electronic device includes an RPA system, and the method includes:
the RPA system receives a seed corpus input by a user;
the RPA system sends a first request containing the seed corpus to a first electronic device, wherein the first request is used for indicating the first electronic device to process NLP based on natural language, generalize the seed corpus according to a preset generalization mode to obtain at least one candidate corpus of the seed corpus, identify the similarity between the at least one candidate corpus and the seed corpus, and take the candidate corpus with the similarity larger than a preset threshold value as the generalized corpus of the seed corpus;
and the RPA system receives and displays the generalization linguistic data of the seed linguistic data sent by the first electronic equipment.
In a possible embodiment, the generalization corpus of the seed corpus is at least one;
the RPA system receives and displays the generalization linguistic data of the seed linguistic data sent by the first electronic device, and the method comprises the following steps:
the RPA system displays at least one generalization corpus of the seed corpus and an indication control, wherein the indication control is used for indicating that the generalization corpus selected by a user is used as a new seed corpus to be generalized;
after the RPA system receives and displays the generalized corpus of the seed corpus sent by the first electronic device, the method further includes:
the RPA system responds to the triggering operation aiming at the indication control, and sends a second request to the first electronic equipment, wherein the second request is used for indicating the first electronic equipment to generalize the generalization linguistic data selected by the user as a new seed linguistic data;
and the RPA system receives and displays the generalization linguistic data of the seed linguistic data retransmitted by the first electronic equipment.
In a possible implementation manner, the RPA system receives and displays a generalized corpus of the seed corpus sent by the first electronic device, further including:
the RPA system receives the at least one generalization corpus of the seed corpus sent by the first electronic device and the corresponding similarity and/or generalization mode thereof;
the RPA system displays the at least one generalization corpus of the seed corpus and the corresponding similarity and/or generalization mode in a correlated manner;
after the RPA system receives and displays the generalized corpus of the seed corpus sent by the first electronic device, the method further includes:
the RPA system receives a screening instruction input by a user, wherein the screening instruction is used for indicating at least one of the preset generalization modes;
and the RPA system displays the generalized linguistic data obtained in the generalization mode indicated by the screening instruction.
In a possible embodiment, the seed corpus is at least one;
the RPA system receives and displays the generalized corpora of the seed corpora sent by the first electronic device, and further includes:
the RPA system displays at least one seed corpus and the number of generalization corpuses thereof;
after the RPA system displays the number of at least one seed corpus and the generalization corpus thereof, the method further comprises:
the RPA system receives a trigger instruction of a user for the number of the generalization linguistic data of the specified seed linguistic data, wherein the specified seed linguistic data is one of all the seed linguistic data;
and the RPA system displays the generalization linguistic data of the specified seed linguistic data.
In a third aspect, an embodiment of the present application provides a corpus generalization method combining an RPA and an AI, which is applied to a third electronic device, where the third electronic device includes an RPA system, and the method includes:
the RPA system receives a seed corpus input by a user;
the RPA system generalizes the seed corpus according to a preset generalization mode to obtain at least one candidate corpus of the seed corpus;
the RPA system identifies the similarity between the at least one candidate corpus and the seed corpus, and determines the candidate corpus with the similarity larger than a preset threshold value as the generalization corpus of the seed corpus;
and the RPA system displays the generalization linguistic data of the seed linguistic data.
In a fourth aspect, an embodiment of the present application provides a corpus generalization device combining an RPA and an AI, applied to a first electronic device, including:
the device comprises a first receiving module, a second receiving module and a sending module, wherein the first receiving module is used for receiving a first request, and the first request comprises seed corpora;
the first processing module is used for processing NLP based on natural language and generalizing the seed corpus according to a preset generalization mode to obtain at least one candidate corpus of the seed corpus;
a first determining module, configured to identify a similarity between the at least one candidate corpus and the seed corpus, and determine a candidate corpus of which the similarity is greater than a preset threshold as a generalization corpus of the seed corpus;
and the first output module is used for outputting the generalization linguistic data of the seed linguistic data.
In a fifth aspect, an embodiment of the present application provides a corpus generalization device combining an RPA and an AI, applied to a second electronic device, including:
the second receiving module is used for receiving the seed linguistic data input by the user;
a first sending module, configured to send a first request including the seed corpus to a first electronic device, where the first request is used to instruct the first electronic device to process an NLP based on a natural language, generalize the seed corpus according to a preset generalization mode, obtain at least one candidate corpus of the seed corpus, identify a similarity between the at least one candidate corpus and the seed corpus, and use the candidate corpus of which the similarity is greater than a preset threshold as a generalized corpus of the seed corpus;
and the first display module is used for receiving and displaying the generalization linguistic data of the seed linguistic data sent by the first electronic equipment.
In a sixth aspect, an embodiment of the present application provides a corpus generalization device combining an RPA and an AI, applied to a third electronic device, including:
the third receiving module is used for receiving the seed linguistic data input by the user;
the second processing module is used for generalizing the seed corpus according to a preset generalization mode to obtain at least one candidate corpus of the seed corpus;
a second determining module, configured to identify a similarity between the at least one candidate corpus and the seed corpus, and determine a candidate corpus of which the similarity is greater than a preset threshold as a generalization corpus of the seed corpus;
and the second display module is used for displaying the generalization linguistic data of the seed linguistic data.
In a seventh aspect, an embodiment of the present application provides a first electronic device, including: at least one processor and memory;
the memory stores computer-executable instructions;
the at least one processor executes computer-executable instructions stored by the memory to cause the at least one processor to perform the corpus generalization method according to the first aspect and various possible embodiments of the first aspect.
In an eighth aspect, an embodiment of the present application provides a second electronic device, including: at least one processor and memory;
the memory stores computer-executable instructions;
the at least one processor executes the computer-executable instructions stored in the memory, so that the at least one processor performs the corpus generalization method according to the second aspect and various possible embodiments of the second aspect.
In a ninth aspect, an embodiment of the present application provides a third electronic device, including: at least one processor and memory;
the memory stores computer-executable instructions;
the at least one processor executes the computer-executable instructions stored in the memory, so that the at least one processor performs the corpus generalization method according to the third aspect and various possible embodiments of the third aspect.
In a tenth aspect, an embodiment of the present application provides a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and when a processor executes the computer-executable instructions, the corpus generalization method according to the first aspect and various possible implementation manners of the first aspect is implemented.
In an eleventh aspect, embodiments of the present application provide a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and when a processor executes the computer-executable instructions, the corpus generalization method according to the second aspect and various possible implementation manners of the second aspect is implemented.
In a twelfth aspect, an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores computer-executable instructions, and when a processor executes the computer-executable instructions, the corpus generalization method according to the third aspect and various possible embodiments of the third aspect is implemented.
According to the corpus generalization method and device, electronic equipment and storage medium combining RPA and AI, an RPA system receives a first request, wherein the first request includes a seed corpus, generalizes the seed corpus according to a preset generalization mode to obtain at least one candidate corpus of the seed corpus, identifies similarity between each candidate corpus and the seed corpus, determines the candidate corpus with the similarity larger than a preset threshold as the generalized corpus of the seed corpus, and outputs the generalized corpus of the seed corpus. According to the method, the seed corpus can be automatically generalized through a preset generalization mode, and the generalized candidate corpus is screened according to a preset threshold value, so that the generalized corpus of the seed corpus is screened out, and the corpus generalization efficiency is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.
Fig. 1 is a schematic view of a scenario of a corpus generalization method combining RPA and AI according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a scenario of a corpus generalization method combining RPA and AI according to yet another embodiment of the present application;
FIG. 3 is a schematic flow chart illustrating a corpus generalization method in combination with RPA and AI according to an embodiment of the present application;
FIG. 4 is a schematic flow chart illustrating a corpus generalization method in combination with RPA and AI according to yet another embodiment of the present application;
FIG. 5 is a schematic flow chart illustrating a corpus generalization method in combination with RPA and AI according to another embodiment of the present application;
FIG. 6 is a schematic diagram of a default generalization mode selection interface provided in an embodiment of the present application;
FIG. 7 is a schematic diagram of a configuration interface of a web crawling method according to an embodiment of the present application;
FIG. 8 is a schematic diagram of a display interface of the generalized corpora according to an embodiment of the present application;
FIG. 9 is a schematic diagram of a display interface of a generalized corpus according to another embodiment of the present application;
fig. 10 is a signaling interaction diagram of a generalized linguistic approach combining RPA and AI according to an embodiment of the present application;
FIG. 11 is a flowchart illustrating a corpus generalization method in conjunction with RPA and AI according to yet another embodiment of the present application;
fig. 12 is a schematic structural diagram of a corpus generalization device combining RPA and AI according to an embodiment of the present application;
FIG. 13 is a schematic structural diagram of a corpus generalization device combining RPA and AI according to yet another embodiment of the present application;
FIG. 14 is a schematic structural diagram of a corpus generalization device combining RPA and AI according to another embodiment of the present application;
fig. 15 is a schematic hardware structure diagram of a first electronic device according to an embodiment of the present application;
fig. 16 is a schematic hardware structure diagram of a second electronic device according to yet another embodiment of the present application;
fig. 17 is a schematic hardware structure diagram of a third electronic device according to another embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Fig. 1 is a schematic view of a scenario of a corpus generalization method combining RPA and AI according to an embodiment of the present application. A first electronic device 11 and a second electronic device 12 may be included in the scenario. The first electronic device 11 may include, but is not limited to, a server, a computer device, and the like. The second electronic device 12 may include, but is not limited to, a mobile phone, a desktop computer, a vehicle-mounted terminal, or a tablet computer. The first electronic device 11 may provide a background computing or application service support for the second electronic device 12 in the network, for example, the first electronic device 11 may support a corpus generalization platform for corpus generalization, and the corpus generalization platform may be a Robot Process Automation (RPA) system. The second electronic device 12 may access an interface of the corpus generalization platform through an application program, a plug-in a social application program, a website login, and the like, so as to access the corpus generalization platform. The user may access the corpus generalization platform for corpus generalization through operation of the second electronic device 12.
For example, the user may log in the corpus generalization platform on the second electronic device 12 through a web page, input the seed corpus to be generalized in the corpus generalization platform, and trigger the generalization instruction. After receiving the generalization instruction triggered by the user, the second electronic device 12 sends a first request to the first electronic device 11. After the first electronic device 11 generalizes the seed corpus, the generalized corpus of the seed corpus is returned to the second electronic device 12 through the corpus generalization platform. The second electronic device 12 may output the generalized corpus in a display, download, or the like according to the instruction of the user.
Fig. 2 is a schematic view of a scenario of a corpus generalization method combining RPA and AI according to another embodiment of the present application. A third electronic device 13 may be included in the scenario. The third electronic device 13 may include, but is not limited to, a mobile phone, a desktop computer, a vehicle-mounted terminal, or a tablet computer, a robot, and the like. The third electronic device 13 does not need the support of other background devices, and can implement corpus generalization by itself.
For example, the third electronic device may run an application program for implementing the corpus generalization, and the application program may implement the corpus generalization without interacting with devices such as a background server. The user can run the application program on the third electronic device 13, input the seed corpus to be generalized in the interface of the application program, and trigger the generalization instruction. After receiving the generalization instruction triggered by the user, the third electronic device 13 generalizes the seed corpus, and then outputs the generalized corpus of the seed corpus to the user in the manners of displaying, downloading, and the like. The application program for implementing corpus generalization may be an RPA system.
It should be noted that the method provided in the embodiment of the present application is not limited to the application scenarios shown in fig. 1 and fig. 2, and may also be used in other possible application scenarios, which is not limited.
Fig. 3 is a schematic flow chart of a corpus generalization method combining RPA and AI according to an embodiment of the present application. The main execution body of the method is the first electronic device in fig. 1, the first electronic device includes an RPA system, as shown in fig. 3, the method includes:
s301, the RPA system receives a first request, wherein the first request comprises a seed corpus.
In this embodiment, the seed corpus is a corpus to be generalized. For example, the seed corpus may be "dietary contraindication for the pregnancy preparation period," and the RPA system in the first electronic device generalizes the seed corpus after receiving the first request.
Optionally, the first request sent by the second electronic device may be received.
In this embodiment, the second electronic device may be the second electronic device in fig. 1. When the user needs to generalize the corpus, the seed corpus can be input into the second electronic device, and then the second electronic device can send a first request to the first electronic device to request the first electronic device to generalize the seed corpus.
And S302, the RPA system carries out generalization on the seed corpus based on natural language processing NLP according to a preset generalization mode to obtain at least one candidate corpus of the seed corpus.
In this embodiment, the RPA system may generalize the seed corpus based on Natural Language Processing (NLP) by using a preset generalization mode to obtain the candidate corpus of the seed corpus. Subsequently, the candidate corpus can be further screened to obtain the generalization corpus of the seed corpus.
Optionally, the preset generalization manner includes at least one of the following:
network crawling mode, synonym replacing mode, knowledge base searching mode and sentence pattern extracting mode.
In this embodiment, the RPA system may generalize the seed corpus in one or more predetermined generalization manners. The specific preset generalization mode can be default or specified by the user.
Alternatively, each predefined generalization may correspond to an identifier, and the RPA system may receive the specified identifier.
Further, the RPA system generalizes the seed corpus according to a preset generalization mode, which may include generalizing the seed corpus according to the preset generalization mode corresponding to the designated identifier.
In this embodiment, the identifier may be a name, a code, and the like of a preset generalization mode, which is not limited herein. The specified identity is an identity specified by the user. When a plurality of preset generalization modes are preset in the RPA system, if the appointed identification is received, the preset generalization mode corresponding to the appointed identification is adopted to generalize the seed corpus. Wherein, the designated identification can be one or more. The specified identity may be transmitted by the second electronic device. For example, the user selects an identifier of a desired generalization mode from all preset generalization modes, inputs the identifier to the second electronic device, and the second electronic device sends the specified identifier to the first electronic device.
S303, the RPA system identifies the similarity between at least one candidate corpus and the seed corpus, and determines the candidate corpus with the similarity larger than a preset threshold value as the generalization corpus of the seed corpus.
In this embodiment, the RPA system may screen out, according to the similarity, a corpus similar to the seed corpus from the candidate corpus as a generalization corpus of the seed corpus. The generalization linguistic data is the generalization result of the seed linguistic data. The preset threshold may be set according to actual requirements, and is not limited herein. For example, the preset threshold may be set to 0.9, 0.8, or the like.
Optionally, the RPA system may first identify the similarity between each candidate corpus and the seed corpus by using a deep learning model, then compare a preset threshold with the similarity corresponding to each candidate corpus, and determine the candidate corpus with the similarity greater than the preset threshold as the generalization corpus of the seed corpus. Therefore, the RPA system can screen the candidate linguistic data according to the similarity, and the accuracy of the generalized linguistic data can be guaranteed.
Optionally, the obtaining of the deep learning model may include training an optimal model by using an XGB algorithm/logistic regression model based on corpus samples in the knowledge base and using various distance features such as Jaccrad, coverage, w2v (word vectors), WMD (word shift distance), and the like, and the trained optimal model is used for calculating similarity between sentences.
As another possible implementation, the RPA system may further identify similarity between each candidate corpus and the seed corpus through a logistic regression model, where the logistic regression model is trained in advance through a training set composed of a plurality of corpora in the knowledge base. In this embodiment, the knowledge base may store the generalized seed corpus and the corresponding generalized corpus. The corpora can be selected from the knowledge base to form a training set, the created logistic regression model is trained through the training set, and the trained logistic regression model is adopted to identify the similarity between each candidate corpus and the seed corpus.
Optionally, in order to ensure the accuracy of the generalized corpus, the RPA system may further calculate the similarity between the generalized candidate corpus and the seed corpus by using a ranking algorithm, rank all the candidate corpora according to the similarity, delete the candidate corpus with the similarity lower than a preset threshold, and further obtain the generalized corpus of the seed corpus.
And S304, the RPA system outputs the generalization linguistic data of the seed linguistic data.
In this embodiment, the first electronic device may output the generalized corpora of the seed corpora to the user through the RPA system, so that the user may view or download the generalized corpora of the seed corpora, and then perform model training according to the generalized corpora. For example, the seed corpus is "dietary contraindication for pregnancy preparation", and the generalization corpus of the seed corpus may be "dietary attention for pregnancy preparation", "food attention for pregnancy preparation", and the like.
To sum up, in the corpus generalization method combining the RPA and the AI according to the embodiment of the present application, the RPA system receives a first request, where the first request includes a seed corpus, then generalizes the seed corpus based on the natural language processing NLP according to a preset generalization method to obtain at least one candidate corpus of the seed corpus, then identifies a similarity between the at least one candidate corpus and the seed corpus, determines the candidate corpus with the similarity greater than a preset threshold as the generalized corpus of the seed corpus, and finally outputs the generalized corpus of the seed corpus. According to the method, the RPA system can automatically generalize the seed corpus in a preset generalization mode, and can screen the generalized candidate corpus according to a preset threshold value, so that the generalized corpus of the seed corpus is screened out, and the corpus generalization efficiency is improved under the condition that the generalization effect is ensured.
Optionally, the RPA system in the first electronic device may send the generalized corpus of the seed corpus to the second electronic device.
Therefore, the first electronic device can send the generalization linguistic data of the seed linguistic data to the second electronic device through the RPA system, so that the second electronic device can display the generalization linguistic data of the seed linguistic data, and a user can conveniently perform subsequent checking, selecting, downloading and other operations.
Optionally, the RPA system sends the generalized corpora of the seed corpus to the second electronic device, which may include sending the generalized corpora of the seed corpus and the similarity and/or the generalization manner corresponding to each of the generalized corpora to the second electronic device.
The similarity corresponding to the generalization corpus refers to the similarity between the generalization corpus and the seed corpus. The generalization mode corresponding to the generalization corpus refers to a preset generalization mode adopted by the first electronic device in determining the generalization corpus. For example, the generalization language "dietary notes for pregnancy" is obtained by "synonym substitution method", and the generalization language "dietary notes for pregnancy" is obtained by "web crawling method".
Therefore, when the RPA system in the first electronic device sends the generalization corpus of the seed corpus to the second electronic device, the similarity and/or the generalization mode corresponding to each generalization corpus is sent to the second electronic device at the same time, so that the second electronic device displays the similarity and/or the generalization mode corresponding to each generalization corpus to the user.
In one embodiment, after S304, the method further includes: and the RPA system updates the seed linguistic data and the generalization linguistic data of the seed linguistic data into a knowledge base.
Therefore, when the RPA system carries out corpus generalization by adopting a knowledge base retrieval mode, the candidate corpus of the seed corpus is retrieved from the corpus stored in the knowledge base. After the generalization corpus of the seed corpus is obtained, the seed corpus and the corresponding generalization corpus can be updated to the knowledge base, so that the corpus data in the knowledge base is enriched.
In one embodiment, after S304, the method further includes: the RPA system receives a second request, wherein the second request is used for indicating at least one generalization corpus of the seed corpus, then generalizes the generalization corpus indicated by the second request as a new seed corpus to obtain a candidate corpus of the new seed corpus, then adds the candidate corpus of the new seed corpus into the candidate corpus of the seed corpus, re-identifies the similarity between the at least one candidate corpus of the seed corpus and the seed corpus, determines the candidate corpus with the similarity larger than the preset threshold as the generalization corpus of the seed corpus, and then re-outputs the generalization corpus of the seed corpus.
In this embodiment, the second request may be sent by the second electronic device. After the generalization corpuses of the seed corpuses are obtained, the user can indicate one or more generalization corpuses as new seed corpuses to generalize among all the generalization corpuses of the seed corpuses, and the generalization result is updated to the generalization corpuses of the original seed corpuses. For example, the seed corpus is "dietary contraindication for pregnancy preparation", the generalization corpus of the seed corpus may be "dietary attention for pregnancy preparation", "food attention for pregnancy preparation", etc., the user may designate "dietary attention for pregnancy preparation" as a new seed corpus to generalize, add the candidate corpus obtained by generalizing "dietary attention for pregnancy preparation" to the candidate corpus of the original seed corpus "dietary contraindication for pregnancy preparation", screen out the generalization corpus of the "dietary contraindication for pregnancy preparation" of the seed corpus again therefrom, and output the generalization corpus of the "dietary contraindication for pregnancy preparation" again for updating.
In the embodiment of the present disclosure, after a certain sub-corpus is generalized, if the user finds that the number of the generalized corpus is small, the generalized corpus with a relatively accurate generalized effect may be selected as a new seed corpus, the new seed corpus is generalized, the generalized result is embedded into the candidate corpus of the original seed corpus, and the similarity between the candidate corpus and the original seed corpus is recalculated. Therefore, the user can directly generalize the generalization linguistic data of the seed linguistic data, the user operation is reduced, the user experience is improved, and the generalization efficiency and accuracy are improved.
In an embodiment, when the preset generalization mode includes the network crawling mode, the RPA system generalizes the seed corpus according to the network crawling mode to obtain at least one candidate corpus of the seed corpus, including: and the RPA system searches the seed corpus in a webpage searching website to obtain a webpage list for displaying a searching result, wherein the webpage list comprises a plurality of webpage items, each webpage item has a title sentence, then a title sentence of each webpage item is crawled, and then the title sentence which meets the matching condition in the title sentences of each webpage item is used as the candidate corpus of the seed corpus.
In this embodiment, a specific implementation process of obtaining the candidate corpus of the seed corpus by the RPA system in a network crawling manner is described. The web search website refers to a website for retrieving a corresponding web page according to a keyword. The web page search website may be set by default or may be designated by the user, and is not limited herein.
The RPA system may search the seed corpus in the web page search website to obtain a plurality of web page items related to the seed corpus and a title sentence of each web page item. For example, the seed corpus is "food contraindication for pregnancy period", and the web page items and title sentences searched in the web page search website may include: a first web page www.aaa.com.cn titled "how good to eat before pregnancy" how to prepare for pregnancy "and how to not eat" AAA web "; a second webpage item www.bbb.com.cn with a title sentence of "[ Progestion preparation diet Yi-Bao ] Progestion preparation diet cautionary item-BBB net"; and a third web page item www.ccc.com.cn, wherein the title sentence is 'pay attention to 5 points for pregnancy, good habit is given to you more lucky-CCC website', and the like.
Further, the RPA system may crawl the title sentences of each web page item, then identify whether the title sentences meet the matching conditions, and use the title sentences meeting the matching conditions as the candidate corpora of the seed corpora. Wherein the matching condition is used for excluding the title sentences of the similar sentences not containing the seed corpus.
Further, the RPA system may delete the vocabulary irrelevant to the seed corpus in the header sentence meeting the matching condition, to obtain the candidate corpus of the seed corpus. For example, the matching condition includes each vocabulary in the seed corpus or synonyms thereof.
In the above example, the title sentences of the first and second web page items meet the matching condition, and the title sentences of the third web page item do not meet the matching condition, so that the "no-food for pregnancy" in the title sentences of the first web page item and the "food attention for food for pregnancy" in the title sentences of the second web page item can be used as the candidate corpus of the seed corpus "food contraindication for food for pregnancy".
Therefore, the RPA system can acquire the candidate corpus of the seed corpus from a generalization library consisting of a plurality of websites according to a network crawling mode, the styles are more, the query is closer to the real situation of a user, and the generalization effect is more in line with the user requirements.
Optionally, when the preset generalization manner includes the network crawling manner, the method may further include: and the RPA system receives the filter words, and takes the header sentences meeting the matching conditions in the header sentences of the webpage items as the candidate linguistic data of the seed linguistic data according to the filter words.
In this embodiment, the RPA system may receive the filter word sent by the second electronic device, determine a corresponding matching condition according to the filter word, screen the title sentence of each web page item, and use the title sentence, which meets the matching condition, in the title sentence of each web page item as the candidate corpus of the seed corpus, so that network crawling may be performed according to the filter word set by the user.
Therefore, by setting the filter words, the user can conveniently adjust the matching conditions of the network crawling mode according to the requirements, so that candidate sentences meeting the requirements are screened out, and the individuation of the network crawling mode is improved.
As another possible implementation, the RPA system may further receive a filter word and a matching pattern, and use a header sentence meeting the matching condition in the header sentences of each web page item as a candidate corpus of the seed corpus according to the filter word and the matching pattern.
The matching mode is precise matching or fuzzy matching, and when the matching mode is precise matching, the matching condition is that the filter words are contained in the title sentences; and when the matching mode is fuzzy matching, the matching condition is that the title sentence contains the filter word or the synonym of the filter word.
In this embodiment, the RPA system may perform network crawling according to the filtering words and the matching mode set by the user. The RPA system may receive the filter word and the matching pattern sent by the second electronic device, determine a corresponding matching condition according to the filter word and the matching pattern, and filter the title sentence of each web page item.
Wherein, the matching mode comprises two optional modes: exact matching or fuzzy matching.
The matching condition corresponding to the precise matching is that the title sentence must contain a filter word, for example, in the above example, it is assumed that the filter word set by the user is "diet", and since the title sentence of the web page item two contains "diet", the matching condition is met; the title sentences of the first web page item and the third web page item do not contain the 'diet', so that the matching conditions are not met.
For example, in the above example, it is assumed that the filter word set by the user is "diet", and since the title sentence of the first web page item includes "eat" (synonym of diet) and the title sentence of the second web page item includes "diet", the matching condition is met; the title sentence of the third web page item does not contain the "diet", so the matching condition is not met.
Therefore, by setting the filter words and the matching mode, the user can conveniently adjust the matching conditions of the network crawling mode according to the requirement, so that candidate sentences meeting the requirement are screened out, and the individuation of the network crawling mode is improved.
Optionally, when the preset generalization manner includes the network crawling manner, the method may further include: the webpage list comprises at least one display page, and each display page comprises at least one webpage item.
The method further comprises the following steps: the RPA system receives a specified website address and/or a specified number of pages to crawl.
The searching the seed corpus in the web page searching website includes: and the RPA system searches the seed corpus in the webpage searching website indicated by the specified website address.
The title sentence of each webpage item crawled comprises the following steps: and the RPA system crawls the title sentences of all the webpage items in the display pages corresponding to the specified crawling page number.
In this embodiment, the RPA system may search the seed corpus in the web search website indicated by the designated network address according to the received designated network address. For example, a user may input an address of a web site searched for by the user as a specified network address to the second electronic device, the second electronic device transmits the specified network address to the first electronic device, and when the RPA system in the first electronic device is generalized by using a network crawling method, the RPA system searches for a web site corresponding to the specified network address.
Optionally, the web page list includes at least one presentation page, and each presentation page includes at least one web page item therein. For example, when the RPA system generalizes in a web crawling manner, 100 web page items related to the seed corpus are searched by a web page search website and are displayed in 10 display pages, and 10 web page items are displayed on each display page. The RPA system may crawl the title statements of individual web page items within a presentation page starting from a starting page specifying the number of pages crawled. For example, if the number of designated crawled pages can be 5, the RPA system crawls the title sentences of each web page item in the presentation pages from page 1 to page 5. The user can input the specified crawled page number into the second electronic device, and the second electronic device sends the specified crawled page number to the RPA system in the first electronic device.
Therefore, by crawling according to the specified website address, the website can be searched on the webpage specified by the user, and the user experience is improved. By only crawling in the display page corresponding to the specified crawling page number, only the webpage items with high relevance to the seed linguistic data can be crawled, the crawling of irrelevant webpage items is avoided, and the processing efficiency of the seed linguistic data is improved.
In an embodiment, when the preset generalization manner includes the synonym replacement manner, generalizing the seed corpus according to the synonym replacement manner to obtain at least one candidate corpus of the seed corpus, including: the RPA system acquires the affiliated field of the seed corpus, selects a synonym table corresponding to the affiliated field from a plurality of preset synonym tables, wherein each synonym table corresponds to one field, then searches key words in the seed corpus, and then performs synonym replacement on the key words in the seed corpus according to the synonym table corresponding to the affiliated field to obtain at least one candidate corpus of the seed corpus.
In this embodiment, a specific implementation process of obtaining the candidate corpus of the seed corpus by the first electronic device in the synonym replacement manner is described. The field of the seed corpus can be automatically identified or can be specified by a user. The fields may include a network field, a news field, a medical field, a travel field, a life field, etc., and are not limited thereto. For example, the user may determine the domain of the seed corpus, input the domain into the second electronic device, and the second electronic device sends the domain of the seed corpus to the first electronic device. The RPA system in the first electronic device can identify the keywords in the seed corpus, and perform synonym replacement on the keywords in the seed corpus according to the synonym table corresponding to the field to which the keywords belong, so as to obtain the candidate corpus of the seed corpus.
For example, if the seed corpus is "dietary contraindication for pregnancy preparation", the field is "life field", and the keyword "contraindication" in the seed corpus is present in the synonym table corresponding to the field, the candidate corpus obtained through synonym replacement includes "dietary attention for pregnancy preparation" and "dietary attention for pregnancy preparation".
Optionally, the RPA system may perform word segmentation on the seed corpus, find a keyword with a higher TF-IDF (term frequency-inverse document frequency), and perform keyword replacement based on the synonym table, thereby generating a candidate corpus. The synonym lists in the fields of network, medical treatment, tourism, news, life and the like are divided by considering that the synonyms in different fields are different, and the synonym replacement can be accurately carried out by selecting the fields. Wherein, the seed corpus can be participled by adopting a Pkuseg word segmentation tool.
Therefore, synonym replacement is carried out on the keywords in the seed corpus through the synonym table corresponding to the field of the seed corpus, the accuracy of the candidate corpus obtained by synonym replacement can be improved, and the accuracy of corpus generalization is further improved.
In an embodiment, when the preset generalization mode includes the knowledge base retrieval mode, the RPA system generalizes the seed corpus according to the knowledge base retrieval mode to obtain at least one candidate corpus of the seed corpus, including: the RPA system searches a knowledge base for the generalization linguistic data corresponding to the seed linguistic data, wherein the knowledge base comprises a plurality of seed linguistic data and the corresponding generalization linguistic data, and then the generalization linguistic data corresponding to the seed linguistic data in the knowledge base is used as candidate linguistic data of the seed linguistic data;
in this embodiment, a specific implementation process of obtaining a candidate corpus of a seed corpus by using a knowledge base retrieval method is described. It is understood that the generalized corpora after each generalization of the seed corpora may be added to the knowledge base. Optionally, the generalized linguistic data with higher accuracy in the generalized linguistic data can be selected and stored in the knowledge base after being reviewed by a trainer. Thus, when the seed corpus needs to be generalized, whether the generalized corpus corresponding to the seed corpus exists or not can be searched in the knowledge base, and if the generalized corpus corresponding to the seed corpus exists, the generalized corpus corresponding to the seed corpus in the knowledge base is used as the candidate corpus of the seed corpus. Therefore, the generalized corpora corresponding to the seed corpora each time are added into the knowledge base, so that the accuracy of the corpora in the knowledge base is guaranteed, and the accuracy of the subsequent corpus generalization is improved.
In an embodiment, when the preset generalization manner includes the sentence extraction manner, the RPA system generalizes the seed corpus according to the sentence extraction manner to obtain at least one candidate corpus of the seed corpus, including: the RPA system identifies and extracts key words in the seed corpus through a Dependency syntactic analysis (DP) algorithm, and then combines the key words of the seed corpus to generate a candidate corpus of the seed corpus.
In this embodiment, a specific implementation process of obtaining the candidate corpus of the seed corpus by adopting the sentence pattern extraction method is described. Wherein, the key vocabulary may include, but not limited to, at least one of subject, predicate, object in the seed corpus. The RPA system can identify and extract key words in the seed corpus through a DP algorithm, and then combine the key words to generate a candidate corpus of the seed corpus. For example, the seed corpus is "how to know that the user is pregnant", the keyword summary may be "how", "know", "self", "pregnant", and the like, and the combined candidate corpus may include "know that the user is pregnant", "how to know pregnant", and the like.
Therefore, according to the embodiment, through dependency syntactic analysis, the gravity words such as the principal and subordinate guests in the seed corpus are extracted to form a complete sentence, or the limited postphrase is deleted, but the meaning of the original sentence can still be kept, so that the accuracy of the generated candidate corpus is ensured.
Fig. 4 is a schematic flow chart of a corpus generalization method combining RPA and AI according to yet another embodiment of the present application. The embodiment describes a specific implementation process of detecting the device status in detail on the basis of the embodiment of fig. 4. As shown in fig. 4, the method includes:
s401, the RPA system receives a first request, wherein the first request comprises a seed corpus.
In this embodiment, S401 is similar to S301 in the embodiment of fig. 3, and is not described here again.
S402, the RPA system searches the seed linguistic data in a historical record, wherein the historical record comprises historical generalized seed linguistic data and corresponding generalized linguistic data.
And S403, if the seed corpus is found in the history record by the RPA system, acquiring the generalization corpus of the seed corpus from the history record.
In this embodiment, the history record may store the seed corpus and the corresponding generalization corpus that are previously input by the user. After receiving the first request, the RPA system first searches whether the seed corpus exists in the history record, if so, directly obtains the generalized corpus of the seed corpus from the history record, and if not, obtains the generalized corpus of the seed corpus by generalization according to the manners of S404 and S405.
S404, if the seed corpus is not found in the history record by the RPA system, generalizing the seed corpus according to a preset generalization mode to obtain at least one candidate corpus of the seed corpus.
In this embodiment, S404 is similar to S302 in the embodiment of fig. 3, and is not described here again.
S405, the RPA system identifies the similarity between at least one candidate corpus and the seed corpus, and determines the candidate corpus with the similarity larger than a preset threshold value as the generalization corpus of the seed corpus.
In this embodiment, S405 is similar to S303 in the embodiment of fig. 3, and is not described herein again.
And S406, the RPA system outputs the generalization linguistic data of the seed linguistic data.
In this embodiment, S406 is similar to S304 in the embodiment of fig. 3, and is not described herein again.
In this embodiment, whether the seed corpus exists in the history record is firstly queried, and for the historical generalized seed corpus, the generalized corpus is directly obtained, so that the generalization efficiency can be improved. For example, if a user needs to generalize 200 seed corpuses in batch, and determines that 50 seed corpuses are generalized by querying the history, the generalized corpuses of the 50 seed corpuses are directly obtained from the history, and the rest 150 seed corpuses only need to be generalized by a preset generalization mode, so that the data volume required to be generalized is reduced, and the generalization efficiency is improved.
Fig. 5 is a schematic flow chart of a corpus generalization method combining RPA and AI according to another embodiment of the present application. The execution subject of the method may be the second electronic device in fig. 1, where the second electronic device includes an RPA system, as shown in fig. 5, and the method includes:
s501, the RPA system receives the seed corpus input by the user.
In this embodiment, the RPA system in the second electronic device may receive the seed corpus input by the user. The user may input a single seed corpus or may input a plurality of seed corpora in batch, which is not limited herein. The seed corpus may be input into the input box by a user, or a file containing the seed corpus is uploaded by the user, and the RPA system extracts the seed corpus from the file.
S502, the PPA system sends a first request containing the seed corpus to first electronic equipment, wherein the first request is used for indicating the first electronic equipment to process NLP based on natural language, generalize the seed corpus according to a preset generalization mode to obtain at least one candidate corpus of the seed corpus, identify the similarity between the at least one candidate corpus and the seed corpus, and take the candidate corpus with the similarity larger than a preset threshold value as the generalized corpus of the seed corpus.
In this embodiment, the first electronic device may be the first electronic device in fig. 1. The RPA system in the second electronic device may send the first request to the first electronic device. The first electronic device may perform generalization processing on the seed corpus according to the first request to obtain a generalization expectation thereof, and a specific generalization processing procedure is similar to the implementation of the corpus generalization method using the first electronic device as an execution main body, and is not described herein again.
Optionally, the preset generalization manner includes at least one of the following: network crawling mode, synonym replacing mode, knowledge base searching mode and sentence pattern extracting mode.
Optionally, the method further includes: the RPA system displays the identification of each preset generalization mode, then receives a selection instruction input by a user, wherein the selection instruction is used for indicating a specified identification in the identification of the preset generalization mode, and then sends the specified identification to the first electronic equipment.
In this embodiment, the preset generalization manner may include one or more of the above four manners. The implementation process of each predetermined generalization manner is similar to the above embodiment of the corpus generalization method using the first electronic device as the execution main body, and is not described herein again.
When the preset generalization mode includes multiple types, the second electronic device may display the identifier of each preset generalization mode, so that the user may select the preset generalization mode to be adopted.
Optionally, after receiving a selection instruction input by the user, the RPA system sends the specified identifier selected by the user to the first electronic device, so that the first electronic device generalizes the seed corpus in a preset generalization mode corresponding to the specified identifier. The specific identifier selected by the user may be one or more, and is not limited herein.
Fig. 6 is a schematic diagram of a preset generalization mode selection interface provided in an embodiment of the present application. In fig. 6, the user may upload a file including the seed corpus by inputting the seed corpus in the seed corpus input box or by clicking the upload file control. The user can check the preset generalization mode to be used, and after the generalization control is clicked, the RPA system sends a first request to the first electronic device to request the first electronic device to generalize the seed corpus input by the user in the preset generalization mode checked by the user.
Therefore, by displaying the identification of each preset generalization mode and receiving the selection instruction input by the user, the user can conveniently select the preset generalization mode for use, the user operation is facilitated, and the user experience is improved.
Optionally, when the preset generalization manner includes the network crawling manner, the method further includes: the RPA system receives a filtering word and a matching mode input by a user, and sends the filtering word and the matching mode to the first electronic equipment.
In this embodiment, the user may configure the network crawling manner. The RPA system can receive the filter words and the matching modes input by the user, and send the filter words and the matching modes to the first electronic equipment, so that the first electronic equipment can perform network crawling according to the filter words and the matching modes. The implementation process of performing network crawling according to the filter word and the matching pattern is similar to the embodiment of the corpus generalization method using the first electronic device as the execution main body, and is not described herein again.
Optionally, when the preset generalization manner includes the network crawling manner, the method further includes: the RPA system receives a specified website address and/or specified number of pages to be crawled input by a user, and sends the specified website address and/or the specified number of pages to the first electronic equipment.
In this embodiment, the user may configure the network crawling manner. The RPA system can receive a specified website address and/or a specified number of crawled pages input by a user, and send the specified website address and/or the specified number of crawled pages to the first electronic device, so that the first electronic device can perform network crawling according to the specified website address and/or the specified number of crawled pages. The implementation process of performing the web crawling according to the designated website address and/or the designated crawling page number is similar to the embodiment of the corpus generalization method using the first electronic device as the execution subject, and is not described herein again.
Fig. 7 is a schematic diagram of a configuration interface of a network crawling manner provided in an embodiment of the present application. In fig. 7, the user can configure a designated website address to be used in a designated network address input box, configure a designated number of crawl pages in a designated crawl page number input box, configure filter words in a filter word input box, and select a matching mode.
S503, the RPA system receives and displays the generalization linguistic data of the seed linguistic data sent by the first electronic device.
In this embodiment, the RPA system may receive and display the generalized corpus of the seed corpus sent by the first electronic device.
To sum up, the corpus generalization method combining the RPA and the AI according to the embodiment of the present application sends a first request including a seed corpus to a first electronic device by receiving the seed corpus input by a user, where the first request is used to instruct the first electronic device to generalize the seed corpus according to a preset generalization mode, to obtain at least one candidate corpus of the seed corpus, identify a similarity between the at least one candidate corpus and the seed corpus, use the candidate corpus with the similarity greater than a preset threshold as a generalization corpus of the seed corpus, and receive and display the generalization corpus of the seed corpus sent by the first electronic device. According to the method, the seed corpus can be automatically generalized through a preset generalization mode, and the generalized candidate corpus is screened according to a preset threshold value, so that the generalized corpus of the seed corpus is screened out, and the corpus generalization efficiency is improved under the condition that the generalization effect is ensured.
Optionally, in order to facilitate the user to edit the generalized corpus more conveniently, operations such as adding, deleting, modifying and the like are provided on the page displayed by the RPA system, and a batch operation control is supported, so as to perform corresponding processing according to the operation control triggered by the user.
In one embodiment, the receiving and displaying, by the RPA system, the generalized corpus of the seed corpus sent by the first electronic device includes: and the RPA system receives at least one generalization corpus of the seed corpus sent by the first electronic equipment and the corresponding similarity and/or generalization mode thereof, and performs associated display on the at least one generalization corpus of the seed corpus and the corresponding similarity and/or generalization mode thereof.
In this embodiment, the RPA system may receive at least one generalization corpus of the seed corpus sent by the first electronic device and the corresponding similarity and/or generalization manner thereof, and perform association display, so as to facilitate a user to view the similarity between each generalization corpus and the seed corpus and obtain the generalization manner of the generalization corpus.
After the RPA system receives and displays the generalized corpus of the seed corpus sent by the first electronic device, the method further includes: and the RPA system receives a screening instruction input by a user, wherein the screening instruction is used for indicating at least one of the preset generalization modes and displaying the generalization linguistic data obtained in the generalization mode indicated by the screening instruction.
In this embodiment, the user may also screen the generalized corpora displayed by the RPA system according to a preset generalization mode, and the RPA system only displays the generalized corpora obtained in the generalization mode indicated by the screening instruction according to the screening instruction, so that the user can conveniently check the generalized corpora obtained in different generalization modes.
For example, when a batch of seed corpora are generalized, a user usually selects all preset generalization manners to obtain as many generalized corpora as possible, but tracing back to each seed corpus will find that the effective generalization manners are different for a certain seed corpus, and the generalized corpora in a certain generalization manner already satisfy the user's requirements. The embodiment designs the screening button in a generalization mode through the setting, and further meets the personalized processing of the generalization linguistic data.
Fig. 8 is a schematic diagram of a display interface of the generalized corpora according to the embodiment of the present application. In FIG. 8, the corresponding similarity and generalization approaches are shown behind each generalized corpus. And a screening control of a generalization mode is set in the display interface, a user can click the screening control to screen the generalization mode, the RPA system pops up a generalization mode screening popup window after the user clicks the screening control, the user can select a required generalization mode in the generalization mode screening popup window, and then the RPA system only displays the generalization linguistic data obtained in the generalization mode selected by the user in the display interface.
In one embodiment, the seed corpus is at least one.
The method further comprises the following steps: the RPA system displays the number of each sub-corpus and the generalization corpus;
the RPA system displays the generalization linguistic data of the seed linguistic data sent by the first electronic device, and the display linguistic data comprises the following steps: receiving a trigger instruction of a user for the number of the generalization linguistic data of the specified seed linguistic data, wherein the specified seed linguistic data is one of all the seed linguistic data, and displaying the generalization linguistic data of the specified seed linguistic data.
In this embodiment, when there are a plurality of seed corpora, the RPA system may display the number of the generalized corpora of each seed corpus on the interface, and after receiving a trigger instruction of a user for the number of the generalized corpora of a certain seed corpus, display each of the generalized corpora of the seed corpus. Therefore, when the seed corpus is more, the display interface is simpler. For example, a user searches for a plurality of seed corpuses, so that the user can delete and modify the generalized corpuses conveniently, and a generalized corpus list can be popped up on the right side of the seed corpuses clicked by the user, so that the operation habit of the user is met. The user clicks the seed corpus again, and the popup can disappear.
In one embodiment, after S403, the method may further include: the generalization linguistic data of the seed linguistic data is at least one;
the RPA system displays the generalization linguistic data of the seed linguistic data sent by the first electronic device, and further comprises: displaying each generalization corpus of the seed corpus and an indication control, wherein the indication control is used for indicating that the generalization corpus selected by a user is used as a new seed corpus to be generalized;
further, the method further comprises: and the RPA system responds to the triggering operation aiming at the indication control, and sends a second request to the first electronic equipment, wherein the second request is used for indicating the first electronic equipment to generalize the generalization linguistic data selected by the user as new seed linguistic data, and the generalization linguistic data of the seed linguistic data retransmitted by the first electronic equipment is received and displayed.
In this embodiment, the RPA system may display each of the generalized corpora of the seed corpus and the indication control, and send a second request to the first electronic device in response to a trigger operation for the indication control, so that the first electronic device generalizes the generalized corpora selected by the user as a new seed corpus, embeds a generalization result thereof into a candidate corpus of an original seed corpus, recalculates the similarity between all the candidate corpora and the original seed corpus, redetermines the generalized corpora of the original seed corpus, and sends the newly determined generalized corpora of the original seed corpus to the RPA system, so that the RPA system updates and displays the generalized corpora of the original seed corpus. The specific generalization process of the first electronic device is similar to the implementation of the corpus generalization method using the first electronic device as the execution main body, and is not described herein again.
Fig. 9 is a schematic view of a display interface of the generalized corpora according to the embodiment of the present application. In fig. 9, each generalized corpus has a selection box, and the user can select one or more of the generalized corpora as new seed corpora through the selection box, and then click the indication control "seed corpora" on the interface, thereby triggering the RPA system to send the second request to the first electronic device. In this embodiment, instruct the controlling part through setting up, can be convenient for the user to select new seed corpus from the generalization corpus of current show to first electronic equipment improves the convenience of user operation according to the new seed corpus that the user colluded the selection, generalizes again to the generalization corpus of former seed corpus, and then improves generalization efficiency, promotes user experience.
Fig. 10 is a signaling interaction diagram of a generalized linguistic approach according to an embodiment of the present application. The execution body in the signaling interaction diagram comprises the first electronic device and the second electronic device in fig. 1. As shown in fig. 10, the method may include:
s1001, the second electronic device receives the seed corpus input by the user.
S1002, the second electronic device sends a first request containing the seed corpus to the first electronic device.
S1003, the first electronic device generalizes the seed corpus according to a preset generalization mode to obtain at least one candidate corpus of the seed corpus.
S1004, the first electronic device identifies the similarity between each candidate corpus and the seed corpus, and determines the candidate corpus with the similarity larger than a preset threshold as the generalization corpus of the seed corpus.
S1005, the first electronic device sends the generalized corpora of the seed corpora to the second electronic device.
S1006, the second electronic device displays the generalization corpus of the seed corpus.
The specific implementation process and technical effects of the method are similar to the embodiment of the generalized linguistic data method using the first electronic device as the execution main body, and the embodiment of the generalized linguistic data method using the second electronic device as the execution main body, so that the method is only briefly described here and is not repeated.
Fig. 11 is a flowchart illustrating a corpus generalization method according to yet another embodiment of the present application. The execution subject of the method may be the third electronic device in fig. 2, and the third electronic device includes the RPA system. As shown in fig. 11, the method includes:
s1101, the RPA system receives the seed corpus input by the user.
In this embodiment, the RPA system may receive the seed corpus input by the user, and the implementation process and technical effect thereof are similar to those of the embodiment of the corpus generalization method using the second electronic device as the execution main body, and are not described herein again.
And S1102, the RPA system generalizes the seed corpus according to a preset generalization mode to obtain at least one candidate corpus of the seed corpus.
In this embodiment, the RPA system may receive the seed corpus input by the user, and the implementation process and technical effect thereof are similar to those of the embodiment of the corpus generalization method using the first electronic device as the execution main body, and are not described herein again.
Optionally, the preset generalization manner includes at least one of the following: synonym replacement mode, network crawling mode, knowledge base retrieval mode and sentence pattern extraction mode.
Optionally, each predefined generalization corresponds to an identity.
The method further comprises the following steps: and the RPA system displays the identification of each preset generalization mode and receives a selection instruction input by a user, wherein the selection instruction is used for indicating the appointed identification in the identification of the preset generalization mode.
The RPA system generalizes the seed corpus according to a preset generalization mode, and the method comprises the following steps: and the RPA system generalizes the seed corpus by adopting a preset generalization mode corresponding to the specified identification.
In this embodiment, the RPA system displays the identifier of each preset generalization mode and receives the selection instruction input by the user, and the implementation process and technical effect thereof are similar to those of the embodiment of the corpus generalization method using the second electronic device as the execution main body, and are not described herein again. The RPA system generalizes the seed corpus in a preset generalization mode corresponding to the designated identifier, and the implementation process and technical effect thereof are similar to the above embodiment of the corpus generalization method using the first electronic device as the execution main body, and are not described herein again.
S1103, the RPA system identifies the similarity between at least one candidate corpus and the seed corpus, and determines the candidate corpus with the similarity larger than a preset threshold value as the generalization corpus of the seed corpus.
In this embodiment, the RPA system identifies the similarity between each corpus candidate and the seed corpus, and determines the corpus candidate with the similarity greater than the preset threshold as the generalized corpus of the seed corpus, which is similar to the embodiment of the corpus generalization method using the first electronic device as the execution main body, and thus, the implementation process and the technical effect are not repeated herein.
And S1104, the RPA system displays the generalization linguistic data of the seed linguistic data.
In this embodiment, the RPA system displays the generalized corpus of the seed corpus, and the implementation process and technical effect thereof are similar to those of the embodiment of the corpus generalization method using the second electronic device as the execution main body, and are not described herein again.
To sum up, the corpus generalization method combining the RPA and the AI according to the embodiment of the present application generalizes the seed corpus by receiving the seed corpus input by the user according to the preset generalization mode to obtain at least one candidate corpus of the seed corpus, identifies the similarity between the at least one candidate corpus and the seed corpus, determines the candidate corpus with the similarity greater than the preset threshold as the generalized corpus of the seed corpus, and displays the generalized corpus of the seed corpus. According to the method, the seed corpus can be automatically generalized through a preset generalization mode, and the generalized candidate corpus is screened according to a preset threshold value, so that the generalized corpus of the seed corpus is screened out, and the corpus generalization efficiency is improved under the condition that the generalization effect is ensured.
In one embodiment, the generalization corpus of the seed corpus is at least one.
The RPA system displays the generalization linguistic data of the seed linguistic data, and comprises the following steps: and the RPA system displays at least one generalization corpus of the seed corpus and an indication control, wherein the indication control is used for indicating that the generalization corpus selected by a user is generalized as a new seed corpus.
The method further comprises the following steps:
and the RPA system responds to the triggering operation aiming at the indication control, generalizes the generalization linguistic data selected by the user as new seed linguistic data to obtain candidate linguistic data of the new seed linguistic data, then adds the candidate linguistic data of the new seed linguistic data into the candidate linguistic data of the seed linguistic data, re-identifies the similarity of at least one candidate linguistic data of the seed linguistic data and the seed linguistic data, determines the candidate linguistic data with the similarity larger than a preset threshold value as the generalization linguistic data of the seed linguistic data, and then redisplays the generalization linguistic data of the seed linguistic data.
In this embodiment, the RPA system displays at least one generalization corpus of the seed corpus and the indication control, and re-displays the generalization corpus of the seed corpus, and the implementation process and technical effect thereof are similar to those of the embodiment of the corpus generalization method using the second electronic device as the execution main body, and are not repeated herein. The RPA system generalizes the generalization linguistic data selected by the user as a new seed linguistic data to obtain a candidate linguistic data of the new seed linguistic data; the implementation process and technical effect of the method are similar to those of the embodiment of the corpus generalization method using the first electronic device as the execution main body, and are not repeated herein.
In an embodiment, when the preset generalization mode includes the network crawling mode, the RPA system generalizes the seed corpus according to the network crawling mode to obtain at least one candidate corpus of the seed corpus, including: and the RPA system searches the seed corpus in a webpage searching website to obtain a webpage list for displaying a searching result, wherein the webpage list comprises a plurality of webpage items, each webpage item is provided with a title sentence, the title sentences of the webpage items are crawled, and the title sentences meeting the matching conditions in the title sentences of the webpage items are used as candidate corpora of the seed corpus.
In this embodiment, the RPA system generalizes the seed corpus in a network crawling manner, and the implementation process and technical effect thereof are similar to those of the embodiment of the corpus generalization method using the first electronic device as the execution main body, and are not described herein again.
Optionally, the method further comprises: the RPA system receives the filter words and matching patterns input by the user.
The RPA system takes the title sentences meeting the matching condition in the title sentences of each webpage item as the candidate linguistic data of the seed linguistic data, and comprises the following steps: the RPA system takes the title sentences meeting the matching conditions in the title sentences of all the webpage items as the candidate linguistic data of the seed linguistic data according to the filter words and the matching mode, wherein the matching mode is accurate matching or fuzzy matching, and when the matching mode is accurate matching, the matching conditions are that the title sentences contain the filter words; and when the matching mode is fuzzy matching, the matching condition is that the title sentence contains the filter word or the synonym of the filter word.
Optionally, the method further comprises: the RPA system receives the specified website address and/or the specified number of crawled pages input by the user.
The RPA system searches the seed corpus in a webpage searching website, and the method comprises the following steps: the RPA system searches the seed corpus in the webpage searching website indicated by the specified website address;
the RPA system crawls title sentences of all webpage items, and the method comprises the following steps: and the RPA system crawls the title sentences of all the webpage items in the display pages corresponding to the specified crawling page number.
In this embodiment, the RPA system receives the filter word and the matching pattern input by the user, and receives the specified website address and/or the specified crawl page number input by the user, and the implementation process and the technical effect are similar to those of the above embodiment of the corpus generalization method using the second electronic device as the execution main body, and are not described herein again. The RPA system generalizes the seed corpus in a network crawling manner, and the implementation process and technical effect thereof are similar to those of the embodiment of the corpus generalization method using the first electronic device as the execution main body, and are not described herein again.
In an embodiment, when the preset generalization manner includes the synonym replacement manner, the RPA system generalizes the seed corpus according to the synonym replacement manner to obtain at least one candidate corpus of the seed corpus, including: and the RPA system acquires the affiliated field of the seed corpus, selects a synonym table corresponding to the affiliated field from a plurality of preset synonym tables, wherein each synonym table corresponds to one field, searches the keywords in the seed corpus, and performs synonym replacement on the keywords in the seed corpus according to the synonym table corresponding to the affiliated field to obtain at least one candidate corpus of the seed corpus.
In this embodiment, the RPA system generalizes the seed corpus in a synonym replacement manner, and the implementation process and technical effect thereof are similar to those of the embodiment of the corpus generalization method using the first electronic device as the execution main body, and are not described herein again.
In an embodiment, when the preset generalization mode includes the knowledge base retrieval mode, the RPA system generalizes the seed corpus according to the knowledge base retrieval mode to obtain at least one candidate corpus of the seed corpus, including: the RPA system searches a knowledge base for a generalization corpus corresponding to the seed corpus, wherein the knowledge base comprises a plurality of seed corpora and corresponding generalization corpora; and taking the generalization linguistic data corresponding to the seed linguistic data in the knowledge base as the candidate linguistic data of the seed linguistic data.
In this embodiment, the RPA system generalizes the seed corpus in a knowledge base retrieval manner, and the implementation process and technical effect thereof are similar to those of the embodiment of the corpus generalization method using the first electronic device as the execution main body, and are not described herein again.
In an embodiment, when the preset generalization manner includes the sentence extraction manner, the RPA system generalizes the seed corpus according to the sentence extraction manner to obtain at least one candidate corpus of the seed corpus, including: and the RPA system identifies and extracts key words in the seed corpus through a dependency syntactic analysis algorithm, combines the key words of the seed corpus and generates a candidate corpus of the seed corpus.
In this embodiment, the RPA system generalizes the seed corpus in a sentence extraction manner, and the implementation process and technical effect thereof are similar to those of the embodiment of the corpus generalization method using the first electronic device as the execution main body, and are not described herein again.
Fig. 12 is a schematic structural diagram of a corpus generalization device combining RPA and AI according to an embodiment of the present application. The corpus generalization device 120 is applied to a first electronic device. As shown in fig. 12, the corpus generalization device 120 includes: a first receiving module 1201, a first processing module 1202, a first determining module 1203, a first outputting module 1204.
A first receiving module 1201, configured to receive a first request, where the first request includes a seed corpus;
a first processing module 1202, configured to process an NLP based on a natural language, and generalize the seed corpus according to a preset generalization manner to obtain at least one candidate corpus of the seed corpus;
a first determining module 1203, configured to identify similarity between the at least one candidate corpus and the seed corpus, and determine a candidate corpus of which the similarity is greater than a preset threshold as a generalized corpus of the seed corpus;
a first output module 1204, configured to output the generalized corpus of the seed corpus.
In a possible embodiment, the preset generalization includes at least one of the following modes: network crawling mode, synonym replacing mode, knowledge base searching mode and sentence pattern extracting mode.
In a possible implementation manner, when the preset generalization manner includes the network crawling manner, the first processing module 1201 includes: the first processing unit is used for generalizing the seed corpus according to the network crawling manner to obtain at least one candidate corpus of the seed corpus;
the first processing unit is specifically configured to: searching the seed corpus in a webpage searching website to obtain a webpage list for displaying a searching result, wherein the webpage list comprises a plurality of webpage items, and each webpage item has a title sentence; crawling title sentences of all webpage items; and taking the title sentences which accord with the matching conditions in the title sentences of the webpage items as the candidate linguistic data of the seed linguistic data.
In a possible implementation, the first processing unit is further configured to: receiving a filter word; and according to the filter words, taking the title sentences meeting the matching conditions in the title sentences of all the webpage items as the candidate linguistic data of the seed linguistic data.
In one possible embodiment, the web page list comprises at least one presentation page, and each presentation page comprises at least one web page item;
the first processing unit is further configured to: receiving a specified website address and/or a specified number of crawled pages; searching the seed corpus in a webpage searching website indicated by the specified website address; and crawling title sentences of all the webpage items in the display page corresponding to the specified crawling page number.
In a possible implementation manner, when the preset generalization manner includes the synonym replacement manner, the first processing module 1202 further includes: the second processing unit is used for generalizing the seed corpus according to the synonym replacement mode to obtain at least one candidate corpus of the seed corpus;
the second processing unit is specifically configured to: acquiring the affiliated field of the seed corpus, and selecting a synonym table corresponding to the affiliated field from a plurality of preset synonym tables, wherein each synonym table corresponds to one field; searching key words in the seed corpus; and carrying out synonym replacement on the keywords in the seed corpus according to the synonym table corresponding to the belonging field to obtain at least one candidate corpus of the seed corpus.
In a possible implementation manner, when the preset generalization manner includes the knowledge base retrieval manner, the first processing module 1202 further includes: a third processing unit, configured to generalize the seed corpus according to the knowledge base retrieval manner, to obtain at least one candidate corpus of the seed corpus;
the third processing unit is specifically configured to: searching a knowledge base for a generalization corpus corresponding to the seed corpus, wherein the knowledge base comprises a plurality of seed corpora and corresponding generalization corpora; and taking the generalization linguistic data corresponding to the seed linguistic data in the knowledge base as the candidate linguistic data of the seed linguistic data.
In a possible implementation manner, when the preset generalization manner includes the sentence extraction manner, the first processing module 1202 further includes: a fourth processing unit, configured to generalize the seed corpus according to the sentence pattern extraction manner, to obtain at least one candidate corpus of the seed corpus;
the fourth processing unit is specifically configured to: identifying and extracting key vocabularies in the seed corpus through a dependency syntactic analysis algorithm; and combining the key vocabularies of the seed corpus to generate a candidate corpus of the seed corpus.
In one possible embodiment, each predefined generalization corresponds to an identifier;
the first processing module 1202 is further configured to: receiving a specified identification; and generalizing the seed corpus by adopting a preset generalization mode corresponding to the specified identification.
In a possible implementation manner, the first output module 1204 is further configured to: receiving a second request, wherein the second request is used for indicating at least one generalization corpus of the seed corpus; generalizing the generalization corpus indicated by the second request as a new seed corpus to obtain a candidate corpus of the new seed corpus; adding the candidate corpus of the new seed corpus into the candidate corpus of the seed corpus, re-identifying the similarity between at least one candidate corpus of the seed corpus and the seed corpus, and determining the candidate corpus with the similarity larger than the preset threshold value as the generalization corpus of the seed corpus; and re-outputting the generalization linguistic data of the seed linguistic data.
In a possible implementation, the first receiving module 1201 is further configured to: searching the seed linguistic data in a historical record, wherein the historical record comprises historical generalized seed linguistic data and corresponding generalized linguistic data; if the seed corpus is found in the historical record, acquiring the generalization corpus of the seed corpus from the historical record; if the seed corpus is not found in the history record, generalizing the seed corpus according to a preset generalization mode to obtain at least one candidate corpus of the seed corpus.
In a possible implementation, the first determining module 1203 is further configured to: updating the seed corpus and the generalization corpus of the seed corpus into a knowledge base, wherein the knowledge base at least comprises the seed corpus and the corresponding generalization corpus; and identifying the similarity between the at least one candidate corpus and the seed corpus through a logistic regression model, wherein the logistic regression model is trained in advance through a training set consisting of a plurality of corpora in the knowledge base.
The corpus generalization device combining RPA and AI provided in the embodiment of the present application may be used to implement the method embodiment using the first electronic device as an execution main body, and the implementation principle and technical effect are similar, which are not described herein again.
Fig. 13 is a schematic structural diagram of a corpus generalization device combining RPA and AI according to yet another embodiment of the present application. The corpus generalization device 130 is applied to a second electronic device. As shown in fig. 13, the corpus generalization device 130 includes: a second receiving module 1301, a first sending module 1302, and a first display module 1303.
A second receiving module 1301, configured to receive a seed corpus input by a user;
a first sending module 1302, configured to send a first request including the seed corpus to a first electronic device, where the first request is used to instruct the first electronic device to perform generalization on the seed corpus based on a natural language processing NLP according to a preset generalization mode, to obtain at least one candidate corpus of the seed corpus, identify a similarity between the at least one candidate corpus and the seed corpus, and use the candidate corpus of which the similarity is greater than a preset threshold as a generalized corpus of the seed corpus;
and the first display module 1303 is configured to receive and display the generalized corpora of the seed corpora sent by the first electronic device.
In a possible embodiment, the generalization corpus of the seed corpus is at least one;
the first display module 1303 is specifically configured to: displaying at least one generalization corpus of the seed corpus and an indication control, wherein the indication control is used for indicating that the generalization corpus selected by a user is used as a new seed corpus to be generalized; responding to a trigger operation aiming at the indication control, and sending a second request to the first electronic equipment, wherein the second request is used for indicating the first electronic equipment to generalize the generalization corpus selected by the user as a new seed corpus; and receiving and displaying the generalization linguistic data of the seed linguistic data retransmitted by the first electronic equipment.
In a possible implementation, the first display module 1303 is further configured to: receiving the at least one generalization corpus of the seed corpus sent by the first electronic device and a corresponding similarity and/or generalization mode thereof; performing related display on the at least one generalization corpus of the seed corpus and the corresponding similarity and/or generalization mode thereof;
the first display module 1303 is further configured to: receiving a screening instruction input by a user, wherein the screening instruction is used for indicating at least one of the preset generalization modes; and displaying the generalized linguistic data obtained in the generalization mode indicated by the screening instruction.
In a possible embodiment, the seed corpus is at least one;
the first display module 1303 is further configured to: displaying the number of at least one seed corpus and the generalization corpus thereof; receiving a trigger instruction of a user for the number of the generalization linguistic data of the specified seed linguistic data, wherein the specified seed linguistic data is one of all the seed linguistic data; and displaying the generalization linguistic data of the specified seed linguistic data.
The corpus generalization device combining RPA and AI provided in the embodiment of the present application may be used to implement the method embodiment using the second electronic device as an execution main body, and the implementation principle and technical effect are similar, which are not described herein again.
Fig. 14 is a schematic structural diagram of a corpus generalization device combining RPA and AI according to another embodiment of the present application. The corpus generalization device 140 is applied to a third electronic device. As shown in fig. 14, the corpus generalization device 140 includes: a third receiving module 1401, a second processing module 1402, a second determining module 1403, and a second displaying module 1404.
A third receiving module 1401, configured to receive a seed corpus input by a user;
a second processing module 1402, configured to generalize the seed corpus according to a preset generalization manner, to obtain at least one candidate corpus of the seed corpus;
a second determining module 1403, configured to identify a similarity between the at least one candidate corpus and the seed corpus, and determine a candidate corpus of which the similarity is greater than a preset threshold as a generalization corpus of the seed corpus;
a second display module 1404, configured to display the generalized corpora of the seed corpora.
In a possible embodiment, the preset generalization includes at least one of the following modes: network crawling mode, synonym replacing mode, knowledge base searching mode and sentence pattern extracting mode.
In a possible implementation manner, when the preset generalization manner includes the network crawling manner, the second processing module 1402 includes: a fifth processing unit, configured to generalize the seed corpus according to the network crawling manner, to obtain at least one candidate corpus of the seed corpus;
the fifth processing unit is specifically configured to: searching the seed corpus in a webpage searching website to obtain a webpage list for displaying a searching result, wherein the webpage list comprises a plurality of webpage items, and each webpage item has a title sentence; crawling title sentences of all webpage items; and taking the title sentences which accord with the matching conditions in the title sentences of the webpage items as the candidate linguistic data of the seed linguistic data.
In a possible implementation, the fifth processing unit is further configured to: receiving a filter word; and according to the filter words, taking the title sentences meeting the matching conditions in the title sentences of all the webpage items as the candidate linguistic data of the seed linguistic data.
In one possible embodiment, the web page list comprises at least one presentation page, and each presentation page comprises at least one web page item;
the fifth processing unit is further configured to: receiving a specified website address and/or a specified number of crawled pages; searching the seed corpus in a webpage searching website indicated by the specified website address; and crawling title sentences of all the webpage items in the display page corresponding to the specified crawling page number.
In a possible implementation manner, when the preset generalization manner includes the synonym replacement manner, the second processing module 1402 further includes: a sixth processing unit, configured to generalize the seed corpus according to the synonym replacement manner, to obtain at least one candidate corpus of the seed corpus;
the sixth processing unit is specifically configured to: acquiring the affiliated field of the seed corpus, and selecting a synonym table corresponding to the affiliated field from a plurality of preset synonym tables, wherein each synonym table corresponds to one field; searching key words in the seed corpus; and carrying out synonym replacement on the keywords in the seed corpus according to the synonym table corresponding to the belonging field to obtain at least one candidate corpus of the seed corpus.
In a possible implementation manner, when the preset generalization manner includes the knowledge base retrieval manner, the second processing module 1402 further includes: a seventh processing unit, configured to generalize the seed corpus according to the knowledge base retrieval manner, to obtain at least one candidate corpus of the seed corpus;
the seventh processing unit is specifically configured to: searching a knowledge base for a generalization corpus corresponding to the seed corpus, wherein the knowledge base comprises a plurality of seed corpora and corresponding generalization corpora; and taking the generalization linguistic data corresponding to the seed linguistic data in the knowledge base as the candidate linguistic data of the seed linguistic data.
In a possible implementation manner, when the preset generalization manner includes the sentence extraction manner, the second processing module 1402 further includes: an eighth processing unit, configured to generalize the seed corpus according to the sentence pattern extraction manner, to obtain at least one candidate corpus of the seed corpus;
the eighth processing unit is specifically configured to: identifying and extracting key vocabularies in the seed corpus through a dependency syntactic analysis algorithm; and combining the key vocabularies of the seed corpus to generate a candidate corpus of the seed corpus.
In one possible embodiment, each predefined generalization corresponds to an identifier;
the second processing module 1402 is further configured to: displaying the identification of each preset generalization mode; receiving a selection instruction input by a user, wherein the selection instruction is used for indicating a specified identifier in identifiers of a preset generalization mode; and generalizing the seed corpus by adopting a preset generalization mode corresponding to the specified identification.
In a possible embodiment, the generalization corpus of the seed corpus is at least one;
the second display module 1404, further configured to: displaying at least one generalization corpus of the seed corpus and an indication control, wherein the indication control is used for indicating that the generalization corpus selected by a user is used as a new seed corpus to be generalized; responding to the triggering operation aiming at the indication control, and generalizing the generalization linguistic data selected by the user as a new seed linguistic data to obtain a candidate linguistic data of the new seed linguistic data; adding the candidate corpus of the new seed corpus into the candidate corpus of the seed corpus, re-identifying the similarity between at least one candidate corpus of the seed corpus and the seed corpus, and determining the candidate corpus with the similarity larger than the preset threshold value as the generalization corpus of the seed corpus; and redisplaying the generalization linguistic data of the seed linguistic data.
The corpus generalization device combining RPA and AI provided in the embodiment of the present application may be used to implement the method embodiment using the third electronic device as an execution main body, and the implementation principle and technical effect are similar, which are not described herein again.
Fig. 15 is a schematic hardware structure diagram of a first electronic device according to an embodiment of the present application. As shown in fig. 15, the first electronic device 150 provided in the present embodiment includes: at least one processor 1501 and memory 1502. The first electronic device 150 also includes a communication component 1503. The processor 1501, the memory 1502, and the communication section 1503 are connected by a bus 1504.
In a specific implementation process, the at least one processor 1501 executes the computer-executable instructions stored in the memory 1502, so that the at least one processor 1501 executes the corpus generalization method with the first electronic device as an execution subject as described above.
For a specific implementation process of the processor 1501, reference may be made to the above method embodiments, which implement similar principles and technical effects, and this embodiment is not described herein again.
Fig. 16 is a schematic hardware structure diagram of a second electronic device according to yet another embodiment of the present application. As shown in fig. 16, the second electronic device 160 provided in the present embodiment includes: at least one processor 1601, and a memory 1602. The second electronic device 160 further comprises a communication component 1603. The processor 1601, the memory 1602, and the communication unit 1603 are connected via a bus 1604.
In a specific implementation process, the at least one processor 1601 executes the computer executable instructions stored in the memory 1602, so that the at least one processor 1601 executes the corpus generalization method with the second electronic device as the execution subject.
For a specific implementation process of the processor 1601, reference may be made to the above method embodiments, which achieve similar implementation principles and technical effects, and details of this embodiment are not described herein again.
Fig. 17 is a schematic hardware structure diagram of a third electronic device according to another embodiment of the present application. As shown in fig. 17, the third electronic device 170 provided in the present embodiment includes: at least one processor 1701 and memory 1702. The third electronic device 170 further comprises a communication component 1703. The processor 1701, the memory 1702, and the communication unit 1703 are connected by a bus 1704.
In particular implementations, the at least one processor 1701 executes the computer executable instructions stored in the memory 1702, so that the at least one processor 1701 executes the corpus generalization method with the third electronic device as an execution subject as described above.
For a specific implementation process of the processor 1701, reference may be made to the above method embodiments, which have similar implementation principles and technical effects, and no further description is given here.
In the embodiments shown in fig. 15, fig. 16, and fig. 17, it should be understood that the processor may be a Central Processing Unit (CPU), other general-purpose processors, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in the incorporated application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor.
The memory may comprise high speed RAM memory and may also include non-volatile storage NVM, such as at least one disk memory.
The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.
The application also provides a computer-readable storage medium, wherein computer-executable instructions are stored in the computer-readable storage medium, and when a processor executes the computer-executable instructions, the corpus generalization method taking the first electronic device as an execution subject is realized.
The application also provides a computer-readable storage medium, wherein computer-executable instructions are stored in the computer-readable storage medium, and when a processor executes the computer-executable instructions, the corpus generalization method taking the second electronic device as an execution subject is realized.
The application also provides a computer-readable storage medium, wherein a computer execution instruction is stored in the computer-readable storage medium, and when a processor executes the computer execution instruction, the corpus generalization method taking the third electronic device as an execution subject is realized.
The readable storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. Readable storage media can be any available media that can be accessed by a general purpose or special purpose computer.
An exemplary readable storage medium is coupled to the processor such the processor can read information from, and write information to, the readable storage medium. Of course, the readable storage medium may also be an integral part of the processor. The processor and the readable storage medium may reside in an Application Specific Integrated Circuits (ASIC). Of course, the processor and the readable storage medium may also reside as discrete components in the apparatus.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims (26)

1.一种结合RPA和AI的语料泛化方法,其特征在于,应用于第一电子设备,所述第一电子设备包括RPA系统,所述方法包括:1. a corpus generalization method combining RPA and AI, is characterized in that, is applied to the first electronic equipment, and described first electronic equipment comprises RPA system, and described method comprises: 所述RPA系统接收第一请求,其中,所述第一请求中包括种子语料;The RPA system receives a first request, wherein the first request includes a seed corpus; 所述RPA系统基于自然语言处理NLP,按照预设泛化方式对所述种子语料进行泛化,得到所述种子语料的至少一个候选语料;The RPA system generalizes the seed corpus according to a preset generalization method based on natural language processing NLP to obtain at least one candidate corpus of the seed corpus; 所述RPA系统识别所述至少一个候选语料与所述种子语料的相似度,将相似度大于预设阈值的候选语料确定为所述种子语料的泛化语料;The RPA system identifies the similarity between the at least one candidate corpus and the seed corpus, and determines the candidate corpus whose similarity is greater than a preset threshold as the generalized corpus of the seed corpus; 所述RPA系统输出所述种子语料的泛化语料。The RPA system outputs a generalized corpus of the seed corpus. 2.根据权利要求1所述的方法,其特征在于,所述预设泛化方式包括以下方式中的至少一种:2. The method according to claim 1, wherein the preset generalization manner comprises at least one of the following manners: 网络爬取方式、同义词替换方式、知识库检索方式和句式提取方式。Web crawling, synonym replacement, knowledge base retrieval and sentence extraction. 3.根据权利要求2所述的方法,其特征在于,所述预设泛化方式包括所述网络爬取方式时,所述RPA系统基于自然语言处理NLP,按照预设泛化方式对所述种子语料进行泛化,得到所述种子语料的至少一个候选语料,包括:3. The method according to claim 2, wherein when the preset generalization method includes the web crawling method, the RPA system processes NLP based on natural language, The seed corpus is generalized to obtain at least one candidate corpus of the seed corpus, including: 所述RPA系统按照所述网络爬取方式对所述种子语料进行泛化,得到所述种子语料的至少一个候选语料;The RPA system generalizes the seed corpus according to the web crawling method to obtain at least one candidate corpus of the seed corpus; 所述RPA系统按照所述网络爬取方式对所述种子语料进行泛化,得到所述种子语料的至少一个候选语料,包括:The RPA system generalizes the seed corpus according to the web crawling method to obtain at least one candidate corpus of the seed corpus, including: 所述RPA系统在网页搜索网站中搜索所述种子语料,得到展示搜索结果的网页列表,其中,所述网页列表中包括多个网页项,每个网页项具有一个标题语句;The RPA system searches for the seed corpus in a web page search website, and obtains a web page list displaying search results, wherein the web page list includes a plurality of web page items, and each web page item has a title statement; 所述RPA系统爬取各网页项的标题语句;The RPA system crawls the title statement of each web page item; 所述RPA系统将各网页项的标题语句中符合匹配条件的标题语句作为所述种子语料的候选语料。The RPA system uses the title sentences that meet the matching conditions in the title sentences of each web page item as the candidate corpus of the seed corpus. 4.根据权利要求3所述的方法,其特征在于,所述方法还包括:4. The method according to claim 3, wherein the method further comprises: 所述RPA系统接收过滤词;the RPA system receives filter words; 所述RPA系统将各网页项的标题语句中符合匹配条件的标题语句作为所述种子语料的候选语料,包括:The RPA system uses the title sentences that meet the matching conditions in the title sentences of each web page item as the candidate corpus of the seed corpus, including: 所述RPA系统根据所述过滤词,将各网页项的标题语句中符合所述匹配条件的标题语句作为所述种子语料的候选语料。The RPA system selects, according to the filter word, a title sentence that meets the matching condition in the title sentence of each web page item as a candidate corpus of the seed corpus. 5.根据权利要求3所述的方法,其特征在于,所述网页列表包括至少一个展示页面,每个展示页面内包括至少一个网页项;5. The method according to claim 3, wherein the webpage list comprises at least one display page, and each display page comprises at least one webpage item; 所述方法还包括:The method also includes: 所述RPA系统接收指定网站地址和/或指定爬取页数;The RPA system receives the specified website address and/or the specified number of crawled pages; 所述RPA系统在网页搜索网站中搜索所述种子语料,包括:The RPA system searches for the seed corpus in the webpage search website, including: 所述RPA系统在所述指定网站地址指示的网页搜索网站中搜索所述种子语料;The RPA system searches the seed corpus in the webpage search website indicated by the designated website address; 所述RPA系统爬取各网页项的标题语句,包括:The RPA system crawls the title statement of each web page item, including: 所述RPA系统在所述指定爬取页数对应的展示页面内,爬取各网页项的标题语句。The RPA system crawls the title statement of each webpage item in the display page corresponding to the specified number of crawled pages. 6.根据权利要求2所述的方法,其特征在于,所述预设泛化方式包括所述同义词替换方式时,所述RPA系统基于自然语言处理NLP,按照预设泛化方式对所述种子语料进行泛化,得到所述种子语料的至少一个候选语料,还包括:6 . The method according to claim 2 , wherein, when the preset generalization method includes the synonym replacement method, the RPA system processes NLP based on natural language, and the seed is processed according to the preset generalization method. 7 . The corpus is generalized to obtain at least one candidate corpus of the seed corpus, which also includes: 所述RPA系统按照所述同义词替换方式对所述种子语料进行泛化,得到所述种子语料的至少一个候选语料;The RPA system generalizes the seed corpus according to the synonym replacement mode to obtain at least one candidate corpus of the seed corpus; 所述RPA系统按照所述同义词替换方式对所述种子语料进行泛化,得到所述种子语料的至少一个候选语料,包括:The RPA system generalizes the seed corpus according to the synonym replacement method to obtain at least one candidate corpus of the seed corpus, including: 所述RPA系统获取所述种子语料的所属领域,并从预置的多个同义词表中选取所述所属领域对应的同义词表,其中,每个同义词表对应一个领域;The RPA system obtains the field of the seed corpus, and selects a synonym table corresponding to the field from a plurality of preset synonym tables, wherein each synonym table corresponds to a field; 所述RPA系统查找所述种子语料中的关键词;The RPA system searches for keywords in the seed corpus; 所述RPA系统根据所述所属领域对应的同义词表,对所述种子语料中的关键词进行同义词替换,得到所述种子语料的至少一个候选语料。The RPA system performs synonym substitution for keywords in the seed corpus according to the synonym table corresponding to the field to which they belong, to obtain at least one candidate corpus of the seed corpus. 7.根据权利要求2所述的方法,其特征在于,所述预设泛化方式包括所述知识库检索方式时,所述RPA系统基于自然语言处理NLP,按照预设泛化方式对所述种子语料进行泛化,得到所述种子语料的至少一个候选语料,还包括:7. The method according to claim 2, wherein when the preset generalization method includes the knowledge base retrieval method, the RPA system processes NLP based on natural language, The seed corpus is generalized to obtain at least one candidate corpus of the seed corpus, which also includes: 所述RPA系统按照所述知识库检索方式对所述种子语料进行泛化,得到所述种子语料的至少一个候选语料;The RPA system generalizes the seed corpus according to the knowledge base retrieval method to obtain at least one candidate corpus of the seed corpus; 所述RPA系统按照所述知识库检索方式对所述种子语料进行泛化,得到所述种子语料的至少一个候选语料,包括:The RPA system generalizes the seed corpus according to the knowledge base retrieval method to obtain at least one candidate corpus of the seed corpus, including: 所述RPA系统在知识库中查找所述种子语料对应的泛化语料,其中,所述知识库中包括多个种子语料及其对应的泛化语料;The RPA system searches the knowledge base for the generalization corpus corresponding to the seed corpus, wherein the knowledge base includes a plurality of seed corpora and their corresponding generalization corpora; 所述RPA系统将所述知识库中所述种子语料对应的泛化语料作为所述种子语料的候选语料。The RPA system uses the generalized corpus corresponding to the seed corpus in the knowledge base as a candidate corpus of the seed corpus. 8.根据权利要求2所述的方法,其特征在于,所述预设泛化方式包括所述句式提取方式时,所述RPA系统基于自然语言处理NLP,按照预设泛化方式对所述种子语料进行泛化,得到所述种子语料的至少一个候选语料,还包括:8 . The method according to claim 2 , wherein when the preset generalization method includes the sentence pattern extraction method, the RPA system processes NLP based on natural language, The seed corpus is generalized to obtain at least one candidate corpus of the seed corpus, which also includes: 所述RPA系统按照所述句式提取方式对所述种子语料进行泛化,得到所述种子语料的至少一个候选语料;The RPA system generalizes the seed corpus according to the sentence pattern extraction method to obtain at least one candidate corpus of the seed corpus; 所述RPA系统按照所述句式提取方式对所述种子语料进行泛化,得到所述种子语料的至少一个候选语料,包括:The RPA system generalizes the seed corpus according to the sentence pattern extraction method to obtain at least one candidate corpus of the seed corpus, including: 所述RPA系统通过依存句法分析算法识别并提取所述种子语料中的关键词汇;The RPA system identifies and extracts key words in the seed corpus through a dependency parsing algorithm; 所述RPA系统对所述种子语料的关键词汇进行组合,生成所述种子语料的候选语料。The RPA system combines key words of the seed corpus to generate candidate corpora of the seed corpus. 9.根据权利要求1所述的方法,其特征在于,每种预设泛化方式对应于一个标识;9. The method according to claim 1, wherein each preset generalization mode corresponds to an identifier; 所述方法还包括:The method also includes: 所述RPA系统接收指定标识;The RPA system receives the specified identifier; 所述RPA系统基于自然语言处理NLP,按照预设泛化方式对所述种子语料进行泛化,得到所述种子语料的至少一个候选语料,还包括:The RPA system generalizes the seed corpus according to a preset generalization method based on natural language processing NLP to obtain at least one candidate corpus of the seed corpus, and further includes: 所述RPA系统采用所述指定标识对应的预设泛化方式对所述种子语料进行泛化。The RPA system generalizes the seed corpus by using a preset generalization manner corresponding to the specified identifier. 10.根据权利要求1-9任一项所述的方法,其特征在于,所述RPA系统输出所述种子语料的泛化语料之后,所述方法还包括:10. The method according to any one of claims 1-9, wherein after the RPA system outputs the generalized corpus of the seed corpus, the method further comprises: 所述RPA系统接收第二请求,其中,所述第二请求用于指示所述种子语料的至少一个泛化语料;The RPA system receives a second request, wherein the second request is used to indicate at least one generalized corpus of the seed corpus; 所述RPA系统将所述第二请求所指示的泛化语料作为新的种子语料进行泛化,得到所述新的种子语料的候选语料;The RPA system generalizes the generalization corpus indicated by the second request as a new seed corpus to obtain a candidate corpus of the new seed corpus; 所述RPA系统将所述新的种子语料的候选语料添加到所述种子语料的候选语料中,并重新识别所述种子语料的至少一个候选语料与所述种子语料的相似度,将相似度大于所述预设阈值的候选语料确定为所述种子语料的泛化语料;The RPA system adds the candidate corpus of the new seed corpus to the candidate corpus of the seed corpus, and re-identifies the similarity between at least one candidate corpus of the seed corpus and the seed corpus, and sets the similarity greater than The candidate corpus of the preset threshold is determined as the generalization corpus of the seed corpus; 所述RPA系统重新输出所述种子语料的泛化语料。The RPA system re-outputs the generalized corpus of the seed corpus. 11.根据权利要求1-9任一项所述的方法,其特征在于,所述RPA系统接收第一请求之后,所述方法还包括:The method according to any one of claims 1-9, wherein after the RPA system receives the first request, the method further comprises: 所述RPA系统在历史记录中查找所述种子语料,所述历史记录包括历史泛化的种子语料及其相应的泛化语料;The RPA system searches for the seed corpus in a historical record, and the historical record includes a historically generalized seed corpus and its corresponding generalized corpus; 所述RPA系统若在所述历史记录中查找到所述种子语料,则从所述历史记录中获取所述种子语料的泛化语料;If the RPA system finds the seed corpus in the historical record, then obtains the generalized corpus of the seed corpus from the historical record; 所述RPA系统若在所述历史记录中未查找到所述种子语料,则根据预设泛化方式对所述种子语料进行泛化,得到所述种子语料的至少一个候选语料。If the RPA system does not find the seed corpus in the historical record, it generalizes the seed corpus according to a preset generalization method to obtain at least one candidate corpus of the seed corpus. 12.根据权利要求1-9任一项所述的方法,其特征在于,所述方法还包括:12. The method according to any one of claims 1-9, wherein the method further comprises: 所述RPA系统将所述种子语料以及所述种子语料的泛化语料更新到知识库中,所述知识库中至少包括所述种子语料及其对应的泛化语料;The RPA system updates the seed corpus and the generalization corpus of the seed corpus into a knowledge base, where the knowledge base at least includes the seed corpus and its corresponding generalization corpus; 所述RPA系统识别所述至少一个候选语料与所述种子语料的相似度,包括:The RPA system identifies the similarity between the at least one candidate corpus and the seed corpus, including: 所述RPA系统通过逻辑回归模型识别所述至少一个候选语料与所述种子语料的相似度,其中,所述逻辑回归模型预先经过由所述知识库中的多个语料组成的训练集训练。The RPA system identifies the similarity between the at least one candidate corpus and the seed corpus through a logistic regression model, wherein the logistic regression model is pre-trained on a training set consisting of multiple corpora in the knowledge base. 13.一种语料泛化方法,其特征在于,应用于第二电子设备,所述第二电子设备包括RPA系统,所述方法包括:13. A corpus generalization method, characterized in that, applied to a second electronic device, the second electronic device comprising an RPA system, the method comprising: 所述RPA系统接收用户输入的种子语料;The RPA system receives the seed corpus input by the user; 所述RPA系统向第一电子设备发送包含所述种子语料的第一请求,其中,所述第一请求用于指示所述第一电子设备基于自然语言处理NLP,按照预设泛化方式对所述种子语料进行泛化,得到所述种子语料的至少一个候选语料,识别所述至少一个候选语料与所述种子语料的相似度,将相似度大于预设阈值的候选语料作为所述种子语料的泛化语料;The RPA system sends a first request including the seed corpus to the first electronic device, where the first request is used to instruct the first electronic device to process NLP based on natural language, and to perform a pre-set generalization method for all the data. generalize the seed corpus to obtain at least one candidate corpus of the seed corpus, identify the similarity between the at least one candidate corpus and the seed corpus, and use the candidate corpus whose similarity is greater than a preset threshold as the seed corpus. generalized corpus; 所述RPA系统接收并显示所述第一电子设备发送的所述种子语料的泛化语料。The RPA system receives and displays the generalized corpus of the seed corpus sent by the first electronic device. 14.根据权利要求13所述的方法,其特征在于,所述种子语料的泛化语料为至少一个;14. The method according to claim 13, wherein the generalized corpus of the seed corpus is at least one; 所述RPA系统接收并显示所述第一电子设备发送的所述种子语料的泛化语料,包括:The RPA system receives and displays the generalized corpus of the seed corpus sent by the first electronic device, including: 所述RPA系统显示所述种子语料的至少一个泛化语料及指示控件,其中,所述指示控件用于指示将用户选择的泛化语料作为新的种子语料进行泛化;The RPA system displays at least one generalized corpus of the seed corpus and an instruction control, wherein the instruction control is used to instruct the generalization corpus selected by the user to be generalized as a new seed corpus; 所述RPA系统接收并显示所述第一电子设备发送的所述种子语料的泛化语料之后,还包括:After receiving and displaying the generalized corpus of the seed corpus sent by the first electronic device, the RPA system further includes: 所述RPA系统响应于针对所述指示控件的触发操作,向所述第一电子设备发送第二请求,其中,所述第二请求用于指示所述第一电子设备将用户选择的泛化语料作为新的种子语料进行泛化;The RPA system sends a second request to the first electronic device in response to the triggering operation for the indicating control, wherein the second request is used to instruct the first electronic device to use the generalized corpus selected by the user Generalize as a new seed corpus; 所述RPA系统接收并显示所述第一电子设备重新发送的所述种子语料的泛化语料。The RPA system receives and displays the generalized corpus of the seed corpus retransmitted by the first electronic device. 15.根据权利要求14所述的方法,其特征在于,所述RPA系统接收并显示所述第一电子设备发送的所述种子语料的泛化语料,还包括:15. The method according to claim 14, wherein the RPA system receives and displays the generalized corpus of the seed corpus sent by the first electronic device, further comprising: 所述RPA系统接收所述第一电子设备发送的所述种子语料的所述至少一个泛化语料及其对应的相似度和/或泛化方式;The RPA system receives the at least one generalized corpus of the seed corpus sent by the first electronic device and its corresponding similarity and/or generalization method; 所述RPA系统将所述种子语料的所述至少一个泛化语料及其对应的相似度和/或泛化方式进行关联显示;The RPA system associates and displays the at least one generalization corpus of the seed corpus and its corresponding similarity and/or generalization mode; 所述RPA系统接收并显示所述第一电子设备发送的所述种子语料的泛化语料之后,还包括:After receiving and displaying the generalized corpus of the seed corpus sent by the first electronic device, the RPA system further includes: 所述RPA系统接收用户输入的筛选指令,其中,所述筛选指令用于指示所述预设泛化方式中的至少一种;The RPA system receives a screening instruction input by a user, wherein the screening instruction is used to indicate at least one of the preset generalization methods; 所述RPA系统显示以所述筛选指令所指示的泛化方式得到的泛化语料。The RPA system displays the generalized corpus obtained in the generalized manner indicated by the screening instruction. 16.根据权利要求13-15任一项所述的方法,其特征在于,所述种子语料为至少一个;16. The method according to any one of claims 13-15, wherein the seed corpus is at least one; 所述RPA系统接收并显示所述第一电子设备发送的所述种子语料的泛化语料,还包括:The RPA system receives and displays the generalized corpus of the seed corpus sent by the first electronic device, and further includes: 所述RPA系统显示至少一个种子语料及其泛化语料的个数;The RPA system displays the number of at least one seed corpus and its generalization corpus; 所述RPA系统显示至少一个种子语料及其泛化语料的个数之后,还包括:After the RPA system displays the number of at least one seed corpus and its generalization corpus, it also includes: 所述RPA系统接收用户针对指定种子语料的泛化语料的个数的触发指令,其中,所述指定种子语料为所有种子语料中的一个;The RPA system receives a triggering instruction from the user for the number of generalization corpora of the specified seed corpus, wherein the specified seed corpus is one of all the seed corpora; 所述RPA系统显示所述指定种子语料的泛化语料。The RPA system displays a generalized corpus of the specified seed corpus. 17.一种结合RPA和AI的语料泛化方法,其特征在于,应用于第三电子设备,所述第三电子设备包括RPA系统,所述方法包括:17. A corpus generalization method combining RPA and AI, wherein the method is applied to a third electronic device, wherein the third electronic device comprises an RPA system, and the method comprises: 所述RPA系统接收用户输入的种子语料;The RPA system receives the seed corpus input by the user; 所述RPA系统根据预设泛化方式对所述种子语料进行泛化,得到所述种子语料的至少一个候选语料;The RPA system generalizes the seed corpus according to a preset generalization method to obtain at least one candidate corpus of the seed corpus; 所述RPA系统识别所述至少一个候选语料与所述种子语料的相似度,将相似度大于预设阈值的候选语料确定为所述种子语料的泛化语料;The RPA system identifies the similarity between the at least one candidate corpus and the seed corpus, and determines the candidate corpus whose similarity is greater than a preset threshold as the generalized corpus of the seed corpus; 所述RPA系统显示所述种子语料的泛化语料。The RPA system displays a generalized corpus of the seed corpus. 18.一种结合RPA和AI的语料泛化装置,其特征在于,应用于第一电子设备,包括:18. A corpus generalization device combining RPA and AI, characterized in that, applied to the first electronic device, comprising: 第一接收模块,用于接收第一请求,其中,所述第一请求中包括种子语料;a first receiving module, configured to receive a first request, wherein the first request includes a seed corpus; 第一处理模块,用于基于自然语言处理NLP,按照预设泛化方式对所述种子语料进行泛化,得到所述种子语料的至少一个候选语料;a first processing module, configured to generalize the seed corpus according to a preset generalization method based on natural language processing NLP to obtain at least one candidate corpus of the seed corpus; 第一确定模块,用于识别所述至少一个候选语料与所述种子语料的相似度,将相似度大于预设阈值的候选语料确定为所述种子语料的泛化语料;a first determination module, configured to identify the similarity between the at least one candidate corpus and the seed corpus, and determine the candidate corpus whose similarity is greater than a preset threshold as the generalized corpus of the seed corpus; 第一输出模块,用于输出所述种子语料的泛化语料。The first output module is used for outputting the generalized corpus of the seed corpus. 19.一种结合RPA和AI的语料泛化装置,其特征在于,应用于第二电子设备,包括:19. A corpus generalization device combining RPA and AI, characterized in that, applied to a second electronic device, comprising: 第二接收模块,用于接收用户输入的种子语料;The second receiving module is used for receiving the seed corpus input by the user; 第一发送模块,用于向第一电子设备发送包含所述种子语料的第一请求,其中,所述第一请求用于指示所述第一电子设备基于自然语言处理NLP,按照预设泛化方式对所述种子语料进行泛化,得到所述种子语料的至少一个候选语料,识别所述至少一个候选语料与所述种子语料的相似度,将相似度大于预设阈值的候选语料作为所述种子语料的泛化语料;A first sending module, configured to send a first request including the seed corpus to a first electronic device, wherein the first request is used to instruct the first electronic device to process NLP based on natural language, generalize according to a preset The seed corpus is generalized to obtain at least one candidate corpus of the seed corpus, the similarity between the at least one candidate corpus and the seed corpus is identified, and the candidate corpus whose similarity is greater than a preset threshold is used as the Generalization corpus of seed corpus; 第一显示模块,用于接收并显示所述第一电子设备发送的所述种子语料的泛化语料。The first display module is configured to receive and display the generalized corpus of the seed corpus sent by the first electronic device. 20.一种结合RPA和AI的语料泛化装置,其特征在于,应用于第三电子设备,包括:20. A corpus generalization device combining RPA and AI, characterized in that, applied to a third electronic device, comprising: 第三接收模块,用于接收用户输入的种子语料;The third receiving module is used to receive the seed corpus input by the user; 第二处理模块,用于根据预设泛化方式对所述种子语料进行泛化,得到所述种子语料的至少一个候选语料;a second processing module, configured to generalize the seed corpus according to a preset generalization method to obtain at least one candidate corpus of the seed corpus; 第二确定模块,用于识别所述至少一个候选语料与所述种子语料的相似度,将相似度大于预设阈值的候选语料确定为所述种子语料的泛化语料;The second determination module is used to identify the similarity between the at least one candidate corpus and the seed corpus, and determine the candidate corpus whose similarity is greater than a preset threshold as the generalized corpus of the seed corpus; 第二显示模块,用于显示所述种子语料的泛化语料。The second display module is used for displaying the generalized corpus of the seed corpus. 21.一种第一电子设备,其特征在于,包括:至少一个处理器和存储器;21. A first electronic device, comprising: at least one processor and a memory; 所述存储器存储计算机执行指令;the memory stores computer-executable instructions; 所述至少一个处理器执行所述存储器存储的计算机执行指令,使得所述至少一个处理器执行如权利要求1-12任一项所述的语料泛化方法。The at least one processor executes the computer-executable instructions stored in the memory, so that the at least one processor executes the corpus generalization method of any one of claims 1-12. 22.一种第二电子设备,其特征在于,包括:至少一个处理器和存储器;22. A second electronic device, comprising: at least one processor and a memory; 所述存储器存储计算机执行指令;the memory stores computer-executable instructions; 所述至少一个处理器执行所述存储器存储的计算机执行指令,使得所述至少一个处理器执行如权利要求13-16任一项所述的语料泛化方法。The at least one processor executes the computer-executable instructions stored in the memory, so that the at least one processor executes the corpus generalization method of any one of claims 13-16. 23.一种第三电子设备,其特征在于,包括:至少一个处理器和存储器;23. A third electronic device, comprising: at least one processor and a memory; 所述存储器存储计算机执行指令;the memory stores computer-executable instructions; 所述至少一个处理器执行所述存储器存储的计算机执行指令,使得所述至少一个处理器执行如权利要求17所述的语料泛化方法。The at least one processor executes the computer-executable instructions stored in the memory to cause the at least one processor to perform the corpus generalization method of claim 17 . 24.一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现如权利要求1-12任一项所述的语料泛化方法。24. A computer-readable storage medium, wherein computer-executable instructions are stored in the computer-readable storage medium, and when a processor executes the computer-executable instructions, the computer-executable instructions according to any one of claims 1-12 are implemented. The corpus generalization method described. 25.一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现如权利要求13-16任一项所述的语料泛化方法。25. A computer-readable storage medium, wherein computer-executable instructions are stored in the computer-readable storage medium, and when a processor executes the computer-executable instructions, any one of claims 13-16 is implemented. The corpus generalization method described. 26.一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现如权利要求17所述的语料泛化方法。26. A computer-readable storage medium, characterized in that, computer-executable instructions are stored in the computer-readable storage medium, and when a processor executes the computer-executable instructions, the corpus generalization as claimed in claim 17 is realized method.
CN202011206419.4A 2020-03-27 2020-11-02 Corpus generalization method, apparatus and electronic device combining RPA and AI Pending CN112307295A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010229468 2020-03-27
CN2020102294683 2020-03-27

Publications (1)

Publication Number Publication Date
CN112307295A true CN112307295A (en) 2021-02-02

Family

ID=74333740

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011206419.4A Pending CN112307295A (en) 2020-03-27 2020-11-02 Corpus generalization method, apparatus and electronic device combining RPA and AI

Country Status (1)

Country Link
CN (1) CN112307295A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113221034A (en) * 2021-05-06 2021-08-06 北京百度网讯科技有限公司 Data generalization method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180137775A1 (en) * 2016-11-11 2018-05-17 International Business Machines Corporation Evaluating User Responses Based on Bootstrapped Knowledge Acquisition from a Limited Knowledge Domain
CN108959258A (en) * 2018-07-02 2018-12-07 昆明理工大学 It is a kind of that entity link method is integrated based on the specific area for indicating to learn
CN109522547A (en) * 2018-10-23 2019-03-26 浙江大学 Chinese synonym iteration abstracting method based on pattern learning
CN110674378A (en) * 2019-09-26 2020-01-10 科大国创软件股份有限公司 Chinese semantic recognition method based on cosine similarity and minimum editing distance
CN110909539A (en) * 2019-10-15 2020-03-24 平安科技(深圳)有限公司 Word generation method, system, computer device and storage medium of corpus

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180137775A1 (en) * 2016-11-11 2018-05-17 International Business Machines Corporation Evaluating User Responses Based on Bootstrapped Knowledge Acquisition from a Limited Knowledge Domain
CN108959258A (en) * 2018-07-02 2018-12-07 昆明理工大学 It is a kind of that entity link method is integrated based on the specific area for indicating to learn
CN109522547A (en) * 2018-10-23 2019-03-26 浙江大学 Chinese synonym iteration abstracting method based on pattern learning
CN110674378A (en) * 2019-09-26 2020-01-10 科大国创软件股份有限公司 Chinese semantic recognition method based on cosine similarity and minimum editing distance
CN110909539A (en) * 2019-10-15 2020-03-24 平安科技(深圳)有限公司 Word generation method, system, computer device and storage medium of corpus

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
周昆;王钊;于碧辉;: "基于语义相关度主题爬虫的语料采集方法", 计算机系统应用, no. 05, 15 May 2019 (2019-05-15) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113221034A (en) * 2021-05-06 2021-08-06 北京百度网讯科技有限公司 Data generalization method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
JP6053131B2 (en) Information processing apparatus, information processing method, and program
US20080065617A1 (en) Search entry system with query log autocomplete
US20180081880A1 (en) Method And Apparatus For Ranking Electronic Information By Similarity Association
CN107526846B (en) Method, device, server and medium for generating and sorting channel sorting model
US20090249248A1 (en) User directed refinement of search results while preserving the scope of the initial search
CN110580278A (en) personalized search method, system, equipment and storage medium according to user portrait
WO2018113468A1 (en) Search term recommendation method, device, program and medium
CN106407316B (en) Software question and answer recommendation method and device based on topic model
CN115309954B (en) Data retrieval method, device, equipment and storage medium
WO2021188702A1 (en) Systems and methods for deploying computerized conversational agents
CN114238745B (en) Method and device for providing search results, electronic device and medium
CN116842160A (en) A patent search formula generation method, system, equipment and medium
WO2020041413A1 (en) Sibling search queries
WO2021051587A1 (en) Search result sorting method and apparatus based on semantic recognition, electronic device, and storage medium
CN112307295A (en) Corpus generalization method, apparatus and electronic device combining RPA and AI
KR20180015491A (en) Method and apparatus for storing log of access based on kewords
JP5971794B2 (en) Patent search support device, patent search support method, and program
JP2001014333A (en) Image retrieval system and image database management device
EP4328764A1 (en) Artificial intelligence-based system and method for improving speed and quality of work on literature reviews
JP5368900B2 (en) Information presenting apparatus, information presenting method, and program
KR101667918B1 (en) Methodand device of providing query-adaptive smart search service
CN115292478A (en) Method, device, equipment and storage medium for recommending search content
CN109284364B (en) Interactive vocabulary updating method and device for voice microphone-connecting interaction
JP2009146013A (en) Content search method, apparatus, and program
JP7705438B2 (en) Method and system for providing search results reflecting user's intentions related to locations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Country or region after: China

Address after: 1902, 19th Floor, China Electronics Building, No. 3 Danling Road, Haidian District, Beijing

Applicant after: BEIJING LAIYE NETWORK TECHNOLOGY Co.,Ltd.

Applicant after: Laiye Technology (Beijing) Co.,Ltd.

Address before: 1902, 19 / F, China Electronics Building, 3 Danling Road, Haidian District, Beijing 100080

Applicant before: BEIJING LAIYE NETWORK TECHNOLOGY Co.,Ltd.

Country or region before: China

Applicant before: BEIJING BENYING NETWORK TECHNOLOGY Co.,Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210202