CN112800315A

CN112800315A - Data processing method, device, equipment and storage medium

Info

Publication number: CN112800315A
Application number: CN202110130043.1A
Authority: CN
Inventors: 连义江; 杨新涛
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-01-29
Filing date: 2021-01-29
Publication date: 2021-05-14
Anticipated expiration: 2041-01-29
Also published as: CN112800315B

Abstract

The application discloses a data processing method, a data processing device, data processing equipment and a storage medium, relates to the technical field of computers, and further relates to artificial intelligence technologies such as deep learning and intelligent search. The specific implementation scheme is as follows: identifying the word pairs to be processed in the historical search display logs according to the abnormal seed word pairs in the seed database; the word pairs comprise search requests submitted by a requester and reference keywords provided by a data provider; and updating an abnormal word list according to the identification result, wherein the abnormal word list is used for data search, so that the influence of the abnormal matching condition on the search result is greatly reduced, and the accuracy of the search result is improved.

Description

Data processing method, device, equipment and storage medium

Technical Field

The application relates to the technical field of computers, in particular to the technical field of artificial intelligence such as deep learning and intelligent search.

Background

Three roles may be included in the data search domain: a requestor, a data provider, and a search engine. Wherein a request submits a search request to a search engine, a data provider provides reference keywords and content (such as advertising creatives) to the search engine, and the search engine designs a matching mechanism between the search request and the reference keywords. When a search request submitted by a requester matches a reference keyword provided by a data provider, the content (such as an advertising creative) of the data provider is displayed in a search result page of the requester, and the matching problem between the search request and the reference keyword is very important in the process. However, the search engine usually inevitably has some abnormal matching conditions in the matching stage of the search request and the reference keyword, and the accuracy of the search result is seriously affected.

Disclosure of Invention

The application provides a data processing method, a data processing device, data processing equipment and a storage medium.

According to an aspect of the present application, there is provided a data processing method, the method including:

identifying the word pairs to be processed in the historical search display logs according to the abnormal seed word pairs in the seed database; the word pairs comprise search requests submitted by a requester and reference keywords provided by a data provider;

and updating an abnormal word list according to the identification result, wherein the abnormal word list is used for data search.

According to another aspect of the present application, there is provided a data processing apparatus comprising:

the identification module is used for identifying the word pairs to be processed in the historical search display logs according to the abnormal seed word pairs in the seed database; the word pairs comprise search requests submitted by a requester and reference keywords provided by a data provider;

and the list updating module is used for updating an abnormal word list according to the identification result, and the abnormal word list is used for data search.

According to another aspect of the present application, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a data processing method according to any of the embodiments of the present application.

According to another aspect of the present application, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the data processing method according to any one of the embodiments of the present application.

According to another aspect of the present application, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the data processing method according to any of the embodiments of the present application.

According to the technology of the application, the influence of the abnormal matching condition on the search result is greatly reduced, and the accuracy of the search result is improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

fig. 1A is a flowchart of a data processing method provided according to an embodiment of the present application;

FIG. 1B is a schematic diagram of a data processing method according to an embodiment of the present application;

FIG. 2A is a flow chart of another data processing method provided according to an embodiment of the application;

FIG. 2B is a schematic structural diagram of a synonymy metric model according to an embodiment of the present disclosure;

FIG. 3 is a flow chart of yet another data processing method provided in accordance with an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;

fig. 5 is a block diagram of an electronic device for implementing the data processing method according to the embodiment of the present application.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1A is a flowchart of a data processing method provided according to an embodiment of the present application; fig. 1B is a schematic diagram of a data processing method according to an embodiment of the present application. The embodiment of the application is suitable for how to process the data in the field of data searching so as to reduce the influence of abnormal matching conditions on the searching result. The present embodiment is applied to a search engine, and the embodiment may be executed by a data processing apparatus, which may be implemented by software and/or hardware, and may be integrated in an electronic device configured with a search engine function. As shown in fig. 1A and 1B, the data processing method includes:

s101, identifying the word pairs to be processed in the historical search display logs according to the abnormal seed word pairs in the seed database; wherein the word pair comprises a search request submitted by a requester and a reference keyword provided by a data provider.

In the embodiment, for any search request submitted by a requester, a search engine can match at least one reference keyword from reference keywords provided by a data provider based on a matching mechanism; regarding each matched reference keyword, taking the reference keyword and the search request as a word pair; further, for any word pair, if the relevance between the search request and the reference keyword in the word pair is less than a set threshold, the word pair may be regarded as an abnormal word pair.

The seed library can be called as an abnormal seed library and is specially used for storing abnormal seed word pairs, and the abnormal seed word pairs are determined according to the abnormal word pairs fed back by a data provider; optionally, all the abnormal word pairs fed back by the data provider can be used as abnormal seed word pairs and added to the seed library; in order to improve the accuracy of the search result, the present embodiment preferably adds the abnormal word pair with lower relevance among the abnormal word pairs fed back by the data provider as the abnormal seed word pair to the seed repository. Because the data provider can feed back the abnormal word pairs to the search engine in real time, further, in order to improve the accuracy of the search result under the condition that the search function of the search engine is not affected, further, the present embodiment can periodically update the seed database according to the abnormal word pairs fed back by the data provider within a period of time in an offline state.

The search display log is a log generated in the process that a search engine generates a search result according to a search request submitted by a requester; specifically, the historical search presentation logs may be all search presentation logs accumulated by the search engine before the current time, may also be search presentation logs accumulated by the search engine within a period of time, or may also be a certain number of search presentation logs accumulated by the search engine history, and the like. Optionally, for any historical search presentation log, the historical search presentation log may include a search request, a reference keyword, content (such as an advertising creative) corresponding to the reference keyword, and the like. The word pair to be processed is a word pair which is recorded in the history search display log and is to be identified as an abnormal word pair.

Optionally, in this embodiment, the identification operation may be performed according to a set condition, where the set condition may be, for example, a set period, or the number of history search display logs reaches a set number. And then under the condition that the current state is monitored to meet a set condition (for example, the current time meets a set period), acquiring a historical search display log, extracting word pairs to be processed from the historical search display log, and identifying the word pairs to be processed in the historical search display log according to the abnormal seed word pairs in the seed library. At this time, the history search presentation log is specifically a log accumulated by the search engine during the period from the last recognition operation to the present recognition operation.

Further, the embodiment may identify the word pairs to be processed in the historical search presentation log according to the abnormal seed word pairs in the seed library based on the pre-trained synonymy metric model. Specifically, for each word pair to be processed, the word pair to be processed and the abnormal seed word pair in the seed library may be input into the synonymy metric model, and the synonymy metric model outputs the degree of correlation between the word pair to be processed and each abnormal seed word pair, so that whether the word pair to be processed is an abnormal word pair may be determined according to the degree of correlation.

For example, in the embodiment, the word pairs to be processed in the historical search display log can be identified according to the abnormal seed word pairs in the seed library in an offline state, so that the online search function of the search engine can be ensured not to be influenced, and the reasonable utilization of system resources is realized.

Furthermore, in order to improve the recognition efficiency, before the to-be-processed word pair in the history search display log is recognized, the to-be-processed word pair in the history search display log can be subjected to duplicate removal. That is, if a word pair to be processed occurs at least twice, only one is retained.

And S102, updating an abnormal word list according to the identification result, wherein the abnormal word list is used for data search.

In this embodiment, the abnormal word list is specially used for storing the abnormal word pairs, and can be applied to data search as a basis for screening search results; further, the abnormal word list may include the abnormal word pair fed back by the data provider, or may not include the abnormal word pair fed back by the data provider.

The recognition result may include a result of whether each to-be-processed word pair is an abnormal word pair, or the recognition result may include only the to-be-processed word pair that is an abnormal word pair.

Optionally, in this embodiment, the to-be-processed word pair that is the abnormal word pair may be added to the abnormal word list according to the recognition result, so as to update the abnormal word list.

It should be noted that, in a search engine, some abnormal matching situations inevitably occur in a matching stage of a search request and a reference keyword, that is, an abnormal word pair exists, which seriously affects accuracy of a search result, and therefore, the abnormal word pair needs to be shielded in a search process. The prior art typically optimizes search engines based on abnormal word pairs fed back by the data provider to avoid similar problems arising again. Due to the diversity of speech expression, the abnormal word list is constructed only according to the abnormal word pairs (namely limited abnormal word pairs) fed back by the data provider or by means of manual intervention, and the effect of shielding the abnormal word pairs of the same type cannot be achieved. For example, the abnormal word pair fed back by the data provider is < cause of syncope, what epilepsy is, where the search request is "cause of syncope", and the reference keyword is "what epilepsy is", and the prior art cannot mask < cause of syncope, what cause of epilepsy > and/or < cause of syncope, what cause of epilepsy > and the like based on the abnormal word pair.

It is worth noting that, in the embodiment, under the condition that the abnormal seed word pairs are relatively few, a rich abnormal word list can be obtained by periodically identifying the word pairs to be processed in the history search display log without depending on manual work. For example, an abnormal seed word pair is < cause of syncope, what disease is epilepsy >, and this embodiment can identify < how to return syncope, what cause of epilepsy >, < syncope cause, what cause of epilepsy), < syncope cause of syncope, what epilepsy > and < how to return syncope, cause of epilepsy > to wait for processing word pair to be identified as an abnormal word pair and add the abnormal word pair to the abnormal word list. The embodiment provides a new idea for acquiring a rich abnormal word list, and lays a foundation for identifying a type of abnormal word pair based on an abnormal word pair provided by a data provider. In addition, the abnormal word list can dynamically change along with the change of the historical search display log, and the identification range in the data search process is further widened.

According to the technical scheme of the embodiment of the application, the seed library is introduced, the abnormal seed word pairs in the seed library are used as a reference, the word pairs to be processed in the historical search display logs are identified, and the abnormal word library is updated based on the identification result. Compared with the prior art, the embodiment can obtain a rich abnormal word list without depending on manpower, and provides a new idea for obtaining the rich abnormal word list; in addition, the abundant abnormal word list is used for data search, the effect of identifying a type of abnormal word pair based on one abnormal word pair provided by a data provider can be realized, the influence of an abnormal matching condition on a search result is greatly reduced, and the accuracy of the search result is improved.

Optionally, as an optional manner of the embodiment of the present application, after the abnormal term list is updated, a data search may be performed based on the updated abnormal term list, for example, a candidate keyword may be selected from reference keywords provided by a data provider according to a target search request submitted by a requester; and identifying the target search request and the candidate keywords according to the updated abnormal word list.

Specifically, the requester can submit a target search request to a search engine under the condition that the requester has a search requirement; the search engine may select at least one candidate keyword (optionally, in this embodiment, multiple candidate keywords are preferred) from the reference keywords provided by the data provider based on a matching mechanism according to the target search request, and use each candidate keyword and the target search request as a candidate word pair, and shield the candidate word pair if any candidate word pair hits the updated abnormal word list, or shield the candidate word pair if the similarity between any candidate word pair and any abnormal word pair in the updated abnormal word list is greater than a set similarity threshold; and then the search engine only shows the content (such as advertisement creativity) corresponding to the unmasked candidate word pairs to the requesting party, thereby greatly reducing the influence of the abnormal matching condition on the search result, improving the accuracy of the search result and simultaneously improving the experience of the user.

Fig. 2A is a flowchart of another data processing method provided in an embodiment of the present application. On the basis of the above embodiment, the embodiment of the present application further explains how to identify the word pairs to be processed in the history search display log according to the abnormal seed word pairs in the seed library. As shown in fig. 2A, the data processing method includes:

s201, determining core words of the word pairs to be processed in the historical search display log.

In this embodiment, for any word pair, the core word of the word pair can represent the central thought of the word pair; further, the core word of the word pair is composed of the core word of the search request of the word pair and the core word of the reference keyword of the word pair. For example, a word pair is < DNA detect, DNA pregnant identifies how do the parent > and the core word of the word pair is (DNA + detect) + (DNA + pregnant + identify + parent).

Optionally, for each to-be-processed word pair in the historical search presentation log, a core word sequence tagging tool may be used to identify a core word of the to-be-processed word pair, or other manners, such as a pre-constructed core word identification model, may also be used to identify a core word of the to-be-processed word pair.

S202, selecting a target word pair of the word pair to be processed from the abnormal seed word pair of the seed library according to the core word inverted index associated with the seed library and the core word of the word pair to be processed.

The inverted index is a common indexing mechanism; in this embodiment, a core word sequence tagging tool may be used to identify a core word of each abnormal seed word pair in the seed library, and based on the core words of all abnormal seed word pairs, a core word inverted index is constructed with the core word as an index word and the abnormal seed word pairs as index content. For example, if the core words of the abnormal seed word pair 1 and the abnormal seed word pair 2 are the same, if both are (DNA + detection) + (DNA + pregnancy + identification + paternity), and further in the case of taking (DNA + detection) + (DNA + pregnancy + identification + paternity) as the index words, based on the inverted index of the core words, the abnormal seed word pairs with the same core word, such as the abnormal seed word pair 1 and the abnormal seed word pair 2, may be obtained from the seed bank.

Furthermore, the core word inverted index can be dynamically updated under the condition that the abnormal seed word pair in the seed library changes. For example, one or more abnormal seed word pairs are newly added to the seed library, the core word of the newly added abnormal seed word pair can be identified, the core word of the newly added abnormal seed word pair is used as an index word, and if the core word of the newly added abnormal seed word pair is used as the index word in the pre-constructed inverted core word index, the newly added abnormal seed word pair can be added to the corresponding index content position. Further, if the situation that the core word of the newly added abnormal seed word pair is used as the index word does not exist in the pre-constructed core word inverted index, a core word of the newly added abnormal seed word pair can be newly added as the index word, and the newly added abnormal seed word pair is used as the index pair of the index content. At least one index pair may be included in the inverted index of core words associated with the seed repository.

Optionally, for each word pair to be processed, the core word of the word pair to be processed may be used as an index word, the core word reverse index associated with the seed library is input, and an abnormal seed word pair whose core word is the same as the core word of the word pair to be processed may be obtained. Furthermore, the number of the abnormal seed word pairs of which the core words are the same as the core words of the word pair to be processed can be one or more, and if the number of the abnormal seed word pairs of which the core words are the same as the core words of the word pair to be processed is one, the abnormal seed word pairs can be directly used as target word pairs of the word pair to be processed; if the number of the abnormal seed word pairs of which the core words are the same as the core words of the word pair to be processed is multiple, a target word pair can be selected from the multiple abnormal seed word pairs, and the following process can be exemplarily implemented:

and step A, selecting a candidate word pair of the word pair to be processed from the abnormal seed word pair of the seed library according to the core word inverted index associated with the seed library and the core word of the word pair to be processed.

Optionally, for each word pair to be processed, in this embodiment, an abnormal seed word pair in which the core word in the seed library is the same as the core word of the word pair to be processed may be used as a candidate word pair of the word pair to be processed, where the number of the candidate word pairs is multiple.

And step B, determining the distance between the word pair to be processed and the candidate word pair based on the synonymy measurement model.

The synonymous metric model may also be referred to as a similarity metric model, and in order to increase the calculation speed, the synonymous metric model in the embodiment is trained based on the idea of the twin network. For example, as shown in fig. 2B, the network structure is a transform network (i.e., a transfomer), word embedding (i.e., word embedding) is used to quantize the words into vectors, and the network parameters are shared, specifically, the network parameters on the side of the positive sample keyword are the same as the network parameters on the side of the search request, and the network parameters on the side of the search request are the same as the network parameters on the side of the negative sample keyword. Further, a sample pair (i.e., pair wise) mode is adopted for training, and a ranking loss function is used in the training process for predicting the relative distance between input samples.

Optionally, the reference keyword clicked by the requester in the history search display log may be used as a positive sample keyword, and the word pair < search request, positive sample keyword > may be used as a positive sample; meanwhile, a part of the reference keywords which are not clicked by the requester in the historical search display log can be used as negative sample keywords, and the word pair < search request, negative sample keywords > is used as a negative sample; in order to ensure the accuracy of the model, a random factor may be introduced, for example, a set number of < search request, random reference keyword > word pairs may also be randomly selected from the historical search presentation log, and also used as a negative sample. In this embodiment, the positive and negative examples may be input to an initial model (i.e., an untrained synonymous metric model) for training, so as to obtain the synonymous metric model. Specifically, word embedding can be performed on the positive sample keywords, the search request and the negative sample keywords to obtain word vectors; inputting the obtained word vector into a transformer to respectively obtain a positive sample keyword vector, a search request vector and a negative sample keyword vector; then, metric learning (namely metric learning) between the positive sample keyword vector and the search request vector and metric learning between the negative sample keyword vector and the search request vector can be respectively carried out, loss can be obtained by adopting a ranking loss function, and then the model is trained on the basis of the loss, so that the synonymous metric model can be obtained. In order to ensure the accuracy of the model, synonymous and non-synonymous samples marked manually can be used as fine tuning samples to fine tune the network parameters, so as to obtain a synonymous measurement model with higher accuracy.

Optionally, in this embodiment, for each word pair to be processed, the word pair to be processed and a candidate word pair of the word pair to be processed may be input into the synonymy metric model together, and the synonymy metric model outputs a distance between the word pair to be processed and each candidate word pair. Optionally, the distance between the word pair to be processed and each candidate word pair may characterize the degree of correlation between the word pair to be processed and each candidate word pair. Further, the larger the distance, the smaller the correlation.

Optionally, as an optional manner of the embodiment of the present application, determining a distance between the word pair to be processed and the candidate word pair may be: calculating a first distance between a search request in the word pair to be processed and a search request in the candidate word pair, and a second distance between a reference keyword in the word pair to be processed and a reference keyword in the candidate word pair; and determining the distance between the word pair to be processed and the candidate word pair according to the first distance and the second distance.

Specifically, for each candidate word pair of each word pair to be processed, a first distance between a search request in the candidate word pair and the search request in the word pair to be processed may be calculated, a second distance between a reference keyword in the candidate word pair and the reference keyword in the word pair to be processed may be calculated, and a sum of the first distance and the second distance may be used as a distance between the candidate word pair and the word pair to be processed; or, the first weight and the second weight may be preset, and then the first weight may be multiplied by the first distance, the second weight may be multiplied by the second distance, and the sum of the products of the two may be used as the distance between the candidate word pair and the word pair to be processed.

And C, selecting a target word pair of the word pair to be processed from the candidate word pair according to the distance.

Specifically, for each word pair to be processed, after determining the distance between the word pair to be processed and each candidate word pair of the word pair to be processed, the candidate word pair corresponding to the minimum distance may be used as the target word pair of the word pair to be processed.

It should be noted that, in the embodiment, by introducing the core word and the inverted index of the core word, the abnormal seed word pair in the seed library is screened, so that the distance calculation amount is reduced, and the recognition efficiency is improved; meanwhile, a synonymy measurement model based on twin network thought training is introduced, and the recognition efficiency is greatly improved under the condition that the distance calculation accuracy can be guaranteed.

And S203, identifying whether the word pair to be processed is an abnormal word pair or not according to the distance between the word pair to be processed and the target word pair.

Optionally, for each word pair to be processed, if the distance between the word pair to be processed and the target word pair of the word pair to be processed is less than or equal to the set distance value, it is determined that the word pair to be processed is an abnormal word pair. And if the distance between the word pair to be processed and the target word pair of the word pair to be processed is greater than a set distance value, determining that the word pair to be processed is not an abnormal word pair.

And S204, updating an abnormal word list according to the identification result, wherein the abnormal word list is used for data search.

According to the technical scheme of the embodiment of the application, the seed library is introduced, the abnormal seed word pairs in the seed library are used as a reference, the word pairs to be processed in the historical search display logs are identified, the core words and the inverted index of the core words are introduced in the specific identification process, the abnormal seed word pairs in the seed library are screened, the distance calculation amount is reduced, and the identification efficiency is improved; compared with the prior art, the embodiment can obtain a rich abnormal word list without depending on manpower, and provides a new idea for obtaining the rich abnormal word list; in addition, the abnormal word library is updated based on the identification result, and the abundant abnormal word list is used for data search, so that the effect of identifying one type of abnormal word pair based on one abnormal word pair provided by a data provider can be realized, the influence of the abnormal matching condition on the search result is greatly reduced, and the accuracy of the search result is improved.

Fig. 3 is a flowchart of another data processing method provided in an embodiment of the present application. The embodiment of the application is added with the operation of updating the seed database on the basis of the embodiment. As shown in fig. 3, the data processing method includes:

and S301, updating the seed bank according to the abnormal word pair fed back by the data provider.

Optionally, in this embodiment, all the abnormal word pairs fed back by the data provider in a period of time may be used as abnormal seed word pairs, and added to the seed repository to update the seed repository. In order to improve the accuracy of the search result, the present embodiment may add a part of the abnormal word pairs fed back by the data provider as the abnormal seed word pairs to the seed repository to update the seed repository.

Optionally, a part of the abnormal word pairs fed back from the data provider is selected as an abnormal seed word pair, and the abnormal seed word pair is added to the seed library to update the seed library, which may specifically be implemented as follows:

step 1, determining the confidence of the abnormal word pair fed back by the data provider.

Optionally, for each abnormal word pair fed back by the data provider, the similarity between the search request and the reference keyword in the abnormal word pair may be determined, and the confidence of the abnormal word pair may be determined according to the similarity. For example, the similarity may be quantized to a standard value between 0 and 1, and the confidence may be determined as the difference between 1 and the quantized standard value. Alternatively, the similarity may be substituted into a set confidence calculation formula to obtain the confidence. Optionally, the similarity is inversely proportional to the confidence, i.e., the greater the similarity, the less the confidence.

And 2, selecting an abnormal seed word pair from the abnormal word pairs fed back by the data provider according to the confidence coefficient.

Optionally, the abnormal word pairs fed back by the data provider may be sorted in a descending order according to the confidence, and then the abnormal word pairs arranged in the top preset number may be used as abnormal seed word pairs and added to the seed library to update the seed library.

And 3, adding the abnormal seed word pair into a seed library.

It should be noted that, in the present embodiment, based on the confidence, an abnormal seed word pair is selected from the abnormal word pairs fed back by the data provider, and is added to the seed library, so that the accuracy of the abnormal seed word pair in the seed library is ensured, and a foundation is laid for obtaining an accurate abnormal word list.

S302, identifying the word pairs to be processed in the historical search display logs according to the abnormal seed word pairs in the updated seed library; wherein the word pair comprises a search request submitted by a requester and a reference keyword provided by a data provider.

And S303, updating an abnormal word list according to the identification result, wherein the abnormal word list is used for data search.

According to the technical scheme of the embodiment of the application, the seed base is dynamically updated, so that a foundation is laid for acquiring a rich abnormal word list; in addition, the abnormal seed word pair in the updated seed library is taken as a reference, the word pair to be processed in the historical search display log is identified, a rich abnormal word list can be obtained without depending on manpower, and a new thought is provided for obtaining the rich abnormal word list; moreover, the abnormal word library is updated based on the identification result, and the abundant abnormal word list is used for data search, so that the effect of identifying one type of abnormal word pair based on one abnormal word pair provided by a data provider can be realized, the influence of the abnormal matching condition on the search result is greatly reduced, and the accuracy of the search result is improved.

Fig. 4 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. The embodiment of the application is suitable for how to process the data in the field of data searching so as to reduce the influence of abnormal matching conditions on the searching result. The device can realize the data processing method in any embodiment of the application. As shown in fig. 4, the data processing apparatus includes:

the identification module 401 is configured to identify a word pair to be processed in the history search display log according to an abnormal seed word pair in the seed database; the word pairs comprise search requests submitted by a requester and reference keywords provided by a data provider;

and a list updating module 402, configured to update the abnormal word list according to the identification result, where the abnormal word list is used for data search.

Illustratively, the identification module 401 includes:

the core word determining unit is used for determining the core words of the word pairs to be processed;

the target selection unit is used for selecting a target word pair of the word pair to be processed from the abnormal seed word pair in the seed library according to the core word inverted index associated with the seed library and the core word of the word pair to be processed;

and the abnormity identification unit is used for identifying whether the word pair to be processed is an abnormal word pair or not according to the distance between the word pair to be processed and the target word pair.

Illustratively, the target selection unit includes:

the candidate selection subunit is used for selecting the candidate word pair of the word pair to be processed from the abnormal seed word pair in the seed library according to the core word reverse index associated with the seed library and the core word of the word pair to be processed;

the distance determining subunit is used for determining the distance between the word pair to be processed and the candidate word pair based on the synonymy measurement model;

and the target selection subunit is used for selecting the target word pair of the word pair to be processed from the candidate word pair according to the distance.

Illustratively, the distance determining subunit is specifically configured to:

calculating a first distance between a search request in the word pair to be processed and a search request in the candidate word pair, and a second distance between a reference keyword in the word pair to be processed and a reference keyword in the candidate word pair;

and determining the distance between the word pair to be processed and the candidate word pair according to the first distance and the second distance.

Exemplarily, the apparatus further includes:

and the seed database updating module is used for updating the seed database according to the abnormal word pairs fed back by the data provider.

Illustratively, the seed repository update module is specifically configured to:

determining the confidence of the abnormal word pair fed back by the data provider;

selecting an abnormal seed word pair from the abnormal word pairs fed back by the data provider according to the confidence coefficient;

adding pairs of abnormal seed words to the seed repository.

Exemplarily, the apparatus further includes:

the keyword selection module is used for selecting candidate keywords from reference keywords provided by a data provider according to a target search request submitted by a requester;

the identifying module 401 is further configured to identify the target search request and the candidate keyword according to the updated abnormal word list.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 5 illustrates a schematic block diagram of an example electronic device 500 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 5, the electronic device 500 includes a computing unit 501, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the electronic apparatus 500 can also be stored. The calculation unit 501, the ROM 502, and the RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

A number of components in the electronic device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, or the like; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508, such as a magnetic disk, optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the electronic device 500 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 501 may be a variety of general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 501 executes the respective methods and processes described above, such as the data processing method. For example, in some embodiments, the data processing method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into the RAM 503 and executed by the computing unit 501, one or more steps of the data processing method described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the data processing method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), blockchain networks, and the internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of data processing, comprising:

2. The method of claim 1, wherein identifying pairs of terms to be processed in the historical search presentation log from pairs of anomalous seed terms in the seed repository comprises:

determining a core word of a word pair to be processed;

selecting a target word pair of the word pair to be processed from an abnormal seed word pair of the seed library according to the core word inverted index associated with the seed library and the core word of the word pair to be processed;

and identifying whether the word pair to be processed is an abnormal word pair or not according to the distance between the word pair to be processed and the target word pair.

3. The method of claim 2, wherein selecting a target word pair of the word pair to be processed from an abnormal seed word pair of the seed repository according to the core word inverted index associated with the seed repository and the core word of the word pair to be processed comprises:

selecting a candidate word pair of the word pair to be processed from an abnormal seed word pair of the seed library according to the core word inverted index associated with the seed library and the core word of the word pair to be processed;

determining a distance between the pair of to-be-processed words and the pair of candidate words based on a synonymy metric model;

and selecting a target word pair of the word pair to be processed from the candidate word pairs according to the distance.

4. The method of claim 3, wherein determining a distance between the pair of words to be processed and the pair of candidate words comprises:

calculating a first distance between the search request in the word pair to be processed and the search request in the candidate word pair, and a second distance between the reference keyword in the word pair to be processed and the reference keyword in the candidate word pair;

5. The method of claim 1, further comprising:

and updating the seed database according to the abnormal word pairs fed back by the data provider.

6. The method of claim 5, wherein updating the seed repository based on the abnormal word pairs fed back by the data provider comprises:

adding the pair of abnormal seed words to the seed repository.

7. The method of claim 1, further comprising, after updating the list of abnormal words according to the recognition result:

selecting candidate keywords from reference keywords provided by a data provider according to a target search request submitted by a requester;

and identifying the target search request and the candidate keywords according to the updated abnormal word list.

8. A data processing apparatus comprising:

9. The apparatus of claim 8, wherein the identification module comprises:

the target selection unit is used for selecting a target word pair of the word pair to be processed from an abnormal seed word pair of the seed library according to the core word inverted index associated with the seed library and the core word of the word pair to be processed;

and the abnormity identification unit is used for identifying whether the word pair to be processed is an abnormal word pair according to the distance between the word pair to be processed and the target word pair.

10. The apparatus of claim 9, wherein the target selection unit comprises:

the candidate selecting subunit is used for selecting the candidate word pair of the word pair to be processed from the abnormal seed word pair of the seed library according to the core word reverse index associated with the seed library and the core word of the word pair to be processed;

a distance determining subunit, configured to determine, based on a synonymy metric model, a distance between the to-be-processed word pair and the candidate word pair;

and the target selection subunit is used for selecting a target word pair of the word pair to be processed from the candidate word pair according to the distance.

11. The apparatus of claim 10, wherein the distance determining subunit is specifically configured to:

12. The apparatus of claim 8, further comprising:

13. The apparatus of claim 12, wherein the seed repository update module is specifically configured to:

adding the pair of abnormal seed words to the seed repository.

14. The apparatus of claim 8, further comprising:

and the identification module is also used for identifying the target search request and the candidate keywords according to the updated abnormal word list.

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the data processing method of any one of claims 1-7.

16. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the data processing method according to any one of claims 1 to 7.

17. A computer program product comprising a computer program which, when executed by a processor, implements a method of data processing according to any one of claims 1 to 7.