CN114595775A

CN114595775A - Training method and device of data generation model

Info

Publication number: CN114595775A
Application number: CN202210247415.3A
Authority: CN
Inventors: 李丽丽; 李博; 刘晓龙; 陈曦
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-03-14
Filing date: 2022-03-14
Publication date: 2022-06-07
Anticipated expiration: 2042-03-14
Also published as: CN114595775B

Abstract

The application provides a training method and a device of a data generation model, which comprises the following steps: acquiring a plurality of pieces of target countermeasure text data, and clustering the plurality of pieces of target countermeasure text data to obtain at least one data cluster; selecting at least one data pair from the data clusters aiming at each data cluster, wherein each data pair comprises two pieces of target confrontation text data, and taking the data pair as a training data sample of a data generation model; and training a data generation model by taking one target countermeasure text data in the training data sample as source data and the other target countermeasure text data as standard data to obtain a trained data generation model, wherein the trained data generation model is used for generating countermeasure text data based on the target countermeasure text data to be processed. Therefore, abundant countermeasure text data can be generated, the recognition effect of the corresponding data recognition model is improved, and the recognition rate of the countermeasure text data is effectively improved.

Description

Training method and device of data generation model

Technical Field

The present application relates to internet technologies, and in particular, to a method and an apparatus for training a data generation model.

Background

The malicious data has the characteristics of multiple variants, strong interference, strong semantic countermeasures and the like, when the malicious data is subjected to semantic recognition, semantic information is easily destroyed by countermeasures and disturbance, and the data characteristics of the malicious data are easily lost, so that in the process of data enhancement of the malicious data, the data characteristics of the malicious data are easily lost, if the malicious data is directly subjected to data enhancement in the modes of similar word replacement, character translation and the like, the meaning of an original sentence can be possibly influenced, and the recognition effect of the malicious data is influenced.

Disclosure of Invention

The embodiment of the application provides a training method and device for a data generation model, an electronic device, a computer readable storage medium and a computer program product, which can generate abundant countermeasure text data, thereby improving the recognition effect of the corresponding data recognition model and effectively improving the recognition rate of the countermeasure text data.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a training method of a data generation model, which comprises the following steps:

acquiring a plurality of pieces of target countermeasure text data, and clustering the plurality of pieces of target countermeasure text data to obtain at least one data cluster;

for each data cluster, selecting at least one data pair from the data clusters, each data pair comprising two pieces of the target countermeasure text data, and

taking the data pairs as training data samples of the data generation model;

and training the data generation model by taking one piece of data in the training data sample as source data and the other piece of data as standard data, so that the trained data generation model can generate confrontation text data based on target confrontation text data to be processed.

The embodiment of the present application provides a training device for a data generation model, including: .

The acquisition module is used for acquiring a plurality of pieces of target countermeasure text data and clustering the plurality of pieces of target countermeasure text data to obtain at least one data cluster;

the selecting module is used for selecting at least one data pair from the data clusters aiming at each data cluster, wherein each data pair comprises two pieces of target confrontation text data, and the data pair is used as a training data sample of the data generation model;

and the training module is used for training the data generation model by taking one piece of target confrontation text data in the training data sample as source data and the other piece of target confrontation text data as standard data to obtain a trained data generation model, and the trained data generation model is used for generating confrontation text data based on the target confrontation text data to be processed.

In the above scheme, the obtaining module is further configured to perform word segmentation processing on each text data in the plurality of pieces of target countermeasure text data, respectively, to obtain word segmentation results corresponding to each text data; determining vocabulary intersection and vocabulary union between any two target confrontation text data based on the word segmentation result; acquiring the ratio of the vocabulary intersection and the vocabulary union of any two target countermeasure text data, and determining the ratio as the similarity between the corresponding two target countermeasure text data; and clustering the plurality of target confrontation text data based on the determined similarity to obtain at least one data cluster.

In the above scheme, the selecting module is further configured to randomly select two target countermeasure text data from at least two target countermeasure text data included in the data cluster, and combine the two target countermeasure text data obtained by random selection into one data pair; and repeating the operation until the data pairs or the data clusters with the selected target quantity are empty.

In the above scheme, the selecting module is further configured to obtain public text data of the text data included in the data cluster, and use the public text data as a cluster seed of the data cluster; selecting at least one piece of target confrontation text data from at least two pieces of target confrontation text data included in the data cluster; and respectively combining the cluster seeds and one piece of target countermeasure text data in the selected target countermeasure text data to form the data pair so as to obtain at least one data pair.

In the above scheme, the apparatus further includes an identification module, where the identification module is configured to perform semantic identification on the training data sample to obtain a semantic identification result of the training data sample; when the semantic recognition result represents that the training data sample comprises unrecognizable characters, mapping the unrecognizable characters into recognizable characters based on a preset mapping relation to obtain a new training data sample; and the training module is also used for training the data generation model by taking one piece of target confrontation text data in the new training data sample as source data and the other piece of target confrontation text data as standard data.

In the above solution, the recognition module is further configured to detect the unrecognizable character to determine a location and a type of the unrecognizable character; acquiring a mapping relation of the unidentifiable characters corresponding to the type based on the type of the unidentifiable characters, and determining the position characteristics of the unidentifiable characters based on the position of the unidentifiable characters; and mapping the non-identifiable characters in the training data samples into identifiable characters based on the mapping relation and the position characteristics so as to obtain new training data samples.

In the above scheme, the data generation model includes an encoding layer and a decoding layer, and the training module is further configured to encode the source data and the standard data in the training data sample through the encoding layer, respectively, to obtain a first encoding feature corresponding to the source data and a second encoding feature corresponding to the standard data; decoding the first coding feature based on the first coding feature and the second coding feature through the decoding layer to obtain confrontation text data corresponding to the source data; and acquiring the difference between the confrontation text data and the standard data, and updating the model parameters of the data generation model based on the difference.

In the above scheme, the training module is further configured to encode the source data and the standard data in the training data sample through the encoding layer, respectively, to obtain a character feature, a position feature and a sentence splitting feature corresponding to the source data, and a character feature, a position feature and a sentence splitting feature corresponding to the standard data; wherein the clause feature is used to distinguish the source data from the standard data; fusing character features, position features and clause features corresponding to the source data to obtain first coding features corresponding to the source data; and fusing character features, position features and clause features corresponding to the standard data to obtain second coding features corresponding to the standard data.

In the above scheme, the apparatus further includes an encryption module, where the encryption module is configured to encrypt the features of the target number in the second encoding features to obtain encrypted encoding features corresponding to the second encoded data; the training module is further configured to decode, by the decoding layer, the first coding feature based on the first coding feature and the encryption coding feature to obtain countermeasure text data corresponding to the source data; acquiring target subdata in the countermeasure text data corresponding to the encrypted features in the second coding features, and acquiring standard subdata in the standard data corresponding to the encrypted features in the second coding features; and comparing the target subdata with the standard subdata to obtain the difference between the target subdata and the standard subdata, and updating the model parameters of the data generation model based on the difference.

In the above scheme, the apparatus further includes a perturbation module, where the perturbation module is configured to perform data perturbation on the training data sample to obtain a perturbed training sample corresponding to the training data sample; the training module is further used for training the data generation model by using one piece of target confrontation text data in the disturbance training sample as source data and the other piece of target confrontation text data as standard data.

In the above scheme, the apparatus further includes an application module, where the application module is configured to obtain the trained data generation model and original target confrontation text data; and generating countermeasure text data corresponding to the original target countermeasure text data based on the original target countermeasure text data through the data generation model.

In the above scheme, the application module is further configured to perform data disturbance on the original target countermeasure text data through the data generation model to obtain at least two disturbance text data; and generating at least two pieces of countermeasure text data corresponding to the original target countermeasure text data based on the at least two pieces of disturbance text data.

An embodiment of the present application provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the training method of the data generation model provided by the embodiment of the application when the executable instructions stored in the memory are executed.

The embodiment of the present application provides a computer-readable storage medium, which stores executable instructions for causing a processor to execute the method for training a data generation model provided in the embodiment of the present application.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the electronic device executes the training method of the data generation model provided by the embodiment of the application.

The embodiment of the application has the following beneficial effects:

the method comprises the steps of constructing a data pair of target countermeasure text data as a training data sample by clustering the target countermeasure text data, training a data generation model by using one piece of target countermeasure text data in the training data sample as source data and the other piece of target countermeasure text data as standard data, and generating the countermeasure text data based on the target countermeasure text data through the trained data generation model. Therefore, the recognition effect of the corresponding data recognition model is improved through the generated confrontation text data, and the recognition rate of the confrontation text data is effectively improved.

Drawings

FIG. 1 is a schematic diagram of an architecture of a training system 100 for a data generation model provided in an embodiment of the present application;

fig. 2 is a schematic structural diagram of an electronic device provided in an embodiment of the present application;

FIG. 3 is a schematic flow chart diagram illustrating a training method for a data generation model provided by an embodiment of the present application;

FIG. 4 is a diagram of vocabulary intersection and vocabulary union between two pieces of data provided by an embodiment of the present application;

FIG. 5 is a schematic flowchart of performing semantic recognition on training data samples to obtain new training data samples according to an embodiment of the present disclosure;

FIG. 6 is a flowchart illustrating mapping non-recognizable characters into recognizable characters according to an embodiment of the present disclosure;

FIG. 7 is a flow chart illustrating a training process of a data generation model provided by an embodiment of the present application;

FIG. 8 is a schematic structural diagram of a data generation model provided by an embodiment of the present application;

FIG. 9 is a schematic structural diagram of a data generation model provided by an embodiment of the present application;

FIG. 10 is a flow chart illustrating a training process of a data generation model provided by an embodiment of the present application;

FIG. 11 is a schematic structural diagram of a data generation model provided by an embodiment of the present application;

FIG. 12 is a flow chart illustrating a training process of a data generation model provided by an embodiment of the present application;

fig. 13 is a flowchart illustrating a generation process of countermeasure text data according to an embodiment of the present application;

fig. 14 is a flowchart illustrating a process of generating countermeasure text data according to an embodiment of the present application.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the following description, references to the terms "first \ second \ third" are only to distinguish similar objects and do not denote a particular order or importance, but rather "first \ second \ third" may, where permissible, be interchanged in a particular order or sequence so that embodiments of the present application described herein can be practiced in other than the order shown or described herein. In the following description, the term "plurality" referred to means at least two.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

1) Malicious data refers to illegal data which takes the internet as a medium, takes a network technology as a main means, and brings potential threats (major potential safety hazards) to the safety of a computer information system and the management order of network space.

2) Sequence to Sequence model, a Sequence to Sequence model, refers to a model that maps one Sequence as an input to another Sequence.

3) The BERT model (Bidirectional Encoder retrieval from Transformer), a pre-training technique of natural language processing, is used for training by using large-scale non-labeled corpus, obtaining semantic representation of text containing rich semantic information, then finely adjusting the semantic representation of text in a specific natural language processing task, and finally applying to the natural language processing task.

4) The BertOnlyMLMHead model is a method for calling a predicted value, and is used for returning the predicted value when a language model is trained on the basis of a mask.

5) The countermeasure text data is text data that is not allowed to be displayed on the current page, that is, the contents of the countermeasure text data do not meet the current scene conditions, that is, are resistant and interfering with the current scene. The current scene condition may include one or more of a relevant legal provision, a platform provision, a page provision, and the like, which is not limited herein. The countermeasure text data is also a malicious data. Exemplarily, when a page scene where the text data is located is a bullet screen scene (e.g., a bullet screen scene for a video), since only a viewer is allowed to make his or her own view, comment on content or exchange content, but no advertisement content can appear in the bullet screen, when a bullet screen carrying an advertisement exists in the bullet screen scene, the bullet screen is regarded as illegal text data, that is, the target countermeasure text data; for another example, when a page scene in which text data is located is a comment scene (e.g., a comment scene for news and articles), since only views, content comments or content exchanges for posting themselves are specified in the comment in the page scene, and malicious comments such as foul comments cannot occur, when a comment including foul comments occurs in the comment scene, the comment is regarded as illegal text data, that is, text data is confronted. For another example, some platforms require text data to have semantic integrity, and when semantic information of the text data is damaged, the readability of the text data without semantic integrity is poor, and even if the text data includes special characters, expressions or other text difficult to recognize, the user's visual experience is affected, and the text data can be determined as countermeasure text data.

6) Target countermeasure text data, a countermeasure text data whose semantics do not have continuity, i.e., whose semantic integrity is destroyed. For example, some malicious advertisements destroy text semantic information by means of inserting various special characters and expressions or using text variations such as homophones and shape-similar word replacement, so as to resist the recognition of an advertisement recognition model, and have strong antagonism, so that the malicious data is difficult to recognize. In the embodiment of the present application, the targeted countermeasure text data includes text data in which such text semantic information is destroyed. In some embodiments, the target countermeasure text data is a more antagonistic countermeasure text data.

In the implementation process of the embodiment of the present application, the applicant finds that the following problems exist in the related art:

because the comment, barrage and other areas in various social platforms are channels which can reach users, expose and propagate malicious information most quickly and have the lowest cost, a large number of automata and malicious accounts send massive malicious texts to the comment barrage and other areas of each platform to disturb the users basically every moment, and the task which cannot be completed basically by adopting a comprehensive manual auditing mode is realized by the aid of the massive malicious data. However, if the verification is performed based on the malicious data recognition model, the semantic information of the text is damaged by the malicious data through inserting various special characters and expressions or using text variation means such as homophones and similar word replacement, so that the recognition of the malicious data recognition model is affected, that is, the malicious data has many variants and strong antagonism, so that the corresponding malicious data is difficult to be enhanced, the recognition effect of the malicious data is affected, and the recognition of the malicious data becomes an industrial problem.

The enhancement of malicious data in the related art is mainly achieved by three methods, method 1: a data enhancement method based on vocabulary replacement mainly comprises replacement based on a dictionary and word vectors, wherein the method generally randomly selects a part of vocabulary from a text, and finds out synonyms corresponding to the vocabulary in the dictionary or pre-trained word vectors for replacement; the method 2 comprises the following steps: the data enhancement method of the reverse translation, choose an intermediate language at first, translate the text into the intermediate language and translate back to the original language again, if the intermediate language is English, translate the text into English, translate the English text back to Chinese again, the text after translating is similar to primitive text semanteme but there is difference in general situation, can realize the goal of data enhancement, or can choose the intermediate language of the multiple gates to add more noises too, thus realize the data enhancement; the method 3 comprises the following steps: the text error correction method mainly comprises a method based on a seq2seq model and a method based on a BERT model, and can achieve the purpose of data enhancement by correcting the interference part in the text and restoring the interference part into a normal text form.

However, with respect to scheme 1, when a text is replaced, a word to be replaced is generally selected randomly, and the meaning of an original sentence may be affected by improper selection; for the scheme 2, when the characters are translated, the effect of the translation algorithm is relatively relied on, error transmission is easy to occur, and the meaning of the original sentence is also influenced; for the scheme 3, because the malicious data has the characteristics of multiple variants, strong interference, strong semantic countermeasures and the like, the corresponding error correction model usually depends on semantic context information, and each word of the Chinese language has independent semantics, the semantic information is easily destroyed by the countermeasures, so that the malicious data is vulnerable to data characteristics after passing through the error correction model, and the data recognition effect is influenced.

Based on this, the present embodiment provides a training method and apparatus for a data generation model, an electronic device, a computer readable storage medium, and a computer program product, which generate countermeasure text data with similar features by automatically learning features of target countermeasure text data, and can improve recognition effects of corresponding data recognition models, thereby effectively improving recognition rates of the countermeasure text data.

Referring to fig. 1, fig. 1 is a schematic diagram of an architecture of a training system 100 for a data generation model provided in an embodiment of the present application, where a terminal (an exemplary terminal 400 is shown) is connected to a server 200 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of the two. The terminal 400 and the server 200 are connected to each other through a wired or wireless network.

The terminal 400 is configured to send a plurality of pieces of target countermeasure text data to the server 200;

a server 200 for receiving a plurality of pieces of target countermeasure text data transmitted by the terminal 400; clustering the target countermeasure text data to obtain at least one data cluster; selecting at least one data pair from the data clusters aiming at each data cluster, wherein each data pair comprises two pieces of target confrontation text data, and taking the data pair as a training data sample of a data generation model; and training a data generation model by taking one target countermeasure text data in the training data sample as source data and the other target countermeasure text data as standard data to obtain a trained data generation model, wherein the trained data generation model is used for generating countermeasure text data based on the target countermeasure text data to be processed.

In some embodiments, the terminal 400 may further set a data generating client 400-1, where the client 400-1 sends an obtaining request of the countermeasure text data to the server 200, and then the server 200 obtains target countermeasure text data to be processed and generates countermeasure text data based on the target countermeasure text data to be processed through a data generating model; and finally, sending the generated confrontation text data to the client 400-1.

In some embodiments, the server 200 may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like. The terminal 400 may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a set-top box, a smart voice interaction device, a smart home appliance, a vehicle-mounted terminal, an aircraft, a mobile device (e.g., a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, a portable game device, a smart speaker, and a smart watch), and the like. The terminal device and the server may be directly or indirectly connected through wired or wireless communication, and the embodiment of the present application is not limited.

Referring to fig. 2, fig. 2 is a schematic structural diagram of an electronic device provided in the embodiment of the present application, and in practical application, the electronic device may be the server 200 or the terminal 400 shown in fig. 1, and referring to fig. 2, the electronic device shown in fig. 2 includes: at least one processor 410, memory 450, at least one network interface 420, and a user interface 430. The various components in the terminal 400 are coupled together by a bus system 440. It is understood that the bus system 440 is used to enable communications among the components. The bus system 440 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 440 in FIG. 2.

The Processor 410 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., wherein the general purpose Processor may be a microprocessor or any conventional Processor, etc.

The user interface 430 includes one or more output devices 431, including one or more speakers and/or one or more visual displays, that enable the presentation of media content. The user interface 430 also includes one or more input devices 432, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The memory 450 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 450 optionally includes one or more storage devices physically located remote from processor 410.

The memory 450 includes either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 450 described in embodiments herein is intended to comprise any suitable type of memory.

In some embodiments, memory 450 is capable of storing data, examples of which include programs, modules, and data structures, or a subset or superset thereof, to support various operations, as exemplified below.

An operating system 451, including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks;

a network communication module 452 for communicating to other computing devices via one or more (wired or wireless) network interfaces 420, exemplary network interfaces 420 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;

a presentation module 453 for enabling presentation of information (e.g., a user interface for operating peripheral devices and displaying content and information) via one or more output devices 431 (e.g., a display screen, speakers, etc.) associated with user interface 430;

an input processing module 454 for detecting one or more user inputs or interactions from one of the one or more input devices 432 and translating the detected inputs or interactions.

In some embodiments, the training apparatus for data generation model provided in the embodiments of the present application may be implemented in software, and fig. 2 illustrates the training apparatus 455 for data generation model stored in the memory 450, which may be software in the form of programs and plug-ins, and includes the following software modules: an acquisition module 4551, a selection module 4552 and a training module 4553, which are logical and thus may be arbitrarily combined or further split depending on the functions implemented.

In other embodiments, the training Device of the data generation model provided in the embodiments of the present Application may be implemented in hardware, and as an example, the training Device of the data generation model provided in the embodiments of the present Application may be a processor in the form of a hardware decoding processor, which is programmed to execute the training method of the data generation model provided in the embodiments of the present Application, for example, the processor in the form of the hardware decoding processor may employ one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.

In some embodiments, the terminal or the server may implement the training method of the data generation model provided by the embodiment of the present application by running a computer program. For example, the computer program may be a native program or a software module in an operating system; the Application program may be a local (Native) Application program (APP), that is, a program that needs to be installed in an operating system to run, such as an instant messaging APP and a web browser APP; or may be an applet, i.e. a program that can be run only by downloading it to a browser environment; but also an applet that can be embedded into any APP. In general, the computer programs described above may be any form of application, module or plug-in.

Based on the above description of the training system and the electronic device for the data generation model provided in the embodiments of the present application, the following description will be given of a training method for the data generation model provided in the embodiments of the present application. In practical implementation, the training method for the data generation model provided in the embodiment of the present application may be implemented by a terminal or a server alone, or implemented by a terminal and a server in cooperation, and the example is described by using the server 200 in fig. 1 to execute the training method for the data generation model provided in the embodiment of the present application alone. Referring to fig. 3, fig. 3 is a flowchart illustrating a training method of a data generation model according to an embodiment of the present application, and will be described with reference to the steps shown in fig. 3.

Step 101, a server acquires a plurality of pieces of target countermeasure text data, and performs clustering processing on the plurality of pieces of target countermeasure text data to obtain at least one data cluster.

It should be noted that, since the recognition rate of the data recognition model for recognizing the countermeasure text data in the related art is low, the countermeasure text data may include countermeasure text data recognizable by the data recognition model and countermeasure text data unrecognizable by the data recognition model, and the target countermeasure text data is countermeasure text data unrecognizable by the data recognition model.

In some embodiments, the server may obtain a plurality of pieces of target countermeasure text data from a target countermeasure text database, wherein the target countermeasure text database stores the target countermeasure text data.

In other embodiments, the server may also obtain the plurality of pieces of target countermeasure text data through a data recognition model.

Next, a process of acquiring a plurality of pieces of target countermeasure text data by the data recognition model will be described.

In one embodiment, the countermeasure text data to be recognized and the preset countermeasure text data can be matched through the data recognition model, and when the matching result represents that the similarity between the countermeasure text data to be recognized and the preset countermeasure text data reaches a preset similarity threshold, the countermeasure text data to be recognized is determined to be the recognizable countermeasure text data; and when the matching result represents that the similarity between the confrontation text data to be recognized and the preset confrontation text data does not reach a preset similarity threshold, determining that the confrontation text data to be recognized is the unidentifiable confrontation text data. It should be noted that the preset countermeasure text data is countermeasure text data whose semantics have continuity, that is, whose semantic integrity is not destroyed, where when the similarity with the preset countermeasure text data does not reach a preset similarity threshold, it is determined that the semantics of the countermeasure text data to be recognized does not have integrity, that is, the semantic integrity is destroyed, and therefore, it is determined that the countermeasure text data to be recognized is the target countermeasure text data.

In another embodiment, the data recognition model may also recognize the countermeasure text data by recognizing the semantics of the data, for example, when the semantic recognition result of one to-be-processed text data satisfies the current scene condition, the to-be-processed text data may be determined as the non-countermeasure text data; when the semantic recognition result of one text data to be processed does not meet the current scene condition, the text data to be processed can be determined as countermeasure text data, and when the semantic meaning of one text data to be processed cannot be recognized or a credible semantic recognition result cannot be obtained (for example, a semantic recognition result with a confidence coefficient exceeding a confidence coefficient threshold value cannot be obtained), the text data to be processed can be determined as target countermeasure text data.

Therefore, after the target countermeasure text data is obtained, the target countermeasure text data is adopted to construct the training sample to generate rich countermeasure text data, the data recognition model can be further trained in practical application, and the recognition rate of the data recognition model on the countermeasure text data is improved.

In actual implementation, after a plurality of pieces of target countermeasure text data are obtained, word segmentation processing is respectively carried out on the plurality of pieces of target countermeasure text data, and word segmentation results corresponding to the text data are obtained; and determining the similarity between the target countermeasure text data based on the word segmentation results, and clustering the target countermeasure text data based on the determined similarity to obtain at least one data cluster. The process of determining the similarity between the target countermeasure text data based on the word segmentation results specifically comprises the steps of determining a word intersection and a word union between any two target countermeasure text data based on the word segmentation results; and acquiring the ratio of the vocabulary intersection and the vocabulary union of any two target countermeasure text data, and determining the ratio as the similarity between the corresponding two target countermeasure text data.

As an example, describing the process of determining the similarity between the respective text data based on the respective segmentation results, first obtaining two target confrontation text data such as "9 kk4 is needed for a piece" and "4 ss6 is needed for a piece" and then performing segmentation processing on the two target confrontation text data to obtain corresponding segmentation results of ' ″ needed ', ' piece ', ' available ', ' in ', ' 9 ', ' k ', ' k ', ' 4 ', ' wanted ', ' piece ', ' available ', ' in ', ' 4 ','s ','s ','s ', ' 6 ', ' and then determining a vocabulary intersection and a vocabulary union between the two target confrontation text data based on the segmentation results, referring to fig. 4, fig. 4 is a schematic diagram of a vocabulary and a vocabulary union between the two target confrontation text data provided by the embodiment of the present application, based on fig. 4, the ratio of the vocabulary intersection to the vocabulary union of the two pieces of target confrontation text data is obtained as 0.467, i.e., 7/(7+4+4), and the ratio is determined as the similarity between the respective two pieces of target confrontation text data.

And 102, selecting at least one data pair from the data clusters aiming at each data cluster, wherein each data pair comprises two pieces of target confrontation text data, and taking the data pair as a training data sample of the data generation model.

In actual implementation, after at least one data cluster corresponding to a plurality of pieces of target countermeasure text data is obtained, at least one data pair is selected from the data clusters, where the selection of the at least one data pair from the data clusters specifically includes two ways, and the specific selection process may adopt at least one of the two ways, and then two ways of selecting the at least one data pair from the data clusters are explained.

In some embodiments, two pieces of target countermeasure text data are randomly selected from at least two pieces of target countermeasure text data included in the data cluster, and the two pieces of target countermeasure text data obtained by random selection are combined into one data pair; and repeating the operation until the selected target number of data pairs or data clusters are empty.

It should be noted that, because the target confrontation text data in the same data cluster has similar data distribution characteristics, data pairs are selected by combining two data pairs randomly, so that the data pairs are used as training data samples of the data generation model.

In some embodiments, common text data of the text data included in the data cluster is obtained as a cluster seed of the data cluster; selecting at least one piece of target confrontation text data from at least two pieces of target confrontation text data included in the data cluster; and respectively combining the cluster seed with one piece of target countermeasure text data in the selected target countermeasure text data to form a data pair so as to obtain at least one data pair.

It should be noted that the manner of obtaining the common text data of the text data included in the data cluster, that is, the cluster seed of the data cluster, may be to obtain the common text data of the text data included in the data cluster by using the longest common subsequence or other common text data extraction manners.

In some embodiments, after the data pair is used as a training data sample of the data generation model, semantic recognition may be further performed on the training data sample to obtain a new training data sample, referring to fig. 5, where fig. 5 is a schematic flow chart of obtaining a new training data sample by performing semantic recognition on the training data sample according to an embodiment of the present application, and based on fig. 3, after step 102, the following steps may also be performed:

step 201, the server performs semantic recognition on the training data sample to obtain a semantic recognition result of the training data sample.

Step 202, when the semantic recognition result represents that the training data sample comprises the unrecognizable character, mapping the unrecognizable character into a recognizable character based on a preset mapping relation to obtain a new training data sample.

In practical implementation, when a semantic recognition result represents that a training data sample includes unrecognizable characters, the unrecognizable characters are detected to obtain a detection result, and based on the detection result and a preset mapping relationship, the unrecognizable characters are mapped to recognizable characters to obtain a new training data sample, as shown in fig. 6, where fig. 6 is a schematic flow diagram of mapping unrecognizable characters to recognizable characters provided in an embodiment of the present application, and based on fig. 5, step 202 may be implemented in the following manner:

step 2021, the unrecognizable characters are detected to determine the location and type of the unrecognizable characters.

In practical implementation, the non-recognizable characters are firstly detected to obtain a detection result, and then the positions and the types of the non-recognizable characters are determined based on the detection result. Here, the location of the unrecognizable character refers to a specific location of the unrecognizable character in the training data sample, and the types of the unrecognizable character at least include uncommon words, special numeric symbols, and emoticons such as emoji emoticons.

Step 2022, obtaining the mapping relation of the unrecognizable characters corresponding to the type based on the type of the unrecognizable characters, and determining the position characteristics of the unrecognizable characters based on the positions of the unrecognizable characters.

In actual implementation, after the type of the unrecognizable character is determined, acquiring a unrecognizable character mapping relation corresponding to the type of the unrecognizable character based on a corresponding pre-constructed unrecognizable character dictionary, and illustratively, when the type of the unrecognizable character is a special number symbol, acquiring a corresponding unrecognizable character mapping relation based on a pre-constructed special character dictionary; and when the unrecognizable character is an emoticon, acquiring a corresponding mapping relation of the unrecognizable character based on a pre-constructed emoticon dictionary.

In actual implementation, after the position of the non-recognizable character is determined, the position feature of the non-recognizable character is determined based on the position where the non-recognizable character appears.

Step 2023, mapping the unrecognizable characters in the training data sample into recognizable characters based on the mapping relationship and the position features to obtain a new training data sample.

In practical implementation, after the mapping relation and the position characteristic corresponding to the unrecognizable character are determined, the unrecognizable character in the training data sample is mapped into the recognizable character based on the mapping relation and the position characteristic, illustratively, when the unrecognizable character is a ninal character or a self-character, the type of the unrecognizable character is determined to be a special number symbol, and then, based on a pre-constructed special character dictionary, the corresponding mapping relation of the unrecognizable character, namely the mapping relation between the ninal character or the self-character and the common character 9 is obtained; when the unrecognizable character is the emoticon of the vertical thumb, determining that the type of the unrecognizable character is the emoticon, and then acquiring a corresponding mapping relation of the unrecognizable character based on a pre-constructed emoticon dictionary, namely the mapping relation between the emoticon of the vertical thumb and a code of 'U' \ U0001F44D ': thumb _ up:'; and then, based on the position characteristics of the unrecognizable characters, adding identification characters before and after the mapping of the unrecognizable characters, so that the unrecognizable characters in the training data sample are mapped into recognizable characters to obtain a new training data sample.

Therefore, the non-identifiable characters in the training data sample are mapped into the identifiable characters based on the mapping relation and the position characteristics, the identification rate of the corresponding data identification model is improved, and meanwhile, the positions of the non-identifiable characters are recorded, so that the loss of partial characteristics of the target countermeasure text data caused by the mapping process is avoided, the corresponding data identification model can be identified through the semantic characteristics, and can also be identified through the positions and the frequency of the non-identifiable characters, and the identification rate of the corresponding data identification model is further improved.

It should be noted that, because a detection scene of malicious data needs to be called thousands of times every day, the data volume is up to billions, and in order to shorten the detection time of unrecognizable characters, only when a semantic recognition result represents that a training data sample includes unrecognizable characters, corresponding detection is performed on the unrecognizable characters, so that the unrecognizable characters are mapped into recognizable characters to obtain a new training data sample, and a target countermeasure text data in the new training data sample is used as a source data, another target countermeasure text data in the new training data sample is used as a standard data, and a model is generated by the training data; and when the semantic recognition result represents that the training data sample does not contain unrecognizable characters, directly taking one target confrontation text data in the training data sample as source data and the other target confrontation text data as standard data, and training the data to generate a model. Therefore, the detection time of the unrecognizable character is shortened, and the training efficiency and the data generation efficiency of the data generation model are improved.

And 103, training a data generation model by taking one target countermeasure text data in the training data sample as source data and the other target countermeasure text data as standard data to obtain a trained data generation model, wherein the trained data generation model is used for generating countermeasure text data based on the target countermeasure text data to be processed.

Referring to fig. 7, fig. 7 is a schematic flowchart of a training process of a data generation model provided in this embodiment of the present application, and it should be noted that the data generation model includes an encoding layer and a decoding layer, referring to fig. 8, fig. 8 is a schematic structural diagram of the data generation model provided in this embodiment of the present application, and based on fig. 7 and fig. 8, with one piece of target countermeasure text data in a training data sample as source data and the other piece of target countermeasure text data as standard data, the process of training the data generation model may be performed as follows:

and step 1031, respectively encoding the source data and the standard data in the training data samples through the encoding layer to obtain a first encoding characteristic corresponding to the source data and a second encoding characteristic corresponding to the standard data.

In actual implementation, firstly, constructing source data and standard data in a training data sample as [ SOS ] source data [ EOS ] standard data [ SOS ], then referring to fig. 9, where fig. 9 is a schematic structural diagram of a data generation model provided in an embodiment of the present application, and based on fig. 9, respectively encoding the source data and the standard data in the training data sample through an encoding layer to obtain a first encoding feature corresponding to the source data and a second encoding feature corresponding to the standard data, specifically, the process of respectively encoding the source data and the standard data in the training data sample through the encoding layer to obtain a character feature, a position feature and a sentence splitting feature corresponding to the source data, and a character feature, a position feature and a sentence splitting feature corresponding to the standard data; the sentence dividing characteristics are used for distinguishing source data and standard data; fusing character features and position features of corresponding source data with clause features to obtain first coding features of the corresponding source data; and fusing the character features and the position features of the corresponding standard data with the sentence features to obtain second coding features of the corresponding standard data.

And 1032, decoding the first coding characteristic through the decoding layer based on the first coding characteristic and the second coding characteristic to obtain the confrontation text data corresponding to the source data.

Step 1033, obtain the difference of the confrontation text data and the standard data, and update the model parameter of the data generation model based on the difference.

In actual implementation, after obtaining the countermeasure text data corresponding to the source data, the countermeasure text data is compared with the standard data to obtain the difference between the countermeasure text data and the standard data, and the model parameters of the model are generated based on the difference update data.

In some embodiments, after obtaining the first encoding feature and the second encoding feature, referring to fig. 10, fig. 10 is a flowchart illustrating a training process of a data generation model provided in an embodiment of the present application, and based on fig. 7, after step 1031, the following steps may be further performed:

step 301, the server encrypts the features of the target quantity in the second encoding features to obtain the encrypted encoding features corresponding to the second encoded data.

In practical implementation, the server randomly encrypts the target number of features of the standard data, for example, the encryption manner here may be a manner of randomly masking the target number, and the embodiment of the present application does not limit this manner as to the manner of encrypting the target number of features.

Step 302, decoding the first coding feature through the decoding layer based on the first coding feature and the encryption coding feature to obtain the confrontation text data corresponding to the source data.

Referring to fig. 11, fig. 11 is a schematic structural diagram of a data generation model provided in the embodiment of the present application, and based on fig. 11, the first coding feature is decoded by a decoding layer based on the first coding feature and the coding feature on the left side of the encrypted feature in the encrypted coding feature, so as to obtain the countermeasure text data corresponding to the source data.

Step 303, obtaining target subdata in the confrontation text data corresponding to the encrypted features in the second encoding features, and obtaining standard subdata in the standard data corresponding to the encrypted features in the second encoding features.

And 304, comparing the target subdata with the standard subdata to obtain the difference between the target subdata and the standard subdata, and updating model parameters of the data generation model based on the difference.

It should be noted that, during the training of the data generation model, the ending identifier of the standard data segment may also be encrypted, so that the expected model may learn to automatically end the data generation.

In some embodiments, after taking the data pair as a training data sample of the data generation model, referring to fig. 12, fig. 12 is a flowchart of a training process of the data generation model provided in the embodiment of the present application, and based on fig. 3, after step 103, the following may also be performed:

step 401, the server performs data disturbance on the training data sample to obtain a disturbed training sample corresponding to the training data sample.

In practical implementation, the two ways of performing data disturbance on the training data sample to obtain the disturbance training sample corresponding to the training data sample specifically include two, and next, performing data disturbance on the training data sample to obtain the disturbance training sample corresponding to the training data sample are described.

In some embodiments, the perturbation parameters are first obtained, and then the perturbation parameters are combined with the features in the training data samples to obtain perturbation training samples corresponding to the training data samples. It should be noted that the combination method here may be addition, or the disturbance parameter is used as a characteristic coefficient, and the like, and the embodiment of the present application is not limited to this.

In some embodiments, total features of the training data samples are obtained first, and then target features of the target number are randomly screened based on the total features; and determining a target training sample corresponding to the target feature as a disturbance training sample of the corresponding training data sample based on the target feature.

Step 402, training data to generate a model by using one target confrontation text data in the disturbance training sample as source data and the other target confrontation text data as standard data.

Therefore, the data generation model is trained by disturbing the training samples, the model is prevented from being excessively dependent on some local features, the model is not over-fitted, is not too different, is higher in generalization, and accordingly more target confrontation text data are obtained.

In some embodiments, after the training of the data generation model is completed, a target countermeasure text data is further generated based on the obtained trained data generation model, referring to fig. 13, where fig. 13 is a flowchart of a generation process of the target countermeasure text data provided in this embodiment, and based on fig. 3, after step 103, the following may also be performed:

step 501, the server obtains a trained data generation model and original target confrontation text data.

In practical implementation, the original target confrontation text data acquired by the server may be original target confrontation text data in a bullet screen in the video playing process, or original target confrontation text data in comments of media information such as videos or articles.

Step 502, generating countermeasure text data corresponding to the original target countermeasure text data based on the original target countermeasure text data through the data generation model.

In practical implementation, the process of generating the countermeasure text data corresponding to the original target countermeasure text data based on the original target countermeasure text data through the data generation model specifically includes performing data disturbance on the original target countermeasure text data through the data generation model to obtain at least two disturbance text data; and generating at least two pieces of confrontation text data corresponding to the original target confrontation text data based on the at least two pieces of disturbance text data.

As an example, referring to fig. 14, fig. 14 is a schematic flow chart of a generation process of countermeasure text data provided in an embodiment of the present application, and based on fig. 14, based on a data generation model, based on original target countermeasure text data, generation of countermeasure text data corresponding to the original target countermeasure text data may specifically be implemented through steps 601 to 606, specifically, based on the data generation model, the original target countermeasure text data is preprocessed to obtain intermediate target countermeasure text data, then, the intermediate target countermeasure text data is subjected to data perturbation to obtain at least two perturbation data, and then, based on the at least two perturbation data, at least two pieces of countermeasure text data corresponding to the original target countermeasure text data are generated.

In practical implementation, after obtaining the countermeasure text data based on the data generation model, the countermeasure text data is added to a database storing the countermeasure text data, so that when the text data to be recognized is recognized, the similarity between the text data to be recognized and the countermeasure text data in the database is determined, and when the similarity reaches a preset similarity threshold, the text data to be recognized is determined to be the countermeasure text data.

It should be noted that, in addition to generating rich confrontation text data based on the data generation model obtained by training, a richer black product advertisement data resource can be generated through the confrontation between the generator and the discriminator based on the training method of the data generation model of the present application.

By applying the embodiment of the application, the data pair of the target countermeasure text data is constructed as the training data sample in a mode of clustering the target countermeasure text data, the data generation model is trained by using one piece of target countermeasure text data in the training data sample as the source data and the other piece of target countermeasure text data as the standard data, and the countermeasure text data is generated by the trained data generation model based on the target countermeasure text data. Therefore, the recognition effect of the corresponding data recognition model is improved through the generated confrontation text data, and the recognition rate of the confrontation text data is effectively improved.

Next, an exemplary application of the embodiment of the present application in a practical application scenario will be described.

Because the regions such as comments and barrages in various social platforms such as video playing software, news clients, reading websites and the like are the channels which can reach users, expose and propagate malicious information most quickly and have the lowest cost, a large amount of automata and malicious accounts send massive malicious texts to the regions such as the comment barrages and the like of the platforms to disturb the users basically every moment, and the task which cannot be completed basically is realized by adopting a comprehensive manual auditing mode due to the large amount of malicious data. However, if the verification is performed based on the malicious data resisting recognition model, the semantic information of the text is damaged by the malicious data through inserting various special characters and expressions or using text variation means such as homophones, shape-similar word replacement and the like, so that the recognition of the malicious data resisting recognition model is influenced, that is, the malicious data is difficult to enhance due to a large number of variants and strong antagonism, so that the recognition effect of the malicious data is influenced, and the recognition of the malicious data becomes an industrial problem.

Based on the above, similar malicious data are deeply researched and researched, the evolution direction of the existing malicious data is systematically summarized, and a semi-supervised malicious confrontation data simulation generation method is provided. In service, the generalization of the malicious advertisement recognition model can be improved by utilizing the generated malicious advertisement data, the malicious advertisement recognition accuracy is improved, the malicious behaviors in the corresponding platforms are accurately attacked, the community environment of each platform is maintained, and the user experience is guaranteed.

In the embodiment of the application, the data enhancement work is carried out by fully considering the characteristics of malicious data, such as interference by using special symbols and strong semantic countermeasures, specifically, firstly, the text is preprocessed, and the special characters and the emoticons of the strong countermeasures are mapped into characters which can be identified by a model; secondly, performing word segmentation on the text, performing short text clustering based on the Jaccard similarity of word segmentation results, and further constructing two types of training data for models according to clustering results, specifically, clustering existing malicious data, constructing data pairs of < malicious advertisement data 1 and malicious advertisement data 2> for the malicious data between each clustering cluster, and acquiring clustering seeds of each category in a longest common subsequence mode, constructing < clustering seeds and malicious advertisement data > data pairs to be used as training data samples to train a deep natural language generation model (data generation model), wherein the clustering seeds are the most essential data features in the category; and finally, introducing a sample disturbance strategy in the data generation stage of the model, thereby realizing the automatic generation of a large amount of malicious data with similar countermeasure characteristics and finally achieving the purpose of simulating the upgrade countermeasure of the malicious advertisement.

For the above-mentioned process of preprocessing the text, mapping the strong-confrontation special characters (unrecognizable characters), such as special numeric characters and emoticons, to the characters recognizable by the model, it should be noted that the text content of the malicious advertisement is mainly divided into two types: alphanumeric type symbolic advertisements, i.e. advertisements that explicitly give out alphanumeric types in a text, and semantic advertisements, i.e. advertisements that mainly attract users by the semantic content of the text, thus doing malicious promotion. The digital-letter advertisement type provides a great challenge for data identification, and here, malicious data often replace normal texts with various special characters and the like for avoiding and resisting identification and attack of an identification model, wherein the special characters include but are not limited to rare words, special numeric symbols, emoticons and the like. Therefore, in the embodiment of the present application, the text is first preprocessed to map these special characters into characters recognizable by the model and having similar meanings. The specific treatment is as follows: for special digital characters, a special character with a single word length, such as ninu or self-tapping, is mapped into a character which can be recognized by a model 9 by constructing a special character dictionary; for the emoticons, mapping the emoticons into corresponding codes by constructing corresponding emoticon dictionaries; and finally, identifying the positions where the special digital characters and the emoticons appear, wherein in order to keep the characteristics of malicious advertisement data and distinguish the difference between original malicious data and a mapped normal text, the '@ #' characters are added before and after the special characters are mapped so as to record the positions where the special characters appear in the text for the model, thereby avoiding the loss of partial characteristics of the original malicious data caused by the mapping process, and enabling the model to identify the malicious advertisement data not only through semantic characteristics, but also through the positions where the characters can not be identified and the frequency of the characters appearing.

It should be noted that, because the detection scene of malicious advertisement data needs millions of calls every day, the data volume is up to billions, and in order to shorten the detection time of special characters, the detection of emoticons and special digital characters is performed only when characters that cannot be recognized by a model appear in a text.

Therefore, aiming at the strong antagonism of the existing malicious advertisement data, namely, various confusion characters including special characters are added into the characters to interfere the recognition of the recognition model, the embodiment of the application preprocesses the text data, and the influence of the problems on the recognition model is solved with a high probability.

In actual implementation, after a text is preprocessed, the obtained data is constructed into [ SOS ] source data [ EOS ] standard data [ SOS ] input by using the obtained data as < malicious advertisement data 1, [ EOS ] malicious advertisement data 2[ SOS ] and < cluster seeds, [ EOS ] malicious advertisement data >, and then corresponding character features, position features and sentence splitting features are obtained, and the character features, the position features and the sentence splitting features of the corresponding source data are fused to obtain coding features (first coding features) of the corresponding source data; and fusing the character characteristic and the position characteristic of the corresponding standard data with the sentence dividing characteristic to obtain the coding characteristic (second coding characteristic) of the corresponding standard data, so that the corresponding data is generated based on the two coding characteristics. Specifically, with the structure of seq2seq, first, 15% of the features in the standard data are randomly masked and the learning-masked word is identified using all the features in the source data plus the features to the left of the word in the standard data. Wherein the masked data is replaced with masked words at 80% probability, randomly selected words at 10% probability, and true words at 10% probability. It should be noted that the end mark [ SOS ] of the standard data end can be masked, so that the expected model can learn to automatically end the data generation.

It should be noted that after generating corresponding data based on two coding features, summarizing the output of the masked position, obtaining the prediction data of the masked position through the BertOnlyMLMHead model, and finally optimizing the model to maximize the likelihood of the masked standard data based on the context.

In actual implementation, because a real-time manual labeling mode is adopted on each current platform line, the amount of malicious data missed by an online feedback model is small under normal conditions, an original model can only generate fixed output for one input data, the generated amount is very limited, and the model can not be effectively promoted after iteration. For this situation, the embodiment of the present application introduces a sample perturbation manner, and requires that after the perturbation is added to the input, the new output distribution is consistent with the original output distribution, so that the same source data can generate a plurality of pieces of target data with different contents and similar characteristics. It should be noted that the perturbation strategy can be selected a lot, for example, a random dropout method, and a certain proportion of embedding is set to 0 at random each time.

In practical implementation, besides the annotation data fed back online, we also focus on selecting text data with recognition probability of the recognition model lower than 0.5 as the source data. Thus, the problem that even though the model can successfully recognize the data now due to the low recognition probability of the data is solved, the data can be easily upgraded to the malicious data which can resist the advertisement recognition model through simple transformation.

According to the embodiment of the application, the malicious data is updated quickly, a common malicious identification strategy usually adopts a real-time sampling and real-time manual labeling mode to sense the burst variation countermeasure of the malicious data, and further the feedback data is utilized to carry out model iteration. However, due to the limitation of manpower, the problem that the feedback data is less and the time consumption is longer is solved, the characteristics of the upgraded malicious data are automatically learned, a large amount of characteristics which are similar in distribution and can be called as the malicious data after the upgrade is resisted, so that the model is quickly iterated, the research and development time is shortened, and the total leakage amount of the black products during the period from the perception of the upgrade of the black products to the striking of the black products is reduced.

Meanwhile, aiming at the diversity of malicious advertisement attacks, as each service has different data distribution characteristics and advertisement judgment standards, malicious industry groups can consciously adopt different interference methods to generate countertexts aiming at different scenes. The community automatically generates enhanced data with corresponding service line characteristics by using the feedback data of different service lines, and further continuously iterates the malicious advertisement recognition model aiming at specific service lines, so that the recall rate and the recognition effective rate of the coordination model are jointly improved.

Therefore, the training samples of the malicious data are supplemented in a data enhancement mode, and the corresponding malicious data recognition model effect is improved. Through striking malicious advertisements of malicious drainage, the exposure of high-quality comments is improved, the attention of a user is attracted, and then the interaction rate of the whole comments and the bullet screen area of the corresponding product is improved.

By applying the embodiment of the application, the data pair of the target countermeasure text data is constructed as the training data sample in a mode of clustering the target countermeasure text data, the data generation model is trained by using one piece of target countermeasure text data in the training data sample as the source data and the other piece of target countermeasure text data as the standard data, and the countermeasure text data is generated by the trained data generation model based on the target countermeasure text data. Therefore, the recognition effect of the corresponding data recognition model is improved through the generated countermeasure text data, and the recognition rate of the countermeasure text data is effectively improved.

Continuing with the exemplary structure of the training apparatus 455 for generating a data model provided by the embodiment of the present application implemented as software modules, in some embodiments, as shown in fig. 2, the software modules stored in the training apparatus 455 for generating a data model in the memory 440 may include:

an obtaining module 4551, configured to obtain multiple pieces of target countermeasure text data, and perform clustering processing on the multiple pieces of target countermeasure text data to obtain at least one data cluster;

a selecting module 4552, configured to select, for each data cluster, at least one data pair from the data clusters, where each data pair includes two pieces of target confrontation text data, and use the data pair as a training data sample of the data generation model;

a training module 4553, configured to train the data generation model with one piece of target countermeasure text data in the training data sample as source data and the other piece of target countermeasure text data as standard data to obtain a trained data generation model, where the trained data generation model is used to generate countermeasure text data based on target countermeasure text data to be processed.

In some embodiments, the obtaining module 4551 is further configured to perform word segmentation on each text data in the multiple pieces of target countermeasure text data, so as to obtain a word segmentation result corresponding to each text data; determining vocabulary intersection and vocabulary union between any two target confrontation text data based on the word segmentation result; acquiring the ratio of the vocabulary intersection and the vocabulary union of any two target countermeasure text data, and determining the ratio as the similarity between the corresponding two target countermeasure text data; and clustering the plurality of target confrontation text data based on the determined similarity to obtain at least one data cluster.

In some embodiments, the selecting module 4552 randomly selects two target countermeasure text data from at least two target countermeasure text data included in the data cluster, and combines the two target countermeasure text data obtained by random selection into one data pair; and repeating the operation until the data pairs or the data clusters with the selected target quantity are empty.

In some embodiments, the selecting module 4552 is further configured to obtain common text data of the text data included in the data cluster, as a cluster seed of the data cluster; selecting at least one piece of target confrontation text data from at least two pieces of target confrontation text data included in the data cluster; and respectively combining the cluster seeds and one piece of target confrontation text data in the selected target confrontation text data to form the data pair so as to obtain at least one data pair.

In some embodiments, the apparatus further includes an identification module, where the identification module is configured to perform semantic identification on the training data sample to obtain a semantic identification result of the training data sample; when the semantic recognition result represents that the training data sample comprises unrecognizable characters, mapping the unrecognizable characters into recognizable characters based on a preset mapping relation to obtain a new training data sample; the training module 4553 is further configured to train the data generation model with one piece of target confrontation text data in the new training data sample as source data and the other piece of target confrontation text data as standard data.

In some embodiments, the recognition module is further configured to detect the unrecognizable character to determine a location and a type of the unrecognizable character; acquiring a mapping relation of the unidentifiable characters corresponding to the type based on the type of the unidentifiable characters, and determining the position characteristics of the unidentifiable characters based on the positions of the unidentifiable characters; and mapping the non-identifiable characters in the training data samples into identifiable characters based on the mapping relation and the position characteristics so as to obtain new training data samples.

In some embodiments, the data generation model includes an encoding layer and a decoding layer, and the training module 4553 is further configured to encode, through the encoding layer, the source data and the standard data in the training data sample respectively to obtain a first encoding characteristic corresponding to the source data and a second encoding characteristic corresponding to the standard data; decoding the first coding feature based on the first coding feature and the second coding feature through the decoding layer to obtain confrontation text data corresponding to the source data; and acquiring the difference between the confrontation text data and the standard data, and updating the model parameters of the data generation model based on the difference.

In some embodiments, the training module 4553 is further configured to encode, through the encoding layer, the source data and the standard data in the training data sample, respectively, to obtain a character feature, a position feature and a sentence splitting feature corresponding to the source data, and a character feature, a position feature and a sentence splitting feature corresponding to the standard data; wherein the clause feature is used to distinguish the source data from the standard data; fusing character features, position features and clause features corresponding to the source data to obtain first coding features corresponding to the source data; and fusing character features, position features and clause features corresponding to the standard data to obtain second coding features corresponding to the standard data.

In some embodiments, the apparatus further includes an encryption module, where the encryption module is configured to encrypt a target number of the second encoding features to obtain encrypted encoding features corresponding to the second encoded data; the training module 4553 is further configured to decode, by the decoding layer, the first coding feature based on the first coding feature and the encrypted coding feature to obtain confrontation text data corresponding to the source data; acquiring target subdata in the confrontation text data corresponding to the encrypted features in the second coding features, and acquiring standard subdata in the standard data corresponding to the encrypted features in the second coding features; and comparing the target subdata with the standard subdata to obtain the difference between the target subdata and the standard subdata, and updating the model parameters of the data generation model based on the difference.

In some embodiments, the apparatus further includes a perturbation module, where the perturbation module is configured to perform data perturbation on the training data sample to obtain a perturbed training sample corresponding to the training data sample; the training module 4553 is further configured to train the data generation model by using one piece of target confrontation text data in the perturbation training sample as source data and another piece of target confrontation text data as standard data.

In some embodiments, the apparatus further comprises an application module for obtaining the trained data generation model and original target confrontation text data; and generating countermeasure text data corresponding to the original target countermeasure text data based on the original target countermeasure text data through the data generation model.

In some embodiments, the application module is further configured to perform data disturbance on the original target countermeasure text data through the data generation model to obtain at least two disturbance text data; and generating at least two pieces of countermeasure text data corresponding to the original target countermeasure text data based on the at least two pieces of disturbance text data.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the training method of the data generation model described in the embodiment of the present application.

Embodiments of the present application provide a computer-readable storage medium storing executable instructions, which when executed by a processor, will cause the processor to perform a training method of a data generation model provided by embodiments of the present application, for example, a training method of a data generation model as shown in fig. 3.

In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

In summary, the embodiments of the present application have the following beneficial effects:

(1) the confrontation text data with similar characteristics are generated by automatically learning the characteristics of the target confrontation text data, so that the recognition effect of the corresponding data recognition model can be improved, and the recognition rate of the confrontation text data is effectively improved.

(2) Based on the mapping relation and the position characteristics, the unidentifiable characters in the training data sample are mapped into the identifiable characters, so that the identification rate of the corresponding data identification model is improved, and meanwhile, the positions of the unidentifiable characters are recorded, so that the loss of partial characteristics of the target countermeasure text data caused by the mapping process is avoided, the corresponding data identification model can be identified through the semantic characteristics, and can also be identified through the positions and the frequency of the unidentifiable characters, and the identification rate of the corresponding data identification model is further improved.

(3) When the semantic recognition result represents that the training data sample comprises the unidentifiable characters, the unidentifiable characters are correspondingly detected, so that the unidentifiable characters are mapped into the identifiable characters, and a new training data sample is obtained to train the data generation model; and when the semantic recognition result represents that the training data sample does not contain unrecognizable characters, directly training the data generation model by using the training data sample. Therefore, the detection time of the unrecognizable character is shortened, and the training efficiency and the data generation efficiency of the data generation model are improved.

(4) The data generation model is trained by disturbing the training samples, so that the model is prevented from being excessively dependent on some local features, the model is not over-fitted, too large in difference and strong in generalization, and more confrontation text data are obtained.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A method of training a data generation model, the method comprising:

for each data cluster, selecting at least one data pair from the data cluster, wherein each data pair comprises two pieces of target confrontation text data, and

taking the data pairs as training data samples of the data generation model;

and training the data generation model by taking one target confrontation text data in the training data sample as source data and the other target confrontation text data as standard data to obtain a trained data generation model, wherein the trained data generation model is used for generating confrontation text data based on the target confrontation text data to be processed.

2. The method of claim 1, wherein clustering the plurality of target countermeasure text data to obtain at least one data cluster comprises:

performing word segmentation processing on each target countermeasure text data in the plurality of target countermeasure text data respectively to obtain word segmentation results corresponding to each target countermeasure text data;

determining vocabulary intersection and vocabulary union between any two target confrontation text data based on the word segmentation result;

acquiring the ratio of the vocabulary intersection and the vocabulary union of any two target countermeasure text data, and determining the ratio as the similarity between the corresponding two target countermeasure text data;

and clustering the plurality of target confrontation text data based on the determined similarity to obtain at least one data cluster.

3. The method of claim 1, wherein said selecting at least one data pair from said data cluster comprises:

randomly selecting two target countermeasure text data from at least two target countermeasure text data included in the data cluster, and combining the two randomly selected target countermeasure text data into one data pair;

and repeating the operation until the data pairs or the data clusters with the selected target quantity are empty.

4. The method of claim 1, wherein said selecting at least one data pair from said data cluster comprises:

acquiring public text data of the text data included in the data cluster, and using the public text data as a cluster seed of the data cluster;

selecting at least one piece of target confrontation text data from at least two pieces of target confrontation text data included in the data cluster;

and respectively combining the cluster seeds and one piece of target countermeasure text data in the selected target countermeasure text data to form the data pair so as to obtain at least one data pair.

5. The method of claim 1, wherein the taking the data pairs as training data samples for the data generation model further comprises:

performing semantic recognition on the training data sample to obtain a semantic recognition result of the training data sample;

when the semantic recognition result represents that the training data sample comprises unrecognizable characters, mapping the unrecognizable characters into recognizable characters based on a preset mapping relation to obtain a new training data sample;

the training of the data generation model by using one target confrontation text data in the training data sample as source data and another target confrontation text data as standard data comprises the following steps:

and training the data generation model by taking one target confrontation text data in the new training data sample as source data and the other target confrontation text data as standard data.

6. The method of claim 5, wherein the mapping the non-recognizable characters into recognizable characters based on a predetermined mapping relationship to obtain a new training data sample comprises:

detecting the unrecognizable character to determine the position and the type of the unrecognizable character;

based on the type of the unrecognizable character, acquiring a mapping relation of the unrecognizable character corresponding to the type, and

determining a position feature of the unrecognizable character based on the position of the unrecognizable character;

and mapping the non-identifiable characters in the training data samples into identifiable characters based on the mapping relation and the position characteristics so as to obtain new training data samples.

7. The method of claim 1, wherein the data generation model comprises an encoding layer and a decoding layer, and the training the data generation model with one target countermeasure text data as source data and another target countermeasure text data as standard data in the training data samples comprises:

respectively encoding the source data and the standard data in the training data sample through the encoding layer to obtain a first encoding characteristic corresponding to the source data and a second encoding characteristic corresponding to the standard data;

decoding the first coding feature based on the first coding feature and the second coding feature through the decoding layer to obtain confrontation text data corresponding to the source data;

and acquiring the difference between the confrontation text data and the standard data, and updating the model parameters of the data generation model based on the difference.

8. The method of claim 7, wherein the encoding, by the encoding layer, the source data and the standard data in the training data samples respectively to obtain a first encoding characteristic corresponding to the source data and a second encoding characteristic corresponding to the standard data comprises:

respectively encoding the source data and the standard data in the training data sample through the encoding layer to obtain character features, position features and clause features corresponding to the source data and character features, position features and clause features corresponding to the standard data;

wherein the clause characteristics are used for distinguishing the source data from the standard data;

fusing character features, position features and clause features corresponding to the source data to obtain first coding features corresponding to the source data;

and fusing character features, position features and clause features corresponding to the standard data to obtain second coding features corresponding to the standard data.

9. The method of claim 7, wherein the method further comprises:

encrypting the characteristics of the target quantity in the second coding characteristics to obtain the encrypted coding characteristics corresponding to the second coding data;

the decoding, by the decoding layer, the first coding feature based on the first coding feature and the second coding feature to obtain countermeasure text data corresponding to the source data includes:

decoding the first coding feature based on the first coding feature and the encryption coding feature through the decoding layer to obtain confrontation text data corresponding to the source data;

acquiring the difference between the confrontation text data and the standard data, and updating the model parameters of the data generation model based on the difference, wherein the method comprises the following steps:

acquiring target subdata in the countermeasure text data corresponding to the encrypted features in the second coding features, and acquiring standard subdata in the standard data corresponding to the encrypted features in the second coding features;

and comparing the target subdata with the standard subdata to obtain the difference between the target subdata and the standard subdata, and updating the model parameters of the data generation model based on the difference.

10. The method of claim 1, wherein the taking the data pairs as training data samples for the data generation model further comprises:

performing data disturbance on the training data sample to obtain a disturbed training sample corresponding to the training data sample;

the training of the data generation model by using one piece of target confrontation text data in the training data sample as source data and the other piece of target confrontation text data as standard data comprises the following steps:

and training the data generation model by taking one target confrontation text data in the disturbance training sample as source data and the other target confrontation text data as standard data.

11. The method of claim 1, wherein the method further comprises:

acquiring the trained data generation model and original target confrontation text data;

and generating countermeasure text data corresponding to the original target countermeasure text data based on the original target countermeasure text data through the data generation model.

12. The method of claim 11, wherein generating, by the data generation model, countermeasure text data corresponding to the original target countermeasure text data based on the original target countermeasure text data comprises:

performing data disturbance on the original target countermeasure text data through the data generation model to obtain at least two disturbance text data;

and generating at least two pieces of countermeasure text data corresponding to the original target countermeasure text data based on the at least two pieces of disturbance text data.

13. An apparatus for training a data generative model, the apparatus comprising: