CN111461154A

CN111461154A - Method and device for labeling data

Info

Publication number: CN111461154A
Application number: CN201910057667.8A
Authority: CN
Inventors: 李玥; 何小锋; 刘海锋
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2019-01-22
Filing date: 2019-01-22
Publication date: 2020-07-28

Abstract

The invention discloses a method and a device for marking data, and relates to the technical field of computers. A specific implementation of the method includes: generating a verification code based on the marked data and the data to be marked, and displaying it to the user; acquiring the verification mark added by the user to the verification code, and obtaining the verification mark to be marked according to the verification mark added by the user to the verification code Valid labeling of data; cross-validation of valid labeling of data to be labelled to determine labeling of data to be labelled. This implementation enables the acquisition of large and rich types of training data at very low cost.

Description

Method and device for labeling data

Technical Field

The invention relates to the technical field of computers, in particular to a method and a device for labeling data.

Background

At present, supervised deep learning is a machine learning method generally adopted in the field of artificial intelligence, and a model adopting deep learning needs a large amount of accurate and high-quality training data for training. For example, if a model capable of identifying an apple from any picture is to be trained, a large number of pictures are required to be provided for training, a label "apple" is added to the picture with the apple, then the model automatically summarizes the characteristics of the apple by learning the large number of pictures, and finally any picture is given to the machine, so that whether the picture contains the apple or not can be identified.

The effectiveness of deep learning depends largely on three aspects: and massive training data, proper algorithm and parameter tuning are enriched. The training data refers to data with labels for training the model, and the acquisition of the training data mainly depends on manual data labeling.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:

Data labeling is a monotonous and tedious intensive repeated labor, and in order to obtain a large amount of training data, a large amount of manpower is required to be consumed for data labeling, so that higher labor cost is paid.

Disclosure of Invention

In view of the above, embodiments of the present invention provide a method and an apparatus for tagging data, which can obtain a large amount of training data with abundant types at a very low cost.

To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a method of labeling data.

The method for labeling data in the embodiment of the invention comprises the following steps: generating a verification code based on the marked data and the data to be marked, and displaying the verification code to a user; acquiring a verification mark added to the verification code by a user, and acquiring effective marking of the data to be marked according to the verification mark added to the verification code by the user; and carrying out cross validation on the effective labels of the data to be labeled so as to determine the labels of the data to be labeled.

Optionally, generating a verification code based on the marked data and the data to be marked, and presenting to the user includes: selecting a marked data source from a plurality of verification code data sources by adopting a weighted random strategy; selecting M marked data from the marked data source and N data to be marked from the data set to be marked according to a preset proportion; wherein M is greater than N; randomly mixing the M marked data and the N data to be marked to generate a verification code; and when a user distinguishing verification request is received, the verification code is displayed to the user.

Optionally, the obtaining of the verification mark added to the verification code by the user, and obtaining the effective label of the data to be labeled according to the verification mark added to the verification code by the user includes: respectively acquiring verification marks added by a user to the marked data and the data to be marked; comparing whether the verification mark added by the user to the marked data is consistent with the correct mark of the marked data or not; if the data to be marked are consistent with the data to be marked, the differential verification is passed, and a verification mark added by the user to the data to be marked in the verification code is obtained and used as an effective mark of the data to be marked; and if the two are not consistent, the authentication is failed to be distinguished.

Optionally, performing cross validation on the valid label of the data to be labeled to determine that the label of the data to be labeled includes: judging whether the number of effective labels of the data to be labeled is greater than or equal to the minimum label number or not; if so, calculating the consistency rate of the effective labels, and judging whether the consistency rate of the effective labels is greater than or equal to a standard consistency rate; if so, taking the effective label with the consistency rate larger than or equal to the standard consistency rate as the label of the data to be labeled; if not, judging whether the number of the effective labels is greater than or equal to the maximum label number; if so, taking the data to be marked as data to be processed; if not, the data to be marked and the marked data are mixed randomly to generate a verification code; if not, the data to be marked and the marked data are mixed randomly to generate a verification code.

To achieve the above object, according to still another aspect of an embodiment of the present invention, there is provided an apparatus for labeling data.

The device for labeling data of the embodiment of the invention comprises: the generating module is used for generating a verification code based on the marked data and the data to be marked and displaying the verification code to a user; the acquisition module is used for acquiring the verification mark added to the verification code by the user and acquiring the effective mark of the data to be marked according to the verification mark added to the verification code by the user; and the verification module is used for performing cross verification on the effective labels of the data to be labeled so as to determine the labels of the data to be labeled.

Optionally, the generating module is further configured to: selecting a marked data source from a plurality of verification code data sources by adopting a weighted random strategy; selecting M marked data from the marked data source and N data to be marked from the data set to be marked according to a preset proportion; wherein M is greater than N; randomly mixing the M marked data and the N data to be marked to generate a verification code; and when a user distinguishing verification request is received, the verification code is displayed to the user.

Optionally, the obtaining module is further configured to: respectively acquiring verification marks added by a user to the marked data and the data to be marked; comparing whether the verification mark added by the user to the marked data is consistent with the correct mark of the marked data or not; if the data to be marked are consistent with the data to be marked, the differential verification is passed, and a verification mark added by the user to the data to be marked in the verification code is obtained and used as an effective mark of the data to be marked; and if the two are not consistent, the authentication is failed to be distinguished.

Optionally, the verification module is further configured to: judging whether the number of effective labels of the data to be labeled is greater than or equal to the minimum label number or not; if so, calculating the consistency rate of the effective labels, and judging whether the consistency rate of the effective labels is greater than or equal to a standard consistency rate; if so, taking the effective label with the consistency rate larger than or equal to the standard consistency rate as the label of the data to be labeled; if not, judging whether the number of the effective labels is greater than or equal to the maximum label number; if so, taking the data to be marked as data to be processed; if not, the data to be marked and the marked data are mixed randomly to generate a verification code; if not, the data to be marked and the marked data are mixed randomly to generate a verification code.

To achieve the above object, according to still another aspect of the embodiments of the present invention, there is provided an electronic device for annotating data.

An electronic device for labeling data according to an embodiment of the present invention includes: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement a method for tagging data in an embodiment of the invention.

To achieve the above object, according to still another aspect of embodiments of the present invention, there is provided a computer-readable storage medium.

A computer-readable storage medium of an embodiment of the present invention stores thereon a computer program that, when executed by a processor, implements a method of labeling data of an embodiment of the present invention.

One embodiment of the above invention has the following advantages or benefits: because the verification code is generated based on the marked data and the data to be marked and is displayed to the user; acquiring a verification mark added to the verification code by a user, and acquiring effective marking of data to be marked according to the verification mark added to the verification code by the user; the effective marking of the data to be marked is cross-verified to determine the technical means of marking of the data to be marked, and marking of the data to be marked is realized through user differential verification, so that the technical problems that a large amount of manpower is consumed and high labor cost is paid in the marking process of training data are solved, and the technical effect of obtaining a large amount of training data with abundant types at extremely low cost is achieved.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic diagram of the main steps of a method of annotating data according to an embodiment of the present invention;

FIG. 2 is a schematic view of a main flow of a method of labeling data according to a referential embodiment of the present invention;

FIG. 3 is a flow chart illustrating the generation of verification codes and data annotation according to an embodiment of the present invention;

FIG. 4 is a schematic flow diagram of cross-validation of a method of annotating data according to an embodiment of the present invention;

FIG. 5 is a first diagram illustrating an application of a method for annotating data according to an embodiment of the present invention;

FIG. 6 is a second application diagram of the method for labeling data according to the embodiment of the present invention;

FIG. 7 is a third diagram illustrating an application of the method for annotating data according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of the main blocks of an apparatus for labeling data according to an embodiment of the present invention;

FIG. 9 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;

Fig. 10 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that the embodiments of the present invention and the technical features of the embodiments may be combined with each other without conflict.

the method for obtaining training data mainly depends on manual data labeling, the data labeling is monotonous and tedious intensive repeated labor, a large amount of manpower is consumed for data labeling to obtain the training data of a specific field, and higher labor cost is paid.

Human verification, namely a public Turing test (called CAPTCHA for short) for fully automatically distinguishing a computer from a human, is also called a verification code, and is a public fully automatic program for distinguishing a user as a computer or a human. In the captcha test, a computer as a server automatically generates a question to be solved by a user, the question can be generated and evaluated by the computer, but only a human can solve the question, and the user who answers the question can be considered as a human because the computer cannot solve the captcha question. At present, public Turing test is widely applied to the fields of login, lottery and the like of internet websites, but most verification code systems are single-type verification codes, for example, a picture is given, and a user inputs characters in the picture. A large number of verification code cracking programs can identify characters in verification code pictures, the identification accuracy is higher and higher, and the risk that a single type of verification code is used for a long time and is cracked is higher and higher.

The method for marking data integrates the verification code and the data marking, and takes the data to be marked as the verification code. In the process of inputting the verification code by the user, human verification and marking of data to be marked are completed simultaneously, and the process is free from user participation, so that the acquisition of massive training data with extremely low cost is possible. Moreover, as the types of labeled data are richer and richer, deep learning can be widely applied to various fields, for example, whether the emotion type needing to be labeled with a sentence in natural language identification is happy or sad; in the field of face recognition, a face on a picture needs to be marked; whether a commodity picture belongs to a certain category or not needs to be identified in the process of establishing the knowledge graph. In addition, the data to be marked have natural diversity, the types of the verification codes can be enriched by taking the data to be marked as the verification codes, and the verification codes which are continuously changed enable the difficulty of being cracked to be higher and the data to be marked are relatively safer.

FIG. 1 is a diagram illustrating the main steps of a method for labeling data according to an embodiment of the present invention.

As shown in fig. 1, the method for labeling data according to the embodiment of the present invention mainly includes the following steps:

Step S101: and generating a verification code based on the marked data and the data to be marked, and displaying the verification code to a user.

The embodiment of the invention takes the marked data and the data to be marked as verification codes for a user to add verification marks, wherein the marked data is used for differential verification, and the differential verification can be common human verification and the like. Since the user does not know which data in the verification code is unmarked, most users can mark all marked data and data to be marked in the verification code as correctly as possible in order to successfully pass the differential verification, and based on the marking, the marking of the data to be marked can be realized through the differential verification of the user.

It should be noted that the marked data and the data to be marked may be pictures, texts, audios, or the like. Correspondingly, the verification code can be a picture, so that a user can mark a specific object of the picture and select the picture with a certain characteristic or which type of picture belongs to; the emotion type can be a text, so that the user can mark the emotion type of each sentence; in the field of face recognition, a face on a picture needs to be marked; it may also be a sound, allowing the user to select the type of sound.

In the embodiment of the present invention, step S101 may be implemented by the following steps: selecting a marked data source from a plurality of verification code data sources by adopting a weighted random strategy; selecting M marked data from the marked data sources according to a preset proportion, and selecting N data to be marked from a data set to be marked; randomly mixing M marked data and N data to be marked to generate a verification code; and when a user distinguishing verification request is received, the verification code is displayed to the user.

When a user logs in some platforms or websites, the user sends a request for logging in to a server of the platform or website, and the server receives the request and then can perform differential verification on the user, namely, a verification code is displayed to the user and an answer of the user is obtained, wherein the verification code can be generated in advance or generated in real time.

At the initial stage of acquiring a type of training data, since the number of labeled data for distinguishing verification is small, the labeled data is easy to crack by an exhaustion method, and in order to ensure the security of the verification code, the labeled data can be mixed with other types of verification codes, namely, the labeled data is selected from a plurality of verification code data sources. Each time a captcha is generated, the determination of which captcha data source's captcha to use may be made by a weighted random policy. The weighted random strategy is to assign a weight to each verification code data source, and randomly select one verification code data source according to the weight each time, so that the probability of selecting the verification code data source is higher as the weight is higher. It should be noted that each verification code data source is responsible for generating a verification code, giving an answer (i.e. correct labeling of each labeled data), and verifying an answer of a user (i.e. a verification mark added by the user to the verification code), and the verification code provided by each verification code data source may also be a traditional picture verification code, a data labeling task, or any other type of verification code, etc., i.e. the labeled data and the data to be labeled may be the same type of data or different types of data.

In addition, in order to take the user experience and the security of the verification code into account, the number of marked data in the verification code may be larger than the number of data to be marked, and preferably, the preset ratio may be between 3:1 and 5:1 (M: N is 3:1 to 5:1), that is, the number of marked data in the verification code is three times to five times the number of data to be marked. The total amount of the marked data and the data to be marked in the verification code can be set according to actual conditions, for example, the total amount of the marked data and the data to be marked is 6 or 10.

Step S102: and acquiring the verification mark added to the verification code by the user, and acquiring the effective marking of the data to be marked according to the verification mark added to the verification code by the user.

In order to successfully pass the differential verification, most users can correctly label all the labeled data and the data to be labeled in the verification code as much as possible based on self cognition.

In the embodiment of the present invention, step S102 may be implemented by the following steps: respectively acquiring verification marks added by a user to the marked data and the data to be marked; comparing whether the verification mark added by the user to the marked data is consistent with the correct mark of the marked data or not; if the data to be marked are consistent, the data to be marked are distinguished and verified, and a verification mark added by the user to the data to be marked in the verification code is obtained and used as an effective mark of the data to be marked; and if the two are not consistent, the authentication is failed to be distinguished.

The differential verification is a verification method for verifying the authenticity of the user, and may be human verification or the like. Most users can mark all marked data and data to be marked in the verification code as correctly as possible in order to successfully pass the differential verification, on the basis, if the users mark all marked data correctly, the users pass the differential verification, and the verification marks added by the users to the data to be marked are effective marks of the data to be marked.

Step S103: and carrying out cross validation on the effective labels of the data to be labeled so as to determine the labels of the data to be labeled.

In the first difference verification, if the verification mark added to the marked data by a certain user is correct, the verification mark added to the data to be marked by the user is used as an effective mark of the data to be marked, so that all effective marks of the data to be marked need to be cross-verified, and the mark of the data to be marked is accurately determined.

The principle of the cross validation is that each data to be labeled has at least k effective labels and at most j effective labels (k is the minimum label number, j is the maximum label number, k is less than j), a standard consistency rate theta (0 is less than theta and less than 1) is defined, and the labels of the data to be labeled are determined according to the consistency rate. The consistency rate refers to the ratio of the number of the labeling answers with the same labeling content and the most frequent times to the total number of labels in all the effective labels.

In the embodiment of the present invention, step S103 may be implemented by the following steps: judging whether the number of effective labels of the data to be labeled is greater than or equal to the minimum label number or not; if the number of the effective labels is larger than or equal to the minimum number of the labels, calculating the consistency rate of the effective labels, and judging whether the consistency rate of the effective labels is larger than or equal to the standard consistency rate; if the consistency rate of the effective labels is greater than or equal to the standard consistency rate, taking the effective labels with the consistency rate greater than or equal to the standard consistency rate as the labels of the data to be labeled; if the consistency rate of the effective labels is less than the standard consistency rate, judging whether the quantity of the effective labels is more than or equal to the maximum label quantity; if the number of the effective labels is larger than or equal to the maximum label number, taking the data to be labeled as the data to be processed; if the number of the effective labels is less than the maximum label number, the data to be labeled and the labeled data are mixed randomly to generate a verification code; and if the number of the effective labels is less than the minimum label number, continuing to randomly mix the data to be labeled and the labeled data to generate the verification code.

According to the method for marking data, disclosed by the embodiment of the invention, when an effective mark is added to data to be marked, the cross validation process is triggered. And when the number x of the effective labels of the data to be marked is between k and j (k is less than or equal to x < j), and the consistency rate of the x effective labels is not lower than the standard consistency rate, the label with the same label content and the maximum times is considered to be credible. And if the effective labeling quantity of the data to be labeled is greater than or equal to the maximum labeling quantity but the labeling of the data to be labeled cannot be determined, taking the data to be labeled as the data to be processed, and manually processing or discarding the data.

And (4) continuously iterating the steps S101 to S103, continuously and accurately marking or discarding unknown data to be marked, and finally completing the marking work of all data to be marked.

According to the method for marking the data, the verification code is generated based on the marked data and the data to be marked and is displayed to the user; acquiring a verification mark added to the verification code by a user, and acquiring effective marking of data to be marked according to the verification mark added to the verification code by the user; the effective marking of the data to be marked is cross-verified to determine the marking technical means of the data to be marked, and the marking of the data to be marked is realized through the differential verification of a user, so the technical problems that a large amount of manpower is consumed and the labor cost is high in the marking process of the training data are solved, and the technical effect of obtaining a large amount of training data with abundant types at extremely low cost is achieved

Fig. 2 is a schematic diagram of a main flow of a method of labeling data according to one referential embodiment of the present invention.

As shown in fig. 2, the method for labeling data according to the embodiment of the present invention can be implemented by referring to the following processes:

Step S201: selecting a labeled data source from a plurality of verification code data sources by adopting a weighted random strategy:

Each verification code data source is responsible for generating verification codes and verifying answers, and the verification codes provided by each verification code data source can also be traditional picture verification codes, data tagging tasks or any other types of verification codes and the like.

Step S202: selecting M marked data from a marked data source, selecting N data to be marked from a data set to be marked, randomly mixing the M marked data and the N data to be marked, and generating a verification code:

The number of the marked data in the verification code is larger than that of the data to be marked, and the marked data and the data to be marked can be the same type of data or different types of data.

Step S203: displaying the verification code to a user, and respectively acquiring verification marks added by the user to the marked data and the data to be marked in the verification code;

Step S204: and comparing whether the verification mark added by the user to the marked data is consistent with the correct mark of the marked data:

If the data to be marked are consistent, the data to be marked are distinguished and verified, and a verification mark added by the user to the data to be marked in the verification code is obtained and used as an effective mark of the data to be marked; and if the two are not consistent, the authentication is failed to be distinguished.

Step S205: and carrying out cross validation on the effective labels of the data to be labeled so as to determine the labels of the data to be labeled.

The implementation process of step S205 is the same as step S103, and is not described herein again.

FIG. 3 is a flowchart illustrating a method for generating a verification code and labeling data according to an embodiment of the present invention.

As shown in fig. 3, in the method for annotating data according to the embodiment of the present invention, a data annotation task in the differential verification process needs to provide a small amount of annotated data and a large amount of data to be annotated. Specifically, the procedure for generating the verification code and the data label is as follows:

Step S301: selecting a marked data source from a plurality of verification code data sources by adopting a weighted random strategy;

Step S302: selecting a preset number of labeled data from labeled data sources, and defining the labeled data as a data set m; wherein, the data set M comprises M marked data;

Step S303: selecting a preset number of data to be marked, and defining the data to be marked as a data set n; the data set N comprises N data to be marked;

Step S304: randomly mixing the set m with the set n to generate a verification code:

The set m is used for distinguishing verification, and the set n is used for obtaining the verification label of the user. M + N is the number of questions answered by the user at each verification, and M and N are constants which can be defined.

Step S305: displaying the verification code to the user for differential verification;

Step S306: and respectively acquiring the verification marks added by the user to the marked data and the data to be marked in the verification code.

Step S307: comparing whether the verification mark added by the user to the marked data is consistent with the correct mark of the marked data, if so, executing the step S308; if not, ending, and distinguishing that the verification fails.

Step S308: and (3) the difference verification passes, the verification mark added by the user to the data to be marked in the verification code is obtained and is used as the effective marking of the data to be marked:

If the verification passes, the answer is regarded as the verification code answer input by human, the answer input by the user corresponding to the set n is an effective mark, namely the verification mark added by the user to the data to be marked in the set n is an effective mark, and the effective mark is recorded.

FIG. 4 is a flow chart illustrating cross-validation of a method of annotating data according to an embodiment of the present invention.

In the first difference verification, if the verification mark added to the marked data by a certain user is correct, the verification mark added to the data to be marked by the user is used as an effective mark of the data to be marked, but the verification mark added to the data to be marked by the user is not necessarily an accurate mark, and considering that the mode of identifying code marking is almost zero cost, the effective mark of the data to be marked can be verified in a cross verification mode, so that the marking accuracy is improved to a satisfactory level.

And when the number x of the effective labels of the data to be marked is between k and j (k is less than or equal to x < j), and the consistency rate of the x effective labels is not lower than the standard consistency rate, the label with the same label content and the maximum times is considered to be credible.

As shown in fig. 4, in the method for labeling data according to the embodiment of the present invention, when an effective label is added to data to be labeled, the following cross validation process is triggered:

Step S401: judging whether the number of the effective labels reaches the minimum label number or not; if yes, go to step S402; if not, ending the cross validation, and continuing to randomly mix the data to be labeled and the labeled data to generate a validation code;

Step S402: judging whether the consistency rate of the effective labels is greater than or equal to the standard consistency rate or not; if yes, go to step S403; if not, go to step S404;

Step S403: taking the effective label with the consistency rate larger than or equal to the standard consistency rate as the label of the data to be labeled, and moving the data to be labeled to the labeled data set;

Step S404: judging whether the number of the effective labels reaches the maximum label number or not; if yes, go to step S405; if not, finishing the cross validation, and continuing to randomly mix the data to be labeled and the labeled data to generate a validation code;

Step S405: moving the data to be marked to a difficult data set:

Data in the problematic data set may be manually processed or discarded.

In order to further explain the technical idea of the present invention, the technical solution of the embodiment of the present invention is now described with reference to specific application scenarios.

Taking a data set required by training and identifying a quadrilateral model as an example, and the difference verification of the embodiment of the invention is human verification, the implementation flow of the data labeling method is as follows:

the marked data set is marked whether each graph contains a quadrangle or not, wherein the mark is TRUE (containing quadrangle) or FA L SE (containing no quadrangle).

As shown in FIG. 5, each verification code is defined to contain 8 graphs, 6 graphs from the labeled data set and 2 graphs from the data set to be labeled, as shown in FIG. 6, 8 graphs in the verification code are randomly arranged to generate a verification code problem, wherein the graphs in the set n do not participate in human verification and are marked as NA, and the standard answer of the verification code is [ FA L SE, TRUE, FA L SE, FA L SE, NA, TRUE, NA, FA L SE ].

As shown in FIG. 7, the user selection is represented by a circular dashed line box, and the user-selected graph is TRUE and the unselected graph is FA L SE., the answer is [ FA L SE, TRUE, FA L SE, FA L SE, TRUE, TRUE, TRUE, FA L SE ]

And then comparing the user input with the standard answers, wherein the comparison result of the graphs in the set m is completely consistent, and the answers of the graphs in the set n given by the user are marked as one effective mark through human verification.

And finally, performing cross validation on the effective labels.

Although the user can verify the 8 graphics provided to the user by simply selecting 6 graphics from the marked data set, the user does not know that, even if the user knows, the user cannot specify which 6 graphics are used for verification and which 2 graphics are used for labeling because the given graphics are randomly arranged. Therefore, it can be considered that most users always mark all 8 graphics correctly as much as possible by verification, and the probability that a valid mark is marked correctly by verification is much higher than the probability that the mark is marked incorrectly. With continued reference to FIG. 7, the third image in the second row of the verification code does not contain a quadrilateral but is labeled by the user as a quadrilateral, and it is apparent that the user wrongly marks this image. Since the user correctly marks all the graphics from the set N, the label of the 7 th graphics is recorded as a valid label, and therefore, the valid labels are not always accurate, and therefore, the valid labels need to be checked, i.e., the valid labels are cross-verified.

assuming that the minimum number of marks and the maximum number of marks are k equal to 3 and j equal to 10 respectively, and the standard consistency rate is 75%, valid marks sequentially received by a certain graph to be marked are [ TRUE, FA L SE, TRUE ], the consistency rate is 75% or more and the standard consistency rate, and the marking times are 4 times between the minimum and maximum marking times [3,10], and the condition is satisfied, then the user is considered to be credible for marking TRUE of the graph.

FIG. 8 is a schematic diagram of main blocks of an apparatus for labeling data according to an embodiment of the present invention.

As shown in fig. 8, an apparatus 800 for labeling data according to an embodiment of the present invention includes: a generation module 801, an acquisition module 802 and a verification module 803.

Wherein,

A generating module 801, configured to generate a verification code based on the tagged data and the data to be tagged, and display the verification code to a user;

An obtaining module 802, configured to obtain a verification mark added to the verification code by a user, and obtain an effective label of the data to be labeled according to the verification mark added to the verification code by the user;

The verification module 803 is configured to perform cross verification on the valid label of the data to be labeled, so as to determine the label of the data to be labeled.

In this embodiment of the present invention, the generating module 801 is further configured to: selecting a marked data source from a plurality of verification code data sources by adopting a weighted random strategy; selecting M marked data from the marked data source and N data to be marked from the data set to be marked according to a preset proportion; wherein M is greater than N; randomly mixing the M marked data and the N data to be marked to generate a verification code; and when a user distinguishing verification request is received, the verification code is displayed to the user.

In this embodiment of the present invention, the obtaining module 802 is further configured to: respectively acquiring verification marks added by a user to the marked data and the data to be marked; comparing whether the verification mark added by the user to the marked data is consistent with the correct mark of the marked data or not; if the data to be marked are consistent with the data to be marked, the differential verification is passed, and a verification mark added by the user to the data to be marked in the verification code is obtained and used as an effective mark of the data to be marked; and if the two are not consistent, the authentication is failed to be distinguished.

In this embodiment of the present invention, the verification module 803 is further configured to: judging whether the number of effective labels of the data to be labeled is greater than or equal to the minimum label number or not; if so, calculating the consistency rate of the effective labels, and judging whether the consistency rate of the effective labels is greater than or equal to a standard consistency rate; if so, taking the effective label with the consistency rate larger than or equal to the standard consistency rate as the label of the data to be labeled; if not, judging whether the number of the effective labels is greater than or equal to the maximum label number; if so, taking the data to be marked as data to be processed; if not, the data to be marked and the marked data are mixed randomly to generate a verification code; if not, the data to be marked and the marked data are mixed randomly to generate a verification code.

According to the device for marking data, the verification code is generated and displayed to the user based on the marked data and the data to be marked; acquiring a verification mark added to the verification code by a user, and acquiring effective marking of data to be marked according to the verification mark added to the verification code by the user; the effective marking of the data to be marked is cross-verified to determine the technical means of marking of the data to be marked, and marking of the data to be marked is realized through user differential verification, so that the technical problems that a large amount of manpower is consumed and high labor cost is paid in the marking process of training data are solved, and the technical effect of obtaining a large amount of training data with abundant types at extremely low cost is achieved.

FIG. 9 illustrates an exemplary system architecture 900 in which the method of annotating data or the apparatus for annotating data of an embodiment of the present invention can be applied.

As shown in fig. 9, the system architecture 900 may include

end devices

901, 902, 903, a network 904, and a server 905. Network 904 is the medium used to provide communication links between

terminal devices

901, 902, 903 and server 905. Network 904 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A user may use the

terminal devices

901, 902, 903 to interact with a server 905 over a network 904 to receive or send messages and the like. The

terminal devices

901, 902, 903 may have various communication client applications installed thereon, such as a shopping application, a web browser application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

The

terminal devices

901, 902, 903 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 905 may be a server that provides various services, such as a background management server that supports shopping websites browsed by users using the

terminal devices

901, 902, and 903. The background management server may analyze and perform other processing on the received data such as the product information query request, and feed back a processing result (e.g., target push information and product information) to the terminal device.

It should be noted that the method for annotating data provided by the embodiment of the present invention is generally executed by the server 905, and accordingly, the apparatus for annotating data is generally disposed in the server 905.

It should be understood that the number of terminal devices, networks, and servers in fig. 9 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 10, a block diagram of a computer system 1000 suitable for use with a terminal device implementing an embodiment of the invention is shown. The terminal device shown in fig. 10 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 10, the computer system 1000 includes a Central Processing Unit (CPU)1001 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)1002 or a program loaded from a storage section 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for the operation of the system 1000 are also stored. The CPU 1001, ROM 1002, and RAM 1003 are connected to each other via a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

to the I/O interface 1005, AN input section 1006 including a keyboard, a mouse, and the like, AN output section 1007 including a terminal such as a Cathode Ray Tube (CRT), a liquid crystal display (L CD), and the like, a speaker, and the like, a storage section 1008 including a hard disk, and the like, and a communication section 1009 including a network interface card such as AN L AN card, a modem, and the like, the communication section 1009 performs communication processing via a network such as the internet, a drive 1010 is also connected to the I/O interface 1005 as necessary, a removable medium 1011 such as a magnetic disk, AN optical disk, a magneto-optical disk, a semiconductor memory, and the like is mounted on the drive 1010 as necessary, so that a computer program read out therefrom is mounted into the storage section 1008 as necessary.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication part 1009 and/or installed from the removable medium 1011. The computer program executes the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 1001.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a generation module, an acquisition module, and a verification module. The names of these modules do not in some cases form a limitation on the modules themselves, for example, a verification module may also be described as a module for performing cross-verification on valid labels of the data to be labeled, so as to determine labels of the data to be labeled.

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: step S101: generating a verification code based on the marked data and the data to be marked, and displaying the verification code to a user; step S102: acquiring a verification mark added to the verification code by a user, and acquiring effective marking of data to be marked according to the verification mark added to the verification code by the user; step S103: and carrying out cross validation on the effective labels of the data to be labeled so as to determine the labels of the data to be labeled.

According to the technical scheme of the embodiment of the invention, the verification code is generated and displayed to the user based on the marked data and the data to be marked; acquiring a verification mark added to the verification code by a user, and acquiring effective marking of data to be marked according to the verification mark added to the verification code by the user; the effective marking of the data to be marked is cross-verified to determine the technical means of marking of the data to be marked, and marking of the data to be marked is realized through user differential verification, so that the technical problems that a large amount of manpower is consumed and high labor cost is paid in the marking process of training data are solved, and the technical effect of obtaining a large amount of training data with abundant types at extremely low cost is achieved.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. a method of labeling data, is characterized in that, comprises:

Generate a verification code based on the marked data and the data to be marked, and display it to the user;

Obtain the verification mark added by the user to the verification code, and obtain the valid mark of the data to be marked according to the verification mark added by the user to the verification code;

Cross-validation is performed on the valid labels of the data to be labelled to determine the label of the data to be labelled.

2. The method according to claim 1, wherein generating a verification code based on the marked data and the data to be marked, and showing to the user comprises:

Use a weighted random strategy to select annotated data sources from multiple CAPTCHA data sources;

Select M pieces of the labeled data from the labeled data source according to a preset ratio, and select N pieces of the to-be-labeled data from the to-be-labeled data set; wherein, M is greater than N;

Randomly mix the M pieces of the marked data and the N pieces of the data to be marked to generate a verification code;

When the user's differential verification request is received, the verification code is displayed to the user.

3. The method according to claim 1, wherein acquiring the verification mark added by the user to the verification code, and according to the verification mark added by the user to the verification code, obtaining the valid mark of the data to be marked comprises: :

respectively acquiring the verification marks added by the user to the marked data and the to-be-marked data;

Compare whether the verification mark added by the user to the marked data is consistent with the correct marking of the marked data;

If they are consistent, the differential verification is passed, and the verification mark added by the user to the data to be marked in the verification code is obtained, and used as a valid mark of the data to be marked;

If not, the differential verification fails.

4. The method according to claim 1, characterized in that, performing cross-validation on the valid labels of the data to be labeled to determine the labels of the data to be labeled comprises:

Determine whether the number of valid labels of the data to be labelled is greater than or equal to the minimum number of labels;

If so, calculate the consistency rate of the valid annotations, and determine whether the consistency rate of the valid annotations is greater than or equal to the standard consistency rate;

If so, take the valid label with the consistency rate greater than or equal to the standard consistency rate as the label of the data to be labelled;

If not, determine whether the number of valid labels is greater than or equal to the maximum number of labels;

If so, take the to-be-labeled data as the to-be-processed data;

If not, continue to randomly mix the to-be-labeled data and the labeled data to generate a verification code;

If not, continue to randomly mix the data to be marked and the marked data to generate a verification code.

5. A device for marking data, comprising:

The generation module is used to generate a verification code based on the marked data and the data to be marked, and display it to the user;

an obtaining module, configured to obtain the verification mark added by the user to the verification code, and obtain the valid mark of the data to be marked according to the verification mark added by the user to the verification code;

A verification module, configured to perform cross-validation on the valid labels of the data to be labelled to determine the label of the data to be labelled.

6. The apparatus according to claim 5, wherein the generating module is further configured to:

Select M marked data from the marked data source according to a preset ratio, and select N to-be-marked data from the to-be-marked data set; wherein, M is greater than N;

Randomly mix the M marked data and the N to-be-marked data to generate a verification code;

7. The device according to claim 5, wherein the acquisition module is further configured to:

If not, the differential verification fails.

8. The device according to claim 5, wherein the verification module is further used for:

If so, take the to-be-labeled data as the to-be-processed data;

9. An electronic device for marking data, comprising:

one or more processors;

storage means for storing one or more programs,

The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-4.

10. A computer-readable medium on which a computer program is stored, characterized in that, when the program is executed by a processor, the method according to any one of claims 1-4 is implemented.