[go: up one dir, main page]

CN111859862B - Text data labeling method and device, storage medium and electronic device - Google Patents

Text data labeling method and device, storage medium and electronic device Download PDF

Info

Publication number
CN111859862B
CN111859862B CN202010712345.5A CN202010712345A CN111859862B CN 111859862 B CN111859862 B CN 111859862B CN 202010712345 A CN202010712345 A CN 202010712345A CN 111859862 B CN111859862 B CN 111859862B
Authority
CN
China
Prior art keywords
data
labeling
text
sub
annotation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010712345.5A
Other languages
Chinese (zh)
Other versions
CN111859862A (en
Inventor
韩俊明
赵培
马志芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Haier Uplus Intelligent Technology Beijing Co Ltd
Original Assignee
Haier Uplus Intelligent Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Haier Uplus Intelligent Technology Beijing Co Ltd filed Critical Haier Uplus Intelligent Technology Beijing Co Ltd
Priority to CN202010712345.5A priority Critical patent/CN111859862B/en
Publication of CN111859862A publication Critical patent/CN111859862A/en
Application granted granted Critical
Publication of CN111859862B publication Critical patent/CN111859862B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/137Hierarchical processing, e.g. outlines
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a text data labeling method and device, a storage medium and an electronic device. Wherein the method comprises the following steps: acquiring a text to be marked; labeling the text by a first processing mode of layering and serial by layers to obtain first labeling data, and labeling the text by a second processing mode of parallel processing without layering to obtain second labeling data; labeling the part with the difference between the first labeling data and the second labeling data according to a preset rule to obtain third labeling data, and labeling the part with the same first labeling data and the same second labeling data to obtain fourth labeling data; the third labeling data and the fourth labeling data are determined to be labeling data of the text, the two labeling data modes are combined, and the purpose of performing secondary processing on the data which are different in the two labeling data is achieved, so that the technical problem that in the prior art, the accuracy of labeling the data of the text is low is solved.

Description

Text data labeling method and device, storage medium and electronic device
Technical Field
The present invention relates to the field of data processing, and in particular, to a method and apparatus for labeling text data, a storage medium, and an electronic apparatus.
Background
In natural language processing, a large amount of labeling data is required, and in general, the accuracy of data labeling can be used by a model with more than 90%, but for some problems, such as home appliance industry, the stability of the model needs to be ensured, and for the existing data, the accuracy of 100% needs to be ensured. However, the artificially marked data still has an error rate of about 10%, and for the false marks, the material correction and correction work is performed by later investment of manpower and material resources, and the verification marks are performed again.
In the prior art, a traditional language processing algorithm is used for carrying out natural language marking verification analysis processing.
In the serial processing of layer-by-layer processing, the complete natural language is parsed in a logical order ranging from broad to precise. One obvious drawback of this type of scheme is the accumulation of errors: the errors generated by the upper layer are not timely extracted, but enter the next layer as input to continue the identification processing, and the identification result is inherited from the upper layer to the next layer, so that a large amount of unnecessary detection and identification work is caused, and a certain amount of resource waste is caused.
In the parallel processing process of layering and separate processing, each layer has respective identification units and standards, and the identification among the layers is not affected, so that the problem of error propagation is effectively solved. However, the recognition method which breaks away from the inter-level association breaks down the strong logic of natural language, the situation that the analysis methods in different fields disassemble the same sentence may occur, and the analysis result may be unsatisfactory.
In view of the above problems, no effective solution has been proposed at present.
Disclosure of Invention
The embodiment of the invention provides a method and a device for marking data of a text, a storage medium and an electronic device, which at least solve the technical problem of lower accuracy of marking the data of the text in the prior art.
According to an aspect of the embodiment of the invention, there is provided a text data labeling method, including: obtaining a text to be marked, wherein the text at least comprises one target object to be marked; labeling the text by a first processing mode of layering and serial by layers to obtain first labeling data, and labeling the text by a second processing mode of parallel processing without layering to obtain second labeling data; labeling the part with the difference between the first labeling data and the second labeling data according to a preset rule to obtain third labeling data, and labeling the part with the same first labeling data and the second labeling data to obtain fourth labeling data; and determining the third annotation data and the fourth annotation data as the annotation data of the text.
According to another aspect of the embodiment of the present invention, there is also provided a text data labeling apparatus, including: the device comprises an acquisition unit, a marking unit and a marking unit, wherein the acquisition unit is used for acquiring a text to be marked, and the text at least comprises one target object to be marked; the first labeling unit is used for labeling the text by a layered serial first processing mode to obtain first labeling data, and labeling the text by a parallel processing mode without distinguishing layers to obtain second labeling data; the second labeling unit is used for labeling the part with the difference between the first labeling data and the second labeling data according to a preset rule to obtain third labeling data, and labeling the part with the same first labeling data and the second labeling data to obtain fourth labeling data; and the determining unit is used for determining the third annotation data and the fourth annotation data as the annotation data of the text, wherein the fourth annotation data is the annotation data of the same part of the first annotation data and the second annotation data.
According to yet another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium having a computer program stored therein, wherein the computer program is configured to perform the data labeling method of text as described above when run.
According to still another aspect of the embodiments of the present invention, there is further provided an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the method for labeling text data according to the computer program.
In the embodiment of the invention, a text to be marked is obtained, wherein the text at least comprises one target object to be marked; labeling the text by a first processing mode of layering and serial by layers to obtain first labeling data, and labeling the text by a second processing mode of parallel processing without layering to obtain second labeling data; labeling the part with the difference between the first labeling data and the second labeling data according to a preset rule to obtain third labeling data, and labeling the part with the same first labeling data and the same second labeling data to obtain fourth labeling data; the third labeling data and the fourth labeling data are determined to be the labeling data of the text, the purposes of combining two labeling data modes, comparing the data which are different in the two modes and then performing secondary processing are achieved, the technical effect of improving the accuracy of the text labeling data is achieved, and the technical problem that in the prior art, the accuracy of the text labeling data is low is solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:
FIG. 1 is a schematic illustration of an application environment of an alternative text data annotation method according to an embodiment of the invention;
FIG. 2 is a flow chart of an alternative text data annotation method according to an embodiment of the invention;
FIG. 3 is a flow chart of an alternative first method of text processing according to an embodiment of the invention;
FIG. 4 is a flow chart of an alternative text second processing mode according to an embodiment of the invention
FIG. 5 is an alternative text semantic hierarchy diagram according to an embodiment of the present invention;
FIG. 6 is a flow chart of an alternative method of verifying annotated data based on multiple layers and multiple models in accordance with an embodiment of the present invention;
FIG. 7 is a schematic diagram of an alternative text data annotation device according to an embodiment of the invention;
fig. 8 is a schematic structural diagram of an electronic device according to an alternative text data labeling method according to an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
According to an aspect of the embodiment of the present invention, a method for labeling text data is provided, optionally, as an optional implementation manner, the method for labeling text data may be applied, but is not limited to, to a system for labeling text data in a hardware environment as shown in fig. 1, where the system for labeling text data may include, but is not limited to, a terminal device 102, a network 110, and a server 112.
The terminal device 102 may include, but is not limited to: a human-machine interaction screen 104, a processor 106 and a memory 108. The man-machine interaction screen 104 is used for acquiring man-machine interaction instructions through a man-machine interaction interface and presenting the text to be annotated; the processor 106 is configured to annotate the text with data in response to the man-machine interaction instruction. The memory 108 is used for storing the text to be annotated and the annotation data and other information of the completion of the text annotation. The server here may include, but is not limited to: the processing engine 116 is used for calling the text to be marked stored in the database 114, and obtaining the text to be marked, wherein the text at least comprises one target object to be marked; labeling the text by a first processing mode of layering and serial by layers to obtain first labeling data, and labeling the text by a second processing mode of parallel processing without layering to obtain second labeling data; labeling the part with the difference between the first labeling data and the second labeling data according to a preset rule to obtain third labeling data, and labeling the part with the same first labeling data and the same second labeling data to obtain fourth labeling data; the third labeling data and the fourth labeling data are determined to be the labeling data of the text, the purposes of combining two labeling data modes, comparing the data which are different in the two modes and then performing secondary processing are achieved, the technical effect of improving the accuracy of the text labeling data is achieved, and the technical problem that in the prior art, the accuracy of the text labeling data is low is solved.
The specific process comprises the following steps: the human-computer interaction screen 104 in the terminal device 102 displays a text to be annotated (as shown in fig. 1, a target object (person a) included in the text). The text to be annotated is obtained and sent to the server 112 via the network 110 as in steps S102-S110. Labeling the text by a first processing mode of layering and serial by layering at the server 112 to obtain first labeling data, and labeling the text by a second processing mode of parallel processing without layering to obtain second labeling data; labeling the part with the difference between the first labeling data and the second labeling data according to a preset rule to obtain third labeling data, and labeling the part with the same first labeling data and the same second labeling data to obtain fourth labeling data; and determining the third annotation data and the fourth annotation data as the annotation data of the text. And then returns the result of the above determination to the terminal device 102.
Then, as shown in steps S102-S110, the terminal device 102 marks the text by a first processing mode of layering and serial layer by layer to obtain first marked data, marks the text by a second processing mode of parallel processing without distinguishing layering to obtain second marked data; labeling the part with the difference between the first labeling data and the second labeling data according to a preset rule to obtain third labeling data, and labeling the part with the same first labeling data and the same second labeling data to obtain fourth labeling data; and determining the third labeling data and the fourth labeling data as the labeling data of the text.
Alternatively, in this embodiment, the above-mentioned text data labeling method may be, but is not limited to, applied to the server 112, and is used to assist the application client in labeling the text to be labeled. The application client may be, but not limited to, a terminal device 102, where the terminal device 102 may be, but not limited to, a terminal device supporting running of the application client, such as a mobile phone, a tablet computer, a notebook computer, a PC, etc. The server 112 and the terminal device 102 may implement data interaction through, but are not limited to, a network, which may include, but is not limited to, a wireless network or a wired network. Wherein the wireless network comprises: bluetooth, WIFI, and other networks that enable wireless communications. The wired network may include, but is not limited to: wide area network, metropolitan area network, local area network. The above is merely an example, and is not limited in any way in the present embodiment.
Optionally, as an optional implementation manner, as shown in fig. 2, the method for labeling data of the text includes:
step S202, a text to be annotated is obtained, wherein the text at least comprises one target object to be annotated.
Step S204, labeling the text by a first processing mode of layering and serial by layering to obtain first labeling data, and labeling the text by a second processing mode of parallel processing without layering to obtain second labeling data.
And S206, marking the part with the difference between the first marking data and the second marking data according to a preset rule to obtain third marking data, and marking the part with the same first marking data and the same second marking data to obtain fourth marking data.
In step S208, the third labeling data and the fourth labeling data are determined as the labeling data of the text.
Alternatively, in the present embodiment, the text may include, but is not limited to, document text, picture text, and the like. Where the text is document text, annotating the document text with data may include, but is not limited to, annotating images in the document, and annotating words, and the format of the document may include, but is not limited to, word format, pdf format, and the like. In the case that the text is a picture text, the labeling of the picture text may include, but is not limited to, data labeling of objects in the picture, for example, data labeling of characters, animals, etc. in the picture, where the format of the picture text is not specifically limited.
Note that, in the case where the text is document text, the target object in the document text may include, but is not limited to: words, phrases, long sentences, etc. In the case where the text is pictorial text, the target objects may include, but are not limited to: humans, animals, etc.
It should also be noted that, between acquiring the text to be annotated, the text may be annotated with partial data, that is, the text to be annotated may include, but is not limited to, annotation data that is not annotated at all, and text in which partial annotation data exists.
It can be seen that, in this embodiment, the data annotation of the text may be applied to the data annotation of the document text and/or the image text, and the annotation data is input into the neural network, so as to identify the document text or identify the image text.
Optionally, in an embodiment, labeling the text by the first processing manner to obtain first labeling data includes:
s1, determining a first category corresponding to the text, and inputting the text to a first layer of a first neural network according to the first category to obtain annotation data corresponding to the first category;
s2, inputting the labeling data corresponding to the first category into a second layer of the first neural network to obtain the first labeling data.
In practical application, the first processing mode includes, but is not limited to, checking and identifying by using a TextCNN algorithm model, and the algorithm is high in efficiency and suitable for analysis processing work of a large amount of data. As shown in fig. 3, a flow chart of a first processing mode of the text.
As shown in fig. 3, the process classifies layer by layer. From the start of execution, the entered text is divided from a class-one layer into different classes (corresponding to the first class of text), each with a respective set of fields. The text after category processing enters the division of the domain level under the category with the category label after processing. In the first layer of the domain, the same processing mode as the previous processing method is adopted, each domain corresponds to a group of intention sets, and after the domain is divided, the corresponding intention set layer is entered for continuous division verification. And after all the level verification labels are finished, finally forming a processed text marked with different level labels. The processed text results have certain logicality among the labeling labels among different layers, and have obvious restriction relation among upper layers and lower layers.
The first processing mode is suitable for processing a plurality of data labels, and the labels are split by layer processing, so that the calculation speed can be effectively improved.
Optionally, in this embodiment, labeling the text with the second processing manner to obtain the second labeling data may include:
s1, determining a second category and a third category corresponding to the text according to different classification modes;
S2, inputting the second type of annotation data into a second neural network according to the second type of annotation data, and inputting the third type of annotation data into a third neural network according to the third type of annotation data, so as to obtain the third type of annotation data;
and S3, processing the annotation data corresponding to the second category and the annotation data corresponding to the third category according to preset conditions to obtain second annotation data.
In practical application, the second processing mode is a text verification labeling method of independent parallel processing of each layer, and the text verification labeling method can be realized by a robert algorithm. As shown in fig. 4, a flow chart of a second processing mode of the text.
As shown in fig. 4, unlike the first processing method mentioned earlier, the second processing method splits the different language hierarchies into independent tag sets, into category sets, domain sets, intention sets, and the like. Each set comprises all labels of the hierarchy, the scale of the set is enlarged compared with the previous method, and the number of the labels is increased.
The second processing method inputs the input text material to each layer to perform parallel analysis processing, and a plurality of layers can be executed simultaneously to obtain analysis results. Such a processing manner can improve the processing efficiency of the system. After layering processing, processing results of integer multiples of the original input material can be obtained, and each result is provided with a layering marking result. After all the output after one-time processing is obtained, the processing results of different layers of the same language text material are combined and processed, and a complete processing result is obtained after integration. After all language materials are combined, the processing process forms a processing result similar to the first processing mode, and the processing result comprises analysis information of various aspects such as category, intention, field and the like.
The second processing mode weakens the constraint logic relation between the layers: each label is selected from the whole large set for marking and checking, the constraint relation between the upper layer and the lower layer is not avoided, the error of the upper layer is not transmitted to the lower layer, and marking boundaries among different categories, fields and intentions are broken.
Optionally, in this embodiment, after two different processing procedures of layering processing and layer-by-layer parallel processing, two sets of different processing results for the same set of input data are finally obtained. Both sets of results are obtained by machine verification, so that certain verification errors exist. In order to reject and correct these errors in a targeted manner, a result comparison is required. The result alignment has two criteria:
1) And comparing the verification results of the same language material, and judging that the verification result of the data is correct if and only if the verification judging positions of the class bit, the field bit, the intention bit and the like of the result are completely the same. Otherwise, the data is referred to as "bad data". And inputting all bad data generated in the comparison process into a bad data storage database.
2) For the result of the identical comparison result, in order to ensure the accuracy of the data, we need to perform a discrimination: and extracting and calculating the discrimination probability of each label, carrying out weighted average on probability values obtained by the two methods, and adding data with the weighted value less than 0.9 into a database of bad data.
Two batches of data can be generated by the two comparison methods, one batch can be defaulted to be marked and checked to be qualified, and secondary processing is not needed. And the data in the bad data database needs to be input into a manual verification system to carry out manual secondary verification and discrimination (equivalent to marking the data according to a preset rule).
According to the embodiment provided by the application, the text to be marked is obtained, wherein the text at least comprises one target object to be marked; labeling the text by a first processing mode of layering and serial by layers to obtain first labeling data, and labeling the text by a second processing mode of parallel processing without layering to obtain second labeling data; labeling the part with the difference between the first labeling data and the second labeling data according to a preset rule to obtain third labeling data, and labeling the part with the same first labeling data and the same second labeling data to obtain fourth labeling data; the third labeling data and the fourth labeling data are determined to be the labeling data of the text, the purposes of combining two labeling data modes, comparing the data which are different in the two modes and then performing secondary processing are achieved, the technical effect of improving the accuracy of the text labeling data is achieved, and the technical problem that in the prior art, the accuracy of the text labeling data is low is solved.
As an alternative embodiment, after determining the third annotation data and the fourth annotation data as the textual annotation data, the method may further include:
inputting the labeling data of the text into a target neural network model, and outputting the probability of executing target operation on a target object;
in the event that the probability is greater than a predetermined threshold, the target operation is performed in response to an instruction to the target object.
Wherein executing the target operation in response to the instruction of the target object includes: executing the annotation data deleting operation of the target object in response to the instruction of the target object; or alternatively; executing the annotation data adding operation of the target object in response to the instruction of the target object; or executing the annotation data operation of updating the target object in response to the instruction of the target object.
As an optional embodiment, the application also provides a marked data verification method based on the multi-level multi-model.
In the analysis processing process of the natural language, the semantic analysis of the natural language is roughly divided into a plurality of layers according to the rules and the inherent logic association of the natural language, and the plurality of layers are widely and finely divided, so that a sentence can be split into a form which is convenient for machine understanding and representation. As shown in fig. 5, the text semantic hierarchy is schematically shown in fig. 5, and the classes, the fields, the intentions and the like are shown, the fields are included in the classes, the same field is divided into different classes, and the text analysis is finally dissected into the keyword form after a plurality of dividing processes. This hierarchical division is the basis for the generation of two different text processing approaches.
As shown in fig. 6, a flowchart of a method for verifying labeled data based on a multi-level multi-model is provided. The method comprises the following specific modes:
step 1, respectively inputting texts to be processed into two text processing systems as input data, wherein before inputting, the text samples are subjected to basic processing such as manual labeling and the like.
And 2, respectively processing the text data to be processed of the same batch of input by the two text processing systems, thereby generating two input text processing processes and finally obtaining two processing results. The two processing procedures are respectively hierarchical layer-by-layer serial processing and a parallel processing method without distinguishing layers, and the two processing procedures are similar to the first processing mode and the second processing mode.
And 3, comparing processing results generated in the two text processing processes, wherein the comparison standard comprises information such as analysis results and positions of each layer, and dividing the compared data into two parts, namely completely consistent comparison results and different comparison results.
Step 4, the data generated after comparison are respectively processed:
1) For the data with the same comparison result, the text analysis processing can be completely correct by default and can be directly used as an output result of the text processing.
2) And extracting data with different comparison results, entering a manual processing and checking link, and analyzing and labeling the text data by special practitioners. The manually marked data is used as the output result of the other part.
And 5, integrating the two generated text processing results, and outputting the integrated text processing results as a final result of the scheme. The result is subjected to twice machine analysis processing and once comparison and verification, and part of the result is also subjected to manual verification, so that the accuracy is high.
By way of the embodiments provided herein, the following benefits may be realized:
1. combining the advantages of manual processing mode and machine processing mode. The manual processing mode has higher accuracy, and the result obtained by analyzing the text from the human viewpoint accords with the common cognition of human on natural language, but the scheme consumes considerable time and human resources. The machine processing adopts the traditional text processing algorithm to carry out verification analysis of the labels, and the analysis efficiency is high. The two modes are combined, human resources are directionally input into the parts of the machine which are difficult to analyze correctly for processing, so that the accuracy of analysis results can be ensured, and the analysis processing efficiency can be improved.
2. The machine process employs two different processes. The layering processing ensures the internal logic association of text analysis, and the layering processing synchronously marks semantic results obtained by each layer. The outputs of the two analysis processes are compared and identical parts can be output as correct results. The comparison scheme ensures that the machine processing result is more reliable, and simultaneously, the part which is difficult to identify and process by the machine can be screened out to enter the manual processing part, thereby improving the efficiency.
3. And a more reliable comparison and discrimination mechanism. When comparing the analysis results generated by two different machine methods, we have two criteria to determine the quality of the data. One is the traditional bit-by-bit comparison, each judged element is compared one by one, and the data is determined to be bad data when different judging results exist. In addition, the same data are judged for each judging bit, the judgment is also carried out according to the probability result of each identifying position, and the probability results of the two processes are weighted and averaged to obtain the judging result. This is because the machine algorithm recognizes that there are some unavoidable errors, and even if the double algorithm comprehensive recognition is adopted, there are some data that are recognized as erroneous. Therefore, we compare the weighted average result with the sustainable discrimination probability of 0.9, and add the data lower than 0.9 to the bad data set as well. The two discrimination mechanisms can make the machine discrimination result more reliable.
It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present invention is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present invention. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present invention.
According to another aspect of the embodiment of the invention, a text data labeling device for implementing the text data labeling method is also provided. As shown in fig. 7, the apparatus includes: an acquisition unit 71, a first labeling unit 73, a second labeling unit 75, and a determination unit 77.
The obtaining unit 71 is configured to obtain a text to be annotated, where the text includes at least one target object to be annotated.
The first labeling unit 73 is configured to label the text by a first processing manner of layering and serial layer by layer to obtain first labeling data, and label the text by a second processing manner of parallel processing without distinguishing layering to obtain second labeling data.
The second labeling unit 75 is configured to label a portion where the first labeling data and the second labeling data have differences according to a preset rule, obtain third labeling data, and label a portion where the first labeling data and the second labeling data are the same, so as to obtain fourth labeling data.
A determining unit 77 for determining the third annotation data and the fourth annotation data as the annotation data of the text.
Alternatively, in this embodiment, the first labeling unit 73 may include:
The first obtaining module is used for determining a first category corresponding to the text, inputting the text into a first layer of the first neural network according to the first category, and obtaining annotation data corresponding to the first category;
the second obtaining module is used for inputting the labeling data corresponding to the first category into a second layer of the first neural network to obtain the first labeling data.
Alternatively, in this embodiment, the first labeling unit 73 may include:
the determining module is used for determining that the text corresponds to the second category and the third category according to different classification modes;
the third obtaining module is used for inputting the second class into a second neural network to obtain the annotation data corresponding to the second class, and inputting the third class into a third neural network to obtain the annotation data corresponding to the third class;
and the fourth obtaining module is used for processing the marking data corresponding to the second category and the marking data corresponding to the third category according to preset conditions to obtain the second marking data.
Through the embodiment provided by the application, the obtaining unit 71 obtains a text to be marked, wherein the text at least comprises one target object to be marked; the first labeling unit 73 labels the text by a first processing mode of layering and serial by layer to obtain first labeling data, and labels the text by a second processing mode of parallel processing without layering to obtain second labeling data; the second labeling unit 75 labels the part where the first labeling data and the second labeling data are different according to a preset rule to obtain third labeling data, and labels the part where the first labeling data and the second labeling data are the same to obtain fourth labeling data; the determination unit 77 determines the third annotation data and the fourth annotation data as the annotation data of the text. The method and the device achieve the aim of combining two labeling data modes, comparing the two modes to generate different data and then carrying out secondary processing, thereby realizing the technical effect of improving the accuracy of text labeling data and further solving the technical problem of lower accuracy of data labeling on texts in the prior art.
As an alternative embodiment, the apparatus may further include:
the obtaining unit is used for determining the third labeling data and the fourth labeling data as the labeling data of the text, inputting the labeling data of the text into the target neural network model, and outputting the probability of executing target operation on the target object;
and a response unit for executing the target operation in response to the instruction to the target object in case the probability is greater than the predetermined threshold.
Wherein the response unit includes:
the first response module is used for responding to the instruction of the target object and executing the annotation data deleting operation of the target object; or alternatively;
the second response module is used for responding to the instruction of the target object and executing the annotation data adding operation of the target object; or alternatively
And the third response module is used for responding to the instruction of the target object and executing the annotation data operation of updating the target object.
According to a further aspect of the embodiments of the present invention, there is also provided an electronic device for implementing the data tagging method of text as described above, as shown in fig. 8, the electronic device comprising a memory 802 and a processor 804, the memory 802 storing a computer program, the processor 804 being arranged to perform the steps of any of the method embodiments described above by means of the computer program.
Alternatively, in this embodiment, the electronic apparatus may be located in at least one network device of a plurality of network devices of the computer network.
Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:
s1, acquiring a text to be annotated, wherein the text at least comprises one target object to be annotated;
s2, labeling the text by a first hierarchical processing mode of serial layers to obtain first labeling data, and labeling the text by a second processing mode of parallel processing without distinction of layers to obtain second labeling data;
s3, marking the part with the difference between the first marking data and the second marking data according to a preset rule to obtain third marking data, and marking the part with the same first marking data and the same second marking data to obtain fourth marking data;
s4, determining the third labeling data and the fourth labeling data as the labeling data of the text.
Alternatively, it will be understood by those skilled in the art that the structure shown in fig. 8 is only schematic, and the electronic device may also be a terminal device such as a smart phone (e.g. an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, and a mobile internet device (Mobile Internet Devices, MID), a PAD, etc. Fig. 8 is not limited to the structure of the electronic device. For example, the electronic device may also include more or fewer components (e.g., network interfaces, etc.) than shown in FIG. 8, or have a different configuration than shown in FIG. 8.
The memory 802 may be used to store software programs and modules, such as program instructions/modules corresponding to the text data labeling method and apparatus in the embodiment of the present invention, and the processor 804 executes the software programs and modules stored in the memory 802, thereby executing various functional applications and data processing, that is, implementing the text data labeling method described above. Memory 802 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, memory 802 may further include memory remotely located relative to processor 804, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. The memory 802 may be used for storing information, such as a document to be annotated, annotation data corresponding to the document, and the like. As an example, as shown in fig. 8, the memory 802 may include, but is not limited to, the acquisition unit 71, the first labeling unit 73, the second labeling unit 75, and the determination unit 77 in the data labeling apparatus including the text. In addition, other module units in the data labeling apparatus of the above text may be further included, which is not described in detail in this example.
Optionally, the transmission device 806 is used to receive or transmit data via a network. Specific examples of the network described above may include wired networks and wireless networks. In one example, the transmission means 806 includes a network adapter (Network Interface Controller, NIC) that can connect to other network devices and routers via a network cable to communicate with the internet or a local area network. In one example, the transmission device 806 is a Radio Frequency (RF) module for communicating wirelessly with the internet.
In addition, the electronic device further includes: a display 808 for displaying the document information to be processed; and a connection bus 810 for connecting the respective module parts in the above-described electronic device.
According to a further aspect of embodiments of the present invention, there is also provided a computer readable storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.
Alternatively, in the present embodiment, the above-described computer-readable storage medium may be configured to store a computer program for executing the steps of:
S1, acquiring a text to be annotated, wherein the text at least comprises one target object to be annotated;
s2, labeling the text by a first hierarchical processing mode of serial layers to obtain first labeling data, and labeling the text by a second processing mode of parallel processing without distinction of layers to obtain second labeling data;
s3, marking the part with the difference between the first marking data and the second marking data according to a preset rule to obtain third marking data, and marking the part with the same first marking data and the same second marking data to obtain fourth marking data;
s4, determining the third labeling data and the fourth labeling data as the labeling data of the text.
Alternatively, in this embodiment, it will be understood by those skilled in the art that all or part of the steps in the methods of the above embodiments may be performed by a program for instructing a terminal device to execute the steps, where the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
The integrated units in the above embodiments may be stored in the above-described computer-readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present invention may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing one or more computer devices (which may be personal computers, servers or network devices, etc.) to perform all or part of the steps of the method described in the embodiments of the present invention.
In the foregoing embodiments of the present invention, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.
In several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims (9)

1. A method for labeling text data, comprising:
obtaining a text to be marked, wherein the text at least comprises one target object to be marked;
Labeling the text by a first processing mode of layering and serial by layers to obtain first labeling data, and labeling the text by a second processing mode of parallel processing without layering to obtain second labeling data;
labeling the part with the difference between the first labeling data and the second labeling data according to a preset rule to obtain third labeling data, and labeling the part with the same first labeling data and the second labeling data to obtain fourth labeling data;
determining the third annotation data and the fourth annotation data as the annotation data of the text;
determining that the first sub-label data and the second sub-label data have differences under the condition that the verification and discrimination positions of the first sub-label data and the second sub-label data are identical and the weighted average value of discrimination probabilities corresponding to the first sub-label data and the second sub-label data is smaller than 0.9;
under the condition that the verification judging positions of the first sub-label data and the second sub-label data are not identical, determining that the first sub-label data and the second sub-label data have differences;
The first sub-annotation data are part of the annotation data in the first annotation data, the second sub-annotation data are part of the annotation data in the second annotation data, and the first sub-annotation data and the second sub-annotation data are data obtained by annotating a target sub-text in the text;
wherein after determining the third annotation data and the fourth annotation data as the annotation data of the text, the method further comprises: inputting the labeling data of the text into a target neural network model, and outputting the probability of executing target operation on a target object; and in the case that the probability is greater than a predetermined threshold, performing the target operation in response to an instruction to the target object.
2. The method of claim 1, wherein labeling the text with the first processing means to obtain first labeling data comprises:
determining a first category corresponding to the text, and inputting the text to a first layer of a first neural network according to the first category to obtain annotation data corresponding to the first category;
and inputting the labeling data corresponding to the first category into a second layer of the first neural network to obtain the first labeling data.
3. The method according to claim 1, wherein labeling the text with the second processing means to obtain second labeling data comprises:
determining that the text corresponds to a second category and a third category according to different classification modes;
inputting the second class into a second neural network according to the second class to obtain annotation data corresponding to the second class, and inputting the third class into a third neural network according to the third class to obtain annotation data corresponding to the third class;
and processing the annotation data corresponding to the second category and the annotation data corresponding to the third category according to preset conditions to obtain the second annotation data.
4. The method of claim 1, wherein performing the target operation in response to an instruction to the target object comprises:
executing the annotation data deleting operation of the target object in response to the instruction of the target object; or alternatively;
executing the annotation data adding operation of the target object in response to the instruction of the target object; or alternatively
And executing the annotation data operation for updating the target object in response to the instruction for the target object.
5. A text data labeling apparatus, comprising:
The device comprises an acquisition unit, a marking unit and a marking unit, wherein the acquisition unit is used for acquiring a text to be marked, and the text at least comprises one target object to be marked;
the first labeling unit is used for labeling the text by a layered serial first processing mode to obtain first labeling data, and labeling the text by a parallel processing mode without distinguishing layers to obtain second labeling data;
the second labeling unit is used for labeling the part with the difference between the first labeling data and the second labeling data according to a preset rule to obtain third labeling data, and labeling the part with the same first labeling data and the second labeling data to obtain fourth labeling data;
the determining unit is used for determining the third annotation data and the fourth annotation data as the annotation data of the text;
the device is further used for determining that the first sub-label data and the second sub-label data have differences when the verification and discrimination positions of the first sub-label data and the second sub-label data are identical and the weighted average value of discrimination probabilities corresponding to the first sub-label data and the second sub-label data is smaller than 0.9;
The device is further used for determining that the first sub-annotation data and the second sub-annotation data have differences under the condition that the verification judging positions of the first sub-annotation data and the second sub-annotation data are not identical;
the first sub-annotation data are part of the annotation data in the first annotation data, the second sub-annotation data are part of the annotation data in the second annotation data, and the first sub-annotation data and the second sub-annotation data are data obtained by annotating a target sub-text in the text;
wherein the apparatus further comprises: the obtaining unit is used for determining the third labeling data and the fourth labeling data as the labeling data of the text, inputting the labeling data of the text into a target neural network model, and outputting the probability of executing target operation on a target object; and the response unit is used for responding to the instruction of the target object to execute the target operation under the condition that the probability is larger than a preset threshold value.
6. The apparatus of claim 5, wherein the first labeling unit comprises:
the first obtaining module is used for determining a first category corresponding to the text, inputting the text into a first layer of a first neural network according to the first category, and obtaining annotation data corresponding to the first category;
And the second obtaining module is used for inputting the labeling data corresponding to the first category into a second layer of the first neural network to obtain the first labeling data.
7. The apparatus of claim 5, wherein the first labeling unit further comprises:
the determining module is used for determining the second category and the third category corresponding to the text according to different classification modes;
the third obtaining module is used for inputting the second class into a second neural network to obtain the annotation data corresponding to the second class, and inputting the third class into a third neural network to obtain the annotation data corresponding to the third class;
and a fourth obtaining module, configured to process the annotation data corresponding to the second category and the annotation data corresponding to the third category according to a preset condition, and obtain the second annotation data.
8. A computer readable storage medium, characterized in that the storage medium comprises a stored program, wherein the program when run performs the method of any of the preceding claims 1 to 4.
9. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method according to any of the claims 1-4 by means of the computer program.
CN202010712345.5A 2020-07-22 2020-07-22 Text data labeling method and device, storage medium and electronic device Active CN111859862B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010712345.5A CN111859862B (en) 2020-07-22 2020-07-22 Text data labeling method and device, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010712345.5A CN111859862B (en) 2020-07-22 2020-07-22 Text data labeling method and device, storage medium and electronic device

Publications (2)

Publication Number Publication Date
CN111859862A CN111859862A (en) 2020-10-30
CN111859862B true CN111859862B (en) 2024-03-22

Family

ID=72949280

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010712345.5A Active CN111859862B (en) 2020-07-22 2020-07-22 Text data labeling method and device, storage medium and electronic device

Country Status (1)

Country Link
CN (1) CN111859862B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115600577B (en) * 2022-10-21 2023-05-23 文灵科技(北京)有限公司 Event segmentation method and system for news manuscript labeling
CN115638833B (en) * 2022-12-23 2023-03-31 保定网城软件股份有限公司 Monitoring data processing method and system

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010250814A (en) * 2009-04-14 2010-11-04 Nec (China) Co Ltd Part-of-speech tagging system, training device and method of part-of-speech tagging model
CN103530282A (en) * 2013-10-23 2014-01-22 北京紫冬锐意语音科技有限公司 Corpus tagging method and equipment
CN106707293A (en) * 2016-12-01 2017-05-24 百度在线网络技术(北京)有限公司 Obstacle recognition method and device for vehicles
WO2019095899A1 (en) * 2017-11-17 2019-05-23 中兴通讯股份有限公司 Material annotation method and apparatus, terminal, and computer readable storage medium
CN110427487A (en) * 2019-07-30 2019-11-08 中国工商银行股份有限公司 A kind of data mask method, device and storage medium
CN110598206A (en) * 2019-08-13 2019-12-20 平安国际智慧城市科技股份有限公司 Text semantic recognition method and device, computer equipment and storage medium
CN110750694A (en) * 2019-09-29 2020-02-04 支付宝(杭州)信息技术有限公司 Data annotation implementation method and device, electronic equipment and storage medium
CN110909768A (en) * 2019-11-04 2020-03-24 北京地平线机器人技术研发有限公司 Method and device for acquiring marked data
CN111159404A (en) * 2019-12-27 2020-05-15 海尔优家智能科技(北京)有限公司 Text classification method and device
CN111159494A (en) * 2019-12-30 2020-05-15 北京航天云路有限公司 Multi-user concurrent processing data labeling method
CN111352348A (en) * 2018-12-24 2020-06-30 北京三星通信技术研究有限公司 Device control method, device, electronic device, and computer-readable storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11010768B2 (en) * 2015-04-30 2021-05-18 Oracle International Corporation Character-based attribute value extraction system

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010250814A (en) * 2009-04-14 2010-11-04 Nec (China) Co Ltd Part-of-speech tagging system, training device and method of part-of-speech tagging model
CN103530282A (en) * 2013-10-23 2014-01-22 北京紫冬锐意语音科技有限公司 Corpus tagging method and equipment
CN106707293A (en) * 2016-12-01 2017-05-24 百度在线网络技术(北京)有限公司 Obstacle recognition method and device for vehicles
WO2019095899A1 (en) * 2017-11-17 2019-05-23 中兴通讯股份有限公司 Material annotation method and apparatus, terminal, and computer readable storage medium
CN111352348A (en) * 2018-12-24 2020-06-30 北京三星通信技术研究有限公司 Device control method, device, electronic device, and computer-readable storage medium
CN110427487A (en) * 2019-07-30 2019-11-08 中国工商银行股份有限公司 A kind of data mask method, device and storage medium
CN110598206A (en) * 2019-08-13 2019-12-20 平安国际智慧城市科技股份有限公司 Text semantic recognition method and device, computer equipment and storage medium
CN110750694A (en) * 2019-09-29 2020-02-04 支付宝(杭州)信息技术有限公司 Data annotation implementation method and device, electronic equipment and storage medium
CN110909768A (en) * 2019-11-04 2020-03-24 北京地平线机器人技术研发有限公司 Method and device for acquiring marked data
CN111159404A (en) * 2019-12-27 2020-05-15 海尔优家智能科技(北京)有限公司 Text classification method and device
CN111159494A (en) * 2019-12-30 2020-05-15 北京航天云路有限公司 Multi-user concurrent processing data labeling method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Top-k learning torank Labeling ranking and evalua;Shuzi Niu 等;《research gate》;1-10 *
基于CPN网络的Deep Web数据语义标注;马安香;高克宁;张晓红;张斌;;东北大学学报(自然科学版)(第06期);36-39 *
基于情感词属性和云模型的文本情感分类方法;孙劲光,马志芳;计算机工程;211-215 *

Also Published As

Publication number Publication date
CN111859862A (en) 2020-10-30

Similar Documents

Publication Publication Date Title
CN112070138B (en) Construction method of multi-label mixed classification model, news classification method and system
CN112163424B (en) Data labeling method, device, equipment and medium
CN113051356A (en) Open relationship extraction method and device, electronic equipment and storage medium
CN111444723A (en) Information extraction model training method and device, computer equipment and storage medium
CN112270196A (en) Entity relationship identification method and device and electronic equipment
CN113656805A (en) Event map automatic construction method and system for multi-source vulnerability information
CN114461777B (en) Intelligent question-answering method, device, equipment and storage medium
CN111984792A (en) Website classification method and device, computer equipment and storage medium
CN115640394A (en) Text classification method, text classification device, computer equipment and storage medium
CN114692778A (en) Multi-modal sample set generation method, training method and device for intelligent inspection
CN119538894A (en) A method, device, equipment and storage medium for automatically filling in financial data
CN113591881A (en) Intention recognition method and device based on model fusion, electronic equipment and medium
CN111859862B (en) Text data labeling method and device, storage medium and electronic device
CN114117299B (en) A website intrusion and tampering detection method, device, equipment and storage medium
CN116796758A (en) Dialogue interaction method, dialogue interaction device, equipment and storage medium
CN119311818A (en) Intelligent question-answering method, system, device and storage medium for power safety knowledge
CN112115229A (en) Text intent recognition method, device, system, and text classification system
CN117217200A (en) Text recognition method and related device
CN113505889B (en) Processing method and device of mapping knowledge base, computer equipment and storage medium
CN113255368B (en) Method and device for emotion analysis of text data and related equipment
CN113704405B (en) Quality inspection scoring method, device, equipment and storage medium based on recorded content
CN114943306A (en) Intent classification method, device, equipment and storage medium
CN118764585A (en) Production system inspection method, device, computer equipment and storage medium
CN115859973A (en) Text feature extraction method and device, nonvolatile storage medium and electronic equipment
CN118211570A (en) Report automatic generation method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant