CN110427622A - Appraisal procedure, device and the storage medium of corpus labeling - Google Patents
Appraisal procedure, device and the storage medium of corpus labeling Download PDFInfo
- Publication number
- CN110427622A CN110427622A CN201910668462.3A CN201910668462A CN110427622A CN 110427622 A CN110427622 A CN 110427622A CN 201910668462 A CN201910668462 A CN 201910668462A CN 110427622 A CN110427622 A CN 110427622A
- Authority
- CN
- China
- Prior art keywords
- corpus
- assessed
- mark
- remaining
- initial
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000002372 labelling Methods 0.000 title claims abstract description 73
- 238000000034 method Methods 0.000 title claims abstract description 65
- 239000013598 vector Substances 0.000 claims description 105
- 239000000463 material Substances 0.000 claims description 17
- 235000013399 edible fruits Nutrition 0.000 claims description 3
- 230000006870 function Effects 0.000 description 15
- 238000012545 processing Methods 0.000 description 14
- 238000004364 calculation method Methods 0.000 description 13
- 238000010586 diagram Methods 0.000 description 13
- 238000012549 training Methods 0.000 description 10
- 238000000605 extraction Methods 0.000 description 8
- 238000004891 communication Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 238000010801 machine learning Methods 0.000 description 4
- 230000011218 segmentation Effects 0.000 description 4
- 238000012706 support-vector machine Methods 0.000 description 4
- 238000011156 evaluation Methods 0.000 description 3
- 238000012937 correction Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000012905 input function Methods 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 230000007774 longterm Effects 0.000 description 2
- 238000010295 mobile communication Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000003990 capacitor Substances 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000001404 mediated effect Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000010897 surface acoustic wave method Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
This application involves a kind of appraisal procedure of corpus labeling, device and storage medium, the appraisal procedure of the corpus labeling includes: the first initial mark that at least one corpus to be assessed and every corpus to be assessed are obtained from corpus;The first mark of corpus to be assessed is determined according to corpus remaining in corpus to be assessed and corpus;The second mark of corpus to be assessed is determined using the disaggregated model trained;According to the first mark and the second mark, determine the first assessment result initially marked of corresponding corpus to be assessed, to when carrying out the check of artificial corpus, the corpus that can be selected mark accuracy not high according to assessment result is checked, and then without being checked one by one corpus all in corpus, the workload for reducing corpus labeling personnel improves corpus check efficiency.
Description
Technical field
This application involves field of computer technology more particularly to a kind of appraisal procedure of corpus labeling, device and storage to be situated between
Matter.
Background technique
It, generally can be by artificial in order to be better understood from problem described in user in intelligent customer service system Construction
The mode of a large amount of corpus is marked to reinforce the understandability of machine learning model.But since different mark personnel are to same
The understanding of business, which can have deviation and mark personnel, will usually complete biggish mark amount, lead to the presence of certain ratio in corpus
The marking error corpus of example.
Therefore, it in order to ensure the accuracy of corpus labeling, needs to check the corpus after mark with will be in corpus
The corpus of mistake, which checks, to be come, and existing technical solution is mainly artificial corpus check.
But when the corpus in corpus is increasing, manual review is difficult to traverse existing corpus in corpus and goes
It makes reference, and takes time and effort.
Summary of the invention
The embodiment of the present application provides appraisal procedure, device and the storage medium of a kind of corpus labeling, to reduce artificial corpus
The workload of check, and improve the efficiency of corpus check.
The embodiment of the present application provides a kind of appraisal procedure of corpus labeling, comprising:
The first initial mark of at least one corpus to be assessed and every corpus to be assessed is obtained from corpus
Note;
The first mark of the corpus to be assessed is determined according to corpus remaining in the corpus to be assessed and the corpus;
The second mark of the corpus to be assessed is determined using the disaggregated model trained;
According to first mark and the second mark, the first assessment initially marked of the corresponding corpus to be assessed is determined
As a result.
Wherein, described that the corpus to be assessed is determined according to corpus remaining in the corpus to be assessed and the corpus
First mark, specifically includes:
Determine the similarity in the corpus to be assessed and the corpus between every remaining corpus;
Similar corpus is determined from the remaining corpus according to the similarity;
Obtain the second initial mark of the similar corpus;
The first mark of the corpus to be assessed is determined according to the described second initial mark.
Wherein, the similarity in the determination corpus to be assessed and the corpus between every remaining corpus, tool
Body includes:
It determines corresponding first term vector of every corpus to be assessed, and determines every remaining language in the corpus
Expect corresponding second term vector;
Corresponding first sentence vector is determined according to first term vector, and is determined and corresponded to according to second term vector
The second sentence vector;
The corresponding corpus to be assessed and remaining corpus are calculated according to the first sentence vector sum the second sentence vector
Between similarity.
Wherein, corresponding first term vector of the corpus to be assessed of the determination every, and determine in the corpus
Corresponding second term vector of every residue corpus, specifically includes:
Every corpus to be assessed is split into multiple first character fields, and by every in the corpus remaining language
Material splits into multiple second character fields;
Corresponding first keyword is determined according to first character field, and is determined according to second character field corresponding
Second keyword;
Corresponding first term vector is determined according to first keyword, and is determined according to second keyword corresponding
Second term vector.
Wherein, first mark that the corpus to be assessed is determined according to the described second initial mark, specifically includes:
The identical similar corpus is initially marked by described second and is classified as one group, obtains at least one similar corpus group;
Count the item number of similar corpus in each similar corpus group;
The most similar corpus group corresponding described second of the item number is initially marked, as the language to be assessed
First mark of material.
Wherein, it before the second mark for determining the corpus to be assessed using the disaggregated model trained described, also wraps
It includes:
The third for obtaining each corpus sample in corpus sample set and the corpus sample set initially marks;
It is initially marked using the corpus sample set and third and preset disaggregated model is trained, obtain described instructed
Experienced disaggregated model.
Wherein, described according to first mark and the second mark, determine that the first of the corresponding corpus to be assessed is initial
The assessment result of mark, specifically includes:
Judge that the first initial corpus of the corpus to be assessed is marked with corresponding first mark and described second
It is whether identical;
If the first initial corpus of the corpus to be assessed and corresponding first mark and the second mark are all the same,
It will indicate first assessment result that initially marks of the correct result as the corpus to be assessed;
If the first initial corpus of the corpus to be assessed is identical as corresponding first mark or the second mark, will
Indicate first assessment result that initially marks of the result as the corpus to be assessed of doubtful mistake;
If the first initial corpus of the corpus to be assessed is all different with corresponding first mark and the second mark,
Then using the suspicious result of indicated altitude as the first assessment result initially marked of the corpus to be assessed.
The embodiment of the present application also provides a kind of assessment devices of corpus labeling, comprising:
Module is obtained, for obtaining at least one corpus to be assessed and every corpus to be assessed from corpus
The first initial mark;
First determining module, it is described to be evaluated for being determined according to corpus remaining in the corpus to be assessed and the corpus
Estimate the first mark of corpus;
Second determining module, for determining the second mark of the corpus to be assessed using the disaggregated model trained;
Third determining module, for determining the corresponding corpus to be assessed according to first mark and the second mark
First assessment result initially marked.
Wherein, first determining module specifically includes:
First determination unit, for determining the phase between the corpus to be assessed and every in the corpus remaining corpus
Like degree;
Second determination unit, for determining similar corpus from the remaining corpus according to the similarity;
Acquiring unit, the second initial mark for obtaining the similar corpus;
Third determination unit, for determining the first mark of the corpus to be assessed according to the described second initial mark.
Wherein, first determination unit, specifically includes:
First determines subelement, for determining corresponding first term vector of every corpus to be assessed, and determines institute
State corresponding second term vector of every residue corpus in corpus;
Second determines subelement, for determining corresponding first sentence vector according to first term vector, and according to institute
It states the second term vector and determines corresponding second sentence vector;
Computation subunit, it is corresponding described to be assessed for being calculated according to the first sentence vector sum the second sentence vector
Similarity between corpus and remaining corpus.
Wherein, described first determine that subelement is specifically used for:
Every corpus to be assessed is split into multiple first character fields, and by every in the corpus remaining language
Material splits into multiple second character fields;
Corresponding first keyword is determined according to first character field, and is determined according to second character field corresponding
Second keyword;
Corresponding first term vector is determined according to first keyword, and is determined according to second keyword corresponding
Second term vector.
Wherein, the third determination unit is specifically used for:
The identical similar corpus is initially marked by described second and is classified as one group, obtains at least one similar corpus group;
Count the item number of similar corpus in each similar corpus group;
The most similar corpus group corresponding described second of the item number is initially marked, as the language to be assessed
First mark of material.
Wherein, the assessment device of the corpus labeling further includes the 4th determining module, and the 4th determining module is used for:
The third for obtaining each corpus sample in corpus sample set and the corpus sample set initially marks;
It is initially marked using the corpus sample set and third and preset disaggregated model is trained, obtain described instructed
Experienced disaggregated model.
Wherein, the third determining module is specifically used for:
Judge that the first initial corpus of the corpus to be assessed is marked with corresponding first mark and described second
It is whether identical;
If the first initial corpus of the corpus to be assessed and corresponding first mark and the second mark are all the same,
It will indicate first assessment result that initially marks of the correct result as the corpus to be assessed;
If the first initial corpus of the corpus to be assessed is identical as corresponding first mark or the second mark, will
Indicate first assessment result that initially marks of the result as the corpus to be assessed of doubtful mistake;
If the first initial corpus of the corpus to be assessed is all different with corresponding first mark and the second mark,
Then using the suspicious result of indicated altitude as the first assessment result initially marked of the corpus to be assessed.
The embodiment of the present application also provides a kind of computer readable storage medium, a plurality of finger is stored in the storage medium
It enables, described instruction is suitable for being loaded by processor to execute the appraisal procedure of any of the above-described corpus labeling.
Appraisal procedure, device and the storage medium of corpus labeling provided by the present application, by being obtained at least from corpus
First initial mark of one corpus to be assessed and every corpus to be assessed, later according to the corpus and corpus to be assessed
Middle residue corpus determines the first mark of the corpus to be assessed, and determines the of corpus to be assessed using the disaggregated model trained
Two marks determine the first assessment result initially marked of corresponding corpus to be assessed then according to the first mark and the second mark,
To which when carrying out the check of artificial corpus, the corpus that can be selected mark accuracy not high according to assessment result is checked, into
Without being checked one by one corpus all in corpus, reduces the workload of corpus labeling personnel, it is multiple to improve corpus
Look into efficiency.
Detailed description of the invention
With reference to the accompanying drawing, it is described in detail by the specific embodiment to the application, the technical solution of the application will be made
And other beneficial effects are apparent.
Fig. 1 is the schematic diagram of a scenario of the assessment system of corpus labeling provided by the embodiments of the present application.
Fig. 2 is the flow diagram of the appraisal procedure of corpus labeling provided by the embodiments of the present application.
Fig. 3 is the flow diagram of S102 provided by the embodiments of the present application.
Fig. 4 is the execution flow diagram of S1024 provided by the embodiments of the present application.
Fig. 5 is another flow diagram of the appraisal procedure of corpus labeling provided by the embodiments of the present application.
Fig. 6 is another flow diagram of the appraisal procedure of corpus labeling provided by the embodiments of the present application.
Fig. 7 is the structural schematic diagram of the assessment device of corpus labeling provided by the embodiments of the present application.
Fig. 8 is the structural schematic diagram of the first determining module 120 provided by the embodiments of the present application.
Fig. 9 is the structural schematic diagram of server provided by the embodiments of the present application.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on
Embodiment in the application, those skilled in the art's every other implementation obtained without creative efforts
Example, shall fall in the protection scope of this application.
The embodiment of the present application provides appraisal procedure, device and the storage medium of a kind of corpus labeling.
Referring to Fig. 1, Fig. 1 is the schematic diagram of a scenario of the assessment system of corpus labeling provided by the embodiments of the present application, the language
The assessment system of material mark may include the assessment device of any corpus labeling provided by the embodiments of the present application, the corpus labeling
Assessment device specifically can integrate in server, such as the background server of intelligent customer service system.
The server can obtain the first of at least one corpus to be assessed and every corpus to be assessed from corpus
Initial mark;The first mark of corpus to be assessed is determined according to corpus remaining in corpus to be assessed and corpus;Using having trained
Disaggregated model determine corpus to be assessed second mark;According to the first mark and the second mark, corresponding corpus to be assessed is determined
The first assessment result initially marked.
Wherein, the corpus and the disaggregated model trained can store in server, if the corpus may include
Dry item has marked corpus, and it can be the corpus for belonging to same application field, such as customer service chat that this several, which have marked corpus,
The dialogue corpus of record, the corpus can be used as the training corpus that machine language understands model.Wherein, in the corpus
The mark that each has marked corpus corresponds to the first initial mark for having marked corpus, which can be logical
Cross what corpus labeling personnel marked, accuracy has to be assessed.
In addition, the assessment system of the corpus labeling can also include the client for being equipped with corpus labeling tool, the client
End can be the terminals such as mobile phone, tablet computer, desktop computer, at the beginning of which can check the first of corpus to be assessed for user
Begin the assessment result marked, and then checks convenient for user's corpus not high to the first initial mark accuracy, and to wherein
The corpus of first initial marking error is corrected.
For example, in Fig. 1, server can obtain corpus 1 " not withdrawing deposit " to be assessed and its first initially from corpus
Mark " error of withdrawing deposit " and corpus to be assessed 2 " prestige deduction of points to the upper limit " and its first initial mark " restoring prestige point ", root
Determine that the first of corpus 1 to be assessed is labeled as " change is withdrawn deposit unsuccessfully " according to corpus remaining in corpus 1 to be assessed and corpus, and benefit
Determine corpus 1 to be assessed with the disaggregated model trained second is labeled as " change is withdrawn deposit unsuccessfully ", and same method can determine
The first mark and the second mark for obtaining corpus 2 to be assessed are " restoring prestige point ", and server is according to corpus 1 to be assessed later
The first mark and the second mark, determine that the first assessment result initially marked of corpus 1 to be assessed is " height suspicious ", equally
The first assessment result initially marked that method can determine to obtain corpus 2 to be assessed is " correct ", and then, server can be with
Receive the request of checking of the assessment result that initially marks about the first of corpus to be assessed of client, and checked according to this request to
The first assessment result initially marked that client sends corpus to be assessed.
As shown in Fig. 2, Fig. 2 is the flow diagram of the appraisal procedure of corpus labeling provided by the embodiments of the present application, the language
The appraisal procedure detailed process of material mark can be such that
S101. the first initial mark of at least one corpus to be assessed and every corpus to be assessed is obtained from corpus
Note.
Wherein, which can be used as the training corpus that machine language understands model, including several have marked language
Material, and it is the corpus for belonging to same application field or close application field that this several, which have marked corpus, for example client chats and remembers
The dialogue corpus of record.In the prior art, the mark that corpus has been marked in the corpus is usually to be marked by corpus labeling personnel
It arrives, to be completed since different corpus labeling personnel can have deviation and corpus labeling personnel usually to the understanding of same corpus
Biggish mark amount leads to can have a certain proportion of marking error corpus in corpus, therefore, it is necessary to in the corpus
The mark for marking corpus carries out accuracy evaluation, and marking error corpus therein is checked to come, and then improves machine language
Understand the training effect of model.
In the present embodiment, the assessment device of corpus labeling can obtain one or more from corpus at random and mark
Corpus, to obtain at least one corpus to be assessed, wherein the initial mark of the first of corpus to be assessed can mark language for correspondence
The artificial mark of material, accuracy has to be assessed.
S102. the first mark of corpus to be assessed is determined according to corpus remaining in corpus to be assessed and corpus.
Wherein, in the corpus remaining corpus refer in the corpus in addition to above-mentioned corpus to be assessed other marked
Corpus.In the present embodiment, the assessment device of corpus labeling can calculate every in a corpus to be assessed and the corpus one by one
The similarity of item residue corpus, and will make in the residue corpus with the biggish corpus that marked of the similarity of this corpus to be assessed
For the similar corpus of this corpus to be assessed, the first mark of this corpus to be assessed is then determined according to the similar corpus
Note.
Specifically, as shown in figure 3, above-mentioned S102 can be specifically included:
S1021. the similarity in corpus and corpus to be assessed between every remaining corpus is determined.
Currently, the method for calculating corpus similarity mainly includes editing distance (Edit Distance) calculation method, outstanding card
The inverse calculating side text frequency (TF-IDF) of German number (Jaccard index) calculation method, word frequency (TF) calculation method, word frequency-
Method and term vector (Word2Vec) calculation method etc..
Wherein, Word2Vec calculation method can be calculated in conjunction with the semantic information of corpus, obtained corpus similarity
Accuracy it is higher, therefore, in the present embodiment, the assessment device of corpus labeling can be preferably by Word2Vec calculation method
Calculate the similarity in corpus and corpus to be assessed between every remaining corpus.Specifically, the calculating side Word2Vec is being utilized
When method calculates corpus similarity, it is necessary first to be segmented to corpus, each the participle correspondence for being then based on the corpus obtains
All term vector phase adductions of the corpus can be averaging, to obtain the sentence vector of the corpus, Zhi Houzai by term vector later
The similarity of the two corpus can be obtained in included angle cosine value by calculating the sentence vector of two corpus.
In one embodiment, above-mentioned S1021 can be specifically included:
S1-1. it determines corresponding first term vector of every corpus to be assessed, and determines every remaining corpus in corpus
Corresponding second term vector.
Wherein, above-mentioned S1-1 can be specifically included:
S1-1-1. every corpus to be assessed is split into multiple first character fields, and by every in corpus remaining language
Material splits into multiple second character fields.
Wherein, the assessment device of corpus labeling can be using segmenting methods such as stammerers (jieba), to every corpus to be assessed
And every remaining corpus carries out word segmentation processing in corpus, obtains multiple first characters of every corpus to be assessed with correspondence
Multiple second character fields of every remaining corpus in section and corpus.
S1-1-2. corresponding first keyword is determined according to the first character field, and is determined according to the second character field corresponding
Second keyword.
Wherein, may exist in the first character field as obtained from carrying out word segmentation processing to corpus and the second character field
Some stop words (such as " ", " ", " " etc.) and non-text character (such as punctuation mark, additional character etc.), and this
A little stop words and the typically no tangible meaning of non-text character but frequency of use is very high, therefore, in order to save memory space and
The efficiency for improving machine learning can carry out stop words and non-text character to above-mentioned first character field and the second character field
Processing, to obtain corresponding first keyword and the second keyword.
S1-1-3. corresponding first term vector is determined according to the first keyword, and is determined according to the second keyword corresponding
Second term vector.
Wherein, it is to be assessed by every to can use the word2vec term vector tool trained for the assessment device of corpus labeling
First keyword of corpus is converted to corresponding first term vector, and every in corpus remaining corpus is converted to corresponding
Second term vector.
S1-2. corresponding first sentence vector is determined according to the first term vector, and is determined according to the second term vector corresponding
Second sentence vector.
Wherein, the assessment device of corpus labeling can take the first term vector of every corpus to be assessed using linear weighted function
The method that average value is returned constructs corresponding first sentence vector, and constructs corresponding second in the same way
Sentence vector.
S1-3. it is calculated between corresponding corpus to be assessed and remaining corpus according to first sentence vector sum the second sentence vector
Similarity.
Wherein it is possible to be determined corresponding by calculating the COS distance between first sentence vector sum the second sentence vector
Similarity in corpus and corpus to be assessed between every remaining corpus.
S1022. similar corpus is determined from remaining corpus according to similarity.
Corpus to be assessed for each, in the phase that the corpus to be assessed with every in corpus remaining corpus is calculated
After degree, the assessment device of corpus labeling can select the biggish corpus that marked of similarity as this from above-mentioned remaining corpus
The similar corpus of corpus to be assessed.
S1023. the second initial mark of similar corpus is obtained.
Wherein, which is the mark corpus in above-mentioned corpus, the initial mark of the second of the similar corpus of each
Note is the corresponding mark for having marked corpus, and is specifically as follows the corresponding artificial mark for having marked corpus.
S1024. the first mark of corpus to be assessed is determined according to the second initial mark.
In the present embodiment, the assessment device of corpus to be assessed for each, corpus labeling can be to be assessed based on this
The second of all similar corpus of corpus is initial to mark the first mark for determining the corpus to be assessed, wherein the corpus to be assessed
The first mark and the first initial mark may be different, it is also possible to it is identical, if also, the two difference, illustrate the language to be assessed
The accuracy that the first of material initially marks, which exists, to be queried, and corpus labeling personnel is needed to verify.
Wherein, above-mentioned S1024 can be specifically included:
S2-1. second is initially marked identical similar corpus and is classified as one group, obtain at least one similar corpus group.
For example, as shown in figure 4, the similar corpus of a corpus to be assessed has 10, number consecutively X1-X10, wherein
Number is the second of the similar corpus of X1, X2, X5, X7 initially to mark identical, is L11, the similar language of number X3, X6, X9
The second of material initially marks identical, is L12, and the second of the similar corpus of number X4, X10 initially marks identical, is
L13, the second of the similar corpus that number is X8 are initially labeled as L14, then can will number similar corpus for X1, X2, X5, X7
It is classified as the first similar corpus group, the similar corpus of number X3, X6, X9 are classified as the second similar corpus group, number X4, X10
Similar corpus is classified as the similar corpus group of third, and the similar corpus that number is X8 is classified as the 4th similar corpus group, so, it is possible to obtain
Four similar corpus groups.
S2-2. the item number of similar corpus in each similar corpus group is counted.
Then a upper example, above-mentioned first, second and third and four the item number of similar corpus in similar corpus group be followed successively by 4,3,2
With 1.
S2-3. the first mark by the corresponding second initial mark of the most similar corpus group of item number, as corpus to be assessed
Note.
Then a upper example, as shown in figure 4, possessing the similar corpus group of the item number of similar corpus at most is the first similar language
Material group, and the first similar corpus group corresponding second is initially labeled as L11, that is, the first of above-mentioned corpus to be assessed marks
For L11.
S103. the second mark of corpus to be assessed is determined using the disaggregated model trained.
In the present embodiment, corpus to be assessed can be input to point trained by the assessment device of corpus labeling one by one
In class model, to be marked again to corpus to be assessed, the second mark for corresponding to corpus to be assessed is obtained.Wherein, to be assessed
Second mark of corpus may be different with the first initial mark, it is also possible to and it is identical, if also, the two difference, illustrate that this is to be evaluated
Estimate the accuracy that the first of corpus initially marks and there is query, corpus labeling personnel is needed to verify.
Specifically, the above-mentioned disaggregated model trained can also include: before above-mentioned S103 in order to obtain
The third that step A. obtains each corpus sample in corpus sample set and corpus sample set initially marks.
In the present embodiment, the whole in above-mentioned corpus corpus can have been marked into as corpus sample, to obtain language
Expect sample set, wherein the third of each corpus sample is initially labeled as the corresponding mark for having marked corpus in the corpus sample set,
And it is specifically as follows the corresponding artificial mark for having marked corpus.
Step B. is initially marked using corpus sample set and third and is trained to preset disaggregated model, has been trained
Disaggregated model.
Specifically, the assessment device of corpus labeling can first propose corpus sample each in corpus sample set progress feature
It takes, for example keyword extraction or Feature Words extract, and to construct the feature vector of corresponding corpus sample, are then based on the corpus sample
It concentrates the feature vector of all corpus samples and third initially to mark to be trained above-mentioned preset disaggregated model.
Wherein, the training process of above-mentioned preset disaggregated model can be indicated with following formula:
Ci=f (Ti);
Wherein, it is the mark of corpus i that Ti, which is the corpus i, Ti indicated with feature vector, and f is disaggregated model.In training rank
Section, it is known that several couples of Ti and Ti can sum up f by machine learning.It in the present embodiment, can be using only heat (one-
Hot) the methods of coding, language model (n-gram) carry out feature extraction to above-mentioned corpus sample, to obtain corresponding corpus sample
Feature vector later can be using the methods of support vector machines (SVM) to the feature vector of above-mentioned corpus sample
It practises, with the disaggregated model trained.
After the disaggregated model trained, in above-mentioned S103, i.e., the disaggregated model pair trained using this
Corpus to be assessed is marked again, to obtain corresponding to the second mark of corpus to be assessed.In the process, similar with front
, it is also desirable to feature extraction first is carried out to corpus to be assessed, to obtain corresponding to the feature vector of corpus to be assessed, then, according to
Above-mentioned formula it is found that f and Ti at this time it is known that Ci therefore can be calculated, namely the second mark of corresponding corpus to be assessed.
S104. according to the first mark and the second mark, the first assessment knot initially marked of corresponding corpus to be assessed is determined
Fruit.
Corpus to be assessed for each, the first mark and the second mark are two relative to its first initial mark
Secondary mark, and three is obtained by three kinds of different mask methods, therefore, can be marked using the first mark and second with it is corresponding
The otherness that the first of corpus to be assessed initially marks, the accuracy initially marked to the first of corpus to be assessed are assessed.
Specifically, as shown in figure 5, above-mentioned S104 can be specifically included:
S1041. judge corpus to be assessed the first initial corpus and it is corresponding first mark and second mark whether phase
Together, if being, S1042 is executed, if only one of which is, executes S1043, if no, executes S1044.
S1042. the assessment result that will indicate that correct result is initially marked as the first of corpus to be assessed.
S1043. the assessment result that will indicate that the result of doubtful mistake is initially marked as the first of corpus to be assessed.
S1044: the assessment result that the suspicious result of indicated altitude is initially marked as the first of corpus to be assessed.
It further, can also should after obtaining the first assessment result initially marked of above-mentioned corpus to be assessed
Corpus to be assessed is labeled as having assessed corpus, and at least one not evaluated corpus conduct of mark is obtained from corpus
Then corpus to be assessed executes above-mentioned S102, S103 and S104 again, so circulation is until all in corpus to have marked corpus equal
It is marked as having assessed corpus.
In addition, after completing to the mark accuracy evaluations for having marked corpus all in corpus, user can also be
The user interface of corpus labeling check sends the assessment result that corpus has been marked in corpus to the assessment device of corpus labeling
Request is checked, so that the assessment device of corpus labeling can check that request sends assessment result to above-mentioned user interface according to this.
Specifically, when corpus labeling personnel are to having marked corpus in above-mentioned corpus and checking, the property of can choose only to assessment
As a result the corpus for being designated as " doubtful mistake " and " height is suspicious " is verified, and to the corpus for being determined as marking error after verification
It is labeled correction, thus, the workload of corpus labeling check is greatly reduced, the efficiency of corpus labeling check is improved.
From the foregoing, it will be observed that the appraisal procedure of corpus labeling provided in this embodiment, by obtaining at least one from corpus
First initial mark of corpus to be assessed and every corpus to be assessed, remains according in the corpus and corpus to be assessed later
Remaining corpus determines the first mark of the corpus to be assessed, and the second mark of corpus to be assessed is determined using the disaggregated model trained
Note determines the first assessment result initially marked of corresponding corpus to be assessed then according to the first mark and the second mark, thus
When carrying out the check of artificial corpus, the corpus that can be selected mark accuracy not high according to assessment result is checked, Jin Erwu
Corpus all in corpus need to be checked one by one, reduce the workload of corpus labeling personnel, improve corpus check effect
Rate.
As shown in fig. 6, Fig. 6 is another flow diagram of the appraisal procedure of corpus labeling provided by the embodiments of the present application,
The appraisal procedure detailed process of the corpus labeling can be such that
S201. the first initial mark of at least one corpus to be assessed and every corpus to be assessed is obtained from corpus
Note.
For example, the assessment device of corpus labeling can be obtained at random from corpus at least one marked corpus be used as to
Corpus is assessed, or the corpus that marked all in corpus first can also be divided into more parts, portion is then therefrom taken to mark language
Material is used as corpus to be assessed.Wherein, the first of corpus to be assessed the initial mark can be the corresponding artificial mark for having marked corpus,
Its accuracy has to be assessed.
S202. every corpus to be assessed is split into multiple first character fields, and by every in corpus remaining corpus
Split into multiple second character fields.
For example, corpus a to be assessed be " like that can only be top set one public number, I wants all public entirety to set
Top ", can use jieba segmenting method and carries out word segmentation processing to it, in obtain multiple first character fields " as ", " can only ",
" top set ", "one", " public number ", ", ", " I ", " thinking ", " ", " all ", " public number ", " entirety " and " top set ",
Wherein, same method is also applied for every remaining corpus in corpus.
S203. corresponding first keyword is determined according to the first character field, and determines corresponding according to the second character field
Two keywords.
Then a upper example is carrying out the multiple of the corpus a to be assessed obtained after word segmentation processing to above-mentioned corpus a to be assessed
In first character field, there are the stop words and non-text character of some not practical significances, such as " such ", "one", " one
Rise " etc. stop words, and ", " punctuation mark.
In the present embodiment, above-mentioned first character field and the second character field can be gone using preset deactivated vocabulary
Stop words processing, to remove function word, pronoun etc., furthermore it is also possible to using the methods of canonical (Z ipf) expression formula to above-mentioned first
Non-text character in character field and the second character field is filtered, to obtain corresponding first keyword and the second keyword.
S204. corresponding first term vector is determined according to the first keyword, and determines corresponding according to the second keyword
Two term vectors.
For example, the assessment device of corpus labeling can use the language for the word2vec term vector tool combination corpus trained
First keyword of every corpus to be assessed is converted to corresponding first term vector by adopted information, and surplus by every in corpus
Remaining corpus is converted to corresponding second term vector.
S205. corresponding first sentence vector is determined according to the first term vector, and is determined according to the second term vector corresponding
Second sentence vector.
Recurrence building is carried out for example, can be averaged using linear weighted function to the first term vector of every corpus to be assessed
Corresponding first sentence vector can also equally take the second term vector of every in corpus remaining corpus using linear weighted function
Average value, which return, constructs corresponding second sentence vector.
S206. it is calculated between corresponding corpus to be assessed and remaining corpus according to first sentence vector sum the second sentence vector
Similarity.
For example, can be determined corresponding by calculating the COS distance between first sentence vector sum the second sentence vector
Similarity in corpus and corpus to be assessed between every remaining corpus.
S207. similar corpus is determined from remaining corpus according to similarity.
For example, corpus to be assessed for each, can be greater than preset threshold for similarity in above-mentioned remaining corpus
Mark corpus, the similar corpus as correspondence corpus to be assessed.It wherein, is 0~1 in the range of similarity, and similarity more connects
Nearly 1, when illustrating that the similarity of two corpus is higher, above-mentioned preset threshold can be with 0.8.
S208. the second initial mark of similar corpus is obtained.
Wherein, which is the mark corpus in above-mentioned corpus, the initial mark of the second of the similar corpus of each
Note is the corresponding mark for having marked corpus, and is specifically as follows the corresponding artificial mark for having marked corpus.
S209. the first mark of corpus to be assessed is determined according to the second initial mark.
The specific embodiment mode of S209 in the present embodiment may refer to the specific reality of S1024 in an embodiment of the method
Mode is applied, therefore details are not described herein.
S210. corresponding feature vector is determined according to the first keyword.
For example, feature extraction can be carried out using first keyword of the n-gram method to every corpus to be assessed, with
To the feature vector of correspondence corpus to be assessed.Wherein, first keyword is by corresponding corpus to be assessed successively by segmenting, going
It is obtained after stop words and non-text character filtration treatment, can be effectively reduced the dimension of features described above vector, and then improve and divide
The classification effectiveness of class model.
In addition, n-gram method is since at the first character of text, the step-length moved every time is 1 character, and every
It is secondary to take length for the characteristic item of n character, for example, extracting characteristic item for " the deduction of points upper limit " this four words in 3-gram method
It is as follows: in deduction of points, to divide the upper limit, it follows that above-mentioned first keyword can be got by extracting feature using n-gram method
Front and back information, that is, the word order information of corpus to be assessed.
S211. classified using the disaggregated model trained to feature vector, to obtain corresponding to the of corpus to be assessed
Two marks.
In the present embodiment, the feature vector of corpus to be assessed is input in the disaggregated model trained, it can be defeated
Obtain corresponding to the second mark of corpus to be assessed out.Wherein, which can be using in above-mentioned corpus
It is all to have marked what corpus training obtained.
S212. according to the first mark and the second mark, the first assessment knot initially marked of corresponding corpus to be assessed is determined
Fruit.
In the present embodiment, corpus to be assessed for each, the first mark and the second mark are relative to it at the beginning of first
Begin mark, be secondary mark, and three obtains by three kinds of different mask methods, therefore, can using first mark with
Second mark is with the first otherness initially marked of corresponding corpus to be assessed, the standard initially marked to the first of corpus to be assessed
True property is assessed.Wherein, the specific embodiment mode of above-mentioned S212 may refer to the specific of S104 in an embodiment of the method
Embodiment, therefore details are not described herein.
In addition, during executing above-mentioned S204 to S209 to obtain the first mark of corpus to be assessed, using
Machine unsupervised learning mode, and the semantic information of corpus is considered, and S210 and S211 is being executed to obtain corpus to be assessed
Second mark during, using machine supervised learning mode, and the word order information of corpus is considered, in this way, In
During initially marking accuracy evaluation to the first of corpus to be assessed, supervised machine study and machine unsupervised learning are carried out
It organically combines, and has fully considered the semanteme and word order information of corpus, be conducive to the accuracy for improving above-mentioned assessment result.
On the basis of above-described embodiment the method, the present embodiment will be from the angle of the assessment device of corpus labeling into one
Step is described, can be with referring to Fig. 7, the assessment device of corpus labeling provided by the embodiments of the present application has been described in detail in Fig. 7
It include: to obtain module 110, the first determining module 120, the second determining module 130 and third determining module 140, in which:
(1) module 110 is obtained
Module 110 is obtained, for obtaining at least one corpus to be assessed and every corpus to be assessed from corpus
First initial mark.
Wherein, which can be used as the training corpus that machine language understands model, including several have marked language
Material, and it is the corpus for belonging to same application field or close application field that this several, which have marked corpus, for example client chats and remembers
The dialogue corpus of record.In the present embodiment, acquisition module 110 can obtain one or more from corpus and mark corpus,
To obtain at least one corpus to be assessed, wherein the initial mark of the first of corpus to be assessed can mark corpus for correspondence
Artificial mark, accuracy has to be assessed.
(2) first determining modules 120
First determining module 120, for determining corpus to be assessed according to corpus remaining in corpus to be assessed and corpus
First mark.
In the present embodiment, remaining corpus refers to its in the corpus in addition to above-mentioned corpus to be assessed in the corpus
He has marked corpus.
Wherein, as shown in figure 8, above-mentioned first determining module 120 can specifically include:
(a) the first determination unit 121, it is similar between corpus to be assessed and every in corpus remaining corpus for determining
Degree.
Currently, the method for calculating corpus similarity mainly includes editing distance (Edit Distance) calculation method, outstanding card
The inverse calculating side text frequency (TF-IDF) of German number (Jaccard index) calculation method, word frequency (TF) calculation method, word frequency-
Method and term vector (Word2Vec) calculation method etc..Wherein, Word2Vec calculation method can be in conjunction with the semantic information of corpus
It is calculated, the accuracy of obtained corpus similarity is higher, and therefore, in the present embodiment, the first determination unit 121 can be excellent
Choosing calculates the similarity in corpus and corpus to be assessed between every remaining corpus using Word2Vec calculation method.
In one embodiment, which can specifically include:
(a1) it first determines subelement, for determining corresponding first term vector of every corpus to be assessed, and determines language
Expect corresponding second term vector of every residue corpus in library.
Wherein, first determine that subelement can be specifically used for:
Every corpus to be assessed is split into multiple first character fields, and every in corpus remaining corpus is split into
Multiple second character fields;
Corresponding first keyword is determined according to the first character field, and determines that corresponding second is crucial according to the second character field
Word;
Determine corresponding first term vector according to the first keyword, and according to the second keyword determine corresponding second word to
Amount.
(a2) second subelement is determined, for determining corresponding first sentence vector according to the first term vector, and according to the
Two term vectors determine corresponding second sentence vector.
Wherein, which can make even to the first term vector of every corpus to be assessed using linear weighted function
The method that mean value is returned constructs corresponding first sentence vector, and constructs corresponding second in the same way
Subvector.
(a3) computation subunit, for calculating corresponding corpus to be assessed according to first sentence vector sum the second sentence vector
Similarity between remaining corpus.
Wherein, the computation subunit can by calculate first sentence vector sum the second sentence vector between cosine away from
From determining in corresponding corpus to be assessed and corpus the similarity between every remaining corpus.
(b) the second determination unit 122, for determining similar corpus from remaining corpus according to similarity.
Corpus to be assessed for each, in the phase that the corpus to be assessed with every in corpus remaining corpus is calculated
After degree, the second determination unit 122 can select from above-mentioned remaining corpus similarity it is biggish marked corpus as this to
Assess the similar corpus of corpus.
(c) acquiring unit 123, the second initial mark for obtaining similar corpus.
Wherein, which is the mark corpus in above-mentioned corpus, the initial mark of the second of the similar corpus of each
Note is the corresponding mark for having marked corpus, and is specifically as follows the corresponding artificial mark for having marked corpus.
(d) third determination unit 124, for determining the first mark of corpus to be assessed according to the second initial mark.
In the present embodiment, corpus to be assessed for each, third determination unit 124 can be based on the corpus to be assessed
All similar corpus second initial mark the first mark for determining the corpus to be assessed, wherein the of the corpus to be assessed
One mark may be different with the first initial mark, it is also possible to and it is identical, if also, the two difference, illustrate the corpus to be assessed
First accuracy initially marked, which exists, to be queried, and corpus labeling personnel is needed to verify.
In one embodiment, which can be specifically used for:
Second is initially marked identical similar corpus and be classified as one group, obtains at least one similar corpus group;
Count the item number of similar corpus in each similar corpus group;
The first mark by the corresponding second initial mark of the most similar corpus group of item number, as corpus to be assessed.
(3) second determining modules 130
Second determining module 130, for determining the second mark of corpus to be assessed using the disaggregated model trained.
In the present embodiment, corpus to be assessed can be input to the classification trained by the second determining module 130 one by one
In model, to be marked again to corpus to be assessed, the second mark for corresponding to corpus to be assessed is obtained.Wherein, language to be assessed
The second mark expected and the first initial mark may be different, it is also possible to and it is identical, if also, the two difference, illustrate that this is to be assessed
The accuracy that the first of corpus initially marks, which exists, to be queried, and corpus labeling personnel is needed to verify.
Wherein, the disaggregated model trained can be using in above-mentioned corpus it is all marked corpus training obtain
's.
(4) third determining module 140
Third determining module 140, for being marked according to the first mark and second, at the beginning of determine corresponding corpus to be assessed first
Begin the assessment result marked.
In the present embodiment, corpus to be assessed for each, the first mark and the second mark are relative to it at the beginning of first
Begin mark, be secondary mark, and three obtains by three kinds of different mask methods, therefore, can using first mark with
Second mark is with the first otherness initially marked of corresponding corpus to be assessed, the standard initially marked to the first of corpus to be assessed
True property is assessed.
Wherein, third determining module 140 can be specifically used for:
Judge whether the first initial corpus and corresponding first mark and the second mark of corpus to be assessed are identical;
If the initial corpus of the first of corpus to be assessed and corresponding first mark and the second mark are all the same, will indicate just
The assessment result that true result is initially marked as the first of corpus to be assessed;
If the initial corpus of the first of corpus to be assessed is identical as corresponding first mark or the second mark, will indicate doubtful
The assessment result that the result of mistake is initially marked as the first of corpus to be assessed;
If the initial corpus of the first of corpus to be assessed is all different with corresponding first mark and the second mark, will instruction
The assessment result that highly suspicious result is initially marked as the first of corpus to be assessed.
Further, the assessment device of above-mentioned corpus labeling can also include the 4th determining module, wherein the 4th determines mould
Block can be specifically used for:
The third for obtaining each corpus sample in corpus sample set and corpus sample set initially marks;
The classification for being trained, having been trained to preset disaggregated model is initially marked using corpus sample set and third
Model.
Specifically, the whole in above-mentioned corpus can have been marked corpus as corpus sample by the 4th determining module,
To obtain corpus sample set, wherein the third of each corpus sample is initially labeled as correspondence and has marked language in the corpus sample set
The mark of material, and it is specifically as follows the corresponding artificial mark for having marked corpus.Also, the 4th determining module can also be first to language
Expect that each corpus sample carries out feature extraction in sample set, for example keyword extraction or Feature Words extract, to construct corresponding corpus
The feature vector of sample, the feature vector and third for being then based on all corpus samples in the corpus sample set are initially marked to upper
Preset disaggregated model is stated to be trained.
Wherein, the training process of above-mentioned preset disaggregated model can be indicated with following formula:
Ci=f (Ti);
Wherein, it is the mark of corpus i that Ti, which is the corpus i, Ti indicated with feature vector, and f is disaggregated model.In training rank
Section, it is known that several couples of Ti and Ti can sum up f by machine learning.In the present embodiment, the 4th determining module can be with
Feature extraction is carried out to above-mentioned corpus sample using the methods of only hot (one-hot) coding, language model (n-gram), to obtain
The feature vector of corresponding corpus sample later can be using the methods of support vector machines (SVM) to the feature of above-mentioned corpus sample
Vector is learnt, with the disaggregated model trained.
After the disaggregated model trained, above-mentioned second determining module 130 is the classification trained using this
Model marks corpus to be assessed again, to obtain corresponding to the second mark of corpus to be assessed.In the process, with front
Similar, which is also required to first carry out feature extraction to corpus to be assessed, to obtain corresponding to corpus to be assessed
Feature vector, then, according to above-mentioned formula it is found that therefore f and Ti at this time are it is known that can be calculated Ci, namely corresponding
Second mark of corpus to be assessed.
In addition, obtaining the first assessment result initially marked of above-mentioned corpus to be assessed in above-mentioned third determining module 140
Later, which can also be labeled as having assessed corpus, and triggers above-mentioned acquisition module 110 and is obtained from corpus
At least one corpus of mark not being evaluated was as corpus to be assessed, then again to the first initial language of the corpus to be assessed
The accuracy of material is assessed, and so circulation is until all corpus that marked are marked as having assessed corpus in corpus.
Further, it is completed in the assessment device of above-mentioned corpus labeling quasi- to the marks for having marked corpus all in corpus
After true property assessment, the user interface that user can also check in corpus labeling sends corpus to the assessment device of corpus labeling
In marked the assessment result of corpus and check request so that the assessment device of corpus labeling can check that request is upward according to this
It states user interface and sends assessment result.Specifically, it checks in corpus labeling personnel to having marked corpus in above-mentioned corpus
When, the corpus for being designated as " doubtful mistake " and " height is suspicious " to assessment result to the property of can choose is verified, and to verification
The corpus for being determined as marking error afterwards is labeled correction, thus, the workload of corpus labeling check is greatly reduced, is improved
The efficiency of corpus labeling check.
It, can also be into when it is implemented, above each subelement, unit and module can be used as independent entity to realize
Row any combination realizes that the specific implementation of above each subelement, unit and module can be joined as same or several entities
See the embodiment of the method for front, details are not described herein.
It is above-mentioned it is found that corpus labeling provided in this embodiment assessment device, by obtaining at least one from corpus
First initial mark of corpus to be assessed and every corpus to be assessed, remains according in the corpus and corpus to be assessed later
Remaining corpus determines the first mark of the corpus to be assessed, and the second mark of corpus to be assessed is determined using the disaggregated model trained
Note determines the first assessment result initially marked of corresponding corpus to be assessed then according to the first mark and the second mark, thus
When carrying out the check of artificial corpus, the corpus that can be selected mark accuracy not high according to assessment result is checked, Jin Erwu
Corpus all in corpus need to be checked one by one, reduce the workload of corpus labeling personnel, improve corpus check effect
Rate.
Correspondingly, the embodiment of the present application also provides a kind of server, as shown in figure 9, it illustrates the embodiment of the present application institutes
The structural schematic diagram for the server being related to, specifically:
The server may include one or processor 401, one or more meters of more than one processing core
Memory 402, radio frequency (Radio Frequency, RF) circuit 403, power supply 404, input unit of calculation machine readable storage medium storing program for executing
The components such as 405 and display unit 406.It will be understood by those skilled in the art that the not structure of server architecture shown in Fig. 9
The restriction of pairs of server may include perhaps combining certain components or different portions than illustrating more or fewer components
Part arrangement.Wherein:
Processor 401 is the control centre of the server, utilizes each of various interfaces and the entire server of connection
Part by running or execute the software program and/or module that are stored in memory 402, and calls and is stored in memory
Data in 402, the various functions and processing data of execute server, to carry out integral monitoring to server.Optionally, locate
Managing device 401 may include one or more processing cores;Preferably, processor 401 can integrate application processor and modulatedemodulate is mediated
Manage device, wherein the main processing operation system of application processor, user interface and application program etc., modem processor is main
Processing wireless communication.It is understood that above-mentioned modem processor can not also be integrated into processor 401.
Memory 402 can be used for storing software program and module, and processor 401 is stored in memory 402 by operation
Software program and module, thereby executing various function application and data processing.Memory 402 can mainly include storage journey
Sequence area and storage data area, wherein storing program area can the (ratio of application program needed for storage program area, at least one function
Such as sound-playing function, image player function) etc.;Storage data area, which can be stored, uses created data according to server
Deng.In addition, memory 402 may include high-speed random access memory, it can also include nonvolatile memory, for example, at least
One disk memory, flush memory device or other volatile solid-state parts.Correspondingly, memory 402 can also include
Memory Controller, to provide access of the processor 401 to memory 402.
During RF circuit 403 can be used for receiving and sending messages, signal is sended and received, and particularly, the downlink of base station is believed
After breath receives, one or the processing of more than one processor 401 are transferred to;In addition, the data for being related to uplink are sent to base station.It is logical
Often, RF circuit 403 includes but is not limited to antenna, at least one amplifier, tuner, one or more oscillators, user identity
Module (SIM) card, transceiver, coupler, low-noise amplifier (LNA, Low Noise Amplifier), duplexer etc..This
Outside, RF circuit 403 can also be communicated with network and other equipment by wireless communication.Any communication can be used in the wireless communication
Standard or agreement, including but not limited to global system for mobile communications (GSM, Global System of Mobile
Communication), general packet radio service (GPRS, General Packet Radio Service), CDMA
(CDMA, Code Division Multiple Access), wideband code division multiple access (WCDMA, Wideband Code
Division Multiple Access), long term evolution (LTE, Long Term Evolution), Email, short message clothes
Be engaged in (SMS, Short Messaging Service) etc..
Server further includes the power supply 404 (such as battery) powered to all parts, it is preferred that power supply 404 can pass through
Power-supply management system and processor 401 are logically contiguous, to realize management charging, electric discharge, Yi Jigong by power-supply management system
The functions such as consumption management.Power supply 404 can also include one or more direct current or AC power source, recharging system, power supply
The random components such as fault detection circuit, power adapter or inverter, power supply status indicator.
The server may also include input unit 405, which can be used for receiving the number or character letter of input
Breath, and generation keyboard related with user setting and function control, mouse, operating stick, optics or trackball signal are defeated
Enter.Specifically, in a specific embodiment, input unit 405 may include touch sensitive surface and other input equipments.It is touch-sensitive
Surface, also referred to as touch display screen or Trackpad, collect user on it or nearby touch operation (such as user use
The operation of any suitable object or attachment such as finger, stylus on touch sensitive surface or near touch sensitive surface), and according to preparatory
The formula of setting drives corresponding attachment device.Optionally, touch sensitive surface may include touch detecting apparatus and touch controller two
A part.Wherein, the touch orientation of touch detecting apparatus detection user, and touch operation bring signal is detected, signal is passed
Give touch controller;Touch controller receives touch information from touch detecting apparatus, and is converted into contact coordinate, then
Processor 401 is given, and order that processor 401 is sent can be received and executed.Furthermore, it is possible to using resistance-type, capacitor
The multiple types such as formula, infrared ray and surface acoustic wave realize touch sensitive surface.In addition to touch sensitive surface, input unit 405 can also be wrapped
Include other input equipments.Specifically, other input equipments can include but is not limited to physical keyboard, function key (such as volume control
Key processed, switch key etc.), trace ball, mouse, one of operating stick etc. or a variety of.
The server may also include display unit 406, the display unit 406 can be used for showing information input by user or
Be supplied to the information of user and the various graphical user interface of server, these graphical user interface can by figure, text,
Icon, video and any combination thereof are constituted.Display unit 406 may include display panel, optionally, can use liquid crystal display
Device (LCD, Liquid Crystal Display), Organic Light Emitting Diode (OLED, Organic Light-Emitting
) etc. Diode forms configure display panel.Further, touch sensitive surface can cover display panel, when touch sensitive surface detects
After touch operation on or near it, processor 401 is sent to determine the type of touch event, is followed by subsequent processing 401 basis of device
The type of touch event provides corresponding visual output on a display panel.Although in Fig. 9, touch sensitive surface is with display panel
Input and input function are realized as two independent components, but in some embodiments it is possible to by touch sensitive surface and are shown
Show that panel is integrated and realizes and outputs and inputs function.
Although being not shown, server can also include camera, bluetooth module etc., and details are not described herein.Specifically in this reality
It applies in example, the processor 401 in server can be according to following instruction, by the process pair of one or more application program
The executable file answered is loaded into memory 402, and the application journey being stored in memory 402 is run by processor 401
Sequence, thus realize various functions, it is as follows:
The first initial mark of at least one corpus to be assessed and every corpus to be assessed is obtained from corpus;
The first mark of the corpus to be assessed is determined according to corpus remaining in the corpus and corpus to be assessed;
The second mark of corpus to be assessed is determined using the disaggregated model trained;
According to the first mark and the second mark, the first assessment result initially marked of corresponding corpus to be assessed is determined.
The server may be implemented achieved by the assessment device of any corpus labeling provided by the embodiment of the present application
Effective effect, be detailed in the embodiment of front, details are not described herein.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of above-described embodiment is can
It is completed with instructing relevant hardware by program, which can be stored in a computer readable storage medium, storage
Medium may include: read-only memory (ROM, Read Only Memory), random access memory (RAM, Random
Access Memory), disk or CD etc..
A kind of appraisal procedure of corpus labeling, device and storage medium provided by the embodiment of the present application are carried out above
It is discussed in detail, specific examples are used herein to illustrate the principle and implementation manner of the present application, above embodiments
Illustrate to be merely used to help understand the present processes and its core concept;Meanwhile for those skilled in the art, according to this
The thought of application, there will be changes in the specific implementation manner and application range, in conclusion the content of the present specification is not answered
It is interpreted as the limitation to the application.
Claims (10)
1. a kind of appraisal procedure of corpus labeling characterized by comprising
The first initial mark of at least one corpus to be assessed and every corpus to be assessed is obtained from corpus;
The first mark of the corpus to be assessed is determined according to corpus remaining in the corpus to be assessed and the corpus;
The second mark of the corpus to be assessed is determined using the disaggregated model trained;
According to first mark and the second mark, the first assessment knot initially marked of the corresponding corpus to be assessed is determined
Fruit.
2. appraisal procedure according to claim 1, which is characterized in that described according to the corpus to be assessed and the corpus
Remaining corpus determines the first mark of the corpus to be assessed in library, specifically includes:
Determine the similarity in the corpus to be assessed and the corpus between every remaining corpus;
Similar corpus is determined from the remaining corpus according to the similarity;
Obtain the second initial mark of the similar corpus;
The first mark of the corpus to be assessed is determined according to the described second initial mark.
3. appraisal procedure according to claim 2, which is characterized in that the determination corpus to be assessed and the corpus
Similarity in library between every remaining corpus, specifically includes:
It determines corresponding first term vector of every corpus to be assessed, and determines every remaining corpus pair in the corpus
The second term vector answered;
Corresponding first sentence vector is determined according to first term vector, and determines corresponding according to second term vector
Two sentence vectors;
It is calculated between the corresponding corpus to be assessed and remaining corpus according to the first sentence vector sum the second sentence vector
Similarity.
4. appraisal procedure according to claim 3, which is characterized in that the corpus to be assessed of the determination every is corresponding
First term vector, and determine corresponding second term vector of every residue corpus in the corpus, it specifically includes:
Every corpus to be assessed is split into multiple first character fields, and every in the corpus remaining corpus is torn open
It is divided into multiple second character fields;
Corresponding first keyword is determined according to first character field, and determines corresponding second according to second character field
Keyword;
Corresponding first term vector is determined according to first keyword, and determines corresponding second according to second keyword
Term vector.
5. appraisal procedure according to claim 2, which is characterized in that described according to the described second initial mark determination
First mark of corpus to be assessed, specifically includes:
The identical similar corpus is initially marked by described second and is classified as one group, obtains at least one similar corpus group;
Count the item number of similar corpus in each similar corpus group;
The most similar corpus group corresponding described second of the item number is initially marked, as the corpus to be assessed
First mark.
6. appraisal procedure according to claim 1, which is characterized in that determine institute using the disaggregated model trained described
Before the second mark for stating corpus to be assessed, further includes:
The third for obtaining each corpus sample in corpus sample set and the corpus sample set initially marks;
It is initially marked using the corpus sample set and third and preset disaggregated model is trained, obtain described trained
Disaggregated model.
7. appraisal procedure according to claim 1, which is characterized in that it is described to be marked according to first mark and second,
The first assessment result initially marked for determining the corresponding corpus to be assessed, specifically includes:
Judge the corpus to be assessed the first initial corpus and it is corresponding it is described first mark and it is described second mark whether
It is identical;
If the first initial corpus of the corpus to be assessed and corresponding first mark and the second mark are all the same, will refer to
Show first assessment result that initially marks of the correct result as the corpus to be assessed;
If the first initial corpus of the corpus to be assessed is identical as corresponding first mark or the second mark, will instruction
First assessment result that initially marks of the result of doubtful mistake as the corpus to be assessed;
If the first initial corpus of the corpus to be assessed is all different with corresponding first mark and the second mark, will
First assessment result that initially marks of the suspicious result of indicated altitude as the corpus to be assessed.
8. a kind of assessment device of corpus labeling characterized by comprising
Module is obtained, for obtaining the of at least one corpus to be assessed and every corpus to be assessed from corpus
One initial mark;
First determining module, for determining the language to be assessed according to corpus remaining in the corpus to be assessed and the corpus
First mark of material;
Second determining module, for determining the second mark of the corpus to be assessed using the disaggregated model trained;
Third determining module, for determining the first of the corresponding corpus to be assessed according to first mark and the second mark
The assessment result initially marked.
9. assessment device according to claim 8, which is characterized in that first determining module specifically includes:
First determination unit, it is similar between the corpus to be assessed and every in the corpus remaining corpus for determining
Degree;
Second determination unit, for determining similar corpus from the remaining corpus according to the similarity;
Acquiring unit, the second initial mark for obtaining the similar corpus;
Third determination unit, for determining the first mark of the corpus to be assessed according to the described second initial mark.
10. a kind of computer readable storage medium, which is characterized in that be stored with a plurality of instruction, the finger in the storage medium
It enables and is suitable for loading the appraisal procedure for requiring 1 to 7 described in any item corpus labelings with perform claim by processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910668462.3A CN110427622B (en) | 2019-07-23 | 2019-07-23 | Corpus labeling evaluation method, corpus labeling evaluation device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910668462.3A CN110427622B (en) | 2019-07-23 | 2019-07-23 | Corpus labeling evaluation method, corpus labeling evaluation device and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110427622A true CN110427622A (en) | 2019-11-08 |
CN110427622B CN110427622B (en) | 2024-08-13 |
Family
ID=68412045
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910668462.3A Active CN110427622B (en) | 2019-07-23 | 2019-07-23 | Corpus labeling evaluation method, corpus labeling evaluation device and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110427622B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110852109A (en) * | 2019-11-11 | 2020-02-28 | 腾讯科技(深圳)有限公司 | Corpus generating method, corpus generating device, and storage medium |
CN111144088A (en) * | 2019-12-09 | 2020-05-12 | 深圳市优必选科技股份有限公司 | Corpus management method, corpus management device and electronic equipment |
CN112329430A (en) * | 2021-01-04 | 2021-02-05 | 恒生电子股份有限公司 | Model training method, text similarity determination method and text similarity determination device |
CN112925910A (en) * | 2021-02-25 | 2021-06-08 | 中国平安人寿保险股份有限公司 | Method, device and equipment for assisting corpus labeling and computer storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060074634A1 (en) * | 2004-10-06 | 2006-04-06 | International Business Machines Corporation | Method and apparatus for fast semi-automatic semantic annotation |
CN102662930A (en) * | 2012-04-16 | 2012-09-12 | 乐山师范学院 | Corpus tagging method and corpus tagging device |
CN108170670A (en) * | 2017-12-08 | 2018-06-15 | 东软集团股份有限公司 | Distribution method, device, readable storage medium storing program for executing and the electronic equipment of language material to be marked |
CN109739956A (en) * | 2018-11-08 | 2019-05-10 | 第四范式(北京)技术有限公司 | Corpus cleaning method, device, equipment and medium |
CN109992763A (en) * | 2017-12-29 | 2019-07-09 | 北京京东尚科信息技术有限公司 | Language marks processing method, system, electronic equipment and computer-readable medium |
-
2019
- 2019-07-23 CN CN201910668462.3A patent/CN110427622B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060074634A1 (en) * | 2004-10-06 | 2006-04-06 | International Business Machines Corporation | Method and apparatus for fast semi-automatic semantic annotation |
CN102662930A (en) * | 2012-04-16 | 2012-09-12 | 乐山师范学院 | Corpus tagging method and corpus tagging device |
CN108170670A (en) * | 2017-12-08 | 2018-06-15 | 东软集团股份有限公司 | Distribution method, device, readable storage medium storing program for executing and the electronic equipment of language material to be marked |
CN109992763A (en) * | 2017-12-29 | 2019-07-09 | 北京京东尚科信息技术有限公司 | Language marks processing method, system, electronic equipment and computer-readable medium |
CN109739956A (en) * | 2018-11-08 | 2019-05-10 | 第四范式(北京)技术有限公司 | Corpus cleaning method, device, equipment and medium |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110852109A (en) * | 2019-11-11 | 2020-02-28 | 腾讯科技(深圳)有限公司 | Corpus generating method, corpus generating device, and storage medium |
CN111144088A (en) * | 2019-12-09 | 2020-05-12 | 深圳市优必选科技股份有限公司 | Corpus management method, corpus management device and electronic equipment |
CN112329430A (en) * | 2021-01-04 | 2021-02-05 | 恒生电子股份有限公司 | Model training method, text similarity determination method and text similarity determination device |
CN112925910A (en) * | 2021-02-25 | 2021-06-08 | 中国平安人寿保险股份有限公司 | Method, device and equipment for assisting corpus labeling and computer storage medium |
CN112925910B (en) * | 2021-02-25 | 2024-10-25 | 中国平安人寿保险股份有限公司 | Auxiliary corpus labeling method, device, equipment and computer storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN110427622B (en) | 2024-08-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6594534B2 (en) | Text information processing method and device | |
US10990511B2 (en) | Apparatus and application interface traversing method | |
CN110427622A (en) | Appraisal procedure, device and the storage medium of corpus labeling | |
US10803241B2 (en) | System and method for text normalization in noisy channels | |
CN104516921B (en) | Automatic reply method and device | |
CN107741937A (en) | A kind of data query method and device | |
US9183598B2 (en) | Identifying event-specific social discussion threads | |
CN104423623B (en) | It is a kind of to select word treatment method and electronic equipment | |
CN110069769A (en) | Using label generating method, device and storage equipment | |
CN109800099A (en) | A kind of restoring method, storage medium and the terminal device of user's operation behavior | |
JP6667452B2 (en) | Method and apparatus for inputting text information | |
US20100191753A1 (en) | Extracting Patterns from Sequential Data | |
WO2015003607A1 (en) | Systems and methods for filtering microblogs | |
CN104102704A (en) | System control displaying method and system control displaying device | |
CN111737398A (en) | Method and device for searching sensitive words in text, electronic equipment and storage medium | |
CN110210039B (en) | Translation content checking method, device, system and computer equipment | |
CN111221690B (en) | Model determination method and device for integrated circuit design and terminal | |
CN109508390B (en) | Input prediction method and device based on knowledge graph and electronic equipment | |
CN116226681B (en) | Text similarity judging method and device, computer equipment and storage medium | |
CN114880242B (en) | Test case extraction method, device, equipment and medium | |
CN111753548B (en) | Information acquisition method and device, computer storage medium and electronic equipment | |
CN110059312A (en) | Short phrase picking method, apparatus and electronic equipment | |
CN110807330B (en) | Semantic understanding model evaluation method, device and storage medium | |
CN112445907B (en) | Text emotion classification method, device, equipment and storage medium | |
US20180275981A1 (en) | Determining candidate patches for a computer software |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |