CN110287482A - Semi-automation participle corpus labeling training device - Google Patents
Semi-automation participle corpus labeling training device Download PDFInfo
- Publication number
- CN110287482A CN110287482A CN201910455093.XA CN201910455093A CN110287482A CN 110287482 A CN110287482 A CN 110287482A CN 201910455093 A CN201910455093 A CN 201910455093A CN 110287482 A CN110287482 A CN 110287482A
- Authority
- CN
- China
- Prior art keywords
- participle
- model
- corpus
- mark
- automatic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Machine Translation (AREA)
Abstract
A kind of semi-automatic participle corpus labeling training device of the present invention, it is intended to solve in participle corpus labeling and training process using corpus there are the drawbacks of.The technical scheme is that: corpus of text marks preparation module to corpus to be marked, segment the management of corpus, pass through the two-way maximum matching participle based on integrated dictionary, CRF, JIEBA, etc. a variety of segmentation methods, semi-automatic corpus participle labeling module is submitted into raw corpus participle mark work, creation participle mark task, selection, which marks, is applicable in algorithm model, carry out automatic marking, on the basis of the fusion of automatic marking result, training pattern corpus and marking model that corpus of text mark preparation module generates are fed back into reaction type model learning training module, selection and model learning training, unified training pattern interface is called to generate core lexicon, update participle training pattern table, dimensioning algorithm Integrated Evaluation Model is established to assess model mark effect, complete new participle mark task.
Description
Technical field
The present invention relates to Text Mining Technology fields, more particularly to participle corpus semi-automation to mark training device.
Background technique
Word be it is the smallest, can independent activities, significant language element, but in Chinese between word without apparent
Separator, therefore, Chinese lexical analysis are basis and the key of Chinese information processing.The accuracy of participle and part-of-speech tagging
Accuracy is closely related, will organically segment process together with part-of-speech tagging Process fusion, and be conducive to disambiguation and raising
Whole efficiency.Chinese sentence is made of continuous word, does not have space to separate between word and word.Part-of-speech tagging refers in sentence
Each word determine the process of a suitable part of speech.Chinese word segmentation is first " process " of Chinese information processing again, is being permitted
It is played an extremely important role in more application fields (text participle, event extraction, text snippet, information retrieval etc.).Participle
All it is to carry out basic processing to corpus with part-of-speech tagging, is referred to as corpus participle mark.However there is the participle corpus of mark very
It less, is that indirectly, in a real system, difference segments the influence of mistake for the raising of the big mission effectiveness where participle
Be it is very different, in addition segment corpus obtain cost it is very expensive, be manually difficult expertly according to some standard before
It consistently goes to mark raw corpus afterwards, so that the scale for segmenting corpus is fairly limited in today of the big computing capability of big data quantity.Word
Property be labeled on message processing flow the step of being and then after participle, and used algorithm principle is similar with participle,
So usually carrying out integrated processing to participle and part-of-speech tagging in the realization of many systems.However, dividing in field at present
Word material is relatively deficient, and segments corpus labeling work and mainly completed at present by manually marking, and manually does word to corpus entirely
Property mark bustling about just as ant, be when expending very much, and that there are corpus labelings is of poor quality, annotation process is numerous
Trivial, the problems such as annotating efficiency is low, cost of human resources is high.Meanwhile having a participle corpus labeling tool that there are mask methods is single,
Mask method model can not be carried out the drawback such as automatically updating, therefore, be capable of indirect labor there is an urgent need to a set of and mark corpus
Semi-automatic participle mark solves problem above with training platform.If there is a semi-automatic participle mask method and it is based on
The semi-automatic annotation equipment of this method design can be fully automated ground to participle corpus to be processed, provide one rapidly
Pre- annotation results are just very good in this way.
In recent years, with the high speed development of big data acquisition obtaining means, excavating to maximize to be worth from data becomes outstanding
To be urgent, this proposes completely new demand to the intelligent analysis of big data.In this context, the skills such as machine learning, deep learning
Art using upper fast development and obtains immense success in big data, and the model algorithm that technology bottom uses more needs to rely on
The training support based on a large amount of data mark corpus.The work of mass data corpus labeling has the training of algorithm model
Great influence, while as the basic work during big data analysis, the main support daily research and development of big data, algorithm tune
The links such as excellent, demonstration and verification are the key foundations of big data mining analysis.Participle is critically depend on dictionary, stammers at present
Although the dictionary that JIEBA is provided is not very complete, but enough for general application.Stammerer (jieba) plug-in unit,
One section of Chinese can be segmented, there are three types of the modes of participle, are adapted to different demands.Chinese word segmentation (Chinese Word
Segmentation it) refers to a chinese character sequence being cut into individual word one by one.Participle is exactly by continuous word sequence
Column are reassembled into the process of word sequence according to certain specification.
Existing segmentation methods can be divided into three categories: the segmenting method based on string matching, the segmenting method based on understanding and
Segmenting method based on statistics.Segmenting method based on string matching: this method, which is called, does mechanical segmentation method, it be by
The Chinese character string being analysed to according to certain strategy is matched with the entry in " sufficiently big " machine dictionary, if in dictionary
Some character string is found, then successful match (identifying a word).
1) Forward Maximum Method method (by left-to-right direction)
2) reverse maximum matching method (by right to left direction):
3) minimum cutting (keeping the word number cut out in each sentence minimum)
4) two-way maximum matching method (carry out by it is left-to-right, by right to left twice sweep)
Segmenting method based on understanding: this segmenting method is to reach identification by allowing the understanding of computer mould personification distich
The effect of word.Its basic thought be exactly participle while carry out syntax, semantic analysis, using syntactic information and semantic information come
Handle Ambiguity.It generally includes three parts: participle subsystem, syntactic-semantic subsystem, master control part.In master control part
Coordination under, participle subsystem can obtain the syntax and semantic information in relation to word, sentence etc. to judge segmentation ambiguity,
I.e. it simulates people to the understanding process of sentence.This segmenting method is needed using a large amount of linguistry and information.Due to the Chinese
General, the complexity of language linguistry, it is difficult to various language messages are organized into the form that machine can be directly read, therefore at present
Words partition system based on understanding is also in experimental stage.
Segmenting method based on statistics: providing the text largely segmented, is cut using statistical machine learning model study word
The rule (referred to as trained) divided, to realize the cutting to unknown text.Such as maximum probability segmenting method and maximum entropy segment
Method etc..With the foundation of Large Scale Corpus, the research and development of statistical machine learning method, the Chinese word segmentation based on statistics
Method becomes main stream approach gradually.
Principal statistical model: N-gram model (N-gram), hidden Markov model (Hidden Markov Model, HMM),
Maximum entropy model (ME), conditional random field models (Conditional Random Fields, CRF) etc..Morphological analysis is NLP
An important basic technology, including participle, part-of-speech tagging, Entity recognition etc., main algorithm structure is based on Bi-
LSTM-CRF algorithm system.With the output sequence that CRF is to obtain global optimum, it is equivalent to the recycling to lstm information.From
From network structure, Bi-LSTM-CRF is applied or CRF this big frame, only LSTM in each t moment the
Output on i tag regards " point function " (only characteristic function related with current location) in CRF characteristic function as, then
" side function " (characteristic function related with front-rear position) is still carried with CRF.Thus by primitive form in linear chain CRF
Characteristic function (linear) become the output f of LSTM1(non-linear), this just introduced in original CRF it is non-linear, can be more preferable
Fitting data.Bi-LSTM, that is, two-way LSTM, more unidirectional LSTM, Bi-LSTM can preferably capture the letter of context in sentence
Breath.Bi-LSTM is exactly two and remembers LSTM based on shot and long term in fact, and only reversed LSTM is the data elder generation input
Reverse head and the tail transposition once, then run a normal LSTM, then again output result reverse once make with just
To the input of LSTM be mapped.
Summary of the invention
Goal of the invention of the invention place in view of the shortcomings of the prior art, is conceived to and solves above-mentioned participle corpus labeling
And in training process using corpus there are the drawbacks of, propose a kind of semi-automatic participle corpus labeling training device.
Above-mentioned purpose of the invention can be reached by following measures, a kind of semi-automatic participle corpus labeling training cartridge
It sets, comprising: corpus of text marks preparation module, semi-automatic corpus segments labeling module, reaction type model learning training module
With participle marking model recruitment evaluation module, it is characterised in that: corpus of text marks preparation module and provides preparation for mark task,
It is distinguished by the data to separate sources and corpus source selects, by sources or theme carries out list to corpus data to be marked
The pre- mark processing of one participle, realizes the management to corpus to be marked, participle corpus, then by based on the two-way of integrated dictionary
A variety of segmentation methods such as maximum matching participle, CRF, JIEBA, BI-LSTM are submitted to raw corpus participle mark work semi-automatic
Change corpus and segments labeling module;Semi-automatic corpus participle labeling module is directed to different labeled use demand and corpus feature, wound
Participle mark task is built, selection, which marks, is applicable in algorithm model, carries out automatic marking by mark Business Rule Management, based on integrated word
The two-way maximum of allusion quotation matches a kind of segmentation methods mould selected in a variety of segmentation methods such as participle, CRF, JIEBA, BI-LSTM
Type and business rule complete the automatic marking of each class mark task, automatic marking result based on algorithm model and are based on business
The automatic marking result of rule is labeled result fusion;On the basis of the fusion of automatic marking result, according to mark business mark
Standard carries out manual intervention and sentences card, saves annotation results, the training pattern corpus and mark that corpus of text mark preparation module is generated
Injection molding type feeds back to reaction type model learning training module, carries out model according to existing model and external depth enhancing model load
Parameter setting, the selection of model corpus and model learning training, model are improved after updating and return again to model parameter setting;Call system
After one training pattern interface Train generates core lexicon and N-gram core lexicon, imported by unified model access interface external
Algorithm model is updated model or exports, and saves the participle model file comprising core lexicon and N-gram lexicon file,
And participle training pattern table is updated, it establishes dimensioning algorithm Integrated Evaluation Model and model mark effect is assessed, pass through model
The continuous iteration between corpus labeling is updated, is carried out more in platform for segmenting the model of mark using trained model
Newly, new participle mark task is completed;It segments marking model recruitment evaluation module and single index is built according to criterion
Algorithm quantifies index according to index computation rule, according to different labeled task using tissue corresponding index building mark
Algorithm synthesis assessment models, the integrated value that hits the target are calculated, are fed back to marking model effect.
The present invention have compared with the prior art it is following the utility model has the advantages that
The present invention assesses model mark effect by establishing dimensioning algorithm Integrated Evaluation Model, feedback participle model study
Training makes model reach best effects, subsequent newly-increased mark task is used for, by continuous between model modification and corpus labeling
Iteration improves corpus participle mark quality and algorithm model effect.System can be directed to different labeled use demand and corpus feature,
The automatic marking mode based on autonomous selection adaptation algorithm and more algorithm fusions is provided, more algorithm fusion automatic markings are using ballot
Method carries out fusion treatment to more arithmetic results, and under conditions of ignoring correlation, the performance of integrated approach is better than single method,
The complicated degree of artificial annotation process can be reduced by the pre- mark work that this method carries out, mitigates manual work's cost;
The present invention is distinguished by the data to separate sources, realizes the management to participle corpus;Introducing manually sentences card link,
By the integrated two-way maximum matching participle based on dictionary, it is based on CRF participle, based on CRF+Bi-LSTM participle, JIEBA participle
Scheduling algorithm, for different participle corpus, applicable dimensioning algorithm is provided in annotation process be may be selected, to corpus data to be marked
Carry out the pre- mark processing of the processing of pre- mark or the fusion of a variety of segmenting methods of single segmenting method, many of segmenting method knot
Fruit fusion uses voting method;It supports the automatic feedback of real-time backstage segmentation methods model to adjust, greatly improves corpus labeling effect
Rate and accuracy rate;
The present invention is directed to different participle corpus, by the integrated two-way maximum matching participle based on dictionary, based on CRF participle, base
In a variety of segmentation methods such as CRF+Bi-LSTM participles, applicable dimensioning algorithm is provided in annotation process be may be selected, to be marked
Corpus data carries out the pre- mark processing of the processing of pre- mark or the fusion of more segmenting methods of single segmenting method, more segmenting method knots
Fruit fusion uses voting method;After the completion of mark task, re -training is carried out to participle model using mark corpus.Finally lead to
It crosses manual confirmation link and confirmation submission is carried out to participle mark corpus, complete corpus participle mark work.It is calculated by establishing mark
Method Integrated Evaluation Model assesses model mark effect, and feedback participle model learning training makes model reach best effects,
For subsequent newly-increased mark task, corpus participle mark quality is improved by the continuous iteration between model modification and corpus labeling
With algorithm model effect.
The present invention can be by building Bi-LSTM network implementations sequence labelling, and can realize participle, and accuracy rate can reach
95% or so, it is modified by manual confirmation link to participle mark corpus, confirms, submits, complete corpus participle mark work
Make;After the completion of mark task, re -training is carried out to participle model using mark corpus;System is supported through the man-machine of close friend
Interactive mode mark interface, simplifies user annotation operating process;
The present invention provides unified participle model access standard, supports the importing, training and use of external model.It can be in various electricity
It is applied in sub- equipment.
Detailed description of the invention
Fig. 1 is the semi-automatic operation principle schematic diagram for segmenting corpus labeling, training device of the invention.
Fig. 2 is the participle model training managing flow chart of Fig. 1.
To make the object, technical solutions and advantages of the present invention clearer, below with reference to embodiment and attached drawing, to this hair
It is bright to be described in further detail.
Specific embodiment
Refering to figure.In preferred embodiment described below, a kind of semi-automatic participle corpus labeling training device, packet
Include: corpus of text marks preparation module, semi-automatic corpus participle labeling module, reaction type model learning training module and participle
Marking model recruitment evaluation module, it is characterised in that: corpus of text marks preparation module and provides preparation for mark task, by right
The data of separate sources are distinguished to be selected with corpus source, and by sources or theme carries out single participle to corpus data to be marked
The processing of pre- mark, realize to corpus to be marked, segment the management of corpus, then pass through two-way maximum based on integrated dictionary
It is marked with a variety of segmentation methods such as participle, condition random field CRF, JIEBA, two-way LSTM network, BI-LSTM, by raw corpus participle
Semi-automatic corpus participle labeling module is submitted in note work;Semi-automatic corpus participle labeling module is used for different labeled
Demand and corpus feature, creation participle mark task, selection mark are applicable in algorithm model, carry out certainly by mark Business Rule Management
Dynamic mark, the two-way maximum based on integrated dictionary match selected in a variety of segmentation methods such as participle, CRF, JIEBA, BI-LSTM
A kind of segmentation methods model and business rule complete the automatic marking of each class mark task, the automatic mark based on algorithm model
Note result and the automatic marking result based on business rule are labeled result fusion;On the basis of automatic marking result fusion
On, manual intervention is carried out according to mark traffic criteria and sentences card, saves annotation results, and corpus of text mark preparation module is generated
Training pattern corpus and marking model feed back to reaction type model learning training module, are enhanced according to existing model and external depth
Model load carries out model parameter setting, the selection of model corpus and model learning training, model is improved after updating and returns again to mould
Shape parameter setting;After calling unified training pattern interface Train to generate core lexicon and N-gram core lexicon, by unified model
Access interface imports external algorithm model, is updated or exports to model, saves comprising core lexicon and N-gram dictionary text
The participle model file of part, and participle training pattern table is updated, it establishes dimensioning algorithm Integrated Evaluation Model and effect is marked to model
Assessed, by the continuous iteration between model modification and corpus labeling, using trained model in platform for point
The model of word mark is updated, and completes new participle mark task.Marking model recruitment evaluation module is segmented according to index mark
Standard builds single index algorithm, quantifies according to index computation rule to index, uses group according to different labeled task
Corresponding index building dimensioning algorithm Integrated Evaluation Model is knitted, the integrated value that hits the target is calculated, fed back to marking model effect.
Corpus of text marks preparation module: completing to corpus to be marked by sources or theme is managed, to mark task
It provides and prepares;Semi-automatic corpus participle labeling module is directed to different labeled use demand and corpus feature, autonomous selection adaptation
Algorithm simultaneously carries out automatic marking, realizes that card is sentenced in the intervention of annotation results by manually sentencing card link, the specific steps are as follows:
Corpus of text marks preparation module and creates participle mark task according to separate sources corpus;Corpus of text marks preparation module
Participle mark task is created according to separate sources corpus;Semi-automatic corpus participle labeling module is for the mark task choosing of each class
Select effect adaptation algorithm model, participle mark task in, according to corpus automatic marking effect configuration condition random field CRF,
JIEBA, two-way LSTM network B I-LSTM algorithms selection CRF, JIEBA, BI-LSTM one of which algorithm complete automatic marking.For
A condition random field CRF is built, first has to define a feature function set, each characteristic function is with entire sentence s, currently
The label of position i, position i and i-1 are input, then assign a weight for each characteristic function, are then directed to each
Annotated sequence l if necessary, can be converted into the value of summation one probability value to all characteristic function weighted sums.
The shift-matrix A of CRF is obtained by the CRF layer approximation of neural network, and P matrix i.e. emission matrix are obtained by Bi-LSTM is approximate
It arrives.
Model learning training module is carried out for special mark task creation business mark rule, and to mark business rule
Management, marking business rule here mainly includes business dictionary and regular expression.
Reaction type model learning training module is directed to inside and outside marking model algorithm, provides model learning training, feedback more
New ability carries out automatic marking to corpus using mark business rule.
Marking model recruitment evaluation module is segmented to the automatic marking result based on algorithm model and based on business rule
Automatic marking result carries out fusion treatment;It is artificial right according to mark traffic criteria on the basis of automatic marking fusion treatment result
Annotation results are modified, confirm and are saved.Mark personnel mark preparation module by corpus of text and select separate sources corpus
It selects and manages, corpus of text to be marked is saved as by different labeled task, i.e., raw corpus;It segments and marks in semi-automatic corpus
In module, create corresponding participle mark task, and select applicable dimensioning algorithm model, based on selected algorithm model to point
Word task corpus carries out automatic pre- mark, meanwhile, for the particularity in field locating for data, related service rule of completing a course carries out base
In the automatic pre- mark of business rule, two class annotation results are merged using ballot method.Traffic criteria based on mark is led to
It crosses manual intervention and sentences card link and mark fusion results are modified, adjusted, it is reaction type that final save, which becomes participle idiom material,
Corpus needed for model learning training provides model training.
Model used in semi-automation participle corpus labeling module is carried out by reaction type model learning training module
Model training and update, it is specific: the training of reaction type model learning can be carried out to existing model used in mark, can also be used
External depth reinforces model and carries out the training of reaction type model learning;Setting participle marking model parameter;Select participle model training
Required idiom material carries out model learning training.
Refering to Fig. 2.Corpus labeling training device detailed operation process is segmented for semi-automation.In participle model training managing
In stream: the corpus that model training user is trained by the selection of model corpus selecting module for doing participle model, selection CRF,
The training of JIEBA, BI-LSTM segmentation methods calls participle training pattern interface Train, generates core lexicon and N-gram core
Dictionary makes model accuracy reach best.Judge whether to save participle model, using having marked corpus data to CRF, BI-
LSTM etc. can training algorithm carry out off-line training, external algorithm model is imported by unified participle training pattern access interface, to mould
Type is updated or exports, and saves the participle model file comprising core lexicon and N-gram lexicon file, and updates participle instruction
Practice model table.After participle model updates, start Chinese Word Segmentation Service, selects the training of CRF, JIEBA, BI-LSTM segmentation methods, configuration text
Part increases new participle switch, judges whether to update participle model, is then to read specified participle model, obtains participle model title,
Otherwise it reads participle training pattern table and reads algorithm and carry core lexicon, merge dictionary, update participle training pattern table and complete
Kind model, and the algorithm that updated marking model feeds back to semi-automatic participle corpus labeling module is carried into core lexicon,
It is updated in platform for segmenting the model of mark using trained model, completes new participle mark task.
It segments marking model recruitment evaluation module and model evaluation index building mark, building rule, quantification of targets etc. is provided
Method is supported to assess model mark effect by constructing dimensioning algorithm Integrated Evaluation Model automatically, the specific steps are as follows:
It segments marking model recruitment evaluation module and single index algorithm is built according to criterion;According to index computation rule to finger
Mark is quantified, and constructs dimensioning algorithm Integrated Evaluation Model using tissue corresponding index according to different labeled task;It hits the target
Integrated value calculates, and feeds back to marking model effect.
The present embodiment includes cutting to the Basic Evaluation index of participle mark participle corpus annotation is carried out by the present apparatus
Accuracy rate precision, cutting recall rate Recall, F estimates, crossing ambiguity accuracy rate, make-up ambiguity accuracy rate, ambiguous category
Mark accuracy rate etc..It is defined as follows:
Wherein, F indicates F value, the as harmonic-mean of accuracy and recall rate, and P indicates accuracy rate, and R indicates recall rate.
Accuracy rate and recall rate are commonly referred to as the relationship of inverse ratio.Accuracy rate is improved by certain methods, will lead under recall rate
Drop, vice versa.In order to define application system for the different demands of accuracy rate and recall rate, a weighted value pair can be provided
What it was weighted considers, to obtain E value:
Wherein, b is the weight being added, and b is bigger, then it represents that the weight for considering middle accuracy rate of E value is bigger, on the contrary then recall rate
Weight is bigger.
Cutting ambiguity is also a difficult point of Algorithm of Automatic Chinese Word Segmentation, in order to investigate algorithm to the ability of ambiguity resolution,
Individual inspection target especially is done to the part there are ambiguity.Specifically, being directed to overlap type and combined two different ambiguities
Type: it is as follows to define accurate rate respectively for crossing ambiguity and make-up ambiguity:
Similar with cutting ambiguity, there is also oneself " ambiguity words ", referred to as conversion of parts of speech for part-of-speech tagging.If a word possesses two
A or more different parts of speech, are known as conversion of parts of speech.Obviously, the mark of conversion of parts of speech is the key points and difficulties of part-of-speech tagging, right
This, defines a special index conversion of parts of speech mark accurate rate to investigate to it, is defined as follows
By by sources or theme is managed, providing preparation to corpus to be marked for mark task;By integrated CRF,
A variety of word segmentation processing algorithms such as JIEBA, BI-LSTM are completed the semi-automatic mark of participle corpus, are provided in annotation process suitable
Dimensioning algorithm may be selected, and carry out segmenting pre- mark processing to corpus data to be marked;Finally by manual confirmation link pair
Mark corpus is modified, confirms and is submitted, and corpus labeling work is completed.After the completion of mark task, mark corpus pair is used
Model carries out re -training.Model mark effect is assessed by establishing dimensioning algorithm Integrated Evaluation Model, feedback model
Learning training makes model reach best effects, subsequent newly-increased mark task is used for, by between model modification and corpus labeling
Continuous iteration improves corpus labeling quality and algorithm model effect.
The above is present pre-ferred embodiments, it has to be noted that the present invention will be described for above-described embodiment, so
And the present invention is not limited thereto, and those skilled in the art can be designed when being detached from scope of the appended claims
Alternative embodiment.For those skilled in the art, without departing from the spirit and substance in the present invention,
Various changes and modifications can be made therein, these variations and modifications are also considered as protection scope of the present invention.
Claims (10)
1. a kind of semi-automatic participle corpus labeling training device, comprising: corpus of text marks preparation module, semi-automatic corpus
Segment labeling module, reaction type model learning training module and participle marking model recruitment evaluation module, it is characterised in that: text
Corpus labeling preparation module provides preparation for mark task, is distinguished by the data to separate sources and corpus source is selected
Select, by sources or theme to corpus data to be marked carry out single participle pre- mark handle, realize to corpus to be marked, participle
Then the management of corpus passes through two-way maximum matching participle, condition random field CRF, stammerer JIEBA Chinese based on integrated dictionary
Raw corpus participle mark work is submitted to semi-automatic corpus by more kinds of participle, two-way LSTM network, BI-LSTM segmentation methods
Segment labeling module;Semi-automatic corpus participle labeling module is directed to different labeled use demand and corpus feature, creation participle
Mark task, selection, which marks, is applicable in algorithm model, carries out automatic marking by mark Business Rule Management, from the two-way of integrated dictionary
Maximum matching participle, a kind of segmentation methods model and business selected in more kinds of segmentation methods of CRF, JIEBA, BI-LSTM are advised
Then, the automatic marking of each class mark task, the automatic mark of automatic marking result and business rule based on algorithm model are completed
Note result is labeled result fusion;On the basis of the fusion of automatic marking result, manually done according to mark traffic criteria
Anticipation card, saves annotation results, and training pattern corpus and marking model that corpus of text mark preparation module generates are fed back to
Reaction type model learning training module carries out stress model parameter setting, mould according to existing model and external depth enhancing model
The selection of type corpus and model learning training, model are improved after updating and return again to model parameter setting;Call unified training pattern
After interface Train generates core lexicon and N-gram core lexicon, external algorithm model is imported by unified model access interface, it is right
Model is updated or exports, and saves the participle model file comprising core lexicon and N-gram lexicon file, and update participle
Training pattern table establishes dimensioning algorithm Integrated Evaluation Model, assesses model mark effect, passes through model modification and corpus
Continuous iteration between mark is updated for segmenting the model of mark in platform using trained model, is completed new
Participle mark task.
2. semi-automatic participle corpus labeling training device as described in claim 1, it is characterised in that: participle marking model effect
Fruit evaluation module mark builds single index algorithm according to criterion, according to index computation rule to the index amount of progress
Change, dimensioning algorithm Integrated Evaluation Model, the integrated value that hits the target meter are constructed using tissue corresponding index according to different labeled task
It calculates, marking model effect is fed back.
3. semi-automatic participle corpus labeling training device as described in claim 1, it is characterised in that: semi-automatic corpus point
Word labeling module is directed to different labeled use demand and corpus feature, autonomous to select adaptation algorithm and carry out automatic marking, passes through
Manually sentence card link and realizes that card is sentenced in the intervention of annotation results.
4. semi-automatic participle corpus labeling training device as described in claim 1, it is characterised in that: corpus of text mark is quasi-
Standby module creates participle mark task according to separate sources corpus;Semi-automatic corpus participle labeling module is marked for each class
Task choosing effect adaptation algorithm model, participle mark task in, according to corpus automatic marking effect configure CRF,
JIEBA, BI-LSTM algorithms selection CRF, JIEBA, BI-LSTM segmentation methods complete automatic marking.
5. semi-automatic participle corpus labeling training device as described in claim 1, it is characterised in that: model learning training mould
Block is managed for special mark task creation business mark rule, and to mark business rule, and marks business rule
Including business dictionary and regular expression;Reaction type model learning training module is directed to inside and outside marking model algorithm, provides mould
Type learning training, feedback updating ability carry out automatic marking to corpus using mark business rule.
6. semi-automatic participle corpus labeling training device as described in claim 1, it is characterised in that: participle marking model effect
Fruit evaluation module carries out at fusion the automatic marking result based on algorithm model and the automatic marking result based on business rule
Reason;On the basis of automatic marking fusion treatment result, according to mark traffic criteria, manually modifies, confirms to annotation results
And preservation.
7. semi-automatic participle corpus labeling training device as described in claim 1, it is characterised in that: corpus of text mark is quasi-
Standby module is selected and is managed to separate sources corpus, and corpus of text to be marked is saved as by different labeled task, i.e., raw corpus;
In semi-automatic corpus participle labeling module, corresponding participle mark task is created, and select applicable dimensioning algorithm model,
Automatic pre- mark is carried out to participle task corpus based on selected algorithm model, meanwhile, for the particularity in field locating for data,
Graduation related service rule carries out the automatic pre- mark based on business rule, is melted using ballot method to two class annotation results
It closes.
8. semi-automatic participle corpus labeling training device as described in claim 1, it is characterised in that: semi-automation participle language
Expect that labeling module carries out model training and update by reaction type model learning training module, to existing model used in mark
The training of reaction type model learning is carried out, or model is reinforced using external depth and carries out the training of reaction type model learning;Setting participle
Marking model parameter;Idiom material needed for selecting participle model training carries out model learning training.
9. semi-automatic participle corpus labeling training device as described in claim 1, it is characterised in that: in participle model training
In processing stream: model corpus selecting module chooses the corpus for doing participle model training, selects CRF, JIEBA, BI-LSTM points
Word algorithm is trained, and is called participle training pattern interface Train, is generated core lexicon and N-gram core lexicon, make model
Accuracy reaches best;Judge whether to save participle model, it can training algorithm to CRF, BI-LSTM using corpus data has been marked
Off-line training is carried out, external algorithm model is imported by unified participle training pattern access interface, model is updated or is exported,
The participle model file comprising core lexicon and N-gram lexicon file is saved, and updates participle training pattern table.
10. semi-automatic participle corpus labeling training device as claimed in claim 9, it is characterised in that: participle model updates
Afterwards, start Chinese Word Segmentation Service, select the training of CRF, JIEBA, BI-LSTM segmentation methods, configuration file increases new participle switch, judgement
Whether participle model is updated, is then to read specified participle model, obtains participle model title, otherwise read participle training pattern table
Core lexicon is carried with algorithm is read, merges dictionary, updates participle training pattern table and sophisticated model, and by updated mark
Model feedback carries core lexicon to the algorithm of semi-automatic participle corpus labeling module, using trained model in platform
Model for segmenting mark is updated, and completes new participle mark task.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910455093.XA CN110287482B (en) | 2019-05-29 | 2019-05-29 | Semi-automatic participle corpus labeling training device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910455093.XA CN110287482B (en) | 2019-05-29 | 2019-05-29 | Semi-automatic participle corpus labeling training device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110287482A true CN110287482A (en) | 2019-09-27 |
CN110287482B CN110287482B (en) | 2022-07-08 |
Family
ID=68002801
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910455093.XA Active CN110287482B (en) | 2019-05-29 | 2019-05-29 | Semi-automatic participle corpus labeling training device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110287482B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111008706A (en) * | 2019-12-09 | 2020-04-14 | 长春嘉诚信息技术股份有限公司 | Processing method for automatically labeling, training and predicting mass data |
CN111582388A (en) * | 2020-05-11 | 2020-08-25 | 广州中科智巡科技有限公司 | Method and system for quickly labeling image data |
CN111597807A (en) * | 2020-04-30 | 2020-08-28 | 腾讯科技(深圳)有限公司 | Method, device and equipment for generating word segmentation data set and storage medium thereof |
CN112036178A (en) * | 2020-08-25 | 2020-12-04 | 国家电网有限公司 | A Semantic Search Method Related to Distribution Network Entity |
CN112101014A (en) * | 2020-08-20 | 2020-12-18 | 淮阴工学院 | A word segmentation method for Chinese chemical literature based on hybrid feature fusion |
WO2021082366A1 (en) * | 2019-10-28 | 2021-05-06 | 南京师范大学 | Interactive and iterative learning-based intelligent construction method for geographical name tagging corpus |
CN113206854A (en) * | 2021-05-08 | 2021-08-03 | 首约科技(北京)有限公司 | Method and device for rapidly developing national standard terminal protocol |
CN113657105A (en) * | 2021-08-31 | 2021-11-16 | 平安医疗健康管理股份有限公司 | Medical entity extraction method, device, equipment and medium based on vocabulary enhancement |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102243649A (en) * | 2011-06-07 | 2011-11-16 | 上海交通大学 | Semi-automatic information extraction processing device of ontology |
CN105718586A (en) * | 2016-01-26 | 2016-06-29 | 中国人民解放军国防科学技术大学 | Word division method and device |
CN107622050A (en) * | 2017-09-14 | 2018-01-23 | 武汉烽火普天信息技术有限公司 | Text sequence labeling system and method based on Bi LSTM and CRF |
CN108256029A (en) * | 2018-01-11 | 2018-07-06 | 北京神州泰岳软件股份有限公司 | Statistical classification model training apparatus and training method |
CN109033085A (en) * | 2018-08-02 | 2018-12-18 | 北京神州泰岳软件股份有限公司 | The segmenting method of Chinese automatic word-cut and Chinese text |
US20190073447A1 (en) * | 2017-09-06 | 2019-03-07 | International Business Machines Corporation | Iterative semi-automatic annotation for workload reduction in medical image labeling |
CN109446369A (en) * | 2018-09-28 | 2019-03-08 | 武汉中海庭数据技术有限公司 | The exchange method and system of the semi-automatic mark of image |
CN109508453A (en) * | 2018-09-28 | 2019-03-22 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Across media information target component correlation analysis systems and its association analysis method |
-
2019
- 2019-05-29 CN CN201910455093.XA patent/CN110287482B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102243649A (en) * | 2011-06-07 | 2011-11-16 | 上海交通大学 | Semi-automatic information extraction processing device of ontology |
CN105718586A (en) * | 2016-01-26 | 2016-06-29 | 中国人民解放军国防科学技术大学 | Word division method and device |
US20190073447A1 (en) * | 2017-09-06 | 2019-03-07 | International Business Machines Corporation | Iterative semi-automatic annotation for workload reduction in medical image labeling |
CN107622050A (en) * | 2017-09-14 | 2018-01-23 | 武汉烽火普天信息技术有限公司 | Text sequence labeling system and method based on Bi LSTM and CRF |
CN108256029A (en) * | 2018-01-11 | 2018-07-06 | 北京神州泰岳软件股份有限公司 | Statistical classification model training apparatus and training method |
CN109033085A (en) * | 2018-08-02 | 2018-12-18 | 北京神州泰岳软件股份有限公司 | The segmenting method of Chinese automatic word-cut and Chinese text |
CN109446369A (en) * | 2018-09-28 | 2019-03-08 | 武汉中海庭数据技术有限公司 | The exchange method and system of the semi-automatic mark of image |
CN109508453A (en) * | 2018-09-28 | 2019-03-22 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Across media information target component correlation analysis systems and its association analysis method |
Non-Patent Citations (1)
Title |
---|
侯超: "基于自然语言处理的策略生成系统的设计与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》, 15 December 2013 (2013-12-15), pages 138 - 1720 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021082366A1 (en) * | 2019-10-28 | 2021-05-06 | 南京师范大学 | Interactive and iterative learning-based intelligent construction method for geographical name tagging corpus |
CN111008706A (en) * | 2019-12-09 | 2020-04-14 | 长春嘉诚信息技术股份有限公司 | Processing method for automatically labeling, training and predicting mass data |
CN111008706B (en) * | 2019-12-09 | 2023-05-05 | 长春嘉诚信息技术股份有限公司 | Processing method for automatically labeling, training and predicting mass data |
CN111597807A (en) * | 2020-04-30 | 2020-08-28 | 腾讯科技(深圳)有限公司 | Method, device and equipment for generating word segmentation data set and storage medium thereof |
CN111597807B (en) * | 2020-04-30 | 2022-09-13 | 腾讯科技(深圳)有限公司 | Word segmentation data set generation method, device, equipment and storage medium thereof |
CN111582388A (en) * | 2020-05-11 | 2020-08-25 | 广州中科智巡科技有限公司 | Method and system for quickly labeling image data |
CN112101014A (en) * | 2020-08-20 | 2020-12-18 | 淮阴工学院 | A word segmentation method for Chinese chemical literature based on hybrid feature fusion |
CN112036178A (en) * | 2020-08-25 | 2020-12-04 | 国家电网有限公司 | A Semantic Search Method Related to Distribution Network Entity |
CN113206854A (en) * | 2021-05-08 | 2021-08-03 | 首约科技(北京)有限公司 | Method and device for rapidly developing national standard terminal protocol |
CN113206854B (en) * | 2021-05-08 | 2022-12-13 | 首约科技(北京)有限公司 | Method and device for rapidly developing national standard terminal protocol |
CN113657105A (en) * | 2021-08-31 | 2021-11-16 | 平安医疗健康管理股份有限公司 | Medical entity extraction method, device, equipment and medium based on vocabulary enhancement |
Also Published As
Publication number | Publication date |
---|---|
CN110287482B (en) | 2022-07-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110287482A (en) | Semi-automation participle corpus labeling training device | |
CN107330011B (en) | The recognition methods of the name entity of more strategy fusions and device | |
CN110633409B (en) | Automobile news event extraction method integrating rules and deep learning | |
CN109359293B (en) | Mongolian name entity recognition method neural network based and its identifying system | |
CN104050160B (en) | Interpreter's method and apparatus that a kind of machine is blended with human translation | |
CN110287481A (en) | Name entity corpus labeling training system | |
CN101566998B (en) | Chinese question-answering system based on neural network | |
CN109918680A (en) | Entity recognition method, device and computer equipment | |
CN101539907B (en) | Part-of-speech tagging model training device and part-of-speech tagging system and method thereof | |
CN110298033A (en) | Keyword corpus labeling trains extracting tool | |
CN110298032A (en) | Text classification corpus labeling training system | |
CN108304372A (en) | Entity extraction method and apparatus, computer equipment and storage medium | |
CN109543181B (en) | Named entity model and system based on combination of active learning and deep learning | |
CN110895559B (en) | Model training method, text processing method, device and equipment | |
CN109271631A (en) | Segmenting method, device, equipment and storage medium | |
CN109858041A (en) | A kind of name entity recognition method of semi-supervised learning combination Custom Dictionaries | |
CN108899013A (en) | Voice search method and device and voice recognition system | |
CN113312453B (en) | A model pre-training system for cross-language dialogue understanding | |
CN104573028A (en) | Intelligent question-answer implementing method and system | |
CN110866093A (en) | Machine question-answering method and device | |
CN109857846B (en) | Method and device for matching user question and knowledge point | |
CN108052499A (en) | Text error correction method, device and computer-readable medium based on artificial intelligence | |
CN111680512B (en) | Named entity recognition model, telephone exchange extension switching method and system | |
CN112183064A (en) | Text emotion reason recognition system based on multi-task joint learning | |
CN110210036A (en) | A kind of intension recognizing method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |