Summary of the invention
In view of this, this application provides recognition methods and the device of evaluation object in a kind of Chinese text, lacking whether the problem judged is omitted to evaluation object in Chinese text for solving prior art.
To achieve these goals, the existing scheme proposed is as follows:
A recognition methods for evaluation object in Chinese text, comprising:
Participle is carried out to the original language material of each bar in corpus, and determines the part of speech feature of each word feature of participle gained;
Receive the label of each predicate feature of user's input, described label indicates whether word feature is evaluation object or emotion word;
Subordinate sentence is carried out to the original language material of each bar, original language material is divided into some clauses;
Filter out goal clause, in described goal clause, comprise the word feature that label is emotion word;
Utilize the feature templates preset, from described goal clause, extract language material feature, composition corpus;
Utilize described corpus to train maximum entropy classifiers, obtain the target maximum entropy sorter after training;
Described target maximum entropy sorter is utilized text to be measured to be carried out to the identification of evaluation object.
Preferably, described participle is carried out to the original language material of each bar in corpus, and determines the part of speech feature of each word feature of participle gained, comprising:
Utilize Stanford Parser instrument to carry out participle to the original language material of each bar in corpus, and determine the part of speech feature of each word feature of participle gained.
Preferably, the feature templates that described utilization is preset, extracts language material feature from described goal clause, and composition corpus, comprising:
Described goal clause is divided into first kind goal clause and Equations of The Second Kind goal clause, described first kind goal clause is do not comprise the clause that label is the word feature of evaluation object, and described Equations of The Second Kind goal clause is comprise the clause that label is the word feature of evaluation object;
Utilize the feature templates preset, from described first kind goal clause, extract language material feature, composition positive example training sample;
Utilize the feature templates preset, from described Equations of The Second Kind goal clause, extract language material feature, the negative routine training sample of composition.
Preferably, described feature templates comprises:
Fisrt feature subtemplate: the bagword feature of current clause, described bagword is characterized as the segmentation sequence be made up of the participle feature of clause;
Second feature subtemplate: by the word feature of current clause and the feature of part of speech feature common combination;
Third feature subtemplate: the word feature of current clause's beginning of the sentence first word;
Fourth feature subtemplate: the part of speech feature of current clause's beginning of the sentence first two words;
Fifth feature subtemplate: the part of speech feature of all words of current clause;
Sixth feature subtemplate: the part of speech feature of current clause's beginning of the sentence first word;
Seventh feature subtemplate: the part of speech feature of last three words of current clause's end of the sentence;
Eighth feature subtemplate: the word characteristic sum part of speech feature of current last word of clause's end of the sentence.
Preferably, described feature templates also comprises:
Ninth feature subtemplate: the bagword feature of the previous clause of current clause.
A recognition device for evaluation object in Chinese text, comprising:
Participle unit, for carrying out participle to the original language material of each bar in corpus, and determines the part of speech feature of each word feature of participle gained;
Label receiving element, for receiving the label of each predicate feature of user's input, described label indicates whether word feature is evaluation object or emotion word;
Subordinate sentence unit, for carrying out subordinate sentence to the original language material of each bar, is divided into some clauses by original language material;
Clause screens unit, for filtering out goal clause, comprises the word feature that label is emotion word in described goal clause;
Feature extraction unit, for utilizing default feature templates, extracts language material feature from described goal clause, composition corpus;
Training unit, for utilizing described corpus to train maximum entropy classifiers, obtains the target maximum entropy sorter after training;
Object identification unit, for the identification utilizing described target maximum entropy sorter text to be measured to be carried out to evaluation object.
Preferably, described participle unit comprises:
First participle subelement, for utilizing Stanford Parser instrument to carry out participle to the original language material of each bar in corpus, and determines the part of speech feature of each word feature of participle gained.
Preferably, described feature extraction unit comprises:
Clause's division unit, for described goal clause is divided into first kind goal clause and Equations of The Second Kind goal clause, described first kind goal clause is do not comprise the clause that label is the word feature of evaluation object, and described Equations of The Second Kind goal clause is comprise the clause that label is the word feature of evaluation object;
Positive training sample determining unit, for utilizing default feature templates, extracts language material feature from described first kind goal clause, composition positive example training sample;
Negative training sample determining unit, for utilizing default feature templates, extracts language material feature from described Equations of The Second Kind goal clause, the negative routine training sample of composition.
Preferably, the feature templates in described feature extraction unit comprises:
Fisrt feature subtemplate: the bagword feature of current clause, described bagword is characterized as the segmentation sequence be made up of the participle feature of clause;
Second feature subtemplate: by the word feature of current clause and the feature of part of speech feature common combination;
Third feature subtemplate: the word feature of current clause's beginning of the sentence first word;
Fourth feature subtemplate: the part of speech feature of current clause's beginning of the sentence first two words;
Fifth feature subtemplate: the part of speech feature of all words of current clause;
Sixth feature subtemplate: the part of speech feature of current clause's beginning of the sentence first word;
Seventh feature subtemplate: the part of speech feature of last three words of current clause's end of the sentence;
Eighth feature subtemplate: the word characteristic sum part of speech feature of current last word of clause's end of the sentence.
Preferably, described feature templates also comprises:
Ninth feature subtemplate: the bagword feature of the previous clause of current clause.
As can be seen from above-mentioned technical scheme, the embodiment of the present application provides evaluation object recognition methods in Chinese text, participle is carried out to the original language material of each bar in corpus, and determine the part of speech feature of each word feature of participle gained, receive the label of each predicate feature of user's input, described label indicates whether word feature is evaluation object or emotion word, subordinate sentence is carried out to the original language material of each bar, original language material is divided into some clauses, filter out goal clause, the word feature that label is emotion word is comprised in described goal clause, utilize the feature templates preset, language material feature is extracted from described goal clause, composition corpus, described corpus is utilized to train maximum entropy classifiers, obtain the target maximum entropy sorter after training, described target maximum entropy sorter is utilized text to be measured to be carried out to the identification of evaluation object.The application employs maximum entropy classifiers and combines various features and goes to identify in text to be measured whether have evaluation object, obtains good effect.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present application, be clearly and completely described the technical scheme in the embodiment of the present application, obviously, described embodiment is only some embodiments of the present application, instead of whole embodiments.Based on the embodiment in the application, those of ordinary skill in the art are not making the every other embodiment obtained under creative work prerequisite, all belong to the scope of the application's protection.
Before introducing the method for the application, what is first introduced is maximum entropy classifiers.
Maximum entropy sorting technique is based on maximum entropy information theory, and its basic thought is all known factor Modling model, and the factor of all the unknowns is foreclosed.Namely to find a kind of probability distribution, meet all known facts, but allow the most randomization of unknown factor.Relative to Nae Bayesianmethod, the maximum feature of the method is exactly the conditional sampling not between demand fulfillment feature and feature.Therefore, the method is applicable to the various different feature of statistics, and without the need to considering the impact between them.
See the recognition methods process flow diagram of evaluation object in Fig. 1, Fig. 1 a kind of Chinese text disclosed in the embodiment of the present application.
As shown in Figure 1, the method comprises:
Step S100, participle is carried out to the original language material of each bar in corpus, and determine the part of speech feature of each word feature of participle gained;
Particularly, many original language materials are comprised in local corpus.Therefrom can choose some original language materials or whole original language materials, then participle be carried out to the original language material of each bar, obtain word feature.Then, all determine its part of speech feature for each word feature, part of speech feature shows the part of speech of this participle, such as verb, noun etc.Illustrate below by an object lesson:
Former event: cousin will marry.
Word segmentation result: cousin// marry//.
Part-of-speech tagging result: cousin/NN wants/VV marriage/VV/ST./PU
Wherein, NN, VV etc. are the code pos that Stanford part of speech tool master specifies, NN is title, VV is verb, and PU is the code of all punctuation marks.Can explaining with reference to the mark of Stanford part of speech tool master for part of speech of other.
It is to be understood that when carrying out participle and determine part of speech feature, most probable number method, maximum matching method, condition random field method or Stanford Parser instrument can be used to carry out participle and to determine part of speech feature.
The label of each predicate feature of step S110, reception user input, described label indicates whether word feature is evaluation object or emotion word;
Particularly, for each word feature after participle, manually being marked by user, is each word characteristic allocation label, shows whether it is evaluation object or emotion word.Such as, in " friend is delithted with apple " this example, the label of word feature " apple " is evaluation object, and the label that word feature " is liked " is emotion word.
Step S120, subordinate sentence is carried out to the original language material of each bar, original language material is divided into some clauses;
Particularly during subordinate sentence, with ".", "? ", "! " etc. as the mark of whole sentence, and according to ", ", "; " etc. whole sentence is divided into each clause, and keep context relation.
Step S130, filter out goal clause, in described goal clause, comprise the word feature that label is emotion word;
Particularly, for the clause not having emotion word, data basis can not be provided for our follow-up sentiment analysis, therefore need in this step to weed out this part clause.
The feature templates that step S140, utilization are preset, extracts language material feature from described goal clause, composition corpus;
Particularly, we have preset feature templates, and this feature templates defines the concrete form of required feature.Utilize feature templates, from goal clause, extract language material feature, composition corpus.
Step S150, utilize described corpus to train maximum entropy classifiers, obtain the target maximum entropy sorter after training;
Step S160, described target maximum entropy sorter is utilized text to be measured to be carried out to the identification of evaluation object.
Particularly, utilize corpus can train maximum entropy classifiers, make maximum entropy classifiers can identify evaluation object in text.
The embodiment of the present application provides evaluation object recognition methods in Chinese text, participle is carried out to the original language material of each bar in corpus, and determine the part of speech feature of each word feature of participle gained, receive the label of each predicate feature of user's input, described label indicates whether word feature is evaluation object or emotion word, subordinate sentence is carried out to the original language material of each bar, original language material is divided into some clauses, filter out goal clause, the word feature that label is emotion word is comprised in described goal clause, utilize the feature templates preset, language material feature is extracted from described goal clause, composition corpus, described corpus is utilized to train maximum entropy classifiers, obtain the target maximum entropy sorter after training, described target maximum entropy sorter is utilized text to be measured to be carried out to the identification of evaluation object.The application employs maximum entropy classifiers and combines various features and goes to identify in text to be measured whether have evaluation object, obtains good effect.
See Fig. 2, Fig. 2 a kind of segmenting method process flow diagram disclosed in the embodiment of the present application.
As shown in Figure 2, above-mentioned steps S100 is specifically as follows:
Step S200, described goal clause is divided into first kind goal clause and Equations of The Second Kind goal clause;
Particularly, described first kind goal clause is do not comprise the clause that label is the word feature of evaluation object, and described Equations of The Second Kind goal clause is comprise the clause that label is the word feature of evaluation object.
The feature templates that step S210, utilization are preset, extracts language material feature from described first kind goal clause, composition positive example training sample;
The feature templates that step S220, utilization are preset, extracts language material feature from described Equations of The Second Kind goal clause, the negative routine training sample of composition.
Positive example training sample and negative routine training sample constitute corpus jointly, are used for training maximum entropy classifiers.
It is to be understood that above-mentioned feature templates specifically can comprise following a few class:
Fisrt feature subtemplate: the bagword feature of current clause, described bagword is characterized as the segmentation sequence be made up of the participle feature of clause;
Second feature subtemplate: by the word feature of current clause and the feature of part of speech feature common combination;
Third feature subtemplate: the word feature of current clause's beginning of the sentence first word;
Fourth feature subtemplate: the part of speech feature of current clause's beginning of the sentence first two words;
Fifth feature subtemplate: the part of speech feature of all words of current clause;
Sixth feature subtemplate: the part of speech feature of current clause's beginning of the sentence first word;
Seventh feature subtemplate: the part of speech feature of last three words of current clause's end of the sentence;
Eighth feature subtemplate: the word characteristic sum part of speech feature of current last word of clause's end of the sentence.
Ninth feature subtemplate: the bagword feature of the previous clause of current clause.
In order to clearer, above-mentioned feature is made an explanation, is described with an instantiation below:
With current subordinate sentence, " friend is delithted with." be example, the result after participle is for " friend is delithted with.", previous clause is " very well, ", and the result after participle is " very well, ", and the part of speech of " friend " is NN, and the part of speech of " very " is AD, and the part of speech of " liking " is VV, and the part of speech of " good " is VA, and the part of speech of punctuate is unified is PU.
Feature 1: the bagword feature of current clause, namely " friend is delithted with.”;
Feature 2: by the word feature of current clause and the feature of part of speech feature common combination, be about to " friend _ NN very _ AD likes _ VV._ PU " as feature;
Feature 3: the word feature of current clause's beginning of the sentence first word, i.e. " friend ";
Feature 4: the part of speech feature of current clause's beginning of the sentence first two words, i.e. " NN_AD ";
Feature 5: the part of speech feature of all words of current clause, i.e. " NN_AD_VV_PU ";
Feature 6: the part of speech feature of current clause's beginning of the sentence first word, i.e. " NN ";
Feature 7: the part of speech feature of last three words of current clause's end of the sentence, therefore eigen is got " AD_VV_PU ";
Feature 8: the word characteristic sum part of speech feature of current last word of clause's end of the sentence, be "._PU”;
Feature 9: the bagword feature of the previous clause of current clause, is " very well, ".
In the Chinese text provided the embodiment of the present application below, the recognition device of evaluation object is described, and in Chinese text described below, the recognition device of evaluation object and the recognition methods of evaluation object in above-described Chinese text can mutual corresponding references.
See Fig. 3, Fig. 3 a kind of event trigger word recognition device structural representation disclosed in the embodiment of the present application.
As shown in Figure 3, this device comprises:
Participle unit 31, for carrying out participle to the original language material of each bar in corpus, and determines the part of speech feature of each word feature of participle gained;
Label receiving element 32, for receiving the label of each predicate feature of user's input, described label indicates whether word feature is evaluation object or emotion word;
Subordinate sentence unit 33, for carrying out subordinate sentence to the original language material of each bar, is divided into some clauses by original language material;
Clause screens unit 34, for filtering out goal clause, comprises the word feature that label is emotion word in described goal clause;
Feature extraction unit 35, for utilizing default feature templates, extracts language material feature from described goal clause, composition corpus;
Training unit 36, for utilizing described corpus to train maximum entropy classifiers, obtains the target maximum entropy sorter after training;
Object identification unit 37, for the identification utilizing described target maximum entropy sorter text to be measured to be carried out to evaluation object.
Optionally, see Fig. 4, Fig. 4 a kind of feature extraction unit structural representation disclosed in the embodiment of the present application.
As shown in Figure 4, feature extraction unit 35 comprises:
Clause's division unit 41, for described goal clause is divided into first kind goal clause and Equations of The Second Kind goal clause, described first kind goal clause is do not comprise the clause that label is the word feature of evaluation object, and described Equations of The Second Kind goal clause is comprise the clause that label is the word feature of evaluation object;
Positive training sample determining unit 42, for utilizing default feature templates, extracts language material feature from described first kind goal clause, composition positive example training sample;
Negative training sample determining unit 43, for utilizing default feature templates, extracts language material feature from described Equations of The Second Kind goal clause, the negative routine training sample of composition.
Optionally, above-mentioned feature templates can comprise:
Fisrt feature subtemplate: the bagword feature of current clause, described bagword is characterized as the segmentation sequence be made up of the participle feature of clause;
Second feature subtemplate: by the word feature of current clause and the feature of part of speech feature common combination;
Third feature subtemplate: the word feature of current clause's beginning of the sentence first word;
Fourth feature subtemplate: the part of speech feature of current clause's beginning of the sentence first two words;
Fifth feature subtemplate: the part of speech feature of all words of current clause;
Sixth feature subtemplate: the part of speech feature of current clause's beginning of the sentence first word;
Seventh feature subtemplate: the part of speech feature of last three words of current clause's end of the sentence;
Eighth feature subtemplate: the word characteristic sum part of speech feature of current last word of clause's end of the sentence.
Ninth feature subtemplate: the bagword feature of the previous clause of current clause.
The embodiment of the present application provides evaluation object recognition device in Chinese text, participle is carried out to the original language material of each bar in corpus, and determine the part of speech feature of each word feature of participle gained, receive the label of each predicate feature of user's input, described label indicates whether word feature is evaluation object or emotion word, subordinate sentence is carried out to the original language material of each bar, original language material is divided into some clauses, filter out goal clause, the word feature that label is emotion word is comprised in described goal clause, utilize the feature templates preset, language material feature is extracted from described goal clause, composition corpus, described corpus is utilized to train maximum entropy classifiers, obtain the target maximum entropy sorter after training, described target maximum entropy sorter is utilized text to be measured to be carried out to the identification of evaluation object.The application employs maximum entropy classifiers and combines various features and goes to identify in text to be measured whether have evaluation object, obtains good effect.
Finally, also it should be noted that, in this article, the such as relational terms of first and second grades and so on is only used for an entity or operation to separate with another entity or operational zone, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thus make to comprise the process of a series of key element, method, article or equipment and not only comprise those key elements, but also comprise other key elements clearly do not listed, or also comprise by the intrinsic key element of this process, method, article or equipment.When not more restrictions, the key element limited by statement " comprising ... ", and be not precluded within process, method, article or the equipment comprising described key element and also there is other identical element.
In this instructions, each embodiment adopts the mode of going forward one by one to describe, and what each embodiment stressed is the difference with other embodiments, between each embodiment identical similar portion mutually see.
To the above-mentioned explanation of the disclosed embodiments, professional and technical personnel in the field are realized or uses the application.To be apparent for those skilled in the art to the multiple amendment of these embodiments, General Principle as defined herein when not departing from the spirit or scope of the application, can realize in other embodiments.Therefore, the application can not be restricted to these embodiments shown in this article, but will meet the widest scope consistent with principle disclosed herein and features of novelty.