CN104298665A

CN104298665A - A method and device for identifying evaluation objects in Chinese texts

Info

Publication number: CN104298665A
Application number: CN201410548882.5A
Authority: CN
Inventors: 李寿山; 戴敏; 周国栋
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2014-10-16
Filing date: 2014-10-16
Publication date: 2015-01-21

Abstract

The application discloses a method and device for identifying evaluation objects in Chinese texts. The method is as follows: segment each original corpus in the corpus, determine the part-of-speech features of each word feature obtained by word segmentation, and receive each of the words input by the user. The label of the feature, the label indicates whether the word feature is an evaluation object or an emotional word, divides each original corpus into sentences, divides the original corpus into several clauses, and screens out the target clause, which contains the label as emotion The word feature of word, utilize preset feature template, extract corpus feature from described target clause, form training corpus, utilize described training corpus to train maximum entropy classifier, obtain the target maximum entropy classifier after training, Use the target maximum entropy classifier to identify the evaluation object of the text to be tested. This application uses a maximum entropy classifier and combines multiple features to identify whether there is an evaluation object in the text to be tested, and has achieved good results.

Description

The recognition methods of evaluation object and device in a kind of Chinese text

Technical field

The application relates to natural language processing technique field, more particularly, relates to recognition methods and the device of evaluation object in a kind of Chinese text.

Background technology

In recent years, along with the fast development of domestic Internet technology, several large electric business, as Taobao, Jingdone district etc., just quietly changes popular life style; Meanwhile, popular comment waits the emergence of the emerging social network sites such as information class website and microblogging, makes domestic Internet user on network, more and more issue the subjective opinion of oneself, creates a large amount of Chinese comment texts.And these comment texts being rich in bulk information propose new challenge to Chinese sentiment analysis technology.

Emotion information extracts, and is a kind of sentiment analysis technology about fine granularity text, extracts exactly to emotion information valuable in emotion text.The correlative study work that existing emotion information extracts task mainly concentrates on extraction viewpoint holder (Opinion Holder), evaluates word (Polarity Word), these three aspects of evaluation object (Opinion Target).Wherein, evaluation object extracts has very important status in emotion information extraction task.

Owing to usually there is partial information abridged situation in Chinese text, in Chinese emotion text, these abridged information often show as the omission of evaluation object, and this exerts a certain influence to the performance that evaluation object extracts.The research of not being correlated with for the evaluation object omission in emotion text at present.

Summary of the invention

In view of this, this application provides recognition methods and the device of evaluation object in a kind of Chinese text, lacking whether the problem judged is omitted to evaluation object in Chinese text for solving prior art.

To achieve these goals, the existing scheme proposed is as follows:

A recognition methods for evaluation object in Chinese text, comprising:

Participle is carried out to the original language material of each bar in corpus, and determines the part of speech feature of each word feature of participle gained;

Receive the label of each predicate feature of user's input, described label indicates whether word feature is evaluation object or emotion word;

Subordinate sentence is carried out to the original language material of each bar, original language material is divided into some clauses;

Filter out goal clause, in described goal clause, comprise the word feature that label is emotion word;

Utilize the feature templates preset, from described goal clause, extract language material feature, composition corpus;

Utilize described corpus to train maximum entropy classifiers, obtain the target maximum entropy sorter after training;

Described target maximum entropy sorter is utilized text to be measured to be carried out to the identification of evaluation object.

Preferably, described participle is carried out to the original language material of each bar in corpus, and determines the part of speech feature of each word feature of participle gained, comprising:

Utilize Stanford Parser instrument to carry out participle to the original language material of each bar in corpus, and determine the part of speech feature of each word feature of participle gained.

Preferably, the feature templates that described utilization is preset, extracts language material feature from described goal clause, and composition corpus, comprising:

Described goal clause is divided into first kind goal clause and Equations of The Second Kind goal clause, described first kind goal clause is do not comprise the clause that label is the word feature of evaluation object, and described Equations of The Second Kind goal clause is comprise the clause that label is the word feature of evaluation object;

Utilize the feature templates preset, from described first kind goal clause, extract language material feature, composition positive example training sample;

Utilize the feature templates preset, from described Equations of The Second Kind goal clause, extract language material feature, the negative routine training sample of composition.

Preferably, described feature templates comprises:

Fisrt feature subtemplate: the bagword feature of current clause, described bagword is characterized as the segmentation sequence be made up of the participle feature of clause;

Second feature subtemplate: by the word feature of current clause and the feature of part of speech feature common combination;

Third feature subtemplate: the word feature of current clause's beginning of the sentence first word;

Fourth feature subtemplate: the part of speech feature of current clause's beginning of the sentence first two words;

Fifth feature subtemplate: the part of speech feature of all words of current clause;

Sixth feature subtemplate: the part of speech feature of current clause's beginning of the sentence first word;

Seventh feature subtemplate: the part of speech feature of last three words of current clause's end of the sentence;

Eighth feature subtemplate: the word characteristic sum part of speech feature of current last word of clause's end of the sentence.

Preferably, described feature templates also comprises:

Ninth feature subtemplate: the bagword feature of the previous clause of current clause.

A recognition device for evaluation object in Chinese text, comprising:

Participle unit, for carrying out participle to the original language material of each bar in corpus, and determines the part of speech feature of each word feature of participle gained;

Label receiving element, for receiving the label of each predicate feature of user's input, described label indicates whether word feature is evaluation object or emotion word;

Subordinate sentence unit, for carrying out subordinate sentence to the original language material of each bar, is divided into some clauses by original language material;

Clause screens unit, for filtering out goal clause, comprises the word feature that label is emotion word in described goal clause;

Feature extraction unit, for utilizing default feature templates, extracts language material feature from described goal clause, composition corpus;

Training unit, for utilizing described corpus to train maximum entropy classifiers, obtains the target maximum entropy sorter after training;

Object identification unit, for the identification utilizing described target maximum entropy sorter text to be measured to be carried out to evaluation object.

Preferably, described participle unit comprises:

First participle subelement, for utilizing Stanford Parser instrument to carry out participle to the original language material of each bar in corpus, and determines the part of speech feature of each word feature of participle gained.

Preferably, described feature extraction unit comprises:

Clause's division unit, for described goal clause is divided into first kind goal clause and Equations of The Second Kind goal clause, described first kind goal clause is do not comprise the clause that label is the word feature of evaluation object, and described Equations of The Second Kind goal clause is comprise the clause that label is the word feature of evaluation object;

Positive training sample determining unit, for utilizing default feature templates, extracts language material feature from described first kind goal clause, composition positive example training sample;

Negative training sample determining unit, for utilizing default feature templates, extracts language material feature from described Equations of The Second Kind goal clause, the negative routine training sample of composition.

Preferably, the feature templates in described feature extraction unit comprises:

Preferably, described feature templates also comprises:

As can be seen from above-mentioned technical scheme, the embodiment of the present application provides evaluation object recognition methods in Chinese text, participle is carried out to the original language material of each bar in corpus, and determine the part of speech feature of each word feature of participle gained, receive the label of each predicate feature of user's input, described label indicates whether word feature is evaluation object or emotion word, subordinate sentence is carried out to the original language material of each bar, original language material is divided into some clauses, filter out goal clause, the word feature that label is emotion word is comprised in described goal clause, utilize the feature templates preset, language material feature is extracted from described goal clause, composition corpus, described corpus is utilized to train maximum entropy classifiers, obtain the target maximum entropy sorter after training, described target maximum entropy sorter is utilized text to be measured to be carried out to the identification of evaluation object.The application employs maximum entropy classifiers and combines various features and goes to identify in text to be measured whether have evaluation object, obtains good effect.

Accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present application or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is only the embodiment of the application, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to the accompanying drawing provided.

The recognition methods process flow diagram of evaluation object in Fig. 1 a kind of Chinese text disclosed in the embodiment of the present application;

Fig. 2 is a kind of segmenting method process flow diagram disclosed in the embodiment of the present application;

Fig. 3 is a kind of event trigger word recognition device structural representation disclosed in the embodiment of the present application;

Fig. 4 is a kind of feature extraction unit structural representation disclosed in the embodiment of the present application.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present application, be clearly and completely described the technical scheme in the embodiment of the present application, obviously, described embodiment is only some embodiments of the present application, instead of whole embodiments.Based on the embodiment in the application, those of ordinary skill in the art are not making the every other embodiment obtained under creative work prerequisite, all belong to the scope of the application's protection.

Before introducing the method for the application, what is first introduced is maximum entropy classifiers.

Maximum entropy sorting technique is based on maximum entropy information theory, and its basic thought is all known factor Modling model, and the factor of all the unknowns is foreclosed.Namely to find a kind of probability distribution, meet all known facts, but allow the most randomization of unknown factor.Relative to Nae Bayesianmethod, the maximum feature of the method is exactly the conditional sampling not between demand fulfillment feature and feature.Therefore, the method is applicable to the various different feature of statistics, and without the need to considering the impact between them.

See the recognition methods process flow diagram of evaluation object in Fig. 1, Fig. 1 a kind of Chinese text disclosed in the embodiment of the present application.

As shown in Figure 1, the method comprises:

Step S100, participle is carried out to the original language material of each bar in corpus, and determine the part of speech feature of each word feature of participle gained;

Particularly, many original language materials are comprised in local corpus.Therefrom can choose some original language materials or whole original language materials, then participle be carried out to the original language material of each bar, obtain word feature.Then, all determine its part of speech feature for each word feature, part of speech feature shows the part of speech of this participle, such as verb, noun etc.Illustrate below by an object lesson:

Former event: cousin will marry.

Word segmentation result: cousin// marry//.

Part-of-speech tagging result: cousin/NN wants/VV marriage/VV/ST./PU

Wherein, NN, VV etc. are the code pos that Stanford part of speech tool master specifies, NN is title, VV is verb, and PU is the code of all punctuation marks.Can explaining with reference to the mark of Stanford part of speech tool master for part of speech of other.

It is to be understood that when carrying out participle and determine part of speech feature, most probable number method, maximum matching method, condition random field method or Stanford Parser instrument can be used to carry out participle and to determine part of speech feature.

The label of each predicate feature of step S110, reception user input, described label indicates whether word feature is evaluation object or emotion word;

Particularly, for each word feature after participle, manually being marked by user, is each word characteristic allocation label, shows whether it is evaluation object or emotion word.Such as, in " friend is delithted with apple " this example, the label of word feature " apple " is evaluation object, and the label that word feature " is liked " is emotion word.

Step S120, subordinate sentence is carried out to the original language material of each bar, original language material is divided into some clauses;

Particularly during subordinate sentence, with ".", "? ", "! " etc. as the mark of whole sentence, and according to ", ", "; " etc. whole sentence is divided into each clause, and keep context relation.

Step S130, filter out goal clause, in described goal clause, comprise the word feature that label is emotion word;

Particularly, for the clause not having emotion word, data basis can not be provided for our follow-up sentiment analysis, therefore need in this step to weed out this part clause.

The feature templates that step S140, utilization are preset, extracts language material feature from described goal clause, composition corpus;

Particularly, we have preset feature templates, and this feature templates defines the concrete form of required feature.Utilize feature templates, from goal clause, extract language material feature, composition corpus.

Step S150, utilize described corpus to train maximum entropy classifiers, obtain the target maximum entropy sorter after training;

Step S160, described target maximum entropy sorter is utilized text to be measured to be carried out to the identification of evaluation object.

Particularly, utilize corpus can train maximum entropy classifiers, make maximum entropy classifiers can identify evaluation object in text.

The embodiment of the present application provides evaluation object recognition methods in Chinese text, participle is carried out to the original language material of each bar in corpus, and determine the part of speech feature of each word feature of participle gained, receive the label of each predicate feature of user's input, described label indicates whether word feature is evaluation object or emotion word, subordinate sentence is carried out to the original language material of each bar, original language material is divided into some clauses, filter out goal clause, the word feature that label is emotion word is comprised in described goal clause, utilize the feature templates preset, language material feature is extracted from described goal clause, composition corpus, described corpus is utilized to train maximum entropy classifiers, obtain the target maximum entropy sorter after training, described target maximum entropy sorter is utilized text to be measured to be carried out to the identification of evaluation object.The application employs maximum entropy classifiers and combines various features and goes to identify in text to be measured whether have evaluation object, obtains good effect.

See Fig. 2, Fig. 2 a kind of segmenting method process flow diagram disclosed in the embodiment of the present application.

As shown in Figure 2, above-mentioned steps S100 is specifically as follows:

Step S200, described goal clause is divided into first kind goal clause and Equations of The Second Kind goal clause;

Particularly, described first kind goal clause is do not comprise the clause that label is the word feature of evaluation object, and described Equations of The Second Kind goal clause is comprise the clause that label is the word feature of evaluation object.

The feature templates that step S210, utilization are preset, extracts language material feature from described first kind goal clause, composition positive example training sample;

The feature templates that step S220, utilization are preset, extracts language material feature from described Equations of The Second Kind goal clause, the negative routine training sample of composition.

Positive example training sample and negative routine training sample constitute corpus jointly, are used for training maximum entropy classifiers.

It is to be understood that above-mentioned feature templates specifically can comprise following a few class:

In order to clearer, above-mentioned feature is made an explanation, is described with an instantiation below:

With current subordinate sentence, " friend is delithted with." be example, the result after participle is for " friend is delithted with.", previous clause is " very well, ", and the result after participle is " very well, ", and the part of speech of " friend " is NN, and the part of speech of " very " is AD, and the part of speech of " liking " is VV, and the part of speech of " good " is VA, and the part of speech of punctuate is unified is PU.

Feature 1: the bagword feature of current clause, namely " friend is delithted with.”；

Feature 2: by the word feature of current clause and the feature of part of speech feature common combination, be about to " friend _ NN very _ AD likes _ VV._ PU " as feature;

Feature 3: the word feature of current clause's beginning of the sentence first word, i.e. " friend ";

Feature 4: the part of speech feature of current clause's beginning of the sentence first two words, i.e. " NN_AD ";

Feature 5: the part of speech feature of all words of current clause, i.e. " NN_AD_VV_PU ";

Feature 6: the part of speech feature of current clause's beginning of the sentence first word, i.e. " NN ";

Feature 7: the part of speech feature of last three words of current clause's end of the sentence, therefore eigen is got " AD_VV_PU ";

Feature 8: the word characteristic sum part of speech feature of current last word of clause's end of the sentence, be "._PU”；

Feature 9: the bagword feature of the previous clause of current clause, is " very well, ".

In the Chinese text provided the embodiment of the present application below, the recognition device of evaluation object is described, and in Chinese text described below, the recognition device of evaluation object and the recognition methods of evaluation object in above-described Chinese text can mutual corresponding references.

See Fig. 3, Fig. 3 a kind of event trigger word recognition device structural representation disclosed in the embodiment of the present application.

As shown in Figure 3, this device comprises:

Participle unit 31, for carrying out participle to the original language material of each bar in corpus, and determines the part of speech feature of each word feature of participle gained;

Label receiving element 32, for receiving the label of each predicate feature of user's input, described label indicates whether word feature is evaluation object or emotion word;

Subordinate sentence unit 33, for carrying out subordinate sentence to the original language material of each bar, is divided into some clauses by original language material;

Clause screens unit 34, for filtering out goal clause, comprises the word feature that label is emotion word in described goal clause;

Feature extraction unit 35, for utilizing default feature templates, extracts language material feature from described goal clause, composition corpus;

Training unit 36, for utilizing described corpus to train maximum entropy classifiers, obtains the target maximum entropy sorter after training;

Object identification unit 37, for the identification utilizing described target maximum entropy sorter text to be measured to be carried out to evaluation object.

Optionally, see Fig. 4, Fig. 4 a kind of feature extraction unit structural representation disclosed in the embodiment of the present application.

As shown in Figure 4, feature extraction unit 35 comprises:

Clause's division unit 41, for described goal clause is divided into first kind goal clause and Equations of The Second Kind goal clause, described first kind goal clause is do not comprise the clause that label is the word feature of evaluation object, and described Equations of The Second Kind goal clause is comprise the clause that label is the word feature of evaluation object;

Positive training sample determining unit 42, for utilizing default feature templates, extracts language material feature from described first kind goal clause, composition positive example training sample;

Negative training sample determining unit 43, for utilizing default feature templates, extracts language material feature from described Equations of The Second Kind goal clause, the negative routine training sample of composition.

Optionally, above-mentioned feature templates can comprise:

The embodiment of the present application provides evaluation object recognition device in Chinese text, participle is carried out to the original language material of each bar in corpus, and determine the part of speech feature of each word feature of participle gained, receive the label of each predicate feature of user's input, described label indicates whether word feature is evaluation object or emotion word, subordinate sentence is carried out to the original language material of each bar, original language material is divided into some clauses, filter out goal clause, the word feature that label is emotion word is comprised in described goal clause, utilize the feature templates preset, language material feature is extracted from described goal clause, composition corpus, described corpus is utilized to train maximum entropy classifiers, obtain the target maximum entropy sorter after training, described target maximum entropy sorter is utilized text to be measured to be carried out to the identification of evaluation object.The application employs maximum entropy classifiers and combines various features and goes to identify in text to be measured whether have evaluation object, obtains good effect.

Finally, also it should be noted that, in this article, the such as relational terms of first and second grades and so on is only used for an entity or operation to separate with another entity or operational zone, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thus make to comprise the process of a series of key element, method, article or equipment and not only comprise those key elements, but also comprise other key elements clearly do not listed, or also comprise by the intrinsic key element of this process, method, article or equipment.When not more restrictions, the key element limited by statement " comprising ... ", and be not precluded within process, method, article or the equipment comprising described key element and also there is other identical element.

In this instructions, each embodiment adopts the mode of going forward one by one to describe, and what each embodiment stressed is the difference with other embodiments, between each embodiment identical similar portion mutually see.

To the above-mentioned explanation of the disclosed embodiments, professional and technical personnel in the field are realized or uses the application.To be apparent for those skilled in the art to the multiple amendment of these embodiments, General Principle as defined herein when not departing from the spirit or scope of the application, can realize in other embodiments.Therefore, the application can not be restricted to these embodiments shown in this article, but will meet the widest scope consistent with principle disclosed herein and features of novelty.

Claims

1. A recognition method for an evaluation object in a Chinese text, characterized in that, comprising:

Perform word segmentation on each original corpus in the corpus, and determine the part-of-speech features of each word feature obtained by word segmentation;

Receive the label of each described word feature of user input, whether described label mark word feature is evaluation object or emotion word;

Carry out sentence segmentation to each original corpus, and divide the original corpus into several clauses;

Screen out the target clause, which contains the word feature that the label is an emotional word in the target clause;

Using a preset feature template to extract corpus features from the target clause to form a training corpus;

Utilize described training corpus to train maximum entropy classifier, obtain the target maximum entropy classifier after training;

Using the target maximum entropy classifier to identify the evaluation object of the text to be tested.

2. recognition method according to claim 1, is characterized in that, described each original corpus in corpus is carried out word segmentation, and determines the part-of-speech feature of each word feature of word segmentation gained, comprises:

Use the Stanford Parser tool to segment each original corpus in the corpus, and determine the part-of-speech features of each word feature obtained by word segmentation.

3. recognition method according to claim 1, is characterized in that, described utilization preset feature template, extracts corpus feature from described target clause, forms training corpus, comprises:

The target clause is divided into the first type of target clause and the second type of target clause, the first type of target clause is a clause that does not contain the word feature of the evaluation object, and the second type of target A clause is a clause containing a word feature whose label is an evaluation object;

Using a preset feature template to extract corpus features from the first type of target clauses to form a positive example training sample;

Using a preset feature template, the corpus features are extracted from the second type of target clauses to form a negative example training sample.

4. identification method according to claim 3, is characterized in that, described feature template comprises:

The first feature sub-template: the bagword feature of the current clause, the bagword feature is a word segmentation sequence composed of the word segmentation features of the clause;

The second feature sub-template: a feature combined by word features and part-of-speech features of the current clause;

The third feature sub-template: the word feature of the first word at the beginning of the current clause;

The fourth characteristic sub-template: the part-of-speech characteristics of the first two words at the beginning of the current clause;

The fifth feature sub-template: the part-of-speech features of all words in the current clause;

The sixth feature sub-template: the part-of-speech feature of the first word at the beginning of the current clause;

The seventh characteristic sub-template: the part-of-speech characteristics of the last three words at the end of the current clause sentence;

The eighth feature sub-template: the word features and part-of-speech features of the last word at the end of the current clause.

5. identification method according to claim 4, is characterized in that, described characteristic template also comprises:

The ninth feature sub-template: the bagword feature of the previous clause of the current clause.

6. A recognition device for an evaluation object in a Chinese text, characterized in that it comprises:

The word segmentation unit is used to segment each original corpus in the corpus, and determine the part-of-speech features of each word feature obtained by word segmentation;

A label receiving unit is used to receive the label of each of the word features input by the user, and the label indicates whether the word feature is an evaluation object or an emotional word;

The clause unit is used to divide each original corpus into sentences, and the original corpus is divided into several clauses;

The clause screening unit is used to filter out the target clause, which contains the word feature that the label is an emotional word in the target clause;

A feature extraction unit, configured to extract corpus features from the target clause using a preset feature template to form a training corpus;

A training unit, configured to use the training corpus to train a maximum entropy classifier to obtain a trained target maximum entropy classifier;

The object recognition unit is configured to use the target maximum entropy classifier to recognize the evaluation object of the text to be tested.

7. The recognition device according to claim 6, wherein the word segmentation unit comprises:

The first word segmentation subunit is used to segment each original corpus in the corpus by using the Stanford Parser tool, and determine the part-of-speech features of each word feature obtained by word segmentation.

8. The identification device according to claim 6, wherein the feature extraction unit comprises:

The clause division unit is used to divide the target clause into a first-type target clause and a second-type target clause, and the first-type target clause is a clause that does not contain a word feature whose label is an evaluation object , the second type of target clause is a clause containing a word feature whose label is an evaluation object;

A positive training sample determination unit, configured to use a preset feature template to extract corpus features from the first type of target clause to form a positive training sample;

The negative training sample determination unit is used to extract corpus features from the second type of target clauses by using a preset feature template to form a negative training sample.

9. The identification device according to claim 8, wherein the feature template in the feature extraction unit comprises:

10. The identification device according to claim 9, wherein the feature template further comprises: