[go: up one dir, main page]

CN104298665A - A method and device for identifying evaluation objects in Chinese texts - Google Patents

A method and device for identifying evaluation objects in Chinese texts Download PDF

Info

Publication number
CN104298665A
CN104298665A CN201410548882.5A CN201410548882A CN104298665A CN 104298665 A CN104298665 A CN 104298665A CN 201410548882 A CN201410548882 A CN 201410548882A CN 104298665 A CN104298665 A CN 104298665A
Authority
CN
China
Prior art keywords
feature
clause
word
corpus
template
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410548882.5A
Other languages
Chinese (zh)
Inventor
李寿山
戴敏
周国栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN201410548882.5A priority Critical patent/CN104298665A/en
Publication of CN104298665A publication Critical patent/CN104298665A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

本申请公开了一种中文文本中评价对象识别方法及装置,方法为:对语料库中的各条原始语料进行分词,并确定分词所得的各个词特征的词性特征,接收用户输入的各个所述词特征的标签,标签标明词特征是否为评价对象或情感词,对各条原始语料进行分句,将原始语料划分为若干子句,筛选出目标子句,所述目标子句中包含标签为情感词的词特征,利用预设的特征模板,从所述目标子句中提取语料特征,组成训练语料,利用所述训练语料对最大熵分类器进行训练,得到训练后的目标最大熵分类器,利用目标最大熵分类器对待测文本进行评价对象的识别。本申请使用了最大熵分类器并结合了多种特征去识别待测文本中是否有评价对象,获得了良好的效果。

The application discloses a method and device for identifying evaluation objects in Chinese texts. The method is as follows: segment each original corpus in the corpus, determine the part-of-speech features of each word feature obtained by word segmentation, and receive each of the words input by the user. The label of the feature, the label indicates whether the word feature is an evaluation object or an emotional word, divides each original corpus into sentences, divides the original corpus into several clauses, and screens out the target clause, which contains the label as emotion The word feature of word, utilize preset feature template, extract corpus feature from described target clause, form training corpus, utilize described training corpus to train maximum entropy classifier, obtain the target maximum entropy classifier after training, Use the target maximum entropy classifier to identify the evaluation object of the text to be tested. This application uses a maximum entropy classifier and combines multiple features to identify whether there is an evaluation object in the text to be tested, and has achieved good results.

Description

The recognition methods of evaluation object and device in a kind of Chinese text
Technical field
The application relates to natural language processing technique field, more particularly, relates to recognition methods and the device of evaluation object in a kind of Chinese text.
Background technology
In recent years, along with the fast development of domestic Internet technology, several large electric business, as Taobao, Jingdone district etc., just quietly changes popular life style; Meanwhile, popular comment waits the emergence of the emerging social network sites such as information class website and microblogging, makes domestic Internet user on network, more and more issue the subjective opinion of oneself, creates a large amount of Chinese comment texts.And these comment texts being rich in bulk information propose new challenge to Chinese sentiment analysis technology.
Emotion information extracts, and is a kind of sentiment analysis technology about fine granularity text, extracts exactly to emotion information valuable in emotion text.The correlative study work that existing emotion information extracts task mainly concentrates on extraction viewpoint holder (Opinion Holder), evaluates word (Polarity Word), these three aspects of evaluation object (Opinion Target).Wherein, evaluation object extracts has very important status in emotion information extraction task.
Owing to usually there is partial information abridged situation in Chinese text, in Chinese emotion text, these abridged information often show as the omission of evaluation object, and this exerts a certain influence to the performance that evaluation object extracts.The research of not being correlated with for the evaluation object omission in emotion text at present.
Summary of the invention
In view of this, this application provides recognition methods and the device of evaluation object in a kind of Chinese text, lacking whether the problem judged is omitted to evaluation object in Chinese text for solving prior art.
To achieve these goals, the existing scheme proposed is as follows:
A recognition methods for evaluation object in Chinese text, comprising:
Participle is carried out to the original language material of each bar in corpus, and determines the part of speech feature of each word feature of participle gained;
Receive the label of each predicate feature of user's input, described label indicates whether word feature is evaluation object or emotion word;
Subordinate sentence is carried out to the original language material of each bar, original language material is divided into some clauses;
Filter out goal clause, in described goal clause, comprise the word feature that label is emotion word;
Utilize the feature templates preset, from described goal clause, extract language material feature, composition corpus;
Utilize described corpus to train maximum entropy classifiers, obtain the target maximum entropy sorter after training;
Described target maximum entropy sorter is utilized text to be measured to be carried out to the identification of evaluation object.
Preferably, described participle is carried out to the original language material of each bar in corpus, and determines the part of speech feature of each word feature of participle gained, comprising:
Utilize Stanford Parser instrument to carry out participle to the original language material of each bar in corpus, and determine the part of speech feature of each word feature of participle gained.
Preferably, the feature templates that described utilization is preset, extracts language material feature from described goal clause, and composition corpus, comprising:
Described goal clause is divided into first kind goal clause and Equations of The Second Kind goal clause, described first kind goal clause is do not comprise the clause that label is the word feature of evaluation object, and described Equations of The Second Kind goal clause is comprise the clause that label is the word feature of evaluation object;
Utilize the feature templates preset, from described first kind goal clause, extract language material feature, composition positive example training sample;
Utilize the feature templates preset, from described Equations of The Second Kind goal clause, extract language material feature, the negative routine training sample of composition.
Preferably, described feature templates comprises:
Fisrt feature subtemplate: the bagword feature of current clause, described bagword is characterized as the segmentation sequence be made up of the participle feature of clause;
Second feature subtemplate: by the word feature of current clause and the feature of part of speech feature common combination;
Third feature subtemplate: the word feature of current clause's beginning of the sentence first word;
Fourth feature subtemplate: the part of speech feature of current clause's beginning of the sentence first two words;
Fifth feature subtemplate: the part of speech feature of all words of current clause;
Sixth feature subtemplate: the part of speech feature of current clause's beginning of the sentence first word;
Seventh feature subtemplate: the part of speech feature of last three words of current clause's end of the sentence;
Eighth feature subtemplate: the word characteristic sum part of speech feature of current last word of clause's end of the sentence.
Preferably, described feature templates also comprises:
Ninth feature subtemplate: the bagword feature of the previous clause of current clause.
A recognition device for evaluation object in Chinese text, comprising:
Participle unit, for carrying out participle to the original language material of each bar in corpus, and determines the part of speech feature of each word feature of participle gained;
Label receiving element, for receiving the label of each predicate feature of user's input, described label indicates whether word feature is evaluation object or emotion word;
Subordinate sentence unit, for carrying out subordinate sentence to the original language material of each bar, is divided into some clauses by original language material;
Clause screens unit, for filtering out goal clause, comprises the word feature that label is emotion word in described goal clause;
Feature extraction unit, for utilizing default feature templates, extracts language material feature from described goal clause, composition corpus;
Training unit, for utilizing described corpus to train maximum entropy classifiers, obtains the target maximum entropy sorter after training;
Object identification unit, for the identification utilizing described target maximum entropy sorter text to be measured to be carried out to evaluation object.
Preferably, described participle unit comprises:
First participle subelement, for utilizing Stanford Parser instrument to carry out participle to the original language material of each bar in corpus, and determines the part of speech feature of each word feature of participle gained.
Preferably, described feature extraction unit comprises:
Clause's division unit, for described goal clause is divided into first kind goal clause and Equations of The Second Kind goal clause, described first kind goal clause is do not comprise the clause that label is the word feature of evaluation object, and described Equations of The Second Kind goal clause is comprise the clause that label is the word feature of evaluation object;
Positive training sample determining unit, for utilizing default feature templates, extracts language material feature from described first kind goal clause, composition positive example training sample;
Negative training sample determining unit, for utilizing default feature templates, extracts language material feature from described Equations of The Second Kind goal clause, the negative routine training sample of composition.
Preferably, the feature templates in described feature extraction unit comprises:
Fisrt feature subtemplate: the bagword feature of current clause, described bagword is characterized as the segmentation sequence be made up of the participle feature of clause;
Second feature subtemplate: by the word feature of current clause and the feature of part of speech feature common combination;
Third feature subtemplate: the word feature of current clause's beginning of the sentence first word;
Fourth feature subtemplate: the part of speech feature of current clause's beginning of the sentence first two words;
Fifth feature subtemplate: the part of speech feature of all words of current clause;
Sixth feature subtemplate: the part of speech feature of current clause's beginning of the sentence first word;
Seventh feature subtemplate: the part of speech feature of last three words of current clause's end of the sentence;
Eighth feature subtemplate: the word characteristic sum part of speech feature of current last word of clause's end of the sentence.
Preferably, described feature templates also comprises:
Ninth feature subtemplate: the bagword feature of the previous clause of current clause.
As can be seen from above-mentioned technical scheme, the embodiment of the present application provides evaluation object recognition methods in Chinese text, participle is carried out to the original language material of each bar in corpus, and determine the part of speech feature of each word feature of participle gained, receive the label of each predicate feature of user's input, described label indicates whether word feature is evaluation object or emotion word, subordinate sentence is carried out to the original language material of each bar, original language material is divided into some clauses, filter out goal clause, the word feature that label is emotion word is comprised in described goal clause, utilize the feature templates preset, language material feature is extracted from described goal clause, composition corpus, described corpus is utilized to train maximum entropy classifiers, obtain the target maximum entropy sorter after training, described target maximum entropy sorter is utilized text to be measured to be carried out to the identification of evaluation object.The application employs maximum entropy classifiers and combines various features and goes to identify in text to be measured whether have evaluation object, obtains good effect.
Accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present application or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is only the embodiment of the application, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to the accompanying drawing provided.
The recognition methods process flow diagram of evaluation object in Fig. 1 a kind of Chinese text disclosed in the embodiment of the present application;
Fig. 2 is a kind of segmenting method process flow diagram disclosed in the embodiment of the present application;
Fig. 3 is a kind of event trigger word recognition device structural representation disclosed in the embodiment of the present application;
Fig. 4 is a kind of feature extraction unit structural representation disclosed in the embodiment of the present application.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present application, be clearly and completely described the technical scheme in the embodiment of the present application, obviously, described embodiment is only some embodiments of the present application, instead of whole embodiments.Based on the embodiment in the application, those of ordinary skill in the art are not making the every other embodiment obtained under creative work prerequisite, all belong to the scope of the application's protection.
Before introducing the method for the application, what is first introduced is maximum entropy classifiers.
Maximum entropy sorting technique is based on maximum entropy information theory, and its basic thought is all known factor Modling model, and the factor of all the unknowns is foreclosed.Namely to find a kind of probability distribution, meet all known facts, but allow the most randomization of unknown factor.Relative to Nae Bayesianmethod, the maximum feature of the method is exactly the conditional sampling not between demand fulfillment feature and feature.Therefore, the method is applicable to the various different feature of statistics, and without the need to considering the impact between them.
See the recognition methods process flow diagram of evaluation object in Fig. 1, Fig. 1 a kind of Chinese text disclosed in the embodiment of the present application.
As shown in Figure 1, the method comprises:
Step S100, participle is carried out to the original language material of each bar in corpus, and determine the part of speech feature of each word feature of participle gained;
Particularly, many original language materials are comprised in local corpus.Therefrom can choose some original language materials or whole original language materials, then participle be carried out to the original language material of each bar, obtain word feature.Then, all determine its part of speech feature for each word feature, part of speech feature shows the part of speech of this participle, such as verb, noun etc.Illustrate below by an object lesson:
Former event: cousin will marry.
Word segmentation result: cousin// marry//.
Part-of-speech tagging result: cousin/NN wants/VV marriage/VV/ST./PU
Wherein, NN, VV etc. are the code pos that Stanford part of speech tool master specifies, NN is title, VV is verb, and PU is the code of all punctuation marks.Can explaining with reference to the mark of Stanford part of speech tool master for part of speech of other.
It is to be understood that when carrying out participle and determine part of speech feature, most probable number method, maximum matching method, condition random field method or Stanford Parser instrument can be used to carry out participle and to determine part of speech feature.
The label of each predicate feature of step S110, reception user input, described label indicates whether word feature is evaluation object or emotion word;
Particularly, for each word feature after participle, manually being marked by user, is each word characteristic allocation label, shows whether it is evaluation object or emotion word.Such as, in " friend is delithted with apple " this example, the label of word feature " apple " is evaluation object, and the label that word feature " is liked " is emotion word.
Step S120, subordinate sentence is carried out to the original language material of each bar, original language material is divided into some clauses;
Particularly during subordinate sentence, with ".", "? ", "! " etc. as the mark of whole sentence, and according to ", ", "; " etc. whole sentence is divided into each clause, and keep context relation.
Step S130, filter out goal clause, in described goal clause, comprise the word feature that label is emotion word;
Particularly, for the clause not having emotion word, data basis can not be provided for our follow-up sentiment analysis, therefore need in this step to weed out this part clause.
The feature templates that step S140, utilization are preset, extracts language material feature from described goal clause, composition corpus;
Particularly, we have preset feature templates, and this feature templates defines the concrete form of required feature.Utilize feature templates, from goal clause, extract language material feature, composition corpus.
Step S150, utilize described corpus to train maximum entropy classifiers, obtain the target maximum entropy sorter after training;
Step S160, described target maximum entropy sorter is utilized text to be measured to be carried out to the identification of evaluation object.
Particularly, utilize corpus can train maximum entropy classifiers, make maximum entropy classifiers can identify evaluation object in text.
The embodiment of the present application provides evaluation object recognition methods in Chinese text, participle is carried out to the original language material of each bar in corpus, and determine the part of speech feature of each word feature of participle gained, receive the label of each predicate feature of user's input, described label indicates whether word feature is evaluation object or emotion word, subordinate sentence is carried out to the original language material of each bar, original language material is divided into some clauses, filter out goal clause, the word feature that label is emotion word is comprised in described goal clause, utilize the feature templates preset, language material feature is extracted from described goal clause, composition corpus, described corpus is utilized to train maximum entropy classifiers, obtain the target maximum entropy sorter after training, described target maximum entropy sorter is utilized text to be measured to be carried out to the identification of evaluation object.The application employs maximum entropy classifiers and combines various features and goes to identify in text to be measured whether have evaluation object, obtains good effect.
See Fig. 2, Fig. 2 a kind of segmenting method process flow diagram disclosed in the embodiment of the present application.
As shown in Figure 2, above-mentioned steps S100 is specifically as follows:
Step S200, described goal clause is divided into first kind goal clause and Equations of The Second Kind goal clause;
Particularly, described first kind goal clause is do not comprise the clause that label is the word feature of evaluation object, and described Equations of The Second Kind goal clause is comprise the clause that label is the word feature of evaluation object.
The feature templates that step S210, utilization are preset, extracts language material feature from described first kind goal clause, composition positive example training sample;
The feature templates that step S220, utilization are preset, extracts language material feature from described Equations of The Second Kind goal clause, the negative routine training sample of composition.
Positive example training sample and negative routine training sample constitute corpus jointly, are used for training maximum entropy classifiers.
It is to be understood that above-mentioned feature templates specifically can comprise following a few class:
Fisrt feature subtemplate: the bagword feature of current clause, described bagword is characterized as the segmentation sequence be made up of the participle feature of clause;
Second feature subtemplate: by the word feature of current clause and the feature of part of speech feature common combination;
Third feature subtemplate: the word feature of current clause's beginning of the sentence first word;
Fourth feature subtemplate: the part of speech feature of current clause's beginning of the sentence first two words;
Fifth feature subtemplate: the part of speech feature of all words of current clause;
Sixth feature subtemplate: the part of speech feature of current clause's beginning of the sentence first word;
Seventh feature subtemplate: the part of speech feature of last three words of current clause's end of the sentence;
Eighth feature subtemplate: the word characteristic sum part of speech feature of current last word of clause's end of the sentence.
Ninth feature subtemplate: the bagword feature of the previous clause of current clause.
In order to clearer, above-mentioned feature is made an explanation, is described with an instantiation below:
With current subordinate sentence, " friend is delithted with." be example, the result after participle is for " friend is delithted with.", previous clause is " very well, ", and the result after participle is " very well, ", and the part of speech of " friend " is NN, and the part of speech of " very " is AD, and the part of speech of " liking " is VV, and the part of speech of " good " is VA, and the part of speech of punctuate is unified is PU.
Feature 1: the bagword feature of current clause, namely " friend is delithted with.”;
Feature 2: by the word feature of current clause and the feature of part of speech feature common combination, be about to " friend _ NN very _ AD likes _ VV._ PU " as feature;
Feature 3: the word feature of current clause's beginning of the sentence first word, i.e. " friend ";
Feature 4: the part of speech feature of current clause's beginning of the sentence first two words, i.e. " NN_AD ";
Feature 5: the part of speech feature of all words of current clause, i.e. " NN_AD_VV_PU ";
Feature 6: the part of speech feature of current clause's beginning of the sentence first word, i.e. " NN ";
Feature 7: the part of speech feature of last three words of current clause's end of the sentence, therefore eigen is got " AD_VV_PU ";
Feature 8: the word characteristic sum part of speech feature of current last word of clause's end of the sentence, be "._PU”;
Feature 9: the bagword feature of the previous clause of current clause, is " very well, ".
In the Chinese text provided the embodiment of the present application below, the recognition device of evaluation object is described, and in Chinese text described below, the recognition device of evaluation object and the recognition methods of evaluation object in above-described Chinese text can mutual corresponding references.
See Fig. 3, Fig. 3 a kind of event trigger word recognition device structural representation disclosed in the embodiment of the present application.
As shown in Figure 3, this device comprises:
Participle unit 31, for carrying out participle to the original language material of each bar in corpus, and determines the part of speech feature of each word feature of participle gained;
Label receiving element 32, for receiving the label of each predicate feature of user's input, described label indicates whether word feature is evaluation object or emotion word;
Subordinate sentence unit 33, for carrying out subordinate sentence to the original language material of each bar, is divided into some clauses by original language material;
Clause screens unit 34, for filtering out goal clause, comprises the word feature that label is emotion word in described goal clause;
Feature extraction unit 35, for utilizing default feature templates, extracts language material feature from described goal clause, composition corpus;
Training unit 36, for utilizing described corpus to train maximum entropy classifiers, obtains the target maximum entropy sorter after training;
Object identification unit 37, for the identification utilizing described target maximum entropy sorter text to be measured to be carried out to evaluation object.
Optionally, see Fig. 4, Fig. 4 a kind of feature extraction unit structural representation disclosed in the embodiment of the present application.
As shown in Figure 4, feature extraction unit 35 comprises:
Clause's division unit 41, for described goal clause is divided into first kind goal clause and Equations of The Second Kind goal clause, described first kind goal clause is do not comprise the clause that label is the word feature of evaluation object, and described Equations of The Second Kind goal clause is comprise the clause that label is the word feature of evaluation object;
Positive training sample determining unit 42, for utilizing default feature templates, extracts language material feature from described first kind goal clause, composition positive example training sample;
Negative training sample determining unit 43, for utilizing default feature templates, extracts language material feature from described Equations of The Second Kind goal clause, the negative routine training sample of composition.
Optionally, above-mentioned feature templates can comprise:
Fisrt feature subtemplate: the bagword feature of current clause, described bagword is characterized as the segmentation sequence be made up of the participle feature of clause;
Second feature subtemplate: by the word feature of current clause and the feature of part of speech feature common combination;
Third feature subtemplate: the word feature of current clause's beginning of the sentence first word;
Fourth feature subtemplate: the part of speech feature of current clause's beginning of the sentence first two words;
Fifth feature subtemplate: the part of speech feature of all words of current clause;
Sixth feature subtemplate: the part of speech feature of current clause's beginning of the sentence first word;
Seventh feature subtemplate: the part of speech feature of last three words of current clause's end of the sentence;
Eighth feature subtemplate: the word characteristic sum part of speech feature of current last word of clause's end of the sentence.
Ninth feature subtemplate: the bagword feature of the previous clause of current clause.
The embodiment of the present application provides evaluation object recognition device in Chinese text, participle is carried out to the original language material of each bar in corpus, and determine the part of speech feature of each word feature of participle gained, receive the label of each predicate feature of user's input, described label indicates whether word feature is evaluation object or emotion word, subordinate sentence is carried out to the original language material of each bar, original language material is divided into some clauses, filter out goal clause, the word feature that label is emotion word is comprised in described goal clause, utilize the feature templates preset, language material feature is extracted from described goal clause, composition corpus, described corpus is utilized to train maximum entropy classifiers, obtain the target maximum entropy sorter after training, described target maximum entropy sorter is utilized text to be measured to be carried out to the identification of evaluation object.The application employs maximum entropy classifiers and combines various features and goes to identify in text to be measured whether have evaluation object, obtains good effect.
Finally, also it should be noted that, in this article, the such as relational terms of first and second grades and so on is only used for an entity or operation to separate with another entity or operational zone, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thus make to comprise the process of a series of key element, method, article or equipment and not only comprise those key elements, but also comprise other key elements clearly do not listed, or also comprise by the intrinsic key element of this process, method, article or equipment.When not more restrictions, the key element limited by statement " comprising ... ", and be not precluded within process, method, article or the equipment comprising described key element and also there is other identical element.
In this instructions, each embodiment adopts the mode of going forward one by one to describe, and what each embodiment stressed is the difference with other embodiments, between each embodiment identical similar portion mutually see.
To the above-mentioned explanation of the disclosed embodiments, professional and technical personnel in the field are realized or uses the application.To be apparent for those skilled in the art to the multiple amendment of these embodiments, General Principle as defined herein when not departing from the spirit or scope of the application, can realize in other embodiments.Therefore, the application can not be restricted to these embodiments shown in this article, but will meet the widest scope consistent with principle disclosed herein and features of novelty.

Claims (10)

1.一种中文文本中评价对象的识别方法,其特征在于,包括:1. A recognition method for an evaluation object in a Chinese text, characterized in that, comprising: 对语料库中的各条原始语料进行分词,并确定分词所得的各个词特征的词性特征;Perform word segmentation on each original corpus in the corpus, and determine the part-of-speech features of each word feature obtained by word segmentation; 接收用户输入的各个所述词特征的标签,所述标签标明词特征是否为评价对象或情感词;Receive the label of each described word feature of user input, whether described label mark word feature is evaluation object or emotion word; 对各条原始语料进行分句,将原始语料划分为若干子句;Carry out sentence segmentation to each original corpus, and divide the original corpus into several clauses; 筛选出目标子句,所述目标子句中包含标签为情感词的词特征;Screen out the target clause, which contains the word feature that the label is an emotional word in the target clause; 利用预设的特征模板,从所述目标子句中提取语料特征,组成训练语料;Using a preset feature template to extract corpus features from the target clause to form a training corpus; 利用所述训练语料对最大熵分类器进行训练,得到训练后的目标最大熵分类器;Utilize described training corpus to train maximum entropy classifier, obtain the target maximum entropy classifier after training; 利用所述目标最大熵分类器对待测文本进行评价对象的识别。Using the target maximum entropy classifier to identify the evaluation object of the text to be tested. 2.根据权利要求1所述的识别方法,其特征在于,所述对语料库中的各条原始语料进行分词,并确定分词所得的各个词特征的词性特征,包括:2. recognition method according to claim 1, is characterized in that, described each original corpus in corpus is carried out word segmentation, and determines the part-of-speech feature of each word feature of word segmentation gained, comprises: 利用Stanford Parser工具来对语料库中的各条原始语料进行分词,并确定分词所得的各个词特征的词性特征。Use the Stanford Parser tool to segment each original corpus in the corpus, and determine the part-of-speech features of each word feature obtained by word segmentation. 3.根据权利要求1所述的识别方法,其特征在于,所述利用预设的特征模板,从所述目标子句中提取语料特征,组成训练语料,包括:3. recognition method according to claim 1, is characterized in that, described utilization preset feature template, extracts corpus feature from described target clause, forms training corpus, comprises: 将所述目标子句划分为第一类目标子句和第二类目标子句,所述第一类目标子句为不包含标签为评价对象的词特征的子句,所述第二类目标子句为包含标签为评价对象的词特征的子句;The target clause is divided into the first type of target clause and the second type of target clause, the first type of target clause is a clause that does not contain the word feature of the evaluation object, and the second type of target A clause is a clause containing a word feature whose label is an evaluation object; 利用预设的特征模板,从所述第一类目标子句中提取语料特征,组成正例训练样本;Using a preset feature template to extract corpus features from the first type of target clauses to form a positive example training sample; 利用预设的特征模板,从所述第二类目标子句中提取语料特征,组成负例训练样本。Using a preset feature template, the corpus features are extracted from the second type of target clauses to form a negative example training sample. 4.根据权利要求3所述的识别方法,其特征在于,所述特征模板包括:4. identification method according to claim 3, is characterized in that, described feature template comprises: 第一特征子模板:当前子句的bagword特征,所述bagword特征为由子句的分词特征组成的分词序列;The first feature sub-template: the bagword feature of the current clause, the bagword feature is a word segmentation sequence composed of the word segmentation features of the clause; 第二特征子模板:由当前子句的词特征及词性特征共同组合的特征;The second feature sub-template: a feature combined by word features and part-of-speech features of the current clause; 第三特征子模板:当前子句句首第一个词的词特征;The third feature sub-template: the word feature of the first word at the beginning of the current clause; 第四特征子模板:当前子句句首前两个词的词性特征;The fourth characteristic sub-template: the part-of-speech characteristics of the first two words at the beginning of the current clause; 第五特征子模板:当前子句所有词的词性特征;The fifth feature sub-template: the part-of-speech features of all words in the current clause; 第六特征子模板:当前子句句首第一个词的词性特征;The sixth feature sub-template: the part-of-speech feature of the first word at the beginning of the current clause; 第七特征子模板:当前子句句末最后三个词的词性特征;The seventh characteristic sub-template: the part-of-speech characteristics of the last three words at the end of the current clause sentence; 第八特征子模板:当前子句句末最后一个词的词特征和词性特征。The eighth feature sub-template: the word features and part-of-speech features of the last word at the end of the current clause. 5.根据权利要求4所述的识别方法,其特征在于,所述特征模板还包括:5. identification method according to claim 4, is characterized in that, described characteristic template also comprises: 第九特征子模板:当前子句的前一个子句的bagword特征。The ninth feature sub-template: the bagword feature of the previous clause of the current clause. 6.一种中文文本中评价对象的识别装置,其特征在于,包括:6. A recognition device for an evaluation object in a Chinese text, characterized in that it comprises: 分词单元,用于对语料库中的各条原始语料进行分词,并确定分词所得的各个词特征的词性特征;The word segmentation unit is used to segment each original corpus in the corpus, and determine the part-of-speech features of each word feature obtained by word segmentation; 标签接收单元,用于接收用户输入的各个所述词特征的标签,所述标签标明词特征是否为评价对象或情感词;A label receiving unit is used to receive the label of each of the word features input by the user, and the label indicates whether the word feature is an evaluation object or an emotional word; 分句单元,用于对各条原始语料进行分句,将原始语料划分为若干子句;The clause unit is used to divide each original corpus into sentences, and the original corpus is divided into several clauses; 子句筛选单元,用于筛选出目标子句,所述目标子句中包含标签为情感词的词特征;The clause screening unit is used to filter out the target clause, which contains the word feature that the label is an emotional word in the target clause; 特征提取单元,用于利用预设的特征模板,从所述目标子句中提取语料特征,组成训练语料;A feature extraction unit, configured to extract corpus features from the target clause using a preset feature template to form a training corpus; 训练单元,用于利用所述训练语料对最大熵分类器进行训练,得到训练后的目标最大熵分类器;A training unit, configured to use the training corpus to train a maximum entropy classifier to obtain a trained target maximum entropy classifier; 对象识别单元,用于利用所述目标最大熵分类器对待测文本进行评价对象的识别。The object recognition unit is configured to use the target maximum entropy classifier to recognize the evaluation object of the text to be tested. 7.根据权利要求6所述的识别装置,其特征在于,所述分词单元包括:7. The recognition device according to claim 6, wherein the word segmentation unit comprises: 第一分词子单元,用于利用Stanford Parser工具来对语料库中的各条原始语料进行分词,并确定分词所得的各个词特征的词性特征。The first word segmentation subunit is used to segment each original corpus in the corpus by using the Stanford Parser tool, and determine the part-of-speech features of each word feature obtained by word segmentation. 8.根据权利要求6所述的识别装置,其特征在于,所述特征提取单元包括:8. The identification device according to claim 6, wherein the feature extraction unit comprises: 子句划分单元,用于将所述目标子句划分为第一类目标子句和第二类目标子句,所述第一类目标子句为不包含标签为评价对象的词特征的子句,所述第二类目标子句为包含标签为评价对象的词特征的子句;The clause division unit is used to divide the target clause into a first-type target clause and a second-type target clause, and the first-type target clause is a clause that does not contain a word feature whose label is an evaluation object , the second type of target clause is a clause containing a word feature whose label is an evaluation object; 正训练样本确定单元,用于利用预设的特征模板,从所述第一类目标子句中提取语料特征,组成正例训练样本;A positive training sample determination unit, configured to use a preset feature template to extract corpus features from the first type of target clause to form a positive training sample; 负训练样本确定单元,用于利用预设的特征模板,从所述第二类目标子句中提取语料特征,组成负例训练样本。The negative training sample determination unit is used to extract corpus features from the second type of target clauses by using a preset feature template to form a negative training sample. 9.根据权利要求8所述的识别装置,其特征在于,所述特征提取单元中的特征模板包括:9. The identification device according to claim 8, wherein the feature template in the feature extraction unit comprises: 第一特征子模板:当前子句的bagword特征,所述bagword特征为由子句的分词特征组成的分词序列;The first feature sub-template: the bagword feature of the current clause, the bagword feature is a word segmentation sequence composed of the word segmentation features of the clause; 第二特征子模板:由当前子句的词特征及词性特征共同组合的特征;The second feature sub-template: a feature combined by word features and part-of-speech features of the current clause; 第三特征子模板:当前子句句首第一个词的词特征;The third feature sub-template: the word feature of the first word at the beginning of the current clause; 第四特征子模板:当前子句句首前两个词的词性特征;The fourth characteristic sub-template: the part-of-speech characteristics of the first two words at the beginning of the current clause; 第五特征子模板:当前子句所有词的词性特征;The fifth feature sub-template: the part-of-speech features of all words in the current clause; 第六特征子模板:当前子句句首第一个词的词性特征;The sixth feature sub-template: the part-of-speech feature of the first word at the beginning of the current clause; 第七特征子模板:当前子句句末最后三个词的词性特征;The seventh characteristic sub-template: the part-of-speech characteristics of the last three words at the end of the current clause sentence; 第八特征子模板:当前子句句末最后一个词的词特征和词性特征。The eighth feature sub-template: the word features and part-of-speech features of the last word at the end of the current clause. 10.根据权利要求9所述的识别装置,其特征在于,所述特征模板还包括:10. The identification device according to claim 9, wherein the feature template further comprises: 第九特征子模板:当前子句的前一个子句的bagword特征。The ninth feature sub-template: the bagword feature of the previous clause of the current clause.
CN201410548882.5A 2014-10-16 2014-10-16 A method and device for identifying evaluation objects in Chinese texts Pending CN104298665A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410548882.5A CN104298665A (en) 2014-10-16 2014-10-16 A method and device for identifying evaluation objects in Chinese texts

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410548882.5A CN104298665A (en) 2014-10-16 2014-10-16 A method and device for identifying evaluation objects in Chinese texts

Publications (1)

Publication Number Publication Date
CN104298665A true CN104298665A (en) 2015-01-21

Family

ID=52318394

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410548882.5A Pending CN104298665A (en) 2014-10-16 2014-10-16 A method and device for identifying evaluation objects in Chinese texts

Country Status (1)

Country Link
CN (1) CN104298665A (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104778163A (en) * 2015-05-11 2015-07-15 苏州大学 Method and system for recognizing event trigger word
CN105069647A (en) * 2015-07-30 2015-11-18 齐鲁工业大学 Improved method for extracting evaluation object in Chinese commodity review
CN105912522A (en) * 2016-03-31 2016-08-31 长安大学 Automatic extraction method and extractor of English corpora based on constituent analyses
CN106202056A (en) * 2016-07-26 2016-12-07 北京智能管家科技有限公司 Chinese word segmentation scene library update method and system
CN106202498A (en) * 2016-07-20 2016-12-07 淮阴工学院 A kind of network behavior custom quantization method based on classification corpus key word word frequency record association
CN106951565A (en) * 2017-04-05 2017-07-14 数库(上海)科技有限公司 File classification method and the text classifier of acquisition
CN107025250A (en) * 2016-04-11 2017-08-08 苏州大学 A kind of Internet user's data processing method, apparatus and system
CN107220238A (en) * 2017-05-24 2017-09-29 电子科技大学 A kind of text object abstracting method based on Mixed Weibull distribution
CN108255808A (en) * 2017-12-29 2018-07-06 东软集团股份有限公司 The method, apparatus and storage medium and electronic equipment that text divides
CN108694176A (en) * 2017-04-06 2018-10-23 北京京东尚科信息技术有限公司 Method, apparatus, electronic equipment and the readable storage medium storing program for executing of document sentiment analysis
CN109086340A (en) * 2018-07-10 2018-12-25 太原理工大学 Evaluation object recognition methods based on semantic feature
CN109213996A (en) * 2018-08-08 2019-01-15 厦门快商通信息技术有限公司 A kind of training method and system of corpus
CN109948141A (en) * 2017-12-21 2019-06-28 北京京东尚科信息技术有限公司 A kind of method and apparatus for extracting Feature Words
CN110119506A (en) * 2018-02-05 2019-08-13 北京京东尚科信息技术有限公司 Text Extraction and system, equipment and storage medium
WO2019174423A1 (en) * 2018-03-16 2019-09-19 北京国双科技有限公司 Entity sentiment analysis method and related apparatus
WO2020007138A1 (en) * 2018-07-03 2020-01-09 腾讯科技(深圳)有限公司 Method for event identification, method for model training, device, and storage medium
CN111027307A (en) * 2018-09-21 2020-04-17 北京国双科技有限公司 Method and device for judging content influencing judgment result in judgment document
WO2020114373A1 (en) * 2018-12-07 2020-06-11 北京国双科技有限公司 Method and apparatus for realizing element recognition in judicial document
CN111738008A (en) * 2020-07-20 2020-10-02 平安国际智慧城市科技股份有限公司 Entity identification method, device and equipment based on multilayer model and storage medium
CN111966822A (en) * 2019-05-20 2020-11-20 北京京东尚科信息技术有限公司 Method and device for determining emotion category of evaluation information

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080249764A1 (en) * 2007-03-01 2008-10-09 Microsoft Corporation Smart Sentiment Classifier for Product Reviews
CN103631961A (en) * 2013-12-17 2014-03-12 苏州大学张家港工业技术研究院 Method for identifying relationship between sentiment words and evaluation objects

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080249764A1 (en) * 2007-03-01 2008-10-09 Microsoft Corporation Smart Sentiment Classifier for Product Reviews
CN103631961A (en) * 2013-12-17 2014-03-12 苏州大学张家港工业技术研究院 Method for identifying relationship between sentiment words and evaluation objects

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
戴敏: ""中文评价对象抽取中省略现象研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104778163A (en) * 2015-05-11 2015-07-15 苏州大学 Method and system for recognizing event trigger word
CN105069647A (en) * 2015-07-30 2015-11-18 齐鲁工业大学 Improved method for extracting evaluation object in Chinese commodity review
CN105912522A (en) * 2016-03-31 2016-08-31 长安大学 Automatic extraction method and extractor of English corpora based on constituent analyses
CN107025250A (en) * 2016-04-11 2017-08-08 苏州大学 A kind of Internet user's data processing method, apparatus and system
CN106202498A (en) * 2016-07-20 2016-12-07 淮阴工学院 A kind of network behavior custom quantization method based on classification corpus key word word frequency record association
CN106202056B (en) * 2016-07-26 2019-01-04 北京智能管家科技有限公司 Chinese word segmentation scene library update method and system
CN106202056A (en) * 2016-07-26 2016-12-07 北京智能管家科技有限公司 Chinese word segmentation scene library update method and system
CN106951565A (en) * 2017-04-05 2017-07-14 数库(上海)科技有限公司 File classification method and the text classifier of acquisition
CN106951565B (en) * 2017-04-05 2018-04-27 数库(上海)科技有限公司 File classification method and the text classifier of acquisition
CN108694176B (en) * 2017-04-06 2021-05-25 北京京东尚科信息技术有限公司 Document emotion analysis method and device, electronic equipment and readable storage medium
CN108694176A (en) * 2017-04-06 2018-10-23 北京京东尚科信息技术有限公司 Method, apparatus, electronic equipment and the readable storage medium storing program for executing of document sentiment analysis
CN107220238A (en) * 2017-05-24 2017-09-29 电子科技大学 A kind of text object abstracting method based on Mixed Weibull distribution
CN109948141A (en) * 2017-12-21 2019-06-28 北京京东尚科信息技术有限公司 A kind of method and apparatus for extracting Feature Words
CN108255808A (en) * 2017-12-29 2018-07-06 东软集团股份有限公司 The method, apparatus and storage medium and electronic equipment that text divides
CN108255808B (en) * 2017-12-29 2021-10-22 东软集团股份有限公司 Text division method and device, storage medium and electronic equipment
CN110119506A (en) * 2018-02-05 2019-08-13 北京京东尚科信息技术有限公司 Text Extraction and system, equipment and storage medium
CN110119506B (en) * 2018-02-05 2023-12-08 北京京东尚科信息技术有限公司 Text extraction method, system, equipment and storage medium
WO2019174423A1 (en) * 2018-03-16 2019-09-19 北京国双科技有限公司 Entity sentiment analysis method and related apparatus
WO2020007138A1 (en) * 2018-07-03 2020-01-09 腾讯科技(深圳)有限公司 Method for event identification, method for model training, device, and storage medium
US11972213B2 (en) 2018-07-03 2024-04-30 Tencent Technology (Shenzhen) Company Limited Event recognition method and apparatus, model training method and apparatus, and storage medium
CN109086340A (en) * 2018-07-10 2018-12-25 太原理工大学 Evaluation object recognition methods based on semantic feature
CN109213996A (en) * 2018-08-08 2019-01-15 厦门快商通信息技术有限公司 A kind of training method and system of corpus
CN111027307B (en) * 2018-09-21 2023-04-07 北京国双科技有限公司 Method and device for judging content influencing judgment result in judgment document
CN111027307A (en) * 2018-09-21 2020-04-17 北京国双科技有限公司 Method and device for judging content influencing judgment result in judgment document
WO2020114373A1 (en) * 2018-12-07 2020-06-11 北京国双科技有限公司 Method and apparatus for realizing element recognition in judicial document
CN111966822A (en) * 2019-05-20 2020-11-20 北京京东尚科信息技术有限公司 Method and device for determining emotion category of evaluation information
CN111966822B (en) * 2019-05-20 2025-03-21 北京京东尚科信息技术有限公司 Method and device for determining sentiment category of evaluation information
CN111738008B (en) * 2020-07-20 2021-04-27 深圳赛安特技术服务有限公司 Entity identification method, device and equipment based on multilayer model and storage medium
CN111738008A (en) * 2020-07-20 2020-10-02 平安国际智慧城市科技股份有限公司 Entity identification method, device and equipment based on multilayer model and storage medium

Similar Documents

Publication Publication Date Title
CN104298665A (en) A method and device for identifying evaluation objects in Chinese texts
Ljubešić et al. The FRENK datasets of socially unacceptable discourse in Slovene and English
CN103631961B (en) Method for identifying relationship between sentiment words and evaluation objects
CN106407236B (en) A kind of emotion tendency detection method towards comment data
CN105069021B (en) Chinese short text sensibility classification method based on field
CN102682124B (en) Emotion classifying method and device for text
CN102262625B (en) Web page keyword extraction method and device
CN104008091A (en) Sentiment value based web text sentiment analysis method
CN104268160A (en) Evaluation object extraction method based on domain dictionary and semantic roles
CN106649519B (en) A method for mining and evaluating product features
CN109446404A (en) A kind of the feeling polarities analysis method and device of network public-opinion
CN104536953B (en) A kind of recognition methods of text emotional valence and device
CN105045857A (en) Social network rumor recognition method and system
CN103577534B (en) Searching method and search engine
CN104317784A (en) Cross-platform user identification method and cross-platform user identification system
CN106407235B (en) A kind of semantic dictionary construction method based on comment data
CN103955453B (en) A kind of method and device for finding neologisms automatic from document sets
KR20120109943A (en) Emotion classification method for analysis of emotion immanent in sentence
CN103294664A (en) Method and system for discovering new words in open fields
CN104915443B (en) A kind of abstracting method of Chinese microblogging evaluation object
CN105740236A (en) Writing feature and sequence feature combined Chinese sentiment new word recognition method and system
WO2014048479A1 (en) A system and method for the automatic creation or augmentation of an electronically rendered publication document
Shalunts et al. Sentiment analysis of German social media data for natural disasters.
CN104504024A (en) Method and system for mining keywords based on microblog content
CN105912644A (en) Network review generation type abstract method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20150121

RJ01 Rejection of invention patent application after publication