CN101520775A - Chinese syntax parsing method with merged semantic information - Google Patents
Chinese syntax parsing method with merged semantic information Download PDFInfo
- Publication number
- CN101520775A CN101520775A CN200910131827A CN200910131827A CN101520775A CN 101520775 A CN101520775 A CN 101520775A CN 200910131827 A CN200910131827 A CN 200910131827A CN 200910131827 A CN200910131827 A CN 200910131827A CN 101520775 A CN101520775 A CN 101520775A
- Authority
- CN
- China
- Prior art keywords
- semantic
- word
- hownet
- grammar
- analysis
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title description 31
- 238000004458 analytical method Methods 0.000 description 64
- 238000012549 training Methods 0.000 description 13
- 238000002372 labelling Methods 0.000 description 5
- 238000012360 testing method Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000007499 fusion processing Methods 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Landscapes
- Machine Translation (AREA)
Abstract
Description
技术领域 technical field
本发明属于自然语言处理技术领域,具体涉及一种融入语义信息的中文句法分析方法,在句法分析中引入语义知识来帮助提高句法分析的性能。The invention belongs to the technical field of natural language processing, and specifically relates to a Chinese syntax analysis method incorporating semantic information, which introduces semantic knowledge into the syntax analysis to help improve the performance of syntax analysis.
背景技术 Background technique
句法分析是自然语言处理当中非常重要的一项技术,它所分析的是词与词之间如何组合形成有意义的短语、句子,来揭示深层的语言规律。句法分析的结果将直接影响到对自然语言的理解。在实际的自然语言处理应用当中,一个高性能的句法分析器有利于提升信息抽取、信息检索、机器翻译、自动问答等高层应用系统的性能。Syntactic analysis is a very important technology in natural language processing. It analyzes how words are combined to form meaningful phrases and sentences to reveal deep language laws. The results of syntactic analysis will directly affect the understanding of natural language. In actual natural language processing applications, a high-performance syntactic analyzer is conducive to improving the performance of high-level application systems such as information extraction, information retrieval, machine translation, and automatic question answering.
句法分析过程就是在给定一套文法模型的情况下,根据一定的算法推导出句子的语法结构,通常用一种树状结构来表示。例如对一句话,“大连外贸出口额一半以上来自‘三资’企业。”,进行句法分析的结果可由附图1(a)中的结构树来表示。在这个树结构当中,最底层的叶子结点是词,称作终结符;上层的非叶子结点均称为非终结符,而非叶子结点的最底层代表词性,称作预终结符。由于自然语言普遍存在着歧义性,对于同一句话可能分析出多个不同的语法结构,因此就需要利用有效的信息和算法来消解存在的歧义,找出最合理的句法结构,这也是当前各种句法分析方法所要解决的问题。The process of syntactic analysis is to deduce the grammatical structure of a sentence according to a certain algorithm given a set of grammatical models, which is usually represented by a tree structure. For example, for a sentence, "more than half of Dalian's foreign trade exports come from 'foreign-funded' enterprises.", the result of syntactic analysis can be represented by the structure tree in Figure 1(a). In this tree structure, the bottom-level leaf nodes are words, called terminators; the upper-level non-leaf nodes are called non-terminal symbols, and the bottom-level non-leaf nodes represent parts of speech, called pre-terminals. Due to the general ambiguity in natural language, multiple different grammatical structures may be analyzed for the same sentence, so it is necessary to use effective information and algorithms to resolve the existing ambiguity and find the most reasonable grammatical structure. The problem to be solved by a method of syntactic analysis.
利用统计学写的方法可以从训练语料中学习词汇和结构的偏向性信息,从而在一定程度上处理句法结构的歧义问题。一些人工标注的语法结构树库资源(如美国宾夕法尼亚大学构建的宾大树库)的出现,为提出基于统计的句法分析方法创造了条件,极大的推动了这类技术的发展。在统计句法分析方法中研究的最多的是概率上下文无关文法(PCFG:Probabilistic Context-Free Grammar),它通过一系列的上下文无关的文法规则来描述句子结构,并且赋予每条规则一定的概率。这种方法的优点是形式简单,可在多项式时间内处理。Using the method of statistical writing can learn the biased information of vocabulary and structure from the training corpus, so as to deal with the ambiguity of syntactic structure to a certain extent. The emergence of some artificially labeled grammatical structure treebank resources (such as the Penn Treebank constructed by the University of Pennsylvania in the United States) has created conditions for proposing statistical-based syntactic analysis methods and greatly promoted the development of such technologies. Probabilistic Context-Free Grammar (PCFG: Probabilistic Context-Free Grammar) is the most researched method in statistical syntax analysis, which describes the sentence structure through a series of context-free grammatical rules, and assigns a certain probability to each rule. The advantage of this approach is that it is simple in form and can be processed in polynomial time.
PCFG模型的一个问题来自于条件独立性假设,在这个假设条件下,认为任何一个非终结符(即在句法树中词结点以上的各个结点)的展开与其他非终结符的展开是相互独立的。但通过对树库中各个位置非终结符的统计分布研究发现,有时一个结点的展开是与其所在树中的位置相关的,而在简单PCFG建模时这一点是被忽略的。为了解决这一问题,就需要对基本PCFG模型进行改进,通常有两种途径:引入词汇化信息和扩展非终结符标记,后者常常又被称作非词汇化方法。引入词汇化信息方面最具代表性的工作是中心词驱动的句法分析方法,代表工作如Michael Collins在他的博士论文当中为语法规则中的每一个非终结符引入词汇、距离等信息,提高文法的区分性,非词汇化句法分析的方法主要有通过人工的方式对部分非终结符进行细化,以及通过无监督学习的方法自动细化标记从而能够覆盖更多的语言现象,代表工作为UC Berkeley的Dan Klein等人的工作。然而这两种方法也存在着各自的缺陷:词汇化方法中词汇信息的引入带来了一定的数据稀疏问题,非词汇化方法中自动细化标记存在着对语言现象的刻画是否准确等问题。A problem of the PCFG model comes from the assumption of conditional independence. Under this assumption, it is considered that the expansion of any non-terminal (that is, each node above the word node in the syntax tree) is mutually related to the expansion of other non-terminals. independent. However, through the study of the statistical distribution of non-terminals in each position in the tree bank, it is found that sometimes the expansion of a node is related to its position in the tree, but this point is ignored in simple PCFG modeling. In order to solve this problem, it is necessary to improve the basic PCFG model. There are usually two ways: introducing lexicalization information and extending non-terminal symbols. The latter is often called a non-lexicalization method. The most representative work on the introduction of lexical information is the syntactic analysis method driven by the head word. For example, Michael Collins introduced vocabulary, distance and other information for each non-terminal in the grammatical rules in his doctoral thesis to improve the grammar. The discriminative, non-lexical syntactic analysis methods mainly include manual refinement of some non-terminal symbols, and automatic refinement of tags through unsupervised learning methods to cover more language phenomena. The representative work is UC Work by Dan Klein et al. at Berkeley. However, these two methods also have their own shortcomings: the introduction of lexical information in the lexicalization method brings certain data sparse problems, and the automatic refinement of tags in the non-lexicalization method has problems such as whether the description of language phenomena is accurate.
发明内容 Contents of the invention
本发明的目的在于提供一种融入语义信息的中文句法分析方法,利用语义信息来帮助提高句法分析的性能,同时还可以从句法分析结果当中获得带有句法约束的语义信息。The purpose of the present invention is to provide a Chinese syntax analysis method that incorporates semantic information, uses semantic information to help improve the performance of syntax analysis, and can also obtain semantic information with syntax constraints from the results of syntax analysis.
已经有理论研究表明语义信息可以帮助句法消歧。语义概念所涉及的是词语的含义、结构和说话方式等,相关研究可以分为两个部分:研究单个词的语义(词义)以及单个词的含义是怎样联合起来组成句子的含义。语义分析的主要任务是产生语言文本的词汇语义单元表示和它们之间的依赖关系。句法分析和语义分析虽然是语言分析的两个不同层面,但两者存在着相互制约的关系。汉语的语序对语义的制约性很强,句法成分之间存在着较复杂的语义关系。在许多情况下,仅对语法形式进行句法结构分析是解释不了句子的内部规律的。因此,在中文句法分析中引入语义会有利于结构歧义的消解。There have been theoretical studies showing that semantic information can help syntactic disambiguation. Semantic concepts involve the meaning, structure, and way of speaking of words. Related research can be divided into two parts: the study of the semantics (meaning) of individual words and how the meanings of individual words are combined to form the meaning of sentences. The main task of semantic analysis is to generate lexical semantic unit representations of language texts and the dependencies between them. Although syntactic analysis and semantic analysis are two different levels of language analysis, they are mutually restrictive. The word order of Chinese has strong constraints on semantics, and there are complex semantic relationships between syntactic components. In many cases, syntactic structure analysis of grammatical forms cannot explain the internal laws of sentences. Therefore, the introduction of semantics into Chinese syntactic analysis will help resolve structural ambiguity.
使用语义信息的前提是存在一套预先定义的语义规范,最直接的办法是使用现有的语义资源。在我们的方法中所使用的语义资源是知网(HowNet)。知网是一个以英汉双语所代表的概念以及概念的特征为基础的,以揭示概念与概念之间以及概念所具有的特性之间的关系为基本内容的常识知识库。从中我们可以得到某个词的不同层次的概念或者概念属性作为我们的语义类,比如我们可以从中得到“汽车”的语义类“entity|实体=>thing|万物=>physical|物质=>inanimate|无生物=>artifact|人工物=>implement|器具=>vehicle|交通工具=>LandVehicle|车”,这其中从左到右表示的是“汽车”在HowNet中的由粗到细的不同层次的语义类。比如,“entity|实体”是最粗一层的语义类,他包含的范围最广;而“LandVehicle|车”是最细一层的语义类,它表达的意思最细,最接近“汽车”。The premise of using semantic information is that there is a set of pre-defined semantic specifications, and the most direct way is to use existing semantic resources. The semantic resource used in our method is HowNet. HowNet is a commonsense knowledge base based on the concepts represented by English and Chinese bilinguals and the characteristics of concepts, and the basic content is to reveal the relationship between concepts and the characteristics of concepts. From it, we can get the concepts or conceptual attributes of different levels of a word as our semantic class, for example, we can get the semantic class "entity|entity=>thing|everything=>physical|substance=>inanimate| No creature=>artifact|artificial object=>implement|apparatus=>vehicle|vehicle=>LandVehicle|car", which from left to right represents the different levels of "car" in HowNet from coarse to fine Semantic class. For example, "entity|entity" is the thickest semantic class, which covers the widest range; and "LandVehicle|car" is the thinnest semantic class, which expresses the smallest meaning and is closest to "car" .
本发明通过考察句法分析和语义分析的关系,将语义信息融入到非词汇化句法分析过程中,来解决PCFG模型缺少语义信息的问题,以及通过语义标记对词性层进行进一步的细化。通过引入语义信息,帮助句法分析进行歧义消解,从而使句法分析的性能有一定程度的提高。The invention solves the problem of lack of semantic information in the PCFG model by investigating the relationship between syntactic analysis and semantic analysis, and integrates semantic information into the non-lexical syntactic analysis process, and further refines the part-of-speech layer through semantic tags. By introducing semantic information, it helps syntactic analysis to resolve ambiguity, thereby improving the performance of syntactic analysis to a certain extent.
因此,本发明的基本思想是认为句法和语义是语言分析的两个不同层面,它们在语言分析的过程当中共同发挥作用,并相互影响,语义信息非常有助于结构歧义的消解。通过在非词汇化句法分析方法中融入语义信息,使句法分析器的性能得到明显提升,并且所得到的分析结果当中既包含句法的修饰关系,也包含了每个词的语义类别。Therefore, the basic idea of the present invention is that syntax and semantics are two different levels of language analysis, they play a role together in the process of language analysis and influence each other, and semantic information is very helpful for dissolving structural ambiguity. By incorporating semantic information into the non-lexical syntactic analysis method, the performance of the syntactic analyzer is significantly improved, and the obtained analysis results include not only the syntactic modification relationship, but also the semantic category of each word.
本发明的出发点是得到高性能的句法分析器,并以语义分析为辅助手段来提高句法分析性能。句法分析的基本模型采用的是非词汇化的PCFG模型,该模型是通过无监督学习的方法自动细化标记,提高文法的描述能力,其性能已经超过了词汇化句法分析器。本方法在此基础之上以HowNet作为语义词典,为句法树库当中的部分词提供某一层次的语义类别,并将语义类附着在句法树的预终结符(即词汇层的上一层)层次,并以标记后的树库进行训练得到包含语义信息的文法模型。在解码部分不需要进行任何特殊处理即可得到带有语义标记的句法分析结果。通过实验发现该方法有效的提高了句法分析的性能。The starting point of the present invention is to obtain a high-performance syntax analyzer, and use semantic analysis as an auxiliary means to improve the performance of syntax analysis. The basic model of syntactic analysis adopts the non-lexicalized PCFG model, which automatically refines tags through unsupervised learning methods to improve the descriptive ability of grammar, and its performance has surpassed that of lexicalized syntactic analyzers. On this basis, this method uses HowNet as a semantic dictionary to provide a certain level of semantic categories for some words in the syntax tree bank, and attach the semantic categories to the pre-terminals of the syntax tree (that is, the upper layer of the vocabulary layer) level, and trained with the labeled tree bank to obtain a grammar model containing semantic information. In the decoding part, the syntactic analysis result with semantic markup can be obtained without any special processing. Through experiments, it is found that this method effectively improves the performance of syntactic analysis.
下面分三个部分详细介绍本发明的技术方案。The technical scheme of the present invention will be introduced in detail in three parts below.
1.语义信息融入句法分析的方式1. The way semantic information is integrated into syntactic analysis
以HowNet作为语义词典,以其中定义的义原(定义为意义的最小单位)作为语义类别。义原在HowNet中存在着一定的上下位关系,如附图2所示,按照这种上下位关系抽取出不同层次的语义类别,以句法树中的词作为键值进行查询得到其语义类,并将语义类附着在预终结符上。为了保证语义体系的一致性以及减轻数据稀疏问题,在这里需要保证的一点是所有词查询得到的语义类在HowNet中处于同一层。Take HowNet as the semantic dictionary, and the sememe (defined as the smallest unit of meaning) defined in it as the semantic category. Sememes have a certain hyponymy relationship in HowNet, as shown in Figure 2, according to this hyponymy relationship, different levels of semantic categories are extracted, and the semantic categories are obtained by querying the words in the syntax tree as key values. and attach semantic classes to preterminals. In order to ensure the consistency of the semantic system and alleviate the problem of data sparsity, it is necessary to ensure that the semantic classes obtained by all word queries are at the same layer in HowNet.
对于存在多个语义类别的词就存在词义消歧的问题,我们这里的策略是取第一个语义类别;另一方面我们设计了一个多义词的意义类别标注系统,采用人工标注的方式对多义词的语义类进行标注。对于HowNet中不存在的词,则不添加语义信息。For words with multiple semantic categories, there is a problem of word sense disambiguation. Our strategy here is to take the first semantic category; Semantic classes are annotated. For words that do not exist in HowNet, no semantic information is added.
附图1显示的是一个标注语义的例子。附图1(a)是标注前的树库中的句子;附图1(b)是经过语义标注后的句子,可以看到引入语义的策略就是将某个词的语义类别附着到它所对应的预终结符上。Figure 1 shows an example of annotation semantics. Attached Figure 1(a) is the sentence in the tree bank before labeling; Figure 1(b) is the sentence after semantic labeling. It can be seen that the strategy of introducing semantics is to attach the semantic category of a word to its corresponding on the preterminal.
对于词性层以上的非终结符,不能从HowNet中直接得到,最简单的添加方式可以采用类似于提取中心词的方法,将预终结符的语义信息当成中心词,提取到上层结点上。但是考虑到,词的语义类别比较多,附加到上层结点可能会产生更多的非终结符,对于数据量不充足的情况会产生非常严重数据稀疏问题。因此,对于上层非终结符仍然采用无监督自动分裂合并的方式进行自动细分,而不引入语义。For the non-terminals above the part-of-speech level, they cannot be directly obtained from HowNet. The easiest way to add them is to use a method similar to extracting the central word, using the semantic information of the pre-terminal as the central word and extracting it to the upper node. However, considering that there are many semantic categories of words, more non-terminals may be generated when appending to upper-level nodes, which will cause very serious data sparsity problems when the amount of data is insufficient. Therefore, for the upper-level non-terminals, the unsupervised automatic splitting and merging method is still used for automatic subdivision without introducing semantics.
经过这样的处理后,树库中的大多数词所对应的上层预终结符就标记上了HowNet中的某一层语义类,采用该树库进行句法分析模型训练,就可以获得融入语义信息的文法模型。利用该文法进行解码,可以得到带有语义标记的句法分析结果,同时句法分析结果也更加准确。After such processing, the upper-level pre-terminals corresponding to most of the words in the tree bank are marked with a certain layer of semantic classes in HowNet, and the tree bank is used for syntactic analysis model training, and the semantic information can be obtained. grammar model. Using this grammar to decode, the syntactic analysis result with semantic marks can be obtained, and the syntactic analysis result is more accurate.
2.句法分析模型训练2. Syntactic analysis model training
本发明所采用的基本句法分析模型为非词汇化句法分析模型,即采用无监督的方式对非终结符结点标记进行细化,来提高文法的描述能力。下面简要介绍该模型。The basic syntactic analysis model adopted in the present invention is a non-lexical syntactic analysis model, that is, the non-terminal node label is refined in an unsupervised manner to improve the description ability of the grammar. The model is briefly described below.
近年来,非词汇化PCFG句法分析方法取得了较大的进展,最好的模型的性能已经达到了当前句法分析的最高水平。该模型是在基本的PCFG框架下通过无监督学习的方式自动细化非终结符标记,增强文法的描述能力。该模型的训练部分主要包含分裂、融合两个过程。分裂过程是将每一个非终结符分裂为两个,对标记进行细化,从而扩大了文法复杂性,扩大了对树库中出现的语言现象的覆盖范围;融合过程是为了保证分裂步骤中标记的分裂哪些是必要的,这一点是通过考察某一标记分裂与否对于整个树库似然度的影响来衡量的,即如果将两个分裂出的子标记合并后整个树库似然度下降不明显,则这一标记的分裂是不必要的,从而将子标记合并。In recent years, non-lexicalized PCFG syntactic analysis methods have made great progress, and the performance of the best models has reached the highest level of current syntactic analysis. The model is based on the basic PCFG framework and automatically refines the non-terminal symbol mark through unsupervised learning to enhance the descriptive ability of the grammar. The training part of the model mainly includes two processes of splitting and fusion. The splitting process splits each non-terminal symbol into two and refines the tokens, thereby expanding the complexity of the grammar and expanding the coverage of the language phenomena that appear in the treebank; the fusion process is to ensure that the tokens in the splitting step Which splits are necessary, this is measured by examining the impact of whether a certain marker is split or not on the likelihood of the entire treebank, that is, if the two split sub-markers are merged, the likelihood of the entire treebank decreases is not obvious, the split of this tag is unnecessary, and the sub-tags are merged.
采用这种基于自动分裂的非词汇化句法分析方法,首先能够保证较高性能的基线系统,同时这种模型便于融入语义信息。此外,通过外部语义词典添加语义信息,有利于约束句法标记的自动分裂;而另一方面,后续的自动分裂又能保证添加的语义类不至于影响句法功能的划分。Using this non-lexical syntactic analysis method based on automatic splitting can firstly ensure a high-performance baseline system, and at the same time, this model is easy to incorporate semantic information. In addition, adding semantic information through an external semantic dictionary is beneficial to constrain the automatic splitting of syntactic tokens; on the other hand, the subsequent automatic splitting can ensure that the added semantic classes will not affect the division of syntactic functions.
3.句法分析解码过程3. Syntactic analysis decoding process
对于一个新的待分析句,根据训练过程中得到的文法模型就可以分析出它的句法结构。基本的方法是采用文法模型中的文法规则按照线图分析的方式自底向上推导出一个最可能的句法树,但是这种最简单的分析方式其搜索空间是非常巨大的。为了提高效率,就采用一种由粗到细的分析策略,即首先采用简单的文法模型解码得到一系列候选结果,然后再采用更精细的文法模型在这些候选结果中再进行解码,这样就可以在后面的精细解码前裁掉许多不可能结果,从而减小了搜索空间,提高了效率。For a new sentence to be analyzed, its syntactic structure can be analyzed according to the grammar model obtained during the training process. The basic method is to use the grammatical rules in the grammatical model to derive the most probable syntax tree from bottom to top in the way of line graph analysis, but the search space of this simplest analysis method is very huge. In order to improve efficiency, a coarse-to-fine analysis strategy is adopted, that is, first, a series of candidate results are obtained by decoding with a simple grammar model, and then a more refined grammar model is used to decode these candidate results, so that Many impossible results are cut before the subsequent fine decoding, thereby reducing the search space and improving efficiency.
本发明的积极效果:Positive effect of the present invention:
与现有技术相比,本发明采用语义信息帮助句法分析消歧,有效提高了句法分析的性能,使句法分析的效率和准确性得到显著提升;并且能够通过这种融合语义信息的句法分析器获得部分词的语义信息。Compared with the prior art, the present invention uses semantic information to help syntactic analysis disambiguation, effectively improves the performance of syntactic analysis, and significantly improves the efficiency and accuracy of syntactic analysis; and can use this syntactic analyzer that fuses semantic information Semantic information of some words is obtained.
附图说明 Description of drawings
图1句法树及添加语义信息后的句法树;Fig. 1 syntax tree and syntax tree after adding semantic information;
(a)是标注前的树库中的句子;(b)是经过语义标注后的句子;(a) is the sentence in the tree bank before labeling; (b) is the sentence after semantic labeling;
图2语义资源HowNet中义原树片段示例;Figure 2 Example of sememe tree fragments in the semantic resource HowNet;
图3本发明的方法流程图。Fig. 3 is a flow chart of the method of the present invention.
具体实施方式 Detailed ways
下面结合附图详细描述本发明的具体实施方式,本发明的方法流程图如图3所示。The specific implementation manner of the present invention will be described in detail below in conjunction with the accompanying drawings, and the flow chart of the method of the present invention is shown in FIG. 3 .
1.构建词-语义类索引1. Build a word-semantic index
根据HowNet中定义的义原之间的上下位关系抽取出由粗到细的不同层的语义类,并与每一个词相对应,从而构建出由词到语义类的索引。这里的词是附带着词性信息的。According to the hyponym relationship between sememes defined in HowNet, the semantic classes of different layers from coarse to fine are extracted, and correspond to each word, so as to construct the index from word to semantic class. The words here are accompanied by part-of-speech information.
2.对原始树库添加语义类信息2. Add semantic class information to the original tree bank
对原始树库,以词和词性作为键值来得到语义类的信息,然后将语义类的信息附着到词性(预终结符)层次上,实现对词性层标记的细化。这样部分词性就包含了语义信息。For the original tree bank, the semantic class information is obtained by using word and part of speech as the key value, and then the semantic class information is attached to the part of speech (pre-terminal) level to realize the refinement of the part of speech layer tag. Such parts of speech contain semantic information.
某些词语可能存在多个不同的语义类,针对这种情况采用了两种策略:选取多个语义中的第一个,或者采用人工标注的方式根据上下文选择。Some words may have multiple different semantic classes, and two strategies are used for this situation: select the first of multiple semantics, or use manual labeling to select according to the context.
3.训练文法模型3. Training grammar model
以添加了语义类信息的树库作为训练数据。采用前面介绍的非词汇化句法分析模型进行文法训练,训练过程中对于非终结符采用自动分裂、合并的方式进行细化。另一方面,为了考察是否需要对添加了语义信息的预终结符也进行这一细化过程,我们进行了实验验证,结果发现在添加粗粒度语义的同时仍然进行自动细分其效果要好于不进行细分,而这一做法的效果也要好于直接添加区分性更强的细粒度语义而不进行自动细化,下面的效果分析部分还会详细的介绍。The tree bank with added semantic class information is used as training data. The non-lexical syntactic analysis model introduced above is used for grammar training. During the training process, non-terminal symbols are refined by automatic splitting and merging. On the other hand, in order to investigate whether it is necessary to perform this refinement process on the pre-terminals with added semantic information, we conducted experiments to verify that the effect of automatic subdivision while adding coarse-grained semantic information is better than that without Segmentation, and the effect of this approach is also better than directly adding more differentiated fine-grained semantics without automatic refinement. The following effect analysis section will introduce it in detail.
4.对待分析语句进行句法分析4. Perform syntactic analysis on the sentence to be analyzed
有了上面训练出的文法模型,对于一个待分析的句子(已经过分词处理)就可以采用前面介绍的非词汇化句法分析器根据文法模型进行解码,得到句法分析结果,同时还带有该语句的语义标注结果。With the grammatical model trained above, for a sentence to be analyzed (which has been processed by word segmentation), the non-lexical parser introduced above can be used to decode according to the grammatical model, and the result of syntactic analysis is obtained, and the sentence is also included semantic annotation results.
效果分析:Effectiveness analysis:
为了验证本发明的有效性,我们设计了一系列的实验,下面介绍部分实验。In order to verify the effectiveness of the present invention, we designed a series of experiments, some of which are described below.
实验语料:Experimental corpus:
训练和测试语料采用宾大中文树库UPenn Chinese Tree Bank 2.0,其中共325篇新闻类语料,采用标准方式进行划分:使用1-25篇作为开发集,共350句话;26-270篇作为训练集,共3172句话;271-300篇作为测试集,共348句话。The training and testing corpus uses UPenn Chinese Tree Bank 2.0, which contains a total of 325 news corpora, which are divided in a standard way: use 1-25 as the development set, a total of 350 sentences; 26-270 as training A total of 3172 sentences; 271-300 as a test set, a total of 348 sentences.
语义词典采用HowNet。Semantic dictionary using HowNet.
基线系统:Baseline system:
基线系统采用前面介绍的非词汇化句法分析模型,采用无监督的方法对非终结符标记自动分裂细化,每次迭代将原始标记分裂为2个,通过EM算法确定新标记对应的参数,接着根据似然度贡献对分裂的标记进行合并。The baseline system adopts the non-lexical syntactic analysis model introduced above, and adopts an unsupervised method to automatically split and refine non-terminal tokens. Each iteration splits the original token into two, and determines the parameters corresponding to the new token through the EM algorithm, and then The split tokens are merged according to the likelihood contribution.
评测程序:Evaluation procedure:
评测程序采用当前使用较为广泛的句法分析评测工具EVALB。该工具是以括号标记匹配为评价标准,关注准确率、召回率和F值。The evaluation program adopts EVALB, a currently widely used syntax analysis evaluation tool. The tool uses bracket tag matching as the evaluation criterion, focusing on precision, recall and F-value.
实验结果及分析:Experimental results and analysis:
基线系统在CTB标准数据集上进行测试的结果见表1:The results of the baseline system tested on the CTB standard dataset are shown in Table 1:
表1:基线系统性能Table 1: Baseline System Performance
其中S&M表示分裂-合并过程循环的次数,比如S&M-1表示进行一次分裂-迭代;S&M-2表示进行两次分裂-迭代,即在一次分裂-迭代得到的文法基础上再进行一次分裂-迭代。Len表示句子的长度,即句子中包含的词数,Len<=40表示只在长度小于40的句子上进行测试;All表示在所有句子上进行测试。LR表示召回率,LP表示准确率,F1表示F值。Among them, S&M represents the number of split-merge process loops. For example, S&M-1 represents one split-iteration; S&M-2 represents two split-iterations, that is, a split-iteration is performed on the basis of the grammar obtained by one split-iteration . Len indicates the length of the sentence, that is, the number of words contained in the sentence. Len<=40 means that the test is only performed on sentences whose length is less than 40; All means that the test is performed on all sentences. LR means the recall rate, LP means the precision rate, and F1 means the F value.
为了在一定程度上减弱数据稀疏问题,我们选取HowNet中最顶层的语义类,并且对所有标记进行自动细化,采用相同数据集的实验结果如表2。In order to alleviate the problem of data sparsity to a certain extent, we select the topmost semantic class in HowNet and automatically refine all tags. The experimental results using the same data set are shown in Table 2.
表2 添加粗粒度语义类标记分析性能Table 2 Adding coarse-grained semantic class tag analysis performance
从上表中可以发现从第四次迭代分裂合并开始,通过添加语义信息类的句法分析性能超过了基线系统。在第六次迭代的时候,分裂过细出现了过训练,F值有一定的下降,在基线系统和改进系统上呈现的趋势一致。但添加语义类的结果仍然优于基线系统。以第五轮迭代的结果进行比较,F值由80.26%提高到了81.63%,绝对提高1.37个点,这在句法分析的研究中提高相当显著。From the table above, it can be found that starting from the fourth iteration of split-merging, the parsing performance by adding semantic information classes exceeds the baseline system. In the sixth iteration, over-training occurred due to too fine splitting, and the F value decreased to a certain extent, which showed the same trend in the baseline system and the improved system. But the results of adding semantic classes are still better than the baseline system. Comparing with the results of the fifth iteration, the F value has increased from 80.26% to 81.63%, an absolute increase of 1.37 points, which is quite significant in the study of syntactic analysis.
此外,采用最新发布的5.0版本的宾大中文树库(共包含18782个句子)进行训练,本发明的句法分析性能最高可达到F值86.39%。添加语义信息前后的对比趋势与上面列出的宾大中文树库2.0上得出的结果相似,这里就不再赘述。In addition, the newly released 5.0 version of the Penn Chinese tree bank (comprising 18782 sentences in total) is used for training, and the syntactic analysis performance of the present invention can reach the highest F value of 86.39%. The comparison trend before and after adding semantic information is similar to the results obtained on the Penn Chinese Treebank 2.0 listed above, so I won’t go into details here.
本发明以非词汇化句法分析器为基础,将语义信息融入其中,利用语义信息帮助句法分析进行消歧,使句法分析器性能得到明显提升,并且能够通过这种融合语义信息的句法分析器获得部分词的语义信息。The present invention is based on a non-lexical syntactic analyzer, integrates semantic information into it, uses semantic information to help syntactic analysis to disambiguate, significantly improves the performance of the syntactic analyzer, and can obtain Semantic information of some words.
Claims (8)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2009101318275A CN101520775B (en) | 2009-02-17 | 2009-04-08 | Chinese syntax parsing method with merged semantic information |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200910078113 | 2009-02-17 | ||
CN200910078113.2 | 2009-02-17 | ||
CN2009101318275A CN101520775B (en) | 2009-02-17 | 2009-04-08 | Chinese syntax parsing method with merged semantic information |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101520775A true CN101520775A (en) | 2009-09-02 |
CN101520775B CN101520775B (en) | 2012-05-30 |
Family
ID=41081371
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2009101318275A Expired - Fee Related CN101520775B (en) | 2009-02-17 | 2009-04-08 | Chinese syntax parsing method with merged semantic information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101520775B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013088287A1 (en) * | 2011-12-12 | 2013-06-20 | International Business Machines Corporation | Generation of natural language processing model for information domain |
CN103189860A (en) * | 2010-11-05 | 2013-07-03 | Sk普兰尼特有限公司 | Machine translation device and machine translation method in which a syntax conversion model and a vocabulary conversion model are combined |
CN107818781A (en) * | 2017-09-11 | 2018-03-20 | 远光软件股份有限公司 | Intelligent interactive method, equipment and storage medium |
CN109298796A (en) * | 2018-07-24 | 2019-02-01 | 北京捷通华声科技股份有限公司 | A kind of Word association method and device |
CN109543195A (en) * | 2018-11-19 | 2019-03-29 | 腾讯科技(深圳)有限公司 | A kind of method, the method for information processing and the device of text translation |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5966686A (en) * | 1996-06-28 | 1999-10-12 | Microsoft Corporation | Method and system for computing semantic logical forms from syntax trees |
CN101329666A (en) * | 2008-06-18 | 2008-12-24 | 南京大学 | Chinese Syntax Automatic Analysis Method Based on Corpus and Tree Structure Pattern Matching |
-
2009
- 2009-04-08 CN CN2009101318275A patent/CN101520775B/en not_active Expired - Fee Related
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103189860A (en) * | 2010-11-05 | 2013-07-03 | Sk普兰尼特有限公司 | Machine translation device and machine translation method in which a syntax conversion model and a vocabulary conversion model are combined |
WO2013088287A1 (en) * | 2011-12-12 | 2013-06-20 | International Business Machines Corporation | Generation of natural language processing model for information domain |
CN103999081A (en) * | 2011-12-12 | 2014-08-20 | 国际商业机器公司 | Generation of natural language processing model for information domain |
US9740685B2 (en) | 2011-12-12 | 2017-08-22 | International Business Machines Corporation | Generation of natural language processing model for an information domain |
CN107818781A (en) * | 2017-09-11 | 2018-03-20 | 远光软件股份有限公司 | Intelligent interactive method, equipment and storage medium |
CN109298796A (en) * | 2018-07-24 | 2019-02-01 | 北京捷通华声科技股份有限公司 | A kind of Word association method and device |
CN109298796B (en) * | 2018-07-24 | 2022-05-24 | 北京捷通华声科技股份有限公司 | Word association method and device |
CN109543195A (en) * | 2018-11-19 | 2019-03-29 | 腾讯科技(深圳)有限公司 | A kind of method, the method for information processing and the device of text translation |
CN109543195B (en) * | 2018-11-19 | 2022-04-12 | 腾讯科技(深圳)有限公司 | Text translation method, information processing method and device |
Also Published As
Publication number | Publication date |
---|---|
CN101520775B (en) | 2012-05-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Roark et al. | Processing South Asian languages written in the Latin script: the Dakshina dataset | |
Cotterell et al. | Labeled morphological segmentation with semi-markov models | |
US6810375B1 (en) | Method for segmentation of text | |
CN103154936B (en) | For the method and system of robotization text correction | |
CN102799577B (en) | A kind of Chinese inter-entity semantic relation extraction method | |
CN103500160B (en) | A kind of syntactic analysis method based on the semantic String matching that slides | |
Shindo et al. | Bayesian symbol-refined tree substitution grammars for syntactic parsing | |
Pettersson et al. | A multilingual evaluation of three spelling normalisation methods for historical text | |
CN101329666A (en) | Chinese Syntax Automatic Analysis Method Based on Corpus and Tree Structure Pattern Matching | |
CN104317846A (en) | Semantic analysis and marking method and system | |
Alegria et al. | Representation and treatment of multiword expressions in Basque | |
Constant et al. | Combining compound recognition and PCFG-LA parsing with word lattices and conditional random fields | |
CN101520775A (en) | Chinese syntax parsing method with merged semantic information | |
CN110427619A (en) | It is a kind of based on Multichannel fusion and the automatic proofreading for Chinese texts method that reorders | |
Oo et al. | An analysis of ambiguity detection techniques for software requirements specification (SRS) | |
Cheung et al. | Topological field parsing of German | |
CN100424685C (en) | A hierarchical Chinese long sentence syntax analysis method and device based on punctuation processing | |
Leidig et al. | Automatic detection of anglicisms for the pronunciation dictionary generation: a case study on our German IT corpus. | |
Fang et al. | Parsing Japanese with a PCFG treebank grammar | |
CN114444490A (en) | Short text similarity calculation method integrated with priori knowledge | |
Li et al. | A unified model for solving the OOV problem of chinese word segmentation | |
Ariaratnam et al. | A shallow parser for Tamil | |
Tsai et al. | Applying an NVEF Word-Pair Identifier to the Chinese Syllable-to-Word Conversion Problem | |
KR101638442B1 (en) | Method and apparatus for segmenting chinese sentence | |
Eineborg et al. | ILP in part-of-speech tagging—an overview |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20120530 Termination date: 20180408 |
|
CF01 | Termination of patent right due to non-payment of annual fee |