CN105630770A

CN105630770A - Word segmentation phonetic transcription and ligature writing method and device based on SC grammar

Info

Publication number: CN105630770A
Application number: CN201510994505.9A
Authority: CN
Inventors: 黄河燕; 黄静
Original assignee: ETONG LANGUAGE TECHNOLOGY (BEIJING) Co Ltd; Beijing Institute of Technology BIT
Current assignee: ETONG LANGUAGE TECHNOLOGY (BEIJING) Co Ltd; Beijing Institute of Technology BIT
Priority date: 2015-12-23
Filing date: 2015-12-25
Publication date: 2016-06-01

Abstract

The invention relates to a method and device for word segmentation, phonetic transcription and continuous writing based on SC grammar, and belongs to the technical field of machine translation in computer science. Firstly, the present invention is based on the word segmentation ambiguity rules of SC grammar, utilizes the adjacency constraints in natural language, and establishes an ambiguity segmentation rule base to eliminate illegal segmentation to improve word segmentation accuracy; secondly, based on the SC grammar's word segmentation continuation rule base and continuation corpus Statistical library, continuation corpus statistical library is used to continuation of those continuation knowledge that cannot be represented as rules; finally, based on the dictionary library of SC grammar, the dictionary is used for forward maximum matching to perform word segmentation, and ambiguous fields call word segmentation ambiguity rules to obtain correct Segment the result, analyze the context of the word to obtain correct part-of-speech tagging and transcription. Compared with the prior art, the invention improves the accuracy of word segmentation, and the word segmentation ambiguity rule base, combination ambiguity lexicon, ligature rule base, dictionary base and ligature corpus statistical base are easy to expand and maintain.

Description

A method and device for word segmentation, phonetic transcription and continuous writing based on SC grammar

技术领域technical field

本发明涉及一种分词标音连写方法及装置，特别涉及一种汉盲翻译系统中基于SC文法的分词标音连写方法及装置，属于计算机科学中的机器翻译技术领域。The invention relates to a method and device for word segmentation and phonetic continuation, in particular to a method and device for word segmentation and phonetic continuation based on SC grammar in a Chinese-blind translation system, and belongs to the technical field of machine translation in computer science.

背景技术Background technique

机器翻译是指利用电子计算机将一种自然语言转换成另一种自然语言表达的过程。汉盲翻译系统把中文信息自动翻译为盲文字符，这对盲人的教育、生活等起到非常大的帮助。盲文是一种特殊形式的拼音文字，要实现汉字到盲文的翻译，首先应将汉语进行分词连写，再转换成拼音，然后由拼音转换成盲文，所以汉语分词标音的准确性就在很大程度上决定了汉盲翻译的准确性。分词连写是汉语盲文独有的重要规则。分词是把一个一个的词分开来写；连写是按照盲文的特殊性，避免音节结构过于松散，便于摸读，将一些词连起来写。分词连写，必须遵循汉语语法、语言的逻辑性、习惯性和音节长短程度的基本规则。在汉语转换成拼音的过程中，由于汉字有多音字问题，但词的多音现象就比字的多音现象少得多，三字以上的词很少有多音现象，所以正确的分词连写可大大减少多音现象。但单独的多音字问题还是会存在，如何正确地给多音字标音就必须利用上下文语境进行自然语言分析处理。所以在汉字到盲文的转换过程有两个难点：1、提高汉语分词连写的正确性；2、结合上下文的语境分析给多音字正确标音。由于国内目前针对汉语到盲文的翻译还停留在人工阶段，为了给盲人带来更多更好的教育素材，繁重的翻译工作带来了准确率的降低，因此迫切需要一套针对汉语到盲文的高准确率的分词标音连写方法，从而为汉盲翻译打下夯实的基础。Machine translation refers to the process of converting one natural language into another natural language expression by using a computer. The Chinese-blind translation system automatically translates Chinese information into Braille characters, which is of great help to the education and life of blind people. Braille is a special form of phonetic writing. To realize the translation of Chinese characters into Braille, the Chinese word segmentation and ligature should first be converted into Pinyin, and then converted from Pinyin to Braille. Therefore, the accuracy of Chinese word segmentation and phonetic transcription is very high. To a certain extent, it determines the accuracy of Chinese blind translation. Word segmentation is an important rule unique to Chinese Braille. Word segmentation is to write words one by one separately; continuous writing is to write some words together to avoid the loose syllable structure and make it easier to read according to the particularity of Braille. Word segmentation must follow the basic rules of Chinese grammar, language logic, habit and syllable length. In the process of converting Chinese to Pinyin, due to the problem of polyphonic characters in Chinese characters, polyphonic phenomena in words are much less than polyphonic phenomena in words, and words with more than three characters rarely have polyphonic phenomena, so correct word segmentation Can greatly reduce polyphony. However, the problem of individual polyphonic characters still exists. How to correctly transcribe polyphonic characters requires the use of contextual context for natural language analysis and processing. Therefore, there are two difficulties in the process of converting Chinese characters to Braille: 1. Improving the correctness of Chinese word segmentation and ligatures; 2. Combining the context analysis of the context to correctly mark polyphonic characters. Since the domestic translation from Chinese to Braille is still in the manual stage, in order to bring more and better educational materials to the blind, the heavy translation work has brought about a decrease in accuracy, so there is an urgent need for a set of translation from Chinese to Braille The high-accuracy word segmentation and phonetic transcription method lays a solid foundation for Chinese-blind translation.

发明内容Contents of the invention

本发明的目的是为解决实现汉盲机器翻译的问题，提出一种基于SC文法的分词标音连写方法及装置，实现快速、准确的分词标音连写。The purpose of the present invention is to solve the problem of realizing Chinese-blind machine translation, and propose a method and device based on SC grammar to realize fast and accurate word segmentation and phonetic ligature.

本发明的思想是：1、基于SC文法的分词歧义规则，利用自然语言中的邻接约束条件，建立歧义切分规则库，以排除不合法切分来提高分词精度；2、基于SC文法的分词连写规则库和连写语料统计库，按照盲文的特殊性，避免音节结构过于松散，便于盲人摸读，将一些词连起来写。连写语料统计库用来连写那些无法表示为规则的连写知识；3、基于SC文法的字典库，利用字典进行正向最大匹配来进行分词，发生歧义的字段调用分词歧义规则来获得正确的切分结果，解析该词的上下文语境获得正确的词性标注和标音。The idea of the present invention is: 1, based on the word segmentation ambiguity rule of SC grammar, utilize the adjacency constraint condition in the natural language, establish the ambiguity segmentation rule base, improve word segmentation precision with getting rid of illegal segmentation; 2, the word segmentation based on SC grammar The ligature rule base and the ligature corpus statistics base, according to the particularity of Braille, avoid the syllable structure being too loose, which is convenient for the blind to touch and read, and write some words together. The continuous writing corpus statistical database is used to write continuous writing knowledge that cannot be expressed as rules; 3. The dictionary library based on SC grammar uses the dictionary to perform forward maximum matching to perform word segmentation, and the ambiguous fields call the word segmentation ambiguity rules to obtain correct segmentation As a result, correct part-of-speech tagging and transcription are obtained by parsing the context of the word.

本发明的目的是通过以下技术方案实现的：The purpose of the present invention is achieved through the following technical solutions:

一种基于SC文法的分词标音连写方法，基于字典库、组合歧义词库、分词歧义规则库、连写规则库和连写语料统计库，包括以下步骤：A method for consecutive writing of word segmentation, phonetic marking based on SC grammar, based on a dictionary database, a combined ambiguous thesaurus, a word segmentation ambiguity rule database, a consecutive writing rule database and a consecutive writing corpus statistical database, comprising the following steps:

(1)接收待分词标音的汉语字符串和文章体裁类型；(1) receive the Chinese character string and article genre type to be divided into phonetic symbols;

所述字符串为纯汉字字符串，即为不包含数字、标点符号、ASCII码字符等特殊符号的字符串；如果字符串中包含非汉字字符，对其进行分割，对分割后的非汉字子串单独处理，如直接生成词节点并赋予相应类型，对汉字字串转步骤(2)经过分词标音连写后与其他经处理的非汉字子串合并后输出即可。Described character string is pure Chinese character character string, is the character string that does not comprise the special symbols such as numeral, punctuation mark, ASCII code character; If comprise non-Chinese character character in the character string, it is segmented, the non-Chinese character character after segmentation Strings are processed separately. For example, word nodes are directly generated and corresponding types are assigned. For Chinese character strings, step (2) is combined with other processed non-Chinese character substrings and then output.

(2)对汉语字符串基于字典库进行分词，并对分词后的词块进行词性标注和标音；(2) Segment Chinese character strings based on the dictionary database, and perform part-of-speech tagging and phonetic marking on the word blocks after word segmentation;

(3)根据文章体裁类型，调用相应的连写规则库，基于连写规则库中的盲文分词连写规则对步骤(2)的词块进行组合连写；(3) According to the article genre type, call the corresponding ligature rule base, and carry out combined ligature writing to the word block of step (2) based on the braille word segmentation ligature rule in the ligature rule base;

(4)基于连写语料统计库对组合后的词块进行二次组合连写；(4) Carry out secondary combined ligatures to the combined word chunks based on the ligature corpus statistical database;

(5)将生成的分词标音连写后的汉语字符串输出。(5) Output the generated Chinese character string after word segmentation and phonetic concatenation.

所述字典库用于汉语分词、词性标注和标音，包括汉语单词符号、语法语义属性标识符、上下文区分函数、单词的拼音。The dictionary database is used for Chinese word segmentation, part-of-speech tagging and phonetic notation, including Chinese word symbols, grammatical and semantic attribute identifiers, context distinguishing functions, and pinyin of words.

所述字典库通过以下过程构建：根据汉语字典知识定义一套语法语义属性分类体系，并进行收录，语言工程人员在调试语料的过程中进一步完善。The dictionary database is constructed through the following process: a set of grammatical and semantic attribute classification system is defined according to the Chinese dictionary knowledge and included, and language engineers further improve it in the process of debugging the corpus.

所述基于字典库进行分词通过以下过程完成：The word segmentation based on the dictionary is completed through the following process:

a.参照字典库，利用正向最大匹配算法对语句进行拆分得到词块；a. Referring to the dictionary library, use the forward maximum matching algorithm to split the sentence to obtain word chunks;

b.根据词块的交叉特征进行交叉歧义判断；b. Carry out cross ambiguity judgment according to the cross feature of the word block;

c.基于组合歧义词库对词块进行歧义判断；c. Carry out ambiguity judgment on word block based on combined ambiguity lexicon;

d.根据歧义规则，通过推理消除歧义；d. According to the rules of ambiguity, disambiguation is eliminated through reasoning;

e.输出分词结果。e. Output word segmentation results.

所述交叉歧义是形如字串AXB，其中AX构成一个词，同时XB也构成一个词，这类歧义现象即为交叉歧义。其中，A、X、B的长度大于等于一个字长。如“有时间”、“不同情况”、“大脑袋”等均存在交叉歧义。The cross ambiguity is in the form of a string AXB, wherein AX constitutes a word, and XB also constitutes a word, and this type of ambiguity is the cross ambiguity. Wherein, the lengths of A, X, and B are greater than or equal to a word length. For example, "have time", "different situations", "big head" and so on all have cross-ambiguity.

所述组合歧义词库用于识别存在组合歧义的词块，库里收录的是存在组合歧义的二字词，组合歧义词是形如AB的词串，其中A,B分别独立成词，如句子“他将来上海。”中的“将来”就是组合歧义词。Described combination ambiguity lexicon is used to identify the word block that there is combination ambiguity, and what included in the storehouse is the two-character word that there is combination ambiguity, and combination ambiguity word is the word string such as AB, and wherein A, B become words independently respectively, as "Future" in the sentence "he will come to Shanghai." is a combination of ambiguous words.

所述组合歧义词库通过以下过程构建：语言工程师在调试大批量语料的过程中逐步收录。The combined ambiguous thesaurus is constructed through the following process: language engineers gradually include in the process of debugging a large number of corpus.

所述分词歧义规则库用于推理消除歧义词块，得到正确的分词结果，包括歧义词块、条件函数、正确分词操作。The word segmentation ambiguity rule base is used to reason and eliminate ambiguous word blocks to obtain correct word segmentation results, including ambiguous word blocks, conditional functions, and correct word segmentation operations.

所述分词歧义规则库通过以下过程构建：语言工程师在调试大批量语料的过程中逐步总结完善规则。分词歧义规则库细分为交叉歧义规则和组合歧义规则两类，具有交叉歧义的词块调用交叉歧义规则推理消歧，具有组合歧义的词块调用组合歧义规则推理消歧。The word segmentation and ambiguity rule base is constructed through the following process: language engineers gradually summarize and improve rules in the process of debugging a large number of corpus. The word segmentation ambiguity rule base is subdivided into two types: cross ambiguity rules and combination ambiguity rules. Word chunks with cross ambiguity call cross ambiguity rules for inference disambiguation, and word chunks with combination ambiguity call combination ambiguity rules for inference disambiguation.

所述基于组合歧义词库对词块进行歧义判断通过以下过程完成：The ambiguity judgment of the word block based on the combined ambiguous thesaurus is completed through the following process:

a.对当前词块，利用二分查找算法查询组合歧义词库；a. For the current word block, use the binary search algorithm to query the combined ambiguous thesaurus;

b.根据查询结果，输出组合歧义标志。b. According to the query result, output the combined ambiguity flag.

所述根据歧义规则，通过推理消除歧义通过以下过程完成：According to the ambiguity rules, disambiguation through inference is accomplished through the following process:

a.对当前含歧义标志的词块，匹配歧义规则中的歧义词块部分；a. Match the ambiguous part of the ambiguous word in the ambiguity rule for the current lexical block containing the ambiguous flag;

b.若匹配成功，进行条件函数检查；b. If the match is successful, check the conditional function;

c.若条件检查满足，执行正确分词操作；c. If the condition check is satisfied, perform the correct word segmentation operation;

d.输出正确的分词结果。d. Output the correct word segmentation results.

所述对分词后的词块进行词性标注和标音通过以下过程完成：The part-of-speech tagging and phonetic marking of the word chunk after the word segmentation is completed through the following process:

a.对当前的词块,从字典库中取出该词块的字典信息；a. For the current word block, take out the dictionary information of the word block from the dictionary storehouse;

b.逐条进行上下文函数检查；b. Check the context function item by item;

c.若上下文检查满足，取出该条的词性和拼音。c. If the context check is satisfied, take out the part of speech and pinyin of the item.

所述连写规则库用于对分词并标注后的词块进行组合连写，包括规则词块部分、条件函数、连写操作。根据不同的文章体裁，连写规则库细分为文言文规则库和现代文规则库。The ligature rule library is used to combine and ligature the word chunks after word segmentation and labeling, including regular word chunk parts, conditional functions, and ligature operations. According to different article genres, the joint writing rule base is subdivided into the classical Chinese rule base and the modern Chinese rule base.

所述连写规则库通过以下过程构建：根据盲文出版物中定义的连写规则进行逐条收录，语言工程人员在调试语料的过程中进一步完善。The ligature rule base is constructed through the following process: according to the ligature rules defined in Braille publications, the ligature rules are included one by one, and language engineers further improve during the process of debugging the corpus.

所述基于连写规则对词块进行组合连写通过以下过程完成：The combined writing of word blocks based on the writing rules is completed through the following process:

a.对当前若干词块,匹配连写规则中的词块部分；a. For several current word blocks, match the word block part in the linking rule;

c.若条件检查满足，执行正确连写操作；c. If the condition check is satisfied, perform the correct continuous writing operation;

d输出连写后的分词结果。d Output the word segmentation result after ligature.

所述连写语料统计库用于对根据连写规则组合后的词块进行二次组合连写，库里收录的是需要组合连写的词块，如“三大纪律”。连写语料统计库细分为基础词库和用户词库，其中基础词库收录了一些通用的连写词块，用户词库包括用户自定义需要连写的词块。The continuous writing corpus statistical database is used for performing combined continuous writing on the word blocks combined according to the continuous writing rules, and the library includes word blocks that need to be combined and continuous writing, such as "three major disciplines". The ligature corpus statistical database is subdivided into basic thesaurus and user thesaurus. The basic thesaurus includes some common ligature word blocks, and the user lexicon includes user-defined word blocks that need to be ligatured.

所述连写语料统计库通过以下过程构建：根据盲文出版物中定义的一些具体连写词块进行收录，语言工程人员在调试语料的过程中进一步完善。The ligature corpus statistical database is constructed through the following process: collecting some specific ligature word blocks defined in Braille publications, and further perfected by language engineers in the process of debugging the corpus.

所述基于连写语料统计库对组合后的词块进行二次组合连写通过以下过程完成：The second combined ligature of the combined word block based on the ligature corpus statistical library is completed through the following process:

a.对当前词块，按照用户词库、基础词库的顺序进行匹配；a. For the current word block, match according to the order of the user thesaurus and the basic thesaurus;

b.若匹配成功，执行连写组合；b. If the matching is successful, execute the continuous writing combination;

c.输出连写后的词块结果；c. Output the word block result after continuous writing;

一种基于SC文法的分词标音连写装置，基于字典库、组合歧义词库、连写语料统计库、连写规则库和分词歧义规则库，包括依次连接的分词模块、词性标注及标音模块、一次组合连写模块和二次组合连写模块，分词模块、词性标注及标音模块分别与字典库相连，分词模块还与组合歧义词库和分词歧义规则库分别相连，一次组合连写模块与连写规则库相连，二次组合连写模块与连写语料统计库相连；A word segmentation and phonetic continuous writing device based on SC grammar, based on a dictionary database, a combined ambiguous thesaurus, a continuous writing corpus statistical database, a continuous writing rule base and a word segmentation ambiguity rule base, including sequentially connected word segmentation modules, part-of-speech tagging and phonetic transcription modules, once The combined continuous writing module and the secondary combined continuous writing module, the word segmentation module, the part-of-speech tagging and the phonetic transcription module are respectively connected to the dictionary database, the word segmentation module is also connected to the combined ambiguous thesaurus and word segmentation ambiguity rule base, and the primary combined continuous writing module is connected to the continuous writing rule base , the secondary combined continuous writing module is connected with the continuous writing corpus statistical database;

分词模块用于对输入汉语字符串基于字典库进行分割，拆分成独立的词块，并在分割的过程中对得到的词块基于交叉歧义特征以及组合歧义词库判断是否存在歧义，并对存在歧义的词基于分词歧义规则库消除切分歧义，得到正确的词块；The word segmentation module is used to segment the input Chinese character string based on the dictionary database, split it into independent word blocks, and judge whether there is ambiguity in the obtained word blocks based on the cross ambiguity features and the combined ambiguity lexicon during the segmentation process. The ambiguous words are based on the word segmentation ambiguity rule base to eliminate the ambiguity and get the correct word block;

词性标注及标音模块用于对分词后的词块基于字典库通过上下文函数检查对分词模块得到的词块进行正确的词性标注和标音从而得到词块的正确词性和拼音；The part-of-speech tagging and phonetic marking module is used to perform correct part-of-speech tagging and transliteration on the word chunks obtained by the word segmentation module based on the dictionary library through the context function check, so as to obtain the correct part of speech and pinyin of the word chunks;

一次组合连写模块用于对词性标注后的词块进行组合连写，该模块基于连写规则库通过对条件函数进行检查得到连写组合后的词块；The one-time combined ligature module is used to combine ligatures for part-of-speech tagged word chunks. This module obtains the combined lexical chunks by checking the conditional functions based on the ligature rule library;

二次组合连写模块用于对一次组合连写后的词块进行连写语料统计库的查询匹配操作得到连写组合后的词块，并将带有词性标注和标音的词块输出。The secondary combined ligature module is used to perform the query matching operation on the lexical corpus statistical database of the combined ligature to obtain the combined lexical chunks, and output the lexical chunks with part-of-speech annotation and phonetic transcription.

作为优选，所述字典库、组合歧义词库、连写语料统计库、连写规则库和分词歧义规则库均可以根据时代的发展不断更改完善，从而提高分词的准确性。Preferably, the dictionary database, combined ambiguous thesaurus, consecutive writing corpus statistical database, consecutive writing rule database and word segmentation ambiguity rule database can be continuously modified and improved according to the development of the times, thereby improving the accuracy of word segmentation.

有益效果Beneficial effect

盲文是一种特殊形式的拼音文字，所以汉语分词标音的准确性就在很大程度上决定了汉盲翻译的准确性。本发明设计的基于SC文法的字典结构提高了多音字标音的准确性，基于SC文法的分词、连写规则提高了分词的准确性，并且分词歧义规则库、组合歧义词库、连写规则库、字典库和连写语料统计库易于扩展和维护。Braille is a special form of pinyin, so the accuracy of Chinese word segmentation and phonetic transcription largely determines the accuracy of Chinese-braille translation. The dictionary structure based on the SC grammar designed by the present invention improves the accuracy of the phonetic notation of polyphonic characters, the word segmentation and ligature rules based on the SC grammar improve the accuracy of the word segmentation, and the word segmentation ambiguity rule base, combined ambiguous thesaurus, ligature rule base, The dictionary library and the co-writing corpus statistics library are easy to expand and maintain.

附图说明Description of drawings

以下结合附图和发明实例对本发明作详细描述：The present invention is described in detail below in conjunction with accompanying drawing and invention example:

图1是本发明实施例一种基于SC文法的分词标音连写方法流程示意图；Fig. 1 is a kind of schematic flow chart of the word segmentation phonetic transcription method based on SC grammar of the embodiment of the present invention;

图2是分词过程的流程图；Fig. 2 is a flow chart of word segmentation process;

图3是词性标注和标音过程的流程图；Fig. 3 is the flowchart of part-of-speech tagging and phonetic transcription process;

图4是分词连写过程的流程图；Fig. 4 is the flowchart of participle continuous writing process;

图5是本发明实施例一种基于SC文法的分词标音连写装置组成结构示意图。FIG. 5 is a schematic diagram of the composition and structure of an SC grammar-based phonetic transcription device according to an embodiment of the present invention.

具体实施方式detailed description

下面结合附图与实施例对本发明进行详细说明。The present invention will be described in detail below in conjunction with the accompanying drawings and embodiments.

一种基于SC文法的分词标音连写方法，流程如图1所示，包括以下步骤：A kind of word segmentation phonetic transcription method based on SC grammar, flow process as shown in Figure 1, comprises the following steps:

⑴接受收待分词标音的汉语字符串和文章体裁类型；⑴ Chinese character strings and article genre types that accept word segmentation and phonetic transcription;

下面以接受的文章体裁类型为现代文、汉语字符串内容为“2008年，小李晋升为这个项目的总工程师”为例，说明本发明方法的实施过程。Taking the accepted article genre as modern Chinese and the content of the Chinese character string as "In 2008, Xiao Li was promoted to be the chief engineer of this project" as an example, the implementation process of the method of the present invention will be described.

⑵对汉语字符串基于字典库进行分词，并对分词后的词块进行词性标注和标音。如图2所示，该内容通过以下过程实现：(2) Segment the Chinese character strings based on the dictionary database, and perform part-of-speech tagging and phonetic marking on the word blocks after word segmentation. As shown in Figure 2, this content is achieved through the following process:

2.1基于字典对汉语字符串进行正向最大匹配，切分出词块。2.1 Based on the dictionary, the Chinese character strings are subjected to the forward maximum matching, and word blocks are segmented.

结合字典最大词长信息和在句中的最大可能边长，确定一个最优最大边长N，在词典中查找。如句子“2008年，小李晋升为这个项目的总工程师。”“年”在字典中的最大词长为3，因为字典里收录的以年开头的词最长的是3个字的。“年”在句子中的最大可能边长为1，因为后面是非汉字符，从而确定该句子中“年”的最优最大边长N为1。若词典中有这样的一个N字词，则匹配成功，匹配字段作为一个词被切分出来；如果词典中找不到这样的一个N字词，则匹配失败。匹配字段去掉最后一个汉字，剩下的N－l个字符作为新的匹配字段，进行新的匹配，如此进行下去，直至切分到成功为止。即完成一轮匹配切分出一个词。如此往复，直到所有的词都被切分出来。Combining the maximum word length information in the dictionary and the maximum possible side length in the sentence, determine an optimal maximum side length N, and look it up in the dictionary. For example, the sentence "In 2008, Xiao Li was promoted to be the chief engineer of this project." The maximum word length of "year" in the dictionary is 3, because the longest word starting with year included in the dictionary is 3 characters long. The maximum possible side length of "year" in the sentence is 1, because there are non-Chinese characters behind it, so it is determined that the optimal maximum side length N of "year" in the sentence is 1. If there is such an N word in the dictionary, the match is successful, and the matching field is segmented as a word; if such an N word is not found in the dictionary, the match fails. The last Chinese character is removed from the matching field, and the remaining N-l characters are used as a new matching field for new matching, and so on until the segmentation is successful. That is, a round of matching is completed to segment a word. This goes on and on until all the words are segmented.

2.2词块歧义判断2.2 Word block ambiguity judgment

如果切分出来的词是多于一个汉字，即N>1，则进行交叉歧义的判断，取该词的第二个汉字作为词首，以词长>＝N为边长，执行上述的词切分操作，如果能找到这样的词，就说明交叉歧义存在，调用分词歧义规则推理消歧。如上面的句子中切分到“项目”时，以“目”为词首，词长为2时，发现“目的”也是词，这就说明“项目”存在交叉歧义。If the segmented word is more than one Chinese character, that is, N>1, then carry out the judgment of cross ambiguity, take the second Chinese character of the word as the beginning of the word, and take the word length >= N as the side length, and execute the above word Segmentation operation, if such a word can be found, it means that the intersection ambiguity exists, and the word segmentation ambiguity rule is called to reason and disambiguate. For example, when the above sentence is segmented into "item", "item" is used as the beginning of the word, and when the word length is 2, it is found that "purpose" is also a word, which shows that "item" has cross ambiguity.

如果当前词长大于1为2，那该词有可能存在组合歧义，查询组合歧义词库判定其是否存在组合歧义。对于示例字符串，由于“项目”不在组合歧义词库中，所以“项目”只有交叉歧义。如果“项目”在组合歧义词库中，则“项目”同时具有交叉歧义和组合歧义。If the current word length is greater than 1 and is 2, then the word may have combination ambiguity, and query the combination ambiguity thesaurus to determine whether it has combination ambiguity. For the example string, since "item" is not in the combined ambiguity corpus, "item" only has cross ambiguity. If "item" is in the combination ambiguity thesaurus, then "item" has both cross ambiguity and combination ambiguity.

2.3推理消歧2.3 Inference Disambiguation

根据当前词的歧义标志类型调用相应的分词歧义规则推理消歧。所述的歧义规则库包含了在某些特定词、词类或属性情况下的歧义切分规则，如组合歧义规则：“NP(将来),NP(PLA)→DWD(A)”，其中，“NP(将来),NP(PLA)”是歧义规则的第一部分，即歧义词块部分。“DWD(A)”是歧义规则的第三部分，即正确分词操作部分，此规则中作为歧义规则的第二部分，即条件函数部分为空；该规则表示当A词块“将来”的后面跟一个B词块，即表示地点(PLA)的名词(NP)时，该A词块要切分开“DWD(A)”。如句子“他将来上海。”经过步骤2.1、2.2后发现“将来”具有组合歧义，匹配规则“NP(将来),NP(PLA)→DWD(A)”成功，“将来”的正确切分为“将/来”。交叉歧义规则和组合歧义规则的表示形式一样，只是内容不同而已。对于上述的句子“项目”有交叉歧义，调用交叉歧义规则进行推理消歧。歧义规则库中没有匹配相应规则，但本发明中的分词算法是基于正向的最大匹配，所以根据正向最长优先原则，得到正确词切分为“项目”。According to the ambiguity flag type of the current word, the corresponding word segmentation ambiguity rules are invoked to disambiguate. The ambiguity rule base contains ambiguity segmentation rules in some specific words, parts of speech or attributes, such as combination ambiguity rules: "NP (future), NP (PLA) → DWD (A)", wherein, " NP (future), NP (PLA)" is the first part of the ambiguity rule, that is, the part of the ambiguous word block. "DWD(A)" is the third part of the ambiguity rule, that is, the correct word segmentation operation part, and the second part of the ambiguity rule in this rule, that is, the conditional function part is empty; this rule indicates that when the A word block "future" When following a B word block, namely representing the noun (NP) of place (PLA), this A word block will be cut off " DWD (A) ". For example, the sentence "he will come to Shanghai." After steps 2.1 and 2.2, it is found that "future" has combination ambiguity, the matching rule "NP(future), NP(PLA)→DWD(A)" is successful, and the correct segmentation of "future" is "future". The expression form of intersection ambiguity rules and combination ambiguity rules is the same, but the content is different. For the above sentence "item" has cross ambiguity, the cross ambiguity rule is invoked for inference disambiguation. There is no corresponding matching rule in the ambiguity rule base, but the word segmentation algorithm in the present invention is based on the maximum matching in the forward direction, so according to the principle of the longest forward direction, the correct word is segmented into "items".

对后面的汉字字符串按上面的步骤进行下去，直到切分出所有词为止。上面的句子切分后的词块为：Follow the above steps for the following Chinese character strings until all words are segmented. The chunks of the above sentence after segmentation are:

2008/年/，/小/李/晋升/为/这/个/项目/的/总/工程师/。/In 2008/year/, /Xiao/Li/promoted/for/general/engineer/of/this/a/project/. /

2.4词性标注和标音2.4 Part-of-speech tagging and phonetic transcription

如图3所示为对词块进行词性标注和标音的过程，具体为：As shown in Figure 3, it is the process of part-of-speech tagging and phonetic marking for a word block, specifically:

对每个汉字词块查询字典，取出该词的字典信息，如当前句子的第一个汉字词块“年”在字典中的表示如下：Query the dictionary for each Chinese character block, and retrieve the dictionary information of the word. For example, the first Chinese character block "year" in the current sentence is represented in the dictionary as follows:

$年$ year

TIM:(NCGEN,nian)S(L,(1,1),[AP；Q；WH；R])“nian2”TIM:(NCGEN,nian)S(L,(1,1),[AP;Q;WH;R]) "nian2"

AP:(AGEN)“nian2”AP: (AGEN) "nian2"

其中，“$年”是汉语词的第一部分，即汉语单词符号部分。“TIM:(NCGEN,nian)”是汉语词的第二部分，即语法语义属性标识符部分；它表示“年”在句子中可以当时间词(TIM)。“S(L,(1,1),[AP；Q；WH；R])”是汉语词的第三部分，即上下文区分函数部分。它表示，如果“年”在句子中作为时间词(TIM)，则其左边第一个词必须是形容词(AP)或数词(Q)或疑问词(WH)或代词(R)。“nian2”是汉语词的第四部分，即单词的拼音部分。Among them, "$year" is the first part of the Chinese word, that is, the symbol part of the Chinese word. "TIM:(NCGEN,nian)" is the second part of Chinese words, that is, the grammatical semantic attribute identifier part; it indicates that "year" can be used as a time word (TIM) in a sentence. "S(L,(1,1),[AP;Q;WH;R])" is the third part of Chinese words, that is, the context discrimination function part. It says that if "year" is used as a time word (TIM) in a sentence, the first word to its left must be an adjective (AP) or a numeral (Q) or a question word (WH) or a pronoun (R). "nian2" is the fourth part of the Chinese word, which is the pinyin part of the word.

上面的句子，“2008”是数词(Q)，满足“年”的第一条，取出词性TIM和拼音“nian2”。如此进行下去，上面的句子的词性标注和标音结果为：In the above sentence, "2008" is a numeral (Q), which satisfies the first clause of "year", and takes out the part of speech TIM and the pinyin "nian2". Proceeding in this way, the result of part-of-speech tagging and transcription of the sentence above is:

2008/Q/2008年/TIM/nian2，/BD/,小/AP/xiao3李/R/li3晋升/VP/jin4sheng1为/SV/wei2这/R/zhe4个/L/ge4项目/NP/xiang4mu4的/DEF/de0总/AP/zong3工程师/NP/gong1cheng2shi1。/BD/。2008/Q/2008 /TIM/nian2, /BD/, small /AP/xiao3 Li /R/li3 promoted /VP/jin4sheng1 to /SV/wei2 this /R/zhe4 /L/ge4 project /NP/xiang4mu4 /DEF/de0 chief/AP/zong3 engineer/NP/gong1cheng2shi1. /BD/.

对词块进行词性标注和标音后，将通过如图4所述过程进行分词连写，具体如下：After part-of-speech tagging and phonetic marking are performed on the word block, word segmentation and co-writing will be carried out through the process as shown in Figure 4, as follows:

⑶根据文章体裁类型，调用相应的连写规则库，基于连写规则库中的盲文分词连写规则对步骤(2)的词块进行组合连写；(3) According to the article genre type, call the corresponding ligature rule base, and carry out combined ligature based on the word block of step (2) based on the braille word segmentation ligature rule in the ligature rule base;

这是现代文体裁文章，调用现代文连写规则，从左到右依次取出分词标注后的词块，当前词块为“2008/Q/2008”时，匹配成功规则This is a modern text article, call the modern text continuous writing rules, and take out the word chunks after word segmentation from left to right. When the current word chunk is "2008/Q/2008", the matching rule is successful

S1{label:Q}S2{label:NP/L/TIM,length:1}||S1,S2S1{label:Q}S2{label:NP/L/TIM,length:1}||S1,S2

其中，“S1{label:Q}S2{label:NP/L/TIM,length:1}”是规则的第一部分，即规则词块部分。它表示规则中的第一个词块是数词(Q)，第二个词块是词长(length)为1的名词(NP)或量词(L)或时间词(TIM)。当前规则没有条件函数，“S1,S2”是规则的第三部分，即连写操作部分，它表示需要把词块S1和S2连写在一起。所以词块“2008/Q/2008年/TIM/nian2”需要连写。连写后的新词块表示为“2008年/QCH/2008nian2”，QCH标志表示该词块是连写后的词块。取出下一可能连写词块“小/AP/xiao3”，匹配连写规则，依次执行如上步骤，从而得到一次组合连写后的词块：Among them, "S1{label:Q}S2{label:NP/L/TIM,length:1}" is the first part of the rule, that is, the part of the rule word block. It indicates that the first word block in the rule is a numeral (Q), and the second word block is a noun (NP) or a quantifier (L) or a time word (TIM) with a word length (length) of 1. The current rule does not have a conditional function. "S1, S2" is the third part of the rule, that is, the linking operation part, which indicates that word blocks S1 and S2 need to be linked together. Therefore, the word block "2008/Q/2008/TIM/nian2" needs to be written consecutively. The new word block after the continuous writing is expressed as "2008/QCH/2008nian2", and the QCH mark indicates that the word block is a word block after the continuous writing. Take out the next possible ligature word block "小/AP/xiao3", match the ligature rules, and execute the above steps in order to get a combined ligature word block:

2008年/QCH/2008nian2，/BD/,小李/QCH/xiao3li3晋升/VP/jin4sheng1为/SV/wei2这个/QCH/zhe4ge4项目/NP/xiang4mu4的/DEF/de0总工程师/QCH/zong3gong1cheng2shi1。/BD/。In 2008 /QCH/2008nian2, /BD/, Xiaoli /QCH/xiao3li3 was promoted from /VP/jin4sheng1 to /DEF/de0 chief engineer /QCH/zong3gong1cheng2shi1 of /SV/wei2 /QCH/zhe4ge4 project /NP/xiang4mu4. /BD/.

⑷基于连写语料统计库对组合后的词块进行二次组合连写；(4) Carry out secondary combination and continuous writing of the combined word blocks based on the continuous writing corpus statistical database;

从左到右依次取出经过一次连写组合后的词块，按最长最优原则匹配用户词库，基础词库中的词块，匹配成功后进行组合连写，得到二次组合连写后的词块。From left to right, take out the word blocks that have been combined once and then match the word blocks in the user lexicon and the basic lexicon according to the longest optimal principle. .

2008年/QCH/nian2，/BD/,小李/QCH/xiao3li3晋升/VP/jin4sheng1为/SV/wei2这个/QCH/zhe4ge4项目/NP/xiang4mu4的/DEF/de0总工程师/QCH/zong3gong1cheng2shi1。/BD/。In 2008 /QCH/nian2, /BD/, Xiao Li /QCH/xiao3li3 was promoted from /VP/jin4sheng1 to the /DEF/de0 chief engineer /QCH/zong3gong1cheng2shi1 of the /SV/wei2 /QCH/zhe4ge4 project /NP/xiang4mu4. /BD/.

⑸将生成的分词标音连写后的汉语字符串输出。⑸ Output the generated Chinese character string after word segmentation and phonetic concatenation.

基于上述一种基于SC文法的分词标音连写方法，实现了一种基于SC文法的分词标音连写装置，如图5所示，从图中可以看出，该装置基于字典库、连写语料统计库、连写规则库、组合歧义词库和分词歧义规则库，包括分词模块、词性标注及标音模块、一次组合连写模块和二次组合连写模块，分词模块、词性标注及标音模块分别与字典库相连，分词模块还与组合歧义词库和分词歧义规则库分别相连，一次组合连写模块与连写规则库相连，二次组合连写模块与连写语料统计库相连；Based on the above-mentioned SC grammar-based word segmentation and phonetic continuation method, a device for word segmentation and phonetic continuation based on SC grammar has been realized, as shown in Figure 5. It can be seen from the figure that the device is based on dictionary database and continuation corpus statistics Literacy rule library, combination ambiguity thesaurus and word segmentation ambiguity rule library, including word segmentation module, part-of-speech tagging and phonetic transcription module, first-time combined ligature module and secondary combination ligature module, word segmentation module, part-of-speech tagging and phonetic notation module respectively with the dictionary The word segmentation module is also connected to the combined ambiguous thesaurus and the word segmentation ambiguous rule base respectively, the first combined continuation module is connected to the continuation rule base, and the secondary combination continuation module is connected to the continuation corpus statistical database;

伴随时间的推移，人们会不断的改变现有此的用法以及不断的创造出新词，因此所述字典库、组合歧义词库、连写语料统计库、连写规则库和分词歧义规则库均可以维护，使其根据时代的发展内容不断更改完善，从而提高分词的准确性。As time goes by, people will constantly change the existing usage of this term and create new words, so the dictionary database, combined ambiguity thesaurus, ligature corpus statistical database, ligature rule base and word segmentation ambiguity rule base can all be maintained , so that it can be constantly changed and improved according to the development of the times, so as to improve the accuracy of word segmentation.

实验结果Experimental results

基于SC文法的分词标音连写方法有效地解决了汉盲转换过程中的汉语分词歧义、连写和多音字的正确标音问题，实现了汉语到盲文的高效智能翻译转换。翻译准确率高于90％。The word segmentation, phonetic and continuous writing method based on SC grammar effectively solves the problems of Chinese word segmentation ambiguity, continuous writing and correct phonetic marking of polyphonic characters in the process of Chinese-blind conversion, and realizes the efficient and intelligent translation conversion from Chinese to Braille. The translation accuracy rate is higher than 90%.

本发明采用人工智能技术，有机地融合了规则和实例等多种分析处理策略，高效准确地对汉语句子进行分词标音连写，提高了汉盲翻译的正确性。本发明设计了一种基于SC文法的，可扩展性好的，表示效率高的，人性化的规则表示语言，该规则表示具有普适性，可扩展到其他自然语言处理问题的解决上。The invention adopts artificial intelligence technology, organically integrates various analysis and processing strategies such as rules and examples, efficiently and accurately performs word segmentation, phonetic transcription and continuous writing on Chinese sentences, and improves the correctness of Chinese-blind translation. The present invention designs a rule expression language based on SC grammar, good expansibility, high expression efficiency, and humanization. The rule expression has universal applicability and can be extended to solve other natural language processing problems.

Claims

1. A method for continuous writing of word segmentation and phonetic marking based on SC grammar, characterized in that: based on dictionary database, combined ambiguous thesaurus, word segmentation ambiguity rule base, continuous writing rule base and continuous writing corpus statistics base, comprising the following steps:

Step 1, receiving the Chinese character string and article genre type to be divided into phonetic symbols;

Step 2. Segment the Chinese character string based on the dictionary database, and perform part-of-speech tagging and phonetic marking on the word blocks after the word segmentation;

Step 3. According to the type of article genre, call the corresponding ligature rule base, and combine the lexical blocks in step (2) based on the braille word segmentation ligature rules in the ligature rule base;

Step 4, based on the continuation corpus statistical library, perform secondary combination continuation on the combined word blocks;

Step 5, output the generated Chinese character string after word segmentation and phonetic concatenation.

2. a kind of word segmentation phonetic marking method based on SC grammar according to claim 1, is characterized in that, described dictionary storehouse is used for Chinese word segmentation, part-of-speech tagging and phonetic marking, comprises Chinese word symbol, grammatical semantic attribute identifier , context discrimination function, pinyin of words.

3. a kind of word segmentation phonetic transcription method based on SC grammar according to claim 1, is characterized in that, described word segmentation based on dictionary storehouse is finished by following process::

a. Referring to the dictionary library, use the forward maximum matching algorithm to split the sentence to obtain word chunks;

b. Carry out cross ambiguity judgment according to the cross feature of the word block;

c. Carry out ambiguity judgment on word block based on combined ambiguity lexicon;

d. According to the rules of ambiguity, disambiguation is eliminated through reasoning;

e. Output word segmentation results.

4. a kind of word segmentation phonetic transcription method based on SC grammar according to claim 3, it is characterized in that, described combination ambiguity lexicon is used for identifying the word block that there is combination ambiguity, what included in the storehouse is that there is combination ambiguity word.

5. according to claim 3-4 arbitrary described a kind of word segmentation phonetic continuation method based on SC grammar, it is characterized in that, described word segmentation ambiguity rule base is used for reasoning and eliminating ambiguity word block, obtains correct word segmentation result, comprises Ambiguity word blocks, conditional functions, and correct word segmentation operations. According to the ambiguity rules, disambiguation through reasoning is completed through the following process:

a. Match the ambiguous part of the ambiguous word in the ambiguity rule for the current lexical block containing the ambiguous flag;

b. If the match is successful, check the conditional function;

c. If the condition check is satisfied, perform the correct word segmentation operation;

d. Output the correct word segmentation results.

6. a kind of word segmentation phonetic marking method based on SC grammar according to claim 1, is characterized in that, described part of speech tagging and phonetic marking are carried out to the word block after word segmentation and finish by following process:

a. For the current word block, take out the dictionary information of the word block from the dictionary storehouse;

b. Check the context function item by item;

c. If the context check is satisfied, take out the part of speech and pinyin of the item.

7. a kind of word segmentation phonetic transcription method based on SC grammar according to claim 1, is characterized in that, described continuation rule storehouse is used for word segmentation and the lexical block after labeling carries out combination continuation, comprises regular word block part, Conditional function, continuous writing operation; according to different article genres, the continuous writing rule base is subdivided into a classical Chinese rule base and a modern Chinese rule base; the combined joint writing of word blocks based on the continuous writing rules is completed through the following process:

a. For several current word blocks, match the word block part in the linking rule;

b. If the match is successful, check the conditional function;

c. If the condition check is satisfied, perform the correct continuous writing operation;

d. Output the word segmentation results after ligature writing.

8. a kind of method for continuous writing based on SC grammar according to claim 1, it is characterized in that, described continuous writing corpus statistics storehouse is used to carry out secondary combination continuous writing to the word block after combining according to continuous writing rule, in library The words that need to be combined are collected; the ligature corpus is subdivided into basic thesaurus and user thesaurus, in which the basic thesaurus contains some general ligature words, and the user lexicon includes user-defined word blocks that need to be ligatured ; The combined lexical block is carried out through the following process to carry out the secondary combined continuation based on the continuation corpus statistical library:

a. For the current word block, match according to the order of the user thesaurus and the basic thesaurus;

b. If the matching is successful, execute the continuous writing combination;

c. Output the word block result after writing.

9. A device for continuous writing of word segmentation, transcription and transcription based on SC grammar, characterized in that, based on a dictionary database, a combined ambiguous thesaurus, a continuous writing corpus statistical database, a continuous writing rule base and a word segmentation ambiguity rule base, including sequentially connected word segmentation modules and part-of-speech tags And the transcription module, the first-time combination continuation module and the second-time combination continuation module, the word segmentation module, the part-of-speech tagging and the transcription module are connected to the dictionary database respectively, and the word segmentation module is also connected to the combination ambiguity lexicon and word segmentation ambiguity rule library respectively, and the first-time combination continuation The module is connected with the ligature rule library, and the secondary combined continuation module is connected with the continuation corpus statistical database;

The word segmentation module is used to segment the input Chinese character string based on the dictionary database, split it into independent word blocks, and judge whether there is ambiguity in the obtained word blocks based on the cross ambiguity features and the combined ambiguity lexicon during the segmentation process. The ambiguous words are based on the word segmentation ambiguity rule base to eliminate the ambiguity and get the correct word block;

The part-of-speech tagging and phonetic marking module is used to perform correct part-of-speech tagging and transliteration on the word chunks obtained by the word segmentation module based on the dictionary library through the context function check, so as to obtain the correct part of speech and pinyin of the word chunks;

The one-time combined ligature module is used to combine ligatures for part-of-speech tagged word chunks. This module obtains the combined lexical chunks by checking the conditional functions based on the ligature rule library;

The secondary combined ligature module is used to perform the query matching operation on the lexical corpus statistical database of the combined ligature to obtain the combined lexical chunks, and output the lexical chunks with part-of-speech annotation and phonetic transcription.

10. a kind of word segmentation and phonetic continuous writing device based on SC grammar according to claim 9, is characterized in that, described dictionary storehouse, combination ambiguity lexicon, continuous writing corpus statistics base, continuous writing rule base and word segmentation ambiguity rule base all can be Maintenance, so that it is constantly changed and improved according to the development of the times, so as to improve the accuracy of word segmentation.