CN106250362A

CN106250362A - Text segmentation device and text segmenting method

Info

Publication number: CN106250362A
Application number: CN201610111581.5A
Authority: CN
Inventors: 大仓清司; 片冈正弘; 出内将夫
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2015-06-05
Filing date: 2016-02-29
Publication date: 2016-12-21
Also published as: KR20160143491A; JP2017004127A; KR101841824B1

Abstract

The invention relates to a text segmentation device and a text segmentation method, which can effectively segment texts at appropriate positions. The computer searches for a first character string included in the text from character string division information that associates registered character strings divided into a plurality of words with the number of distinguished words (step 201 ). And, when the first character string corresponds to the registered character string, the computer divides the second character string of the first character string, which includes the words of the number of distinguished words corresponding to the registered character string, into the number of distinguished words. word (step 202).

Description

Text segmentation device and text segmentation method

技术领域technical field

本发明涉及文本分割装置以及文本分割方法。The invention relates to a text segmentation device and a text segmentation method.

背景技术Background technique

近年来，因特网上的信息飞跃性地增大，使用了大数据的商务增加，所以希望高效地处理大数据。如日语、汉语或者韩语的文件那样不利用空白等分隔字符分隔单词和单词的表示的文件的情况下，为了计算单词的出现频率而进行词素解析。In recent years, information on the Internet has increased dramatically, and businesses using big data have increased, so it is desired to efficiently process big data. In the case of a document that does not separate words and word representations with separator characters such as blanks, such as Japanese, Chinese, or Korean documents, morphological analysis is performed to calculate the frequency of occurrence of words.

词素解析是将文本分割成词素，对各词素赋予词类信息的处理。通过词素解析所得的词素有时也作为单词来处理。通过进行这样的词素解析，能够决定文件中的单词间的关系以及单词的词类，并将文件中的文本分割为单词。然而，由于词素解析的处理负荷较大，所以处理大量的文本花费很长的时间。Morpheme analysis is a process of dividing a text into morphemes and assigning part-of-speech information to each morpheme. The morphemes obtained by morpheme analysis are also sometimes handled as words. By performing such morphological analysis, it is possible to determine the relationship between words in the document and the part of speech of the word, and to divide the text in the document into words. However, since the processing load of morphological analysis is large, it takes a long time to process a large amount of text.

也已知一种高速地将句子分割成2个以上的单词的单词分割装置(例如参照专利文献1)。该单词分割装置从能够储存1个以上的单词、以及单词与作为分割该单词后的结果的2个以上的分割单词的组即1个以上的分割信息的单词分割用辞典中，从作为接受的句子的前端的句子的指针获取与最大长度的字符串一致的单词。而且，在具有与获取到的单词对应的2个以上的分割单词的情况下，单词分割装置进行代替一致的单词而获取2个以上的分割单词的分割单词获取处理。单词分割装置将句子的指针移动到一致的单词的下一个字符后，进行分割单词获取处理直到包括句子的最后的字符的单词为止，并获取分割句子所得的2个以上的单词的集合即第一分割结果。There is also known a word segmentation device that divides a sentence into two or more words at high speed (for example, refer to Patent Document 1). The word segmentation device selects from a word segmentation dictionary capable of storing one or more words and a word and a group of two or more segmented words as a result of segmenting the word, that is, one or more segmentation information. The pointer to the sentence at the head of the sentence acquires the word that matches the character string of the maximum length. Then, when there are two or more segmented words corresponding to the acquired word, the word segmenting device performs segmented word acquisition processing of acquiring two or more segmented words instead of matching words. After the word segmentation device moves the pointer of the sentence to the next character of the matching word, the segmented word acquisition process is performed until a word including the last character of the sentence is obtained, and the set of two or more words obtained by segmenting the sentence, that is, the first character is acquired. Split results.

也已知能够正确且迅速地进行词素解析的词素解析系统(例如参照专利文献2)。在该词素解析系统的汉字字符串词素N字符登记辞典中，词素如果是若使其它任意字符串后续并结合则成为该词素的字符串中被分隔的2个以上的词素的情况，则与该词素建立关联地一并记录该分隔位置的信息。汉字字符串词素解析程序利用最长一致法获取第一词素候选，如果其中记录有分隔位置信息则从该位置再次利用最长一致法尝试第二词素候选的获取。在平假名词素接合列表辞典中预先记录考虑语法上的连接的正确性使多个平假名词素连接的词素。平假名字符串词素解析程序通过平假名词素连接列表辞典与字符串数据的比较处理来获取词素。A morphological analysis system capable of accurately and rapidly performing morphological analysis is also known (for example, refer to Patent Document 2). In the Chinese character string morpheme N-character registration dictionary of this morpheme analysis system, if the morpheme is two or more morphemes separated in the character string that becomes the morpheme if another arbitrary character string is followed and combined, then the same as The morpheme is associated and recorded together with the information of the separation position. The Chinese character string morpheme analysis program uses the longest agreement method to obtain the first morpheme candidate, and if the separation position information is recorded therein, it uses the longest agreement method to try to obtain the second morpheme candidate again from this position. In the hiragana morpheme conjugation list dictionary, a morpheme to which a plurality of hiragana morphemes are connected in consideration of the correctness of the grammatical connection is preliminarily recorded. The hiragana character string morpheme analysis program obtains the morpheme by comparing the hiragana morpheme connection list dictionary with character string data.

专利文献1：日本特开2014-106707号公报Patent Document 1: Japanese Patent Laid-Open No. 2014-106707

专利文献2：日本特开2002-32366号公报Patent Document 2: Japanese Patent Laid-Open No. 2002-32366

在上述的以往的单词分割装置或者词素解析系统中，基于仅文本的一部分的信息来决定分割位置，所以不一定在适当的位置分割文本。In the above-mentioned conventional word segmentation device or morphological analysis system, the segmentation position is determined based on information on only a part of the text, so the text is not always segmented at an appropriate position.

此外，所述的问题并不限于为了计算单词的出现频率而分割文本的情况，为了其它的文本解析而分割文本的情况下也产生。In addition, the above-mentioned problem is not limited to the case where the text is divided for counting the frequency of occurrence of words, but also occurs when the text is divided for other text analysis.

发明内容Contents of the invention

在一个方面，本发明的目的在于在适当的位置上高效地分割文本。In one aspect, it is an object of the present invention to efficiently segment text where appropriate.

在一个方案中，文本分割程序使计算机执行以下的处理。In one aspect, the text segmentation program causes a computer to execute the following processing.

(1)计算机从将被分割成多个单词的登记字符串和区分单词数建立对应的字符串分割信息检索文本所包含的第一字符串。(1) The computer retrieves the first character string included in the text from character string division information in which registered character strings to be divided into a plurality of words are associated with the number of differentiated words.

(2)计算机在第一字符串与登记字符串对应的情况下，将第一字符串中的、包括与登记字符串建立对应的区分单词数的单词的第二字符串分割为该区分单词数的单词。(2) When the first character string corresponds to the registered character string, the computer divides, among the first character string, a second character string comprising a word corresponding to the number of distinguished words corresponding to the registered character string into the number of distinguished words words.

根据实施方式，能够在适当的位置上高效地分割文本。According to the embodiment, text can be efficiently segmented at an appropriate position.

附图说明Description of drawings

图1是文本分割装置的功能的结构图。FIG. 1 is a functional block diagram of a text segmentation device.

图2是文本分割处理的流程图。FIG. 2 is a flowchart of text segmentation processing.

图3是表示字符串分割信息的图。FIG. 3 is a diagram showing character string division information.

图4是表示文本分割处理的具体例子的流程图。FIG. 4 is a flowchart showing a specific example of text segmentation processing.

图5是进行区分单词数登记处理的文本分割装置的功能的结构图。FIG. 5 is a functional block diagram of a text segmentation device that performs registration processing of the number of differentiated words.

图6是区分单词数登记处理的流程图。FIG. 6 is a flowchart of the number-of-differentiated word count registration process.

图7是信息处理装置的结构图。FIG. 7 is a configuration diagram of an information processing device.

附图文字说明Text description of drawings

101…文本分割装置；111…存储部；112…分割部；121…字符串分割信息；501…区分单词数决定部；701…CPU；702…存储器；703…输入装置；704…输出装置；705…辅助存储装置；706…介质驱动装置；707…网络连接装置；708…总线，709…可移动型记录介质。101...text division device; 111...storage unit; 112...segmentation unit; 121...character string division information; 501...divided word number determination unit; 701...CPU; ...Auxiliary storage device; 706...Medium drive device; 707...Network connection device; 708...Bus, 709...Removable recording medium.

具体实施方式detailed description

以下，参照附图，详细地对实施方式进行说明。Hereinafter, embodiments will be described in detail with reference to the drawings.

例如，使用专利文献1的单词分割装置来分割“そうはいってもっと進んでください”这个文本的情况下，通过单词分割用辞典的最长一致检索来分割文本。因此，正确的分割结果虽是“そう/はいって/もっと/進んで/ください”，但有时获得“そう/は/いっても/っと進んでください”这样的不希望分割结果。For example, when the word segmentation device of Patent Document 1 is used to segment the text "そうはいってもっと向んでください", the text is segmented by the longest matching search of the dictionary for word segmentation. Therefore, although the correct segmentation result is "そう/はいって/もっと/向んで/ください", sometimes an undesirable segmentation result such as "そう/は/いっても/っと必んでください" is obtained.

可以认为其原因在于尽管存在分割位置根据紧接着某个单词的后面的单词而不同的情况，但不检测比单词宽的范围的上下文，而单纯利用最长一致检索来决定分割位置这点。This is considered to be due to the fact that although the division position may vary depending on the word immediately following a certain word, the division position is determined simply by the longest coincidence search without detecting a context wider than the word.

另外，在使用专利文献2的词素解析系统来分割“自然言語処理技術”这个复合词的文本的情况下，从后退N字符位置再次进行最长一致检索，并在存在第二词素候选的单词的情况下，采用该分隔位置。In addition, when using the morpheme analysis system of Patent Document 2 to segment the text of the compound word "Natural Language Processing Technology", the longest agreement search is performed again from the backward N character position, and when there is a second morpheme candidate word Next, use the split position.

因此，即使在基于第一词素候选获得了“自然言語処理/技術”那样的正确的分割结果的情况下，有时也基于第二词素候选来采用“自然/言語処理技術”这样的错误的分割结果。同样地，“原子力学会”的正确的分割结果是“原子力/学会”，但有时基于第二词素候选来采用“原子/力学/会”这样的错误的分割结果。Therefore, even when a correct segmentation result such as "natural language processing/technology" is obtained based on the first morpheme candidate, an incorrect segmentation result such as "natural/speech processing technology" may be adopted based on the second morpheme candidate . Likewise, the correct segmentation result of "Atomic Mechanics Society" is "atomic force/society", but sometimes an incorrect segmentation result of "atom/mechanics/society" is adopted based on the second morpheme candidate.

可以认为其原因在于不检测复合词的上下文，而基于局部的信息来决定分割位置这点。This is considered to be due to the fact that the context of the compound word is not detected, but the segmentation position is determined based on local information.

这样，即使基于仅文本的一部分的信息来决定分割位置，也由于不检测文本全体的上下文，所以有时生成错误的分割结果。然而，由于日语等的句子无限地存在，所以并不是在辞典中登记全部的句子的分割结果。In this way, even if the segmentation position is determined based on information on only a part of the text, the context of the entire text is not detected, and thus an erroneous segmentation result may be generated. However, since sentences such as Japanese exist infinitely, segmentation results of all sentences are not registered in the dictionary.

图1示出实施方式的文本分割装置的功能的构成例。图1的文本分割装置101包括存储部111以及分割部112。FIG. 1 shows a functional configuration example of a text segmentation device according to an embodiment. The text segmentation device 101 in FIG. 1 includes a storage unit 111 and a segmentation unit 112 .

存储部111存储将分割成多个单词的登记字符串和区分单词数建立对应的字符串分割信息121。分割部112参照存储部111存储的字符串分割信息121来进行文本分割处理。The storage unit 111 stores character string division information 121 in which registered character strings divided into a plurality of words are associated with the number of distinguished words. The segmentation unit 112 performs text segmentation processing with reference to the character string segmentation information 121 stored in the storage unit 111 .

图2是表示图1的文本分割装置101进行的文本分割处理的例子的流程图。首先，分割部112从字符串分割信息121中检索文本所包含的第一字符串(步骤201)。而且，在第一字符串与登记字符串对应的情况下，分割部112将第一字符串中的包括与登记字符串建立对应的区分单词数的单词的第二字符串分割成该区分单词数的单词(步骤202)。FIG. 2 is a flowchart showing an example of text segmentation processing performed by the text segmentation device 101 in FIG. 1 . First, the segmentation unit 112 searches the first character string included in the text from the character string segmentation information 121 (step 201 ). Furthermore, when the first character string corresponds to the registered character string, the dividing unit 112 divides the second character string including the words of the number of distinguished words corresponding to the registered character string into the number of distinguished words in the first character string. words (step 202).

根据这样的文本分割装置101，能够将文本在适当的位置上高效地分割。According to such a text segmentation device 101 , it is possible to efficiently segment a text at an appropriate position.

文本分割装置101能够应用于解析大量的文本的文本解析。例如也可以对计算文本内的各单词的出现频率的统计处理应用文本分割装置101。The text segmentation device 101 can be applied to text analysis for analyzing a large amount of text. For example, the text segmentation device 101 may be applied to statistical processing for calculating the frequency of occurrence of each word in the text.

图3示出针对日语文本的字符串分割信息121的例子。图3的字符串分割信息121是单词单位的n-gram(n元语法)被登记成字符串的n-gram表格，与用于文本分割处理的辞典对应。该n-gram表格的各项包括项的识别信息(ID)、n-gram、区分单词数、字符串长、字符种类、和助词以及助动词的位置。FIG. 3 shows an example of character string segmentation information 121 for Japanese text. The character string segmentation information 121 in FIG. 3 is an n-gram table in which word-unit n-grams (n-grams) are registered as character strings, and corresponds to a dictionary used for text segmentation processing. Items of the n-gram table include item identification information (ID), n-gram, number of distinguished words, character string length, character type, and auxiliary words and positions of auxiliary verbs.

n-gram表示包括n个单词的字符串，区分单词数表示n个单词中作为分割结果被采用的单词的个数。区分单词数是1以上n以下的整数，但优选使用比n小的区分单词数。字符串长表示n-gram所包含的字符的个数，字符种类表示n-gram所包含的各单词的字符的种类。字符种类“1”表示单词所包含的字符全部是平假名或者片假名，字符种类“0”表示单词包括除此以外的字符。助词以及助动词的位置表示n-gram所包含的助词以及助动词的出现位置。n-gram represents a character string including n words, and the number of distinguished words represents the number of words used as a segmentation result among the n words. The number of distinguishing words is an integer from 1 to n inclusive, but it is preferable to use a number of distinguishing words smaller than n. The string length indicates the number of characters included in the n-gram, and the character type indicates the type of characters in each word included in the n-gram. The character type "1" indicates that all characters included in the word are Hiragana or Katakana, and the character type "0" indicates that the word includes other characters. The positions of the auxiliary words and the auxiliary verbs indicate the occurrence positions of the auxiliary words and the auxiliary verbs included in the n-gram.

例如，ID“1”的字符串“そうはいっても”是由“そう”、“は”、“いって”、以及“も”构成的4-gram，区分单词数为1，字符串长为7。字符种类“1111”表示四个单词的各个是平假名或者片假名，助词以及助动词的位置“2，4”表示从4-gram的前端起第二个单词和第四个单词是助词或者助动词。For example, the character string "そうはいっても" of ID "1" is a 4-gram composed of "そう", "は", "いって", and "も", the number of distinguishing words is 1, and the length of the string is 7. The character type "1111" indicates that each of the four words is hiragana or katakana, and the positions "2, 4" of the particle and the auxiliary verb indicate that the second and fourth words from the front end of the 4-gram are particles or auxiliary verbs.

ID“5”的字符串“そうはいはいと人”是由“そう”、“はいはい”、“と”、以及“人”构成的4-gram，区分单词数为3，字符串长为9。字符种类“1110”表示四个单词中第一个～第三个单词是平假名或者片假名，第四个单词包括除此以外的字符，助词以及助动词的位置“3”表示第三单词是助词或者助动词。The character string "そうはいはいと人" of ID "5" is a 4-gram composed of "そう", "はいはい", "と", and "人", the number of distinguishing words is 3, and the string length is 9. The character type "1110" indicates that the first to third words of the four words are hiragana or katakana, the fourth word includes other characters, and the position of particle and auxiliary verb "3" indicates that the third word is a particle or auxiliary verbs.

另外，ID“3”等的项中的助词以及助动词的位置“-1”表示n-gram不包括助词或者助动词。In addition, the position "-1" of the particle and the particle in the item such as ID "3" indicates that the n-gram does not include particle or particle.

图3中仅示出2-gram～4-gram的特定的字符串的项，但该n-gram表格中也包括未图示的2-gram～4-gram的其它字符串的项和5-gram～10-gram的项。并且，可以登记n为11以上的n-gram。通过增大n将比较长的字符串登记于n-gram表格中，能够检测比单词宽的范围的上下文。Only the items of specific character strings of 2-gram to 4-gram are shown in FIG. 3 , but the n-gram table also includes items of other character strings of 2-gram to 4-gram not shown and 5- gram to 10-gram items. Also, n-grams in which n is 11 or more can be registered. By increasing n and registering a relatively long character string in the n-gram table, it is possible to detect a context wider than a word.

例如通过利用高精度的词素解析处理对多个文件的文本进行解析，能够自动地生成n-gram表格。即使是相同的字符串，根据使用的领域而可以登记为不同的n-gram。例如字符串“原子力学”能够登记为“原子力—学”以及“原子—力学”这两个2-gram。决定各项的区分单词数的方法后述。For example, an n-gram table can be automatically generated by analyzing the text of a plurality of files by high-precision morphological analysis processing. Even the same character string can be registered as different n-grams depending on the field of use. For example, the character string "atomic mechanics" can be registered as two 2-grams of "atomic mechanics" and "atomic mechanics". The method of determining the number of different words for each item will be described later.

图4是表示图2的文本分割处理的具体例子的流程图。首先，分割部112将分割对象的文本的前端位置作为开始位置，通过最长一致检索从登记在字符串分割信息121的字符串中检索从文本内的开始位置开始的字符串(步骤401)。而且，分割部112检查从开始位置开始的字符串、和字符串分割信息121的任意一个项的字符串是否一致(步骤402)。FIG. 4 is a flowchart showing a specific example of text segmentation processing in FIG. 2 . First, the segmentation unit 112 uses the leading position of the text to be segmented as the start position, and searches for a character string starting from the start position in the text from the character strings registered in the character string segmentation information 121 by longest match search (step 401 ). Then, the division unit 112 checks whether the character string starting from the start position matches a character string in any item of the character string division information 121 (step 402 ).

在从开始位置开始的字符串与哪一项的字符串都不一致的情况下(步骤402：否)，分割部112使开始位置向后方移位1字符(步骤406)，反复步骤401以后的处理。When the character string starting from the start position does not match any of the character strings (step 402: No), the division unit 112 shifts the start position backward by one character (step 406), and repeats the processes after step 401 .

在从开始位置开始的字符串与任意一个项的字符串一致的情况下(步骤402：是)，分割部112参照一致的字符串中与最长的字符串对应的项的区分单词数(步骤403)。而且，分割部112从文本内的开始位置起将与该区分单词数对应的部分的字符串分割成该项中所登记的单词。When the character string starting from the start position matches the character string of any one item (step 402: Yes), the division unit 112 refers to the number of distinguishing words of the item corresponding to the longest character string among the matching character strings (step 402: Yes). 403). Then, the dividing unit 112 divides the character string of the portion corresponding to the number of distinguished words from the start position in the text into the words registered in the item.

接下来，分割部112使开始位置向后方移位与区分单词数对应的部分的字符串的字符数(步骤404)，检查一致的最长的字符串的末尾是否是文本的末尾(步骤405)。在最长的字符串的末尾不是文本的末尾的情况下(步骤405：否)，分割部112反复步骤401以后的处理。Next, the segmentation unit 112 shifts the start position backward by the number of characters of the character string corresponding to the number of words to be distinguished (step 404), and checks whether the end of the longest character string that matches is the end of the text (step 405) . When the end of the longest character string is not the end of the text (step 405 : NO), the dividing unit 112 repeats the processing from step 401 onwards.

在最长的字符串的末尾是文本的末尾的情况下(步骤405：是)，分割部112将开始位置以后的字符串分割成与最长的字符串对应的项中所登记的单词，并结束处理。When the end of the longest character string is the end of the text (step 405: Yes), the segmentation unit 112 divides the character string after the start position into the words registered in the item corresponding to the longest character string, and Finish processing.

例如在分割对象的文本是“そうはいってもっと進んでください”的情况下，将前端的2字符“そう”作为检索对象，若通过前方一致检索来检索图3的n-gram表格，则提取ID“1”～ID“9”这9个项。For example, when the text to be segmented is "そうはいってもっとでんでください", the leading 2-character "そう" is used as the search object, and the ID is extracted by searching the n-gram table in FIG. Nine items of "1" to ID "9".

这些项中与最长的字符串长“12”对应的项是ID“3”以及ID“4”的项。然而，ID“3”的字符串“そうはいってもっとむこう”和ID“4”的字符串“そうはいってずっとむこう”都与分割对象的文本不一致。Items corresponding to the longest character string length "12" among these items are items of ID "3" and ID "4". However, both the character string "そうはいってもっとむこう" of the ID "3" and the character string "そうはいってずっとむこう" of the ID "4" do not match the text of the division object.

与第二长的字符串长“9”对应的项是ID“5”、ID“6”、以及ID“8”的项。其中，仅ID“8”的字符串“そうはいってもっと”与分割对象的文本一致，所以基于该项来分割分割对象的文本。该情况下，ID“8”的区分单词数为“1”，在字符串所包含的3个单词中第一个的“そう”和第二个的“はいって”之间的位置上分割分割对象的文本，开始位置移位至该分割位置。Items corresponding to the second longest character string length "9" are items of ID "5", ID "6", and ID "8". Among them, only the character string "そうはいってもっと" of the ID "8" matches the text to be divided, so the text to be divided is divided based on this item. In this case, the number of distinguished words of ID "8" is "1", and the division is performed at the position between the first "そう" and the second "はいって" among the three words included in the character string. The object's text, whose start position is shifted to the division position.

接下来，将剩余的“はいってもっと進んでください”的前端的2字符“はい”作为检索对象，若通过前方一致检索来检索n-gram表格，则提取ID“10”～ID“19”这10个项。Next, if the remaining 2 characters "はい" at the front end of "はいってもっと必んでください" are searched for, and the n-gram table is searched by the previous coincidence search, ID "10" to ID "19" are extracted. 10 items.

这些项中与最长的字符串长“11”对应的项是ID“11”以及ID“12”的项。然而，ID“11”的字符串“はいってもっとむこう”和ID“12”的字符串“はいってずっとむこう”都与剩余的文本不一致。Among these items, the item corresponding to the longest character string length "11" is the item of ID "11" and ID "12". However, both the character string "はいってもっとむこう" of the ID "11" and the character string "はいってずっとむこう" of the ID "12" do not agree with the remaining text.

与第二长的字符串长“10”对应的项是ID“16”以及ID“17”的项。然而，ID“16”的字符串“はいってもっとむこう”和ID“17”的字符串“はいってずっとむこう”都与剩余的文本不一致。Items corresponding to the second longest character string length "10" are items of ID "16" and ID "17". However, both the character string "はいってもっとむこう" of the ID "16" and the character string "はいってずっとむこう" of the ID "17" do not agree with the remaining text.

与第三长的字符串长“8”对应的项是ID“14”以及ID“19”的项。然而，ID“14”的字符串“はいはいと簡単”与ID“19”的字符串“はいってください“都与剩余的文本不一致。Items corresponding to the third longest character string length "8" are items of ID "14" and ID "19". However, both the character string "はいはいと秋册" of the ID "14" and the character string "はいってください" of the ID "19" do not match the remaining text.

与第四长的字符串长“7”对应的项是ID“10”、ID“13”以及ID“18”的项。其中，由于仅ID“18”的字符串“はいってもっと”与剩余的文本一致，所以基于该项来分割剩余的文本。该情况下，由于ID“18”的区分单词数为“1”，所以在字符串所包含的2个单词中的第一个的“はいって”和第二个的“もっと”之间的位置上分割剩余的文本，开始位置移位至该分割位置。Items corresponding to the fourth longest character string length "7" are items of ID "10", ID "13", and ID "18". However, since only the character string "はいってもっと" of the ID "18" matches the rest of the text, the rest of the text is divided based on this item. In this case, since the number of distinguishing words of ID "18" is "1", the position between the first "はいって" and the second "もっと" among the two words included in the character string Split the rest of the text, and shift the start position to the split position.

接下来，若将剩余的“もっと進んでください”的前端的2字符“もっ”作为检索对象，通过前方一致检索来检索n-gram表格，则提取ID“20”以及ID“21”这2个项。Next, if the remaining 2 characters "もっ" at the leading end of "もっと必んでください" are searched for, and the n-gram table is searched by the previous coincidence search, two IDs "20" and "21" are extracted. item.

这些项的字符串长都为“6”，但由于仅ID“20”的字符串“もっと進んで”与剩余的文本一致，所以基于该项来分割剩余的文本。该情况下，由于ID“20”的区分单词数为“1”，所以在字符串所包含的2个单词中第一个的“もっと”和第二个的“進んで”之间的位置上分割剩余的文本，开始位置移位至该分割位置。之后，对剩余的“進んでください”重复同样的分割处理。The character string lengths of these items are all "6", but since only the character string "もっと必んで" of ID "20" matches the remaining text, the remaining text is divided based on this item. In this case, since the number of distinguishing words of the ID "20" is "1", the position between the first "もっと" and the second "入んで" among the two words included in the character string Split the remaining text, and shift the start position to the split position. Thereafter, the same division process is repeated for the remaining "入んでください".

根据这样的文本分割处理，并不是通过最长一致检索而分割一致的字符串的全部而仅分割一部分，能够在接下来的最长一致检索的检索对象中包括剩余的部分。此时，通过将更长的字符串登记于字符串分割信息121，能够一边基于较宽的范围的上下文来比较多个登记字符串，一边缓缓地确定分割结果。According to such text segmentation processing, not all of the matched character strings are divided by the longest match search, but only a part thereof, and the remaining part can be included in the search target of the next longest match search. At this time, by registering a longer character string in the character string segmentation information 121, it is possible to gradually specify the segmentation result while comparing a plurality of registered character strings based on a wide range of contexts.

例如也能够如多个登记字符串间共用的部分那样，仅对基于上下文判定为可能的部分确定分割结果，而对除此以外的部分不确定分割结果。“そうはいってもっと進んでください”的例子中，前端的“そう”与判定为可能的部分对应，“はいって”以后的部分与除此以外的部分对应。For example, it is also possible to determine the segmentation result only for the portion determined to be possible based on the context, and not to determine the segmentation result for the other portions, like the portion common to a plurality of registered character strings. In the example of "そうはいってもっと必んでください", the leading part of "そう" corresponds to the part determined to be possible, and the part after "はいって" corresponds to the other parts.

根据图4的文本分割处理，也能够将字符串分割信息121内包括未作为单词而存在的未知词的字符串分割成多个单词。According to the text segmentation processing in FIG. 4 , a character string including unknown words that do not exist as words in the character string segmentation information 121 can also be segmented into a plurality of words.

例如“XY自動車交通(株)の今期の業績は…”这个文本是分割对象，在“XY”是未知词的情况下，首先，前端的“X”被设定为开始位置。然而，由于从“X”开始的字符串与字符串分割信息121的任意项都不一致，所以接下来，开始位置移位1字符，“Y”被设定为开始位置。然而，由于从“Y”开始的字符串与字符串分割信息121的任意一个项都不一致，所以接下来，开始位置移位1字符，“自”被设定为开始位置。For example, the text "XY Automobile Transportation Co., Ltd. の今日の支持は..." is the object of segmentation, and when "XY" is an unknown word, first, "X" at the front end is set as the start position. However, since the character string starting from "X" does not match any item of the character string division information 121, next, the start position is shifted by one character, and "Y" is set as the start position. However, since the character string starting from "Y" does not match any of the items of the character string division information 121, next, the start position is shifted by one character, and "from" is set as the start position.

此处，“自動車-交通-(株)-の”这个4-gram被登记在字符串分割信息121中，若其区分单词数为“3”，则字符串“自動車交通(株)”被分割成“自動車”、“交通”、以及“(株)”这三个单词。并且，“自動車交通(株)”前面的字符串“XY”作为单词被采用。由此，能够将“XY自動車交通(株)”如“XY/自動車/交通/(株)”那样分割为四个单词。Here, the 4-gram of "Automobile-Transportation-(Co., Ltd.)-の" is registered in the character string segmentation information 121, and if the number of distinguishing words is "3", the character string "Automobile-Transportation Co., Ltd." is divided Into "automobile", "traffic", and "(strain)" these three words. In addition, a character string "XY" preceding "Automobile Transportation Co., Ltd." is adopted as a word. Thereby, "XY Automobile Transportation Co., Ltd." can be divided into four words like "XY/Automobile/Transportation/(Co., Ltd.)".

另外，在字符串分割信息121内作为单词而存在，但即使在未登记包括该单词的字符串的情况下，也能够将这样的未登记字符串分割为多个单词。In addition, a character string exists as a word in the character string division information 121, but even if a character string including the word is not registered, such an unregistered character string can be divided into a plurality of words.

例如“そんなスリッパの…”这个文本是分割对象，在“そんなスリッパ”是未登记字符串的情况下，首先，前端的“そ”被设定为开始位置。然而由于从“そ”开始的字符串与字符串分割信息121的哪一项都不一致，所以接下来，开始位置移位1字符，“ん”被设定为开始位置。For example, the text "そんなスリッパの..." is the object of segmentation. If "そんなスリッパ" is an unregistered character string, first, "そ" at the top is set as the start position. However, since the character string starting from "そ" does not match any item of the character string division information 121, next, the start position is shifted by one character, and "ん" is set as the start position.

然而，由于从“ん”开始的字符串与字符串分割信息121的哪一项都不一致，所以接下来，开始位置移位1字符，“な”被设定为开始位置。然而，从“な”开始的字符串与字符串分割信息121的哪一项都不一致，所以接下来，开始位置移位1字符，“ス”被设定为开始位置。However, since the character string starting from "ん" does not match any item of the character string division information 121, next, the start position is shifted by one character, and "na" is set as the start position. However, the character string starting from "な" does not match any item of the character string division information 121, so next, the start position is shifted by one character, and "ス" is set as the start position.

此处，“スリッパ-の”这个2-gram被登记在字符串分割信息121中，若其区分单词数为“1”，则字符串“スリッパの”被分割为“スリッパ”以及“の”2个单词。并且，“スリッパの”的前面的字符串“そんな”作为单词被采用。由此，能够将“そんなスリッパ”如“そんな/スリッパ”那样分割成2个单词。Here, the 2-gram "スリッパ-の" is registered in the character string segmentation information 121, and if the number of distinguishing words is "1", the character string "スリッパの" is divided into "スリッパ" and "の". words. And, the preceding character string "そんな" of "スリッパの" is adopted as a word. Thus, "そんなスリッパ" can be divided into two words like "そんな/スリッパ".

这样，根据图4的文本分割处理，即使是未被登记到字符串分割信息121中的字符串，也能够适当地分割。因此，无需将包括所有的单词的多个字符串登记于字符串分割信息121中，只要仅登记统计上出现频率较大的字符串就可以。由此，能够抑制用于存储字符串分割信息121的存储区域的增大。In this way, according to the text segmentation processing of FIG. 4 , even a character string not registered in the character string segmentation information 121 can be appropriately segmented. Therefore, it is not necessary to register a plurality of character strings including all the words in the character string segmentation information 121 , and it is only necessary to register only character strings that appear frequently statistically. Accordingly, it is possible to suppress an increase in the storage area for storing the character string division information 121 .

图5示出进行区分单词数登记处理的文本分割装置的功能的构成例。图5的文本分割装置101具有在图1的文本分割装置101追加区分单词数决定部501的构成。区分单词数决定部501基于字符串分割信息121的各项的字符串的属性来决定区分单词数，并将决定的区分单词数登记于字符串分割信息121。FIG. 5 shows an example of the functional configuration of a text segmentation device that performs registration processing of the number of differentiated words. The text segmentation device 101 in FIG. 5 has a configuration in which a segmented word number determination unit 501 is added to the text segmentation device 101 in FIG. 1 . The distinguishable word number determination unit 501 determines the distinguishable word count based on the attributes of the character strings in each item of the character string division information 121 , and registers the determined distinguishable word count in the character string division information 121 .

图6是表示区分单词数决定部501进行的区分单词数登记处理的例子的流程图。首先，区分单词数决定部501将字符串分割信息121的一个项中所登记的字符串作为处理对象，提取该字符串的属性(步骤601)，并基于提取的属性来决定与处理对象的字符串对应的区分单词数(步骤602)。FIG. 6 is a flowchart showing an example of registration processing of the number of differentiated words performed by the number of differentiated word determination unit 501 . First, the distinguishable word number determination unit 501 takes the character string registered in one item of the character string segmentation information 121 as the processing object, extracts the attribute of the character string (step 601), and determines the character string corresponding to the processing object based on the extracted attribute. The number of distinguished words corresponding to the string (step 602).

接下来，区分单词数决定部501检查其它项中是否存在与处理对象的字符串相同的字符串(步骤603)。在存在相同的字符串的情况下(步骤603：是)，区分单词数决定部501将决定的区分单词数变更为多个相同的字符串共用的单词的个数(步骤604)。而且，区分单词数决定部501将变更后的区分单词数登记于处理对象的字符串的项(步骤605)Next, the distinguishable word number determination unit 501 checks whether or not there is a character string identical to the character string to be processed in other items (step 603 ). When the same character string exists (step 603: Yes), the distinguishing word count determination unit 501 changes the determined distinguishing word count to the number of words common to a plurality of identical character strings (step 604). Then, the distinguishing word number determination unit 501 registers the changed distinguishing word count in the item of the character string to be processed (step 605)

另一方面，在不存在相同的字符串的情况下(步骤603：否)，区分单词数决定部501将决定的区分单词数登记于处理对象的字符串的项(步骤605)。On the other hand, when the same character string does not exist (step 603: No), the distinguishing word count determination unit 501 registers the determined distinguishing word count in the item of the character string to be processed (step 605).

接下来，区分单词数决定部501检查是否处理了字符串分割信息121的全部项(步骤606)。在剩余未处理的项的情况下(步骤606：否)，区分单词数决定部501将下一个项中所登记的字符串作为处理对象，反复步骤601以后的处理。而且，在处理了全部项的情况下(步骤606：是)，区分单词数决定部501结束处理。Next, the distinguishable word number determination unit 501 checks whether all the items of the character string division information 121 have been processed (step 606 ). When unprocessed items remain (step 606: No), the number-of-differentiated word determination unit 501 makes the character string registered in the next item a processing target, and repeats the processing from step 601 onward. Then, when all the terms have been processed (step 606: Yes), the number-of-differentiated word determination unit 501 ends the processing.

区分单词数决定部501可以在图4的文本分割处理的开始前进行图6的区分单词数登记处理，也可以与文本分割处理并行地进行区分单词数登记处理。The distinguishing word number determination unit 501 may perform the distinguishing word count registration process in FIG. 6 before starting the text segmentation process in FIG. 4 , or may perform the distinguishable word count registration process in parallel with the text segmentation process.

在图6的步骤601中所提取的处理对象的字符串的属性能够包括字符串的一部分或者全部所包含的字符的个数、字符串所包含的单词的字符种类、或者字符串内的规定的词类的位置中的至少一个。单词的字符种类例如表示平假名或者片假名、除此以外的字符等，作为规定的词类，例如使用助词以及助动词。The attributes of the character string to be processed extracted in step 601 of FIG. 6 can include the number of characters contained in part or all of the character string, the character types of the words contained in the character string, or the specified characters in the character string. At least one of the positions of the part of speech. The character type of a word represents, for example, hiragana, katakana, or other characters, and as predetermined parts of speech, for example, auxiliary words and auxiliary verbs are used.

由于包括字符种类是平假名或者片假名的单词的字符串大多不唯一地决定分割位置，所以优选比包括平假名以及片假名以外的单词的字符串的区分单词数小地设定这样的字符串的区分单词数。Since a character string including words whose character type is hiragana or katakana often does not uniquely determine the division position, it is preferable to set such a character string so that the number of distinguishing words is smaller than that of a character string including words other than hiragana and katakana. The number of distinguishing words.

另外，由于包括助词或者助动词的字符串也大多不唯一地决定分割位置，所以优选比不包括助词以及助动词的字符串的区分单词数小地设定这样的字符串的区分单词数。Also, since character strings including auxiliary words or auxiliary verbs often do not uniquely determine division positions, it is preferable to set the number of distinguishing words for such character strings to be smaller than the number of distinguishing words for character strings not including auxiliary words or auxiliary verbs.

在步骤602中，区分单词数决定部501能够例如按照以下的顺序决定处理对象的字符串的区分单词数z。In step 602 , the distinguishable word number determination unit 501 can determine the distinguishable word count z of the character string to be processed in the following procedure, for example.

首先，区分单词数决定部501在从字符串的前端起第n个单词(最后的单词)为标点符号(“。”或者“，”)的情况下，设定为z＝n，而在第n个单词不是标点符号的情况下，设定为z＝n-1。First, when the n-th word (the last word) from the front end of the character string is a punctuation mark ("." or ","), the number-of-differentiating word determination unit 501 sets z=n, and When n words are not punctuation marks, z=n−1 is set.

接下来，区分单词数决定部501检查第n个单词的字符种类、和第(n-1)个单词的词类。Next, the number of distinguished words determination unit 501 checks the character type of the nth word and the part of speech of the (n-1)th word.

在第(n-1)个单词是助词或者助动词的情况下，至该单词为止的字符串与一个文节(单词的连贯)对应，第(n-1)个单词和第n个单词之间有可能存在文节的边界。然而，在其下一个的第n个单词是平假名或者片假名的情况下，第(n-1)个单词与第n个单词之间未必存在边界。反之，在第n个单词是平假名以及片假名以外的字符的情况下，可以说在第(n-1)个单词与第n个单词之间存在边界的可能性较高。When the (n-1)th word is an auxiliary word or auxiliary verb, the character string up to the word corresponds to a text section (coherence of words), and between the (n-1)th word and the nth word There may be section boundaries. However, when the next n-th word is Hiragana or Katakana, there is not necessarily a boundary between the (n-1)th word and the n-th word. Conversely, when the nth word is a character other than hiragana or katakana, it can be said that there is a high possibility of a boundary between the (n−1)th word and the nth word.

因此，在第n个单词的字符种类是“0”，第(n-1)个单词是助词或者助动词的情况下，区分单词数决定部501不变更z。Therefore, when the character type of the n-th word is "0" and the (n-1)-th word is an auxiliary word or an auxiliary verb, the distinguishing word number determination unit 501 does not change z.

另一方面，在第n个单词的字符种类是“1”的情况下，或者第(n-1)个单词是助词以及助动词以外的词类的情况下，区分单词数决定部501按照以下的顺序使z减少。On the other hand, when the character type of the n-th word is "1", or when the (n-1)-th word is a part of speech other than an auxiliary verb and an auxiliary verb, the number of distinguished words determination unit 501 performs the following procedure: make z decrease.

首先，区分单词数决定部501使用从字符串的前端到第z个单词为止的范围的字符的个数k，检查是否是k<z*3。在z＝n的情况下，k表示处理对象的字符串所包含的字符的总数，在z＝n-1的情况下，k表示处理对象的字符串的第一个～第(n-1)个单词所包含的字符的个数。First, the number of distinguished words determination unit 501 checks whether or not k<z*3 is the number k of characters in the range from the beginning of the character string to the z-th word. In the case of z=n, k represents the total number of characters contained in the character string to be processed, and in the case of z=n-1, k represents the first to (n-1)th characters of the character string to be processed The number of characters contained in a word.

由于在字符串所包含的字符的个数较少的情况下，大多不唯一地决定分割位置，所以优选减小z。因此，在k<z*3的情况下，区分单词数决定部501设定为z＝z-1。Since the division position is often not uniquely determined when the number of characters included in the character string is small, it is preferable to make z smaller. Therefore, in the case of k<z*3, the distinguishing word number determination unit 501 sets z=z−1.

另外，即使在字符串所包含的字符的个数不少的情况下，在字符种类是平假名或者片假名时，大多不唯一地决定分割位置。因此，在k≥z*3，第一个～第(n-1)个单词的字符种类全部是“1”的情况下，区分单词数决定部501也设定为z＝z-1。Also, even when the number of characters included in the character string is large, when the character type is hiragana or katakana, the division position is often not uniquely determined. Therefore, when k≧z*3, and the character types of the first to (n−1)th words are all “1”, the number of distinguished words determination unit 501 also sets z=z−1.

此外，区分单词数决定部501可以将k与其它阈值相比较，来代替使k与z*3相比较，也可以将z设定为更小的值，来代替设定为z＝z-1。In addition, the distinguishable word number determination unit 501 may compare k with other thresholds instead of comparing k with z*3, and may set z to a smaller value instead of z=z-1. .

这样，通过基于字符串分割信息121中所登记的字符串的属性来决定区分单词数，从而根据各字符串的上下文来设定确定分割结果的部分。由此，能够不使处理速度降低，就以较高的精度分割文本。In this way, by determining the number of distinguishing words based on the attributes of the character strings registered in the character string segmentation information 121, the portion for specifying the segmentation result is set according to the context of each character string. Thereby, the text can be segmented with high precision without reducing the processing speed.

另外，在步骤604中，区分单词数决定部501从前端起对是多个项中所登记的相同的字符串、且分割位置不同的字符串彼此进行比较，将共用的单词的个数设定为z。但是，在共用的单词的个数为步骤602中所决定的区分单词数以上的情况下，区分单词数决定部501可以不变更决定的区分单词数。In addition, in step 604, the distinguishing word number determination unit 501 compares character strings that are the same character string registered in a plurality of items and have different division positions from the front end, and sets the number of common words to for z. However, when the number of common words is equal to or greater than the number of distinguishing words determined in step 602 , the distinguishing word number determination unit 501 does not need to change the determined number of distinguishing words.

在登记分割位置不同的相同的字符串的情况下，通过将这些字符串的分割结果共用的单词的个数设定为区分单词数，能够降低在错误的位置上分割由剩余的单词构成的字符串的风险。When registering the same character string with different division positions, by setting the number of words common to the division results of these character strings as the number of differentiated words, it is possible to reduce the possibility of dividing characters consisting of remaining words at wrong positions. string of risks.

例如图3的ID“1”的“そうはいっても”是处理对象的字符串的情况下，n＝4，第四个单词“も”不是标点符号，所以设定为z＝n-1＝3。接下来，由于第四个单词“も”的字符种类是“1”，所以检查是否是k<z*3。该情况下，由于k＝2+1+3＝6，所以k<z*3＝9，决定为z＝z-1＝2(步骤602)。For example, when "そうはいっても" of ID "1" in FIG. 3 is a character string to be processed, n=4, and the fourth word "も" is not a punctuation mark, so z=n-1= 3. Next, since the character type of the fourth word "も" is "1", it is checked whether k<z*3. In this case, since k=2+1+3=6, k<z*3=9, z=z-1=2 is determined (step 602).

接下来，ID“9”的“そうはいっても”是相同的字符串，ID“1”的“そう-は-いって-も”与ID“9”的“そうはいっても”共用的单词仅是“そう”，所以变更为z＝1(步骤604)。Next, "そうはいっても" of ID "9" is the same character string, a word shared by "そう-は-いって-も" of ID "1" and "そうはいっても" of ID "9" Since it is only "そう", it is changed to z=1 (step 604).

在ID“2”的“そうはいってた”是处理对象的字符串的情况下，同样地决定为z＝2(步骤602)。并且，若假定为其它项中登记了“そう-はいって-た”这个3-gram，则变更为z＝1(步骤604)。When "そうはいってた" of ID "2" is a character string to be processed, z=2 is similarly determined (step 602). And, if it is assumed that the 3-gram "そう-はいって-た" is registered in another item, it is changed to z=1 (step 604).

在ID“3”的“そうはいってもっとむこう”是处理对象的字符串的情况下，由于n＝4，第四个单词“むこう”并不是标点符号，所以设定为z＝n-1＝3。接下来，由于第四个单词“むこう”的字符种类是“1”，所以检查是否是k<z*3。该情况下，由于k＝2+4+3＝9，所以k＝z*3，但第一个单词“そう”、第二个单词“はいって”、以及第三个单词“もっと”的字符种类全部是“1”，所以决定为z＝z-1＝2(步骤602)。由于其它项中不存在与“そうはいってもっとむこう”相同的字符串，所以确定为z＝2。In the case where "そうはいってもっとむこう" of ID "3" is a character string to be processed, since n=4, the fourth word "むこう" is not a punctuation mark, so z=n- 1=3. Next, since the character type of the fourth word "むこう" is "1", it is checked whether k<z*3. In this case, since k=2+4+3=9, k=z*3, but the characters of the first word "そう", the second word "はいって", and the third word "もっと" Since the types are all "1", it is determined that z=z-1=2 (step 602). Since the same character string as "そうはいってもっとむこう" does not exist in other items, it is determined that z=2.

在ID“5”的“そうはいはいと人”是处理对象的字符串的情况下，n＝4，由于第四个单词“人”并不是标点符号，所以设定为z＝n-1＝3。接下来，由于第四个单词“人”的字符种类是“0”，第三单词“と”是助词，所以决定为z＝3(步骤602)。由于其它项中不存在与“そうはいはいと人”相同的字符串，所以确定为z＝3。When "そうはいはいと人" of ID "5" is a character string to be processed, n=4, and since the fourth word "人" is not a punctuation mark, it is set to z=n-1=3 . Next, since the character type of the fourth word "人" is "0" and the third word "と" is a particle, z=3 is determined (step 602). Since the same character string as "そうはいはいと人" does not exist in other items, it is determined that z=3.

图1以及图5的文本分割装置101的构成只是一个例子，可以根据文本分割装置101的用途、条件来省略或者变更一部分的构成部件。例如在图5的文本分割装置101中，在文本分割处理由外部的装置进行的情况下，能够省略分割部112。The configuration of the text segmentation device 101 in FIGS. 1 and 5 is just an example, and some components may be omitted or changed according to the application and conditions of the text segmentation device 101 . For example, in the text segmentation device 101 of FIG. 5 , when the text segmentation processing is performed by an external device, the segmentation unit 112 can be omitted.

图2、图4、以及图6的流程图只是一个例子，可以根据文本分割装置101的构成、条件来省略或者变更一部分的处理。例如在图4的文本分割处理的步骤401中，未必需要进行最长一致检索，可以采用通过前方一致检索而一致的登记字符串中的任意一个登记字符串。The flowcharts in FIG. 2 , FIG. 4 , and FIG. 6 are just examples, and part of the processing may be omitted or changed depending on the configuration and conditions of the text segmentation device 101 . For example, in step 401 of the text segmentation process in FIG. 4 , it is not necessarily necessary to perform the longest match search, and any one of the registered character strings matched by the previous match search may be used.

在图6的区分单词数登记处理的步骤601以及步骤602中，区分单词数决定部501除了平假名或者片假名以外，还可以使用汉字、英文字母、数字、符号等种类，作为字符串所包含的单词的字符种类。另外，区分单词数决定部501除了助词以及助动词以外，还可以使用名词、动词、形容词、副词等词类，作为字符串内的规定词类。区分单词数决定部501可以仅基于字符串的一部分或者全部所包含的字符的个数、字符串所包含的单词的字符种类、或者字符串内的规定词类的位置中的一个属性来决定区分单词数。In steps 601 and 602 of the number-of-differentiation registration processing in FIG. 6 , the number-of-differentiation determination unit 501 may use, in addition to hiragana or katakana, types such as Chinese characters, English letters, numbers, symbols, etc., as included in the character string. The character kind of the word. In addition, the number of distinguished words determination unit 501 may use parts of speech such as nouns, verbs, adjectives, and adverbs in addition to auxiliary words and auxiliary verbs as predetermined parts of speech in the character string. The number of distinguishing words determining unit 501 may determine distinguishing words based on only one of the number of characters contained in a part or all of a character string, the character type of a word contained in a character string, or the position of a specified part of speech in a character string. number.

在图6的区分单词数登记处理中，不基于字符串的属性来决定区分单词数的情况下，能够省略步骤601以及步骤602的处理。在不将多个相同的字符串共用的单词的个数作为区分单词数进行登记的情况下，能够省略步骤603以及步骤604的处理。In the registration process of the number of distinguishing words in FIG. 6 , when the number of distinguishing words is not determined based on the attribute of the character string, the processing of steps 601 and 602 can be omitted. When the number of words common to a plurality of identical character strings is not registered as the number of differentiated words, the processing of steps 603 and 604 can be omitted.

区分单词数决定部501可以将用户或者操作员指示的区分单词数登记于字符串分割信息121，代替进行图6的区分单词数登记处理。The distinguishing word number determination unit 501 may register the distinguishing word count instructed by the user or the operator in the character string segmentation information 121 instead of performing the distinguishing word count registration process of FIG. 6 .

图3的字符串分割信息121只是一个例子，可以根据文本分割装置101的构成、条件而使用其它的字符串分割信息121。例如在文本分割装置101不进行区分单词数登记处理的情况下，能够省略图3的字符串长、字符种类、助词以及助动词的位置。字符串分割信息121中所登记的字符串可以不必是n-gram的形式，而是表示单词间的边界位置的其它形式。在分割日语以外的语言的文本的情况下，该语言的字符串被登记于字符串分割信息121。The character string segmentation information 121 in FIG. 3 is just an example, and other character string segmentation information 121 may be used depending on the configuration and conditions of the text segmentation device 101 . For example, when the text segmentation device 101 does not perform the processing of registering the number of distinguished words, the character string length, character type, particle, and particle position shown in FIG. 3 can be omitted. The character strings registered in the character string segmentation information 121 may not necessarily be in the form of n-grams, but may be in other forms indicating boundary positions between words. When dividing a text in a language other than Japanese, a character string in that language is registered in the character string division information 121 .

图1以及图5的文本分割装置101例如能够使用图7所示那样的信息处理装置(计算机)来实现。The text segmentation device 101 in FIGS. 1 and 5 can be realized using, for example, an information processing device (computer) as shown in FIG. 7 .

图7的信息处理装置包括中央处理单元(CPU)701、存储器702、输入装置703、输出装置704、辅助存储装置705、介质驱动装置706、以及网络连接装置707。这些构成部件通过总线708相互连接。The information processing device of FIG. 7 includes a central processing unit (CPU) 701 , a memory 702 , an input device 703 , an output device 704 , an auxiliary storage device 705 , a medium drive device 706 , and a network connection device 707 . These constituent elements are connected to each other by a bus 708 .

存储器702例如是只读存储器(ROM)、随机读取存储器(RAM)、闪存等半导体存储器。存储器702储存用于文本分割处理或者区分单词数登记处理的程序以及数据。存储器702能够作为图1以及图5的存储部111使用。The memory 702 is, for example, a semiconductor memory such as read only memory (ROM), random access memory (RAM), or flash memory. The memory 702 stores programs and data used for text segmentation processing or classification word number registration processing. The memory 702 can be used as the storage unit 111 in FIGS. 1 and 5 .

CPU701(处理器)例如通过利用存储器702执行程序，而作为图1以及图5的分割部112以及区分单词数决定部501进行动作。The CPU 701 (processor), for example, executes a program using the memory 702 to operate as the dividing unit 112 and the number of distinguished words determining unit 501 in FIGS. 1 and 5 .

输入装置703例如是键盘、定位设备等，被用于来自用户或者操作员的指示、信息的输入。输出装置704例如是显示装置、打印机、扬声器等，被用于向用户或者操作人员的查询、处理结果的输出。处理结果可以是文本的分割结果。The input device 703 is, for example, a keyboard, a pointing device, and the like, and is used for inputting instructions and information from a user or operator. The output device 704 is, for example, a display device, a printer, a speaker, etc., and is used for querying and outputting processing results to users or operators. The processing result may be a text segmentation result.

辅助存储装置705例如是磁盘装置、光盘装置、光磁盘装置、磁带装置等。辅助存储装置705可以是硬盘驱动器或者闪存。信息处理装置能够事先在辅助存储装置705中储存程序以及数据，并将它们加载到存储器702来使用。辅助存储装置705能够作为图1以及图5的存储部111使用。The auxiliary storage device 705 is, for example, a magnetic disk device, an optical disk device, a magneto-optical disk device, a magnetic tape device, or the like. Secondary storage device 705 may be a hard drive or flash memory. The information processing device can store programs and data in the auxiliary storage device 705 in advance, and load them into the memory 702 for use. The auxiliary storage device 705 can be used as the storage unit 111 shown in FIGS. 1 and 5 .

介质驱动装置706驱动可移动型记录介质709，并访问其记录内容。可移动型记录介质709是存储器件、软盘、光盘、光磁盘等。可移动型记录介质709也可以是光盘只读存储器(CD-ROM)、数字通用光盘(DVD)、通用串行总线(USB)存储器等。用户或者操作员能够在该可移动型记录介质709中储存程序以及数据，并将它们加载到存储器702来使用。The medium drive unit 706 drives a removable recording medium 709 and accesses its recorded content. The removable recording medium 709 is a storage device, a floppy disk, an optical disk, a magneto-optical disk, or the like. The removable recording medium 709 may also be a compact disc read only memory (CD-ROM), digital versatile disc (DVD), universal serial bus (USB) memory, or the like. A user or an operator can store programs and data in the removable recording medium 709 and load them into the memory 702 for use.

这样储存程序以及数据的计算机可读取的记录介质是存储器702、辅助存储装置705、以及可移动型记录介质709这样的物理的(非暂时性的)记录介质。The computer-readable recording medium storing the program and data in this way is a physical (non-transitory) recording medium such as the memory 702 , the auxiliary storage device 705 , and the removable recording medium 709 .

网络连接装置707是与局域网(LAN)、因特网等通信网络连接，进行伴随着通信的数据转换的通信接口。信息处理装置能够经由网络连接装置707从外部装置接收程序以及数据，并将它们加载到存储器702来使用。The network connection device 707 is a communication interface connected to a communication network such as a local area network (LAN) or the Internet, and performs data conversion accompanying communication. The information processing device can receive programs and data from an external device via the network connection device 707 and load them into the memory 702 for use.

信息处理装置也能够经由网络连接装置707从用户终端接收指示、信息，并进行文本分割处理或者区分单词数登记处理，向用户终端发送处理结果。The information processing device can also receive instructions and information from the user terminal via the network connection device 707, perform text segmentation processing or differentiated word number registration processing, and transmit the processing result to the user terminal.

此外，信息处理装置无需包括图7的全部构成部件，也能够根据用途、条件来省略一部分的构成部件。例如在不进行来自用户或者操作人员的指示、信息的输入的情况下，可以省略输入装置703，在不进行向用户或者操作人员的查询、处理结果的输出的情况下，可以省略输出装置704。在信息处理装置不访问可移动型记录介质709或者通信网络的情况下，可以省略介质驱动装置706或者网络连接装置707。In addition, the information processing device does not need to include all the components in FIG. 7 , and some components can be omitted depending on the application and conditions. For example, the input device 703 may be omitted when no instruction or information input from the user or operator is performed, and the output device 704 may be omitted when no query to the user or operator or output of processing results is performed. In the case where the information processing device does not access the removable recording medium 709 or the communication network, the medium drive device 706 or the network connection device 707 may be omitted.

详细地对公开的实施方式和其优点进行了说明，但本领域技术人员能够不从权利要求书中明确记载的本发明的范围脱离而进行各种变更、追加、省略。The disclosed embodiments and their advantages have been described in detail, but those skilled in the art can make various changes, additions, and omissions without departing from the scope of the present invention clearly described in the claims.

对于参照图1至图7说明的实施方式，还公开以下的附记。Regarding the embodiment described with reference to FIGS. 1 to 7 , the following additional notes are also disclosed.

(附记1)(Note 1)

一种文本分割程序，使计算机执行如下的处理：A text segmentation program that causes a computer to perform processing as follows:

从将被分割成多个单词的登记字符串和区分单词数建立对应的字符串分割信息中检索文本所包含的第一字符串；以及Retrieving a first character string included in the text from the character string segmentation information corresponding to the registered character string to be divided into a plurality of words and the number of distinguished words; and

在上述第一字符串与上述登记字符串对应的情况下，将上述第一字符串中的、包括与上述登记字符串建立对应的上述区分单词数的区分单词的第二字符串分割为上述区分单词数的上述区分单词。In the case where the first character string corresponds to the registered character string, the second character string of the first character string including the number of distinguishing words corresponding to the registered character string is divided into the distinguishing words The above distinguishes words by word count.

(附记2)(Additional Note 2)

附记1所记载的文本分割程序的特征在于，上述区分单词数基于上述登记字符串的属性来决定。The text segmentation program described in Supplementary Note 1 is characterized in that the number of distinguished words is determined based on the attribute of the registered character string.

(附记3)(Note 3)

附记2所记载的文本分割程序的特征在于，上述登记字符串的属性包括上述登记字符串的一部分或者全部所包含的字符的个数、上述登记字符串所包含的上述多个单词的字符种类、或者上述登记字符串内的规定的词类的位置中的至少一个。The text segmentation program described in Supplementary Note 2 is characterized in that the attributes of the registered character string include the number of characters included in part or all of the registered character string, and the character types of the plurality of words included in the registered character string. , or at least one of the position of the prescribed part of speech in the registration character string.

(附记4)(Note 4)

附记1～3中的任意一项所记载的文本分割程序的特征在于，上述字符串分割信息包括是与上述登记字符串相同的字符串且在与上述登记字符串不同的分割位置上分割成多个单词的字符串，上述区分单词数基于上述登记字符串、和在上述不同的分割位置上被分割成的上述字符串共用的单词的个数来决定。The text segmentation program described in any one of Supplements 1 to 3 is characterized in that the character string segmentation information includes a character string that is the same as the registered character string and is divided into For a character string of a plurality of words, the number of distinguished words is determined based on the number of words common to the registration character string and the character string divided at the different division positions.

(附记5)(Note 5)

附记1～4中的任意一项所记载的文本分割程序的特征在于，上述计算机通过最长一致检索从上述字符串分割信息中检索上述第一字符串。The text segmentation program described in any one of Supplements 1 to 4 is characterized in that the computer searches for the first character string from the character string segmentation information by a longest match search.

(附记6)(Note 6)

一种文本分割装置，其特征在于，具备：存储部，其对将被分割成多个单词的登记字符串和区分单词数建立对应的字符串分割信息进行存储；以及分割部，其从上述字符串分割信息中检索文本所包含的第一字符串，并在上述第一字符串与上述登记字符串对应的情况下，将上述第一字符串中的、包括与上述登记字符串建立对应的上述区分单词数的单词的第二字符串分割成上述区分单词数的上述单词。A text segmentation device, characterized in that it includes: a storage unit that stores character string division information that associates a registered character string that is divided into a plurality of words with the number of distinguished words; Search for the first character string contained in the text in the string segmentation information, and if the first character string corresponds to the registered character string, include the above-mentioned character string corresponding to the registered character string in the first character string The second character string of words of the number of words is divided into the words of the number of words.

(附记7)(Note 7)

附记6所记载的文本分割装置的特征在于，上述区分单词数基于上述登记字符串的属性来决定。The text segmentation device described in Supplementary Note 6 is characterized in that the number of distinguished words is determined based on the attribute of the registered character string.

(附记8)(Note 8)

附记7所记载的文本分割装置的特征在于，上述登记字符串的属性包括上述登记字符串的一部分或者全部所包含的字符的个数、上述登记字符串所包含的上述多个单词的字符种类、或者上述登记字符串内的规定的词类的位置中的至少一个。The text segmentation device described in Supplementary Note 7 is characterized in that the attributes of the registered character string include the number of characters included in part or all of the registered character string, the character types of the plurality of words included in the registered character string, , or at least one of the position of the prescribed part of speech in the registration character string.

(附记9)(Note 9)

附记6～8中的任意一项所记载的文本分割装置的特征在于，上述字符串分割信息包括是与上述登记字符串相同的字符串且在与上述登记字符串不同的分割位置上分割成多个单词的字符串，上述区分单词数基于上述登记字符串、和在上述不同分割位置上所分割成的上述字符串共用的单词的个数来决定。The text segmentation device described in any one of Supplements 6 to 8 is characterized in that the character string segmentation information includes a character string that is the same as the registered character string and is divided into For a character string of a plurality of words, the number of distinguished words is determined based on the number of words common to the registration character string and the character string divided at the different division positions.

(附记10)(Additional Note 10)

附记6～9中的任意一项所记载的文本分割装置的特征在于，上述分割部通过最长一致检索从上述字符串分割信息中检索上述第一字符串。The text segmentation device according to any one of Supplements 6 to 9 is characterized in that the segmentation unit searches for the first character string from the character string segmentation information by a longest match search.

(附记11)(Additional Note 11)

一种文本分割方法，其特征在于，A method for text segmentation, characterized in that,

计算机从将分割成多个单词的登记字符串和区分单词数建立对应的字符串分割信息中检索文本所包含的第一字符串，并在上述第一字符串与上述登记字符串对应的情况下，将上述第一字符串中的、包括与上述登记字符串建立对应的上述区分单词数的单词的第二字符串分割成上述区分单词数的上述单词。The computer searches for a first character string included in the text from character string segmentation information associating registered character strings divided into a plurality of words with the number of distinguished words, and when the first character string corresponds to the registered character string and dividing a second character string of the first character string including words of the number of distinguished words associated with the registered character string into the words of the number of distinguished words.

(附记12)(Additional Note 12)

附记11所记载的文本分割方法的特征在于，上述区分单词数基于上述登记字符串的属性来决定。The text segmentation method described in supplementary note 11 is characterized in that the number of distinguished words is determined based on the attribute of the registered character string.

(附记13)(Additional Note 13)

附记12所记载的文本分割方法的特征在于，上述登记字符串的属性包括上述登记字符串的一部分或者全部所包含的字符的个数、上述登记字符串所包含的上述多个单词的字符种类、或者上述登记字符串内的规定的词类的位置中的至少一个。The text segmentation method described in Supplementary Note 12 is characterized in that the attributes of the registered character string include the number of characters contained in part or all of the registered character string, and the character types of the plurality of words contained in the registered character string. , or at least one of the position of the prescribed part of speech in the registration character string.

(附记14)(Additional Note 14)

附记11～13中的任意一项所记载的文本分割方法的特征在于，上述字符串分割信息包括是与上述登记字符串相同的字符串且在与上述登记字符串不同的分割位置上被分割成多个单词的字符串，上述区分单词数基于上述登记字符串、和在上述不同的分割位置上被分割成的上述字符串共用的单词的个数来决定。The text segmentation method described in any one of Supplements 11 to 13 is characterized in that the character string segmentation information includes a character string that is the same as the registered character string and is segmented at a segmenting position different from the registered character string. For a character string divided into a plurality of words, the number of distinguishable words is determined based on the number of words common to the registration character string and the character string divided at the different division positions.

(附记15)(Additional Note 15)

附记11～14中的任意一项所记载的文本分割方法的特征在于，上述计算机通过最长一致检索从上述字符串分割信息中检索上述第一字符串。The text segmentation method described in any one of Supplements 11 to 14 is characterized in that the computer searches for the first character string from the character string segmentation information by a longest match search.

Claims

1. A text segmentation device, characterized in that, possesses:

a storage unit that stores character string division information corresponding to the registered character string to be divided into a plurality of words and the number of distinguished words; and

A segmentation unit that searches for a first character string included in the text from the character string segmentation information, and when the first character string corresponds to the registered character string, divides the first character string that includes the above-mentioned The registered character string establishes a second character string of words corresponding to the above-mentioned number of differentiated words divided into the above-mentioned words of the above-mentioned number of differentiated words.

2. The text segmentation device according to claim 1, wherein:

The number of distinguished words is determined based on the attribute of the registered character string.

3. text segmentation device according to claim 2, is characterized in that,

The attributes of the above-mentioned registered character string include the number of characters contained in part or all of the above-mentioned registered character string, the character types of the above-mentioned multiple words contained in the above-mentioned registered character string, or the position of a specified part of speech in the above-mentioned registered character string at least one of the

4. The text segmentation device according to any one of claims 1 to 3, wherein:

The character string division information includes a character string that is the same character string as the registered character string and is divided into a plurality of words at different division positions from the registered character string, and the number of differentiated words is based on the registered character string and the difference It is determined by the number of common words in the above-mentioned character strings that are divided at the division position.

5. text segmentation device according to claim 1, is characterized in that,

The division unit searches for the first character string from the character string division information by longest match search.

6. A text segmentation method, characterized in that,

The computer retrieves the first character string contained in the text from the character string segmentation information that associates the registered character string divided into a plurality of words with the number of distinguished words,

And in the case where the above-mentioned first character string corresponds to the above-mentioned registration character string, the second character string of the above-mentioned first character string, including the words of the above-mentioned distinguishing word number corresponding to the above-mentioned registration character string, is divided into the above-mentioned distinguishing character strings Word count of the above words.

7. text segmentation method according to claim 6, is characterized in that,

8. text segmentation method according to claim 7, is characterized in that,

9. The text segmentation method according to any one of claims 6 to 8, wherein,

The character string division information includes a character string that is the same character string as the registered character string and is divided into a plurality of words at different division positions from the registered character string, and the number of distinguished words is based on the registered character string and the It is determined by the number of common words in the above-mentioned character strings that are divided into different division positions.

10. text segmentation method according to claim 6, is characterized in that,

The computer searches for the first character string from the character string segmentation information by longest match search.