JP2014119977A

JP2014119977A - Daily word extractor, method, and program

Info

Publication number: JP2014119977A
Application number: JP2012274791A
Authority: JP
Inventors: Kugatsu Sadamitsu; 九月貞光; Ryuichiro Higashinaka; 竜一郎東中; Kuniko Saito; 邦子齋藤; Toshiaki Makino; 俊朗牧野; Yoshihiro Matsuo; 義博松尾; Takeshi Yoshimura; 健吉村; Wataru Uchida; 渉内田
Original assignee: NTT Docomo Inc; Nippon Telegraph and Telephone Corp
Current assignee: NTT Docomo Inc; Nippon Telegraph and Telephone Corp
Priority date: 2012-12-17
Filing date: 2012-12-17
Publication date: 2014-06-30
Anticipated expiration: 2032-12-17
Also published as: JP5676552B2

Abstract

PROBLEM TO BE SOLVED: To extract a daily word with high accuracy.SOLUTION: From a document including at least one sentence having undergone a morphological analysis, a daily word extracting unit 32 extracts a character string in which a part matching the regular expression does not indicate the time, as a daily word which is a character string showing an event, by using a regular expression representing a character string pattern simultaneously including a character string which does not indicate time, a date expression regarding a character string, and an intrinsic expression regarding a character string, or a number.

Description

本発明は、デイリーワード抽出装置、方法、プログラムに係り、特に、入力文書中に出現するデイリーワードの抽出を行うデイリーワード抽出装置、方法、プログラムに関する。 The present invention relates to a daily word extraction apparatus, method, and program, and more particularly, to a daily word extraction apparatus, method, and program for extracting daily words that appear in an input document.

デイリーワードとは、（１）ある短い期間において起こる事象（イベント）を表す文字列であり、かつ（２）公な情報となりやすい事象を表す文字列である。例えば、「発表」、「結婚」、「ランチ」、「バイト」、「来てる」は上記（１）の要件を満たしているため、デイリーワードの候補となる。なお、「結婚」は継続する事象だが、継続中においては「Ｘの夫は誰？」のような質問になりやすいと本発明においては仮定しているため、「結婚」は「結婚[発表／発覚]」の場合に使われやすいためデイリーワードの候補となる。 The daily word is a character string that represents (1) an event (event) that occurs in a short period, and (2) a character string that represents an event that is likely to be public information. For example, “announcement”, “marriage”, “lunch”, “byte”, and “coming” are candidates for daily words because they satisfy the requirement (1) above. Note that “marriage” is an ongoing event, but in the present invention, it is assumed that questions such as “who is the husband of X?” Are likely to continue, so “marriage” is “marriage [announcement / It is a candidate for the daily word because it is easily used in the case of "

また、「発表」、「結婚」は公的な情報となるため、上記（２）の条件も満たしていることにより、デイリーワードとなる。しかし、「ランチ」、「バイト」、「来てる」は私的な情報となるため、上記（２）の条件を満たさないため、デイリーワードとはならない。 In addition, since “announcement” and “marriage” are public information, they satisfy the above condition (2) and become daily words. However, since “lunch”, “byte”, and “coming” are private information, they do not satisfy the above condition (2), so they are not daily words.

デイリーワードを検出する技術に近い従来技術として、バーストする分野（トピック）を推定するバーストトピック検出技術により、バーストする事象を推定する方法が知られている（非特許文献１）。 As a conventional technique close to a technique for detecting a daily word, a method for estimating a bursting event by a burst topic detection technique for estimating a field (topic) to burst is known (Non-patent Document 1).

“Finding Bursty Topics from Microblogs”Qiming Diao, Jing Jiang, Feida Zhu, Ee-Peng Lim, ACL 2012 http://www.mysmu.edu/faculty/jingjiang/papers/ACL'12.pdf“Finding Bursty Topics from Microblogs” Qiming Diao, Jing Jiang, Feida Zhu, Ee-Peng Lim, ACL 2012 http://www.mysmu.edu/faculty/jingjiang/papers/ACL'12.pdf

しかしながら、非特許文献１の手法においては、トピックモデルとＨＭＭによってバーストする箇所とそのトピックを検出するものであり、トピックモデルからバーストワードを検出することは可能であるが、デイリーワードはトピックを横断して出現するため、検出することは困難であるという問題がある。 However, in the method of Non-Patent Document 1, the burst model and the topic are detected by the topic model and the HMM, and the burst word can be detected from the topic model, but the daily word crosses the topic. Therefore, there is a problem that it is difficult to detect.

本発明では、上記問題点を解決するために成されたものであり、高精度にデイリーワードを抽出するデイリーワード抽出装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made to solve the above problems, and an object thereof is to provide a daily word extraction apparatus, method, and program for extracting daily words with high accuracy.

上記目的を達成するために、本発明のデイリーワード抽出装置は、形態素解析済みの少なくとも１つの文からなる文書を記憶する記憶手段と、前記記憶手段に記憶された前記文書について、日時を表さない文字列と、前記文字列に関する日時表現と、前記文字列に関する固有表現又は数とを同時に含む文字列のパターンを示す正規表現を用いて、前記正規表現と一致する部分における前記日時を表さない文字列であって、かつ、出現頻度が予め定められた閾値以上となる文字列を、イベントを表す文字列であるデイリーワードとして抽出する抽出手段と、を含んで構成されている。 In order to achieve the above object, the daily word extracting device of the present invention represents a storage means for storing a document composed of at least one sentence that has been subjected to morphological analysis, and a date and time for the document stored in the storage means. The date and time in a portion that matches the regular expression using a regular expression that indicates a pattern of a character string that simultaneously includes a non-character string, a date and time expression related to the character string, and a unique expression or number related to the character string. And an extraction unit that extracts a character string having an appearance frequency equal to or higher than a predetermined threshold as a daily word that is a character string representing an event.

本発明のデイリーワード抽出方法は、形態素解析済みの少なくとも１つの文からなる文書を記憶する記憶手段と、抽出手段とを含む、デイリーワード抽出装置におけるデイリーワード抽出方法であって、前記抽出手段により、前記記憶手段に記憶された前記文書について、日時を表さない文字列と、前記文字列に関する日時表現と、前記文字列に関する固有表現又は数とを同時に含む文字列のパターンを示す正規表現を用いて、前記正規表現と一致する部分における前記日時を表さない文字列であって、かつ、出現頻度が予め定められた閾値以上となる文字列を、イベントを表す文字列であるデイリーワードとして抽出する。 The daily word extraction method of the present invention is a daily word extraction method in a daily word extraction device, comprising a storage means for storing a document composed of at least one sentence that has been subjected to morphological analysis, and an extraction means. A regular expression indicating a character string pattern that simultaneously includes a character string that does not represent a date and time, a date and time expression related to the character string, and a unique expression or a number related to the character string for the document stored in the storage unit. Use a character string that does not represent the date and time in a portion that matches the regular expression and that has an appearance frequency equal to or higher than a predetermined threshold as a daily word that is a character string representing an event. Extract.

本発明に係る抽出手段は、公的な文書について予め求められた各文字列の出現頻度情報に基づいて、前記正規表現と一致する部分における前記日時を表さない文字列であって、かつ、出現頻度が前記閾値以上となる文字列であって、かつ、前記文字列の前記公的な文書内の出現頻度が予め定められた閾値以上となる文字列を、前記デイリーワードとして抽出するようにすることができる。 The extraction means according to the present invention is a character string that does not represent the date and time in a portion that matches the regular expression, based on the appearance frequency information of each character string obtained in advance for an official document, and A character string whose appearance frequency is equal to or higher than the threshold and whose frequency of appearance in the official document is equal to or higher than a predetermined threshold is extracted as the daily word. can do.

本発明に係る抽出手段は、前記公的な文書としての新聞の見出し部分について予め求められた各文字列の出現頻度情報、及び前記新聞の本文部分について予め求められた各文字列の出現頻度情報に基づいて、前記正規表現と一致する部分における前記日時を表さない文字列であって、かつ、出現頻度が前記閾値以上となる文字列であって、かつ、前記文字列の新聞の見出し部分における出現頻度が予め定められた閾値以上、又は、前記文字列の新聞の本文部分における出現頻度が予め定められた閾値以上となる文字列を、前記デイリーワードとして抽出するようにすることができる。 The extraction means according to the present invention includes the appearance frequency information of each character string obtained in advance for the headline part of the newspaper as the official document, and the appearance frequency information of each character string obtained in advance for the body part of the newspaper. And a character string that does not represent the date and time in a portion that matches the regular expression, and that is a character string that has an appearance frequency equal to or higher than the threshold value, and a newspaper headline portion of the character string It is possible to extract, as the daily word, a character string in which the appearance frequency is equal to or higher than a predetermined threshold, or the appearance frequency of the character string in the body part of a newspaper is equal to or higher than a predetermined threshold.

本発明に係る抽出手段は、前記正規表現と一致する部分における前記日時を表さない文字列であって、かつ、出現頻度が前記閾値以上となる文字列であって、かつ、検索エンジンに用いられる予め定められた検索クエリを示す複数の文字列の少なくとも１つと共起する前記文字列を、前記デイリーワードとして抽出すると共に、前記正規表現と一致する部分における前記日時を表さない文字列であって、かつ、出現頻度が前記閾値以上となる文字列であって、かつ、前記文字列の新聞の見出し部分における出現頻度が予め定められた閾値以上、又は、前記文字列の新聞の本文部分における出現頻度が予め定められた閾値以上となる前記文字列であって、かつ、私的な文書に用いられる予め定められた複数の文字列の何れとも共起しない前記文字列を、前記デイリーワードとして抽出するようにすることができる。 The extraction means according to the present invention is a character string that does not represent the date and time in a portion that matches the regular expression, and is a character string that has an appearance frequency equal to or higher than the threshold value, and is used for a search engine. The character string coexisting with at least one of a plurality of character strings indicating a predetermined search query is extracted as the daily word, and a character string that does not represent the date and time in a portion that matches the regular expression And the appearance frequency of the character string is equal to or higher than the threshold value, and the frequency of appearance of the character string in the headline portion of the newspaper is equal to or higher than a predetermined threshold value, or the newspaper text portion of the character string. The sentence in which the appearance frequency in the character string is equal to or higher than a predetermined threshold and does not co-occur with any of a plurality of predetermined character strings used in a private document Column, can be extracted as the daily word.

本発明によれば、形態素解析済みの少なくとも１つの文からなる文書について、日時を表さない文字列と、文字列に関する日時表現と、文字列に関する固有表現又は数とを同時に含む文字列のパターンを示す正規表現を用いて記正規表現と一致する部分における日時を表さない文字列であって、出現頻度が予め定められた閾値以上となる文字列を、イベントを表す文字列であるデイリーワードとして抽出する。 According to the present invention, for a document composed of at least one sentence that has been subjected to morphological analysis, a character string pattern that simultaneously includes a character string that does not represent a date and time, a date and time expression that relates to the character string, and a unique expression or number that relates to the character string. A daily word that is a character string that represents an event and that is a character string that does not represent a date and time in a portion that matches the written regular expression using a regular expression that indicates Extract as

このように、日時を表さない文字列と、文字列に関する日時表現と、文字列に関する固有表現又は数とを同時に含む文字列のパターンを示す正規表現を用いて、正規表現と一致する部分からデイリーワードを抽出することにより、高精度にデイリーワードを抽出することができる。 In this way, from a portion that matches a regular expression, using a regular expression that indicates a character string pattern that simultaneously includes a character string that does not represent a date and time, a date and time expression related to the character string, and a unique expression or number related to the character string. By extracting the daily word, it is possible to extract the daily word with high accuracy.

また、本発明のプログラムは、コンピュータを、上記のデイリーワード抽出装置を構成する各手段として機能させるためのプログラムである。 The program of the present invention is a program for causing a computer to function as each means constituting the above-described daily word extraction device.

以上説明したように、本発明のデイリーワード抽出装置、方法、及びプログラムによれば、日時を表さない文字列と、文字列に関する日時表現と、文字列に関する固有表現又は数とを同時に含む文字列のパターンを示す正規表現を用いて、正規表現と一致する部分からデイリーワードを抽出することにより、高精度にデイリーワードを抽出することができる。 As described above, according to the daily word extraction apparatus, method, and program of the present invention, a character string that simultaneously includes a character string that does not represent a date and time, a date and time expression that relates to the character string, and a unique expression or number that relates to the character string. A daily word can be extracted with high accuracy by extracting a daily word from a portion that matches the regular expression by using a regular expression indicating a pattern of a column.

第１の実施の形態のデイリーワード抽出装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the daily word extraction apparatus of 1st Embodiment. 第１の実施の形態のデイリーワード抽出装置におけるデイリーワード抽出処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the daily word extraction process routine in the daily word extraction apparatus of 1st Embodiment. デイリーワードの位置づけを示す図である。It is a figure which shows the position of the daily word. 第２の実施の形態のデイリーワード抽出装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the daily word extraction apparatus of 2nd Embodiment. 第２の実施の形態のデイリーワード抽出装置におけるデイリーワード抽出処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the daily word extraction process routine in the daily word extraction apparatus of 2nd Embodiment. 第３の実施の形態のデイリーワード抽出装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the daily word extraction apparatus of 3rd Embodiment. 第３の実施の形態のデイリーワード抽出装置におけるデイリーワード抽出処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the daily word extraction process routine in the daily word extraction apparatus of 3rd Embodiment. 実験におけるデイリーワード抽出結果を示す図である。It is a figure which shows the daily word extraction result in experiment.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜システム構成＞
本発明の第１の実施の形態に係るデイリーワード抽出装置について説明する。図１に示すように、本発明の第１の実施の形態に係るデイリーワード抽出装置１００は、入力部１０と、後述するデイリーワード抽出処理ルーチンを実行する演算部２０と、出力部５０と、を備えている。 <System configuration>
A daily word extracting apparatus according to the first embodiment of the present invention will be described. As shown in FIG. 1, the daily word extraction apparatus 100 according to the first embodiment of the present invention includes an input unit 10, a calculation unit 20 that executes a daily word extraction processing routine described later, an output unit 50, It has.

入力部１０は、キーボードなどの入力装置から、複数の形態素解析済み文書を受け付ける。なお、入力部１０は、ネットワーク等を介して外部から入力されたものを受け付けるようにしてもよい。 The input unit 10 receives a plurality of morphologically analyzed documents from an input device such as a keyboard. Note that the input unit 10 may accept input from the outside via a network or the like.

演算部２０は、ＣＰＵ（Central Processing Unit）と、ＲＡＭ（Random Access Memory）と、後述するデイリーワード抽出処理ルーチンを実行するためのプログラムを記憶したＲＯＭ（Read Only Memory）とを備えたコンピュータで構成されている。このコンピュータは、機能的には、図１に示すように、形態素解析済み文書記憶部３０と、固有表現抽出部３１と、デイリーワード抽出部３２とを含んだ構成で表すことができる。 The arithmetic unit 20 is configured by a computer including a CPU (Central Processing Unit), a RAM (Random Access Memory), and a ROM (Read Only Memory) that stores a program for executing a daily word extraction processing routine to be described later. Has been. As shown in FIG. 1, this computer can be functionally represented by a configuration including a morphological-analyzed document storage unit 30, a specific expression extraction unit 31, and a daily word extraction unit 32.

形態素解析済み文書記憶部３０には、入力部１０において受け付けた複数の形態素解析済み文書が記憶されている。 The morpheme analyzed document storage unit 30 stores a plurality of morpheme analyzed documents received by the input unit 10.

固有表現抽出部３１は、形態素解析済み文書記憶部３０に記憶されている複数の形態素解析済み文書について、固有表現（人名、組織名、人工物名などの固有名詞、日付表現（ＤＡＴＥ）、時間表現（ＴＩＭＥ））を抽出する。固有表現抽出の手法は、従来既知の手法を用いれば良い。本実施の形態の例では、ＣａｂｏＣｈａを用いている。 The specific expression extraction unit 31 uses a specific expression (a proper noun such as a person name, an organization name, and an artifact name, a date expression (DATE), a time for a plurality of morpheme analyzed documents stored in the morpheme analyzed document storage unit 30. Expression (TIME)) is extracted. A conventionally known method may be used as the method of extracting the proper expression. In the example of the present embodiment, CaboCha is used.

デイリーワード抽出部３２は、日時を表さない文字列と、当該文字列に関する日時表現と、当該文字列に関する固有表現又は数の表現とを同時に含む文字列のパターンを示す正規表現を用いて、形態素解析済み文書記憶部３０に記憶されている複数の文書について、正規表現のパターンと一致する部分から、日時を表さない文字列をデイリーワード候補として各々抽出する。正規表現として、例えば、「[ＤＡＴ]の[Ｘ]は[Ｙ]」という文字列のパターンを用いて、正規表現に一致する部分の文字列Ｘを普通名詞のデイリーワード候補として抽出する。また、正規表現として、「[ＤＡＴ]に[Ｘ]（する｜した）[Ｙ]」という文字列のパターンを用いて、正規表現に一致する部分の文字列Ｘを動作性名詞のデイリーワード候補として抽出する。なお、抽出対象となる文字列Ｘは、普通名詞及び動作性名詞の合算した出現頻度が予め定められた閾値以上の文字列Ｘのみを対象とする（本実施形態における閾値は１０回とする。）。 The daily word extraction unit 32 uses a regular expression indicating a character string pattern including a character string that does not represent a date and time, a date and time expression related to the character string, and a unique expression or number expression related to the character string, For a plurality of documents stored in the morphological-analyzed document storage unit 30, character strings that do not represent the date and time are extracted as daily word candidates from portions that match the regular expression pattern. As a regular expression, for example, using a character string pattern of “[DAT] [X] is [Y]”, a character string X corresponding to the regular expression is extracted as a daily noun candidate for a common noun. Further, as a regular expression, a character string pattern “[X] (do | done) [Y]” in [DAT] is used, and a character string X corresponding to the regular expression is used as a daily word candidate of an action noun. Extract as It should be noted that the character string X to be extracted is targeted only for the character string X in which the combined appearance frequency of the common noun and the action noun is equal to or higher than a predetermined threshold (the threshold in this embodiment is 10 times). ).

ここで、ＤＡＴは品詞が日時となる文字列か、固有表現抽出部３１において抽出された日付表現（ＤＡＴＥ）又は時間表現（ＴＩＭＥ）の文字列であり、文字列Ｘは、品詞が日時となる文字列以外の文字列であり、文字列Ｙは固有表現（人名、組織名、人工物名などの固有名詞）又は数の表現の文字列である。 Here, DAT is a character string whose part of speech is a date and time, or a character string of a date expression (DATE) or a time expression (TIME) extracted by the unique expression extraction unit 31, and the character string X is a part of speech whose date is part of a date. The character string Y is a character string other than a character string, and the character string Y is a character string of a proper expression (a proper noun such as a person name, an organization name, or an artifact name) or a number expression.

デイリーワード抽出部３２は、デイリーワード候補の抽出結果に基づいて、デイリーワード候補として抽出された文字列であって、Ｎ回（例えば、１０回）以上抽出された文字列を、デイリーワードとして抽出し、デイリーワードリストとして出力部５０により出力する。 The daily word extraction unit 32 extracts a character string extracted as a daily word candidate based on the extraction result of the daily word candidate and extracted N times (for example, 10 times) or more as a daily word. The output unit 50 outputs the daily word list.

＜デイリーワード抽出装置の作用＞ <Operation of Daily Word Extractor>

次に、本発明の第１の実施の形態に係るデイリーワード抽出装置１００の作用について説明する。まず、入力部１０により、形態素解析済み文書が入力されると、形態素解析済み文書記憶部３０に記憶され、デイリーワード抽出装置１００のＲＯＭに記憶されたプログラムを、ＣＰＵが実行することにより、図２に示すデイリーワード抽出処理ルーチンが実行される。 Next, the operation of the daily word extraction device 100 according to the first embodiment of the present invention will be described. First, when a morpheme analyzed document is input by the input unit 10, the CPU executes a program stored in the morpheme analyzed document storage unit 30 and stored in the ROM of the daily word extraction device 100. 2 is executed.

まず、ステップＳ１００において、形態素解析済み文書記憶部３０に記憶されている複数の形態素解析済み文書を読み込む。 First, in step S100, a plurality of morphologically analyzed documents stored in the morphologically analyzed document storage unit 30 are read.

次に、ステップＳ１０１において、ステップＳ１００において得られた複数の形態素解析済み文書について、固有表現抽出を行う。 Next, in step S101, specific expression extraction is performed on the plurality of morphologically analyzed documents obtained in step S100.

次に、ステップＳ１０２において、ステップＳ１００において得られた形態素解析済み文書について、「[ＤＡＴ]の[Ｘ]は[Ｙ]」という文字列のパターンの正規表現に一致する部分の各々から、文字列Ｘに該当する文字列を普通名詞のデイリーワード候補として抽出すると共に、「[ＤＡＴ]に[Ｘ]（する｜した）[Ｙ]」という文字列のパターンの正規表現に一致する部分の各々から、文字列Ｘに該当する文字列を動作性名詞のデイリーワード候補として抽出する。 Next, in step S102, for each of the morpheme-analyzed documents obtained in step S100, a character string is extracted from each of the portions that match the regular expression of the character string pattern “[DAT] [X] is [Y]”. A character string corresponding to X is extracted as a daily noun candidate for a common noun, and from each of the portions that match the regular expression of the character string pattern “[DAT] [X] (do | done) [Y]” The character string corresponding to the character string X is extracted as a daily word candidate of the action noun.

そして、デイリーワード候補として抽出された文字列であって、１０回以上抽出された文字列を、デイリーワードとして抽出する。 A character string extracted as a daily word candidate and extracted ten times or more is extracted as a daily word.

次に、ステップＳ１０４において、ステップＳ１０２において抽出されたデイリーワードをデイリーワードリストとして出力部５０により出力して処理を終了する。 Next, in step S104, the daily word extracted in step S102 is output as a daily word list by the output unit 50, and the process ends.

以上、説明したように、本発明の第１の実施の形態に係るデイリーワード抽出装置によれば、日時を表さない文字列と、文字列に関する日時表現と、文字列に関する固有表現又は数とを同時に含む文字列のパターンを示す正規表現を用いて、正規表現と一致する部分からデイリーワードを抽出することにより、高精度にデイリーワードを抽出することができる。 As described above, according to the daily word extraction device according to the first embodiment of the present invention, a character string that does not represent a date and time, a date and time expression that relates to the character string, and a unique expression or number that relates to the character string, By extracting a daily word from a portion that matches the regular expression using a regular expression that indicates a pattern of a character string that simultaneously includes the daily word, it is possible to extract the daily word with high accuracy.

また、デイリーワードとなる文字列が示す対象が日毎に変化することを考慮して、日時表現と固有表現を同時に含む正規表現を用いることによりデイリーワードを高精度に抽出することができる。 Further, in consideration of the fact that the character string that becomes the daily word changes from day to day, the daily word can be extracted with high accuracy by using a regular expression that includes both the date expression and the specific expression.

また、「懇親会」や「誕生日」等の私的な情報も含めたデイリーワードを抽出することができる。 It is also possible to extract a daily word including private information such as “social gathering” and “birthday”.

また、自然言語による質問文に対し、答えを探し出すために、リアルタイム性の高いソーシャルメディアか非リアルタイムなＷＥＢ記事のどちらを参照すべきか分類するに際し、本実施の形態において自動的に抽出されたデイリーワードリストを用いることにより、デイリーワードが質問文に含まれればリアルタイム性の高いソーシャルメディアを参照すべきであると分類することができる。 In addition, when classifying whether to refer to social media with high real-time property or non-real-time WEB articles in order to find an answer to a question sentence in natural language, a daily extracted automatically in this embodiment. By using the word list, if a daily word is included in a question sentence, it can be classified that social media with high real-time property should be referred to.

また、記事の出現数からバーストワードを検出すると真に意味のある情報が取得できるとは限らないという問題に対して、（非特許文献２：「マイクロブログ上の話題抽出とユーザの態度の分類に基づく流言検出支援システム」藤川,鍜治,吉永,喜連川（DEIM2012) http://www.tkl.iis.u-tokyo.ac.jp/top/modules/newdb/extract/1176/data/DEIM2012.pdf）デイリーワードと共起するバーストワードはより意味があるバーストワードであると仮定し、本実施の形態において自動的に抽出されたデイリーワードリストを、真に意味のあるバーストワードの抽出に用いることができる（図３）。 In addition, when the burst word is detected from the number of appearances of articles, it is not always possible to acquire truly meaningful information (Non-Patent Document 2: “Topic extraction on microblog and classification of user attitudes” Furukawa, Eiji, Yoshinaga, Kitsuregawa (DEIM2012) http://www.tkl.iis.u-tokyo.ac.jp/top/modules/newdb/extract/1176/data/DEIM2012.pdf ) Assuming that the burst word co-occurring with the daily word is a more meaningful burst word, the daily word list automatically extracted in this embodiment is used to extract a truly meaningful burst word. (Fig. 3).

次に、第２の実施の形態について説明する。なお、第１の実施の形態と同様の構成となる部分については、同一符号を付して説明を省略する。 Next, a second embodiment will be described. In addition, about the part which becomes the structure similar to 1st Embodiment, the same code | symbol is attached | subjected and description is abbreviate | omitted.

第２の実施の形態では、デイリーワード候補のうち公的な文書に出現する文字列のみをデイリーワードとして抽出する点が第１の実施の形態と異なっている。 The second embodiment is different from the first embodiment in that only character strings appearing in official documents are extracted as daily words from the daily word candidates.

＜デイリーワード抽出装置の構成＞
本発明の第２の実施の形態に係るデイリーワード抽出装置２００は、図４に示すように、入力部１０と、演算部１２０と、出力部５０とを備えている。 <Configuration of Daily Word Extractor>
As illustrated in FIG. 4, the daily word extraction device 200 according to the second exemplary embodiment of the present invention includes an input unit 10, a calculation unit 120, and an output unit 50.

入力部１０は、キーボードなどの入力装置から、複数の形態素解析済み文書と、新聞の文書を形態素解析した複数の形態素解析済み新聞データとを受け付ける。なお、入力部１０は、ネットワーク等を介して外部から入力されたものを受け付けるようにしてもよい。 The input unit 10 receives a plurality of morphologically analyzed documents and a plurality of morphologically analyzed newspaper data obtained by morphological analysis of newspaper documents from an input device such as a keyboard. Note that the input unit 10 may accept input from the outside via a network or the like.

演算部１２０は、ＣＰＵ（Central Processing Unit）と、ＲＡＭ（Random Access Memory）と、後述するデイリーワード抽出処理ルーチンを実行するためのプログラムを記憶したＲＯＭ（Read Only Memory）とを備えたコンピュータで構成されている。このコンピュータは、機能的には、図４に示すように、形態素解析済み文書記憶部３０と、固有表現抽出部３１と、デイリーワード候補抽出部３４と、初期デイリーワードリスト記憶部３６と、形態素解析済み新聞データ記憶部３８と、単語頻度計測部４０と、頻度リスト記憶部４２と、デイリーワード抽出部４４と、を含んだ構成で表すことができる。 The arithmetic unit 120 is configured by a computer including a CPU (Central Processing Unit), a RAM (Random Access Memory), and a ROM (Read Only Memory) that stores a program for executing a daily word extraction processing routine to be described later. Has been. As shown in FIG. 4, this computer functionally includes a morpheme analyzed document storage unit 30, a specific expression extraction unit 31, a daily word candidate extraction unit 34, an initial daily word list storage unit 36, and a morpheme. The analyzed newspaper data storage unit 38, the word frequency measurement unit 40, the frequency list storage unit 42, and the daily word extraction unit 44 can be represented.

デイリーワード候補抽出部３４は、第１の実施の形態のデイリーワード抽出部３２と同様の処理を行って、形態素解析済み文書記憶部３０に記憶されている文書から、デイリーワード候補を抽出し、抽出されたデイリーワード候補からなる初期デイリーワードリストを、初期デイリーワードリスト記憶部３６に格納する。 The daily word candidate extraction unit 34 performs the same processing as the daily word extraction unit 32 of the first embodiment, and extracts daily word candidates from the document stored in the morphologically analyzed document storage unit 30. An initial daily word list composed of the extracted daily word candidates is stored in the initial daily word list storage unit 36.

初期デイリーワードリスト記憶部３６は、デイリーワード候補抽出部３４によって抽出されたデイリーワード候補からなる初期デイリーワードリストを記憶している。 The initial daily word list storage unit 36 stores an initial daily word list made up of daily word candidates extracted by the daily word candidate extraction unit 34.

形態素解析済み新聞データ記憶部３８には、入力部１０において受け付けた新聞の文書を形態素解析した複数の形態素解析済み新聞データが記憶されている。 The morphological-analyzed newspaper data storage unit 38 stores a plurality of morphological-analyzed newspaper data obtained by morphological analysis of newspaper documents received by the input unit 10.

単語頻度計測部４０は、形態素解析済み新聞データ記憶部３８に記憶されている複数の形態素解析済み新聞データの各々について、見出し部分のデータと、本文部分のデータとに分け、複数の形態素解析済み新聞データの各々の見出し部分のデータから、内容語である文字列の各々（例えば、「バイト」、「先発」）について、出現頻度を計測し、見出しにおける文字列の頻度リストを作成する。また、複数の形態素解析済み新聞データの各々の本文部分のデータから、内容語である文字列の各々について、出現頻度を計測し、本文における文字列の頻度リストを作成する。作成された見出しにおける文字列の頻度リスト及び本文における文字列の頻度リスト（以下、単語頻度リストとする）を頻度リスト記憶部４２に格納する。 The word frequency measuring unit 40 divides each of the plurality of morpheme analyzed newspaper data stored in the morpheme analyzed newspaper data storage unit 38 into a headline part data and a body part data, and a plurality of morpheme analyzed data From the data of each headline part of the newspaper data, the appearance frequency is measured for each of the character strings that are the content words (for example, “byte”, “starting”), and a frequency list of the character strings in the headline is created. Further, the appearance frequency is measured for each character string that is a content word from the data of the body part of each of the plurality of morphologically analyzed newspaper data, and a frequency list of the character strings in the body is created. The frequency list of character strings in the created headline and the frequency list of character strings in the body (hereinafter referred to as a word frequency list) are stored in the frequency list storage unit 42.

頻度リスト記憶部４２は、単語頻度計測部４０から入力された単語頻度リストを記憶している。 The frequency list storage unit 42 stores the word frequency list input from the word frequency measurement unit 40.

デイリーワード抽出部４４は、初期デイリーワードリスト記憶部３６に記憶されている初期デイリーワードリストに含まれるデイリーワード候補の文字列の各々について、当該デイリーワード候補の文字列が動作性名詞におけるデイリーワード候補である場合には、頻度リスト記憶部４２に記憶されている単語頻度リストの内、見出しにおける文字列の頻度リストから得られる当該デイリーワード候補の文字列の出現頻度が、閾値（本実施例では１０回）以上であるかを判定し、当該デイリーワード候補の文字列が普通名詞におけるデイリーワード候補である場合には、頻度リスト記憶部４２に記憶されている単語頻度リストの内、本文における文字列の頻度リストから得られる当該デイリーワード候補の文字列の出現頻度が、閾値（本実施例では１０回）以上であるか否かを判定する。デイリーワード抽出部４４は、上記の出現頻度が閾値以上であると判定されたデイリーワード候補の文字列の各々をデイリーワードとして抽出し、デイリーワードリストとして出力部５０により出力する。一方、当該デイリーワード候補の文字列の出現頻度が、閾値未満である場合には、当該デイリーワード候補の文字列は、デイリーワードではないと判断され除外される。 For each of the character strings of the daily word candidates included in the initial daily word list stored in the initial daily word list storage unit 36, the daily word extraction unit 44 converts the daily word candidate character string into a daily word in the action noun. If it is a candidate, the appearance frequency of the character string of the daily word candidate obtained from the frequency list of the character string in the heading out of the word frequency list stored in the frequency list storage unit 42 is a threshold value (this embodiment). In the case where the character string of the daily word candidate is a daily word candidate in a common noun, the word frequency list stored in the frequency list storage unit 42 includes The frequency of occurrence of the character string of the candidate daily word obtained from the frequency list of character strings is a threshold (this implementation In determining whether a 10 times) or more. The daily word extraction unit 44 extracts each character string of the daily word candidate determined that the appearance frequency is equal to or higher than the threshold value as a daily word, and outputs the daily word list as the daily word list. On the other hand, if the appearance frequency of the character string of the candidate for the daily word is less than the threshold value, the character string of the candidate for the daily word is determined not to be a daily word and is excluded.

＜デイリーワード抽出装置の作用＞
次に、本発明の第２の実施の形態に係るデイリーワード抽出装置２００の作用について説明する。なお、第１の実施の形態と同様の処理については、同一符号を付して詳細な説明を省略する。 <Operation of Daily Word Extractor>
Next, the operation of the daily word extracting apparatus 200 according to the second embodiment of the present invention will be described. In addition, about the process similar to 1st Embodiment, the same code | symbol is attached | subjected and detailed description is abbreviate | omitted.

まず、入力部１０により、新聞の文書を形態素解析した複数の形態素解析済み新聞データが入力されると、単語頻度計算部４０は、新聞データの各々について、見出し部分のデータと、本文部分のデータとに分け、複数の形態素解析済み新聞データの各々の見出し部分のデータから、内容語である文字列の各々について、出現頻度を計測し、見出しにおける文字列の頻度リストを作成する。また、単語頻度計算部４０は、複数の形態素解析済み新聞データの各々の本文部分のデータから、内容語である文字列の各々について、出現頻度を計測し、本文における文字列の頻度リストを作成し、頻度リスト記憶部４２に格納する。 First, when a plurality of morphologically analyzed newspaper data obtained by morphological analysis of a newspaper document is input by the input unit 10, the word frequency calculation unit 40, for each of the newspaper data, the data of the headline part and the data of the body part. The frequency of appearance of each character string that is a content word is measured from the data of each headline part of the plurality of morphologically analyzed newspaper data, and a frequency list of the character strings in the headline is created. In addition, the word frequency calculation unit 40 measures the appearance frequency of each character string that is a content word from the data of the body part of each of the plurality of morphologically analyzed newspaper data, and creates a frequency list of the character strings in the body text And stored in the frequency list storage unit 42.

そして、入力部１０により、複数の形態素解析済み文書が入力されると、形態素解析済み文書記憶部３０に記憶され、デイリーワード抽出装置１００のＲＯＭに記憶されたプログラムを、ＣＰＵが実行することにより、図５に示すデイリーワード抽出処理ルーチンが実行される。 When a plurality of morpheme analyzed documents are input by the input unit 10, the CPU executes a program stored in the morpheme analyzed document storage unit 30 and stored in the ROM of the daily word extraction device 100. The daily word extraction processing routine shown in FIG. 5 is executed.

ステップＳ１００において、形態素解析済み文書記憶部３０に記憶されている複数の形態素解析済み文書を読み込む。 In step S100, a plurality of morphologically analyzed documents stored in the morphologically analyzed document storage unit 30 are read.

ステップＳ２００において、頻度リスト記憶部４２に記憶されている単語頻度リストを読み込む。 In step S200, the word frequency list stored in the frequency list storage unit 42 is read.

ステップＳ１０１において、ステップＳ１００において得られた複数の形態素解析済み文書について、固有表現抽出を行う。 In step S101, specific expression extraction is performed on the plurality of morphologically analyzed documents obtained in step S100.

ステップＳ２０４において、ステップＳ１００において得られた複数の形態素解析済み文書について、「[ＤＡＴ]の[Ｘ]は[Ｙ]」という文字列のパターンの正規表現に一致する部分の各々から、文字列Ｘに該当する文字列を普通名詞のデイリーワード候補として抽出すると共に、「[ＤＡＴ]に[Ｘ]（する｜した）[Ｙ]」という文字列のパターンの正規表現に一致する部分の各々から、文字列Ｘに該当する文字列を動作性名詞としてのデイリーワード候補として抽出する。そして、デイリーワード候補として抽出された文字列であって、１０回以上抽出された文字列を抽出し、初期デイリーワードリストとして初期デイリーワードリスト記憶部３６に格納する。 In step S204, for each of the plurality of morpheme-analyzed documents obtained in step S100, the character string X is extracted from each of the portions that match the regular expression of the character string pattern “[DAT] [X] is [Y]”. Are extracted as candidates for daily nouns of common nouns, and from each of the portions that match the regular expression of the string pattern “[DAT] [X] (do |) [Y]”, A character string corresponding to the character string X is extracted as a daily word candidate as an action noun. The character strings extracted as the daily word candidates and extracted ten times or more are extracted and stored in the initial daily word list storage unit 36 as the initial daily word list.

ステップＳ２０６において、ステップＳ２０４で初期デイリーワードリスト記憶部３６に格納された初期デイリーワードリストに含まれるデイリーワード候補の文字列について、ステップＳ２００において得られた単語頻度リストに基づいて、デイリーワード候補の文字列が動作性名詞の場合、見出しにおける文字列の出現頻度が、閾値以上である否かを判定し、デイリーワード候補の文字列が普通名詞の場合、本文における文字列の出現頻度が、閾値以上であるか否かを判定し、上記の出現頻度が閾値以上であると判定されたデイリーワード候補の文字列をデイリーワードとして抽出する。 In step S206, for the character strings of the daily word candidates included in the initial daily word list stored in the initial daily word list storage unit 36 in step S204, the daily word candidate strings are determined based on the word frequency list obtained in step S200. If the character string is a behavioral noun, it is determined whether the frequency of occurrence of the character string in the heading is equal to or higher than the threshold value. If the character string of the daily word candidate is a common noun, the frequency of occurrence of the character string in the body text is It is determined whether or not it is the above, and a character string of a daily word candidate for which the appearance frequency is determined to be equal to or higher than a threshold is extracted as a daily word.

ステップＳ２０７において、初期デイリーワードリスト記憶部３６に格納された初期デイリーワードリストに含まれるデイリーワード候補の文字列の全てについて、上記ステップＳ２０６のデイリーワードの抽出処理を行ったか否かを判定する。すべてのデイリーワード候補についてデイリーワードの抽出処理を行った場合には、ステップＳ１０４に移行し、デイリーワードの抽出処理を行っていないデイリーワード候補の文字列が存在する場合には、ステップＳ２０６に移行して当該文字列を判定対象の文字列として上記の処理を繰り返す。 In step S207, it is determined whether or not the daily word extraction processing in step S206 has been performed for all the character strings of the daily word candidates included in the initial daily word list stored in the initial daily word list storage unit 36. If the daily word extraction process has been performed for all the daily word candidates, the process proceeds to step S104. If there is a character string of a daily word candidate that has not been subjected to the daily word extraction process, the process proceeds to step S206. Then, the above process is repeated using the character string as a character string to be determined.

ステップＳ１０４において、ステップＳ２０６において抽出されたデイリーワードをデイリーワードリストとして出力部５０により出力して処理を終了する。 In step S104, the daily word extracted in step S206 is output as a daily word list by the output unit 50, and the process ends.

以上、説明したように、本発明の第２の実施の形態に係るデイリーワード抽出装置によれば、デイリーワード候補を抽出し、さらにデイリーワード候補のうち公的な文書に出現する文字列のみをデイリーワードとして抽出することにより、高精度にデイリーワードを抽出することができる。 As described above, according to the daily word extraction device of the second embodiment of the present invention, the daily word candidates are extracted, and only the character strings appearing in the official document among the daily word candidates are extracted. By extracting as a daily word, a daily word can be extracted with high accuracy.

また、新聞等の公性の強いリソースの中の頻度情報と照らし合わせることにより、デイリーワードとして抽出される文字列の公性を保証して、デイリーワードを抽出することができる。 In addition, by checking against frequency information in a highly public resource such as a newspaper, it is possible to extract a daily word while guaranteeing the publicity of a character string extracted as a daily word.

また、公的な文書の頻度情報を用いることにより、デイリーワードとして私的な情報を過剰に獲得することを避けることが出来る。 Moreover, it is possible to avoid excessive acquisition of private information as a daily word by using the frequency information of a public document.

次に、第３の実施の形態について説明する。なお、第１及び第２の実施の形態と同様の構成となる部分については、同一符号を付して説明を省略する。 Next, a third embodiment will be described. In addition, about the part which becomes the same structure as 1st and 2nd embodiment, the same code | symbol is attached | subjected and description is abbreviate | omitted.

第３の実施の形態では、デイリーワード候補を抽出し、公的な文書に出現する文字列であるか否かと、検索クエリとして用いられる単語と共起するか否かと、私的な文書に用いられる単語と共起するか否かとに基づいてデイリーワードを抽出する点が第２の実施の形態と異なっている。 In the third embodiment, a daily word candidate is extracted and used for a private document, whether it is a character string that appears in a public document, whether it co-occurs with a word used as a search query, or not. This is different from the second embodiment in that a daily word is extracted based on whether or not it co-occurs with a given word.

＜デイリーワード抽出装置の構成＞
本発明の第３の実施の形態に係るデイリーワード抽出装置３００は、図６に示すように、入力部１０と、演算部２２０と、出力部５０とを備えている。 <Configuration of Daily Word Extractor>
As illustrated in FIG. 6, the daily word extraction device 300 according to the third exemplary embodiment of the present invention includes an input unit 10, a calculation unit 220, and an output unit 50.

入力部１０は、キーボードなどの入力装置から、複数の形態素解析済み文書と、新聞の文書を形態素解析した複数の形態素解析済み新聞データと、私的な文書について用いられる予め定められた複数の単語からなるプライベート単語リストと、検索エンジンに用いられる検索クエリを示す予め定められた複数の単語からなる検索クエリリストとを受け付ける。なお、入力部１０は、ネットワーク等を介して外部から入力されたものを受け付けるようにしてもよい。 The input unit 10 receives, from an input device such as a keyboard, a plurality of morphologically analyzed documents, a plurality of morphologically analyzed newspaper data obtained by morphological analysis of newspaper documents, and a plurality of predetermined words used for private documents. And a search query list consisting of a plurality of predetermined words indicating search queries used in the search engine. Note that the input unit 10 may accept input from the outside via a network or the like.

演算部２２０は、ＣＰＵ（Central Processing Unit）と、ＲＡＭ（Random Access Memory）と、後述するデイリーワード抽出処理ルーチンを実行するためのプログラムを記憶したＲＯＭ（Read Only Memory）とを備えたコンピュータで構成されている。このコンピュータは、機能的には、図６に示すように、形態素解析済み文書記憶部３０と、固有表現抽出部３１と、デイリーワード候補抽出部３４と、初期デイリーワードリスト記憶部３６と、形態素解析済み新聞データ記憶部３８と、単語頻度計測部４０と、頻度リスト記憶部４２と、デイリーワード抽出部４４と、プライベート単語リスト記憶部６０と、検索クエリリスト記憶部６２とを含んだ構成で表すことができる。 The arithmetic unit 220 is configured by a computer including a CPU (Central Processing Unit), a RAM (Random Access Memory), and a ROM (Read Only Memory) that stores a program for executing a daily word extraction processing routine to be described later. Has been. As shown in FIG. 6, this computer functionally includes a morpheme analyzed document storage unit 30, a specific expression extraction unit 31, a daily word candidate extraction unit 34, an initial daily word list storage unit 36, and a morpheme. An analysis newspaper data storage unit 38, a word frequency measurement unit 40, a frequency list storage unit 42, a daily word extraction unit 44, a private word list storage unit 60, and a search query list storage unit 62 are included. Can be represented.

プライベート単語リスト記憶部６０には、入力部１０において受け付けたプライベート単語リストが記憶されている。 The private word list storage unit 60 stores the private word list received by the input unit 10.

検索クエリリスト記憶部６２には、入力部１０において受け付けた検索クエリリストが記憶されている。 The search query list storage unit 62 stores the search query list received by the input unit 10.

デイリーワード抽出部４４は、初期デイリーワードリスト記憶部３６に記憶されている初期デイリーワードリストに含まれるデイリーワード候補の文字列について、当該デイリーワード候補の文字列が、検索クエリリスト記憶部６２に記憶されている検索クエリリストに含まれる少なくとも１つの単語である文字列と、当該デイリーワード候補の文字列が含まれている形態素解析済み文書内で共起するか否かを判定し、共起すると判定された場合には、当該デイリーワード候補の文字列をデイリーワードとして抽出する。 The daily word extraction unit 44 stores the character string of the daily word candidate in the search query list storage unit 62 for the character string of the daily word candidate included in the initial daily word list stored in the initial daily word list storage unit 36. Determine whether or not to co-occur in a morphological-analyzed document that includes a character string that is at least one word included in the stored search query list and the character string of the daily word candidate If it is determined, the character string of the daily word candidate is extracted as a daily word.

当該デイリーワード候補の文字列が検索クエリリストに含まれる何れの文字列とも共起しない場合には、当該デイリーワード候補の文字列が動作性名詞におけるデイリーワード候補であれば、頻度リスト記憶部４２に記憶されている単語頻度リストの内、見出しにおける文字列の頻度リストから得られる当該デイリーワード候補の文字列の出現頻度が、閾値（本実施例では１０回）以上であるかを判定し、当該デイリーワード候補の文字列が普通名詞におけるデイリーワード候補であれば、頻度リスト記憶部４２に記憶されている単語頻度リストの内、本文における文字列の頻度リストから得られる当該デイリーワード候補の文字列の出現頻度が、閾値（本実施例では１０回）以上であるか否かを判定する。デイリーワード抽出部４４は、当該デイリーワード候補の文字列について、上記の出現頻度が閾値以上であると判定されると、当該デイリーワード候補の文字列が、プライベート単語リスト記憶部６０に記憶されているプライベート単語リストの何れの単語とも、当該デイリーワード候補の文字列が含まれている形態素解析済み文書内で共起しないか否かを判定し、何れの単語とも共起しないと判定された場合には、当該デイリーワード候補の文字列をデイリーワードとして抽出する。
一方、当該デイリーワード候補の文字列について、上記の出現頻度が、閾値未満である場合には、当該デイリーワード候補の文字列は、デイリーワードではないと判断され除外される。また、当該デイリーワード候補の文字列について、プライベート単語リスト記憶部６０に記憶されているプライベート単語リストの少なくとも１つの単語と、当該デイリーワード候補の文字列が含まれている形態素解析済み文書内で共起すると判定された場合には、当該デイリーワード候補の文字列は、デイリーワードではないと判断され除外される。
デイリーワード抽出部４４は、上記のデイリーワード抽出処理を、初期デイリーワードリスト記憶部３６に記憶されている初期デイリーワードリストに含まれる全てのデイリーワード候補の文字列の各々について繰り返し行い、抽出されたデイリーワードを、デイリーワードリストとして出力部５０により出力する。 When the character string of the daily word candidate does not co-occur with any character string included in the search query list, if the character string of the daily word candidate is a daily word candidate in an action noun, the frequency list storage unit 42 In the word frequency list stored in the heading, it is determined whether the frequency of appearance of the character string of the candidate daily word obtained from the frequency list of the character string in the heading is equal to or higher than a threshold value (10 times in this embodiment), If the character string of the daily word candidate is a daily word candidate in a common noun, the character of the daily word candidate obtained from the frequency list of the character string in the text in the word frequency list stored in the frequency list storage unit 42 It is determined whether the appearance frequency of the column is equal to or greater than a threshold value (10 times in this embodiment). When the daily word extraction unit 44 determines that the appearance frequency of the character string of the daily word candidate is equal to or higher than the threshold value, the character string of the daily word candidate is stored in the private word list storage unit 60. When it is determined that any word in the private word list does not co-occur with any word in the morphologically analyzed document containing the character string of the candidate daily word The character string of the daily word candidate is extracted as a daily word.
On the other hand, when the appearance frequency of the character string of the daily word candidate is less than the threshold value, the character string of the daily word candidate is determined not to be a daily word and is excluded. Further, regarding the character string of the daily word candidate, in the morphologically analyzed document including at least one word of the private word list stored in the private word list storage unit 60 and the character string of the daily word candidate. If it is determined to co-occur, the character string of the daily word candidate is determined not to be a daily word and is excluded.
The daily word extraction unit 44 repeatedly performs the above-described daily word extraction processing for each character string of all the daily word candidates included in the initial daily word list stored in the initial daily word list storage unit 36. The daily word is output by the output unit 50 as a daily word list.

＜デイリーワード抽出装置の作用＞
次に、本発明の第３の実施の形態に係るデイリーワード抽出装置３００の作用について説明する。なお、第１及び第２の実施の形態と同様の処理については、同一符号を付して詳細な説明を省略する。 <Operation of Daily Word Extractor>
Next, the operation of the daily word extraction apparatus 300 according to the third embodiment of the present invention will be described. In addition, about the process similar to 1st and 2nd embodiment, the same code | symbol is attached | subjected and detailed description is abbreviate | omitted.

まず、入力部１０により、新聞の文書を形態素解析した複数の形態素解析済み新聞データが入力されると、新聞データの各々について、見出し部分のデータと、本文部分のデータとに分け、複数の形態素解析済み新聞データの各々の見出し部分のデータから、内容語である文字列の各々について、出現頻度を計測し、見出しにおける文字列の頻度リストを作成する。また、複数の形態素解析済み新聞データの各々の本文部分のデータから、内容語である文字列の各々について、出現頻度を計測し、本文における文字列の頻度リストを作成し、頻度リスト記憶部４２に格納する。 First, when a plurality of morphological-analyzed newspaper data obtained by morphological analysis of a newspaper document is input by the input unit 10, each of the newspaper data is divided into headline data and body text data, From the data of each headline part of the analyzed newspaper data, the appearance frequency is measured for each character string that is a content word, and a frequency list of the character strings in the headline is created. Further, the appearance frequency is measured for each of the character strings that are the content words from the data of the body part of each of the plurality of morphologically analyzed newspaper data, the frequency list of the character strings in the body is created, and the frequency list storage unit 42 To store.

また、入力部１０により、プライベート単語リストが入力されると、入力されたプライベート単語リストをプライベート単語リスト記憶部６０に格納する。 When a private word list is input by the input unit 10, the input private word list is stored in the private word list storage unit 60.

また、入力部１０により、検索クエリリストが入力されると、入力された検索クエリリストを検索クエリリスト記憶部６２に格納する。 When a search query list is input by the input unit 10, the input search query list is stored in the search query list storage unit 62.

そして、入力部１０により、形態素解析済み文書が入力されると、形態素解析済み文書記憶部３０に記憶され、デイリーワード抽出装置３００のＲＯＭに記憶されたプログラムを、ＣＰＵが実行することにより、図７に示すデイリーワード抽出処理ルーチンが実行される。 When a morpheme analyzed document is input by the input unit 10, the CPU executes the program stored in the morpheme analyzed document storage unit 30 and stored in the ROM of the daily word extraction device 300. The daily word extraction processing routine shown in FIG. 7 is executed.

ステップＳ３００において、プライベート単語リスト記憶部６０に記憶されているプライベート単語リストを読み込む。 In step S300, the private word list stored in the private word list storage unit 60 is read.

ステップＳ３０２において、検索クエリリスト記憶部６２に記憶されている検索クエリリストを読み込む。 In step S302, the search query list stored in the search query list storage unit 62 is read.

ステップＳ３０４において、ステップＳ２０４で初期デイリーワードリスト記憶部３６に格納された初期デイリーワードリストに含まれるデイリーワード候補の文字列について、ステップＳ３０２において取得した検索クエリリストに基づいて、検索クエリリストに含まれる少なくとも１つの単語である文字列と、当該デイリーワード候補の文字列が含まれている形態素解析済み文書内で共起するか否かを判定し、共起する場合には、該デイリーワード候補の文字列の各々をデイリーワードとして抽出する。 In step S304, the character strings of the daily word candidates included in the initial daily word list stored in the initial daily word list storage unit 36 in step S204 are included in the search query list based on the search query list acquired in step S302. Whether or not to co-occur in a morpheme-analyzed document including the character string that is the at least one word and the daily word candidate character string. Are extracted as daily words.

検索クエリリストに含まれる少なくとも１つの単語である文字列と共起しない場合には、ステップＳ２００において得られた単語頻度リストに基づいて、当該デイリーワード候補の文字列の見出し又は本文における文字列の出現頻度が、閾値以上である否かを判定する。 In the case where it does not co-occur with a character string that is at least one word included in the search query list, based on the word frequency list obtained in step S200, the character string heading of the daily word candidate or the character string in the text It is determined whether the appearance frequency is greater than or equal to a threshold value.

上記の出現頻度が閾値以上と判定された場合には、ステップＳ３００において取得したプライベート単語リストに基づいて、当該デイリーワード候補の文字列がプライベート単語リストに含まれる何れの単語とも共起しないか否かを判定し、何れの単語とも共起しないと判定されたデイリーワード候補の文字列をデイリーワードとして抽出する。 If it is determined that the appearance frequency is equal to or greater than the threshold, whether or not the character string of the daily word candidate does not co-occur with any word included in the private word list based on the private word list acquired in step S300. A character string of a daily word candidate determined not to co-occur with any word is extracted as a daily word.

ステップＳ２０７において、初期デイリーワードリスト記憶部３６に格納された初期デイリーワードリストに含まれるデイリーワード候補の文字列の全てについて、上記ステップＳ３０４のデイリーワードの抽出処理を行ったか否かを判定する。すべてのデイリーワード候補についてデイリーワードの抽出処理を行った場合には、ステップＳ１０４に移行し、デイリーワードの抽出処理を行っていないデイリーワード候補の文字列が存在する場合には、ステップＳ３０４に移行して当該文字列を判定対象の文字列として上記の処理を繰り返す。 In step S207, it is determined whether or not the daily word extraction processing in step S304 has been performed for all the character strings of the daily word candidates included in the initial daily word list stored in the initial daily word list storage unit 36. If the daily word extraction process has been performed for all the daily word candidates, the process proceeds to step S104. If there is a character string of a daily word candidate that has not been subjected to the daily word extraction process, the process proceeds to step S304. Then, the above process is repeated using the character string as a character string to be determined.

ステップＳ１０４において、ステップＳ３０４において抽出されたデイリーワードをデイリーワードリストとして出力部５０により出力して処理を終了する。 In step S104, the daily word extracted in step S304 is output as a daily word list by the output unit 50, and the process ends.

以上、説明したように、本発明の第３の実施の形態に係るデイリーワード抽出装置によれば、デイリーワード候補を抽出し、公的な文書に出現する文字列であるか否かと、検索クエリとして用いられる単語と共起するか否かと、私的な文書に用いられる単語と共起するか否かとに基づいてデイリーワードを抽出することにより、高精度にデイリーワードを抽出することが出来る。 As described above, according to the daily word extracting device according to the third embodiment of the present invention, the daily word candidate is extracted, whether or not the character string appears in the official document, and the search query. By extracting a daily word based on whether it co-occurs with a word used as a word and whether it co-occurs with a word used in a private document, a daily word can be extracted with high accuracy.

＜実験例＞
本発明の効果を検証するため、本発明の第２の実施の形態で説明した手法を用いてデイリーワードを抽出する実験を行った。実験におけるデイリーワードの抽出結果を図８に示す。本実験においては形態素解析済みデータとしてｂｌｏｇのデータ（３５、２９４、６８４記事）を使用し、公的な文書として新聞のデータ（６００００文）を用いた。図８では、取り消し線が引いてある文字列は頻度情報に基づいて過剰に除去されたものを示しており、下線が引いてある文字列は、本来は除去されるべき文字列であるが、頻度情報に基づいて除去されずにデイリーワードとして抽出されたものを示している。図８に示す実験結果から、デイリーワードが高精度に抽出されることが分かった。 <Experimental example>
In order to verify the effect of the present invention, an experiment was performed to extract a daily word using the method described in the second embodiment of the present invention. The results of daily word extraction in the experiment are shown in FIG. In this experiment, blog data (35, 294, 684 articles) was used as morphological-analyzed data, and newspaper data (60000 sentences) was used as an official document. In FIG. 8, the character string with a strikethrough is shown as being excessively removed based on the frequency information, and the character string with an underline is originally a character string that should be removed. It shows what is extracted as a daily word without being removed based on the frequency information. From the experimental results shown in FIG. 8, it was found that the daily word was extracted with high accuracy.

なお、各要素の１カラム目は入力されたｂｌоｇ中の記事における当該デイリーワード候補の文字列としての正規表現のマッチ数であり、２カラム目は新聞本文の頻度情報、３カラム目は新聞の見出しにおける頻度情報を表す。 The first column of each element is the number of regular expression matches as the character string of the daily word candidate in the article in the inputted blog, the second column is the frequency information of the newspaper text, the third column is the newspaper Represents frequency information in the heading.

なお、本発明は、上記の実施の形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made without departing from the gist of the present invention.

例えば、上記の第２及び第３の実施の形態では、演算部１２０及び２２０において、新聞の文書の形態素解析済みデータを入力として単語頻度リストを作成したが、これに限定されるものではなく、事前に作成した単語頻度リストを入力し、それを用いても良い。 For example, in the above-described second and third embodiments, the calculation units 120 and 220 create the word frequency list by inputting the morphologically analyzed data of the newspaper document, but the present invention is not limited to this. A previously created word frequency list may be input and used.

また、上記の第３の実施の形態では、検索クエリリストの少なくとも１つの単語と共起する文字列をデイリーワードとし、プライベート単語リストの少なくとも１つの単語と共起する文字列をデイリーワードから除外する場合を説明したが、これに限定されるものではない。例えば、検索クエリリストの少なくとも１つの単語と共起する文字列をデイリーワードとして抽出すると共に、新聞データにおける頻度情報に基づいて出現頻度が閾値以上である文字列をデイリーワードとして抽出してもよい。また、検索クエリリストとの共起を考慮せずに、新聞データにおける頻度情報に基づいて出現頻度が閾値以上であり、かつ、プライベート単語リストの何れの単語とも共起しない文字列をデイリーワードとして抽出してもよい。 In the third embodiment, a character string that co-occurs with at least one word in the search query list is a daily word, and a character string that co-occurs with at least one word in the private word list is excluded from the daily word. However, the present invention is not limited to this. For example, a character string that co-occurs with at least one word in the search query list may be extracted as a daily word, and a character string having an appearance frequency equal to or higher than a threshold value may be extracted as a daily word based on frequency information in newspaper data. . Further, without considering co-occurrence with the search query list, a character string whose appearance frequency is equal to or higher than a threshold based on frequency information in newspaper data and does not co-occur with any word in the private word list is set as a daily word. It may be extracted.

また、上記の第３の実施の形態では、検索クエリリスト又はプライベート単語リストの少なくとも１つの単語と、デイリーワード候補の文字列が、当該文字列の含まれている文書内において共起したか否かを判定しているが、これに限定されるものではなく、当該文字列の含まれている文内において共起したか否かを判断してもよい。 In the third embodiment, whether or not at least one word in the search query list or the private word list and the character string of the daily word candidate co-occur in the document including the character string. However, the present invention is not limited to this, and it may be determined whether or not co-occurrence occurs in a sentence including the character string.

また、本発明の実施の形態では、正規表現として、「[ＤＡＴ]の[Ｘ]は[Ｙ]」及び「[ＤＡＴ]に[Ｘ]（する｜した）[Ｙ]」の文字列のパターンを用いてデイリーワードを抽出するようにしたが、これに限定されるものではなく、日時を表さない文字列と、文字列に関する日時表現と、文字列に関する固有表現又は数とを同時に含む他の文字列のパターンを示す正規表現を用いてもよい。 In the embodiment of the present invention, as regular expressions, “[DAT] [X] is [Y]” and “[DAT] is [X] (do | done) [Y]”. The daily word is extracted using, but the present invention is not limited to this, and it includes a character string that does not represent the date and time, a date and time expression related to the character string, and a specific expression or number related to the character string at the same time. A regular expression indicating the pattern of the character string may be used.

また、上述のデイリーワード抽出装置１００、２００、及び３００は内部にコンピュータシステムを有しているが、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）を含むものとする。 The above-described daily word extraction devices 100, 200, and 300 each have a computer system. However, if the “computer system” uses a WWW system, a homepage providing environment (or display) Environment).

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能であるし、ネットワークを介して提供することも可能である。また、本実施の形態のデイリーワード抽出装置１００、２００、及び３００の各部をハードウエアにより構成してもよい。また、形態素解析済み新聞データ、頻度リスト、形態素解析済み文書、プライベート単語リスト、検索クエリリストが記憶されるデータベースとしては、ハードディスク装置やファイルサーバ等に例示される記憶手段によって実現可能であり、デイリーワード抽出装置１００、２００、及び３００内部にデータベースを設けても良いし、外部装置に設けてもよい。 Further, in the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium or provided via a network. It is also possible to do. Moreover, you may comprise each part of the daily word extraction apparatuses 100, 200, and 300 of this Embodiment by hardware. The database storing morphologically analyzed newspaper data, frequency list, morphologically analyzed document, private word list, and search query list can be realized by storage means exemplified by a hard disk device or a file server. A database may be provided in the word extraction devices 100, 200, and 300, or may be provided in an external device.

１０入力部
２０，１２０，２２０演算部
３０形態素解析済み文書記憶部
３１固有表現抽出部
３２，４４デイリーワード抽出部
３４デイリーワード候補抽出部
３６初期デイリーワードリスト記憶部
３８形態素解析済み新聞データ記憶部
４０単語頻度計測部
４２頻度リスト記憶部
５０出力部
６０プライベート単語リスト記憶部
６２検索クエリリスト記憶部
１００，２００，３００デイリーワード抽出装置 DESCRIPTION OF SYMBOLS 10 Input part 20,120,220 Arithmetic unit 30 Morphological analysis document storage part 31 Specific expression extraction part 32,44 Daily word extraction part 34 Daily word candidate extraction part 36 Initial daily word list storage part 38 Morphological analysis newspaper data storage part 40 Word Frequency Measurement Unit 42 Frequency List Storage Unit 50 Output Unit 60 Private Word List Storage Unit 62 Search Query List Storage Unit 100, 200, 300 Daily Word Extraction Device

Claims

Storage means for storing a document composed of at least one sentence that has undergone morphological analysis;
For the document stored in the storage means, a regular expression indicating a character string pattern including a character string that does not represent a date and time, a date expression related to the character string, and a unique expression or a number related to the character string is used. In addition, a character string that does not represent the date and time in a portion that matches the regular expression and that has an appearance frequency equal to or higher than a predetermined threshold is extracted as a daily word that is a character string that represents an event. Extraction means to
Daily word extraction device.

The extraction means is a character string that does not represent the date and time in a portion that matches the regular expression based on the appearance frequency information of each character string obtained in advance for an official document, and the appearance frequency is The character string that is a character string that is equal to or greater than the threshold and that has an appearance frequency of the character string in the official document that is equal to or greater than a predetermined threshold is extracted as the daily word. Daily word extractor.

The extraction means is based on the appearance frequency information of each character string obtained in advance for the headline part of the newspaper as the official document and the appearance frequency information of each character string obtained in advance for the body part of the newspaper. , A character string that does not represent the date and time in a portion that matches the regular expression, and that is a character string that has an appearance frequency equal to or greater than the threshold, and the appearance frequency of the character string in the headline portion of the newspaper 3. The daily word extracting device according to claim 2, wherein a character string is extracted as the daily word when the character string is equal to or higher than a predetermined threshold value or the appearance frequency of the character string in a newspaper text portion is equal to or higher than a predetermined threshold value.

The extraction means is a character string that does not represent the date and time in a portion that matches the regular expression, is a character string that has an appearance frequency equal to or higher than the threshold value, and is used in advance for a search engine. The character string that co-occurs with at least one of a plurality of character strings indicating the search query is extracted as the daily word, and is a character string that does not represent the date and time in a portion that matches the regular expression, In addition, the character string has an appearance frequency equal to or higher than the threshold value, and the frequency of appearance of the character string in the newspaper headline portion is equal to or higher than a predetermined threshold value, or the frequency of appearance of the character string in the text portion of the newspaper. Is a character string that is equal to or greater than a predetermined threshold, and the character string that does not co-occur with any of a plurality of predetermined character strings used in a private document, Daily word extraction apparatus according to claim 3 wherein the extract as serial Daily word.

A daily word extraction method in a daily word extraction apparatus, comprising a storage means for storing a document comprising at least one sentence that has been subjected to morphological analysis, and an extraction means,
A character string pattern including a character string that does not represent a date and time, a date and time expression related to the character string, and a unique expression or a number related to the character string for the document stored in the storage means by the extracting means. A character string that does not represent the date and time in a portion that matches the regular expression, and that has an appearance frequency equal to or higher than a predetermined threshold, is a character string that represents an event. Extract as a daily word Daily word extraction method.

The extraction means is a character string that does not represent the date and time in a portion that matches the regular expression based on the appearance frequency information of each character string obtained in advance for an official document, and the appearance frequency is 6. The character string that is equal to or greater than the threshold and that has a frequency of appearance in the official document that is equal to or greater than a predetermined threshold is extracted as the daily word. Daily word extraction method.

The extraction means is based on the appearance frequency information of each character string obtained in advance for the headline part of the newspaper as the official document and the appearance frequency information of each character string obtained in advance for the body part of the newspaper. , A character string that does not represent the date and time in a portion that matches the regular expression, and that is a character string that has an appearance frequency equal to or greater than the threshold, and the appearance frequency of the character string in the headline portion of the newspaper The daily word extraction method according to claim 6, wherein a character string in which the character string is equal to or higher than a predetermined threshold value or the appearance frequency of the character string in a newspaper text portion is equal to or higher than a predetermined threshold value is extracted as the daily word.

The extraction means is a character string that does not represent the date and time in a portion that matches the regular expression, is a character string that has an appearance frequency equal to or higher than the threshold value, and is used in advance for a search engine. The character string that co-occurs with at least one of a plurality of character strings indicating the search query is extracted as the daily word, and is a character string that does not represent the date and time in a portion that matches the regular expression, In addition, the character string has an appearance frequency equal to or higher than the threshold value, and the frequency of appearance of the character string in the newspaper headline portion is equal to or higher than a predetermined threshold value, or the frequency of appearance of the character string in the text portion of the newspaper. Is a character string that is equal to or greater than a predetermined threshold, and the character string that does not co-occur with any of a plurality of predetermined character strings used in a private document, Daily word extraction method of claim 7 wherein extracting the serial Daily word.

The program for functioning a computer as each means which comprises the daily word extraction apparatus of any one of Claims 1-4.