JP5964149B2

JP5964149B2 - Apparatus and program for identifying co-occurrence words

Info

Publication number: JP5964149B2
Application number: JP2012138820A
Authority: JP
Inventors: 加藤　剛志; 剛志加藤; 桂一落合
Original assignee: NTT Docomo Inc
Current assignee: NTT Docomo Inc
Priority date: 2012-06-20
Filing date: 2012-06-20
Publication date: 2016-08-03
Anticipated expiration: 2032-06-20
Also published as: JP2014002653A

Description

本発明は、ユーザが関心を持つ情報を示すキーワードとともに同じ文章中に出現する言葉である共起語を特定する技術に関する。 The present invention relates to a technique for specifying a co-occurrence word that is a word that appears in the same sentence together with a keyword indicating information of interest to a user.

広くインターネットを介した情報検索が行われている。情報検索を行うユーザは、例えば端末装置において動作するＷｅｂブラウザの検索画面において自分が関心を持つ言葉を検索キーワードとして入力し、検索エンジン機能を備えたサーバ装置に送信する。サーバ装置はインターネット上に公開されている多数のＷｅｂページの中から、端末装置から送信されてきた検索キーワードを含むＷｅｂページを所定の規則に従い抽出し、その抽出結果を示すＷｅｂページを端末装置に送信する。ユーザはサーバ装置から端末装置に送信されてきたＷｅｂページにリンクされているＷｅｂページを開くことにより、検索キーワードに関連する情報を得ることができる。 Information retrieval is widely performed via the Internet. For example, a user who searches for information inputs a word of interest as a search keyword on a search screen of a Web browser operating on a terminal device, and transmits it to a server device having a search engine function. The server device extracts a Web page including a search keyword transmitted from the terminal device from a large number of Web pages published on the Internet according to a predetermined rule, and a Web page indicating the extraction result is stored in the terminal device. Send. The user can obtain information related to the search keyword by opening a Web page linked to the Web page transmitted from the server device to the terminal device.

しかしながら、検索キーワードの送信に応じてサーバ装置から送信されてくる抽出結果には、ユーザが求める情報を提供しないＷｅｂページがリンクされることも多い。そのため、ユーザは多数の検索キーワードを組み合わせるなどして抽出結果の質の向上を図ることができるが、手間である。 However, the extraction result transmitted from the server device in response to the transmission of the search keyword is often linked to a Web page that does not provide information requested by the user. Therefore, the user can improve the quality of the extraction result by combining a large number of search keywords, but this is troublesome.

そのような手間を軽減する技術として、例えば特許文献１には、ユーザが指定した検索キーワードとともに同じ文章中で高い頻度で使用されている単語を共起語として特定しておき、ユーザが指定した検索キーワードにそれらの共起語を自動的に追加して文章を抽出することで、抽出結果の質の向上を図る技術が提案されている。 As a technique for reducing such effort, for example, in Patent Document 1, a word that is frequently used in the same sentence together with a search keyword specified by the user is specified as a co-occurrence word and specified by the user. Techniques have been proposed for improving the quality of extraction results by automatically adding those co-occurrence words to search keywords and extracting sentences.

特開２００３−２２２７５号公報JP 2003-22275 A

上記のような検索キーワードによる抽出結果にいわゆるノイズと呼ばれる不要な情報が混入する一つの理由として、検索時点における話題傾向が必ずしも考慮されていない、という点が挙げられる。例えば、最新の液晶ディスプレイの機能を知りたいと思ったユーザが、「液晶ディスプレイ」という検索キーワードを入力しＷｅｂページの抽出を行ったとする。その際、もし液晶ディスプレイの基本原理を詳しく説明しているＷｅｂページＡが長年にわたり多くの人々に閲覧されていれば、抽出結果にそのＷｅｂページＡが含まれることになる。この場合、ユーザは液晶ディスプレイの基本原理には関心がないため、抽出結果に含まれるＷｅｂページＡはノイズである。 One reason why unnecessary information called so-called noise is mixed in the extraction result based on the search keyword as described above is that the topic tendency at the time of search is not necessarily taken into consideration. For example, a user who wants to know the latest liquid crystal display functions inputs a search keyword “liquid crystal display” and extracts a Web page. At that time, if the Web page A that explains the basic principle of the liquid crystal display in detail has been browsed by many people for many years, the Web page A is included in the extraction result. In this case, since the user is not interested in the basic principle of the liquid crystal display, the Web page A included in the extraction result is noise.

上記のように、ユーザが現在というタイミングにおけるいわゆる話題傾向を考慮した情報を検索したい場合には、例えば「３Ｄ」、「白色ＬＥＤ」、「ＩＰＳパネル」などの最近話題性が高まっているキーワードを「液晶ディスプレイ」とともに検索キーワードとして入力すればよい。なお、本願において「話題傾向」とは、多くの人々が関心を示す話題を意味する。話題は単一の言葉ではなく複数の言葉の組み合わせから生じるため、上記のように複数の検索キーワードを組み合わせることにより話題傾向を考慮した抽出結果を得ることができる。 As described above, when the user wants to search for information that takes into account the so-called topical trend at the present time, keywords such as “3D”, “white LED”, “IPS panel”, etc., which have recently become highly topical, are being used. What is necessary is just to input as a search keyword with "liquid crystal display". In the present application, the “topic trend” means a topic in which many people are interested. Since a topic arises not from a single word but from a combination of a plurality of words, an extraction result considering a topic tendency can be obtained by combining a plurality of search keywords as described above.

しかしながら、ユーザは必ずしも現在の話題傾向を容易に知ることはできず、従ってそれらの追加すべき検索キーワードを特定することが難しい場合が多い。また、この例のように、話題傾向自体を知りたいこともある。その場合には、ユーザはとりあえず「液晶ディスプレイ」等の一般的な検索キーワードを入力し、その結果として得られるノイズの多い抽出結果の中から必要な情報を探し出さなければならない。 However, the user cannot always easily know the current topic trend, and therefore it is often difficult to specify the search keyword to be added. Also, as in this example, you may want to know the topic trend itself. In that case, the user must first input a general search keyword such as “liquid crystal display” and search for necessary information from the noisy extraction results obtained as a result.

本発明は上記の事情に鑑み、ユーザが興味を持つキーワードに関連する特定期間内における話題傾向を容易に知ることを可能とする技術を提供することを目的とする。 In view of the circumstances described above, an object of the present invention is to provide a technique that allows a user to easily know a topic trend within a specific period related to a keyword in which a user is interested.

上述した課題を解決するため、本発明は、文章を示すテキストデータを、時刻を示す時刻データとともに取得するテキストデータ取得手段と、前記テキストデータ取得手段により取得されたテキストデータが示す文章を形態素解析により形態素に分割して当該分割した形態素を各々示す複数の形態素データを生成する形態素解析手段と、前記形態素解析手段により複数のテキストデータの各々を分割して生成された形態素データのうちの１以上の形態素データの各々に関し、当該形態素データと、当該形態素データの生成に用いられたテキストデータを識別するテキストデータ識別データと、当該形態素データの生成に用いられたテキストデータとともに前記テキストデータ取得手段により取得された時刻データとを記憶する形態素データ記憶手段と、一のキーワードを示す一のキーワードデータに関し、前記形態素データ記憶手段において、所定の期間内の時刻を示す時刻データとともに記憶され、かつ、前記一のキーワードデータが示す前記一のキーワードを示す形態素データとともに前記形態素データ記憶手段に記憶されているテキストデータ識別データと同一のテキストデータ識別データとともに記憶されている形態素データを、前記一のキーワードデータに対応する共起語データとして抽出する共起語データ抽出手段とを備える装置を提案する。 In order to solve the above-described problem, the present invention provides a text data acquisition unit that acquires text data indicating a sentence together with time data indicating a time, and a morphological analysis of the sentence indicated by the text data acquired by the text data acquisition unit The morpheme analyzing means for dividing the morpheme and generating a plurality of morpheme data each indicating the divided morpheme, and one or more of the morpheme data generated by dividing each of the plurality of text data by the morpheme analyzing means For each of the morpheme data, the text data acquisition means together with the morpheme data, the text data identification data for identifying the text data used to generate the morpheme data, and the text data used to generate the morpheme data Morphological data storage that stores the acquired time data And the keyword data indicating one keyword, the morpheme data storage means stores the time keyword indicating the time within a predetermined period and the one keyword indicated by the one keyword data. Co-occurrence extracting morpheme data stored together with morpheme data together with the same text data identification data as the text data identification data stored in the morpheme data storage means as co-occurrence word data corresponding to the one keyword data An apparatus comprising word data extraction means is proposed.

このような装置によれば、特定期間内に公開された多数のテキストデータにおいて特定のキーワードと共に用いられている共起語が抽出されるため、特定のキーワードに関連する特定期間における話題傾向を示す情報が得られる。 According to such an apparatus, since co-occurrence words used together with a specific keyword are extracted in a large number of text data published within a specific period, the trend of topics in the specific period related to the specific keyword is indicated. Information is obtained.

また、上述した装置において、前記一のキーワードデータが示す前記一のキーワードを示す形態素データとともに記憶されている一のテキストデータ識別データに関し、前記形態素データ記憶手段において、前記一のキーワードデータに対応する共起語データとして前記共起語データ抽出手段により抽出された形態素データの各々に関し、前記一のテキストデータ識別データとともに記憶されている当該形態素データと同一の形態素データを、前記一のテキストデータに含まれる前記一のキーワードデータに対応する共起語データとして抽出し、前記共起語データ抽出手段により抽出された前記一のキーワードデータに対応する共起語データの各々の数と、前記一のテキストデータに含まれる前記一のキーワードデータに対応する共起語データの各々の数とに基づき、前記一のテキストデータと前記一のキーワードデータとの間の関連度を示す関連度データを生成する関連度データ生成手段を備える、という構成が採用されてもよい。 Further, in the above-described apparatus, the text data identification data stored together with the morpheme data indicating the one keyword indicated by the one keyword data corresponds to the one keyword data in the morpheme data storage unit. For each piece of morpheme data extracted by the co-occurrence word data extraction unit as co-occurrence word data, the same morpheme data as the morpheme data stored together with the one text data identification data is stored in the one text data. Each of the co-occurrence word data corresponding to the one keyword data extracted by the co-occurrence word data extraction means, extracted as co-occurrence word data corresponding to the one keyword data included, and the one Co-occurrence word data corresponding to the one keyword data included in the text data Of based on the number of each, the provided relevance data generating means for generating a relevance data indicating the degree of association between one of the text data and the one of keyword data, it may be configured that the adopted.

このような装置によれば、特定期間内に公開された多数のテキストデータにおいて特定のキーワードと共に用いられている共起語が、特定のテキストデータにおいてその特定のキーワードと共にどれだけ用いられているかという情報に基づき、特定期間内における特定のテキストデータと特定のキーワードとの間の関連度が特定される。従って、ユーザはその特定期間内における話題傾向を把握していなくても、それらの話題傾向を考慮した上で特定のキーワードに関連すると思われるテキストデータを得ることができる。 According to such a device, how many co-occurrence words used with a specific keyword in a large number of text data released within a specific period are used with the specific keyword in specific text data. Based on the information, the degree of association between specific text data and a specific keyword within a specific period is specified. Therefore, even if the user does not grasp the topic trend within the specific period, the user can obtain text data that seems to be related to the specific keyword in consideration of the topic trend.

また、上述した装置において、前記関連度データ生成手段は、前記共起語データ抽出手段により抽出された共起語データの数に基づき所定の規則に従い定められるウェイトに従い、前記関連度データの生成において各共起語データに関する数の加重を行う、という構成が採用されてもよい。 Further, in the above-described apparatus, the relevance level data generation unit is configured to generate the relevance level data according to a weight determined according to a predetermined rule based on the number of co-occurrence word data extracted by the co-occurrence word data extraction unit. A configuration may be employed in which the number of numbers related to each co-occurrence word data is weighted.

このような装置によれば、一般的に話題傾向をより強く特徴付ける出現数が多い共起語が関連度の特定においてより大きく考慮されるため、話題傾向をあまり特徴付けない出現数が少ない共起語が多数存在するような場合であっても、話題傾向の特徴が十分に反映された関連度データが得られる。 According to such an apparatus, since co-occurrence words having a large number of appearances that generally characterize topic trends more strongly are considered in determining the relevance, co-occurrence with a small number of occurrences that do not characterize the topic tendency much Even in the case where there are many words, relevance data that sufficiently reflects the characteristics of the topic tendency can be obtained.

また、上述した装置において、端末装置から前記一のキーワードデータもしくは前記一のキーワードデータを含むテキストデータを受信するキーワードデータ受信手段と、前記テキストデータ取得手段により取得されたテキストデータを記憶するテキストデータ記憶手段と、前記関連度データ生成手段により算出された、前記テキストデータ記憶手段に記憶されている複数のテキストデータの各々と前記キーワードデータ受信手段により前記端末装置から受信された前記一のキーワードデータとの間の関連度データを、当該複数のテキストデータの各々に関連付けて前記端末装置に送信する関連度データ送信手段とを備える、という構成が採用されてもよい。 In the above-described apparatus, the keyword data receiving means for receiving the one keyword data or the text data including the one keyword data from the terminal device, and the text data for storing the text data acquired by the text data acquiring means Each of the plurality of text data stored in the text data storage means and the one keyword data received from the terminal device by the keyword data reception means, calculated by the storage means and the relevance data generation means And a degree-of-association data transmitting unit that associates each of the plurality of text data with each of the plurality of text data and transmits the data to the terminal device.

このような装置によれば、関連度の特定のための処理がユーザの端末装置とは異なる装置において行われるため、多数のテキストデータに関する形態素解析や共起語の特定、それらの処理の結果の記憶等の処理を各ユーザの端末装置が各々行う必要がない。 According to such an apparatus, since the process for specifying the degree of association is performed in an apparatus different from the terminal apparatus of the user, morphological analysis regarding a large number of text data, identification of co-occurrence words, and results of those processes There is no need for each user's terminal device to perform processing such as storage.

また、上述した装置において、前記関連度データ送信手段は、前記複数のテキストデータを、前記関連度データにより示される関連度に従った順序でソートした上で前記関連度データとともに前記端末装置に送信する、という構成が採用されてもよい。 Further, in the above-described apparatus, the relevance data transmission means sorts the plurality of text data in an order according to the relevance indicated by the relevance data and transmits the text data together with the relevance data to the terminal device. A configuration may be employed.

このような装置によれば、端末装置のユーザは関連度に応じてソートされたテキストデータの内容を見ることにより、特定期間内における話題傾向を考慮した知りたい情報を容易に得ることができるとともに、その順序により、どのような話題が世の中の関心をより多く集めているか、という話題傾向の内容についても知ることができる。 According to such a device, the user of the terminal device can easily obtain information he / she wants to know in consideration of the topic trend within a specific period by looking at the contents of the text data sorted according to the degree of relevance. By that order, it is possible to know the contents of the topic trend of what topics are attracting more attention from the world.

また、本発明は、コンピュータを、文章を示すテキストデータを、時刻を示す時刻データとともに取得するテキストデータ取得手段と、前記テキストデータ取得手段により取得されたテキストデータが示す文章を形態素解析により形態素に分割して当該分割した形態素を各々示す複数の形態素データを生成する形態素解析手段と、前記形態素解析手段により複数のテキストデータの各々を分割して生成した形態素データのうちの１以上の形態素データの各々に関し、当該形態素データと、当該形態素データの生成に用いたテキストデータを識別するテキストデータ識別データと、当該形態素データの生成に用いたテキストデータとともに前記テキストデータ取得手段により取得された時刻データとを記憶する形態素データ記憶手段と、一のキーワードを示す一のキーワードデータに関し、前記形態素データ記憶手段において、所定の期間内の時刻を示す時刻データとともに記憶され、かつ、前記一のキーワードデータが示すキーワードを示す形態素データとともに前記形態素データ記憶手段に記憶されているテキストデータ識別データと同一のテキストデータ識別データとともに記憶されている形態素データを、前記一のキーワードデータに対応する共起語データとして抽出する共起語データ抽出手段として機能させるためのプログラムを提案する。 Further, the present invention causes a computer, the text data indicating a text, and the text data acquisition means for acquiring together with time data indicating the time, the text indicated by the acquired text data by the text data acquisition means into morphemes by morphological analysis A morpheme analysis unit that divides and generates a plurality of morpheme data each indicating the divided morpheme, and one or more morpheme data of the morpheme data generated by dividing each of the plurality of text data by the morpheme analysis unit For each, the morpheme data, the text data identification data for identifying the text data used to generate the morpheme data, the time data acquired by the text data acquisition means together with the text data used to generate the morpheme data, and morphological data storing means for storing, one key Relates one keyword data indicating a word in the morpheme data storage means is stored together with time data indicating the time within a predetermined time period, and the morphological data storage means together with the morpheme data representing a keyword indicating the one keyword data because make function morphemes data stored together with the text data identification data of the same text data identification data stored, as co-occurrence word data extraction means for extracting as occurrence word data corresponding to the one keyword data to propose a program.

このようなプログラムによれば、一般的なコンピュータを用いて上述した装置が実現される。 According to such a program, the above-described apparatus is realized using a general computer.

本発明によれば、ユーザは興味を持つキーワードに関連する特定期間内における話題傾向を容易に知ることができる。 According to the present invention, a user can easily know a topic tendency within a specific period related to a keyword of interest.

本発明の一実施形態にかかるテキスト検索システムの構成を示した図である。It is the figure which showed the structure of the text search system concerning one Embodiment of this invention. 本発明の一実施形態にかかるサーバ装置の機能構成を示した図である。It is the figure which showed the function structure of the server apparatus concerning one Embodiment of this invention. 本発明の一実施形態にかかるテキストＤＢのデータ構成を模式的に示した図である。It is the figure which showed typically the data structure of text DB concerning one Embodiment of this invention. 本発明の一実施形態にかかる形態素ＤＢのデータ構成を模式的に示した図である。It is the figure which showed typically the data structure of morpheme DB concerning one Embodiment of this invention. 本発明の一実施形態にかかる形態素解析部が行う形態素分析の処理の結果例を示した図である。It is the figure which showed the example of the result of the process of the morpheme analysis which the morpheme analysis part concerning one Embodiment of this invention performs. 本発明の一実施形態にかかる共起語ＤＢのデータ構成を模式的に示した図である。It is the figure which showed typically the data structure of co-occurrence word DB concerning one Embodiment of this invention. 本発明の一実施形態にかかるサーバ装置が形態素データの登録に伴い行う処理を示した図である。It is the figure which showed the process which the server apparatus concerning one Embodiment of this invention performs with registration of morpheme data. 本発明の一実施形態にかかるサーバ装置が共起語データおよび共起係数データの生成に伴い行う処理を示した図である。It is the figure which showed the process which the server apparatus concerning one Embodiment of this invention performs with the production | generation of co-occurrence word data and co-occurrence coefficient data. 本発明の一実施形態にかかるサーバ装置が関連度データの生成および関連度データとテキストデータの送信に伴い行う処理を示した図である。It is the figure which showed the process which the server apparatus concerning one Embodiment of this invention performs with the production | generation of relevance data, and transmission of relevance data and text data. 本発明の一実施形態にかかる端末装置において表示される抽出結果の表示画面を模式的に示した図である。It is the figure which showed typically the display screen of the extraction result displayed in the terminal device concerning one Embodiment of this invention.

［実施形態］
以下に、図面を参照しながら本発明の実施形態について説明する。図１は本実施形態にかかるテキスト検索システム１の構成を示した図である。テキスト検索システム１はユーザが文章の検索に用いる端末装置１１と、ユーザにより入力された検索キーワードを端末装置１１から受信し、受信した検索キーワードに応じた複数のテキストデータを検索キーワードとの関連度を示す関連度データとともに端末装置１１に送信するサーバ装置１２を備えている。端末装置１１とサーバ装置１２はネットワーク９を介して互いに各種データの送受信を行う。 [Embodiment]
Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a diagram showing a configuration of a text search system 1 according to the present embodiment. The text search system 1 receives from the terminal device 11 a terminal device 11 used by a user for searching for text and a search keyword input by the user, and a plurality of text data corresponding to the received search keyword is related to the search keyword. The server apparatus 12 which transmits to the terminal device 11 with the relevance degree data which shows is provided. The terminal device 11 and the server device 12 transmit / receive various data to / from each other via the network 9.

なお、図１においては図の簡略化のため端末装置１１を１台のみ示しているが、端末装置１１の数はテキスト検索システム１を利用するユーザの数に応じて任意に変化する。サーバ装置１２の数もまた、テキスト検索システム１におけるテキストデータの検索サービスの規模に応じて任意に変化し得る。 In FIG. 1, only one terminal device 11 is shown for simplification of the drawing, but the number of terminal devices 11 is arbitrarily changed according to the number of users using the text search system 1. The number of server devices 12 can also be arbitrarily changed according to the scale of the text data search service in the text search system 1.

端末装置１１は、ユーザから入力されたデータをサーバ装置１２に送信可能であり、サーバ装置１２から送信されてくるデータを受信可能であり、受信したデータに従いユーザに対する情報の表示が可能な装置であれば如何なる装置であってもよい。従ってその形態は、例えば携帯電話、スマートフォン、ノート型ＰＣ（Personal Computer）、タッチパッド型ＰＣ、デスクトップ型ＰＣ、ＰＤＡ（Personal Digital Assistant）、通信機能を備えたゲーム端末、通信機能を備えたテレビ等のいずれであってもよい。端末装置１１の構成および動作は、サーバ装置との間でデータ通信可能な一般的な端末装置と同様であるので、その説明を省略する。 The terminal device 11 is a device that can transmit data input from a user to the server device 12, can receive data transmitted from the server device 12, and can display information to the user according to the received data. Any device may be used. Therefore, for example, a mobile phone, a smart phone, a notebook PC (Personal Computer), a touch pad PC, a desktop PC, a PDA (Personal Digital Assistant), a game terminal having a communication function, a television having a communication function, etc. Any of these may be used. Since the configuration and operation of the terminal device 11 are the same as those of a general terminal device capable of data communication with the server device, description thereof is omitted.

サーバ装置１２のハードウェア構成は、通信機能を備えた一般的なコンピュータのハードウェア構成と同様であるので、その説明を省略する。サーバ装置１２は本発明にかかるアプリケーションプログラムに従った処理を行うことにより、図２に示す機能構成を備える装置として動作する。 Since the hardware configuration of the server device 12 is the same as the hardware configuration of a general computer having a communication function, description thereof is omitted. The server device 12 operates as a device having the functional configuration shown in FIG. 2 by performing processing according to the application program according to the present invention.

サーバ装置１２はその機能構成部として、以下の構成部を備えている。
（計時部１２１）基準の時刻からの経過時間を継続的に計測し、現在時刻を示す時刻データを生成する。
（形態素解析部１２２）サーバ装置１２がインターネットを介して他のサーバ装置から取得してきたテキストデータが示すテキストを形態素解析手法に従い形態素に分割しそれらの形態素を示す形態素データを生成する。
（テキストデータ取得部１２３）インターネットを介して他のサーバ装置から定期的に新たに公開されたテキストデータを取得する。
（検索キーワードデータ受信部１２４）端末装置１１から検索キーワードを示す検索キーワードデータを受信する。 The server device 12 includes the following components as functional components.
(Timer 121) The elapsed time from the reference time is continuously measured, and time data indicating the current time is generated.
(Morphological Analysis Unit 122) The server device 12 divides the text indicated by the text data acquired from another server device via the Internet into morphemes according to the morpheme analysis method, and generates morpheme data indicating these morphemes.
(Text data acquisition unit 123) The newly acquired text data is periodically acquired from another server device via the Internet.
(Search keyword data receiving unit 124) Receives search keyword data indicating a search keyword from the terminal device 11.

（共起語データ抽出部１２５）主として形態素解析部１２２により生成された形態素データに基づき検索キーワードデータ受信部１２４により受信された検索キーワードデータにより示される検索キーワードとともに所定期間内に共に同じテキスト内で用いられている言葉である共起語を抽出し、抽出した共起語を示す共起語データを生成する。
（関連度データ生成部１２６）形態素解析部１２２により生成された形態素データおよび共起語データ抽出部１２５により生成された共起語データに基づき検索キーワードデータ受信部１２４により受信された検索キーワードデータとテキストデータ取得部１２３により取得されたテキストデータとの関連度を示す関連度データを生成する。
（関連度データ送信部１２７）関連度データ生成部１２６により生成された関連度データに従いテキストデータ取得部１２３により取得されたテキストデータをソートし、関連度データとともに端末装置１１に送信する。
（記憶部１２８）各種データを記憶する。 (Co-occurrence word data extraction unit 125) Along with the search keyword indicated by the search keyword data received by the search keyword data reception unit 124 mainly based on the morpheme data generated by the morpheme analysis unit 122, both within the same text within a predetermined period. Co-occurrence words that are used words are extracted, and co-occurrence word data indicating the extracted co-occurrence words is generated.
(Relevance data generation unit 126) Search keyword data received by the search keyword data reception unit 124 based on the morpheme data generated by the morpheme analysis unit 122 and the co-occurrence word data generated by the co-occurrence word data extraction unit 125 Relevance data indicating the relevance with the text data acquired by the text data acquisition unit 123 is generated.
(Relevance level data transmission unit 127) The text data acquired by the text data acquisition unit 123 is sorted according to the relevance level data generated by the relevance level data generation unit 126, and transmitted to the terminal device 11 together with the relevance level data.
(Storage unit 128) Stores various data.

また、記憶部１２８には、以下のデータが記憶されている。
（辞書ＤＢ）形態素解析部１２２が形態素解析を行う際に用いる辞書データを格納したＤＢ（Database）。
（文法ＤＢ）形態素解析部１２２が形態素解析を行う際に用いる文法データを格納したＤＢ。
（テキストＤＢ）テキストデータ取得部１２３により取得されたテキストデータを格納するＤＢ。
（形態素ＤＢ）形態素解析部１２２により生成された形態素データを格納するＤＢ。
（共起語ＤＢ）共起語データ抽出部１２５により生成された共起語データを格納するＤＢ。 The storage unit 128 stores the following data.
(Dictionary DB) DB (Database) that stores dictionary data used when the morphological analysis unit 122 performs morphological analysis.
(Grammar DB) A DB that stores grammar data used when the morpheme analysis unit 122 performs morpheme analysis.
(Text DB) A DB that stores text data acquired by the text data acquisition unit 123.
(Morpheme DB) A DB that stores morpheme data generated by the morpheme analyzer 122.
(Co-occurrence word DB) A DB that stores co-occurrence word data generated by the co-occurrence word data extraction unit 125.

辞書ＤＢおよび文法ＤＢは一般的な形態素解析において用いられる既知のデータベースであるため、そのデータ構成の説明を省略する。 Since the dictionary DB and the grammar DB are known databases used in general morphological analysis, description of the data structure is omitted.

図３はテキストＤＢのデータ構成の例を模式的に示した図である。テキストＤＢはテキストデータ取得部１２３により取得されたテキストデータの各々に応じたデータレコードの集まりであり、各データレコードは、テキストデータを識別するテキストＩＤ（Identifier）を格納するデータフィールド「テキストＩＤ」、テキストデータが取得された時刻を示す時刻データを格納するデータフィールド「時刻」、テキストデータを格納するデータフィールド「テキスト」を備えている。 FIG. 3 is a diagram schematically showing an example of the data structure of the text DB. The text DB is a collection of data records corresponding to each of the text data acquired by the text data acquisition unit 123, and each data record has a data field “text ID” for storing a text ID (Identifier) for identifying the text data. , A data field “time” for storing time data indicating the time when the text data was acquired, and a data field “text” for storing text data are provided.

図４は形態素ＤＢのデータ構成の例を模式的に示した図である。形態素ＤＢは形態素解析部１２２により生成された形態素データの各々に応じたデータレコードの集まりであり、各データレコードは、形態素データの生成に用いられたテキストデータのテキストＩＤを格納するデータフィールド「テキストＩＤ」、そのテキストデータが取得された時刻を示す時刻データを格納するデータフィールド「時刻」、形態素データを格納するデータフィールド「形態素」を備えている。 FIG. 4 is a diagram schematically showing an example of the data structure of the morpheme DB. The morpheme DB is a collection of data records corresponding to each of the morpheme data generated by the morpheme analysis unit 122, and each data record is a data field “text” that stores the text ID of the text data used to generate the morpheme data. ID ”, a data field“ time ”that stores time data indicating the time when the text data was acquired, and a data field“ morpheme ”that stores morpheme data.

形態素解析部１２２が行う形態素分析は既知の技術であるため、その詳細な説明は省略するが、形態素解析部１２２が行う形態素分析の処理の結果例を図５に示す。図５は、テキストデータ「○○ツリーはやはり高い。○○ツリータウンも面白そうだな。」を形態素解析部１２２が形態素分析した際の結果を示している。図５に示されるデータフィールド「表層形」は入力されたテキストから分割された形態素を示している。データフィールド「品詞」は形態素の品詞を示している。データフィールド「原形」は形態素の原形を示している。例えば、表層形が「面白」の形態素は原形が「面白い」であり、以下に説明するサーバ装置１２の処理においては言葉の比較が原形により行われ、変化形により同じ意味の言葉が異なる言葉として扱われることはない。従って、形態素ＤＢのデータフィールド「形態素」に格納されるデータは形態素の原形を示すデータである。 The morpheme analysis performed by the morpheme analysis unit 122 is a known technique, and a detailed description thereof will be omitted. FIG. 5 shows an example of the result of the morpheme analysis process performed by the morpheme analysis unit 122. FIG. 5 shows a result when the morphological analysis unit 122 performs morphological analysis on the text data “XX tree is still expensive. XX tree town seems to be interesting”. A data field “surface shape” shown in FIG. 5 indicates a morpheme divided from the input text. The data field “part of speech” indicates the part of speech of the morpheme. The data field “original form” indicates the original form of the morpheme. For example, a morpheme whose surface shape is “interesting” has an original shape of “interesting”, and in the processing of the server device 12 described below, words are compared based on the original shape, and words having the same meaning differ as different words depending on the variation. It will not be treated. Therefore, the data stored in the data field “morpheme” of the morpheme DB is data indicating the original form of the morpheme.

なお、形態素解析部１２２が形態素ＤＢに登録する形態素データは図５に示すようにテキストから分割された形態素（原形）の全てではなく、その言葉が単体で意味を持つ品詞のものに限られる。具体的には、名詞、動詞、形容詞、形容動詞などは形態素ＤＢに登録され、助詞、副詞、助動詞、接続詞、記号などは形態素ＤＢに登録されない。 Note that the morpheme data registered in the morpheme DB by the morpheme analysis unit 122 is not limited to all the morphemes (original forms) divided from the text as shown in FIG. Specifically, nouns, verbs, adjectives, adjective verbs, and the like are registered in the morpheme DB, and particles, adverbs, auxiliary verbs, conjunctions, symbols, and the like are not registered in the morpheme DB.

図６は共起語ＤＢのデータ構成の例を模式的に示した図である。共起語ＤＢは共起語データ抽出部１２５により生成された共起語データの各々に応じたデータレコードの集まりであり、各データレコードは、端末装置１１からサーバ装置１２に対し送信されてきた検索のリクエストを識別する検索ＩＤを格納するデータフィールド「検索ＩＤ」、検索のリクエストにおいて端末装置１１から送信されてきた検索キーワードデータを格納するデータフィールド「検索キーワード」、抽出対象の形態素をそのソースのテキストデータの取得された時刻により絞り込む時間帯を示す時間帯データを格納するデータフィールド「時間帯」、共起語として抽出された形態素を示す共起語データを格納するデータフィールド「共起語」、検索キーワードと共起語との間の共起係数（後述）を示す共起係数データを格納するデータフィールド「共起係数」を備えている。 FIG. 6 is a diagram schematically showing an example of the data structure of the co-occurrence word DB. The co-occurrence word DB is a collection of data records corresponding to each of the co-occurrence word data generated by the co-occurrence word data extraction unit 125, and each data record has been transmitted from the terminal device 11 to the server device 12. A data field “search ID” for storing a search ID for identifying a search request, a data field “search keyword” for storing search keyword data transmitted from the terminal device 11 in the search request, and a source morpheme to be extracted A data field “time zone” for storing time zone data indicating a time zone to be narrowed down by the time at which the text data was acquired, and a data field “co-occurrence word” for storing co-occurrence word data indicating morphemes extracted as co-occurrence words ”, Stores co-occurrence coefficient data indicating the co-occurrence coefficient (described later) between the search keyword and the co-occurrence word It is equipped with a data field "co-occurrence factor".

図６に示されるように、共起語データ抽出部１２５により生成される共起語データおよび共起係数データは、特定の時間帯内において取得されたテキストから生成された形態素から抽出した共起語およびその共起係数を示している。本実施形態において、この時間帯はサーバ装置１２が端末装置１１から検索のリクエストを受信した時刻から前１ヶ月間の時間帯であるものとし、共起語データ抽出部１２５により自動的に設定される。 As shown in FIG. 6, the co-occurrence word data and the co-occurrence coefficient data generated by the co-occurrence word data extraction unit 125 are the co-occurrence extracted from the morphemes generated from the text acquired in a specific time zone. The words and their co-occurrence coefficients are shown. In the present embodiment, this time zone is assumed to be the time zone for the previous month from the time when the server device 12 received the search request from the terminal device 11, and is automatically set by the co-occurrence word data extraction unit 125. The

続いて、テキスト検索システム１の動作を説明する。まず、サーバ装置１２は定期的に（例えば、１日に１回、所定時刻に）、インターネットにおいて前日以降に新たに公開された文書を示すテキストデータを様々なサーバ装置からクロールして、それらのテキストデータを用いて形態素解析を行い、形態素データの登録を行う。図７はそれらの形態素データの登録に伴いサーバ装置１２が行う処理を示した図である。 Next, the operation of the text search system 1 will be described. First, the server device 12 periodically (for example, once a day at a predetermined time) crawls text data indicating documents newly published on the Internet after the previous day from various server devices, Morphological analysis is performed using text data, and morpheme data is registered. FIG. 7 is a diagram showing processing performed by the server device 12 in accordance with registration of the morpheme data.

まず、テキストデータ取得部１２３は外部の各サーバ装置に対し、過去最後に同様の要求を行った日時を示す時刻データを含むテキストデータの送信要求を行い、その応答として各サーバ装置から送信されてくる、新たに更新されたテキストデータを受信する（Ｓ１０１）。テキストデータ取得部１２３は取得したそれらのテキストデータの各々にテキストＩＤを割り当て、計時部１２１から取得した時刻データ（その取得時の時刻を示す）とともにテキストＤＢ（図３）に格納する（Ｓ１０２）。 First, the text data acquisition unit 123 makes a transmission request for text data including time data indicating the date and time when a similar request was last made to each external server device, and is transmitted from each server device as a response thereto. The newly updated text data is received (S101). The text data acquisition unit 123 assigns a text ID to each of the acquired text data, and stores it in the text DB (FIG. 3) together with the time data acquired from the time measuring unit 121 (indicating the time at the time of acquisition) (S102). .

形態素解析部１２２は、テキストデータ取得部１２３によりテキストＤＢに格納されたテキストデータを用いて形態素解析を行い（Ｓ１０３）、生成した形態素データの中から単独で意味を生じる品詞のものを抽出し（Ｓ１０４）、抽出した形態素データを、その生成に用いたテキストデータのテキストＩＤおよび時刻データとともに形態素ＤＢ（図４）に格納する（Ｓ１０５）。以上が形態素データの登録に伴う処理である。 The morpheme analysis unit 122 performs morpheme analysis using the text data stored in the text DB by the text data acquisition unit 123 (S103), and extracts the part of speech that has meaning alone from the generated morpheme data ( S104), the extracted morpheme data is stored in the morpheme DB (FIG. 4) together with the text ID and time data of the text data used for the generation (S105). The above is the process accompanying registration of morpheme data.

ユーザは、自分が関心を持つ検索キーワードを端末装置１１に入力し、サーバ装置１２に送信することで、その応答としてサーバ装置１２から端末装置１１に送信されてくる、その検索キーワードに関連したテキストデータを、その検索キーワードとの関連度データとともに閲覧することができる。その際、サーバ装置１２から端末装置１１に送信されてくるテキストデータの選択に用いられた関連度データは、例えば、過去１ヶ月間に取得したテキストデータから生成された形態素データに基づき生成された共起語データおよび共起係数データに基づいて生成されたものである。従って、ユーザに対し提供されるテキストデータは、過去１ヶ月における話題傾向が考慮されて抽出されたものである。 A user inputs a search keyword in which he / she is interested in the terminal device 11 and transmits the search keyword to the server device 12, and as a response, the text related to the search keyword is transmitted from the server device 12 to the terminal device 11. The data can be browsed together with the relevance data with the search keyword. At that time, the relevance data used for selecting the text data transmitted from the server device 12 to the terminal device 11 is generated based on, for example, morpheme data generated from the text data acquired in the past month. It is generated based on the co-occurrence word data and the co-occurrence coefficient data. Therefore, the text data provided to the user is extracted in consideration of the topic trend in the past month.

図８は、ユーザにより端末装置１１に入力された検索キーワードがサーバ装置１２に送信された際に、サーバ装置１２において行われる共起語データおよび共起係数データの生成に伴う処理を示した図である。 FIG. 8 is a diagram illustrating processing associated with generation of co-occurrence word data and co-occurrence coefficient data performed in the server device 12 when a search keyword input to the terminal device 11 by the user is transmitted to the server device 12. It is.

まず、端末装置１１から送信された検索キーワードデータは、検索キーワードデータ受信部１２４により受信される（Ｓ２０１）。検索キーワードデータの受信をトリガに、共起語データ抽出部１２５は形態素ＤＢ（図４）から、データフィールド「時刻」に過去１ヶ月の時間帯に含まれる時刻を示す時刻データが格納されているデータレコード群を抽出する（Ｓ２０２）。 First, the search keyword data transmitted from the terminal device 11 is received by the search keyword data receiving unit 124 (S201). With the reception of the search keyword data as a trigger, the co-occurrence word data extraction unit 125 stores time data indicating the time included in the time zone of the past month in the data field “time” from the morpheme DB (FIG. 4). A data record group is extracted (S202).

続いて、共起語データ抽出部１２５はステップＳ２０２において抽出したデータレコード群の中から、ステップＳ２０１において受信した検索キーワードデータと同じデータがデータフィールド「形態素」に格納されているデータレコード群を抽出する（Ｓ２０３）。 Subsequently, the co-occurrence word data extraction unit 125 extracts a data record group in which the same data as the search keyword data received in step S201 is stored in the data field “morpheme” from the data record group extracted in step S202. (S203).

続いて、共起語データ抽出部１２５はステップＳ２０３において抽出したデータレコード群をデータフィールド「テキストＩＤ」によりグループ化し、それらのグループの数、すなわち検索キーワードデータを含むテキストデータの数を検索キーワードの出現文書数ｄｆ（ｗ_i）として特定する（Ｓ２０４）。 Subsequently, the co-occurrence word data extraction unit 125 groups the data record group extracted in step S203 by the data field “text ID”, and calculates the number of those groups, that is, the number of text data including the search keyword data as the search keyword. to identify as the appearance number of documents _{df (w i) (S204)} .

続いて、共起語データ抽出部１２５はステップＳ２０２において抽出したデータレコード群をデータフィールド「形態素」によりグループ化する（Ｓ２０５）。以下、ステップＳ２０５において、第１〜第ｎ（ただし、ｎは任意の自然数）までの形態素に応じたデータレコード群が生成されたものとする。 Subsequently, the co-occurrence word data extraction unit 125 groups the data record group extracted in step S202 by the data field “morpheme” (S205). Hereinafter, in step S205, it is assumed that data record groups corresponding to the first to nth (where n is an arbitrary natural number) morphemes are generated.

続いて、共起語データ抽出部１２５はステップＳ２０５において生成したｎ個のデータレコード群の中から第ｍ（ただし、ｍは１以上ｎ以下の任意の自然数）の形態素のデータレコード群を選択する（Ｓ２０６）。 Subsequently, the co-occurrence word data extraction unit 125 selects the m-th morpheme data record group (where m is an arbitrary natural number between 1 and n) from the n data record groups generated in step S205. (S206).

続いて、共起語データ抽出部１２５はステップＳ２０６において選択した第ｍの形態素のデータレコード群をデータフィールド「テキストＩＤ」によりグループ化し、それらのグループの数、すなわち第ｍの形態素データを含むテキストデータの数を第ｍの形態素の出現文書数ｄｆ（ｗ_jm）として特定する（Ｓ２０７）。 Subsequently, the co-occurrence word data extraction unit 125 groups the data record group of the mth morpheme selected in step S206 by the data field “text ID”, and the number of those groups, that is, the text including the mth morpheme data. The number of data is specified as the number of appearance documents df (w _jm ) of the mth morpheme (S207).

続いて、共起語データ抽出部１２５はステップＳ２０３において抽出したデータレコード群のいずれかのデータフィールド「テキストＩＤ」に格納され、ステップＳ２０６において選択した第ｍの形態素のデータレコード群のいずれかのデータフィールド「テキストＩＤ」にも格納されるテキストＩＤの数を共起文書数ｄｆ（ｗ_i，ｗ_jm）として特定する（Ｓ２０８）。なお、このように特定される共起文書数ｄｆ（ｗ_i，ｗ_jm）は、検索キーワードデータと第ｍの形態素を示す形態素データをともに含むテキストデータの数である。 Subsequently, the co-occurrence word data extraction unit 125 is stored in any data field “text ID” of the data record group extracted in step S203, and any one of the m-th morpheme data record group selected in step S206. The number of text IDs stored in the data field “text ID” is specified as the co-occurrence document number df (w _i , w _jm ) (S208). The co-occurrence document number df (w _i , w _jm ) specified in this way is the number of text data including both search keyword data and morpheme data indicating the mth morpheme.

続いて、共起語データ抽出部１２５はステップ２０４において特定した出現文書数ｄｆ（ｗ_i）、ステップ２０７において特定した出現文書数ｄｆ（ｗ_jm）、ステップＳ２０８において特定した共起文書数ｄｆ（ｗ_i，ｗ_jm）を用いて、以下の式に従い共起係数Ｄｉｃｅ（ｗ_i，ｗ_jm）を算出する（Ｓ２０９）。

Subsequently, the co-occurrence word data extraction unit 125 determines the number of appearance documents df (w _i ) specified in step 204, the number of appearance documents df (w _jm ) specified in step 207, and the number of co-occurrence documents df (w ( _m ) specified in step S208. w _i, with w _jm), in accordance with the following equation co-occurrence coefficient Dice (w _i, w _jm) calculating a (S209).

なお、共起係数とは、２つの単語（この場合、検索キーワードと第ｍの形態素）が同じ文書に出現する頻度を示す指標であり、上記の式はダイス係数として知られる共起係数の算出式である。 The co-occurrence coefficient is an index indicating the frequency with which two words (in this case, the search keyword and the m-th morpheme) appear in the same document, and the above formula is a calculation of the co-occurrence coefficient known as a dice coefficient. It is a formula.

続いて、共起語データ抽出部１２５はステップ２０９において算出した共起係数Ｄｉｃｅ（ｗ_i，ｗ_jm）を示す共起係数データを、ステップＳ２０１において検索キーワードデータの受信に際しその検索のリクエストに対し検索キーワードデータ受信部１２４により割り振られた検索ＩＤ、ステップＳ２０１において受信された検索キーワードデータ、検索キーワードデータの受信のタイミングより前１ヶ月間を示す時間帯データ、第ｍの形態素を示す共起語データとともに共起語ＤＢ（図６）に格納する（Ｓ２１０）。 Subsequently, the co-occurrence word data extraction unit 125 receives the co-occurrence coefficient data indicating the co-occurrence coefficient Dice (w _i , w _jm ) calculated in step 209 in response to the search request when receiving the search keyword data in step S201. The search ID assigned by the search keyword data receiving unit 124, the search keyword data received in step S201, the time zone data indicating one month before the reception timing of the search keyword data, and the co-occurrence word indicating the mth morpheme The data is stored together with the data in the co-occurrence word DB (FIG. 6) (S210).

続いて、共起語データ抽出部１２５はステップＳ２０５においてグループ化したｎ個の形態素に応じたグループの全てに関し、ステップＳ２１０の登録処理が完了したか否かを判定する（Ｓ２１１）。ステップＳ２１１の判定において、まだ登録処理が完了していない形態素のグループがあると判定した場合（Ｓ２１１；Ｎｏ）、共起語データ抽出部１２５は処理をステップＳ２０６に戻し、第（ｍ＋１）の形態素のデータレコード群を選択し、ステップＳ２０７以降の処理を行う。一方、ステップＳ２１１の判定において、全ての形態素のグループに関し登録処理が完了したと判定した場合（Ｓ２１１；Ｙｅｓ）、共起語データ抽出部１２５は一連の処理を終了する。 Subsequently, the co-occurrence word data extraction unit 125 determines whether or not the registration process in step S210 has been completed for all the groups corresponding to the n morphemes grouped in step S205 (S211). If it is determined in step S211 that there is a morpheme group for which registration processing has not yet been completed (S211; No), the co-occurrence word data extraction unit 125 returns the process to step S206, and the (m + 1) th morpheme. The data record group is selected, and the processing after step S207 is performed. On the other hand, if it is determined in step S211 that the registration process has been completed for all morpheme groups (S211; Yes), the co-occurrence word data extraction unit 125 ends the series of processes.

以上が、ユーザが入力した検索キーワードに関する共起語データおよび共起係数データの生成に伴う処理である。 The above is the process accompanying the generation of co-occurrence word data and co-occurrence coefficient data related to the search keyword input by the user.

共起語データ抽出部１２５による共起語データおよび共起係数データの生成の処理が完了すると、続いて関連度データ生成部１２６による関連度データの生成と、関連度データ送信部１２７によるテキストデータおよび関連度データの端末装置１１に対する送信の処理が行われる。図９はそれらの処理を示した図である。 When the process of generating the co-occurrence word data and the co-occurrence coefficient data by the co-occurrence word data extraction unit 125 is completed, the generation of the relevance data by the relevance data generation unit 126 and the text data by the relevance data transmission unit 127 And the process of transmission of the relevance data to the terminal device 11 is performed. FIG. 9 is a diagram showing these processes.

まず、関連度データ生成部１２６は、共起語ＤＢ（図６）から、ステップＳ２０１において端末装置１１から送信されてきた検索キーワードデータの受信に伴い検索キーワードデータ受信部１２４によりその検索のリクエストに対し割り当てられた検索ＩＤをデータフィールド「検索ＩＤ」に含むデータレコードを抽出する（Ｓ３０１）。 First, the degree-of-association data generation unit 126 makes a search request by the search keyword data receiving unit 124 with the reception of the search keyword data transmitted from the terminal device 11 in step S201 from the co-occurrence word DB (FIG. 6). A data record including the search ID assigned to the data field “search ID” is extracted (S301).

続いて、関連度データ生成部１２６は形態素ＤＢ（図４）から、データフィールド「時刻」に過去１ヶ月間に含まれる時刻を示す時刻データが格納されているデータレコードを抽出する（Ｓ３０２）。 Subsequently, the relevance data generation unit 126 extracts, from the morpheme DB (FIG. 4), a data record in which time data indicating time included in the past month is stored in the data field “time” (S302).

続いて、関連度データ生成部１２６はステップＳ３０２において抽出したデータレコード群をデータフィールド「テキストＩＤ」によりグループ化する（ステップＳ３０３）。以下、ステップＳ３０３において、第１〜第ｘ（ただし、ｘは任意の自然数）までのテキストデータに応じたデータレコード群が生成されたものとする。 Subsequently, the relevance data generation unit 126 groups the data record group extracted in step S302 by the data field “text ID” (step S303). Hereinafter, in step S303, it is assumed that data record groups corresponding to text data from the first to xth (where x is an arbitrary natural number) are generated.

続いて、関連度データ生成部１２６はステップＳ３０３において生成したｘ個のデータレコード群の中から第ｙ（ただし、ｙは１以上ｘ以下の任意の自然数）のテキストデータのデータレコード群を選択する（Ｓ３０４）。 Subsequently, the relevance data generation unit 126 selects a data record group of text data of y-th (where y is an arbitrary natural number between 1 and x) from the x data record groups generated in step S303. (S304).

続いて、関連度データ生成部１２６はステップＳ３０４において選択した第ｙのテキストデータのデータレコード群をデータフィールド「形態素」によりグループ化し、それらのグループの各々に含まれるデータレコードの数をカウントする（Ｓ３０５）。今、ステップＳ３０５のグループ化により第１〜第ｐ（ただし、ｐは任意の自然数）までの形態素に応じたデータレコード群が生成されたものとし、第ｑ（ただし、ｑは１以上ｒ以下の任意の自然数）の形態素に応じたデータレコード群の数としてデータレコード数Ｃ_qがカウントされたものとする。 Subsequently, the relevance data generation unit 126 groups the data records of the y-th text data selected in step S304 by the data field “morpheme”, and counts the number of data records included in each of these groups ( S305). Now, it is assumed that the data record group corresponding to the first to pth (where p is an arbitrary natural number) morpheme is generated by the grouping in step S305, and the qth (where q is 1 to r) It is assumed that the number of data records C _q is counted as the number of data record groups according to the morpheme of any natural number).

続いて、関連度データ生成部１２６は第１〜第ｐの形態素の各々（以下、第ｑとする）に関し、共起語ＤＢ（図６）から、ステップＳ２０１において検索キーワードデータ受信部１２４により割り当てられた検索ＩＤがデータフィールド「検索ＩＤ」に格納され、第ｑの形態素を示す共起語データがデータフィールド「共起語」に格納されているデータレコードを検索し、検索したデータレコードのデータフィールド「共起係数」に格納される共起係数データを第ｑの形態素に関する共起係数Ｄ_qとして読み出す（Ｓ３０６）。 Subsequently, the relevance data generation unit 126 assigns each of the first to p-th morphemes (hereinafter referred to as q-th) from the co-occurrence word DB (FIG. 6) by the search keyword data reception unit 124 in step S201. The retrieved search ID is stored in the data field “search ID”, the data record in which the co-occurrence word data indicating the q-th morpheme is stored in the data field “co-occurrence word” is retrieved, and the data of the retrieved data record The co-occurrence coefficient data stored in the field “co-occurrence coefficient” is read as the co-occurrence coefficient D _q related to the q-th morpheme (S306).

続いて、関連度データ生成部１２６は以下の式に従い、第ｙのテキストデータと検索キーワードデータとの間の関連度Ｒ_yを算出し、算出した関連度Ｒ_yを示す関連度データを生成する（Ｓ３０７）。

Subsequently, the relevance data generation unit 126 calculates the relevance R _y between the y-th text data and the search keyword data according to the following formula, and generates relevance data indicating the calculated relevance R _y. (S307).

続いて、関連度データ生成部１２６はステップＳ３０３においてグループ化したｘ個のテキストデータに応じたグループの全てに関し、ステップＳ３０７の関連度データの生成処理が完了したか否かを判定する（Ｓ３０８）。ステップＳ３０８の判定において、まだ関連度データの生成処理が完了していないテキストデータのグループがあると判定した場合（Ｓ３０８；Ｎｏ）、関連度データ生成部１２６は処理をステップＳ３０４に戻し、第（ｙ＋１）のテキストデータのデータレコード群を選択し、ステップＳ３０５以降の処理を行う。 Subsequently, the relevance data generation unit 126 determines whether or not the relevance data generation processing in step S307 has been completed for all the groups corresponding to the x pieces of text data grouped in step S303 (S308). . If it is determined in step S308 that there is a group of text data for which the relevance data generation processing has not yet been completed (S308; No), the relevance data generation unit 126 returns the processing to step S304, and the ( The data record group of the text data of y + 1) is selected, and the processes after step S305 are performed.

一方、ステップＳ３０８の判定において、全てのテキストデータのグループに関し関連度データの生成処理が完了したと判定した場合（Ｓ３０８；Ｙｅｓ）、関連度データ生成部１２６は一連の処理を終え、続いて関連度データ送信部１２７によるテキストデータおよび関連度データの送信の処理が行われる。 On the other hand, in the determination of step S308, when it is determined that the relevance data generation processing has been completed for all text data groups (S308; Yes), the relevance data generation unit 126 finishes a series of processing, and then continues to be related. Processing of transmission of text data and relevance data by the degree data transmission unit 127 is performed.

関連度データ送信部１２７は、第１〜第ｘのテキストデータを、ステップＳ３０３においてグループ化されたデータレコード群の各々に関し、データフィールド「テキストＩＤ」に格納されているテキストＩＤを検索キーとして、テキストＤＢ（図３）から第１〜第ｘのテキストデータを抽出する（Ｓ３０９）。 The relevance data transmission unit 127 uses the text ID stored in the data field “text ID” as a search key for each of the data records grouped in step S303 for the first to x-th text data. First to xth text data are extracted from the text DB (FIG. 3) (S309).

続いて、関連度データ送信部１２７はステップＳ３０９において抽出した第１〜第ｘのテキストデータとステップＳ３０７において第１〜第ｘのテキストデータに関し生成された関連度データとを各々対応付けた後、関連度データが示す関連度の降順となるようにテキストデータをソートする（Ｓ３１０）。 Subsequently, the association degree data transmission unit 127 associates the first to xth text data extracted in step S309 with the association degree data generated for the first to xth text data in step S307, respectively. The text data is sorted in descending order of the degree of association indicated by the degree of association data (S310).

続いて、関連度データ送信部１２７はステップＳ３１０においてソートを行ったテキストデータを、それらに対応付けた関連度データとともに端末装置１１に送信する（Ｓ３１１）。以上が関連度データ生成部１２６による関連度データの生成および関連度データ送信部１２７によるテキストデータと関連度データの送信に伴う処理である。 Subsequently, the relevance data transmission unit 127 transmits the text data sorted in step S310 to the terminal device 11 together with the relevance data associated with them (S311). The above is the process associated with the generation of relevance data by the relevance data generation unit 126 and the transmission of text data and relevance data by the relevance data transmission unit 127.

端末装置１１はサーバ装置１２から送信されてくる関連度データを伴うテキストデータを受信すると、その内容を表示する。図１０は、端末装置１１において表示される抽出結果の表示画面を模式的に示した図である。ユーザは抽出結果の表示画面を見ることにより、先に自分が入力した検索キーワードに関連する文書の内容を知ることができる。その際、表示されるテキストデータは、過去１ヶ月間において検索キーワードと共に同じ文書において用いられた共起語の出現頻度を考慮して選択されたテキストデータであり、またその表示順序は共起語の出現頻度を考慮して算出された関連度の高い順であるため、その内容を読むとユーザは過去１ヶ月間における、検索キーワードに関連する話題傾向を知ることができる。 When the terminal device 11 receives the text data with the relevance data transmitted from the server device 12, the terminal device 11 displays the content. FIG. 10 is a diagram schematically illustrating a display screen of extraction results displayed on the terminal device 11. By viewing the extraction result display screen, the user can know the contents of the document related to the search keyword previously input by the user. In this case, the displayed text data is text data selected in consideration of the appearance frequency of the co-occurrence words used in the same document together with the search keyword in the past month, and the display order is the co-occurrence words. Since the degree of relevance calculated in consideration of the appearance frequency of the items is in descending order, the user can know the trend of topics related to the search keyword in the past month by reading the contents.

［変形例］
上述した実施形態は本発明の一実施形態であり、本発明の技術的思想の範囲内において様々に変形可能である。以下にそれらの変形の例を示す。 [Modification]
The embodiment described above is an embodiment of the present invention, and can be variously modified within the scope of the technical idea of the present invention. Examples of these modifications are shown below.

上述した実施形態においては、テキスト検索システム１が端末装置１１とサーバ装置１２により構成され、検索キーワードデータの入力および結果表示を除く処理が全てサーバ装置１２において行われる。これに代えて、サーバ装置１２が行う処理の全てもしくは一部が端末装置１１において行われる構成が採用されてもよい。例えば、端末装置１１がデスクトップＰＣのように十分な処理速度、通信速度、記憶容量等を備えている場合、端末装置１１がサーバ装置１２の役割を兼ねることができる。 In the embodiment described above, the text search system 1 is configured by the terminal device 11 and the server device 12, and all processes except the input of search keyword data and the result display are performed in the server device 12. Instead, a configuration in which all or part of the processing performed by the server device 12 is performed in the terminal device 11 may be employed. For example, when the terminal device 11 has sufficient processing speed, communication speed, storage capacity, and the like like a desktop PC, the terminal device 11 can also serve as the server device 12.

また、上述した実施形態においては、共起係数としてＤｉｃｅ係数が採用されているが、共起の程度を示す指標であれば、Ｄｉｃｅ係数以外のいずれの指標が共起係数として採用されてもよい。 In the embodiment described above, the Dice coefficient is employed as the co-occurrence coefficient. However, any index other than the Dice coefficient may be employed as the co-occurrence coefficient as long as the index indicates the degree of co-occurrence. .

また、上述した実施形態においては、ユーザに対し提示されるテキストデータは、過去１ヶ月間にサーバ装置１２において取得されたテキストデータの中から抽出されたものであるが、抽出対象のテキストデータの取得時期の範囲は任意に変更可能であり、取得時期に基づく抽出を行わない構成が採用されてもよい。すなわち、共起語データおよび共起係数データの生成において用いる形態素データの取得時刻は過去１ヶ月等の所定期間内に限定されるが、抽出対象のテキストデータはその所定期間外に取得されたものであってもよい。 In the above-described embodiment, the text data presented to the user is extracted from the text data acquired in the server device 12 during the past month. The range of the acquisition time can be arbitrarily changed, and a configuration that does not perform extraction based on the acquisition time may be employed. That is, the acquisition time of the morpheme data used in the generation of the co-occurrence word data and the co-occurrence coefficient data is limited to a predetermined period such as the past month, but the text data to be extracted is acquired outside the predetermined period It may be.

また、上述した実施形態においては、共起語データおよび共起係数データの生成において用いる形態素データをその取得時刻に基づき絞り込む際に用いる所定期間を過去１ヶ月間としたが、この所定期間は任意に変更可能であり、例えばユーザが検索キーワードを入力する際にこの所定期間を指定可能とし、ユーザにより指定された所定期間において取得された形態素データに基づき、共起語データおよび共起係数データの生成が行われる構成が採用されてもよい。 In the above-described embodiment, the predetermined period used when narrowing down the morpheme data used in the generation of the co-occurrence word data and the co-occurrence coefficient data based on the acquisition time is the past one month. For example, when the user inputs a search keyword, the predetermined period can be specified. Based on the morpheme data acquired in the predetermined period specified by the user, the co-occurrence word data and the co-occurrence coefficient data A configuration in which generation is performed may be adopted.

また、上述した実施形態においては、所定期間内における出現回数の多少にかかわらず全ての形態素が一様に関連度データの生成に用いられるが、例えば所定期間内における出現回数が多い形態素に対し、出現回数が少ない形態素よりも大きいウェイトを与え、関連度データの生成においてウェイトに応じて加重することにより、出現回数が多い形態素が共起語として出現するほど、出現回数が少ない形態素が共起語として出現する場合よりも高い関連度を示す関連度データが生成される構成が採用されてもよい。例えば、所定の閾値より少ない出現回数の形態素にはウェイトとして「０」を与え、所定の閾値以上の出現回数の形態素にはウェイトとして「１」を与えることにより、所定の閾値以上の出現回数の形態素のみを関連度データの生成において考慮する構成が採用されてもよい。 In the above-described embodiment, all morphemes are uniformly used for generating relevance data regardless of the number of appearances within a predetermined period. For example, for morphemes with a large number of appearances within a predetermined period, A morpheme with a smaller number of occurrences appears as a co-occurrence word by giving a larger weight than a morpheme with a smaller number of occurrences, and weighting according to the weight in the generation of relevance data. A configuration may be employed in which relevance data indicating a higher relevance level than when it appears. For example, by assigning “0” as a weight to a morpheme with an appearance count less than a predetermined threshold and giving “1” as a weight to a morpheme with an appearance count greater than or equal to a predetermined threshold, A configuration may be adopted in which only morphemes are considered in generating relevance data.

また、上述した実施形態においては、形態素データに対応する時刻データとして、そのソースとなるテキストデータが取得された時刻を示す時刻データが用いられる構成が採用されている。形態素データに対応する時刻データはこれに限られず、例えばソースとなるテキストデータにその文書の掲載時刻を示す時刻データが伴っている場合には、その時刻データを形態素データに対応する時刻データとして用いる構成が採用されてもよい。 In the above-described embodiment, a configuration is employed in which time data indicating the time at which text data as the source is acquired is used as time data corresponding to morpheme data. The time data corresponding to the morpheme data is not limited to this. For example, when the text data as the source is accompanied by the time data indicating the publication time of the document, the time data is used as the time data corresponding to the morpheme data. A configuration may be employed.

また、上述した実施形態においては、ユーザが単一の検索キーワードを入力し、単一の検索キーワードデータに応じた関連度データの生成が行われるものとしたが、ユーザが複数の検索キーワードを入力し、それらの複数の検索キーワードを各々示す複数の検索キーワードデータに応じた関連度データの生成が行われる構成が採用されてもよい。その場合、例えば関連度データ生成部１２６が各検索キーワードデータに関し算出した関連度の加算値を示す関連度データを生成する構成が採用されてもよい。 In the above-described embodiment, the user inputs a single search keyword, and the relevance data corresponding to the single search keyword data is generated. However, the user inputs a plurality of search keywords. And the structure by which the production | generation of the relevance data according to the some search keyword data which respectively show those some search keywords may be employ | adopted. In that case, for example, a configuration may be employed in which the relevance data generation unit 126 generates relevance data indicating an addition value of relevance calculated for each search keyword data.

さらに、ユーザが検索キーワードを入力する代わりにテキストデータを指定し、サーバ装置１２において形態素解析部１２２によりそのテキストデータから分割して生成した複数の形態素データを検索キーワードとして用いる構成が採用されてもよい。その場合、ユーザは自分が興味を持った文書を指定することで、その文書に関連する文書を読むことができる。その場合も、ユーザに提示されるテキストデータは所定期間における話題傾向が考慮されて選択されたものとなる。 Furthermore, even if the user designates text data instead of inputting a search keyword and the server device 12 uses a plurality of morpheme data generated by dividing the text data by the morpheme analysis unit 122 as a search keyword. Good. In that case, the user can read a document related to the document by designating the document in which the user is interested. Also in that case, the text data presented to the user is selected in consideration of the topic tendency in a predetermined period.

また、上述した実施形態においては、関連度データは共起語の出現数に対し共起係数により重み付けをしたものを合算した数値を示すものとしたが、例えば共起係数を用いず、共起語の出現数を示すデータをそのまま関連度データとして用いたり、共起係数とは異なる係数を用いたりしてもよい。 In the above-described embodiment, the relevance data indicates a numerical value obtained by adding the weights of the co-occurrence words weighted by the co-occurrence coefficient. For example, the co-occurrence coefficient is not used and the co-occurrence coefficient is not used. Data indicating the number of occurrences of words may be used as relevance data as it is, or a coefficient different from the co-occurrence coefficient may be used.

また、上述した実施形態においては、サーバ装置１２から端末装置１１に送信されるテキストデータは対応する関連度が高い順（降順）にソートされるものとしたが、例えば、サーバ装置１２から端末装置１１に対しては任意の順序で並べられたテキストデータが関連度データとともに送信され、端末装置１１においてテキストデータのソートが行われる構成が採用されてもよい。また、ソートの順序は任意に変更可能である。例えば、テキストデータが示すテキストの長さ順や取得された時刻が新しい順などでソートが行われてもよい。また、テキストデータがＷｅｂページに掲載されており、そのテキストデータに対し他のＷｅｂページから貼られているリンクの数やそのＷｅｂページの閲覧数といったそのテキストデータに対する世間の関心度の指標となるデータが得られる場合には、それらの指標（リンク数や閲覧数など）に従いソートが行われる構成が採用されてもよい。 In the embodiment described above, the text data transmitted from the server device 12 to the terminal device 11 is sorted in descending order of the corresponding relevance. For example, from the server device 12 to the terminal device. 11, the text data arranged in an arbitrary order may be transmitted together with the relevance data, and the terminal device 11 may sort the text data. Further, the sort order can be arbitrarily changed. For example, the sorting may be performed in the order of the length of the text indicated by the text data or the order of the acquired time. In addition, text data is posted on a web page, and is an index of public interest in the text data, such as the number of links pasted from other web pages to the text data and the number of browsing web pages. When data is obtained, a configuration may be employed in which sorting is performed according to those indexes (such as the number of links and the number of browsing).

また、上述した実施形態においては、一般的なコンピュータに本発明にかかるプログラムに従った処理を実行させることによりサーバ装置１２が実現されるものとしたが、例えば図２に示した機能構成をハードウェアにより実現するいわゆる専用機としてサーバ装置１２が構成されてもよい。 In the above-described embodiment, the server apparatus 12 is realized by causing a general computer to execute processing according to the program of the present invention. For example, the functional configuration shown in FIG. The server device 12 may be configured as a so-called dedicated machine realized by hardware.

なお、上述した実施形態において示した数式や処理フローの内容および順序は説明のための一例であって、様々に変更可能である。 Note that the formulas and the contents and order of the processing flows shown in the above-described embodiments are examples for explanation, and can be variously changed.

１…テキスト検索システム、９…ネットワーク、１１…端末装置、１２…サーバ装置、１２１…計時部、１２２…形態素解析部、１２３…テキストデータ取得部、１２４…検索キーワードデータ受信部、１２５…共起語データ抽出部、１２６…関連度データ生成部、１２７…関連度データ送信部、１２８…記憶部 DESCRIPTION OF SYMBOLS 1 ... Text search system, 9 ... Network, 11 ... Terminal device, 12 ... Server apparatus, 121 ... Timekeeping part, 122 ... Morphological analysis part, 123 ... Text data acquisition part, 124 ... Search keyword data receiving part, 125 ... Co-occurrence Word data extraction unit, 126 ... relevance data generation unit, 127 ... relevance data transmission unit, 128 ... storage unit

Claims

Text data acquisition means for acquiring text data indicating a sentence together with time data indicating a time;
A morpheme analysis unit that divides a sentence indicated by the text data acquired by the text data acquisition unit into a morpheme by morpheme analysis and generates a plurality of morpheme data each indicating the divided morpheme;
For each of one or more morpheme data generated by dividing each of a plurality of text data by the morpheme analysis unit, the morpheme data and the text data used to generate the morpheme data are identified. Morpheme data storage means for storing the text data identification data and the time data acquired by the text data acquisition means together with the text data used for generating the morpheme data;
With respect to one keyword data indicating one keyword, the morpheme data storage means stores the time data indicating the time within a predetermined period and the morpheme data indicating the one keyword indicated by the one keyword data. Co-occurrence word data extraction for extracting morpheme data stored together with the same text data identification data as the text data identification data stored in the morpheme data storage means as co-occurrence word data corresponding to the one keyword data A device comprising:

With respect to one text data identification data stored together with morpheme data indicating the one keyword indicated by the one keyword data, the co-occurrence word data corresponding to the one keyword data is stored in the morpheme data storage unit as the co-occurrence word data. For each of the morpheme data extracted by the word data extraction means, the one keyword data included in the one text data is the same morpheme data as the morpheme data stored together with the one text data identification data. And the number of co-occurrence word data corresponding to the one keyword data extracted by the co-occurrence word data extracting means and the one text data included in the one text data And the number of co-occurrence word data corresponding to the keyword data of The apparatus of claim 1, further comprising a relevance data generating means for generating a relevance data indicating the degree of association between the text data and the one of the keyword data.

The relevance data generation means is a number related to each co-occurrence word data in the generation of the relevance data according to a weight determined according to a predetermined rule based on the number of co-occurrence word data extracted by the co-occurrence word data extraction means. The apparatus according to claim 2, wherein weighting is performed.

Keyword data receiving means for receiving the one keyword data or text data including the one keyword data from a terminal device;
Text data storage means for storing text data acquired by the text data acquisition means;
Between each of the plurality of text data stored in the text data storage means and the one keyword data received from the terminal device by the keyword data receiving means, calculated by the relevance level data generating means. The device according to claim 2, further comprising: relevance data transmission means for transmitting relevance data to the terminal device in association with each of the plurality of text data.

5. The relevance data transmitting means transmits the plurality of text data to the terminal device together with the relevance data after sorting the plurality of text data in an order according to the relevance indicated by the relevance data. apparatus.

The computer,
Text data acquisition means for acquiring text data indicating a sentence together with time data indicating a time;
A morpheme analysis unit that divides a sentence indicated by the text data acquired by the text data acquisition unit into a morpheme by morpheme analysis and generates a plurality of morpheme data each indicating the divided morpheme;
Text identifying the morpheme data and the text data used to generate the morpheme data for each of one or more morpheme data generated by dividing each of the plurality of text data by the morpheme analysis unit Morpheme data storage means for storing data identification data and time data acquired by the text data acquisition means together with text data used for generating the morpheme data ;
Relates one keyword data showing one keyword, the morpheme in the data storage means is stored together with time data indicating the time within a predetermined time period, and the morphological data with morphological data indicating a keyword indicating the one keyword data occurrence word data extracting means for extracting morphemes data stored with the same text data identification data and text data identification data stored in the storage means, as the occurrence word data corresponding to the one keyword data
Program for make function as.