JP5710519B2

JP5710519B2 - Synonym extraction device, method, and program

Info

Publication number: JP5710519B2
Application number: JP2012027809A
Authority: JP
Inventors: 健司江崎; 内山　匡; 匡内山; 典子高屋; 裕介市川; 翔一長野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2012-02-10
Filing date: 2012-02-10
Publication date: 2015-04-30
Anticipated expiration: 2032-02-10
Also published as: JP2013164751A

Description

本発明は、同義語抽出装置、方法、及びプログラムに係り、特に、サイトにおける商品の記述に含まれる単語の同義語を抽出する同義語抽出装置、方法、及びプログラムに関する。 The present invention relates to a synonym extraction device, method, and program, and more particularly, to a synonym extraction device, method, and program for extracting synonyms of words included in a description of a product on a site.

近年、実店舗に加えてECサイトでも出店する企業が増えてきている。そのためECサイトの市場規模は増大し、ECサイトでの売り上げを向上することが重要である。ECサイトでの売り上げを向上するためには、顧客のニーズにあった商品を仕入れることが重要になるため、顧客のニーズを推定する事が必要である。 In recent years, more and more companies have opened stores on EC sites in addition to actual stores. For this reason, the market size of EC sites will increase, and it is important to improve sales at EC sites. In order to improve sales at EC sites, it is important to purchase products that meet customer needs, so it is necessary to estimate customer needs.

そのために顧客の多くや重要顧客が商品を選択した条件を顧客のニーズと推定し、条件にあった商品を仕入れることで売り上げの向上の実現が期待される。 For this reason, it is expected that sales will be improved by presuming the conditions under which many or important customers have selected products as customer needs and purchasing products that meet the conditions.

以降、閲覧される商品が選択された条件を商品選択条件と呼ぶ。 Hereinafter, a condition for selecting a product to be viewed is referred to as a product selection condition.

商品選択条件を抽出する方法として、閲覧した商品に共通して出現する語を商品選択条件として抽出する方法がある。 As a method of extracting the product selection condition, there is a method of extracting a word that appears in common with the viewed product as the product selection condition.

しかし、図１２のようにワンピースという商品選択条件が、ドレス、ミニワンピース、シャツワンピースなどと記述が異なる場合がある。このような語を同義語と呼ぶ。 However, as shown in FIG. 12, the item selection condition of one piece may be different from a dress, a mini one piece, a shirt one piece, or the like. Such a word is called a synonym.

同義語により記述が異なる場合には、閲覧した商品に語が共通して出現する頻度を算出できない。なぜならば、それぞれ別の語として出現頻度を算出してしまうからである。そのため同義語により記述が異なる場合には商品選択条件を抽出することができない。 If the description differs depending on the synonym, the frequency at which the words appear in common in the viewed product cannot be calculated. This is because the appearance frequency is calculated for each different word. Therefore, if the description differs depending on the synonym, the product selection condition cannot be extracted.

そこで、閲覧した商品に含まれる語の同義語を補完すると、閲覧した商品に語が共通して出現する頻度が高いものが商品選択条件となり、正しく商品選択条件を抽出することができる。 Therefore, when the synonyms of the words included in the browsed product are complemented, the product selection conditions are those that frequently appear in the browsed products, and the product selection conditions can be correctly extracted.

同義語を求める方法として、文書中である語と同時に出現する頻度が高い語を同義語として抽出する方法がある。ここで文書とは、任意のテキストを含むものであり、ページなどがある。 As a method for obtaining a synonym, there is a method of extracting a word having a high frequency that appears at the same time as a word in a document as a synonym. Here, the document includes arbitrary text, such as a page.

しかし、ECサイトの商品詳細ページの記述から同義語を抽出する場合、ワンピースと記述がある同一の商品でシャツワンピースの記述が同時に発生することは少ない。なぜならば、商品詳細ページの記述は短く、類似の記述を記載するよりも異なる内容を記述する事が多いからである。ここで、商品詳細ページとは、任意のある商品に対して詳細情報が記述されたページである。 However, when synonyms are extracted from the description on the product detail page on the EC site, it is unlikely that the description of the shirt dress will occur at the same time for the same product with the description of the dress. This is because the description of the product detail page is short, and different contents are often described rather than similar descriptions. Here, the product detail page is a page in which detailed information is described for an arbitrary product.

したがって、同一の商品で同時に出現した頻度を算出するよりは、同一のセッションでの記述から同時に出現する頻度を算出することが有効であると考えられる。ここで同一の意図のもとに連続的に閲覧された一連の閲覧をセッションとする。 Therefore, it is considered effective to calculate the frequency of simultaneous appearance from the description in the same session, rather than calculating the frequency of simultaneous appearance of the same product. Here, a series of browsing continuously browsed under the same intention is defined as a session.

非特許文献１では、ある語を含む検索結果について利用しているが、セッションを検索結果とみなすことで同手法を適用可能である。 In Non-Patent Document 1, a search result including a certain word is used, but this method can be applied by regarding a session as a search result.

徳永健伸、「情報検索と自然言語処理」、東京大学出版会、Ｐ．１５４〜１５９、1999年Takenobu Tokunaga, “Information Retrieval and Natural Language Processing”, University of Tokyo Press, P.A. 154-159, 1999

同一セッション内で単語が共起するセッションの数を共起頻度として算出する場合には、同一セッション内で異なる商品カテゴリを閲覧するセッションが存在する場合が考えられる。 When the number of sessions in which words co-occur within the same session is calculated as the co-occurrence frequency, there may be cases where there are sessions that browse different product categories within the same session.

このように、同一セッション内で単語が共起するセッション数を共起頻度として算出した場合、同義語が高い共起頻度を得るとは限らない。なぜならば、同一セッション内であっても同一の商品選択条件をもつとは限らず、共起頻度が高い語が同義語でない場合があるからである。 Thus, when the number of sessions in which words co-occur within the same session is calculated as the co-occurrence frequency, synonyms do not always have a high co-occurrence frequency. This is because even in the same session, it does not always have the same product selection condition, and a word with a high co-occurrence frequency may not be a synonym.

本発明は上記事情に鑑みてなされたものであり、サイトにおける商品の記述に含まれる単語の同義語を精度よく抽出することができる同義語抽出装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and an object thereof is to provide a synonym extraction device, method, and program capable of accurately extracting synonyms of words included in product descriptions on a site. To do.

上記目的を達成するために、本発明の同義語抽出装置は、サイトにおける商品の記述に含まれる単語の同義語を抽出する同義語抽出装置であって、入力された、セッション毎の、少なくとも１つの閲覧商品、及び各閲覧商品の記述に含まれる単語群を含むセッション情報に基づいて、抽出対象の各単語ｗ１について、前記抽出対象の単語ｗ１を含む閲覧商品が所定個以上あるセッションを抽出するセッション抽出手段と、前記抽出対象の各単語ｗ１について前記セッション抽出手段によって抽出されたセッションのセッション情報に基づいて、前記抽出対象の単語ｗ１毎に、前記抽出対象の単語ｗ１について抽出されたセッション全体において各単語ｗ２が出現するセッションの数を、前記抽出対象の単語ｗ１との共起頻度として抽出する共起頻度抽出手段と、前記抽出対象の各単語ｗ１について前記共起頻度抽出手段によって抽出された前記抽出対象の単語ｗ１との共起頻度に基づいて、前記抽出対象の単語ｗ１毎に、前記抽出対象の単語ｗ１との共起頻度が抽出された各単語ｗ２について、前記共起頻度が第１閾値以上となる前記抽出対象の単語ｗ１の数と予め定められた正の定数との和の逆数である単語ＩＤＦを抽出するＩＤＦ抽出手段と、前記抽出対象の単語ｗ１毎に、前記抽出対象の単語ｗ１について抽出された前記単語ＩＤＦが第２閾値以下となる単語ｗ２を除く、前記抽出対象の単語ｗ１との共起頻度が前記第１閾値以上となる単語ｗ２を、前記抽出対象の単語ｗ１の同義語として出力する同義語抽出手段と、を含んで構成されている。 In order to achieve the above object, a synonym extraction device of the present invention is a synonym extraction device for extracting synonyms of words included in a description of a product on a site, and at least one input for each session. Based on session information including one browsing product and a word group included in the description of each browsing product, a session having a predetermined number or more of browsing products including the extraction target word w1 is extracted for each word w1 to be extracted. An entire session extracted for the extraction target word w1 for each extraction target word w1 based on session extraction means and session information extracted by the session extraction means for each extraction target word w1 The number of sessions in which each word w2 appears is extracted as a co-occurrence frequency with the extraction target word w1. Based on the co-occurrence frequency of the frequency extraction means and the word w1 of the extraction target extracted by the co-occurrence frequency extraction means for each word w1 of the extraction target, the extraction target for each word w1 of the extraction target For each word w2 from which the co-occurrence frequency with the word w1 is extracted, the reciprocal of the sum of the number of the extraction target words w1 with the co-occurrence frequency equal to or higher than the first threshold and a predetermined positive constant IDF extraction means for extracting a certain word IDF, and for each word w1 to be extracted, the word to be extracted excluding the word w2 in which the word IDF extracted for the word w1 to be extracted is a second threshold value or less synonym extracting means for outputting a word w2 having a co-occurrence frequency with w1 equal to or higher than the first threshold as a synonym of the extraction target word w1.

また、本発明の同義語抽出方法は、セッション抽出手段、共起頻度抽出手段、ＩＤＦ抽出手段、及び同義語抽出手段を含み、サイトにおける商品の記述に含まれる単語の同義語を抽出する同義語抽出装置における同義語抽出方法であって、前記同義語抽出装置は、前記セッション抽出手段によって、入力された、セッション毎の、少なくとも１つの閲覧商品、及び各閲覧商品の記述に含まれる単語群を含むセッション情報に基づいて、抽出対象の各単語ｗ１について、前記抽出対象の単語ｗ１を含む閲覧商品が所定個以上あるセッションを抽出するステップと、前記共起頻度抽出手段によって、前記抽出対象の各単語ｗ１について前記セッション抽出手段によって抽出されたセッションのセッション情報に基づいて、前記抽出対象の単語ｗ１毎に、前記抽出対象の単語ｗ１について抽出されたセッション全体において各単語ｗ２が出現するセッションの数を、前記抽出対象の単語ｗ１との共起頻度として抽出するステップと、前記ＩＤＦ抽出手段によって、前記抽出対象の各単語ｗ１について前記共起頻度抽出手段によって抽出された前記抽出対象の単語ｗ１との共起頻度に基づいて、前記抽出対象の単語ｗ１毎に、前記抽出対象の単語ｗ１との共起頻度が抽出された各単語ｗ２について、前記共起頻度が第１閾値以上となる前記抽出対象の単語ｗ１の数と予め定められた正の定数との和の逆数である単語ＩＤＦを抽出するステップと、前記同義語抽出手段によって、前記抽出対象の単語ｗ１毎に、前記抽出対象の単語ｗ１について抽出された前記単語ＩＤＦが第２閾値以下となる単語ｗ２を除く、前記抽出対象の単語ｗ１との共起頻度が前記第１閾値以上となる単語ｗ２を、前記抽出対象の単語ｗ１の同義語として出力するステップと、を含んで実行することを特徴とする。 The synonym extraction method of the present invention includes a session extraction unit, a co-occurrence frequency extraction unit, an IDF extraction unit, and a synonym extraction unit, and extracts synonyms of words included in the description of the product on the site. A synonym extraction method in an extraction device, wherein the synonym extraction device includes at least one browse product for each session and a word group included in the description of each browse product input by the session extraction unit. Based on the included session information, for each word w1 to be extracted, the step of extracting a session having a predetermined number or more of browsing products including the word w1 to be extracted, and the co-occurrence frequency extracting means Based on the session information of the session extracted by the session extraction means for the word w1, the word w1 to be extracted And extracting the number of sessions in which each word w2 appears in the entire session extracted for the word w1 to be extracted as a co-occurrence frequency with the word w1 to be extracted, and by the IDF extracting means, Based on the co-occurrence frequency of each word w1 to be extracted and the word w1 to be extracted extracted by the co-occurrence frequency extracting unit, the word w1 to be extracted is shared with the word w1 to be extracted. For each word w2 from which the occurrence frequency is extracted, a word IDF that is the reciprocal of the sum of the number of the extraction target words w1 with the co-occurrence frequency equal to or higher than a first threshold and a predetermined positive constant is extracted. And the word IDF extracted for the extraction target word w1 by the synonym extraction means for each extraction target word w1 is equal to or less than a second threshold value. Outputting a word w2 having a co-occurrence frequency with the extraction target word w1 excluding the word w2 as the synonym of the extraction target word w1. Features.

また、本発明のプログラムは、サイトにおける商品の記述に含まれる単語の同義語を抽出するためのプログラムであって、コンピュータを、入力された、セッション毎の、少なくとも１つの閲覧商品、及び各閲覧商品の記述に含まれる単語群を含むセッション情報に基づいて、抽出対象の各単語ｗ１について、前記抽出対象の単語ｗ１を含む閲覧商品が所定個以上あるセッションを抽出するセッション抽出手段、前記抽出対象の各単語ｗ１について前記セッション抽出手段によって抽出されたセッションのセッション情報に基づいて、前記抽出対象の単語ｗ１毎に、前記抽出対象の単語ｗ１について抽出されたセッション全体において各単語ｗ２が出現するセッションの数を、前記抽出対象の単語ｗ１との共起頻度として抽出する共起頻度抽出手段、前記抽出対象の各単語ｗ１について前記共起頻度抽出手段によって抽出された前記抽出対象の単語ｗ１との共起頻度に基づいて、前記抽出対象の単語ｗ１毎に、前記抽出対象の単語ｗ１との共起頻度が抽出された各単語ｗ２について、前記共起頻度が第１閾値以上となる前記抽出対象の単語ｗ１の数と予め定められた正の定数との和の逆数である単語ＩＤＦを抽出するＩＤＦ抽出手段、及び前記抽出対象の単語ｗ１毎に、前記抽出対象の単語ｗ１について抽出された前記単語ＩＤＦが第２閾値以下となる単語ｗ２を除く、前記抽出対象の単語ｗ１との共起頻度が前記第１閾値以上となる単語ｗ２を、前記抽出対象の単語ｗ１の同義語として出力する同義語抽出手段として機能させるためのプログラムである。 In addition, the program of the present invention is a program for extracting synonyms of words included in the description of products on a site, and the computer inputs at least one browse product for each session, and each browse Session extraction means for extracting, for each word w1 to be extracted, a session having a predetermined number or more of browsing products including the extraction target word w1, based on session information including a word group included in the description of the product, the extraction target A session in which each word w2 appears in the entire session extracted for the word w1 to be extracted for each word w1 to be extracted based on the session information of the session extracted by the session extraction means for each word w1 Is extracted as a co-occurrence frequency with the extraction target word w1. Means for each word w1 to be extracted for each word w1 to be extracted based on the co-occurrence frequency with the word w1 to be extracted extracted by the co-occurrence frequency extraction unit for each word w1 to be extracted For each word w2 from which the co-occurrence frequency is extracted, a word IDF that is the reciprocal of the sum of the number of words w1 to be extracted whose co-occurrence frequency is equal to or greater than a first threshold and a predetermined positive constant For each word w1 to be extracted and the word w1 to be extracted, excluding the word w2 in which the word IDF extracted for the word w1 to be extracted is equal to or less than a second threshold value A program for causing a word w2 having a co-occurrence frequency to be equal to or higher than the first threshold to function as a synonym extracting unit that outputs the word w2 as a synonym of the extraction target word w1.

以上説明したように、本発明の同義語抽出装置、方法、及びプログラムによれば、抽出対象の各単語ｗ１について、単語ｗ１を含む閲覧商品が所定個以上あるセッションを抽出して、各単語ｗ２について単語ｗ１との共起頻度を算出し、単語ＩＤＦが閾値以下となる単語ｗ２を除いて、単語ｗ１の同義語を抽出することにより、サイトにおける商品の記述に含まれる単語の同義語を精度よく抽出することができる、という効果が得られる。 As described above, according to the synonym extraction device, method, and program of the present invention, for each word w1 to be extracted, a session having a predetermined number or more of browsing products including the word w1 is extracted, and each word w2 is extracted. By calculating the co-occurrence frequency with the word w1 for the word, and extracting the synonym of the word w1 except for the word w2 where the word IDF is less than or equal to the threshold, the synonym of the word included in the description of the product on the site The effect that it can extract well is acquired.

本発明の実施の形態の同義語抽出装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the synonym extraction apparatus of embodiment of this invention. セッション群入力バッファの内容を示す図である。It is a figure which shows the content of a session group input buffer. 短期的セッション群入力バッファの内容を示す図である。It is a figure which shows the content of a short-term session group input buffer. 共起頻度保存バッファの内容を示す図である。It is a figure which shows the content of the co-occurrence frequency preservation | save buffer. 共起頻度入力バッファの内容を示す図である。It is a figure which shows the content of the co-occurrence frequency input buffer. ＩＤＦ抽出バッファの内容を示す図である。It is a figure which shows the content of IDF extraction buffer. 同義語抽出装置の出力を示す図である。It is a figure which shows the output of a synonym extraction apparatus. 本発明の実施の形態の同義語抽出装置における同義語抽出処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the synonym extraction process routine in the synonym extraction apparatus of embodiment of this invention. 本発明の実施の形態の同義語抽出装置における短期的セッションを抽出する処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the process which extracts the short-term session in the synonym extraction apparatus of embodiment of this invention. 本発明の実施の形態の同義語抽出装置における共起頻度を抽出する処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the process which extracts the co-occurrence frequency in the synonym extraction apparatus of embodiment of this invention. 本発明の実施の形態の同義語抽出装置における単語ＩＤＦを抽出する処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the process which extracts the word IDF in the synonym extraction apparatus of embodiment of this invention. 同義語を説明するための図である。It is a figure for demonstrating a synonym. 全てのセッションにおける共起頻度を算出した結果を示す図である。It is a figure which shows the result of having calculated the co-occurrence frequency in all the sessions. 短期的セッションにおける共起頻度を算出した結果を示す図である。It is a figure which shows the result of having calculated the co-occurrence frequency in a short-term session.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜本発明の概要＞
全てのセッションで、同一セッション内で単語が共起するセッションの数を共起頻度として算出する場合、同一セッション内で異なる商品カテゴリを閲覧しているセッションが存在する場合が考えられる。 <Outline of the present invention>
When the number of sessions in which words co-occur in the same session is calculated as the co-occurrence frequency in all sessions, there may be a session in which different product categories are viewed in the same session.

このように、同一セッション内で単語が共起するセッション数を共起頻度として算出した場合、同義語が高い共起頻度を得るとは限らない。なぜならば、同一セッション内で共起する単語であっても同一の商品選択条件をもつとは限らず、共起頻度が高い語が同義語でない場合があるからである。 Thus, when the number of sessions in which words co-occur within the same session is calculated as the co-occurrence frequency, synonyms do not always have a high co-occurrence frequency. This is because even a word that co-occurs in the same session does not necessarily have the same product selection condition, and a word having a high co-occurrence frequency may not be a synonym.

図１３にワンピースを含む全てのセッションで共起頻度を算出した場合を示す。シャツワンピースやミニワンピースといった同義語が上位に来ておらず、商品詳細ページに含まれやすい、ブラックなどといった語や他のカテゴリに関する同義語ではない語が上位に来ていることがわかる。 FIG. 13 shows a case where the co-occurrence frequency is calculated in all sessions including one piece. It can be seen that synonyms such as shirt dress and mini dress do not come to the top, and words such as black that are easily included in the product detail page and words that are not synonymous with other categories come to the top.

そこで本発明では、同一の商品選択条件によって閲覧されているセッションのみで共起頻度を算出することによって同義語を抽出する。 Therefore, in the present invention, synonyms are extracted by calculating the co-occurrence frequency only in sessions browsed under the same product selection condition.

同一の商品選択条件によって閲覧されているセッションを、対象の語と共通する語を含む閲覧商品をＮ個以上有するセッション（短期的セッションと呼ぶ）と定義する事によって、異なる商品選択条件で閲覧した語の影響を少なくすることが可能になる。 Browsing under different product selection conditions by defining a session viewed under the same product selection conditions as a session (referred to as a short-term session) that has N or more browsing products that contain words in common with the target word It is possible to reduce the influence of words.

また、上記同一の商品選択条件で閲覧しているセッションに共起する語が複数の語に出現するものを除去することによって同義語をより正確に抽出する。 In addition, synonyms are more accurately extracted by removing words that co-occur in a session viewed under the same product selection conditions appear in a plurality of words.

図１４にワンピースを含む同一の商品選択条件によって閲覧されているセッションのみで共起頻度を算出し同義語を抽出する例を示す。図１３と違い、ミニワンピース、シャツワンピースという同義語が上位に含まれることがわかる。一方、上位に商品詳細ページに含まれやすいブラックなどといった語が残る。そのため同義語は、複数の単語と共に頻出しないという仮定により、単語での出現頻度の逆数である単語IDFが閾値以下の単語を同義語候補から除去することによって、それらの語を除去する。 FIG. 14 shows an example in which the co-occurrence frequency is calculated and the synonyms are extracted only in the session browsed under the same product selection condition including one piece. Unlike FIG. 13, it can be seen that the synonyms of mini dress and shirt dress are included at the top. On the other hand, words such as black that are likely to be included in the product detail page remain at the top. Therefore, on the assumption that synonyms do not appear frequently together with a plurality of words, those words are removed by removing from the synonym candidates words whose word IDF, which is the reciprocal of the appearance frequency of the words, is less than or equal to a threshold value.

＜システム構成＞
本発明の実施の形態に係る同義語抽出装置１００は、ＣＰＵ（Central Processing Unit）と、ＲＡＭ（Random Access Memory）と、後述する同義語抽出処理ルーチンを実行するためのプログラムを記憶したＲＯＭ（Read Only Memory）とを備えたコンピュータで構成されている。このコンピュータは、機能的には、図１に示すように、セッション群入力部１０、同義語抽出部２０、単語群辞書データベース３０、及び同義語出力部４０を含んだ構成で表すことができる。なお、同義語出力部４０は、同義語抽出手段の一例である。 <System configuration>
A synonym extraction device 100 according to an embodiment of the present invention includes a CPU (Central Processing Unit), a RAM (Random Access Memory), and a ROM (Read that stores a program for executing a synonym extraction processing routine described later. Only memory). As shown in FIG. 1, this computer can be functionally represented by a configuration including a session group input unit 10, a synonym extraction unit 20, a word group dictionary database 30, and a synonym output unit 40. The synonym output unit 40 is an example of a synonym extraction unit.

セッション群入力部１０では、少なくとも１つのＥＣサイトで得られたセッション毎のセッション情報の入力を受け付ける。ここで、セッションとは、同一の閲覧者が同一サイトにおいて連続して閲覧した閲覧ページのログを表す。セッション情報は、図２に示すように、予め規定されたセッションとしてのIDであるセッションIDと、どのユーザが閲覧したかというユーザIDと、セッション中で閲覧した閲覧商品と、閲覧商品の記述で出現した単語の集合である単語群と、単語群の各単語の出現頻度（０ｏｒ１）とを含み、閲覧商品と単語群と各単語の出現頻度とを、閲覧商品毎に有している。 The session group input unit 10 receives input of session information for each session obtained at at least one EC site. Here, a session represents a log of browsing pages continuously browsed by the same viewer on the same site. As shown in FIG. 2, the session information includes a session ID that is a pre-defined session ID, a user ID indicating which user has browsed, a browsed product browsed during the session, and a description of the browsed product. It includes a word group that is a set of appearing words and an appearance frequency (0 or 1) of each word in the word group, and has a browsing product, a word group, and an appearance frequency of each word for each browsing product.

単語群辞書データベース３０には、単語群辞書として、多数の単語データが記憶されている。 The word group dictionary database 30 stores a large number of word data as a word group dictionary.

同義語抽出部２０は、短期的セッション抽出部２１、共起頻度抽出部２２、及びＩＤＦ抽出部２３を備えている。 The synonym extraction unit 20 includes a short-term session extraction unit 21, a co-occurrence frequency extraction unit 22, and an IDF extraction unit 23.

短期的セッション抽出部２１は、セッション群入力バッファ２１１を備えている。 The short-term session extraction unit 21 includes a session group input buffer 211.

セッション群入力バッファ２１１には、セッション群入力部１０により入力された、上記図２に示すセッション毎のセッション情報が格納される。 The session group input buffer 211 stores session information for each session shown in FIG. 2 input by the session group input unit 10.

短期的セッション抽出部２１は、セッション群入力バッファ２１１に書き込みが完了してから動作を開始し、以下に説明するように、単語群辞書データベース３０の各単語毎に、当該単語に対する短期的セッションを抽出する。 The short-term session extraction unit 21 starts an operation after writing to the session group input buffer 211 is completed, and, as will be described below, for each word in the word group dictionary database 30, a short-term session for the word is performed. Extract.

まず、単語群辞書データベース３０から単語ｗ１を一つ取り出し、次に、セッション群入力バッファ２１１からセッション情報を一つ取り出し、セッション保存バッファ（図示省略）にセッション情報の内容を書き込む。セッション保存バッファに格納されたセッション情報における閲覧商品のうち、単語ｗ１を含む商品の数がしきい値Ｎ以上である場合には、図３に示すように、後述する短期的セッション群入力バッファ２２１に、単語ｗ１及びセッション情報（セッションＩＤ、ユーザＩＤ、閲覧商品、単語群、各単語の出現頻度）を書き込む。 First, one word w1 is extracted from the word group dictionary database 30, then one session information is extracted from the session group input buffer 211, and the contents of the session information are written in a session storage buffer (not shown). When the number of products including the word w1 among the browsed products in the session information stored in the session storage buffer is greater than or equal to a threshold value N, as shown in FIG. The word w1 and session information (session ID, user ID, browsing product, word group, frequency of appearance of each word) are written in.

以上のように、短期的セッション抽出部２１は、単語群辞書データベース３０の各単語ｗ１毎に、セッション情報が入力された全てのセッションの集合から、単語ｗ１を含む閲覧商品の数がしきい値Ｎ以上となるセッションの部分集合を、単語ｗ１に対する短期的セッションとして抽出し、短期的セッションに含まれる各セッションの情報を、上記図３に示すように、単語ｗ１と対応付けて、短期的セッション群入力バッファ２２１に書き込む。 As described above, the short-term session extraction unit 21 determines, for each word w1 in the word group dictionary database 30, the number of browsing products including the word w1 from the set of all sessions in which session information is input as a threshold value. A subset of N sessions or more is extracted as a short-term session for the word w1, and information on each session included in the short-term session is associated with the word w1 as shown in FIG. Write to the group input buffer 221.

共起頻度抽出部２２は、短期的セッション群入力バッファ２２１と共起頻度保存バッファ２２２とを備えている。 The co-occurrence frequency extraction unit 22 includes a short-term session group input buffer 221 and a co-occurrence frequency storage buffer 222.

短期的セッション群入力バッファ２２１には、短期的セッション抽出部２１により書き込まれた、上記図２に示す短期的セッションの情報が格納される。 The short-term session group input buffer 221 stores information on the short-term session shown in FIG. 2 written by the short-term session extraction unit 21.

共起頻度抽出部２２は、短期的セッション群入力バッファ２２１に書き込みが完了してから動作を開始し、以下に説明するように、単語ｗ１毎に、当該単語ｗ１に対する短期的セッションにおける単語ｗ１との共起頻度を各単語ｗ２について算出する。 The co-occurrence frequency extraction unit 22 starts the operation after the writing to the short-term session group input buffer 221 is completed. As described below, for each word w1, the word w1 in the short-term session and the word w1 Is calculated for each word w2.

短期的セッション群入力バッファ２２１から、単語ｗ１の短期的セッションとして抽出されたセッション情報の全てを取り出す。次に、単語群辞書データベース３０から単語ｗ２を一つ取り出す。単語ｗ２が、単語ｗ１の短期的セッションで出現する頻度（単語ｗ２を含むセッションの数）を算出し、算出した出現頻度を、単語ｗ２に対する単語ｗ１との共起頻度として、図４に示すように、共起頻度保存バッファ２２２に書き込む。 All of the session information extracted as the short-term session of the word w1 is extracted from the short-term session group input buffer 221. Next, one word w 2 is extracted from the word group dictionary database 30. As shown in FIG. 4, the frequency at which the word w2 appears in the short-term session of the word w1 (the number of sessions including the word w2) is calculated, and the calculated appearance frequency is the co-occurrence frequency of the word w2 with the word w1. To the co-occurrence frequency storage buffer 222.

このように、単語ｗ１毎に、当該単語ｗ１に対する短期的セッションにおいて、各単語ｗ２が出現するセッションの数が算出され、上記図４に示すように、単語ｗ１との共起頻度として共起頻度保存バッファ２２２に格納される。 Thus, for each word w1, in the short-term session for the word w1, the number of sessions in which each word w2 appears is calculated. As shown in FIG. 4, the co-occurrence frequency as the co-occurrence frequency with the word w1 is calculated. Stored in the save buffer 222.

ＩＤＦ抽出部２３は、共起頻度入力バッファ２３１とＩＤＦ抽出バッファ２３２とを備えている。 The IDF extraction unit 23 includes a co-occurrence frequency input buffer 231 and an IDF extraction buffer 232.

共起頻度入力バッファ２３１には、図５に示すように、共起頻度保存バッファ２２２に格納された同じ内容が格納される。 The co-occurrence frequency input buffer 231 stores the same contents stored in the co-occurrence frequency storage buffer 222 as shown in FIG.

ＩＤＦ抽出部２３は、共起頻度入力バッファ２３１に書き込みが完了してから動作を開始し、以下に説明するように、単語ｗ１毎に、各単語ｗ２の単語ＩＤＦを抽出し同義語を抽出する。 The IDF extraction unit 23 starts the operation after the writing to the co-occurrence frequency input buffer 231 is completed, and extracts the word IDF of each word w2 and extracts synonyms for each word w1, as described below. .

まず、単語群辞書データベース３０から単語ｗ１を一つ抽出する。共起頻度入力バッファ２３１から、単語ｗ１との共起頻度がしきい値Ｍ以上となる単語ｗ２を同義語候補として抽出し、同義語候補として抽出された各単語ｗ２について、単語ｗ２の共起頻度がしきい値Ｍ以上となる単語ｗ１の数を、単語出現頻度（単語ＤＦ）として算出する（図６参照）。単語ｗ２について算出された単語出現頻度から以下の（１）式を用いて、単語ｗ２について単語IDFを算出し、単語ｗ１の同義語候補から、単語IDFがしきい値Ｌ以下の単語を除去する。 First, one word w 1 is extracted from the word group dictionary database 30. From the co-occurrence frequency input buffer 231, a word w 2 whose co-occurrence frequency with the word w 1 is equal to or greater than the threshold value M is extracted as a synonym candidate, and the co-occurrence of the word w 2 for each word w 2 extracted as a synonym candidate The number of words w1 whose frequency is greater than or equal to the threshold value M is calculated as the word appearance frequency (word DF) (see FIG. 6). The word IDF is calculated for the word w2 from the word appearance frequency calculated for the word w2 by using the following expression (1), and the word whose IDF is equal to or less than the threshold value L is removed from the synonym candidate of the word w1. .

ただし、ｆは、単語ｗ１との共起頻度がしきい値Ｍ以上となる単語ｗ２について算出される上記単語出現頻度に１を加えたものであり、ｋは正規化定数である。 However, f is obtained by adding 1 to the word appearance frequency calculated for the word w2 whose co-occurrence frequency with the word w1 is equal to or greater than the threshold value M, and k is a normalization constant.

ＩＤＦ抽出部２３は、単語ｗ１毎に、同義語候補（単語ｗ１と共起する語）から、単語IDFがしきい値Ｌ以下となる単語ｗ２を除去した後に、当該同義語候補を、同義語出力部４０に書き込む。同義語出力部４０は、書き込まれた単語ｗ１毎の同義語候補に基づいて、図７に示すように、各単語ｗ１の同義語を出力する。 For each word w1, the IDF extraction unit 23 removes the word w2 whose word IDF is equal to or less than the threshold value L from the synonym candidate (word that co-occurs with the word w1), and then extracts the synonym candidate as a synonym. Write to the output unit 40. The synonym output unit 40 outputs the synonym of each word w1, as shown in FIG. 7, based on the written synonym candidate for each word w1.

＜同義語抽出装置の作用＞
本実施の形態の同義語抽出装置１００に、セッション毎のセッション情報が入力されると、同義語抽出装置１００によって、セッション群入力バッファ２１１に格納される。そして、同義語抽出装置１００において、図８に示す、同義語抽出処理ルーチンが実行される。 <Operation of synonym extraction device>
When session information for each session is input to the synonym extraction device 100 of the present embodiment, the synonym extraction device 100 stores the session information in the session group input buffer 211. Then, in the synonym extraction device 100, a synonym extraction processing routine shown in FIG. 8 is executed.

まず、ステップＳ１００で、短期的セッション抽出部２１によって、セッション群入力バッファ２１１のセッション集合から、単語ｗ１毎に、単語ｗ１を含む閲覧商品の数がＮ個以上となるセッションの部分集合を短期的セッションとして抽出する。ステップＳ１０１では、共起頻度抽出部２２によって、単語ｗ１毎に、抽出した単語ｗ１の短期的セッションの各セッションでの各単語ｗ２の出現頻度を計算すると共に、抽出した単語ｗ１の短期的セッション全体での単語ｗ１との共起頻度を各単語ｗ２について計算する。そして、ステップＳ１０２において、ＩＤＦ抽出部２３によって、単語ｗ１毎に、単語ｗ１の同義語候補の各単語ｗ２について、単語ｗ２の単語ＩＤＦを算出し、単語ＩＤＦがしきい値Ｌ未満となる単語ｗ２を、多くの単語ｗ１と共起した単語として単語ｗ１の同義語候補から除去して単語ｗ１の同義語を抽出し、同義語抽出処理ルーチンを終了する。 First, in step S100, the short-term session extraction unit 21 creates a short-term subset of sessions in which the number of browsing products including the word w1 is N or more for each word w1 from the session set in the session group input buffer 211. Extract as a session. In step S101, the co-occurrence frequency extraction unit 22 calculates the appearance frequency of each word w2 in each session of the extracted word w1 for each word w1, and the entire short-term session of the extracted word w1. The co-occurrence frequency with the word w1 is calculated for each word w2. In step S102, the IDF extraction unit 23 calculates the word IDF of the word w2 for each word w2 of the synonym candidate of the word w1 for each word w1, and the word w2 for which the word IDF is less than the threshold value L. Are extracted from the synonym candidate of the word w1 as words co-occurring with many words w1, the synonym of the word w1 is extracted, and the synonym extraction processing routine is terminated.

上記ステップＳ１００は、図９に示す短期的セッション抽出処理ルーチンによって実現される。 The step S100 is realized by a short-term session extraction processing routine shown in FIG.

ステップＳ１１０において、単語群辞書データベース３０から、１つの単語ｗ１を取り出す。ステップＳ１１１において、セッション群入力バッファ２１１から、１つのセッション情報を取り出し、セッション保存バッファに書き込む。ステップＳ１１２において、上記ステップＳ１１１で取り出したセッション情報に含まれる閲覧商品のうち、Ｎ個以上の閲覧商品が、単語ｗ１を含む場合には、セッション保存バッファに書き込んだセッション情報を、単語ｗ１の短期的セッションの一つとして、短期的セッション群入力バッファ２２１に書き込む。 In step S110, one word w1 is extracted from the word group dictionary database 30. In step S111, one piece of session information is extracted from the session group input buffer 211 and written to the session storage buffer. In step S112, when N or more browsed products among the browsed products included in the session information extracted in step S111 include the word w1, the session information written in the session storage buffer is changed to the short-term word w1. As one of the sessions, the short-term session group input buffer 221 is written.

次のステップＳ１１３では、セッション群入力バッファ２１１が空になったか否かを判定し、セッション群入力バッファ２１１が空でない場合には、上記ステップＳ１１１に戻り処理を継続する。一方、セッション群入力バッファ２１１が空である場合には、ステップＳ１１４へ進む。 In the next step S113, it is determined whether or not the session group input buffer 211 is empty. If the session group input buffer 211 is not empty, the process returns to step S111 and continues. On the other hand, if the session group input buffer 211 is empty, the process proceeds to step S114.

ステップＳ１１４では、単語群辞書データベース３０の全ての単語について、上記処理を実行したか否かを判定し、単語群辞書データベース３０の全ての単語について処理済みでない場合には、上記ステップＳ１１０に戻り処理を継続する。一方、単語群辞書データベース３０の全ての単語について処理済みである場合には、短期的セッション抽出処理ルーチンを終了する。 In step S114, it is determined whether or not the above process has been executed for all the words in the word group dictionary database 30, and if not all the words in the word group dictionary database 30 have been processed, the process returns to step S110. Continue. On the other hand, if all the words in the word group dictionary database 30 have been processed, the short-term session extraction processing routine is terminated.

上記ステップＳ１０１は、図１０に示す共起頻度抽出処理ルーチンによって実現される。 Step S101 is realized by the co-occurrence frequency extraction processing routine shown in FIG.

ステップＳ１２０において、短期的セッション群入力バッファ２２１から、単語ｗ１の短期的セッションに含まれる各セッションの情報を取り出し、ステップＳ１２１において、単語群辞書データベース３０から、１つの単語ｗ２を取り出す。ステップＳ１２２では、上記ステップ１２０で取り出した単語ｗ１の短期的セッションにおける単語ｗ２の出現頻度（単語ｗ２が出現するセッションの数）を算出し、単語ｗ１との共起頻度として共起頻度保存バッファ２２２に書き込む。 In step S120, information on each session included in the short-term session of the word w1 is extracted from the short-term session group input buffer 221, and one word w2 is extracted from the word group dictionary database 30 in step S121. In step S122, the appearance frequency of the word w2 in the short-term session of the word w1 extracted in step 120 (the number of sessions in which the word w2 appears) is calculated, and the co-occurrence frequency storage buffer 222 is calculated as the co-occurrence frequency with the word w1. Write to.

そして、ステップＳ１２３において、単語群辞書データベース３０の全ての単語について上記処理を実行したか否かを判断し、単語群辞書データベース３０の全ての単語について処理済みでない場合には、上記ステップＳ１２１に戻り処理を継続する。一方、単語群辞書データベース３０の全ての単語について処理済みである場合には、ステップＳ１２４へ進む。 In step S123, it is determined whether or not the above processing has been executed for all the words in the word group dictionary database 30, and if all the words in the word group dictionary database 30 have not been processed, the process returns to step S121. Continue processing. On the other hand, if all the words in the word group dictionary database 30 have been processed, the process proceeds to step S124.

ステップＳ１２４では、短期的セッション群入力バッファ２２１が空になったか否かを判定し、短期的セッション群入力バッファ２２１が空でない場合には、上記ステップＳ１２０に戻り処理を継続する。一方、短期的セッション群入力バッファ２２１が空である場合には、共起頻度抽出処理ルーチンを終了する。 In step S124, it is determined whether or not the short-term session group input buffer 221 is empty. If the short-term session group input buffer 221 is not empty, the process returns to step S120 and the processing is continued. On the other hand, if the short-term session group input buffer 221 is empty, the co-occurrence frequency extraction processing routine is terminated.

上記ステップＳ１０２は、図１１に示すＩＤＦ抽出処理ルーチンによって実現される。 Step S102 is realized by the IDF extraction processing routine shown in FIG.

ステップＳ１３０において、単語群辞書データベース３０から、一つの単語ｗ１を取り出す。ステップＳ１３１において、共起頻度入力バッファ２３１から、単語ｗ１との共起頻度がしきい値Ｍ以上となる単語ｗ２を同義語候補として抽出し、同義語候補の各単語ｗ２について単語出現頻度（ＤＦ）を算出する。 In step S130, one word w1 is extracted from the word group dictionary database 30. In step S131, a word w2 whose co-occurrence frequency with the word w1 is equal to or greater than the threshold value M is extracted from the co-occurrence frequency input buffer 231 as a synonym candidate, and the word appearance frequency (DF) is set for each word w2 of the synonym candidate. ) Is calculated.

そして、ステップＳ１３２において、上記ステップＳ１３１で算出した各単語ｗ２の単語出現頻度に基づいて、各単語ｗ２の単語ＩＤＦを算出し、単語ＩＤＦが閾値Ｌ以下の単語ｗ２を、単語ｗ１の同義語候補から除去する。次のステップＳ１３３では、上記ステップＳ１３２で単語が除去された同義語候補を、単語ｗ１の同義語として同義語出力部４０に書き込む。 In step S132, based on the word appearance frequency of each word w2 calculated in step S131, the word IDF of each word w2 is calculated, and the word w2 whose word IDF is equal to or less than the threshold L is the synonym candidate for word w1. Remove from. In the next step S133, the synonym candidate from which the word is removed in step S132 is written in the synonym output unit 40 as a synonym for the word w1.

そして、ステップＳ１３４において、単語群辞書データベース３０の全ての単語について上記処理を実行したか否かを判定し、単語群辞書データベース３０の全ての単語について処理済みでない場合には、上記ステップＳ１３０に戻り処理を継続する。一方、単語群辞書データベース３０の全ての単語について処理済みである場合には、ＩＤＦ抽出処理ルーチンを終了する。 In step S134, it is determined whether or not the above process has been executed for all the words in the word group dictionary database 30, and if not all the words in the word group dictionary database 30 have been processed, the process returns to step S130. Continue processing. On the other hand, if all the words in the word group dictionary database 30 have been processed, the IDF extraction processing routine ends.

そして、同義語出力部４０によって、書き込まれた各単語ｗ１の同義語を出力する。 And the synonym output part 40 outputs the synonym of each written word w1.

以上説明したように、本発明の実施の形態に係る同義語抽出装置によれば、抽出対象の各単語ｗ１について、単語ｗ１を含む閲覧商品が所定個以上あるセッションを短期的セッションとして抽出して、短期的セッションにおいて各単語ｗ２について単語ｗ１との共起頻度を算出し、共起頻度に基づいて抽出される同義語候補から、単語ＩＤＦが閾値以下となる単語ｗ２を除いて、単語ｗ１の同義語を抽出する。このように、出現頻度を算出する対象を同一の商品選択条件によって閲覧されているセッション（短期的セッション）に限定し、同義語は特にある単語とのみ高い共起頻度を持つこととすることにより、サイトにおける商品の記述に含まれる単語の同義語を精度よく抽出することができる。 As described above, according to the synonym extraction device according to the embodiment of the present invention, for each word w1 to be extracted, a session having a predetermined number or more of browsing products including the word w1 is extracted as a short-term session. In the short-term session, the co-occurrence frequency of each word w2 with the word w1 is calculated, and from the synonym candidates extracted based on the co-occurrence frequency, the word w1 having the word IDF equal to or lower than the threshold is excluded. Extract synonyms. In this way, by limiting the target for calculating the appearance frequency to sessions (short-term sessions) that are browsed under the same product selection conditions, synonyms have a high co-occurrence frequency especially with certain words. The synonym of the word included in the description of the product on the site can be accurately extracted.

本発明は、上記実施の形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made without departing from the gist of the present invention.

例えば、上述の同義語抽出装置は、内部にコンピュータシステムを有しているが、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。 For example, the above synonym extraction device has a computer system inside, but the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used. .

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 In the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

１０セッション群入力部
２０同義語抽出部
２１短期的セッション抽出部
２２共起頻度抽出部
２３ＩＤＦ抽出部
３０単語群辞書データベース
４０同義語出力部
１００同義語抽出装置 DESCRIPTION OF SYMBOLS 10 Session group input part 20 Synonym extraction part 21 Short-term session extraction part 22 Co-occurrence frequency extraction part 23 IDF extraction part 30 Word group dictionary database 40 Synonym output part 100 Synonym extraction apparatus

Claims

A synonym extraction device that extracts synonyms of words included in a description of a product on a site,
Based on the input session information including at least one browsing product for each session and a word group included in the description of each browsing product, the browsing including the extraction target word w1 for each word w1 of the extraction target Session extraction means for extracting a session having a predetermined number of products;
Based on the session information of the session extracted by the session extraction unit for each word w1 to be extracted, for each word w1 to be extracted, each word w2 in the entire session extracted for the word w1 to be extracted is Co-occurrence frequency extraction means for extracting the number of appearing sessions as the co-occurrence frequency with the word w1 to be extracted;
Based on the co-occurrence frequency of each word w1 to be extracted and the word w1 to be extracted extracted by the co-occurrence frequency extraction unit, for each word w1 to be extracted, the word w1 to be extracted For each word w2 from which the co-occurrence frequency has been extracted, a word IDF that is the reciprocal of the sum of the number of the extraction target words w1 with the co-occurrence frequency equal to or higher than the first threshold and a predetermined positive constant is extracted. IDF extraction means for
For each word w1 to be extracted, the co-occurrence frequency with the word w1 to be extracted is the first threshold value except for the word w2 in which the word IDF extracted for the word w1 to be extracted is equal to or less than a second threshold value. Synonym extraction means for outputting the word w2 as above as a synonym of the word w1 to be extracted;
Synonym extraction device.

A synonym extraction method in a synonym extraction device that includes a session extraction unit, a co-occurrence frequency extraction unit, an IDF extraction unit, and a synonym extraction unit, and extracts synonyms of words included in a description of a product on a site,
The synonym extraction device is:
For each word w1 to be extracted, based on session information including at least one browsing product for each session and a word group included in the description of each browsing product input by the session extraction unit, the extraction target Extracting a session having a predetermined number or more of browsing products including the word w1;
The co-occurrence frequency extraction unit extracts the extraction target word w1 for each extraction target word w1 based on the session information of the session extracted by the session extraction unit for each extraction target word w1. Extracting the number of sessions in which each word w2 appears in the entire session as a co-occurrence frequency with the word w1 to be extracted;
Based on the co-occurrence frequency of the extraction target word w1 extracted by the co-occurrence frequency extraction unit for each of the extraction target words w1 by the IDF extraction unit, the extraction is performed for each extraction target word w1. For each word w2 from which the co-occurrence frequency with the target word w1 has been extracted, the reciprocal of the sum of the number of the extraction target words w1 with the co-occurrence frequency equal to or higher than the first threshold and a predetermined positive constant Extracting a word IDF which is
For each word w1 to be extracted by the synonym extracting means, the word ID1 extracted with respect to the word w1 to be extracted is excluded from the word w1 to be extracted except for the word w2 having a second threshold value or less. Outputting a word w2 having an occurrence frequency equal to or higher than the first threshold as a synonym of the extraction target word w1,
The synonym extraction method characterized by including, and performing.

A program for extracting synonyms of words in product descriptions on a site,
Computer
Based on the input session information including at least one browsing product for each session and a word group included in the description of each browsing product, the browsing including the extraction target word w1 for each word w1 of the extraction target Session extraction means for extracting a session having a predetermined number of products,
Based on the session information of the session extracted by the session extraction unit for each word w1 to be extracted, for each word w1 to be extracted, each word w2 in the entire session extracted for the word w1 to be extracted is Co-occurrence frequency extraction means for extracting the number of appearing sessions as the co-occurrence frequency with the word w1 to be extracted;
Based on the co-occurrence frequency of each word w1 to be extracted and the word w1 to be extracted extracted by the co-occurrence frequency extraction unit, for each word w1 to be extracted, the word w1 to be extracted For each word w2 from which the co-occurrence frequency has been extracted, a word IDF that is the reciprocal of the sum of the number of the extraction target words w1 with the co-occurrence frequency equal to or higher than the first threshold and a predetermined positive constant is extracted. Co-occurrence with the extraction target word w1 excluding the word w2 for which the word IDF extracted for the extraction target word w1 is equal to or less than a second threshold value for each extraction target word w1 A program for causing a word w2 having a frequency equal to or higher than the first threshold to function as a synonym extracting unit that outputs the word w2 as a synonym of the extraction target word w1.