JP4071657B2

JP4071657B2 - Text processing device

Info

Publication number: JP4071657B2
Application number: JP2003078682A
Authority: JP
Inventors: 哲郎長束
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2003-03-20
Filing date: 2003-03-20
Publication date: 2008-04-02
Anticipated expiration: 2023-03-20
Also published as: JP2004287816A

Description

【０００１】
【発明の属する技術分野】
本発明は、異表記処理を行い解析した単語の代表表記を、処理対象であるテキストに含まれる表記の出現頻度により決定するテキスト処理装置に関する。
【０００２】
【従来の技術】
従来、大量の文書情報を処理するための技術として、文書検索、文書分類、文書分析などの技術が開示されている。こうした文書処理技術の多くは、文書内のテキストを解析する日本語解析処理を利用している。日本語解析処理の基本は、単語を解析する形態素解析である。形態素解析方法の多くでは、表記の異なる単語を同じ単語として解析する異表記処理を行う。これにより、同じ意味で表記の異なる単語を同じ単語として抽出することができ、その後の処理において異表記による精度の劣化を防ぐことができる。こうした異表記処理は、異表記単語（あるいは同義語）の辞書を用いていたり、表記パターンのルールを適用したりすることで行われる。表記の異なる複数の単語を同じ単語として扱う場合、抽出した単語情報をユーザに提示したりする場合は、その単語の代表的な表記を提示することになる。多くは１種類の代表的な表記を利用する。
【０００３】
特許文献１には、ユーザが指定した関心概念を表すユーザ入力を受け付け、前記電子的に格納された文書を分析してユーザが指定した関心概念を論じている位置を特定し、前記電子的に格納された文書内で前記関心概念を論じている部分の大まかな位置を読者に提供するように、その存在部分に指標を表示することで、希望する情報を確実に検索することができるように読者を補助する技術が開示されている。
【０００４】
特許文献２には、文書を含む大量のデータから特異な特徴を有する概念を抽出することにより、有効な知識を獲得する方法およびシステムと、大量データから注目に値する概念を自動的に見つけ出す方法およびシステムと、大量のデータから有効な特徴的な概念を獲得するにあたり、ユーザビリティに優れた分析方法およびシステムとを提供する技術が開示されている。
【０００５】
【特許文献１】
特開２００１−６０２０６号公報
【特許文献２】
特開２００１−７５９６６号公報
【０００６】
【発明が解決しようとする課題】
従来の技術では、抽出された単語の代表表記として、辞書情報などのあらかじめ決められた表記を利用する。しかし、これでは代表表記が解析対象であるテキストに出現していない、あるいは出現頻度が少ない場合でも、抽出された単語の表記として利用されてしまい、例えば単語情報を提示されたユーザが出現していない、あるいは出現頻度が少ない表記を提示されることで、理解や把握を妨げられてしまうという問題があった。
【０００７】
また、最近ではテキストに含まれる概念をテキストに含まれる単語の組み合わせなどにより表現する概念表現方法も考案されている。このような概念表現方法においては、概念表現を提示する際に、構成している単語の表記が実際のテキストにおける表記と異なると、ユーザの理解や操作を誤らせる原因になる可能性が高い。
【０００８】
本発明は、上記事情に鑑みてなされたものであり、異表記処理を行って解析された単語の代表表記を、処理対象であるテキストに含まれる表記の出現頻度により決定することを目的とする。
【０００９】
また、本発明は、テキストに含まれる単語を解析する際に、異なる表記の複数の単語を１つの単語として解析する形態素解析手段を利用するテキスト処理装置において、解析された単語の代表表記をテキスト内における単語表記の出現頻度に基づいて抽出することを目的とする。
【００１０】
また、本発明は、テキストに含まれる単語を解析する際に、異なる表記の複数の単語を１つの単語として解析する形態素解析手段を利用するテキスト処理装置であって、かつ、テキストに含まれる概念をテキストに含まれる１つ以上の単語を用いて表現する概念表現方法を利用するテキスト処理装置において、解析された単語の代表表記をテキスト内における単語表記の出現頻度に基づいて抽出することを目的とする。
【００１１】
また、本発明は、テキストに含まれる単語を解析する際に、異なる表記の複数の単語を１つの単語として解析する形態素解析手段を利用するテキスト処理装置であって、かつ、テキストに含まれる概念をテキストに含まれる１つ以上の単語を用いて表現する概念表現方法を利用するテキスト処理装置において、概念表現を構成する単語の代表表記を、概念表現と適合するテキスト部分における単語表記の出現頻度に基づいて抽出することを目的とする。
【００１２】
【課題を解決するための手段】
上記目的を達成するために、請求項１に記載の発明は、テキスト処理装置であって、前記テキスト処理装置が、入力された各テキストに対して係り受け解析と、形態素解析と、を行うことにより解析結果を得る言語解析部と、前記言語解析部で得られた解析結果に基づいて、前記テキストを構成する各文節から単語を抽出する単語抽出部と、前記言語解析部で得られた解析結果に基づいて、前記テキストを構成する各文節が、特定の表現を持つ付属語を含む場合に、該文節に付加されている意図である意図表現を抽出する意図表現抽出部と、前記言語解析部で得られた解析結果、前記単語抽出部で抽出した単語、前記意図表現抽出部で抽出した意図表現からなる文節情報と、前記文節に含まれる単語、該単語の表記、該表記の出現頻度が対応づけられた単語リストと、を記憶する記憶部と、前記記憶部が記憶した文節情報を用いて各文節について、単語のみ、あるいは単語と意図表現の組み合わせから構成される概念表現の基本単位を生成し、文節間の係り受け関係に基づいた該概念表現基本単位間の関係表現により、概念表現を生成する概念表現生成部とを有し、前記概念表現を構成する単語を基に、前記記憶部で記憶した単語リストから前記単語に対応づけられた表記の中で出現頻度が高い表記を抽出し、該抽出した表記を、前記概念表現を構成する単語の表記として利用することを特徴としている。
【００１３】
請求項２に記載の発明は、請求項１に記載の発明において、前記出現頻度は、前記言語解析部で得られた解析結果を用いて、前記テキストにおける単語の表記別の出現頻度を算出することを特徴としている。
【００１４】
請求項３に記載の発明は、請求項１に記載の発明において、前記概念表現生成部によって生成した概念表現を構成する単語の表記に、概念表現と適合するテキスト部分における単語の表記のうち出現頻度が高い単語の表記を利用することを特徴としている。
【００２１】
【発明の実施の形態】
以下、本発明の実施の形態を添付図面を参照しながら詳細に説明する。
【００２２】
＜実施例１＞
図１は、実施例１のテキスト処理装置の一実施例の構成を示す図である。テキスト入力部１０１では、テキストを入力する。すでにテキストが記録されている場合はそのテキストを入力とすることもできる。入力するテキストは１つでも複数でもかまわないが、以下複数のテキストが入力されたものとして説明する。
【００２３】
言語解析部１０２では、入力された各テキストに対して、言語解析を行う。実施例１の装置は言語解析処理として、形態素解析を利用するものである。また利用する形態素解析は、異なる表記の単語を１つの単語として解析する形態素解析処理である。異なる表記の単語を１つの単語として解析する処理は、異表記辞書や同義語辞書などの辞書を利用したり、カタカナ表記の単語の表記パターンルールの適用による異表記生成などの方法により実施することができる。図２は、本実施例の言語解析の異表記例を示す図である。
【００２４】
単語表記計数部１０３では、言語解析部１０２で解析された単語情報を用いて、単語リストを生成し、表記別の出現頻度情報を算出する。単語リストの例を図３に示す。単語はＩＤが付与されて管理される。単語の情報として、品詞、出現頻度、表記情報が管理される。単語表記の情報として、異なる表記ごとに表記ＩＤ、表記、表記別頻度を管理する。活用のある単語の場合はその終止形を表記として利用する。
【００２５】
これにより、異なる表記の単語を単語ＩＤにより１つの単語として扱うことができ、さらに、異なる表記ごとの表記ＩＤ、表記、表記別頻度情報も利用することができる。
【００２６】
言語解析結果記憶部１０４では、単語表記計数部１０３で生成された図３の単語情報を記憶、管理する。
【００２７】
単語情報をユーザに提示したりする際には、言語解析結果記憶部に記憶されている単語情報から、単語ＩＤに基づいて表記情報を参照する。たとえば表記別頻度の一番高い表記をその単語の代表表記としてユーザに提示してもよい。また、表記別頻度の高い順に複数の表記を提示する方法も考えられる。
【００２８】
本実施例のテキストデータ処理装置では、テキストに含まれる概念をテキストに含まれる１つ以上の単語を用いて表現する概念表現方法を利用することを前提としている。まず概念表現について説明する。
【００２９】
本実施例にて用いる概念表現は、テキストを言語解析した結果得られる文節あるいは文節間関係情報に基づいている。言語解析としては、たとえば形態素解析、文節係り受け解析を利用することができる。形態素解析はテキストに含まれる単語を分析する。係り受け解析は、テキストに含まれる文節を解析し、文節間の関係として係りと受けの関係にある文節を解析する。「ソフトウェアのインストールが正常に実行できない」というテキストの場合、言語解析の結果として図４のような情報を得ることができる。
【００３０】
図４の「自」は自立語を、「付」は付属語をあらわす。自立語は動詞、形容詞、名詞などの品詞の単語であり、付属語とは助詞、助動詞などの品詞の単語である。通常文節は１個の自立語と、０個以上の付属語で構成される。解析方法によっては、１文節に複数個の自立語が含まれるような結果を出すものもあるが、ここでは、文節にはかならず１個のみの自立語しか含まないように文節を生成する解析方法を利用するものとする。
【００３１】
本実施例で用いる概念表現は、概念表現の基本単位と基本単位間の関係表現により表現される。概念表現の基本単位は、トークンおよび意図表現を利用して表現される。
【００３２】
トークンはそれ自体で１つの意味をあらわす単語であり、自立語を利用することができる。たとえば図４では、「ソフトウェア」「インストール」「正常」「実行」がトークンとなる。トークンの表現はトークンの表記を利用することもできるし、トークンの代表的表記に変換したものを利用することもできる。
【００３３】
意図表現とは、文節内の付属語による意味の付加を表す表現であり、付属語のある特定の表現パターンを抽出することで、その文節に付加されている意図を解析する。たとえば、「〜ない（助動詞）」「〜ず（助動詞）」という表現は「打消」の意味を、「〜できる（補助動詞）」という表現は「可能」の意味を、「〜たい（助動詞）」という表現は「要望」の意味を、文節に対して付加しているとすることができる。たとえば、図４の「実行できない」という文節から「可能」と「打消」の意図表現が抽出される。意図表現はたとえば「（＋打消）」「（＋可能−打消）」というように表現することができる、ここで「＋ＸＸ」はその意図表現が付加されていることを、「−ＸＸ」はその意図表現が付加されていないことを表している。
【００３４】
概念表現の基本単位としては、トークンのみ、意図表現のみ、あるいはトークンと意図表現の組み合わせで表現され、たとえば図５のように表現される。
【００３５】
トークンと意図表現の組み合わせとは、ある文節に指定されたトークンが含まれていて、かつその文節に指定された意図表現が付加されていることを意味する。
【００３６】
基本単位間の関係は、基本単位間に意味的な強い関係があることを表す。意味的な強い関係とは、基本的には係り受け関係にある文節に含まれることを表す。たとえば、基本単位間の関係を「⇒」であらわすものとすると、「情報⇒検索」という概念表現は、係り受け関係にある２つの文節において係り文節に「情報」が、受け文節に「検索」が含まれていることを意味する（「情報を検索する」）。基本単位間の関係として文節係り受け関係を利用することで、よく文書検索などで利用される単語の論理式「ソフトウェア＆インストール」のように単にテキスト内の共起出現関係を指定するのではなく、基本単位がテキスト内で意味的に強い関係をもって出現していることを指定することができる。
【００３７】
文節係り受け関係は、ある文節が係り文節になる場合は受け文節は１つのみであるが、複数の係り文節が同じ１つの受け文節に係ることができる（上記例１の文節４は文節２と文節３の受け文節となっている）。そのため、概念表現における基本単位間の関係の表現は複数の係り文節を持つ受け文節という文節間関係を表現する場合と、しない場合の２通りが可能である。
【００３８】
複数の係り文節を持つ受け文節という文節間関係を表現しない場合、概念表現は図６のような基本単位の単純な１次元のリスト表現となる。
【００３９】
複数の係り文節を持つ受け文節という文節間関係を表現する場合、概念表現は図７のような基本単位のツリー表現となる。
【００４０】
複数の係り文節を持つ受け文節という文節間関係を表現しない場合、概念表現はユーザにとって簡単でわかりやすく、表現の拡張などの操作も行いやすいが、複雑な文節係り受け関係構造の表現ができない問題がある。複数の係り文節を持つ受け文節という文節間関係を表現する場合、複雑な文節係り受け関係構造も表現できるが、ユーザにとっては複雑でわかりにくく、操作も行いにくいと考えられる。両場合とも利用することができるが、以降の実施例では、ユーザにとってわかりやすく操作もしやすい複数の係り文節を持つ受け文節という文節間関係を表現しない場合の表現方法を用いることとして説明する。
【００４１】
複数の係り文節を持つ受け文節という文節間関係を表現しない場合のテキストから生成することのできる概念表現例を図８に示す。
【００４２】
＜実施例２＞
図９は、実施例２のテキスト処理装置における一実施例の構成を示す図である。テキスト入力部２０１では、テキストを入力する。すでにテキストが記録されている場合はそのテキストを入力とすることもできる。入力するテキストは１つでも複数でもかまわないが、以下入力テキストは１つとして説明する。
【００４３】
言語解析部２０２では、入力された各テキストに対して、形態素解析と係り受け解析の言語解析を行う。形態素解析ではテキストに含まれる単語を解析する。係り受け解析ではテキストに含まれる文、文節を解析し、文節間の関係として係りと受けの関係にある文節を解析する。たとえば、「ソフトウェアのインストールが正常に実行できない。」という文を解析した場合の解析結果例は図４に示す。単語の区切りを「／」で表している。また各単語の上の「自」は自立語を、「付」は付属語を表す。
【００４４】
通常文節は１つの自立語を含む。１つの文節に複数の自立語を含むように解析する処理方法もあるが、本実施例では、文節には必ず１つの自立語だけを含むように解析する方法を利用するものとする。
【００４５】
トークン抽出部２０３では、言語解析部２０２によって解析された各文節からトークンを抽出する。文節内の単語情報から、自立語品詞である単語を抽出してトークンとする。図４のテキスト例からは図１０が抽出される。
【００４６】
意図表現抽出部２０４では、言語解析部２０２によって解析された各文節から意図表現を抽出する。文節内の単語情報から、特定の表現パターンを抽出し、意図表現情報を生成する。たとえば「打消」「要望」「疑問」「可能」という意図表現は、図１１のような単語あるいは表現パターンが含まれている場合に抽出することができる。図４のテキスト例の場合、意図表現として「文節４意図表現：＋可能＋打消以下」が抽出される。
【００４７】
テキストデータ構造記憶部２０５では、言語解析部によって解析されたテキストの構造と、トークン抽出手段、意図表現抽出手段で抽出されたトークンおよび意図表現の情報を記憶する。
【００４８】
テキストは図１２に示すようなデータ構造で記憶される。各構成要素は図１３に示す情報を保持している。本実施例では、図３に示すように、テキストに含まれる単語に対してユニークな識別子を付与した単語リストを生成し、単語の管理を行うものとする。その際、単語の表記情報（表記ＩＤ、表記、表記別頻度）も管理する。
【００４９】
図１３に各構成要素が保持する情報を示す。各テキストはユニークなＩＤを付与されて管理される。各テキストはテキストに含まれる文ＩＤリストを管理し、文は自分の文ＩＤと文に含まれる文節リストを管理する。文節は自分の文節ＩＤと文節に含まれる単語ＩＤ−表記ＩＤリスト、係り文節ＩＤリスト、受け文節ＩＤを管理する。単語ＩＤ−表記ＩＤは図３に示した単語リストにおける単語ＩＤと表記ＩＤの組み合わせである。これにより単語だけでなく、どの表記であったかが管理される。係り文節ＩＤリストは、当該文節を受けとする係り文節のＩＤである。上記例にもあるように、１つの受け文節に対して複数の文節が係り文節となりうるので係り文節ＩＤリストで管理する。受け文節ＩＤは当該文節が係り文節となる受け文節のＩＤである。係り文節は受け文節を１つしかとることができない。また文節はその文節から抽出されたトークンと意図表現も管理する。
【００５０】
文節が管理する情報として、係り受けの関係の種類を保持することも可能である。たとえば連体修飾なのか連用修飾なのか、などである。図１４に文節が保持するデータ例を示す。
【００５１】
概念表現生成部２０６では、テキストデータ構造記憶部２０５に記憶されている文節情報から概念表現を生成する。文節のトークン、意図表現情報から概念表現基本単位を生成し、文節間の係り受け関係に基づいた概念表現基本単位間の関係表現により、概念表現を生成する。複数の係り文節を持つ受け文節という文節間関係を表現しない場合のテキストからはたとえば複数の係り文節を持つ受け文節という文節間関係を表現する場合に示す概念表現を生成することができる。
【００５２】
本実施例では、概念表現を構成する概念表現基本単位のトークンの表記を、テキスト内での表記の出現頻度に基づいて抽出する。
【００５３】
本実施例の装置では、たとえば図３の単語リストにおける単語ＩＤにより、「単語ＩＤ１⇒単語ＩＤ３」という概念表現が生成された場合、単語の表記をテキストにおける表記頻度の高いものを利用する。この例の場合、単語ＩＤ３の単語は表記が３種類（速い、早い、はやい）あるが、たとえば表記頻度の一番高い表記（早い）を利用することで、「処理⇒早い」のような概念表現表記を生成することができる。また、複数表記がある場合は、表記頻度の高い順に複数の表記を併記する方法も考えられる。例えば、「処理⇒（早い、速い、はやい）」等である。
【００５４】
本実施例の装置では、たとえば図３の単語リストにおける単語ＩＤにより、単語ＩＤ１⇒単語ＩＤ３という概念表現が生成された場合、単語の表記を、概念表現と適合するテキスト部分における単語表記の出現頻度を利用する。テキスト全体における出現頻度で表記を決定することもできるし、「単語ＩＤ１⇒単語ＩＤ３」という概念表現が適合するテキスト部分での出現頻度を利用することもできる。つまり、「処理⇒早い」「処理⇒速い」「処理⇒はやい」のどの出現パターンが多いかにより、表記を決定する。たとえばテキスト全体では「早い」という表記が一番多く出現していても、「処理⇒単語ＩＤ３」というパターンでは「速い」という表記が一番多ければ、「単語ＩＤ１⇒単語ＩＤ３」という概念表現の表記は「処理⇒速い」となる。
【００５５】
概念表現記憶部２０７では、概念表現生成部２０６により生成された概念表現を記憶する。
【００５６】
【発明の効果】
以上の説明から明らかなように、本発明によれば、テキストに含まれる単語を解析する際に、異なる表記の複数の単語を１つの単語として解析する形態素解析手段を利用するテキスト処理装置において、解析された単語の代表表記をテキスト内における単語表記の出現頻度に基づいて抽出することにより、抽出された単語あるいはそれを利用した情報をユーザに提示する、あるいは操作させる場合に、実際にテキストに含まれる単語表記を提示することができる。
【００５７】
また、本発明によれば、テキストに含まれる単語を解析する際に、異なる表記の複数の単語を１つの単語として解析する形態素解析手段を利用するテキスト処理装置であって、かつ、テキストに含まれる概念をテキストに含まれる１つ以上の単語を用いて表現する概念表現方法を利用するテキスト処理装置において、解析された単語の代表表記をテキスト内における単語表記の出現頻度に基づいて抽出することにより、概念表現情報をユーザに提示する、あるいは操作させる場合に、実際にテキストに含まれる単語表記を提示することができる。
【００５８】
また、本発明によれば、テキストに含まれる単語を解析する際に、異なる表記の複数の単語を１つの単語として解析する形態素解析手段を利用するテキスト処理装置であって、かつ、テキストに含まれる概念をテキストに含まれる１つ以上の単語を用いて表現する概念表現方法を利用するテキスト処理装置において、概念表現を構成する単語の代表表記を、概念表現と適合するテキスト部分における単語表記の出現頻度に基づいて抽出することにより、概念表現情報をユーザに提示する、あるいは操作させる場合に、実際に概念表現と適合するテキスト部分に出現する単語表記を提示することができる。
【図面の簡単な説明】
【図１】実施例１のテキスト処理装置における一実施例の構成を示す図である。
【図２】実施例１の言語解析においての異表記例を示す図である。
【図３】単語リストの例を示す図である。
【図４】「ソフトウェアのインストールが正常に実行できない」というテキストの場合の言語解析の結果を示す図である。
【図５】概念表現の基本単位を表現する一実施例を示す図である。
【図６】複数の係り文節を持つ受け文節という文節間関係を表現しない場合の概念表現を示す図である。
【図７】複数の係り文節を持つ受け文節という文節間関係を表現する場合の概念表現を示す図である。
【図８】複数の係り文節を持つ受け文節という文節間関係を表現しない場合のテキストから生成することのできる概念表現例を示す。
【図９】実施例２のテキスト処理装置における一実施例の構成を示す図である。
【図１０】実施例２の方法で図４のテキスト例から抽出される情報を示す図である。
【図１１】「打消」「要望」「疑問」「可能」という意図表現を抽出する際の意図表現情報の一実施例を示す図である。
【図１２】テキストを記憶する際のデータ構造を示す図である。
【図１３】各構成要素が保持する情報を示す図である。
【図１４】文節が保持するデータ例を示す図である。
【符号の説明】
１０１、２０１テキスト入力部
１０２、２０２言語解析部
１０３単語表記計数部
１０４言語解析結果記憶部
２０３トークン抽出部
２０４意図表現抽出部
２０５テキストデータ構造記憶部
２０６概念表現生成部
２０７概念表現記憶部[0001]
BACKGROUND OF THE INVENTION
The present invention relates a representative representation of words by analyzing performed Allograph processing, the text processing apparatus for determining the frequency of occurrence of notation contained in a processed text.
[0002]
[Prior art]
Conventionally, techniques such as document search, document classification, and document analysis have been disclosed as techniques for processing a large amount of document information. Many of these document processing techniques use Japanese analysis processing for analyzing text in a document. The basis of Japanese analysis processing is morphological analysis that analyzes words. In many morpheme analysis methods, different notation processing is performed in which words having different notations are analyzed as the same word. As a result, words having the same meaning and different notations can be extracted as the same word, and accuracy deterioration due to different notations can be prevented in subsequent processing. Such different notation processing is performed by using a dictionary of different notation words (or synonyms) or by applying notation pattern rules. When a plurality of words having different notations are handled as the same word, when the extracted word information is presented to the user, a representative notation of the words is presented. Many use one type of representative notation.
[0003]
Patent Document 1 accepts a user input representing a concept of interest designated by a user, analyzes the electronically stored document, identifies a position where the concept of interest designated by the user is discussed, and electronically In order to provide the reader with a rough location of the part that discusses the concept of interest in the stored document, an indicator is displayed on the existing part to ensure that the desired information can be searched. Techniques to assist readers are disclosed.
[0004]
Patent Document 2 discloses a method and system for acquiring effective knowledge by extracting a concept having unique characteristics from a large amount of data including documents, a method for automatically finding a notable concept from a large amount of data, and A technique for providing a system and an analysis method and system excellent in usability in acquiring an effective characteristic concept from a large amount of data is disclosed.
[0005]
[Patent Document 1]
JP 2001-60206 A [Patent Document 2]
Japanese Patent Laid-Open No. 2001-75966
[Problems to be solved by the invention]
In the conventional technique, a predetermined notation such as dictionary information is used as a representative notation of the extracted word. However, in this case, even when the representative notation does not appear in the text to be analyzed or the appearance frequency is low, it is used as the notation of the extracted word, for example, the user who presented the word information has appeared. There was a problem that comprehension and comprehension were hampered by noting or noting that the appearance frequency was low.
[0007]
Recently, a concept expression method for expressing a concept included in a text by a combination of words included in the text has been devised. In such a concept expression method, when presenting the concept expression, if the notation of the word that is configured is different from the notation in the actual text, there is a high possibility that it will cause the user to understand and operate incorrectly.
[0008]
The present invention has been made in view of the above circumstances, and an object thereof is to determine the representative notation of a word analyzed by performing different notation processing based on the appearance frequency of the notation included in the text to be processed. .
[0009]
The present invention also provides a text processing device that uses morpheme analysis means for analyzing a plurality of words having different notations as one word when analyzing words included in the text. The purpose is to extract based on the appearance frequency of word notation in the text.
[0010]
The present invention is also a text processing apparatus that uses morpheme analysis means for analyzing a plurality of words having different notations as one word when analyzing a word included in the text, and a concept included in the text. In a text processing apparatus that uses a concept expression method that expresses a word using one or more words contained in text, the representative expression of the analyzed word is extracted based on the appearance frequency of the word expression in the text And
[0011]
The present invention is also a text processing apparatus that uses morpheme analysis means for analyzing a plurality of words having different notations as one word when analyzing a word included in the text, and a concept included in the text. In a text processing apparatus that uses a concept expression method that expresses a word using one or more words included in the text, the representative notation of words constituting the concept expression is the frequency of appearance of the word notation in the text portion that matches the concept expression It aims at extracting based on.
[0012]
[Means for Solving the Problems]
In order to achieve the above object, the invention according to claim 1 is a text processing device, wherein the text processing device performs dependency analysis and morphological analysis on each input text. A language analysis unit for obtaining an analysis result, a word extraction unit for extracting a word from each clause constituting the text based on the analysis result obtained by the language analysis unit, and an analysis obtained by the language analysis unit Based on the result, when each clause constituting the text includes an attached word having a specific expression, an intention expression extraction unit that extracts an intention expression that is an intention added to the clause, and the language analysis Analysis result obtained by the section, the word extracted by the word extraction section, phrase information consisting of the intention expression extracted by the intention expression extraction section, the word included in the phrase, the notation of the word, the appearance frequency of the notation Supported Generates a basic unit of a conceptual expression composed of only a word or a combination of a word and an intention expression for each phrase using the phrase information stored in the storage part and the phrase information stored in the storage part And a concept expression generation unit that generates a concept expression based on the relationship expression between the basic units of the concept expression based on the dependency relationship between clauses, and based on the words constituting the concept expression, the storage unit In the notation associated with the word, the notation having a high appearance frequency is extracted from the word list stored in (1), and the extracted notation is used as the notation of the words constituting the conceptual expression .
[0013]
According to a second aspect of the present invention, in the first aspect of the present invention, the appearance frequency is calculated by using the analysis result obtained by the language analysis unit to calculate the appearance frequency of each word in the text. It is characterized by that.
[0014]
According to a third aspect of the present invention, in the first aspect of the present invention, the word notation constituting the concept expression generated by the concept expression generation unit appears in the word notation in the text portion that matches the concept expression. It is characterized by using notation of frequently used words .
[0021]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.
[0022]
<Example 1>
FIG. 1 is a diagram illustrating the configuration of an embodiment of the text processing apparatus according to the first embodiment. Text input unit 101 inputs text. If text is already recorded, it can be used as input. There may be one or a plurality of texts to be input, but the following description will be made assuming that a plurality of texts are input.
[0023]
The language analysis unit 102 performs language analysis on each input text. The apparatus according to the first embodiment uses morphological analysis as language analysis processing. The morpheme analysis to be used is a morpheme analysis process in which words having different notations are analyzed as one word. The process of analyzing different notation words as one word is performed by using a dictionary such as a different notation dictionary or synonym dictionary, or by generating different notation by applying a notation pattern rule for words in katakana notation. Can do. FIG. 2 is a diagram illustrating an example of different notation for language analysis according to the present embodiment.
[0024]
The word notation counting unit 103 generates a word list using the word information analyzed by the language analysis unit 102 and calculates appearance frequency information for each notation. An example of the word list is shown in FIG. Words are managed with IDs. Part of speech, appearance frequency, and notation information are managed as word information. As word notation information, the notation ID, notation, and notation frequency are managed for each different notation. In the case of a word that is useful, its end form is used as a notation.
[0025]
Thus, different notation words can be handled as one word by the word ID, and also the notation ID, notation, and notation frequency information for each different notation can be used.
[0026]
The language analysis result storage unit 104 stores and manages the word information of FIG. 3 generated by the word notation counting unit 103.
[0027]
When presenting the word information to the user, the notation information is referred to based on the word ID from the word information stored in the language analysis result storage unit. For example, the notation with the highest frequency by notation may be presented to the user as the representative notation of the word. A method of presenting a plurality of notations in descending order of the notation frequency is also conceivable.
[0028]
The text data processing apparatus according to the present embodiment is premised on using a concept expression method that expresses a concept included in a text using one or more words included in the text. First, the concept expression will be described.
[0029]
The concept expression used in this embodiment is based on clauses or inter-phrase relation information obtained as a result of language analysis of text. As the language analysis, for example, morphological analysis or phrase dependency analysis can be used. Morphological analysis analyzes words contained in text. Dependency analysis analyzes clauses included in text and analyzes clauses in a relationship between dependency and dependency as a relationship between clauses. In the case of the text “Software installation cannot be executed normally”, information as shown in FIG. 4 can be obtained as a result of language analysis.
[0030]
In FIG. 4, “self” represents an independent word, and “attached” represents an attached word. Autonomous words are part-of-speech words such as verbs, adjectives and nouns, and adjuncts are part-of-speech words such as particles and auxiliary verbs. A normal phrase consists of one independent word and zero or more attached words. Depending on the analysis method, there may be a result that a single phrase includes a plurality of independent words, but here, an analysis method for generating a phrase so that the phrase always includes only one independent word. Shall be used.
[0031]
The concept expression used in the present embodiment is expressed by a basic unit of the concept expression and a relation expression between the basic units. The basic unit of concept expression is expressed using tokens and intention expressions.
[0032]
A token is a word that expresses one meaning by itself, and an independent word can be used. For example, in FIG. 4, “software”, “installation”, “normal”, and “execution” are tokens. The token expression can use the token notation, or can be converted to a representative token notation.
[0033]
The intention expression is an expression representing the addition of meaning by an attached word in a phrase, and the intention added to the phrase is analyzed by extracting a specific expression pattern of the attached word. For example, the expression “~ not (auxiliary verb)” or “~ z (auxiliary verb)” means “cancellation”, the expression “can do (auxiliary verb)” means “possible”, “~ tai (auxiliary verb)” "Can be said to have added the meaning of" request "to the phrase. For example, intention expressions “possible” and “cancellation” are extracted from the phrase “cannot be executed” in FIG. The intention expression can be expressed as, for example, “(+ cancellation)” and “(+ possible−cancellation)”, where “+ XX” indicates that the intention expression is added, and “−XX” indicates that It means that no intention expression is added.
[0034]
As a basic unit of the concept expression, it is expressed by only a token, only an intention expression, or a combination of a token and an intention expression, for example, as shown in FIG.
[0035]
The combination of a token and an intention expression means that a specified token is included in a certain clause and the specified intention expression is added to the clause.
[0036]
The relationship between basic units indicates that there is a strong semantic relationship between basic units. A semantically strong relationship basically means being included in a clause having a dependency relationship. For example, if the relationship between basic units is represented by “⇒”, the conceptual expression “information⇒search” is “information” in the dependency clause and “search” in the dependency clause in the two clauses in the dependency relationship. ("Search for information"). By using the phrase dependency relationship as the relationship between basic units, instead of simply specifying the co-occurrence occurrence relationship in the text as in the word “software & installation”, which is a word logical expression often used in document search etc. , It can be specified that the basic unit appears in the text with a strong semantic relationship.
[0037]
In the clause dependency relationship, when a certain clause becomes a dependency clause, there is only one receiver clause, but a plurality of dependency clauses can relate to the same receiver clause (the clause 4 in the above example 1 is the clause 2). And is the receiving clause of clause 3). For this reason, there are two ways of expressing the relationship between basic units in the concept expression: when expressing the inter-phrase relationship of receiving clauses having a plurality of dependency clauses, and when not expressing them.
[0038]
In the case where the inter-phrase relationship of receiving clauses having a plurality of dependency clauses is not expressed, the conceptual representation is a simple one-dimensional list representation of basic units as shown in FIG.
[0039]
When expressing the inter-phrase relationship of receiving clauses having a plurality of dependency clauses, the concept representation is a tree representation of basic units as shown in FIG.
[0040]
When not expressing the inter-phrase relationship of receiving clauses with multiple dependency clauses, the concept expression is simple and easy for the user to understand, and it is easy to perform operations such as expanding the expression, but it is not possible to express complex clause dependency relationship structures There is. When expressing the inter-phrase relationship of receiving clauses having a plurality of dependency clauses, it is possible to express a complicated clause dependency relationship structure, but it is considered complicated and difficult for the user to perform. Although it can be used in both cases, in the following embodiments, it will be described as using an expression method in the case of not expressing the inter-phrase relationship of receiving clauses having a plurality of dependency clauses that are easy to understand and operate for the user.
[0041]
FIG. 8 shows an example of a conceptual expression that can be generated from a text in a case where the inter-phrase relationship of receiving clauses having a plurality of dependency clauses is not expressed.
[0042]
<Example 2>
FIG. 9 is a diagram illustrating a configuration of one embodiment of the text processing apparatus according to the second embodiment. The text input unit 201 inputs text. If text is already recorded, it can be used as input. There may be one or more texts to be input, but the following description will be made assuming that there is only one input text.
[0043]
The language analysis unit 202 performs language analysis of morphological analysis and dependency analysis on each input text. In morphological analysis, words included in text are analyzed. In dependency analysis, sentences and clauses included in the text are analyzed, and a clause having a relationship between dependency and dependency is analyzed as a relationship between clauses. For example, FIG. 4 shows an example of the analysis result when the sentence “Software installation cannot be executed normally” is analyzed. A word delimiter is represented by “/”. In addition, “self” above each word represents an independent word, and “attached” represents an attached word.
[0044]
A normal phrase contains one free word. Although there is a processing method for analyzing a single phrase so as to include a plurality of independent words, in this embodiment, it is assumed that a method for analyzing a phrase so as to always include only one independent word is used.
[0045]
The token extraction unit 203 extracts a token from each clause analyzed by the language analysis unit 202. From the word information in the phrase, a word that is an independent word part of speech is extracted and used as a token. FIG. 10 is extracted from the text example of FIG.
[0046]
The intention expression extraction unit 204 extracts an intention expression from each clause analyzed by the language analysis unit 202. A specific expression pattern is extracted from the word information in the phrase, and intention expression information is generated. For example, the intention expressions “cancellation”, “request”, “question”, and “possible” can be extracted when a word or expression pattern as shown in FIG. 11 is included. In the case of the text example of FIG. 4, “Sentence 4 intention expression: + possible + cancellation or less” is extracted as the intention expression.
[0047]
The text data structure storage unit 205 stores the text structure analyzed by the language analysis unit and the token and intention expression information extracted by the token extraction unit and the intention expression extraction unit.
[0048]
The text is stored in a data structure as shown in FIG. Each component holds the information shown in FIG. In this embodiment, as shown in FIG. 3, it is assumed that a word list in which a unique identifier is assigned to a word included in a text is generated and the word is managed. At that time, word notation information (notation ID, notation, frequency by notation) is also managed.
[0049]
FIG. 13 shows information held by each component. Each text is managed with a unique ID. Each text manages a sentence ID list included in the text, and each sentence manages its own sentence ID and a phrase list included in the sentence. The phrase manages its own phrase ID and the word ID-notation ID list, the related phrase ID list, and the received phrase ID included in the phrase. The word ID-notation ID is a combination of the word ID and the notation ID in the word list shown in FIG. As a result, not only the word but also which notation is managed. The related phrase ID list is an ID of a related phrase that receives the relevant phrase. As in the above example, since a plurality of clauses can be related clauses for one received clause, they are managed in the related clause ID list. The received phrase ID is an ID of a received phrase that is a related phrase. A dependency clause can take only one receiving clause. The clause also manages tokens and intention expressions extracted from the clause.
[0050]
It is also possible to hold the type of dependency relationship as information managed by the clause. For example, whether it is a combination modification or a combination modification. FIG. 14 shows an example of data held by a phrase.
[0051]
The concept expression generation unit 206 generates a concept expression from the phrase information stored in the text data structure storage unit 205. A concept representation basic unit is generated from clause tokens and intention expression information, and a concept representation is generated by a relationship expression between concept representation basic units based on the dependency relationship between clauses. From the text in the case of not expressing the inter-phrase relationship of receiving clauses having a plurality of dependency clauses, for example, it is possible to generate a conceptual expression shown when expressing the inter-phrase relationship of receiving clauses having a plurality of dependency clauses.
[0052]
In the present embodiment, the token notation of the concept expression basic unit constituting the concept expression is extracted based on the appearance frequency of the notation in the text.
[0053]
In the apparatus of the present embodiment, for example, when a conceptual expression “word ID1 → word ID3” is generated from the word ID in the word list of FIG. 3, the word notation having a high notation frequency in the text is used. In this example, the word ID3 has three types of notation (fast, fast, fast). For example, by using the notation with the highest notation frequency (fast), a concept such as “processing ⇒ fast” An expression notation can be generated. In addition, when there are a plurality of notations, a method may be considered in which a plurality of notations are written in descending order of the notation frequency. For example, “processing => (fast, fast, fast)”.
[0054]
In the apparatus according to the present embodiment, for example, when a conceptual expression of word ID1⇒word ID3 is generated based on the word ID in the word list of FIG. 3, the appearance frequency of the word expression in the text portion that matches the conceptual expression is used. Is used. The notation can be determined by the appearance frequency in the entire text, or the appearance frequency in the text portion to which the conceptual expression “word ID1 → word ID3” is applicable can be used. In other words, the notation is determined depending on which occurrence pattern of “processing ⇒ fast”, “processing ⇒ fast”, and “processing ⇒ fast” appears. For example, even if the notation “fast” appears most frequently in the entire text, if the notation “fast” is the most in the pattern “processing → word ID 3”, the concept expression “word ID 1 → word ID 3” The notation will be “processing ⇒ fast”.
[0055]
The concept expression storage unit 207 stores the concept expression generated by the concept expression generation unit 206.
[0056]
【The invention's effect】
As is apparent from the above description, according to the present invention, in analyzing a word included in a text, in a text processing apparatus using a morphological analysis unit that analyzes a plurality of words having different notations as one word, By extracting the representative notation of the analyzed word based on the appearance frequency of the word notation in the text, when the user presents or operates the extracted word or information using the extracted word, The word notation included can be presented.
[0057]
In addition, according to the present invention, when analyzing a word included in a text, the text processing apparatus uses a morphological analysis unit that analyzes a plurality of words having different notations as one word, and is included in the text. In a text processing apparatus that uses a concept expression method that expresses a concept to be expressed using one or more words included in the text, the representative expression of the analyzed word is extracted based on the appearance frequency of the word expression in the text Thus, when the concept expression information is presented to the user or operated, the word notation actually included in the text can be presented.
[0058]
In addition, according to the present invention, when analyzing a word included in a text, the text processing apparatus uses a morphological analysis unit that analyzes a plurality of words having different notations as one word, and is included in the text. In a text processing device that uses a concept expression method that expresses a concept to be expressed using one or more words included in the text, the representative notation of the words constituting the concept expression is represented by By extracting based on the appearance frequency, it is possible to present a word notation that appears in a text portion that actually matches the concept expression when the concept expression information is presented to the user or operated.
[Brief description of the drawings]
FIG. 1 is a diagram illustrating a configuration of one embodiment of a text processing apparatus according to a first embodiment.
FIG. 2 is a diagram illustrating an example of different notation in language analysis according to the first embodiment.
FIG. 3 is a diagram illustrating an example of a word list.
FIG. 4 is a diagram illustrating a result of language analysis in the case of a text “Software installation cannot be executed normally”.
FIG. 5 is a diagram illustrating an example of expressing a basic unit of concept expression.
FIG. 6 is a diagram illustrating a conceptual expression when a relation between clauses called receiving clauses having a plurality of dependency clauses is not expressed.
FIG. 7 is a diagram showing a conceptual expression when expressing an inter-phrase relationship called a receiving clause having a plurality of dependency clauses.
FIG. 8 shows an example of a concept expression that can be generated from text in a case where the inter-phrase relation of receiving clauses having a plurality of dependency clauses is not expressed.
FIG. 9 is a diagram illustrating a configuration of one embodiment of a text processing apparatus according to a second embodiment.
FIG. 10 is a diagram illustrating information extracted from the text example of FIG. 4 by the method of the second embodiment.
FIG. 11 is a diagram illustrating an example of intention expression information when extracting intention expressions of “cancellation”, “request”, “question”, and “possible”.
FIG. 12 is a diagram showing a data structure when text is stored.
FIG. 13 is a diagram illustrating information held by each component.
FIG. 14 is a diagram illustrating an example of data held in a phrase.
[Explanation of symbols]
101, 201 Text input unit 102, 202 Language analysis unit 103 Word notation counting unit 104 Language analysis result storage unit 203 Token extraction unit 204 Intention expression extraction unit 205 Text data structure storage unit 206 Concept expression generation unit 207 Concept expression storage unit

Claims

A text processing device,
The text processing device is
A language analysis unit that obtains an analysis result by performing dependency analysis and morphological analysis on each input text;
Based on the analysis result obtained by the language analysis unit, a word extraction unit that extracts words from each clause constituting the text;
Based on the analysis result obtained by the language analysis unit, when each clause constituting the text includes an attached word having a specific expression, an intention expression that is an intention added to the clause is extracted. An intention expression extraction unit;
The analysis result obtained by the language analysis unit, the word extracted by the word extraction unit, the phrase information composed of the intention expression extracted by the intention expression extraction unit, the word included in the phrase, the notation of the word, the notation A storage unit for storing a word list associated with the appearance frequency of
Using the phrase information stored in the storage unit, for each phrase, a basic unit of concept expression composed of only a word or a combination of a word and an intention expression is generated, and the concept expression based on the dependency relationship between phrases A concept expression generation unit that generates a concept expression by a relation expression between basic units;
Based on the words constituting the concept expression, a notation having a high appearance frequency is extracted from the word list stored in the storage unit, and the extracted expression is used as the concept expression. A text processing apparatus characterized by being used as a notation of a constituent word.

The frequency of appearance is
The text processing apparatus according to claim 1, wherein an appearance frequency for each notation of the word in the text is calculated using an analysis result obtained by the language analysis unit.

The notation of the word which has high appearance frequency is used for the notation of the word which comprises the concept expression produced | generated by the said concept expression production | generation part among the notation of the word in the text part applicable to a concept expression. The text processing device described.