JP6502279B2

JP6502279B2 - Outlier location extraction device, method and program

Info

Publication number: JP6502279B2
Application number: JP2016035300A
Authority: JP
Inventors: 早苗藤田; 正嗣服部
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2016-02-26
Filing date: 2016-02-26
Publication date: 2019-04-17
Anticipated expiration: 2036-02-26
Also published as: JP2017151849A

Description

この発明は、テキストに含まれる少なくとも１つの断片の中で所定の基準以上に難易度が外れている断片を判定及び／又は抽出する技術に関する。 The present invention relates to a technique for determining and / or extracting a fragment out of at least one fragment included in a text whose degree of difficulty is higher than a predetermined standard.

テキストの難易度の推定を行う研究は古くから行われてきている。しかし、その多くは、テキスト全体の難易度を推定することを目的として提案されており、一部の文章や語句、言い回しに関する情報をフィードバックすることによって文章作成支援に利用できるものではない。 Research to estimate the degree of difficulty of the text has been conducted since ancient times. However, many of them have been proposed for the purpose of estimating the degree of difficulty of the entire text, and can not be used to support writing by feeding back information on some sentences, phrases, and phrases.

例えば、非特許文献１では、難易度の推定に文字bigramのみを特徴量として用いる。そのため、対象テキストには一定以上の分量の文字が含まれることが要求される。また、非特許文献１では、有効文字bigram が25 以上あれば、相関係数0.9 以上という高い相関で難易度を推定できると報告している（例えば、非特許文献１参照。）。 For example, in Non-Patent Document 1, only the character bigram is used as a feature amount for estimating the difficulty level. Therefore, the target text is required to include a certain amount of characters or more. Further, Non-Patent Document 1 reports that if the effective character bigram is 25 or more, the degree of difficulty can be estimated with a high correlation of 0.9 or more (see, for example, Non-Patent Document 1).

また、非特許文献２では、一文に含まれる平均述語数とテキスト全体のひらがなの割合を変数とし、対象学年（難易度）を推定する重回帰式を提案している。非特許文献２の場合、推定のために必要なテキストの分量は少なくてもよいという利点がある。しかしながら、述語数や割合だけを利用しているため、個々の語の難しさなどをフィードバックすることはできない（例えば、非特許文献２参照。）。 Further, Non-Patent Document 2 proposes a multiple regression equation for estimating a target grade (degree of difficulty) using the average number of predicates contained in one sentence and the proportion of hiragana in the entire text as variables. In the case of Non-Patent Document 2, there is an advantage that the amount of text required for estimation may be small. However, since only the number and ratio of predicates are used, it is not possible to feed back the difficulty of each word or the like (see, for example, Non-Patent Document 2).

小島健輔，外２名，「文字bigram モデルを用いた日本語テキストの難易度推定」，言語処理学会，第15 回年次大会，発表論文集，2009年３月，pp.897-900Kengo Kojima, et al., "Estimated difficulty of Japanese text using a character bigram model", Proceedings of the Language Processing Society of Japan, The 15th Annual Conference, Proceedings, March 2009, pp. 897-900 柴崎秀子，外１名，「国語科教科書を基にした小・中学校の文章難易度学年判定式の構築」，日本教育工学会論文誌，Vol. 33，No. 4，pp. 449-458，2010SHINOZAKI Hideko, 1 other person, "Construction of writing difficulty grade grade for elementary and junior high schools based on Japanese language textbooks", Journal of Japan Society for Educational Technology, Vol. 33, No. 4, pp. 449-458, 2010

これまでの難易度の推定技術は、テキスト全体の難易度の推定を行うことを目的に提案されており、テキスト中の一部について要求される難易度と一致しないことを示したり、一致しない箇所のフィードバックを行ったりすることができなかった。 The previous difficulty level estimation techniques have been proposed for the purpose of estimating the level of difficulty of the entire text, and show that the level of difficulty required for a part of the text does not match or does not match Was unable to give feedback.

この発明の目的は、テキストに含まれる少なくとも１つの断片の中で所定の基準以上に難易度が外れている断片を、テキストの外れ値箇所であると判定する及び／又はテキストの外れ値箇所として抽出する外れ値箇所抽出装置、方法及びプログラムを提供することである。 An object of the present invention is to determine a fragment having a degree of difficulty exceeding a predetermined level among at least one fragment included in the text as an outlier point of the text and / or as an outlier point of the text An outlier location extracting apparatus, method and program for extracting are provided.

この発明の一態様による外れ値箇所抽出装置は、入力されたテキストに含まれ、テキストを所定の単位で分割した各断片の難易度クラスを推定する断片難易度推定部と、テキストの難易度クラスを推定する全体難易度推定部と、推定された各断片の難易度クラスと推定されたテキストの難易度クラスとの比較に基づいて、推定されたテキストの難易度クラスから離れた難易度を有する、テキストの断片である外れ値箇所を抽出する外れ値箇所抽出部と、を備えており、断片難易度推定部は、各断片に含まれる各単語n-gramの出現頻度と、難易度クラスごとに予め求められた各単語の生起確率とに基づいて、各断片が各難易度クラスに属する尤度を推定し、最も尤度の高い難易度クラスを推定された各断片の難易度クラスとし、全体難易度推定部は、テキストに含まれる各単語n-gramの出現頻度と、難易度クラスごとに予め求められた各単語の生起確率とに基づいて、テキストが各難易度クラスに属する尤度を推定し、最も尤度の高い難易度クラスを推定されたテキストの難易度クラスとする。 The outlier location extracting device according to one aspect of the present invention includes a fragment difficulty level estimation unit for estimating the difficulty level class of each fragment obtained by dividing the text in a predetermined unit, included in the input text; Have a degree of difficulty away from the estimated class of difficulty of the text, based on a comparison of the overall class of difficulty estimation part and the class of difficulty of each estimated segment and the class of estimated complexity of the text And an outlier location extraction unit for extracting an outlier location which is a text fragment, and the fragment difficulty level estimation unit is configured to calculate the appearance frequency of each word n-gram included in each segment and the difficulty level for each class. The likelihood of each fragment belonging to each difficulty class is estimated based on the occurrence probability of each word obtained in advance, and the difficulty class of the highest likelihood is defined as the difficulty class of each fragment, Overall difficulty estimation part Estimate the likelihood that the text belongs to each difficulty level based on the appearance frequency of each word n-gram included in the text and the occurrence probability of each word previously determined for each difficulty level class The degree of difficulty is assumed to be the estimated degree of difficulty of the text .

この発明の一態様による外れ値箇所抽出装置は、入力されたテキストに含まれ、テキストを所定の単位で分割した各断片の難易度クラスを推定する断片難易度推定部と、推定された各断片の難易度クラスと所定の難易度クラスとの比較に基づいて、所定の難易度クラスから離れた難易度を有する、テキストの断片である外れ値箇所を抽出する外れ値箇所抽出部と、を備えており、断片難易度推定部は、各断片に含まれる各単語n-gramの出現頻度と、難易度クラスごとに予め求められた各単語の生起確率とに基づいて、各断片が各難易度クラスに属する尤度を推定し、最も尤度の高い難易度クラスを推定された各断片の難易度クラスとする。 The outlier location extracting device according to one aspect of the present invention includes a fragment difficulty level estimation unit for estimating the difficulty level class of each fragment obtained by dividing the text in predetermined units, included in the input text, and each estimated fragment And an outlier location extraction unit for extracting an outlier location, which is a fragment of text, having a degree of difficulty separated from the predetermined difficulty level class based on comparison between the difficulty level class and the predetermined difficulty level class The fragment difficulty level estimation unit determines the difficulty level of each fragment based on the appearance frequency of each word n-gram included in each fragment and the occurrence probability of each word previously determined for each difficulty level class. The likelihood belonging to the class is estimated, and the highest likelihood difficulty class is taken as the difficulty class of each fragment estimated .

テキストに含まれる少なくとも１つの断片の中で所定の基準以上に難易度が外れている断片を、テキストの外れ値箇所であると判定する及び／又はテキストの外れ値箇所として抽出することができる。 Among the at least one fragment included in the text, a fragment whose degree of difficulty is higher than a predetermined level can be determined to be an outlier point of the text and / or extracted as an outlier point of the text.

外れ値箇所抽出装置の例を説明するためのブロック図。The block diagram for demonstrating the example of an outlier location extraction apparatus. 外れ値箇所抽出装置の例を説明するためのブロック図。The block diagram for demonstrating the example of an outlier location extraction apparatus. 外れ値箇所抽出方法の例を説明するための流れ図。The flowchart for demonstrating the example of the outlier location extraction method. 初出難易度クラスの例を示す図。The figure which shows the example of a first appearance difficulty level class. 断片の親密度の例を示す図。The figure which shows the example of the closeness degree of a fragment. 難易度に対応する親密度の例を示す図。The figure which shows the example of the closeness degree corresponding to a difficulty level. 各難易度クラスにおける平均語数の例を示す図。The figure which shows the example of the average word number in each difficulty level class. 単語の生起確率の例を示す図。The figure which shows the example of the occurrence probability of a word. 単語の生起確率の例を示す図。The figure which shows the example of the occurrence probability of a word.

以下、図面を参照して、外れ値箇所抽出装置及び方法の実施形態について説明する。 Hereinafter, embodiments of the outlier location extraction apparatus and method will be described with reference to the drawings.

［第一実施形態］
第一実施形態の外れ値箇所抽出装置は、図１に示すように、前処理部１、断片難易度推定部３１、全体難易度推定部３２、外れ値箇所抽出部４、代替表現提示部５、代替表現置換部６、記憶部７１、記憶部７２及び代替表現記憶部８を例えば備えている。 First Embodiment
As shown in FIG. 1, the outlier location extracting apparatus according to the first embodiment includes a preprocessing unit 1, a fragment difficulty level estimating unit 31, an overall difficulty level estimating unit 32, an outlier location extracting unit 4, and an alternative expression presenting unit 5. For example, the alternative expression replacing unit 6, the storage unit 71, the storage unit 72, and the alternative expression storage unit 8 are provided.

外れ値箇所抽出方法は、例えば、外れ値箇所抽出装置の各部が図３のステップＳ１からステップＳ６の処理を行うことにより実現される。 The outlier location extraction method is realized, for example, by each unit of the outlier location extraction apparatus performing the process from step S1 to step S6 in FIG. 3.

＜前処理部１＞
前処理部１には、テキストが入力される。 <Pre-processing unit 1>
Text is input to the preprocessing unit 1.

前処理部１は、入力されたテキストに対して、形態素解析、構文解析、固有表現抽出、項構造解析等の前処理を行い、その前処理の結果を断片難易度推定部３１及び全体難易度推定部３２に出力する（ステップＳ１）。前処理として形態素解析が行われる場合には、形態素解析の結果が出力される。 The preprocessing unit 1 performs preprocessing such as morphological analysis, syntax analysis, eigenexpression extraction, and term structure analysis on the input text, and the result of the preprocessing is the fragment difficulty level estimation unit 31 and the overall difficulty level It outputs to the estimation part 32 (step S1). When morphological analysis is performed as preprocessing, the result of morphological analysis is output.

形態素解析、構文解析、固有表現抽出、項構造解析等の前処理の技術としては、既存の技術を用いることができる。 An existing technique can be used as a pre-processing technique such as morphological analysis, syntactic analysis, specific expression extraction, and term structure analysis.

以下、前処理の一例である形態素解析の結果の例について説明する。以下は、「親子のコミュニケーションの契機になるように」というテキストの形態素解析の結果の例である。
「親子」：名詞, 普通名詞, 一般,*,*,*, オヤコ, 親子, 親子, オヤコ, 親子, オヤコ, 和,*,*,*,*
「の」：助詞, 格助詞,*,*,*,*, ノ, の, の, ノ, の, ノ, 和,*,*,*,*
「コミュニケーション」：名詞, 普通名詞, サ変可能,*,*,*, コミュニケーション, コミュニケーション-communication, コミュニケーション, コミュニケーション, コミュニケーション, コミュニケーション, 外,*,*,*,*
「の」：助詞, 格助詞,*,*,*,*, ノ, の, の, ノ, の, ノ, 和,*,*,*,*
「契機」：名詞, 普通名詞, 一般,*,*,*, ケイキ, 契機, 契機, ケーキ, 契機, ケーキ, 漢,*,*,*,*
「に」：助詞, 格助詞,*,*,*,*, ニ, に, に, ニ, に, ニ, 和,*,*,*,*
「なる」：動詞, 非自立可能,*,*, 五段-ラ行, 連体形-一般, ナル, 成る, なる, ナル, なる, ナル, 和,*,*,*,*
「よう」：形状詞, 助動詞語幹,*,*,*,*, ヨウ, 様, よう, ヨー, よう, ヨー, ,*,*,*,*
「に」：助動詞,*,*,*, 助動詞-ダ, 連用形-ニ, ダ, だ, に, ニ, だ, ダ, 和,*,*,*,* Hereinafter, an example of a result of morphological analysis which is an example of preprocessing will be described. The following is an example of the result of morphological analysis of the text "To be a trigger for parent-child communication".
"Parent": noun, common noun, general, *, *, *, oyster, parent and child, parent and child, oyster, parent and child, oyster, sum, *, *, *, *
"No": particle, case particle, *, *, *, *, no, of, no, of, no, sum, *, *, *, *
"Communication": noun, common noun, changeable, *, *, *, communication, communication-communication, communication, communication, communication, communication, outside, *, *, *, *
"No": particle, case particle, *, *, *, *, no, of, no, of, no, sum, *, *, *, *
"Occurrence": noun, common noun, general, *, *, *, keiki, opportunity, occasion, cake, opportunity, cake, han, *, *, *, *
"Ni": particle, case particle, *, *, *, *, ni, to, ni, to, ni, sum, *, *, *, *
"Naru": Verb, non-independent, *, *, 5 steps-La line, adjective form-general, null, consists, becomes, null, becomes, null, sum, *, *, *, *
"Yo": shape verb, auxiliary verb stem, *, *, *, *, yo, yo, yo, yo, yo, yo, yo,, *, *, *, *
"Ni": auxiliary verbs, *, *, *, auxiliary verbs-da, continuous forms-da, da, d, di, da, da, sum, *, *, *, *

＜断片難易度推定部３１＞
断片難易度推定部３１には、テキスト及び前処理の結果が入力される。 <Shard difficulty level estimation unit 31>
The fragment difficulty level estimation unit 31 receives the text and the result of preprocessing.

断片難易度推定部３１は、入力されたテキストに含まれる各断片の難易度クラスを推定する（ステップＳ３１）。推定された各断片の難易度クラスは、外れ値箇所抽出部４に出力される。 The fragment difficulty level estimation unit 31 estimates the difficulty level class of each fragment included in the input text (step S31). The degree of difficulty class of each fragment estimated is output to the outlier location extraction unit 4.

入力されたテキストは、所定の単位で分割した複数の断片で構成されている。所定の単位は、少なくとも１個の単語や単語n-gram、少なくとも１個の文等のテキストを構成する要素のことである。言い換えれば、断片は、少なくとも１個の単語や単語n-gram、少なくとも１個の文等のテキストを構成する要素である。単語n-gramの例として、n=1の場合の単語n-gramである単語unigram、n=2の場合の単語n-gramである単語bigram、n=3の場合の単語n-gramである単語trigramを挙げることができる。 The input text is composed of a plurality of fragments divided in predetermined units. The predetermined unit is an element constituting text such as at least one word, word n-gram, at least one sentence or the like. In other words, a fragment is an element that constitutes text such as at least one word, word n-gram, at least one sentence, and the like. As an example of word n-gram, it is word unigram which is word n-gram in case of n = 1, word n-gram which is word n-gram in case of n = 2, word n-gram in case of n = 3 The word trigram can be mentioned.

段落は少なくとも１個の文で構成されるため、少なくとも１個の文を断片とする場合は、テキストの各段落を断片とする場合を含む。 Since a paragraph is composed of at least one sentence, when at least one sentence is a fragment, the case where each paragraph of text is a fragment is included.

例えば、断片難易度推定部３１の中の特徴量抽出部１１１は、各断片の特徴量を抽出し、断片難易度推定部３１は抽出された各断片の特徴量からその断片の難易度クラスを推定する。例えば、推定した各断片の特徴量に対応する難易度クラスを各断片の難易度クラスとする。断片の特徴量は、断片の難易度に関するものであればどのような特徴量であってもよい。例えば、断片が少なくとも１個の文である場合には、断片の特徴量を、その断片を構成する平均語数、その断片を構成する文節数、その断片における漢字の割合、その断片におけるカタカナの割合、その断片におけるひらがなの割合、その断片における漢字とカタカナの割合、その断片における能動態又は受動態の割合、その断片におけるその断片における構文木の深さの何れかとすることができる。 For example, the feature quantity extraction unit 111 in the fragment difficulty level estimation unit 31 extracts the feature quantity of each fragment, and the fragment difficulty level estimation unit 31 uses the feature quantity of each fragment extracted and the difficulty class of the fragment presume. For example, the difficulty level class corresponding to the feature quantity of each estimated fragment is taken as the difficulty level class of each fragment. The feature quantity of the fragment may be any feature quantity as far as the difficulty level of the fragment is concerned. For example, when the fragment is at least one sentence, the feature quantity of the fragment is defined as the average number of words constituting the fragment, the number of phrases constituting the fragment, the ratio of kanji in the fragment, the ratio of katakana in the fragment The ratio of hiragana in the fragment, the ratio of kanji and katakana in the fragment, the ratio of active or passive in the fragment, or the depth of the syntactic tree in the fragment in the fragment can be used.

その断片における漢字の割合は、例えば（その断片の中の漢字の文字数）／（その断片の文字数）である。 The ratio of kanji in the fragment is, for example, (number of kanji characters in the fragment) / (number of characters of the fragment).

その断片におけるカタカナの割合は、例えば（その断片の中のカタカナの文字数）／（その断片の文字数）である。 The ratio of katakana in the fragment is, for example, (number of katakana characters in the fragment) / (number of characters of the fragment).

その断片におけるひらがなの割合は、例えば（その断片の中のカタカナの文字数）／（その断片の文字数）である。 The percentage of Hiragana in the fragment is, for example, (number of katakana characters in the fragment) / (number of characters of the fragment).

その断片における漢字とカタカナの割合は、例えば（その断片の中の漢字とカタカナの文字数）／（その断片の文字数）である。 The ratio of kanji and katakana in the fragment is, for example, (number of kanji and katakana characters in the fragment) / (number of characters of the fragment).

その断片における能動態又は受動態の割合は、例えば、（その断片における能動態又は受動態の出現回数）／（その断片における動詞の出現回数）である。 The ratio of active or passive in the fragment is, for example, (number of appearances of active or passive in the fragment) / (number of appearances of verb in the fragment).

なお、断片が単語n-gramである場合には、断片難易度推定部３１は、記憶部７１に予め記憶されている、各断片とその各断片の難易度クラスとの対応付けの情報を参照して、各断片の難易度クラスを求めてもよい。例えば、ある断片が初めて出現した難易度クラス、言い換えればその断片が出現する最も低い難易度クラスを、その断片の難易度として記憶部７１に記憶させておく。ある断片が初めて出現した難易度クラスである初出難易度クラスの例を図４に示す。断片難易度推定部３１は、各断片に対応する難易度クラスを記憶部７１から読み込むことにより、その各断片の難易度クラスを得ることができる。 When the fragment is a word n-gram, the fragment difficulty level estimation unit 31 refers to the information of the association between each fragment and the difficulty class of each fragment, which is stored in advance in the storage unit 71. You may then ask for the difficulty class of each piece. For example, the difficulty class in which a certain fragment first appears, in other words, the lowest difficulty class in which the fragment appears, is stored in the storage unit 71 as the difficulty of the fragment. An example of the first appearance difficulty level, which is a difficulty level in which a certain fragment first appears, is shown in FIG. The fragment difficulty level estimation unit 31 can obtain the difficulty level class of each fragment by reading the difficulty level class corresponding to each fragment from the storage unit 71.

あるいは、例えば、ある断片が頻出する難易度クラス、言い換えればその断片がもっとよよく出現する難易度クラスを、その断片の難易度として記憶部７１に記憶させておいてもよい。 Alternatively, for example, a difficulty class in which a certain fragment frequently appears, in other words, a difficulty class in which the fragment appears better, may be stored in the storage unit 71 as the difficulty of the fragment.

＜全体難易度推定部３２＞
全体難易度推定部３２には、テキスト及び前処理の結果が入力される。 <Overall difficulty level estimation unit 32>
Text and the result of preprocessing are input to the overall difficulty level estimation unit 32.

全体難易度推定部３２は、入力されたテキスト全体の難易度クラスを推定する（ステップＳ３２）。記載の簡略化のために、テキスト全体の難易度クラスのことを、単に「テキストの難易度クラス」とも呼ぶ。 The overall difficulty level estimation unit 32 estimates the difficulty level class of the entire input text (step S32). For simplicity of description, the difficulty class of the entire text is also referred to simply as the "text difficulty class".

断片難易度推定部３１において、各断片の特徴量が抽出されている場合には、全体難易度推定部３は、抽出された各断片の特徴量を用いて、入力されたテキストの難易度クラスを推定する。例えば、全体難易度推定部３は、入力されたテキストに含まれる各断片の特徴量の平均値を計算して入力されたテキストの難易度クラスとしてもよい。また、例えば、特徴量として文の長さの平均値が計算されている場合には、文の長さの平均値を閾値判定することで難易度クラスを求めてもよい。これにより、長い文が多いテキストは、難易度が高いという傾向を反映した難易度クラス設定が可能となる。 When the feature amount of each fragment is extracted in the fragment difficulty level estimation unit 31, the overall difficulty level estimation unit 3 uses the feature amount of each extracted fragment, and the difficulty level class of the input text Estimate For example, the overall difficulty level estimation unit 3 may calculate the average value of the feature amounts of the fragments included in the input text and set the average as the difficulty level of the input text. Also, for example, when the average value of the sentence length is calculated as the feature amount, the difficulty level class may be obtained by performing threshold determination on the average value of the sentence length. As a result, it is possible to set a difficulty level class that reflects the tendency that the degree of difficulty is high in a text having many long sentences.

また、全体断片難易度推定部３２の中の特徴量抽出部３２１が、テキストの特徴量を求め、全体難易度推定部３２は求まった特徴量をテキストの難易度クラスとしてもよい。テキストの特徴量は、テキストの難易度に関するものであればどのような特徴量であってもよい。例えば、テキストの特徴量を、そのテキストにおける漢字の割合、そのテキストにおけるカタカナの割合、そのテキストにおけるひらがなの割合、そのテキストにおける漢字とカタカナの割合、そのテキストにおける能動態又は受動態の割合等にすることができる。 Further, the feature extraction unit 321 in the total fragment difficulty estimation unit 32 may obtain the feature of the text, and the entire difficulty estimation unit 32 may use the obtained feature as the text difficulty class. The feature quantity of the text may be any feature quantity related to the degree of difficulty of the text. For example, the feature quantity of text is made the ratio of kanji in the text, the ratio of katakana in the text, the ratio of hiragana in the text, the ratio of kanji to katakana in the text, the ratio of active or passive in the text, etc. Can.

そのテキストにおける漢字の割合は、例えば（そのテキストの中の漢字の文字数）／（そのテキストの文字数）である。 The ratio of kanji in the text is, for example, (number of kanji characters in the text) / (number of characters of the text).

そのテキストおけるカタカナの割合は、例えば（そのテキストの中のカタカナの文字数）／（そのテキスト文字数）である。 The proportion of katakana in the text is, for example, (the number of katakana characters in the text) / (the number of text characters).

そのテキストにおけるひらがなの割合は、例えば（そのテキストの中のカタカナの文字数）／（その断片の文字数）である。 The proportion of hiragana characters in the text is, for example, (number of katakana characters in the text) / (number of characters of the fragment).

そのテキストにおける漢字とカタカナの割合は、例えば（そのテキストの中の漢字とカタカナの文字数）／（その断片の文字数）である。 The ratio of kanji and katakana in the text is, for example, (number of kanji and katakana characters in the text) / (number of letters of the fragment).

そのテキストにおける能動態又は受動態の割合は、例えば、（そのテキストにおける能動態又は受動態の出現回数）／（そのテキストにおける動詞の出現回数）である。 The ratio of active or passive in the text is, for example, (number of appearances of active or passive in the text) / (number of appearance of verbs in the text).

＜外れ値箇所抽出部４＞
外れ値箇所抽出部４には、断片難易度推定部３１で推定された各断片の難易度クラスと、全体難易度推定部３２で推定されたテキストの難易度クラスが入力される。 <Outlier location extraction unit 4>
The outlier location extracting unit 4 receives the difficulty class of each fragment estimated by the fragment difficulty estimating unit 31 and the difficulty class of the text estimated by the overall difficulty estimating unit 32.

外れ値箇所抽出部４は、断片難易度推定部３１で推定された各断片の難易度クラスと全体難易度推定部３２で推定されたテキストの難易度クラスとの組である難易度の組を用いて、全体難易度推定部３２で推定されたテキストの難易度クラスから離れた難易度を有する、テキストの断片である外れ値箇所を抽出する（ステップＳ４）。抽出された外れ値箇所は、代替表現提示部５及び代替表現置換部６に出力される。 The outlier location extraction unit 4 is a set of difficulty levels that is a combination of the difficulty level class of each fragment estimated by the fragment difficulty level estimation unit 31 and the difficulty level class of the text estimated by the overall difficulty level estimation unit 32. The outlier location which is a text fragment having a degree of difficulty separated from the degree of difficulty of the text estimated by the overall degree of difficulty estimation unit 32 is extracted (step S4). The extracted outlier points are output to the alternative expression presentation unit 5 and the alternative expression replacement unit 6.

なお、各断片の難易度クラスとテキストの難易度クラスとの組である難易度とは、必ずしも組にする必要はない。以下では、組にして比較する例で説明するが、テキストの難易度クラスから離れた難易度を有する、テキストの断片を抽出するために、各断片の難易度クラスとテキストの難易度クラスとの差異を比較できれば、比較方法は問わない。 Note that the degree of difficulty, which is a combination of the difficulty level class of each fragment and the text difficulty level class, does not necessarily have to be a pair. In the following, we will explain an example of comparing in pairs, but with the difficulty class of each fragment and the text difficulty class in order to extract fragments of the text, which have difficulty different from that of the text difficulty class As long as the differences can be compared, the comparison method does not matter.

例えば、外れ値箇所抽出部４は、難易度推定部３で推定されたテキストの難易度クラスから、所定の閾値以上に外れる箇所を抽出する。すなわち、外れ値箇所抽出部４は、各断片の難易度クラスと、推定されたテキストの難易度クラスとの差の絶対値が、所定の閾値（例えば１）以上又はより大である場合に、外れ値箇所抽出部４はその各断片を外れ値箇所として抽出する。 For example, the outlier location extraction unit 4 extracts, from the difficulty level class of the text estimated by the difficulty estimation unit 3, a location that deviates by a predetermined threshold or more. That is, when the absolute value of the difference between the difficulty level class of each fragment and the estimated difficulty level class of the text is greater than or equal to a predetermined threshold (for example, 1), the outlier location extraction unit 4 The outlier location extraction unit 4 extracts each of the fragments as an outlier location.

例えば、推定されたテキストの難易度クラスが９だった場合、例文中の「契機」は、後述する図８で示す様に、難易度クラス１１以上で出現しやすい語であり、「契機」の難易度クラスは１１となる。所定の閾値が１である場合、両難易度クラスの差の絶対値である２は、この所定の閾値である１以上又はより大である。このため、「契機」が外れ値箇所として抽出される。 For example, when the difficulty level class of the estimated text is 9, "Occurrence" in the example sentences is a word that is likely to appear on difficulty level 11 or higher, as shown in FIG. The difficulty class will be 11. When the predetermined threshold is 1, 2 which is the absolute value of the difference between both difficulty levels is one or more or larger than the predetermined threshold. For this reason, the "trigger" is extracted as an outlier part.

外れ値箇所抽出部４は、親密度等の難易度クラス以外の指標を用いて外れ値箇所の抽出を行ってもよい。断片が単語であるとして、単語である断片の親密度は、その断片がどの程度なじみがあると感じられるかを表した指標である（例えば、参考文献１参照。）。 The outlier location extraction unit 4 may extract the outlier location using an index other than the difficulty level class such as closeness. Assuming that a fragment is a word, the closeness of a fragment that is a word is an index indicating how familiar the fragment is to be (see, for example, reference 1).

〔参考文献１〕天野成昭，外１名，「基本語データベース：語義別単語親密度」，学習研究社，2008 [Reference 1] Amano Nadaaki, 1 other person, "Basic word database: Word familiarity according to meaning", Learning research company, 2008

各断片に対応する親密度が予め定められて、記憶部７２に記憶されているとする。各難易度クラスに対応する親密度が、記憶部７２に記憶されているとする。図５に各断片に対応する親密度の例を示す。また、図６に各難易度クラスに対応する親密度の例を示す。図５の例では、各難易度クラスに対応する親密度は、区間となっている。このように、各全体難易度に対応する親密度は、区間となっていてもよい。また、図６の例では、１２個の難易度クラスのそれぞれに対応する親密度を区間で示している。なお、難易度クラス５よりも低い全体難易度では、いずれも大人にとってはとても親密度の高い語になる。このため、難易度クラス５よりも低い全体難易度に、同じ親密度を対応させてもよい。 It is assumed that the degree of intimacy corresponding to each fragment is predetermined and stored in the storage unit 72. It is assumed that closeness corresponding to each difficulty level class is stored in the storage unit 72. FIG. 5 shows an example of familiarity corresponding to each fragment. Further, FIG. 6 shows an example of closeness corresponding to each difficulty level class. In the example of FIG. 5, the intimacy corresponding to each difficulty level class is a section. Thus, the intimacy corresponding to each overall difficulty may be a section. Moreover, in the example of FIG. 6, the intimacy corresponding to each of 12 difficulty level classes is shown by the area. In addition, on the whole difficulty level lower than the difficulty level 5, all become a very close word for adults. Therefore, the same intimacy degree may be associated with the overall difficulty level lower than the difficulty level class 5.

各断片の親密度は、例えば参考文献１等の既存のデータベースに基づいて定めることができる。親密度は、例えば７段階で定めることができる。例えば、親密度１だと知っている人が非常に少ない、親密度５以上なら９５％以上の大人が知っていることを表す。入力されるテキスト対象が、幼児対象のものであれば、幼児の語彙獲得月齢を用いて、親密度を定めてもよい。 The closeness degree of each fragment can be determined, for example, based on an existing database such as reference 1. The closeness can be determined, for example, in seven stages. For example, there are very few people who know that the intimacy degree is 1; if the intimacy degree is 5 or more, it indicates that 95% or more of the adults know. If the text object to be input is for an infant subject, the vocabulary acquisition age of the infant may be used to determine the intimacy.

外れ値箇所抽出部４は、親密度を用いて外れ値箇所の抽出を行う場合には、各断片に対応する親密度と、全体難易度推定部３２で推定されたテキストの難易度クラスとに対応する親密度を記憶部７から読み込む。 When the outlier location extracting unit 4 extracts outlier locations using the intimacy, the outlier location extracting unit 4 uses the intimacy corresponding to each fragment and the difficulty level class of the text estimated by the overall difficulty estimation unit 32. The corresponding closeness is read from the storage unit 7.

外れ値箇所抽出部４は、例えば、各断片の親密度が、全体難易度推定部３２で推定された難易度クラスに対応する親密度よりも低い場合には、その各断片を外れ値箇所とする。 For example, when the closeness degree of each fragment is lower than the closeness degree corresponding to the difficulty class estimated by the overall difficulty level estimation unit 32, the outlier point extraction unit 4 sets each fragment as an outlier point. Do.

このように、外れ値箇所抽出部４は、各断片の親密度と全体難易度推定部３２で推定されたテキストの難易度クラスに対応する親密度との比較に基づいて、外れ値箇所を抽出してもよい。 Thus, the outlier location extraction unit 4 extracts the outlier location based on comparison between the closeness of each fragment and the closeness corresponding to the difficulty level class of the text estimated by the overall difficulty estimation unit 32. You may

また、外れ値箇所抽出部４は、難易度の組と親密度の組の両方を使って、外れ値箇所の抽出をしてもよい。外れ値箇所抽出部４は、例えば、各断片の難易度クラスと、推定されたテキストの難易度クラスとの差の絶対値が、所定の閾値（例えば１）以上又はより大である場合であって、かつ、その各断片の親密度が、全体難易度推定部３２で推定された難易度クラスに対応する親密度よりも低い場合には、その各断片を外れ値箇所とする。 Also, the outlier location extraction unit 4 may extract outlier locations using both of the difficulty level pair and the intimacy pair. The outlier location extraction unit 4 is, for example, the case where the absolute value of the difference between the difficulty level class of each fragment and the difficulty level class of the estimated text is a predetermined threshold (for example, 1) or more or larger. If the intimacy degree of each fragment is lower than the intimacy degree corresponding to the difficulty class estimated by the overall difficulty level estimation unit 32, each fragment is regarded as an outlier location.

また、このように、外れ値箇所抽出部４は、親密度の組と、難易度の組との少なくとも一方を用いて、外れ値箇所を抽出してもよい。 Also, as described above, the outlier location extraction unit 4 may extract the outlier location using at least one of the intimacy set and the difficulty set.

なお、外れ値箇所抽出部４は、各断片の難易度クラス及び各断片の親密度の少なくとも一方を用いて外れ値箇所を抽出してもよい。例えば、外れ値箇所抽出部４は、各断片の難易度クラスが所定の閾値以上であり、かつ、その各断片の親密度が所定の以下の場合に、その各断片を外れ値箇所とする。これにより、出現頻度が極端に低い語や、偏りのある語についても、外れ値箇所の抽出処理を行うことができる。 The outlier location extracting unit 4 may extract the outlier location using at least one of the difficulty level class of each fragment and the intimacy degree of each fragment. For example, when the difficulty level class of each fragment is equal to or more than a predetermined threshold and the intimacy degree of each fragment is equal to or less than a predetermined threshold value, the outlier location extraction unit 4 sets each fragment as an outlier location. As a result, the outlier location extraction processing can be performed even on words having an extremely low frequency of appearance and words having a bias.

図７に、各難易度クラスにおける平均語数の例を示す。入力されたテキストの難易度クラスは３であり、ある一文である断片の語数は１５であるとすると、この一文である断片の語数約１５は、テキストの難易度クラス３における平均語数約１１を大きく上回っている。このため、この一文である断片を外れ値箇所とすることが考えられる。 FIG. 7 shows an example of the average number of words in each difficulty level class. Assuming that the difficulty level class of the input text is 3 and the number of words of a fragment which is one sentence is 15, the number of words of the fragment which is one sentence is approximately 15 and the average number of words in the difficulty level 3 of the text is approximately 11 It is greatly exceeded. For this reason, it is conceivable to use this fragment, which is one sentence, as an outlier location.

＜代替表現提示部５＞
代替表現提示部５には、外れ値箇所が入力される。 <Alternative Expression Presentation Unit 5>
Outlier points are input to the alternative expression presentation unit 5.

代替表現提示部５は、外れ値箇所抽出部４で抽出された外れ値箇所の断片と同様の意味を有する断片であって、外れ値箇所抽出部４で抽出された外れ値箇所の断片の難易度クラスよりも全体難易度推定部３で推定された全体難易度に近い難易度の断片である代替表現をユーザに提示する（ステップＳ５）。この提示は、例えば、ＣＲＴ、液晶ディスプレイ等の表示装置を介して行われる。なお、代替表現提示部５は、外れ値箇所を更にユーザに提示してもよい。 The alternative expression presenting unit 5 is a fragment having the same meaning as the fragment of the outlier location extracted by the outlier location extracting unit 4, and the difficulty of the fragment of the outlier location extracted by the outlier location extracting unit 4 The user is presented with an alternative expression that is a fragment of a degree of difficulty closer to the overall degree of difficulty estimated by the overall degree of difficulty estimation unit 3 than the degree class (step S5). This presentation is performed, for example, via a display device such as a CRT or a liquid crystal display. The alternative expression presenting unit 5 may further present the outlier location to the user.

例えば、外れ値箇所抽出部４において「契機」という単語が難易度クラスが高すぎるため外れ値箇所として抽出された場合には、代替表現提示部５は、「契機」よりも難易度クラスの低い「きっかけ」という単語を置き換える候補として提示することが考えられる。 For example, when the word “Origin” is extracted as an outlier part because the difficulty level class is too high in the outlier part extraction part 4, the alternative expression presentation part 5 has a difficulty level lower than that of the “Origin”. It is conceivable to present the word "trigger" as a candidate for replacement.

各断片に対応する代替表現は対応する難易度クラスと共に、代替表現記憶部８に予め記憶されているとする。代替表現は、同義語辞書や、大量のコーパスから類似する語や表現を収集したALAGIN 言語資源・音声資源サイト（https://alaginrc.nict.go.jp/）等の既存のデータベースを用いて作成することができる。代替表現提示部５は、入力された外れ値箇所をキーとして代替表現記憶部８を参照することにより、適切な代替表現を読み込む。 It is assumed that the alternative expression corresponding to each fragment is stored in advance in the alternative expression storage unit 8 together with the corresponding difficulty level class. The alternative representation uses an existing database such as synonym dictionary or ALAGIN language resource / voice resource site (https://alaginrc.nict.go.jp/) where similar words and expressions are collected from a large amount of corpus. Can be created. The alternative expression presentation unit 5 reads an appropriate alternative expression by referring to the alternative expression storage unit 8 using the input outlier point as a key.

代替表現提示部５は、外れ値箇所を提示した代替表現に置換する修正を行うかどうかの提案をユーザに行ってもよい。ユーザは、キーボード、マウス、タッチパネル等の入力装置を用いて、その提案を受け入れる旨の入力を行うと、その旨を表す修正要求信号が代替表現提示部５から代替表現置換部６に出力される。 The alternative expression presentation unit 5 may provide the user with a suggestion as to whether or not to make a correction to replace the outlier location with the presented alternative expression. When the user inputs using an input device such as a keyboard, a mouse, a touch panel or the like to accept the proposal, a correction request signal representing that is outputted from the alternative expression presentation unit 5 to the alternative expression substitution unit 6 .

置き換える候補として提示する候補は、１つであっても複数であってもよい。複数の場合、代替表現提示部５は、外れ値箇所を提示した代替表現のいずれに置換する修正を行うか、あるいは、いずれにも修正を行わないか、複数の提案をユーザに行ってもよい。ユーザは、キーボード、マウス、タッチパネル等の入力装置を用いて、どの候補を受け入れる旨を入力するか、修正を行わない旨の入力を行うと、その旨を表す修正要求信号が代替表現提示部５から代替表現置換部６に出力される。 The candidate to be presented as a candidate for replacement may be one or more. In the case of multiple, the alternative expression presentation unit 5 may make a plurality of suggestions to the user whether to make corrections to replace outlier locations with any of the alternative expressions presented, or to make corrections to none of them. . When the user uses an input device such as a keyboard, a mouse, and a touch panel to input which candidate is accepted or not to correct, a correction request signal representing that is displayed on the alternative presentation presenting unit 5 Are output to the alternative expression replacing unit 6.

＜代替表現置換部６＞
代替表現置換部６には、外れ値箇所が入力される。また、代替表現置換部６には、修正要求信号が入力される。 <Alternative Expression Replacement Unit 6>
The outlier location is input to the alternative expression replacing unit 6. Further, a correction request signal is input to the alternative representation replacing unit 6.

代替表現置換部６は、外れ値箇所抽出部４で抽出された外れ値箇所の断片と同様の意味を有する断片であって、外れ値箇所抽出部４で抽出された外れ値箇所の断片の難易度クラスよりも全体難易度推定部３で推定された難易度クラスに近い難易度クラスの断片である代替表現により入力されたテキストの中の外れ値箇所抽出部４で抽出された外れ値箇所の断片を置換したテキストを出力する（ステップＳ６）。 The alternative expression substitution unit 6 is a fragment having the same meaning as the fragment of the outlier location extracted by the outlier location extraction unit 4, and the difficulty of the fragment of the outlier location extracted by the outlier location extraction unit 4 Of the outlier location extracted by the outlier location extracting unit 4 in the text input by the alternative expression which is a fragment of the difficulty category closer to the difficulty class estimated by the overall difficulty estimation unit 3 than the degree class The text obtained by replacing the fragment is output (step S6).

また、修正要求信号を代替表現提示部５から受信した場合には、代替表現置換部６は、外れ値箇所を代替表現提示部５で提示した代替表現に置換する修正を行い、修正後テキストを出力する。 Also, when the correction request signal is received from the alternative expression presentation unit 5, the alternative expression substitution unit 6 corrects the outlier location with the alternative expression presented by the alternative expression presentation unit 5, and corrects the corrected text Output.

各断片に対応する代替表現は対応する難易度クラスと共に、代替表現記憶部８に予め記憶されているとする。代替表現は、同義語辞書や、大量のコーパスから類似する語や表現を収集したALAGIN 言語資源・音声資源サイト（https://alaginrc.nict.go.jp/）等の既存のデータベースを用いて作成することができる。代替表現置換部６は、入力された外れ値箇所をキーとして代替表現記憶部８を参照することにより、適切な代替表現を読み込み、置換処理を行う。 It is assumed that the alternative expression corresponding to each fragment is stored in advance in the alternative expression storage unit 8 together with the corresponding difficulty level class. The alternative representation uses an existing database such as synonym dictionary or ALAGIN language resource / voice resource site (https://alaginrc.nict.go.jp/) where similar words and expressions are collected from a large amount of corpus. Can be created. The alternative expression substitution unit 6 reads an appropriate alternative expression by referring to the alternative expression storage unit 8 using the input outlier point as a key, and performs substitution processing.

代替表現提示をせず、最も近いものに自動で置き換えて、修正後のテキストを出力する構成でもよい。その場合には、代替表現提示部５は不要である。 It may be configured to output a corrected text by automatically replacing with the closest one without presenting alternative expressions. In that case, the alternative expression presentation unit 5 is unnecessary.

［第二実施形態］
第二実施形態の外れ値箇所抽出装置及び方法は、全体難易度推定部３２で推定されたテキストの難易度クラスからの外れ値箇所ではなく、所定の難易度クラスからの外れ値箇所を抽出する装置及び方法である。以下、第一実施形態と異なる部分のみを説明する。第一実施形態と同様の部分については説明を省略する。 Second Embodiment
The outlier location extracting apparatus and method according to the second embodiment extract outlier locations from a predetermined difficulty class, not outlier locations from the difficulty class of the text estimated by the overall difficulty estimating unit 32. Apparatus and method. Hereinafter, only differences from the first embodiment will be described. The description of the same parts as in the first embodiment will be omitted.

所定の難易度クラスは、ユーザにより適宜決定される。 The predetermined difficulty level class is appropriately determined by the user.

第二実施形態の外れ値箇所抽出装置は、第一実施形態の外れ値箇所抽出装置と異なり、図２に示すように、テキストの難易度クラスを推定する全体難易度推定部３２を備えていない。すなわち、第二実施形態の外れ値箇所抽出方法は、ステップＳ３２の処理を行わない。 Unlike the outlier location extracting device according to the first embodiment, the outlier location extracting device according to the second embodiment does not include the overall difficulty estimating unit 32 for estimating the text difficulty level class, as shown in FIG. . That is, the outlier location extraction method of the second embodiment does not perform the process of step S32.

第二実施形態の外れ値箇所抽出部４、代替表現提示部５及び代替表現置換部６は、全体難易度推定部３２で推定されたテキストの難易度クラスに代えて、所定の難易度クラスに基づいて、第一実施形態と同様の処理を行う。 The outlier location extracting unit 4, the alternative expression presenting unit 5 and the alternative expression replacing unit 6 of the second embodiment replace the difficulty level class of the text estimated by the overall difficulty level estimating unit 32 with a predetermined difficulty level class. Based on the same processing as the first embodiment is performed.

すなわち、第二実施形態の外れ値箇所抽出部４は、断片難易度推定部３１で推定された各断片の難易度クラスと所定の難易度クラスとの組である難易度の組を用いて、所定の難易度クラスから離れた難易度を有する、テキストの断片である外れ値箇所を抽出する。言い換えれば、第二実施形態の外れ値箇所抽出部４は、断片難易度推定部３１で推定された各断片の難易度クラスと所定の難易度クラスとの比較に基づいて、所定の難易度クラスから離れた難易度を有する、テキストの断片である外れ値箇所を抽出する。 That is, the outlier location extraction unit 4 of the second embodiment uses a set of difficulty levels that is a set of the difficulty level class of each fragment estimated by the fragment difficulty level estimation unit 31 and a predetermined difficulty level class. Extract outlier locations, which are fragments of the text, having difficulty levels that deviate from the predetermined difficulty level class. In other words, the outlier location extracting unit 4 of the second embodiment determines the predetermined difficulty class based on the comparison between the difficulty class of each fragment estimated by the fragment difficulty estimating unit 31 and the predetermined difficulty class. Extract outliers, which are fragments of the text, with difficulty levels away from.

第二実施形態の代替表現提示部５は、外れ値箇所抽出部４で抽出された外れ値箇所の断片と同様の意味を有する断片であって、外れ値箇所抽出部４で抽出された外れ値箇所の断片の難易度クラスよりも所定の難易度クラスに近い難易度の断片である代替表現をユーザに提示する（ステップＳ５）。 The alternative expression presentation unit 5 of the second embodiment is a fragment having the same meaning as the fragment of the outlier location extracted by the outlier location extraction unit 4, and the outlier extracted in the outlier location extraction unit 4 The user is presented with an alternative expression which is a fragment of a degree of difficulty closer to the predetermined difficulty class than the degree of difficulty of the fragment of the portion (step S5).

第二実施形態の代替表現置換部６は、外れ値箇所抽出部４で抽出された外れ値箇所の断片と同様の意味を有する断片であって、外れ値箇所抽出部４で抽出された外れ値箇所の断片の難易度クラスよりも所定の全体難易度に近い難易度クラスの断片である代替表現により入力されたテキストの中の外れ値箇所抽出部４で抽出された外れ値箇所の断片を置換したテキストを出力する（ステップＳ６）。 The alternative expression substitution unit 6 of the second embodiment is a fragment having the same meaning as the fragment of the outlier location extracted by the outlier location extraction unit 4, and the outlier extracted in the outlier location extraction unit 4 Replace the outlier location fragment extracted by the outlier location extraction unit 4 in the text input by the alternative expression that is a fragment of the difficulty class closer to a predetermined overall difficulty level than the difficulty level class of the location fragment The output text is output (step S6).

また、各断片の親密度及び所定の難易度クラスに対応する親密度が予め定められているとして、第二実施形態の外れ値箇所抽出部４は、各断片の親密度と所定の難易度クラスに対応する親密度との比較に基づいて、外れ値箇所を抽出してもよい。 Also, assuming that the intimacy degree of each fragment and the intimacy degree corresponding to the predetermined difficulty level class are predetermined, the outlier location extraction unit 4 of the second embodiment determines the intimacy degree of each fragment and the predetermined difficulty level class. Outlier locations may be extracted based on the comparison with the intimacy corresponding to.

このように、全体難易度推定部３２で推定された難易度クラスの代わりにユーザが予め指定した所定の難易度クラスを用いてもよい。例えば、ユーザが小学校３年生程度の難易度を設定したい場合、小学校３年生程度の難易度を表すクラスを所定の難易度クラスとする。 Thus, instead of the difficulty level class estimated by the overall difficulty level estimation unit 32, a predetermined difficulty level class designated in advance by the user may be used. For example, when the user wants to set the degree of difficulty in the third grade of elementary school, a class representing the degree of difficulty in the third grade of elementary school is set as the predetermined degree of difficulty.

例えば、小学校３年生程度の難易度を設定したいのに、テキスト全体が小学校３年生程度の難易度よりも難しいかどうかを調べたい場合には、全体難易度推定部３２による全体の難易度の推定を行うが、その必要がない場合には、単に小学校３年生程度の難易度クラスから遠い箇所を外れ値箇所とすればよいため、全体難易度推定部３２による全体の難易度の推定は行わなくてよい。 For example, when it is desired to set whether the entire text is more difficult than the third grade elementary level, although the third degree primary grade is desired to be set, the overall degree of difficulty is estimated by the overall degree estimation section 32. If it is not necessary to do so, it is sufficient to simply assume a part far from the difficulty level class of the third grade of elementary school as an outlier part, so the overall difficulty level estimation unit 32 does not estimate the overall difficulty level You may

［第三実施形態］
第三実施形態の外れ値箇所抽出装置及び方法は、単語n-gramの出現頻度を用いて、断片及び／又はテキストの難易度クラスを推定する装置及び方法である。以下、第一実施形態と異なる部分のみを説明する。第一実施形態と同様の部分については説明を省略する。 Third Embodiment
The outlier location extracting apparatus and method according to the third embodiment is an apparatus and method for estimating the difficulty level class of fragments and / or texts using the frequency of occurrence of word n-gram. Hereinafter, only differences from the first embodiment will be described. The description of the same parts as in the first embodiment will be omitted.

＜断片難易度推定部３１＞
断片難易度推定部３１は、入力されたテキストに含まれる各断片の難易度クラスを推定する。 <Shard difficulty level estimation unit 31>
The fragment difficulty level estimation unit 31 estimates the difficulty level class of each fragment included in the input text.

例えば、断片が少なくとも１個の文である場合（具体的には、断片が段落である等の場合）には、断片難易度推定部３１は、各単語n-gramの出現頻度を用いて、各断片の難易度クラスを推定する。 For example, if the fragment is at least one sentence (specifically, if the fragment is a paragraph, etc.), the fragment difficulty level estimation unit 31 uses the appearance frequency of each word n-gram to Estimate the difficulty class of each fragment.

断片難易度推定部３１は、各断片に含まれる各単語n-gramの出現頻度と、難易度クラスごとに予め求められた各単語の生起確率とに基づいて、各断片が各難易度クラスに属する尤度を推定し、最も尤度の高い難易度クラスを各断片の難易度クラスとする。 The fragment difficulty level estimation unit 31 assigns each fragment to each difficulty level class based on the appearance frequency of each word n-gram included in each fragment and the occurrence probability of each word previously obtained for each difficulty level class. The likelihood of belonging is estimated, and the highest likelihood difficulty class is taken as the difficulty class of each fragment.

各断片Sが難易度クラスiに属する尤度L(i|S)は、例えば以下の式(3')及び式(4')により定義される。 The likelihood L (i | S) that each fragment S belongs to the difficulty level class i is defined by, for example, the following equations (3 ′) and (4 ′).

ここで、tf・idf(W_j)は単語n-gram W_jの重みであり、f(W_j,S)は各断片における各単語n-gram W_jの出現頻度であり、Σ_Lf(W_L,S)は上記各断片に含まれる単語n-gramの数であり、Dは所定の学習用テキストの数であり（すなわち、各難易度クラスの学習用テキストの集合D_iの要素の数|D_i|の和Σ_i=1 ^N|D_i|であり）、df_iは単語n-gram W_jの出現する学習用テキストの数であり、P_i(W_j)は難易度クラスiにおける単語n-gram W_jの生起確率P_i(W_j)である。 Here, tf · idf (W _j ) is a weight of word n-gram W _j , f (W _j , S) is an appearance frequency of each word n-gram W _j in each fragment, and Σ _L f ( W _L , S) is the number of words n-grams contained in each of the above fragments, and D is the number of predetermined texts for learning (ie, the elements of the set D _i of texts for learning of each difficulty level number | D _i | of the sum _{^{_{Σ i = 1 N | D i}}} | a is), df _i is the number of learning for the text that appears in the word _{_{n-gram W j, P i}} (W j) the degree of difficulty class The occurrence probability P _i (W _j ) of the word n-gram W _j at _i .

難易度クラスiにおける単語n-gram W_jの生起確率P_i(W_j)は例えば以下の式(1)により定義される。ここで、iは、事前に設定された難易度クラスの種類を表す値(i=1,…,N)である。ｊは、1以上の自然数である。 The occurrence probability P _i (W _j ) of the word n-gram W _j in the difficulty level class i is defined, for example, by the following equation (1). Here, i is a value (i = 1,..., N) representing the type of the difficulty level set in advance. j is a natural number of 1 or more.

ここで、Nは、難易度クラスの個数であり、所定の正の整数である。f(w_j,D_i)は、D_iにおける断片w_jの出現頻度である。 Here, N is the number of difficulty classes and is a predetermined positive integer. f (w _j , D _i ) is the frequency of occurrence of fragment w _j in D _i .

各単語n-gramの特徴量としてP_i(w_j)が予め計算され記憶部７１に予め記憶される。また、式(3')及び式(4')の計算で必要なD, df_i等の他のパラメータも記憶部７１に記憶されている。断片難易度推定部３１は、記憶部７１からこれらの値を読み込み式(3')及び式(4')の計算を行う。 P _i (w _j ) is calculated in advance as a feature amount of each word n-gram, and stored in the storage unit 71 in advance. Also, stored D required in the calculation of the formula (3 ') and Formula (4'), other parameters such as df _i in the storage unit 71. The fragment difficulty level estimation unit 31 reads these values from the storage unit 71 and calculates Equations (3 ′) and (4 ′).

外れ値箇所抽出装置は、P_i(w_j)を事前に計算するための事前計算部９を備えていてもよい。事前計算部９は、P_i(w_j)=0となる場合に上記式(3)及び式(3')により定義される尤度が計算できなくなることを避けるため、全てのP_i(w_j)が0でなくなるまで、以下の式(2)を適用することにより、P_i(w_j)を補正してもよい。 The outlier location extraction device may include a precalculation unit 9 for calculating P _i (w _j ) in advance. In order to prevent the likelihood defined by the above equation (3) and equation (3 ′) from being unable to be calculated when P _i (w _j ) = 0, the pre-calculation unit 9 performs all P _i (w P _i (w _j ) may be corrected by applying the following equation (2) until _j ) is not zero.

また、P_i(w_j)が0でない場合でも、特徴量抽出部２は式(2)によって得られる値を利用してもよい。例えば式(2)によって得られる値が式(1)で得られている値より大きい場合にも、特徴量抽出部２は式(2)によって得られる値を利用してもよい。 In addition, even when P _i (w _j ) is not 0, the feature amount extraction unit 2 may use the value obtained by Expression (2). For example, even when the value obtained by equation (2) is larger than the value obtained by equation (1), the feature amount extraction unit 2 may use the value obtained by equation (2).

なお、0の場合にのみ両隣の値の和の1/2とする式(2)では不具合が生じる場合がある。 In addition, a fault may arise in Formula (2) made into 1/2 of the sum of the value of both sides only in the case of 0.

例えば、難易度クラスdの補正前のP_k(w_j)は0であるため、式(2)により補正をすると、難易度クラスdの補正後のP_k(w_j)=(0.6+0.6)/2=0.3となる。この場合、補正前後でbとdの大小関係が逆転してしまう。すなわち、補正前では難易度クラスbのP_b(w_j)=0.1 > 難易度クラスdのP_d(w_j)=0であったにも関わらず、補正後では難易度クラスbのP_b(w_j)=0.1 < 難易度クラスdのP_d(w_j)=0.3となり、補正の前後でbとdでP_i(w_j)の大小関係が逆転している。そこで、0に近い値を有し、自身より大きな値の隣に存在する場合には0でなくても補正してもよい。具体的には、上記の場合、bについても上記式(2)により補正してもよい。 For example, since P _k (w _j ) before the correction of the difficulty class d is 0, P _k (w _j ) after the correction of the difficulty class d = (0.6 + 0.6) when corrected by the equation (2). ) /2=0.3. In this case, the magnitude relationship between b and d is reversed before and after correction. That is, although P _b (w _j ) = 0.1 in the difficulty class b before correction, P _d (w _j ) = 0 in the difficulty class d, P _{b in the} difficulty class b after correction (w _j ) = 0.1 <P _d (w _j ) = 0.3 of the difficulty class d, and before and after the correction, the magnitude relationship of P _i (w _j ) is reversed between b and d. Therefore, if it has a value close to 0 and exists next to a value larger than itself, it may be corrected even if it is not 0. Specifically, in the above case, b may be corrected by the above equation (2).

断片が１個の単語で構成されているとした場合、すなわち単語uni-gramの場合、難易度クラスiにおける断片w_jの生起確率P_i(w_j)は、「きっかけ」「契機」「親子」等の各単語の各難易度クラスにおける生起確率である。 Assuming that the fragment is composed of one word, that is, in the case of a word uni-gram, the occurrence probability P _i (w _j ) of the fragment w _j in the difficulty class i is “trigger” “trigger” “parent-child Etc. is an occurrence probability in each difficulty level class of each word.

図８に、「きっかけ」「契機」「親子」という単語のそれぞれの各難易度クラスにおける生起確率の例を示す。また、図９に、「おおかみ」「オオカミ」「狼」という単語のそれぞれの各難易度クラスにおける生起確率の例を示す。図８及び図９において、横軸は難易度クラスを表し、縦軸は生起確率を意味する。図８及び図９では、難易度クラスは１から１２までの１２種類存在する。この例では、難易度クラスの値が大きいほど難易度が高いことを意味する。この１２個の難易度クラスは、それぞれ小学校１年生から高校３年生に対応している。すなわち、難易度クラス１が小学校１年生に、難易度クラス２が小学校２年生に、・・・、難易度クラス１２が高校３年生に対応している。 FIG. 8 shows an example of the occurrence probability in each difficulty level class of the words "trigger", "trigger" and "parent-child". Further, FIG. 9 shows an example of the occurrence probability in each difficulty level class of each of the words "Ookami", "Ookami", and "Mio". In FIGS. 8 and 9, the horizontal axis represents the difficulty level, and the vertical axis represents the probability of occurrence. In FIG. 8 and FIG. 9, there are 12 types of difficulty levels from 1 to 12. In this example, the larger the difficulty class value, the higher the difficulty. These twelve difficulty classes correspond to first graders to third graders in elementary school, respectively. That is, the difficulty level 1 corresponds to the first grader, the difficulty level 2 corresponds to the second grade, ..., and the difficulty level 12 corresponds to the third grade.

図８から、「契機」という単語は、９から１２という高い難易度クラスでのみ生起確率が高いことがわかる。また、図９から、「おおかみ」という単語は低い難易度クラスで生起確率が高く、「オオカミ」という単語は中程度の難易度クラスで生起確率が高く、「狼」という単語は高い難易度クラスで生起確率が高いことがわかる。 It can be seen from FIG. 8 that the word “Origin” has a high probability of occurrence only in the high difficulty level class of 9 to 12. Also, from FIG. 9, the word “okami” has a high probability of occurrence in the low difficulty class, the word “wolf” has a high probability of occurrence in the medium difficulty class, and the word “狼” has a high difficulty class It can be seen that the occurrence probability is high.

断片が２個の単語で構成されている場合、すなわち単語bigramの場合、難易度クラスiにおける単語n-gram W_jの生起確率P_i(W_j)は、「親子の」等の２個の単語の各難易度クラスにおける生起確率となる。この例のように、断片が２個以上の単語である場合の生起確率を計算することにより、単語より長い複合表現や、言い回し等の各難易度クラスにおける生起確率を計算することができる。 When the fragment is composed of two words, that is, in the case of the word bigram, the occurrence probability P _i (W _j ) of the word n-gram W _j in the difficulty class i is two such as “parent-child” It is the occurrence probability in each difficulty level class of the word. As in this example, by calculating the occurrence probability when the fragment is two or more words, it is possible to calculate the occurrence probability in each difficulty level class such as a complex expression longer than a word or a wording.

これらの生起確率を用いて、より長い単位のn-gramで構成される断片やテキストについても、各難易度クラスに対する尤度を計算することができる。 These occurrence probabilities can be used to calculate the likelihood for each difficulty level even for fragments or texts composed of longer units of n-grams.

なお、断片難易度推定部３１は、第一実施形態で説明した処理と同様の処理により、各断片の難易度クラスを推定してもよい。例えば、断片が単語n-gramである場合には、断片難易度推定部３１は、記憶部７１に予め記憶されている、各断片とその各断片の難易度クラスとの対応付けの情報を参照して、各断片の難易度クラスを求めてもよい。 The fragment difficulty level estimation unit 31 may estimate the difficulty level class of each fragment by the same processing as the processing described in the first embodiment. For example, when the fragment is a word n-gram, the fragment difficulty level estimation unit 31 refers to information on the association between each fragment and the difficulty class of each fragment, which is stored in advance in the storage unit 71. You may then ask for the difficulty class of each piece.

＜全体難易度推定部３２＞
全体難易度推定部３２は、各単語n-gramの出現頻度を用いて、入力されたテキストの難易度クラスを推定する（ステップＳ３２）。推定されたテキストの難易度クラスは、外れ値箇所抽出部４に出力される。 <Overall difficulty level estimation unit 32>
The overall difficulty level estimation unit 32 estimates the difficulty level class of the input text, using the appearance frequency of each word n-gram (step S32). The degree of difficulty of the estimated text is output to the outlier location extraction unit 4.

例えば、全体難易度推定部３２は、入力されたテキストに含まれる各単語n-gramの出現頻度と、難易度クラスごとに予め求められた各単語の生起確率とに基づいて、入力されたテキストが各難易度クラスに属する尤度を推定し、最も尤度の高い難易度クラスを入力されたテキストの難易度クラスとする。 For example, the overall difficulty level estimation unit 32 determines the input text based on the appearance frequency of each word n-gram included in the input text and the occurrence probability of each word previously determined for each difficulty level class. The likelihood of belonging to each difficulty class is estimated, and the most likely difficulty class is taken as the difficulty class of the input text.

入力されたテキストTが難易度クラスiに属する尤度L(i|T)は、例えば以下の式(3)及び式(4)により定義される。 The likelihood L (i | T) that the input text T belongs to the difficulty level class i is defined by, for example, the following equations (3) and (4).

ここで、tf・idf(W_j)は単語n-gram W_jの重みであり、f(W_j,T)はTにおける各単語n-gram W_jの出現頻度であり、Σ_Lf(W_L,T)は上記テキストに含まれる単語n-gramの数であり、Dは所定の学習用テキストの数であり、df_iは単語n-gram W_jの出現する上記学習用テキストの数であり、P_i(W_j)は難易度クラスiにおける単語n-gram W_jの生起確率P_i(W_j)である。 Here, tf · idf (W _j ) is a weight of the word n-gram W _j , f (W _j , T) is an appearance frequency of each word n-gram W _j in T, and Σ _L f (W _L, T) is the number of words n-gram contained in the text, D is the number of predetermined training text, df _i is the number of the training text that appears in the word n-gram W _j P _i (W _j ) is an occurrence probability P _i (W _j ) of the word n-gram W _j in the difficulty class i.

各単語n-gramの特徴量としてP_i(W_j)が予め計算され記憶部７１に予め記憶される。また、式(3)及び式(4)の計算で必要なD, df_i等の他のパラメータも記憶部７１に記憶されている。全体難易度推定部３は、記憶部７１からこれらの値を読み込み式(3) 及び式(4)の計算を行う。 P _i (W _j ) is calculated in advance as a feature amount of each word n-gram and stored in advance in the storage unit 71. Also, it stored the required D in calculation of equation (3) and (4), other parameters such as df _i in the storage unit 71. The overall difficulty level estimation unit 3 reads these values from the storage unit 71 and calculates equations (3) and (4).

P_i(W_j)の定義及び事前計算については、上記と同様であるため、ここでは重複説明を省略する。 The definition and pre-calculation of P _i (W _j ) are the same as described above, and thus redundant description will be omitted here.

なお、全体難易度推定部３２は、第一実施形態で説明した処理と同様の処理によりテキストの難易度クラスを推定してもよい。例えば、全体断片難易度推定部３２の中の特徴量抽出部３２１が、テキストの特徴量を求め、全体難易度推定部３２は求まった特徴量をテキストの難易度クラスとしてもよい。 Note that the overall difficulty level estimation unit 32 may estimate the text difficulty level class by the same process as the process described in the first embodiment. For example, the feature extraction unit 321 in the total fragment difficulty estimation unit 32 may calculate the feature of the text, and the entire difficulty estimation unit 32 may use the calculated feature as the text difficulty class.

なお、全体難易度推定部３２は、各単語n-gramの出現頻度を用いて入力されたテキストの難易度クラスを推定した結果である単語n-gram に基づく各難易度クラスに対する尤度と、第一実施形態で説明した処理と同様の処理により得たテキストの平均文長や漢字の割合、受動態や能動態の割合などをすべて特徴量として利用し、学習器によって最終的な難易度クラスを推定してもよい。この時、学習器としては、SVM-RANK などの既知の学習器を利用することができる。また、これらの特徴量を用いて難易度クラスを推定するためのモデルは、予め学習データから構築して記憶させておく。 The overall difficulty level estimation unit 32 estimates the likelihood for each difficulty level class based on the word n-gram, which is the result of estimating the difficulty level class of the text input using the appearance frequency of each word n-gram, The final difficulty level is estimated by the learner, using the average sentence length of the text, the ratio of kanji, the ratio of passive voice and active voice, etc. obtained by the same processing as the processing described in the first embodiment as the feature amount You may At this time, a known learner such as SVM-RANK can be used as a learner. In addition, a model for estimating the difficulty level class using these feature quantities is previously constructed from learning data and stored.

［第四実施形態］
第四実施形態の外れ値箇所抽出装置及び方法は、第四実施形態の外れ値箇所抽出装置及び方法とは異なり、全体難易度推定部３２で推定されたテキストの難易度クラスからの外れ値箇所ではなく、所定の難易度クラスからの外れ値箇所を抽出する装置及び方法である。以下、第三実施形態と異なる部分のみを説明する。第三実施形態と同様の部分については説明を省略する。 Fourth Embodiment
The outlier location extracting apparatus and method according to the fourth embodiment are different from the outlier location extracting apparatus and method according to the fourth embodiment, and the outlier location from the difficulty level class of the text estimated by the overall difficulty level estimation unit 32. And an apparatus and method for extracting outlier points from a predetermined difficulty level class. Hereinafter, only portions different from the third embodiment will be described. Descriptions of parts similar to those in the third embodiment will be omitted.

第四実施形態の外れ値箇所抽出装置は、第三実施形態の外れ値箇所抽出装置と異なり、図２に示すように、テキストの難易度クラスを推定する全体難易度推定部３２を備えていない。すなわち、第四実施形態の外れ値箇所抽出方法は、ステップＳ３２の処理を行わない。 Unlike the outlier location extracting device according to the fourth embodiment, the outlier location extracting device according to the fourth embodiment does not include the overall difficulty estimating unit 32 for estimating the difficulty level of the text as shown in FIG. . That is, the outlier location extraction method of the fourth embodiment does not perform the process of step S32.

第四実施形態の外れ値箇所抽出部４、代替表現提示部５及び代替表現置換部６は、全体難易度推定部３２で推定されたテキストの難易度クラスに代えて、所定の難易度クラスに基づいて、第三実施形態と同様の処理を行う。 The outlier location extracting unit 4, the alternative expression presenting unit 5, and the alternative expression replacing unit 6 in the fourth embodiment replace the difficulty level class of the text estimated by the overall difficulty level estimating unit 32 with a predetermined difficulty level class. Based on the same processing as the third embodiment is performed.

すなわち、第四実施形態の外れ値箇所抽出部４は、断片難易度推定部３１で推定された各断片の難易度クラスと所定の難易度クラスとの組である難易度の組を用いて、所定の難易度クラスから離れた難易度を有する、テキストの断片である外れ値箇所を抽出する。言い換えれば、第四実施形態の外れ値箇所抽出部４は、断片難易度推定部３１で推定された各断片の難易度クラスと所定の難易度クラスとの比較に基づいて、所定の難易度クラスから離れた難易度を有する、テキストの断片である外れ値箇所を抽出する。 That is, the outlier location extraction unit 4 of the fourth embodiment uses a set of difficulty levels that is a set of the difficulty level class of each fragment estimated by the fragment difficulty level estimation unit 31 and a predetermined difficulty level class. Extract outlier locations, which are fragments of the text, having difficulty levels that deviate from the predetermined difficulty level class. In other words, the outlier location extracting unit 4 of the fourth embodiment determines the predetermined difficulty class based on the comparison between the difficulty class of each fragment estimated by the fragment difficulty estimating unit 31 and the predetermined difficulty class. Extract outliers, which are fragments of the text, with difficulty levels away from.

第四実施形態の代替表現提示部５は、外れ値箇所抽出部４で抽出された外れ値箇所の断片と同様の意味を有する断片であって、外れ値箇所抽出部４で抽出された外れ値箇所の断片の難易度クラスよりも所定の難易度クラスに近い難易度の断片である代替表現をユーザに提示する（ステップＳ５）。 The alternative expression presentation unit 5 of the fourth embodiment is a fragment having the same meaning as the fragment of the outlier location extracted by the outlier location extraction unit 4, and the outlier extracted in the outlier location extraction unit 4 The user is presented with an alternative expression which is a fragment of a degree of difficulty closer to the predetermined difficulty class than the degree of difficulty of the fragment of the portion (step S5).

第四実施形態の代替表現置換部６は、外れ値箇所抽出部４で抽出された外れ値箇所の断片と同様の意味を有する断片であって、外れ値箇所抽出部４で抽出された外れ値箇所の断片の難易度クラスよりも所定の難易度クラスに近い難易度クラスの断片である代替表現により入力されたテキストの中の外れ値箇所抽出部４で抽出された外れ値箇所の断片を置換したテキストを出力する（ステップＳ６）。 The alternative expression substitution unit 6 of the fourth embodiment is a fragment having the same meaning as the fragment of the outlier location extracted by the outlier location extraction unit 4, and the outlier extracted in the outlier location extraction unit 4 Replace the outlier location fragment extracted by the outlier location extraction unit 4 in the text input by the alternative expression that is a fragment of the difficulty class closer to the predetermined difficulty class than the difficulty class of the location fragment The output text is output (step S6).

また、各断片の親密度及び所定の難易度クラスに対応する親密度が予め定められているとして、第四実施形態の外れ値箇所抽出部４は、各断片の親密度と所定の難易度クラスに対応する親密度との比較に基づいて、外れ値箇所を抽出してもよい。 Further, assuming that the intimacy degree of each fragment and the intimacy degree corresponding to the predetermined difficulty level class are determined in advance, the outlier location extraction unit 4 of the fourth embodiment determines the intimacy degree of each fragment and the predetermined difficulty level class. Outlier locations may be extracted based on the comparison with the intimacy corresponding to.

［プログラム及び記録媒体］
外れ値箇所抽出装置における各処理をコンピュータによって実現する場合、外れ値箇所抽出装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、その各処理がコンピュータ上で実現される。 [Program and Recording Medium]
When each process in the outlier location extraction device is realized by a computer, the processing content of the function that the outlier location extraction device should have is described by a program. And each process is implement | achieved on a computer by running this program by computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing content can be recorded in a computer readable recording medium. As the computer readable recording medium, any medium such as a magnetic recording device, an optical disc, a magneto-optical recording medium, a semiconductor memory, etc. may be used.

また、各処理手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Further, each processing means may be configured by executing a predetermined program on a computer, or at least a part of the processing content may be realized as hardware.

［変形例］
上記の外れ値抽出装置及び方法は、外れ値箇所を抽出する代わりに、上記と同様の処理により、外れ値箇所であると判定をするものであってもよい。また、外れ値抽出装置及び方法は、外れ値箇所であると判定した上で、その判定された外れ値箇所を抽出するものであってもよい。 [Modification]
The above outlier extraction apparatus and method may be configured to determine that the outlier location is the same as the above processing, instead of extracting the outlier location. Further, the outlier extraction apparatus and method may be configured to extract the determined outlier location after determining that it is an outlier location.

上記の外れ値抽出装置は、入力されたテキストに含まれ、テキストを所定の単位で分割した断片の少なくとも１つについて難易度を推定する断片難易度推定部３１と、少なくとも１つの断片の中で所定の基準以上に難易度が外れている断片を、テキストの外れ値箇所であると判定する及び／又はテキストの外れ値箇所として抽出する外れ値箇所抽出部４と、を備えている外れ値抽出装置の一例であると考えることができる。 The outlier extraction device described above includes a fragment difficulty level estimation unit 31, which is included in the input text and estimates the difficulty level of at least one of the fragments obtained by dividing the text into predetermined units, and the at least one fragment. An outlier extracting unit 4 including an outlier location extracting unit 4 that determines a fragment whose degree of difficulty exceeds a predetermined level to be an outlier location of text and / or extracts it as an outlier location of text It can be considered as an example of the device.

外れ値箇所抽出部３１は、少なくとも１つの断片の中で所定の基準以上に難易度が外れている断片を、テキストの外れ値箇所であると判定する及び／又はテキストの外れ値箇所として抽出することができれば上記以外の処理により外れ値箇所の判定及び／又は抽出を行ってもよい。 The outlier location extracting unit 31 determines a fragment having a degree of difficulty which is higher than a predetermined level in at least one fragment as an outlier location of text and / or extracts it as an outlier location of text. If possible, the determination and / or extraction of outliers may be performed by processing other than the above.

例えば、外れ値箇所抽出部３１は、統計的に各断片の難易度の分布を描いたとき、難易度がその分布の中で所定の位置よりも裾の方に行ってしまっていたらその難易度に対応する断片を外れ値箇所とする、というような統計的な所定の基準に基づいて、外れ値箇所の判定及び／又は抽出を行ってもよい。 For example, when the outlier location extraction unit 31 statistically draws the distribution of the difficulty of each fragment, if the difficulty is more toward the hem than the predetermined position in the distribution, the difficulty is Determination and / or extraction of outlier locations may be performed based on statistical predetermined criteria such as setting a fragment corresponding to 外れ as an outlier location.

上記の断片の難易度クラスは、断片の難易度の一例である。断片の難易度は、離散的な値を取るものであってもよいし、連続的な値を取るものであってもよい。 The above-mentioned difficulty class of fragments is an example of the difficulty of fragments. The difficulty levels of fragments may be discrete values or continuous values.

図７及び図８の例では、難易度クラスの種類の個数は１２あり、１２個の難易度クラスは小学校１年生から高校３年生に対応していたが、これは一例である。Nを所定の正の整数として、難易度クラスの種類の個数はN個であってもよい。また、N個の難易度クラスは、年齢や学年以外の指標に対応させてもよい。例えば、難易度クラスを、「一般」「専門１（新聞）」「専門２（特許）」「専門３（教科書）」等の専門分野に対応させてもよい。これにより、外れ値箇所抽出装置及び方法を、ある専門分野でのマニュアル作成時に利用することができる。 In the example of FIG. 7 and FIG. 8, the number of types of the difficulty level class is 12, and the 12 difficulty levels correspond to the first grader to the third grade of elementary school, but this is an example. The number of types of difficulty classes may be N, where N is a predetermined positive integer. Also, N difficulty classes may correspond to indicators other than age and grade. For example, the difficulty level classes may correspond to special fields such as “general”, “specialty 1 (newspaper)”, “specialty 2 (patent)”, and “specialty 3 (textbook)”. This allows the outlier location extraction apparatus and method to be used when creating a manual in a particular area of expertise.

また、外れ値箇所抽出装置に、代替表現提示部５は備えられていなくてもよい。 Further, the alternative expression presentation unit 5 may not be provided in the outlier location extraction device.

外れ値箇所として、テキストから単語や段落を抽出するだけでなく、外れ値箇所としてテキストからページを抽出してもよい。この場合、断片は、ページとなる。また、外れ値箇所として、複数のテキストから、難易度クラスの異なるテキストを抽出してもよい。この場合、複数のテキストの難易度クラスと、各断片である各テキストの難易度クラスとが推定され、複数のテキストの難易度クラスから離れた難易度クラスのテキストが外れ値箇所として抽出される。このように、「入力されたテキスト」及び「断片」の定義を適宜変えることにより、外れ値箇所抽出装置及び方法を様々な用途に拡張してもよい。 As outlier portions, not only words or paragraphs may be extracted from text, but pages may be extracted from text as outlier portions. In this case, the fragment is a page. Also, texts of different difficulty classes may be extracted from a plurality of texts as outlier locations. In this case, difficulty levels of multiple texts and difficulty levels of each text that is each fragment are estimated, and texts of difficulty levels separated from multiple text difficulty classes are extracted as outlier points . Thus, the outlier location extraction apparatus and method may be extended to various applications by changing the definitions of “entered text” and “fragment” as appropriate.

上記説明した処理は、記載の順にしたがって時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 The processes described above are not only executed chronologically according to the order of description, but may also be executed in parallel or individually depending on the processing capability of the apparatus executing the process or the necessity.

その他、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 It goes without saying that other modifications can be made as appropriate without departing from the spirit of the present invention.

１前処理部
２特徴量抽出部
３１断片難易度推定部
３２全体難易度推定部
４外れ値箇所抽出部
５代替表現提示部
６代替表現置換部
７１記憶部
７２記憶部
８代替表現記憶部 1 Preprocessor 2 Feature extraction unit 31 Fragment difficulty estimation unit 32 Overall difficulty estimation unit 4 Outlier location extraction unit 5 Alternative expression presentation unit 6 Alternative expression replacement unit 71 Storage unit 72 Storage unit 8 Alternative expression storage unit

Claims

A fragment difficulty level estimation unit for estimating the difficulty level class of each fragment obtained by dividing the text into predetermined units, which is included in the input text;
An overall difficulty estimation unit that estimates the difficulty level class of the above text,
A fragment of the text having a degree of difficulty separate from the estimated difficulty level of the text based on a comparison of the estimated difficulty level of each fragment and the estimated difficulty level of the text An outlier location extraction unit for extracting a certain outlier location;
Only including,
The fragment difficulty level estimation unit determines each difficulty level of each fragment based on the appearance frequency of each word n-gram included in each fragment and the occurrence probability of each word previously obtained for each difficulty level class. The likelihood belonging to the class is estimated, and the highest likelihood difficulty class is taken as the difficulty class of each fragment estimated above,
The overall difficulty level estimation unit determines whether the text is in each difficulty level class based on the appearance frequency of each word n-gram included in the text and the occurrence probability of each word previously determined for each difficulty level class. Estimate the likelihood of belonging, and let the difficulty class with the highest likelihood be the difficulty class of the above estimated text,
Outlier location extractor.

A fragment difficulty level estimation unit for estimating the difficulty level class of each fragment obtained by dividing the text into predetermined units, which is included in the input text;
An outlier portion which is a fragment of the text having a degree of difficulty separated from the predetermined degree of difficulty is extracted based on a comparison between the estimated degree of difficulty of each fragment and a predetermined degree of difficulty. Outlier location extraction unit,
Only including,
The fragment difficulty level estimation unit determines each difficulty level of each fragment based on the appearance frequency of each word n-gram included in each fragment and the occurrence probability of each word previously obtained for each difficulty level class. The likelihood belonging to the class is estimated, and the highest likelihood difficulty class is taken as the difficulty class of each fragment estimated above,
Outlier location extractor.

In the outlier location extraction device according to claim 1 or 2 ,
Let S be the fragment, f (W _j , S) be the appearance frequency of each word n-gram W _j in each fragment, and _L _L f (W _L , S) be the word n-gram contained in each fragment Let D be the number of given training texts, d f _{i be} the number of the above training texts in which word n-gram W _j appears, and P _i (W _j ) be the word n- in the difficulty class i As the occurrence probability P _i (W _j ) of gram W _j, the likelihood L (i | S) that each of the above fragments belongs to the difficulty level class i is defined by the following equation

Let T be the above text, f (W _j , T) be the appearance frequency of each word n-gram W _j in T, and _L _L f (W _L , T) be the number of word n-grams contained in the text Let D be a predetermined number of learning texts, d f _{i be} the number of the learning texts in which word n-gram W _j appears, and P _i (W _j ) be a word n-gram W in difficulty level i _As the occurrence probability P _i (W _j ) of _j , the likelihood L (i | T) that the above text belongs to the difficulty level class i is defined by the following equation:

Outlier location extractor.

In the outlier location extraction device according to any one of claims 1 to 3 ,
Assuming that the above fragment is a word,
The intimacy degree of a word is used as an index indicating how familiar the word is felt, and corresponds to the intimacy degree of each fragment and the difficulty level of the estimated text or the predetermined difficulty level. Assuming that closeness is predetermined,
The outlier location extraction unit is configured to compare the outlier location based on a comparison between the intimacy degree of each fragment and the intimacy degree corresponding to the estimated difficulty level of the text and / or the predetermined difficulty level class. Extract,
Outlier location extractor.

In the outlier location extraction device according to any one of claims 1 to 4 ,
The fragment is at least one sentence, and the feature quantity of the fragment is the average number of words constituting the fragment, or
The above fragment is at least one sentence, and the feature quantity of the fragment is the number of clauses constituting the fragment, or
The fragment is at least one sentence, and the feature quantity of the fragment is the ratio of kanji in the fragment,
The fragment is at least one sentence, and the feature quantity of the fragment is the proportion of katakana in the fragment, or
The fragment is at least one sentence, and the feature quantity of the fragment is the percentage of hiragana in the fragment,
The above fragment is at least one sentence, and the feature quantity of the fragment is the ratio of kanji and katakana in the fragment,
The fragment is at least one sentence, and the feature quantity of the fragment is the rate of activity or passiveness of the fragment,
The above fragment is at least one sentence, and the feature quantity of the fragment is the depth of the syntax tree in the fragment, or
An outlier location extractor that is any of the following.

In the outlier location extraction device according to any one of claims 1 to 5 ,
It is a fragment having the same meaning as the extracted fragment of the outlier part, which is the difficulty class of the above-mentioned estimated text or the above predetermined difficulty class than the difficulty of the fragment of the extracted outlier part An alternative presentation presenting unit that presents to the user an alternative presentation that is a fragment of a degree of difficulty close to
An outlier location extractor further comprising

In the outlier location extraction device according to any one of claims 1 to 5 ,
It is a fragment having the same meaning as the extracted fragment of the outlier part, which is the difficulty class of the above-mentioned estimated text or the above predetermined difficulty class than the difficulty of the fragment of the extracted outlier part An alternative expression substitution presentation unit which outputs a text obtained by replacing the extracted fragment of the outlier location in the text by an alternative expression that is a fragment having a degree of difficulty near,
An outlier location extractor further comprising

  A fragment difficulty level estimation step of estimating a difficulty level class of each fragment obtained by dividing the text into predetermined units, which is included in the input text;
  An overall difficulty estimation step of estimating the difficulty level of the above text by the overall difficulty estimation unit;
  The outlier location extraction unit determines the degree of difficulty away from the estimated difficulty class of the text based on the comparison of the estimated difficulty class of each fragment with the estimated difficulty class of the text. An outlier location extracting step of extracting an outlier location which is a fragment of the text,
  Including
  In the fragment difficulty level estimation step, each fragment has each difficulty level based on the appearance frequency of each word n-gram included in each fragment and the occurrence probability of each word previously determined for each difficulty level class The likelihood belonging to the class is estimated, and the highest likelihood difficulty class is taken as the difficulty class of each fragment estimated above,
  In the overall difficulty level estimation step, the text is divided into each difficulty level class based on the appearance frequency of each word n-gram included in the text and the occurrence probability of each word previously determined for each difficulty level class. Estimate the likelihood of belonging, and let the difficulty class with the highest likelihood be the difficulty class of the above estimated text,
  Outlier location extraction method.

  A fragment difficulty level estimation step of estimating a difficulty level class of each fragment obtained by dividing the text into predetermined units, which is included in the input text;
  The text fragment of the above, wherein the outlier location extraction unit has a difficulty level away from the predetermined difficulty level class based on the comparison between the estimated difficulty level class of each fragment and the predetermined difficulty level class An outlier location extracting step for extracting a certain outlier location;
  Including
  In the fragment difficulty level estimation step, each fragment has each difficulty level based on the appearance frequency of each word n-gram included in each fragment and the occurrence probability of each word previously determined for each difficulty level class The likelihood belonging to the class is estimated, and the highest likelihood difficulty class is taken as the difficulty class of each fragment estimated above,
  Outlier location extraction method.

A program for causing a computer to function as each part of the outlier location extraction device according to any one of claims 1 to 7 .