JP2011070529A

JP2011070529A - Document processing apparatus

Info

Publication number: JP2011070529A
Application number: JP2009222602A
Authority: JP
Inventors: Masakazu Fujio; 正和藤尾; Takeshi Eisaki; 健永崎; Juichi Takahashi; 寿一高橋; Takehiro Urano; 雄大浦野
Original assignee: Hitachi Solutions Ltd
Current assignee: Hitachi Solutions Ltd
Priority date: 2009-09-28
Filing date: 2009-09-28
Publication date: 2011-04-07

Abstract

【課題】書誌データを効率的に抽出出来る文書構造抽出装置を提供する。
【解決手段】抽出対象とする書誌情報の識別子を記述した抽出対象メタデータ定義辞書１６１と、各メタデータが持つ特徴量を記述したメタデータ別特徴定義辞書１６２を情報保持部１６に用意し、文書のカラム構造判定結果に基づき文字行とその順序を抽出し、各文字行について、情報保持部16に記憶されるメタデータ別特徴量に基づいてメタデータスコアを計算し、文字行位置に基づき、各メタデータスコアからメタデータらしさの加重移動平均を計算し、文字行の先頭位置から、抽出対象とするメタデータのうちの何れかのメタデータスコアの極大点を探し、連続する二つの極大点の間のうち、スコア値の谷間部分により書誌情報ラベルの境界位置を決定する。
【選択図】図１A document structure extracting apparatus capable of efficiently extracting bibliographic data is provided.
An information storage unit 16 includes an extraction target metadata definition dictionary 161 describing an identifier of bibliographic information to be extracted and a metadata-specific feature definition dictionary 162 describing a feature amount of each metadata. Based on the document column structure determination result, character lines and their order are extracted, and for each character line, a metadata score is calculated based on the feature amount by metadata stored in the information holding unit 16, and based on the character line position. , Calculate the weighted moving average of metadata likeness from each metadata score, find the maximum point of any metadata score of the metadata to be extracted from the beginning position of the character line, and find two continuous maximums Among the points, the boundary position of the bibliographic information label is determined by the valley portion of the score value.
[Selection] Figure 1

Description

本発明は、電子文書の他、複合機等で電子化した多様な学術文献文書について、タイトル、著者名、要旨、英語作成者名、英語要旨等のラベル付けを行い、学術情報提供サービス提供、文書の再利用を効率化する文書処理装置に関する。 In addition to electronic documents, the present invention labels various academic literature documents digitized by multifunctional machines, etc., with titles, author names, abstracts, English creator names, English abstracts, etc., providing academic information provision services, The present invention relates to a document processing apparatus that improves the efficiency of document reuse.

官公庁、民間企業においては、営業プレ活動、企画立案などの非定型業務のほか、審査・申込等の定型業務において、日々多くの文書が発生する。これらの文書は、通常電子化されているのみでコンテンツに対するラベル付け（メタデータ抽出）は行われておらず、整理しきれないまま各種文書管理システムの中に眠っている。しかし、近年コンプライアンス対応、労働生産性向上、カスタマーサービス向上といった観点から、蓄積文書に様々な観点から効率的にアクセスし、文書再利用、ナレッジ共有、問い合わせ対応等に活用したいという声が高まっている。そのためには、文書に記載されている書誌情報（タイトル、サブタイトル、作成者、作成日等）や、文書ごとに特徴的な情報（見積期限、注文番号等）を認識し、メタデータとして本文と対応づけて管理する技術が重要となってくる。 In government offices and private companies, in addition to atypical operations such as pre-sales activities and planning, a large number of documents are generated every day in routine operations such as examination and application. These documents are usually only digitized, the contents are not labeled (metadata extraction), and are asleep in various document management systems without being organized. However, in recent years, there has been a growing demand for efficient access to stored documents from various viewpoints, such as compliance, labor productivity improvement, and customer service improvement, and for document reuse, knowledge sharing, response to inquiries, etc. . To do so, it recognizes the bibliographic information (title, subtitle, creator, creation date, etc.) described in the document and characteristic information (quotation deadline, order number, etc.) for each document, Corresponding management techniques are important.

また学術情報サービス機関等では、文献検索サービスサイトのバックエンドシステムとして、多種多様な学術文献（画像もしくは電子文書）から日本語タイトル、日本語作成者、英語タイトル、英語作成者、要旨、英語要旨等の書誌情報や、本文領域、引用文献情報を効率的に抽出したいというニーズがある。現状は、各種学術文献について人手による書誌情報の抽出とデータベースへの登録を行い、文献検索サイトに公開している。学術文献・発行紙毎に、書誌情報の表現方法（順序や書かれている位置）が異なるため、自動化を困難なものとしている。
このような非定型な文書から、書誌情報のメタデータを抽出する技術がいくつか開示されている。特許文献１では、文書の文字サイズ、文字種、行間距離、先頭文字の位置の情報に基づいて、目次や章立て等の文書構造情報を取得する技術が開示されている。また、非特許文献１では、レイアウト構造と論理構造の対応関係を記述した文書モデルに基づき、文書構造を抽出する技術が開示されている。 In addition, academic information service institutions, etc., as a back-end system for literature search service sites, Japanese titles, Japanese creators, English titles, English creators, abstracts, English abstracts from a wide variety of academic documents (images or electronic documents) There is a need to efficiently extract bibliographic information such as the text area and cited reference information. Currently, bibliographic information is manually extracted and registered in a database for various academic documents and published on a literature search site. Bibliographic information is expressed in different ways (order and position) for each academic document / issued paper, making automation difficult.
Several techniques for extracting metadata of bibliographic information from such an atypical document have been disclosed. Patent Document 1 discloses a technique for acquiring document structure information such as a table of contents and chapters based on information on the character size, character type, line spacing, and head character position of a document. Non-Patent Document 1 discloses a technique for extracting a document structure based on a document model describing a correspondence relationship between a layout structure and a logical structure.

特開2001-265762号公報JP 2001-265762 A

黄瀬、“レイアウトモデルに基づく文書構造解析”、信学論、72-D2、7、pp1029-1039Kise, “Document structure analysis based on layout model”, IEICE theory, 72-D2, 7, pp1029-1039

しかしながら、非特許文献１にみられるようなルール定義型の文書構造抽出装置では、特定のレイアウト条件下の文書について処理できるに留まる。このため、数種類の文献であればレイアウト解析ルールで対応可能だが、学術情報サービスにおいて提供されるような数百から数千種類の学術文献に対応しようとする場合、ルールの作成及びメンテナンスに多大なコストがかかる。 However, the rule definition type document structure extraction apparatus as found in Non-Patent Document 1 can only process a document under a specific layout condition. For this reason, layout analysis rules can be used for several types of documents, but if you intend to support hundreds to thousands of types of academic documents that are provided in academic information services, a great deal of rule creation and maintenance is required. costly.

また、特許文献１では、目次や章立てなどの比較的限定されたレイアウトにのみ適用できるものである。学術文献種ごとに、カラム（段組み）や著者名、組織名等の書誌情報が書かれる順序・位置が異なる場合や、省略されるようなものには対応できない。 Further, Patent Document 1 can be applied only to a relatively limited layout such as a table of contents or a chapter chapter. It is not possible to deal with cases in which the order or position in which bibliographic information such as columns (columns), author names, organization names, etc. are written is different or omitted for each academic document type.

このような状況に鑑み、本発明の目的は、多様な学術文献文書について、タイトル、著者名、要旨等の書誌情報抽出を行い、学術情報提供サービス、文書の再利用を効率化する文書処理装置を提供することにある。 In view of such circumstances, the object of the present invention is to extract bibliographic information such as titles, author names, abstracts, etc. for various academic literature documents, and provide a scientific information providing service and a document processing apparatus that makes document reuse efficient. Is to provide.

上記の目的を達成するため、本発明においては、情報保持部と処理部とを備え、文字行の文書中の出現位置および文字列に基づき、処理部により書誌情報のメタデータを抽出する文書処理装置であって、情報保持部は、各メタデータが持つメタデータ別の特徴量を保持し、処理部は、文書のカラム構造判定の結果に基づき文字行と文字行位置を抽出し、メタデータ別の特徴量に基づいて、文字行各々についてメタデータスコアを計算し、このメタデータスコアに基づき、書誌情報の境界位置を決定する文書処理装置を提供する。 In order to achieve the above object, in the present invention, the document processing includes an information holding unit and a processing unit, and extracts metadata of bibliographic information by the processing unit based on the appearance position and the character string in the document of the character line. The information holding unit holds a feature amount of each metadata included in each metadata, and the processing unit extracts a character line and a character line position based on a result of the column structure determination of the document. Provided is a document processing apparatus that calculates a metadata score for each character line based on another feature amount and determines a boundary position of bibliographic information based on the metadata score.

また、本発明において、処理部は、更に文字行位置に基づき、各メタデータについてのメタデータスコアの伝播処理を実行し、文字行の先頭位置から、抽出対象とするメタデータのうちの何れかの伝播処理後のメタデータスコアの極大点を探し、探索された連続する二つの極大点の間のメタデータスコアの変化に基づき、境界位置を決定し、決定した境界位置の前の文字行には一つ目の極大点のメタデータを設定し、後の文字行には二つ目の極大点のメタデータを設定する。 In the present invention, the processing unit further executes a propagation process of the metadata score for each metadata based on the character line position, and any one of the metadata to be extracted from the head position of the character line. Find the maximum point of the metadata score after the propagation processing of, determine the boundary position based on the change of the metadata score between the two consecutive maximum points searched, the character string before the determined boundary position Sets the metadata for the first local maximum point, and sets the metadata for the second local maximum point in the subsequent text line.

更に、本発明において、処理部は、文字行位置に基づき、メタデータ各々についてのメタデータスコアの加重移動平均計算による平滑化処理を行うことにより、メタデータスコアの伝播処理を実行する。 Further, in the present invention, the processing unit executes the metadata score propagation process by performing a smoothing process by weighted moving average calculation of the metadata score for each piece of metadata based on the character line position.

また更に、本発明においては、情報保持部は、メタデータ別の特徴量として、文字行を構成する文字数、英人名パターン長、単語間平均距離、文字間平均距離の少なくとも一つを保持する。 Furthermore, in the present invention, the information holding unit holds at least one of the number of characters constituting the character line, the English name pattern length, the average distance between words, and the average distance between characters as a feature amount for each metadata.

本発明においては、上記目的を達成するため好適には、情報保持部に、抽出対象とするメタデータの識別子を記述した抽出対象メタデータ定義辞書と、各メタデータが持つ特徴量を記述したメタデータ別特徴定義辞書を用意し、処理部は、(1)文書のカラム構造判定結果に基づき、文字行とその順序を抽出し、(2)各文字行について、情報保持部に対応付けられて記憶される各メタデータの特徴量に基づいてメタデータスコアを計算し、(3)文字行位置に基づき、各メタデータらしさの加重移動平均を計算することによりメタデータスコアの伝播処理を実行し、(4)文字行の先頭位置から、抽出対象とするメタデータのうちの何れかのメタデータスコアの極大点を探し、(5)連続する二つの極大点の間のうち、スコア値の谷間部分により書誌情報の境界位置を決定する文書処理装置を提供する。 In the present invention, preferably, in order to achieve the above object, the information holding unit preferably has an extraction target metadata definition dictionary describing an identifier of metadata to be extracted, and a metadata describing feature values of each metadata. A data-specific feature definition dictionary is prepared, and the processing unit (1) extracts character lines and their order based on the column structure determination result of the document. (2) Each character line is associated with the information holding unit. The metadata score is calculated based on the feature amount of each metadata stored, and (3) the metadata score propagation process is executed by calculating the weighted moving average of each metadata based on the character line position. , (4) Find the maximum point of any metadata score among the metadata to be extracted from the beginning position of the character line, and (5) Score value valley between two consecutive maximum points Bibliographic information boundary by part To provide a document processing apparatus for determining a location.

より好適な態様として、本発明は、文書情報を入力可能な入力部と、情報保持部と、処理部と、出力部を少なくとも有し、文字の集合からなる文字行の文書中の出現位置および文字列に基づき、書誌情報を抽出する文書処理装置であって、処理部は、入力された文書情報から文字要素を抽出し、抽出された文字要素の、水平方向および垂直方向の射影成分ヒストグラムを生成し、射影ヒストグラムからカラム情報を抽出し、文字要素の座標値とカラム情報により決定される文字の順序に基づき文字行の抽出を行い、抽出した文字行の座標と文字行が含まれるカラム情報により、文字行を系列データとして並べ替え、並べ替えた文字行各々に対し、予め情報保持部に記憶されるメタデータ各々の特徴量に基づいてメタデータスコアを計算し、文字行位置に基づき、メタデータ各々についてのメタデータスコアの伝播処理を実行し、先頭文字行から、伝播処理後のメタデータスコアの極大点を探索し、探索された連続する二つの極大点の間のメタデータスコアの変化に基づいて境界点を設定し、設定した境界点の前の文字行には一つ目の極大点のメタデータを設定し、後の文字行には二つ目の極大点のメタデータを設定する構成の文書処理装置を提供する。 As a more preferred aspect, the present invention includes an input unit capable of inputting document information, an information holding unit, a processing unit, and an output unit, and an appearance position in a document of a character line including a set of characters, and A document processing apparatus that extracts bibliographic information based on a character string, wherein the processing unit extracts character elements from the input document information, and generates a projection component histogram in the horizontal direction and the vertical direction of the extracted character elements. Generate and extract column information from the projection histogram, extract character lines based on the order of characters determined by the coordinate values of character elements and column information, and column information that includes the coordinates and character lines of the extracted character lines The character lines are rearranged as series data, and a metadata score is calculated for each rearranged character line based on the feature amount of each metadata stored in the information holding unit in advance. Based on the line position, execute the metadata score propagation process for each piece of metadata, search for the maximum point of the metadata score after the propagation process from the first character line, and between the two consecutive maximum points searched The boundary point is set based on the change in the metadata score of the first, the first maximum point metadata is set in the character line before the set boundary point, and the second maximum point is set in the subsequent character line. Provided is a document processing apparatus configured to set point metadata.

同様に、より好適な態様として、本発明は、文書情報を入力可能な入力部と、情報保持部と、処理部と、出力部を少なくとも有し、文字の集合からなる文字行の文書中の出現位置および文字列に基づき、書誌情報を抽出する文書処理装置であって、処理部は、入力された文書情報から文字要素を抽出し、抽出された文字要素の座標を基に、水平方向および垂直方向の射影成分ヒストグラムを生成し、射影成分ヒストグラムからカラム情報の抽出を抽出し、文字要素の座標値とカラム情報により決定される文字の順序に基づき、文字行の抽出を行い、
抽出した文字行の座標と文字行が含まれるカラム情報により、文字行を系列データとして並べ替え、並べ替えた文字行各々に対し、予め情報保持部に記憶されるメタデータ各々の特徴量に基づいてメタデータスコアを計算し、各文字行のメタデータスコアを基に、各文字行が当該メタデータである場合のコストをそれぞれ設定し、予め情報保持部に対応付けられて記憶されるメタデータ識別子の順序候補に基づいて、文字行全体の合計コストが最小となるメタデータラベル系列を探索する構成の文書処理装置を提供する。 Similarly, as a more preferable aspect, the present invention includes at least an input unit capable of inputting document information, an information holding unit, a processing unit, and an output unit. A document processing device that extracts bibliographic information based on an appearance position and a character string, wherein the processing unit extracts a character element from the input document information, and based on the coordinates of the extracted character element, Generate a projection component histogram in the vertical direction, extract column information extraction from the projection component histogram, extract character lines based on the order of characters determined by the coordinate values of the character elements and column information,
Based on the extracted character line coordinates and column information including the character line, the character line is rearranged as series data, and for each rearranged character line, based on the feature amount of each metadata stored in the information holding unit in advance. The metadata score is calculated, the cost when each character line is the metadata is set based on the metadata score of each character line, and the metadata stored in advance in association with the information holding unit Provided is a document processing apparatus configured to search for a metadata label sequence that minimizes the total cost of an entire character line based on identifier order candidates.

本発明によれば、文献別の文書構造定義をすることなく、多種多様な学術文献から、書誌情報を抽出することができる。 According to the present invention, bibliographic information can be extracted from a wide variety of academic documents without defining the document structure for each document.

また、抽出対象とする学術文献種類が増えた場合でも、文書構造定義を新たに作成することなく、高精度に書誌情報のメタデータを抽出することが可能となる。 Further, even when the number of academic literature types to be extracted increases, it is possible to extract bibliographic information metadata with high accuracy without newly creating a document structure definition.

更に、複合機等で複数の文献を束ねて印刷した場合でも、文献ごとに書誌情報のメタデータを抽出でき、また混在した文献を、個別の文献に分離可能となる。 Furthermore, even when a plurality of documents are bundled and printed by a multifunction peripheral or the like, bibliographic information metadata can be extracted for each document, and mixed documents can be separated into individual documents.

第１の実施例に係る、書誌情報抽出システムの概略構成を示す図である。It is a figure which shows schematic structure of the bibliographic information extraction system based on a 1st Example. 第１の実施例に係る、メタデータ抽出フローの全体フローチャート図である。It is a whole flowchart figure of the metadata extraction flow based on 1st Example. 第１の実施例に係る、表紙領域のカラム構造抽出の例を示す図である。It is a figure which shows the example of the column structure extraction of a cover area | region based on 1st Example. 第１の実施例に係る、表紙領域のカラム構造抽出処理の概要を示すフローチャート図である。It is a flowchart figure which shows the outline | summary of the column structure extraction process of a cover area based on 1st Example. 第１の実施例に係る、縦方向分割処理の概要を示すフローチャート図である。It is a flowchart figure which shows the outline | summary of the vertical direction division | segmentation process based on 1st Example. 第１の実施例に係る、表紙領域のカラム構造抽出の際に使用するX軸、Y軸方向の射影成分ヒストグラムの一例を示す図である。It is a figure which shows an example of the projection component histogram of the X-axis and Y-axis direction used in the case of the column structure extraction of a cover area based on 1st Example. 第１の実施例に係る、文字行抽出処理の概要を表すフローチャート図である。It is a flowchart figure showing the outline | summary of the character line extraction process based on 1st Example. 第１の実施例に係る、文字行抽出結果の例を示す図である。It is a figure which shows the example of the character line extraction result based on 1st Example. 第１の実施例に係る、抽出した文字行を順序づけて並べ替えた例を示す図である。It is a figure which shows the example which ordered and rearranged the extracted character line based on 1st Example. 第１の実施例に係る、各メタデータのスコア計算に使用する特徴量を定義したメタデータ別特徴定義辞書のテーブルの一例を示す図である。It is a figure which shows an example of the table of the feature definition dictionary classified by metadata which defined the feature-value used for the score calculation of each metadata based on a 1st Example. 第１の実施例に係る、単語の区切り判定方法を説明する図である。It is a figure explaining the word break determination method based on a 1st Example. 第１の実施例に係る、メタデータごとのスコア計算特徴を定義したメタデータ別特徴定義辞書のテーブルの一例を示す図である。It is a figure which shows an example of the table of the characteristic definition dictionary classified by metadata which defined the score calculation characteristic for every metadata based on 1st Example. 第１の実施例に係る、メタデータごとの制約特徴を定義したメタデータ特徴定義辞書のテーブルの一例を示す図である。It is a figure which shows an example of the table of the metadata feature definition dictionary which defined the restriction | limiting feature for every metadata based on a 1st Example. 第１の実施例に係る、各文字行の各メタデータスコアの計算処理の概要を表すフローチャート図である。It is a flowchart figure showing the outline | summary of the calculation process of each metadata score of each character line based on a 1st Example. 第１の実施例に係る、メタデータスコアを文字行順に描画した例を示す図である。It is a figure which shows the example which drawn the metadata score based on 1st Example in order of a character line. 第１の実施例に係る、加重移動平均後のメタデータスコアを文字行順に描画した例を示す図である。It is a figure which shows the example which drawn the metadata score after a weighted moving average based on a 1st Example in order of a character line. 第１の実施例に係る、複数メタデータスコアグラフを用いたメタデータ領域決定処理の概要を示すフローチャート図である。It is a flowchart figure which shows the outline | summary of the metadata area | region determination process using the multiple metadata score graph based on 1st Example. 第１の実施例に係る、２つの極大点から境界位置を決定する処理を例示した図である。It is the figure which illustrated the process which determines a boundary position from two local maximum points based on a 1st Example. 第２の実施例に係る、複数文献を連続して複写した一連の文書に対するメタデータスコアグラフの一例を示す図である。It is a figure which shows an example of the metadata score graph with respect to a series of documents which copied the some literature continuously based on a 2nd Example. 第３の実施例に係る、文字行位置とメタデータスコアの関係の例を示す図である。It is a figure which shows the example of the relationship between a character line position and a metadata score based on 3rd Example. 第３の実施例に係る、文字行位置とメタデータスコアの関係マトリクスから最小コストパスの探索例を表す図である。It is a figure showing the example of a search of the minimum cost path | pass from the relationship matrix of a character line position and metadata score based on 3rd Example. 第３の実施例に係る、可能なメタデータ順序をツリー構造で表現したデータ構造を示す図である。It is a figure which shows the data structure which represented the possible metadata order by the tree structure based on 3rd Example. 第３の実施例に係る、メタデータスコアによる最小コストラベル系列探索の概要を示すフローチャート図である。It is a flowchart figure which shows the outline | summary of the minimum cost label series search by a metadata score based on 3rd Example. 第３の実施例に係る、文字行間のメタデータラベルコストの例を示す図である。It is a figure which shows the example of the metadata label cost between character lines based on a 3rd Example. 各実施例に示す書誌情報抽出システムにおける、抽出対象のメタデータを定義した抽出対象メタデータ定義辞書のテーブルの一例を示す図である。It is a figure which shows an example of the table of the extraction object metadata definition dictionary which defined the metadata of extraction object in the bibliographic information extraction system shown in each Example.

本発明は、以下に詳述するように、電子文書の他、複合機等で電子化した多様な学術文献文書について、タイトル、著者名、要旨、英語作成者名、英語要旨等のラベル付け、すなわち、書誌情報であるメタデータの抽出を行い、学術情報提供サービス提供、文書の再利用を効率化する文書処理装置を提供する。 As described in detail below, the present invention, in addition to electronic documents, various academic literature documents digitized with a multifunction machine, etc., labeling the title, author name, abstract, English creator name, English abstract, etc. That is, it provides a document processing apparatus that extracts metadata that is bibliographic information, provides an academic information providing service, and makes document reuse more efficient.

なお、本明細書において、「書誌情報」とは、学術文献文書（画像もしくは電子文書）中の日本語タイトル、日本語作成者、英語タイトル、英語作成者、要旨、英語要旨等、文書に記載されている狭義の書誌情報（タイトル、サブタイトル、作成者、作成日等）や、文書ごとに特徴的な情報（見積期限、注文番号等）などを総称するものとする。また、「メタデータスコア」とは、各メタデータのメタデータらしさを数値で表したものを意味し、数値が大きいほど当該メタデータらしいものとする。更に、「文字行位置」とは、文書における文字行位置もしくは文書の先頭からの文字行のY座標の相対値を意味するものとする。また更に、「伝播処理」とは、単純移動平均や加重移動平均などの平滑化処理を総称したものとする。 In this specification, “bibliographic information” refers to the Japanese title, Japanese creator, English title, English creator, abstract, English abstract, etc. in the academic document (image or electronic document). Bibliographic information in a narrow sense (title, subtitle, creator, creation date, etc.) and information characteristic of each document (estimation deadline, order number, etc.) are collectively referred to. Further, the “metadata score” means a value representing the metadata likeness of each metadata by a numerical value, and the larger the numerical value, the more likely it is that metadata. Further, the “character line position” means a character line position in the document or a relative value of the Y coordinate of the character line from the beginning of the document. Furthermore, “propagation processing” is a generic term for smoothing processing such as simple moving average and weighted moving average.

＜メタデータ抽出装置の構成＞ <Configuration of metadata extraction device>

図１は、第１の実施例に係る、メタデータ付与装置、すなわち書誌情報抽出システムを構成する文書処理装置の概略構成を示す図である。ここでは、電子文書の他、複合機等で電子化した紙文書を想定している。 FIG. 1 is a diagram illustrating a schematic configuration of a metadata adding apparatus, that is, a document processing apparatus constituting a bibliographic information extraction system according to the first embodiment. Here, in addition to an electronic document, a paper document digitized by a multifunction machine or the like is assumed.

書誌情報抽出システム１０は、入力装置１１０と画像入力装置１１１とからなる入力部１１、表示装置１２と印刷装置１４とからなる出力部、中央処理部（ＣＰＵ）１３からなる処理部、ワークエリア１５と情報保持部１６とからなる記憶部とを備えている。 The bibliographic information extraction system 10 includes an input unit 11 including an input device 110 and an image input device 111, an output unit including a display device 12 and a printing device 14, a processing unit including a central processing unit (CPU) 13, and a work area 15. And a storage unit including the information holding unit 16.

ワークエリア１５は、オペレーティング・システム（ＯＳ）１５１のほか、通信プログラム１５２、文書処理プログラム１５３を既に備えているか、若しくは、それらを必要に応じて情報保持部１６からロードする。なお、情報保持部１６は、文書処理プログラム１５３が必要とする各種辞書を備えている。 In addition to the operating system (OS) 151, the work area 15 already includes a communication program 152 and a document processing program 153, or loads them from the information holding unit 16 as necessary. The information holding unit 16 includes various dictionaries required by the document processing program 153.

入力装置１１０としては、例えば、文書処理プログラム１５３に入力対象のデータやコマンド等を入力するためのキーボード、マウス、タブレット等が挙げられる。 Examples of the input device 110 include a keyboard, a mouse, and a tablet for inputting data to be input, commands, and the like to the document processing program 153.

画像入力装置１１１としては、例えば、紙文書を処理対象とする場合に、文書を画像データとして、取り込むためのスキャナ等の装置が挙げられる。 Examples of the image input device 111 include devices such as a scanner for capturing a document as image data when a paper document is a processing target.

ＯＳ１５１は、入力部１１、表示装置１２、ＣＰＵ１３、印刷装置１４、通信プログラム１５２、文書処理プログラム１５３、メタデータ編集プログラム１５４、その他図示しないメモリ、記憶装置の動作を制御する機能を備える。通信プログラム１５２は、処置対象の文書をネットワーク経由で取得するための通信機能を備える。 The OS 151 has a function of controlling operations of the input unit 11, display device 12, CPU 13, printing device 14, communication program 152, document processing program 153, metadata editing program 154, other memory (not shown), and storage device. The communication program 152 has a communication function for acquiring a document to be treated via a network.

本実施例における文書処理プログラム１５３は、入力装置１１１によって入力された文書画像もしくは、通信プログラム１５２によって取得された電子文書データを処理対象とし、メタデータ（書誌情報）を抽出する機能を備え、メタデータ編集プログラム１５４は、得られたメタデータを編集する機能を有する。 The document processing program 153 according to the present exemplary embodiment has a function of extracting metadata (bibliographic information) using a document image input by the input device 111 or electronic document data acquired by the communication program 152 as a processing target. The data editing program 154 has a function of editing the obtained metadata.

図示を省略したメモリ中の情報保持部１６は、抽出対象メタデータ定義辞書１６１と、メタデータ別特徴定義辞書１６２とを備えている。これらの辞書は、後で詳述するように、文書処理プログラム１５３がメタデータを抽出する際に参照する辞書データベースとして機能している。 The information holding unit 16 in the memory (not shown) includes an extraction target metadata definition dictionary 161 and a metadata-specific feature definition dictionary 162. As will be described in detail later, these dictionaries function as a dictionary database that is referred to when the document processing program 153 extracts metadata.

抽出対象メタデータ定義辞書１６１は、抽出対象とする書誌情報を表す識別子のリストもしくはツリー構造を記載した辞書を格納する。メタデータ別特徴定義辞書１６２は、各メタデータにラベル付けされる文字行が持つ特徴量について記載した辞書を格納する。 The extraction target metadata definition dictionary 161 stores a dictionary describing a list or tree structure of identifiers representing bibliographic information to be extracted. The metadata-specific feature definition dictionary 162 stores a dictionary that describes the feature amounts of character lines labeled with each metadata.

表示装置１２は、文書処理プログラム１５３によるメタデータ抽出結果やメタデータ編集プログラム１５４による編集結果を表示するディスプレイ等の装置である。ＣＰＵ１３は、図示を省略したメモリ上のワークエリア１５内の各種プログラムをロードし、ＯＳ１５１と協働してプログラムの内容を実行する。印刷装置１４は、文書処理プログラム１５３による入力文字列の変換結果等を出力するための装置である。通信ネットワーク１９は、ネットワークでつながったファイルサーバ２１などの別の装置上のデータやワークエリア、情報保持部にアクセスするための装置である。 The display device 12 is a device such as a display for displaying the metadata extraction result by the document processing program 153 and the editing result by the metadata editing program 154. The CPU 13 loads various programs in the work area 15 on the memory (not shown), and executes the contents of the programs in cooperation with the OS 151. The printing device 14 is a device for outputting the conversion result of the input character string by the document processing program 153. The communication network 19 is a device for accessing data, a work area, and an information holding unit on another device such as the file server 21 connected via the network.

＜メタデータ抽出の全体フロー＞
図２は、第１の実施例における書誌情報のメタデータ抽出フローの全体フローチャートである。 <Overall flow of metadata extraction>
FIG. 2 is an overall flowchart of the bibliographic information metadata extraction flow in the first embodiment.

入力P-A101では、図１の入力装置１１０や通信ネットワーク１９を通じて、処理対象とする文書データをワークエリア１５にロードする。カラム構造判定処理P-A102では、複数カラム構造を持つ学術文献についても、文字行を正しく抽出するため、入力文書の段組み構成を判定する。入力文書の段組み構成の判定の詳細については、図３以降で述べる。文字行抽出処理P-A103では、カラム構造判定P-A102で抽出したカラム構造の情報（カラム情報）を基に、ラベル付けの単位となる文字行の抽出を行う。一次元系列化処理P-A-104では、文字行抽出処理P-A103で抽出した文字行を、文字行のXY座標と、カラム構造判定P-A102で判定したカラム（段組み）のいずれに含まれるかによって並べ替える。文字行特徴設定P-A105では、予め図１の情報保持部１６に各メタデータと対応付けて記憶される、行高さ、行位置、平均文字間距離等のレイアウト特徴及び行の文字列パターンや文字数に基づき、各文字行の言語およびレイアウト特徴量を設定する。 In the input P-A 101, the document data to be processed is loaded into the work area 15 through the input device 110 and the communication network 19 shown in FIG. In the column structure determination process P-A102, a column structure of an input document is determined in order to correctly extract a character line even for an academic document having a plurality of column structures. Details of the determination of the column structure of the input document will be described in FIG. In the character line extraction process P-A103, a character line serving as a labeling unit is extracted based on the column structure information (column information) extracted in the column structure determination P-A102. In the one-dimensional series processing PA-104, the character line extracted by the character line extraction process P-A103 is included in either the XY coordinates of the character line or the column (column) determined by the column structure determination P-A102. Sort by. In the character line feature setting P-A105, layout features such as line height, line position, average inter-character distance, and line character string pattern, which are stored in advance in the information holding unit 16 of FIG. 1 in association with each metadata. Based on the number of characters and the number of characters, the language and layout feature amount of each character line is set.

各メタデータのスコア計算処理P-A106では、特徴量計算P-A105で抽出した特徴量に基づき、抽出対象メタデータ定義辞書で指定した各メタデータらしさをメタデータスコアとして計算し保存する。加重移動平均計算P-A107では、スコア計算処理P-A106で計算したメタデータスコアについて、一次元系列化処理P-A104により抽出した順序と、文字行間の距離に従い加重移動平均を計算する。境界位置決定処理P-A108では、加重移動平均により隣接領域に伝播された各メタデータスコアの分布パターンに基づき、各メタデータの境界領域の決定と領域ごとのメタデータラベルを決定する。なお、本明細書において、単純移動平均や加重移動平均などにより、隣接領域にメタデータスコアを伝播させることをスコアの伝播処理と総称することは上述の通りである。 In the score calculation process P-A106 for each metadata, the metadata likeness specified in the extraction target metadata definition dictionary is calculated and stored as a metadata score based on the feature amount extracted by the feature amount calculation P-A105. In the weighted moving average calculation P-A107, the weighted moving average is calculated according to the order extracted by the one-dimensional series processing P-A104 and the distance between the character lines for the metadata score calculated in the score calculation process P-A106. In the boundary position determination process P-A 108, the boundary area of each metadata and the metadata label for each area are determined based on the distribution pattern of each metadata score propagated to the adjacent area by the weighted moving average. In this specification, as described above, propagating a metadata score to an adjacent region by a simple moving average or a weighted moving average is collectively referred to as a score propagation process.

以上の処理フローにより、本実施例によれば、非定型なレイアウトを持つ数百から数千の学術文献であっても、抽出対象メタデータの小規模の辞書によって書誌情報のメタデータを適切に抽出できる。また、順序を文献別に定義し、文献テンプレートごとに照合する場合に比べ高速な抽出が可能である。更に、複合機等で複数の文献を束ねて印刷した場合でも、文献ごとにメタデータを抽出でき、また混在した文献を、個別の文献に分類することができる。 According to the above processing flow, according to the present embodiment, even for hundreds to thousands of academic documents having an atypical layout, the metadata of the bibliographic information is appropriately obtained by the small dictionary of the extraction target metadata. Can be extracted. Further, the extraction can be performed at a higher speed than when the order is defined for each document and collation is performed for each document template. Furthermore, even when a plurality of documents are bundled and printed by a multifunction machine or the like, metadata can be extracted for each document, and mixed documents can be classified into individual documents.

以下、本実施例における図２の各処理の詳細について、図３〜図１８を用いて説明する。 Hereinafter, details of each process of FIG. 2 in the present embodiment will be described with reference to FIGS.

＜カラム構造判定P-A102＞
図３の例に示すように、各学会および発行元によって、学術文献の書誌情報の記載方法は一定ではない。図３の左側部、中央部、右側部の上段にそれぞれ示す表紙レイアウトF-A001〜F-A003は、ある学術文献における書誌データ記載パターンを模式化したものである。また、同図の下段に示すカラム構造F-A101〜F-A103は、それらに対応するカラム構造を表し、丸１、丸２、丸３等の番号は、領域番号を表す。例えば、表紙F-A001では、書誌情報のうち、“タイトル”、 “著者名”、 “英語タイトル”、“英語著者名”までが１カラム上で記載されたあと、“要旨”がダブルカラムの左カラムの先頭位置に記載されている。表紙F-A002では、書誌情報のうち、“タイトル”、 “著者名”、“要旨”までが１カラム上に記載されたあと、著者の所属する組織情報が、ダブルカラムの左下領域に出現している。また、表紙F-A003では、ページ全体がダブルカラム構造となっており、その左側のカラムの先頭から、“タイトル”、 “著者名”、“要旨”が記載されている。また、以上のようなレイアウトのバリエーションに加えて、例えば、ある学術文献では、“論文”、“要旨”、“キーワード”等、書誌データの開始となる明示的キーワードがあるが、別の学術文献では、こうしたキーワードが一切ない、といった状況が起こり得る。書誌メタデータ抽出の第一の実施例においては、まず、これらのレイアウトのバリエーションを判定し、文字行を一次元の系列データとして表現することを行う。 <Column structure determination P-A102>
As shown in the example of FIG. 3, the bibliographic information description method for academic literature is not constant depending on each academic society and publisher. The cover layouts F-A001 to F-A003 shown in the upper part of the left part, the center part, and the right part of FIG. 3 are schematic representations of bibliographic data description patterns in a certain academic document. In addition, column structures F-A101 to F-A103 shown in the lower part of the figure represent column structures corresponding to them, and numbers such as circle 1, circle 2, and circle 3 represent region numbers. For example, on the cover F-A001, “Title”, “Author Name”, “English Title”, “English Author Name” are listed in one column, and “Summary” is double column. It is written at the top position in the left column. On the cover page F-A002, after bibliographic information, “Title”, “Author name”, and “Summary” are listed in one column, the organization information to which the author belongs appears in the lower left area of the double column. ing. In the cover F-A003, the entire page has a double column structure, and “title”, “author name”, and “abstract” are described from the top of the left column. In addition to the layout variations as described above, for example, in some academic literature, there are explicit keywords that start bibliographic data such as “paper”, “abstract”, “keyword”, but other academic literature. Then there can be situations where there are no such keywords. In the first embodiment of bibliographic metadata extraction, first, variations of these layouts are determined, and character lines are expressed as one-dimensional series data.

以下、図４〜５を用いて、図２のカラム構造判定P-A102の詳細フローについて説明する。図４は、表紙領域のカラム構造判定処理の概要を示すフローチャートである。ステップP-B001は、文書領域の縦方向分割処理を表し、ステップP-B002は、文書領域の横方向分割処理を表す。縦方向分割の例として、図３のカラム構造F-A103の丸１、丸２の分割や、カラム構造F-A101の丸２、丸３の分割、カラム構造F-A102の丸２、丸３の分割があげられる。 Hereinafter, the detailed flow of the column structure determination P-A102 in FIG. 2 will be described with reference to FIGS. FIG. 4 is a flowchart showing an overview of the column structure determination processing of the cover area. Step P-B001 represents the vertical division process of the document area, and Step P-B002 represents the horizontal division process of the document area. As an example of vertical division, the division of circles 1 and 2 of the column structure F-A103 in FIG. 3, the division of circles 2 and 3 of the column structure F-A101, the circles 2 and 3 of the column structure F-A102 Can be divided.

まず縦方向分割処理P-B001において、ページ内要素のX軸方向の射影成分ヒストグラムに基づき、縦方向分割の実行を試みる。分割は、ヒストグラムのパターンに応じて成功もしくは失敗する。この分割処理の詳細については、図５において説明する）。成功した場合、ダブルカラムであると判定し、処理を終える。失敗した場合、横方向分割処理P-B002を実行し、成功した場合は分割後の下領域について再度、縦方向分割処理P-B003を実行する。処理フローP-B101をたどった場合、図３のカラム構造F-A103のようなカラム構造が得られ、処理フローP-B102をたどった場合、図３のカラム構造F-A101、カラム構造F-A102のようなカラム構造が得られる。 First, in the vertical division process P-B001, an attempt is made to execute vertical division based on the projection component histogram in the X-axis direction of the elements in the page. The division succeeds or fails depending on the pattern of the histogram. Details of this division processing will be described with reference to FIG. If it is successful, it is determined that the column is a double column, and the process ends. If the process fails, the horizontal division process P-B002 is executed. If the process is successful, the vertical division process P-B003 is executed again for the lower area after the division. When processing flow P-B101 is followed, a column structure such as column structure F-A103 in FIG. 3 is obtained, and when processing flow P-B102 is followed, column structure F-A101 and column structure F- in FIG. A column structure like A102 is obtained.

図５は、図４の縦方向の領域分割処理P-B001の詳細フローを示したものである。処理P-C001では、表紙ページ中のテキスト、罫線、イメージ要素について、X軸への射影成分を抽出し、射影成分ヒストグラムを生成する。ヒストグラムの頻度は、各要素のY軸方向の幅の合計とする。次に処理P-C002では、処理P-C001で抽出したメタデータスコアヒストグラムの隣接領域への伝播処理を行う。伝播処理には、単純移動平均や加重移動平均などの平滑化処理を用いることができる。平滑化を行う理由は、次の処理P-C003において、ノイズとなる凹点を取り除くためである。次に、処理P-C003において、分割候補となるX座標を探索する。具体的には、ヒストグラムをX軸の昇順に探索していったとき、ヒストグラムの形状が凹となる点を見つける（処理P-C004）。そして、処理P-C005において、凹点のヒストグラム値が直前の凸点のヒストグラム値に対して、一定の分割閾値以下である場合に当該X座標にて分割可能と判定する。分割閾値を満たすX座標が見つからなければ、分割失敗とする。ここで分割閾値を満たすとは、直前の凸点に対する凹点のヒストグラム値の比率が一定値以下の場合を言う。なお、同様に、Y軸への射影成分を抽出し、射影成分ヒストグラムを生成することができるので、説明は省略する。 FIG. 5 shows a detailed flow of the vertical region dividing process P-B001 of FIG. In process P-C001, a projection component to the X-axis is extracted for the text, ruled line, and image element in the cover page, and a projection component histogram is generated. The frequency of the histogram is the sum of the widths of each element in the Y-axis direction. Next, in process P-C002, a process of propagating the metadata score histogram extracted in process P-C001 to the adjacent region is performed. For the propagation process, a smoothing process such as a simple moving average or a weighted moving average can be used. The reason for performing the smoothing is to remove indentations that become noise in the next process P-C003. Next, in process P-C003, an X coordinate that is a candidate for division is searched. Specifically, when the histogram is searched in the ascending order of the X axis, a point where the shape of the histogram is concave is found (processing P-C004). Then, in process P-C005, when the histogram value of the concave point is equal to or less than a predetermined division threshold value with respect to the histogram value of the immediately previous convex point, it is determined that the division can be performed at the X coordinate. If an X coordinate that satisfies the division threshold is not found, the division fails. Satisfying the division threshold here means a case where the ratio of the histogram value of the concave point to the immediately preceding convex point is a certain value or less. Similarly, a projection component on the Y-axis can be extracted and a projection component histogram can be generated.

図６に、上述したX軸およびY軸方向への射影成分ヒストグラムの一例を示す。図６の上段に示すヒストグラムF-B001は、X軸方向への射影成分ヒストグラム（平滑化後）の例を表す。同図の下段に示すヒストグラムF-B002は、Y軸方向への射影成分ヒストグラム（平滑化後）の例を表す。今、凹点のX座標F-B101において、分割点となるかどうか判断している（図５の処理P-C003〜P-C005）ものとする。この場合、直前の凸点がポイントF-B102であり、処理P-C005における判定のための比率は、ポイントF-B102の頻度α１に対するポイントF-B101の頻度β１の比率、β１／α１の値を用いる。横方向の分割処理P-B002においても、Y軸方向の射影成分ヒストグラムを用いて、同じ処理を行う。例えば、今、凹点のY座標F-B201において、分割点となるかどうか判定する場合、比率β２／α２の値が、あらかじめ指定した閾値以下であれば、横方向への分割を行う。 FIG. 6 shows an example of the projection component histogram in the X-axis and Y-axis directions described above. A histogram F-B001 shown in the upper part of FIG. 6 represents an example of a projection component histogram (after smoothing) in the X-axis direction. A histogram F-B002 shown in the lower part of the figure represents an example of a projection component histogram (after smoothing) in the Y-axis direction. Now, it is determined whether or not the X coordinate F-B101 of the concave point is a division point (processing P-C003 to P-C005 in FIG. 5). In this case, the immediately preceding convex point is the point F-B102, and the ratio for the determination in the process P-C005 is the ratio of the frequency β1 of the point F-B101 to the frequency α1 of the point F-B102, the value of β1 / α1 Is used. Also in the horizontal division process P-B002, the same process is performed using the projection component histogram in the Y-axis direction. For example, when it is determined whether or not the Y coordinate F-B201 of the concave point is a dividing point, if the value of the ratio β2 / α2 is equal to or less than a predetermined threshold value, the horizontal division is performed.

以上詳述した方法により、図３のカラム分割F-A101、F-A102、F-A103で示したようなカラム構造の判定を行うことができる。 By the method described in detail above, the column structure as shown by the column divisions F-A101, F-A102, and F-A103 in FIG. 3 can be determined.

＜文字行抽出P-A103＞
文字行の抽出処理P-A103は、図４、５のカラム構造判定処理P-A102の結果と、各文字要素のXY座標を用いて行う。 <Character line extraction P-A103>
The character line extraction process P-A103 is performed using the result of the column structure determination process P-A102 of FIGS. 4 and 5 and the XY coordinates of each character element.

図７は、図２の文字行抽出P-A103の概要を表すフローチャートである。処理P-D001では、表紙ページ中の文字集合を、XY座標とカラム構造に基づき並べ替える。例えば、XY座標が左下原点で表現されているとすれば、
文字が含まれるカラム領域の順序（左上→右下という順序）
文字のY座標の昇順
文字のX座標の降順
という順序に従い、文字列を並べ替える。 FIG. 7 is a flowchart showing an outline of the character line extraction P-A 103 in FIG. In process P-D001, the character set in the cover page is rearranged based on the XY coordinates and the column structure. For example, if the XY coordinates are expressed with the lower left origin,
Order of column areas that contain characters (upper left → lower right)
Sort the string according to the ascending order of the Y coordinate of the character and the descending order of the X coordinate of the character.

次に処理P-D002において、先頭の文字から順に文字行の抽出を行う。先頭から文字をたどり、あらたな文字行と判断するまでは、現在の文字行に順に繋いでいく。文字行が改行する場合、あらたな文字行として開始する。 Next, in process P-D002, character lines are extracted in order from the first character. The character is sequentially connected to the current character line until the character is traced from the beginning until it is determined as a new character line. When a line breaks, it starts as a new line.

文字行の例を、図８を用いて説明する。図８は、図３のカラム構造F-A101に対して文字行を抽出したものである。要素F-C301は文字を表し、内部の数字は処理P-D001により並べ替えた文字の順序を表す。文字行F-C201は、カラム領域丸１の文字行を表し、文字番号９と文字番号１０において改行するため、文字番号９までが文字行F-C201を構成し、文字番号１０が改行しているため、あらたな行F-C202の先頭文字となる。また、カラム領域丸２では、文字番号２３で改行するため、文字番号１９〜２２が文字行F-C203を構成する。 An example of a character line will be described with reference to FIG. FIG. 8 shows a character line extracted from the column structure F-A101 of FIG. An element F-C301 represents a character, and an internal number represents the order of characters rearranged by the process P-D001. The character line F-C201 represents the character line of the column region circle 1, and line breaks are made at the character number 9 and the character number 10. Therefore, the character line F-C201 constitutes the character line F-C201, and the character number 10 is broken. Therefore, it becomes the first character of a new line F-C202. In the column area circle 2, since a line break is made at the character number 23, the character numbers 19 to 22 constitute a character line F-C203.

＜一次元系列化P-A104＞
続いて、文字行の並べ替えステップである、図２の文字行の一次元系列化処理P-A104は、前記の文字行抽出処理P-A103で抽出した文字行を、文字の並べ替えと同じ順序に従い、並べ替える。例えば、XY座標が左下原点で表現されているとすれば、
（１）文字行が含まれるカラム領域の順序（左上→右下という順序）
（２）文字行のY座標の昇順
（３）文字行のX座標の降順
という順序で文字行を抽出する。 <One-dimensional series P-A104>
Subsequently, the one-dimensional series processing P-A104 of the character lines in FIG. 2, which is the step of rearranging the character lines, uses the character lines extracted by the character line extraction processing P-A103 as the character rearrangement. Rearrange according to order. For example, if the XY coordinates are expressed with the lower left origin,
(1) Order of column area containing character lines (order from upper left to lower right)
(2) The character lines are extracted in the order of ascending order of the Y coordinate of the character line (3) descending order of the X coordinate of the character line.

図９は、文字行を一次元系列化した例を表す。図９において、左側部の表紙ページF-D001は、図３のカラム構造F-A101に対する文字行抽出結果を表す。表紙ページF-D001内の要素A〜Jは、前記処理で抽出した文字行を表す。同図右側部に示す一次元系列データF-D101は、前記（１）〜（３）の基準に従って順序付けした文字行を表す。 FIG. 9 shows an example in which character lines are made into a one-dimensional series. In FIG. 9, a cover page F-D001 on the left side represents a character line extraction result for the column structure F-A101 in FIG. Elements A to J in the cover page F-D001 represent character lines extracted by the above processing. The one-dimensional series data F-D101 shown on the right side of the figure represents character lines ordered according to the criteria (1) to (3).

＜各文字行の特徴量計算P-A105＞
図２の各文字行の特徴量計算P-A105では、各文字行のレイアウト特徴量および文字列特徴量を計算する。図１０のテーブルT-A100は、本実施例のメタデータ付与装置において使用する特徴量を示す、図１のメタデータ別特徴量定義辞書162の一例を示している。同図において、項目T-A001は、各特徴量の識別IDを表す。項目T-A002は、各特徴量の名前を表す。項目T-A003は、各特徴量の内容を表す。項目T-A004は、各特徴量の重み計算の一例を表す。項目T-A002、T-A003、T-A004が、項目T-A001で特定される特徴量の定義を構成している。 <Characteristics calculation for each character line P-A105>
In the feature amount calculation P-A105 of each character line in FIG. 2, the layout feature amount and the character string feature amount of each character line are calculated. A table T-A100 in FIG. 10 shows an example of the feature-by-metadata definition dictionary 162 in FIG. 1 showing the feature amounts used in the metadata providing apparatus of the present embodiment. In the figure, item T-A001 represents the identification ID of each feature quantity. Item T-A002 represents the name of each feature. Item T-A003 represents the contents of each feature amount. Item T-A004 represents an example of weight calculation for each feature amount. Items T-A002, T-A003, and T-A004 constitute the definition of the feature amount specified by the item T-A001.

識別ID1番の特徴量は、単語数を表す。単語の単位は、スペースに基づいて決定する。図１１は、スペースに基づく単語区切りの説明の模式図を表す。矩形F-E101は文字を表す。その横に続く矩形ｂ〜ｈも文字を表す。各文字間は、ある幅のスペースで区切られている。今ここで、スペースの相対的な変化に基づき、単語区切りスペースであるか否かの判定を行う。例えば、スペースF-E001は、スペースF-E002、F-E003、F-E004に比べて相対的に広がっている。これはスペースの比率に関する閾値や幅の差に関する相対閾値（ページ幅等）を用いて判断することができる。以上の手段により単語区切りスペースの数を数えることで、ID1番の特徴量である単語数の計算ができる。 The feature amount of identification ID 1 represents the number of words. The unit of the word is determined based on the space. FIG. 11 is a schematic diagram illustrating the explanation of word breaks based on spaces. A rectangle F-E101 represents a character. The rectangles b to h that follow the side also represent characters. Each character is separated by a space of a certain width. Here, based on the relative change of the space, it is determined whether or not it is a word break space. For example, the space F-E001 is relatively wider than the spaces F-E002, F-E003, and F-E004. This can be determined using a threshold relating to the space ratio or a relative threshold (such as page width) relating to the width difference. By counting the number of word break spaces by the above means, the number of words that is the feature amount of ID1 can be calculated.

識別ID2番の特徴量は、文字行を構成する文字数を表す。これは、例えば、要旨や本文領域の判定特徴の一つとして使用できる。この文字数は値をそのまま用いてもよいが、文字行抽出処理P-A103において抽出した文字行の最大文字数に対する比率を用いてもよい。 The feature amount of the identification ID No. 2 represents the number of characters constituting the character line. This can be used, for example, as one of the determination features of the gist and text area. Although the number of characters may be used as it is, the ratio of the character line extracted in the character line extraction process P-A103 to the maximum number of characters may be used.

識別ID3番の特徴量は、英語人名に特徴的に現れるパターンを表す。英語人名では、“Taro HITACHI”のように、姓を大文字で、名を大文字＋小文字で表わすことが多い。従って、このパターンに当てはまる文字の文字行文字数に対する割合を用いる。 The feature amount of the identification ID No. 3 represents a pattern that appears characteristically in English names. In English names, as in “Taro HITACHI”, surnames are often expressed in uppercase letters and first names in uppercase and lowercase letters. Therefore, the ratio of the characters corresponding to this pattern to the number of character lines is used.

識別ID4番の特徴量は、英語単語に特徴的に現れるパターンを表す。この場合は、アルファベット＋記号文字の文字行文字数に対する割合を用いる。 The feature amount of the identification ID No. 4 represents a pattern that appears characteristically in an English word. In this case, the ratio of the alphabet + symbol character to the number of character lines is used.

識別ID5番の特徴量は、人名単語の文字数を表す。人名単語の判定には、形態素解析技術や、人名辞書との照合を用いることができる。特徴量の値としては、文字行文字数に対する該当パターン文字数の割合を用いる。 The feature amount of the identification ID No. 5 represents the number of characters of the personal name word. For the determination of a personal name word, morphological analysis technology or collation with a personal name dictionary can be used. As the feature value, the ratio of the number of pattern characters to the number of character characters is used.

識別ID6番の特徴量は、組織名単語の文字数を表す。組織名単語の判定には、形態素解析技術や、組織名辞書との照合を用いることができる。特徴量の値としては、文字行文字数に対する該当パターン文字数の割合を用いる。 The feature amount of the identification ID No. 6 represents the number of characters of the organization name word. For the determination of the organization name word, morphological analysis technology or collation with the organization name dictionary can be used. As the feature value, the ratio of the number of pattern characters to the number of character characters is used.

識別ID7番の特徴量は、場所名単語の文字数を表す。場所名単語の判定には、形態素解析技術や、場所名辞書との照合を用いることができる。特徴量の値としては、文字行文字数に対する該当パターン文字数の割合を用いる。 The feature amount of the identification ID No. 7 represents the number of characters of the place name word. For the determination of the place name word, morphological analysis technology or collation with the place name dictionary can be used. As the feature value, the ratio of the number of pattern characters to the number of character characters is used.

識別ID8番の特徴量は、文字行中の読点文字数を表す。単語数に対する読点文字数の割合を用いることにより、英語人名パターンの特徴の一つとして用いることができる。 The feature amount of the identification ID No. 8 represents the number of punctuation characters in the character line. By using the ratio of the number of punctuation characters to the number of words, it can be used as one of the features of the English name pattern.

識別ID9番の特徴量は、単語間の平均距離を表す。単語の判定は、識別ID1番の特徴量で用いたのと同じ手段を用いる。日本語人名において、スペースが広くなる傾向があるため、日本語人名の特徴の一つとして用いることができる。 The feature amount of the identification ID No. 9 represents the average distance between words. For the word determination, the same means as that used for the feature amount of identification ID 1 is used. Since Japanese personal names tend to have more space, they can be used as one of the characteristics of Japanese personal names.

識別ID10番の特徴量は、文字間の平均距離を表す。日本語人名の文字間平均距離は、1)文字がタイトルや要旨と比べて少ない、2)単語区切り部分のスペースが広い、という理由で大きくなる傾向がある。日本語人名の特徴の一つとして用いることができる。 The feature amount of the identification ID No. 10 represents the average distance between characters. The average distance between characters in Japanese names tends to be large because 1) there are fewer characters than titles and abstracts, and 2) there is a large space for word breaks. It can be used as one of the characteristics of Japanese names.

識別ID11番の特徴量は、処理P-A104において一次元系列化した文字行の何番目に現れるかを基にした特徴量である。例えば、タイトルなどのように先頭に現れる傾向が強いものは、以下の式を用いることができる。 The feature amount of the identification ID No. 11 is a feature amount based on the number of the character line that is one-dimensionally arranged in the process P-A104. For example, the following equation can be used for a title or the like that has a strong tendency to appear at the top.

数１を用いることにより、先頭行の重みが１で最も高く、残りの行は1〜0の間の値を降順にとる。 By using Equation 1, the weight of the top row is the highest at 1, and the remaining rows take values between 1 and 0 in descending order.

識別ID12番の特徴量は、日本語文字数を表す。本特徴量は、学術文献中の日本語領域、英語領域等の制約条件として用いることができる。 The feature amount of the identification ID No. 12 represents the number of Japanese characters. This feature amount can be used as a constraint condition for a Japanese region, an English region, etc. in academic literature.

識別ID13番の特徴量は、項目辞書引き結果を表す。本特徴量は、メタデータごとに項目単語が定義できるものは定義しておき、各文字行との編集距離を用いる。編集距離を用いる理由は、OCR結果の文書等では誤読があるためである。また、本来項目単語は文字行の先頭にマッチすると考えられるが、記号等が挿入されて先頭に来ないことも考えられる。編集距離を用いることで、行途中にマッチする場合をスコアとして扱うことができ、項目読み取り失敗によるメタデータ判定ミスを防ぐことができる。“編集距離＋１”の逆数を用いることで、項目辞書が文字行の先頭からマッチし、完全一致の場合のスコアが１となる。また、先頭から文字がずれる、もしくは挿入、欠損が入るごとにスコアが1〜0の値をとることになる。 The feature amount of the identification ID No. 13 represents the item dictionary lookup result. This feature quantity defines what can define an item word for each metadata, and uses an edit distance from each character line. The reason for using the edit distance is that there is a misread in the OCR result document. In addition, it is considered that the item word originally matches the head of the character line, but it is also possible that a symbol or the like is inserted and does not come to the head. By using the edit distance, it is possible to handle a case where a match is made in the middle of a line as a score, and it is possible to prevent a metadata determination error due to an item reading failure. By using the reciprocal of “edit distance + 1”, the item dictionary matches from the beginning of the character line, and the score is 1 in the case of complete match. In addition, the score takes a value of 1 to 0 each time the character is shifted from the beginning or inserted or missing.

識別ID14番の特徴量は、項目単語との相対的位置関係を表す。項目単語が存在する場合、それ以前に出現する文字行が当該メタデータとなることはないので、本特徴量は制約条件として用いることができる。 The feature amount of the identification ID No. 14 represents the relative positional relationship with the item word. When an item word exists, a character line that appears before the item word does not become the metadata, so that the feature amount can be used as a constraint condition.

以上に例示した特徴量は、書誌情報のメタデータ判定に有効と考えられるものの例であり、ここに挙げたものに限られるわけではない。これらの特徴量を組み合わせて用いることで、処理P-A103により抽出した文字行に対して、書誌情報のラベル付け、すなわちメタデータ判定を行うことができる。 The feature quantities exemplified above are examples of what is considered to be effective for metadata determination of bibliographic information, and are not limited to those listed here. By using these feature amounts in combination, bibliographic information labeling, that is, metadata determination, can be performed on the character line extracted by processing P-A103.

＜メタデータスコア計算P-A106＞
次に、図２のメタデータスコア計算P-A106のメタデータスコア計算では、各メタデータと対応づけて記録された、メタデータ別特徴定義辞書162に従い、メタデータらしさのスコアを計算する。たとえば、先に詳述した図１０の特徴量を利用する。 <Metadata score calculation P-A106>
Next, in the metadata score calculation of the metadata score calculation P-A106 in FIG. 2, a metadata-like score is calculated according to the metadata-specific feature definition dictionary 162 recorded in association with each metadata. For example, the feature amount shown in FIG.

図２５は、図１の抽出対象メタデータ定義辞書１６１のうち、抽出対象とするメタデータの種類を示したテーブルT-D100を表す。項目T-D001は、書誌情報であるメタデータの種類を表す識別子であるメタIDを表す。項目T-D002は、書誌情報であるメタデータ名を表す。項目T-D003は、抽出対象かどうかを表す。項目T-D003の値が“○”の場合、抽出対象であることを表し、“○”の場合、抽出対象でないことをあらわす。 FIG. 25 shows a table T-D100 showing the types of metadata to be extracted in the extraction target metadata definition dictionary 161 of FIG. An item T-D001 represents a meta ID that is an identifier representing a type of metadata that is bibliographic information. An item T-D002 represents a metadata name that is bibliographic information. An item T-D003 indicates whether or not it is an extraction target. When the value of the item T-D003 is “◯”, it indicates that it is an extraction target, and when it is “○”, it indicates that it is not an extraction target.

図１２は、図１のメタデータ別特徴定義辞書１６２のうち、書誌情報であるメタデータのスコア計算に用いる特徴量を定義したテーブルT-B100を表す。テーブル中の項目T-B001は、書誌情報であるメタデータの種類を表す識別子メタIDを表す。項目T-B002は、書誌情報の内容であるメタデータ名を表す。項目T-B003は、スコア計算に使用する特徴量のIDを表す。特徴量の各IDはそれぞれ、図１０の抽出対象メタデータ定義辞書161のテーブルT-A100の項目番号T-A001に対応する。各メタデータのスコアは、図１２のテーブルT-B100のスコア計算に使用する特徴項目T-B003を参照し、そこで定義されている項目番号について、図１０のテーブルT-A100の重み計算項目T-A004の値の線形和を計算することで得られる。計算方法は、線形和に限られるものではなく、例えば大量サンプルから判別分析によって学習した値を用いてもよい。 FIG. 12 shows a table T-B100 in which feature amounts used for score calculation of metadata that is bibliographic information in the feature definition dictionary by metadata 162 of FIG. 1 are defined. An item T-B001 in the table represents an identifier meta ID indicating the type of metadata that is bibliographic information. Item T-B002 represents a metadata name which is the content of the bibliographic information. An item T-B003 represents an ID of a feature amount used for score calculation. Each ID of the feature amount corresponds to the item number T-A001 of the table T-A100 of the extraction target metadata definition dictionary 161 in FIG. For the score of each metadata, refer to the feature item T-B003 used for the score calculation of the table T-B100 in FIG. 12, and the item number defined therein is the weight calculation item T in the table T-A100 in FIG. Obtained by calculating the linear sum of the values of -A004. The calculation method is not limited to the linear sum, and for example, a value learned from a large sample by discriminant analysis may be used.

図１３は、図１のメタデータ別特徴定義辞書１６２のうち、各メタデータが満たすべき制約条件（制約とする特徴項目）を定義したテーブルT-C100を表す。テーブルT-C100において、項目T-C001は、メタデータの種類を表す識別子を表す。項目T-C002は、メタデータ名を表す。項目T-C003は、制約として用いる特徴のIDリストを表し、図１０の項目番号T-A001に対応する。項目T-C004は、各制約特徴量の値が満たすべき閾値（下限値）を表す。テーブルT-C100に基づいて、まず書誌メタデータの制約条件を満たすかどうか計算し、満たすものについてのみ、前記のメタデータスコアを計算する。 FIG. 13 shows a table T-C100 that defines the constraint conditions (feature items to be constrained) that each metadata should satisfy in the metadata feature definition dictionary 162 of FIG. In the table T-C100, an item T-C001 represents an identifier indicating the type of metadata. Item T-C002 represents a metadata name. An item T-C003 represents an ID list of features used as constraints, and corresponds to the item number T-A001 in FIG. Item T-C004 represents a threshold value (lower limit value) that each constraint feature value should satisfy. Based on the table T-C100, first, whether or not the bibliographic metadata constraint condition is satisfied is calculated, and the metadata score is calculated only for those satisfying the constraint.

以上のメタデータスコア計算P-A106の流れを処理フローで表わすと、図１４のようになる。メタデータスコア計算処理では、まず処理P-E001により、図１の抽出対象メタデータ定義辞書161に定義されているメタデータであるための制約を満たすかどうか判定する(P-E001)。判定には、図１３で例示した特徴量の閾値を満たすかどうかで判定する。条件を満たさないメタデータについては、スコア０を付与する。条件を満たすメタデータについては、処理P-E002では、メタデータのスコアを計算する(P-E002)。スコア計算には、図１２で例示した特徴量の値を用いる。以上の処理を、全ての文字行、もしくは先頭からある一定行数までの文字行について行う(P-E003)。 The flow of the above metadata score calculation P-A106 can be represented by a processing flow as shown in FIG. In the metadata score calculation processing, first, it is determined by processing P-E001 whether or not the constraints for the metadata defined in the extraction target metadata definition dictionary 161 in FIG. 1 are satisfied (P-E001). The determination is based on whether or not the feature amount threshold illustrated in FIG. 13 is satisfied. A score of 0 is assigned to metadata that does not satisfy the condition. For metadata that satisfies the condition, in process P-E002, a metadata score is calculated (P-E002). For the score calculation, the feature value illustrated in FIG. 12 is used. The above processing is performed for all character lines or character lines up to a certain number of lines from the beginning (P-E003).

＜加重移動平均計算P-A107＞
図１５は、図２の処理P-A106において計算された各メタデータのスコアを、文字行の位置と対応づけて例示したものである。軸F-F200は、メタデータスコアの値を示す軸である。左ほど値が大きい。軸F-F300は、処理P-A104によって抽出された文字行の位置を表す。先頭行からの行番号、もしくは当該文字行までの系列データ上でのY座標を表す。グラフF-F001は、書誌情報のうちタイトルのメタデータスコアを例示したものである。グラフF-F002は、著者のメタデータスコアを表示したものである。グラフF-F003は、英語タイトルのメタデータスコアを表示したものである。例えば、グラフF-F001は、先頭行のスコアが高く、以降スコアが下がっている。また、英語タイトルを示すグラフF-F003は、先頭から数行目のスコアが最も高く、以降スコアは低いが、途中突発的にスコアが高めの部分が出現する。 <Weighted moving average calculation P-A107>
FIG. 15 illustrates the score of each metadata calculated in process P-A106 in FIG. 2 in association with the position of the character line. An axis F-F200 is an axis indicating the value of the metadata score. The value is larger to the left. The axis F-F300 represents the position of the character line extracted by the process P-A104. Indicates the line number from the first line, or the Y coordinate on the series data up to the character line. Graph F-F001 illustrates the metadata score of the title among the bibliographic information. Graph F-F002 displays the metadata score of the author. Graph F-F003 displays the metadata score of the English title. For example, in the graph F-F001, the score of the first line is high, and the score is lowered thereafter. In addition, the graph F-F003 showing the English title has the highest score in the several lines from the top, and the score is low thereafter, but a portion with a suddenly high score appears halfway.

以上のように、文字行ごとにメタデータスコアを計算することで、ある程度メタデータの領域およびその境界を判定することが可能となる。しかし、実際には、複数メタデータのスコアにあまり差がなく、メタデータスコアの値だけでは、正しくメタデータの境界を判定できないことがある。そこで、本実施例では、特徴量に基づくメタデータスコアに加えて、レイアウト情報を反映させたメタデータスコアグラフへの変換処理を行う。 As described above, by calculating the metadata score for each character line, it is possible to determine the metadata region and its boundary to some extent. However, in reality, there is not much difference between the scores of a plurality of metadata, and the metadata boundary may not be correctly determined only by the metadata score value. Therefore, in the present embodiment, in addition to the metadata score based on the feature amount, conversion processing to a metadata score graph reflecting layout information is performed.

図１６は、図１５において例示した特徴量に基づくメタデータスコアを、文字行位置もしくは当該文字行までの系列上でのY座標に関して加重移動平均による変換処理後のメタデータスコアグラフを表す。グラフF-G001は、書誌情報のうちタイトルのメタデータスコアを例示したものである。グラフF-G002は、著者のメタデータスコアを表示したものである。グラフF-G003は、英語タイトルのメタデータスコアを表示したものである。 FIG. 16 shows a metadata score graph obtained by converting the metadata score based on the feature amount exemplified in FIG. 15 by the weighted moving average with respect to the Y coordinate on the character line position or the series up to the character line. Graph F-G001 shows the metadata score of the title among the bibliographic information. Graph F-G002 displays the author's metadata score. Graph F-G003 displays the metadata score of the English title.

幅ｎの加重移動平均は、例えば以下のような式により計算できる。 The weighted moving average of the width n can be calculated by the following formula, for example.

数２において、P_Yは、加重移動平均計算対象のY座標（先頭文字行からの文字行系列上のＹ座標、すなわち文字行位置）となる文字行のメタデータスコアを表す。nは、加重移動平均のウィンドウサイズを表す。本計算式によるメタデータスコアの伝播処理により、ノイズスコアを平滑化して、本質的なメタデータスコアグラフを作成することができる。 In Equation 2, P _Y represents the metadata score of the character line that is the Y coordinate (Y coordinate on the character line series from the first character line, that is, the character line position) of the weighted moving average calculation target. n represents the window size of the weighted moving average. With the metadata score propagation process according to this calculation formula, it is possible to smooth the noise score and create an essential metadata score graph.

＜メタデータ領域決定P-A108＞
次に、以上の手段により作成された複数の特徴量に基づくメタデータスコアのグラフを用いて実行する、図２のメタデータ領域の境界位置決定P-A108の処理について説明する。 <Metadata area determination P-A108>
Next, the process of determining the boundary position of the metadata area in FIG. 2 performed using the metadata score graph based on a plurality of feature amounts created by the above means will be described.

図１７は、図２の加重移動平均計算P-A107の処理も含めた、メタデータ領域境界決定P-A108の概要を示すフローチャートである。まず、処理P-F001において、図１４の処理フローで説明したメタデータスコアの加重移動平均を計算する。この処理は、数２で例示した通りであり、図２の加重移動平均計算P-A107に対応する。 FIG. 17 is a flowchart showing an outline of the metadata area boundary determination P-A 108 including the processing of the weighted moving average calculation P-A 107 in FIG. First, in process P-F001, the weighted moving average of the metadata score described in the process flow of FIG. 14 is calculated. This processing is as illustrated in Equation 2, and corresponds to the weighted moving average calculation P-A107 in FIG.

次に、処理P-F002において、文字行の先頭位置から、抽出対象であるメタデータスコアの極大点P1を探す。次に、処理P-F003において、現在着目しているメタデータM1のスコア極大点の次のメタデータスコア極大点P2を探す。このメタデータをM2と置く。次に、処理P-F004において、メタデータM1とメタデータM2の領域境界点を、極大点P1と極大点P2の間から求める。 Next, in process P-F002, the maximum point P1 of the metadata score to be extracted is searched from the beginning position of the character line. Next, in process P-F003, a metadata score maximum point P2 next to the score maximum point of the metadata M1 currently focused on is searched for. Put this metadata as M2. Next, in process P-F004, a region boundary point between the metadata M1 and the metadata M2 is obtained from between the local maximum point P1 and the local maximum point P2.

図１８は、処理P-F004の２つの極大点から境界位置を決定する処理を例示した模式図である。軸F-H301は、スコアの値を表す。軸F-H302は、文字行位置（先頭文字行からの文字行系列上のＹ座標）を表す。グラフF-H001は、メタデータM1のスコアグラフを表す。グラフF-H002は、メタデータM2の表す。グラフ上の点F-H201は、メタデータM1のスコアの極大点P1を表し、グラフ上の点F-H202は、メタデータM2のスコアの極大点P2を表す。P1とP2間の境界を決める方法として、グラフF-H001およびグラフF-H002の交差点F-H203を境界とする方法がある。また、以下の式で求められるような、クラス間分散が最大となる位置を、メタデータの境界点としてもよい。 FIG. 18 is a schematic diagram illustrating the process of determining the boundary position from the two maximum points in process P-F004. The axis F-H301 represents the score value. The axis F-H302 represents the character line position (Y coordinate on the character line series from the first character line). A graph F-H001 represents a score graph of the metadata M1. A graph F-H002 represents the metadata M2. A point F-H201 on the graph represents the maximum point P1 of the metadata M1 score, and a point F-H202 on the graph represents the maximum point P2 of the metadata M2 score. As a method for determining the boundary between P1 and P2, there is a method using the intersection F-H203 of the graph F-H001 and the graph F-H002 as a boundary. Further, the position where the inter-class variance is maximized as determined by the following equation may be used as a metadata boundary point.

数３は、境界位置をyとした場合の、２領域間分散値を表し、この値が最大となる位置を境界位置yとして選ぶ。 Equation 3 represents a variance value between two regions when the boundary position is y, and a position where this value is maximized is selected as the boundary position y.

＜加重移動平均計算P-A107の変形＞
以上、第１の実施例を図面に従い詳述したが、加重移動平均計算P-A107を実行する際の距離値として、Ｙ座標をそのまま使うほか、連続する行のパターンにより距離を変換してもよい。そして、変換した距離の元で加重移動平均を行う。例えば、以下の特徴に着目し、各文字行の仮想的な位置を変更する。 <Modification of weighted moving average calculation P-A107>
Although the first embodiment has been described in detail with reference to the drawings, the Y coordinate is used as it is as the distance value when executing the weighted moving average calculation P-A107, or the distance can be converted by a pattern of continuous lines. Good. Then, a weighted moving average is performed based on the converted distance. For example, paying attention to the following features, the virtual position of each character line is changed.

(1)連続行の下行の長さが上行の長さのα倍の場合
=> 行間隔をα・ε倍する。 (1) When the length of the lower line of the continuous line is α times the length of the upper line
=> Multiply line spacing by α ・ ε.

(2)連続行の下行が上行より左側にＸ0だけずれている
=> 行間隔を（Ｘ0／下行の幅）倍する。 (2) The lower line of the continuous line is shifted to the left by X0 from the upper line.
=> Multiply line spacing by (X0 / width of lower line).

＜複数文献データの自動分割方法＞
以上で述べた第１実施例に係る書誌情報抽出システムのメタデータ抽出方法は、複数文献が連続して印刷された場合にも適用することができる。 <Automatic segmentation method for multiple document data>
The metadata extraction method of the bibliographic information extraction system according to the first embodiment described above can be applied even when a plurality of documents are printed continuously.

図１９は、複数文献を連続して複写した一連の文書に対するメタデータスコアグラフの例を示す図である。同図の上段において、軸F-I100は、複写されたページ位置の軸を表す。軸F-I200は、メタデータスコアを表し、上にいくほどスコアが高いことを表す。グラフF-I001、F-I002、F-I003、F-I004は、図１６で示したメタデータスコアの加重移動平均計算後のグラフを表す。この例では、４つの文献が連続して印刷されており、メタデータスコアの極大点群が、４回繰り返して出現している。メタデータスコアの極大点群は、例えば、一定ウィンドウサイズ内に一定数以上の極大点が存在する領域として見つけることができる。 FIG. 19 is a diagram illustrating an example of a metadata score graph for a series of documents in which a plurality of documents are continuously copied. In the upper part of the figure, axis F-I100 represents the axis of the copied page position. The axis F-I200 represents the metadata score, and the higher the score is, the higher the score is. Graphs F-I001, F-I002, F-I003, and F-I004 represent graphs after the weighted moving average calculation of the metadata score shown in FIG. In this example, four documents are printed in succession, and a maximum point group of metadata scores appears repeatedly four times. The maximum point group of the metadata score can be found, for example, as an area where a certain number of local maximum points exist within a certain window size.

同図の下段に示したラベルF-I301、F-I302、F-I303、F-I304は、複写された一連のページの一部について、図１７のフローチャットを用いて説明した手段を適用して付与したメタデータのラベルを表す。境界線F-I301、F-I302、F-I303は、メタデータラベルF-I301、F-I302、F-I303、F-I304の付与結果に基づき決定した文献境界を表す。文献境界は、メタデータスコアの極大点群が含まれるページの前にて区切ることができる。 Labels F-I301, F-I302, F-I303, and F-I304 shown at the bottom of the figure apply the means described using the flow chat in FIG. 17 to a part of a series of copied pages. Represents the metadata label assigned. Boundary lines F-I301, F-I302, and F-I303 represent document boundaries that are determined based on the assignment results of the metadata labels F-I301, F-I302, F-I303, and F-I304. The document boundary can be delimited in front of the page containing the maximum score group of metadata scores.

以上の手段により、複合機等で複数の文献を束ねて印刷した場合でも、文献ごとにメタデータを抽出でき、また混在した文献を、個別の文献に分離することが可能となる。 By the above means, even when a plurality of documents are bundled and printed by a multifunction machine or the like, metadata can be extracted for each document, and mixed documents can be separated into individual documents.

＜メタデータスコアの２軸方向加重移動平均計算＞
上述した第１の実施例においては、文字行を一次元に並べたあとメタデータスコアの加重移動平均の計算を行った。この加重移動平均の計算は、Y軸方向固定とは限らず、X軸方向を選び、実施してもよい。また、１軸方向のみならず、２軸方向で加重移動平均計算を実行することもできる。第2の実施例では、文字行の一次元系列化を行わず、２次元座標上で加重移動平均を行う。この場合の計算式は、以下のようになる。 <Calculation of 2-axis weighted moving average of metadata score>
In the first embodiment described above, the weighted moving average of the metadata score is calculated after the character lines are arranged one-dimensionally. This calculation of the weighted moving average is not limited to fixing in the Y-axis direction, and may be performed by selecting the X-axis direction. In addition, the weighted moving average calculation can be executed not only in one axis direction but also in two axis directions. In the second embodiment, a weighted moving average is performed on two-dimensional coordinates without performing a one-dimensional series of character lines. The calculation formula in this case is as follows.

境界領域の決定は、１次元の場合と同様に、X軸、Y軸ごとに、メタデータスコアの境界点を見つけることで決定することができる。本実施例によれば、２次元での境界領域の決定を容易に行うことができる。 The boundary region can be determined by finding the boundary point of the metadata score for each of the X axis and the Y axis, as in the case of the one-dimensional case. According to the present embodiment, it is possible to easily determine a two-dimensional boundary region.

＜動的計画法によるメタデータ領域境界の判定方法＞
次に第３の実施例として、ＣＰＵ１３で実行される文書処理プログラム１５３の処理中、メタデータスコアを元に境界を決定する方法に動的計画法を用いる構成について説明する。動的計画法を用いることで、メタデータの相対的位置関係を反映したメタデータのラベル付けが可能となる。 <Determination method of metadata area boundary by dynamic programming>
Next, as a third embodiment, a configuration in which dynamic programming is used as a method for determining a boundary based on a metadata score during the processing of the document processing program 153 executed by the CPU 13 will be described. By using dynamic programming, it becomes possible to label metadata reflecting the relative positional relationship of metadata.

図２０は、文字行位置とメタデータ種別の関係性の例を示す図である。同図上段に示すマトリクスF-J100は、文字行位置と正規化後のメタデータスコアの関係を示したマトリクスである。軸F-J110は、文字行位置を表し、右にいくほど先頭から遠いことを表す。軸F-J120は、メタデータ種別の軸を表す。メタデータ識別子F-J121は、仮のメタデータ出現順序を表す。マトリクス内の数字は、当該文字行の仮のメタデータスコアを基にしたコストを表す。同図下段に示すマトリクスF-J200は、文字行ごとの各メタデータスコア順位を元にしたメタデータ選択コストのマトリクスを表す。これは、例えば、ある文字行のメタデータTのスコアの順位が3位であれば、“3”とすることで計算できる。 FIG. 20 is a diagram illustrating an example of the relationship between the character line position and the metadata type. The matrix F-J100 shown in the upper part of the figure is a matrix showing the relationship between the character line position and the normalized metadata score. The axis F-J110 represents the character line position, and the farther to the right, the farther from the beginning. An axis F-J120 represents an axis of the metadata type. The metadata identifier F-J121 represents a provisional metadata appearance order. The numbers in the matrix represent the cost based on the temporary metadata score of the character line. A matrix F-J200 shown in the lower part of the figure represents a metadata selection cost matrix based on each metadata score ranking for each character line. This can be calculated, for example, by setting “3” if the score ranking of the metadata T of a certain character line is third.

仮に、メタデータの順序を一つ仮定すると、マトリクスF-J100もしくはマトリクスF-J200上で、最下行に至るパスのうち、合計コストが最小となるパスを見つけることで、各文字行へのメタデータ割り当てを決定することができる。最短パスの探索は、文字列対文字列の編集距離計算でよく使われる動的計画法を用いることができる。本実施例におけるメタデータスコアを用いた動的計画法の場合の、マトリクス移動コストは以下のように定義する。 Assuming one metadata order, find the path with the lowest total cost among the paths to the bottom line on matrix F-J100 or matrix F-J200. Data allocation can be determined. For the search of the shortest path, a dynamic programming method often used in calculating the edit distance between character strings and character strings can be used. In the case of the dynamic programming method using the metadata score in the present embodiment, the matrix moving cost is defined as follows.

マトリクス移動パターンによるコスト：
(1)マトリクスを右に進む場合：連続する文字行が同じメタデータのラベルを付与されることを表す。文字行間スペースが一定以上開いている場合、コストを追加する。
(2)マトリクスを下に進む場合：同じ文字行に複数メタデータを割り当てる（もしくはメタデータの省略）を表す。削除コストを追加する。
(3)マトリクスを斜めに進む場合：次のメタデータに遷移することを表す。文字行間スペースが十分狭い場合、遷移コストを追加する。 Cost due to matrix movement pattern:
(1) When moving to the right in the matrix: This indicates that consecutive character lines are given the same metadata label. If the space between character lines is more than a certain amount, add a cost.
(2) When moving down the matrix: This indicates that multiple metadata is assigned to the same character line (or metadata is omitted). Add a deletion cost.
(3) When moving diagonally through the matrix: Indicates transition to the next metadata. If the space between character lines is sufficiently narrow, a transition cost is added.

本実施例においては、以上のマトリクス上の移動パターンコストと、マトリクスコストの合計値が最短となるパスを、動的計画法により求める。 In the present embodiment, the movement pattern cost on the matrix and the path with the shortest total value of the matrix cost are obtained by dynamic programming.

図２１は、図２０のマトリクスF-J100もしくはF-J200を用いた、最小コストパスの探索例を表す図である。軸F-K110は、文字行の位置を表す軸である。軸F-K120は、メタデータ種別を表す軸である。メタデータ系列F-K121は、メタデータの仮順序（一例）を表す。セルF-K201は、メタデータ系列の”B”までマトリクスを探索した場合のコストのうち、最小コストとなるセルを表す。最小コストパスを探索する場合、斜め左上もしくは左側のセルの値を比較し、値の小さい方のセルを辿ることで、最小コストパスを得ることができる。セルの値が同じであった場合は、斜め左上を優先する。図２１の例では、マトリクスF-J100内のセルのうち、ハッチングの部分が最小コストパスに相当する。文字行のメタデータラベルは、着目文字行を縦（軸F-K120方向）に辿り、最小コストパスのセルを見つけることで、当該セルの行位置のメタデータより得られる。 FIG. 21 is a diagram illustrating a search example of the minimum cost path using the matrix F-J100 or F-J200 of FIG. The axis F-K110 is an axis representing the position of the character line. The axis F-K120 is an axis representing the metadata type. The metadata series F-K121 represents a temporary order of metadata (an example). A cell F-K201 represents a cell having the lowest cost among the costs when the matrix is searched up to “B” of the metadata series. When searching for the minimum cost path, it is possible to obtain the minimum cost path by comparing the values of the diagonally upper left or left cells and tracing the cell with the smaller value. If the cell values are the same, priority is given to the upper left corner. In the example of FIG. 21, the hatched portion of the cells in the matrix F-J100 corresponds to the minimum cost path. The metadata label of the character line is obtained from the metadata of the row position of the cell by tracing the target character line vertically (in the direction of the axis F-K120) and finding the cell of the minimum cost path.

文字行間スペースコストとして、(A)文字行高さに対する相対値、(B)ページ高さに対する相対値、(C)文字行間距離を降順に並べたときの順位等を用いることができる。図２４は、文字行間のメタデータラベルコストの例を示す図である。要素F-M100〜F-M107は、文字行を表す。幅F-M001は、ページ高さを表す。幅F-M002は、文字行F-M100と文字行F-M101の間の文字行間距離を表す。幅F-M003は、文字行F-M103と文字行F-M104の間の文字行間距離を表す。幅F-M004は、文字行F-M006と文字行F-M007の間の文字行間距離を表す。幅F-M005は、文字行F-M100の高さを表す。幅F-M006は、文字行F-M103の高さを表す。幅F-M007は、文字行F-M106の高さを表す。以上の値を用いて、上記(A)(B)(C)は、以下のように表される。 As the space cost between character lines, (A) a relative value with respect to the character line height, (B) a relative value with respect to the page height, (C) a ranking when the distances between the character lines are arranged in descending order, and the like can be used. FIG. 24 is a diagram illustrating an example of a metadata label cost between character lines. Elements F-M100 to F-M107 represent character lines. The width F-M001 represents the page height. A width F-M002 represents a distance between character lines between the character line F-M100 and the character line F-M101. A width F-M003 represents a distance between character lines between the character line F-M103 and the character line F-M104. A width F-M004 represents a distance between character lines between the character line F-M006 and the character line F-M007. The width F-M005 represents the height of the character line F-M100. The width F-M006 represents the height of the character line F-M103. A width F-M007 represents the height of the character line F-M106. Using the above values, the above (A), (B), and (C) are expressed as follows.

(A)の例：
・文字行F-M100とF-M101でメタデータが共通の場合のコスト(マトリクスコストの(1))＝α1／β1
・文字行F-M103とF-M104でメタデータが共通の場合のコスト(マトリクスコストの(1))＝α2／β2
・文字行F-M106とF-M107でメタデータが共通の場合のコスト(マトリクスコストの(1))＝α3／β3
(B)の例：
・文字行F-M100とF-M101でメタデータが共通の場合のコスト(マトリクスコストの(1))＝α1／γ1
・文字行F-M103とF-M104でメタデータが共通の場合のコスト(マトリクスコストの(1))＝α2／γ2
・文字行F-M106とF-M107でメタデータが共通の場合のコスト(マトリクスコストの(1))＝α3／γ3
(C)の例：文字行間距離の順位（α2＞α3＞α1）の逆数を使用
・文字行F-M100とF-M101でメタデータが共通の場合のコスト(マトリクスコストの(1))＝1／3
・文字行F-M103とF-M104でメタデータが共通の場合のコスト(マトリクスコストの(1))＝1／1
・文字行F-M106とF-M107でメタデータが共通の場合のコスト(マトリクスコストの(1))＝1／2
以上の手段により、仮のメタデータ順序を与えた場合に、各文字行のメタデータらしさとメタデータ順序とレイアウトを考慮した最適なメタデータラベル付けを行うことができる。 Example (A):
-Cost when the metadata is the same for character lines F-M100 and F-M101 ((1) of matrix cost) = α1 / β1
-Cost when the metadata is the same for the character lines F-M103 and F-M104 ((1) of matrix cost) = α2 / β2
-Cost when the metadata is the same for character lines F-M106 and F-M107 ((1) of matrix cost) = α3 / β3
Example (B):
-Cost when the metadata is the same for character lines F-M100 and F-M101 ((1) of matrix cost) = α1 / γ1
-Cost when the metadata is the same for the character lines F-M103 and F-M104 ((1) of the matrix cost) = α2 / γ2
・ Cost when the metadata is the same for character lines F-M106 and F-M107 ((1) of matrix cost) = α3 / γ3
Example of (C): Use the reciprocal of the rank of the distance between character lines (α2>α3> α1)-Cost when the metadata is the same for character lines F-M100 and F-M101 ((1) of matrix cost) = 1/3
-Cost when the metadata is the same for character lines F-M103 and F-M104 ((1) of matrix cost) = 1/1
-Cost when the metadata is the same for character lines F-M106 and F-M107 ((1) of matrix cost) = 1/2
By the above means, when a temporary metadata order is given, optimal metadata labeling can be performed in consideration of the metadata likeness, metadata order and layout of each character line.

以上に紹介した方法は、メタデータ順序が与えられた場合の処理であった。最後に、メタデータ順序が不定の場合に、メタデータ合計コスト最小となるラベル系列の探索を実行する方法について述べる。単純には全てのメタデータ順序について、上記動的計画法を実行し、最小コストのメタデータ列の結果を選択すればよい。ただし、全てのメタデータ順序の中には、同じ部分系列が重複して存在し、同じ処理を繰り返し実行することになって非効率的である。そこで、出現し得るメタデータ順序を木構造で表現することで、重複処理を一部回避することができる。 The method introduced above is a process when a metadata order is given. Finally, a method for executing a search for a label sequence that minimizes the total metadata cost when the metadata order is indefinite will be described. Simply, the above dynamic programming is executed for all metadata orders, and the result of the metadata column having the minimum cost may be selected. However, the same partial series is duplicated in all metadata orders, and the same processing is repeatedly executed, which is inefficient. Thus, by representing the order of metadata that can appear in a tree structure, it is possible to partially avoid duplication processing.

図２２は、可能なメタデータ順序をツリー構造で表現したデータ構造である。このツリー構造の全パス探索を行うことで、先頭からの順序が同じ部分については、処理の重複を避けることができる。ツリーノードF-L001はルートノードを表す。ルートノードの識別子はT(タイトル)となっている。ツリーノードF-L002、F-L003、F-L004は、ノードF-L001の子ノードを表す。例では、Tの次にNE（英語人名）、TE（英語タイトル）、N（日本語著者名）が出現し得ることを表す。以上の例にみられるように、ツリーのパスがそれぞれメタデータの出現順序に対応する。また、リンクF-L201、F-L202は、ツリーのパス上で部分文字列が共通であるパスの先頭をリンクで結んだものである。本リンクは、ツリーの全探索を行う順序に基づき、後の部分文字列から先の部分文字列へリンクを張るようにする。以上の構成によって、例えば、探索途中で、ツリーノードF-L005に達した場合、リンクを辿って、既に処理済みのツリーノードF-L006の結果を用いることができる。本処理によって、末尾の部分系列が共通する部分についての処理の重複を避けることができる。 FIG. 22 shows a data structure that represents a possible metadata order in a tree structure. By performing an all-path search of this tree structure, it is possible to avoid duplication of processing for the same order from the beginning. Tree node F-L001 represents a root node. The identifier of the root node is T (title). Tree nodes F-L002, F-L003, and F-L004 represent child nodes of the node F-L001. In the example, NE (English name), TE (English title), and N (Japanese author name) can appear after T. As can be seen from the above example, each path of the tree corresponds to the appearance order of the metadata. The links F-L201 and F-L202 are links in which the beginnings of paths that share a partial character string on the tree path are linked. In this link, a link is made from the subsequent partial character string to the previous partial character string based on the order of performing a full tree search. With the above configuration, for example, when the tree node F-L005 is reached during the search, the link can be traced and the result of the already processed tree node F-L006 can be used. By this processing, it is possible to avoid duplication of processing for a portion having a common partial sequence at the end.

本実施例において、上述のツリー構造による探索と、図２０、２１を用いて説明したメタデータスコアに基づく動的計画法を用いることにより、抽出対象とする学術文献種類が増えた場合でも、文書構造定義を新たに作成することなく、高精度に書誌メタデータ情報を抽出することが可能となる。 In this embodiment, even when the number of academic literature types to be extracted increases by using the above-described tree structure search and the dynamic programming method based on the metadata score described with reference to FIGS. Bibliographic metadata information can be extracted with high accuracy without creating a new structure definition.

図２３は、以上説明した第３の実施例における、メタデータスコアによる最小コストラベル系列探索の概要を示すフローチャートである。この処理フローは、先に述べたように、図１に示した書誌情報抽出システム10の記憶部のワークエリア15等に記憶される文書処理プログラム152をＣＰＵ13で実行することにより実現できることは言うまでもない。 FIG. 23 is a flowchart showing an outline of the minimum cost label sequence search based on the metadata score in the third embodiment described above. It goes without saying that this processing flow can be realized by the CPU 13 executing the document processing program 152 stored in the work area 15 of the storage unit of the bibliographic information extraction system 10 shown in FIG. .

最初に、処理P-G001において、可能なメタデータ順序から構成したツリーを構成する。このとき、共通接尾系列についてのリンクも作成する。 First, in process P-G001, a tree composed of possible metadata orders is constructed. At this time, a link for the common suffix series is also created.

入力P-G101では、第１の実施例同様、図１の入力装置110や通信ネットワーク19を通じて、処理対象とする文書データをワークエリア15にロードする。カラム構造判定処理P-G102では、複数カラム構造を持つ学術文献についても、文字行を正しく抽出するため、入力文書の段組み構成を判定する。文字行抽出処理P-G103では、カラム構造判定P-G102で抽出したカラム構造情報を基に、ラベル付けの単位となる文字行の抽出を行う。一次元系列化処理P-G-104では、文字行抽出処理P-G103で抽出した文字行を、文字行のXY座標と、カラム構造判定P-G102で判定したカラムのいずれに含まれるかによって並べ替える。文字行特徴設定P-G105では、予め図１の情報保持部１６に各メタデータと対応付けて記憶される、行高さ、行位置、平均文字間距離等のレイアウト特徴及び行の文字列パターンや文字数に基づき、各文字行の言語およびレイアウト特徴量を設定する。 As in the first embodiment, the input P-G 101 loads document data to be processed into the work area 15 through the input device 110 and the communication network 19 shown in FIG. In the column structure determination process P-G102, the column structure of the input document is determined in order to correctly extract the character lines even for academic documents having a plurality of column structures. In the character line extraction process P-G103, a character line serving as a labeling unit is extracted based on the column structure information extracted by the column structure determination P-G102. In the one-dimensional series processing PG-104, the character lines extracted by the character line extraction process P-G103 are rearranged according to whether they are included in the XY coordinates of the character line or the column determined by the column structure determination P-G102. . In the character line feature setting P-G105, layout features such as line height, line position, average inter-character distance, and character string pattern of the line, which are stored in advance in the information holding unit 16 of FIG. 1 in association with each metadata. Based on the number of characters and the number of characters, the language and layout feature amount of each character line is set.

次に、処理P-G002において、処理対象の文字行について各メタデータのスコアを計算する。次に、処理P-G003において、文字行ごとのメタデータスコアに基づくメタデータ順位を決定し、順位をコストとするマトリクスを構成する。コストの計算の仕方はこれに限らず、“１−正規化したメタデータスコア”といった値を用いてもよい。次に、処理P-G004において、探索するツリーのパスを一つ決める。そして処理P-G005において、パス中に出現するメタデータを用いて、図２１に示したようにメタデータスコアに基づくコスト最小パス探索を行う。その途中で、判定P-G101においてリンクがあるかどうかを判定し、リンクがあれば処理P-G007に移り、リンクがなければ処理P-G006にて最小コストパス探索を継続する。処理P-G007では、リンクを辿りリンク先の最小コストパス探索結果を用いて探索を終了する。全てのツリーパスの探索が終了したら、最後に処理P-G008においてコスト最小となるツリーパスにおいて抽出されたメタデータラベル付け結果を最終結果として取得する。 Next, in process P-G002, the score of each metadata is calculated for the character line to be processed. Next, in process P-G003, a metadata rank based on the metadata score for each character line is determined, and a matrix having the rank as a cost is constructed. The method of calculating the cost is not limited to this, and a value such as “1-normalized metadata score” may be used. Next, in process P-G004, one path of the tree to be searched is determined. In process P-G005, the minimum cost path search based on the metadata score is performed as shown in FIG. 21 using the metadata appearing in the path. On the way, it is determined whether or not there is a link in determination P-G101. If there is a link, the process proceeds to process P-G007, and if there is no link, the minimum cost path search is continued in process P-G006. In process P-G007, the link is traced, and the search is completed using the minimum cost path search result of the link destination. When the search of all tree paths is completed, the metadata labeling result extracted in the tree path that minimizes the cost in the process P-G008 is finally acquired as the final result.

以上の処理により、本実施例によれば、抽出対象とする学術文献種類が増えた場合でも、文書構造定義を新たに作成することなく、高精度に書誌情報であるメタデータを抽出することが可能となる。 Through the above processing, according to the present embodiment, even when the number of academic literature types to be extracted increases, metadata that is bibliographic information can be extracted with high accuracy without newly creating a document structure definition. It becomes possible.

以上詳述した本発明は、電子文書の他、複合機等で電子化した多様な学術文献文書について、タイトル、著者名、要旨、英語作成者名、英語要旨等の書誌情報のメタデータ抽出を行い、学術情報提供サービス提供、文書の再利用を効率化する文書処理装置として有用である。 In addition to electronic documents, the present invention described in detail above extracts metadata of bibliographic information such as titles, author names, abstracts, names of English creators, English abstracts, etc., for various academic literature documents that have been digitized by multifunction peripherals. It is useful as a document processing device that performs academic information providing services and reuses documents efficiently.

１１…入力部
１１０…入力装置
１１１…画像入力装置
１２…表示装置
１３…ＣＰＵ
１４…印刷装置
１５…ワークエリア
１５１…ＯＳ
１５２…通信プログラム
１５３…文書処理プログラム
１５４…書誌情報（メタデータ）編集プログラム
１６…情報保持部
１６１…抽出対象メタデータ定義辞書
１６２…メタデータ別特徴定義辞書
１９…通信ネットワーク
２１…ファイルサーバ。 DESCRIPTION OF SYMBOLS 11 ... Input part 110 ... Input device 111 ... Image input device 12 ... Display device 13 ... CPU
14 ... Printer 15 ... Work area 151 ... OS
152 ... Communication program 153 ... Document processing program 154 ... Bibliographic information (metadata) editing program 16 ... Information holding unit 161 ... Extraction target metadata definition dictionary 162 ... Feature definition dictionary 19 for each metadata ... Communication network 21 ... File server.

Claims

A document processing apparatus comprising an information holding unit and a processing unit, and extracting metadata of bibliographic information by the processing unit based on an appearance position and a character string in a document of a character line,
The information holding unit
Holds a feature definition dictionary for each metadata that describes the feature values for each metadata.
The processor is
Based on the document column structure determination result, extract the character line and character line position,
A metadata score is calculated for each of the character lines based on the feature amount for each metadata held in the information holding unit,
Determining a boundary position of the bibliographic information based on the metadata score;
A document processing apparatus characterized by that.

The document processing apparatus according to claim 1,
The processor is
Based on the character line position, a propagation process of the metadata score for each metadata is executed, and the meta data after the propagation process of any one of the metadata to be extracted from the start position of the character line. The maximum point of the data score is searched, the boundary position is determined based on the change of the metadata score between the two consecutive maximum points searched, and the character line before the determined boundary position is set to one. Set the metadata for the first maximum point, and set the metadata for the second maximum point in the subsequent text line.
A document processing apparatus characterized by that.

The document processing apparatus according to claim 2,
The processor is
Based on the character line position, performing the metadata score propagation process by performing a smoothing process by weighted moving average calculation of the metadata score for each of the metadata,
A document processing apparatus characterized by that.

The document processing apparatus according to claim 1,
The information holding unit
The number of characters constituting a character line is held as the feature amount for each metadata.
A document processing apparatus characterized by that.

The document processing apparatus according to claim 1,
The information holding unit
As a feature quantity for each metadata, an English name pattern length is retained.
A document processing apparatus characterized by that.

The document processing apparatus according to claim 1,
The information holding unit
As the feature amount for each metadata, the average distance between words and the average distance between characters are retained.
A document processing apparatus characterized by that.

The document processing apparatus according to claim 1,
The processor is
Character elements are extracted from the input document information, horizontal and vertical projection component histograms of the extracted character elements are generated, and column information is extracted from the extracted projection component histograms to extract the document information. Perform column structure determination,
A document processing apparatus characterized by that.

Bibliographic information is extracted based on the appearance position and character string in the document of a character line consisting of a set of characters, at least having an input unit capable of inputting document information, an information holding unit, a processing unit, and an output unit. A document processing device,
The processor is
Extract character elements from the input document information,
Generating a horizontal and vertical projection component histogram of the extracted character elements, extracting column information from the projection histogram;
Based on the order of characters determined by the coordinate values of the character elements and the column information, character lines are extracted,
Based on the extracted coordinates of the character line and the column information including the character line, the character line is rearranged as series data,
For each of the sorted character lines, calculate a metadata score based on the feature amount of each metadata stored in advance in the information holding unit,
Based on the character line position, execute the propagation process of the metadata score for each of the metadata,
The maximum point of the metadata score after propagation processing is searched from the first character line, and the boundary point is set based on the change of the metadata score between the two consecutive maximum points searched. The first maximum point metadata is set in the character line before the boundary point, and the second maximum point metadata is set in the subsequent character line.
A document processing apparatus characterized by that.

The document processing apparatus according to claim 8, wherein:
The processor is
Performing the propagation process by performing a weighted moving average calculation of the metadata score for each of the metadata;
A document processing apparatus characterized by that.

The document processing apparatus according to claim 8, wherein:
The information holding unit
As the feature amount for each metadata, it holds at least one of the number of characters constituting the character line, the English name pattern length, the average distance between words, and the average distance between characters.
A document processing apparatus characterized by that.

Bibliographic information is extracted based on the appearance position and character string in the document of a character line consisting of a set of characters, at least having an input unit capable of inputting document information, an information holding unit, a processing unit, and an output unit. A document processing device,
The processor is
Extract character elements from the input document information, generate horizontal and vertical projection component histograms based on the extracted coordinates of the character elements, and extract column information extraction from the projection component histograms ,
Based on the order of characters determined by the coordinate values of the character elements and the column information, character lines are extracted,
Based on the extracted coordinates of the character line and the column information including the character line, the character line is rearranged as series data,

For each of the sorted character lines, calculate a metadata score based on the feature amount of each metadata stored in advance in the information holding unit,
Based on the metadata score of each character line, set the cost when each character line is the metadata, and based on the metadata identifier order candidates stored in advance associated with the information holding unit , Search for a metadata label sequence that minimizes the total cost of the entire text line,
A document processing apparatus characterized by that.

The document processing apparatus according to claim 11,
The processor is
Determining the rank based on the metadata score of each character line, and setting the rank as the cost;
A document processing apparatus characterized by that.

The document processing apparatus according to claim 11,
The processor is
A value obtained by subtracting a value obtained by normalizing the metadata score of each character line from 1 is set as the cost.
A document processing apparatus characterized by that.

The document processing apparatus according to claim 11,
The processor is
Set the distance between the character lines as the cost,
A document processing apparatus characterized by that.