JP2007317131A

JP2007317131A - Document management method, document retrieval method and device, and program

Info

Publication number: JP2007317131A
Application number: JP2006148893A
Authority: JP
Inventors: Harumi Kawamura; 春美川村; Takehito Abe; 剛仁阿部; Tomonori Takada; 智規高田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2006-05-29
Filing date: 2006-05-29
Publication date: 2007-12-06

Abstract

<P>PROBLEM TO BE SOLVED: To retrieve a document using the same/similar image from documents, which includes images and managed in the form of paper mediums, by using an image feature. <P>SOLUTION: A document, which includes an image and is not digitalized but printed on a paper medium and the like, is registered in a database while associating an image feature with identification information of the document. When the image feature of the image inputted as retrieval information is extracted and the database is searched with the image feature, the document with the image having the same/similar feature is retrieved. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、文書管理方法及び文書検索方法及び装置及びプログラムに係り、特に、紙媒体で管理されている文書を電子化し、画像領域を抽出してデータベース上で蓄積・管理することにより、画像及び画像を含む文書の検索を可能とする、文書管理方法及び文書検索方法及び装置及びプログラムに関する。 The present invention relates to a document management method, a document search method, an apparatus, and a program, and more particularly, by digitizing a document managed on a paper medium, extracting an image area, and storing and managing it on a database. The present invention relates to a document management method, a document retrieval method, an apparatus, and a program that enable retrieval of documents including images.

従来の文書管理技術には、文書の更新履歴を管理する技術（第１の技術）や、ＨＴＭＬやＸＭＬ等の言語で記述された文書を対象とした文書管理技術（第２の技術）がある。 Conventional document management techniques include a technique for managing a document update history (first technique) and a document management technique (second technique) for documents written in a language such as HTML or XML. .

上記の第１の技術は、文書の管理情報を印刷時に数字やバーコードで出力し、検索時にそのバーコードの模様をパターンマッチングで検出することにより該当文書の管理情報を得るという技術である（例えば、特許文献１参照）。 The first technique described above is a technique in which document management information is output as numbers or barcodes at the time of printing, and the pattern information of the barcode is detected by pattern matching at the time of retrieval to obtain the management information of the corresponding document ( For example, see Patent Document 1).

一方、上記の第２の技術では、ＨＴＭＬやＸＭＬ等で記述された文書がタグ（情報の属性を表すもの）によって構造化されていることを利用し、特定のタグに該当する箇所の情報を抽出し、文書の継承関係と共にデータベース化することにより（構造化）文書の管理を行う技術である（例えば、特許文献２参照）。
特公平７−７６９０５号公報「文書管理装置」特開２００３−３３０９３８号公報「構造化文書格納方法及びその検索方法」 On the other hand, the second technique described above utilizes the fact that a document described in HTML, XML, etc. is structured by tags (representing attributes of information), and stores information on locations corresponding to specific tags. This is a technique for managing (structuring) documents by extracting and creating a database together with inheritance relationships of documents (see, for example, Patent Document 2).
Japanese Patent Publication No. 7-76905 “Document Management Device” JP 2003-330938 A “Structured document storage method and search method thereof”

しかしながら、上記の第１の技術（特許文献１）、第２の技術（特許文献２）とも、電子文書の作成段階もしくは電子文書が作成済みでデータベースに既に存在することを前提としており、紙媒体のみで管理されている文書は対象外である。 However, both the first technique (Patent Document 1) and the second technique (Patent Document 2) are based on the premise that an electronic document is created or an electronic document has already been created and already exists in a database. Documents that are managed only with this are excluded.

また、上記第１の技術（特許文献１）は、電子文書のバーコード等の情報からメタ情報（改訂履歴や管理情報等）を抽出する方式であり、メタ情報をキーに文書を検索するには、事前にこれらメタ情報をデータベースに格納しておく必要がある。従って、第１の技術をそのまま適用してメタ情報から文書を検索するには不適である。 The first technique (Patent Document 1) is a method of extracting meta information (revision history, management information, etc.) from information such as a barcode of an electronic document. The document is searched using the meta information as a key. Needs to store these meta information in the database in advance. Therefore, it is unsuitable to apply the first technique as it is and search for a document from meta information.

一方、上記第２の技術（特許文献２）は、文書内に記載の情報や画像等を検索することは可能であるが、タグで構造化されていることが前提である。従って、タグが存在しない、文字や画像のみからなる文書から情報を抽出し、管理することはできない。 On the other hand, the second technique (Patent Document 2) can retrieve information, images, and the like described in a document, but is premised on being structured with tags. Therefore, it is impossible to extract and manage information from a document including only characters and images without a tag.

本発明は、上記の点に鑑みなされたもので、紙媒体で管理されている画像混在文書に対して、画像特徴を用いて、同一または類似の画像が使われた画像混在文書を検索することが可能な文書管理方法及び文書検索方法及び装置及びプログラムを提供することを目的とする。 The present invention has been made in view of the above points, and uses an image feature to search for an image mixed document in which the same or similar images are used for an image mixed document managed on a paper medium. It is an object of the present invention to provide a document management method, a document search method, an apparatus, and a program that can execute the above-described process.

図１は、本発明の原理を説明するための図である。 FIG. 1 is a diagram for explaining the principle of the present invention.

本発明（請求項１）は、画像が混在する文書を電子化し、データベースで管理する画像混在文書を管理する文書管理方法であって、
文書入力手段が、電子文書または、紙媒体の文書を入力し、該文書が紙媒体であれば、該文書を電子化する文書入力ステップ（ステップ１）と、
画像領域抽出手段が、入力された電子文書から画像領域を抽出する画像領域抽出ステップ（ステップ２）と、
画像特徴抽出手段が、抽出された画像領域から画像自体の特徴を表す情報列を画像特徴として抽出する画像特徴抽出ステップ（ステップ３）と、
画像特徴登録手段が、画像特徴を文書の識別情報と対応付けてデータベースに登録する画像特徴登録ステップ（ステップ４）と、を行う。 The present invention (Claim 1) is a document management method for managing an image mixed document managed in a database by digitizing a document including mixed images.
A document input unit inputs an electronic document or a paper medium document, and if the document is a paper medium, a document input step (step 1) for digitizing the document;
An image area extracting step (step 2) in which the image area extracting means extracts an image area from the input electronic document;
An image feature extraction step (step 3) in which the image feature extraction means extracts an information sequence representing the feature of the image itself from the extracted image region as an image feature;
The image feature registration means performs an image feature registration step (step 4) for registering the image feature in the database in association with the document identification information.

また、本発明（請求項２）は、画像自体に文書の識別情報を埋め込む文書識別情報埋め込みステップを更に行い、
画像特徴登録ステップにおいて、
画像特徴と文書識別情報が埋め込まれた画像を対応付けてデータベースに登録する。 The present invention (Claim 2) further performs a document identification information embedding step of embedding document identification information in the image itself,
In the image feature registration step,
The image feature and the image in which the document identification information is embedded are associated and registered in the database.

本発明（請求項３）は、データベース内に格納されている画像混在文書に関する情報を管理する文書管理方法であって、
画像特徴検索手段が、画像特徴がデータベース登録される毎に、該データベース内に同一、または／及び類似する画像特徴があるか否かを該データベースを検索して判定する画像特徴検索ステップと、
画像特徴情報更新手段が、画像特徴検索ステップにおいて、同一、または／及び類似する画像特徴がデータベースに存在する場合は、該データベース内の既存の画像特徴情報に対応する文書の識別情報を追記することにより更新し、存在しない場合は、登録された画像特徴の画像特徴情報を該データベースに追加する画像特徴情報更新ステップと、を行う。 The present invention (Claim 3) is a document management method for managing information related to mixed image documents stored in a database,
An image feature search step for determining whether or not there is an identical or / and similar image feature in the database each time the image feature is registered in the database;
When the same or / and similar image feature exists in the database in the image feature search step, the image feature information updating means adds the document identification information corresponding to the existing image feature information in the database. If the image feature information does not exist, the image feature information update step of adding the image feature information of the registered image feature to the database is performed.

図２は、本発明の原理構成図である。 FIG. 2 is a principle configuration diagram of the present invention.

本発明（請求項４）は、画像が混在する文書を電子化し、データベース１０６で管理する画像混在文書を管理する文書管理装置であって、
電子文書または、紙媒体の文書を入力し、該文書が紙媒体であれば、該文書を電子化する文書入力手段１０１と、
電子文書から画像領域を抽出する画像領域抽出手段１０３と、
抽出された画像領域から画像自体の特徴を表す情報列を画像特徴として抽出する画像特徴抽出手段１０４と、
画像特徴を文書の識別情報と対応付けてデータベース１０６に登録する画像特徴登録手段１０５と、を有する。 The present invention (Claim 4) is a document management apparatus that digitizes a document having mixed images and manages the mixed image document managed by the database 106.
An electronic document or a paper medium document is input. If the document is a paper medium, a document input unit 101 for digitizing the document;
Image area extraction means 103 for extracting an image area from the electronic document;
Image feature extraction means 104 for extracting an information sequence representing the characteristics of the image itself from the extracted image region as an image feature;
Image feature registration means 105 for registering image features in the database 106 in association with document identification information.

また、本発明（請求項５）は、画像自体に文書の識別情報を埋め込む文書識別情報埋め込み手段を更に有し、
画像特徴登録手段１０５は、
画像特徴と文書識別情報が埋め込まれた画像を対応付けてデータベースに登録する手段を含む。 The present invention (Claim 5) further includes document identification information embedding means for embedding document identification information in the image itself,
The image feature registration means 105
Means for associating and registering the image feature and the image in which the document identification information is embedded in the database;

本発明（請求項６）は、データベース１０６内に格納されている画像混在文書に関する情報を管理する文書管理装置であって、
画像特徴が登録される毎に、データベース１０６内に同一、または／及び類似する画像特徴があるか否かを、該データベースを検索して判定する画像特徴検索手段１０７と、
画像特徴検索手段１０７において、同一、または／及び類似する画像特徴がデータベース１０６に存在する場合は、該データベース１０６内の既存の画像特徴情報を更新し、存在しない場合は、登録された画像特徴の画像特徴情報を該データベース１０６に追加する画像特徴情報更新手段１０８と、
を有する。 The present invention (Claim 6) is a document management apparatus for managing information related to mixed image documents stored in the database 106,
Image feature search means 107 that searches the database 106 to determine whether or not there is an identical or / and similar image feature each time an image feature is registered;
In the image feature search means 107, when the same or / and similar image feature exists in the database 106, the existing image feature information in the database 106 is updated. Image feature information updating means 108 for adding image feature information to the database 106;
Have

本発明（請求項７）は、電子化された画像混在文書が管理されているデータベースに対して文書内の画像に基づいて検索を行う画像混在文書の検索方法であって、
文書入力手段が、検索対象の画像を含む電子文書、または、検索対象の画像自体、または、紙媒体の文書を入力し、該文書が紙媒体であれば、該文書を電子化する文書入力ステップ（ステップ１１）と、
画像領域抽出手段が、電子文書から画像領域を抽出する画像領域抽出ステップ（ステップ１２）と、
画像特徴抽出手段が、抽出された画像領域から画像自体の特徴を表す情報列を画像特徴として抽出する画像特徴抽出ステップ（ステップ１３）と、
画像特徴検索手段が、画像特徴と同一の画像特徴をデータベースから検索する画像特徴検索ステップ（ステップ１４）と、
文書出力手段が、データベースから検索された画像特徴に対応付けられた文書を出力する文書出力ステップ（ステップ１５）と、を行う。 The present invention (Claim 7) is an image mixed document search method for performing a search based on an image in a document with respect to a database in which an electronic image mixed document is managed,
A document input step in which the document input means inputs an electronic document including an image to be searched, an image to be searched itself, or a paper medium document, and if the document is a paper medium, the document input step for digitizing the document (Step 11),
An image area extracting step (step 12) in which the image area extracting means extracts an image area from the electronic document;
An image feature extraction step (step 13) in which the image feature extraction means extracts an information string representing the feature of the image itself from the extracted image region as an image feature;
An image feature search step (step 14) in which the image feature search means searches the database for the same image feature as the image feature;
The document output means performs a document output step (step 15) for outputting a document associated with the image feature retrieved from the database.

また、本発明（請求項８）は、画像特徴検索手段が、画像特徴検索ステップで得られた画像特徴に対応付けられた画像から文書識別情報を取得する文書識別情報取得ステップを行い、
文書出力ステップにおいて、
文書識別情報に基づいて、データベースから文書を取得して出力する。 Further, according to the present invention (claim 8), the image feature search means performs a document identification information acquisition step of acquiring the document identification information from the image associated with the image feature obtained in the image feature search step,
In the document output step,
A document is acquired from the database based on the document identification information and output.

本発明（請求項９）は、電子化された画像混在文書が管理されているデータベースに対して文書内の画像に基づいて検索を行う画像混在文書の検索装置であって、
検索対象の画像を含む電子文書、または、検索対象の画像自体、または、紙媒体の文書を入力し、該文書が紙媒体であれば、該文書を電子化する文書入力手段１０１と、
電子文書から画像領域を抽出する画像領域抽出手段１０３と、
抽出された画像領域から画像自体の特徴を表す情報列を画像特徴として抽出する画像特徴抽出手段１０４と、
画像特徴と同一の画像特徴をデータベースから検索する画像特徴検索手段１０７と、
データベースから検索された画像特徴に対応付けられた文書を出力する文書出力手段１１０と、を有する。 The present invention (Claim 9) is an image mixed document search apparatus that performs a search based on an image in a document with respect to a database in which the digitized image mixed document is managed,
An electronic document including an image to be searched, an image to be searched itself, or a paper medium document; if the document is a paper medium, a document input unit 101 for digitizing the document;
Image area extraction means 103 for extracting an image area from the electronic document;
Image feature extraction means 104 for extracting an information sequence representing the characteristics of the image itself from the extracted image region as an image feature;
Image feature retrieval means 107 for retrieving the same image feature as the image feature from the database;
Document output means 110 for outputting a document associated with the image feature retrieved from the database.

また、本発明（請求項１０）は、画像特徴検索手段１０７において、
画像特徴検索手段で得られた画像特徴に対応付けられた画像から文書識別情報を取得する文書識別情報取得手段を含み、
文書出力手段１１０は、
文書識別情報に基づいて、データベース１０６から文書を取得して出力する手段を含む。 Further, the present invention (Claim 10), in the image feature search means 107
Including document identification information acquisition means for acquiring document identification information from an image associated with the image feature obtained by the image feature search means,
The document output means 110
Means for obtaining and outputting a document from the database 106 based on the document identification information is included.

本発明（請求項１１）は、コンピュータに、請求項４乃至６記載の文書管理装置の各手段を実行させる文書管理プログラムである。 The present invention (Claim 11) is a document management program for causing a computer to execute each means of the document management apparatus according to Claims 4 to 6.

本発明（請求項１２）は、コンピュータに、請求項９乃至１０記載の文書検索装置の各手段を実行させる文書検索プログラムである。 The present invention (Claim 12) is a document search program for causing a computer to execute each means of the document search apparatus according to Claims 9 to 10.

本発明によれば、紙媒体に印刷されている電子化されていない画像入り文書を電子化し、当該文書から抽出された画像の特徴量を抽出して、当該文書の識別情報と対応付けてデータベースに格納することで、同一または類似の画像が使われた画像混在文書を検索することが可能となる。 According to the present invention, a non-digitized image-containing document printed on a paper medium is digitized, the feature amount of the image extracted from the document is extracted, and the database is associated with the identification information of the document. By storing them in the image, it is possible to search image mixed documents in which the same or similar images are used.

以下、図面と共に本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

［第１の実施の形態］
本実施の形態では、入力された画像が混在する文書をデータベースに登録する処理について説明する。 [First Embodiment]
In the present embodiment, processing for registering a document in which input images are mixed in a database will be described.

図３は、本発明の第１の実施の形態における文書管理装置の構成を示す。 FIG. 3 shows the configuration of the document management apparatus according to the first embodiment of the present invention.

同図に示す文書管理装置１００は、文書入力部１０１、コンテンツ記憶メモリ１０２、画像領域抽出部１０３、画像特徴抽出部１０４、画像特徴登録部１０５及びデータベース１０６から構成される。 A document management apparatus 100 shown in FIG. 1 includes a document input unit 101, a content storage memory 102, an image area extraction unit 103, an image feature extraction unit 104, an image feature registration unit 105, and a database 106.

コンテンツ記憶メモリ１０２は、文書入力部１０１によって得られた文書の電子ファイル、画像領域抽出部１０３で得られた画像、画像特徴抽出部１０４で得られた各画像に対応する画像特徴を格納する。 The content storage memory 102 stores an electronic file of a document obtained by the document input unit 101, an image obtained by the image region extraction unit 103, and an image feature corresponding to each image obtained by the image feature extraction unit 104.

図４は、本発明の第１の実施の形態における文書管理装置の動作のフローチャートである。以下、図３の装置構成の動作について図４に沿って説明する。 FIG. 4 is a flowchart of the operation of the document management apparatus according to the first embodiment of the present invention. The operation of the apparatus configuration in FIG. 3 will be described below with reference to FIG.

ステップ２０１）文書入力ステップ：
文書入力部１０１は、電子文書または紙媒体の文書を入力する。紙媒体で管理されている文書をスキャナ等の入力装置を用いて電子化する。電子化されたデータは、電子ファイルとしてコンテンツ記憶メモリ１０２に格納される。ここでは、電子化する手段としてスキャナを例にあげているが、デジタルカメラでもよい。 Step 201) Document input step:
The document input unit 101 inputs an electronic document or a paper medium document. A document managed on a paper medium is digitized using an input device such as a scanner. The digitized data is stored in the content storage memory 102 as an electronic file. Here, a scanner is taken as an example of electronic means, but a digital camera may be used.

ステップ２０２）画像領域抽出ステップ：
画像領域抽出部１０３は、コンテンツ記憶メモリ１０２に格納されている電子文書に対して、画像に分類される領域を抽出し、コンテンツ記憶メモリ１０２に格納する。電子文書から画像領域を抽出するには、例えば、以下のような手法がある。 Step 202) Image region extraction step:
The image area extraction unit 103 extracts an area classified as an image from the electronic document stored in the content storage memory 102 and stores the extracted area in the content storage memory 102. In order to extract an image region from an electronic document, for example, there are the following methods.

カラー文書を２値化し、文書領域抽出に適したフィルタと図表領域抽出に適したフィルタをそれぞれ同一の文書画像に作用させることにより、前者からは文字領域、後者からは図表領域が抽出されるという技術である（勝山、黒川、武部、藤木、直井：“文字列抽出用／図表抽出用の２種類の２値化方式を使用したカラー文書画像レイアウト解析手法”信学総大、D-12-88, p220 (2006)）。当該技術を用いることによって、得られた図表領域を画像領域と見做しても構わないし、また、図表領域から、更に図領域と表領域に分離し、図領域を画像領域として抽出してもよい。なお、図表領域の分離は、水平方向と垂直方向のヒストグラムを求め、両方向のヒストグラムが周期的にピークを持つ場合を表領域とすることによって、図領域のみを抽出することが可能である。 By binarizing a color document and applying a filter suitable for document area extraction and a filter suitable for chart area extraction to the same document image, a character area is extracted from the former and a chart area is extracted from the latter. Technology (Katsuyama, Kurokawa, Takebe, Fujiki, Naoi: "Color document image layout analysis method using two types of binarization methods for character string extraction / chart extraction" Shingaku Sodai, D-12- 88, p220 (2006)). By using this technique, the obtained chart area may be regarded as an image area, or the chart area may be further separated into a chart area and a table area, and the figure area may be extracted as an image area. Good. The chart area can be extracted by obtaining a histogram in the horizontal direction and the vertical direction, and taking the case where the histograms in both directions have peaks periodically as the table area.

ステップ２０３）画像特徴抽出ステップ：
画像特徴抽出部１０４は、画像領域抽出部１０３で得られた画像をコンテンツ記憶メモリ１０２から取得し、画像が一意に識別される画像特徴を抽出する。一種類の文書から複数の画像が抽出された場合には、全ての画像に対して同様に画像特徴を抽出する。得られた画像特徴はコンテンツ記憶メモリ１０２に格納される。画像特徴は、例えば、画像をブロックに分割し、ブロック毎の平均画素値を全ブロック分並べたものや（Jia Li, James Z Wang, Gio Wiederhold, “IRM: Integrated Region Matching for Image Retrieval”, Proceedings of the eighth ACM international conference on Multimedia, pp.147-156 (2000)）、画像をブロックに分割し、ブロック毎の画素値の平均の全ブロック中における順序（例えば、一番暗い方から数えて何番目かという情報）とするもの等がある（高田、阿部、川村：“変換耐性を備えたコンテンツ識別方法”、画像の認識・理解シンポジウム（MIRU2005），2005）。 Step 203) Image feature extraction step:
The image feature extraction unit 104 acquires the image obtained by the image region extraction unit 103 from the content storage memory 102, and extracts an image feature that uniquely identifies the image. When a plurality of images are extracted from one type of document, image features are extracted in the same manner for all images. The obtained image features are stored in the content storage memory 102. Image features include, for example, an image divided into blocks, and average pixel values for each block arranged for all blocks (Jia Li, James Z Wang, Gio Wiederhold, “IRM: Integrated Region Matching for Image Retrieval”, Proceedings of the eighth ACM international conference on Multimedia, pp.147-156 (2000)), dividing the image into blocks, and the average pixel value of each block in all blocks (for example, what is counted from the darkest one) (Takada, Abe, Kawamura: “Content identification method with transformation tolerance”, Image Recognition and Understanding Symposium (MIRU2005), 2005).

ステップ２０４）画像特徴登録ステップ：
画像特徴登録部１０５は、画像特徴抽出部１０４で得られた一文書に含まれる各画像に対応する画像特徴をコンテンツ記憶メモリ１０２から取り出し、文書情報（文書管理番号）と対応付けてデータベース１０６に格納する。データベース１０６に登録する際の、画像特徴と文書情報との対応付けは、例えば、図５に示すように、文書の管理番号と画像特徴を表形式で管理する方法や、図６に示すように、文書の管理番号を画像自体に電子透かしを埋め込み、図７に示すように、画像特徴と画像の在処を示す情報（画像保管場所）を対にして管理する方法がある。 Step 204) Image feature registration step:
The image feature registration unit 105 extracts the image feature corresponding to each image included in one document obtained by the image feature extraction unit 104 from the content storage memory 102 and associates it with the document information (document management number) in the database 106. Store. Correspondence between image features and document information at the time of registration in the database 106 is, for example, a method of managing document management numbers and image features in a table format as shown in FIG. 5, or as shown in FIG. There is a method of managing the document management number by embedding a digital watermark in the image itself and as shown in FIG. 7, the image feature and the information indicating the location of the image (image storage location) are paired.

［第２の実施の形態］
本実施の形態では、画像特徴と文書情報が格納されているデータベース１０６を更新する処理を説明する。 [Second Embodiment]
In the present embodiment, a process for updating the database 106 in which image features and document information are stored will be described.

図８は、本発明の第２の実施の形態における文書管理装置の構成を示す。 FIG. 8 shows the configuration of the document management apparatus according to the second embodiment of the present invention.

同図において、図３の構成と同一構成部分には同一符号を付与して、その説明を省略する。 In the figure, the same reference numerals are given to the same components as those in FIG. 3, and the description thereof is omitted.

図８に示す文書管理装置２００は、画像特徴を検索する機能、画像特徴情報を更新する機能を含むものであり、画像特徴登録部１０５、データベース１０６、画像特徴検索部１０７、画像特徴情報更新部１０８、コンテンツ記憶メモリ１０２から構成される。なお、当該構成は、図３とは分けて記載しているが、図３の構成に、画像特徴検索部１０７と画像特徴情報更新部１０８を付加した構成で一体的に構成されていてもよい。 The document management apparatus 200 shown in FIG. 8 includes a function for searching for image features and a function for updating image feature information. An image feature registration unit 105, a database 106, an image feature search unit 107, and an image feature information update unit. 108, a content storage memory 102. Although this configuration is described separately from FIG. 3, the configuration may be integrally configured by adding the image feature search unit 107 and the image feature information update unit 108 to the configuration of FIG. 3. .

以下に、当該文書管理装置２００の動作を説明する。 Hereinafter, the operation of the document management apparatus 200 will be described.

図９は、本発明の第２の実施の形態における動作のフローチャートである。 FIG. 9 is a flowchart of the operation in the second embodiment of the present invention.

ステップ２０４）画像特徴登録ステップ：
上記の第１の実施の形態におけるステップ２０４に対応する。 Step 204) Image feature registration step:
This corresponds to step 204 in the first embodiment.

ステップ２０５）画像特徴検索ステップ：
上記のステップ２０４において、画像特徴がデータベース１０６に登録されている画像特徴に対して、画像特徴検索部１０７がデータベース１０６内に同一もしくは類似の画像特徴が存在するか否かを検索し、その結果をコンテンツ記憶メモリ１０２に一時的に格納する。 Step 205) Image feature search step:
In step 204 described above, the image feature search unit 107 searches for whether or not the same or similar image feature exists in the database 106 for the image feature registered in the database 106, and the result Is temporarily stored in the content storage memory 102.

以下では、画像特徴情報が図５に示す形式で管理されている場合について説明する。検索においては、入力画像特徴とデータベース１０６内の画像特徴の文字列が完全に一致すれば同一の画像特徴であると判定し、画像特徴の類似度の高いものがあれば、類似の画像特徴であると判定する。なお、画像特徴の類似度とは、例えば、画像特徴の文字列を多次元ベクトルと見做した場合のベクトル間の内積や、文字列を要素毎に比較した際の不一致個数もしくは、不一致割合に基づいて得られるものであり、前者もしくは後者も数値が小さいほど類似度が高い。画像特徴検索部１０７により、検索対象の画像と一致もしくは類似と判定された画像を含む管理番号が得られる。 Hereinafter, a case where the image feature information is managed in the format shown in FIG. 5 will be described. In the search, if the input image feature and the character string of the image feature in the database 106 completely match, it is determined that they are the same image feature. If there is a high similarity of the image feature, a similar image feature is used. Judge that there is. Note that the similarity of image features is, for example, the inner product between vectors when character strings of image features are regarded as multidimensional vectors, the number of mismatches when character strings are compared for each element, or the mismatch rate. The former or the latter also has a higher similarity as the numerical value is smaller. The image feature search unit 107 obtains a management number including an image determined to be identical or similar to the search target image.

なお、画像特徴情報が図７に示す形式で管理されている場合には、一致もしくは類似の画像特徴を持つ画像の在処（画像保管場所）が得られる。 When the image feature information is managed in the format shown in FIG. 7, the location (image storage location) of the image having the same or similar image feature is obtained.

ステップ２０６）画像特徴情報更新ステップ：
画像特徴情報更新部１０８では、画像特徴検索部１０７にて得られた結果をコンテンツ記憶メモリ１０２から取り出し、データベース１０６に格納されている情報を更新する。 Step 206) Image feature information update step:
The image feature information update unit 108 retrieves the result obtained by the image feature search unit 107 from the content storage memory 102 and updates the information stored in the database 106.

以下では、図５に示す形式で画像特徴情報が管理されている場合を例にとって詳細に説明する。 Hereinafter, a case where image feature information is managed in the format shown in FIG. 5 will be described in detail.

画像特徴情報更新部１０８は、画像特徴登録部１０５から入力された画像特徴と一致するものがデータベース１０６に存在する場合、即ち、図１０（ａ）に示すように、文書管理番号“ＥＦ−００９９”に含まれる画像の特徴データが“２３４２３４５６７５６７”と一致する場合、この画像特徴の欄に文書管理番号“ＥＦ−００９９”を追記することにより、画像特徴情報を更新する。また、画像特徴登録部１０５から入力された画像特徴がデータベース１０６内の全ての画像特徴に対して同一とも類似とも判定されない場合は、図１０（ｂ）に示すように、文書管理番号“ＣＤ−１０２２”の画像特徴“００４５６７８９１２３４”を新規に追加する。さらに、画像特徴登録部１０５から入力された画像特徴と類似のものがデータベース１０６上にある場合、即ち、図１０（ｃ）に示すように、文書管理番号“ＥＦ−０００３”の画像特徴“１２３１２３１２３４４６”が文書管理番号“ＡＢ−０００１”の画像特徴“１２３１２３１２３４４６”と類似である場合、新規に画像特徴と対応する文書管理番号を追加すると同時に、類似の画像特徴として検索された画像特徴“１２３１２３１２３４５６”の欄に類似画像特徴として“１２３１２３１２３４４６”を追記する。 When the image feature information update unit 108 matches the image feature input from the image feature registration unit 105 in the database 106, that is, as shown in FIG. 10A, the document feature number “EF-0099”. When the image feature data included in “” matches “234342567567”, the image feature information is updated by adding the document management number “EF-0099” to the image feature column. If the image feature input from the image feature registration unit 105 is not determined to be the same or similar to all the image features in the database 106, as shown in FIG. The image feature “004567891234” of “1022” is newly added. Further, when there is an image feature similar to the image feature input from the image feature registration unit 105 on the database 106, that is, as shown in FIG. 10C, the image feature “123123123446 of the document management number“ EF-0003 ”. "Is similar to the image feature" 123123123446 "of the document management number" AB-0001 ", a new document management number corresponding to the image feature is newly added, and at the same time, the image feature" 123123123456 "retrieved as a similar image feature “123123123446” is added as a similar image feature to the column of “No.”.

上述のようにして、画像特徴登録部１０５から登録された新規の画像特徴と文書管理番号に関してデータベース１０６上の画像特徴との一致乃至類似を判定することにより、データベース１０６内の情報を更新する。 As described above, the information in the database 106 is updated by determining whether the new image feature registered from the image feature registration unit 105 matches the image feature on the database 106 with respect to the document management number.

また、図７に示すような形式で画像特徴情報が管理されている場合も同様に、情報の更新を行うことができることは自明である。 Also, it is obvious that the information can be updated in the same manner when the image feature information is managed in a format as shown in FIG.

［第３の実施の形態］
本実施の形態では、検索処理について説明する。 [Third Embodiment]
In the present embodiment, search processing will be described.

図１１は、本発明の第３の実施の形態における検索装置の構成を示す。同図において、図３、図８と同一構成部分には同一符号を付し、その説明を省略する。 FIG. 11 shows the configuration of the search device according to the third embodiment of the present invention. In this figure, the same components as those in FIGS. 3 and 8 are denoted by the same reference numerals, and description thereof is omitted.

図１１に示す検索装置３００は、コンテンツ記憶メモリ１０２、画像領域抽出部１０３、画像特徴抽出部１０４、データベース１０６、画像特徴検索部１０７、検索情報入力部１０９、文書出力部１１０から構成される。 11 includes a content storage memory 102, an image area extraction unit 103, an image feature extraction unit 104, a database 106, an image feature search unit 107, a search information input unit 109, and a document output unit 110.

なお、図１１に示す検索装置３００を独立した構成としているが、図３、図８に示す文書管理装置に含めて構成することも可能である。 Although the search device 300 shown in FIG. 11 has an independent configuration, the search device 300 may be included in the document management device shown in FIGS.

以下では、図３、図８の構成と重複しない構成要素について、図１２のフローチャートに従って説明する。なお、図１２に示すフローチャートにおいて、図４、図９と同一の動作については同一のステップ番号を付す。 In the following, components that do not overlap with the configurations of FIGS. 3 and 8 will be described with reference to the flowchart of FIG. In the flowchart shown in FIG. 12, the same steps as those in FIGS. 4 and 9 are given the same step numbers.

ステップ２０７）検索情報入力ステップ：
検索情報入力部１０９は、検索対象の画像が含まれる文書もしくは、検索対象の画像自体を入力する。入力対象が、紙媒体の場合は、第１の実施の形態と同様にスキャナ等のデバイスを介して電子化を行い、コンテンツ記憶メモリ１０２に格納する。また、入力対象が電子データである場合は、そのままコンテンツ記憶メモリ１０２に格納する。 Step 207) Search information input step:
The search information input unit 109 inputs a document including a search target image or the search target image itself. When the input target is a paper medium, it is digitized via a device such as a scanner and stored in the content storage memory 102 as in the first embodiment. If the input target is electronic data, it is stored in the content storage memory 102 as it is.

ステップ２０２）画像領域抽出ステップ：
検索情報入力部１０９によってコンテンツ記憶メモリ１０２に格納されたデータが文書の形式である場合は、画像領域抽出部１０３において、画像領域を抽出し、得られた１乃至複数の画像データをコンテンツ記憶メモリ１０２に格納する。 Step 202) Image region extraction step:
When the data stored in the content storage memory 102 by the search information input unit 109 is in the form of a document, the image region extraction unit 103 extracts the image region, and the obtained one or more image data are stored in the content storage memory. 102.

ステップ２０３）画像特徴抽出ステップ：
コンテンツ記憶メモリ１０２に格納されている画像データに対して、画像特徴抽出部１０４で画像特徴を抽出する。 Step 203) Image feature extraction step:
The image feature extraction unit 104 extracts image features from the image data stored in the content storage memory 102.

ステップ２０５）画像特徴検索ステップ：
画像特徴検索部１０７は、抽出された画像特徴とデータベース１０６に格納されている画像特徴と一致するものを検索し、入力された画像特徴と一致もしくは、類似の画像が含まれている文書管理番号（図５に示す形式で管理されている場合）、乃至は、画像の格納場所の情報（図７に示す形式で管理されている場合）を得る。なお、後者の場合には、該当の画像自体をモニタ等の外部デバイスに一旦、表示した上で、ユーザがマウスやキーボード等により指定した画像から電子透かし情報として埋め込まれた文書管理情報を検出する。 Step 205) Image feature search step:
The image feature search unit 107 searches for an image feature that matches the extracted image feature and the image feature stored in the database 106, and a document management number that contains an image that matches or is similar to the input image feature. Information on the storage location of the image (when managed in the format shown in FIG. 7) is obtained (when managed in the format shown in FIG. 5). In the latter case, the corresponding image itself is temporarily displayed on an external device such as a monitor, and then document management information embedded as digital watermark information is detected from an image designated by the user with a mouse or a keyboard. .

ステップ２０８）文書出力ステップ：
文書出力部１１０は、画像特徴検索部１０７から得られた文書管理番号に対応する電子文書を外部デバイスに出力する。ここで、電子文書及び画像自体を表示する外部デバイスとは、ＣＲＴ等のモニタであっても構わないし、プリンタであっても構わない。 Step 208) Document output step:
The document output unit 110 outputs an electronic document corresponding to the document management number obtained from the image feature search unit 107 to an external device. Here, the external device that displays the electronic document and the image itself may be a monitor such as a CRT or a printer.

なお、本発明は、上記の第１〜第３の実施の形態における図３、図８、図１１に示す文書管理装置や検索装置の機能をプログラムとして構築し、文書管理装置や検索装置として利用されるコンピュータにインストールして実行させる、または、ネットワークを介して流通させることが可能である。 In the present invention, the functions of the document management device and the search device shown in FIGS. 3, 8, and 11 in the first to third embodiments are constructed as a program and used as the document management device and the search device. It can be installed and executed on a computer to be distributed, or distributed via a network.

また、構築されたプログラムを、ハードディスクや、フレキシブルディスク・ＣＤ−ＲＯＭ等の可搬記憶媒体に格納し、コンピュータにインストールする、または配布することが可能である。 Further, the constructed program can be stored in a portable storage medium such as a hard disk, a flexible disk, or a CD-ROM, and can be installed or distributed in a computer.

なお、本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において種々変更・応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made within the scope of the claims.

本発明は、画像混在文書を電子化しデータベースで管理するための技術、特に、カラー文書画像認識技術に適用可能である。 The present invention can be applied to a technique for digitizing an image mixed document and managing it in a database, particularly a color document image recognition technique.

本発明の原理を説明するための図である。It is a figure for demonstrating the principle of this invention. 本発明の原理構成図である。It is a principle block diagram of this invention. 本発明の第１の実施の形態における文書管理装置の構成図である。It is a block diagram of the document management apparatus in the 1st Embodiment of this invention. 本発明の第１の実施の形態における文書管理装置の動作のフローチャートである。It is a flowchart of operation | movement of the document management apparatus in the 1st Embodiment of this invention. 本発明の第１の実施の形態におけるデータベースの例（その１）である。It is an example (the 1) of the database in the 1st Embodiment of this invention. 本発明の第１の実施の形態における管理番号を電子透かしを用いて埋め込む例である。It is an example which embeds the management number in the 1st Embodiment of this invention using a digital watermark. 本発明の第１の実施の形態におけるデータベースの例（その２）である。It is an example (the 2) of the database in the 1st Embodiment of this invention. 本発明の第２の実施の形態における文書管理装置の構成図である。It is a block diagram of the document management apparatus in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における動作のフローチャートである。It is a flowchart of the operation | movement in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における更新処理におけるデータベースの例である。It is an example of the database in the update process in the 2nd Embodiment of this invention. 本発明の第３の実施の形態における検索装置の構成図である。It is a block diagram of the search device in the 3rd Embodiment of this invention. 本発明の第３の実施の形態における文書検索装置の動作のフローチャートである。It is a flowchart of operation | movement of the document search device in the 3rd Embodiment of this invention.

Explanation of symbols

１００文書管理装置
１０１文書入力手段、文書入力部
１０２コンテンツ記憶メモリ
１０３画像領域抽出手段、画像領域抽出部
１０４画像特徴抽出手段、画像特徴抽出部
１０５画像特徴登録手段、画像特徴登録部
１０６データベース
１０７画像特徴検索手段、画像特徴検索部
１０８画像特徴情報更新手段、画像特徴情報更新部
１０９検索情報入力手段、検索情報入力部
１１０文書出力手段、文書出力部
２００文書管理装置
３００文書検索装置 DESCRIPTION OF SYMBOLS 100 Document management apparatus 101 Document input means, document input part 102 Content storage memory 103 Image area extraction means, Image area extraction part 104 Image feature extraction means, Image feature extraction part 105 Image feature registration means, Image feature registration part 106 Database 107 Image Feature search unit, image feature search unit 108 Image feature information update unit, image feature information update unit 109 Search information input unit, search information input unit 110 Document output unit, document output unit 200 Document management device 300 Document search device

Claims

A document management method that digitizes a document containing mixed images and manages the mixed image document managed in a database,
A document input means for inputting an electronic document or a paper medium document, and if the document is a paper medium, a document input step for digitizing the document;
An image region extracting step for extracting an image region from the electronic document;
An image feature extraction step, wherein the image feature extraction means extracts an information sequence representing the feature of the image itself from the extracted image region as an image feature;
An image feature registration means for registering the image feature in the database in association with the identification information of the document;
A document management method characterized by:

Further performing a document identification information embedding step of embedding the document identification information in the image itself,
In the image feature registration step,
Registering the image feature and the image in which the document identification information is embedded in the database in association with each other;
The document management method according to claim 1.

A document management method for managing information related to mixed image documents stored in a database,
An image feature search step for searching the database to determine whether or not there is an identical or / and similar image feature in the database each time the image feature is registered in the database. ,
When the same or / and similar image feature exists in the database in the image feature search step, the image feature information update unit additionally adds document identification information corresponding to the existing image feature information in the database. An image feature information update step of adding the image feature information of the registered image feature to the database,
A document management method characterized by:

A document management apparatus that digitizes a document containing mixed images and manages the mixed image document managed by a database,
An electronic document or a paper medium document is input, and if the document is a paper medium, a document input means for digitizing the document;
Image area extraction means for extracting an image area from the electronic document;
Image feature extraction means for extracting an information string representing the characteristics of the image itself from the extracted image region as an image feature;
Image feature registration means for registering the image feature in a database in association with identification information of the document;
A document management apparatus comprising:

A document identification information embedding unit that embeds the document identification information in the image itself;
The image feature registration means includes:
Means for associating and registering the image feature and the image in which the document identification information is embedded in the database;
The document management apparatus according to claim 4.

A document management apparatus for managing information related to mixed image documents stored in a database,
Image feature search means for determining whether or not there is an identical or / and similar image feature in the database each time the image feature is registered in the database;
In the image feature search means, if the same or / and similar image feature exists in the database, it is updated by adding the identification information of the document corresponding to the existing image feature information in the database. If not, image feature information update means for adding image feature information of the registered image features to the database;
A document management apparatus comprising:

An image mixed document search method for performing a search based on an image in a document with respect to a database in which an electronic image mixed document is managed,
A document input step in which the document input means inputs an electronic document including an image to be searched, an image to be searched itself, or a paper medium document, and if the document is a paper medium, the document input step for digitizing the document When,
An image region extracting step for extracting an image region from the electronic document;
An image feature extraction step, wherein the image feature extraction means extracts an information sequence representing the feature of the image itself from the extracted image region as an image feature;
An image feature search means for searching the image feature for the same image feature as the image feature from the database;
A document output means for outputting a document associated with the image feature retrieved from the database;
A document search method characterized by:

The image feature search means performs a document identification information acquisition step of acquiring document identification information from an image associated with the image feature obtained in the image feature search step;
In the document output step,
Obtaining and outputting a document from the database based on the document identification information;
The document search method according to claim 7.

An image mixed document search apparatus that performs a search based on an image in a document with respect to a database in which an electronic image mixed document is managed,
An electronic document including a search target image, or a search target image itself, or a paper medium document; if the document is a paper medium, a document input unit that digitizes the document;
Image area extraction means for extracting an image area from the electronic document;
Image feature extraction means for extracting an information sequence representing the characteristics of the image itself from the extracted image region as an image feature;
Image feature search means for searching the database for the same image feature as the image feature;
Document output means for outputting a document associated with the image feature retrieved from the database;
A document search apparatus characterized by comprising:

The image feature search means includes:
Including document identification information obtaining means for obtaining document identification information from an image associated with the image feature obtained by the image feature retrieval means,
The document output means includes
Means for obtaining and outputting a document from the database based on the document identification information;
The document search apparatus according to claim 9.

On the computer,
7. A document management program for causing each means of the document management apparatus according to claim 4 to be executed.

On the computer,
11. A document search program for causing each means of the document search apparatus according to claim 9 to be executed.