[go: up one dir, main page]

CN1269060C - Method and system of digitizing ancient Chinese books and automatizing the content search - Google Patents

Method and system of digitizing ancient Chinese books and automatizing the content search Download PDF

Info

Publication number
CN1269060C
CN1269060C CN 00119542 CN00119542A CN1269060C CN 1269060 C CN1269060 C CN 1269060C CN 00119542 CN00119542 CN 00119542 CN 00119542 A CN00119542 A CN 00119542A CN 1269060 C CN1269060 C CN 1269060C
Authority
CN
China
Prior art keywords
feature
page
retrieval
module
display
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN 00119542
Other languages
Chinese (zh)
Other versions
CN1336604A (en
Inventor
施伯乐
张亮
王勇
陈智峰
印峻
陈国梁
舒韵宏
焦宇翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANGHAI JINXIN COMPUTER SYSTEM ENGINEERING Co Ltd
Fudan University
Original Assignee
SHANGHAI JINXIN COMPUTER SYSTEM ENGINEERING Co Ltd
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANGHAI JINXIN COMPUTER SYSTEM ENGINEERING Co Ltd, Fudan University filed Critical SHANGHAI JINXIN COMPUTER SYSTEM ENGINEERING Co Ltd
Priority to CN 00119542 priority Critical patent/CN1269060C/en
Publication of CN1336604A publication Critical patent/CN1336604A/en
Application granted granted Critical
Publication of CN1269060C publication Critical patent/CN1269060C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明开发并组合部分公知技术,形成一套完整的技术措施和工艺流程,结合公知的计算机硬件和实现上述技术措施和工艺流程的软件模块所共同形成的计算机系统,解决以毛笔手书汉字为其主要特征的中文古籍这种稀缺资源转换为可低成本无限重复使用介质,并且直接在这种介质上分离和提取满足客观条件的部分资源的技术问题。

Figure 00119542

The present invention develops and combines some known technologies to form a complete set of technical measures and technological processes, and combines known computer hardware and a computer system formed by software modules for realizing the above-mentioned technical measures and technological processes to solve the problem of Chinese characters written with a brush. The main feature is the technical problem of converting the scarce resource of Chinese ancient books into a medium that can be reused infinitely at low cost, and directly separating and extracting part of the resource that meets the objective conditions on this medium.

Figure 00119542

Description

Handle and reuse the method for Chinese ancient book and the computer software and hardware system that is adopted thereof
Technical field
The present invention relates to a kind of computer hardware and the method for software processes and repeated use Chinese ancient book and computer system that is adopted thereof of adopting.
Background technology
Ancient books has high academic research and value of art appreciation as the important component part of human culture legacy.Because it is rare, rare, the above-mentioned value of ancient books can't interiorly on a large scale be utilized by the public, even in strict restricted portion, the security of ancient books original paper and sustainable keeping quality still are difficult to ensure.Excavation and effective utilization to literature of ancient book have become one of main target of various countries digital library (Digital Library) engineering.Up to now, the mode of utilizing of the various ancient books digitizings of proposition and digital media can be summarized as follows:
Index adds the picture browsing mode.At first with the predetermined resolution scan ancient books page, the digital media (being called for short " page-images ") as the ancient books page behind the elimination noise is stored in the mass storage device (CD commonly used).Library or museum professional are to page-images index (as by portion/class/genus/order classification, title, author's epoch, author's name, works mode, year of publication, publication ground, publisher, format, the form and arrangement of lines in calligraphy or printing, crowd school person, preface and postscript person, Tibetan seal, front cover, title page, preface, front/rear interpolation page or leaf, the note on the use, catalogue, figure, appendix, postscript etc.), as the additional information of page-images and set up relative index, be kept in the memory storage for future reference.Retrieval person utilizes data input device (keyboard or mouse), retrieval point (commonly book number, portion/class/genus/order classification, title, author's epoch, author's name) the retrieval ancient books of the limited quantity that provides by system, browse the page-images of pandect or partial page then, also can be according to the front cover in the page-images of the ancient books of index information browse in advance, front/rear interpolation page or leaf, title page, preface, the note on the use, catalogue, figure, appendix, postscript etc.System provides also generally that the advance and retreat and the image of the may command page such as amplifies/dwindle at subsidiary function in the navigation process.The characteristics of this mode are:
● realized that the ancient books scarce resource is to conversion that can widely used digitizing page-images
● do not possess the ability of on page-images, separating and extract the part resource
Subsidiary text adds text full-text search mode.At first make corresponding with it text (as the manual keyboard typing), use global search technology then this subsidiary text is realized the word content retrieval, access page-images by corresponding relation more at last according to ancient books.This indirect mode is at the generation phase of its requisite subsidiary text file, and aspects such as the judgement of the homogeneity of body text and ancient books original copy content, character set scale, special symbol processing, automaticity exist the professional unacceptable restriction condition of ancient books works possessor (as library or museum); These problems cause information retrieval method and the system based on the text form that propose among the Chinese patent application prospectus CN-1151558A can't be applied to use for the content retrieval of the ancient books page of its essence with image.In addition, interchangeability of Chinese characters word being extensive use of in ancient books also makes global search technology that the ancient books content retrieval is lacked necessary ability.
Optical character identification adds text full-text search mode.This mode is used global search technology then this subsidiary text is realized word content retrieval with the text and the searching object of optical character identification (OCR) technology generation ancient books correspondence, accesses page-images by corresponding relation more at last.Yet owing to ancient books year of publication, version form difference, ancient books is huge with the word difference, can't set up to comprise all dictionaries of words at all times; More because in the Chinese ancient book between fuzzy, lack of standardization, the stroke of writing brush personal letter Chinese-character stroke/relative position instability between parts, stroke inclination angle/relative length instability, writing style difference, soft stroke distortion etc. are all multifactor, are difficult to finish the accurate identification that soft pen is write in one's own hand font.Proposed a kind of generation text strings (as " middle final accounts " and " between ox final accounts ") similar among the Chinese patent application prospectus CN-1165571A, the method for a text full-text search has been used in every kind of possible distortion respectively, to avoid the problems referred to above that wrong identification is brought to retrieval to the searching object shape.But this method is helpless for ancient books.Because the distortion number of text strings increases with index law with text strings length.For example, the average deformation number of establishing each word is k, and text strings length is n, and then possible deformed letters string adds up to kn.Therefore, this method lacks scalability (Scalability) on algorithm, be reflected in the application, is to lack practicality.OCR as another major defect of additional text file Core Generator is: the semanteme of ancient books literal/symbol object (hereinafter to be referred as " object ") " freezes " at the OCR cognitive phase, and promptly the image of object determinacy is mapped to a literal.Retrieval person changes the Semantic mapping of having been freezed by additional text documenting person without any ability in retrieving.Be in the Chinese ancient book works of principal character with the writing brush personal letter, the stained semanteme of object that all causes inevitably of the variation of hand-written script, page papery can't uniquely be determined, need retrieval person immediately to make a choice, for example determine the compromise of recall ratio and precision ratio according to target.This requirement can't be satisfied by the ancient books content search method based on OCR.
In a word, for writing in one's own hand Chinese character with writing brush is the Chinese ancient book works of its principal character, solve simultaneously this scarce resource is converted to and low-cost unlimited to reuse medium, and directly on this medium, separate and to extract the part resource problem that satisfies objective condition very difficult.Still do not have at present effectively, directly technology and system, promptly its technique effect can not be by every known technology alone or utilize prior art measure and technological process combination to realize.
Summary of the invention
The objective of the invention is to make up known technology, form a whole set of brand-new technology measure and technological process, be converted to and low-costly infinitely reuse medium to solve, and directly on this medium, separate and extract the technical matters of the part resource that satisfies objective condition with this scarce resource of Chinese ancient book of writing brush personal letter Chinese character for its principal character.
One aspect of the present invention provides a kind of method that adopts computer hardware and software processes and repeated use Chinese ancient book, this method will be write in one's own hand Chinese character with writing brush and will be converted to unlimited repeated use medium for the Chinese ancient book of its principal character, and directly separate on this medium and extract the part resource that satisfies objective condition; Be characterized in that it is by the disposable feature space organized processing flow process of finishing and can repeated ancient mat content retrieval treatment scheme in succession forming; Described feature space organized processing flow process may further comprise the steps: produce page figure and deposit it in the page-images storehouse by scanning and pretreatment module, pass to follow-up extraction characteristic module the object in the page-images is decomposed into the ordered set of independent image by skeleton simultaneously; By extracting characteristic module the ordered set of described object is separated into page feature, object global position feature and morphological feature vector also are kept at these features in the mark sheet; Organize described global position feature and morphological feature vector and be stored in the data structure feature space index by the index characteristic module; By the feature space index module morphological feature vector is carried out visual similarity cluster and an eliminating and the dissimilar letter symbol image of retrieval point that computing machine is finished automatically; So that its global position feature is fed back; The described content retrieval stage may further comprise the steps: the page coordinates by demarcating sample retrieval module settings page-images and the order of coordinate sequence to be forming sample retrieval, and the order of coordinate sequence is passed to checking constraint condition module as constraint condition; By obtaining characteristic module, to obtain and the corresponding morphological feature vector of object with the concrete object of page coordinates sequence as condition definite page-images from mark sheet; By the approximate query module is that reference point is sought the arest neighbors element to constitute the analogical object set of reference point with the morphological feature vector; And the global position feature set is formed in the set of the analogical object of correspondence bunch passed to checking constraint condition module; By of the effective combination of checking constraint condition module, to form result for retrieval according to described constraint condition inspection set bunch element; And by showing/browse that the result for retrieval module is apparent in result for retrieval on retrieval person's the client screen.
The present invention provides a kind of computer software and hardware system that realizes the described method of one aspect of the present invention on the other hand, this system is made of jointly the configuration of known computer hardware device and the component models of implementing the computer application software system of the described method flow of one aspect of the present invention, it is characterized in that: the configuration of this known computer hardware device comprises by CPU (central processing unit), random access memory, hard disk, keyboard, display, network access device, the server that scanner and indicating equipment are formed and by CPU (central processing unit), primary memory, hard disk or ROM (read-only memory), keyboard, display, the client computer that network access device and indicating equipment are formed; The parts of this realization computer application software system comprise with lower module and with the configuration of computer hardware equipment and combining: scanning and pretreatment module, utilize the computing power that CPU (central processing unit) provides, the storage capacity of random access memory, the display capabilities of display, the picture acquisition power of scanner to produce page-images, and deposit the page-images storehouse in, page-images stock is put on the storage medium of hard disk, generates the skeleton object simultaneously; Extract characteristic module, utilize the computing power that CPU (central processing unit) provides, the storage capacity of random access memory, the display capabilities of display, the station-keeping ability of indicating equipment, be connected with pretreatment module by skeleton object and described scanning and accept and decompose object in the described page-images, object ordered set is separated into the global position feature and the morphological feature sequence of page feature, object, and be stored in the mark sheet, this process is by means of input related command and parameter, and mark sheet is deposited on the storage medium of hard disk; The index characteristic module utilizes the computing power that CPU (central processing unit) provides, the storage capacity of random access memory, and tissue is by extracting described global position feature and the morphological feature vector that characteristic module extracts; Proper vector is deposited on the storage medium of hard disk; Data structure feature space index module, the global position feature of preservation process tissue and morphological feature vector are on the storage medium of hard disk; The hard disk of above-mentioned server contains permanent storage computer operating system, digitizing ancient books storehouse; Demarcate the sample retrieval module, utilize the computing power that CPU (central processing unit) provides, the storage capacity of primary memory, the display capabilities of display, the station-keeping ability of indicating equipment, the order of determining the page coordinate of object and coordinate sequence is to form retrieval person's sample retrieval; Obtain characteristic module, computing power, the main memory store ability of utilizing CPU (central processing unit) to provide are determined concrete object in the page-images with the page coordinates sequence as condition, to obtain and the corresponding proper vector of object from mark sheet; The approximate module of inquiry, utilize the computing power that CPU (central processing unit) provides, the storage capacity of random access memory, with the reference point is clue, its analogical object of coupling set in the feature space index on the storage medium of hard disk, and the analogical object set of all the object correspondences in the sample retrieval is combined into global position feature set bunch; Checking constraint condition module is utilized the computing power that CPU (central processing unit) provides, the storage capacity of random access memory, and the order of accepting coordinate sequence is as constraint condition and global position feature set bunch, with effective combination of inspection set bunch element and form result for retrieval; With show/browse the result for retrieval module, utilize storage capacity, the display of computing power that CPU (central processing unit) provides, random access memory display capabilities, result for retrieval is apparent on retrieval person's the client computer indicator screen; Network access device is used for communicating by letter between server and the client computer.Method and system of the present invention has obtained to write in one's own hand Chinese character with writing brush, and can be low-cost unlimited for this scarce resource of Chinese ancient book of its principal character is converted to reuse medium be the digitizing page-images of ancient books works, and directly separate on this medium and extract the technique effect that the part resource that satisfies objective condition promptly satisfies the partial page image of the condition that preestablishes.
Description of drawings
Below in conjunction with the description of drawings embodiments of the invention.
Fig. 1 is system architecture and base conditioning process flow diagram;
Fig. 2 is the block scheme of system hardware structure;
Fig. 3 is the search method overview flow chart;
Fig. 4 is the feature space organization flow chart;
Fig. 5 is the content retrieval process flow diagram;
Fig. 6 uses the symbolic significance explanation in the process flow diagram;
Fig. 7 is a bitonal bitmap longitudinal projection synoptic diagram;
Fig. 8 is level and smooth with auxiliary grid;
Fig. 9 a and 9b are the page-images of filling apportion mark;
Figure 10 a and 10b are the results of dividing object from row;
Figure 11 is the object example that branch marks;
Figure 12 is the refinement bitmap of Figure 11;
Figure 13 is the bitmap of Figure 12 after normalization;
Figure 14 is the firsts and seconds area dividing example of Figure 13 based on center of gravity;
Figure 15 be horizontal, vertical, cast aside, press down the definition of stroke factor;
Figure 16 is level-1 area and level-2 area coding rule;
Figure 17 is left-falling stroke, distribution and horizontal, vertical stroke factor the distribution plan in secondary zoning of right-falling stroke stroke factor in the one-level zoning of Figure 14.
Embodiment
Low-cost unlimited repeated use medium is the digitizing page-images (hereinafter to be referred as page-images) of ancient books works among the present invention, directly on medium, separate and extract the part resource that satisfies objective condition showing as content retrieval, extract the result for satisfying the partial page image of the condition that preestablishes.
The base conditioning flow process of system of the present invention search method is described referring now to Fig. 1.Should note: two processing units among Fig. 1 demarcate sample retrievals 121 and show/browse result for retrieval 125 as program file separately or global storage in the hard disk 204b of Fig. 2; The processing unit that all the other each block schemes are represented as data file or program file separately or global storage in the hard disk 204a of Fig. 2.
Search method among the present invention and technology are made of in succession the processing stage 120 two of feature space tissue 100 and ancient books content retrievals, and the digitizing ancient books storehouse 110 that the former produces provides the basis for the latter.Feature space is organized disposable finishing of stages 100, and the ancient books content retrieval stage 120 can repeatedly repeat according to retrieval person's requirement.
Ancient books is through overscanning and pretreatment module 101, produce on the one hand page-images and deposit page-images storehouse 111 in and browse in order to the user, the object in the page-images is passed to the ordered set that follow-up extraction characteristic module 102 is broken down into independent image by skeleton on the other hand.The page-images that deposits in the storehouse 111 can be original scanning result (as coloured image or a gray level image), keeps original visual image of ancient books and style; Also can be through the picture rich in detail after the pre-service processing, obtain readable preferably.Object ordered set is extracted characteristic module 102 again and is separated into and is converted to three category features: the global position feature of page feature, object and morphological feature sequence.These features are kept in the mark sheet 112.Global position feature that module 102 is extracted and morphological feature vector are by higher dimensional space index characteristic module 103 favorable tissue and being stored in the data structure feature space index module 113 in addition.Except the visual similarity cluster to the mathematical expression of proper vector object, another function of feature space index structure 113 is exactly in time to get rid of and the dissimilar literal/glyph image of retrieval point, the object that the vision of acceleration search query point is similar.This is the basis that the ancient books content retrieval is realized high speed.
The content retrieval stage 120 is adopted the working method of inquiry by example.Demarcation sample retrieval module 121 supports that retrieval persons at any time, at random demarcate object on the page-images of being browsed, the page coordinates when record client indicating equipment 209b clicks page-images and the order of this coordinate sequence, formation retrieval person's sample retrieval.The order of coordinate sequence is passed to checking constraint condition module 124 as constraint condition.The page coordinates sequence itself is acquired characteristic module 122 and is used as condition concrete object in definite page-images from mark sheet 112, obtains and the corresponding proper vector of object.The approximate object module 123 of inquiry is a reference point with the proper vector that obtains, and the searching arest neighbors element in feature space index 113 constitutes the analogical object set of reference point.This module 123 will be combined into the global position feature set with the set of all objects are corresponding in the sample retrieval analogical object simultaneously and bunch give checking constraint condition module 124.By the effective combination of module 124, form result for retrieval according to the constraint condition inspection set bunch element that obtains.These results are by showing/browse that result for retrieval module 125 is apparent on retrieval person's the client screen 206b in eye-catching mode.Browse and observe its context for the user.
Referring now to Fig. 2, among the figure illustration in order to implement system hardware structure of the present invention.They are server 200a and the client computer 200b that are connected in network 210.Server 200a is used for the transmission of storage, maintenance, management, retrieval and the result for retrieval of data and page-images.Its hardware system is the universal computer architecture that is linked together by bus 201a, comprises the CPU (central processing unit) 202a with computing and control input/output function, the random access memory 203a of save routine and computing intermediate data, the permanent storage computer operating system, retrieve application software, page-images, the hard disk 204a of contents such as feature space index file, in order to key in order and the keyboard 205a of parameter and the display 206a of display command feedback result, network access device 207a, digitized scanner 208 of the ancient books page and function selecting and auxiliary positioning equipment are indicating equipment 209a; Client computer 200b be responsible for man-machine interface operation, send the demand of inquiring and browsing and display navigation Query Result.Its hardware system is the universal computer architecture that is linked together by bus 201b, comprising the CPU (central processing unit) 202b with computing and control input/output function, the primary memory 203b of save routine and computing intermediate data, the permanent storage computer operating system, the hard disk 204b (or ROM (read-only memory) 204b) of contents such as retrieve application software, in order to key in the keyboard 205b of order and parameter, the display 206b of display page image and order feedback result, network access device 207b, help designated display 206b to go up the indicating equipment of screen position (as Genius mouse, writing pencil) 209b; Server and client computer connect via network 210 by network access device 207a, 207b, exchange information.
As the another kind of special case of above-mentioned embodiment, network 210 can be wide area network (WAN is as Internet).In the system architecture that is known as the browser/server pattern, HTTP (HTML (Hypertext Markup Language)) agreement is followed in the communication between client computer 200a and the server 200b.Client computer 200b specifies certain Web page or leaf by uniform resource locator (URL) address of given server 200a, help retrieval person to prepare retrieval/browse request then, transmission is asked to server 200a, and accepts page-images and relevant information (as the JAVA applet) that server 200a transmits; Server 200a deposits the hypermedia file with HTML (HTML (Hypertext Markup Language)) language compilation, it has a HTTP finger daemon, the request of its subscribing client 200b proposition is also made response, when this process receives a request, just create a new subprocess and be this request service, finish validity checking, handle and make data at the request of client computer, comprise and use CGI (CGI (Common Gateway Interface)) program that data are carried out early stage and post-processed, then, page-images of handling well etc. is sent to the client computer 200b that files a request.
As another special case of above-mentioned embodiment, network 210 can be a Local Area Network.
As another special case again of above-mentioned embodiment, server 200a and client computer 200b can be same machines, do not have network 210, network access device 207a, 207b this moment, adopt the loopback adapter; Bus is that 201a, CPU (central processing unit) are that 202a, random access memory are that 203a, hard disk are that 204a, keyboard are that 205a, display are that 206b, scanner are 208, indicating equipment is 209a.
Client computer in another embodiment can adopt mobile computing device (as notebook, PDA etc.).
The operating system of server can be that the various realization versions of Windows95/Windows98 (Microsoft trade mark), MacOS (Apple trade mark), Unix are as (AIX of IBM or free software Linux), do not require multiwindow and figure man-machine interface, but should support the HTTP access protocal; Client computer can adopt above-mentioned any operating system, but requires multiwindow and figure man-machine interface simultaneously, and supports the HTTP access protocal; When the embodiment that adopts client/server on a computing machine, operating system is got the configuration of client-side; When client computer was handheld devices such as PDA, the operating system of this handheld device or its equivalent should be supported the HTTP access protocal.
Further specify the flow process characteristics of search method of the present invention and the technology that is adopted below.
The computing machine ancient books content search method of visual similarity of the present invention is formed by a series of technical unit organic assembling.Each technical unit can adopt the technique known scheme to realize, also can realize with the technical scheme that the present invention proposes, to exchange higher execution efficient for.Make up these technical units and form the new technical scheme of a cover, solution with writing brush personal letter Chinese character for this scarce resource of Chinese ancient book of its principal character is converted to can the low-cost unlimited medium of reusing, and directly on this medium, separate and the technical matters of extracting the part resource that satisfies objective condition is main contents of the present invention.Fig. 3 is the overview flow chart of search method, and Fig. 4, Fig. 5 are the detail flowcharts of Fig. 3.Fig. 6 uses the symbolic significance explanation in the process flow diagram.
As previously mentioned, search method by 120 two of feature space tissue 100 and ancient books content retrievals in succession treatment scheme constitute.Feature space organization flow 100 is finished by ancient books information services provision merchant is disposable in advance.It generates the result, and promptly the digitizing ancient books storehouse 110 among Fig. 1 is kept in the hard disk or CD 204a of server end among Fig. 2.Ancient books content retrieval flow process 120 can repeatedly repeat according to retrieval person's requirement, and it utilizes the digitizing ancient books storehouse of storing among hard disk or the CD 204a.Two flow processs 100 and 120 needn't be continuous in time, only requires to guarantee that the order that provides as Fig. 3 gets final product.
Now further specify feature space and organize the stage in conjunction with Fig. 4.The purpose of feature space tissue is that the content (object and sequence relation thereof) in the ancient books generates its feature clustering as previously mentioned, sets up the index structure that is easy to search fast according to visual similarity approximate object.Feature space organizes the basic step in stage as follows:
1. scan ancient books page 101a
Scan ancient books by visible light or other light sources page by page according to ancient books page number numbering, obtain its digitizing colour or gray scale image.To intact ancient books, can adopt ordinary flat formula scanner, for the ancient books that is damaged by fire damage or other reasons, available far infrared or other light sources irradiation manifest the literal of being covered.
2. pre-service 101b
For outstanding ancient books content, overcome scanning errors, separate foreground object and ground unrest, the acquisition object, before formal structural attitude spatial index 113, carry out the pre-service work such as graduation, object refinement of space of a whole page slant correction, noise removing, binaryzation and row/object.The function that the preprocessing means of Chinese optical character identification (OCR) technology of available standards or pool image are handled needs a spot of manual intervention to realize in case of necessity.Below provide some embodiment.
(1) color and gray scale are handled
The digitizing ancient books page-images that is obtained by scanning step 101a can be colour or gray scale.The purpose of doing like this is in order to keep the original appearance of ancient books to greatest extent, to be convenient to the user and to view and admire.Be the processing needs of subsequent step, the page-images of confession extraction feature should be converted to black and white, promptly so-called bianry image or bitmap.The page-images of viewing and admiring for the user still can keep original color or gray scale.
Coloured image generally is expressed as RGB (RGB) or other color spaces, as the point set of YIQ (brightness, colourity, saturation degree).From the angle of compression of images, adopt the situation of scheme of non-rgb color space more general.Because these schemes concentrate on the principal character of image on some coordinate axis in the space, the gray level image on this is handled, can embody image aspects substantially.In Chinese ancient book content retrieval field, adopt such scheme to change coloured image into form that gray level image still can keep literal/symbol object.
A kind of specific embodiments is that coloured image is decomposed into Y, I, three components of Q, again Y component is wherein given over to further processing as gray level image.The Y component has comprised the main information of original image.Transformational relation between YIQ and RGB is:
Y I Q = 0.299 0.587 0.114 0.596 - 0.275 - 0.322 0.211 - 0.523 - 0.312 R G B , R G B = 1.0 1.176 0.763 1.0 - 0.411 - 0.677 1.0 - 0.964 1.487 Y I Q
Gray level image becomes bitonal bitmap through binaryzation.The key of binaryzation is to determine appropriate threshold.A kind of system of selection is to determine global threshold according to grey level histogram.If number of grayscale levels is G, the pixel of image adds up to n, and (number of picture elements of 1≤k≤G) is n to k level gray scale k, statistical picture is in the gray level (frequency of occurrences of k1≤k≤G) locate
p ( k ) = n k n , k=1,2,...,G
And be ordinate with p (k), k is the horizontal ordinate mapping, obtains the grey level histogram of image.The grey level histogram of Chinese ancient book generally is bimodal, and two spikes have been represented prospect and background pixels respectively.Gray threshold can be taken at the trough place between bimodal, for example value 1≤g≤G.According to gray threshold g with gray level image IMG gChange bitonal bitmap IMG into b:
IM G b ( i , j ) = 1 , IM G g ( i , j ) &GreaterEqual; g i = 1,2 , &CenterDot; &CenterDot; &CenterDot; , R 0 , IM G g ( i , j ) < g ' j = 1,2 , &CenterDot; &CenterDot; &CenterDot; , C
Wherein, R, C are respectively the number of lines and columns of image pixel matrix.
For the grey level histogram of multimodal, can adopt the local threshold binarization method.
(2) space of a whole page is proofreaied and correct
Deflection can take place because the ancient books original copy puts the inaccurate of angle in the page-images that scanning obtains, and influences subsequent treatment.In most cases, the angle of deflection is not too large.If departing from the scope of normal position (as vertical) is [A ,+A].With a is increment, from-A rotation bitonal bitmap, calculate projected density by following method, until+A.The bitonal bitmap that records maximal projection density is as correction chart.
With reference to Fig. 7, at first,, obtain the horizontal distribution (the latter half of Fig. 7) of display foreground pixel with a certain postrotational bitmap (the first half of Fig. 7) projection longitudinally.Make that projection width is W, then the average line height
h = &Sigma; i &Sigma; j IM G b ( i , j ) W .
On the average line of horizontal distribution, calculate projected density
&rho; = &Sigma; k n k W k ,
In the following formula, n kBe to be higher than counting of h, W in k the continuous segment on the average line kBe these projection widths on average line.Select the projection of projection on the average line rather than all horizontal distribution to help to reduce the page-images influence on horizontal line and border up and down.
(3) eliminate noise
Use smoothing technique to eliminate residual isolated point in the bitonal bitmap, level and smooth stroke edge.Smoothing process is the application of low-pass filtering in the image processing techniques.
A kind of simple embodiment is 3 * 3 grids decision pixel x that adopts as shown in Figure 8 0Value.If represent that with x pixel x value is 1 (foreground), represent that with~x pixel x value is 0 (background colour), then pixel x 0Result after level and smooth is:
x 0′=~x 0[x 3x 7(x 1+x 5)+x 1x 5(x 3+x 7)]+
x 0~[(x 3+x 7)
(~(x 4+x 5+x 6)+~(x 1+x 2+x 8)+~(x 1+x 5)+(~(x 6+x 7+x 8)+~(x 2+x 3+x 4))]
(4) Object Segmentation
The row of Chinese character OCR, character segmentation technology can be directly used in Object Segmentation.It below is another comparatively simple Object Segmentation method.It is divided into apportion, participle and three subsequent steps of adjustment.As previously mentioned, bitonal bitmap IMG bWidth be C, highly be R, (i, the pixel of j) locating is designated as IMG at coordinate b(i, j), IMG b(i, j)=1 this point of expression is foreground.
A. apportion
Make j list sum of all pixels being
C jResult after the horizontal distribution figure that constitutes is smooth is designated as
S j = 1 d &Sigma; d = 0 &mu; - 1 C j + d , (j=0,...,C-μ).
Wherein, μ is smooth step-length.S jMaximal value, minimum value and both difference do not remember with:
M=max{S j},m=min{S j},D=max-min
Order again: Th=m+ α D, wherein the α threshold parameter generally gets 0.1 or 0.2.Obtain S jThe j value j of=Th 0, j 1..., j 2n-1' these values organize in twos in regular turn right, that is: p k=(j 2k+ j 2k+1)/2,0<k<n can obtain the column split line sequence p of the page kShown in the dotted line among Fig. 9 a.For extracting easy-to-handle object row, before participle, also should get rid of vertical line.Concrete grammar is: calculate average col width δ=(p N-1-P 0)/(n-1) is if two adjacent column split line (p kAnd p K+1) spacing is less than 0.1 δ, thinks that then between this two adjacent column split line be the column split vertical line of ancient books, fill out between with them into background colour and with (p k+ p K+1)/2 substitute these two column split lines.Fig. 9 a obtains white stick separated among Fig. 9 b through after getting rid of vertical line.
B. participle
The object that obtains row are considered as the parent page image, the transposing steps A. in the row, column mark.Can obtain the basic division of each object.Concrete outcome is seen Figure 10 a.
C. adjust
Automatically there is a spot of erroneous judgement result sometimes in cut zone, and cutting techniques should provide image feedback, for treatment people manual setting cut zone.This is the indicating equipment 209a selection deletion/increase function with server end among Fig. 1, clicks corresponding object or position then.For example, obtain correct Object Segmentation behind the useless cut-off rule that deletion Figure 10 a top is caused by former ancient books outer rim, shown in Figure 10 b.One to cut apart the object diagram of finishing for example shown in Figure 11.
(5) refinement
The bitonal bitmap of object is converted into the skeleton image that live width is single pixel, to reduce because of of the influence of stroke width difference to feature extraction.Thinning algorithm is as follows:
i.I”=IMG b
ii.Do
a.I=I”;
B. all pixels in the scans I form new bitmap I '.To pixel x among the I 0, investigate its neighborhood as shown in Figure 8, if C 1Set up, then the relevant position puts 1 among the I ';
C. scans I ' in all pixels, form new bitmap I ".To pixel x among the I ' n, investigate its neighborhood as shown in Figure 8, if G 2Set up, then I " in the relevant position put 1;
Until I=I”;
Iii. return I ".
C 1=x 0~x 1~x 2~x 3x 4x 5x 6~x 7~x 8+x 0~x 1~x 2x 3~x 4x 5~x 6~x 7~x 8+x 0~x 1~x 2x 3x 4x 5~x 6~x 7~x 8+
x 0~x 1~x 2x 3~x 4x 5x 6~x 7~x 8+x 0~x 1~x 2x 3x 4x 5x 6~x 7~x 8+x 0~x 1~x 2~x 3~x 4x 5~x 6x 7~x 8+
x 0~x 1~x 2~x 3x 4x 5~x 6x 7~x 8+x 0~x 1~x 2~x 3~x 4x 5x 6x 7~x 8+x 0~x 1~x 2~x 3x 4x 5x 6x 7~x 8+
x 0~x 1~x 2x 3~x 4x 5~x 6x 7~x 8+x 0~x 1~x 2x 3x 4x 5~x 6x 7~x 8+x 0~x 1~x 2x 3~x 4x 5x 6x 7~x 8+
x 0~x 1~x 2x 3x 4x 5x 6x 7~x 8+x 0~x 1x 2x 3x 4~x 5~x 6~x 7~x 8+x 0~x 1x 2x 3~x 4x 5~x 6~x 7~x 8+
x 0~x 1x 2x 3x 4x 5~x 6~x 7~x 8+x 0~x 1x 2x 3~x 4x 5x 6~xl~x 8+x 0~x 1x 2x 3x 4x 5x 6~x 7~x 8+
x 0~x 1xlx 3~x 4x 5~x 6x 7~x 8+x 0~x 1x 2x 3x 4x 5~x 6x 7~x 8+x 0~x 1x 2x 3~x 4x 5x 6x 7~x 8+
x 0~x 1x 2x 3x 4x 5x 6x 7~x 8+x 0~x 1~x 2~x 3~x 4x 5~x 6x 7x 8+x 0~x 1~x 2~x 3x 4x 5~x 6x 7x 8+
x 0~x 1~x 2~x 3~x 4~x 5x 6x 7x 8+x 0~x 1~x 2~x 3~x 4x 5x 6x 7x 8+x 0~x 1~x 2~x 3x 4x 5x 6x 7x 8+
x 0~x 1~x 2x 3~x 4x 5~x 6x 7x 8+x 0~x 1~x 2x 3x 4x 5~x 6X 7x 8+x 0~x 1~x 2x 3~x 4x 5x 6x 7x 8+
x 0~x 1~x 2x 3x 4x 5x 6x 7x 8+x 0x 1~x 2~x 3~x 4~x 5~x 6x 7x 8+x 0x 1~x 2~x 3~x 4x 5~x 6x 7x 8+
x 0x 1~x 2~x 3x 4x 5~x 6x 7x 8+x 0x 1~x 2~x 3~x 4~x 5x 6x 7x 8+x 0x 1~x 2~x 3~x 4x 5x 6x 7x 8+
x 0x 1~x 2~x 3x 4x 5x 6x 7x 8
Bitmap when algorithm finishes is the skeleton image after the refinement.Condition in the algorithm
C 2=x 0~x 1~x 2x 3x 4x 5~x 6~x 7~x 8+x 0~x 1x 2x 3x 4~x 5~x 6~x 7~x 8+x 0~x 1x 2x 3x 4x 5~x 6~x 7~x 8+
x 0x 1~x 2x 3~x 4~x 5~x 6~x 7~x 8~x 0x 1~x 2x 3x 4~x 5~x 6~x 7~x 8+x 0x 1~x 2x 3x 4x 5~x 6~x 7~x 8+
x 0x 1~x 2~x 3~x 4~x 5~x 6x 7~x 8+x 0x 1~x 2~x 3~x 4~x 5x 6x 7~x 8+x 0x 1~x 2x 3~x 4~x 5~x 6x 7~x 8+
x 0x 1~x 2x 3x 4~x 5~x 6x 7~x 8+x 0x 1~x 2x 3~x 4~x 5x 6x 7~x 8+x 0x 1x 2x 3~x 4~x 5~x 6~x 7~x 8+
x 0x 1x 2x 3x 4~x 5~x 6~x 7~x 8+x 0x 1x 2x 3x 4x 5~x 6~x 7~x 8+x 0x 1x 2~x 3~x 4~x 5~x 6x 7~x 8+
x 0x 1x 2~x 3~x 4~x 5x 6x 7~x 8+x 0x 1x 2x 3~x 4~x 5~x 6x 7~x 8+x 0x 1x 2x 3x 4~x 5~x 6x 7~x 8+
x 0x 1x 2x 3~x 4~x 5x 6x 7~x 8+x 0~x 1~x 2~x 3~x 4~x 5x 6x 7x 8+x 0x 1~x 2x 3~x 4~x 5~x 6~x 7x 8+
x 0x 1~x 2x 3x 4~x 5~x 6~x 7x 8+x 0x 1~x 2x 3x 4x 5~x 6~x 7x 8+x 0x 1~x 2~x 3~x 4~x 5~x 6x 7x 8+
x 0x 1~x 2~x 3~x 4~x 5x 6x 7x 8+x 0x 1~x 2x 3~x 4~x 5~x 6x 7x 8+x 0x 1~x 2x 3x 4~x 5~x 6x 7x 8+
x 0x 1~x 2x 3~x 4~x 5x 6x 7x 8+x 0x 1x 2~x 3~x 4~x 5~x 6~x 7x 8+x 0x 1x 2x 3~x 4~x 5~x 6~x 7x 8+
x 0x 1x 2x 3x 4~x 5~x 6~x 7x 8+x 0x 1x 2x 3x 4x 5~x 6~x 7x 8+x 0x 1x 2~x 3~x 4~x 5~x 6x 7x 8+
x 0x 1x 2~x 3~x 4~x 5x 6x 7x 8+x 0x 1x 2x 3~x 4~x 5~x 6x 7x 8+x 0x 1x 2x 3x 4~x 5~x 6x 7x 8+
x 0x 1x 2x 3~x 4~x 5x 6x 7x 8
(6) normalization
For eliminating the influence of handwritten form object size and change in location, the skeleton image of each object of standardizing.For example, Figure 13 is the normalization bitmap of the skeleton image of Figure 12, and housing is represented the border of new bitmap.
Standardized method is to select the maximal value of the height of skeleton image and width as monolateral length, makes a square bitmap.Then skeleton image is placed this square bitmap center.Deserving to be called and stating square is MBS (Minimal BoundingSquare).Compare with the conventional standardized method that uses boundary rectangle MBB (Minimal Bounding Box), the standardized method has here kept the ratio of width to height of object.Be difficult for causing elongated objects deviation when feature extraction, to occur.
3. feature extraction 102
This method is at single ancient books definition and extract three class essential characteristics, that is: the global position feature and the morphological feature of page feature, object.If the multireel ancient books that same people transcribes is combined processing, only need to add the books sign.Above-mentioned feature description the ancient books content.
In module 102, each object is separated from page-images, and each object has all possessed geometric coordinate and range of size in the clear and definite page.Following mask body defines described three class essential characteristic and extracting method thereof.
The global position feature (GLF) that defines 1 object is the linear order numbering of this object in the page of an ancient books.
As long as can guarantee object is that 1-1 is corresponding with its global position feature, the linear order in the definition can be taked arbitrary form.For example, global position Feature Extraction method can according to ancient books transcribe custom (page number from small to large, in the page or leaf from right to left, each row from top to bottom), obtain the global position feature of each object of obtaining by the scanner uni pretreatment module.For complicated space of a whole page layout, global position Feature Extraction method can be utilized recurrence curve such as Hilbert or Piano curved scanning layout area earlier, and each intra-zone is handled in the usual way again then.
The page feature (PF) that defines 2 ancient books is made of the geometric coordinate of each object in the page number and the page.
Page feature description by the geometric layout of object in page relation.
The morphological feature of object has been portrayed the vision semanteme of object.And then, remove outside the polyphone, a Chinese character write unique linguistics semanteme that has determined this word.In other words, by comparison, can realize the approximate match of literal, semiotics semanteme to Chinese character-type.Research of Chinese Feature Extraction technology among any Chinese OCR all can be used as object extracting of morphological method.
Yet,, exist a lot of variable factors to influence the extraction of Hanzi component and formation stroke thereof being in the Chinese ancient book of feature with writing brush personal letter Chinese character.For example, stroke weight is inhomogeneous, the part stroke is fuzzy or owe, the relative position skew during repeatedly the occurring of same literal between stroke/parts, stroke inclination angle/relative length variation etc., all can influence the coupling of object on the vision meaning.Need the stronger Feature Extraction Technology of exploitation fault-tolerant ability.Notice " the fixing standardized of block character parts position and ratio is the crystallization of Chinese character calligraphy art for a long time " this fact, below provide a kind of morphological feature of in multistage barycenter graduation zone, adding up stroke factor aggregate-value and describe and extractive technique.It has stronger fault-tolerant ability to the above-mentioned changing factor that exists in the Chinese ancient book.
The morphological feature (MF) that defines 3 objects is the aggregate-value of its image stroke factor component in multistage barycenter graduation zone.
The extracting of morphological method is as follows:
At first, according to the center of gravity of object its MBS is done the multilayer graduation.Each regional graduation point is decided to be the center of gravity of object foreground point in this zone (the stain collection in the accompanying drawing).Further graduation recurrence on the basis of shallow one deck is carried out.The concrete mode of one, two layer of graduation of Figure 13 as shown in figure 14.
Then, add up stroke factor in each zone, classification accumulative total back forms proper vector.So-called stroke factor is meant to constitute fundamental element horizontal, vertical, that cast aside, press down four kinds of strokes that its dot matrix is arranged as shown in figure 15.With respect to complete stroke, phenomenons such as the feature formation based on the stroke factor is inhomogeneous to soft stroke handwriting Chinese character stroke, stroke fuzzy, inclination angle/relative length lacks rule all have stronger fault-tolerant ability, also are convenient to the unified of non-legible symbol object in the ancient books handled.It is easy to extract the stroke factor scheme from the bitmap of object, has multiple embodiments.For example, be structural element (Structure Elements) with four kinds of stroke factors respectively, the applied mathematics morphological method is done corrosion (Erosion) computing to the foreground point (stain among the figure in the square frame) of Figure 13, obtains four kinds of stroke factor distributions in square frame.The extracting method of stroke factor is acted among Figure 14, and the stroke that can obtain in the graduation zone distributes, and the pixel count with all foreground points in the zone removes it again, obtains the distribution density of stroke factor in each zone.Notice in the Chinese character that the occurrence frequency of stroke is much higher than to cast aside anyhow and press down stroke, simultaneously for reducing the dimension of feature space, improve index and effectiveness of retrieval,, can decompose one deck less in the promptly regional graduation casting aside the statistics shallow level of stroke more anyhow of pressing down the stroke factor.A kind of concrete mode is that horizontal, vertical stroke factor is all used two layer region graduation, casts aside, presses down the stroke factor and all use one deck graduation.Among Figure 17 illustration in the double-layer separate partition territory horizontal, vertical stroke distribute and one deck graduation zone in cast aside, press down the stroke distribution.Adopt the zone number rule of Figure 16, the morphological feature vector of all objects has been opened into 16 * 2+4 * 2=40 dimensional feature space in the ancient books.Vector f in the space is calculated by following formula:
In the following formula, p 1(k) and p 2(k) be respectively resembling among the preceding bitmap firsts and seconds zoning k of feature extraction
f ( i ) = &Sigma; 1 &le; k &le; i h ( k ) p 2 ( k ) ,
Figure C0011954200182
f ( 32 + j ) j = 1,2,3,4 = &Sigma; 1 &le; k &le; j p ( k ) p 1 ( k ) , f ( 36 + j ) = &Sigma; 1 &le; k &le; j n ( k ) p 1 ( k )
That vegetarian refreshments number, h (k), s (k), p (k), n (k) are respectively is horizontal, vertical, cast aside, press down the black pixel of stroke factor in bitmap region k counts.
Adopt the object morphological feature of multistage barycenter graduation zone stroke factor aggregate-value, embodied the vision content of handwritten Chinese character preferably, can express literal/symbol with stroke distribution density relatively flexibly.Certain tolerance of definition in feature space (or claiming distance) can form vector space.A kind of tolerance is known Euclidean distance.In the characteristic vector space that forms, the morphological feature vector of object has constituted the coordinate of feature space mid point.Therefore, the unique point of plesiomorphism object has formed cluster naturally, and bigger distance is arranged between the unique point of discrepant Chinese character.
So far, the feature of ancient books has been extracted and has been finished, and the morphological feature and the global position feature of ancient books page feature, object remain to mark sheet 112.Be that mark sheet is made up of as the four-tuple of (geometric coordinate, global position feature, morphological feature in page number, the page or leaf) a plurality of shapes, a plurality of numbers is the object number that scanning pretreatment module 101 is determined.
4. the feature space index 113
In the practical application, the feature space of generation generally has characteristics such as dimension height, unique point quantity are many.Need the design space index structure corresponding with application target, the unique point that rationalization is all exchanges information inquiry fast for less storage overhead.Say on the principle that all space index methods (as R-tree and improve one's methods, X-tree, SR-tree, PK-tree etc.) can both become the embodiment of feature space index structure.Yet the performance of partial index algorithm such as R-tree can sharply descend with the increase of space dimensionality.Provide the optimization embodiment of SR-tree herein.About inner realization and the performance evaluation thereof of SR-tree, see also relevant paper and package description.
A. data structure
Definition of data item E i=(MF i, GLF I)=(f i, GLF I).f iBe the coordinate of feature space mid point i, the morphological feature vector of object i just; GLF iIt is the global position feature of object i.
B. create the SR-tree
Call function new_HnSRTreeFilePath, Dimension, DataSize, BlockSize, SplitFactor, ReinsertFactor.Generate an empty SR-tree and return it, return data type HnSRTreeFile.
The meaning of the input parameter in calling and value such as following table:
Parameter name Type The parameter meaning Value
Path Dimension DataSize BlockSize SplitFactor ReinsertFactor Character string integer integer integer integer integer Preserve the Data Filename feature space Dimension Characteristics spot correlation attribute GLF byte number data block size of SR-tree, the minimum utilization factor of (byte) database, (percent) insert the factor again, (percent) Ancient books name .idx 40 2 8192 (system default value) 40 (system default values) 30 (system default value)
C. insert data item
SR-tree object File according to B. returns calls its method Store (...) with data item E i=(f i, GLF I) insertion SR-tree.Concrete steps are:
HnSRTreeFile File;
File.Store(Point,Data)。
The meaning of parameter wherein and value such as following table:
Parameter name Type The parameter meaning Value
Point HnPoint& The storage address of point coordinate in the feature space The morphological feature vector f of object
Data HnData& The storage address of feature space mid point attribute The GLF of this object
5. treatment scheme control
Ancient books is handled and is adopted recycle design to finish.In a width of cloth page image, each object is implemented 102 to 113 processing, the object in one page whether finish dealing with Fig. 4 105 in judge.If this page or leaf also has other objects, then repeat said process, handle otherwise change time page or leaf.Ancient books whether be converted into fully digitizing ancient books storehouse 110 Fig. 4 106 in judge.
The processing stage of now retrieving 120 in conjunction with Fig. 5 description.Content retrieval must have been finished feature space at the ancient books that is retrieved and carry out after organizing 100 steps.For a cover feature space index structure of being set up, retrieval person can carry out the content retrieval of arbitrary number of times referring to Fig. 3.The purpose of content retrieval is to utilize feature space to organize resulting index structure, obtains all other objects similar to given object vision content fast.The basic step of content retrieval is as follows:
(1) reads precision controlled variable 501
Retrieval person adjusts the retrieval precision controlled variable by man-machine interaction mode.This parameter is only represented notional " strictness " and " loose ", value determine need not any quiet and secluded knowledge.Parameter value generally is divided into multistage, and pairing distance thresholds at different levels can be implemented the people by setting arbitrarily to big monotone increasing add mode by zero by invention.A kind of embodiment is to set 11 grades, and the 0th grade of predetermined distance threshold value is zero, the strict coupling of expression; The 10th grade is the loosest precision controlled condition, and the predetermined distance threshold value is 1; Progressively increase distance threshold by 0.1 increment therebetween.Because content retrieval can repeatedly carry out, retrieval person can dynamically adjust the precision controlled variable with reference to result for retrieval last time, gives new balance to next time recall ratio and precision ratio, satisfies its needs.The approximate hunting zone of Object Query 123 in feature space index 113 of precision controlled variable influence.
(2) open open the beginning browsing pages 502
Retrieval person can page number accesses corresponding page-images or enters certain page in conjunction with general indexing method by importing arbitrarily.Directly the scheme of input page numbering is the simplest.Comparatively practical with the scheme that indexing method is used.This is not only harmonious with the existing retrieval mode of library and ancient books CD server, and formed 2-level search pattern is more convenient for handling the different literature of ancient book of a large amount of writing styles.The retrieval point guiding retrieval person that indexing method provides further offers help for retrieval person finds target in volume based on the content search method of visual similarity at digital library or CD server discovery candidate's ancient books folder.
(3) demarcate searching object 121
On the page displayed image, retrieval person utilizes indicating equipment 209b such as mouse or writing pencil to click object, sets or adjust object-order.Interior geometric coordinate of page number, the page that demarcation sample retrieval module 121 record indicating equipments provide and the natural number of representing this order according to order mark on page-images that retrieval person sets.Can cooperate and browse controlling mechanism, in multipage, demarcate searching object.When retrieval person starts when retrieval, module 121 forms sample retrieval according to the order of geometric coordinate sequence and coordinate sequence in the page number of above-mentioned object, the page.Page number and coordinate set are passed to and are obtained characteristic module 122, and the order of coordinate sequence is passed to checking constraint condition module 124 as constraint condition.Afterwards the member object of each sample retrieval is implemented 122 to 123 and handle, in 506 steps, carry out aftertreatment and judge loop ends.
(4) obtain the morphological feature vector 122 of sample retrieval
From mark sheet, obtain the morphological feature vector of this object according to geometric coordinate in the page label of the member object of sample retrieval and the page.Acquisition methods depends on the organizational form of the interior geometric coordinate of the page of mark sheet.Page-images is after Object Segmentation, and each object all has a rectangle (referring to 2 (4)) that comprises it.If geometric coordinate is provided by the middle point coordinate of this rectangle in the page of object, then should be under the identical situation of page number, in mark sheet, calculate and the immediate point in sample member position earlier, and then obtain the morphological feature vector of object from this list item according to Euclidean distance; If geometric coordinate angular coordinate is provided in the page of object by this rectangle, then should be under the identical situation of page number, whether the check rectangle comprises sample member position in mark sheet earlier, and then obtains the morphological feature vector of object from the list item that comprises sample.Before a kind of method each object is saved the storage space of a pair of coordinate, a kind of method in back can be avoided the multiplication and division computing relatively the time, execution speed is very fast.In ancient books the more or sample retrieval length of object number generally more in short-term, a kind of method is favourable before using.
(5) approximate Object Query 123
With respect to certain sample member object, in the feature space index, search the similar object set of its vision according to nearest neighbouring rule.Specific practice is that establishing by the 123 form vectors that obtain is v, is r by the 501 search precision controlled variable that read, and then uses the set that following A~B obtains the global position feature GLF of analogical object.
A. according to parameter r setting range border.
To each dimension of feature space, establishing its mobility scale is W, then at first sets the range of search width
w = &epsiv; r = 0 W &times; r / s 0 < r &le; s
Wherein, ε is a very little number, general value 0.0001, the situation of corresponding strict search.S is the maximum occurrences of r.If read described in the precision controlled variable step s=10 according to aforementioned.
Then, system adjusts the position of range of search automatically, obtains on this dimension of feature space an interval w who comprises retrieval point x and be positioned at W, makes the w mid point that x is positioned at as much as possible.The border of note w is respectively a iAnd b i
Utilize the method SetRange of HnRect in the SR-tree program bag to set range of search, promptly i is tieed up
rect.SetRange(a i,HnRange::INCLUSIVE,b i,HnRange::INCLUSIVE,i).
Wherein, HnRange::INCLUSIVE is the constant that defines in the software package.
B. scope is searched (Range Search).
According to the range of search of setting among the A., from feature space index 113, return the global position feature GLF of analogical object one by one, form this sample member's analogical object set.Specific algorithm is as follows:
I) call the GetFirst method of HnSRTreeFile object File, return the GLF of first approximate object;
Ii) incorporate this GLF into results set
Iii) call the GetNext method of HnSRTreeFile object File repeatedly, return the GLF of next approximate object.Incorporate this GLF into results set, Key.isValid () test is for false in return parameters.
(6) handle lookup result 123
To all member objects of sample retrieval, the GLF of their approximate object set is accumulated cluster, passes to checking constraint condition module 124.
(7) checking constraint condition 124
So-called constraint condition promptly is the relative order of object elements that retrieval person demarcates in 121.Concrete proof procedure is as follows:
A. make sample retrieval comprise M member object, remember with e successively relatively in proper order by it 1, e 2..., e M, from 506 obtain bunch M GLF souvenir with L 1, L 2..., L M
B. with L 1As L, be circulated to M with subscript i from 2 with increment 1, carry out C
C. to each the element e among the L, establishing its GLF is j, if L iIn not have GLF be the object of j+i-1, then e is left out from L
D. the result who keeps among the L during loop ends is exactly first element list of result for retrieval.
(8) mark result for retrieval 508 on page image
Take out the GLF of header element one by one from 127 result for retrieval, as index, search mark sheet 112, deterministic retrieval is the page number and the page internal coordinate of header element as a result.On page-images, paste additional marking such as red round dot, indicate a continuous N object that begins thus.When this page or leaf when side-play amount begins not enough M object, from inferior beginning of the page portion beginning label residue object.
(9) page-images shows/browses 125
Set up hop button such as first term mark, preceding paragraph mark, consequent mark, last item mark,, provide retrieval person to observe result for retrieval and its contextual function of observation in conjunction with common homepage, preceding page or leaf, back page or leaf and last page navigation button.

Claims (2)

1.一种采用计算机硬件和软件处理和重复使用中文古籍的方法,该方法将以毛笔手书汉字为其主要特征的中文古籍转换为无限重复使用介质,并且直接在这种介质上分离和提取满足客观条件的部分资源;其特征在于该方法由一次性完成的特征空间组织(100)处理流程和可多次重复使用的古藉内容检索(120)处理流程组成;1. A method for processing and reusing ancient Chinese books using computer hardware and software, which converts ancient Chinese books with brush handwritten Chinese characters as its main feature into an infinitely reusable medium, and directly separates and extracts on this medium to meet the requirements Partial resources of objective conditions; the method is characterized in that the method is composed of a once-completed feature space organization (100) processing flow and a reusable ancient borrowing content retrieval (120) processing flow; 所述的特征空间组织处理(100)流程产生低成本、可重复使用的内容介质——数字化古籍库(110);该流程由以下步骤共同构成:The process of feature space organization processing (100) produces a low-cost, reusable content medium—digital ancient book library (110); the process is composed of the following steps: 通过扫描和预处理(101)产生页面图像和将它存入页面图像库(111),同时将生成的骨架传给后续的提取特征(102)以将页面图像中的对象分解为独立图像的有序集合;Generate a page image through scanning and preprocessing (101) and store it in the page image library (111), and pass the generated skeleton to the subsequent extraction feature (102) to decompose the objects in the page image into independent images. ordered collection; 通过提取特征(102)将所述对象的有序集合分离成页面特征,对象全局位置特征和形态特征向量并将这些特征保存在特征表(112)中;By extracting features (102) the ordered set of objects is separated into page features, object global position features and morphological feature vectors and these features are stored in feature table (112); 通过索引特征(103)组织所述全局位置特征和形态特征向量并保存于数据结构特征空间索引(113)中;以及Organize the global position feature and morphological feature vector through the index feature (103) and save in the data structure feature space index (113); and 通过数据结构特征空间索引(113)对形态特征向量进行聚类以及排除与检索点不相关的文字符号图像;Clustering the morphological feature vectors and excluding text symbol images irrelevant to the retrieval point through the data structure feature space index (113); 特征空间组织处理产生可重复使用的数字化古籍库(110),包括页面图像库(111)、特征表(112)和特征空间索引(113);The feature space organization process produces a reusable digital ancient book library (110), including a page image library (111), a feature table (112) and a feature space index (113); 所述内容检索(120)流程直接在古籍数字化介质上分离和提取部分资源,它由以下步骤共同构成:The content retrieval (120) process directly separates and extracts some resources on the digital medium of ancient books, and it consists of the following steps: 通过标定检索样本(121)确定对象的页面坐标和坐标序列的顺序以形成检索样本,并将坐标序列的顺序作为约束条件传给验证约束条件(124);Determining the page coordinates of the object and the order of the coordinate sequence to form a retrieval sample by demarcating the retrieval sample (121), and passing the order of the coordinate sequence as a constraint to the verification constraint (124); 通过获取特征(122)将页面坐标序列作为条件从特征表(112)中确定页面图像的具体对象,以获得与对象相对应的形态特征向量;Determine the specific object of the page image from the feature table (112) with the page coordinate sequence as a condition by obtaining the feature (122), so as to obtain a morphological feature vector corresponding to the object; 通过近似查询(123)以形态特征向量为参考点在特征空间索引(113)中匹配最近邻元素以构成参考点的相似对象集合;并将对应的相似对象集合组成全局位置特征集簇传递给验证约束条件(124);By approximate query (123) with the morphological feature vector as the reference point, match the nearest neighbor element in the feature space index (113) to form the similar object set of the reference point; and pass the corresponding similar object set to form the global position feature cluster for verification constraints(124); 由验证约束条件(124)根据所述约束条件检验集簇元素的有效组合,以形成检索结果;以及valid combinations of cluster elements are checked by validation constraints (124) according to said constraints to form retrieval results; and 通过显示/浏览检索结果(125)将检索结果显现在检索者的客户机屏幕(206b)上。The search results are visualized on the searcher's client screen (206b) by displaying/browsing the search results (125). 2.一种实现如权利要求1所述的方法的计算机软硬件系统,该系统由公知计算机硬件设备的配置和实施如权利要求1所述的方法流程的计算机应用软件系统的部件模块共同构成,其特征在于:该公知计算机硬件设备的配置包括由中央处理机(202a)、随机存储器(203a)、硬盘(204a)、键盘(205a)、显示器(206a)、网络连接设备(207a),扫描仪(208)和指示设备(209a)组成的服务器(200a)和由中央处理机(202b)、主存储器(203b)、硬盘或只读存储器(204b)、键盘(205b)、显示器(206b)、网络连接设备(207b)和指示设备(209b)组成的客户机(200b);该实现计算机应用软件系统的部件包括以下模块并与计算机硬件设备的配置相结合:2. A computer software and hardware system realizing the method as claimed in claim 1, the system is composed of the configuration of known computer hardware equipment and the component modules of the computer application software system implementing the method process as claimed in claim 1, It is characterized in that: the configuration of this known computer hardware equipment comprises central processing unit (202a), RAM (203a), hard disk (204a), keyboard (205a), display (206a), network connection equipment (207a), scanner (208) and the server (200a) that pointing device (209a) is formed and by central processing unit (202b), main memory (203b), hard disk or read-only memory (204b), keyboard (205b), display (206b), network A client (200b) composed of a connection device (207b) and a pointing device (209b); the components for realizing the computer application software system include the following modules and are combined with the configuration of the computer hardware device: 扫描和预处理模块(101),利用中央处理机(202a)提供的计算能力、随机存储器(203a)的存储能力、显示器(206a)的显示能力、扫描仪(208)的图像获取能力产生页面图像,并存入页面图像库(111),页面图像库存放于硬盘(204a)的存储介质上,同时生成骨架对象;The scanning and preprocessing module (101) utilizes the computing capability provided by the central processing unit (202a), the storage capability of the random access memory (203a), the display capability of the display (206a), and the image acquisition capability of the scanner (208) to generate page images , and stored in the page image library (111), the page image library is stored on the storage medium of the hard disk (204a), and the skeleton object is generated simultaneously; 提取特征模块(102),利用中央处理机(202a)提供的计算能力、随机存储器(203a)的存储能力、显示器(206a)的显示能力、指示设备(209a)的定位能力,通过骨架对象与所述扫描和预处理模块连接而接受并分解所述页面图像中的对象,以将对象有序集合分离成页面特征、对象的全局位置特征和形态特征序列,并保存于特征表(112)中,这一过程借助于205a输入相关命令和参数,特征表存放于硬盘(204a)的存储介质上;The feature extraction module (102) utilizes the computing power provided by the central processing unit (202a), the storage capacity of the random access memory (203a), the display capacity of the display (206a), and the positioning capacity of the pointing device (209a), through skeleton objects and all The scanning and preprocessing modules are connected to accept and decompose the objects in the page image, so as to separate the ordered collection of objects into page features, global position features and morphological feature sequences of objects, and store them in the feature table (112), This process imports relevant commands and parameters by means of 205a, and the feature table is stored on the storage medium of the hard disk (204a); 索引特征模块(103),利用中央处理机(202a)提供的计算能力、随机存储器(203a)的存储能力,组织由提取特征模块提取的所述全局位置特征和形态特征向量;特征向量存放于硬盘(204a)的存储介质上;The index feature module (103) utilizes the computing power provided by the central processing unit (202a) and the storage capacity of the random access memory (203a) to organize the global position features and morphological feature vectors extracted by the feature extraction module; the feature vectors are stored in the hard disk (204a) storage medium; 数据结构特征空间索引模块(113),保存经过组织的全局位置特征和形态特征向量于硬盘(204a)的存储介质上;The data structure feature space index module (113), saves the organized global position feature and morphological feature vector on the storage medium of the hard disk (204a); 上述服务器(200a)的硬盘(204a)含有永久性存储计算机操作系统、数字化古籍库(110);The hard disk (204a) of the above-mentioned server (200a) contains a permanent storage computer operating system and a digitized ancient book library (110); 标定检索样本模块(121),利用中央处理机(202b)提供的计算能力、主存储器(203b)的存储能力、显示器(206b)的显示能力、指示设备(209b)的定位能力,确定对象的页面坐标和坐标序列的顺序以形成检索者的检索样本;The calibration retrieval sample module (121) utilizes the computing capability provided by the central processing unit (202b), the storage capability of the main memory (203b), the display capability of the display (206b), and the positioning capability of the pointing device (209b) to determine the page of the object coordinates and sequence of coordinate sequences to form the searcher's search sample; 获取特征模块(122),利用中央处理机(202b)提供的计算能力、主存储器(203b)存储能力,将页面坐标序列作为条件从特征表(112)中确定页面图像中的具体对象,以获得与对象相应的特征向量;Obtain feature module (122), utilize computing power that central processing unit (202b) provides, main memory (203b) storage capacity, use page coordinate sequence as condition to determine concrete object in the page image from feature table (112), obtain the eigenvector corresponding to the object; 查询近似模块(123),利用中央处理机(202a、202b)提供的计算能力、随机存储器(203a、203b)的存储能力,以参考点为线索,在硬盘(204a)的存储介质上的特征空间索引(113)中匹配其相似对象集合,并将检索样本中的所有对象对应的相似对象集合组合成全局位置特征集簇;Query the approximation module (123), use the computing power provided by the central processing unit (202a, 202b), the storage capacity of the random access memory (203a, 203b), take the reference point as a clue, and use the feature space on the storage medium of the hard disk (204a) Match its similar object set in the index (113), and combine the similar object set corresponding to all objects in the retrieval sample into a global position feature cluster; 验证约束条件模块(124),利用中央处理机(202a、202b)提供的计算能力、随机存储器(203a、203b)的存储能力,接受坐标序列的顺序作为约束条件和全局位置特征集簇,以检验集簇元素的有效组合并形成检索结果;和The verification constraint module (124) utilizes the computing power provided by the central processing unit (202a, 202b) and the storage capacity of the random access memory (203a, 203b), and accepts the order of the coordinate sequence as a constraint condition and a global position feature cluster to verify efficient combination of clustering elements to form search results; and 显示/浏览检索结果模块(125),利用中央处理机(202b)提供的计算能力、随机存储器(203b)的存储能力、显示器(206b)的显示能力、将检索结果显现在检索者的客户机显示器屏幕(206b)上;The display/browse retrieval result module (125) utilizes the computing power provided by the central processing unit (202b), the storage capacity of the random access memory (203b), the display capability of the display (206b), and the retrieval result is displayed on the searcher's client computer display on the screen (206b); 网络连接设备(207a、207b)用于服务器(200a)与客户机(200b)之间的通信。Network connection devices (207a, 207b) are used for communication between the server (200a) and the client (200b).
CN 00119542 2000-08-01 2000-08-01 Method and system of digitizing ancient Chinese books and automatizing the content search Expired - Fee Related CN1269060C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 00119542 CN1269060C (en) 2000-08-01 2000-08-01 Method and system of digitizing ancient Chinese books and automatizing the content search

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 00119542 CN1269060C (en) 2000-08-01 2000-08-01 Method and system of digitizing ancient Chinese books and automatizing the content search

Publications (2)

Publication Number Publication Date
CN1336604A CN1336604A (en) 2002-02-20
CN1269060C true CN1269060C (en) 2006-08-09

Family

ID=4587786

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 00119542 Expired - Fee Related CN1269060C (en) 2000-08-01 2000-08-01 Method and system of digitizing ancient Chinese books and automatizing the content search

Country Status (1)

Country Link
CN (1) CN1269060C (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100664311B1 (en) * 2005-11-18 2007-01-04 삼성전자주식회사 Image forming apparatus capable of automatic index generation and automatic index generation method
CN101393643B (en) * 2007-09-21 2012-01-18 华东师范大学 Computer stroke deforming system and method
CN102253990A (en) * 2011-07-05 2011-11-23 广东星海数字家庭产业技术研究院有限公司 Interactive application multimedia data query method and device
KR20150006740A (en) * 2013-07-09 2015-01-19 류중하 Method for Composing Mark Image Corresponding Letter, and Method for Analyzing Mark Image About Corresponding Letter
CN105183744A (en) * 2015-06-29 2015-12-23 努比亚技术有限公司 Method and device for carrying out paper book keyword retrieval by mobile phone
US9898452B2 (en) 2015-10-16 2018-02-20 International Business Machines Corporation Annotation data generation and overlay for enhancing readability on electronic book image stream service
CN106502974A (en) * 2016-10-17 2017-03-15 王忠义 A kind of inscriptions on bones or tortoise shells carves diction Electronic record template construction method
CN106503247A (en) * 2016-11-09 2017-03-15 天津赛因哲信息技术有限公司 Ancient book document management system and method based on knowledge discovery technology
CN108550154A (en) * 2018-04-11 2018-09-18 中国科学院西双版纳热带植物园 A kind of method of accurately measuring karst earth's surface bare rock accounting
TWI734037B (en) * 2018-09-28 2021-07-21 愛探極溫度行銷有限公司 Real estate holding and inheritance calculation management system
CN111666262B (en) * 2020-05-28 2021-06-22 重庆中联信息产业有限责任公司 Working method for extracting feature points of massive medical images in network attached storage NAS state

Also Published As

Publication number Publication date
CN1336604A (en) 2002-02-20

Similar Documents

Publication Publication Date Title
US8520889B2 (en) Automated generation of form definitions from hard-copy forms
CN1248138C (en) Image processing method and image processing system
CN1158627C (en) Method and device for character recognition
JP5095534B2 (en) System and method for generating a junction
JP4577931B2 (en) Document processing system and index information acquisition method
AU2010311067B2 (en) System and method for increasing the accuracy of optical character recognition (OCR)
CN1625741A (en) An electronic filing system searchable by a handwritten search query
CN1542655A (en) Information processing apparatus, method, storage medium and program
CN1900933A (en) Image search system, image search method, and storage medium
DE102011079443A1 (en) Learning weights of typed font fonts in handwriting keyword retrieval
CN1877598A (en) Method for gathering and recording business card information in mobile phone by using image recognition
CN113326797A (en) Method for converting form information extracted from PDF document into structured knowledge
CN1215201A (en) Character Recognition/Correction Method
CN1177407A (en) Method and system for velocity-based head writing recognition
US20040139384A1 (en) Removal of extraneous text from electronic documents
CN1269060C (en) Method and system of digitizing ancient Chinese books and automatizing the content search
JP2004334339A (en) Information processor, information processing method, and storage medium, and program
CN1041773C (en) Character recognition method and apparatus based on 0-1 pattern representation of histogram of character image
CN115830620B (en) Archive text data processing method and system based on OCR
CN1525378A (en) Form definition data generation method and form processing device
CN115774805B (en) File intelligent query method and system based on digital processing
WO2021140682A1 (en) Information processing device, information processing method, and information processing program
CN1549192A (en) Computer Recognition and Automatic Input Method of Handwritten Characters
KR20060007204A (en) Document Image Processing and Verification System and Method for the Digitization of Massive Data
CN112464907A (en) Document processing system and method

Legal Events

Date Code Title Description
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C06 Publication
PB01 Publication
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20060809

Termination date: 20130801