Embodiment
Low-cost unlimited repeated use medium is the digitizing page-images (hereinafter to be referred as page-images) of ancient books works among the present invention, directly on medium, separate and extract the part resource that satisfies objective condition showing as content retrieval, extract the result for satisfying the partial page image of the condition that preestablishes.
The base conditioning flow process of system of the present invention search method is described referring now to Fig. 1.Should note: two processing units among Fig. 1 demarcate sample retrievals 121 and show/browse result for retrieval 125 as program file separately or global storage in the hard disk 204b of Fig. 2; The processing unit that all the other each block schemes are represented as data file or program file separately or global storage in the hard disk 204a of Fig. 2.
Search method among the present invention and technology are made of in succession the processing stage 120 two of feature space tissue 100 and ancient books content retrievals, and the digitizing ancient books storehouse 110 that the former produces provides the basis for the latter.Feature space is organized disposable finishing of stages 100, and the ancient books content retrieval stage 120 can repeatedly repeat according to retrieval person's requirement.
Ancient books is through overscanning and pretreatment module 101, produce on the one hand page-images and deposit page-images storehouse 111 in and browse in order to the user, the object in the page-images is passed to the ordered set that follow-up extraction characteristic module 102 is broken down into independent image by skeleton on the other hand.The page-images that deposits in the storehouse 111 can be original scanning result (as coloured image or a gray level image), keeps original visual image of ancient books and style; Also can be through the picture rich in detail after the pre-service processing, obtain readable preferably.Object ordered set is extracted characteristic module 102 again and is separated into and is converted to three category features: the global position feature of page feature, object and morphological feature sequence.These features are kept in the mark sheet 112.Global position feature that module 102 is extracted and morphological feature vector are by higher dimensional space index characteristic module 103 favorable tissue and being stored in the data structure feature space index module 113 in addition.Except the visual similarity cluster to the mathematical expression of proper vector object, another function of feature space index structure 113 is exactly in time to get rid of and the dissimilar literal/glyph image of retrieval point, the object that the vision of acceleration search query point is similar.This is the basis that the ancient books content retrieval is realized high speed.
The content retrieval stage 120 is adopted the working method of inquiry by example.Demarcation sample retrieval module 121 supports that retrieval persons at any time, at random demarcate object on the page-images of being browsed, the page coordinates when record client indicating equipment 209b clicks page-images and the order of this coordinate sequence, formation retrieval person's sample retrieval.The order of coordinate sequence is passed to checking constraint condition module 124 as constraint condition.The page coordinates sequence itself is acquired characteristic module 122 and is used as condition concrete object in definite page-images from mark sheet 112, obtains and the corresponding proper vector of object.The approximate object module 123 of inquiry is a reference point with the proper vector that obtains, and the searching arest neighbors element in feature space index 113 constitutes the analogical object set of reference point.This module 123 will be combined into the global position feature set with the set of all objects are corresponding in the sample retrieval analogical object simultaneously and bunch give checking constraint condition module 124.By the effective combination of module 124, form result for retrieval according to the constraint condition inspection set bunch element that obtains.These results are by showing/browse that result for retrieval module 125 is apparent on retrieval person's the client screen 206b in eye-catching mode.Browse and observe its context for the user.
Referring now to Fig. 2, among the figure illustration in order to implement system hardware structure of the present invention.They are server 200a and the client computer 200b that are connected in network 210.Server 200a is used for the transmission of storage, maintenance, management, retrieval and the result for retrieval of data and page-images.Its hardware system is the universal computer architecture that is linked together by bus 201a, comprises the CPU (central processing unit) 202a with computing and control input/output function, the random access memory 203a of save routine and computing intermediate data, the permanent storage computer operating system, retrieve application software, page-images, the hard disk 204a of contents such as feature space index file, in order to key in order and the keyboard 205a of parameter and the display 206a of display command feedback result, network access device 207a, digitized scanner 208 of the ancient books page and function selecting and auxiliary positioning equipment are indicating equipment 209a; Client computer 200b be responsible for man-machine interface operation, send the demand of inquiring and browsing and display navigation Query Result.Its hardware system is the universal computer architecture that is linked together by bus 201b, comprising the CPU (central processing unit) 202b with computing and control input/output function, the primary memory 203b of save routine and computing intermediate data, the permanent storage computer operating system, the hard disk 204b (or ROM (read-only memory) 204b) of contents such as retrieve application software, in order to key in the keyboard 205b of order and parameter, the display 206b of display page image and order feedback result, network access device 207b, help designated display 206b to go up the indicating equipment of screen position (as Genius mouse, writing pencil) 209b; Server and client computer connect via network 210 by network access device 207a, 207b, exchange information.
As the another kind of special case of above-mentioned embodiment, network 210 can be wide area network (WAN is as Internet).In the system architecture that is known as the browser/server pattern, HTTP (HTML (Hypertext Markup Language)) agreement is followed in the communication between client computer 200a and the server 200b.Client computer 200b specifies certain Web page or leaf by uniform resource locator (URL) address of given server 200a, help retrieval person to prepare retrieval/browse request then, transmission is asked to server 200a, and accepts page-images and relevant information (as the JAVA applet) that server 200a transmits; Server 200a deposits the hypermedia file with HTML (HTML (Hypertext Markup Language)) language compilation, it has a HTTP finger daemon, the request of its subscribing client 200b proposition is also made response, when this process receives a request, just create a new subprocess and be this request service, finish validity checking, handle and make data at the request of client computer, comprise and use CGI (CGI (Common Gateway Interface)) program that data are carried out early stage and post-processed, then, page-images of handling well etc. is sent to the client computer 200b that files a request.
As another special case of above-mentioned embodiment, network 210 can be a Local Area Network.
As another special case again of above-mentioned embodiment, server 200a and client computer 200b can be same machines, do not have network 210, network access device 207a, 207b this moment, adopt the loopback adapter; Bus is that 201a, CPU (central processing unit) are that 202a, random access memory are that 203a, hard disk are that 204a, keyboard are that 205a, display are that 206b, scanner are 208, indicating equipment is 209a.
Client computer in another embodiment can adopt mobile computing device (as notebook, PDA etc.).
The operating system of server can be that the various realization versions of Windows95/Windows98 (Microsoft trade mark), MacOS (Apple trade mark), Unix are as (AIX of IBM or free software Linux), do not require multiwindow and figure man-machine interface, but should support the HTTP access protocal; Client computer can adopt above-mentioned any operating system, but requires multiwindow and figure man-machine interface simultaneously, and supports the HTTP access protocal; When the embodiment that adopts client/server on a computing machine, operating system is got the configuration of client-side; When client computer was handheld devices such as PDA, the operating system of this handheld device or its equivalent should be supported the HTTP access protocal.
Further specify the flow process characteristics of search method of the present invention and the technology that is adopted below.
The computing machine ancient books content search method of visual similarity of the present invention is formed by a series of technical unit organic assembling.Each technical unit can adopt the technique known scheme to realize, also can realize with the technical scheme that the present invention proposes, to exchange higher execution efficient for.Make up these technical units and form the new technical scheme of a cover, solution with writing brush personal letter Chinese character for this scarce resource of Chinese ancient book of its principal character is converted to can the low-cost unlimited medium of reusing, and directly on this medium, separate and the technical matters of extracting the part resource that satisfies objective condition is main contents of the present invention.Fig. 3 is the overview flow chart of search method, and Fig. 4, Fig. 5 are the detail flowcharts of Fig. 3.Fig. 6 uses the symbolic significance explanation in the process flow diagram.
As previously mentioned, search method by 120 two of feature space tissue 100 and ancient books content retrievals in succession treatment scheme constitute.Feature space organization flow 100 is finished by ancient books information services provision merchant is disposable in advance.It generates the result, and promptly the digitizing ancient books storehouse 110 among Fig. 1 is kept in the hard disk or CD 204a of server end among Fig. 2.Ancient books content retrieval flow process 120 can repeatedly repeat according to retrieval person's requirement, and it utilizes the digitizing ancient books storehouse of storing among hard disk or the CD 204a.Two flow processs 100 and 120 needn't be continuous in time, only requires to guarantee that the order that provides as Fig. 3 gets final product.
Now further specify feature space and organize the stage in conjunction with Fig. 4.The purpose of feature space tissue is that the content (object and sequence relation thereof) in the ancient books generates its feature clustering as previously mentioned, sets up the index structure that is easy to search fast according to visual similarity approximate object.Feature space organizes the basic step in stage as follows:
1. scan ancient books page 101a
Scan ancient books by visible light or other light sources page by page according to ancient books page number numbering, obtain its digitizing colour or gray scale image.To intact ancient books, can adopt ordinary flat formula scanner, for the ancient books that is damaged by fire damage or other reasons, available far infrared or other light sources irradiation manifest the literal of being covered.
2. pre-service 101b
For outstanding ancient books content, overcome scanning errors, separate foreground object and ground unrest, the acquisition object, before formal structural attitude spatial index 113, carry out the pre-service work such as graduation, object refinement of space of a whole page slant correction, noise removing, binaryzation and row/object.The function that the preprocessing means of Chinese optical character identification (OCR) technology of available standards or pool image are handled needs a spot of manual intervention to realize in case of necessity.Below provide some embodiment.
(1) color and gray scale are handled
The digitizing ancient books page-images that is obtained by scanning step 101a can be colour or gray scale.The purpose of doing like this is in order to keep the original appearance of ancient books to greatest extent, to be convenient to the user and to view and admire.Be the processing needs of subsequent step, the page-images of confession extraction feature should be converted to black and white, promptly so-called bianry image or bitmap.The page-images of viewing and admiring for the user still can keep original color or gray scale.
Coloured image generally is expressed as RGB (RGB) or other color spaces, as the point set of YIQ (brightness, colourity, saturation degree).From the angle of compression of images, adopt the situation of scheme of non-rgb color space more general.Because these schemes concentrate on the principal character of image on some coordinate axis in the space, the gray level image on this is handled, can embody image aspects substantially.In Chinese ancient book content retrieval field, adopt such scheme to change coloured image into form that gray level image still can keep literal/symbol object.
A kind of specific embodiments is that coloured image is decomposed into Y, I, three components of Q, again Y component is wherein given over to further processing as gray level image.The Y component has comprised the main information of original image.Transformational relation between YIQ and RGB is:
Gray level image becomes bitonal bitmap through binaryzation.The key of binaryzation is to determine appropriate threshold.A kind of system of selection is to determine global threshold according to grey level histogram.If number of grayscale levels is G, the pixel of image adds up to n, and (number of picture elements of 1≤k≤G) is n to k level gray scale
k, statistical picture is in the gray level (frequency of occurrences of k1≤k≤G) locate
k=1,2,...,G
And be ordinate with p (k), k is the horizontal ordinate mapping, obtains the grey level histogram of image.The grey level histogram of Chinese ancient book generally is bimodal, and two spikes have been represented prospect and background pixels respectively.Gray threshold can be taken at the trough place between bimodal, for example value 1≤g≤G.According to gray threshold g with gray level image IMG
gChange bitonal bitmap IMG into
b:
Wherein, R, C are respectively the number of lines and columns of image pixel matrix.
For the grey level histogram of multimodal, can adopt the local threshold binarization method.
(2) space of a whole page is proofreaied and correct
Deflection can take place because the ancient books original copy puts the inaccurate of angle in the page-images that scanning obtains, and influences subsequent treatment.In most cases, the angle of deflection is not too large.If departing from the scope of normal position (as vertical) is [A ,+A].With a is increment, from-A rotation bitonal bitmap, calculate projected density by following method, until+A.The bitonal bitmap that records maximal projection density is as correction chart.
With reference to Fig. 7, at first,, obtain the horizontal distribution (the latter half of Fig. 7) of display foreground pixel with a certain postrotational bitmap (the first half of Fig. 7) projection longitudinally.Make that projection width is W, then the average line height
On the average line of horizontal distribution, calculate projected density
In the following formula, n
kBe to be higher than counting of h, W in k the continuous segment on the average line
kBe these projection widths on average line.Select the projection of projection on the average line rather than all horizontal distribution to help to reduce the page-images influence on horizontal line and border up and down.
(3) eliminate noise
Use smoothing technique to eliminate residual isolated point in the bitonal bitmap, level and smooth stroke edge.Smoothing process is the application of low-pass filtering in the image processing techniques.
A kind of simple embodiment is 3 * 3 grids decision pixel x that adopts as shown in Figure 8
0Value.If represent that with x pixel x value is 1 (foreground), represent that with~x pixel x value is 0 (background colour), then pixel x
0Result after level and smooth is:
x
0′=~x
0[x
3x
7(x
1+x
5)+x
1x
5(x
3+x
7)]+
x
0~[(x
3+x
7)
(~(x
4+x
5+x
6)+~(x
1+x
2+x
8)+~(x
1+x
5)+(~(x
6+x
7+x
8)+~(x
2+x
3+x
4))]
(4) Object Segmentation
The row of Chinese character OCR, character segmentation technology can be directly used in Object Segmentation.It below is another comparatively simple Object Segmentation method.It is divided into apportion, participle and three subsequent steps of adjustment.As previously mentioned, bitonal bitmap IMG
bWidth be C, highly be R, (i, the pixel of j) locating is designated as IMG at coordinate
b(i, j), IMG
b(i, j)=1 this point of expression is foreground.
A. apportion
Make j list sum of all pixels being
C
jResult after the horizontal distribution figure that constitutes is smooth is designated as
(j=0,...,C-μ).
Wherein, μ is smooth step-length.S
jMaximal value, minimum value and both difference do not remember with:
M=max{S
j},m=min{S
j},D=max-min
Order again: Th=m+ α D, wherein the α threshold parameter generally gets 0.1 or 0.2.Obtain S
jThe j value j of=Th
0, j
1..., j
2n-1' these values organize in twos in regular turn right, that is: p
k=(j
2k+ j
2k+1)/2,0<k<n can obtain the column split line sequence p of the page
kShown in the dotted line among Fig. 9 a.For extracting easy-to-handle object row, before participle, also should get rid of vertical line.Concrete grammar is: calculate average col width δ=(p
N-1-P
0)/(n-1) is if two adjacent column split line (p
kAnd p
K+1) spacing is less than 0.1 δ, thinks that then between this two adjacent column split line be the column split vertical line of ancient books, fill out between with them into background colour and with (p
k+ p
K+1)/2 substitute these two column split lines.Fig. 9 a obtains white stick separated among Fig. 9 b through after getting rid of vertical line.
B. participle
The object that obtains row are considered as the parent page image, the transposing steps A. in the row, column mark.Can obtain the basic division of each object.Concrete outcome is seen Figure 10 a.
C. adjust
Automatically there is a spot of erroneous judgement result sometimes in cut zone, and cutting techniques should provide image feedback, for treatment people manual setting cut zone.This is the indicating equipment 209a selection deletion/increase function with server end among Fig. 1, clicks corresponding object or position then.For example, obtain correct Object Segmentation behind the useless cut-off rule that deletion Figure 10 a top is caused by former ancient books outer rim, shown in Figure 10 b.One to cut apart the object diagram of finishing for example shown in Figure 11.
(5) refinement
The bitonal bitmap of object is converted into the skeleton image that live width is single pixel, to reduce because of of the influence of stroke width difference to feature extraction.Thinning algorithm is as follows:
i.I”=IMG
b;
ii.Do
a.I=I”;
B. all pixels in the scans I form new bitmap I '.To pixel x among the I
0, investigate its neighborhood as shown in Figure 8, if C
1Set up, then the relevant position puts 1 among the I ';
C. scans I ' in all pixels, form new bitmap I ".To pixel x among the I '
n, investigate its neighborhood as shown in Figure 8, if G
2Set up, then I " in the relevant position put 1;
Until I=I”;
Iii. return I ".
C
1=x
0~x
1~x
2~x
3x
4x
5x
6~x
7~x
8+x
0~x
1~x
2x
3~x
4x
5~x
6~x
7~x
8+x
0~x
1~x
2x
3x
4x
5~x
6~x
7~x
8+
x
0~x
1~x
2x
3~x
4x
5x
6~x
7~x
8+x
0~x
1~x
2x
3x
4x
5x
6~x
7~x
8+x
0~x
1~x
2~x
3~x
4x
5~x
6x
7~x
8+
x
0~x
1~x
2~x
3x
4x
5~x
6x
7~x
8+x
0~x
1~x
2~x
3~x
4x
5x
6x
7~x
8+x
0~x
1~x
2~x
3x
4x
5x
6x
7~x
8+
x
0~x
1~x
2x
3~x
4x
5~x
6x
7~x
8+x
0~x
1~x
2x
3x
4x
5~x
6x
7~x
8+x
0~x
1~x
2x
3~x
4x
5x
6x
7~x
8+
x
0~x
1~x
2x
3x
4x
5x
6x
7~x
8+x
0~x
1x
2x
3x
4~x
5~x
6~x
7~x
8+x
0~x
1x
2x
3~x
4x
5~x
6~x
7~x
8+
x
0~x
1x
2x
3x
4x
5~x
6~x
7~x
8+x
0~x
1x
2x
3~x
4x
5x
6~xl~x
8+x
0~x
1x
2x
3x
4x
5x
6~x
7~x
8+
x
0~x
1xlx
3~x
4x
5~x
6x
7~x
8+x
0~x
1x
2x
3x
4x
5~x
6x
7~x
8+x
0~x
1x
2x
3~x
4x
5x
6x
7~x
8+
x
0~x
1x
2x
3x
4x
5x
6x
7~x
8+x
0~x
1~x
2~x
3~x
4x
5~x
6x
7x
8+x
0~x
1~x
2~x
3x
4x
5~x
6x
7x
8+
x
0~x
1~x
2~x
3~x
4~x
5x
6x
7x
8+x
0~x
1~x
2~x
3~x
4x
5x
6x
7x
8+x
0~x
1~x
2~x
3x
4x
5x
6x
7x
8+
x
0~x
1~x
2x
3~x
4x
5~x
6x
7x
8+x
0~x
1~x
2x
3x
4x
5~x
6X
7x
8+x
0~x
1~x
2x
3~x
4x
5x
6x
7x
8+
x
0~x
1~x
2x
3x
4x
5x
6x
7x
8+x
0x
1~x
2~x
3~x
4~x
5~x
6x
7x
8+x
0x
1~x
2~x
3~x
4x
5~x
6x
7x
8+
x
0x
1~x
2~x
3x
4x
5~x
6x
7x
8+x
0x
1~x
2~x
3~x
4~x
5x
6x
7x
8+x
0x
1~x
2~x
3~x
4x
5x
6x
7x
8+
x
0x
1~x
2~x
3x
4x
5x
6x
7x
8
Bitmap when algorithm finishes is the skeleton image after the refinement.Condition in the algorithm
C
2=x
0~x
1~x
2x
3x
4x
5~x
6~x
7~x
8+x
0~x
1x
2x
3x
4~x
5~x
6~x
7~x
8+x
0~x
1x
2x
3x
4x
5~x
6~x
7~x
8+
x
0x
1~x
2x
3~x
4~x
5~x
6~x
7~x
8~x
0x
1~x
2x
3x
4~x
5~x
6~x
7~x
8+x
0x
1~x
2x
3x
4x
5~x
6~x
7~x
8+
x
0x
1~x
2~x
3~x
4~x
5~x
6x
7~x
8+x
0x
1~x
2~x
3~x
4~x
5x
6x
7~x
8+x
0x
1~x
2x
3~x
4~x
5~x
6x
7~x
8+
x
0x
1~x
2x
3x
4~x
5~x
6x
7~x
8+x
0x
1~x
2x
3~x
4~x
5x
6x
7~x
8+x
0x
1x
2x
3~x
4~x
5~x
6~x
7~x
8+
x
0x
1x
2x
3x
4~x
5~x
6~x
7~x
8+x
0x
1x
2x
3x
4x
5~x
6~x
7~x
8+x
0x
1x
2~x
3~x
4~x
5~x
6x
7~x
8+
x
0x
1x
2~x
3~x
4~x
5x
6x
7~x
8+x
0x
1x
2x
3~x
4~x
5~x
6x
7~x
8+x
0x
1x
2x
3x
4~x
5~x
6x
7~x
8+
x
0x
1x
2x
3~x
4~x
5x
6x
7~x
8+x
0~x
1~x
2~x
3~x
4~x
5x
6x
7x
8+x
0x
1~x
2x
3~x
4~x
5~x
6~x
7x
8+
x
0x
1~x
2x
3x
4~x
5~x
6~x
7x
8+x
0x
1~x
2x
3x
4x
5~x
6~x
7x
8+x
0x
1~x
2~x
3~x
4~x
5~x
6x
7x
8+
x
0x
1~x
2~x
3~x
4~x
5x
6x
7x
8+x
0x
1~x
2x
3~x
4~x
5~x
6x
7x
8+x
0x
1~x
2x
3x
4~x
5~x
6x
7x
8+
x
0x
1~x
2x
3~x
4~x
5x
6x
7x
8+x
0x
1x
2~x
3~x
4~x
5~x
6~x
7x
8+x
0x
1x
2x
3~x
4~x
5~x
6~x
7x
8+
x
0x
1x
2x
3x
4~x
5~x
6~x
7x
8+x
0x
1x
2x
3x
4x
5~x
6~x
7x
8+x
0x
1x
2~x
3~x
4~x
5~x
6x
7x
8+
x
0x
1x
2~x
3~x
4~x
5x
6x
7x
8+x
0x
1x
2x
3~x
4~x
5~x
6x
7x
8+x
0x
1x
2x
3x
4~x
5~x
6x
7x
8+
x
0x
1x
2x
3~x
4~x
5x
6x
7x
8
(6) normalization
For eliminating the influence of handwritten form object size and change in location, the skeleton image of each object of standardizing.For example, Figure 13 is the normalization bitmap of the skeleton image of Figure 12, and housing is represented the border of new bitmap.
Standardized method is to select the maximal value of the height of skeleton image and width as monolateral length, makes a square bitmap.Then skeleton image is placed this square bitmap center.Deserving to be called and stating square is MBS (Minimal BoundingSquare).Compare with the conventional standardized method that uses boundary rectangle MBB (Minimal Bounding Box), the standardized method has here kept the ratio of width to height of object.Be difficult for causing elongated objects deviation when feature extraction, to occur.
3. feature extraction 102
This method is at single ancient books definition and extract three class essential characteristics, that is: the global position feature and the morphological feature of page feature, object.If the multireel ancient books that same people transcribes is combined processing, only need to add the books sign.Above-mentioned feature description the ancient books content.
In module 102, each object is separated from page-images, and each object has all possessed geometric coordinate and range of size in the clear and definite page.Following mask body defines described three class essential characteristic and extracting method thereof.
The global position feature (GLF) that defines 1 object is the linear order numbering of this object in the page of an ancient books.
As long as can guarantee object is that 1-1 is corresponding with its global position feature, the linear order in the definition can be taked arbitrary form.For example, global position Feature Extraction method can according to ancient books transcribe custom (page number from small to large, in the page or leaf from right to left, each row from top to bottom), obtain the global position feature of each object of obtaining by the scanner uni pretreatment module.For complicated space of a whole page layout, global position Feature Extraction method can be utilized recurrence curve such as Hilbert or Piano curved scanning layout area earlier, and each intra-zone is handled in the usual way again then.
The page feature (PF) that defines 2 ancient books is made of the geometric coordinate of each object in the page number and the page.
Page feature description by the geometric layout of object in page relation.
The morphological feature of object has been portrayed the vision semanteme of object.And then, remove outside the polyphone, a Chinese character write unique linguistics semanteme that has determined this word.In other words, by comparison, can realize the approximate match of literal, semiotics semanteme to Chinese character-type.Research of Chinese Feature Extraction technology among any Chinese OCR all can be used as object extracting of morphological method.
Yet,, exist a lot of variable factors to influence the extraction of Hanzi component and formation stroke thereof being in the Chinese ancient book of feature with writing brush personal letter Chinese character.For example, stroke weight is inhomogeneous, the part stroke is fuzzy or owe, the relative position skew during repeatedly the occurring of same literal between stroke/parts, stroke inclination angle/relative length variation etc., all can influence the coupling of object on the vision meaning.Need the stronger Feature Extraction Technology of exploitation fault-tolerant ability.Notice " the fixing standardized of block character parts position and ratio is the crystallization of Chinese character calligraphy art for a long time " this fact, below provide a kind of morphological feature of in multistage barycenter graduation zone, adding up stroke factor aggregate-value and describe and extractive technique.It has stronger fault-tolerant ability to the above-mentioned changing factor that exists in the Chinese ancient book.
The morphological feature (MF) that defines 3 objects is the aggregate-value of its image stroke factor component in multistage barycenter graduation zone.
The extracting of morphological method is as follows:
At first, according to the center of gravity of object its MBS is done the multilayer graduation.Each regional graduation point is decided to be the center of gravity of object foreground point in this zone (the stain collection in the accompanying drawing).Further graduation recurrence on the basis of shallow one deck is carried out.The concrete mode of one, two layer of graduation of Figure 13 as shown in figure 14.
Then, add up stroke factor in each zone, classification accumulative total back forms proper vector.So-called stroke factor is meant to constitute fundamental element horizontal, vertical, that cast aside, press down four kinds of strokes that its dot matrix is arranged as shown in figure 15.With respect to complete stroke, phenomenons such as the feature formation based on the stroke factor is inhomogeneous to soft stroke handwriting Chinese character stroke, stroke fuzzy, inclination angle/relative length lacks rule all have stronger fault-tolerant ability, also are convenient to the unified of non-legible symbol object in the ancient books handled.It is easy to extract the stroke factor scheme from the bitmap of object, has multiple embodiments.For example, be structural element (Structure Elements) with four kinds of stroke factors respectively, the applied mathematics morphological method is done corrosion (Erosion) computing to the foreground point (stain among the figure in the square frame) of Figure 13, obtains four kinds of stroke factor distributions in square frame.The extracting method of stroke factor is acted among Figure 14, and the stroke that can obtain in the graduation zone distributes, and the pixel count with all foreground points in the zone removes it again, obtains the distribution density of stroke factor in each zone.Notice in the Chinese character that the occurrence frequency of stroke is much higher than to cast aside anyhow and press down stroke, simultaneously for reducing the dimension of feature space, improve index and effectiveness of retrieval,, can decompose one deck less in the promptly regional graduation casting aside the statistics shallow level of stroke more anyhow of pressing down the stroke factor.A kind of concrete mode is that horizontal, vertical stroke factor is all used two layer region graduation, casts aside, presses down the stroke factor and all use one deck graduation.Among Figure 17 illustration in the double-layer separate partition territory horizontal, vertical stroke distribute and one deck graduation zone in cast aside, press down the stroke distribution.Adopt the zone number rule of Figure 16, the morphological feature vector of all objects has been opened into 16 * 2+4 * 2=40 dimensional feature space in the ancient books.Vector f in the space is calculated by following formula:
In the following formula, p
1(k) and p
2(k) be respectively resembling among the preceding bitmap firsts and seconds zoning k of feature extraction
That vegetarian refreshments number, h (k), s (k), p (k), n (k) are respectively is horizontal, vertical, cast aside, press down the black pixel of stroke factor in bitmap region k counts.
Adopt the object morphological feature of multistage barycenter graduation zone stroke factor aggregate-value, embodied the vision content of handwritten Chinese character preferably, can express literal/symbol with stroke distribution density relatively flexibly.Certain tolerance of definition in feature space (or claiming distance) can form vector space.A kind of tolerance is known Euclidean distance.In the characteristic vector space that forms, the morphological feature vector of object has constituted the coordinate of feature space mid point.Therefore, the unique point of plesiomorphism object has formed cluster naturally, and bigger distance is arranged between the unique point of discrepant Chinese character.
So far, the feature of ancient books has been extracted and has been finished, and the morphological feature and the global position feature of ancient books page feature, object remain to mark sheet 112.Be that mark sheet is made up of as the four-tuple of (geometric coordinate, global position feature, morphological feature in page number, the page or leaf) a plurality of shapes, a plurality of numbers is the object number that scanning pretreatment module 101 is determined.
4. the feature space index 113
In the practical application, the feature space of generation generally has characteristics such as dimension height, unique point quantity are many.Need the design space index structure corresponding with application target, the unique point that rationalization is all exchanges information inquiry fast for less storage overhead.Say on the principle that all space index methods (as R-tree and improve one's methods, X-tree, SR-tree, PK-tree etc.) can both become the embodiment of feature space index structure.Yet the performance of partial index algorithm such as R-tree can sharply descend with the increase of space dimensionality.Provide the optimization embodiment of SR-tree herein.About inner realization and the performance evaluation thereof of SR-tree, see also relevant paper and package description.
A. data structure
Definition of data item E
i=(MF
i, GLF
I)=(f
i, GLF
I).f
iBe the coordinate of feature space mid point i, the morphological feature vector of object i just; GLF
iIt is the global position feature of object i.
B. create the SR-tree
Call function new_HnSRTreeFilePath, Dimension, DataSize, BlockSize, SplitFactor, ReinsertFactor.Generate an empty SR-tree and return it, return data type HnSRTreeFile.
The meaning of the input parameter in calling and value such as following table:
Parameter name | Type | The parameter meaning | Value |
Path Dimension DataSize BlockSize SplitFactor ReinsertFactor | Character string integer integer integer integer integer | Preserve the Data Filename feature space Dimension Characteristics spot correlation attribute GLF byte number data block size of SR-tree, the minimum utilization factor of (byte) database, (percent) insert the factor again, (percent) | Ancient books name .idx 40 2 8192 (system default value) 40 (system default values) 30 (system default value) |
C. insert data item
SR-tree object File according to B. returns calls its method Store (...) with data item E
i=(f
i, GLF
I) insertion SR-tree.Concrete steps are:
HnSRTreeFile File;
File.Store(Point,Data)。
The meaning of parameter wherein and value such as following table:
Parameter name | Type | The parameter meaning | Value |
Point | HnPoint& | The storage address of point coordinate in the feature space | The morphological feature vector f of object |
Data | HnData& | The storage address of feature space mid point attribute | The GLF of this object |
5. treatment scheme control
Ancient books is handled and is adopted recycle design to finish.In a width of cloth page image, each object is implemented 102 to 113 processing, the object in one page whether finish dealing with Fig. 4 105 in judge.If this page or leaf also has other objects, then repeat said process, handle otherwise change time page or leaf.Ancient books whether be converted into fully digitizing ancient books storehouse 110 Fig. 4 106 in judge.
The processing stage of now retrieving 120 in conjunction with Fig. 5 description.Content retrieval must have been finished feature space at the ancient books that is retrieved and carry out after organizing 100 steps.For a cover feature space index structure of being set up, retrieval person can carry out the content retrieval of arbitrary number of times referring to Fig. 3.The purpose of content retrieval is to utilize feature space to organize resulting index structure, obtains all other objects similar to given object vision content fast.The basic step of content retrieval is as follows:
(1) reads precision controlled variable 501
Retrieval person adjusts the retrieval precision controlled variable by man-machine interaction mode.This parameter is only represented notional " strictness " and " loose ", value determine need not any quiet and secluded knowledge.Parameter value generally is divided into multistage, and pairing distance thresholds at different levels can be implemented the people by setting arbitrarily to big monotone increasing add mode by zero by invention.A kind of embodiment is to set 11 grades, and the 0th grade of predetermined distance threshold value is zero, the strict coupling of expression; The 10th grade is the loosest precision controlled condition, and the predetermined distance threshold value is 1; Progressively increase distance threshold by 0.1 increment therebetween.Because content retrieval can repeatedly carry out, retrieval person can dynamically adjust the precision controlled variable with reference to result for retrieval last time, gives new balance to next time recall ratio and precision ratio, satisfies its needs.The approximate hunting zone of Object Query 123 in feature space index 113 of precision controlled variable influence.
(2) open open the beginning browsing pages 502
Retrieval person can page number accesses corresponding page-images or enters certain page in conjunction with general indexing method by importing arbitrarily.Directly the scheme of input page numbering is the simplest.Comparatively practical with the scheme that indexing method is used.This is not only harmonious with the existing retrieval mode of library and ancient books CD server, and formed 2-level search pattern is more convenient for handling the different literature of ancient book of a large amount of writing styles.The retrieval point guiding retrieval person that indexing method provides further offers help for retrieval person finds target in volume based on the content search method of visual similarity at digital library or CD server discovery candidate's ancient books folder.
(3) demarcate searching object 121
On the page displayed image, retrieval person utilizes indicating equipment 209b such as mouse or writing pencil to click object, sets or adjust object-order.Interior geometric coordinate of page number, the page that demarcation sample retrieval module 121 record indicating equipments provide and the natural number of representing this order according to order mark on page-images that retrieval person sets.Can cooperate and browse controlling mechanism, in multipage, demarcate searching object.When retrieval person starts when retrieval, module 121 forms sample retrieval according to the order of geometric coordinate sequence and coordinate sequence in the page number of above-mentioned object, the page.Page number and coordinate set are passed to and are obtained characteristic module 122, and the order of coordinate sequence is passed to checking constraint condition module 124 as constraint condition.Afterwards the member object of each sample retrieval is implemented 122 to 123 and handle, in 506 steps, carry out aftertreatment and judge loop ends.
(4) obtain the morphological feature vector 122 of sample retrieval
From mark sheet, obtain the morphological feature vector of this object according to geometric coordinate in the page label of the member object of sample retrieval and the page.Acquisition methods depends on the organizational form of the interior geometric coordinate of the page of mark sheet.Page-images is after Object Segmentation, and each object all has a rectangle (referring to 2 (4)) that comprises it.If geometric coordinate is provided by the middle point coordinate of this rectangle in the page of object, then should be under the identical situation of page number, in mark sheet, calculate and the immediate point in sample member position earlier, and then obtain the morphological feature vector of object from this list item according to Euclidean distance; If geometric coordinate angular coordinate is provided in the page of object by this rectangle, then should be under the identical situation of page number, whether the check rectangle comprises sample member position in mark sheet earlier, and then obtains the morphological feature vector of object from the list item that comprises sample.Before a kind of method each object is saved the storage space of a pair of coordinate, a kind of method in back can be avoided the multiplication and division computing relatively the time, execution speed is very fast.In ancient books the more or sample retrieval length of object number generally more in short-term, a kind of method is favourable before using.
(5) approximate Object Query 123
With respect to certain sample member object, in the feature space index, search the similar object set of its vision according to nearest neighbouring rule.Specific practice is that establishing by the 123 form vectors that obtain is v, is r by the 501 search precision controlled variable that read, and then uses the set that following A~B obtains the global position feature GLF of analogical object.
A. according to parameter r setting range border.
To each dimension of feature space, establishing its mobility scale is W, then at first sets the range of search width
Wherein, ε is a very little number, general value 0.0001, the situation of corresponding strict search.S is the maximum occurrences of r.If read described in the precision controlled variable step s=10 according to aforementioned.
Then, system adjusts the position of range of search automatically, obtains on this dimension of feature space an interval w who comprises retrieval point x and be positioned at W, makes the w mid point that x is positioned at as much as possible.The border of note w is respectively a
iAnd b
i
Utilize the method SetRange of HnRect in the SR-tree program bag to set range of search, promptly i is tieed up
rect.SetRange(a
i,HnRange::INCLUSIVE,b
i,HnRange::INCLUSIVE,i).
Wherein, HnRange::INCLUSIVE is the constant that defines in the software package.
B. scope is searched (Range Search).
According to the range of search of setting among the A., from feature space index 113, return the global position feature GLF of analogical object one by one, form this sample member's analogical object set.Specific algorithm is as follows:
I) call the GetFirst method of HnSRTreeFile object File, return the GLF of first approximate object;
Ii) incorporate this GLF into results set
Iii) call the GetNext method of HnSRTreeFile object File repeatedly, return the GLF of next approximate object.Incorporate this GLF into results set, Key.isValid () test is for false in return parameters.
(6) handle lookup result 123
To all member objects of sample retrieval, the GLF of their approximate object set is accumulated cluster, passes to checking constraint condition module 124.
(7) checking constraint condition 124
So-called constraint condition promptly is the relative order of object elements that retrieval person demarcates in 121.Concrete proof procedure is as follows:
A. make sample retrieval comprise M member object, remember with e successively relatively in proper order by it
1, e
2..., e
M, from 506 obtain bunch M GLF souvenir with L
1, L
2..., L
M
B. with L
1As L, be circulated to M with subscript i from 2 with increment 1, carry out C
C. to each the element e among the L, establishing its GLF is j, if L
iIn not have GLF be the object of j+i-1, then e is left out from L
D. the result who keeps among the L during loop ends is exactly first element list of result for retrieval.
(8) mark result for retrieval 508 on page image
Take out the GLF of header element one by one from 127 result for retrieval, as index, search mark sheet 112, deterministic retrieval is the page number and the page internal coordinate of header element as a result.On page-images, paste additional marking such as red round dot, indicate a continuous N object that begins thus.When this page or leaf when side-play amount begins not enough M object, from inferior beginning of the page portion beginning label residue object.
(9) page-images shows/browses 125
Set up hop button such as first term mark, preceding paragraph mark, consequent mark, last item mark,, provide retrieval person to observe result for retrieval and its contextual function of observation in conjunction with common homepage, preceding page or leaf, back page or leaf and last page navigation button.