CN1186744C

CN1186744C - Chinese character recognizing method based on structure model

Info

Publication number: CN1186744C
Application number: CNB021259496A
Authority: CN
Inventors: 贾云得; 刘峡壁
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2002-08-06
Filing date: 2002-08-06
Publication date: 2005-01-26
Anticipated expiration: 2022-08-06
Also published as: CN1474351A

Abstract

The present invention relates to a Chinese character recognition method based on a structure model, which belongs to the fields of mode recognition, artificial intelligence and Chinese information processing. The present invention uses the two primitives of a stroke segment and a stroke to respectively establish two mathematical models for describing a Chinese character structure, namely a central point model of the stroke segment and a relational matrix model of the stroke, and a central point recognition method of the stroke segment and a relational matrix recognition method of the stroke are established. The central point recognition method of the stroke segment is combined with the matrix recognition method of the stroke, the central point recognition method of the stroke segment is used for the rough sort of Chinese character recognition, and the matrix recognition method of the stroke is used for the fine sort of the Chinese character recognition to form a set of integral Chinese character recognition methods. Printed Chinese character recognition and handwritten Chinese character recognition are processed by a uniform mechanism, and the present invention not only can be used for off-line recognition, but also can be used for online recognition. The present invention has the advantages of high recognition accuracy and stable performance.

Description

A kind of Chinese characters recognition method based on structural model

Technical field

The present invention relates to the Chinese characters recognition method based on structural model, claimed technical scheme belongs to pattern-recognition, artificial intelligence and Chinese information processing field.

Background technology

Through the development of decades, Chinese character recognition technology made great progress already.But unconfined Handwritten Chinese Character Recognition, particularly Off-line Handwritten Chinese Character Recognition also have certain distance apart from people's expectation.In order to solve this problem of Off-line Handwritten Chinese Character Recognition, at present statistical method and neural net methods of adopting by the study to a large amount of handwritten Chinese character samples, reach the purpose that adapts to the Chinese character distortion more.This method need be collected the magnanimity sample and spend huge learning time, but effect is not very good.Structural approach is strong to the adaptive faculty of distortion, does not collect sample and the burden learnt, though that existing structural approach is obtained in the Online Handwritten Chinese Character Recognition is quite successful, but is difficult to apply in the off line Chinese Character Recognition field and goes.

Summary of the invention

Technical matters to be solved by this invention provides a kind of structural approach of effective identification Chinese character, this method recognition correct rate height, good stability, both can be used for Handwritten Chinese Character Recognition, also can be used for printed Chinese character identification, both can be used for the off line Chinese Character Recognition, also can be used for online Chinese Character Recognition.

Matter of utmost importance with structural approach identification Chinese character is to set up the structural model of Chinese character image.The invention provides two kinds of mathematical models that are used for Description of Chinese Character Structure: sub-stroke center model and stroke relation matrix model.

The sub-stroke center model serves as to form the primitive of Chinese character with the pen section, describes Chinese character by the type and the position of pen section.Here, pen section refers in the Chinese character image set of a foreground pixel understanding horizontal, vertical, that cast aside, press down four kinds of basic strokes (other stroke can be combined by these four kinds of basic strokes) being consistent with people.Being expressed as follows of sub-stroke center model:

1) segment type

According to the direction vector of pen section correspondence, be divided into horizontal, vertical, cast aside, press down four kinds.

2) fragment position

Fragment position is represented by the mid point Euclidean coordinate of pen section, is referred to as center point coordinate.This coordinate is tried to achieve on the standardization Chinese character image.

3) model constitutes

H＝{(X _i，Y _i，T _i)}，i＝1，2，…，N (1)

Wherein, H represents Chinese character, X _iBe the central point abscissa value of i pen section, Y _iBe the central point ordinate value of i pen section, T _iThe type of representing i pen section, value be horizontal, vertical, cast aside, press down one of four kinds, N is for forming the pen section number of Chinese character.

Formula (1) illustrates, if a standardization Chinese character image is determined on the position (by X at all _iAnd Y _iDetermine) definite type is all arranged (by T _iDetermine) the pen section, then this image is exactly a certain Chinese character (being determined by H), otherwise then is not.

Based on the sub-stroke center model, the invention provides following Chinese characters recognition method, this method is called as the sub-stroke center method of identification.

At first determine the pairing standard sub-stroke center of each Chinese character classification model.During identification, calculating the distance between the pairing sub-stroke center model of Chinese character to be identified and all standard sub-stroke center models, is recognition result with classification under classification under the distance reckling or the inferior little top n.The computing formula of distance is as follows:

Wherein, D (SP, RP) expression center for standard point set and wait distance between knowing central point gathers, Q represents the set of center for standard point and waits to know the maximum number of the pen section that can mate between the central point set, I represents the pen section number of center for standard point set, J represents to wait to know the pen section number of central point set, and the remaining later on pen section number of pen section that is considered to connect pen in matching process is removed in J ' expression from the input pen section is gathered.(G _iX, G _iY) center point coordinate of gathering for center for standard point, (H _jX, H _jY) for waiting to know the center point coordinate of central point set, MS _iExpression with center for standard point set in before the cross-talk collection of waiting of section being complementary of i-1 pen during knowing central point gathers, Simi (ST _i, PT _j) type and the similarity of waiting to know in the central point set j section type of i pen section in the expression center for standard point set, V is the threshold value of the pen section number difference that allowed, T is for giving the threshold value of the ultimate range that section is given that can not mate, and W is the threshold value of the minor increment between the section that allows coupling.

The concrete steps of sub-stroke center method of identification are as follows:

(1) the standard sub-stroke center of setting up each Chinese character is gathered;

(2) will wait to know standardization of Chinese characters, extract all sections in the Chinese character to be identified then, form central point set to be identified to normal size;

(3) by formula (2) calculate the distance of each center for standard point set between gathering with central point to be identified, and with as the distance between each standard Chinese character and the Chinese character to be identified;

(4) in all standard Chinese characters, get and Chinese character to be identified between be recognition result apart from reckling or inferior little top n.

The stroke relation matrix model is the primitive of forming Chinese character with the stroke, concerns by the type of stroke and position each other and describes Chinese character.Here, stroke is meant the common Chinese character stroke of being familiar with of people.The concrete form of stroke relation matrix model is:

(1) type of stroke

See accompanying drawing 1

(2) relation of the mutual alignment between the stroke

For represent as much as possible one between the various forms of Chinese character general character and ignore the factor that those might produce violent change, we turn to six kinds with the mutual alignment between each stroke relation is fuzzy: upper and lower, left and right, intersection, link to each other.

(3) built-up pattern

Because Chinese character image is two-dimentional,, stroke and mutual alignment relation thereof can reflect its architectural feature more accurately so expressing with two-dimensional approach.We adopt the form of matrix to describe:

S ₁ S ₂ ..... S _N-1 S _N

S ₁ R ₁₁ R ₁₂ ..... R _1(N-1) R _1N

S ₂ R ₂₁ R ₂₂ ..... R _2(N-1) R _2N

..... ..... .... ..... ...... .....

S _N-1 R _(N-1)1 R _(N-1)2 ..... R _(N-1)(N-1) R _(N-1)N

S _N R _N1 R _N2 ..... R _N(N-1) R _NN

Wherein, S represents stroke, and R representation relation, N are represented the stroke number.S ₁～S _NRepresent the meaning of row or column, i.e. stroke type, R ₁₁～R _NNBe matrix element, row that expression is corresponding with it and the mutual alignment that lists between two strokes concern.

Based on the stroke relation matrix model, the invention provides following Chinese characters recognition method, this method is called as the stroke relation matrix method of identification:

At first determine the pairing standard stroke relational matrix of each Chinese character classification model.During identification, calculate the similarity between Chinese character to be identified pairing pen section set and all standard stroke relational matrix models.With classification under similarity value the maximum is recognition result.The computing formula of similarity value is as follows:

Wherein, S (SP, RP) expression canonical matrix and wait to know similarity between the matrix, the pen section number that BN (SP) expression is corresponding with canonical matrix, BN (RP) represents and waits to know the corresponding pen section number of matrix, BN (RP ') expression from wait to know the matrix correspondence and matching process, remove and be considered to connect remaining pen section number after the pen section of pen, SS (S _k, T _k) k stroke and wait to know in the matrix similarity (k is i or j) on the type between k the stroke, RS (R in the expression canonical matrix _Ij, G _Ij) in the expression canonical matrix the capable j column element of i with wait to know the similarity between the capable j column element of i in the matrix, V is the threshold value of the pen section number difference that allowed.

The concrete steps of stroke relation matrix method of identification are as follows:

(1) sets up the standard stroke relational matrix model of each Chinese character.

(2) with standardization of Chinese characters to be identified to normal size, extract all sections in the Chinese character to be identified then, form the set of input pen section.

(3) by formula (3) calculate the similarity between the set of each canonical matrix and input pen section, and with as the similarity between each standard Chinese character and the Chinese character to be identified.

(4) in all standard Chinese characters, get and Chinese character to be identified between one of the similarity maximum be recognition result.

Sub-stroke center method of identification and stroke relation matrix method of identification respectively have characteristics, and the stroke relation matrix method of identification is more accurate, and sub-stroke center method of identification speed is faster.Therefore, Chinese characters recognition method provided by the invention adopts the sub-stroke center method of identification to carry out rough sort, adopts the stroke relation matrix method of identification to carry out disaggregated classification.Simultaneously, the accuracy that the sub-stroke center method of identification is discerned the Chinese character of shape comparison standard also is gratifying, therefore, when enforcement the present invention discerns the Chinese character of shape comparison standard, can adopt the sub-stroke center method of identification to carry out disaggregated classification separately.

The present invention has the following advantages:

1, Chinese characters recognition method provided by the invention carries out Chinese Character Recognition with unified mechanism, both can be used for off line identification, also can be used for off line identification, both can be used for handwritten form identification, also can be used for block letter identification.

2, Chinese characters recognition method recognition correct rate height provided by the invention, strong to the adaptive faculty of distortion, good stability.

Description of drawings

Fig. 1 is the stroke type figure in the stroke relation matrix model;

Fig. 2 is the synoptic diagram of sub-stroke center model;

Fig. 3 is the synoptic diagram of stroke relation matrix model;

Fig. 4 is the The general frame of Chinese characters recognition method

Fig. 5 is the Chinese Character Recognition process flow diagram of pen section center identification method;

Fig. 6 is the Chinese Character Recognition process flow diagram of stroke relation matrix method of identification;

Embodiment

Invention can be implemented in the various occasions that need carry out Chinese Character Recognition, optimal way is Online Handwritten Chinese Character Recognition System and device, off line printed Chinese characters recognition system and device, Off-line Handwritten Chinese Character Recognition system and device.Embodiment, in 6763 Chinese character scopes of GB2312-80 regulation, unrestricted free handwritten Chinese character is discerned, the accuracy of sub-stroke center sorter identification top ten candidate is more than 99%, average recognition speed is 1 a second/word, the recognition correct rate of stroke relation matrix sorter is more than 91.2%, and average recognition speed is 0.2 a second/word.

Claims

1, a kind of Chinese character recognition method based on structure model, it is characterized in that:

Using the stroke center point recognition method based on the stroke center point model for rough classification; using the stroke relationship matrix recognition method based on the stroke relationship matrix model to fine-tune the rough classification results;

The stroke center point model has the following form: first a Chinese character image is normalized to a standard size, then it is decomposed into a collection of strokes, and these strokes are determined as four types: horizontal, vertical, left and right, and finally Use the coordinates of the center points of these strokes and the types of these strokes to form a model representing a Chinese character. The above model can be summarized as the following formula:

H={(X _i , Y _i , T _i )}, i=1, 2, ..., N

Among them, H represents a Chinese character, X _i is the abscissa value of the center point of the i-th stroke segment, Y _i is the ordinate value of the center point of the i-th stroke segment, T _i represents the type of the i-th stroke segment, and the value is One of the four types of horizontal, vertical, left and right, and N is the number of strokes forming Chinese characters;

Described stroke center point identification method is identified according to the distance between the standard stroke center point model and the stroke center point model to be recognized, and the distance is calculated by the following formula:

Wherein, D (SP, RP) represents the distance between the standard central point collection and the central point collection to be recognized, Q represents the maximum number of strokes that can be matched between the standard central point collection and the central point collection to be recognized, and I represents The number of strokes in the standard central point set, J represents the number of strokes in the central point set to be recognized, J′ represents the number of all strokes in the matching set and non-matching set, (G _i X, G _i Y) is the standard The coordinates of the center point of the center point set, (H _j X, H _j Y) are the center point coordinates of the center point set to be recognized, and MS _i indicates that it has been matched with the first i-1 strokes in the standard center point set The stroke subset in the center point set, Simi(ST _i , PT _j ) represents the similarity between the type of the i-th stroke segment in the standard center point set and the type of the j-th stroke segment in the center point set to be recognized, and V is The threshold value of the allowed stroke number difference, T is the threshold value of the maximum distance given to the stroke segments that cannot be matched, and W is the threshold value of the minimum distance between the stroke segments that allow matching;

The stroke relationship matrix model has the following form: first, a Chinese character image is normalized to a standard size, then it is decomposed into a collection of predefined different types of strokes, and the mutual positional relationship between these strokes is determined, and finally these strokes are used to And its mutual positional relationship forms a matrix to form a model representing a Chinese character, which can be summarized as the following matrix formula:

S ₁ S ₂ …… S _N-1 S _N

S ₁ R ₁₁ R ₁₂ …… R _1(N-1) R _1N

S ₂ R ₂₁ R ₂₂ …… R _2(N-1) R _2N

... ... ... ... ... ... ... ...

S _N-1 R _(N-1)1 R _(N-1)2 …… R _(N-1)(N-1) R _(N-1)N

S _N R _N1 R _N2 …… R _N(N-1) R _NN

Among them, S represents the stroke, R represents the relationship, N represents the number of strokes, S ₁ ~ S _N represents the meaning of the row or column, that is, the stroke type, R ₁₁ ~ R _NN are matrix elements, representing the corresponding row and column The mutual positional relationship between two strokes, the specific value is one of the six relationships: up, down, left, right, cross, and connected;

Described stroke relationship matrix recognition method is identified according to the similarity between the standard stroke relationship matrix model and the stroke relationship matrix model to be recognized, and the similarity is calculated by the following formula:

Among them, S(SP, RP) represents the similarity between the standard model and the model to be recognized, BN(SP) represents the number of strokes corresponding to the standard model, and BN(RP) represents the number of strokes corresponding to the model to be recognized BN(RP′) represents the number of strokes contained in the stroke set corresponding to the model to be recognized, SS(S _k , T _k ) represents the kth stroke in the standard model and the kth stroke in the model to be recognized The similarity between types (k is i or j), RS(R _ij , G _ij ) represents the similarity between the element in row i, column j in the standard model and the element in row i, column j in the model to be recognized degree, and V is the threshold value of the allowable difference in the number of stroke segments.

2. a kind of Chinese character recognition method based on structural model as claimed in claim 1, it is characterized in that: described stroke central point recognition method comprises the following steps: (1) set up standard model storehouse: according to stroke central point model , set up the standard stroke center point model of each Chinese character and save in the model library; (2) determine the corresponding stroke center point model to be recognized according to each standard stroke center point model and input stroke set; (3) Calculate the distance between the standard stroke center point model and the stroke center point model to be recognized; (4) take the Chinese characters corresponding to the first N standard stroke center point models with the smallest and second smallest distance values as the recognition result.

3, a kind of Chinese character recognition method based on structural model as claimed in claim 2, it is characterized in that: described according to standard stroke center point model and input stroke set the method for determining the stroke center point model to be recognized comprises the following Steps: (1) For each stroke segment in the standard stroke center point model, look for the stroke segment with the smallest distance to it in the input stroke segment set; (2) If the minimum distance is greater than the defined maximum threshold, consider the stroke segment Standard strokes have no matching strokes in the input stroke set, otherwise these two strokes will be matched and deleted from the respective stroke collection; (3) Repeat the above process until the standard stroke center point Each stroke in the model is processed; (4) The strokes corresponding to the standard stroke center point model obtained in the above calculation process form a matching set; (5) Among the input strokes that are not included in the matching set , remove the strokes that connect the two strokes in the matching set, and the remaining strokes form a non-matching set; (6) determine the types and center point coordinates of all strokes in the matching set and the non-matching set, and form the center of the strokes to be recognized point model.

4, a kind of Chinese character recognition method based on structural model as claimed in claim 3 is characterized in that: the calculating method of the distance of described standard stroke and input stroke comprises the following steps: (1) calculate the standard stroke The Euclidean distance between the center point and the center point of the input stroke; (2) according to the type of the standard stroke and the input stroke, determine its type similarity: the similarity between horizontal and vertical, left and right is 0, the similarity between the same type is 1, and the determination of the similarity in other cases is determined according to the degree of the angle range value that the angle of the stroke to be recognized deviates from the type of the standard stroke; (3) step Divide the distance obtained in (1) by the type similarity obtained in step (2) to obtain the final distance. If the type similarity is 0, the final distance is the maximum value assigned.

5. a kind of Chinese character recognition method based on structural model as claimed in claim 1, it is characterized in that: described stroke relation matrix recognition method comprises the following steps: (1) set up standard model storehouse: according to stroke relation matrix model, set up The standard stroke relationship matrix model of each Chinese character is stored in the model library; (2) according to each standard stroke relationship matrix model, strokes and their mutual positional relationships are determined from the input stroke collection to form a stroke relationship matrix model to be recognized (3) calculate the similarity value of the standard stroke relation matrix model and the stroke relation matrix model to be recognized; (4) repeat steps (2) and (3) until all stroke relation matrices to be recognized can be derived from the standard stroke relation matrix model The models have been calculated, and the smallest similarity value is taken as the final similarity value corresponding to the standard stroke relationship matrix model; (5) The Chinese character corresponding to the standard stroke relationship matrix model with the smallest final similarity value is taken as the recognition result.

6, a kind of Chinese character recognition method based on structural model as claimed in claim 5, it is characterized in that: described according to standard stroke relation matrix model from the method for determining the stroke relation matrix model to be recognized from the input stroke segment set comprising the following steps : (1) For each stroke in the standard stroke relationship matrix model, search for the same or similar strokes in the input stroke collection to form a corresponding stroke subset; (2) from the standard stroke relationship matrix Take one out of the stroke sets to be recognized corresponding to all the strokes in the model to form all the strokes in the stroke relationship matrix model to be recognized. These taken out strokes should not contradict each other, that is, they cannot share strokes; (3) Temporarily delete those strokes that are not included in the stroke relationship matrix model to be recognized but connect two strokes in the stroke relationship matrix model to be recognized in the segment set, and the remaining strokes form a stroke segment set corresponding to the stroke relationship matrix model to be recognized (4) Determining their respective types and mutual positional relationships according to all strokes in the obtained stroke relationship matrix model to be recognized, forming a stroke relationship matrix model to be recognized.

7, a kind of Chinese character recognition method based on structural model as claimed in claim 6, it is characterized in that: described method for finding identical or similar stroke from input stroke collection comprises the following steps: (1) establish and describe each stroke (2) establish the type similarity value between each stroke; (3) determine the type of stroke that needs to be searched according to the threshold of the given similarity value; (4) input the template according to the type of stroke to be searched Search in the collection of strokes to determine the subset of strokes corresponding to it.