CN1570958A

CN1570958A - Method for identifying multi-font multi-character size print form Tibetan character

Info

Publication number: CN1570958A
Application number: CN 200410034107
Authority: CN
Inventors: 丁晓青; 王�华; 刘长松; 彭良瑞; 方驰
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2004-04-23
Filing date: 2004-04-23
Publication date: 2005-01-26
Anticipated expiration: 2024-04-23
Also published as: CN1251130C

Abstract

The multi-font and multi-size printed Tibetan character recognition method belongs to the field of character recognition. It is characterized in that it proposes a normalization scheme for the characteristics of printed Tibetan characters belonging to non-square characters: the character image is based on the baseline, that is, the upper horizontal line, Decompose the boundary point into two sub-images that do not overlap each other, and use the position normalization combined with the center of gravity and frame and the size normalization method based on cubic B-spline function interpolation for each sub-image; the extraction can fully reflect The four-direction linear features of Tibetan character composition information are compressed and dimensionally reduced by linear discriminant analysis to obtain compact character feature vectors. The rough and fine two-level classification strategy based on confidence analysis is used to judge the character category. The coarse and fine classifiers use the Euclidean distance with deviation EDD and the modified quadratic discriminant function MQDF respectively. The present invention has a recognition accuracy rate of 99.83% on the multi-font and multi-size printed Tibetan single-character test set, and the recognition rate of the actual text can also reach more than 99%.

Description

Multi-font and multi-size printed Tibetan character recognition method

技术领域technical field

多字体多字号印刷体藏文字符识别方法属于字符识别领域。The multi-font and multi-size printed Tibetan character recognition method belongs to the field of character recognition.

背景技术Background technique

藏文字符识别技术是中文多文种信息处理系统的重要组成部分，具有极高的理论价值和广阔的应用前景。字符识别方法可以归结为两类：统计决策方法和句法结构方法。在统计决策方法中，每个字符模式用一个特征矢量表示，它被看成是特征空间中的一个点，识别的过程就是在特征空间中将待识别字符模式正确地划分到所属的类别中。而句法结构方法则对于给定的字符集，抽取数量有限的不可分割的最小子模式(基元)，将这些基元按照特定的顺序和规则组合起来可以构成该字符集中的任何字符。这样，利用字符结构与语言之间的相似性，字符识别可以借助形式语言学的文法(包含了句法规则)来描述剖析字符的结构。Tibetan character recognition technology is an important part of Chinese multilingual information processing system, which has extremely high theoretical value and broad application prospects. Character recognition methods can be classified into two categories: statistical decision-making methods and syntactic structure methods. In the statistical decision-making method, each character pattern is represented by a feature vector, which is regarded as a point in the feature space, and the recognition process is to correctly divide the character pattern to be recognized into its category in the feature space. The syntactic structure method extracts a limited number of indivisible minimum sub-patterns (primitives) for a given character set, and combines these primitives according to a specific order and rules to form any character in the character set. In this way, using the similarity between character structure and language, character recognition can use the grammar of formal linguistics (including syntactic rules) to describe and analyze the structure of characters.

字符数量大、字型结构复杂、字体种类多、相似字比例高给藏文字符识别研究带来了挑战。目前国内外对藏文识别的研究基本上还非常有限，尚未见到有成功的算法和系统出现。藏文虽然是拼音文字，每个字符都由若干个部件(字母及某些字母的变体)组成，但由于部件的结构及其相互间的连接方式复杂，使得正确分离字符中各部件非常困难，又考虑到句法结构方法的抗干扰性差等显著的弱点，所以本发明采用统计决策的方法来进行多字体多字号印刷体藏文字符识别的研究，以单个藏文字符的整体作为基本的识别单位。The large number of characters, complex font structure, many types of fonts, and high proportion of similar characters have brought challenges to the research of Tibetan character recognition. At present, the research on Tibetan recognition at home and abroad is basically very limited, and no successful algorithms and systems have been seen yet. Although Tibetan is a phonetic script, each character is composed of several parts (letters and some variants of letters), but due to the complexity of the structure of the parts and the way they are connected to each other, it is very difficult to correctly separate the parts in the character , taking into account the obvious weaknesses such as poor anti-interference of the syntactic structure method, so the present invention adopts the method of statistical decision-making to carry out the research of multi-font and multi-size printed Tibetan character recognition, and takes the whole of a single Tibetan character as the basic recognition unit.

在汉字识别中，方向线素很好地描述了它在所占空间的不同位置上横、竖、撇、捺四种基本单元的数量关系，从而全面、准确、稳定地反映了汉字的组成信息。藏文字符由各部件按照一定的次序纵向叠加在一起构成，而部件又由笔划组成，各部件中笔划之间的连接关系是固定不变的。这样，每个藏文字符都有特定的结构，并且这种结构可以被从层次、局部和细节三方面反映出来，而方向线素正是刻画这些结构特征的有效手段。In Chinese character recognition, the direction line element well describes the quantitative relationship of the four basic units of horizontal, vertical, left and right in different positions of the occupied space, thus comprehensively, accurately and stably reflecting the composition information of Chinese characters . Tibetan characters are composed of various components vertically stacked together in a certain order, and the components are composed of strokes, and the connection relationship between the strokes in each component is fixed. In this way, each Tibetan character has a specific structure, and this structure can be reflected from three aspects: level, part and detail, and the direction line element is an effective means to describe these structural characteristics.

本发明在全面细致考察藏文字符特点的基础上，根据藏文字符的特殊形态，选择了恰当的归一化方法，抽取描述能力强的方向线素特征，利用基于置信度分析的两级统计分类器获得识别结果，实现了高性能的多字体多字号藏文字符识别方法，这是目前所有其他文献里都没有使用过的方法。The present invention selects an appropriate normalization method according to the special form of Tibetan characters on the basis of comprehensively and meticulously examining the characteristics of Tibetan characters, extracts directional line features with strong descriptive ability, and utilizes two-level statistics based on confidence analysis The classifier obtains the recognition result and realizes a high-performance multi-font multi-size Tibetan character recognition method, which is a method that has not been used in any other literature so far.

发明内容Contents of the invention

本发明的目的在于实现一个多字体多字号印刷体藏文字符识别的方法。以单个藏文字符作为处理对象，首先对字符对象进行必要的归一化处理，包括位置归一化和大小归一化，然后提取能很好反映字符特点的四方向线素特征并采用LDA(线性鉴别分析)方法对特征进行压缩变换，采用基于置信度分析的粗、细两级统计分类器进行分类判决。由此，可以得到极高的单字识别正确率。根据该方法，实现了一个多字体多字号印刷体藏文字符识别系统。The purpose of the present invention is to realize a multi-font and multi-size printed Tibetan character recognition method. Taking a single Tibetan character as the processing object, firstly carry out the necessary normalization processing on the character object, including position normalization and size normalization, and then extract the four-directional line element feature that can well reflect the characteristics of the character and use LDA( Linear discriminant analysis) method compresses and transforms features, and adopts coarse and fine two-level statistical classifiers based on confidence analysis to make classification judgments. Thus, a very high accuracy rate of single character recognition can be obtained. According to the method, a multi-font and multi-size printed Tibetan character recognition system is realized.

作为一个印刷体藏文字符识别系统还包括单字样本的采集，即系统首先扫描输入印刷体藏文的文本，采用自动的方式进行字符切分。利用采集建立的训练样本数据库，进行方向线素特征抽取和特征变换，得到训练样本的特征数据库。在训练样本的特征数据库的基础上，通过实验确定分类器的参数。对未知的输入字符样本，采用同样的方法抽取特征，然后送入分类器与特征库进行分类比较，从而判断输入字符的类别属性。As a printed Tibetan character recognition system, it also includes the collection of single-character samples, that is, the system first scans the printed Tibetan text, and uses an automatic method to perform character segmentation. Using the training sample database established by collection, the direction line element feature extraction and feature transformation are carried out to obtain the feature database of the training sample. Based on the feature database of training samples, the parameters of the classifier are determined through experiments. For unknown input character samples, the same method is used to extract features, and then sent to the classifier for classification and comparison with the feature library, so as to determine the category attributes of the input characters.

本发明由以下几部分组成：字符归一化、四方向线素特征提取、特征变换、分类器设计。The invention consists of the following parts: character normalization, four-direction line element feature extraction, feature transformation, and classifier design.

1.字符归一化1. Character normalization

1.1位置归一化1.1 Position normalization

设原始字符图像为[F(i，j)]_W×H，图像宽度为W，高度为H，图像位于第i行第j列的象素点的值为F(i，j)，i＝1，2，…，H，j＝1，2，…，W。根据藏文字符的特点，[F(i，j)]_W×H可以看作两个互不重叠的子图像[F₁(i，j)]_W×H1、[F2(i，j)]_W×H2的纵向拼接而成，其中[F1(i，j)]_W×H1为基线(上平线)以上部分图像，即上元音部分，[F₂(i，j)]_W×H2为基线以下部分，且H₁+H₂＝H。设字符图像的水平投影V(i)，i＝1，2，…，H由下式计算：Suppose that the original character image is [F(i, j)] _W×H , the image width is W, and the height is H, and the value of the pixel point in the i-th row and j-column of the image is F(i, j), i= 1, 2, ..., H, j=1, 2, ..., W. According to the characteristics of Tibetan characters, [F(i, j)] _W×H can be regarded as two non-overlapping sub-images [F ₁ (i, j)] _W×H1 , [F2(i, j)] The longitudinal splicing of _W×H2 , where [F1(i, j)] _W×H1 is the part of the image above the baseline (upper horizontal line), that is, the upper vowel part, [F ₂ (i, j)] _W×H2 is the fraction below the baseline, and H ₁ +H ₂ =H. Suppose the horizontal projection V(i) of the character image, i=1, 2, ..., H is calculated by the following formula:

$V V ((i i)) = = {Σ Σ}_{j j = = 11}^{W W} F f ((i i,, j j))$

则基线所在位置的纵坐标值P_I为：Then the ordinate value P _I of the position of the baseline is:

${P P}_{I I} = = arg arg \underset{i i}{max max} ((V V ((i i)) - - V V ((i i - - 11)))),, i i = = 2,3 2,3,, \cdot &Center Dot; \cdot &Center Dot; \cdot &Center Dot;,, H h$

根据P_I和字符顶部的纵坐标的值就可以确定H₁，而在本发明所采用的坐标系(图4)中，H₁在数值上等于P_I。H ₁ can be determined according to the value of P _I and the vertical coordinate of the top of the character, and in the coordinate system ( FIG. 4 ) adopted by the present invention, H ₁ is numerically equal to P _I .

设归一化后字符图像为[G(i，j)]_M×N，图像宽度为M，高度为N，图像位于第i行第j列的象素点的值为G(i，j)，i＝1，2，…，N，j＝1，2，…，M。同样的，[G(i，j)]_M×N也可看作两个互不重叠子图像[G₁(i，j)]_M×N1、[G₂(i，j)]_M×N2的纵向拼接而成，其中[G₁(i，j)]_M×N1为基线以上部分图像，[G₂(i，j)]_M×N2为基线以下部分，根据对藏文字符中基线的位置特性分析，此处设定N₁＝N/4，N₂＝3N/4。这样，归一化可以看成是将输入图像点阵[F₁(i，j)]_W×H1、[F₂(i，j)]_W×H2分别映射成目标图像点阵[G₁(i，j)]_M×N1、[G₂(i，j)]_M×N2的处理过程。在此过程中，选定输入图像点阵[F_k(i，j)]_W×Hk，k＝1，2中的参考点U_k(u_Ik，u_Jk)，k＝1，2，移动输入图像点阵，使该参考点位于目标点阵[G_k(i，j)]_M×Nk，k＝1，2的中心，从而完成输入字符的位置归一化。Let the character image after normalization be [G(i, j)] _M×N , the image width is M, the height is N, and the value of the pixel point in the i-th row and j-th column of the image is G(i, j) , i=1, 2,..., N, j=1, 2,..., M. Similarly, [G(i, j)] _M×N can also be regarded as two non-overlapping sub-images [G ₁ (i, j)] _M×N1 , [G ₂ (i, j)] _M×N2 , where [G ₁ (i, j)] _M×N1 is the part above the baseline, and [G ₂ (i, j)] _M×N2 is the part below the baseline. According to the For position characteristic analysis, set N ₁ =N/4 and N ₂ =3N/4 here. In this way, normalization can be regarded as mapping the input image lattice [F ₁ (i, j)] _W×H1 and [F ₂ (i, j)] _W×H2 into the target image lattice [G ₁ ( i, j)] _M×N1 , [G ₂ (i, j)] _M×N2 processing. In this process, select the reference point U k (u Ik , u _Jk ) in the input image lattice [F _k (i, j)] _W×Hk , _k =1, ₂ , k=1, 2, move The image lattice is input, and the reference point is located at the center of the target lattice [G _k (i, j)] _M×Nk , k=1, 2, thereby completing the position normalization of the input characters.

令[F_k(i，j)]_W×Hk，k＝1，2重心和外边框几何的中心分别为A_k(a_Ik，a_Jk)，k＝1，2和B_k(b_Ik，b_Jk)，k＝1，2，则有：Let [F _k (i, j)] _W×Hk , k=1, 2 center of gravity and the center of outer frame geometry be A _k (a _Ik , a _Jk ), k=1, 2 and B _k (b _Ik , b _Jk ), k=1, 2, then:

令U_k(u_Ik，u_Jk)，k＝1，2为介于A_k(a_Ik，a_Jk)，k＝1，2与B_k(b_Ik，b_Jk)，k＝1，2之间的一点，即：Let U _k (u _Ik , u _Jk ), k=1, 2 be between A _k (a _Ik , a _Jk ), k=1, 2 and B _k (b _Ik , b _Jk ), k=1, 2 A point between, that is:

其中β为常数且0≤β≤1。Where β is a constant and 0≤β≤1.

1.2大小归一化1.2 Size normalization

藏文字符是非方块字，字符宽度具有相对稳定性，而各字符间高度差异很大，无法象汉字那样归一化为方形点阵。据对收集到的1200套藏文字符样本中共710,400个(6种字体、7种字号，每套样本592个字符)字符的高宽比特性所做的统计，取归一化之后的藏文字符的高宽比为2较合理，它是差别各异的各字体字符高宽比的一个折衷。Tibetan characters are non-square characters, and the width of the characters is relatively stable, but the height of each character varies greatly, so it cannot be normalized into a square lattice like Chinese characters. According to statistics on the aspect ratio characteristics of 710,400 (6 fonts, 7 font sizes, 592 characters per sample) characters collected from 1,200 sets of Tibetan character samples, the normalized Tibetan characters The aspect ratio of 2 is more reasonable, which is a compromise between the different font character aspect ratios.

考察输入输入字符图像[F_k(i，j)]_W×Hk，k＝1，2，与归一化后目标字符点阵为[G_k(i，j)]_M×Nk，k＝1，2，之间的关系可知：Consider the input character image [F _k (i, j)] _W×Hk , k=1, 2, and the target character lattice after normalization is [G _k (i, j)] _M×Nk , k=1 , 2, the relationship between:

G_k(i，j)＝F_k(i/r_i，j/r_j)，k＝1，2G _k (i, j) = F _k (i/r _i , j/r _j ), k=1, 2

其中r_i和r_j分别为i和j方向的尺度变换因子：r_i＝N_k/H_k，r_j＝M/W。根据上式，输出图像点阵中的点(i，j)对应于输入字符中的点(i/r_i，j/r_j)。F_k(i，j)为离散函数，而i/r_i、j/r_j的取值一般不为整数，故需要根据F_k中已知的离散点处的值来估计其在(i/r_i，j/r_j)处的取值。本发明采用三次B样条函数来进行插值运算，以减少归一化后字符点阵出现诸如阶梯状边缘等畸变。对于给定(i，j)，令：Where r _i and r _j are scaling factors in directions i and j respectively: r _i =N _k /H _k , r _j =M/W. According to the above formula, the point (i, j) in the output image lattice corresponds to the point (i/r _i , j/r _j ) in the input character. F _k (i, j) is a discrete function, _and the values of i/r _i and j/r _j are generally not integers, so it is necessary to estimate its value in (i/ r _i , the value at j/r _j ). The present invention adopts cubic B-spline function to perform interpolation operation, so as to reduce the distortion such as ladder-like edge and the like that appear in the character dot matrix after normalization. For a given (i, j), let:

其中：

[·]为取整函数。插值过程可表示为：in:

[·] is rounding function. The interpolation process can be expressed as:

${G G}_{k k} ((i i,, j j)) = = {F f}_{k k} (({p p}_{00} + + {Δ Δ}_{p p},, {q q}_{00} + + {Δ Δ}_{q q})) = = {Σ Σ}_{m m = = - - 11}^{22} {Σ Σ}_{l l = = - - 11}^{22} {F f}_{k k} (({p p}_{00} + + m m,, {q q}_{00} + + l l)) {R R}_{B B} ((m m - - {Δ Δ}_{p p})) {R R}_{B B} ((- - ((l l - - {Δ Δ}_{q q}))))$

式中的R_B(z)为三次B样条函数：R _B (z) in the formula is a cubic B-spline function:

${R R}_{B B} ((z z)) = = \frac{11}{66} [[{((z z + + 22))}^{33} W W ((z z + + 22)) - - 44 {((z z + + 11))}^{33} W W ((z z + + 11)) + + 66 {z z}^{33} W W ((z z)) - - 44 {((z z - - 11))}^{33} W W ((z z - - 11))]]$

其中W(z)为阶跃函数， where W(z) is a step function,

2.方向线素特征提取2. Direction feature extraction

2.1提取字符的轮廓2.1 Extracting the outline of characters

假定特征字图像其笔划所对应的点为黑象素点，背景点为白象素点。对于笔划象素点，如果其8邻域有白象素点且当前黑象素不是孤立黑象素点(8邻域黑象素点的个数为0)，则称该笔划象素点为轮廓点。提取轮廓图像的方法是扫描整个字符点阵，对于某个位置的黑象素，如果它的8邻域中的黑象素个数和白象素个数均大于0，则保留该黑象素，否则将字符点阵在该位置的值改为0。这样，从归一化后的字符图像[G(i，j)]_M×N得到了其轮廓图像[G′(i，j)]_M×N。It is assumed that the points corresponding to the strokes of the characteristic word image are black pixels, and the background points are white pixels. For a stroke pixel, if there are white pixels in its 8 neighbors and the current black pixel is not an isolated black pixel (the number of black pixels in the 8 neighborhood is 0), then the stroke pixel is called a contour point . The method of extracting the outline image is to scan the entire character lattice, for a black pixel at a certain position, if the number of black pixels and the number of white pixels in its 8 neighbors are both greater than 0, then keep the black pixel, otherwise Change the value of the character lattice at this position to 0. In this way, the contour image [G′(i, j)] _M×N is obtained from the normalized character image [G(i, j)] _M×N .

2.2分块和特征矢量的构成2.2 Composition of blocks and feature vectors

对于字符轮廓点阵[G′(i，j)]_M×N中的每一个黑象素，根据它与相邻的另外两个黑象素的位置关系，赋予它横(0°)、竖(90°)、撇(45°)、捺(135°)四种线素。考虑两种情况：一种是3个黑象素在同一直线上，则只给该中心象素分配一种线素特征并且赋值为2(图9a-d)；另一种3个黑象素不在同一直线上，那么就同时给中心象素分配两种线素特征并分别赋值为1(图9e-p)，如图9k所示的情况则给中心线素分配的线素是捺和竖，数值均为1，其余情况类推。按照这些原则对字符点阵中的各黑象素的进行线素特征的分配，对每个黑象素点(i，j)，都可以得到一个4维向量X(i，j)＝(x_v，x_k，x_p，x_o)^T，其分量分别表示该黑象素点处的4种线素的数量。For each black pixel in the character outline lattice [G′(i, j)] _M×N , according to its positional relationship with the other two adjacent black pixels, it is given horizontal (0°), vertical (90°), skimming (45°), and pressing (135°) four lines. Consider two cases: one is that 3 black pixels are on the same line, then only assign a line pixel feature to the central pixel and assign a value of 2 (Figure 9a-d); the other is 3 black pixels If they are not on the same straight line, then assign two kinds of line element features to the central pixel at the same time and assign them a value of 1 respectively (Fig. 9e-p). , the values are all 1, and the rest of the cases are deduced by analogy. According to these principles, each black pixel in the character dot matrix is distributed with line features, and for each black pixel point (i, j), a 4-dimensional vector X (i, j)=(x _v , x _k , x _p , x _o ) ^T , the components of which respectively represent the quantity of four kinds of line pixels at the black pixel point.

完成上述工作以后，将M×N的点阵均匀分成宽为M₀、高为N₀的子区域(图10)，每个子区域跟相邻的子区域之间在水平方向有M₀/2、在垂直方向上有N₀/2个象素的重合，故从整个M×N点阵可以得到的子区域个数为 $(\frac{2 M}{M_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1) .$ 然后，将每个子区域划分成互相嵌套A、B、C、D四个小方块(图11)，A、B、C、D的大小依次为(M₀/4)×(N₀/4)、(M₀/2)×(N₀/2)、(3M₀/4)×(3N₀/4)和M₀×N₀。对于每个小方块，分别定义一个4维向量X_A＝(x_v，x_k，x_p，x_o)^T、X_B＝(x_v，x_k，x_p，x_o)^T、X_C＝(x_v，x_k，x_p，x_o)^T、X_D＝(x_v，x_k，x_p，x_o)^T，表示各自的方块内各象素的在0°、90°、45°、135°方向上的线素数量的总和，即：After the above work is completed, the M×N lattice is evenly divided into sub-regions with a width of M ₀ and a height of N ₀ (Fig. 10), and there is a distance of M ₀ /2 between each sub-region and the adjacent sub-region in the horizontal direction. , There are overlaps of N ₀ /2 pixels in the vertical direction, so the number of sub-regions that can be obtained from the entire M×N lattice is $(\frac{2 m}{m_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1) .$ Then, divide each sub-region into four small squares nested with each other A, B, C, and D (Figure 11), and the sizes of A, B, C, and D are (M ₀ /4)×(N ₀ /4 ), (M ₀ /2)×(N ₀ /2), (3M ₀ /4)×(3N ₀ /4) and M ₀ ×N ₀ . For each small square, define a 4-dimensional vector X _A = (x _v , x _k , x _p , x _o ) ^T , X _B = (x _v , x _k , x _p , x _o ) ^T , X _C =(x _v , x _k , x _p , x _o ) ^T , X _D =(x _v , x _k , x _p , x _o ) ^T , representing the angles of each pixel in the respective squares at 0°, 90°, The sum of the number of line pixels in the directions of 45° and 135°, namely:

${X x}_{A A} = = \underset{((i i,, j j)) &Element; &Element; A A}{Σ Σ} X x ((i i,, j j))$

${X x}_{B B} = = \underset{((i i,, j j)) &Element; &Element; B B}{Σ Σ} X x ((i i,, j j))$

${X x}_{C C} = = \underset{((i i,, j j)) &Element; &Element; C C}{Σ Σ} X x ((i i,, j j))$

${X x}_{D D.} = = \underset{((i i,, j j)) &Element; &Element; D D.}{Σ Σ} X x ((i i,, j j))$

而整个子区域的方向线素特征向量X_S＝(x_v，x_k，x_p，x_o)^T由该子区域中各方块特征向量的加权和来表示，即：And the direction line element feature vector X _S =(x _v , x _k , x _p , x _o ) ^T of the whole sub-area is represented by the weighted sum of the feature vectors of each block in the sub-area, namely:

X_S＝α_AX_A+α_BX_B+α_CX_C+α_DX_D X _S ＝α _A X _A +α _B X _B +α _C X _C +α _D X _D

其中α_A，α_B，α_C，α_D为介于0和1之间的常数，它们刻画了不同方块内的特征向量对本子区域整体特征向量的贡献的重要程度。这样，从每个子区域都可以得到一个4维特征向量后，将所有子区域的特征向量按顺序排列在一起组成的 $4 (\frac{2 M}{M_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1)$ 维特征向量，这就是表示该字符的方向线素特征。Among them, α _A , α _B , α _C , and α _D are constants between 0 and 1, which describe the importance of the contribution of the feature vectors in different squares to the overall feature vector of this sub-region. In this way, after a 4-dimensional feature vector can be obtained from each sub-region, the feature vectors of all sub-regions are arranged together in order to form $4 (\frac{2 m}{m_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1)$ Dimensional feature vector, which is the directional line element feature representing the character.

3.特征变换3. Feature transformation

特征维数的增大和训练样本的不足，将给分类器参数估计和识别计算量都带来很大的问题。根据一般的分类器设计的经验，对训练样本数的要求是达到特征维数的10倍以上。为了减少过高的特征维数和训练样本的相对不足给分类器设计和参数估计带来的困难，本发明利用LDA方法对高维的原始特征进行了压缩。The increase of the feature dimension and the shortage of training samples will bring great problems to the estimation of classifier parameters and the calculation of recognition. According to the general classifier design experience, the requirement for the number of training samples is to reach more than 10 times the feature dimension. In order to reduce the difficulties brought about by the excessively high feature dimension and the relative shortage of training samples to classifier design and parameter estimation, the present invention utilizes the LDA method to compress high-dimensional original features.

设字符类别数为c(在藏文字符识别中c＝592)，第ω类字符的训练样本数为O_ω，ω＝1，2，…，c，则对第该字符类别的训练样本采用上述方法提取四方向线素特征后，得到特征向量集合为 ${{X}_{1}^{ω}, {X_{2}}^{ω}, \cdot \cdot \cdot, {X_{O_{ω}}}^{ω}},$ 其中X_k ^ω(k＝1，2，…，O_ω)是 $4 (\frac{2 M}{M_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1)$ 维向量。Let the number of character categories be c (c=592 in Tibetan character recognition), the number of training samples of the ω class character is _Oω , ω=1, 2, ..., c, then the training samples of the character category are adopted After the above method extracts the four-direction line element features, the feature vector set is obtained as ${{x}_{1}^{ω}, {x_{2}}^{ω}, \cdot \cdot &Center Dot;, {x_{o_{ω}}}^{ω}},$ where X _k ^ω (k=1, 2, ..., O _ω ) is $4 (\frac{2 m}{m_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1)$ dimension vector.

首先计算每个字符类ω(1≤ω≤c)特征向量的中心μ_ω和所有字符类的特征向量的中心μFirst calculate the center μ of the eigenvectors of each character class ω (1≤ω≤c) _ω and the center μ of the eigenvectors of all character classes

${μ μ}_{r r} = = \frac{11}{{O o}_{ω ω}} {Σ Σ}_{k k = = 11}^{{O o}_{ω ω}} {X x}_{k k}^{ω ω}$

$μ μ = = \frac{11}{c c} {Σ Σ}_{ω ω = = 11}^{c c} {μ μ}_{ω ω}$

然后计算类间散度矩阵S_b和平均类内散度矩阵S_w Then calculate the between-class scatter matrix S _b and the average within-class scatter matrix S _w

${S S}_{b b} = = \frac{11}{c c} {Σ Σ}_{ω ω = = 11}^{c c} (({μ μ}_{ω ω} - - μ μ)) {(({μ μ}_{ω ω} - - μ μ))}^{T T}$

${S S}_{w w} = = \frac{11}{c c} {Σ Σ}_{ω ω = = 11}^{c c} \frac{11}{{O o}_{ω ω}} {Σ Σ}_{k k = = 11}^{{O o}_{ω ω}} (({X x}_{k k}^{ω ω} - - {μ μ}_{ω ω})) {(({X x}_{k k}^{ω ω} - - {μ μ}_{ω ω}))}^{T T}$

寻找变换矩阵Φ，使得tr[(Φ^TS_wΦ)^-1(Φ^TS_bΦ)]达到最大，从而使模式类内散度方差与类间散度方差的比值达到最大以增加各模式类别间的可分性。Find the transformation matrix Φ, so that tr[(Φ ^T S _w Φ) ^-1 (Φ ^T S _b Φ)] reaches the maximum, so that the ratio of the variance of the divergence within the class to the variance of the divergence between the classes is maximized to increase the Separability between categories.

用矩阵计算工具计算矩阵S_w ^-1S_b的前 $d (d \leq 4 (\frac{2 M}{M_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1))$ 个最大的非零本征值ξ_k(k＝1，2，…，d)和相应的本征向量_k(k＝1，2，…，d)，

则LDA变换的变换矩阵Φ＝[₁，₂，…，_d]。相应的特征变换为Y＝Φ^TX，这里Y是最具判别性的d维特征。Use the matrix calculation tool to calculate the front of the matrix S _w ^-1 S _b

d (d \leq 4 (\frac{2 m}{m_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1))

the largest non-zero eigenvalues ξ _k (k=1, 2,..., d) and the corresponding eigenvectors  _k (k=1, 2,..., d),

Then the transformation matrix Φ=[ ₁ ,  ₂ , . . . ,  _d ] of the LDA transformation. The corresponding feature transformation is Y = Φ ^T X, where Y is the most discriminative d-dimensional feature.

4.分类器设计4. Classifier design

分类器设计是字符识别的核心技术之一，研究者针对不同的问题提出了许多模式分类器。但在多种因素制约下，目前在处理大字符集识别问题时，往往还是选择最小距离分类器。本发明采用基于置信度分析的粗、细两级分类策略(图13)来完成输入待识别藏文字符所属类别的判断。Classifier design is one of the core technologies of character recognition, and researchers have proposed many pattern classifiers for different problems. However, under the constraints of many factors, when dealing with the recognition of large character sets, the minimum distance classifier is often selected. The present invention adopts a coarse and fine classification strategy (Fig. 13) based on confidence analysis to complete the judgment of the category of the input Tibetan characters to be recognized.

4.1粗分类4.1 Rough classification

粗分类的目的是在一个大的字符集中快速选出一个数目相对很小的候选字子集，并保证候选集中包含待识别字符所属正确类别的概率尽可能大。这就要求粗分类器结构简单、运算速度快。为此，本发明设计了一种带偏差的欧氏距离(EDD)分类器。The purpose of rough classification is to quickly select a relatively small subset of candidate characters from a large character set, and to ensure that the probability of the correct category of characters to be recognized in the candidate set is as high as possible. This requires the coarse classifier to be simple in structure and fast in operation. For this reason, the present invention designs a kind of Euclidean distance (EDD) classifier with deviation.

令Y＝(y₁，y₂，…，y_d)^T为输入未知字符的d维特征向量，Y^ω＝(y^ω ₁，y^ω ₂，…，y^ω _d)^T为第ω类字符的标准特征向量，带偏差的欧氏距离定义如下：Let Y=(y ₁ ,y ₂ ,...,y _d ) ^T be the d-dimensional feature vector of the input unknown character, Y ^ω =(y ^ω ₁ ,y ^ω ₂ ,...,y ^ω _d ) ^T is the ωth class character The standard eigenvector of , the Euclidean distance with bias is defined as follows:

$D D. ((Y Y,, {Y Y}_{ω ω})) = = {Σ Σ}_{k k = = 11}^{d d} {[[t t (({y the y}_{k k},, {y the y}^{ω ω}_{k k}))]]}^{22}$

式中In the formula

其中，σ^ω _k是第ω类字符特征向量的第k个分量的均方差，θ_ω，γ_ω为与ω相关的常数，C为与字符类别无关的常量。上式的一个最重要的特性是在欧氏距离中引入了字符特征的二阶统计量，这使得分类器对特征在空间上的分布具有一定的刻画能力。Among them, σ ^ω _k is the mean square error of the kth component of the ω-th character feature vector, θ _ω and γ _ω are constants related to ω, and C is a constant independent of the character category. One of the most important features of the above formula is that the second-order statistics of character features are introduced in the Euclidean distance, which makes the classifier have a certain ability to describe the distribution of features in space.

4.2细分类4.2 Subdivision

贝叶斯分类器是理论上最优的统计分类器，在处理实际问题时，人们希望尽量去逼近它。当在字符的特征为高斯分布且各类特征分布的先验概率相等的条件下，贝叶斯分类器简化为马氏距离分类器。但该条件在实际中通常不易满足，而且马氏距离分类器的性能随着协方差矩阵估计误差的产生而严重劣化。本发明采用MQDF(修正二次鉴别函数)作为细分类度量，它是马氏距离的一个变形。MQDF鉴别函数形式为：Bayesian classifier is the optimal statistical classifier in theory, and when dealing with practical problems, people hope to approach it as much as possible. Under the condition that the character features are Gaussian distribution and the prior probabilities of various feature distributions are equal, the Bayesian classifier is simplified to a Mahalanobis distance classifier. But this condition is usually not easy to meet in practice, and the performance of the Mahalanobis distance classifier is seriously degraded with the generation of covariance matrix estimation error. The present invention adopts MQDF (Modified Quadratic Discriminant Function), which is a deformation of the Mahalanobis distance, as the subdivision measure. The form of the MQDF discriminant function is:

$Q Q ((Y Y,, {Y Y}^{ω ω})) = = \frac{11}{{h h}^{22}} {{{Σ Σ}_{l l = = 11}^{d d} {(({y the y}_{l l} - - {y the y}^{ω ω}_{l l}))}^{22} - - {Σ Σ}_{l l = = 11}^{K K} ((11 - - \frac{{h h}^{22}}{{λ λ}_{ωl ωl}})) {[[{((Y Y - - {Y Y}^{ω ω}))}^{T T} {φ φ}_{ωl ωl}]]}^{22}}} + + ln ln (({h h}^{22 ((d d - - K K))} {Π Π}_{l l = = 11}^{K K} {λ λ}_{ωl ωl}))$

其中λ_ωl和φ_ωl分别为第ω类样本的协方差矩阵∑_ω的第l个特征值和特征向量，K表示所截取的主本征向量的个数，也是模式类的主子空间维数，其最优值由实验确定，h²是对小本征值的实验估计。MQDF产生的是二次判决曲面，因只需估计每个类别协方差阵的前K个主本征向量，避免了小本征值估计误差的负面影响。MQDF鉴别距离可以看作是在K维主子空间内的马氏距离和剩余的(d-K)维空间内的欧氏距离的加权和，加权因子为1/h²。Among them, _λωl and _φωl are respectively the lth eigenvalue and eigenvector of the covariance matrix _Σω of the ωth sample, and K represents the number of intercepted main eigenvectors, which is also the main subspace dimension of the pattern class, Its optimal value is determined experimentally, and ^h2 is an experimental estimate of the small eigenvalue. MQDF produces a quadratic decision surface, because it only needs to estimate the first K principal eigenvectors of each category covariance matrix, avoiding the negative impact of small eigenvalue estimation errors. The MQDF discrimination distance can be regarded as the weighted sum of the Mahalanobis distance in the K-dimensional main subspace and the Euclidean distance in the remaining (dK)-dimensional space, and the weighting factor is 1/h ² .

4.3置信度计算4.3 Confidence Calculation

设粗分类器的输出候选集为CanSet＝{(e₁，D₁)，(e₂，D₂)…，(e_L，D_L)}，k为候选集容量，e_k和D_k分别为候选字符和对应的粗分类距离，D₁≤D₂≤…≤D_L。细分类器的作用是根据重新计算的鉴别距离对CanSet进行再排序，找到输入字符所属的最可能的类别。如果粗分类结果的可靠性相当高，换言之，若e₁已经为输入字符的正确分类时，则细分类完全没必要进行。本发明对候选集CanSet进行置信度分析以决定是否需要进行细分类，采用EDD输出的距离作为度量，依下式计算置信度：Let the output candidate set of the rough classifier be CanSet={(e ₁ , D ₁ ), (e ₂ , D ₂ )..., (e _L , D _L )}, k is the capacity of the candidate set, e _k and D _k are respectively is the candidate character and the corresponding rough classification distance, D ₁ ≤D ₂ ≤...≤D _L . The role of the fine classifier is to reorder the CanSet according to the recalculated discriminative distance to find the most probable category to which the input character belongs. If the reliability of the rough classification result is quite high, in other words, if e ₁ is already the correct classification of the input character, then fine classification is completely unnecessary. The present invention analyzes the confidence of the candidate set CanSet to determine whether subdivision is required, and uses the distance output by EDD as a measure to calculate the confidence according to the following formula:

$Conf conf ((CanSet CanSet)) = = \frac{{D D.}_{22} - - {D D.}_{11}}{{D D.}_{11}}$

当置信度低于一定的阈值Conf_TH时，将CanSet送入细分类器处理，否则直接输出CanSet。本发明的特征在于，它是一种能够识别多种字体和多种字号的印刷体藏文字符识别技术。它依次含有以下步骤：When the confidence is lower than a certain threshold Conf _TH , the CanSet is sent to the fine classifier for processing, otherwise the CanSet is output directly. The present invention is characterized in that it is a printed Tibetan character recognition technology capable of recognizing multiple fonts and multiple font sizes. It contains the following steps in order:

它首先对输入的单个藏文字符进行适当的位置归一化和大小归一化，以最大限度地消除输入字符因字号和字体的不同而造成的形状、姿态等方面的差异，然后提取能很好反映藏文字符结构特点的四方向线素特征，在此基础上，利用LDA变换提取最具鉴别性的特征以降低特征维数，把变换后特征送入基于识别置信度分析的粗、细两级分类器判定字符所属类别。在由图像采集设备和计算机组成的系统中，它依次含有以下步骤：It firstly performs proper position normalization and size normalization on the input single Tibetan character, so as to eliminate the difference in shape, posture, etc. of the input character due to the difference in size and font, and then extracts The four-direction line element feature that reflects the structural characteristics of Tibetan characters is good. On this basis, the LDA transformation is used to extract the most discriminative features to reduce the feature dimension, and the transformed features are sent to the coarse and fine lines based on the recognition confidence analysis. A two-stage classifier determines the class to which a character belongs. In a system consisting of an image acquisition device and a computer, it contains the following steps in sequence:

1.字符样本的采集1. Collection of character samples

扫描输入印有多字体多字号藏文字符的文本，利用已有算法进行去除噪声、二值化等必要预处理后，将藏文文本进行切分以分离单个字符，对每个字符的图像标定其对应的正确的字符的内码，由此完成用以训练和测试的藏文字符单字样本的采集，建立训练样本数据库。Scan and input the text printed with multi-font and multi-size Tibetan characters, use the existing algorithm to perform necessary preprocessing such as noise removal, binarization, etc., segment the Tibetan text to separate individual characters, and calibrate the image of each character The internal code of the correct character corresponding to it completes the collection of Tibetan character samples for training and testing, and establishes a training sample database.

2.归一化处理，包含字符位置和大小的线性归一化2. Normalization processing, including linear normalization of character position and size

2.1定位单个藏文字符的基线位置2.1 Locate the baseline position of a single Tibetan character

设原始字符图像为[F(i，j)]_W×H，其中W为图像宽度，H为图像高度，图像位于第i行第j列的象素点的值为F(i，j)，i＝1，2，…，H，j＝1，2，…，WLet the original character image be [F(i, j)] _W×H , wherein W is the image width, H is the image height, and the value of the pixel point in the i-th row and j-column of the image is F(i, j), i=1, 2, ..., H, j = 1, 2, ..., W

由下式计算字符图像的水平投影V(i)，i＝1，2，…，H：Calculate the horizontal projection V(i) of the character image by the following formula, i=1, 2,..., H:

$V V ((i i)) = = {Σ Σ}_{j j = = 11}^{W W} F f ((i i,, j j))$

则基线的位置P_L为：Then the position _PL of the baseline is:

${P P}_{L L} = = arg arg \underset{i i}{max max} ((V V ((i i)) - - V V ((i i - - 11)))),, i i = = 2,3 2,3,, \cdot \cdot \cdot &Center Dot; \cdot &Center Dot;,, H h$

2.2以基线为分界点将输入图像分离成两个子图像2.2 Separate the input image into two sub-images with the baseline as the dividing point

[F(i，j)]_W×H可以看作两个子图像[F₁(i，j)]_W×H1、[F₂(i，j)]_W×H2的纵向拼接其中[F₁(i，j)]_W×H1为基线以上部分，即上元音部分；[F₂(i，j)]_W×H2为基线以下部分。两者没有交叠而是纵向组合在一起合成[F(i，j)]_W×H，且H₁+H₂＝H[F(i, j)] _W×H can be regarded as the longitudinal splicing of two sub-images [F ₁ (i, j)] _W×H1 , [F ₂ (i, j)] _W×H2 where [F ₁ ( i, j)] _W×H1 is the part above the baseline, that is, the upper vowel part; [F ₂ (i, j)] _W×H2 is the part below the baseline. The two do not overlap but are combined vertically to form [F(i,j)] _W×H , and H ₁ +H ₂ =H

对应的，归一化后的目标字符图像[G(i，j)]_M×N也可以看作两个子图像[G₁(i，j)]_M×N1、[G₂(i，j)]_M×N2的纵向拼接其中M为目标图像的宽度，N为图像高度。[G₁(i，j)]_M×N1为基线以上部分图像，即上元音部分；[G₂(i，j)]_M×N2为基线以下部分。两者也没有交叠而是纵向组合成[G(i，j)]_M×N，且设定N₁＝N/4，N₂＝3N/4。Correspondingly, the normalized target character image [G(i, j)] _M×N can also be regarded as two sub-images [G ₁ (i, j)] _M×N1 , [G ₂ (i, j) ] _M×N2 vertical splicing where M is the width of the target image, and N is the image height. [G ₁ (i, j)] _M×N1 is the part of the image above the baseline, that is, the upper vowel part; [G ₂ (i, j)] _M×N2 is the part below the baseline. The two are not overlapped but vertically combined into [G(i, j)] _M×N , and N ₁ =N/4, N ₂ =3N/4 are set.

2.3位置归一化参考点U_k(u_Ik，u_Jk)，k＝1，2的选择2.3 Selection of position normalized reference point U _k (u _Ik , u _Jk ), k=1, 2

[F_k(i，j)]_W×Hk，k＝1，2重心和外边框中心分别为A_k(a_Ik，a_Jk)，k＝1，2和B_k(b_Ik，b_Jl)，k＝1，2其中[F _k (i, j)] _W×Hk , k=1, 2 The center of gravity and the center of the outer frame are A _k (a _Ik , a _Jk ), k=1, 2 and B _k (b _Ik , b _Jl ) , k=1, 2 where

其中β为常数且0≤β≤1。Where β is a constant and 0≤β≤1.

移动输入图像点阵，使该参考点位于目标点阵[G_k(i，j)]_M×Nk，k＝1，2的几何中心，从而完成输入字符的位置归一化Move the input image lattice so that the reference point is located at the geometric center of the target lattice [G _k (i, j)] _M×Nk , k=1, 2, thereby completing the position normalization of the input characters

2.4大小归一化2.4 Size normalization

因[F_k(i，j)]_W×Hk，k＝1，2与[G_k(i，j)]_M×Nk，k＝1，2之间的关系为G_k(i，j)＝F_k(i/r_i，j/r_j)，k＝1，2其中r_i和r_j分别为i和j方向的尺度变换因子：r_i＝N_k/H_k，r_j＝M/W。故采用三次B样条函数进行插值运算，以减少归一化后字符出现诸如阶梯状边缘等畸变。对于给定(i，j)，令：Because the relationship between [F _k (i, j)] _W×Hk , k=1, 2 and [G _k (i, j)] _M×Nk , k=1, 2 is G _k (i, j) =F _k (i/r _i , j/r _j ), k=1, 2 where r _i and r _j are scale transformation factors in i and j directions respectively: r _i =N _k /H _k , r _j =M /W. Therefore, the cubic B-spline function is used for interpolation operation to reduce the distortion of characters such as stepped edges after normalization. For a given (i, j), let:

其中： [·]为取整函数。插值过程可表示为：in: [·] is rounding function. The interpolation process can be expressed as:

其中W(z)为阶跃函数，

where W(z) is a step function,

3.提取藏文字符的四方向线素特征3. Extract the four-directional line element features of Tibetan characters

3.1字符轮廓提取3.1 Character outline extraction

扫描整个字符点阵，对于某个位置的黑象素，根据它的8邻域中的象素分布情况决定是否保留该黑象素。这样，可以得到归一化后的字符图像[G(i，j)]_M×N的轮廓图像[G′(i，j)]_M×N。Scan the entire character dot matrix, and for a black pixel at a certain position, decide whether to keep the black pixel according to the distribution of pixels in its 8 neighbors. In this way, the normalized character image [G(i, j)] _M×N contour image [G′(i, j)] _M×N can be obtained.

3.2方向线素特征的提取3.2 Extraction of direction line features

首先，对于字符轮廓点阵[G′(i，j)]_M×N中的每一个黑象素(i，j)，根据它与相邻的另外两个黑象素的之间的位置关系，赋予它横(0°)、竖(90°)、撇(45°)、捺(135°)四种线素。并记为一个4维向量X(i，j)＝(x_v，x_k，x_p，x_o)^T。First, for each black pixel (i, j) in the character outline lattice [G′(i, j)] _M×N , according to the positional relationship between it and the other two adjacent black pixels , endow it with four line elements: horizontal (0°), vertical (90°), left (45°), and right (135°). And recorded as a 4-dimensional vector X(i, j)=(x _v , x _k , x _p , x _o ) ^T .

将整个大小为M×N的字符轮廓图像[G′(i，j)]_M×N均匀划分为 $(\frac{2 M}{M_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1)$ 个子区域，每个子区域又进一步划分成互相嵌套的、大小依次为(M₀/4)×(N₀/4)、(M₀/2)×(N₀/2)、(3M₀/4)×(3N₀/4)和M₀×N₀的A、B、C、D等4个小方块。每个上小方块的特征向量X_A＝(x_v，x_k，x_p，x_o)^T、X_B＝(x_v，x_k，x_p，x_o)^T、X_C＝(x_v，x_k，x_p，x_o)^T、X_D＝(x_v，x_k，x_p，x_o)^T表示为该方块内所有黑象素特征向量的和：Divide the entire character outline image [G′(i, j)] _{M×N with size M×N} evenly into $(\frac{2 m}{m_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1)$ sub-areas, and each sub-area is further divided into nested ones, the size of which is (M ₀ /4)×(N ₀ /4), (M ₀ /2)×(N ₀ /2), (3M ₀ / 4) ×(3N ₀ /4) and M ₀ ×N ₀ A, B, C, D and other 4 small squares. The eigenvectors X _A =(x _v , x _k , x _p , x _o ) ^T , X _B =(x _v , x _k , x _p , x _o ) ^T , X _C =(x _v , x _k , x _p , x _o ) ^T , X _D = (x _v , x _k , x _p , x _o ) ^T is expressed as the sum of all black pixel feature vectors in the box:

整个子区域的方向线素特征向量X_S＝(x_v，x_k，x_p，x_o)^T由该子区域中各方块特征向量的加权和来表示：The direction line element feature vector X _S of the whole sub-area = (x _v , x _k , x _p , x _o ) ^T is represented by the weighted sum of the feature vectors of each block in the sub-area:

X_S＝α_AX_A+α_BX_B+α_CX_C+α_DX_D这样，从每个子区域都可以得到一个4维特征向量后，将所有子区域的特征向量按顺序排列在一起组成的表示输入字符的 $4 (\frac{2 M}{M_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1)$ 维方向线素特征向量。X _S ＝α _A X _A +α _B X _B +α _C X _C +α _D X _D In this way, after obtaining a 4-dimensional feature vector from each sub-region, arrange the feature vectors of all sub-regions together in order Composed of input characters $4 (\frac{2 m}{m_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1)$ Dimensional direction voxel feature vector.

4.特征变换4. Feature transformation

设字符类别数为c，第ω类字符的训练样本数为O_ω，ω＝1，2，…，c，则对第该字符类别的训练样本采用上述方法提取四方向线素特征后，得到特征向量集合为 ${{X_{1}}^{ω}, {X_{2}}^{ω}, \cdot \cdot \cdot, {X_{O_{ω}}}^{ω}},$ 其中X_k ^ω(k＝1，2，…，O_ω)是 $4 (\frac{2 M}{M_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1)$ 维向量。Assume that the number of character categories is c, the number of training samples of the ωth class character is _Oω , ω=1, 2, ..., c, then the training samples of the th character category are extracted by the above-mentioned method to obtain the four-direction linear element feature, and the obtained The set of eigenvectors is ${{x_{1}}^{ω}, {x_{2}}^{ω}, &Center Dot; &Center Dot; &Center Dot;, {x_{o_{ω}}}^{ω}},$ where X _k ^ω (k=1, 2, ..., O _ω ) is $4 (\frac{2 m}{m_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1)$ dimension vector.

利用LDA变换对原始特征压缩如下Use LDA transformation to compress the original features as follows

首先计算每个字符类ω(1≤ω≤c)特征向量的中心μ_ω、所有字符类的特征向量的中心μ、类间散度矩阵S_b和平均类内散度矩阵S_w First calculate the center μ _ω of the eigenvectors of each character class ω (1≤ω≤c), the center μ of the eigenvectors of all character classes, the inter-class scatter matrix S _b and the average intra-class scatter matrix S _w

$μ μ = = \frac{11}{c c} {Σ Σ}_{ω ω = = 11}^{c c} {μ μ}_{ω ω}$

寻找变换矩阵Φ，使得tr(Φ^TS_wΦ)^-1(Φ^TS_bΦ)]达到最大，则LDA相应的特征变换为Y＝Φ^TX，这里Y是最具判别性的d维特征。Find the transformation matrix Φ such that tr(Φ ^T S _w Φ) ^-1 (Φ ^T S _b Φ)] reaches the maximum, then the corresponding feature transformation of LDA is Y=Φ ^T X, where Y is the most discriminative d-dimensional feature.

5.对输入字符所属类别的判断，即对未知类别的字符图像，提取特征，与识别库中已有的数据进行比较，以确定其正确的字符代码。5. Judging the category of the input character, that is, extracting features from the character image of an unknown category, and comparing it with the existing data in the recognition database to determine its correct character code.

5.1设计分类器5.1 Design Classifier

对由LDA压缩得到的特征向量Y，计算各字符的均值向量 $\overset{&OverBar;}{Y^{ω}} (ω = 1,2, \cdot \cdot \cdot, c)$ 和各字符的特征向量在每一维上的方差σ_s ^ω(ω＝1，2，…，c，s＝1，2，…，d)，d为Y的维数，For the feature vector Y obtained by LDA compression, calculate the mean vector of each character $\overset{&OverBar;}{Y^{ω}} (ω = 1,2, \cdot \cdot &Center Dot;, c)$ And the variance σ _s ^ω (ω=1,2,...,c, s=1,2,...,d) of the feature vector of each character on each dimension, d is the dimension of Y,

$\overset{&OverBar; &OverBar;}{{Y Y}^{ω ω}} = = \frac{11}{{O o}_{ω ω}} {Σ Σ}_{k k = = 11}^{{O o}_{ω ω}} {Y Y}_{k k}^{ω ω},,$

${σ σ}_{s the s}^{ω ω} = = \sqrt{\frac{11}{{O o}_{ω ω}} {Σ Σ}_{k k = = 11}^{{O o}_{ω ω}} {(({y the y}^{ω ω}_{ks ks} - - {\overset{&OverBar; &OverBar;}{{y the y}^{ω ω}}}_{s the s}))}^{22}}$

其中每个藏文字符类别ω(1≤ω≤c)的特征集合为 ${{Y_{1}}^{ω}, {Y_{2}}^{ω}, \cdot \cdot \cdot, {Y_{O_{ω}}}^{ω}},$ 将各字符的鉴别特征均值向量和各维的方差存入鉴别特征数据库文件中，同时将通过实验得到的分类器的参数存入库文件中。The feature set of each Tibetan character category ω(1≤ω≤c) is ${{Y_{1}}^{ω}, {Y_{2}}^{ω}, &Center Dot; &Center Dot; &Center Dot;, {Y_{o_{ω}}}^{ω}},$ Store the discriminant feature mean vector of each character and the variance of each dimension in the discriminant feature database file, and store the parameters of the classifier obtained through experiments into the library file.

5.2分类判决5.2 Classification Judgment

对未知类别的输入字符图像，首先进行位置归一化和大小归一化处理，再提取四方向线素特征X，利用LDA线性变换矩阵Φ将原始方向线素特征X变换成Y＝Φ^TX＝(y₁，y₂，…，y_d)^T，d是变换后特征的维数。For an input character image of an unknown category, first perform position normalization and size normalization processing, and then extract the four-directional linear feature X, and use the LDA linear transformation matrix Φ to transform the original direction linear feature X into Y=Φ ^T X =(y ₁ , y ₂ ,...,y _d ) ^T , where d is the dimension of the transformed feature.

从库文件中读取所有字符类的均值向量 $\overset{&OverBar;}{Y^{ω}} = {(\overset{&OverBar;}{{y_{1}}^{ω}}, \overset{&OverBar;}{{y_{2}}^{ω}}, \cdot \cdot \cdot \overset{&OverBar;}{{y_{d}}^{ω}})}^{T},$ (ω＝1，2，…，c)和各字符类的各维的方差σ_s ^ω(ω＝1，2，…，c，s＝1，2，…，d)。计算Y到的带偏差的欧氏距离 Read the mean vectors of all character classes from the library file $\overset{&OverBar;}{Y^{ω}} = {(\overset{&OverBar;}{{the y}_{1}^{ω}}, \overset{&OverBar;}{{they}_{2}^{ω}}, \cdot \cdot &Center Dot; \overset{&OverBar;}{{the y}_{d}^{ω}})}^{T},$ (ω=1, 2, . . . , c) and the variance σ _s ^ω (ω=1, 2, . . . , c, s=1, 2, . . . , d) of each dimension of each character class. Calculate Y to Euclidean distance with bias

$D D. ((Y Y,, \overset{&OverBar; &OverBar;}{{Y Y}^{ω ω}})) = = {Σ Σ}_{s the s = = 11}^{d d} {[[t t (({y the y}_{s the s},, \overset{&OverBar; &OverBar;}{{y the y}^{ω ω}_{s the s}}))]]}^{22}$

其中in

将所有经过计算的 ω＝1，2，…，c按照由小到大的顺序重新排序，选出前L(1≤L≤c)个距离及其所代表的字符类别码e_k，k＝1，2，…，L组成粗分类候选集CanSet＝{(e₁，D₁)，(e₂，D₂)…，(e_L，D_L)}，D₁≤D₂≤…≤D_L。all calculated ω=1, 2,..., c are reordered according to the order from small to large, and the first L (1≤L≤c) distances and the character category codes e _k represented by them are selected, k=1, 2,... , L constitutes a rough classification candidate set CanSet={(e ₁ , D ₁ ), (e ₂ , D ₂ )..., (e _L , D _L )}, D ₁ ≤D ₂ ≤...≤D _L .

计算CanSet中首字符的识别置信度Conf(CanSet)Calculate the recognition confidence Conf(CanSet) of the first character in CanSet

若Conf(CanSet)高于一定的阈值Conf_TH，直接将(e₁，D₁)作为输入字符的识别结果输出，即认为输入字符属于e₁所对应的字符类别，且识别距离是D₁。否则，计算Y到CanSet中各内码所对应的字符类别的MQDF鉴别距离 ω＝1，2，…，LIf Conf(CanSet) is higher than a certain threshold Conf _TH , directly output (e ₁ , D ₁ ) as the recognition result of the input character, that is, the input character is considered to belong to the character category corresponding to e ₁ , and the recognition distance is D ₁ . Otherwise, calculate the MQDF discrimination distance from Y to the character category corresponding to each inner code in CanSet ω=1, 2, ..., L

$Q Q ((Y Y,, \overset{&OverBar; &OverBar;}{{Y Y}^{ω ω}})) = = \frac{11}{{h h}^{22}} {{{Σ Σ}_{l l = = 11}^{d d} {(({y the y}_{l l} - - \overset{&OverBar; &OverBar;}{{y the y}^{ω ω}_{l l}}))}^{22} - - {Σ Σ}_{l l = = 11}^{K K} ((11 - - \frac{{h h}^{22}}{{λ λ}_{ωl ωl}})) {[[{((Y Y - - \overset{&OverBar; &OverBar;}{{Y Y}^{ω ω}}))}^{T T} {φ φ}_{ωl ωl}]]}^{22}}} + + ln ln (({h h}^{22 ((d d - - K K))} {Π Π}_{l l = = 11}^{K K} {λ λ}_{ωl ωl}))$

若 $Q (Y . \overset{&OverBar;}{Y^{τ}}) = \min_{1 \leq ω \leq L} Q (Y, \overset{&OverBar;}{Y^{ω}}),$ 则该输入字符属于eτ所对应的字符类别，即 $τ = \arg \min_{1 \leq ω \leq L} Q (Y, \overset{&OverBar;}{Y^{ω}}) .$ like $Q (Y . \overset{&OverBar;}{Y^{τ}}) = \min_{1 \leq ω \leq L} Q (Y, \overset{&OverBar;}{Y^{ω}}),$ Then the input character belongs to the character category corresponding to eτ, namely $τ = \arg \min_{1 \leq ω \leq L} Q (Y, \overset{&OverBar;}{Y^{ω}}) .$

实验证明，本发明在多字体多字号印刷体藏文单字测试集上的识别正确率达到99.83％，对实际文本的识别率也可达99％以上。Experiments have proved that the present invention has a recognition accuracy rate of 99.83% on the multi-font and multi-size printed Tibetan single-character test set, and the recognition rate of the actual text can reach more than 99%.

附图说明Description of drawings

图1一个典型的藏文字符识别系统的硬件构成。Figure 1 shows the hardware configuration of a typical Tibetan character recognition system.

图2藏文单字样本的生成。Figure 2 Generation of Tibetan word samples.

图3藏文字符识别系统的构成。Figure 3 The composition of the Tibetan character recognition system.

图4采用的图像坐标系示意。Figure 4 shows the image coordinate system used.

图5字符归一化流程Figure 5 Character normalization process

图6字符归一化示例Figure 6 Example of character normalization

图7方向线素特征提取流程。Figure 7 is the process of directional line element feature extraction.

图8归一化后字符及其轮廓。Figure 8 Normalized characters and their outlines.

图9四方向线素特征中的横、竖、撇、捺四种方向属性。The four direction attributes of horizontal, vertical, left and right in the four-direction line feature in Figure 9.

图10图像子区域的划分方法。Fig. 10 The division method of image sub-regions.

图11构成子区域的小方块示意。Figure 11 is a schematic diagram of the small blocks that constitute sub-regions.

图12LDA特征变换流程图。Figure 12 LDA feature transformation flow chart.

图13分类策略Figure 13 Classification strategy

图14基于本算法的多字体多字号印刷藏文字符识别系统。Figure 14 is a multi-font and multi-size printed Tibetan character recognition system based on this algorithm.

图15多字体印刷藏文(混排汉英)文档识别系统Figure 15 Multi-font printing Tibetan (mixed Chinese-English) document recognition system

具体实施方式Detailed ways

如图1所示，一个印刷体藏文字符识别系统在硬件上由两部分构成：图像采集设备和计算机。图像采集设备一般是扫描仪，用来获取藏文字符的数字图像。计算机用于对数字图像进行处理，并进行判决分类。As shown in Figure 1, a printed Tibetan character recognition system consists of two parts in terms of hardware: image acquisition equipment and a computer. The image capture device is generally a scanner, which is used to obtain digital images of Tibetan characters. Computers are used to process digital images and make judgments and classifications.

图2所示的是训练藏文单字样本和测试藏文单字样本的生成过程。对于一篇印刷体藏文样张，首先通过扫描仪将其扫入计算机，使之变为数字图像。对数字图像二值化、去除噪声等预处理措施，得到二值化的图像。再对输入图像进行行切分，得到文本行，在此基础上对每一个文本行进行字切分，得到单个藏文字符，然后标定每个字符图像所属的字符类别。此后，要进行一次检查，对行、字切分阶段和字符类别标定阶段产生的错误采用手动方式改正。最后，将相同的字符类别对应的原始字符图像提取出来，并保存，完成藏文单字样本的采集。Figure 2 shows the generation process of training Tibetan single-character samples and testing Tibetan single-character samples. As for a printed Tibetan sample, firstly, it is scanned into a computer through a scanner to turn it into a digital image. Binarize digital images, remove noise and other preprocessing measures to obtain binarized images. Line segmentation is then performed on the input image to obtain text lines, and on this basis, word segmentation is performed on each text line to obtain a single Tibetan character, and then the character category to which each character image belongs is calibrated. Thereafter, a check is to be carried out, and the errors generated in the line and character segmentation stages and the character category calibration stages are manually corrected. Finally, the original character images corresponding to the same character category are extracted and saved to complete the collection of Tibetan single character samples.

如图3所示，印刷体藏文字符识别算法分为两个部分：训练系统和测试系统。训练系统中，对输入的藏文单字训练样本集中的每一个样本，恰当地进行归一化处理，提取反映其组成信息的四方向线素特征，利用LDA对特征进行变换，降低原始特征维数，然后，采用合适的分类器，训练分类器，得到特征库文件。在测试系统中，对输入的未知类别字符图像，采用和训练系统同样的归一化和特征提取方法，并用训练系统得到的变换矩阵对特征进行变换，然后送入分类器进行分类，判断输入字符所属的类别。As shown in Figure 3, the printed Tibetan character recognition algorithm is divided into two parts: the training system and the testing system. In the training system, each sample in the input Tibetan single-character training sample set is properly normalized, and the four-directional line element feature reflecting its composition information is extracted, and the feature is transformed by LDA to reduce the dimension of the original feature , and then, adopt a suitable classifier, train the classifier, and obtain the feature library file. In the test system, the same normalization and feature extraction methods as the training system are used for the input character images of unknown categories, and the transformation matrix obtained by the training system is used to transform the features, and then sent to the classifier for classification to judge the input characters category to which it belongs.

因而，实用的多字体多字号印刷体藏文字符识别系统的实现需要考虑如下几个方面：Therefore, the realization of a practical multi-font and multi-size printed Tibetan character recognition system needs to consider the following aspects:

A)藏文字符单字样本的获取；A) Acquisition of Tibetan character samples;

B)训练系统的实现；B) Implementation of the training system;

C)测试系统的实现。C) Implementation of the test system.

下面分别对这三个方面进行详细介绍。These three aspects are described in detail below.

A)藏文字符单字本的获取A) Acquisition of Tibetan character monographs

印刷体藏文单字样本的获取过程如图2所示。输入的一篇纸质印刷体藏文文档通过扫描仪得到数字图像，输入计算机。然后对该图像进行噪声去除、二值化等预处理措施。利用各种虑波方法去除噪声在现有文献中已经有大量记载。二值化方法可采用已有的全局二值化或局部自适应二值化。接着对文档进行版面分析，得到字符区域。对字符区域分别利用水平投影直方图和垂直投影直方图进行行切分和字切分得到单个字符。在此阶段的切分错误采用手动的方式进行更正。对得到的单个藏文字符的类别进行标定，一般采用计算机自动标定，对其中的错误进行人工处理(更该、删除等)。最后，把具有相同内码的字符所对应的不同字体、不同字号的原始字符图像保存起来，就得到了多字体多字号印刷体藏文单字样本。The process of obtaining printed Tibetan individual character samples is shown in Figure 2. A paper-based printed Tibetan document is input into a digital image through a scanner and input into a computer. Then, preprocessing measures such as noise removal and binarization are performed on the image. The use of various filtering methods to remove noise has been extensively documented in the existing literature. The binarization method can adopt the existing global binarization or local adaptive binarization. Then, the layout analysis is performed on the document to obtain the character area. Use the horizontal projection histogram and the vertical projection histogram to perform line segmentation and word segmentation on the character area to obtain a single character. Segmentation errors at this stage are corrected manually. To calibrate the category of the individual Tibetan characters obtained, the automatic calibration of the computer is generally used, and the errors in it are manually processed (corrected, deleted, etc.). Finally, the original character images of different fonts and different font sizes corresponding to the characters with the same internal code are saved, and a multi-font and multi-size printed Tibetan single character sample is obtained.

B)训练系统的实现B) Implementation of training system

B.1字符归一化B.1 Character normalization

B.1.1位置归一化B.1.1 Position normalization

设原始字符图像为[F(i，j)]_W×H，图像宽度为W，高度为H，图像位于第i行第j列的象素点的值为F(i，j)，i＝1，2，…，H，j＝1，2，…，W。[F(i，j)]_W×H可以看作由两个子图像——基线以上部分[F₁(i，j)]_W×H1和基线以下部分[F₂(i，j)]_W×H2的纵向拼接而成，H₁+H₂＝H。设字符图像的水平投影为V(i)，i＝1，2，…，H，可由下式计算：Suppose that the original character image is [F(i, j)] _W×H , the image width is W, and the height is H, and the value of the pixel point in the i-th row and j-column of the image is F(i, j), i= 1, 2, ..., H, j=1, 2, ..., W. [F(i, j)] _W×H can be seen as consisting of two sub-images—a part above the baseline [F ₁ (i, j)] _W×H1 and a part below the baseline [F ₂ (i, j)] _{W× H2} is spliced longitudinally, H ₁ +H ₂ =H. Let the horizontal projection of the character image be V(i), i=1, 2, ..., H, which can be calculated by the following formula:

$V V ((i i)) = = {Σ Σ}_{j j = = 11}^{W W} F f ((i i,, j j))$

${P P}_{I I} = = arg arg \underset{i i}{max max} ((V V ((i i)) - - V V ((i i - - 11)))),, i i = = 2,3 2,3,, \cdot &Center Dot; \cdot \cdot \cdot \cdot,, H h$

设归一化后字符图像为[G(i，j)]_M×N，图像宽度为M，高度为N，图像位于第i行第j列的象素点的值为G(i，j)，i＝1，2，…，N，j＝1，2，…，M。同样的，[G(i，j)]_M×N也可以看作两个子图像——基线以上部分[G₁(i，j)]_M×N1和基线以下部分[G₂(i，j)]_M×N2的纵向拼接而成，此处设定N₁＝N/4，N₂＝3N/4。这样，归一化可以看成是将输入图像点阵[F₁(i，j)]_W×H1、[F₂(i，j)]_W×H2分别映射成目标图像点阵[G₁(i，j)]_M×N1、[G₂(i，j)]_M×N2的处理过程。在此过程中，选定输入图像点阵[F_k(i，j)]_W×Hk，k＝1，2中的参考点U_k(u_Ik，u_Jk)，k＝1，2，移动输入图像点阵，使该参考点，位于目标点阵[G_k(i，j)]_M×Nk，k＝1，2的中心，从而完成输入字符的位置归一化。Let the character image after normalization be [G(i, j)] _M×N , the image width is M, the height is N, and the value of the pixel point in the i-th row and j-th column of the image is G(i, j) , i=1, 2,..., N, j=1, 2,..., M. Similarly, [G(i, j)] _M×N can also be regarded as two sub-images - the part above the baseline [G ₁ (i, j)] _M×N1 and the part below the baseline [G ₂ (i, j) ] _M×N2 vertical splicing, where N ₁ =N/4 and N ₂ =3N/4 are set. In this way, normalization can be regarded as mapping the input image lattice [F ₁ (i, j)] _W×H1 and [F ₂ (i, j)] _W×H2 into the target image lattice [G ₁ ( i, j)] _M×N1 , [G ₂ (i, j)] _M×N2 processing. In this process, select the reference point U k (u Ik , u _Jk ) in the input image lattice [F _k (i, j)] _W×Hk , _k =1, ₂ , k=1, 2, move The image lattice is input, and the reference point is located at the center of the target lattice [G _k (i, j)] _M×Nk , k=1, 2, thereby completing the position normalization of the input characters.

其中β为常数且0≤β≤1。Where β is a constant and 0≤β≤1.

B.1.2大小归一化B.1.2 Size normalization

考察输入字符图像[F_k(i，j)]_W×Hk，k＝1，2与归一化后目标字符点阵为[G_k(i，j)]_M×Nk，k＝1，2之间的关系可知：Investigate the input character image [F _k (i, j)] _W×Hk , k=1, 2 and the target character lattice after normalization is [G _k (i, j)] _M×Nk , k=1, 2 The relationship between can be seen:

其中r_i和r_j分别为i和j方向的尺度变换因子：r_i＝N_k/H_k，r_j＝M/W。根据上式，输出图像点阵中的点(i，j)对应于输入字符中的点(i/r_i，j/r_j)。F_k(i，j)为离散函数，而i/r_i、j/r_j的取值一般不为整数，故需要根据F_k中已知的离散点处的值来估计其在(i/r_i，j/r_j)处的取值。采用三次B样条函数进行插值运算，以减少归一化后字符出现畸变。对于给定(i，j)，令：Where r _i and r _j are scaling factors in directions i and j respectively: r _i =N _k /H _k , r _j =M/W. According to the above formula, the point (i, j) in the output image lattice corresponds to the point (i/r _i , j/r _j ) in the input character. F _k (i, j) is a discrete function, _and the values of i/r _i and j/r _j are generally not integers, so it is necessary to estimate its value in (i/ r _i , the value at j/r _j ). The cubic B-spline function is used for interpolation operation to reduce the distortion of characters after normalization. For a given (i, j), let:

其中W(z)为阶跃函数，

where W(z) is a step function,

B.2方向线素特征提取B.2 Directional line element feature extraction

B.2.1取字符的轮廓B.2.1 Take the outline of the character

扫描整个字符点阵，对于某个位置的黑象素，如果它的8邻域中黑象素个数和白象素个数均大于0，则保留该黑象素，否则将字符点阵在该位置的值改为0。这样，可以从归一化后的字符图像[G(i，j)]_M×N的轮廓图像[G′(i，j)]_M×N。Scan the entire character lattice, for a black pixel at a certain position, if the number of black pixels and the number of white pixels in its 8 neighbors are both greater than 0, then keep the black pixel, otherwise, place the character lattice at this position value to 0. In this way, the normalized character image [G(i, j)] _M×N contour image [G′(i, j)] _M×N can be obtained.

B.2.2分块和特征矢量的构成B.2.2 Composition of blocks and feature vectors

对于字符轮廓点阵[G′(i，j)]_M×N中的每一个黑象素，根据它与相邻的另外两个黑象素的位置关系，赋予它横(0°)、竖(90°)、撇(45°)、捺(135°)四种线素。考虑两种情况：一种是3个黑象素在同一直线上，则只给该中心象素分配一种线素特征并且赋值为2；另一种3个黑象素不在同一直线上，那么就同时给中心象素分配两种线素特征并分别赋值为1。按照这些原则对字符点阵中的各黑象素的进行线素特征的分配，对每个黑象素点(i，j)，都可以得到一个4维向量X(i，j)＝(x_v，x_k，x_p，x_o)^T，其分量分别表示该黑象素点处的4种线素数量。For each black pixel in the character outline lattice [G′(i, j)] _M×N , according to its positional relationship with the other two adjacent black pixels, it is given horizontal (0°), vertical (90°), skimming (45°), and pressing (135°) four lines. Consider two situations: one is that 3 black pixels are on the same straight line, then only assign a line pixel feature to the central pixel and assign a value of 2; the other is that the 3 black pixels are not on the same straight line, then Just assign two kinds of line element features to the center pixel at the same time and assign a value of 1 respectively. According to these principles, each black pixel in the character dot matrix is distributed with line features, and for each black pixel point (i, j), a 4-dimensional vector X (i, j)=(x _v , x _k , x _p , x _o ) ^T , and their components respectively represent the four kinds of line pixel quantities at the black pixel point.

完成上述工作以后，将M×N的点阵均匀划分成宽为M₀、高为N₀的子区域，每个子区域跟相邻的子区域之间在水平方向有M₀/2、在垂直方向上有N₀/2个象素的重合，故子区域的总个数为 $(\frac{2 M}{M_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1)$ 然后，将每个子区域划分成互相嵌套、大小依次为(M₀/4)×(N₀/4)、(M₀/2)×(N₀/2)、(3M₀/4)×(3N₀/4)和M₀×N₀的A、B、C、D等4个小方块。对每个小方块，分别定义一个4维向量X_A＝(x_v，x_k，x_p，x_o)^T、X_B＝(x_v，x_k，x_p，x_o)^T、X_C＝(x_v，x_k，x_p，x_o)^T、X_D＝(x_v，x_k，x_p，x_o)^T，它们表示各自方块内各象素的0°、90°、45°、135°方向线素数量的总和，即：After the above work is completed, the M×N lattice is evenly divided into sub-regions with a width of M ₀ and a height of N _0. There is M ₀ /2 between each sub-region and the adjacent sub-region in the horizontal direction, and vertical There is overlap of N ₀ /2 pixels in the direction, so the total number of sub-regions is $(\frac{2 m}{m_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1)$ Then, each sub-area is divided into mutually nested, and the sizes are (M ₀ /4)×(N ₀ /4), (M ₀ /2)×(N ₀ /2), (3M ₀ /4)× (3N ₀ /4) and M ₀ ×N ₀ A, B, C, D and other 4 small squares. For each small square, define a 4-dimensional vector X _A = (x _v , x _k , x _p , x _o ) ^T , X _B = (x _v , x _k , x _p , x _o ) ^T , X _C =(x _v , x _k , x _p , x _o ) ^T , X _D =(x _v , x _k , x _p , x _o ) ^T , which represent the 0°, 90°, 45° of each pixel in each square The sum of the number of direction pixels of ° and 135°, that is:

而整个子区域的方向线素特征向量X_S＝(x_v，x_k，x_p，x_o)^T表示为该子区域中各方块特征向量的加权和，即：And the direction line element feature vector X _S =(x _v , x _k , x _p , x _o ) ^T of the whole sub-area is expressed as the weighted sum of the feature vectors of each block in the sub-area, namely:

其中α_A，α_B，α_C，α_D为介于0和1之间的常数，它们刻画了不同方块内的特征向量对本子区域总体特征向量的贡献的重要程度。这样，从每个子区域都可以得到一个4维特征向量后，将所有子区域的特征向量按顺序排列在一起组成的 $4 (\frac{2 M}{M_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1)$ 维方向线素特征向量。Among them, α _A , α _B , α _C , and α _D are constants between 0 and 1, which describe the importance of the contribution of the feature vectors in different squares to the overall feature vector of this sub-region. In this way, after a 4-dimensional feature vector can be obtained from each sub-region, the feature vectors of all sub-regions are arranged together in order to form $4 (\frac{2 m}{m_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1)$ Dimensional direction voxel feature vector.

B.3特征变换B.3 Feature Transformation

设字符类别数为c(在藏文字符识别中c＝592)，第ω类字符的训练样本数为O_ω，ω＝1，2，…，c，原始方向线素特征向量集合为 ${{X_{1}}^{ω}, {X_{2}}^{ω}, \cdot \cdot \cdot, {X_{O_{ω}}}^{ω}},$ 其中X_k ^ω(k＝1，2，…，O_ω)是 $4 (\frac{2 M}{M_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1)$ 维向量。Assuming that the number of character categories is c (c=592 in Tibetan character recognition), the number of training samples of the ω class character is O _ω , ω=1, 2,..., c, the original direction line element feature vector set is ${{x_{1}}^{ω}, {x_{2}}^{ω}, \cdot \cdot \cdot, {x_{o_{ω}}}^{ω}},$ where X _k ^ω (k=1, 2, ..., O _ω ) is $4 (\frac{2 m}{m_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1)$ dimension vector.

$μ μ = = \frac{11}{c c} {Σ Σ}_{ω ω = = 11}^{c c} {μ μ}_{ω ω}$

用矩阵计算工具计算矩阵S_w ^-1S_b的前 $d (d \leq 4 (\frac{2 M}{M_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1))$ 个最大的非零本征值ξ_k(k＝1，2，…，d)和相应的本征向量_k(k＝1，2，…，d)，则LDA变换的变换矩阵Φ＝[₁，₂，…，_d]。相应的特征变换为Y＝Φ^TX，这里Y是最具判别性的d维特征。Use the matrix calculation tool to calculate the front of the matrix S _w ^-1 S _b $d (d \leq 4 (\frac{2 m}{m_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1))$ the largest non-zero eigenvalues ξ _k (k=1, 2,..., d) and the corresponding eigenvectors  _k (k=1, 2,..., d), Then the transformation matrix Φ=[ ₁ ,  ₂ , . . . ,  _d ] of the LDA transformation. The corresponding feature transformation is Y = Φ ^T X, where Y is the most discriminative d-dimensional feature.

B.4设计分类器B.4 Designing Classifiers

对经LDA变换得到特征向量Y，计算各字符的均值向量 $\overset{&OverBar;}{Y^{ω}} (ω = 1,2, \cdot \cdot \cdot, c)$ 和各字符的特征向量在每一维上的方差σ_s ^ω(ω＝1，2，…，c，s＝1，2，…，d)，d为Y的维数，For the feature vector Y obtained by LDA transformation, calculate the mean vector of each character $\overset{&OverBar;}{Y^{ω}} (ω = 1,2, \cdot \cdot &Center Dot;, c)$ And the variance σ _s ^ω (ω=1,2,...,c, s=1,2,...,d) of the feature vector of each character on each dimension, d is the dimension of Y,

其中每个藏文字符类别ω(1≤ω≤c)的最具可分性的特征集合为 ${{Y_{1}}^{ω}, {Y_{2}}^{ω}, \cdot \cdot \cdot, {Y_{O_{ω}}}^{ω}},$ 将各字符的鉴别特征均值向量和各维上的方差存入鉴别特征数据库文件中，同时将通过实验调整分类器的各相关参数的值并存入库文件中。这样就完成了分类器的设计和训练。The most separable feature set of each Tibetan character category ω(1≤ω≤c) is ${{Y_{1}}^{ω}, {Y_{2}}^{ω}, &Center Dot; &Center Dot; \cdot, {Y_{o_{ω}}}^{ω}},$ Store the discriminative feature mean vector of each character and the variance on each dimension in the discriminative feature database file, and at the same time adjust the value of each relevant parameter of the classifier through experiments and store it in the library file. This completes the design and training of the classifier.

C)测试系统的实现C) Implementation of the test system

从库文件中读取所有字符类的均值向量 $\overset{&OverBar;}{Y^{ω}} = {(\overset{&OverBar;}{{y_{1}}^{ω}}, \overset{&OverBar;}{{y_{2}}^{ω}}, \cdot \cdot \cdot \overset{&OverBar;}{{y_{d}}^{ω}})}^{T},$ (ω＝1，2，…，c)和各字符类的各维的方差σ_s ^ω(ω＝1，2，…，c，s＝1，2，…，d)。计算Y到

的带偏差的欧氏距离 Read the mean vectors of all character classes from the library file

\overset{&OverBar;}{Y^{ω}} = {(\overset{&OverBar;}{{the y}_{1}^{ω}}, \overset{&OverBar;}{{they}_{2}^{ω}}, \cdot \cdot \cdot \overset{&OverBar;}{{the y}_{d}^{ω}})}^{T},

(ω=1, 2, . . . , c) and the variance σ _s ^ω (ω=1, 2, . . . , c, s=1, 2, . . . , d) of each dimension of each character class. Calculate Y to

Euclidean distance with bias

其中in

将所有经过计算的

ω＝1，2，…，c按照由小到大的顺序重新排序，选出前L(1≤L≤c)个距离及其所代表的字符类别码e_k，k＝1，2，…，L组成粗分类候选集CanSet＝{(e₁，D₁)，(e₂，D₂)…，(e_L，D_L)}，D₁≤D₂≤…≤D_L。all calculated

ω=1, 2,..., c are reordered according to the order from small to large, and the first L (1≤L≤c) distances and the character category codes e _k represented by them are selected, k=1, 2,... , L constitutes a rough classification candidate set CanSet={(e ₁ , D ₁ ), (e ₂ , D ₂ )..., (e _L , D _L )}, D ₁ ≤D ₂ ≤...≤D _L .

$Q Q ((Y Y,, \overset{&OverBar; &OverBar;}{{Y Y}^{ω ω}})) = = \frac{11}{{h h}^{22}} {{{Σ Σ}_{l l = = 11}^{d d} {(({y the y}_{11} - - \overset{&OverBar; &OverBar;}{{y the y}^{ω ω}_{l l}}))}^{22} - - {Σ Σ}_{l l = = 11}^{K K} ((11 - - \frac{{h h}^{22}}{{λ λ}_{ωl ωl}})) {[[{((Y Y - - \overset{&OverBar; &OverBar;}{{Y Y}^{ω ω}}))}^{T T} {φ φ}_{ωl ωl}]]}^{22}}} + + ln ln (({h h}^{22 ((d d - - K K))} {Π Π}_{l l = = 11}^{K K} {λ λ}_{ωl ωl}))$

若 $Q (Y, \overset{&OverBar;}{Y^{τ}}) = \min_{1 \leq ω \leq L} Q (Y, \overset{&OverBar;}{Y^{ω}}),$ 则该输入字符属于e_τ所对应的字符类别，即 $τ = \arg \min_{1 \leq ω \leq L} Q (Y, \overset{&OverBar;}{Y^{ω}}) .$ 以下给出两个具体的实现例子。like $Q (Y, \overset{&OverBar;}{Y^{τ}}) = \min_{1 \leq ω \leq L} Q (Y, \overset{&OverBar;}{Y^{ω}}),$ Then the input character belongs to the character category corresponding to e _τ , namely $τ = \arg \min_{1 \leq ω \leq L} Q (Y, \overset{&OverBar;}{Y^{ω}}) .$ Two specific implementation examples are given below.

实施例1：多字体多字号印刷体藏文字符识别系统基于本发明的多字体多字号印刷体藏文字符识别系统如图14a所示，实验在收集到的1200套印刷体藏文文档(每个文档包涵全部592个现代藏文字符)上进行的，这些样本文档大部分采自当今主要的印刷藏文出版系统(方正、华光)，也有少量由TureType字体直接打印形成。字体不仅有最常用的白体、黑体和通用体，还包括圆体、长体、竹体，字号从六号到初号。样本质量不等，正常、断裂、粘连字符的比例约为2∶1∶1。经过扫描输入、行、字切分和内码标定等过程，将这1200套藏文文档转换为1200套单字样本(即每个字符类别有1200个单字样本)，从中随机抽出900套组成训练集，其余300套留作测试样本。Embodiment 1: Multi-font and multi-size printed body Tibetan character recognition system Based on the multi-font and multi-font size printed body Tibetan character recognition system of the present invention, as shown in Figure 14a, the experiment was performed on 1200 sets of printed Tibetan documents collected (each This document contains all 592 modern Tibetan characters), and most of these sample documents are collected from today's major printed Tibetan publishing systems (Fangzheng, Huaguang), and a small amount are directly printed by TrueType fonts. The fonts include not only the most commonly used white body, black body and universal body, but also round body, long body and bamboo body, and the font sizes range from No. 6 to No. 1. The sample quality varies, and the ratio of normal, broken, and glued characters is about 2:1:1. After the process of scanning input, line and word segmentation and internal code calibration, these 1200 sets of Tibetan documents are converted into 1200 sets of single-character samples (that is, each character category has 1200 single-character samples), and 900 sets are randomly selected from them to form a training set , and the remaining 300 sets are reserved as test samples.

实验中，采用本发明的方法将每个藏文字符归一化为48×96的点阵，归一化参数β＝0.5。四方向线素特征提取中子区域的如图10所示方式划分，取M₀＝N₀＝16，子区域中各方块的特征向量对整个子区域特征向量的加权系数α_A，α_B，α_C，α_D分别为0.4，0.3，0.2，0.1。按照图7所示的流程提取方向线素特征后，采用LDA线性变换进行特征压缩，变换后特征维数d选定为128(图14c)。粗分类器EDD中的参数θ₁＝θ₂＝…＝θ₅₉₂＝0.8，γ₁＝γ₂＝…＝γ₅₉₂＝2.2，C＝20，粗分类置信度分析时采用阈值Conf_TH＝0.9，细分类器MQDF中的参数K＝32(图14b)，h²用各字符类的协方差阵的第K个本征值的均值作为估计值。在测试集上的实验结果如表1所示In the experiment, the method of the present invention is used to normalize each Tibetan character into a 48×96 lattice, and the normalization parameter β=0.5. The sub-regions are divided as shown in Figure 10 in the four-direction line element feature extraction, and M ₀ =N ₀ =16, the weighting coefficients α _A , α _B of the feature vectors of each block in the sub-region to the feature vector of the entire sub-region, α _C , α _D are 0.4, 0.3, 0.2, 0.1, respectively. After extracting the directional line element features according to the process shown in Figure 7, LDA linear transformation is used for feature compression, and the feature dimension d after transformation is selected as 128 (Figure 14c). Parameters in the coarse classifier EDD θ ₁ =θ ₂ =...=θ ₅₉₂ =0.8, γ ₁ =γ ₂ =...=γ ₅₉₂ =2.2, C=20, the threshold value Conf _TH =0.9 is used in the rough classification confidence analysis, The parameter K=32 in the fine classifier MQDF (Fig. 14b), h ² uses the mean value of the Kth eigenvalue of the covariance matrix of each character class as an estimated value. The experimental results on the test set are shown in Table 1.

表1系统在六种藏文字体测试样本集上的识别率字体白体黑体通用体圆体长体竹体平均识别率字符数 36112 39072 35520 30192 14800 22496 识别率 99.94％ 99.86％ 99.83％ 99.85％ 99.58％ 99.76％ 99.83％ Table 1 The recognition rate of the system on the six Tibetan font test sample sets font white body black body Universal body round body long body bamboo body average recognition rate number of characters 36112 39072 35520 30192 14800 22496 Recognition rate 99.94% 99.86% 99.83% 99.85% 99.58% 99.76% 99.83%

从表1可见，多字体多字号藏文字符的平均识别正确率达到99.83％，表明本发明所提的方法的有效性。It can be seen from Table 1 that the average recognition accuracy rate of multi-font and multi-size Tibetan characters reaches 99.83%, indicating the effectiveness of the method proposed in the present invention.

实施例2：多字体印刷藏文(混排汉英)文档识别系统Embodiment 2: multi-font printing Tibetan (mixed Chinese-English) document recognition system

多字体印刷藏文(混排汉英)文档识别系统的研究是为适应藏族地区办公自动化和促进中文多文种信息处理技术发展的需求而展开的，它的系统框图如图15所示。主要包括图像输入和预处理子系统、行字切分子系统、字符识别子系统和后处理子系统。本发明是字符识别子系统的主要组成部分，在汉字和英文识别核心的配合下对藏文占主体、夹杂一定汉字和英文、数字、符号的多字体印刷文档进行自动识别，将文档图像转换为计算机可“阅读”的文本。The research on multi-font printed Tibetan (mixed Chinese-English) document recognition system is carried out to meet the needs of office automation in Tibetan areas and to promote the development of Chinese multilingual information processing technology. Its system block diagram is shown in Figure 15. It mainly includes image input and preprocessing subsystem, line word cutting subsystem, character recognition subsystem and postprocessing subsystem. The present invention is the main component of the character recognition subsystem. With the cooperation of the Chinese character and English recognition cores, it automatically recognizes multi-font printed documents that are dominated by Tibetan and mixed with certain Chinese characters, English, numbers and symbols, and converts the document image into Text that a computer can "read".

在该系统中的藏文字符识别部分采用本发明提出的方法，具体参数与实施例1一致，移植了实施例1中的字符特征库。该系统于2003年11月通过了教育部主持的专家鉴定。在鉴定测试时，从由西北民族大学提供的500余页，共52万余字的实际印刷体藏文文档(采自书籍、报刊、杂志等出版物)中随机选出62页，共95583个字符进行了测试，结果如下：The Tibetan character recognition part in this system adopts the method proposed by the present invention, and the specific parameters are consistent with those in Embodiment 1, and the character feature library in Embodiment 1 is transplanted. The system passed the expert appraisal hosted by the Ministry of Education in November 2003. During the identification test, 62 pages were randomly selected from the actual printed Tibetan documents (collected from books, newspapers, magazines, etc.) provided by Northwest University for Nationalities, totaling 95,583. The characters were tested and the results are as follows:

表2多字体印刷藏文(混排汉英)文档识别系统的测试性能字符种类字符数目识别正确率(％) 错误率分布 ACE(％) ASE(％) UTE(％) 藏文 91636 99.06 0.30 0.57 0.07 汉字 804 96.27 1.99 1.74 0 英文+符号 2118 86.59 5.24 6.66 1.51 数字 1025 92.39 3.61 3.42 0.58 合计 95583 98.68 - - - Table 2 Test performance of multi-font printed Tibetan (mixed Chinese-English) document recognition system character type number of characters Recognition accuracy rate (%) Error Rate Distribution ACE(%) ASE(%) UTE(%) Tibetan 91636 99.06 0.30 0.57 0.07 Chinese character 804 96.27 1.99 1.74 0 English + sign 2118 86.59 5.24 6.66 1.51 number 1025 92.39 3.61 3.42 0.58 total 95583 98.68 - - -

注：ACE为可判断的识别错误率 ASE为可判断的切分错误率 UTE为不可判断错误类型的错误率该结果表明，本发明提出的多字体多字号印刷体藏文字符识别完全适应实际应用的需要，能够获得良好的识别性能，具有广泛的应用前景。Note: ACE is the identifiable recognition error rate ASE is the identifiable segmentation error rate UTE is the error rate of the undeterminable error type. The results show that the multi-font and multi-font-size printed Tibetan character recognition proposed by the present invention is completely suitable for practical application needs, can obtain good recognition performance, and has a wide range of application prospects.

Claims

1. many font sizes of multi-font printed tibetan character character identifying method, it is characterized in that, normalization scheme at the printed tibetan character character characteristics that belong to non-Chinese characters has been proposed: with character picture with baseline, it is upper horizontal line, for separation resolves into two number of sub images that do not overlap mutually, each subimage is adopted the place normalization that combines with center of gravity and frame respectively and based on the size normalization method of cubic B-spline function interpolation; Extraction can fully reflect four directions that the Tibetan language character forms information to the linear element feature, obtains compact character feature vector after utilizing linear discriminant analysis LDA compression dimensionality reduction; Employing is carried out the judgement of character class based on thick, the thin two-stage classification strategy of degree of confidence analysis, and thick, disaggregated classification device adopts the Euclidean distance EDD of band deviation and the secondary Discrimination Functions MQDF of correction respectively; In the system that is made up of image capture device and computing machine, it contains following steps successively:

(1) set:

(1.1) the Tibetan language character class sum c=592 of the present invention's processing;

(1.2) character duration M, height N after the normalization;

The place normalization parameter beta;

When (1.3) extracting the directional line element feature feature, the subregion width M of division ₀, the height N ₀

The proper vector of each square is to the weighting coefficient α of whole subregion proper vector in the subregion _A, α _B, α _C, α _D

(1.4) parameters C among the rough sort device EDD, θ _k, γ _k, k=1 wherein, 2 ..., 592;

(1.5) confidence threshold value Conf _TH

(2) collection of character sample

Be printed on the text of many font sizes of multi-font Tibetan language character to the computing machine input by scanner, after utilizing existing method to remove necessary pre-service such as noise, binaryzation, the Tibetan language text is carried out cutting to separate single character, the image of each character is demarcated the ISN of the correct character of its correspondence, finish collection thus, set up the training sample database in order to the Tibetan language character individual character sample of training and testing;

(3) normalized comprises the normalization of character position and size

(3.1) baseline position of the single Tibetan language character in location

If the original character image is [F (i, j)] _{W * H},

Wherein W is a picture traverse, and H is a picture altitude, the value that image is positioned at the picture element of the capable j of i row be F (i, j), i=1,2 ..., H, j=1,2 ..., W,

The horizontal projection V of calculating character image (i), i=1,2 ..., H is:

V (i) = Σ_{j = 1}^{w} F (i, j),

The ordinate value P of baseline position then ₁For:

P_{I} = \arg \max_{i} (V (i) - V (i - 1)), i = 2,3, \cdot \cdot \cdot, H;

(3.2) be that separation is separated into two number of sub images with input picture with the baseline

[F (i, j) _{W * H}Can regard two number of sub images as

Longitudinal spliced,

Wherein

For baseline with top, promptly go up the vowel part; For baseline with the lower part, both not do not overlap but vertically combine synthetic [F (i, j)] _{W * H}, and H ₁+ H ₂=H is by P ₁Can determine H with the difference of the ordinate at character top ₁Size;

Corresponding, the target character image after the normalization [G (i, j)] _{M * N}Also can regard two number of sub images as

Longitudinal spliced,

Wherein M is the width of target image, and N is a picture altitude; For the above parts of images of baseline, promptly go up the vowel part;

For baseline with the lower part; Both do not overlap yet but vertically are combined into [G (i, j)] _{M * N}, and set N ₁=N/4, N ₂=3N/4;

(3.3) place normalization reference point U _k(u _Ik, u _Jk), k=1,2 selection

{[F_{k} (i, j)]}_{W \times H_{k}}, k = 1,2

Center of gravity and outer rim center are respectively Ak (α _Ik, α _Jk), k=1,2 and B _k(b _Ik, b _Jk), k=1,2 wherein

U then _k(u _Ik, u _Jk), k=1,2 get between A _k(a _Ik, a _Jk), k=1,2 and B _k(b _Ik, b _Jk), k=1, a bit between 2, that is:

Wherein β is constant and 0≤β≤1;

Mobile input picture dot matrix makes this reference point, is positioned at the target dot matrix

{[G_{k} (i, j)]}_{M \times N_{k}}, k = 1,2

Geometric center, thereby finish the place normalization of input character;

(3.4) size normalization

Cause

{[F_{k} (i, j)]}_{{W \times H}_{k}}, k = 1,2

With

{[G_{k} (i, j)]}_{M \times N_{k}}, k = 1,2

Between the pass be:

G _k(i，j)＝F _k(i/r _i，j/r _j)，k＝1，2，

R wherein _iAnd r _jBe respectively the change of scale factor of i and j direction: r _i=N _k/ H _k, r _j=M/W; Adopt cubic B-spline function to carry out interpolation arithmetic;

For given (i, j), the order:

Wherein: [] is bracket function;

Interpolation process can be expressed as:

G_{k} (i, j) = F_{k} (p_{0} + Δ_{p}, q_{0} + Δ_{q}) = Σ_{m = - 1}^{2} Σ_{l = - 1}^{2} F_{k} (p_{0} + m, q_{0} + l) R_{B} (m - Δ_{p}) R_{B} (- (l - Δ_{q})),

R in the formula _B(z) be cubic B-spline function:

R_{B} (z) = \frac{1}{6} [{(z + 2)}^{3} W (z + 2) - 4 {(z + 1)}^{3} W (z + 1) + 6 z^{3} W (z) - 4 {(z - 1)}^{3} W (z - 1)],

Wherein W (z) is a step function,

(4) extract the four directions of Tibetan language character to the linear element feature

(4.1) character outline is extracted

Scan whole character pattern, for the black pixel of certain position, if the individual number average of black pixel in its 8 neighborhoods and background pixels then keeps this black pixel greater than 0, otherwise, this black pixel is made as background pixels; Like this, obtain after the normalization character picture [G (and i, j) _{M * N}Contour images [G ' (i, j) _{M * N}

(4.2) formation of directional line element feature feature

At first, for character outline dot matrix [G ' (i, j)] _{M * N}In each black pixel (i j), according to the position relation between it and adjacent two other black pixel, gives that it is horizontal, vertical, cast aside, press down four kinds of linear elements, and be designated as one 4 dimensional vector X (i, j)=(x _v, x _k, x _p, x _o) ^T

With whole size is the character outline image [G ' (i, j)] of M * N _{M * N}Evenly be divided into

(\frac{2 M}{M_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1)

Individual width is M ₀, highly be N ₀Subregion, that each subregion further is divided into again is nested against one another, size is followed successively by (M ₀/ 4) * (N ₀/ 4), (M ₀/ 2) * (N ₀/ 2), (3M ₀/ 4) * (3N ₀/ 4) and M ₀* N ₀A, B, 4 blockages such as C, D; The feature vector, X of blockage on each _A=(x _v, x _k, x _p, x _o) ^T, X _B=(x _v, x _k, x _p, x _o) ^T, X _C=(x _v, x _k, x _p, x _o) ^T, X _D=(x _v, x _k, x _p, x _o) ^TBe expressed as all black pixel proper vectors in this square and:

X_{A} = \underset{(i, j) &Element; A}{Σ} X (i, j),

X_{B} = \underset{(i, j) &Element; B}{Σ} X (i, j),

X_{C} = \underset{(i, j) &Element; C}{Σ} X (i, j),

X_{D} = \underset{(i, j) &Element; D}{Σ} X (i, j),

The directional line element feature feature vector, X of whole subregion _S=(x _v, x _k, x _p, x _o) ^TWeighted sum by each side's block eigenvector in this subregion is represented:

X _S＝α _AX _A+α _BX _B+α _CX _C+α _DX _D，

α wherein _A, α _B, α _C, α _DBe the constant between 0 and 1; Like this, can obtain one 4 dimensional feature vector from each subregion after, the proper vector of all subregions is arranged in order the expression input character formed together

4 (\frac{2 M}{M_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1)

Dimension its original orientation linear element proper vector;

(5) eigentransformation

If Tibetan language character class number is c, the number of training of ω class character is O _ω, ω=1,2 ..., c, then the training sample to this character class adopts said method to extract the four directions after the linear element feature, obtains set of eigenvectors and is combined into { X ₁ ^ω, X ₂ ^ω..., X _{O ω} ^ω, X wherein _k ^ω(k=1,2 ..., O _ω) be

4 (\frac{2 M}{M_{0}} - 1) \times (\frac{2 N}{N_{0}} - 1)

Dimensional vector;

Utilize the compression of LDA transfer pair primitive character as follows:

At first calculate each character type ω (center μ of proper vector of 1≤ω≤c) _ω, all character types center μ, the between class scatter matrix S of proper vector _bWith divergence matrix S in the average class _w:

μ_{r} = \frac{1}{O_{ω}} Σ_{k = 1}^{O_{ω}} {X_{k}}^{ω},

μ = \frac{1}{c} Σ_{ω = 1}^{c} μ_{ω},

S_{b} = \frac{1}{c} Σ_{ω = 1}^{c} (μ_{ω} - μ) {(μ_{ω} - μ)}^{T},

S_{w} = \frac{1}{c} Σ_{ω = 1}^{c} \frac{1}{O_{ω}} Σ_{k = 1}^{O_{ω}} ({X_{k}}^{ω} - μ_{ω}) {({X_{k}}^{ω} - μ_{ω})}^{T},

Seek transformation matrix Φ, make t _r[(Φ ^TS _wΦ) ^-1(Φ ^TS _bΦ)] reach maximum, then the corresponding eigentransformation of LDA is Y=Φ ^TX, Y is the d dimensional feature of tool identification here;

(6) to the judgement of classification under the input character, promptly, extract feature, compare with existing data in the identification storehouse, to determine its correct character code to the character picture of unknown classification;

(6.1) design category device

To the proper vector Y that obtains by the LDA compression, calculate the mean vector of each character

\overset{&OverBar;}{Y^{ω}} (ω = 1,2, \cdot \cdot \cdot, c)

The variances sigma of proper vector on each dimension with each character _s ^ω(ω=1,2 ..., c, s=1,2 ..., d), d is the dimension of Y,

\overset{&OverBar;}{Y^{ω}} = \frac{1}{O_{ω}} Σ_{k = 1}^{O_{ω}} {Y_{k}}^{ω},

{σ_{s}}^{ω} = \sqrt{\frac{1}{O_{ω}} Σ_{k = 1}^{O_{ω}} {({y^{ω}}_{ks} - {\overset{&OverBar;}{y^{ω}}}_{s})}^{2}},

Wherein (characteristic set of 1≤ω≤c) is each Tibetan language character class ω The diagnostic characteristics mean vector of each character and the variance of Ge Wei are deposited in the diagnostic characteristics database file, and the parameter of the sorter that will obtain by experiment deposits in the library file simultaneously;

(6.2) classification judgement

To the input character image of unknown classification, at first carry out place normalization and size normalization and handle, extract the four directions again to linear element feature X, utilize LDA matrix of a linear transformation Φ that its original orientation linear element feature X is transformed into Y=Φ ^TX=(y ₁, y ₂..., y _d) ^T, d is the dimension of feature after the conversion;

From library file, read the mean vector of all character types

\overset{&OverBar;}{Y^{ω}} = {(\overset{&OverBar;}{{y_{1}}^{ω}}, \overset{&OverBar;}{{y_{2}}^{ω}}, \cdot \cdot \cdot \overset{&OverBar;}{{y_{d}}^{ω}})}^{T} (ω = 1,2, \cdot \cdot \cdot, c)

Each variances sigma of tieing up with each character type _s ^ω(ω=1,2 ..., c, s=1,2 ..., d), calculate Y and arrive The Euclidean distance of band deviation

D (Y, \overset{&OverBar;}{Y^{ω}}) :

D (Y, \overset{&OverBar;}{Y^{ω}}) = Σ_{s = 1}^{d} {[t (y_{s}, \overset{&OverBar;}{{y^{ω}}_{s}})]}^{2},

Wherein

All processes are calculated

D (Y, \overset{&OverBar;}{Y^{ω}}), ω = 1,2, \cdot \cdot \cdot,

According to the rearrangement of ascending order, select preceding L (the character class sign indicating number e of individual distance of 1≤L≤c) and representative thereof _k, k=1,2 ..., L forms rough sort Candidate Set CanSet={ (e ₁, D ₁), (e ₂, D ₂) ..., (e _L, D _L), D ₁≤ D ₂≤ ... ≤ D _L

Calculate the recognition confidence Conf (CanSet) of initial character among the CanSet

Conf (CanSet) = \frac{D_{2} - D_{1}}{D_{1}},

If Conf (CanSet) is higher than certain threshold value Conf _TH, directly with (e ₁, D ₁) as the recognition result output of input character, think that promptly input character belongs to e ₁Pairing character class, and decipherment distance is D ₁Otherwise, calculate Y MQDF of the pairing character class of each ISN in the CanSet and differentiate distance

Q (Y, \overset{&OverBar;}{Y^{ω}}), ω = 1,2, \cdot \cdot \cdot, L :

Q (Y, \overset{&OverBar;}{Y^{ω}}) = \frac{1}{h^{2}} {Σ_{l = 1}^{d} {(y_{l} - \overset{&OverBar;}{{y^{ω}}_{l}})}^{2} - Σ_{l = 1}^{K} (1 - \frac{h^{2}}{λ_{ωl}}) {[{(Y - \overset{&OverBar;}{Y^{ω}})}^{T} φ_{ωl}]}^{2}} + \ln (h^{2 (d - K)} Π_{l = 1}^{K} λ_{ωl}),

If

Q (Y, \overset{&OverBar;}{Y^{τ}}) = \min_{1 \leq ω \leq L} Q (Y, \overset{&OverBar;}{Y^{ω}}),

Then this input character belongs to e _τPairing character class, promptly

τ = \arg \min_{1 \leq ω \leq L} Q (Y, \overset{&OverBar;}{Y^{ω}}) .