CN108564078B

CN108564078B - A method of extracting the axis of Manchu word image

Info

Publication number: CN108564078B
Application number: CN201810371803.6A
Authority: CN
Inventors: 郑蕊蕊; 李敏; 贺建军; 许爽; 吴宝春; 卢海涛
Original assignee: Dalian Minzu University
Current assignee: Dalian Minzu University
Priority date: 2018-04-24
Filing date: 2018-04-24
Publication date: 2020-11-13
Anticipated expiration: 2038-04-24
Also published as: CN108564078A

Abstract

The method for extracting the axis of Manchu word images belongs to the field of text segmentation, and is used to solve the problem of improving the accuracy of Manchu segmentation. It is for the extraction of the central axis of the Manchu word image, which directly affects the accuracy of segmentation. In order to provide the accuracy of segmentation, it is necessary to improve the accuracy of the central axis extraction. Therefore, to locate the central axis and detect its width, you can The central axis is accurately extracted.

Description

A method of extracting the axis of Manchu word image

技术领域technical field

本发明属于文字切分领域，涉及一种提取满文单词图像中轴线的方法。The invention belongs to the field of text segmentation, and relates to a method for extracting the central axis of a Manchu word image.

背景技术Background technique

满文是我国满族、锡伯族等少数民族使用的语言文字，在清代作为法定文字被推广和使用，形成了大量珍贵的满文文献。由于目前满语文已濒临消失，满族语言文化遗产亟待抢救和保护得到国家和社会各界的认同和重视。研究满文的光学字符识别技术对保护和传承清代文化遗产显得尤为重要。满文是一种音素文字，共有38个字母，其中6个元音字母，22个辅音字母，此外还有10个专门用于拼写汉语借词的特定字母。满文书写采用字序从上到下，行款从左到右的规则。对于满文识别往往需要先将满文先切分基本单元(如字母等)，再予以识别，因而，提高满文识别的精度可以从提高其切分精度着手。Manchu is the language used by the Manchu, Xibo and other ethnic minorities in my country. It was popularized and used as a legal script in the Qing Dynasty, forming a large number of precious Manchu documents. As the Manchu language is on the verge of disappearing, the Manchu language and cultural heritage needs to be rescued and protected urgently, which has been recognized and valued by the state and all sectors of society. The study of the optical character recognition technology of Manchu is particularly important for the protection and inheritance of the cultural heritage of the Qing Dynasty. Manchu is a phonetic script with a total of 38 letters, including 6 vowels, 22 consonants, and 10 specific letters for spelling Chinese loanwords. Manchu writing adopts the rules of word order from top to bottom and lines from left to right. For Manchu character recognition, it is often necessary to first divide Manchu characters into basic units (such as letters, etc.), and then identify them. Therefore, improving the accuracy of Manchu character recognition can start from improving its segmentation accuracy.

发明内容SUMMARY OF THE INVENTION

为了解决提高满文切分精度的问题，本发明提出如下技术方案：In order to solve the problem of improving Manchu segmentation accuracy, the present invention proposes the following technical solutions:

一种提取满文单词图像中轴线的方法，包括如下步骤：A method for extracting the central axis of a Manchu word image, comprising the following steps:

S1.定位满文单词图像中轴线；S1. locate the central axis of the Manchu word image;

S2.检测满文单词图像中轴线宽度。S2. Detect the width of the central axis of the Manchu word image.

作为技术方案的补充：所述步骤S1具体包括：As a supplement to the technical solution: the step S1 specifically includes:

S1.1.对满文单词图像取反，令文字部分的像素值取1，而背景部分的像素值取0；S1.1. Invert the image of Manchu words, let the pixel value of the text part take 1, and the pixel value of the background part take 0;

S1.2.使用MATLAB图像处理工具箱的形态学细化函数实现满文单词图像的形态学细化；S1.2. Use the morphological refinement function of the MATLAB image processing toolbox to realize the morphological refinement of Manchu word images;

S1.3.对形态学细化后的满文单词图像，使用霍夫变换以确定细化的中轴线所对应的列坐标，该列坐标作为满文单词图像中轴线的位置，其中，限定霍夫变换搜索直线的角度为θ＝90，仅查找竖直方向的直线，并连接相同纵向位置的，间距小于满文单词图像高度且自身长度大于1个像素的直线为一条直线，求出中轴线的中心位置。S1.3. For the morphologically refined Manchu word image, use Hough transform to determine the column coordinates corresponding to the thinned central axis, and the column coordinates are used as the position of the central axis of the Manchu word image, wherein the Hough transform is defined. The angle of the search line is θ=90, only the vertical line is searched, and the line connecting the same vertical position, the spacing is less than the height of the Manchu word image and the length is greater than 1 pixel is a straight line, and the central axis is obtained. the center position.

作为技术方案的补充：所述步骤S2具体包括：As a supplement to the technical solution: the step S2 specifically includes:

S2.1.确定最大游程比例法的搜索区域；S2.1. Determine the search area for the maximum run-length proportional method;

S2.2.对满文单词图像在搜索区域内施以最大游程比例法而确定满文单词图像中轴线的宽度；S2.2. Determine the width of the central axis of the Manchu word image by applying the maximum run-length ratio method to the Manchu word image in the search area;

S2.3.由满文单词图像中轴线的中心位置和中轴线的宽度计算中轴线的左边界和右边界。S2.3. Calculate the left and right boundaries of the central axis from the central position of the central axis of the Manchu word image and the width of the central axis.

作为技术方案的补充：所述步骤S2.1具体为：As a supplement to the technical solution: the step S2.1 is specifically:

最大游程比例法的搜索区域由下述公式规定的范围所确定：The search area for the maximum run ratio method is determined by the range specified by the following formula:

其中，sl是限定的搜索范围的左边界，sr是限定的搜索范围的右边界，baseline是中轴线的中心位置，round表示向最近的整数取整，W是满文单词图像的宽度。Among them, sl is the left boundary of the limited search range, sr is the right boundary of the limited search range, baseline is the center position of the central axis, round means rounding to the nearest integer, and W is the width of the Manchu word image.

作为技术方案的补充：所述步骤S2.2的最大游程比例法的步骤：扫描满文单词图像搜索区域单词图像的每一行，并统计连续黑色像素的游程长度和该长度出现的次数，则具有最大出现次数的游程长度就是满文单词图像中轴线的宽度。As a supplement to the technical solution: the steps of the maximum run-length ratio method in the step S2.2: scan each line of the word image in the Manchu word image search area, and count the run length of continuous black pixels and the number of times the length appears, then there is The run length of the maximum number of occurrences is the width of the axis in the Manchu word image.

作为技术方案的补充：由下式计算所述步骤S2.3的中轴线的左边界和右边界；As a supplement to the technical solution: calculate the left and right boundaries of the central axis of the step S2.3 by the following formula;

其中：bl是中轴线的左边界，br是中轴线的右边界，baseline是满文单词图像中轴线的中心位置，baseline_width是满文单词图像中轴线的宽度，round表示向最近的整数取整。Where: bl is the left border of the central axis, br is the right border of the central axis, baseline is the center position of the axis of the Manchu word image, baseline_width is the width of the axis of the Manchu word image, and round means rounding to the nearest integer.

有益效果：对于满文单词图像中轴线提取，直接影响切分的准确率，为了提供切分准确率，必要的将中轴线提取的精度被提高，因而，将中轴线定位，并检测其宽度，可以将中轴线准确提取。Beneficial effects: For the extraction of the central axis of the Manchu word image, the accuracy of segmentation is directly affected. In order to provide the accuracy of segmentation, the accuracy of the extraction of the central axis must be improved. Therefore, the central axis is positioned and its width is detected. The central axis can be accurately extracted.

附图说明Description of drawings

图1满文部件集构建流程图；Figure 1 is a flowchart of the construction of Manchu component set;

图2满文部件分割流程图；Fig. 2 Manchu component segmentation flowchart;

图3传统方法满文单词图像中轴线提取错误实例图；Fig. 3 traditional method Manchu word image central axis extraction error example diagram;

图4采用区域限定的最大游程比例法确定满文中轴线宽度的图，其中：(1)最大游程比例法错误实例图、(2)本发明限定的搜索范围图、(3)本发明方法结果图；Fig. 4 adopts the maximum run-length ratio method limited by the region to determine the width of the Manchu central axis, wherein: (1) the maximum run-length ratio method error example chart, (2) the search range chart limited by the present invention, (3) the result chart of the method of the present invention ;

图5本发明方法中轴线提取效果图；5 is an effect diagram of axis extraction in the method of the present invention;

图6满文部件切分流程图；Figure 6 is a flowchart of Manchu component segmentation;

图7满文部件切分结果图，其中：(1)弱分割现象图、(2)弱分割区域经过细切分图、(3)过分割现象图、(4)过分割区域经过合并图、(5)部分分割结果图。Fig. 7 Manchu component segmentation result diagram, wherein: (1) Weak segmentation phenomenon diagram, (2) Weak segmentation region through fine segmentation diagram, (3) Over segmentation phenomenon diagram, (4) Over segmentation region through merge diagram, (5) Partial segmentation result graph.

具体实施方式Detailed ways

从光学字符识别技术的角度分析，满文具有以下特点：(1)根据在单词中位置的不同，满文同一个字母一般有独立形、字头形、字中形和字尾形4种不同形式。满文不同字形的字母共计114个。(2)满文文档同列单词都位于相同的中轴线附近，印刷体满文两列之间的单词基本不会出现交叉的情况，有利于列提取。同一列满文文本中满文单词之间有一定间隔，有利于单词提取。(3)满文单词由一个或多个满文字母竖向中轴线连接组成，同一单词内的字母与字母之间没有空隙。但是字母与字母的相拼处位于满文单词图像中轴线上，可以考虑利用中轴线处的像素特性分割满文字母。(4)部分满文字母具有“一形多字”现象。例如字符

，同时是字母a、e和n的字中形，在识别中可根据相邻字母的拼读规则加以区别。(5)部分满文字母具有相同的组成部分。例如字符

(字母o的字头形)，可以看做是由字符

(字母e的字头形)和字符

(字母o的字中形)两部分组合而成。因此以满文字母为基本分割单元则容易出现过分割和弱分割的现象。(6)某些字母组合不具备可分性。例如

(bo)，切分成

(字母b)和

(字母o)非常困难。From the perspective of optical character recognition technology, Manchu has the following characteristics: (1) According to the different positions in the word, the same letter in Manchu generally has four different forms: independent, prefix, middle and suffix. . There are a total of 114 letters in Manchu with different glyphs. (2) The words in the same column of Manchu documents are located near the same central axis, and the words between the two columns of printed Manchu are basically not crossed, which is conducive to column extraction. There is a certain interval between Manchu words in the same column of Manchu text, which is beneficial to word extraction. (3) Manchu words are composed of one or more Manchu letters connected by the vertical axis, and there is no gap between letters in the same word. However, the place where the letters and letters are joined is located on the axis of the Manchu word image, and it can be considered to segment the Manchu letters by using the pixel characteristics at the axis. (4) Some Manchu letters have the phenomenon of "one shape and many characters". e.g. characters

, which is the middle form of letters a, e and n at the same time, can be distinguished according to the spelling rules of adjacent letters in recognition. (5) Some Manchu letters have the same components. e.g. characters

(the initial shape of the letter o), which can be seen as a combination of characters

(the initial form of the letter e) and the character

(the middle form of the letter o) is a combination of two parts. Therefore, taking the Manchu alphabet as the basic segmentation unit is prone to over-segmentation and weak segmentation. (6) Certain letter combinations are not divisible. E.g

(bo), cut into

(letter b) and

(letter o) is very difficult.

基于上述满文的特点，本实施例提出一种以部件重新解构满文单词的思路，以满文部件(以下简称部件)作为分割和识别的基本单元，能够解决以满文字母为基本分割单元导致的过分割和弱分割问题，满文部件集包括满文字母、字母或字母组合的一部分、字母组合等3种来源，构建满文部件集的目的在于减少因分割带来的错误识别,这是因为如果按照字母为基本分割单元,则如前述分析,易出现过分割和弱分割问题,则后续用于识别字母的分类器势必会对过分割和弱分割的部分产生识别错误,甚至不能够识别；而本发明(方法)提出的满文部件集,是以分割方法的结果为导向设计构建的,也就是说,常见的过分割(字母或字母组合的一部分)和弱分割(字母组合)不再认为是一种“错误”，而是一种“正确”的分割，因此后续设计的分类器能够识别这些部件，从而减少了因分割错误产生的识别错误等问题。对于满文部件的理解，可以参考与对英文单词的识别进行类比。以英文单词study为例，可以直接识别整词study；也可以把整词切分成s,t,u,d,y等字母，分别识别字母，然后组合成单词study；如果切分成字母难以实现，而切分成部件相对容易，例如，容易切分成：st,u,dy，(其中，st,u,dy都是部件)则识别部件，然后把部件组合成单词，然而，满文由于具有上述特点，其部件的切分并非如举例的英文一样容易，如图1所示，满文部件集的构建流程是：参考满文字母表、中华人民共和国国家标准《信息技术通用多八位编码字符集锡伯文、满文字型》、文献[1-2]中的蒙古文部件集，提出一个包含99个初始部件的满文部件初始集(以下简称为“初始集”)，并置每个满文部件的Flag＝0。使用满文切分对满文单词图像进行分割，并统计和分析分割结果：若分割后的部件不属于初始集，则将该部件加入初始集，并置该部件的Flag＝1；若分割后的部件属于初始部件集，则置对应部件的Flag＝1。检查初始集中是否有部件的Flag＝0，判断是否有部件在分割结果中从未出现过，若存在该部件则将该部件从初始集中删除。整理并输出满文部件集。满文部件集中共包含部件106个，详见附表1。其中述及的文献[1-2]：Based on the above-mentioned characteristics of Manchu, this embodiment proposes an idea of re-deconstructing Manchu words with components, using Manchu components (hereinafter referred to as components) as the basic unit of segmentation and recognition, which can solve the problem of using Manchu letters as the basic segmentation unit The problem of over-segmentation and weak segmentation caused by the Manchu parts set includes Manchu letters, part of letters or letter combinations, and letter combinations. This is because if letters are used as the basic segmentation unit, as in the previous analysis, the problems of over-segmentation and weak segmentation are prone to occur, and the subsequent classifiers used to identify letters are bound to produce recognition errors for the over-segmented and weakly segmented parts, or even fail to recognize them. The Manchu component set proposed by the present invention (method) is designed and constructed based on the results of the segmentation method, that is, the common over-segmentation (part of a letter or a letter combination) and weak segmentation (a letter combination) It is no longer considered a "wrong", but a "correct" segmentation, so the classifiers designed later can recognize these parts, thereby reducing the recognition errors caused by segmentation errors. For the understanding of Manchu parts, you can refer to the analogy with the recognition of English words. Take the English word study as an example, you can directly identify the whole word study; you can also divide the whole word into letters such as s, t, u, d, y, identify the letters respectively, and then combine them into the word study; if it is difficult to divide into letters, It is relatively easy to divide into parts. For example, it is easy to divide into: st, u, dy, (where st, u, and dy are all parts), then identify the parts, and then combine the parts into words. However, Manchu has the above characteristics because of the above characteristics. , the segmentation of its components is not as easy as the English example. As shown in Figure 1, the construction process of the Manchu component set is: refer to the Manchu alphabet, the national standard of the People's Republic of China "General Multi-octet Coded Character Set for Information Technology" The Mongolian component set in Xibo, Manchu, and documents [1-2], proposes an initial set of Manchu components (hereinafter referred to as "initial set") containing 99 initial components, and juxtaposes each Manchu Component's Flag=0. Use Manchu segmentation to segment Manchu word images, and count and analyze the segmentation results: if the segmented component does not belong to the initial set, add the component to the initial set, and set the component's Flag=1; If the component belongs to the initial component set, set Flag=1 of the corresponding component. Check whether there is a component in the initial set with Flag=0, determine whether there is a component that has never appeared in the segmentation result, and delete the component from the initial set if there is such a component. Organize and output Manchu parts sets. There are 106 components in the Manchu component set, see Appendix 1 for details. The literature mentioned in it [1-2]:

[1]Hongxi Wei,Guanglai Gao.A keyword retrieval system for historicalMongolian document images[J].International journal on document analysisandrecognition,2014,17(1),33-45.[1]Hongxi Wei,Guanglai Gao.A keyword retrieval system for historicalMongolian document images[J].International journal on document analysis andrecognition,2014,17(1),33-45.

[2]Liangrui Peng,Changsong Liu,Xiaoqing Ding,Jianming Jin,Youshou Wu,Hua Wang,Yanhua Bao.Multi-font printed Mongolian document recognition system[J].International journal on document analysis and recognition,2010,13(2):93-106.[2] Liangrui Peng, Changsong Liu, Xiaoqing Ding, Jianming Jin, Youshou Wu, Hua Wang, Yanhua Bao. Multi-font printed Mongolian document recognition system[J]. International journal on document analysis and recognition, 2010, 13(2) :93-106.

如图2所示，满部件文的切分步骤如下：As shown in Figure 2, the segmentation steps of the full text are as follows:

S1.满文纸质文档通过光电转换设备被转换为能够用计算机存储和处理的数字图像文档，将满文文档的数字图像进行图像预处理(平滑、二值化)；S1. Manchu paper documents are converted into digital image documents that can be stored and processed by a computer through photoelectric conversion equipment, and image preprocessing (smoothing, binarization) is performed on the digital images of Manchu documents;

S2.版面分析(倾斜矫正、列切分和单词切分)；S2. Layout analysis (tilt correction, column segmentation and word segmentation);

S3.提取满文单词图像；S3. Extract Manchu word images;

S4.位置归一化；S4. Position normalization;

S5.中轴线提取；S5. Central axis extraction;

S6.根据满文部件与中轴线位置的关系实现满文部件切分。S6. According to the relationship between the Manchu parts and the position of the central axis, the Manchu parts are divided.

其中，倾斜校正采用霍夫变换法确定版面倾斜角度，然后旋转图像并矫正回垂直文本状态；被倾斜矫正的满文文档采用垂直投影法进行列切分，采用水平投影法切分单词并提取满文列图像中的满文单词，再对满文单词图像进行位置归一化。由上述步骤即完成了满文单词图像的预处理，记满文单词图像的高度为H，宽度为W。需要说明的是，对满文单词图像进行位置归一化即切掉满文单词图像的多余白色背景边缘，图2示出的流程，是为编程方便而进行图像翻转，显示去掉的是满文单词的黑色边缘。图2中黑底白字的图，也就是所说的图像翻转的图。原图应该是白底黑字，但是为了编程方便，翻转成黑底白字去除四个角的边缘更方便编程，因此直接给出了翻转后再去掉黑色边缘的图像。Among them, the tilt correction uses the Hough transform method to determine the tilt angle of the layout, and then rotates the image and corrects it back to the vertical text state; the Manchu document that has been tilt-corrected uses the vertical projection method for column segmentation, and the horizontal projection method is used to segment words and extract full text. Manchu words in the text image, and then perform position normalization on the Manchu word images. The preprocessing of the Manchu word image is completed by the above steps, and the height of the Manchu word image is recorded as H and the width as W. It should be noted that the position normalization of the Manchu word image is to cut off the redundant white background edge of the Manchu word image. The process shown in Figure 2 is to perform image flipping for the convenience of programming, and it is displayed that the Manchu word is removed. The black edge of the word. In Figure 2, the picture with white characters on a black background is the picture of the so-called image flipping. The original image should be black characters on a white background, but for the convenience of programming, it is more convenient to flip it to white characters on a black background to remove the edges of the four corners, so the image with the black edges removed after flipping is directly given.

在本实施例中，对于满文单词图像中轴线提取，直接影响分割的准确率，下述对其具体方案详细说明。In this embodiment, the extraction of the central axis of the Manchu word image directly affects the accuracy of segmentation, and the specific scheme thereof will be described in detail below.

对于满文单词图像中轴线提取，即步骤S5，现有技术中一般使用垂直投影法和最大累积垂直投影法，然而上述两种方法存在中轴线定位偏移和中轴线宽度估计错误等情况，如图3所示。本实施例提供一种提取满文单词图像中轴线的方法，包括如下步骤：For the extraction of the central axis of the Manchu word image, that is, step S5, the vertical projection method and the maximum cumulative vertical projection method are generally used in the prior art. However, the above two methods have the central axis positioning offset and the central axis width estimation error, etc., such as shown in Figure 3. This embodiment provides a method for extracting the central axis of a Manchu word image, including the following steps:

S5.1.满文单词图像中轴线定位：S5.1. Central axis positioning of Manchu word images:

首先对满文单词图像取反，即令文字部分的像素值取1而背景的像素值取0。使用MATLAB图像处理工具箱的形态学细化函数，采用3×3结构元素模板，每个模板包含9个像素，每个像素只能取0或1，因此模板有512种不同形式，将模板分成8个方向实现满文单词图像的形态学细化。对细化后的满文单词图像，利用霍夫变换确定细化的中轴线所对应的列坐标，即为满文单词图像中轴线的位置。在满文单词图像中轴线的提取中，限定霍夫变换搜索直线的角度为θ＝90，即仅查找竖直方向的直线，并连接相同纵向位置的，间距小于单词图像高度H且自身长度大于1像素的直线为一条直线，即求出中轴线的中心位置，记为baseline。满文单词图像中轴线指的是在一幅满文单词图像中，满文单词中轴线在图像中的列坐标位置，而不是一幅图像的中心线。First, the Manchu word image is inverted, that is, the pixel value of the text part is 1 and the pixel value of the background is 0. Using the morphological refinement function of the MATLAB image processing toolbox, a 3×3 structural element template is used, each template contains 9 pixels, and each pixel can only take 0 or 1, so the template has 512 different forms, the template is divided into 8 morphological refinement of Manchu word images in each direction. For the refined Manchu word image, the Hough transform is used to determine the column coordinates corresponding to the refined central axis, which is the position of the central axis of the Manchu word image. In the extraction of the central axis of the Manchu word image, the angle of the Hough transform search line is limited to θ=90, that is, only the vertical line is searched and connected to the same vertical position, the spacing is less than the height H of the word image and its length is greater than A straight line of 1 pixel is a straight line, that is, the center position of the central axis is obtained, which is recorded as baseline. The central axis of the Manchu word image refers to the column coordinate position of the central axis of the Manchu word in the image in a Manchu word image, rather than the center line of an image.

S5.2.满文单词图像中轴线宽度检测S5.2. Axis width detection of Manchu word images

S5.2.1.采用中轴线宽度的最大游程比例法：首先扫描满文单词图像的每一行，并统计连续黑色像素的游程长度和该长度出现的次数；依次扫描所有行，则具有最大出现次数的游程长度就是满文单词图像中轴线的宽度，记做w₀。采用最大游程比例法对检测满文单词图像中轴线宽度是有效的，但仍然存在如图4(1)所示的错误情况。产生这种错误的原因在于，最大游程比例法是对整幅满文单词图像进行连续黑色像素游程统计，而满文不同字体变形严重干扰最大游程比例法对全局的统计结果。对满文书写的统计表明，满文中轴线宽度一般不会超过单词宽度W的1/2，因此限制最大游程比例法的搜索区域，将算法的搜索区域限定在公式(1)规定的范围内，称为区域限定的最大游程比例法。S5.2.1. Use the maximum run-length ratio method of the width of the central axis: first scan each line of the Manchu word image, and count the run-length of consecutive black pixels and the number of occurrences of this length; scan all the lines in turn, then the maximum number of occurrences The run length is the width of the axis of the Manchu word image, denoted as w ₀ . The maximum run-length ratio method is effective to detect the width of the axis in the Manchu word image, but there is still an error situation as shown in Figure 4(1). The reason for this error is that the maximum run-length ratio method is to perform continuous black pixel run-length statistics on the entire Manchu word image, and the deformation of different Manchu fonts seriously interferes with the global statistical results of the maximum run-length ratio method. Statistics on Manchu writing show that the width of the axis in Manchu generally does not exceed 1/2 of the word width W, so the search area of the maximum run-length ratio method is limited, and the search area of the algorithm is limited to the range specified by formula (1), It is called the area-limited maximum run-scale method.

公式(1)中，sl是限定的搜索范围的左边界，sr是限定的搜索范围的右边界，baseline是中轴线的中心位置，round表示向最近的整数取整。限定搜索区域范围削弱了满文游离和枝干笔画对中轴线宽度的统计影响，然后再采用最大游程比例法在限定搜索范围后的满文单词图像中完成中轴线宽度的检测，结果如图4(3)所示。In formula (1), sl is the left boundary of the limited search range, sr is the right boundary of the limited search range, baseline is the center position of the central axis, and round means rounding to the nearest integer. The limited search area weakens the statistical influence of Manchu free and branch strokes on the width of the central axis, and then the maximum run-length ratio method is used to complete the detection of the width of the central axis in the Manchu word image after the limited search range. The results are shown in Figure 4 ( 3) shown.

S5.2.2.由中轴线的宽度baseline_width和中轴线的中心位置baseline，根据公式(2)计算中轴线的左边界bl和右边界br。S5.2.2. From the width baseline_width of the central axis and the center position baseline of the central axis, calculate the left boundary bl and the right boundary br of the central axis according to formula (2).

随机抽取不同字体字号的满文图像共400幅，分别采用本实施例的区域限定的最大游程比例法、垂直投影法提取中轴线，其结果如表1所示。采用本发明方法正确提取中轴线的部分实例如图5所示。实验结果表明采用形态学细化和霍夫变换能够准确定位满文单词图像中轴线位置，采用区域限定的最大游程概率法能正确确定满文单词图像中轴线的宽度。A total of 400 Manchu images with different font sizes were randomly selected, and the central axis was extracted by the area-limited maximum run-length ratio method and the vertical projection method in this embodiment. The results are shown in Table 1. A partial example of correctly extracting the central axis using the method of the present invention is shown in FIG. 5 . The experimental results show that the morphological refinement and Hough transform can accurately locate the axis position of Manchu word images, and the maximum run-length probability method limited by the region can correctly determine the width of the axis of Manchu word images.

表1满文单词图像中轴线提取结果统计表Table 1 Statistical table of the results of axis extraction in Manchu word images

本发明方法method of the invention 垂直投影法vertical projection 正确样本数correct number of samples 397397 210210 错误样本数number of wrong samples 33 190190 正确率Correct rate 99.25％99.25% 52.50％52.50%

在本实施例中，满文字符切分的精确性是提高满文识别准确率的瓶颈问题，下述对其具体方案详细说明。In this embodiment, the accuracy of Manchu character segmentation is a bottleneck problem to improve the Manchu recognition accuracy, and the specific scheme thereof will be described in detail below.

对于满文部件切分，即步骤S6，如图6所示，包括：For Manchu component segmentation, that is, step S6, as shown in Figure 6, it includes:

S6.1.满文部件粗切分；S6.1. Manchu parts are roughly cut;

S6.2.候选分割区域的弱分割判决与细切分；S6.2. Weak segmentation decision and fine segmentation of candidate segmentation regions;

S6.3.候选分割区域的过分割判决与合并。S6.3. Over-segmentation decision and merging of candidate segmentation regions.

以下对上述步骤作出具体说明：The above steps are described in detail below:

S6.1.满文部件粗切分S6.1. Rough cutting of Manchu parts

由于满文部件以中轴线为连接，因此首先以中轴线为中心，将满文单词分为左、中、右3部分。其中，左侧部分的范围为满文单词的第1列到第bl－1列，右侧部分的范围为第br+1列到满文单词的第W列。分别对左侧部分和右侧部分进行水平投影，记为pl和pr。定义第i行的切分代价函数为：Since the Manchu parts are connected by the central axis, the Manchu words are firstly divided into left, middle and right parts with the central axis as the center. Among them, the range of the left part is from column 1 to column bl-1 of Manchu words, and the range of the right part is from column br+1 to column W of Manchu words. The left part and the right part are projected horizontally, denoted as pl and pr. The segmentation cost function of the i-th row is defined as:

Cost(i)＝pl(i)+pr(i),i＝1,2,…,H (3)Cost(i)=pl(i)+pr(i), i=1,2,...,H(3)

理想情况下切分行的代价函数值应为0，即左右两部分在该行都没有除中轴线之外的笔画。但实际情况中，由于扫描、倾斜校正、二值化等预处理带来的噪声影响，对切分行的约束条件太严格则会导致严重的弱分割问题。设T1为满文部件粗切分阈值，通过大量实验确定T1的值为

只有满足条件：Ideally, the value of the cost function for dividing a row should be 0, that is, the left and right parts of the row have no strokes other than the central axis. However, in practice, due to the influence of noise caused by preprocessing such as scanning, tilt correction, and binarization, too strict constraints on segmented rows will lead to serious weak segmentation problems. Let T1 be the rough segmentation threshold of Manchu parts, and determine the value of T1 through a large number of experiments.

Only if the conditions are met:

Cost(i)≤T1 (4)Cost(i)≤T1 (4)

的行才是候选切分行，并记所有满足条件(4)的候选切分行组成的序列为Can_seg。其中，对T1的值的确定实验，是为选择不同倍数的baseline_width作为T1，这些倍数都是<＝1的分数，执行满文部件切分方法，对于切分后的图像比较，选择出切分效果更好的满文单词图像所对应的T1，最终选择为上述T1值。The row is the candidate segmentation row, and the sequence composed of all the candidate segmentation rows that satisfy the condition (4) is recorded as Can_seg. Among them, the determination experiment of the value of T1 is to select different multiples of baseline_width as T1, and these multiples are all fractions of <=1, execute the Manchu component segmentation method, and select the segmentation method for the image comparison after segmentation. The T1 corresponding to the Manchu word image with better effect is finally selected as the above T1 value.

经满文部件粗切分所获得的候选切分行集合，会出现以下三种情况：The set of candidate segmentation rows obtained by the rough segmentation of Manchu parts will have the following three situations:

1)把图像的第1行当做候选切分行，这显然是不合理的候选行，故应从候选切分行集合中删除；1) Take the first row of the image as a candidate segmentation row, which is obviously an unreasonable candidate row, so it should be deleted from the candidate segmentation row set;

2)图像的从第1行开始的连续相邻行/图像以最后1行(第H行)为结束的连续相邻行，都是不合理的候选行子段，故应从候选切分集合中删除这些子段；2) The continuous adjacent lines of the image starting from the first line/the continuous adjacent lines of the image ending with the last line (the Hth line) are all unreasonable candidate line subsections, so they should be selected from the candidate segmentation set. delete these subsections;

3)除了2)中的连续相邻行组成的子段，只需要其中位于中间位置的一条候选切分行，其余的并不需要；故应采用中间位置的候选行替代整个连续相邻行组成的子段。3) Except for the sub-segment composed of consecutive adjacent rows in 2), only one candidate segmentation row in the middle position is needed, and the rest are not required; therefore, the candidate row in the middle position should be used instead of the entire continuous adjacent row. subsection.

由上述，Can_seg中往往还存在多余候选切分行，为此，采用以下策略进一步删除Can_seg中的多余候选切分行：From the above, there are often redundant candidate segmentation lines in Can_seg. For this reason, the following strategies are adopted to further delete the redundant candidate segmentation lines in Can_seg:

(1)如果Can_seg中只有1条候选分割行，且为第1行，则删除该行；否则转步骤(2)；(1) If there is only one candidate segmentation row in Can_seg, and it is the first row, delete the row; otherwise, go to step (2);

(2)查找连续候选切分行组成的子段conti_subseg，若子段的起始行为第1行，或者子段的结束行为第H行，则删除该子段的所有行；否则转步骤(3)；(2) Find the sub-segment conti_subseg composed of continuous candidate segmentation lines, if the starting line of the sub-segment is the 1st line, or the end line of the sub-segment is the H line, then delete all the lines of the sub-segment; otherwise, go to step (3);

(3)在连续候选切分子段conti_subseg中，按从小到大顺序，用中位数替代该子段的所有行(偶数个候选行时取中间两个值的平均值再向上取整)；(3) In the continuous candidate sub-segment conti_subseg, in the order from small to large, replace all rows of the sub-segment with the median (for an even number of candidate rows, take the average of the two middle values and then round up);

(4)输出删除多余候选切分行的新切分行序列Can_seg_new。(4) Output a new segmented row sequence Can_seg_new for deleting redundant candidate segmented rows.

S7.2.候选分割区域的弱分割判决与细切分S7.2. Weak segmentation decision and fine segmentation of candidate segmentation regions

经粗切分的满文部件可能存在弱切分情况。统计结果表明，满文部件的高度一般不超过5倍baseline_width，故设弱分割判定阈值T_less＝5。计算Can_seg_new中每个切分区域的高度hl，则高度hl＞(T_less×baseline_width)的切分区域被判定为弱分割区域。对于弱分割区域，采用上述粗切分方法和细切分阈值T2进行二次切分，并保存在Seg1序列中。细切分阈值T2在粗切分的基础上再次放宽对切分候选行的约束，通过大量实验确定

其中，对T2的值的确定实验，是为选择不同倍数的baseline_width作为T2，这些倍数都是<＝1的分数，执行满文部件切分方法，对于切分后的图像比较，选择出切分效果更好的满文单词图像所对应的T2，最终选择为上述T2值。Roughly cut Manchu parts may have weak cuts. Statistical results show that the height of Manchu parts generally does not exceed 5 times baseline_width, so the weak segmentation judgment threshold T_less=5 is set. The height hl of each segmented region in Can_seg_new is calculated, and the segmented region with height hl>(T_less×baseline_width) is determined as a weakly segmented region. For weakly segmented regions, the above-mentioned coarse segmentation method and fine segmentation threshold T2 are used to perform secondary segmentation, and are stored in the Seg1 sequence. The fine segmentation threshold T2 relaxes the constraints on the segmentation candidate rows on the basis of the coarse segmentation, and is determined through a large number of experiments

Among them, the determination experiment of the value of T2 is to select different multiples of baseline_width as T2, and these multiples are all fractions of <=1, perform the Manchu component segmentation method, and select the segmentation method for the image comparison after segmentation. The T2 corresponding to the Manchu word image with better effect is finally selected as the above T2 value.

S7.3.候选分割区域的过分割判决与合并S7.3. Over-segmentation decision and merging of candidate segmentation regions

经粗切分和细切分后，Seg1序列还可能存在过分割区域。统计结果表明，满文部件的高度一般大于baseline_width，故设过分割判定阈值T_over＝1。计算Seg1中每个切分区域的高度ho，则高度ho＜(T_over×baseline_width)的切分区域被判定为过分割区域，需要合并，合并会有以下情况：After rough segmentation and fine segmentation, there may also be over-segmented regions in the Seg1 sequence. Statistical results show that the height of Manchu parts is generally greater than baseline_width, so the over-segmentation judgment threshold T_over=1 is set. Calculate the height ho of each segmented area in Seg1, and the segmented area with height ho<(T_over×baseline_width) is judged as an over-segmented area and needs to be merged. The merge will have the following situations:

1)从上至下数，第一个切分区域被判定为过分割，则只可能与第2个区域合并；1) Counting from top to bottom, if the first segmented area is judged to be over-segmented, it can only be merged with the second area;

2)从下往上，倒数第2个区域被判定为过分割，则只可能与倒数第一个区域合并；2) From bottom to top, the penultimate area is judged to be over-segmented, it can only be merged with the penultimate area;

3)若过分割区域位于中间，则需要考虑其相邻的上下两个区域。分别计算与上面区域合并后区域的高度h_up，和与下面区域合并后区域的高度h_lw，选择合并后高度小的那个合并方案；3) If the over-segmented area is located in the middle, the adjacent upper and lower areas need to be considered. Calculate the height h_up of the area merged with the area above, and the height h_lw of the area merged with the area below, and select the merging scheme with the smaller height after merging;

4)若和上下两个区域的合并后高度相等，即根据3)不能确定合并方案，则分别计算和上下两个区域合并后的连通域个数，并选择连通域个数少的合并方案；4) If the combined heights of the upper and lower regions are equal, that is, the merging scheme cannot be determined according to 3), then calculate the number of connected domains after merging with the upper and lower regions respectively, and select the merging scheme with a small number of connected domains;

5)输出经区域合并后的切分行。5) Output the segmented rows after the regions are merged.

为此，采用以下规则合并过分割区域：To do this, merge over-segmented regions using the following rules:

(1)若第1个切分区域过分割，则与第2个切分区域合并；否则转步骤(2).(1) If the first segmented area is over-segmented, merge with the second segmented area; otherwise, go to step (2).

(2)若倒数第2个切分区域过分割，则与最后一个切分区域合并；否则转步骤(3).(2) If the penultimate segmented area is over-segmented, it will be merged with the last segmented area; otherwise, go to step (3).

(3)若过分割区域既不是第1个，也不是倒数第2个，则分别计算其相邻的上、下2个切分区域的高度h_up和h_lw。如果h_up＜h_lw，则与上一个切分区域合并；如果h_up＞h_lw，则合并到下一个切分区域；否则转步骤(4).(3) If the over-segmented area is neither the first nor the penultimate second, calculate the heights h_up and h_lw of the adjacent upper and lower segmentation areas respectively. If h_up<h_lw, merge with the previous segmented area; if h_up>h_lw, merge into the next segmented area; otherwise, go to step (4).

(4)若过分割区域的上、下2个相邻区域的高度相等，则分别计算与上或下区域合并后的连通域个数num_up，num_lw。如果num_up＜num_lw，则与上一个切分区域合并；如果num_up＞num_lw，则与下一个切分区域合并。(4) If the heights of the upper and lower adjacent regions of the over-segmented region are equal, then calculate the number of connected domains num_up and num_lw merged with the upper or lower regions, respectively. If num_up<num_lw, merge with the previous segmented area; if num_up>num_lw, merge with the next segmented area.

(5)输出合并过分割区域的切分行序列。(5) Output the segmented row sequence that merges the over-segmented regions.

由上述方案，得到满文部件切分结果，如图7所示，图7(1)-(2)是弱分割区域经过细切分的结果；图7(3)-(4)是过分割区域经过合并的结果。From the above scheme, the result of Manchu component segmentation is obtained, as shown in Figure 7, Figure 7(1)-(2) is the result of fine segmentation of the weakly segmented area; Figure 7(3)-(4) is over-segmentation The result of the merged regions.

由上述完成的满文部件切分结果进一步处理，以对满文部件识别，该识别方法除上述满文单词图像的切分外，还包括如下步骤：Further processing is performed by the above-mentioned completed Manchu parts segmentation results to identify the Manchu parts, and the recognition method further comprises the following steps in addition to the segmentation of the above-mentioned Manchu word images:

(1)满文部件归一化(1) Normalization of Manchu parts

包括满文部件位置归一化和大小归一化。Including Manchu component position normalization and size normalization.

满文部件位置归一化就是将满文部件图像以笔画像素点的最上、最下、最左、最右的像素点为边界，切除背景部分，只保留有笔画的部分。满文部件大小归一化是指将上述经过位置归一化后的图像归一化为相同的尺寸(例如:64像素×64像素)。The position normalization of Manchu parts is to take the uppermost, lowermost, leftmost and rightmost pixels of the stroke pixels as the boundaries of the Manchu part image, cut off the background part, and only keep the part with strokes. Manchu component size normalization refers to normalizing the above image after position normalization to the same size (for example: 64 pixels×64 pixels).

(2)满文部件特征提取(2) Manchu component feature extraction

首先分别提取常用于少数民族文字特征提取的方法，包括：轮廓特征、网格特征、方向线素特征、视觉方向特征和仿射不变距特征。然后融合这些特征，并采用主成分分析对融合特征进行降维。Firstly, the methods commonly used in the feature extraction of ethnic minority characters are extracted respectively, including: outline feature, grid feature, direction line feature, visual direction feature and affine invariant distance feature. These features are then fused, and principal component analysis is used to reduce the dimensionality of the fused features.

(3)满文部件识别(3) Manchu component recognition

采用具有高斯核函数的支持向量机分类器，使用“一对其余”的多分类器组合规则实现对某个满文部件的识别。A support vector machine classifier with Gaussian kernel function is used, and a "one pair rest" multi-classifier combination rule is used to realize the recognition of a Manchu part.

(4)满文部件识别后处理，(4) Post-processing of Manchu parts identification,

对于识别好的满文部件，根据上下相邻部件的识别结果和满文字母的拼写规则，完成从部件到单词的重组，从而实现对满文单词的识别。For the recognized Manchu parts, according to the recognition results of the upper and lower adjacent parts and the spelling rules of Manchu letters, the reorganization from parts to words is completed, so as to realize the recognition of Manchu words.

附表1：Schedule 1:

Claims

1. a method for extracting a central axis of a Manchu word image is characterized by comprising the following steps:

s1, positioning a central axis of a Manchu word:

s1.1, negating the Manchu word image, and enabling the pixel value of a character part to be 1 and the pixel value of a background part to be 0;

s1.2, using a morphological thinning function of an MATLAB image processing tool box to realize morphological thinning of the Manchu word image;

s1.3, determining a column coordinate corresponding to a refined central axis by using Hough transform on the morphologically refined Manchu word image, wherein the column coordinate is used as the position of the central axis of the Manchu word, and the angle of a Hough transform search straight line is limited to beθ=90, only straight lines in the vertical direction are searched, the straight lines at the same longitudinal position are connected, the straight lines with the distance smaller than the height of the Manchu word image and the length larger than 1 pixel are taken as a straight line, and the central position of the central axis is calculated;

s2, detecting the width of the central axis of the Manchu word:

s2.1, determining a search area of a maximum run proportion method:

the search area of the maximum run length ratio method is determined by the range specified by the following formula:

（1）

wherein,slis the left boundary of the defined search range,sris the right boundary of the defined search range,baselineis the central position of the central axis,roundmeaning that the rounding to the nearest integer is performed,Wis the width of the Manchu word image:

s2.2, applying a maximum run length proportion method to the Manchu word image in the search area to determine the width of a central axis of the Manchu word image: scanning each line of the Manchu word image search area, and counting the run length of continuous black pixels and the occurrence frequency of the run length, wherein the run length with the maximum occurrence frequency is the width of a central axis of the Manchu word:

s2.3, calculating the left boundary and the right boundary of the central axis according to the central position of the central axis of the Manchu word image and the width of the central axis: calculating the left boundary and the right boundary of the central axis according to the following formula;

(2)

wherein:blis the left boundary of the central axis,bris the right boundary of the central axis,baselineis the central position of the central axis of the Manchu word image,baseline_widthis the width of the central axis of the Manchu word image,roundindicating rounding to the nearest integer.