CN103679678B - A kind of semi-automatic splicing restored method of rectangle character features a scrap of paper - Google Patents
A kind of semi-automatic splicing restored method of rectangle character features a scrap of paper Download PDFInfo
- Publication number
- CN103679678B CN103679678B CN201310697323.6A CN201310697323A CN103679678B CN 103679678 B CN103679678 B CN 103679678B CN 201310697323 A CN201310697323 A CN 201310697323A CN 103679678 B CN103679678 B CN 103679678B
- Authority
- CN
- China
- Prior art keywords
- fragment
- edge
- paper slip
- horizontal
- document
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 65
- 239000012634 fragment Substances 0.000 claims abstract description 604
- 238000001514 detection method Methods 0.000 claims abstract description 15
- 230000000877 morphologic effect Effects 0.000 claims abstract description 9
- 239000013598 vector Substances 0.000 claims description 85
- 238000003708 edge detection Methods 0.000 claims description 34
- 230000008569 process Effects 0.000 claims description 31
- 239000000284 extract Substances 0.000 claims description 11
- 238000011084 recovery Methods 0.000 claims description 9
- 230000008878 coupling Effects 0.000 claims 29
- 238000010168 coupling process Methods 0.000 claims 29
- 238000005859 coupling reaction Methods 0.000 claims 29
- 210000003141 lower extremity Anatomy 0.000 claims 13
- 238000006243 chemical reaction Methods 0.000 claims 3
- 239000000203 mixture Substances 0.000 claims 3
- 230000008030 elimination Effects 0.000 claims 1
- 238000003379 elimination reaction Methods 0.000 claims 1
- 230000013011 mating Effects 0.000 claims 1
- 238000007781 pre-processing Methods 0.000 abstract description 9
- 238000004422 calculation algorithm Methods 0.000 abstract description 8
- 238000012545 processing Methods 0.000 description 16
- 238000005516 engineering process Methods 0.000 description 9
- 238000007689 inspection Methods 0.000 description 7
- 238000012216 screening Methods 0.000 description 5
- 238000012937 correction Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000007639 printing Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 238000007373 indentation Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004883 computer application Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000013467 fragmentation Methods 0.000 description 1
- 238000006062 fragmentation reaction Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 230000008439 repair process Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
Landscapes
- Character Input (AREA)
- Machine Translation (AREA)
Abstract
本发明提出一种矩形碎纸片的拼接复原方法。根据切割方式(纵切、横纵切)、纸条类型(单面、双面)以及文字(中文、英文)的不同,将拼接复原方法由简单到困难分成三部分:纵切中英文纸条、横纵切中英文纸条、横纵切双面英文纸条。每部分都需先将读入的图片转换为灰度图像,并进行二值化、取反操作,然后通过左边缘碎片检测、边缘匹配、高度检测等算法,达到碎纸片拼接复原的目的。为了提高拼接复原准确性,对横纵切割方式加入人工干预,对横纵切割英文纸条时加入形态学预处理操作等。本发明所提供的方法针对性强、操作简单,且较人工拼接复原方法效率大大地提高,较完全自动拼接复原准确性明显改善,在粉碎机切割成的矩形碎片恢复中具有明显的优势。
The invention proposes a method for splicing and restoring rectangular shredded paper. According to the different cutting methods (longitudinal, horizontal and vertical), paper strip types (single-sided, double-sided) and text (Chinese, English), the splicing and restoration methods are divided into three parts from simple to difficult: longitudinal cutting of Chinese and English strips, Cross-cut Chinese and English paper strips, cross-cut double-sided English paper strips. Each part needs to first convert the read image into a grayscale image, and perform binarization and inversion operations, and then use algorithms such as left edge fragment detection, edge matching, and height detection to achieve the purpose of splicing and restoring shredded paper. In order to improve the accuracy of splicing restoration, manual intervention is added to the horizontal and vertical cutting methods, and morphological preprocessing operations are added to the horizontal and vertical cutting of English paper strips. The method provided by the invention has strong pertinence, simple operation, greatly improved efficiency compared with the manual splicing and restoration method, and significantly improved accuracy compared with fully automatic splicing and restoration, and has obvious advantages in the restoration of rectangular fragments cut by a shredder.
Description
技术领域 technical field
本发明提出一种矩形文字特征碎纸片的拼接复原方法,属于计算机应用中图像处理和模式识别技术领域。 The invention proposes a splicing and restoration method for shredded pieces of paper characterized by rectangular characters, which belongs to the technical field of image processing and pattern recognition in computer applications.
技术背景 technical background
碎片自动拼接可以近似看作一个拼图问题[1],在司法物证复原、历史文献修复、艺术品修复、破碎纸钱币修复以及军事情报获取等领域都有着重要的应用[2]。传统上,拼接复原工作需由人工完成,准确率较高,但效率很低。特别是当碎片数量巨大,人工拼接很难在短时间内完成任务。碎纸片在日常生活中随处可见,它是碎片的一种。随着计算机技术的快速发展,人们试图开发碎纸片的自动拼接技术,以提高拼接复原效率。 Automatic splicing of fragments can be approximately regarded as a puzzle problem [1], and has important applications in the fields of judicial evidence restoration, historical document restoration, art restoration, broken paper coin restoration, and military intelligence acquisition [2]. Traditionally, splicing and restoration work has to be done manually, with high accuracy but low efficiency. Especially when the number of fragments is huge, it is difficult for manual splicing to complete the task in a short time. Shredded paper can be seen everywhere in daily life, and it is a kind of debris. With the rapid development of computer technology, people try to develop the automatic splicing technology of shredded paper to improve the efficiency of splicing and restoration.
目前,国内外研究者对碎片的拼接复原做过大量研究,对于纸片主要是对手工撕碎的碎纸片进行处理,其中关键技术是对形状匹配技术的研究,例如,文献[3]中提出基于多尺度二维碎片的拼接方法,使用碎纸片轮廓上采样点曲率串进行多尺度分析,利用动态规划技术对各匹配进行精化处理,文献[4]提出一种基于弹性匹配的碎片自动拼接方法,文献[5]采用人机交互半自动化拼接方法,文献[2]提出一种从提取碎纸片轮廓线出发,以边界准则和面积准则实现匹配的碎纸半自动拼接形状匹配方法,文献[6]中提出一种基于线段扫描的碎纸片边缘检测算法等等。目前,这些算法都是在不规则形状的基础上进行的,而文献[7]中针对文字文档特征提出了一种基于碎片文字行特征或表格特征的碎片半自动拼接算法,将文字行特征和表格线特征考虑在内,结合人工手段,准确性较纯机器拼接有了很大的提高。 At present, researchers at home and abroad have done a lot of research on splicing and restoration of fragments. For paper fragments, they mainly process the shredded paper fragments shredded by hand. The key technology is the research on shape matching technology. For example, in literature [3] A mosaic method based on multi-scale two-dimensional fragments is proposed, and the curvature string of sampling points on the contour of the shredded paper is used for multi-scale analysis, and dynamic programming technology is used to refine each matching. Literature [4] proposes a fragment based on elastic matching For the automatic splicing method, literature [5] adopts a human-computer interaction semi-automatic splicing method, and literature [2] proposes a semi-automatic splicing shape matching method for shredded paper, which starts from extracting the outline of shredded paper and uses boundary criteria and area criteria to achieve matching. Document [6] proposes an edge detection algorithm for scraps of paper based on line segment scanning and so on. At present, these algorithms are all based on irregular shapes, and literature [7] proposes a semi-automatic fragment splicing algorithm based on fragmented text line features or table features for text document features, combining text line features and table features Taking line features into consideration, combined with manual methods, the accuracy has been greatly improved compared with pure machine stitching.
随着印刷技术的提高,在文档记录中手写已经逐渐被印刷所替代,使文字的行特征以及表格特征等更加趋向于一致。同时,碎纸机在办公应用中也越来越普及,文档的碎片呈现出横纵两种碎片形式,因此基于形状特征的算法在拼接复原中也越来越微弱。针对这种状况,我们提出一种矩形碎纸片复原方法,通过文字字迹特征以及左边缘检测等进行拼接复原。 With the improvement of printing technology, handwriting has been gradually replaced by printing in document records, making the line features and table features of text tend to be more consistent. At the same time, shredders are becoming more and more popular in office applications, and the fragments of documents appear in two forms: horizontal and vertical, so algorithms based on shape features are becoming weaker and weaker in splicing and restoration. In response to this situation, we propose a restoration method for rectangular shreds of paper, which is spliced and restored through character handwriting features and left edge detection.
[1]Wolfson H,Schonberg E,Kalvin A,et al.Solving jigsaw puzzles by computer[J].Annals of Operations Research,1988,12(1):51-64. [1]Wolfson H, Schonberg E, Kalvin A, et al.Solving jigsaw puzzles by computer[J].Annals of Operations Research,1988,12(1):51-64.
[2]贾海燕,朱良家,周宗潭,等.一种碎纸自动拼接中的形状匹配方法[J].计算机仿真,2006,23(11):180-183. [2] Jia Haiyan, Zhu Liangjia, Zhou Zongtan, et al. A shape matching method in automatic splicing of shredded paper [J]. Computer Simulation, 2006, 23(11): 180-183.
[3]da GamaH C,Stolfi J.A multiscale method for the reassembly of two-dimensional fragmented objects[J].Pattern Analysis and Machine Intelligence,IEEE Transactions on,2002,24(9):1239-1251. [3] da Gama HC, Stolfi JA multiscale method for the reassembly of two-dimensional fragmented objects[J]. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2002, 24(9): 1239-1251.
[4]Kong W,Kimia B B.On solving 2D and 3D puzzles using curve matching[C]//Computer Vision and Pattern Recognition,2001.CVPR 2001.Proceedings of the 2001IEEE Computer Society Conference on.IEEE,2001,2:II-583-II-590vol.2. [4]Kong W,Kimia B B.On solving 2D and 3D puzzles using curve matching[C]//Computer Vision and Pattern Recognition,2001.CVPR 2001.Proceedings of the 2001IEEE Computer Society Conference on.IEEE,2001,2: II-583-II-590vol.2.
[5]De Smet P,De Bock J,Corluy E.Computer vision techniques for semi-automatic reconstruction of ripped-up documents[C]//AeroSense 2003.International Society for Optics and Photonics,2003:189-197. [5]De Smet P, De Bock J, Corluy E.Computer vision techniques for semi-automatic reconstruction of ripped-up documents[C]//AeroSense 2003.International Society for Optics and Photonics,2003:189-197.
[6]罗智中.基于线段扫描的碎纸片边界检测算法研究[J].仪器仪表学报,2011,32(002):289-294. [6] Luo Zhizhong. Research on edge detection algorithm of shredded paper based on line segment scanning [J]. Journal of Instrumentation, 2011,32(002):289-294.
[7]罗智中.基于文字特征的文档碎纸片半自动拼接[J].计算机工程与应用,2012,48(5):207-210. [7] Luo Zhizhong. Semi-automatic splicing of document scraps based on text features [J]. Computer Engineering and Applications, 2012,48(5):207-210.
发明内容 Contents of the invention
目前,大多数碎片拼接复原技术都是基于碎片形状特征,且一些算法复杂性高,针对性不强。而随着技术的发展,打印技术以及碎纸机在日常生活中的应用,碎纸碎片逐渐呈现规则的矩形,且表格特征逐渐淡化,针对现有技术的不足,本发明提供一种矩形文字特征碎纸片的拼接复原方法。本发明所述复原方法由碎纸片预处理、左边缘碎纸片检测、边缘匹配算法向右匹配等过程,辅助于人工处理等,并按照拼接难度由低到高针对性地对中英文纵切纸条、中英文横纵切纸条以及英文横纵切双面纸条作了处理。 At present, most fragment splicing restoration technologies are based on fragment shape features, and some algorithms are highly complex and not very pertinent. However, with the development of technology, printing technology and the application of paper shredders in daily life, shredded paper fragments gradually present a regular rectangle, and the form features gradually fade away. In view of the deficiencies in the prior art, the present invention provides a rectangular text feature Method of splicing and restoring shredded paper. The recovery method of the present invention consists of processes such as preprocessing of shredded paper, detection of left edge shredded paper, and right matching of edge matching algorithm, assisted by manual processing, etc., and according to the splicing difficulty from low to high, the Chinese and English vertical Cut paper strips, Chinese and English horizontal and vertical cut paper strips, and English horizontal and vertical cut double-sided paper strips have been processed.
本发明的技术方案如下: Technical scheme of the present invention is as follows:
一种矩形文字特征碎纸片的拼接复原方法,包括针对中英文纵切纸条碎片、中英文横纵切纸条碎片以及英文横纵切双面纸条碎片的复原方法: A method for splicing and restoring pieces of paper with rectangular character features, including restoration methods for Chinese and English longitudinally cut pieces of paper, Chinese and English horizontally and vertically cut pieces of paper, and English horizontally and vertically cut double-sided paper pieces:
1.中英文纵切纸条碎片 1. Chinese and English slitting pieces of paper
(1)使用扫描仪读入文字碎片数字图像,并把图像转化成灰度图像; (1) Use a scanner to read in the digital image of text fragments, and convert the image into a grayscale image;
由于仅有纵切碎片时,碎片较为细长,纵向边缘矢量作为比较的量更具有可行性,流程如图1(a)所示; Since the fragments are relatively slender when there are only longitudinal fragments, it is more feasible to use the longitudinal edge vector as a comparison quantity, and the process is shown in Figure 1(a);
(2)将所述图像再分别进行二值化处理、碎片左边缘检测、边缘匹配和显示拼接复原图像: (2) The image is subjected to binarization processing, fragment left edge detection, edge matching and display splicing restoration image respectively:
(a)对图像进行二值化处理:将纸条碎片进行顺序编号,如00,01,02,……,计算机按照编号将碎片图像进行读入,并进行二值化处理并取反,提取每个碎片的左右边缘矢量; (a) Binarize the image: serially number the fragments of paper strips, such as 00, 01, 02, ..., the computer reads in the fragmented images according to the number, and performs binarization and inversion to extract the left and right edge vectors of each fragment;
(b)对步骤(a)中碎片依次进行左边缘检测,以确定所述碎片是否为原始文档的左边缘:判断编号的碎片左边缘是否为空白,若为空白则编号的碎片为文档左边缘; (b) Perform left edge detection on the fragments in step (a) in order to determine whether the fragments are the left edge of the original document: judge whether the left edge of the numbered fragments is blank, if it is blank, the numbered fragments are the left edge of the document ;
(c)按照边缘匹配准则进行碎片排序:通过步骤(b)确定原始文档左边缘后,确定文档左边缘所对应的碎片为第1碎片,显示第1碎片及其编号;将所述第1碎片的右边缘矢量,按照编号顺序依次与其它碎片的左边缘矢量进行对比匹配,直到找与第1碎片右边缘匹配的第2碎片,在所述第1碎片的右侧增加显示第2碎片及其编号; (c) Sorting the fragments according to the edge matching criterion: after determining the left edge of the original document through step (b), determine that the fragment corresponding to the left edge of the document is the first fragment, and display the first fragment and its number; The right edge vector of the first fragment is compared and matched with the left edge vectors of other fragments in order of number until the second fragment matching the right edge of the first fragment is found, and the second fragment and its Numbering;
(d)按照步骤(c)的边缘匹配准则对其它碎片按照编号顺序依次由左到右进行对比匹配,直到匹配到最后一个碎片,将匹配完毕的碎片依次按照匹配次序由左到右进行显示,并显示各自的编号,形成最终的恢复原始文档; (d) According to the edge matching criterion of step (c), compare and match other fragments from left to right according to the numbering order, until the last fragment is matched, and display the matched fragments from left to right according to the matching order, And display their respective numbers to form the final restored original document;
(e)保存步骤(c)和步骤(d)中所述匹配碎片的排序及对应的编号排序。 (e) saving the sorting of the matching fragments and the corresponding numbering sorting described in step (c) and step (d).
根据本发明优选的,人眼查看所述步骤(d)中最终的恢复原始文档,核定恢复结果; Preferably according to the present invention, human eyes check the final restored original document in the step (d), and verify the restoration result;
2.中英文横纵切纸条碎片 2. Chinese and English horizontal and vertical cut paper pieces
由于碎片上下左右都有切割痕迹,考虑字符间距、行间距、段首缩进等因素,碎片边缘可能出现空白,这会给边缘矢量匹配带来较大误差。这种情形下加入每行字符位置(高度)的判断标准,即通过碎片上每行字符的大致位置完成行的匹配。 Since there are cutting marks on the top, bottom, left, and right of the fragment, considering factors such as character spacing, line spacing, and paragraph indentation, there may be blanks on the edge of the fragment, which will cause large errors in edge vector matching. In this case, the judging standard of the character position (height) of each line is added, that is, the line matching is completed by the approximate position of each line of characters on the fragment.
中文是方块字,字形相对规则,所以将字符高度检测作为分组标准可行性很高;而英文字母高度不一,故需要进行一定的预处理,将字母控制在较为一致的高度范围内,再来确定整体相对位置,由此提高识别率。由于中英文字符具有不同的特点,因此不能使用同一标准进行处理。下面对中英文文档分别进行分析。 Chinese is a square character with a relatively regular shape, so it is very feasible to use character height detection as a grouping standard; while English letters have different heights, so certain preprocessing is required to control the letters within a relatively consistent height range, and then determine the overall Relative position, thereby improving the recognition rate. Since Chinese and English characters have different characteristics, they cannot be processed using the same standard. The Chinese and English documents are analyzed separately below.
①中文横纵碎片 ① Chinese horizontal and vertical fragments
中文横纵碎片拼接复原流程如图1(b)所示: The restoration process of Chinese horizontal and vertical fragments is shown in Figure 1(b):
首先,读入文档横纵切的纸条碎片数字图像信息,并将碎片图像转化为灰度图像,且进行二值化、取反处理,然后分别提取每个碎片的左右边缘矢量、上下边缘矢量; First, read in the digital image information of the paper strip fragments cut horizontally and vertically in the document, convert the fragment image into a grayscale image, and perform binarization and inversion processing, and then extract the left and right edge vectors and upper and lower edge vectors of each fragment respectively ;
其次,将检测碎片高度;且在组内检测左边缘碎片,进行组内匹配;最后进行横向纸条上下匹配,进行恢复: Secondly, the height of the debris will be detected; and the fragments on the left edge will be detected in the group to perform intra-group matching; finally, the horizontal paper strips will be matched up and down to restore:
i)对碎片上文字的高度进行检测:由于中文是方块字,字形相对规则,所以对所有碎片逐行检测字符出现的位置,根据碎片上边沿为字或者空白,以及第一行字符出现的最高像素点位置进行分类成组,并将每组内的碎片进行编号;分类时,为了避免特殊情况的发生,进行精确分组,加入人工干预指导,将文字高度按照大小进行排列,相邻数值接近的高度对应的碎片归为一组; i) Detect the height of the characters on the fragments: Since Chinese is a square character with a relatively regular shape, detect the position where the characters appear line by line for all fragments, according to whether the upper edge of the fragments is a word or a blank, and the highest pixel where the characters appear in the first line Classify the point positions into groups, and number the fragments in each group; when classifying, in order to avoid the occurrence of special situations, carry out precise grouping, add manual intervention guidance, arrange the height of the text according to the size, and the height of adjacent values close The corresponding fragments are grouped together;
ii)组内匹配:将步骤(i)中分类完毕的碎片进行组内匹配: ii) Intra-group matching: perform intra-group matching on the fragments classified in step (i):
(a)对所有碎片依次进行左边缘检测,以确定所述碎片是否为原始文档的左边缘碎片:判断碎片左边缘是否为空白,若为空白则碎片为文档左边缘; (a) carry out left edge detection to all fragments successively, to determine whether described fragment is the left edge fragment of original document: judge whether the left edge of fragment is blank, if it is blank then fragment is the left edge of document;
(b)按照边缘匹配准则进行碎片排序:若检测出的左边缘碎片存在于步骤i)过程中分好的组内,则以该左边缘碎片为起点对相应的分组的碎片进行向右边缘匹配;通过步骤(a)确定原始文档左边缘后,确定文档左边缘所对应的碎片为第1碎片,显示第1碎片及其编号;将所述第1碎片的右边缘矢量,按照编号顺序依次与其它碎片的左边缘矢量进行对比匹配,直到找与第1碎片右边缘匹配的第2碎片,在所述第1碎片的右侧增加显示第2碎片及其编号; (b) Sorting the fragments according to the edge matching criterion: if the detected left edge fragment exists in the group divided in the process of step i), then use the left edge fragment as the starting point to perform right edge matching on the corresponding grouped fragments ; After determining the left edge of the original document by step (a), determine that the fragment corresponding to the left edge of the document is the first fragment, and display the first fragment and its number; the right edge vector of the first fragment is sequentially connected with The left edge vectors of other fragments are compared and matched until the second fragment matching the right edge of the first fragment is found, and the second fragment and its number are displayed on the right side of the first fragment;
若左边缘碎片不属于任何一个分组,则先搁置该碎片,直到匹配完其它分组后,按照原始文档的左边缘碎片的上下边缘矢量,再进行匹配入文档; If the left edge fragment does not belong to any group, put the fragment on hold until other groups are matched, and then match into the document according to the upper and lower edge vectors of the left edge fragment of the original document;
(c)按照步骤(b)的边缘匹配准则对其它碎片按照编号顺序依次由左到右进行对比匹配,直到匹配到最后一个碎片,将匹配完毕的碎片依次按照匹配次序由左到右进行显示,并显示各自的编号,形成恢复原始横条文档; (c) According to the edge matching criterion of step (b), compare and match other fragments from left to right according to the numbering order, until the last fragment is matched, and display the matched fragments from left to right according to the matching order, And display their respective numbers to form a restored original horizontal stripe file;
(d)保存步骤(b)和步骤(c)中所述匹配碎片的排序及对应的编号排序。 (d) saving the sorting of matching fragments and the corresponding numbering sorting described in step (b) and step (c).
根据本发明优选的,人眼查看所述步骤(c)中恢复原始横条文档,核定恢复结果。由于碎片的像素较低,在匹配过程中,每组都必须参与人工筛选,碰到错误的匹配应排除并重新进行正确的匹配或者插入正确碎片后继续进行匹配; Preferably, according to the present invention, human eyes check the restored original horizontal stripe document in the step (c), and verify the restoration result. Due to the low pixel size of fragments, each group must participate in manual screening during the matching process. When encountering wrong matches, they should be eliminated and corrected again or continue to match after inserting correct fragments;
iii)行内调整:匹配时选择边缘差异最小的碎片,如果组内没有相连的下一个碎片,则会匹配到错误的碎片,进而会影响到之后的排序,因此通过人工干预的方式进行行内调整,最后得到横向纸条; iii) In-row adjustment: When matching, select the fragment with the smallest edge difference. If there is no next connected fragment in the group, the wrong fragment will be matched, which will affect the subsequent sorting. Therefore, the in-row adjustment is performed through manual intervention. Finally get the horizontal note;
iv)横向纸条上下匹配: iv) Horizontal paper strips match up and down:
提取横向纸条的上下边缘矢量进行对比和匹配, Extract the upper and lower edge vectors of the horizontal paper strips for comparison and matching,
(a)对所有横向纸条依次进行上边缘检测和下边缘检测,以确定所述横向纸条是否为原始文档的上边缘碎片或下边缘碎片:判断横向纸条的上边缘或下边缘是否为空白,若为空白则横向纸条为文档上边缘碎片或下边缘碎片; (a) Carry out upper edge detection and lower edge detection sequentially to all horizontal paper strips, to determine whether said horizontal paper strips are upper edge fragments or lower edge fragments of the original document: judge whether the upper edge or the lower edge of the horizontal paper strips are Blank, if it is blank, the horizontal paper strip is a fragment of the upper edge or lower edge of the document;
(b)按照边缘匹配准则进行横向纸条的纵向排序:通过步骤(a)确定原始文档上边缘后,确定文档上边缘所对应的碎片为第1横向纸条,显示第1横向纸条及其编号;将所述第1横向纸条的下边缘矢量,按照编号顺序依次与其它碎片的上边缘矢量进行对比匹配,直到找与第1横向纸条右边缘匹配的第2横向纸条,在所述第1横向纸条的下侧增加显示第2横向纸条及其编号; (b) Vertically sort the horizontal strips according to the edge matching criterion: After determining the upper edge of the original document through step (a), determine the fragment corresponding to the upper edge of the document as the first horizontal strip, and display the first horizontal strip and its Numbering; compare and match the lower edge vectors of the first horizontal paper strip with the upper edge vectors of other fragments in sequence according to the numbering order, until the second horizontal paper strip matching the right edge of the first horizontal paper strip is found. Add the second horizontal paper strip and its serial number to the lower side of the first horizontal paper strip mentioned above;
(c)按照步骤(b)的边缘匹配准则对其它横向纸条按照编号顺序依次由上到下进行对比匹配,直到匹配到最后一个横向纸条,将匹配完毕的横向纸条依次按照匹配次序由上到下进行显示,并显示各自的编号,形成恢复原始文档; (c) According to the edge matching criterion of step (b), compare and match other horizontal paper strips from top to bottom according to the numbering order, until the last horizontal paper strip is matched, and the matched horizontal paper strips are sequentially selected according to the matching order Display from top to bottom, and display their respective numbers to form the restored original document;
(d)保存步骤(b)和步骤(c)中所述匹配横向纸条的排序及对应的编号排序; (d) storing the sorting and corresponding numbering of the matching horizontal paper strips described in step (b) and step (c);
根据本发明优选的,人眼查看所述步骤iv)中第(c)步中恢复原始横条文档,核定恢复结果;由于存在行间距,组内匹配之后可能出现上、下边缘空白,无法进行上下边缘矢量对比,因此在横向纸条上下匹配之后进行人工检查,检查文档内容的上下文是否合理,若有错位要根据文章内容、字符位置进行适当的调整; Preferably according to the present invention, the original horizontal bar document is restored in step (c) in step iv) by human eyes, and the restoration result is approved; due to the existence of line spacing, upper and lower margin blanks may appear after the matching within the group, which cannot be performed The upper and lower edge vectors are compared, so manual inspection is performed after the horizontal paper strips are matched up and down to check whether the context of the document content is reasonable. If there is any misalignment, appropriate adjustments should be made according to the content of the article and the position of the characters;
②英文横纵碎片 ② English horizontal and vertical fragments
由于英文字母的形态特点,仅仅检测第一行中字母出现的最高、最低位置是不准确的。联想英文作业纸中四线三格的定位方式,如图2所示,选择利用定位线完成行的匹配。按照标准四线三格中位置和占空比预先将字母分为四种高度:占中间一格(如a、e等),占上两格(如h、l等),占下两格(如y、g等), 占三格(j)。在分组之前,需要计算不同类型字母的高度,并以此为依据,对英文字母的高度进行处理,进而分组。 Due to the morphological characteristics of English letters, it is inaccurate to only detect the highest and lowest positions of letters in the first line. The positioning method of four lines and three grids in Lenovo English homework paper, as shown in Figure 2, chooses to use the positioning line to complete the line matching. According to the position and duty cycle of the standard four lines and three grids, the letters are divided into four heights in advance: occupying the middle grid (such as a, e, etc.), occupying the upper two grids (such as h, l, etc.), and occupying the lower two grids ( Such as y, g, etc.), occupying three grids (j). Before grouping, it is necessary to calculate the heights of different types of letters, and based on this, process the heights of English letters and then group them.
根据英文字母四线格的特点,每个字母都在中间格有内容,上下两个只有部分字母有内容,所以找到四线格的第二线和第三线能够较准确的实现字母所在行的定位。 According to the characteristics of the four-line grid of English letters, each letter has content in the middle grid, and only some letters have content in the upper and lower two grids. Therefore, finding the second and third lines of the four-line grid can accurately locate the row where the letter is located.
考虑到占据上格内容在纵向上有较明显的形态学特点,利用适当的处理,除了大写字母占的上格,其他字母都可以消除上格所占的内容。 Considering that the content occupying the upper case has obvious morphological characteristics in the vertical direction, with proper processing, except for the upper case occupied by capital letters, other letters can eliminate the content occupied by the upper case.
另外再考虑到大写字母和占下格的字母比例较小,在一个处理后的碎片中,几乎总是能够至少找到一行只有中格有内容,一旦找到这样的行,就可以根据已求出的间距和高度数据,求出其他行的第二、三线的位置,从而实现高度分组。 In addition, considering that the proportion of capital letters and letters occupying the lower cells is relatively small, in a processed fragment, it is almost always possible to find at least one row with only the middle cell. Once such a row is found, it can be calculated according to the Calculate the position of the second and third lines of other rows based on the spacing and height data, so as to achieve height grouping.
首先,读入纸条碎片数字图像,并转化为灰度图像,并进行二值化、取反处理,凸显文字;然后分别提取每个碎片的左右边缘矢量、上下边缘矢量; First, read in the digital image of the fragments of paper strips, convert them into grayscale images, and perform binarization and inversion processing to highlight the text; then extract the left and right edge vectors and upper and lower edge vectors of each fragment respectively;
其次,将检测碎片高度;且在组内检测左边缘碎片,进行组内匹配;最后进行横向纸条上下匹配,进行恢复: Secondly, the height of the debris will be detected; and the fragments on the left edge will be detected in the group to perform intra-group matching; finally, the horizontal paper strips will be matched up and down to restore:
然后,按照图1(b)所示步骤进行处理,确定碎片的二三线位置,按照文字高度分组;组内匹配、最后进行横向纸条上下匹配,进行恢复: Then, process according to the steps shown in Figure 1(b), determine the position of the second and third lines of the fragments, and group them according to the height of the text; match within the group, and finally match up and down the horizontal paper strips to recover:
i)碎片预处理:要想确定每行字母位置,最大程度抵消因字母占位不同带来的高度匹配误差,先要对字母进行形态学处理,将占据上格的字母部分尽量消除,以便确定第二、三线的位置。 i) Fragment preprocessing: In order to determine the position of the letters in each row and offset the height matching error caused by the difference in letter occupancy to the greatest extent, the letters must first be morphologically processed, and the letter parts occupying the upper grid should be eliminated as much as possible, so as to determine The position of the second and third lines.
ii)确定碎片中第二三线位置,按照高度分组:首先在处理后的碎片中寻找只占中格的一行,然后再往上计算上一行的二三线位置,直到超出上边缘。取每一个碎片出现的第一个二线高度作为该碎片的特征高度,以此为依据进行分组,对所有碎片逐行检测字符出现的位置,根据碎片上边沿为字或者空白,以及第一行字符二线高度出现的最高像素点在碎片中的位置进行分类成组,并将每组内的碎片进行编号;分类时,为了避免特殊情况的发生,进行精确分组,加入人工干预指导,将文字高度按照大小进行排列,相邻数值接近的高度对应的碎片归为一组。 ii) Determine the position of the second and third lines in the fragments, and group them according to height: firstly, find a line occupying only the middle grid in the processed fragments, and then calculate the position of the second and third lines of the previous line until it exceeds the upper edge. Take the height of the first second line that appears in each fragment as the characteristic height of the fragment, and group based on this, and detect the position where the characters appear line by line for all fragments, according to whether the upper edge of the fragment is a word or a blank, and the first line of characters The position of the highest pixel point appearing at the height of the second line in the fragments is classified into groups, and the fragments in each group are numbered; when classifying, in order to avoid the occurrence of special situations, accurate grouping is carried out, and manual intervention guidance is added to divide the height of the text according to The size is arranged, and the fragments corresponding to the heights with close adjacent values are grouped together.
iii)组内匹配:将步骤(ii)中分类完毕的碎片进行组内匹配: iii) Intra-group matching: perform intra-group matching on the fragments classified in step (ii):
(a)对所有碎片依次进行左边缘检测,以确定所述碎片是否为原始文档的左边缘碎片:判断碎片左边缘是否为空白,若为空白则碎片为文档左边缘; (a) carry out left edge detection to all fragments successively, to determine whether described fragment is the left edge fragment of original document: judge whether the left edge of fragment is blank, if it is blank then fragment is the left edge of document;
(b)按照边缘匹配准则进行碎片排序:若检测出的左边缘碎片存在于步骤i)过程中分好的组内,则以该左边缘碎片为起点对相应的分组的碎片进行向右边缘匹配;通过步骤(a)确定原始文档左边缘后,确定文档左边缘所对应的碎片为第1碎片,显示第1碎片及其编号;将所述第1碎片的右边缘矢量,按照编号顺序依次与其它碎片的左边缘矢量进行对比匹配,直到找与第1碎片右边缘匹配的第2碎片,在所述第1碎片的右侧增加显示第2碎片及其编号; (b) Sorting the fragments according to the edge matching criterion: if the detected left edge fragment exists in the group divided in the process of step i), then use the left edge fragment as the starting point to perform right edge matching on the corresponding grouped fragments ; After determining the left edge of the original document by step (a), determine that the fragment corresponding to the left edge of the document is the first fragment, and display the first fragment and its number; the right edge vector of the first fragment is sequentially connected with The left edge vectors of other fragments are compared and matched until the second fragment matching the right edge of the first fragment is found, and the second fragment and its number are displayed on the right side of the first fragment;
若左边缘碎片不属于任何一个分组,则先搁置该碎片,直到匹配完其它分组后,按照原始文档的左边缘碎片的上下边缘矢量,再进行匹配入文档; If the left edge fragment does not belong to any group, put the fragment on hold until other groups are matched, and then match into the document according to the upper and lower edge vectors of the left edge fragment of the original document;
(c)按照步骤(b)的边缘匹配准则对其它碎片按照编号顺序依次由左到右进行对比匹配,直到匹配到最后一个碎片,将匹配完毕的碎片依次按照匹配次序由左到右进行显示,并显示各自的编号,形成恢复原始横条文档; (c) According to the edge matching criterion of step (b), compare and match other fragments from left to right according to the numbering order, until the last fragment is matched, and display the matched fragments from left to right according to the matching order, And display their respective numbers to form a restored original horizontal stripe document;
(d)保存步骤(b)和步骤(c)中所述匹配碎片的排序及对应的编号排序; (d) storing the sorting and corresponding numbering of the matching fragments described in step (b) and step (c);
根据本发明优选的,人眼查看所述步骤iii)第(c)步中恢复原始横条文档,核定恢复结果。由于碎片的像素较低,在匹配过程中,每组都必须参与人工筛选,碰到错误的匹配应排除并重新进行正确的匹配或者插入正确碎片后继续进行匹配; Preferably, according to the present invention, human eyes check the restored original horizontal stripe document in the step (c) of step iii) to verify the restoration result. Due to the low pixel size of fragments, each group must participate in manual screening during the matching process. When encountering wrong matches, they should be eliminated and corrected again or continue to match after inserting correct fragments;
iv)行内调整:匹配时选择边缘差异最小的碎片,如果组内没有相连的下一个碎片,则会匹配到错误的碎片,进而会影响到之后的排序,因此通过人工干预的方式进行行内调整,最后得到横向纸条; iv) In-row adjustment: When matching, select the fragment with the smallest edge difference. If there is no next connected fragment in the group, the wrong fragment will be matched, which will affect the subsequent sorting. Therefore, the in-row adjustment is performed through manual intervention. Finally get the horizontal note;
v)横向纸条上下匹配: v) Horizontal paper strips match up and down:
提取横向纸条的上下边缘矢量进行对比和匹配, Extract the upper and lower edge vectors of the horizontal paper strips for comparison and matching,
(a)对所有横向纸条依次进行上边缘检测和下边缘检测,以确定所述横向纸条是否为原始文档的上边缘碎片或下边缘碎片:判断横向纸条的上边缘或下边缘是否为空白,若为空白则横向纸条为文档上边缘碎片或下边缘碎片; (a) Carry out upper edge detection and lower edge detection sequentially to all horizontal paper strips, to determine whether said horizontal paper strips are upper edge fragments or lower edge fragments of the original document: judge whether the upper edge or the lower edge of the horizontal paper strips are Blank, if it is blank, the horizontal paper strip is a fragment of the upper edge or lower edge of the document;
(b)按照边缘匹配准则进行横向纸条的纵向排序:通过步骤(a)确定原始文档上边缘后,确定文档上边缘所对应的碎片为第1横向纸条,显示第1横向纸条及其编号;将所述第1横向纸条的下边缘矢量,按照编号顺序依次与其它碎片的上边缘矢量进行对比匹配,直到找与第1横向纸条右边缘匹配的第2横向纸条,在所述第1横向纸条的下侧增加显示第2横向纸条及其编号; (b) Vertically sort the horizontal strips according to the edge matching criterion: After determining the upper edge of the original document through step (a), determine the fragment corresponding to the upper edge of the document as the first horizontal strip, and display the first horizontal strip and its Numbering; compare and match the lower edge vectors of the first horizontal paper strip with the upper edge vectors of other fragments in sequence according to the numbering order, until the second horizontal paper strip matching the right edge of the first horizontal paper strip is found. Add the second horizontal paper strip and its serial number to the lower side of the first horizontal paper strip mentioned above;
(c)按照步骤(b)的边缘匹配准则对其它横向纸条按照编号顺序依次由上到下进行对比匹配,直到匹配到最后一个横向纸条,将匹配完毕的横向纸条依次按照匹配次序由上到下进行显示,并显示各自的编号,形成恢复原始文档; (c) According to the edge matching criterion of step (b), compare and match other horizontal paper strips from top to bottom according to the numbering order, until the last horizontal paper strip is matched, and the matched horizontal paper strips are sequentially selected according to the matching order Display from top to bottom, and display their respective numbers to form the restored original document;
(d)保存步骤(b)和步骤(c)中所述匹配横向纸条的排序及对应的编号排序; (d) storing the sorting and corresponding numbering of the matching horizontal paper strips described in step (b) and step (c);
根据本发明优选的,人眼查看所述步骤iv)中第(c)步中恢复原始横条文档,核定恢复结果;由于存在行间距,组内匹配之后可能出现上、下边缘空白,无法进行上下边缘矢量对比,因此在横向纸条上下匹配之后进行人工检查,检查文档内容的上下文是否合理,若有错位要根据文章内容、字符位置进行适当的调整; Preferably according to the present invention, the original horizontal bar document is restored in step (c) in step iv) by human eyes, and the restoration result is approved; due to the existence of line spacing, upper and lower margin blanks may appear after the matching within the group, which cannot be performed The upper and lower edge vectors are compared, so manual inspection is performed after the horizontal paper strips are matched up and down to check whether the context of the document content is reasonable. If there is any misalignment, appropriate adjustments should be made according to the content of the article and the position of the characters;
3.双面英文横纵切纸条碎片 3. Double-sided English horizontal and vertical cut paper fragments
由于碎片正反面不一致,所以将会有单面纸条2倍的不同碎片信息,但是由于碎片正反面对应,只要匹配出其中一面,就能求出结果表达式,对于单面的困难有:数据量增加一倍,在高度分组的过程中,每组的数量会大大增加,同时由于边缘空白碎片的增加,边缘匹配错误率提高,加大了人工纠错的工作量。对于中英文碎片对比,由2中解决方案可知,对于英文的拼接复原难度要远高于中文,于是,选择对双面英文横纵切纸条碎片进行处理,对于中文双面纸条可以参照第2部分中文横纵切纸条碎片拼接复原方法以及第3部分双面英文横纵切拼接复原方法两部分的内容。 Since the front and back of the fragments are inconsistent, there will be twice the different fragment information of the single-sided note. However, since the front and back of the fragments correspond, as long as one side is matched, the result expression can be obtained. For the difficulty of single-sided : The amount of data is doubled. In the process of high grouping, the number of each group will increase greatly. At the same time, due to the increase of edge blank fragments, the edge matching error rate increases, which increases the workload of manual error correction. For the comparison of Chinese and English fragments, it can be seen from the 2 solutions that the splicing and restoration of English is much more difficult than that of Chinese. Therefore, we choose to process the fragments of double-sided English horizontal and vertical paper strips. For Chinese double-sided paper strips, please refer to Part 2 is the splicing and restoration method of Chinese horizontal and vertical cut paper fragments, and the third part is the two-part content of the double-sided English horizontal and vertical splicing and restoration method.
对于双面纸条碎片,解决思路如下:找出所有的左边缘碎片。对这左边缘碎片先进行人工匹配,则可以得到每一面的左边缘。在同一个分组中,若包含两个左边缘碎片,则分别作为开头向右匹配,匹配过程中若由于边缘空白,出现错误需要人工干预纠错,过程不再赘述;若只包含一个左边缘碎片,且组中碎片数量较多,则以其为开头匹配;若组中碎片数量较少,在其他组匹配好之后,碎片以人工纠错填充的形式补充入结果表达式。 For the fragments of double-sided paper strips, the solution is as follows: Find all the fragments on the left edge. Manually match the left edge fragments first, then the left edge of each side can be obtained. In the same group, if it contains two left-edge fragments, they will be used as the beginning to match to the right. If an error occurs during the matching process due to blank margins, manual intervention is required to correct the error, and the process will not be repeated; if only one left-edge fragment is included , and the number of fragments in the group is large, it will be used as the beginning of the match; if the number of fragments in the group is small, after other groups are matched, the fragments will be added to the result expression in the form of manual error correction filling.
双面英文横纵切碎片拼接复原流程如图1(c)所示: The splicing and restoration process of double-sided English horizontal and vertical fragments is shown in Figure 1(c):
首先,将纸条碎片得到的正反面数字图像信息读入,并将图像转化为灰度图像,将灰度图像进行二值化和取反操作,凸显文字; First, read the front and back digital image information obtained from the scraps of paper strips, convert the image into a grayscale image, and perform binarization and inversion operations on the grayscale image to highlight the text;
其次,进行高度检测,人工辅助分组;组内正反面边缘检测,组内匹配并人工检查;横向纸条上下匹配,进行拼接复原: Secondly, carry out height detection, manual assisted grouping; front and back edge detection within the group, matching within the group and manual inspection; vertical matching of horizontal paper strips, splicing and restoration:
i)碎片预处理:为了获得每行字母的位置,最大程度抵消字母占位不同带来的高度匹配误差,通过对碎片中字母进行形态学处理尽量消除占据上格的字母部分,为确定第二、三线位置做准备; i) Fragment preprocessing: In order to obtain the position of each row of letters and offset the height matching error caused by the difference in letter occupancy to the greatest extent, by performing morphological processing on the letters in the fragments, try to eliminate the part of the letter occupying the upper grid, in order to determine the second , Prepare for the third-line position;
ii)确定碎片中第二三线位置,按照高度进行分组:首先在处理后的碎片中寻找只占中格的一行,然后再往上计算上一行的二三线位置,直到超出上边缘。取每一个碎片出现的第一个二线高度定义为该碎片的特征高度;然后对所有碎片逐行检测字符出现的位置,根据字符上边沿为字或者空白,以及第一行字符二线高度出现的最高像素点在碎片中的位置进行分组,并对每组内碎片进行编号;分类时,为了避免特殊情况的发生,加入人工干预指导,按照文字高度大小进行排列,将高度接近的碎片归为一组; ii) Determine the position of the second and third lines in the fragments, and group them according to their heights: first find a line in the processed fragments that only occupies the middle grid, and then calculate the position of the second and third lines of the previous line until it exceeds the upper edge. Take the height of the first second line that appears in each fragment and define it as the characteristic height of the fragment; then detect the position where the characters appear line by line for all fragments, according to whether the upper edge of the character is a word or a blank, and the highest second-line height of the character in the first line The positions of the pixels in the fragments are grouped, and the fragments in each group are numbered; when classifying, in order to avoid the occurrence of special situations, manual intervention guidance is added, and the fragments are arranged according to the height of the text, and the fragments with similar heights are grouped together ;
iii)组内匹配:将步骤(ii)中分类完毕的碎片进行组内匹配: iii) Intra-group matching: perform intra-group matching on the fragments classified in step (ii):
(a)对每组内碎片进行左边缘检测,确定每组内的左边缘为原始文档每行的左边缘:判断每组内碎片是否为空白,若为空白则是碎片文档的左边缘; (a) carry out left edge detection to fragment in each group, determine that the left edge in each group is the left edge of each line of original document: judge whether fragment in each group is blank, if it is blank, it is the left edge of fragment document;
(b)按照边缘匹配准则向右匹配:若检测出的左边缘碎片存在于步骤(ii)的分好的组内,则以该碎片左边缘碎片为起点对该组内相应的碎片进行向右匹配;将所述第1横向纸条的下边缘矢量,按照编号顺序依次与其它碎片的上边缘矢量进行对比匹配,直到找与第1横向纸条右边缘匹配的第2横向纸条,在所述第1横向纸条的下侧增加显示第2横向纸条及其编号; (b) Match to the right according to the edge matching criterion: if the detected left-edge fragment exists in the divided group in step (ii), then start from the left-edge fragment of the fragment to the right of the corresponding fragment in the group Matching: compare and match the lower edge vectors of the first horizontal paper strip with the upper edge vectors of other fragments in sequence, until the second horizontal paper strip that matches the right edge of the first horizontal paper strip is found. Add the second horizontal paper strip and its serial number to the lower side of the first horizontal paper strip mentioned above;
(c)此时需要人工干预,观察匹配结果,将多余的碎片拿到无法分组的碎片群中,缺少的碎片在无法分组的碎片群中寻找。由于高度分布比较集中,所以分组难度不大,只需考虑个别不连续的碎片。无法分组的碎片中也有两个左边缘碎片,分别进行匹配,从而一个分组中的碎片分别分出了两行; (c) At this time, manual intervention is required to observe the matching results, take the redundant fragments to the fragment group that cannot be grouped, and search for the missing fragments in the fragment group that cannot be grouped. Since the height distribution is relatively concentrated, grouping is not difficult, only individual discontinuous fragments need to be considered. There are also two left-edge fragments in the fragments that cannot be grouped, which are matched separately, so that the fragments in one group are divided into two rows;
(d)按照步骤(b)的边缘匹配准则对其它碎片按照编号顺序依次由左到右进行对比匹配,直到匹配到最后一个碎片,将匹配完毕的碎片依次按照匹配次序由左到右进行显示,并显示各自的编号,形成恢复原始横条文档; (d) According to the edge matching criterion of step (b), compare and match other fragments from left to right according to the numbering order, until the last fragment is matched, and display the matched fragments from left to right according to the matching order, And display their respective numbers to form a restored original horizontal stripe file;
(e)保存步骤(b)和步骤(c)中所述匹配碎片的排序及对应的编号排序。 (e) saving the sorting of the matching fragments and the corresponding numbering sorting described in step (b) and step (c).
根据本发明优选的,人眼查看所述步骤iii)中第(c)步中恢复原始横条文档,核定恢复结果。由于碎片的像素较低,在匹配过程中,每组都必须参与人工筛选,碰到错误的匹配应排除并重新进行正确的匹配或者插入正确碎片后继续进行匹配; Preferably according to the present invention, human eyes check the restored original horizontal stripe document in the step (c) in the step iii) and check the restoration result. Due to the low pixel size of fragments, each group must participate in manual screening during the matching process. When encountering wrong matches, they should be eliminated and corrected again or continue to match after inserting correct fragments;
iv)行内调整:匹配时选择边缘差异最小的碎片,如果组内没有相连的下一个碎片,则会匹配到错误的碎片,进而会影响到之后的排序,因此通过人工干预的方式进行行内调整,最后得到横向纸条; iv) In-row adjustment: When matching, select the fragment with the smallest edge difference. If there is no next connected fragment in the group, the wrong fragment will be matched, which will affect the subsequent sorting. Therefore, the in-row adjustment is performed through manual intervention. Finally get the horizontal note;
v)横向纸条上下匹配:提取横向纸条的上下边缘矢量进行对比和匹配, v) Up and down matching of horizontal paper strips: extracting the upper and lower edge vectors of horizontal paper strips for comparison and matching,
(a)对所有横向纸条依次进行上边缘检测和下边缘检测,以确定所述横向纸条是否为原始文档的上边缘碎片或下边缘碎片:判断横向纸条的上边缘或下边缘是否为空白,若为空白则横向纸条为文档上边缘碎片或下边缘碎片; (a) Carry out upper edge detection and lower edge detection sequentially to all horizontal paper strips, to determine whether said horizontal paper strips are upper edge fragments or lower edge fragments of the original document: judge whether the upper edge or the lower edge of the horizontal paper strips are Blank, if it is blank, the horizontal paper strip is a fragment of the upper edge or lower edge of the document;
(b)按照边缘匹配准则进行横向纸条的纵向排序:通过步骤(a)确定原始文档上边缘后,确定文档上边缘所对应的碎片为第1横向纸条,显示第1横向纸条及其编号;将所述第1横向纸条的下边缘矢量,按照编号顺序依次与其它碎片的上边缘矢量进行对比匹配,直到找与第1横向纸条右边缘匹配的第2横向纸条,在所述第1横向纸条的下侧增加显示第2横向纸条及其编号; (b) Vertically sort the horizontal strips according to the edge matching criterion: After determining the upper edge of the original document through step (a), determine the fragment corresponding to the upper edge of the document as the first horizontal strip, and display the first horizontal strip and its Numbering; compare and match the lower edge vectors of the first horizontal paper strip with the upper edge vectors of other fragments in sequence according to the numbering order, until the second horizontal paper strip matching the right edge of the first horizontal paper strip is found. Add the second horizontal paper strip and its serial number to the lower side of the first horizontal paper strip mentioned above;
(c)按照步骤(b)的边缘匹配准则对其它横向纸条按照编号顺序依次由上到下进行对比匹配,直到匹配到最后一个横向纸条,将匹配完毕的横向纸条依次按照匹配次序由上到下进行显示,并显示各自的编号,形成恢复原始文档; (c) According to the edge matching criterion of step (b), compare and match other horizontal paper strips from top to bottom according to the numbering order, until the last horizontal paper strip is matched, and the matched horizontal paper strips are sequentially selected according to the matching order Display from top to bottom, and display their respective numbers to form the restored original document;
(d)保存步骤(b)和步骤(c)中所述匹配横向纸条的排序及对应的编号排序; (d) storing the sorting and corresponding numbering of the matching horizontal paper strips described in step (b) and step (c);
根据本发明优选的,人眼查看所述步骤v)中第(c)步中恢复原始横条文档,核定恢复结果;由于存在行间距,组内匹配之后可能出现上、下边缘空白,无法进行上下边缘矢量对比,因此在横向纸条上下匹配之后进行人工检查,检查文档内容的上下文是否合理,若有错位要根据文章内容、字符位置进行适当的调整; Preferably according to the present invention, the original horizontal bar document is restored in step (c) in step v) with human eyes, and the restoration result is approved; due to the existence of line spacing, upper and lower margins may appear after the matching within the group, and cannot be performed. The upper and lower edge vectors are compared, so manual inspection is performed after the horizontal paper strips are matched up and down to check whether the context of the document content is reasonable. If there is any misalignment, appropriate adjustments should be made according to the content of the article and the position of the characters;
vi)由步骤(v)可以找到属于同一面的所有的行,匹配完文档的一面,即可复原整个文档。 vi) From step (v), all rows belonging to the same side can be found, and after matching one side of the document, the entire document can be restored.
本发明的有益效果是: The beneficial effects of the present invention are:
1、本发明针对粉碎机粉碎纸条呈现矩形形态对规则的矩形形态碎纸片作了针对性的处理; 1. According to the present invention, the pulverized paper strips of the shredder present a rectangular shape, and the regular rectangular shape shredded paper is processed in a targeted manner;
2、本发明根据切割方式的不同,对单面纵切、单面横纵切以及双面横纵切等常见的不同切割方式进行专门处理; 2. According to the different cutting methods, the present invention performs special treatment on common different cutting methods such as single-sided longitudinal cutting, single-sided horizontal and longitudinal cutting, and double-sided horizontal and longitudinal cutting;
3、本发明根据中英文文档的不同点,对中文文档横纵切和英文文档横纵切两种文档的拼接复原做了针对性处理; 3. According to the differences between Chinese and English documents, the present invention performs targeted processing on splicing and restoration of two kinds of documents: Chinese document horizontal and vertical cut and English document horizontal and vertical cut;
4、本发明对不同文档及切割方式的处理大大提高了拼接复原的效率以及准确性。 4. The processing of different documents and cutting methods in the present invention greatly improves the efficiency and accuracy of splicing and restoration.
附图说明 Description of drawings
图1中英文纵切纸条拼接复原流程; Fig. 1 The splicing and restoration process of Chinese and English slitting paper strips;
图2英文字母四线格排版; Figure 2 Four-line layout of English letters;
图3是中文纸条纵切成19块碎片图片; Figure 3 is a picture of 19 pieces of Chinese paper strips cut longitudinally;
图4英文纸条纵切成19块碎片图片; Figure 4 English paper strips cut longitudinally into 19 fragments;
图5(a)纵切中文纸条碎片拼接复原结果; Figure 5(a) The splicing and restoration results of longitudinally cut Chinese paper strip fragments;
图5(b)纵切英文纸条碎片拼接复原结果; Fig. 5(b) splicing and restoration results of longitudinally cut English paper fragments;
图6中文纸条横纵切209块碎片图片; Figure 6: A picture of 209 fragments of Chinese paper strips cut horizontally and vertically;
图7(a)是碎片边沿空白出现位置示例; Figure 7(a) is an example of the position where the margin of the debris appears;
图7(b)是碎片边沿像素出现位置示例; Figure 7(b) is an example of where the fragment edge pixels appear;
图8中文纸条横纵切第五行碎片示例; Figure 8: An example of fragments in the fifth line of Chinese paper strips cut horizontally and vertically;
图9中文纸条横纵切拼接复原结果; Fig. 9 The restoration results of the horizontal and vertical splicing of Chinese paper strips;
图10英文字母划线及高度测量示意; Figure 10 Schematic diagram of English letter marking and height measurement;
图11英文横纵碎片图像预处理示意; Fig. 11 Schematic diagram of English horizontal and vertical fragment image preprocessing;
图12英文纸条横纵切拼接复原结果; Figure 12 The restoration results of horizontal and vertical splicing of English paper strips;
图13(a)双面英文纸条横纵切a面碎片; Fig. 13 (a) the fragments of side a of the double-sided English paper strip cut vertically and horizontally;
图13(b)双面英文纸条横纵切b面碎片; Fig. 13(b) The double-sided English paper strip is cut horizontally and vertically into the b-side fragments;
图14(a)是双面英文纸条横纵切拼接复原图A面; Fig. 14(a) is the side A of the splicing restoration diagram of double-sided English paper strips cut horizontally and vertically;
图14(b)是双面英文纸条横纵切拼接复原图B面。 Fig. 14(b) is side B of the spliced restoration diagram of double-sided English paper strips cut horizontally and vertically.
具体实施方式: detailed description:
下面结合附图和实例对本发明进行详细的描述,但不限于此。 The present invention will be described in detail below in conjunction with the accompanying drawings and examples, but is not limited thereto.
实施例1、 Embodiment 1,
一种矩形碎纸片拼接复原方法如图1所示,包括针对中英文纵切纸条碎片、中英文横纵切纸条碎片以及英文横纵切双面纸条碎片的复原方法: A kind of splicing recovery method of rectangular pieces of shredded paper is shown in Figure 1, including the restoration method for Chinese and English vertically cut paper strip fragments, Chinese and English horizontally and vertically cut paper strip fragments, and English horizontally and vertically cut double-sided paper strip fragments:
1.中英文纵切纸条碎片 1. Chinese and English slitting pieces of paper
(1)使用扫描仪读入文字碎片数字图像,并把图像转化成灰度图像,中文纸片如图3所示,英文纸片如图4所示,共有19个碎片; (1) Use a scanner to read in the digital image of the text fragments, and convert the image into a grayscale image. The Chinese paper fragments are shown in Figure 3, and the English paper fragments are shown in Figure 4, with a total of 19 fragments;
由于仅有纵切碎片时,碎片较为细长,纵向边缘矢量作为比较的量更具有可行性,流程如图1(a)所示; Since the fragments are relatively slender when there are only longitudinal fragments, it is more feasible to use the longitudinal edge vector as a comparison quantity, and the process is shown in Figure 1(a);
(2)将所述图像再分别进行二值化处理、碎片左边缘检测、边缘匹配和显示拼接复原图像: (2) The image is subjected to binarization processing, fragment left edge detection, edge matching and display splicing restoration image respectively:
(a)对图像进行二值化处理:将纸条碎片进行顺序编号,如00,01,02,……18,计算机按照编号将碎片图像进行读入,并进行二值化处理并取反,提取每个碎片的左右边缘矢量; (a) Perform binarization processing on the image: sequentially number the fragments of paper strips, such as 00, 01, 02, ... 18, and the computer reads in the fragment images according to the numbers, and performs binarization processing and inversion, Extract the left and right edge vectors of each patch;
(b)对步骤(a)中碎片依次进行左边缘检测,以确定所述碎片是否为原始文档的左边缘:判断编号为00的碎片左边缘是否为空白,若为空白则编号为00的碎片为文档左边缘。否则,将编号为00的纸条碎片按照序号与其它等宽度等高矩形碎片进行比对,检测出左边缘有一定空白的纸条碎片,为原文档纸面的左边缘碎片; (b) Perform left edge detection on the fragments in step (a) in order to determine whether the fragments are the left edge of the original document: determine whether the left edge of the fragment numbered 00 is blank, and if it is blank, then the fragment numbered 00 for the left edge of the document. Otherwise, compare the scrap of paper with the number 00 with other rectangular fragments of equal width and height according to the serial number, and detect the scrap of paper with a certain blank on the left edge, which is the left edge fragment of the original document;
(c)按照边缘匹配准则进行碎片排序:通过步骤(b)确定原始文档左边缘后,确定文档左边缘所对应的碎片为第1碎片,显示第1碎片及其编号;将所述第1碎片的右边缘矢量,按照编号顺序依次与其它碎片的左边缘矢量进行对比匹配,即由文档左边缘检测找到文档的第1个碎片开始,循环提取前一个碎片的右边缘矢量和其他碎片的左边缘矢量进行比较,两矢量每有一个元素不同则比较标记量xx加1。所有比较完毕之后,取xx值最小的碎片作为与第1碎片右边缘匹配的第2碎片,在所述第1碎片的右侧增加显示第2碎片及其编号; (c) Sorting the fragments according to the edge matching criterion: after determining the left edge of the original document through step (b), determine that the fragment corresponding to the left edge of the document is the first fragment, and display the first fragment and its number; The right edge vector of the document is compared and matched with the left edge vectors of other fragments in sequence according to the numbering order, that is, starting from the first fragment of the document found by the left edge detection of the document, the right edge vector of the previous fragment and the left edge of other fragments are cyclically extracted Vectors are compared, and each time the two vectors have a different element, the comparison mark xx plus 1. After all comparisons are completed, take the fragment with the smallest xx value as the second fragment that matches the right edge of the first fragment, and display the second fragment and its number on the right side of the first fragment;
(d)按照步骤(c)的边缘匹配准则对其它碎片按照编号顺序依次由左到右进行对比匹配,直到匹配到最后一个碎片,将匹配完毕的碎片依次按照匹配次序由左到右进行显示,并显示各自的编号,形成最终的恢复原始文档,得到中文纵切碎片编号顺序为表1,英文纵切编号顺序为表2; (d) According to the edge matching criterion of step (c), compare and match other fragments from left to right according to the numbering order, until the last fragment is matched, and display the matched fragments from left to right according to the matching order, And display the respective numbers to form the final restored original document, and the numbering sequence of the Chinese longitudinal fragments is Table 1, and the English longitudinal fragmentation numbering sequence is Table 2;
表1中文纵切碎片编号: Table 1 Number of Chinese longitudinal slices:
表2英文纵切碎片编号: Table 2 Numbers of longitudinally cut fragments in English:
(e)保存步骤(c)和步骤(d)中所述匹配碎片的排序及对应的编号排序。 (e) saving the sorting of the matching fragments and the corresponding numbering sorting described in step (c) and step (d).
根据本发明优选的,人眼查看所述步骤(d)中最终的恢复原始文档,核定恢复结果,中英文碎片恢复结果分别如图5(a)、(b)所示; Preferably according to the present invention, human eyes check the final restored original document in the step (d), and check and verify the restoration results, and the Chinese and English fragment restoration results are shown in Figure 5 (a), (b) respectively;
2.中英文横纵切纸条碎片 2. Chinese and English horizontal and vertical cut paper pieces
由于碎片上下左右都有切割痕迹,考虑字符间距、行间距、段首缩进等因素,碎片边缘可能出现空白,这会给边缘矢量匹配带来较大误差。这种情形下加入每行字符位置(高度)的判断标准,即通过碎片上每行字符的大致位置完成行的匹配。 Since there are cutting marks on the top, bottom, left, and right of the fragment, considering factors such as character spacing, line spacing, and paragraph indentation, there may be blanks on the edge of the fragment, which will cause large errors in edge vector matching. In this case, the judging standard of the character position (height) of each line is added, that is, the line matching is completed by the approximate position of each line of characters on the fragment.
中文是方块字,字形相对规则,所以将字符高度检测作为分组标准可行性很高;而英文字母高度不一,故需要进行一定的预处理,将字母控制在较为一致的高度范围内,再来确定整体相对位置,由此提高识别率。由于中英文字符具有不同的特点,因此不能使用同一标准进行处理。下面对中英文文档分别进行分析。 Chinese is a square character with a relatively regular shape, so it is very feasible to use character height detection as a grouping standard; while English letters have different heights, so certain preprocessing is required to control the letters within a relatively consistent height range, and then determine the overall Relative position, thereby improving the recognition rate. Since Chinese and English characters have different characteristics, they cannot be processed using the same standard. The Chinese and English documents are analyzed separately below.
①中文横纵碎片 ① Chinese horizontal and vertical fragments
中文横纵碎片拼接复原流程如图1(b)所示: The restoration process of Chinese horizontal and vertical fragments is shown in Figure 1(b):
首先,读入文档横纵切的纸条碎片数字图像信息,并将碎片图像转化为灰度图像,如图6所示横切纸条10次,纵切纸条18次,共形成11*19块碎片,且进行二值化、取反处理,然后分别提取每个碎片的左右边缘矢量、上下边缘矢量; First, read in the digital image information of the paper strip fragments cut horizontally and vertically in the document, and convert the fragment image into a grayscale image. As shown in Figure 6, the paper strip is cut 10 times and the paper strip is cut 18 times, forming a total of 11*19 Block fragments, and perform binarization and inversion processing, and then extract the left and right edge vectors and upper and lower edge vectors of each fragment respectively;
其次,将检测碎片高度;且在组内检测左边缘碎片,进行组内匹配;最后进行横向纸条上下匹配,进行恢复: Secondly, the height of the debris will be detected; and the fragments on the left edge will be detected in the group to perform intra-group matching; finally, the horizontal paper strips will be matched up and down to restore:
i)对碎片上文字的高度进行检测:由于中文是方块字,字形相对规则,所以对所有碎片逐行检测字符出现的位置,根据碎片上边沿为字或者空白,以及第一行字符出现的最高像素点位置进行分类,如图7(a)、(b)所示,并将每组内的碎片进行编号;分类时,为了避免特殊情况的发生,进行精确分组,加入人工干预指导,将文字高度按照大小进行排列,相邻数值接近的高度对应的碎片归为一组; i) Detect the height of the characters on the fragments: Since Chinese is a square character with a relatively regular shape, detect the position where the characters appear line by line for all fragments, according to whether the upper edge of the fragments is a word or a blank, and the highest pixel where the characters appear in the first line Classify the point positions, as shown in Figure 7(a) and (b), and number the fragments in each group; when classifying, in order to avoid the occurrence of special situations, carry out precise grouping, add manual intervention guidance, and set the text height Arranged according to the size, the fragments corresponding to the heights with close adjacent values are grouped together;
人工干预的高度分组过程示例:取出的高度按照大小排列,相邻数值接近的高度对应的碎片归为一组,方法还是比较明显的。比如前35个高度为: An example of the height grouping process of manual intervention: the extracted heights are arranged according to size, and the fragments corresponding to the heights with similar adjacent values are grouped together. The method is quite obvious. For example, the first 35 heights are:
很明显前三个高度属于奇点,对应碎片要归入无法分组的群中。从第三个数据42开始,一直到38,共19个数据,大小非常相似,可以归为一组。从28开始到之后,高度也是非常相似,也归为一组(数据未贴完整)。中间的33也属于奇点,列入无法分组的群中。图6给的中文碎片除了出现25个碎片无法分组(见表3横纵切中文纸条恢复标号表,粗体显示)外,分成了10组,每组碎片在17-19片(无序),分组结果较好; It is obvious that the first three heights belong to singular points, and the corresponding fragments should be classified into groups that cannot be grouped. Starting from the third data 42, all the way to 38, a total of 19 data, the size is very similar, can be grouped together. From 28 to later, the heights are also very similar, and they are also grouped together (the data is not fully posted). The 33 in the middle also belongs to the singularity and is included in the group that cannot be grouped. The Chinese fragments given in Figure 6 are divided into 10 groups, except that there are 25 fragments that cannot be grouped (see Table 3, horizontal and vertical cut Chinese paper strip recovery labeling table, shown in bold), and each group of fragments has 17-19 pieces (disordered) , the grouping result is better;
ii)组内匹配:将步骤(i)中分类完毕的碎片进行组内匹配: ii) Intra-group matching: perform intra-group matching on the fragments classified in step (i):
(a)对所有碎片依次进行左边缘检测,以确定所述碎片是否为原始文档的左边缘碎片:判断碎片左边缘是否为空白,若为空白则碎片为文档左边缘,在实例中运用72*10的矩阵,对所有碎片进行检测,得到左边缘的11个碎片(原纸面左边缘是存在空白的),所以各分好的高度组中都确定了匹配的初始碎片; (a) Perform left edge detection on all fragments in order to determine whether the fragment is the left edge fragment of the original document: judge whether the left edge of the fragment is blank, if it is blank, the fragment is the left edge of the document, use 72* in the example 10 matrix, detect all fragments, and get 11 fragments on the left edge (there is a blank space on the left edge of the original paper), so the matching initial fragments are determined in each divided height group;
(b)按照边缘匹配准则进行碎片排序:若检测出的左边缘碎片存在于步骤i)过程中分好的组内,则以该左边缘碎片为起点对相应的分组的碎片进行向右边缘匹配;通过步骤(a)确定原始文档左边缘后,确定文档左边缘所对应的碎片为第1碎片,显示第1碎片及其编号;将所述第1碎片的右边缘矢量,按照编号顺序依次与其它碎片的左边缘矢量进行对比匹配,直到找与第1碎片右边缘匹配的第2碎片,在所述第1碎片的右侧增加显示第2碎片及其编号; (b) Sorting the fragments according to the edge matching criterion: if the detected left edge fragment exists in the group divided in the process of step i), then use the left edge fragment as the starting point to perform right edge matching on the corresponding grouped fragments ; After determining the left edge of the original document by step (a), determine that the fragment corresponding to the left edge of the document is the first fragment, and display the first fragment and its number; the right edge vector of the first fragment is sequentially connected with The left edge vectors of other fragments are compared and matched until the second fragment matching the right edge of the first fragment is found, and the second fragment and its number are displayed on the right side of the first fragment;
若左边缘碎片不属于任何一个分组,则先搁置该碎片(在检测出的左边缘碎片中,014、029和071不存在于任何一个高度组),直到匹配完其它分组后,按照原始文档的左边缘碎片的上下边缘矢量,再进行匹配入文档; If the fragment at the left edge does not belong to any group, put the fragment on hold first (among the detected fragments at the left edge, 014, 029 and 071 do not exist in any height group), until other groups are matched, follow the original document The upper and lower edge vectors of the left edge fragments are matched into the document;
(c)按照步骤(b)的边缘匹配准则对其它碎片按照编号顺序依次由左到右进行对比匹配,直到匹配到最后一个碎片,将匹配完毕的碎片依次按照匹配次序由左到右进行显示,并显示各自的编号,形成恢复原始横条文档; (c) According to the edge matching criterion of step (b), compare and match other fragments from left to right according to the numbering order, until the last fragment is matched, and display the matched fragments from left to right according to the matching order, And display their respective numbers to form a restored original horizontal stripe file;
(d)保存步骤(b)和步骤(c)中所述匹配碎片的排序及对应的编号排序。 (d) saving the sorting of matching fragments and the corresponding numbering sorting described in step (b) and step (c).
根据本发明优选的,人眼查看所述步骤(c)中恢复原始横条文档,核定恢复结果,发现有完整的五行已经匹配完毕(具体行的纵向相对位置是不确定的);一行中含有的18个碎片完好匹配,尚缺最后一个;有两行在中间的匹配出现错误匹配。此时需要人工干预,在无法分组的碎片中利用剩下的三个左边缘碎片,分别去和没有左边缘碎片的分组进行匹配,若有吻合,则该边缘碎片属于此分组。这样确定了碎片014和029的分组,从而也确定了071的分组,以此类推。 Preferably according to the present invention, human eyes check the recovery of the original horizontal bar document in the step (c), check and verify the recovery result, and find that complete five lines have been matched (the relative vertical position of the specific line is uncertain); one line contains The 18 shards of are perfectly matched, and the last one is missing; there are two rows in the middle of which the match has a wrong match. At this time, manual intervention is required. Among the fragments that cannot be grouped, use the remaining three left-edge fragments to match the groups without left-edge fragments. If there is a match, the edge fragments belong to this group. This determines the grouping of fragments 014 and 029, thereby also determining the grouping of 071, and so on.
iii)行内调整:匹配时选择边缘差异最小的碎片,如果组内没有相连的下一个碎片,则会匹配到错误的碎片,进而会影响到之后的排序,因此通过人工干预的方式进行行内调整,最后得到横向纸条;综合观察匹配结果,如表3所示,发 现第三组(只有18个碎片)从第六个碎片开始出现不吻合的情况,第七组(只有18个碎片)最后一个碎片跟前一个也不吻合。014所在的分组只有18个碎片,从第七个碎片开始出现不吻合的情况。029所在的组匹配较好。而071恰好匹配了其他18个碎片,于是无法分组的碎片只剩下四个。其中一个是第二行的最后一个,人眼可以挑出。其他三个分别去跟出现不吻合的分组中最后一个正确匹配的碎片进行相邻边缘比对,确定各自位置之后再与组内剩余碎片匹配,形成一整行的纸条。借助最后生成的顺序向量显示图片,观察恢复结果,行碎片恢复结束。 iii) In-row adjustment: When matching, select the fragment with the smallest edge difference. If there is no next connected fragment in the group, the wrong fragment will be matched, which will affect the subsequent sorting. Therefore, the in-row adjustment is performed through manual intervention. Finally, horizontal paper strips are obtained; comprehensively observe the matching results, as shown in Table 3, it is found that the third group (only 18 fragments) does not match from the sixth fragment, and the seventh group (only 18 fragments) is the last The fragment doesn't match the previous one either. The group in which 014 is located has only 18 fragments, and mismatches appear from the seventh fragment. The group where 029 belongs to is a good match. And 071 just matched the other 18 fragments, so there are only four fragments that cannot be grouped. One of them is the last one in the second row, which the human eye can pick out. The other three are compared with the last correctly matched fragment in the group that does not match. After determining their respective positions, they are matched with the remaining fragments in the group to form a whole row of paper strips. Use the last generated sequential vector to display the picture, observe the recovery result, and the recovery of row fragments is complete.
表3中文横纵切碎片拼接复原对应标号(粗体为无法按高度分组的碎片): Table 3 Corresponding labels of splicing and restoration of Chinese horizontal and vertical fragments (bold are fragments that cannot be grouped according to height):
iv)横向纸条上下匹配: iv) Horizontal paper strips match up and down:
提取横向纸条的上下边缘矢量进行对比和匹配, Extract the upper and lower edge vectors of the horizontal paper strips for comparison and matching,
(a)对所有横向纸条依次进行上边缘检测和下边缘检测,以确定所述横向纸条是否为原始文档的上边缘碎片或下边缘碎片:判断横向纸条的上边缘或下边缘是否为空白,若为空白则横向纸条为文档上边缘碎片或下边缘碎片; (a) Carry out upper edge detection and lower edge detection sequentially to all horizontal paper strips, to determine whether said horizontal paper strips are upper edge fragments or lower edge fragments of the original document: judge whether the upper edge or the lower edge of the horizontal paper strips are Blank, if it is blank, the horizontal paper strip is a fragment of the upper edge or lower edge of the document;
(b)按照边缘匹配准则进行横向纸条的纵向排序:通过步骤(a)确定原始文档上边缘后,确定文档上边缘所对应的碎片为第1横向纸条,显示第1横向纸条及其编号;将所述第1横向纸条的下边缘矢量,按照编号顺序依次与其它碎片的上边缘矢量进行对比匹配,直到找与第1横向纸条右边缘匹配的第2横向纸条,在所述第1横向纸条的下侧增加显示第2横向纸条及其编号; (b) Vertically sort the horizontal strips according to the edge matching criterion: After determining the upper edge of the original document through step (a), determine the fragment corresponding to the upper edge of the document as the first horizontal strip, and display the first horizontal strip and its Numbering; compare and match the lower edge vectors of the first horizontal paper strip with the upper edge vectors of other fragments in sequence according to the numbering order, until the second horizontal paper strip matching the right edge of the first horizontal paper strip is found. Add the second horizontal paper strip and its serial number to the lower side of the first horizontal paper strip mentioned above;
(c)按照步骤(b)的边缘匹配准则对其它横向纸条按照编号顺序依次由上到下进行对比匹配,直到匹配到最后一个横向纸条,将匹配完毕的横向纸条依次按照匹配次序由上到下进行显示,并显示各自的编号,形成恢复原始文档; (c) According to the edge matching criterion of step (b), compare and match other horizontal paper strips from top to bottom according to the numbering order, until the last horizontal paper strip is matched, and the matched horizontal paper strips are sequentially selected according to the matching order Display from top to bottom, and display their respective numbers to form the restored original document;
(d)保存步骤(b)和步骤(c)中所述匹配横向纸条的排序及对应的编号排序; (d) storing the sorting and corresponding numbering of the matching horizontal paper strips described in step (b) and step (c);
根据本发明优选的,人眼查看所述步骤(c)中恢复原始横条文档,核定恢复结果;由于存在行间距,组内匹配之后可能出现上、下边缘空白,无法进行上下边缘矢量对比,如图8中第五行碎片,该行所有碎片分组时都无法根据高度跟别的碎片分在一起,而在组内完成匹配后,由于上边缘空白,无法进行边缘比对。因此在横向纸条彼此上下匹配之后要人工检查内容的上下文是否合理,若有错位要根据文章内容、字符位置等进行适当调整,最终拼接修复结果如图9所示; Preferably, according to the present invention, human eyes check the restored original horizontal bar document in the step (c), and verify the restoration result; due to the existence of line spacing, upper and lower edge blanks may appear after matching within the group, and the upper and lower edge vector comparisons cannot be performed. As shown in the fifth row of fragments in Figure 8, all the fragments in this row cannot be grouped together with other fragments according to their height, and after the matching is completed in the group, the edge comparison cannot be performed because the upper edge is blank. Therefore, after the horizontal paper strips are matched up and down with each other, it is necessary to manually check whether the context of the content is reasonable. If there is any misalignment, appropriate adjustments should be made according to the content of the article and the position of the characters. The final splicing repair result is shown in Figure 9;
②英文横纵碎片 ② English horizontal and vertical fragments
由于英文字母的形态特点,仅仅检测第一行中字母出现的最高、最低位置是不准确的。联想英文作业纸中四线三格的定位方式,如图2所示,选择利用定位线完成行的匹配。按照标准四线三格中位置和占空比预先将字母分为四种高度:占中间一格(如a、e等),占上两格(如h、l等),占下两格(如y、g等),占三格(j)。在分组之前,需要计算不同类型字母的高度,并以此为依据,对英文字母的高度进行处理,进而分组。 Due to the morphological characteristics of English letters, it is inaccurate to only detect the highest and lowest positions of letters in the first line. The positioning method of four lines and three grids in Lenovo English homework paper, as shown in Figure 2, chooses to use the positioning line to complete the line matching. According to the position and duty cycle of the standard four lines and three grids, the letters are divided into four heights in advance: occupying the middle grid (such as a, e, etc.), occupying the upper two grids (such as h, l, etc.), and occupying the lower two grids ( Such as y, g, etc.), occupying three grids (j). Before grouping, it is necessary to calculate the heights of different types of letters, and based on this, process the heights of English letters and then group them.
如图10所示,通过检测像素出现位置画线并返回高度值测量字母高度,返回的两组画线高度值分别[1,52,90,116,154,180]、[1,14,44,83,122,147,171,180]。算法要用到的尺度标准为,占中间一格(即标准四线三格的二三线)的字母高度为171-147=24个像素点,占上(下)两格的字母高度为90-52=38个像素点。 As shown in Figure 10, the letter height is measured by detecting the position where the pixel appears and drawing a line and returning the height value. The returned two sets of line drawing height values are [1,52,90,116,154,180], [1,14,44,83,122,147,171,180]. The scale standard to be used by the algorithm is that the height of the letters occupying the middle grid (i.e. the second and third lines of the standard four lines and three grids) is 171-147=24 pixels, and the height of the letters accounting for the upper (lower) two grids is 90- 52=38 pixels.
根据英文字母四线格的特点,每个字母都在中间格有内容,上下两个只有部分字母有内容,所以找到四线格的第二线和第三线能够较准确的实现字母所在行的定位。 According to the characteristics of the four-line grid of English letters, each letter has content in the middle grid, and only some letters have content in the upper and lower two grids. Therefore, finding the second and third lines of the four-line grid can accurately locate the row where the letter is located.
考虑到占据上格内容在纵向上有较明显的形态学特点,利用适当的处理,除了大写字母占的上格,其他字母都可以消除上格所占的内容。 Considering that the content occupying the upper case has obvious morphological characteristics in the vertical direction, with proper processing, except for the upper case occupied by capital letters, other letters can eliminate the content occupied by the upper case.
另外再考虑到大写字母和占下格的字母比例较小,在一个处理后的碎片中,几乎总是能够至少找到一行只有中格有内容,一旦找到这样的行,就可以根据已求出的间距和高度数据,求出其他行的第二、三线的位置,从而实现高度分组。 In addition, considering that the proportion of capital letters and letters occupying the lower cells is relatively small, in a processed fragment, it is almost always possible to find at least one row with only the middle cell. Once such a row is found, it can be calculated according to the Calculate the position of the second and third lines of other rows based on the spacing and height data, so as to achieve height grouping.
首先,读入纸条碎片数字图像,并转化为灰度图像,并进行二值化、取反处理,凸显文字;然后分别提取每个碎片的左右边缘矢量、上下边缘矢量; First, read in the digital image of the fragments of paper strips, convert them into grayscale images, and perform binarization and inversion processing to highlight the text; then extract the left and right edge vectors and upper and lower edge vectors of each fragment respectively;
其次,将检测碎片高度;且在组内检测左边缘碎片,进行组内匹配;最后进行横向纸条上下匹配,进行恢复: Secondly, the height of the debris will be detected; and the fragments on the left edge will be detected in the group to perform intra-group matching; finally, the horizontal paper strips will be matched up and down to restore:
然后,按照图1(b)所示步骤进行处理,确定碎片的二三线位置,按照文字高度分组;组内匹配、最后进行横向纸条上下匹配,进行恢复: Then, process according to the steps shown in Figure 1(b), determine the position of the second and third lines of the fragments, and group them according to the height of the text; match within the group, and finally match up and down the horizontal paper strips to recover:
i)碎片预处理:要想确定每行字母位置,最大程度抵消因字母占位不同带来的高度匹配误差,先要对字母进行形态学处理,将占据上格的字母部分尽量消除,以便确定第二、三线的位置,预处理如图11(a-d)所示,图11(a)为原图像取反之后,图11(b)为利用27*1的结构元素对碎片进行开运算的结果,图11(c)为图11(a)与图11(b)相减后的结果,图11(d)为消除图11(c)中连通域小于50个像素点的区域后的结果。 i) Fragment preprocessing: In order to determine the position of the letters in each row and offset the height matching error caused by the difference in letter occupancy to the greatest extent, the letters must first be morphologically processed, and the letter parts occupying the upper grid should be eliminated as much as possible, so as to determine The positions of the second and third lines are preprocessed as shown in Figure 11(a-d). Figure 11(a) is after the original image is reversed, and Figure 11(b) is the result of opening operations on fragments using 27*1 structural elements , Figure 11(c) is the result of the subtraction of Figure 11(a) and Figure 11(b), and Figure 11(d) is the result of eliminating the region with a connected domain less than 50 pixels in Figure 11(c).
ii)确定碎片中第二三线位置,按照高度分组:首先在处理后的碎片中寻找只占中格的一行,然后再往上计算上一行的二三线位置,直到超出上边缘。取每一个碎片出现的第一个二线高度作为该碎片的特征高度,以此为依据进行分组,对所有碎片逐行检测字符出现的位置,根据碎片上边沿为字或者空白,以及第一行字符二线高度出现的最高像素点在碎片中的位置进行分类成组,并将每组内的碎片进行编号;分类时,为了避免特殊情况的发生,进行精确分组,加入人工干预指导,将文字高度按照大小进行排列,相邻数值接近的高度对应的碎片归为一组,分组结果如表4所示。 ii) Determine the position of the second and third lines in the fragments, and group them according to height: firstly, find a line occupying only the middle grid in the processed fragments, and then calculate the position of the second and third lines of the previous line until it exceeds the upper edge. Take the height of the first second line that appears in each fragment as the characteristic height of the fragment, and group based on this, and detect the position where the characters appear line by line for all fragments, according to whether the upper edge of the fragment is a word or a blank, and the first line of characters The position of the highest pixel point appearing at the height of the second line in the fragments is classified into groups, and the fragments in each group are numbered; when classifying, in order to avoid the occurrence of special situations, accurate grouping is carried out, and manual intervention guidance is added to divide the height of the text according to The size is arranged, and the fragments corresponding to the heights with similar adjacent values are grouped together. The grouping results are shown in Table 4.
表4英文横纵切碎片拼接复原对应标号(粗体为无法按高度分组的碎片) Table 4 Corresponding labels of splicing and restoration of English horizontal and vertical fragments (bold are fragments that cannot be grouped by height)
iii)组内匹配:将步骤(ii)中分类完毕的碎片进行组内匹配: iii) Intra-group matching: perform intra-group matching on the fragments classified in step (ii):
(a)对每组内碎片进行左边缘检测,确定每组内的左边缘为原始文档每行的左边缘:判断每组内碎片是否为空白,若为空白则是碎片文档的左边缘; (a) carry out left edge detection to fragment in each group, determine that the left edge in each group is the left edge of each line of original document: judge whether fragment in each group is blank, if it is blank, it is the left edge of fragment document;
(b)按照边缘匹配准则向右匹配:若检测出的左边缘碎片存在于步骤(ii)的分好的组内,则以该碎片左边缘碎片为起点对该组内相应的碎片进行向右匹配;将所述第1横向纸条的下边缘矢量,按照编号顺序依次与其它碎片的上边缘矢量进行对比匹配,直到找与第1横向纸条右边缘匹配的第2横向纸条,在所述第1横向纸条的下侧增加显示第2横向纸条及其编号; (b) Match to the right according to the edge matching criterion: if the detected left-edge fragment exists in the divided group in step (ii), then start from the left-edge fragment of the fragment to the right of the corresponding fragment in the group Matching: compare and match the lower edge vectors of the first horizontal paper strip with the upper edge vectors of other fragments in sequence, until the second horizontal paper strip that matches the right edge of the first horizontal paper strip is found. Add the second horizontal paper strip and its serial number to the lower side of the first horizontal paper strip mentioned above;
(c)此时需要人工干预,观察匹配结果,将多余的碎片拿到无法分组的碎片群中,缺少的碎片在无法分组的碎片群中寻找。由于高度分布比较集中,所以分 组难度不大,只需考虑个别不连续的碎片。无法分组的碎片中也有两个左边缘碎片,分别进行匹配,从而一个分组中的碎片分别分出了两行; (c) At this time, manual intervention is required to observe the matching results, take the redundant fragments to the fragment group that cannot be grouped, and search for the missing fragments in the fragment group that cannot be grouped. Since the height distribution is relatively concentrated, it is not difficult to group, and only need to consider individual discontinuous fragments. There are also two left-edge fragments in the fragments that cannot be grouped, which are matched separately, so that the fragments in one group are divided into two rows;
(d)按照步骤(b)的边缘匹配准则对其它碎片按照编号顺序依次由左到右进行对比匹配,直到匹配到最后一个碎片,将匹配完毕的碎片依次按照匹配次序由左到右进行显示,并显示各自的编号,形成恢复原始横条文档; (d) According to the edge matching criterion of step (b), compare and match other fragments from left to right according to the numbering order, until the last fragment is matched, and display the matched fragments from left to right according to the matching order, And display their respective numbers to form a restored original horizontal stripe file;
(e)保存步骤(b)和步骤(c)中所述匹配碎片的排序及对应的编号排序。 (e) saving the sorting of the matching fragments and the corresponding numbering sorting described in step (b) and step (c).
根据本发明优选的,人眼查看所述步骤(c)中恢复原始横条文档,核定恢复结果。由于碎片的像素较低,在匹配过程中,每组都必须参与人工筛选,碰到错误的匹配应排除并重新进行正确的匹配或者插入正确碎片后继续进行匹配; Preferably, according to the present invention, human eyes check the restored original horizontal stripe document in the step (c), and verify the restoration result. Due to the low pixel size of fragments, each group must participate in manual screening during the matching process. When encountering wrong matches, they should be eliminated and corrected again or continue to match after inserting correct fragments;
iv)行内调整:匹配时选择边缘差异最小的碎片,如果组内没有相连的下一个碎片,则会匹配到错误的碎片,进而会影响到之后的排序,因此通过人工干预的方式进行行内调整,最后得到横向纸条,如图6中得到11个横向纸条; iv) In-row adjustment: When matching, select the fragment with the smallest edge difference. If there is no next connected fragment in the group, the wrong fragment will be matched, which will affect the subsequent sorting. Therefore, the in-row adjustment is performed through manual intervention. Obtain horizontal paper strip at last, obtain 11 horizontal paper strips among Fig. 6;
v)横向纸条上下匹配: v) Horizontal paper strips match up and down:
提取得到的11个横向纸条的上下边缘矢量进行对比和匹配, The extracted upper and lower edge vectors of the 11 horizontal paper strips are compared and matched,
(a)对所有横向纸条依次进行上边缘检测和下边缘检测,以确定所述横向纸条是否为原始文档的上边缘碎片或下边缘碎片:判断横向纸条的上边缘或下边缘是否为空白,若为空白则横向纸条为文档上边缘碎片或下边缘碎片; (a) Carry out upper edge detection and lower edge detection sequentially to all horizontal paper strips, to determine whether said horizontal paper strips are upper edge fragments or lower edge fragments of the original document: judge whether the upper edge or the lower edge of the horizontal paper strips are Blank, if it is blank, the horizontal paper strip is a fragment of the upper edge or lower edge of the document;
(b)按照边缘匹配准则进行横向纸条的纵向排序:通过步骤(a)确定原始文档上边缘后,确定文档上边缘所对应的碎片为第1横向纸条,显示第1横向纸条及其编号;将所述第1横向纸条的下边缘矢量,按照编号顺序依次与其它碎片的上边缘矢量进行对比匹配,直到找与第1横向纸条右边缘匹配的第2横向纸条,在所述第1横向纸条的下侧增加显示第2横向纸条及其编号; (b) Vertically sort the horizontal strips according to the edge matching criterion: After determining the upper edge of the original document through step (a), determine the fragment corresponding to the upper edge of the document as the first horizontal strip, and display the first horizontal strip and its Numbering; compare and match the lower edge vectors of the first horizontal paper strip with the upper edge vectors of other fragments in sequence according to the numbering order, until the second horizontal paper strip matching the right edge of the first horizontal paper strip is found. Add the second horizontal paper strip and its serial number to the lower side of the first horizontal paper strip mentioned above;
(c)按照步骤(b)的边缘匹配准则对其它横向纸条按照编号顺序依次由上到下进行对比匹配,直到匹配到最后一个横向纸条,将匹配完毕的横向纸条依次按照匹配次序由上到下进行显示,并显示各自的编号,形成恢复原始文档; (c) According to the edge matching criterion of step (b), compare and match other horizontal paper strips from top to bottom according to the numbering order, until the last horizontal paper strip is matched, and the matched horizontal paper strips are sequentially selected according to the matching order Display from top to bottom, and display their respective numbers to form the restored original document;
(d)保存步骤(b)和步骤(c)中所述匹配横向纸条的排序及对应的编号排序; (d) storing the sorting and corresponding numbering of the matching horizontal paper strips described in step (b) and step (c);
根据本发明优选的,人眼查看所述步骤(c)中恢复原始横条文档,核定恢复结果;由于存在行间距,组内匹配之后可能出现上、下边缘空白,无法进行上下边缘矢量对比,因此在横向纸条上下匹配之后进行人工检查,检查文档内容的上下文是否合理,若有错位要根据文章内容、字符位置进行适当的调整匹配结果如图12所示; Preferably, according to the present invention, human eyes check the restored original horizontal bar document in the step (c), and verify the restoration result; due to the existence of line spacing, upper and lower edge blanks may appear after matching within the group, and the upper and lower edge vector comparisons cannot be performed. Therefore, after matching the top and bottom of the horizontal paper strips, manual inspection is performed to check whether the context of the document content is reasonable. If there is any misalignment, appropriate adjustments should be made according to the content of the article and the position of the characters. The matching results are shown in Figure 12;
3.双面英文横纵切纸条碎片 3. Double-sided English horizontal and vertical cut paper fragments
由于碎片正反面不一致,所以将会有单面纸条2倍的不同碎片信息,按照2中切割方式将会产生418个不同的碎片。但是由于碎片正反面对应,只要匹配出其中一面,就能求出结果表达式,对于单面的困难有:数据量增加一倍,在高度分组的过程中,每组的数量会大大增加,同时由于边缘空白碎片的增加,边缘匹配错误率提高,加大了人工纠错的工作量。对于中英文碎片对比,由2中解决方案可知,对于英文的拼接复原难度要远高于中文,于是,本节选择对双面英文横纵切纸条碎片进行处理,对于中文双面纸条可以参照第2部分中文横纵切纸条碎片拼接复原方法以及第3部分双面英文横纵切拼接复原方法两部分的内容。 Since the front and back of the fragments are inconsistent, there will be twice the different fragment information of the single-sided paper strip, and 418 different fragments will be produced according to the 2 cutting methods. However, due to the correspondence between the front and back of the fragments, as long as one side is matched, the result expression can be obtained. For one side, the difficulty is: the amount of data is doubled, and the number of each group will greatly increase in the process of highly grouping At the same time, due to the increase of edge blank fragments, the error rate of edge matching increases, which increases the workload of manual error correction. For the comparison of Chinese and English fragments, it can be seen from the 2 solutions that the splicing and restoration of English is much more difficult than Chinese. Therefore, this section chooses to process the fragments of double-sided English horizontal and vertical paper strips. For Chinese double-sided paper strips, it can be processed. Refer to the content of the two parts of the splicing and restoration method of the Chinese horizontal and vertical cut paper fragments in the second part and the splicing and restoration method of the double-sided English horizontal and vertical cuts in the third part.
对于双面纸条碎片,解决思路如下:找出所有的左边缘碎片。对这左边缘碎片先进行人工匹配,则可以得到每一面的左边缘。在同一个分组中,若包含两个左边缘碎片,则分别作为开头向右匹配,匹配过程中若由于边缘空白,出现错误需要人工干预纠错,过程不再赘述;若只包含一个左边缘碎片,且组中碎片数量较多,则以其为开头匹配;若组中碎片数量较少,在其他组匹配好之后,碎片以人工纠错填充的形式补充入结果表达式。 For the fragments of double-sided paper strips, the solution is as follows: Find all the fragments on the left edge. Manually match the left edge fragments first, then the left edge of each side can be obtained. In the same group, if it contains two left-edge fragments, they will be used as the beginning to match to the right. If an error occurs during the matching process due to blank margins, manual intervention is required to correct the error, and the process will not be repeated; if only one left-edge fragment is included , and the number of fragments in the group is large, it will be used as the beginning of the match; if the number of fragments in the group is small, after other groups are matched, the fragments will be added to the result expression in the form of manual error correction filling.
双面英文横纵切碎片拼接复原流程如图1(c)所示: The splicing and restoration process of double-sided English horizontal and vertical fragments is shown in Figure 1(c):
首先,将418个纸条碎片得到的正反面数字图像信息读入,每个碎片随机规定一面为正面,则另一面为反面,两面数据分别编号为000a-208a、000b-208b,如图13(a)和图13(b)所示。并将图像转化为灰度图像,将灰度图像进行二值化和取反操作,凸显文字; First, read in the digital image information on the front and back sides obtained from 418 pieces of paper strips. Each piece randomly stipulates that one side is the front side and the other side is the back side. a) and Figure 13(b). And the image is converted into a grayscale image, and the grayscale image is binarized and inverted to highlight the text;
其次,进行高度检测,人工辅助分组;组内正反面边缘检测,组内匹配并人工检查;横向纸条上下匹配,进行拼接复原: Secondly, carry out height detection, manual assisted grouping; front and back edge detection within the group, matching within the group and manual inspection; vertical matching of horizontal paper strips, splicing and restoration:
i)碎片预处理:为了获得每行字母的位置,最大程度抵消字母占位不同带来的高度匹配误差,通过对碎片中字母进行形态学处理尽量消除占据上格的字母部分,为确定第二、三线位置做准备。 i) Fragment preprocessing: In order to obtain the position of each row of letters and offset the height matching error caused by the difference in letter occupancy to the greatest extent, by performing morphological processing on the letters in the fragments, try to eliminate the part of the letter occupying the upper grid, in order to determine the second , Prepare for the third-line position.
ii)确定碎片中第二三线位置,按照高度进行分组:首先在处理后的碎片中寻找只占中格的一行,然后再往上计算上一行的二三线位置,直到超出上边缘。取每一个碎片出现的第一个二线高度定义为该碎片的特征高度;然后对所有碎片逐行检测字符出现的位置,根据字符上边沿为字或者空白,以及第一行字符二线高度出现的最高像素点在碎片中的位置进行分组,对每组内碎片进行编号,可以分出10个组,每组有35-40个碎片,最后剩下30多个碎片;分类时,为了避免特殊情况的发生,加入人工干预指导,按照文字高度大小进行排列,将高度接近的碎片归为一组。 ii) Determine the position of the second and third lines in the fragments, and group them according to their heights: first find a line in the processed fragments that only occupies the middle grid, and then calculate the position of the second and third lines of the previous line until it exceeds the upper edge. Take the height of the first second line that appears in each fragment and define it as the characteristic height of the fragment; then detect the position where the characters appear line by line for all fragments, according to whether the upper edge of the character is a word or a blank, and the highest second-line height of the character in the first line The positions of pixels in the fragments are grouped, and the fragments in each group are numbered. 10 groups can be divided, each group has 35-40 fragments, and more than 30 fragments are left in the end; when classifying, in order to avoid special cases occur, add manual intervention guidance, arrange according to the height and size of the text, and group the fragments with similar heights into one group.
iii)组内匹配:将步骤(ii)中分类完毕的碎片进行组内匹配: iii) Intra-group matching: perform intra-group matching on the fragments classified in step (ii):
(a)对每组内碎片进行左边缘检测,确定每组内的左边缘为原始文档每行的左边缘:判断每组内碎片是否为空白,若为空白则是碎片文档的左边缘,在分好的10组中能够确定左边缘起始碎片20个; (a) Perform left edge detection on the fragments in each group, and determine that the left edge in each group is the left edge of each line of the original document: judge whether the fragments in each group are blank, if it is blank, it is the left edge of the fragment document, in Among the 10 divided groups, 20 starting fragments on the left edge can be determined;
(b)按照边缘匹配准则向右匹配:若检测出的左边缘碎片存在于步骤(ii)的分好的组内,则以该碎片左边缘碎片为起点对该组内相应的碎片进行向右匹配;将所述第1横向纸条的下边缘矢量,按照编号顺序依次与其它碎片的上边缘矢量进行对比匹配,直到找与第1横向纸条右边缘匹配的第2横向纸条,在所述第1横向纸条的下侧增加显示第2横向纸条及其编号; (b) Match to the right according to the edge matching criterion: if the detected left-edge fragment exists in the divided group in step (ii), then start from the left-edge fragment of the fragment to the right of the corresponding fragment in the group Matching: compare and match the lower edge vectors of the first horizontal paper strip with the upper edge vectors of other fragments in sequence, until the second horizontal paper strip that matches the right edge of the first horizontal paper strip is found. Add the second horizontal paper strip and its serial number to the lower side of the first horizontal paper strip mentioned above;
(c)此时需要人工干预,观察匹配结果,将多余的碎片拿到无法分组的碎片群中,缺少的碎片在无法分组的碎片群中寻找。由于高度分布比较集中,所以分组难度不大,只需考虑个别不连续的碎片。无法分组的碎片中也有两个左边缘碎片,分别进行匹配,从而一个分组中的碎片分别分出了两行; (c) At this time, manual intervention is required to observe the matching results, take the redundant fragments to the fragment group that cannot be grouped, and search for the missing fragments in the fragment group that cannot be grouped. Since the height distribution is relatively concentrated, grouping is not difficult, only individual discontinuous fragments need to be considered. There are also two left-edge fragments in the fragments that cannot be grouped, which are matched separately, so that the fragments in one group are divided into two rows;
(d)按照步骤(b)的边缘匹配准则对其它碎片按照编号顺序依次由左到右进行对比匹配,直到匹配到最后一个碎片,将匹配完毕的碎片依次按照匹配次序由左到右进行显示,并显示各自的编号,形成恢复原始横条文档; (d) According to the edge matching criterion of step (b), compare and match other fragments from left to right according to the numbering order, until the last fragment is matched, and display the matched fragments from left to right according to the matching order, And display their respective numbers to form a restored original horizontal stripe file;
(e)保存步骤(b)和步骤(c)中所述匹配碎片的排序及对应的编号排序。 (e) saving the sorting of the matching fragments and the corresponding numbering sorting described in step (b) and step (c).
根据本发明优选的,人眼查看所述步骤(c)中恢复原始横条文档,核定恢复结果。由于碎片的像素较低,在匹配过程中,每组都必须参与人工筛选,碰到错误的匹配应排除并重新进行正确的匹配或者插入正确碎片后继续进行匹配; Preferably, according to the present invention, human eyes check the restored original horizontal stripe document in the step (c), and verify the restoration result. Due to the low pixel size of fragments, each group must participate in manual screening during the matching process. When encountering wrong matches, they should be eliminated and corrected again or continue to match after inserting correct fragments;
iv)行内调整:匹配时选择边缘差异最小的碎片,如果组内没有相连的下一个碎片,则会匹配到错误的碎片,进而会影响到之后的排序,因此通过人工干预的方式进行行内调整,最后得到22个横向纸条,纸张A、B面匹配结果分别如表5,表6所示。 iv) In-row adjustment: When matching, select the fragment with the smallest edge difference. If there is no next connected fragment in the group, the wrong fragment will be matched, which will affect the subsequent sorting. Therefore, the in-row adjustment is performed through manual intervention. Finally, 22 horizontal paper strips were obtained, and the matching results of paper A and B sides are shown in Table 5 and Table 6, respectively.
表5面A对应碎片编号 Fragment number corresponding to side A of Table 5
表6面B对应碎片编号 Table 6 side B corresponds to the fragment number
v)横向纸条上下匹配:提取(iv)中得到的22个横向纸条的上下边缘矢量进行对比和匹配: v) Up and down matching of horizontal paper strips: the upper and lower edge vectors of 22 horizontal paper strips obtained in extraction (iv) are compared and matched:
(a)对所有横向纸条依次进行上边缘检测和下边缘检测,以确定所述横向纸条是否为原始文档的上边缘碎片或下边缘碎片:判断横向纸条的上边缘或下边缘是否为空白,若为空白则横向纸条为文档上边缘碎片或下边缘碎片; (a) Carry out upper edge detection and lower edge detection sequentially to all horizontal paper strips, to determine whether said horizontal paper strips are upper edge fragments or lower edge fragments of the original document: judge whether the upper edge or the lower edge of the horizontal paper strips are Blank, if it is blank, the horizontal paper strip is a fragment of the upper edge or lower edge of the document;
(b)按照边缘匹配准则进行横向纸条的纵向排序:通过步骤(a)确定原始文档上边缘后,确定文档上边缘所对应的碎片为第1横向纸条,显示第1横向纸条及其编号;将所述第1横向纸条的下边缘矢量,按照编号顺序依次与其它碎片的上边缘矢量进行对比匹配,直到找与第1横向纸条右边缘匹配的第2横向纸条,在所述第1横向纸条的下侧增加显示第2横向纸条及其编号; (b) Vertically sort the horizontal strips according to the edge matching criterion: After determining the upper edge of the original document through step (a), determine the fragment corresponding to the upper edge of the document as the first horizontal strip, and display the first horizontal strip and its Numbering; compare and match the lower edge vectors of the first horizontal paper strip with the upper edge vectors of other fragments in sequence according to the numbering order, until the second horizontal paper strip matching the right edge of the first horizontal paper strip is found. Add the second horizontal paper strip and its serial number to the lower side of the first horizontal paper strip mentioned above;
(c)按照步骤(b)的边缘匹配准则对其它横向纸条按照编号顺序依次由上到下进行对比匹配,直到匹配到最后一个横向纸条,将匹配完毕的横向纸条依次按照匹配次序由上到下进行显示,并显示各自的编号,形成恢复原始文档; (c) According to the edge matching criterion of step (b), compare and match other horizontal paper strips from top to bottom according to the numbering order, until the last horizontal paper strip is matched, and the matched horizontal paper strips are sequentially selected according to the matching order Display from top to bottom, and display their respective numbers to form the restored original document;
(d)保存步骤(b)和步骤(c)中所述匹配横向纸条的排序及对应的编号排序; (d) storing the sorting and corresponding numbering of the matching horizontal paper strips described in step (b) and step (c);
根据本发明优选的,人眼查看所述步骤(c)中恢复原始横条文档,核定恢复结果;由于存在行间距,组内匹配之后可能出现上、下边缘空白,无法进行上下边缘矢量对比,因此在横向纸条上下匹配之后进行人工检查,检查文档内容的上下文是否合理,若有错位要根据文章内容、字符位置进行适当的调整; Preferably, according to the present invention, human eyes check the restored original horizontal bar document in the step (c), and verify the restoration result; due to the existence of line spacing, upper and lower edge blanks may appear after matching within the group, and the upper and lower edge vector comparisons cannot be performed. Therefore, after matching the top and bottom of the horizontal paper strips, manual inspection is performed to check whether the context of the document content is reasonable, and if there is any misalignment, appropriate adjustments should be made according to the content of the article and the position of the characters;
vi)由步骤(v)可以找到属于同一面的11行,匹配完文档的一面,即可复原整个文档,如图13所示。 vi) From step (v), 11 lines belonging to the same side can be found, and after matching one side of the document, the entire document can be restored, as shown in FIG. 13 .
Claims (1)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310697323.6A CN103679678B (en) | 2013-12-18 | 2013-12-18 | A kind of semi-automatic splicing restored method of rectangle character features a scrap of paper |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310697323.6A CN103679678B (en) | 2013-12-18 | 2013-12-18 | A kind of semi-automatic splicing restored method of rectangle character features a scrap of paper |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103679678A CN103679678A (en) | 2014-03-26 |
CN103679678B true CN103679678B (en) | 2016-11-23 |
Family
ID=50317133
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310697323.6A Expired - Fee Related CN103679678B (en) | 2013-12-18 | 2013-12-18 | A kind of semi-automatic splicing restored method of rectangle character features a scrap of paper |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103679678B (en) |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103996180B (en) * | 2014-05-05 | 2016-09-07 | 河海大学 | Shredder based on English words feature crushes document restored method |
CN104021534A (en) * | 2014-06-05 | 2014-09-03 | 温州医科大学 | Shredded paper splicing method |
CN107132959A (en) * | 2016-02-29 | 2017-09-05 | 四川效率源信息安全技术股份有限公司 | A kind of wrong method of quick amendment fragment picture restructuring result |
CN105809623A (en) * | 2016-03-04 | 2016-07-27 | 重庆交通大学 | A method of splicing and restoring shredded paper |
CN106097242A (en) * | 2016-05-31 | 2016-11-09 | 四川效率源信息安全技术股份有限公司 | A kind of method for correcting of picture recombination error |
CN106952230B (en) * | 2017-03-19 | 2021-02-02 | 北京工业大学 | Cross-cutting fragment recovery method based on clustering and ant colony algorithm |
CN106991082B (en) * | 2017-03-31 | 2020-06-26 | 西安理工大学 | Grouping method of multi-page similar document fragments |
CN108628920A (en) * | 2017-11-13 | 2018-10-09 | 淄博职业学院 | A kind of Art Design internet auxiliary puzzle system and design method |
CN108510442B (en) * | 2018-03-23 | 2021-12-31 | 中南大学 | Single-side paper scrap splicing and restoring method based on absolute value distance optimization |
CN109191407A (en) * | 2018-09-20 | 2019-01-11 | 湘潭大学 | A kind of a scrap of paper splicing restored method and system based on extreme learning machine |
CN109584163B (en) * | 2018-12-17 | 2020-12-08 | 深圳市华星光电半导体显示技术有限公司 | Method for restoring original file of paper scrap |
CN117994170B (en) * | 2024-03-08 | 2025-03-11 | 广州大学 | A method and system for splicing and restoring shredded paper based on semantic error correction and association |
-
2013
- 2013-12-18 CN CN201310697323.6A patent/CN103679678B/en not_active Expired - Fee Related
Non-Patent Citations (6)
Title |
---|
Gokturk Ucoluk等.Automatic reconstruction of broken 3-D surface objects.《Computers & Graphics 23》.1999,573-582. * |
孟繁宇等.基于相似性测量的碎纸片拼接复原方法.《哈尔滨师范大学自然科学学报》.2013,第29卷(第6期),60-62. * |
李晓霞等.关于中英文的碎纸片拼接复原问题研究.《运城学院学报》.2013,第31卷(第5期),12-15. * |
杨雯雯等.单页单面汉字纵横切碎片拼接复原算法.《运城学院学报》.2013,第31卷(第5期),16-20. * |
潘荣江等.一种基于LCS的物体碎片自动拼接方法.《计算机学报》.2005,第28卷(第3期),350-356. * |
陶佳琪等.单页单面英文纵横切碎片拼接复原算法.《运城学院学报》.2013,第31卷(第5期),21-24. * |
Also Published As
Publication number | Publication date |
---|---|
CN103679678A (en) | 2014-03-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103679678B (en) | A kind of semi-automatic splicing restored method of rectangle character features a scrap of paper | |
Karthick et al. | Steps involved in text recognition and recent research in OCR; a study | |
CN103258198B (en) | Character extracting method in a kind of form document image | |
CN101615252B (en) | Method for extracting text information from adaptive images | |
CN101515325B (en) | Character extracting method in digital video based on character segmentation and color cluster | |
CN111814722A (en) | A form recognition method, device, electronic device and storage medium in an image | |
CN103258201B (en) | A kind of form lines extracting method of amalgamation of global and local message | |
CN106156761A (en) | The image form detection of facing moving terminal shooting and recognition methods | |
CN114004204B (en) | Table structure reconstruction and text extraction method and system based on computer vision | |
CN103700081B (en) | A kind of shredder crushes the restoration methods of English document | |
CN103440472B (en) | A kind of quick calculation method of character image feature difference | |
CN103996180B (en) | Shredder based on English words feature crushes document restored method | |
CN114332866B (en) | Literature curve separation and coordinate information extraction method based on image processing | |
CN112329641A (en) | Table identification method, device and equipment and readable storage medium | |
WO2023045298A1 (en) | Method and apparatus for detecting table lines in image | |
CN104239872A (en) | Abnormal Chinese character identification method | |
CN104182966B (en) | A kind of regular shredded paper method for automatically split-jointing | |
CN103679671A (en) | Transverse and vertical sliced shredded paper splicing and recovery algorithm of FFT (Fast Fourier Transform) integrated comprehensive evaluation method | |
Das et al. | Heuristic based script identification from multilingual text documents | |
CN108510442B (en) | Single-side paper scrap splicing and restoring method based on absolute value distance optimization | |
CN106682666A (en) | Characteristic template manufacturing method for unusual font OCR identification | |
CN105537131A (en) | Mail sorting system based on diverse information coordination | |
CN110246098B (en) | A Fragment Recovery Method | |
Sorio et al. | Open world classification of printed invoices | |
CN102332088A (en) | A Machine Vision Recognition Method of Ballot Symbols Based on Run-length Features |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20161123 Termination date: 20171218 |