CN101354727A

CN101354727A - A method and device for establishing a link between a digital document catalog and a text

Info

Publication number: CN101354727A
Application number: CNA2008102227847A
Authority: CN
Inventors: 高良才; 褚一民; 陶欣; 汤帜
Original assignee: Peking University; Peking University Founder Group Co Ltd; Beijing Founder Apabi Technology Co Ltd
Current assignee: New Founder Holdings Development Co ltd; Peking University; Founder Apabi Technology Ltd
Priority date: 2008-09-24
Filing date: 2008-09-24
Publication date: 2009-01-28
Anticipated expiration: 2028-09-24
Also published as: CN101354727B

Abstract

The invention discloses a method and device for establishing a link between a digital document catalog and a text, which is used to provide a method for automatically establishing a link between a digital document catalog and a text, and improve the efficiency of establishing a link between a digital document and a text. The method includes obtaining at least one directory entry information from saving each directory entry information, determining each logical page corresponding to each directory entry in the digital document according to the at least one directory entry information; establishing each directory entry and corresponding Links between each logical page of the . According to the solution proposed by the present invention, by automatically establishing the link between the table of contents and the text of the digital document, the efficiency of establishing the link between the table of contents and the text of the digital document can be effectively improved, thereby increasing the production speed of the digital document.

Description

A method and device for establishing a link between a digital document catalog and a text

技术领域 technical field

本发明涉及文档处理技术领域，尤其涉及一种建立数字文档目录与正文之间链接的方法及装置。The invention relates to the technical field of document processing, in particular to a method and device for establishing a link between a digital document catalog and a text.

背景技术 Background technique

文档的目录可以用来索引文档，便于读者的检索和阅读。对于数字文档中，一般读者希望能够通过点击某个目录项，就可以跳转到对应该目录项的正文部分，从而可以提高读者查找内容的速度以及阅读的速度。The directory of the document can be used to index the document, which is convenient for readers to retrieve and read. For digital documents, general readers expect to be able to jump to the text part corresponding to the catalog item by clicking on a certain catalog item, so as to improve the speed of searching content and reading speed of readers.

纸质文档中在文档的前面或者章节的中间部分会有版权页、目录页、前言、序、跋、附录和参考文档等非正文部分的内容，在具体的页码标记的过程中，一般都是每个内容都有各自的页码排序，例如对于目录页共有5页，按照其自身的页码排序方法分别为第一页到第五页，正文部分从第六页开始，但是按照其正文部分的排序该页为正文部分的第一页，并且目录中记录的该正文部分对应的页码也是第一页，但是该页码并不代表其真正的逻辑页码。In paper documents, there will be non-text parts such as copyright pages, catalog pages, prefaces, prefaces, postscripts, appendices, and reference documents in the front of the document or in the middle of the chapter. In the process of marking specific pages, generally Each content has its own page number sorting. For example, if there are 5 pages in the catalog page, the page number sorting method is from the first page to the fifth page, and the text part starts from the sixth page, but it is sorted according to the text part This page is the first page of the body part, and the page number corresponding to the body part recorded in the table of contents is also the first page, but this page number does not represent its real logical page number.

在纸质文档经过光学特征识别(Optical Character Recognition，OCR)转化为数字文档，或者通过文档处理软件，例如Adobe Acrobat或方正飞腾排版软件等，直接生成数字文档后，数字文档记录的都是每一页的逻辑页，即以数字文档为一个整体，每页在这个整体的位置。因此，数字文档的目录中记录的每个目录条目标记的该内容的自然页，与数字文档的逻辑页之间不存在对应关系。现有技术中建立数字文档的目录页与正文之间的链接，一般都是通过人工手动完成，效率低、链接建立的速度慢，并且准确率也不高。After paper documents are transformed into digital documents through Optical Character Recognition (OCR), or directly generated by document processing software, such as Adobe Acrobat or Founder Feiteng typesetting software, etc., digital documents record every The logical page of the page, that is, the digital document as a whole, and the position of each page in the whole. Therefore, there is no corresponding relationship between the natural page of the content marked by each directory entry recorded in the directory of the digital document and the logical page of the digital document. In the prior art, the link between the catalog page and the text of the digital document is generally completed manually, which has low efficiency, slow link establishment speed, and low accuracy.

发明内容Contents of the invention

有鉴于此，本发明实施例提供一种建立数字文档目录与正文之间链接的方法及装置，用以自动建立数字文档目录与正文之间链接，提高数字文档目录与正文之间链接建立的效率。In view of this, an embodiment of the present invention provides a method and device for establishing a link between a digital document catalog and a text, to automatically establish a link between a digital document catalog and a text, and improve the efficiency of establishing a link between a digital document catalog and a text .

本发明实施例提供的一种建立数字文档目录与正文之间链接的方法，其中所述数字文档目录包含多个目录条目，每个目录条目包含至少一个目录项信息，包括：An embodiment of the present invention provides a method for establishing a link between a digital document catalog and a text, wherein the digital document catalog contains a plurality of catalog entries, and each catalog entry contains at least one catalog item information, including:

从保存的每个目录条目中获取至少一个目录项信息，根据所述至少一个目录项信息，在数字文档中确定每个目录条目对应的每个逻辑页；Obtain at least one directory entry information from each saved directory entry, and determine each logical page corresponding to each directory entry in the digital document according to the at least one directory entry information;

建立每个目录条目与对应的每个逻辑页之间的链接。Links are established between each directory entry and each corresponding logical page.

本发明实施例提供的一种建立数字文档目录与正文之间链接的装置，其中所述数字文档目录包含多个目录条目，每个目录条目包含至少一个目录项信息，包括：An apparatus for establishing a link between a digital document catalog and a text provided by an embodiment of the present invention, wherein the digital document catalog contains a plurality of catalog entries, and each catalog entry contains at least one catalog item information, including:

逻辑页识别模块，用于从保存的每个目录条目中获取至少一个目录项信息，根据所述至少一个目录项信息，在数字文档中确定每个目录条目对应的每个逻辑页；A logical page identification module, configured to obtain at least one directory entry information from each stored directory entry, and determine each logical page corresponding to each directory entry in the digital document according to the at least one directory entry information;

链接建立模块，用于建立每个目录条目与对应的每个逻辑页之间的链接。A link establishing module, configured to establish a link between each directory entry and each corresponding logical page.

本发明实施例提供的一种建立数字文档目录与正文之间链接的方法，可以通过根据保存的目录条目信息获取至少一个目录项信息，将该至少一个目录项信息在数字文档的页面进行匹配，根据匹配的结果确定每个目录条目对应的每个逻辑页，从而建立每个目录条目与该逻辑页间的链接。采用此自动建立数字文档目录与正文之间链接的方法，可以有效地提高数字文档的目录与正文之间链接的建立效率，进而提高数字文档的制作效率。A method for establishing a link between a digital document catalog and a text provided by an embodiment of the present invention can obtain at least one catalog item information according to the saved catalog entry information, and match the at least one catalog item information on the page of the digital document, Each logical page corresponding to each directory entry is determined according to the matching result, so as to establish a link between each directory entry and the logical page. By adopting the method for automatically establishing the link between the table of contents and the text of the digital document, the efficiency of establishing the link between the table of contents and the text of the digital document can be effectively improved, thereby improving the production efficiency of the digital document.

附图说明 Description of drawings

图1为本发明实施例提供的一种自动建立数字文档目录与正文之间链接的方法流程图；Fig. 1 is a flow chart of a method for automatically establishing a link between a digital document catalog and a text provided by an embodiment of the present invention;

图2为本发明实施例提供的一种建立数字文档目录与正文之间链接的方法具体实施流程图；Fig. 2 is a specific implementation flowchart of a method for establishing a link between a digital document catalog and a text provided by an embodiment of the present invention;

图3为本发明实施例提供的根据目录条目中的页码目录项确定每个目录条目对应的每个逻辑页的方法流程图；3 is a flowchart of a method for determining each logical page corresponding to each directory entry according to the page number directory entry in the directory entry provided by an embodiment of the present invention;

图4为本发明实施例提供的确定字符坐标示意图；FIG. 4 is a schematic diagram of determining character coordinates provided by an embodiment of the present invention;

图5为本发明实施例提供的根据目录条目中的标题目录项确定每个目录条目对应的每个逻辑页的方法流程图；5 is a flowchart of a method for determining each logical page corresponding to each directory entry according to the title directory entry in the directory entry provided by an embodiment of the present invention;

图6A为本发明实施例提供的具体的建立数字文档目录与正文之间链接的方法具体实施流程图；FIG. 6A is a specific implementation flowchart of a method for establishing a link between a digital document catalog and a text provided by an embodiment of the present invention;

图6B为本发明实施例提供的根据页码目录项信息和标题目录项信息确定逻辑页的方法流程图；FIG. 6B is a flow chart of a method for determining a logical page according to the page number bibliography information and the title bibliography information according to an embodiment of the present invention;

图7为本发明实施例提供的数字文档的目录页；Fig. 7 is the catalog page of the digital document provided by the embodiment of the present invention;

图8为本发明实施例提供的数字文档的正文页；Fig. 8 is the text page of the digital document provided by the embodiment of the present invention;

图9为本发明实施例提供的一种建立数字文档目录与正文之间链接的装置结构图。FIG. 9 is a structural diagram of an apparatus for establishing a link between a digital document catalog and a text provided by an embodiment of the present invention.

具体实施方式 Detailed ways

在本发明实施例中为了实现数字文档目录与正文之间链接的自动建立，提高数字文档目录与正文之间链接建立的效率，如图1所示，提供了一种建立数字文档目录与正文之间链接的方法，其中所述数字文档目录包含多个目录条目，每个目录条目包含至少一个目录项信息，具体包括以下步骤：In the embodiment of the present invention, in order to realize the automatic establishment of the link between the digital document catalog and the text, and improve the efficiency of link establishment between the digital document catalog and the text, as shown in Figure 1, a method for establishing a link between the digital document catalog and the text is provided. A method for interlinking, wherein the digital document directory includes multiple directory entries, and each directory entry contains at least one directory item information, which specifically includes the following steps:

S101：从保存的每个目录条目中获取至少一个目录项信息，根据所述至少一个目录项信息，在数字文档中确定每个目录条目对应的每个逻辑页。S101: Obtain at least one directory entry information from each stored directory entry, and determine each logical page corresponding to each directory entry in the digital document according to the at least one directory entry information.

其中，该获取的至少一个目录项信息包括：页码目录项信息和/或标题目录项信息。Wherein, the at least one piece of bibliographic item information obtained includes: page number bibliographic item information and/or title bibliographic item information.

S102：建立每个目录条目与对应的每个逻辑页之间的链接。S102: Establish a link between each directory entry and each corresponding logical page.

下面结合附图对本发明实施例进行详细的说明。Embodiments of the present invention will be described in detail below in conjunction with the accompanying drawings.

在本发明实施例中所采用的数据文档，能够按页读取该数字文档，并且能够获取数字文档每页的字符，并可以获得每个字符在每页的坐标信息，同时能够识别出文字的字体信息，即文字的字体类型，字号等信息。The data file used in the embodiment of the present invention can read the digital file page by page, and can obtain the characters of each page of the digital file, and can obtain the coordinate information of each character on each page, and can recognize the text Font information, that is, the font type, font size and other information of the text.

如图2所示，为本发明实施例中建立数字文档目录与正文之间链接的方法，具体包括以下步骤：As shown in Figure 2, the method for establishing a link between the digital document catalog and the text in the embodiment of the present invention specifically includes the following steps:

S201：读入数字文档，获取保存的每个目录条目信息。S201: Read in the digital file, and obtain information about each stored directory entry.

保存的目录条目为，根据识别的数字文档的目录的信息，将目录中的每一行作为一个目录条目，该目录条目中包括：章节序号目录项，为该目录行代表的章节序号信息，例如，第二章、第十节等；或标题目录项，为该目录行代表的标题信息，即章节序号后、页码信息前的文字信息；或页码目录项，为该目录行中该章节所在的自然页。The saved catalog entry is, according to the catalog information of the identified digital document, each line in the catalog is regarded as a catalog entry, and the catalog entry includes: chapter serial number catalog item, which is the chapter serial number information represented by the catalog row, for example, Chapter 2, Section 10, etc.; or the title table of contents item, which is the title information represented by the table of contents line, that is, the text information after the chapter number and before the page number information; or the page number table of contents item, which is the natural text information of the chapter in the table of contents line Page.

S202：根据每个目录条目中的至少一个目录项信息，确定每个目录条目对应的每个逻辑页。S202: Determine each logical page corresponding to each directory entry according to at least one directory entry information in each directory entry.

其中每个目录条目中的至少一个目录项信息包括：目录条目中的页码目录项信息或根据目录条目中的标题目录项信息，或者两者的结合。The at least one item of catalog item information in each catalog entry includes: the catalog item information of the page number in the catalog entry or the catalog item information according to the title in the catalog entry, or a combination of both.

S203：建立每个目录条目与对应的每个逻辑页之间的链接。S203: Establish a link between each directory entry and each corresponding logical page.

如图3所示，为本发明实施例提供的根据目录条目中的页码目录项信息，确定每个目录条目对应的每个逻辑页的方法，具体包括以下步骤：As shown in Figure 3, the method for determining each logical page corresponding to each directory entry according to the page number directory entry information in the directory entry provided by the embodiment of the present invention specifically includes the following steps:

S301：根据每个目录条目中的页码目录项中的页码信息，确定对应该页码的逻辑页所在的候选页。S301: According to the page number information in the page number directory entry in each directory entry, determine the candidate page where the logical page corresponding to the page number is located.

根据预置的逻辑页所在的候选页与页码的关系，确定该页码的逻辑页所在的候选页，其中该预置的规格根据每个目录条目的页码信息，确定每个目录条目的页码对应的逻辑页所在的候选页包括：根据页码目录项中的页码、数字文档目录页的总页数及设定的范围阈值参数，确定该目录条目的页码对应的逻辑页所在的候选页，即确定该目录条目对应的逻辑页所在的候选页。According to the relationship between the preset logical page candidate page and the page number, determine the candidate page where the logical page of the page number is located, wherein the preset specification determines the page number corresponding to each catalog entry according to the page number information of each catalog entry The candidate page where the logical page is located includes: according to the page number in the page number catalog item, the total number of pages of the digital document catalog page and the set range threshold parameter, determine the candidate page where the logical page corresponding to the page number of the catalog entry is located, that is, determine the The candidate page where the logical page corresponding to the directory entry resides.

具体为当页码目录项中页码为n时，数字文档目录页的总页数为K，同时设定的范围阈值参数为D，则可知该页码n对应的逻辑页所在的候选页N为：n+K-D≤N≤n+K+D。在实际的计算过程中可以根据需要灵活设置范围阈值参数D的大小，采用合适的范围阈值参数可以达到提高链接建立的效率，同时也可以满足准确度的要求。Specifically, when the page number in the page number catalog entry is n, the total number of pages in the digital document catalog page is K, and the range threshold parameter is set at the same time as D, then it can be known that the candidate page N where the logical page corresponding to the page number n is located is: n +K-D≤N≤n+K+D. In the actual calculation process, the size of the range threshold parameter D can be flexibly set according to needs. Using an appropriate range threshold parameter can improve the efficiency of link establishment and also meet the accuracy requirements.

S302：在每个候选页中提取有效信息。S302: Extract effective information from each candidate page.

具体包括：根据保存的版心范围的信息，及每个候选页中每个字符的坐标，确定位于该版心范围外的字符，从该版心范围外的字符中提取数字字符。即确定版心范围外的页眉页脚中的字符，从该字符中提取出数字字符。其中，保存的版心范围的信息包括：版心范围的上边界线、下边界线、左边界限和右边界线信息，Specifically include: according to the saved information of the center range and the coordinates of each character in each candidate page, determine the characters outside the center range, and extract digital characters from the characters outside the center range. That is to determine the characters in the header and footer outside the center range, and extract the numeric characters from the characters. Among them, the saved information of the core range includes: the upper boundary line, the lower boundary line, the left boundary line and the right boundary line information of the core range,

其中每个字符的坐标包括根据该字符的最小外接矩形框确定的该字符的坐标，字符的坐标用其最小外接矩形框的两个对角的顶点的坐标表示，如图4，字符“目”的坐标可以采用顶点1和3的坐标表示，或者采用顶点2和4的坐标表示，例如采用顶点1和3的坐标表示字符的坐标，该字符的坐标表示为(x₁，y₁，x₂，y₂)，x₁为顶点1的横坐标，即顶点1距离坐标轴y的距离，y₁为顶点1的纵坐标，即顶点1距离坐标轴x的距离，x₂为顶点3的横坐标，即顶点3距离坐标轴y的距离，y₂为顶点3的纵坐标，即顶点3距离坐标轴x的距离。The coordinates of each character include the coordinates of the character determined according to the minimum circumscribing rectangle of the character, and the coordinates of the character are represented by the coordinates of the two diagonal vertices of its minimum circumscribing rectangle, as shown in Figure 4, the character "order" The coordinates of vertex 1 and 3 can be represented by the coordinates of vertices 1 and 3, or the coordinates of vertices 2 and 4 can be used. For example, the coordinates of vertices 1 and 3 can be used to represent the coordinates of the character, and the coordinates of the character can be expressed as (x ₁ , y ₁ , x ₂ , y ₂ ), x ₁ is the abscissa of vertex 1, that is, the distance from vertex 1 to coordinate axis y, y ₁ is the ordinate of vertex 1, that is, the distance from vertex 1 to coordinate axis x, and x ₂ is the abscissa of vertex 3 Coordinates, that is, the distance from vertex 3 to the coordinate axis y, and y ₂ is the vertical coordinate of vertex 3, that is, the distance from vertex 3 to the coordinate axis x.

S303：合并提取的有效信息。S303: Merge the extracted effective information.

根据提取的数字字符信息，判断每两个数字字符间的距离是否超过设定的间距阈值，当两个数字字符的间距没有超过设定的间距阈值时，将该两个数字字符合并为一个数字字符串；否则认为该两个数字字符为两个独立的数字字符串。According to the extracted digital character information, it is judged whether the distance between each two digital characters exceeds the set distance threshold, and when the distance between the two digital characters does not exceed the set distance threshold, the two digital characters are combined into one number String; otherwise, the two numeric characters are considered to be two separate numeric strings.

其中，判断每两个数字字符的间距是否超过设定的间距阈值时，可以根据每两个数字字符的坐标判断，如图4所示的确定每个字符坐标的方法，首先判断两个字符是否可以认为在同一行，其中具体的判断过程可以比较两个字符的纵坐标，当了两个数字字符对应的纵坐标的差值的绝对值小于设定的第一条件值时，判定该两个数字字符在同一行，否则不同行，其中对应的纵坐标，即当采用一个数字字符的顶点3的纵坐标时，也应该采用第二个数字字符顶点3的纵坐标；然后判断同行中的两个数字字符的水平间距是否满足设定的第二条件值，例如比较两个数字字符的横坐标值，当两个横坐标对应的横坐标的差值的绝对值小于设定的第二条件值时，则判定该两个字符可以合并为一个数字字符串。当然在具体的计算过程中还可以根据其他采用坐标确定两个数字字符是否合并为一个数字字符串的方法，这里就不一一赘述。Wherein, when judging whether the distance between every two digital characters exceeds the set distance threshold, it can be judged according to the coordinates of every two digital characters. As shown in Figure 4, the method for determining the coordinates of each character is first to judge whether the two characters are It can be considered to be in the same line, wherein the specific judgment process can compare the vertical coordinates of the two characters, and when the absolute value of the difference between the vertical coordinates corresponding to the two numeric characters is less than the set first condition value, it is judged that the two Numerical characters are in the same line, otherwise they are in different lines, and the corresponding ordinate, that is, when the ordinate of vertex 3 of a numerical character is used, the ordinate of vertex 3 of the second digital character should also be used; then judge the two in the same line Whether the horizontal spacing of two numeric characters satisfies the set second condition value, such as comparing the abscissa values of two numeric characters, when the absolute value of the difference between the two abscissas corresponding to the abscissa is less than the set second condition value , it is determined that the two characters can be combined into a numeric character string. Of course, in the specific calculation process, it is also possible to determine whether two numeric characters are merged into a numeric character string according to other methods using coordinates, and details will not be repeated here.

S304：将目录条目中页码目录项的页码信息与合并的有效信息匹配，根据匹配的结果确定该目录条目中对应的逻辑页。S304: Match the page number information of the page number directory entry in the directory entry with the merged effective information, and determine the corresponding logical page in the directory entry according to the matching result.

具体包括：将目录条目中页码目录项的页码与合并后的每个字符串进行大小的比较，根据合并后的每个字符串是否与页码相同，确定每个候选页对应该目录条目的第一目录条目置信度。可以选取第一目录条目置信度最高的候选页，作为该目录条目中对应的逻辑页。Specifically include: comparing the size of the page number of the page number in the catalog entry with each merged character string, and determining whether each candidate page corresponds to the first page number of the catalog entry according to whether each merged character string is the same as the page number Directory entry confidence. The candidate page with the highest confidence of the first directory entry may be selected as the corresponding logical page in the directory entry.

其中具体实施过程中可以包括：首先对于目录条目页码目录项中的页码对应的逻辑页所在候选页中的每个候选页设定一个相同的初始置信度X，将目录条目页码目录项中的页码与每个候选页中的合并后的每个数字字符串匹配，每找到一个与该目录条目中的页码匹配的数字字符串时，将该候选页对应的置信度加Y，当每找到一个与该目录条目中的页码不匹配的数字字符串时，将该候选页对应的置信度减E，从而确定该候选页对应该目录条目的第一置信度。例如该候选页一共合并得到了5个数字字符串，该候选页的初始执行度为X，有一个数字字符串与目录条目中的页码匹配，4个数字字符串与目录条目中的页码不匹配，则可知该候选页对应的置信度为X+Y-4E。其中，X、Y和E都为大于零的正实数。The specific implementation process may include: first, for each candidate page in the candidate page where the logical page corresponding to the page number in the catalog entry page number is set, an identical initial confidence X is set, and the page number in the catalog entry page number is set to Match each combined number string in each candidate page. When a number string matching the page number in the directory entry is found, add Y to the confidence level corresponding to the candidate page. When the page number in the directory entry does not match the numeric string, E is subtracted from the confidence degree corresponding to the candidate page, so as to determine the first confidence degree of the candidate page corresponding to the directory entry. For example, the candidate page is combined to obtain a total of 5 numeric strings, the initial execution degree of the candidate page is X, one numeric string matches the page number in the catalog entry, and 4 numeric strings do not match the page number in the catalog entry , it can be seen that the confidence level corresponding to the candidate page is X+Y-4E. Among them, X, Y and E are all positive real numbers greater than zero.

同时在本发明实施例中也可以根据目录条目中的标题目录项，确定与每个目录条目对应的逻辑页，如图5所示，为本发明实施例提供的根据目录条目中的标题目录项，确定每个目录条目对应的每个逻辑页的方法，具体包括以下步骤：At the same time, in the embodiment of the present invention, the logical page corresponding to each directory entry can also be determined according to the title directory entry in the directory entry. As shown in FIG. , a method for determining each logical page corresponding to each directory entry, specifically including the following steps:

S501：根据每页中的所有字符的坐标，在每页中将所有的字符排列为若干行。S501: Arrange all the characters in each page into several rows according to the coordinates of all the characters in each page.

具体包括：每个候选页的页面中，将所有字符排序，首先判断每两个字符是否为同一行，以数字文档目录项的排版方向为横排为例，可以按照判断两个字符的垂直方向的间距是否不超过预置的间距参数h，其中h为正实数，当两个字符在垂直方向的间距间距不大于间距参数h时，则将两个字符排列在一行，否则，不将两个字符排列在一行；然后在每一行中，按照横坐标依次递增的原则将每行的字符排序。如图4所示，则排序后得到此行的最小外接矩形框为(x_m，y_m，x_n，y_n)，该行中所有字符的最小外接矩形框包括在该行的最小外接矩形框内，其中x_m为此行中最左端字符1的横坐标值，该横坐标可以为该字符的左上顶点的横坐标或左下顶点的横坐标，y_m为此行中最上端字符3的纵坐标值，该纵坐标可以为该字符的左上顶点的纵坐标或右上顶点的纵坐标，x_n为此行中最右端字符4的横坐标值，该横坐标可以为该字符的右上顶点的横坐标或右下顶点的横坐标，y_n为此行中最下端字符2的纵坐标值，该纵坐标可以为该字符的左下顶点的纵坐标或右下顶点的纵坐标。Specifically include: in each candidate page, sort all the characters, first judge whether every two characters are in the same line, take the typesetting direction of the digital document directory item as horizontal as an example, you can judge the vertical direction of the two characters according to Whether the spacing between the two characters does not exceed the preset spacing parameter h, where h is a positive real number. When the spacing between two characters in the vertical direction is not greater than the spacing parameter h, the two characters are arranged in a row; otherwise, the two characters are not The characters are arranged in one line; then in each line, the characters in each line are sorted according to the principle that the abscissa increases successively. As shown in Figure 4, after sorting, the minimum bounding rectangle of this line is (x _m , y _m , x _n , y _n ), and the minimum bounding rectangle of all characters in this line includes the minimum bounding rectangle of this line In the frame, x _m is the abscissa value of the leftmost character 1 in the line, the abscissa can be the abscissa of the upper left vertex or the abscissa of the lower left vertex of the character, and y _m is the abscissa value of the uppermost character 3 in the line The ordinate value, the ordinate can be the ordinate of the upper left vertex or the ordinate of the upper right vertex of the character, x _n is the abscissa value of the rightmost character 4 in the row, and the abscissa can be the ordinate of the upper right vertex of the character The abscissa or the abscissa of the lower right vertex, y _n is the ordinate value of the bottommost character 2 in the row, and the ordinate can be the ordinate of the lower left vertex or the ordinate of the lower right vertex of the character.

S502：在每行中将该行的字符，与保存的目录条目中的标题目录项中的标题信息匹配。S502: In each line, match the characters in the line with the title information in the title directory entry in the saved directory entry.

具体的匹配过程包括：根据最长公共子串(Longest Common Subsequence，LCS)算法，进行字符串间相似度的匹配，该字符串包括：目录条目中的标题及每行的字符；同时再根据设置的至少一个特征信息，确定该行对应标题目录项信息的总置信度。The specific matching process includes: according to the Longest Common Subsequence (LCS) algorithm, perform similarity matching between strings, the string includes: the title in the directory entry and the characters of each line; at the same time, according to the settings At least one feature information of the row is used to determine the total confidence level of the row corresponding to the title directory item information.

其中，该设置的至少一个特征信息包括：该行字符在其所在的数字文档页面中的位置，或该行字符的平均文字宽度与正文部分的平均文字宽度的大小，或根据LCS算法匹配的字符串是否与其他的文字字符同行。根据上述至少一个特征信息与LCS算法可以确定每行字符对应标题目录项信息的总置信度。Wherein, at least one feature information of the setting includes: the position of the characters in the line in the digital document page where they are located, or the size of the average character width of the characters in the line and the average character width of the body part, or the characters matched according to the LCS algorithm Whether the string is peered with other literal characters. According to the above at least one feature information and the LCS algorithm, the total confidence level of each line of characters corresponding to the title directory item information can be determined.

S503：根据每页中每行与标题信息的匹配结果，确定该目录条目中对应的逻辑页。S503: Determine the corresponding logical page in the directory entry according to the matching result between each line in each page and the title information.

根据每行字符与标题信息的匹配程度得到总置信度，将每页中每行对应的最高总置信度作为该页对应该目录条目的第二目录条目置信度，根据每页对应该目录条目的第二目录条目置信度，确定每个目录条目对应的逻辑页。According to the matching degree of each row of characters and the title information, the total confidence is obtained, and the highest total confidence corresponding to each row in each page is used as the second directory entry confidence of the page corresponding to the directory entry. The second directory entry confidence level determines the logical page corresponding to each directory entry.

当然通过目录条目中的标题信息，确定每个目录条目对应的逻辑页，其具体的实现过程可靠性高，但是同时也会影响目录与正文之间链接建立的效率，因此可以采用页码信息和标题信息结合确定每个目录条目对应的逻辑页，具体包括：根据页码信息确定该页码对应的逻辑页所在的候选页，在每个候选页内进行页码信息的匹配，确定每个候选页对应该目录条目的第一目录条目置信度，同时在每个候选页中根据标题信息进行匹配，确定每个候选页对应该目录条目的第二目录条目置信度，根据每个候选页对应该目录条目的第一目录条目置信度，及每个候选页对应该目录条目的第二目录条目置信度，并设定页码信息匹配和标题信息匹配对应的权重系数，确定每个候选页对应目录条目的总置信度，从而确定每个目录条目对应的逻辑页。Of course, the logical page corresponding to each catalog entry is determined through the title information in the catalog entry. The specific implementation process is highly reliable, but it will also affect the efficiency of link establishment between the catalog and the text, so page number information and titles can be used Combining information to determine the logical page corresponding to each directory entry, specifically including: determining the candidate page where the logical page corresponding to the page number is located according to the page number information, matching the page number information in each candidate page, and determining that each candidate page corresponds to the directory Confidence of the first directory entry of the entry, and at the same time, match according to the title information in each candidate page to determine the second directory entry confidence of each candidate page corresponding to the directory entry, and according to the second directory entry confidence of each candidate page corresponding to the directory entry Confidence degree of a directory entry, and the confidence degree of the second directory entry for each candidate page corresponding to the directory entry, and setting the weight coefficient corresponding to the page number information matching and the title information matching, and determining the total confidence degree of each candidate page corresponding to the directory entry , to determine the logical page for each directory entry.

如图6A所示，为本发明实施例中以内蒙古大学出版社2006年出版的《心理健康教育》为例，具体说明建立数字文档目录和正文之间链接的方法，具体包括以下步骤：As shown in Figure 6A, in the embodiment of the present invention, "Mental Health Education" published by Inner Mongolia University Press in 2006 is taken as an example to specifically illustrate the method of establishing a link between the digital document catalog and the text, which specifically includes the following steps:

步骤601：读入数字文档，获取保存的目录条目信息。Step 601: Read in the digital file to obtain saved directory entry information.

该数字文档共有236页，目录条目59条。其中以图7所示第二章第一节对应的目录条目为例详细描述建立该目录条目与正文之间的链接的过程。该目录条目中章节目录项为“第一节”，标题目录项为“自我意识概述”，页码目录项为“20”。The digital file has 236 pages and 59 table of contents entries. The process of establishing a link between the directory entry and the text is described in detail by taking the directory entry corresponding to the first section of the second chapter shown in FIG. 7 as an example. In this catalog entry, the chapter catalog item is "Section 1", the title catalog item is "Self-Awareness Overview", and the page number catalog item is "20".

步骤602：根据保存的目录条目的信息设置的至少一个目录项信息，确定每个目录条目对应的逻辑页。Step 602: Determine the logical page corresponding to each directory entry according to at least one directory entry information set by the stored directory entry information.

步骤603：根据确定的每个目录条目对应的逻辑页，建立每个目录条目和对应的每个逻辑页的链接。Step 603: According to the determined logical page corresponding to each directory entry, establish a link between each directory entry and each corresponding logical page.

如图6B所示为在本发明实施例中按照目录条目中的页码目录项的页码信息和标题目录项中的标题信息，确定每个目录条目对应的逻辑页的方法，具体确定每个目录条目对应的逻辑页的过程包括：As shown in Figure 6B, in the embodiment of the present invention, according to the page number information of the page number directory entry in the directory entry and the title information in the title directory entry, the method for determining the logical page corresponding to each directory entry is specifically determined for each directory entry The corresponding logical page process includes:

步骤602a：根据目录条目中的页码目录项中的页码信息，确定每个目录条目对应的逻辑页所在的候选页。Step 602a: According to the page number information in the directory entry, determine the candidate page where the logical page corresponding to each directory entry is located.

其中该目录条目的页码目录项中页码为20，根据设置的逻辑页所在的候选页与页码的关系n+K-D≤N≤n+K+D，其中该数字文档的总页数K为5，范围阈值参数D为3，则确定该目录条目对应的逻辑页所在的候选页为第22页到第28页。Wherein the page number of the directory entry is 20, according to the relationship between the candidate page and the page number where the logical page is located n+K-D≤N≤n+K+D, wherein the total number of pages K of the digital document is 5, If the range threshold parameter D is 3, it is determined that the candidate pages where the logical page corresponding to the directory entry is located are pages 22 to 28.

步骤602b：在每个候选页内提取有效信息。Step 602b: Extract valid information from each candidate page.

在本发明实施例中保存的该数字文档的版心范围信息为，上边界线纵坐标为80.73，左边界线的横坐标为0，右边界线的横坐标为485，下边界线的纵坐标为697.10。在每个候选页内，根据每个字符的坐标，确定版心范围外的字符，并从版心范围外的字符中提取数字字符。在具体的计算过程中，如图4所示，每个字符的坐标根据该字符的最小外接矩形框的顶点1和顶点3的坐标确定，当字符的顶点1的纵坐标比80.73小，或顶点3的纵坐标比697.10大，或顶点1的横坐标比0小，或顶点3的横坐标比485大时，都认为该字符的位于版心范围外。The core range information of the digital document saved in the embodiment of the present invention is that the ordinate of the upper boundary line is 80.73, the abscissa of the left boundary line is 0, the abscissa of the right boundary line is 485, and the ordinate of the lower boundary line is 697.10. In each candidate page, according to the coordinates of each character, determine characters outside the core range, and extract numeric characters from the characters outside the core range. In the specific calculation process, as shown in Figure 4, the coordinates of each character are determined according to the coordinates of vertex 1 and vertex 3 of the smallest circumscribed rectangular frame of the character. When the ordinate of vertex 1 of the character is smaller than 80.73, or the vertex When the ordinate of 3 is larger than 697.10, or the abscissa of vertex 1 is smaller than 0, or the abscissa of vertex 3 is larger than 485, it is considered that the character is located outside the center range.

在位于版心范围外的字符中提取出数字字符。将提出的数字字符进行合并。例如提取出数字字符为“7”和“1”，其中数字字符“7”的坐标为(421.05，699.83，425.76，706.94)，数字字符“1”的坐标为(416.74，699.83，419.47，706.94)。将提取出的数字字符按照其坐标排序。例如数字字符“1”顶点1的横坐标大于数字字符“7”顶点1的横坐标，数字字符“1”顶点3的横坐标小于数字字符“7”顶点3的横坐标，并且数字字符“1”和“7”对应的纵坐标相同，则可知两个数字字符在同一行，且数字字符“1”在数字字符“7”的左边。Numeric characters are extracted from characters located outside the center range. Combine the proposed numeric characters. For example, to extract the numeric characters "7" and "1", the coordinates of the numeric character "7" are (421.05, 699.83, 425.76, 706.94), and the coordinates of the numeric character "1" are (416.74, 699.83, 419.47, 706.94) . Sort the extracted numeric characters according to their coordinates. For example, the abscissa of the vertex 1 of the numeric character "1" is greater than the abscissa of the vertex 1 of the numeric character "7", the abscissa of the vertex 3 of the numeric character "1" is smaller than the abscissa of the vertex 3 of the numeric character "7", and the numeric character "1" " and "7" correspond to the same ordinate, it can be seen that the two numeric characters are on the same line, and the numeric character "1" is on the left of the numeric character "7".

同时数字字符“7”顶点1的横坐标与数字字符“1”顶点3的横坐标的差为1.58，设定的间距阈值为2.37到4.71间的一个数值，则可知两个数字字符可以合并为一个数字字符串，合并后该数字字符串为“17”，且该数字字符串的坐标为(416.74，699.83，425.76，706.94)。At the same time, the difference between the abscissa of the vertex 1 of the numeric character "7" and the abscissa of the vertex 3 of the numeric character "1" is 1.58, and the set distance threshold is a value between 2.37 and 4.71, so it can be known that the two numeric characters can be combined into A digital string, the combined digital string is "17", and the coordinates of the digital string are (416.74, 699.83, 425.76, 706.94).

步骤602c：将提取的有效信息和保存的目录条目中的页码信息进行匹配，确定每个候选页对应该目录条目的第一目录条目置信度。Step 602c: Match the extracted valid information with the page number information in the saved directory entry, and determine the first directory entry confidence level of each candidate page corresponding to the directory entry.

将合并后的数字字符串与目录条目中的页码信息进行比较。该目录条目中的页码信息为20，并后的数字字符串为17，此两个数字字符不符。因此将候选页的置信度减E，本实施例初始置信度X为50，E为6，则可知本候选页的置信度为44。Compare the combined number string with the page number information in the table of contents entry. The page number information in this catalog entry is 20, and the numeric string after it is 17, and the two numeric characters do not match. Therefore, the confidence degree of the candidate page is subtracted by E, the initial confidence degree X of this embodiment is 50, and E is 6, so it can be known that the confidence degree of the candidate page is 44.

采用上述方法可以得到，对于自然页数为第22页到第28页中的每个候选页进行页码匹配后得到的第一目录条目置信度分别为44，44，44，80，44，44，44。Using the above method, it can be obtained that the confidence levels of the first directory entry obtained after page number matching for each candidate page whose natural page number is from page 22 to page 28 are 44, 44, 44, 80, 44, 44, respectively. 44.

步骤602d：根据每页中的所有的字符的坐标，在每页中将所有的字符排列为若干行。Step 602d: According to the coordinates of all the characters in each page, arrange all the characters in several rows on each page.

为了保证每一页中所有字符按照行排列，在排列的过程中字符间的水平中轴线间的垂直距离需要满足一定的条件，此字符间的水平中轴线的垂直距离可以根据计算字符的上端和下端的两个顶点的纵坐标的平均值，再计算两个字符的对应的纵坐标的平均值的差值确定。在本发明实施例中判断两个字符A和B是否能够排在一行的方法为：计算字符A的两个纵坐标的平均值，并计算字符A的较大的纵坐标与较小的纵坐标的差值，同时计算字符B的两个纵坐标的平均值，并计算字符B的较大的纵坐标与较小的纵坐标的差值，判断字符A、B对应的纵坐标的平均值的差值，是否小于两个字符A、B中较小的较大的纵坐标与较小的纵坐标的差值与参数的乘积，即判断：In order to ensure that all characters on each page are arranged in rows, the vertical distance between the horizontal central axes of the characters needs to meet certain conditions during the arrangement process. The vertical distance of the horizontal central axes between the characters can be calculated according to the upper end of the characters and Determine the average value of the vertical coordinates of the two vertices at the lower end, and then calculate the difference between the average values of the corresponding vertical coordinates of the two characters. In the embodiment of the present invention, the method for judging whether two characters A and B can be arranged in a row is: calculate the average value of the two ordinates of character A, and calculate the larger ordinate and the smaller ordinate of character A At the same time, calculate the average value of the two vertical coordinates of character B, and calculate the difference between the larger vertical coordinate and the smaller vertical coordinate of character B, and determine the average value of the corresponding vertical coordinates of characters A and B Whether the difference is smaller than the product of the difference between the smaller and larger ordinate and the smaller ordinate of the two characters A and B and the parameter, that is, to judge:

$| | \frac{{Y Y}_{11} ((A A)) + + {Y Y}_{22} ((A A))}{22} - - \frac{{Y Y}_{11} ((B B)) + + {Y Y}_{22} ((B B))}{22} | | < < MIN MIN [[{Y Y}_{22} ((A A)) - - {Y Y}_{11} ((A A)),, {Y Y}_{22} ((B B)) - - {Y Y}_{11} ((B B))]] \times \times j j$

其中，MIN表示取两者中较小值，j是小于1的正实数，Y₁(A)为字符A的较小的纵坐标值，Y₂(A)为字符A的较大的纵坐标值，Y₁(B)为字符B的较小的纵坐标值，Y₂(B)为字符B的较大的纵坐标值。当判断结果为是时，将A和B排列到一行，否则将A和B排列到不同行，然后依次判断B和C两个字符的纵坐标是否满足上述条件，判断B和C是否排列到一行。采用此方法将每一页中所有字符进行排列。采用此方法排列后，每一行对应一个最小外接矩形框，如图4所示。Among them, MIN means to take the smaller value of the two, j is a positive real number less than 1, Y ₁ (A) is the smaller ordinate value of character A, and Y ₂ (A) is the larger ordinate value of character A Y ₁ (B) is the smaller ordinate value of character B, and Y ₂ (B) is the larger ordinate value of character B. When the judgment result is yes, arrange A and B in one line, otherwise arrange A and B in different lines, and then judge whether the vertical coordinates of the two characters B and C meet the above conditions in turn, and judge whether B and C are arranged in one line . Use this method to arrange all the characters on each page. After being arranged by this method, each row corresponds to a minimum circumscribed rectangular frame, as shown in Figure 4.

步骤602e：将每个候选页的每行与目录条目中的标题信息匹配，确定每个候选页对应目录条目的第二目录条目置信度。Step 602e: match each row of each candidate page with the title information in the directory entry, and determine the second directory entry confidence level of each candidate page corresponding to the directory entry.

步骤602f：根据每个候选页对应每个目录条目的第一目录条目置信度以及对应每个目录条目的第二置信度，确定每个候选页对应每个目录条目的总置信度，根据该总置信度确定每个目录条目对应的逻辑页。Step 602f: According to the first directory entry confidence degree corresponding to each directory entry of each candidate page and the second confidence degree corresponding to each directory entry, determine the total confidence degree of each candidate page corresponding to each directory entry, according to the total Confidence determines which logical page each directory entry corresponds to.

根据每个候选页对应目录条目的第一目录条目置信度dPageVeri以及对应目录条目的第二目录条目置信度+dTitleVeri，以及第一目录条目置信度对应的权重系数dPageWeight以及第二目录条目置信度对应的权重系数dTitleWeight，确定每个候选页对应每个目录条目的总置信度，其中第一目录条目置信度对应的权重系数dPageWeight与第二目录条目置信度对应的权重系数dTitleWeight的和为1，并且都为大于零的正实数，例如dPageWeight为0.4，dPageWeight为0.6。采用上述确定总置信度的方法得到如图8所示的自然页为第25页的候选页的总置信度为94。选择总置信度最高的候选页为该目录条目对应的逻辑页。同时也可以设定总置信度阈值，将总置信度超过总置信度阈值的候选页，作为该目录条目对应的逻辑页。According to the first directory entry confidence degree dPageVeri of each candidate page corresponding to the directory entry, the second directory entry confidence degree +dTitleVeri of the corresponding directory entry, and the weight coefficient dPageWeight corresponding to the first directory entry confidence degree and the second directory entry confidence degree corresponding The weight coefficient dTitleWeight of each candidate page is determined to correspond to the total confidence of each directory entry, wherein the sum of the weight coefficient dPageWeight corresponding to the confidence degree of the first directory entry and the weight coefficient dTitleWeight corresponding to the confidence degree of the second directory entry is 1, and Both are positive real numbers greater than zero, for example, dPageWeight is 0.4, and dPageWeight is 0.6. Using the method for determining the total confidence above, the total confidence of the candidate page whose natural page is the 25th page as shown in FIG. 8 is 94. The candidate page with the highest total confidence is selected as the logical page corresponding to the directory entry. At the same time, a total confidence threshold may also be set, and candidate pages whose total confidence exceeds the total confidence threshold are used as logical pages corresponding to the directory entry.

其中，确定每个候选页对应目录条目的第二目录条目置信度的过程包括：Wherein, the process of determining the confidence degree of the second directory entry corresponding to each candidate page includes:

如图8所示为本发明实施例提供的排列后的自然页为第25页的内容，采用LCS算法将候选页中每行的字符与目录条目中的标题字符进行匹配，根据匹配的结果确定候选页中每行对应该标题目录项信息的第一目录项信息置信度，确定第二目录项信息置信度的过程包括：确定候选页中每行的字符在每行中的位置，根据该位置确定每行的第二置信度，并比较每行字符的平均文字宽度与正文部分的平均文字宽度大小，确定每行的第三置信度，同时根据LCS算法匹配成功的字符串，确定该字符串是否与其他文字字符同行，确定每行对应的第四置信度，根据上述条件对应的第二置信度、第三置信度和第四置信度，及每个条件对应的权重系数，确定每行对应该标题目录项信息的第二目录项信息置信度，根据该第一目录项信息置信度和第二目录项信息置信度，确定每行的总目录项信息置信度。根据每个页面中每行的总目录项信息置信度，取其中总置信度最大的值作为该候选页对应该目录条目的第二目录条目置信度。As shown in Figure 8, the arranged natural page provided by the embodiment of the present invention is the content of the 25th page, and the characters of each line in the candidate page are matched with the title characters in the directory entry by using the LCS algorithm, and determined according to the matching result Each row in the candidate page corresponds to the confidence level of the first directory item information of the title directory information, and the process of determining the confidence level of the second directory item information includes: determining the position of the characters in each row in the candidate page in each row, according to the position Determine the second confidence level of each line, and compare the average text width of each line of characters with the average text width of the body part, determine the third confidence level of each line, and at the same time determine the string according to the successful match of the LCS algorithm Whether it goes with other text characters, determine the fourth confidence degree corresponding to each line, and determine the corresponding The second directory entry information confidence degree of the title directory entry information is used to determine the total directory entry information confidence degree of each row according to the first directory entry information confidence degree and the second directory entry information confidence degree. According to the confidence degree of the total directory entry information of each line in each page, the value with the largest total confidence degree is taken as the second directory entry confidence degree of the candidate page corresponding to the directory entry.

其中采用LCS算法将候选页中每行的字符与目录条目中的标题进行匹配，该算法输入的参数是两个字符串，该两个字符串为目录条目的标题以及待匹配行的字符串，经计算后返回这两个字符串最长的公共子串部分，根据返回的最长的公共字符串部分，可以确定两个字符串的相似度从而确定该行的第一目录项信息置信度。例如将第25页第二行的字符“第一节自我意识概述”与目录条目的标题“自我意识概述”根据LCS算法进行匹配，经匹配后输出的结果为“自我意识概述”，匹配后的结果与目录条目的标题相同，则可以确定该行的第一目录项信息置信度为100。第一目录项信息置信度越高的行，则该行的字符与目录条目的标题的相似度越高。The LCS algorithm is used to match the characters of each line in the candidate page with the title in the directory entry. The input parameters of the algorithm are two strings, which are the title of the directory entry and the string of the row to be matched. After calculation, the longest common substring part of the two character strings is returned. According to the returned longest common character string part, the similarity of the two character strings can be determined to determine the confidence level of the first directory item information of the row. For example, match the characters "Summary of Self-Awareness Section 1" on the second line of page 25 with the title of the catalog entry "Summary of Self-Awareness" according to the LCS algorithm. If the result is the same as the title of the directory entry, it can be determined that the confidence level of the first directory entry information in this row is 100. A row with a higher confidence degree of the first directory entry information has a higher similarity between the characters of the row and the title of the directory entry.

并且进行每行字符位置的判断。每行字符位置的判断过程具体包括：比较根据每行的字符所在行的坐标确定的第一中轴线，与根据版心范围的左、右边界线确定的第二中轴线的水平距离差的绝对值，根据该水平距离差的绝对值确定每行的第二置信度。其中第二置信度越高的行，则该水平距离差的绝对值越小。And judge the character position of each line. The process of judging the character position of each line specifically includes: comparing the absolute value of the horizontal distance difference between the first central axis determined according to the coordinates of the row where the characters of each line are located, and the second central axis determined according to the left and right boundary lines of the center range , and determine the second confidence level of each row according to the absolute value of the horizontal distance difference. The row with the higher second confidence level has a smaller absolute value of the horizontal distance difference.

并且在确定每行的字符的位置时，也可以根据每行的坐标确定每行的字符的大概位置是否位于根据版心范围的左、右边界线确定的中间位置，当该行的字符位于该中间位置时，根据该行的坐标确定该行的长度是否满足设定的长度条件，例如该设定的长度条件为小于整个版心范围右边界线和左边界线差值的80％，根据该行的长度是否满足设定的长度条件，确定该行的第二置信度；或者也可以当该行的字符位于版心的左边时，根据该行的坐标确定该行的长度是否满足设定的长度条件，例如该设定的长度条件为小于整个版心范围右边界线和左边界线差值的70％，确定该行的第二置信度。And when determining the position of the characters of each line, it can also be determined according to the coordinates of each line whether the approximate position of the characters of each line is located in the middle position determined according to the left and right boundary lines of the center range, when the characters of this line are located in the middle position, determine whether the length of the line satisfies the set length condition according to the coordinates of the line. Whether the set length condition is satisfied, determine the second confidence level of the line; or when the character of the line is located on the left side of the core, determine whether the length of the line meets the set length condition according to the coordinates of the line, For example, if the set length condition is less than 70% of the difference between the right boundary line and the left boundary line of the entire core range, the second confidence level of the line is determined.

其中，在根据每行的坐标确定每行的字符的大概位置时，例如图4所示的行坐标，可以比较每行的坐标中X_m与版心范围的左边界线的第一差值，以及版心范围的右边界线与X_n的第二差值的大小，当第一差值与第二差值的差大于设定的差值阈值时，则判断该行的字符位于整个版心的两边，并且当第一差值较第二差值大时，判断该行的字符位于整个版心的右端，当第一差值较第二差值小时，判断该行的字符位于整个版心的左端。当然在实际的位置判断过程中可能还有很多的方法，但是基于本发明实施例思想根据坐标差判断每行字符位置的方法都应该在本发明的保护范围内。Wherein, when determining the approximate position of the characters of each line according to the coordinates of each line, such as the line coordinates shown in Figure 4, the first difference between X _m and the left boundary line of the version center range in the coordinates of each line can be compared, and The size of the second difference between the right boundary line of the core range and X _n , when the difference between the first difference and the second difference is greater than the set difference threshold, it is judged that the characters in this line are located on both sides of the entire core , and when the first difference is larger than the second difference, it is judged that the characters in this row are located at the right end of the entire version center, and when the first difference is smaller than the second difference, it is judged that the characters in this row are located at the left end of the entire version center . Of course, there may be many methods in the actual position judgment process, but the method of judging the position of each line of characters based on the coordinate difference based on the idea of the embodiment of the present invention should be within the protection scope of the present invention.

同时判断每行字符的平均文字宽度与数字文档正文的平均文字宽度的大小，例如图8中第二行的平均文字宽度为13.77，该数字文档中正文的平均文字宽度为10.29，该第二行的平均文字宽度大于数字文档中正文的平均文字宽度，确定该行对应的第三置信度。其中第三置信度越高的行，该行字符的平均文字宽度越大。Simultaneously judge the size of the average text width of each line of characters and the average text width of the text of the digital document, for example, the average text width of the second line in Figure 8 is 13.77, the average text width of the text in this digital document is 10.29, the second line The average text width of is greater than the average text width of the body text in the digital document, determine the third confidence level corresponding to the row. Among them, the row with the higher third confidence degree has a larger average text width of the characters in the row.

并且根据LCS算法匹配成功的字符串，判断该行中是否还存在其他的文字字符。具体判断该行中是否还存在其他的字符的过程中，将与该匹配成功的字符串的坐标直接相邻的文字字符，确定为该行中存在其他的文字字符，当有文字字符与该匹配成功的字符存在间接连接关系时，例如“第二节自我意识概述”，匹配成功的字符为“自我意识概述”，与该字符串在坐标上相连的为空格，“第二节”与该字符串存在间接连接关系，可以认为该匹配成功的字符串所在的行中不存在其他的文字字符，根据该匹配成功的字符串所在的行中是否存在其他的文字字符，确定该行的第四置信度。And judge whether there are other literal characters in the row according to the strings that are successfully matched by the LCS algorithm. In the process of specifically judging whether there are other characters in the line, the text characters directly adjacent to the coordinates of the successfully matched character string are determined as other text characters in the line. When there are text characters that match the When there is an indirect connection relationship between successful characters, such as "Second Section Self-Awareness Overview", the successfully matched character is "Self-Awareness Overview", and the coordinates connected with this string are spaces, and "Second Section" and this character There is an indirect connection relationship between strings, it can be considered that there are no other literal characters in the row where the successfully matched string is located, and the fourth confidence level of the row is determined according to whether there are other literal characters in the row where the successfully matched string is located Spend.

根据每行的第一目录项信息置信度、以及由第二置信度、第三置信度和第四置信度确定的第二目录项信息置信度，及每个置信度对应的权重系数，确定每行的总目录项信息置信度，取总目录项信息置信度最高的值作为每个候选页对应该目录条目的第二目录条目置信度，其中，每个置信度对应的权重系数为正实数。According to the confidence degree of the first directory entry information of each row, the second directory entry information confidence degree determined by the second confidence degree, the third confidence degree and the fourth confidence degree, and the weight coefficient corresponding to each confidence degree, determine each The confidence degree of the total directory entry information of the row, the value with the highest confidence degree of the total directory entry information is taken as the second directory entry confidence degree of each candidate page corresponding to the directory entry, wherein the weight coefficient corresponding to each confidence degree is a positive real number.

同时由于在数字文档中，影响目录条目中的自然页和实际的逻辑页不对应的主要因素为正文之前的版权页、目录页、前言、序附录等内容，因此只要在建立数字文档的目录与正文之间的链接的过程中，识别出该正文前版权页、目录页、前言、序附录等内容的页数，根据该页数及保存的数字文档目录条目中的页码信息，也可以确定每个目录条目对应的逻辑页，从而建立每个目录条目和每个逻辑页之间的链接。具体的实现过程中相信本领域的技术人员可以根据本发明实施例提供的方法进行具体的实施，这里就不一一赘述。At the same time, because in digital documents, the main factors that affect the mismatch between the natural pages in the catalog entry and the actual logical pages are the copyright page, catalog page, preface, preface and appendix before the main text, so as long as the catalog of the digital document is established and In the process of linking between texts, the number of pages of content such as the copyright page, catalog page, preface, preface and appendix before the text is identified, and according to the number of pages and the page number information in the catalog entry of the saved digital document, it is also possible to determine each Each directory entry corresponds to a logical page, thereby establishing a link between each directory entry and each logical page. In the specific implementation process, it is believed that those skilled in the art can implement the specific implementation according to the method provided by the embodiment of the present invention, so details will not be repeated here.

如图9所示，为本发明实施例提供了一种建立数字文档目录与正文之间链接的装置，其中所述数字文档目录包含多个目录条目，每个目录条目包含至少一个目录项信息，包括：As shown in Figure 9, an embodiment of the present invention provides an apparatus for establishing a link between a digital document catalog and a text, wherein the digital document catalog contains a plurality of catalog entries, and each catalog entry contains at least one catalog item information, include:

逻辑页识别模块90，用于从保存的每个目录条目中获取至少一个目录项信息，根据所述至少一个目录项信息，在数字文档中确定每个目录条目对应的每个逻辑页；A logical page identification module 90, configured to obtain at least one directory entry information from each stored directory entry, and determine each logical page corresponding to each directory entry in the digital document according to the at least one directory entry information;

链接建立模块91，用于建立每个目录条目与对应的每个逻辑页之间的链接。A link establishing module 91, configured to establish a link between each directory entry and each corresponding logical page.

所述逻辑页识别模块90包括：The logical page identification module 90 includes:

第一识别单元901，用于当获得的至少一个目录项信息为页码目录项信息时，在数字文档中确定每个目录条目对应的每个逻辑页。The first identification unit 901 is configured to determine each logical page corresponding to each directory entry in the digital document when the obtained at least one directory entry information is page number directory entry information.

所述第一识别单元901包括：The first identification unit 901 includes:

第一候选页确定子单元9010，用于按照预置的规则根据每个目录条目的页码目录项信息，确定所述每个目录条目对应的逻辑页所在的候选页；The first candidate page determination subunit 9010 is configured to determine the candidate page where the logical page corresponding to each directory entry is located according to the page number directory entry information of each directory entry according to preset rules;

第一匹配子单元9011，用于在每个候选页中提取有效信息，比较每个有效信息与该目录条目中的页码目录项信息是否相同；The first matching subunit 9011 is used to extract valid information from each candidate page, and compare whether each valid information is the same as the page number directory entry information in the directory entry;

第一计算子单元9012，用于根据每个有效信息与所述页码目录项信息是否相同，确定每个候选页对应该目录条目的第一置信度；The first calculation subunit 9012 is configured to determine the first confidence level of each candidate page corresponding to the directory entry according to whether each valid information is the same as the directory entry information of the page number;

逻辑页第一确定子单元9013，用于根据所述第一置信度确定每个目录条目对应的每个逻辑页。The logical page first determining subunit 9013 is configured to determine each logical page corresponding to each directory entry according to the first confidence level.

第二识别单元902，用于当获取的至少一个目录项信息为标题目录项信息时，在数字文档中确定每个目录条目对应的每个逻辑页。The second identifying unit 902 is configured to determine each logical page corresponding to each directory entry in the digital document when at least one piece of directory entry information acquired is title directory entry information.

所述第二识别单元902包括：The second identification unit 902 includes:

行置信度第一确定子单元9020，用于根据每个目录条目中的标题目录项信息与每页数字文档中每行字符的相似度，确定该页数字文档中每行字符对应该标题目录项信息的第一置信度；The first line confidence determination subunit 9020 is used to determine that each line of characters in the page of digital documents corresponds to the title directory item according to the similarity between the title directory item information in each directory entry and the characters in each line of digital documents on each page the first degree of confidence of the information;

行置信度第二确定子单元9021，用于根据该页数字文档中每行字符的至少一个特征信息，确定该页数字文档中每行字符对应该标题目录项信息的第二置信度；The second line confidence determination subunit 9021 is configured to determine the second confidence level of each line of characters in the page of digital documents corresponding to the title directory entry information according to at least one feature information of each line of characters in the page of digital documents;

第二计算子单元9022，用于根据该页数字文档中每行字符对应该标题目录项信息的总置信度，确定该页数字文档对应该目录条目的第二置信度；The second calculation subunit 9022 is configured to determine the second confidence level of the page of digital documents corresponding to the directory entry according to the total confidence level of each line of characters in the page of digital documents corresponding to the title directory entry information;

逻辑页第二确定子单元9023，用于根据所述第二置信度确定每个目录项信息对应的每个逻辑页。The logical page second determining subunit 9023 is configured to determine each logical page corresponding to each directory entry information according to the second confidence level.

其中行置信度第二确定子单元9021中的至少一个特征信息包括：数字文档中每行字符的位置信息、或数字文档中每行字符的平均文字宽度信息，或数字文档与所述标题目录项目信息完全相似的字符是否与其他文字字符同行。Wherein at least one characteristic information in the second determining subunit 9021 of the line confidence includes: the position information of each line of characters in the digital document, or the average text width information of each line of characters in the digital document, or the content of the digital document and the title directory item Information whether identical characters are peered with other literal characters.

第三识别单元903，用于当所述至少一个目录项信息为页码目录项信息和标题目录项信息时，在数字文档中确定每个目录条目对应的每个逻辑页。The third identifying unit 903 is configured to determine each logical page corresponding to each directory entry in the digital document when the at least one directory entry information is page number directory entry information and title directory entry information.

所述第三识别单元903包括：The third identification unit 903 includes:

第二候选页确定子单元9030，用于根据每个目录条目中的页码目录项信息，确定每个目录条目对应的逻辑页所在的候选页；The second candidate page determining subunit 9030 is configured to determine the candidate page where the logical page corresponding to each directory entry is located according to the page number directory entry information in each directory entry;

第一置信度确定子单元9031，用于确定每个候选页对应每个目录条目中的第一置信度；The first confidence level determination subunit 9031 is configured to determine the first confidence level in each directory entry corresponding to each candidate page;

第二置信度确定子单元9032，用于确定每个候选页对应每个目录条目中的第二置信度；The second confidence level determination subunit 9032 is configured to determine the second confidence level in each directory entry corresponding to each candidate page;

总置信度确定子单元9033，用于根据所述第一置信度和所述第二置信度确定每个候选页对应每个目录条目的总置信度；A total confidence determination subunit 9033, configured to determine the total confidence of each candidate page corresponding to each directory entry according to the first confidence and the second confidence;

逻辑页第三确定子单元9034，用于根据所述总置信度确定每个目录条目对应的逻辑页。The third logical page determination subunit 9034 is configured to determine the logical page corresponding to each directory entry according to the total confidence.

所述第一置信度确定子单元9031包括：The first confidence determination subunit 9031 includes:

第一匹配子模块，用于在每个候选页中提取有效信息，比较每个有效信息与该目录条目中的页码目录项信息是否相同；The first matching submodule is used to extract valid information from each candidate page, and compare whether each valid information is the same as the page number directory item information in the directory entry;

第一计算子模块，用于根据每个有效信息与所述页码目录项信息是否相同，确定每个候选页对应该目录条目的第一目录条目置信度。The first calculation sub-module is configured to determine the first directory entry confidence level of each candidate page corresponding to the directory entry according to whether each valid information is the same as the page number directory entry information.

所述第二置信度确定子单元9032包括：The second confidence determination subunit 9032 includes:

行置信度第一确定子模块，用于根据每个目录条目中的标题目录项信息与每个候选页中每行字符的相似度，确定该候选页中每行字符对应该标题目录项信息的第一录项信息置信度；The first determination submodule of line confidence is used to determine the character of each line of the candidate page corresponding to the title directory item information according to the similarity between the title directory item information in each directory entry and each line of characters in each candidate page Confidence degree of the information of the first entry;

行置信度第二确定子模块，用于根据该候选页中每行字符的至少一个特征信息，确定该候选页中每行字符对应该标题目录项信息的第二目录项信息置信度；The second determination submodule of line confidence is used to determine the second directory item information confidence level of each line of characters in the candidate page corresponding to the title directory item information according to at least one characteristic information of each line of characters in the candidate page;

第二计算子模块，用于根据该候选页中每行字符对应该标题目录项信息的总置信度，确定该候选页对应该目录条目的第二目录条目置信度。The second calculation submodule is used to determine the second directory entry confidence level of the candidate page corresponding to the directory entry according to the total confidence level of each row of characters in the candidate page corresponding to the directory entry information of the title.

显然，本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这样，倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内，则本发明也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the present invention without departing from the spirit and scope of the present invention. Thus, if these modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalent technologies, the present invention also intends to include these modifications and variations.

Claims

1. A method for establishing a link between a digital document catalog and a text, wherein the digital document catalog includes a plurality of catalog entries, and each catalog entry contains at least one catalog item information, characterized in that it includes:

Obtain at least one directory entry information from each saved directory entry, and determine each logical page corresponding to each directory entry in the digital document according to the at least one directory entry information;

Links are established between each directory entry and each corresponding logical page.

2. The method according to claim 1, wherein said obtaining at least one directory item information comprises:

Obtain page number table of contents information and/or title table of contents information.

3. The method according to claim 2, wherein when the obtained at least one catalog item information is page number catalog item information, determining each logical page corresponding to each catalog entry in the digital document comprises:

Determine the candidate page where the logical page corresponding to each directory entry is located according to the page number directory entry information of each directory entry according to preset rules;

Extract valid information from each candidate page, and compare whether each valid information is the same as the page number catalog item information in the catalog entry;

According to the comparison result of each valid information and the page number directory entry information, determine the first directory entry confidence level of each candidate page corresponding to the directory entry;

Each logical page corresponding to each directory entry is determined according to the first directory entry confidence.

4. The method according to claim 2, wherein when at least one catalog item information acquired is title catalog item information, determining each logical page corresponding to each catalog entry in the digital document comprises:

According to the similarity between the title directory entry information in each directory entry and each line of characters in each page of digital documents, determine the first directory entry information confidence level of each line of characters in this page of digital documents corresponding to the title directory entry information;

According to at least one feature information of each line of characters in the digital document on the page, determine the confidence level of the second directory entry information corresponding to the title directory entry information for each line of characters in the digital document on the page;

According to the first directory entry information confidence degree and the second directory entry information confidence degree of each line of characters in the digital document corresponding to the title directory information, determine the total content of each line of characters in the page digital document corresponding to the title directory entry information item information confidence;

Determine the second directory entry confidence level of the page digital file corresponding to the directory entry according to the confidence level of the total directory entry information that each row of characters in the digital document corresponds to the title directory entry information;

Each logical page corresponding to each directory entry information is determined according to the second directory entry confidence.

5. The method according to claim 4, wherein the at least one feature information includes:

The position information of each line of characters in the digital document, or the average text width information of each line of characters in the digital document, or information on whether the characters in the digital document that are completely similar to the title directory item information go with other text characters.

6. The method according to claim 2, wherein when the at least one catalog item information is page number catalog item information and title catalog item information, each logical page corresponding to each catalog entry is determined in the digital document include:

According to the page number directory entry information in each directory entry, determine the candidate page where the logical page corresponding to each directory entry is located;

Extract valid information from each candidate page, compare whether each valid information is the same as the page number directory item information in the directory entry, and determine whether each candidate page corresponds to the a first directory entry confidence for the directory entry; and

According to the similarity between the title directory item information in each directory entry and each line of characters in each candidate page, determine the confidence level of the first directory item information corresponding to the title directory item information for each line of characters in the candidate page; At least one feature information of each line of characters in the page, determine the confidence level of the second directory item information corresponding to the title directory item information for each line of characters in the candidate page; according to the first directory item corresponding to each line of characters in the candidate page Information confidence degree and the second directory entry information confidence degree, determine the total directory entry information confidence degree of each row of characters in the candidate page corresponding to the title directory entry information; The confidence level of the total directory item information determines the confidence level of the second directory entry corresponding to the directory entry in the digital document on this page;

determining the total confidence of each candidate page corresponding to each directory entry according to the first directory entry confidence and the second directory entry confidence;

A logical page corresponding to each directory entry is determined according to the total confidence.

7. A device for establishing a link between a digital document catalog and a text, wherein the digital document catalog contains a plurality of catalog entries, and each catalog entry contains at least one catalog item information, wherein the device includes:

A logical page identification module, configured to obtain at least one directory entry information from each stored directory entry, and determine each logical page corresponding to each directory entry in the digital document according to the at least one directory entry information;

A link establishing module, configured to establish a link between each directory entry and each corresponding logical page.

8. The device according to claim 7, wherein the logical page identification module comprises:

The first identifying unit is configured to determine each logical page corresponding to each directory entry in the digital document when the obtained at least one directory entry information is page number directory entry information.

9. The device according to claim 8, wherein the first identification unit comprises:

The first candidate page determination subunit is configured to determine the candidate page where the logical page corresponding to each directory entry is located according to the page number directory entry information of each directory entry according to preset rules;

The first matching subunit is used to extract valid information from each candidate page, and compare whether each valid information is the same as the page number directory entry information in the directory entry;

The first calculation subunit is used to determine the first catalog entry confidence level of each candidate page corresponding to the catalog entry according to whether each valid information is the same as the catalog entry information of the page number;

The logical page first determining subunit is configured to determine each logical page corresponding to each directory entry according to the first directory entry confidence.

10. The device according to claim 7, wherein the logical page identification module comprises:

The second identifying unit is configured to determine each logical page corresponding to each directory entry in the digital document when the acquired at least one directory entry information is title directory entry information.

11. The device according to claim 10, wherein the second identification unit comprises:

The first determining subunit of line confidence is used to determine the corresponding title directory item information for each line of characters in the digital document on the page according to the similarity between the title directory item information in each directory entry and the characters in each line of digital documents on each page The confidence level of the first entry information;

The second determination subunit of line confidence is used to determine the second directory item information confidence level of each line of characters in the page of digital documents corresponding to the title directory item information according to at least one characteristic information of each line of characters in the page of digital documents;

The second calculation subunit is used to determine the second directory entry confidence level of the page digital document corresponding to the directory entry according to the total confidence level of each line of characters in the page digital document corresponding to the title directory entry information;

The second logical page determining subunit is configured to determine each logical page corresponding to each directory entry information according to the second confidence level.

12. The device according to claim 7, wherein the logical page identification module comprises:

The third identifying unit is configured to determine each logical page corresponding to each directory entry in the digital document when the at least one directory entry information is page number directory entry information and title directory entry information.

13. The device according to claim 12, wherein the third identification unit comprises:

The second candidate page determination unit is configured to determine the candidate page where the logical page corresponding to each directory entry is located according to the page number directory entry information in each directory entry;

A first confidence determination subunit, configured to determine the confidence of each candidate page corresponding to the first directory entry in each directory entry;

The second confidence determination subunit is configured to determine the confidence of each candidate page corresponding to the second directory entry in each directory entry;

A total confidence determination subunit, configured to determine the total confidence of each candidate page corresponding to each directory entry according to the first directory entry confidence and the second directory entry confidence;

The third logical page determination subunit is configured to determine the logical page corresponding to each directory entry according to the total confidence.

14. The device according to claim 13, wherein the first confidence determination subunit comprises:

The first matching submodule is used to extract valid information from each candidate page, and compare whether each valid information is the same as the page number directory item information in the directory entry;

The first calculation sub-module is configured to determine the first directory entry confidence level of each candidate page corresponding to the directory entry according to whether each valid information is the same as the page number directory entry information.

15. The device according to claim 13, wherein the second confidence determination subunit comprises:

The first determination submodule of line confidence is used to determine the character of each line of the candidate page corresponding to the title directory item information according to the similarity between the title directory item information in each directory entry and each line of characters in each candidate page Confidence degree of the information of the first entry;

The second determination submodule of line confidence is used to determine the second directory item information confidence level of each line of characters in the candidate page corresponding to the title directory item information according to at least one characteristic information of each line of characters in the candidate page;

The second calculation submodule is used to determine the second directory entry confidence level of the candidate page corresponding to the directory entry according to the total confidence level of each row of characters in the candidate page corresponding to the directory entry information of the title.