CN110442719B

CN110442719B - Text processing method, device, equipment and storage medium

Info

Publication number: CN110442719B
Application number: CN201910734656.9A
Authority: CN
Inventors: 张航
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing Feishu Technology Co ltd; Douyin Vision Co Ltd; Douyin Vision Beijing Co Ltd
Priority date: 2019-08-09
Filing date: 2019-08-09
Publication date: 2022-03-04
Anticipated expiration: 2039-08-09
Also published as: CN110442719A

Abstract

The embodiment of the disclosure discloses a text processing method, a text processing device, text processing equipment and a storage medium, wherein the method comprises the following steps: the method comprises the steps of obtaining character position information contained in a text to be blocked, determining at least one text line and text line position information of the text line according to the character position information, determining dividing line information contained in the text to be blocked, determining a target distance between the text lines according to the text line position information and the dividing line information, clustering the text lines according to the target distance, and determining at least one text block of the text to be blocked according to a clustering result of the text lines. According to the method provided by the embodiment of the disclosure, the text to be blocked is blocked according to the character position information and the dividing line information, so that the text blocking process is simplified, and the accuracy of the text blocking result is improved.

Description

Text processing method, device, equipment and storage medium

Technical Field

The embodiments of the present disclosure relate to the field of information technologies, and in particular, to a text processing method, apparatus, device, and storage medium.

Background

Portable Document Format (PDF) is a file Format that presents documents in a manner that is independent of application programs, hardware, and operating systems. The PDF file can well restore the document style, but the main purpose of the PDF file is to ensure the rendering result, so that the structural information of the content is ignored. Therefore, the logic structure or semantic structure between the contents of the PDF document cannot be directly obtained, so that the PDF document is difficult to be well structured. If the PDF document is not subjected to text blocking, the problem of disordered sequence can be caused by directly extracting the characters. Therefore, the text area needs to be framed out to ensure that the text sequence inside the block is correct. The blocks are arranged from top to bottom and from left to right. Text blocking is therefore the basis for the structuring of PDF documents.

At present, a text blocking method includes converting a two-dimensional plane segmentation problem into a one-dimensional character string analysis problem through horizontal and vertical coordinates of page elements, then performing a blocking method for regularly distinguishing corresponding elements, a segmentation algorithm according to shape operation, a Thiessen polygon (Voronoi) algorithm, a constrained run algorithm or a region detection algorithm based on deep learning, and the like. However, the existing text blocking method needs to set a large number of rules and parameters, the accuracy of the recognition result is not high, or a large number of data are labeled for training, and the process is complicated.

Disclosure of Invention

The disclosure provides a text processing method, a text processing device, a text processing apparatus and a storage medium, so as to simplify a text blocking process and improve the accuracy of a text blocking result.

In a first aspect, an embodiment of the present disclosure provides a text processing method, including:

acquiring character position information contained in a text to be partitioned, and determining at least one text line and text line position information of the text line according to the character position information;

determining segmentation line information contained in the text to be segmented, and determining a target distance between text lines according to the text line position information and the segmentation line information;

and clustering the text lines according to the target distance, and determining at least one text block of the text to be blocked according to the clustering result of the text lines.

In a second aspect, an embodiment of the present disclosure further provides a text processing apparatus, including:

the text line determining module is used for acquiring character position information contained in a text to be blocked and determining at least one text line and text line position information of the text line according to the character position information;

the target distance determining module is used for determining the dividing line information contained in the text to be divided according to the text line position information and determining the target distance between the text lines according to the text line position information and the dividing line information;

and the text block determining module is used for clustering the text lines according to the target distance and determining at least one text block of the text to be blocked according to the clustering result of the text lines.

In a third aspect, an embodiment of the present disclosure further provides a terminal device, where the terminal device includes:

one or more processing devices;

storage means for storing one or more programs;

when executed by the one or more processing devices, the one or more programs cause the one or more processing devices to implement the text processing method according to any of the embodiments of the present disclosure.

In a fourth aspect, the disclosed embodiments also provide a computer-readable storage medium, which when executed by a computer processor is configured to perform a text processing method according to any one of the disclosed embodiments.

According to the text blocking method and device, the text blocking process is simplified and the accuracy of the text blocking result is improved by obtaining the character position information contained in the text to be blocked, determining at least one text line and the text line position information of the text line according to the character position information, determining the dividing line information contained in the text to be blocked, determining the target distance between the text lines according to the text line position information and the dividing line information, clustering the text lines according to the target distance, determining at least one text block of the text to be blocked according to the clustering result of the text lines, and blocking the text to be blocked according to the character position information and the dividing line information.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.

Fig. 1 is a flowchart of a text processing method according to an embodiment of the present disclosure;

fig. 2 is a flowchart of a text processing method according to an embodiment of the present disclosure;

fig. 3a is a flowchart of a text processing method according to an embodiment of the disclosure;

fig. 3b is a schematic diagram of a text block extraction result in a text processing method according to an embodiment of the present disclosure;

fig. 3c is a diagram illustrating segmentation in a text processing method according to an embodiment of the present disclosure;

fig. 3d is a schematic diagram of a text line clustering result in a text processing method according to an embodiment of the present disclosure;

fig. 3e is a schematic diagram of a text to be blocked in a text processing method according to an embodiment of the present disclosure;

fig. 3f is a schematic diagram of a blocking result of a text to be blocked in a text processing method according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a text processing apparatus according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a terminal device according to an embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

In the following embodiments, optional features and examples are provided in each embodiment, and various features described in the embodiments may be combined to form a plurality of alternatives, and each numbered embodiment should not be regarded as only one technical solution.

Example one

Fig. 1 is a flowchart of a text processing method according to an embodiment of the present disclosure. The embodiments of the present disclosure may be applicable to the case when text blocking is performed on PDF text, and the method may be performed by a text processing apparatus, which may be implemented in software and/or hardware, for example, the text processing apparatus may be configured in a terminal device. As shown in fig. 1, the method includes:

s110, character position information contained in the text to be partitioned is obtained, and at least one text line and text line position information of the text line are determined according to the character position information.

In the embodiment of the present disclosure, the text to be blocked may be one or more pages contained in the PDF text. The character position information may be character coordinates of each character in the text to be partitioned. It can be understood that the PDF text may include elements such as words and pictures, and a data stream of the PDF text includes element information of all elements, and corresponding element information is different for different element types. For example, when the element type is text, the element information may be information such as coordinates, a font, a size, and the like, and when the element type is picture, the element information may be information such as coordinates, a height, and the like.

By parsing the data stream of PDF text, the coordinates of each word can be obtained. For example, whether an element corresponding to the element information is a character may be determined according to the element information, and when the element corresponding to the element information is a character, a coordinate included in the element information is acquired as a character coordinate. Optionally, it may be determined whether the element information includes information corresponding to a "font", and if the element information includes information corresponding to a "font", it is determined that the element corresponding to the element information is a character.

After character position information of all characters contained in a text to be partitioned is obtained, a text line and text line position information of the text line are determined according to the character position information of each character. Optionally, the words may be divided into text lines according to the word coordinates of each word, and the text line position information of each text line may be determined. The manner of dividing the text into text lines is not limited herein, and alternatively, an area formed by text having the same ordinate and a distance between abscissas within a set distance threshold may be used as one text line, or an area formed by text having the same ordinate and continuous abscissas may be used as one text line. The distance threshold value can be set according to the text database dividing condition of the text to be divided.

Optionally, the position information of the text line may include vertex coordinates of the text line (e.g., top left vertex coordinates, bottom left vertex coordinates, top right vertex coordinates, or bottom right vertex coordinates), a height of the text line, and a width of the text line. After the text line is determined according to the character coordinates, for each text line, the character coordinates of the end points on at least one side of the text line are determined, and the vertex coordinates of the text line are determined based on the character coordinates of the end points of the text line. For example, if the vertex coordinate of the text line is set as the vertex coordinate of the upper left corner of the text line, the vertex coordinate of the upper left corner of the text of the endpoint on the left side of the text line is obtained as the vertex coordinate of the text line. Acquiring element information corresponding to the characters, and determining the height of the text line based on the font size contained in the element information.

In the embodiment of the present disclosure, the width of the text line may be determined according to the number of characters and element information included in the text line, or may be determined according to the character coordinates of the characters at the end point of the text line. For example, the number of words contained within a line of text may be determined, and the width of the line of text may be determined based on the word font size, the number of words, and the word spacing. And the character coordinates of end points on two sides of the text line can be obtained, the distance between the character coordinates of the end points on the two sides of the text line is calculated, and the distance between the characters of the end points on the two sides of the text line is used as the width of the text line.

In one embodiment, the determining the at least one text line and the text line position information of the text line according to the text position information includes: and taking the characters with continuous horizontal coordinates and same vertical coordinates as a text line, and determining the text line position information of the text line according to the character position information of the characters in the text line.

Preferably, the text line position information of the text line may be determined based on the character position information in the text line, with characters having continuous abscissa and the same ordinate as one text line. The characters with continuous horizontal coordinates and same vertical coordinates are used as a text line, so that each divided text line is ensured to be continuous character information, and the division of the text lines is more accurate. For the manner of determining the position information of the text line according to the position information of the characters in the text line, reference may be made to the above description, which is not repeated herein.

S120, determining the dividing line information contained in the text to be divided, and determining the target distance between the text lines according to the text line position information and the dividing line information.

In order to more accurately divide text blocks of a text to be partitioned, in the embodiment of the disclosure, dividing line information included in the text to be partitioned is used as a parameter for dividing the text blocks, a target distance between text lines is determined based on text line position information and the dividing line information, and the text blocks are divided based on the target distance between the text lines. The dividing line information is used as a parameter for dividing the text block, so that two text lines which are close to each other but are not actually in the same text area (for example, the two text lines are close to each other but have the dividing line therebetween) are divided into different text blocks.

In the embodiment of the disclosure, the segmentation line information contained in the text to be segmented can be determined according to an image segmentation algorithm. In consideration of the fact that the characters in the text to be partitioned may affect the image segmentation result, and the extracted segmentation line information is inaccurate, in the embodiment of the disclosure, the character parts in the text to be partitioned may be color-filled, so as to remove the influence of the character parts on the image segmentation result.

Optionally, the image after the edge detection may be used to obtain a pixel matrix of the image only including the partition line, and if the pixel value of a certain pixel point in the image satisfies the set pixel value range, it indicates that the color change at the pixel point is large, and may be the partition line.

In one embodiment, the determining information of the segmentation line included in the text to be segmented includes:

converting other areas except the text line in the text to be blocked into a picture format, and carrying out graying on the converted picture to obtain a grayscale picture;

filling pixel values of pixel points in a region corresponding to the text position information to be blocked in the gray picture to obtain a picture to be detected;

and carrying out edge detection on the picture to be detected through an edge detection algorithm, and taking the detected edge information as the information of the dividing line.

In order to fill the color of the character part in the text to be blocked, other areas except the text line in the text to be blocked need to be converted into a picture format, the converted picture is grayed to obtain a grayscale picture, and the text line is filled with the pixel value based on the grayscale picture. Optionally, if the background color of the text to be segmented is a single color, the pixel value of the background color can be directly used as the pixel value of the text line region; if the background of the text to be blocked comprises multiple colors, such as gradient colors, interpolation filling can be performed on pixel values in the text line region based on pixel values of pixel points around the text line region, so that the picture to be detected is obtained. The interpolation filling mode can enable the pixel value in the text line area to be closer to the background pixel value, so that the extraction of the segmentation line information is more accurate. The interpolation filling method is not limited herein. Illustratively, the filling of pixel values within a text line region may be performed using bilinear interpolation.

After the picture to be detected is obtained, the edge detection algorithm is used for detecting the information of the dividing line contained in the picture to be detected. In the embodiment of the present disclosure, the edge detection algorithm is not limited. For example, the edge detection may be performed on the picture to be detected by using Canny operator, Roberts operator, Sobel operator, and the like.

In one embodiment, the spatial distance between the text lines may be adjusted according to the information of the dividing lines between the text lines, so as to obtain the target distance between the text lines. Illustratively, the spatial distance between text lines is calculated according to the text line position information, the existence condition of the segmentation line between the text lines is judged according to the text line position information and the segmentation line information, the adjustment parameter of the spatial distance is determined according to the existence condition of the segmentation line between the text lines, and the target distance between the text lines is calculated according to the spatial distance and the adjustment parameter. Optionally, a correspondence between the existence of the segmentation lines between the text lines and the adjustment parameters may be preset, and after the existence of the segmentation lines between the text lines is determined, the adjustment parameters of the spatial distance between the text lines are determined by searching the preset correspondence. For example, the spatial distance and the adjustment parameter may be summed or multiplied, and the obtained operation result is used as the target distance.

The existence condition of the dividing lines between the text lines can be that the dividing lines exist between the text lines or that the dividing lines do not exist between the text lines, and further, the existence condition of the dividing lines exist between the text lines can be further divided according to the lengths of the dividing lines existing between the text lines, for example, the lengths of the dividing lines are divided into N length ranges, and the condition corresponding to each length range is taken as the existence condition of the dividing line; or further dividing the situation that the dividing lines exist between the text lines according to the ratio of the length of the dividing lines to the width of the text lines, dividing the ratio into M ratio ranges, and taking the situation corresponding to each ratio range as the situation that the dividing lines exist. Wherein M, N is an integer greater than 1.

S130, clustering the text lines according to the target distance, and determining at least one text block of the text to be blocked according to the clustering result of the text lines.

In the embodiment of the disclosure, after the target distance between every two text lines is determined, the text lines are clustered according to the target distance between the text lines, and the text line set in the same category is used as a text block. In the embodiment of the present disclosure, the clustering method is not limited. For example, algorithms such as K-means clustering, hierarchical clustering algorithms, SOM neural network clustering algorithms, etc. may be used to cluster lines of text based on target distances between lines of text.

In one embodiment, the clustering the text lines according to the target distance and determining at least one text block of the text to be blocked according to the clustering result of the text lines includes:

determining an adjacency matrix corresponding to the text line cluster according to the target distance between the text lines;

and clustering the text lines based on the adjacency matrix, and determining the text block position information corresponding to the same category according to the text line position information of the same category.

Alternatively, text lines may be clustered based on target distances between text lines using a spectral clustering algorithm. Specifically, each text line is used as a point, the identification of the point is determined, and the target distance between the text lines is used as the edge weight formed by two nodes to obtain an adjacency matrix. Illustratively, the element d corresponding to the ith row and the jth column in the adjacency matrix_ijHas an element value of a text line l_iAnd the text line l_jThe target distance therebetween. And after the adjacency matrix is obtained, carrying out Ncut segmentation to obtain a clustering result of the text lines, and determining text block information according to the clustering result of the text lines.

Illustratively, if the clustering result is a text line 1, a text line 2, a text line 3 and a cluster set 1 of the same genus, and a text line 4, a text line 5 and a cluster set 2 of the same genus, it is determined that the text line 1, the text line 2, and the text line 3 form a text block 1, the text line 4, and the text line 5 form a cluster block 2, the position information of the text block 1 is determined according to the position information of the text line 1, the text line 2, and the text line 3, and the position information of the text block 2 is determined according to the position information of the text line 4 and the text line 5. For example, the position information of the text block may include vertex coordinates of the text block, a width of the text block, a height of the text block, and the like. The vertex coordinates of the text block can be determined according to the vertex coordinates of the text lines in the text block, the width of the text block can be determined according to the coordinate information of each text line in the text block, and the height of the text block can be determined according to the coordinate information of each text line in the text block.

Example two

Fig. 2 is a flowchart of a text processing method according to an embodiment of the present disclosure. The embodiments of the present disclosure may be combined with various alternatives in one or more of the embodiments described above. As shown in fig. 2, the method includes:

s210, character position information contained in the text to be partitioned is obtained, and at least one text line and text line position information of the text line are determined according to the character position information.

S220, determining the information of the segmentation lines contained in the text to be segmented.

And S230, determining the space distance between the text lines according to the text line position information.

In the embodiment of the present disclosure, a spatial distance calculation rule may be preset, and the distance between text lines may be calculated according to the spatial distance calculation rule and the text line position informationThe spatial distance. Alternatively, the position information of the text line may be represented by 4 parameters. Illustratively, lines of text l are defined_iAnd the text line l_jThe spatial distance between them is: d₀(l_i，l_j)＝α|x_i-x_j|/max(w_i，w_j)+|y_i-y_j|/max(h_i，h_j) Wherein d is₀(l_i，l_j) Represents a line of text l_iAnd the text line l_j(x) of the space between_i，y_i) For lines of text l_iPosition coordinate of top left corner vertex, h_iFor lines of text l_iHeight, w_iFor lines of text l_iWidth, (x)_j，y_j) For lines of text l_jPosition coordinate of top left corner vertex, h_jFor lines of text l_jHeight, w_jFor lines of text l_jThe width α is a parameter for controlling the importance ratio between the row spacing and the column spacing, and optionally, the value of α may be 1.5.

S240, determining the dividing distance between the text lines according to the text line position information and the dividing line information, wherein the dividing distance is the number of the dividing points existing between the text lines.

In the disclosed embodiment, the segmentation distance between text lines is embodied as the number of segmentation points existing between text lines. In one embodiment, the determining the dividing distance between the text lines according to the text line position information and the dividing line information includes:

determining a segmentation point identification range according to the text line position information;

and acquiring the pixel values of the pixels in the division point identification range, and taking the number of the pixels with the pixel values larger than a set threshold value in the division point identification range as the division distance.

Optionally, can be according to line l of text_iPosition information and text line l_jDetermines the division point identification range. Illustratively, if the text line l_iHas a vertex coordinate of (x) at the upper left corner_i，y_i) Line of text l_jPosition seat of top left corner vertexIs marked as (x)_j，y_j) Then x may be satisfied_j≤p≤x_i+w_iAnd y is_i+h_i≤q≤y_jThe position range corresponding to the point set of (p, q) of (1) is set as the division point identification range. And after the division point identification range is determined, determining the number of the division points in the division point identification range according to the pixel value of each pixel point in the division point identification range. For example, a pixel point having a pixel value greater than a set threshold may be used as the division point.

Alternatively, the number of division points included between text lines may be determined based on a pixel matrix of a picture obtained by edge detection. Illustratively, lines of text l are defined_iAnd the text line l_jThe separation distance between them is:

wherein d is₁(l_i，l_j) Represents a line of text l_iAnd the text line l_j(x) of the distance between the two_i，y_i) For lines of text l_iPosition coordinate of top left corner vertex, h_iFor lines of text l_iHeight, w_iFor lines of text l_iWidth, (x)_j，y_j) For lines of text l_jThe position coordinate of the vertex at the upper left corner, I (p, q) is the pixel value of the pixel point with the coordinate (p, q) in the pixel matrix, θ is a preset pixel value threshold, and optionally, the value of θ may be 50.

And S250, determining the target distance between the text lines according to the space distance and the segmentation distance.

And after determining the control distance and the segmentation distance between the text lines, calculating the target distance between the text lines according to the space distance and the segmentation distance between the text lines. Optionally, the spatial distance and the segmentation distance may be subjected to weighted summation operation to obtain the target distance. In one embodiment, the determining the target distance between text lines according to the spatial distance and the segmentation distance includes: and carrying out weighted summation on the space distance and the segmentation distance to obtain the target distance.

Optionally, the target distance calculation rule is defined as: d (l)_i，l_j)＝d₀(l_i，l_j)+λd₁(l_i，l_j). Wherein d (l)_i，l_j) For lines of text l_iAnd the text line l_jTarget distance between, d₀(l_i，l_j) For lines of text l_iAnd the text line l_jSpatial distance between, d₁(l_i，l_j) For lines of text l_iAnd the text line l_jλ is the weight of the division distance. Wherein, the value of the lambda can be adjusted according to the position parameter of the text line.

S260, clustering the text lines according to the target distance, and determining at least one text block of the text to be blocked according to the clustering result of the text lines.

According to the technical scheme of the embodiment of the disclosure, the target distance between the text lines is determined according to the text line position information and the dividing line information, the spatial distance between the text lines is determined according to the text line position information, the dividing distance between the text lines is determined according to the text line position information and the dividing line information, and the target distance between the text lines is determined according to the spatial distance and the dividing distance, so that the calculation of the target distance is more accurate, and further the text line clustering result based on the target distance is more accurate.

EXAMPLE III

Fig. 3a is a flowchart of a text processing method according to an embodiment of the present disclosure. The embodiment of the present disclosure provides a preferred embodiment based on the above-mentioned embodiments. As shown in fig. 3a, the method comprises:

and S310, starting.

And S320, acquiring the PDF file.

And acquiring the PDF file input by the user.

And S330, extracting the text information of the PDF file.

And extracting character information from the PDFW file information stream, wherein the character information comprises character coordinates, font size, width and height and the like.

And S340, generating text lines by using the text information.

And determining text lines according to the text position information. Fig. 3b is a schematic diagram of a text block extraction result in a text processing method according to an embodiment of the present disclosure. As shown in fig. 3b, the black area is the extracted text line.

And S350, converting the PDF file into a gray picture, and removing a character part by using bilinear interpolation.

In order to represent other segmentation information in the document, the document except the characters is converted into a picture, the picture is subjected to graying processing to obtain a grayed picture, and the character part in the picture is filled by using bilinear interpolation.

And S360, carrying out edge detection on the picture to obtain a segmentation picture.

And (5) carrying out edge detection on the picture by using a Canny operator to obtain a segmentation picture containing segmentation lines. Fig. 3c is a diagram illustrating segmentation in a text processing method according to an embodiment of the present disclosure. As shown in fig. 3c, the white line in the figure is the extracted dividing line.

And S370, calculating the distance between the text lines to obtain an adjacency matrix.

And calculating the segmentation distance between the text lines according to the segmentation graph, and combining the space distance between the text lines to obtain the adjacency matrix.

And S380, obtaining a text line clustering result by using spectral clustering, namely text blocking.

And clustering the text rows by using spectral clustering based on the adjacency matrix to obtain a clustering result, and determining a text blocking result according to the clustering result. Fig. 3d is a schematic diagram of a text line clustering result in a text processing method according to an embodiment of the present disclosure. As shown in fig. 3D, the text lines are clustered into three types, the cluster set 1 includes a text line H, the cluster set 2 includes a text line a, a text line B, and a text line C, and the cluster set 3 includes a text line D, a text line G, a text line F, and a text line E. Text line H constitutes text block 1, text line a, text line B and text line C constitute text block 2, and text line D, text line G, text line F and text line E constitute text block 3.

And S390, ending.

Fig. 3e is a schematic diagram of a text to be partitioned in the text processing method according to the embodiment of the present disclosure. Fig. 3f is a schematic diagram of a blocking result of a text to be blocked in the text processing method according to the embodiment of the present disclosure. Fig. 3e and 3f exemplarily show the blocking effect of text blocking by using the text processing method provided by the embodiment of the present disclosure. As shown in fig. 3f, a first text block 301f, a second text block 302f, a third text block 303f, a fourth text block 304f, a fifth text block 305f, a sixth text block 306f, a seventh text block 307f, an eighth text block 308f, a ninth text block 309f, a tenth text block 310f, and an eleventh text block 311f are text blocks obtained by text blocking of the text to be blocked shown in fig. 3e by using the text processing method provided by the embodiment of the present disclosure. It can be seen that the text blocking result obtained based on the text processing method provided by the embodiment of the disclosure has high accuracy.

The method and the device have the advantages that the PDF document information is used for processing the character area and the non-character area separately, mutual interference is avoided, information such as a dividing line picture is extracted by using an edge detection algorithm, the text is divided, meanwhile, the space distance and the dividing distance of the text are considered, the text blocks are automatically obtained by using a clustering algorithm, a large amount of rules or training data are not needed, and the document can be accurately divided into the text blocks only by determining a small number of parameters.

Example four

Fig. 4 is a schematic structural diagram of a text processing apparatus according to an embodiment of the present disclosure. The embodiment of the disclosure is applicable to the situation when the PDF text is subjected to text blocking. The text processing apparatus may be implemented in software and/or hardware, and may be configured in a terminal device, for example. As shown in fig. 4, the text processing apparatus includes: a text line determination module 410, a target distance determination module 420, and a text block determination module 430. Wherein:

a text line determining module 410, configured to obtain text position information included in a text to be segmented, and determine at least one text line and text line position information of the text line according to the text position information;

a target distance determining module 420, configured to determine, according to the text line position information, dividing line information included in the text to be partitioned, and determine, according to the text line position information and the dividing line information, a target distance between text lines;

and a text block determining module 430, configured to cluster the text lines according to the target distance, and determine at least one text block of the text to be blocked according to a clustering result of the text lines.

According to the text partitioning method and device, the text line determining module is used for obtaining the text position information contained in the text to be partitioned, determining at least one text line and the text line position information of the text line according to the text position information, determining the dividing line information contained in the text to be partitioned, the target distance determining module is used for determining the target distance between the text lines according to the text line position information and the dividing line information, the text block determining module is used for clustering the text lines according to the target distance, determining at least one text block of the text to be partitioned according to the clustering result of the text lines, and partitioning the text to be partitioned according to the text position information and the dividing line information, so that the text partitioning process is simplified, and the accuracy of the text partitioning result is improved.

Optionally, on the basis of the foregoing technical solution, the target distance determining module 420 includes:

the space distance determining unit is used for determining the space distance between the text lines according to the text line position information;

a dividing distance determining unit, configured to determine a dividing distance between text lines according to the text line position information and the dividing line information, where the dividing distance is the number of dividing points existing between the text lines;

and the target distance determining unit is used for determining the target distance between the text lines according to the space distance and the segmentation distance.

Optionally, on the basis of the above technical solution, the segmentation distance determining unit is specifically configured to:

Optionally, on the basis of the above technical solution, the target distance determining unit is specifically configured to:

and carrying out weighted summation on the space distance and the segmentation distance to obtain the target distance.

Optionally, on the basis of the foregoing technical solution, the target distance determining module 410 includes a segmentation information detecting unit, configured to:

Optionally, on the basis of the foregoing technical solution, the text block determining module 430 is specifically configured to:

Optionally, on the basis of the foregoing technical solution, the text position information includes text coordinates, and the text line determining module 410 is specifically configured to:

and taking the characters with continuous horizontal coordinates and same vertical coordinates as a text line, and determining the text line position information of the text line according to the character position information of the characters in the text line.

The text processing device provided by the embodiment of the disclosure can execute the text processing method provided by the embodiment of the disclosure, and has corresponding functional modules and beneficial effects of the execution method.

It should be noted that, the units and modules included in the apparatus are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only used for distinguishing one functional unit from another, and are not used for limiting the protection scope of the embodiments of the present disclosure.

EXAMPLE five

Referring now to fig. 5, a block diagram of a terminal device 500 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, devices such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like. The terminal device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 5, the terminal device 500 may include a processing means (e.g., a central processing unit, a graphic processor, etc.) 501 that may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage means 506 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the terminal apparatus 500 are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

Generally, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 507 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, and the like; storage devices 506 including, for example, magnetic tape, hard disk, etc.; and a communication device 509. The communication means 509 may allow the terminal device 500 to perform wireless or wired communication with other devices to exchange data. While fig. 5 illustrates a terminal apparatus 500 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or installed from the storage means 506, or installed from the ROM 502. The computer program performs the above-described functions defined in the methods of the embodiments of the present disclosure when executed by the processing device 501.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be included in the terminal device; or may exist separately without being assembled into the terminal device.

The computer readable medium carries one or more programs which, when executed by the terminal device, cause the terminal device to:

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods, apparatus, terminal devices, and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules and units described in the embodiments of the present disclosure may be implemented by software or hardware. For example, the text line determination module may be further described as a "module for acquiring character position information included in a text to be segmented and determining at least one text line and text line position information of the text line according to the character position information".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, an example provides a text processing method, including:

According to one or more embodiments of the present disclosure, example two provides a text processing method, and on the basis of the text processing method of example one, the determining a target distance between text lines according to the text line position information and the dividing line information includes:

determining the space distance between text lines according to the text line position information;

determining the dividing distance between text lines according to the text line position information and the dividing line information, wherein the dividing distance is the number of dividing points existing between the text lines;

and determining the target distance between text lines according to the space distance and the segmentation distance.

According to one or more embodiments of the present disclosure, example three provides a text processing method, and on the basis of the text processing method of example two, the determining a dividing distance between text lines according to the text line position information and the dividing line information includes:

According to one or more embodiments of the present disclosure, example four provides a text processing method, and on the basis of the text processing method of example two, the determining a target distance between text lines according to the spatial distance and the segmentation distance includes:

According to one or more embodiments of the present disclosure, example five provides a text processing method, and on the basis of the text processing method of example one, the determining of the segmentation line information included in the text to be segmented includes:

According to one or more embodiments of the present disclosure, example six provides a text processing method, and on the basis of the text processing method of example one, the clustering the text lines according to the target distance, and determining at least one text block of the text to be blocked according to a clustering result of the text lines includes:

According to one or more embodiments of the present disclosure, example seven provides a text processing method, and on the basis of the text processing method of example one, the text position information includes text coordinates, and the determining at least one text line and text line position information of the text line according to the text position information includes:

Example eight provides, in accordance with one or more embodiments of the present disclosure, a text processing apparatus comprising:

Example nine provides, in accordance with one or more embodiments of the present disclosure, a terminal device, comprising:

one or more processing devices;

storage means for storing one or more programs;

when executed by the one or more processing devices, cause the one or more processing devices to implement a text processing method as in any of examples one to seven.

Example ten provides a computer-readable storage medium having stored thereon a computer program that, when executed by a processor, implements a text processing method as in any one of examples one to seven, in accordance with one or more embodiments of the present disclosure.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

1. a text processing method, is characterized in that, comprises:

Obtain the text position information contained in the text to be divided, and determine at least one text line and the text line position information of the text line according to the text position information;

Determine the dividing line information contained in the text to be divided, determine the target distance between the text lines according to the text line position information and the dividing line information, and calculate the target distance according to the spatial distance and the dividing distance;

Clustering the text lines according to the target distance, and determining at least one text block of the text to be divided according to the clustering result of the text lines;

Wherein, the calculation of the segmentation distance includes:

Determine the segmentation point recognition range according to the text line position information;

The pixel values of the pixel points within the identification range of the segmentation point are acquired, and the number of pixel points whose pixel values of the pixel points within the identification range of the segmentation point are greater than the set threshold is used as the segmentation distance.

2. The method according to claim 1, wherein the determining the target distance between the text lines according to the text line position information and the dividing line information comprises:

Determine the spatial distance between text lines according to the text line position information;

A target distance between text lines is determined according to the spatial distance and the segmentation distance.

3. The method according to claim 2, wherein the determining the target distance between the text lines according to the spatial distance and the segmentation distance comprises:

Weighted summation is performed on the spatial distance and the segmentation distance to obtain the target distance.

4. The method according to claim 1, wherein the determining the dividing line information contained in the text to be divided comprises:

Converting other areas other than the text line in the text to be divided into a picture format, and graying the converted picture to obtain a grayscale picture;

Filling the pixel values of the pixel points in the area corresponding to the position information of the text to be divided in the grayscale picture to obtain the picture to be detected;

Edge detection is performed on the picture to be detected by an edge detection algorithm, and the detected edge information is used as the dividing line information.

5 . The method according to claim 1 , wherein the text lines are clustered according to the target distance, and at least one of the to-be-blocked texts is determined according to a clustering result of the text lines. 6 . Text blocks, including:

Determine the adjacency matrix corresponding to the text line clustering according to the target distance between the text lines;

The text lines are clustered based on the adjacency matrix, and the text block position information corresponding to the category is determined according to the text line position information of the same category.

6. The method according to claim 1, wherein the text position information comprises text coordinates, and the determining at least one text line and the text line position information of the text line according to the text position information comprises:

The text with continuous horizontal coordinates and the same vertical coordinate is regarded as a text line, and the text line position information of the text line is determined according to the text position information of the text in the text line.

7. A text processing device, comprising:

a text line determination module, configured to obtain text position information contained in the text to be divided, and determine at least one text line and text line position information of the text line according to the text position information;

A target distance determination module, configured to determine the dividing line information contained in the text to be divided according to the text line position information, and determine the target distance between the text lines according to the text line position information and the dividing line information, The target distance is calculated according to the spatial distance and the segmentation distance;

a text block determination module, configured to cluster the text lines according to the target distance, and determine at least one text block of the text to be divided according to the clustering result of the text lines;

Wherein, the calculation of the segmentation distance includes:

8. A terminal device, wherein the terminal device comprises:

one or more processing devices;

a storage device for storing one or more programs;

When the one or more programs are executed by the one or more processing apparatuses, the one or more processing apparatuses implement the text processing method according to any one of claims 1-6.

9. A computer-readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, the text processing method according to any one of claims 1-6 is implemented.