[go: up one dir, main page]

CN110442719B - Text processing method, device, equipment and storage medium - Google Patents

Text processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN110442719B
CN110442719B CN201910734656.9A CN201910734656A CN110442719B CN 110442719 B CN110442719 B CN 110442719B CN 201910734656 A CN201910734656 A CN 201910734656A CN 110442719 B CN110442719 B CN 110442719B
Authority
CN
China
Prior art keywords
text
line
position information
distance
lines
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910734656.9A
Other languages
Chinese (zh)
Other versions
CN110442719A (en
Inventor
张航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Feishu Technology Co ltd
Douyin Vision Co Ltd
Douyin Vision Beijing Co Ltd
Original Assignee
Beijing ByteDance Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing ByteDance Network Technology Co Ltd filed Critical Beijing ByteDance Network Technology Co Ltd
Priority to CN201910734656.9A priority Critical patent/CN110442719B/en
Publication of CN110442719A publication Critical patent/CN110442719A/en
Application granted granted Critical
Publication of CN110442719B publication Critical patent/CN110442719B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Character Input (AREA)

Abstract

The embodiment of the disclosure discloses a text processing method, a text processing device, text processing equipment and a storage medium, wherein the method comprises the following steps: the method comprises the steps of obtaining character position information contained in a text to be blocked, determining at least one text line and text line position information of the text line according to the character position information, determining dividing line information contained in the text to be blocked, determining a target distance between the text lines according to the text line position information and the dividing line information, clustering the text lines according to the target distance, and determining at least one text block of the text to be blocked according to a clustering result of the text lines. According to the method provided by the embodiment of the disclosure, the text to be blocked is blocked according to the character position information and the dividing line information, so that the text blocking process is simplified, and the accuracy of the text blocking result is improved.

Description

Text processing method, device, equipment and storage medium
Technical Field
The embodiments of the present disclosure relate to the field of information technologies, and in particular, to a text processing method, apparatus, device, and storage medium.
Background
Portable Document Format (PDF) is a file Format that presents documents in a manner that is independent of application programs, hardware, and operating systems. The PDF file can well restore the document style, but the main purpose of the PDF file is to ensure the rendering result, so that the structural information of the content is ignored. Therefore, the logic structure or semantic structure between the contents of the PDF document cannot be directly obtained, so that the PDF document is difficult to be well structured. If the PDF document is not subjected to text blocking, the problem of disordered sequence can be caused by directly extracting the characters. Therefore, the text area needs to be framed out to ensure that the text sequence inside the block is correct. The blocks are arranged from top to bottom and from left to right. Text blocking is therefore the basis for the structuring of PDF documents.
At present, a text blocking method includes converting a two-dimensional plane segmentation problem into a one-dimensional character string analysis problem through horizontal and vertical coordinates of page elements, then performing a blocking method for regularly distinguishing corresponding elements, a segmentation algorithm according to shape operation, a Thiessen polygon (Voronoi) algorithm, a constrained run algorithm or a region detection algorithm based on deep learning, and the like. However, the existing text blocking method needs to set a large number of rules and parameters, the accuracy of the recognition result is not high, or a large number of data are labeled for training, and the process is complicated.
Disclosure of Invention
The disclosure provides a text processing method, a text processing device, a text processing apparatus and a storage medium, so as to simplify a text blocking process and improve the accuracy of a text blocking result.
In a first aspect, an embodiment of the present disclosure provides a text processing method, including:
acquiring character position information contained in a text to be partitioned, and determining at least one text line and text line position information of the text line according to the character position information;
determining segmentation line information contained in the text to be segmented, and determining a target distance between text lines according to the text line position information and the segmentation line information;
and clustering the text lines according to the target distance, and determining at least one text block of the text to be blocked according to the clustering result of the text lines.
In a second aspect, an embodiment of the present disclosure further provides a text processing apparatus, including:
the text line determining module is used for acquiring character position information contained in a text to be blocked and determining at least one text line and text line position information of the text line according to the character position information;
the target distance determining module is used for determining the dividing line information contained in the text to be divided according to the text line position information and determining the target distance between the text lines according to the text line position information and the dividing line information;
and the text block determining module is used for clustering the text lines according to the target distance and determining at least one text block of the text to be blocked according to the clustering result of the text lines.
In a third aspect, an embodiment of the present disclosure further provides a terminal device, where the terminal device includes:
one or more processing devices;
storage means for storing one or more programs;
when executed by the one or more processing devices, the one or more programs cause the one or more processing devices to implement the text processing method according to any of the embodiments of the present disclosure.
In a fourth aspect, the disclosed embodiments also provide a computer-readable storage medium, which when executed by a computer processor is configured to perform a text processing method according to any one of the disclosed embodiments.
According to the text blocking method and device, the text blocking process is simplified and the accuracy of the text blocking result is improved by obtaining the character position information contained in the text to be blocked, determining at least one text line and the text line position information of the text line according to the character position information, determining the dividing line information contained in the text to be blocked, determining the target distance between the text lines according to the text line position information and the dividing line information, clustering the text lines according to the target distance, determining at least one text block of the text to be blocked according to the clustering result of the text lines, and blocking the text to be blocked according to the character position information and the dividing line information.
Drawings
The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.
Fig. 1 is a flowchart of a text processing method according to an embodiment of the present disclosure;
fig. 2 is a flowchart of a text processing method according to an embodiment of the present disclosure;
fig. 3a is a flowchart of a text processing method according to an embodiment of the disclosure;
fig. 3b is a schematic diagram of a text block extraction result in a text processing method according to an embodiment of the present disclosure;
fig. 3c is a diagram illustrating segmentation in a text processing method according to an embodiment of the present disclosure;
fig. 3d is a schematic diagram of a text line clustering result in a text processing method according to an embodiment of the present disclosure;
fig. 3e is a schematic diagram of a text to be blocked in a text processing method according to an embodiment of the present disclosure;
fig. 3f is a schematic diagram of a blocking result of a text to be blocked in a text processing method according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of a text processing apparatus according to an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of a terminal device according to an embodiment of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.
It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.
It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.
The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.
In the following embodiments, optional features and examples are provided in each embodiment, and various features described in the embodiments may be combined to form a plurality of alternatives, and each numbered embodiment should not be regarded as only one technical solution.
Example one
Fig. 1 is a flowchart of a text processing method according to an embodiment of the present disclosure. The embodiments of the present disclosure may be applicable to the case when text blocking is performed on PDF text, and the method may be performed by a text processing apparatus, which may be implemented in software and/or hardware, for example, the text processing apparatus may be configured in a terminal device. As shown in fig. 1, the method includes:
s110, character position information contained in the text to be partitioned is obtained, and at least one text line and text line position information of the text line are determined according to the character position information.
In the embodiment of the present disclosure, the text to be blocked may be one or more pages contained in the PDF text. The character position information may be character coordinates of each character in the text to be partitioned. It can be understood that the PDF text may include elements such as words and pictures, and a data stream of the PDF text includes element information of all elements, and corresponding element information is different for different element types. For example, when the element type is text, the element information may be information such as coordinates, a font, a size, and the like, and when the element type is picture, the element information may be information such as coordinates, a height, and the like.
By parsing the data stream of PDF text, the coordinates of each word can be obtained. For example, whether an element corresponding to the element information is a character may be determined according to the element information, and when the element corresponding to the element information is a character, a coordinate included in the element information is acquired as a character coordinate. Optionally, it may be determined whether the element information includes information corresponding to a "font", and if the element information includes information corresponding to a "font", it is determined that the element corresponding to the element information is a character.
After character position information of all characters contained in a text to be partitioned is obtained, a text line and text line position information of the text line are determined according to the character position information of each character. Optionally, the words may be divided into text lines according to the word coordinates of each word, and the text line position information of each text line may be determined. The manner of dividing the text into text lines is not limited herein, and alternatively, an area formed by text having the same ordinate and a distance between abscissas within a set distance threshold may be used as one text line, or an area formed by text having the same ordinate and continuous abscissas may be used as one text line. The distance threshold value can be set according to the text database dividing condition of the text to be divided.
Optionally, the position information of the text line may include vertex coordinates of the text line (e.g., top left vertex coordinates, bottom left vertex coordinates, top right vertex coordinates, or bottom right vertex coordinates), a height of the text line, and a width of the text line. After the text line is determined according to the character coordinates, for each text line, the character coordinates of the end points on at least one side of the text line are determined, and the vertex coordinates of the text line are determined based on the character coordinates of the end points of the text line. For example, if the vertex coordinate of the text line is set as the vertex coordinate of the upper left corner of the text line, the vertex coordinate of the upper left corner of the text of the endpoint on the left side of the text line is obtained as the vertex coordinate of the text line. Acquiring element information corresponding to the characters, and determining the height of the text line based on the font size contained in the element information.
In the embodiment of the present disclosure, the width of the text line may be determined according to the number of characters and element information included in the text line, or may be determined according to the character coordinates of the characters at the end point of the text line. For example, the number of words contained within a line of text may be determined, and the width of the line of text may be determined based on the word font size, the number of words, and the word spacing. And the character coordinates of end points on two sides of the text line can be obtained, the distance between the character coordinates of the end points on the two sides of the text line is calculated, and the distance between the characters of the end points on the two sides of the text line is used as the width of the text line.
In one embodiment, the determining the at least one text line and the text line position information of the text line according to the text position information includes: and taking the characters with continuous horizontal coordinates and same vertical coordinates as a text line, and determining the text line position information of the text line according to the character position information of the characters in the text line.
Preferably, the text line position information of the text line may be determined based on the character position information in the text line, with characters having continuous abscissa and the same ordinate as one text line. The characters with continuous horizontal coordinates and same vertical coordinates are used as a text line, so that each divided text line is ensured to be continuous character information, and the division of the text lines is more accurate. For the manner of determining the position information of the text line according to the position information of the characters in the text line, reference may be made to the above description, which is not repeated herein.
S120, determining the dividing line information contained in the text to be divided, and determining the target distance between the text lines according to the text line position information and the dividing line information.
In order to more accurately divide text blocks of a text to be partitioned, in the embodiment of the disclosure, dividing line information included in the text to be partitioned is used as a parameter for dividing the text blocks, a target distance between text lines is determined based on text line position information and the dividing line information, and the text blocks are divided based on the target distance between the text lines. The dividing line information is used as a parameter for dividing the text block, so that two text lines which are close to each other but are not actually in the same text area (for example, the two text lines are close to each other but have the dividing line therebetween) are divided into different text blocks.
In the embodiment of the disclosure, the segmentation line information contained in the text to be segmented can be determined according to an image segmentation algorithm. In consideration of the fact that the characters in the text to be partitioned may affect the image segmentation result, and the extracted segmentation line information is inaccurate, in the embodiment of the disclosure, the character parts in the text to be partitioned may be color-filled, so as to remove the influence of the character parts on the image segmentation result.
Optionally, the image after the edge detection may be used to obtain a pixel matrix of the image only including the partition line, and if the pixel value of a certain pixel point in the image satisfies the set pixel value range, it indicates that the color change at the pixel point is large, and may be the partition line.
In one embodiment, the determining information of the segmentation line included in the text to be segmented includes:
converting other areas except the text line in the text to be blocked into a picture format, and carrying out graying on the converted picture to obtain a grayscale picture;
filling pixel values of pixel points in a region corresponding to the text position information to be blocked in the gray picture to obtain a picture to be detected;
and carrying out edge detection on the picture to be detected through an edge detection algorithm, and taking the detected edge information as the information of the dividing line.
In order to fill the color of the character part in the text to be blocked, other areas except the text line in the text to be blocked need to be converted into a picture format, the converted picture is grayed to obtain a grayscale picture, and the text line is filled with the pixel value based on the grayscale picture. Optionally, if the background color of the text to be segmented is a single color, the pixel value of the background color can be directly used as the pixel value of the text line region; if the background of the text to be blocked comprises multiple colors, such as gradient colors, interpolation filling can be performed on pixel values in the text line region based on pixel values of pixel points around the text line region, so that the picture to be detected is obtained. The interpolation filling mode can enable the pixel value in the text line area to be closer to the background pixel value, so that the extraction of the segmentation line information is more accurate. The interpolation filling method is not limited herein. Illustratively, the filling of pixel values within a text line region may be performed using bilinear interpolation.
After the picture to be detected is obtained, the edge detection algorithm is used for detecting the information of the dividing line contained in the picture to be detected. In the embodiment of the present disclosure, the edge detection algorithm is not limited. For example, the edge detection may be performed on the picture to be detected by using Canny operator, Roberts operator, Sobel operator, and the like.
In one embodiment, the spatial distance between the text lines may be adjusted according to the information of the dividing lines between the text lines, so as to obtain the target distance between the text lines. Illustratively, the spatial distance between text lines is calculated according to the text line position information, the existence condition of the segmentation line between the text lines is judged according to the text line position information and the segmentation line information, the adjustment parameter of the spatial distance is determined according to the existence condition of the segmentation line between the text lines, and the target distance between the text lines is calculated according to the spatial distance and the adjustment parameter. Optionally, a correspondence between the existence of the segmentation lines between the text lines and the adjustment parameters may be preset, and after the existence of the segmentation lines between the text lines is determined, the adjustment parameters of the spatial distance between the text lines are determined by searching the preset correspondence. For example, the spatial distance and the adjustment parameter may be summed or multiplied, and the obtained operation result is used as the target distance.
The existence condition of the dividing lines between the text lines can be that the dividing lines exist between the text lines or that the dividing lines do not exist between the text lines, and further, the existence condition of the dividing lines exist between the text lines can be further divided according to the lengths of the dividing lines existing between the text lines, for example, the lengths of the dividing lines are divided into N length ranges, and the condition corresponding to each length range is taken as the existence condition of the dividing line; or further dividing the situation that the dividing lines exist between the text lines according to the ratio of the length of the dividing lines to the width of the text lines, dividing the ratio into M ratio ranges, and taking the situation corresponding to each ratio range as the situation that the dividing lines exist. Wherein M, N is an integer greater than 1.
S130, clustering the text lines according to the target distance, and determining at least one text block of the text to be blocked according to the clustering result of the text lines.
In the embodiment of the disclosure, after the target distance between every two text lines is determined, the text lines are clustered according to the target distance between the text lines, and the text line set in the same category is used as a text block. In the embodiment of the present disclosure, the clustering method is not limited. For example, algorithms such as K-means clustering, hierarchical clustering algorithms, SOM neural network clustering algorithms, etc. may be used to cluster lines of text based on target distances between lines of text.
In one embodiment, the clustering the text lines according to the target distance and determining at least one text block of the text to be blocked according to the clustering result of the text lines includes:
determining an adjacency matrix corresponding to the text line cluster according to the target distance between the text lines;
and clustering the text lines based on the adjacency matrix, and determining the text block position information corresponding to the same category according to the text line position information of the same category.
Alternatively, text lines may be clustered based on target distances between text lines using a spectral clustering algorithm. Specifically, each text line is used as a point, the identification of the point is determined, and the target distance between the text lines is used as the edge weight formed by two nodes to obtain an adjacency matrix. Illustratively, the element d corresponding to the ith row and the jth column in the adjacency matrixijHas an element value of a text line liAnd the text line ljThe target distance therebetween. And after the adjacency matrix is obtained, carrying out Ncut segmentation to obtain a clustering result of the text lines, and determining text block information according to the clustering result of the text lines.
Illustratively, if the clustering result is a text line 1, a text line 2, a text line 3 and a cluster set 1 of the same genus, and a text line 4, a text line 5 and a cluster set 2 of the same genus, it is determined that the text line 1, the text line 2, and the text line 3 form a text block 1, the text line 4, and the text line 5 form a cluster block 2, the position information of the text block 1 is determined according to the position information of the text line 1, the text line 2, and the text line 3, and the position information of the text block 2 is determined according to the position information of the text line 4 and the text line 5. For example, the position information of the text block may include vertex coordinates of the text block, a width of the text block, a height of the text block, and the like. The vertex coordinates of the text block can be determined according to the vertex coordinates of the text lines in the text block, the width of the text block can be determined according to the coordinate information of each text line in the text block, and the height of the text block can be determined according to the coordinate information of each text line in the text block.
According to the text blocking method and device, the text blocking process is simplified and the accuracy of the text blocking result is improved by obtaining the character position information contained in the text to be blocked, determining at least one text line and the text line position information of the text line according to the character position information, determining the dividing line information contained in the text to be blocked, determining the target distance between the text lines according to the text line position information and the dividing line information, clustering the text lines according to the target distance, determining at least one text block of the text to be blocked according to the clustering result of the text lines, and blocking the text to be blocked according to the character position information and the dividing line information.
Example two
Fig. 2 is a flowchart of a text processing method according to an embodiment of the present disclosure. The embodiments of the present disclosure may be combined with various alternatives in one or more of the embodiments described above. As shown in fig. 2, the method includes:
s210, character position information contained in the text to be partitioned is obtained, and at least one text line and text line position information of the text line are determined according to the character position information.
S220, determining the information of the segmentation lines contained in the text to be segmented.
And S230, determining the space distance between the text lines according to the text line position information.
In the embodiment of the present disclosure, a spatial distance calculation rule may be preset, and the distance between text lines may be calculated according to the spatial distance calculation rule and the text line position informationThe spatial distance. Alternatively, the position information of the text line may be represented by 4 parameters. Illustratively, lines of text l are definediAnd the text line ljThe spatial distance between them is: d0(li,lj)=α|xi-xj|/max(wi,wj)+|yi-yj|/max(hi,hj) Wherein d is0(li,lj) Represents a line of text liAnd the text line lj(x) of the space betweeni,yi) For lines of text liPosition coordinate of top left corner vertex, hiFor lines of text liHeight, wiFor lines of text liWidth, (x)j,yj) For lines of text ljPosition coordinate of top left corner vertex, hjFor lines of text ljHeight, wjFor lines of text ljThe width α is a parameter for controlling the importance ratio between the row spacing and the column spacing, and optionally, the value of α may be 1.5.
S240, determining the dividing distance between the text lines according to the text line position information and the dividing line information, wherein the dividing distance is the number of the dividing points existing between the text lines.
In the disclosed embodiment, the segmentation distance between text lines is embodied as the number of segmentation points existing between text lines. In one embodiment, the determining the dividing distance between the text lines according to the text line position information and the dividing line information includes:
determining a segmentation point identification range according to the text line position information;
and acquiring the pixel values of the pixels in the division point identification range, and taking the number of the pixels with the pixel values larger than a set threshold value in the division point identification range as the division distance.
Optionally, can be according to line l of textiPosition information and text line ljDetermines the division point identification range. Illustratively, if the text line liHas a vertex coordinate of (x) at the upper left corneri,yi) Line of text ljPosition seat of top left corner vertexIs marked as (x)j,yj) Then x may be satisfiedj≤p≤xi+wiAnd y isi+hi≤q≤yjThe position range corresponding to the point set of (p, q) of (1) is set as the division point identification range. And after the division point identification range is determined, determining the number of the division points in the division point identification range according to the pixel value of each pixel point in the division point identification range. For example, a pixel point having a pixel value greater than a set threshold may be used as the division point.
Alternatively, the number of division points included between text lines may be determined based on a pixel matrix of a picture obtained by edge detection. Illustratively, lines of text l are definediAnd the text line ljThe separation distance between them is:
Figure BDA0002161775750000121
wherein d is1(li,lj) Represents a line of text liAnd the text line lj(x) of the distance between the twoi,yi) For lines of text liPosition coordinate of top left corner vertex, hiFor lines of text liHeight, wiFor lines of text liWidth, (x)j,yj) For lines of text ljThe position coordinate of the vertex at the upper left corner, I (p, q) is the pixel value of the pixel point with the coordinate (p, q) in the pixel matrix, θ is a preset pixel value threshold, and optionally, the value of θ may be 50.
And S250, determining the target distance between the text lines according to the space distance and the segmentation distance.
And after determining the control distance and the segmentation distance between the text lines, calculating the target distance between the text lines according to the space distance and the segmentation distance between the text lines. Optionally, the spatial distance and the segmentation distance may be subjected to weighted summation operation to obtain the target distance. In one embodiment, the determining the target distance between text lines according to the spatial distance and the segmentation distance includes: and carrying out weighted summation on the space distance and the segmentation distance to obtain the target distance.
Optionally, the target distance calculation rule is defined as: d (l)i,lj)=d0(li,lj)+λd1(li,lj). Wherein d (l)i,lj) For lines of text liAnd the text line ljTarget distance between, d0(li,lj) For lines of text liAnd the text line ljSpatial distance between, d1(li,lj) For lines of text liAnd the text line ljλ is the weight of the division distance. Wherein, the value of the lambda can be adjusted according to the position parameter of the text line.
S260, clustering the text lines according to the target distance, and determining at least one text block of the text to be blocked according to the clustering result of the text lines.
According to the technical scheme of the embodiment of the disclosure, the target distance between the text lines is determined according to the text line position information and the dividing line information, the spatial distance between the text lines is determined according to the text line position information, the dividing distance between the text lines is determined according to the text line position information and the dividing line information, and the target distance between the text lines is determined according to the spatial distance and the dividing distance, so that the calculation of the target distance is more accurate, and further the text line clustering result based on the target distance is more accurate.
EXAMPLE III
Fig. 3a is a flowchart of a text processing method according to an embodiment of the present disclosure. The embodiment of the present disclosure provides a preferred embodiment based on the above-mentioned embodiments. As shown in fig. 3a, the method comprises:
and S310, starting.
And S320, acquiring the PDF file.
And acquiring the PDF file input by the user.
And S330, extracting the text information of the PDF file.
And extracting character information from the PDFW file information stream, wherein the character information comprises character coordinates, font size, width and height and the like.
And S340, generating text lines by using the text information.
And determining text lines according to the text position information. Fig. 3b is a schematic diagram of a text block extraction result in a text processing method according to an embodiment of the present disclosure. As shown in fig. 3b, the black area is the extracted text line.
And S350, converting the PDF file into a gray picture, and removing a character part by using bilinear interpolation.
In order to represent other segmentation information in the document, the document except the characters is converted into a picture, the picture is subjected to graying processing to obtain a grayed picture, and the character part in the picture is filled by using bilinear interpolation.
And S360, carrying out edge detection on the picture to obtain a segmentation picture.
And (5) carrying out edge detection on the picture by using a Canny operator to obtain a segmentation picture containing segmentation lines. Fig. 3c is a diagram illustrating segmentation in a text processing method according to an embodiment of the present disclosure. As shown in fig. 3c, the white line in the figure is the extracted dividing line.
And S370, calculating the distance between the text lines to obtain an adjacency matrix.
And calculating the segmentation distance between the text lines according to the segmentation graph, and combining the space distance between the text lines to obtain the adjacency matrix.
And S380, obtaining a text line clustering result by using spectral clustering, namely text blocking.
And clustering the text rows by using spectral clustering based on the adjacency matrix to obtain a clustering result, and determining a text blocking result according to the clustering result. Fig. 3d is a schematic diagram of a text line clustering result in a text processing method according to an embodiment of the present disclosure. As shown in fig. 3D, the text lines are clustered into three types, the cluster set 1 includes a text line H, the cluster set 2 includes a text line a, a text line B, and a text line C, and the cluster set 3 includes a text line D, a text line G, a text line F, and a text line E. Text line H constitutes text block 1, text line a, text line B and text line C constitute text block 2, and text line D, text line G, text line F and text line E constitute text block 3.
And S390, ending.
Fig. 3e is a schematic diagram of a text to be partitioned in the text processing method according to the embodiment of the present disclosure. Fig. 3f is a schematic diagram of a blocking result of a text to be blocked in the text processing method according to the embodiment of the present disclosure. Fig. 3e and 3f exemplarily show the blocking effect of text blocking by using the text processing method provided by the embodiment of the present disclosure. As shown in fig. 3f, a first text block 301f, a second text block 302f, a third text block 303f, a fourth text block 304f, a fifth text block 305f, a sixth text block 306f, a seventh text block 307f, an eighth text block 308f, a ninth text block 309f, a tenth text block 310f, and an eleventh text block 311f are text blocks obtained by text blocking of the text to be blocked shown in fig. 3e by using the text processing method provided by the embodiment of the present disclosure. It can be seen that the text blocking result obtained based on the text processing method provided by the embodiment of the disclosure has high accuracy.
The method and the device have the advantages that the PDF document information is used for processing the character area and the non-character area separately, mutual interference is avoided, information such as a dividing line picture is extracted by using an edge detection algorithm, the text is divided, meanwhile, the space distance and the dividing distance of the text are considered, the text blocks are automatically obtained by using a clustering algorithm, a large amount of rules or training data are not needed, and the document can be accurately divided into the text blocks only by determining a small number of parameters.
Example four
Fig. 4 is a schematic structural diagram of a text processing apparatus according to an embodiment of the present disclosure. The embodiment of the disclosure is applicable to the situation when the PDF text is subjected to text blocking. The text processing apparatus may be implemented in software and/or hardware, and may be configured in a terminal device, for example. As shown in fig. 4, the text processing apparatus includes: a text line determination module 410, a target distance determination module 420, and a text block determination module 430. Wherein:
a text line determining module 410, configured to obtain text position information included in a text to be segmented, and determine at least one text line and text line position information of the text line according to the text position information;
a target distance determining module 420, configured to determine, according to the text line position information, dividing line information included in the text to be partitioned, and determine, according to the text line position information and the dividing line information, a target distance between text lines;
and a text block determining module 430, configured to cluster the text lines according to the target distance, and determine at least one text block of the text to be blocked according to a clustering result of the text lines.
According to the text partitioning method and device, the text line determining module is used for obtaining the text position information contained in the text to be partitioned, determining at least one text line and the text line position information of the text line according to the text position information, determining the dividing line information contained in the text to be partitioned, the target distance determining module is used for determining the target distance between the text lines according to the text line position information and the dividing line information, the text block determining module is used for clustering the text lines according to the target distance, determining at least one text block of the text to be partitioned according to the clustering result of the text lines, and partitioning the text to be partitioned according to the text position information and the dividing line information, so that the text partitioning process is simplified, and the accuracy of the text partitioning result is improved.
Optionally, on the basis of the foregoing technical solution, the target distance determining module 420 includes:
the space distance determining unit is used for determining the space distance between the text lines according to the text line position information;
a dividing distance determining unit, configured to determine a dividing distance between text lines according to the text line position information and the dividing line information, where the dividing distance is the number of dividing points existing between the text lines;
and the target distance determining unit is used for determining the target distance between the text lines according to the space distance and the segmentation distance.
Optionally, on the basis of the above technical solution, the segmentation distance determining unit is specifically configured to:
determining a segmentation point identification range according to the text line position information;
and acquiring the pixel values of the pixels in the division point identification range, and taking the number of the pixels with the pixel values larger than a set threshold value in the division point identification range as the division distance.
Optionally, on the basis of the above technical solution, the target distance determining unit is specifically configured to:
and carrying out weighted summation on the space distance and the segmentation distance to obtain the target distance.
Optionally, on the basis of the foregoing technical solution, the target distance determining module 410 includes a segmentation information detecting unit, configured to:
converting other areas except the text line in the text to be blocked into a picture format, and carrying out graying on the converted picture to obtain a grayscale picture;
filling pixel values of pixel points in a region corresponding to the text position information to be blocked in the gray picture to obtain a picture to be detected;
and carrying out edge detection on the picture to be detected through an edge detection algorithm, and taking the detected edge information as the information of the dividing line.
Optionally, on the basis of the foregoing technical solution, the text block determining module 430 is specifically configured to:
determining an adjacency matrix corresponding to the text line cluster according to the target distance between the text lines;
and clustering the text lines based on the adjacency matrix, and determining the text block position information corresponding to the same category according to the text line position information of the same category.
Optionally, on the basis of the foregoing technical solution, the text position information includes text coordinates, and the text line determining module 410 is specifically configured to:
and taking the characters with continuous horizontal coordinates and same vertical coordinates as a text line, and determining the text line position information of the text line according to the character position information of the characters in the text line.
The text processing device provided by the embodiment of the disclosure can execute the text processing method provided by the embodiment of the disclosure, and has corresponding functional modules and beneficial effects of the execution method.
It should be noted that, the units and modules included in the apparatus are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only used for distinguishing one functional unit from another, and are not used for limiting the protection scope of the embodiments of the present disclosure.
EXAMPLE five
Referring now to fig. 5, a block diagram of a terminal device 500 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, devices such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like. The terminal device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 5, the terminal device 500 may include a processing means (e.g., a central processing unit, a graphic processor, etc.) 501 that may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage means 506 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the terminal apparatus 500 are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
Generally, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 507 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, and the like; storage devices 506 including, for example, magnetic tape, hard disk, etc.; and a communication device 509. The communication means 509 may allow the terminal device 500 to perform wireless or wired communication with other devices to exchange data. While fig. 5 illustrates a terminal apparatus 500 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or installed from the storage means 506, or installed from the ROM 502. The computer program performs the above-described functions defined in the methods of the embodiments of the present disclosure when executed by the processing device 501.
It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.
The computer readable medium may be included in the terminal device; or may exist separately without being assembled into the terminal device.
The computer readable medium carries one or more programs which, when executed by the terminal device, cause the terminal device to:
acquiring character position information contained in a text to be partitioned, and determining at least one text line and text line position information of the text line according to the character position information;
determining segmentation line information contained in the text to be segmented, and determining a target distance between text lines according to the text line position information and the segmentation line information;
and clustering the text lines according to the target distance, and determining at least one text block of the text to be blocked according to the clustering result of the text lines.
Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods, apparatus, terminal devices, and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules and units described in the embodiments of the present disclosure may be implemented by software or hardware. For example, the text line determination module may be further described as a "module for acquiring character position information included in a text to be segmented and determining at least one text line and text line position information of the text line according to the character position information".
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
According to one or more embodiments of the present disclosure, an example provides a text processing method, including:
acquiring character position information contained in a text to be partitioned, and determining at least one text line and text line position information of the text line according to the character position information;
determining segmentation line information contained in the text to be segmented, and determining a target distance between text lines according to the text line position information and the segmentation line information;
and clustering the text lines according to the target distance, and determining at least one text block of the text to be blocked according to the clustering result of the text lines.
According to one or more embodiments of the present disclosure, example two provides a text processing method, and on the basis of the text processing method of example one, the determining a target distance between text lines according to the text line position information and the dividing line information includes:
determining the space distance between text lines according to the text line position information;
determining the dividing distance between text lines according to the text line position information and the dividing line information, wherein the dividing distance is the number of dividing points existing between the text lines;
and determining the target distance between text lines according to the space distance and the segmentation distance.
According to one or more embodiments of the present disclosure, example three provides a text processing method, and on the basis of the text processing method of example two, the determining a dividing distance between text lines according to the text line position information and the dividing line information includes:
determining a segmentation point identification range according to the text line position information;
and acquiring the pixel values of the pixels in the division point identification range, and taking the number of the pixels with the pixel values larger than a set threshold value in the division point identification range as the division distance.
According to one or more embodiments of the present disclosure, example four provides a text processing method, and on the basis of the text processing method of example two, the determining a target distance between text lines according to the spatial distance and the segmentation distance includes:
and carrying out weighted summation on the space distance and the segmentation distance to obtain the target distance.
According to one or more embodiments of the present disclosure, example five provides a text processing method, and on the basis of the text processing method of example one, the determining of the segmentation line information included in the text to be segmented includes:
converting other areas except the text line in the text to be blocked into a picture format, and carrying out graying on the converted picture to obtain a grayscale picture;
filling pixel values of pixel points in a region corresponding to the text position information to be blocked in the gray picture to obtain a picture to be detected;
and carrying out edge detection on the picture to be detected through an edge detection algorithm, and taking the detected edge information as the information of the dividing line.
According to one or more embodiments of the present disclosure, example six provides a text processing method, and on the basis of the text processing method of example one, the clustering the text lines according to the target distance, and determining at least one text block of the text to be blocked according to a clustering result of the text lines includes:
determining an adjacency matrix corresponding to the text line cluster according to the target distance between the text lines;
and clustering the text lines based on the adjacency matrix, and determining the text block position information corresponding to the same category according to the text line position information of the same category.
According to one or more embodiments of the present disclosure, example seven provides a text processing method, and on the basis of the text processing method of example one, the text position information includes text coordinates, and the determining at least one text line and text line position information of the text line according to the text position information includes:
and taking the characters with continuous horizontal coordinates and same vertical coordinates as a text line, and determining the text line position information of the text line according to the character position information of the characters in the text line.
Example eight provides, in accordance with one or more embodiments of the present disclosure, a text processing apparatus comprising:
the text line determining module is used for acquiring character position information contained in a text to be blocked and determining at least one text line and text line position information of the text line according to the character position information;
the target distance determining module is used for determining the dividing line information contained in the text to be divided according to the text line position information and determining the target distance between the text lines according to the text line position information and the dividing line information;
and the text block determining module is used for clustering the text lines according to the target distance and determining at least one text block of the text to be blocked according to the clustering result of the text lines.
Example nine provides, in accordance with one or more embodiments of the present disclosure, a terminal device, comprising:
one or more processing devices;
storage means for storing one or more programs;
when executed by the one or more processing devices, cause the one or more processing devices to implement a text processing method as in any of examples one to seven.
Example ten provides a computer-readable storage medium having stored thereon a computer program that, when executed by a processor, implements a text processing method as in any one of examples one to seven, in accordance with one or more embodiments of the present disclosure.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.
Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims (9)

1.一种文本处理方法,其特征在于,包括:1. a text processing method, is characterized in that, comprises: 获取待分块文本中包含的文字位置信息,根据所述文字位置信息确定至少一个文本行以及所述文本行的文本行位置信息;Obtain the text position information contained in the text to be divided, and determine at least one text line and the text line position information of the text line according to the text position information; 确定所述待分块文本中包含的分割线信息,根据所述文本行位置信息以及所述分割线信息确定文本行之间的目标距离,所述目标距离根据空间距离和分割距离计算得到;Determine the dividing line information contained in the text to be divided, determine the target distance between the text lines according to the text line position information and the dividing line information, and calculate the target distance according to the spatial distance and the dividing distance; 根据所述目标距离将所述文本行进行聚类,根据所述文本行的聚类结果确定所述待分块文本的至少一个文本块;Clustering the text lines according to the target distance, and determining at least one text block of the text to be divided according to the clustering result of the text lines; 其中,所述分割距离的计算包括:Wherein, the calculation of the segmentation distance includes: 根据所述文本行位置信息确定分割点识别范围;Determine the segmentation point recognition range according to the text line position information; 获取所述分割点识别范围内像素点的像素值,将所述分割点识别范围内像素点的像素值大于设定阈值的像素点个数作为所述分割距离。The pixel values of the pixel points within the identification range of the segmentation point are acquired, and the number of pixel points whose pixel values of the pixel points within the identification range of the segmentation point are greater than the set threshold is used as the segmentation distance. 2.根据权利要求1所述的方法,其特征在于,所述根据所述文本行位置信息以及所述分割线信息确定文本行之间的目标距离,包括:2. The method according to claim 1, wherein the determining the target distance between the text lines according to the text line position information and the dividing line information comprises: 根据所述文本行位置信息确定文本行之间的空间距离;Determine the spatial distance between text lines according to the text line position information; 根据所述空间距离以及所述分割距离确定文本行之间的目标距离。A target distance between text lines is determined according to the spatial distance and the segmentation distance. 3.根据权利要求2所述的方法,其特征在于,所述根据所述空间距离以及所述分割距离确定文本行之间的目标距离,包括:3. The method according to claim 2, wherein the determining the target distance between the text lines according to the spatial distance and the segmentation distance comprises: 将所述空间距离与所述分割距离进行加权求和,得到所述目标距离。Weighted summation is performed on the spatial distance and the segmentation distance to obtain the target distance. 4.根据权利要求1所述的方法,其特征在于,所述确定所述待分块文本中包含的分割线信息,包括:4. The method according to claim 1, wherein the determining the dividing line information contained in the text to be divided comprises: 将所述待分块文本中文本行之外的其他区域转化为图片格式,并将转化得到的图片进行灰度化,得到灰度图片;Converting other areas other than the text line in the text to be divided into a picture format, and graying the converted picture to obtain a grayscale picture; 将所述灰度图片中与所述待分块文本位置信息对应的区域内的像素点的像素值进行填充,得到待检测图片;Filling the pixel values of the pixel points in the area corresponding to the position information of the text to be divided in the grayscale picture to obtain the picture to be detected; 通过边缘检测算法对所述待检测图片进行边缘检测,将检测出的边缘信息作为所述分割线信息。Edge detection is performed on the picture to be detected by an edge detection algorithm, and the detected edge information is used as the dividing line information. 5.根据权利要求1所述的方法,其特征在于,所述根据所述目标距离将所述文本行进行聚类,根据所述文本行的聚类结果确定所述待分块文本的至少一个文本块,包括:5 . The method according to claim 1 , wherein the text lines are clustered according to the target distance, and at least one of the to-be-blocked texts is determined according to a clustering result of the text lines. 6 . Text blocks, including: 根据文本行之间的目标距离确定文本行聚类对应的邻接矩阵;Determine the adjacency matrix corresponding to the text line clustering according to the target distance between the text lines; 基于所述邻接矩阵对文本行进行聚类,根据同一类别的文本行位置信息确定所述类别对应的文本块位置信息。The text lines are clustered based on the adjacency matrix, and the text block position information corresponding to the category is determined according to the text line position information of the same category. 6.根据权利要求1所述的方法,其特征在于,所述文字位置信息包括文字坐标,所述根据所述文字位置信息确定至少一个文本行以及所述文本行的文本行位置信息,包括:6. The method according to claim 1, wherein the text position information comprises text coordinates, and the determining at least one text line and the text line position information of the text line according to the text position information comprises: 将横坐标连续且纵坐标相同的文字作为一个文本行,根据所述文本行内文字的文字位置信息确定所述文本行的文本行位置信息。The text with continuous horizontal coordinates and the same vertical coordinate is regarded as a text line, and the text line position information of the text line is determined according to the text position information of the text in the text line. 7.一种文本处理装置,其特征在于,包括:7. A text processing device, comprising: 文本行确定模块,用于获取待分块文本中包含的文字位置信息,根据所述文字位置信息确定至少一个文本行以及所述文本行的文本行位置信息;a text line determination module, configured to obtain text position information contained in the text to be divided, and determine at least one text line and text line position information of the text line according to the text position information; 目标距离确定模块,用于根据所述文本行位置信息确定所述待分块文本中包含的分割线信息,根据所述文本行位置信息以及所述分割线信息确定文本行之间的目标距离,所述目标距离根据空间距离和分割距离计算得到;A target distance determination module, configured to determine the dividing line information contained in the text to be divided according to the text line position information, and determine the target distance between the text lines according to the text line position information and the dividing line information, The target distance is calculated according to the spatial distance and the segmentation distance; 文本块确定模块,用于根据所述目标距离将所述文本行进行聚类,根据所述文本行的聚类结果确定所述待分块文本的至少一个文本块;a text block determination module, configured to cluster the text lines according to the target distance, and determine at least one text block of the text to be divided according to the clustering result of the text lines; 其中,所述分割距离的计算包括:Wherein, the calculation of the segmentation distance includes: 根据所述文本行位置信息确定分割点识别范围;Determine the segmentation point recognition range according to the text line position information; 获取所述分割点识别范围内像素点的像素值,将所述分割点识别范围内像素点的像素值大于设定阈值的像素点个数作为所述分割距离。The pixel values of the pixel points within the identification range of the segmentation point are acquired, and the number of pixel points whose pixel values of the pixel points within the identification range of the segmentation point are greater than the set threshold is used as the segmentation distance. 8.一种终端设备,其特征在于,所述终端设备包括:8. A terminal device, wherein the terminal device comprises: 一个或多个处理装置;one or more processing devices; 存储装置,用于存储一个或多个程序;a storage device for storing one or more programs; 当所述一个或多个程序被所述一个或多个处理装置执行,使得所述一个或多个处理装置实现如权利要求1-6中任一所述的文本处理方法。When the one or more programs are executed by the one or more processing apparatuses, the one or more processing apparatuses implement the text processing method according to any one of claims 1-6. 9.一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该计算机程序被处理器执行时实现如权利要求1-6中任一所述的文本处理方法。9. A computer-readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, the text processing method according to any one of claims 1-6 is implemented.
CN201910734656.9A 2019-08-09 2019-08-09 Text processing method, device, equipment and storage medium Active CN110442719B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910734656.9A CN110442719B (en) 2019-08-09 2019-08-09 Text processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910734656.9A CN110442719B (en) 2019-08-09 2019-08-09 Text processing method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110442719A CN110442719A (en) 2019-11-12
CN110442719B true CN110442719B (en) 2022-03-04

Family

ID=68434244

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910734656.9A Active CN110442719B (en) 2019-08-09 2019-08-09 Text processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110442719B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113627133B (en) * 2020-05-08 2024-12-13 珠海金山办公软件有限公司 A page splitting method and device
CN111680491B (en) * 2020-05-27 2024-02-02 北京字跳网络技术有限公司 Method and device for extracting document information and electronic equipment
CN113177959B (en) * 2021-05-21 2022-05-03 广州普华灵动机器人技术有限公司 A real-time extraction method of QR code during fast movement
CN118629057A (en) * 2023-03-07 2024-09-10 凯钿行动科技股份有限公司 Method, device, computer equipment and storage medium for determining text blocks of PDF text

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107832756A (en) * 2017-10-24 2018-03-23 讯飞智元信息科技有限公司 Express delivery list information extracting method and device, storage medium, electronic equipment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012057891A1 (en) * 2010-10-26 2012-05-03 Hewlett-Packard Development Company, L.P. Transformation of a document into interactive media content
US10579707B2 (en) * 2017-12-29 2020-03-03 Konica Minolta Laboratory U.S.A., Inc. Method for inferring blocks of text in electronic documents

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107832756A (en) * 2017-10-24 2018-03-23 讯飞智元信息科技有限公司 Express delivery list information extracting method and device, storage medium, electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于最小生成树聚类的中文版面分割法;张充等;《计算机工程》;20080805(第15期);全文 *
面向移动设备的WEB页面分块算法;路松峰等;《小型微型计算机系统》;20070915(第09期);第1672-1677页 *

Also Published As

Publication number Publication date
CN110442719A (en) 2019-11-12

Similar Documents

Publication Publication Date Title
CN110442719B (en) Text processing method, device, equipment and storage medium
US10846524B2 (en) Table layout determination using a machine learning system
CN108304775B (en) Remote sensing image recognition method and device, storage medium and electronic equipment
CN108304835B (en) character detection method and device
WO2021244270A1 (en) Image processing method and apparatus, device, and computer readable storage medium
CN111950555A (en) Text recognition method and device, readable medium and electronic equipment
CN111680491A (en) Document information extraction method and device and electronic equipment
CN115861400B (en) Target object detection method, training device and electronic equipment
CN114511041B (en) Model training method, image processing method, apparatus, equipment and storage medium
CN113420757B (en) Text auditing method and device, electronic equipment and computer readable medium
CN114037985A (en) Information extraction method, device, equipment, medium and product
CN111626919B (en) Image synthesis method and device, electronic equipment and computer readable storage medium
US11734799B2 (en) Point cloud feature enhancement and apparatus, computer device and storage medium
US20160232420A1 (en) Method and apparatus for processing signal data
CN111738252A (en) Method and device for detecting text lines in image and computer system
CN103946865B (en) Method and apparatus for contributing to the text in detection image
CN112560857B (en) Character area boundary detection method, equipment, storage medium and device
CN112488095B (en) Seal image recognition method and device and electronic equipment
CN113255812A (en) Video frame detection method and device and electronic equipment
CN114155545A (en) Form identification method and device, readable medium and electronic equipment
CN114612647A (en) Image processing method, device, electronic device and storage medium
CN108230332B (en) Character image processing method and device, electronic equipment and computer storage medium
CN110633595A (en) Target detection method and device by utilizing bilinear interpolation
CN115359502A (en) Image processing method, device, equipment and storage medium
CN111881778B (en) Method, apparatus, device and computer readable medium for text detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: 100041 B-0035, 2 floor, 3 building, 30 Shixing street, Shijingshan District, Beijing.

Patentee after: Douyin Vision Co.,Ltd.

Country or region after: China

Address before: 100041 B-0035, 2 floor, 3 building, 30 Shixing street, Shijingshan District, Beijing.

Patentee before: Tiktok vision (Beijing) Co.,Ltd.

Country or region before: China

Address after: 100041 B-0035, 2 floor, 3 building, 30 Shixing street, Shijingshan District, Beijing.

Patentee after: Tiktok vision (Beijing) Co.,Ltd.

Country or region after: China

Address before: 100041 B-0035, 2 floor, 3 building, 30 Shixing street, Shijingshan District, Beijing.

Patentee before: BEIJING BYTEDANCE NETWORK TECHNOLOGY Co.,Ltd.

Country or region before: China

CP03 Change of name, title or address
TR01 Transfer of patent right

Effective date of registration: 20241024

Address after: 100190, 10th Floor, Building 4, Zijin Digital Park, Haidian District, Beijing, 1004

Patentee after: Beijing Feishu Technology Co.,Ltd.

Country or region after: China

Address before: 100041 B-0035, 2 floor, 3 building, 30 Shixing street, Shijingshan District, Beijing.

Patentee before: Douyin Vision Co.,Ltd.

Country or region before: China

TR01 Transfer of patent right