CN115937864B

CN115937864B - Text overlap detection method, device, medium and electronic equipment

Info

Publication number: CN115937864B
Application number: CN202211678556.7A
Authority: CN
Inventors: 梁晓云; 高永强; 杨萍
Original assignee: Douyin Vision Co Ltd
Current assignee: Douyin Vision Co Ltd
Priority date: 2022-12-26
Filing date: 2022-12-26
Publication date: 2025-11-21
Anticipated expiration: 2042-12-26
Also published as: CN115937864A

Abstract

This disclosure relates to a text overlap detection method, apparatus, medium, and electronic device, belonging to the field of computer technology, and can improve the accuracy and recall precision of text overlap detection. A text overlap detection method includes: performing character recognition on the object to be detected to obtain the character recognition confidence scores of text lines in the object; adding text lines with character recognition confidence scores lower than a preset recognition confidence threshold to a first candidate anomaly region set; extracting text line images of each text line from the object to be detected, performing text classification on the text line images, and adding text lines whose text classification results are overlapping text to a second candidate anomaly region set; performing target detection on the overlapping text in the object to be detected, and adding text lines whose target detection results are overlapping text to a third candidate anomaly region set; and determining the intersection of the first, second, and third candidate anomaly region sets as the text overlap detection result.

Description

Text overlap detection method, device, medium and electronic equipment

Technical Field

The disclosure relates to the technical field of computers, and in particular relates to a text overlap detection method, a text overlap detection device, a text overlap detection medium and electronic equipment.

Background

Any abnormal phenomenon of text overlapping in an application program (as shown in the text overlapping schematic diagram of fig. 1) can seriously affect the user experience, and even the user cannot understand page information when the text overlapping is serious.

Thus, an intelligent detection scheme for text overlap is urgently needed.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

According to the text overlap detection method, text recognition is conducted on an object to be detected, text lines of which the text recognition confidence is lower than a preset recognition confidence threshold are obtained, the text lines of which the text recognition confidence is lower than a preset recognition confidence threshold are added to a first candidate abnormal region set, text line images of all the text lines of the object to be detected are intercepted from the object to be detected, text classification is conducted on the text line images, text classification results are added to a second candidate abnormal region set for overlapping text, object detection is conducted on the overlapped text in the object to be detected, text lines of which object detection results are the overlapping text are added to a third candidate abnormal region set, and intersection sets of the first candidate abnormal region set, the second candidate abnormal region set and the third candidate abnormal region set are determined to be text overlap detection results.

The text overlap detection device comprises a text recognition module, a text classification module, a target detection module and a determination module, wherein the text recognition module is used for performing text recognition on an object to be detected to obtain text recognition confidence of text lines in the object to be detected, text lines with the text recognition confidence lower than a preset recognition confidence threshold value are added into a first candidate abnormal region set, the text classification module is used for intercepting text line images of all text lines of the object to be detected from the object to be detected, performing text classification on the text line images, adding text classification results into a second candidate abnormal region set, performing target detection on overlapped text in the object to be detected, adding text lines of target detection results into a third candidate abnormal region set, and determining the intersection of the first candidate abnormal region set, the second candidate abnormal region set and the third candidate abnormal region set as text overlap detection results.

In a third aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which, when executed by a processing device, implements the steps of the method according to any of the first aspects of the disclosure.

In a fourth aspect, the present disclosure provides an electronic device comprising storage means having stored thereon a computer program, processing means for executing the computer program in the storage means to carry out the steps of the method of any one of the first aspects of the present disclosure.

By adopting the technical scheme, the first candidate abnormal region set of the object to be detected is obtained by utilizing a text recognition mode, the second candidate abnormal region set of the object to be detected is obtained by utilizing a text classification mode, the third candidate abnormal region set of the object to be detected is obtained by utilizing a target detection mode, and the text overlapping detection result is determined by utilizing the intersection of the first candidate abnormal region set, the second candidate abnormal region set and the third candidate abnormal region set. In addition, the labor cost of text overlap detection is greatly reduced.

Additional features and advantages of the present disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale. In the drawings:

fig. 1 shows a schematic diagram of text overlap.

Fig. 2 is a flow diagram of a text overlap detection method according to an embodiment of the present disclosure.

Fig. 3 shows a schematic diagram with overlapping text lines and overlapping areas as detection targets in an object to be detected.

Fig. 4 is a flow chart of text classification of a text line image according to an embodiment of the present disclosure.

Fig. 5 shows a schematic diagram of the architecture of a transducer encoder.

Fig. 6 shows a schematic architecture of a multi-layer sensor.

Fig. 7 shows a schematic diagram of a text line image after filling.

Fig. 8 shows an architectural diagram of a text classifier according to an embodiment of the present disclosure.

Fig. 9 shows an architectural diagram of object detection according to an embodiment of the present disclosure.

FIG. 10 is a schematic diagram of automatically generating training samples by extracting foreground characters and superimposing the foreground characters to other locations in accordance with an embodiment of the present disclosure.

Fig. 11 is a schematic block diagram of a text overlap detection apparatus according to an embodiment of the present disclosure.

Fig. 12 shows a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment," another embodiment "means" at least one additional embodiment, "and" some embodiments "means" at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

It will be appreciated that prior to using the technical solutions disclosed in the embodiments of the present disclosure, the user should be informed and authorized of the type, usage range, usage scenario, etc. of the personal information related to the present disclosure in an appropriate manner according to the relevant legal regulations.

For example, in response to receiving an active request from a user, a prompt is sent to the user to explicitly prompt the user that the operation it is requesting to perform will require personal information to be obtained and used with the user. Thus, the user can autonomously select whether to provide personal information to software or hardware such as an electronic device, an application program, a server or a storage medium for executing the operation of the technical scheme of the present disclosure according to the prompt information.

As an alternative but non-limiting implementation, in response to receiving an active request from a user, the manner in which the prompt information is sent to the user may be, for example, a popup, in which the prompt information may be presented in a text manner. In addition, a selection control for the user to select to provide personal information to the electronic device in a 'consent' or 'disagreement' manner can be carried in the popup window.

It will be appreciated that the above-described notification and user authorization process is merely illustrative and not limiting of the implementations of the present disclosure, and that other ways of satisfying relevant legal regulations may be applied to the implementations of the present disclosure.

Meanwhile, it can be understood that the data (including but not limited to the data itself, the acquisition or the use of the data) related to the technical scheme should conform to the requirements of the corresponding laws and regulations and related regulations.

Fig. 2 is a flow diagram of a text overlap detection method according to an embodiment of the present disclosure. As shown in fig. 2, the text overlap detection method includes the following steps S21 to S27.

In step S21, the text recognition is performed on the object to be detected, so as to obtain the text recognition confidence of the text line in the object to be detected.

The object to be detected refers to an object requiring text overlap detection. The object to be detected may be, for example, a screenshot of an APP, a screenshot of a game interface, etc.

Character recognition may be implemented using optical character recognition (Optical Character Recognition, OCR) or any other type of character recognition algorithm.

In this step, the text recognition confidence of each text line of the object to be detected is obtained through text recognition. For example, assuming that the object to be detected has 3 text lines, namely, text line 1, text line 2 and text line 3, then text recognition is performed on the three text lines, respectively, to obtain text line 1 text recognition confidence level 1, text line 2 text recognition confidence level 2 and text line 3 text recognition confidence level 3.

In step S22, a text line having a text recognition confidence below a preset recognition confidence threshold is added to the first set of candidate abnormal regions.

The preset recognition confidence threshold can be set empirically or can be obtained by self-learning. For example, the text recognition confidence of the text line where the text overlap exists is self-learned, and accordingly, an appropriate preset recognition confidence threshold is set. For example, the preset recognition confidence threshold may be set to 0.8 or other suitable value.

In this step, if the text recognition confidence of a text line is lower than the preset recognition confidence threshold, the text line is added to the first candidate abnormal region set, for example, if the foregoing text recognition confidence 1 is lower than the preset recognition confidence threshold, and the text recognition confidence 2 and the text recognition confidence 3 are both greater than the preset recognition confidence threshold, the text line 1 is added to the first candidate abnormal region set.

In step S23, text line images of the respective text lines of the object to be detected are cut from the object to be detected, and the text line images are text-classified.

In some embodiments, capturing text line images of individual text lines of an object to be detected from the object to be detected may include first obtaining coordinate information of each text line in the object to be detected, e.g., coordinate information of each text line may be obtained using a text recognition tool (e.g., OCR recognition), and then capturing text line images of each text line from the object to be detected, i.e., one text line is captured as one text line image, assuming that the object to be detected includes a total of 3 text lines, then a total of 3 text line images may be captured.

In some embodiments, text classifying the text line images may include separately text classifying each text line image using various text classifiers. For example, each text line image is individually text classified using a visual task based classification algorithm.

In step S24, the text classification result is added to the second candidate abnormal region set as a text line of the overlapped text.

After the text classification result of each text line image is obtained in step S23, the text classification result may be added to the second candidate abnormal region set for the text line of the overlapped text in step S24.

For example, assuming that the text classification result of text line 1 is overlapping text and the text classification results of text line 2 and text line 3 are both normal text, text line 1 would be added to the second set of candidate abnormal regions.

In step S25, object detection is performed on the overlapped text in the object to be detected.

Various object detection algorithms may be employed to object detect overlapping text in an object to be detected, such as the Darknet object detection algorithm.

In some embodiments, performing target detection on overlapping text in an object to be detected may include performing target detection with overlapping text lines and overlapping areas as detection targets in the object to be detected. That is, there are two objects for target detection, one is overlapping text lines and the other is overlapping area.

In step S26, the text line of the target detection result overlapping text is added to the third candidate abnormal region set.

For example, if the object detection algorithm detects that text line 1 is overlapping text and text line 2 and text line 3 are both normal text, text line 1 will be added to the third set of candidate abnormal regions in this step.

In some embodiments, in the case that the overlapped text line and the overlapped area are taken as the detection targets in the object to be detected, the step S26 of adding the target detection result to the third candidate abnormal area set may include adding the overlapped text line to the third candidate abnormal area set if the target detection result indicates that the coordinates of a certain overlapped text line overlap with the coordinates of a certain overlapped area. Thus, the accuracy of target detection can be improved, and false alarm of target detection can be reduced. For example, with the object to be detected shown in fig. 3, if the overlapped text line 1 and the overlapped area 2 are detected, and there is a coordinate overlap between the overlapped text line 1 and the overlapped area 2, in this case, the overlapped text line 1 may be added to the third candidate abnormal area set.

In step S27, the intersection of the first, second, and third candidate abnormal region sets is determined as a text overlap detection result.

In some embodiments, the text classification of the text line image may include cutting the text line image into a plurality of image blocks, and text classification of each image block using a text classifier that utilizes a class vector, a transformer structure, and a classification multi-layer perceptron, wherein the class vector is used to integrate the whole-image features of the text line image.

The text line image is cut because most text lines are generally only partially overlapped, and in the example of the "1990/us/science fiction" text line in fig. 1, only the "0/us" area is overlapped, and other areas are not overlapped, so that the text lines with the text overlapping phenomenon can be better distinguished from the normal text lines by cutting the text line image.

The classification multi-layer perceptron refers to a multi-layer perceptron adopting a classification model, and the classification result of the classification model comprises overlapped texts and normal texts.

By adopting the technical scheme, the text classification can be carried out on each image block and the category vector, and further, the whole image characteristic (such as text overlapping or non-overlapping) of the whole text line image can be known according to the text classification result of the category vector, and the characteristic (such as text overlapping or non-overlapping) of each image block can be known according to the text classification result of each image block.

Fig. 4 is a flow chart of text classification of a text line image according to an embodiment of the present disclosure. As shown in fig. 4, the text classification process includes steps S41 to S46.

In step S41, the text line image is cut into a plurality of image blocks, resulting in a first vector.

In some embodiments, the size of the convolution kernel of the neural network employed determines how large image blocks the text line image needs to be cut into. For example, if the size of the convolution kernel used is 16 x 16, then the size of the cut image block should be 16 x 16.

After the cut, the first vector obtained is k×n×m×j. Where k represents the number of image blocks obtained by cutting, n×m represents the size of each image block, and j represents the number of channels, for example, if a color text line image, the number of channels j is 3.

The text line image is cut because most text lines are generally only partially overlapped, such as the "1990/us/science fiction" text line example in fig. 1, only the "0/us" area is overlapped, and the text lines with the text overlap phenomenon can be better distinguished from the normal text lines by cutting the text line image.

In step S42, the first vector is linearly transformed to obtain a second vector.

Linear transformation (i.e., fully connected layer) refers to extracting flat pixel vectors in each image block, feeding each image block into the linear projection layer. Where the compressed dimension taken in performing the linear transformation is D, for example D may be 512 or other values. This step may also be referred to as tile embedding (Patch Embedding).

The second vector obtained through the processing of step S42 is k×n×m×d.

In step S43, a learnable class vector is added to the second vector to obtain a third vector, where the class vector is used to integrate the whole-image features of the text line image.

The third vector obtained after adding the class vector cls_token is (k+1) ×n×m×d.

In step S44, a position code is added to the third vector to obtain a fourth vector, where the position code is used to characterize the relative positional relationship of the image blocks.

Since the sequence information of the input series is lost in the subsequent encoding process, position encoding is added here so that the relative positional relationship of the individual image blocks can still be known after encoding. The fourth vector obtained after adding the position code is (k+1) × (n×m+1) ×d.

In step S45, the fourth vector is encoded using a transducer structure.

Fig. 5 shows a schematic diagram of the architecture of a transducer encoder. As shown in fig. 5, after the embedded image block (i.e., the fourth vector) is input to the transform encoder, it is first processed by layer normalization (Layer Normalization), then processed by a multi-head attention module for feature enhancement, the output result of the multi-head attention module is connected with the embedded image block by residual, then processed by layer normalization, then processed by multi-layer perception to extract features, the obtained output result is residual with the embedded image block again, and the final encoder output is obtained.

By encoding, the feature difference between the text overlap and the non-text overlap (i.e., normal text) is made more pronounced.

In step S46, the coded fourth vector is classified by using the classification multi-layer perceptron, so as to obtain a classification vector and a text classification result of each image block.

The architecture of the bifurcated multi-layered sensor is varied, and fig. 6 shows a schematic diagram of one architecture of the bifurcated multi-layered sensor. As shown in fig. 6, the input from the encoder is first linearized, then activated using an activation function (e.g., geLU), then the number of channels is reduced, then the linearization is performed again, then the number of channels is reduced again, and the final text classification result is obtained.

By adopting the above text classification technical scheme, text classification can be performed on each image block and the category vector, and then the whole image feature (such as text overlapping or non-overlapping) of the whole text line image can be known according to the text classification result of the category vector, and the feature (such as text overlapping or non-overlapping) of each image block can be known according to the text classification result of each image block.

In some embodiments, it is possible that the size of a certain text line image does not meet the size requirement of the convolution kernel of the neural network, e.g., the size of the text line image is not a multiple of the size of the convolution kernel, in which case the size of the text line image may be adjusted while maintaining the aspect ratio of the text line image before cutting the text line image into a plurality of image blocks such that the size of the adjusted text line image is an integer multiple of the size of the convolution kernel, and then padding (padding) pixels (e.g., 0) in the adjusted text line image, such that a size-adjusted text line image is obtained, where padding refers to expanding on the basis of the original text line image. In addition, a mask of the filling position, that is, a padding_mask, may be recorded so that it is possible to know at which position the filling is performed. Fig. 7 shows a schematic diagram of a text line image after filling.

In the case of filling the text line image, the encoding of the fourth vector using the transform structure described in step S45 in fig. 4 may include encoding the fourth vector using the transform structure and not performing an attention mechanism on the filled region during the encoding. By not performing an attention mechanism on the filled-in area, the difference in characteristics of text overlap and non-text overlap (i.e., normal text) is made more pronounced.

In some embodiments, the step S24 of adding the text classification result to the overlapping text line set includes adding the text line to the second candidate abnormal region set if the text classification result of the category vector corresponding to the text line image is the overlapping text, the text classification result of the text line image has more than N consecutive image blocks in the overlapping text, and the text classification confidence of the text classification result of the category vector and more than N consecutive image blocks is greater than the preset classification confidence threshold.

N is a positive integer greater than or equal to 3.

The preset classification confidence threshold may be set empirically or may be derived by self-learning. For example, the classification confidence of text lines where there is text overlap is self-learned, whereby an appropriate preset classification confidence threshold is set. For example, the preset classification confidence threshold may be set to 0.9 or other suitable value.

For example, for a certain text line image, if the text classification result of the class vector corresponding to the text line image is overlapped text, the text classification results of the image block 1, the image block 2 and the image block 3 in the text line image are continuous image blocks and the text classification results of the 3 image blocks are also overlapped text, and the text classification confidence of the class vector and the text classification results of the image block 1, the image block 2 and the image block 3 are all greater than the preset classification confidence threshold, the text line is added to the second candidate abnormal region set.

By adopting the technical scheme, when the text classification result of the category vector corresponding to the text line image is overlapped text, and the text classification result of more than N continuous image blocks in the text line image is overlapped text, and the text classification confidence of the text classification result of the category vector and more than N continuous image blocks is larger than the preset classification confidence threshold, the corresponding text line is considered to belong to the overlapped text, so that the accuracy of text overlapping detection is improved, and recall precision is improved.

Fig. 8 shows an architectural diagram of a text classifier according to an embodiment of the present disclosure. As shown in fig. 8, the text line image is first filled while maintaining the aspect ratio of the text line image, then the text line image is cut to obtain a plurality of image blocks (fig. 8 is illustrated by taking the cut to obtain 16 image blocks as an example), then the image blocks are linearly transformed in a linear projection layer, then position codes and class vectors are added, wherein "×" in fig. 8 indicates the class vectors, then the coding is performed in a Transformer encoder, and then the classification is performed in MLP to obtain a text classification result.

By adopting the architecture shown in fig. 8, it is possible to perform text classification for each text line image.

Fig. 9 shows an architectural diagram of object detection according to an embodiment of the present disclosure. The Darknet architecture is adopted as the architecture. As shown in fig. 9, three outputs are obtained after Darknet a 53 processing, the dimensions are (batch_size, 52, 52, 21), (batch_size, 26, 26, 21), (batch_size, 13, 13, 21), where 21=3 (2+4+1), 3 represents the number of anchor boxes, 2 represents the number of categories of objects within the anchor boxes, in this disclosure the object category includes two categories of overlapping text lines and overlapping regions, 4 represents the coordinate offset value (i.e. tx, ty, tw, th), 1 represents whether or not it is an overlapping object. In this architecture, there are 9 anchor boxes in total, which are equally divided among the aforementioned 3-dimensional feature layers. Thus, multi-scale detection is realized, and all the size overlapping targets can be detected. In fig. 9, the dimensions "52×52", "26×26", and "13×13" of the anchor frames and the number 3 of anchor frames are merely examples, and the present disclosure is not limited thereto.

In some embodiments, the text overlap detection method according to embodiments of the present disclosure further includes training a classifier that performs text classification and a target detector that performs target detection.

For deep learning network training, the number of training samples is very important. However, since there is very little real text overlay data in a real environment, the text overlay detection method according to the embodiments of the present disclosure further includes the step of automatically generating training samples to avoid consuming a lot of manpower to view and annotate the text overlay data.

The training samples may be automatically generated by at least one of:

(1) Writing characters on a normal text line, wherein the fonts, colors and character strings of the written characters are random;

(2) The foreground characters are extracted from the text line image, and the extracted foreground characters are superimposed on other text line images, as shown by reference numeral 4 in fig. 10, where the reference numeral 4 is originally normal text, and the superimposed text is formed by extracting foreground characters at other positions in the image and superimposing the extracted foreground characters on the reference numeral 4.

In both of the above two ways, the coordinates of the text line can be obtained by a text recognition algorithm (e.g., OCR), so that it can be determined where to write text or add foreground characters according to the coordinates of the text line.

In addition, automatically generated coordinate information of text lines with overlapping text may also be recorded for use in training the text classifier and the object detector.

After enough training samples are obtained, the classifier for performing text classification and the target detector for performing target detection can be trained by using the training samples so as to improve the classification precision of the classifier and the target detection precision of the target detector.

Fig. 11 is a schematic block diagram of a text overlap detection apparatus according to an embodiment of the present disclosure. As shown in FIG. 11, the text overlap detection device comprises a text recognition module 121 for performing text recognition on an object to be detected to obtain text recognition confidence of text lines in the object to be detected, adding text lines with the text recognition confidence lower than a preset recognition confidence threshold to a first candidate abnormal region set, a text classification module 122 for intercepting text line images of each text line of the object to be detected from the object to be detected, performing text classification on the text line images, adding text lines with text classification results being overlapped text to a second candidate abnormal region set, a target detection module 123 for performing target detection on overlapped text in the object to be detected, adding text lines with target detection results being overlapped text to a third candidate abnormal region set, and a determination module 124 for determining intersections of the first candidate abnormal region set, the second candidate abnormal region set and the third candidate abnormal region set as text overlap detection results.

In some embodiments, the text classification module 122 performs text classification on the text line image, including cutting the text line image into a plurality of image blocks, and performing text classification on each of the image blocks by using a text classifier that utilizes a class vector, a transformer structure, and a classification multi-layer perceptron, wherein the class vector is used to integrate whole-image features of the text line image.

In some embodiments, the text classification module 122 performs text classification on each image block by using a text classifier using a category vector, a transform structure and a two-class multi-layer perceptron, and includes performing linear transformation on a first vector composed of the plurality of image blocks to obtain a second vector, adding a learnable category vector to the second vector to obtain a third vector, adding a position code to the third vector to obtain a fourth vector, wherein the position code is used for representing the relative position relation of each image block, encoding the fourth vector by using the transform structure, and classifying the encoded fourth vector by using the two-class multi-layer perceptron to obtain text classification results of the category vector and each image block.

In some embodiments, the text classification module 122 is further configured to, prior to cutting the text line image into the plurality of image blocks, resize the text line image while maintaining the aspect ratio of the text line image, and fill pixels in the resized text line image;

the text classification module 122 is further configured to encode the fourth vector using a transducer structure and not perform an attention mechanism on the filled region during encoding.

In some embodiments, the text classification module 122 adds the text classification result to the second candidate abnormal region set for overlapping text, including adding the text line to the second candidate abnormal region set if the text classification result for the category vector corresponding to the text line image is overlapping text, the text classification result for the text line image having more than N consecutive image blocks is overlapping text, and the text classification confidence of the text classification results for the category vector and the more than N consecutive image blocks is greater than a preset classification confidence threshold.

In some embodiments, the object detection module 123 performs object detection on overlapping text in the object to be detected, including performing the object detection with overlapping text lines and overlapping areas as detection objects in the object to be detected.

In some embodiments, the object detection module 123 adds the object detection result to the third set of candidate abnormal regions for the text line of overlapping text, including adding the overlapping text line to the third set of candidate abnormal regions if the object detection result indicates that the coordinates of a certain overlapping text line overlap with the coordinates of a certain overlapping region.

In some embodiments, the text overlap detection device according to the embodiments of the present disclosure further includes a training module for automatically generating training samples by at least one of writing text on normal text lines, wherein the font, color, and character string of the written text are random, performing foreground character extraction on the text line image and superimposing the extracted foreground characters on other text line images, and training a classifier performing the text classification and a target detector performing the target detection using the training samples.

Specific implementation manners of operations performed by each module in the text overlap detection apparatus according to the embodiments of the present disclosure have been described in detail in related methods, and are not described herein.

Referring now to fig. 12, a schematic diagram of an electronic device 600 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 12 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 12, the electronic device 600 may include a processing means (e.g., a central processing unit, a graphic processor, etc.) 601, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

In general, devices may be connected to I/O interface 605 including input devices 606, including for example, touch screens, touch pads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, etc., output devices 607, including for example, liquid Crystal Displays (LCDs), speakers, vibrators, etc., storage devices 608, including for example, magnetic tape, hard disk, etc., and communication devices 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 12 shows an electronic device 600 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via communication means 609, or from storage means 608, or from ROM 602. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 601.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of a computer-readable storage medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to electrical wiring, fiber optic cable, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be included in the electronic device or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs, when the one or more programs are executed by the electronic equipment, the electronic equipment is caused to perform text recognition on an object to be detected, obtain text recognition confidence of text lines in the object to be detected, add text lines with the text recognition confidence lower than a preset recognition confidence threshold to a first candidate abnormal region set, intercept text line images of all the text lines of the object to be detected from the object to be detected, perform text classification on the text line images, add text classification results to a second candidate abnormal region set for text lines of overlapping text, perform object detection on overlapping text in the object to be detected, add text lines of object detection results to a third candidate abnormal region set for overlapping text, and determine intersections of the first candidate abnormal region set, the second candidate abnormal region set and the third candidate abnormal region set as text overlapping detection results.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented in software or hardware. The name of a module is not limited to the module itself in some cases, and for example, the first acquisition module may also be described as "a module that acquires at least two internet protocol addresses".

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic that may be used include Field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems-on-a-chip (SOCs), complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, example 1 provides a text overlap detection method, including performing text recognition on an object to be detected, obtaining text recognition confidence of text lines in the object to be detected, adding text lines with the text recognition confidence lower than a preset recognition confidence threshold to a first candidate abnormal region set, intercepting text line images of each text line of the object to be detected from the object to be detected, performing text classification on the text line images, adding text classification results to a second candidate abnormal region set as text overlap detection results, performing object detection on overlapped text in the object to be detected, adding text lines of object detection results to a third candidate abnormal region set as text overlap detection results, and determining intersections of the first candidate abnormal region set, the second candidate abnormal region set and the third candidate abnormal region set as text overlap detection results.

Example 2 provides the method of example 1 according to one or more embodiments of the present disclosure, wherein the text line image is text classified, comprising cutting the text line image into a plurality of image blocks, and text classifying each of the image blocks using a text classifier utilizing a class vector, a transducer structure, and a bi-classified multi-layer perceptron, wherein the class vector is used to integrate whole-image features of the text line image.

According to one or more embodiments of the present disclosure, example 3 provides the method of example 2, wherein the employing a text classifier using a class vector, a transform structure, and a hierarchical binary perceptron to perform text classification on each of the image blocks includes performing a linear transformation on a first vector composed of the plurality of image blocks to obtain a second vector, adding a learnable class vector to the second vector to obtain a third vector, adding a position code to the third vector to obtain a fourth vector, wherein the position code is used to characterize a relative positional relationship of each of the image blocks, encoding the fourth vector using the transform structure, and classifying the encoded fourth vector using the hierarchical binary perceptron to obtain the class vector and a text classification result of each of the image blocks.

In accordance with one or more embodiments of the present disclosure, example 4 provides the method of example 3, wherein, prior to the cutting the text line image into the plurality of image blocks, the method further comprises adjusting a size of the text line image while maintaining an aspect ratio of the text line image, and filling pixels in the adjusted text line image;

The encoding of the fourth vector using a transform structure includes encoding the fourth vector using a transform structure and not performing an attention mechanism on the padding region during encoding.

According to one or more embodiments of the present disclosure, example 5 provides the method of example 3 or 4, wherein the adding the text classification result to the second candidate abnormal region set for the overlapped text includes adding the text line to the second candidate abnormal region set if the text classification result of the category vector corresponding to the text line image is the overlapped text, the text classification result of the image block having more than N consecutive image blocks in the text line image is the overlapped text, and the text classification confidence of the text classification results of the category vector and the image block having more than N consecutive image blocks is greater than a preset classification confidence threshold.

According to one or more embodiments of the present disclosure, example 6 provides the method of example 1, wherein the performing object detection on overlapping text in the object to be detected includes performing the object detection with overlapping text lines and overlapping areas as detection objects in the object to be detected.

Example 7 provides the method of example 6, wherein adding the target detection result to the third set of candidate abnormal regions for the text line of overlapping text comprises adding the overlapping text line to the third set of candidate abnormal regions if the target detection result indicates that the coordinates of a certain overlapping text line overlap with the coordinates of a certain overlapping region.

Example 8 provides the method of example 1, according to one or more embodiments of the present disclosure, wherein the method further comprises automatically generating training samples by at least one of writing text on normal text lines, wherein the font, color, and string of written text are random, performing foreground character extraction on the text line image and superimposing the extracted foreground characters on other text line images, and training a classifier performing the text classification and a target detector performing the target detection with the training samples.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims. The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Claims

1. A text overlap detection method, characterized in that it includes:

The text recognition of the object to be detected is performed to obtain the text recognition confidence score of the text lines in the object to be detected. The text lines with the text recognition confidence score lower than the preset recognition confidence threshold are added to the first candidate abnormal region set.

Text line images of each text line of the object to be detected are extracted from the object to be detected. Text line images are classified. Text lines whose text classification results are overlapping text are added to the second candidate abnormal region set.

Target detection is performed on the overlapping text in the object to be detected, and the text lines whose target detection results are overlapping text are added to the third candidate abnormal region set;

The intersection of the first set of candidate abnormal regions, the second set of candidate abnormal regions, and the third set of candidate abnormal regions is determined as the text overlap detection result.

2. The method according to claim 1, wherein the text classification of the text line image includes:

The text line image is cut into multiple image blocks;

A text classifier that utilizes category vectors, a transformer structure, and a binary multilayer perceptron is used to classify the text for each image patch, wherein the category vectors are used to integrate the overall image features of the text line image.

3. The method according to claim 2, characterized in that, the step of using a text classifier that utilizes category vectors, a transformer structure, and a binary multilayer perceptron to perform text classification on each image patch includes:

A second vector is obtained by performing a linear transformation on the first vector composed of the plurality of image blocks;

Add the learnable category vector to the second vector to obtain the third vector;

A positional code is added to the third vector to obtain a fourth vector, wherein the positional code is used to characterize the relative positional relationship of each of the image blocks;

The fourth vector is encoded using the transformer structure.

The encoded fourth vector is classified using the binary classification multilayer perceptron to obtain the category vector and the text classification results of each image patch.

4. The method according to claim 3, characterized in that, before cutting the text line image into multiple image blocks, the method further includes: adjusting the size of the text line image while maintaining the aspect ratio of the text line image, and filling the adjusted text line image with pixels;

The step of encoding the fourth vector using the transformer structure includes: encoding the fourth vector using the transformer structure, and not performing an attention mechanism on the filled region during the encoding process.

5. The method according to any one of claims 2 to 4, characterized in that adding the text lines whose text classification result is overlapping text to the second candidate anomaly region set includes:

If the text classification result of the category vector corresponding to the text line image is overlapping text, the text classification result of more than N consecutive image blocks in the text line image is overlapping text, and the text classification confidence of the category vector and the text classification result of more than N consecutive image blocks is greater than the preset classification confidence threshold, then the text line is added to the second candidate abnormal region set.

6. The method according to claim 1, wherein the target detection of overlapping text in the object to be detected includes:

The overlapping text lines and overlapping regions are used as detection targets in the object to be detected to perform the target detection.

7. The method according to claim 6, wherein adding the text lines whose target detection result is overlapping text to the third candidate anomaly region set comprises:

If the target detection result indicates that the coordinates of a certain overlapping text line overlap with the coordinates of a certain overlapping region, then the overlapping text line is added to the third candidate abnormal region set.

8. The method according to claim 1, characterized in that the method further comprises:

Training samples are automatically generated using at least one of the following methods: writing text onto normal text lines, wherein the font, color, and string of the written text are all random; extracting foreground characters from the text line images and overlaying the extracted foreground characters onto other text line images;

The training samples are used to train the classifier that performs the text classification and the target detector that performs the target detection.

9. A text overlap detection device, characterized in that it comprises:

The text recognition module is used to perform text recognition on the object to be detected, obtain the text recognition confidence score of the text lines in the object to be detected, and add the text lines with the text recognition confidence score lower than the preset recognition confidence threshold to the first candidate abnormal region set.

The text classification module is used to extract text line images of each text line of the object to be detected from the object to be detected, perform text classification on the text line images, and add the text lines whose text classification results are overlapping text to the second candidate abnormal region set.

The target detection module is used to perform target detection on overlapping text in the object to be detected, and add the text lines whose target detection results are overlapping text to the third candidate abnormal region set.

The determination module is used to determine the intersection of the first candidate abnormal region set, the second candidate abnormal region set, and the third candidate abnormal region set as the text overlap detection result.

10. A computer-readable medium having a computer program stored thereon, characterized in that, when executed by a processing device, the program implements the steps of the method according to any one of claims 1-8.

11. An electronic device, characterized in that it comprises:

A storage device on which computer programs are stored;

A processing device for executing the computer program in the storage device to implement the steps of the method according to any one of claims 1-8.