CN114842460A

CN114842460A - Scene character detection method and device

Info

Publication number: CN114842460A
Application number: CN202210265977.0A
Authority: CN
Inventors: 徐鑫
Original assignee: Jingdong Kunpeng Jiangsu Technology Co Ltd
Current assignee: Jingdong Kunpeng Jiangsu Technology Co Ltd
Priority date: 2022-03-17
Filing date: 2022-03-17
Publication date: 2022-08-02

Abstract

The invention discloses a method and a device for scene text detection, and relates to the technical field of image word processing. A specific embodiment of the method includes: acquiring an image to be detected, the image to be detected contains a text area, determining a text centerline diagram and a directional distance map of the text area, and determining the text in the text area according to the text centerline map and the directional distance map contour. The method for scene text detection according to the embodiment of the present invention can effectively separate the adjacent text through the text centerline map, and the direction distance map can detect the scene text with any shape and direction, which solves the detection problem of irregular scene text, and further can Effectively improve detection performance.

Description

A method and device for scene text detection

技术领域technical field

本发明涉及图像文字处理领域，尤其涉及一种场景文字检测的方法和装置。The invention relates to the field of image word processing, in particular to a method and device for scene word detection.

背景技术Background technique

场景文字检测技术具有广泛的应用前景，例如可以用于辅助无人驾驶系统获得实时的路况和地理信息。Scene text detection technology has broad application prospects, for example, it can be used to assist driverless systems to obtain real-time road conditions and geographic information.

目前场景文字检测的方法主要包括以下两种：一种是基于回归的方式，通过预测候选文字框与实际文字框之间的偏移量，得到水平框、带方向的矩形框或四边形框，但该方式在不适用于检测不规则形状的文字；另一种是基于分割的方式，利用全卷积网络逐个判别图像中的像素点的属于文字区域还是非文字区域，但该种方式难以将紧邻的文字区域分开，降低检测的准确性。The current scene text detection methods mainly include the following two: one is based on regression, by predicting the offset between the candidate text frame and the actual text frame to obtain a horizontal frame, a directional rectangular frame or a quadrilateral frame, but This method is not suitable for detecting irregular-shaped text; the other is a segmentation-based method, which uses a fully convolutional network to determine whether the pixels in the image belong to a text area or a non-text area one by one, but this method is difficult to The text area is separated, which reduces the detection accuracy.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本发明实施例提供一种场景文字检测的方法和装置，能够有效分离文字区域中的紧邻的场景文字，且能够解决不规则形状和方向的场景文字的检测问题，提升检测的性能。In view of this, the embodiments of the present invention provide a method and device for scene text detection, which can effectively separate the immediately adjacent scene text in the text area, and can solve the detection problem of scene text with irregular shapes and directions, and improve the detection performance. .

为实现上述目的，根据本发明实施例的一个方面，提供了一种场景文字检测的方法，包括：To achieve the above purpose, according to an aspect of the embodiments of the present invention, a method for scene text detection is provided, including:

获取待检测图像，所述待检测图像中包含文字区域；acquiring an image to be detected, where the image to be detected includes a text area;

确定所述文字区域的文字中心线图和方向距离图；determining the text centerline diagram and the direction distance diagram of the text area;

根据所述文字中心线图和所述方向距离图，确定所述文字区域的文字轮廓；Determine the text outline of the text area according to the text centerline map and the direction distance map;

其中，所述文字中心线图是根据所述文字区域的文字中心线构成的，所述方向距离图是采用极坐标的方式回归所述文字中心线上的点沿多个预设方向到达所述文字轮廓边缘的距离得到的。Wherein, the text center line map is formed according to the text center line of the text area, and the direction distance map uses polar coordinates to return to the point on the text center line along a plurality of preset directions to reach the text center line. The distance from the edge of the text outline is obtained.

可选地，确定所述文字区域的文字中心线图和方向距离图之前，包括：Optionally, before determining the text centerline map and the direction distance map of the text area, including:

获取图像数据训练集，其中，所述图像数据训练集中的图像数据训练样本中包含文字区域；Obtaining an image data training set, wherein the image data training samples in the image data training set include text regions;

构建网络结构并为所述网络结构构建多任务损失函数；constructing a network structure and constructing a multi-task loss function for the network structure;

利用所述图像数据训练集对所述网络结构进行训练，直至所述多任务损失函数的值达到预设条件，获得场景文字检测模型。The network structure is trained by using the image data training set until the value of the multi-task loss function reaches a preset condition, and a scene text detection model is obtained.

可选地，所述多任务损失函数包括分割损失函数和回归损失函数，所述分割损失函数为基于预测的和真实的文字中心线图构造的，所述回归损失函数为基于预测的和真实的方向距离图构造的；Optionally, the multi-task loss function includes a segmentation loss function and a regression loss function, the segmentation loss function is constructed based on predicted and real text centerline diagrams, and the regression loss function is constructed based on predicted and real text centerlines. The direction distance map is constructed;

所述确定所述文字区域的文字中心线图和方向距离图，包括：The determining of the text centerline diagram and the direction distance diagram of the text area includes:

将所述待检测图像输入到训练好的所述场景文字检测模型中，预测得到所述待检测图像对应的文字中心线图和方向距离图。The to-be-detected image is input into the trained scene text detection model, and the text centerline map and the direction distance map corresponding to the to-be-detected image are predicted to be obtained.

可选地，所述将所述待检测图像输入到训练好的所述场景文字检测模型中，预测得到所述待检测图像对应的文字中心线图和方向距离图，包括：Optionally, inputting the image to be detected into the trained scene text detection model, and predicting a text centerline map and a direction distance map corresponding to the image to be detected, including:

根据所述场景文字检测模型，提取所述待检测图像的特征并进行特征融合，获得融合特征图；According to the scene text detection model, extract the features of the image to be detected and perform feature fusion to obtain a fusion feature map;

根据所述融合特征图和所述场景文字检测模型，预测得到所述待检测图像对应的文字中心线图和方向距离图。According to the fusion feature map and the scene text detection model, a text centerline map and a direction distance map corresponding to the image to be detected are predicted to be obtained.

可选地，根据所述文字中心线图和所述方向距离图，确定所述文字区域的文字轮廓，包括：Optionally, determining the text outline of the text area according to the text centerline map and the direction distance map, including:

根据文字中心线图中所述文字中心线上相邻的点聚合形成连通区域；A connected area is formed according to the aggregation of adjacent points on the text center line in the text center line diagram;

确定所述连通区域上的各个采样点；determining each sampling point on the connected region;

根据所述采样点和所述方向距离图，确定与每个所述采样点对应的方向点；determining a direction point corresponding to each of the sampling points according to the sampling points and the direction distance map;

根据各个所述采样点的方向点，确定所述文字区域的文字轮廓。According to the direction points of each of the sampling points, the character outline of the character area is determined.

可选地，确定所述连通区域上的各个采样点，包括：Optionally, determining each sampling point on the connected region, including:

确定所述连通区域的最小旋转外接矩形；determining the minimum rotated circumscribed rectangle of the connected region;

对所述矩形进行n等分，形成n-1条垂线；其中，n为大于1的整数；Divide the rectangle into n equal parts to form n-1 vertical lines; wherein, n is an integer greater than 1;

以每条垂线与所述连通区域的交线的中点作为所述采样点。The sampling point is the midpoint of the intersection of each vertical line and the connected area.

可选地，提取所述待检测图像的特征并进行特征融合，获得融合特征图，包括：Optionally, extract the features of the image to be detected and perform feature fusion to obtain a fusion feature map, including:

通过骨干网络对所述待检测图像进行特征提取，获得多个不同尺度的特征图；Feature extraction is performed on the to-be-detected image through the backbone network to obtain multiple feature maps of different scales;

将多个不同尺度的特征图融合，获得所述融合特征图。A plurality of feature maps of different scales are fused to obtain the fused feature map.

可选地，通过骨干网络对所述待检测图像进行特征提取，获得多个不同尺度的特征图，包括：通过骨干网络的多个卷积模块提取待检测图像的特征，获得多个不同尺度的深层特征图和浅层特征图；Optionally, feature extraction is performed on the image to be detected through a backbone network to obtain multiple feature maps of different scales, including: extracting features of the image to be detected through multiple convolution modules of the backbone network to obtain multiple feature maps of different scales. Deep feature maps and shallow feature maps;

将多个不同尺度的特征图融合，获得所述融合特征图，包括：采用标准卷积和多个不同空洞率的空洞卷积网络对多个所述深层特征图进行上下文特征提取，将提取上下文特征后的多个深层特征图与所述浅层特征图通过级联的方式进行合并，获得所述融合特征图。Fusion of multiple feature maps of different scales to obtain the fused feature map includes: using standard convolution and multiple atrous convolutional networks with different dilation rates to perform context feature extraction on the multiple deep feature maps, and extract the context The multiple deep feature maps after the feature are combined with the shallow feature maps in a cascaded manner to obtain the fusion feature map.

本发明实施例的另一方面提供一种场景文字检测的装置，包括：Another aspect of the embodiments of the present invention provides an apparatus for scene text detection, including:

获取模块，获取待检测图像，所述待检测图像中包含文字区域；an acquisition module, which acquires an image to be detected, where the image to be detected includes a text area;

模型预测模块，确定所述文字区域的文字中心线图和方向距离图；The model prediction module determines the text centerline diagram and the direction distance diagram of the text area;

确定模块，根据所述文字中心线图和所述方向距离图，确定所述文字区域的文字轮廓，a determining module, for determining the text outline of the text area according to the text centerline diagram and the direction distance map,

根据本发明实施例的另一个方面，提供了一种电子设备，包括：According to another aspect of the embodiments of the present invention, an electronic device is provided, including:

一个或多个处理器；one or more processors;

存储装置，用于存储一个或多个程序，storage means for storing one or more programs,

当所述一个或多个程序被所述一个或多个处理器执行，使得所述一个或多个处理器实现本发明提供的场景文字检测的方法。When the one or more programs are executed by the one or more processors, the one or more processors implement the method for scene text detection provided by the present invention.

根据本发明实施例的还一个方面，提供了一种计算机可读介质，其上存储有计算机程序，所述程序被处理器执行时实现本发明提供的场景文字检测的方法。According to another aspect of the embodiments of the present invention, there is provided a computer-readable medium on which a computer program is stored, and when the program is executed by a processor, the method for scene character detection provided by the present invention is implemented.

上述发明中的一个实施例具有如下优点或有益效果：通过获取待检测图像，待检测图像中包含文字区域，根据待检测图像确定文字区域的文字中心线图和方向距离图，根据文字中心线图和方向距离图，得到文字区域的文字轮廓。本发明实施例的场景文字检测的方法通过文字中心线图能够有效将紧邻的文字分离开，基于文字中心线图得到的方向距离图能够检测具有任意形状和方向的场景文字，解决了不规则场景文字的检测问题，进而能够有效提升场景文字检测器的检测性能。An embodiment of the above invention has the following advantages or beneficial effects: by acquiring an image to be detected, the image to be detected contains a text area, a text centerline diagram and a direction distance map of the text area are determined according to the to-be-detected image, and a text centerline diagram is determined according to the text centerline diagram. And the direction distance map, get the text outline of the text area. The method for scene text detection according to the embodiment of the present invention can effectively separate adjacent texts through the text centerline map, and the directional distance map obtained based on the text centerline map can detect scene texts with arbitrary shapes and directions, which solves the problem of irregular scenes. The problem of text detection can effectively improve the detection performance of the scene text detector.

上述的非惯用的可选方式所具有的进一步效果将在下文中结合具体实施方式加以说明。Further effects of the above non-conventional alternatives will be described below in conjunction with specific embodiments.

附图说明Description of drawings

附图用于更好地理解本发明，不构成对本发明的不当限定。其中：The accompanying drawings are used for better understanding of the present invention and do not constitute an improper limitation of the present invention. in:

图1是根据本发明实施例的一种场景文字检测的方法的主要流程的示意图；1 is a schematic diagram of the main flow of a method for scene text detection according to an embodiment of the present invention;

图2是根据本发明实施例的不同文字的表达方式的示意图；FIG. 2 is a schematic diagram of expressions of different characters according to an embodiment of the present invention;

图3是根据本发明实施例的一种获得融合特征图的过程示意图；3 is a schematic diagram of a process for obtaining a fusion feature map according to an embodiment of the present invention;

图4是根据本发明实施例的一种确定文字区域的文字轮廓的过程示意图；4 is a schematic diagram of a process for determining a text outline of a text area according to an embodiment of the present invention;

图5是根据本发明实施例的一种场景文字检测的方法的流程示意图；5 is a schematic flowchart of a method for scene text detection according to an embodiment of the present invention;

图6是根据本发明实施例的一种场景文字检测的装置的主要模块的示意图；6 is a schematic diagram of main modules of an apparatus for scene text detection according to an embodiment of the present invention;

图7是本发明实施例可以应用于其中的示例性系统架构图；FIG. 7 is an exemplary system architecture diagram to which an embodiment of the present invention may be applied;

图8是适于用来实现本发明实施例的终端设备或服务器的计算机系统的结构示意图。FIG. 8 is a schematic structural diagram of a computer system suitable for implementing a terminal device or a server according to an embodiment of the present invention.

具体实施方式Detailed ways

以下结合附图对本发明的示范性实施例做出说明，其中包括本发明实施例的各种细节以助于理解，应当将它们认为仅仅是示范性的。因此，本领域普通技术人员应当认识到，可以对这里描述的实施例做出各种改变和修改，而不会背离本发明的范围和精神。同样，为了清楚和简明，以下的描述中省略了对公知功能和结构的描述。Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, which include various details of the embodiments of the present invention to facilitate understanding and should be considered as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted from the following description for clarity and conciseness.

近年来，基于深度学习的场景文字检测技术广泛应用于计算机视觉的各个领域，如自动驾驶领域、图像视频检索、文字翻译等均需要对场景文字进行识别。目前大量的场景文字检测方法大多依赖手动涉及的特征来区分文字和文字区域，需要大量的特征工程且不能保证文字检测的鲁棒性。基于深度学习算法的场景文字检测方法包括回归和分割的方式，但不能有效分开紧邻的文字区域，且检测不规则形状的文字如弯曲的文字时，预测出多余的背景信息，给后续的文字识别带来较大的干扰，检测准确性低。针对以上问题，本发明实施例提供一种场景文字检测的方法，能够将紧邻的文字分离且适用于检测不规则形状的场景文字，提升检测性能。In recent years, scene text detection technology based on deep learning has been widely used in various fields of computer vision, such as the field of automatic driving, image and video retrieval, text translation, etc., all need to recognize scene text. At present, a large number of scene text detection methods mostly rely on manually involved features to distinguish text and text regions, which requires a lot of feature engineering and cannot guarantee the robustness of text detection. The scene text detection method based on the deep learning algorithm includes regression and segmentation methods, but cannot effectively separate the adjacent text areas, and when detecting irregularly shaped text such as curved text, it predicts redundant background information for subsequent text recognition. It brings greater interference and low detection accuracy. In view of the above problems, embodiments of the present invention provide a method for scene text detection, which can separate adjacent text and is suitable for detecting scene texts with irregular shapes, thereby improving detection performance.

图1是根据本发明实施例的一种场景文字检测的方法的主要流程的示意图，如图1所示，该方法包括以下步骤：FIG. 1 is a schematic diagram of the main flow of a method for scene text detection according to an embodiment of the present invention. As shown in FIG. 1 , the method includes the following steps:

步骤S101：获取待检测图像，待检测图像中包含文字区域；Step S101: acquiring an image to be detected, and the image to be detected includes a text area;

步骤S102：确定文字区域的文字中心线图和方向距离图；Step S102: Determine the text centerline map and the direction distance map of the text area;

步骤S103：根据文字中心线图和方向距离图，确定文字区域的文字轮廓。Step S103: Determine the text outline of the text area according to the text centerline map and the direction distance map.

在本发明实施例中，待检测图像可以为自动驾驶、图像视频检索、文字翻译等场景中得到的图像，待检测图像为包含文字区域的图像，文字区域为包含场景文字(文字实例)的区域，场景文字例如可以为一个或多个单词或文字行。待检测图像中可以包含一个或多个文字区域。可选地，场景文字可以为规则形状，也可以为不规则形状，例如，场景文字为弯曲形状。In the embodiment of the present invention, the image to be detected may be an image obtained in scenarios such as automatic driving, image and video retrieval, text translation, etc., the image to be detected is an image containing a text area, and the text area is an area containing scene text (text instances) , the scene text can be, for example, one or more words or text lines. The image to be detected may contain one or more text regions. Optionally, the scene text may be a regular shape or an irregular shape, for example, the scene text may be a curved shape.

在本发明实施例中，文字中心线图是根据文字区域的文字中心线构成的，方向距离图是采用极坐标的方式对回归文字中心线上的点沿多个预设方向达到文字轮廓边缘的距离得到的。In the embodiment of the present invention, the text centerline diagram is formed according to the text centerline of the text area, and the direction distance diagram uses polar coordinates to regress the points on the text centerline to reach the edge of the text outline along multiple preset directions. distance obtained.

如图2所示为不同形状文字的表达方式的示意图，图2中，(a)为采用欧几里德坐标的方法获得的规则形状文字的文字轮廓；(b)为仅使用一个具有极坐标的点获得的规则形状文字的文字轮廓；从(c)中可以看出，当文字为不规则形状时，仅采用一个具有极坐标的点将无法覆盖整个文字区域，从而难以较为准确地获得该不规则形状文字的文字轮廓；从(d)可以看出，本发明实施例针对不规则形状的文字，首先获取文字区域的文字中心线图，然后回归文字中心线上的点沿预定方向到文字轮廓的距离，如预定8个方向，相邻方向的夹角为45度，从而得到不规则形状的文字轮廓。本发明实施例的方法可以用于在场景图像中精确定位任意形状的文字。As shown in Figure 2 is a schematic diagram of the expression of different shapes of text, in Figure 2, (a) is the text outline of the regular shape text obtained by the method of Euclidean coordinates; (b) only uses a polar coordinate It can be seen from (c) that when the text is irregular in shape, only one point with polar coordinates will not be able to cover the entire text area, so it is difficult to obtain the text more accurately. The text outline of the irregular-shaped text; it can be seen from (d) that for irregular-shaped text, the embodiment of the present invention first obtains the text centerline diagram of the text area, and then returns to the point on the text centerline along the predetermined direction to the text The distance of the outline, such as the predetermined 8 directions, the included angle between the adjacent directions is 45 degrees, so as to obtain the outline of the text with irregular shape. The method of the embodiment of the present invention can be used to precisely locate characters of any shape in a scene image.

在本发明实施例中，确定文字区域的文字中心线图和方向距离图之前，包括：In the embodiment of the present invention, before determining the text centerline map and the direction distance map of the text area, the steps include:

获取图像数据训练集，图像数据包含文字区域；Obtain a training set of image data, the image data contains text areas;

构建网络结构并为网络结构构建多任务损失函数，Build the network structure and build a multi-task loss function for the network structure,

利用图像数据训练集对网络结构进行训练，直至多任务损失函数的值达到预设条件，获得场景文字检测模型。The image data training set is used to train the network structure until the value of the multi-task loss function reaches the preset condition, and the scene text detection model is obtained.

在本发明实施例中，在确定文字中心线图和方向距离图之前，需要获得场景文字检测模型，首先获取图像数据训练集，图像数据训练集中包括多个图像数据训练样本，每个图像数据训练样本均包含文字区域。图像数据的获取方式不做特别限定。然后构建网络结构，网络结构由CNN(Convolutional Neural Network，卷积神经网络)+FPN(FeaturePyramid Networks，特征金字塔网络)构成。In the embodiment of the present invention, before determining the text centerline map and the directional distance map, a scene text detection model needs to be obtained, and an image data training set is obtained first. The image data training set includes a plurality of image data training samples, each image data training set The samples all contain text areas. The acquisition method of the image data is not particularly limited. Then a network structure is constructed, and the network structure is composed of CNN (Convolutional Neural Network, Convolutional Neural Network) + FPN (FeaturePyramid Networks, Feature Pyramid Network).

利用所述多个图像的图像训练样本进行模型训练，得到场景文字检测模型，可选地，该模型可以为基于FCN网络(Fully Convolutional Networks，全卷积网络)的模型，以实现图像语义分割。Model training is performed using the image training samples of the multiple images to obtain a scene text detection model, optionally, the model may be a model based on an FCN network (Fully Convolutional Networks, fully convolutional network) to implement image semantic segmentation.

在本发明实施例中，多任务损失函数包括分割损失函数和回归损失函数，分割损失函数为基于预测的和实际的文字中心线图构造的，回归损失函数为基于预测的和实际的方向距离图构造的。In the embodiment of the present invention, the multi-task loss function includes a segmentation loss function and a regression loss function, the segmentation loss function is constructed based on the predicted and actual text centerline diagrams, and the regression loss function is based on the predicted and actual direction distance diagrams constructed.

在本发明实施例中，构建的多任务损失函数的值的计算式如式(1)所示，In the embodiment of the present invention, the calculation formula of the value of the constructed multi-task loss function is shown in formula (1),

L＝λL_tcd+L_dd 式(1)，L= _{λL tcd} +L _dd formula (1),

其中，L_tcd表示TCL(文字中心线图)的二分类分割损失，L_dd表示DD(方向距离图)的回归损失，λ为平衡两个损失项的权重系数，例如为0.1。Among them, L _tcd represents the two-class segmentation loss of TCL (text centerline map), L _dd represents the regression loss of DD (direction distance map), and λ is the weight coefficient that balances the two loss terms, such as 0.1.

对于TCL的预测，可以视为对图像数据逐像素判断属于文字或者背景区域的二分类问题，由于自然场景中文字实例的尺度大小差异显著，如果所有文字像素的权重都相同，则小的文字实例可能因为对总的二分类分割损失的贡献小而被漏检掉。因此，本发明实施例采用基于实例均衡的Dice损失函数，构建的TCL的二分类分割损失函数如式(2)所示，For the prediction of TCL, it can be regarded as a binary classification problem of judging the image data pixel by pixel to belong to the text or background area. Since the scales of text instances in natural scenes are significantly different, if the weights of all text pixels are the same, the small text instance It may be missed because of its small contribution to the total binary classification loss. Therefore, the embodiment of the present invention adopts the Dice loss function based on instance equalization, and the constructed TCL two-class segmentation loss function is shown in formula (2),

式(2)中，G和P分别表示图像中TCL的实际区域和预测区域；W表示TCL的权重图，W中的任一任意像素点的权重p的w_c(p)通过如式(3)所示的计算式得到，In formula (2), G and P respectively represent the actual area and prediction area of TCL in the image; W represents the weight map of TCL, and w _c (p) of the weight p of any arbitrary pixel in W is calculated as in formula (3) ) shown in the formula to get,

式(3)中，w_c(p)表示TCL中像素点p的权重，Area(C)是区域C中像素点的总数，C表示TCL中像素的集合，N是图像中文字实例的数量，C_p表示图像中包含像素p的中心线区域。In formula (3), w _c (p) represents the weight of pixel p in TCL, Area (C) is the total number of pixels in area C, C represents the set of pixels in TCL, N is the number of text instances in the image, C _p represents the centerline region in the image that contains pixel p.

在本发明实施例的可选的实施方式中，预测方向距离即为预测TCL上的点沿多个方向到文字轮廓的距离，采用基于Smooth L1损失构建得到DD损失函数如式(4)所示，In an optional implementation of the embodiment of the present invention, the predicted direction distance is the distance from the predicted point on the TCL to the text outline in multiple directions, and the DD loss function is constructed based on the Smooth L1 loss, as shown in formula (4). ,

式(4)中，d_x,y和d^* _x,y分别表示点(x,y)到文字边缘的实际距离和预测距离，Norm_x,y通过式(5)获得，In formula (4), d _{x, y} and d ^* _{x, y} represent the actual distance and predicted distance from the point (x, y) to the text edge, respectively, Norm _{x, y} are obtained by formula (5),

式(5)中，Box_H_x,y和Box_W_x,y分别表示点(x,y)所在的文字实例边缘框的高度和宽度。In formula (5), Box_H _{x, y} and Box_W _{x, y} represent the height and width of the edge box of the text instance where the point (x, y) is located, respectively.

在本发明实施例中，采用端到端的方式训练神经网络(如FCN网络)，以最大程度的减小多任务损失函数的值，多次迭代，当多任务损失函数的值达到预设收敛条件时，如多任务损失函数的值的增量在预设范围内时，停止迭代，获得场景文字检测模型。In the embodiment of the present invention, a neural network (such as an FCN network) is trained in an end-to-end manner to minimize the value of the multi-task loss function, and after multiple iterations, when the value of the multi-task loss function reaches a preset convergence condition , if the increment of the value of the multi-task loss function is within the preset range, stop the iteration and obtain the scene text detection model.

在本发明实施例中，确定文字区域的文字中心线图和方向距离图，包括：将待检测图像输入到训练好的场景文字检测模型中，预测得到待检测图像对应的文字中心线图和方向距离图，具体地，包括：根据场景文字检测模型，提取待检测图像的特征并进行特征融合，获得融合特征图；根据融合特征图和场景文字检测模型，预测得到待检测图像对应的文字中心线图和方向距离图。In the embodiment of the present invention, determining the text centerline map and the direction distance map of the text area includes: inputting the image to be detected into the trained scene text detection model, and predicting the text centerline map and direction corresponding to the image to be detected. The distance map, specifically, includes: extracting the features of the image to be detected according to the scene text detection model and performing feature fusion to obtain a fusion feature map; according to the fusion feature map and the scene text detection model, predicting and obtaining the text centerline corresponding to the image to be detected Graph and direction distance graph.

在本发明实施例中，提取待检测图像的特征并进行特征融合，获得融合特征图，包括：通过骨干网络对待检测图像进行特征提取，获得多个不同尺度的特征图；将多个不同尺度的特征图融合，获得融合特征图。可选地，采用去除全连接层后的ResNet-50残差卷积网络作为骨干网络(主网络)提取待检测图像的特征。In the embodiment of the present invention, extracting the features of the image to be detected and performing feature fusion to obtain a fusion feature map includes: performing feature extraction on the image to be detected through a backbone network to obtain multiple feature maps of different scales; The feature map is fused to obtain a fused feature map. Optionally, the ResNet-50 residual convolutional network after removing the fully connected layer is used as the backbone network (main network) to extract the features of the image to be detected.

在本发明实施例中，通过骨干网络对待检测图像进行特征提取，获得多个不同尺度的特征图，包括：通过骨干网络的多个卷积模块提取待检测图像的特征，获得多个不同尺度的深层特征图和浅层特征图，采用标准卷积和多个不同空洞率的空洞卷积网络对多个深层特征图进行上下文特征提取。In the embodiment of the present invention, feature extraction is performed on the image to be detected through the backbone network to obtain multiple feature maps of different scales, including: extracting features of the image to be detected through multiple convolution modules of the backbone network, and obtaining multiple feature maps of different scales Deep feature maps and shallow feature maps, using standard convolution and multiple atrous convolutional networks with different dilation rates to extract contextual features from multiple deep feature maps.

在本发明实施例中，将多个不同尺度的特征图融合，获得融合特征图，包括：将提取上下文特征后的多个深层特征图与浅层特征图通过级联的方式进行合并，获得融合特征图。In the embodiment of the present invention, fusing multiple feature maps of different scales to obtain a fused feature map includes: merging multiple deep feature maps and shallow feature maps after extracting contextual features in a cascaded manner to obtain a fusion feature feature map.

为由于标准卷积的感受野的限制，标准卷积不适用于处理在形状和宽高比例上有较大变化的场景文字，因此，本发明实施例可以通过引入空洞卷积使得网络在保持参数量相同的情况下拥有更大的感受野，从而提升了对长文字的检测性能。Due to the limitation of the receptive field of standard convolution, standard convolution is not suitable for processing scene characters with large changes in shape and aspect ratio. Therefore, in this embodiment of the present invention, atrous convolution can be introduced so that the network can maintain parameters. In the case of the same amount, it has a larger receptive field, thereby improving the detection performance of long text.

采用标准卷积和多个不同空洞率的空洞卷积构成上下文特征提取模块(CFE)，采用上下文特征提取模块对深层特征图进行特征提取，以提取深层特征图中丰富的上下文特征。The standard convolution and multiple atrous convolutions with different dilation rates are used to form a context feature extraction module (CFE).

如图3所示为根据本发明实施例的一种获得融合特征图的过程示意图，输入待检测图像(Image)，然后通过骨干网络对待检测图像进行特征提取，经stage1的64个卷积核下采样(/2)操作后进入四个阶段stage2、stage3、stage4和stage5进行下采样(/2)操作，卷积核个数分别为256、512、1024和2048，经过stage2的1*1的卷积层(Conv1*1)处理后获得浅层特征图，经过stage3、stage4和stage5的不同通道数的1*1卷积核处理后获得3个不同尺度的深层特征图，采用CFE模块对深层特征图进行上下文特征提取，CFE模块由三个具有不同空洞率的空洞卷积(空洞率r分别设置为3、5和7)和一个1*1的标准卷积，将提取上下文特征的3个不同尺度的深层特征图和1个浅层特征图沿着通道轴对其进行连接(Concat)，以融合不同感受野范围的特征信息，即将浅层特征图和深层特征图通过级联的方式进行合并，获得与待检测图像的尺寸相同的融合特征图。Element-wise Sum为将3个空洞卷积和标准卷积融合。Figure 3 is a schematic diagram of a process for obtaining a fusion feature map according to an embodiment of the present invention. Input the image to be detected (Image), and then perform feature extraction on the image to be detected through the backbone network. After the 64 convolution kernels of stage1 After the sampling (/2) operation, it enters the four stages stage2, stage3, stage4 and stage5 for downsampling (/2) operation. The number of convolution kernels are 256, 512, 1024 and 2048 respectively. After the 1*1 volume of stage2 The shallow layer feature map is obtained after the convolution layer (Conv1*1) processing. After the 1*1 convolution kernel with different channel numbers of stage3, stage4 and stage5, three deep feature maps of different scales are obtained. The CFE module is used to analyze the deep features. For context feature extraction, the CFE module consists of three atrous convolutions with different atrous rates (the atrous rate r is set to 3, 5 and 7, respectively) and a 1*1 standard convolution, which will extract three different contextual features. The scaled deep feature map and a shallow feature map are connected (Concat) along the channel axis to fuse the feature information of different receptive field ranges, that is, the shallow feature map and the deep feature map are combined in a cascade manner. , to obtain a fusion feature map of the same size as the image to be detected. Element-wise Sum is the fusion of 3 hole convolutions and standard convolutions.

在本发明实施例中，根据文字中心线图和方向距离图，确定文字区域的文字轮廓，包括：根据文字中心线形成连通区域；确定连通区域上的各个采样点；根据采样点和方向距离图，确定与各个采样点对应的方向点；根据方向点，确定文字区域的文字轮廓。In the embodiment of the present invention, determining the text outline of the text area according to the text centerline map and the direction distance map includes: forming a connected area according to the text center line; determining each sampling point on the connected area; according to the sampling point and the direction distance map , determine the direction point corresponding to each sampling point; according to the direction point, determine the text outline of the text area.

可选地，根据文字中心线图中文字中心线上相邻的点聚合形成连通区域，包括：利用图像处理连通方法将文字中心线中相邻的点聚合，形成连通区域，其中，图像处理连通方法可以包括腐蚀、膨胀等处理方法。Optionally, forming a connected area according to the aggregation of adjacent points on the text centerline in the text centerline diagram includes: using an image processing connection method to aggregate adjacent points in the text centerline to form a connected area, wherein the image processing is connected. Methods may include treatment methods such as corrosion, expansion, and the like.

在本发明实施例中，确定连通区域上的各个采样点，包括：确定连通区域的最小旋转外接矩形；对矩形进行n等分，形成n-1条垂线；其中，n为大于1的整数；以每条垂线与连通区域的交线的中点作为采样点。In the embodiment of the present invention, determining each sampling point on the connected area includes: determining the minimum rotated circumscribed rectangle of the connected area; dividing the rectangle into n equal parts to form n-1 vertical lines; wherein, n is an integer greater than 1 ; Take the midpoint of the intersection of each vertical line and the connected region as the sampling point.

可选地，确定连通区域上的各个采样点还可以采用其他方式获得，例如，可以在连通区域上间隔预设距离确定各个采样点。Optionally, the determination of each sampling point on the connected region may also be obtained in other manners, for example, each sampling point may be determined at a preset distance on the connected region.

可选地，根据方向点可以产生一个多边形包围框，即为文字区域的文字轮廓，例如可以采用Alpha-shape算法(一种利用某些特征点来刻画点集直观轮廓的一种算法)根据方向点产生多边形包围框。Optionally, a polygon bounding box can be generated according to the direction point, that is, the text outline of the text area. For example, an Alpha-shape algorithm (an algorithm that uses some feature points to describe the intuitive outline of a point set) can be used according to the direction. Points produce polygon bounding boxes.

本发明实施例中，当确定文字中心线图和方向距离图后，进行后处理以得到文字轮廓。如图4所示为一种确定文字区域的文字轮廓的过程示意图，在图4中，(a)为得到的两条文字中心线分别形成的连通区域，(b)为对其中的一个连通区域获得该连通区域的最小旋转外接矩形，然后沿着该矩形的长边将其n等分，n为11，得到10条垂线，以每条垂线与连通区域的交线的中点作为采样点，确定10个采样点；(c)中，基于采样点和方向距离图，针对每个采样点，计算得到对应的方向点，从而得到文字边缘区域的方向点；(d)中，基于获得的方向点，采用Alpha-shape算法产生一个多边形包围框，该多边形包围框即为文字区域的文字轮廓。本发明实施例的后处理方法相对于PixelLink，TextSnake和TextField等基于分割的方法更加简单高效，能够提高场景文字检测的效率。In the embodiment of the present invention, after the text centerline map and the direction distance map are determined, post-processing is performed to obtain the text outline. Figure 4 is a schematic diagram of the process of determining the text outline of the text area. In Figure 4, (a) is the connected area formed by the two obtained text center lines, and (b) is a connected area for one of the obtained Obtain the minimum rotated circumscribed rectangle of the connected area, and then divide it into n equal parts along the long side of the rectangle, where n is 11, to obtain 10 vertical lines, with the midpoint of the intersection of each vertical line and the connected area as the sampling In (c), based on the sampling points and the direction distance map, for each sampling point, the corresponding direction points are calculated to obtain the direction points of the text edge area; in (d), based on the obtained The direction point of , uses the Alpha-shape algorithm to generate a polygon bounding box, the polygon bounding box is the text outline of the text area. Compared with segmentation-based methods such as PixelLink, TextSnake, and TextField, the post-processing method of the embodiment of the present invention is simpler and more efficient, and can improve the efficiency of scene text detection.

如图5所示为本发明实施例的一种场景文字检测的方法的过程示意图，在图5中，(a)为获取的待检测图像，该待检测图像中包括具有多个单词的文字区域，多个单词呈弯曲形状，且多个单词相邻的距离较近；将待检测图像输入到场景文字检测模型中，得到(b)所示的单通道的文字中心线图和(c)所示的八通道的方向距离图，其中，方向距离图是以极坐标的方式回归文字中心线上的点沿八个方向(八个方向包括上、下、左、右、左上、左下、右上、右下)到文字轮廓的距离得到的，然后基于得到的方向距离和文字中心线图进行后处理，得到(d)所示的文字轮廓，实现文字实例重建。FIG. 5 is a schematic process diagram of a method for detecting text in a scene according to an embodiment of the present invention. In FIG. 5, (a) is an acquired image to be detected, and the image to be detected includes a text area with multiple words , a plurality of words are in a curved shape, and the adjacent distances of a plurality of words are relatively close; input the image to be detected into the scene text detection model, and obtain the single-channel text centerline diagram shown in (b) and (c) The eight-channel directional distance map shown, wherein the directional distance map is a polar coordinate return to the point on the center line of the text along eight directions (the eight directions include up, down, left, right, upper left, lower left, upper right, (lower right) to the text outline, and then post-processing based on the obtained direction distance and the text centerline diagram to obtain the text outline shown in (d), and realize the reconstruction of text instances.

本发明实施例所提供的场景文字检测的方法，提供了采用极坐标进行文字表达的方式，构建网络结构实现端到端可训练的深度学习模型即场景文字检测模型；通过语义分割确定待检测图像的文字区域的文字中心线图，能够有效地将文字区域中紧邻的场景文字分离，然后基于文字中心线图获得方向距离图，利用极坐标的方式来参数化文字区域的文字轮廓，从而可以检测任意形状的场景文字，不包括多余的背景信息，利于后续的文字识别，解决了不规则场景文字的检测问题，从而能够更加精确的重建出任意形状的文字实例。此外，为了解决场景文字尺度差异大的问题，引入空洞卷积来提取丰富的上下文特征信息，有效提升了对长文字的检测性能。并且后处理方法简单高效，能够提高场景文字检测的效率。The method for scene text detection provided by the embodiment of the present invention provides a way of using polar coordinates to express text, constructs a network structure to realize an end-to-end trainable deep learning model, that is, a scene text detection model; determines the image to be detected through semantic segmentation The text centerline map of the text area can effectively separate the text in the scene adjacent to the text area, and then obtain the direction distance map based on the text centerline map, and use polar coordinates to parameterize the text area. Scene text of any shape does not include redundant background information, which is beneficial to subsequent text recognition, solves the problem of detecting irregular scene text, and can more accurately reconstruct text instances of any shape. In addition, in order to solve the problem of large differences in scene text scales, atrous convolution is introduced to extract rich contextual feature information, which effectively improves the detection performance of long texts. Moreover, the post-processing method is simple and efficient, which can improve the efficiency of scene text detection.

如图6所示，本发明实施例的另一方面提供一种场景文字检测的装置600，包括：As shown in FIG. 6 , another aspect of an embodiment of the present invention provides an apparatus 600 for scene text detection, including:

获取模块601，获取待检测图像，待检测图像中包含文字区域；The obtaining module 601 obtains an image to be detected, and the image to be detected includes a text area;

模型预测模块602，确定文字区域的文字中心线图和方向距离图；The model prediction module 602 determines the text centerline map and the direction distance map of the text area;

确定模块603，根据文字中心线图和方向距离图，确定文字区域的文字轮廓，The determining module 603 determines the text outline of the text area according to the text centerline diagram and the direction distance map,

其中，文字中心线图是根据文字区域的文字中心线构成的，方向距离图是采用极坐标的方式回归文字中心线上的点沿多个预设方向到达文字轮廓边缘的距离得到的。The text centerline map is formed according to the text centerline of the text area, and the direction distance map is obtained by using polar coordinates to return the distances from points on the text centerline to the edge of the text outline along multiple preset directions.

在本发明实施例中，模型预测模型602，还用于：在确定文字区域的文字中心线图和方向距离图之前，获取图像数据训练集，图像数据包含文字区域；构建网络结构并为网络结构构建多任务损失函数，利用图像数据训练集对网络结构进行训练，直至损失函数的值达到预设条件，获得场景文字检测模型。In the embodiment of the present invention, the model prediction model 602 is further used to: before determining the text centerline map and the direction distance map of the text area, obtain a training set of image data, the image data includes the text area; construct a network structure and form a network structure A multi-task loss function is constructed, and the image data training set is used to train the network structure until the value of the loss function reaches the preset condition, and the scene text detection model is obtained.

在本发明实施例中，模型预测模块602，进一步用于：提取待检测图像的特征并进行特征融合，获得融合特征图；将融合特征图输入到场景文字检测模型中，获得文字中心线图和方向距离图；方向距离图是采用极坐标的方式对回归文字中心线上的点沿多个预设方向达到文字轮廓边缘的距离得到的。In the embodiment of the present invention, the model prediction module 602 is further configured to: extract the features of the image to be detected and perform feature fusion to obtain a fused feature map; input the fused feature map into the scene text detection model to obtain the text centerline map and Direction distance map: The direction distance map is obtained by using polar coordinates to return the distances from the point on the center line of the returned text to the edge of the text outline along multiple preset directions.

在本发明实施例中，确定模块603，进一步用于：根据文字中心线形成连通区域；确定连通区域上的各个采样点；根据采样点和方向距离图，确定与每个采样点对应的一组方向点；根据各个采样点的各组方向点，确定文字区域的文字轮廓。In the embodiment of the present invention, the determining module 603 is further configured to: form a connected area according to the text center line; determine each sampling point on the connected area; and determine a group corresponding to each sampling point according to the sampling point and the directional distance map Direction point: Determine the text outline of the text area according to each group of direction points of each sampling point.

在本发明实施例中，确定模块603，进一步用于：确定连通区域的最小旋转外接矩形；对矩形进行n等分，形成n-1条垂线；其中，n为大于1的整数；以每条垂线与连通区域的交线的中点作为采样点。In this embodiment of the present invention, the determining module 603 is further configured to: determine the minimum rotated circumscribed rectangle of the connected region; divide the rectangle into n equal parts to form n-1 vertical lines; wherein n is an integer greater than 1; The midpoint of the intersection of the vertical line and the connected region is taken as the sampling point.

在本发明实施例中，模型预测模块602，还用于：通过骨干网络对待检测图像进行特征提取，获得多个不同尺度的特征图；将多个不同尺度的特征图融合，获得融合特征图。In this embodiment of the present invention, the model prediction module 602 is further configured to: perform feature extraction on the image to be detected through the backbone network to obtain multiple feature maps of different scales; and fuse multiple feature maps of different scales to obtain a fused feature map.

在本发明实施例中，模型预测模块602，进一步用于：通过骨干网络的多个卷积模块提取待检测图像的特征，获得多个不同尺度的深层特征图和浅层特征图；将多个不同尺度的特征图融合，获得融合特征图，包括：采用标准卷积和多个不同空洞率的空洞卷积网络对多个深层特征图进行上下文特征提取，将提取上下文特征后的多个深层特征图与浅层特征图通过级联的方式进行合并，获得融合特征图。In the embodiment of the present invention, the model prediction module 602 is further configured to: extract features of the image to be detected through multiple convolution modules of the backbone network, and obtain multiple deep feature maps and shallow feature maps of different scales; Fusion of feature maps of different scales to obtain a fused feature map, including: using standard convolution and multiple atrous convolutional networks with different dilation rates to extract contextual features from multiple deep feature maps, and extract multiple deep features after contextual features. The image and the shallow feature map are combined in a cascaded manner to obtain a fusion feature map.

本发明实施例的再一方面提供一种电子设备，包括：一个或多个处理器；存储装置，用于存储一个或多个程序，当一个或多个程序被一个或多个处理器执行，使得一个或多个处理器实现本发明实施例所提供的场景文字检测的方法。Another aspect of the embodiments of the present invention provides an electronic device, including: one or more processors; a storage device for storing one or more programs, when the one or more programs are executed by the one or more processors, One or more processors are caused to implement the method for scene text detection provided by the embodiment of the present invention.

本发明实施例的还一方面提供一种计算机可读介质，其上存储有计算机程序，程序被处理器执行时实现本发明实施例的场景文字检测的方法。Another aspect of the embodiments of the present invention provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processor, the method for scene text detection according to the embodiment of the present invention is implemented.

图7示出了可以应用本发明实施例的场景文字检测的方法或场景文字检测的装置的示例性系统架构700。FIG. 7 shows an exemplary system architecture 700 of a scene text detection method or a scene text detection apparatus to which an embodiment of the present invention can be applied.

如图7所示，系统架构700可以包括终端设备701、702、703，网络704和服务器705。网络704用以在终端设备701、702、703和服务器705之间提供通信链路的介质。网络704可以包括各种连接类型，例如有线、无线通信链路或者光纤电缆等等。As shown in FIG. 7 , the system architecture 700 may include terminal devices 701 , 702 , and 703 , a network 704 and a server 705 . The network 704 is the medium used to provide the communication link between the terminal devices 701 , 702 , 703 and the server 705 . Network 704 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

用户可以使用终端设备701、702、703通过网络704与服务器705交互，以接收或发送消息等。终端设备701、702、703上可以安装有各种通讯客户端应用，例如购物类应用、网页浏览器应用、搜索类应用、即时通信工具、邮箱客户端、社交平台软件等(仅为示例)。The user can use the terminal devices 701, 702, 703 to interact with the server 705 through the network 704 to receive or send messages and the like. Various communication client applications may be installed on the terminal devices 701 , 702 and 703 , such as shopping applications, web browser applications, search applications, instant messaging tools, email clients, social platform software, etc. (only examples).

终端设备701、702、703可以是具有显示屏并且支持网页浏览的各种电子设备，包括但不限于智能手机、平板电脑、膝上型便携计算机和台式计算机等等。The terminal devices 701, 702, 703 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop computers, desktop computers, and the like.

服务器705可以是提供各种服务的服务器，例如对用户利用终端设备701、702、703所浏览的购物类网站提供支持的后台管理服务器(仅为示例)。后台管理服务器可以对待检测图像等数据进行分析等处理，并将处理结果(例如文字轮廓--仅为示例)反馈给终端设备。The server 705 may be a server that provides various services, such as a background management server that provides support for shopping websites browsed by the terminal devices 701 , 702 , and 703 (just an example). The background management server can analyze and process data such as images to be detected, and feed back the processing results (such as text outlines—just an example) to the terminal device.

需要说明的是，本发明实施例所提供的场景文字检测的方法一般由服务器705执行，相应地，场景文字检测的装置一般设置于服务器705中。It should be noted that the method for scene text detection provided by the embodiment of the present invention is generally executed by the server 705 , and accordingly, the apparatus for scene text detection is generally set in the server 705 .

应该理解，图7中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要，可以具有任意数目的终端设备、网络和服务器。It should be understood that the numbers of terminal devices, networks and servers in FIG. 7 are only illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.

下面参考图8，其示出了适于用来实现本发明实施例的终端设备的计算机系统800的结构示意图。图8示出的终端设备仅仅是一个示例，不应对本发明实施例的功能和使用范围带来任何限制。Referring next to FIG. 8 , it shows a schematic structural diagram of a computer system 800 suitable for implementing a terminal device according to an embodiment of the present invention. The terminal device shown in FIG. 8 is only an example, and should not impose any limitations on the functions and scope of use of the embodiments of the present invention.

如图8所示，计算机系统800包括中央处理单元(CPU)801，其可以根据存储在只读存储器(ROM)802中的程序或者从存储部分808加载到随机访问存储器(RAM)803中的程序而执行各种适当的动作和处理。在RAM 803中，还存储有系统800操作所需的各种程序和数据。CPU 801、ROM 802以及RAM 803通过总线804彼此相连。输入/输出(I/O)接口805也连接至总线804。As shown in FIG. 8, a computer system 800 includes a central processing unit (CPU) 801, which can be loaded into a random access memory (RAM) 803 according to a program stored in a read only memory (ROM) 802 or a program from a storage section 808 Instead, various appropriate actions and processes are performed. In the RAM 803, various programs and data necessary for the operation of the system 800 are also stored. The CPU 801 , the ROM 802 , and the RAM 803 are connected to each other through a bus 804 . An input/output (I/O) interface 805 is also connected to bus 804 .

以下部件连接至I/O接口805：包括键盘、鼠标等的输入部分806；包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分807；包括硬盘等的存储部分808；以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分809。通信部分809经由诸如因特网的网络执行通信处理。驱动器810也根据需要连接至I/O接口805。可拆卸介质811，诸如磁盘、光盘、磁光盘、半导体存储器等等，根据需要安装在驱动器810上，以便于从其上读出的计算机程序根据需要被安装入存储部分808。The following components are connected to the I/O interface 805: an input section 806 including a keyboard, a mouse, etc.; an output section 807 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker, etc.; a storage section 808 including a hard disk, etc. ; and a communication section 809 including a network interface card such as a LAN card, a modem, and the like. The communication section 809 performs communication processing via a network such as the Internet. A drive 810 is also connected to the I/O interface 805 as needed. A removable medium 811, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is mounted on the drive 810 as needed so that a computer program read therefrom is installed into the storage section 808 as needed.

特别地，根据本发明公开的实施例，上文参考流程图描述的过程可以被实现为计算机软件程序。例如，本发明公开的实施例包括一种计算机程序产品，其包括承载在计算机可读介质上的计算机程序，该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中，该计算机程序可以通过通信部分809从网络上被下载和安装，和/或从可拆卸介质811被安装。在该计算机程序被中央处理单元(CPU)801执行时，执行本发明的系统中限定的上述功能。In particular, the processes described above with reference to the flowcharts may be implemented as computer software programs in accordance with the disclosed embodiments of the present invention. For example, embodiments disclosed herein include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart. In such an embodiment, the computer program may be downloaded and installed from the network via the communication portion 809, and/or installed from the removable medium 811. When the computer program is executed by the central processing unit (CPU) 801, the above-described functions defined in the system of the present invention are performed.

需要说明的是，本发明所示的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件，或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于：具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本发明中，计算机可读存储介质可以是任何包含或存储程序的有形介质，该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本发明中，计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号，其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式，包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质，该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输，包括但不限于：无线、电线、光缆、RF等等，或者上述的任意合适的组合。It should be noted that the computer-readable medium shown in the present invention may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. The computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing. In the present invention, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In the present invention, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device . Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

附图中的流程图和框图，图示了按照本发明各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分，上述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意，在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个接连地表示的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图或流程图中的每个方框、以及框图或流程图中的方框的组合，可以用执行规定的功能或操作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams or flowchart illustrations, and combinations of blocks in the block diagrams or flowchart illustrations, can be implemented in special purpose hardware-based systems that perform the specified functions or operations, or can be implemented using A combination of dedicated hardware and computer instructions is implemented.

描述于本发明实施例中所涉及到的模块可以通过软件的方式实现，也可以通过硬件的方式来实现。所描述的模块也可以设置在处理器中，例如，可以描述为：一种处理器包括获取模块、模型预测模块和确定模块。其中，这些模块的名称在某种情况下并不构成对该模块本身的限定，例如，获取模块还可以被描述为“获取待检测图像的模块”。The modules involved in the embodiments of the present invention may be implemented in a software manner, and may also be implemented in a hardware manner. The described modules can also be provided in the processor, for example, it can be described as: a processor includes an acquisition module, a model prediction module and a determination module. Wherein, the names of these modules do not constitute a limitation of the module itself in some cases, for example, the acquisition module can also be described as "a module for acquiring images to be detected".

作为另一方面，本发明还提供了一种计算机可读介质，该计算机可读介质可以是上述实施例中描述的设备中所包含的；也可以是单独存在，而未装配入该设备中。上述计算机可读介质承载有一个或者多个程序，当上述一个或者多个程序被一个该设备执行时，使得该设备包括：获取待检测图像，待检测图像中包含文字区域；确定文字区域的文字中心线图和方向距离图；根据文字中心线图和方向距离图，确定文字区域的文字轮廓。As another aspect, the present invention also provides a computer-readable medium, which may be included in the device described in the above embodiments; or may exist alone without being assembled into the device. The above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by a device, the device includes: acquiring an image to be detected, and the image to be detected contains a text area; determining the text of the text area Centerline diagram and direction distance diagram; determine the text outline of the text area according to the text centerline diagram and direction distance diagram.

根据本发明实施例的技术方案，提供了采用极坐标进行文字表达的方式，构建网络结构实现端到端可训练的深度学习模型即场景文字检测模型；通过确定待检测图像的文字区域的文字中心线图，能够有效地将文字区域中紧邻的场景文字分离，然后基于文字中心线图获得方向距离图，利用极坐标的方式来参数化文字区域的文字轮廓，从而可以检测任意形状的场景文字，不包括多余的背景信息，利于后续的文字识别，解决了不规则场景文字的检测问题，从而能够更加精确的重建出任意形状的文字实例。此外，为了解决场景文字尺度差异大的问题，引入空洞卷积来提取丰富的上下文特征信息，有效提升了对长文字的检测性能。并且后处理方法简单高效，能够提高场景文字检测的效率。According to the technical solution of the embodiment of the present invention, a method of using polar coordinates to express text is provided, and a network structure is constructed to realize an end-to-end trainable deep learning model, namely a scene text detection model; by determining the text center of the text area of the image to be detected Line map, which can effectively separate the text of the scene adjacent to the text area, and then obtain the direction distance map based on the text center line map, and use polar coordinates to parameterize the text outline of the text area, so that scene text of any shape can be detected. It does not include redundant background information, which is beneficial to subsequent text recognition, and solves the problem of text detection in irregular scenes, so that text instances of any shape can be more accurately reconstructed. In addition, in order to solve the problem of large differences in scene text scales, atrous convolution is introduced to extract rich contextual feature information, which effectively improves the detection performance of long texts. Moreover, the post-processing method is simple and efficient, which can improve the efficiency of scene text detection.

上述具体实施方式，并不构成对本发明保护范围的限制。本领域技术人员应该明白的是，取决于设计要求和其他因素，可以发生各种各样的修改、组合、子组合和替代。任何在本发明的精神和原则之内所作的修改、等同替换和改进等，均应包含在本发明保护范围之内。The above-mentioned specific embodiments do not constitute a limitation on the protection scope of the present invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may occur depending on design requirements and other factors. Any modifications, equivalent replacements and improvements made within the spirit and principle of the present invention shall be included within the protection scope of the present invention.

Claims

1. A method for detecting scene characters is characterized by comprising the following steps:

acquiring an image to be detected, wherein the image to be detected comprises a character area;

determining a character central line graph and a direction distance graph of the character area;

determining the character outline of the character area according to the character central line graph and the direction distance graph;

the character central line graph is formed according to a character central line of the character area, and the direction distance graph is obtained by regressing distances from points on the character central line to the edge of the character outline along a plurality of preset directions in a polar coordinate mode.

2. The method of claim 1, wherein determining the text center line map and the direction distance map for the text region is preceded by:

acquiring an image data training set, wherein an image data training sample in the image data training set comprises a character area;

constructing a network structure and constructing a multitask loss function for the network structure;

and training the network structure by using the image data training set until the value of the multitask loss function reaches a preset condition, and obtaining a scene character detection model.

3. The method of claim 2, wherein the multitask loss function comprises a segmentation loss function and a regression loss function, the segmentation loss function being constructed based on predicted and actual textual centerline maps, the regression loss function being constructed based on predicted and actual directional distance maps;

the determining of the character central line graph and the direction distance graph of the character area comprises the following steps:

and inputting the image to be detected into the trained scene character detection model, and predicting to obtain a character central line graph and a direction distance graph corresponding to the image to be detected.

4. The method as claimed in claim 3, wherein the step of inputting the image to be detected into the trained scene character detection model to predict a character centerline map and a direction distance map corresponding to the image to be detected comprises:

extracting the characteristics of the image to be detected and carrying out characteristic fusion according to the scene character detection model to obtain a fusion characteristic diagram;

and predicting to obtain a character central line graph and a direction distance graph corresponding to the image to be detected according to the fusion characteristic graph and the scene character detection model.

5. The method of claim 1, wherein determining the text outline of the text region from the text centerline map and the direction distance map comprises:

aggregating adjacent points on the character central line in the character central line graph to form a communication area;

determining each sampling point on the connected region;

determining a direction point corresponding to each sampling point according to the sampling points and the direction distance graph;

and determining the character outline of the character area according to the direction point of each sampling point.

6. The method of claim 5, wherein determining the respective sample points on the connected component comprises:

determining a minimum rotation circumscribed rectangle of the connected region;

dividing the rectangle into n equal parts to form n-1 vertical lines; wherein n is an integer greater than 1;

and taking the middle point of the intersection line of each vertical line and the communication area as the sampling point.

7. The method according to claim 4, wherein the extracting the features of the image to be detected and performing feature fusion to obtain a fused feature map comprises:

extracting the features of the image to be detected through a backbone network to obtain a plurality of feature maps with different scales;

and fusing a plurality of feature maps with different scales to obtain the fused feature map.

8. The method according to claim 7, wherein the extracting features of the image to be detected through a backbone network to obtain a plurality of feature maps with different scales comprises:

extracting the characteristics of an image to be detected through a plurality of convolution modules of a backbone network to obtain a plurality of deep characteristic diagrams and shallow characteristic diagrams with different scales;

fusing a plurality of feature maps with different scales to obtain the fused feature map, wherein the method comprises the following steps: and performing context feature extraction on the deep feature maps by adopting a standard convolution and a plurality of hole convolution networks with different hole rates, and merging the deep feature maps after the context features are extracted and the shallow feature maps in a cascading manner to obtain the fusion feature map.

9. An apparatus for scene text detection, comprising:

the acquisition module acquires an image to be detected, wherein the image to be detected comprises a character area;

the model prediction module is used for determining a character central line graph and a direction distance graph of the character area;

a determining module for determining the character outline of the character area according to the character central line graph and the direction distance graph,

10. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-8.

11. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-8.