CN110110715A

CN110110715A - Text detection model training method, text filed, content determine method and apparatus

Info

Publication number: CN110110715A
Application number: CN201910367675.2A
Authority: CN
Inventors: 苏驰; 李凯; 刘弘也; 赵志明
Original assignee: Beijing Kingsoft Cloud Network Technology Co Ltd; Beijing Kingsoft Cloud Technology Co Ltd
Current assignee: Beijing Kingsoft Cloud Network Technology Co Ltd; Beijing Kingsoft Cloud Technology Co Ltd
Priority date: 2019-04-30
Filing date: 2019-04-30
Publication date: 2019-08-09
Also published as: WO2020221298A1

Abstract

The present invention provides a text detection model training method, text area, and content determination method and device; wherein, the text detection model training method includes: extracting multiple initial feature maps of a target training image through a first feature extraction network; The fusion network fuses multiple initial feature maps to obtain a fusion feature map; inputs the fusion feature map to the first output network, and outputs the candidate regions of the text region in the target training image and the probability value of each candidate region; The first loss value is determined by the detection loss function of , and the first initial model is trained according to the first loss value until the parameters in the first initial model converge, and a text detection model is obtained. The present invention can quickly, comprehensively and accurately detect various types of texts in images under the scenarios of various font sizes, various fonts, various shapes and various orientations, which is also beneficial to the accuracy of subsequent text recognition and improves the accuracy of text recognition. Effect.

Description

Text detection model training method, text area, content determination method and device

技术领域technical field

本发明涉及图像处理技术领域，尤其是涉及一种文本检测模型训练方法、文本区域、内容确定方法和装置。The present invention relates to the technical field of image processing, and in particular, to a text detection model training method, a text area, and a content determination method and device.

背景技术Background technique

相关技术中，可以通过字符切分方式或者深度学习方式实现文本的检测和识别。但这些方式通常适用于字体字号单一、背景简单、文本排列方向单一等简单场景中；在复杂场景下，如多种字号、多种字体、多种形状、多种方向、背景多变等场景，上述文本检测识别方式的效果较差。In the related art, text detection and recognition can be realized by character segmentation or deep learning. However, these methods are usually suitable for simple scenarios such as single font size, simple background, and single text arrangement direction; in complex scenarios, such as multiple font sizes, multiple fonts, multiple shapes, multiple orientations, and changing backgrounds, etc. The above text detection and recognition methods are less effective.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本发明的目的在于提供一种文本检测模型训练方法、文本区域、内容确定方法和装置，以在多种字号、多种字体、多种形状、多种方向场景下，快速全面准确地检测出图像中的各类文本，进而也有利于后续文本识别的准确性，提高文本识别的效果。In view of this, the purpose of the present invention is to provide a text detection model training method, text area, content determination method and device, so as to be fast, comprehensive and accurate in a variety of font sizes, fonts, shapes, and directions. It can detect all kinds of text in the image, which is also conducive to the accuracy of subsequent text recognition and improves the effect of text recognition.

第一方面，本发明实施例提供了一种文本检测模型训练方法，该方法包括：基于预设的训练集合确定目标训练图像；将目标训练图像输入至第一初始模型；第一初始模型包括第一特征提取网络、特征融合网络和第一输出网络；通过第一特征提取网络提取目标训练图像的多个初始特征图；多个初始特征图之间的尺度不同；通过特征融合网络对多个初始特征图进行融合处理，得到融合特征图；将融合特征图输入至第一输出网络，输出目标训练图像中文本区域的候选区域以及每个候选区域的概率值；通过预设的检测损失函数确定候选区域以及每个候选区域的概率值的第一损失值；根据第一损失值对第一初始模型进行训练，直至第一初始模型中的参数收敛，得到文本检测模型。In a first aspect, an embodiment of the present invention provides a method for training a text detection model. The method includes: determining a target training image based on a preset training set; inputting the target training image into a first initial model; the first initial model includes a first a feature extraction network, a feature fusion network and a first output network; multiple initial feature maps of the target training image are extracted through the first feature extraction network; the scales between the multiple initial feature maps are different; The feature map is fused to obtain a fused feature map; the fused feature map is input to the first output network, and the candidate region of the text region in the target training image and the probability value of each candidate region are output; the candidate region is determined by the preset detection loss function. region and the first loss value of the probability value of each candidate region; the first initial model is trained according to the first loss value, until the parameters in the first initial model converge, and the text detection model is obtained.

在一些实施例中，上述第一特征提取网络包括依次连接的多组第一卷积网络；每组第一卷积网络包括依次连接的卷积层、批归一化层和激活函数层。In some embodiments, the above-mentioned first feature extraction network includes a plurality of groups of first convolutional networks connected in sequence; each group of first convolutional networks includes a convolutional layer, a batch normalization layer, and an activation function layer that are connected in sequence.

在一些实施例中，上述通过特征融合网络对多个初始特征图进行融合处理，得到融合特征图的步骤，包括：根据初始特征图的尺度，将多个初始特征图依次排列；其中，最顶层级的初始特征图的尺度最小；最底层级的初始特征图的尺度最大；将最顶层级的初始特征图确定为最顶层级的融合特征图；除最顶层级以外，将当前层级的初始特征图和当前层级的上一层级的融合特征图进行融合，得到当前层级的融合特征图；将最低层级的融合特征图确定为最终的融合特征图。In some embodiments, the above-mentioned step of fusing multiple initial feature maps through a feature fusion network to obtain a fused feature map includes: arranging the multiple initial feature maps in sequence according to the scale of the initial feature maps; The scale of the initial feature map of the bottom level is the smallest; the scale of the initial feature map of the bottom level is the largest; the initial feature map of the top level is determined as the fusion feature map of the top level; except for the top level, the initial feature map of the current level The image is fused with the fusion feature map of the previous level of the current level to obtain the fusion feature map of the current level; the fusion feature map of the lowest level is determined as the final fusion feature map.

在一些实施例中，上述第一输出网络包括第一卷积层和第二卷积层；上述将融合特征图输入至第一输出网络，输出目标训练图像中文本区域的候选区域以及每个候选区域的概率值的步骤，包括：将融合特征图分别输入至第一卷积层和第二卷积层；通过第一卷积层对融合特征图进行第一卷积运算，输出坐标矩阵；坐标矩阵包括目标训练图像中文本区域的候选区域的顶点坐标；通过第二卷积层对融合特征图进行第二卷积运算，输出概率矩阵；概率矩阵包括每个候选区域的概率值。In some embodiments, the above-mentioned first output network includes a first convolutional layer and a second convolutional layer; the above-mentioned inputting the fusion feature map to the first output network outputs the candidate regions of the text region in the target training image and each candidate region The step of obtaining the probability value of the region includes: inputting the fusion feature map into the first convolution layer and the second convolution layer respectively; performing a first convolution operation on the fusion feature map through the first convolution layer, and outputting a coordinate matrix; The matrix includes the vertex coordinates of the candidate regions of the text region in the target training image; the second convolution operation is performed on the fusion feature map through the second convolution layer to output a probability matrix; the probability matrix includes the probability value of each candidate region.

在一些实施例中，上述检测损失函数包括第一函数和第二函数；上述第一函数为L₁＝|G^*-G|；其中，G^*为预先标注的目标训练图像中文本区域的坐标矩阵；G为第一输出网络输出的目标训练图像中文本区域的候选区域的坐标矩阵；上述第二函数为L₂＝-Y^*logY-(1-Y^*)log(1-Y)；其中，Y^*为预先标注的目标训练图像中文本区域的概率矩阵；Y为第一输出网络输出的目标训练图像中文本区域的候选区域的概率矩阵；log表示对数运算；上述候选区域以及每个候选区域的概率值的第一损失值L＝L₁+L₂。In some embodiments, the above-mentioned detection loss function includes a first function and a second function; the above-mentioned first function is L ₁ =|G ^* -G|; wherein, G ^* is the coordinate of the text area in the pre-labeled target training image matrix; G is the coordinate matrix of the candidate region of the text region in the target training image output by the first output network; the above-mentioned second function is L ₂ =-Y ^* logY-(1-Y ^* )log(1-Y); wherein , Y ^* is the probability matrix of the text region in the pre-labeled target training image; Y is the probability matrix of the candidate region of the text region in the target training image output by the first output network; log represents the logarithmic operation; the above candidate regions and each The first loss value L=L ₁ +L ₂ of the probability value of the candidate region.

在一些实施例中，上述根据第一损失值对第一初始模型进行训练，直至第一初始模型中的参数收敛，得到文本检测模型的步骤，包括：根据第一损失值更新第一初始模型中的参数；判断更新后的参数是否均收敛；如果更新后的参数均收敛，将参数更新后的第一初始模型确定为检测模型；如果更新后的参数没有均收敛，继续执行基于预设的训练集合确定目标训练图像的步骤，直至更新后的参数均收敛。In some embodiments, the above-mentioned step of training the first initial model according to the first loss value until the parameters in the first initial model converge to obtain the text detection model includes: updating the first initial model according to the first loss value determine whether the updated parameters are converged; if the updated parameters are converged, the first initial model after parameter update is determined as the detection model; if the updated parameters are not converged, continue to perform the training based on the preset Sets the steps to determine the target training image until the updated parameters converge.

在一些实施例中，上述根据第一损失值更新第一初始模型中的参数的步骤，包括：按照预设规则，从第一初始模型确定待更新参数；计算第一损失值对第一初始模型中待更新参数的导数其中，L为第一损失值；W为待更新参数；更新待更新参数，得到更新后的待更新参数其中，α为预设系数。In some embodiments, the above-mentioned step of updating parameters in the first initial model according to the first loss value includes: according to preset rules, determining parameters to be updated from the first initial model; calculating the first loss value for the first initial model the derivative of the parameter to be updated in Among them, L is the first loss value; W is the parameter to be updated; update the parameter to be updated to obtain the updated parameter to be updated Among them, α is a preset coefficient.

第二方面，本发明实施例提供了一种文本区域确定方法，该方法包括：获取待检测图像；将待检测图像输入至预先训练完成的文本检测模型，输出待检测图像中文本区域的多个候选区域，以及每个候选区域的概率值；文本检测模型通过上述文本检测模型的训练方法训练得到；根据候选区域的概率值以及多个候选区域之间的重叠程度，从多个候选区域中确定待检测图像中的文本区域。In a second aspect, an embodiment of the present invention provides a method for determining a text area, the method includes: acquiring an image to be detected; inputting the image to be detected into a pre-trained text detection model, and outputting a plurality of text areas in the image to be detected The candidate area, and the probability value of each candidate area; the text detection model is obtained by training the above-mentioned text detection model training method; according to the probability value of the candidate area and the degree of overlap between multiple candidate areas, it is determined from multiple candidate areas The text area in the image to be detected.

在一些实施例中，上述根据候选区域的概率值以及多个候选区域之间的重叠程度，从多个候选区域中确定待检测图像中的文本区域的步骤，包括：根据候选区域的概率值，将多个候选区域依次排列；其中，第一个候选区域的概率值最大，最后一个候选区域的概率值最小；将第一个候选区域作为当前候选区域，逐一计算当前候选区域与除当前候选区域以外的候选区域的重叠程度；将除当前候选区域以外的候选区域中，重叠程度大于预设的重叠阈值的候选区域剔除；将当前候选区域的下一个候选区域作为新的当前候选区域，继续执行逐一计算当前候选区域与除当前候选区域以外的候选区域的重叠程度的步骤，直至到达最后一个候选区域；将剔除后的剩余的候选区域确定为待检测图像中的文本区域。In some embodiments, the above-mentioned step of determining the text region in the image to be detected from the multiple candidate regions according to the probability value of the candidate region and the degree of overlap between the multiple candidate regions includes: according to the probability value of the candidate region, Arrange multiple candidate regions in order; among them, the probability value of the first candidate region is the largest, and the probability value of the last candidate region is the smallest; the first candidate region is used as the current candidate region, and the current candidate region and the current candidate region are calculated one by one. The degree of overlap of candidate regions other than the current candidate region; the candidate regions whose overlap degree is greater than the preset overlap threshold in the candidate regions other than the current candidate region are eliminated; the next candidate region of the current candidate region is regarded as the new current candidate region, and the execution continues Steps of calculating the degree of overlap between the current candidate region and the candidate regions other than the current candidate region one by one, until the last candidate region is reached; the remaining candidate regions after culling are determined as the text regions in the image to be detected.

在一些实施例中，上述根据候选区域的概率值，将多个候选区域依次排列的步骤之前，该方法还包括：将多个候选区域中，概率值低于预设的概率阈值的候选区域剔除，得到最终的多个候选区域。In some embodiments, before the step of arranging the multiple candidate regions in sequence according to the probability values of the candidate regions, the method further includes: removing candidate regions whose probability value is lower than a preset probability threshold from the multiple candidate regions , to get the final multiple candidate regions.

第三方面，本发明实施例提供了一种文本内容确定方法，该方法包括：通过上述文本区域确定方法，获取图像中的文本区域；将文本区域输入至预先训练完成的文本识别模型，输出文本区域的识别结果；根据识别结果确定文本区域中的文本内容。In a third aspect, an embodiment of the present invention provides a method for determining text content. The method includes: obtaining a text area in an image through the above-mentioned text area determining method; inputting the text area into a pre-trained text recognition model, and outputting the text The recognition result of the area; the text content in the text area is determined according to the recognition result.

在一些实施例中，上述将文本区域输入至预先训练完成的识别模型的步骤之前，上述方法还包括：按照预设尺寸，对文本区域进行归一化处理。In some embodiments, before the above step of inputting the text area into the pre-trained recognition model, the above method further includes: normalizing the text area according to a preset size.

在一些实施例中，上述文本识别模型通过下述方式训练完成：基于预设的训练集合确定目标训练文本图像；将目标训练文本图像输入至第二初始模型；第二初始模型包括第二特征提取网络、特征拆分网络、第二输出网络和分类函数；通过第二特征提取网络提取目标训练文本图像的特征图；通过特征拆分网络将特征图拆分成至少一个子特征图；将子特征图分别输入至第二输出网络，输出每个子特征图对应的输出矩阵；将每个子特征图对应的输出矩阵分别输入至分类函数，输出每个子特征图对应的概率矩阵；通过预设的识别损失函数确定概率矩阵的第二损失值；根据第二损失值对第二初始模型进行训练，直至第二初始模型中的参数收敛，得到文本识别模型。In some embodiments, the above-mentioned text recognition model is trained by the following methods: determining a target training text image based on a preset training set; inputting the target training text image into a second initial model; the second initial model includes a second feature extraction network, feature splitting network, second output network and classification function; extract feature map of target training text image through second feature extraction network; split feature map into at least one sub-feature map through feature splitting network; sub-feature The image is input to the second output network respectively, and the output matrix corresponding to each sub-feature map is output; the output matrix corresponding to each sub-feature map is input to the classification function respectively, and the probability matrix corresponding to each sub-feature map is output; through the preset recognition loss The function determines the second loss value of the probability matrix; the second initial model is trained according to the second loss value until the parameters in the second initial model converge, and a text recognition model is obtained.

在一些实施例中，上述第二特征提取网络包括依次连接的多组第二卷积网络；每组第二卷积网络包括依次连接的卷积层、池化层和激活函数层。In some embodiments, the above-mentioned second feature extraction network includes a plurality of groups of second convolution networks connected in sequence; each group of second convolution networks includes a convolution layer, a pooling layer, and an activation function layer connected in sequence.

在一些实施例中，上述通过特征拆分网络将特征图拆分成至少一个子特征图的步骤，包括：沿着特征图的列方向，将特征图拆分成至少一个子特征图；特征图的列方向为文本行方向的垂直方向。In some embodiments, the above step of splitting the feature map into at least one sub-feature map through the feature splitting network includes: splitting the feature map into at least one sub-feature map along the column direction of the feature map; the feature map The column direction is the vertical direction of the text line direction.

在一些实施例中，上述第二输出网络包括多个全连接层；全连接层的数量与子特征图的数量对应；将子特征图分别输入至第二输出网络，输出每个子特征图对应的输出矩阵的步骤，包括：将每个子特征图分别输入至对应的全连接层中，以使每个全连接层输出子特征图对应的输出矩阵。In some embodiments, the above-mentioned second output network includes a plurality of fully connected layers; the number of fully connected layers corresponds to the number of sub-feature maps; the sub-feature maps are respectively input to the second output network, and the corresponding sub-feature maps are output. The step of outputting the matrix includes: inputting each sub-feature map into the corresponding fully-connected layer, so that each fully-connected layer outputs an output matrix corresponding to the sub-feature map.

在一些实施例中，上述分类函数包括Softmax函数；该Softmax函数为其中，e表示自然常数；t表示第t个概率矩阵；K表示所述训练集合的目标训练文本图像所包含的不同字符的个数；m表示从1到K+1；∑表示求和运算；为所述输出矩阵中的第i个元素；所述为所述概率矩阵p^t中的第i个元素。In some embodiments, the above-mentioned classification function includes a Softmax function; the Softmax function is Wherein, e represents a natural constant; t represents the t-th probability matrix; K represents the number of different characters contained in the target training text image of the training set; m represents from 1 to K+1; ∑ represents a summation operation; is the i-th element in the output matrix; the is the ith element in the probability matrix p ^t .

在一些实施例中，上述识别损失函数包括L＝-log p(y|{p_t}_t＝1…T)；其中，y为预先标注的所述目标训练文本图像的概率矩阵；t表示第t个概率矩阵；p^t为所述分类函数输出的每个所述子特征图对应的概率矩阵；T为所述概率矩阵的总数量；p表示计算概率；log表示对数运算。In some embodiments, the above-mentioned recognition loss function includes L=-log p(y|{p _t } _t=1...T ); wherein, y is the pre-labeled probability matrix of the target training text image; t represents the first t probability matrices; p ^t is the probability matrix corresponding to each of the sub-feature maps output by the classification function; T is the total number of the probability matrices; p represents the calculation probability; log represents the logarithmic operation.

在一些实施例中，上述根据第二损失值对第二初始模型进行训练，直至第二初始模型中的参数收敛，得到文本识别模型的步骤，包括：根据第二损失值更新第二初始模型中的参数；判断更新后的参数是否均收敛；如果更新后的参数均收敛，将参数更新后的第二初始模型确定为文本识别模型；如果更新后的参数没有均收敛，继续执行基于预设的训练集合确定目标训练文本图像的步骤，直至更新后的各个参数均收敛。In some embodiments, the above-mentioned step of training the second initial model according to the second loss value until the parameters in the second initial model converge to obtain the text recognition model includes: updating the second initial model according to the second loss value determine whether the updated parameters are all converged; if the updated parameters are all converged, the second initial model after parameter updating is determined as the text recognition model; if the updated parameters are not all converged, continue to execute the The training set determines the steps of the target training text image until the updated parameters converge.

在一些实施例中，上述根据第二损失值更新第二初始模型中各个参数的步骤，包括：按照预设规则，从第二初始模型确定待更新参数；计算第二损失值对待更新参数的导数其中，L′为概率矩阵的损失值；W′为待更新参数；更新待更新参数，得到更新后的待更新参数其中，α′为预设系数。In some embodiments, the above step of updating each parameter in the second initial model according to the second loss value includes: determining the parameter to be updated from the second initial model according to a preset rule; calculating the derivative of the parameter to be updated by the second loss value Among them, L' is the loss value of the probability matrix; W' is the parameter to be updated; update the parameter to be updated to obtain the updated parameter to be updated Among them, α' is a preset coefficient.

在一些实施例中，上述文本区域的识别结果包括文本区域对应的多个概率矩阵；根据识别结果确定文本区域中的文本内容的步骤，包括：确定每个概率矩阵中的最大概率值的位置；从预先设置的概率矩阵中各个位置与字符的对应关系中，获取最大概率值的位置对应的字符；按照多个概率矩阵的排列顺序，排列获取到的字符；根据排列后的字符确定文本区域中的文本内容。In some embodiments, the recognition result of the text area includes a plurality of probability matrices corresponding to the text area; the step of determining the text content in the text area according to the recognition result includes: determining the position of the maximum probability value in each probability matrix; From the correspondence between each position and the character in the preset probability matrix, obtain the character corresponding to the position with the maximum probability value; arrange the obtained characters according to the arrangement order of the plurality of probability matrices; determine the characters in the text area according to the arranged characters text content.

在一些实施例中，上述根据排列后的字符确定文本区域中的文本内容的步骤，包括：按照预设规则，删除排列后的字符中的重复字符和空字符，得到文本区域中的文本内容。In some embodiments, the above-mentioned step of determining the text content in the text area according to the arranged characters includes: according to a preset rule, deleting repeated characters and null characters in the arranged characters to obtain the text content in the text area.

在一些实施例中，上述根据识别结果确定文本区域中的文本内容的步骤之后，方法还包括：如果图像中包含有多个文本区域，获取每个文本区域中的文本内容；通过预先建立的敏感词库确定图像对应的文本内容中是否包含有敏感信息。In some embodiments, after the above step of determining the text content in the text area according to the recognition result, the method further includes: if the image contains multiple text areas, acquiring the text content in each text area; The thesaurus determines whether the textual content corresponding to the image contains sensitive information.

在一些实施例中，上述通过预先建立的敏感词库确定图像对应的文本内容中是否包含有敏感信息的步骤，包括：对获取到的文本内容进行分词操作；逐一将分词操作后得到的分词与预先建立的敏感词库进行匹配；如果至少一个分词匹配成功，确定图像对应的文本内容中包含有敏感信息。In some embodiments, the above-mentioned step of determining whether the text content corresponding to the image contains sensitive information by using a pre-established sensitive thesaurus includes: performing a word segmentation operation on the acquired text content; The pre-established sensitive thesaurus is matched; if at least one word segmentation is successfully matched, it is determined that the text content corresponding to the image contains sensitive information.

在一些实施例中，上述确定图像对应的文本内容中包含有敏感信息之后，上述方法还包括：获取匹配成功的分词所属的文本区域，在图像中标识出获取到的文本区域，或者匹配成功的分词。In some embodiments, after it is determined that the text content corresponding to the image contains sensitive information, the above method further includes: acquiring the text area to which the successfully matched word segmentation belongs, and identifying the acquired text area in the image, or the successfully matched text area. Participle.

第四方面，本发明实施例提供了一种文本检测模型训练装置，该装置包括：训练图像确定模块，用于基于预设的训练集合确定目标训练图像；训练图像输入模块，用于将目标训练图像输入至第一初始模型；第一初始模型包括第一特征提取网络、特征融合网络和第一输出网络；特征提取模块，用于通过第一特征提取网络提取目标训练图像的多个初始特征图；多个初始特征图之间的尺度不同；特征融合模块，用于通过特征融合网络对多个初始特征图进行融合处理，得到融合特征图；输出模块，用于将融合特征图输入至第一输出网络，输出目标训练图像中文本区域的候选区域以及每个候选区域的概率值；损失值确定和训练模块，用于通过预设的检测损失函数确定候选区域以及每个候选区域的概率值的第一损失值；根据第一损失值对第一初始模型进行训练，直至第一初始模型中的参数收敛，得到文本检测模型。In a fourth aspect, an embodiment of the present invention provides a text detection model training device, the device includes: a training image determination module for determining a target training image based on a preset training set; a training image input module for The image is input to the first initial model; the first initial model includes a first feature extraction network, a feature fusion network and a first output network; a feature extraction module is used to extract multiple initial feature maps of the target training image through the first feature extraction network ; The scales between multiple initial feature maps are different; the feature fusion module is used to fuse multiple initial feature maps through the feature fusion network to obtain a fusion feature map; the output module is used to input the fusion feature map to the first The output network outputs the candidate regions of the text region in the target training image and the probability value of each candidate region; the loss value determination and training module is used to determine the candidate region and the probability value of each candidate region through the preset detection loss function. The first loss value; the first initial model is trained according to the first loss value, until the parameters in the first initial model converge, and a text detection model is obtained.

在一些实施例中，上述特征融合模块还用于：根据初始特征图的尺度，将多个初始特征图依次排列；其中，最顶层级的初始特征图的尺度最小；最底层级的初始特征图的尺度最大；将最顶层级的初始特征图确定为最顶层级的融合特征图；除最顶层级以外，将当前层级的初始特征图和当前层级的上一层级的融合特征图进行融合，得到当前层级的融合特征图；将最低层级的融合特征图确定为最终的融合特征图。In some embodiments, the above-mentioned feature fusion module is further configured to: arrange multiple initial feature maps in sequence according to the scale of the initial feature map; wherein, the scale of the initial feature map of the topmost level is the smallest; the initial feature map of the bottommost level The scale is the largest; the initial feature map of the top level is determined as the fusion feature map of the top level; except for the top level, the initial feature map of the current level and the fusion feature map of the previous level of the current level are fused to obtain The fusion feature map of the current level; the fusion feature map of the lowest level is determined as the final fusion feature map.

在一些实施例中，上述第一输出网络包括第一卷积层和第二卷积层；上述输出模块还用于：将融合特征图分别输入至第一卷积层和第二卷积层；通过第一卷积层对融合特征图进行第一卷积运算，输出坐标矩阵；坐标矩阵包括目标训练图像中文本区域的候选区域的顶点坐标；通过第二卷积层对融合特征图进行第二卷积运算，输出概率矩阵；概率矩阵包括每个候选区域的概率值。In some embodiments, the above-mentioned first output network includes a first convolutional layer and a second convolutional layer; the above-mentioned output module is further configured to: input the fusion feature map to the first convolutional layer and the second convolutional layer respectively; The first convolution operation is performed on the fusion feature map through the first convolution layer, and a coordinate matrix is output; the coordinate matrix includes the vertex coordinates of the candidate region of the text region in the target training image; The convolution operation outputs a probability matrix; the probability matrix includes the probability value of each candidate region.

在一些实施例中，上述检测损失函数包括第一函数和第二函数；该第一函数为L₁＝|G^*-G|；其中，G^*为预先标注的目标训练图像中文本区域的坐标矩阵；G为第一输出网络输出的目标训练图像中文本区域的候选区域的坐标矩阵；该第二函数为L₂＝-Y^*logY-(1-Y^*)log(1-Y)；其中，Y^*为预先标注的目标训练图像中文本区域的概率矩阵；Y为第一输出网络输出的目标训练图像中文本区域的候选区域的概率矩阵；log表示对数运算；上述候选区域以及每个候选区域的概率值的第一损失值L＝L₁+L₂。In some embodiments, the above-mentioned detection loss function includes a first function and a second function; the first function is L ₁ =|G ^* -G|; wherein, G ^* is the coordinates of the text area in the pre-labeled target training image matrix; G is the coordinate matrix of the candidate region of the text region in the target training image output by the first output network; the second function is L ₂ =-Y ^* logY-(1-Y ^* )log(1-Y); wherein , Y ^* is the probability matrix of the text region in the pre-labeled target training image; Y is the probability matrix of the candidate region of the text region in the target training image output by the first output network; log represents the logarithmic operation; the above candidate regions and each The first loss value L=L ₁ +L ₂ of the probability value of the candidate region.

在一些实施例中，上述损失值确定和训练模块还用于：根据第一损失值更新第一初始模型中的参数；判断更新后的参数是否均收敛；如果更新后的参数均收敛，将参数更新后的第一初始模型确定为检测模型；如果更新后的参数没有均收敛，继续执行基于预设的训练集合确定目标训练图像的步骤，直至更新后的参数均收敛。In some embodiments, the above-mentioned loss value determination and training module is further used for: updating the parameters in the first initial model according to the first loss value; judging whether the updated parameters are all converged; The updated first initial model is determined as the detection model; if the updated parameters do not all converge, the step of determining the target training image based on the preset training set is continued until the updated parameters converge.

在一些实施例中，上述损失值确定和训练模块还用于：按照预设规则，从第一初始模型确定待更新参数；计算第一损失值对第一初始模型中待更新参数的导数其中，L为第一损失值；W为待更新参数；更新待更新参数，得到更新后的待更新参数其中，α为预设系数。In some embodiments, the above-mentioned loss value determination and training module is further used to: determine the parameter to be updated from the first initial model according to a preset rule; calculate the derivative of the first loss value to the parameter to be updated in the first initial model Among them, L is the first loss value; W is the parameter to be updated; update the parameter to be updated to obtain the updated parameter to be updated Among them, α is a preset coefficient.

第五方面，本发明实施例提供了一种文本区域确定装置，该装置包括：图像获取模块，用于获取待检测图像；检测模块，用于将待检测图像输入至预先训练完成的文本检测模型，输出待检测图像中文本区域的多个候选区域，以及每个候选区域的概率值；文本检测模型通过上述文本检测模型的训练方法训练得到；文本区域确定模块，用于根据候选区域的概率值以及多个候选区域之间的重叠程度，从多个候选区域中确定待检测图像中的文本区域。In a fifth aspect, an embodiment of the present invention provides an apparatus for determining a text area, the apparatus includes: an image acquisition module, used to acquire an image to be detected; a detection module, used to input the to-be-detected image into a pre-trained text detection model , output multiple candidate regions of the text region in the image to be detected, as well as the probability value of each candidate region; the text detection model is obtained by training the above-mentioned text detection model training method; the text region determination module is used to determine the probability value of the candidate region according to the and the degree of overlap between the multiple candidate regions, the text region in the image to be detected is determined from the multiple candidate regions.

在一些实施例中，上述文本区域确定模块还用于：根据候选区域的概率值，将多个候选区域依次排列；其中，第一个候选区域的概率值最大，最后一个候选区域的概率值最小；将第一个候选区域作为当前候选区域，逐一计算当前候选区域与除当前候选区域以外的候选区域的重叠程度；将除当前候选区域以外的候选区域中，重叠程度大于预设的重叠阈值的候选区域剔除；将当前候选区域的下一个候选区域作为新的当前候选区域，继续执行逐一计算当前候选区域与除当前候选区域以外的候选区域的重叠程度的步骤，直至到达最后一个候选区域；将剔除后的剩余的候选区域确定为待检测图像中的文本区域。In some embodiments, the above-mentioned text region determination module is further configured to: arrange multiple candidate regions in sequence according to the probability values of the candidate regions; wherein, the probability value of the first candidate region is the largest, and the probability value of the last candidate region is the smallest ; Take the first candidate region as the current candidate region, and calculate the degree of overlap between the current candidate region and the candidate regions other than the current candidate region one by one; Eliminate candidate regions; take the next candidate region of the current candidate region as the new current candidate region, and continue to perform the steps of calculating the degree of overlap between the current candidate region and the candidate regions other than the current candidate region one by one, until the last candidate region is reached; The remaining candidate regions after culling are determined as text regions in the image to be detected.

在一些实施例中，上述装置还包括：区域剔除模块，用于将多个候选区域中，概率值低于预设的概率阈值的候选区域剔除，得到最终的多个候选区域。In some embodiments, the above-mentioned apparatus further includes: a region elimination module, configured to eliminate candidate regions whose probability value is lower than a preset probability threshold from the plurality of candidate regions to obtain the final plurality of candidate regions.

第六方面，本发明实施例提供了一种文本内容确定装置，该装置包括：区域获取模块，用于通过上述文本区域确定方法，获取图像中的文本区域；识别模块，用于将文本区域输入至预先训练完成的文本识别模型，输出文本区域的识别结果；文本内容确定模块，用于根据识别结果确定文本区域中的文本内容。In a sixth aspect, an embodiment of the present invention provides an apparatus for determining text content, the apparatus comprising: an area acquisition module, configured to acquire a text area in an image through the above-mentioned text area determination method; a recognition module, used to input the text area into To the pre-trained text recognition model, the recognition result of the text area is output; the text content determination module is used to determine the text content in the text area according to the recognition result.

在一些实施例中，上述装置还包括：归一化模块，用于按照预设尺寸，对文本区域进行归一化处理。In some embodiments, the above apparatus further includes: a normalization module, configured to perform normalization processing on the text area according to a preset size.

在一些实施例中，上述装置还包括文本识别模型训练模块，用于使文本识别模型通过下述方式训练完成：基于预设的训练集合确定目标训练文本图像；将目标训练文本图像输入至第二初始模型；第二初始模型包括第二特征提取网络、第二输出网络和分类函数；通过第二特征提取网络提取目标训练文本图像的特征图；通过第二初始模型将特征图拆分成至少一个子特征图；将子特征图分别输入至第二输出网络，输出每个子特征图对应的输出矩阵；将每个子特征图对应的输出矩阵分别输入至分类函数，输出每个子特征图对应的概率矩阵；通过预设的识别损失函数确定概率矩阵的第二损失值；根据第二损失值对第二初始模型进行训练，直至第二初始模型中的参数收敛，得到文本识别模型。In some embodiments, the above-mentioned apparatus further includes a text recognition model training module, configured to train the text recognition model by the following methods: determining a target training text image based on a preset training set; inputting the target training text image into a second an initial model; the second initial model includes a second feature extraction network, a second output network and a classification function; the feature map of the target training text image is extracted through the second feature extraction network; the feature map is split into at least one Sub-feature map; input the sub-feature maps to the second output network respectively, and output the output matrix corresponding to each sub-feature map; input the output matrix corresponding to each sub-feature map to the classification function respectively, and output the probability matrix corresponding to each sub-feature map ; Determine the second loss value of the probability matrix by the preset recognition loss function; train the second initial model according to the second loss value, until the parameters in the second initial model converge, and obtain the text recognition model.

在一些实施例中，上述识别模型训练模块还用于：沿着特征图的列方向，将特征图拆分成至少一个子特征图；特征图的列方向为文本行方向的垂直方向。In some embodiments, the above recognition model training module is further configured to: split the feature map into at least one sub-feature map along the column direction of the feature map; the column direction of the feature map is the vertical direction of the text line direction.

在一些实施例中，上述第二输出网络包括多个全连接层；全连接层的数量与子特征图的数量对应；识别模型训练模块还用于：将每个子特征图分别输入至对应的全连接层中，以使每个全连接层输出子特征图对应的输出矩阵。In some embodiments, the above-mentioned second output network includes multiple fully-connected layers; the number of fully-connected layers corresponds to the number of sub-feature maps; the recognition model training module is further configured to: input each sub-feature map to the corresponding full-feature map respectively. In the connection layer, so that each fully connected layer outputs the output matrix corresponding to the sub-feature map.

在一些实施例中，上述分类函数包括Softmax函数；Softmax函数为其中，e表示自然常数；t表示第t个概率矩阵；K表示所述训练集合的目标训练文本图像所包含的不同字符的个数；m表示从1到K+1；∑表示求和运算；为所述输出矩阵中的第i个元素；所述为所述概率矩阵p^t中的第i个元素。In some embodiments, the above-mentioned classification function includes a Softmax function; the Softmax function is Wherein, e represents a natural constant; t represents the t-th probability matrix; K represents the number of different characters contained in the target training text image of the training set; m represents from 1 to K+1; ∑ represents a summation operation; is the i-th element in the output matrix; the is the ith element in the probability matrix p ^t .

在一些实施例中，上述识别模型训练模块还用于：根据第二损失值更新第二初始模型中的参数；判断更新后的各个参数是否均收敛；如果更新后的各个参数均收敛，将参数更新后的第二初始模型确定为文本识别模型；如果更新后的各个参数没有均收敛，继续执行基于预设的训练集合确定目标训练文本图像的步骤，直至更新后的各个参数均收敛。In some embodiments, the above-mentioned recognition model training module is further used for: updating the parameters in the second initial model according to the second loss value; judging whether all the updated parameters converge; The updated second initial model is determined to be a text recognition model; if the updated parameters do not converge, continue to perform the step of determining the target training text image based on the preset training set until the updated parameters converge.

在一些实施例中，上述识别模型训练模块还用于：按照预设规则，从第二初始模型确定待更新参数；计算第二损失值对待更新参数的导数其中，L′为概率矩阵的损失值；W′为待更新参数；更新待更新参数，得到更新后的待更新参数其中，α′为预设系数。In some embodiments, the above-mentioned recognition model training module is further configured to: determine the parameter to be updated from the second initial model according to a preset rule; calculate the derivative of the parameter to be updated from the second loss value Among them, L' is the loss value of the probability matrix; W' is the parameter to be updated; update the parameter to be updated to obtain the updated parameter to be updated Among them, α' is a preset coefficient.

在一些实施例中，上述文本区域的识别结果包括文本区域对应的多个概率矩阵；上述文本内容确定模块还用于：确定每个概率矩阵中的最大概率值的位置；从预先设置的概率矩阵中各个位置与字符的对应关系中，获取最大概率值的位置对应的字符；按照多个概率矩阵的排列顺序，排列获取到的字符；根据排列后的字符确定文本区域中的文本内容。In some embodiments, the recognition result of the text area includes a plurality of probability matrices corresponding to the text area; the text content determination module is further configured to: determine the position of the maximum probability value in each probability matrix; In the correspondence between each position and character in the , obtain the character corresponding to the position with the maximum probability value; arrange the obtained characters according to the arrangement order of multiple probability matrices; determine the text content in the text area according to the arranged characters.

在一些实施例中，上述文本内容确定模块还用于：按照预设规则，删除排列后的字符中的重复字符和空字符，得到文本区域中的文本内容。In some embodiments, the above-mentioned text content determination module is further configured to: delete repeated characters and null characters in the arranged characters according to a preset rule to obtain the text content in the text area.

在一些实施例中，上述装置还包括：信息获取模块，用于如果图像中包含有多个文本区域，获取每个文本区域中的文本内容；敏感信息确定模块，用于通过预先建立的敏感词库确定图像对应的文本内容中是否包含有敏感信息。In some embodiments, the above-mentioned apparatus further includes: an information acquisition module, configured to acquire the text content in each text area if the image contains multiple text areas; a sensitive information determination module, configured to obtain the text content in each text area through a pre-established sensitive word The library determines whether the textual content corresponding to the image contains sensitive information.

在一些实施例中，上述敏感信息确定模块还用于：对获取到的文本内容进行分词操作；逐一将分词操作后得到的分词与预先建立的敏感词库进行匹配；如果至少一个分词匹配成功，确定图像对应的文本内容中包含有敏感信息。In some embodiments, the above-mentioned sensitive information determination module is further configured to: perform a word segmentation operation on the acquired text content; match the word segmentation obtained after the word segmentation operation with a pre-established sensitive word database one by one; if at least one word segmentation is successfully matched, Determine that the text content corresponding to the image contains sensitive information.

在一些实施例中，上述装置还包括：区域标识模块，用于获取匹配成功的分词所属的文本区域，在图像中标识出获取到的文本区域。In some embodiments, the above-mentioned apparatus further includes: a region identification module, configured to acquire the text region to which the successfully matched word segmentation belongs, and identify the acquired text region in the image.

第七方面，本发明实施例提供了一种电子设备，包括处理器和存储器，存储器存储有能够被处理器执行的机器可执行指令，处理器执行机器可执行指令以实现上述文本检测模型训练方法，上述文本区域确定方法，或者上述文本内容确定方法的步骤。In a seventh aspect, an embodiment of the present invention provides an electronic device, including a processor and a memory, the memory stores machine-executable instructions that can be executed by the processor, and the processor executes the machine-executable instructions to implement the above text detection model training method , the steps of the above text area determination method, or the above text content determination method.

第八方面，本发明实施例提供了一种机器可读存储介质，该机器可读存储介质存储有机器可执行指令，该机器可执行指令在被处理器调用和执行时，机器可执行指令促使处理器实现上述文本检测模型训练方法，上述文本区域确定方法，或者上述文本内容确定方法的步骤。In an eighth aspect, embodiments of the present invention provide a machine-readable storage medium, where the machine-readable storage medium stores machine-executable instructions, and when the machine-executable instructions are invoked and executed by a processor, the machine-executable instructions cause the The processor implements the steps of the above text detection model training method, the above text area determination method, or the above text content determination method.

本发明实施例带来了以下有益效果：The embodiments of the present invention have brought the following beneficial effects:

本发明实施例提供的文本检测模型训练方法，首先提取目标训练图像的尺度相互不同的多个初始特征图；再对多个初始特征图进行融合处理，得到融合特征图；进而将融合特征图输入至第一输出网络，输出目标训练图像中文本区域的候选区域以及每个候选区域的概率值；通过预设的检测损失函数确定第一损失值后，根据该第一损失值对第一初始模型进行训练，得到检测模型。该方式中，特征提取网络可以自动提取不同尺度的特征，因而该文本检测模型，只需要输入一张图像即可得到该图像中各种尺度的文本区域的候选区域，无需再人工变换图像尺度，操作便捷，尤其在多种字号、多种字体、多种形状、多种方向场景下，可以快速全面准确地检测出图像中的各类文本，进而也有利于后续文本识别的准确性，提高了文本识别的效果。The text detection model training method provided by the embodiment of the present invention firstly extracts multiple initial feature maps with different scales of the target training image; then fuses the multiple initial feature maps to obtain a fused feature map; and then inputs the fused feature map into To the first output network, output the candidate region of the text region in the target training image and the probability value of each candidate region; after determining the first loss value through the preset detection loss function, according to the first loss value to the first initial model Perform training to obtain a detection model. In this method, the feature extraction network can automatically extract features of different scales. Therefore, the text detection model only needs to input an image to obtain the candidate regions of text regions of various scales in the image, and no need to manually transform the image scale. The operation is convenient, especially in the scenarios of multiple font sizes, multiple fonts, multiple shapes, and multiple orientations, it can quickly, comprehensively and accurately detect various texts in the image, which is also conducive to the accuracy of subsequent text recognition, improving the The effect of text recognition.

本发明的其他特征和优点将在随后的说明书中阐述，或者，部分特征和优点可以从说明书推知或毫无疑义地确定，或者通过实施本发明的上述技术即可得知。Additional features and advantages of the present invention will be set forth in the description which follows, or some may be inferred or unambiguously determined from the description, or may be learned by practicing the above-described techniques of the present invention.

为使本发明的上述目的、特征和优点能更明显易懂，下文特举较佳实施方式，并配合所附附图，作详细说明如下。In order to make the above-mentioned objects, features and advantages of the present invention more clearly understood, the preferred embodiments are exemplified below, and are described in detail as follows in conjunction with the accompanying drawings.

附图说明Description of drawings

为了更清楚地说明本发明具体实施方式或现有技术中的技术方案，下面将对具体实施方式或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施方式，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the specific embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the specific embodiments or the prior art. Obviously, the accompanying drawings in the following description The drawings are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without creative efforts.

图1为本发明实施例提供的一种文本检测模型训练方法的流程图；1 is a flowchart of a method for training a text detection model according to an embodiment of the present invention;

图2为本发明实施例提供的一种第一特征提取网络的结构示意图；2 is a schematic structural diagram of a first feature extraction network provided by an embodiment of the present invention;

图3为本发明实施例提供的一种对多个初始特征图进行融合处理的示意图；3 is a schematic diagram of performing fusion processing on multiple initial feature maps according to an embodiment of the present invention;

图4为本发明实施例提供的一种文本区域确定方法的流程图；4 is a flowchart of a method for determining a text area provided by an embodiment of the present invention;

图5为本发明实施例提供的另一种文本区域确定方法的流程图；5 is a flowchart of another method for determining a text area provided by an embodiment of the present invention;

图6为本发明实施例提供的一种文本内容确定方法的流程图；6 is a flowchart of a method for determining text content according to an embodiment of the present invention;

图7为本发明实施例提供的一种文本识别模型的训练方法的流程图；7 is a flowchart of a method for training a text recognition model according to an embodiment of the present invention;

图8为本发明实施例提供的一种第二特征提取网络的结构示意图；8 is a schematic structural diagram of a second feature extraction network provided by an embodiment of the present invention;

图9为本发明实施例提供的另一种文本内容确定方法的流程图；9 is a flowchart of another text content determination method provided by an embodiment of the present invention;

图10为本发明实施例提供的另一种文本内容确定方法的流程图；10 is a flowchart of another method for determining text content provided by an embodiment of the present invention;

图11为本发明实施例提供的一种文本检测模型训练装置的结构示意图；11 is a schematic structural diagram of an apparatus for training a text detection model according to an embodiment of the present invention;

图12为本发明实施例提供的一种文本区域确定装置的结构示意图；12 is a schematic structural diagram of an apparatus for determining a text area according to an embodiment of the present invention;

图13为本发明实施例提供的一种文本内容确定装置的结构示意图；13 is a schematic structural diagram of an apparatus for determining text content according to an embodiment of the present invention;

图14为本发明实施例提供的一种电子设备的结构示意图。FIG. 14 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合附图对本发明的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purposes, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are part of the embodiments of the present invention, but not all of them. example. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

传统的文本识别技术中，通过人为设定的规则从图片中检测出可能存在文本的文本区域，再对检测出的文本区域进行字符切分，得到每个字符对应的图像块，通过预先训练的分类器对每个图像块进行识别，进而得到最终的文本识别结果。该方式中，由于人为设定的规则数量有限，导致检测出的文本区域大多为规则形状区域，应用范围有限，难以适用于复杂场景下的文本检测识别，如多种字号、多种字体、多种形状、多种方向、背景多变等场景，且该方式是对单个字符的识别，并未考虑字符之间的关联性，导致在复杂场景下的检测识别效果较差。In the traditional text recognition technology, the text area where the text may exist is detected from the picture through the rules set by humans, and then the detected text area is characterized to obtain the image block corresponding to each character. The classifier recognizes each image patch, and then obtains the final text recognition result. In this method, due to the limited number of artificially set rules, most of the detected text areas are regular-shaped areas, and the application range is limited, which is difficult to apply to text detection and recognition in complex scenarios, such as multiple font sizes, In addition, this method recognizes a single character without considering the correlation between characters, resulting in poor detection and recognition effect in complex scenes.

另外，还可以通过深度学习的方式实现文本识别；首先需要通过循环神经网络训练识别模型；再将待检测的图片变换为多种尺度，逐一输入至识别模型中检测文本区域并识别文本；该方式中，需要人工变换图像尺度，将多种尺度的图像分别输入至识别模型中，以使识别模型识别不同大小的文本，操作较为繁琐，难以满足实时识别的需求、另外，由于循环神经网络需要遵从时间序列进行递归运算，难以并行处理，运算速度较慢。并且，该识别模型通常使用矩形检测框检测文本区域，因而仅能检测并识别水平方向的文本，对于任意角度的文本识别效果较差，导致难以适用于复杂场景下的文本检测识别。In addition, text recognition can also be achieved through deep learning; first, the recognition model needs to be trained through a recurrent neural network; then the images to be detected are transformed into various scales, and input into the recognition model one by one to detect the text area and recognize the text; this method It is necessary to manually transform the image scale, and input images of various scales into the recognition model, so that the recognition model can recognize texts of different sizes, which is cumbersome and difficult to meet the needs of real-time recognition. The recursive operation of time series is difficult to process in parallel, and the operation speed is slow. In addition, this recognition model usually uses a rectangular detection frame to detect text areas, so it can only detect and recognize text in the horizontal direction, and the recognition effect for text at any angle is poor, making it difficult to apply to text detection and recognition in complex scenes.

综上，相关技术中的文本检测识别方式在复杂场景下效果较差；基于此，本发明实施例提供一种文本检测模型训练方法、文本区域、内容确定方法和装置；该技术可以广泛应用于各种场景下的文本检测和文本识别，尤其可以应用于网络直播、有限电视直播、游戏、视频等复杂场景下的文本检测和文本识别。To sum up, the text detection and recognition methods in the related art have poor effects in complex scenarios; based on this, the embodiments of the present invention provide a text detection model training method, a text area, and a content determination method and device; the technology can be widely used in Text detection and text recognition in various scenarios, especially text detection and text recognition in complex scenarios such as online live broadcast, limited TV live broadcast, games, and videos.

为便于对本实施例进行理解，首先对本发明实施例所公开的一种文本检测模型训练方法进行详细介绍，该文本检测模型可以用于文本检测，该文件检测可以理解为：从图像中定位出包含有文本的图像区域。如图1所示，该方法包括如下步骤：In order to facilitate the understanding of this embodiment, a method for training a text detection model disclosed in the embodiment of the present invention is first introduced in detail. The text detection model can be used for text detection, and the file detection can be understood as: locating files containing Image area with text. As shown in Figure 1, the method includes the following steps:

步骤S102，基于预设的训练集合确定目标训练图像。Step S102, determining a target training image based on a preset training set.

该训练集合中可以包含有多张图像，为了提高检测模型的应用广泛性，训练集合中的图像可以包含各种场景下的图像，例如，直播场景图像、游戏场景图像、户外场景图像、室内场景图像等；训练集合中的图像也可以包含多种字号、形状、字体、语言的文本行，以使训练出的检测模型能够检测各类文本行。The training set can contain multiple images. In order to improve the wide application of the detection model, the images in the training set can contain images in various scenarios, such as live scene images, game scene images, outdoor scene images, and indoor scenes. Images, etc.; the images in the training set can also contain text lines of various font sizes, shapes, fonts, and languages, so that the trained detection model can detect various text lines.

每张图像中包含有由人工标注的文本行的文本区域，该文本区域可以通过矩形等四边形框标注，也可以通过其他多边形框进行标注；标注的文本区域通常能够完整地覆盖整个文本行，且文本区域与文本行能够紧密贴合。另外，还可以将上述训练集合中的多张图像按照预设比例划分为训练子集和测试子集。在训练过程中，可以从训练子集从获取目标训练图像。训练完成后，可以从测试子集中获取目标测试图像，用于测试检测模型的性能。Each image contains a text area of manually annotated text lines, which can be marked by quadrilateral boxes such as rectangles, or by other polygonal boxes; the marked text area can usually completely cover the entire text line, and The text area and the text line can fit tightly. In addition, the plurality of images in the training set may also be divided into training subsets and test subsets according to a preset ratio. During training, target training images can be obtained from the training subset. After training, target test images can be obtained from the test subset to test the performance of the detection model.

步骤S104，将目标训练图像输入至第一初始模型；该第一初始模型包括第一特征提取网络、特征融合网络和第一输出网络。Step S104, input the target training image into a first initial model; the first initial model includes a first feature extraction network, a feature fusion network and a first output network.

在输入至第一初始模型之前，可以将目标训练图像调整至预设大小，如512*512。Before input to the first initial model, the target training image can be resized to a preset size, such as 512*512.

步骤S106，通过第一特征提取网络提取目标训练图像的多个初始特征图；多个初始特征图之间的尺度不同。Step S106, extracting multiple initial feature maps of the target training image through the first feature extraction network; the scales between the multiple initial feature maps are different.

其中，第一特征提取网络可以通过多层卷积层实现，通常，多层卷积层依次连接，每层卷积层通过设置不同的卷积核，以提取不同尺度的特征图。目标训练图像的多个初始特征图中，每个初始特征图可以由对应的卷积层进行卷积计算得到。以四层卷积层为例，每层卷积层可以输出一个初始特征图；每层卷积层可以设置不同大小的卷积核，以使每层卷积层输出的初始特征图的尺度不同。在实际实现时，可以设置输入目标训练图像的卷积层输出的初始特征图的尺度最大，后续每层卷积层输出的初始特征图的尺度逐渐减小。Among them, the first feature extraction network can be implemented by multi-layer convolutional layers. Generally, the multi-layered convolutional layers are connected in sequence, and each layer of convolutional layers is set with different convolution kernels to extract feature maps of different scales. Multiple initial feature maps of the target training image, each initial feature map can be calculated by convolution of the corresponding convolution layer. Taking four convolutional layers as an example, each convolutional layer can output an initial feature map; each convolutional layer can be set with convolution kernels of different sizes, so that the scales of the initial feature maps output by each convolutional layer are different. . In actual implementation, the scale of the initial feature map output by the convolutional layer of the input target training image can be set to be the largest, and the scale of the initial feature map output by each subsequent convolutional layer is gradually reduced.

步骤S108，通过所述特征融合网络对多个所述初始特征图进行融合处理，得到融合特征图。Step S108, performing fusion processing on a plurality of the initial feature maps through the feature fusion network to obtain a fusion feature map.

通常，较小的卷积核可以感应图像中的高频特征，使用较小的卷积核的卷积网络输出的初始特征图中携带有小尺度的文本行特征；较大的卷积核可以感应图像中的低频特征，使用较大的卷积网络的卷积层输出的初始特征图中携带有大尺度的文本行特征；基于此，多个不同尺度的初始特征图中携带有各种尺度的文本行特征，对多个初始特征图进行融合处理后得到的融合特征图中也携带有各种尺度的文本行特征。通过该方式，检测模型可以检测各种尺度的文本行，无需在检测之前人为地进行图像尺度变换。Generally, smaller convolution kernels can sense high-frequency features in images, and the initial feature maps output by convolutional networks using smaller convolution kernels carry small-scale text line features; larger convolution kernels can Sensing the low-frequency features in the image, the initial feature map output by the convolutional layer of the larger convolutional network carries large-scale text line features; based on this, multiple initial feature maps of different scales carry various scales The fused feature map obtained by fusing multiple initial feature maps also carries text line features of various scales. In this way, the detection model can detect text lines of various scales without artificially performing image scale transformation before detection.

在实际实现时，由于多个初始特征图的尺度不同，在进行融合之前，可以将较小尺度的初始特征图进行插值运算，以扩展较小尺度的初始特征图，使之与较大尺度的初始特征图相匹配。在融合过程中，不同初始特征图间，相同位置的特征点可以进行相乘或相加运算，从而得到最终的融合特征图。In actual implementation, since the scales of multiple initial feature maps are different, before the fusion, the initial feature maps of smaller scales can be interpolated to expand the initial feature maps of smaller scales to match the larger ones. match the initial feature map. During the fusion process, between different initial feature maps, the feature points at the same position can be multiplied or added to obtain the final fusion feature map.

步骤S110，将融合特征图输入至第一输出网络，输出目标训练图像中文本区域的候选区域以及每个候选区域的概率值。Step S110, input the fusion feature map to the first output network, and output the candidate regions of the text region in the target training image and the probability value of each candidate region.

该第一输出网络用于从融合特征图中提取需要的特征，得到输出结果；如果检测模型的输出结果为唯一的结果，则该第一输出网络通常包含一组网络；如果检测模型的输出结果为多种结果，则该第一输出网络通常包含多组网络，多组网络间并列设置，每组网络对应输出一种结果。该第一输出网络中可以由卷积层或全连接层组成。上述步骤中，第一输出网络需要输出候选区域和候选区域的概率值两种结果，因而该第一输出网络中可以包含两组网络，每组网络可以为卷积网络或全连接网络。The first output network is used to extract the required features from the fusion feature map to obtain the output result; if the output result of the detection model is the only result, the first output network usually includes a group of networks; if the output result of the detection model is the only result For multiple results, the first output network usually includes multiple groups of networks, the multiple groups of networks are arranged in parallel, and each group of networks outputs a corresponding result. The first output network can be composed of convolutional layers or fully connected layers. In the above steps, the first output network needs to output two results, the candidate region and the probability value of the candidate region. Therefore, the first output network may include two groups of networks, and each group of networks may be a convolutional network or a fully connected network.

步骤S112，通过预设的检测损失函数确定上述候选区域以及每个候选区域的概率值的第一损失值；根据该第一损失值对第一初始模型进行训练，直至第一初始模型中的参数收敛，得到文本检测模型。Step S112: Determine the first loss value of the above-mentioned candidate area and the probability value of each candidate area through a preset detection loss function; train the first initial model according to the first loss value until the parameters in the first initial model are Convergence to get the text detection model.

目标训练图像中预先标注有标准的文本区域，基于标注的文本区域的位置可以生成文本区域的坐标矩阵，以及文本区域的概率矩阵；其中，文本区域的坐标矩阵中包含有标准的文本区域的顶点坐标；文本区域的概率矩阵包含有文本区域的概率值，该概率值通常为1。A standard text area is pre-marked in the target training image, and a coordinate matrix of the text area and a probability matrix of the text area can be generated based on the position of the marked text area; wherein, the coordinate matrix of the text area contains the vertices of the standard text area Coordinates; the probability matrix of the text area contains the probability value of the text area, which is usually 1.

检测损失函数可以比较候选区域的坐标矩阵与标准的文本区域的坐标矩阵的区别，以及候选区域的概率值与标准的文本区域的概率值的区别，通常区别越大，上述第一损失值越大。基于该第一损失值可以调整上述第一初始模型中各个部分的参数，以达到训练的目的。当模型中各个参数收敛时，训练结束，得到检测模型。The detection loss function can compare the difference between the coordinate matrix of the candidate area and the coordinate matrix of the standard text area, and the difference between the probability value of the candidate area and the probability value of the standard text area. Generally, the greater the difference, the greater the above-mentioned first loss value. . Based on the first loss value, the parameters of each part of the above-mentioned first initial model can be adjusted to achieve the purpose of training. When each parameter in the model converges, the training ends, and the detection model is obtained.

本发明实施例还提供另一种文本检测模型训练方法，该方法在上述实施例所述方法的基础上实现；该方法重点描述上述训练方法中各个步骤的具体实现过程；该方法包括如下步骤：The embodiment of the present invention also provides another text detection model training method, which is implemented on the basis of the method described in the above-mentioned embodiment; the method focuses on describing the specific implementation process of each step in the above-mentioned training method; the method includes the following steps:

步骤202，基于预设的训练集合确定目标训练图像。Step 202: Determine a target training image based on a preset training set.

步骤204，将目标训练图像输入至第一初始模型；该第一初始模型包括第一特征提取网络、特征融合网络和第一输出网络。Step 204 , input the target training image into a first initial model; the first initial model includes a first feature extraction network, a feature fusion network and a first output network.

步骤206，通过第一特征提取网络提取目标训练图像的多个初始特征图；多个初始特征图之间的尺度不同。Step 206: Extract multiple initial feature maps of the target training image through the first feature extraction network; the scales of the multiple initial feature maps are different.

在实际实现时，为了提高第一特征提取网络的性能，该第一特征提取网络可以包括依次连接的多组第一卷积网络；每组第一卷积网络包括依次连接的卷积层、批归一化层和激活函数层。图2示出了一种第一特征提取网络的结构示意图；图2中以四组第一卷积网络为例进行说明，后一组第一卷积网络的卷积层连接前一组第一卷积网络的激活函数层。另外，第一特征提取网络中还可以包含更多组或更少组的第一卷积网络。In actual implementation, in order to improve the performance of the first feature extraction network, the first feature extraction network may include multiple groups of first convolutional networks connected in sequence; each group of first convolutional networks includes sequentially connected convolutional layers, batch Normalization layer and activation function layer. Fig. 2 shows a schematic structural diagram of a first feature extraction network; in Fig. 2, four groups of first convolutional networks are used as an example to illustrate, and the convolutional layers of the latter group of first convolutional networks are connected to the former group of first convolutional networks. The activation function layer of the convolutional network. In addition, the first feature extraction network may further include more or less groups of first convolutional networks.

第一卷积网络中的批归一化层用于对卷积层输出的特征图进行归一化处理，该过程可以加快第一特征提取网络以及检测模型的收敛速度，并且可以缓解在多层卷积网络中梯度弥散的问题，使得第一特征提取网络更加稳定。第一卷积网络中的激活函数层可以对归一化处理后的特征图进行函数变换，该变换过程打破卷积层输入的线性组合，可以提高第一卷积网络的特征表达能力。该激活函数层具体可以为Sigmoid函数、tanh函数、Relu函数等。The batch normalization layer in the first convolutional network is used to normalize the feature map output by the convolutional layer. This process can speed up the convergence speed of the first feature extraction network and the detection model, and can alleviate the multi-layer The problem of gradient dispersion in convolutional networks makes the first feature extraction network more stable. The activation function layer in the first convolutional network can perform functional transformation on the normalized feature map, and the transformation process breaks the linear combination of the input of the convolutional layer, which can improve the feature expression ability of the first convolutional network. The activation function layer may specifically be a Sigmoid function, a tanh function, a Relu function, or the like.

步骤208，通过上述特征融合网络对多个所述初始特征图进行融合处理，得到融合特征图。Step 208: Perform fusion processing on a plurality of the initial feature maps through the feature fusion network to obtain a fusion feature map.

下述步骤02-08提供一种步骤208的具体的实现方式，该方式中，以金字塔特征为例进行说明，即各个卷积层输出的初始特征图的尺度依次减小：The following steps 02-08 provide a specific implementation of step 208. In this method, the pyramid feature is used as an example to illustrate, that is, the scale of the initial feature map output by each convolutional layer is sequentially reduced:

步骤02，根据初始特征图的尺度，将多个初始特征图依次排列；其中，最顶层级的初始特征图的尺度最小；最底层级的初始特征图的尺度最大；Step 02, according to the scale of the initial feature map, arrange the multiple initial feature maps in sequence; wherein, the scale of the initial feature map of the topmost level is the smallest; the scale of the initial feature map of the bottommost level is the largest;

步骤04，将最顶层级的初始特征图确定为最顶层级的融合特征图；Step 04, the initial feature map of the topmost level is determined as the fusion feature map of the topmost level;

步骤06，除最顶层级以外，将当前层级的初始特征图和当前层级的上一层级的融合特征图进行融合，得到当前层级的融合特征图；Step 06, except for the topmost level, the initial feature map of the current level and the fusion feature map of the previous level of the current level are fused to obtain the fusion feature map of the current level;

由于当前层级的上一层级的融合特征图的尺度小于当前层级的初始特征图，二者在进行融合之前，可以通过插值运算，将当前层级的上一层级的融合特征图的尺度扩展至与当前层级的初始特征图的尺度相同，进而再进行逐点相加或逐点相乘的融合处理，得到当前层级的融合特征图。Since the scale of the fusion feature map of the previous level of the current level is smaller than the initial feature map of the current level, before the two are fused, the scale of the fusion feature map of the previous level of the current level can be extended to the same size as the current level through interpolation operation. The scale of the initial feature map of the level is the same, and then the fusion processing of point-by-point addition or point-by-point multiplication is performed to obtain the fusion feature map of the current level.

步骤08，将最低层级的融合特征图确定为最终的融合特征图。Step 08: Determine the fusion feature map of the lowest level as the final fusion feature map.

图3示出了一种对多个初始特征图进行融合处理的示意图；目标训练图像经第一特征提取网络进行卷积处理后得到四层初始特征图；最顶层级的初始特征图作为最顶层级的融合特征图；最顶层级的融合特征图与第二层级的初始特征图进行融合，得到第二层级的融合特征图；第二层级的融合特征图与第三层级的初始特征图进行融合，得到第三层级的融合特征图；第三层级的融合特征图与第四层级的初始特征图进行融合，得到第四层级的融合特征图；该第四层级的融合特征图即最终的融合特征图。Figure 3 shows a schematic diagram of fusion processing of multiple initial feature maps; the target training image is convolved by the first feature extraction network to obtain four layers of initial feature maps; the top-level initial feature map is used as the top-level The fusion feature map of the first level; the fusion feature map of the top level is fused with the initial feature map of the second level to obtain the fusion feature map of the second level; the fusion feature map of the second level is fused with the initial feature map of the third level , the fusion feature map of the third level is obtained; the fusion feature map of the third level is fused with the initial feature map of the fourth level to obtain the fusion feature map of the fourth level; the fusion feature map of the fourth level is the final fusion feature picture.

步骤210，将融合特征图输入至第一输出网络，输出目标训练图像中文本区域的候选区域以及每个候选区域的概率值。Step 210: Input the fusion feature map to the first output network, and output the candidate regions of the text region in the target training image and the probability value of each candidate region.

以卷积网络为例，上述第一输出网络包括第一卷积层和第二卷积层；其中，第一卷积层和第二卷积层并列设置，第一卷积层和第二卷积层分别用于输出选区域的顶点坐标和候选区域的概率值，上述步骤210还可以通过下述步骤12-16实现：Taking a convolutional network as an example, the above-mentioned first output network includes a first convolutional layer and a second convolutional layer; wherein, the first convolutional layer and the second convolutional layer are arranged side by side, and the first convolutional layer and the second convolutional layer are arranged in parallel. The product layer is respectively used to output the vertex coordinates of the selected area and the probability value of the candidate area. The above step 210 can also be implemented by the following steps 12-16:

步骤12，将融合特征图分别输入至第一卷积层和第二卷积层；Step 12, input the fusion feature map to the first convolutional layer and the second convolutional layer respectively;

步骤14，通过第一卷积层对融合特征图进行第一卷积运算，输出坐标矩阵；该坐标矩阵包括目标训练图像中文本区域的候选区域的顶点坐标；Step 14, performing the first convolution operation on the fusion feature map through the first convolution layer, and outputting a coordinate matrix; the coordinate matrix includes the vertex coordinates of the candidate region of the text region in the target training image;

例如，该坐标矩阵可以表示为n*H*W，其中H和W分别为坐标矩阵的高度和宽度，n为坐标矩阵的维度；例如，当候选区域为四边形时，一个候选区域需要通过四个顶点坐标确定，因而n为8；当候选区域为其他多边形时，则n的数值通常为候选区域边数的两倍。For example, the coordinate matrix can be expressed as n*H*W, where H and W are the height and width of the coordinate matrix, respectively, and n is the dimension of the coordinate matrix; for example, when the candidate area is a quadrilateral, a candidate area needs to pass through four The vertex coordinates are determined, so n is 8; when the candidate area is other polygons, the value of n is usually twice the number of sides of the candidate area.

步骤16，通过第二卷积层对融合特征图进行第二卷积运算，输出概率矩阵；该概率矩阵包括每个候选区域的概率值。In step 16, a second convolution operation is performed on the fusion feature map through the second convolution layer, and a probability matrix is output; the probability matrix includes the probability value of each candidate region.

每个候选区域的概率值也可以称为每个候选区域的得分，概率值可以用于表征候选区域能够完整包含有文本行的概率。The probability value of each candidate region can also be referred to as the score of each candidate region, and the probability value can be used to represent the probability that the candidate region can completely contain text lines.

步骤212，通过预设的检测损失函数确定上述候选区域以及每个候选区域的概率值的第一损失值；根据该第一损失值对第一初始模型进行训练，直至第一初始模型中的参数收敛，得到文本检测模型。Step 212: Determine the first loss value of the above-mentioned candidate regions and the probability value of each candidate region through a preset detection loss function; train the first initial model according to the first loss value until the parameters in the first initial model are Convergence to get the text detection model.

在实际实现时，上述检测损失函数包括第一函数和第二函数，分别用于计算候选区域的顶点坐标以及每个候选区域的概率值的损失值；其中，第一函数为L₁＝|G^*-G|；其中，G^*为预先标注的目标训练图像中文本区域的坐标矩阵；G为第一输出网络输出的目标训练图像中文本区域的候选区域的坐标矩阵；第二函数为L₂＝-Y^*logY-(1-Y^*)log(1-Y)；其中，Y^*为预先标注的目标训练图像中文本区域的概率矩阵；Y为第一输出网络输出的目标训练图像中文本区域的候选区域的概率矩阵；log表示对数运算。上述候选区域的顶点坐标以及每个候选区域的概率值的第一损失值为上述第一函数和第二函数之和，即L＝L₁+L₂。In actual implementation, the above-mentioned detection loss function includes a first function and a second function, which are respectively used to calculate the vertex coordinates of the candidate regions and the loss value of the probability value of each candidate region; wherein, the first function is L ₁ =|G ^* -G|; wherein, G ^* is the coordinate matrix of the text area in the pre-marked target training image; G is the coordinate matrix of the candidate area of the text area in the target training image output by the first output network; the second function is L ₂ =-Y ^* logY-(1-Y ^* )log(1-Y); wherein, Y ^* is the probability matrix of the text area in the pre-labeled target training image; Y is the text in the target training image output by the first output network Probability matrix of candidate regions for regions; log represents a logarithmic operation. The first loss value of the vertex coordinates of the candidate regions and the probability value of each candidate region is the sum of the first function and the second function, that is, L=L ₁ +L ₂ .

基于上述对第一损失值的描述，上述步骤中，根据该第一损失值对第一初始模型进行训练的过程，还可以通过下述步骤22-28实现：Based on the above description of the first loss value, in the above steps, the process of training the first initial model according to the first loss value can also be implemented through the following steps 22-28:

步骤22，根据第一损失值更新第一初始模型中的参数；Step 22, update the parameters in the first initial model according to the first loss value;

在实际实现时，可以预先设置函数映射关系，将原始参数和第一损失值输入至该函数映射关系中，即可计算得到更新的参数。不同参数的函数映射关系可以相同，也可以不同。In actual implementation, the function mapping relationship can be preset, and the original parameters and the first loss value are input into the function mapping relationship, and then the updated parameters can be calculated. The function mapping relationship of different parameters can be the same or different.

具体而言，可以首先按照预设规则，确定待更新参数；该待更新参数可以为第一初始模型中的所有参数，也可以随机从第一初始模型中确定的部分参数；再计算第一损失值对第一初始模型中待更新参数的导数其中，L为第一损失值；W为待更新参数；该待更新参数也可以称为各神经元的权值。该过程也可以称为反向传播算法；如果第一损失值较大，则说明当前的第一初始模型的输出与期望输出结果不符，则求出上述第一损失值对第一初始模型中待更新参数的导数，该导数可以作为调整待更新参数的依据。Specifically, the parameters to be updated can be determined first according to preset rules; the parameters to be updated can be all parameters in the first initial model, or some parameters randomly determined from the first initial model; and then the first loss is calculated Derivative of the value to the parameter to be updated in the first initial model Among them, L is the first loss value; W is the parameter to be updated; the parameter to be updated may also be called the weight of each neuron. This process can also be called a back-propagation algorithm; if the first loss value is large, it means that the output of the current first initial model does not match the expected output result, and then the above-mentioned first loss value is found to be the difference between the first initial model and the expected output. The derivative of the update parameter, which can be used as the basis for adjusting the parameter to be updated.

得到各个待更新参数的导数后，再更新各个待更新参数，得到更新后的待更新参数其中，α为预设系数。该过程也可以称为随机梯度下降算法；各个待更新参数的导数也可以理解为相对于当前参数，第一损失值下降最快的方向，通过该方向调整参数，可以使第一损失值快速降低，使该参数收敛。另外，当第一初始模型经一次训练后，得到一个第一损失值，此时可以从第一初始模型中各个参数中随机选择一个或多个参数进行上述的更新过程，该方式的模型训练时间较短，算法较快；当然也可以对第一初始模型中所有参数进行上述的更新过程，该方式的模型训练更加准确。After obtaining the derivative of each parameter to be updated, update each parameter to be updated to obtain the updated parameter to be updated Among them, α is a preset coefficient. This process can also be called a stochastic gradient descent algorithm; the derivative of each parameter to be updated can also be understood as the direction in which the first loss value decreases the fastest relative to the current parameter. By adjusting the parameters in this direction, the first loss value can be quickly reduced. , so that the parameter converges. In addition, after the first initial model is trained once, a first loss value is obtained. At this time, one or more parameters can be randomly selected from the parameters in the first initial model to perform the above update process. The model training time of this method is Shorter, the algorithm is faster; of course, the above update process can also be performed on all parameters in the first initial model, and the model training in this way is more accurate.

步骤24，判断更新后的各个参数是否均收敛；如果更新后的各个参数均收敛，执行步骤26；如果更新后的各个参数没有均收敛，执行步骤28；Step 24, determine whether the updated parameters are all converged; if the updated parameters are converged, go to step 26; if the updated parameters are not converged, go to step 28;

步骤26，将参数更新后的第一初始模型确定为检测模型；结束。Step 26: Determine the first initial model after parameter update as the detection model; end.

步骤28，继续执行基于预设的训练集合确定目标训练图像的步骤，直至更新后的各个参数均收敛。Step 28: Continue to perform the step of determining the target training image based on the preset training set until all the updated parameters converge.

具体地，可以从训练集合中重新获取新的图像作为目标训练图像，也可以继续将当前的目标训练图像作为目标训练图像进行训练。Specifically, a new image can be re-acquired from the training set as the target training image, or the current target training image can be continued to be used as the target training image for training.

上述方式中，特征提取网络可以自动提取不同尺度的特征图，进而再将不同尺度的特征图进行融合处理，基于得到的融合特征图获取图像中各种尺度的文本区域的候选区域。该检测模型，只需要输入一张图像即可得到该图像中各种尺度的文本区域的候选区域，无需再人工变换图像尺度，操作便捷，尤其在多种字号、多种字体、多种形状、多种方向场景下，可以快速全面准确地检测出图像中的各类文本，进而也有利于后续文本识别的准确性，提高了文本识别的效果。In the above method, the feature extraction network can automatically extract feature maps of different scales, and then fuse the feature maps of different scales, and obtain candidate regions of text regions of various scales in the image based on the obtained fused feature maps. The detection model only needs to input an image to obtain the candidate regions of the text regions of various scales in the image, no need to manually transform the image scale, and the operation is convenient, especially in the case of multiple font sizes, multiple fonts, multiple shapes, In a variety of orientation scenarios, various texts in the image can be detected quickly, comprehensively and accurately, which is also beneficial to the accuracy of subsequent text recognition and improves the effect of text recognition.

基于上述实施例提供的文本检测模型训练方法，本发明实施例还提供一种文本区域确定方法，该方法在上述实施例所述的文本检测模型训练方法的基础上实现；如图4所示，该方法包括如下步骤：Based on the text detection model training method provided in the foregoing embodiment, the embodiment of the present invention further provides a text region determination method, which is implemented on the basis of the text detection model training method described in the foregoing embodiment; as shown in FIG. 4 , The method includes the following steps:

步骤S402，获取待检测图像；该待检测图像可以是图片，也可以是从视频文件或直播视频中截取的视频帧等。Step S402, acquiring an image to be detected; the image to be detected may be a picture, or a video frame intercepted from a video file or a live video, or the like.

步骤S404，将待检测图像输入至预先训练完成的文本检测模型，输出待检测图像中文本区域的多个候选区域，以及每个候选区域的概率值；该文本检测模型通过上述文本检测模型的训练方法训练得到；Step S404, input the image to be detected into the pre-trained text detection model, output multiple candidate regions of the text region in the image to be detected, and the probability value of each candidate region; the text detection model passes the training of the above-mentioned text detection model method training;

步骤S406，根据候选区域的概率值以及多个候选区域之间的重叠程度，从多个候选区域中确定待检测图像中的文本区域。Step S406, according to the probability value of the candidate area and the degree of overlap between the multiple candidate areas, determine the text area in the image to be detected from the multiple candidate areas.

上述文本检测模型输出的候选区域中，可能有多个候选区域均对应同一个文本行；为了从多个候选区域中找出与文本行最匹配的区域，需要对多个候选区域进行筛选。大多情况下，相互重叠程度较高的多个候选区域，通常对应同一个文本行，进而再根据相互重叠程度较高的多个候选区域的概率值，即可从中确定该文本行对应的文本区域；例如，将相互重叠程度较高的多个候选区域中，概率值最大的候选区域确定为文本区域。如果图像中存在多个文本行，则通常最终确定出多个文本区域。Among the candidate regions output by the above text detection model, there may be multiple candidate regions corresponding to the same text line; in order to find the region that best matches the text line from the multiple candidate regions, it is necessary to screen the multiple candidate regions. In most cases, multiple candidate regions with a high degree of mutual overlap usually correspond to the same text line, and then according to the probability values of the multiple candidate regions with a high degree of mutual overlap, the text region corresponding to the text line can be determined. ; For example, among multiple candidate regions with a high degree of mutual overlap, the candidate region with the largest probability value is determined as the text region. If there are multiple lines of text in the image, then multiple text areas are usually finalized.

本发明实施例提供的上述文本区域确定方法，将获取到的待检测图像输入至文本检测模型，输出待检测图像中文本区域的多个候选区域以及每个候选区域的概率值；进而根据候选区域的概率值以及多个候选区域之间的重叠程度，从多个候选区域中确定待检测图像中的文本区域。该方式中，文本检测模型可以自动提取不同尺度的特征，因而只需要输入一张图像至该模型即可得到该图像中各种尺度的文本区域的候选区域，无需再人工变换图像尺度，操作便捷，尤其在多种字号、多种字体、多种形状、多种方向场景下，可以快速全面准确地检测出图像中的各类文本，进而也有利于后续文本识别的准确性，提高了文本识别的效果。The above-mentioned text region determination method provided by the embodiment of the present invention inputs the acquired image to be detected into the text detection model, and outputs multiple candidate regions of the text region in the to-be-detected image and the probability value of each candidate region; and then according to the candidate region The probability value of , and the degree of overlap between multiple candidate regions, determine the text region in the image to be detected from the multiple candidate regions. In this method, the text detection model can automatically extract features of different scales, so it is only necessary to input an image to the model to obtain candidate regions of text regions of various scales in the image, without the need to manually transform the image scale, and the operation is convenient , especially in the scenes of multiple font sizes, multiple fonts, multiple shapes, and multiple orientations, it can quickly, comprehensively and accurately detect various texts in the image, which is also beneficial to the accuracy of subsequent text recognition and improves text recognition. Effect.

本发明实施例还提供另一种文本区域确定方法，该方法在上述实施例所述方法的基础上实现；该方法重点描述根据检测网络输出的候选区域的顶点坐标以及候选区域的概率值确定待检测图像中的文本区域的具体过程；如图5所示，该方法包括如下步骤：The embodiment of the present invention also provides another method for determining a text area, which is implemented on the basis of the method described in the above-mentioned embodiment; the method focuses on the determination of the to-be-to-be-to-be-determined according to the vertex coordinates of the candidate area output by the detection network and the probability value of the candidate area The specific process of detecting the text area in the image; as shown in Figure 5, the method includes the following steps:

步骤S502，获取待检测图像。Step S502, acquiring an image to be detected.

步骤S504，将待检测图像输入至预先训练完成的文本检测模型，输出待检测图像中文本区域的多个候选区域，以及每个候选区域的概率值；Step S504, input the image to be detected into the pre-trained text detection model, output multiple candidate regions of the text region in the image to be detected, and the probability value of each candidate region;

步骤S506，将多个候选区域中，概率值低于预设的概率阈值的候选区域剔除，得到最终的多个候选区域。Step S506 , among the plurality of candidate regions, candidate regions whose probability value is lower than a preset probability threshold are eliminated to obtain a plurality of final candidate regions.

该步骤S506为可选步骤，即下述步骤S508中，可以对检测模型输出的每个候选区域进行排列，也可以先将检测模型输出的候选区域中概率值低于预设的概率阈值的候选区域剔除，再对剩余的候选区域进行排列。上述预设的概率阈值可以预先设置，如0.2、0.1等；通过剔除概率值低于预设的概率阈值的候选区域，有利于降低后续确定待检测图像中的文本区域的运算量，提高运算速度。This step S506 is an optional step, that is, in the following step S508, each candidate region output by the detection model may be arranged, or candidates whose probability value is lower than a preset probability threshold in the candidate regions output by the detection model may be first Region culling, and then arranging the remaining candidate regions. The above-mentioned preset probability threshold can be preset, such as 0.2, 0.1, etc.; by eliminating candidate regions whose probability value is lower than the preset probability threshold, it is beneficial to reduce the calculation amount of subsequent determination of the text region in the image to be detected, and improve the calculation speed .

步骤S508，根据候选区域的概率值，将多个候选区域依次排列；其中，第一个候选区域的概率值最大，最后一个候选区域的概率值最小；Step S508, according to the probability value of the candidate region, arrange the multiple candidate regions in sequence; wherein, the probability value of the first candidate region is the largest, and the probability value of the last candidate region is the smallest;

步骤S510，将第一个候选区域作为当前候选区域，逐一计算当前候选区域与除当前候选区域以外的候选区域的重叠程度；Step S510, taking the first candidate region as the current candidate region, and calculating the degree of overlap between the current candidate region and the candidate regions other than the current candidate region one by one;

除当前候选区域以外的候选区域也可以简称为其他候选区域，在计算当前候选区域与每个其他候选区域的重叠程度时，具体可以计算两个候选区域的交并比，该交并比等于两个候选区域交集的区域大小与两个候选区域并集的区域大小。可以理解，交并比越大，两个候选区域的重叠程度越大。对于当前候选区域而言，与该当前候选区域重叠程度较大的其他候选区域通常与该当前候选区域表征同一个文本行，又由于其他候选区域的概率值小于当前候选区域，因此可以将该其他候选区域剔除，以通过当前候选区域表征该文本行。Candidate regions other than the current candidate region can also be referred to as other candidate regions for short. When calculating the degree of overlap between the current candidate region and each other candidate region, the intersection ratio of the two candidate regions can be calculated specifically, and the intersection ratio is equal to two The area size of the intersection of the two candidate areas and the area size of the union of the two candidate areas. It can be understood that the larger the intersection and union ratio, the greater the degree of overlap between the two candidate regions. For the current candidate area, other candidate areas with a large degree of overlap with the current candidate area usually represent the same text line as the current candidate area, and because the probability value of other candidate areas is smaller than the current candidate area, the other candidate areas can be Candidate region culling to characterize the text line by the current candidate region.

步骤S512，将除当前候选区域以外的候选区域中，重叠程度大于预设的重叠阈值的候选区域剔除；该重叠阈值可以预先设置，如0.5、0.6等。Step S512: Eliminate candidate regions whose overlap degree is greater than a preset overlap threshold in candidate regions other than the current candidate region; the overlap threshold may be preset, such as 0.5, 0.6, and the like.

步骤S514，将当前候选区域的下一个候选区域作为新的当前候选区域，继续执行逐一计算当前候选区域与除当前候选区域以外的候选区域的重叠程度的步骤，直至到达最后一个候选区域。Step S514, taking the next candidate region of the current candidate region as the new current candidate region, and continuing to perform the step of calculating the overlap degree between the current candidate region and the candidate regions other than the current candidate region one by one, until the last candidate region is reached.

上述步骤S510-S514中包含有循环过程，每轮循环中都会剔除部分候选区域，当遍历至最后一个候选区域，循环结束，将最终剩余的候选区域确定为待检测图像中的文本区域。如果最终剩余的候选区域为多个，则可以确定待检测图像中的文本区域为多个。The above steps S510-S514 include a loop process, and some candidate regions will be eliminated in each round of loops. When the last candidate region is traversed, the loop ends, and the final remaining candidate regions are determined as the text regions in the image to be detected. If there are multiple candidate regions remaining in the end, it can be determined that there are multiple text regions in the image to be detected.

步骤S516，将剔除后的剩余的候选区域确定为待检测图像中的文本区域。In step S516, the remaining candidate regions after the elimination are determined as the text regions in the image to be detected.

上述方式中，通过文本检测模型可以得到多个候选区域以及每个候选区域的概率值，进而再通过非极大抑制的方式从多个候选区域中确定文本区域。该方式中，文本检测模型可以自动提取不同尺度的特征，因而只需要输入一张图像至该模型即可得到该图像中各种尺度的文本区域的候选区域，无需再人工变换图像尺度，操作便捷，尤其在多种字号、多种字体、多种形状、多种方向场景下，可以快速全面准确地检测出图像中的各类文本，进而也有利于后续文本识别的准确性，提高了文本识别的效果。In the above manner, multiple candidate regions and the probability value of each candidate region can be obtained through the text detection model, and then the text region is determined from the multiple candidate regions by means of non-maximum suppression. In this method, the text detection model can automatically extract features of different scales, so it is only necessary to input an image to the model to obtain candidate regions of text regions of various scales in the image, without the need to manually transform the image scale, and the operation is convenient , especially in the scenes of multiple font sizes, multiple fonts, multiple shapes, and multiple orientations, it can quickly, comprehensively and accurately detect various texts in the image, which is also beneficial to the accuracy of subsequent text recognition and improves text recognition. Effect.

基于上述实施例提供的文本区域确定方法，本发明实施例还提供一种文本内容确定方法，该方法在上述实施例所述的文本区域确定方法的基础上实现；如图6所示，该方法包括如下步骤：Based on the text area determination method provided by the foregoing embodiment, the embodiment of the present invention further provides a text content determination method, which is implemented on the basis of the text area determination method described in the foregoing embodiment; as shown in FIG. 6 , this method It includes the following steps:

步骤S602，通过上述文本区域确定方法，获取图像中的文本区域；Step S602, obtaining the text area in the image by the above-mentioned text area determination method;

步骤S604，将文本区域输入至预先训练完成的文本识别模型，输出文本区域的识别结果；Step S604, input the text area into the pre-trained text recognition model, and output the recognition result of the text area;

步骤S606，根据识别结果确定文本区域中的文本内容。Step S606: Determine the text content in the text area according to the recognition result.

上述文本识别模型可以通过多种方式训练得到，如循环神经网络、卷积神经网络，当然也可以通过光学字符识别的方式得到文本区域的识别结果。可以将文本识别模型输出的识别结果确定为文本区域中的文本内容，也可以先对文本识别模型输出的识别结果进行优化处理，如删除重复字符和空字符、空字符等，进而将处理后的识别结果确定为文本区域中的文本内容。The above text recognition model can be trained in various ways, such as recurrent neural network, convolutional neural network, and of course, the recognition result of text area can also be obtained by means of optical character recognition. The recognition result output by the text recognition model can be determined as the text content in the text area, or the recognition result output by the text recognition model can be optimized first, such as deleting repeated characters, null characters, null characters, etc. The recognition result is determined as the text content in the text area.

本发明实施例提供的文本内容确定方法，首先通过上述文本区域确定方法获取图像中的文本区域；再将该文本区域输入至预先训练完成的文本识别模型，输出文本区域的识别结果；最后根据该识别结果确定文本区域中的文本信息。该方式中，由于上述文本区域确定方法可以通过文本检测模型获取到各种尺度的文本区域，在多种字号、多种字体、多种形状、多种方向场景下，可以快速全面准确地检测出图像中的各类文本，进而也有利于文本识别的准确性，提高了文本识别的效果。In the text content determination method provided by the embodiment of the present invention, firstly, the text area in the image is obtained by the above-mentioned text area determination method; then the text area is input into the pre-trained text recognition model, and the recognition result of the text area is output; finally, according to the text area The recognition result determines text information in the text area. In this way, since the above-mentioned text area determination method can obtain text areas of various scales through the text detection model, it can quickly, comprehensively and accurately detect the scene of various font sizes, fonts, shapes and directions. Various types of text in the image are also beneficial to the accuracy of text recognition and improve the effect of text recognition.

本发明实施例还提供另一种文本内容确定方法，该方法在上述实施例所述方法的基础上实现；该方法重点描述文本识别模型的训练方法；该文本识别模型可以用于文本识别，该文本识别可以理解为：对图片中文本区域进行检测，从而定位出包含有文本的图片区域，进而在该图片区域中识别出文本的具体内容。如图7所示，该检测模型通过下述方式训练完成：The embodiments of the present invention also provide another method for determining text content, which is implemented on the basis of the methods described in the above embodiments; the method focuses on describing the training method of the text recognition model; the text recognition model can be used for text recognition, and the Text recognition can be understood as: detecting the text area in the picture, so as to locate the picture area containing the text, and then identify the specific content of the text in the picture area. As shown in Figure 7, the detection model is trained in the following ways:

步骤S702，基于预设的训练集合确定目标训练文本图像；Step S702, determining a target training text image based on a preset training set;

该目标训练文本图像可以为单独的图像，也可以为标注在图像上的图像区域。该训练集合中可以包含有多张图像，为了提高文本识别模型的应用广泛性，训练集合中的图像可以包含各种场景下的图像，例如，直播场景图像、游戏场景图像、户外场景图像、室内场景图像等；训练集合中的图像也可以包含多种字号、形状、字体、语言的文本行，以使训练出的文本识别模型能够检测各类文本行。每张目标训练文本图像对应有由人工标注的文本行的文本内容，如“你好”“真棒”等。每张目标训练文本图像对应一个标注的文本内容。The target training text image can be a separate image or an image area marked on the image. The training set can contain multiple images. In order to improve the applicability of the text recognition model, the images in the training set can contain images from various scenes, such as live broadcast scene images, game scene images, outdoor scene images, indoor scene images, etc. Scene images, etc.; the images in the training set can also contain text lines of various font sizes, shapes, fonts, and languages, so that the trained text recognition model can detect various text lines. Each target training text image corresponds to the text content of the manually annotated text line, such as "Hello", "Awesome" and so on. Each target training text image corresponds to an annotated text content.

在标注完成后，还可以通过训练集合中所有图像对应的所有文本行的文本内容，建立字符库；具体而言，获取到训练集合中所有图像对应的所有文本行的文本内容，从中提取不同的字符，将彼此不同的字符组成字符库。另外，还可以将上述训练集合中的多张图像按照预设比例划分为训练子集和测试子集。在训练过程中，可以从训练子集从获取目标训练图像。训练完成后，可以从测试子集中获取目标测试图像，用于测试文本识别模型的性能。After the annotation is completed, a character library can also be established through the text content of all text lines corresponding to all images in the training set; specifically, the text content of all text lines corresponding to all images in the training set is obtained, and different characters, and group characters different from each other into a character library. In addition, the plurality of images in the training set may also be divided into training subsets and test subsets according to a preset ratio. During training, target training images can be obtained from the training subset. After training, target test images can be obtained from the test subset to test the performance of the text recognition model.

步骤S704，将目标训练文本图像输入至第二初始模型；第二初始模型包括第二特征提取网络、特征拆分网络、第二输出网络和分类函数；Step S704, input the target training text image into the second initial model; the second initial model includes a second feature extraction network, a feature splitting network, a second output network and a classification function;

步骤S706，通过第二特征提取网络提取目标训练文本图像的特征图；Step S706, extracting the feature map of the target training text image through the second feature extraction network;

该第二特征提取网络可以通过多层卷积层实现，通常，多层卷积层依次连接，每次卷积层通过设置相应的卷积核，对输入的数据进行卷积计算，最后一层卷积层输出的数据即可作为目标训练文本图像的特征图。The second feature extraction network can be implemented by multi-layer convolution layers. Usually, the multi-layer convolution layers are connected in sequence. Each convolution layer sets the corresponding convolution kernel to perform convolution calculation on the input data. The last layer The data output by the convolutional layer can be used as the feature map of the target training text image.

步骤S708，通过特征拆分网络将特征图拆分成至少一个子特征图；Step S708, splitting the feature map into at least one sub-feature map through a feature splitting network;

基于识别文本内容的目的，文本识别模型需要将文本行对应的特征图拆分，使每个子特征图中包含有一个或少量的文字或符号，便于文本内容的识别。在拆分过程中，可以预先设置子特征图的尺度，基于该子特征图的尺度拆分特征图；也可以预先设置子特征图的数量，基于该子特征图的数量拆分特征图。当然，如果文本行本来就很短，如只有一个字符，则特征图也可能仅拆分出一个子特征图。Based on the purpose of recognizing text content, the text recognition model needs to split the feature map corresponding to the text line, so that each sub-feature map contains one or a small amount of words or symbols, which is convenient for text content recognition. In the splitting process, the scale of the sub-feature map can be preset, and the feature map can be split based on the scale of the sub-feature map; the number of sub-feature maps can also be preset, and the feature map can be split based on the number of the sub-feature maps. Of course, if the text line is inherently short, such as only one character, the feature map may also be split into only one sub-feature map.

步骤S710，将上述子特征图分别输入至第二输出网络，输出每个子特征图对应的输出矩阵；Step S710, input the above-mentioned sub-feature maps to the second output network respectively, and output the output matrix corresponding to each sub-feature map;

该第二输出网络用于对子特征图进行再次计算；输出的每个子特征图对应的输出矩阵中，每个位置对应有一个预设的字符；该位置上的数值可以表征该子特征图与该位置对应的字符的匹配程度。该第二输出网络可以为卷积网络或全连接网络。The second output network is used to recalculate the sub-feature map; in the output matrix corresponding to each output sub-feature map, each position corresponds to a preset character; the value at the position can represent the relationship between the sub-feature map and the sub-feature map. The matching degree of the character corresponding to this position. The second output network can be a convolutional network or a fully connected network.

步骤S712，将每个子特征图对应的输出矩阵分别输入至分类函数，输出每个子特征图对应的概率矩阵；Step S712, input the output matrix corresponding to each sub-feature map to the classification function respectively, and output the probability matrix corresponding to each sub-feature map;

该分类函数可以将输出矩阵中的各个数值映射为概率值，从而得到概率矩阵。该概率矩阵中的每个位置上的概率值，可以用于表征该子特征图与该位置对应的字符相匹配的概率。The classification function can map each value in the output matrix to a probability value to obtain a probability matrix. The probability value at each position in the probability matrix can be used to characterize the probability that the sub-feature map matches the character corresponding to the position.

步骤S714，通过预设的识别损失函数确定概率矩阵的第二损失值；根据该第二损失值对上述第二初始模型进行训练，直至第二初始模型中的参数收敛，得到文本识别模型。Step S714: Determine the second loss value of the probability matrix by using a preset recognition loss function; train the above-mentioned second initial model according to the second loss value until the parameters in the second initial model converge to obtain a text recognition model.

目标训练文本图像中预先标注有标准的文本内容，该文本内容由一个或多个标准字符组成；基于该文本内容可以生成概率矩阵；该概率矩阵中，子特征图对应的标准字符对应的位置的概率值为1，其他位置的概率值为0。识别损失函数可以比较分类函数输出的概率矩阵与标准的文本内容的概率矩阵的区别，通常区别越大，上述第二损失值越大。基于该第二损失值可以调整上述第二初始模型中各个部分的参数，以达到训练的目的。当模型中各个参数收敛时，训练结束，得到文本识别模型。The target training text image is pre-marked with standard text content, which consists of one or more standard characters; a probability matrix can be generated based on the text content; in the probability matrix, the position corresponding to the standard character corresponding to the sub-feature map is The probability value is 1, and the probability value for other positions is 0. The recognition loss function can compare the difference between the probability matrix output by the classification function and the probability matrix of standard text content. Generally, the greater the difference, the greater the above-mentioned second loss value. Based on the second loss value, the parameters of each part of the second initial model can be adjusted to achieve the purpose of training. When each parameter in the model converges, the training ends, and a text recognition model is obtained.

上述文本识别模型的训练方式中，首先提取目标训练文本图像的特征图；再将该特征图拆分成至少一个子特征图；进而将该子特征图分别输入至第二输出网络，输出每个子特征图对应的输出矩阵；再通过分类函数得到每个子特征图对应的概率矩阵；通过预设的识别损失函数确定概率矩阵的第二损失值后，根据该第二损失值对第二初始模型进行训练得到文本识别模型。该方式中，模型可以自动对图像的特征图进行切分，因而该文本识别模型，只需要输入包含有文本行的图像即可得到该图像中的文本内容，无需再对文本行进行切分，直接可得到文本行的文本内容，操作编辑，运算速度快，同时文本的识别准确度较高。In the training method of the above-mentioned text recognition model, first extract the feature map of the target training text image; then split the feature map into at least one sub-feature map; and then input the sub-feature map to the second output network respectively, and output each sub-feature map. The output matrix corresponding to the feature map; then the probability matrix corresponding to each sub-feature map is obtained through the classification function; after the second loss value of the probability matrix is determined through the preset recognition loss function, the second initial model is performed according to the second loss value. Train a text recognition model. In this way, the model can automatically segment the feature map of the image, so the text recognition model only needs to input the image containing the text line to obtain the text content in the image, and there is no need to segment the text line. The text content of the text line can be directly obtained, the operation editing is performed, the operation speed is fast, and the recognition accuracy of the text is high.

本发明实施例还提供另一种文本识别模型的训练方式，该方法在上述实施例所述方法的基础上实现；该方法重点描述上述训练方法中各个步骤的具体实现过程；该方法包括如下步骤：The embodiment of the present invention also provides another training method for a text recognition model, which is implemented on the basis of the method described in the above-mentioned embodiment; the method focuses on describing the specific implementation process of each step in the above-mentioned training method; the method includes the following steps :

步骤802，基于预设的训练集合确定目标训练文本图像；Step 802, determining a target training text image based on a preset training set;

步骤804，将目标训练文本图像输入至第二初始模型；该第二初始模型包括第二特征提取网络、特征拆分网络、第二输出网络和分类函数；Step 804, input the target training text image into a second initial model; the second initial model includes a second feature extraction network, a feature splitting network, a second output network and a classification function;

步骤806，通过第二特征提取网络提取目标训练文本图像的特征图；Step 806, extract the feature map of the target training text image through the second feature extraction network;

为了提高第二特征提取网络的性能，该第二特征提取网络可以包括依次连接的多组第二卷积网络；每组第二卷积网络包括依次连接的卷积层、池化层和激活函数层。图8示出了一种第二特征提取网络的结构示意图；图8中以四组第二卷积网络为例进行说明，后一组第二卷积网络的卷积层连接前一组第二卷积网络的激活函数层。第二特征提取网络中还可以包含更多组或更少组的第二卷积网络。In order to improve the performance of the second feature extraction network, the second feature extraction network may include multiple groups of second convolutional networks connected in sequence; each group of second convolutional networks includes sequentially connected convolutional layers, pooling layers and activation functions Floor. Fig. 8 shows a schematic diagram of the structure of a second feature extraction network; in Fig. 8, four groups of second convolutional networks are used as examples for illustration, and the convolutional layers of the latter group of second convolutional networks are connected to the former group of second convolutional networks. The activation function layer of the convolutional network. The second feature extraction network may further include more or less groups of second convolutional networks.

可以理解，第二卷积网络中的卷积层用于提取特征，生成特征图；该池化层可以为平均池化层(Average Pooling或mean-pooling)、全局平均池化层(Global AveragePooling)、最大池化层(max-pooling)等；池化层可以用于对卷积层输出的特征图进行压缩，保留特征图中的主要特征，删除非主要特征，以降低特征图的维度，以平均池化层为例，平均池化层可以对当前特征点的预设范围大小的邻域内的特征点值求平均，将平均值作为该当前特征点的新的特征点值。另外，池化层还可以帮助特征图保持一些不变形，例如旋转不变性、平移不变性、伸缩不变性等。激活函数层可以对池化层处理后的特征图进行函数变换，该变换过程打破卷积层输入的线性组合，可以提高第二卷积网络的特征表达能力。该激活函数层具体可以为Sigmoid函数、tanh函数、Relu函数等。It can be understood that the convolutional layer in the second convolutional network is used to extract features and generate feature maps; the pooling layer can be an average pooling layer (Average Pooling or mean-pooling), a global average pooling layer (Global Average Pooling) , max-pooling, etc.; the pooling layer can be used to compress the feature map output by the convolutional layer, retain the main features in the feature map, and delete non-main features to reduce the dimension of the feature map. Taking the average pooling layer as an example, the average pooling layer can average the feature point values in the neighborhood of the current feature point with a preset range size, and use the average value as the new feature point value of the current feature point. In addition, the pooling layer can also help the feature map to maintain some invariance, such as rotation invariance, translation invariance, scaling invariance, etc. The activation function layer can perform functional transformation on the feature map processed by the pooling layer. The transformation process breaks the linear combination of the input of the convolution layer, which can improve the feature expression ability of the second convolution network. The activation function layer may specifically be a Sigmoid function, a tanh function, a Relu function, or the like.

步骤808，通过特征拆分网络将特征图拆分成至少一个子特征图；Step 808, splitting the feature map into at least one sub-feature map through a feature splitting network;

考虑到大部分的文本行为横向排列，为了使拆分后的子特征图中包含有一个或少量的字符对应的特征，可以沿着特征图的列方向，将特征图拆分成至少一个子特征图；该特征图的列方向可以理解为文本行方向的垂直方向。在实际实现时，根据大部分字符的宽度设置子特征图的宽度，根据该宽度拆分上述特征图。例如，上述特征图为H*W*C，预设的子特征图的宽度为k，则每个子特征图为H*(W/k)*C。另外，还可以预设子特征图的个数，如T个，则每个子特征图为H*(W/T)*C。Considering that most text behaviors are arranged horizontally, in order to make the split sub-feature map contain one or a small number of features corresponding to characters, the feature map can be split into at least one sub-feature along the column direction of the feature map. Figure; the column direction of the feature map can be understood as the vertical direction of the text row direction. In actual implementation, the width of the sub-feature map is set according to the width of most characters, and the above-mentioned feature map is split according to the width. For example, the above feature map is H*W*C, and the preset width of the sub-feature map is k, then each sub-feature map is H*(W/k)*C. In addition, the number of sub-feature maps can also be preset, such as T, and each sub-feature map is H*(W/T)*C.

步骤810，将上述子特征图分别输入至第二输出网络，输出每个子特征图对应的输出矩阵；Step 810, respectively input the above-mentioned sub-feature maps to the second output network, and output the output matrix corresponding to each sub-feature map;

以卷积网络为例，该第二输出网络包括多个全连接层；多个全连接层并列设置；该全连接层的数量与子特征图的数量对应，将每个子特征图分别输入至对应的全连接层中，以使每个全连接层输出子特征图对应的输出矩阵。Taking the convolutional network as an example, the second output network includes multiple fully connected layers; multiple fully connected layers are arranged in parallel; the number of the fully connected layers corresponds to the number of sub-feature maps, and each sub-feature map is input to the corresponding In the fully connected layer of , so that each fully connected layer outputs the output matrix corresponding to the sub-feature map.

步骤812，将每个子特征图对应的输出矩阵分别输入至分类函数，输出每个子特征图对应的概率矩阵；Step 812, input the output matrix corresponding to each sub-feature map to the classification function respectively, and output the probability matrix corresponding to each sub-feature map;

该分类函数可以为Softmax函数；该Softmax函数可以标识为其中，e表示自然常数；t表示第t个概率矩阵；K表示所述训练集合的目标训练文本图像所包含的不同字符的个数；m表示从1到K+1；∑表示求和运算；为所述输出矩阵中的第i个元素；所述为所述概率矩阵p^t中的第i个元素。The classification function can be a Softmax function; the Softmax function can be identified as Wherein, e represents a natural constant; t represents the t-th probability matrix; K represents the number of different characters contained in the target training text image of the training set; m represents from 1 to K+1; ∑ represents a summation operation; is the i-th element in the output matrix; the is the ith element in the probability matrix p ^t .

相对于输出矩阵中的元素本身，元素的指数函数值可以扩大各个元素之间的差异，例如，输出矩阵为[3,1,-3]，计算每个元素的指数函数值后，该输出矩阵对应的指数函数值矩阵为[20,2.7,0.05]。采用元素的指数函数值计算各元素的概率，可以增大彼此间的概率差距，使正确的识别结果的概率更高，有利于识别结果的准确性。relative to the elements in the output matrix itself, the exponential function value of the element The difference between each element can be enlarged. For example, the output matrix is [3,1,-3]. After calculating the exponential function value of each element, the exponential function value matrix corresponding to the output matrix is [20,2.7,0.05] . Using the exponential function value of the elements to calculate the probability of each element can increase the probability gap between them, so that the probability of the correct identification result is higher, which is beneficial to the accuracy of the identification result.

步骤814，通过预设的识别损失函数确定概率矩阵的第二损失值；根据该第二损失值对第二初始模型进行训练，直至第二初始模型中的参数收敛，得到文本识别模型。Step 814: Determine the second loss value of the probability matrix by using the preset recognition loss function; train the second initial model according to the second loss value until the parameters in the second initial model converge to obtain a text recognition model.

该识别损失函数包括L＝-log p(y|{p^t}_t＝1…T)；其中，y为预先标注的所述目标训练文本图像的概率矩阵；t表示第t个概率矩阵；p^t为所述分类函数输出的每个所述子特征图对应的概率矩阵；T为所述概率矩阵的总数量；p表示计算概率；log表示对数运算。基于该识别损失函数，上述步骤中，根据该第二损失值对第二初始模型进行训练的过程，还可以通过下述步骤32-38实现：The recognition loss function includes L=-log p(y|{p ^t } _t=1...T ); wherein, y is the pre-labeled probability matrix of the target training text image; t represents the t-th probability matrix; p ^t is the probability matrix corresponding to each of the sub-feature maps output by the classification function; T is the total number of the probability matrices; p is the calculated probability; log is the logarithmic operation. Based on the recognition loss function, in the above steps, the process of training the second initial model according to the second loss value can also be implemented through the following steps 32-38:

步骤32，根据第二损失值更新第二初始模型中的参数；Step 32, update the parameters in the second initial model according to the second loss value;

在实际实现时，可以预先设置函数映射关系，将原始参数和第二损失值输入至该函数映射关系中，即可计算得到更新的参数。不同参数的函数映射关系可以相同，也可以不同。In actual implementation, the function mapping relationship can be preset, and the original parameters and the second loss value are input into the function mapping relationship, and then the updated parameters can be calculated. The function mapping relationship of different parameters can be the same or different.

具体而言，可以按照预设规则，从第二初始模型确定待更新参数；该待更新参数可以为第二初始模型中的所有参数，也可以随机从第二初始模型中确定部分参数；再计算第二损失值对待更新参数的导数其中，L′为概率矩阵的损失值；W′为待更新参数；该待更新参数也可以称为各神经元的权值。该过程也可以称为反向传播算法；如果第二损失值较大，则说明当前的第二初始模型的输出与期望输出结果不符，则求出上述第二损失值对第二初始模型中待更新参数的导数，该导数可以作为调整待更新参数的依据。Specifically, the parameters to be updated can be determined from the second initial model according to preset rules; the parameters to be updated can be all parameters in the second initial model, or some parameters can be randomly determined from the second initial model; and then calculate The second loss value treats the derivative of the updated parameter Among them, L' is the loss value of the probability matrix; W' is the parameter to be updated; the parameter to be updated may also be called the weight of each neuron. This process can also be called a back-propagation algorithm; if the second loss value is large, it means that the output of the current second initial model does not match the expected output result, then the above-mentioned second loss value is calculated for the second initial model to be in the second initial model. The derivative of the update parameter, which can be used as the basis for adjusting the parameter to be updated.

得到各个待更新参数的导数后，再更新待更新参数，得到更新后的待更新参数其中，α′为预设系数。该过程也可以称为随机梯度下降算法；各个待更新参数的导数也可以理解为基于当前的待更新参数，第一损失值下降最快的方向，通过该方向调整参数，可以使第一损失值快速降低，使该参数收敛。另外，当第二初始模型经一次训练后，得到一个第二损失值，此时可以从第二初始模型中各个参数中随机选择一个或多个参数进行上述的更新过程，该方式的模型训练时间较短，算法较快；当然也可以对第一初始模型中所有参数进行上述的更新过程，该方式的模型训练更加准确。After obtaining the derivative of each parameter to be updated, update the parameter to be updated to obtain the updated parameter to be updated Among them, α' is a preset coefficient. This process can also be called a stochastic gradient descent algorithm; the derivative of each parameter to be updated can also be understood as the direction in which the first loss value decreases the fastest based on the current parameter to be updated. By adjusting the parameters in this direction, the first loss value can be reduced Decrease quickly to bring this parameter to convergence. In addition, after the second initial model is trained once, a second loss value is obtained. At this time, one or more parameters can be randomly selected from the parameters in the second initial model to perform the above updating process. The model training time of this method is Shorter, the algorithm is faster; of course, the above update process can also be performed on all parameters in the first initial model, and the model training in this way is more accurate.

步骤34，判断更新后的参数是否均收敛；如果更新后的参数均收敛，执行步骤36；如果更新后的参数没有均收敛，执行步骤38；Step 34, determine whether the updated parameters are all converged; if the updated parameters are all converged, go to step 36; if the updated parameters are not all converged, go to step 38;

步骤36，将参数更新后的第二初始模型确定为识别模型；Step 36, the second initial model after the parameter update is determined as the identification model;

步骤38，继续执行基于预设的训练集合确定目标训练文本图像的步骤，直至更新后的各个参数均收敛。Step 38: Continue to perform the step of determining the target training text image based on the preset training set until all the updated parameters converge.

具体地，可以从训练集合中重新获取新的图像作为目标训练文本图像，也可以继续将当前的目标训练文本图像作为目标训练文本图像进行训练。Specifically, a new image may be re-acquired from the training set as the target training text image, or the current target training text image may be continued to be used as the target training text image for training.

上述方式中，模型可以自动对图像的特征图进行切分，因而该文本识别模型，只需要输入包含有文本行的图像即可得到该图像中的文本内容，无需再对文本行进行切分，直接可得到文本行的文本内容，操作编辑，运算速度快，同时文本的识别准确度较高。In the above method, the model can automatically segment the feature map of the image, so the text recognition model only needs to input the image containing the text line to obtain the text content in the image, and there is no need to segment the text line. The text content of the text line can be directly obtained, the operation editing is performed, the operation speed is fast, and the recognition accuracy of the text is high.

基于上述实施例提供的文本内容确定方法，本发明实施例还提供另一种文本内容确定方法，该方法在上述实施例所述的文本内容确定方法或文本识别模型的训练方法的基础上实现；该方法重点描述文本识别模型输出识别结果后，基于该识别结果得到文本区域的文本内容的过程；如图9所示，该方法包括如下步骤：Based on the text content determination method provided by the above embodiment, the embodiment of the present invention further provides another text content determination method, which is implemented on the basis of the text content determination method or the training method of the text recognition model described in the above embodiment; The method focuses on describing the process of obtaining the text content of the text area based on the recognition result after the text recognition model outputs the recognition result; as shown in Figure 9, the method includes the following steps:

步骤S902，通过上述文本区域确定方法，获取图像中的文本区域；Step S902, obtaining the text area in the image by the above-mentioned text area determination method;

步骤S904，按照预设尺寸，对文本区域进行归一化处理。Step S904, normalize the text area according to the preset size.

该预设尺寸可以包含预设的长度和宽度，如果文本区域不满足该预设尺寸，可以对该文本区域进行缩放处理，也可以对该文本区域进行剪切或填补空白区域的方式，以使处理后的文本区域满足上述预设尺寸。The preset size can include preset length and width. If the text area does not meet the preset size, the text area can be scaled, or the text area can be cut or filled in the blank area, so that the The processed text area satisfies the above preset size.

步骤S906，将处理后的文本区域输入至预先训练完成的文本识别模型，输出文本区域的识别结果；该文本区域的识别结果包括文本区域对应的多个概率矩阵；Step S906, input the processed text area into the pre-trained text recognition model, and output the recognition result of the text area; the recognition result of the text area includes a plurality of probability matrices corresponding to the text area;

文本识别模型在识别过程中，需要对文本区域对应的特征图进行切分，对切分后的子特征图分别通过相应的输出网络输出输出矩阵，进而再通过分类函数得到每个输出矩阵对应的概率矩阵，因而文本区域的识别结果包括多个概率矩阵，每个概率矩阵通常对应一个或少量字符。During the recognition process of the text recognition model, the feature map corresponding to the text area needs to be segmented, and the segmented sub-feature maps are output through the corresponding output network to output the output matrix, and then the classification function is used to obtain the corresponding output matrix of each output matrix. The probability matrix, and thus the recognition result of the text area includes multiple probability matrices, each probability matrix usually corresponds to one or a small number of characters.

步骤S908，确定每个概率矩阵中的最大概率值的位置；Step S908, determining the position of the maximum probability value in each probability matrix;

步骤S910，从预先设置的概率矩阵中各个位置与字符的对应关系中，获取最大概率值的位置对应的字符；Step S910, from the correspondence between each position and the character in the preset probability matrix, obtain the character corresponding to the position of the maximum probability value;

如上述实施例所述，概率矩阵中每个位置上的概率值，可以用于表征该子特征图与该位置对应的字符相匹配的概率。因而可以将最大概率值的位置对应的字符，确定为对应子特征图的识别结果。在大多数情况下，最大概率值的位置对应的字符，可以为一个字符，也可以多个字符。上述各个位置与字符的对应关系，可以通过下述方式建立：首先采集字符，该字符可以包含多种语言的文字、标点符号、数学符号、网络表情符号等；具体可以在建立训练集合的过程中采集字符，也可以通过词典、字符库、符号库等采集。As described in the above embodiment, the probability value at each position in the probability matrix can be used to represent the probability that the sub-feature map matches the character corresponding to the position. Therefore, the character corresponding to the position of the maximum probability value can be determined as the recognition result of the corresponding sub-feature map. In most cases, the character corresponding to the position of the maximum probability value can be one character or multiple characters. The corresponding relationship between the above-mentioned positions and characters can be established by the following methods: firstly collecting characters, which can include words, punctuation marks, mathematical symbols, network emoticons, etc. in multiple languages; specifically, in the process of establishing a training set To collect characters, you can also collect them through dictionaries, character libraries, and symbol libraries.

步骤S912，按照多个概率矩阵的排列顺序，排列获取到的字符；Step S912, arranging the acquired characters according to the arrangement order of the plurality of probability matrices;

文本识别模型输出的多个概率矩阵，通常按照各概率矩阵对应的子特征图在特征图中的位置确定排列顺序，因而多个概率矩阵的排列顺序通常与各个概率矩阵对应的子特征图包含的字符的排列顺序相一致；基于此，按照多个概率矩阵的排列顺序，排列获取到的字符，可以使排列后的字符与原始的文本行的字符排列相一致，因而可以根据排列后的字符确定文本区域中的文本内容。The multiple probability matrices output by the text recognition model are usually arranged according to the position of the sub-feature map corresponding to each probability matrix in the feature map. Therefore, the arrangement order of multiple probability matrices is usually the same as that contained in the sub-feature map corresponding to each probability matrix. The arrangement order of the characters is consistent; based on this, according to the arrangement order of multiple probability matrices, arranging the acquired characters can make the arranged characters consistent with the character arrangement of the original text line, so it can be determined according to the arranged characters The text content in the text area.

步骤S914，根据排列后的字符确定文本区域中的文本内容。Step S914: Determine the text content in the text area according to the arranged characters.

在实际实现时，可以直接将排列后的字符确定文本区域中的文本内容；但考虑到文本中的字符字体大小不同，因而文本识别模型中，在切分特征图时，可能不会完全按照一个字符对应一个子特征图的方式实现，因而，最终排列后的字符中可能有相互重复的字符，为了进一步优化文本的识别效果，可以按照预设规则，删除排列后的字符中的重复字符和空字符，得到文本区域中的文本内容。In actual implementation, the arranged characters can directly determine the text content in the text area; however, considering that the font sizes of the characters in the text are different, in the text recognition model, when segmenting the feature map, it may not be completely according to a Characters are implemented in a way that corresponds to a sub-feature map. Therefore, there may be repeated characters in the final arranged characters. In order to further optimize the text recognition effect, the repeated characters and empty spaces in the arranged characters can be deleted according to preset rules. character to get the text content in the text area.

具体而言，可以预先建立一个叠词库，如果排列后的字符中存在重复字符，可以从叠词库中查找是否存在该重复字符，如果不存在，则删除该重复字符，仅保留重复字符中的一个；另外，还可以结合其他字符的语义判断当前语境是否应当存在重复字符。对于空字符，也可以结合当前语境判断是否删除，如果空字符位于两个英文单词之间，则无需删除，可以保留。举例而言，上述排列后的字符为“--hh-e-l-ll-oo-”，其中，“-”代表空字符；删除重复字符和空字符后，得到的文本内容为“hello”。Specifically, a word repetition database can be established in advance. If there is a repeated character in the arranged characters, it can be searched from the repetition word database to see if the repeated character exists. If not, the repeated character is deleted, and only the repeated characters are retained. In addition, it can also be combined with the semantics of other characters to determine whether there should be repeated characters in the current context. For null characters, you can also judge whether to delete them in combination with the current context. If the null characters are located between two English words, you don't need to delete them and can keep them. For example, the above arrangement of characters is "--hh-e-l-ll-oo-", where "-" represents a null character; after deleting the repeated characters and null characters, the obtained text content is "hello".

上述方式中，首先对获取到的文本区域进行归一化处理，再通过文本识别模型得到文本区域的识别结果；进而通过识别结果中的各个概率矩阵确定识别出的字符，进而得到文本区域的文本内容。由于文本识别模型可以自动对图像的特征图进行切分，因而该方式中，只需要输入包含有文本行的图像即可得到该图像的识别结果，进而得到文本内容，无需再对文本行进行切分，直接可得到文本行的文本内容，操作编辑，运算速度快，同时文本的识别准确度较高。In the above method, first normalize the acquired text area, and then obtain the recognition result of the text area through the text recognition model; then determine the recognized characters through each probability matrix in the recognition result, and then obtain the text of the text area. content. Since the text recognition model can automatically segment the feature map of the image, in this method, only the image containing the text line needs to be input to obtain the recognition result of the image, and then the text content can be obtained, and there is no need to segment the text line. points, the text content of the text line can be directly obtained, the operation editing, the operation speed is fast, and the text recognition accuracy is high.

基于上述实施例提供的文本内容确定方法，本发明实施例还提供另一种文本内容确定方法，该方法在上述方法的基础上实现；该方法重点描述得到文本区域的文本内容后，基于该文本内容判断图像中是否包含敏感词的过程。Based on the text content determination method provided by the above embodiment, the embodiment of the present invention further provides another text content determination method, which is implemented on the basis of the above method; Content The process of judging whether an image contains sensitive words.

通常，需要预先建立一个敏感词库，通过该敏感词库确定图像对应的文本内容中是否包含有敏感信息；该敏感词库中包含有敏感词，如涉及色情、反动、恐怖主义的敏感词；可以逐一对文本内容中的词语该敏感词库进行匹配，如果匹配成功，则说明当前词语为敏感词。基于此，本实施例的文本内容确定方法包括如下步骤，如图10所示：Usually, a sensitive thesaurus needs to be established in advance, through which it is determined whether the text content corresponding to the image contains sensitive information; the sensitive thesaurus contains sensitive words, such as sensitive words involving pornography, reaction, and terrorism; You can match the words in the text content to the sensitive word database one by one. If the match is successful, it means that the current word is a sensitive word. Based on this, the method for determining text content in this embodiment includes the following steps, as shown in FIG. 10 :

步骤S1002，通过上述文本区域确定方法，获取图像中的文本区域；Step S1002, obtaining the text area in the image by the above-mentioned text area determination method;

步骤S1004，按照预设尺寸，对文本区域进行归一化处理。Step S1004, normalize the text area according to the preset size.

步骤S1006，将处理后的文本区域输入至预先训练完成的文本识别模型，输出文本区域的识别结果；该文本区域的识别结果包括文本区域对应的多个概率矩阵；Step S1006, input the processed text area into the pre-trained text recognition model, and output the recognition result of the text area; the recognition result of the text area includes a plurality of probability matrices corresponding to the text area;

步骤S1008，确定每个概率矩阵中的最大概率值的位置；Step S1008, determining the position of the maximum probability value in each probability matrix;

步骤S1010，从预先设置的概率矩阵中各个位置与字符的对应关系中，获取最大概率值的位置对应的字符；Step S1010, from the correspondence between each position and the character in the preset probability matrix, obtain the character corresponding to the position of the maximum probability value;

步骤S1012，按照多个概率矩阵的排列顺序，排列获取到的字符；Step S1012, arranging the acquired characters according to the arrangement order of the plurality of probability matrices;

步骤S1014，根据排列后的字符确定文本区域中的文本内容。Step S1014: Determine the text content in the text area according to the arranged characters.

步骤S1016，如果图像中包含有多个文本区域，获取每个文本区域中的文本内容；Step S1016, if the image contains multiple text regions, obtain the text content in each text region;

步骤S1018，对获取到的文本内容进行分词操作；Step S1018, performing a word segmentation operation on the acquired text content;

分词操作也可以称为切词操作；在实际实现时，可以建立一个词库，基于该词库进行分词操作；具体而言，可以从文本内容中的第一个字符开始，将该第一个字符和第二个字符作为一个组合，从词库中查找，如果找不到包含该组合对应的词，则将第一个字符划分为一个单独的词；如果可以找到包含该组合对应的词，再将第三个字符加入至该组合中，继续从词库中查找；直至找不到包含该组合对应的词，将该组合中除最后一个字符以外的字符划分为一个词，依此类推，直至完成文本内容的切词操作。The word segmentation operation can also be called the word segmentation operation; in actual implementation, a thesaurus can be established, and the word segmentation operation can be performed based on the thesaurus; specifically, starting from the first character in the text content, the first character The character and the second character are used as a combination to search from the thesaurus. If the corresponding word containing the combination is not found, the first character is divided into a separate word; if the corresponding word containing the combination can be found, Then add the third character to the combination, and continue to search from the thesaurus; until the word corresponding to the combination cannot be found, divide the characters in the combination except the last character into a word, and so on. Until the word segmentation operation of the text content is completed.

步骤S1020，逐一将分词操作后得到的分词与预先建立的敏感词库进行匹配；Step S1020, one by one, the word segmentations obtained after the word segmentation operation are matched with the pre-established sensitive thesaurus;

步骤S1022，如果至少一个分词匹配成功，确定图像对应的文本内容中包含有敏感信息。Step S1022, if at least one word segmentation is successfully matched, it is determined that the text content corresponding to the image contains sensitive information.

步骤S1024，获取匹配成功的分词所属的文本区域，在图像中标识出获取到的文本区域，或者匹配成功的分词。Step S1024: Acquire the text area to which the successfully matched word segmentation belongs, and identify the acquired text area in the image, or the successfully matched word segmentation.

在实际实现时，可以以标识框的方式标识获取到的文本区域，或者匹配成功的分词；如果是在视频播放或实时直播场景下的实时检测，可以使用马赛克或模糊化的方式标识获取到的文本区域，或者匹配成功的分词，以达到过滤敏感词的目的。In actual implementation, the acquired text area can be identified by means of an identification box, or the successfully matched word segmentation; if it is real-time detection in video playback or real-time live broadcast scenarios, the acquired text area can be identified by mosaic or fuzzification. Text area, or matching successful word segmentation, in order to achieve the purpose of filtering sensitive words.

上述方式中，获取到文本区域的文本内容后，再通过敏感词库从文本内容中识别敏感词，以实现言论监管的目的；该方式可以实时获取内容并识别敏感词，有利于实现在网络直播、视频直播等场景下的言论监管，并限制敏感词传播的目的。In the above method, after the text content in the text area is obtained, the sensitive words are identified from the text content through the sensitive thesaurus, so as to realize the purpose of speech supervision; this method can obtain the content in real time and identify the sensitive words, which is conducive to the realization of live broadcast on the Internet. , video live broadcast and other scenarios, and restrict the purpose of dissemination of sensitive words.

需要说明的是，上述各方法实施例均采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似的部分互相参见即可。It should be noted that the above method embodiments are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same and similar parts of the various embodiments may be referred to each other.

对应于上述方法实施例，参见图11所示的一种文本检测模型训练装置的结构示意图，该装置包括：Corresponding to the above method embodiment, refer to the schematic structural diagram of a text detection model training device shown in FIG. 11 , the device includes:

训练图像确定模块110，用于基于预设的训练集合确定目标训练图像；A training image determination module 110, configured to determine a target training image based on a preset training set;

训练图像输入模块111，用于将目标训练图像输入至第一初始模型；第一初始模型包括第一特征提取网络、特征融合网络和第一输出网络；The training image input module 111 is used to input the target training image into the first initial model; the first initial model includes a first feature extraction network, a feature fusion network and a first output network;

特征提取模块112，用于通过第一特征提取网络提取目标训练图像的多个初始特征图；多个初始特征图之间的尺度不同；The feature extraction module 112 is used for extracting multiple initial feature maps of the target training image through the first feature extraction network; the scales between the multiple initial feature maps are different;

特征融合模块113，用于通过特征融合网络对多个初始特征图进行融合处理，得到融合特征图；The feature fusion module 113 is configured to perform fusion processing on a plurality of initial feature maps through a feature fusion network to obtain a fusion feature map;

输出模块114，用于将融合特征图输入至第一输出网络，输出目标训练图像中文本区域的候选区域以及每个候选区域的概率值；The output module 114 is used to input the fusion feature map to the first output network, and output the candidate regions of the text region in the target training image and the probability value of each candidate region;

损失值确定和训练模块115，用于通过预设的检测损失函数确定候选区域以及每个候选区域的概率值的第一损失值；根据第一损失值对第一初始模型进行训练，直至第一初始模型中的参数收敛，得到文本检测模型。The loss value determination and training module 115 is used to determine the candidate area and the first loss value of the probability value of each candidate area through the preset detection loss function; according to the first loss value, the first initial model is trained until the first The parameters in the initial model converge, resulting in a text detection model.

本发明实施例提供的文本检测模型训练装置，首先提取目标训练图像的尺度相互不同的多个初始特征图；再对多个初始特征图进行融合处理，得到融合特征图；进而将融合特征图输入至第一输出网络，输出目标训练图像中文本区域的候选区域以及每个候选区域的概率值；通过预设的检测损失函数确定第一损失值后，根据该第一损失值对第一初始模型进行训练，得到检测模型。该方式中，特征提取网络可以自动提取不同尺度的特征，因而该文本检测模型，只需要输入一张图像即可得到该图像中各种尺度的文本区域的候选区域，无需再人工变换图像尺度，操作便捷，尤其在多种字号、多种字体、多种形状、多种方向场景下，可以快速全面准确地检测出图像中的各类文本，进而也有利于后续文本识别的准确性，提高了文本识别的效果。The text detection model training device provided by the embodiment of the present invention first extracts multiple initial feature maps with different scales of the target training image; then fuses the multiple initial feature maps to obtain a fused feature map; and then inputs the fused feature map into To the first output network, output the candidate region of the text region in the target training image and the probability value of each candidate region; after determining the first loss value through the preset detection loss function, according to the first loss value to the first initial model Perform training to obtain a detection model. In this method, the feature extraction network can automatically extract features of different scales. Therefore, the text detection model only needs to input an image to obtain the candidate regions of text regions of various scales in the image, and no need to manually transform the image scale. The operation is convenient, especially in the scenarios of multiple font sizes, multiple fonts, multiple shapes, and multiple orientations, it can quickly, comprehensively and accurately detect various texts in the image, which is also conducive to the accuracy of subsequent text recognition, improving the The effect of text recognition.

在一些实施例中，上述检测损失函数包括第一函数和第二函数；第一函数为L₁＝|G^*-G|；其中，G^*为预先标注的目标训练图像中文本区域的坐标矩阵；G为第一输出网络输出的目标训练图像中文本区域的候选区域的坐标矩阵；第二函数为L₂＝-Y^*logY-(1-Y^*)log(1-Y)；其中，Y^*为预先标注的目标训练图像中文本区域的概率矩阵；Y为第一输出网络输出的目标训练图像中文本区域的候选区域的概率矩阵；候选区域以及每个候选区域的概率值的第一损失值L＝L₁+L₂。In some embodiments, the above-mentioned detection loss function includes a first function and a second function; the first function is L ₁ =|G ^* -G|; wherein, G ^* is the coordinate matrix of the text area in the pre-labeled target training image ; G is the coordinate matrix of the candidate region of the text region in the target training image output by the first output network; The second function is L ₂ =-Y ^* logY-(1-Y ^* )log(1-Y); Wherein, Y ^* is the probability matrix of the text area in the pre-labeled target training image; Y is the probability matrix of the candidate area of the text area in the target training image output by the first output network; the first loss of the candidate area and the probability value of each candidate area The value L=L ₁ +L ₂ .

参见图12所示的一种文本区域确定装置的结构示意图；该装置包括：Referring to the schematic structural diagram of a text area determination device shown in FIG. 12 ; the device includes:

图像获取模块120，用于获取待检测图像；an image acquisition module 120, configured to acquire an image to be detected;

检测模块122，用于将待检测图像输入至预先训练完成的文本检测模型，输出待检测图像中文本区域的多个候选区域，以及每个候选区域的概率值；文本检测模型通过上述文本检测模型的训练方法训练得到；The detection module 122 is used to input the image to be detected into the pre-trained text detection model, and output multiple candidate regions of the text region in the image to be detected, as well as the probability value of each candidate region; the text detection model passes the above-mentioned text detection model. The training method is trained;

文本区域确定模块124，用于根据候选区域的概率值以及多个候选区域之间的重叠程度，从多个候选区域中确定待检测图像中的文本区域。The text area determination module 124 is configured to determine the text area in the image to be detected from the multiple candidate areas according to the probability value of the candidate area and the degree of overlap between the multiple candidate areas.

本发明实施例提供的上述文本区域确定装置，将获取到的待检测图像输入至文本检测模型，输出待检测图像中文本区域的多个候选区域以及每个候选区域的概率值；进而根据候选区域的概率值以及多个候选区域之间的重叠程度，从多个候选区域中确定待检测图像中的文本区域。该方式中，文本检测模型可以自动提取不同尺度的特征，因而只需要输入一张图像至该模型即可得到该图像中各种尺度的文本区域的候选区域，无需再人工变换图像尺度，操作便捷，尤其在多种字号、多种字体、多种形状、多种方向场景下，可以快速全面准确地检测出图像中的各类文本，进而也有利于后续文本识别的准确性，提高了文本识别的效果。The above-mentioned text region determination device provided in the embodiment of the present invention inputs the acquired image to be detected into the text detection model, and outputs multiple candidate regions of the text region in the to-be-detected image and the probability value of each candidate region; and then according to the candidate region The probability value of , and the degree of overlap between multiple candidate regions, determine the text region in the image to be detected from the multiple candidate regions. In this method, the text detection model can automatically extract features of different scales, so it is only necessary to input an image to the model to obtain candidate regions of text regions of various scales in the image, without the need to manually transform the image scale, and the operation is convenient , especially in the scenes of multiple font sizes, multiple fonts, multiple shapes, and multiple orientations, it can quickly, comprehensively and accurately detect various texts in the image, which is also beneficial to the accuracy of subsequent text recognition and improves text recognition. Effect.

参见图13所示的一种文本内容确定装置的结构示意图；该装置包括：Referring to the schematic structural diagram of a text content determination device shown in FIG. 13; the device includes:

区域获取模块130，用于通过权利要求8-10任一项的文本区域确定方法，获取图像中的文本区域；The area acquisition module 130 is used to acquire the text area in the image by the text area determination method of any one of claims 8-10;

识别模块132，用于将文本区域输入至预先训练完成的文本识别模型，输出文本区域的识别结果；The recognition module 132 is used to input the text area into the pre-trained text recognition model, and output the recognition result of the text area;

文本内容确定模块134，用于根据识别结果确定文本区域中的文本内容。The text content determination module 134 is configured to determine the text content in the text area according to the recognition result.

本发明实施例提供的文本内容确定装置，首先通过上述文本区域确定方法获取图像中的文本区域；再将该文本区域输入至预先训练完成的文本识别模型，输出文本区域的识别结果；最后根据该识别结果确定文本区域中的文本信息。该方式中，由于上述文本区域确定方法可以通过文本检测模型获取到各种尺度的文本区域，在多种字号、多种字体、多种形状、多种方向场景下，可以快速全面准确地检测出图像中的各类文本，进而也有利于文本识别的准确性，提高了文本识别的效果。The text content determination device provided by the embodiment of the present invention first obtains the text area in the image through the above-mentioned text area determination method; then inputs the text area into the pre-trained text recognition model, and outputs the recognition result of the text area; finally, according to the text area The recognition result determines text information in the text area. In this way, since the above-mentioned text area determination method can obtain text areas of various scales through the text detection model, it can quickly, comprehensively and accurately detect the scene of various font sizes, fonts, shapes and directions. Various types of text in the image are also beneficial to the accuracy of text recognition and improve the effect of text recognition.

在一些实施例中，上述文本识别模型训练模块还用于：沿着特征图的列方向，将特征图拆分成至少一个子特征图；特征图的列方向为文本行方向的垂直方向。In some embodiments, the above text recognition model training module is further configured to: split the feature map into at least one sub-feature map along the column direction of the feature map; the column direction of the feature map is the vertical direction of the text row direction.

在一些实施例中，上述识别损失函数包括L＝-log p(y|{p^t}_t＝1…T)；其中，y为预先标注的所述目标训练文本图像的概率矩阵；t表示第t个概率矩阵；p^t为所述分类函数输出的每个所述子特征图对应的概率矩阵；T为所述概率矩阵的总数量；p表示计算概率；log表示对数运算。In some embodiments, the above-mentioned recognition loss function includes L=-log p(y|{p ^t } _t=1...T ); wherein, y is the pre-labeled probability matrix of the target training text image; t represents the first t probability matrices; p ^t is the probability matrix corresponding to each of the sub-feature maps output by the classification function; T is the total number of the probability matrices; p represents the calculation probability; log represents the logarithmic operation.

在一些实施例中，上述文本区域的识别结果包括文本区域对应的多个概率矩阵；文本内容确定模块还用于：确定每个概率矩阵中的最大概率值的位置；从预先设置的概率矩阵中各个位置与字符的对应关系中，获取最大概率值的位置对应的字符；按照多个概率矩阵的排列顺序，排列获取到的字符；根据排列后的字符确定文本区域中的文本内容。In some embodiments, the recognition result of the text region includes a plurality of probability matrices corresponding to the text region; the text content determination module is further configured to: determine the position of the maximum probability value in each probability matrix; In the correspondence between each position and the character, the character corresponding to the position with the maximum probability value is obtained; the obtained characters are arranged according to the arrangement order of the plurality of probability matrices; the text content in the text area is determined according to the arranged characters.

本发明实施例所提供的装置，其实现原理及产生的技术效果和前述方法实施例相同，为简要描述，装置实施例部分未提及之处，可参考前述方法实施例中相应内容。The implementation principle and technical effects of the device provided by the embodiment of the present invention are the same as those of the foregoing method embodiment. For brief description, for the parts not mentioned in the device embodiment, reference may be made to the corresponding content in the foregoing method embodiment.

本发明实施例还提供了一种电子设备，参见图14所示，该电子设备包括存储器100和处理器101，其中，存储器100用于存储一条或多条计算机指令，一条或多条计算机指令被处理器101执行，以实现上述文本检测模型训练方法，文本区域确定方法，或者文本内容确定方法的步骤。An embodiment of the present invention further provides an electronic device, as shown in FIG. 14 , the electronic device includes a memory 100 and a processor 101, wherein the memory 100 is used to store one or more computer instructions, and the one or more computer instructions are The processor 101 executes to implement the steps of the above-mentioned text detection model training method, text area determination method, or text content determination method.

进一步地，图14所示的电子设备还包括总线102和通信接口103，处理器101、通信接口103和存储器100通过总线102连接。Further, the electronic device shown in FIG. 14 further includes a bus 102 and a communication interface 103 , and the processor 101 , the communication interface 103 and the memory 100 are connected through the bus 102 .

其中，存储器100可能包含高速随机存取存储器(RAM，RandomAccessMemory)，也可能还包括非不稳定的存储器(non-volatilememory)，例如至少一个磁盘存储器。通过至少一个通信接口103(可以是有线或者无线)实现该系统网元与至少一个其他网元之间的通信连接，可以使用互联网，广域网，本地网，城域网等。总线102可以是ISA总线、PCI总线或EISA总线等。所述总线可以分为地址总线、数据总线、控制总线等。为便于表示，图14中仅用一个双向箭头表示，但并不表示仅有一根总线或一种类型的总线。The memory 100 may include a high-speed random access memory (RAM, Random Access Memory), and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory. The communication connection between the network element of the system and at least one other network element is implemented through at least one communication interface 103 (which may be wired or wireless), which may use the Internet, a wide area network, a local network, a metropolitan area network, and the like. The bus 102 may be an ISA bus, a PCI bus, an EISA bus, or the like. The bus can be divided into an address bus, a data bus, a control bus, and the like. For ease of representation, only one bidirectional arrow is shown in FIG. 14, but it does not mean that there is only one bus or one type of bus.

处理器101可能是一种集成电路芯片，具有信号的处理能力。在实现过程中，上述方法的各步骤可以通过处理器101中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器101可以是通用处理器，包括中央处理器(CentralProcessingUnit，简称CPU)、网络处理器(NetworkProcessor，简称NP)等；还可以是数字信号处理器(Digital SignalProcessing，简称DSP)、专用集成电路(Application Specific Integrated Circuit，简称ASIC)、现成可编程门阵列(Field-Programmable Gate Array，简称FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本发明实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本发明实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成，或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器，闪存、只读存储器，可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器100，处理器101读取存储器100中的信息，结合其硬件完成前述实施例的方法的步骤。The processor 101 may be an integrated circuit chip with signal processing capability. In the implementation process, each step of the above-mentioned method may be completed by an integrated logic circuit of hardware in the processor 101 or an instruction in the form of software. The above-mentioned processor 101 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (NetworkProcessor, NP for short), etc. Circuit (Application Specific Integrated Circuit, ASIC for short), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, and discrete hardware components. Various methods, steps, and logical block diagrams disclosed in the embodiments of the present invention can be implemented or executed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in conjunction with the embodiments of the present invention may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art. The storage medium is located in the memory 100, and the processor 101 reads the information in the memory 100, and completes the steps of the methods in the foregoing embodiments in combination with its hardware.

本发明实施例还提供了一种机器可读存储介质，该机器可读存储介质存储有机器可执行指令，该机器可执行指令在被处理器调用和执行时，机器可执行指令促使处理器实现上述文本检测模型训练方法，文本区域确定方法，或者文本内容确定方法的步骤，具体实现可参见方法实施例，在此不再赘述。Embodiments of the present invention further provide a machine-readable storage medium, where the machine-readable storage medium stores machine-executable instructions, and when the machine-executable instructions are invoked and executed by a processor, the machine-executable instructions cause the processor to implement The steps of the above-mentioned text detection model training method, text area determination method, or text content determination method can be found in the method embodiments, which will not be repeated here.

本发明实施例所提供的文本检测模型训练方法、文本区域、内容确定方法、装置和电子设备的计算机程序产品，包括存储了程序代码的计算机可读存储介质，所述程序代码包括的指令可用于执行前面方法实施例中所述的方法，具体实现可参见方法实施例，在此不再赘述。The text detection model training method, text area, content determination method, apparatus, and computer program product of an electronic device provided by the embodiments of the present invention include a computer-readable storage medium storing program codes, and the instructions included in the program codes can be used for The methods described in the foregoing method embodiments are executed. For specific implementation, reference may be made to the method embodiments, which will not be repeated here.

所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。The functions, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes .

最后应说明的是：以上所述实施例，仅为本发明的具体实施方式，用以说明本发明的技术方案，而非对其限制，本发明的保护范围并不局限于此，尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，其依然可以对前述实施例所记载的技术方案进行修改或可轻易想到变化，或者对其中部分技术特征进行等同替换；而这些修改、变化或者替换，并不使相应技术方案的本质脱离本发明实施例技术方案的精神和范围，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应所述以权利要求的保护范围为准。Finally, it should be noted that the above-mentioned embodiments are only specific implementations of the present invention, and are used to illustrate the technical solutions of the present invention, but not to limit them. The protection scope of the present invention is not limited thereto, although referring to the foregoing The embodiment has been described in detail the present invention, those of ordinary skill in the art should understand: any person skilled in the art who is familiar with the technical field within the technical scope disclosed by the present invention can still modify the technical solutions described in the foregoing embodiments. Or can easily think of changes, or equivalently replace some of the technical features; and these modifications, changes or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should be covered in the present invention. within the scope of protection. Therefore, the protection scope of the present invention should be based on the protection scope of the claims.

Claims

1. a kind of text detection model training method, which is characterized in that the described method includes:

Gathered based on preset training and determines target training image；

The target training image is input to the first initial model；First initial model includes that fisrt feature extracts net Network, Fusion Features network and the first output network；

Multiple initial characteristics figures that network extracts the target training image are extracted by the fisrt feature；It is multiple described initial Scale between characteristic pattern is different；

Fusion treatment is carried out to multiple initial characteristics figures by the Fusion Features network, obtains fusion feature figure；

The fusion feature figure is input to the first output network, exports time text filed in the target training image The probability value of favored area and each candidate region；

The first of the probability value of the candidate region and each candidate region is determined by preset Detectability loss function Penalty values；First initial model is trained according to the first-loss value, until in first initial model Parameter convergence, obtains text detection model.

2. the method according to claim 1, wherein it includes sequentially connected more that the fisrt feature, which extracts network, The first convolutional network of group；First convolutional network described in every group includes sequentially connected convolutional layer, batch normalization layer and activation primitive Layer.

3. the method according to claim 1, wherein by the Fusion Features network to multiple initial spies The step of sign figure carries out fusion treatment, obtains fusion feature figure, comprising:

According to the scale of the initial characteristics figure, multiple initial characteristics figures are arranged successively；Wherein, top grade is initial The scale of characteristic pattern is minimum；The scale of the initial characteristics figure of bottom grade is maximum；

The initial characteristics figure of top grade is determined as to the fusion feature figure of the top grade；

It is in addition to the top grade, the fusion of the initial characteristics figure of current level and a upper level for the current level is special Sign figure is merged, and the fusion feature figure of current level is obtained；

The fusion feature figure of lowest hierarchical level is determined as to final fusion feature figure.

4. the method according to claim 1, wherein the first output network includes the first convolutional layer and second Convolutional layer；

The fusion feature figure is input to the first output network, exports time text filed in the target training image The step of probability value of favored area and each candidate region, comprising:

The fusion feature figure is separately input into first convolutional layer and second convolutional layer；

The first convolution algorithm, output coordinate matrix are carried out to the fusion feature figure by first convolutional layer；The coordinate Matrix includes the apex coordinate of candidate region text filed in the target training image；

The second convolution algorithm, output probability matrix are carried out to the fusion feature figure by second convolutional layer；The probability Matrix includes the probability value of each candidate region.

5. the method according to claim 1, wherein the Detectability loss function includes first function and the second letter Number；

The first function is L₁=| G^*-G|；Wherein, the G^*It is text filed in the target training image that marks in advance Coordinates matrix；G is the seat of candidate region text filed in the target training image of the first output network output Mark matrix；

The second function is L₂=-Y^*logY-(1-Y^*)log(1-Y)；Wherein, Y^*For the target training figure marked in advance The text filed probability matrix as in；Y is text filed in the target training image of the first output network output The probability matrix of candidate region；Log indicates logarithm operation；

The first-loss value L=L of the probability value of the candidate region and each candidate region₁+L₂。

6. the method according to claim 1, wherein according to the first-loss value to first initial model It is trained, up to the step of parameter in first initial model restrains, obtains text detection model, comprising:

The parameter in first initial model is updated according to the first-loss value；

Judge whether the updated parameter restrains；

If the updated parameter restrains, updated first initial model of parameter is determined as detection model；

If the updated parameter does not restrain, continues to execute and determining target training image is gathered based on preset training The step of, until the updated parameter restrains.

7. according to the method described in claim 6, it is characterized in that, updating first introductory die according to the first-loss value The step of parameter in type, comprising:

According to preset rules, parameter to be updated is determined from first initial model；

The first-loss value is calculated to the derivative of parameter to be updated described in first initial modelWherein, L is institute State first-loss value；W is the parameter to be updated；

The parameter to be updated is updated, updated parameter to be updated is obtainedWherein, α is default system Number.

8. a kind of text filed determining method, which is characterized in that the described method includes:

Obtain image to be detected；

Described image to be detected is input to the text detection model that training is completed in advance, exports text in described image to be detected The probability value of multiple candidate regions in region and each candidate region；The text detection model passes through claim The training method training of the described in any item text detection models of 1-7 obtains；

According to the overlapping degree between the probability value of the candidate region and multiple candidate regions, from multiple candidates It is determined in region text filed in described image to be detected.

9. according to the method described in claim 8, it is characterized in that, according to the probability value of the candidate region and multiple described Overlapping degree between candidate region, from the text filed step determined in multiple candidate regions in described image to be detected Suddenly, comprising:

According to the probability value of the candidate region, multiple candidate regions are arranged successively；Wherein, first candidate region Probability value is maximum, and the probability value of the last one candidate region is minimum；

Using first candidate region as current candidate region, the current candidate region is calculated one by one and except described current The overlapping degree of candidate region other than candidate region；

By in the candidate region in addition to the current candidate region, the overlapping degree is greater than the candidate of preset anti-eclipse threshold It rejects in region；

Using next candidate region in the current candidate region as new current candidate region, continues to execute and calculate institute one by one The step of stating the overlapping degree in current candidate region and the candidate region in addition to the current candidate region, until reaching last One candidate region；

Remaining candidate region after rejecting is determined as text filed in described image to be detected.

10., will be multiple described according to the method described in claim 9, it is characterized in that, according to the probability value of the candidate region Before the step of candidate region is arranged successively, the method also includes:

By in multiple candidate regions, probability value is rejected lower than the candidate region of preset probability threshold value, is obtained final more A candidate region.

11. a kind of content of text determines method, which is characterized in that the described method includes:

By the described in any item text filed determining methods of claim 8-10, obtain text filed in image；

By the text filed text identification model for being input to training completion in advance, the text filed identification knot is exported Fruit；

According to the recognition result determine it is described it is text filed in content of text.

12. according to the method for claim 11, which is characterized in that text filed be input to is trained completion in advance Before the step of identification model, the method also includes: according to pre-set dimension, text filed it is normalized to described.

13. according to the method for claim 11, which is characterized in that the text identification model has been trained by following manner At:

Gathered based on preset training and determines target training text image；

The target training text image is input to the second initial model；Second initial model includes that second feature is extracted Network, feature split network, the second output network and classification function；

The characteristic pattern that network extracts the target training text image is extracted by the second feature；

Network is split by the feature, and the characteristic pattern is split into at least one subcharacter figure；

The subcharacter figure is separately input into the second output network, exports the corresponding output square of each subcharacter figure Battle array；

The corresponding output matrix of each subcharacter figure is separately input into the classification function, exports each subcharacter Scheme corresponding probability matrix；

The second penalty values of the probability matrix are determined by preset identification loss function；According to second penalty values to institute It states the second initial model to be trained, until the parameter convergence in second initial model, obtains text identification model.

14. according to the method for claim 13, which is characterized in that it includes sequentially connected that the second feature, which extracts network, The second convolutional network of multiple groups；Second convolutional network described in every group includes sequentially connected convolutional layer, pond layer and activation primitive layer.

15. according to the method for claim 13, which is characterized in that split network by the feature and tear the characteristic pattern open The step of being divided at least one subcharacter figure, comprising:

Along the column direction of the characteristic pattern, the characteristic pattern is split into at least one subcharacter figure；The column of the characteristic pattern Direction is the vertical direction of text line direction.

16. according to the method for claim 13, which is characterized in that the second output network includes multiple full articulamentums； The quantity of the full articulamentum is corresponding with the quantity of the subcharacter figure；

It is described that the subcharacter figure is separately input into the second output network, it is corresponding defeated to export each subcharacter figure The step of matrix out, comprising: each subcharacter figure is separately input into corresponding full articulamentum, so that each described complete Articulamentum exports the corresponding output matrix of the subcharacter figure.

17. according to the method for claim 13, which is characterized in that the classification function includes Softmax function；

The Softmax function isWherein, e indicates natural constant；T indicates t-th of probability matrix；K table Show the number for the kinds of characters that the target training text image of the training set is included；M is indicated from 1 to K+1；∑ expression is asked And operation；For i-th of element in the output matrix；It is describedFor the probability matrix p^tIn i-th of element.

18. according to the method for claim 13, which is characterized in that the identification loss function include L=-log p (y | {p^t}_{T=1 ... T})；Wherein, y is the probability matrix of the target training text image marked in advance；T indicates t-th of probability square Battle array；p^tFor each of the classification function output corresponding probability matrix of the subcharacter figure；T is the sum of the probability matrix Amount；P indicates to calculate probability；Log indicates logarithm operation.

19. according to the method for claim 13, which is characterized in that according to second penalty values to second introductory die Type is trained, up to the step of parameter in second initial model restrains, obtains text identification model, comprising:

The parameter in second initial model is updated according to second penalty values；

Judge whether the updated parameter restrains；

If the updated parameter restrains, updated second initial model of parameter is determined as text identification mould Type；

If the updated parameter does not restrain, continues to execute and determining target training text is gathered based on preset training The step of image, until updated each parameter restrains.

20. according to the method for claim 19, which is characterized in that it is initial to update described second according to second penalty values In model the step of parameters, comprising:

According to preset rules, parameter to be updated is determined from second initial model；

Second penalty values are calculated to the derivative of the parameter to be updatedWherein, L ' is the loss of the probability matrix Value；W ' is the parameter to be updated；

The parameter to be updated is updated, updated parameter to be updated is obtainedWherein, α ' is default Coefficient.

21. according to the method for claim 11, which is characterized in that the text filed recognition result includes the text The corresponding multiple probability matrixs in region；

According to the recognition result determine it is described it is text filed in content of text the step of, comprising:

Determine the position of the most probable value in each probability matrix；

From in the corresponding relationship of position each in pre-set probability matrix and character, the position of the most probable value is obtained Corresponding character；

According to putting in order for multiple probability matrixs, the character got is arranged；

According to the character after arrangement determine it is described it is text filed in content of text.

22. according to the method for claim 21, which is characterized in that determine the text area according to the character after arrangement The step of content of text in domain, comprising:

According to preset rules, the repeat character (RPT) and null character in the character after deleting arrangement, obtain it is described it is text filed in Content of text.

23. according to the method for claim 11, which is characterized in that according to the recognition result determine it is described it is text filed in Content of text the step of after, the method also includes:

If include in described image it is multiple text filed, obtain it is each it is described it is text filed in content of text；

Determine in the corresponding content of text of described image whether include sensitive information by the sensitive dictionary pre-established.

24. according to the method for claim 23, which is characterized in that determine described image by the sensitive dictionary pre-established The step of whether including sensitive information in corresponding content of text, comprising:

Participle operation is carried out to the content of text got；

The participle dictionary sensitive with what is pre-established obtained after participle operation is matched one by one；

If at least one participle successful match, determines that in the corresponding content of text of described image include sensitive information.

25. according to the method for claim 24, which is characterized in that determine in the corresponding content of text of described image and include After sensitive information, the method also includes:

Obtain successful match participle belonging to it is text filed, identified in described image get it is described text filed, Or the participle of successful match.

26. a kind of text detection model training apparatus, which is characterized in that described device includes:

Training image determining module determines target training image for gathering based on preset training；

Training image input module, for the target training image to be input to the first initial model；First introductory die Type includes that fisrt feature extracts network, Fusion Features network and the first output network；

Characteristic extracting module, for extracting multiple initial spies that network extracts the target training image by the fisrt feature Sign figure；Scale between multiple initial characteristics figures is different；

Fusion Features module is obtained for carrying out fusion treatment to multiple initial characteristics figures by the Fusion Features network To fusion feature figure；

Output module exports the target training image for the fusion feature figure to be input to the first output network In text filed candidate region and each candidate region probability value；

Penalty values determination and training module, for determining the candidate region and each institute by preset Detectability loss function State the first-loss value of the probability value of candidate region；First initial model is trained according to the first-loss value, Until the parameter convergence in first initial model, obtains text detection model.

27. device according to claim 26, which is characterized in that it includes sequentially connected that the fisrt feature, which extracts network, The first convolutional network of multiple groups；First convolutional network described in every group includes sequentially connected convolutional layer, batch normalization layer and activation letter Several layers.

28. device according to claim 26, which is characterized in that the Fusion Features module is also used to:

29. device according to claim 26, which is characterized in that the first output network includes the first convolutional layer and the Two convolutional layers；

The output module is also used to:

30. device according to claim 26, which is characterized in that the Detectability loss function includes first function and second Function；

31. device according to claim 26, which is characterized in that the penalty values are determining and training module is also used to:

Judge whether the updated parameter restrains；

32. device according to claim 31, which is characterized in that the penalty values are determining and training module is also used to:

33. a kind of text filed determining device, which is characterized in that described device includes:

Image collection module, for obtaining image to be detected；

Detection module, for described image to be detected to be input to the text detection model that training is completed in advance, output it is described to The probability value of text filed multiple candidate regions and each candidate region in detection image；The text detection mould Type is obtained by the training method training of the described in any item text detection models of claim 1-7；

Text filed determining module, for the weight between the probability value and multiple candidate regions according to the candidate region Folded degree, it is text filed in described image to be detected from being determined in multiple candidate regions.

34. device according to claim 33, which is characterized in that the text filed determining module is also used to:

35. device according to claim 34, which is characterized in that described device further include: module is rejected in region, and being used for will In multiple candidate regions, probability value is rejected lower than the candidate region of preset probability threshold value, is obtained final multiple described Candidate region.

36. a kind of content of text determining device, which is characterized in that described device includes:

Region obtains module, for obtaining in image by the described in any item text filed determining methods of claim 8-10 It is text filed；

Identification module, for exporting the text for the text filed text identification model for being input to training completion in advance The recognition result in region；

Content of text determining module, for according to the recognition result determine it is described it is text filed in content of text.

37. device according to claim 36, which is characterized in that described device further include: normalization module, for according to Pre-set dimension text filed is normalized to described.

38. device according to claim 36, which is characterized in that described device further includes text identification model training mould Block, for completing the text identification model by following manner training:

Gathered based on preset training and determines target training text image；

The target training text image is input to the second initial model；Second initial model includes that second feature is extracted Network, the second output network and classification function；

The characteristic pattern is split into at least one subcharacter figure by second initial model；

39. the device according to claim 38, which is characterized in that it includes sequentially connected that the second feature, which extracts network, The second convolutional network of multiple groups；Second convolutional network described in every group includes sequentially connected convolutional layer, pond layer and activation primitive layer.

40. the device according to claim 38, which is characterized in that the identification model training module is also used to:

41. the device according to claim 38, which is characterized in that the second output network includes multiple full articulamentums； The quantity of the full articulamentum is corresponding with the quantity of the subcharacter figure；

The identification model training module is also used to: each subcharacter figure is separately input into corresponding full articulamentum, So that each full articulamentum exports the corresponding output matrix of the subcharacter figure.

42. the device according to claim 38, which is characterized in that the classification function includes Softmax function；

43. the device according to claim 38, which is characterized in that the identification loss function include L=-log p (y | {p^t}_{T=1 ... T})；Wherein, y is the probability matrix of the target training text image marked in advance；T indicates t-th of probability square Battle array；p^tFor each of the classification function output corresponding probability matrix of the subcharacter figure；T is the sum of the probability matrix Amount；P indicates to calculate probability；Log indicates logarithm operation.

44. the device according to claim 38, which is characterized in that the identification model training module is also used to:

Judge whether updated each parameter restrains；

If updated each parameter restrains, updated second initial model of parameter is determined as text and is known Other model；

If updated each parameter does not restrain, continues to execute and determining target training is gathered based on preset training The step of text image, until updated each parameter restrains.

45. device according to claim 44, which is characterized in that the identification model training module is also used to:

46. device according to claim 36, which is characterized in that the text filed recognition result includes the text The corresponding multiple probability matrixs in region；

The content of text determining module is also used to:

Determine the position of the most probable value in each probability matrix；

47. device according to claim 46, which is characterized in that the content of text determining module is also used to:

48. device according to claim 36, which is characterized in that described device further include:

Data obtaining module, if for include in described image it is multiple text filed, obtain it is each it is described it is text filed in Content of text；

Sensitive information determining module, determining in the corresponding content of text of described image for the sensitive dictionary by pre-establishing is No includes sensitive information.

49. device according to claim 48, which is characterized in that the sensitive information determining module is also used to:

Participle operation is carried out to the content of text got；

50. device according to claim 49, which is characterized in that described device further include:

Area identification module, it is text filed belonging to the participle of successful match for obtaining, acquisition is identified in described image That arrives is described text filed.

51. a kind of electronic equipment, which is characterized in that including processor and memory, the memory is stored with can be described The machine-executable instruction that processor executes, the processor execute the machine-executable instruction to realize claim 1 to 7 Described in any item text detection model training methods, the described in any item text filed determining methods of claim 8 to 10, or The step of described in any item content of text of person's claim 11 to 25 determine method.

52. a kind of machine readable storage medium, which is characterized in that the machine readable storage medium is stored with the executable finger of machine It enables, for the machine-executable instruction when being called and being executed by processor, machine-executable instruction promotes processor to realize that right is wanted Ask 1 to 7 described in any item text detection model training methods, the described in any item text filed determinations of claim 8 to 10 The step of method or the described in any item content of text of claim 11 to 25 determine method.