CN117371445B

CN117371445B - Information error correction method, device, computer equipment and storage medium

Info

Publication number: CN117371445B
Application number: CN202311668330.3A
Authority: CN
Inventors: 冯帆
Original assignee: Beijing Wisdom Spark Tech Co ltd
Current assignee: Beijing Wisdom Spark Tech Co ltd
Priority date: 2023-12-07
Filing date: 2023-12-07
Publication date: 2024-11-05
Anticipated expiration: 2043-12-07
Also published as: CN117371445A

Abstract

The present application relates to an information error correction method, device, computer equipment and storage medium. The method comprises: identifying the initial text information corresponding to the target image; performing word segmentation processing on the initial text information through a word segmentation dictionary to obtain multiple word segmentation information corresponding to the initial text information; analyzing the grammatical structure and semantic information corresponding to each word segmentation information according to the character granularity and word granularity; if it is determined that the current word segmentation information does not meet the preset conditions, the current word segmentation information is used as the text information to be corrected; matching the text information to be corrected with the corresponding multiple similar text information in the similarity dictionary, replacing each similar text information in turn with the text information to be corrected in the initial text information, and using it as the candidate text information; using the candidate text information with the least confusion as the corrected text information. The use of this method can improve the accuracy of identifying abnormal information, thereby improving the efficiency and accuracy of error correction for abnormal information.

Description

Information error correction method, device, computer equipment and storage medium

技术领域Technical Field

本申请涉及图像识别技术领域，特别是涉及一种信息纠错方法、装置、计算机设备和存储介质。The present application relates to the field of image recognition technology, and in particular to an information error correction method, device, computer equipment and storage medium.

背景技术Background Art

随着图像识别技术的发展，出现了OCR识别技术，用于将纸质文档转换为电子文档，但由于目前OCR识别技术仍会出现错误的识别结果，因此需要对识别结果进行纠错。With the development of image recognition technology, OCR recognition technology has emerged, which is used to convert paper documents into electronic documents. However, since the current OCR recognition technology still produces erroneous recognition results, it is necessary to correct the recognition results.

传统方法主要是将分词处理后的散串字符作为异常信息，并针对该异常信息进行纠错。The traditional method mainly regards the scattered string characters after word segmentation as abnormal information and performs error correction on the abnormal information.

然而，这种传统方法仅在字符组成的层面上判断信息是否异常，容易将正常信息误判为异常信息，从而降低了对异常信息进行识别的准确性，进而降低对异常信息进行纠错的高效性和准确性。However, this traditional method only judges whether the information is abnormal at the level of character composition, and it is easy to misjudge normal information as abnormal information, thereby reducing the accuracy of identifying abnormal information, and further reducing the efficiency and accuracy of error correction of abnormal information.

发明内容Summary of the invention

基于此，有必要针对上述技术问题，提供一种能够高效地、准确地识别并纠正异常信息的信息纠错方法、装置、计算机设备、计算机可读存储介质和计算机程序产品。Based on this, it is necessary to provide an information error correction method, device, computer equipment, computer-readable storage medium and computer program product that can efficiently and accurately identify and correct abnormal information in response to the above-mentioned technical problems.

第一方面，本申请提供了一种信息纠错方法，包括：In a first aspect, the present application provides an information error correction method, comprising:

获取目标图像，并识别所述目标图像对应的初始文本信息；Acquire a target image, and identify initial text information corresponding to the target image;

将所述初始文本信息通过分词词典进行分词处理，得到所述初始文本信息对应的多个分词信息，根据字粒度和词粒度分析各个分词信息对应的语法结构和语义信息，并对各个分词信息进行判断，若判断当前分词信息不满足预设条件，则将所述当前分词信息作为待纠错文本信息；其中所述字粒度表示在字符级别的分词单位，所述词粒度表示在词语级别的分词单位；The initial text information is segmented through a segmentation dictionary to obtain a plurality of segmentation information corresponding to the initial text information, the grammatical structure and semantic information corresponding to each segmentation information are analyzed according to the character granularity and the word granularity, and each segmentation information is judged, if it is judged that the current segmentation information does not meet the preset conditions, the current segmentation information is used as the text information to be corrected; wherein the character granularity represents the segmentation unit at the character level, and the word granularity represents the segmentation unit at the word level;

将所述待纠错文本信息在形似字典中匹配对应的多个形似文本信息，将每一形似文本信息依次替换在所述初始文本信息中的所述待纠错文本信息，将替换处理后的初始文本信息作为候选文本信息；Matching the text information to be corrected with corresponding multiple similar text information in a similar dictionary, replacing the text information to be corrected in the initial text information with each similar text information in turn, and using the replaced initial text information as candidate text information;

计算每一候选文本信息对应的混淆程度，将混淆程度最小的候选文本信息作为已纠错文本信息。The confusion degree corresponding to each candidate text information is calculated, and the candidate text information with the smallest confusion degree is taken as the corrected text information.

在其中一个实施例中，所述获取目标图像，并识别所述目标图像对应的初始文本信息，包括：基于正则表达式，根据书名号的左边界和右边界，匹配所述初始文本信息中的书名号以及书名号内的文本信息，将书名号内的文本信息输入至所述分词词典中。In one of the embodiments, the step of acquiring a target image and identifying initial text information corresponding to the target image includes: matching the book title marks in the initial text information and the text information within the book title marks according to the left and right boundaries of the book title marks based on regular expressions, and inputting the text information within the book title marks into the word segmentation dictionary.

在其中一个实施例中，所述根据字粒度和词粒度分析各个分词信息对应的语法结构和语义信息，包括：基于所述分词词典所携带的针对语法结构和语义信息的标准规则，对各个分词信息对应的语法结构和语义信息进行分析；其中所述标准规则包括词性规则、词组搭配规则、多义词规则、专有名词识别规则以及语法成分规则。In one of the embodiments, the analyzing the grammatical structure and semantic information corresponding to each word segmentation information according to the character granularity and the word granularity includes: analyzing the grammatical structure and semantic information corresponding to each word segmentation information based on the standard rules for the grammatical structure and semantic information carried by the word segmentation dictionary; wherein the standard rules include part-of-speech rules, phrase collocation rules, polysemous word rules, proper noun recognition rules and grammatical component rules.

在其中一个实施例中，所述若判断当前分词信息不满足预设条件，则将所述当前分词信息作为待纠错文本信息，包括：若所述当前分词信息为单个字符，则基于所述标准规则对所述当前分词信息进行分析，若所述当前分词信息不满足所述标准规则，则将所述当前分词信息作为待纠错文本信息。In one of the embodiments, if it is determined that the current word segmentation information does not meet the preset conditions, the current word segmentation information is used as text information to be corrected, including: if the current word segmentation information is a single character, the current word segmentation information is analyzed based on the standard rules; if the current word segmentation information does not meet the standard rules, the current word segmentation information is used as text information to be corrected.

在其中一个实施例中，所述将所述待纠错文本信息在形似字典中匹配对应的多个形似文本信息，将每一形似文本信息依次替换在所述初始文本信息中的所述待纠错文本信息，包括：若至少包括两个待纠错文本信息，则将各个待纠错文本信息分别在形似字典中匹配对应的多个形似文本信息，将不同待纠错文本信息分别对应的形似文本信息相互组合，得到多个形似文本组合信息，并将多个形似文本组合信息依次地替换在所述初始文本信息中对应的待纠错文本信息。In one of the embodiments, the text information to be corrected is matched with corresponding multiple similar text information in a similar dictionary, and each similar text information is replaced with the text information to be corrected in the initial text information in sequence, including: if at least two text information to be corrected are included, each text information to be corrected is matched with corresponding multiple similar text information in a similar dictionary, the similar text information corresponding to different text information to be corrected are combined with each other to obtain multiple similar text combination information, and the multiple similar text combination information are replaced with the corresponding text information to be corrected in the initial text information in sequence.

在其中一个实施例中，所述计算每一候选文本信息对应的混淆程度，包括：将每一候选文本信息输入至已训练的模型中，所述模型根据每一候选文本信息中的各个字符在对应的候选文本信息中的条件概率，得到对应的候选文本信息的混淆程度；其中所述条件概率是指在给定当前词的情况下，下一个词出现的概率。In one embodiment, the calculation of the confusion level corresponding to each candidate text information includes: inputting each candidate text information into a trained model, wherein the model obtains the confusion level of the corresponding candidate text information based on the conditional probability of each character in each candidate text information in the corresponding candidate text information; wherein the conditional probability refers to the probability of the next word appearing given the current word.

第二方面，本申请还提供了一种信息纠错装置，包括：In a second aspect, the present application also provides an information error correction device, comprising:

获取模块，用于获取目标图像，并识别所述目标图像对应的初始文本信息；An acquisition module, used to acquire a target image and identify initial text information corresponding to the target image;

分词模块，用于将所述初始文本信息通过分词词典进行分词处理，得到所述初始文本信息对应的多个分词信息，根据字粒度和词粒度分析各个分词信息对应的语法结构和语义信息，并对各个分词信息进行判断，若判断当前分词信息不满足预设条件，则将所述当前分词信息作为待纠错文本信息；其中所述字粒度表示在字符级别的分词单位，所述词粒度表示在词语级别的分词单位；A word segmentation module is used to perform word segmentation processing on the initial text information through a word segmentation dictionary to obtain a plurality of word segmentation information corresponding to the initial text information, analyze the grammatical structure and semantic information corresponding to each word segmentation information according to the character granularity and the word granularity, and judge each word segmentation information. If it is judged that the current word segmentation information does not meet the preset conditions, the current word segmentation information is used as the text information to be corrected; wherein the character granularity represents the word segmentation unit at the character level, and the word granularity represents the word segmentation unit at the word level;

匹配模块，用于将所述待纠错文本信息在形似字典中匹配对应的多个形似文本信息，将每一形似文本信息依次替换在所述初始文本信息中的所述待纠错文本信息，将替换处理后的初始文本信息作为候选文本信息；a matching module, used for matching the text information to be corrected with corresponding multiple similar text information in a similarity dictionary, replacing the text information to be corrected in the initial text information with each similar text information in turn, and using the replaced initial text information as candidate text information;

计算模块，用于计算每一候选文本信息对应的混淆程度，将混淆程度最小的候选文本信息作为已纠错文本信息。The calculation module is used to calculate the confusion degree corresponding to each candidate text information, and take the candidate text information with the smallest confusion degree as the corrected text information.

在其中一个实施例中，所述匹配模块还用于：若至少包括两个待纠错文本信息，则将各个待纠错文本信息分别在形似字典中匹配对应的多个形似文本信息，将不同待纠错文本信息分别对应的形似文本信息相互组合，得到多个形似文本组合信息，并将多个形似文本组合信息依次地替换在所述初始文本信息中对应的待纠错文本信息。In one embodiment, the matching module is also used for: if at least two text information to be corrected are included, each text information to be corrected is matched with a corresponding plurality of similar text information in a similar dictionary, the similar text information corresponding to different text information to be corrected are combined with each other to obtain a plurality of similar text combination information, and the plurality of similar text combination information are sequentially replaced with the corresponding text information to be corrected in the initial text information.

第三方面，本申请还提供了一种计算机设备，包括存储器和处理器，所述存储器存储有计算机程序，所述处理器执行所述计算机程序时实现以下步骤：In a third aspect, the present application further provides a computer device, including a memory and a processor, wherein the memory stores a computer program, and when the processor executes the computer program, the following steps are implemented:

第四方面，本申请还提供了一种计算机可读存储介质，其上存储有计算机程序，所述计算机程序被处理器执行时实现以下步骤：In a fourth aspect, the present application further provides a computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the following steps are implemented:

上述信息纠错方法、装置、计算机设备和存储介质，通过目标图像识别对应的初始文本信息，将初始文本信息通过分词词典进行分词处理，得到对应的多个分词信息，根据字粒度和词粒度分析各个分词信息对应的语法结构和语义信息，全面地、准确地在分词信息中识别出待纠错文本信息，基于此，提高了针对异常信息进行识别的准确性，从而提高了针对异常信息进行纠错的高效性和准确性。The above-mentioned information error correction method, device, computer equipment and storage medium identify the corresponding initial text information through the target image, segment the initial text information through a word segmentation dictionary, obtain corresponding multiple word segmentation information, analyze the grammatical structure and semantic information corresponding to each word segmentation information according to the character granularity and word granularity, and comprehensively and accurately identify the text information to be corrected in the word segmentation information. Based on this, the accuracy of identifying abnormal information is improved, thereby improving the efficiency and accuracy of error correction for abnormal information.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本申请实施例或相关技术中的技术方案，下面将对实施例或相关技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application or the related technologies, the drawings required for use in the embodiments or the related technical descriptions are briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.

图1为一个实施例中信息纠错方法的流程示意图；FIG1 is a schematic flow chart of an information error correction method in one embodiment;

图2为又一个实施例中信息纠错方法的流程示意图；FIG2 is a schematic flow chart of an information error correction method in yet another embodiment;

图3为一个实施例中信息纠错装置的结构框图；FIG3 is a block diagram of an information error correction device in one embodiment;

图4为一个实施例中计算机设备的内部结构图。FIG. 4 is a diagram showing the internal structure of a computer device in one embodiment.

具体实施方式DETAILED DESCRIPTION

为了使本申请的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本申请进行进一步详细说明。应当理解，此处描述的具体实施例仅用以解释本申请，并不用于限定本申请。In order to make the purpose, technical solution and advantages of the present application more clearly understood, the present application is further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application and are not used to limit the present application.

在一个实施例中，如图1所示，提供了一种信息纠错方法，本实施例以该方法应用于服务器进行举例说明，可以理解的是，该方法也可以应用于终端，还可以应用于包括终端和服务器的系统，并通过终端和服务器的交互实现。本实施例中，该方法包括步骤S102至步骤S108。其中：In one embodiment, as shown in FIG1 , an information error correction method is provided. This embodiment uses the method applied to a server as an example for illustration. It is understandable that the method can also be applied to a terminal, and can also be applied to a system including a terminal and a server, and is implemented through the interaction between the terminal and the server. In this embodiment, the method includes steps S102 to S108. Among them:

步骤S102，获取目标图像，并识别目标图像对应的初始文本信息。Step S102, acquiring a target image, and identifying initial text information corresponding to the target image.

示例性地，可通过OCR（Optical Character Recognition，光学字符识别）技术，识别出目标图像对应的文本信息。For example, the text information corresponding to the target image may be recognized by using OCR (Optical Character Recognition) technology.

步骤S104，将初始文本信息通过分词词典进行分词处理，得到初始文本信息对应的多个分词信息，根据字粒度和词粒度分析各个分词信息对应的语法结构和语义信息，并对各个分词信息进行判断，若判断当前分词信息不满足预设条件，则将当前分词信息作为待纠错文本信息；其中字粒度表示在字符级别的分词单位，词粒度表示在词语级别的分词单位。Step S104, segmenting the initial text information through a segmentation dictionary to obtain multiple segmentation information corresponding to the initial text information, analyzing the grammatical structure and semantic information corresponding to each segmentation information according to the character granularity and the word granularity, and judging each segmentation information. If it is judged that the current segmentation information does not meet the preset conditions, the current segmentation information is used as the text information to be corrected; wherein the character granularity represents the segmentation unit at the character level, and the word granularity represents the segmentation unit at the word level.

其中，分词词典是指包含有中文词汇及其对应信息的字典，用于处理针对中文文本进行分词的任务。The word segmentation dictionary refers to a dictionary containing Chinese words and their corresponding information, and is used to handle the task of word segmentation for Chinese text.

示例性地，在分词词典中，检索并匹配到与初始文本信息中的字词相对应的字词信息，根据所匹配到的分词词典中的字词信息，将初始文本信息对应地进行切分，切分后所得到的多个分词信息与该分词词典中所匹配到的字词信息相对应。Exemplarily, in the word segmentation dictionary, word information corresponding to the words in the initial text information is retrieved and matched, and the initial text information is segmented accordingly according to the word information in the matched word segmentation dictionary. The multiple word segmentation information obtained after segmentation correspond to the word information matched in the word segmentation dictionary.

示例性地，分别在字符级别和词语级别上分析各个分词信息对应的语法结构和语义信息，若一个分词信息不满足预设条件，即该分词信息对应的语法结构和语义信息并不匹配于在分词词典中所检索到的字词信息对应的语法结构和语义信息，则将该分词信息作为待纠错文本信息，即判断该分词信息是在根据图像识别初始文本信息的过程中将信息错误地识别所得到的。Exemplarily, the grammatical structure and semantic information corresponding to each word segmentation information are analyzed at the character level and the word level respectively. If a word segmentation information does not meet the preset conditions, that is, the grammatical structure and semantic information corresponding to the word segmentation information do not match the grammatical structure and semantic information corresponding to the word information retrieved in the word segmentation dictionary, then the word segmentation information is used as text information to be corrected, that is, it is judged that the word segmentation information is obtained by incorrectly identifying the information in the process of recognizing the initial text information according to the image.

可选地，若每一分词信息均满足预设条件，则各分词信息以及对应的初始文本信息为无需纠错的正常信息。Optionally, if each word segmentation information satisfies a preset condition, each word segmentation information and the corresponding initial text information are normal information that does not require error correction.

步骤S106，将待纠错文本信息在形似字典中匹配对应的多个形似文本信息，将每一形似文本信息依次替换在初始文本信息中的待纠错文本信息，将替换处理后的初始文本信息作为候选文本信息。Step S106, matching the text information to be corrected with corresponding multiple similar text information in the similarity dictionary, replacing the text information to be corrected in the initial text information with each similar text information in turn, and using the replaced initial text information as candidate text information.

其中，形似字典是指包含有形状、结构或拼写相似的汉字的字典，用于处理针对存在混淆的中文字形进行区分与调用的任务。Among them, the shape-similar dictionary refers to a dictionary containing Chinese characters with similar shapes, structures or spellings, and is used to handle the task of distinguishing and calling confusing Chinese characters.

示例性地，在形似字典中，检索并匹配到与待纠错文本信息在形状或结构上相对应的形似文本信息，在初始文本信息中，将该待纠错文本信息替换为该形似文本信息，并将替换处理后的初始文本信息作为候选文本信息；若匹配到多个形似文本信息，则将该待纠错文本信息依次替换为不同的形似文本信息，并将多个替换处理后的初始文本信息作为不同的候选文本信息。Exemplarily, in a similarity dictionary, similar text information corresponding to the text information to be corrected in shape or structure is retrieved and matched, and in the initial text information, the text information to be corrected is replaced with the similar text information, and the initial text information after the replacement is used as candidate text information; if multiple similar text information is matched, the text information to be corrected is replaced with different similar text information in turn, and the multiple initial text information after the replacement is used as different candidate text information.

示例性地，包含有待纠错文本信息的初始文本信息，也可直接作为候选文本信息，即看作为该待纠错文本信息在形似词典中匹配到完全相同的文字，并将该完全相同的文字在初始文本信息中替换掉待纠错文本信息。Exemplarily, the initial text information containing the text information to be corrected can also be directly used as candidate text information, that is, the text information to be corrected is regarded as matching the exact same words in the similarity dictionary, and the exact same words are used to replace the text information to be corrected in the initial text information.

可选地，若形似字典中未检索并匹配到与待纠错文本信息在形状或结构上相对应的形似文本信息，则直接将包含有待纠错文本信息的初始文本信息作为候选文本信息。Optionally, if no similar text information corresponding to the text information to be corrected in shape or structure is retrieved and matched in the similar dictionary, the initial text information containing the text information to be corrected is directly used as the candidate text information.

步骤S108，计算每一候选文本信息对应的混淆程度，将混淆程度最小的候选文本信息作为已纠错文本信息。Step S108, calculating the confusion degree corresponding to each candidate text information, and taking the candidate text information with the smallest confusion degree as the corrected text information.

其中，混淆程度是指文本信息在语法、结构或语义上的复杂性，用于表征针对文本信息进行理解的困难程度。The degree of confusion refers to the complexity of the text information in terms of grammar, structure or semantics, and is used to characterize the difficulty in understanding the text information.

示例性地，可将每一候选文本信息输入至已训练的语言模型中，由该语言模型对每一候选文本信息对应的混淆程度进行计算，根据计算结果，将混淆程度最低的候选文本信息作为已纠错文本信息，即判断该候选文本信息是在根据图像识别初始文本信息的过程中本应该将信息正确地识别所得到的。Exemplarily, each candidate text information can be input into a trained language model, which calculates the degree of confusion corresponding to each candidate text information. Based on the calculation result, the candidate text information with the lowest degree of confusion is used as the corrected text information, that is, it is judged that the candidate text information is obtained by correctly identifying the information in the process of recognizing the initial text information based on the image.

可选地，可将文本长度作为衡量混淆程度的指标，即过长或过短的文本信息可能会导致难以对文本信息进行理解；可将文本信息的语法结构作为衡量混淆程度的指标，即使用复杂的语法结构或非常规的语法结构，可能会增加理解文本信息的难度；可将文本信息的歧义性作为衡量混淆程度的指标，即特定词或短语可能具有多个解释，则提高了文本信息的歧义性，使得混淆程度增大。Optionally, the length of the text can be used as an indicator to measure the degree of confusion, that is, text information that is too long or too short may make it difficult to understand the text information; the grammatical structure of the text information can be used as an indicator to measure the degree of confusion, that is, the use of complex grammatical structures or unconventional grammatical structures may increase the difficulty of understanding the text information; the ambiguity of the text information can be used as an indicator to measure the degree of confusion, that is, a specific word or phrase may have multiple interpretations, which increases the ambiguity of the text information and increases the degree of confusion.

上述信息纠错方法中，通过目标图像识别对应的初始文本信息，将初始文本信息通过分词词典进行分词处理，得到对应的多个分词信息，根据字粒度和词粒度分析各个分词信息对应的语法结构和语义信息，全面地、准确地在分词信息中识别出待纠错文本信息，基于此，提高了针对异常信息进行识别的准确性，从而提高了针对异常信息进行纠错的高效性和准确性。In the above information error correction method, the initial text information corresponding to the target image is recognized, and the initial text information is segmented through a word segmentation dictionary to obtain corresponding multiple word segmentation information. The grammatical structure and semantic information corresponding to each word segmentation information are analyzed according to the character granularity and word granularity, and the text information to be corrected is comprehensively and accurately identified in the word segmentation information. Based on this, the accuracy of identifying abnormal information is improved, thereby improving the efficiency and accuracy of error correction for abnormal information.

在一个示例性的实施例中，获取目标图像，并识别目标图像对应的初始文本信息，包括步骤S202，其中：In an exemplary embodiment, acquiring a target image and identifying initial text information corresponding to the target image includes step S202, wherein:

步骤S202，基于正则表达式，根据书名号的左边界和右边界，匹配初始文本信息中的书名号以及书名号内的文本信息，将书名号内的文本信息输入至分词词典中。Step S202, based on the regular expression, according to the left and right boundaries of the quotation marks, match the quotation marks in the initial text information and the text information in the quotation marks, and input the text information in the quotation marks into the word segmentation dictionary.

其中，正则表达式是指用于匹配、查询或替换符合特定文本模式的文本的工具。Among them, regular expression refers to a tool used to match, query or replace text that conforms to a specific text pattern.

示例性地，将书名号的左边界和右边界输入至正则表达式中，从而将正则表达式的功能设定为用于匹配具有书名号的文本，基于该正则表达式，在初始文本信息中匹配到书名号以及书名号内的文本信息，由此可将书名号内的文本信息单独地进行提取并进行分词处理。Exemplarily, the left and right boundaries of book title marks are input into a regular expression, thereby setting the function of the regular expression to match text with book title marks. Based on the regular expression, the book title marks and the text information within the book title marks are matched in the initial text information, thereby allowing the text information within the book title marks to be extracted separately and subjected to word segmentation processing.

可选地，可基于包含有书名号的正则表达式，快速地识别出文本信息中的标题部分；可基于包含有问号的正则表达式，快速地识别出文本信息中的问题部分；可基于包含有感叹号的正则表达式，快速地识别出文本信息中的重点标注部分。Optionally, the title part in the text information can be quickly identified based on a regular expression containing book title marks; the question part in the text information can be quickly identified based on a regular expression containing question marks; and the key marked part in the text information can be quickly identified based on a regular expression containing exclamation marks.

本实施例中，基于正则表达式，快速地匹配到特定文本模式的文本信息，从而可适配地、可针对地在不同应用场景下对信息进行识别与纠错。In this embodiment, based on regular expressions, text information of a specific text pattern is quickly matched, so that information can be adaptively and specifically identified and corrected in different application scenarios.

在一个示例性的实施例中，根据字粒度和词粒度分析各个分词信息对应的语法结构和语义信息，包括步骤S302，其中：In an exemplary embodiment, the grammatical structure and semantic information corresponding to each word segmentation information are analyzed according to the character granularity and the word granularity, including step S302, wherein:

步骤S302，基于分词词典所携带的针对语法结构和语义信息的标准规则，对各个分词信息对应的语法结构和语义信息进行分析；其中标准规则包括词性规则、词组搭配规则、多义词规则、专有名词识别规则以及语法成分规则。Step S302, based on the standard rules for grammatical structure and semantic information carried by the word segmentation dictionary, the grammatical structure and semantic information corresponding to each word segmentation information are analyzed; wherein the standard rules include part-of-speech rules, phrase collocation rules, polysemous word rules, proper noun recognition rules and grammatical component rules.

其中，词性规则是指针对词汇对应的词性信息所制定的规则，有助于确定词语的词性，例如名词、动词、形容词、副词等；词组搭配规则是指根据词语之间的搭配进行定义的规则，有助于确定短语的固定搭配、文本的习惯表达方式；多义词规则是指针对含有多个意义的词汇所制定的规则，有助于在文本中识别与区分具有多个意义的词语；专有名词识别规则是指针对专有名词所制定的规则，用于识别出文本中的专有名词；语法成分规则是指针对不同成分在句子中的排列和组合方式所制定的规则，有助于通过确定词语的语法成分，例如主语、谓语、宾语等，从而确定词语在文本中所在的语法位置，保证文本具有合法的语法结构。Among them, part-of-speech rules refer to the rules formulated for the part-of-speech information corresponding to vocabulary, which help to determine the part of speech of words, such as nouns, verbs, adjectives, adverbs, etc.; phrase collocation rules refer to the rules defined based on the collocation between words, which help to determine the fixed collocation of phrases and the customary expressions of texts; polysemy rules refer to the rules formulated for words with multiple meanings, which help to identify and distinguish words with multiple meanings in the text; proper noun recognition rules refer to the rules formulated for proper nouns, which are used to identify proper nouns in the text; grammatical component rules refer to the rules formulated for the arrangement and combination of different components in a sentence, which help to determine the grammatical position of words in the text by determining the grammatical components of words, such as subject, predicate, object, etc., to ensure that the text has a legal grammatical structure.

示例性地，分词词典携带有词性规则、词组搭配规则、多义词规则、专有名词识别规则以及语法成分规则等针对语法结构或语义信息所制定的标准规则，基于该标准规则，对每一分词信息对应的语法结构和语义信息进行分析。Exemplarily, the word segmentation dictionary carries standard rules formulated for grammatical structure or semantic information, such as part of speech rules, phrase collocation rules, polysemous word rules, proper noun recognition rules, and grammatical component rules. Based on the standard rules, the grammatical structure and semantic information corresponding to each word segmentation information are analyzed.

本实施例中，基于分词词典所携带的针对语法结构和语义信息的标准规则，对各个分词信息在语法结构和语义信息的维度上进行分析，从而全面地、准确地识别出待纠错文本信息。In this embodiment, based on the standard rules for grammatical structure and semantic information carried by the word segmentation dictionary, each word segmentation information is analyzed in terms of grammatical structure and semantic information, so as to comprehensively and accurately identify the text information to be corrected.

在一个示例性的实施例中，若判断当前分词信息不满足预设条件，则将当前分词信息作为待纠错文本信息，包括步骤S402，其中：In an exemplary embodiment, if it is determined that the current word segmentation information does not meet the preset condition, the current word segmentation information is used as text information to be corrected, including step S402, wherein:

步骤S402，若当前分词信息为单个字符，则基于标准规则对当前分词信息进行分析，若当前分词信息不满足标准规则，则将当前分词信息作为待纠错文本信息。Step S402: If the current word segmentation information is a single character, the current word segmentation information is analyzed based on the standard rules. If the current word segmentation information does not meet the standard rules, the current word segmentation information is used as text information to be corrected.

示例性地，将初始文本信息进行分词处理后，得到多个分词信息，若其中一个分词信息为单个字符，则表示该单个字符在分词词典中并没有匹配到对应的字词信息，则该单个字符所对应的分词信息可能为异常信息，需通过分词词典的标准规则对该分词信息进行分析，若该分词信息不满足标准规则，则将该分词信息作为待纠错文本信息。Exemplarily, after the initial text information is segmented, multiple segmentation information is obtained. If one of the segmentation information is a single character, it means that the single character does not match the corresponding word information in the segmentation dictionary, and the segmentation information corresponding to the single character may be abnormal information. It is necessary to analyze the segmentation information according to the standard rules of the segmentation dictionary. If the segmentation information does not meet the standard rules, the segmentation information is used as text information to be corrected.

可选地，若其中一个分词信息为单个字符，即该单个字符在分词词典中并没有匹配到对应的字词信息，但该分词信息满足分词词典的标准规则，则表明该单个字符的分词信息为正常信息，例如该单个字符作为独立词语而存在。Optionally, if one of the word segmentation information is a single character, that is, the single character does not match the corresponding word information in the word segmentation dictionary, but the word segmentation information meets the standard rules of the word segmentation dictionary, then it indicates that the word segmentation information of the single character is normal information, for example, the single character exists as an independent word.

本实施例中，针对可能为异常信息的单个字符的分词信息，进一步地通过标准规则进行分析，从而避免将正常信息判定为异常信息，以提高针对异常信息进行识别的准确性。In this embodiment, the word segmentation information of a single character that may be abnormal information is further analyzed using standard rules, thereby avoiding judging normal information as abnormal information and improving the accuracy of identifying abnormal information.

在一个示例性的实施例中，将待纠错文本信息在形似字典中匹配对应的多个形似文本信息，将每一形似文本信息依次替换在初始文本信息中的待纠错文本信息，包括步骤S502，其中：In an exemplary embodiment, the text information to be corrected is matched with corresponding multiple similar text information in the similarity dictionary, and each similar text information is sequentially replaced with the text information to be corrected in the initial text information, including step S502, wherein:

步骤S502，若至少包括两个待纠错文本信息，则将各个待纠错文本信息分别在形似字典中匹配对应的多个形似文本信息，将不同待纠错文本信息分别对应的形似文本信息相互组合，得到多个形似文本组合信息，并将多个形似文本组合信息依次地替换在初始文本信息中对应的待纠错文本信息。Step S502, if it includes at least two text information to be corrected, then each text information to be corrected is matched with the corresponding multiple similar text information in the similarity dictionary, the similar text information corresponding to different text information to be corrected are combined with each other to obtain multiple similar text combination information, and the multiple similar text combination information are replaced in sequence with the corresponding text information to be corrected in the initial text information.

示例性地，若文本信息中至少包括两个待纠错文本信息，则在形似字典中，分别通过不同的待纠错文本信息匹配对应的形似文本信息，将不同待纠错文本信息分别对应的形似文本信息相互组合，得到多个形似文本组合信息，并将多个形似文本组合信息依次地替换在初始文本信息中对应的待纠错文本信息，从而最终生成与形似文本组合信息的数量相对应的候选文本信息。Exemplarily, if the text information includes at least two text information to be corrected, then in the similarity dictionary, the corresponding similar text information is matched with different text information to be corrected, and the similar text information corresponding to different text information to be corrected are combined with each other to obtain multiple similar text combination information, and the multiple similar text combination information are replaced in sequence with the corresponding text information to be corrected in the initial text information, thereby finally generating candidate text information corresponding to the number of similar text combination information.

可选地，A、C以及D为字形相似的汉字，B、E以及F为字形形似的汉字，文本信息中包括A和B两个待纠错文本信息，在形似字典中，通过待纠错文本信息A匹配到形似文本信息A、C和D，通过待纠错文本信息B匹配到形似文本信息B、E和F；将不同待纠错文本信息分别对应的形似文本信息相互组合，得到AB、AE、AF、CB、CE、CF、DB、DE、DF九种形似文本组合信息；在初始文本信息中，将AB依次替换为将不同的形似文本组合信息，最终生成九种分别包含AB、AE、AF、CB、CE、CF、DB、DE、DF的候选文本信息。Optionally, A, C and D are Chinese characters with similar glyphs, B, E and F are Chinese characters with similar glyphs, and the text information includes two text information to be corrected, A and B. In the similarity dictionary, the text information A to be corrected is matched to similar text information A, C and D, and the text information B to be corrected is matched to similar text information B, E and F; the similar text information corresponding to different text information to be corrected are combined with each other to obtain nine types of similar text combination information AB, AE, AF, CB, CE, CF, DB, DE and DF; in the initial text information, AB is replaced in turn with different similar text combination information, and finally nine candidate text information containing AB, AE, AF, CB, CE, CF, DB, DE and DF are generated.

本实施例中，通过将不同待纠错文本信息分别对应的形似文本信息相互组合，得到多个形似文本组合信息，并将形似文本组合信息替换对应的待纠错文本信息，从而提供全面的、完整的候选文本信息集合，以提高对信息进行纠错的准确性。In this embodiment, by combining the similar text information corresponding to different text information to be corrected, multiple similar text combination information are obtained, and the similar text combination information replaces the corresponding text information to be corrected, thereby providing a comprehensive and complete set of candidate text information to improve the accuracy of information correction.

在一个示例性的实施例中，计算每一候选文本信息对应的混淆程度，包括步骤S602，其中：In an exemplary embodiment, calculating the confusion degree corresponding to each candidate text information includes step S602, wherein:

步骤S602，将每一候选文本信息输入至已训练的模型中，模型根据每一候选文本信息中的各个字符在对应的候选文本信息中的条件概率，得到对应的候选文本信息的混淆程度；其中条件概率是指在给定当前词的情况下，下一个词出现的概率。Step S602, input each candidate text information into the trained model, and the model obtains the confusion degree of the corresponding candidate text information based on the conditional probability of each character in each candidate text information in the corresponding candidate text information; wherein the conditional probability refers to the probability of the next word appearing given the current word.

其中，模型可表示为语言模型，其通过学习大规模文本数据的语言结构和上下文关系，能够计算出给定文本的合理性和概率分布，从而得到给定文本的混淆程度。Among them, the model can be expressed as a language model, which can calculate the rationality and probability distribution of a given text by learning the language structure and contextual relationship of large-scale text data, thereby obtaining the degree of confusion of the given text.

示例性地，将每一候选文本信息输入至已训练的模型中，该模型根据每一候选文本信息中的各个字符在对应的候选文本信息中的条件概率，即在一个候选文本信息中，对出现当前字符所对应的概率进行计算，再在出现当前字符的情况下对同时出现下一字符所对应的概率进行计算，从而依序对各个字符的出现概率进行计算，以得到该候选文本信息中同时出现全部字符所对应的条件概率。Exemplarily, each candidate text information is input into a trained model, and the model calculates the conditional probability of each character in each candidate text information in the corresponding candidate text information, that is, in a candidate text information, the model calculates the probability corresponding to the appearance of the current character, and then calculates the probability corresponding to the simultaneous appearance of the next character when the current character appears, thereby calculating the occurrence probability of each character in sequence to obtain the conditional probability corresponding to the simultaneous appearance of all characters in the candidate text information.

可选地，候选文本信息对应的概率与混淆程度成反比，概率最高的候选文本信息表明该候选文本信息为正常信息的概率最高，且混淆程度最低，以相对于其他候选文本信息而言，更易于理解。Optionally, the probability corresponding to the candidate text information is inversely proportional to the degree of confusion, and the candidate text information with the highest probability indicates that the candidate text information has the highest probability of being normal information and the lowest degree of confusion, so it is easier to understand than other candidate text information.

可选地，可根据各个字符之间的语义关联关系，以及字符在文本中的语法位置关系，得到出现每一字符所对应的概率。其中，各个字符之间的语义关联关系可表示为在文本中的各个字符相互之间的含义关系，例如相邻字符之间的影响、字符的组合和排列等；字符在文本中的语法位置关系可表示为各个字符在文本中的语法角色，例如主语、谓语、宾语等。Optionally, the probability of each character appearing can be obtained based on the semantic association between the characters and the grammatical position relationship of the characters in the text. The semantic association between the characters can be expressed as the meaning relationship between the characters in the text, such as the influence between adjacent characters, the combination and arrangement of characters, etc. The grammatical position relationship of the characters in the text can be expressed as the grammatical role of each character in the text, such as subject, predicate, object, etc.

本实施例中，通过模型对候选文本信息中的各个字符出现概率进行计算，得到该候选文本信息对应的概率，从而高效地、准确地得到该候选文本信息的混淆程度。In this embodiment, the probability of occurrence of each character in the candidate text information is calculated through a model to obtain the probability corresponding to the candidate text information, thereby efficiently and accurately obtaining the confusion degree of the candidate text information.

在一个示例性的实施例中，如图2所示，包括以下步骤：In an exemplary embodiment, as shown in FIG2 , the following steps are included:

在对广告海报的标题内容进行识别并纠错的业务场景中，通过OCR技术识别该广告海报对应的初始文本信息，即“热门短剧《都布狂龙》限时免费”；基于包含有书名号的正则表达式，在初始文本信息中匹配到书名号以及书名号内的文本信息，从而针对书名号内的文本信息，即针对广告海报的标题内容“都布狂龙”进行后续的纠错处理。In the business scenario of identifying and correcting the title content of an advertising poster, the initial text information corresponding to the advertising poster, namely, "The popular short drama "Du Bu Kuang Long" is free for a limited time" is identified through OCR technology; based on a regular expression containing quotation marks, the quotation marks and the text information within the quotation marks are matched in the initial text information, and subsequent error correction processing is performed on the text information within the quotation marks, namely, the title content of the advertising poster "Du Bu Kuang Long".

将上述的“都布狂龙”通过分词词典进行分词处理，分别得到“都”、“布”“狂龙”三个分词信息；根据字粒度和词粒度分析各个分词信息对应的语法结构和语义信息，其中“布”这一分词信息不满足分词词典针对语法结构和语义信息的标准规则，将“布”这一分词信息作为待纠错文本信息。其中，分词词典可采用jieba词库，也可采用基于特定业务场景对特定词汇进行收录的自定义词典，以使得分词词典适配于不同的业务场景，以及能够满足迭代的业务需求；此外，形似词典也可适用于上述的自定义操作。The above "都布狂龙" is segmented through the word segmentation dictionary, and the three word segmentation information of "都", "布" and "狂龙" are obtained respectively; the grammatical structure and semantic information corresponding to each word segmentation information are analyzed according to the character granularity and word granularity, among which the word segmentation information of "布" does not meet the standard rules of the word segmentation dictionary for grammatical structure and semantic information, and the word segmentation information of "布" is used as the text information to be corrected. Among them, the word segmentation dictionary can use the jieba word library, or a custom dictionary that collects specific words based on specific business scenarios, so that the word segmentation dictionary can adapt to different business scenarios and meet iterative business needs; in addition, the morphological dictionary can also be applied to the above custom operations.

将“布”这一待纠错文本信息通过近似字典匹配到“布”、“市”、“币”、“巾”四个形似文本信息，将四个形似文本信息依次替换 “都布狂龙”中的“布”，分别得到“都布狂龙”、“都市狂龙”、“都币狂龙”、“都巾狂龙”四个候选文本信息。The text information "布" to be corrected is matched to four similar text information "布", "市", "币" and "巾" through an approximate dictionary. The four similar text information replace "布" in "都布狂龙" in turn, and four candidate text information "都布狂龙", "都市狂龙", "都币狂龙" and "都巾狂龙" are obtained respectively.

将四个候选文本信息分别输入至自然语言模型中，计算每一候选文本信息对应的困惑度，即“都布狂龙” 的困惑度为0.5，“都市狂龙” 的困惑度为0.01，“都币狂龙” 的困惑度为0.3，“都巾狂龙” 的困惑度为0.3，按照由低至高的困惑度对不同的候选文本信息进行排序，由此，将困惑度最低的“都市狂龙”作为已纠错文本信息。其中，困惑度相当于混淆程度。The four candidate text information are input into the natural language model respectively, and the perplexity corresponding to each candidate text information is calculated, that is, the perplexity of "都布狂龙" is 0.5, the perplexity of "都市狂龙" is 0.01, the perplexity of "都币狂龙" is 0.3, and the perplexity of "都巾狂龙" is 0.3. The different candidate text information are sorted from low to high perplexity, and thus, "都市狂龙" with the lowest perplexity is used as the corrected text information. Among them, perplexity is equivalent to the degree of confusion.

本实施例中，在低像素、强干扰、变形文字等情况下，根据图像所识别出的文本信息可能有误，可基于本方法，可高效地、准确地针对错误的文本信息进行纠错，从而保证结果的可靠性。In this embodiment, in cases of low pixels, strong interference, deformed text, etc., the text information recognized based on the image may be erroneous. Based on this method, the erroneous text information can be efficiently and accurately corrected, thereby ensuring the reliability of the results.

应该理解的是，虽然如上所述的各实施例所涉及的流程图中的各个步骤按照箭头的指示依次显示，但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明，这些步骤的执行并没有严格的顺序限制，这些步骤可以以其它的顺序执行。而且，如上所述的各实施例所涉及的流程图中的至少一部分步骤可以包括多个步骤或者多个阶段，这些步骤或者阶段并不必然是在同一时刻执行完成，而是可以在不同的时刻执行，这些步骤或者阶段的执行顺序也不必然是依次进行，而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that, although the various steps in the flowcharts involved in the above-mentioned embodiments are displayed in sequence according to the indication of the arrows, these steps are not necessarily executed in sequence according to the order indicated by the arrows. Unless there is a clear explanation in this article, the execution of these steps does not have a strict order restriction, and these steps can be executed in other orders. Moreover, at least a part of the steps in the flowcharts involved in the above-mentioned embodiments can include multiple steps or multiple stages, and these steps or stages are not necessarily executed at the same time, but can be executed at different times, and the execution order of these steps or stages is not necessarily to be carried out in sequence, but can be executed in turn or alternately with other steps or at least a part of the steps or stages in other steps.

基于同样的发明构思，本申请实施例还提供了一种用于实现上述所涉及的信息纠错方法的信息纠错装置。该装置所提供的解决问题的实现方案与上述方法中所记载的实现方案相似，故下面所提供的一个或多个信息纠错装置实施例中的具体限定可以参见上文中对于信息纠错方法的限定，在此不再赘述。Based on the same inventive concept, the embodiment of the present application also provides an information error correction device for implementing the information error correction method involved above. The implementation scheme for solving the problem provided by the device is similar to the implementation scheme recorded in the above method, so the specific limitations in one or more information error correction device embodiments provided below can refer to the limitations of the information error correction method above, and will not be repeated here.

在一个示例性的实施例中，如图3所示，提供了一种信息纠错装置，包括：获取模块702、分词模块704、匹配模块706和计算模块708，其中：In an exemplary embodiment, as shown in FIG3 , an information error correction device is provided, including: an acquisition module 702, a word segmentation module 704, a matching module 706 and a calculation module 708, wherein:

获取模块702，用于获取目标图像，并识别目标图像对应的初始文本信息。The acquisition module 702 is used to acquire a target image and identify initial text information corresponding to the target image.

分词模块704，用于将初始文本信息通过分词词典进行分词处理，得到初始文本信息对应的多个分词信息，根据字粒度和词粒度分析各个分词信息对应的语法结构和语义信息，并对各个分词信息进行判断，若判断当前分词信息不满足预设条件，则将当前分词信息作为待纠错文本信息；其中字粒度表示在字符级别的分词单位，词粒度表示在词语级别的分词单位。The word segmentation module 704 is used to perform word segmentation processing on the initial text information through the word segmentation dictionary to obtain multiple word segmentation information corresponding to the initial text information, analyze the grammatical structure and semantic information corresponding to each word segmentation information according to the character granularity and the word granularity, and judge each word segmentation information. If it is judged that the current word segmentation information does not meet the preset conditions, the current word segmentation information is used as the text information to be corrected; wherein the character granularity represents the word segmentation unit at the character level, and the word granularity represents the word segmentation unit at the word level.

匹配模块706，用于将待纠错文本信息在形似字典中匹配对应的多个形似文本信息，将每一形似文本信息依次替换在初始文本信息中的待纠错文本信息，将替换处理后的初始文本信息作为候选文本信息。The matching module 706 is used to match the text information to be corrected with corresponding multiple similar text information in the similarity dictionary, replace the text information to be corrected in the initial text information with each similar text information in turn, and use the replaced initial text information as candidate text information.

计算模块708，用于计算每一候选文本信息对应的混淆程度，将混淆程度最小的候选文本信息作为已纠错文本信息。The calculation module 708 is used to calculate the confusion degree corresponding to each candidate text information, and take the candidate text information with the smallest confusion degree as the corrected text information.

在一个示例性的实施例中，获取模块702还用于基于正则表达式，根据书名号的左边界和右边界，匹配初始文本信息中的书名号以及书名号内的文本信息，将书名号内的文本信息输入至分词词典中。In an exemplary embodiment, the acquisition module 702 is also used to match the book title marks in the initial text information and the text information in the book title marks based on the left and right boundaries of the book title marks based on regular expressions, and input the text information in the book title marks into the word segmentation dictionary.

在一个示例性的实施例中，分词模块704还用于基于分词词典所携带的针对语法结构和语义信息的标准规则，对各个分词信息对应的语法结构和语义信息进行分析；其中标准规则包括词性规则、词组搭配规则、多义词规则、专有名词识别规则以及语法成分规则。In an exemplary embodiment, the word segmentation module 704 is also used to analyze the grammatical structure and semantic information corresponding to each word segmentation information based on the standard rules for grammatical structure and semantic information carried by the word segmentation dictionary; wherein the standard rules include part-of-speech rules, phrase collocation rules, polysemous word rules, proper noun recognition rules and grammatical component rules.

在一个示例性的实施例中，分词模块704还用于若当前分词信息为单个字符，则基于标准规则对当前分词信息进行分析，若当前分词信息不满足标准规则，则将当前分词信息作为待纠错文本信息。In an exemplary embodiment, the word segmentation module 704 is also used to analyze the current word segmentation information based on standard rules if the current word segmentation information is a single character, and to use the current word segmentation information as text information to be corrected if the current word segmentation information does not meet the standard rules.

在一个示例性的实施例中，匹配模块706还用于若至少包括两个待纠错文本信息，则将各个待纠错文本信息分别在形似字典中匹配对应的多个形似文本信息，将不同待纠错文本信息分别对应的形似文本信息相互组合，得到多个形似文本组合信息，并将多个形似文本组合信息依次地替换在初始文本信息中对应的待纠错文本信息。In an exemplary embodiment, the matching module 706 is also used to match each text information to be corrected with corresponding multiple similar text information in the similarity dictionary if at least two text information to be corrected are included, combine the similar text information corresponding to different text information to be corrected with each other to obtain multiple similar text combination information, and replace the corresponding text information to be corrected in the initial text information with the multiple similar text combination information in sequence.

在一个示例性的实施例中，计算模块708还用于将每一候选文本信息输入至已训练的模型中，模型根据每一候选文本信息中的各个字符在对应的候选文本信息中的条件概率，得到对应的候选文本信息的混淆程度；其中条件概率是指在给定当前词的情况下，下一个词出现的概率。In an exemplary embodiment, the calculation module 708 is also used to input each candidate text information into a trained model, and the model obtains the confusion level of the corresponding candidate text information based on the conditional probability of each character in each candidate text information in the corresponding candidate text information; wherein the conditional probability refers to the probability of the next word appearing given the current word.

上述信息纠错装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中，也可以以软件形式存储于计算机设备中的存储器中，以便于处理器调用执行以上各个模块对应的操作。Each module in the above-mentioned information error correction device can be implemented in whole or in part by software, hardware and their combination. Each module can be embedded in or independent of the processor in the computer device in the form of hardware, or can be stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to each module above.

在一个示例性的实施例中，提供了一种计算机设备，该计算机设备可以是服务器，其内部结构图可以如图4所示。该计算机设备包括处理器、存储器、输入/输出接口(Input/Output，简称I/O）和通信接口。其中，处理器、存储器和输入/输出接口通过系统总线连接，通信接口通过输入/输出接口连接到系统总线。其中，该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质和内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储图像、各种文本信息、分词词典数据、形似词典数据以及语言模型数据。该计算机设备的输入/输出接口用于处理器与外部设备之间交换信息。该计算机设备的通信接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种信息纠错方法。In an exemplary embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be shown in FIG4. The computer device includes a processor, a memory, an input/output interface (Input/Output, referred to as I/O) and a communication interface. The processor, the memory and the input/output interface are connected via a system bus, and the communication interface is connected to the system bus via the input/output interface. The processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program and a database. The internal memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The database of the computer device is used to store images, various text information, word segmentation dictionary data, shape-like dictionary data and language model data. The input/output interface of the computer device is used to exchange information between the processor and an external device. The communication interface of the computer device is used to communicate with an external terminal through a network connection. When the computer program is executed by the processor, an information error correction method is implemented.

本领域技术人员可以理解，图4中示出的结构，仅仅是与本申请方案相关的部分结构的框图，并不构成对本申请方案所应用于其上的计算机设备的限定，具体的计算机设备可以包括比图中所示更多或更少的部件，或者组合某些部件，或者具有不同的部件布置。Those skilled in the art will understand that the structure shown in FIG. 4 is merely a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied. The specific computer device may include more or fewer components than shown in the figure, or combine certain components, or have a different arrangement of components.

在一个示例性的实施例中，提供了一种计算机设备，包括存储器和处理器，存储器中存储有计算机程序，该处理器执行计算机程序时实现以下步骤：In an exemplary embodiment, a computer device is provided, including a memory and a processor, wherein a computer program is stored in the memory, and when the processor executes the computer program, the following steps are implemented:

获取目标图像，并识别目标图像对应的初始文本信息；Acquire a target image and identify initial text information corresponding to the target image;

将初始文本信息通过分词词典进行分词处理，得到初始文本信息对应的多个分词信息，根据字粒度和词粒度分析各个分词信息对应的语法结构和语义信息，并对各个分词信息进行判断，若判断当前分词信息不满足预设条件，则将当前分词信息作为待纠错文本信息；其中字粒度表示在字符级别的分词单位，词粒度表示在词语级别的分词单位；The initial text information is segmented through a segmentation dictionary to obtain multiple segmentation information corresponding to the initial text information, and the grammatical structure and semantic information corresponding to each segmentation information are analyzed according to the character granularity and the word granularity, and each segmentation information is judged. If it is judged that the current segmentation information does not meet the preset conditions, the current segmentation information is used as the text information to be corrected; wherein the character granularity represents the segmentation unit at the character level, and the word granularity represents the segmentation unit at the word level;

将待纠错文本信息在形似字典中匹配对应的多个形似文本信息，将每一形似文本信息依次替换在初始文本信息中的待纠错文本信息，将替换处理后的初始文本信息作为候选文本信息；Matching the text information to be corrected with corresponding multiple similar text information in the similarity dictionary, replacing the text information to be corrected in the initial text information with each similar text information in turn, and using the replaced initial text information as candidate text information;

在一个实施例中，处理器执行计算机程序时还实现以下步骤：基于正则表达式，根据书名号的左边界和右边界，匹配初始文本信息中的书名号以及书名号内的文本信息，将书名号内的文本信息输入至分词词典中。In one embodiment, when the processor executes the computer program, the following steps are also implemented: based on regular expressions, according to the left and right boundaries of the book title marks, the book title marks in the initial text information and the text information within the book title marks are matched, and the text information within the book title marks is input into the word segmentation dictionary.

在一个实施例中，处理器执行计算机程序时还实现以下步骤：基于分词词典所携带的针对语法结构和语义信息的标准规则，对各个分词信息对应的语法结构和语义信息进行分析；其中标准规则包括词性规则、词组搭配规则、多义词规则、专有名词识别规则以及语法成分规则。In one embodiment, when the processor executes the computer program, the following steps are also implemented: based on the standard rules for grammatical structure and semantic information carried by the word segmentation dictionary, the grammatical structure and semantic information corresponding to each word segmentation information are analyzed; wherein the standard rules include part-of-speech rules, phrase collocation rules, polysemous word rules, proper noun recognition rules and grammatical component rules.

在一个实施例中，处理器执行计算机程序时还实现以下步骤：若当前分词信息为单个字符，则基于标准规则对当前分词信息进行分析，若当前分词信息不满足标准规则，则将当前分词信息作为待纠错文本信息。In one embodiment, when the processor executes the computer program, the following steps are also implemented: if the current word segmentation information is a single character, the current word segmentation information is analyzed based on standard rules; if the current word segmentation information does not meet the standard rules, the current word segmentation information is used as text information to be corrected.

在一个实施例中，处理器执行计算机程序时还实现以下步骤：若至少包括两个待纠错文本信息，则将各个待纠错文本信息分别在形似字典中匹配对应的多个形似文本信息，将不同待纠错文本信息分别对应的形似文本信息相互组合，得到多个形似文本组合信息，并将多个形似文本组合信息依次地替换在初始文本信息中对应的待纠错文本信息。In one embodiment, when the processor executes the computer program, the following steps are also implemented: if at least two text information to be corrected are included, each text information to be corrected is matched with corresponding multiple similar text information in the similarity dictionary, the similar text information corresponding to different text information to be corrected are combined with each other to obtain multiple similar text combination information, and the multiple similar text combination information are replaced in sequence with the corresponding text information to be corrected in the initial text information.

在一个实施例中，处理器执行计算机程序时还实现以下步骤：将每一候选文本信息输入至已训练的模型中，模型根据每一候选文本信息中的各个字符在对应的候选文本信息中的条件概率，得到对应的候选文本信息的混淆程度；其中条件概率是指在给定当前词的情况下，下一个词出现的概率。In one embodiment, when the processor executes the computer program, the following steps are also implemented: each candidate text information is input into a trained model, and the model obtains the confusion degree of the corresponding candidate text information based on the conditional probability of each character in each candidate text information in the corresponding candidate text information; wherein the conditional probability refers to the probability of the next word appearing given the current word.

在一个实施例中，提供了一种计算机可读存储介质，其上存储有计算机程序，计算机程序被处理器执行时实现以下步骤：In one embodiment, a computer readable storage medium is provided, on which a computer program is stored, and when the computer program is executed by a processor, the following steps are implemented:

在一个实施例中，计算机程序被处理器执行时还实现以下步骤：基于正则表达式，根据书名号的左边界和右边界，匹配初始文本信息中的书名号以及书名号内的文本信息，将书名号内的文本信息输入至分词词典中。In one embodiment, when the computer program is executed by the processor, the following steps are also implemented: based on regular expressions, according to the left and right boundaries of the book title marks, the book title marks in the initial text information and the text information within the book title marks are matched, and the text information within the book title marks is input into the word segmentation dictionary.

在一个实施例中，计算机程序被处理器执行时还实现以下步骤：基于分词词典所携带的针对语法结构和语义信息的标准规则，对各个分词信息对应的语法结构和语义信息进行分析；其中标准规则包括词性规则、词组搭配规则、多义词规则、专有名词识别规则以及语法成分规则。In one embodiment, when the computer program is executed by the processor, the following steps are also implemented: based on the standard rules for grammatical structure and semantic information carried by the word segmentation dictionary, the grammatical structure and semantic information corresponding to each word segmentation information are analyzed; wherein the standard rules include part-of-speech rules, phrase collocation rules, polysemous word rules, proper noun recognition rules and grammatical component rules.

在一个实施例中，计算机程序被处理器执行时还实现以下步骤：若当前分词信息为单个字符，则基于标准规则对当前分词信息进行分析，若当前分词信息不满足标准规则，则将当前分词信息作为待纠错文本信息。In one embodiment, when the computer program is executed by the processor, the following steps are also implemented: if the current word segmentation information is a single character, the current word segmentation information is analyzed based on standard rules; if the current word segmentation information does not meet the standard rules, the current word segmentation information is used as text information to be corrected.

在一个实施例中，计算机程序被处理器执行时还实现以下步骤：若至少包括两个待纠错文本信息，则将各个待纠错文本信息分别在形似字典中匹配对应的多个形似文本信息，将不同待纠错文本信息分别对应的形似文本信息相互组合，得到多个形似文本组合信息，并将多个形似文本组合信息依次地替换在初始文本信息中对应的待纠错文本信息。In one embodiment, when the computer program is executed by the processor, the following steps are also implemented: if at least two text information to be corrected are included, each text information to be corrected is matched with corresponding multiple similar text information in the similarity dictionary, the similar text information corresponding to different text information to be corrected are combined with each other to obtain multiple similar text combination information, and the multiple similar text combination information are replaced in sequence with the corresponding text information to be corrected in the initial text information.

在一个实施例中，计算机程序被处理器执行时还实现以下步骤：将每一候选文本信息输入至已训练的模型中，模型根据每一候选文本信息中的各个字符在对应的候选文本信息中的条件概率，得到对应的候选文本信息的混淆程度；其中条件概率是指在给定当前词的情况下，下一个词出现的概率。In one embodiment, when the computer program is executed by the processor, the following steps are also implemented: each candidate text information is input into a trained model, and the model obtains the confusion degree of the corresponding candidate text information based on the conditional probability of each character in each candidate text information in the corresponding candidate text information; wherein the conditional probability refers to the probability of the next word appearing given the current word.

需要说明的是，本申请所涉及的用户信息（包括但不限于用户设备信息、用户个人信息等）和数据（包括但不限于用于分析的数据、存储的数据、展示的数据等），均为经用户授权或者经过各方充分授权的信息和数据，且相关数据的收集、使用和处理需要符合相关规定。It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data must comply with relevant regulations.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机程序来指令相关的硬件来完成，所述的计算机程序可存储于一非易失性计算机可读取存储介质中，该计算机程序在执行时，可包括如上述各方法的实施例的流程。其中，本申请所提供的各实施例中所使用的对存储器、数据库或其它介质的任何引用，均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器（Read-OnlyMemory，ROM）、磁带、软盘、闪存、光存储器、高密度嵌入式非易失性存储器、阻变存储器（ReRAM）、磁变存储器（Magnetoresistive Random Access Memory，MRAM）、铁电存储器（Ferroelectric Random Access Memory，FRAM）、相变存储器（Phase Change Memory，PCM）、石墨烯存储器等。易失性存储器可包括随机存取存储器（Random Access Memory，RAM）或外部高速缓冲存储器等。作为说明而非局限，RAM可以是多种形式，比如静态随机存取存储器（Static Random Access Memory，SRAM）或动态随机存取存储器（Dynamic RandomAccess Memory，DRAM）等。本申请所提供的各实施例中所涉及的数据库可包括关系型数据库和非关系型数据库中至少一种。非关系型数据库可包括基于区块链的分布式数据库等，不限于此。本申请所提供的各实施例中所涉及的处理器可为通用处理器、中央处理器、图形处理器、数字信号处理器、可编程逻辑器、基于量子计算的数据处理逻辑器等，不限于此。Those skilled in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be completed by instructing the relevant hardware through a computer program, and the computer program can be stored in a non-volatile computer-readable storage medium. When the computer program is executed, it can include the processes of the embodiments of the above-mentioned methods. Among them, any reference to the memory, database or other medium used in the embodiments provided in this application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetoresistive random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. As an illustration and not limitation, RAM can be in various forms, such as static random access memory (SRAM) or dynamic random access memory (DRAM). The database involved in each embodiment provided in this application may include at least one of a relational database and a non-relational database. Non-relational databases may include distributed databases based on blockchains, etc., but are not limited to this. The processor involved in each embodiment provided in this application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic device, a data processing logic device based on quantum computing, etc., but are not limited to this.

以上实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above embodiments may be combined arbitrarily. To make the description concise, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

以上所述实施例仅表达了本申请的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对本申请专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本申请构思的前提下，还可以做出若干变形和改进，这些都属于本申请的保护范围。因此，本申请的保护范围应以所附权利要求为准。The above-described embodiments only express several implementation methods of the present application, and the descriptions thereof are relatively specific and detailed, but they cannot be understood as limiting the scope of the present application. It should be pointed out that, for a person of ordinary skill in the art, several variations and improvements can be made without departing from the concept of the present application, and these all belong to the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the attached claims.

Claims

1. An information error correction method, characterized in that the method comprises:

Acquiring a target image and identifying initial text information corresponding to the target image, including: matching text information corresponding to the target symbol in the initial text information based on a target symbol included in a regular expression, and inputting the text information corresponding to the target symbol into a word segmentation dictionary, wherein the target symbol includes a book title mark, a question mark, and an exclamation mark;

The initial text information is segmented through a segmentation dictionary to obtain a plurality of segmentation information corresponding to the initial text information, the grammatical structure and semantic information corresponding to each segmentation information are analyzed according to the character granularity and the word granularity, and each segmentation information is judged, and if it is judged that the current segmentation information does not meet the preset conditions, the current segmentation information is used as the text information to be corrected, including: if the grammatical structure and semantic information corresponding to the current segmentation information do not match the grammatical structure and semantic information corresponding to the word information retrieved in the segmentation dictionary, the current segmentation information is used as the text information to be corrected; wherein the character granularity represents the segmentation unit at the character level, and the word granularity represents the segmentation unit at the word level;

Matching the text information to be corrected with corresponding multiple similar text information in a similar dictionary, replacing the text information to be corrected in the initial text information with each similar text information in turn, and using the replaced initial text information as candidate text information;

Calculating the degree of confusion corresponding to each candidate text information, and taking the candidate text information with the smallest degree of confusion as the corrected text information; wherein the text length, grammatical structure and text ambiguity of each candidate text information are used as indicators to measure the degree of confusion; the degree of confusion represents the complexity of the candidate text information in terms of grammar, structure or semantics, and is used to characterize the difficulty of understanding the candidate text information;

The grammatical structure and semantic information corresponding to each word segmentation information is analyzed according to the character granularity and word granularity, including:

Based on the standard rules for grammatical structure and semantic information carried by the word segmentation dictionary, the grammatical structure and semantic information corresponding to each word segmentation information are analyzed; wherein the standard rules include part-of-speech rules, phrase collocation rules, polysemous word rules, proper noun recognition rules and grammatical component rules;

If it is determined that the current word segmentation information does not meet the preset condition, the current word segmentation information is used as text information to be corrected, including:

If the current word segmentation information is a single character, the current word segmentation information is analyzed based on the standard rule; if the current word segmentation information does not meet the standard rule, the current word segmentation information is used as text information to be corrected.

2. The method according to claim 1, wherein acquiring a target image and identifying initial text information corresponding to the target image comprises:

Based on regular expressions, the book title marks in the initial text information and the text information in the book title marks are matched according to the left and right boundaries of the book title marks, and the text information in the book title marks is input into the word segmentation dictionary.

3. The method according to claim 1, characterized in that the step of matching the text information to be corrected with a plurality of corresponding similar text information in a similar dictionary and replacing each similar text information with the text information to be corrected in the initial text information in sequence comprises:

If at least two text information to be corrected are included, each text information to be corrected is matched with corresponding multiple similar text information in the similarity dictionary, and the similar text information corresponding to different text information to be corrected are combined with each other to obtain multiple similar text combination information, and the multiple similar text combination information are replaced in sequence with the corresponding text information to be corrected in the initial text information.

4. The method according to claim 1, wherein calculating the degree of confusion corresponding to each candidate text information comprises:

Each candidate text information is input into a trained model, and the model obtains the confusion degree of the corresponding candidate text information according to the conditional probability of each character in each candidate text information in the corresponding candidate text information; wherein the conditional probability refers to the probability of the next word appearing given the current word.

5. The method according to claim 4, characterized in that the step of inputting each candidate text information into a trained model, wherein the model obtains the confusion degree of the corresponding candidate text information according to the conditional probability of each character in each candidate text information in the corresponding candidate text information, comprises:

In the current candidate text information, the probability corresponding to the appearance of the current character is calculated. When the current character appears, the probability corresponding to the simultaneous appearance of the next character is calculated. The probability of the appearance of each character is calculated in sequence to obtain the conditional probability corresponding to the simultaneous appearance of all characters in the current candidate text information.

6. The method according to claim 4, characterized in that the probability corresponding to the occurrence of each character is determined based on the semantic association relationship between the characters and the grammatical position relationship of the characters in the text;

The semantic association between characters is represented by the meaning relationship between characters in the text, and the meaning relationship between characters includes the influence between adjacent characters, the combination and arrangement of characters;

The grammatical position relationship of characters in the text is represented by the grammatical role of each character in the text. The grammatical role of each character in the text includes subject, predicate and object.

7. An information error correction device, characterized in that the device comprises:

an acquisition module, for acquiring a target image and identifying initial text information corresponding to the target image, including: matching text information corresponding to the target symbol in the initial text information based on a target symbol included in a regular expression, and inputting the text information corresponding to the target symbol into a word segmentation dictionary, wherein the target symbol includes a book title mark, a question mark, and an exclamation mark;

The word segmentation module is used to perform word segmentation processing on the initial text information through a word segmentation dictionary to obtain multiple word segmentation information corresponding to the initial text information, analyze the grammatical structure and semantic information corresponding to each word segmentation information according to the character granularity and the word granularity, and judge each word segmentation information. If it is judged that the current word segmentation information does not meet the preset conditions, the current word segmentation information is used as the text information to be corrected, including: if the grammatical structure and semantic information corresponding to the current word segmentation information do not match the grammatical structure and semantic information corresponding to the word information retrieved in the word segmentation dictionary, the current word segmentation information is used as the text information to be corrected; wherein the word granularity represents the word segmentation unit at the character level, and the word granularity represents the word segmentation unit at the word level;

a matching module, used for matching the text information to be corrected with corresponding multiple similar text information in a similarity dictionary, replacing the text information to be corrected in the initial text information with each similar text information in turn, and using the replaced initial text information as candidate text information;

A calculation module, used to calculate the degree of confusion corresponding to each candidate text information, and take the candidate text information with the smallest degree of confusion as the corrected text information; wherein the text length, grammatical structure and text ambiguity of each candidate text information are used as indicators to measure the degree of confusion; the degree of confusion represents the complexity of the candidate text information in terms of grammar, structure or semantics, and is used to characterize the difficulty of understanding the candidate text information;

The word segmentation module is also used to: analyze the grammatical structure and semantic information corresponding to each word segmentation information based on the standard rules for grammatical structure and semantic information carried by the word segmentation dictionary; wherein the standard rules include part of speech rules, phrase collocation rules, polysemous word rules, proper noun recognition rules and grammatical component rules;

The word segmentation module is also used to: if the current word segmentation information is a single character, analyze the current word segmentation information based on the standard rule; if the current word segmentation information does not meet the standard rule, use the current word segmentation information as text information to be corrected.

8. The device according to claim 7, characterized in that the matching module is further used for:

9. A computer device, comprising a memory and a processor, wherein the memory stores a computer program, wherein the processor implements the steps of the method according to any one of claims 1 to 6 when executing the computer program.

10. A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 6 are implemented.