CN114359086B

CN114359086B - Molecular formula recognition method and related device, equipment and storage medium

Info

Publication number: CN114359086B
Application number: CN202111630143.7A
Authority: CN
Inventors: 吴浩
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2021-12-28
Filing date: 2021-12-28
Publication date: 2024-11-08
Anticipated expiration: 2041-12-28
Also published as: CN114359086A

Abstract

The present application discloses a molecular formula recognition method and related devices, equipment and storage medium, the method comprising: using a molecular formula recognition model to recognize an image to be recognized to obtain a symbol sequence; based on the symbol sequence, recovering the target molecular formula in the image to be recognized; wherein the molecular formula recognition model is trained using a sample image containing a sample molecular formula, the sample image is annotated with a sample symbol sequence of the sample molecular formula, and the sample symbol sequence is constructed from the graphic visual form of the sample molecular formula. The above scheme can improve the accuracy of molecular formula recognition and the generalization ability of molecular formula recognition.

Description

Molecular formula recognition method and related device, equipment and storage medium

技术领域Technical Field

本申请涉及图像识别技术领域，特别是涉及一种分子式识别方法及相关装置、设备和存储介质。The present application relates to the field of image recognition technology, and in particular to a molecular formula recognition method and related devices, equipment and storage media.

背景技术Background Art

随着深度学习的发展，图文识别技术也日趋成熟，开始赋能越来越多的行业。尤其在教育行业中，对作业、作答、试卷的拍照图像的图文识别能力已成为非常重要的一环，一方面可用于教学知识的高效电子化，另一方面可服务于自动批改，实现学情分析、因材施教，毋庸置疑已成绝对的刚需。With the development of deep learning, image and text recognition technology has become more mature and has begun to empower more and more industries. Especially in the education industry, the ability to recognize images of homework, answers, and test papers has become a very important part. On the one hand, it can be used for efficient electronicization of teaching knowledge, and on the other hand, it can serve automatic correction, realize learning situation analysis, and teach students in accordance with their aptitude. It has undoubtedly become an absolute necessity.

目前，图文识别技术对于中英文数据的识别已基本成熟，对公式等带结构的数据的识别能力也基本达到了可用状态，可用满足中英数理化学科中的大部分场景。但是，对于一些特殊场景，图文识别技术并未达到可用，有机化学学科便是其中之一，由于有机化学的题目和作答的图像数据在外观形态上和中英文数据、数学公式差别很大，所以还不具备对有机化学数据的识别能力。At present, the image and text recognition technology has basically matured in the recognition of Chinese and English data, and the recognition ability of structured data such as formulas has basically reached a usable state, which can meet most scenarios in Chinese and English mathematics, physics and chemistry. However, for some special scenarios, the image and text recognition technology has not yet reached a usable state, and organic chemistry is one of them. Because the appearance of organic chemistry questions and answer image data is very different from Chinese and English data and mathematical formulas, it does not yet have the ability to recognize organic chemistry data.

发明内容Summary of the invention

本申请主要解决的技术问题是提供一种分子式识别方法及相关装置、设备和存储介质，能够提高识别分子式的准确性和分子式识别的泛化能力。The main technical problem solved by the present application is to provide a molecular formula recognition method and related devices, equipment and storage media, which can improve the accuracy of molecular formula recognition and the generalization ability of molecular formula recognition.

为了解决上述技术问题，本申请第一方面提供了一种分子式识别方法，包括：利用分子式识别模型对待识别图像进行识别，得到符号序列；基于符号序列，恢复得到待识别图像中的目标分子式；其中，分子式识别模型利用含有样本分子式的样本图像训练得到，样本图像标注有样本分子式的样本符号序列，且样本符号序列由样本分子式的图形视觉形态构建得到。In order to solve the above technical problems, the first aspect of the present application provides a molecular formula recognition method, including: using a molecular formula recognition model to recognize an image to be recognized to obtain a symbol sequence; based on the symbol sequence, recovering the target molecular formula in the image to be recognized; wherein the molecular formula recognition model is trained using a sample image containing a sample molecular formula, the sample image is annotated with a sample symbol sequence of the sample molecular formula, and the sample symbol sequence is constructed from the graphic visual form of the sample molecular formula.

为了解决上述技术问题，本申请第二方面提供了一种分子式识别装置，包括：序列识别模块和式子恢复模块；序列识别模块用于利用分子式识别模型对待识别图像进行识别，得到符号序列；式子恢复模块用于基于符号序列，恢复得到待识别图像中的目标分子式；其中，分子式识别模型利用含有样本分子式的样本图像训练得到，样本图像标注有样本分子式的样本符号序列，且样本符号序列由样本分子式的图形视觉形态构建得到。In order to solve the above-mentioned technical problems, the second aspect of the present application provides a molecular formula recognition device, including: a sequence recognition module and a formula recovery module; the sequence recognition module is used to use a molecular formula recognition model to recognize an image to be recognized and obtain a symbol sequence; the formula recovery module is used to recover the target molecular formula in the image to be recognized based on the symbol sequence; wherein the molecular formula recognition model is trained using a sample image containing a sample molecular formula, the sample image is annotated with a sample symbol sequence of the sample molecular formula, and the sample symbol sequence is constructed from the graphic visual form of the sample molecular formula.

为了解决上述技术问题，本申请第三方面提供了一种电子设备，包括相互耦接的存储器和处理器，存储器中存储有程序指令，处理器用于执行程序指令以实现上述第一方面中的分子式识别方法。In order to solve the above technical problems, the third aspect of the present application provides an electronic device, including a memory and a processor coupled to each other, the memory storing program instructions, and the processor being used to execute the program instructions to implement the molecular formula recognition method in the above first aspect.

为了解决上述技术问题，本申请第四方面提供了一种计算机可读存储介质，存储有能够被处理器运行的程序指令，程序指令用于实现上述第一方面中的分子式识别方法。In order to solve the above technical problem, the fourth aspect of the present application provides a computer-readable storage medium storing program instructions that can be executed by a processor, wherein the program instructions are used to implement the molecular formula recognition method in the above first aspect.

上述方案，利用分子式识别模型对待识别图像进行识别，得到符号序列，并基于符号序列，恢复得到待识别图像中的目标分子式；其中，分子式识别模型利用含有样本分子式的样本图像训练得到，样本图像标注有样本分子式的样本符号序列，且样本符号序列由样本分子式的图形视觉形态构建得到。故利用公式识别模型得到的符号序列能够准确反映目标公式的图形视觉状态，即能够如实反映出待识别图像中分子式的内容，充分保留待识别图像中的分子式的原始信息，从而提高识别目标分子式的准确性和泛化能力。The above scheme uses a molecular formula recognition model to recognize the image to be recognized, obtains a symbol sequence, and based on the symbol sequence, recovers the target molecular formula in the image to be recognized; wherein the molecular formula recognition model is trained using a sample image containing a sample molecular formula, the sample image is annotated with a sample symbol sequence of the sample molecular formula, and the sample symbol sequence is constructed from the graphic visual form of the sample molecular formula. Therefore, the symbol sequence obtained by the formula recognition model can accurately reflect the graphic visual state of the target formula, that is, it can truthfully reflect the content of the molecular formula in the image to be recognized, and fully retain the original information of the molecular formula in the image to be recognized, thereby improving the accuracy and generalization ability of recognizing the target molecular formula.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本申请提供的分子式识别方法一实施例的流程示意图；FIG1 is a schematic diagram of a flow chart of an embodiment of a molecular formula recognition method provided by the present application;

图2是本申请提供的SMILES识别结果一实施例的示意图；FIG2 is a schematic diagram of an embodiment of a SMILES recognition result provided by the present application;

图3是本申请提供的Chemfig标注结果一实施例的示意图；FIG3 is a schematic diagram of an embodiment of a Chemfig annotation result provided by the present application;

图4是本申请提供的分子式识别模型一实施例的结构示意图；FIG4 is a schematic structural diagram of an embodiment of a molecular formula recognition model provided by the present application;

图5是本申请提供的样本数据一实施例的示意图；FIG5 is a schematic diagram of an embodiment of sample data provided by the present application;

图6是图1所示步骤S11一实施例的流程示意图；FIG6 is a schematic diagram of a flow chart of an embodiment of step S11 shown in FIG1 ;

图7是本申请提供的对原始标签序列进行处理得到样本符号序列一实施例的流程示意图；FIG7 is a flow chart of an embodiment of processing an original tag sequence to obtain a sample symbol sequence provided by the present application;

图8是本申请提供的确定原始标签序列是否正确一实施例的流程示意图；FIG8 is a schematic diagram of a flow chart of an embodiment of determining whether an original tag sequence is correct provided by the present application;

图9是本申请提供的样本符号序列一实施例的示意图；FIG9 is a schematic diagram of an embodiment of a sample symbol sequence provided by the present application;

图10是图6所示步骤S112一实施例的流程示意图；FIG10 is a schematic diagram of a flow chart of an embodiment of step S112 shown in FIG6 ;

图11是本申请提供的样本符号序列另一实施例的示意图；FIG11 is a schematic diagram of another embodiment of a sample symbol sequence provided by the present application;

图12是本申请提供的训练分子式识别模型一实施例的流程示意图；FIG12 is a schematic diagram of a flow chart of an embodiment of a training molecular formula recognition model provided by the present application;

图13是本申请提供的分子式识别模型解码过程一实施例的示意图；FIG13 is a schematic diagram of an embodiment of a molecular formula recognition model decoding process provided by the present application;

图14是本申请提供的确定解码结束一实施例的流程示意图FIG. 14 is a flow chart of an embodiment of determining the end of decoding provided by the present application.

图15是本申请提供的分子式识别装置一实施例的框架示意图；FIG15 is a schematic diagram of a framework of an embodiment of a molecular formula recognition device provided by the present application;

图16是本申请提供的电子设备一实施例的框架示意图；FIG16 is a schematic diagram of a framework of an electronic device according to an embodiment of the present application;

图17是本申请提供的计算机可读存储介质一实施例的框架示意图。FIG. 17 is a schematic diagram of a framework of an embodiment of a computer-readable storage medium provided in the present application.

具体实施方式DETAILED DESCRIPTION

下面结合说明书附图，对本申请实施例的方案进行详细说明。The scheme of the embodiment of the present application is described in detail below in conjunction with the drawings of the specification.

以下描述中，为了说明而不是为了限定，提出了诸如特定系统结构、接口、技术之类的具体细节，以便透彻理解本申请。In the following description, for the purpose of explanation rather than limitation, specific details such as specific system structures, interfaces, and technologies are provided to facilitate a thorough understanding of the present application.

本文中术语“系统”和“网络”在本文中常被可互换使用。本文中术语“和/或”，仅仅是一种描述关联对象的关联关系，表示可以存在三种关系，例如，A和/或B，可以表示：单独存在A，同时存在A和B，单独存在B这三种情况。另外，本文中字符“/”，一般表示前后关联对象是一种“或”的关系。此外，本文中的“多”表示两个或者多于两个。The terms "system" and "network" are often used interchangeably in this article. The term "and/or" in this article is only a description of the association relationship of associated objects, indicating that there can be three relationships. For example, A and/or B can mean: A exists alone, A and B exist at the same time, and B exists alone. In addition, the character "/" in this article generally indicates that the associated objects before and after are in an "or" relationship. In addition, "many" in this article means two or more than two.

请参阅图1，图1是本申请提供的分子式识别方法一实施例的流程示意图。需要注意的是，若有实质上相同的结果，本实施例并不以图1所示的流程顺序为限。如图1所示，本实施例包括：Please refer to FIG. 1 , which is a schematic flow chart of an embodiment of a molecular formula recognition method provided in the present application. It should be noted that if there are substantially the same results, this embodiment is not limited to the process sequence shown in FIG. 1 . As shown in FIG. 1 , this embodiment includes:

步骤S11：利用分子式识别模型对待识别图像进行识别，得到符号序列。Step S11: using the molecular formula recognition model to recognize the image to be recognized, and obtaining a symbol sequence.

本实施例的方法用于对待识别图像进行识别，以恢复得到待识别图像中的目标分子式。本文所述的待识别图像可以为包括任意分子式的图像，具体可以从本地存储或云端存储中获取得到。可以理解地，在其他实施方式中，也可通过图像采集设备实时采集得到待识别图像，在此不做具体限定。The method of this embodiment is used to identify the image to be identified so as to recover the target molecular formula in the image to be identified. The image to be identified described herein may be an image including any molecular formula, which may be obtained from local storage or cloud storage. It is understandable that in other embodiments, the image to be identified may also be obtained by real-time acquisition by an image acquisition device, which is not specifically limited here.

本实施方式中，利用分子式识别模型对待识别图像进行识别，能够得到符号序列。其中，分子式识别模型利用含有样本分子式的样本图像训练得到的，样本图像标注有样本分子式的样本符号序列，且样本符号序列由样本分子式的图形视觉形态构建得到。也就是说，分子式识别模型是根据由样本分子式的图形视觉形态构建得到的标注有样本分子式的样本符号序列的样本图像训练得到的；另外，通过训练完成的分子式识别模型能够对待识别图像进行处理，以得到目标分子式。In this embodiment, the molecular formula recognition model is used to recognize the image to be recognized, and a symbol sequence can be obtained. The molecular formula recognition model is trained using a sample image containing a sample molecular formula, and the sample image is annotated with a sample symbol sequence of the sample molecular formula, and the sample symbol sequence is constructed from the graphic visual form of the sample molecular formula. In other words, the molecular formula recognition model is trained based on a sample image annotated with a sample symbol sequence of the sample molecular formula constructed from the graphic visual form of the sample molecular formula; in addition, the molecular formula recognition model completed by training can process the image to be recognized to obtain the target molecular formula.

举例来说，如图2所示，图2是本申请提供的SMILES识别结果一实施例的示意图，深度学习类分子式识别方法，一般都直接套用现有的Encoder-Decoder模型进行建模，该模型能够直接学习分子式图像到识别结果，识别结果一般以SMILES字符串格式存储；但是，如图2所示，左侧输入图像和右侧输入图像为两个学生的手写分子式，左侧输入图像中的分子式书写正确，右侧输入图像中的分子式书写错误(右侧输入图像中虚线框所示只能为3，否则违反了化合价的规则)，此时使用SMILES规范，只能正确表达出左边的分子式，不能正确表达出右边的分子式，即使用SMILES不能如实表达出一些不符合化学规则的手写作答。另外，图2中左侧输入图像书写有“H3C”的原子团，但是在SMILES识别结果中“H3C”并未体现，而只是依据规则“碳、氢省略”被省略；并且，“N”原子被调整为“N+”，“O”原子被调整为“O-”，也都是处于化学原理的考虑。因此，SMILES识别结果表达规范，但都是建立在大量的专业化学知识的基础上的，没有考虑书写的形式，不能如实反映出图像中的内容，所以不是基于图形视觉形态得到的，即图文不相关。For example, as shown in Figure 2, Figure 2 is a schematic diagram of an embodiment of the SMILES recognition result provided by the present application. Deep learning molecular formula recognition methods generally directly apply the existing Encoder-Decoder model for modeling. The model can directly learn the molecular formula image to the recognition result, and the recognition result is generally stored in the SMILES string format; however, as shown in Figure 2, the left input image and the right input image are handwritten molecular formulas of two students. The molecular formula in the left input image is written correctly, and the molecular formula in the right input image is written incorrectly (the dotted box in the right input image can only be 3, otherwise it violates the valence rule). At this time, using the SMILES specification, only the molecular formula on the left can be correctly expressed, and the molecular formula on the right cannot be correctly expressed, that is, using SMILES cannot truthfully express some handwritten answers that do not conform to chemical rules. In addition, the input image on the left side of Figure 2 has the atomic group "H3C" written on it, but in the SMILES recognition result, "H3C" is not reflected, but is simply omitted according to the rule of "carbon and hydrogen are omitted"; and the "N" atom is adjusted to "N+", and the "O" atom is adjusted to "O-", all of which are also based on chemical principles. Therefore, the SMILES recognition results are standardized, but they are all based on a lot of professional chemical knowledge, without considering the form of writing, and cannot truly reflect the content in the image, so they are not based on the visual form of the graphic, that is, the picture and text are not related.

而如图3所示，图3是本申请提供的Chemfig标注结果一实施例的示意图，对于图3中左侧输入图像和右侧输入图像，左侧输入图像中“H3C”原子团符号符合化学价规则，右侧输入图像中“H4C”原子团不符合化合价规则，属于错误书写，在Chemfig标注结果中能够体现出书写错误，即在右侧输入图像对应的Chemfig标注结果中包含了“H_{4}C”的字样。由于Chemfig语法较少地依赖抽象的化学知识，而是完全从分子式图像本身出发而定制的，所以利用Chemfig语法得到的样本符号序列是由样本分子式的图形视觉形态构建得到的，即Chemfig语法是一种“所见即所得”的分子式图像的标注方式，能够充分保留样本分子式的原始信息，使得训练出来的分子式识别模型具备被批改的可能。可以理解地，在其他实施方式中，也可利用其他规则语法构建得到能够体现样本分子式图形视觉形态的样本符号序列，在此不做具体限定。需要说明的是，为了便于描述，下文将以利用Chemfig语法构建得到样本符号序列为例进行说明，但可以理解的是，这并不是对构建样本符号序列的语法规则进行限定。As shown in FIG3 , FIG3 is a schematic diagram of an embodiment of the Chemfig annotation result provided by the present application. For the left input image and the right input image in FIG3 , the "H3C" atom group symbol in the left input image complies with the chemical valence rule, and the "H4C" atom group in the right input image does not comply with the chemical valence rule, which is an incorrect writing. The writing error can be reflected in the Chemfig annotation result, that is, the Chemfig annotation result corresponding to the right input image contains the words "H_{4}C". Since the Chemfig syntax relies less on abstract chemical knowledge, but is completely customized based on the molecular formula image itself, the sample symbol sequence obtained by using the Chemfig syntax is constructed from the graphic visual form of the sample molecular formula, that is, the Chemfig syntax is a "what you see is what you get" molecular formula image annotation method, which can fully retain the original information of the sample molecular formula, so that the trained molecular formula recognition model has the possibility of being corrected. It can be understood that in other embodiments, other regular grammars can also be used to construct a sample symbol sequence that can reflect the graphic visual form of the sample molecular formula, which is not specifically limited here. It should be noted that, for ease of description, the following will take the example of constructing a sample symbol sequence using Chemfig syntax as an example, but it can be understood that this does not limit the syntax rules for constructing the sample symbol sequence.

其中，不对分子式识别模型的网络架构进行限定，可根据实际使用需要具体设置。在一实施方式中，如图4所示，图4是本申请提供的分子式识别模型一实施例的结构示意图，分子式识别模型包括分子式编码网络和分子式解码网络，且分子式编码网络由图文识别模型的图文编码网络初始化得到，图文编码模型是利用含有样本文本、样本公式至少一者的样本图像训练得到的。例如，如图5所示，图5是本申请提供的样本数据一实施例的示意图，样本文本为中英文文本“且过点A”、样本公式为数学公式所以图文编码模型是利用含有中英文文本、数学公式图文中的至少一者的样本图像训练得到的，或者说，图文编码模型中的部分参数可以直接复用中英文文本识别模型、数学公式图文识别模型中的至少一者的训练好的参数，从而能够加快分子识别模型的收敛速度，并且还可以在一定程度上弥补因标注难度高导致的分子式训练数据不足的问题，进而提高分子式识别模型的识别准确性。Among them, the network architecture of the molecular formula recognition model is not limited, and can be specifically set according to actual use needs. In one embodiment, as shown in Figure 4, Figure 4 is a structural schematic diagram of an embodiment of the molecular formula recognition model provided by the present application, and the molecular formula recognition model includes a molecular formula encoding network and a molecular formula decoding network, and the molecular formula encoding network is initialized by the image-text encoding network of the image-text recognition model, and the image-text encoding model is trained using a sample image containing at least one of a sample text and a sample formula. For example, as shown in Figure 5, Figure 5 is a schematic diagram of an embodiment of sample data provided by the present application, and the sample text is the Chinese and English text "and passes through point A", and the sample formula is a mathematical formula Therefore, the image-text coding model is trained using sample images containing at least one of Chinese and English texts and mathematical formula images and texts. In other words, some parameters in the image-text coding model can directly reuse the trained parameters of at least one of the Chinese and English text recognition model and the mathematical formula image-text recognition model, thereby accelerating the convergence speed of the molecular recognition model and, to a certain extent, compensating for the problem of insufficient molecular formula training data caused by the high difficulty of labeling, thereby improving the recognition accuracy of the molecular formula recognition model.

在一实施方式中，样本符号序列包括表示样本分子式中原子团的字符串，及表示样本分子式中化学键的字符串，表示化学键的字符串至少包含化学键的角度。具体地，如图3所示，利用Chemfig语法对左侧输入的样本图像进行标注得到样本符号序列为“H-{3}C-N(＝[1]O)＝[-1]O”，其中，包括表示样本分子式中原子团的字符串“H-{3}C”，及表示样本分子式中化学键的字符串“-”和“＝”；表示化学键的字符串包含化学键的角度，例如，“N”原子连接着2个双键，分别朝向右上方和右下方，在样本符号序列中用“[1]”和“[-1]”这样的“角度代号”表示化学键“＝”的大致朝向。通过化学键的角度信息，能够更加贴切地刻画样本分子式的图形视觉形态，提供了更为丰富的视觉标注信息。In one embodiment, the sample symbol sequence includes a character string representing an atomic group in the sample molecular formula, and a character string representing a chemical bond in the sample molecular formula, and the character string representing the chemical bond at least includes the angle of the chemical bond. Specifically, as shown in FIG3 , the sample image input on the left is annotated using the Chemfig syntax to obtain a sample symbol sequence of "H-{3}C-N(＝[1]O)＝[-1]O", which includes the character string "H-{3}C" representing the atomic group in the sample molecular formula, and the character strings "-" and "＝" representing the chemical bonds in the sample molecular formula; the character string representing the chemical bond includes the angle of the chemical bond, for example, the "N" atom is connected to two double bonds, facing the upper right and lower right respectively, and the "angle codes" such as "[1]" and "[-1]" are used in the sample symbol sequence to represent the approximate direction of the chemical bond "＝". The angle information of the chemical bond can more accurately portray the graphic visual form of the sample molecular formula, providing richer visual annotation information.

在一实施方式中，样本符号序列还包括代表样本分子式中分支的分支符，且分支符至少表征分支的方向。也就是说，样本符号序列包括表示样本分子式中原子团的字符串、表示样本分子式中化学键的字符串、表示化学键的字符串至少包括化学键的角度、代表样本分子式中分支的分支符。In one embodiment, the sample symbol sequence further includes a branch symbol representing a branch in the sample molecular formula, and the branch symbol at least represents the direction of the branch. That is, the sample symbol sequence includes a string representing an atomic group in the sample molecular formula, a string representing a chemical bond in the sample molecular formula, a string representing a chemical bond including at least the angle of the chemical bond, and a branch symbol representing a branch in the sample molecular formula.

在一实施方式中，样本符号序列包括代表样本分子式中分支的分支符时，样本符号序列由样本分子式主干的样本第一子序列和各分支的样本第二子序列组成，样本第一子序列包含分别代表各分支的分支符，且分支符还表征分支的标识，样本第二子序列包含序位符，序位符表征分支的标识。In one embodiment, when the sample symbol sequence includes branch symbols representing branches in the sample molecular formula, the sample symbol sequence is composed of a sample first subsequence of the sample molecular formula trunk and a sample second subsequence of each branch, the sample first subsequence includes branch symbols representing each branch respectively, and the branch symbols also represent the identification of the branch, and the sample second subsequence includes an order symbol, and the order symbol represents the identification of the branch.

步骤S12：基于符号序列，恢复得到待识别图像中的目标分子式。Step S12: based on the symbol sequence, recover the target molecular formula in the image to be identified.

本实施方式中，根据上述方式得到的符号序列，恢复得到待识别图像中的目标分子式。具体地，将代表各分支的样本第二子序列中的字符以第二子序列中的化学键连接至其在第一子序列中分支符的位置，并根据化学键的角度调整化学键与第一子序列的连接方向，以恢复得到目标分子式。In this embodiment, the target molecular formula in the image to be recognized is restored based on the symbol sequence obtained in the above manner. Specifically, the characters in the second subsequence of the sample representing each branch are connected to the position of the branch symbol in the first subsequence through the chemical bond in the second subsequence, and the connection direction of the chemical bond and the first subsequence is adjusted according to the angle of the chemical bond to restore the target molecular formula.

在一实施方式中，可将符号序列恢复成目标分子式的DFS(Depth First Search，深度优先搜索)形式，DFS形式的具体含义可以参阅下述相关描述，在此暂不赘述。可以理解地，在其他实施方式中，也可将符号序列恢复成目标分子式的图形数据结构形式。In one embodiment, the symbol sequence can be restored to the DFS (Depth First Search) form of the target molecular formula. The specific meaning of the DFS form can be found in the following related descriptions and will not be repeated here. It can be understood that in other embodiments, the symbol sequence can also be restored to the graphical data structure form of the target molecular formula.

上述实施方式中，利用分子式识别模型对待识别图像进行识别，得到符号序列，并基于符号序列，恢复得到待识别图像中的目标分子式；其中，分子式识别模型利用含有样本分子式的样本图像训练得到，样本图像标注有样本分子式的样本符号序列，且样本符号序列由样本分子式的图形视觉形态构建得到。故利用公式识别模型得到的符号序列能够准确反映目标公式的图形视觉状态，即能够如实反映出待识别图像中分子式的内容，充分保留待识别图像中的分子式的原始信息，从而提高识别目标分子式的准确性和泛化能力。In the above implementation, the molecular formula recognition model is used to recognize the image to be recognized, and a symbol sequence is obtained, and based on the symbol sequence, the target molecular formula in the image to be recognized is restored; wherein the molecular formula recognition model is trained using a sample image containing a sample molecular formula, the sample image is annotated with a sample symbol sequence of the sample molecular formula, and the sample symbol sequence is constructed from the graphic visual form of the sample molecular formula. Therefore, the symbol sequence obtained by the formula recognition model can accurately reflect the graphic visual state of the target formula, that is, it can truthfully reflect the content of the molecular formula in the image to be recognized, and fully retain the original information of the molecular formula in the image to be recognized, thereby improving the accuracy and generalization ability of recognizing the target molecular formula.

请参阅图6，图6是图1所示步骤S11一实施例的流程示意图。需要注意的是，若有实质上相同的结果，本实施例并不以图6所示的流程顺序为限。如图6所示，本实施例包括：Please refer to FIG. 6 , which is a schematic diagram of a flow chart of an embodiment of step S11 shown in FIG. 1 . It should be noted that if there is substantially the same result, this embodiment is not limited to the flow chart sequence shown in FIG. 6 . As shown in FIG. 6 , this embodiment includes:

步骤S111：基于原始标签序列进行结构解析，得到样本分子式的图形数据。Step S111: Perform structural analysis based on the original tag sequence to obtain graphic data of the sample molecular formula.

其中，样本分子式预先以预设分子式标记语言标注为原始标签序列。其中，不对预设分子式标记语言进行限定，可根据实际使用需要具体设置。例如，预设分子式标记语言为Chemfig语言。The sample molecular formula is pre-marked as an original label sequence in a preset molecular formula markup language. The preset molecular formula markup language is not limited and can be specifically set according to actual use needs. For example, the preset molecular formula markup language is Chemfig language.

由于对于同一个样本分子式的图像，根据选取的起点和遍历顺序的不同，会存在多种不同的但正确的标注结果，即对于同一个样本分子式的图像，会存在多个不同的原始标签序列，而标注人员在标注时只会随机选择其中一种，并且无法对标注过程做出明确的规则限制。因此，Chemfig语言标注得到的原始标签序列会存在多义性，会增加分子式识别模型的学习难度。另外，为了保持Chemfig语言标注得到的原始标签序列的精简度，语法中会设立大量的先验规则去帮助标注人员节省标注时输入的字符量，例如，对于苯环，可用“*6(-----)”简单表示，其中6个单键无需指定方向，而是默认按照苯环的规则去自动推断方向。可见，这些设置的先验规则是比较复杂的，分子式识别模型比较难以学习到，会造成泛化性能下降。For the image of the same sample molecular formula, there will be multiple different but correct annotation results depending on the selected starting point and traversal order, that is, for the image of the same sample molecular formula, there will be multiple different original label sequences, and the annotator will only randomly select one of them during annotation, and it is impossible to make clear rules to restrict the annotation process. Therefore, the original label sequence obtained by Chemfig language annotation will be ambiguous, which will increase the learning difficulty of the molecular formula recognition model. In addition, in order to maintain the simplicity of the original label sequence obtained by Chemfig language annotation, a large number of prior rules will be set up in the grammar to help the annotator save the number of characters entered during annotation. For example, for the benzene ring, it can be simply represented by "*6(-----)", in which the 6 single bonds do not need to specify the direction, but the default direction is automatically inferred according to the rules of the benzene ring. It can be seen that these set prior rules are relatively complex, and the molecular formula recognition model is difficult to learn, which will cause the generalization performance to decline.

因此，本实施方式中，根据原始标签序列进行结构解析，得到样本分子式的图形数据，以消除原始标签序列的多义性和复杂性，尽量统一同一个样本分子式的表达形式。具体地，如图7所示，图7是本申请提供的对原始标签序列进行处理得到样本符号序列一实施例的流程示意图，“原始Chemfig标注串”这一栏有三个不同的Chemfig标注串，即有三个不同的原始标签序列，但实际表达的是同一样本分子式；因此，本实施方式中，会根据原始Chemfig语法规则，对原始标签序列进行结构解析，以得到样本分子式的图形数据。Therefore, in this embodiment, structural analysis is performed based on the original label sequence to obtain graphic data of the sample molecular formula, so as to eliminate the ambiguity and complexity of the original label sequence and unify the expression form of the same sample molecular formula as much as possible. Specifically, as shown in Figure 7, Figure 7 is a flow chart of an embodiment of processing the original label sequence to obtain a sample symbol sequence provided by the present application. The "Original Chemfig annotation string" column has three different Chemfig annotation strings, that is, there are three different original label sequences, but they actually express the same sample molecular formula; therefore, in this embodiment, the original label sequence will be structurally analyzed according to the original Chemfig grammatical rules to obtain graphic data of the sample molecular formula.

其中，图形数据由若干数据元素组成，若干数据元素包括节点和连接节点的边，节点表示原子团，边表示化学键。具体地，利用Chemfig语法规则，通过人工编程的方式，解析、恢复出原始标签序列背后所有的原子团和化学键等数据元素，并将它们进行连接，得到图形数据。如图7所示，“标注的Graph表示”栏为进行结构解析后得到的样本分子式的图形数据，其中，节点即矩形框指的是原子团，原子团可以为一个字符串，如“HO”、“COOH”，原子团也可以没有内容，如苯环上的顶点，对于带圈的苯环，也将中间的圈作为一个特殊的原子团对待；除了节点之外的边即除了原子团之外的线段为化学键，化学键可以为单键、双键或者三键等，图7中只包含了单键。一个化学键的两端，一定连接着一个原子团，一个原子团可以连接着多根化学键，通过这种连接关系，原子团作为节点，化学键作为边，便可以得到样本分子式的图形数据。从图7中可以看出，即使原始标签序列多种多样，但是只要是同一个原子，其背后的图形数据是相同的，这就消除了利用Chemfig语言标注得到的原始标签序列会存在多义性的问题。Among them, the graphic data is composed of several data elements, and the several data elements include nodes and edges connecting nodes. The nodes represent atomic groups, and the edges represent chemical bonds. Specifically, using the Chemfig syntax rules, all the data elements such as atomic groups and chemical bonds behind the original label sequence are parsed and restored by manual programming, and they are connected to obtain graphic data. As shown in Figure 7, the "Annotated Graph Representation" column is the graphic data of the sample molecular formula obtained after structural analysis, wherein the node, i.e., the rectangular box, refers to the atomic group. The atomic group can be a string, such as "HO", "COOH", and the atomic group can also have no content, such as the vertex on the benzene ring. For the benzene ring with a circle, the middle circle is also treated as a special atomic group; the edge other than the node, i.e., the line segment other than the atomic group, is a chemical bond. The chemical bond can be a single bond, a double bond, or a triple bond, etc. Figure 7 only contains single bonds. The two ends of a chemical bond must be connected to an atomic group, and an atomic group can be connected to multiple chemical bonds. Through this connection relationship, the atomic group is used as a node and the chemical bond is used as an edge, and the graphic data of the sample molecular formula can be obtained. As can be seen from Figure 7, even if the original label sequences are diverse, as long as they are the same atom, the graphic data behind them are the same, which eliminates the problem of ambiguity in the original label sequence obtained by annotating with the Chemfig language.

其中，图形数据中各数据元素标记有数据属性，也就是说，对原始标签序列进行结构解析，还能够解析出样本分子式的图形数据中各个数据元素的属性。不对各数据元素标记的数据属性进行限定，可根据实际使用需要具体设置。Among them, each data element in the graphic data is marked with a data attribute, that is, by performing structural analysis on the original label sequence, the attributes of each data element in the graphic data of the sample molecular formula can also be analyzed. The data attributes marked on each data element are not limited and can be specifically set according to actual use needs.

在一实施方式中，节点的数据属性包括表示原子团的字符，即节点代表什么原子团。在一实施方式中，边的数据属性至少包括化学键的角度，即化学键的角度是多少。可以理解地，在其他实施方式中，边的数据属性还可包括边代表什么化学键，或者还可包括化学键是单键、双键还是三键等，即化学键的共价键的类型。In one embodiment, the data attributes of the node include characters representing the atomic group, i.e., what atomic group the node represents. In one embodiment, the data attributes of the edge include at least the angle of the chemical bond, i.e., what the angle of the chemical bond is. It is understandable that in other embodiments, the data attributes of the edge may also include what chemical bond the edge represents, or may also include whether the chemical bond is a single bond, a double bond, or a triple bond, etc., i.e., the type of covalent bond of the chemical bond.

为了保证基于原始标签序列进行结构解析得到的样本分子式的图形数据是正确的，则需要保证原始标签序列是标注正确的，在一实施方式中，如图8所示，图8是本申请提供的确定原始标签序列是否正确一实施例的流程示意图，在基于原始标签序列进行结构解析，得到样本分子式的图形数据之前，通过比对渲染分子式和样本分子式确定原始标签序列是否标注正确，具体包括如下子步骤：In order to ensure that the graphic data of the sample molecular formula obtained by performing structural analysis based on the original label sequence is correct, it is necessary to ensure that the original label sequence is correctly labeled. In one embodiment, as shown in FIG. 8 , FIG. 8 is a flow chart of an embodiment of determining whether the original label sequence is correct provided by the present application. Before performing structural analysis based on the original label sequence to obtain the graphic data of the sample molecular formula, it is determined whether the original label sequence is correctly labeled by comparing the rendered molecular formula with the sample molecular formula, which specifically includes the following sub-steps:

步骤S81：利用预设分子式标记语言的渲染引擎对原始标签序列进行渲染，得到渲染分子式。Step S81: Rendering the original tag sequence using a rendering engine of a preset molecular formula markup language to obtain a rendered molecular formula.

本实施方式中，利用预设分子式标记语言的渲染引擎对原始标签序列进行渲染，得到渲染分子式。其中，不对预设分子式标记语言进行限定，可根据实际使用需要具体设置。例如，预设分子式标记语言为Chemfig语言。In this embodiment, the original tag sequence is rendered using a rendering engine of a preset molecular formula markup language to obtain a rendered molecular formula. The preset molecular formula markup language is not limited and can be specifically set according to actual use needs. For example, the preset molecular formula markup language is Chemfig language.

在一具体实施方式中，如图3所示，预设分子式标记语言为Chemfig语言，Chemfig语言有现有的渲染引擎，通过Chemfig语言的渲染引擎对原始标签序列进行渲染，得到渲染分子式，即图3中标注渲染结果。In a specific embodiment, as shown in FIG3 , the preset molecular formula markup language is the Chemfig language, and the Chemfig language has an existing rendering engine. The original label sequence is rendered by the rendering engine of the Chemfig language to obtain a rendered molecular formula, i.e., the rendering result marked in FIG3 .

步骤S82：基于渲染分子式与样本分子式之间的差异检查结果，确定原始标签序列是否标注正确。Step S82: Based on the difference check result between the rendered molecular formula and the sample molecular formula, determine whether the original label sequence is correctly labeled.

本实施方式中，根据渲染分子式与样本分子式之间的差异检测结果，确定原始标签序列是否标注正确。具体地，如图3所示，图3中标注渲染结果与其对应的输入图像一致，即渲染分子式与其对应的样本分子式之间无差异，所以确定原始标签序列的标注正确。In this embodiment, whether the original label sequence is correctly labeled is determined based on the difference detection result between the rendered molecular formula and the sample molecular formula. Specifically, as shown in FIG3 , the labeled rendering result in FIG3 is consistent with the corresponding input image, that is, there is no difference between the rendered molecular formula and the corresponding sample molecular formula, so it is determined that the labeling of the original label sequence is correct.

步骤S112：基于图形数据进行遍历，得到样本符号序列。Step S112: Traverse based on the graphic data to obtain a sample symbol sequence.

由于样本分子式的图形数据中的数据结构比较复杂，因此还需要转换为分子式识别模型可训练的形式。因此，本实施方式中，基于对图形数据进行遍历，得到样本符号序列。其中，对图形数据进行遍历的方式不做限定，可根据实际使用需要具体设置。例如，采用深度优先遍历DFS对图形数据进行遍历，得到样本符号序列；或者，采用广度优先遍历BFS(Breadth First Search)对图形数据进行遍历，得到样本符号序列。Since the data structure in the graphic data of the sample molecular formula is relatively complex, it is also necessary to convert it into a form that can be trained by the molecular formula recognition model. Therefore, in this embodiment, based on traversing the graphic data, a sample symbol sequence is obtained. Among them, the method of traversing the graphic data is not limited and can be specifically set according to actual use needs. For example, the graphic data is traversed using depth-first traversal DFS to obtain a sample symbol sequence; or the graphic data is traversed using breadth-first traversal BFS (Breadth First Search) to obtain a sample symbol sequence.

具体地，按照固定的规则选取一个点，并按照固定的规则沿着图形数据进行遍历，将访问到的节点和边追加到字符串末尾，从而完成对图形数据的遍历，得到样本符号序列。Specifically, a point is selected according to a fixed rule, and the graph data is traversed according to the fixed rule, and the visited nodes and edges are appended to the end of the string, thereby completing the traversal of the graph data and obtaining a sample symbol sequence.

在一具体实施方式中，如图7和图9所示，图9是本申请提供的样本符号序列一实施例的示意图，采用深度优先遍历对图7所示的图形数据进行遍历，得到图7右上角和图9所示的样本符号序列即规整标签。从此样本符号序列可以看出，出现分支时用一对圆括号表示，样本符号序列包含“原子团”、“化学键”、“化学键的角度”和“虚体”这四种元素；所有化学键后面都会跟着角度的描述符“[:<角度值>]”，角度值在构建图形数据时已计算完成，可直接按格式输出；“虚体”是用于表示分支的嵌套。因此，当样本分子式的分支较多、较深时，采用深度优先遍历对图形数据进行遍历得到的样本符号序列会出现较多的虚体的嵌套，分子式识别模型在学习时会过于关注深度，而忽略了全局的结构。In a specific embodiment, as shown in FIG. 7 and FIG. 9, FIG. 9 is a schematic diagram of an embodiment of a sample symbol sequence provided by the present application, and the graphic data shown in FIG. 7 is traversed by depth-first traversal to obtain the sample symbol sequence shown in the upper right corner of FIG. 7 and FIG. 9, i.e., the regular label. From this sample symbol sequence, it can be seen that a pair of parentheses are used to indicate the occurrence of branches, and the sample symbol sequence contains four elements: "atomic group", "chemical bond", "angle of chemical bond" and "virtual body"; all chemical bonds are followed by the angle descriptor "[:<angle value>]", and the angle value has been calculated when constructing the graphic data and can be directly output in the format; "virtual body" is used to indicate the nesting of branches. Therefore, when the sample molecular formula has many branches and is deep, the sample symbol sequence obtained by traversing the graphic data by depth-first traversal will have more virtual body nesting, and the molecular formula recognition model will pay too much attention to the depth when learning, and ignore the global structure.

在其他具体实施方式中，如图7所示，采用广度优先遍历对图7所示的图形数据进行遍历，得到图7右下角所示的样本符号序列即规整标签。其中，如图10所示，图10是图6所示步骤S112一实施例的流程示意图，采用广度优先遍历对图形数据进行遍历得到样本符号序列具体包括如下子步骤：In other specific implementations, as shown in FIG7 , the graphic data shown in FIG7 is traversed by breadth-first traversal to obtain the sample symbol sequence shown in the lower right corner of FIG7 , that is, the regularized label. As shown in FIG10 , FIG10 is a flow chart of an embodiment of step S112 shown in FIG6 , and traversing the graphic data by breadth-first traversal to obtain the sample symbol sequence specifically includes the following sub-steps:

步骤S1001：在图形数据遍历样本分子式主干上的数据元素，得到样本第一子序列，并在图形数据遍历样本分子式分支上的数据元素，得到样本第二子序列。Step S1001: traverse the data elements on the sample molecular formula trunk in the graphic data to obtain the sample first subsequence, and traverse the data elements on the sample molecular formula branches in the graphic data to obtain the sample second subsequence.

本实施方式中，首先，在图形数据遍历样本分子式主干上的数据元素，得到样本第一子序列，并在图形数据遍历样本分子式分支上的数据元素，得到样本第二子序列。其中，样本符号序列中表示数据元素的字符串包括数据元素的数据属性，分支在样本第一子序列由分支符代为表示，且分支符表征分支的方向和标识，样本第二子序列包含序位符，序位符表征分支的标识。也就是说，本实施方式中，遵循的是先整体再局部的遍历策略，在图形数据上会先遍历样本分子式主干上的数据元素即先整体遍历，然后在图形数据上遍历样本分子式分支上的数据元素即再局部遍历。通过先整体再局部的遍历原则，使得后续得到的样本符号序列在用于分子式识别模型训练时，能够使得分子式识别模型更好地抓住样本分子式整体结构且不忽略样本分子式的分支结构即细节结构。In this embodiment, first, the data elements on the sample molecular formula trunk are traversed in the graphic data to obtain the first subsequence of the sample, and the data elements on the sample molecular formula branches are traversed in the graphic data to obtain the second subsequence of the sample. Among them, the character string representing the data element in the sample symbol sequence includes the data attribute of the data element, the branch is represented by the branch symbol in the sample first subsequence, and the branch symbol represents the direction and identification of the branch, and the sample second subsequence contains the sequence symbol, and the sequence symbol represents the identification of the branch. In other words, in this embodiment, the strategy of traversal first as a whole and then as a part is followed. The data elements on the sample molecular formula trunk will be traversed first on the graphic data, that is, the whole traversal will be performed first, and then the data elements on the sample molecular formula branches will be traversed on the graphic data, that is, the local traversal will be performed. Through the principle of traversal first as a whole and then as a part, when the sample symbol sequence obtained later is used for the training of the molecular formula recognition model, the molecular formula recognition model can better grasp the overall structure of the sample molecular formula and not ignore the branch structure, that is, the detailed structure of the sample molecular formula.

其中，不对分支符和序位符的体现形式进行限定，可根据实际使用需要具体设置。例如，“\place{x}[：{θ}]”表示为一个分支符，即在遍历样本分子式主干上的数据元素时，当遇到分支时，暂时不输出分支的内容，先用分支符代替这个分支；其中，{x}表示整数序号，{θ}表示为角度且其表征分支的方向。“\solveplace{x}”表示为序位符，即表示为开始输出对应分支符“\place{x}[：{θ}]”的内容，其中，{x}表示整数序号。There is no limitation on the manifestation of branching symbols and sequence symbols, which can be set according to actual use needs. For example, "\place{x}[：{θ}]" is represented as a branching symbol, that is, when traversing the data elements on the sample molecular formula trunk, when encountering a branch, the content of the branch is temporarily not output, and the branch is replaced by the branch symbol first; wherein {x} represents an integer serial number, and {θ} represents an angle and represents the direction of the branch. "\solveplace{x}" is represented as a sequence symbol, that is, it means starting to output the content of the corresponding branch symbol "\place{x}[：{θ}]", wherein {x} represents an integer serial number.

在一实施方式中，一个分支符对应的内容输出完毕后即在图形数据遍历样本分子式任一分支上的数据元素完成后，以预设结束符结束，即分支符的末位符号为预设结束符。其中，不对预设结束符的体现形式进行限定，可根据实际使用需要具体设置。例如，预设结束符为“\eol”或者“\eos”等。In one embodiment, after the content corresponding to a branch symbol is output, that is, after the graphic data traverses the data elements on any branch of the sample molecular formula, it ends with a preset end symbol, that is, the last symbol of the branch symbol is the preset end symbol. Among them, the embodiment of the preset end symbol is not limited, and it can be specifically set according to actual use needs. For example, the preset end symbol is "\eol" or "\eos".

举例来说，如图11所示，图11是本申请提供的样本符号序列另一实施例的示意图，样本分子式的图形数据如图11(a)所示，遵循先整体再局部的遍历原则，先在图形数据遍历样本分子式主干上的数据元素，得到样本第一子序列，样本第一子序列具体如图11(b)所示；然后在图形数据遍历样本分子式分支上的数据元素，由于有3个分支结构，所以得到3个第二子序列；其中，分支符“\place1[：300]”对应的第二子序列为“\solveplace1\eol”，分支符“\place2[：0]”对应的第二子序列为“\solveplace2-[：0]C O O H\eol”，分支符“\place1[：60]”对应的第二子序列为“\solveplace3--[：60]\circle\eol”。For example, as shown in Figure 11, Figure 11 is a schematic diagram of another embodiment of the sample symbol sequence provided by the present application. The graphic data of the sample molecular formula is shown in Figure 11(a). Following the principle of traversal of the whole first and then the part, the data elements on the trunk of the sample molecular formula are first traversed in the graphic data to obtain the first subsequence of the sample, and the first subsequence of the sample is specifically shown in Figure 11(b); then the data elements on the branches of the sample molecular formula are traversed in the graphic data. Since there are three branch structures, three second subsequences are obtained; among which, the second subsequence corresponding to the branch symbol "\place1[：300]" is "\solveplace1\eol", the second subsequence corresponding to the branch symbol "\place2[：0]" is "\solveplace2-[：0]C O O H\eol", and the second subsequence corresponding to the branch symbol "\place1[：60]" is "\solveplace3--[：60]\circle\eol".

具体而言，在广度优先遍历过程中，可以首先初始化样本符号序列text为“\place[:0]\eol”，并另队列Q＝[(x,0)]，其中，元素x为上述图形数据G中任一元素，并初始化分支指示标记ind为1，在此基础上，当队列Q不为空时，可以循环执行如下步骤：(1)从队列Q首部取出元素x和当前占位符指示标记ind’(初始为0，之后每执行一次本步骤累加1)；(2)从图形数据G中提取包含元素x的最大连通子图G’；(3)在最大连通子图G’上寻找最长分支作为主分支Bmain；(4)将Bmain上所有元素从图形数据G中删除；(5)输出“\solveplace{ind’}”到样本符号序列text；(6)按连接顺序遍历Bmain上每个元素y，并输出y到样本符号序列text的末尾，以及若y为原子团且y与不在Bmain上的化学键e相连，则可以确定y为分支，则可以令分支方向angle为化学键e的角度，并输出“\place{ind}[:{angle}]”到样本符号序列text，以及在队列Q中加入[(e,ind)]，并将ind加1，以对ind进行更新，遍历Bmain结束之后，即可输出“\eol”到样本符号序列text的末尾，从而完成一次循环执行操作。需要说明的是，在首次执行循环操作时，可以得到主干的第一子序列，在后续继续执行循环操作时，可以得到分支的第二子序列。Specifically, in the breadth-first traversal process, the sample symbol sequence text can be first initialized to "\place[:0]\eol", and another queue Q = [(x,0)], where the element x is any element in the above-mentioned graphic data G, and the branch indicator ind is initialized to 1. On this basis, when the queue Q is not empty, the following steps can be executed cyclically: (1) Take out the element x and the current placeholder indicator ind' (initial 0, and then accumulate 1 each time this step is executed) from the head of the queue Q; (2) Extract the maximum connected subgraph G' containing the element x from the graphic data G; (3) Find the longest branch on the maximum connected subgraph G' as the main branch Bmain; (4) Delete all elements on Bmain from the graphic data G; (5) ) outputs "\solveplace{ind'}" to the sample symbol sequence text; (6) traverses each element y on Bmain in the connection order, and outputs y to the end of the sample symbol sequence text, and if y is an atomic group and y is connected to a chemical bond e that is not on Bmain, it can be determined that y is a branch, and the branch direction angle can be set to the angle of the chemical bond e, and outputs "\place{ind}[:{angle}]" to the sample symbol sequence text, and adds [(e,ind)] to the queue Q, and adds 1 to ind to update ind. After traversing Bmain, it can output "\eol" to the end of the sample symbol sequence text, thereby completing a loop execution operation. It should be noted that when the loop operation is executed for the first time, the first subsequence of the trunk can be obtained, and when the loop operation is continued to be executed subsequently, the second subsequence of the branch can be obtained.

步骤S1002：组合样本第一子序列和样本第二子序列，得到样本符号序列。Step S1002: Combine the first subsequence of samples and the second subsequence of samples to obtain a sample symbol sequence.

本实施方式中，组合样本第一子序列和样本第二子序列，得到样本符号序列。也就是说，将遍历样本分子式主干上的数据元素得到的第一子序列和遍历样本分子式分支上的数据元素得到的第二子序列进行组合，得到整个样本分子式的数据元素对应的序列，即得到样本符号序列。In this embodiment, the sample first subsequence and the sample second subsequence are combined to obtain a sample symbol sequence. That is, the first subsequence obtained by traversing the data elements on the sample molecular formula trunk and the second subsequence obtained by traversing the data elements on the sample molecular formula branches are combined to obtain a sequence corresponding to the data elements of the entire sample molecular formula, that is, to obtain a sample symbol sequence.

请参阅图12，图12是本申请提供的训练分子式识别模型一实施例的流程示意图。需要注意的是，若有实质上相同的结果，本实施例并不以图12所示的流程顺序为限。Please refer to Figure 12, which is a schematic diagram of a flow chart of an embodiment of training a molecular formula recognition model provided by the present application. It should be noted that if there is substantially the same result, the present embodiment is not limited to the flow chart shown in Figure 12.

如图4所示，在训练分子式识别模型之前，先搭建分子式识别模型，分子式识别模型包括分子式编码网络和分子式解码网络。具体地，分子式编码网络E对输入图像I进行编码后，输出样本特征图F；分子式解码网络D采用自回归的迭代建模方式，每个解码时刻t，以样本特征图F和上一解码时刻的解码结果y_t-1作为分子式解码网络D的输入，输出当前解码时刻的解码结果y_t，如此迭代解码，直至遇到结束信号为止。假设解码时刻总共有T个时刻，则最终输出的解码结果y＝[y₁,y₂,…,y_T]。其中，y_t为基本建模单元，对图11所示的样本符号序列，以空格分隔，得到建模单元序列。As shown in FIG4 , before training the molecular formula recognition model, the molecular formula recognition model is first built. The molecular formula recognition model includes a molecular formula encoding network and a molecular formula decoding network. Specifically, after the molecular formula encoding network E encodes the input image I, it outputs a sample feature map F; the molecular formula decoding network D adopts an autoregressive iterative modeling method. At each decoding time t, the sample feature map F and the decoding result y _t-1 of the previous decoding time are used as the input of the molecular formula decoding network D, and the decoding result y _t of the current decoding time is output. The decoding is iterated in this way until the end signal is encountered. Assuming that there are a total of T decoding moments, the final output decoding result y = [y ₁ ,y ₂ ,…,y _T ]. Among them, y _t is a basic modeling unit. The sample symbol sequence shown in FIG11 is separated by spaces to obtain a modeling unit sequence.

其中，传统的分子式解码网络D的解码过程具体如下：Among them, the decoding process of the traditional molecular formula decoding network D is as follows:

c_t＝Attn(s_t-1,y_t-1,F)c _t = Attn(s _t-1 ,y _t-1 ,F)

o_t,s_t＝RNN(c_t,y_t-1,s_t-1)o _t ,s _t =RNN(c _t ,y _t-1 ,s _t-1 )

p_t＝MLP(o_t)p _t =MLP(o _t )

y_t＝argmaxx_ip_t(i),i∈{0,1,...V-1}y _t =argmaxx _i p _t (i),i∈{0,1,...V-1}

其中，c_t为当前解码时刻需要关注的特征；Attn为注意力机制网络；s_t-1为上一解码时刻的解码状态；y_t-1为上一解码时刻的解码结果；F为样本特征图；s_t为循环神经网络RNN的隐状态；o_t为循环神经网络RNN的输出特征；p_t为输出概率分布，p_t∈R^V，V为建模单元的种类数；MLP为全连接层和softmax层组成的神经网络；y_t当前解码时刻对应的解码结果。具体地，根据上一解码时刻的解码状态s_t-1、上一解码时刻的解码结果y_t-1与样本特征图F中所有位置的特征计算相关性，并通过相关性对样本特征图F中的所有位置的特征做加权求和，得到当前解码时刻需要关注的特征c_t；然后，利用循环神经网络RNN关联当前解码时刻需要关注的特征c_t、上一解码时刻的解码状态s_t-1和上一解码时刻的解码结果y_t-1，得到当前解码时刻的解码状态o_t；然后，根据当前解码时刻的解码状态o_t进行符号预测，能够得到预设符号的概率分布p_t即预设符号的概率预测值，具体地，会设置有解码符号词典，概率分布包括对应解码符号词典中各个预设解码符号的概率预测值；将解码符号词典中最大概率预测值对应的预设符号作为当前解码时刻的解码结果y_t。Among them, c _t is the feature that needs to be paid attention to at the current decoding moment; Attn is the attention mechanism network; s _t-1 is the decoding state at the previous decoding moment; y _t-1 is the decoding result at the previous decoding moment; F is the sample feature map; s _t is the hidden state of the recurrent neural network RNN; o _t is the output feature of the recurrent neural network RNN; p _t is the output probability distribution, p _t ∈ ^RV , V is the number of modeling units; MLP is a neural network composed of a fully connected layer and a softmax layer; y _t is the decoding result corresponding to the current decoding moment. Specifically, the correlation is calculated according to the decoding state s _t-1 at the previous decoding moment, the decoding result y _t-1 at the previous decoding moment, and the features of all positions in the sample feature map F, and the features of all positions in the sample feature map F are weighted and summed through the correlation to obtain the feature c _t that needs to be paid attention to at the current decoding moment; then, the recurrent neural network RNN is used to associate the feature c _t that needs to be paid attention to at the current decoding moment, the decoding state s _t-1 at the previous decoding moment, and the decoding result y _t-1 at the previous decoding moment to obtain the decoding state o _t at the current decoding moment; then, symbol prediction is performed according to the decoding state o _t at the current decoding moment, and the probability distribution p _t of the preset symbol, that is, the probability prediction value of the preset symbol, can be obtained. Specifically, a decoding symbol dictionary will be provided, and the probability distribution includes the probability prediction value of each preset decoding symbol in the corresponding decoding symbol dictionary; the preset symbol corresponding to the maximum probability prediction value in the decoding symbol dictionary is used as the decoding result y _t at the current decoding moment.

由于样本符号序列中包括大量的分支符(如，\place{x}[：{θ}])和序位符(如，\solveplace{x})的配对，从序位符到下一个预设结束符(如，\eol)之间的解码过程都属于分支符的扩展内容，因此在解码待解码分支时会用到解码到对应分支符时的解码状态。但是，在样本符号序列中，序位符和预设结束符之间的距离较远，如图11(b)所示，序位符“\solveplace2”和下一个预设结束符“\eol”之间的距离即序位符“\solveplace2”和占位符“\place2[：0]”之间间隔8个解码时刻，传统的分子式解码网络D中的RNN模型的记忆能力无法关联间隔较长的2个解码时刻。所以为了使分子式识别模型充分利用样本符号序列，学习到样本分子式的图形视觉形态信息，本实施方式中，在搭建的分子式识别模型中分子式解码网络D中加入了“角度条件解码”机制，从而使分子式识别模型能够有效将时间上相距较远的占位符和虚伪符的特征关联起来。具体地，如图12所示，本实施例包括：Since the sample symbol sequence includes a large number of branch symbols (e.g., \place{x}[：{θ}]) and sequence symbols (e.g., \solveplace{x}), the decoding process from the sequence symbol to the next preset end symbol (e.g., \eol) belongs to the extended content of the branch symbol, so the decoding state when decoding to the corresponding branch symbol is used when decoding the branch to be decoded. However, in the sample symbol sequence, the distance between the sequence symbol and the preset end symbol is far, as shown in Figure 11(b), the distance between the sequence symbol "\solveplace2" and the next preset end symbol "\eol", that is, the distance between the sequence symbol "\solveplace2" and the placeholder "\place2[：0]" is 8 decoding moments, and the memory capacity of the RNN model in the traditional molecular formula decoding network D cannot associate two decoding moments with a long interval. Therefore, in order to make the molecular formula recognition model fully utilize the sample symbol sequence and learn the graphic visual morphological information of the sample molecular formula, in this embodiment, the "angle condition decoding" mechanism is added to the molecular formula decoding network D in the constructed molecular formula recognition model, so that the molecular formula recognition model can effectively associate the features of the placeholders and pseudo symbols that are far apart in time. Specifically, as shown in FIG12, this embodiment includes:

步骤S1201：随机选择一个未被选择过的参考状态。Step S1201: Randomly select a reference state that has not been selected.

本实施方式中，在分子式解码网络D中加入的“角度条件解码”机制具体为增加条件输入z_t∈[0,t-1]，用于告诉分子式识别模型当前正在展开哪一个分支符的内容。In this embodiment, the "angle conditional decoding" mechanism added to the molecular formula decoding network D specifically adds a conditional input z _t ∈[0, t-1] to tell the molecular formula recognition model which branch symbol is currently being expanded.

在一实施方式中，样本分子式由主干和分支构成，样本符号序列包括主干的样本子序列和分支的样本子序列，在对各待解码分支进行解码之前，需要先解码出样本分子式的主干的预测子序列。具体地，根据预设状态和样本特征图进行解码，得到主干的预测子序列。由于解码样本分子式的主干的预测子序列是首个解码阶段，不存在可参考的解码状态，所以在首次解码阶段中，预设状态可设置为0向量。In one embodiment, the sample molecular formula is composed of a trunk and branches, and the sample symbol sequence includes a sample subsequence of the trunk and a sample subsequence of the branches. Before decoding each branch to be decoded, it is necessary to first decode the predicted subsequence of the trunk of the sample molecular formula. Specifically, decoding is performed according to a preset state and a sample feature map to obtain a predicted subsequence of the trunk. Since decoding the predicted subsequence of the trunk of the sample molecular formula is the first decoding stage, there is no reference decoding state, so in the first decoding stage, the preset state can be set to a 0 vector.

本实施方式中，在解码到样本分子式的主干的预测子序列后，需要解码出样本分子式的分支的预测子序列。首先，会随机选择一个未被选择过参考状态，其中，参考状态为解码到分支符时的解码状态，以使得后续对未展开的分支符进行解码。随机选择未被选择过的参考状态的选择策略，能够对分支符的解码顺序做一定扰动后再训练，从而能够避免分子式识别模型在训练过程中过拟合到某种固定的遍历分支的顺序上，进而迫使分子式识别模型不能不充分考虑输入的条件信息，利用条件信息来正确解码，提高了分子式识别模型的泛化能力。In this embodiment, after decoding the predicted subsequence of the trunk of the sample molecular formula, it is necessary to decode the predicted subsequence of the branch of the sample molecular formula. First, a reference state that has not been selected is randomly selected, wherein the reference state is the decoding state when the branch symbol is decoded, so that the unexpanded branch symbol can be decoded subsequently. The selection strategy of randomly selecting a reference state that has not been selected can make certain disturbances to the decoding order of the branch symbol before training, thereby avoiding the molecular formula recognition model from overfitting to a certain fixed order of traversing branches during the training process, thereby forcing the molecular formula recognition model to fully consider the input conditional information, and use the conditional information to correctly decode, thereby improving the generalization ability of the molecular formula recognition model.

举例来说，如图13所示，图13是本申请提供的分子式识别模型解码过程一实施例的示意图，当分子式识别模型在t-1解码时刻解码完一个结束符“eol”之后，下一个解码时刻会随机选择一个未被选择过的参考状态。For example, as shown in FIG13 , FIG13 is a schematic diagram of an embodiment of the molecular formula recognition model decoding process provided in the present application. After the molecular formula recognition model decodes an end symbol “eol” at decoding time t-1, a reference state that has not been selected will be randomly selected at the next decoding time.

步骤S1202：将参考状态对应的分支符所代表的分支，作为待解码分支。Step S1202: taking the branch represented by the branch symbol corresponding to the reference state as the branch to be decoded.

本实施方式中，将参考状态对应的分支符所代表的分支，作为待解码分支。也就是说，将解码到分支符时的解码状态对应的分支符作为待解码分支，以使得后续对分支符对应的数据内容进行解码。In this embodiment, the branch represented by the branch symbol corresponding to the reference state is used as the branch to be decoded. In other words, the branch symbol corresponding to the decoding state when the branch symbol is decoded is used as the branch to be decoded, so that the data content corresponding to the branch symbol is subsequently decoded.

步骤S1203：基于参考状态和由样本图像提取到的样本特征图进行解码，得到待解码分支的预测子序列。Step S1203: Decoding is performed based on the reference state and the sample feature map extracted from the sample image to obtain a predicted subsequence of the branch to be decoded.

本实施方式中，根据参考状态和由样本图像提取得到的样本特征图进行解码，得到待解码分支的预测子序列。具体地，如图13所示，分子式识别模型会将随机选择的参考状态对应的分支符所代表的分支，作为待解码分支，对其内容进行解码。其中，得到待解码分支的预测子序列的具体公式如下：In this embodiment, decoding is performed based on the reference state and the sample feature map extracted from the sample image to obtain a predicted subsequence of the branch to be decoded. Specifically, as shown in FIG13 , the molecular formula recognition model will take the branch represented by the branch symbol corresponding to the randomly selected reference state as the branch to be decoded and decode its content. Among them, the specific formula for obtaining the predicted subsequence of the branch to be decoded is as follows:

c_t＝Attn(s_t-1,s_zt,y_t-1,F)c _t =Attn(s _t-1 ,s _zt ,y _t-1 ,F)

o_t,s_t＝RNN(c_t,y_t-1,s_t-1,s_zt)o _t ,s _t =RNN(c _t ,y _t-1 ,s _t-1 ,s _zt )

p_t＝MLP(o_t)p _t =MLP(o _t )

其中，c_t为当前解码时刻需要关注的特征；Attn为注意力机制网络；s_t-1为上一解码时刻的解码状态；s_zt为解码到分支符时的解码状态；y_t-1为上一解码时刻的解码结果；F为样本特征图；s_t为循环神经网络RNN的隐状态；o_t为循环神经网络RNN的输出特征；p_t为输出概率分布，p_t∈R^V，V为建模单元的种类数；MLP为全连接层和softmax层组成的神经网络；y_t当前解码时刻对应的解码结果。具体地，根据上一解码时刻的解码状态s_t-1、上一解码时刻的解码结果y_t-1、解码到分支符时的解码状态s_zt与样本特征图F中所有位置的特征计算相关性，并通过相关性对样本特征图F中的所有位置的特征做加权求和，得到当前解码时刻需要关注的特征c_t；然后，利用循环神经网络RNN关联当前解码时刻需要关注的特征c_t、上一解码时刻的解码状态s_t-1、上一解码时刻的解码结果y_t-1和解码到分支符时的解码状态s_zt，得到当前解码时刻的解码状态o_t；然后，根据当前解码时刻的解码状态o_t进行符号预测，能够得到预设符号的概率分布p_t即预设符号的概率预测值，具体地，会设置有解码符号词典，概率分布包括对应解码符号词典中各个预设解码符号的概率预测值；将解码符号词典中最大概率预测值对应的预设符号作为当前解码时刻的解码结果y_t。Among them, c _t is the feature that needs to be paid attention to at the current decoding moment; Attn is the attention mechanism network; s _t-1 is the decoding state at the previous decoding moment; s _zt is the decoding state when decoding to the branch symbol; y _t-1 is the decoding result at the previous decoding moment; F is the sample feature map; s _t is the hidden state of the recurrent neural network RNN; o _t is the output feature of the recurrent neural network RNN; p _t is the output probability distribution, p _t ∈ ^RV , V is the number of modeling units; MLP is a neural network composed of a fully connected layer and a softmax layer; y _t is the decoding result corresponding to the current decoding moment. Specifically, the correlation is calculated according to the decoding state s _t-1 at the last decoding moment, the decoding result y _t-1 at the last decoding moment, the decoding state s _zt when decoding to the branch symbol, and the features of all positions in the sample feature map F, and the features of all positions in the sample feature map F are weighted and summed by the correlation to obtain the feature c _t that needs to be paid attention to at the current decoding moment; then, the recurrent neural network RNN is used to associate the feature c _t that needs to be paid attention to at the current decoding moment, the decoding state s _t-1 at the last decoding moment, the decoding result y _t-1 at the last decoding moment, and the decoding state s _zt when decoding to the branch symbol to obtain the decoding state o _t at the current decoding moment; then, according to the decoding state o _t at the current decoding moment, symbol prediction is performed to obtain the probability distribution p _t of the preset symbol, that is, the probability prediction value of the preset symbol. Specifically, a decoding symbol dictionary is provided, and the probability distribution includes the probability prediction values of each preset decoding symbol in the corresponding decoding symbol dictionary; the preset symbol corresponding to the maximum probability prediction value in the decoding symbol dictionary is used as the decoding result y _t at the current decoding moment.

步骤S1204：基于属于相同分支的样本子序列与预测子序列之间的差异，调整分子式识别模型的网络参数。Step S1204: adjusting the network parameters of the molecular formula recognition model based on the difference between the sample subsequence and the predicted subsequence belonging to the same branch.

本实施方式中，根据属于相同分支的样本子序列与预测子序列之间的差异，调整分子式识别模型的网络参数，使得分子式识别网络收敛，从而提高分子式识别模型的识别能力，提示分子式识别模型的泛化能力。In this embodiment, according to the difference between the sample subsequence and the predicted subsequence belonging to the same branch, the network parameters of the molecular formula recognition model are adjusted so that the molecular formula recognition network converges, thereby improving the recognition ability of the molecular formula recognition model and improving the generalization ability of the molecular formula recognition model.

在一实施方式中，通过诸如交叉熵等损失函数计算得到属于相同分支的样本子序列中各解码字符与预测子序列中各预测字符之间的子预测损失值，再统计各个子预测损失值以得到总损失值，从而可以通过诸如梯度下降等优化方式，基于总损失值调整分子式识别模型的网络参数，并重复上述过程，直至利用包含样本分子式的样本图像将分子式识别模型训练至收敛为止。In one embodiment, a sub-prediction loss value between each decoded character in a sample sub-sequence and each predicted character in a predicted sub-sequence belonging to the same branch is calculated by a loss function such as cross entropy, and then each sub-prediction loss value is counted to obtain a total loss value, so that the network parameters of the molecular formula recognition model can be adjusted based on the total loss value through optimization methods such as gradient descent, and the above process is repeated until the molecular formula recognition model is trained to convergence using a sample image containing the sample molecular formula.

在其他实施方式中，样本分子式由主干和分支构成，样本符号序列包括主干的样本子序列和分支的样本子序列，还可根据主干的预测子序列与样本子序列之间的差异以及属于相同分支的样本子序列与预测子序列之间的差异，调整分子式识别模型的网络参数。In other embodiments, the sample molecular formula is composed of a trunk and branches, the sample symbol sequence includes a sample subsequence of the trunk and a sample subsequence of the branches, and the network parameters of the molecular formula recognition model can be adjusted according to the difference between the predicted subsequence and the sample subsequence of the trunk and the difference between the sample subsequence and the predicted subsequence belonging to the same branch.

在一实施方式中，如图14所示，图14是本申请提供的确定解码结束一实施例的流程示意图，在调整分子式识别模型的网络参数之前，需要确认整个样本图像的解码是否结束，具体包括如下子步骤：In one embodiment, as shown in FIG. 14 , FIG. 14 is a flow chart of an embodiment of determining the end of decoding provided by the present application. Before adjusting the network parameters of the molecular formula recognition model, it is necessary to confirm whether the decoding of the entire sample image is completed, which specifically includes the following sub-steps:

步骤S1401：检查是否解码到结束符且所有参考状态均已被选择。Step S1401: Check whether the end character is decoded and all reference states have been selected.

本实施方式中，会判断是否解码到结束符(如，结束符为eol等)且所有参考状态均已被选择。如果还未解码到结束符或者解码到结束符且尚有未选择使用的参考状态，则表明对样本图像的解码还未结束，如果在解码到结束符且尚有未被选择过的参考状态时，执行步骤1402；如果解码到结束符且所有参考状态均已被选择使用，则表明对样本图像的解码已经结束，此时执行步骤S1403。In this embodiment, it is determined whether the end character is decoded (e.g., the end character is eol, etc.) and all reference states have been selected. If the end character has not been decoded or the end character is decoded and there is a reference state that has not been selected for use, it indicates that the decoding of the sample image has not been completed. If the end character is decoded and there is a reference state that has not been selected, step 1402 is executed; if the end character is decoded and all reference states have been selected for use, it indicates that the decoding of the sample image has been completed, and step S1403 is executed.

步骤S1402：响应于解码到结束符且尚有未被选择过的参考状态，执行随机选择一个未被选择过的参考状态的步骤以及后续步骤。Step S1402: In response to decoding to an end character and there being reference states that have not been selected, executing a step of randomly selecting a reference state that has not been selected and subsequent steps.

本实施方式中，样本子序列以结束符结尾，响应于解码到结束符且尚有未选择使用的参考状态，重新执行随机选择一个未被选择过的参考状态的步骤以及后续步骤。也就是说，在还有未选择使用的参考状态的情况下，会重新执行随机选择一个未被选择过的参考状态的步骤以及后续步骤，直至所有参考状态均已被选择使用，表明解码结束。In this embodiment, the sample subsequence ends with an end symbol, and in response to decoding to the end symbol and there are still reference states that have not been selected for use, the step of randomly selecting a reference state that has not been selected and the subsequent steps are re-executed. That is, in the case where there are still reference states that have not been selected for use, the step of randomly selecting a reference state that has not been selected and the subsequent steps are re-executed until all reference states have been selected for use, indicating that the decoding is completed.

步骤S1403：响应于解码到结束符且所有参考状态均已被选择，执行基于属于相同分支的样本子序列与预测子序列之间的差异，调整分子式识别模型的网络参数的步骤。Step S1403: In response to decoding to the end symbol and all reference states being selected, a step of adjusting network parameters of the molecular formula recognition model based on the difference between the sample subsequence and the predicted subsequence belonging to the same branch is performed.

本实施方式中，响应于解码到结束符且所有参考状态均已选择使用，确定解码结束，并执行基于属于相同分支的样本子序列与预测子序列之间的差异，调整分子式识别模型的网络参数的步骤。也就是说，在当前解码阶段解码到结束符(例如，结束符为“eol”)且所有参考状态均已选择使用后，表明解码结束；另外，执行基于属于相同分支的样本子序列与预测子序列之间的差异，调整分子式识别模型的网络参数的步骤。In this embodiment, in response to decoding to the end symbol and all reference states being selected for use, it is determined that decoding is finished, and a step of adjusting the network parameters of the molecular formula recognition model based on the difference between the sample subsequence and the predicted subsequence belonging to the same branch is performed. That is, after decoding to the end symbol (for example, the end symbol is "eol") and all reference states are selected for use in the current decoding stage, it indicates that decoding is finished; in addition, a step of adjusting the network parameters of the molecular formula recognition model based on the difference between the sample subsequence and the predicted subsequence belonging to the same branch is performed.

请参阅图15，图15是本申请提供的分子式识别装置一实施例的框架示意图。分子式识别装置150包括序列识别模块151和式子恢复模块152；序列识别模块151用于利用分子式识别模型对待识别图像进行识别，得到符号序列；式子恢复模块152用于基于符号序列，恢复得到待识别图像中的目标分子式；其中，分子式识别模型利用含有样本分子式的样本图像训练得到，样本图像标注有样本分子式的样本符号序列，且样本符号序列由样本分子式的图形视觉形态构建得到。Please refer to FIG. 15 , which is a schematic diagram of the framework of an embodiment of a molecular formula recognition device provided by the present application. The molecular formula recognition device 150 includes a sequence recognition module 151 and a formula recovery module 152; the sequence recognition module 151 is used to recognize the image to be recognized using the molecular formula recognition model to obtain a symbol sequence; the formula recovery module 152 is used to recover the target molecular formula in the image to be recognized based on the symbol sequence; wherein the molecular formula recognition model is trained using a sample image containing a sample molecular formula, the sample image is annotated with a sample symbol sequence of the sample molecular formula, and the sample symbol sequence is constructed from the graphic visual form of the sample molecular formula.

其中，上述样本符号序列包括表示样本分子式中原子团的字符串，及表示样本分子式中化学键的字符串，表示化学键的字符串至少包含化学键的角度。The sample symbol sequence includes a character string representing an atomic group in the sample molecular formula and a character string representing a chemical bond in the sample molecular formula, and the character string representing a chemical bond at least includes an angle of the chemical bond.

其中，上述样本符号序列还包括代表样本分子式中分支的分支符，且分支符至少表征分支的方向。The sample symbol sequence further includes a branch symbol representing a branch in the sample molecular formula, and the branch symbol at least represents the direction of the branch.

其中，上述样本符号序列由样本分子式主干的样本第一子序列和各分支的样本第二子序列组成，样本第一子序列包含分别代表各分支的分支符，且分支符还表征分支的标识，样本第二子序列包含序位符，序位符表征分支的标识。The sample symbol sequence is composed of a sample first subsequence of the sample molecular formula trunk and a sample second subsequence of each branch. The sample first subsequence contains branch symbols representing each branch respectively, and the branch symbols also represent the identification of the branch. The sample second subsequence contains an order symbol, and the order symbol represents the identification of the branch.

其中，上述样本分子式预先以预设分子式标记语言标注为原始标签序列，且预设分子式标记语言的语法规则遵循分子式的图形视觉形态，分子式识别装置150还包括获取模块153，获取模块153用于样本符号序列的获取步骤包括：基于原始标签序列进行结构解析，得到样本分子式的图形数据；其中，图形数据由若干数据元素组成，若干数据元素包括节点和连接节点的边，节点表示原子团，边表示化学键，且图形数据中各数据元素标记有数据属性；基于图形数据进行遍历，得到样本符号序列。Among them, the above-mentioned sample molecular formula is pre-annotated as an original label sequence in a preset molecular formula markup language, and the grammatical rules of the preset molecular formula markup language follow the graphic visual form of the molecular formula. The molecular formula recognition device 150 also includes an acquisition module 153. The acquisition module 153 is used for the acquisition steps of the sample symbol sequence including: performing structural analysis based on the original label sequence to obtain graphic data of the sample molecular formula; wherein the graphic data is composed of a number of data elements, and the several data elements include nodes and edges connecting the nodes, the nodes represent atomic groups, the edges represent chemical bonds, and each data element in the graphic data is marked with data attributes; traversal is performed based on the graphic data to obtain the sample symbol sequence.

其中，上述节点的数据属性包括表示原子团的字符；和/或，上述边的数据属性至少包括化学键的角度。The data attributes of the nodes include characters representing atomic groups; and/or the data attributes of the edges include at least the angle of the chemical bond.

其中，获取模块153用于基于图形数据进行遍历，得到样本符号序列，具体包括：在图形数据遍历样本分子式主干上的数据元素，得到样本第一子序列，并在图形数据遍历样本分子式分支上的数据元素，得到样本第二子序列；组合样本第一子序列和样本第二子序列，得到样本符号序列；其中，样本符号序列中表示数据元素的字符串包括数据元素的数据属性，分支在样本第一子序列由分支符代为表示，且分支符表征分支的方向和标识，样本第二子序列包含序位符，序位符表征分支的标识。Among them, the acquisition module 153 is used to traverse based on the graphic data to obtain a sample symbol sequence, specifically including: traversing the data elements on the sample molecular formula trunk in the graphic data to obtain the sample first subsequence, and traversing the data elements on the sample molecular formula branches in the graphic data to obtain the sample second subsequence; combining the sample first subsequence and the sample second subsequence to obtain a sample symbol sequence; wherein the character string representing the data element in the sample symbol sequence includes the data attributes of the data element, the branch is represented by a branch symbol in the sample first subsequence, and the branch symbol represents the direction and identification of the branch, and the sample second subsequence contains a sequence symbol, and the sequence symbol represents the identification of the branch.

其中，在基于原始标签序列进行结构解析，得到样本分子式的图形数据之前，获取模块153还用于：利用预设分子式标记语言的渲染引擎对原始标签序列进行渲染，得到渲染分子式；基于渲染分子式与样本分子式之间的差异检查结果，确定原始标签序列是否标注正确。Among them, before performing structural analysis based on the original label sequence to obtain graphic data of the sample molecular formula, the acquisition module 153 is also used to: render the original label sequence using a rendering engine of a preset molecular formula markup language to obtain a rendered molecular formula; and determine whether the original label sequence is correctly labeled based on the difference check result between the rendered molecular formula and the sample molecular formula.

其中，上述样本分子式由主干和分支构成，样本符号序列包括主干的样本子序列和分支的样本子序列，分支在主干的样本子序列中以分支符代为表示；分子式识别装置150还包括训练模块154，训练模块154用于分子式识别模型的训练步骤包括：随机选择一个未被选择过的参考状态；其中，参考状态为解码到分支符时的解码状态；将参考状态对应的分支符所代表的分支，作为待解码分支；基于参考状态和由样本图像提取到的样本特征图进行解码，得到待解码分支的预测子序列；基于属于相同分支的样本子序列与预测子序列之间的差异，调整分子式识别模型的网络参数。The sample molecular formula is composed of a trunk and branches, and the sample symbol sequence includes a sample subsequence of the trunk and a sample subsequence of the branch, and the branch is represented by a branch symbol in the sample subsequence of the trunk; the molecular formula recognition device 150 also includes a training module 154, and the training module 154 is used for the training steps of the molecular formula recognition model, including: randomly selecting a reference state that has not been selected; wherein the reference state is a decoding state when decoding to a branch symbol; taking the branch represented by the branch symbol corresponding to the reference state as the branch to be decoded; decoding based on the reference state and the sample feature map extracted from the sample image to obtain a predicted subsequence of the branch to be decoded; adjusting the network parameters of the molecular formula recognition model based on the difference between the sample subsequence and the predicted subsequence belonging to the same branch.

其中，上述样本子序列以结束符结尾；训练模块154用于随机选择一个未被选择过的参考状态，具体包括：响应于解码到结束符且尚有未被选择过的参考状态，执行随机选择一个未被选择过的参考状态的步骤以及后续步骤。The sample subsequence ends with a terminator; the training module 154 is used to randomly select a reference state that has not been selected, specifically including: in response to decoding to the terminator and there is still a reference state that has not been selected, executing the step of randomly selecting a reference state that has not been selected and subsequent steps.

其中，在基于属于相同分支的样本子序列与预测子序列之间的差异，调整分子式识别模型的网络参数之前，训练模块154还用于：检查是否解码到结束符且所有参考状态均已被选择；响应于解码到结束符且所有参考状态均已被选择，执行基于属于相同分支的样本子序列与预测子序列之间的差异，调整分子式识别模型的网络参数的步骤。Wherein, before adjusting the network parameters of the molecular formula recognition model based on the difference between the sample subsequence and the predicted subsequence belonging to the same branch, the training module 154 is also used to: check whether the end symbol is decoded and all reference states have been selected; in response to decoding to the end symbol and all reference states have been selected, perform the step of adjusting the network parameters of the molecular formula recognition model based on the difference between the sample subsequence and the predicted subsequence belonging to the same branch.

其中，在随机选择一个未被选择过的参考状态之前，训练模块154还用于：基于预设状态和样本特征图进行解码，得到主干的预测子序列；训练模块154用于基于属于相同分支的样本子序列与预测子序列之间的差异，调整分子式识别模型的网络参数，具体包括：基于主干的预测子序列与样本子序列之间的差异，以及属于相同分支的样本子序列与预测子序列之间的差异，调整分子式识别模型的网络参数。Among them, before randomly selecting a reference state that has not been selected, the training module 154 is also used to: decode based on the preset state and the sample feature map to obtain a predicted subsequence of the trunk; the training module 154 is used to adjust the network parameters of the molecular formula recognition model based on the difference between the sample subsequence and the predicted subsequence belonging to the same branch, specifically including: adjusting the network parameters of the molecular formula recognition model based on the difference between the predicted subsequence and the sample subsequence of the trunk, and the difference between the sample subsequence and the predicted subsequence belonging to the same branch.

其中，上述分子式识别模型包括分子式编码网络和分子式解码网络，且分子式编码网络由图文识别模型的图文编码网络初始化得到，图文编码模型是利用含有样本文本、样本公式至少一者的样本图像训练得到的。Among them, the above-mentioned molecular formula recognition model includes a molecular formula encoding network and a molecular formula decoding network, and the molecular formula encoding network is initialized by the image-text encoding network of the image-text recognition model, and the image-text encoding model is trained using sample images containing at least one of sample text and sample formula.

请参阅图16，图16是本申请提供的电子设备一实施例的框架示意图。电子设备160包括相互耦接的存储器161和处理器162，存储器161中存储有程序指令，处理器162用于执行程序指令以实现上述任一分子式识别方法实施例中的步骤。具体地，电子设备160可以包括但不限于：台式计算机、笔记本电脑、服务器、手机、平板电脑等等，在此不做限定。Please refer to FIG. 16 , which is a schematic diagram of the framework of an embodiment of an electronic device provided by the present application. The electronic device 160 includes a memory 161 and a processor 162 coupled to each other, the memory 161 stores program instructions, and the processor 162 is used to execute the program instructions to implement the steps in any of the above-mentioned molecular formula recognition method embodiments. Specifically, the electronic device 160 may include, but is not limited to: a desktop computer, a laptop computer, a server, a mobile phone, a tablet computer, etc., which are not limited here.

具体而言，处理器162用于控制其自身以及存储器161以实现上述任一分子式识别方法实施例中的步骤。处理器162还可以称为CPU(Central Processing Unit，中央处理单元)。处理器162可能是一种集成电路芯片，具有信号的处理能力。处理器162还可以是通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(ApplicationSpecific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable GateArray,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。另外，处理器162可以由集成电路芯片共同实现。Specifically, the processor 162 is used to control itself and the memory 161 to implement the steps in any of the above-mentioned molecular formula recognition method embodiments. The processor 162 can also be referred to as a CPU (Central Processing Unit). The processor 162 may be an integrated circuit chip with signal processing capabilities. The processor 162 may also be a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor, etc. In addition, the processor 162 may be implemented by an integrated circuit chip.

请参阅图17，图17是本申请提供的计算机可读存储介质一实施例的框架示意图。计算机可读存储介质170存储有能够被处理器运行的程序指令171，程序指令171用于实现上述任一分子式识别方法实施例中的步骤。Please refer to Figure 17, which is a schematic diagram of a computer-readable storage medium according to an embodiment of the present application. The computer-readable storage medium 170 stores program instructions 171 that can be executed by a processor, and the program instructions 171 are used to implement the steps in any of the above molecular formula recognition method embodiments.

在一些实施例中，本公开实施例提供的装置具有的功能或包含的模块可以用于执行上文方法实施例描述的方法，其具体实现可以参照上文方法实施例的描述，为了简洁，这里不再赘述。In some embodiments, the functions or modules included in the device provided by the embodiments of the present disclosure can be used to execute the method described in the above method embodiments. The specific implementation can refer to the description of the above method embodiments, and for the sake of brevity, it will not be repeated here.

上文对各个实施例的描述倾向于强调各个实施例之间的不同之处，其相同或相似之处可以互相参考，为了简洁，本文不再赘述。The above description of various embodiments tends to emphasize the differences between the various embodiments. The same or similar aspects can be referenced to each other, and for the sake of brevity, they will not be repeated herein.

在本申请所提供的几个实施例中，应该理解到，所揭露的方法和装置，可以通过其它的方式实现。例如，以上所描述的装置实施方式仅仅是示意性的，例如，模块或单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性、机械或其它的形式。In the several embodiments provided in the present application, it should be understood that the disclosed methods and devices can be implemented in other ways. For example, the device implementation described above is only schematic. For example, the division of modules or units is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed. Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, and the indirect coupling or communication connection of devices or units can be electrical, mechanical or other forms.

作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施方式方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the present embodiment.

另外，在本申请各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above-mentioned integrated unit may be implemented in the form of hardware or in the form of software functional units.

集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)或处理器(processor)执行本申请各个实施方式方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including a number of instructions to enable a computer device (which can be a personal computer, server, or network device, etc.) or a processor (processor) to perform all or part of the steps of each implementation method of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), disk or optical disk and other media that can store program code.

以上所述仅为本申请的实施方式，并非因此限制本申请的专利范围，凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换，或直接或间接运用在其他相关的技术领域，均同理包括在本申请的专利保护范围内。The above description is only an implementation method of the present application, and does not limit the patent scope of the present application. Any equivalent structure or equivalent process transformation made using the contents of the present application specification and drawings, or directly or indirectly used in other related technical fields, are also included in the patent protection scope of the present application.

Claims

1. A molecular formula recognition method, comprising:

The image to be identified is identified using the molecular formula recognition model to obtain a symbol sequence;

Based on the symbol sequence, recovering the target molecular formula in the image to be identified;

The molecular formula recognition model is trained using a sample image containing a sample molecular formula, the sample image is annotated with a sample symbol sequence of the sample molecular formula, and the sample symbol sequence is constructed from a graphic visual form of the sample molecular formula;

The sample molecular formula is pre-marked as an original label sequence in a preset molecular formula markup language, and the grammatical rules of the preset molecular formula markup language follow the graphic visual form of the molecular formula. The step of obtaining the sample symbol sequence includes:

Structural analysis is performed based on the original tag sequence to obtain graphic data of the sample molecular formula; wherein the graphic data is composed of a plurality of data elements, the plurality of data elements include nodes and edges connecting the nodes, the nodes represent atomic groups, the edges represent chemical bonds, and each of the data elements in the graphic data is marked with a data attribute;

The sample symbol sequence is obtained by traversing based on the graphic data.

2. The method according to claim 1 is characterized in that the sample symbol sequence includes a character string representing an atomic group in the sample molecular formula and a character string representing a chemical bond in the sample molecular formula, and the character string representing the chemical bond at least includes an angle of the chemical bond.

3. The method according to claim 2, characterized in that the sample symbol sequence further includes a branch symbol representing a branch in the sample molecular formula, and the branch symbol at least represents the direction of the branch.

4. The method according to claim 3 is characterized in that the sample symbol sequence is composed of a sample first subsequence of the sample molecular formula trunk and a sample second subsequence of each of the branches, the sample first subsequence contains branch symbols representing each of the branches respectively, and the branch symbols also represent the identifiers of the branches, and the sample second subsequence contains sequence symbols, and the sequence symbols represent the identifiers of the branches.

5. The method according to claim 1, characterized in that the data attribute of the node includes a character representing the atomic group;

And/or, the data attributes of the edge include at least the angle of the chemical bond.

6. The method according to claim 1, characterized in that the traversal based on the graphic data to obtain the sample symbol sequence comprises:

Traversing the data elements on the trunk of the sample molecular formula in the graphic data to obtain a first subsequence of samples, and traversing the data elements on the branches of the sample molecular formula in the graphic data to obtain a second subsequence of samples;

Combining the first subsequence of samples with the second subsequence of samples to obtain the sample symbol sequence;

Among them, the character string representing the data element in the sample symbol sequence includes the data attribute of the data element, the branch is represented by a branch symbol in the sample first subsequence, and the branch symbol represents the direction and identification of the branch, and the sample second subsequence contains an order symbol, and the order symbol represents the identification of the branch.

7. The method according to claim 1, characterized in that before performing structural analysis based on the original tag sequence to obtain graphic data of the sample molecular formula, the method further comprises:

Rendering the original tag sequence using a rendering engine of the preset molecular formula markup language to obtain a rendered molecular formula;

Based on the difference check result between the rendered molecular formula and the sample molecular formula, it is determined whether the original label sequence is correctly labeled.

8. The method according to claim 1, characterized in that the sample molecular formula consists of a trunk and branches, the sample symbol sequence includes a sample subsequence of the trunk and a sample subsequence of the branch, and the branch is represented by a branch symbol in the sample subsequence of the trunk; the training step of the molecular formula recognition model comprises:

Randomly select a reference state that has not been selected; wherein the reference state is a decoding state when decoding to the branch symbol;

Taking the branch represented by the branch symbol corresponding to the reference state as the branch to be decoded;

Decoding is performed based on the reference state and the sample feature map extracted from the sample image to obtain a predicted subsequence of the branch to be decoded;

Based on the difference between the sample subsequence and the predicted subsequence belonging to the same branch, the network parameters of the molecular formula recognition model are adjusted.

9. The method according to claim 8, wherein the sample subsequence ends with a terminator; and the randomly selecting a reference state that has not been selected comprises:

In response to decoding the end symbol and there being a reference state that has not been selected, the step of randomly selecting a reference state that has not been selected and subsequent steps are performed.

10. The method according to claim 9, characterized in that before adjusting the network parameters of the molecular formula recognition model based on the difference between the sample subsequence and the predicted subsequence belonging to the same branch, the method further comprises:

Check whether the terminator is decoded and all the reference states have been selected;

In response to decoding the end symbol and all the reference states being selected, the step of adjusting the network parameters of the molecular formula recognition model based on the difference between the sample subsequence and the predicted subsequence belonging to the same branch is performed.

11. The method according to claim 8, characterized in that before randomly selecting a reference state that has not been selected, the method further comprises:

Decoding is performed based on a preset state and the sample feature map to obtain a predicted subsequence of the backbone;

The adjusting the network parameters of the molecular formula recognition model based on the difference between the sample subsequence and the predicted subsequence belonging to the same branch comprises:

Based on the difference between the predicted subsequence and the sample subsequence of the trunk, and the difference between the sample subsequence and the predicted subsequence belonging to the same branch, the network parameters of the molecular formula recognition model are adjusted.

12. The method according to claim 1 is characterized in that the molecular formula recognition model includes a molecular formula encoding network and a molecular formula decoding network, and the molecular formula encoding network is initialized by the image-text encoding network of the image-text recognition model, and the image-text encoding model is trained using a sample image containing at least one of a sample text and a sample formula.

13. A molecular formula recognition device, comprising:

A sequence recognition module is used to recognize the image to be recognized by using a molecular formula recognition model to obtain a symbol sequence;

A formula recovery module, used for recovering the target molecular formula in the image to be identified based on the symbol sequence;

The sample symbol sequence is obtained by traversing based on the graphic data.

14. An electronic device, characterized in that it comprises a memory and a processor coupled to each other, wherein the memory stores program instructions, and the processor is used to execute the program instructions to implement the molecular formula recognition method according to any one of claims 1 to 12.

15 . A computer-readable storage medium, characterized in that program instructions that can be executed by a processor are stored therein, wherein the program instructions are used to implement the molecular formula recognition method according to any one of claims 1 to 12.