CN115033635A

CN115033635A - Data extraction method and device, processor and electronic equipment

Info

Publication number: CN115033635A
Application number: CN202210841925.3A
Authority: CN
Inventors: 赵文怡; 朱芳鹏
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2022-07-18
Filing date: 2022-07-18
Publication date: 2022-09-09

Abstract

The application discloses a data extraction method, a data extraction device, a processor and electronic equipment. Relating to the field of big data and other fields, the method comprises the following steps: acquiring structured data from a structured database to obtain a plurality of first data; acquiring structured data from a document to obtain a plurality of second data, wherein the document is an unstructured document; taking the first data and the second data as initial data, establishing a preset incidence relation between all the initial data according to the incidence fields, and establishing a target database according to the initial data; and determining a first target field and a second target field, and extracting target data in a target database through a preset incidence relation, the first target field and the second target field. By the method and the device, the problem that in the related technology, due to the fact that the data source has both structured data and unstructured data and the unstructured data cannot be retrieved, all data needing to be extracted cannot be extracted from the data source directly is solved.

Description

Data extraction method, device, processor and electronic device

技术领域technical field

本申请涉及大数据领域及其他领域，具体而言，涉及一种数据提取方法、装置、处理器及电子设备。The present application relates to the field of big data and other fields, and in particular, to a data extraction method, apparatus, processor and electronic device.

背景技术Background technique

企业根据日常的运营数据积累了海量数据，企业通过建立结构化的数据库系统存储了海量结构化数据。然而，企业也存在大量的文档报告等非结构化数据，这些非结构化数据无法统一存储在一个数据库中，此外，同一类文档往往结构相似、内容不同，对于具体的同一项目常常也会以不同的侧重点重复生成多个文档。Enterprises accumulate massive amounts of data based on daily operational data, and enterprises store massive amounts of structured data through the establishment of structured database systems. However, enterprises also have a large amount of unstructured data such as document reports, which cannot be uniformly stored in a database. In addition, the same type of documents are often similar in structure and different in content, and are often different for the same specific project. The emphasis is on repetitive generation of multiple documents.

相关技术中，业务人员通过人工处理的方法逐篇查看文档，抽取有用信息，并根据数据库系统查找所需字段，汇总整理成一篇新的文档，这种人工处理的方法工作量大、效率低。此外，由于业务人员之间信息不对称，上报内容重叠，浪费了大量的人力成本。现有技术中，通过对某一类型的文档进行自然语言处理，对所需信息进行要素提取。具体的方法为，将非结构化文本数据抽取相关元数据，并根据约束条件转换为结构化数据。In the related art, business personnel review documents one by one through manual processing, extract useful information, search for required fields according to the database system, and summarize and organize into a new document. This manual processing method has a large workload and low efficiency. In addition, due to information asymmetry among business personnel, the reporting content overlaps, which wastes a lot of labor costs. In the prior art, elements of required information are extracted by performing natural language processing on a certain type of document. The specific method is to extract relevant metadata from unstructured text data and convert it into structured data according to constraints.

然而，现有技术只能对某一类文档的内容进行抽取，而在实际业务工作中，存在大量与项目相关的各类文档。面对日益增长的非结构化文本抽取需求，尽管积累了海量的信息，但是彼此之间相对独立，另外对于现有的结构化数据也没有充分利用起来。However, the existing technology can only extract the content of a certain type of document, but in actual business work, there are a large number of various types of documents related to the project. Facing the growing demand for unstructured text extraction, despite accumulating massive amounts of information, they are relatively independent of each other, and the existing structured data has not been fully utilized.

针对相关技术中数据源中既有结构化数据又有非结构化数据，由于无法对非结构化的数据进行检索，从而难以直接从数据源中提取出所有需要提取的数据的问题，目前尚未提出有效的解决方案。In the related art, there are both structured data and unstructured data in the data source. Since the unstructured data cannot be retrieved, it is difficult to directly extract all the data that needs to be extracted from the data source. Effective solution.

发明内容SUMMARY OF THE INVENTION

本申请的主要目的在于提供一种数据提取方法、装置、处理器及电子设备，以解决相关技术中数据源中既有结构化数据又有非结构化数据，由于无法对非结构化的数据进行检索，从而难以直接从数据源中提取出所有需要提取的数据的问题。The main purpose of the present application is to provide a data extraction method, device, processor and electronic equipment, so as to solve the problem that there are both structured data and unstructured data in the data source in the related art. Therefore, it is difficult to directly extract all the data that needs to be extracted from the data source.

为了实现上述目的，根据本申请的一个方面，提供了一种数据提取方法。该方法包括：从结构化数据库中获取结构化数据，得到多个第一数据，其中，结构化数据库用于存储结构化数据；从文档中获取结构化数据，得到多个第二数据，其中，文档为非结构化文档；将第一数据和第二数据作为初始数据，根据关联字段建立所有初始数据之间的预设关联关系，并根据初始数据建立目标数据库；确定第一目标字段和第二目标字段，通过预设关联关系、第一目标字段和第二目标字段在目标数据库中提取目标数据。In order to achieve the above object, according to an aspect of the present application, a data extraction method is provided. The method includes: acquiring structured data from a structured database to obtain a plurality of first data, wherein the structured database is used for storing structured data; acquiring structured data from a document to obtain a plurality of second data, wherein, The document is an unstructured document; the first data and the second data are used as the initial data, the preset association relationship between all the initial data is established according to the associated fields, and the target database is established according to the initial data; the first target field and the second data are determined. The target field is to extract target data from the target database through the preset association relationship, the first target field and the second target field.

可选地，根据关联字段建立所有初始数据之间的预设关联关系包括：建立存在相同字段的各对初始数据之间的关联关系，得到多个关联关系，并将多个关联关系组合为预设关联关系。Optionally, establishing a preset association relationship between all the initial data according to the association field includes: establishing an association relationship between each pair of initial data with the same field, obtaining multiple association relationships, and combining the multiple association relationships into a preset association relationship. Set up a relationship.

可选地，通过预设关联关系、第一目标字段和第二目标字段在目标数据库中提取目标数据包括：遍历目标数据库，确定存在第一目标字段的初始数据，得到第一初始数据，并确定存在第二目标字段的初始数据，得到第二初始数据；将第一初始数据确定为起始节点，将第二初始数据确定为结束节点，通过预设关联关系确定所有从起始节点到结束节点的路径，得到多条路径，其中，每条路径至少包含两个初始数据对应的节点；从多条路径中确定目标路径；在目标数据库中提取目标路径中的各个节点所对应的数据，得到目标数据。Optionally, extracting the target data from the target database by using the preset association relationship, the first target field and the second target field includes: traversing the target database, determining that initial data of the first target field exists, obtaining the first initial data, and determining. There is initial data of the second target field, and the second initial data is obtained; the first initial data is determined as the start node, the second initial data is determined as the end node, and all nodes from the start node to the end node are determined by the preset association relationship path, and obtain multiple paths, wherein each path contains at least two nodes corresponding to the initial data; determine the target path from the multiple paths; extract the data corresponding to each node in the target path in the target database to obtain the target path data.

可选地，从多条路径中确定目标路径包括：在每条路径中，分别根据节点对应的初始数据的准确程度确定节点的权重，其中，准确程度用于表征从数据源提取初始数据的准确度；计算每条路径上的所有节点的权重的和，得到多个路径权重；将多条路径中最小路径权重对应的路径确定为目标路径。Optionally, determining the target path from the multiple paths includes: in each path, determining the weight of the node according to the degree of accuracy of the initial data corresponding to the node, wherein the degree of accuracy is used to represent the accuracy of extracting the initial data from the data source. degree; calculate the sum of the weights of all nodes on each path to obtain multiple path weights; determine the path corresponding to the minimum path weight among the multiple paths as the target path.

可选地，根据节点对应的初始数据的准确程度确定节点的权重包括：在初始数据为第二数据的情况下，获取通过数据抽取模型抽取第二数据的准确率，并将准确率的倒数确定为节点的权重；在初始数据为第一数据的情况下，将预设值确定为节点的权重。Optionally, determining the weight of the node according to the degree of accuracy of the initial data corresponding to the node includes: in the case that the initial data is the second data, obtaining the accuracy rate of extracting the second data through the data extraction model, and determining the inverse of the accuracy rate. is the weight of the node; when the initial data is the first data, the preset value is determined as the weight of the node.

可选地，从多个文档中获取多个第二数据包括：根据每个文档的类型从多个数据抽取模型中确定对应类型的数据抽取模型；分别根据对应类型的数据抽取模型从每个文档中抽取第二数据，得到多个第二数据。Optionally, acquiring a plurality of second data from a plurality of documents includes: determining a data extraction model of a corresponding type from a plurality of data extraction models according to the type of each document; extracting data from each document according to the data extraction model of the corresponding type. Extract the second data from the second data to obtain a plurality of second data.

可选地，在通过预设关联关系、第一目标字段和第二目标字段在目标数据库中提取目标数据之后，方法还包括：确定预设条件，从目标数据中剔除不符合预设条件的数据，得到更新后的目标数据。Optionally, after extracting the target data from the target database through the preset association relationship, the first target field and the second target field, the method further includes: determining a preset condition, and removing data that does not meet the preset condition from the target data to get the updated target data.

为了实现上述目的，根据本申请的另一方面，提供了一种数据提取装置。该装置包括：第一获取单元，用于从结构化数据库中获取结构化数据，得到多个第一数据，其中，结构化数据库用于存储结构化数据；第二获取单元，用于从文档中获取结构化数据，得到多个第二数据，其中，文档为非结构化文档；确定单元，用于将第一数据和第二数据作为初始数据，根据关联字段建立所有初始数据之间的预设关联关系，并根据初始数据建立目标数据库；提取单元，用于确定第一目标字段和第二目标字段，通过预设关联关系、第一目标字段和第二目标字段在目标数据库中提取目标数据。In order to achieve the above object, according to another aspect of the present application, a data extraction apparatus is provided. The device includes: a first acquisition unit, used for acquiring structured data from a structured database, to obtain a plurality of first data, wherein the structured database is used for storing structured data; a second acquiring unit, used for extracting data from documents Obtaining structured data to obtain a plurality of second data, wherein the document is an unstructured document; a determining unit, used for using the first data and the second data as initial data, and establishing presets between all initial data according to the associated fields The association relationship is established, and the target database is established according to the initial data; the extraction unit is used to determine the first target field and the second target field, and extract the target data from the target database through the preset association relationship, the first target field and the second target field.

通过本申请，采用以下步骤：从结构化数据库中获取结构化数据，得到多个第一数据，其中，结构化数据库用于存储结构化数据；从文档中获取结构化数据，得到多个第二数据，其中，文档为非结构化文档；将第一数据和第二数据作为初始数据，根据关联字段建立所有初始数据之间的预设关联关系，并根据初始数据建立目标数据库；确定第一目标字段和第二目标字段，通过预设关联关系、第一目标字段和第二目标字段在目标数据库中提取目标数据，解决了相关技术中数据源中既有结构化数据又有非结构化数据，由于无法对非结构化的数据进行检索，从而难以直接从数据源中提取出所有需要提取的数据的问题。通过从非结构化文档中提取结构化数据，并建立现有数据库中的结构化数据与提取到的结构化数据之间的联系，进而达到了从非结构化话文档中检索到数据的效果。Through the present application, the following steps are adopted: obtaining structured data from a structured database to obtain a plurality of first data, wherein the structured database is used to store structured data; obtaining structured data from a document, obtaining a plurality of second data data, wherein the document is an unstructured document; the first data and the second data are used as initial data, a preset association relationship between all initial data is established according to the associated fields, and a target database is established according to the initial data; the first target is determined field and the second target field, extract target data in the target database through the preset association relationship, the first target field and the second target field, which solves the problem that there are both structured data and unstructured data in the data source in the related art, Since unstructured data cannot be retrieved, it is difficult to directly extract all the data that needs to be extracted from the data source. By extracting structured data from unstructured documents and establishing the relationship between the structured data in the existing database and the extracted structured data, the effect of retrieving data from unstructured documents is achieved.

附图说明Description of drawings

构成本申请的一部分的附图用来提供对本申请的进一步理解，本申请的示意性实施例及其说明用于解释本申请，并不构成对本申请的不当限定。在附图中：The accompanying drawings constituting a part of the present application are used to provide further understanding of the present application, and the schematic embodiments and descriptions of the present application are used to explain the present application and do not constitute an improper limitation of the present application. In the attached image:

图1是根据本申请实施例提供的数据提取方法的流程图；1 is a flowchart of a data extraction method provided according to an embodiment of the present application;

图2是根据本申请实施例提供的路径的示意图；2 is a schematic diagram of a path provided according to an embodiment of the present application;

图3是根据本申请实施例提供的文档生成方法的流程图；3 is a flowchart of a document generation method provided according to an embodiment of the present application;

图4是根据本申请实施例提供的数据提取装置的示意图；4 is a schematic diagram of a data extraction apparatus provided according to an embodiment of the present application;

图5是根据本申请实施例提供的电子设备的示意图。FIG. 5 is a schematic diagram of an electronic device provided according to an embodiment of the present application.

具体实施方式Detailed ways

需要说明的是，在不冲突的情况下，本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本申请。It should be noted that the embodiments in the present application and the features of the embodiments may be combined with each other in the case of no conflict. The present application will be described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.

为了使本技术领域的人员更好地理解本申请方案，下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分的实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都应当属于本申请保护的范围。In order to make those skilled in the art better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only The embodiments are part of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the scope of protection of the present application.

需要说明的是，本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的本申请的实施例。此外，术语“包括”和“具有”以及他们的任何变形，意图在于覆盖不排他的包含，例如，包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first", "second", etc. in the description and claims of the present application and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances for the embodiments of the application described herein. Furthermore, the terms "comprising" and "having" and any variations thereof, are intended to cover non-exclusive inclusion, for example, a process, method, system, product or device comprising a series of steps or units is not necessarily limited to those expressly listed Rather, those steps or units may include other steps or units not expressly listed or inherent to these processes, methods, products or devices.

需要说明的是，本公开所涉及的用户信息(包括但不限于用户设备信息、用户个人信息等)和数据(包括但不限于用于展示的数据、分析的数据等)，均为经用户授权或者经过各方充分授权的信息和数据。It should be noted that the user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to display data, analysis data, etc.) involved in this disclosure are all authorized by the user or information and data fully authorized by the parties.

下面结合优选的实施步骤对本发明进行说明，图1是根据本申请实施例提供的数据提取方法的流程图，如图1所示，该方法包括如下步骤：The present invention will be described below in conjunction with the preferred implementation steps. FIG. 1 is a flowchart of a data extraction method provided according to an embodiment of the present application. As shown in FIG. 1 , the method includes the following steps:

步骤S101，从结构化数据库中获取结构化数据，得到多个第一数据，其中，结构化数据库用于存储结构化数据。Step S101: Acquire structured data from a structured database to obtain a plurality of first data, wherein the structured database is used for storing structured data.

具体地，结构化数据是可以通过固有键值获取相应信息的数据，且数据的格式固定，例如数字、符号等。Specifically, structured data is data that can obtain corresponding information through inherent key values, and the format of the data is fixed, such as numbers, symbols, and the like.

步骤S102，从文档中获取结构化数据，得到多个第二数据，其中，文档为非结构化文档。Step S102: Obtain structured data from the document to obtain a plurality of second data, wherein the document is an unstructured document.

具体地，非结构化文档是指存储各种文字、图片等非结构化数据的文档，通过多种类型的数据抽取模型从多个文档中抽取结构化数据，得到多个第二数据。例如，依据发票文档抽取模型从发票文档类型的文档中抽取第二数据。Specifically, an unstructured document refers to a document that stores various texts, pictures, and other unstructured data, and extracts structured data from multiple documents through various types of data extraction models to obtain multiple pieces of second data. For example, the second data is extracted from documents of the invoice document type according to the invoice document extraction model.

步骤S103，将第一数据和第二数据作为初始数据，根据关联字段建立所有初始数据之间的预设关联关系，并根据初始数据建立目标数据库。Step S103, taking the first data and the second data as initial data, establishing a preset association relationship between all initial data according to the association field, and establishing a target database according to the initial data.

具体地，初始数据包含第一数据和第二数据，为了便于检索，将第一数据与第二数据存储至同一个目标数据库，关联字段是指初始数据之间存在的相同字段，例如，两个初始数据中均包含同一个公司名称，那么这两个初始数据间建立关联关系。每个初始数据可以存在多个关联关系。所有初始数据间的关联关系共同构成预设关联关系。Specifically, the initial data includes the first data and the second data. In order to facilitate retrieval, the first data and the second data are stored in the same target database. The associated field refers to the same field existing between the initial data, for example, two If the initial data all contain the same company name, a relationship is established between the two initial data. There can be multiple associations for each initial data. The association relationship between all the initial data together constitutes a preset association relationship.

步骤S104，确定第一目标字段和第二目标字段，通过预设关联关系、第一目标字段和第二目标字段在目标数据库中提取目标数据。Step S104, determining the first target field and the second target field, and extracting target data from the target database through the preset association relationship, the first target field and the second target field.

具体地，第一目标字段和第二目标字段可以为用户选取的用于检索数据的字段，例如，将m公司名称作为第一目标字段，将m公司的转账记录作为第二目标字段，通过第一目标字段在预设关联关系中确定起始节点，通过第二目标字段在预设关联关系中确定结束节点，通过预设关联关系确定目标路径，依据目标路径将目标路径上的所有节点对应得初始数据从目标数据库中提取出来，得到目标数据。Specifically, the first target field and the second target field may be fields selected by the user for retrieving data. For example, the name of company m is used as the first target field, and the transfer record of company m is used as the second target field. A target field determines the start node in the preset association relationship, the second target field determines the end node in the preset association relationship, the target path is determined through the preset association relationship, and all nodes on the target path are corresponding to the target path. The initial data is extracted from the target database to obtain the target data.

本申请实施例提供的数据提取方法，通过从结构化数据库中获取结构化数据，得到多个第一数据，其中，结构化数据库用于存储结构化数据；从文档中获取结构化数据，得到多个第二数据，其中，文档为非结构化文档；将第一数据和第二数据作为初始数据，根据关联字段建立所有初始数据之间的预设关联关系，并根据初始数据建立目标数据库；确定第一目标字段和第二目标字段，通过预设关联关系、第一目标字段和第二目标字段在目标数据库中提取目标数据，解决了相关技术中数据源中既有结构化数据又有非结构化数据，由于无法对非结构化的数据进行检索，从而难以直接从数据源中提取出所有需要提取的数据的问题。通过从非结构化文档中提取结构化数据，并建立现有数据库中的结构化数据与提取到的结构化数据之间的联系，进而达到了从非结构化话文档中检索到数据的效果。In the data extraction method provided by the embodiments of the present application, a plurality of first data are obtained by obtaining structured data from a structured database, wherein the structured database is used to store structured data; the structured data is obtained from a document, and multiple first data are obtained. a second data, wherein the document is an unstructured document; take the first data and the second data as initial data, establish a preset association relationship between all initial data according to the associated field, and establish a target database according to the initial data; determine The first target field and the second target field extract the target data from the target database through the preset association relationship, the first target field and the second target field, which solves the problem that there are both structured data and unstructured data in the data source in the related art. Due to the inability to retrieve unstructured data, it is difficult to directly extract all the data that needs to be extracted from the data source. By extracting structured data from unstructured documents and establishing the relationship between the structured data in the existing database and the extracted structured data, the effect of retrieving data from unstructured documents is achieved.

为了便于在目标数据库中提取数据，将初始数据之间建立预设关联关系，可选地，在本申请实施例提供的数据提取方法中，根据关联字段建立所有初始数据之间的预设关联关系包括：建立存在相同字段的各对初始数据之间的关联关系，得到多个关联关系，并将多个关联关系组合为预设关联关系。In order to facilitate data extraction from the target database, a preset association relationship is established between the initial data. Optionally, in the data extraction method provided in this embodiment of the present application, a preset association relationship between all initial data is established according to an association field. The method includes: establishing an association relationship between each pair of initial data with the same field, obtaining multiple association relationships, and combining the multiple association relationships into a preset association relationship.

具体地，相同字段可以为相同的客编、客户名称、证件号码等，例如，初始数据A的字段中存在客编字段m0101，初始数据B的字段中也存在相同的客编字段m0101，那么初始数据A与初始数据B之间建立关联关系。将所有存在相同字段的初始数据之间均建立关联关系，所有的关联关系组合为预设关联关系。通过在初始数据间建立预设关联关系能够方便抽取相互关联的数据。Specifically, the same field can be the same customer code, customer name, certificate number, etc., for example, there is a guest code field m0101 in the field of the initial data A, and the same guest code field m0101 also exists in the field of the initial data B, then the initial An association relationship is established between data A and initial data B. An association relationship is established between all the initial data with the same field, and all the association relationships are combined into a preset association relationship. Interrelated data can be easily extracted by establishing a preset association relationship between the initial data.

预设关联关系建好后，可以根据预设关联关系快速提取目标数据，可选地，在本申请实施例提供的数据提取方法中，通过预设关联关系、第一目标字段和第二目标字段在目标数据库中提取目标数据包括：遍历目标数据库，确定存在第一目标字段的初始数据，得到第一初始数据，并确定存在第二目标字段的初始数据，得到第二初始数据；将第一初始数据确定为起始节点，将第二初始数据确定为结束节点，通过预设关联关系确定所有从起始节点到结束节点的路径，得到多条路径，其中，每条路径至少包含两个初始数据对应的节点；从多条路径中确定目标路径；在目标数据库中提取目标路径中的各个节点所对应的数据，得到目标数据。After the preset association relationship is established, the target data can be quickly extracted according to the preset association relationship. Optionally, in the data extraction method provided in the embodiment of the present application, the preset association relationship, the first target field and the second target field are Extracting the target data from the target database includes: traversing the target database, determining that the initial data of the first target field exists, obtaining the first initial data, and determining that the initial data of the second target field exists, and obtaining the second initial data; The data is determined as the start node, the second initial data is determined as the end node, and all paths from the start node to the end node are determined through a preset association relationship, and multiple paths are obtained, wherein each path contains at least two initial data The corresponding node; the target path is determined from the multiple paths; the data corresponding to each node in the target path is extracted from the target database to obtain the target data.

具体地，第一目标字段可以为用户需要检索的字段，例如公司名称C，第二目标字段可以为与第一目标字段相关的字段，例如公司C的交易记录，用户需要从目标数据库中提取所有的关联第一目标字段的数据，因此，确定第一目标字段为起始节点，第二目标字段为结束节点，根据初始数据间的关联关系，获取起始节点到结束节点的所有路径，并从所有路径中确定一个目标路径，目标路径是所有路径中权重最小的路径，也是检索结果最准确的路径，确定目标路径后，提取目标路径上所有节点对应的初始数据，也即目标数据。通过按照预设关联关系提取目标数据，可以提高检索效率。Specifically, the first target field can be a field that the user needs to retrieve, such as company name C, and the second target field can be a field related to the first target field, such as the transaction record of company C, and the user needs to extract all the data from the target database. Therefore, it is determined that the first target field is the start node, and the second target field is the end node. According to the association relationship between the initial data, all paths from the start node to the end node are obtained, and from One target path is determined among all the paths. The target path is the path with the smallest weight among all the paths, and it is also the path with the most accurate retrieval result. After the target path is determined, the initial data corresponding to all nodes on the target path is extracted, that is, the target data. By extracting the target data according to the preset association relationship, the retrieval efficiency can be improved.

目标路径的确定需要计算各路径的路径权重，可选地，在本申请实施例提供的数据提取方法中，从多条路径中确定目标路径包括：在每条路径中，分别根据节点对应的初始数据的准确程度确定节点的权重，其中，准确程度用于表征从数据源提取初始数据的准确度；计算每条路径上的所有节点的权重的和，得到多个路径权重；将多条路径中最小路径权重对应的路径确定为目标路径。The determination of the target path needs to calculate the path weight of each path. Optionally, in the data extraction method provided by the embodiment of the present application, determining the target path from multiple paths includes: in each path, according to the initial The degree of accuracy of the data determines the weight of the node, where the degree of accuracy is used to characterize the accuracy of extracting the initial data from the data source; calculate the sum of the weights of all nodes on each path to obtain multiple path weights; The path corresponding to the minimum path weight is determined as the target path.

具体地，准确程度可以为目标数据库中的初始数据被获取时的准确程度，通过计算每个节点对应的初始数据的准确程度的倒数，得到该节点的权重，计算所有路径上节点的权重的和，确定路径权重，最小的路径权重对应的路径即为目标路径，通过确定目标路径可以帮助用户提取最准确的目标数据。Specifically, the degree of accuracy can be the degree of accuracy when the initial data in the target database is acquired. By calculating the inverse of the degree of accuracy of the initial data corresponding to each node, the weight of the node is obtained, and the sum of the weights of the nodes on all paths is calculated. , determine the path weight, the path corresponding to the smallest path weight is the target path, and the user can extract the most accurate target data by determining the target path.

例如，图2是根据本申请实施例提供的路径的示意图，如图2所示，起始节点为A，结束节点为D，A到D的路径有两条，分别为A-B-D和A-B-C-D，A点的权重为1.5，B点的权重为1.6，C点的权重为1.4，D点的权重为1，那么A-B-D的路径权重为1.5+1.6+1等于4.1，A-B-C-D的路径权重为1.5+1.6+1.4+1等于5.5，路径权重最小的为A-B-D，因此目标路径为A-B-D。For example, FIG. 2 is a schematic diagram of a path provided according to an embodiment of the present application. As shown in FIG. 2 , the start node is A, the end node is D, and there are two paths from A to D, namely A-B-D and A-B-C-D, point A The weight of point B is 1.5, the weight of point B is 1.6, the weight of point C is 1.4, and the weight of point D is 1, then the path weight of A-B-D is 1.5+1.6+1 equal to 4.1, and the path weight of A-B-C-D is 1.5+1.6+1.4 +1 equals 5.5, the path with the smallest weight is A-B-D, so the target path is A-B-D.

在确定目标路径前，需要确定每一个节点的权重，可选地，在本申请实施例提供的数据提取方法中，根据节点对应的初始数据的准确程度确定节点的权重包括：在初始数据为第二数据的情况下，获取通过数据抽取模型抽取第二数据的准确率，并将准确率的倒数确定为节点的权重；在初始数据为第一数据的情况下，将预设值确定为节点的权重。Before determining the target path, the weight of each node needs to be determined. Optionally, in the data extraction method provided by the embodiment of the present application, determining the weight of the node according to the accuracy of the initial data corresponding to the node includes: when the initial data is the first In the case of the second data, the accuracy rate of the second data extracted by the data extraction model is obtained, and the inverse of the accuracy rate is determined as the weight of the node; in the case that the initial data is the first data, the preset value is determined as the node's weight. Weights.

具体地，节点的权重根据初始数据的类型来确定，若初始数据为第二数据，由于第二数据是通过数据抽取模型中抽取出来的，因此会根据不同的数据抽取模型存在不同的准确率，准确率的倒数作为第二数据的权重，若初始数据为第一数据，由于第一数据来源于数据库的结构化数据，因此第一数据的准确率为1，所以第一数据的权重也为1，也即预设值为1。通过设置节点的权重，可以选取出最准确的数据抽取路径。Specifically, the weight of the node is determined according to the type of the initial data. If the initial data is the second data, since the second data is extracted through the data extraction model, there will be different accuracy rates according to different data extraction models. The reciprocal of the accuracy rate is used as the weight of the second data. If the initial data is the first data, since the first data is derived from the structured data of the database, the accuracy rate of the first data is 1, so the weight of the first data is also 1. , that is, the default value is 1. By setting the weight of the node, the most accurate data extraction path can be selected.

第二数据是从文档中依据各种类型的数据抽取模型抽取出来的，可选地，在本申请实施例提供的数据提取方法中，从多个文档中获取多个第二数据包括：根据每个文档的类型从多个数据抽取模型中确定对应类型的数据抽取模型；分别根据对应类型的数据抽取模型从每个文档中抽取第二数据，得到多个第二数据。The second data is extracted from the documents according to various types of data extraction models. Optionally, in the data extraction method provided in this embodiment of the present application, acquiring multiple pieces of second data from multiple documents includes: according to each A data extraction model of a corresponding type is determined from a plurality of data extraction models for each document type; second data is extracted from each document according to the data extraction model of the corresponding type to obtain a plurality of second data.

具体地，数据抽取模型可以为合同文档抽取模型，授信文档抽取模型以及发票文档抽取模型等，依据合同文档抽取模型从合同文档类型的文档中抽取第二数据，依据授信文档抽取模型从授信文档类型的文档中抽取第二数据，依据发票文档抽取模型从发票文档类型的文档中抽取第二数据，得到多个第二数据。通过按照不同数据抽取模型从文档中抽取第二数据，可以避免无法从非结构化数据中检索目标数据的问题。Specifically, the data extraction model may be a contract document extraction model, a credit document extraction model, an invoice document extraction model, etc., according to the contract document extraction model to extract the second data from the documents of the contract document type, and according to the credit document extraction model from the credit document type The second data is extracted from the documents of the invoice document, and the second data is extracted from the documents of the invoice document type according to the invoice document extraction model to obtain a plurality of second data. By extracting the second data from the document according to different data extraction models, the problem that the target data cannot be retrieved from the unstructured data can be avoided.

由于抽取到的目标数据中仍存在部分属于用户不需要提取的数据，因此对目标数据进行筛选，得到更新后的目标数据，可选地，在本申请实施例提供的数据提取方法中，在通过预设关联关系、第一目标字段和第二目标字段在目标数据库中提取目标数据之后，方法还包括：确定预设条件，从目标数据中剔除不符合预设条件的数据，得到更新后的目标数据。Since there are still some data that the user does not need to extract in the extracted target data, the target data is filtered to obtain updated target data. After the preset association relationship, the first target field and the second target field are extracted from the target database, the method further includes: determining a preset condition, removing data that does not meet the preset condition from the target data, and obtaining an updated target data.

例如，预设条件可以为具有客户名称m的目标数据，从目标数据中剔除不具有客户名称m的数据，得到更新后的目标数据。通过进一步筛选目标数据，保证目标数据中不会掺杂无效数据。For example, the preset condition may be the target data with the customer name m, and the data without the customer name m is eliminated from the target data to obtain the updated target data. By further screening the target data, it is ensured that the target data will not be mixed with invalid data.

根据本申请的另一实施例，提供了一种文档生成方法，图3是根据本申请实施例提供的文档生成方法的流程图，如图3所示，首先获取建模平台的海量文档抽取指标，并建立相应的数据库。然后，对获取的指标库与现有的数据库进行融合，抽取出文档/表-关联字段-文档/表等知识，建立数据库与文档库之间关联。接下来，检索出包含所需字段的数据之间最短路径。最终对抽取出的全量字段，按照文档模板自动生成。According to another embodiment of the present application, a method for generating documents is provided. FIG. 3 is a flowchart of the method for generating documents according to an embodiment of the present application. As shown in FIG. 3 , first, the extraction indexes of massive documents of the modeling platform are obtained. , and establish the corresponding database. Then, the acquired indicator library is integrated with the existing database, and knowledge such as document/table-related fields-document/table is extracted, and the association between the database and the document library is established. Next, retrieve the shortest path between the data containing the required fields. Finally, all the extracted fields are automatically generated according to the document template.

通过本申请实施例提供的文档生成方法，对非结构化数据进行转换,并将非结构化数据与结构化数据进行融合，使得数据放在同一数据库。通过对海量文档的抽取，建立指标库，可以让业务人员通过定义要求输出的指标，直接生成包含所需指标的文档，减轻业务人员的工作压力。With the document generation method provided by the embodiment of the present application, the unstructured data is converted, and the unstructured data and the structured data are fused, so that the data is stored in the same database. Through the extraction of massive documents and the establishment of an indicator library, business personnel can directly generate documents containing the required indicators by defining the indicators required to be output, reducing the work pressure of business personnel.

需要说明的是，在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行，并且，虽然在流程图中示出了逻辑顺序，但是在某些情况下，可以以不同于此处的顺序执行所示出或描述的步骤。It should be noted that the steps shown in the flowcharts of the accompanying drawings may be executed in a computer system, such as a set of computer-executable instructions, and, although a logical sequence is shown in the flowcharts, in some cases, Steps shown or described may be performed in an order different from that herein.

本申请实施例还提供了一种数据提取装置，需要说明的是，本申请实施例的数据提取装置可以用于执行本申请实施例所提供的用于数据提取方法。以下对本申请实施例提供的数据提取装置进行介绍。The embodiment of the present application further provides a data extraction apparatus. It should be noted that the data extraction apparatus of the embodiment of the present application may be used to execute the method for data extraction provided by the embodiment of the present application. The data extraction apparatus provided by the embodiment of the present application will be introduced below.

图4根据本申请实施例的数据提取装置的示意图。如图4所示，该装置包括：FIG. 4 is a schematic diagram of a data extraction apparatus according to an embodiment of the present application. As shown in Figure 4, the device includes:

第一获取单元10，用于从结构化数据库中获取结构化数据，得到多个第一数据，其中，结构化数据库用于存储结构化数据；The first obtaining unit 10 is configured to obtain structured data from a structured database to obtain a plurality of first data, wherein the structured database is used for storing structured data;

第二获取单元20，用于从文档中获取结构化数据，得到多个第二数据，其中，文档为非结构化文档；The second obtaining unit 20 is configured to obtain structured data from the document to obtain a plurality of second data, wherein the document is an unstructured document;

确定单元30，用于将第一数据和第二数据作为初始数据，根据关联字段建立所有初始数据之间的预设关联关系，并根据初始数据建立目标数据库；The determining unit 30 is configured to use the first data and the second data as initial data, establish a preset association relationship between all the initial data according to the associated field, and establish a target database according to the initial data;

提取单元40，用于确定第一目标字段和第二目标字段，通过预设关联关系、第一目标字段和第二目标字段在目标数据库中提取目标数据。The extraction unit 40 is configured to determine the first target field and the second target field, and extract the target data from the target database through the preset association relationship, the first target field and the second target field.

本申请实施例提供的数据提取装置，通过第一获取单元10，从结构化数据库中获取结构化数据，得到多个第一数据，其中，结构化数据库用于存储结构化数据；第二获取单元20，从文档中获取结构化数据，得到多个第二数据，其中，文档为非结构化文档；确定单元30，将第一数据和第二数据作为初始数据，根据关联字段建立所有初始数据之间的预设关联关系，并根据初始数据建立目标数据库；提取单元40，确定第一目标字段和第二目标字段，通过预设关联关系、第一目标字段和第二目标字段在目标数据库中提取目标数据，解决了相关技术中数据源中既有结构化数据又有非结构化数据，由于无法对非结构化的数据进行检索，从而难以直接从数据源中提取出所有需要提取的数据的问题，通过从非结构化文档中提取结构化数据，并建立现有数据库中的结构化数据与提取到的结构化数据之间的联系，进而达到了从非结构化话文档中检索到数据的效果。In the data extraction device provided by the embodiment of the present application, the first obtaining unit 10 obtains structured data from a structured database to obtain a plurality of first data, wherein the structured database is used for storing structured data; the second obtaining unit 20. Obtain structured data from the document, and obtain a plurality of second data, wherein the document is an unstructured document; the determining unit 30 uses the first data and the second data as initial data, and establishes a relationship between all initial data according to the associated field. The preset association relationship between them, and the target database is established according to the initial data; the extraction unit 40 determines the first target field and the second target field, and extracts the target database through the preset association relationship, the first target field and the second target field The target data solves the problem in related technologies that there are both structured data and unstructured data in the data source. Since the unstructured data cannot be retrieved, it is difficult to directly extract all the data that needs to be extracted from the data source. , by extracting structured data from unstructured documents, and establishing the connection between the structured data in the existing database and the extracted structured data, so as to achieve the effect of retrieving data from unstructured documents .

可选地，在本申请实施例提供的数据提取装置中，确定单元30包括：建立模块，用于建立存在相同字段的各对初始数据之间的关联关系，得到多个关联关系，并将多个关联关系组合为预设关联关系。Optionally, in the data extraction apparatus provided in this embodiment of the present application, the determining unit 30 includes: a building module, configured to establish an association relationship between pairs of initial data with the same field, obtain a plurality of association relationships, and combine the multiple A combination of association relationships is a preset association relationship.

可选地，在本申请实施例提供的数据提取装置中，提取单元40包括：遍历模块，用于遍历目标数据库，确定存在第一目标字段的初始数据，得到第一初始数据，并确定存在第二目标字段的初始数据，得到第二初始数据；第一确定模块，用于将第一初始数据确定为起始节点，将第二初始数据确定为结束节点，通过预设关联关系确定所有从起始节点到结束节点的路径，得到多条路径，其中，每条路径至少包含两个初始数据对应的节点；第二确定模块，用于从多条路径中确定目标路径；提取模块，用于在目标数据库中提取目标路径中的各个节点所对应的数据，得到目标数据。Optionally, in the data extraction device provided in the embodiment of the present application, the extraction unit 40 includes: a traversal module, configured to traverse the target database, determine that the initial data of the first target field exists, obtain the first initial data, and determine that the first target field exists. The initial data of the two target fields is used to obtain the second initial data; the first determination module is used to determine the first initial data as the starting node, the second initial data as the ending node, and determine all the starting nodes through the preset association relationship. From the path from the start node to the end node, multiple paths are obtained, wherein each path contains at least two nodes corresponding to the initial data; the second determination module is used to determine the target path from the multiple paths; the extraction module is used to The data corresponding to each node in the target path is extracted from the target database to obtain target data.

可选地，在本申请实施例提供的数据提取装置中，第二确定模块包括：第一确定子模块，用于在每条路径中，分别根据节点对应的初始数据的准确程度确定节点的权重，其中，准确程度用于表征从数据源提取初始数据的准确度；计算子模块，用于计算每条路径上的所有节点的权重的和，得到多个路径权重；第二确定子模块，用于将多条路径中最小路径权重对应的路径确定为目标路径。Optionally, in the data extraction apparatus provided in the embodiment of the present application, the second determination module includes: a first determination sub-module, configured to determine the weight of the node according to the accuracy of the initial data corresponding to the node in each path. , where the degree of accuracy is used to characterize the accuracy of extracting the initial data from the data source; the calculation sub-module is used to calculate the sum of the weights of all nodes on each path to obtain multiple path weights; the second determination sub-module uses It is used to determine the path corresponding to the minimum path weight among the multiple paths as the target path.

可选地，在本申请实施例提供的数据提取装置中，第一确定子模块包括：准确率获取模块，用于在初始数据为第二数据的情况下，获取通过数据抽取模型抽取第二数据的准确率，并将准确率的倒数确定为节点的权重；权重确定模块，用于在初始数据为第一数据的情况下，将预设值确定为节点的权重。Optionally, in the data extraction apparatus provided in the embodiment of the present application, the first determination sub-module includes: an accuracy rate acquisition module, configured to acquire the second data extracted by the data extraction model when the initial data is the second data. and determine the reciprocal of the accuracy as the weight of the node; the weight determination module is used to determine the preset value as the weight of the node when the initial data is the first data.

可选地，在本申请实施例提供的数据提取装置中，第二获取单元20包括：第三确定模块，用于根据每个文档的类型从多个数据抽取模型中确定对应类型的数据抽取模型；抽取模块，用于分别根据对应类型的数据抽取模型从每个文档中抽取第二数据，得到多个第二数据。Optionally, in the data extraction apparatus provided in the embodiment of the present application, the second acquisition unit 20 includes: a third determination module, configured to determine a data extraction model of a corresponding type from a plurality of data extraction models according to the type of each document. The extraction module is used to extract the second data from each document according to the data extraction model of the corresponding type, and obtain a plurality of second data.

可选地，在本申请实施例提供的数据提取装置中，该装置还包括：更新单元，用于确定预设条件，从目标数据中剔除不符合预设条件的数据，得到更新后的目标数据。Optionally, in the data extraction device provided in the embodiment of the present application, the device further includes: an update unit, configured to determine a preset condition, remove data that does not meet the preset condition from the target data, and obtain updated target data .

数据提取装置包括处理器和存储器，上述第一获取单元10、第二获取单元20、确定单元30和提取单元40等均作为程序单元存储在存储器中，由处理器执行存储在存储器中的上述程序单元来实现相应的功能。The data extraction device includes a processor and a memory. The above-mentioned first acquisition unit 10, second acquisition unit 20, determination unit 30, and extraction unit 40 are all stored in the memory as program units, and the processor executes the above-mentioned programs stored in the memory. unit to achieve the corresponding function.

处理器中包含内核，由内核去存储器中调取相应的程序单元。内核可以设置一个或以上，通过调整内核参数来从非结构化话文档中检索到数据。The processor includes a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can set one or more parameters to retrieve data from unstructured speech documents by adjusting the kernel parameters.

存储器可能包括计算机可读介质中的非永久性存储器，随机存取存储器(RAM)和/或非易失性内存等形式，如只读存储器(ROM)或闪存(flash RAM)，存储器包括至少一个存储芯片。Memory may include non-persistent memory in computer readable media, random access memory (RAM) and/or non-volatile memory, such as read only memory (ROM) or flash memory (flash RAM), the memory including at least one memory chip.

本发明实施例提供了一种计算机可读存储介质，其上存储有程序，该程序被处理器执行时实现数据提取方法。Embodiments of the present invention provide a computer-readable storage medium, on which a program is stored, and when the program is executed by a processor, a data extraction method is implemented.

本发明实施例提供了一种处理器，处理器用于运行程序，其中，程序运行时执行数据提取方法。An embodiment of the present invention provides a processor, where the processor is used for running a program, wherein the data extraction method is executed when the program is running.

如图5示，本发明实施例提供了一种电子设备，设备501包括处理器、存储器及存储在存储器上并可在处理器上运行的程序，处理器执行程序时实现以下步骤：数据提取方法。本文中的设备可以是服务器、PC、PAD、手机等。As shown in FIG. 5 , an embodiment of the present invention provides an electronic device. The device 501 includes a processor, a memory, and a program stored in the memory and running on the processor. When the processor executes the program, the following steps are implemented: data extraction method . The devices in this article can be servers, PCs, PADs, mobile phones, and so on.

本申请还提供了一种计算机程序产品，当在数据处理设备上执行时，适于执行初始化有如下方法步骤的程序：数据提取方法。The application also provides a computer program product, when executed on a data processing device, adapted to execute a program initialized with the following method steps: a data extraction method.

本领域内的技术人员应明白，本申请的实施例可提供为方法、系统、或计算机程序产品。因此，本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。As will be appreciated by those skilled in the art, the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.

在一个典型的配置中，计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

存储器可能包括计算机可读介质中的非永久性存储器，随机存取存储器(RAM)和/或非易失性内存等形式，如只读存储器(ROM)或闪存(flash RAM)。存储器是计算机可读介质的示例。Memory may include non-persistent memory in computer readable media, random access memory (RAM) and/or non-volatile memory in the form of, for example, read only memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括，但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带，磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质，可用于存储可以被计算设备访问的信息。按照本文中的界定，计算机可读介质不包括暂存电脑可读媒体(transitory media)，如调制的数据信号和载波。Computer-readable media includes both persistent and non-permanent, removable and non-removable media, and storage of information may be implemented by any method or technology. Information may be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer-readable media does not include transitory computer-readable media, such as modulated data signals and carrier waves.

还需要说明的是，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device comprising a series of elements includes not only those elements, but also Other elements not expressly listed, or which are inherent to such a process, method, article of manufacture, or apparatus are also included. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, method, article of manufacture or apparatus that includes the element.

本领域技术人员应明白，本申请的实施例可提供为方法、系统或计算机程序产品。因此，本申请可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且，本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。It will be appreciated by those skilled in the art that the embodiments of the present application may be provided as a method, a system or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

以上仅为本申请的实施例而已，并不用于限制本申请。对于本领域技术人员来说，本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等，均应包含在本申请的权利要求范围之内。The above are merely examples of the present application, and are not intended to limit the present application. Various modifications and variations of this application are possible for those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included within the scope of the claims of this application.

Claims

1. a data extraction method, is characterized in that, comprises:

Obtain structured data from a structured database to obtain a plurality of first data, wherein the structured database is used to store structured data;

Obtain structured data from a document to obtain a plurality of second data, wherein the document is an unstructured document;

Taking the first data and the second data as initial data, establishing a preset association relationship between all the initial data according to the associated field, and establishing a target database according to the initial data;

A first target field and a second target field are determined, and target data is extracted from the target database through the preset association relationship, the first target field and the second target field.

2. The method according to claim 1, wherein establishing the preset association relationship between all the initial data according to the association field comprises:

An association relationship between each pair of the initial data with the same field is established, a plurality of the association relationships are obtained, and the plurality of the association relationships are combined into the preset association relationship.

3. The method according to claim 1, wherein extracting target data from the target database by using the preset association relationship, the first target field and the second target field comprises:

Traverse the target database, determine that the initial data of the first target field exists, obtain the first initial data, and determine that the initial data of the second target field exists, and obtain the second initial data;

Determining the first initial data as a start node, and determining the second initial data as an end node,

All paths from the start node to the end node are determined according to the preset association relationship, and multiple paths are obtained, wherein each of the paths includes at least two nodes corresponding to the initial data;

determining a target path from a plurality of said paths;

Data corresponding to each node in the target path is extracted from the target database to obtain the target data.

4. The method according to claim 3, wherein determining a target path from a plurality of the paths comprises:

In each path, the weight of the node is determined according to the degree of accuracy of the initial data corresponding to the node, wherein the degree of accuracy is used to represent the accuracy of extracting the initial data from the data source;

Calculate the sum of the weights of all the nodes on each of the paths to obtain multiple path weights;

The path corresponding to the smallest path weight among the multiple paths is determined as the target path.

5. The method according to claim 4, wherein determining the weight of the node according to the accuracy of the initial data corresponding to the node comprises:

In the case that the initial data is the second data, obtain the accuracy rate of extracting the second data through a data extraction model, and determine the inverse of the accuracy rate as the weight of the node;

When the initial data is the first data, a preset value is determined as the weight of the node.

6. The method of claim 1, wherein acquiring a plurality of second data from a plurality of documents comprises:

Determine the corresponding type of data extraction model from multiple data extraction models according to the type of each document;

The second data is extracted from each of the documents according to the data extraction model of the corresponding type, to obtain a plurality of the second data.

7 . The method according to claim 1 , wherein after extracting target data from the target database through the preset association relationship, the first target field and the second target field, the Methods also include:

A preset condition is determined, data that does not meet the preset condition is eliminated from the target data, and updated target data is obtained.

8. A data extraction device, characterized in that, comprising:

a first obtaining unit, configured to obtain structured data from a structured database to obtain a plurality of first data, wherein the structured database is used to store structured data;

a second obtaining unit, configured to obtain structured data from a document to obtain a plurality of second data, wherein the document is an unstructured document;

a determining unit, configured to use the first data and the second data as initial data, establish a preset association relationship between all the initial data according to the associated field, and establish a target database according to the initial data;

The extraction unit is configured to determine a first target field and a second target field, and extract target data from the target database through the preset association relationship, the first target field and the second target field.

9 . A processor, wherein the processor is configured to run a program, wherein the data extraction method according to any one of claims 1 to 7 is executed when the program is run.

10. An electronic device, comprising one or more processors and a memory, the memory being used to store one or more programs, wherein when the one or more programs are executed by the one or more programs When executed by the processor, the one or more processors are caused to implement the data extraction method described in any one of claims 1 to 7 .