[go: up one dir, main page]

CN114581680A - Table information extraction method and device, electronic equipment and storage medium - Google Patents

Table information extraction method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114581680A
CN114581680A CN202210282207.7A CN202210282207A CN114581680A CN 114581680 A CN114581680 A CN 114581680A CN 202210282207 A CN202210282207 A CN 202210282207A CN 114581680 A CN114581680 A CN 114581680A
Authority
CN
China
Prior art keywords
cell
layout
data
information
basic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210282207.7A
Other languages
Chinese (zh)
Inventor
范诗剑
朱昱锦
李超
徐亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
OneConnect Financial Technology Co Ltd Shanghai
Original Assignee
OneConnect Financial Technology Co Ltd Shanghai
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by OneConnect Financial Technology Co Ltd Shanghai filed Critical OneConnect Financial Technology Co Ltd Shanghai
Priority to CN202210282207.7A priority Critical patent/CN114581680A/en
Publication of CN114581680A publication Critical patent/CN114581680A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Document Processing Apparatus (AREA)

Abstract

本发明实施例公开了一种表格信息提取方法、装置、电子设备及存储介质,该方法包括:获取基础原始表格的半结构化表格数据;对半结构化表格数据进行归一化标准处理,得到标准布局表格的结构化表格数据;根据结构化表格数据中各个单元格的单元格布局类别对标准布局表格的单元格进行标记,得到标准布局表格匹配的布局分布表格;根据布局分布表格提取多元组格式的表格信息。本发明实施例的技术方案能够对表格正文实现结构化处理与提取,从而满足表格信息提取需求,提高表格信息处理的高效性和表格信息的应用性。

Figure 202210282207

Embodiments of the present invention disclose a method, device, electronic device and storage medium for extracting table information. The method includes: acquiring semi-structured table data of a basic original table; performing normalization standard processing on the semi-structured table data to obtain The structured table data of the standard layout table; mark the cells of the standard layout table according to the cell layout category of each cell in the structured table data, and obtain the layout distribution table matching the standard layout table; extract the tuple according to the layout distribution table Format table information. The technical solutions of the embodiments of the present invention can realize structured processing and extraction of the text of the table, thereby meeting the requirements of table information extraction, and improving the efficiency of table information processing and the applicability of table information.

Figure 202210282207

Description

Table information extraction method and device, electronic equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of data processing, in particular to a table information extraction method and device, electronic equipment and a storage medium.
Background
With the development of information technology, massive data resources are more and more abundant. In addition to unstructured data, there is a large amount of tabular data that can be stored in a variety of forms of information, such as video, pictures, and web pages, among other information. How to convert and process the table data in a structured manner so as to facilitate automatic processing and decision-making is always an important subject in the fields of artificial intelligence and enterprise digital transformation.
Currently, in the prior art, all information contained in a table is usually directly acquired for extraction and conversion of table data, and then all the acquired information is extracted one by one, and finally structured table information is generated.
In the process of implementing the invention, the inventor finds that the prior art has the following defects: in the prior art, in the process of converting and processing table information, the overall efficiency of converting and processing the table information is low, and the extracted table data is difficult to meet the requirements of enterprises.
Disclosure of Invention
The embodiment of the invention provides a table information extraction method, a table information extraction device, electronic equipment and a storage medium, which can meet the table information extraction requirement and improve the efficiency of table information processing and the applicability of table information.
In a first aspect, an embodiment of the present invention provides a table information extraction method, including:
acquiring semi-structured table data of a basic original table;
carrying out normalization standard processing on the semi-structured form data to obtain structured form data of a standard layout form;
marking the cells of the standard layout table according to the cell layout categories of the cells in the structured table data to obtain a layout distribution table matched with the standard layout table;
and extracting form information in a multi-element format according to the layout distribution form.
In a second aspect, an embodiment of the present invention further provides a form information extraction apparatus, including:
the data acquisition module is used for acquiring semi-structured table data of the basic original table;
the data processing module is used for carrying out normalization standard processing on the semi-structured form data to obtain structured form data of a standard layout form;
the table marking module is used for marking the cells of the standard layout table according to the cell layout categories of the cells in the structured table data to obtain a layout distribution table matched with the standard layout table;
and the table information extraction module is used for extracting the table information in the multi-element format according to the layout distribution table.
In a third aspect, an embodiment of the present invention further provides an electronic device, where the electronic device includes:
one or more processors;
storage means for storing one or more computer programs;
when executed by the one or more processors, cause the one or more processors to implement a table information extraction method as provided by any of the embodiments of the present invention.
In a fourth aspect, an embodiment of the present invention further provides a computer storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a table information extraction method according to any embodiment of the present invention.
The embodiment of the invention obtains the semi-structured form data of the basic original form, then performs normalization standard processing on the semi-structured form data to obtain the structured form data of the standard layout form, further marks the cells of the standard layout form according to the cell layout categories of each cell in the structured form data to obtain the layout distribution form matched with the standard layout form, and extracts the form information of a multi-element format through the layout distribution form.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present invention, nor do they necessarily limit the scope of the invention. Other features of the present invention will become apparent from the following description.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flowchart of a table information extraction method according to an embodiment of the present invention;
fig. 2 is a schematic flowchart of another table information extraction method according to an embodiment of the present invention;
fig. 3 is a schematic flowchart of another table information extraction method according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a table information extraction service flow provided in an embodiment of the present invention;
fig. 5 is a schematic view of a table information extraction service processing flow provided in an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a table information extraction system according to an embodiment of the present invention;
fig. 7 is a block diagram of a table information extraction apparatus according to an embodiment of the present invention;
fig. 8 is a block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make those skilled in the art better understand the technical solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example one
Fig. 1 is a schematic diagram of a table information extraction method according to an embodiment of the present invention, where the present embodiment is applicable to a case of efficiently extracting and processing table information, and the method may be executed by a table information extraction apparatus, where the apparatus may be implemented by software and/or hardware, and may be generally integrated in an electronic device, where the electronic device may be a terminal device or a server device, and the embodiment of the present invention does not limit a specific device type of the electronic device. Accordingly, as shown in fig. 1, the method comprises the steps of:
and S110, acquiring semi-structured table data of the basic original table.
The semi-structured table data may be table data with a certain data structure included in the basic original table. It is understood that, in the table information extraction, the tables included in the various information including the tables may be used as the basic original table, and the semi-structured table data of the table data having a certain data structure therein is further obtained through the basic original table. The information including the table may be any information including a table, for example, a video, a picture, a document, or the like, and the embodiment of the present invention is not limited in particular.
In the embodiment of the present invention, the base raw table may be obtained from various information including tables to obtain semi-structured table data therein according to the base raw table. The data type of the optional semi-structured table data may be JSON (JS Object Notation) data, XML (Extensible Markup Language) data, or the like. For example, the semi-structured table data may be a data table with non-uniform number of rows and columns, and in one unstructured table data, three rows of data may exist in one column and five rows of data may exist in another column; while in different rows, one row may have one row of data and another row may have four rows of data.
And S120, carrying out normalization standard processing on the semi-structured form data to obtain structured form data of a standard layout form.
The normalization standard processing may be processing the semi-structured table data according to a standard layout.
The standard layout table may be a table with a standard layout format obtained by performing normalization standard processing on the basic primitive table. The structured form data can be form data having a standard data structure that is included in a standard layout form.
In the embodiment of the invention, the acquired semi-structured form data is subjected to normalization standard processing according to the standard layout format of the form to obtain a standard layout form, and the semi-structured form data is converted according to the standard layout form to obtain the structured form data of the form data with the standard data structure, which is included in the standard layout form. Wherein the cell size in the structured table data may be a standard cell size. That is, each cell in the standard layout table may be a standard cell of the same size.
Step S130, marking the cells of the standard layout table according to the cell layout categories of the cells in the structured table data to obtain a layout distribution table matched with the standard layout table.
The cell layout type may be a function type of a cell in a table, and optionally, the cell layout type may specifically include a row header column mark (i.e., a column header of a row header, which may be understood as a table header), a row header, a column header, a data area, table metadata, and the like. The data area is also the cell where the valid data in the table is located, and the table metadata may be additional information such as the title and unit of the table.
The layout distribution table may be a standard layout table obtained after labeling according to the type of the cell layout.
In the embodiment of the invention, each cell in the structured form data has a corresponding cell layout type, after the structured form data is obtained, the cell layout type of each cell in the standard layout form can be determined, and each cell in the standard layout form is marked according to the determined cell layout type. After all cells are marked, a layout distribution table matching the standard layout table can be obtained.
And S140, extracting form information in a multi-element format according to the layout distribution form.
The multi-element format may be a sequence format with a limited number of objects.
Correspondingly, the structured form data in the layout distribution table is converted into the form information in the multi-element group format according to the cell layout category of each cell in the layout distribution table and the data of the cells.
Therefore, the form information extraction mode can firstly analyze the form, then carry out standardization and layout analysis to obtain the layout distribution diagram, and finally extract the form information in the multi-tuple format according to the layout distribution form, so that the form information extraction requirement can be met, and the efficiency of form information processing and the applicability of the form information are improved.
The embodiment of the invention obtains the semi-structured form data of the basic original form, then performs normalization standard processing on the semi-structured form data to obtain the structured form data of the standard layout form, further marks the cells of the standard layout form according to the cell layout categories of each cell in the structured form data to obtain the layout distribution form matched with the standard layout form, and extracts the form information of a multi-element format through the layout distribution form.
Example two
Fig. 2 is a schematic flow chart of another table information extraction provided in the embodiment of the present invention, and the embodiment of the present invention performs optimization based on the above optional embodiments, and provides multiple specific optional implementation manners for obtaining semi-structured table data according to a basic original table and obtaining a layout distribution table matched with a standard layout table. Specifically, as shown in fig. 2, the method includes the following steps:
and S210, acquiring original form data.
The original form data may be data including a basic original form, for example, document data, image data, or the like, as long as the basic original form is included, and the data type of the original form data is not limited in this proposal.
In the embodiment of the present invention, the original table data may be acquired by any means of acquiring data, for example: the method and the system can be used for crawling related websites through a crawler technology, collecting data through a real-time collecting tool and acquiring data through a data technology service provider, and the embodiment of the invention is not particularly limited to the means for acquiring the data.
S220, determining an original form extraction tool according to the data type of the original form data.
The raw table extraction tool may be a tool for extracting a basic raw table from raw table data.
Accordingly, after the raw form data is obtained, the data type of the raw form data may be further identified to match the corresponding raw form extraction tool according to the data type of the raw form data.
And S230, extracting initial table information from the original table data through the original table extraction tool to obtain semi-structured table data of the basic original table.
Correspondingly, the original table data is extracted through an original table extraction tool corresponding to the matching of the type of the original table data, and initial table information is obtained and is used as semi-structured table data of the basic original table.
It is understood that the raw form data types are different, and the corresponding raw form extraction tools may also be different. For example, when the original form data is image data, the image data may be subjected to Character Recognition by an OCR (Optical Character Recognition) technique to acquire an underlying original form in the image. When the raw form data is Document data, the Document type parser may parse the Document data to obtain a basic raw form, for example, a Word Document data may be parsed by a Word (Microsoft Office Word, Microsoft Word processing tool) parser, or a PDF Document data may be parsed by a PDF (Portable Document Format) parser to obtain a basic raw form, and the like.
S240, obtaining the basic table size and the table row and column data of the basic original table.
Wherein the base table size may be the size of the base original table, such as 10cm by 10 cm; the table row and column data may be the row and column number data of the underlying raw table.
In the embodiment of the present invention, the size of the basic table of the basic raw table and the number of rows and columns in the basic raw table can be obtained according to the basic raw table. It will be appreciated that the number of columns per row and the number of rows per column in the table may not be the same, since the underlying raw table is not sufficiently canonical.
And S250, determining the target line number and the target column number of the semi-structured table data according to the table row and column data.
The target row number may be the maximum row number of the basic original table, and the target column number may be the maximum column number of the basic original table.
Correspondingly, the maximum line number and the maximum column number of the basic original table are determined through the line number and the column number of the obtained basic original table, and the maximum line number and the maximum column number of the basic original table are respectively determined as the target line number and the target column number of the semi-structured table data.
And S260, dividing the size of the basic table according to the target row number and the target column number to obtain the size of the normalized standard cell.
Wherein the normalized standard cell size is a normalized standard cell size.
In the embodiment of the present invention, the normalized standard cell size is obtained by performing average division on the basic table size according to the target row number and the target column number through the basic table size, the target row number and the target column number of the basic original table obtained in the above embodiment. Optionally, the size of each normalized standard cell in the divided table is the same.
For example, if the size of the basic original table is 10cm × 10cm, and the number of target rows and target columns is 5, the table with 10cm × 10cm may be uniformly divided into 5 rows and 5 columns, the size of each cell in the divided table is the same, and is a cell size of 2cm × 2cm, and the divided cell size of 2cm × 2cm is used as the normalized standard cell size.
S270, splitting the basic original table according to the normalized standard cell size to obtain a normalized standard cell.
In the embodiment of the invention, the basic original form is split according to the normalized standard cell size and the form is split according to the rounding splitting principle, and the normalized standard cell is obtained after the basic original form is split.
Illustratively, if the normalized standard cell size is a cell size of 2cm by 2cm and the cell size of the cells in the base raw table is 2cm by 2cm, the cells may be directly used as the normalized standard cells; if the cell size of the cells in the base raw form is 3cm x 2cm, the cells can be split into 2cm x 2cm normalized standard cells and 1cm x 1cm cells, and the split 1cm x 1cm cells can be further expanded into 2cm x 2cm normalized standard cells. If the cell size of the cells in the base raw form is 1cm by 2cm, the cells can be expanded to 2cm by 2cm normalized standard cells.
S280, copying and filling the data of the normalized standard cells according to the original form data of the basic original form to obtain the structured form data of the standard layout form.
Correspondingly, original table data of the basic original table are obtained, and the normalized standard cells after the original cells are split are correspondingly copied and filled according to the positions of the original table data of the obtained basic original table in the original cells of the basic original table, so that the structured table data of the standard layout table is obtained.
For example, the original table data is split into 2 normalized standard cells of 2cm × 2cm before splitting in a cell of 3cm × 2cm cell size in the basic original table, and when the original table data is copied and filled, the content of the original table data in the cell of 3cm × 2cm before splitting is directly copied and filled into the two normalized standard cells of 2cm × 2cm after splitting.
And S290, acquiring cell association information of each cell in the standard layout table according to the structured table data.
The cell association information may include, but is not limited to, cell text information and cell location information. The cell text information may be text information of data in the cell, and the cell location information may be the number of rows and columns of each cell in the structured table data. For example, one cell may be described as text information of "23" whose position information is in the 3 rd row and 4 th column in the entire table.
In the embodiment of the invention, the text information recorded by each cell in the standard layout table in the structured table data and the row and column number of the cell in the standard layout table are identified, and the text information and the position information of each cell in the standard layout table are used as the cell association information of each cell.
S2100, determining the cell layout type of each cell in the standard layout table according to the cell association information through a layout type classifier.
The layout category classifier may be a classification model trained in advance, or may be a classification dictionary constructed in advance.
In the embodiment of the invention, when the cell associated information is input into the layout type classifier, the layout type classifier outputs the layout type of the cell according to the cell text information and the cell position information in the input cell associated information. The cell layout categories may include, but are not limited to, a row header and column flag, a row header, a column header, a data area, and table metadata. For example, in the current cell associated information, the cell text information is "gender", the location information is "first row, second column", and then the layout type of the current cell is determined to be the column header.
And S2110, marking the cells of the standard layout table according to the cell layout categories of the cells in the standard layout table to obtain a layout distribution table matched with the standard layout table.
Correspondingly, after the layout types of all the cells in the standard layout table are identified, the cells in the standard layout table are marked according to the cell layout types of all the cells in the standard layout table, and the layout distribution table matched with the standard layout table can be obtained after the marking of the cells in all the standard layout table is finished.
And S2120, extracting form information in a multi-element format according to the layout distribution form.
The method comprises the steps of obtaining original form data, obtaining semi-structured form data of a basic original form according to the original form data, then obtaining basic form size and form row and column data of the basic original form, dividing the basic form size to obtain normalized standard cell size, and splitting the basic original form through the normalized standard cell size to obtain normalized standard cells; and then marking the cells of the standard layout table according to the cell layout categories of each cell in the structured table data to obtain a layout distribution table matched with the standard layout table, extracting the table information in a multi-element format through the layout distribution table, solving the problems that the efficiency and accuracy of extraction processing of the table information are low and the actual application requirements are difficult to meet in the prior art, and realizing the structured processing and extraction of the table text, thereby meeting the requirement of extraction of the table information and improving the efficiency of processing the table information and the applicability of the table information.
EXAMPLE III
Fig. 3 is another schematic flow chart of table information extraction provided in the embodiment of the present invention, and the embodiment of the present invention is further optimized based on the above optional embodiments, and provides a plurality of specific optional implementation manners for extracting table information in a multi-element format according to the layout distribution table. Accordingly, as shown in fig. 3, the method comprises the steps of:
s310, acquiring semi-structured table data of the basic original table.
S320, carrying out normalization standard processing on the semi-structured form data to obtain structured form data of a standard layout form.
S330, marking the cells of the standard layout table according to the cell layout categories of the cells in the structured table data to obtain a layout distribution table matched with the standard layout table.
S340, acquiring the head distribution data of the row heads and the column heads of the layout distribution table.
The row header and column header distribution data may be cell data of which the layout types are row header and column header in the layout distribution table.
In the embodiment of the invention, all row head and column head distribution data in the layout distribution table are determined according to the marking information in the layout distribution table, and the row head and column head distribution data of the layout distribution table are obtained.
And S350, splitting the layout distribution table according to the row head and column head distribution data to obtain a basic layout distribution table.
The basic layout distribution table may be a layout distribution table having only one table data. The base layout distribution table may not be further divided into multiple tables. It is to be understood that the layout distribution table may include table data of at least one table, and the number of all the basic layout distribution tables included in the layout distribution table may be determined according to head-of-row and head-of-column distribution data of the layout distribution table.
In the embodiment of the invention, the layout distribution table determines the number of basic layout distribution tables included in the current layout distribution table according to the head distribution data of the row heads and the column heads, and splits the branch distribution table according to the head distribution data of the row heads and the column heads to obtain all the basic layout distribution tables.
And S360, extracting the basic table information in the multi-tuple format from the basic layout distribution table.
In the embodiment of the invention, structured table data in the layout distribution table is converted into table information in a multi-element format according to the cell layout type of each cell in the basic layout distribution table and the data of the cells.
Optionally, in another embodiment of the present invention, the extracting the basic table information in the tuple format from the basic layout distribution table may include: acquiring basic cell association information of the basic layout distribution table; the basic cell associated information comprises basic cell position information, basic cell layout types and basic cell text information; generating a search rule of upper cells of the basic cells in the basic layout distribution table according to the basic cell association information; searching the upper unit cell of each basic unit cell according to the upper unit cell searching rule of the basic unit cell; wherein the upper unit cells comprise row upper unit cells and column upper unit cells; carrying out error correction and duplicate removal processing on the upper cells to obtain target upper cell information; and generating basic table information in the multi-tuple format according to the basic cell association information and the target upper cell information.
The basic cell association information may include, but is not limited to, basic cell text information and basic cell location information. The basic cell text information may be text information of data in the basic cells, and the basic cell position information may be the number of rows and columns of each basic cell in the basic table data. The upper cell search rule may be a preset default upper cell search rule or an upper cell search rule generated according to the basic cell association information. The upper unit cells comprise row upper unit cells and column upper unit cells. It can be understood that, in a basic layout table, each basic cell has a row upper cell and a column upper cell, and a search is performed according to the row and the column where the basic cell is located to obtain a row header and a column header of the basic cell, that is, the row header is the row upper cell of the basic cell, and the column header is the column upper cell of the basic cell.
Correspondingly, an upper cell search rule of the basic cells in the basic layout distribution table is generated through the basic cell associated information of the basic layout distribution table, the upper cells of all the basic cells are searched according to the upper cell search rule and the basic cell position information included by the cell associated information, then the obtained upper cells are subjected to error correction and duplication removal processing to obtain target upper cell information, and further the basic table information in the multi-tuple format is generated according to the upper cell information. The error correction and deduplication processing may be to obtain a plurality of repeated row headers or column headers, and only one row header or column header is reserved.
In a specific example, the current cell records "23", the layout category of the cell is "data area", the position information is "line number: 3 lines; the number of columns: 2 columns ", then search up and left according to the search rule starting from the current cell, and further search to the corresponding head of the row and the head of the column, for example, the head of the row may be" zhang san ", and the head of the column may be" age ". If the current cell is the head of the line "three", then the search can be up. If the current cell is the head column of "age", then the search can be done to the left, if the current cell is the head column of the row labeled "name \ attribute name", then the search is not done.
And S370, performing aggregation processing on the basic table information to obtain the table information of the layout distribution table in the multi-element format.
In the embodiment of the invention, the acquired basic table information in the multi-tuple format is aggregated, and all the basic table information in the multi-tuple format is aggregated into the table information in the multi-tuple format, so that the table information in the multi-tuple format of the layout distribution table is obtained.
Optionally, in the embodiment of the present invention, the table information in the multi-tuple format may include cell content information, cell information on a row, cell information on a column, cell layout category, and cell binary classification result.
In one specific example, the table information in the tuple format can be represented as follows: (value, key _ row, key _ col, value _ type, k _ or _ v) where value may be used to represent cell content information, key _ row may be used to represent cell information on a row, key _ col may be used to represent cell information on a column, value _ type may be used to represent a cell layout class, and k _ or _ v may be used to represent a cell binary result, where two classes refer to: the unit cell of the non-data area is called k; the data area cell is referred to as v. Thus, in one cell: the cell content is as follows: 23, upper-row cell information: "three", column unit cell information: the "age" cell layout category: "data region", cell classification result "v", and the resulting tuple may be (23, Zhang three, age, data region, v).
Fig. 4 is a schematic diagram of a table information extraction service process provided by an embodiment of the present invention, and in a specific example, as shown in fig. 4, the table information extraction service process may include: table extraction, table analysis, and table abstraction, wherein:
form extraction mainly involves extracting form data from unstructured data, wherein the unstructured data may be file types such as Img file (Img format), Txt (text document) file, picture, document, PDF, and the like. For example, in table extraction, when the original table data is image data, the image data may be subjected to character recognition by an OCR technology to obtain an underlying original table in the image. When the original form data is document data, it can be parsed by a document type parser, for example, a basic original form is obtained by parsing Word document data through the parser of the Word document, or a basic original form is obtained by parsing PDF document data through the PDF parser, etc.
The table analysis is to perform table standardization and layout analysis on the unstructured data, and fig. 5 is a schematic diagram of a table information extraction business processing flow provided by an embodiment of the present invention. In a specific example, as shown in fig. 5, in the table analysis, the document layout structured data may first define standard units of cells through the geometric dimensions of the basic original table, convert the non-standard structured table into a standard table data structure by performing a split-copy operation on the cells with a row-column span exceeding one standard unit, and then perform layout analysis and labeling on the standard table data structure through a classifier to generate a layout distribution map with a table layout description. The types of semi-structured data may include, but are not limited to, XML files, HTML files, and Json files, and the types of structured data may include, but are not limited to, CSV (Comma-Separated Values) files and RDB (Relational Database) files.
Further, table abstraction is carried out on the basis of the layout distribution diagram, according to the distribution of the line heads and the column heads, the table is divided into basic substructures and converted into corresponding abstract data structures, and then the result of generating the upper cell information is obtained through upper information search, and the data set in the multi-tuple format is generated after error correction and de-duplication logic. And extracting entities and attributes thereof required for constructing the knowledge graph through the data set in the multi-tuple format to generate the external knowledge graph. The tuple data may include an RDF (Resource Description Framework) file and an OWL (Web Ontology Language) file.
The embodiment of the invention obtains the semi-structured form data of the basic original form, then carries out normalization standard processing on the semi-structured form data to obtain the structured form data of the standard layout form, further marks the cells of the standard layout form according to the cell layout categories of each cell in the structured form data to obtain the layout distribution form matched with the standard layout form, splits the column head distribution data of the row head of the layout distribution form to obtain the basic layout distribution form, extracts the basic form information of the multi-tuple format from the basic layout distribution form, and finally carries out aggregation processing on the basic form information to obtain the form information of the multi-tuple format of the layout distribution form, thereby solving the problems that the extraction processing efficiency and the accuracy of the form information in the prior art are lower and the practical application requirements are difficult to meet, the method can realize structured processing and extraction on the table text, thereby meeting the requirement of extracting the table information and improving the efficiency of processing the table information and the applicability of the table information.
The embodiment of the invention also provides a table information extraction system, and a user can automatically extract the table information by using the table information extraction system. Fig. 6 is a schematic structural diagram of a table information extraction system according to an embodiment of the present invention, and as shown in fig. 6, the table information extraction system may include a presentation layer, a tool layer, an algorithm layer, and a data layer. Wherein:
the presentation layer mainly relates to user calling and effect presentation, in the presentation layer, a RestFul (Representational State Transfer, REST for short) Interface is used for packaging the whole processing flow by using the RestFul Interface, and a user can realize a complete calling link of table semi-structured data transmission-background logic processing-multi-element data output by calling a corresponding Application Programming Interface (API). The function of the RPC (Remote Procedure Call) interface is the same as that of the RestFul interface, and only the background logic encapsulation mode is changed to the RPC mode. The Web (World Wide Web ) refers to a corresponding front-end page, and a user can upload an input file as original table data in a Web page mode and check a final table information extraction effect.
The tool layer mainly relates to a development and debugging convenience tool set in engineering. The visualization tool is a tool set for visualizing input semi-structured data and output results, and is convenient for users to carry out visualization data processing process requirements such as error troubleshooting and negative sample analysis. Index calculation refers to a set of tools that perform calculations corresponding to system indices (accuracy/recall/table type/number of samples distribution). The log monitoring can provide logs and monitoring components required by the system so as to realize the log monitoring function. The configuration management is a configuration management module required by the system framework, and is used for performing information configuration on the functional modules of the system and the like.
The algorithm layer mainly comprises a system framework core logic component, can be used for packaging and abstracting models, dictionaries, rules and the like required by various steps and processes involved in the table information extraction process, and provides corresponding interfaces to provide a core algorithm function. The training and reasoning logic package can include model training/reasoning method interface definition, and the training and reasoning implementation corresponding to the subsequent model/rule/dictionary method can be derived from the interface definition. The algorithm engine can realize a core algorithm, and mainly relates to a model and a rule dictionary method related to cell classification and layout analysis. Metadata management refers to the ability to define corresponding class code descriptions for component (cell/table) element entities in a table. Model management may then perform model definition and model lifecycle management (registration-validation-deregistration) functions. The rule management can complete the functions of rule base definition and rule life cycle management. The rules may include search rules and layout analysis rules for determining the type of cell layout. The dictionary management can complete the dictionary management and the dictionary life cycle management functions. The dictionary may be specifically used to perform decision analysis on the cells to determine the type of cell layout. The algorithm configuration management is used for configuring relevant model parameters, managing table information extraction link configuration and the like. The data loading and management is used for analyzing and converting scripts of different types of input data. Wherein, the model can use the data loader to define and manage the model.
The whole table information extraction method can be realized based on an algorithm layer. That is, the algorithm layer may implement the table information extraction method provided in any embodiment of the present invention, and the other layers may serve as supports to provide a data support function for the algorithm layer.
The data layer comprises read-write interfaces of storage engines of different databases and can be used for obtaining related data by butting different databases. The HDFS (Hadoop Distributed File System) is a data read/write interface corresponding to the Distributed File System, the RDBMS (Relational Database Management System: RDBMS, Relational Database Management System) is a data read/write interface corresponding to the Relational Database, the File is a data read/write interface corresponding to the stand-alone File System, and the NoSQL (generic non-Relational Database) is a data read/write interface corresponding to the non-Relational Database, for example, the non-Relational Database may be a MongoDB (Database stored in Distributed files), a Redis (Remote Dictionary service), or the like.
Example four
Fig. 7 is a schematic diagram of a table information extracting apparatus according to a fourth embodiment of the present invention, and as shown in fig. 7, the apparatus includes: a data acquisition module 410, a data processing module 420, a table marking module 430, and a table information extraction module 440, wherein:
a data obtaining module 410, configured to obtain semi-structured table data of a base raw table;
the data processing module 420 is configured to perform normalization standard processing on the semi-structured form data to obtain structured form data of a standard layout form;
a table marking module 430, configured to mark cells of the standard layout table according to the cell layout categories of the cells in the structured table data, so as to obtain a layout distribution table matched with the standard layout table;
the table information extracting module 440 is configured to extract the table information in the multi-element format according to the layout distribution table.
The embodiment of the invention obtains the semi-structured form data of the basic original form, then performs normalization standard processing on the semi-structured form data to obtain the structured form data of the standard layout form, further marks the cells of the standard layout form according to the cell layout categories of each cell in the structured form data to obtain the layout distribution form matched with the standard layout form, and extracts the form information of a multi-element format through the layout distribution form.
Optionally, the data obtaining module 410 is specifically configured to: acquiring original form data; determining an original table extraction tool according to the data type of the original table data; and extracting initial table information from the original table data through the original table extraction tool to obtain the semi-structured table data of the basic original table.
Optionally, the data processing module 420 is specifically configured to: acquiring the basic form size and the form row and column data of the basic original form; determining the target line number and the target column number of the semi-structured table data according to the table row and column data; dividing the basic table size according to the target row number and the target column number to obtain a normalized standard cell size; splitting the basic original table according to the normalized standard cell size to obtain a normalized standard cell; and copying and filling the data of the normalized standard cells according to the original table data of the basic original table to obtain the structured table data of the standard layout table.
Optionally, the table marking module 430 is specifically configured to: acquiring cell association information of each cell in the standard layout table according to the structured table data; the cell associated information comprises cell text information and cell position information; determining the cell layout category of each cell in the standard layout table according to the cell association information through a layout category classifier; and marking the cells of the standard layout table according to the cell layout categories of the cells in the standard layout table to obtain a layout distribution table matched with the standard layout table.
Optionally, the table information extracting module 440 is specifically configured to: acquiring row head and column head distribution data of the layout distribution table; splitting the layout distribution table according to the row head and column head distribution data to obtain a basic layout distribution table; extracting basic table information of the multi-tuple format from the basic layout distribution table; and performing aggregation processing on the basic form information to obtain the form information of the layout distribution form in the multi-element format.
Optionally, the table information extracting module 440 is further specifically configured to: acquiring basic cell association information of the basic layout distribution table; the basic cell associated information comprises basic cell position information, basic cell layout types and basic cell text information; generating a search rule of upper cells of the basic cells in the basic layout distribution table according to the basic cell association information; searching the upper unit cell of each basic unit cell according to the upper unit cell searching rule of the basic unit cell; wherein the upper unit cells comprise row upper unit cells and column upper unit cells; carrying out error correction and duplicate removal processing on the upper cells to obtain target upper cell information; and generating basic table information in the multi-tuple format according to the basic cell association information and the target upper cell information.
Optionally, the table information in the multi-tuple format includes cell content information, upper-row cell information, upper-column cell information, cell layout categories, and cell binary classification results.
The table information extraction device can execute the table information extraction method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For details of the technique not described in detail in this embodiment, reference may be made to the table information extraction method provided in any embodiment of the present invention.
Since the above-described table information extraction device is a device capable of executing the table information extraction method in the embodiment of the present invention, based on the table information extraction method described in the embodiment of the present invention, a person skilled in the art can understand a specific implementation manner of the table information extraction device in the embodiment of the present invention and various variations thereof, and therefore, how the table information extraction device implements the table information extraction method in the embodiment of the present invention is not described in detail here. As long as the device adopted by the person skilled in the art to implement the table information extraction method in the embodiments of the present invention falls within the scope of the present application.
EXAMPLE five
FIG. 8 illustrates a schematic diagram of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
As shown in fig. 8, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a Read Only Memory (ROM)12, a Random Access Memory (RAM)13, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 11 can perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM)12 or the computer program loaded from a storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data necessary for the operation of the electronic apparatus 10 can also be stored. The processor 11, the ROM 12, and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.
A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, or the like; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The processor 11 performs the various methods and processes described above, such as the table information extraction method.
In some embodiments, the table information extraction method may be implemented as a computer program tangibly embodied in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into the RAM 13 and executed by the processor 11, one or more steps of the table information extraction method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the table information extraction method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
A computer program for implementing the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on a machine, as a stand-alone software package partly on a machine and partly on a remote machine or entirely on a remote machine or server.
In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), blockchain networks, and the internet.
The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.
EXAMPLE six
An embodiment of the present invention further provides a computer storage medium storing a computer program, where the computer program is executed by a computer processor to perform the table information extraction method according to any one of the above embodiments of the present invention: acquiring semi-structured table data of a basic original table; carrying out normalization standard processing on the semi-structured form data to obtain structured form data of a standard layout form; marking the cells of the standard layout table according to the cell layout categories of the cells in the structured table data to obtain a layout distribution table matched with the standard layout table; and extracting the table information in the multi-tuple format according to the layout distribution table.
Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM, or flash Memory), an optical fiber, a portable compact disc Read Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, Radio Frequency (RF), etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.
The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1.一种表格信息提取方法,其特征在于,包括:1. a form information extraction method, is characterized in that, comprises: 获取基础原始表格的半结构化表格数据;Obtain semi-structured tabular data of the underlying raw form; 对所述半结构化表格数据进行归一化标准处理,得到标准布局表格的结构化表格数据;performing normalization standard processing on the semi-structured table data to obtain structured table data of a standard layout table; 根据所述结构化表格数据中各个单元格的单元格布局类别对所述标准布局表格的单元格进行标记,得到所述标准布局表格匹配的布局分布表格;Marking the cells of the standard layout table according to the cell layout category of each cell in the structured table data, to obtain a layout distribution table matching the standard layout table; 根据所述布局分布表格提取多元组格式的表格信息。Table information in a tuple format is extracted according to the layout distribution table. 2.根据权利要求1所述的方法,其特征在于,所述获取基础原始表格的半结构化表格数据,包括:2. The method according to claim 1, wherein the acquiring the semi-structured table data of the basic original table comprises: 获取原始表格数据;Get raw table data; 根据所述原始表格数据的数据类型确定原始表格抽取工具;Determine the original table extraction tool according to the data type of the original table data; 通过所述原始表格抽取工具对原始表格数据抽取初始表格信息,得到所述基础原始表格的半结构化表格数据。The initial table information is extracted from the original table data by the original table extraction tool, and the semi-structured table data of the basic original table is obtained. 3.根据权利要求1所述的方法,其特征在于,所述对所述半结构化表格数据进行归一化标准处理,得到标准布局表格的结构化表格数据,包括:3. The method according to claim 1, wherein the normalization standard processing is performed on the semi-structured table data to obtain the structured table data of the standard layout table, comprising: 获取所述基础原始表格的基础表格尺寸和表格行列数据;Obtain the basic table size and table row and column data of the basic original table; 根据所述表格行列数据确定所述半结构化表格数据的目标行数和目标列数;Determine the target number of rows and the target number of columns of the semi-structured table data according to the table row and column data; 根据所述目标行数和所述目标列数对所述基础表格尺寸进行划分,得到归一化标准单元格尺寸;Divide the basic table size according to the target number of rows and the target number of columns to obtain a normalized standard cell size; 根据所述归一化标准单元格尺寸对所述基础原始表格进行拆分,得到归一化标准单元格;Splitting the basic original table according to the normalized standard cell size to obtain a normalized standard cell; 根据所述基础原始表格的原始表格数据对所述归一化标准单元格的数据进行复制填充,得到所述标准布局表格的结构化表格数据。Copy and fill the data of the normalized standard cell according to the original table data of the basic original table, so as to obtain the structured table data of the standard layout table. 4.根据权利要求1所述的方法,其特征在于,所述根据所述结构化表格数据中各个单元格的单元格布局类别对所述标准布局表格的单元格进行标记,得到所述标准布局表格匹配的布局分布表格,包括:4 . The method according to claim 1 , wherein the standard layout is obtained by marking the cells of the standard layout table according to the cell layout category of each cell in the structured table data. 5 . Layout distribution tables for table matching, including: 根据所述结构化表格数据获取所述标准布局表格中各个单元格的单元格关联信息;其中,所述单元格关联信息包括单元格文本信息和单元格位置信息;Obtain cell association information of each cell in the standard layout table according to the structured table data; wherein, the cell association information includes cell text information and cell position information; 通过布局类别分类器根据所述单元格关联信息确定所述标准布局表格中各个单元格的单元格布局类别;Determine the cell layout category of each cell in the standard layout table according to the cell association information by the layout category classifier; 根据所述标准布局表格中各个单元格的单元格布局类别对所述标准布局表格的单元格进行标记,得到所述标准布局表格匹配的布局分布表格。Mark the cells of the standard layout table according to the cell layout category of each cell in the standard layout table to obtain a layout distribution table matching the standard layout table. 5.根据权利要求1所述的方法,其特征在于,所述根据所述布局分布表格提取多元组格式的表格信息,包括:5. The method according to claim 1, wherein the extracting table information in a tuple format according to the layout distribution table comprises: 获取所述布局分布表格的行头列头分布数据;obtaining the distribution data of row headers and column headers of the layout distribution table; 根据所述行头列头分布数据从所述布局分布表格中拆分得到基础布局分布表格;Splitting the layout distribution table from the layout distribution table according to the row header and column header distribution data to obtain a basic layout distribution table; 从所述基础布局分布表格中提取所述多元组格式的基础表格信息;extracting the basic table information in the tuple format from the basic layout distribution table; 将各所述基础表格信息进行聚合处理,得到所述布局分布表格的多元组格式的表格信息。Perform aggregation processing on each of the basic table information to obtain table information in a tuple format of the layout distribution table. 6.根据权利要求5所述的方法,其特征在于,所述从所述基础布局分布表格中提取所述多元组格式的基础表格信息,包括:6. The method according to claim 5, wherein the extracting the basic table information in the tuple format from the basic layout distribution table comprises: 获取所述基础布局分布表格的基础单元格关联信息;其中,所述基础单元格关联信息包括基础单元格位置信息、基础单元格布局类别以及基础单元格文本信息;Acquiring basic cell association information of the basic layout distribution table; wherein the basic cell association information includes basic cell location information, basic cell layout category, and basic cell text information; 根据所述基础单元格关联信息生成所述基础布局分布表格中基础单元格的上位单元格搜索规则;Generate the upper-level cell search rule of the basic cell in the basic layout distribution table according to the basic cell association information; 根据所述基础单元格的上位单元格搜索规则搜索各所述基础单元格的上位单元格;其中,所述上位单元格包括行上位单元格和列上位单元格;Search the upper cell of each of the basic cells according to the upper cell search rule of the basic cell; wherein, the upper cell includes a row upper cell and a column upper cell; 对所述上位单元格进行纠错去重处理,得到目标上位单元格信息;Perform error correction and deduplication processing on the upper cell to obtain target upper cell information; 根据所述基础单元格关联信息和所述目标上位单元格信息生成所述多元组格式的基础表格信息。The basic table information in the tuple format is generated according to the basic cell association information and the target upper cell information. 7.根据权利要求1-6任一所述的方法,其特征在于,所述多元组格式的表格信息包括单元格内容信息、行上位单元格信息、列上位单元格信息、单元格布局类别以及单元格二分类结果。7. The method according to any one of claims 1-6, wherein the table information in the tuple format comprises cell content information, upper cell information in a row, upper cell information in a column, cell layout category and Cell binary classification results. 8.一种表格信息提取方法装置,其特征在于,包括:8. A method and device for extracting table information, comprising: 数据获取模块,用于获取基础原始表格的半结构化表格数据;The data acquisition module is used to acquire the semi-structured table data of the basic original table; 数据处理模块,用于对所述半结构化表格数据进行归一化标准处理,得到标准布局表格的结构化表格数据;a data processing module, configured to perform normalized standard processing on the semi-structured table data to obtain structured table data of standard layout tables; 表格标记模块,用于根据所述结构化表格数据中各个单元格的单元格布局类别对所述标准布局表格的单元格进行标记,得到所述标准布局表格匹配的布局分布表格;A table marking module, configured to mark the cells of the standard layout table according to the cell layout category of each cell in the structured table data, to obtain a layout distribution table matching the standard layout table; 表格信息提取模块,用于根据所述布局分布表格提取多元组格式的表格信息。A table information extraction module, configured to extract table information in a tuple format according to the layout distribution table. 9.一种电子设备,其特征在于,所述电子设备包括:9. An electronic device, characterized in that the electronic device comprises: 至少一个处理器;以及at least one processor; and 与所述至少一个处理器通信连接的存储器;其中,a memory communicatively coupled to the at least one processor; wherein, 所述存储器存储有可被所述至少一个处理器执行的计算机程序,所述计算机程序被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-7中任一项所述的表格信息提取方法。the memory stores a computer program executable by the at least one processor, the computer program being executed by the at least one processor to enable the at least one processor to perform any of claims 1-7 The described table information extraction method. 10.一种计算机存储介质,其特征在于,所述计算机可读存储介质存储有计算机指令,所述计算机指令用于使处理器执行时实现权利要求1-7中任一项所述的表格信息提取方法。10. A computer storage medium, characterized in that the computer-readable storage medium stores computer instructions, the computer instructions being used to enable a processor to implement the table information according to any one of claims 1-7 when executed Extraction Method.
CN202210282207.7A 2022-03-21 2022-03-21 Table information extraction method and device, electronic equipment and storage medium Pending CN114581680A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210282207.7A CN114581680A (en) 2022-03-21 2022-03-21 Table information extraction method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210282207.7A CN114581680A (en) 2022-03-21 2022-03-21 Table information extraction method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114581680A true CN114581680A (en) 2022-06-03

Family

ID=81781924

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210282207.7A Pending CN114581680A (en) 2022-03-21 2022-03-21 Table information extraction method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114581680A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115454999A (en) * 2022-08-19 2022-12-09 杭州美创科技有限公司 Non-index table data comparison method, device, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2015060533A (en) * 2013-09-20 2015-03-30 株式会社モーダルコンセプトジャパン Creation device, creation method and creation program for layout chart
CN106709032A (en) * 2016-12-29 2017-05-24 深圳市华傲数据技术有限公司 Method and device for extracting structured information from spreadsheet document
CN111914805A (en) * 2020-08-18 2020-11-10 科大讯飞股份有限公司 Table structuring method, device, electronic device and storage medium
US20210397420A1 (en) * 2016-12-03 2021-12-23 Thomas STACHURA Spreadsheet-Based Software Application Development
JP2022015969A (en) * 2020-07-10 2022-01-21 京セラドキュメントソリューションズ株式会社 Data generation system and data generation program

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2015060533A (en) * 2013-09-20 2015-03-30 株式会社モーダルコンセプトジャパン Creation device, creation method and creation program for layout chart
US20210397420A1 (en) * 2016-12-03 2021-12-23 Thomas STACHURA Spreadsheet-Based Software Application Development
CN106709032A (en) * 2016-12-29 2017-05-24 深圳市华傲数据技术有限公司 Method and device for extracting structured information from spreadsheet document
JP2022015969A (en) * 2020-07-10 2022-01-21 京セラドキュメントソリューションズ株式会社 Data generation system and data generation program
CN111914805A (en) * 2020-08-18 2020-11-10 科大讯飞股份有限公司 Table structuring method, device, electronic device and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杨靖民: "复杂表格文档图像的模板识别与提取", 《中国优秀硕士学位论文全文数据库信息科技辑》, 15 August 2019 (2019-08-15), pages 1 - 78 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115454999A (en) * 2022-08-19 2022-12-09 杭州美创科技有限公司 Non-index table data comparison method, device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
US11416768B2 (en) Feature processing method and feature processing system for machine learning
CN113220710B (en) Data query method, device, electronic device and storage medium
CN114416662A (en) File comparison method and device, electronic equipment and storage medium
CN115757596A (en) General electric power unstructured data to structured data conversion method
CN118467595A (en) Search method, device, equipment, and medium for target domain based on large language model
CN114328885A (en) An information processing method, device and computer readable storage medium
CN115146070B (en) Key-value generation methods, knowledge graph generation methods, devices, equipment and media
CN118245285A (en) Intelligent disaster recovery management and control platform and method for data backup of data center
CN114860867A (en) Training document information extraction model, method and device for document information extraction
CN115130435A (en) Document processing method and device, electronic equipment and storage medium
CN119066177A (en) A policy interpretation method, device, equipment, medium and product
CN116028618B (en) Text processing, text retrieval methods, devices, electronic equipment and storage media
WO2024152550A1 (en) Picture processing method and apparatus, and electronic device and storage medium
CN114581680A (en) Table information extraction method and device, electronic equipment and storage medium
CN113377922B (en) Methods, devices, electronic devices and media for matching information
CN116561095A (en) Data migration method, device, electronic device and storage medium
CN113609100A (en) Data storage method, data query method, device and electronic device
CN110716994B (en) A retrieval method and device supporting heterogeneous geographic data resource retrieval
CN119396859A (en) Large language model data analysis method, device, computer equipment and storage medium
CN119886084A (en) Data table merging and splitting method, device, equipment, medium and program product
CN115203428B (en) A knowledge graph construction method, device, electronic equipment and storage medium
CN113361249B (en) Document weight determination method, device, electronic equipment and storage medium
CN116894021A (en) A log data storage method, query method, device, equipment and medium
CN117312574A (en) Information extraction method, device, equipment and storage medium
CN114201607B (en) Information processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20220603