[go: up one dir, main page]

CN118467538A - An automated processing method and system for multi-field data table management and warehousing - Google Patents

An automated processing method and system for multi-field data table management and warehousing Download PDF

Info

Publication number
CN118467538A
CN118467538A CN202410579469.9A CN202410579469A CN118467538A CN 118467538 A CN118467538 A CN 118467538A CN 202410579469 A CN202410579469 A CN 202410579469A CN 118467538 A CN118467538 A CN 118467538A
Authority
CN
China
Prior art keywords
data
sub
field
warehousing
tables
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410579469.9A
Other languages
Chinese (zh)
Other versions
CN118467538B (en
Inventor
杨璞
赵子墨
孙长杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chaozhou Zhuoshu Big Data Industry Development Co Ltd
Original Assignee
Chaozhou Zhuoshu Big Data Industry Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chaozhou Zhuoshu Big Data Industry Development Co Ltd filed Critical Chaozhou Zhuoshu Big Data Industry Development Co Ltd
Priority to CN202410579469.9A priority Critical patent/CN118467538B/en
Publication of CN118467538A publication Critical patent/CN118467538A/en
Application granted granted Critical
Publication of CN118467538B publication Critical patent/CN118467538B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2365Ensuring data consistency and integrity
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种多字段数据表治理及入库的自动化处理方法及系统,属于大数据及数据库管理技术领域,通过数据预处理,构建图模型,数据表分割,建立字段映射关系,为子表创建关联ID,设置索引导入子表,并写成自动化治理流程工具,创新策略优化算法,实现对复杂数据表的高效治理和快速入库。本发明可以实现更高效得多字段数据表治理及入库流程,大大减少手动操作,提高导入效率,确保数据的一致性和完整性,并为数据治理提供便利;灵活的入库机制可以根据数据表的结构和字段特性进行定制化的处理。

The present invention discloses an automated processing method and system for multi-field data table management and warehousing, which belongs to the field of big data and database management technology. Through data preprocessing, graph model construction, data table segmentation, field mapping relationship establishment, creation of associated ID for subtable, index setting and import of subtable, and writing of automated management process tool, innovative strategy optimization algorithm, efficient management and rapid warehousing of complex data tables are realized. The present invention can realize more efficient multi-field data table management and warehousing process, greatly reduce manual operation, improve import efficiency, ensure data consistency and integrity, and provide convenience for data management; flexible warehousing mechanism can be customized according to the structure and field characteristics of the data table.

Description

一种多字段数据表治理及入库的自动化处理方法及系统An automated processing method and system for multi-field data table management and warehousing

技术领域Technical Field

本发明涉及大数据及数据库管理技术领域,具体地说是一种多字段数据表治理及入库的自动化处理方法及系统。The present invention relates to the technical field of big data and database management, and in particular to an automated processing method and system for managing and warehousing multi-field data tables.

背景技术Background Art

随着大数据时代的到来,数据规模呈指数级增加,使得数据及数据表内的结构和内容变的越来越复杂,因此对于多字段数据表的治理和入库变得尤为关键。传统的数据表治理和入库方法在处理大规模、多字段数据表时多依赖于手动操作或简单的自动化脚本,效率低下,很难满足数据实时分析和业务的及时响应需求。因此,如何将这种数据表高效的治理及入库,并保障数据安全,成为了急需解决的重要问题。With the advent of the big data era, the scale of data has increased exponentially, making the structure and content of data and data tables increasingly complex. Therefore, the management and storage of multi-field data tables have become particularly critical. Traditional data table management and storage methods rely on manual operations or simple automated scripts when processing large-scale, multi-field data tables. They are inefficient and difficult to meet the needs of real-time data analysis and timely response to business. Therefore, how to efficiently manage and store such data tables and ensure data security has become an important issue that needs to be solved urgently.

发明内容Summary of the invention

本发明的技术任务是针对以上不足之处,提供一种多字段数据表治理及入库的自动化处理方法及系统,通过创新的策略和优化算法,实现了对复杂数据表的高效治理和快速入库。The technical task of the present invention is to address the above deficiencies and provide an automated processing method and system for multi-field data table management and warehousing, which realizes efficient management and rapid warehousing of complex data tables through innovative strategies and optimization algorithms.

本发明解决其技术问题所采用的技术方案是:The technical solution adopted by the present invention to solve the technical problem is:

一种多字段数据表治理及入库的自动化处理方法,通过数据预处理,构建图模型,数据表分割,建立字段映射关系,为子表创建关联ID,设置索引导入子表,并写成自动化治理流程工具,创新策略优化算法,实现对复杂数据表的高效治理和快速入库。An automated processing method for multi-field data table management and warehousing, which implements data preprocessing, graph model construction, data table segmentation, field mapping relationship establishment, creation of associated IDs for sub-tables, index setting and import of sub-tables, and writing of automated management process tools, innovative strategy optimization algorithms, to achieve efficient management of complex data tables and rapid warehousing.

进一步的,该方法的实现过程如下:Furthermore, the implementation process of this method is as follows:

(1)数据预处理:(1) Data preprocessing:

首先接收待治理和入库的数据表,并对数据进行清洗和预处理,去除重复、无效或格式错误的数据,以确保数据的质量和一致性;First, the data tables to be managed and stored are received, and the data is cleaned and preprocessed to remove duplicate, invalid or erroneous data to ensure data quality and consistency;

(2)构建图模型:(2) Constructing a graph model:

基于业务需求和各字段之间的关联性,分析字段间的依赖关系和访问频率,构建图模型为分表做准备;Based on business requirements and the correlation between fields, analyze the dependencies and access frequencies between fields, and build a graph model to prepare for table sharding.

(3)数据表分割:(3) Data table segmentation:

将相关性高或访问频繁的字段聚集到一个子表中;Aggregate highly relevant or frequently accessed fields into a subtable;

通过图模型分析字段之间的关联关系及关联频率,确定哪些字段应该被划分到同一个子表里面;Use the graph model to analyze the association and association frequency between fields and determine which fields should be divided into the same sub-table;

(4)建立字段映射关系:(4) Establish field mapping relationship:

为每个子表建立字段映射关系,以确保不同表间数据的正确映射;Establish field mapping relationships for each sub-table to ensure correct mapping of data between different tables;

(5)为子表创建关联ID,设置索引:(5) Create an associated ID for the subtable and set an index:

关联ID不仅用于标识子表,还可用于构建子表间的关联关系,实现跨表数据关联查询;The association ID is not only used to identify the sub-table, but also to build the association relationship between sub-tables to realize cross-table data association query;

(6)将数据导入子表,为子表匹配其他子表的唯一ID字段:(6) Import the data into the sub-table and match the unique ID field of other sub-tables to the sub-table:

通过sql语句,将总表数据导入到各个子表中;Import the data of the master table into each sub-table through SQL statements;

(7)自动化入库工具开发:(7) Development of automated warehousing tools:

通过python打包开发一套多字段数据表自动化拆分工具,实现数据表的自动分割、字段映射关系的自动建立、关联ID的自动生成,自动导入子表任务。A set of multi-field data table automatic splitting tools is developed through Python packaging to realize automatic segmentation of data tables, automatic establishment of field mapping relationships, automatic generation of associated IDs, and automatic import of sub-table tasks.

进一步的,所述数据预处理,基于分割策略和字段映射关系,实现对数据表的自动清洗、分割、转换、合并操作。大大缩短数据处理周期。提高了数据查询、处理及入库效率。Furthermore, the data preprocessing can realize automatic cleaning, segmentation, conversion and merging of data tables based on segmentation strategies and field mapping relationships, greatly shortening the data processing cycle and improving the efficiency of data query, processing and storage.

进一步的,所述构建图模型,根据字段间的依赖关系和访问频率构建图模型,将相关性高或访问频繁的字段聚集到一个子表中;优化后期数据查询效率,减少跨表操作。Furthermore, the graph model is constructed according to the dependency and access frequency between fields, and the fields with high correlation or frequent access are aggregated into a sub-table; the efficiency of subsequent data query is optimized and cross-table operations are reduced.

优选的,所述构建图模型,在图模型中,每个节点代表字段,边代表字段之间的关联,根据字段之间可能存在的联系为两节点之间连线。Preferably, in the graph model, each node represents a field, and an edge represents an association between fields, and a line is formed between two nodes according to possible connections between the fields.

进一步的,所述数据表分割,根据数据表的特点和业务需求设计合理的分割策略,将数据表拆分为多个子表;Furthermore, the data table segmentation is to design a reasonable segmentation strategy according to the characteristics of the data table and business requirements, and split the data table into multiple sub-tables;

所述为每个子表建立字段映射关系,通过元数据管理和数据字典技术,为每个字段定义数据类型、格式、取值范围等等,方便跨表查询和数据整合。The field mapping relationship is established for each sub-table, and the data type, format, value range, etc. are defined for each field through metadata management and data dictionary technology, so as to facilitate cross-table query and data integration.

进一步的,为子表建立唯一id,以保证各子表之间的关联性;Furthermore, a unique ID is established for each sub-table to ensure the association between the sub-tables;

通过分布式唯一ID生成器或者mysql中的创建UUID,为子表中的每条数据创建唯一ID;并根据数据的特点和查询需求,为合适的字段添加索引策略,提高查询效率。Create a unique ID for each piece of data in the subtable through a distributed unique ID generator or create UUID in MySQL; and add index strategies for appropriate fields based on the characteristics of the data and query requirements to improve query efficiency.

本发明还要求保护一种多字段数据表治理及入库的自动化处理系统,包括数据预处理模块,构建图模型模块,数据表分割模块,建立字段映射关系模块,子表创建关联ID模块,数据导入子表模块,自动化入库工具生成模块;The present invention also claims protection for an automated processing system for multi-field data table management and warehousing, including a data preprocessing module, a graph model building module, a data table segmentation module, a field mapping relationship establishment module, a sub-table creation association ID module, a data import sub-table module, and an automated warehousing tool generation module;

该系统通过上述的方法实现多字段数据表治理及入库的自动化处理。The system realizes the automatic processing of multi-field data table management and warehousing through the above method.

本发明还要求保护一种多字段数据表治理及入库的自动化处理装置,包括至少一个存储器和至少一个处理器;The present invention also claims protection for an automated processing device for managing and warehousing a multi-field data table, comprising at least one memory and at least one processor;

所述至少一个存储器,用于存储机器可读程序;The at least one memory is used to store a machine-readable program;

所述至少一个处理器,用于调用所述机器可读程序,实现上述的方法。The at least one processor is used to call the machine-readable program to implement the above method.

本发明还要求保护一种计算机可读介质,所述计算机可读介质上存储有计算机指令,所述计算机指令在被处理器执行时,使所述处理器实现上述的方法。The present invention also claims protection for a computer-readable medium having computer instructions stored thereon, which, when executed by a processor, enable the processor to implement the above method.

本发明的一种多字段数据表治理及入库的自动化处理方法及系统与现有技术相比,具有以下有益效果:Compared with the prior art, the method and system for automatically processing and storing multi-field data tables of the present invention have the following beneficial effects:

本发明提供的多字段数据表治理及入库的流程方法,通过创新的策略和优化算法,实现了对复杂数据表的高效治理和快速入库,大大提高了数据处理和数据管理的效率和准确性,降低了操作成本。同时,本发明的方法能够适用于各种具有多个字段的数据表,实现更高效的表治理及入库流程,大大减少手动操作,提高导入效率,确保数据的一致性和完整性,并为数据治理提供便利,具有广泛的应用前景。The multi-field data table management and warehousing process method provided by the present invention realizes efficient management and rapid warehousing of complex data tables through innovative strategies and optimization algorithms, greatly improves the efficiency and accuracy of data processing and data management, and reduces operating costs. At the same time, the method of the present invention can be applied to various data tables with multiple fields, realize more efficient table management and warehousing processes, greatly reduce manual operations, improve import efficiency, ensure data consistency and integrity, and provide convenience for data management, and has broad application prospects.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明实施例提供的多字段数据表治理及入库的自动化处理方法流程示图;1 is a flowchart of an automated processing method for managing and warehousing a multi-field data table provided by an embodiment of the present invention;

图2是本发明实施例提供的构建图模型示例图。FIG. 2 is an example diagram of a graph model constructed according to an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

下面结合具体实施例对本发明作进一步说明。The present invention will be further described below in conjunction with specific embodiments.

本发明实施例提供一种多字段数据表治理及入库的自动化处理方法,通过数据预处理,构建图模型,数据表分割,建立字段映射关系,为子表创建关联ID,设置索引导入子表,并写成自动化治理流程工具,创新策略优化算法,实现对复杂数据表的高效治理和快速入库。The embodiment of the present invention provides an automated processing method for multi-field data table management and warehousing, which implements efficient management and rapid warehousing of complex data tables through data preprocessing, graph model construction, data table segmentation, establishment of field mapping relationships, creation of associated IDs for subtables, setting of indexes to import subtables, and writing of automated management process tools and innovative strategy optimization algorithms.

其中,所述数据预处理,通过自动化的多字段数据治理入库工具,基于分割策略和字段映射关系,实现对数据表的自动清洗、分割、转换、合并等操作。大大缩短数据处理周期。提高了数据查询、处理及入库效率。The data preprocessing, through the automated multi-field data management and warehousing tool, can realize automatic cleaning, segmentation, conversion, merging and other operations on the data table based on segmentation strategy and field mapping relationship, greatly shortening the data processing cycle and improving the efficiency of data query, processing and warehousing.

所述构建图模型,根据字段间的依赖关系和访问频率构建图模型,将相关性高或访问频繁的字段聚集到一个子表中。优化后期数据查询效率,减少跨表操作。The graph model is constructed based on the dependency and access frequency between fields, and the fields with high correlation or frequent access are aggregated into a sub-table to optimize the efficiency of later data query and reduce cross-table operations.

数据表分割,根据数据表的特点和业务需求设计合理的分割策略,将数据表拆分为多个子表。Data table segmentation: design a reasonable segmentation strategy based on the characteristics of the data table and business needs, and split the data table into multiple sub-tables.

该方法适用于各种具有多个字段的数据表,实现更高效的表治理及入库流程,大大减少手动操作,提高导入效率,确保数据的一致性和完整性。This method is applicable to various data tables with multiple fields, achieving more efficient table management and warehousing processes, greatly reducing manual operations, improving import efficiency, and ensuring data consistency and integrity.

为子表建立唯一id,保证各子表之间的关联性。Create a unique ID for the sub-table to ensure the association between the sub-tables.

针对入库过程中的瓶颈问题适当的建立索引,提高数据入库效率及后期的数据查询效率。Appropriate indexing is done to address bottleneck issues during the warehousing process, improving data warehousing efficiency and subsequent data query efficiency.

该方法的具体实现过程如下:The specific implementation process of this method is as follows:

1、数据预处理:1. Data preprocessing:

首先接收待治理和入库的数据表,并对数据进行清洗和预处理,去除重复、无效或格式错误的数据,以确保数据的质量和一致性。First, the data tables to be managed and stored are received, and the data is cleaned and preprocessed to remove duplicate, invalid or erroneous data to ensure the quality and consistency of the data.

2、构建图模型:2. Build a graph model:

基于业务需求和各字段之间的关联性,分析字段间的依赖关系和访问频率,构建图模型为分表做准备;Based on business requirements and the correlation between fields, analyze the dependencies and access frequencies between fields, and build a graph model to prepare for table sharding.

例如:在社区人房数据中,姓名、人员标签、居住住址、电话等通常会有频繁的关联查询。根据图模型算法,可以将这些关联字段放在一个子表中,以便于快速的查询和整合数据。For example, in community housing data, there are often frequent related queries for names, personnel labels, residential addresses, telephone numbers, etc. According to the graph model algorithm, these related fields can be placed in a subtable to facilitate fast query and data integration.

具体方法为:在图模型中的每个节点代表字段,边代表字段之间的关联,可以自己根据字段之间可能存在的联系为两节点之间连线。如年龄与收入之间不可避免的存在一定联系,电话与用户之间的存在对应关系等等。如图2所示。The specific method is: each node in the graph model represents a field, and the edge represents the relationship between the fields. You can connect two nodes based on the possible relationship between the fields. For example, there is an inevitable relationship between age and income, and there is a corresponding relationship between phone numbers and users, etc. See Figure 2.

3、数据表分割:3. Data table segmentation:

将相关性高或访问频繁的字段聚集到一个子表中。这样可以优化数据查询效率,减少跨表操作。Aggregate highly relevant or frequently accessed fields into a subtable. This can optimize data query efficiency and reduce cross-table operations.

通过图模型分析字段之间的关联关系及关联频率,确定哪些字段应该被划分到同一个子表里面;如根据上图所示图模型,可以把多字段数据表分为人口基础信息表,人口附加信息表,社区表,房屋表4张子表。The graph model is used to analyze the association and association frequency between fields, and determine which fields should be divided into the same sub-table. For example, according to the graph model shown in the figure above, the multi-field data table can be divided into four sub-tables: population basic information table, population additional information table, community table, and housing table.

人口基础表中包含:姓名,所属社区名称,年龄,人员标签,居住地址,电话,身份证号,籍贯等与姓名字段关联频率较高的字段;The population basic table contains: name, community name, age, person label, residential address, telephone number, ID number, place of origin and other fields that are frequently associated with the name field;

人口附加表包含婚姻状况,学历,收入,职业,政治面貌等关联频率较低的字段;The population supplementary table contains fields with low correlation frequency, such as marital status, education, income, occupation, and political status;

房屋表包含具体房屋所在的省,市,区,街道,楼号,房号等字段;The house table contains fields such as province, city, district, street, building number, room number, etc. where the specific house is located;

社区表则包含了社区以及对应的社区编号等社区其他相关字段。The community table contains the community and other community related fields such as the corresponding community number.

4、建立字段映射关系:4. Establish field mapping relationship:

为每个子表建立字段映射关系,以确保不同表间数据的正确映射;如通过元数据管理和数据字典技术,为每个字段定义数据类型、格式、取值范围等等,方便跨表查询和数据整合。Establish field mapping relationships for each sub-table to ensure correct mapping of data between different tables; for example, through metadata management and data dictionary technology, define the data type, format, value range, etc. for each field to facilitate cross-table query and data integration.

5、为子表创建关联ID,设置索引:5. Create an associated ID for the subtable and set an index:

将总表的数据分别导入子表时,要想使每个子表相互关联,必须建立唯一ID。关联ID不仅用于标识子表,还可用于构建子表间的关联关系,实现跨表数据关联查询。可以通过分布式唯一ID生成器或者mysql中的创建UUID,为子表中的每条数据创建唯一ID。并根据数据的特点和查询需求,为合适的字段添加索引策略,提高查询效率。When importing data from the master table into sub-tables separately, a unique ID must be created to associate each sub-table with each other. The association ID is not only used to identify the sub-table, but also to build the association relationship between sub-tables to implement cross-table data association query. You can create a unique ID for each piece of data in the sub-table through the distributed unique ID generator or create UUID in MySQL. And according to the characteristics of the data and query requirements, add index strategies for appropriate fields to improve query efficiency.

在上述例子中,需要通过分布式唯一ID生成器为三个子表建立唯一ID,分别为:人员基础表的userid,房屋表的houseid,社区表的communityId;随后在人员基础表中创建存储houseid,communityId的对应字段,并在导入人口附加表时根据身份证号匹配人员基础表的userid,使其与人员基础表相关联。同时根据数据的特点和查询需求,为身份证号等相关字段添加索引策略,提高查询效率。In the above example, a distributed unique ID generator is required to create unique IDs for the three sub-tables, namely: userid of the personnel basic table, houseid of the house table, and communityId of the community table; then corresponding fields for storing houseid and communityId are created in the personnel basic table, and when the population additional table is imported, the userid of the personnel basic table is matched according to the ID card number to associate it with the personnel basic table. At the same time, according to the characteristics of the data and query requirements, index strategies are added for related fields such as the ID card number to improve query efficiency.

6、将数据导入子表,为子表匹配其他子表的唯一ID字段:6. Import the data into the sub-table and match the unique ID fields of other sub-tables for the sub-table:

通过sql语句,将总表数据导入到各个子表中:Import the total table data into each subtable through SQL statements:

6.1、先导房屋表数据(distinct保证数据的唯一性),并生成houseid,6.1. First, obtain the house table data (distinct ensures the uniqueness of the data) and generate houseid.

Insert into房屋表(省,市,区,街道,楼号,房号)Insert into housing table (province, city, district, street, building number, room number)

Select distinct省,市,区,街道,楼号,房号from社区人房总表;Select distinct province, city, district, street, building number, room number from community room list;

6.2、同理导社区表数据,6.2. Similarly, export the community table data.

Insert into社区表(所属社区,社区编号...)Insert into community table (community, community number...)

Select distinct所属社区,社区编号...from社区人房总表;Select distinct community, community number...from community room total table;

6.3、在相应子表中分别生成houseid及communityid数据,6.3. Generate houseid and communityid data in the corresponding subtables respectively.

6.4、导人员基础表数据,6.4. Data of the basic table of personnel,

Insert into人员基础表(userid,姓名,所属社区名称,年龄,人员标签,居住住址,电话,身份证号,籍贯)Insert into personnel basic table (userid, name, community name, age, personnel label, residential address, phone number, ID number, place of origin)

SELECT UUID(),姓名,所属社区名称,年龄,人员标签,居住住址,电话,身份证号,籍贯FROM社区人房总数据表;SELECT UUID(), name, community name, age, person tag, residential address, phone number, ID number, native place FROM community person and house total data table;

6.5、创建相关索引,利用update语句为人员基础表匹配houseid,communityid字段。6.5. Create relevant indexes and use the update statement to match the houseid and communityid fields for the personnel base table.

6.6、创建索引,导入人员附加表数据,6.6. Create index and import personnel additional table data.

Insert into人员附加表(婚姻状况,学历,收入,职业,政治面貌,userid)Insert into additional table of personnel (marital status, education, income, occupation, political status, userid)

SELECT婚姻状况,学历,收入,职业,政治面貌,(SELECT marital status, education, income, occupation, political status, (

SELECTSELECT

useriduserid

FROMFROM

人员基础表bPersonnel basic table b

WHEREWHERE

a.身份证号=b.身份证号a. ID number = b. ID number

)AS userid,籍贯FROM社区人房总数据表a。)AS userid, place of origin FROM community housing data table a.

7、自动化入库工具开发:7. Development of automated warehousing tools:

通过python打包开发一套多字段数据表自动化拆分工具,实现数据表的自动分割、字段映射关系的自动建立、关联ID的自动生成,自动导入子表任务。A set of multi-field data table automatic splitting tools is developed through Python packaging to realize automatic segmentation of data tables, automatic establishment of field mapping relationships, automatic generation of associated IDs, and automatic import of sub-table tasks.

在多字段数据表的治理及入库过程中,通常把整张数据表通过sql导入到数据库中,在此过程中导入效率快慢、数据一致性问题以及后期数据治理的复杂度都难以得到保障。本方法可以实现更高效得多字段数据表治理及入库流程,大大减少手动操作,提高导入效率,确保数据的一致性和完整性,并为数据治理提供便利。In the process of managing and storing multi-field data tables, the entire data table is usually imported into the database through SQL. In this process, the import efficiency, data consistency and the complexity of subsequent data management are difficult to guarantee. This method can achieve a more efficient multi-field data table management and storage process, greatly reduce manual operations, improve import efficiency, ensure data consistency and integrity, and facilitate data management.

另外,它提供一种灵活的入库机制,可以根据数据表的结构和字段特性进行定制化的处理。总的来说,它是为了实现对多字段数据表的自动化治理和入库,从而提高数据处理和数据管理的效率和准确性,降低操作成本。In addition, it provides a flexible storage mechanism that can be customized according to the structure and field characteristics of the data table. In general, it is to achieve automatic management and storage of multi-field data tables, thereby improving the efficiency and accuracy of data processing and data management and reducing operating costs.

本发明实施例还提供一种多字段数据表治理及入库的自动化处理系统,包括数据预处理模块,构建图模型模块,数据表分割模块,建立字段映射关系模块,子表创建关联ID模块,数据导入子表模块,自动化入库工具生成模块;该系统通过上述实施例所述的方法实现多字段数据表治理及入库的自动化处理。An embodiment of the present invention also provides an automated processing system for multi-field data table management and warehousing, including a data preprocessing module, a graph model building module, a data table segmentation module, a field mapping relationship establishment module, a sub-table association ID creation module, a data import sub-table module, and an automated warehousing tool generation module; the system realizes automated processing of multi-field data table management and warehousing through the methods described in the above embodiments.

1、数据预处理模块,1. Data preprocessing module,

首先接收待治理和入库的数据表,并对数据进行清洗和预处理,去除重复、无效或格式错误的数据,确保数据的质量和一致性。First, the data tables to be managed and stored are received, and the data is cleaned and preprocessed to remove duplicate, invalid or erroneous data to ensure data quality and consistency.

2、构建图模型模块,2. Build the graph model module.

基于业务需求和各字段之间的关联性,分析字段间的依赖关系和访问频率,构建图模型为分表做准备。Based on business needs and the correlation between fields, analyze the dependencies and access frequencies between fields, and build a graph model to prepare for table sharding.

例如:在社区人房数据中,姓名,人员标签,居住住址,电话等通常会有频繁的关联查询。根据图模型算法,可以将这些关联字段放在一个子表中,以便于快速的查询和整合数据。For example, in community housing data, there are usually frequent related queries for names, personnel labels, residential addresses, telephone numbers, etc. According to the graph model algorithm, these related fields can be placed in a subtable to facilitate fast query and integration of data.

具体方法为:在图模型中的每个节点代表字段,边代表字段之间的关联,我们可以自己根据字段之间可能存在的联系为两节点之间连线。如年龄与收入之间不可避免的存在一定联系,电话与用户之间的存在对应关系等等。The specific method is: each node in the graph model represents a field, and the edge represents the relationship between the fields. We can connect two nodes based on the possible relationship between the fields. For example, there is an inevitable relationship between age and income, and there is a corresponding relationship between phone numbers and users.

3、数据表分割模块,3. Data table segmentation module,

将相关性高或访问频繁的字段聚集到一个子表中。这样可以优化数据查询效率,减少跨表操作。通过图模型分析字段之间的关联关系及关联频率,确定哪些字段应该被划分到同一个子表里面。Aggregate highly correlated or frequently accessed fields into a subtable. This can optimize data query efficiency and reduce cross-table operations. Use the graph model to analyze the association and association frequency between fields and determine which fields should be divided into the same subtable.

4、建立字段映射关系模块,4. Establish a field mapping relationship module.

为每个子表建立字段映射关系,确保不同表间数据的正确映射。如通过元数据管理和数据字典技术,为每个字段定义数据类型、格式、取值范围等等,方便跨表查询和数据整合。Establish field mapping relationships for each sub-table to ensure the correct mapping of data between different tables. For example, through metadata management and data dictionary technology, define the data type, format, value range, etc. for each field to facilitate cross-table query and data integration.

5、子表创建关联ID模块,5. Create an associated ID module for the subtable.

为子表建立唯一id,保证各子表之间的关联性。将总表的数据分别导入子表时,要想使每个子表相互关联,必须建立唯一ID。关联ID不仅用于标识子表,还可用于构建子表间的关联关系,实现跨表数据关联查询。可以通过分布式唯一ID生成器或者mysql中的创建UUID,为子表中的每条数据创建唯一ID。并根据数据的特点和查询需求,为合适的字段添加索引策略,提高查询效率。Create a unique ID for the subtable to ensure the association between each subtable. When importing the data of the main table into the subtables separately, a unique ID must be created to associate each subtable with each other. The association ID is not only used to identify the subtable, but also to build the association relationship between subtables to achieve cross-table data association query. You can create a unique ID for each piece of data in the subtable through the distributed unique ID generator or create UUID in MySQL. And according to the characteristics of the data and query requirements, add index strategies for appropriate fields to improve query efficiency.

在上述例子中,需要通过分布式唯一ID生成器为三个子表建立唯一ID,分别为:人员基础表的userid,房屋表的houseid,社区表的communityId;随后在人员基础表中创建存储houseid,communityId的对应字段,并在导入人口附加表时根据身份证号匹配人员基础表的userid,使其与人员基础表相关联。同时根据数据的特点和查询需求,为身份证号等相关字段添加索引策略,提高查询效率。In the above example, a distributed unique ID generator is required to create unique IDs for the three sub-tables, namely: userid of the personnel basic table, houseid of the house table, and communityId of the community table; then corresponding fields for storing houseid and communityId are created in the personnel basic table, and when the population additional table is imported, the userid of the personnel basic table is matched according to the ID card number to associate it with the personnel basic table. At the same time, according to the characteristics of the data and query requirements, index strategies are added for related fields such as the ID card number to improve query efficiency.

6、数据导入子表模块,6. Import data into sub-table module,

将数据导入子表,为子表匹配其他子表的唯一ID字段。通过sql语句,将总表数据导入到各个子表中。Import data into the sub-table, and match the sub-table with the unique ID field of other sub-tables. Use SQL statements to import the master table data into each sub-table.

7、自动化入库工具生成模块,7. Automatic warehousing tool generation module,

结合上述各模块,通过python打包开发一套多字段数据表自动化拆分工具,实现数据表的自动分割、字段映射关系的自动建立、关联ID的自动生成,自动导入子表等任务。Combining the above modules, a set of multi-field data table automatic splitting tools was developed through Python packaging to realize tasks such as automatic segmentation of data tables, automatic establishment of field mapping relationships, automatic generation of associated IDs, and automatic import of sub-tables.

本发明实施例还提供一种多字段数据表治理及入库的自动化处理装置,包括至少一个存储器和至少一个处理器;The embodiment of the present invention also provides an automated processing device for managing and storing multi-field data tables, comprising at least one memory and at least one processor;

所述至少一个存储器,用于存储机器可读程序;The at least one memory is used to store a machine-readable program;

所述至少一个处理器,用于调用所述机器可读程序,实现上述实施例中所述的多字段数据表治理及入库的自动化处理方法。The at least one processor is used to call the machine-readable program to implement the automated processing method for managing and warehousing the multi-field data table described in the above embodiment.

本发明实施例还提供一种计算机可读介质,所述计算机可读介质上存储有计算机指令,所述计算机指令在被处理器执行时,使所述处理器执行上述实施例中所述的多字段数据表治理及入库的自动化处理方法。具体地,可以提供配有存储介质的系统或者装置,在该存储介质上存储着实现上述实施例中任一实施例的功能的软件程序代码,且使该系统或者装置的计算机(或CPU或MPU)读出并执行存储在存储介质中的程序代码。The embodiment of the present invention further provides a computer-readable medium, on which computer instructions are stored, and when the computer instructions are executed by a processor, the processor executes the automatic processing method for managing and storing multi-field data tables described in the above embodiment. Specifically, a system or device equipped with a storage medium can be provided, on which software program codes for implementing the functions of any of the above embodiments are stored, and a computer (or CPU or MPU) of the system or device can read and execute the program code stored in the storage medium.

在这种情况下,从存储介质读取的程序代码本身可实现上述实施例中任何一项实施例的功能,因此程序代码和存储程序代码的存储介质构成了本发明的一部分。In this case, the program code itself read from the storage medium can realize the function of any one of the above-mentioned embodiments, and thus the program code and the storage medium storing the program code constitute a part of the present invention.

用于提供程序代码的存储介质实施例包括软盘、硬盘、磁光盘、光盘(如CD-ROM、CD-R、CD-RW、DVD-ROM、DVD-RAM、DVD-RW、DVD+RW)、磁带、非易失性存储卡和ROM。可选择地,可以由通信网络从服务器计算机上下载程序代码。The storage medium embodiments for providing the program code include a floppy disk, a hard disk, a magneto-optical disk, an optical disk (such as CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD+RW), a magnetic tape, a non-volatile memory card, and a ROM. Alternatively, the program code can be downloaded from a server computer by a communication network.

此外,应该清楚的是,不仅可以通过执行计算机所读出的程序代码,而且可以通过基于程序代码的指令使计算机上操作的操作系统等来完成部分或者全部的实际操作,从而实现上述实施例中任意一项实施例的功能。In addition, it should be clear that the functions of any of the above embodiments can be implemented not only by executing the program code read by the computer, but also by enabling an operating system operating on the computer to complete part or all of the actual operations based on instructions from the program code.

此外,可以理解的是,将由存储介质读出的程序代码写到插入计算机内的扩展板中所设置的存储器中或者写到与计算机相连接的扩展单元中设置的存储器中,随后基于程序代码的指令使安装在扩展板或者扩展单元上的CPU等来执行部分和全部实际操作,从而实现上述实施例中任一实施例的功能。In addition, it can be understood that the program code read from the storage medium is written to a memory provided in an expansion board inserted into the computer or written to a memory provided in an expansion unit connected to the computer, and then based on the instructions of the program code, a CPU installed on the expansion board or the expansion unit is enabled to perform part or all of the actual operations, thereby realizing the functions of any of the above-mentioned embodiments.

上文通过附图和优选实施例对本发明进行了详细展示和说明,然而本发明不限于这些已揭示的实施例,基与上述多个实施例本领域技术人员可以知晓,可以组合上述不同实施例中的代码审核手段得到本发明更多的实施例,这些实施例也在本发明的保护范围之内。The present invention is shown and described in detail above through the accompanying drawings and preferred embodiments. However, the present invention is not limited to these disclosed embodiments. Based on the above multiple embodiments, those skilled in the art can know that the code review methods in the above different embodiments can be combined to obtain more embodiments of the present invention, and these embodiments are also within the protection scope of the present invention.

Claims (10)

1. An automatic processing method for managing and warehousing multi-field data table is characterized in that a graph model is constructed through data preprocessing, the data table is divided, a field mapping relation is established, an associated ID is created for a sub-table, a cable is arranged and guided into the sub-table, and the sub-table is written into an automatic management flow tool, so that a strategy optimization algorithm is innovated, and efficient management and quick warehousing of complex data tables are realized.
2. The automated processing method for managing and warehousing multi-field data table according to claim 1, wherein the implementation process of the method is as follows:
1) Data preprocessing:
Firstly, receiving a data table to be treated and put in storage, and cleaning and preprocessing the data to remove repeated, invalid or wrong data;
2) Constructing a graph model:
based on the service requirement and the relevance among the fields, analyzing the dependency relationship and the access frequency among the fields, and constructing a graph model to prepare for sub-tables;
3) Data table segmentation:
aggregating fields with high correlation or frequent access into a sub-table;
analyzing the association relation and association frequency among the fields through the graph model, and determining which fields should be divided into the same sub-table;
4) Establishing a field mapping relation:
establishing a field mapping relation for each sub-table to ensure correct mapping of data among different tables;
5) Creating an association ID for a sub-table, setting an index:
the association ID is not only used for identifying the sub-tables, but also used for constructing association relations among the sub-tables, so as to realize cross-table data association inquiry;
6) Importing data into the sub-table, and matching the sub-table with unique ID fields of other sub-tables:
importing the total table data into each sub-table through the sql statement;
7) Automatic warehouse tool development:
Developing an automatic splitting tool for the multi-field data table, realizing automatic splitting of the data table, automatic establishment of field mapping relation, automatic generation of association ID and automatic importing of sub-table tasks.
3. The automated processing method for managing and warehousing multi-field data tables according to claim 1 or 2, wherein the data preprocessing realizes the operations of automatically cleaning, dividing, converting and merging the data tables based on a dividing strategy and a field mapping relation.
4. The automated processing method for managing and warehousing a multi-field data table according to claim 1 or 2, wherein the graph model is constructed according to the dependency relationship between fields and the access frequency, and the fields with high correlation or frequent access are gathered into a sub-table.
5. The automated multi-field data table management and warehousing method according to claim 4, wherein the graph model is constructed, each node represents a field, the edges represent the association between the fields, and the connection between the two nodes is based on the possible association between the fields.
6. The automated processing method for managing and warehousing a multi-field data table according to claim 1 or 2, wherein the data table is divided, and a division strategy is designed according to characteristics and service requirements of the data table, so that the data table is divided into a plurality of sub-tables;
And establishing a field mapping relation for each sub-table, and defining a data type, a format and a value range for each field through metadata management and data dictionary technology.
7. An automated processing method for managing and warehousing multi-field data tables according to claim 1 or 2, wherein a unique id is established for the sub-tables to ensure the relevance between the sub-tables;
creating a unique ID for each piece of data in the sub-table by a distributed unique ID generator or creating a UUID in mysql; and adding an index strategy for the proper field according to the characteristics of the data and the query requirement.
8. An automatic processing system for managing and warehousing a multi-field data table is characterized by comprising a data preprocessing module, a graph model building module, a data table dividing module, a field mapping relation building module, a sub-table creation association ID module, a data importing sub-table module and an automatic warehousing tool generating module;
The system realizes the automation treatment of multi-field data table management and warehousing by the method of any one of claims 1 to 7.
9. An automatic processing device for managing and warehousing multi-field data tables is characterized by comprising at least one memory and at least one processor;
The at least one memory for storing a machine readable program;
the at least one processor being configured to invoke the machine readable program to implement the method of any of claims 1 to 7.
10. A computer readable medium having stored thereon computer instructions which, when executed by a processor, cause the processor to implement the method of any of claims 1 to 7.
CN202410579469.9A 2024-05-11 2024-05-11 Automatic processing method and system for multi-field data table management and warehousing Active CN118467538B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410579469.9A CN118467538B (en) 2024-05-11 2024-05-11 Automatic processing method and system for multi-field data table management and warehousing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410579469.9A CN118467538B (en) 2024-05-11 2024-05-11 Automatic processing method and system for multi-field data table management and warehousing

Publications (2)

Publication Number Publication Date
CN118467538A true CN118467538A (en) 2024-08-09
CN118467538B CN118467538B (en) 2025-08-08

Family

ID=92167900

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410579469.9A Active CN118467538B (en) 2024-05-11 2024-05-11 Automatic processing method and system for multi-field data table management and warehousing

Country Status (1)

Country Link
CN (1) CN118467538B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120804363A (en) * 2025-09-15 2025-10-17 富盛科技股份有限公司 Video resource processing method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060136469A1 (en) * 2004-12-17 2006-06-22 International Business Machines Corporation Creating a logical table from multiple differently formatted physical tables having different access methods
CN101566986A (en) * 2008-04-21 2009-10-28 阿里巴巴集团控股有限公司 Method and device for processing data in online business processing
CN111046035A (en) * 2019-10-29 2020-04-21 三盟科技股份有限公司 Data automation processing method, system, computer equipment and readable storage medium
CN113259972A (en) * 2021-06-08 2021-08-13 网络通信与安全紫金山实验室 Data warehouse construction method, system, equipment and medium based on wireless communication network
CN115391428A (en) * 2022-08-31 2022-11-25 广东工业大学 Real-time calculation method for dynamic association relation of mass financial time sequence data
WO2023078130A1 (en) * 2021-11-03 2023-05-11 中兴通讯股份有限公司 Index creation method and apparatus, and computer-readable storage medium
CN117971902A (en) * 2024-02-05 2024-05-03 广东科杰通信息科技有限公司 Data index slicing compression processing method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060136469A1 (en) * 2004-12-17 2006-06-22 International Business Machines Corporation Creating a logical table from multiple differently formatted physical tables having different access methods
CN101566986A (en) * 2008-04-21 2009-10-28 阿里巴巴集团控股有限公司 Method and device for processing data in online business processing
CN111046035A (en) * 2019-10-29 2020-04-21 三盟科技股份有限公司 Data automation processing method, system, computer equipment and readable storage medium
CN113259972A (en) * 2021-06-08 2021-08-13 网络通信与安全紫金山实验室 Data warehouse construction method, system, equipment and medium based on wireless communication network
WO2023078130A1 (en) * 2021-11-03 2023-05-11 中兴通讯股份有限公司 Index creation method and apparatus, and computer-readable storage medium
CN115391428A (en) * 2022-08-31 2022-11-25 广东工业大学 Real-time calculation method for dynamic association relation of mass financial time sequence data
CN117971902A (en) * 2024-02-05 2024-05-03 广东科杰通信息科技有限公司 Data index slicing compression processing method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120804363A (en) * 2025-09-15 2025-10-17 富盛科技股份有限公司 Video resource processing method and device

Also Published As

Publication number Publication date
CN118467538B (en) 2025-08-08

Similar Documents

Publication Publication Date Title
CN112199366B (en) Data table processing method, device and equipment
CN118819778A (en) Task processing method, device, equipment, storage medium and program product based on large model intelligent agent arrangement
CN111984745B (en) Database field dynamic expansion method, device, equipment and storage medium
WO2025039361A1 (en) Data processing method, device and system, and storage medium
CN112347122B (en) SQL workflow processing method, device, electronic equipment and storage medium
CN111666344B (en) Heterogeneous data synchronization method and device
CN111782820B (en) Knowledge graph creation method and device, readable storage medium and electronic equipment
CN102708203A (en) A Dynamic Database Management Method Based on XML Metadata
CN115185986A (en) Province and city address information matching method, device, computer equipment and storage medium
CN110515999A (en) General record processing method, device, electronic device and storage medium
CN112800149A (en) Data blood margin analysis-based data management method and system
CN114741276A (en) Method and device for multiplexing test cases of domestic operating system
CN116680270A (en) Data table conversion method, device and storage medium
CN116610714A (en) Data query method, device, computer equipment and storage medium
US20080294673A1 (en) Data transfer and storage based on meta-data
CN114564621A (en) A method, apparatus, device and readable storage medium for associating data
CN118467538A (en) An automated processing method and system for multi-field data table management and warehousing
CN111897837A (en) Data query method, device, equipment and medium
CN114238318A (en) Data storage structure processing method, processing device and data storage system
CN119357209A (en) A coding system and coding method based on unified housing construction coding standards
CN116127047B (en) Method and device for establishing enterprise information database
CN118689988A (en) A question-and-answer method based on the enterprise internal management manual
CN112835897B (en) Geographic area division management method, data conversion method and related equipment
CN100565519C (en) Data bank service disposal route based on the uniform data sign
CN116561106A (en) Configuration item data management method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant