CN119271750A - Data processing method, device, equipment, medium and program product for data warehouse - Google Patents
Data processing method, device, equipment, medium and program product for data warehouse Download PDFInfo
- Publication number
- CN119271750A CN119271750A CN202411354924.1A CN202411354924A CN119271750A CN 119271750 A CN119271750 A CN 119271750A CN 202411354924 A CN202411354924 A CN 202411354924A CN 119271750 A CN119271750 A CN 119271750A
- Authority
- CN
- China
- Prior art keywords
- data
- target
- tables
- data warehouse
- business
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 43
- 230000001360 synchronised effect Effects 0.000 claims abstract description 28
- 238000000034 method Methods 0.000 claims description 43
- 230000008569 process Effects 0.000 claims description 28
- 238000012545 processing Methods 0.000 claims description 28
- 238000004590 computer program Methods 0.000 claims description 14
- 230000008685 targeting Effects 0.000 claims 4
- 230000004044 response Effects 0.000 abstract description 8
- 238000005516 engineering process Methods 0.000 abstract description 6
- 235000019580 granularity Nutrition 0.000 description 52
- 238000010586 diagram Methods 0.000 description 13
- 238000012423 maintenance Methods 0.000 description 7
- 230000002776 aggregation Effects 0.000 description 4
- 238000004220 aggregation Methods 0.000 description 4
- 238000013473 artificial intelligence Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 4
- 238000013499 data model Methods 0.000 description 4
- 238000013461 design Methods 0.000 description 4
- 238000007726 management method Methods 0.000 description 4
- 238000012546 transfer Methods 0.000 description 3
- 230000001960 triggered effect Effects 0.000 description 3
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 2
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000004140 cleaning Methods 0.000 description 2
- 238000007405 data analysis Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000010354 integration Effects 0.000 description 2
- 238000005065 mining Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 239000013589 supplement Substances 0.000 description 2
- 230000001131 transforming effect Effects 0.000 description 2
- 238000011144 upstream manufacturing Methods 0.000 description 2
- 101000827703 Homo sapiens Polyphosphoinositide phosphatase Proteins 0.000 description 1
- 102100023591 Polyphosphoinositide phosphatase Human genes 0.000 description 1
- 239000008280 blood Substances 0.000 description 1
- 210000004369 blood Anatomy 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 239000000047 product Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000013068 supply chain management Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/283—Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/248—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/288—Entity relationship models
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明涉及大数据技术领域,提供一种数据仓库的数据处理方法、装置、设备、介质和程序产品,包括:响应于业务更新指令,获取同步到数据仓库中ODS层的各业务源表以及与各业务源表对应的数据仓库目标表,并获取各业务源表之间的关联关系及各业务源表与数据仓库目标表之间的依赖关系;基于关联关系和依赖关系构建实体关系模型,并向数据仓库逐层传递实体关系模型的建模信息。本发明通过数据仓库中ODS层自动同步业务源表且自动构建实体关系模型并向下层传递建模信息的方式,实现直观展示数据仓库模型信息的目的,无需人为参与,确保数据仓库中的建模信息完善更加高效和全面,方便使用者全方位了解数据资产,提升数据查询效率和数据使用效率。
The present invention relates to the field of big data technology, and provides a data processing method, device, equipment, medium and program product for a data warehouse, including: in response to a business update instruction, obtaining each business source table synchronized to the ODS layer in the data warehouse and the data warehouse target table corresponding to each business source table, and obtaining the association relationship between each business source table and the dependency relationship between each business source table and the data warehouse target table; constructing an entity relationship model based on the association relationship and the dependency relationship, and delivering the modeling information of the entity relationship model to the data warehouse layer by layer. The present invention achieves the purpose of intuitively displaying the data warehouse model information by automatically synchronizing the business source table at the ODS layer in the data warehouse, automatically constructing the entity relationship model and delivering the modeling information to the lower layer, without the need for human participation, ensuring that the modeling information in the data warehouse is more efficient and comprehensive, facilitating users to fully understand data assets, and improving data query efficiency and data usage efficiency.
Description
技术领域Technical Field
本发明涉及大数据技术领域,尤其涉及数据仓库的数据处理方法、装置、设备、介质和程序产品。The present invention relates to the field of big data technology, and in particular to a data processing method, device, equipment, medium and program product for a data warehouse.
背景技术Background Art
目前,数据仓库因其具备强大的数据集成、存储和分析能力而被广泛应用,但因其规范不严格或模型设计信息丢失等其它历史遗留问题,使得数据仓库中存在大量建模信息缺失的数据表,造成用户或数据工程师查询数据和使用数据的难度增大。为了填补丢失的建模信息,通常由数据工程师根据自身的业务知识和过往经验进行梳理,以重新设计出丢失的建模信息;整个数据维护过程不仅工作量大、难以持续迭代维护,而且设计标准主观固定、容易随数据工程师的流动而混乱,从而不能实现对数据仓库建模信息的高效且持续完善。At present, data warehouses are widely used because of their powerful data integration, storage and analysis capabilities. However, due to lax specifications or other historical problems such as loss of model design information, there are a large number of data tables with missing modeling information in the data warehouse, which makes it more difficult for users or data engineers to query and use data. In order to fill the missing modeling information, data engineers usually sort out the missing modeling information based on their own business knowledge and past experience to redesign the missing modeling information; the entire data maintenance process is not only labor-intensive and difficult to maintain continuously, but also the design standards are subjective and fixed, and easily confused with the flow of data engineers, thus failing to achieve efficient and continuous improvement of data warehouse modeling information.
发明内容Summary of the invention
本发明旨在至少解决相关技术中存在的技术问题之一。为此,本发明提出一种数据仓库的数据处理方法,实现直观展示数据仓库中所有表的关联关系和依赖关系等模型信息,整个数据维护过程无需人为参与,确保数据仓库中的建模信息完善更加高效和全面,方便使用者全方位了解数据资产,从而大幅提升了数据查询效率和数据使用效率。The present invention aims to solve at least one of the technical problems existing in the related art. To this end, the present invention proposes a data processing method for a data warehouse, which can intuitively display model information such as association relationships and dependency relationships of all tables in the data warehouse. The entire data maintenance process does not require human participation, ensuring that the modeling information in the data warehouse is more efficient and comprehensive, and facilitating users to fully understand data assets, thereby greatly improving data query efficiency and data usage efficiency.
本发明还提出一种数据仓库的数据处理装置。The present invention also provides a data processing device for a data warehouse.
本发明还提出一种电子设备。The invention also provides an electronic device.
本发明还提出一种非暂态计算机可读存储介质。The present invention also provides a non-transitory computer-readable storage medium.
本发明还提出一种计算机程序产品。The present invention also provides a computer program product.
根据本发明第一方面实施例的数据仓库的数据处理方法,包括:A data processing method for a data warehouse according to a first embodiment of the present invention includes:
响应于业务更新指令,获取同步到数据仓库中ODS层的各业务源表以及与所述各业务源表对应的数据仓库目标表,并获取所述各业务源表之间的关联关系及所述各业务源表与所述数据仓库目标表之间的依赖关系;In response to the business update instruction, each business source table synchronized to the ODS layer in the data warehouse and the data warehouse target table corresponding to each business source table are obtained, and the association relationship between the business source tables and the dependency relationship between the business source tables and the data warehouse target table are obtained;
基于所述关联关系和所述依赖关系构建实体关系模型,并向所述数据仓库逐层传递所述实体关系模型的建模信息。An entity relationship model is constructed based on the association relationship and the dependency relationship, and modeling information of the entity relationship model is transmitted layer by layer to the data warehouse.
根据本发明实施例的数据仓库的数据处理方法,响应业务更新指令时,首先获取同步到数据仓库中ODS层的各业务源表以及与各业务源表对应的数据仓库目标表,并获取各业务源表之间的关联关系及各业务源表与数据仓库目标表之间的依赖关系,再进一步基于关联关系和依赖关系构建实体关系模型,并向数据仓库逐层传递实体关系模型的建模信息。这样,在业务系统中业务数据发生变化的情况下,通过数据仓库中ODS层自动同步业务源表且自动构建实体关系模型并向下层传递建模信息的方式,实现直观展示数据仓库中所有表的关联关系和依赖关系等模型信息,整个数据维护过程无需人为参与,确保数据仓库中的建模信息完善更加高效和全面,方便使用者全方位了解数据资产,从而大幅提升了数据查询效率和数据使用效率。According to the data processing method of the data warehouse of the embodiment of the present invention, when responding to the business update instruction, first obtain each business source table synchronized to the ODS layer in the data warehouse and the data warehouse target table corresponding to each business source table, and obtain the association relationship between each business source table and the dependency relationship between each business source table and the data warehouse target table, and then further build an entity relationship model based on the association relationship and dependency relationship, and pass the modeling information of the entity relationship model to the data warehouse layer by layer. In this way, when the business data in the business system changes, the business source table is automatically synchronized by the ODS layer in the data warehouse, and the entity relationship model is automatically built and the modeling information is passed to the lower layer, so as to realize the intuitive display of the model information such as the association relationship and dependency relationship of all tables in the data warehouse, and the entire data maintenance process does not require human participation, ensuring that the modeling information in the data warehouse is improved more efficiently and comprehensively, and it is convenient for users to understand data assets in an all-round way, thereby greatly improving the efficiency of data query and data use.
根据本发明的一个实施例,所述基于所述关联关系和所述依赖关系构建实体关系模型,包括:According to an embodiment of the present invention, the step of constructing an entity relationship model based on the association relationship and the dependency relationship includes:
获取所述各业务源表各自的数据粒度和主题域;Obtaining the data granularity and subject domain of each of the business source tables;
基于所述关联关系和所述依赖关系,以及各所述数据粒度和各所述主题域,构建所述实体关系模型。The entity relationship model is constructed based on the association relationship and the dependency relationship, as well as each data granularity and each subject domain.
根据本发明的一个实施例,所述基于所述关联关系和所述依赖关系,以及各所述数据粒度和各所述主题域,构建所述实体关系模型,包括:According to an embodiment of the present invention, constructing the entity relationship model based on the association relationship and the dependency relationship, as well as each data granularity and each subject domain includes:
确定第一表集合,所述第一表集合中包括基于所述各业务源表确定的多个ODS表,每个所述ODS表分别标识对应的主题域和数据粒度以及对应的所述关联关系和/或对应的所述依赖关系;Determine a first table set, wherein the first table set includes a plurality of ODS tables determined based on the business source tables, each of the ODS tables respectively identifying a corresponding subject domain and data granularity and a corresponding association relationship and/or a corresponding dependency relationship;
将所述第一表集合中的所述ODS表移出,并确定以所述ODS表为源表的全部后继目标表;Remove the ODS table from the first table set, and determine all subsequent target tables with the ODS table as a source table;
确定以所述全部后继目标表中后继目标表为目标表的全部前继源表,并基于所述全部前继源表中两两交叉连接关系的查询结果以及所述实体关系模型,确定目标实体关系模型;Determine all predecessor source tables that use the successor target table as the target table in all the successor target tables, and determine the target entity relationship model based on the query results of the cross-connection relationships between each two in all the predecessor source tables and the entity relationship model;
基于所述后继目标表对所述第一表集合进行填充,再从所述全部后继目标表中选取下一个后继目标表,返回执行所述确定以所述全部后继目标表中后继目标表为目标表的全部前继源表的步骤;直至遍历完所述全部后继目标表时将对应得到的填充后的第一表集合作为新的第一表集合,以及将遍历完所述全部后继目标表时对应得到的所述目标实体关系模型作为新的实体关系模型,重复执行上述过程;直至所述第一表集合为空时将对应得到的所述目标实体关系模型确定为构建好的所述实体关系模型。The first table set is filled based on the successor target table, and then the next successor target table is selected from all the successor target tables, and the step of determining all the predecessor source tables with the successor target table in all the successor target tables as the target table is returned to execute; until all the successor target tables are traversed, the corresponding filled first table set is used as the new first table set, and the corresponding target entity relationship model obtained when all the successor target tables are traversed is used as the new entity relationship model, and the above process is repeated; until the first table set is empty, the corresponding target entity relationship model is determined as the constructed entity relationship model.
根据本发明的一个实施例,所述确定以所述全部后继目标表中后继目标表为目标表的全部前继源表,包括:According to one embodiment of the present invention, determining all predecessor source tables that use the successor target table in all the successor target tables as the target table includes:
在将所述第一表集合中的所述ODS表移出至第二表集合中的情况下,若所述后继目标表同时未存在于所述第一表集合和所述第二表集合中,则从所述实体关系模型中确定以所述后继目标表为目标表的全部前继源表。When the ODS table in the first table set is moved out to the second table set, if the successor target table does not exist in both the first table set and the second table set, all predecessor source tables with the successor target table as the target table are determined from the entity relationship model.
根据本发明的一个实施例,所述从所述实体关系模型中确定以所述后继目标表为目标表的全部前继源表,包括:According to an embodiment of the present invention, determining all predecessor source tables with the successor target table as the target table from the entity relationship model includes:
在所述实体关系模型中所述ODS表为所述后继目标表的主表的情况下,将所述ODS表的所述数据粒度和所述主题域复制到所述后继目标表中,并基于复制结果确定以所述后继目标表为目标表的全部前继源表;在所述ODS表为所述后继目标表的非主表的情况下,确定以所述后继目标表为目标表的全部前继源表。In the case where the ODS table in the entity relationship model is the primary table of the successor target table, the data granularity and the subject domain of the ODS table are copied to the successor target table, and all predecessor source tables with the successor target table as the target table are determined based on the copy result; in the case where the ODS table is a non-primary table of the successor target table, all predecessor source tables with the successor target table as the target table are determined.
根据本发明的一个实施例,所述方法还包括:According to one embodiment of the present invention, the method further includes:
在所述ODS表为所述后继目标表的主表的情况下,将所述ODS表的所述数据粒度和所述主题域,以及所述ODS表的业务分类信息均复制到所述后继目标表中。In the case where the ODS table is the master table of the subsequent target table, the data granularity and the subject domain of the ODS table, as well as the business classification information of the ODS table are copied to the subsequent target table.
根据本发明的一个实施例,所述基于所述全部前继源表中两两交叉连接关系的查询结果以及所述实体关系模型,确定目标实体关系模型,包括:According to an embodiment of the present invention, determining the target entity relationship model based on the query results of the pairwise cross-connection relationships in all the predecessor and successor source tables and the entity relationship model includes:
在各所述全部前继源表中查询到所述两两交叉连接关系的情况下,基于所述两两交叉连接关系对所述实体关系模型进行更新,得到所述目标实体关系模型;When the pairwise cross-connection relationship is found in all the predecessor and successor source tables, the entity relationship model is updated based on the pairwise cross-connection relationship to obtain the target entity relationship model;
在各所述全部前继源表中未查询到所述两两交叉连接关系的情况下,将所述实体关系模型确定为所述目标实体关系模型。When the pairwise cross-connection relationship is not found in all the predecessor source tables, the entity relationship model is determined as the target entity relationship model.
根据本发明第二方面实施例的数据仓库的数据处理装置,包括:A data processing device for a data warehouse according to an embodiment of the second aspect of the present invention includes:
关系获取单元,用于响应于业务更新指令,获取同步到数据仓库中ODS层的各业务源表以及与所述各业务源表对应的数据仓库目标表,并获取所述各业务源表之间的关联关系及所述各业务源表与所述数据仓库目标表之间的依赖关系;A relationship acquisition unit, configured to acquire, in response to a business update instruction, each business source table synchronized to the ODS layer in the data warehouse and a data warehouse target table corresponding to each business source table, and acquire an association relationship between each business source table and a dependency relationship between each business source table and the data warehouse target table;
模型构建单元,用于基于所述关联关系和所述依赖关系构建实体关系模型,并向所述数据仓库逐层传递所述实体关系模型的建模信息。A model building unit is used to build an entity relationship model based on the association relationship and the dependency relationship, and transmit modeling information of the entity relationship model to the data warehouse layer by layer.
根据本发明实施例的数据仓库的数据处理装置,响应业务更新指令时,首先获取同步到数据仓库中ODS层的各业务源表以及与各业务源表对应的数据仓库目标表,并获取各业务源表之间的关联关系及各业务源表与数据仓库目标表之间的依赖关系,再进一步基于关联关系和依赖关系构建实体关系模型,并向数据仓库逐层传递实体关系模型的建模信息。这样,在业务系统中业务数据发生变化的情况下,通过数据仓库中ODS层自动同步业务源表且自动构建实体关系模型并向下层传递建模信息的方式,实现直观展示数据仓库中所有表的关联关系和依赖关系等模型信息,整个数据维护过程无需人为参与,确保数据仓库中的建模信息完善更加高效和全面,方便使用者全方位了解数据资产,从而大幅提升了数据查询效率和数据使用效率。According to the data processing device of the data warehouse of the embodiment of the present invention, when responding to the business update instruction, it first obtains each business source table synchronized to the ODS layer in the data warehouse and the data warehouse target table corresponding to each business source table, and obtains the association relationship between each business source table and the dependency relationship between each business source table and the data warehouse target table, and then further builds an entity relationship model based on the association relationship and dependency relationship, and transmits the modeling information of the entity relationship model to the data warehouse layer by layer. In this way, when the business data in the business system changes, the business source table is automatically synchronized by the ODS layer in the data warehouse, and the entity relationship model is automatically constructed and the modeling information is transmitted to the lower layer, so as to realize the intuitive display of the model information such as the association relationship and dependency relationship of all tables in the data warehouse. The entire data maintenance process does not require human participation, ensuring that the modeling information in the data warehouse is improved more efficiently and comprehensively, and it is convenient for users to understand data assets in an all-round way, thereby greatly improving the efficiency of data query and data use.
本发明实施例中的上述一个或多个技术方案,至少具有如下技术效果之一:响应业务更新指令时,首先获取同步到数据仓库中ODS层的各业务源表以及与各业务源表对应的数据仓库目标表,并获取各业务源表之间的关联关系及各业务源表与数据仓库目标表之间的依赖关系,再进一步基于关联关系和依赖关系构建实体关系模型,并向数据仓库逐层传递实体关系模型的建模信息。这样,在业务系统中业务数据发生变化的情况下,通过数据仓库中ODS层自动同步业务源表且自动构建实体关系模型并向下层传递建模信息的方式,实现直观展示数据仓库中所有表的关联关系和依赖关系等模型信息,整个数据维护过程无需人为参与,确保数据仓库中的建模信息完善更加高效和全面,方便使用者全方位了解数据资产,从而大幅提升了数据查询效率和数据使用效率。The above one or more technical solutions in the embodiments of the present invention have at least one of the following technical effects: when responding to a business update instruction, first obtain each business source table synchronized to the ODS layer in the data warehouse and the data warehouse target table corresponding to each business source table, and obtain the association relationship between each business source table and the dependency relationship between each business source table and the data warehouse target table, and then further build an entity relationship model based on the association relationship and dependency relationship, and pass the modeling information of the entity relationship model to the data warehouse layer by layer. In this way, when the business data in the business system changes, the business source table is automatically synchronized by the ODS layer in the data warehouse, and the entity relationship model is automatically built and the modeling information is passed to the lower layer, so as to realize the intuitive display of the model information such as the association relationship and dependency relationship of all tables in the data warehouse. The entire data maintenance process does not require human participation, ensuring that the modeling information in the data warehouse is more efficient and comprehensive, and it is convenient for users to understand data assets in all aspects, thereby greatly improving data query efficiency and data usage efficiency.
进一步的,通过主题域自动划分和数据仓库表粒度自动识别的方式完善数据仓库中主题和数据粒度信息,不仅能够有效补充数据模型信息,还能起到检查数据仓库模型规范程度的作用,且由程序自动执行,避免建模人员经验差异导致的标准混乱,能够广泛应用于数据仓库和分析挖掘等领域。Furthermore, the subject and data granularity information in the data warehouse can be improved by automatically dividing the subject domain and automatically identifying the granularity of the data warehouse table. This can not only effectively supplement the data model information, but also play a role in checking the degree of standardization of the data warehouse model. It is automatically executed by the program to avoid standard confusion caused by differences in experience among modelers. It can be widely used in fields such as data warehouses and analytical mining.
更进一步的,通过对数据仓库表的业务领域和数据粒度进行自动识别,方便业务人员根据业务领域快速定位所需数据,帮助开发人员减少探索数据粒度等工作量。Furthermore, by automatically identifying the business domain and data granularity of the data warehouse table, it is convenient for business personnel to quickly locate the required data according to the business domain, helping developers reduce the workload of exploring data granularity.
本发明的附加方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本发明的实践了解到。Additional aspects and advantages of the present invention will be given in part in the following description and in part will be obvious from the following description, or will be learned through practice of the present invention.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本发明实施例或相关技术中的技术方案,下面将对实施例或相关技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or related technologies, the drawings required for use in the embodiments or related technical descriptions are briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.
图1是本发明实施例提供的数据仓库的数据处理方法的流程示意图之一。FIG. 1 is a flowchart of a data processing method for a data warehouse provided by an embodiment of the present invention.
图2是本发明实施例提供的数据仓库的数据处理方法的流程示意图之二。FIG. 2 is a second flowchart of the data processing method for a data warehouse provided in an embodiment of the present invention.
图3是本发明实施例提供的数据仓库的数据处理方法的流程示意图之三。FIG. 3 is a third flowchart of the data processing method for a data warehouse provided in an embodiment of the present invention.
图4是本发明实施例提供的数据仓库的数据处理装置的结构示意图。FIG. 4 is a schematic diagram of the structure of a data processing device for a data warehouse provided in an embodiment of the present invention.
图5是本发明实施例提供的电子设备的结构示意图。FIG. 5 is a schematic diagram of the structure of an electronic device provided by an embodiment of the present invention.
具体实施方式DETAILED DESCRIPTION
为使本发明的目的、技术方案和优点更加清楚,下面将结合本发明中的附图,对本发明中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, technical solution and advantages of the present invention clearer, the technical solution of the present invention will be clearly and completely described below in conjunction with the drawings of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.
在本发明的实施例中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况,其中A,B可以是单数或者复数。在本发明的文字描述中,字符“/”一般表示前后关联对象是一种“或”的关系。此外,需要说明的是,本发明中为描述的对象所编序号本身,例如“第一”、“第二”等,仅用于区分所描述的对象,不具有任何顺序或技术含义。In the embodiments of the present invention, "at least one" refers to one or more, and "more than one" refers to two or more. "And/or" describes the association relationship of associated objects, indicating that three relationships may exist. For example, A and/or B may represent: A exists alone, A and B exist at the same time, and B exists alone, where A and B may be singular or plural. In the textual description of the present invention, the character "/" generally indicates that the previous and next associated objects are in an "or" relationship. In addition, it should be noted that the serial numbers themselves, such as "first", "second", etc., used to distinguish the objects described in the present invention are only used to distinguish the objects described, and do not have any order or technical meaning.
下面对本发明中涉及的相关专业名词术语进行解释说明:The following is an explanation of the relevant professional terms involved in the present invention:
数据模型:用于描述数据仓库中数据的结构和关系,包含表的实体属性,关联和依赖关系等。Data model: used to describe the structure and relationship of data in the data warehouse, including entity attributes, associations and dependencies of tables, etc.
数据域:从业务数据的角度对业务过程进行自下而上的归纳、抽象,划分出来的数据集合。Data domain: A data set that is divided by summarizing and abstracting the business process from the bottom up from the perspective of business data.
主题域:从数据分析应用角度将业务系统中的数据进行自上而下的综合、归纳和分析利用的一种抽象概念,有多种划分方法,若按照业务系统划分则与数据域的意义相同。Subject domain: An abstract concept that integrates, summarizes, and analyzes the data in the business system from top to bottom from the perspective of data analysis application. There are many ways to divide it. If it is divided according to the business system, it has the same meaning as the data domain.
数据粒度:数据所描述的现象或事件的具体程度或范围。Data granularity: The degree of detail or scope of the phenomenon or event described by the data.
ODS(Operational Data Store)操作数据存储:数据仓库用于存储业务操作性数据的临时区域,其表字段与业务侧同构,作为数据仓库表依赖层级关系的最上游(即:均为起点表)。ODS (Operational Data Store) is a temporary area used by the data warehouse to store business operational data. Its table fields are isomorphic to the business side and are the most upstream of the data warehouse table dependency hierarchy (i.e., all are starting tables).
schema:表示数据库中的结构,包括表、列、数据类型等描述及主键和唯一键等约束信息。Schema: represents the structure in the database, including descriptions of tables, columns, data types, and constraint information such as primary keys and unique keys.
sqoop.job.data.publish.class:一种Sqoop组件的hook插件,在Mysql/Oracle等业务数据抽取到数据仓库时触发,可从中获取所同步数据的业务源表和数据仓库ODS目标表信息。sqoop.job.data.publish.class: A hook plug-in of the Sqoop component, which is triggered when business data such as MySQL/Oracle is extracted to the data warehouse, and can obtain the business source table of the synchronized data and the data warehouse ODS target table information.
hive.exec.post.hooks:针对Hive组件的hook插件,在ETL脚本执行成功后触发,可从中获取Hive SQL的表关联关系和依赖关系等信息。hive.exec.post.hooks: A hook plug-in for Hive components. It is triggered after the ETL script is successfully executed. It can be used to obtain information such as the table association and dependency of Hive SQL.
spark.sql.queryExecutionListeners:针对Spark组件的hook插件,在ETL脚本执行成功后触发,可从中获取Spark SQL的表关联关系和依赖关系等信息。spark.sql.queryExecutionListeners: A hook plug-in for Spark components. It is triggered after the ETL script is successfully executed, from which information such as Spark SQL table associations and dependencies can be obtained.
表的关联关系和依赖关系:从ETL代码中抽取的数据仓库表的关系,关联关系表示ETL源表之间的JOIN关系,依赖关系表示源表到目标表之间的血缘关系,比如ETL逻辑。Table associations and dependencies: The relationships between data warehouse tables extracted from the ETL code. Associations represent JOIN relationships between ETL source tables, and dependencies represent blood relationships between source tables and target tables, such as ETL logic.
示例性的,ETL逻辑为:INSERT OVERWRITE TABLE t SELECT s1.field,s2.col,s3.value FROM s1 JOIN s2 ON s1.id=s2.key LEFT JOIN s3 ON s2.fk=s3.name;则可以抽取到的依赖关系为::(s1,t)、(s2,t),(s3,t),其中s1,s2,s3为源表,t1为目标表;关联关系为:<s1,JOIN:id=key,s2>和<s2,LEFT JOIN:fk=name,s3>,d=key和fk=name为关联键。For example, the ETL logic is: INSERT OVERWRITE TABLE t SELECT s1.field,s2.col,s3.value FROM s1 JOIN s2 ON s1.id=s2.key LEFT JOIN s3 ON s2.fk=s3.name; the extracted dependencies are: (s1,t), (s2,t), (s3,t), where s1, s2, s3 are source tables and t1 is the target table; the association relationships are: <s1,JOIN:id=key,s2> and <s2,LEFT JOIN:fk=name,s3>, where d=key and fk=name are association keys.
在大数据技术领域,随着企业中业务数据量的不断增大,使用数据仓库这一专门用于集成、存储和管理企业各类数据的系统,可以确保用户轻松进行复杂数据查询和数据分析操作。但是数据仓库在具体应用过程中也存在历史遗留问题,如数据仓库因其规范不严格或模型设计信息丢失等其它历史遗留问题,使得数据仓库中存在大量建模信息缺失的数据表,造成用户或数据工程师查询数据和使用数据的难度增大。In the field of big data technology, as the amount of business data in enterprises continues to increase, the use of data warehouses, a system dedicated to integrating, storing and managing various types of enterprise data, can ensure that users can easily perform complex data queries and data analysis operations. However, data warehouses also have historical problems in the specific application process. For example, due to the lack of strict specifications or other historical problems such as the loss of model design information, there are a large number of data tables with missing modeling information in the data warehouse, which makes it more difficult for users or data engineers to query and use data.
为了填补丢失的建模信息,通常由数据工程师根据自身的业务知识和过往经验进行梳理,以重新设计出丢失的建模信息;整个数据维护过程不仅工作量大、难以持续迭代维护,而且设计标准主观固定、容易随数据工程师的流动而混乱,从而不能实现对数据仓库建模信息的高效且持续。因此,如何对数据仓库中各类数据进行高效维护就显得尤为重要。In order to fill in the missing modeling information, data engineers usually sort out the information based on their own business knowledge and past experience to redesign the missing modeling information; the entire data maintenance process is not only labor-intensive and difficult to maintain continuously and iteratively, but also the design standards are subjective and fixed, and easily confused with the flow of data engineers, thus failing to achieve efficient and continuous modeling of data warehouse information. Therefore, how to efficiently maintain various types of data in the data warehouse is particularly important.
为解决上述技术问题,本发明提供一种数据仓库的数据处理方法、装置、设备、介质和程序产品。下面结合图1-图5描述本发明提供的数据仓库的数据处理方法、装置、设备、介质和程序产品,其中数据仓库的数据处理方法的执行主体为数据仓库。进一步的,数据处理方法还可以应用于设置在数据仓库中的数据处理装置中,该数据处理装置可以通过软件、硬件或者两者的结合来实现。下面以该数据处理方法的执行主体为数据仓库为例,对该数据处理方法进行描述。In order to solve the above technical problems, the present invention provides a data processing method, device, equipment, medium and program product for a data warehouse. The data processing method, device, equipment, medium and program product for a data warehouse provided by the present invention are described below in conjunction with Figures 1 to 5, wherein the execution subject of the data processing method for a data warehouse is a data warehouse. Furthermore, the data processing method can also be applied to a data processing device arranged in a data warehouse, and the data processing device can be implemented by software, hardware or a combination of the two. The data processing method is described below by taking the execution subject of the data processing method as a data warehouse as an example.
为了便于理解本发明实施例提供的数据仓库的数据处理方法,下面,将通过下述几个示例地实施例对本发明提供的数据仓库的数据处理方法进行详细地说明。可以理解的是,下面这几个示例地实施例可以相互结合,对于相同或相似的概念或过程可能在某些实施例不再赘述。In order to facilitate understanding of the data processing method of the data warehouse provided by the embodiment of the present invention, the data processing method of the data warehouse provided by the present invention will be described in detail below through the following exemplary embodiments. It can be understood that the following exemplary embodiments can be combined with each other, and the same or similar concepts or processes may not be repeated in some embodiments.
参照图1,为本发明实施例提供的数据仓库的数据处理方法的流程示意图,如图1所示,该数据处理方法包括如下步骤110和步骤120。1 , which is a flow chart of a data processing method for a data warehouse provided in an embodiment of the present invention, as shown in FIG1 , the data processing method includes the following steps 110 and 120 .
步骤110、响应于业务更新指令,获取同步到数据仓库中ODS层的各业务源表以及与各业务源表对应的数据仓库目标表,并获取各业务源表之间的关联关系及各业务源表与数据仓库目标表之间的依赖关系。Step 110, in response to the business update instruction, obtain each business source table synchronized to the ODS layer in the data warehouse and the data warehouse target table corresponding to each business source table, and obtain the association relationship between each business source table and the dependency relationship between each business source table and the data warehouse target table.
其中,业务更新指令是某个业务系统中的业务数据发生变化时自动生成的指令。关联关系和依赖关系的数量均至少为1个。The business update instruction is an instruction automatically generated when the business data in a business system changes. The number of association relationships and dependency relationships is at least 1.
每个关联关系具体可以表征多个业务源表中至少两个业务源表之间具备关联关系;每个依赖关系具体可以表征数据仓库目标表与多个业务源表中对应至少两个业务源表之间具备依赖关系,如多个业务源表中对应至少两个业务源表经过加工和清洗等其它处理后生成对应的数据仓库目标表。Each association relationship can specifically represent that at least two business source tables in multiple business source tables have an association relationship; each dependency relationship can specifically represent that a data warehouse target table has a dependency relationship with at least two corresponding business source tables in multiple business source tables, such as generating a corresponding data warehouse target table after processing, cleaning and other processing such as at least two corresponding business source tables in multiple business source tables.
企业可以根据其具体需求和业务范围通常设置多个业务系统,多个业务系统可以包括但不限定人力资源管理系统、财务管理系统、供应链管理系统、客户关系管理系统、销售营销系统和信贷管理系统等其它业务。Enterprises can usually set up multiple business systems according to their specific needs and business scope. Multiple business systems may include but are not limited to human resource management systems, financial management systems, supply chain management systems, customer relationship management systems, sales and marketing systems, credit management systems and other businesses.
数据仓库自上而下包括有操作数据存储(Operational Data Store,ODS)层、数据明细(DWD)层、主题汇总(DWS)层和数据应用(ADM)层。ODS层在数据仓库中,用于存储业务操作性数据的临时区域,其表字段与业务侧同构,作为数据仓库表依赖层级关系的最上游(即:均为起点表)。The data warehouse includes the operational data store (ODS) layer, data detail (DWD) layer, subject summary (DWS) layer and data application (ADM) layer from top to bottom. The ODS layer is a temporary area used to store business operational data in the data warehouse. Its table fields are isomorphic with the business side and are the upstream of the data warehouse table dependency hierarchy (i.e., they are all starting tables).
具体的,数据仓库中预先设置有大数据同步插件,此大数据同步插件用于监听多个业务系统中发生变化的业务数据同步到数据仓库的数据同步事件;数据仓库通过利用数据同步插件监听到数据同步事件时,可以认为自动生成并响应了业务更新指令,此时数据仓库可以基于数据同步插件中预先配置的信息获取方法,从该数据同步事件中获取发生变化的业务数据(也即业务更新数据)同步到数据仓库中ODS层的各业务源表以及与各业务源表对应的数据仓库目标表。大数据同步插件是一种将业务数据抽到数据仓库的组件,如Sqoop,SeaTunnel和DataX。Specifically, a big data synchronization plug-in is pre-set in the data warehouse. This big data synchronization plug-in is used to monitor the data synchronization events of the business data changed in multiple business systems and synchronize them to the data warehouse; when the data warehouse monitors the data synchronization events by using the data synchronization plug-in, it can be considered that the business update instruction is automatically generated and responded to. At this time, the data warehouse can obtain the changed business data (that is, business update data) from the data synchronization event based on the information acquisition method pre-configured in the data synchronization plug-in and synchronize it to each business source table of the ODS layer in the data warehouse and the data warehouse target table corresponding to each business source table. The big data synchronization plug-in is a component that extracts business data to the data warehouse, such as Sqoop, SeaTunnel and DataX.
需要说明的是,本发明可以通过数据仓库ETL引擎(如Hive SQL或Spark SQL)的代码解析出每个关联关系和每个依赖关系,ETL(Extract-Transform-Load):从来源端经过抽取(extract)、转换(transform)、加载(load)至目标端的过程,一般由Hive SQL或SparkSQL定义其计算逻辑,数据仓库中除ODS层外其余各层数据基本是通过ETL加工得来的。It should be noted that the present invention can parse each association relationship and each dependency relationship through the code of the data warehouse ETL engine (such as Hive SQL or Spark SQL). ETL (Extract-Transform-Load) is a process of extracting, transforming, and loading from the source end to the target end. Its calculation logic is generally defined by Hive SQL or SparkSQL. Except for the ODS layer, the data of other layers in the data warehouse are basically obtained through ETL processing.
例如,开发如图2所示的ETL过程监听器,以ETL执行引擎为Hive SQL为例,首先扩展:For example, to develop the ETL process listener shown in Figure 2, take Hive SQL as the ETL execution engine as an example, first expand:
org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext接口,在run方法中监听数据仓库ETL事件,从HookContext参数中获取同步到数据仓库的数据同步事件,从db.connect.string参数中获取Hive SQL执行计划,并从Hive SQL执行计划中解析HiveSQL脚本中的业务源表和数据仓库目标表之间的依赖关系、各业务源表中左表和右表的关连键(也即左表和右表之间的关联关系,如JOIN ON)以及子查询块的聚合键(GROUP BY);以此完成图2中采集模块的采集功能,也即采集模块不仅可以通过数据同步hook插件获取同步到数据仓库中ODS层的各业务源表以及与各业务源表对应的数据仓库目标表等元数据,还可以通过数据ETL过程监听器获取到到达ODS层的所有ODS表之间的关联键、聚合键和表依赖关系。org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext interface, listen to data warehouse ETL events in the run method, obtain data synchronization events synchronized to the data warehouse from the HookContext parameter, obtain the Hive SQL execution plan from the db.connect.string parameter, and parse the dependency between the business source table and the data warehouse target table in the HiveSQL script from the Hive SQL execution plan, the association key between the left table and the right table in each business source table (that is, the association relationship between the left table and the right table, such as JOIN ON), and the aggregation key (GROUP BY) of the subquery block; in this way, the collection function of the collection module in Figure 2 is completed, that is, the collection module can not only obtain the metadata such as each business source table synchronized to the ODS layer in the data warehouse and the data warehouse target table corresponding to each business source table through the data synchronization hook plug-in, but also obtain the association key, aggregation key and table dependency between all ODS tables reaching the ODS layer through the data ETL process listener.
此时,数据仓库将ETL过程中涉及的表的关联键和表之间的依赖关系等关系保存到元数据库中,以便于数据仓库中的建模模块构建实体关系图,其中数据仓库中表关联关系如表1所示,数据仓库中表依赖关系如表2所示;以此完成图2中建模模块的建模功能,也即,建模模块用于根据数据表依赖关系,从ODS表出发,不断驱动业务源表的后继目标表生成主要包括主题域和数据粒度,以及ER图等的建模信息。At this time, the data warehouse saves the association keys of the tables involved in the ETL process and the dependencies between the tables into the metadata database, so that the modeling module in the data warehouse can build an entity relationship diagram, where the table association relationships in the data warehouse are shown in Table 1, and the table dependencies in the data warehouse are shown in Table 2; this completes the modeling function of the modeling module in Figure 2, that is, the modeling module is used to continuously drive the subsequent target table of the business source table to generate modeling information mainly including subject domains and data granularity, as well as ER diagrams, based on the data table dependencies, starting from the ODS table.
表1Table 1
表2Table 2
示例性的,上述图2所示数据ETL过程监听器的ETL计算引擎为其他非Hive SQL组件时,可换成对应组件的插件,例如使用Spark SQL时则可以开发spark.sql.queryExecutionListeners插件:For example, when the ETL computing engine of the data ETL process listener shown in FIG. 2 is other non-Hive SQL components, it can be replaced with a plug-in of the corresponding component. For example, when Spark SQL is used, the spark.sql.queryExecutionListeners plug-in can be developed:
扩展org.apache.spark.sql.util.QueryExecutionListener接口,在onSuccess方法中监听数据仓库ETL事件,从QueryExecution参数中获取Spark SQL执行计划,从执行计划中解析SQL中的源表和目标表依赖关系、左表和右表的关联键(如:JOIN ON),以及子查询块的聚合键(如:GROUP BY)。Extend the org.apache.spark.sql.util.QueryExecutionListener interface, listen to data warehouse ETL events in the onSuccess method, obtain the Spark SQL execution plan from the QueryExecution parameter, and parse the source and target table dependencies, the association keys of the left and right tables (such as JOIN ON), and the aggregation keys of the subquery block (such as GROUP BY) in the SQL from the execution plan.
需要说明的是,数据仓库目标表的数量至少为1个,每个数据仓库目标表中的业务数据是由多个业务源表中至少两个相互关联业务源表各自的业务数据经过加工和/或清洗等操作后生成的;通过分析所有业务源表之间的关系,得到各业务源表之间的至少1个关联关系;通过分析所有业务源表与和至少1个数据仓库目标表之间的关系,得到各业务源表与对应数据仓库目标表之间的至少1个依赖关系。It should be noted that the number of data warehouse target tables is at least one, and the business data in each data warehouse target table is generated after processing and/or cleaning of the business data of at least two interrelated business source tables in multiple business source tables; by analyzing the relationship between all business source tables, at least one association relationship between the business source tables is obtained; by analyzing the relationship between all business source tables and at least one data warehouse target table, at least one dependency relationship between each business source table and the corresponding data warehouse target table is obtained.
步骤120、基于关联关系和依赖关系构建实体关系模型,并向数据仓库逐层传递实体关系模型的建模信息。Step 120: construct an entity relationship model based on the association relationship and the dependency relationship, and transfer the modeling information of the entity relationship model to the data warehouse layer by layer.
具体的,数据仓库可以将同步至ODS层的各业务源表和数据仓库目标表均作为实体点、将每个业务源表和数据仓库目标表各自的不同级别主题和预设键字段信息均作为属性、将各业务源表之间具备的所有关联关系以及将各业务源表和对应数据仓库目标表之间的依赖关系均作为关系边构建实体关系模型,也即构建实体关系(Entity-Relationship,ER)。Specifically, the data warehouse can use each business source table and data warehouse target table synchronized to the ODS layer as entity points, each business source table and data warehouse target table's respective different levels of subject and preset key field information as attributes, all the associations between the business source tables and the dependencies between the business source tables and the corresponding data warehouse target tables as relationship edges to build an entity relationship model, that is, to build an entity relationship (Entity-Relationship, ER).
示例性的,数据仓库中其它层的主题域和数据粒度可以根据依赖关系由前继源表传递给后继目标表,在传递到当前目标表的情况下将其对应的所有源表的关联关系构建关系边三元组,并以构建的关系边三元组更新实体关系模型;以此完成向数据仓库逐层传递实体关系模型的建模信息的目的。Exemplarily, the subject domains and data granularity of other layers in the data warehouse can be transferred from the predecessor source table to the successor target table according to the dependency relationship. When transferred to the current target table, the association relationships of all corresponding source tables are constructed into relationship edge triples, and the entity relationship model is updated with the constructed relationship edge triples; thereby achieving the purpose of transferring the modeling information of the entity relationship model to the data warehouse layer by layer.
本发明实施例提供的数据仓库的数据处理方法,数据仓库响应业务更新指令时,首先获取同步到数据仓库中ODS层的各业务源表以及与各业务源表对应的数据仓库目标表,并获取各业务源表之间的关联关系及各业务源表与数据仓库目标表之间的依赖关系,再进一步基于关联关系和依赖关系构建实体关系模型,并向数据仓库逐层传递实体关系模型的建模信息。这样,在业务系统中业务数据发生变化的情况下,通过数据仓库中ODS层自动同步业务源表且自动构建实体关系模型并向下层传递建模信息的方式,实现直观展示数据仓库中所有表的关联关系和依赖关系等模型信息,整个数据维护过程无需人为参与,确保数据仓库中的建模信息完善更加高效和全面,方便使用者全方位了解数据资产,从而大幅提升了数据查询效率和数据使用效率。The data processing method of the data warehouse provided by the embodiment of the present invention, when the data warehouse responds to the business update instruction, first obtains each business source table synchronized to the ODS layer in the data warehouse and the data warehouse target table corresponding to each business source table, and obtains the association relationship between each business source table and the dependency relationship between each business source table and the data warehouse target table, and then further builds an entity relationship model based on the association relationship and dependency relationship, and transmits the modeling information of the entity relationship model to the data warehouse layer by layer. In this way, when the business data in the business system changes, the business source table is automatically synchronized by the ODS layer in the data warehouse, and the entity relationship model is automatically constructed and the modeling information is transmitted to the lower layer, so as to realize the intuitive display of the model information such as the association relationship and dependency relationship of all tables in the data warehouse, and the entire data maintenance process does not require human participation, ensuring that the modeling information in the data warehouse is improved more efficiently and comprehensively, and it is convenient for users to understand data assets in an all-round way, thereby greatly improving the efficiency of data query and data use.
基于上述图1所示的数据处理方法,在一种示例实施例中,步骤120中基于关联关系和依赖关系构建实体关系模型可以通过下述步骤实现。Based on the data processing method shown in FIG. 1 , in an exemplary embodiment, constructing an entity relationship model based on association relationships and dependency relationships in step 120 can be implemented through the following steps.
首先,获取各业务源表各自的数据粒度和主题域;然后,基于关联关系和依赖关系,以及各数据粒度和各主题域,构建实体关系模型。First, the data granularity and subject domain of each business source table are obtained; then, based on the association and dependency relationships, as well as the data granularity and subject domain, an entity relationship model is constructed.
具体的,数据仓库在获取各业务源表之间的关联关系及各业务源表与数据仓库目标表之间的依赖关系的同时,还可以获取每个业务源表的数据粒度和主题域。Specifically, the data warehouse can obtain the data granularity and subject domain of each business source table while obtaining the association relationship between each business source table and the dependency relationship between each business source table and the data warehouse target table.
每个主题域可以为从对应业务源表的业务数据角度对业务过程进行自下而上的归纳、抽象,划分出来的数据主题集合,通常可以根据业务类别、数据来源和数据用途等多个维度,对企业的业务数据进行区域划分,将同类型数据分别归类,以此得到联系紧密的数据主题集合。Each subject domain can be a bottom-up summary and abstraction of the business process from the perspective of the business data of the corresponding business source table. The resulting data subject set can usually be used to divide the enterprise's business data into regions based on multiple dimensions such as business category, data source, and data usage, and classify the same type of data separately to obtain a closely related data subject set.
每个数据粒度可以为对应业务源表中数据所描述的现象或事件的具体程度或范围,例如对应业务源表的主键或唯一键。Each data granularity may be a specific degree or scope of a phenomenon or event described by the data in the corresponding business source table, such as a primary key or a unique key of the corresponding business source table.
数据仓库可以将同步至ODS层的各业务源表和数据仓库目标表均作为实体点、将每个业务源表各自的数据粒度和主题域均作为属性、将各业务源表之间具备的所有关联关系以及将各业务源表和对应数据仓库目标表之间的依赖关系均作为关系边构建实体关系模型。The data warehouse can use each business source table and data warehouse target table synchronized to the ODS layer as entity points, the data granularity and subject domain of each business source table as attributes, all associations between the business source tables, and the dependency relationships between each business source table and the corresponding data warehouse target table as relationship edges to build an entity relationship model.
需要说明的是,本发明可以通过数据仓库ETL引擎(如Hive SQL或Spark SQL)的代码或hook插件中解析出每个关联关系和每个依赖关系,ETL(Extract-Transform-Load):从来源端经过抽取(extract)、转换(transform)、加载(load)至目标端的过程,一般由HiveSQL或Spark SQL定义其计算逻辑,数据仓库中除ODS层外其余各层数据基本是通过ETL加工得来的。It should be noted that the present invention can parse out each association relationship and each dependency relationship through the code or hook plug-in of the data warehouse ETL engine (such as Hive SQL or Spark SQL). ETL (Extract-Transform-Load): the process of extracting, transforming, and loading from the source end to the target end. Its calculation logic is generally defined by HiveSQL or Spark SQL. Except for the ODS layer, the data of other layers in the data warehouse are basically obtained through ETL processing.
示例性的,开发如图2所示的数据同步hook插件,以大数据同步工具为Sqoop为例,首先扩展org.apache.sqoop.SqoopJobDataPublisher接口,在publish方法中监听业务更新数据同步到数据仓库的数据同步事件,从db.connect.string参数中获取源系统业务库jdbc连接串,从db.query参数提取所同步业务更新数据的初始业务源表,从hive.database.name和hive.table.name参数中获取数据仓库的ODS层中每个ODS表的库名和表名;此时,数据仓库可以根据源系统业务库jdbc连接串和每个初始业务源表的名称,查询其所属业务系统名称或数据库转码名称,作为数据仓库中ODS层对应ODS表的一级主题,将每个初始业务源表的业务源表评论意见作为对应ODS表的二级主题,将每个初始业务源表的主键或唯一键作为对应ODS表的数据粒度。此处数据仓库的ODS层中所有ODS表包括业务更新数据同步到数据仓库中ODS层的各业务源表以及与各业务源表对应的数据仓库目标表。Exemplarily, a data synchronization hook plug-in as shown in Figure 2 is developed. Taking the big data synchronization tool Sqoop as an example, the org.apache.sqoop.SqoopJobDataPublisher interface is first extended, and the data synchronization events of business update data synchronization to the data warehouse are listened to in the publish method. The source system business library jdbc connection string is obtained from the db.connect.string parameter, the initial business source table of the synchronized business update data is extracted from the db.query parameter, and the library name and table name of each ODS table in the ODS layer of the data warehouse are obtained from the hive.database.name and hive.table.name parameters. At this time, the data warehouse can query the business system name or database transcoding name to which it belongs based on the source system business library jdbc connection string and the name of each initial business source table, as the first-level subject of the corresponding ODS table in the ODS layer of the data warehouse, and the business source table comments of each initial business source table are used as the second-level subject of the corresponding ODS table, and the primary key or unique key of each initial business source table is used as the data granularity of the corresponding ODS table. Here, all ODS tables in the ODS layer of the data warehouse include business update data synchronized to each business source table of the ODS layer in the data warehouse and the data warehouse target table corresponding to each business source table.
可以理解的是,根据业务源表的数据库连接信息查询对应业务表的schema,含:业务主键和唯一键(作为数据粒度),评论或者意见(即:业务表中文名称,作为数据二级数据域),业务源表所属业务系统(系统编号转换为中文名称)作为一级数据域。It can be understood that the schema of the corresponding business table is queried according to the database connection information of the business source table, including: business primary key and unique key (as data granularity), comments or opinions (that is, the Chinese name of the business table, as the secondary data field of the data), and the business system to which the business source table belongs (the system number is converted into the Chinese name) as the primary data field.
对于hook插件需要说明的是,本发明提供的hook插件可将同步到数据仓库的业务数据所属系统/库,表/字段,主键/唯一键等信息自动集成到对应的数据仓库表,作为数据域划分和粒度识别的原始资料,从而提高数据域和业务领域的相关度以及数据粒度识别的准确度。What needs to be explained about the hook plug-in is that the hook plug-in provided by the present invention can automatically integrate the system/library, table/field, primary key/unique key and other information of the business data synchronized to the data warehouse into the corresponding data warehouse table as the original data for data domain division and granularity identification, thereby improving the relevance of the data domain and the business field and the accuracy of data granularity identification.
此时,数据仓库可以将数据仓库的ODS层中所有ODS表各自的主题域和数据粒度保存到元数据库中,以便于数据仓库中的建模模块向后继表传递模型信息,其中数据仓库中ODS表的主题域和数据粒度可以如表3所示。At this time, the data warehouse can save the subject domain and data granularity of all ODS tables in the ODS layer of the data warehouse into the metadata database to facilitate the modeling module in the data warehouse to pass model information to subsequent tables, where the subject domain and data granularity of the ODS table in the data warehouse can be shown in Table 3.
表3Table 3
需要说明的是,若上述图2所示数据同步hook插件的数据集成工具为其他非Sqoop组件时,可换成对应组件的hook插件,如:当使用DataX同步业务更新数据时则可以开发com.alibaba.datax.common.spi.Hook插件,当使用SeaTunnel同步数据时则可以开发Spark监听器或自定义插件解析source和sink配置。It should be noted that if the data integration tool of the data synchronization hook plug-in shown in Figure 2 is other non-Sqoop components, it can be replaced with the hook plug-in of the corresponding component. For example, when using DataX synchronization service to update data, you can develop the com.alibaba.datax.common.spi.Hook plug-in; when using SeaTunnel to synchronize data, you can develop a Spark listener or a custom plug-in to parse the source and sink configurations.
本发明提供的数据仓库的数据处理方法,通过主题域自动划分和数据仓库表粒度自动识别的方式完善数据仓库中主题和数据粒度信息,不仅能够有效补充数据模型信息,还能起到检查数据仓库模型规范程度的作用,且由程序自动执行,避免建模人员经验差异导致的标准混乱,能够广泛应用于数据仓库和分析挖掘等领域。The data processing method for data warehouse provided by the present invention improves the subject and data granularity information in the data warehouse by automatically dividing the subject domain and automatically identifying the granularity of the data warehouse table. It can not only effectively supplement the data model information, but also play a role in checking the standardization degree of the data warehouse model. It is automatically executed by the program to avoid standard confusion caused by differences in experience of modeling personnel. It can be widely used in fields such as data warehouses and analysis and mining.
基于上述图1所示的数据处理方法,在一种示例实施例中,数据仓库基于关联关系和依赖关系,以及各数据粒度和各主题域,构建实体关系模型,具体可以可以通过下述步骤实现。Based on the data processing method shown in FIG. 1 , in an exemplary embodiment, the data warehouse constructs an entity relationship model based on association relationships and dependency relationships, as well as various data granularities and various subject domains, which can be specifically implemented through the following steps.
首先,确定第一表集合,第一表集合中包括基于各业务源表确定的多个ODS表,每个ODS表分别标识对应的主题域和数据粒度以及对应的关联关系和/或对应的依赖关系;然后,将第一表集合中的ODS表移出,并确定以ODS表为源表的全部后继目标表;再进一步确定以全部后继目标表中后继目标表为目标表的全部前继源表,并基于全部前继源表中两两交叉连接关系的查询结果以及实体关系模型,确定目标实体关系模型;最后,基于后继目标表对第一表集合进行填充,再从全部后继目标表中选取下一个后继目标表,返回执行确定以全部后继目标表中后继目标表为目标表的全部前继源表的步骤;直至遍历完全部后继目标表时将对应得到的填充后的第一表集合作为新的第一表集合,以及将遍历完全部后继目标表时对应得到的目标实体关系模型作为新的实体关系模型,重复执行上述过程;直至第一表集合为空时将对应得到的目标实体关系模型确定为构建好的实体关系模型。First, determine the first table set, which includes multiple ODS tables determined based on each business source table, each ODS table identifies the corresponding subject domain and data granularity and the corresponding association relationship and/or the corresponding dependency relationship; then, remove the ODS table in the first table set, and determine all the successor target tables with the ODS table as the source table; further determine all the predecessor source tables with the successor target tables in all the successor target tables as the target table, and determine the target entity relationship model based on the query results of the cross-connection relationships between the two predecessor source tables and the entity relationship model; finally, based on The successor target table fills the first table set, then selects the next successor target table from all the successor target tables, and returns to execute the step of determining all the predecessor source tables with the successor target table as the target table; until all the successor target tables are traversed, the corresponding filled first table set is used as the new first table set, and the corresponding target entity relationship model obtained when all the successor target tables are traversed is used as the new entity relationship model, and the above process is repeated; until the first table set is empty, the corresponding target entity relationship model is determined as the constructed entity relationship model.
其中,实体关系模型中的所有实体点为各业务源表和数据仓库目标表,此处将各业务源表均作为数据仓库源头表,也即表征各业务源表全部同步到数据仓库的ODS层,此时将其均作为ODS表,每个ODS表均可以与对应业务系统中的业务数据表同构,每个ODS表的主题域和数据粒度沿用对应业务系统中业务数据表的主题域和数据粒度,且均可以通过上述表3抽取所需数据。Among them, all entity points in the entity-relationship model are business source tables and data warehouse target tables. Here, each business source table is used as the source table of the data warehouse, that is, it represents that all business source tables are synchronized to the ODS layer of the data warehouse. At this time, they are all used as ODS tables. Each ODS table can be isomorphic to the business data table in the corresponding business system. The subject domain and data granularity of each ODS table follow the subject domain and data granularity of the business data table in the corresponding business system, and the required data can be extracted through the above Table 3.
可以理解的是,由于数据仓库ODS表与业务数据保持同构,不做任何转换,故,数据仓库ODS表的主题域和数据粒度与业务源表保持一致,至此,数据仓库所有起点表都获取了主题域和数据粒度等建模信息。It is understandable that since the data warehouse ODS table remains isomorphic to the business data without any conversion, the subject domain and data granularity of the data warehouse ODS table remain consistent with the business source table. At this point, all starting tables of the data warehouse have obtained modeling information such as subject domain and data granularity.
实体关系模型的建模信息包括关联关系和依赖关系,以及每个ODS表分别标识的对应数据粒度和主题域。The modeling information of the entity relationship model includes association relationships and dependency relationships, as well as the corresponding data granularity and subject domain identified by each ODS table.
具体的,参照图3所示的数据处理方法的流程示意图之二,如图3所示,可以通过下述步骤实现步骤100-步骤107实现。Specifically, referring to the second flowchart of the data processing method shown in FIG. 3 , as shown in FIG. 3 , steps 100 to 107 may be implemented through the following steps.
步骤100、数据仓库将其ODS层中的全部ODS表放入第一表集合中,第一表集合可以记为TODO集合,TODO集合用于存放当前已标注过主题域和数据粒度、下一步需向后继目标表传递以及构建其目标表所依赖的全部源表实体关系的表清单;转步骤200。Step 100, the data warehouse puts all ODS tables in its ODS layer into the first table set, which can be recorded as the TODO set. The TODO set is used to store the table list that has been marked with the subject domain and data granularity, which needs to be transferred to the subsequent target table in the next step, and constructs all source table entity relationships on which its target table depends; go to step 200.
步骤200、判断TODO集合是否为空;当TODO集合为空时表示所有ODS表均已标注过主题域和数据粒度且已无后继目标表需处理,则数据处理流程结束;当TODO集合非空时表示还有未完成建模的ODS表,转步骤110(图3中未示出步骤110)。Step 200, determine whether the TODO set is empty; when the TODO set is empty, it means that all ODS tables have been marked with subject domains and data granularity and there are no subsequent target tables to be processed, and the data processing flow ends; when the TODO set is not empty, it means that there are ODS tables that have not been modeled, and go to step 110 (step 110 is not shown in Figure 3).
步骤110、从TODO集合中移出第k个ODS表Sk,并确定以第k个ODS表Sk为源表的全部后继目标表;转步骤120(图3中未示出步骤120)。Step 110 , remove the kth ODS table Sk from the TODO set, and determine all subsequent target tables with the kth ODS table Sk as the source table; then go to step 120 (step 120 is not shown in FIG. 3 ).
步骤120、基于全部前继源表中两两交叉连接关系的查询结果以及实体关系模型,确定目标实体关系模型;转步骤107。Step 120 : Determine the target entity relationship model based on the query results of the pairwise cross-connection relationships in all predecessor and successor source tables and the entity relationship model; and go to step 107 .
步骤107、基于后继目标表Ti对第一表集合进行填充,也即将后继目标表Ti放入TODO集合。至此,源表Sk对应的后继目标表Ti已经标注上了主题域和数据粒度,后继目标表Ti的所有存在关联的前继源表的关联关系已更新到实体关系图中,此时再从全部后继目标表中选取下一个后继目标表,返回步骤120,继续处理源表Sk的其它还未处理的后继目标表;直至全部后继目标表均被处理过时将对应得到的填充后的TODO集合作为新的TODO集合,以及将全部后继目标表均被处理过时对应得到目标实体关系模型作为新的实体关系模型,重复执行上述过程;直至第一表集合为空,生成待向下一层传递的建模信息。Step 107, fill the first table set based on the successor target table Ti , that is, put the successor target table Ti into the TODO set. So far, the successor target table Ti corresponding to the source table Sk has been marked with the subject domain and data granularity, and the association relationship of all the predecessor source tables associated with the successor target table Ti has been updated to the entity relationship diagram. At this time, the next successor target table is selected from all the successor target tables, and the process returns to step 120 to continue processing other unprocessed successor target tables of the source table Sk ; until all the successor target tables are processed, the corresponding filled TODO set is used as a new TODO set, and the corresponding target entity relationship model is used as a new entity relationship model when all the successor target tables are processed, and the above process is repeated; until the first table set is empty, the modeling information to be transferred to the next layer is generated.
本发明提供的数据仓库的数据处理方法,通过对数据仓库表的业务领域和数据粒度进行自动识别,方便业务人员根据业务领域快速定位所需数据,帮助开发人员减少探索数据粒度等工作量。The data processing method for the data warehouse provided by the present invention automatically identifies the business domain and data granularity of the data warehouse table, so as to facilitate business personnel to quickly locate required data according to the business domain and help developers reduce the workload of exploring data granularity.
基于上述图1所示的数据处理方法,在一种示例实施例中,数据仓库确定以全部后继目标表中后继目标表为目标表的全部前继源表的过程包括:在将第一表集合中的ODS表移出至第二表集合中的情况下,若后继目标表同时未存在于第一表集合和第二表集合中,则从实体关系模型中确定以后继目标表为目标表的全部前继源表。Based on the data processing method shown in Figure 1 above, in an exemplary embodiment, the process of the data warehouse determining all predecessor source tables with the successor target table as the target table in all successor target tables includes: when the ODS table in the first table set is moved out to the second table set, if the successor target table does not exist in both the first table set and the second table set, then determining all predecessor source tables with the successor target table as the target table from the entity relationship model.
示例性的,前述步骤110中具体可以通过图3中的步骤101-步骤202实现。Exemplarily, the aforementioned step 110 may be specifically implemented through steps 101 to 202 in FIG. 3 .
步骤101、从TODO集合中移出第k个ODS表Sk并移入第二表集合中,此处k的取值范围为1至G,G表示TODO集合中ODS表的总个数;且此处第二表集合具体为DONE集合,集合DONE存放已完成建模的数据表;将第k个ODS表Sk从TODO集合移入DONE集合,表示第k个ODS表Sk经后续步骤处理后即可完成建模;转步骤102。Step 101, remove the kth ODS table S k from the TODO set and move it into the second table set, where the value range of k is 1 to G, and G represents the total number of ODS tables in the TODO set; and the second table set here is specifically the DONE set, and the set DONE stores the data tables for which modeling has been completed; moving the kth ODS table S k from the TODO set to the DONE set indicates that the kth ODS table S k can be modeled after being processed in subsequent steps; go to step 102.
步骤102、取出以第k个ODS表Sk为源表的全部后继目标表放入第三表集合,此处第三表集合可以为集合T,集合T存放源表为Sk的所有目标表清单,如表2所示集合T存放source_table为Sk的所有target_table清单;转步骤201。Step 102, take out all the successor target tables with the kth ODS table Sk as the source table and put them into the third table set, where the third table set can be set T, and set T stores a list of all target tables with the source table Sk . As shown in Table 2, set T stores a list of all target_tables with source_table Sk; go to step 201.
步骤201、判断集合T是否为空;集合T为空表示源表Sk已无后继目标表(也即Sk是孤立点,没有后继也没有与其它表关联),由于源表Sk放入TODO集合时已标注主题域和数据粒度,同时在步骤101中也已放入DONE集合,至此源表Sk表已建模完毕,故跳过后续步骤直接返回步骤200,继续从TODO集合中迭代下一个ODS表;反之,转步骤103。Step 201, determine whether the set T is empty; if the set T is empty, it means that the source table Sk has no successor target table (that is, Sk is an isolated point, has no successor and is not associated with other tables). Since the subject domain and data granularity have been marked when the source table Sk is placed in the TODO set, and it has also been placed in the DONE set in step 101, the source table Sk has been modeled, so the subsequent steps are skipped and the process directly returns to step 200 to continue iterating the next ODS table from the TODO set; otherwise, go to step 103.
步骤103、在集合T为非空集合的情况下,从集合T中移出一个后继目标表,此处将移出的该个后继目标表记为后继目标表Ti;转步骤202。Step 103 , when the set T is a non-empty set, remove a successor target table from the set T, where the removed successor target table is recorded as the successor target table Ti ; go to step 202 .
步骤202、判断后继目标表Ti是否存在于TODO集合或者DONE集合,也即判断源表Sk对应的后继目标表Ti是否处理过,并确定后继目标表Ti同时未存在于TODO集合和DONE集合中时,转步骤210(图3中未示出步骤210);反之,若确定后继目标表Ti存在于TODO集合或者DONE集合中时,则说明后继目标表Ti已标注上了主题域和数据粒度,后继目标表Ti的所有存在关联的前继源表的关联关系已更新到实体关系模型中,此时返回步骤201,继续处理集合T中的下一个后继目标表。Step 202, determine whether the successor target table Ti exists in the TODO set or the DONE set, that is, determine whether the successor target table Ti corresponding to the source table Sk has been processed, and when it is determined that the successor target table Ti does not exist in both the TODO set and the DONE set, go to step 210 (step 210 is not shown in Figure 3); on the contrary, if it is determined that the successor target table Ti exists in the TODO set or the DONE set, it means that the successor target table Ti has been marked with the subject domain and data granularity, and the association relationships of all the associated predecessor source tables of the successor target table Ti have been updated in the entity relationship model. At this time, return to step 201 and continue to process the next successor target table in the set T.
步骤210、从实体关系模型中确定以后继目标表Ti为目标表的全部前继源表。Step 210: Determine all predecessor source tables that take the successor target table Ti as the target table from the entity relationship model.
基于上述图1所示的数据处理方法,在一种示例实施例中,数据仓库从实体关系模型中确定以后继目标表为目标表的全部前继源表(也即步骤210)的过程包括:在实体关系模型中ODS表为后继目标表的主表的情况下,将ODS表的数据粒度和主题域复制到后继目标表中,并基于复制结果确定以后继目标表为目标表的全部前继源表;在ODS表为后继目标表的非主表的情况下,确定以后继目标表为目标表的全部前继源表。Based on the data processing method shown in FIG1 , in an exemplary embodiment, the process of the data warehouse determining all predecessor source tables of the target table with the successor target table as the target table from the entity relationship model (i.e., step 210) includes: when the ODS table is the master table of the successor target table in the entity relationship model, copying the data granularity and subject domain of the ODS table to the successor target table, and determining all predecessor source tables of the target table with the successor target table as the target table based on the copy result; when the ODS table is a non-master table of the successor target table, determining all predecessor source tables of the target table with the successor target table as the target table.
示例性的,步骤210具体可以通过图3中的步骤203和步骤105实现。Exemplarily, step 210 may be implemented specifically through step 203 and step 105 in FIG. 3 .
步骤203、判断源表Sk是否为后继目标表Ti的主表;若源表Sk为后继目标表Ti的主表,则转步骤104;反之,后继目标表Ti不继承源表Sk的主题域,跳过步骤104直接转步骤105。Step 203, determine whether the source table Sk is the master table of the subsequent target table Ti ; if the source table Sk is the master table of the subsequent target table Ti , go to step 104; otherwise, the subsequent target table Ti does not inherit the subject domain of the source table Sk , skip step 104 and go directly to step 105.
步骤104、将源表Sk的数据粒度和主题域复制到后继目标表Ti中,并在复制成功的情况下转步骤105。Step 104: copy the data granularity and subject domain of the source table Sk to the subsequent target table Ti , and if the copy is successful, go to step 105.
步骤105、在实体关系模型中取出以后继目标表Ti为目标表的全部前继源表放入集合S中,例如从表2中取出以target_table为Ti的所有source_table放入集合S中。Step 105: Take out all predecessor source tables with successor target table Ti as target table in the entity relationship model and put them into set S. For example, take out all source_tables with target_table as Ti from Table 2 and put them into set S.
基于上述图1所示的数据处理方法,在一种示例实施例中,本发明提供的数据仓库的数据处理方法还可以包括:Based on the data processing method shown in FIG. 1 above, in an exemplary embodiment, the data processing method of the data warehouse provided by the present invention may further include:
在ODS表为后继目标表的主表的情况下,将ODS表的数据粒度和主题域,以及ODS表的业务分类信息均复制到后继目标表中。In the case where the ODS table is the master table of the subsequent target table, the data granularity and subject domain of the ODS table, as well as the business classification information of the ODS table are copied to the subsequent target table.
具体的,利用生成式人工智能(Artificial Intelligence Generated Content,AIGC)模型确定ODS表的业务分类信息,如通过向AIGC模型提供角色名称、ODS表的评论意见(comment)、表字段列表(及其comment)和数据样例等提示词的方式,获取AIGC模型输出的业务域和主题域;此处的业务域可以包括但不限定业务分类信息,或者,业务域即为业务分类信息。Specifically, the business classification information of the ODS table is determined using a generative artificial intelligence (AIGC) model, such as by providing the AIGC model with prompts such as role names, comments on the ODS table, a table field list (and its comments), and data samples, to obtain the business domain and subject domain output by the AIGC model; the business domain here may include but is not limited to business classification information, or the business domain is the business classification information.
对于ODS表的主题域的确定方式,还可以对ODS表及其字段的comment进行文本分类,并分类结果作为该ODS表的主题域。As for the method of determining the subject domain of the ODS table, text classification can also be performed on the comments of the ODS table and its fields, and the classification result is used as the subject domain of the ODS table.
对于ODS表的主题域所含不同主题的确定方式,还可以对ODS表及其字段的comment进行聚类,并从聚类到的所有表中,将文本相似度最接近的表的主题作为该ODS表的主题。As for the method of determining different topics contained in the subject domain of the ODS table, the comments of the ODS table and its fields can also be clustered, and from all the clustered tables, the topic of the table with the closest text similarity is used as the topic of the ODS table.
需要说明的是,当ODS表沿用其源表(也即业务系统中某个业务数据表)时,所属主题概率可以通过式(1)计算得到。It should be noted that when the ODS table uses its source table (that is, a business data table in the business system), the subject probability can be calculated using formula (1).
(1)。 (1).
式(1)中,表示字段总数,表示ODS表,表示ODS表的字段索引。In formula (1), Indicates the total number of fields. Represents an ODS table, Indicates the field index of the ODS table.
此外,还可以根据本发明填充的建模信息检查数据仓库中的模型规范程度:数据粒度字段聚合非唯一,表示存在数据发散或没有以最细粒度表为主表等不规范问题;目标表对应源表的主题差异越小表示数据表越符合“高内聚”规范,边界划分越合理。In addition, the degree of model standardization in the data warehouse can also be checked based on the modeling information filled in by the present invention: if the data granularity field aggregation is non-unique, it means that there are irregularities such as data divergence or the finest granularity table is not the main table; the smaller the subject difference between the target table and the source table, the more the data table complies with the "high cohesion" specification and the more reasonable the boundary division.
基于上述图1所示的数据处理方法,在一种示例实施例中,数据仓库基于全部前继源表中两两交叉连接关系的查询结果以及实体关系模型,确定目标实体关系模型(也即步骤120)的过程包括:在各全部前继源表中查询到两两交叉连接关系的情况下,基于两两交叉连接关系对实体关系模型进行更新,得到目标实体关系模型;在各全部前继源表中未查询到两两交叉连接关系的情况下,将实体关系模型确定为目标实体关系模型。Based on the data processing method shown in FIG. 1 , in an exemplary embodiment, the process of determining the target entity relationship model (i.e., step 120) by the data warehouse based on the query results of the pairwise cross-connection relationships in all predecessor source tables and the entity relationship model includes: when the pairwise cross-connection relationships are found in all predecessor source tables, the entity relationship model is updated based on the pairwise cross-connection relationships to obtain the target entity relationship model; when the pairwise cross-connection relationships are not found in all predecessor source tables, the entity relationship model is determined as the target entity relationship model.
示例性的,步骤120具体可以通过图3中的步骤106实现。Exemplarily, step 120 may be implemented specifically through step 106 in FIG. 3 .
步骤106、在集合S中查询前继源表两两交叉所构造的两两交叉连接关系,若存在两两交叉连接关系,则基于查询到的两两交叉连接关系对实体关系图进行更新,并将更新后的实体关系图作为目标实体关系图;反之,若未存在两两交叉连接关系,则对实体关系图不进行更新,也即保持实体关系不变。Step 106, query the pairwise cross-connection relationship constructed by the pairwise cross-connection of the predecessor source tables in the set S. If a pairwise cross-connection relationship exists, the entity relationship diagram is updated based on the queried pairwise cross-connection relationship, and the updated entity relationship diagram is used as the target entity relationship diagram; conversely, if no pairwise cross-connection relationship exists, the entity relationship diagram is not updated, that is, the entity relationship remains unchanged.
例如,对于集合S中的任意两个不同的前继源表SL和SR,如果表1中存在left_table为SL且rigth_table为SR的记录,则可以确定SL和SR存在关联关系K(其中关联类型为join_type,关键键为join_on),两两交叉连接关系记为:<SL,K,SR>,L≠R,以此更新实体关系图;反之,则说明SL和SR不存在关联,则不更新实体关系图。For example, for any two different predecessor source tables SL and SR in set S, if there are records in Table 1 where left_table is SL and rigth_table is SR , it can be determined that there is an association relationship K between SL and SR (where the association type is join_type and the key is join_on), and the pairwise cross-connection relationship is recorded as: < SL , K, SR >, L≠R, and the entity relationship diagram is updated accordingly; otherwise, it means that there is no association between SL and SR , and the entity relationship diagram is not updated.
至此,通过图3完成待向数据仓库下一层传递的建模信息,可以按照业务源表在数据仓库上的数据依赖顺序,逐个传递每个业务源表的主题域和数据粒度等信息到其目标表,完毕则继续以目标表为新的业务源表,将与其具备依赖关系的下游表作为新的目标表,继续传递,直到完成数据仓库上所有数据表的建模信息标注。以此可以直观展现数据仓库所有表的关联关系、表的字段信息、表的依赖关系,表所属业务领域和数据粒度等数据模型信息,方便使用者全方位了解数据资产,提升数据的寻找和使用效率。At this point, the modeling information to be transferred to the next layer of the data warehouse is completed through Figure 3. According to the data dependency order of the business source table in the data warehouse, the subject domain and data granularity of each business source table can be transferred to its target table one by one. After completion, the target table is used as the new business source table, and the downstream table with dependency relationship is used as the new target table, and the transfer is continued until the modeling information of all data tables in the data warehouse is completed. In this way, the association relationship of all tables in the data warehouse, the field information of the table, the dependency relationship of the table, the business field and data granularity of the table, and other data model information can be intuitively displayed, which is convenient for users to fully understand data assets and improve the efficiency of data search and use.
下面对本发明提供的数据仓库的数据处理装置进行描述,下文描述的数据仓库的数据处理装置与上文描述的数据仓库的数据处理方法可相互对应参照。The data processing device of the data warehouse provided by the present invention is described below. The data processing device of the data warehouse described below and the data processing method of the data warehouse described above can be referenced to each other.
参照图4,为本发明实施例提供的数据仓库的数据处理装置的结构示意图,如图4所示,该数据仓库的数据处理装置400,包括:关系获取单元410和数据处理单元420。4 , which is a schematic diagram of the structure of a data processing device for a data warehouse provided in an embodiment of the present invention, as shown in FIG4 , the data processing device 400 for a data warehouse includes: a relationship acquisition unit 410 and a data processing unit 420 .
关系获取单元410,用于响应于业务更新指令,获取同步到数据仓库中ODS层的各业务源表以及与各业务源表对应的数据仓库目标表,并获取各业务源表之间的关联关系及各业务源表与数据仓库目标表之间的依赖关系。The relationship acquisition unit 410 is used to respond to the business update instruction, obtain the business source tables synchronized to the ODS layer in the data warehouse and the data warehouse target tables corresponding to the business source tables, and obtain the association relationships between the business source tables and the dependency relationships between the business source tables and the data warehouse target tables.
数据处理单元420,用于基于关联关系和依赖关系构建实体关系模型,并向数据仓库逐层传递实体关系模型的建模信息。The data processing unit 420 is used to construct an entity relationship model based on the association relationship and the dependency relationship, and transmit the modeling information of the entity relationship model to the data warehouse layer by layer.
可选的,数据处理单元420,具体用于获取各业务源表各自的数据粒度和主题域;基于关联关系和所述依赖关系,以及各数据粒度和各主题域,构建实体关系模型。Optionally, the data processing unit 420 is specifically used to obtain the data granularity and subject domain of each business source table; and construct an entity relationship model based on the association relationship and the dependency relationship, as well as each data granularity and each subject domain.
可选的,数据处理单元420,具体用于确定第一表集合,第一表集合中包括基于各业务源表确定的多个ODS表,每个ODS表分别标识对应的主题域和数据粒度以及对应的关联关系和/或对应的依赖关系;将第一表集合中的ODS表移出,并确定以ODS表为源表的全部后继目标表;确定以全部后继目标表中后继目标表为目标表的全部前继源表,并基于全部前继源表中两两交叉连接关系的查询结果以及实体关系模型,确定目标实体关系模型;基于后继目标表对第一表集合进行填充,再从全部后继目标表中选取下一个后继目标表,返回执行确定以全部后继目标表中后继目标表为目标表的全部前继源表的步骤;直至遍历完全部后继目标表时将对应得到的填充后的第一表集合作为新的第一表集合,以及将遍历完全部后继目标表时对应得到的目标实体关系模型作为新的实体关系模型,重复执行上述过程;直至第一表集合为空时将对应得到的目标实体关系模型确定为构建好的实体关系模型。Optionally, the data processing unit 420 is specifically used to determine a first table set, the first table set includes multiple ODS tables determined based on each business source table, each ODS table respectively identifies a corresponding subject domain and data granularity and a corresponding association relationship and/or a corresponding dependency relationship; remove the ODS table in the first table set, and determine all subsequent target tables with the ODS table as the source table; determine all predecessor source tables with the successor target table in all the successor target tables as the target table, and determine the target entity relationship model based on the query results of the cross-connection relationships between the two in all the predecessor source tables and the entity relationship model. ; Fill the first table set based on the successor target table, then select the next successor target table from all the successor target tables, and return to execute the step of determining all the predecessor source tables with the successor target table as the target table; until all the successor target tables are traversed, the corresponding filled first table set is used as the new first table set, and the corresponding target entity relationship model obtained when all the successor target tables are traversed is used as the new entity relationship model, and the above process is repeated; until the first table set is empty, the corresponding target entity relationship model is determined as the constructed entity relationship model.
可选的,数据处理单元420,具体用于在将第一表集合中的ODS表移出至第二表集合中的情况下,若后继目标表同时未存在于第一表集合和第二表集合中,则从实体关系模型中确定以后继目标表为目标表的全部前继源表。Optionally, the data processing unit 420 is specifically used to determine all predecessor source tables with the successor target table as the target table from the entity relationship model when the ODS table in the first table set is moved out to the second table set if the successor target table does not exist in both the first table set and the second table set.
可选的,数据处理单元420,具体用于在实体关系模型中ODS表为后继目标表的主表的情况下,将ODS表的数据粒度和主题域复制到后继目标表中,并基于复制结果确定以后继目标表为目标表的全部前继源表;在ODS表为后继目标表的非主表的情况下,确定以后继目标表为目标表的全部前继源表。Optionally, the data processing unit 420 is specifically used to copy the data granularity and subject domain of the ODS table to the successor target table when the ODS table is the primary table of the successor target table in the entity relationship model, and determine all predecessor source tables with the successor target table as the target table based on the copy result; when the ODS table is a non-primary table of the successor target table, determine all predecessor source tables with the successor target table as the target table.
可选的,数据处理单元420,具体用于在ODS表为后继目标表的主表的情况下,将ODS表的数据粒度和主题域,以及ODS表的业务分类信息均复制到后继目标表中。Optionally, the data processing unit 420 is specifically configured to copy the data granularity and subject domain of the ODS table, as well as the business classification information of the ODS table to the subsequent target table when the ODS table is the main table of the subsequent target table.
可选的,数据处理单元420,具体用于在各全部前继源表中查询到两两交叉连接关系的情况下,基于两两交叉连接关系对实体关系模型进行更新,得到目标实体关系模型;在各全部前继源表中未查询到两两交叉连接关系的情况下,将实体关系模型确定为目标实体关系模型。Optionally, the data processing unit 420 is specifically used to update the entity relationship model based on the pairwise cross-connection relationship to obtain a target entity relationship model when a pairwise cross-connection relationship is found in all predecessor source tables; and to determine the entity relationship model as the target entity relationship model when no pairwise cross-connection relationship is found in all predecessor source tables.
图5示例了一种电子设备的实体结构示意图,如图5所示,该电子设备可以包括:处理器(processor)510、通信接口(Communications Interface)520、存储器(memory)530和通信总线540,其中,处理器510,通信接口520,存储器530通过通信总线540完成相互间的通信。处理器510可以调用存储器530中的逻辑指令,以执行如下方法:响应于业务更新指令,获取同步到数据仓库中ODS层的各业务源表以及与各业务源表对应的数据仓库目标表,并获取各业务源表之间的关联关系及各业务源表与数据仓库目标表之间的依赖关系;基于关联关系和依赖关系构建实体关系模型,并向数据仓库逐层传递实体关系模型的建模信息。FIG5 illustrates a schematic diagram of the physical structure of an electronic device. As shown in FIG5 , the electronic device may include: a processor 510, a communication interface 520, a memory 530 and a communication bus 540, wherein the processor 510, the communication interface 520 and the memory 530 communicate with each other through the communication bus 540. The processor 510 may call the logic instructions in the memory 530 to execute the following method: in response to the business update instruction, obtain each business source table synchronized to the ODS layer in the data warehouse and the data warehouse target table corresponding to each business source table, and obtain the association relationship between each business source table and the dependency relationship between each business source table and the data warehouse target table; construct an entity relationship model based on the association relationship and the dependency relationship, and transfer the modeling information of the entity relationship model to the data warehouse layer by layer.
此外,上述的存储器530中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对相关技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the logic instructions in the above-mentioned memory 530 can be implemented in the form of a software functional unit and can be stored in a computer-readable storage medium when it is sold or used as an independent product. Based on such an understanding, the technical solution of the present invention, or the part that contributes to the relevant technology or the part of the technical solution, can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including a number of instructions to enable a computer device (which can be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), disk or optical disk and other media that can store program codes.
另一方面,本发明实施例公开一种计算机程序产品,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,计算机能够执行上述各方法实施例所提供的数据仓库的数据处理方法,例如包括:响应于业务更新指令,获取同步到数据仓库中ODS层的各业务源表以及与各业务源表对应的数据仓库目标表,并获取各业务源表之间的关联关系及各业务源表与数据仓库目标表之间的依赖关系;基于关联关系和依赖关系构建实体关系模型,并向数据仓库逐层传递实体关系模型的建模信息。On the other hand, an embodiment of the present invention discloses a computer program product, which includes a computer program stored on a non-transitory computer-readable storage medium, and the computer program includes program instructions. When the program instructions are executed by a computer, the computer can execute the data processing methods of the data warehouse provided by the above-mentioned method embodiments, for example, including: in response to a business update instruction, obtaining each business source table synchronized to the ODS layer in the data warehouse and the data warehouse target table corresponding to each business source table, and obtaining the association relationship between each business source table and the dependency relationship between each business source table and the data warehouse target table; constructing an entity relationship model based on the association relationship and the dependency relationship, and delivering the modeling information of the entity relationship model to the data warehouse layer by layer.
又一方面,本发明实施例还提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现以执行上述各实施例提供的数据仓库的数据处理方法,例如包括:响应于业务更新指令,获取同步到数据仓库中ODS层的各业务源表以及与各业务源表对应的数据仓库目标表,并获取各业务源表之间的关联关系及各业务源表与数据仓库目标表之间的依赖关系;基于关联关系和依赖关系构建实体关系模型,并向数据仓库逐层传递实体关系模型的建模信息。On the other hand, an embodiment of the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon. When the computer program is executed by a processor, it is implemented to execute the data processing method of the data warehouse provided by the above-mentioned embodiments, for example, including: in response to a business update instruction, obtaining each business source table synchronized to the ODS layer in the data warehouse and the data warehouse target table corresponding to each business source table, and obtaining the association relationship between each business source table and the dependency relationship between each business source table and the data warehouse target table; constructing an entity relationship model based on the association relationship and the dependency relationship, and delivering the modeling information of the entity relationship model to the data warehouse layer by layer.
又一方面,本发明实施例还提供一种计算机程序产品,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,计算机能够执行上述各方法实施例所提供的数据仓库的数据处理方法,例如包括:响应于业务更新指令,获取同步到数据仓库中ODS层的各业务源表以及与各业务源表对应的数据仓库目标表,并获取各业务源表之间的关联关系及各业务源表与数据仓库目标表之间的依赖关系;基于关联关系和依赖关系构建实体关系模型,并向数据仓库逐层传递实体关系模型的建模信息。On the other hand, an embodiment of the present invention also provides a computer program product, which includes a computer program stored on a non-transitory computer-readable storage medium, and the computer program includes program instructions. When the program instructions are executed by a computer, the computer can execute the data processing methods of the data warehouse provided by the above-mentioned method embodiments, for example, including: in response to a business update instruction, obtaining each business source table synchronized to the ODS layer in the data warehouse and the data warehouse target table corresponding to each business source table, and obtaining the association relationship between each business source table and the dependency relationship between each business source table and the data warehouse target table; constructing an entity relationship model based on the association relationship and the dependency relationship, and delivering the modeling information of the entity relationship model to the data warehouse layer by layer.
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。The device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment. Ordinary technicians in this field can understand and implement it without paying creative labor.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对相关技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that each implementation method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be implemented by hardware. Based on this understanding, the above technical solution is essentially or the part that contributes to the relevant technology can be embodied in the form of a software product, and the computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, a disk, an optical disk, etc., including a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in each embodiment or some parts of the embodiment.
最后应说明的是,以上实施方式仅用于说明本发明,而非对本发明的限制。尽管参照实施例对本发明进行了详细说明,本领域的普通技术人员应当理解,对本发明的技术方案进行各种组合、修改或者等同替换,都不脱离本发明技术方案的精神和范围,均应涵盖在本发明的权要范围中。Finally, it should be noted that the above embodiments are only used to illustrate the present invention, rather than to limit the present invention. Although the present invention is described in detail with reference to the embodiments, it should be understood by those skilled in the art that various combinations, modifications or equivalent substitutions of the technical solutions of the present invention do not depart from the spirit and scope of the technical solutions of the present invention, and should be included in the scope of the present invention.
Claims (11)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202411354924.1A CN119271750A (en) | 2024-09-25 | 2024-09-25 | Data processing method, device, equipment, medium and program product for data warehouse |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202411354924.1A CN119271750A (en) | 2024-09-25 | 2024-09-25 | Data processing method, device, equipment, medium and program product for data warehouse |
Publications (1)
Publication Number | Publication Date |
---|---|
CN119271750A true CN119271750A (en) | 2025-01-07 |
Family
ID=94105121
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202411354924.1A Pending CN119271750A (en) | 2024-09-25 | 2024-09-25 | Data processing method, device, equipment, medium and program product for data warehouse |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN119271750A (en) |
-
2024
- 2024-09-25 CN CN202411354924.1A patent/CN119271750A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Rost et al. | Distributed temporal graph analytics with GRADOOP | |
US10242016B2 (en) | Systems and methods for management of data platforms | |
JP6434960B2 (en) | Support for a combination of flow-based ETL and entity relationship-based ETL | |
CN109739878B (en) | Big data query method, device, server and storage medium | |
US9507807B1 (en) | Meta file system for big data | |
JP6434154B2 (en) | Identifying join relationships based on transaction access patterns | |
CN109684330A (en) | User's portrait base construction method, device, computer equipment and storage medium | |
US20150095303A1 (en) | Knowledge Graph Generator Enabled by Diagonal Search | |
US20150356123A1 (en) | Systems and methods for management of data platforms | |
US10970300B2 (en) | Supporting multi-tenancy in a federated data management system | |
US9311617B2 (en) | Processing event instance data in a client-server architecture | |
US12148027B2 (en) | Data processing method and apparatus, device, and readable storage medium | |
US20040015486A1 (en) | System and method for storing and retrieving data | |
CN114925045A (en) | PaaS platform for large data integration and management | |
US20190057147A1 (en) | Data portal | |
US20250200053A1 (en) | System and method for analysis of data from multiple data sources | |
CN110414259A (en) | A kind of method and apparatus for constructing data element, realizing data sharing | |
CN115422169B (en) | Data warehouse construction method and device based on commercial advertisement scene | |
CN115374105A (en) | Data processing method and device | |
CN118394829A (en) | Data blood edge analysis method, device, equipment and readable storage medium | |
CN109213826A (en) | Data processing method and equipment | |
CN115329011A (en) | Data model construction method, data query method, data model construction device and data query device, and storage medium | |
CN114328759A (en) | Data construction and management method and terminal for data warehouse | |
Rost et al. | Seraph: Continuous queries on property graph streams | |
Arputhamary et al. | A review on big data integration |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |