Disclosure of Invention
In order to solve the problems or some problems in the prior art, embodiments of the present invention provide a large data platform management system, method, device, and storage medium, which automatically obtain source table data according to a snapshot configuration file, automatically perform date formatting and code value conversion, and automatically load buffer layer data to a physical table of a base layer, without requiring a large amount of manual configuration, thereby saving labor cost, reducing error rate, and greatly improving ETL work efficiency.
According to a first aspect of the present invention, an embodiment of the present invention provides a big data platform management system, which includes: the initialization module is used for acquiring a drawing configuration file and standardized configuration; the data extraction module is used for reading the full data or the incremental data of the source table based on the snapshot configuration file and storing the full data or the incremental data to a physical table of the buffer layer; the data conversion module is used for automatically carrying out date formatting and/or code value conversion according to the standardized configuration when the data of the buffer layer is loaded to the base layer; and the data loading module is used for automatically loading the data of the buffer layer to the physical table of the base layer according to the loading script.
According to the embodiment of the invention, the data of the source table is automatically acquired according to the snapshot configuration file through the data extraction module, the date formatting and code value conversion are automatically carried out through the data conversion module, and the data of the buffer layer is automatically loaded to the physical table of the base layer through the data loading module, so that a large amount of manual configuration is not needed, the labor cost is reduced, the error rate is reduced, and the working efficiency of the ETL can be greatly improved.
In some embodiments of the present invention, an initialization module determines a service table to be extracted into a data warehouse according to a service requirement, and determines an extraction policy and a loading policy according to a table structure of the service table, where a service table name, the extraction policy, and the loading policy of the service table constitute drawing configuration information; and the initialization module fills the drawing configuration information into a configuration template to generate the drawing configuration file.
In some embodiments of the invention, the initialization module is further configured to: automatically acquiring a data structure of a source table according to the drawing configuration file, and automatically creating a dependency relationship among metadata information, a physical table of a buffer layer and a physical table of a base layer; and automatically creating a physical table of the buffer layer and a physical table of the base layer according to the metadata information.
According to the embodiment of the invention, the determined small amount of configuration information is filled in the configuration template to obtain the drawing configuration file for data extraction, and then the data structure of the source table is automatically obtained according to the drawing configuration file, the metadata information, the physical table of the buffer layer, the physical table of the base layer and the dependency relationship between the physical table of the buffer layer and the physical table of the base layer are automatically created, so that the manual configuration amount can be minimized, the working efficiency in the data extraction stage is greatly improved, and the high error rate caused by a large amount of manual configuration is avoided.
In some embodiments of the present invention, automatically loading the data of the buffer layer into the physical table of the base layer according to the loading script comprises: generating an automatic loading script in real time according to the metadata information, the drawing strategy, the loading strategy and the standardized configuration, and calling the automatic loading script to automatically load the data of the buffer layer to a physical table of a basic layer; and deleting the automatic loading script after the automatic loading script is backed up.
According to the embodiment of the invention, the automatic loading script is automatically generated through the data loading module, a large amount of loading scripts do not need to be manually compiled, and the working efficiency of data loading can be greatly improved. Meanwhile, when the data structure of the source table is changed, the metadata information is correspondingly changed, and the automatic loading script is generated in real time according to the metadata information, the configuration information and the like without manually modifying or adjusting the loading script, so that normal loading of data can be ensured, and the maintenance cost is saved.
In some embodiments of the present invention, automatically loading the data of the buffer layer to the physical layer of the base layer according to the loading script further comprises: automatically detecting whether a manual loading script exists or not, if so, preferably calling the manual loading script to automatically load the data of the buffer layer to a physical table of a basic layer; and if the manual loading script does not exist, calling the automatic loading script to automatically load the data of the buffer layer to a physical table of a basic layer.
In some embodiments of the invention, the standardized configuration comprises a date standardized configuration and a code value standardized configuration.
In some embodiments of the invention, automatically performing date formatting according to a standardized configuration comprises: and in the insert script which loads the data of the buffer layer into the base layer, automatically converting the date formatted character string into the SQL date formatted character string of the data warehouse according to the date standardized configuration.
According to the embodiment of the invention, the date formatted character string is automatically converted into the SQL date formatted character string of the data warehouse according to the date standardized configuration by the data conversion module in the process of loading the data of the buffer layer to the base layer, so that the date formatted conversion of the data from the buffer layer to the base layer is efficiently realized.
In some embodiments of the present invention, automatically performing code value conversion according to a standardized configuration comprises: and loading the data of the buffer layer into an insert script of a base layer, automatically generating a join statement association to a code value mapping table according to the code value standardized configuration, and obtaining the converted code value in the code value mapping table.
According to the embodiment of the invention, the code values are managed in a centralized manner through the code value mapping table of the data conversion module, the converted code values are automatically acquired according to the code value standardized configuration in the process of loading the data of the buffer layer to the base layer, and the code value standard conversion of the data from the buffer layer to the base layer is efficiently realized.
In some embodiments of the invention, the big data platform management system further comprises: and the metadata management module is used for managing the table structure of the physical table of the buffer layer of the big data platform, the field mapping relation between the source table and the physical table of the buffer layer, and the standardized configuration.
In some embodiments of the invention, the big data platform management system further comprises: and the scheduling module is used for automatically generating a scheduling script according to the dependency relationship between the source table and the physical table of the buffer layer, the dependency relationship between the physical table of the buffer layer and the physical table of the base layer and the dependency relationship between the data processing tasks, and scheduling the tasks according to the scheduling script.
According to the embodiment of the invention, the scheduling script is automatically and rapidly generated by the scheduling module according to the dependency relationship between the source table and the physical table of the buffer layer, the dependency relationship between the physical table of the buffer layer and the physical table of the base layer, and the dependency relationship between the data processing tasks, and the user does not need to manually maintain the upstream and downstream dependency relationships, so that the manual maintenance cost is reduced, the accuracy of the scheduling dependency relationship can be ensured, and the batch running success rate is ensured.
According to a second aspect of the present invention, an embodiment of the present invention provides a big data platform management method, including: acquiring a drawing configuration file and standardized configuration; reading the full data or the incremental data of the source table based on the snapshot configuration file and storing the full data or the incremental data to a physical table of the buffer layer; when the data of the buffer layer is loaded to the base layer, the date formatting and/or code value conversion is automatically carried out according to the standardized configuration; and automatically loading the data of the buffer layer to a physical table of the base layer according to the loading script.
According to the embodiment of the invention, the data of the source table is automatically acquired according to the snapshot configuration file, the date formatting and the code value conversion are automatically carried out, and the data of the buffer layer is automatically loaded to the physical table of the base layer without a large amount of manual configuration, so that the labor cost is reduced, the error rate is reduced, and the working efficiency of the ETL can be greatly improved.
In some embodiments of the present invention, a service table to be extracted into a data warehouse is determined according to a service requirement, and an extraction policy and a loading policy are determined according to a table structure of the service table, wherein a service table name, the extraction policy and the loading policy of the service table constitute drawing configuration information; and filling the drawing configuration information into a configuration template to generate the drawing configuration file.
In some embodiments of the invention, the method further comprises: automatically acquiring a data structure of a source table according to the drawing configuration file, and automatically creating a dependency relationship among metadata information, a physical table of a buffer layer and a physical table of a base layer; and automatically creating a physical table of the buffer layer and a physical table of the base layer according to the metadata information.
According to the embodiment of the invention, the determined small amount of configuration information is filled in the configuration template to obtain the drawing configuration file for data extraction, and then the data structure of the source table is automatically obtained according to the drawing configuration file, the metadata information, the physical table of the buffer layer, the physical table of the base layer and the dependency relationship between the physical table of the buffer layer and the physical table of the base layer are automatically created, so that the manual configuration amount can be minimized, the working efficiency in the data extraction stage is greatly improved, and the high error rate caused by a large amount of manual configuration is avoided.
In some embodiments of the present invention, automatically loading the data of the buffer layer into the physical table of the base layer according to the loading script comprises: generating an automatic loading script in real time according to the metadata information, the drawing strategy, the loading strategy and the standardized configuration, and calling the automatic loading script to automatically load the data of the buffer layer to a physical table of a basic layer; and deleting the automatic loading script after the automatic loading script is backed up.
According to the embodiment of the invention, the automatic loading script is automatically generated, a large amount of loading scripts do not need to be written manually, and the working efficiency of data loading can be greatly improved. Meanwhile, when the data structure of the source table is changed, the metadata information is correspondingly changed, and the automatic loading script is generated in real time according to the metadata information, the configuration information and the like without manually modifying or adjusting the loading script, so that normal loading of data can be ensured, and the maintenance cost is saved.
In some embodiments of the present invention, automatically loading the data of the buffer layer to the physical layer of the base layer according to the loading script further comprises: automatically detecting whether a manual loading script exists or not, if so, preferably calling the manual loading script to automatically load the data of the buffer layer to a physical table of a basic layer; and if the manual loading script does not exist, calling the automatic loading script to automatically load the data of the buffer layer to a physical table of a basic layer.
In some embodiments of the invention, the standardized configuration comprises a date standardized configuration and a code value standardized configuration.
In some embodiments of the invention, automatically performing date formatting according to a standardized configuration comprises: and in the insert script which loads the data of the buffer layer into the base layer, automatically converting the date formatted character string into the SQL date formatted character string of the data warehouse according to the date standardized configuration.
According to the embodiment of the invention, the date formatting character string of the data warehouse is automatically converted into the SQL date formatting character string of the data warehouse according to the date standardized configuration in the process of loading the data of the buffer layer into the base layer, so that the date formatting conversion of the data from the buffer layer to the base layer is efficiently realized.
In some embodiments of the present invention, automatically performing code value conversion according to a standardized configuration comprises: and loading the data of the buffer layer into an insert script of a base layer, automatically generating a join statement association to a code value mapping table according to the code value standardized configuration, and obtaining the converted code value in the code value mapping table.
According to the embodiment of the invention, the code values are managed in a centralized manner through the code value mapping table, the converted code values are automatically acquired according to the code value standardized configuration in the process of loading the data of the buffer layer to the base layer, and the code value standard conversion of the data from the buffer layer to the base layer is efficiently realized.
In some embodiments of the present invention, the big data platform management method further includes: and managing a table structure of a physical table of a buffer layer of the big data platform, a field mapping relation between a source table and the physical table of the buffer layer, and the standardized configuration.
In some embodiments of the present invention, the big data platform management method further includes: and automatically generating a scheduling script according to the dependency relationship between the source table and the physical table of the buffer layer, the dependency relationship between the physical table of the buffer layer and the physical table of the base layer and the dependency relationship between the data processing tasks, and scheduling the tasks according to the scheduling script.
According to the embodiment of the invention, the scheduling script is automatically and quickly generated according to the dependency relationship between the source table and the physical table of the buffer layer, the dependency relationship between the physical table of the buffer layer and the physical table of the base layer and the dependency relationship between the data processing tasks, and the user does not need to manually maintain the upstream and downstream dependency relationships, so that the manual maintenance cost is reduced, the accuracy of the scheduling dependency relationship can be ensured, and the batch running success rate is ensured.
According to a third aspect of the present invention, the present invention provides a computer-readable storage medium having stored thereon computer-readable instructions, which, when executed by a processor, cause a computer to perform the following operations: the operation includes the steps included in the big data platform management method according to any of the above embodiments.
According to a fourth aspect of the present invention, the present invention provides a computer device including a memory and a processor, wherein the memory is used for storing one or more computer readable instructions, and the one or more computer readable instructions, when executed by the processor, can implement the big data platform management method according to any one of the above embodiments.
Therefore, by implementing the big data platform management system, the method, the equipment and the storage medium provided by the invention, the source table data can be automatically acquired according to the snapshot configuration file, the date formatting and the code value conversion can be automatically carried out, and the data of the buffer layer can be automatically loaded to the physical table of the base layer without a large amount of manual configuration, so that the labor cost is reduced, the error rate is reduced, and the working efficiency of the ETL is greatly improved.
Detailed Description
Various aspects of the invention are described in detail below with reference to the figures and the detailed description. Well-known modules, units and their interconnections, links, communications or operations with each other are not shown or described in detail. Furthermore, the described features, architectures, or functions can be combined in any manner in one or more implementations. It will be understood by those skilled in the art that the various embodiments described below are illustrative only and are not intended to limit the scope of the present invention. It will also be readily understood that the modules or units or processes of the embodiments described herein and illustrated in the figures can be combined and designed in a wide variety of different configurations.
The terms used herein are briefly described below.
ETL: the abbreviation of Extract-Transform-Load in english is used to describe the process of extracting (Extract), converting (Transform), and loading (Load) data from a source end to a destination end.
Metadata: data about description data (data about data) is mainly information about the property of data (property) and is used to support functions such as indicating storage location, history data, resource search, file recording, etc.
Task scheduling: the ETL task scheduling (ETL scheduling for short) is used to control the start operation (start time, operation cycle, and trigger condition) of the ETL task, and implement the data transmission conversion operation.
A data warehouse: data Warehouse, which may be abbreviated as DW or DWH, is a strategic set that provides all types of Data support for all levels of decision-making processes of an enterprise.
Buffer layer: in the ETL process, the original data extracted from the business library is stored for data processing of the base layer.
Base layer: and after the data of the original data extracted from the buffer layer is standardized, storing the data according to different loading strategies for subsequent level data processing (secondary processing).
FIG. 1 is an architecture diagram of a big data platform management system according to one embodiment of the present invention.
As shown in fig. 1, the big data platform management system includes:
the initialization module 110 is used to obtain a snapshot configuration file and a standardized configuration. Wherein the standardized configuration includes, but is not limited to, a date standardized configuration and a code value standardized configuration.
In one embodiment, the initialization module obtains the draw profile by: the method comprises the steps that an initialization module determines a service table to be extracted into a data warehouse according to service requirements, and determines an extraction strategy and a loading strategy according to a table structure of the service table, wherein the service table name, the extraction strategy and the loading strategy of the service table form extraction configuration information; and the initialization module fills the drawing configuration information into a configuration template to generate the drawing configuration file.
In a further embodiment, the initialization module is further configured to automatically obtain a data structure of a source table according to the snapshot configuration file, automatically create metadata information, a dependency relationship between a physical table of a buffer layer and a physical table of a base layer, and automatically create a physical table of a buffer layer and a physical table of a base layer according to the metadata information.
The method comprises the steps of filling a small amount of determined configuration information into a configuration template to obtain a drawing configuration file for data extraction, further automatically acquiring a data structure of a source table according to the drawing configuration file, and automatically creating metadata information, a physical table of a buffer layer, a physical table of a base layer and a dependency relationship between the physical table of the buffer layer and the physical table of the base layer, so that the manual configuration amount can be minimized, the working efficiency of a data extraction stage is greatly improved, and a high error rate caused by a large amount of manual configuration is avoided.
And the data extraction module 120 is configured to read the full data or the incremental data of the source table based on the snapshot configuration file and store the full data or the incremental data in the physical table of the buffer layer.
A data conversion module 130, configured to, when the data of the buffer layer is loaded to the base layer, automatically perform date formatting and/or code value conversion according to the standardized configuration.
In one embodiment, automatically date formatting according to a standardized configuration includes: and in the insert script which loads the data of the buffer layer into the base layer, automatically converting the date formatted character string into an SQL structured query language date formatted character string of the data warehouse according to the date standardized configuration. Automatically converting code values according to a standardized configuration includes: and loading the data of the buffer layer into an insert script of a base layer, automatically generating a join statement association to a code value mapping table according to the code value standardized configuration, and obtaining the converted code value in the code value mapping table.
The code values are managed in a centralized mode and are configured in a preset date standardization mode through the data conversion module, the date formatted character strings are automatically converted into SQL (structured query language) date formatted character strings of a data warehouse in the process that the data of the buffer layer are loaded to the base layer, the converted code values are obtained according to the code value standardization configuration, and date formatting conversion and code value standard conversion of the data from the buffer layer to the base layer are achieved efficiently.
In a further implementation mode, code values are centrally managed and custom code value conversion rules are supported through a code value mapping table and code value standardized configuration, and users can bind the code value conversion rules for code value fields, so that the rules are conveniently replaced at any time. The code value conversion rule can automatically take effect in the loading process of the physical table from the source table to the buffer layer, and standard conversion of the code values of the physical table from the source table to the buffer layer is realized. Meanwhile, in order to solve the common problem of date standardization in data conversion, the word strings can be formatted through field custom dates, so that the predefined date formatting word strings are automatically adapted in the loading process of the physical table from the source table to the buffer layer, and the date formatting conversion of the physical table from the source table to the buffer layer is realized. The code value mapping table is a configuration table and represents the corresponding relationship between the code value of each source service system and the standard code value, and the code value standardized configuration acquired by the initialization module is configuration information which specifies which field of which table adopts which set of code value mapping scheme. In the centralized management of code values based on a code value mapping table and code value standardized configuration, the corresponding relationship between each source service system code value and a standard code value can be divided into a plurality of sets of mapping schemes, when an initialization module obtains the standardized configuration, which set of mapping scheme needs to be selected for use is determined, and the set of mapping scheme determines the standard for data conversion.
And the data loading module 140 is configured to automatically load the data of the buffer layer into the physical table of the base layer according to the loading script.
In one embodiment, automatically loading the data of the buffer layer to a physical table of a base layer according to a loading script comprises: generating an automatic loading script in real time according to metadata information, a drawing strategy, a loading strategy and standardized configuration, and calling the automatic loading script to automatically load the data of the buffer layer to a physical table of a basic layer; and deleting the automatic loading script after the automatic loading script is backed up.
In another embodiment, automatically loading the data of the buffer layer to the physical layer of the base layer according to the loading script comprises: automatically detecting whether a manual loading script exists or not, if so, preferably calling the manual loading script to automatically load the data of the buffer layer to a physical table of a basic layer; and if the manual loading script does not exist, calling the generated automatic loading script to automatically load the data of the buffer layer to the physical table of the base layer.
The automatic loading scripts are automatically generated, a large number of loading scripts do not need to be written manually, and the working efficiency of data loading can be greatly improved. Meanwhile, when the data structure of the source table is changed, the metadata information is correspondingly changed, and the automatic loading script is generated in real time according to the metadata information, the configuration information and the like without manually modifying or adjusting the loading script, so that normal loading of data can be ensured, and the maintenance cost is saved. In addition, to support personalized needs, manual code is also supported.
By adopting the large data platform management system provided by the embodiment of the invention, the data of the source table is automatically obtained according to the extraction configuration file through the data extraction module, the date formatting and code value conversion are automatically carried out through the data conversion module, and the data of the buffer layer is automatically loaded to the physical table of the base layer through the data loading module, so that a large amount of manual configuration is not needed, the labor cost is reduced, the error rate is reduced, and the working efficiency of ETL can be greatly improved.
As shown in FIG. 1, in some embodiments, the big data platform management system may also include a metadata management module 150 and a scheduling module 160.
The metadata management module 150 is configured to manage a table structure of a physical table of a buffer layer of the big data platform, a field mapping relationship between a source table and the physical table of the buffer layer, and the standardized configuration.
In a further embodiment, the metadata of the table is automatically created when the initial source is the data extraction configuration is on-line, and the user can also modify the metadata configuration (such as field type, date formatting, standardized transcoding, etc.) in the foreground. And if the fields are added or deleted in the upstream service table, the buffer layer table and the base layer table also need to be correspondingly modified. In order to facilitate the synchronous modification of the big data platform, the metadata management module also has a metadata detection function, and by automatically comparing the structural difference between the source table and the physical table of the buffer layer and prompting a user to add or delete which fields, the user can automatically add the fields newly added by the source table in the buffer layer and the base layer by using the 'metadata updating' function in the foreground. By the automatic detection function/technology, the database table is automatically adapted to the change of the data structure of the business database, and the maintainability of the large data platform is improved.
The scheduling module 160 is configured to automatically generate a scheduling script according to a dependency relationship between the source table and the physical table of the buffer layer, a dependency relationship between the physical table of the buffer layer and the physical table of the base layer, and a dependency relationship between data processing tasks, and perform task scheduling according to the scheduling script.
In a further embodiment, the scheduling module automatically generates an upstream and downstream dependency relationship of the snapshot scheduling, i.e. a dependency relationship between the source table and the physical table of the buffer layer, according to the snapshot configuration; the scheduling module automatically analyzes the SQL script semantics of each data processing task and generates a blood margin dependency relationship among the data processing tasks; then, the scheduling module automatically creates a dependency tree according to the dependency between the source table and the physical table of the buffer layer, the dependency between the physical table of the buffer layer and the physical table of the base layer, and the blood-vessel dependency between the data processing tasks, and then automatically generates a scheduling script according to the dependency tree, and deploys the scheduling script to a scheduling platform, so that a large number of complex scheduling tasks can be efficiently realized.
Therefore, according to the large data platform management system provided by the embodiment of the invention, one-stop ETL data acquisition, conversion, loading and scheduling can be realized, the ETL in the construction of the large data platform is simple and efficient in work, manual management is not needed, the scheduling can be automatically generated, the system has the advantages of reliability, easiness in use and the like, the working efficiency of the ETL is greatly improved, the ETL work of a large scale can be easily managed, the human resources are saved, and the working accuracy is improved. Specifically, the working efficiency is improved by 80% compared with the original ETL tool, the error rate is reduced by 90%, and the system maintainability is greatly improved.
In a further embodiment, the invention reserves interfaces for various data sources and scheduling tools, is convenient for various novel databases and scheduling tools to be quickly and conveniently accessed, and ensures that the whole large data platform management system is easy to expand so as to adapt to various databases and scheduling tools, is suitable for large data platform construction of all industries, and can realize ETL work of various types.
Fig. 2 is a schematic diagram of a specific operation flow of an initialization module in the large data platform management system shown in fig. 1.
As shown in fig. 2, the initialization module is configured to perform: the method comprises four operations of extraction preparation, configuration of warehousing, creation of warehousing resources and data warehousing, and specifically comprises the following steps: step S21, step S22, step S23, step S24, step S25, step S26, step S27, and step S28, which will be described in detail below.
In step S21, the table to be extracted into the data warehouse in the service library is sorted out according to the service requirement.
In step S22, the table structure to be put in storage is analyzed, and a drawing strategy and a loading strategy of the table are formulated.
In step S23, the sorted drawing configuration (service library table name, drawing policy, loading policy, etc.) is filled in the configuration template to generate a drawing configuration file, and the filling of the drawing configuration is completed.
In step S24, after the preparation for extraction is completed, the extraction configuration file is uploaded on the foreground page of the big data platform management system, so that the configuration and storage work is completed.
In step S25, a warehousing resource is automatically created.
Specifically, in some embodiments, after the drawing configuration is put in storage, the system clicks "preparation for storage" on a foreground page of the big data platform management system, and the system completes the following steps of automatically creating a storage resource in the background:
(1) automatically acquiring a data structure of a source table through a data detection technology according to the drawing configuration file, and creating metadata information;
(2) and automatically creating a dependency relationship between the buffer layer table and the base layer table.
Through the steps, the big data platform management system automatically creates the metadata and the dependency relationship in the big data platform without manual processing, and the working efficiency is greatly improved.
In step S26, it is determined whether or not the standardized configuration is required, and if necessary, step S27 is performed, and if not, step S28 is performed.
In step S27, a date or code value standardization arrangement is performed. For example, if there is a field in the table that requires conversion of the date string to a date data type, the field may be configured with a date standardized format string at the initialization module (e.g.: 2021-09-26 denotes the available string "@ yyyyy @ - @ MM @ dd @" denotes the date format at 26 months 9/2021). Meanwhile, if a field in the table needs to be subjected to code value standardization, the code value type and the conversion scheme of the field can be configured in the initialization module. After the configuration is completed, the system can automatically convert the date format and the code value when data is extracted, namely data is extracted from the source table to the physical table of the buffer layer.
In step S28, data is initialized. In some embodiments, after completing the above steps S21-S27, the "data initialization" is clicked in the foreground, i.e. the initialization work before data extraction is completed.
Thereafter, data may be extracted from the source table to the physical table of the data warehouse buffer layer. The big data platform management system works in the background as follows: automatically creating a buffer layer and a basic layer physical table according to metadata information; reading the full data of the source table and storing the data to a buffer layer; and loading the data of the buffer layer to the base layer. After the processing of the first step and the third step, the initialization process of random sampling configuration and the physical table loading data from the source table to the buffer layer is completed. And subsequently, the scheduling module is only needed to automatically generate scheduling, and normal batch running is performed every day, so that the data can be ensured to be automatically loaded from the source table to the physical table of the buffer layer.
FIG. 3 is a flow chart of a big data platform management method according to an embodiment of the invention.
As shown in fig. 3, in an embodiment of the present invention, the big data platform management method may include: step S31, step S32, step S33, and step S34, which are described in detail below.
In step S31, a drawing profile, a standardized profile, is acquired. Wherein the standardized configuration includes, but is not limited to, a date standardized configuration and a code value standardized configuration.
In one embodiment, the draw profile is obtained by: determining a service table to be extracted into a data warehouse according to service requirements, and determining an extraction strategy and a loading strategy according to a table structure of the service table, wherein the service table name, the extraction strategy and the loading strategy of the service table form extraction configuration information; and filling the drawing configuration information into a configuration template to generate the drawing configuration file.
In a further embodiment, the data structure of the source table is automatically obtained according to the snapshot configuration file, and metadata information, a dependency relationship between a physical table of the buffer layer and a physical table of the base layer, and a physical table of the buffer layer and a physical table of the base layer are automatically created according to the metadata information.
The method comprises the steps of filling a small amount of determined configuration information into a configuration template to obtain a drawing configuration file for data extraction, further automatically acquiring a data structure of a source table according to the drawing configuration file, and automatically creating metadata information, a physical table of a buffer layer, a physical table of a base layer and a dependency relationship between the physical table of the buffer layer and the physical table of the base layer, so that the manual configuration amount can be minimized, the working efficiency of a data extraction stage is greatly improved, and a high error rate caused by a large amount of manual configuration is avoided.
In step S32, the full data or the incremental data of the source table is read based on the snapshot profile and stored in the physical table of the buffer layer.
In step S33, when the data of the buffer layer is loaded to the base layer, date formatting and/or code value conversion is automatically performed according to the standardized configuration.
In one embodiment, automatically date formatting according to a standardized configuration includes: and in the insert script which loads the data of the buffer layer into the base layer, automatically converting the date formatted character string into an SQL structured query language date formatted character string of the data warehouse according to the date standardized configuration. Automatically converting code values according to a standardized configuration includes: and loading the data of the buffer layer into an insert script of a base layer, automatically generating a join statement association to a code value mapping table according to the code value standardized configuration, and obtaining the converted code value in the code value mapping table.
The code values are managed in a centralized mode and are configured in a preset date standardization mode through the data conversion module, the date formatted character strings are automatically converted into SQL (structured query language) date formatted character strings of a data warehouse in the process that the data of the buffer layer are loaded to the base layer, the converted code values are obtained according to the code value standardization configuration, and date formatting conversion and code value standard conversion of the data from the buffer layer to the base layer are achieved efficiently.
In a further implementation mode, code values are centrally managed and custom code value conversion rules are supported through a code value mapping table and code value standardized configuration, and users can bind the code value conversion rules for code value fields, so that the rules are conveniently replaced at any time. The code value conversion rule can automatically take effect in the loading process of the physical table from the source table to the buffer layer, and standard conversion of the code values of the physical table from the source table to the buffer layer is realized. Meanwhile, in order to solve the common problem of date standardization in data conversion, the word strings can be formatted through field custom dates, so that the predefined date formatting word strings are automatically adapted in the loading process of the physical table from the source table to the buffer layer, and the date formatting conversion of the physical table from the source table to the buffer layer is realized.
In step S34, the data of the buffer layer is automatically loaded to the physical table of the base layer according to the loading script.
In one embodiment, automatically loading the data of the buffer layer to a physical table of a base layer according to a loading script comprises: generating an automatic loading script in real time according to metadata information, a drawing strategy, a loading strategy and standardized configuration, and calling the automatic loading script to automatically load the data of the buffer layer to a physical table of a basic layer; and deleting the automatic loading script after the automatic loading script is backed up.
In another embodiment, automatically loading the data of the buffer layer to the physical layer of the base layer according to the loading script comprises: automatically detecting whether a manual loading script exists or not, if so, preferably calling the manual loading script to automatically load the data of the buffer layer to a physical table of a basic layer; and if the manual loading script does not exist, calling the generated automatic loading script to automatically load the data of the buffer layer to the physical table of the base layer.
The automatic loading scripts are automatically generated, a large number of loading scripts do not need to be written manually, and the working efficiency of data loading can be greatly improved. Meanwhile, when the data structure of the source table is changed, the metadata information is correspondingly changed, and the automatic loading script is generated in real time according to the metadata information, the configuration information and the like without manually modifying or adjusting the loading script, so that normal loading of data can be ensured, and the maintenance cost is saved. In addition, to support personalized needs, manual code is also supported.
By adopting the method of the embodiment of the invention, the data of the source table is automatically obtained according to the snapshot configuration file, the date formatting and the code value conversion are automatically carried out, and the data of the buffer layer is automatically loaded to the physical table of the base layer without a large amount of manual configuration, so that the labor cost is reduced, the error rate is reduced, and the working efficiency of the ETL can be greatly improved.
In further embodiments, the table structure of the physical table of the buffer layer of the big data platform, the field mapping relationship between the source table and the physical table of the buffer layer, and the standardized configuration are managed.
Optionally, the metadata of the table is automatically created when the initial source of the metadata is online, and the user may also modify the metadata configuration (e.g., field type, date formatting, standardized transcoding, etc.) in the foreground. And if the fields are added or deleted in the upstream service table, the buffer layer table and the base layer table also need to be correspondingly modified. In order to facilitate the synchronous modification of the big data platform, the structure difference between the source table and the physical table of the buffer layer can be automatically compared through a metadata detection function, and the user is prompted to add or delete which fields. By the automatic detection function/technology, the database table is automatically adapted to the change of the data structure of the business database, and the maintainability of the large data platform is improved.
In a further embodiment, a scheduling script is automatically generated according to the dependency relationship between the source table and the physical table of the buffer layer, the dependency relationship between the physical table of the buffer layer and the physical table of the base layer, and the dependency relationship between the data processing tasks, and task scheduling is performed according to the scheduling script.
Optionally, an upstream-downstream dependency relationship of the snapshot scheduling, that is, a dependency relationship between the source table and the physical table of the buffer layer, is automatically generated; and automatically analyzing SQL script semantics of each data processing task to generate a blood-related dependency relationship between each data processing task; and then, automatically creating a dependency tree according to the dependency between the source table and the physical table of the buffer layer, the dependency between the physical table of the buffer layer and the physical table of the base layer and the blood-margin dependency between the data processing tasks, further automatically generating a scheduling script according to the dependency tree, and deploying the scheduling script to a scheduling platform, thereby efficiently realizing a large number of complex scheduling tasks.
Therefore, according to the large data platform management method provided by the embodiment of the invention, one-stop ETL data acquisition, conversion, loading and scheduling can be realized, the ETL in the construction of the large data platform is simple and efficient in work, manual management is not needed, the scheduling can be automatically generated, the method has the advantages of reliability, easiness in use and the like, the working efficiency of the ETL is greatly improved, the ETL of thousands of meters can be easily managed, the manpower resource is saved, and the working accuracy is improved.
Through the above description of the embodiments, those skilled in the art will clearly understand that the present invention can be implemented by combining software and a hardware platform. With this understanding in mind, all or part of the technical solutions of the present invention that contribute to the background can be embodied in the form of a software product, which can be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes instructions for causing a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments or some parts of the embodiments.
Correspondingly, the embodiment of the invention also provides a computer readable storage medium, on which computer readable instructions or a program are stored, and when the computer readable instructions or the program are executed by a processor, the computer is enabled to execute the following operations: the operation includes the steps included in the big data platform management method according to any of the above embodiments, and details are not described here. Wherein the storage medium may include: such as optical disks, hard disks, floppy disks, flash memory, magnetic tape, etc.
In addition, the embodiment of the present invention also provides a computer device including a memory and a processor, where the memory is used for storing one or more computer readable instructions or programs, and when the one or more computer readable instructions or programs are executed by the processor, the big data platform management method according to any one of the above embodiments can be implemented. The computer device may be, for example, a server, a desktop computer, a notebook computer, a tablet computer, or the like.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may be modified or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention. Therefore, the protection scope of the present invention should be subject to the claims.