CN117591611A - Classification management method for table and function metadata based on Flink data directory - Google Patents
Classification management method for table and function metadata based on Flink data directory Download PDFInfo
- Publication number
- CN117591611A CN117591611A CN202311735752.8A CN202311735752A CN117591611A CN 117591611 A CN117591611 A CN 117591611A CN 202311735752 A CN202311735752 A CN 202311735752A CN 117591611 A CN117591611 A CN 117591611A
- Authority
- CN
- China
- Prior art keywords
- metadata
- data
- hive
- function
- flink
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000007726 management method Methods 0.000 title abstract description 27
- 238000000034 method Methods 0.000 claims abstract description 84
- 238000004458 analytical method Methods 0.000 claims abstract description 18
- 230000007246 mechanism Effects 0.000 claims abstract description 12
- 230000006870 function Effects 0.000 claims description 205
- 230000008569 process Effects 0.000 claims description 23
- 230000008859 change Effects 0.000 claims description 15
- 238000012544 monitoring process Methods 0.000 claims description 7
- 238000012217 deletion Methods 0.000 claims description 6
- 230000037430 deletion Effects 0.000 claims description 6
- 238000005192 partition Methods 0.000 claims description 4
- 230000001960 triggered effect Effects 0.000 claims 1
- 238000013070 change management Methods 0.000 abstract description 4
- 238000012545 processing Methods 0.000 description 10
- 239000003054 catalyst Substances 0.000 description 6
- 238000013461 design Methods 0.000 description 6
- 230000001360 synchronised effect Effects 0.000 description 6
- 238000007792 addition Methods 0.000 description 3
- 238000010224 classification analysis Methods 0.000 description 3
- 238000007405 data analysis Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 238000013523 data management Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000010223 real-time analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2282—Tablespace storage structures; Management thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a classified management method of table and function metadata based on a Flink data directory, aiming at effectively classifying and managing the table and function metadata stored in a Hive library corresponding to the Flink data directory; the method comprises the following steps: connecting a third party data source; creating a data catalog, wherein the catalog type is Hive, and the catalog type corresponds to a Hive library one by one; creating a Flink SQL task, and defining a Flink table and Flink function information in the task; the method comprises the steps of running a Flink SQL task, and storing a table and function metadata into a Hive library corresponding to a data directory; performing full-quantity acquisition and increment acquisition of metadata aiming at the HIVE database; the analysis and classification of the metadata are stored in a storage database; classified queries, historical version queries, of metadata are provided. The invention improves the metadata classification and visibility of the data catalogs, enhances the security and personalized management of the data catalogs, improves the inquiry function and the data history tracking capability, and innovatively introduces a set of automatic metadata change management mechanism.
Description
Technical Field
The invention belongs to the technical field of data processing and management, and particularly relates to a table and function metadata classification management method based on a Flink data directory.
Background
In the field of modern data processing and analysis, data engineers and analysts are often required to process both stream data and batch data. This introduces the concept of stream-batch unification, i.e. integrating stream data and batch data into a unified platform or system. Flink Hive Catalog becomes particularly important in a streaming batch-wise processing environment because it serves as a key metadata management tool that allows the flank application to access the metadata of tables and functions in the Hive database. With the widespread use of streaming batch processing, users want to be able to register metadata information of different Flink tables and Flink custom functions into the Hive database by creating a data directory, using Flink Hive Catalog, in order to manage and access the metadata information of these tables and functions in one unified Hive database.
However, the prior art has certain limitations in this respect: (1) in the Hive database, not only Hive table metadata information and Hive custom function metadata information are stored, but also Flink table metadata information registered in the Hive database by a user through Flink Hive Catalog is stored. The Flink list has different list types according to the types of different data sources, so that metadata of the lists and functions stored in the Hive database are disordered, and a user cannot clearly view and distinguish metadata information of the lists and functions of different types in the Hive database corresponding to the data catalog; (2) the user can not set the customized visibility of the metadata information of the table and the function in the Hive database corresponding to the data directory and delete the respective metadata; (3) the user cannot access the historical version of the metadata information of the tables and functions, a more comprehensive data view angle cannot be provided for the user (4), and the user cannot know the change or deletion condition of the metadata information of any tables and functions in time, so that the user can quickly respond and take corresponding measures.
Disclosure of Invention
Aiming at the problems existing in the prior art, the invention provides a table and function metadata classification management method based on a Flink data directory, and the first aim is to solve the problem that a user cannot clearly view and distinguish metadata information of different types of tables and functions in a Hive database corresponding to the data directory; the second aim is to solve the problem that the user can not set the customized visibility of the metadata information of the table and the function in the Hive database corresponding to the data directory and can not delete the respective metadata; the third objective is to solve the problem that the user cannot access the historical version of metadata information of the table and the function, and cannot provide a more comprehensive data view for the user; the fourth objective is to solve the problem that the user cannot know the change or deletion condition of metadata information of any table and function in time, so that the user can quickly respond and take corresponding measures.
The invention adopts the following technical scheme for solving the technical problems:
a classified management method of tables and function metadata based on a Flink data directory is characterized in that: comprises the following steps of;
step one, connecting a third party data source;
step two, creating a data catalog, wherein the catalog type is Hive, and the catalog type corresponds to a Hive library one by one;
Creating a Flink SQL task, and using the data directory to define a Flink table and Flink function information by creating SQL in the task through a user;
step four, a Flink SQL task is operated, and the table and the function metadata are stored in a Hive library corresponding to the data catalog;
fifthly, performing full-quantity acquisition and increment acquisition of metadata aiming at the Hive database;
step six, analyzing and classifying metadata aiming at the Hive database;
step seven, the metadata after analysis and classification are stored in a storage database;
and step eight, providing classified query of metadata and historical version query of metadata.
Further, the full-volume acquisition and incremental acquisition of metadata are performed for the Hive database in the fifth step, and the specific process is as follows:
1) Introducing a permission control mechanism of the data directory, wherein the mechanism not only allows the customized setting of the visibility of tables and functions of the data directory, but also endows different users and organizations with the permission of adding and canceling the respective metadata;
2) Metadata acquisition: acquiring the full metadata of tables and functions under the Hive database corresponding to all data catalogues through Flink Hive Catalog API; and acquiring the incremental metadata of the tables and the functions under the Hive database corresponding to all the data catalogues through a user-defined Hive Hook program.
Further, the step five process 2) obtains the total metadata of the tables and functions under the databases corresponding to all the data directories through Flink Hive Catalog API, specifically as follows:
i. acquiring Hive database names corresponding to all data catalogues established in the system;
ii. Establishing a connector to Flink Hive Catalog: realizing connection through the Hive database name and the configuration file address of the data directory;
and acquiring metadata information, namely once the connection is established, the Flink can send a request to the Hive through Flink Hive Catalog API to acquire metadata information of the table and the function.
Further, in the step five, 2), incremental metadata of tables and functions under Hive databases corresponding to all data directories are obtained through a custom Hive Hook program, which is specifically as follows:
i. monitoring events related to functions in Hive; the event includes: creating a table, deleting the table, creating a function, and deleting an event of the function;
ii. When a change behavior occurs, hive Hook can immediately capture this change; and triggers the flow of classification of metadata and synchronization of metadata.
Further, the specific process is as follows:
the step five, step 2), is to obtain the full metadata and the incremental metadata, and the specific links are as follows:
A. The full-quantity acquisition is to acquire all data corresponding to the Hive database name of the current data catalog based on Flink Hive Catalog API, wherein all metadata comprise tables and functions;
B. the increment acquired data is only acquired from the newly added data after the last full data acquisition, and partial data which changes are deleted;
in Flink, catalog is an interface for managing external data sources, while Flink Hive Catalog is a specific implementation connected to Apache Hive; through Flink Hive Catalog, flink may connect to Hive metadata store, query tables in Hive, function metadata information, and allow Flink jobs to directly access and manipulate these data;
further, the step five, step 2), of obtaining all metadata in full amount in the link a occurs only once, and then obtaining metadata in increment; acquiring metadata by the total quantity and increment of the links A) and B) in the step five, wherein the metadata comprises metadata of table information and metadata of function information; metadata of the table information comprises names, column names, data types, partition information, storage formats and connection information of the table; the metadata of the function information comprises a function name, a function class name, a registered path of the function in Hive and a function class.
Further, the analyzing and classifying of metadata is performed on the Hive database in the sixth step, and the specific process is as follows:
1) When the metadata of the table is obtained in full quantity, analyzing the attribute value of the connection parameter connector in the metadata of the table, and finishing the classification of the table according to the attribute value of the connector;
2) When metadata of the function is obtained in full quantity, analyzing Hive storage address field information in the metadata of the function, judging that classification is completed according to the Hive storage address field value, if the value exists, classifying the function as a Hive function, otherwise, classifying the function as a Flink function;
3) When the increment acquires data, the new increment of the table calls a onCreateTable (CreateTableEvent tableEvent) method, and in the method, map < String > parameters objects in table objects in the table event of the method are analyzed; sorting is completed according to the parameter object value or the flink. Connector attribute value in the parameter object; if the parameter object does not exist, classifying the table into a Hive table, and if the parameter object does exist, acquiring a link attribute value in the parameter object, and finishing classification according to the link attribute value; if the flink connector attribute value cannot be obtained, the table is classified as a Hive table.
4) When the data is acquired incrementally, the new addition of the function will call
onCreateFunction (CreateFunctionEvent funcEvent) in the method, analyzing the attribute in the function object in the function event of the method, judging to finish classification according to the storage address field value of Hive in the function object, if the value exists, classifying the function as a Hive function, otherwise classifying the function as a Flink function.
Further, the step seven is to store the metadata after analysis and classification in a storage database;
the specific process is as follows:
when metadata is acquired in full
(1) Acquiring metadata basic information of all tables, and storing the data into a storage database; adding an is_delete field in the metadata basic information, setting the value of the is_delete field as false, and marking the metadata as newly added data;
(2) acquiring metadata basic information of all functions, storing the data into a storage database, adding an is_delete field in the metadata basic information, setting the value of the is_delete field as false, and marking the metadata as newly added data;
further, the step seven is to store the metadata after analysis and classification in a storage database;
The specific process is as follows:
when metadata is obtained incrementally
(1) When the new table is generated, triggering a method for customizing the new table of the Hive Hook program, acquiring basic information of table metadata, storing the data into a storage database, and pushing a message to a creator of a data directory; adding an is_delete field in the metadata basic information, setting the value of the is_delete field as false, and marking the metadata as newly added data;
(2) when the new function is generated, triggering a method for customizing the new function of the Hive Hook program, acquiring basic information of function metadata, and storing the data into a storage database; pushing information to creator of data catalog; adding an is_delete field in the metadata basic information, setting the value of the is_delete field as false, and marking the metadata as newly added data;
(3) when a deletion table occurs, triggering a method for deleting the table by a user-defined Hive Hook program, obtaining basic information of table metadata, adding an is_delete field in the basic information of the table metadata, setting an is_delete field value to true, marking that the metadata is deleted, searching a database through a database name and a table name, updating the record, storing historical data into a storage database, and pushing a message to a creator of a data directory;
(4) When a deleting function occurs, triggering a method for deleting the function by a user-defined Hive Hook program, obtaining metadata basic information of the function, adding an is_delete field in the metadata basic information, setting an is_delete field value to true, marking that the metadata is deleted, searching a database through a database name and a function name, updating the record, storing historical data into a storage database, and pushing a message to a creator of a data directory;
further, the sorting query of the metadata provided in the step eight is to query the metadata by category in the storage database, and the historical version query of the metadata provided in the step eight is to query the data which identifies that the metadata has been deleted in the storage database.
Advantageous effects of the invention
1. Data catalog metadata classification and visibility are improved: the invention provides a registration and classification mechanism of metadata in Hive for tables and functions of a Flink data directory, which can carry out fine management on different types of tables and functions. The method solves the problem that a user cannot clearly view and distinguish different types of tables and functions in the prior art. According to the method and the device, the user can easily identify and access the metadata of the specific type, so that the control precision and the operation convenience of the user on metadata management are greatly improved.
2. The security and personalized management of the data catalogue are enhanced: the invention innovatively introduces a permission control mechanism of the data directory, which not only allows customized setting of the visibility of metadata of tables and functions of the data directory, but also gives different users and organizations the logout permission of the respective metadata. The feature pointedly solves the problem of insufficient management flexibility of the data catalogue in the prior art, and obviously enhances the security and personalized management capability of the data catalogue. By allowing users and organizations to perform log-off operations on different types of tables and function metadata, the invention effectively reduces the risk of data directory metadata leakage and ensures compliance of data management and security of data environment.
3. The query function is improved, and the data history tracking capability is enhanced: the invention realizes advanced inquiry functions such as fuzzy search and multi-condition filtration by carrying out advanced classification and localized storage on metadata of tables and functions of the data catalogue, and introduces the viewing capability of historical versions of the metadata. The query flexibility and accuracy of the user in data analysis are remarkably improved. Meanwhile, the history version capable of accessing the metadata provides a more comprehensive data perspective for the user, so that the data analysis not only depends on the current state, but also can consider the change history of the data, and further and comprehensive data insight is realized.
4. Automated metadata change management and notification mechanism the present invention innovatively introduces a set of automated metadata change management mechanisms, including automated handling of metadata when new or deleted, and equipped with a corresponding notification system. This mechanism directly solves the problem of lack of automated handling and timely notification in metadata management in the prior art. Through automatic change management and real-time notification, users and administrators can timely know the new or deleted condition of any metadata, so that the users and administrators can quickly respond and take corresponding measures.
Drawings
FIG. 1 is a flow chart of a method for classifying and managing tables and function metadata based on a Flink data directory according to the present invention;
FIG. 2 is a flowchart of a metadata classification management method __ metadata acquisition method according to the present invention;
FIG. 3 is a flowchart of a metadata acquisition method __ full-scale metadata acquisition method according to the present invention;
FIG. 4 is a flowchart of a metadata acquisition method __ incremental metadata acquisition method according to the present invention;
FIG. 5 is a flowchart of a metadata classification parsing method according to the metadata acquisition method __ of the present invention;
FIG. 6 is a schematic diagram of a table and function metadata sort management platform based on a Flink data directory in accordance with the present invention.
Detailed Description
Principle of design of the invention
1. The invention has the design difficulty that: (1) and (5) realizing accurate classification management. In existing flank data directories, implementing sophisticated sort management of different types of tables and functions is a challenge, especially when complex and diverse metadata is handled. Metadata example 1: assuming an EXCEL table, wherein the header is metadata information, the sheet pages in the EXCEL are also metadata information, and the content in the table is data or not metadata; metadata example 2: the method is characterized in that the method is a two-dimensional database table, and field names, field types, lengths, table names, table descriptions, database names and reservoir name descriptions of tables in the database are metadata; and the contents stored in each field in the database table are data. Such sort management of metadata not only needs to be accurate, but also is sufficient to accommodate changing new data environments. (2) Real-time metadata capture and analysis classification: the capturing and analyzing classification of metadata is divided into full-quantity capturing and incremental capturing and analyzing, and the full-quantity capturing can develop a specific program to read all metadata at one time, but the key difficulty is to realize real-time monitoring, capturing and analyzing classification of metadata generated in real time when a user submits a task. This requires that any subtle changes to the metadata be accurately captured and that the metadata information be updated in real-time, which is technically challenging.
2. The solution idea is as follows: solution idea for one of the difficulties: the invention introduces dynamic classification logic and can be adjusted according to the real-time change of metadata. The design of the strategy considers the diversity and the dynamic property of the data, thereby realizing accurate and flexible classification management; solution idea for the second difficulty: in response to this difficulty, the present invention innovatively developed a specific Hive Hook program. The program is specifically designed to monitor the user's registered tables and functions in the Hive database. In this way, hive Hook can immediately capture this change and trigger real-time updating of metadata and analysis classification flows once the change behavior occurs. The design not only ensures timeliness and accuracy of metadata information, but also improves response speed and flexibility of the whole system to changes.
Through the innovative design and implementation, the invention effectively solves the technical problems of real-time metadata capture and analysis classification, and remarkably improves the real-time performance and accuracy of table and function metadata management in the Flink data directory. This solution embodies the innovative thinking and technical capabilities of the present invention in terms of handling efficient synchronization and real-time analysis classification in metadata management.
3. Noun interpretation:
flink: flink (Apache Flink) is an open source stream processing and batching framework for processing large scale data streams and batches of data. It is a distributed data processing engine designed for real-time streaming data processing, event driven applications and batch jobs
Flink Hive Catalog: flink Hive Catalog (also known as Hive catalyst) is a component in Apache Flink for integrating Apache Hive; the method comprises the steps of managing and accessing a Hive table, a Hive function and a Flink table stored in a Hive database, wherein the Flink custom function is related to metadata information; it also allows the flank user to easily access and manipulate Hive tables, hive functions and flank tables stored in the Hive database, flank custom function related metadata information in their stream processing and batch applications.
Flink Hive Catalog API: flink Hive Catalog API is an interactive programming interface provided by Flink Hive Catalog. Through this API, a developer can query an application program for metadata information of a Hive database corresponding to a data directory, including metadata information of a table, metadata information of a function,
flink SQL: the flank SQL is a query language and programming interface in Apache flank for performing SQL queries and data operations in flank stream processing and batch applications. It allows users to query, transform, and process real-time data streams and batch data using standard SQL syntax without having to go deep into the underlying Flink programming model
4. The invention is a classification management method of a classification management service platform based on a table of a Flink data catalog and function metadata, wherein the service platform is shown in figure 6, and comprises a third party data source module, a data catalog creation module, a Flink SQL task running module, a metadata acquisition module, a metadata classification analysis module, a metadata synchronous storage module, a classification inquiry and historical version inquiry module; the third party data source connection module is used for connecting a third party data source in the service platform and storing connection information; the data directory creating module is used for creating a data directory with a directory type of Hive in the service platform, and the data directory corresponds to the Hive database one by one; the module for creating the Flink SQL task is used for using a Hive type data directory in the service platform, creating the Flink SQL task in the service platform, and creating metadata information of an SQL definition table and metadata information of a function through a user; the Flink SQL task operation module is used for operating the Flink SQL task defined by the service platform in the Flink operation environment, and during the operation process, the Flink can store metadata of the table and the function into the Hive database in the data directory correspondence through Flink Hive Catalog API; the method is characterized in that:
The metadata acquisition module is used for acquiring full metadata and incremental metadata of the Hive database in the service platform; the metadata classification analysis module is used for analyzing and classifying the acquired full-volume metadata and the incremental metadata in the service platform; the metadata synchronous storage module is used for synchronously storing the analyzed and classified metadata into a local storage database in the service platform; the classified query and historical version query module is used for performing classified query and historical data query on metadata in a local storage database in the service platform.
Further, the metadata acquisition module includes: a full-quantity acquisition module for metadata aiming at the Hive database and an increment acquisition module for metadata aiming at the Hive database; the full-volume acquisition of the metadata aiming at the Hive database can only be acquired once, and the incremental acquisition of the metadata aiming at the Hive database is to acquire only added data and changed data after deletion on the basis of the full-volume acquisition.
Further, the full-quantity obtaining module for metadata aiming at the Hive database comprises a module for obtaining names of all Hive databases, a module for establishing a connector, a full-quantity metadata module for obtaining all tables of the Hive databases and a full-quantity metadata module for obtaining all functions of the Hive databases.
Further, the incremental acquisition module for metadata aiming at the Hive database comprises a monitoring related event module, a capturing event change module and a triggering metadata classification and metadata synchronization module; the monitoring related event module is used for monitoring events of creating a table, deleting the table, creating a function and deleting the function; the classification and metadata synchronization module is used for analyzing and classifying the captured newly added metadata and storing the analyzed and classified metadata into a local storage database.
Further, the metadata classification analysis module comprises an analysis classification sub-module when metadata of the table is acquired in a full amount, an analysis classification sub-module when metadata of the function is acquired in a full amount, an analysis classification sub-module when metadata of the table is acquired in an increment, and an analysis classification sub-module when metadata of the function is acquired in an increment; when the metadata of the table is obtained in full quantity, analyzing the attribute value of the connection parameter connector in the metadata of the table, and finishing the classification of the table according to the attribute value of the connector; when metadata of the function is obtained in full quantity, analyzing Hive storage address field information in the metadata of the function, judging that classification is completed according to the Hive storage address field value, if the value exists, classifying the function as a Hive function, otherwise, classifying the function as a Flink function; when the increment obtains the data, the new increment of the table can be called
onCreateTable (CreateTableEvent tableEvent) method, in which Map < String > parameters objects in table objects in the method-incoming object table event are resolved; sorting is completed according to the parameter object value or the flink. Connector attribute value in the parameter object; if the parameter object does not exist, classifying the table into a Hive table, and if the parameter object does exist, acquiring a link attribute value in the parameter object, and finishing classification according to the link attribute value; if the flink connector attribute value cannot be obtained, the table is classified as a Hive table; when the data is acquired in an increment mode, a onCreateFunction (CreateFunctionEvent funcEvent) method is called by the new function, in the method, the attribute in a function object in a function event which is transmitted into the method is analyzed, classification is judged to be completed according to the storage address field value of Hive in the function object, if the value exists, the function is classified as a Hive function, and otherwise, the function is classified as a Flink function.
Further, the metadata synchronous storage module comprises a full acquisition metadata synchronous module and an incremental acquisition metadata synchronous module, wherein the full acquisition metadata synchronous module identifies metadata as newly added data; the increment acquires metadata synchronization module, when data is newly added, the identification metadata is the newly added data, and when the data is deleted, the identification metadata is the deleted data.
Further, the metadata classified query and history version query module comprises a classified query newly-added data module and a classified query history data module, wherein the classified query newly-added data module displays a query result according to the metadata identifier as a newly-added data identifier and the category to which the metadata is distributed; the classified query history data module displays the query result according to the metadata identification as the deleted data identification and the category to which the metadata is assigned.
Based on the principle and the service platform of the invention, the invention designs a classified management method of tables and function metadata based on a Flink data catalog, as shown in figures 1-5, which is characterized in that: comprises the following steps of;
step one, connecting a third party data source;
step two, creating a data catalog, wherein the catalog type is Hive, and the catalog type corresponds to a Hive library one by one;
supplementary explanation 1
(1) When the catalog is created, the system records the creator of the data catalog and is used for notifying the message in the subsequent metadata change;
(2) creating a data directory requires association with a Hive library, with a specific SQL syntax of
create catalog hive_catalog with(
'type'='hive',
'default-database'='default',
'hive-conf-dir'='/etc/hive/conf'
);
The explanation is as follows:
create catalog hive _category, which indicates that a name is being created
hive_catalyst's Flink catalyst.
with this is the start of a configuration statement.
'type' = 'Hive': this specifies that this category of catalyst is Hive. This means that the catalyst to be created is one that is connected to Hive, allowing Flink to use the data and metadata in Hive.
This parameter specifies the default database, i.e., when no database is specified, the Flink will look up or register metadata objects such as tables and functions in the database named default.
This parameter defines the Hive's configuration directory, specifying the path of the Hive's configuration file. This is important for Flink Hive Catalog to connect and access Hive metadata stores because it can load Hive configuration information from a specified directory to ensure that the Flink can interact correctly with Hive.
(3) By creating and configuring a data directory, a flank may define data structures of tables, functions, etc., such that these structured data may be directly queried, read, and written by flank jobs as if it were a local table. This unified interface reduces the complexity of integration with external systems, allowing the developer to more easily manipulate and process data from different data sources in a flank job.
(4) The data directory may also provide metadata management functionality that allows developers to register, query, and manipulate metadata information of data sources, such as table structures, data locations, etc., in the link to better manage and use external data.
Creating a Flink SQL task, and using the data directory to define a Flink table and Flink function information by a user creating SQL in the task;
step four, a Flink SQL task is operated, and the table and the function metadata are stored in a Hive library corresponding to the data catalog;
supplementary explanation 2
(1) The "running the flank SQL task" is that the task runs in the flank running environment, and during the running process, the flank stores metadata of the table and the function into the Hive library corresponding to the data directory through Flink Hive Catalog API.
(2) The "define a link table in a task" is metadata information of a user defining a table or a function through SQL. Metadata of the table information comprises names, column names, data types, partition information, storage formats and connection information of the table; the metadata of the function information comprises a function name, a function class name, a registered path of the function in Hive and a function class.
Fifthly, performing full-quantity acquisition and increment acquisition of metadata aiming at the Hive database;
Step six, analyzing and classifying metadata aiming at the Hive database;
step seven, the metadata after analysis and classification are stored in a storage database;
and step eight, providing classified query of metadata and historical version query of metadata.
Further, the full-volume acquisition and incremental acquisition of metadata are performed for the Hive database in the fifth step, and the specific process is as follows:
1) Introducing a permission control mechanism of the data directory, wherein the mechanism not only allows the customized setting of the visibility of tables and functions of the data directory, but also endows different users and organizations with the permission of adding and canceling the respective metadata;
2) Metadata acquisition: acquiring the full metadata of tables and functions under the databases corresponding to all the data catalogues through Flink Hive Catalog API; and obtaining the incremental metadata of the tables and functions under the database corresponding to all the data catalogues through the user-defined Hive Hook program.
Further, the step five process 2) obtains the total metadata of the tables and functions under the databases corresponding to all the data directories through Flink Hive Catalog API, specifically as follows:
i. acquiring Hive database names corresponding to all data catalogues established in the system;
ii. Establishing a connector to Flink Hive Catalog: realizing connection through the Hive database name and the configuration file address of the data directory;
and acquiring metadata information, namely once the connection is established, the Flink can send a request to the Hive through Flink Hive Catalog API to acquire metadata information of the table and the function.
Further, the step five process 2) obtains the incremental metadata of the tables and functions under the database corresponding to all the data catalogs through the custom Hive Hook program, which is specifically as follows:
i. monitoring events related to functions in Hive catalyst; the event includes: creating a table, deleting the table, creating a function, and deleting an event of the function;
ii. When a change behavior occurs, hive Hook can immediately capture this change; and triggers the flow of classification of metadata and synchronization of metadata.
Supplementary explanation 3
(1) The triggering of creating a table, deleting a table, creating a function and deleting a function event is specifically as follows:
onCreateTable (CreateTableEvent tableEvent), when a table is created.
onDropTable (DropTableEvent tableEvent), when a table is deleted.
onCreateFunction (CreateFunctionEvent funcEvent), when a function is created.
onDropFunction (DropFunctionEvent funcEvent) when the function is deleted.
(2) The Flink Hive Catalog is a component in Apache Flink for integrating Apache Hive, allowing the data and metadata in Hive to be used directly in the Flink job.
(3) Through Flink Hive Catalog, the Flink may be connected to the Hive metadata store and use tables, function-dependent metadata objects defined therein.
Further, the specific process is as follows:
the step five, step 2), is to obtain the full metadata and the incremental metadata, and the specific links are as follows:
A. the full-quantity acquisition is to acquire all data corresponding to the Hive database name of the current data catalog based on Flink Hive Catalog API, wherein all metadata comprise tables and functions;
B. the increment acquired data is only acquired from the newly added data after the last full data acquisition, and partial data which changes are deleted;
in Flink, catalog is an interface for managing external data sources, while Flink Hive Catalog is a specific implementation connected to Apache Hive; through Flink Hive Catalog, flink may connect to Hive metadata store, query tables in Hive, function metadata information, and allow Flink jobs to directly access and manipulate these data;
Supplementary explanation 4
(1) The incremental data refers to data newly added or deleted in a period of time; with respect to the full amount of data, delta data means that only the latest added or deleted part is included in metadata, not the entire content of the entire metadata;
(2) the full data refers to all data in the dataset, i.e. including all records, information or content; corresponding to the incremental data, the full data covers the content of the whole metadata, including all historical and current metadata; the full metadata is data acquired once, and the incremental data is data acquired for multiple times according to the new or deleted change of the metadata;
(3) the metadata includes field names, types of fields, lengths, table names, table descriptions, database names, library name descriptions, and assuming that one of the field names needs to be modified, the method of modification is to delete the entire table instead of modifying only one of the fields. The specific method comprises the following steps: deleting one table and then creating one table; when deleting a table, the value of the is_delete field of the table is set to true, identifying that the metadata has been deleted; when a table is newly built, an is_delete field is added in the metadata basic information, the value of the is_delete field is set to false, and the metadata is identified as newly added data;
Further, the step five, step 2), of obtaining all metadata in full amount in the link a occurs only once, and then obtaining metadata in increment; acquiring metadata by the total quantity and increment of the links A) and B) in the step five, wherein the metadata comprises metadata of table information and metadata of function information; metadata of the table information comprises names, column names, data types, partition information, storage formats and connection information of the table; the metadata of the function information comprises a function name, a function class name, a registered path of the function in Hive and a function class.
Further, the analyzing and classifying of metadata is performed on the Hive database in the sixth step, and the specific process is as follows:
1) When the metadata of the table is obtained in full quantity, analyzing the attribute value of the connection parameter connector in the metadata of the table, and finishing the classification of the table according to the attribute value of the connector;
2) When metadata of the function is obtained in full quantity, analyzing Hive storage address field information in the metadata of the function, judging that classification is completed according to the Hive storage address field value, if the value exists, classifying the function as a Hive function, otherwise, classifying the function as a Flink function;
3) When the increment acquires data, the new increment of the table calls a onCreateTable (CreateTableEvent tableEvent) method, and in the method, map < String > parameters objects in table objects in the table event of the method are analyzed; sorting is completed according to the parameter object value or the flink. Connector attribute value in the parameter object; if the parameter object does not exist, classifying the table into a Hive table, and if the parameter object does exist, acquiring a link attribute value in the parameter object, and finishing classification according to the link attribute value; if the flink connector attribute value cannot be obtained, the table is classified as a Hive table.
4) When the data is acquired incrementally, the new addition of the function will call
onCreateFunction (CreateFunctionEvent funcEvent) in the method, analyzing the attribute in the function object in the function event of the method, judging to finish classification according to the storage address field value of Hive in the function object, if the value exists, classifying the function as a Hive function, otherwise classifying the function as a Flink function.
Further, the step seven is to store the metadata after analysis and classification in a storage database;
the specific process is as follows:
when metadata is acquired in full
(1) Acquiring metadata basic information of all tables, and storing the data into a storage database; adding an is_delete field in the metadata basic information, setting the value of the is_delete field as false, and marking the metadata as newly added data;
(2) acquiring metadata basic information of all functions, storing the data into a storage database, adding an is_delete field in the metadata basic information, setting the value of the is_delete field as false, and marking the metadata as newly added data;
further, the step seven is to store the metadata after analysis and classification in a storage database;
The specific process is as follows:
when metadata is obtained incrementally
(1) When the new table is generated, triggering a method for customizing the new table of the Hive Hook program, acquiring basic information of table metadata, storing the data into a storage database, and pushing a message to a creator of a data directory; adding an is_delete field in the metadata basic information, setting the value of the is_delete field as false, and marking the metadata as newly added data;
(2) when the new function is generated, triggering a method for customizing the new function of the Hive Hook program, acquiring basic information of function metadata, and storing the data into a storage database; pushing information to creator of data catalog; adding an is_delete field in the metadata basic information, setting the value of the is_delete field as false, and marking the metadata as newly added data;
(3) when a deletion table occurs, triggering a method for deleting the table by a user-defined Hive Hook program, obtaining basic information of table metadata, adding an is_delete field in the basic information of the table metadata, setting an is_delete field value to true, marking that the metadata is deleted, searching a database through a database name and a table name, updating the record, storing historical data into a storage database, and pushing a message to a creator of a data directory;
(4) When a deleting function occurs, triggering a method for deleting the function by a user-defined Hive Hook program, obtaining metadata basic information of the function, adding an is_delete field in the metadata basic information, setting an is_delete field value to true, marking that the metadata is deleted, searching a database through a database name and a function name, updating the record, storing historical data into a storage database, and pushing a message to a creator of a data directory;
further, the sorting query of the metadata provided in the step eight is to query the metadata by category in the storage database, and the historical version query of the metadata provided in the step eight is to query the data which identifies that the metadata has been deleted in the storage database.
The foregoing is merely illustrative and explanatory of the principles of the invention, as various modifications and additions may be made to the specific embodiments described, or similar thereto, by those skilled in the art, without departing from the principles of the invention or beyond the scope of the appended claims.
Claims (10)
1. A method for classifying and managing tables and function metadata based on a Flink data directory is characterized by comprising the following steps of: the method comprises the following steps of;
Step one, connecting a third party data source;
step two, creating a data catalog, wherein the catalog type is Hive, and the catalog type corresponds to a Hive library one by one;
creating a Flink SQL task, and using the data directory to define a Flink table and Flink function information by creating SQL in the task through a user;
step four, a Flink SQL task is operated, and the table and the function metadata are stored in a Hive library corresponding to the data catalog;
fifthly, performing full-quantity acquisition and increment acquisition of metadata aiming at the Hive database;
step six, analyzing and classifying metadata aiming at the Hive database;
step seven, the metadata after analysis and classification are stored in a storage database;
and step eight, providing classified query of metadata and historical version query of metadata.
2. The method for classifying management of tables and function metadata based on a flank data directory as claimed in claim 1, wherein: and step five, performing full-quantity acquisition and increment acquisition of metadata aiming at the Hive database, wherein the specific process is as follows:
1) Introducing a permission control mechanism of the data directory, wherein the mechanism not only allows the customized setting of the visibility of tables and functions of the data directory, but also endows different users and organizations with the permission of adding and canceling the respective metadata;
2) Metadata acquisition: acquiring the full metadata of tables and functions under the databases corresponding to all the data catalogues through Flink Hive Catalog API; and obtaining the incremental metadata of the tables and functions under the database corresponding to all the data catalogues through the user-defined Hive Hook program.
3. The method for classifying management of tables and function metadata based on a flank data directory as claimed in claim 2, wherein: the step five, step 2), obtains the total metadata of the tables and functions under the database corresponding to all the data catalogues through Flink Hive Catalog API, specifically as follows:
i. acquiring Hive database names corresponding to all data catalogues established in the system;
ii. Establishing a connector to Flink Hive Catalog: realizing connection through the Hive database name and the configuration file address of the data directory;
and acquiring metadata information, namely once the connection is established, the Flink can send a request to the Hive through Flink Hive Catalog API to acquire metadata information of the table and the function.
4. The method for classifying management of tables and function metadata based on a flank data directory as claimed in claim 2, wherein: the step five, step 2), of acquiring incremental metadata of tables and functions under databases corresponding to all data catalogues through a user-defined Hive Hook program, wherein the incremental metadata are specifically as follows:
i. Monitoring events related to functions in Hive; the event includes: creating a table, deleting the table, creating a function, and deleting an event of the function;
ii. When a change behavior occurs, hive Hook can immediately capture this change; and triggers the flow of classification of metadata and synchronization of metadata.
5. The method for classifying management of tables and function metadata based on a flank data directory as claimed in claim 2, wherein: the specific process is as follows:
the step five, step 2), is to obtain the full metadata and the incremental metadata, and the specific links are as follows:
A. the full-quantity acquisition is to acquire all data corresponding to the Hive database name of the current data catalog based on Flink Hive Catalog API, wherein all metadata comprise tables and functions;
B. the increment acquired data is only acquired from the newly added data after the last full data acquisition, and partial data which changes are deleted;
in Flink, catalog is an interface for managing external data sources, while Flink Hive Catalog is a specific implementation connected to Apache Hive; through Flink Hive Catalog, the Flink may connect to Hive metadata store, query tables in Hive, function metadata information, and allow Flink jobs to directly access and manipulate these data.
6. The method for classifying management of tables and function metadata based on a flank data directory as defined in claim 5, wherein: the step five, step 2), of obtaining all metadata in the total quantity of the link A only occurs once, and the metadata are obtained in the increment later; acquiring metadata by the total quantity and increment of the links A) and B) in the step five, wherein the metadata comprises metadata of table information and metadata of function information; metadata of the table information comprises names, column names, data types, partition information, storage formats and connection information of the table; the metadata of the function information comprises a function name, a function class name, a registered path of the function in Hive and a function class.
7. The method for classifying management of tables and function metadata based on a flank data directory as claimed in claim 1, wherein: and step six, analyzing and classifying metadata aiming at the Hive database, wherein the specific process is as follows:
1) When the metadata of the table is obtained in full quantity, analyzing the attribute value of the connection parameter connector in the metadata of the table, and finishing the classification of the table according to the attribute value of the connector;
2) When metadata of the function is obtained in full quantity, analyzing Hive storage address field information in the metadata of the function, judging that classification is completed according to the Hive storage address field value, if the value exists, classifying the function as a Hive function, otherwise, classifying the function as a Flink function;
3) When the increment acquires data, the new increment of the table calls a onCreateTable (CreateTableEvent tableEvent) method, and in the method, map < String > parameters objects in table objects in the table event of the method are analyzed; sorting is completed according to the parameter object value or the flink. Connector attribute value in the parameter object; if the parameter object does not exist, classifying the table into a Hive table, and if the parameter object does exist, acquiring a link attribute value in the parameter object, and finishing classification according to the link attribute value; if the flink connector attribute value cannot be obtained, the table is classified as a Hive table.
4) When the data is acquired in an increment mode, a onCreateFunction (CreateFunctionEvent funcEvent) method is called by the new function, in the method, the attribute in a function object in a function event which is transmitted into the method is analyzed, classification is judged to be completed according to the storage address field value of Hive in the function object, if the value exists, the function is classified as a Hive function, and otherwise, the function is classified as a Flink function.
8. The method for classifying management of tables and function metadata based on a flank data directory as claimed in claim 1, wherein: step seven, the metadata after analysis and classification are stored in a storage database;
The specific process is as follows:
when metadata is acquired in full
(1) Acquiring metadata basic information of all tables, and storing the data into a storage database; adding an is_delete field in the metadata basic information, setting the value of the is_delete field as false, and marking the metadata as newly added data;
(2) acquiring metadata basic information of all functions, storing the data into a storage database, adding an is_delete field in the metadata basic information, setting the value of the is_delete field as false, and identifying the metadata as newly added data.
9. The method for classifying management of tables and function metadata based on a flank data directory as claimed in claim 1, wherein: step seven, the metadata after analysis and classification are stored in a storage database;
the specific process is as follows:
when metadata is obtained incrementally
(1) When the new table is generated, triggering a method for customizing the new table of the Hive Hook program, acquiring basic information of table metadata, storing the data into a storage database, and pushing a message to a creator of a data directory; adding an is_delete field in the metadata basic information, setting the value of the is_delete field as false, and marking the metadata as newly added data;
(2) When the new function is generated, triggering a method for customizing the new function of the Hive Hook program, acquiring basic information of function metadata, and storing the data into a storage database; pushing information to creator of data catalog; adding an is_delete field in the metadata basic information, setting the value of the is_delete field as false, and marking the metadata as newly added data;
(3) when a deletion table occurs, triggering a method for deleting the table by a user-defined Hive Hook program, obtaining basic information of table metadata, adding an is_delete field in the basic information of the table metadata, setting an is_delete field value to true, marking that the metadata is deleted, searching a database through a database name and a table name, updating the record, storing historical data into a storage database, and pushing a message to a creator of a data directory;
(4) when a deleting function occurs, a method for deleting the function by the user-defined Hive Hook program is triggered, the metadata basic information of the function is obtained, an is_delete field is added in the metadata basic information, the value of the is_delete field is set to true, the metadata is identified to be deleted, a database is searched through a database name, the function name is used for searching the database, the record is updated, the historical data is stored in a storage database, and a message is pushed to a creator of a data directory.
10. The method for classifying management of tables and function metadata based on a flank data directory as claimed in claim 1, wherein: and step eight, namely inquiring metadata according to categories in a storage database, wherein step eight, namely inquiring historical versions of the metadata, namely inquiring the data which identify that the metadata has been deleted in the storage database.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311735752.8A CN117591611A (en) | 2023-12-16 | 2023-12-16 | Classification management method for table and function metadata based on Flink data directory |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311735752.8A CN117591611A (en) | 2023-12-16 | 2023-12-16 | Classification management method for table and function metadata based on Flink data directory |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117591611A true CN117591611A (en) | 2024-02-23 |
Family
ID=89909975
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311735752.8A Pending CN117591611A (en) | 2023-12-16 | 2023-12-16 | Classification management method for table and function metadata based on Flink data directory |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117591611A (en) |
-
2023
- 2023-12-16 CN CN202311735752.8A patent/CN117591611A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20030105745A1 (en) | Text-file based relational database | |
US6233729B1 (en) | Method and apparatus for identifying dynamic structure and indirect messaging relationships between processes | |
US5721911A (en) | Mechanism for metadata for an information catalog system | |
US10545981B2 (en) | Virtual repository management | |
JP4222947B2 (en) | Method, program, and system for representing multimedia content management objects | |
US20180349184A1 (en) | Processing data sets in a big data repository | |
US7644065B2 (en) | Process of performing an index search | |
US6453321B1 (en) | Structured cache for persistent objects | |
US8356029B2 (en) | Method and system for reconstruction of object model data in a relational database | |
US6925462B2 (en) | Database management system, and query method and query execution program in the database management system | |
CN1333336C (en) | Method for unified management of component library supporting heterogeneous component | |
US6317749B1 (en) | Method and apparatus for providing relationship objects and various features to relationship and other objects | |
JP2000148461A (en) | Software model and existing source code synchronizing method and device | |
KR100529661B1 (en) | Object integrated management system | |
US6735598B1 (en) | Method and apparatus for integrating data from external sources into a database system | |
US6205576B1 (en) | Method and apparatus for identifying indirect messaging relationships between software entities | |
US20050033719A1 (en) | Method and apparatus for managing data | |
US7707211B2 (en) | Information management system and method | |
US20030204522A1 (en) | Autofoldering process in content management | |
US7475088B2 (en) | Systems and methods of providing data from a data source to a data sink | |
US8903846B2 (en) | Method and apparatus for integrating data from external sources into a database system | |
EP1383055A2 (en) | Map and data location provider | |
Wittenburg et al. | An adaptive document management system for shared multimedia data | |
US20050038812A1 (en) | Method and apparatus for managing data | |
CN117591611A (en) | Classification management method for table and function metadata based on Flink data directory |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |