[go: up one dir, main page]

CN115599863A - Bank data synchronization method and device based on Hudi, electronic equipment and medium - Google Patents

Bank data synchronization method and device based on Hudi, electronic equipment and medium Download PDF

Info

Publication number
CN115599863A
CN115599863A CN202211317617.7A CN202211317617A CN115599863A CN 115599863 A CN115599863 A CN 115599863A CN 202211317617 A CN202211317617 A CN 202211317617A CN 115599863 A CN115599863 A CN 115599863A
Authority
CN
China
Prior art keywords
data
source information
data source
hudi
incremental
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211317617.7A
Other languages
Chinese (zh)
Inventor
李海博
王鑫毅
王长生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Agricultural Bank of China
Original Assignee
Agricultural Bank of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Agricultural Bank of China filed Critical Agricultural Bank of China
Priority to CN202211317617.7A priority Critical patent/CN115599863A/en
Publication of CN115599863A publication Critical patent/CN115599863A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • G06F16/275Synchronous replication
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/02Banking, e.g. interest calculation or account maintenance

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Software Systems (AREA)
  • Quality & Reliability (AREA)
  • Computing Systems (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明实施例公开了一种基于Hudi的银行数据同步方法、装置、电子设备及介质。其中,该方法包括:获取当前数据源信息,确定是否存在与所述当前数据源信息匹配的Hudi表;其中,所述Hudi表用于存储与历史数据源信息对应的历史数据文件;若存在,则确定所述当前数据源信息为增量数据源信息,对所述增量数据源信息进行数据同步;否则,确定所述当前数据源信息为全量数据源信息,对所述全量数据源信息进行数据同步。本技术方案,能够在满足银行系统安全性和可审计性的同时,实现对全量数据的增量更新操作,可达到“T+0”类实时数据同步效果,即当天数据当天消费,有效提高了数据同步效率和数据时效性,同时降低了资源空间占用。

Figure 202211317617

The embodiment of the invention discloses a Hudi-based bank data synchronization method, device, electronic equipment and medium. Wherein, the method includes: obtaining current data source information, determining whether there is a Hudi table matching the current data source information; wherein, the Hudi table is used to store historical data files corresponding to historical data source information; if it exists, Then determine that the current data source information is incremental data source information, and perform data synchronization on the incremental data source information; otherwise, determine that the current data source information is full data source information, and perform data synchronization on the full amount of data source information data synchronization. This technical solution can realize the incremental update operation of the full amount of data while satisfying the security and auditability of the banking system, and can achieve the effect of "T+0" real-time data synchronization, that is, the consumption of the same day's data on the same day, effectively improving the Data synchronization efficiency and data timeliness, while reducing resource space occupation.

Figure 202211317617

Description

Hudi-based bank data synchronization method and device, electronic equipment and medium
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method and an apparatus for synchronizing bank data based on Hudi, an electronic device, and a medium.
Background
The bank data analysis and mining system is a one-stop artificial intelligence platform serving for bank data analysis and can support all data analysis projects in a centralized manner. At present, storage media supported by a data analysis and mining System are mainly classified into a Hadoop-based HDFS (Hadoop Distributed File System) storage and a Gbase database based on MPP (Massively Parallel Processing) storage. The Hadoop is a distributed system framework used on a large cluster, a computing task under the framework can be divided into a plurality of small tasks to run on different nodes, and the HDFS provided by the Hadoop can store data on the computing nodes so as to provide extremely high cross-data center aggregation bandwidth. The Gbase MPP is a large-scale distributed parallel database cluster with a column-type storage architecture, has the characteristics of high performance, high availability, high expansion and the like, can provide a high-cost-performance general computing platform for ultra-large-scale data management, and is widely used for supporting various data warehouse systems, business intelligent systems and decision support systems.
At present, the bank system has more difficulties in using enterprise data inside the bank, main data of the bank and external data of the bank. Firstly, the timeliness of the data is low, and the data consumption period is approximately 'T + 3', namely the data is consumed on the third day. And secondly, the modification of the HDFS file needs to be replaced in a full amount, and incremental updating operation such as adding, deleting or modifying cannot be performed, so that computing and storage resources are greatly occupied. Thirdly, the existing data analysis and mining platform system of the bank mainly depends on single process to carry out data synchronization, and the requirement of large-scale data synchronization is difficult to meet.
Disclosure of Invention
The invention provides a Hudi-based bank data synchronization method, a Hudi-based bank data synchronization device, electronic equipment and a Hudi-based bank data synchronization medium, which can realize incremental update operation on full data while meeting the safety and auditability of a bank system, effectively improve the data synchronization efficiency and the data timeliness, and simultaneously reduce the occupation of resource space.
According to an aspect of the present invention, there is provided a Hudi-based banking data synchronization method, the method comprising:
acquiring current data source information, and determining whether a Hudi table matched with the current data source information exists or not; the Hudi table is used for storing historical data files corresponding to historical data source information;
if yes, determining that the current data source information is incremental data source information, and performing data synchronization on the incremental data source information;
otherwise, determining that the current data source information is full data source information, and performing data synchronization on the full data source information.
According to another aspect of the present invention, there is provided a Hudi-based bank data synchronization apparatus, including:
the current data source information matching module is used for acquiring current data source information and determining whether a Hudi table matched with the current data source information exists or not; the Hudi table is used for storing historical data files corresponding to historical data source information;
the incremental data source data synchronization module is used for determining that the current data source information is incremental data source information if the current data source information exists, and carrying out data synchronization on the incremental data source information;
and the full data source data synchronization module is used for determining that the current data source information is full data source information and carrying out data synchronization on the full data source information if the current data source information is not full data source information.
According to another aspect of the present invention, there is provided an electronic device for hodi-based banking data synchronization, the electronic device comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the Hudi-based banking data synchronization method according to any one of the embodiments of the present invention.
According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to implement a Hudi-based bank data synchronization method according to any one of the embodiments of the present invention when executed.
According to the technical scheme of the embodiment of the invention, the current data source information is acquired, and whether a Hudi table matched with the current data source information exists is determined; the Hudi table is used for storing historical data files corresponding to historical data source information; if yes, determining that the current data source information is incremental data source information, and performing data synchronization on the incremental data source information; otherwise, determining that the current data source information is the full data source information, and performing data synchronization on the full data source information. According to the technical scheme, the increment updating operation of the full data can be realized while the safety and auditability of the bank system are met, the T + 0-like real-time data synchronization effect can be achieved, the data are consumed on the same day, the data synchronization efficiency and the data timeliness are effectively improved, and the resource space occupation is reduced.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present invention, nor do they necessarily limit the scope of the invention. Other features of the present invention will become apparent from the following description.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart of a method for Hudi-based bank data synchronization according to an embodiment of the present invention;
fig. 2 is a flowchart of a Hudi-based bank data synchronization method according to a second embodiment of the present invention;
fig. 3 is a schematic structural diagram of a Hudi-based bank data synchronization apparatus according to a third embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device implementing the Hudi-based bank data synchronization method according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," "object," and the like in the description and claims of the invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in other sequences than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example one
Fig. 1 is a flowchart of a Hudi-based bank data synchronization method according to an embodiment of the present invention, where this embodiment is applicable to a situation where bank data is efficiently synchronized, and the method may be implemented by a Hudi-based bank data synchronization apparatus, where the Hudi-based bank data synchronization apparatus may be implemented in a form of hardware and/or software, and the Hudi-based bank data synchronization apparatus may be configured in an electronic device with data processing capability. As shown in fig. 1, the method includes:
s110, acquiring current data source information, and determining whether a Hudi table matched with the current data source information exists or not; the Hudi table is used for storing historical data files corresponding to historical data source information.
The technical solution of this embodiment is based on Hudi (Hadoop Updates and additions) technology, and realizes incremental update operation on full data. The Hudi adopts and manages a large analysis data set stored through DFS (Hadoop system HDFS or other cloud storage), and supports updating operation of incremental or full data in a current data table. Hudi organizes the table into a directory structure under a certain specified directory (Basepath) on the HDFS, the secondary directory is a Partition path key (autonomously specified by a user), and the secondary directory is divided into a plurality of partitions. And storing a plurality of files of the current partition under each partition directory, wherein the default mode is a partial mode. In addition, the Hudi-based banking data synchronization apparatus for performing the Hudi-based banking data synchronization method includes, but is not limited to, a stand-alone machine or a server cluster.
The data source information may include bank subject data, bank internal enterprise data, and bank external enterprise data. Specifically, the storage formats of data source information are roughly divided into two types, one is structured data, and data management is usually performed by means of a relational database, such as Gbase MPP; the second is unstructured data and semi-structured data, and data storage management is usually performed by means of a distributed file system, such as HDFS. The internal enterprise data of the bank may refer to information which is generated in a bank system and mainly includes an enterprise, and is usually stored in a database in the form of structured data. The external business data of the bank may refer to the main business data purchased by the bank to an external third party company. The Hudi table is a Hudi data table, and can be used for storing historical data files corresponding to historical data source information.
It should be noted that, in this embodiment, a data source acquisition period may be preset according to an actual application requirement, and data source information may be acquired based on the preset data source acquisition period. For example, the data source acquisition period may be determined to be 1 day, 1 month, 1 quarter or instant acquisition according to the requirements of different businesses for data effectiveness.
And S120, if yes, determining that the current data source information is incremental data source information, and performing data synchronization on the incremental data source information.
The incremental data source information may refer to data source information that needs to be updated with incremental data. In this embodiment, if a Hudi table matching the current data source information exists, it indicates that the current data source information is not the data source information obtained for the first time, and at this time, the current data source information may be determined as incremental data source information, and the incremental data update is performed on the original data source information according to the incremental data extracted from the current data source information, so as to implement data synchronization on the incremental data source information, and without modifying the full data, thereby improving the data synchronization efficiency.
Illustratively, a Spark job for updating the incremental data source information may be generated based on the Hudi DeltaStreamer tool class, and the update operation of the incremental data source information may be completed through the Spark job. The Spark can be a fast and general computing engine specially designed for large-scale data processing, and can store the results output in the middle of the job in a memory without reading and writing the HDFS, so that the data processing speed of the Spark is high.
In addition, a corresponding change log can be generated according to the change content (namely the incremental data) of the incremental data source information and used for bank audit. The change log may include a change time, a change object, change content, and the like. In addition, for safety, a disk dropping measure, such as a disk dropping or a usb disk, may be taken for the changed content, and the disk dropping data may be transmitted through a dedicated channel. For multi-data source transmission, hadoop proprietary command Hadoop distcp and Gbase MPP proprietary command select intooutfile can be used for realization.
And S130, otherwise, determining that the current data source information is the full data source information, and performing data synchronization on the full data source information.
The full data source information may refer to data source information that needs to be updated with full data. In this embodiment, if there is no Hudi table matching with the current data source information, it indicates that the current data source information is the data source information obtained for the first time, and at this time, the current data source information may be determined to be the full data source information, and the full data writing operation is performed according to the current data source information, so as to implement data synchronization on the full data source information. Illustratively, a Spark job for updating the full amount of data source information may be generated based on the Hudi DeltaStreamer tool class, and the update operation of the full amount of data source information may be completed through the Spark job.
According to the technical scheme of the embodiment of the invention, the current data source information is acquired, and whether a Hudi table matched with the current data source information exists is determined; the Hudi table is used for storing historical data files corresponding to historical data source information; if yes, determining that the current data source information is incremental data source information, and performing data synchronization on the incremental data source information; otherwise, determining that the current data source information is the full data source information, and performing data synchronization on the full data source information. According to the technical scheme, the increment updating operation of the full data can be realized while the safety and auditability of a bank system are met, the T + 0-like real-time data synchronization effect can be achieved, the data can be consumed on the same day, the data synchronization efficiency and the data timeliness are effectively improved, and the occupation of resource space is reduced.
In this embodiment, optionally, determining whether there is a Hudi table matching with the current data source information includes: determining first identification information of current data source information and second identification information of a Hudi table; and determining whether a Hudi table matched with the current data source information exists according to the matching result of the first identification information and the second identification information.
The first identification information may be used to uniquely identify the current data source information. The second identification information may be used to uniquely identify the Hudi table. It should be noted that, in this embodiment, the expression form of the first identification information and the second identification information is not limited at all, and may be set according to the actual application requirement. For example, the first identification information and the second identification information may be represented in the form of a character string, and the character string may include one or more of numbers, letters, and special characters.
In this embodiment, whether a Hudi table matching the current data source information exists may be determined in a manner of matching the identification information. Specifically, first identification information of current data source information and second identification information of each Hudi table are determined, then the first identification information and the second identification information are matched, and whether the Hudi table matched with the current data source information exists is determined according to a matching result of the first identification information and the second identification information. If second identification information successfully matched with the first identification information exists, determining that a Hudi table matched with the current data source information exists; if the second identification information which is successfully matched with the first identification information does not exist, the situation that the Hudi table which is matched with the current data source information does not exist can be determined.
It should be noted that, the matching method of the identification information is not limited in any way, and may be set according to actual requirements. Illustratively, the identification information matching may be implemented based on similarity calculations. For example, the similarity between identification information may be calculated by using cosine similarity, matrix similarity, or character string editing distance.
By means of the scheme, whether the Hudi table matched with the current data source information exists can be judged quickly and accurately through the identification information, and therefore whether the current data source information is the data source information obtained for the first time or not is determined according to the matching result.
In this embodiment, optionally, the data synchronization of the full data source information includes: determining a full data file according to the full data source information; and carrying out data synchronization on the full data source information based on the full data file.
Wherein, the full data file can be used for recording the full data source information. In this embodiment, after the current data source information is determined as the full data source information, the full data source information may be all written into the same file, so as to determine the full data file, and then data synchronization is performed on the full data source information based on the full data file.
According to the scheme, the full data source information is completely written into the full data file, data synchronization is carried out on the full data source information based on the full data file, and the full data can be stored and managed uniformly and conveniently through the full data file.
In this embodiment, optionally, performing data synchronization on the full data source information based on the full data file includes: performing data format conversion on data in the full data file based on a preset data format to obtain a first updated full data file; the preset data format is determined based on format characteristics of a Hudi table; performing data type conversion on data in the first updating full data file based on a preset data type to obtain a second updating full data file; the preset data type comprises at least one data type of one dimension; and establishing a Hudi table according to the second updated full data file so as to perform data synchronization on the full data source information.
The preset data format may refer to a preset data format. Specifically, the preset data format may be determined based on format characteristics of the Hudi table, so as to obtain a data format supported by the Hudi table. For example, the preset data format may be a partial format. The first update full data file may refer to an update file after data format conversion is performed on the full data file. The preset data type may refer to a preset data type. The preset data type may include a data type of at least one dimension, which may be specifically set according to an actual application requirement, and this embodiment does not limit this. For example, the preset data types may include data types with different dimensions such as decimal, datetime, uuid, timestamp, or a structure. The second update full data file may refer to an update file after data type conversion is performed on data in the first update full data file.
In this embodiment, first, data format conversion is performed on the full-size data file based on a preset data format to obtain a first updated full-size data file in a queue format supported by a Hudi table, and data in the first updated full-size data file is all represented in a character string form. In order to ensure the data quality, data cleaning may be further performed on the data in the first updated full data file, including but not limited to removing various dirty data such as null values and abnormal values. And then performing data type conversion on the data in the first updated full data file subjected to data cleaning based on a preset data type to obtain a second updated full data file. And further, a Hudi table is created according to the second updated full data file, and the second updated full data file is stored in the newly created Hudi table, so that data synchronization is performed on the full data source information.
It should be noted that, in the data type conversion process, a chain structure may be set to implement multidimensional fast conversion of data types. The chain structure can adopt a sequential and/or parallel mode to perform data type conversion, further data processing can be performed according to the converted data type, and the corresponding data processing mode can be flexibly set and modified according to actual requirements. Specifically, the data processing mode may be to perform data processing on data corresponding to one dimension data type, or perform data processing on data corresponding to multiple dimension data types. For example, assume that datatime type data is obtained after data type conversion, for example, 10/8/2022. If the preset data processing mode is to extract the month from the data, 10 may be extracted from 10, 8 and 2022 as the data processing result according to the preset data processing mode. When data processing is performed on different dimension data types, a data processing result corresponding to one dimension data type can be used as an input parameter of a data processing flow corresponding to another dimension data type.
According to the scheme, the data conversion type of the full data can be flexibly set according to the actual application requirement, so that the multi-dimensional conversion of the data type is realized, and the data synchronization of the full data source information is performed according to the second updated full data file after the data type conversion.
Example two
Fig. 2 is a flowchart of a Hudi-based bank data synchronization method according to a second embodiment of the present invention, where the second embodiment is optimized based on the foregoing embodiment. The specific optimization is as follows: performing data synchronization on incremental data source information, including: determining an incremental data file according to the incremental data source information; and carrying out data synchronization on the incremental data source information according to the incremental data file.
As shown in fig. 2, the method of the present embodiment specifically includes the following steps:
s210, acquiring current data source information, and determining whether a Hudi table matched with the current data source information exists; the Hudi table is used for storing historical data files corresponding to historical data source information.
And S220, if yes, determining that the current data source information is incremental data source information, and determining an incremental data file according to the incremental data source information.
The incremental data file can be used for recording incremental data information in the incremental data source information. In this embodiment, after the current data source information is determined as the incremental data source information, the incremental data information may be extracted from the incremental data source information, and the incremental data information is written into the same file, so as to determine the incremental data file. It is understood that only incremental data information corresponding to the incremental data source information is described in the incremental data file, and not all of the incremental data source information.
And S230, carrying out data synchronization on the incremental data source information according to the incremental data file.
In this embodiment, after determining the incremental data file, data synchronization may be further performed on the incremental data source information according to the incremental data file. Optionally, performing data synchronization on the incremental data source information according to the incremental data file, including: determining a target Hudi table matched with the incremental data source information; and updating the target data file in the target Hudi table based on the incremental data file so as to perform data synchronization on the incremental data source information.
The target Hudi table may refer to a Hudi table matched with the incremental data source information. The target data file may refer to a data file in a target Hudi table. In this embodiment, a target Hudi table matched with incremental data source information is first determined, and then a target data file in the target Hudi table is updated based on the incremental data file, so as to implement data synchronization of the incremental data source information.
In this embodiment, optionally, updating the target data file in the target Hudi table based on the incremental data file includes: performing data format conversion on data in the incremental data file based on a preset data format to obtain a first updated incremental data file; the preset data format is determined based on the format characteristics of the Hudi table; performing data type conversion on data in the first update incremental data file based on a preset data type to obtain a second update incremental data file; the preset data type comprises at least one data type of one dimension; and updating the target data file in the target Hudi table according to the second updating incremental data file.
The first update incremental data file may be an update file obtained by performing data format conversion on an incremental data file. The second update delta data file may refer to an update file after data type conversion is performed on data in the first update delta data file.
In this embodiment, first, data format conversion is performed on the incremental data file based on a preset data format to obtain a first update incremental data file in a queue format supported by a Hudi table, and data in the first update incremental data file is all represented in a character string form. Similarly, in order to ensure the data quality, data cleaning may be further performed on the data in the first update incremental data file, including but not limited to removing various dirty data such as null values and abnormal values. And then performing data type conversion on the data in the first update increment data file after data cleaning based on a preset data type to obtain a second update increment data file. Similarly, in the data type conversion process, a chain structure can be set to realize multi-dimensional rapid conversion of the data type, and further data processing can be performed according to the converted data type. And further, the target data file in the target Hudi table can be updated according to the second updated incremental data file, so that data synchronization of the incremental data source information is realized.
According to the scheme, the data conversion type of the incremental data can be flexibly set according to the actual application requirement, so that the multi-dimensional conversion of the data type is realized, and the data synchronization of the incremental data source information is performed according to the second updated incremental data file after the data type conversion.
S240, otherwise, determining that the current data source information is the full data source information, and performing data synchronization on the full data source information.
According to the technical scheme of the embodiment of the invention, the current data source information is obtained, and whether a Hudi table matched with the current data source information exists or not is determined; the Hudi table is used for storing historical data files corresponding to historical data source information; if yes, determining that the current data source information is incremental data source information, and determining an incremental data file according to the incremental data source information; performing data synchronization on incremental data source information according to the incremental data file; otherwise, determining that the current data source information is the full data source information, and performing data synchronization on the full data source information. According to the technical scheme, the increment updating operation of the full data can be realized while the safety and auditability of a bank system are met, the T + 0-like real-time data synchronization effect can be achieved, the data can be consumed on the same day, the data synchronization efficiency and the data timeliness are effectively improved, and the occupation of resource space is reduced.
EXAMPLE III
Fig. 3 is a schematic structural diagram of a Hudi-based bank data synchronization apparatus according to a third embodiment of the present invention, which is capable of executing the Hudi-based bank data synchronization method according to any embodiment of the present invention, and has corresponding functional modules and beneficial effects of the execution method. As shown in fig. 3, the apparatus includes:
a current data source information matching module 310, configured to obtain current data source information, and determine whether a Hudi table matching the current data source information exists; the Hudi table is used for storing historical data files corresponding to historical data source information;
an incremental data source data synchronization module 320, configured to determine that the current data source information is incremental data source information if the current data source information exists, and perform data synchronization on the incremental data source information;
a full data source data synchronization module 330, configured to determine that the current data source information is full data source information if the current data source information is not full data source information, and perform data synchronization on the full data source information.
Optionally, the current data source information matching module 310 is specifically configured to:
determining first identification information of the current data source information and second identification information of a Hudi table;
and determining whether a Hudi table matched with the current data source information exists according to the matching result of the first identification information and the second identification information.
Optionally, the incremental data source data synchronization module 320 includes:
an incremental data file determining unit, configured to determine an incremental data file according to the incremental data source information;
and the incremental data source data synchronization unit is used for carrying out data synchronization on the incremental data source information according to the incremental data file.
Optionally, the incremental data source data synchronization unit includes:
a target Hudi table determining subunit, configured to determine a target Hudi table that matches the incremental data source information;
and the target data file updating subunit is used for updating the target data file in the target Hudi table based on the incremental data file so as to perform data synchronization on the incremental data source information.
Optionally, the target data file updating subunit is configured to:
performing data format conversion on data in the incremental data file based on a preset data format to obtain a first updated incremental data file; the preset data format is determined based on the format characteristics of a Hudi table;
performing data type conversion on data in the first updating incremental data file based on a preset data type to obtain a second updating incremental data file; the preset data type comprises a data type of at least one dimension;
and updating the target data file in the target Hudi table according to the second updating incremental data file.
Optionally, the full-data source data synchronization module 330 includes:
the full data file determining unit is used for determining a full data file according to the full data source information;
and the full data source data synchronization unit is used for carrying out data synchronization on the full data source information based on the full data file.
Optionally, the full data source data synchronization unit is configured to:
performing data format conversion on data in the full data file based on a preset data format to obtain a first updated full data file; the preset data format is determined based on the format characteristics of a Hudi table;
performing data type conversion on data in the first updating full data file based on a preset data type to obtain a second updating full data file; the preset data type comprises a data type of at least one dimension;
and establishing a Hudi table according to the second updated full data file so as to perform data synchronization on the full data source information.
The Hudi-based bank data synchronization device provided by the embodiment of the invention can execute the Hudi-based bank data synchronization method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.
Example four
FIG. 4 illustrates a block diagram of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
As shown in fig. 4, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 11 can perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from a storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data necessary for the operation of the electronic apparatus 10 can also be stored. The processor 11, the ROM 12, and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to the bus 14.
A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, or the like; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The processor 11 performs the various methods and processes described above, such as the Hudi-based banking data synchronization method.
In some embodiments, the Hudi-based bank data synchronization method may be implemented as a computer program tangibly embodied in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the Hudi-based bank data synchronization method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the Hudi-based bank data synchronization method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Computer programs for implementing the methods of the present invention can be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on a machine, as a stand-alone software package partly on a machine and partly on a remote machine or entirely on a remote machine or server.
In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.
The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.
The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1.一种基于Hudi的银行数据同步方法,其特征在于,所述方法包括:1. a bank data synchronization method based on Hudi, is characterized in that, described method comprises: 获取当前数据源信息,确定是否存在与所述当前数据源信息匹配的Hudi表;其中,所述Hudi表用于存储与历史数据源信息对应的历史数据文件;Obtain current data source information, determine whether there is a Hudi table matching the current data source information; wherein, the Hudi table is used to store historical data files corresponding to historical data source information; 若存在,则确定所述当前数据源信息为增量数据源信息,对所述增量数据源信息进行数据同步;If it exists, determine that the current data source information is incremental data source information, and perform data synchronization on the incremental data source information; 否则,确定所述当前数据源信息为全量数据源信息,对所述全量数据源信息进行数据同步。Otherwise, determine that the current data source information is the full amount of data source information, and perform data synchronization on the full amount of data source information. 2.根据权利要求1所述的方法,其特征在于,确定是否存在与所述当前数据源信息匹配的Hudi表,包括:2. The method according to claim 1, wherein determining whether there is a Hudi table matching the current data source information comprises: 确定所述当前数据源信息的第一标识信息和Hudi表的第二标识信息;Determine the first identification information of the current data source information and the second identification information of the Hudi table; 根据所述第一标识信息和所述第二标识信息的匹配结果,确定是否存在与所述当前数据源信息匹配的Hudi表。According to the matching result of the first identification information and the second identification information, determine whether there is a Hudi table matching the current data source information. 3.根据权利要求1所述的方法,其特征在于,对所述增量数据源信息进行数据同步,包括:3. The method according to claim 1, wherein performing data synchronization on the incremental data source information comprises: 根据所述增量数据源信息确定增量数据文件;determining incremental data files according to the incremental data source information; 根据所述增量数据文件对所述增量数据源信息进行数据同步。Perform data synchronization on the incremental data source information according to the incremental data file. 4.根据权利要求3所述的方法,其特征在于,根据所述增量数据文件对所述增量数据源信息进行数据同步,包括:4. The method according to claim 3, wherein the data synchronization of the incremental data source information according to the incremental data file comprises: 确定与所述增量数据源信息匹配的目标Hudi表;Determine the target Hudi table matching the incremental data source information; 基于所述增量数据文件对所述目标Hudi表中的目标数据文件进行更新,以对所述增量数据源信息进行数据同步。The target data file in the target Hudi table is updated based on the incremental data file, so as to perform data synchronization on the incremental data source information. 5.根据权利要求4所述的方法,其特征在于,基于所述增量数据文件对所述目标Hudi表中的目标数据文件进行更新,包括:5. method according to claim 4, is characterized in that, based on described incremental data file, the target data file in described target Hudi table is updated, comprising: 基于预设数据格式对所述增量数据文件中的数据进行数据格式转换,得到第一更新增量数据文件;其中,所述预设数据格式基于Hudi表的格式特征进行确定;Carry out data format conversion to the data in the incremental data file based on the preset data format, obtain the first update incremental data file; Wherein, the preset data format is determined based on the format characteristics of the Hudi table; 基于预设数据类型对所述第一更新增量数据文件中的数据进行数据类型转换,得到第二更新增量数据文件;其中,所述预设数据类型包括至少一个维度的数据类型;Perform data type conversion on the data in the first updated incremental data file based on a preset data type to obtain a second updated incremental data file; wherein the preset data type includes at least one dimensional data type; 依据所述第二更新增量数据文件对所述目标Hudi表中的目标数据文件进行更新。The target data file in the target Hudi table is updated according to the second update incremental data file. 6.根据权利要求1所述的方法,其特征在于,对所述全量数据源信息进行数据同步,包括:6. The method according to claim 1, wherein performing data synchronization on the full amount of data source information comprises: 根据所述全量数据源信息确定全量数据文件;determining the full data file according to the full data source information; 基于所述全量数据文件对所述全量数据源信息进行数据同步。Data synchronization is performed on the full amount of data source information based on the full amount of data files. 7.根据权利要求6所述的方法,其特征在于,基于所述全量数据文件对所述全量数据源信息进行数据同步,包括:7. The method according to claim 6, wherein the data synchronization of the full data source information based on the full data file comprises: 基于预设数据格式对所述全量数据文件中的数据进行数据格式转换,得到第一更新全量数据文件;其中,所述预设数据格式基于Hudi表的格式特征进行确定;Carry out data format conversion to the data in the full amount of data file based on the preset data format, obtain the first updated full amount of data file; Wherein, the preset data format is determined based on the format characteristics of the Hudi table; 基于预设数据类型对所述第一更新全量数据文件中的数据进行数据类型转换,得到第二更新全量数据文件;其中,所述预设数据类型包括至少一个维度的数据类型;Perform data type conversion on the data in the first updated full data file based on a preset data type to obtain a second updated full data file; wherein the preset data type includes a data type of at least one dimension; 根据所述第二更新全量数据文件新建Hudi表,以对所述全量数据源信息进行数据同步。A Hudi table is created according to the second updated full data file, so as to perform data synchronization on the full data source information. 8.一种基于Hudi的银行数据同步装置,其特征在于,所述装置包括:8. A bank data synchronization device based on Hudi, characterized in that the device comprises: 当前数据源信息匹配模块,用于获取当前数据源信息,确定是否存在与所述当前数据源信息匹配的Hudi表;其中,所述Hudi表用于存储与历史数据源信息对应的历史数据文件;The current data source information matching module is used to obtain the current data source information, and determine whether there is a Hudi table matching the current data source information; wherein, the Hudi table is used to store historical data files corresponding to the historical data source information; 增量数据源数据同步模块,用于若存在,则确定所述当前数据源信息为增量数据源信息,对所述增量数据源信息进行数据同步;Incremental data source data synchronization module, used to determine that the current data source information is incremental data source information if it exists, and perform data synchronization on the incremental data source information; 全量数据源数据同步模块,用于否则,确定所述当前数据源信息为全量数据源信息,对所述全量数据源信息进行数据同步。The full data source data synchronization module is configured to, if not, determine that the current data source information is full data source information, and perform data synchronization on the full data source information. 9.一种基于Hudi的银行数据同步电子设备,其特征在于,所述电子设备包括:9. A bank data synchronization electronic device based on Hudi, characterized in that, the electronic device comprises: 至少一个处理器;以及at least one processor; and 与所述至少一个处理器通信连接的存储器;其中,a memory communicatively coupled to the at least one processor; wherein, 所述存储器存储有可被所述至少一个处理器执行的计算机程序,所述计算机程序被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-7中任一项所述的基于Hudi的银行数据同步方法。The memory stores a computer program executable by the at least one processor, the computer program is executed by the at least one processor, so that the at least one processor can perform any one of claims 1-7 The described Hudi-based bank data synchronization method. 10.一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机指令,所述计算机指令用于使处理器执行时实现权利要求1-7中任一项所述的基于Hudi的银行数据同步方法。10. A computer-readable storage medium, wherein the computer-readable storage medium stores computer instructions, and the computer instructions are used to enable a processor to implement the method described in any one of claims 1-7 when executed. Bank data synchronization method based on Hudi.
CN202211317617.7A 2022-10-26 2022-10-26 Bank data synchronization method and device based on Hudi, electronic equipment and medium Pending CN115599863A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211317617.7A CN115599863A (en) 2022-10-26 2022-10-26 Bank data synchronization method and device based on Hudi, electronic equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211317617.7A CN115599863A (en) 2022-10-26 2022-10-26 Bank data synchronization method and device based on Hudi, electronic equipment and medium

Publications (1)

Publication Number Publication Date
CN115599863A true CN115599863A (en) 2023-01-13

Family

ID=84850471

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211317617.7A Pending CN115599863A (en) 2022-10-26 2022-10-26 Bank data synchronization method and device based on Hudi, electronic equipment and medium

Country Status (1)

Country Link
CN (1) CN115599863A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118760867A (en) * 2024-09-06 2024-10-11 浙江有数数智科技有限公司 A data management processing method, device, equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019539A (en) * 2017-07-14 2019-07-16 北京京东尚科信息技术有限公司 A kind of method and apparatus that the data of data warehouse are synchronous
CN112612797A (en) * 2020-12-30 2021-04-06 杭州拼便宜网络科技有限公司 Multi-source same-table data loading method, device, equipment and medium
CN114265896A (en) * 2021-11-30 2022-04-01 携程计算机技术(上海)有限公司 Data synchronization method, system, device and medium for multiple data sources
CN114595230A (en) * 2022-03-14 2022-06-07 北京百姓车服网络科技有限公司 Data synchronous warehousing method, module, computing device and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019539A (en) * 2017-07-14 2019-07-16 北京京东尚科信息技术有限公司 A kind of method and apparatus that the data of data warehouse are synchronous
CN112612797A (en) * 2020-12-30 2021-04-06 杭州拼便宜网络科技有限公司 Multi-source same-table data loading method, device, equipment and medium
CN114265896A (en) * 2021-11-30 2022-04-01 携程计算机技术(上海)有限公司 Data synchronization method, system, device and medium for multiple data sources
CN114595230A (en) * 2022-03-14 2022-06-07 北京百姓车服网络科技有限公司 Data synchronous warehousing method, module, computing device and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118760867A (en) * 2024-09-06 2024-10-11 浙江有数数智科技有限公司 A data management processing method, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN109189835B (en) Method and device for generating data wide table in real time
CN113568938B (en) Data stream processing method and device, electronic equipment and storage medium
CN113407649A (en) Data warehouse modeling method and device, electronic equipment and storage medium
CN108629029A (en) A kind of data processing method and device applied to data warehouse
CN113282611A (en) Method and device for synchronizing stream data, computer equipment and storage medium
CN112817930B (en) A method and device for data migration
CN111753019A (en) A data partition method and device applied to a data warehouse
CN107480205B (en) Method and device for partitioning data
CN109063196A (en) Data processing method and device, electronic equipment and computer readable storage medium
CN107256206A (en) The method and apparatus of character stream format conversion
US12079202B2 (en) Parallel stream processing of change data capture
CN115454971A (en) Data migration method, device, electronic device and storage medium
CN115525659A (en) Data query method and device, electronic equipment and storage medium
CN111984686A (en) Data processing method and device
CN110858197A (en) Method and device for synchronizing data
EP3803625A1 (en) Frequent pattern analysis for distributed systems
CN117633116A (en) Data synchronization method, device, electronic equipment and storage medium
CN115408546A (en) Time sequence data management method, device, equipment and storage medium
CN113760966A (en) Data processing method and device based on heterogeneous database system
CN112579673A (en) Multi-source data processing method and device
CN112015790A (en) Data processing method and device
CN117149907A (en) Data synchronization method, device, equipment and medium
CN116821217A (en) Data distribution and conversion methods, devices, equipment and storage media
CN115599863A (en) Bank data synchronization method and device based on Hudi, electronic equipment and medium
CN115878596A (en) Data processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination