Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," "object," and the like in the description and claims of the invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in other sequences than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example one
Fig. 1 is a flowchart of a Hudi-based bank data synchronization method according to an embodiment of the present invention, where this embodiment is applicable to a situation where bank data is efficiently synchronized, and the method may be implemented by a Hudi-based bank data synchronization apparatus, where the Hudi-based bank data synchronization apparatus may be implemented in a form of hardware and/or software, and the Hudi-based bank data synchronization apparatus may be configured in an electronic device with data processing capability. As shown in fig. 1, the method includes:
s110, acquiring current data source information, and determining whether a Hudi table matched with the current data source information exists or not; the Hudi table is used for storing historical data files corresponding to historical data source information.
The technical solution of this embodiment is based on Hudi (Hadoop Updates and additions) technology, and realizes incremental update operation on full data. The Hudi adopts and manages a large analysis data set stored through DFS (Hadoop system HDFS or other cloud storage), and supports updating operation of incremental or full data in a current data table. Hudi organizes the table into a directory structure under a certain specified directory (Basepath) on the HDFS, the secondary directory is a Partition path key (autonomously specified by a user), and the secondary directory is divided into a plurality of partitions. And storing a plurality of files of the current partition under each partition directory, wherein the default mode is a partial mode. In addition, the Hudi-based banking data synchronization apparatus for performing the Hudi-based banking data synchronization method includes, but is not limited to, a stand-alone machine or a server cluster.
The data source information may include bank subject data, bank internal enterprise data, and bank external enterprise data. Specifically, the storage formats of data source information are roughly divided into two types, one is structured data, and data management is usually performed by means of a relational database, such as Gbase MPP; the second is unstructured data and semi-structured data, and data storage management is usually performed by means of a distributed file system, such as HDFS. The internal enterprise data of the bank may refer to information which is generated in a bank system and mainly includes an enterprise, and is usually stored in a database in the form of structured data. The external business data of the bank may refer to the main business data purchased by the bank to an external third party company. The Hudi table is a Hudi data table, and can be used for storing historical data files corresponding to historical data source information.
It should be noted that, in this embodiment, a data source acquisition period may be preset according to an actual application requirement, and data source information may be acquired based on the preset data source acquisition period. For example, the data source acquisition period may be determined to be 1 day, 1 month, 1 quarter or instant acquisition according to the requirements of different businesses for data effectiveness.
And S120, if yes, determining that the current data source information is incremental data source information, and performing data synchronization on the incremental data source information.
The incremental data source information may refer to data source information that needs to be updated with incremental data. In this embodiment, if a Hudi table matching the current data source information exists, it indicates that the current data source information is not the data source information obtained for the first time, and at this time, the current data source information may be determined as incremental data source information, and the incremental data update is performed on the original data source information according to the incremental data extracted from the current data source information, so as to implement data synchronization on the incremental data source information, and without modifying the full data, thereby improving the data synchronization efficiency.
Illustratively, a Spark job for updating the incremental data source information may be generated based on the Hudi DeltaStreamer tool class, and the update operation of the incremental data source information may be completed through the Spark job. The Spark can be a fast and general computing engine specially designed for large-scale data processing, and can store the results output in the middle of the job in a memory without reading and writing the HDFS, so that the data processing speed of the Spark is high.
In addition, a corresponding change log can be generated according to the change content (namely the incremental data) of the incremental data source information and used for bank audit. The change log may include a change time, a change object, change content, and the like. In addition, for safety, a disk dropping measure, such as a disk dropping or a usb disk, may be taken for the changed content, and the disk dropping data may be transmitted through a dedicated channel. For multi-data source transmission, hadoop proprietary command Hadoop distcp and Gbase MPP proprietary command select intooutfile can be used for realization.
And S130, otherwise, determining that the current data source information is the full data source information, and performing data synchronization on the full data source information.
The full data source information may refer to data source information that needs to be updated with full data. In this embodiment, if there is no Hudi table matching with the current data source information, it indicates that the current data source information is the data source information obtained for the first time, and at this time, the current data source information may be determined to be the full data source information, and the full data writing operation is performed according to the current data source information, so as to implement data synchronization on the full data source information. Illustratively, a Spark job for updating the full amount of data source information may be generated based on the Hudi DeltaStreamer tool class, and the update operation of the full amount of data source information may be completed through the Spark job.
According to the technical scheme of the embodiment of the invention, the current data source information is acquired, and whether a Hudi table matched with the current data source information exists is determined; the Hudi table is used for storing historical data files corresponding to historical data source information; if yes, determining that the current data source information is incremental data source information, and performing data synchronization on the incremental data source information; otherwise, determining that the current data source information is the full data source information, and performing data synchronization on the full data source information. According to the technical scheme, the increment updating operation of the full data can be realized while the safety and auditability of a bank system are met, the T + 0-like real-time data synchronization effect can be achieved, the data can be consumed on the same day, the data synchronization efficiency and the data timeliness are effectively improved, and the occupation of resource space is reduced.
In this embodiment, optionally, determining whether there is a Hudi table matching with the current data source information includes: determining first identification information of current data source information and second identification information of a Hudi table; and determining whether a Hudi table matched with the current data source information exists according to the matching result of the first identification information and the second identification information.
The first identification information may be used to uniquely identify the current data source information. The second identification information may be used to uniquely identify the Hudi table. It should be noted that, in this embodiment, the expression form of the first identification information and the second identification information is not limited at all, and may be set according to the actual application requirement. For example, the first identification information and the second identification information may be represented in the form of a character string, and the character string may include one or more of numbers, letters, and special characters.
In this embodiment, whether a Hudi table matching the current data source information exists may be determined in a manner of matching the identification information. Specifically, first identification information of current data source information and second identification information of each Hudi table are determined, then the first identification information and the second identification information are matched, and whether the Hudi table matched with the current data source information exists is determined according to a matching result of the first identification information and the second identification information. If second identification information successfully matched with the first identification information exists, determining that a Hudi table matched with the current data source information exists; if the second identification information which is successfully matched with the first identification information does not exist, the situation that the Hudi table which is matched with the current data source information does not exist can be determined.
It should be noted that, the matching method of the identification information is not limited in any way, and may be set according to actual requirements. Illustratively, the identification information matching may be implemented based on similarity calculations. For example, the similarity between identification information may be calculated by using cosine similarity, matrix similarity, or character string editing distance.
By means of the scheme, whether the Hudi table matched with the current data source information exists can be judged quickly and accurately through the identification information, and therefore whether the current data source information is the data source information obtained for the first time or not is determined according to the matching result.
In this embodiment, optionally, the data synchronization of the full data source information includes: determining a full data file according to the full data source information; and carrying out data synchronization on the full data source information based on the full data file.
Wherein, the full data file can be used for recording the full data source information. In this embodiment, after the current data source information is determined as the full data source information, the full data source information may be all written into the same file, so as to determine the full data file, and then data synchronization is performed on the full data source information based on the full data file.
According to the scheme, the full data source information is completely written into the full data file, data synchronization is carried out on the full data source information based on the full data file, and the full data can be stored and managed uniformly and conveniently through the full data file.
In this embodiment, optionally, performing data synchronization on the full data source information based on the full data file includes: performing data format conversion on data in the full data file based on a preset data format to obtain a first updated full data file; the preset data format is determined based on format characteristics of a Hudi table; performing data type conversion on data in the first updating full data file based on a preset data type to obtain a second updating full data file; the preset data type comprises at least one data type of one dimension; and establishing a Hudi table according to the second updated full data file so as to perform data synchronization on the full data source information.
The preset data format may refer to a preset data format. Specifically, the preset data format may be determined based on format characteristics of the Hudi table, so as to obtain a data format supported by the Hudi table. For example, the preset data format may be a partial format. The first update full data file may refer to an update file after data format conversion is performed on the full data file. The preset data type may refer to a preset data type. The preset data type may include a data type of at least one dimension, which may be specifically set according to an actual application requirement, and this embodiment does not limit this. For example, the preset data types may include data types with different dimensions such as decimal, datetime, uuid, timestamp, or a structure. The second update full data file may refer to an update file after data type conversion is performed on data in the first update full data file.
In this embodiment, first, data format conversion is performed on the full-size data file based on a preset data format to obtain a first updated full-size data file in a queue format supported by a Hudi table, and data in the first updated full-size data file is all represented in a character string form. In order to ensure the data quality, data cleaning may be further performed on the data in the first updated full data file, including but not limited to removing various dirty data such as null values and abnormal values. And then performing data type conversion on the data in the first updated full data file subjected to data cleaning based on a preset data type to obtain a second updated full data file. And further, a Hudi table is created according to the second updated full data file, and the second updated full data file is stored in the newly created Hudi table, so that data synchronization is performed on the full data source information.
It should be noted that, in the data type conversion process, a chain structure may be set to implement multidimensional fast conversion of data types. The chain structure can adopt a sequential and/or parallel mode to perform data type conversion, further data processing can be performed according to the converted data type, and the corresponding data processing mode can be flexibly set and modified according to actual requirements. Specifically, the data processing mode may be to perform data processing on data corresponding to one dimension data type, or perform data processing on data corresponding to multiple dimension data types. For example, assume that datatime type data is obtained after data type conversion, for example, 10/8/2022. If the preset data processing mode is to extract the month from the data, 10 may be extracted from 10, 8 and 2022 as the data processing result according to the preset data processing mode. When data processing is performed on different dimension data types, a data processing result corresponding to one dimension data type can be used as an input parameter of a data processing flow corresponding to another dimension data type.
According to the scheme, the data conversion type of the full data can be flexibly set according to the actual application requirement, so that the multi-dimensional conversion of the data type is realized, and the data synchronization of the full data source information is performed according to the second updated full data file after the data type conversion.
Example two
Fig. 2 is a flowchart of a Hudi-based bank data synchronization method according to a second embodiment of the present invention, where the second embodiment is optimized based on the foregoing embodiment. The specific optimization is as follows: performing data synchronization on incremental data source information, including: determining an incremental data file according to the incremental data source information; and carrying out data synchronization on the incremental data source information according to the incremental data file.
As shown in fig. 2, the method of the present embodiment specifically includes the following steps:
s210, acquiring current data source information, and determining whether a Hudi table matched with the current data source information exists; the Hudi table is used for storing historical data files corresponding to historical data source information.
And S220, if yes, determining that the current data source information is incremental data source information, and determining an incremental data file according to the incremental data source information.
The incremental data file can be used for recording incremental data information in the incremental data source information. In this embodiment, after the current data source information is determined as the incremental data source information, the incremental data information may be extracted from the incremental data source information, and the incremental data information is written into the same file, so as to determine the incremental data file. It is understood that only incremental data information corresponding to the incremental data source information is described in the incremental data file, and not all of the incremental data source information.
And S230, carrying out data synchronization on the incremental data source information according to the incremental data file.
In this embodiment, after determining the incremental data file, data synchronization may be further performed on the incremental data source information according to the incremental data file. Optionally, performing data synchronization on the incremental data source information according to the incremental data file, including: determining a target Hudi table matched with the incremental data source information; and updating the target data file in the target Hudi table based on the incremental data file so as to perform data synchronization on the incremental data source information.
The target Hudi table may refer to a Hudi table matched with the incremental data source information. The target data file may refer to a data file in a target Hudi table. In this embodiment, a target Hudi table matched with incremental data source information is first determined, and then a target data file in the target Hudi table is updated based on the incremental data file, so as to implement data synchronization of the incremental data source information.
In this embodiment, optionally, updating the target data file in the target Hudi table based on the incremental data file includes: performing data format conversion on data in the incremental data file based on a preset data format to obtain a first updated incremental data file; the preset data format is determined based on the format characteristics of the Hudi table; performing data type conversion on data in the first update incremental data file based on a preset data type to obtain a second update incremental data file; the preset data type comprises at least one data type of one dimension; and updating the target data file in the target Hudi table according to the second updating incremental data file.
The first update incremental data file may be an update file obtained by performing data format conversion on an incremental data file. The second update delta data file may refer to an update file after data type conversion is performed on data in the first update delta data file.
In this embodiment, first, data format conversion is performed on the incremental data file based on a preset data format to obtain a first update incremental data file in a queue format supported by a Hudi table, and data in the first update incremental data file is all represented in a character string form. Similarly, in order to ensure the data quality, data cleaning may be further performed on the data in the first update incremental data file, including but not limited to removing various dirty data such as null values and abnormal values. And then performing data type conversion on the data in the first update increment data file after data cleaning based on a preset data type to obtain a second update increment data file. Similarly, in the data type conversion process, a chain structure can be set to realize multi-dimensional rapid conversion of the data type, and further data processing can be performed according to the converted data type. And further, the target data file in the target Hudi table can be updated according to the second updated incremental data file, so that data synchronization of the incremental data source information is realized.
According to the scheme, the data conversion type of the incremental data can be flexibly set according to the actual application requirement, so that the multi-dimensional conversion of the data type is realized, and the data synchronization of the incremental data source information is performed according to the second updated incremental data file after the data type conversion.
S240, otherwise, determining that the current data source information is the full data source information, and performing data synchronization on the full data source information.
According to the technical scheme of the embodiment of the invention, the current data source information is obtained, and whether a Hudi table matched with the current data source information exists or not is determined; the Hudi table is used for storing historical data files corresponding to historical data source information; if yes, determining that the current data source information is incremental data source information, and determining an incremental data file according to the incremental data source information; performing data synchronization on incremental data source information according to the incremental data file; otherwise, determining that the current data source information is the full data source information, and performing data synchronization on the full data source information. According to the technical scheme, the increment updating operation of the full data can be realized while the safety and auditability of a bank system are met, the T + 0-like real-time data synchronization effect can be achieved, the data can be consumed on the same day, the data synchronization efficiency and the data timeliness are effectively improved, and the occupation of resource space is reduced.
EXAMPLE III
Fig. 3 is a schematic structural diagram of a Hudi-based bank data synchronization apparatus according to a third embodiment of the present invention, which is capable of executing the Hudi-based bank data synchronization method according to any embodiment of the present invention, and has corresponding functional modules and beneficial effects of the execution method. As shown in fig. 3, the apparatus includes:
a current data source information matching module 310, configured to obtain current data source information, and determine whether a Hudi table matching the current data source information exists; the Hudi table is used for storing historical data files corresponding to historical data source information;
an incremental data source data synchronization module 320, configured to determine that the current data source information is incremental data source information if the current data source information exists, and perform data synchronization on the incremental data source information;
a full data source data synchronization module 330, configured to determine that the current data source information is full data source information if the current data source information is not full data source information, and perform data synchronization on the full data source information.
Optionally, the current data source information matching module 310 is specifically configured to:
determining first identification information of the current data source information and second identification information of a Hudi table;
and determining whether a Hudi table matched with the current data source information exists according to the matching result of the first identification information and the second identification information.
Optionally, the incremental data source data synchronization module 320 includes:
an incremental data file determining unit, configured to determine an incremental data file according to the incremental data source information;
and the incremental data source data synchronization unit is used for carrying out data synchronization on the incremental data source information according to the incremental data file.
Optionally, the incremental data source data synchronization unit includes:
a target Hudi table determining subunit, configured to determine a target Hudi table that matches the incremental data source information;
and the target data file updating subunit is used for updating the target data file in the target Hudi table based on the incremental data file so as to perform data synchronization on the incremental data source information.
Optionally, the target data file updating subunit is configured to:
performing data format conversion on data in the incremental data file based on a preset data format to obtain a first updated incremental data file; the preset data format is determined based on the format characteristics of a Hudi table;
performing data type conversion on data in the first updating incremental data file based on a preset data type to obtain a second updating incremental data file; the preset data type comprises a data type of at least one dimension;
and updating the target data file in the target Hudi table according to the second updating incremental data file.
Optionally, the full-data source data synchronization module 330 includes:
the full data file determining unit is used for determining a full data file according to the full data source information;
and the full data source data synchronization unit is used for carrying out data synchronization on the full data source information based on the full data file.
Optionally, the full data source data synchronization unit is configured to:
performing data format conversion on data in the full data file based on a preset data format to obtain a first updated full data file; the preset data format is determined based on the format characteristics of a Hudi table;
performing data type conversion on data in the first updating full data file based on a preset data type to obtain a second updating full data file; the preset data type comprises a data type of at least one dimension;
and establishing a Hudi table according to the second updated full data file so as to perform data synchronization on the full data source information.
The Hudi-based bank data synchronization device provided by the embodiment of the invention can execute the Hudi-based bank data synchronization method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.
Example four
FIG. 4 illustrates a block diagram of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
As shown in fig. 4, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 11 can perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from a storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data necessary for the operation of the electronic apparatus 10 can also be stored. The processor 11, the ROM 12, and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to the bus 14.
A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, or the like; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The processor 11 performs the various methods and processes described above, such as the Hudi-based banking data synchronization method.
In some embodiments, the Hudi-based bank data synchronization method may be implemented as a computer program tangibly embodied in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the Hudi-based bank data synchronization method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the Hudi-based bank data synchronization method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Computer programs for implementing the methods of the present invention can be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on a machine, as a stand-alone software package partly on a machine and partly on a remote machine or entirely on a remote machine or server.
In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.
The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.
The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.