CN106156359B

CN106156359B - A kind of data synchronization updating method under cloud computing platform

Info

Publication number: CN106156359B
Application number: CN201610608344.XA
Authority: CN
Inventors: 张敬华; 程映忠; 王松
Original assignee: Guangdong Olympic Data Polytron Technologies Inc
Current assignee: Guangdong Olympic Data Polytron Technologies Inc
Priority date: 2016-07-28
Filing date: 2016-07-28
Publication date: 2019-05-21
Anticipated expiration: 2036-07-28
Also published as: CN106156359A

Abstract

The invention proposes a kind of data synchronization updating methods under cloud computing platform, it include: cloud data backup system of 1. buildings based on Hadoop distributed file system, the system is physically divided into client, backup server and Hadoop distributed file system cluster；2. the information of the in store backup server for providing service for the machine in client issues respective request to backup server when needing to back up or restore；3. backup server receives the request at customer end, the backup and recovery of file are carried out.The efficiency that the method proposed by the present invention improves backup file, updates file.

Description

Data synchronous updating method under cloud computing platform

Technical Field

The invention relates to the field of cloud computing, in particular to a data synchronization updating method under a cloud computing platform.

Background

In recent years, with the rapid development of information technology and the rapid increase of information amount, information and data processing is in the way of daily communications of people such as life, study, and work. While this development facilitates the interaction and sharing of information, it is also a part of the heavy work of people. Since then, the rapid development and efficient application of computer technology has made it possible to rapidly process massive amounts of data. Therefore, synchronous backup of data is important for enterprises and individuals to better ensure the security and integrity of the data.

For most of small and medium-sized enterprises, a large amount of data is stored in a local database and a server cluster, and developers or managers synchronously backup the data in an automatic or manual mode through a database management system. The existing database synchronous backup technology mainly comprises a medium transmission technology, a data replication technology, a backup technology owned by database software and the like. The realization of the technologies requires that a local server disk has a certain data storage space to store the data after synchronous backup, and managers need to regularly maintain system software and a hard disk to prevent the data from being damaged and lost. Among them, the medium transmission technology is not widely used because it requires a long time delay to store the update data in the physical medium.

For individual users, data is primarily stored on mobile devices and computer hard disks. The existing personal data synchronous backup method is mainly completed by a mobile device and synchronous backup software. Users who often travel to and from remote locations need to carry these devices with them or use software with backup capabilities to store data. Among other things, the mobile device needs to check and collate information therein periodically to ensure the security of data because the frequency and age of use of the mobile device affect the efficiency of data transmission and the reliability of synchronous backup. Similarly, the user uploads and downloads personal data by using the synchronous backup software, and needs to install and configure a software client on a different PC to receive the personal data such as mails and web page favorites each time, which is more complicated than direct access of a browser.

Aiming at the defects of mass data storage and the existing data synchronous backup method, the cloud computing which is a new computing industry can enable users not to worry about the management and maintenance of equipment, and provide massive cloud storage space, and the characteristics of paying as required and being available at will can meet the application of the existing computing development to a great extent. As the frequency and requirements for synchronized backup of data increase, users need to upgrade hardware devices accordingly to meet the increasing data. The powerful platform and the mass space provided by the cloud computing not only reduce the labor cost of user maintenance and the equipment cost of data storage, but also can utilize the browser to perform synchronous backup operation of data through the application program interface deployed on the cloud platform without related data management professional technologies particularly for wide PC users.

Cloud computing is a product of fusion of computer technologies and network technologies such as grid computing, utility computing, network storage, virtualization, load balancing and the like. It integrates all computing resources and realizes automatic management by software, and users do not need to participate in actual management. This makes enterprises and individuals unnecessarily bother with computing power and storage, as well as with management of these resources, and can focus more on their business processes, which is beneficial to innovation and cost reduction. As a service providing mode with on-demand service, high cost performance and transparent resources, the essential point is that the personalized and multi-level service requirements of different users are met through pooling of resources such as calculation, storage, transmission and the like of the Internet. The cloud computing provides a reliable and safe data storage center, and users do not need to worry about serious problems such as data loss, virus invasion and the like; meanwhile, the cloud computing can be applied to various user terminal devices, and terminals such as computers, mobile phones and televisions can be accessed; in addition, cloud computing can easily realize data and application sharing among different devices, and more importantly, cloud computing also provides infinite possibilities in terms of network use.

At present, the problem of low efficiency of data synchronization under a cloud computing platform exists.

Disclosure of Invention

The invention at least partially solves the problems in the prior art, and provides a data synchronization updating method under a cloud computing platform, which comprises the following steps:

1. the cloud data backup system based on the Hadoop distributed file system is constructed and physically divided into a client, a backup server and a Hadoop distributed file system cluster;

2. the client stores the information of the backup server providing service for the local computer, and sends a corresponding request to the backup server when backup or recovery is needed;

3. the backup server receives a request of a client side and performs file backup and recovery;

wherein,

the client is a plurality of computer nodes needing data backup/recovery service in an enterprise, and is divided into a plurality of groups according to regions and system categories, when data backup or recovery is needed, the client makes a request to a backup server in charge of the group, and file backup and recovery operation is performed after permission is obtained; the client is used for realizing data backup and recovery, including file packing, compression strategies, and data backup and recovery;

the backup server is a bridge for data backup and recovery between the client and the Hadoop distributed file system cluster, and is composed of a plurality of high-performance and large-storage-capacity servers, and each server is responsible for one client cluster. The client side receives a backup recovery request of the client side, caches backup data of the client side, respectively merges, divides and compresses the backup data according to different conditions of the backup data, uploads the merged backup data to a Hadoop distributed file system cluster for backup, simultaneously saves a mapping table of backup files of the client side, reads the backup files from the Hadoop distributed file system cluster when the client side puts forward the recovery request, and sends the backup files to the client side according to the file mapping table;

the Hadoop distributed file system cluster consists of a computer provided with Hadoop distributed file system software, and under the framework of the Hadoop distributed file system software, uploading and downloading services are provided for a plurality of backup servers through configuration, so that the core function of the system is realized;

the Hadoop distributed file system cluster adopts a master/slave structure and consists of a name node Namenode and a certain number of data nodes Datanodes, wherein the Namenode is used as a central server and is responsible for managing a namespace (namespace) of a file system and accessing files by clients; the Namenode executes namespace operations of opening, closing, renaming files or directories of the file system; the data node is used for storing data, is configured by a large number of cheap computers in the enterprise, and can be dynamically expanded according to the scale of backup data. The file is divided into one or more data blocks at the time of backup, and the data blocks are stored on a group of dataodes; the dataode is responsible for processing read-write requests of the file system client and performing operations such as creation, deletion and copying of data blocks under unified scheduling of the Namenode.

Preferably, the backup server comprises the following specific functional modules:

(1) the backup management module: the core function module of the system is mainly responsible for the backup management work of the files;

(2) a recovery management module: the backup file recovery system is responsible for the recovery work of the backup files;

(3) a safety management module: the functions of the module comprise the transmission safety and the storage safety of a control file, and the authentication and the authorization of a client;

(4) and the directory management module is responsible for client management and backup file directory management. The file backup information table is responsible for managing directories of backup files, and the client information table is responsible for managing all clients for which the backup server is responsible;

(5) the user interface module is used for providing a friendly user operation interface for displaying and configuring backup operation information, and a user can select a backup mode according to the requirement of the user;

(6) a synchronous processing module: the module is mainly responsible for the synchronous processing of files and is used for monitoring the change of files of the client, carrying out synchronous work between the client and a cluster end of the Hadoop distributed file system, and synchronously updating corresponding files on the cluster of the Hadoop distributed file system when the change of the files of the client is monitored.

Preferably, when the file of the client is monitored to be changed, the corresponding file on the Hadoop distributed file system cluster is synchronously updated in the following mode:

1. when the file CF of the client is monitored_oldChange to File CF_newThen, the changed file ID is sent to the Hadoop distributed file systemClustering;

2. according to the file ID sent by the client, the Hadoop distributed file system cluster sends CF_oldCorresponding SF_oldDivided into blocks of size B, SF_old[(i-1)B,iB-1]Indicating the content of the file from offset address (i-1) B to iB-1, wherein i takes the value of [1,2,3, … …, N]N is a file SF_oldThe number of divided blocks; then calculate each block B_iTwo hash values of (1): q. q.s_i＝h_q(B_i) And r_i＝h_m(B_i) Wherein h is_q(B_i) Represents a pair of blocks B_iPerforming alder-32 check calculation, h_m(B_i) Represents a pair of blocks B_iPerforming MD5 check calculation, and then sending two check values to the client;

3. the client receives two hash values (q) of each block from a Hadoop distributed file system cluster_i，r_i) Establishing a hash table;

4. client side traversal file CF_newStarting from the offset address j equal to 0, the following steps 4.1 to 4.4 are repeatedly performed

4.1 calculate h_q(CF_new[j,j+B-1])；

4.2 searching whether the hash table has a matched hash value;

4.3 if a matching hash value is found, calculate h_m(CF_new[j,j+B-1]) If h is_mIf the matching is also carried out, the offset address j of the block and the size information of the block are sent to the distributed file system cluster, and B adding operation is carried out on j;

4.4 if no matching hash value is found, or h_mIf not, then CF is transmitted_new[j]For Hadoop distributed file system clusters, CF_new[j]Representation file CF_newThe content at offset address j, j ═ j + 1;

hadoop distributed file system cluster according to contents and SF transmitted by client_oldConstruction of and CF_newCorresponding file SF_new。

1. when the file CF of the client is monitored_oldChange to File CF_newThen, sending the changed file ID to a Hadoop distributed file system cluster;

4.1 calculate h_q(CF_new[j,j+B-1])；

4.2 searching whether the hash table has a matched hash value;

4.3 if a matching hash value is found, calculate h_m(CF_new[j,j+B-1]) If h is_mIf the offset address j is matched with the size information of the block, storing the offset address j of the block and the size information of the block into a MatchList, and adding B to j;

4.4 if no matching hash value is found, or h_mNot matching, then CF_new[j]Store into the List MatchList, CF_new[j]Representation file CF_newAt offset address j, the CF stored in the MatchList is judged_new[j]Whether the total capacity reaches a minimum storage unit CK in the Hadoop distributed file system cluster or not is judged, if yes, the content stored in the MatchList is sent to the Hadoop distributed file system cluster and the following operation is continued, otherwise, the following operation is directly continued, and j is j + 1;

The invention provides a new method for backing up and updating files based on a cloud computing platform, and the efficiency of backing up and updating files is improved.

Drawings

Fig. 1 is a flowchart of a data synchronization updating method under a cloud computing platform according to the present invention;

Detailed Description

The technical solution of the present invention will be clearly and completely described below with reference to the accompanying drawings of the present invention. Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

Referring to fig. 1, the present invention provides a data synchronization updating method under a cloud computing platform, including:

the client is a plurality of computer nodes needing data backup/recovery service in an enterprise, and is divided into a plurality of groups according to regions, system types and the like, when data backup or recovery is needed, the client makes a request to a backup server in charge of the group, and file backup and recovery operation is carried out after permission is obtained. The client is used for realizing data backup and recovery, including file packaging and compression strategies, and data backup and recovery.

The backup server is a bridge for data backup and recovery between the client and the Hadoop distributed file system cluster, and is composed of a plurality of high-performance and large-storage-capacity servers, and each server is responsible for one client cluster. The client side receives a backup recovery request of the client side, caches backup data of the client side, respectively merges, divides and compresses the backup data according to different conditions of the backup data, uploads the merged backup data to the Hadoop distributed file system cluster for backup, simultaneously saves a mapping table of backup files of the client side, reads the backup files from the Hadoop distributed file system cluster when the client side puts forward the recovery request, and sends the backup files to the client side according to the file mapping table.

The backup server comprises the following specific functional modules:

The Hadoop distributed file system cluster is composed of computers provided with Hadoop distributed file system software, and under the framework of the Hadoop distributed file system software, uploading and downloading services are provided for a plurality of backup servers through configuration, so that the core function of the system is realized.

The Hadoop distributed file system cluster adopts a master/slave structure and consists of a name node Namenode and a certain number of data nodes Datanodes, wherein the Namenode is used as a central server and is responsible for managing a namespace (namespace) of a file system and accessing files by clients; the Namenode executes namespace operations of opening, closing, renaming files or directories of the file system; the data node is used for storing data, is configured by a large number of cheap computers in the enterprise, and can be dynamically expanded according to the scale of backup data. The file is divided into one or more data blocks at backup, and the blocks are stored on a set of dataodes. The dataode is responsible for processing read-write requests of the file system client and performing operations such as creation, deletion and copying of data blocks under unified scheduling of the Namenode.

The cloud data backup system based on the Hadoop distributed file system applies the backup server as a bridge between the client and the backup cluster in the following consideration: the backup server can shield the direct access of the client to the backup cluster, improve the security of the backup cluster, and realize data security between the backup server and the client through technical means such as a firewall, a security channel and the like, thereby ensuring the security of the whole system; the backup server can temporarily store data and upload the data at a proper time according to the load condition of the backup cluster and the network condition, so that the load balance of the backup cluster is ensured; although the backup server may become a bottleneck of the system due to a large number of backup/recovery requests of the clients in a special situation, the occurrence of the situation can be avoided to the greatest extent by applying a high-performance server as a backup server and reasonable scheduling of the clients; the method is characterized in that a Hadoop specific component is required to be installed on a computer for uploading and downloading files to a Hadoop distributed file system cluster, which is unrealistic for a large number of clients with uneven levels.

before the client module backs up data, tools such as tar and winrar are applied to pack all data files into a back-up file, and the back-up file is named according to the rule of client Id-back-up date-bak; meanwhile, the compression is carried out to save the storage space and reduce the backup recovery time.

The backup process of the client file specifically comprises the following steps:

b1 calling tool to pack the backup data;

b2 calling a compression tool to compress the packed file;

b3 making a backup request to the backup server;

b4 judging whether the backup request passes;

b5 uploading the data file to the backup server as the backup request passes.

The recovery process of the client file specifically comprises the following steps:

h1 makes a restore request to the backup server;

h2 judges whether the recovery request passes;

h3 downloading the data file if the recovery request passes;

h4 calls tools to decompress the packed file;

h5 calls a tool to unpack the backup file.

3.1 backup operation of the backup server specifically includes:

after receiving a backup request of a client, a backup server firstly identifies and authenticates the client, receives a backup file uploaded by the client after the authentication is passed, temporarily stores the backup file after the backup file is uploaded and added with a timestamp number, records the information of the backup file into a backup file information table, and then calls a cloud data uploading algorithm to upload data to a Hadoop distributed file system cluster by taking the file name as a parameter.

The cloud data uploading algorithm firstly detects whether the size of a file uploaded by a user is larger than or equal to a threshold value th _ size, if so, the file is uploaded to a Hadoop distributed file system cluster, after the file is uploaded successfully, a corresponding uploading mark in a file backup data information table is set to be true, an uploading file name is filled in, and the file on a backup server is deleted; if the file size is smaller than th _ size, reading the backup file information table to obtain the information of all the non-uploaded backup files, calculating the sizes of all the non-uploaded files, if the file size is larger than or equal to th _ size, packaging all the non-uploaded files into a file, naming the file according to the mode of 'filename 1-file 2 … -filename n' and uploading, setting the uploading flag position corresponding to the backup file information table as true after uploading is successful, and deleting the file after the uploading file name is filled; if all the uploaded files are still smaller than th _ size, the files are not uploaded to the Hadoop distributed file system cluster temporarily.

3.2 the recovery operation of the backup server specifically includes:

after receiving a recovery request of a client, a backup server firstly identifies and authenticates the client, checks a backup file information table after the authentication is passed, and sends a file to the client from the backup server if the backup file is temporarily stored locally; if the backup file is stored in the Hadoop distributed file system cluster, the backup file is downloaded from the Hadoop distributed file system cluster and then sent to the client, and if the backup file is formed by packaging a plurality of files, the backup file also needs to be unpacked and then sent to the client.

The backup server follows the following rules when downloading and uploading data:

when the backup server needs to download data, the data is immediately downloaded; when data needs to be uploaded, if no other backup server uploads the data, the data is uploaded immediately, otherwise, the data is called to generate conflict, the data is detected after waiting for a period of time to determine whether to be uploaded, the length of the waiting time is determined by a backoff algorithm, and the backoff algorithm specifically comprises the following steps:

1) when the first detection is in conflict, setting the parameter L to be 2;

2) the backoff interval takes a random number in 1 to L time slices;

3) when the repeated detection is in conflict, the parameter L is doubled, the maximum value of L is 256, when L is increased to 256,

l is no longer increased;

4) once the number of detections exceeds 8, the data is uploaded unconditionally.

By applying a back-off algorithm, when the backup server detects more conflicts, the probability of generating longer waiting time is higher, so that the system is ensured to be tested and calculated as less as possible when the system is heavily loaded; and meanwhile, when the backoff times of the backup server exceed 8 times, the backup server immediately uploads the data to ensure fairness.

The synchronization problem of large files is a difficult point of cloud synchronization. The synchronization of the large files occupies a large amount of storage space at the cloud end, and a plurality of problems of instability, file security, file verification, file encryption compression and the like based on network transmission need to be solved when the large files are uploaded and downloaded. At present, most of cloud synchronization applications at home and abroad only support file synchronization below 100 MB. Synchronization of large files faces mainly the following problems: 1. instability of network transmission; 2. security of file transfer; 3. a limitation of network bandwidth; 4. the efficiency of large file updates.

Therefore, the invention adopts the file segmentation technology to segment the file into a plurality of independent file blocks, thereby improving the efficiency of file synchronization processing. After the file is segmented, the size of the file block is within a controllable range, and no matter how large the original file is, the segmented file block is within an acceptable range of the cloud storage system. Therefore, the file storage system of the Hadoop distributed file system cluster can quickly process the problem of file storage of cloud synchronization, and manages the corresponding file blocks, so that the performance problem of the Hadoop distributed file system cluster storage system and the waste of the Hadoop distributed file system cluster storage space caused by the large file blocks of the Hadoop distributed file system cluster are avoided.

When the file is uploaded and restored, the file is managed in a file splitting mode. Before uploading the file, dividing the file into small file blocks, and uploading the file blocks; when the file is restored, the file blocks of the file are downloaded first, and the file blocks are combined into the original file after all the file blocks are downloaded.

The uploading of the file comprises the following steps:

1. file segmentation: the original user file is divided into a plurality of small file blocks, the file division is a storage problem that a storage file of a large file is changed into a plurality of small files, and a plurality of technical problems which need to be dealt with in the storage of the large file can be directly avoided;

2. encrypting a file block: the file block encryption adopts a public key encryption technology, and the public key and the private key of the file block are acquired from the Hadoop distributed file system cluster. The file block encryption is to ensure the package confidentiality of file data, the data confidentiality is a necessary requirement of a user for any cloud synchronization application, and the user cannot store the data in applications which are possibly leaked;

3. and (3) compressing file blocks: compressing the encrypted file blocks;

4. checking the file blocks: after the file blocks are encrypted and pressurized, the hash value of the file blocks is calculated through a hash algorithm, and uploading and recovery of the file need to be verified through the hash value so as to determine that no error occurs in the file blocks in the transmission process; meanwhile, if the hash value is found to exist, that is, the same file block is stored in the server, the file block does not need to be uploaded repeatedly. The integrity of data can be guaranteed by using file verification, the storage space of a server can be saved by avoiding uploading the same file content, the data flow is reduced, and the file synchronization efficiency is improved.

5. Uploading file blocks: the file blocks are synchronized through a remote interface provided by the Hadoop distributed file system cluster, the file blocks are uploaded to the Hadoop distributed file system cluster, and after the file blocks are uploaded, the Hadoop distributed file system cluster needs to determine that the file blocks are free of errors through a hash value.

The file recovery comprises the following steps:

1. acquiring a file block list: acquiring a file block list corresponding to a file through the file ID, acquiring detailed file block information according to the ID of the file block, and downloading the file block to indirectly complete a file downloading function;

2. downloading the file blocks: searching the file block at the appointed position by using the ID of the file block, and downloading the file block in the list to the local;

3. checking the file blocks: after the file block is downloaded, whether the file block is downloaded successfully is verified through the size of the file block and the hash value; if the file block fails to be verified, the file block is invalid and needs to be downloaded again or processed by adopting a manual strategy;

4. file block decompression: decompressing the file blocks by adopting a file block decompression algorithm corresponding to the file block compression;

5. file block decryption: acquiring a private key for decrypting a file block from the Hadoop distributed file system cluster, and decrypting the file block by adopting a decryption algorithm corresponding to file block encryption;

6. and (3) file block merging: after the file blocks are downloaded, verified, decompressed and decrypted, the separated file blocks are recombined to restore the original file of the user.

When the file of the client is monitored to be changed, the corresponding file on the Hadoop distributed file system cluster is synchronously updated in the following mode:

4.1 calculate h_q(CF_new[j,j+B-1])；

4.2 searching whether the hash table has a matched hash value;

The synchronous updating mode has small calculation amount and high speed. The algorithm can be further improved for the case of small file modification. When CF_newIth block of (1) and SF_oldWhen the jth block of (1) is matched, CF is highly likely_newThe (i + 1) th block and SF_oldThe (j + 1) th block is matched, and the algorithm has too many times of data to be transmitted when finding a matched block every time, so that the utilization of the bandwidth is not high.

When the file of the client is monitored to be changed, the method can also be used for synchronously updating the corresponding file on the Hadoop distributed file system cluster in the following modes:

4.1 calculate h_q(CF_new[j,j+B-1])；

4.2 searching whether the hash table has a matched hash value;

In the invention, the specific implementation process of reading the file by the client comprises the following steps:

1. a client opens a file which is expected to be read by calling an open () method of an instance FileStream object of the distributed file system;

2. the distributed file system remotely calls name nodes through RPC to obtain the positions of data blocks at the beginning of a file, for each block, the name nodes return the addresses of the data nodes where the block is located, the data nodes can be sequenced according to the distance from the data nodes to a client, if the client is also the data node, local data are directly read, the distributed file system returns an FSDataInputStream object of an input stream supporting file positioning to the client, and the client reads data from the FSDataInputStream;

3. the client calls a read () method of FSDataInputStream;

4. the DFSInputStream storing the data node addresses of the file beginning part blocks is immediately connected with the data nodes closest to the blocks, and the data is read from the data nodes and returned to the client side by repeatedly calling read () in the data stream;

5. when the first block is read out, DFSInputStream closes the connection with the data node, and then starts the operation of the second block;

6. when the client reads data from the stream, the blocks are read according to the sequence that DFSInputStream opens new connections with data nodes, DFSInputStream also calls name nodes to retrieve the positions of the data nodes of the next group of required blocks, and after the client finishes data reading, the client calls the close () method of FSDataInputStream to close the data stream.

In the file reading process, if the client reads an error from one data node, the next data node closest to the client is selected. While keeping in mind the failed data node, this data node is no longer selected when reading the following block.

An important aspect of this design is: the client contacts the data node directly to receive the data, and the client is directed to the best data node containing the desired data by the name node. The design can expand the Hadoop distributed file system to be suitable for a large number of clients, and a data transmission line passes through all data nodes in a cluster; the name node only needs to provide the location query service of the corresponding block, and the name node stores the location information of the block in the memory, so that the efficiency is very high, the name node does not need to provide the data transmission service, otherwise, the data service becomes a bottleneck rapidly along with the increase of the clients.

In the invention, the specific implementation process of writing the file by the client comprises the following steps:

1. the client creates a file by calling a create () method of the distributed file system;

2. the distributed file system remotely calls a name node through RPC (remote procedure call), and creates a new file in a name space of the file system, wherein at the moment, no block of the file is associated with the new file; the name node performs a check to ensure that this file does not already exist and that the client has the right to create this file; if the check is passed, the name node generates a record of a new file; otherwise, the file is failed to be created and an exception is thrown to the client; the distributed file system returns an FSDataOutputStream to enable the client to start writing data, the FSDataOutputStream controls a DFSOutputStream, and the DFSOutputStream is responsible for processing communication between the data node and the name node;

3. when a client writes data, the DFSDataOutputStream divides the data to be written into a plurality of packets and writes the packets into an internal data queue, the data in the data queue is read by a data stream, the data stream enables name nodes to find out a proper data node list, and requires the data nodes to allocate new blocks to store the data copied as a copy, and the data node list forms a pipeline;

the FSDataInputStream shunts packets to a first data node in the pipeline, which stores and sends packets to a second data node in the pipeline, which stores and passes packets to a third data node in the pipeline until the packets are passed to the last data node in the pipeline;

the DFSOutputStream has an internal packet queue to wait for the data node to receive the acknowledgement, which is called an acknowledgement queue, and only when all the data nodes in the pipeline return write success, the packet is written successfully, and sends an acknowledgement to the DFSOutputStream, the packet is shifted out of the acknowledgement queue, and then the next packet is written;

if during the data writing period, the data node fails, the following operations are executed: firstly, the pipeline is closed, any packet in the confirmation queue is added back to the front of the data queue to ensure that the data node flows downstream from the failed node and any packet cannot be missed, and the current block is given a new identity in the data node which works normally and is connected with the name node, so that part of the data block can be deleted when the failed data node is recovered at the later stage; the failed data node will be deleted from the pipeline and the data of the remaining blocks will be written to the two good data nodes in the pipeline; when the name node notices that the block copy is insufficient, arranging to create a copy on another node; subsequently, the subsequent blocks continue normal processing;

6. after the client finishes writing data, a close () is called in the FSDataInputStream;

7. after the block completes copying to the minimum number of copies, the name node will return successfully.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A data synchronization updating method under a cloud computing platform comprises the following steps:

(1) the cloud data backup system based on the Hadoop distributed file system is constructed and physically divided into a client, a backup server and a Hadoop distributed file system cluster;

(2) the client stores the information of the backup server providing service for the local computer, and sends a corresponding request to the backup server when backup or recovery is needed;

(3) the backup server receives a request of a client side and performs file backup and recovery;

wherein,

the backup server is a bridge for data backup and recovery between the client and the Hadoop distributed file system cluster, and is composed of a plurality of high-performance and large-storage-capacity servers, each server is responsible for a client group, receives a backup and recovery request of the client, caches backup data of the client, respectively merges, divides and compresses the backup data according to different conditions of the backup data, uploads the backup data to the Hadoop distributed file system cluster for backup, simultaneously stores an image table of backup files of the client, reads backup files from the Hadoop distributed file system cluster when the client proposes the recovery request, and sends the backup files to the client according to the image table;

the Hadoop distributed file system cluster adopts a master/slave structure and consists of a name node Namenode and a certain number of data nodes Datanodes, wherein the Namenode is used as a central server and is responsible for managing namespace of a file system and accessing files by clients; the Namenode executes namespace operations of opening, closing, renaming files or directories of the file system; the data nodes are used for storing data, are configured by a large number of cheap computers in an enterprise, and can be dynamically expanded according to the scale of backup data, and files are divided into one or more data blocks during backup, and the data blocks are stored on a group of data blocks; the Datanode is responsible for processing read-write requests of the file system client and performing operations such as creation, deletion and copying of data blocks under unified scheduling of the NanoDE;

(1) when the file CF of the client is monitored_oldChange to File CF_newThen, sending the changed file ID to a Hadoop distributed file system cluster;

(2) according to the file ID sent by the client, the Hadoop distributed file system cluster sends CF_oldCorresponding SF_oldDivided into blocks of size B, SF_old[(i-1)B,iB-1]Indicating the content of the file from offset address (i-1) B to iB-1, wherein i takes the value of [1,2,3, … …, N]N is a file SF_oldThe number of divided blocks; then calculate each block B_iTwo hash values of (a): q. q.s_i＝h_q(B_i) And r_i＝h_m(B_i) Wherein h is_q(B_i) Represents a pair of blocks B_iPerforming alder-32 check calculation, h_m(B_i) Represents a pair of blocks B_iPerforming MD5 check calculation, and then sending two check values to the client;

(3) the client receives two hash values (q) of each block from a Hadoop distributed file system cluster_i，r_i) Establishing a hash table;

(4) client side traversal file CF_newStarting from the offset address j being 0, the following steps 4.1 to 4.4 are repeatedly performed:

(4.1) calculation of h_q(CF_new[j,j+B-1])；

(4.2) looking up from the hash table whether there is a matching hash value;

(4.3) if a matching hash value is found, calculating h_m(CF_new[j,j+B-1]) If h is_mAlso match, send the offset address j of the block and size information of the block to the distributionB, the formula file system cluster is adopted, and the operation of adding B is carried out on j;

(4.4) if no matching hash value is found, or h_mIf not, then CF is transmitted_new[j]For Hadoop distributed file system clusters, CF_new[j]Representation file CF_newThe content at offset address j, j ═ j + 1;

(5) the Hadoop distributed file system cluster transmits the contents and SF according to the client_oldConstruction of and CF_newCorresponding file SF_new。

2. The data synchronization updating method under the cloud computing platform according to claim 1, wherein the backup server includes the following specific functional modules:

(4) the directory management module is responsible for client management and backup file directory management, the file backup information table is responsible for managing directories of backup files, and the client information table is responsible for managing all clients for which the backup server is responsible;

3. The data synchronization updating method under the cloud computing platform according to claim 1, wherein when it is monitored that the file of the client is changed, the corresponding file on the Hadoop distributed file system cluster can be updated synchronously by using the following method:

(4.1) calculation of h_q(CF_new[j,j+B-1])；

(4.2) looking up from the hash table whether there is a matching hash value;

(4.3) if a matching hash value is found, calculating h_m(CF_new[j,j+B-1]) If h is_mIf the offset address j is matched with the size information of the block, storing the offset address j of the block and the size information of the block into a MatchList, and adding B to j;

(4.4) if no matching hash value is found, or h_mNot matching, then CF_new[j]Store into the List MatchList, CF_new[j]Representation file CF_newAt offset address j, the CF stored in the MatchList is judged_new[j]Whether the total capacity reaches a minimum storage unit CK in the Hadoop distributed file system cluster or not is judged, if yes, the content stored in the MatchList is sent to the Hadoop distributed file system cluster and the following operation is continued, otherwise, the following operation is directly continued, and j is j + 1;