CN101814045A

CN101814045A - Data organization method for backup services

Info

Publication number: CN101814045A
Application number: CN 201010152397
Authority: CN
Inventors: 周可; 王桦; 张鹏
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2010-04-22
Filing date: 2010-04-22
Publication date: 2010-08-25
Anticipated expiration: 2030-04-22
Also published as: CN101814045B

Abstract

The invention discloses a backup service software storage server end data organization method, which is used for improving the data organization and data management efficiency of the storage server end. The method includes: ① initializing storage server storage space into metadata area (including main record, index header and data index) and data area; ② accepting and judging user operation commands, performing backup operations sequentially, and turning to step ④ for restoring operations, and turning to step ④ for deleting operations. Step ⑤; ③ process user backup operation, back up user data to the storage server data area, and use deduplication technology to avoid duplicate data backup; turn to step ②; ④ process recovery operation, and list the recovery data specified by the user in the storage server data Area location search, and then transfer to the client; go to step ②; ⑤ process the delete operation, find the data that the user specifies to delete, and perform corresponding processing according to the backup data block reference count of these data in the storage server data area; go to step ②. The method improves the utilization rate, manageability and system scalability of the storage server end, saves network bandwidth, and improves backup efficiency.

Description

A data organization method for backup service

技术领域technical field

本发明属于计算机数据存储和备份方法，具体涉及一种备份服务的数据组织方法，该方法实现了块级重复数据的删除。The invention belongs to computer data storage and backup methods, and in particular relates to a data organization method for backup services. The method realizes the deletion of block-level repeated data.

背景技术Background technique

随着信息化社会的飞速发展，从人们的日常生活到企业的业务运营，都被日益渗透的信息系统所包围，对其依赖性也越来越大。尤其在金融、通讯、交通和保险等行业中，一旦关键数据丢失或损坏，会给个人和企业带来不可估量的损失。With the rapid development of the information society, people's daily life and business operations of enterprises are surrounded by increasingly infiltrated information systems, and their dependence is increasing. Especially in industries such as finance, communication, transportation and insurance, once key data is lost or damaged, it will bring immeasurable losses to individuals and enterprises.

这里所说的备份服务本质上是一个提供一定容灾功能的备份恢复软件系统，能够为个人和企业用户提供完善的数据备份、恢复及相关管理任务，而且能够根据自身实际需求定制各种备份策略。同时这里的备份服务也是一种软件模式，为企业搭建信息化所需要的所有网络基础设施及软件、硬件运作平台，并负责所有前期的实施、后期的维护等一系列服务，企业无需购买软硬件、建设机房、招聘IT人员，即可通过互联网使用信息系统。就像打开自来水龙头就能用水一样，企业根据实际需要租赁软件服务。The backup service mentioned here is essentially a backup and recovery software system that provides a certain disaster recovery function. It can provide individual and enterprise users with comprehensive data backup, recovery and related management tasks, and can customize various backup strategies according to their actual needs. . At the same time, the backup service here is also a software model. It builds all the network infrastructure and software and hardware operation platforms required for informatization for enterprises, and is responsible for all the early implementation and later maintenance and other services. Enterprises do not need to purchase software and hardware. , building a computer room, and recruiting IT personnel, the information system can be used through the Internet. Just like turning on the tap to get water, companies rent software services based on actual needs.

数据备份与恢复是保障信息安全的重要措施。数据重要性的不断凸现要求存储系统上的数据能得到有效而全面的保护。随着高速网络及通信技术、海量存储技术等新技术的出现和发展，基础存储资源相比以往有了天翻地覆的变化。日益增多的各种信息系统的应用，也使有保护价值的数据量呈几何级数上升，这些都为数据备份与恢复软件的开发与相关技术研究提出了更高的要求。Data backup and recovery are important measures to ensure information security. The increasing importance of data requires effective and comprehensive protection of data on storage systems. With the emergence and development of new technologies such as high-speed network and communication technology and mass storage technology, basic storage resources have undergone earth-shaking changes compared with the past. The increasing application of various information systems has also increased the amount of data with protection value in a geometric progression, which has put forward higher requirements for the development of data backup and recovery software and related technical research.

用户使用备份服务时对存储空间及数据方面的需求通常包括：能够根据需求增加或减少对存储空间的使用量；能够无障碍使用既有空间，即只要有剩余空间而且网络可达，数据备份任务都能够正确执行；能够随时恢复已备份的数据。为了满足这些需求，要求用户空间和数据具备一定的逻辑独立性，故需要研究用户空间管理方式和备份数据组织方法。另外，还需要设计出高效的空间分配、回收机制，在充分挖掘重复数据存储空间的耦合性的同时，保持用户数据的逻辑独立性，并实现高效的数据查找和访问。When users use backup services, their storage space and data requirements usually include: the ability to increase or decrease the use of storage space according to demand; the ability to use existing space without barriers, that is, as long as there is remaining space and the network is reachable, the data backup task can be executed correctly; the backed up data can be restored at any time. In order to meet these requirements, user space and data are required to have a certain logical independence, so it is necessary to study user space management methods and backup data organization methods. In addition, it is also necessary to design an efficient space allocation and recycling mechanism to fully exploit the coupling of duplicate data storage space while maintaining the logical independence of user data and achieve efficient data search and access.

在一般的备份软件架构中，存储服务器是经过数据备份软件的管理控制台认证过的物理介质，它可以是服务器上的一个硬盘空间，服务器外挂的存储设备，或者网络上的一个磁盘映射。通过管理控制台可以配置多个存储服务器，在备份服务器的统一管理下，备份客户端将数据备份到相应的存储服务器上。In the general backup software architecture, the storage server is a physical medium certified by the management console of the data backup software. It can be a hard disk space on the server, a storage device attached to the server, or a disk map on the network. Multiple storage servers can be configured through the management console. Under the unified management of the backup server, the backup client will back up data to the corresponding storage server.

在之前的备份软件存储服务器端的设计中，大多采用文件级的备份方法。文件级的备份，即备份软件只能感知到文件这一层，将源磁盘上所有的文件，备份到另一个目的介质上。所以文件级备份软件，要么依靠操作系统提供的文件系统接口来备份文件，要么自身具有文件系统的功能，可以识别文件系统元数据。简言之，文件级备份软件的机制就是将数据以文件为单位读出，然后再将读出的文件存储在另外一个介质上。显然这对于PB级大规模存储系统形成了性能瓶颈，由于存储服务器端管理的数据单元就是文件，这不可避免的造成大量重复数据的备份，给存储服务器端的管理也带来了很大的不方便，而利用块级重复数据删除技术进行数据备份能在很大程度上解决这些问题。In the design of the previous backup software storage server, most of them adopt the file-level backup method. File-level backup, that is, the backup software can only perceive the file level, and back up all the files on the source disk to another destination medium. Therefore, file-level backup software either relies on the file system interface provided by the operating system to back up files, or has the file system function itself and can identify file system metadata. In short, the mechanism of file-level backup software is to read data in units of files, and then store the read files on another medium. Obviously, this has formed a performance bottleneck for PB-level large-scale storage systems. Since the data unit managed by the storage server is a file, this inevitably results in the backup of a large amount of duplicate data, which also brings great inconvenience to the management of the storage server. , and using block-level deduplication technology for data backup can solve these problems to a large extent.

另一方面，在之前的备份软件中，存储服务器端初始分配给用户的可用存储空间容量往往是固定的，这样大大降低了系统的可扩展性。实际应用中，系统无法预料所面对的各个用户最终会使用多大存储空间(当然可能会根据用户的权限和类型有一个最大可用存储容量限制)，分配大了很可能会造成存储空间浪费和利用率下降，分配小了可能会对用户的使用带来很大的限制。On the other hand, in the previous backup software, the storage server initially allocates a fixed amount of available storage space to users, which greatly reduces the scalability of the system. In practical applications, the system cannot predict how much storage space each user will eventually use (of course, there may be a maximum available storage capacity limit according to the user's permissions and types), and a large allocation may cause waste and utilization of storage space The rate drops, and a small allocation may bring great restrictions to the user's use.

最近EMC公司收购了Avamar公司，这家公司获专利的重复数据删除和全局单实例存储技术可确保备份数据段在全局范围内仅存储一次。这可以有效地将移动和恢复的数据量缩减300倍，同时还可以实现每日完整备份和快速恢复。针对每个24KB数据段，Avamar生成唯一的20字节ID标识，使用SHA-1加密算法。该唯一ID就是该数据段的指纹，于是Avamar的软件可以使用该唯一ID来确定是否一个数据段此前曾被存储过。但是SHA-1加密算法计算复杂，对CPU的消耗很大。同时由于数据段过小，当用户备份数据量很大时消耗的指纹空间也很大，同时还存在一定的可扩展性问题。EMC recently acquired Avamar, whose patented deduplication and global single instance storage technology ensures that backup data segments are stored globally only once. This effectively reduces the amount of data moved and restored by a factor of 300, while also enabling daily full backups and fast restores. For each 24KB data segment, Avamar generates a unique 20-byte ID, using the SHA-1 encryption algorithm. The unique ID is the fingerprint of the data segment, so Avamar's software can use the unique ID to determine whether a data segment has been stored before. However, the calculation of the SHA-1 encryption algorithm is complex and consumes a lot of CPU. At the same time, because the data segment is too small, when the user backs up a large amount of data, the fingerprint space consumed is also large, and there is still a certain scalability problem.

发明内容Contents of the invention

本发明的目的在于提供一种用于备份服务的数据组织方法，该方法可以实现块级重复数据的删除，能够提高数据组织和管理效率。The purpose of the present invention is to provide a data organization method for backup service, which can realize block-level duplicate data deletion, and can improve data organization and management efficiency.

本发明提供一种用于备份服务的数据组织方法，该方法包括下述步骤：The present invention provides a data organization method for backup service, the method includes the following steps:

(1)初始化：(1) Initialization:

元数据信息部分初始化，包括给元数据区的索引头信息、数据索引信息、数据区元数据信息赋初始值；Partial initialization of metadata information, including assigning initial values to index header information, data index information, and data area metadata information in the metadata area;

在数据区预分配一个数据空间准备接受用户的备份请求；Pre-allocate a data space in the data area and prepare to accept the user's backup request;

(2)接收用户命令并判断用户命令类型：(2) Receive user commands and determine the type of user commands:

判断用户命令类型，如果是备份操作，进入步骤(3)，如果是恢复操作，则转入步骤(4)，如果是删除操作，则转入步骤(5)；Judge the user command type, if it is a backup operation, then enter step (3), if it is a recovery operation, then proceed to step (4), if it is a delete operation, then proceed to step (5);

(3)按照下述过程进行备份处理：(3) Perform backup processing according to the following process:

(3.1)首先将用户待备份的文件分块，然后对数据分块的内容用MD5算法进行哈希，得到一个唯一标识数据分块的指纹，数据分块以指纹为索引存储在存储服务器端元数据的索引头和数据索引中；(3.1) First divide the file to be backed up by the user into blocks, and then use the MD5 algorithm to hash the content of the data block to obtain a fingerprint that uniquely identifies the data block, and the data block is stored in the storage server end element with the fingerprint as an index Data index header and data index;

(3.2)由备份客户端向存储服务器传送数据分块的指纹，存储服务器端根据指纹查询该数据分块是否存在；(3.2) The fingerprint of the data block is transmitted from the backup client to the storage server, and the storage server side inquires whether the data block exists according to the fingerprint;

(3.3)如果该指纹不存在，则备份客户端传送该数据分块给存储服务器，则该数据分块为新备份数据块，在存储服务器端动态分配存储空间，并完成该新备份数据块的写操作；如果存在，则只需更新存储服务器端该数据分块所对应的索引信息，将其引用计数加一；(3.3) If the fingerprint does not exist, then the backup client sends the data block to the storage server, then the data block is a new backup data block, and the storage server dynamically allocates storage space, and completes the storage of the new backup data block Write operation; if it exists, you only need to update the index information corresponding to the data block on the storage server, and increase its reference count by one;

(3.4)转入步骤(2)；(3.4) Go to step (2);

(4)恢复时，由备份服务器查得待恢复文件包含的Hash列表，根据Hash列表访问存储空间元数据信息来定位在相应数据空间中的逻辑位置，然后依次从存储服务器端读待恢复文件数据到内存缓冲区，然后通过套接字传给备份客户端，并合成所需文件集，再转入步骤(2)；(4) When restoring, the backup server checks the Hash list contained in the file to be restored, accesses the metadata information of the storage space according to the Hash list to locate the logical position in the corresponding data space, and then reads the data of the file to be restored from the storage server in turn to the memory buffer, then pass it to the backup client through the socket, and synthesize the required file set, and then turn to step (2);

(5)按下述过程删除已备份文件：(5) Delete the backed up files according to the following procedure:

(5.1)由备份系统软件中备份服务器查得待删除文件包含的Hash列表；(5.1) check the Hash list that the file to be deleted comprises by the backup server in the backup system software;

(5.2)根据Hash值查找存储服务器端的元数据区的索引头和数据索引映射表，如果入口参数Hash值不存在，则立刻返回，返回值为false；(5.2) Search the index header and data index mapping table of the metadata area on the storage server side according to the Hash value, if the Hash value of the entry parameter does not exist, return immediately, and the return value is false;

(5.3)否则将Hash值对应的对象元数据的引用计数减1，返回值为true；(5.3) Otherwise, the reference count of the object metadata corresponding to the Hash value is decremented by 1, and the return value is true;

(5.4)再转入步骤(2)。(5.4) Go to step (2) again.

现在企业数据不仅种类多增长快，而且是高度冗余的，有很多相同的文件或数据储存在系统内和系统之间，编辑好的文件也同样有大量的冗余，这些冗余存在于文件以前的版本中。传统的备份软件将这些冗余数据一次又一次备份，放大了这种冗余。现在比较好的解决方法是采用重复数据删除技术。重复数据删除技术不但可以实现高压缩率，释放存储空间，也可以降低基于磁盘备份的成本，也降低了数据管理的成本。本发明是一种基于块级的重复数据删除技术来实现存储服务器端的数据组织及管理方法，能够高效执行与客户机之间的备份/恢复数据的传输，并按备份服务器的策略进行本地存储空间管理与数据组织。本发明在不影响主要用户备份和恢复的前提下，可以实现全局的重复数据删除，随着用户数和备份数据量的增长，重复数据删除的作用将越发明显。能够大大减少用户备份所需的数据量，节省备份时间、网络带宽和备份需要的存储空间。Nowadays, enterprise data is not only diverse and fast-growing, but also highly redundant. There are many identical files or data stored in the system and between systems, and edited files also have a large amount of redundancy, which exists in the file in previous versions. Traditional backup software amplifies this redundancy by backing up this redundant data over and over again. A better solution now is to use data deduplication technology. Data deduplication technology can not only achieve high compression ratio and release storage space, but also reduce the cost of disk-based backup and data management. The present invention is a data organization and management method based on block-level deduplication technology to realize the data organization and management of the storage server. Management and data organization. The present invention can realize global deduplication without affecting the backup and recovery of main users. With the increase of the number of users and the amount of backup data, the effect of deduplication will become more obvious. It can greatly reduce the amount of data required for user backup, saving backup time, network bandwidth and storage space required for backup.

附图说明Description of drawings

图1为本发明方法所使用的存储数据组织及数据项的定位过程图；Fig. 1 is the location process figure of storage data organization and data items used by the inventive method;

图2为本发明方法的流程框图；Fig. 2 is the block flow diagram of the inventive method;

图3为本发明中动态分配存储空间的流程图；Fig. 3 is the flowchart of dynamically allocating storage space among the present invention;

图4为本发明备份操作中的写操作流程图。Fig. 4 is a flow chart of the write operation in the backup operation of the present invention.

具体实施方式Detailed ways

下面通过借助实施实例更加详细地说明本发明，但以下实施例仅是说明性的，本发明的保护范围并不受这些实施例的限制。The present invention will be described in more detail below by means of implementation examples, but the following examples are only illustrative, and the protection scope of the present invention is not limited by these examples.

备份服务系统基于三方架构，由备份服务器、存储服务器、备份客户端三方组成。其中，备份客户端负责接受用户定制的数据备份策略、恢复请求或数据管理相关的其他请求。备份服务器连接备份客户端和存储服务器，是整个数据备份软件的控制中心。它负责用户权限控制、全局作业调度和全局存储管理。备份客户端发起备份/恢复作业时，由备份服务器引导其与指定的存储服务器建立连接并进入执行环节；另一方面，备份服务器将监控各存储服务器的计算、传输和存储压力，并执行负载均衡策略。用户信息、存储服务器状态及其他支撑备份服务器运行的基础元数据拟采用数据库进行存储。存储服务器负责执行与客户机之间的备份/恢复数据的传输，并按备份服务器的策略进行本地存储空间管理与数据组织。The backup service system is based on a tripartite architecture and consists of a backup server, a storage server, and a backup client. Wherein, the backup client is responsible for accepting user-customized data backup policies, recovery requests or other requests related to data management. The backup server connects the backup client and the storage server, and is the control center of the entire data backup software. It is responsible for user permission control, global job scheduling, and global storage management. When the backup client initiates a backup/restore job, the backup server guides it to establish a connection with the designated storage server and enter the execution link; on the other hand, the backup server will monitor the computing, transmission and storage pressure of each storage server, and perform load balancing Strategy. User information, storage server status and other basic metadata supporting the operation of the backup server are planned to be stored in a database. The storage server is responsible for performing backup/recovery data transmission with the client, and performs local storage space management and data organization according to the backup server's policy.

以下为需要说明的本实例使用的4个数据结构：主记录区、索引头、数据索引和数据区，其结构如图1所示。The following are the four data structures used in this example that need to be explained: main record area, index header, data index, and data area. The structure is shown in Figure 1.

主记录区：主要描述整个存储空间的信息，它存放如下信息：索引头信息、数据索引信息、数据区元数据信息。Main record area: mainly describes the information of the entire storage space, which stores the following information: index header information, data index information, and data area metadata information.

索引头：是一个对象哈希表，用来实现对象ID(由数据内容生成的160位的Hash值)到数据索引表的映射。这里的对象是存储系统中数据存储的基本单元，不同于作为传统存储系统中基本组件的文件和块，对象是应用数据和定义存储属性(元数据)的组合，其中包含数据和其它足够的信息允许数据自治和自我管理。它使用Hash值代表面向对象存储中的对象ID，以文件内容作为存储依据，通过Hash值索引建立内容与对象之间的映射关系。因为Hash值在统计意义上是全局唯一的，所以具有全局唯一的命名空间，提高了系统共享的可管理性。系统采用的是成熟的MD5算法，MD5Hash算法将任意长度的数据内容变换成一个128bit(16byte)的大整数，即对象ID。Index header: It is an object hash table, which is used to realize the mapping from the object ID (160-bit Hash value generated by the data content) to the data index table. The object here is the basic unit of data storage in the storage system, which is different from the files and blocks that are the basic components in the traditional storage system. The object is a combination of application data and defined storage attributes (metadata), which contains data and other sufficient information Allow data autonomy and self-management. It uses the Hash value to represent the object ID in object-oriented storage, uses the file content as the storage basis, and establishes the mapping relationship between the content and the object through the Hash value index. Because the Hash value is globally unique in a statistical sense, it has a globally unique namespace, which improves the manageability of system sharing. The system uses the mature MD5 algorithm, and the MD5Hash algorithm transforms the data content of any length into a 128bit (16byte) large integer, namely the object ID.

数据索引：是一个大小为N(N表示数据索引表中索引个数，取值范围为2²⁰～2³⁰)的数组，数组中的每个元素是一个对象的元数据结构，元数据结构中的信息有：对象ID、对象在数据区的起始偏移地址(I表示数据空间编号，J表示所在数据空间内的逻辑数据块编号，K表示数据空间内的逻辑数据块内的偏移地址)、对象所对应数据大小、对象所对应数据内容的副本个数、与该对象映射到对象哈希表中同一个位置的下一个对象在对象表中的位置，这就把映射到对象哈希表中同一个位置的对象链接成一个链表。Data index: It is an array with a size of N (N represents the number of indexes in the data index table, and the value range is 2 ²⁰ to 2 ³⁰ ), each element in the array is a metadata structure of an object, in the metadata structure The information includes: object ID, the starting offset address of the object in the data area (I represents the data space number, J represents the logical data block number in the data space, and K represents the offset address in the logical data block in the data space ), the size of the data corresponding to the object, the number of copies of the data content corresponding to the object, and the position in the object table of the next object that is mapped to the same position in the object hash table as the object, which maps to the object hash Objects at the same location in the table are linked into a linked list.

数据区：用来存放对象的数据，对象的数据包括对象ID、数据内容长度和数据内容，为了便于存储空间管理，将数据区分成若干个连续的数据空间(每一个数据空间用一个单独的数据文件来表示)，每个数据空间由若干逻辑数据块组成。Data area: used to store the data of the object. The data of the object includes the object ID, the length of the data content and the data content. In order to facilitate the storage space management, the data area is divided into several continuous data spaces (each data space uses a separate data space) file), and each data space is composed of several logical data blocks.

数据分块：在备份服务系统中，执行备份或者恢复操作时，都是将处理的数据按照固定长度分块，每一个分块就是数据分块。Data block: In the backup service system, when performing backup or recovery operations, the processed data is divided into blocks according to a fixed length, and each block is a data block.

备份数据块：用户利用备份客户端备份指定的文件和文件夹时，备份客户端首先将这些要备份的数据按照固定长度分块(实际备份服务软件系统中分块大小为4M)，每一个分块就是备份数据块Backup data block: When the user uses the backup client to back up specified files and folders, the backup client first divides the data to be backed up into blocks according to a fixed length (the block size in the actual backup service software system is 4M), and each block block is the backup data block

逻辑数据块：在存储服务器端，为了方便管理以及高效利用存储服务器存储空间，将每个数据空间分成若干个子存储单元，每个子存储单元就是一个逻辑数据块(实际备份服务软件系统中每个逻辑数据块大小为1G)Logical data block: On the storage server side, in order to facilitate management and efficiently utilize storage server storage space, each data space is divided into several sub-storage units, and each sub-storage unit is a logical data block (each logic block in the actual backup service software system The data block size is 1G)

下面结合附图进一步说明本实例的实现过程。The implementation process of this example will be further described below in conjunction with the accompanying drawings.

如图2示，本发明方法包括下述步骤：As shown in Figure 2, the inventive method comprises the following steps:

(1)初始化：(1) Initialization:

通常存储数据分为两部分：元数据区和数据区。用户实际备份的数据存储在数据区，而描述这些用户数据的相关信息存储在元数据区。开始初始化元数据区，主要是给元数据区的索引头信息、数据索引信息、数据区元数据信息赋初始值。将索引头即对象哈希表全部置为0，表示全部可用，同时也将数据索引数组中的每个元素置为零，表示这个时候还没有任何数据写入。并且在数据区预分配一个数据空间准备接受用户的备份请求。数据空间总数定义为S，S取值最大不超过1000。数据区当前已经使用了的数据空间数是V，V＜＝S。每个数据空间预分配的逻辑数据块(block)个数定义为P，P最大不超过10。每个数据空间最大逻辑数据块个数定义为W，W最大不超过1024。在我们目前备份服务软件部署实施中，每个数据空间由1024个逻辑数据块组成，每个逻辑数据块大小为1G，每个数据空间最大为1T。Usually stored data is divided into two parts: metadata area and data area. The data actually backed up by the user is stored in the data area, and the relevant information describing these user data is stored in the metadata area. Initialize the metadata area, mainly to assign initial values to the index header information, data index information, and data area metadata information in the metadata area. Set all the index headers, that is, the object hash table, to 0, indicating that all are available, and also set each element in the data index array to zero, indicating that no data has been written at this time. And pre-allocate a data space in the data area ready to accept the user's backup request. The total number of data spaces is defined as S, and the maximum value of S cannot exceed 1000. The number of data spaces currently used by the data area is V, and V<=S. The number of logical data blocks (blocks) pre-allocated in each data space is defined as P, and the maximum value of P does not exceed 10. The maximum number of logical data blocks in each data space is defined as W, and the maximum number of W cannot exceed 1024. In our current backup service software deployment and implementation, each data space consists of 1024 logical data blocks, each logical data block has a size of 1G, and each data space has a maximum size of 1T.

(3.1)首先将用户待备份的文件分块(定义数据分块大小为b，b取值大小为1M---4M，实际该备份服务软件部署时b取值为4M)，然后对数据分块的内容用MD5算法进行哈希，得到一个唯一标识数据分块的指纹，数据分块以指纹为索引存储在存储服务器端元数据的索引头和数据索引中；(3.1) First divide the file to be backed up by the user into blocks (the defined data block size is b, and the value of b is 1M---4M, and the value of b is 4M when the backup service software is actually deployed), and then divide the data The content of the block is hashed with the MD5 algorithm to obtain a fingerprint that uniquely identifies the data block, and the data block is stored in the index header and data index of the server-side metadata with the fingerprint as an index;

(3.3)如果该指纹不存在，则备份客户端传送该数据分块给存储服务器，在存储服务器端动态分配存储空间，并完成该数据分块的写操作；如果存在，则无需传送数据，只需更新存储服务器端该数据分块所对应的索引信息，将引用计数加一。(3.3) If the fingerprint does not exist, the backup client sends the data block to the storage server, dynamically allocates storage space on the storage server side, and completes the write operation of the data block; if it exists, there is no need to transmit the data, only It is necessary to update the index information corresponding to the data block on the storage server, and increase the reference count by one.

(3.4)转入步骤(2)；(3.4) Go to step (2);

上述步骤(3.3)中，可以按照图3所示的过程动态分配存储空间，具体步骤如下：In the above step (3.3), the storage space can be dynamically allocated according to the process shown in Figure 3, and the specific steps are as follows:

(a1)判断在当前数据空间的P个逻辑数据块中是否有能够满足指定大小备份数据块的剩余可用空间，如果有，进入步骤(a5)，否则，进入步骤(a2)；(a1) judge whether in the P logic data blocks of current data space, whether there is the remaining available space that can satisfy the designated size backup data block, if yes, enter step (a5), otherwise, enter step (a2);

(a2)判断P＜W是否成立，若成立进入步骤(a6)，否则，进入步骤(a3)；(a2) Judging whether P<W is established, if established, enter step (a6), otherwise, enter step (a3);

(a3)判断在存储服务器主记录中的其它数据空间是否有能够满足指定大小的备份数据块的剩余可用空间，如果有，进入步骤(a5)，否则，进入步骤(a4)；(a3) judge whether other data spaces in the storage server master record have the remaining available space that can satisfy the backup data block of designated size, if yes, enter step (a5), otherwise, enter step (a4);

(a4)判读V＜S是否成立，若成立，则在存储服务器上增长一个数据空间，为新的数据空间内待写入备份数据块分配一个数据索引，然后转入步骤(a8)，否则，进入步骤(a7)；(a4) Judging whether V<S is established, if established, then increase a data space on the storage server, allocate a data index for the backup data block to be written in the new data space, and then turn to step (a8), otherwise, Go to step (a7);

(a5)为剩余可用空间内待写入备份数据块分配一个数据索引，然后转入步骤(a8)；(a5) allocate a data index for the backup data block to be written in the remaining free space, and then proceed to step (a8);

(a6)在存储服务器上为该数据空间增长一个备份数据块大小的空间，再为该备份数据块分配一个数据索引，然后转入步骤(a8)；(a6) increase the space of a backup data block size for the data space on the storage server, then allocate a data index for the backup data block, then proceed to step (a8);

(a7)因为找不到有能够满足指定大小的备份数据块的剩余可用空间，所以宣布动态分配失败；(a7) Because there is no remaining free space that can satisfy the backup data block of the specified size, the dynamic allocation is declared to be failed;

(a8)结束动态分配过程。(a8) End the dynamic allocation process.

还可以如图4所示的过程完成写操作，其步骤如下：The write operation can also be completed as shown in Figure 4, and the steps are as follows:

(b1)在存储服务器端动态寻找可用存储空间，查找是否有满足条件的逻辑数据块；(b1) Dynamically search for available storage space on the storage server side, and find out whether there is a logical data block that meets the conditions;

(b2)如果没有可用逻辑数据块则返回失败；(b2) Return failure if there is no available logical data block;

(b3)如果有满足新备份数据块大小的可用存储空间，就创建一个新数据索引，将新备份数据块写入相应存储服务器位置，然后将相应索引头和数据索引元数据写入主记录区。(b3) If there is available storage space that meets the size of the new backup data block, create a new data index, write the new backup data block to the corresponding storage server location, and then write the corresponding index header and data index metadata into the main record area .

(4)恢复时，由备份系统软件中备份服务器查得待恢复文件包含的Hash列表，根据Hash列表访问存储空间元数据信息来定位在相应数据空间中的逻辑位置，然后依次从存储服务器端读待恢复文件数据到内存缓冲区，然后通过套接字传给备份客户端，并合成所需文件集，再转入步骤(2)。(4) When restoring, the backup server in the backup system software checks the Hash list contained in the file to be restored, accesses the metadata information of the storage space according to the Hash list to locate the logical position in the corresponding data space, and then reads it from the storage server in turn The file data to be restored is stored in the memory buffer, and then transmitted to the backup client through the socket, and the required file set is synthesized, and then the step (2) is performed.

如图2所示，根据Hash列表访问存储空间元数据信息来定位在相应数据空间中的逻辑位置的过程如下：As shown in Figure 2, the process of accessing the metadata information of the storage space according to the Hash list to locate the logical position in the corresponding data space is as follows:

(4.1)设m为预先设定的索引头的位数，由数据Hash值的前m位对索引头进行索引，索引头的内容构成了数据索引号。(4.1) Let m be the number of digits of the pre-set index header, the index header is indexed by the first m bits of the data Hash value, and the content of the index header constitutes the data index number.

通常，每个索引头占两个字节，一共有2^m个，m的取值范围通常为20～30。Usually, each index header occupies two bytes, and there are 2 ^m in total, and the value of m usually ranges from 20 to 30.

(4.2)通过索引头对数据索引寻址。而数据索引则对一次操作的数据项进行寻址，具体包括三部分内容：(4.2) Address the data index through the index header. The data index addresses the data items of an operation, which specifically includes three parts:

(4.2.1)通过数据索引的结构体成员I(I为数据空间编号)找到具体的数据空间号；(4.2.1) Find the specific data space number by the structure member I of the data index (I is the data space number);

(4.2.2)通过数据索引的结构体成员J(J为数据空间内的逻辑数据块编号)找到数据空间中的块号；(4.2.2) Find the block number in the data space through the structure member J of the data index (J is the logical data block number in the data space);

(4.2.4)通过数据索引的结构体成员K(K为数据空间内的逻辑数据块内的偏移地址)找到数据项在逻辑数据块中的偏移地址，相当于三级寻址。由此可以定位数据项的数据头和数据实体。(4.2.4) Find the offset address of the data item in the logical data block through the structure member K of the data index (K is the offset address in the logical data block in the data space), which is equivalent to three-level addressing. The data header and data entity of the data item can thus be located.

(4.4)获得上面三级逻辑数据块地址信息，就可以定位到相应的数据区读取数据。(4.4) After obtaining the above three-level logical data block address information, the corresponding data area can be located to read data.

(5.1)由备份系统软件中备份服务器查得待删除备份文件包含的Hash列表；(5.1) check the Hash list that the backup file to be deleted contains by the backup server in the backup system software;

(5.4)再转入步骤(2)。(5.4) Go to step (2) again.

由于我们提供的是一种在线备份服务，所以备份服务器和存储服务器是作为守护进程始终在后台运行的，因此不存在结束情况，始终等待接收用户的操作请求。而备份客户端是用户使用在线备份服务的操作界面，用户可以在任意时候登陆备份客户端执行指定的操作，如备份、恢复和删除等。Since we provide an online backup service, the backup server and storage server are always running in the background as daemon processes, so there is no end situation, and they are always waiting to receive user operation requests. The backup client is an operation interface for users to use the online backup service. Users can log in to the backup client at any time to perform specified operations, such as backup, recovery, and deletion.

实例：Example:

备份服务系统应用的运行支撑环境为：The running support environment of the backup service system application is:

1.硬件环境与支持环境1. Hardware environment and supporting environment

备份客户端要求主机具备512M及以上内存，10Mbps及以上网络吞吐能力。The backup client requires the host to have a memory of 512M or above and a network throughput of 10Mbps or above.

调度服务器要求主机具备2GB及以上内存，1000Mbps及以上网络吞吐能力。The scheduling server requires the host to have a memory of 2GB or above and a network throughput of 1000Mbps or above.

存储服务器要求主机具备4GB及以上内存和TB级外存储能力，1000Mbps级以上网络吞吐能力。The storage server requires the host to have 4GB or more memory and TB-level external storage capacity, and a network throughput capacity of 1000Mbps or more.

调度服务器和存储服务器软件所在主机之间具备GB级网络交换能力，客户端与服务端软件所在主机之间具备网络连通能力。要求服务器主机所在环境具备冗余电源保障、冗余通信链路保障、温度控制系统、防火系统等确保主机正常运转的基本条件。The scheduling server and the host where the storage server software are located have GB-level network exchange capabilities, and the client and server software hosts have network connectivity capabilities. The environment where the server host is located is required to have redundant power supply guarantee, redundant communication link guarantee, temperature control system, fire prevention system and other basic conditions to ensure the normal operation of the host.

2.软件运行环境2. Software operating environment

备份客户端程序运行在Windows XP及以后版本操作系统或基于Linux 2.6内核的操作系统平台下。The backup client program runs on the Windows XP and later version operating system or the operating system platform based on the Linux 2.6 kernel.

调度服务器及存储服务器运行在Windows Server 2003操作系统平台下。The scheduling server and storage server run on the Windows Server 2003 operating system platform.

在目前实现并正常运行的在线备份服务系统中，存储服务器端每个数据空间大小为1T，数据空间个数最大为20。每个数据空间分为1024个逻辑数据块，每个逻辑数据块大小为1G。In the online backup service system currently implemented and running normally, the size of each data space on the storage server side is 1T, and the maximum number of data spaces is 20. Each data space is divided into 1024 logical data blocks, and the size of each logical data block is 1G.

以上所述为本发明的较佳实施例而已，但本发明不应该局限于该实施例和附图所公开的内容。所以凡是不脱离本发明所公开的精神下完成的等效或修改，都落入本发明保护的范围。The above description is only a preferred embodiment of the present invention, but the present invention should not be limited to the content disclosed in this embodiment and the accompanying drawings. Therefore, all equivalents or modifications that do not deviate from the spirit disclosed in the present invention fall within the protection scope of the present invention.

Claims

1. a data organization method that is used for backup services is characterized in that, this method comprises the steps:

(1) initialization:

The metadata information partially-initialized comprises that indexing head information, data directory information, the data field metadata information to meta-data region composed initial value;

Prepare to accept user's backup request in data space of data field predistribution;

(2) receive user command and judge the user command type:

Judge the user command type, if backup operation enters step (3), if recovery operation then changes step (4) over to, if deletion action then changes step (5) over to;

(3) carry out back-up processing according to following process:

(3.1) file block at first that the user is to be backed up, then the content of data piecemeal is carried out Hash with the MD5 algorithm, obtain the fingerprint of a unique identification data piecemeal, deblocking is that index stores is in the indexing head and data directory of storage server end member data with the fingerprint;

(3.2) by the fingerprint of backup client to storage server transmission deblocking, whether the storage server end is inquired about this piecemeal according to fingerprint and is existed;

(3.3) if this fingerprint does not exist, then backup client transmits this deblocking and gives storage server, and then this deblocking is new Backup Data piece, in storage server end memory space dynamic allocation, and finishes the write operation of this new Backup Data piece; If exist, then only need the pairing index information of updated stored this deblocking of server end, its reference count is added one;

(3.4) change step (2) over to;

When (4) recovering, check in the Hash tabulation for the treatment of that recovery file comprises by backup server, the storage space metadata information was positioned at the logical place in the corresponding data space between tabulation was visited according to Hash, read to treat that from the storage server end recovery file data are to core buffer successively then, pass to backup client by socket then, and synthetic required file set, change step (2) again over to;

(5) delete backup file by following process:

(5.1) check in the Hash tabulation for the treatment of that deleted file comprises by backup server in the standby system software;

(5.2) search the indexing head and the data directory mapping table of the meta-data region of storage server end according to hash value, if the suction parameter hash value does not exist, then return at once, rreturn value is false;

(5.3) otherwise the reference count of the object metadata of hash value correspondence is subtracted 1, rreturn value is true;

(5.4) change step (2) again over to.

2. the data organization method that is used for backup services according to claim 1, it is characterized in that, in the above-mentioned steps (3.3), make P represent the preallocated logic data block number of each data space, W represents the largest logical data block number that can hold in each data space, the data space number that V has represented to use, S represents the data space sum;

The concrete steps of memory space dynamic allocation are as follows:

(a1) judge whether the residue free space that can satisfy the big or small Backup Data piece of appointment is arranged,, enter step (a5) in P the logic data block in current data space if having, otherwise, step (a2) entered;

(a2) judge whether P＜W sets up, enter step (a6) if set up, otherwise, step (a3) entered;

(a3) judge whether other data space in the storage server master record has the residue free space that can satisfy the Backup Data piece of specifying size, if having, enters step (a5), otherwise, step (a4) entered;

(a4) whether interpretation V＜S sets up, if set up, then increases a data space on storage server, for Backup Data piece to be written in the new data space distributes a data index, changes step (a8) then over to, otherwise, enter step (a7);

(a5) for Backup Data piece to be written in the residue free space distributes a data index, change step (a8) then over to.

(a6) on storage server, increase the space of a logical data block size, for data index of Backup Data piece distribution, change step (a8) then over to again for this data space;

(a7) announce the dynamic assignment failure;

(a8) finish dynamic allocation procedure.

3. the data organization method that is used for backup services according to claim 1 is characterized in that write operation comprises the steps:

(b1) dynamically seek free memory at the stores service end, search whether the logic data block that satisfies condition is arranged;

(b2) if do not have the utilogic data block then return failure;

(b3) if the free memory that satisfies new Backup Data block size is arranged, this creates a new data index, and new Backup Data piece is write the respective stored server location, then respective index head and data directory metadata is write the master record district.

4. the data organization method that is used for backup services according to claim 1 is characterized in that, the process that is positioned at the logical place in the corresponding data space according to storage space metadata information between Hash tabulation visit is as follows:

(4.1) according to the figure place m of predefined indexing head, indexing head is carried out index, the content composition data call number of indexing head by the preceding m position of data hash value;

(4.2) by indexing head to the data indexed addressing, by data directory the data item of single job is carried out addressing again, specifically comprise three partial contents:

(4.2.1) find concrete data space number by the data space among the structure member of data directory numbering;

(4.2.2) find piece number in the data space by the logical data block number in the data space among the structure member of data directory;

(4.2.4) find the offset address of data item in logic data block, the data head of locator data item and data entity by the offset address in the logic data block in the data space among the structure member of data directory;

(4.4) utilize the address information that obtains, navigate to corresponding data field reading of data.