CN100547555C

CN100547555C - A Data Backup System Based on Fingerprint

Info

Publication number: CN100547555C
Application number: CNB2007101687158A
Authority: CN
Inventors: 冯丹; 刘景宁; 杨天明; 周可; 牛中盈; 张航; 刘高
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2007-12-10
Filing date: 2007-12-10
Publication date: 2009-10-07
Anticipated expiration: 2027-12-10
Also published as: CN101183323A

Abstract

A fingerprint-based data backup system belongs to the technical field of computer storage backup, and aims at reducing the management, storage and network overhead of data backup and improving backup performance. The present invention includes a backup server, a backup agent, a storage server and a Web server, and they communicate with each other through the network to complete data backup and recovery; the present invention adopts the anchor-based file segmentation technology to identify redundant data of backup files, which has modification stability and calculation The overhead is small; the data blocks are stored on the disk array of the storage server with their fingerprints as the index, eliminating the backup of redundant data and saving disk storage space; once the data blocks are stored, they will not be erased and can be continuously added to the disk , Eliminate disk storage fragmentation; adopt effective backup buffer strategy, reduce backup network overhead, improve data backup speed, and reduce the impact of backup on application servers.

Description

A Data Backup System Based on Fingerprint

技术领域 technical field

本发明属于计算机存储备份领域，具体涉及一种数据备份系统。The invention belongs to the field of computer storage backup, and in particular relates to a data backup system.

背景技术 Background technique

在当今这个知识爆炸的信息时代，无论对企业还是个人来说，数据都是一项弥足珍贵的资源。数据丢失轻则影响企业业务连续性，使其丧失一时的竞争优势，重则能使一个企业破产倒闭。引起数据丢失的原因很多，包括系统软硬件故障、人为操作失误或破坏以及不可抗拒力(自然灾害、战争)等。为了保护数据免遭不测，传统的方法是周期性地把数据拷贝到可移动的媒介比如磁带、光盘上，然后再离线运送到一个相对安全的地方以便在必要时恢复这些数据。应当指出，这种传统的数据保护方法存在一些明显的缺点：(1)、可移动的存储介质比如磁带、光盘等随着时间的流逝，会出现磨损或损坏使其存储可靠性降低因而不适合作数据的长期存储介质。(2)、作为备份大容量数据的常用存储媒介的磁带，其读写速度往往很慢，而且由于是顺序存储设备，在恢复数据时通常会出现频繁的机械倒带操作，如果备份数据分布在几条磁带上，还需费时的装卸操作。这使得利用磁带进行数据备份和恢复是一件相当耗时的工作。(3)、需要雇用专人把备份数据运送到远程站点，并且保证运输和储存过程中的数据安全。由此可以看出，传统的数据备份需要人工介入完成许多任务，是一项代价高昂的、繁琐的工作。为了提高数据备份和恢复的效率，克服传统的数据保护技术的缺点，近二十年来，世界上一些知名的IT企业或研究机构研制出了形形色色的数据备份系统。包括IBM的TotalStorage，HP的OpenView存储镜像软件、CASA、XPCA以及EVACA，EMC的SRDF和MirrorView，VERITAS的NetBackup等等。这些商业系统没有重复数据删除功能，为了存储在备份中产生的大量冗余数据，往往需要使用磁盘到磁带(D2T)技术，即使用高速磁盘作为备份缓冲区以提高在线备份效率，然后在后台把磁盘缓冲区中的备份数据迁移到磁带库或光盘库等低速大容量的存储媒介上，故其后台存储设备还是需要耗费大量的人力物力进行日常维护。由于磁盘存储较磁带存储具有管理方便、存取速度快等优点，随着磁盘存储技术的发展，使用磁盘存储数据的备份系统越来越受到重视。目前的磁盘存储技术能够很容易搭建一个TB甚至PB级的磁盘存储系统。每比特磁盘存储的价格越来越便宜使得利用磁盘永久归档数据变得现实起来。对于一个基于磁盘的数据备份系统来说，备份数据永久存储于磁盘而不擦除具有许多优点：首先，数据可以连续地写到磁盘上，不会因为空间回收而产生磁盘碎片，其次，用户的数据历史得到完整的保存，用户可以很方便地浏览文件的任一历史版本，第三，有利于保护用户的备份数据，避免了用户误操作而删除重要的数据。然而，对于一个永久存储的基于磁盘的备份系统来说，最大的挑战来源于用户不断增加的备份数据。通常，企业的数据具有高度的冗余，大量重复的数据和文件存储在系统中，一个文件的多个编辑版本之间也存在大量重复的内容。目前广泛使用的基于文件的备份技术不能识别文件之间的冗余数据，导致越来越多的重复数据备份到系统中，不但降低了备份系统的磁盘空间利用率，而且无端通过网络传输了大量冗余数据，增加了数据备份的网络开销，延长了数据备份时间。In today's information age of knowledge explosion, data is a precious resource for both enterprises and individuals. Data loss can affect the business continuity of the enterprise, causing it to lose a temporary competitive advantage, or it can cause a company to go bankrupt. There are many reasons for data loss, including system software and hardware failures, human error or destruction, and force majeure (natural disasters, wars), etc. In order to protect the data from accidents, the traditional method is to periodically copy the data to removable media such as tapes and optical discs, and then transport them offline to a relatively safe place so that the data can be restored when necessary. It should be pointed out that there are some obvious disadvantages in this traditional data protection method: (1), removable storage media such as tapes, optical discs, etc. will wear out or be damaged as time goes by, so that their storage reliability will be reduced, so they are not suitable for storage. Long-term storage medium for data. (2) As a common storage medium for backing up large-capacity data, the reading and writing speed of magnetic tape is often very slow, and because it is a sequential storage device, frequent mechanical rewinding operations usually occur when restoring data. If the backup data is distributed in On several tapes, time-consuming loading and unloading operations are also required. This makes data backup and recovery using tape a time-consuming task. (3) It is necessary to hire a special person to transport the backup data to the remote site, and ensure the data security during transportation and storage. It can be seen from this that traditional data backup requires manual intervention to complete many tasks, which is a costly and tedious task. In order to improve the efficiency of data backup and recovery and overcome the shortcomings of traditional data protection technologies, some well-known IT companies or research institutions in the world have developed various data backup systems in the past two decades. Including IBM's TotalStorage, HP's OpenView storage mirroring software, CASA, XPCA and EVACA, EMC's SRDF and MirrorView, VERITAS's NetBackup and so on. These commercial systems do not have the function of deduplication. In order to store a large amount of redundant data generated in backup, it is often necessary to use disk-to-tape (D2T) technology, that is, to use high-speed disks as backup buffers to improve online backup efficiency, and then transfer them in the background. The backup data in the disk buffer is migrated to low-speed and large-capacity storage media such as tape library or optical disk library, so the background storage device still needs a lot of manpower and material resources for daily maintenance. Because disk storage has the advantages of convenient management and fast access speed compared with tape storage, with the development of disk storage technology, more and more attention has been paid to the backup system using disk storage data. The current disk storage technology can easily build a TB or even PB disk storage system. The ever-increasing price per bit of disk storage has made permanent archiving of data on disk a reality. For a disk-based data backup system, there are many advantages of permanently storing backup data on the disk without erasing: first, data can be continuously written to the disk, and disk fragmentation will not occur due to space reclamation; The data history is completely preserved, and the user can easily browse any historical version of the file. Third, it is beneficial to protect the user's backup data and prevent the user from deleting important data due to misoperation. However, for a permanent storage disk-based backup system, the biggest challenge comes from users' ever-increasing backup data. Usually, enterprise data has a high degree of redundancy, a large number of duplicate data and files are stored in the system, and there are also a lot of duplicate content among multiple edited versions of a file. The currently widely used file-based backup technology cannot identify redundant data between files, resulting in more and more duplicate data being backed up to the system, which not only reduces the disk space utilization of the backup system, but also transmits a large amount of data through the network for no reason. Redundant data increases the network overhead of data backup and prolongs the data backup time.

由此可见，开发一个永久存储的基于磁盘的备份系统，并采用新的数据备份技术清除备份的冗余数据，提高系统的存储效率，是具有积极意义的。It can be seen that it is of positive significance to develop a disk-based backup system for permanent storage, and to use new data backup technology to clear redundant backup data and improve system storage efficiency.

发明内容 Contents of the invention

本发明提出一种基于指纹的数据备份系统，系统采用磁盘永久存储备份数据并采用基于指纹的数据备份技术以删除备份中的冗余数据，目的在于降低数据备份的管理、存储以及网络开销，提高备份性能。The present invention proposes a data backup system based on fingerprints. The system uses disks to permanently store backup data and uses fingerprint-based data backup technology to delete redundant data in the backup. The purpose is to reduce the management, storage and network overhead of data backup and improve backup performance.

本发明的一种基于指纹的数据备份系统，包括备份服务器、备份代理、存储服务器和Web服务器，它们通过网络相互通信完成数据备份与恢复，其特征在于：A kind of fingerprint-based data backup system of the present invention comprises backup server, backup agent, storage server and Web server, they complete data backup and recovery through network mutual communication, it is characterized in that:

所述备份服务器装有配置文件和目录数据库，备份服务器的配置文件中记录用户定义的作业对象，作业对象包含指定系统操作作业运行的属性，备份服务器通过作业对象控制着整个数据备份和恢复的过程；目录数据库存储作业记录，作业记录保存作业对象运行的管理信息；The backup server is equipped with a configuration file and a directory database. The user-defined job object is recorded in the configuration file of the backup server. The job object includes the attributes of the specified system operation job operation. The backup server controls the entire data backup and recovery process through the job object. ;The directory database stores job records, and the job records save the management information of the operation of the job object;

所述备份代理安装于网络中每一个需要备份数据的主机上，备份时由备份代理从所在主机的文件系统中读取需要备份的文件，对文件进行基于锚的分块并计算分块的指纹，把指纹和部分需要的分块数据通过网络送往存储服务器；恢复时备份代理通过网络从存储服务器接收文件数据并写到所在主机的文件系统中指定的目录下；备份代理对文件进行基于锚的分块步骤为：The backup agent is installed on each host that needs to back up data in the network. When backing up, the backup agent reads the file to be backed up from the file system of the host where it is located, performs anchor-based segmentation on the file and calculates the fingerprint of the block , and send the fingerprint and part of the required block data to the storage server through the network; when restoring, the backup agent receives the file data from the storage server through the network and writes it to the specified directory in the file system of the host; the backup agent performs anchor-based The chunking steps are:

(1)以文件的开头48字节b₁，b₂，...，b₄₈为一个窗口，以式H₁＝(b₁*p⁴⁷+b₂*p⁴⁶+...+b₄₈)mod M计算文件的第一个窗口的哈希值；式中p为17，M为2³²，哈希值存储在变量H₁中；(1) Take the first 48 bytes b ₁ , b ₂ ,..., b ₄₈ of the file as a window, and use the formula H ₁ =(b ₁ *p ⁴⁷ +b ₂ *p ⁴⁶ +...+b ₄₈ ) mod M calculates the hash value of the first window of the file; where p is 17, M is 2 ³² , and the hash value is stored in the variable H ₁ ;

(2)向后滑动一个字节，以式H₂＝(p*H₁+b₄₉-b₁*p⁴⁸)mod M计算文件第二个窗口b₂，b₃，...，b₄₉的哈希值，存储在变量H₂中；(2) Slide one byte backward, calculate the second window b ₂ , b ₃ ,..., b ₄₉ of the file with the formula H ₂ =(p*H ₁ +b ₄₉ -b ₁ *p ⁴⁸ )mod M The hash value of is stored in the variable _H2 ;

(3)以此类推，计算文件的所有窗口的哈希值；(3) By analogy, calculate the hash values of all windows of the file;

(4)对每个窗口的哈希值，取其低13位组成一个二进制数，如果此数等于61，则确定其相应的窗口为一个锚，以锚为边界把文件分成大小不一的数据块；(4) For the hash value of each window, take its lower 13 bits to form a binary number. If the number is equal to 61, then determine the corresponding window as an anchor, and use the anchor as the boundary to divide the file into data of different sizes piece;

所述存储服务器安装有大容量磁盘阵列，大容量磁盘阵列是数据备份的目的地，备份时通过网络从相应的备份代理接收指纹或数据分块，把数据分块存储到磁盘上，并建立文件的索引；恢复时则从大容量磁盘阵列根据文件索引重构文件，并把文件数据通过网络送到相应的备份代理；The storage server is equipped with a large-capacity disk array, and the large-capacity disk array is the destination of data backup. When backing up, it receives fingerprints or data blocks from the corresponding backup agent through the network, stores the data blocks on the disk, and creates a file index; when restoring, the file is reconstructed from the large-capacity disk array according to the file index, and the file data is sent to the corresponding backup agent through the network;

所述Web服务器是本系统的B-S模式网页用户管理界面，通过登录Web服务器，用户既可以指定系统完成交互式的备份或恢复作业、监视系统自动调度型作业的运行情况，还可以修改备份服务器的配置文件、定制作业对象，进行设备管理。The Web server is the B-S mode web page user management interface of the system. By logging into the Web server, the user can designate the system to complete interactive backup or restore jobs, monitor the operation of the system's automatic scheduling type job, and can also modify the settings of the backup server. Configure files, customize job objects, and manage devices.

所述的基于指纹的数据备份系统，其特征在于，所述备份服务器包括备份服务器初始化模块、命令监听模块、命令处理模块、作业处理模块和网络通信模块；The fingerprint-based data backup system is characterized in that the backup server includes a backup server initialization module, a command monitoring module, a command processing module, a job processing module and a network communication module;

所述备份服务器初始化模块执行初始化工作，包括读取配置文件、建立内存中的资源链表、检查目录数据库状态、保证配置文件和目录数据库的数据一致性和完整性、启动命令监控端口、接受来自Web服务器的用户命令、初始化作业队列和用户命令队列、向作业队列中加载作业对象、启动作业和网络监控服务；The backup server initialization module performs initialization work, including reading configuration files, establishing resource linked lists in memory, checking directory database status, ensuring data consistency and integrity of configuration files and directory databases, starting command monitoring ports, accepting data from Web User commands of the server, initializing the job queue and user command queue, loading job objects into the job queue, starting jobs and network monitoring services;

所述命令监听模块是由系统生成的一个网络监听线程，对Web服务器的连接请求进行认证，保证只有经过系统授权的Web服务器才能连接系统，监听已通过认证的Web服务器发送来的命令请求；收到命令请求时，将命令请求加入到用户命令队列中等待系统处理；The command monitoring module is a network monitoring thread generated by the system, which authenticates the connection request of the Web server, ensures that only the Web server authorized by the system can connect to the system, and monitors the command request sent by the authenticated Web server; When a command request is received, add the command request to the user command queue and wait for the system to process it;

所述命令处理模块包括一个用户命令队列和N个命令工作线程，当用户命令队列溢出时，命令监听模块转入睡眠状态；这些命令工作线程不断从用户命令队列中读取命令并执行，根据所执行命令的不同完成不同的功能；当命令监听模块向用户命令队列中加入一个命令时，如果当前没有空闲的命令工作线程且活跃的命令工作线程的数目没有达到N时，就生成一个新的命令工作线程；命令工作线程每次从用户命令队列中读取命令时都检查命令监听模块的状态，如果其处于睡眠状态则唤醒它；The command processing module includes a user command queue and N command worker threads. When the user command queue overflows, the command monitoring module goes into a sleep state; these command worker threads constantly read and execute commands from the user command queue, according to the Different execution commands perform different functions; when the command monitoring module adds a command to the user command queue, if there is no idle command worker thread and the number of active command worker threads does not reach N, a new command is generated Worker thread; the command worker thread checks the status of the command monitoring module every time it reads a command from the user command queue, and wakes it up if it is in a sleep state;

所述作业处理模块包括一个作业队列、L个作业工作线程和一个作业队列加载线程，当作业队列发生溢出时，作业队列加载线程进入睡眠状态；作业工作线程不断从作业队列中取作业对象并执行，根据作业对象属性的不同调用不同的资源、实现不同的功能；作业队列加载线程进行作业调度，检查作业资源链中每个作业对象的调度策略属性，把需要调度运行的作业对象加入作业队列中，如果当前没有空闲的作业工作线程且活跃的作业工作线程的数目没有达到L时，就生成一个新的作业工作线程；作业工作线程每次从作业队列中读取作业对象时都检查作业队列加载线程的状态，如果其处于睡眠状态则唤醒它；The job processing module includes a job queue, L job working threads and a job queue loading thread, when the job queue overflows, the job queue loading thread enters a sleep state; the job working thread constantly gets the job object from the job queue and executes , call different resources and implement different functions according to different job object attributes; the job queue loads threads to perform job scheduling, checks the scheduling policy attributes of each job object in the job resource chain, and adds the job objects that need to be scheduled to run into the job queue , if there is currently no idle job worker thread and the number of active job worker threads does not reach L, a new job worker thread is generated; every time a job worker thread reads a job object from the job queue, it checks the job queue loading The state of the thread, waking it up if it is asleep;

所述网络通信模块把标准的网络通信应用编程接口进行封装，向命令工作线程和作业工作线程提供网络通信接口，网络通信接口实现备份服务器、备份代理和存储服务器之间的数据传输协议。The network communication module encapsulates a standard network communication application programming interface, provides a network communication interface to the command worker thread and the job worker thread, and the network communication interface realizes the data transmission protocol between the backup server, the backup agent and the storage server.

所述的基于指纹的数据备份系统，其特征在于，所述备份代理包括备份代理初始化模块、请求监听模块、作业处理模块、文件分块模块和网络通信模块；The fingerprint-based data backup system is characterized in that the backup agent includes a backup agent initialization module, a request monitoring module, a job processing module, a file block module and a network communication module;

所述备份代理初始化模块，执行初始化工作，包括读取备份代理配置文件、建立内存资源链表、初始化作业队列、启动备份服务器请求监听模块；The backup agent initialization module performs initialization work, including reading the backup agent configuration file, establishing a memory resource linked list, initializing the job queue, and starting the backup server request monitoring module;

所述请求监听模块监听网络上备份服务器的连接请求，认证连接的备份服务器，认证通过后生成一个网络连接套接字和此备份服务器通信并加入作业队列中；The request monitoring module monitors the connection request of the backup server on the network, authenticates the connected backup server, and generates a network connection socket to communicate with the backup server and join the job queue after passing the authentication;

所述作业处理模块包括一个作业队列和M个作业工作线程，当作业队列溢出时，请求监听模块转入睡眠状态；作业工作线程从作业队列中取出一个网络连接套接字后，首先为作业建立一个作业控制记录，把网络连接套接字链入作业控制记录的成员变量中，然后通过此网络连接套接字和备份服务器交互，把备份服务器作业对象的有关属性通过变换后赋值给作业控制记录的相应成员变量；然后用从备份服务器处得到的作业票据ticket连接相应的存储服务器，产生一个和存储服务器通信的网络连接套接字并将之链入作业控制记录的成员变量中；当请求监听模块向作业队列中加入一个网络连接套接字时，如果当前没有空闲的作业工作线程且活跃的作业工作线程的数目没有达到M时，就生成一个新的作业工作线程；作业工作线程每次从作业队列中取一个网络连接套接字时都检查请求监听模块的状态，如果其处于睡眠状态则唤醒它；Described job processing module comprises a job queue and M job worker threads, when job queue overflows, request monitor module to transfer to sleep state; After job worker thread takes out a network connection socket from job queue, at first establishes for job A job control record, which links the network connection socket into the member variable of the job control record, and then interacts with the backup server through the network connection socket, and assigns the relevant attributes of the backup server job object to the job control record after transformation The corresponding member variable; then use the job ticket obtained from the backup server to connect to the corresponding storage server, generate a network connection socket for communication with the storage server and link it into the member variable of the job control record; when the request monitor When the module adds a network connection socket to the job queue, if there is no idle job worker thread and the number of active job worker threads does not reach M, a new job worker thread is generated; the job worker thread starts from When fetching a network connection socket in the job queue, check the status of the request monitoring module, and wake it up if it is in a sleeping state;

所述文件分块模块接受作业处理模块中作业工作线程的命令执行备份作业的文件分块任务，在客户机文件系统上打开文件集中的每一个文件，对文件进行基于锚的分块并计算分块指纹，和相应的存储服务器协调执行第一备份过程的备份算法；The file chunking module accepts the command of the job worker thread in the job processing module to execute the file chunking task of the backup job, opens each file in the file set on the client file system, performs anchor-based chunking on the file and calculates the chunking task. Block fingerprints, coordinating with the corresponding storage server to execute the backup algorithm of the first backup process;

所述网络通信模块由作业的网络连接套接字组成，备份代理的每个作业都拥有两个网络连接套接字，分别用于和该作业对应的备份服务器作业以及存储服务器作业通信。The network communication module is composed of job network connection sockets, and each job of the backup agent has two network connection sockets, which are respectively used for communication with the backup server job and the storage server job corresponding to the job.

所述的基于指纹的数据备份系统，其特征在于，所述存储服务器包括存储服务器初始化模块、连接监控模块、作业票据表、作业处理模块和网络通信模块，以及索引缓冲区、分块缓冲区、分块哈希表和磁盘日志；The fingerprint-based data backup system is characterized in that the storage server includes a storage server initialization module, a connection monitoring module, a job ticket table, a job processing module and a network communication module, as well as an index buffer, a block buffer, block hash table and disk log;

所述存储服务器初始化模块执行初始化工作，包括解析存储服务器配置文件，建立内存资源链表，启动相关服务线程；The storage server initialization module performs initialization work, including parsing the storage server configuration file, establishing a memory resource list, and starting related service threads;

所述连接监控模块监控备份服务器和备份代理的连接请求，对连接的备份服务器进行认证，认证通过后生成一个网络连接套接字和此备份服务器通信并加入作业队列中；对连接的备份代理，则根据其出示的作业票据ticket检查作业票据表以对其进行认证，认证通过后生成一个网络连接套接字和此备份代理通信并链接到相应作业控制记录的成员变量中；Described connection monitoring module monitors the connection request of backup server and backup agent, and the backup server of connection is authenticated, and after authentication is passed, generate a network connection socket and communicate with this backup server and join in the job queue; To the backup agent of connection, Check the job ticket table according to the job ticket it presents to authenticate it, and generate a network connection socket to communicate with the backup agent and link to the member variable of the corresponding job control record after the authentication is passed;

所述作业票据表用于存储对备份代理作业进行认证的票据；The job ticket table is used to store a ticket for authenticating the backup proxy job;

所述作业处理模块包括一个作业队列以及W个作业工作线程，当作业队列溢出时，连接监控模块转入“拒绝备份服务器连接请求”状态；作业工作线程从作业队列中取出一个网络连接套接字后，首先为作业建立一个作业控制记录，把网络连接套接字链入作业控制记录的成员变量中，然后通过此网络连接套接字和备份服务器交互，把备份服务器作业对象的有关属性通过变换后赋值给作业控制记录的相应成员变量，并随机生成一个作业票据ticket登记到作业票据表中且向备份服务器作业对象传送此作业票据ticket；当连接监控模块向作业队列中加入一个网络连接套接字时，如果当前没有空闲的作业工作线程且活跃的作业工作线程的数目没有达到W时，就生成一个新的作业工作线程；作业工作线程每次从作业队列中取一个网络连接套接字时都检查连接监控模块的状态，如果其处于“拒绝备份服务器连接请求”状态则取消这种状态以使它接受备份服务器连接请求；Described job processing module comprises a job queue and W job worker threads, and when job queue overflows, connection monitor module changes over to " reject backup server connection request " state; Job worker thread takes out a network connection socket from job queue Finally, first create a job control record for the job, link the network connection socket into the member variable of the job control record, and then interact with the backup server through the network connection socket, and transfer the relevant attributes of the backup server job object through transformation Afterwards, it is assigned to the corresponding member variable of the job control record, and a job ticket ticket is randomly generated and registered in the job ticket table, and the job ticket ticket is sent to the job object of the backup server; when the connection monitoring module adds a network connection socket to the job queue word, if there is no idle job worker thread and the number of active job worker threads does not reach W, a new job worker thread is generated; every time the job worker thread fetches a network connection socket from the job queue Check the state of the connection monitoring module, if it is in the state of "rejecting the backup server connection request", cancel this state so that it accepts the backup server connection request;

所述网络通信模块由作业的网络连接套接字组成，存储服务器的每个作业都拥有两个网络连接套接字，分别用于和该作业对应的备份服务器作业以及备份代理作业通信；Described network communication module is made up of the network connection socket of job, and each job of storage server all has two network connection sockets, is respectively used for and the backup server job corresponding to this job and backup agent job communication;

所述索引缓冲区是存储服务器作业执行第一备份过程和第二备份过程的基础设施，索引缓冲区以一个内存哈希表实现，用于存储本作业链中本作业实例Job_x(t_n)的前一个作业实例Job_x(t_n-1)包含的所有指纹以及在本作业运行过程中新生成的指纹；The index buffer is the infrastructure for the storage server job to execute the first backup process and the second backup process. The index buffer is implemented as a memory hash table and is used to store the job instance Job _x (t _n ) in the job chain. All the fingerprints contained in the previous job instance Job _x (t _n-1 ) and the newly generated fingerprints during the running of this job;

所述分块缓冲区是存储服务器作业执行第一备份过程和第二备份过程的基础设施，分块缓冲区以一个独立的磁盘阵列实现，用以临时存储第一备份过程中其指纹在索引缓冲区中没有被找到的数据分块；The block buffer is the infrastructure for the storage server job to execute the first backup process and the second backup process. The block buffer is implemented with an independent disk array to temporarily store its fingerprints in the index buffer during the first backup process. Data blocks not found in the zone;

所述分块哈希表是存储服务器作业执行第二备份过程的基础设施，分块哈希表以一个独立的磁盘阵列实现，用以建立分块指纹到此分块在磁盘日志的存储地址的映射；The block hash table is the infrastructure for the storage server job to execute the second backup process. The block hash table is implemented with an independent disk array, and is used to establish the block fingerprint to the storage address of the block in the disk log. mapping;

所述磁盘日志是存储服务器作业执行第二备份过程的基础设施，磁盘日志以一个独立的磁盘阵列实现，用以存储数据分块和以分块形式存储的文件索引。The disk log is the infrastructure for the storage server job to execute the second backup process, and the disk log is implemented as an independent disk array to store data blocks and file indexes stored in blocks.

本发明的优点为：The advantages of the present invention are:

1、采用基于锚的文件分块技术把文件分成变长大小的块以识别文件内部或文件之间的冗余数据，具有修改稳定性，对一个文件的修改仅仅影响修改区域内相邻的数据块，其他数据块的边界不会发生移动。这样在对一个文件进行增量备份时，仅仅修改过的几个数据块需要备份，其他的数据块可以和以前的备份文件共享；使用窗口滑动计算，计算开销小。1. Use the anchor-based file block technology to divide the file into blocks of variable length to identify redundant data within the file or between files. It has modification stability. The modification of a file only affects the adjacent data in the modification area. block, the boundaries of other data blocks will not move. In this way, when incrementally backing up a file, only a few data blocks that have been modified need to be backed up, and other data blocks can be shared with the previous backup file; using window sliding calculation, the calculation overhead is small.

2、数据分块以其指纹为索引存储在存储服务器的磁盘阵列上，把数据存储地址和内容关联起来，改变了数据存储地址和内容相分离的传统概念，消除了冗余数据的备份，节省了磁盘存储空间；2. The data block is stored on the disk array of the storage server with its fingerprint as an index, and the data storage address is associated with the content, which changes the traditional concept of separating the data storage address and content, eliminates the backup of redundant data, and saves disk storage space;

3、数据分块一旦存储就不再擦除，数据分块可以连续追加在磁盘上，消除了磁盘存储碎片；用户的数据历史得到完整保存，用户可以很方便地浏览文件的任一历史版本；避免了用户误操作而删除重要数据。3. Once the data block is stored, it will not be erased, and the data block can be continuously appended to the disk, eliminating disk storage fragmentation; the user's data history is completely preserved, and the user can easily browse any historical version of the file; Avoid user misoperation and delete important data.

4、采用有效的备份缓冲策略，减少了备份的网络开销，提高了数据备份速度，降低了备份对应用服务器的影响。4. Adopting an effective backup buffer strategy reduces the network overhead of backup, improves the data backup speed, and reduces the impact of backup on the application server.

附图说明 Description of drawings

图1为本发明结构示意图；Fig. 1 is a structural representation of the present invention;

图2为备份服务器结构示意图；Fig. 2 is a schematic diagram of the structure of the backup server;

图3为备份代理结构示意图；Fig. 3 is a schematic diagram of the backup agent structure;

图4为存储服务器结构示意图；FIG. 4 is a schematic structural diagram of a storage server;

图5为文件在磁盘日志上的存储示意图；Fig. 5 is the storage diagram of file on the disk log;

图6为磁盘日志上多个文件共享数据分块/索引块示意图；Fig. 6 is a schematic diagram of multiple file sharing data blocks/index blocks on the disk log;

图7为本发明的索引缓冲区结构图；Fig. 7 is the index buffer structural diagram of the present invention;

图8为基于锚的文件分块技术中，文件分块示意图。FIG. 8 is a schematic diagram of file segmentation in the anchor-based file segmentation technology.

具体实施方式 Detailed ways

下面结合附图和实施例对本发明进一步详细说明。The present invention will be described in further detail below in conjunction with the accompanying drawings and embodiments.

1、系统总体结构1. The overall structure of the system

图1为本发明系统体系示意图，本发明包括备份服务器、备份代理、存储服务器和Web服务器，它们通过网络相互通信完成数据备份与恢复。Fig. 1 is a schematic diagram of the system system of the present invention. The present invention includes a backup server, a backup agent, a storage server and a Web server, which communicate with each other through a network to complete data backup and recovery.

图2为备份服务器结构示意图；备份服务器包括备份服务器初始化模块、命令监听模块、命令处理模块、作业处理模块和网络通信模块；还装有配置文件和目录数据库。Figure 2 is a schematic diagram of the backup server structure; the backup server includes a backup server initialization module, a command monitoring module, a command processing module, a job processing module and a network communication module; configuration files and a directory database are also installed.

备份服务器是整个网络备份系统的指挥中枢，它通过作业对象控制着整个数据备份和恢复的过程。备份服务器的作业对象给用户提供了一个定制备份/恢复作业的窗口。作业对象包含了许多属性，这些属性指定了系统如何操作作业运行。如备份代理属性指定了作业从哪一台主机上备份/恢复数据；文件集属性指定了作业要备份/恢复的目录；调度策略属性指定了系统调度本作业运行的策略等等。记一个作业对象为Job_x，作业对象在时刻t被调度运行时产生一个运行实例Job_x(t)。作业对象Job_x按时间顺序的一序列运行实例Job_x(t₀)，Job_x(t₁)，...Job_x(t_n)(t₀＜t₁＜...＜t_n)组成了本作业对象的一条作业链，记为Job_x(t₀，t₁，...t_n)。所述备份服务器同时维护着一个目录数据库用于记录Job_x(t)的管理信息。具体地说，Job_x(t)的管理信息存储在目录数据库中本作业的作业记录Job_x(t).Record中。The backup server is the command center of the entire network backup system, and it controls the entire process of data backup and recovery through job objects. The job object of the backup server provides a window for the user to customize the backup/restore job. A job object contains a number of attributes that specify how the system handles the job run. For example, the backup proxy attribute specifies which host the job will back up/restore data from; the file set attribute specifies the directory to be backed up/restored by the job; the scheduling policy attribute specifies the system scheduling strategy for the job to run, and so on. Denote a job object as Job _x , and when the job object is scheduled to run at time t, a running instance Job _x (t) is generated. The job object Job _x is composed of a sequence of running instances Job _x (t ₀ ), Job _x (t ₁ ), ... Job _x (t _n ) (t ₀ <t ₁ <...<t _n ) in chronological order A job chain of this job object is recorded as Job _x (t ₀ , t ₁ ,...t _n ). The backup server also maintains a directory database for recording the management information of Job _x (t). Specifically, the management information of Job _x (t) is stored in the job record Job _x (t).Record of this job in the directory database.

目录数据库：用来存储作业运行的管理信息，即Job_x(t).Record。Job_x(t).Record主要存储本作业包含的文件的根块，本作业的指纹文件Job_x(t).FF等。每一个运行完成的作业Job_x(t)都在目录数据库中保存一份指纹文件Job_x(t).FF，Job_x(t).FF存储作业Job_x(t)所包含的所有指纹。Job_x(t_n).FF用于对作业Job_x(t_n+1)的索引缓冲区进行初始化。Directory database: used to store the management information of job operation, that is, Job _x (t).Record. Job _x (t).Record mainly stores the root block of the files included in this job, the fingerprint file Job _x (t).FF, etc. of this job. Each completed job Job _x (t) saves a fingerprint file Job _x (t).FF in the directory database, and Job _x (t).FF stores all fingerprints contained in the job Job _x (t). Job _x (t _n ).FF is used to initialize the index buffer of the job Job _x (t _n+1 ).

图3为备份代理结构示意图；备份代理包括备份代理初始化模块、请求监听模块、作业处理模块、文件分块模块和网络通信模块。Figure 3 is a schematic diagram of the backup agent structure; the backup agent includes a backup agent initialization module, a request monitoring module, a job processing module, a file block module and a network communication module.

图4为存储服务器结构示意图；存储服务器包括存储服务器初始化模块、连接监控模块、作业票据表、作业处理模块和网络通信模块，以及索引缓冲区、分块缓冲区、分块哈希表和磁盘日志。Figure 4 is a schematic diagram of the storage server structure; the storage server includes a storage server initialization module, a connection monitoring module, a job ticket table, a job processing module, and a network communication module, as well as an index buffer, a block buffer, a block hash table, and a disk log .

存储服务器管理着一个大容量的磁盘阵列(RAID)用以存储数据分块。分块以其指纹为索引存储在磁盘阵列上。数据分块一旦写到磁盘上就不再擦除，这样整个磁盘阵列就像一个日志，数据分块无间隔地追加在磁盘上，消除了磁盘存储的碎片。用于存储数据分块的磁盘被称为磁盘日志。存储服务器使用一块专用的磁盘阵列存储分块哈希表，分块哈希表用以建立分块指纹到此分块在磁盘日志的存储地址的映射。备份文件的所有数据分块通过索引块进行索引，一个文件的所有索引块组成了一棵索引树。同时每一个文件都拥有唯一的一个分块叫根块，根块存储文件的索引树的根的索引，同时文件的元数据以及一些管理信息也存储在根块上。文件的根块以及索引块也作为数据分块存储在磁盘日志上。存储服务器采用备份缓冲策略以提高系统的数据备份速度。具体为：(1)采用内存索引缓冲区存储本作业链中本作业实例Job_x(t_n)的前一个作业实例Job_x(t_n-1)包含的所有指纹以及在本作业运行过程中新生成的指纹。(2)采用一块专用的磁盘阵列作为分块缓冲区用以临时存储备份过程中其指纹在索引缓冲区中没有被找到的数据分块。(3)一个作业的备份过程被分成两个阶段完成，这两个阶段分别记为第一备份过程和第二备份过程。第一备份过程由备份代理和存储服务器相互交互完成文件分块的备份，使用索引缓冲区查找分块指纹，使用分块缓冲区存储在索引缓冲区查找过程中没有发现其指纹的数据分块。对备份代理来说，第一备份过程完成后作业的备份过程就算结束了。因为本过程使用内存索引缓冲区进行指纹查询，免去了费时的分块哈希表查询，故而速度很快。第二备份过程由存储服务器在系统相对空闲的时候运行。本过程把分块缓冲区中临时存储的数据分块转存到磁盘日志上，使用分块哈希表进行指纹查询。本过程同时建立文件在磁盘日志上的索引树。由于第二备份过程是在后台由存储服务器独自完成，故而对运行备份代理的应用服务器没有影响。恢复文件时，存储服务器根据文件索引重构文件并把文件数据通过网络送到相应的备份代理。The storage server manages a large-capacity disk array (RAID) to store data blocks. Chunks are stored on the disk array indexed by their fingerprints. Once the data blocks are written to the disk, they will not be erased, so that the entire disk array is like a log, and the data blocks are appended to the disk without intervals, eliminating the fragmentation of disk storage. The disk used to store chunks of data is called a disk journal. The storage server uses a dedicated disk array to store the block hash table, and the block hash table is used to establish a mapping from the block fingerprint to the storage address of the block in the disk log. All data blocks of a backup file are indexed through index blocks, and all index blocks of a file form an index tree. At the same time, each file has a unique block called the root block, which stores the index of the root of the index tree of the file, and the metadata and some management information of the file are also stored on the root block. The file's root block and index blocks are also stored as data blocks on the disk journal. The storage server adopts the backup buffer strategy to improve the data backup speed of the system. Specifically: (1) Use the memory index buffer to store all the fingerprints contained in the previous job instance Job _x (t _n-1 ) of this job instance Job _x (t _n ) in this job chain and the new fingerprints during the running of this job. Generated fingerprints. (2) A dedicated disk array is used as a block buffer to temporarily store data blocks whose fingerprints are not found in the index buffer during the backup process. (3) The backup process of a job is completed in two phases, and these two phases are recorded as the first backup process and the second backup process respectively. In the first backup process, the backup agent and the storage server interact with each other to complete the backup of the file blocks, use the index buffer to find the block fingerprints, and use the block buffer to store the data blocks whose fingerprints are not found in the index buffer search process. For the backup agent, the backup process of the job ends after the first backup process is completed. Because this process uses the memory index buffer for fingerprint query, eliminating the time-consuming block hash table query, so the speed is very fast. The second backup process is run by the storage server when the system is relatively idle. This process transfers the data temporarily stored in the block buffer to the disk log in blocks, and uses the block hash table for fingerprint query. This process also builds an index tree of files on the disk journal. Since the second backup process is independently completed by the storage server in the background, it has no impact on the application server running the backup agent. When restoring a file, the storage server reconstructs the file according to the file index and sends the file data to the corresponding backup agent through the network.

Web服务器：本发明采用B-S模式提供网页用户界面。用户可以在任何地方通过Web浏览器登录系统的管理界面以指定系统完成交互式的备份或恢复作业、监视系统自动调度型作业的运行情况，还可以定制作业、配置备份服务器、进行设备管理等。Web server: the present invention adopts B-S mode to provide a web page user interface. Users can log in to the management interface of the system through a web browser at any place to specify the system to complete interactive backup or recovery jobs, monitor the running status of the system's automatic scheduling jobs, customize jobs, configure backup servers, and perform device management.

2、存储服务器磁盘日志2. Storage server disk log

本发明备份数据分块以其指纹为索引存储在存储服务器的磁盘日志上。这样保证没有相同的两个分块同时存储在磁盘上，因而消除了冗余数据的备份。分块一旦存储就不再擦除，使得分块可以连续的追加在磁盘日志上，消除了磁盘存储碎片。备份文件所属的数据块以索引块为索引。文件的索引块也存储在磁盘日志上。In the present invention, the backup data block is stored on the disk log of the storage server with its fingerprint as an index. This ensures that no two identical blocks are stored on disk at the same time, thereby eliminating redundant data backups. Once the block is stored, it will not be erased, so that the block can be continuously appended to the disk log, eliminating disk storage fragmentation. The data block to which the backup file belongs is indexed by the index block. Index blocks for files are also stored on the disk journal.

2.1、分块块头2.1, block header

为了方面管理，每个数据分块的前面都附加了一个块头。块头为系统管理，包括完整性检测、文件索引以及分块哈希表的重构提供了必要的信息。块头一共39字节，由以下部分组成：For aspect management, each data chunk is preceded by a chunk header. Block headers provide the necessary information for system management, including integrity checks, file indexing, and block hash table reconstruction. The block header is 39 bytes in total and consists of the following parts:

magic：6个字符的块头标志；magic: 6-character block header flag;

fingerprint：本分块的指纹，共20字节；fingerprint: the fingerprint of this block, a total of 20 bytes;

type：本数据分块的类型，共有三种不同类型的数据分块，即数据块、索引块和文件的根块，分别记为：dc，ic，rc；type: the type of the data block, there are three different types of data blocks, namely the data block, the index block and the root block of the file, respectively recorded as: dc, ic, rc;

size：本数据分块的大小，不包括块头。对索引块，系统规定其大小不能超过16KB；size: The size of this data block, not including the block header. For the index block, the system stipulates that its size cannot exceed 16KB;

offset：本数据分块在磁盘日志上的存储地址。offset: The storage address of this data block on the disk log.

2.2、文件索引2.2, file index

图5所示为文件在磁盘日志上的存储结构。文件所属的数据块以索引块为索引，索引块也存储在磁盘日志上，一个文件的所有索引块组成了一棵索引树；每个文件都在磁盘日志上存储有唯一的一个根块，根块里存储文件索引树的根的索引，同时还存储文件的元数据和本文件的一些管理信息。文件备份完成后，其根块作为作业的管理信息同时也存储到目录数据库的作业记录里。图5中，F₀表示一个文件，D_i表示数据块，I_i表示索引块，索引块由索引项组成，P(X)表示一个索引项，它是一个三元组<H(X)，offset，type>，其中X是被索引的数据分块，H(X)表示数据分块X的指纹，offset表示数据分块X在磁盘日志上的存储地址，type表示数据分块X的类型，X可以是一个索引块I_i，也可以是一个数据块D_i，图中的箭头表示被索引块和其索引项的对应关系，M(F₀)表示文件F₀的元数据以及一些管理信息，索引块I₀，I₁和I₂组成了文件F₀的索引树，索引块I₀为此索引树的根，R₀表示文件F₀的根块，它由M(F₀)和一个指向文件的索引树的根I₀的索引项P(I₀)组成。磁盘日志上的所有数据块和索引块都可以被不同的文件所共享。图6所示为不同文件共享数据块和索引块的情况，图中各记号表示的意义和图5相同。Figure 5 shows the storage structure of files on the disk log. The data block to which the file belongs is indexed by the index block, and the index block is also stored on the disk log. All index blocks of a file form an index tree; each file has a unique root block stored on the disk log, and the root The block stores the index of the root of the file index tree, and also stores the metadata of the file and some management information of the file. After the file backup is completed, its root block is also stored in the job record of the directory database as the management information of the job. In Fig. 5, F ₀ represents a file, D _i represents a data block, I _i represents an index block, an index block is composed of index items, P(X) represents an index item, and it is a triple <H(X), offset, type>, where X is the indexed data block, H(X) represents the fingerprint of data block X, offset represents the storage address of data block X on the disk log, type represents the type of data block X, X can be an index block I _i or a data block D _i , the arrows in the figure indicate the corresponding relationship between the indexed block and its index items, and M(F ₀ ) indicates the metadata and some management information of the file F ₀ , index block I ₀ , I ₁ and I ₂ constitute the index tree of file F ₀ , index block I ₀ is the root of this index tree, R ₀ represents the root block of file F ₀ , which consists of M(F ₀ ) and a An index entry P(I ₀ ) pointing to the root I ₀ of the index tree of the file. All data blocks and index blocks on the disk log can be shared by different files. FIG. 6 shows the situation that different files share data blocks and index blocks. The meanings of the symbols in the figure are the same as those in FIG. 5 .

3、存储服务器分块哈希表3. Storage server block hash table

本发明存储服务器分块哈希表用以建立分块指纹到此分块在磁盘日志的存储地址的映射，分块哈希表由相同大小的桶组成。分块哈希表所包含的桶数是根据磁盘日志的大小来确定的，磁盘日志的容量越大，则分块哈希表所包含的桶数就越多，以降低桶的哈希冲突的几率。系统根据哈希表的桶数取指纹的前n位作为桶号把指纹映射到哈希表的相应的桶里。每个指纹以三元组<fingerprint，offset，type>的形式存储在桶里，其中fingerprint表示此分块的指纹，offset表示此指纹对应的分块在磁盘日志上的存储地址，type表示此指纹对应的分块的类型。如果桶发生哈希冲突，则把指纹的三元组存储在相邻的一个桶里。The block hash table of the storage server in the present invention is used to establish the mapping from the block fingerprint to the storage address of the block in the disk log, and the block hash table is composed of buckets of the same size. The number of buckets contained in the block hash table is determined according to the size of the disk log. The larger the capacity of the disk log, the more buckets the block hash table contains, so as to reduce the probability of hash collision of the buckets. probability. According to the number of buckets in the hash table, the system takes the first n digits of the fingerprint as the bucket number and maps the fingerprint to the corresponding bucket in the hash table. Each fingerprint is stored in the bucket in the form of triple <fingerprint, offset, type>, where fingerprint represents the fingerprint of this block, offset represents the storage address of the block corresponding to this fingerprint on the disk log, and type represents this fingerprint The corresponding block type. If a bucket has a hash collision, the triplet of the fingerprint is stored in an adjacent bucket.

4、存储服务器索引缓冲区4. Storage server index buffer

图7所示为索引缓冲区的结构。索引缓冲区为一个内存哈希表，它由一个桶组和许多数据链表组成，桶组一共有1024*1024个桶，桶的编号从00000H到FFFFFH，桶可能为空，桶若非空，则里面包含一个指向数据链表的指针，对应一个数据链表，数据链表的表项存储被哈希到本桶中的指纹信息。哈希时，取指纹的前20比特作为桶号把此指纹哈希到相应的桶所指向的数据链表里。Figure 7 shows the structure of the index buffer. The index buffer is a memory hash table, which consists of a bucket group and many data linked lists. The bucket group has a total of 1024*1024 buckets. The number of the bucket is from 00000H to FFFFFH. The bucket may be empty. If the bucket is not empty, the inside Contains a pointer to the data linked list, corresponding to a data linked list, and the entries of the data linked list store the fingerprint information that is hashed into the bucket. When hashing, take the first 20 bits of the fingerprint as the bucket number and hash the fingerprint into the data linked list pointed to by the corresponding bucket.

数据链表的表项结构为：The entry structure of the data linked list is:

tag：标识符，占4比特，用以指示在第一备份过程和第二备份过程中本指纹的状态；tag: identifier, occupying 4 bits, used to indicate the state of this fingerprint in the first backup process and the second backup process;

fingerprintTail：本分块的指纹的后140比特，因为前20比特隐含在桶号中，故这里只需要存储指纹的后140比特；fingerprintTail: the last 140 bits of the fingerprint of this block, because the first 20 bits are implied in the bucket number, so only the last 140 bits of the fingerprint need to be stored here;

offset：存储地址，占64比特，如果此项非空，则表示此指纹对应的数据分块在磁盘日志的存储地址；offset: storage address, occupying 64 bits, if this item is not empty, it means the storage address of the data block corresponding to this fingerprint in the disk log;

next：占32比特，指向下一个表项的指针。next: occupies 32 bits, and points to the pointer to the next entry.

图7中“一个指纹”所示为一个指纹7E54F36A4EC62…3B被哈希到索引缓冲区的情况，第(1)步用指纹的前20比特“7E54F”作为桶号(bucketNo)找到编号为7E54FH的桶，第(2)步在此桶所指的数据链表中找fingerprintTail为“36A4EC62…3B”的表项，如果找到则表明指纹7E54F36A4EC62…3B已经存储在索引缓冲区中，如果没有找到，则建立一个新的表项存储此指纹的信息。"One fingerprint" in Figure 7 shows a fingerprint 7E54F36A4EC62...3B is hashed into the index buffer, step (1) uses the first 20 bits of the fingerprint "7E54F" as the bucket number (bucketNo) to find the number 7E54FH Bucket, step (2) finds the entry whose fingerprintTail is "36A4EC62...3B" in the data link list pointed to by this bucket. If found, it indicates that the fingerprint 7E54F36A4EC62...3B has been stored in the index buffer. If not found, create A new entry stores information about this fingerprint.

索引缓冲区的数据链表表项的tag共有三个不同的数值，其表示的意义如下：The tag of the data link list entry in the index buffer has three different values, the meanings of which are as follows:

0000：指纹来源于前一个作业的指纹文件，并且在本次备份过程中没有被命中；0000: The fingerprint comes from the fingerprint file of the previous job, and is not hit during this backup process;

1000：指纹来源于前一个作业的指纹文件，并且在本次备份过程中被命中；1000: The fingerprint comes from the fingerprint file of the previous job and is hit during this backup process;

1100：指纹是在本次备份过程中新产生的。1100: The fingerprint is newly generated during this backup.

一个备份作业Jobx(t_n-1)完成后，本作业所包含的所有指纹以二元组<fingerprint，offset>(其中fingerprint表示分块的指纹，offset表示分块在磁盘日志上的存储地址)的形式被保存在文件Jobx(t_n-1).FF中，文件Jobx(t_n-1).FF被存储在目录数据库的作业记录Jobx(t_n-1).Record中。Jobx(t_n-1).FF被用来初始化作业Jobx(t_n)的索引缓冲区。由于同一个作业链的相邻作业通常共享大量的文件或数据，故使用Jobx(t_n-1).FF初始化作业Jobx(t_n)的索引缓冲区会提高缓冲区的指纹命中率。After a backup job Jobx(t _n-1 ) is completed, all fingerprints contained in this job are represented by the binary group <fingerprint, offset> (where fingerprint represents the fingerprint of the block, and offset represents the storage address of the block on the disk log) The form of is saved in the file Jobx(t _n-1 ).FF, and the file Jobx(t _n-1 ).FF is stored in the job record Jobx(t _n-1 ).Record of the directory database. Jobx(t _n-1 ).FF is used to initialize the index buffer of job Jobx(t _n ). Since adjacent jobs in the same job chain usually share a large number of files or data, using Jobx(t _n-1 ).FF to initialize the index buffer of job Jobx(t _n ) will improve the fingerprint hit rate of the buffer.

5、备份过程5. Backup process

为方便起见，定义如下记号：For convenience, the following notations are defined:

BS：备份服务器作业工作线程；BS: backup server job worker thread;

BA：备份代理作业工作线程；BA: backup agent job worker thread;

SS：存储服务器作业工作线程；SS: storage server job worker thread;

F：一个文件；F: a file;

H：一个指纹；H: a fingerprint;

M(F)：文件F的元数据；M(F): metadata of file F;

R(F)：文件F的根块；R(F): the root block of file F;

H(D)：数据分块D的指纹；H(D): the fingerprint of data block D;

D(H)：指纹H所对应的数据块/索引块；D(H): the data block/index block corresponding to fingerprint H;

F.Index：构建文件F的索引树的内存缓冲区；F.Index: build the memory buffer of the index tree of file F;

index cache：索引缓冲区；index cache: index buffer;

chunk cache：分块缓冲区；chunk cache: block buffer;

hash table：分块哈希表；hash table: block hash table;

Job_x(t_n).FileSet：作业对象Job_x(t_n)的文件集；Job _x (t _n ).FileSet: the file set of the job object Job _x (t _n );

I(F，level)：索引树F.Index第level层包含的索引块的集合。索引树的叶子被定义成0层，叶子结点的父结点为树的第1层，依次类推。I(F, level): A collection of index blocks contained in the level of the index tree F.Index. The leaf of the index tree is defined as level 0, the parent node of the leaf node is the first level of the tree, and so on.

I_w(F，level)：I(F，level)中当前被用于存储三元组<H，offset，type>的工作结点；I _w (F, level): I(F, level) is currently used to store the working node of the triplet <H, offset, type>;

<H，offset，type>：三元组，H：指纹，offset：分块D(H)在磁盘日志上的存储地址，type：分块D(H)的类型；<H, offset, type>: triplet, H: fingerprint, offset: storage address of block D(H) on the disk log, type: type of block D(H);

5.1、第一备份过程5.1, the first backup process

第一备份过程主要由备份代理作业工作线程和存储服务器作业工作线程协作完成，其步骤为：The first backup process is mainly completed by the cooperation of the backup agent job worker thread and the storage server job worker thread, and its steps are:

(1)SS：使用Job_x(t_n-1).FF初始化index cache；(1) SS: use Job _x (t _n-1 ).FF to initialize index cache;

(2)BA：if(Job_x(t_n).FileSet为空)转(20)，else从Job_x(t_n).FileSet中读取一个文件F_i；(2) BA: if (Job _x (t _n ).FileSet is empty) turn to (20), else read a file F _i from Job _x (t _n ).FileSet;

(3)BA：传送M(F_i)到SS；(3) BA: send M(F _i ) to SS;

(4)SS：把M(F_i)缓存到chunk cache；(4) SS: Cache M(F _i ) to the chunk cache;

(5)BA：对F_i进行基于锚的文件分块；(5) BA: perform anchor-based file segmentation on F _i ;

(6)BA：计算每个分块的指纹并把这些指纹组成的指纹集合传送到SS；(6) BA: Calculate the fingerprint of each block and send the fingerprint set composed of these fingerprints to SS;

(7)SS：if(指纹集合为空)转(17)，else在指纹集合中取出一个指纹H_j并在index cache中查询此指纹；(7) SS: if (fingerprint set is empty) turn to (17), else take out a fingerprint H _j from the fingerprint set and query this fingerprint in the index cache;

(8)SS：if(在index cache查到指纹H_j){(8) SS: if (fingerprint H _j found in index cache) {

(9)SS：if(tag＝＝0000){tag＝1000；把<H_j，offset>缓存到chunkcache；}(9) SS: if(tag==0000){tag=1000; cache <H _j , offset> to chunkcache;}

(10)SS：else if(tag＝＝1000)把<H_j，offset>缓存到chunkcache；(10) SS: else if (tag==1000) cache <H _j , offset> to chunkcache;

(11)SS：else if(tag＝＝1100)把<H_j，null>缓存到chunk cache；}(11) SS: else if (tag==1100) cache <H _j , null> to the chunk cache;}

(12)SS：else{把H_j缓存到index cache，tag＝1100，offset＝null；(12) SS: else {cache H _j to index cache, tag=1100, offset=null;

(13)SS：请求BA传送D(H_j)；(13) SS: request BA to transmit D(H _j );

(14)BA：传送D(H_j)到SS；(14) BA: Send D(H _j ) to SS;

(15)SS：把<H_K，D(H_K)>缓存到chunk cache；}(15) SS: Cache <H _K , D(H _K )> to the chunk cache;}

(16)SS：返回步骤(7)；(16) SS: return to step (7);

(17)SS：通知BA备份下一个文件；(17) SS: notify BA to back up the next file;

(18)BA：返回步骤(2)；(18) BA: return to step (2);

(19)BA：向BS及SS报告作业Job_x(t_n)的结束状态然后退出.(19) BA: Report the end status of Job _x (t _n ) to BS and SS and exit.

(20)SS：收到BA的作业结束信号后，结束第一备份过程，转入第二备份过程；(20) SS: after receiving the job end signal from BA, end the first backup process and turn to the second backup process;

(21)BS：收到BA的作业结束信号后，断开和BA的连接，等待SS执行第二备份过程。(21) BS: After receiving the job end signal from BA, it disconnects from BA, and waits for SS to execute the second backup process.

5.1.1基于锚的文件分块5.1.1 Anchor-based file chunking

在第一备份过程的步骤(5)中，基于锚的文件分块是由备份代理作业工作线程调用备份代理文件分块模块完成的，其步骤为：In the step (5) of the first backup process, based on the anchor, the file segmentation is completed by the backup agent job worker thread calling the backup agent file segmentation module, and its steps are:

(1)以文件的开头48字节b₁，b₂，...，b₄₈为一个窗口，以式H₁＝(b₁*p⁴⁷+b₂*p⁴⁶+...+b₄₈)mod M计算文件的第一个窗口的哈希值。上式中p为某个素数，可取17，M为常数，可取2³²。哈希值存储在变量H₁中。(1) Take the first 48 bytes b ₁ , b ₂ ,..., b ₄₈ of the file as a window, and use the formula H ₁ =(b ₁ *p ⁴⁷ +b ₂ *p ⁴⁶ +...+b ₄₈ )mod M computes the hash of the first window of the file. In the above formula, p is a certain prime number, which can be 17, and M is a constant, which can be 2 ³² . The hash value is stored in variable _H1 .

(2)向后滑动一个字节，以式H₂＝(p*H₁+b₄₉-b₁*p⁴⁸)mod M计算文件第二个窗口b₂，b₃，...，b₄₉的哈希值存储在变量H₂中。(2) Slide one byte backward, calculate the second window b ₂ , b ₃ ,..., b ₄₉ of the file with the formula H ₂ =(p*H ₁ +b ₄₉ -b ₁ *p ⁴⁸ )mod M The hash value of is stored in variable _H2 .

(3)以此类推，计算文件的所有窗口的哈希值。(3) By analogy, calculate the hash values of all windows of the file.

(4)对每个窗口的哈希值，取其低13位组成一个二进制数，如果此数等于预定的某个数(比如61)，则确定其相应的窗口为一个锚，以锚为边界把文件分成大小不一的数据块。(4) For the hash value of each window, take its lower 13 bits to form a binary number, if this number is equal to a predetermined number (such as 61), then determine its corresponding window as an anchor, with the anchor as the boundary Divide the file into chunks of varying sizes.

上述基于锚的文件分块遵守如下三个约定：a)如果文件小于48字节，则退出基于锚的文件分块算法，整个文件为一个数据块；b)如果在某一段字节流中包含过多的锚，则舍弃一些锚使得最小的分块不小于2KB(文件末尾的一个分块是唯一的可能小于2KB的分块)；c)如果在连续64KB的字节流中都没有锚，则取此64KB为一个分块；The above-mentioned anchor-based file chunking follows the following three conventions: a) If the file is smaller than 48 bytes, the anchor-based file chunking algorithm will be exited, and the entire file is a data block; b) If a byte stream contains If there are too many anchors, some anchors are discarded so that the smallest block is not less than 2KB (a block at the end of the file is the only block that may be smaller than 2KB); c) if there are no anchors in the continuous 64KB byte stream, Then take this 64KB as a block;

本发明中基于锚的文件分块具有如下两个特点：(1)具有修改稳定性，也就是说对一个文件的修改仅仅影响修改区域内相邻的数据块，其他数据块的边界不会发生移动。这样在对一个文件进行增量备份时，仅仅修改过的几个数据块需要备份，其他的数据块可以和以前的备份文件进行共享。修改稳定性还保证了文件内部以及文件之间的数据相似性不因比特偏移而被遗漏，从而最大限度地检测出文件的重复数据。(2)滑动窗口具有计算方便的优点，其下一个窗口的哈希值可以很容易从前一个窗口的哈希值的基础上计算出来，因而使得基于锚的文件分块具有计算开销小的优点，整个算法的时间复杂度为O(n)，其中n为文件包含的字节数。In the present invention, the anchor-based file segmentation has the following two characteristics: (1) has modification stability, that is to say, the modification of a file only affects the adjacent data blocks in the modification area, and the boundaries of other data blocks will not occur move. In this way, when a file is incrementally backed up, only a few data blocks that have been modified need to be backed up, and other data blocks can be shared with the previous backup file. The modification stability also ensures that the similarity of data within files and between files will not be missed due to bit offsets, thereby maximally detecting duplicate data in files. (2) The sliding window has the advantage of convenient calculation, and the hash value of the next window can be easily calculated from the hash value of the previous window, so that the anchor-based file partitioning has the advantage of small computational overhead, The time complexity of the whole algorithm is O(n), where n is the number of bytes contained in the file.

图8所示为一个文件分块后再对文件编辑时此文件分块的变化情况。从图中可以看出，基于锚的文件分块具有修改稳定性，也就是说对一个文件的修改仅仅影响修改区域内相邻的数据块，其他数据块的边界不会发生移动。a行所示为一个文件被锚分成了B₁～B₈大小不一的8块，每一块的边界带纹齿的部分为48字节的锚。b、c、d行为对文件进行第1、2、3次修改后，分块的变化情况，带阴影的部分为被修改过的部分。b行：对文件的第1次修改发生在块B₄内，修改后并没有产生新的块，仅仅使块B₄变成了块B₉，其它的块都没有发生改变。这时候的文件备份就只需要把块B₉备份过去替代原来的块B₄就可以了。c行：对文件的第2次修改发生在块B₅内，修改后产生了新的锚，把块B₅分成了两块B₁₀和B₁₁，其它的块都没有发生改变。这时候的文件备份就只需要把块B₁₀和B₁₁备份过去代替原来的块B₅就行了。d行：对文件的第3次修改发生在块B₂和B₃的分界处，结果使B₂和B₃之间的锚丢失，两块合并成为一个块B₁₂。这时候的文件备份只需把块B₁₂备份过去代替原来的块B₂和B₃。Fig. 8 shows the change of the block of a file when the file is edited after block. It can be seen from the figure that the anchor-based file partitioning has modification stability, that is to say, the modification of a file only affects the adjacent data blocks in the modification area, and the boundaries of other data blocks will not move. Line a shows that a file is divided into 8 blocks of different sizes from B ₁ to B ₈ by anchors, and the part with teeth on the border of each block is a 48-byte anchor. The behaviors b, c, and d show the changes in blocks after the 1st, 2nd, and 3rd modification of the file, and the shaded part is the modified part. Line b: the first modification to the file occurs in block B ₄ , no new block is generated after the modification, only block B ₄ becomes block B ₉ , and other blocks are not changed. At this time, the file backup only needs to back up the block B ₉ to replace the original block B ₄ in the past. Line c: the second modification to the file occurs in block B ₅ , a new anchor is generated after the modification, and block B ₅ is divided into two blocks B ₁₀ and B ₁₁ , and other blocks are not changed. At this time, the file backup only needs to back up blocks B ₁₀ and B ₁₁ to replace the original block B ₅ . Line d: The third modification to the file occurs at the boundary between blocks B ₂ and B ₃ , as a result, the anchor between B ₂ and B ₃ is lost, and the two blocks are merged into one block B ₁₂ . At this time, the file backup only needs to back up the block B ₁₂ to replace the original blocks B ₂ and B ₃ .

5.2、第二备份过程5.2, the second backup process

第二备份过程主要由存储服务器作业工作线程在系统相对空闲的时候完成，其步骤为：The second backup process is mainly completed by the storage server job worker thread when the system is relatively idle, and the steps are as follows:

(1)SS：if(Job_x(t_n).FileSet为空)转(19)，else从Job_x(t_n).FileSet中取一个文件名F_i；(1) SS: if (Job _x (t _n ).FileSet is empty) turn to (19), else get a file name F _i from Job _x (t _n ).FileSet;

(2)SS：为文件F_i创建内存缓冲区F_i.Index，并在F_i.Index中创建R(F_i)，然后把chunk cache中的M(F_i)存到R(F_i)；(2) SS: Create a memory buffer F _i .Index for the file F _i , and create R(F _i ) in F _i .Index, and then save M(F _i ) in the chunk cache to R(F _i ) ;

(3)SS：if(chunk cache中没有和F_i相关的元组)转(14)，else从chunk cache中读取一个和F_i相关的元组；(3) SS: if (there is no tuple related to F _i in the chunk cache) turn to (14), else read a tuple related to F _i from the chunk cache;

(4)SS：if(是<H_j，offset>)，转步骤(12)；(4) SS: if (is <H _j , offset>), go to step (12);

(5)SS：else if(是<H_j，D(H_j)>){(5) SS: else if (is <H _j , D(H _j )>){

(6)SS：在hash table中查询H_j；(6) SS: query H _j in the hash table;

(7)SS：if(找到)把“offset”值写到index cache中和的H_j对应的表项中，转步骤(12)；(7) SS: if (found) write the "offset" value into the table entry corresponding to H _j in the index cache, then go to step (12);

(8)SS：else{把D(H_j)追加到磁盘日志，同时更新hash table；(8) SS: else {append D(H _j ) to the disk log, and update the hash table at the same time;

(9)SS：把“offset”值写到index cache中和的H_j对应的表项中，转步骤(12)；}}(9) SS: Write the "offset" value into the entry corresponding to H _j in the index cache, and go to step (12);}}

(10)SS：else if(是<H_j，null>)(10) SS: else if (is <H _j , null>)

(11)SS：从index cache中H_j对应的表项中读取“offset”值；(11) SS: read the "offset" value from the entry corresponding to H _j in the index cache;

(12)SS：insert(<H_j，offset，dc>，0，F_i.Index)；(12) SS: insert(<H _j , offset, dc>, 0, F _i .Index);

(13)SS：返回步骤(3)；(13) SS: return to step (3);

(14)SS：storeRemain(F_i.Index，R(F_i))；(14) SS: storeRemain(F _i .Index, R(F _i ));

(15)SS：把R(F_i)追加到磁盘日志，同时更新hash table；(15) SS: Append R(F _i ) to the disk log and update the hash table at the same time;

(16)SS：把R(F_i)传送给BS；(16) SS: Send R(F _i ) to BS;

(17)BS：把R(F_i)传送到目录数据库并存储在Job_x(t_n).Record中；(17) BS: transfer R(F _i ) to the directory database and store it in Job _x (t _n ).Record;

(18)SS：返回步骤(1)；(18) SS: return to step (1);

(19)SS：创建文件Job_x(t_n).FF；(19) SS: create the file Job _x (t _n ).FF;

(20)SS：读index cache，对每一个符合条件(tag＝＝1000 ortag＝＝1100)的表项，把<H，offset>写到文件Job_x(t_n).FF中；(20) SS: read index cache, write <H, offset> to the file Job _x (t _n ).FF for each entry that meets the conditions (tag==1000 ortag==1100);

(21)SS：把文件Job_x(t_n).FF传送给BS；(21) SS: Send the file Job _x (t _n ).FF to BS;

(22)BS：把文件Job_x(t_n).FF传送到目录数据库并存储在Job_x(t_n).Record中；(22) BS: transfer the file Job _x (t _n ).FF to the directory database and store it in Job _x (t _n ).Record;

(23)SS：向BS报告作业Job_x(t_n)的结束状态；(23) SS: report the end status of the job Job _x (t _n ) to the BS;

(24)BS：中断和SS的连接，把作业Job_x(t_n)的结束状态写到目录数据库的Job_x(t_n).Record中，并结束作业Job_x(t_n)运行。(24) BS: interrupt the connection with SS, write the end status of the job Job _x (t _n ) into the Job _x (t _n ).Record of the directory database, and end the operation of the job Job _x (t _n ).

在上述算法中，步骤(12)和(14)两个函数的算法如下：In the above algorithm, the algorithms of the two functions of steps (12) and (14) are as follows:

步骤(12)算法Step (12) Algorithm

insert(<H，offset，type>，level，F.Index)insert(<H, offset, type>, level, F. Index)

{//存储三元组<H，offset，type>到F.Index.{//Store triple <H, offset, type> to F.Index.

//level：存储三元组<H，offset，type>的索引结点在索引树F.Index中的层号.//level: Stores the layer number of the index node of the triple <H, offset, type> in the index tree F.Index.

{创建I_w(F，level)；把<H，offset，type>存储到I_w(F，level)；返回；}{ create _Iw (F, level); store <H, offset, type> into _Iw (F, level); return; }

else if(I_w(F，level)未满)else if(I _w (F, level) is not full)

{存储<H，offset，type>到I_w(F，level)中；返回；}{ store <H, offset, type> into I _w (F, level); return; }

else if(I_w(F，level)已满)else if(I _w (F, level) is full)

{计算H(I_w(F，level))；{ compute H(I _w (F, level));

在hash table中查询H(I_w(F，level))；Query H(I _w (F, level)) in the hash table;

if未找到if not found

把I_w(F，level)追加到磁盘日志，同时更新hash table；Append I _w (F, level) to the disk log and update the hash table at the same time;

insert(<H(I_w(F，level))，offset，ic>，level+1，F.Index)；insert(<H(I _w (F, level)), offset, ic>, level+1, F.Index);

创建一个新的索引结点I_w(F，level)；Create a new index node I _w (F, level);

存储<H，offset，type>到I_w(F，level)中；返回；store <H, offset, type> into I _w (F, level); return;

}}

步骤(14)算法Step (14) Algorithm

storeRemain(F.Index，R(F))storeRemain(F.Index, R(F))

{//把F.Index中每一层的工作索引结点存储到磁盘日志中.{//Store the working index nodes of each layer in F.Index to the disk log.

int level:＝0；int level:=0;

loop：计算H(I_w(F，level))；loop: Calculate H(I _w (F, level));

if未找到if not found

if(|I(F，level)|＝1)If(|I(F, level)|＝1)

{存储<H(I_w(F，level))，offset，ic>到R(F)；返回；}{store <H(I _w (F, level)), offset, ic> into R(F); return; }

elseelse

{insert(<H(I_w(F，level))，offset，ic>，level+1，F.Index)；{insert(<H(I _w (F, level)), offset, ic>, level+1, F.Index);

level:＝level+1；goto loop；level:=level+1; goto loop;

}}

Claims

1. the data backup system based on fingerprint comprises backup server, backup agent, storage server and Web server, and they intercom mutually by network and finish data backup and recovery, it is characterized in that:

Described backup server is equipped with configuration file and catalog data base, the manipulating object of recording user definition in the configuration file of backup server, manipulating object comprises the attribute of appointing system operation task operation, and backup server is being controlled the process of whole data backup and recovery by manipulating object; Catalog data base storage operation record, charge book is preserved the management information of manipulating object operation;

Described backup agent is installed on that each needs on the main frame of Backup Data in the network, from the file system of place main frame, read the file that needs backup by backup agent during backup, file is carried out based on the piecemeal of anchor and calculates the fingerprint of piecemeal, and the block data that fingerprint and part are needed is sent to storage server by network; Backup agent receives file data and writes the file system of place main frame under the designated directory from storage server by network during recovery; Backup agent carries out based on the piecemeal step of anchor file:

(1) with the beginning 48 byte b of file ₁, b ₂..., b ₄₈Be a window, with formula H ₁=(b ₁* p ⁴⁷+ b ₂* p ⁴⁶+ ...+b ₄₈) cryptographic hash of first window of mod M calculation document; P is 17 in the formula, and M is 2 ³², cryptographic hash is stored in variable H ₁In;

(2) slide backward a byte, with formula H ₂=(p*H ₁+ b ₄₉-b ₁* p ⁴⁸) second window b of mod M calculation document ₂, b ₃..., b ₄₉Cryptographic hash, be stored in variable H ₂In;

(3) by that analogy, the cryptographic hash of all windows of calculation document;

(4) to the cryptographic hash of each window, get its low 13 and form a binary number, if this number equals 61, determine that then its corresponding window is an anchor, be that the border is divided into data block not of uniform size to file with the anchor;

Above-mentioned file block based on anchor is observed following three agreements: if a) file is less than 48 bytes, then withdraw from the file block algorithm based on anchor, whole file is a data block; B) if in a certain section byte stream, comprise too much anchor, then give up some anchors and make minimum piecemeal be not less than 2KB, a piecemeal of end of file be unique may be less than the piecemeal of 2KB; C) if all do not have anchor in the byte stream of continuous 64KB, then getting this 64KB is a piecemeal;

Described storage server is equipped with the large capacity disc array, and the large capacity disc array is the destination of data backup, receives fingerprint or deblocking by network from corresponding backup agent during backup, deblocking is stored on the disk, and set up the index of file; During recovery then from the large capacity disc array according to file index reconstruct file, and file data delivered to corresponding backup agent by network;

Described Web server is the B-S pattern webpage subscriber administration interface of native system, by the login Web server, the user both can appointing system finishes the ruuning situation of interactively back up or restore operation, the operation of surveillance scheduling type automatically, can also revise configuration file, the customization manipulating object of backup server, carry out equipment control.

2. the data backup system based on fingerprint as claimed in claim 1 is characterized in that, described backup server comprises backup server initialization module, order monitoring module, command processing module, operation processing module and network communication module;

Described backup server initialization module is carried out initial work, comprises reading configuration file, set up resource chained list in the internal memory, check catalog data base state, the data consistency that guarantees configuration file and catalog data base and integrality, startup command policing port, accepting user command, initialization job queue and user command formation, load operations object, initiating task and network monitoring service in job queue from Web server;

It is a network monitoring thread that is generated by system that module is monitored in described order, connection request to Web server authenticates, assurance has only the Web server ability connected system through system authorization, monitors the command request of having sent by the Web server that authenticates; Receive orders when asking, command request is joined in the user command formation wait for system handles;

Described command processing module comprises a user command formation and N command job thread, and when the user command formation was overflowed, order was monitored module and changed sleep state over to; Constantly reading order and the execution from the user command formation of these command job threads finished different functions according to the difference of performed order; When order is monitored module adds an order in the user command formation,, just generate a new command job thread if when the number of current command job thread that does not have a free time and active command job thread does not reach N; The command job thread all checks from the user command formation that at every turn order monitors the state of module during reading order, if it is in sleep state then wakes it up;

Described operation processing module comprises that a job queue, a L work operations thread and a job queue load thread, and when the operation formation was overflowed, job queue loaded thread and enters sleep state; The work operations thread is constantly got manipulating object and is carried out from job queue, call different resources, realize different functions according to the difference of manipulating object attribute; Job queue loads thread and carries out job scheduling, check the scheduling strategy attribute of each manipulating object in the operation resource chain, the manipulating object that needs management and running is added in the job queue, if when the current number that does not have idle work operations thread and an active work operations thread does not reach L, just generate a new work operations thread; The work operations thread all checks from job queue that at every turn job queue loads the state of thread during the reading operation object, if it is in sleep state then wakes it up;

Described network communication module encapsulates the network communication applications DLL (dynamic link library) of standard, provide network communication interface to command job thread and work operations thread, network communication interface is realized the Data Transport Protocol between backup server, backup agent and the storage server.

3. the data backup system based on fingerprint as claimed in claim 1 is characterized in that, described backup agent comprises backup agent initialization module, request monitoring module, operation processing module, file block module and network communication module;

Described backup agent initialization module is carried out initial work, comprises reading the backup agent configuration file, setting up the memory source chained list, the initialization job queue, start backup server requests and monitor module;

Described request is monitored the connection request that module is monitored backup server on the network, authenticates the backup server of connection, and authentication is communicated by letter with this backup server by a network connection of back generation socket and added in the job queue;

Described operation processing module comprises a job queue and M work operations thread, and when the operation formation was overflowed, request monitoring module changed sleep state over to; The work operations thread takes out a network connection socket from job queue after, at first set up a job control record for operation, network is connected the socket chain goes in the member variable of job control record, connect socket by this network then and backup server mutual, the relevant attribute of backup server manipulating object by conversion after assignment to the corresponding member variable of job control record; Use the operation bill ticket that obtains from backup server to connect corresponding storage server then, produce a network and be connected socket and it chain is gone in the member variable of job control record with storage server communication; When request monitoring module adds network when connecting socket in job queue,, just generate a new work operations thread if when the current number that does not have idle work operations thread and an active work operations thread does not reach M; The work operations thread is got the state of all checking request monitoring module when a network connects socket at every turn from job queue, if it is in sleep state then wakes it up;

Described file block module is accepted the file block task of the command execution backup job of work operations thread in the operation processing module, each file that on client file systems, opens file concentrated, file is carried out based on the piecemeal of anchor and calculates the piecemeal fingerprint and corresponding storage server coordinates to carry out the backup algorithm of first backup procedure;

Described network communication module is made up of the network connection socket of operation, and each operation of backup agent all has two networks and connects sockets, is respectively applied for the backup server operation and the storage server operation of this operation correspondence and communicates by letter.

4. the data backup system based on fingerprint as claimed in claim 1, it is characterized in that, described storage server comprises the storage server initialization module, connects monitoring module, operation bill table, operation processing module and network communication module, and index buffer zone, blocking and buffering district, piecemeal Hash table and Disk Logs;

Described storage server initialization module is carried out initial work, comprises resolving the storage server configuration file, sets up the memory source chained list, starts the related service thread;

The connection request of described connection monitoring module monitoring backup server and backup agent authenticates the backup server that connects, and authentication generates a network by the back and connects socket and communicate by letter with this backup server and add in the job queue; To the backup agent that connects, then check operation bill table so that it is authenticated according to its operation bill ticket that shows, authentication is communicated by letter with this backup agent by a network connection of back generation socket and is linked in the member variable of corresponding job control record;

Described operation bill table is used to store the bill that operation authenticates to backup agent;

Described operation processing module comprises a job queue and W work operations thread, when the operation formation is overflowed, connects monitoring module and changes " refusal backup server connection request " state over to; The work operations thread takes out a network connection socket from job queue after, at first set up a job control record for operation, network is connected the socket chain goes in the member variable of job control record, mutual by this network connection socket and backup server then, the relevant attribute of backup server manipulating object by conversion after assignment give the corresponding member variable of job control record, and generate an operation bill ticket at random and register in the operation bill table and and transmit this operation bill ticket to the backup server manipulating object; In job queue, adds a network when connecting socket when connecting monitoring module,, just generate a new work operations thread if when the number of current work operations thread that does not have a free time and active work operations thread does not reach W; The work operations thread is got from job queue at every turn and is all checked the state that connects monitoring module when a network connects socket, if it is in " refusal backup server connection request " state then cancels this state so that it accepts the backup server connection request;

Described network communication module is made up of the network connection socket of operation, and each operation of storage server all has two networks and connects sockets, is respectively applied for the backup server operation and the backup agent operation of this operation correspondence and communicates by letter;

Described index buffer zone is the infrastructure that first backup procedure and second backup procedure are carried out in the storage server operation, and the index buffer zone is realized with an internal memory Hash table, is used for storing this job instances of this activity chain Job _x(t _n) previous job instances Job _x(t _N-1) all fingerprints that comprise and newly-generated fingerprint in this job run process;

Described blocking and buffering district is the infrastructure that first backup procedure and second backup procedure are carried out in the storage server operation, the blocking and buffering district does not have found deblocking with an independently disk array realization in order to its fingerprint in interim storage first backup procedure in the index buffer zone;

Described piecemeal Hash table is the infrastructure that second backup procedure is carried out in the storage server operation, and the piecemeal Hash table is with an independently disk array realization, in order to set up the piecemeal fingerprint to the mapping of this piecemeal in the memory address of Disk Logs;

Described Disk Logs is the infrastructure that second backup procedure is carried out in the storage server operation, and Disk Logs is with an independently disk array realization, in order to the file index of storing deblocking and storing with block form.