CN103412929A

CN103412929A - Mass data storage method

Info

Publication number: CN103412929A
Application number: CN2013103596144A
Authority: CN
Inventors: 柯宗贵; 柯宗庆; 杨育斌; 曹兴财
Original assignee: Bluedon Information Security Technologies Co Ltd
Current assignee: Bluedon Information Security Technologies Co Ltd
Priority date: 2013-08-16
Filing date: 2013-08-16
Publication date: 2013-11-27

Abstract

The invention discloses a mass data storage method which specifically comprises the following steps: judging whether a file exists in a storage space through a server when a user requests to submit the file, wherein the server takes the received md5 value as an index value to judge whether the file exists; if the same md5 value exists, continuously comparing the md5 value for a sliced file; if the same file exists, updating a record block, and if the slice is different from the md5 value recorded by the data block, uploading the source file slice and the md5 value information, and updating and recording the related information of the data block through the server. According to the method, the low efficiency caused by verifying and uploading of repeating data is improved in mass data storage, a copy is dynamically adjusted, and the allocation of the storage space is improved.

Description

A kind of storage means of mass data

Technical field

The present invention relates to technical field of data storage, relate in particular to a kind of storage means of mass data.

Background technology

In the mass data storage system, the existence of a large amount of repeating datas, not only increased spending, and reduced effectiveness of retrieval, and deleting duplicated data, and then reduction storage space, be a problem demanding prompt solution.The existence of many copies has guaranteed the reliability of system, and when single node broke down, the copy of other node can continue to provide service, maintained the normal operation of system.The increase of copy amount, can make to safeguard the consistent expense that increased of copy, and a plurality of copies is synchronous, also increased bandwidth.When considering data reliability, should reasonably to copy, carry out layout.

In prior art, the linux source is heavily deleted technology, and file is divided into to some fritters, first file is made to simple proof test value relatively, does not really mate, and then carries out the md5 value relatively.

HDFS adopts complete backup policy, is defaulted as 3 parts of backups of each document creation, and the copy of 3 backups is placed dispersedly, has prevented the Single Point of Faliure that may occur.

But in mass data, by filename, identify file not too reliable, in system, may exist not of the same name, but the consistent data of file content.If a plurality of small data pieces that large file is divided into compare one by one, computing time is too slow again.Due to the otherness of system file, access frequency is different, if all files all adopt identical backup policy, can not utilize efficiently storage space.

Summary of the invention

The objective of the invention is, in order to overcome the defect of prior art, provides a kind of storage means of mass data, and the idiographic flow of the method is:

When user's request was presented a paper, the md5 value that service end will receive, as index value, judged whether file exists, if there is identical md5 value to exist, for the file of section, continue relatively md5 value, if there has been identical file, upgrade recording data blocks; If section is different from the md5 value of data block record, source file section and md5 value information to be uploaded, service end is upgraded the recording data blocks relevant information.

In said method, judge file whether Already in the determination strategy flow process in storage space be specially:

If the source file byte-sized is size, given constant m_size, as the judgement radix, when size is less than or equal to m_size, carries out the md5 computing to whole file, after having calculated, md5 value and source file length is passed to service end; If the source file byte-sized is greater than m_size, using the m_length constant as computational length, to the source file head, in, the content of tail three parts calculates the md5 value; These three md5 are connected with source file length and generate character string, and calculating character string md5 value, md5 value and source file length are sent to service end.

The beneficial effect that technical solution of the present invention is brought:

While by the present invention, not only having improved mass data storage, the efficiency that the repeating data checking is uploaded, and dynamically adjust copy, improved the distribution of storage space.

The accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, below will the accompanying drawing of required use in embodiment or description of the Prior Art be briefly described, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain according to these accompanying drawings other accompanying drawing.

Fig. 1 is method flow diagram of the present invention;

Fig. 2 is File determination strategy process flow diagram of the present invention;

Fig. 3 is the storage node composition that records complete file in the present invention;

Fig. 4 is that File of the present invention is fetched process flow diagram;

Fig. 5 is that in the present invention, many copies are eliminated process flow diagram.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only the present invention's part embodiment, rather than whole embodiment.Based on the embodiment in the present invention, those of ordinary skills, not making under the creative work prerequisite the every other embodiment obtained, belong to the scope of protection of the invention.

The invention provides a kind of storage means of mass data, its problem to be solved, the one, improve while presenting a paper, the efficiency of repeating data checking, the 2nd, dynamically adjust copy, rationally utilize hard drive space.

Method flow of the present invention as shown in Figure 1, is specially:

S1: when user's request was presented a paper, the md5 value that service end will receive, as index value, judged whether file exists, if there is identical md5 value to exist, for the file of section, continue relatively md5 value, if there has been identical file, upgrade recording data blocks.

S2: if section is different from the md5 value of data block record, source file section and md5 value information are uploaded, service end is upgraded the recording data blocks relevant information.

Already in whether the determination strategy flow process in storage space is as shown in Figure 2, specific as follows in said method, to judge file:

If the source file byte-sized is size, given constant m_size, as the judgement radix, when size is less than or equal to m_size, carries out the md5 computing to whole file, after having calculated, md5 value and source file length is passed to service end; If the source file byte-sized is greater than m_size, using the m_length constant as computational length, to the source file head, in, the content of tail three parts calculates the md5 value.These three md5 are connected with source file length and generate character string, and calculating character string md5 value, md5 value and source file length are sent to service end.

Be illustrated in figure 3 the storage node composition that records complete file:

What the first node such as SID1, SID2 was deposited is the information such as md5 value and filename, uses as index file.It is node that back connects, and is the section pointer of file, index section use, and these nodes form complete file.

If while when concrete enforcement is of the present invention, requiring to fetch file after user's storage file, its flow process is as shown in Figure 4, specific as follows:

At first judge that the file possibility exists, if there is no, response file does not exist.

If the file of fetching exists, in the data block that receives the service end transmission, if client is wanted deleted file, service end also will subtract 1 by the number of times of quoting of piece storage, is 0 if quote number of times, just deletes all information of whole file.The data block that the client service end sends, merge and obtain original file according to number order.

In concrete implementation and operation, in order rationally to utilize storage space, need to eliminate some expired data, as shown in Figure 5, idiographic flow is as follows for its flow process:

During service end storage data, adopt two-level memory.The preferential SAS hard disk of selecting, according to lru algorithm, at first stale data eliminates the SATA hard disk; If eliminate rear space, be not enough to hold the data that will deposit in, so by certain hour, for several times less data of access are eliminated the SATA hard disk; If SATA hard drive space deficiency, compare data and the data SATA of from SAS, eliminating, at first eliminate out stale data; If after deleting stale data, free space still is not enough to hold the data that will deposit in, and by certain hour, the data that access times are less eliminate, and sends message to the daily record center, the warning memory space inadequate.

The storage means of above a kind of mass data that the embodiment of the present invention is provided is described in detail, applied specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment is just be used to helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims

1. the storage means of a mass data, is characterized in that, the idiographic flow of the method is:

When user's request was presented a paper, the md5 value that service end will receive, as index value, judged whether file exists, if there is identical md5 value to exist, for the file of section, continue relatively md5 value, if there has been identical file, upgrade recording data blocks;

If section is different from the md5 value of data block record, source file section and md5 value information to be uploaded, service end is upgraded the recording data blocks relevant information.

2. method according to claim 1, is characterized in that, in said method, judge file whether Already in the determination strategy flow process in storage space be specially:

3. method according to claim 2, is characterized in that, md5 value and filename exist in first node, and as index file use, it is node that back connects, and is the section pointer of file, index section use, and these nodes form complete file.

4. method according to claim 1, is characterized in that, while after user's storage file, requiring to fetch file, the concrete operations flow process is:

At first judge that the file possibility exists, if there is no, response file does not exist;

If the file of fetching exists, in the data block that receives the service end transmission, if client is wanted deleted file, service end also will subtract 1 by the number of times of quoting of piece storage, if quote number of times, be 0, just delete all information of whole file, the data block that the client service end sends, merge and obtain original file according to number order.

5. method according to claim 1, is characterized in that, in order rationally to utilize storage space need to eliminate some expired data, idiographic flow is in the process of data storage:

During service end storage data, adopt two-level memory, preferentially select the SAS hard disk, according to lru algorithm, at first stale data eliminates the SATA hard disk; If eliminate rear space, be not enough to hold the data that will deposit in, so by certain hour, for several times less data of access are eliminated the SATA hard disk; If SATA hard drive space deficiency, compare data and the data SATA of from SAS, eliminating, at first eliminate out stale data; If after deleting stale data, free space still is not enough to hold the data that will deposit in, and by certain hour, the data that access times are less eliminate, and sends message to the daily record center, the warning memory space inadequate.