Summary of the invention
The object of the present invention is to provide a kind of distributed memory system date storage method, this method passes through will be to be stored
File carries out piecemeal, carries out the determination that comparing determines redundant data by segmentation blocks of files, can be improved in the middle part of file
The detection probability for dividing repeated data realizes precision data and deletes again;It is a further object of the present invention to provide a kind of distributions to deposit
Storage system data storage device system and readable storage medium storing program for executing have above-mentioned beneficial effect.
In order to solve the above technical problems, the present invention provides a kind of distributed memory system date storage method, comprising:
To file to be stored piecemeal, several file to be stored blocks are obtained;
The file to be stored block and pre-stored blocks of files are carried out content to compare, in judgement system with the presence or absence of with
The blocks of files of the file to be stored block content matching;
If so, the data storage location of the blocks of files of content matching described in acquisition system;
Data directory is established to matched file to be stored block according to the data storage location.
Preferably, described to compare the file to be stored block and pre-stored blocks of files progress content, judge system
In with the presence or absence of with the blocks of files of the file to be stored block content matching include:
The hash value for calculating the file to be stored block obtains hash value to be stored;
The hash value to be stored is compared with blocks of files hash value in concordance list, judge in the concordance list whether
There is the identical blocks of files hash value with the hash value to be stored;Wherein, it is stored in system in the concordance list and has stored text
The blocks of files hash value of part and corresponding blocks of files storage location.
Preferably, the distributed memory system date storage method further include:
If the identical blocks of files hash value with the hash value to be stored, storage be not described wait deposit in the concordance list
Data are stored up, and the hash value to be stored and corresponding data storage location are added in the concordance list.
Preferably, the distributed memory system date storage method further include:
By updated concordance list real-time release into system each node.
Preferably, the hash value for calculating the file to be stored block includes:
The hash value of the file to be stored block is calculated by SHA-1 hash function.
Preferably, the generation method of the concordance list includes:
It whether there is pre-stored file in judgement system;
If so, calculating the blocks of files hash of the storage file according to the blocks of files occupancy situation of storage file
Value;
Each blocks of files hash value of storage file and the corresponding data storage location is calculated, generates
Concordance list.
Preferably, the distributed memory system date storage method further include:
Blocks of files hash value each in the concordance list is compared two-by-two, is judged in the concordance list with the presence or absence of identical hash
The blocks of files of value;
If so, determining document retaining block and non-reserved blocks of files;
The data storage location of the document retaining block is replaced to the storing data of the non-reserved blocks of files.
The present invention discloses a kind of distributed memory system data storage device, comprising:
Blocking unit, for obtaining several file to be stored blocks to file to be stored piecemeal;
Comparing unit is compared for the file to be stored block and pre-stored blocks of files to be carried out content, judge be
With the presence or absence of the blocks of files with the file to be stored block content matching in system;
Data information acquiring unit, for if so, the data of the blocks of files of content matching described in acquisition system store
Position;
Index establishes unit, for establishing data rope to matched file to be stored block according to the data storage location
Draw.
The present invention discloses a kind of distributed memory system data storage device, comprising:
Memory, for storing program;
Processor, the step of distributed memory system date storage method is realized when for executing described program.
The present invention discloses a kind of readable storage medium storing program for executing, and program is stored on the readable storage medium storing program for executing, and described program is located
The step of reason device realizes the distributed memory system date storage method when executing.
Distributed memory system date storage method provided by the present invention carries out piecemeal by file that will be to be stored,
File to be stored block and pre-stored blocks of files are carried out content by the analysis that data are carried out by dividing documents into data block
It comparing, the detection probability for improving part repeated data in file may be implemented, the redundant file block that can be inquired can greatly increase,
Carrying out comparing precision by segmentation blocks of files realizes the determination of redundant data, carries out whole point compared to entire file
Analysis can greatly improve the detection probability of redundant data;If there is the file with file to be stored block content matching in system
Block shows that the content of currently stored this document block has the blocks of files of content matching in systems, i.e. the file to be stored block is
Redundant file block establishes data rope to the redundant file block according to the data storage location of the blocks of files of content matching in system
Draw, i.e., do not store the data of the redundant file block, by the way that current file block is directed toward pre-stored matched data, that is, meet to
The demand of storage file system storage, and greatly reduce the EMS memory occupation of redundant data.
In addition, another embodiment of the present invention, which is disclosed, compares this technology by blocks of files hash value progress blocks of files content
Feature, the hash value of blocks of files can embody the uniqueness characteristic of blocks of files content by simple characteristic value, not only can be big
The big consumption for simplifying system resource and load in the comparison process of file content, and the efficiency that data are deleted again can be improved,
Realize that high efficiency is deleted again.
The present invention also provides a kind of distributed memory system data storage device, system and readable storage medium storing program for executing, have
Above-mentioned beneficial effect, details are not described herein.
Specific embodiment
Core of the invention is to provide a kind of distributed memory system date storage method, and this method passes through will be to be stored
File carries out piecemeal, carries out the determination that comparing determines redundant data by segmentation blocks of files, can be improved in the middle part of file
The detection probability for dividing repeated data realizes precision data and deletes again;Another core of the invention is to provide a kind of distribution and deposits
Storage system data storage device, system and readable storage medium storing program for executing.
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention
In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is
A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art
Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
Referring to FIG. 1, Fig. 1 is the flow chart of distributed memory system date storage method provided in this embodiment;The party
Method mainly may include:
Step s110, to file to be stored piecemeal, several file to be stored blocks are obtained.
After client receives data storage request, system carries out deblocking, this implementation to the file with storage first
Example without limitation, can specify that unified piecemeal rule to the method for partition of file, all files to be stored is divided into fixed big
Small blocks of files can also carry out the piecemeal rule with storing data according to file content and memory node customized.
Specifically, customizing method can according to piecemeal size can by size of data and memory node (abridging again a little) come
It determines, if memory node is more, and data file data volume is little, then block size can be defined as to lower value, such as 1kb,
Block size can be defined as high value, such as 512K if data volume is larger and memory node is not up to the larger value.
By the corresponding several blocks of files of this document available after file to be stored piecemeal, text is stored in each blocks of files
Number of packages evidence.
Step s120, file to be stored block and pre-stored blocks of files content is carried out to compare, judge in system whether
In the presence of the blocks of files with file to be stored block content matching.
Blocks of files pre-stored in blocks of files and system after segmentation progress content is compared, is deposited in advance in judgement system
Whether there is the blocks of files with the file to be stored block content matching after segmentation in the data of storage, determines that matched rule can basis
File content determines that can reach 98% or more in content similarity can be regarded as two blocks of files content matchings, can also be in content phase
Reaching 100% like degree just can be regarded as two blocks of files content matchings etc., without limitation to specific matching rule at this, when file is important
When rank is higher, file content precision degree is higher, higher matching degree can be set.
Content matching, that is, file repeats, if blocks of files content matching occur can determine blocks of files repetition occur, wait store
Blocks of files can be determined as redundant block.
The specific method that the present embodiment compares the content of blocks of files without limitation, is referred at present carry out file whole
The matched method of body carries out the comparison of blocks of files content one by one, can also treat storage text with the characteristic value of extraction document block content
Part block and pre-stored file block eigenvalue are compared, and data can be greatly simplified by being compared by extraction characteristic value
Comparison process reduces the resource occupation of comparison process.
If there is the blocks of files with file to be stored block content matching in system, s130 is thened follow the steps.
Step s130, in acquisition system the blocks of files of content matching data storage location.
Step s140, data directory is established to matched file to be stored block according to data storage location.
If there is the blocks of files with file to be stored block content matching in system, show that blocks of files currently to be stored is
Redundant block includes the blocks of files with this document block content matching, the file to be stored block in the pre-stored blocks of files of system
For redundant file block.It is occupied to reduce the datarams of redundant file block, does not then store the partial redundance data, it will be pre- in system
First store with the matched data of the partial data as the partial data, by the data directory of the matching files block to be stored
It migrates to the pre-stored blocks of files of Corresponding matching, according to the storage location of matched data to matched file to be stored block
Establish data directory.
If not finding to match, illustrate that data to be stored block is Non-redundant data, data storage rule in such cases
Without limitation, it is referred to existing data storage rule and carries out data storage.
Based on above-mentioned introduction, distributed memory system date storage method disclosed by the embodiments of the present invention, by will be wait deposit
The file of storage carries out piecemeal, and the analysis of data is carried out by dividing documents into data block, by file to be stored block with deposit in advance
The blocks of files of storage carries out content comparison, the detection probability for improving part repeated data in file may be implemented, what can be inquired is superfluous
Remaining blocks of files can greatly increase, and carrying out comparing precision by segmentation blocks of files realizes the determination of redundant data, compare
The detection probability of redundant data can be greatly improved by carrying out global analysis to entire file;If existed and text to be stored in system
The blocks of files of part block content matching shows that the content of currently stored this document block has the blocks of files of content matching in systems,
I.e. the file to be stored block is redundant file block, according to the data storage location of the blocks of files of content matching in system to the redundancy
Blocks of files establishes data directory, i.e., does not store the data of the redundant file block, pre-stored by the way that current file block to be directed toward
Matched data meets the needs of file to be stored system storage, and greatly reduces the EMS memory occupation of redundant data.
It in above-described embodiment without limitation to the specific method of the content comparison of blocks of files, can be with extraction document block content
File to be stored block and pre-stored file block eigenvalue is compared in characteristic value, to reduce the resource of comparison process
It occupies.The present embodiment is specifically introduced the content comparison method of blocks of files.
There are many extracting methods of blocks of files characteristic, is referred to existing characteristic extracting method, for example,
NMF algorithm, FAST algorithm, SURF algorithm, hash algorithm etc..Wherein hash Value Data more refines, the characteristic value being calculated
It is relatively simple, and there is uniqueness, it can accurately embody blocks of files content characteristic.Preferably, calculation document block can be passed through
Hash value be compared and realize the content of blocks of files is compared, specifically, by file to be stored block and pre-stored file
Block carries out content comparison, in judgement system with the presence or absence of the blocks of files with file to be stored block content matching can specifically include with
Lower step:
The hash value for calculating file to be stored block, obtains hash value to be stored;
Hash value to be stored is compared with blocks of files hash value in concordance list, judge whether to have in concordance list with wait deposit
Store up the identical blocks of files hash value of hash value;Wherein, the blocks of files hash value of storage file in system is stored in concordance list
And corresponding blocks of files storage location.
The comparison in file content can be not only greatly simplified by carrying out the comparison of blocks of files content to blocks of files hash value
The consumption of system resource and load in the process, and the efficiency that data delete (deleting duplicated data) again can be improved, it realizes efficient
Change is deleted again.Wherein, the specific steps of calculation document block hash value are referred to the prior art, and details are not described herein.
Based on the above embodiment, in concordance list not with hash value to be stored the case where identical blocks of files hash value not
It limits, can directly store the partial data, it is preferable that, can also will hash be stored while carrying out data storage
Value and corresponding data storage location are added in concordance list, to realize the real-time update to blocks of files content in concordance list.
In addition, including several nodes in distributed memory system, realize that different data functions, including data are deposited respectively
Storage, data management etc., concordance list is generally stored in management node, and is the government pressure for alleviating management node, can generally be selected
Take several ordinary nodes to share the management role of management node, if concordance list update after in system other nodes it is possible that letter
The case where breath delay, carries out information comparison still according to concordance list before and avoids the repetition of data from depositing to reduce such case
Storage, can by updated concordance list real-time release into system each node, with realize to node index information each in system
Update prompt.
Wherein, hash algorithm includes lock kind many kinds of function, without limitation to the hash function specifically selected, it is preferable that can
To calculate the hash value of file to be stored block by SHA-1 hash function.SHA-1 hash function calculating speed is fast, can be promoted
Information comparison speed promotes data storage efficiency.
Further, since before the date storage method that the application present invention provides apple carries out data storage, system magnetic
Possible data with existing is stored in advance in disk, which can not be compared, only to using after the storage method
System file to be stored carries out comparing, i.e., can not include the pre-stored data in the part in concordance list, with reduction pair
The occupancy of system resource.
And the part be pre-stored data volume it is larger when, in the blocks of files newly stored may exist largely with the portion
The duplicate data of divided data, for the occupancy for being further reduced system space, rate is deleted in promotion again, it is preferable that may determine that in system
With the presence or absence of pre-stored file;If so, calculating storage file according to the blocks of files occupancy situation of storage file
Blocks of files hash value;Each blocks of files hash value of storage file and corresponding data storage location is calculated, it is raw
At concordance list.
In addition, since there may be repeated datas to mention for the repetitive rate for reducing pre-stored data for pre-stored data itself
Rise system storage performance, it is preferable that can be compared two-by-two to blocks of files hash value each in concordance list, judge whether deposit in concordance list
In the blocks of files of identical hash value;If so, determining document retaining block and non-reserved blocks of files;The data of document retaining block are deposited
Storage space sets the storing data for replacing non-reserved blocks of files.By the way that hash value in concordance list is compared, if there is repeated data,
Repeated data replaced with into the corresponding data directory of the partial data, realizes and repeated data in pre-stored data is deleted again, drop
Low system data stored memory occupies.
To deepen the understanding to the date storage method provided is invented, herein to whole for carrying out blocks of files hash comparison
Body Stored Procedure is introduced, and other implementations based on distributed memory system date storage method provided by the invention are equal
It can refer to the introduction of the present embodiment.
Client receives data storage request, carries out piecemeal to the file to be stored received.
The memory node (abridging again a little) that each blocks of files is distributed to cluster after piecemeal, realizes multiple nodal parallels
Business is deleted again.
The hash value for calculate by SHA-1 hash function data to be stored block, is inquired after obtaining data block hash value
Data directory, if in concordance list there are the hash value if indicate that the data block is existing and stores, the file to be stored block
Corresponding data block is redundant data, then only records the hash value pointer position inquired;If in data directory
In do not find to match, illustrate that the data block is Non-redundant data, then carry out data storage, and the pointer position of data block is stored in
Data directory is deposited into data directory, and by the pointer position of data block.
Wherein, concordance list is deposited in database, one group of data in each blocks of files manipulative indexing table, with hash value work
For index key, because hash value may insure the data block accuracy that data directory is stored as the identification of unique block,
It can also be improved the raising of data directory inquiry velocity and delete efficiency again.
Due to the multinode storing data of distributed mass memory system, it can realize that nodal parallel is held by deleting algorithm again
Row deletes service again, and the service of deleting again deletes number according to whether the verification of the data directory of all nodes sharings attaches most importance to after data block issues
According to updating shared data rope if Non-redundant data if it is the pointer in redundant data then direct storing data index
Draw table, and real-time release is to each node updates.By above-mentioned mass memory, (the capacity growth in data storage is to tend to nothing
Limit, without the upper limit) distributed technology of deleting again realizes that multi-node parallel data are deleted again, process is deleted in optimization again, realizes greater efficiency
Data are deleted again
The present embodiment uses the Harbin SHA-1 function calculation document block by carrying out storage verification to the file for preparing storage
Hash value, and data directory is established according to pointer, already present data block hash value is verified, by being in judgement concordance list
It is no containing hash value identical with the data block, and if so, show the data be redundant data, the data block can be stored
Hash value pointer reduces storage device data and uses capacity, promotes space utilisation, and can reduce system load pressure, subtract
Small data read-write delay;If there is no then showing that the data block is not redundant data, then the hash value is recorded in data rope
Draw in table, while there is mass storage system (MSS) multinode data distribution to share characteristic, so when different data stores different nodes
Block (KB) grade data can directly be carried out parallel to delete again, for backup, calamity can effectively improve data storage efficiency for data such as data.
Referring to FIG. 2, Fig. 2 is the structural frames of distributed memory system data storage device provided in an embodiment of the present invention
Figure;It may include: that blocking unit 210, comparing unit 220, data information acquiring unit 230 and index establish unit 240.This
The distributed memory system data storage device that embodiment provides can be mutual with above-mentioned distributed memory system date storage method
Control.
Wherein, blocking unit 210 is mainly used for obtaining several file to be stored blocks to file to be stored piecemeal;
Comparing unit 220 is mainly used for comparing file to be stored block and pre-stored blocks of files progress content, judges
With the presence or absence of the blocks of files with file to be stored block content matching in system;
If data information acquiring unit 230 is mainly used for the file for having with file to be stored block content matching in system
Block, the data storage location of the blocks of files of content matching in acquisition system;
Index establishes unit 240 and is mainly used for establishing data rope to matched file to be stored block according to data storage location
Draw.
Preferably, comparing unit is specifically as follows hash comparing unit, comprising:
Hash value computation subunit obtains hash value to be stored for calculating the hash value of file to be stored block;
Hash value comparison subunit is sentenced for hash value to be stored to be compared with blocks of files hash value in concordance list
Whether have and the identical blocks of files hash value of hash value to be stored in disconnected concordance list;Wherein, it is stored in concordance list in system
The blocks of files hash value of storage file and corresponding blocks of files storage location.
Preferably, distributed memory system data storage device provided in this embodiment can be with further include: storage unit is deposited
Storage unit is connect with hash value comparison subunit, if being mainly used in concordance list the not identical file with hash value to be stored
Block hash value stores data to be stored, and hash value to be stored and corresponding data storage location is added in concordance list.
Preferably, distributed memory system data storage device provided in this embodiment can be with further include: updating unit, more
New unit is connect with storage unit, is mainly used for updated concordance list real-time release into system each node.
Preferably, hash value computation subunit specifically can be used for: calculate file to be stored block by SHA-1 hash function
Hash value.
Preferably, concordance list generation unit mainly may include: in distributed memory system data storage device
Judgment sub-unit, for whether there is pre-stored file in judgement system;
Pre-stored computation subunit, if for there are pre-stored files in system, according to the text of storage file
Part block occupancy situation calculates the blocks of files hash value of storage file;
Each blocks of files hash value of storage file and corresponding data storage location is calculated, generates index
Table.
It preferably, can be in concordance list generation unit further include: concordance list repeats comparing unit;
Concordance list repeats comparing unit
Pre-stored comparison subunit, for being compared two-by-two to blocks of files hash value each in concordance list, judge be in concordance list
It is no that there are the blocks of files of identical hash value;
Blocks of files determines subelement, if for, there are the blocks of files of identical hash value, determining document retaining in concordance list
Block and non-reserved blocks of files;
Data replace subelement, for the data storage location of document retaining block to be replaced to the storage number of non-reserved blocks of files
According to.
Distributed memory system data storage device provided in this embodiment by blocking unit by file to be stored into
Row piecemeal, comparing unit are carried out the determination that comparing determines redundant data by segmentation blocks of files, can be improved in file
The detection probability of part repeated data realizes precision data and deletes again.
Referring to FIG. 3, Fig. 3 is the structural block diagram of distributed memory system data storage device provided in this embodiment;It should
Equipment may include: memory 300 and processor 310.Distributed memory system data storage device can refer to above-mentioned distribution
The introduction of formula memory system data storage method.
Wherein, memory 300 is mainly used for storing program;
Processor 310 is mainly used for the step of realizing above-mentioned distributed memory system date storage method when executing program.
Referring to FIG. 4, being the structural schematic diagram of distributed memory system data storage device provided in this embodiment, the number
Bigger difference can be generated because configuration or performance are different according to storage equipment, may include one or more processors
(central processing units, CPU) 322 (for example, one or more processors) and memory 332, one
Or (such as one or more mass memories are set the storage medium 330 of more than one storage application program 342 or data 344
It is standby).Wherein, memory 332 and storage medium 330 can be of short duration storage or persistent storage.It is stored in the journey of storage medium 330
Sequence may include one or more modules (diagram does not mark), and each module may include to one in data processing equipment
Series of instructions operation.Further, central processing unit 322 can be set to communicate with storage medium 330, set in data storage
The series of instructions operation in storage medium 330 is executed on standby 301.
Data storage device 301 can also include one or more power supplys 326, one or more wired or nothings
Wired network interface 350, one or more input/output interfaces 358, and/or, one or more operating systems 341,
Such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM etc..
Step in distributed memory system date storage method described in above figure 1 can be by distributed memory system
The structure of data storage device is realized.
Present embodiment discloses a kind of readable storage medium storing program for executing, program is stored on readable storage medium storing program for executing, program is by processor
The step of distributed memory system date storage method is realized when execution, wherein distributed memory system date storage method can
Corresponding embodiment referring to Fig.1, details are not described herein.
The readable storage medium storing program for executing be specifically as follows USB flash disk, mobile hard disk, read-only memory (Read-Only Memory,
ROM), the various program storage generations such as random access memory (Random Access Memory, RAM), magnetic or disk
The readable storage medium storing program for executing of code.
Each embodiment is described in a progressive manner in specification, the highlights of each of the examples are with other realities
The difference of example is applied, the same or similar parts in each embodiment may refer to each other.For device disclosed in embodiment
Speech, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is referring to method part illustration
?.
Professional further appreciates that, unit described in conjunction with the examples disclosed in the embodiments of the present disclosure
And algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware and
The interchangeability of software generally describes each exemplary composition and step according to function in the above description.These
Function is implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Profession
Technical staff can use different methods to achieve the described function each specific application, but this realization is not answered
Think beyond the scope of this invention.
The step of method described in conjunction with the examples disclosed in this document or algorithm, can directly be held with hardware, processor
The combination of capable software module or the two is implemented.Software module can be placed in random access memory (RAM), memory, read-only deposit
Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology
In any other form of storage medium well known in field.
Above to distributed memory system date storage method, device, equipment and readable storage medium provided by the present invention
Matter is described in detail.Used herein a specific example illustrates the principle and implementation of the invention, above
The explanation of embodiment is merely used to help understand method and its core concept of the invention.It should be pointed out that for the art
Those of ordinary skill for, without departing from the principle of the present invention, can also to the present invention carry out it is several improvement and repair
Decorations, these improvements and modifications also fall within the scope of protection of the claims of the present invention.