CN104657362A - Method and device for storing and querying data - Google Patents
Method and device for storing and querying data Download PDFInfo
- Publication number
- CN104657362A CN104657362A CN201310577254.5A CN201310577254A CN104657362A CN 104657362 A CN104657362 A CN 104657362A CN 201310577254 A CN201310577254 A CN 201310577254A CN 104657362 A CN104657362 A CN 104657362A
- Authority
- CN
- China
- Prior art keywords
- data
- content
- packed
- bucket
- cell
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 66
- 238000013507 mapping Methods 0.000 claims description 44
- 238000007906 compression Methods 0.000 claims description 30
- 230000008878 coupling Effects 0.000 claims description 30
- 238000010168 coupling process Methods 0.000 claims description 30
- 238000005859 coupling reaction Methods 0.000 claims description 30
- 230000006835 compression Effects 0.000 claims description 29
- 230000001174 ascending effect Effects 0.000 claims description 12
- 238000013500 data storage Methods 0.000 claims description 7
- 230000006837 decompression Effects 0.000 claims description 7
- 238000005259 measurement Methods 0.000 claims description 7
- 238000012545 processing Methods 0.000 claims description 7
- 238000010276 construction Methods 0.000 claims description 6
- 230000013011 mating Effects 0.000 claims description 4
- 241001269238 Data Species 0.000 description 14
- 230000007935 neutral effect Effects 0.000 description 10
- 230000008569 process Effects 0.000 description 9
- 238000005516 engineering process Methods 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 5
- 238000004891 communication Methods 0.000 description 2
- 238000013506 data mapping Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000010267 cellular communication Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000031068 symbiosis, encompassing mutualism through parasitism Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/0608—Saving storage space on storage systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
- G06F3/0644—Management of space entities, e.g. partitions, extents, pools
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The invention provides a method for storing data. The data to be stored include original data identifiers, and original data contents corresponding to the original data identifiers. The method comprises the steps of clustering the data to be stored according to the original data identifier, wherein each class is corresponding to a data block; generating data block identifiers according to the original data identifiers of the store to be stored and corresponding to the data blocks; compressing the original data contents of the data to be stored and corresponding to the data blocks; storing the compressed data contents into data areas of the data blocks and acquiring address identifiers of the compressed data contents; generating indexes according to the original data identifiers of the data to be stored and corresponding to the data blocks and the address identifiers; storing the indexes into index areas in the data blocks. With the adoption of the data storing module, the use efficiency of the data is greatly increased. The invention further provides a data storing device, a data querying method and a data querying device.
Description
Technical field
The present invention relates to field of computer technology, particularly relate to the storage of a kind of data, querying method and device.
Background technology
Along with the progress of computer technology, computer technology creates great impact to the life of people, work, and creating a large amount of data needs Computer Storage and process.At present, the data that computing machine will store are by original MB(megabyte), GB(GB) order of magnitude, risen to TB(terabyte), even PB(thousand terabyte) order of magnitude, in the face of the explosive growth of data, how to reduce data carrying cost and become people's urgent problem.
The means of traditional reduction data carrying cost, generally store after directly compression process being carried out to data, but, this sacrifice exchange for computing time ground storage space mode, reduce the service efficiency of data, such as the compression of the data all solutions of compression could must be used during data query.
Summary of the invention
Based on this, be necessary the problem of the service efficiency reducing data for the method for traditional packed data, a kind of date storage method and device, data enquire method and device are provided.
A kind of date storage method, data to be stored comprise raw data mark and identify corresponding original data content with described raw data, and described method comprises:
Carry out cluster according to described raw data mark to described data to be stored, each class is a corresponding data block respectively, and generates data block identifier according to the raw data mark of data to be stored corresponding to described data block;
The original data content of data to be stored corresponding for described data block is compressed, packed data content is stored in the data field of described data block, and obtain the address designation of described packed data content;
According to raw data mark and the described address designation generating indexes of data to be stored corresponding to described data block, and by the index area of described index stores in described data block.
A kind of data storage device, data to be stored comprise raw data mark and identify corresponding original data content with described raw data, and described device comprises:
Data block generation module, for carrying out cluster according to described raw data mark to described data to be stored, each class is a corresponding data block respectively, and generates data block identifier according to the raw data mark of data to be stored corresponding to described data block;
Original data content compression module, for being compressed by the original data content of data to be stored corresponding for described data block, is stored in the data field of described data block by packed data content, and obtains the address designation of described packed data content;
Index generation module, for according to the raw data mark of data to be stored corresponding to described data block and described address designation generating indexes, and by the index area of described index stores in described data block.
Above-mentioned date storage method and device, after data to be stored are carried out cluster, each class data to be stored are a corresponding data block respectively, and generate data block identifier according to the raw data mark of data to be stored corresponding to data block, and data block comprises index area and data field, data field stores packed data content, and index area stores the index generated according to the address designation of packed data content in raw data mark and data field.During data query, the data block at data query place directly can be determined according to data query mark, the address designation of data query content is obtained according to the index in data block, thus Compressed text search data content can be obtained according to this address designation from data field, just can obtain Compressed text search data content after decompression.Both can reach good compression effectiveness, and reduce and store space shared by data, again can rapidly locating position, realize fast query data, and without the need to being decompressed by all packed datas, drastically increase the service efficiency of data.
A kind of data enquire method, described method comprises:
Obtain original query Data Identification;
The data block corresponding to described original query Data Identification is determined according to described original query Data Identification and the data block identifier prestored;
According to the index in the index area of described data block, obtain the address designation corresponding with described original query Data Identification;
Obtain Compressed text search data content according to described address designation from the data field of described data block, decompress described Compressed text search data content, obtains data query content.
A kind of data query arrangement, described device comprises:
Original query Data Identification acquisition module, for obtaining original query Data Identification;
Data block determination module, for determining the data block corresponding to described original query Data Identification according to described original query Data Identification and the data block identifier prestored;
Address designation acquisition module, for according to the index in the index area of described data block, obtains the address designation corresponding with described original query Data Identification;
Data query content obtaining module, for obtaining Compressed text search data content, decompression module according to described address designation from the data field of described data block, for the described Compressed text search data content that decompresses, obtains data query content.
Above-mentioned data enquire method and device, after obtaining original query Data Identification, the data block corresponding to original query Data Identification can be determined by this original query Data Identification and the data block identifier prestored, then determine the data block at data query place, the address designation corresponding with original query Data Identification is obtained from the index area data block, thus Compressed text search data content can be obtained from the data field data block according to this address designation, being decompressed by Compressed text search data content just can obtain data query content.Can rapidly locating position, realize quick-searching, and without the need to being decompressed by all packed datas, drastically increase the service efficiency of data.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet of date storage method in an embodiment;
Fig. 2 clips low portion the raw data mark of data to be stored corresponding from data block in an embodiment, acquisition packed data identifies, and by packed data mark and the address designation corresponding stored of packed data content in the schematic flow sheet of the step of the index area of data block;
The original data content of data to be stored corresponding for data block compresses in an embodiment by Fig. 3, packed data content is stored in the data field of data block, and obtains the schematic flow sheet of the step of the address designation of packed data content;
Fig. 4 is the schematic flow sheet of the step generating decoding accelerometer in an embodiment;
Fig. 5 is the structured flowchart of data block in an embody rule scene;
Fig. 6 is the schematic flow sheet of data enquire method in an embodiment;
Fig. 7 is the low portion clipping original query Data Identification in an embodiment, obtains Compressed text search Data Identification, the schematic flow sheet of the step of the address designation that the Compressed text search Data Identification stored in the index area of acquisition data block is corresponding;
Fig. 8 is the schematic diagram of Huffman tree and decoding accelerometer in an embody rule scene;
Fig. 9 is the structured flowchart of data storage device in an embodiment;
Figure 10 is the structured flowchart of index generation module in an embodiment;
Figure 11 is the structured flowchart of original data content compression module in an embodiment;
Figure 12 is the structured flowchart of data storage device in another embodiment;
Figure 13 is the structured flowchart of data query arrangement in an embodiment;
Figure 14 is the structured flowchart of address designation acquisition module in an embodiment;
Figure 15 is the structured flowchart of the first acquisition module in an embodiment;
Figure 16 is the structured flowchart of the first acquisition module in another embodiment;
Figure 17 is the structured flowchart of data query content obtaining module in an embodiment;
Figure 18 is the module map of a computer system that can realize the embodiment of the present invention in an embodiment.
Embodiment
In order to make object of the present invention, technical scheme and advantage clearly understand, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, be not intended to limit the present invention.
Be appreciated that term used in the present invention " first ", " second " etc. can in this article for describing various element, but these elements do not limit by these terms.These terms are only for distinguishing first element and another element.For example, without departing from the scope of the invention, figure place can be preset first and be called that second presets figure place, and similarly, figure place can be preset second and be called that first presets figure place.First presets figure place and second presets figure place both default figure places, but it is not same default figure place.
Unless context separately has the description of specific distinct, the element in the present invention and assembly, the form that quantity both can be single exists, and form that also can be multiple exists, and the present invention does not limit this.Although the step in the present invention arranges with label, and be not used in and limit the precedence of step, the order of step or the execution of certain step need based on other steps unless expressly stated, otherwise the relative rank of step is adjustable.Be appreciated that term "and/or" used herein relates to and contains the one or more any and all possible combination in the Listed Items be associated.
As shown in Figure 1, in one embodiment, provide a kind of date storage method, the method comprises:
Step 102, carries out cluster according to raw data mark to data to be stored, and each class is a corresponding data block respectively, and generates data block identifier according to the raw data mark of data to be stored corresponding to data block.
Data to be stored comprise raw data mark and identify corresponding original data content with raw data." original " represents data state originally, uncompressed or other process.Raw data mark can be the key (key) of data to be stored, can be used for distinguishing different data to be stored.Cluster is carried out to data to be stored, the raw data mark centralized stores of public part can be had, and the public part each raw data identified extracts, no longer repeated storage, only can store non-common parts when storing raw data mark, reach the object of raw data mark being carried out to fixed length compression.And the data after fixed length compression still possess the characteristic of random access, be convenient to search.
Data block is one group or continuous arrangement several groups of records together in order, is the data unit carrying out between primary memory and Input/Output Device or external storage transmitting.Data block can be stored in the internal memory of computing machine.After data to be stored are carried out cluster according to raw data mark, the corresponding data block of each class, and generate data block identifier according to the raw data mark of data to be stored corresponding to data block.Like this when data query, can just can the data block at locator data place according to data query mark, thus reach the object of fast finding.
Step 104, compresses the original data content of data to be stored corresponding for data block, packed data content is stored in the data field of data block, and obtain the address designation of packed data content.
In order to obtain larger compressibility, variable-length encoding can be used to compress original data content.Variable-length encoding is exactly entropy code, a kind of coded system that it is is foundation with the probability of occurrence of each data cell or frequency, allows and occurs the shortest code identification of maximum code characters, shorten code length so as much as possible; Last coding result is a binary serial data, and Huffman encoding algorithm is exactly a kind of typical variable-length encoding.
Use variable-length encoding can reach very high compressibility, but do not fix due to code length, make the memory address of packed data be difficult to directly be predicted by data structure, generally need all packed datas to decompress just can inquire required data.Address designation, for identifying the memory location of packed data content, when stores compressed data content, obtains the address designation of packed data content, can be convenient to directly locate packed data content according to this address designation, not need all packed datas all to decode.
Step 106, according to raw data mark and the address designation generating indexes of data to be stored corresponding to data block, and by the index area of index stores in data block.
The raw data mark of to be stored data corresponding according to data block and the address designation generating indexes of packed data content, and by the index area of this index stores in data block, be convenient to obtain according to raw data mark the address designation identifying corresponding packed data content with raw data, and then obtain packed data content according to this address designation, thus realize fast query data.
Above-mentioned date storage method, after data to be stored are carried out cluster, each class data to be stored are a corresponding data block respectively, and generate data block identifier according to the raw data mark of data to be stored corresponding to data block, and data block comprises index area and data field, data field stores packed data content, and index area stores the index generated according to the address designation of packed data content in raw data mark and data field.During data query, the data block at data query place directly can be determined according to data query mark, the address designation of data query content is obtained according to the index in data block, thus Compressed text search data content can be obtained according to this address designation from data field, just can obtain Compressed text search data content after decompression.Both can reach good compression effectiveness, and reduce and store space shared by data, again can rapidly locating position, realize fast query data, and without the need to being decompressed by all packed datas, drastically increase the service efficiency of data.
In one embodiment, step 102 comprises: using data to be stored identical for the low portion of the first default figure place in the raw data of data to be stored mark as a class, each class is a corresponding data block respectively, and using the data block identifier of low portion as data block.
In the present embodiment, using data to be stored identical for the low portion of the first default figure place as a class, each class is a corresponding data block respectively, and using the data block identifier of low portion as data block.During data query, the data block at data query place can be judged according to the low portion of the default figure place of data query mark, can fast query be realized.
Wherein the first default figure place sets as required, if than representing that raw data identifies by 32 signless integers, then a raw data mark needs to store by 4 bytes, if a raw data is designated " 0x00000200 ", wherein " 0x " represents sexadecimal, then the least-significant byte of this raw data mark is " 00 ", and now the first default figure place is 8.Here least-significant byte refers to the least-significant byte under scale-of-two, is equivalent to last 2 under sexadecimal.
In one embodiment, step 106 comprises: from the raw data mark of data to be stored corresponding to data block, clip low portion, acquisition packed data identifies, and by packed data mark and the address designation corresponding stored of packed data content in the index area of data block.
Owing to presetting the identical data to be stored of the low portion of figure place using first as a class, so the raw data mark in each class has identical low portion, clip low portion and can reduce the storage space stored in index area needed for Data Identification, realize the compression to raw data mark, obtain packed data mark.Such as clip the low portion of 8, the storage space of 1 byte can be reduced.And the low portion clipped is as data block identifier, data block identifier and packed data identifier combination can be obtained raw data and identify.By packed data mark and the address designation corresponding stored of packed data content in the index area of data block, this packed data identifies and the corresponding relation of address designation is exactly the index being stored in index area.
As shown in Figure 2, in one embodiment, from the raw data mark of data to be stored corresponding to data block, clip low portion, obtain packed data mark, and by packed data mark and the address designation corresponding stored of packed data content in the step of the index area of data block, specifically comprise:
Step 202, using raw data mark identical for the high-order portion of the second default figure place in the raw data of data block mark as same subclass, each data bucket in the index area of the raw data mark difference corresponding data block of each subclass, the data bucket of high-order portion as data bucket is identified, and in the index area of data block, stores the corresponding relation of data bucket mark and the bucket initial memory address of data bucket in index area.
Wherein, first total bit that is that figure place and second presets figure place and that be less than or equal to raw data mark is preset.Data bucket (bucket) is for storing the data structure of data in the index area of data block.
Particularly, using raw data mark identical for the high-order portion of the second default figure place as same subclass, and the index area of data block is divided into multiple data bucket, each subclass and each data bucket one_to_one corresponding.The raw data mark of each subclass has identical high-order portion, and this high-order portion can be used to identify as the data bucket of data bucket.
The corresponding relation storing data bucket mark and the bucket initial memory address of data bucket in index area is also needed in index area, particularly, the array length array identical with the quantity of data bucket in data block can be used to store the bucket initial memory address of data bucket, be designated as data bucket mark under array, thus realize the corresponding relation storing data bucket mark and the bucket initial memory address of data bucket in index area.First presets total bit that is that figure place and second presets figure place and that be no more than raw data mark, thus ensures that a raw data mark can normally be compressed.
Step 204, clips low portion and high-order portion, obtains packed data mark, and packed data mark be stored in the data bucket identified using high-order portion as data bucket from the raw data mark of data to be stored corresponding to data block.
First presets the integral multiple that figure place can be 8 under scale-of-two, and second presets the integral multiple that figure place also can be 8 under scale-of-two.Because 8 is a byte, can ensure that the part clipped is the integral multiple of byte, and a usual character needs the storage space occupying a byte, thus effectively can reduce data and store space, place.
Low portion and high-order portion is clipped from the raw data mark of data to be stored corresponding to data block, raw data mark is compressed, acquisition packed data identifies, and packed data is identified the high-order portion that is stored in clip as in the data bucket that data bucket identifies, data block identifier, data bucket mark and packed data mark can form raw data mark, and carrying out compression to raw data mark can save storage space.
Step 206, stores the address designation identifying corresponding packed data content with the packed data stored in data bucket in the index area of data block.
Particularly, stores compressed data can identify and the data content address designation within a data area corresponding with packed data in the index area of data block.Thus obtain by packed data mark the packed data content stored in data field.
In the present embodiment, by opening up multiple data bucket in the index area of data block, after the high-order portion identify raw data and low portion clip rear acquisition packed data mark, packed data mark is stored in data bucket, also stored in index area and identify corresponding address designation with packed data.During data query, by the data bucket mark that stores in index area and the corresponding relation of the bucket initial memory address of data bucket in index area, the packed data mark of storage is obtained from data bucket, obtain the address designation that packed data mark is corresponding again from index area, thus obtain data query content according to this address designation from the data field of data block.Be stored in index area by after the compression of raw data mark, save storage space, and by the packed data content in rapidly locating district, index area, ensure that the efficiency of data query.
In one embodiment, the address designation of packed data content is the side-play amount of packed data content memory address within a data area relative to initial memory address in data field; Then step 206 comprises: in index area, store the start offset amount corresponding to data bucket, and the side-play amount of the packed data content corresponding to packed data mark in data bucket is relative to the bucket bias internal amount of start offset amount.
In the present embodiment, using packed data content memory address within a data area relative to the side-play amount of initial memory address in data field as the address designation of packed data content.Initial memory address refers in data field the memory address starting to store data, because data are Coutinuous stores, obtain the side-play amount of the packed data content of initial memory address and packed data mark correspondence, just can obtain the actual storage address of packed data content, namely in data field, start the memory address storing this packed data content.And the side-play amount of packed data content corresponding to adjacent next packed data mark is identified by this packed data, the storage end position of packed data content corresponding to this packed data mark can be obtained, thus do not need extra storage data length, the storage space taken is little.
Start offset amount be the memory address that starts within a data area to store of the packed data content corresponding to packed data mark in data bucket relative to the side-play amount of above-mentioned initial memory address, bucket bias internal amount refers to the side-play amount of side-play amount relative to start offset amount of each packed data content stored in data bucket.The start offset amount that packed data mark is corresponding and bucket bias internal amount and the real offset that is exactly this packed data mark, this real offset and initial memory address and the actual storage address that is exactly packed data content corresponding to this packed data mark.In the present embodiment, more storage space to be saved owing to storing side-play amount than store storage address itself, the space shared by memory address mark can be reduced further by bucket bias internal amount.
In one embodiment, the address designation of packed data content is the side-play amount of packed data content memory address within a data area relative to initial memory address in data field; Step 206 comprises: the start offset amount storing the packed data content corresponding to data bucket in index area, and the data length of each packed data content corresponding to data bucket.
In the present embodiment, the start offset amount of the packed data content corresponding to data bucket is stored in index area, and the data length of each packed data content corresponding to data bucket, the storage space ratio deviation amount shared by data length is less, can packed data volume further.Need to obtain the data length that in start offset amount and data bucket, before Compressed text search Data Identification, all packed data marks of (comprising self) are corresponding during data query.Because data are Coutinuous stores, the data length that packed data marks that (not comprise self) before Compressed text search Data Identification in data bucket all are corresponding and, be exactly the bucket bias internal amount of Compressed text search Data Identification; Bucket bias internal amount and start offset amount and be exactly the memory address of packed data content corresponding to Compressed text search Data Identification, namely start the address of stores compressed data content.And the end address of packed data content can be obtained by the data length of Compressed text search Data Identification self.
In one embodiment, packed data is identified in the data bucket of the index area of data block and stores according to the numerical values recited ascending order of packed data mark or descending.
In the present embodiment, the packed data in data bucket is identified the numerical values recited ascending order according to packed data mark or descending storage, can be convenient to by binary chop quick position packed data mark, thus the speed of data query can be increased substantially.
Particularly, binary chop is a kind of searching algorithm searching a certain element-specific in subordinate ordered array.Searching plain process is from the neutral element of array, if neutral element is just in time the element that will search, then searches plain process and terminates; If a certain element-specific is greater than or less than neutral element, is then greater than or less than in array in that half of neutral element and searches, and equally compare from neutral element with starting.If be empty in a certain step array, then representative can not find.This searching algorithm more all makes hunting zone reduce half each time, greatly can improve the efficiency of data query.
As shown in Figure 3, in one embodiment, step 104 comprises:
Step 302, is divided into data cell by the original data content of data to be stored, obtains data cell set, calculates the frequency of occurrences of each data cell in data cell set.
The division of data cell can be determined according to actual needs, and original data content as a data cell, also can be divided into multiple data cell by original data content itself.If be IP address A.B.C.D than data content, can use this IP address of 4 byte representations, each byte represents A, B, C or D respectively, IP address can be divided into A, B, C and D tetra-data cells.
Step 304, according to the frequency of occurrences of data cell, for data cell distributes data cell encoding, and according to the corresponding relation of data cell and data cell coding, the original data content of data to be stored corresponding for data block is encoded according to data cell, obtain packed data content.
Calculate the frequency of occurrences of data cell, the data cell high for the frequency of occurrences distributes shorter coding, the data cell low for the frequency of occurrences distributes longer coding, thus according to the corresponding relation of data cell and data cell coding, the original data content of data to be stored corresponding for data block is carried out variable-length encoding according to data cell, obtain packed data content.
Step 306, the coding schedule of the corresponding relation of packed data content being encoded with record data cell and data cell is stored in the data field of data block, and obtains the address designation of packed data content.
Packed data content and coding schedule are stored in the data field of data block, wherein coding schedule have recorded the corresponding relation of data cell and data cell coding, can decode during data query according to this coding schedule to packed data content, obtains original data content.
In the present embodiment, by the original data content of data to be stored is divided into data cell, store after variable-length encoding being carried out to original data content according to the frequency of occurrences of data cell, significantly can reduce and store the required storage space of raw data mark, and each packed data content is to having address designation, by the memory address of this address designation quick position packed data content during data query, take into account the efficiency of data volume and data query.
In one embodiment, according to the frequency of occurrences of data cell, for data cell distributes the step of data cell encoding, comprising: according to the frequency of occurrences of data cell, structure Huffman tree, according to the coordinates measurement data cell coding from the root node of Huffman tree to leaf node; The method also comprises: store Huffman tree in the data field of data block.
Particularly, can by the leaf node of corresponding for each data cell Huffman tree, data cell is sorted according to the frequency of occurrences, minimum for the frequency of occurrences 2 leaf nodes is merged and obtains a new node, the frequency of occurrences of this node be the frequency of occurrences of these two data cells and.Get rid of these two leaf nodes merged, in remaining leaf node and new node, remerge two minimum nodes of the wherein frequency of occurrences, till being merged into the root node of Huffman tree always.The left subtree of the node of definition Huffman tree is 0, and it is 1 that right subnumber 1(also can define left subtree, and right subtree is 0, only illustrates principle here), thus according to the binary data cell coding of the coordinates measurement from root node to leaf node.When stores compressed data content and coding schedule, Huffman tree can be stored in the data field of data block in the lump.
When using Huffman tree decoding compressed data content, using packed data content as bit stream, can decode by turn.Particularly, according to definition during structure Huffman tree, from the root node of Huffman tree, run into 0 and forward left child node to, run into 1 and forward right child node to, until leaf node, the binary coding now from root node to leaf node is a data cell encoding, and the length of this data cell coding is identical with the degree of depth of this leaf node in Huffman tree.Thus obtain data cell corresponding to this data cell coding according to coding schedule, then continue to decode by turn from the root node of Huffman tree, until the data of acquisition presets or decoding complete.
If original data content has set form, then can set the data of presets, when obtaining the data of this presets, decoding terminates, without the need to a complete packed data content all being decoded, obtain required content, the efficiency of data query can be improved.
In the present embodiment, adopt Huffman encoding to compress original data content, significantly can reduce the storage space needed for packed data content.
As shown in Figure 4, in one embodiment, this date storage method also comprises the step generating decoding accelerometer, comprising:
Step 402 is the data cell code construction decoding accelerometer of preset length according to data length, corresponding leaf node of encoding with the data cell in decoding accelerometer in each data cell coding mapping to Huffman tree in decoding accelerometer.
General to the decoding data through Huffman encoding, need to decode by turn, efficiency comparison is low, and therefore conventional decoding process cannot meet performance requirement in High Performance Data Query service, needs to build decoding accelerometer and realizes fast decoding.Particularly, usage data length is the data cell code construction decoding accelerometer of preset length, and first by leaf node corresponding in each data cell coding mapping to Huffman tree in decoding accelerometer.
Step 404, by the data cell coding mapping identical with the prefix of the preset length of long data cell encoding in decoding accelerometer to node corresponding to the prefix of long data cell encoding in Huffman tree.
Wherein, long data cell encoding is the data cell coding that data length is greater than preset length.Data length is greater than to the long data cell encoding of preset length, then by the data cell coding mapping node that this prefix is corresponding in Huffman tree identical with the prefix of the preset length of long data cell encoding in decoding accelerometer.
Step 406, to encode the prefix data cell coding mapping identical with short data cell encoding in decoding accelerometer corresponding leaf node to the short-and-medium data cell of Huffman tree.
Wherein, short data cell encoding is the data cell coding that data length is less than preset length.Data length is less than to the short data cell encoding of preset length, then by the data cell coding mapping leaf node that this short data cell encoding is corresponding in Huffman tree identical with short data cell encoding for prefix in decoding accelerometer.
Step 408, is stored into the data field of data block by decoding accelerometer.
When packed data content is decoded, decoding accelerometer is obtained from the data field of data block, can use reading pointer from packed data content, get the data of preset length, a node in Huffman tree is mapped to according to decoding accelerometer, if the node of this mapping is leaf node, then the data of this preset length are the data cell coding of preset length, or short data cell encoding, directly can obtain corresponding data cell according to coding schedule.And then the distance corresponding with the degree of depth of this leaf node is moved in the direction of the initial memory address of tripping contracting data content of being supported or opposed by reading pointer, namely after moving distance corresponding to length that the data cell corresponding with this leaf node encode, continue the decoding data getting preset length, until decoding obtains the partial data of presets in original data content or original data content.
If the node mapped is non-leaf nodes, after the distance corresponding with preset length is moved in the direction of the initial memory address of the tripping contracting data query content that then supported or opposed by reading pointer, use Huffman tree from the node that coded data maps, move reading pointer by turn and decode until leaf node, obtain the data cell that leaf node is corresponding, continue to read the data of preset length, until decoding to obtain in data query the partial data of presets in perhaps data query content.
In the present embodiment, generate decoding accelerometer, and be stored in the data field of data block, once can get the decoding data of preset length during decoding from packed data content, avoid as far as possible and decode by turn, improve the efficiency of decoding.
The principle of above-mentioned date storage method is described by an embody rule scene below, and this application scene is applied to computing machine to illustrate with this date storage method.Data to be stored comprise account and the IP address corresponding with account, and account is raw data mark, and IP address is original data content, as following table:
Account | IP address |
0000001 | 192.168.10.11 |
0000002 | 192.168.11.9 |
0000003 | 192.168.9.11 |
…… | …… |
Table one data to be stored
Here 32 signless integers are used to be used as account and side-play amount is described.When not compressing, each account needs 8 bytes to store, and wherein 4 bytes are used for storing account, and 4 bytes are used for storing the side-play amount of compressed IP address in the data field corresponding with account.
As shown in Figure 5, first account is divided into 256 classes by least-significant byte, every class is a corresponding data block 520 respectively, then symbiosis becomes 256 data blocks 520, the data block identifier that the least-significant byte (under scale-of-two) of the account of data block 520 correspondence is data block.Data block 520 comprises index area 522 and data field 524.Index area 522 comprises 65536(that is 2
16) individual data bucket, data bucket is designated the high 16 of account, and to open up an array length be the array of pointers of 65536, records the reference position of each data bucket.Then only need store 8 ~ 15 of account in each data bucket, to reach the compression to account.The bucket bias internal amount that in the start offset amount of each data bucket and data bucket, each compression account is corresponding is also stored in index area.Store compressed IP address, data field 524, and the bucket bias internal amount corresponding with compression account according to the start offset amount of data bucket in index area, can locate the memory location of compressed IP address in data field 524.
In this application scene, the extra storage overhead of data block is only the 65536*4=256KB shared by array of pointers, storage overhead is 256KB*256=64MB altogether, and to compress the storage space that account reduces be N*3 byte, wherein N is the number of account, and the more compression effectiveness of visible account number are more obvious.
As shown in Figure 6, in one embodiment, provide a kind of data enquire method, for inquiring about the data adopting above-mentioned date storage method to store, the method comprises:
Step 602, obtains original query Data Identification.
Original query Data Identification can be inputted by user, or obtains from third party application, for inquiring about the data query content corresponding with this original query Data Identification according to original query Data Identification from the data stored.
Step 604, according to original query Data Identification and the data block corresponding to the data block identifier determination original query Data Identification prestored.
Data block identifier can distinguish different data blocks, and data block identifier generates according to Data Identification, therefore can determine and the data block identifier that original query Data Identification mates according to original query Data Identification, thus the data block corresponding to raw data mark can be determined, this data block is the data block storing Compressed text search data content corresponding to original query Data Identification.
Step 606, according to the index in the index area of data block, obtains the address designation corresponding with original query Data Identification.
Store the index generated according to Data Identification and address designation in the index area of data block, according to this index, the address designation corresponding with original query Data Identification can be obtained.
Step 608, obtains Compressed text search data content, decompression Compressed text search data content according to address designation from the data field of data block, obtains data query content.
This address designation identifies the memory location of Compressed text search data content in the data field in data block, therefore can obtain Compressed text search data content from the data field of data block according to this address designation, after being decoded by Compressed text search data content, just can obtain the data query content corresponding with original query Data Identification.
Above-mentioned data enquire method, after obtaining original query Data Identification, the data block corresponding to original query Data Identification can be determined by this original query Data Identification and the data block identifier prestored, then determine the data block at data query place, the address designation corresponding with original query Data Identification is obtained from the index area data block, thus Compressed text search data content can be obtained from the data field data block according to this address designation, being decompressed by Compressed text search data content just can obtain data query content.Can rapidly locating position, realize quick-searching, and without the need to being decompressed by all packed datas, drastically increase the service efficiency of data.
In one embodiment, step 604 comprises: determine the data block corresponding with the data block identifier that the first low portion presetting figure place in original query Data Identification mates.
Particularly, the low portion of figure place is preset as data block identifier using first of Data Identification when storing data, first low portion presetting figure place that therefore can intercept original query Data Identification when data query compares with the data block identifier prestored, if data block identifier is mated with this low portion, illustrate that data block corresponding to this data block identifier is exactly the data block of store compressed data query content.
In the present embodiment, use the data block of the low portion determination store compressed data query content of first of original query Data Identification the default figure place, calculate simple, efficiency is high.
In one embodiment, step 606 comprises: the low portion clipping original query Data Identification, obtains Compressed text search Data Identification, the address designation that the Compressed text search Data Identification stored in the index area of acquisition data block is corresponding.
Particularly, the corresponding relation of packed data mark and the address designation clipping low portion is stored in the index area of data block, the low portion of original query Data Identification is clipped during inquiry, obtain Compressed text search Data Identification, thus obtain the address designation corresponding with Compressed text search Data Identification from the index area of data block.
In the present embodiment, storage be compression after Data Identification, can storage space be saved, and the low portion only need clipping original query Data Identification when inquiring about just can realize inquiry, do not affect search efficiency, taken into account data volume and data service efficiency.
As shown in Figure 7, in one embodiment, clip the low portion of original query Data Identification, obtain Compressed text search Data Identification, the step of the address designation that the Compressed text search Data Identification stored in the index area of acquisition data block is corresponding comprises:
Step 702, determines that the data bucket mated with second of original query Data Identification the high-order portion presetting figure place identifies.
Particularly, the index area of data block comprises multiple data bucket, is identified by the high-order portion that second of Data Identification presets figure place when storing data as data bucket.Therefore during data query, in the index area of data block, determine that the data bucket mated with second of original query Data Identification the high-order portion presetting figure place identifies, the data bucket that the data bucket mark of this coupling is corresponding is exactly the data bucket that store compressed data query identifies.Wherein, first presets total bit that is that figure place and second presets figure place and that be less than or equal to original query Data Identification, ensures the validity of inquiry.
Step 704, obtains the bucket initial memory address of data bucket index area corresponding to data bucket mark from the index area of data block.
When storing data, store the corresponding relation of data bucket mark and the bucket initial memory address of data bucket in index area in the index area of data block.Therefore, during data query, the bucket initial memory address of data bucket in this index area that the data bucket mark of coupling is corresponding can be obtained from the index area of established data block.Obtaining this barrel of initial memory address, just reading by accessing this memory address the data stored in data bucket.
Step 706, clips high-order portion and the low portion of original query Data Identification, obtains Compressed text search Data Identification.
In the present embodiment, when storing Data Identification, high-order portion and the low portion acquisition packed data of having clipped Data Identification identify and store.Therefore when data query, high-order portion and the low portion of original query Data Identification need be clipped, obtain Compressed text search Data Identification.
Step 708, according to the bucket initial memory address of data bucket in index area, searches the packed data mated with Compressed text search Data Identification and identifies in data bucket.
Obtain the bucket initial memory address of data bucket in index area, just from this barrel of initial memory address, the packed data mated with Compressed text search Data Identification can be searched in this data bucket and identify.In addition can be determined the end position of this data query bucket by the bucket initial memory address of the next data bucket adjacent with this data query bucket, thus avoid and cannot not stop inquiry when there is not Compressed text search Data Identification in data bucket.
Step 710, obtains and identifies corresponding address designation with the packed data of coupling from index area.
The corresponding relation of packed data mark and address designation is stored in the index area of data block, therefore identified by the packed data of this coupling, can obtain from index area and identify corresponding address designation with the packed data of coupling, this address designation is for determining the memory location of Compressed text search data content in the data field of data block.
In the present embodiment, when storing data, Data Identification is compressed, a large amount of storage space can be saved, use original query Data Identification can determine the data block at Compressed text search data content place fast during data query, obtain address designation from the index area of this data block again, thus obtain the Compressed text search data content stored in the data field of data block according to address designation, inquiry velocity is fast, without the need to all packed datas that decompresses, drastically increase the service efficiency of data.
In one embodiment, the address designation of packed data content is the side-play amount of Compressed text search data content memory address within a data area relative to initial memory address in data field; Step 710 comprises: from index area, obtain the bucket bias internal amount that the packed data mark of start offset amount corresponding to data bucket and coupling is corresponding; The side-play amount of data query content is calculated according to start offset amount and bucket bias internal gauge.
In the present embodiment, using packed data content memory address within a data area relative to the side-play amount of initial memory address in data field as the address designation of packed data content.Initial memory address refers in data field the memory address starting to store data, because data are Coutinuous stores, obtain the side-play amount that the packed data mark of initial memory address and coupling is corresponding, just the actual storage address of the packed data content of coupling can be obtained, namely the actual storage address of Compressed text search data content.And the side-play amount of packed data content corresponding to adjacent next packed data mark is identified by the packed data of this coupling, the storage end position of the packed data content of the packed data mark correspondence of this coupling can be obtained, thus do not need extra storage data length, the storage space taken is little.
Start offset amount be the memory address that starts within a data area to store of the packed data content corresponding to packed data mark in data bucket relative to the side-play amount of above-mentioned initial memory address, bucket bias internal amount refers to the side-play amount of side-play amount relative to start offset amount of each packed data content stored in data bucket.When calculating side-play amount, the start offset amount that Compressed text search Data Identification is corresponding and bucket bias internal amount and the real offset that is exactly this Compressed text search Data Identification, this real offset and initial memory address and the actual storage address that is exactly packed data content corresponding to this Compressed text search Data Identification.In the present embodiment, more storage space to be saved owing to storing side-play amount than store storage address itself, the space shared by memory address mark can be reduced further by bucket bias internal amount, and less on search efficiency impact.
In one embodiment, the address designation of packed data content is the side-play amount of Compressed text search data content memory address within a data area relative to initial memory address in data field; Step 710 comprises: the data length of packed data marks all obtain the packed data mark of packed data mark and the coupling of mating in start offset amount corresponding to data bucket and data bucket from index area before; The side-play amount of data query content is calculated according to the data length of start offset amount and acquisition.
In the present embodiment, the start offset amount of the packed data content corresponding to data bucket is stored in index area, and the data length of each packed data content corresponding to data bucket, the storage space ratio deviation amount shared by data length is less, can packed data volume further.Need to obtain the data length that in start offset amount and data bucket, before Compressed text search Data Identification, all packed data marks of (comprising self) are corresponding during data query.Because data are Coutinuous stores, the data length that packed data marks that (not comprise self) before Compressed text search Data Identification in data bucket all are corresponding and, be exactly the bucket bias internal amount of Compressed text search Data Identification; Bucket bias internal amount and start offset amount and be exactly the memory address of packed data content corresponding to Compressed text search Data Identification, namely start the address of stores compressed data content.And the end address of packed data content can be obtained by the data length of Compressed text search Data Identification self.
The data length of recording compressed data content is adopted to carry out alternative bucket bias internal amount in the present embodiment, higher compressibility can be realized, take storage space less, although compared with bucket bias internal amount, the more time can be spent during data query, but when storage resources is nervous, use record data length to be more excellent selection.
In one embodiment, the packed data in the data bucket of the index area of data block identifies the numerical values recited ascending order or descending storage that identify according to packed data; In data bucket, search the step that the packed data that mates with Compressed text search Data Identification identifies comprise: use binary chop in data bucket, search the packed data mated with Compressed text search Data Identification and identify.
In the present embodiment, the packed data in data bucket is identified the numerical values recited ascending order according to packed data mark or descending storage, can be convenient to by binary chop quick position packed data mark, thus the speed of data query can be increased substantially.
Particularly, binary chop is a kind of searching algorithm searching a certain element-specific in subordinate ordered array.Searching plain process is from the neutral element of array, if neutral element is just in time the element that will search, then searches plain process and terminates; If a certain element-specific is greater than or less than neutral element, is then greater than or less than in array in that half of neutral element and searches, and equally compare from neutral element with starting.If be empty in a certain step array, then representative can not find.This searching algorithm more all makes hunting zone reduce half each time, greatly can improve the efficiency of data query.
In one embodiment, step 608 comprises: the coding schedule obtaining the corresponding relation of record data cell and data cell coding from data field, according to coding schedule, the Compressed text search data content comprising data cell coding is decoded, obtain data cell, obtain data query content according to data cell.
The coding schedule of the corresponding relation of record data cell and data cell coding is stored in the data field of data block, coding schedule is obtained from data field, thus according to this coding schedule, Compressed text search data content can be decoded, obtain data cell, the data cell of acquisition forms data query content.
In one embodiment, the coding schedule of the corresponding relation of record data cell and data cell coding is obtained from data field, according to coding schedule, the Compressed text search data content comprising data cell coding is decoded, obtain data cell, the step obtaining data query content according to data cell comprises: the coding schedule obtaining the corresponding relation that Huffman tree is encoded with record data cell and data cell from data field; Compressed text search data content is divided into data cell coding according to Huffman tree, obtains according to coding schedule and data cell is encoded corresponding data cell; Data query content is obtained according to data cell.
When using the decoding compressed data query content of Huffman tree, using Compressed text search data content as bit stream, can decode by turn.Particularly, according to definition during structure Huffman tree, the left subtree of node is 0, right subnumber 1, from the root node of Huffman tree, runs into 0 and forwards left child node to, run into 1 and forward right child node to, until leaf node, the binary coding now from root node to leaf node is a data cell encoding, and the length of this data cell coding is identical with the degree of depth of this leaf node in Huffman tree.Thus obtain data cell corresponding to this data cell coding according to coding schedule, then continue to decode by turn from the root node of Huffman tree, until the data of acquisition presets or decoding complete.
In the present embodiment, adopt Huffman encoding to compress original data content, significantly can reduce the storage space needed for packed data content.
In one embodiment, the coding schedule of the corresponding relation that Huffman tree is encoded with record data cell and data cell is obtained from data field; Compressed text search data content is divided into data cell coding according to Huffman tree, obtains according to coding schedule and data cell is encoded corresponding data cell; The step obtaining data query content according to data cell comprises step 1) ~ step 4):
1) obtain Huffman tree from data field, the coding schedule of corresponding relation that decoding accelerometer is encoded with record data cell and data cell.
Wherein, decoding accelerometer comprises the data cell coding of multiple preset length, corresponding leaf node of encoding with the data cell in decoding accelerometer in each data cell coding mapping to Huffman tree in decoding accelerometer.In decoding accelerometer, the data cell coding mapping identical with the prefix of the preset length of long data cell encoding is to node corresponding to the prefix of long data cell encoding in Huffman tree; Long data cell encoding is the data cell coding that data length is greater than preset length.In decoding accelerometer, the prefix data cell coding mapping identical with short data cell encoding to be encoded corresponding leaf node to the short-and-medium data cell of Huffman tree.Short data cell encoding is the data cell coding that data length is less than preset length.Long data cell encoding is the data cell coding that data length is greater than preset length.
2) use reading pointer from the initial memory address of Compressed text search data content, read the coded data of preset length, coded data is mapped to the node in Huffman tree by use decoding accelerometer, and judges whether the node mapped is leaf node.
3) if the node that coded data maps is leaf node, then obtain data cell corresponding to leaf node according to coding schedule decoding, and after the direction of the initial memory address of the tripping contracting data query content that supported or opposed by reading pointer moves the distance corresponding with the degree of depth of leaf node, continue to read the coded data of preset length, until decoding to obtain in data query the partial data of presets in perhaps data query content.
4) if the node that coded data maps is non-leaf nodes, after the distance corresponding with preset length is moved in the direction of the initial memory address of the tripping contracting data query content that then supported or opposed by reading pointer, use Huffman tree from the node that coded data maps, move reading pointer by turn and decode until leaf node, obtain the data cell that leaf node is corresponding, continue to read the coded data of preset length, until decoding to obtain in data query the partial data of presets in perhaps data query content.
Illustrate, as shown in Figure 8, in Huffman tree 802, the left subtree of non-leaf nodes is 0, right subtree is 1, for bit stream 010010 ..., when using Huffman tree 802 to decode by turn, run into 0 left-hand rotation, run into 1 right-hand rotation, as shown in Figure 8, from root node, turn left, turn right, turn left, turn left, turn right, turn left, final arrival leaf node, can obtain the data cell of this 010010 correspondence according to coding schedule.
When using decoding accelerometer 804 to decode, each data cell coding in decoding accelerometer is 16.Use reading pointer to read the coded data of 16, this coded data is mapped to the node in Huffman tree 802 by use decoding accelerometer 804, and judges whether the node mapped is leaf node.
If the node that coded data maps is leaf node, then there are two kinds of situations, namely this coded data is the data cell coding of 16, or the prefix of this data encoding of 16 is the short data cell encoding that length is less than 16, encoding unit encodes that can be corresponding according to the leaf node mapped directly obtains coding unit, and distance corresponding to degree of depth reading pointer being moved this node (such as, if the degree of depth of this node is 16, then reading pointer can be moved 16, this is because this 16 bit data is decoded), continue the coded data of reading 16, until decoded or the data cell obtained constitutes the data of presets time stop decoding.
If the node that coded data maps is non-leaf nodes, then these 16 coded datas are the prefix part of long codes data, after reading pointer being moved 16, continue to use Huffman tree 802 to continue to decode by turn, decoding acquisition data cell during arrival leaf node, from current read pointer, now continue reading 16 coded datas decode, until decoded or the data cell that obtains constitutes the data of presets time stop decoding.
In the present embodiment, generate decoding accelerometer, and be stored in the data field of data block, once can get the decoding data of preset length during decoding from packed data content, avoid as far as possible and decode by turn, improve the efficiency of decoding.
As shown in Figure 9, in one embodiment, provide a kind of data storage device, data to be stored comprise raw data mark and identify corresponding original data content with raw data, and this device comprises data block generation module 902, original data content compression module 904 and index generation module 906.
Data block generation module 902 is for carrying out cluster according to raw data mark to data to be stored, and each class is a corresponding data block respectively, and generates data block identifier according to the raw data mark of data to be stored corresponding to data block.
Data to be stored comprise raw data mark and identify corresponding original data content with raw data." original " represents data state originally, uncompressed or other process.Raw data mark can be the key (key) of data to be stored, can be used for distinguishing different data to be stored.Data block generation module 902 is for carrying out cluster to data to be stored, the raw data mark centralized stores of public part can be had, and the public part each raw data identified extracts, no longer repeated storage, only can store non-common parts when storing raw data mark, reach the object of raw data mark being carried out to fixed length compression.And the data after fixed length compression still possess the characteristic of random access, be convenient to search.
After data to be stored are carried out cluster according to raw data mark, the corresponding data block of each class, and generate data block identifier according to the raw data mark of data to be stored corresponding to data block.Like this when data query, can just can the data block at locator data place according to data query mark, thus reach the object of fast finding.
Packed data content, for being compressed by the original data content of data to be stored corresponding for data block, is stored in the data field of data block by original data content compression module 904, and obtains the address designation of packed data content.
In order to obtain larger compressibility, original data content compression module 904 can be used for using variable-length encoding to compress original data content.Use variable-length encoding can reach very high compressibility, but do not fix due to code length, make the memory address of packed data be difficult to directly be predicted by data structure, generally need all packed datas to decompress just can inquire required data.Address designation, for identifying the memory location of packed data content, when stores compressed data content, obtains the address designation of packed data content, can be convenient to directly locate packed data content according to this address designation, not need all packed datas all to decode.
Index generation module 906 for according to the raw data mark of data to be stored corresponding to data block and address designation generating indexes, and by the index area of index stores in data block.
Index generation module 906 can be used for the address designation generating indexes according to the raw data of data to be stored corresponding to data block mark and packed data content, and by the index area of this index stores in data block, be convenient to obtain according to raw data mark the address designation identifying corresponding packed data content with raw data, and then obtain packed data content according to this address designation, thus realize fast query data.
Above-mentioned data storage device, after data to be stored are carried out cluster, each class data to be stored are a corresponding data block respectively, and generate data block identifier according to the raw data mark of data to be stored corresponding to data block, and data block comprises index area and data field, data field stores packed data content, and index area stores the index generated according to the address designation of packed data content in raw data mark and data field.During data query, the data block at data query place directly can be determined according to data query mark, the address designation of data query content is obtained according to the index in data block, thus Compressed text search data content can be obtained according to this address designation from data field, just can obtain Compressed text search data content after decompression.Both can reach good compression effectiveness, and reduce and store space shared by data, again can rapidly locating position, realize fast query data, and without the need to being decompressed by all packed datas, drastically increase the service efficiency of data.
In one embodiment, data block generation module 902 is also for first presetting the identical data to be stored of the low portion of figure place as a class in the raw data of data to be stored being identified, each class is a corresponding data block respectively, and using the data block identifier of low portion as data block.
In the present embodiment, using data to be stored identical for the low portion of the first default figure place as a class, each class is a corresponding data block respectively, and using the data block identifier of low portion as data block.During data query, the data block at data query place can be judged according to the low portion of the default figure place of data query mark, can fast query be realized.
In one embodiment, index generation module 906 also for clipping low portion from the raw data mark of data to be stored corresponding to data block, acquisition packed data identifies, and by packed data mark and the address designation corresponding stored of packed data content in the index area of data block.
Owing to presetting the identical data to be stored of the low portion of figure place using first as a class, so the raw data mark in each class has identical low portion, clip low portion and can reduce the storage space stored in index area needed for Data Identification, realize the compression to raw data mark, obtain packed data mark.Such as clip the low portion of 8, the storage space of 1 byte can be reduced.And the low portion clipped is as data block identifier, data block identifier and packed data identifier combination can be obtained raw data and identify.By packed data mark and the address designation corresponding stored of packed data content in the index area of data block, this packed data identifies and the corresponding relation of address designation is exactly the index being stored in index area.
As shown in Figure 10, in one embodiment, index generation module 906 comprises data bucket generation module 906a, packed data identifier generation module 906b and the first memory module 906c.
Data bucket generation module 906a is used for raw data mark identical for the high-order portion of the second default figure place in the raw data of data block mark as same subclass, each data bucket in the index area of the raw data mark difference corresponding data block of each subclass, the data bucket of high-order portion as data bucket is identified, and in the index area of data block, stores the corresponding relation of data bucket mark and the bucket initial memory address of data bucket in index area.
Wherein, first total bit that is that figure place and second presets figure place and that be less than or equal to raw data mark is preset.Particularly, data bucket generation module 906a is used for using raw data mark identical for the high-order portion of the second default figure place as same subclass, and the index area of data block is divided into multiple data bucket, each subclass and each data bucket one_to_one corresponding.The raw data mark of each subclass has identical high-order portion, and this high-order portion can be used to identify as the data bucket of data bucket.
The corresponding relation storing data bucket mark and the bucket initial memory address of data bucket in index area is also needed in index area, particularly, data bucket generation module 906a can be used for use array length array identical with the quantity of data bucket in data block to store the bucket initial memory address of data bucket, be designated as data bucket mark under array, thus realize the corresponding relation storing data bucket mark and the bucket initial memory address of data bucket in index area.First presets total bit that is that figure place and second presets figure place and that be no more than raw data mark, thus ensures that raw data mark can normally be compressed.
Packed data identifier generation module 906b is used for clipping low portion and high-order portion from the raw data mark of data to be stored corresponding to data block, acquisition packed data identifies, and packed data mark is stored in the data bucket identified using high-order portion as data bucket.
First presets the integral multiple that figure place can be 8 under scale-of-two, and second presets the integral multiple that figure place also can be 8 under scale-of-two.Because 8 is a byte, can ensure that the part clipped is the integral multiple of byte, and a usual character needs the storage space occupying a byte, thus effectively can reduce data and store space, place.
Low portion and high-order portion is clipped from the raw data mark of data to be stored corresponding to data block, raw data mark is compressed, acquisition packed data identifies, and packed data is identified the high-order portion that is stored in clip as in the data bucket that data bucket identifies, data block identifier, data bucket mark and packed data mark can form raw data mark, and carrying out compression to raw data mark can save storage space.
First memory module 906c is used in the index area of data block, store the address designation identifying corresponding packed data content with the packed data stored in data bucket.
Particularly, the first memory module 906c is used in stores compressed data mark and the data content address designation within a data area corresponding with packed data in the index area of data block.Thus obtain by packed data mark the packed data content stored in data field.
In the present embodiment, by opening up multiple data bucket in the index area of data block, after the high-order portion identify raw data and low portion clip rear acquisition packed data mark, packed data mark is stored in data bucket, also stored in index area and identify corresponding address designation with packed data.During data query, by the data bucket mark that stores in index area and the corresponding relation of the bucket initial memory address of data bucket in index area, the packed data mark of storage is obtained from data bucket, obtain the address designation that packed data mark is corresponding again from index area, thus obtain data query content according to this address designation from the data field of data block.Be stored in index area by after the compression of raw data mark, save storage space, and by the packed data content in rapidly locating district, index area, ensure that the efficiency of data query.
In one embodiment, the address designation of packed data content is the side-play amount of packed data content memory address within a data area relative to initial memory address in data field; First memory module 906c also for storing the start offset amount corresponding to data bucket in index area, and the side-play amount of the packed data content corresponding to packed data mark in data bucket is relative to the bucket bias internal amount of start offset amount.
In the present embodiment, using packed data content memory address within a data area relative to the side-play amount of initial memory address in data field as the address designation of packed data content.Initial memory address refers in data field the memory address starting to store data, because data are Coutinuous stores, obtain the side-play amount of the packed data content of initial memory address and packed data mark correspondence, just can obtain the actual storage address of packed data content, namely in data field, start the memory address storing this packed data content.And the side-play amount of packed data content corresponding to adjacent next packed data mark is identified by this packed data, the storage end position of packed data content corresponding to this packed data mark can be obtained, thus do not need extra storage data length, the storage space taken is little.
Start offset amount be the memory address that starts within a data area to store of the packed data content corresponding to packed data mark in data bucket relative to the side-play amount of above-mentioned initial memory address, bucket bias internal amount refers to the side-play amount of side-play amount relative to start offset amount of each packed data content stored in data bucket.The start offset amount that packed data mark is corresponding and bucket bias internal amount and the real offset that is exactly this packed data mark, this real offset and initial memory address and the actual storage address that is exactly packed data content corresponding to this packed data mark.In the present embodiment, more storage space to be saved owing to storing side-play amount than store storage address itself, the space shared by memory address mark can be reduced further by bucket bias internal amount.
In one embodiment, the address designation of packed data content is the side-play amount of packed data content memory address within a data area relative to initial memory address in data field; First memory module 906c also for storing the start offset amount of the packed data content corresponding to data bucket in index area, and the data length of each packed data content corresponding to data bucket.
In the present embodiment, the start offset amount of the packed data content corresponding to data bucket is stored in index area, and the data length of each packed data content corresponding to data bucket, the storage space ratio deviation amount shared by data length is less, can packed data volume further.Need to obtain the data length that in start offset amount and data bucket, before Compressed text search Data Identification, all packed data marks of (comprising self) are corresponding during data query.Because data are Coutinuous stores, the data length that packed data marks that (not comprise self) before Compressed text search Data Identification in data bucket all are corresponding and, be exactly the bucket bias internal amount of Compressed text search Data Identification; Bucket bias internal amount and start offset amount and be exactly the memory address of packed data content corresponding to Compressed text search Data Identification, namely start the address of stores compressed data content.And the end address of packed data content can be obtained by the data length of Compressed text search Data Identification self.
In one embodiment, packed data is identified in the data bucket of the index area of data block and stores according to the numerical values recited ascending order of packed data mark or descending.
In the present embodiment, the packed data in data bucket is identified the numerical values recited ascending order according to packed data mark or descending storage, can be convenient to by binary chop quick position packed data mark, thus the speed of data query can be increased substantially.
As shown in figure 11, in one embodiment, original data content compression module 904 comprises frequency computation part module 904a, packed data content generating module 904b and the second memory module 904c.
Frequency computation part module 904a is used for the original data content of data to be stored to be divided into data cell, obtains data cell set, calculates the frequency of occurrences of each data cell in data cell set.
The division of data cell can be determined according to actual needs, and original data content as a data cell, also can be divided into multiple data cell by original data content itself.
Packed data content generating module 904b is used for the frequency of occurrences according to data cell, for data cell distributes data cell encoding, and according to the corresponding relation of data cell and data cell coding, the original data content of data to be stored corresponding for data block is encoded according to data cell, obtain packed data content.
Calculate the frequency of occurrences of data cell, the data cell high for the frequency of occurrences distributes shorter coding, the data cell low for the frequency of occurrences distributes longer coding, thus according to the corresponding relation of data cell and data cell coding, the original data content of data to be stored corresponding for data block is carried out variable-length encoding according to data cell, obtain packed data content.
Second memory module 904c is used for data field packed data content and the coding schedule of the corresponding relation recording data cell and data cell coding being stored in data block, and obtains the address designation of packed data content.
Packed data content and coding schedule are stored in the data field of data block, wherein coding schedule have recorded the corresponding relation of data cell and data cell coding, can decode during data query according to this coding schedule to packed data content, obtains original data content.
In the present embodiment, by the original data content of data to be stored is divided into data cell, store after variable-length encoding being carried out to original data content according to the frequency of occurrences of data cell, significantly can reduce and store the required storage space of raw data mark, and each packed data content is to having address designation, by the memory address of this address designation quick position packed data content during data query, take into account the efficiency of data volume and data query.
In one embodiment, packed data content generating module 904b is also for the frequency of occurrences according to data cell, and structure Huffman tree, according to the coordinates measurement data cell coding from the root node of Huffman tree to leaf node; Second memory module 904c is also for storing Huffman tree in the data field of data block.
Particularly, packed data content generating module 904b can be used for the leaf node of corresponding for each data cell Huffman tree, data cell is sorted according to the frequency of occurrences, minimum for the frequency of occurrences 2 leaf nodes are merged and obtain a new node, the frequency of occurrences of this node be the frequency of occurrences of these two data cells and.Get rid of these two leaf nodes merged, in remaining leaf node and new node, remerge two minimum nodes of the wherein frequency of occurrences, till being merged into the root node of Huffman tree always.The left subtree of the node of definition Huffman tree is 0, and it is 1 that right subnumber 1(also can define left subtree, and right subtree is 0, only illustrates principle here), thus according to the binary data cell coding of the coordinates measurement from root node to leaf node.When stores compressed data content and coding schedule, Huffman tree can be stored in the data field of data block in the lump.
When using Huffman tree decoding compressed data content, using packed data content as bit stream, can decode by turn.Particularly, according to definition during structure Huffman tree, from the root node of Huffman tree, run into 0 and forward left child node to, run into 1 and forward right child node to, until leaf node, the binary coding now from root node to leaf node is a data cell encoding, and the length of this data cell coding is identical with the degree of depth of this leaf node in Huffman tree.Thus obtain data cell corresponding to this data cell coding according to coding schedule, then continue to decode by turn from the root node of Huffman tree, until the data of acquisition presets or decoding complete.
If original data content has set form, then can set the data of presets, when obtaining the data of this presets, decoding terminates, without the need to a complete packed data content all being decoded, obtain required content, the efficiency of data query can be improved.
In the present embodiment, adopt Huffman encoding to compress original data content, significantly can reduce the storage space needed for packed data content.
As shown in figure 12, in one embodiment, device also comprises decoding accelerometer generation module 905, comprises the first mapping block 905a, the second mapping block 905b, the 3rd mapping block 905c and the 3rd reflects storage block 905d.
The data cell code construction decoding accelerometer of the first mapping block 905a for according to data length being preset length, corresponding leaf node of encoding with the data cell in decoding accelerometer in each data cell coding mapping to Huffman tree in decoding accelerometer.
General to the decoding data through Huffman encoding, need to decode by turn, efficiency comparison is low, and therefore conventional decoding process cannot meet performance requirement in High Performance Data Query service, needs to build decoding accelerometer and realizes fast decoding.Particularly, usage data length is the data cell code construction decoding accelerometer of preset length, and first by leaf node corresponding in each data cell coding mapping to Huffman tree in decoding accelerometer.
Second mapping block 905b is used for the data cell coding mapping identical with the prefix of the preset length of long data cell encoding in decoding accelerometer to node corresponding to the prefix of long data cell encoding in Huffman tree; Long data cell encoding is the data cell coding that data length is greater than preset length.
Data length is greater than to the long data cell encoding of preset length, then by the data cell coding mapping node that this prefix is corresponding in Huffman tree identical with the prefix of the preset length of long data cell encoding in decoding accelerometer.
3rd mapping block 905c is used for the prefix data cell coding mapping identical with short data cell encoding in decoding accelerometer to encode corresponding leaf node to the short-and-medium data cell of Huffman tree; Short data cell encoding is the data cell coding that data length is less than preset length.
Data length is less than to the short data cell encoding of preset length, then by the data cell coding mapping leaf node that this short data cell encoding is corresponding in Huffman tree identical with short data cell encoding for prefix in decoding accelerometer.
3rd reflects storage block 905d for decoding accelerometer being stored into the data field of data block.
When packed data content is decoded, decoding accelerometer is obtained from the data field of data block, can use reading pointer from packed data content, get the data of preset length, a node in Huffman tree is mapped to according to decoding accelerometer, if the node of this mapping is leaf node, then the data of this preset length are the data cell coding of preset length, or short data cell encoding, directly can obtain corresponding data cell according to coding schedule.And then the distance corresponding with the degree of depth of this leaf node is moved in the direction of the initial memory address of tripping contracting data content of being supported or opposed by reading pointer, namely after moving distance corresponding to length that the data cell corresponding with this leaf node encode, continue the decoding data getting preset length, until decoding obtains the partial data of presets in original data content or original data content.
If the node mapped is non-leaf nodes, after the distance corresponding with preset length is moved in the direction of the initial memory address of the tripping contracting data query content that then supported or opposed by reading pointer, use Huffman tree from the node that coded data maps, move reading pointer by turn and decode until leaf node, obtain the data cell that leaf node is corresponding, continue to read the data of preset length, until decoding to obtain in data query the partial data of presets in perhaps data query content.
In the present embodiment, generate decoding accelerometer, and be stored in the data field of data block, once can get the decoding data of preset length during decoding from packed data content, avoid as far as possible and decode by turn, improve the efficiency of decoding.
As shown in figure 13, in one embodiment, provide a kind of data query arrangement, comprise original query Data Identification acquisition module 1302, data block determination module 1304, address designation acquisition module 1306 and data query content obtaining module 1308.
Original query Data Identification acquisition module 1302 is for obtaining original query Data Identification.
Original query Data Identification can be inputted by user, or obtains from third party application, for inquiring about the data query content corresponding with this original query Data Identification according to original query Data Identification from the data stored.
Data block determination module 1304, for the data block corresponding to original query Data Identification and the data block identifier determination original query Data Identification that prestores.
Data block identifier can distinguish different data blocks, and data block identifier generates according to Data Identification, therefore can determine and the data block identifier that original query Data Identification mates according to original query Data Identification, thus the data block corresponding to raw data mark can be determined, this data block is the data block storing Compressed text search data content corresponding to original query Data Identification.
Address designation acquisition module 1306, for according to the index in the index area of data block, obtains the address designation corresponding with original query Data Identification.
Store the index generated according to Data Identification and address designation in the index area of data block, according to this index, the address designation corresponding with original query Data Identification can be obtained.
Data query content obtaining module 1308, for obtaining Compressed text search data content, decompression module according to address designation from the data field of data block, for the Compressed text search data content that decompresses, obtains data query content.
This address designation identifies the memory location of Compressed text search data content in the data field in data block, therefore can obtain Compressed text search data content from the data field of data block according to this address designation, after being decoded by Compressed text search data content, just can obtain the data query content corresponding with original query Data Identification.
Above-mentioned data query arrangement, after obtaining original query Data Identification, the data block corresponding to original query Data Identification can be determined by this original query Data Identification and the data block identifier prestored, then determine the data block at data query place, the address designation corresponding with original query Data Identification is obtained from the index area data block, thus Compressed text search data content can be obtained from the data field data block according to this address designation, being decompressed by Compressed text search data content just can obtain data query content.Can rapidly locating position, realize quick-searching, and without the need to being decompressed by all packed datas, drastically increase the service efficiency of data.
In one embodiment, data block determination module 1304 also first presets data block corresponding to data block identifier that the low portion of figure place mates for determining with original query Data Identification.
Particularly, the low portion of figure place is preset as data block identifier using first of Data Identification when storing data, therefore when data query, data block determination module 1304 can be used for intercepting the low portion that first of original query Data Identification presets figure place, this low portion is compared with the data block identifier prestored, if data block identifier is mated with this low portion, illustrate that data block corresponding to this data block identifier is exactly the data block of store compressed data query content.
In the present embodiment, use the data block of the low portion determination store compressed data query content of first of original query Data Identification the default figure place, calculate simple, efficiency is high.
In one embodiment, address designation acquisition module 1306, also for clipping the low portion of original query Data Identification, obtains Compressed text search Data Identification, the address designation that the Compressed text search Data Identification stored in the index area of acquisition data block is corresponding.
Particularly, the corresponding relation of packed data mark and the address designation clipping low portion is stored in the index area of data block, the low portion of original query Data Identification is clipped during inquiry, obtain Compressed text search Data Identification, thus obtain the address designation corresponding with Compressed text search Data Identification from the index area of data block.
In the present embodiment, storage be compression after Data Identification, can storage space be saved, and the low portion only need clipping original query Data Identification when inquiring about just can realize inquiry, do not affect search efficiency, taken into account data volume and data service efficiency.
As shown in figure 14, in one embodiment, address designation acquisition module 1306 comprises data bucket mark determination module 1306a, bucket initial memory address determination module 1306b, Compressed text search Data Identification acquisition module 1306c, searches module 1306d and the first acquisition module 1306e.
Data bucket mark determination module 1306a is for determining that the data bucket mated with second of original query Data Identification the high-order portion presetting figure place identifies.
Particularly, the index area of data block comprises multiple data bucket, is identified by the high-order portion that second of Data Identification presets figure place when storing data as data bucket.Therefore during data query, data bucket mark determination module 1306a is used in the index area of data block, determine that the data bucket mated with second of original query Data Identification the high-order portion presetting figure place identifies, and the data bucket of the data bucket mark correspondence of this coupling is exactly the data bucket of store compressed data query mark.Wherein, first presets total bit that is that figure place and second presets figure place and that be less than or equal to original query Data Identification, ensures the validity of inquiry.
Bucket initial memory address determination module 1306b is used for obtaining the bucket initial memory address of data bucket index area corresponding to data bucket mark from the index area of data block.
When storing data, store the corresponding relation of data bucket mark and the bucket initial memory address of data bucket in index area in the index area of data block.Therefore, during data query, the bucket initial memory address of data bucket in this index area that the data bucket mark of coupling is corresponding can be obtained from the index area of established data block.Obtaining this barrel of initial memory address, just reading by accessing this memory address the data stored in data bucket.
Compressed text search Data Identification acquisition module 1306c, for clipping high-order portion and the low portion of original query Data Identification, obtains Compressed text search Data Identification.
In the present embodiment, when storing Data Identification, high-order portion and the low portion acquisition packed data of having clipped Data Identification identify and store.Therefore, when data query, Compressed text search Data Identification acquisition module 1306c, for clipping high-order portion and the low portion of original query Data Identification, obtains Compressed text search Data Identification.
Search module 1306d for according to the bucket initial memory address of data bucket in index area, in data bucket, search the packed data mated with Compressed text search Data Identification identify.
Obtain the bucket initial memory address of data bucket in index area, just from this barrel of initial memory address, the packed data mated with Compressed text search Data Identification can be searched in this data bucket and identify.In addition can be determined the end position of this data query bucket by the bucket initial memory address of the next data bucket adjacent with this data query bucket, thus avoid and cannot not stop inquiry when there is not Compressed text search Data Identification in data bucket.
First acquisition module 1306e is used for obtaining from index area identifying corresponding address designation with the packed data of coupling.
The corresponding relation of packed data mark and address designation is stored in the index area of data block, therefore identified by the packed data of this coupling, can obtain from index area and identify corresponding address designation with the packed data of coupling, this address designation is for determining the memory location of Compressed text search data content in the data field of data block.
In the present embodiment, when storing data, Data Identification is compressed, a large amount of storage space can be saved, use original query Data Identification can determine the data block at Compressed text search data content place fast during data query, obtain address designation from the index area of this data block again, thus obtain the Compressed text search data content stored in the data field of data block according to address designation, inquiry velocity is fast, without the need to all packed datas that decompresses, drastically increase the service efficiency of data.
In one embodiment, the address designation of packed data content is the side-play amount of Compressed text search data content memory address within a data area relative to initial memory address in data field.
As shown in figure 15, the first acquisition module 1306e comprises the second acquisition module 1306e1 and the first side-play amount computing module 1306e2.
Second acquisition module 1306e1 is used for the bucket bias internal amount of the packed data mark correspondence of start offset amount and the coupling obtained from index area corresponding to data bucket.
First side-play amount computing module 1306e2 is used for the side-play amount calculating data query content according to start offset amount and bucket bias internal gauge.
In the present embodiment, using packed data content memory address within a data area relative to the side-play amount of initial memory address in data field as the address designation of packed data content.Initial memory address refers in data field the memory address starting to store data, because data are Coutinuous stores, obtain the side-play amount that the packed data mark of initial memory address and coupling is corresponding, just the actual storage address of the packed data content of coupling can be obtained, namely the actual storage address of Compressed text search data content.And the side-play amount of packed data content corresponding to adjacent next packed data mark is identified by the packed data of this coupling, the storage end position of the packed data content of the packed data mark correspondence of this coupling can be obtained, thus do not need extra storage data length, the storage space taken is little.
Start offset amount be the memory address that starts within a data area to store of the packed data content corresponding to packed data mark in data bucket relative to the side-play amount of above-mentioned initial memory address, bucket bias internal amount refers to the side-play amount of side-play amount relative to start offset amount of each packed data content stored in data bucket.When calculating side-play amount, the start offset amount that Compressed text search Data Identification is corresponding and bucket bias internal amount and the real offset that is exactly this Compressed text search Data Identification, this real offset and initial memory address and the actual storage address that is exactly packed data content corresponding to this Compressed text search Data Identification.In the present embodiment, more storage space to be saved owing to storing side-play amount than store storage address itself, the space shared by memory address mark can be reduced further by bucket bias internal amount, and less on search efficiency impact.
In one embodiment, the address designation of packed data content is the side-play amount of Compressed text search data content memory address within a data area relative to initial memory address in data field.
As shown in figure 16, the first acquisition module 1306e comprises the 3rd acquisition module 1306e3 and the second side-play amount computing module 1306e4.
The data length of the packed data mark that the 3rd acquisition module 1306e3 is all before being used for the packed data mark of packed data mark and the coupling of mating in the start offset amount that obtains from index area corresponding to data bucket and data bucket;
Second side-play amount computing module 1306e4 is for calculating the side-play amount of data query content according to the data length of start offset amount and acquisition.
In the present embodiment, the start offset amount of the packed data content corresponding to data bucket is stored in index area, and the data length of each packed data content corresponding to data bucket, the storage space ratio deviation amount shared by data length is less, can packed data volume further.Need to obtain the data length that in start offset amount and data bucket, before Compressed text search Data Identification, all packed data marks of (comprising self) are corresponding during data query.Because data are Coutinuous stores, the data length that packed data marks that (not comprise self) before Compressed text search Data Identification in data bucket all are corresponding and, be exactly the bucket bias internal amount of Compressed text search Data Identification; Bucket bias internal amount and start offset amount and be exactly the memory address of packed data content corresponding to Compressed text search Data Identification, namely start the address of stores compressed data content.And the end address of packed data content can be obtained by the data length of Compressed text search Data Identification self.
The data length of recording compressed data content is adopted to carry out alternative bucket bias internal amount in the present embodiment, higher compressibility can be realized, take storage space less, although compared with bucket bias internal amount, the more time can be spent during data query, but when storage resources is nervous, use record data length to be more excellent selection.
In one embodiment, the packed data in the data bucket of the index area of data block identifies the numerical values recited ascending order or descending storage that identify according to packed data; Search module 1306d also to identify for using binary chop to search the packed data mated with Compressed text search Data Identification in data bucket.
In the present embodiment, the packed data in data bucket is identified the numerical values recited ascending order according to packed data mark or descending storage, can be convenient to by binary chop quick position packed data mark, thus the speed of data query can be increased substantially.
In one embodiment, data query content obtaining module 1308 is also for obtaining the coding schedule of the corresponding relation of record data cell and data cell coding from data field, according to coding schedule, the Compressed text search data content comprising data cell coding is decoded, obtain data cell, obtain data query content according to data cell.
The coding schedule of the corresponding relation of record data cell and data cell coding is stored in the data field of data block, coding schedule is obtained from data field, thus according to this coding schedule, Compressed text search data content can be decoded, obtain data cell, the data cell of acquisition forms data query content.
In one embodiment, data query content obtaining module 1308 is also for obtaining the coding schedule of the corresponding relation that Huffman tree is encoded with record data cell and data cell from data field; Compressed text search data content is divided into data cell coding according to Huffman tree, obtains according to coding schedule and data cell is encoded corresponding data cell; Data query content is obtained according to data cell.
When using the decoding compressed data query content of Huffman tree, using Compressed text search data content as bit stream, can decode by turn.Particularly, according to definition during structure Huffman tree, the left subtree of node is 0, right subnumber 1, from the root node of Huffman tree, runs into 0 and forwards left child node to, run into 1 and forward right child node to, until leaf node, the binary coding now from root node to leaf node is a data cell encoding, and the length of this data cell coding is identical with the degree of depth of this leaf node in Huffman tree.Thus obtain data cell corresponding to this data cell coding according to coding schedule, then continue to decode by turn from the root node of Huffman tree, until the data of acquisition presets or decoding complete.
In the present embodiment, adopt Huffman encoding to compress original data content, significantly can reduce the storage space needed for packed data content.
As shown in figure 17, in one embodiment, data query content obtaining module 1308 comprises the 4th acquisition module 1308a, node mapping module 1308b, leaf node processing module 1308c and non-leaf nodes processing module 1308d.
The coding schedule of the corresponding relation that the 4th acquisition module 1308a is used for obtaining Huffman tree from data field, decoding accelerometer is encoded with record data cell and data cell.
Wherein, decoding accelerometer comprises the data cell coding of multiple preset length, corresponding leaf node of encoding with the data cell in decoding accelerometer in each data cell coding mapping to Huffman tree in decoding accelerometer.In decoding accelerometer, the data cell coding mapping identical with the prefix of the preset length of long data cell encoding is to node corresponding to the prefix of long data cell encoding in Huffman tree; Long data cell encoding is the data cell coding that data length is greater than preset length.In decoding accelerometer, the prefix data cell coding mapping identical with short data cell encoding to be encoded corresponding leaf node to the short-and-medium data cell of Huffman tree.Short data cell encoding is the data cell coding that data length is less than preset length.Long data cell encoding is the data cell coding that data length is greater than preset length.
The coded data of node mapping module 1308b for using reading pointer to read preset length from the initial memory address of Compressed text search data content, coded data is mapped to the node in Huffman tree by use decoding accelerometer.
If the node that leaf node processing module 1308c is used for coded data mapping is leaf node, then obtain data cell corresponding to leaf node according to coding schedule decoding, and after the direction of the initial memory address of the tripping contracting data query content that supported or opposed by reading pointer moves the distance corresponding with the degree of depth of leaf node, continue to read the coded data of preset length, until decoding to obtain in data query the partial data of presets in perhaps data query content.
If the node that non-leaf nodes processing module 1308d is used for coded data mapping is non-leaf nodes, after the distance corresponding with preset length is moved in the direction of the initial memory address of the tripping contracting data query content that then supported or opposed by reading pointer, use Huffman tree from the node that coded data maps, move reading pointer by turn and decode until leaf node, obtain the data cell that leaf node is corresponding, continue to read the coded data of preset length, until decoding to obtain in data query the partial data of presets in perhaps data query content.
In the present embodiment, generate decoding accelerometer, and be stored in the data field of data block, once can get the decoding data of preset length during decoding from packed data content, avoid as far as possible and decode by turn, improve the efficiency of decoding.
Figure 18 is the module map of a computer system 1000 that can realize the embodiment of the present invention.This computer system 1000 is an example being applicable to computer environment of the present invention, can not think to propose any restriction to usable range of the present invention.Computer system 1000 can not be interpreted as the combination needing the one or more parts depending on or have in illustrated exemplary computer system 1000.
Computer system 1000 shown in Figure 18 is the examples being suitable for computer system of the present invention.Other framework with different sub-systems configuration also can use.
As shown in figure 18, computer system 1000 comprises processor 1010, storer 1020 and system bus 1022.The various system components comprising storer 1020 and processor 1010 are connected on system bus 1022.Processor 1010 is the hardware being used for being performed by arithmetic sum logical operation basic in computer system computer program instructions.Storer 1020 be one for storing the physical equipment of calculation procedure or data (such as, program state information) temporarily or permanently.System bus 1020 can be any one in the bus structure of following several types, comprises memory bus or memory controller, peripheral bus and local bus.Processor 1010 and storer 1020 can carry out data communication by system bus 1022.Wherein storer 1020 comprises ROM (read-only memory) (ROM) or flash memory (all not shown in figure), and random-access memory (ram), and RAM typically refers to the primary memory being loaded with operating system and application program.
Computer system 1000 also comprises display interface 1030(such as, Graphics Processing Unit), display device 1040(such as, liquid crystal display), audio interface 1050(such as, sound card) and audio frequency apparatus 1060(such as, loudspeaker).Display device 1040 and audio frequency apparatus 1060 are the media devices for experiencing content of multimedia.
Computer system 1000 generally comprises a memory device 1070.Memory device 1070 can be selected from multiple computer-readable medium, and computer-readable medium refers to any available medium can accessed by computer system 1000, that comprise movement and fixing two media.Such as, computer-readable medium includes but not limited to, flash memory (miniature SD card), CD-ROM, digital versatile disc (DVD) or other optical disc storage, tape cassete, tape, disk storage or other magnetic storage apparatus, or can be used for storing information needed and other medium any can accessed by computer system 1000.
Computer system 1000 also comprises input media 1080 and input interface 1090(such as, I/O controller).User can pass through input media 1080, and as the touch panel equipment in keyboard, mouse, display device 1040, input instruction and information are in computer system 1000.Input media 1080 is normally connected on system bus 1022 by input interface 1090, but also can be connected by other interface or bus structure, as USB (universal serial bus) (USB).
Computer system 1000 can be carried out logic with one or more network equipment in a network environment and is connected.The network equipment can be PC, server, router, smart phone, panel computer or other common network node.Computer system 1000 is connected with the network equipment by Local Area Network interface 1100 or mobile comm unit 1110.Local Area Network refers in limited area, such as family, school, computer laboratory or use the office building of the network media, the computer network of interconnected composition.WiFi and twisted-pair feeder wiring Ethernet are two kinds of technology of the most frequently used structure LAN (Local Area Network).WiFi is a kind of technology that can make computer system 1000 swapping data or be connected to wireless network by radiowave.Mobile comm unit 1110 can be answered by radio communication diagram while movement and call in a wide geographic area.Except call, mobile comm unit 1110 is also supported in the 2G providing mobile data service, carries out internet access in 3G or 4G cellular communication system.
It should be pointed out that other computer system comprising the subsystem more more or less than computer system 1000 also can be applicable to invention.Such as, computer system 1000 can comprise the bluetooth unit that can exchange data in short distance, for the imageing sensor of taking a picture, and for the accelerometer of acceleration measurement.
As described in detail, be applicable to the assigned operation that computer system 1000 of the present invention can perform date storage method and/or data enquire method above.The form of the software instruction that computer system 1000 is operated in computer-readable medium by processor 1010 performs these operations.These software instructions can be read into storer 1020 from memory device 1070 or by lan interfaces 1100 from another equipment.The software instruction be stored in storer 1020 makes processor 1010 perform above-mentioned date storage method and/or data enquire method.In addition, also the present invention can be realized equally by hardware circuit or hardware circuit in conjunction with software instruction.Therefore, the combination that the present invention is not limited to any specific hardware circuit and software is realized.
The above embodiment only have expressed several embodiment of the present invention, and it describes comparatively concrete and detailed, but therefore can not be interpreted as the restriction to the scope of the claims of the present invention.It should be pointed out that for the person of ordinary skill of the art, without departing from the inventive concept of the premise, can also make some distortion and improvement, these all belong to protection scope of the present invention.Therefore, the protection domain of patent of the present invention should be as the criterion with claims.
Claims (40)
1. a date storage method, data to be stored comprise raw data mark and identify corresponding original data content with described raw data, and described method comprises:
Carry out cluster according to described raw data mark to described data to be stored, each class is a corresponding data block respectively, and generates data block identifier according to the raw data mark of data to be stored corresponding to described data block;
The original data content of data to be stored corresponding for described data block is compressed, packed data content is stored in the data field of described data block, and obtain the address designation of described packed data content;
According to raw data mark and the described address designation generating indexes of data to be stored corresponding to described data block, and by the index area of described index stores in described data block.
2. method according to claim 1, it is characterized in that, described mark according to described raw data carries out cluster to described data to be stored, and each class is a corresponding data block respectively, and generate data block identifier according to the raw data mark of data to be stored corresponding to described data block, comprising:
Using data to be stored identical for the low portion of the first default figure place in the raw data of described data to be stored mark as a class, each class is a corresponding data block respectively, and using the data block identifier of described low portion as described data block.
3. method according to claim 2, is characterized in that, the raw data mark of the described to be stored data corresponding according to described data block and described address designation generating indexes, and by the index area of described index stores in described data block, comprising:
From the raw data mark of data to be stored corresponding to described data block, clip described low portion, obtain packed data mark, and by the address designation corresponding stored of described packed data mark and described packed data content in the index area of described data block.
4. method according to claim 3, it is characterized in that, described low portion is clipped the raw data mark of the described data to be stored corresponding from described data block, acquisition packed data identifies, and by the address designation corresponding stored of described packed data mark and described packed data content in the index area of described data block, comprising:
Using raw data mark identical for the high-order portion of the second default figure place in the raw data of described data block mark as same subclass, each data bucket in the index area of the corresponding described data block of raw data mark difference of each subclass, the data bucket of described high-order portion as described data bucket is identified, and in the index area of described data block, stores the corresponding relation of described data bucket mark and the bucket initial memory address of described data bucket in described index area;
From the raw data mark of data to be stored corresponding to described data block, clip described low portion and described high-order portion, obtain packed data mark, and described packed data mark is stored in the data bucket identified using described high-order portion as data bucket;
The address designation identifying corresponding packed data content with the packed data stored in described data bucket is stored in the index area of described data block.
5. method according to claim 4, is characterized in that, the address designation of described packed data content is the side-play amount of the memory address of described packed data content in described data field relative to initial memory address in described data field;
Describedly to store in the index area of described data block and the packed data stored in described data bucket identifies the address designation of corresponding packed data content, comprising:
In described index area, store the start offset amount corresponding to described data bucket, and the side-play amount of the packed data content corresponding to packed data mark in described data bucket is relative to the bucket bias internal amount of described start offset amount.
6. method according to claim 4, is characterized in that, the address designation of described packed data content is the side-play amount of the memory address of described packed data content in described data field relative to initial memory address in described data field;
Describedly to store in the index area of described data block and the packed data stored in described data bucket identifies the address designation of corresponding packed data content, comprising:
The start offset amount of the packed data content corresponding to described data bucket is stored in described index area, and the data length of each packed data content corresponding to described data bucket.
7. the method according to claim 4-6 any one, is characterized in that, described packed data is identified at the numerical values recited ascending order that identifies according to described packed data in the data bucket of the index area of described data block or descending stores.
8. the method according to claim 1-6 any one, it is characterized in that, the described original data content by data to be stored corresponding for described data block compresses, packed data content is stored in the data field of described data block, and obtain the address designation of described packed data content, comprising:
The original data content of described data to be stored is divided into data cell, obtains data cell set, calculate the frequency of occurrences of each described data cell in described data cell set;
According to the frequency of occurrences of described data cell, for described data cell distributes data cell encoding, and according to the corresponding relation of described data cell and described data cell coding, the original data content of data to be stored corresponding for described data block is encoded according to data cell, obtain packed data content;
Described packed data content and the coding schedule of the corresponding relation recording described data cell and described data cell coding are stored in the data field of described data block, and obtain the address designation of described packed data content.
9. method according to claim 8, is characterized in that, the described frequency of occurrences according to described data cell, for described data cell distributes data cell encoding, comprising:
According to the frequency of occurrences of described data cell, structure Huffman tree, according to the coordinates measurement data cell coding from the root node of described Huffman tree to leaf node;
Described method also comprises: store described Huffman tree in the data field of described data block.
10. method according to claim 9, is characterized in that, described method also comprises:
According to the data cell code construction decoding accelerometer that data length is preset length, each data cell coding mapping in described decoding accelerometer is encoded with the data cell in described decoding accelerometer corresponding leaf node in described Huffman tree;
By data cell coding mapping identical with the prefix of the preset length of long data cell encoding in described decoding accelerometer to node corresponding to the prefix of the cell encoding of long data described in Huffman tree;
By leaf node corresponding for short data cell encoding described in data cell coding mapping identical with short data cell encoding for prefix in described decoding accelerometer to described Huffman tree;
Described decoding accelerometer is stored into the data field of described data block.
11. 1 kinds of data enquire methods, described method comprises:
Obtain original query Data Identification;
The data block corresponding to described original query Data Identification is determined according to described original query Data Identification and the data block identifier prestored;
According to the index in the index area of described data block, obtain the address designation corresponding with described original query Data Identification;
Obtain Compressed text search data content according to described address designation from the data field of described data block, decompress described Compressed text search data content, obtains data query content.
12. methods according to claim 11, is characterized in that, the described data block determining corresponding to described original query Data Identification according to described original query Data Identification and the data block identifier that prestores, comprising:
Determine the data block corresponding with the data block identifier that the first low portion presetting figure place in described original query Data Identification mates.
13. methods according to claim 12, is characterized in that, described according to the index in the index area of described data block, obtain the address designation corresponding with described original query Data Identification, comprising:
Clip the described low portion of described original query Data Identification, obtain Compressed text search Data Identification, obtain the address designation that the described Compressed text search Data Identification that stores in the index area of described data block is corresponding.
14. methods according to claim 13, it is characterized in that, described in clip the described low portion of described original query Data Identification, obtain Compressed text search Data Identification, obtain the address designation that the described Compressed text search Data Identification that stores in the index area of described data block is corresponding, comprising:
Determine that the data bucket mated with second of described original query Data Identification the high-order portion presetting figure place identifies;
The bucket initial memory address of data bucket described index area that described data bucket mark is corresponding is obtained from the index area of described data block;
Clip the described high-order portion of described original query Data Identification and described low portion, obtain Compressed text search Data Identification;
According to the bucket initial memory address of described data bucket in described index area, in described data bucket, search the packed data mated with described Compressed text search Data Identification identify;
Obtain from described index area and identify corresponding address designation with the packed data of described coupling.
15. methods according to claim 14, is characterized in that, the address designation of described packed data content is the side-play amount of the memory address of described Compressed text search data content in described data field relative to initial memory address in described data field;
Described acquisition from described index area identifies corresponding address designation with the packed data of described coupling, comprising:
The bucket bias internal amount that the packed data mark of start offset amount corresponding to described data bucket and described coupling is corresponding is obtained from described index area;
The side-play amount of described data query content is calculated according to described start offset amount and described bucket bias internal gauge.
16. methods according to claim 14, is characterized in that, the address designation of described packed data content is the side-play amount of the memory address of described Compressed text search data content in described data field relative to initial memory address in described data field;
Described acquisition from described index area identifies corresponding address designation with the packed data of described coupling, comprising:
The data length of packed data marks all obtain the packed data mark of packed data mark and the described coupling of mating described in start offset amount corresponding to described data bucket and described data bucket from described index area before;
The side-play amount of described data query content is calculated according to the data length of described start offset amount and described acquisition.
17. methods according to claim 14-16 any one, is characterized in that, the numerical values recited ascending order that the packed data mark in the data bucket of the index area of described data block identifies according to described packed data or descending store;
The described packed data mated with described Compressed text search Data Identification of searching in described data bucket identifies, and comprising:
Use binary chop in described data bucket, search the packed data mated with described Compressed text search Data Identification to identify.
18. methods according to claim 11-16 any one, it is characterized in that, described according to the data field acquisition Compressed text search data content of described address designation from described data block, decompress described Compressed text search data content, obtain data query content, comprising:
The coding schedule of the corresponding relation of record data cell and data cell coding is obtained from described data field, according to described coding schedule, the described Compressed text search data content comprising data cell coding is decoded, obtain data cell, obtain data query content according to described data cell.
19. methods according to claim 18, it is characterized in that, the described coding schedule obtaining the corresponding relation of record data cell and data cell coding from described data field, according to described coding schedule, the described Compressed text search data content comprising data cell coding is decoded, obtain data cell, obtain data query content according to described data cell, comprising:
The coding schedule of the corresponding relation that Huffman tree is encoded with record data cell and data cell is obtained from described data field; Described Compressed text search data content is divided into data cell coding according to described Huffman tree, obtains according to described coding schedule and described data cell is encoded corresponding data cell; Data query content is obtained according to described data cell.
20. methods according to claim 19, is characterized in that, the described coding schedule obtaining the corresponding relation that Huffman tree is encoded with record data cell and data cell from described data field; Described Compressed text search data content is divided into data cell coding according to described Huffman tree, obtains according to described coding schedule and described data cell is encoded corresponding data cell; Obtain data query content according to described data cell, comprising:
From the coding schedule of the corresponding relation that described data field acquisition Huffman tree, decoding accelerometer are encoded with record data cell and data cell; Wherein, described decoding accelerometer comprises the data cell coding of multiple preset length, and each data cell coding mapping in described decoding accelerometer is encoded with the data cell in described decoding accelerometer corresponding leaf node in described Huffman tree; Data cell coding mapping identical with the prefix of the preset length of long data cell encoding in described decoding accelerometer is to node corresponding to the prefix of the cell encoding of long data described in Huffman tree; The leaf node that described in the data cell coding mapping that in described decoding accelerometer, prefix is identical with short data cell encoding to described Huffman tree, short data cell encoding is corresponding;
Use reading pointer from the initial memory address of described Compressed text search data content, read the coded data of described preset length, use described decoding accelerometer described coded data to be mapped to node in described Huffman tree;
If the node that described coded data maps is leaf node, then obtain data cell corresponding to described leaf node according to described coding schedule decoding, and after described reading pointer is moved the distance corresponding with the degree of depth of described leaf node to the direction of the initial memory address deviating from described Compressed text search data content, continue the coded data reading preset length, until decoding obtains the partial data of presets in perhaps described data query content in described data query;
If the node that described coded data maps is non-leaf nodes, after then described reading pointer being moved the distance corresponding with described preset length to the direction of the initial memory address deviating from described Compressed text search data content, use described Huffman tree from the node that described coded data maps, move described reading pointer by turn and decode until leaf node, obtain the data cell that described leaf node is corresponding, continue the coded data reading preset length, until decoding obtains the partial data of presets in perhaps described data query content in described data query.
21. 1 kinds of data storage devices, is characterized in that, data to be stored comprise raw data mark and identify corresponding original data content with described raw data, and described device comprises:
Data block generation module, for carrying out cluster according to described raw data mark to described data to be stored, each class is a corresponding data block respectively, and generates data block identifier according to the raw data mark of data to be stored corresponding to described data block;
Original data content compression module, for being compressed by the original data content of data to be stored corresponding for described data block, is stored in the data field of described data block by packed data content, and obtains the address designation of described packed data content;
Index generation module, for according to the raw data mark of data to be stored corresponding to described data block and described address designation generating indexes, and by the index area of described index stores in described data block.
22. devices according to claim 21, it is characterized in that, described data block generation module is also for first presetting the identical data to be stored of the low portion of figure place as a class in the raw data of described data to be stored being identified, each class is a corresponding data block respectively, and using the data block identifier of described low portion as described data block.
23. devices according to claim 22, it is characterized in that, described index generation module also for clipping described low portion from the raw data mark of data to be stored corresponding to described data block, acquisition packed data identifies, and by the address designation corresponding stored of described packed data mark and described packed data content in the index area of described data block.
24. devices according to claim 23, is characterized in that, described index generation module, comprising:
Data bucket generation module, for raw data identical for the high-order portion of the second default figure place in the raw data of described data block mark is identified as same subclass, each data bucket in the index area of the corresponding described data block of raw data mark difference of each subclass, the data bucket of described high-order portion as described data bucket is identified, and in the index area of described data block, stores the corresponding relation of described data bucket mark and the bucket initial memory address of described data bucket in described index area;
Packed data identifier generation module, for clipping described low portion and described high-order portion from the raw data mark of data to be stored corresponding to described data block, acquisition packed data identifies, and is stored in the data bucket identified using described high-order portion as data bucket by described packed data mark;
First memory module, for storing the address designation identifying corresponding packed data content with the packed data stored in described data bucket in the index area of described data block.
25. devices according to claim 24, is characterized in that, the address designation of described packed data content is the side-play amount of the memory address of described packed data content in described data field relative to initial memory address in described data field;
Described first memory module comprises also for storing the start offset amount corresponding to described data bucket in described index area, and the side-play amount of the packed data content corresponding to packed data mark in described data bucket is relative to the bucket bias internal amount of described start offset amount.
26. devices according to claim 24, is characterized in that, the address designation of described packed data content is the side-play amount of the memory address of described packed data content in described data field relative to initial memory address in described data field;
Described first memory module comprises also for storing the start offset amount of the packed data content corresponding to described data bucket in described index area, and the data length of each packed data content corresponding to described data bucket.
27. devices according to claim 24-26 any one, is characterized in that, described packed data is identified at the numerical values recited ascending order that identifies according to described packed data in the data bucket of the index area of described data block or descending stores.
28. devices according to claim 21-26 any one, it is characterized in that, described original data content compression module comprises:
Frequency computation part module, for the original data content of described data to be stored is divided into data cell, obtains data cell set, calculates the frequency of occurrences of each described data cell in described data cell set;
Packed data content generating module, for the frequency of occurrences according to described data cell, for described data cell distributes data cell encoding, and according to the corresponding relation of described data cell and described data cell coding, the original data content of data to be stored corresponding for described data block is encoded according to data cell, obtain packed data content;
Second memory module, for described packed data content and the coding schedule of the corresponding relation recording described data cell and described data cell coding being stored in the data field of described data block, and obtains the address designation of described packed data content.
29. devices according to claim 28, it is characterized in that, described packed data content generating module is also for the frequency of occurrences according to described data cell, and structure Huffman tree, according to the coordinates measurement data cell coding from the root node of described Huffman tree to leaf node;
Described second memory module is also for storing described Huffman tree in the data field of described data block.
30. devices according to claim 29, is characterized in that, described device also comprises decoding accelerometer generation module, comprising:
First mapping block, for being the data cell code construction decoding accelerometer of preset length according to data length, each data cell coding mapping in described decoding accelerometer is encoded with the data cell in described decoding accelerometer corresponding leaf node in described Huffman tree;
Second mapping block, for by data cell coding mapping identical with the prefix of the preset length of long data cell encoding in described decoding accelerometer to node corresponding to the prefix of the cell encoding of long data described in Huffman tree;
3rd mapping block, for by leaf node corresponding for short data cell encoding described in data cell coding mapping identical with short data cell encoding for prefix in described decoding accelerometer to described Huffman tree;
3rd memory module, for being stored into the data field of described data block by described decoding accelerometer.
31. 1 kinds of data query arrangement, is characterized in that, described device comprises:
Original query Data Identification acquisition module, for obtaining original query Data Identification;
Data block determination module, for determining the data block corresponding to described original query Data Identification according to described original query Data Identification and the data block identifier prestored;
Address designation acquisition module, for according to the index in the index area of described data block, obtains the address designation corresponding with described original query Data Identification;
Data query content obtaining module, for obtaining Compressed text search data content, decompression module according to described address designation from the data field of described data block, for the described Compressed text search data content that decompresses, obtains data query content.
32. devices according to claim 31, is characterized in that, described data block determination module also first presets data block corresponding to data block identifier that the low portion of figure place mates for determining with described original query Data Identification.
33. devices according to claim 32, it is characterized in that, described address designation acquisition module is also for clipping the described low portion of described original query Data Identification, obtain Compressed text search Data Identification, obtain the address designation that the described Compressed text search Data Identification that stores in the index area of described data block is corresponding.
34. devices according to claim 33, is characterized in that, described address designation acquisition module comprises:
Data bucket mark determination module, for determining that the data bucket mated with second of described original query Data Identification the high-order portion presetting figure place identifies;
Bucket initial memory address determination module, for obtaining the bucket initial memory address of data bucket in described index area corresponding to described data bucket mark from the index area of described data block;
Compressed text search Data Identification acquisition module, for clipping the described high-order portion of described original query Data Identification and described low portion, obtains Compressed text search Data Identification;
Search module, for according to the bucket initial memory address of described data bucket in described index area, in described data bucket, search the packed data mated with described Compressed text search Data Identification identify;
First acquisition module, identifies corresponding address designation for obtaining from described index area with the packed data of described coupling.
35. devices according to claim 34, is characterized in that, the address designation of described packed data content is the side-play amount of the memory address of described Compressed text search data content in described data field relative to initial memory address in described data field;
Described first acquisition module comprises:
Second acquisition module, the bucket bias internal amount that the packed data mark for obtaining start offset amount corresponding to described data bucket and described coupling from described index area is corresponding;
First side-play amount computing module, for calculating the side-play amount of described data query content according to described start offset amount and described bucket bias internal gauge.
36. devices according to claim 34, is characterized in that, the address designation of described packed data content is the side-play amount of the memory address of described Compressed text search data content in described data field relative to initial memory address in described data field;
Described first acquisition module comprises:
3rd acquisition module, for the data length of packed data marks all before obtain the packed data mark of packed data mark and the described coupling of mating described in start offset amount corresponding to described data bucket and described data bucket from described index area;
Second side-play amount computing module, for calculating the side-play amount of described data query content according to the data length of described start offset amount and described acquisition.
37. devices according to claim 34-36 any one, is characterized in that, the numerical values recited ascending order that the packed data mark in the data bucket of the index area of described data block identifies according to described packed data or descending store;
Described module of searching also identifies for using binary chop to search the packed data mated with described Compressed text search Data Identification in described data bucket.
38. devices according to claim 31-36 any one, it is characterized in that, described data query content obtaining module is also for obtaining the coding schedule of the corresponding relation of record data cell and data cell coding from described data field, according to described coding schedule, the described Compressed text search data content comprising data cell coding is decoded, obtain data cell, obtain data query content according to described data cell.
39., according to device according to claim 38, is characterized in that, described data query content obtaining module is also for obtaining the coding schedule of the corresponding relation that Huffman tree is encoded with record data cell and data cell from described data field; Described Compressed text search data content is divided into data cell coding according to described Huffman tree, obtains according to described coding schedule and described data cell is encoded corresponding data cell; Data query content is obtained according to described data cell.
40., according to device according to claim 39, is characterized in that, described data query content obtaining module comprises:
4th acquisition module, for obtaining Huffman tree, the coding schedule of corresponding relation that decoding accelerometer is encoded with record data cell and data cell from described data field; Wherein, described decoding accelerometer comprises the data cell coding of multiple preset length, and each data cell coding mapping in described decoding accelerometer is encoded with the data cell in described decoding accelerometer corresponding leaf node in described Huffman tree; Data cell coding mapping identical with the prefix of the preset length of long data cell encoding in described decoding accelerometer is to node corresponding to the prefix of the cell encoding of long data described in Huffman tree; The leaf node that described in the data cell coding mapping that in described decoding accelerometer, prefix is identical with short data cell encoding to described Huffman tree, short data cell encoding is corresponding;
Node mapping module, for the coded data using reading pointer to read described preset length from the initial memory address of described Compressed text search data content, uses described decoding accelerometer described coded data to be mapped to node in described Huffman tree;
Leaf node processing module, if the node mapped for described coded data is leaf node, then obtain data cell corresponding to described leaf node according to described coding schedule decoding, and after described reading pointer is moved the distance corresponding with the degree of depth of described leaf node to the direction of the initial memory address deviating from described Compressed text search data content, continue the coded data reading preset length, until decoding obtains the partial data of presets in perhaps described data query content in described data query;
Non-leaf nodes processing module, if the node mapped for described coded data is non-leaf nodes, after then described reading pointer being moved the distance corresponding with described preset length to the direction of the initial memory address deviating from described Compressed text search data content, use described Huffman tree from the node that described coded data maps, move described reading pointer by turn and decode until leaf node, obtain the data cell that described leaf node is corresponding, continue the coded data reading preset length, until decoding obtains the partial data of presets in perhaps described data query content in described data query.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310577254.5A CN104657362B (en) | 2013-11-18 | 2013-11-18 | Data storage, querying method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310577254.5A CN104657362B (en) | 2013-11-18 | 2013-11-18 | Data storage, querying method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104657362A true CN104657362A (en) | 2015-05-27 |
CN104657362B CN104657362B (en) | 2018-07-10 |
Family
ID=53248510
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310577254.5A Active CN104657362B (en) | 2013-11-18 | 2013-11-18 | Data storage, querying method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104657362B (en) |
Cited By (29)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105743509A (en) * | 2016-01-26 | 2016-07-06 | 华为技术有限公司 | Data compression device and method |
CN106202398A (en) * | 2016-07-08 | 2016-12-07 | 北京易车互联信息技术有限公司 | A kind of method and device indexing foundation |
CN106326464A (en) * | 2016-08-31 | 2017-01-11 | 成都科来软件有限公司 | Network conversation packet indexing method based on projection of retrieval information |
CN106484852A (en) * | 2016-09-30 | 2017-03-08 | 华为技术有限公司 | Data compression method, equipment and computing device |
CN106686401A (en) * | 2017-01-13 | 2017-05-17 | 山东鑫诚信电子科技有限公司 | Video data distributed storage method, video data distributed storage device, video data retrieval method and video data retrieval device |
CN107015985A (en) * | 2016-01-27 | 2017-08-04 | 阿里巴巴集团控股有限公司 | A kind of data storage and acquisition methods and device |
CN107204776A (en) * | 2016-03-18 | 2017-09-26 | 余海箭 | A kind of Web3D data compression algorithms based on floating number situation |
CN107944957A (en) * | 2017-11-22 | 2018-04-20 | 广州优视网络科技有限公司 | Application program method for pushing, device and computer equipment |
CN105608019B (en) * | 2015-12-21 | 2018-06-29 | 山东海量信息技术研究院 | A kind of method in the quick searching datas of stone ram |
CN108628898A (en) * | 2017-03-21 | 2018-10-09 | 中国移动通信集团河北有限公司 | The method, apparatus and equipment of data loading |
CN109145118A (en) * | 2018-09-06 | 2019-01-04 | 北京京东尚科信息技术有限公司 | Approaches to IM and device |
CN110147330A (en) * | 2019-05-23 | 2019-08-20 | 深圳市创维软件有限公司 | A kind of caching method of character pattern data, device, equipment and storage medium |
WO2020024772A1 (en) * | 2018-08-03 | 2020-02-06 | 华为技术有限公司 | Method and apparatus for querying data |
CN110866127A (en) * | 2018-08-27 | 2020-03-06 | 华为技术有限公司 | Method for establishing index and related device |
CN110874416A (en) * | 2018-09-04 | 2020-03-10 | 深圳云天励飞技术有限公司 | Image characteristic value storage method and device and electronic equipment |
CN111159202A (en) * | 2019-12-30 | 2020-05-15 | 深信服科技股份有限公司 | Data processing method, virtual device, equipment and storage medium |
CN111190908A (en) * | 2018-11-15 | 2020-05-22 | 华为技术有限公司 | Data management method, device and system |
TWI695264B (en) * | 2019-05-20 | 2020-06-01 | 慧榮科技股份有限公司 | A data storage device and a data processing method |
CN111274259A (en) * | 2020-02-16 | 2020-06-12 | 西安奥卡云数据科技有限公司 | Data updating method for storage nodes in distributed storage system |
CN111309261A (en) * | 2020-02-16 | 2020-06-19 | 西安奥卡云数据科技有限公司 | Physical data position mapping method on single node in distributed storage system |
CN111833496A (en) * | 2020-07-17 | 2020-10-27 | 长园共创电力安全技术股份有限公司 | Unlocking method and device based on intelligent key and storage medium |
CN112486915A (en) * | 2020-12-18 | 2021-03-12 | 上海哔哩哔哩科技有限公司 | Data storage method and device |
CN108075978B (en) * | 2016-11-17 | 2021-03-30 | 华为技术有限公司 | Message sending method, node configuration method and related equipment |
CN112584155A (en) * | 2020-12-11 | 2021-03-30 | 南京中兴力维软件有限公司 | Video data processing method and device |
CN112783418A (en) * | 2019-11-01 | 2021-05-11 | 华为技术有限公司 | Method for storing application program data and mobile terminal |
CN113535709A (en) * | 2020-04-15 | 2021-10-22 | 北京字节跳动网络技术有限公司 | Data processing method and device and electronic equipment |
CN113852556A (en) * | 2021-08-31 | 2021-12-28 | 天翼数字生活科技有限公司 | Method and system for compressing and retrieving routing information |
CN114629500A (en) * | 2022-03-10 | 2022-06-14 | Oppo广东移动通信有限公司 | Memory data compression method and device, electronic device |
CN117033254A (en) * | 2023-06-30 | 2023-11-10 | 超聚变数字技术有限公司 | Memory page access frequency determining method and computing device |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1858747A (en) * | 2006-04-30 | 2006-11-08 | 北京金山软件有限公司 | Data storage/searching method and system |
CN101676899A (en) * | 2008-09-18 | 2010-03-24 | 上海宝信软件股份有限公司 | Profiling and inquiring method for massive database records |
US20100257174A1 (en) * | 2009-04-02 | 2010-10-07 | Matthew Dino Minuti | Method for data compression utilizing pattern-analysis and matching means such as neural networks |
CN102024047A (en) * | 2010-12-14 | 2011-04-20 | 青岛普加智能信息有限公司 | Data searching method and device thereof |
-
2013
- 2013-11-18 CN CN201310577254.5A patent/CN104657362B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1858747A (en) * | 2006-04-30 | 2006-11-08 | 北京金山软件有限公司 | Data storage/searching method and system |
CN101676899A (en) * | 2008-09-18 | 2010-03-24 | 上海宝信软件股份有限公司 | Profiling and inquiring method for massive database records |
US20100257174A1 (en) * | 2009-04-02 | 2010-10-07 | Matthew Dino Minuti | Method for data compression utilizing pattern-analysis and matching means such as neural networks |
CN102024047A (en) * | 2010-12-14 | 2011-04-20 | 青岛普加智能信息有限公司 | Data searching method and device thereof |
Cited By (39)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105608019B (en) * | 2015-12-21 | 2018-06-29 | 山东海量信息技术研究院 | A kind of method in the quick searching datas of stone ram |
CN105743509B (en) * | 2016-01-26 | 2019-05-24 | 华为技术有限公司 | Data compression device and method |
WO2017128763A1 (en) * | 2016-01-26 | 2017-08-03 | 华为技术有限公司 | Data compression device and method |
CN105743509A (en) * | 2016-01-26 | 2016-07-06 | 华为技术有限公司 | Data compression device and method |
CN107015985A (en) * | 2016-01-27 | 2017-08-04 | 阿里巴巴集团控股有限公司 | A kind of data storage and acquisition methods and device |
CN107204776A (en) * | 2016-03-18 | 2017-09-26 | 余海箭 | A kind of Web3D data compression algorithms based on floating number situation |
CN106202398A (en) * | 2016-07-08 | 2016-12-07 | 北京易车互联信息技术有限公司 | A kind of method and device indexing foundation |
CN106326464A (en) * | 2016-08-31 | 2017-01-11 | 成都科来软件有限公司 | Network conversation packet indexing method based on projection of retrieval information |
CN106484852A (en) * | 2016-09-30 | 2017-03-08 | 华为技术有限公司 | Data compression method, equipment and computing device |
CN106484852B (en) * | 2016-09-30 | 2019-10-18 | 华为技术有限公司 | Data compression method, equipment and calculating equipment |
CN108075978B (en) * | 2016-11-17 | 2021-03-30 | 华为技术有限公司 | Message sending method, node configuration method and related equipment |
CN106686401A (en) * | 2017-01-13 | 2017-05-17 | 山东鑫诚信电子科技有限公司 | Video data distributed storage method, video data distributed storage device, video data retrieval method and video data retrieval device |
CN108628898B (en) * | 2017-03-21 | 2021-04-23 | 中国移动通信集团河北有限公司 | Data storage method, device and device |
CN108628898A (en) * | 2017-03-21 | 2018-10-09 | 中国移动通信集团河北有限公司 | The method, apparatus and equipment of data loading |
CN107944957A (en) * | 2017-11-22 | 2018-04-20 | 广州优视网络科技有限公司 | Application program method for pushing, device and computer equipment |
WO2020024772A1 (en) * | 2018-08-03 | 2020-02-06 | 华为技术有限公司 | Method and apparatus for querying data |
US11579986B2 (en) | 2018-08-03 | 2023-02-14 | Huawei Cloud Computing Technologies Co., Ltd. | Data query method and apparatus |
CN110866127A (en) * | 2018-08-27 | 2020-03-06 | 华为技术有限公司 | Method for establishing index and related device |
CN110874416A (en) * | 2018-09-04 | 2020-03-10 | 深圳云天励飞技术有限公司 | Image characteristic value storage method and device and electronic equipment |
CN110874416B (en) * | 2018-09-04 | 2022-06-24 | 深圳云天励飞技术有限公司 | Image characteristic value storage method and device and electronic equipment |
CN109145118A (en) * | 2018-09-06 | 2019-01-04 | 北京京东尚科信息技术有限公司 | Approaches to IM and device |
CN111190908A (en) * | 2018-11-15 | 2020-05-22 | 华为技术有限公司 | Data management method, device and system |
CN111190908B (en) * | 2018-11-15 | 2023-09-22 | 华为技术有限公司 | Data management method, device and system |
TWI695264B (en) * | 2019-05-20 | 2020-06-01 | 慧榮科技股份有限公司 | A data storage device and a data processing method |
CN110147330A (en) * | 2019-05-23 | 2019-08-20 | 深圳市创维软件有限公司 | A kind of caching method of character pattern data, device, equipment and storage medium |
CN110147330B (en) * | 2019-05-23 | 2023-09-01 | 深圳市创维软件有限公司 | Word matrix data caching method, device, equipment and storage medium |
CN112783418A (en) * | 2019-11-01 | 2021-05-11 | 华为技术有限公司 | Method for storing application program data and mobile terminal |
CN111159202A (en) * | 2019-12-30 | 2020-05-15 | 深信服科技股份有限公司 | Data processing method, virtual device, equipment and storage medium |
CN111274259A (en) * | 2020-02-16 | 2020-06-12 | 西安奥卡云数据科技有限公司 | Data updating method for storage nodes in distributed storage system |
CN111309261A (en) * | 2020-02-16 | 2020-06-19 | 西安奥卡云数据科技有限公司 | Physical data position mapping method on single node in distributed storage system |
CN113535709A (en) * | 2020-04-15 | 2021-10-22 | 北京字节跳动网络技术有限公司 | Data processing method and device and electronic equipment |
CN111833496A (en) * | 2020-07-17 | 2020-10-27 | 长园共创电力安全技术股份有限公司 | Unlocking method and device based on intelligent key and storage medium |
CN112584155A (en) * | 2020-12-11 | 2021-03-30 | 南京中兴力维软件有限公司 | Video data processing method and device |
CN112486915B (en) * | 2020-12-18 | 2023-01-20 | 上海哔哩哔哩科技有限公司 | Data storage method and device |
CN112486915A (en) * | 2020-12-18 | 2021-03-12 | 上海哔哩哔哩科技有限公司 | Data storage method and device |
CN113852556B (en) * | 2021-08-31 | 2023-04-14 | 天翼数字生活科技有限公司 | Method and system for compressing and retrieving routing information |
CN113852556A (en) * | 2021-08-31 | 2021-12-28 | 天翼数字生活科技有限公司 | Method and system for compressing and retrieving routing information |
CN114629500A (en) * | 2022-03-10 | 2022-06-14 | Oppo广东移动通信有限公司 | Memory data compression method and device, electronic device |
CN117033254A (en) * | 2023-06-30 | 2023-11-10 | 超聚变数字技术有限公司 | Memory page access frequency determining method and computing device |
Also Published As
Publication number | Publication date |
---|---|
CN104657362B (en) | 2018-07-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104657362A (en) | Method and device for storing and querying data | |
CN112953550B (en) | Data compression method, electronic device and storage medium | |
US8838551B2 (en) | Multi-level database compression | |
CN111629081B (en) | Internet Protocol (IP) address data processing method and device and electronic equipment | |
US8120516B2 (en) | Data compression using a stream selector with edit-in-place capability for compressed data | |
CN103326732B (en) | The method of compression data, the decompression method of data, encoder | |
CN116680648B (en) | Service fusion data generation method and system for digital twin city | |
CN101667843B (en) | Methods and devices for compressing and uncompressing data of embedded system | |
US9665590B2 (en) | Bitmap compression for fast searches and updates | |
CN104008134B (en) | Efficient storage method and system based on Hbase | |
CN113014903B (en) | Point cloud neighbor determination, point cloud prediction, point cloud encoding, point cloud decoding method and device | |
CN105144157A (en) | System and method for compressing data in database | |
KR100484137B1 (en) | Improved huffman decoding method and apparatus thereof | |
CN112307138B (en) | Method, system and medium for storing and inquiring regional information | |
CN116301656A (en) | Data storage method, system and equipment based on log structure merging tree | |
CN101551820B (en) | Generation method and apparatus for index database of points of interest attribute | |
CN115408350A (en) | Log compression method, log recovery method, log compression device, log recovery device, computer equipment and storage medium | |
CN114579561B (en) | Data processing method and device, and storage medium | |
CN111061722A (en) | Data compression method, data decompression method, device and equipment | |
CN109302449B (en) | Data writing method, data reading device and server | |
JPWO2014097359A1 (en) | Compression program, compression device, decompression program, and decompression device | |
WO2024149207A1 (en) | Data processing method and apparatus, and medium and computer device | |
CN113742332B (en) | Data storage method, device, equipment and storage medium | |
WO2009001174A1 (en) | System and method for data compression and storage allowing fast retrieval | |
CN106599112A (en) | Massive incomplete data storage and operation method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |