CN113655968A

CN113655968A - Unstructured data storage method

Info

Publication number: CN113655968A
Application number: CN202110971730.6A
Authority: CN
Inventors: 郭殿勇
Original assignee: Shanghai Jinshuo Information Technology Co ltd
Current assignee: Shanghai Jinshuo Information Technology Co ltd
Priority date: 2021-08-24
Filing date: 2021-08-24
Publication date: 2021-11-16
Anticipated expiration: 2041-08-24
Also published as: CN113655968B

Abstract

The invention relates to the field of electronic digital data processing, in particular to an unstructured data storage method, which comprises the following steps: acquiring unstructured data; identifying unstructured data and generating a main label; carrying out block storage on the unstructured data based on the main label; generating a secondary label based on the mining characteristics; retrieving and generating a mapping based on the secondary label in each storage area; and storing the mapping relation into the second storage area. Therefore, the unstructured data can be stored in blocks through the category in the main label, and then mapping can be established based on the auxiliary label and the data in all the storage areas, so that the retrieval speed and the reading and writing speed of the stored data can be improved, the use of the unstructured data is guaranteed, and the reaction speed is higher.

Description

Unstructured data storage method

Technical Field

The invention relates to the field of electronic digital data processing, in particular to an unstructured data storage method.

Background

The continuous development of computer applications has led to a dramatic increase in the amount of data, resulting in the growth of unstructured data at a much greater rate than structured data, as the data structuring process is limited by the speed of manual processing. For large-scale data which is continuously increased to reach TB and PB levels at present, better tools or technologies are needed to organize and manage files, and an efficient data organization method can help people to quickly acquire data wanted by people from background large-scale data when needed.

The existing unstructured data are generally stored in a memory in sequence, so that the data have no relation, the retrieval time is long, and the data are difficult to use.

Disclosure of Invention

The invention aims to provide an unstructured data storage method, aims to improve the retrieval speed and the reading and writing speed of stored data, and guarantees the use of unstructured data so that the response speed is higher.

In order to achieve the above object, the present invention provides an unstructured data storage method, comprising: acquiring unstructured data;

identifying unstructured data and generating a main label;

carrying out block storage on the unstructured data based on the main label;

generating a secondary label based on the mining characteristics;

retrieving and generating a mapping based on the secondary label in each storage area;

and storing the mapping relation into the second storage area.

The specific mode of acquiring the unstructured data is to acquire the unstructured data from data channels such as websites, computer programs and mobile phone APPs.

Wherein, the main label can be a file type or a data source channel.

The method for storing the unstructured data in blocks based on the main label comprises the following specific steps:

generating a blocked storage area based on the main label;

identifying unstructured data based on the primary label;

storing the unstructured data into corresponding block storage areas;

when the capacity of the partitioned storage area is exhausted, a new storage area is added and a mapping is established.

The method for generating the block storage area based on the main label comprises the following specific steps:

acquiring the address of a total storage area;

acquiring the number of main labels;

and dividing the address of the total storage area based on the number of the main labels to generate a blocked storage area corresponding to the number of the main labels.

The partitioned storage area comprises a cache area and a storage area, wherein the cache area is used for storing high-frequency access data, and the storage area is used for storing low-frequency access data.

The specific steps of storing the unstructured data into the corresponding block storage areas are as follows:

setting a standard capacity value;

comparing the capacity value of the current file unit with the standard capacity value, if the standard capacity value is large, merging and storing the current file unit and the next adjacent file unit, and comparing the current file unit with the standard capacity value until the capacity value is larger than the standard capacity value, and generating a storage unit;

an index of the storage unit is generated.

When the capacity of the partitioned storage area is used up, adding a new storage area and establishing mapping specifically comprise the following steps:

detecting the capacity of the partitioned storage area;

searching a new storage area when the capacity is lower than a threshold value;

an address mapping is established when a new storage area is found.

The invention discloses an unstructured data storage method, which comprises the following steps: acquiring unstructured data; identifying unstructured data and generating a main label; carrying out block storage on the unstructured data based on the main label; generating a secondary label based on the mining characteristics; retrieving and generating a mapping based on the secondary label in each storage area; and storing the mapping relation into the second storage area. Therefore, the unstructured data can be stored in blocks through the category in the main label, and then mapping can be established based on the auxiliary label and the data in all the storage areas, so that the retrieval speed and the reading and writing speed of the stored data can be improved, the use of the unstructured data is guaranteed, and the reaction speed is higher.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of a method of unstructured data storage of the present invention;

FIG. 2 is a flow chart of the present invention for chunking unstructured data based on a primary tag;

FIG. 3 is a flow diagram of the present invention for generating a partitioned storage area based on a primary label;

FIG. 4 is a flow chart of the present invention for storing unstructured data into corresponding partitioned storage areas;

FIG. 5 is a flow chart of the present invention for adding a new storage area and creating a mapping when the capacity of the partitioned storage area is exhausted;

FIG. 6 is a flow chart of the present invention for retrieving and generating a map based on secondary labels per block of storage.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

Referring to fig. 1 to 6, the present invention provides an unstructured data storage method, which includes:

s101, acquiring unstructured data;

unstructured data is data that has an irregular or incomplete data structure, no predefined data model, and is inconvenient to represent with a database two-dimensional logical table. Including office documents, text, pictures, XML, HTML, various types of reports, images, audio/video information, and the like, in all formats.

Data in a computer informatization system is divided into structured data and unstructured data. Unstructured data is very diverse in format, diverse in standard, and technically more difficult to standardize and understand than structured information.

The needed unstructured data can be obtained from data channels such as websites, computer programs and mobile phone APP, and the formats of the unstructured data can be documents, texts, pictures and the like.

S102, identifying unstructured data and generating a main label;

the main label may be a file type or a data source channel. The file types comprise the information such as the documents, the texts and the pictures, and various files can be better classified based on the main tags of the file types, so that the files of the same type can be retrieved in the same mode, and the retrieval efficiency is improved; and the data source channel classification is adopted, so that the data can be updated and deleted at any time, and the maintenance is more convenient.

S103, carrying out block storage on the unstructured data based on the main label;

the method comprises the following specific steps:

s201, generating a block storage area based on the main label;

the method comprises the following specific steps:

s301, acquiring the address of the total storage area;

by reading the address of the total memory area, the size of the entire memory can be obtained to facilitate the division.

S302, acquiring the number of the main tags;

the main label has a plurality of categories, and each category needs to be assigned.

S303, dividing the address of the total storage area based on the number of the main labels, and generating a block storage area corresponding to the number of the main labels.

The size of the address can be adjusted according to the characteristics of the category, for example, the space occupied by the document and the text is generally smaller than that occupied by the picture and the video, and therefore, the size of the capacity can be classified in advance before the classification, so that the categories of the corresponding levels can adopt the address classification mode with the same capacity. So that space utilization is more sufficient.

The partitioned storage area comprises a cache area and a storage area, the cache area is used for storing high-frequency access data, and the storage area is used for storing low-frequency access data.

In the use process of unstructured data, new data are continuously collected, so that the effectiveness of old data is reduced, the use frequency is reduced, and in order to improve the retrieval read-write efficiency, a block storage area can be set into a cache area and a storage area, wherein the cache area has higher read-write speed, so that the quick response can be realized, the storage area has lower read-write speed, but the storage price is low, and the method is suitable for storing a large amount of old data. To distinguish between new data and old data, a time node may be provided to distinguish between data of, for example, 7 days or 10 days.

S202, identifying unstructured data based on the main label;

the unstructured data is identified by the category in the primary label, so that better correspondence can be made to storage.

S203, storing the unstructured data into the corresponding block storage area;

the method comprises the following specific steps:

s401, setting a standard capacity value;

in order to reduce the number of small files and improve the processing rate, the small files can be combined into a large file for storage, so that a standard capacity value can be set, combination is needed when the capacity value is smaller than the standard capacity value, and combination is not needed when the capacity value is larger than the standard capacity value.

S402, comparing the capacity value of the current file unit with the standard capacity value, if the standard capacity value is large, merging and storing the current file unit and the next adjacent file unit, and comparing the merged and stored current file unit with the standard capacity value until the merged and stored current file unit is larger than the standard capacity value, and generating a storage unit;

the steps are sequentially executed, so that the small files can be sequentially combined into the storage unit, and quick reading and writing can be facilitated.

S403 generates an index of the storage unit.

Indexes are built for all small files in the storage unit, so that the small files can be conveniently retrieved.

S204, when the capacity of the partitioned storage area is used up, adding a new storage area and establishing mapping.

The method comprises the following specific steps:

s501, detecting the capacity of the partitioned storage area;

s502, searching a new storage area when the capacity is lower than a threshold value;

and setting a capacity threshold value of the partitioned storage area, and searching other new storage areas to be used in the network when the capacity threshold value is lower than the capacity threshold value.

S503 establishes an address mapping when a new storage area is found.

The address of the new memory area is connected with the current block memory area by establishing an address mapping mode, so that the capacity can be expanded, and the coverage of other adjacent memory areas is avoided.

S104, generating a secondary label based on the mining characteristics;

when the data needs to be used, corresponding mining characteristics need to be set, so that the corresponding data can be found out. To more accurately retrieve, the secondary tags may be based on keywords in the mined features.

S105, retrieving and generating mapping on the basis of the secondary label in each storage area;

the method comprises the following specific steps:

s601, using each secondary label to scan a plurality of partitioned storage areas simultaneously;

all data are stored in a partitioned mode, so that all the separated storage areas can be scanned for data based on the secondary labels, and the retrieval efficiency can be improved.

S602 determines whether or not the contents identical to the respective sub-tags are recorded in the respective partitioned storage areas, and marks the contents whose determination result is yes.

S603 maps the data in the partitioned storage area and the secondary label by the flag.

Thereby, the association of the sub tag with all the partitioned storage areas can be established, so that the retrieval speed can be further improved.

S106 stores the mapping relation into the second storage area.

The second storage area is used for storing the mapping relation and updating according to the requirement. When the mining characteristics need to be updated, the secondary label needs to be generated again in step S104, and subsequent steps are performed, so that the data in the partitioned storage area can be flexibly used as needed, and the processing efficiency can be improved.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An unstructured data storage method characterized by,

the method comprises the following steps: acquiring unstructured data;

identifying unstructured data and generating a main label;

carrying out block storage on the unstructured data based on the main label;

generating a secondary label based on the mining characteristics;

and storing the mapping relation into the second storage area.

2. The unstructured data storage method of claim 1,

the specific mode of acquiring the unstructured data is to acquire the unstructured data from data channels such as websites, computer programs and mobile phone APP.

3. The unstructured data storage method of claim 1,

the main label may be a file type or a data source channel.

4. An unstructured data storage method according to claim 3,

the specific steps of carrying out block storage on the unstructured data based on the main label are as follows:

generating a blocked storage area based on the main label;

identifying unstructured data based on the primary label;

storing the unstructured data into corresponding block storage areas;

5. The unstructured data storage method of claim 4,

the specific steps of generating the block storage area based on the main label are as follows:

acquiring the address of a total storage area;

acquiring the number of main labels;

6. The unstructured data storage method of claim 4,

7. The unstructured data storage method of claim 4,

setting a standard capacity value;

an index of the storage unit is generated.

8. The unstructured data storage method of claim 4,

when the capacity of the partitioned storage area is used up, the specific steps of adding a new storage area and establishing mapping are as follows:

detecting the capacity of the partitioned storage area;

searching a new storage area when the capacity is lower than a threshold value;

an address mapping is established when a new storage area is found.