CN113655968A - Unstructured data storage method - Google Patents
Unstructured data storage method Download PDFInfo
- Publication number
- CN113655968A CN113655968A CN202110971730.6A CN202110971730A CN113655968A CN 113655968 A CN113655968 A CN 113655968A CN 202110971730 A CN202110971730 A CN 202110971730A CN 113655968 A CN113655968 A CN 113655968A
- Authority
- CN
- China
- Prior art keywords
- unstructured data
- storage area
- storage
- data
- label
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 30
- 238000013500 data storage Methods 0.000 title claims abstract description 18
- 238000013507 mapping Methods 0.000 claims abstract description 26
- 238000005065 mining Methods 0.000 claims abstract description 8
- 238000004590 computer program Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 abstract description 5
- 230000036632 reaction speed Effects 0.000 abstract description 3
- 230000004044 response Effects 0.000 description 2
- 238000004883 computer application Methods 0.000 description 1
- 238000013499 data model Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
- G06F3/064—Management of blocks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/061—Improving I/O performance
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
- G06F3/0644—Management of space entities, e.g. partitions, extents, pools
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the field of electronic digital data processing, in particular to an unstructured data storage method, which comprises the following steps: acquiring unstructured data; identifying unstructured data and generating a main label; carrying out block storage on the unstructured data based on the main label; generating a secondary label based on the mining characteristics; retrieving and generating a mapping based on the secondary label in each storage area; and storing the mapping relation into the second storage area. Therefore, the unstructured data can be stored in blocks through the category in the main label, and then mapping can be established based on the auxiliary label and the data in all the storage areas, so that the retrieval speed and the reading and writing speed of the stored data can be improved, the use of the unstructured data is guaranteed, and the reaction speed is higher.
Description
Technical Field
The invention relates to the field of electronic digital data processing, in particular to an unstructured data storage method.
Background
The continuous development of computer applications has led to a dramatic increase in the amount of data, resulting in the growth of unstructured data at a much greater rate than structured data, as the data structuring process is limited by the speed of manual processing. For large-scale data which is continuously increased to reach TB and PB levels at present, better tools or technologies are needed to organize and manage files, and an efficient data organization method can help people to quickly acquire data wanted by people from background large-scale data when needed.
The existing unstructured data are generally stored in a memory in sequence, so that the data have no relation, the retrieval time is long, and the data are difficult to use.
Disclosure of Invention
The invention aims to provide an unstructured data storage method, aims to improve the retrieval speed and the reading and writing speed of stored data, and guarantees the use of unstructured data so that the response speed is higher.
In order to achieve the above object, the present invention provides an unstructured data storage method, comprising: acquiring unstructured data;
identifying unstructured data and generating a main label;
carrying out block storage on the unstructured data based on the main label;
generating a secondary label based on the mining characteristics;
retrieving and generating a mapping based on the secondary label in each storage area;
and storing the mapping relation into the second storage area.
The specific mode of acquiring the unstructured data is to acquire the unstructured data from data channels such as websites, computer programs and mobile phone APPs.
Wherein, the main label can be a file type or a data source channel.
The method for storing the unstructured data in blocks based on the main label comprises the following specific steps:
generating a blocked storage area based on the main label;
identifying unstructured data based on the primary label;
storing the unstructured data into corresponding block storage areas;
when the capacity of the partitioned storage area is exhausted, a new storage area is added and a mapping is established.
The method for generating the block storage area based on the main label comprises the following specific steps:
acquiring the address of a total storage area;
acquiring the number of main labels;
and dividing the address of the total storage area based on the number of the main labels to generate a blocked storage area corresponding to the number of the main labels.
The partitioned storage area comprises a cache area and a storage area, wherein the cache area is used for storing high-frequency access data, and the storage area is used for storing low-frequency access data.
The specific steps of storing the unstructured data into the corresponding block storage areas are as follows:
setting a standard capacity value;
comparing the capacity value of the current file unit with the standard capacity value, if the standard capacity value is large, merging and storing the current file unit and the next adjacent file unit, and comparing the current file unit with the standard capacity value until the capacity value is larger than the standard capacity value, and generating a storage unit;
an index of the storage unit is generated.
When the capacity of the partitioned storage area is used up, adding a new storage area and establishing mapping specifically comprise the following steps:
detecting the capacity of the partitioned storage area;
searching a new storage area when the capacity is lower than a threshold value;
an address mapping is established when a new storage area is found.
The invention discloses an unstructured data storage method, which comprises the following steps: acquiring unstructured data; identifying unstructured data and generating a main label; carrying out block storage on the unstructured data based on the main label; generating a secondary label based on the mining characteristics; retrieving and generating a mapping based on the secondary label in each storage area; and storing the mapping relation into the second storage area. Therefore, the unstructured data can be stored in blocks through the category in the main label, and then mapping can be established based on the auxiliary label and the data in all the storage areas, so that the retrieval speed and the reading and writing speed of the stored data can be improved, the use of the unstructured data is guaranteed, and the reaction speed is higher.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a method of unstructured data storage of the present invention;
FIG. 2 is a flow chart of the present invention for chunking unstructured data based on a primary tag;
FIG. 3 is a flow diagram of the present invention for generating a partitioned storage area based on a primary label;
FIG. 4 is a flow chart of the present invention for storing unstructured data into corresponding partitioned storage areas;
FIG. 5 is a flow chart of the present invention for adding a new storage area and creating a mapping when the capacity of the partitioned storage area is exhausted;
FIG. 6 is a flow chart of the present invention for retrieving and generating a map based on secondary labels per block of storage.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
Referring to fig. 1 to 6, the present invention provides an unstructured data storage method, which includes:
s101, acquiring unstructured data;
unstructured data is data that has an irregular or incomplete data structure, no predefined data model, and is inconvenient to represent with a database two-dimensional logical table. Including office documents, text, pictures, XML, HTML, various types of reports, images, audio/video information, and the like, in all formats.
Data in a computer informatization system is divided into structured data and unstructured data. Unstructured data is very diverse in format, diverse in standard, and technically more difficult to standardize and understand than structured information.
The needed unstructured data can be obtained from data channels such as websites, computer programs and mobile phone APP, and the formats of the unstructured data can be documents, texts, pictures and the like.
S102, identifying unstructured data and generating a main label;
the main label may be a file type or a data source channel. The file types comprise the information such as the documents, the texts and the pictures, and various files can be better classified based on the main tags of the file types, so that the files of the same type can be retrieved in the same mode, and the retrieval efficiency is improved; and the data source channel classification is adopted, so that the data can be updated and deleted at any time, and the maintenance is more convenient.
S103, carrying out block storage on the unstructured data based on the main label;
the method comprises the following specific steps:
s201, generating a block storage area based on the main label;
the method comprises the following specific steps:
s301, acquiring the address of the total storage area;
by reading the address of the total memory area, the size of the entire memory can be obtained to facilitate the division.
S302, acquiring the number of the main tags;
the main label has a plurality of categories, and each category needs to be assigned.
S303, dividing the address of the total storage area based on the number of the main labels, and generating a block storage area corresponding to the number of the main labels.
The size of the address can be adjusted according to the characteristics of the category, for example, the space occupied by the document and the text is generally smaller than that occupied by the picture and the video, and therefore, the size of the capacity can be classified in advance before the classification, so that the categories of the corresponding levels can adopt the address classification mode with the same capacity. So that space utilization is more sufficient.
The partitioned storage area comprises a cache area and a storage area, the cache area is used for storing high-frequency access data, and the storage area is used for storing low-frequency access data.
In the use process of unstructured data, new data are continuously collected, so that the effectiveness of old data is reduced, the use frequency is reduced, and in order to improve the retrieval read-write efficiency, a block storage area can be set into a cache area and a storage area, wherein the cache area has higher read-write speed, so that the quick response can be realized, the storage area has lower read-write speed, but the storage price is low, and the method is suitable for storing a large amount of old data. To distinguish between new data and old data, a time node may be provided to distinguish between data of, for example, 7 days or 10 days.
S202, identifying unstructured data based on the main label;
the unstructured data is identified by the category in the primary label, so that better correspondence can be made to storage.
S203, storing the unstructured data into the corresponding block storage area;
the method comprises the following specific steps:
s401, setting a standard capacity value;
in order to reduce the number of small files and improve the processing rate, the small files can be combined into a large file for storage, so that a standard capacity value can be set, combination is needed when the capacity value is smaller than the standard capacity value, and combination is not needed when the capacity value is larger than the standard capacity value.
S402, comparing the capacity value of the current file unit with the standard capacity value, if the standard capacity value is large, merging and storing the current file unit and the next adjacent file unit, and comparing the merged and stored current file unit with the standard capacity value until the merged and stored current file unit is larger than the standard capacity value, and generating a storage unit;
the steps are sequentially executed, so that the small files can be sequentially combined into the storage unit, and quick reading and writing can be facilitated.
S403 generates an index of the storage unit.
Indexes are built for all small files in the storage unit, so that the small files can be conveniently retrieved.
S204, when the capacity of the partitioned storage area is used up, adding a new storage area and establishing mapping.
The method comprises the following specific steps:
s501, detecting the capacity of the partitioned storage area;
s502, searching a new storage area when the capacity is lower than a threshold value;
and setting a capacity threshold value of the partitioned storage area, and searching other new storage areas to be used in the network when the capacity threshold value is lower than the capacity threshold value.
S503 establishes an address mapping when a new storage area is found.
The address of the new memory area is connected with the current block memory area by establishing an address mapping mode, so that the capacity can be expanded, and the coverage of other adjacent memory areas is avoided.
S104, generating a secondary label based on the mining characteristics;
when the data needs to be used, corresponding mining characteristics need to be set, so that the corresponding data can be found out. To more accurately retrieve, the secondary tags may be based on keywords in the mined features.
S105, retrieving and generating mapping on the basis of the secondary label in each storage area;
the method comprises the following specific steps:
s601, using each secondary label to scan a plurality of partitioned storage areas simultaneously;
all data are stored in a partitioned mode, so that all the separated storage areas can be scanned for data based on the secondary labels, and the retrieval efficiency can be improved.
S602 determines whether or not the contents identical to the respective sub-tags are recorded in the respective partitioned storage areas, and marks the contents whose determination result is yes.
S603 maps the data in the partitioned storage area and the secondary label by the flag.
Thereby, the association of the sub tag with all the partitioned storage areas can be established, so that the retrieval speed can be further improved.
S106 stores the mapping relation into the second storage area.
The second storage area is used for storing the mapping relation and updating according to the requirement. When the mining characteristics need to be updated, the secondary label needs to be generated again in step S104, and subsequent steps are performed, so that the data in the partitioned storage area can be flexibly used as needed, and the processing efficiency can be improved.
The invention discloses an unstructured data storage method, which comprises the following steps: acquiring unstructured data; identifying unstructured data and generating a main label; carrying out block storage on the unstructured data based on the main label; generating a secondary label based on the mining characteristics; retrieving and generating a mapping based on the secondary label in each storage area; and storing the mapping relation into the second storage area. Therefore, the unstructured data can be stored in blocks through the category in the main label, and then mapping can be established based on the auxiliary label and the data in all the storage areas, so that the retrieval speed and the reading and writing speed of the stored data can be improved, the use of the unstructured data is guaranteed, and the reaction speed is higher.
While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (8)
1. An unstructured data storage method characterized by,
the method comprises the following steps: acquiring unstructured data;
identifying unstructured data and generating a main label;
carrying out block storage on the unstructured data based on the main label;
generating a secondary label based on the mining characteristics;
retrieving and generating a mapping based on the secondary label in each storage area;
and storing the mapping relation into the second storage area.
2. The unstructured data storage method of claim 1,
the specific mode of acquiring the unstructured data is to acquire the unstructured data from data channels such as websites, computer programs and mobile phone APP.
3. The unstructured data storage method of claim 1,
the main label may be a file type or a data source channel.
4. An unstructured data storage method according to claim 3,
the specific steps of carrying out block storage on the unstructured data based on the main label are as follows:
generating a blocked storage area based on the main label;
identifying unstructured data based on the primary label;
storing the unstructured data into corresponding block storage areas;
when the capacity of the partitioned storage area is exhausted, a new storage area is added and a mapping is established.
5. The unstructured data storage method of claim 4,
the specific steps of generating the block storage area based on the main label are as follows:
acquiring the address of a total storage area;
acquiring the number of main labels;
and dividing the address of the total storage area based on the number of the main labels to generate a blocked storage area corresponding to the number of the main labels.
6. The unstructured data storage method of claim 4,
the partitioned storage area comprises a cache area and a storage area, the cache area is used for storing high-frequency access data, and the storage area is used for storing low-frequency access data.
7. The unstructured data storage method of claim 4,
the specific steps of storing the unstructured data into the corresponding block storage areas are as follows:
setting a standard capacity value;
comparing the capacity value of the current file unit with the standard capacity value, if the standard capacity value is large, merging and storing the current file unit and the next adjacent file unit, and comparing the current file unit with the standard capacity value until the capacity value is larger than the standard capacity value, and generating a storage unit;
an index of the storage unit is generated.
8. The unstructured data storage method of claim 4,
when the capacity of the partitioned storage area is used up, the specific steps of adding a new storage area and establishing mapping are as follows:
detecting the capacity of the partitioned storage area;
searching a new storage area when the capacity is lower than a threshold value;
an address mapping is established when a new storage area is found.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110971730.6A CN113655968B (en) | 2021-08-24 | 2021-08-24 | Unstructured data storage method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110971730.6A CN113655968B (en) | 2021-08-24 | 2021-08-24 | Unstructured data storage method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113655968A true CN113655968A (en) | 2021-11-16 |
CN113655968B CN113655968B (en) | 2024-06-18 |
Family
ID=78481689
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110971730.6A Active CN113655968B (en) | 2021-08-24 | 2021-08-24 | Unstructured data storage method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113655968B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114691769A (en) * | 2022-04-08 | 2022-07-01 | 南方电网数字电网研究院有限公司 | Unstructured data processing method and device for power monitoring system |
CN116719822A (en) * | 2023-08-10 | 2023-09-08 | 深圳市连用科技有限公司 | Method and system for storing massive structured data |
WO2024021107A1 (en) * | 2022-07-29 | 2024-02-01 | 西门子股份公司 | Industrial data storage method and apparatus |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101369275A (en) * | 2008-09-10 | 2009-02-18 | 浙江大学 | A Product Attribute Mining Method in Unstructured Text |
US20090164416A1 (en) * | 2007-12-10 | 2009-06-25 | Aumni Data Inc. | Adaptive data classification for data mining |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102915373B (en) * | 2012-11-06 | 2016-08-10 | 无锡江南计算技术研究所 | A kind of date storage method and device |
CN107315798A (en) * | 2017-06-19 | 2017-11-03 | 北京神州泰岳软件股份有限公司 | Structuring processing method and processing device based on multi-threaded semantic label information MAP |
-
2021
- 2021-08-24 CN CN202110971730.6A patent/CN113655968B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090164416A1 (en) * | 2007-12-10 | 2009-06-25 | Aumni Data Inc. | Adaptive data classification for data mining |
CN101369275A (en) * | 2008-09-10 | 2009-02-18 | 浙江大学 | A Product Attribute Mining Method in Unstructured Text |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114691769A (en) * | 2022-04-08 | 2022-07-01 | 南方电网数字电网研究院有限公司 | Unstructured data processing method and device for power monitoring system |
WO2024021107A1 (en) * | 2022-07-29 | 2024-02-01 | 西门子股份公司 | Industrial data storage method and apparatus |
CN116719822A (en) * | 2023-08-10 | 2023-09-08 | 深圳市连用科技有限公司 | Method and system for storing massive structured data |
CN116719822B (en) * | 2023-08-10 | 2023-12-22 | 深圳市连用科技有限公司 | Method and system for storing massive structured data |
Also Published As
Publication number | Publication date |
---|---|
CN113655968B (en) | 2024-06-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113655968B (en) | Unstructured data storage method | |
CN109446344B (en) | Intelligent analysis report automatic generation system based on big data | |
US9645787B1 (en) | Tag-based electronic media playlist processing | |
US11226976B2 (en) | Systems and methods for graphical exploration of forensic data | |
US20040103102A1 (en) | System and method for automatically linking items with multiple attributes to multiple levels of folders within a content management system | |
US7788267B2 (en) | Image metadata action tagging | |
CN102622332B (en) | E-book realizing method and making system | |
MXPA04006378A (en) | Method and apparatus for automatic detection of data types for data type dependent processing. | |
CN110888837B (en) | Object storage small file merging method and device | |
CN102624770B (en) | Information extraction method and extraction information network storage management system based on cloud calculation | |
US11625526B2 (en) | Systems and methods for displaying digital forensic evidence | |
CN110851663B (en) | Method and device for managing metadata | |
CN113486026A (en) | Data processing method, device, equipment and medium | |
CN116821133A (en) | Data processing method and device | |
CN102457817A (en) | A method and system for extracting news content from a mobile newspaper | |
US20080228734A1 (en) | Document management method and document management apparatus using the same | |
CN101408882B (en) | A method and system for retrieving authorized documents | |
CN114359924A (en) | Data processing method, apparatus, equipment and storage medium | |
CN112948503A (en) | Target characteristic tree structure graph rendering method and device | |
CN111881660A (en) | Report generation method and device, computer equipment and storage medium | |
CN115454947A (en) | Method, device and equipment for storing unstructured data and storage medium | |
KR20090037704A (en) | How to generate metadata for images for intuitive image retrieval | |
CN116150129B (en) | Sea-entry sewage outlet data reorganization evaluation method | |
CN117931269A (en) | Multi-version management method and system for development page of low-code platform | |
CN101808296B (en) | Automatic realization method for editing and massively transmitting multimedia message and automatic realization system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |