[go: up one dir, main page]

CN113655968A - Unstructured data storage method - Google Patents

Unstructured data storage method Download PDF

Info

Publication number
CN113655968A
CN113655968A CN202110971730.6A CN202110971730A CN113655968A CN 113655968 A CN113655968 A CN 113655968A CN 202110971730 A CN202110971730 A CN 202110971730A CN 113655968 A CN113655968 A CN 113655968A
Authority
CN
China
Prior art keywords
unstructured data
storage area
storage
data
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110971730.6A
Other languages
Chinese (zh)
Other versions
CN113655968B (en
Inventor
郭殿勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jinshuo Information Technology Co ltd
Original Assignee
Shanghai Jinshuo Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jinshuo Information Technology Co ltd filed Critical Shanghai Jinshuo Information Technology Co ltd
Priority to CN202110971730.6A priority Critical patent/CN113655968B/en
Publication of CN113655968A publication Critical patent/CN113655968A/en
Application granted granted Critical
Publication of CN113655968B publication Critical patent/CN113655968B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0644Management of space entities, e.g. partitions, extents, pools

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the field of electronic digital data processing, in particular to an unstructured data storage method, which comprises the following steps: acquiring unstructured data; identifying unstructured data and generating a main label; carrying out block storage on the unstructured data based on the main label; generating a secondary label based on the mining characteristics; retrieving and generating a mapping based on the secondary label in each storage area; and storing the mapping relation into the second storage area. Therefore, the unstructured data can be stored in blocks through the category in the main label, and then mapping can be established based on the auxiliary label and the data in all the storage areas, so that the retrieval speed and the reading and writing speed of the stored data can be improved, the use of the unstructured data is guaranteed, and the reaction speed is higher.

Description

Unstructured data storage method
Technical Field
The invention relates to the field of electronic digital data processing, in particular to an unstructured data storage method.
Background
The continuous development of computer applications has led to a dramatic increase in the amount of data, resulting in the growth of unstructured data at a much greater rate than structured data, as the data structuring process is limited by the speed of manual processing. For large-scale data which is continuously increased to reach TB and PB levels at present, better tools or technologies are needed to organize and manage files, and an efficient data organization method can help people to quickly acquire data wanted by people from background large-scale data when needed.
The existing unstructured data are generally stored in a memory in sequence, so that the data have no relation, the retrieval time is long, and the data are difficult to use.
Disclosure of Invention
The invention aims to provide an unstructured data storage method, aims to improve the retrieval speed and the reading and writing speed of stored data, and guarantees the use of unstructured data so that the response speed is higher.
In order to achieve the above object, the present invention provides an unstructured data storage method, comprising: acquiring unstructured data;
identifying unstructured data and generating a main label;
carrying out block storage on the unstructured data based on the main label;
generating a secondary label based on the mining characteristics;
retrieving and generating a mapping based on the secondary label in each storage area;
and storing the mapping relation into the second storage area.
The specific mode of acquiring the unstructured data is to acquire the unstructured data from data channels such as websites, computer programs and mobile phone APPs.
Wherein, the main label can be a file type or a data source channel.
The method for storing the unstructured data in blocks based on the main label comprises the following specific steps:
generating a blocked storage area based on the main label;
identifying unstructured data based on the primary label;
storing the unstructured data into corresponding block storage areas;
when the capacity of the partitioned storage area is exhausted, a new storage area is added and a mapping is established.
The method for generating the block storage area based on the main label comprises the following specific steps:
acquiring the address of a total storage area;
acquiring the number of main labels;
and dividing the address of the total storage area based on the number of the main labels to generate a blocked storage area corresponding to the number of the main labels.
The partitioned storage area comprises a cache area and a storage area, wherein the cache area is used for storing high-frequency access data, and the storage area is used for storing low-frequency access data.
The specific steps of storing the unstructured data into the corresponding block storage areas are as follows:
setting a standard capacity value;
comparing the capacity value of the current file unit with the standard capacity value, if the standard capacity value is large, merging and storing the current file unit and the next adjacent file unit, and comparing the current file unit with the standard capacity value until the capacity value is larger than the standard capacity value, and generating a storage unit;
an index of the storage unit is generated.
When the capacity of the partitioned storage area is used up, adding a new storage area and establishing mapping specifically comprise the following steps:
detecting the capacity of the partitioned storage area;
searching a new storage area when the capacity is lower than a threshold value;
an address mapping is established when a new storage area is found.
The invention discloses an unstructured data storage method, which comprises the following steps: acquiring unstructured data; identifying unstructured data and generating a main label; carrying out block storage on the unstructured data based on the main label; generating a secondary label based on the mining characteristics; retrieving and generating a mapping based on the secondary label in each storage area; and storing the mapping relation into the second storage area. Therefore, the unstructured data can be stored in blocks through the category in the main label, and then mapping can be established based on the auxiliary label and the data in all the storage areas, so that the retrieval speed and the reading and writing speed of the stored data can be improved, the use of the unstructured data is guaranteed, and the reaction speed is higher.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a method of unstructured data storage of the present invention;
FIG. 2 is a flow chart of the present invention for chunking unstructured data based on a primary tag;
FIG. 3 is a flow diagram of the present invention for generating a partitioned storage area based on a primary label;
FIG. 4 is a flow chart of the present invention for storing unstructured data into corresponding partitioned storage areas;
FIG. 5 is a flow chart of the present invention for adding a new storage area and creating a mapping when the capacity of the partitioned storage area is exhausted;
FIG. 6 is a flow chart of the present invention for retrieving and generating a map based on secondary labels per block of storage.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
Referring to fig. 1 to 6, the present invention provides an unstructured data storage method, which includes:
s101, acquiring unstructured data;
unstructured data is data that has an irregular or incomplete data structure, no predefined data model, and is inconvenient to represent with a database two-dimensional logical table. Including office documents, text, pictures, XML, HTML, various types of reports, images, audio/video information, and the like, in all formats.
Data in a computer informatization system is divided into structured data and unstructured data. Unstructured data is very diverse in format, diverse in standard, and technically more difficult to standardize and understand than structured information.
The needed unstructured data can be obtained from data channels such as websites, computer programs and mobile phone APP, and the formats of the unstructured data can be documents, texts, pictures and the like.
S102, identifying unstructured data and generating a main label;
the main label may be a file type or a data source channel. The file types comprise the information such as the documents, the texts and the pictures, and various files can be better classified based on the main tags of the file types, so that the files of the same type can be retrieved in the same mode, and the retrieval efficiency is improved; and the data source channel classification is adopted, so that the data can be updated and deleted at any time, and the maintenance is more convenient.
S103, carrying out block storage on the unstructured data based on the main label;
the method comprises the following specific steps:
s201, generating a block storage area based on the main label;
the method comprises the following specific steps:
s301, acquiring the address of the total storage area;
by reading the address of the total memory area, the size of the entire memory can be obtained to facilitate the division.
S302, acquiring the number of the main tags;
the main label has a plurality of categories, and each category needs to be assigned.
S303, dividing the address of the total storage area based on the number of the main labels, and generating a block storage area corresponding to the number of the main labels.
The size of the address can be adjusted according to the characteristics of the category, for example, the space occupied by the document and the text is generally smaller than that occupied by the picture and the video, and therefore, the size of the capacity can be classified in advance before the classification, so that the categories of the corresponding levels can adopt the address classification mode with the same capacity. So that space utilization is more sufficient.
The partitioned storage area comprises a cache area and a storage area, the cache area is used for storing high-frequency access data, and the storage area is used for storing low-frequency access data.
In the use process of unstructured data, new data are continuously collected, so that the effectiveness of old data is reduced, the use frequency is reduced, and in order to improve the retrieval read-write efficiency, a block storage area can be set into a cache area and a storage area, wherein the cache area has higher read-write speed, so that the quick response can be realized, the storage area has lower read-write speed, but the storage price is low, and the method is suitable for storing a large amount of old data. To distinguish between new data and old data, a time node may be provided to distinguish between data of, for example, 7 days or 10 days.
S202, identifying unstructured data based on the main label;
the unstructured data is identified by the category in the primary label, so that better correspondence can be made to storage.
S203, storing the unstructured data into the corresponding block storage area;
the method comprises the following specific steps:
s401, setting a standard capacity value;
in order to reduce the number of small files and improve the processing rate, the small files can be combined into a large file for storage, so that a standard capacity value can be set, combination is needed when the capacity value is smaller than the standard capacity value, and combination is not needed when the capacity value is larger than the standard capacity value.
S402, comparing the capacity value of the current file unit with the standard capacity value, if the standard capacity value is large, merging and storing the current file unit and the next adjacent file unit, and comparing the merged and stored current file unit with the standard capacity value until the merged and stored current file unit is larger than the standard capacity value, and generating a storage unit;
the steps are sequentially executed, so that the small files can be sequentially combined into the storage unit, and quick reading and writing can be facilitated.
S403 generates an index of the storage unit.
Indexes are built for all small files in the storage unit, so that the small files can be conveniently retrieved.
S204, when the capacity of the partitioned storage area is used up, adding a new storage area and establishing mapping.
The method comprises the following specific steps:
s501, detecting the capacity of the partitioned storage area;
s502, searching a new storage area when the capacity is lower than a threshold value;
and setting a capacity threshold value of the partitioned storage area, and searching other new storage areas to be used in the network when the capacity threshold value is lower than the capacity threshold value.
S503 establishes an address mapping when a new storage area is found.
The address of the new memory area is connected with the current block memory area by establishing an address mapping mode, so that the capacity can be expanded, and the coverage of other adjacent memory areas is avoided.
S104, generating a secondary label based on the mining characteristics;
when the data needs to be used, corresponding mining characteristics need to be set, so that the corresponding data can be found out. To more accurately retrieve, the secondary tags may be based on keywords in the mined features.
S105, retrieving and generating mapping on the basis of the secondary label in each storage area;
the method comprises the following specific steps:
s601, using each secondary label to scan a plurality of partitioned storage areas simultaneously;
all data are stored in a partitioned mode, so that all the separated storage areas can be scanned for data based on the secondary labels, and the retrieval efficiency can be improved.
S602 determines whether or not the contents identical to the respective sub-tags are recorded in the respective partitioned storage areas, and marks the contents whose determination result is yes.
S603 maps the data in the partitioned storage area and the secondary label by the flag.
Thereby, the association of the sub tag with all the partitioned storage areas can be established, so that the retrieval speed can be further improved.
S106 stores the mapping relation into the second storage area.
The second storage area is used for storing the mapping relation and updating according to the requirement. When the mining characteristics need to be updated, the secondary label needs to be generated again in step S104, and subsequent steps are performed, so that the data in the partitioned storage area can be flexibly used as needed, and the processing efficiency can be improved.
The invention discloses an unstructured data storage method, which comprises the following steps: acquiring unstructured data; identifying unstructured data and generating a main label; carrying out block storage on the unstructured data based on the main label; generating a secondary label based on the mining characteristics; retrieving and generating a mapping based on the secondary label in each storage area; and storing the mapping relation into the second storage area. Therefore, the unstructured data can be stored in blocks through the category in the main label, and then mapping can be established based on the auxiliary label and the data in all the storage areas, so that the retrieval speed and the reading and writing speed of the stored data can be improved, the use of the unstructured data is guaranteed, and the reaction speed is higher.
While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (8)

1. An unstructured data storage method characterized by,
the method comprises the following steps: acquiring unstructured data;
identifying unstructured data and generating a main label;
carrying out block storage on the unstructured data based on the main label;
generating a secondary label based on the mining characteristics;
retrieving and generating a mapping based on the secondary label in each storage area;
and storing the mapping relation into the second storage area.
2. The unstructured data storage method of claim 1,
the specific mode of acquiring the unstructured data is to acquire the unstructured data from data channels such as websites, computer programs and mobile phone APP.
3. The unstructured data storage method of claim 1,
the main label may be a file type or a data source channel.
4. An unstructured data storage method according to claim 3,
the specific steps of carrying out block storage on the unstructured data based on the main label are as follows:
generating a blocked storage area based on the main label;
identifying unstructured data based on the primary label;
storing the unstructured data into corresponding block storage areas;
when the capacity of the partitioned storage area is exhausted, a new storage area is added and a mapping is established.
5. The unstructured data storage method of claim 4,
the specific steps of generating the block storage area based on the main label are as follows:
acquiring the address of a total storage area;
acquiring the number of main labels;
and dividing the address of the total storage area based on the number of the main labels to generate a blocked storage area corresponding to the number of the main labels.
6. The unstructured data storage method of claim 4,
the partitioned storage area comprises a cache area and a storage area, the cache area is used for storing high-frequency access data, and the storage area is used for storing low-frequency access data.
7. The unstructured data storage method of claim 4,
the specific steps of storing the unstructured data into the corresponding block storage areas are as follows:
setting a standard capacity value;
comparing the capacity value of the current file unit with the standard capacity value, if the standard capacity value is large, merging and storing the current file unit and the next adjacent file unit, and comparing the current file unit with the standard capacity value until the capacity value is larger than the standard capacity value, and generating a storage unit;
an index of the storage unit is generated.
8. The unstructured data storage method of claim 4,
when the capacity of the partitioned storage area is used up, the specific steps of adding a new storage area and establishing mapping are as follows:
detecting the capacity of the partitioned storage area;
searching a new storage area when the capacity is lower than a threshold value;
an address mapping is established when a new storage area is found.
CN202110971730.6A 2021-08-24 2021-08-24 Unstructured data storage method Active CN113655968B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110971730.6A CN113655968B (en) 2021-08-24 2021-08-24 Unstructured data storage method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110971730.6A CN113655968B (en) 2021-08-24 2021-08-24 Unstructured data storage method

Publications (2)

Publication Number Publication Date
CN113655968A true CN113655968A (en) 2021-11-16
CN113655968B CN113655968B (en) 2024-06-18

Family

ID=78481689

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110971730.6A Active CN113655968B (en) 2021-08-24 2021-08-24 Unstructured data storage method

Country Status (1)

Country Link
CN (1) CN113655968B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114691769A (en) * 2022-04-08 2022-07-01 南方电网数字电网研究院有限公司 Unstructured data processing method and device for power monitoring system
CN116719822A (en) * 2023-08-10 2023-09-08 深圳市连用科技有限公司 Method and system for storing massive structured data
WO2024021107A1 (en) * 2022-07-29 2024-02-01 西门子股份公司 Industrial data storage method and apparatus

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101369275A (en) * 2008-09-10 2009-02-18 浙江大学 A Product Attribute Mining Method in Unstructured Text
US20090164416A1 (en) * 2007-12-10 2009-06-25 Aumni Data Inc. Adaptive data classification for data mining

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102915373B (en) * 2012-11-06 2016-08-10 无锡江南计算技术研究所 A kind of date storage method and device
CN107315798A (en) * 2017-06-19 2017-11-03 北京神州泰岳软件股份有限公司 Structuring processing method and processing device based on multi-threaded semantic label information MAP

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090164416A1 (en) * 2007-12-10 2009-06-25 Aumni Data Inc. Adaptive data classification for data mining
CN101369275A (en) * 2008-09-10 2009-02-18 浙江大学 A Product Attribute Mining Method in Unstructured Text

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114691769A (en) * 2022-04-08 2022-07-01 南方电网数字电网研究院有限公司 Unstructured data processing method and device for power monitoring system
WO2024021107A1 (en) * 2022-07-29 2024-02-01 西门子股份公司 Industrial data storage method and apparatus
CN116719822A (en) * 2023-08-10 2023-09-08 深圳市连用科技有限公司 Method and system for storing massive structured data
CN116719822B (en) * 2023-08-10 2023-12-22 深圳市连用科技有限公司 Method and system for storing massive structured data

Also Published As

Publication number Publication date
CN113655968B (en) 2024-06-18

Similar Documents

Publication Publication Date Title
CN113655968B (en) Unstructured data storage method
CN109446344B (en) Intelligent analysis report automatic generation system based on big data
US9645787B1 (en) Tag-based electronic media playlist processing
US11226976B2 (en) Systems and methods for graphical exploration of forensic data
US20040103102A1 (en) System and method for automatically linking items with multiple attributes to multiple levels of folders within a content management system
US7788267B2 (en) Image metadata action tagging
CN102622332B (en) E-book realizing method and making system
MXPA04006378A (en) Method and apparatus for automatic detection of data types for data type dependent processing.
CN110888837B (en) Object storage small file merging method and device
CN102624770B (en) Information extraction method and extraction information network storage management system based on cloud calculation
US11625526B2 (en) Systems and methods for displaying digital forensic evidence
CN110851663B (en) Method and device for managing metadata
CN113486026A (en) Data processing method, device, equipment and medium
CN116821133A (en) Data processing method and device
CN102457817A (en) A method and system for extracting news content from a mobile newspaper
US20080228734A1 (en) Document management method and document management apparatus using the same
CN101408882B (en) A method and system for retrieving authorized documents
CN114359924A (en) Data processing method, apparatus, equipment and storage medium
CN112948503A (en) Target characteristic tree structure graph rendering method and device
CN111881660A (en) Report generation method and device, computer equipment and storage medium
CN115454947A (en) Method, device and equipment for storing unstructured data and storage medium
KR20090037704A (en) How to generate metadata for images for intuitive image retrieval
CN116150129B (en) Sea-entry sewage outlet data reorganization evaluation method
CN117931269A (en) Multi-version management method and system for development page of low-code platform
CN101808296B (en) Automatic realization method for editing and massively transmitting multimedia message and automatic realization system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant