CN114661832B

CN114661832B - Multi-mode heterogeneous data storage method and system based on data quality

Info

Publication number: CN114661832B
Application number: CN202210281261.XA
Authority: CN
Inventors: 李冬; 张志钧; 单晓欢; 宋宝燕; 陈廷伟; 王俊陆; 纪婉婷
Original assignee: Liaoning University
Current assignee: Liaoning University
Priority date: 2022-03-22
Filing date: 2022-03-22
Publication date: 2024-08-13
Anticipated expiration: 2042-03-22
Also published as: CN114661832A

Abstract

The invention relates to a multi-mode heterogeneous data storage method and a system based on data quality, comprising the following steps: 1) The original text data is stored in a distributed mode in an original database in a key-value format; 2) Carrying out data modeling on the original multimedia data, and carrying out distributed storage in a file database in a file form; 3) Converting the key-value data into relational data, and constructing a relational database; 4) Constructing a graph database according to the relation among the entities in the relation database; 5) Carrying out data modeling on the activity data of the entity in a chained structure to construct a chained database; 6) Converting the multimedia data into text data, and respectively storing the text data in a multimedia database and an original database according to data types; 7) Linking entity data of each sub-database by constructing a multi-level index structure; 8) And constructing a log file maintenance system of the multi-mode database aiming at the data integration method and each sub-database. The method can greatly reduce the time required for inquiring the data and ensure the efficiency of related personnel when using the data.

Description

A multi-modal heterogeneous data storage method and system based on data quality

技术领域Technical Field

本发明属于计算机技术领域，具体涉及一种基于数据质量的多模态数据库异构存储方法及系统。The present invention belongs to the field of computer technology, and in particular relates to a multimodal database heterogeneous storage method and system based on data quality.

背景技术Background Art

目前用户在不同的网络平台产生了大量的用户行为数据，这些数据不再是单一的文本或图片数据，而是包含来自不同平台的文本、图像、视频等的多模态数据，其中包括结构化数据、半结构化数据以及非结构化数据。结构化数据是指可以使用关系型数据库表示和存储，表现为二维形式的数据，一般特点是：数据以行为单位，一行数据表示一个实体的信息，每一行数据的属性是相同的；半结构化数据是结构化数据的一种形式，它并不符合关系型数据库或其他数据表的形式关联起来的数据模型结构，但包含相关标记，用来分隔语义元素以及对记录和字段进行分层，因此，它也被称为自描述的结构；非结构化数据就是没有固定结构的数据，各种文档、图片、视频、音频等都属于非结构化数据，对于这类数据，一般直接整体进行存储。Currently, users generate a large amount of user behavior data on different network platforms. These data are no longer single text or image data, but multimodal data including text, images, videos, etc. from different platforms, including structured data, semi-structured data and unstructured data. Structured data refers to data that can be represented and stored using a relational database and is in a two-dimensional form. The general characteristics are: data is in the form of lines, a line of data represents the information of an entity, and the attributes of each line of data are the same; semi-structured data is a form of structured data. It does not conform to the data model structure associated with relational databases or other data tables, but contains relevant tags to separate semantic elements and stratify records and fields. Therefore, it is also called a self-describing structure; unstructured data is data without a fixed structure. Various documents, pictures, videos, audio, etc. are all unstructured data. For this type of data, it is generally stored directly as a whole.

近些年随着海量的多模态数据的出现，数据存储的成本加大，如何构建一个良好切高效的多模态数据库成为大多数计算机行业人员需要共同解决的问题。In recent years, with the emergence of massive multimodal data, the cost of data storage has increased. How to build a good and efficient multimodal database has become a problem that most people in the computer industry need to solve together.

发明内容Summary of the invention

为解决上述技术问题，本发明提供了一种基于数据质量的多模态异构数据存储方法及系统。In order to solve the above technical problems, the present invention provides a multimodal heterogeneous data storage method and system based on data quality.

为了实现上述目的，本发明创造采用了如下技术方案：In order to achieve the above purpose, the present invention adopts the following technical solutions:

一种基于数据质量的多模态异构数据存储方法及系统，其特征在于，包括以下步骤：A multimodal heterogeneous data storage method and system based on data quality, characterized by comprising the following steps:

1)本发明针对来自于互联网数据源的原始数据(包括原始文本类数据和原始多媒体类数据)，将原始本文数据以key-value格式在原始数据库中进行分布式存储；1) The present invention targets the original data (including original text data and original multimedia data) from the Internet data source, and stores the original text data in a distributed manner in the original database in a key-value format;

2)针对来自于互联网的原始多媒体类数据进行数据建模，以文件的形式在文件数据库中进行分布式存储；2) Data modeling is performed for the original multimedia data from the Internet, and distributed storage is performed in the file database in the form of files;

3)将原始文本数据通过事件抽取、实体链接、不完备数据填补等数据集成方法进行数据转换，转换为关系型数据，并针对关系数据进行建模，构建关系数据库；3) Convert the original text data into relational data through data integration methods such as event extraction, entity linking, and incomplete data filling, and then model the relational data to build a relational database;

4)将关系数据库中的相互之间具有关联关系的实体以及实体之间的关系进行建模，构建图数据库；4) Model the entities that have associations with each other in the relational database and the relationships between entities to build a graph database;

5)关系数据库中各个实体的活动数据具有典型的时序特征，将活动数据以链式的结构进行数据建模，构建链式数据库；5) The activity data of each entity in the relational database has typical time series characteristics. The activity data is modeled in a chain structure to build a chain database;

6)将多媒体数据中的视频数据、音频数据通过数据转换方法转换为文本数据，并以文件形式存储于多媒体数据库，以及以key-value格式存储于原始数据库；6) Converting the video data and audio data in the multimedia data into text data through a data conversion method, and storing them in the multimedia database in the form of files, and storing them in the original database in the form of key-value;

7)根据数据质量，对不同的分布式数据库进行数据库优化，通过构建多级索引结构将各个子数据库的实体数据进行链接，保证数据的一致性；7) Optimize different distributed databases based on data quality, link the entity data of each sub-database by building a multi-level index structure to ensure data consistency;

8)针对数据集成方法以及各个子数据库构建多模态数据库的日志文件维护体系。8) Construct a log file maintenance system for the multimodal database based on the data integration method and each sub-database.

另一方面，本发明提供一种基于数据质量的多模态异构数据存储系统，包括：原始数据库：用于存储来源于互联网数据的原始数据，存储格式为：key-value格式；关系数据库：用于将原始数据库中的key-value数据转换为关系型数据，再建模存储；图数据库：用于将关系数据库中相关联的实体以及实体之间的关系图形化并存储；多媒体数据库：用于存储转换为文本格式的视频数据、音频数据；链式数据库：用于存储关系数据库中各个实体活动数据的链式结构。On the other hand, the present invention provides a multimodal heterogeneous data storage system based on data quality, including: an original database: used to store original data from Internet data, and the storage format is: key-value format; a relational database: used to convert the key-value data in the original database into relational data, and then model and store it; a graph database: used to graph and store related entities and the relationships between entities in the relational database; a multimedia database: used to store video data and audio data converted into text format; a chain database: used to store the chain structure of each entity activity data in the relational database.

一种计算机可读存储介质集群，其上存储有计算机程序，该计算机程序被处理器执行时实现一种基于数据质量的多模态异构数据存储方法所述的5种分布式子数据库。A computer-readable storage medium cluster stores a computer program, which, when executed by a processor, implements five distributed sub-databases described in a multimodal heterogeneous data storage method based on data quality.

进一步地，所述步骤1)中原始数据存储的具体方法如下：Furthermore, the specific method for storing the original data in step 1) is as follows:

1.1)使用MongoDB数据库系统作为key-value数据存储的数据库系统。对从互联网上爬取下来的相关数据以JSON文件的形式保存，并将其存储至MongoDB数据库中，MongoDB将每一条数据自动生成一个唯一的key值作为唯一标识符，通过key值可以定位到每一条具体的数据；1.1) Use MongoDB database system as the database system for key-value data storage. Save the relevant data crawled from the Internet in the form of JSON files and store them in the MongoDB database. MongoDB automatically generates a unique key value for each piece of data as a unique identifier, and each specific piece of data can be located through the key value;

1.2)使用MongoDB中的MongoDB Replica Set作为原始数据库的分布式存储解决方案。将前述一种计算机可读存储介质集群以MongoDB Replica Set分布规则设定1个主节点、1个副本节点和1个仲裁节点，主节点接收所有请求，副本节点与主节点保持同样的数据集并可参与选主，仲裁节点进行选主投票。1.2) Use MongoDB Replica Set in MongoDB as a distributed storage solution for the original database. Set the aforementioned computer-readable storage medium cluster to have 1 master node, 1 replica node, and 1 arbitration node according to the MongoDB Replica Set distribution rule. The master node receives all requests, the replica node maintains the same data set as the master node and can participate in the master election, and the arbitration node votes for the master election.

进一步地，所述步骤2)中多媒体数据建模的具体方法如下：将多媒体数据，包括视频、音频、图片数据根据特定规则存储至分布式文件系统中；其中，特定规则指根据多媒体数据的数据源决定其所存储的分布式文件系统节点。即将多媒体数据，包括视频、音频或图片数据根据数据源对应的存储节点存储至分布式文件系统中。Furthermore, the specific method of multimedia data modeling in step 2) is as follows: multimedia data, including video, audio, and image data, are stored in a distributed file system according to specific rules; wherein the specific rules refer to determining the distributed file system node where the multimedia data is stored according to the data source of the multimedia data. That is, multimedia data, including video, audio, or image data, is stored in a distributed file system according to the storage node corresponding to the data source.

进一步地，所述步骤3)中关系数据模型构建的具体方法如下：Furthermore, the specific method for constructing the relational data model in step 3) is as follows:

3.1)使用数据集成方法，包括事件抽取、实体链接、不完备数据填补，将原始文本数据转化为结构化数据；其中，事件抽取主要将原始文本数据通过特定规则进行数据标注形成数据集，利用数据集训练事件抽取模型，将得到的结果以结构化形式存储下来；实体链接主要将事件抽取所得到的结果与数据库进行某些特定实体的消歧，并将消歧后的结构化数据存储下来；不完备数据填补主要将转化后的结构化数据中缺失的部分使用缺失数据填补方法进行填充，保证数据完整性；3.1) Use data integration methods, including event extraction, entity linking, and incomplete data filling, to convert raw text data into structured data; among them, event extraction mainly uses specific rules to annotate raw text data to form a data set, uses the data set to train the event extraction model, and stores the results in a structured form; entity linking mainly disambiguates certain specific entities between the results of event extraction and the database, and stores the disambiguated structured data; incomplete data filling mainly uses missing data filling methods to fill in the missing parts of the converted structured data to ensure data integrity;

3.2)使用MySQL数据库系统作为关系数据存储的数据库系统。将通过数据集成后的结构化数据存储至关系数据库MySQL中；3.2) Use MySQL database system as the database system for relational data storage. Store the structured data after data integration in the relational database MySQL;

3.3)使用MySQLCluster作为关系数据库的分布式存储解决方案。将前述一种计算机可读存储介质集群以MySQLCluster分布规则设定1个管理节点、2个数据节点和1个应用节点，管理节点管理相关配置文件，数据节点分布式存储数据，应用节点进行读写等操作。3.3) Use MySQLCluster as a distributed storage solution for relational databases. Set up a computer-readable storage medium cluster as described above with 1 management node, 2 data nodes, and 1 application node according to the MySQLCluster distribution rule. The management node manages related configuration files, the data node stores data in a distributed manner, and the application node performs operations such as reading and writing.

进一步地，所述步骤4)中图数据库存储模型构建的具体方法如下：Furthermore, the specific method for constructing the graph database storage model in step 4) is as follows:

4.1)使用HBase作为底层图数据存储方案。将关系数据库中具有特定关系的要素及其之间的关系进行抽取，存储到HBase中，HBase中的数据通过rowkey按照行的形式进行存储，将前述一种计算机可读存储介质集群以HBase分布规则设定1个主节点、1个从节点和1个备用节点；4.1) Use HBase as the underlying graph data storage solution. Extract the elements with specific relationships and their relationships in the relational database and store them in HBase. The data in HBase is stored in the form of rows through rowkey. Set the aforementioned computer-readable storage medium cluster to have 1 master node, 1 slave node and 1 backup node according to the HBase distribution rule;

4.2)使用Neo4j作为图数据库可视化查询方案。利用Hive将HBase中的部分数据导出并存储到Neo4j中构建能满足不同查询需求的知识图谱。建立HBase与Hive的映射，将HBase数据还原为类关系数据库数据，将数据通过Neo4j建立关系；4.2) Use Neo4j as a graph database visualization query solution. Use Hive to export some data from HBase and store it in Neo4j to build a knowledge graph that can meet different query requirements. Establish a mapping between HBase and Hive, restore HBase data to relational database data, and establish relationships between the data through Neo4j;

4.3)将关系数据库中的实体通过实体之间的关系进行建模，构成一系列的节点和边来表示，其中，实体表示为节点，关系表示为边，通过Neo4j可视化展示。4.3) The entities in the relational database are modeled through the relationships between entities, forming a series of nodes and edges to represent them, where entities are represented as nodes and relationships are represented as edges, and are visualized through Neo4j.

进一步地，所述步骤5)中链式数据库以联盟链和私有链的形式存在，其中，联盟链存储结构化数据，私有链存储半结构化、非结构化数据，包含文本、图片、视频等；联盟链采用MySQL数据库存储结构化数据，私有链采用HDFS存储半结构化、非结构化数据。Furthermore, in the step 5), the chain database exists in the form of a consortium chain and a private chain, wherein the consortium chain stores structured data, and the private chain stores semi-structured and unstructured data, including text, pictures, videos, etc.; the consortium chain uses MySQL database to store structured data, and the private chain uses HDFS to store semi-structured and unstructured data.

进一步地，所述步骤6)中多媒体数据存储具体方法如下：Furthermore, the specific method for storing multimedia data in step 6) is as follows:

6.1)根据多媒体数据源从互联网中爬取相关的多媒体数据，包括视频、音频、图像、文本等；根据数据属性设计多媒体数据索引表，根据数据源、数据类型、存储节点、存储路径、文件名等属性，可以通过索引表定位到多媒体数据的具体位置，将索引表以结构化数据形式存储至关系数据库中；6.1) Crawl relevant multimedia data from the Internet according to multimedia data sources, including video, audio, image, text, etc.; design multimedia data index tables according to data attributes, and locate the specific location of multimedia data through the index table according to attributes such as data source, data type, storage node, storage path, file name, etc., and store the index table in the form of structured data in a relational database;

6.2)设计数据转换存储模型，将视频数据通过“视频->音频->文本”过程转换为文本数据；将音频数据通过“音频->文本”过程转换为文本数据；将图像数据通过“图像->文本”过程转换为文本数据；并存储到原始数据库和多媒体数据库中。6.2) Design a data conversion storage model to convert video data into text data through the "video->audio->text" process; convert audio data into text data through the "audio->text" process; convert image data into text data through the "image->text" process; and store them in the original database and multimedia database.

进一步地，所述步骤7)中，一种数据质量的定义方法包括：准确性、完整性、一致性、关联性：Furthermore, in step 7), a method for defining data quality includes: accuracy, completeness, consistency, and relevance:

7.1)准确性指在事件抽取、数据填补及数据一致性检测和转换等数据集成方法中，通过转换模型和方法的准确度等指标保证数据转换的准确性；7.1) Accuracy refers to ensuring the accuracy of data conversion through indicators such as the accuracy of conversion models and methods in data integration methods such as event extraction, data filling, data consistency detection and conversion;

7.2)完整性是指针对同一实体的原始文本数据，在本多模态异构数据存储系统中既有进行数据转换后的结构化形式数据存在，也有key-value格式的半结构化数据以及以文档形式的非结构化数据存在，且同时存储在原始数据库以及关系数据库中；另一方面，针对多模态数据，在本多模态异构数据存储系统中既有进行多媒体数据转换后的结构化形式数据存在，也有key-value格式的半结构化数据以及以多媒体文件形式的非结构化数据存在，且同时存储在原始数据库以及多媒体数据库中；7.2) Integrity means that for the original text data of the same entity, there are both structured data after data conversion, semi-structured data in key-value format, and unstructured data in the form of documents in the present multimodal heterogeneous data storage system, and they are stored in both the original database and the relational database. On the other hand, for multimodal data, there are both structured data after multimedia data conversion, semi-structured data in key-value format, and unstructured data in the form of multimedia files in the present multimodal heterogeneous data storage system, and they are stored in both the original database and the multimedia database.

7.3)一致性指通过数据一致性检测以及转换，将各个子数据库中同一实体的数据进行一致性检测，包括量纲的一致性、表达方式的一致性、数据值的一致性等；确保存入各个子数据库中的相关数据，与原始文本数据以及原数据始多媒体数据的相关数据一致；7.3) Consistency refers to the consistency test of the data of the same entity in each sub-database through data consistency test and conversion, including consistency of dimension, consistency of expression, consistency of data value, etc.; ensuring that the relevant data stored in each sub-database is consistent with the original text data and the relevant data of the original multimedia data;

7.4)关联性指以实体id或实体名a将各个子数据库中进行关联，同一实体的数据在各个子数据库中实现同步更新，并且通过实体id或实体名的关联可以实现数据的溯源。7.4) Association refers to associating various sub-databases with entity id or entity name a. The data of the same entity is updated synchronously in various sub-databases, and the data can be traced through the association of entity id or entity name.

另一方面，本发明提供一种基于数据质量的多模态异构数据优化方法，包括多级索引结构以及日志文件维护模块；On the other hand, the present invention provides a multi-modal heterogeneous data optimization method based on data quality, including a multi-level index structure and a log file maintenance module;

多级索引及动态维护模块包括全局索引、局部索引及动态维护部分：The multi-level index and dynamic maintenance module includes global index, local index and dynamic maintenance parts:

全局索引将多模态数据库的各个子数据库之间构建主外键索引，将多模态数据库中各子数据库进行有效链接，实现相关数据的查询操作；The global index constructs primary and foreign key indexes between each sub-database of the multimodal database, effectively links each sub-database in the multimodal database, and realizes the query operation of related data;

局部索引将多模态数据库的各子数据库中构建各自独立的索引结构，实现各子数据库内容的局部索引，包括：Local indexing constructs independent index structures in each sub-database of the multimodal database to implement local indexing of the content of each sub-database, including:

原始数据库局部索引模块，将数据的每一个key建立索引，并将索引字段设置分片键，通过索引提高查询效率；The local index module of the original database creates an index for each key of the data and sets the index field as a shard key to improve query efficiency through indexing.

关系数据库局部索引模块，将数据中常用字段建立索引，例如某实体数据常用字段为实体名称，通过索引提高查询效率；The local index module of the relational database creates indexes for commonly used fields in the data. For example, a commonly used field in a certain entity data is the entity name, and the query efficiency is improved through indexing.

图数据库局部索引模块，通过ApachePhoenix进行二级索引构建，在Phoenix建立与HBase中表的映射，实现在Phoenix操作HBase的表，通过索引提高查询效率；The graph database local index module builds secondary indexes through Apache Phoenix, establishes a mapping between Phoenix and HBase tables, implements operations on HBase tables in Phoenix, and improves query efficiency through indexes.

链式数据库局部索引模块，主要分为名称索引、排序建立、动态增量更新等部分。按照特定字段建立名称索引，同时按照时间顺序构建联盟链，并且数据动态增量更新；The local index module of the chain database is mainly divided into name index, sorting establishment, dynamic incremental update and other parts. The name index is established according to specific fields, and the alliance chain is built in chronological order, and the data is dynamically updated incrementally;

多媒体数据库局部索引模块，将多媒体数据的基本信息，包括数据的存储结点信息、路径、文件名、扩展名等信息，构建出局部索引结构并存储在关系数据库中，根据数据源、数据类型、存储节点、存储路径、文件名等属性，可以通过索引表定位到多媒体数据的具体位置。The local index module of the multimedia database constructs a local index structure based on the basic information of the multimedia data, including the storage node information, path, file name, extension name and other information of the data, and stores it in the relational database. According to the data source, data type, storage node, storage path, file name and other attributes, the specific location of the multimedia data can be located through the index table.

日志文件维护模块包括多模态数据库的日志文件维护与数据集成的日志文件维护。其中，多模态数据库的日志文件维护包括关系数据库的日志文件维护、图数据库的日志文件维护、链式数据库的日志文件维护以及原始数据库的日志文件维护；数据集成的日志文件维护包括事件抽取的日志文件维护、实体链接的日志文件维护、不完备数据填补的日志文件维护以及数据一致性的日志文件维护；The log file maintenance module includes log file maintenance for multimodal databases and log file maintenance for data integration. Among them, log file maintenance for multimodal databases includes log file maintenance for relational databases, log file maintenance for graph databases, log file maintenance for chain databases, and log file maintenance for original databases; log file maintenance for data integration includes log file maintenance for event extraction, log file maintenance for entity linking, log file maintenance for incomplete data filling, and log file maintenance for data consistency;

8.1)多模态数据库的日志文件维护由各个子数据库通过其使用的相关系统或方案的日志文件进行维护；8.1) The log file maintenance of the multimodal database is maintained by each sub-database through the log files of the relevant systems or solutions it uses;

8.2)数据集成的日志文件维护指在数据集成的处理中，当由于数据集成操作发生数据变更的情况下，将所有数据集成操作的过程予以记录，以日志文件的形式进行保存；日志文件的内容包括：数据库操作发生的时间、数据集成的类型、操作日志的类别以及各个数据集成方法特有的属性特征；对于每种类型的数据集成方法，日志内容均包括：数据集成操作发生的时间(Timestamp)、数据集成类型、日志记录的级别(INFO、WARNING、ERROR等)，以及针对每种不同类型的数据集成方法，设计每种数据集成方法特定的日志内容要素；8.2) Data integration log file maintenance refers to recording all data integration operation processes and saving them in the form of log files when data changes occur due to data integration operations during data integration. The content of the log files includes: the time when the database operation occurs, the type of data integration, the category of the operation log, and the unique attribute characteristics of each data integration method. For each type of data integration method, the log content includes: the time when the data integration operation occurs (Timestamp), the data integration type, the level of log record (INFO, WARNING, ERROR, etc.), and for each different type of data integration method, the log content elements specific to each data integration method are designed.

8.3)本发明所使用的数据集成方法类型分为事件抽取(EE)、实体链接(EL)、不完备数据填补(DF)、数据一致性检测(DC)，日志级别分为五种：导致应用程序退出的严重错误(FATAL)、虽然发生了错误但不影响系统的继续运行(ERROR)、会出现潜在的错误情形(WARNING)、在粗粒度级别上，强调应用程序的运行全程(INFO)、在细粒度级别上，对调试应用程序非常有帮助(DEBUG)；8.3) The data integration methods used in the present invention are divided into event extraction (EE), entity linking (EL), incomplete data filling (DF), and data consistency detection (DC). The log levels are divided into five types: serious errors that cause the application to exit (FATAL), although an error occurs, it does not affect the continued operation of the system (ERROR), potential error situations may occur (WARNING), at the coarse-grained level, emphasizing the entire operation of the application (INFO), and at the fine-grained level, it is very helpful for debugging the application (DEBUG);

8.4)事件抽取的日志文件记录的构成为：[Timestamp][EE][日志级别][事件类型编码][事件ID][标题][事件时间]；8.4) The log file record of event extraction is composed of: [Timestamp][EE][Log level][Event type code][Event ID][Title][Event time];

8.5)实体链接的日志文件记录的构成为：[Timestamp][EL][日志级别][实体链接类型编码][唯一主键对应的数据值][链接表名][链接到总表时所需要使用的复合主键1对应的数据值][链接到总表时所需要使用的复合主键2对应的数据值][链接到总表时所需要使用的复合主键3对应的数据值]；8.5) The structure of the log file record of the entity link is: [Timestamp][EL][Log level][Entity link type code][Data value corresponding to the unique primary key][Link table name][Data value corresponding to the composite primary key 1 required to be used when linking to the master table][Data value corresponding to the composite primary key 2 required to be used when linking to the master table][Data value corresponding to the composite primary key 3 required to be used when linking to the master table];

8.6)不完备数据填补的日志文件记录的构成为：[Timestamp][DF][日志级别][操作内容][操作结果]；8.6) The structure of the log file record filled with incomplete data is: [Timestamp][DF][Log level][Operation content][Operation result];

8.7)数据一致性检测的日志文件记录的构成为：[Timestamp][DC][日志级别][数据一致性检测的类型][操作内容][操作结果]。8.7) The structure of the log file record of data consistency check is: [Timestamp][DC][Log level][Type of data consistency check][Operation content][Operation result].

本发明创造的有益效果：本发明采用上述方案，通过从不同数据源收集数据，将原始数据以key-value形式存入原始数据库，将key-value数据通过事件抽取、实体链接、不完备数据填补等数据集成方法进行数据转换，转换为关系型数据，存入关系数据库，将关系数据库中的数据以实体-关系的形式存入图数据库，将关系数据库中具有典型时序特征的活动数据存入链式数据库，将多媒体数据存入多媒体数据库，根据数据的准确性、完整性、一致性、关联性对数据进行处理，最终实现数据的异构存储及数据库优化。其优势在于设计了多种数据库模型，并且根据各个数据库的存储特点将不同格式的数据进行存储，构成多模态数据库。通过索引结构大大提高了查询效率，通过日志文件的维护可对多模态数据库进行故障恢复，以及查看各个数据集成方法的操作过程。经过这些步骤，最终获得了一个高数据质量的多模态异构分布式数据库系统。本发明多模态数据库异构存储方法及系统，根据数据形式的不同将这些数据存入不同的分布式数据库中，并且进行数据查询优化，可以大大减少查询数据所需的时间，保证相关人员使用数据时的效率。Beneficial effects created by the present invention: The present invention adopts the above scheme, collects data from different data sources, stores the original data in the original database in the form of key-value, converts the key-value data into relational data through data integration methods such as event extraction, entity linking, and incomplete data filling, stores the data in the relational database in the form of entity-relationship in the graph database, stores the activity data with typical time series characteristics in the relational database in the chain database, stores the multimedia data in the multimedia database, processes the data according to the accuracy, completeness, consistency, and relevance of the data, and finally realizes the heterogeneous storage of data and database optimization. Its advantage is that a variety of database models are designed, and data in different formats are stored according to the storage characteristics of each database to form a multimodal database. The query efficiency is greatly improved through the index structure, and the multimodal database can be fault-recovered through the maintenance of the log file, and the operation process of each data integration method can be viewed. After these steps, a multimodal heterogeneous distributed database system with high data quality is finally obtained. The multimodal database heterogeneous storage method and system of the present invention stores the data into different distributed databases according to different data forms, and optimizes data query, which can greatly reduce the time required for querying data and ensure the efficiency of relevant personnel when using data.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为一种基于数据质量的多模态异构数据存储系统架构图；FIG1 is an architecture diagram of a multimodal heterogeneous data storage system based on data quality;

图2为原始数据库架构图；Figure 2 is a diagram of the original database architecture;

图3为关系数据库架构图；Figure 3 is a diagram of a relational database architecture;

图4为图数据库数据展示流程图；FIG4 is a flowchart of graph database data display;

图5为链式数据库流程图；Fig. 5 is a chain database flow chart;

图6为全局索引结构流程图。FIG6 is a flow chart of a global index structure.

具体实施方式DETAILED DESCRIPTION

下面将结合本发明创造实施例中的附图，对本发明创造实施例中的技术方案进行清晰、完整地描述，另外下面所描述的实施例仅仅是本发明创造一个实施例，而不是全部的实施例。The technical scheme in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. In addition, the embodiment described below is only one embodiment of the present invention, rather than all embodiments.

本发明提供了一种基于数据质量的多模态异构数据存储方法，其设计构思如下：从不同数据源收集数据，将原始文本数据以key-value的形式保存下来，将多媒体数据以文件的形式保存下来。其次，进行数据建模，通过数据集成方法将原始文本数据转化为关系型数据存储至关系数据库中，将关系数据库中具有特定关系的要素及其之间的关系进行抽取，存储到HBase中，利用Hive将HBase中的部分数据导出并存储到Neo4j中构建能满足不同查询需求的图谱，将原始数据及压缩后的多媒体数据存入链式数据库中。最后，进行数据库优化，建立全局索引结构以及局部索引结构，以及日志文件维护模块，构成多模态异构数据库。The present invention provides a multimodal heterogeneous data storage method based on data quality, and its design concept is as follows: collect data from different data sources, save the original text data in the form of key-value, and save the multimedia data in the form of files. Secondly, perform data modeling, convert the original text data into relational data and store it in a relational database through a data integration method, extract the elements with specific relationships in the relational database and the relationships between them, store them in HBase, use Hive to export part of the data in HBase and store them in Neo4j to build a graph that can meet different query requirements, and store the original data and compressed multimedia data in a chain database. Finally, perform database optimization, establish a global index structure and a local index structure, as well as a log file maintenance module, to form a multimodal heterogeneous database.

基于该方法设计的一种基于数据质量的多模态异构数据存储系统框架如下图1所示。包括：原始数据库、关系数据库、图数据库、链式数据库、多媒体数据库。A data quality-based multimodal heterogeneous data storage system framework designed based on this method is shown in Figure 1. It includes: original database, relational database, graph database, chain database, and multimedia database.

各数据库功用如下：The functions of each database are as follows:

原始数据库：用于存储来源于互联网数据的原始数据，存储格式为：key-value格式；Original database: used to store original data from the Internet, the storage format is: key-value format;

关系数据库：用于将原始数据库中的key-value数据转换为关系型数据，再建模存储；Relational database: used to convert key-value data in the original database into relational data, and then model and store it;

图数据库：用于将关系数据库中相关联的实体以及实体之间的关系图形化并存储；Graph database: used to graph and store related entities and relationships between entities in a relational database;

多媒体数据库：用于存储转换为文本格式的视频数据、音频数据；Multimedia database: used to store video data and audio data converted into text format;

链式数据库：用于存储关系数据库中各个实体活动数据的链式结构。Chain database: A chain structure used to store activity data of each entity in a relational database.

采用上述系统实现一种基于数据质量的多模态异构数据存储方法，步骤如下：The above system is used to implement a multi-modal heterogeneous data storage method based on data quality, and the steps are as follows:

1)从互联网上爬取原始数据，将原始数据以key-value的形式存入原始数据库中；1) Crawl the original data from the Internet and store it in the original database in the form of key-value;

具体包括：Specifically include:

1.1)对从互联网上爬取下来的相关数据以JSON文件的形式保存，并将其存储至MongoDB数据库中；1.1) Save the relevant data crawled from the Internet in the form of JSON files and store them in the MongoDB database;

1.2)在本发明中，原始数据库采用MongoDB Replica Set+Sharding集群来实现原始数据库的分布式存储模式，根据所采用的集群方式，数据库由三个节点构成，分别为：主节点、副本节点和仲裁节点，原始数据库架构图如图2所示，从纵向来看，三个节点分别充当三个服务器，每个服务器配置一个路由进程，一个配置服务器进程以及相对应的分片。当进行存储或读取等操作数据库的任务时，路由进程接收客户发出的指令，并将请求指令发送到相对应的分片上，配置服务器则负责存储数据库中元信息的配置。从横向来看，本发明设计了三个分片，每个分片通过使用Replica Set在三个节点上形成主、备、仲裁的方式。1.2) In the present invention, the original database uses MongoDB Replica Set + Sharding cluster to implement the distributed storage mode of the original database. According to the cluster mode adopted, the database is composed of three nodes, namely: the master node, the replica node and the arbitration node. The original database architecture diagram is shown in Figure 2. From a vertical perspective, the three nodes act as three servers respectively, and each server is configured with a routing process, a configuration server process and a corresponding shard. When performing tasks such as storing or reading database operations, the routing process receives the instructions issued by the client and sends the request instructions to the corresponding shard, and the configuration server is responsible for storing the configuration of the metadata in the database. From a horizontal perspective, the present invention designs three shards, and each shard forms a master, backup and arbitration mode on three nodes by using Replica Set.

2)将原始文本数据通过数据集成方法转化为关系型数据存储至关系数据库中；2) Convert the original text data into relational data through data integration method and store it in relational database;

具体包括：Specifically include:

2.1)通过事件抽取、实体链接、不完备数据填补等数据集成方法将原始文本数据转化为关系型数据存储至MySQL数据库中；2.1) Convert the original text data into relational data and store it in the MySQL database through data integration methods such as event extraction, entity linking, and incomplete data filling;

2.2)在本发明中，关系数据库采用MySQL Cluster集群来实现关系数据库的分布式存储模型，根据所采用的集群方式，数据库由四个节点构成，分别为：1个管理节点、2个数据节点、1个应用节点，关系数据库架构图如图3所示，在本发明提供的关系数据库模型中，客户端通过连接应用节点进行数据库的基本操作，将数据以结构化的形式进行存储。在客户端操作完成后，两个数据节点会自动同步复制相同的数据，以保证数据的安全性。管理节点则可以随时监控其他节点的状态，并可添加配置新的节点。2.2) In the present invention, the relational database uses MySQL Cluster cluster to implement the distributed storage model of the relational database. According to the cluster mode adopted, the database is composed of four nodes, namely: 1 management node, 2 data nodes, and 1 application node. The relational database architecture diagram is shown in Figure 3. In the relational database model provided by the present invention, the client performs basic operations of the database by connecting to the application node and stores the data in a structured form. After the client operation is completed, the two data nodes will automatically synchronize and replicate the same data to ensure the security of the data. The management node can monitor the status of other nodes at any time and can add and configure new nodes.

3)将关系数据库中数据中具有特定关系的要素及其之间的关系进行抽取，存入图数据库中并进行展示；3) Extract the elements with specific relationships in the data in the relational database and the relationships between them, store them in the graph database and display them;

具体包括：Specifically include:

3.1)将关系数据库中具有特定关系的要素及其之间的关系进行抽取，存储到HBase中；3.1) Extract the elements with specific relationships in the relational database and the relationships between them and store them in HBase;

3.2)利用Hive将HBase中的部分数据导出并存储到Neo4j中构建能满足不同查询需求的知识图谱。建立HBase与Hive的映射，将HBase数据还原为类关系数据库数据，将数据通过Neo4j建立关系。3.2) Use Hive to export some data from HBase and store it in Neo4j to build a knowledge graph that can meet different query requirements. Establish a mapping between HBase and Hive, restore HBase data to relational database data, and establish relationships between the data through Neo4j.

3.3)基于Neo4j的可视化展示，将相关数据从HBase中进行提取，存储到Neo4j的图谱上，如图4所示为图数据库数据展示流程图，图数据库采用四个节点构成集群来实现图数据库的分布式存储，在本发明提供的图数据库模型中，客户端通过Neo4j进行数据库的基本查询操作，将数据以可视化的形式进行展示。底层图数据通过HBase进行数据存储。3.3) Based on the visualization display of Neo4j, the relevant data is extracted from HBase and stored in the Neo4j graph. FIG4 is a flowchart of graph database data display. The graph database uses four nodes to form a cluster to realize the distributed storage of the graph database. In the graph database model provided by the present invention, the client performs basic query operations on the database through Neo4j and displays the data in a visualized form. The underlying graph data is stored through HBase.

4)将原始数据及压缩后的多媒体数据存入链式数据库中；4) storing the original data and the compressed multimedia data in a chain database;

所述的步骤4)的链式数据库流程图如图5所示：The chain database flow chart of step 4) is shown in FIG5 :

在联盟链中，解析json文本文件，将其对应属性存储到MySQL数据库对应表的对应字段中。在私有链中，利用HDFS分布式文件系统存储事件的原始详细内容，每一条事件对应本地文件系统中该事件的文本文件、图片、视频等，将其进行打包压缩为压缩包，然后对该压缩包进行哈希值计算，该哈希值存入MySQL对应的哈希字段中，压缩包上传到HDFS分布式文件系统中。In the consortium chain, the json text file is parsed and its corresponding attributes are stored in the corresponding fields of the corresponding table of the MySQL database. In the private chain, the HDFS distributed file system is used to store the original detailed content of the event. Each event corresponds to the text file, picture, video, etc. of the event in the local file system, which are packaged and compressed into a compressed package, and then the hash value of the compressed package is calculated. The hash value is stored in the corresponding hash field of MySQL, and the compressed package is uploaded to the HDFS distributed file system.

5)根据各个子数据库中相同的数据属性，构建全局索引，本发明将关系数据库、图数据库、链式数据库、多媒体数据库以及原始数据库，通过某些特定字段索引将各个数据库中相关的数据链接起来，提供3种不同的功能：第一，实现了相关数据的数据查询功能；第二，并且通过索引的构建，实现了从其它数据库到原始数据库的原始数据的溯源功能；第三，将多媒体数据中的视频、音频、图像数据转换为文本数据后，实现关系数据库、原始数据库、多媒体数据库对多副本文本数据存储的副本管理功能。通过多模态全局索引结构的构建，实现相关数据的链接和溯源功能，全局索引结构流程图如图6所示：5) According to the same data attributes in each sub-database, a global index is constructed. The present invention links the related data in each database through certain specific field indexes, such as relational database, graph database, chain database, multimedia database and original database, and provides three different functions: first, the data query function of related data is realized; second, through the construction of the index, the traceability function of the original data from other databases to the original database is realized; third, after converting the video, audio and image data in the multimedia data into text data, the copy management function of the relational database, original database and multimedia database for the storage of multiple copies of text data is realized. Through the construction of a multimodal global index structure, the linking and traceability functions of related data are realized. The global index structure flow chart is shown in Figure 6:

5.1)在关系数据库中，实体基础信息表包含实体ID、实体名称、对象ID等实体的基础信息属性，通过实体ID属性将关系数据库的实体链接到该实体对应于原始数据库中的JSON格式的原始数据，实现关系数据到原始数据的溯源；5.1) In a relational database, the entity basic information table contains basic information attributes of the entity, such as entity ID, entity name, object ID, etc. The entity ID attribute is used to link the entity in the relational database to the original data in JSON format corresponding to the entity in the original database, thus realizing the traceability from relational data to original data;

5.2)通过关系数据库中各个实体业务数据表中的实体ID属性作为外键，参照实体基础信息表中该实体的基础信息数据，实现从基础信息数据到业务数据的关联查询功能；5.2) Using the entity ID attribute in each entity business data table in the relational database as a foreign key, referencing the basic information data of the entity in the entity basic information table, to achieve the association query function from the basic information data to the business data;

5.3)通过关系数据库中实体信息表的实体ID属性，链接到该实体对应的存储在关系数据库中的多媒体索引表，包括：视频、音频、图像、文本数据的存储结点信息、路径、文件名、扩展名等数据，实现从关系数据库到多媒体数据库的关联修改、删除功能；5.3) Through the entity ID attribute of the entity information table in the relational database, link to the multimedia index table corresponding to the entity stored in the relational database, including: storage node information, path, file name, extension name and other data of video, audio, image and text data, to realize the association modification and deletion function from the relational database to the multimedia database;

5.4)通过多媒体索引表，将多媒体文件(包括：视频、音频、图像、文本数据)的存储结点信息、路径、文件名、扩展名等属性信息组合关联使用，实现多媒体索引表对保存在各个结点的多媒体文件的查询功能；5.4) Through the multimedia index table, the storage node information, path, file name, extension and other attribute information of the multimedia files (including: video, audio, image, text data) are combined and associated for use, so as to realize the query function of the multimedia index table for the multimedia files stored in each node;

5.5)链式数据库中以实体名作为关键字，以联盟链+私有链的链式结构对于实体的事件信息进行存储。通过关系数据库中的实体基础信息表中的实体名，实现将关系数据到链式数据库的链接，查询该实体在链式数据库中的数据；5.5) The chain database uses the entity name as the keyword and uses the chain structure of alliance chain + private chain to store the event information of the entity. Through the entity name in the entity basic information table in the relational database, the link between the relational data and the chain database is realized, and the data of the entity in the chain database is queried;

5.6)图数据库中将实体名作为关键字，通过构建实体信息的三元组，存储和展示该实体为中心节点的实体间的链接关系。通过关系数据库中的实体基础信息表中的实体名，实现将关系数据到图数据库的链接，查询该实体在图数据库中的数据以及实体之间的关联关系；5.6) In the graph database, the entity name is used as a keyword, and the link relationship between entities with the entity as the central node is stored and displayed by constructing a triple of entity information. Through the entity name in the entity basic information table in the relational database, the link between the relational data and the graph database is realized, and the data of the entity in the graph database and the association relationship between entities are queried;

5.7)在本发明提供的数据转换中，从视频、音频、图像数据转换后的文本数据除了保存在多媒体数据库中以外，为了对外提供更丰富的数据接口，将文本数据作为原始数据存储到原始数据库，在文本数据与原始数据库中的JSON格式的文件通过文本ID进行关联，实现多媒体数据库中的文本数据与原始数据库中的原始数据的链接。5.7) In the data conversion provided by the present invention, in addition to being stored in the multimedia database, the text data converted from the video, audio, and image data is also stored as raw data in the original database in order to provide a richer data interface to the outside world. The text data and the JSON formatted files in the original database are associated through the text ID, thereby realizing the link between the text data in the multimedia database and the raw data in the original database.

Claims

1. The multi-mode heterogeneous data storage method based on the data quality is characterized by comprising the following steps of:

1) The method comprises the steps that original data from Internet data are stored in a distributed mode in an original database in a key-value format; the original data comprises original text data and original multimedia data;

2) Modeling the original multimedia data, and storing the data in a file database in a distributed manner in a file form;

3) Converting the original text data into relational data through event extraction, entity linking and incomplete data filling data integration methods, modeling the relational data, and constructing a relational database;

4) Modeling entities with association relations among the entities in the relational database and the relation among the entities to construct a graph database;

5) The activity data of each entity in the relational database has typical time sequence characteristics, and the activity data is subjected to data modeling in a chained structure to construct a chained database;

6) Converting video data and audio data in the multimedia data into text data through a data conversion method, storing the text data in a multimedia database in a file form, and storing the text data in an original database in a key-value format;

7) According to the data quality, database optimization is carried out on different distributed databases, and entity data of each sub-database are linked through constructing a multi-level index structure, so that the consistency of the data is ensured;

8) And constructing a log file maintenance system of the multi-mode database aiming at the data integration method and each sub-database.

2. The method for storing multi-modal heterogeneous data based on data quality according to claim 1, wherein in the step 1), the specific method for storing the original text data in the key-value format in the original database in a distributed manner is as follows:

2.1 A database system using a MongoDB database system as key-value data storage;

2.2 A distributed storage solution using mongo db REPLICA SET in mongo db as the original database.

3. The method according to claim 1, wherein in the step 2), the multimedia data is stored in the distributed file system according to the type of the data source, i.e. the video, audio or picture data is stored in the storage node corresponding to the data source.

4. The method for storing multi-modal heterogeneous data based on data quality according to claim 1, wherein in the step 3), the specific method for constructing the relational database is as follows:

3.1 The original text data is converted into relational data through three data integration methods of relation extraction, entity linking and incomplete data filling;

3.2 Using MySQL database system as the database system for relational data storage;

3.3 Using MySQL Cluster as a distributed storage solution for relational databases.

5. The method for storing multi-modal heterogeneous data based on data quality according to claim 1, wherein in the step 4), the specific method for constructing the graph database is as follows:

4.1 Using HBase as the underlying graph data storage scheme;

4.2 Using Neo4j as a graph database visualization query scheme;

4.3 Modeling entities in a relational database by relationships between the entities.

6. The method for storing multi-modal heterogeneous data based on data quality according to claim 1, wherein in the step 5), the specific method for constructing the chain database is as follows:

The chained database uses MySQL and HDFS as data storage schemes and stores the MySQL and the HDFS into the chained database; the alliance chain stores structured data by MySQL, and the private chain stores semi-structured data by HDFS.

7. The method for storing multi-modal heterogeneous data based on data quality according to claim 1, wherein the specific method for storing the multi-media data in step 6) is as follows:

6.1 Crawling relevant multimedia data including video, audio, images and texts from the Internet according to the multimedia data source; designing a multimedia data index table according to the data attribute, positioning the multimedia data to a specific position through the index table according to the data source, the data type, the storage node, the storage path and the file name attribute, and storing the index table in a relational database in a structured data form;

6.2 A data conversion storage model is designed, and video data is converted into text data through a video- > audio- > text process; converting the audio data into text data through an audio- > text process; converting the image data into text data through an image- > text process; and stored in the original database and the multimedia database.

8. The method for storing multi-modal heterogeneous data based on data quality according to claim 1, wherein in the step 7), the multi-level index structure is composed of a global index and a local index; the dynamic maintenance process is as follows:

Constructing a main external key index among the global index original database, the relational database, the graph database, the chained database and the multimedia database, and effectively linking all the sub databases to realize the query operation of related data; the local index constructs independent index structures in the databases to realize the local index of the contents of each sub-database;

Each sub-database index module is as follows: the original database local index module establishes an index for each key of the data, sets a fragment key for an index field and improves the query efficiency through the index;

The relational database local index module is used for establishing indexes for common fields in data, performing secondary index construction through Apache Phoenix, establishing mapping with a table in HBase in Phoenix, realizing table operation in Phoenix, and improving query efficiency through indexes;

The chain type database local index module is mainly divided into a name index part, a sorting establishment part and a dynamic increment updating part; establishing a name index according to a specific field, simultaneously constructing a alliance chain according to a time sequence, and dynamically updating data increment;

the local index module of the multimedia database constructs a local index structure from basic information of the multimedia data, including storage node information, path, file name and extension information of the data, and stores the local index structure in the relational database, and can locate specific positions of the multimedia data through an index table according to data sources, data types, storage nodes, storage path and file name attributes.

9. The method for storing multi-modal heterogeneous data based on data quality according to claim 1, wherein in the step 8), the specific method for maintaining the log file is as follows:

The log file maintenance is divided into log file maintenance of a multi-mode database and log file maintenance of data integration; the log file maintenance of the multi-mode database comprises log file maintenance of a relational database, log file maintenance of a graph database, log file maintenance of a chained database and log file maintenance of an original database; the log file maintenance of the data integration comprises log file maintenance of event extraction, log file maintenance of entity links, log file maintenance of incomplete data filling and log file maintenance of data consistency.

10. A data quality based multi-modal heterogeneous data storage system for implementing a data quality based multi-modal heterogeneous data storage method as claimed in any one of claims 1 to 9, comprising:

Original database: the storage format is used for storing the original data from the internet data, and the storage format is as follows: a key-value format;

relational database: the method comprises the steps of converting key-value data in an original database into relational data, and building a model for storage;

graph database: the method comprises the steps of being used for imaging and storing related entities and relations among the entities in a relational database;

multimedia database: for storing video data, audio data converted into text format;

Chain database: and the chain structure is used for storing the activity data of each entity in the relational database.