CN104850632A

CN104850632A - Generic similarity calculation method and system based on heterogeneous information network

Info

Publication number: CN104850632A
Application number: CN201510267247.4A
Authority: CN
Inventors: 张邦佐; 汤树林; 尹宗铭; 徐桂萍; 蔡永健; 徐坤
Original assignee: Northeast Normal University
Current assignee: Northeast Normal University
Priority date: 2015-05-22
Filing date: 2015-05-22
Publication date: 2015-08-19

Abstract

The invention discloses a general similarity calculation method based on a heterogeneous information network, comprising: step 1, preprocessing the input data set to ensure the validity of the input data; step 2, extracting metadata to extract the input The description information of the data, and store the description information in the metadata database; step 3, establish the heterogeneous information network model through user interaction, and store the network model; step 4, adopt the similarity calculation method based on the meta-path in the heterogeneous information network Carry out similarity calculation; step 5, similarity post-processing to form a total similarity as the final output. Beneficial effects of the present invention: adopting heterogeneous information network modeling, a general similarity calculation method is proposed; different types of data sets can be processed; various similarity calculation requirements can be met; users can specify multiple calculation methods and The result post-processing method has a high degree of freedom of choice, improves the calculation accuracy and efficiency, and better solves the problem of information overload.

Description

A general method and system for similarity calculation based on heterogeneous information network

技术领域technical field

本发明涉及信息技术和互联网技术领域，具体而言，涉及一种通用的基于异构信息网络的相似度计算方法与系统。The present invention relates to the fields of information technology and Internet technology, in particular, to a general heterogeneous information network-based similarity calculation method and system.

背景技术Background technique

随着信息技术和互联网的发展，人们逐渐从早期的数据匮乏进入到了信息过载的年代。特别是在当前的大数据时代，如何解决信息过载问题，并从海量的数据中提炼出有价值的信息是目前人们迫切需要解决的一个关键问题。不管是在各种信息检索系统还是在各种个性化推荐系统与应用中，信息的相似度计算都是这些系统和应用中的关键技术，通常对相关系统和应用的处理精度起着决定性作用。With the development of information technology and the Internet, people have gradually entered the era of information overload from the early data scarcity. Especially in the current era of big data, how to solve the problem of information overload and extract valuable information from massive data is a key issue that people urgently need to solve. Whether in various information retrieval systems or in various personalized recommendation systems and applications, information similarity calculation is a key technology in these systems and applications, and usually plays a decisive role in the processing accuracy of related systems and applications.

异构信息网络是随着社会网络的发展而出现的一个较新的研究领域，也称为异构社会网络或者多关系社会网络。异构信息网络通过网络模式指定了对象集合上的类型约束和对象间的关系约束。这些约束使得异构信息网络是半结构化的，从而指导人们更好地去探索网络语义。异构信息网络可以从许多互联的社会的、科学的、工程的到商业应用的大规模数据集上进行构建，也可以在电子商务上比如Amazon和eBay，在线电影数据库如IMDb(Internet MovieDatabase)，和各种数据库上进行应用。Heterogeneous information network is a relatively new research field emerging with the development of social network, also known as heterogeneous social network or multi-relational social network. The heterogeneous information network specifies the type constraints on object collections and the relationship constraints between objects through network patterns. These constraints make the heterogeneous information network semi-structured, thus guiding people to better explore network semantics. Heterogeneous information networks can be constructed from many interconnected social, scientific, engineering, and large-scale data sets for commercial applications, as well as e-commerce such as Amazon and eBay, online movie databases such as IMDb (Internet MovieDatabase), and various database applications.

异构信息网络作为一种通用的大数据挖掘工具，对于处理数据之间的关系与结构特征有着很好的表现力，通过异构信息网络对现实世界中的关系进行建模，可以有效地对现实世界中信息之间的相似度进行计算。As a general big data mining tool, the heterogeneous information network has a good expressive power for processing the relationship and structural features between data. Modeling the relationship in the real world through the heterogeneous information network can effectively Calculate the similarity between information in the real world.

目前一些比较传统的在计算现实世界实体之间的相似度计算方法方面通常只是针对特定数据，方法简单固定，不能很好地体现现实世界实体之间的丰富关系，缺少一种通用的计算方法和框架，通常利用一些简单的相似度计算方法，并且在计算过程中只考虑数据与数据之间的较少的相对固定的属性，在考虑不同类型的数据以及数据的不同属性的情况下，必须重新考虑计算方法，并且需要对系统进行重新修改，导致通用性较差，计算效率不高以及结果的准确度低，完全不能适应当今的大数据时代的要求。At present, some traditional calculation methods for calculating the similarity between entities in the real world are usually only for specific data, the method is simple and fixed, and cannot well reflect the rich relationship between entities in the real world, lacking a general calculation method and The framework usually uses some simple similarity calculation methods, and only considers few relatively fixed attributes between data and data in the calculation process. When considering different types of data and different attributes of data, it must be re- Consider the calculation method, and the system needs to be re-modified, resulting in poor versatility, low calculation efficiency and low accuracy of the results, which cannot meet the requirements of today's big data era.

发明内容Contents of the invention

为解决上述问题，本发明的目的在于提供一种可以根据用户需求选定相应数据属性及计算方法的通用的相似度计算方法与系统。To solve the above problems, the object of the present invention is to provide a general similarity calculation method and system that can select corresponding data attributes and calculation methods according to user requirements.

本发明提供了一种通用的基于异构信息网络的相似度计算方法，包括：The present invention provides a general similarity calculation method based on heterogeneous information network, including:

步骤1，对输入数据集进行预处理，保证输入数据的有效性；Step 1, preprocessing the input data set to ensure the validity of the input data;

步骤2，进行元数据提取，提取出所述输入数据的描述信息，并将所述描述信息存放在元数据库中，其中，所述描述信息包括所述输入数据集整体情况的全局信息、每条记录的局部信息以及数据属性的标识符与内部表示的转换和对应信息；Step 2, perform metadata extraction, extract the description information of the input data, and store the description information in the metadata database, wherein the description information includes global information of the overall situation of the input data set, each item The local information of the record and the conversion and corresponding information between the identifier of the data attribute and the internal representation;

步骤3，用户选取参与相似度计算的实体和数据属性，查询对应的元数据，显示每个元数据的数据类型、取值范围，并提示用户根据预先设定好的处理库选取所述元数据的处理方法，以所述实体为中心结点，各实体按照语义关系进行链接，各属性连接到相应实体，生成异构信息网络模式，并存储所述异构信息网络模式；Step 3, the user selects the entities and data attributes involved in the similarity calculation, queries the corresponding metadata, displays the data type and value range of each metadata, and prompts the user to select the metadata according to the preset processing library The processing method, with the entity as the central node, each entity is linked according to the semantic relationship, each attribute is connected to the corresponding entity, a heterogeneous information network model is generated, and the heterogeneous information network model is stored;

步骤4，在所述异构信息网络模式中，用户指定数据属性和元路径后，根据所选的处理方法进行相似度计算；Step 4, in the heterogeneous information network mode, after the user specifies the data attribute and meta-path, perform similarity calculation according to the selected processing method;

步骤5，根据所选数据属性的权重，将步骤4计算得到的多个相似度结果进行融合，得到统一的相似度结果，并将统一的相似度结果进行格式转换，输出给用户。In step 5, according to the weight of the selected data attribute, the multiple similarity results calculated in step 4 are merged to obtain a unified similarity result, and the format of the unified similarity result is converted and output to the user.

作为本发明进一步的改进，步骤1中，对输入数据集的预处理包括数据清理和数据集成；As a further improvement of the present invention, in step 1, the preprocessing of the input data set includes data cleaning and data integration;

其中，in,

数据清理是进行格式转换、消除噪声、删除不一致数据，对所述输入数据集进行数据清理，去除无用的噪声数据，并进行相应的格式转换；Data cleaning is to perform format conversion, eliminate noise, delete inconsistent data, perform data cleaning on the input data set, remove useless noise data, and perform corresponding format conversion;

数据集成是组合多个数据源数据。Data integration is combining data from multiple data sources.

作为本发明进一步的改进，步骤4具体包括：As a further improvement of the present invention, step 4 specifically includes:

步骤401，用户指定数据属性，并选择元路径，如果涉及到多个数据属性，用户指定多个数据属性的链接顺序，构成元路径；Step 401, the user specifies data attributes and selects a meta path, if multiple data attributes are involved, the user specifies the link sequence of multiple data attributes to form a meta path;

步骤402，在用户指定单个数据属性之后，调取对应数据属性中的数据，并将这些数据构建成为邻接矩阵，如果用户指定一个元路径，则将这个元路径上的所有关系构建所述邻接矩阵，之后对所述邻接矩阵进行标准化处理；Step 402, after the user specifies a single data attribute, retrieve the data in the corresponding data attribute, and construct the data into an adjacency matrix, if the user specifies a meta-path, construct the adjacency matrix with all the relationships on the meta-path , and then standardize the adjacency matrix;

步骤403，在所述邻接矩阵规范化之后，利用矩阵运算，得到对应属性的相似度结果矩阵。In step 403, after the adjacency matrix is normalized, a matrix operation is used to obtain a similarity result matrix of the corresponding attribute.

作为本发明进一步的改进，步骤1中的所述输入数据集的格式包括关系数据库形式、NOSQL数据库形式、ARFF文件、CSV文件以及文本文件和excel电子表格。As a further improvement of the present invention, the format of the input data set in step 1 includes a relational database format, a NOSQL database format, an ARFF file, a CSV file, a text file and an excel spreadsheet.

作为本发明进一步的改进，步骤2中，所述全局信息包括所述输入数据集的记录条数以及每条记录中的数据属性个数，所述局部信息包括每个数据属性的标识符、数据类型和取值范围。As a further improvement of the present invention, in step 2, the global information includes the number of records in the input data set and the number of data attributes in each record, and the local information includes the identifier of each data attribute, data type and value range.

本发明还提供了一种通用的基于异构信息网络的相似度计算系统，包括：The present invention also provides a general similarity calculation system based on heterogeneous information networks, including:

处理模块，对输入数据集进行预处理，保证输入数据的有效性；The processing module preprocesses the input data set to ensure the validity of the input data;

提取模块，进行元数据提取，提取出所述输入数据的描述信息，并将所述描述信息存放在元数据库中，其中，所述描述信息包括所述输入数据集整体情况的全局信息、每条记录的局部信息以及数据属性的标识符与内部表示的转换和对应信息；The extraction module extracts metadata, extracts the description information of the input data, and stores the description information in the metadata database, wherein the description information includes global information of the overall situation of the input data set, each item The local information of the record and the conversion and corresponding information between the identifier of the data attribute and the internal representation;

建模模块，用户选取参与相似度计算的实体和数据属性，查询对应的元数据，显示每个元数据的数据类型、取值范围，并提示用户根据预先设定好的处理库选取所述元数据的处理方法，以所述实体为中心结点，各实体按照语义关系进行链接，各属性连接到相应实体，生成异构信息网络模式，并存储所述异构信息网络模式；In the modeling module, the user selects the entities and data attributes involved in the similarity calculation, queries the corresponding metadata, displays the data type and value range of each metadata, and prompts the user to select the metadata according to the preset processing library. The data processing method, with the entity as the central node, each entity is linked according to the semantic relationship, each attribute is connected to the corresponding entity, a heterogeneous information network model is generated, and the heterogeneous information network model is stored;

计算模块，在所述异构信息网络模式中，用户指定数据属性和元路径后，根据所选的处理方法进行相似度计算；Calculation module, in the heterogeneous information network mode, after the user specifies the data attribute and the meta path, calculate the similarity according to the selected processing method;

后处理模块，根据所选数据属性的权重，将步骤4计算得到的多个相似度结果进行融合，得到统一的相似度结果，并将统一的相似度结果进行格式转换，输出给用户。The post-processing module fuses the multiple similarity results calculated in step 4 according to the weights of the selected data attributes to obtain a unified similarity result, converts the format of the unified similarity result, and outputs it to the user.

作为本发明进一步的改进，所述处理模块对输入数据集的预处理包括数据清理和数据集成；As a further improvement of the present invention, the preprocessing of the input data set by the processing module includes data cleaning and data integration;

其中，in,

作为本发明进一步的改进，所述计算模块包括：As a further improvement of the present invention, the calculation module includes:

指定模块，用户指定数据属性，并选择元路径，如果涉及到多个数据属性，用户指定多个数据属性的链接顺序，构成元路径；Specify the module, the user specifies the data attribute, and selects the meta path. If multiple data attributes are involved, the user specifies the link sequence of multiple data attributes to form the meta path;

构建模块，在用户指定单个数据属性之后，调取对应数据属性中的数据，并将这些数据构建成为邻接矩阵，如果用户指定一个元路径，则将这个元路径上的所有关系构建所述邻接矩阵，之后对所述邻接矩阵进行标准化处理；The construction module, after the user specifies a single data attribute, calls the data in the corresponding data attribute, and constructs these data into an adjacency matrix, if the user specifies a meta-path, then constructs the adjacency matrix with all the relationships on this meta-path , and then standardize the adjacency matrix;

运算模块，在所述邻接矩阵规范化之后，利用矩阵运算，得到对应属性的相似度结果矩阵。The operation module, after the adjacency matrix is normalized, uses matrix operation to obtain the similarity result matrix of the corresponding attribute.

作为本发明进一步的改进，所述处理模块中的所述输入数据集的格式包括关系数据库形式、NOSQL数据库形式、ARFF文件、CSV文件以及文本文件和excel电子表格。As a further improvement of the present invention, the format of the input data set in the processing module includes a relational database format, a NOSQL database format, an ARFF file, a CSV file, a text file and an excel spreadsheet.

作为本发明进一步的改进，所述提取模块中，所述全局信息包括所述输入数据集的记录条数以及每条记录中的数据属性个数，所述局部信息包括每个数据属性的标识符、数据类型和取值范围。As a further improvement of the present invention, in the extraction module, the global information includes the number of records in the input data set and the number of data attributes in each record, and the local information includes an identifier for each data attribute , data type and value range.

本发明的有益效果为：The beneficial effects of the present invention are:

1、采用异构信息网络建模，较好地解决了在当前越来越重要的相似度计算问题，具有广阔的应用前景和使用价值；1. Using heterogeneous information network modeling, it better solves the problem of similarity calculation that is becoming more and more important at present, and has broad application prospects and use value;

2、具有良好的通用性与兼容性，能够处理不同类型的数据集；2. It has good versatility and compatibility, and can handle different types of data sets;

3、能够满足用户对数据对象多种相似度计算的需求，既可用于计算同类型对象之间的相似度，也可以计算不同对象间的相似度，适用于多种领域的实际应用与系统中。如推荐系统中最经常使用的计算用户与用户之间，物品与物品之间的相似度，以及计算用户和物品之间的相似度，也可以计算搜索系统中的系统对象之间的相似度；3. It can meet the needs of users for multiple similarity calculations of data objects. It can be used to calculate the similarity between objects of the same type, and can also calculate the similarity between different objects. It is suitable for practical applications and systems in various fields . For example, the most frequently used calculations in the recommendation system are between users and users, between items and items, and between users and items, and can also calculate the similarity between system objects in the search system;

4、可以通过用户指定多种相似度计算方法和结果后处理方式，灵活便捷，用户选择自由度高，并且显著提高了计算的准确度，同时在计算效率方面有着明显优势，很好的解决了信息过载问题。4. The user can specify a variety of similarity calculation methods and result post-processing methods, which are flexible and convenient, with a high degree of freedom for users to choose, and significantly improve the accuracy of calculation. At the same time, it has obvious advantages in calculation efficiency, which is a good solution Information overload problem.

附图说明Description of drawings

图1为本发明实施例所述的一种通用的基于异构信息网络的相似度计算方法的流程图；Fig. 1 is a flow chart of a general similarity calculation method based on a heterogeneous information network according to an embodiment of the present invention;

图2为本发明实施例所述的一种通用的基于异构信息网络的相似度计算方法中步骤4的流程图；Fig. 2 is a flow chart of step 4 in a general heterogeneous information network-based similarity calculation method described in an embodiment of the present invention;

图3为本发明实施例所述的一种通用的基于异构信息网络的相似度计算方法中异构信息网络模式的示意图；3 is a schematic diagram of a heterogeneous information network mode in a general heterogeneous information network-based similarity calculation method according to an embodiment of the present invention;

图4为本发明实施例所述的一种通用的基于异构信息网络的相似度计算系统的系统图；4 is a system diagram of a general-purpose similarity calculation system based on a heterogeneous information network according to an embodiment of the present invention;

图5为本发明实施例所述的一种通用的基于异构信息网络的相似度计算系统中计算模块的示意图。FIG. 5 is a schematic diagram of a calculation module in a general heterogeneous information network-based similarity calculation system according to an embodiment of the present invention.

具体实施方式Detailed ways

下面通过具体的实施例并结合附图对本发明做进一步的详细描述。The present invention will be described in further detail below through specific embodiments and in conjunction with the accompanying drawings.

实施例1，如图1所示，本发明实施例的一种通用的基于异构信息网络的相似度计算方法，包括：Embodiment 1, as shown in FIG. 1, a general similarity calculation method based on a heterogeneous information network according to an embodiment of the present invention includes:

以下仅以美国明尼苏达大学提供的电影推荐数据集MovieLens100k为例，具体实施如下：在仅使用电影推荐的情况下，其中大量的用户数据就是多余的，需要去掉。而在该数据集中缺乏电影演员(Actor)，导演(Director)等信息，而数据集提供了电影到因特网电影数据集IMDb的链接，组合数据集MovieLens 100K和数据集IMDb，得到有效的数据。The following is an example of the movie recommendation dataset MovieLens100k provided by the University of Minnesota. The specific implementation is as follows: In the case of only using movie recommendations, a large amount of user data is redundant and needs to be removed. In this data set, there is a lack of information such as movie actors (Actor), director (Director), and the data set provides a link from the movie to the Internet movie data set IMDb. Combine the data set MovieLens 100K and the data set IMDb to obtain effective data.

步骤2，进行元数据提取，提取出输入数据的描述信息，并将描述信息存放在元数据库中，其中，描述信息包括输入数据集整体情况的全局信息、每条记录的局部信息以及数据属性的标识符与内部表示的转换和对应信息；Step 2, perform metadata extraction, extract the description information of the input data, and store the description information in the metadata database, where the description information includes the global information of the overall situation of the input data set, the local information of each record, and the attributes of the data Conversion and corresponding information between identifiers and internal representations;

在MovieLens 100K数据集中，全局信息包括：有1682部电影，共有943位用户以及100000份电影评分，评分的范围为1～5。局部信息包括：标识符为演员(Actor)、上映时间(Year)、导演(Director)、电影类型(Genre)、国家(Country)和关键字(Keyword)；局部信息例如演员属性，数据类型为字符串，取值范围为20个字符，要求精确匹配。关于数据属性标识符与内部表示的转换，对于MovieLens100K中内部属性演员(Actor)标记为属性1(A1)、上映时间(Year)标记为属性2(A2)、导演(Director)标记为属性3(A3)、电影类型(Genre)标记为属性4(A4)、国家(Country)标记为属性5(A5)、关键字(Keyword)标记为属性6(A6)，从而可以更加考虑数据整体，不必再考虑每个数据属性的具体情况，方便处理。提取这些信息，并保存起来便于后续工作调用。In the MovieLens 100K data set, the global information includes: there are 1682 movies, a total of 943 users and 100,000 movie ratings, and the ratings range from 1 to 5. Partial information includes: identifiers are actor (Actor), release time (Year), director (Director), movie type (Genre), country (Country) and keyword (Keyword); partial information such as actor attributes, data type is character String, with a value range of 20 characters, requires an exact match. Regarding the conversion of data attribute identifiers and internal representations, the internal attributes actor (Actor) in MovieLens100K is marked as attribute 1 (A1), the release time (Year) is marked as attribute 2 (A2), and the director (Director) is marked as attribute 3 ( A3), the movie type (Genre) is marked as attribute 4 (A4), the country (Country) is marked as attribute 5 (A5), and the keyword (Keyword) is marked as attribute 6 (A6), so that the data as a whole can be considered more, and it is not necessary to Consider the specific situation of each data attribute for easy processing. Extract this information and save it for subsequent work calls.

步骤3，用户选取参与相似度计算的实体和数据属性，查询对应的元数据，显示每个元数据的数据类型、取值范围，并提示用户根据预先设定好的处理库选取元数据的处理方法，以实体为中心结点，各实体按照语义关系进行链接，各属性连接到相应实体，生成异构信息网络模式，并存储异构信息网络模式；Step 3, the user selects the entities and data attributes involved in the similarity calculation, queries the corresponding metadata, displays the data type and value range of each metadata, and prompts the user to select the metadata processing according to the preset processing library The method takes the entity as the central node, links each entity according to the semantic relationship, and connects each attribute to the corresponding entity, generates a heterogeneous information network model, and stores the heterogeneous information network model;

在MovieLens 100K数据集中，进一步完善电影的常用数据属性，需要从IMDb数据库中对电影属性进一步扩充。选取实体为电影，选取数据属性为：演员(Actor)、上映时间(Year)、导演(Director)、电影类型(Genre)、国家(Country)、关键字(Keyword)，则电影的异构信息网络模式如图3。In the MovieLens 100K dataset, to further improve the commonly used data attributes of movies, it is necessary to further expand the movie attributes from the IMDb database. The selected entity is a movie, and the selected data attributes are: actor (Actor), release time (Year), director (Director), movie type (Genre), country (Country), keyword (Keyword), then the heterogeneous information network of the movie The pattern is shown in Figure 3.

步骤4，在异构信息网络模式中，用户指定数据属性和元路径后，根据所选的处理方法进行相似度计算。Step 4, in the heterogeneous information network mode, after the user specifies the data attribute and meta-path, the similarity calculation is performed according to the selected processing method.

如图2所示，具体包括以下步骤：As shown in Figure 2, it specifically includes the following steps:

异构信息网络的相似度通常是由数据属性决定的，所以需要用户指定根据哪个数据属性进行相似度计算。如果采用当红演员来保证电影票房，则认为具有相同演员的电影是相似的。The similarity of heterogeneous information networks is usually determined by data attributes, so the user needs to specify which data attribute is used to calculate the similarity. Movies with the same actors are considered similar if popular actors are used to guarantee the movie box office.

步骤402，在用户指定单个数据属性之后，调取对应数据属性中的数据，并将这些数据构建成为邻接矩阵，如果用户指定一个元路径，则将这个元路径上的所有关系构建邻接矩阵，之后对邻接矩阵进行标准化处理；Step 402, after the user specifies a single data attribute, retrieve the data in the corresponding data attribute, and construct these data into an adjacency matrix, if the user specifies a meta-path, construct an adjacency matrix for all relationships on this meta-path, and then Normalize the adjacency matrix;

在MovieLens 100K数据集中，选取数据属性为演员(Actor)，可以得到电影之间关于演员属性的邻接矩阵，这种相似度是基于某一选定的元路径的，即针对某一电影(Movie)，则可以从电影(Movie)到演员(Actor)再到电影的路径探索下去，比如如下路径：Movie-(Actor-Movie-Actor)^r-Movie；其中r取非负整数。In the MovieLens 100K data set, select the data attribute as actor (Actor), and you can get the adjacency matrix of actor attributes between movies. This similarity is based on a selected meta-path, that is, for a movie (Movie) , you can explore the path from the movie (Movie) to the actor (Actor) and then to the movie, such as the following path: Movie-(Actor-Movie-Actor) ^r -Movie; where r is a non-negative integer.

步骤403，在邻接矩阵规范化之后，利用矩阵运算，得到对应属性的相似度结果矩阵；Step 403, after the adjacency matrix is normalized, use the matrix operation to obtain the similarity result matrix of the corresponding attribute;

如果元路径为P(A₁ A₂ ... A_i ... A_l)，其中，P为元路径，A_i表示数据的属性，定义R_i为A_i到A_i+1上的关系，即在A_i和A_i+1之间有一条边相连。则可以得到任意对象s、t(s是属性A₁的对象，t是属性A_l的对象)之间的相似度计算公式。If the meta-path is P(A ₁ A ₂ ... A _i ... A _l ), where P is the meta-path, A _i represents the attribute of the data, and R _i is defined as the relationship between A _i and A _i+1 , that is, there is an edge between A _i and A _i+1 . Then the similarity calculation formula between arbitrary objects s and t (s is the object of attribute _A1 , _t is the object of attribute A1) can be obtained.

可以采用HeteSim相似度进行计算，其公式为：HeteSim similarity can be used for calculation, the formula is:

其中函数O(R)和I(R)分别表示关系R的出度和入度。The functions O(R) and I(R) represent the out-degree and in-degree of the relation R, respectively.

也可以使用PathSim相似度进行计算，其公式为：It can also be calculated using PathSim similarity, the formula is:

$P P a a t t h h S S i i m m ((s the s,, t t)) = = \frac{22 \times \times | | {{{p p}_{s the s &RightArrow; &Right Arrow; t t} : : {p p}_{s the s &RightArrow; &Right Arrow; t t} &Element; &Element; P P}} | |}{| | {{{p p}_{s the s &RightArrow; &Right Arrow; s the s} : : {p p}_{s the s &RightArrow; &Right Arrow; s the s} &Element; &Element; P P}} | | + + | | {{{p p}_{t t &RightArrow; &Right Arrow; t t} : : {p p}_{t t &RightArrow; &Right Arrow; t t} &Element; &Element; P P}} | |} - - - - - - ((22))$

其中，p_s→t、p_s→s、p_t→t分别是s与t、s与s，以及t与t之间的路径实例。Among them, p _s→t , p _s→s , and p _t→t are path instances between s and t, s and s, and t and t, respectively.

步骤5，根据所选数据属性的权重，将步骤4计算得到的多个相似度结果进行融合，得到统一的相似度结果，并将统一的相似度结果进行格式转换，输出给用户；Step 5, according to the weight of the selected data attribute, the multiple similarity results calculated in step 4 are merged to obtain a unified similarity result, and the format of the unified similarity result is converted and output to the user;

根据用户需求，权重可以取数据属性的平均值、最大值、最小值，以及加权平均等各种方式。According to user needs, the weight can take various methods such as the average value, maximum value, minimum value, and weighted average of the data attribute.

进一步的，步骤1中的输入数据集的格式包括关系数据库形式、NOSQL数据库形式、ARFF文件、CSV文件以及文本文件和excel电子表格。Further, the format of the input data set in step 1 includes a relational database form, a NOSQL database form, an ARFF file, a CSV file, a text file and an excel spreadsheet.

进一步的，步骤2中，全局信息包括输入数据集的记录条数以及每条记录中的数据属性个数，局部信息包括每个数据属性的标识符、数据类型和取值范围。Further, in step 2, the global information includes the number of records in the input data set and the number of data attributes in each record, and the local information includes the identifier, data type and value range of each data attribute.

步骤3中预先设定好的的处理库是数据进行相似度计算的基本处理方法库。通常所涉及到的数据的类型都是有限的几类，如数值、时间、字符串等。在进行相似度计算时需要根据情况作不同的处理。如属性是数值数据类型，可以设定它们是严格相等、以一定百分比或者数值误差相等；如属性是时间数据类型，可以认为同一天、同一月、同一年即认为相等，或者是相差一定百分比或者数值范围为相等；如属性是字符串数据类型，通常可以设定要严格相等、近似匹配等。这些常用的处理方式是本方法提供的通用解决方案，供用户在使用时选择，用户也可以自己定义自定义匹配方式，例如可以通过用户交互进行，如用户限定时间只能为某一天。The pre-set processing library in step 3 is the basic processing method library for data similarity calculation. Generally, the types of data involved are limited to several types, such as numerical value, time, string, etc. Different processing needs to be done according to the situation when performing similarity calculation. If the attribute is a numeric data type, you can set them to be strictly equal, equal by a certain percentage or numerical error; if the attribute is a time data type, you can consider the same day, the same month, the same year to be considered equal, or a certain percentage difference or The value range is equal; if the attribute is a string data type, it can usually be set to be strict equality, approximate matching, etc. These commonly used processing methods are general solutions provided by this method for users to choose during use. Users can also define their own custom matching methods, for example, through user interaction. For example, the user can only limit the time to a certain day.

实施例2，如图4所示，本发明还提供了一种通用的基于异构信息网络的相似度计算系统，包括：Embodiment 2, as shown in Figure 4, the present invention also provides a general similarity calculation system based on heterogeneous information networks, including:

提取模块，进行元数据提取，提取出输入数据的描述信息，并将描述信息存放在元数据库中，其中，描述信息包括输入数据集整体情况的全局信息、每条记录的局部信息以及数据属性的标识符与内部表示的转换和对应信息；The extraction module extracts metadata, extracts the description information of the input data, and stores the description information in the metadata database, where the description information includes the global information of the overall situation of the input data set, the local information of each record, and the attributes of the data Conversion and corresponding information between identifiers and internal representations;

建模模块，用户选取参与相似度计算的实体和数据属性，查询对应的元数据，显示每个元数据的数据类型、取值范围，并提示用户根据预先设定好的处理库选取元数据的处理方法，以实体为中心结点，各实体按照语义关系进行链接，各属性连接到相应实体，生成异构信息网络模式，并存储异构信息网络模式；In the modeling module, the user selects the entities and data attributes involved in the similarity calculation, queries the corresponding metadata, displays the data type and value range of each metadata, and prompts the user to select the metadata according to the preset processing library. The processing method takes the entity as the central node, links each entity according to the semantic relationship, and connects each attribute to the corresponding entity, generates a heterogeneous information network model, and stores the heterogeneous information network model;

计算模块，在异构信息网络模式中，用户指定数据属性和元路径后，根据所选的处理方法进行相似度计算；Calculation module, in the heterogeneous information network mode, after the user specifies the data attribute and meta path, the similarity calculation is performed according to the selected processing method;

进一步的，处理模块对输入数据集的预处理包括数据清理和数据集成；Further, the preprocessing of the input data set by the processing module includes data cleaning and data integration;

其中，in,

数据清理是进行格式转换、消除噪声、删除不一致数据，对输入数据集进行数据清理，去除无用的噪声数据，并进行相应的格式转换；Data cleaning is to perform format conversion, eliminate noise, delete inconsistent data, perform data cleaning on the input data set, remove useless noise data, and perform corresponding format conversion;

如图5所示，计算模块包括：As shown in Figure 5, the calculation module includes:

构建模块，在用户指定单个数据属性之后，调取对应数据属性中的数据，并将这些数据构建成为邻接矩阵，如果用户指定一个元路径，则将这个元路径上的所有关系构建邻接矩阵，之后对邻接矩阵进行标准化处理；The construction module, after the user specifies a single data attribute, retrieves the data in the corresponding data attribute and constructs the data into an adjacency matrix. If the user specifies a meta path, all relationships on the meta path are constructed into an adjacency matrix, and then Normalize the adjacency matrix;

运算模块，在邻接矩阵规范化之后，利用矩阵运算，得到对应属性的相似度结果矩阵。In the operation module, after the adjacency matrix is normalized, the matrix operation is used to obtain the similarity result matrix of the corresponding attribute.

进一步的，处理模块中的输入数据集的格式包括关系数据库形式、NOSQL数据库形式、ARFF文件、CSV文件以及文本文件和excel电子表格。Further, the format of the input data set in the processing module includes relational database form, NOSQL database form, ARFF file, CSV file, text file and excel spreadsheet.

进一步的，提取模块中，全局信息包括输入数据集的记录条数以及每条记录中的数据属性个数，局部信息包括每个数据属性的标识符、数据类型和取值范围。Further, in the extraction module, the global information includes the number of records in the input data set and the number of data attributes in each record, and the local information includes the identifier, data type and value range of each data attribute.

以上所述仅为本发明的优选实施例而已，并不用于限制本发明，对于本领域的技术人员来说，本发明可以有各种更改和变化。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

Claims

1. A general similarity calculation method based on a heterogeneous information network is characterized by comprising the following steps:

step 1, preprocessing an input data set to ensure the validity of the input data;

step 2, extracting metadata, extracting description information of the input data, and storing the description information in a metadata database, wherein the description information comprises global information of the overall situation of the input data set, local information of each record, and conversion and corresponding information of an identifier and an internal representation of a data attribute;

step 3, selecting entities and data attributes participating in similarity calculation by a user, inquiring corresponding metadata, displaying the data type and value range of each metadata, prompting the user to select a processing method of the metadata according to a preset processing library, linking the entities according to semantic relations by taking the entities as central nodes, connecting the attributes to the corresponding entities, generating a heterogeneous information network mode, and storing the heterogeneous information network mode;

step 4, in the heterogeneous information network mode, after a user specifies data attributes and a meta path, similarity calculation is carried out according to the selected processing method;

and 5, fusing the plurality of similarity results obtained by calculation in the step 4 according to the weight of the selected data attribute to obtain a uniform similarity result, and performing format conversion on the uniform similarity result to output the uniform similarity result to a user.

2. The method for computing the similarity based on the universal heterogeneous information network according to claim 1, wherein in the step 1, the preprocessing of the input data set comprises data cleaning and data integration;

wherein,

the data cleaning is to perform format conversion, eliminate noise and delete inconsistent data, perform data cleaning on the input data set, remove useless noise data and perform corresponding format conversion;

data integration is the combining of multiple data source data.

3. The method for calculating the similarity based on the universal heterogeneous information network according to claim 1, wherein the step 4 specifically comprises:

step 401, a user designates data attributes and selects a meta path, and if a plurality of data attributes are involved, the user designates a link sequence of the plurality of data attributes to form the meta path;

step 402, after a user designates a single data attribute, calling data in the corresponding data attribute, constructing the data into an adjacency matrix, if the user designates a meta-path, constructing the adjacency matrix by using all relations on the meta-path, and then performing standardization processing on the adjacency matrix;

step 403, after the normalization of the adjacent matrix, obtaining a similarity result matrix corresponding to the attribute by using matrix operation.

4. The method for computing the similarity of the universal heterogeneous information network according to claim 1, wherein the format of the input data set in step 1 comprises a relational database format, a NOSQL database format, an ARFF file, a CSV file, and a text file and an excel spreadsheet.

5. The method according to claim 1, wherein in step 2, the global information includes the number of records of the input data set and the number of data attributes in each record, and the local information includes an identifier, a data type, and a value range of each data attribute.

6. A general similarity calculation system based on a heterogeneous information network, comprising:

the processing module is used for preprocessing the input data set and ensuring the validity of the input data;

the extraction module is used for extracting metadata, extracting description information of the input data and storing the description information in a metadata database, wherein the description information comprises global information of the whole condition of the input data set, local information of each record, and conversion and corresponding information of an identifier and internal representation of a data attribute;

the modeling module is used for selecting entities and data attributes participating in similarity calculation by a user, inquiring corresponding metadata, displaying the data type and value range of each metadata, prompting the user to select a processing method of the metadata according to a preset processing library, taking the entities as a central node, linking the entities according to semantic relations, connecting the attributes to the corresponding entities, generating a heterogeneous information network mode and storing the heterogeneous information network mode;

the computing module is used for carrying out similarity computation according to the selected processing method after a user specifies data attributes and a meta path in the heterogeneous information network mode;

and the post-processing module is used for fusing the plurality of similarity results obtained by calculation in the step 4 according to the weight of the selected data attribute to obtain a uniform similarity result, and performing format conversion on the uniform similarity result to output the uniform similarity result to a user.

7. The universal heterogeneous information network based similarity calculation system according to claim 6, wherein the preprocessing of the input data set by the processing module comprises data cleaning and data integration;

wherein,

data integration is the combining of multiple data source data.

8. The universal heterogeneous information network based similarity calculation system according to claim 6, wherein the calculation module comprises:

the specifying module is used for specifying the data attributes by a user and selecting the meta path, and if the plurality of data attributes are related, the user specifies the link sequence of the plurality of data attributes to form the meta path;

the construction module is used for calling data in corresponding data attributes after a user designates a single data attribute, constructing the data into an adjacency matrix, constructing all relations on a meta path into the adjacency matrix if the user designates the meta path, and then standardizing the adjacency matrix;

and the operation module obtains a similarity result matrix corresponding to the attributes by utilizing matrix operation after the adjacent matrix is normalized.

9. The generalized heterogeneous information network based similarity calculation system according to claim 6, wherein the format of the input data set in the processing module comprises a relational database form, a NOSQL database form, an ARFF file, a CSV file, and a text file and an excel spreadsheet.

10. The system according to claim 6, wherein in the extraction module, the global information includes the number of records of the input data set and the number of data attributes in each record, and the local information includes an identifier, a data type, and a value range of each data attribute.