CN110134874A

CN110134874A - A Collaborative Filtering Method for Optimizing User Similarity

Info

Publication number: CN110134874A
Application number: CN201910312071.8A
Authority: CN
Inventors: 安彦涵; 张新鹏; 吴汉舟; 余江; 王子驰
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2019-04-18
Filing date: 2019-04-18
Publication date: 2019-08-16

Abstract

The invention proposes a kind of collaborative filtering methods for optimizing user's similarity.While not increasing server delay, the precision of proposed algorithm is improved.The characteristics of this method, is: by being standardized pretreatment to user's score data, calculate Pearson similarity, the evaluation weight of user vector distance and asymmetrical similarity weight, and then Pearson similarity is optimized, so that traditional collaborative filtering recommends precision to be improved.This method is suitable for user --- the data set of project scoring.

Description

A Collaborative Filtering Method for Optimizing User Similarity

技术领域technical field

针对基于协同过滤的推荐系统，本发明提出了一种优化用户相似度的协同过滤方法。Aiming at the recommendation system based on collaborative filtering, the present invention proposes a collaborative filtering method for optimizing user similarity.

背景技术Background technique

互联网的快速发展和普及为用户获取、分享和传播信息提供了极大的便利。与此同时，信息量的大幅增长却降低了信息的利用率，使用户很难及时从网络中获得对自己真正有用的信息，造成信息超载问题。一种能有效应对信息超载问题的方法是设计推荐系统，它根据用户的需求、兴趣等信息，将用户感兴趣的内容和产品推荐给用户。和搜索引擎相比，推荐系统通过研究用户的兴趣、偏好，进行个性化计算，从而发现用户的兴趣点，进而引导用户发现自己的信息需求，并获取对自己有用的信息。好的推荐系统不仅能为用户提供个性化的服务，还能为不同用户建立相互之间的密切关系，让用户对推荐产生依赖。The rapid development and popularization of the Internet has provided great convenience for users to obtain, share and disseminate information. At the same time, the substantial increase in the amount of information reduces the utilization rate of information, making it difficult for users to obtain information that is truly useful to them from the network in a timely manner, resulting in information overload. A method that can effectively deal with the problem of information overload is to design a recommender system, which recommends content and products that users are interested in based on information such as users' needs and interests. Compared with search engines, recommender systems conduct personalized calculations by studying the interests and preferences of users, so as to discover the points of interest of users, and then guide users to discover their own information needs and obtain useful information for them. A good recommendation system can not only provide users with personalized services, but also establish a close relationship between different users, allowing users to rely on recommendations.

推荐系统主要包括内容过滤和协同过滤。基于内容过滤的推荐系统根据用户以前的浏览或购买记录得到用户关注项目的特征，将最符合用户兴趣特征的新项目推荐给用户。而基于协同过滤的推荐系统是通过计算用户之间历史记录的相似性得到用户间的相似程度，搜寻与目标用户偏好相似的其他用户，将这类用户感兴趣的项目推荐给目标用户。Recommendation systems mainly include content filtering and collaborative filtering. The recommendation system based on content filtering obtains the characteristics of the user's attention items according to the user's previous browsing or purchase records, and recommends new items that best match the user's interest characteristics to the user. The recommendation system based on collaborative filtering obtains the similarity between users by calculating the similarity of historical records between users, searches for other users with similar preferences to the target user, and recommends items of interest to such users to the target user.

基于内容过滤的推荐系统只考虑目标用户本身，基于协同过滤的推荐系统则充分利用了集体智慧，即在大量的人群的行为和数据中收集答案，推荐的个性化程度更高，所以协同过滤推荐算法是个性化推荐服务中应用最为广泛、有效的推荐算法。The recommendation system based on content filtering only considers the target user itself, while the recommendation system based on collaborative filtering makes full use of collective intelligence, that is, collects answers from the behavior and data of a large number of people, and the recommendation is more personalized, so collaborative filtering recommends Algorithms are the most widely used and effective recommendation algorithms in personalized recommendation services.

基于协同过滤的推荐系统又分为基于模型的协同过滤推荐系统和基于记忆的协同过滤推荐系统。前者主要是利用机器学习、数据挖掘和统计学等方法，对用户的历史数据进行训练，然后构造相对应的用户模型，利用该模型为用户提供预测和推荐，涉及矩阵分解，隐语义分析等技术。后者分为基于用户的协同过滤推荐系统和基于项目的协同过滤推荐系统。The recommendation system based on collaborative filtering is further divided into model-based collaborative filtering recommendation system and memory-based collaborative filtering recommendation system. The former mainly uses methods such as machine learning, data mining and statistics to train users' historical data, then constructs a corresponding user model, and uses the model to provide users with predictions and recommendations, involving matrix decomposition, latent semantic analysis and other technologies . The latter is divided into user-based collaborative filtering recommender systems and item-based collaborative filtering recommender systems.

传统的基于用户的协同过滤推荐系统虽采用Pearson公式度量相似度，但未对数据集进行预处理、不考虑用户评分向量间的距离、未考虑用户间的相似性关系的不平等性，会使推荐系统的推荐质量下降。为此，本发明针对基于用户的协同过滤推荐算法，对上述三点进行优化，提高推荐质量。Although the traditional user-based collaborative filtering recommendation system uses the Pearson formula to measure the similarity, it does not preprocess the data set, does not consider the distance between user rating vectors, and does not consider the inequality of the similarity relationship between users. The recommendation quality of the recommender system decreases. Therefore, the present invention optimizes the above three points for the user-based collaborative filtering recommendation algorithm to improve the recommendation quality.

发明内容SUMMARY OF THE INVENTION

本发明致力于降低传统的基于用户的协同过滤推荐算法的平均绝对误差值，有效提高推荐系统的推荐质量，提供一种优化用户相似度的协同过滤方法。The present invention is dedicated to reducing the average absolute error value of the traditional user-based collaborative filtering recommendation algorithm, effectively improving the recommendation quality of the recommending system, and providing a collaborative filtering method for optimizing user similarity.

为达到上述目的，本发明提出如下技术方案：To achieve the above object, the present invention proposes the following technical solutions:

一种优化用户相似度的协同过滤方法，通过将用户的评分向量标准化，结合用户向量距离的评价权重、非对称的相似度权重对Pearson相似度优化，最后进行用户评分的预测，具体步骤如下：A collaborative filtering method for optimizing user similarity. By standardizing the user's score vector, combining the evaluation weight of the user vector distance and the asymmetric similarity weight to optimize the Pearson similarity, and finally predicting the user's score, the specific steps are as follows:

1)筹备实验数据库：收集一定数量用户对不同项目的评分值，建立实验数据库；1) Prepare the experimental database: collect the scores of a certain number of users for different items, and establish an experimental database;

2)标准化预处理：运用Z-score方法对每个用户的评分向量进行标准化，并依据标准化后的用户评分向量，生成用户——项目评分矩阵；2) Standardization preprocessing: use the Z-score method to standardize the rating vector of each user, and generate a user-item rating matrix based on the standardized user rating vector;

3)计算用户的相似度矩阵：根据步骤2)生成的用户——项目评分矩阵，计算Pearson相似度、用户向量距离的评价权重、非对称的相似度权重；结合用户向量距离的评价权重、非对称的相似度权重对Pearson相似度进行优化，得出优化后的相似度公式，根据优化后的相似度公式计算出每个用户与其他用户的相似度，最终生成相似度矩阵；3) Calculate the similarity matrix of the user: According to the user-item scoring matrix generated in step 2), calculate the Pearson similarity, the evaluation weight of the user vector distance, and the asymmetric similarity weight; Symmetrical similarity weights optimize the Pearson similarity, obtain the optimized similarity formula, calculate the similarity between each user and other users according to the optimized similarity formula, and finally generate a similarity matrix;

4)预测评分：依据目标用户与其他用户的相似度，计算目标用户的邻居用户集合，通过评分公式对目标用户的未评分项目进行预测。4) Prediction score: According to the similarity between the target user and other users, the neighbor user set of the target user is calculated, and the unscored items of the target user are predicted by the scoring formula.

与现有技术相比，本发明具有如下的优点：Compared with the prior art, the present invention has the following advantages:

本发明方法对协同过滤技术中推荐算法模块进行用户相似度的优化，使得推荐系统在不增加服务器延时的同时，推荐质量得到有效提高。The method of the invention optimizes the user similarity for the recommendation algorithm module in the collaborative filtering technology, so that the recommendation system can effectively improve the recommendation quality without increasing the server delay.

附图说明Description of drawings

图1是本发明方法的流程图。Figure 1 is a flow chart of the method of the present invention.

具体实施方式Detailed ways

下面结合附图，对本发明的具体实施例做进一步的说明。The specific embodiments of the present invention will be further described below with reference to the accompanying drawings.

本实施例针对MovieLens-100k数据集(可从网站https://movielens.org/下载)进行实例分析，该数据集涵盖943个用户对1682部电影的共计10万条评分记录，评分值为1到5之间的整数，其中1代表评价最低，5代表评价最高。每个用户对至少20部电影进行过评分。数据集中80％的数据为训练集，20％的数据为测试集。In this embodiment, instance analysis is performed on the MovieLens-100k dataset (available for download from the website https://movielens.org/), which covers a total of 100,000 rating records of 1,682 movies by 943 users, and the rating value is 1 An integer between 5 and 1, where 1 is the lowest rating and 5 is the highest rating. Each user has rated at least 20 movies. 80% of the data in the dataset is the training set and 20% of the data is the test set.

如图1所示，一种优化用户相似度的协同过滤方法，通过将用户的评分向量标准化，结合用户向量距离的评价权重、非对称的相似度权重对Pearson相似度优化，最后进行用户评分的预测，具体步骤如下：As shown in Figure 1, a collaborative filtering method for optimizing user similarity, by standardizing the user's rating vector, combining the evaluation weight of the user vector distance and the asymmetric similarity weight to optimize the Pearson similarity, and finally the user rating is calculated. The specific steps are as follows:

1)筹备实验数据库：收集一定数量用户对不同项目的评分值，建立实验数据库。1) Prepare the experimental database: collect the scores of a certain number of users for different items, and establish an experimental database.

2)标准化预处理：运用Z-score方法对每个用户的评分向量进行标准化，并依据标准化后的用户评分向量，生成用户——项目评分矩阵，具体步骤如下：2) Standardization preprocessing: Use the Z-score method to standardize the rating vector of each user, and generate a user-item rating matrix based on the standardized user rating vector. The specific steps are as follows:

设训练集中第u个用户的评分向量为R_u＝(r_(u,1),r_(u,2),…,r_(u,m))，其中r_(u,m)表示用户u对项目m的评分；如式(1)所示，运用Z-score方法对R_u进行标准化，其中z_(u,m)是标准化后用户u对项目m的评分，是R_u各分量的平均值，σ_u是R_u各分量的标准差：Let the rating vector of the uth user in the training set be R _u =(r _(u,1) ,r _(u,2) ,...,r _(u,m) ), where r _(u,m) represents the pair of user u The score of item m; as shown in formula (1), the Z-score method is used to standardize R _u , where z _{(u, m)} is the score of user u on item m after normalization, is the mean of each component of R _u , and σ _u is the standard deviation of each component of R _u :

标准化后的用户u的评分向量记为Z_u＝(z_(u,1),z_(u,2),…,z_(u,m))，Z_u均值为0，标准差为1。生成大小为943×1682的用户—项目评分矩阵，其中943是用户数量，1682是项目数量。Z_u记录在用户——项目评分矩阵的第u行，将用户u未进行评分的项目的评分值记为0。The standardized rating vector of user u is denoted as Z _u =(z _(u,1) ,z _(u,2) ,...,z _(u,m) ), where the mean value of Z _u is 0 and the standard deviation is 1. A user-item rating matrix of size 943 × 1682 is generated, where 943 is the number of users and 1682 is the number of items. Zu is recorded in the _uth row of the user-item rating matrix, and the rating value of the item that user u has not rated is recorded as 0.

3)计算用户的相似度矩阵：根据步骤2)生成的用户——项目评分矩阵，计算Pearson相似度、用户向量距离的评价权重、非对称的相似度权重；结合用户向量距离的评价权重、非对称的相似度权重对Pearson相似度进行优化，得出优化后的相似度公式，根据优化后的相似度公式计算出每个用户与其他用户的相似度，最终生成相似度矩阵。以MovieLens-100k的训练集中任意两个用户u和用户v为例，计算用户u对用户v的相似度，具体步骤如下：3) Calculate the similarity matrix of the user: According to the user-item scoring matrix generated in step 2), calculate the Pearson similarity, the evaluation weight of the user vector distance, and the asymmetric similarity weight; Symmetrical similarity weights are used to optimize the Pearson similarity, and an optimized similarity formula is obtained. According to the optimized similarity formula, the similarity between each user and other users is calculated, and a similarity matrix is finally generated. Taking any two users u and user v in the training set of MovieLens-100k as an example, to calculate the similarity between user u and user v, the specific steps are as follows:

3.1)计算Pearson相似度：如式(2)所示，用Pearson相似度公式度量用户u和用户v的Pearson相似度sim_(u,v)，其中集合S是用户u和用户v的共同评分过的项目集合：3.1) Calculate Pearson similarity: As shown in formula (2), the Pearson similarity sim _{(u, v)} of user u and user v is measured by the Pearson similarity formula, where set S is the common score of user u and user v. A collection of items:

3.2)计算用户向量距离的评价权重：如式(3)所示，计算Z_u和Z_v的用户向量距离的评价权重D_(u,v)，其中S是用户u和用户v的共同评分项目集合，N(S)为集合S的元素个数，α表示单独一个项目的评分差距的阈值，如果α越大，D_(u,v)越接近1，如果α越小，D_(u,v)越接近0：3.2) Calculate the evaluation weight of the user vector distance: As shown in formula (3), calculate the evaluation weight D _{(u, v)} of the user vector distance between Z _u and Z _v , where S is the common rating item of user u and user v Set, N(S) is the number of elements in set S, α represents the threshold of the score difference of a single item, if α is larger, D _{(u, v) is} closer to 1, if α is smaller, D _{(u, v ) is} closer to 0:

3.3)计算非对称的相似度权重：如式(4)所示，计算用户u对用户v的非对称的相似度权重w_(u,v)，其中，S是用户u和用户v的共同评分项目集合，I_u是用户u的已评分项目集合，N(S)为集合S的元素个数，N(I_u)是集合I_u的元素个数：3.3) Calculate the asymmetric similarity weight: as shown in formula (4), calculate the asymmetric similarity weight w _{(u, v)} of user u to user v, where S is the common score of user u and user v Item set, I _u is the set of rated items of user u, N(S) is the number of elements in the set S, N(I _u ) is the number of elements in the set I _u :

3.4)用户相似度公式：如式(5)所示，通过融合式(2)、式(3)以及式(4)，得到优化后用户u对用户v的相似度为sim′_(u,v)：3.4) User similarity formula: as shown in formula (5), by fusing formula (2), formula (3) and formula (4), the similarity between user u and user v after optimization is obtained as sim′ _{(u, v )} :

sim′_(u,v)＝D_(u,v)*w_(u,v)*sim_(u,v) (5)sim′ _(u,v) = D _(u,v) *w _(u,v) *sim _(u,v) (5)

3.5)计算用户相似度矩阵：按式(5)计算不同用户间的相似度，最终得到943×943的用户相似度矩阵。3.5) Calculate the user similarity matrix: Calculate the similarity between different users according to formula (5), and finally obtain a user similarity matrix of 943×943.

4)预测评分：依据目标用户与其他用户的相似度，计算目标用户的邻居用户集合，通过评分公式对目标用户的未评分项目进行预测，本实例中以训练集中任意用户u的一个未评分项目a为例，计算用户u对项目a的预测评分，具体步骤如下：4) Prediction score: Calculate the neighbor user set of the target user according to the similarity between the target user and other users, and predict the unscored items of the target user through the scoring formula. In this example, an unscored item of any user u in the training set is used. Taking a as an example, to calculate the predicted score of user u for item a, the specific steps are as follows:

4.1)计算邻居用户集合：在训练集中，找到评价过项目a的用户集合，记为U_a＝{u_(1,a),u_(2,a),…,u_(q,a)}，其中u_(q,a)表示第q个评价过项目a的用户；依据这q个用户与用户u的相似度大小，按相似度从大到小的顺序进行排序，记为U′_a＝{u′_(1,a),u′_(2,a),…,u′_(q,a)}；再从排好序的用户集合U′_a中选取前k个用户作为用户u的邻居用户集合，记为U＝{u′_(1,a),u′_(2,a),…,u′_(k,a)}。4.1) Calculate the set of neighbor users: In the training set, find the set of users who have evaluated item a, denoted as U _a ={u _(1,a) ,u _(2,a) ,...,u _(q,a) }, Among them, u _{(q, a)} represents the qth user who has evaluated item a; according to the similarity between these q users and user u, sort them in descending order of similarity, and denote it as U′ _a = { u′ _(1,a) ,u′ _(2,a) ,…,u′ _(q,a) }; then select the first k users from the sorted user set U′ _a as the neighbor users of user u Set, denoted as U={u′ _(1,a) ,u′ _(2,a) ,…,u′ _(k,a) }.

4.2)预测用户u对项目a的评分：按式(6)计算用户u对项目a的预测评分p_(u,a)，其中集合U是用户u的邻居用户集合，是R_u各分量的平均值，σ_u是R_u各分量的标准差，z_(v,a)是用户v对项目a的标准化评分，sim′_(u,v)是用户u对用户v的相似度：4.2) Predict the score of user u on item a: Calculate the predicted score p _(u,a) of user u on item a according to formula (6), where set U is the set of neighbor users of user u, is the average value of each component of R _u , σ _u is the standard deviation of each component of R _u , z _(v,a) is the standardized score of user v to item a, sim′ _(u,v) is the score of user u to user v Similarity:

如式(7)，采用平均绝对误差MAE(Mean Absolute Error)来刻画推荐精度，MAE越小说明误差越小，精度越高，其中p_i是用户对项目i的预测评分，r_i是测试集中用户对项目i的实际评分，n是测试集中的评分数量：As shown in formula (7), the mean absolute error (MAE) is used to describe the recommendation accuracy. The smaller the MAE, the smaller the error and the higher the accuracy, where pi is the user's predicted score for item _i , and _ri is the test set. User's actual rating for item i, n is the number of ratings in the test set:

本实施例中邻居集合的大小取10，本发明的MAE值为0.74086，比原始采用Pearson相似度低3.06％。本发明方法计算用户相似度矩阵需要的时间为61.3秒，评分预测需要的时间为7.04秒。在实际应用中，计算用户相似度矩阵可以通过离线计算完成，而用户使用本发明推荐算法进行在线预测项目评分时，所用的实时计算时间与原始方法几乎一致，没有增加用户在线等待的时间。In this embodiment, the size of the neighbor set is 10, and the MAE value of the present invention is 0.74086, which is 3.06% lower than the original Pearson similarity. The time required for calculating the user similarity matrix by the method of the present invention is 61.3 seconds, and the time required for scoring prediction is 7.04 seconds. In practical applications, the calculation of the user similarity matrix can be completed by offline calculation, and when the user uses the recommendation algorithm of the present invention to predict the item score online, the real-time calculation time used is almost the same as the original method, and the user's online waiting time is not increased.

Claims

1. A collaborative filtering method for optimizing user similarity. By standardizing the user's rating vector, combining the evaluation weight of the user vector distance and the asymmetric similarity weight to optimize the Pearson similarity, and finally predicting the user's rating, its characteristics Yes, the specific steps are as follows:

1) Prepare the experimental database: collect the scores of a certain number of users for different items, and establish an experimental database;

2) Standardization preprocessing: use the Z-score method to standardize the rating vector of each user, and generate a user-item rating matrix based on the standardized user rating vector;

3) Calculate the similarity matrix of the user: According to the user-item scoring matrix generated in step 2), calculate the Pearson similarity, the evaluation weight of the user vector distance, and the asymmetric similarity weight; Symmetrical similarity weights optimize the Pearson similarity, obtain the optimized similarity formula, calculate the similarity between each user and other users according to the optimized similarity formula, and finally generate a similarity matrix;

4) Prediction score: According to the similarity between the target user and other users, the neighbor user set of the target user is calculated, and the unscored items of the target user are predicted by the scoring formula.

2. the collaborative filtering method of optimizing user similarity according to claim 1, is characterized in that, the concrete steps of described step 2) are as follows: let the rating vector of the uth user in the training set be R _u =(r _{(u ,1)} ,r _(u,2) ,…,r _(u,m) ), where z _(u,m) is the user u’s rating on item m after normalization, and r _(u,m) is the user u’s rating on item m The score of m; as shown in formula (1), use the Z-score method to standardize R _u , where z _{(u, m)} is the score of user u on item m after normalization, is the mean of each component of R _u , and σ _u is the standard deviation of each component of R _u :

The standardized rating vector of user u is denoted as Z _u = (z _(u,1) ,z _(u,2) ,...,z _(u,m) ), the mean value of Z _u is 0, and the standard deviation is 1; then , generate a user-item rating matrix; Zu is recorded in the _uth row of the user-item rating matrix, and the rating value of the item that user u has not scored is recorded as 0.

3. the collaborative filtering method of optimizing user similarity according to claim 1, is characterized in that, in described step 3), take any two users u and user v in training set as an example, calculate the similarity of user u to user v degree, the specific steps are as follows:

3.1) Calculate Pearson similarity: As shown in formula (2), the Pearson similarity sim _{(u, v)} of user u and user v is measured by the Pearson similarity formula, where set S is the common score of user u and user v. A collection of items:

3.2) Calculate the evaluation weight of the user vector distance: As shown in formula (3), calculate the evaluation weight D _{(u, v)} of the user vector distance between Z _u and Z _v , where S is the common rating item of user u and user v Set, N(S) is the number of elements in set S, α represents the threshold of the score difference of a single item, if α is larger, D _{(u, v) is} closer to 1, if α is smaller, D _{(u, v ) is} closer to 0:

3.3) Calculate the asymmetric similarity weight: as shown in formula (4), calculate the asymmetric similarity weight w _{(u, v)} of user u to user v, where S is the common score of user u and user v Item set, I _u is the set of rated items of user u, N(S) is the number of elements in the set S, N(I _u ) is the number of elements in the set I _u :

3.4) User similarity formula: as shown in formula (5), by fusing formula (2), formula (3) and formula (4), the similarity between user u and user v after optimization is obtained as sim′ _{(u, v )} :

sim′ _(u,v) = D _(u,v) *w _(u,v) *sim _(u,v) (5)

3.5) Calculate the user similarity matrix: Calculate the similarity between different users according to formula (5), and finally obtain the user similarity matrix.

4. the collaborative filtering method of optimizing user similarity according to claim 1, it is characterized in that, in described step 4), take an unscored item a of any user u in training set as an example, calculate user u to item a. To predict the score, the specific steps are as follows:

4.1) Calculate the set of neighbor users: In the training set, find the set of users who have evaluated item a, denoted as U _a ={u _(1,a) ,u _(2,a) ,...,u _(q,a) }, Among them, u _{(q, a)} represents the qth user who has evaluated item a; according to the similarity between these q users and user u, sort them in descending order of similarity, and denote it as U′ _a = { u′ _(1,a) ,u′ _(2,a) ,…,u′ _(q,a) }; then select the first k users from the sorted user set U′ _a as the neighbor users of user u Set, denoted as U={u′ _(1,a) ,u′ _(2,a) ,…,u′ _(k,a) };

4.2) Predict the score of user u on item a: Calculate the predicted score p _(u,a) of user u on item a according to formula (6), where set U is the set of neighbor users of user u, is the average value of each component of R _u , σ _u is the standard deviation of each component of R _u , z _(v,a) is the standardized score of user v to item a, sim′ _(u,v) is the score of user u to user v Similarity:

As shown in formula (7), the average absolute error MAE is used to describe the recommendation accuracy. The smaller the MAE, the smaller the error and the higher the accuracy, where pi is the user's predicted score for item _i , and ri is the user's score for item _i in the test set. Actual ratings, n is the number of ratings in the test set: