[go: up one dir, main page]

CN115618127A - A Neural Network Recommender System Collaborative Filtering Algorithm - Google Patents

A Neural Network Recommender System Collaborative Filtering Algorithm Download PDF

Info

Publication number
CN115618127A
CN115618127A CN202211273752.6A CN202211273752A CN115618127A CN 115618127 A CN115618127 A CN 115618127A CN 202211273752 A CN202211273752 A CN 202211273752A CN 115618127 A CN115618127 A CN 115618127A
Authority
CN
China
Prior art keywords
users
item
user
project
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211273752.6A
Other languages
Chinese (zh)
Inventor
朵琳
龙国虎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202211273752.6A priority Critical patent/CN115618127A/en
Publication of CN115618127A publication Critical patent/CN115618127A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种神经网络推荐系统协同过滤算法,本发明首先采用卷积神经网络对用户之前的一些行为和项目本身属性建立一个卷积神经网络模型,提取用户和其项目属性的特征,通过全连接的方式对评分进行拟合,引入关联规则思想,根据项目关联关系生成项目关联矩阵,生成候选项目集,提出了修正的Pearson相关系数,建立用户一评分矩阵,对该矩阵进行优化,计算用户之间的评分的相似性,根据从高分到低分的顺序,以评分的相似程度排名,选择排在前面的最近邻居,计算出目标用户对没有评分项目的评分预测值,然后选择预测值排名靠前的N个项目推荐给用户,对于基于卷积神经网络和循环神经网络的电影推荐算法在相应的评价指标上有着不错的结果。

Figure 202211273752

The invention discloses a collaborative filtering algorithm for a neural network recommendation system. The invention first uses a convolutional neural network to establish a convolutional neural network model for some previous behaviors of users and the attributes of items themselves, and extracts the characteristics of users and their item attributes. Fit the score in the way of full connection, introduce the idea of association rules, generate the item association matrix according to the item association relationship, generate the candidate item set, propose the revised Pearson correlation coefficient, establish the user-rating matrix, optimize the matrix, and calculate The similarity of the scores between users is ranked according to the similarity of the scores according to the order from high scores to low scores, and the nearest neighbors in the front are selected to calculate the target user's rating prediction value for items without ratings, and then select the prediction The N items with the highest value are recommended to users, and the movie recommendation algorithm based on convolutional neural network and recurrent neural network has good results in the corresponding evaluation indicators.

Figure 202211273752

Description

一种神经网络推荐系统协同过滤算法A Neural Network Recommender System Collaborative Filtering Algorithm

技术领域technical field

本发明涉及医疗器械技术领域,特别涉及一种神经网络推荐系统协同过滤算法。The invention relates to the technical field of medical devices, in particular to a collaborative filtering algorithm for a neural network recommendation system.

背景技术Background technique

科技的进步使得人类的生活越来越便利化,人们可以足不出户就能看到这个世界的美好。他们会将自己所遇到的有趣的事分享到社交平台上,并且利用网络大量的搜寻自己所感兴趣的内容。每个人所产生的行为数据量都是庞大的,这就使得网络上的信息暴增。而在这样一个海量数据的环境下,人们要在其中准确并快速地获取到对自己有意义的数据是存在一定的难度。解决这一问题,目前看来较为有效的解决方案就是推荐系统。通过对用户在访问网页的时候所产生的一系列的行为来进行分析,比如电影推荐系统,是将用户的观影信息、浏览信息和对电影的评分几个方面进行综合,然后将用户可能感兴趣的电影推荐给他。一个成熟的电影推荐是会给在线用户提供实时的推荐,这样会加大用户对网站或者是应用程序的粘性。The advancement of science and technology has made human life more and more convenient, and people can see the beauty of the world without leaving home. They will share the interesting things they have encountered on social platforms, and use the Internet to search a large number of content they are interested in. The amount of behavioral data generated by each person is huge, which makes the information on the network soar. In such an environment of massive data, it is difficult for people to accurately and quickly obtain data that is meaningful to them. To solve this problem, it seems that the more effective solution is the recommendation system. By analyzing a series of behaviors generated by users when they visit webpages, for example, the movie recommendation system integrates the user's movie viewing information, browsing information and movie ratings, and then the user's possible feelings Interested in the movie recommended to him. A mature movie recommendation will provide online users with real-time recommendations, which will increase the user's stickiness to the website or application.

当今社会,推荐系统无处不在,像电商领域,图书领域,音乐领域,社交领域,电影领域等都有着各自的推荐算法。最初的推荐是根据用户的浏览数据进行排序,然后再将排名靠前的商品推荐给用户。以电影领域为例,现在有一部喜剧片,很多用户都点击了这个喜剧片,所以导致该喜剧片的排名靠前。而现在有一个用户对悬疑片很感兴趣,而不喜欢喜剧片,这个时候将喜剧片推荐给该用户就很不合适。所以在对用户进行在线实时的推荐时,如果采用这种方法的话可能会导致大量用户的流失,得不到用户的喜爱,这样的产品是无法在这个环境下生存下去的。目前很多视频电影网站都非常看重推荐算法的质量和效率,一些视频电影商都会花费一些资金来举办一些比赛,从而获得更加优质的推荐算法。例如,Netflix公司花费大量资金举办了关于视频推荐的比赛,并将在比赛中表现比较优秀的推荐方法运用在他们公司的电影推荐项目上。比如说关联规则(AssociationRules),奇异值分解(SingularValueDecomposition,SVD),协同过滤(CollaborativeFiltering)等推荐方法,这些方法诞生为电影推荐系统领域带来了新的生机。当然,这些方法的应用也获得了非常好的推荐效果。然而传统协同过滤算法无法结合上下文的联系来进行推荐,不能有效捕捉到用户的兴趣爱好的改变,从而导致推荐的准确性就会下降,而且面对新用户冷启动问题时,会出现待预测项目近邻数不足等问题。为此,提出了一种神经网络推荐系统协同过滤算法。In today's society, recommendation systems are ubiquitous. Fields such as e-commerce, books, music, social networking, and movies all have their own recommendation algorithms. The initial recommendation is to sort according to the user's browsing data, and then recommend the top-ranked products to the user. Take the movie field as an example. Now there is a comedy film, and many users have clicked on this comedy film, so the ranking of this comedy film is high. And now there is a user who is very interested in suspense films but does not like comedy films, it is very inappropriate to recommend comedy films to this user at this time. Therefore, when making online real-time recommendations to users, if this method is used, it may lead to the loss of a large number of users, and they will not be loved by users. Such products cannot survive in this environment. At present, many video and movie websites attach great importance to the quality and efficiency of recommendation algorithms. Some video and movie companies will spend some money to hold some competitions to obtain better recommendation algorithms. For example, Netflix spent a lot of money to hold a video recommendation competition, and applied the recommendation method that performed better in the competition to their company's movie recommendation project. For example, Association Rules (Association Rules), Singular Value Decomposition (Singular Value Decomposition, SVD), Collaborative Filtering (Collaborative Filtering) and other recommendation methods, the birth of these methods has brought new vitality to the field of movie recommendation systems. Of course, the application of these methods has also achieved very good recommendation results. However, traditional collaborative filtering algorithms cannot combine contextual connections to make recommendations, and cannot effectively capture changes in users' hobbies, resulting in a decrease in the accuracy of recommendations, and when faced with the cold start problem of new users, there will be items to be predicted Insufficient number of neighbors and other issues. Therefore, a collaborative filtering algorithm for neural network recommendation system is proposed.

发明内容Contents of the invention

本发明的目的是提供一种神经网络推荐系统协同过滤算法,解决了传统协同过滤算法无法结合上下文的联系来进行推荐,不能有效捕捉到用户的兴趣爱好的改变,从而导致推荐的准确性就会下降,而且面对新用户冷启动问题时,会出现待预测项目近邻数不足等问题。The purpose of the present invention is to provide a collaborative filtering algorithm for a neural network recommendation system, which solves the problem that the traditional collaborative filtering algorithm cannot make recommendations in combination with contextual connections, and cannot effectively capture changes in users' hobbies, which leads to a decrease in the accuracy of recommendations. decline, and when faced with the cold start problem of new users, there will be problems such as insufficient neighbors of the project to be predicted.

本发明的上述技术目的是通过以下技术方案得以实现的:Above-mentioned technical purpose of the present invention is achieved through the following technical solutions:

一种神经网络推荐系统协同过滤算法,包括以下步骤:A neural network recommendation system collaborative filtering algorithm, comprising the following steps:

包括以下步骤:Include the following steps:

S1、首先采用卷积神经网络对用户之前的一些行为和项目本身属性建立一个卷积神经网络模型,从而对用户提供一个更为精准的推荐;S1. First, use the convolutional neural network to establish a convolutional neural network model for some of the user's previous behaviors and the attributes of the project itself, so as to provide a more accurate recommendation to the user;

数据集中主要包含用户和项目这两个静态对象,其中用户相关属性就包括了用户ID,性别,职业和年龄,项目属性就包含项目ID,项目类型和项目名称,首先是要对其进行数字编码,将离散变量转化为连续值向量,将数据中的类别字段,比如说项目类别属性,会采用字符来表示,通常的处理方式就是采用one-hot编码,将其转化为数值;The data set mainly contains two static objects, users and projects, among which user-related attributes include user ID, gender, occupation and age, and project attributes include project ID, project type and project name. First, it needs to be digitally encoded , convert the discrete variable into a continuous value vector, and the category field in the data, such as the item category attribute, will be represented by characters. The usual processing method is to use one-hot encoding to convert it into a value;

S2、提取用户和其项目属性的特征,通过全连接的方式对评分进行拟合;S2. Extract the features of users and their item attributes, and fit the scores through full connection;

首先将属性值都转成数字进行表示,将转化后的这个数字作为嵌入矩阵的索引N,然后为每个索引分配固定大小的潜在因子,这个也就是向量的大小,通常情况下是选取犯,则矩阵维度大小即为(N}32),数据相关属性通过操作完成之后就可以得到用户特征向量和项目特征向量,此时也就能拟合评分值;First, convert the attribute values into numbers for representation, and use the converted number as the index N of the embedding matrix, and then assign a fixed-size potential factor to each index, which is the size of the vector. Usually, it is selected. Then the size of the matrix dimension is (N}32), and the user feature vector and item feature vector can be obtained after the data-related attributes are operated, and the score value can also be fitted at this time;

S3、引入关联规则思想,根据项目关联关系生成项目关联矩阵,并使用项目关联矩阵生成候选项目集;S3. Introduce the idea of association rules, generate an item association matrix according to the item association relationship, and use the item association matrix to generate candidate item sets;

关联度表示项目间关系的强度,本方法定义关联度为用户浏览了一个项目后又浏览另一项目的可能性,关联度用r表示,rij表示项目j对项目i的关联度,定义项目i与项目j共同浏览用户数比上项目i的用户浏览数来计算,如下式:The degree of relevance indicates the strength of the relationship between items. This method defines the degree of relevance as the possibility that the user browses another item after browsing one item. The number of users co-viewed by i and project j is calculated by the ratio of the number of users browsed by project i, as follows:

Figure BDA0003895662740000041
Figure BDA0003895662740000041

其中,Ni,Nj分别表示对项目i与项目j评分的用户数,依据项目关联度可以建立项目关联矩阵G=nxn,其中rij=(i,j=1,2...n)为项目j对项目i之间的关联度,矩阵的主对角线元素均为0,项目间关联度一般是不同的,关联度的大小表明了项目间的关系的强度,而且项目关联矩阵一般是不对称的,即rij≠rjiAmong them, N i , N j represent the number of users who rated item i and item j respectively, and the item correlation matrix G=nxn can be established according to the item correlation degree, where r ij =(i,j=1,2...n) is the correlation degree between project j and project i, the main diagonal elements of the matrix are all 0, and the correlation degree between projects is generally different, and the size of the correlation degree indicates the strength of the relationship between projects, and the project correlation matrix is generally is asymmetric, that is, r ij ≠ r ji ;

Figure BDA0003895662740000042
Figure BDA0003895662740000042

关联度引入到候选项目集的选取过程中,使用项目关联矩阵代替相似度矩阵来生成候选项目集;The correlation degree is introduced into the selection process of candidate item sets, and the item correlation matrix is used instead of the similarity matrix to generate candidate item sets;

S4、针对数据稀疏情况下相似度准确性较差的问题,提出了修正的Pearson相关系数,建立用户一评分矩阵,利用遗忘函数对该矩阵进行优化,计算目标用户与其他用户之间的评分的相似性,根据从高分到低分的顺序,以评分的相似程度排名,选择排在前面的若干用户作为目标用户的最近邻居,根据邻居的评分信息计算出目标用户对没有评分项目的评分预测值,然后选择预测值排名靠前的N个项目推荐给用户;S4. Aiming at the problem of poor similarity accuracy in the case of sparse data, a modified Pearson correlation coefficient is proposed, a user-rating matrix is established, and the matrix is optimized using the forgetting function to calculate the score between the target user and other users. Similarity, according to the order from high score to low score, ranked by the similarity of the score, select the top users as the nearest neighbors of the target user, and calculate the score prediction of the target user for the unrated item according to the score information of the neighbors value, and then select the N items with the highest predicted value to recommend to the user;

在选择共同评分数据时,还要考虑共同评分的用户数,如果两项目评分用户中,共同评分用户的比例越高,在一定程度上,两项目也就越相似,例如,项目i1被10个用户评分,项目i2被12个用户评分,项目i3被50个用户过分,如果项目i1和i2,以及项目i1和i3都有8个共同评分用户,则认为项目i1和i2更加的相似,同样,项目i1和i2有8个共同评分用户也比有5个共同评分用户的情况表示两项目更相似,为此,提出了修正的Pearson相关系数,公式如下:When selecting co-rating data, the number of users who co-rated should also be considered. If the proportion of co-rating users among the two item scoring users is higher, the two items will be more similar to a certain extent. For example, item i 1 is compared with 10 item i 2 is rated by 12 users, item i 3 is rated by 50 users, if items i 1 and i 2 , and items i 1 and i 3 have 8 common rating users, then item i 1 is considered It is more similar to i 2. Similarly, the project i 1 and i 2 have 8 co-rated users, which means that the two items are more similar than the case of 5 co-rated users. Therefore, a modified Pearson correlation coefficient is proposed, and the formula is as follows :

Figure BDA0003895662740000051
Figure BDA0003895662740000051

其中,Ui,Uj,U分别表示对项目i,对项目j以及同时对项目i和项目j评分的用户集,|Ui|,|Uj|,|U|分别表示集合Ui,Uj,U中元素个数,rui,ruj分别为用户u对项目i和项目j评分,

Figure BDA0003895662740000052
分别为项目i和项目j的平均评分;Among them, U i , U j , U represent the user sets who rate item i, item j and both item i and item j respectively; |U i |, |U j |, |U| represent sets U i , U j , the number of elements in U, r ui , r uj are user u’s ratings on item i and item j respectively,
Figure BDA0003895662740000052
are the average ratings of item i and item j respectively;

算法如下:The algorithm is as follows:

(1)坏境设定(1) environment setting

设用户u对于当前系统来说是新用户,历史行为信息为空,但是他在其他系统(比如社交网站、商务网站、视频网站)中已经有历史行为信息(可以通过授权或者当前系统有跨站的cookie);Assuming that user u is a new user to the current system, the historical behavior information is empty, but he already has historical behavior information in other systems (such as social networking sites, business sites, video sites) (can be authorized or the current system has cross-site cookie);

(2)用户关系网络的构建(2) Construction of user relationship network

根据用户在其他系统中的历史信息构建用户之间的关系网络,并对他们进行社区划分,根据用户的购买或者评分的信息构建关系网络时使用相似度计算公式,公式为:Build a relationship network between users based on their historical information in other systems, and divide them into communities, and use a similarity calculation formula when building a relationship network based on user purchases or rating information. The formula is:

Figure BDA0003895662740000053
Figure BDA0003895662740000053

其中,ru和rv分别表示用户u和v的评分向量,covar(ru,rv)表示ru和rv的协方差,

Figure BDA0003895662740000054
Figure BDA0003895662740000055
分别表示ru和rv中非零元素的标准差[20];Among them, r u and r v represent the rating vectors of users u and v respectively, covar(r u , r v ) represents the covariance of r u and r v ,
Figure BDA0003895662740000054
and
Figure BDA0003895662740000055
Respectively represent the standard deviation of the non-zero elements in r u and r v [20];

值得注意的是,相似度为0就说明相应的两个用户间不应添加连边,而两个向量之间重叠的元素应该不少于三个,不然对应的两个用户间的相似度就被认为是0;It is worth noting that a similarity of 0 means that no edge should be added between the corresponding two users, and the overlapping elements between the two vectors should be no less than three, otherwise the similarity between the corresponding two users will be considered to be 0;

如果用户在某社交网站中有好友,即拥有朋友圈,那么可以对它直接进行社区划分;If a user has friends in a social networking site, that is, has a circle of friends, then it can be directly divided into communities;

(3)社区划分(3) Community division

社区划分采用主模块最大化(PMM)的方法,基本原理是使模块性指标Q最大化,Q通过测度拥有相同度分布的随机网络与社区内部相互连接的差异来衡量对真实网络的进行社区划分的强度,被定义为:The community division adopts the method of principal module maximization (PMM). The basic principle is to maximize the modularity index Q. Q measures the community division of the real network by measuring the difference between the random network with the same degree distribution and the interconnection within the community. The strength of , is defined as:

Figure BDA0003895662740000061
其中
Figure BDA0003895662740000062
Figure BDA0003895662740000061
in
Figure BDA0003895662740000062

S是节点u与社团C的归属关系矩阵,若节点u属于社团C,则SuC为1,否则为0,B为模块矩阵,A为节点关系的邻接矩阵,d是所有节点的度序列,m是节点关系网络中边的总数,Q小于0表示划分效果非常差;S is the affiliation matrix of node u and community C, if node u belongs to community C, then S uC is 1, otherwise it is 0, B is the module matrix, A is the adjacency matrix of node relationship, d is the degree sequence of all nodes, m is the total number of edges in the node relationship network, and Q less than 0 means that the division effect is very poor;

由于包括用户对项目的评分、用户之间的好友关系等,考虑的是用户之间的多维度关系,于是PMM就根据这多维度关系进行社区划分,得到划分结果S中的元素表示用户u之于社团C的归属度是多高;Considering the multi-dimensional relationship between users, including the user's rating of the project, the friendship between users, etc., the PMM divides the community according to the multi-dimensional relationship, and the elements in the division result S represent the user u. How high is the degree of belonging to community C;

(4)选择相似邻居策略的改变(4) Changes in the strategy of selecting similar neighbors

之前是在所有用户中,选择相似度最大的k个邻居,本方法则只在与目标用户u在同一社区的用户中选择,目标用户u对于项目j的预测评分:Previously, among all users, the k neighbors with the highest similarity were selected, but this method only selects users in the same community as the target user u, and the target user u’s predicted score for item j:

Figure BDA0003895662740000071
Figure BDA0003895662740000071

其中,

Figure BDA0003895662740000072
瓦为用户u的平均评分,a为常数,C为用户v所在社区,rV,j为用户v对项目j的评分,
Figure BDA0003895662740000073
氏为用户v的平均评分。in,
Figure BDA0003895662740000072
W is the average rating of user u, a is a constant, C is the community of user v, r V,j is the rating of user v on item j,
Figure BDA0003895662740000073
is the average rating of user v.

综上所述,本发明具有以下有益效果:In summary, the present invention has the following beneficial effects:

实验仿真证明,该算法在公开电影数据集MovieLens上面进行模型验证。通过一系列的对比实验结果可以看出,对于本方法研究的基于卷积神经网络和循环神经网络的电影推荐算法在相应的评价指标上有着不错的结果。The experimental simulation proves that the algorithm performs model verification on the public movie dataset MovieLens. Through a series of comparative experimental results, it can be seen that the movie recommendation algorithm based on convolutional neural network and recurrent neural network studied by this method has good results in the corresponding evaluation indicators.

附图说明Description of drawings

图1是本发明中基于用户社区划分的推荐与传统协同过滤推荐的比较;Fig. 1 is the comparison between recommendation based on user community division and traditional collaborative filtering recommendation in the present invention;

图2是本发明中UBCF-IBCF预测准确度比较;Fig. 2 is UBCF-IBCF prediction accuracy comparison among the present invention;

图3是本发明中UBCF-IBCF推荐精度比较;Fig. 3 is UBCF-IBCF recommended accuracy comparison among the present invention;

图4是本发明中候选项目集的大小;Fig. 4 is the size of candidate item set in the present invention;

图5是本发明中用户感兴趣项目比例;Fig. 5 is the proportion of items of interest to users in the present invention;

图6是本发明中部分用户-评分表;Fig. 6 is a partial user-rating table in the present invention;

图7是本发明中部分用户信息表;Fig. 7 is a partial user information table in the present invention;

图8是本发明中项目表;Fig. 8 is a table of items in the present invention;

图9是本发明中传统的与改进后的皮尔森相关系数的比较;Fig. 9 is the comparison of traditional and improved Pearson's correlation coefficient among the present invention;

图10是本发明中传统的与改进后的矢量余弦相似度的比较;Fig. 10 is the comparison of traditional and improved vector cosine similarity among the present invention;

图11是本发明中传统的与改进后的修正矢量余弦相似度的比较;Fig. 11 is the comparison of traditional and improved modified vector cosine similarity in the present invention;

图12是本发明中传统的与改进后的皮尔森相关比较;Fig. 12 is the traditional and improved Pearson correlation comparison in the present invention;

图13是本发明中的卷积神经网络模型框架。Fig. 13 is the framework of the convolutional neural network model in the present invention.

具体实施方式detailed description

实施例,一种神经网络推荐系统协同过滤算法,包括以下步骤:Embodiment, a neural network recommendation system collaborative filtering algorithm, comprises the following steps:

包括以下步骤:Include the following steps:

S1、如图13所示,首先采用卷积神经网络对用户之前的一些行为和项目本身属性建立一个卷积神经网络模型,从而对用户提供一个更为精准的推荐;S1, as shown in Figure 13, first use the convolutional neural network to establish a convolutional neural network model for some of the user's previous behaviors and the attributes of the project itself, so as to provide a more accurate recommendation to the user;

数据集中主要包含用户和项目这两个静态对象,其中用户相关属性就包括了用户ID,性别,职业和年龄,项目属性就包含项目ID,项目类型和项目名称,在本实施例中,项目为电影,电影属性就包含电影ID,电影类型和电影名称。其中电影年份是通过电影名称中提取出来的一个新的特征属性,当然,项目也可以是书籍和音乐等。首先是要对其进行数字编码,将离散变量转化为连续值向量,将数据中的类别字段,比如说项目类别属性,会采用字符来表示,通常的处理方式就是采用one-hot编码,将其转化为数值;The data set mainly contains two static objects of user and project, wherein user-related attributes include user ID, gender, occupation and age, and project attributes include project ID, project type and project name. In this embodiment, the project is For movies, movie attributes include movie ID, movie type and movie name. The movie year is a new feature attribute extracted from the movie name. Of course, the item can also be books and music. The first is to digitally encode it, convert discrete variables into continuous value vectors, and use characters to represent category fields in the data, such as item category attributes. The usual processing method is to use one-hot encoding, and convert it to converted to a value;

S2、提取用户和其项目属性的特征,通过全连接的方式对评分进行拟合;S2. Extract the features of users and their item attributes, and fit the scores through full connection;

首先将属性值都转成数字进行表示,将转化后的这个数字作为嵌入矩阵的索引N,然后为每个索引分配固定大小的潜在因子,这个也就是向量的大小,通常情况下是选取犯,则矩阵维度大小即为(N}32),数据相关属性通过操作完成之后就可以得到用户特征向量和项目特征向量,此时也就能拟合评分值;First, convert the attribute values into numbers for representation, and use the converted number as the index N of the embedding matrix, and then assign a fixed-size potential factor to each index, which is the size of the vector. Usually, it is selected. Then the size of the matrix dimension is (N}32), and the user feature vector and item feature vector can be obtained after the data-related attributes are operated, and the score value can also be fitted at this time;

S3、引入关联规则思想,根据项目关联关系生成项目关联矩阵,并使用项目关联矩阵生成候选项目集;S3. Introduce the idea of association rules, generate an item association matrix according to the item association relationship, and use the item association matrix to generate candidate item sets;

关联度表示项目间关系的强度,本方法定义关联度为用户浏览了一个项目后又浏览另一项目的可能性,关联度用r表示,rij表示项目j对项目i的关联度,定义项目i与项目j共同浏览用户数比上项目i的用户浏览数来计算,如下式:The degree of relevance indicates the strength of the relationship between items. This method defines the degree of relevance as the possibility that the user browses another item after browsing one item. The number of users co-viewed by i and project j is calculated by the ratio of the number of users browsed by project i, as follows:

Figure BDA0003895662740000091
Figure BDA0003895662740000091

其中,Ni,Nj分别表示对项目i与项目j评分的用户数,依据项目关联度可以建立项目关联矩阵G=nxn,其中rij=(i,j=1,2...n)为项目j对项目i之间的关联度,矩阵的主对角线元素均为0,项目间关联度一般是不同的,关联度的大小表明了项目间的关系的强度,而且项目关联矩阵一般是不对称的,即rij≠rjiAmong them, N i , N j represent the number of users who rated item i and item j respectively, and the item correlation matrix G=nxn can be established according to the item correlation degree, where r ij =(i,j=1,2...n) is the correlation degree between project j and project i, the main diagonal elements of the matrix are all 0, and the correlation degree between projects is generally different, and the size of the correlation degree indicates the strength of the relationship between projects, and the project correlation matrix is generally is asymmetric, that is, r ij ≠ r ji ;

Figure BDA0003895662740000092
Figure BDA0003895662740000092

关联度引入到候选项目集的选取过程中,使用项目关联矩阵代替相似度矩阵来生成候选项目集;The correlation degree is introduced into the selection process of candidate item sets, and the item correlation matrix is used instead of the similarity matrix to generate candidate item sets;

S4、针对数据稀疏情况下相似度准确性较差的问题,提出了修正的Pearson相关系数,建立用户一评分矩阵,利用遗忘函数对该矩阵进行优化,计算目标用户与其他用户之间的评分的相似性,根据从高分到低分的顺序,以评分的相似程度排名,选择排在前面的若干用户作为目标用户的最近邻居,根据邻居的评分信息计算出目标用户对没有评分项目的评分预测值,然后选择预测值排名靠前的N个项目推荐给用户;S4. Aiming at the problem of poor similarity accuracy in the case of sparse data, a modified Pearson correlation coefficient is proposed, a user-rating matrix is established, and the matrix is optimized using the forgetting function to calculate the score between the target user and other users. Similarity, according to the order from high score to low score, ranked by the similarity of the score, select the top users as the nearest neighbors of the target user, and calculate the score prediction of the target user for the unrated item according to the score information of the neighbors value, and then select the N items with the highest predicted value to recommend to the user;

在选择共同评分数据时,还要考虑共同评分的用户数,如果两项目评分用户中,共同评分用户的比例越高,在一定程度上,两项目也就越相似,例如,项目i1被10个用户评分,项目i2被12个用户评分,项目i3被50个用户过分,如果项目i1和i2,以及项目i1和i3都有8个共同评分用户,则认为项目i1和i2更加的相似,同样,项目i1和i2有8个共同评分用户也比有5个共同评分用户的情况表示两项目更相似,为此,提出了修正的Pearson相关系数,公式如下:When selecting co-rating data, the number of users who co-rated should also be considered. If the proportion of co-rating users among the two item scoring users is higher, the two items will be more similar to a certain extent. For example, item i 1 is compared with 10 item i 2 is rated by 12 users, item i 3 is rated by 50 users, if items i 1 and i 2 , and items i 1 and i 3 have 8 common rating users, then item i 1 is considered It is more similar to i 2. Similarly, the project i 1 and i 2 have 8 co-rated users, which means that the two items are more similar than the case of 5 co-rated users. Therefore, a modified Pearson correlation coefficient is proposed, and the formula is as follows :

Figure BDA0003895662740000101
Figure BDA0003895662740000101

其中,Ui,Uj,U分别表示对项目i,对项目j以及同时对项目i和项目j评分的用户集,|Ui|,|Uj|,|U|分别表示集合Ui,Uj,U中元素个数,rui,ruj分别为用户u对项目i和项目j评分,

Figure BDA0003895662740000102
分别为项目i和项目j的平均评分;Among them, U i , U j , U represent the user sets who rate item i, item j and both item i and item j respectively; |U i |, |U j |, |U| represent sets U i , U j , the number of elements in U, r ui , r uj are user u’s ratings on item i and item j respectively,
Figure BDA0003895662740000102
are the average ratings of item i and item j respectively;

算法如下:The algorithm is as follows:

(1)坏境设定(1) environment setting

设用户u对于当前系统来说是新用户,历史行为信息为空,但是他在其他系统(比如社交网站、商务网站、视频网站)中已经有历史行为信息(可以通过授权或者当前系统有跨站的cookie);Assuming that user u is a new user to the current system, the historical behavior information is empty, but he already has historical behavior information in other systems (such as social networking sites, business sites, video sites) (can be authorized or the current system has cross-site cookie);

(2)用户关系网络的构建(2) Construction of user relationship network

根据用户在其他系统中的历史信息构建用户之间的关系网络,并对他们进行社区划分,根据用户的购买或者评分的信息构建关系网络时使用相似度计算公式,公式为:Build a relationship network between users based on their historical information in other systems, and divide them into communities, and use a similarity calculation formula when building a relationship network based on user purchases or rating information. The formula is:

Figure BDA0003895662740000111
Figure BDA0003895662740000111

其中,ru和rv分别表示用户u和v的评分向量,covar(ru,rv)表示ru和rv的协方差,

Figure BDA0003895662740000112
Figure BDA0003895662740000113
分别表示ru和rv中非零元素的标准差[20];Among them, r u and r v represent the rating vectors of users u and v respectively, covar(r u , r v ) represents the covariance of r u and r v ,
Figure BDA0003895662740000112
and
Figure BDA0003895662740000113
Respectively represent the standard deviation of the non-zero elements in r u and r v [20];

值得注意的是,相似度为0就说明相应的两个用户间不应添加连边,而两个向量之间重叠的元素应该不少于三个,不然对应的两个用户间的相似度就被认为是0;It is worth noting that a similarity of 0 means that no edge should be added between the corresponding two users, and the overlapping elements between the two vectors should be no less than three, otherwise the similarity between the corresponding two users will be considered to be 0;

如果用户在某社交网站中有好友,即拥有朋友圈,那么可以对它直接进行社区划分;If a user has friends in a social networking site, that is, has a circle of friends, then it can be directly divided into communities;

(3)社区划分(3) Community division

社区划分采用主模块最大化(PMM)的方法,基本原理是使模块性指标Q最大化,Q通过测度拥有相同度分布的随机网络与社区内部相互连接的差异来衡量对真实网络的进行社区划分的强度,被定义为:The community division adopts the method of principal module maximization (PMM). The basic principle is to maximize the modularity index Q. Q measures the community division of the real network by measuring the difference between the random network with the same degree distribution and the interconnection within the community. The strength of , is defined as:

Figure BDA0003895662740000114
其中
Figure BDA0003895662740000115
Figure BDA0003895662740000114
in
Figure BDA0003895662740000115

S是节点u与社团C的归属关系矩阵,若节点u属于社团C,则SuC为1,否则为0,B为模块矩阵,A为节点关系的邻接矩阵,d是所有节点的度序列,m是节点关系网络中边的总数,Q小于0表示划分效果非常差;S is the affiliation matrix of node u and community C, if node u belongs to community C, then S uC is 1, otherwise it is 0, B is the module matrix, A is the adjacency matrix of node relationship, d is the degree sequence of all nodes, m is the total number of edges in the node relationship network, and Q less than 0 means that the division effect is very poor;

由于包括用户对项目的评分、用户之间的好友关系等,考虑的是用户之间的多维度关系,于是PMM就根据这多维度关系进行社区划分,得到划分结果S中的元素表示用户u之于社团C的归属度是多高;Considering the multi-dimensional relationship between users, including the user's rating of the project, the friendship between users, etc., the PMM divides the community according to the multi-dimensional relationship, and the elements in the division result S represent the user u. How high is the degree of belonging to community C;

(4)选择相似邻居策略的改变(4) Changes in the strategy of selecting similar neighbors

之前是在所有用户中,选择相似度最大的k个邻居,本方法则只在与目标用户u在同一社区的用户中选择,目标用户u对于项目j的预测评分:Previously, among all users, the k neighbors with the highest similarity were selected, but this method only selects users in the same community as the target user u, and the target user u’s predicted score for item j:

Figure BDA0003895662740000121
Figure BDA0003895662740000121

其中,

Figure BDA0003895662740000122
瓦为用户u的平均评分,a为常数,C为用户v所在社区,rV,j为用户v对项目j的评分,
Figure BDA0003895662740000123
氏为用户v的平均评分。in,
Figure BDA0003895662740000122
W is the average rating of user u, a is a constant, C is the community of user v, r V,j is the rating of user v on item j,
Figure BDA0003895662740000123
is the average rating of user v.

仿真实验:Simulation:

本实验的数据集是来自的imhonet网站,它是一家多方面的社交网站,本方法用到的数据集中包括了用户在网上的联系还有他们对书和电影的评分,其中,用户至少有一个联系,4.8万用户对5万部电影有90万的电影评分,1.3万的用户对于1.4万的书有I20万书评,各评级在数据集中的记录包括用户ID、项目ID以及他们之间的评分值(0-10,其中0表示未评分)。从这个数据集中选择那些曾在书籍和电影上只要有20个评分的用户,由此产生实验用的数据集中包括6089个用户在12621部电影上的2363394个评分和这些用户在17907本书籍上的1138401个评分。The data set in this experiment is from the imhonet website, which is a multi-faceted social networking site. The data set used in this method includes the user's online connections and their ratings for books and movies. Among them, the user has at least one Contact, 48,000 users have 900,000 movie ratings for 50,000 movies, 13,000 users have 1.2 million book reviews for 14,000 books, and the records of each rating in the data set include user ID, item ID and the ratings between them Value (0-10, where 0 means not rated). From this data set, users who had only 20 ratings on books and movies were selected. The resulting experimental data set included 2,363,394 ratings on 12,621 movies from 6,089 users and 17,907 books from these users. 1,138,401 ratings.

将数据集的20%的用户作为测试集,剩下的80%作为训练集,为了模拟冷启动问题,将测试集中的所有用户的书评删掉,以此模拟对于图书测试集中的用户全部为新用户并且要为他们来预测对于图书的评分。20% of the users in the data set are used as the test set, and the remaining 80% are used as the training set. In order to simulate the cold start problem, the book reviews of all users in the test set are deleted, so as to simulate that all users in the book test set are new users and predict ratings for books for them.

本实验采用对比法,对比实验为传统的协同过滤算法,选择N个最近邻居来预测,这里的计算用户间相似度的计算方法采用的是Pcarson相关系数。通过修改最近邻居数量进行对比。预测的邻居是从10到100递增,增量为10,评价标准仍然采用MAE,通过计算预测的用户评分和实际评分之间的偏差来度量预测的准确性。MAE越小,说明推荐算法的质量越高。This experiment uses the comparison method. The comparison experiment is a traditional collaborative filtering algorithm, and N nearest neighbors are selected for prediction. The calculation method for calculating the similarity between users here uses the Pcarson correlation coefficient. Compare by modifying the number of nearest neighbors. The predicted neighbors are incremented from 10 to 100, and the increment is 10. The evaluation standard still adopts MAE, and the accuracy of the prediction is measured by calculating the deviation between the predicted user rating and the actual rating. The smaller the MAE, the higher the quality of the recommendation algorithm.

实验结果如图1,正方形虚线实现表示基于传统的皮尔森相关的协同过滤评分预测,矩形虚线表示基于用户社区划分的评分预测。The experimental results are shown in Figure 1. The square dotted line represents the collaborative filtering score prediction based on traditional Pearson correlation, and the rectangular dotted line represents the score prediction based on user community division.

实验结论Experimental results

实验结果显示,基于用户社区划分的评分预测要优于传统的协同过滤,尤其在最近邻居数大于40以后,传统的协同过滤推荐质量趋于平稳,而基于社区的预测评分仍然可以提高推荐质量,而在数据集中通过删除用户对于图书的评分来模拟了推荐系统中的冷启动问题,实验证明,在邻居数不断增加的条件下,前者预测的准确度要优于后者,也就是说,该算法能够克服冷启动问题提高推荐质量。The experimental results show that the score prediction based on user community division is better than the traditional collaborative filtering, especially when the number of nearest neighbors is greater than 40, the traditional collaborative filtering recommendation quality tends to be stable, while the community-based prediction score can still improve the recommendation quality, In the data set, the cold start problem in the recommendation system is simulated by deleting the user's rating for the book. The experiment proves that the prediction accuracy of the former is better than that of the latter under the condition that the number of neighbors increases continuously, that is to say, the The algorithm can overcome the cold start problem and improve the recommendation quality.

如图2和图3分别基于项目的协同过滤算法(IBCF)与基于内容的协同过算法(UBCF)在MovicLens数据集上,MAE与Precision两种指标上的比较。实验中,将数据集分成80%-20%的训练集和测试集并使用5折交叉法进行实验,同时,本实验采用余弦相似度计算用户之间和项目之间的相似度。在MAE比较的实验中,近邻数knear从5增加到50,间隔为5;在Precision比较的实验中,推荐数topN从2增加到20,间隔为2,同时,在生成候选项目集时,为每个已评分项目或用户取5个最相似项目或用户(候选近邻数为5),对于从于用户的协同过滤算法近邻取30,基于项目的协同过滤算法近邻取15(之所以取值不同,是因为两种算法在上述取值情况下己经可以取得平稳的顶测准确度)。从以下两图中可以看出,在数据相当稀疏的情况下,基与项目的协同过滤算法比基于用户的协同过滤算法具有更好的预测准确度(MAE值),在一定程度上确实缓解了数据稀疏度的影响,但是其推荐精度(Precision值)却不如基于用户的协同过滤算法。实际上,基于项目的协同过滤算法在绝大多数情况下,MAE值都优于基于用户的协同过滤算法,而不仅仅是在数据稀疏的情况下。另外,此时两种算法每个用户的平均候选项目集大小不同,基于用户的协同过滤算法的用户平均候选项目集更大,可以通过降低候选近邻数来使得用户平均候选项目集变小,而且如果两种算法的用户平均候选项目集变小;而且如果两种算法的用户平均候选项目集大小差不多的情况下(UBCF取2,IBCF取5时),基于用户的协同过滤算法的推荐精度要远远优于后者。As shown in Figure 2 and Figure 3, the comparison between the item-based collaborative filtering algorithm (IBCF) and the content-based collaborative filtering algorithm (UBCF) on the MovicLens dataset, MAE and Precision. In the experiment, the data set is divided into 80%-20% training set and test set, and the 5-fold crossover method is used for experiments. At the same time, this experiment uses cosine similarity to calculate the similarity between users and items. In the experiment of MAE comparison, the number of neighbors knear increased from 5 to 50 with an interval of 5; in the experiment of Precision comparison, the number of recommendations topN increased from 2 to 20 with an interval of 2. At the same time, when generating candidate itemsets, For each rated item or user, take the 5 most similar items or users (the number of candidate neighbors is 5), for the user-based collaborative filtering algorithm, the neighbors take 30, and for the item-based collaborative filtering algorithm, the neighbors take 15 (the reason why the values are different , because the two algorithms can already obtain stable top measurement accuracy under the above values). As can be seen from the following two figures, when the data is quite sparse, the collaborative filtering algorithm based on the base and the item has better prediction accuracy (MAE value) than the collaborative filtering algorithm based on the user, and it does alleviate the problem to a certain extent. The impact of data sparsity, but its recommendation accuracy (Precision value) is not as good as the user-based collaborative filtering algorithm. In fact, the MAE value of the item-based collaborative filtering algorithm is better than that of the user-based collaborative filtering algorithm in most cases, not only in the case of sparse data. In addition, at this time, the size of the average candidate itemsets for each user of the two algorithms is different, and the average user candidate itemsets of the user-based collaborative filtering algorithm is larger, and the average user candidate itemsets can be reduced by reducing the number of candidate neighbors, and If the average user candidate item sets of the two algorithms become smaller; and if the average user candidate item sets of the two algorithms are similar in size (UBCF takes 2, IBCF takes 5), the recommendation accuracy of the user-based collaborative filtering algorithm should be far superior to the latter.

本方法分析了传统基于项目的协同过滤算法的核心流程:先是生成候选项目集;然后预测活动用户对候选项目集中的项目的评分。候选集的大小、准确程度,都将影响最终的推荐精度。基于项目的协同过滤算法需要对用户u的所有已评分项目i∈Iu进行操作,为每个已评分项目读取k个最近邻居集Ni={i1,i2,…,ik},合并所有Ni并从中删除Iu中已评分项目,得到候选项目集C。This method analyzes the core process of the traditional item-based collaborative filtering algorithm: first, generate candidate itemsets; then predict active users' ratings on the items in the candidate itemsets. The size and accuracy of the candidate set will affect the final recommendation accuracy. The item-based collaborative filtering algorithm needs to operate on all rated items i∈I u of user u, and read k nearest neighbor sets N i ={i 1 ,i 2 ,…,i k } for each rated item , merge all N i and delete the scored items in I u from them, and get the candidate item set C.

图4所示是在MovieLens数据集上,根据基于项目的协同过滤算法得到的用户平均候选项目集的大小随着候选集近邻数k值的变化曲线,实验中,采用余弦相似度计算项目间的相似度。途中横轴是候选集近邻数k值大小,从2增长到20,间隔为2,纵轴是用户平均的候选项目集的大小,也就是候选项目集中候选项目的个数。从图中可以看到,当k=2时,平均候选项目集的大小已经超过50,随着k值的增加,候选项目集的大小也急剧增加,当k=20时,每个用户的平均候选集的大小己经超过300。Figure 4 shows the change curve of the size of the user’s average candidate item set with the k value of the number of neighbors in the candidate set obtained according to the project-based collaborative filtering algorithm on the MovieLens data set. In the experiment, the cosine similarity was used to calculate the relationship between items. similarity. On the way, the horizontal axis is the k value of the number of neighbors in the candidate set, increasing from 2 to 20, with an interval of 2. The vertical axis is the average size of the user's candidate item set, that is, the number of candidate items in the candidate item set. It can be seen from the figure that when k=2, the size of the average candidate item set has exceeded 50, and as the value of k increases, the size of the candidate item set also increases sharply. When k=20, the average size of each user The size of the candidate set has exceeded 300.

图5给出了同样实验条件下,候选项目集中用户感兴趣的项目占总候选集的比例,图中横轴是候选集近邻数k值大小,纵轴是用户感兴趣项目所占的百分比。从图中可以看出,当k=2时,有最大值12.2%左右,随着似选集近邻数k值的增加,用户感兴趣项目在候选项目集中所占比例逐渐下降,当k=20时,比例已经降至4.8%左右。显然,基于项目的协同过滤算法,用户感兴趣的项目在候选项目集中所占比例是非常小的。由此可见,基于项目的协同过滤不考虑使用者间的差别,这样容易引起将用户不感兴趣的项目列入到候选项目集中,所以推荐精度比较差,而且候选项目集太大,会导致下一步为候选集中项目预测评分的时间增加,直接影响系统的可扩展性。Figure 5 shows the proportion of the items that the user is interested in in the candidate item set in the total candidate set under the same experimental conditions. The horizontal axis in the figure is the k value of the number of neighbors in the candidate set, and the vertical axis is the percentage of the items that the user is interested in. It can be seen from the figure that when k=2, there is a maximum value of about 12.2%. With the increase of the k value of the nearest neighbors in the similar selection set, the proportion of the user's interested item in the candidate item set gradually decreases. When k=20 , the proportion has dropped to around 4.8%. Obviously, based on the project-based collaborative filtering algorithm, the proportion of the project that the user is interested in in the candidate project set is very small. It can be seen that item-based collaborative filtering does not consider the differences between users, which can easily cause items that users are not interested in to be included in the candidate item set, so the recommendation accuracy is relatively poor, and the candidate item set is too large, which will lead to the next step The time to predict scores for items in the candidate set increases, directly affecting the scalability of the system.

以下两表分别是使用MovieLens数据集和使用Jester数据集,几种相似度计算方法在MAE值上的表现,表中列出了每种相似度计算方法下MAE的最优值、最差值、平均值以及取得最优值时的近邻数取值。前一个表显示的是MovieLens数据集上的实验结果,实验中,近邻数knear从5增加到50,间隔为5。后一个表显示的是Jester数据集的实验结果,实验中,近邻数knear从2增加到30,间隔为2。从前表中可以看出,无论从最优值还是从平均值来看,余弦相似度的表现都非常的优秀,而Pearson相关系数在数据稀疏时的表现不太优异,原因是项目间的共同评分用户数稀少,容易造成计算结果要么偏大要么偏小。而从后表中,可以看出,Pearson相关系数的效果已经明显上升,要优于其它传统算法,这是因为随着稀疏度的降低,Pearson相关系数得到充足的共同评分用户。另外,还可以看出,不论是在数据稀疏的情况下,还是改善的情况下,提出的修正Pearson相关系数都表现出了良好的性能,尤其是数据稀疏的情况下,其效果要远远超过Pearson相关系数。随着稀疏度的逐步降低,修正Pearson相关系数与Pearson相关系数效果间差距也渐渐变小,此时,两者的效果都非常良好。同时,从表中还可以看出,在数据稀疏时,Pearson相关系数收敛速度较慢,当近邻数k=50,才取到最优值,随着稀疏度的降低,其收敛性有明显改善。而修正Pearson相关系数的收敛性在数据稀疏和不稀疏的情况下一直很好,分别是k=15和k=10左右时,就已经取得了最优值。The following two tables show the performance of several similarity calculation methods on the MAE value using the MovieLens data set and the Jester data set respectively. The table lists the optimal value, worst value, and The average value and the value of the number of neighbors when the optimal value is obtained. The previous table shows the experimental results on the MovieLens dataset. In the experiment, the number of neighbors knear was increased from 5 to 50 with an interval of 5. The latter table shows the experimental results of the Jester dataset. In the experiment, the number of neighbors knear was increased from 2 to 30 with an interval of 2. It can be seen from the previous table that the performance of cosine similarity is very good no matter from the optimal value or the average value, but the performance of Pearson correlation coefficient is not so good when the data is sparse, because the common scoring between items The number of users is rare, and it is easy to cause the calculation results to be either too large or too small. From the following table, it can be seen that the effect of the Pearson correlation coefficient has increased significantly, which is better than other traditional algorithms. This is because as the sparsity decreases, the Pearson correlation coefficient gets enough common scoring users. In addition, it can also be seen that the proposed modified Pearson correlation coefficient shows good performance no matter in the case of sparse data or in the case of improvement, especially in the case of sparse data, its effect is far more than that of Pearson correlation coefficient. As the sparsity gradually decreases, the gap between the effects of the modified Pearson correlation coefficient and the Pearson correlation coefficient gradually decreases. At this time, the effects of both are very good. At the same time, it can also be seen from the table that when the data is sparse, the Pearson correlation coefficient converges slowly. When the number of neighbors k=50, the optimal value is obtained. As the sparsity decreases, the convergence is significantly improved. . The convergence of the modified Pearson correlation coefficient is always good when the data is sparse and not sparse, and when k=15 and k=10 respectively, the optimal value has been achieved.

表1各相似度计算方法比较-1Table 1 Comparison of various similarity calculation methods-1

Figure BDA0003895662740000161
Figure BDA0003895662740000161

Figure BDA0003895662740000171
Figure BDA0003895662740000171

表2各相似度计算方法比较-2Table 2 Comparison of various similarity calculation methods-2

Figure BDA0003895662740000172
Figure BDA0003895662740000172

本实验采用对比法,通过与其他几种推荐算法的对比,来验证所提出的改进算法的有效性。This experiment uses the comparison method to verify the effectiveness of the proposed improved algorithm by comparing it with several other recommendation algorithms.

本实验的实验数据是在MovieLens网站收集到了100K的数据集,其中包括1682部电影和943个用户以及这些用户对这些电影所进行的100000条评分数据,用户所打的分又分为1,2,3,4,5这么5个等级,每个用户至少对20部电影评过分。在数据集中,有m个用户U={u1,u2,u3,......um},n部电影M={m1,m2,m3......mn},那么用一个mxn的矩阵R就可以表示为用户对电影的评分。rui为用户u对电影i的评分,若没有评分,则rui=0。The experimental data of this experiment is a 100K data set collected on the MovieLens website, including 1682 movies and 943 users and 100,000 rating data of these movies by these users. The scores scored by users are divided into 1,2 , 3, 4, 5 such 5 grades, each user has rated at least 20 movies. In the data set, there are m users U={u 1 ,u 2 ,u 3 ,...u m }, n movies M={m 1 ,m 2 ,m 3 ...... m n }, then a mxn matrix R can be expressed as the user's rating of the movie. r ui is user u's rating for movie i, if there is no rating, then r ui =0.

首先评价用户之间的相似度,用皮尔森相关、矢量余弦和修正矢量余弦来计算。再用本实验提出的改进后的算法来计算用户间的相似度。根据训练集中的数据计算相似度,然后预测用户的评分是根据预测集中的数据。本方法做了两种实验,第一种,在计算目标用户的邻居的时候,选用了Top-N推荐。实验中,为了验证所提出的算法的可行性和有效性,选用了不同数量的邻居数。第二种,逐一更改相似度阂值,每次加0.1的闭值,然后验证预测的精准程度,将阂值从0.1到0.9逐渐递增。Firstly, the similarity between users is evaluated, which is calculated by Pearson correlation, vector cosine and modified vector cosine. Then use the improved algorithm proposed in this experiment to calculate the similarity between users. The similarity is calculated based on the data in the training set, and then the user's rating is predicted based on the data in the prediction set. This method has done two experiments. In the first one, when calculating the neighbors of the target user, the Top-N recommendation is selected. In the experiment, in order to verify the feasibility and effectiveness of the proposed algorithm, different numbers of neighbors are selected. The second is to change the similarity threshold one by one, adding a closed value of 0.1 each time, and then verify the accuracy of the prediction, and gradually increase the threshold from 0.1 to 0.9.

推荐系统性能评测标准Recommended System Performance Evaluation Criteria

平均绝对误差((MeanAbsluteErrorMAE)是用来衡量在推荐系统的预测结果的准确度的。通过比较预测的用户评分和实际评分之间的偏差来度量预测的准确性。MAE越小,说明推荐算法的推荐质量越高。设用户的预测评分集合为{t1,t2,......tn},对应的实际评分集合为{p1,p2.....pn},则MAE为:The mean absolute error ((MeanAbsluteErrorMAE) is used to measure the accuracy of the prediction results in the recommendation system. The accuracy of the prediction is measured by comparing the deviation between the predicted user rating and the actual rating. The smaller the MAE, the better the recommendation algorithm The higher the quality of recommendation. Let the user's predicted score set be {t 1 ,t 2 ,...t n }, and the corresponding actual score set be {p 1 ,p 2 .....p n }, Then the MAE is:

Figure BDA0003895662740000181
Figure BDA0003895662740000181

召回率(Recall)和精度(Precise)是在统计学分类领域和信息检索领域被广泛应用的两个度量值,将它们应用到推荐系统中用来评价结果的质量。其中召回率是在推荐系统中,本系统产生的一个推荐列表中包含用户评分的资源的概率。它可以反映用户兴趣的完整性。设系统为用户ui生成的推荐列表为Li,用户ui在测试集中感兴趣的资源的集合为Ti其计算公式为:Recall (Recall) and precision (Precise) are two metrics widely used in the field of statistical classification and information retrieval, and they are applied to the recommendation system to evaluate the quality of the results. The recall rate is the probability that a recommendation list generated by the system contains resources rated by users in the recommendation system. It can reflect the completeness of user interests. Let the recommendation list generated by the system for user u i be L i , and the set of resources that user u i is interested in in the test set is T i, whose calculation formula is:

Figure BDA0003895662740000191
Figure BDA0003895662740000191

精度又叫准确率,顾名思义,就是考察产生的推荐列表中,有多少是用户真正感兴趣的资源的比率,即预测用户兴趣的准确度:Accuracy is also called accuracy rate. As the name suggests, it is the ratio of how many resources in the generated recommendation list are the resources that users are really interested in, that is, the accuracy of predicting user interests:

Figure BDA0003895662740000192
Figure BDA0003895662740000192

实验数据准备与设计Experimental data preparation and design

数据准备,如图6和7所示,本实验的实验数据是在MovieLens网站收集到的,大小为1000,000的数据集,其中包括943个用户和这些用户对1682部电影所进行的100000条评分数据,用户所打的分分为1,2,3,4,5这样5个等级,而每个用户至少对20部电影评过分。实验选取所收集到的数据集的80%作为训练集(base集),剩下的20%作为预测集(test集)。Data preparation, as shown in Figures 6 and 7, the experimental data of this experiment is collected from the MovieLens website, with a size of 1,000,000 data sets, including 943 users and 100,000 records of these users on 1682 movies. Rating data, the scores scored by users are divided into 5 levels of 1, 2, 3, 4, and 5, and each user has rated at least 20 movies. The experiment selects 80% of the collected data set as the training set (base set), and the remaining 20% as the prediction set (test set).

如图8所示,其中,用户-评分表包括4个字段,分别是用户ID(userid),项目ID(itemid),评分(rating),时间戳(timestamp)。时间是unix系统中UTC时间1970年1月1日开始计算的。As shown in FIG. 8 , the user-rating table includes four fields, which are user ID (userid), item ID (itemid), rating (rating), and timestamp (timestamp). The time is calculated from January 1, 1970, UTC time in the Unix system.

实验设计experimental design

本实验中一共做了两组实验,第一组实验中,用皮尔森和矢量余弦、修正的矢量余弦相似度计算方法一与基于遗忘曲线的相似度计算方法做对比,通过改变邻居数量来验证提出的算法的有效性,根据MAE来衡量最后的实验结果;第二组实验中,用皮尔森相似和基于遗忘曲线的相似度计算,通过改变相似度阂值来验证所提出的算法的有效性。如前所述,通过计算预测的用户评分和实际评分的偏差来衡量预测评分的准确性,MAE值越低,说明推荐质量越好。In this experiment, a total of two sets of experiments were done. In the first set of experiments, Pearson and vector cosine, modified vector cosine similarity calculation method 1 was compared with the similarity calculation method based on the forgetting curve, and verified by changing the number of neighbors The effectiveness of the proposed algorithm is measured by MAE to measure the final experimental results; in the second set of experiments, Pearson similarity and similarity calculation based on forgetting curve are used to verify the effectiveness of the proposed algorithm by changing the similarity threshold . As mentioned before, the accuracy of the predicted rating is measured by calculating the deviation between the predicted user rating and the actual rating, and the lower the MAE value, the better the recommendation quality.

实验一experiment one

在这一组实验中,通过修改最近邻居数量进行对比。预测的邻居是从10到100递增,增量为10,结果如图9,传统的皮尔森相似与改进后的相似性算法的实验结果(其中红丝实线为传统算法,蓝色虚线为改进的算法,以下皆同)。从图8可以看出,在邻居数等于30的时候,两算法的MAE值大致相等,但是当邻居数继续增加到100的时候,改进的相似性算法MAE值更小,说明其效果更好一些。In this set of experiments, comparisons are made by modifying the number of nearest neighbors. The predicted neighbors are incremented from 10 to 100, and the increment is 10. The result is shown in Figure 9, the experimental results of the traditional Pearson similarity and the improved similarity algorithm (where the red silk solid line is the traditional algorithm, and the blue dotted line is the improved algorithm, the following are the same). It can be seen from Figure 8 that when the number of neighbors is equal to 30, the MAE values of the two algorithms are roughly equal, but when the number of neighbors continues to increase to 100, the MAE value of the improved similarity algorithm is smaller, indicating that its effect is better .

图10是传统的矢量余弦和基于遗忘曲线的矢量余弦相似性算法的实验结果。由图不难看出,伴随着邻居数量的不断递增,改进后的算法MAE值比传统的矢量余弦更小,说明预测是也更接近准确。Figure 10 is the experimental results of the traditional vector cosine and vector cosine similarity algorithm based on forgetting curve. It is not difficult to see from the figure that with the continuous increase in the number of neighbors, the MAE value of the improved algorithm is smaller than that of the traditional vector cosine, indicating that the prediction is also closer to accuracy.

图11是传统的修正矢量余弦与改进的基于遗忘曲线的相似度算法的实验结果。由图可知,在邻居数量从10到100不断递增的过程中,改进后的算法的MAE值始终小于传统的修正矢量余弦,说明在使用修正矢量余弦计算相似度的时候,基于遗忘曲线的算法能使预测更准确。Figure 11 is the experimental results of the traditional modified vector cosine and the improved similarity algorithm based on forgetting curve. It can be seen from the figure that in the process of increasing the number of neighbors from 10 to 100, the MAE value of the improved algorithm is always smaller than that of the traditional modified vector cosine, which shows that when using the modified vector cosine to calculate the similarity, the algorithm based on the forgetting curve can make forecasts more accurate.

由以上三个实验皆验证了提出的算法的有效性,在计算用户之间的相似度时,应该考虑要由于自然的遗忘而造成的用户对资源评分的改变,用户兴趣是会随着时间的推移而改变的。通过模拟人的遗忘过程,来使预测的准确性有所提局。The effectiveness of the proposed algorithm has been verified by the above three experiments. When calculating the similarity between users, the change of users' ratings of resources due to natural forgetting should be considered. User interests will change over time. changed over time. By simulating the forgetting process of human beings, the accuracy of prediction is improved.

实验二Experiment 2

用皮尔森相似和基于遗忘曲线的相似度计算,通过改变相似度阂值来验证提出的算法的有效性。如前所述,计算预测的用户评分和实际评分的偏差来衡量预测评分的准确性,MAE值越低,说明推荐质量越好。阂值范围从0.1到0.9依次递增,增量为0.1,通过设置好的阂值,使用的邻居用户都是相似度大于这一闭值的,然后分别进行实验。结果如图12,由图得知,当阂值大于0.4的时候,传统的皮尔森相关系数的MAE的值要明显高于基于遗忘曲线而改进的算法的MAE的值,而小于0.4时,传统的方法要好一些。也就是说,当阂值大于0.4时,改进的算法预测的更准确一些。Using Pearson similarity and similarity calculation based on forgetting curve, the effectiveness of the proposed algorithm is verified by changing the similarity threshold. As mentioned earlier, the deviation between the predicted user rating and the actual rating is calculated to measure the accuracy of the predicted rating. The lower the MAE value, the better the recommendation quality. The threshold value ranges from 0.1 to 0.9, and the increment is 0.1. By setting the threshold value, the neighbor users used are all similarities greater than this closed value, and then the experiment is carried out separately. The results are shown in Figure 12. It can be seen from the figure that when the threshold value is greater than 0.4, the MAE value of the traditional Pearson correlation coefficient is significantly higher than the MAE value of the improved algorithm based on the forgetting curve, and when it is less than 0.4, the traditional method is better. In other words, when the threshold is greater than 0.4, the improved algorithm predicts more accurately.

综上所述,通过两组实验的结果来看,总体上来讲,基于遗忘曲线的相似度的计算方法比传统的算法要好一些。那么在推荐系统中,结合自然规律,通过运用艾氏遗忘曲线所表述的遗忘的规律,对用户评分进行衰减,可以明显的提高系统预测的准确度。这也表明,在推荐系统中,人的认知规律可以发挥很重要的作用。To sum up, according to the results of two groups of experiments, generally speaking, the calculation method based on the similarity of the forgetting curve is better than the traditional algorithm. Then in the recommendation system, combined with the laws of nature, by using the law of forgetting expressed by the Einstein's forgetting curve to attenuate the user's score, the accuracy of the system's prediction can be significantly improved. This also shows that in the recommendation system, human cognitive laws can play a very important role.

Claims (1)

1. A collaborative filtering algorithm of a neural network recommendation system is characterized in that: the method comprises the following steps:
s1, firstly, establishing a convolutional neural network model for some previous behaviors of a user and the attributes of items by adopting a convolutional neural network, so that more accurate recommendation is provided for the user;
the data set mainly comprises two static objects, namely a user and a project, wherein the related attributes of the user comprise a user ID, a gender, an occupation and an age, and the project attributes comprise a project ID, a project type and a project name, firstly, the data are coded digitally, discrete variables are converted into continuous value vectors, category fields in the data, such as the project category attributes, are represented by characters, and the common processing mode is to convert the category fields into numerical values by adopting one-hot coding;
s2, extracting characteristics of the user and the project attributes of the user, and fitting the scores in a full-connection mode;
firstly, converting attribute values into numbers for representation, taking the converted numbers as indexes N of an embedded matrix, then allocating potential factors with fixed sizes to each index, namely the size of a vector, usually selecting a criminal, wherein the size of the matrix dimension is (N) 32, obtaining user characteristic vectors and project characteristic vectors after data related attributes are operated, and fitting score values;
s3, introducing an association rule idea, generating a project association matrix according to the project association relation, and generating a candidate project set by using the project association matrix;
the relevance represents the strength of the relation between the items, the relevance is defined as the possibility that a user browses one item and then another item, the relevance is represented by r, and r ij The association degree of the item j to the item i is represented, and the ratio of the co-browsing user ratio of the item i and the item j to the browsing user ratio of the item i is defined to be calculated as follows:
Figure FDA0003895662730000021
wherein, N i ,N j Respectively representing the number of users scoring the item i and the item j, and establishing an item association matrix G = nxn according to the item association degree, wherein r ij N, the main diagonal elements of the matrix are all 0, the degree of association between items is generally different, the magnitude of the degree of association indicates the strength of the relationship between items, and the item association matrix is generally asymmetric, i.e., r is ij ≠r ji
Figure FDA0003895662730000022
The relevance is introduced into the selection process of the candidate item set, and the item relevance matrix is used for replacing a similarity matrix to generate the candidate item set;
s4, aiming at the problem of poor similarity accuracy under the condition of sparse data, providing a modified Pearson correlation coefficient, establishing a user-scoring matrix, optimizing the matrix by using a forgetting function, calculating scoring similarity between a target user and other users, ranking according to the sequence from high score to low score and the scoring similarity, selecting a plurality of users arranged in the front as nearest neighbors of the target user, calculating a scoring predicted value of the target user for an unscored item according to the scoring information of the neighbors, and then selecting N items with the ranking predicted values being close to the front to recommend the users;
when selecting the common scoring data, the number of users scoring together is also considered, and if the proportion of the users scoring together is higher in the two items of scoring users, the two items are similar to each other to a certain extent, for example, the item i 1 Is scored by 10 users, item i 2 Is scored by 12 users, item i 3 Is scored by 50 users if item i 1 And i 2 And item i 1 And i 3 All have 8 common scoring users, then consider project i 1 And i 2 More similarly, item i 1 And i 2 There are 8 users who score together and represent two projects more similar than the case of 5 users who score together, for this reason, a modified Pearson correlation coefficient is proposed, the formula is as follows:
Figure FDA0003895662730000031
wherein, U i ,U j U represents the set of users scoring project i, project j, and both project i and project j, respectively, | U i |,|U j Respectively, representing the set U i ,U j Number of elements in U, r ui ,r uj Item i and item j are scored separately for user u,
Figure FDA0003895662730000032
average scores for item i and item j, respectively;
the algorithm is as follows:
(1) Environmental settings
Let user u be a new user to the current system, and the historical behavior information is null, but he already has the historical behavior information in other systems (such as social network sites, business network sites, video network sites) (either through authorization or the current system has a cross-site cookie);
(2) Construction of a user relationship network
The method comprises the following steps of constructing a relationship network among users according to historical information of the users in other systems, carrying out community division on the users, and using a similarity calculation formula when constructing the relationship network according to purchasing or grading information of the users, wherein the formula is as follows:
Figure FDA0003895662730000033
wherein r is u And r v Score vectors, covar (r), representing users u and v, respectively u ,r v ) Is represented by r u And r v The covariance of (a) is determined,
Figure FDA0003895662730000041
and
Figure FDA0003895662730000042
respectively represent r u And r v Standard deviation of medium non-zero elements [ 20%];
It should be noted that, a similarity of 0 indicates that no connecting edge should be added between the two corresponding users, and the number of the overlapped elements between the two vectors should be not less than three, otherwise, the similarity between the two corresponding users is considered as 0;
if a user has friends in a certain social network site, namely has a friend circle, the user can directly perform community division on the friends;
(3) Community partitioning
The community division adopts a method of maximizing a master module (PMM), the basic principle is to maximize a modularity index Q, and the Q measures the intensity of community division on a real network by measuring the difference between a random network with same degree distribution and the interconnection in a community, and is defined as follows:
Figure FDA0003895662730000043
wherein
Figure FDA0003895662730000044
S is an attribution relationship matrix of the node u and the community C, if the node u belongs to the community C, S uC If the number is 1, otherwise, the number is 0, B is a module matrix, A is an adjacent matrix of the node relationship, d is a degree sequence of all nodes, m is the total number of edges in the node relationship network, and Q is less than 0, which indicates that the partitioning effect is very poor;
because the multi-dimensional relationship among the users is considered due to the fact that the multi-dimensional relationship among the users comprises the scores of the users for the items, the friend relationships among the users and the like, the PMM performs community division according to the multi-dimensional relationship, and the element in the division result S shows how high the affiliation degree of the user u to the community C is;
(4) Change of selection similar neighbor policy
Selecting k neighbors with the maximum similarity from all users, selecting only the users in the same community with the target user u according to the method, and scoring the target user u for the prediction of the item j:
Figure FDA0003895662730000051
wherein,
Figure FDA0003895662730000052
w is the average score of user u, a is a constant, C is the community in which user v is located, r V,j For the user v to score the item j,
Figure FDA0003895662730000053
and "is the average score of user v.
CN202211273752.6A 2022-10-18 2022-10-18 A Neural Network Recommender System Collaborative Filtering Algorithm Pending CN115618127A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211273752.6A CN115618127A (en) 2022-10-18 2022-10-18 A Neural Network Recommender System Collaborative Filtering Algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211273752.6A CN115618127A (en) 2022-10-18 2022-10-18 A Neural Network Recommender System Collaborative Filtering Algorithm

Publications (1)

Publication Number Publication Date
CN115618127A true CN115618127A (en) 2023-01-17

Family

ID=84863098

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211273752.6A Pending CN115618127A (en) 2022-10-18 2022-10-18 A Neural Network Recommender System Collaborative Filtering Algorithm

Country Status (1)

Country Link
CN (1) CN115618127A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117574177A (en) * 2024-01-15 2024-02-20 每日互动股份有限公司 Data processing method, device, medium and equipment for user wire expansion
CN118051675A (en) * 2024-02-05 2024-05-17 广州工程技术职业学院 Collaborative filtering-based movie recommendation method, system, equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117574177A (en) * 2024-01-15 2024-02-20 每日互动股份有限公司 Data processing method, device, medium and equipment for user wire expansion
CN117574177B (en) * 2024-01-15 2024-04-19 每日互动股份有限公司 Data processing method, device, medium and equipment for user wire expansion
CN118051675A (en) * 2024-02-05 2024-05-17 广州工程技术职业学院 Collaborative filtering-based movie recommendation method, system, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110162700B (en) Training method, device and equipment for information recommendation and model and storage medium
WO2019205795A1 (en) Interest recommendation method, computer device, and storage medium
CN109918563B (en) Book recommendation method based on public data
CN103377250B (en) Top k based on neighborhood recommend method
CN111475744B (en) Personalized position recommendation method based on ensemble learning
CN115618127A (en) A Neural Network Recommender System Collaborative Filtering Algorithm
CN111475724A (en) Random walk social network event recommendation method based on user similarity
CN114861050A (en) Feature fusion recommendation method and system based on neural network
CN118071400A (en) Application method and system based on graph computing technology in information consumption field
Hassan et al. Performance analysis of neural networks-based multi-criteria recommender systems
CN116561426A (en) Personalized recommendation method based on graph comparison learning and negative interest propagation
CN113836393B (en) Cold start recommendation method based on preference self-adaptive meta-learning
Du et al. Research on personalized book recommendation based on improved similarity calculation and data filling collaborative filtering algorithm
CN117171449B (en) Recommendation method based on graph neural network
Zhang et al. Multi-view dynamic heterogeneous information network embedding
CN114861079B (en) A collaborative filtering recommendation method and system integrating product features
Wu et al. Enhancing recommendation capabilities using multi-head attention-based federated knowledge distillation
Hsieh et al. Leveraging attribute latent features for addressing new item cold-start issue
Niu et al. Tourism event knowledge graph for attractions recommendation
Wan et al. A recommendation approach based on heterogeneous network and dynamic knowledge graph
Li Research on e-business requirement information resource extraction method in network big data
Hassanpour et al. Improving Accuracy of Recommender Systems using Social Network Information and Longitudinal Data
Xiaoyi et al. A hybrid collaborative filtering model with context and folksonomy for social recommendation
CN118674017B (en) Model training method, content recommendation method, device and electronic equipment
Roy et al. Big data analytics based recommender system for tele-communication industry

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination