CN108876536A

CN108876536A - Collaborative filtering recommending method based on arest neighbors information

Info

Publication number: CN108876536A
Application number: CN201810621062.2A
Authority: CN
Inventors: 韩玥; 王颖; 张子洋; 金志刚
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2018-06-15
Filing date: 2018-06-15
Publication date: 2018-11-23

Abstract

The present invention relates to a collaborative filtering recommendation method based on nearest neighbor information, comprising the following steps: Step 1: Input the rating data set in the user-item matrix to determine the target user u; Step 2: Through the rating data and Pearson similarity, Calculate the availability between the target user and other users, and select k `users with higher availability to generate the nearest neighbor candidate set of the target user _u ; Step 3: Calculate each The trust degree between a user v and the target user u; step 4: filter out users with high availability but low trust, select K users with high trust as the nearest neighbor users, and generate the nearest neighbor set ; Step 5: Using the rating information of Nu, calculate the predicted ratings of target user _u for all unrated items; Step 6: Select items with high predicted ratings to generate a list of recommended items.

Description

Collaborative filtering recommendation method based on nearest neighbor information

技术领域technical field

本发明属于基于协同过滤的推荐技术领域，具体涉及一种基于最近邻信息的协同过滤推荐方法。The invention belongs to the technical field of recommendation based on collaborative filtering, and in particular relates to a collaborative filtering recommendation method based on nearest neighbor information.

背景技术Background technique

在互联网快速发展的今天，网络早已融入到人类的日常生活。与此同时，网络信息也变得纷繁复杂，并且数量迅速增长，使得互联网在信息共享时面临着“信息过载”等问题。而在电商领域中，所面临的“信息过载”现象相当严重。这此背景下，个性化推荐技术逐步发展起来。With the rapid development of the Internet today, the network has long been integrated into human daily life. At the same time, network information has also become complicated and rapidly increasing, which makes the Internet face problems such as "information overload" when sharing information. In the field of e-commerce, the phenomenon of "information overload" is quite serious. In this context, personalized recommendation technology has gradually developed.

与传统的搜索方法相比较，个性化推荐系统为用户提供独特的服务，它能够通过收集用户的历史行为数据，分析用户的兴趣和潜在兴趣，为用户提供其感兴趣的商品，减少用户查找商品的时间和精力。另一方面,个性化推荐系统为电商网站吸引更多消费者，把商品推荐给消费者，提高网站销量,获得更多利润。达到用户和供应商的双赢。在个性化推荐系统中的核心技术是个性化推荐技术，个性化推荐系统中往往存在评分系统，而利用系统历史评分记录，可以对用户的偏好进行预估。除了评分系统，部分平台还仅有评价系统，利用现有技术如情感分析等可以将用户的文字、表情等评论信息转换为数值评分，进而可以转换为评分系统。Compared with traditional search methods, the personalized recommendation system provides users with unique services. It can collect users' historical behavior data, analyze users' interests and potential interests, provide users with products they are interested in, and reduce users' search for products. time and energy. On the other hand, the personalized recommendation system attracts more consumers for e-commerce websites, recommends products to consumers, increases website sales, and obtains more profits. To achieve a win-win situation for users and suppliers. The core technology in the personalized recommendation system is the personalized recommendation technology. There is often a scoring system in the personalized recommendation system, and the user's preference can be estimated by using the system's historical scoring records. In addition to the rating system, some platforms only have an evaluation system. Using existing technologies such as sentiment analysis, users' comments such as text and expressions can be converted into numerical ratings, which can then be converted into a rating system.

如上所述，个性化推荐存在于淘宝、亚马逊、京东、今日头条等各大平台。但作为用户不难发现其中依然存在一些问题，例如某用户在电商平台买了一台笔记本电脑，可想而知该用户在一定时间内不会再去购买笔记本电脑，但是系统会在近期多次推荐笔记本电脑给该用户。同时不难发现，对于一些只有评分或者评论的平台来说，之所以能够实现较高的个性化推荐，是因为在启动个性化推荐之前，拥有着大量的用户和数据，而对于某些中小型的公司来说，并没有大量的用户，即使拥有大量用户也没用较多的用户历史行为记录，导致现存数据相当稀疏，无法实现良好的个性化推荐。As mentioned above, personalized recommendations exist on major platforms such as Taobao, Amazon, JD.com, and Toutiao. However, as a user, it is not difficult to find that there are still some problems. For example, a user bought a laptop on an e-commerce platform. Recommend the laptop to this user for the first time. At the same time, it is not difficult to find that for some platforms that only have ratings or reviews, the reason why they can achieve high personalized recommendations is because they have a large number of users and data before starting personalized recommendations, while for some small and medium-sized For a company that does not have a large number of users, even if it has a large number of users, it is useless to have more historical user behavior records, resulting in the existing data being quite sparse, making it impossible to achieve good personalized recommendations.

为了解决最近邻居选取精度低、数据稀疏情况下推荐准确度不高等问题，本发明在传统协同过滤算法的基础上，建立了最近邻居优化选取方法，为个性化推荐提供更好的技术支持。In order to solve the problems of low accuracy of nearest neighbor selection and low recommendation accuracy in the case of sparse data, the present invention establishes a nearest neighbor optimization selection method on the basis of traditional collaborative filtering algorithm to provide better technical support for personalized recommendation.

发明内容Contents of the invention

为解决最近邻居选取精度低、数据稀疏情况下推荐准确度低的问题，本发明提供一种更加准确的个性化推荐方法，在传统协同过滤算法的基础上，建立了基于最近邻信息的协同过滤推荐方法。为实现上述目的，本发明采取以下技术方案：In order to solve the problems of low accuracy of nearest neighbor selection and low recommendation accuracy under the condition of sparse data, the present invention provides a more accurate personalized recommendation method. On the basis of traditional collaborative filtering algorithm, a collaborative filtering based on nearest neighbor information is established. recommended method. To achieve the above object, the present invention takes the following technical solutions:

一种基于最近邻信息的协同过滤推荐方法，包括下列步骤：A collaborative filtering recommendation method based on nearest neighbor information, comprising the following steps:

步骤1：输入用户-项目矩阵中的评分数据集，确定目标用户u；Step 1: Input the scoring data set in the user-item matrix to determine the target user u;

步骤2：通过评分数据和皮尔森相似度，对目标用户与其他用户之间的可用度进行计算，并选出可用度较高的k`个用户，生成目标用户u的最近邻居候选集N_u`；Step 2: Calculate the availability between the target user and other users through the scoring data and Pearson similarity, and select k` users with high availability to generate the nearest neighbor candidate set N _{u of the target user u} `;

步骤3：计算同一时间窗口θN_u`中的每个用户v与目标用户u之间的信任度tru(u,v)：Step 3: Calculate the trust degree tru(u,v) between each user v in the same time window θN u` and the target user _u :

a)根据同一时间窗口θ中用户v对某个项目i的偏好是否接近或远离目标用户u的偏好，认为在窗口θ中用户u对用户v产生了一致偏好和不一致偏好，分别记为和 a) According to whether user v’s preference for an item i in the same time window θ is close to or far from the target user u’s preference, it is considered that user u has consistent preferences and inconsistent preferences for user v in window θ, which are recorded as and

b)由于用户v可能会在窗口θ中为目标用户u评价多个项目，窗口θ中一致或不一致偏好的值将被计算为所有项目上该用户一致或不一致偏好的总和，分别记为g^θ(u,v)和b^θ(u,v)b) Since user v may evaluate multiple items for target user u in window θ, the value of consistent or inconsistent preference in window θ will be calculated as the sum of the user’s consistent or inconsistent preference on all items, denoted as g ^θ (u,v) and b ^θ (u,v)

c)给落在同一时间窗口θ中的所有评分分配相同的遗忘因子f，逐渐遗忘用户以前的偏好：对于时间窗口θ，分别赋予一致偏好遗忘因子f_g和不一致偏好遗忘因子f_b，一致和不一致偏好的总值分别记为G^θ(u,v)和B^θ(u,v)；c) Assign the same forgetting factor f to all ratings falling in the same time window θ, and gradually forget the user’s previous preferences: For the time window θ, assign the consistent preference forgetting factor f _g and the inconsistent preference forgetting factor f _b , respectively, consistent and The total value of inconsistent preferences is denoted as G ^θ (u,v) and B ^θ (u,v);

d)最后，根据用户v对用户u一致和不一致偏好总值的高低，计算用户v与目标用户u之间的信任度，一致性偏好总值越高或不一致偏好总值越低，信任度越高；d) Finally, calculate the trust degree between user v and target user u according to the total value of user v’s consistent and inconsistent preference for user u. The higher the total value of consistent preference or the lower the total value of inconsistent preference, the higher the trust degree. high;

步骤4：过滤掉可用度较高但是信任度较低的用户，选取信任度较高的K个用户，作为最近邻居用户，并生成最近邻居集N_u；Step 4: Filter out users with high availability but low trust, select K users with high trust as nearest neighbor users, and generate nearest neighbor set N _u ;

步骤5：利用N_u的评分信息，计算出目标用户u对所有未评分项目的预测评分；Step 5: Using the rating information of Nu, calculate the predicted ratings of target user _u for all unrated items;

步骤6：选取预测评分高的项目生成推荐项目列表。Step 6: Select items with high predicted scores to generate a list of recommended items.

本发明由于采取以上技术方案，其具有以下优点：The present invention has the following advantages due to the adoption of the above technical scheme:

(1)大多数现有用户相似度计算机制，如Pearson相似度，将一对用户之间的相似度计算为对称值，而实际情况中两个用户互相推荐的能力并不相同。本发明基于传统相似度计算方法，考虑用户相似度的不对称性和推荐可用性，保证最近邻居的推荐能力；(1) Most existing user similarity calculation mechanisms, such as Pearson similarity, calculate the similarity between a pair of users as a symmetric value, but in actual situations, the ability of two users to recommend each other is not the same. Based on the traditional similarity calculation method, the present invention considers the asymmetry of user similarity and recommendation availability, and ensures the recommendation ability of the nearest neighbor;

(2)大多数当前的邻居选择方法不考虑用户对不同项目的偏好的一致性，忽略用户偏好随着时间的推移的动态变化。本发明考虑用户对不同项目的偏好一致性，增加时间窗口和遗忘因子，更进一步地保证最近邻居与目标用户在不同时间段的偏好持续一致。(2) Most current neighbor selection methods do not consider the consistency of users' preferences for different items, and ignore the dynamic changes of users' preferences over time. The present invention considers the user's preference consistency for different items, increases the time window and the forgetting factor, and further ensures that the preferences of the nearest neighbor and the target user are consistent in different time periods.

附图说明Description of drawings

图1为基于最近邻居优化选取方法的协同过滤推荐方法流程。Figure 1 is the process flow of the collaborative filtering recommendation method based on the nearest neighbor optimization selection method.

具体实施方式Detailed ways

本发明在传统协同过滤算法的基础上，建立了基于最近邻信息的协同过滤推荐方法，其中包括可用度计算模型和动态信任度计算模型两个关键的模型。The present invention establishes a collaborative filtering recommendation method based on the nearest neighbor information on the basis of the traditional collaborative filtering algorithm, which includes two key models: an availability calculation model and a dynamic trust calculation model.

为实现上述目的，本发明采取以下技术方案：To achieve the above object, the present invention takes the following technical solutions:

(1)可用度计算模型：传统的相似度计算方法主要依赖于共同评分项集，因此当两个用户的共同评分项目数较少时，得到的相似度与实际情况的偏差较大。同时，传统的相似度计算方法认为用户间的推荐能力对称，而实际情况中，两个用户对彼此的推荐能力并非相同。针对以上问题，本发明考虑到两个用户间共同评分项个数占目标用户的评分项个数的比例，结合传统的相似度计算方法，提出用户的可用度模型。(1) Availability calculation model: The traditional similarity calculation method mainly relies on the common scoring item set, so when the number of common scoring items of two users is small, the deviation between the obtained similarity and the actual situation is large. At the same time, the traditional similarity calculation method considers that the recommendation ability between users is symmetric, but in the actual situation, the recommendation ability of two users to each other is not the same. In view of the above problems, the present invention considers the ratio of the number of common scoring items between two users to the number of scoring items of the target user, and combines the traditional similarity calculation method to propose a user availability model.

(2)动态信任度计算模型：在过去传统推荐算法中，计算相似度的方法中主要依赖的数据是用户之间的共同评分项值，然而在当今众多推荐系统当中，推荐技术主要面临的关键问题之一就是数据稀疏性问题。因此，在较少共同评分数据的情况下，计算得出的相似度并不能够准确代表用户之间的相似程度，负面影响到了后续的推荐结果。此外，推荐系统中存在许多虚假或者恶意评分的用户，才使协同过滤推荐系统很容易受到攻击，对商品的不真实的评价或者别有目的评价内容，使得其误导其他用户对该商品的认知。本发明提出了信任度模型，一方面可以改善数据稀疏的问题，另一方面通过计算用户之间的信任度，选出那些值得信任的用户，排除那些恶意用户，进一步提高了推荐的准确率。(2) Calculation model of dynamic trust degree: In the past traditional recommendation algorithm, the data mainly relied on in the method of calculating similarity is the value of common scoring items between users. One of the problems is the data sparsity problem. Therefore, in the case of less common scoring data, the calculated similarity cannot accurately represent the similarity between users, which negatively affects the subsequent recommendation results. In addition, there are many users with false or malicious ratings in the recommendation system, which makes the collaborative filtering recommendation system vulnerable to attacks. The untrue evaluation of the product or the evaluation content with other purposes make it mislead other users' perception of the product. . The invention proposes a trust degree model, which can improve the problem of data sparseness on the one hand, and on the other hand, by calculating the trust degree among users, select those trustworthy users and exclude those malicious users, thereby further improving the accuracy of recommendation.

本发明的基于最近邻居优化选取方法的协同过滤推荐方法，在计算动态信任度模型的过程中，根据实际情况改进了传统的信任模型，提出时间窗口和遗忘因子，从目标用户的角度来计算用户的信任值，并考虑时间因素来随着时间的推移来捕获用户的偏好变化，选择具有较高信任度的邻居。本发明将大大提高系统向目标用户做出推荐的准确度，图1显示了本发明的流程。具体实施步骤如下：In the collaborative filtering recommendation method based on the nearest neighbor optimization selection method of the present invention, in the process of calculating the dynamic trust model, the traditional trust model is improved according to the actual situation, the time window and the forgetting factor are proposed, and the user is calculated from the perspective of the target user. The trust value of , and consider the time factor to capture the user's preference changes over time, and select neighbors with higher trust. The present invention will greatly improve the accuracy of the system to make recommendations to target users. Figure 1 shows the flow of the present invention. The specific implementation steps are as follows:

(1)输入用户-项目矩阵中的评分数据集，确定目标用户u。(1) Input the rating data set in the user-item matrix to determine the target user u.

(2)通过评分数据对目标用户与其他用户之间的可用度进行计算，公式如下：(2) Calculate the usability between the target user and other users by scoring data, the formula is as follows:

其中ava(u,v)表示用户v对用户u的可用度，I_v、I_u分别表示用户v、u的评分项，sim(u,v)表示用户v、u的相似度。可以看出，当即用户v的所有评分项都被用户u评价过时，用户v对于用户u来说毫无推荐能力，因此，此时ava(u,v)＝0。当sim(u,v)＜0时，说明用户u和v的相似性过低，因此，此时同样有ava(u,v)＝0。|I_u∩I_v|表示用户u和v共同评分项的数目，当用户v中存在用户u未评分的项目时，通过考虑共同评分项数目占用户u的评分项数目的比例并结合传统的相似度计算方法，抵消传统相似度计算方法对共同评分项的过分依赖，当sim(u,v)相同时，|I_u∩I_v|/|I_u|越大，说明计算传统相似度时使用的评分数据越多，因此用户v对用户u的可用度越高。Among them, ava(u, v) represents the availability of user v to user u, I _v and I _u represent the scoring items of users v and u respectively, and sim(u, v) represents the similarity between users v and u. It can be seen that when That is, when all rating items of user v have been evaluated by user u, user v has no recommendation ability for user u, so ava(u, v)=0 at this time. When sim(u, v)<0, it means that the similarity between users u and v is too low, therefore, ava(u, v)=0 also exists at this time. |I _u ∩I _v | represents the number of common scoring items of user u and v, when user v has unrated items of user u, by considering the ratio of the number of common scoring items to the number of scoring items of user u and combining the traditional The similarity calculation method offsets the excessive reliance of the traditional similarity calculation method on common scoring items. When the sim(u,v) is the same, the larger |I _u ∩I _v |/|I _u |, indicating that when calculating the traditional similarity The more rating data used, and thus the higher the availability of user v to user u.

(3)选出可用度较高的K′个用户，生成目标用户u的最近邻居候选集N′(u)。(3) Select K' users with high availability, and generate the nearest neighbor candidate set N'(u) of the target user u.

K′＝[ε×K]K'=[ε×K]

其中ε为最近邻居选取系数，ε∈{ε∈R|ε≥1}，[x]为取整函数，即不超过实数x的最大整数。N′(u)集中的用户一般具有较高的可用度，即与目标用户具有较高的相似度且具有较高的推荐能力。Where ε is the nearest neighbor selection coefficient, ε∈{ε∈R|ε≥1}, [x] is the rounding function, that is, the largest integer that does not exceed the real number x. The users in the N'(u) set generally have higher availability, that is, they have higher similarity with target users and have higher recommendation ability.

(4)计算N′(u)中的所有用户w与目标用户u之间的信任度tru(u,w)。(4) Calculate the trust degree tru(u,w) between all users w in N′(u) and the target user u.

a)本发明通过引入时间窗口和遗忘因子来逐渐遗忘用户以前的偏好。给定在时间t_c提供的特定评分，该评分落入的窗口的索引被标记为θ，a) The present invention gradually forgets the user's previous preferences by introducing a time window and a forgetting factor. Given a particular score provided at time _tc , the index of the window into which that score falls is denoted θ,

θ＝[(t_c-t_s)/t_w]+1θ=[(t _c -t _s )/t _w ]+1

其中t_s和t_w分别表示训练开始时间和窗口长度。在本发明提出的方法中，落在同一时间窗口θ中的所有评分将被分配相同的遗忘因子进行处理。where t _s and t _w represent the training start time and window length, respectively. In the method proposed by the present invention, all scores falling in the same time window θ will be assigned the same forgetting factor for processing.

b)根据v对某个项目的偏好是否接近或远离目标用户u的偏好，本发明认为用户v对用户u产生了“好行为”(即一致偏好)或“坏行为”(即不一致的偏好)，分别记为g_i(u-v)和b_i(u-v)，其中的每一个部分均可以被量化为在(0，1)的范围内的连续值。本发明可以对窗口θ中的每个用户v的评分行为进行量化(记为或 b) According to whether v’s preference for a certain item is close to or far from the target user u’s preference, the present invention considers that user v has “good behavior” (that is, consistent preference) or “bad behavior” (that is, inconsistent preference) for user u , respectively recorded as g _i (uv) and _bi (uv), each of which can be quantized as a continuous value in the range of (0, 1). The present invention can quantify the scoring behavior of each user v in the window θ (denoted as or

其中R_max和R_min分别表示推荐系统中的最大和最小评分值，表示时间窗口θ中用户v对项目i的评分值，表示时间窗口θ中目标用户u对项目i的评分值。where R _max and R _min represent the maximum and minimum rating values in the recommendation system, respectively, Indicates the rating value of user v on item i in time window θ, Indicates the rating value of target user u on item i in time window θ.

c)由于用户v可能会在窗口θ中为目标用户u评价多个项目，窗口θ中的好或坏行为的总值将被计算为所有项目上该用户的好或坏行为值的总和。c) Since user v may rate multiple items for target user u in window θ, the total value of good or bad behavior in window θ will be calculated as the sum of the user’s good or bad behavior values on all items.

d)为了让时间更近的行为在计算用户的可信度方面获得较高的权重，本发明使用遗忘因子来逐渐遗忘用户以前的行为。具体来说，对于时间窗口θ，“好行为”和“坏行为”的总值(记为G^θ(u,v)和B^θ(u,v))可以计算为：d) In order to allow the more recent behavior to obtain a higher weight in calculating the user's credibility, the present invention uses the forgetting factor to gradually forget the user's previous behavior. Specifically, for a time window θ, the total value of "good behavior" and "bad behavior" (denoted as G ^θ (u,v) and B ^θ (u,v)) can be calculated as:

G^θ(u,v)＝G^θ-1(u,v)×f_g+g^θ(u,v)G ^θ (u,v)=G ^θ-1 (u,v)×f _g +g ^θ (u,v)

B^θ(u,v)＝B^θ-1(u,v)×f_b+b^θ(u,v)B ^θ (u,v)=B ^θ-1 (u,v)×f _b +b ^θ (u,v)

其中f_g和f_b是表示“好行为”和“坏行为”的遗忘因子，取值范围为(0,1)。Among them, f _g and f _b are forgetting factors representing "good behavior" and "bad behavior", and the value range is (0,1).

e)最后，目标用户u对用户v的信任度tru(u,v)计算如下，可以看出用户v进行的行为越好，该用户将获得的信任值越高。e) Finally, the trust degree tru(u, v) of target user u to user v is calculated as follows, it can be seen that the better the behavior of user v, the higher the trust value the user will obtain.

(5)过滤掉可用度较高但是信任度较低的用户，选取信任度较高的K个用户，作为最近邻居用户，并生成最近邻居集N(u)，从而得到了可用度和信任度均较高的用户作为目标用户的最近邻居，这部分用户与用户的偏好比较相似，且相似持续性比较高。(5) Filter out users with high availability but low trust, select K users with high trust as the nearest neighbor users, and generate the nearest neighbor set N(u), thus obtaining the availability and trust The users with a higher value are the nearest neighbors of the target user, and these users have similar preferences with the user, and the similarity persistence is relatively high.

(6)利用最近邻居用户集的评分信息，计算出目标用户u对所有未评分项目的预测评分(例如对目标项目i的预测评分P_ui)。主要利用下面的计算方法。(6) Using the rating information of the nearest neighbor user set, calculate the predicted ratings of the target user u for all unrated items (for example, the predicted rating P _ui for the target item i). The following calculation method is mainly used.

(7)预测评分列表中的评分数据越高，表示本发明认为目标用户对该评分对应的项目的喜爱程度越高。因此选取预测评分高的项目生成推荐项目列表。(7) The higher the score data in the predicted score list, it means that the present invention believes that the target user has a higher preference for the item corresponding to the score. Therefore, items with high predicted scores are selected to generate a list of recommended items.

Claims

1. a kind of collaborative filtering recommending method based on arest neighbors information, includes the following steps：

Step 1：The score data collection in user-project matrix is inputted, determines target user u；

Step 2：By score data and Pearson's similarity, the availability between target user and other users is counted It calculates, and selects the higher k` user of availability, generate the nearest-neighbors Candidate Set N of target user u_u`；

Step 3：Calculate same time window θ N_uEach user v in the ` and degree of belief tru (u, v) between target user u：

A) according to user v in same time window θ to the preference of some project i whether close to or away from target user u preference, Think that user u produces consistent preference and inconsistent preference to user v in window θ, is denoted as respectivelyWith

B) since user v may be the multiple projects of target user u evaluation in window θ, consistent or inconsistent preference in window θ Value will be calculated as that the user on all items is consistent or the summation of inconsistent preference, be denoted as g respectively^θ(u, v) and b^θ(u,v)

C) identical forgetting factor f is distributed to all scorings fallen in same time window θ, it is pervious partially gradually forgets user It is good：For time window θ, consistent preference forgetting factor f is assigned respectively_gWith inconsistent preference forgetting factor f_b, consistent and inconsistent The total value of preference is denoted as G respectively^θ(u, v) and B^θ(u,v)；

D) finally, according to user v to user u unanimously and the height of inconsistent preference total value, calculate user v and target user u it Between degree of belief, consistency preference total value is higher or inconsistent preference total value is lower, and degree of belief is higher；

Step 4：It filters out that availability is higher but the lower user of degree of belief, the higher K user of degree of belief is chosen, as most Nearly neighbor user, and generate nearest-neighbors collection N_u；

Step 5：Utilize N_uScore information, calculate target user u and score the predictions of all non-scoring items；

Step 6：Choose the project generation recommended project list that prediction is scored high.