CN110060102B

CN110060102B - Big data prediction method for user store location based on partial label learning

Info

Publication number: CN110060102B
Application number: CN201910313789.9A
Authority: CN
Inventors: 王进; 闵子剑; 孙开伟; 许景益; 邓欣; 刘彬
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Beijing Xinsuo Consulting Co ltd
Priority date: 2019-04-18
Filing date: 2019-04-18
Publication date: 2022-05-03
Anticipated expiration: 2039-04-18
Also published as: CN110060102A

Abstract

The invention discloses a bias label learning-based method for predicting the positioning big data of a shop where a user is located, which comprises the following steps: 101, preprocessing the shopping state data of a user; 102, constructing a bias marking data set according to a candidate shop set corresponding to each sample; 103, carrying out feature extraction operation on the partial mark data set; 104, constructing a similarity graph according to the feature space; 105, carrying out probability propagation according to the similarity graph; 106 predict the shops with future behavioral interaction of the user from the candidate shop set of the partial mark data set by propagating the converged probabilities. According to the method, the historical data of the user are preprocessed, the characteristics are extracted, the bias mark data set is converted, the bias mark learning model is established, and the shops with behavior interaction in the future of the user are predicted from the candidate shop set corresponding to each user according to the bias mark data set of the position behavior of the user, so that the user can obtain more accurate personalized push service, and the shopping experience of the user is improved.

Description

Big data prediction method for user store location based on partial label learning

技术领域technical field

本发明属于偏标记学习、大数据处理技术领域，尤其基于概率传播模型用户所在商铺定位大数据预测。The invention belongs to the technical fields of partial mark learning and big data processing, in particular to big data prediction of store location where a user is based on a probability propagation model.

背景技术Background technique

偏标记学习是输出空间与一组候选标签集合相关联的一种弱监督学习，候选标签集合中仅有一个为真实标记，剩余标签被视为干扰噪声标签。在偏标记训练的过程中，每个训练样本的真实标签被淹没在候选标签集合中，因此无法类似于强监督学习那样，直接从数据集中获得输入空间到输出空间的学习算法。然而，在现实生活中，带有准确唯一标签信息的数据集越难越获得。因此我们不得不面对如何从不具有单一性和明确性的数据集中学习的严峻问题。最近，偏标记学习提供了很多有效的方法去解决此类问题，并且广泛地运用在了许多实际应用中，特别在用户所在商铺定位问题上有十分大的突破。Partial label learning is a weakly supervised learning in which the output space is associated with a set of candidate labels. Only one of the candidate labels is the true label, and the remaining labels are regarded as noise labels. In the process of partial labeling training, the true label of each training sample is submerged in the candidate label set, so it is impossible to directly obtain the input space from the data set to the output space of the learning algorithm similar to strong supervised learning. However, in real life, datasets with accurate unique label information are harder and harder to obtain. So we have to face the serious problem of how to learn from datasets that do not have uniformity and clarity. Recently, partial label learning has provided many effective methods to solve such problems, and has been widely used in many practical applications, especially in the problem of store location where users are located.

随着互联网移动支付的迅速普及，我们享受到越来越多智能定位所带来的生活便利。例如当客户走入商场的某家餐厅时，手机会自动弹出该餐厅的优惠券；当客户走入商场服装店时，手机可以自动推荐这家店里您喜欢的衣服；在客户路过商场一家珠宝店时，手机可以自动提示客户想了很久的一款钻戒已经有货了；离开商场停车场时，手机在客户的许可下可以自动交停车费。这些客户所享受的贴心服务都离不开背后大数据挖掘和机器学习的支持。客户所在商铺定位分析在隐性地带给客户人工智能体验的同时，使得用户更容易了解自己所感兴趣的商铺信息，从而间接地提高了客户的购买力。如何在正确的时间、正确的地点给用户最有效的服务，是大数据时代智能化拓展的新挑战。With the rapid popularization of Internet mobile payment, we enjoy the convenience of life brought by more and more intelligent positioning. For example, when a customer walks into a restaurant in the mall, the mobile phone will automatically pop up a coupon for the restaurant; when a customer walks into a clothing store in the mall, the mobile phone can automatically recommend the clothes you like in this store; when the customer passes by a jewelry store in the mall When going to the store, the mobile phone can automatically remind customers that a diamond ring that they have been thinking about for a long time is already in stock; when leaving the shopping mall parking lot, the mobile phone can automatically pay the parking fee with the customer's permission. The intimate services enjoyed by these customers are inseparable from the support of big data mining and machine learning behind them. The location analysis of the store where the customer is located not only implicitly brings the customer an artificial intelligence experience, but also makes it easier for the user to understand the information of the store they are interested in, thereby indirectly improving the purchasing power of the customer. How to provide users with the most effective services at the right time and at the right place is a new challenge for intelligent expansion in the era of big data.

发明内容SUMMARY OF THE INVENTION

本发明旨在解决以上现有技术的问题。提出了一种使得用户能够获得更为精准的个性化推送服务，提高用户的购物体验的基于偏标记学习的用户所在商铺定位大数据预测方法。本发明的技术方案如下：The present invention aims to solve the above problems of the prior art. This paper proposes a big data prediction method for user's store location based on partial mark learning, which enables users to obtain more accurate personalized push services and improves users' shopping experience. The technical scheme of the present invention is as follows:

一种基于偏标记学习的用户所在商铺定位大数据预测方法，其包括以下步骤：A big data prediction method for the location of a store where a user is located based on partial mark learning, comprising the following steps:

101.对用户的位置行为数据进行包括异常样本清洗、缺失Wi-Fi信息填充在内的预处理操作；101. Perform preprocessing operations on the user's location behavior data, including cleaning abnormal samples and filling in missing Wi-Fi information;

102.根据每个样本所对应的候选商铺集合，数据集中的每个样本为某一个用户对应的一种购物状态，每个用户的不同购物状态对应于不同的候选商铺集合，每个样本的候选商铺集合根据一定的规则获取，对于每个样本，此规则可以概括为三个步骤：1、根据距离找到离此用户当前购物状态最近的10个商铺；2、根据优化一个创新的凸二次规划问题，来求解这10个店铺对于此用户当前购物状态的重要性；3、根据重要性，选择重要性大于阈值0.4的商铺作为候选商铺集合，构建偏标记数据集；102. According to the candidate store set corresponding to each sample, each sample in the data set is a shopping state corresponding to a certain user, and the different shopping states of each user correspond to different candidate store sets. The store set is obtained according to certain rules. For each sample, this rule can be summarized into three steps: 1. Find the 10 closest stores to the user's current shopping status according to the distance; 2. Optimize an innovative convex quadratic programming according to the distance To solve the importance of these 10 stores to the current shopping status of this user; 3. According to the importance, select the stores whose importance is greater than the threshold of 0.4 as the candidate store set, and construct a partial labeling data set;

103.对偏标记数据集进行特征提取操作，提取Wi-Fi距离强度特征向量的特征组成特征空间，该特征向量类似于ONE-HOT特征向量，特征向量的每一维代表为数据集出现的每一种Wi-Fi在该用户当前购物状态下的距离强度值；103. Perform feature extraction on the partial labeled data set, and extract the features of the Wi-Fi distance intensity feature vector to form a feature space. The feature vector is similar to the ONE-HOT feature vector, and each dimension of the feature vector represents each occurrence of the data set. A distance strength value of Wi-Fi in the current shopping state of the user;

104.根据特征空间构建相似度图，具体包括：104. Build a similarity map according to the feature space, including:

对于数据集中的每一个样本x_i，重复性地做相同的操作：1、将x_i当作相似度图的一个结点；2、将x_i看作中心点，根据x_i与数据集中的其他样本之间Wi-Fi距离强度特征向量的欧氏距离，为x_i选取欧氏距离最小的10个样本，然后根据x_i与选取的这10个样本，x_i可以看作是这10个样本的中心样本点，在相似图中将其在图中所对应的结点用边连接起来；For each sample _xi in the data set, do the same operation repeatedly: 1. Take _xi as a node of the similarity graph; 2. Take _xi as the center point, according to the relationship between _xi and the data set Euclidean distance of the Wi-Fi distance strength feature vector between other samples, select 10 samples with the smallest Euclidean distance for _xi , and then according to _xi and the selected 10 samples, _xi can be regarded as these 10 samples The center sample point of the sample, and its corresponding nodes in the graph are connected by edges in the similarity graph;

105.根据相似度图进行概率传播；对于数据集中的每一个样本x_i，重复性地做相同的操作：1、初始化：根据似然函数(公式(6))计算最优参数，从而计算出x_i所对应的候选商铺集合中的每个候选商铺可能互动的概率，将此概率分布作为x_i所对应的候选商铺集合中的每个候选商铺的初始化概率分布；2、对于概率传播算法的第t次迭代：根据基于相似图的公式获得第t次迭代的x_i所对应的候选商铺的概率分布，实现第t次迭代的概率传播，计算这个公式的过程就是一次概率传播的过程，此传播过程只能实现相似图中每条边所对应的两个结点之间的传播，由于在传播的过程中可能会导致不在x_i所对应的候选商铺集合的商铺的互动概率不为0，因此要对所有商铺相对于x_i的互动概率进行消歧归一化，a、对于非候选商铺集合的商铺的互动概率置为0；b、对于候选商铺集合的商铺的互动概率进行最大最小归一化。105. Carry out probability propagation according to the similarity map; for each sample _xi in the data set, do the same operation repeatedly: 1. Initialization: Calculate the optimal parameters according to the likelihood function (formula (6)), thereby calculating The probability that each candidate store in the candidate store set corresponding to _xi may interact, and this probability distribution is taken as the initialization probability distribution of each candidate store in the candidate store set corresponding to _xi ; 2. For the probability propagation algorithm The t-th iteration: According to the formula based on the similarity graph, the probability distribution of the candidate stores corresponding to the x _i of the t-th iteration is obtained, and the probability propagation of the t-th iteration is realized. The process of calculating this formula is a process of probability propagation. The propagation process can only realize the propagation between the two nodes corresponding to each edge in the similarity graph. Since the process of propagation may cause the interaction probability of shops that are not in the candidate shop set corresponding to x _i to be non-zero, Therefore, it is necessary to disambiguate and normalize the interaction probability of all shops relative to x _i . a. The interaction probability of the shops in the non-candidate shop set is set to 0; b. The interaction probability of the shops in the candidate shop set is the maximum and minimum normalization unify.

106.通过步骤105概率传播所收敛的概率，从偏标记数据集的候选商铺集合中预测出用户未来有行为互动的商铺。106. Using the probability converged by the probability propagation in step 105, predict from the set of candidate stores in the partial labeling data set the stores that the user will have behavioral interaction with in the future.

进一步的，，所述步骤101对用户的购物状态数据进行预处理操作具体步骤为：Further, in the step 101, the specific steps of preprocessing the user's shopping status data are as follows:

1011.异常样本清洗：异常样本的清洗首先通过原数据集中的经纬度和当前购物状态的Wi-Fi强度信息，根据公式1011. Cleaning of abnormal samples: The cleaning of abnormal samples is firstly carried out through the latitude and longitude of the original data set and the Wi-Fi strength information of the current shopping state, according to the formula

计算每个样本的异常置信度，其中λ_i,

τ_i分别为第i个样本所对应用户的经度、纬度和当前状态的Wi-Fi强度，m表示数据集样本数量，若某样本的异常置信度c_i低于0.15或者高于0.85，则将该样本判定为异常样本，并将其从原数据集中过滤出去；Calculate the anomaly confidence for each sample, where λ _i ,

τ _i are the longitude and latitude of the user corresponding to the ith sample and the Wi-Fi strength of the current state respectively, m is the number of samples in the dataset, if the abnormal confidence c _i of a sample is lower than 0.15 or higher than 0.85, then The sample is determined as an abnormal sample, and it is filtered out from the original data set;

1012.缺失Wi-Fi信息的填充：首先找到离Wi-Fi强度信息缺失的样本经纬度最为相似的10个样本，且这10个样本对应的Wi-Fi强度信息均为已知信息，其两个样本之间的相似性根据公式1012. Filling of missing Wi-Fi information: First, find the 10 samples with the most similar latitude and longitude to the samples whose Wi-Fi strength information is missing, and the Wi-Fi strength information corresponding to these 10 samples are all known information. The similarity between samples according to the formula

进行计算，其中λ_a,

λ_b,

分别为样本a所对应用户的经纬度和样本b所对应用户的经纬度，

分别为经度和纬度在整个数据集中的方差，然后通过这10个样本根据公式is calculated, where λ _a ,

λ _b ,

are the latitude and longitude of the user corresponding to sample a and the latitude and longitude of the user corresponding to sample b, respectively,

are the variances of longitude and latitude in the entire dataset, respectively, and then pass these 10 samples according to the formula

去填充该样本缺失的Wi-Fi强度信息，其中样本a为待填充样本，a_i(i＝1,2,…,10)为样本a的10个近邻样本，

为样本a_i所对应的Wi-Fi强度信息。To fill the missing Wi-Fi strength information of the sample, where sample a is the sample to be filled, a _i (i=1,2,...,10) is the 10 neighbor samples of sample a,

is the Wi-Fi strength information corresponding to sample a _i .

进一步的，所述步骤102根据每个样本所对应的候选商铺集合构建偏标记数据集的具体步骤为：Further, in the step 102, the specific steps of constructing the partial label data set according to the candidate store set corresponding to each sample are as follows:

对于原数据中的每个样本，重复执行以下操作以构造偏标记数据集：(1)根据原数据集中用户经纬度和商铺经纬度，计算样本与每个商铺之间的距离

其中λ_A,

分别表示店铺A的经纬度，λ_a,

分别表示样本a经纬度；(2)根据计算得到的距离d，选取与样本距离最近的10个商铺；(3)根据此样本所对应的距离最近的10个商铺的经纬度，对如下二次规划方程进行优化：For each sample in the original data, repeat the following operations to construct a partial labeled dataset: (1) Calculate the distance between the sample and each store based on the user's latitude and longitude and the store's latitude and longitude in the original data set

where λ _A ,

respectively represent the latitude and longitude of store A, λ _a ,

respectively represent the latitude and longitude of sample a; (2) according to the calculated distance d, select the 10 shops with the closest distance to the sample; (3) according to the latitude and longitude of the 10 shops with the closest distance corresponding to this sample, for the following quadratic programming equation optimize:

求解该样本所对应的10个商铺相对此样本的权重值，其中λ_a,

分别表示样本a所对应用户经纬度，ω_a,i(i＝1,2,…,10)分别表示距离样本a最近的10个商铺中的商铺i相对于样本a的权重值，

分别表示样本a所对应最近的10个商铺的经纬度，若计算出来的商铺所对应权重大于0.4，则将该商铺添加到该样本的候选商铺集合中。Solve the weight value of the 10 shops corresponding to this sample relative to this sample, where λ _a ,

respectively represent the latitude and longitude of the user corresponding to sample a, ω _a,i (i=1,2,...,10) respectively represent the weight value of store i in the 10 stores closest to sample a relative to sample a,

respectively represent the latitude and longitude of the nearest 10 shops corresponding to sample a. If the weight corresponding to the calculated shop is greater than 0.4, the shop is added to the set of candidate shops of the sample.

进一步的，所述步骤103对偏标记数据集进行特征提取操作，具体包括步骤：Further, the step 103 performs a feature extraction operation on the partial labeled data set, which specifically includes the steps:

Wi-Fi距离强度：首先将Wi-Fi名称离散化为1000维特征向量，特征值即为Wi-Fi名所对应的Wi-Fi强度，然后根据转换公式：Wi-Fi distance strength: First, the Wi-Fi name is discretized into a 1000-dimensional feature vector, and the feature value is the Wi-Fi strength corresponding to the Wi-Fi name, and then according to the conversion formula:

将离散化的Wi-Fi强度特征向量转换为了Wi-Fi距离强度特征向量，其中

为第i个样本的1000维Wi-Fi距离强度特征向量，

为第i个样本的1000维Wi-Fi名所对应的Wi-Fi强度特征向量，|Y_i|为第i个样本对应的候选商铺集合的大小，

分别表示该样本对应候选商铺A_j的经纬度，λ_a,

分别表示该样本对应用户经纬度。Convert the discretized Wi-Fi strength feature vector to Wi-Fi distance strength feature vector, where

is the 1000-dimensional Wi-Fi distance intensity feature vector of the ith sample,

is the Wi-Fi intensity feature vector corresponding to the 1000-dimensional Wi-Fi name of the ith sample, |Y _i | is the size of the candidate store set corresponding to the ith sample,

respectively represent the latitude and longitude of the candidate shop A _j corresponding to the sample, λ _a ,

respectively represent the latitude and longitude of the user corresponding to the sample.

进一步的，所述步骤104根据特征空间构建相似度图的具体步骤为：Further, the specific steps of constructing the similarity map according to the feature space in the step 104 are:

为了构造基于特征空间的相似度图<V,E,ω_e>，需要分别定义相似图的结点V、相似图的边E以及相似图的边权重ω_e；In order to construct the similarity graph <V, E, ω _e > based on the feature space, it is necessary to define the node V of the similar graph, the edge E of the similar graph and the edge weight ω _e of the similar graph respectively;

1041.相似图的结点的定义：将偏标记数据集中的每一个样本视为相似度图中的结点；1041. Definition of the node of similarity graph: treat each sample in the partial labeled dataset as a node in the similarity graph;

1042.相似图的边的定义：对于偏标记数据集中的每一个样本即相似度图中的每一个结点，选择与之Wi-Fi距离强度欧式距离最近的10个除自身以外的样本作为关联对象，即将相似图中对应的两点进行连接，作为相似图的边；1042. Definition of the edge of the similarity graph: For each sample in the partial labeled data set, that is, each node in the similarity graph, select the 10 samples other than itself with the closest Wi-Fi distance strength Euclidean distance as the association The object, that is to connect the corresponding two points in the similar graph, as the edge of the similar graph;

1043.相似图的边权重的定义：根据公式(2)中的similar(a,b)作为相似图的边(a,b)的权重，其中a,b分别为相似图中两个结点在偏标记数据集中所对应的两个样本。1043. Definition of edge weights of similar graphs: Similar(a, b) in formula (2) is used as the weights of edges (a, b) of similar graphs, where a and b are the two nodes in the similar graphs respectively. The corresponding two samples in the partially labeled dataset.

进一步的，所述步骤105根据相似度图进行概率传播，具体步骤为：Further, the step 105 performs probability propagation according to the similarity map, and the specific steps are:

1051.初始化概率：对于每个样本，首先假设商铺出现在其候选商铺集合中的概率为整个数据集中该商铺出现的比例，即将商铺在数据集中出现的概率作为商铺出现在该样本候选商铺集合概率的先验知识，并假设在第i个样本的Wi-Fi距离强度出现的条件下，候选集中的商铺为真实标签的概率服从逻辑斯蒂分布，然后根据已有的偏标记数据集，构造出似然函数：1051. Initialization probability: For each sample, first assume that the probability of a store appearing in its candidate store set is the proportion of the store in the entire data set, that is, the probability of a store appearing in the data set is taken as the probability of a store appearing in the sample candidate store set. , and assume that under the condition that the Wi-Fi distance strength of the ith sample appears, the probability that the store in the candidate set is the real label obeys the logistic distribution, and then according to the existing partial labeled data set, construct the Likelihood function:

其中p(y∈S_i|x_i,θ)为在第i个样本的Wi-Fi距离强度向量出现的条件下，真实标签存在于该样本的候选商铺集合中的概率，n_y表示商铺y在整个数据集中出现的次数，π_i,y为商铺y出现在其候选商铺集合中的概率，p(y|x_i,θ)为在第i个样本的Wi-Fi距离强度向量出现的条件下，商铺y为真实标签的概率，这个似然函数形式化了整个数据集中的每一个样本的真实标签都存在于候选商铺集合这个已知的事实，而参数值θ可以用极大似然估计进行预估，其中

即为在此样本的Wi-Fi距离强度特征向量出现的条件下，该商铺y在未来将会被此样本所对应用户的进行交互行为的概率，将作为概率传播的初始化概率；where p(y∈S _i | _xi ,θ) is the probability that the true label exists in the candidate store set of the sample under the condition that the Wi-Fi distance intensity vector of the ith sample appears, and n _y represents the store y The number of occurrences in the entire data set, π _i,y is the probability that store y appears in its candidate store set, p(y|x _i ,θ) is the condition for the occurrence of the Wi-Fi distance intensity vector in the ith sample Next, the store y is the probability of the true label. This likelihood function formalizes the known fact that the true label of each sample in the entire data set exists in the candidate store set, and the parameter value θ can be estimated by maximum likelihood make estimates, which

That is, under the condition that the Wi-Fi distance strength feature vector of this sample appears, the probability that the store y will be interacted by the user corresponding to this sample in the future will be used as the initialization probability of probability propagation;

1052.概率的传播：在概率传播的第t轮迭代中，根据上一轮迭代的概率矩阵F_t-1和初始化概率矩阵P＝[p(y_i＝j|x_i,θ)]_m×q，就能获得一轮新的受到领域样本传播影响的概率矩阵F_t：1052. Probability propagation: in the t-th iteration of probability propagation, according to the probability matrix F _t-1 of the previous iteration and the initialization probability matrix P=[p(y _i =j|x _i ,θ)] _{m× q} , a new round of probability matrix F _t affected by domain sample propagation can be obtained:

其中W∈R^m×m为样本与样本之间的相似度矩阵，概率传播一共迭代50轮，在概率传播的每一轮中，每个样本所对应的商铺互动概率按照样本之间的相似度传播给其所对应的近邻示例，每个样本根据其10个近邻样本所对应的商铺互动概率来更新自己对此商铺的互动概率。Among them, W∈R ^m×m is the similarity matrix between samples, and probability propagation is iterated for 50 rounds in total. In each round of probability propagation, the shop interaction probability corresponding to each sample is based on the similarity between samples. It is propagated to its corresponding neighbor examples, and each sample updates its own interaction probability for this store according to the store interaction probability corresponding to its 10 neighbor samples.

进一步的，在偏标记学习问题中，每一轮迭代需要对更新后的概率矩阵进行消歧操作，即将每个样本非候选商铺集合中的商铺互动概率置为0，对候选商铺集合中的商铺互动概率进行归一化：Further, in the partial label learning problem, each iteration needs to perform a disambiguation operation on the updated probability matrix, that is, set the shop interaction probability in each sample non-candidate shop set to 0, and set the shop interaction probability in the candidate shop set to 0. The interaction probability is normalized:

进一步的，所述步骤106通过传播所收敛的概率，从偏标记数据集的候选商铺集合中预测出用户未来有行为互动的商铺的具体步骤为：Further, in the step 106, by propagating the converged probability, the specific steps of predicting the shops that the user will interact with in the future from the candidate shop set of the partial marker dataset are as follows:

根据105步骤中所传播收敛得到的概率矩阵F_t，即可获得每个样本对应用户最有可能互动的预测商铺：According to the probability matrix F _t obtained by the propagation and convergence in step 105, the predicted store corresponding to each sample that is most likely to interact with the user can be obtained:

本发明的优点及有益效果如下：The advantages and beneficial effects of the present invention are as follows:

1、商铺定位应用本身，最常用的预测方法为基本的多分类机器学习方法，而多分类方法会消耗大量的资源，而且每个样本可能的标签应该是所有标签的子集，即每个样本的真实标签只有可能出现在某几个标签中，而不是多分类方法将所有标签看作可能的真实标签，这样会导致多分类方法的精度不足。因此本专利创新性地将商铺定位应用看作了偏标记学习方法进行预测，可以充分利用每个样本仅可能互动的那几个的商铺的标签信息进行预测，大大提高模型的精度；1. For the store location application itself, the most commonly used prediction method is the basic multi-class machine learning method, and the multi-class method consumes a lot of resources, and the possible labels of each sample should be a subset of all labels, that is, each sample. The true labels of are only possible to appear in a few labels, instead of the multi-classification method treating all labels as possible true labels, which will lead to insufficient accuracy of the multi-classification method. Therefore, this patent innovatively regards the store location application as a partial label learning method for prediction, which can make full use of the label information of only those few stores that each sample may interact with for prediction, which greatly improves the accuracy of the model;

2、在异常样本清洗步骤中，考虑到数据集中的样本均在同一个商圈中这一事实，本专利创新地创造了与样本所对应用户的经纬度以及当前购物状态的Wi-Fi强度相关的异常置信度，将偏离数据集中的平均置信水平太高或者自身置信水平太低的样本清洗出去。2. In the abnormal sample cleaning step, considering the fact that the samples in the data set are all in the same business circle, this patent innovatively creates a data related to the latitude and longitude of the user corresponding to the sample and the Wi-Fi strength of the current shopping state. The abnormal confidence level is to clean out the samples that deviate from the average confidence level in the data set or whose confidence level is too low.

3、跟据商铺定位应用中不同样本之间所对应用户经纬度越相似，它们所处的购物状态就应该越相似的原则，本专利创新性地创造了基于此原则的相似度公式，来表示不同样本之间的相似程度，此相似度在本专利中有两个作用：(1)根据与Wi-Fi信息缺失的样本相似度最低的10个样本，去填充Wi-Fi信息缺失的样本的缺失信息；(2)相似度可以作为相似图中，样本之间边权重大小。3. According to the principle that the latitude and longitude of the users corresponding to different samples in the store positioning application are more similar, the shopping states they are in should be more similar. This patent innovatively creates a similarity formula based on this principle to express different Similarity between samples, this similarity has two functions in this patent: (1) According to the 10 samples with the lowest similarity to the samples with missing Wi-Fi information, to fill the missing of the samples with missing Wi-Fi information (2) Similarity can be used as the similarity graph, the edge weights between samples.

4、在构造偏标记数据集的过程中，常规的构造方法仅仅通过寻找该样本对应用户距离最近的10个商铺即可，然而这样会给偏标记数据集带来过多的噪声值，因此我们还需要对距离最近的10个商铺进行筛选，本专利创新性地创造了与商铺经纬度以及样本所对应用户的经纬度相关的二次规划方程，此二次规划方程将每个商铺相对于该样本的互动权重作为求解变量，根据最优化二次规划方程所对应的最优求解变量，就能够尽可能地筛选出用户当前购物状态相对距离(相对于其它9个店铺)最接近的商铺，可以大大降低偏标记数据集的候选标签集大小太大所带来的噪声值。4. In the process of constructing the partial labeled dataset, the conventional construction method can only search for the 10 stores closest to the user corresponding to the sample. However, this will bring too much noise to the partial labeled dataset, so we It is also necessary to screen the 10 nearest stores. This patent innovatively creates a quadratic programming equation related to the latitude and longitude of the store and the latitude and longitude of the user corresponding to the sample. The interaction weight is used as the solution variable. According to the optimal solution variable corresponding to the optimized quadratic programming equation, the stores with the closest relative distance (relative to the other 9 stores) of the current shopping status of the user can be selected as much as possible, which can greatly reduce the The noise value caused by the large size of the candidate label set for the partially labeled dataset.

5、在特征提取操作过程中，本专利抓住了商铺定位应用中每个样本所对应用户与候选商铺集合中的每个商铺距离的平均值，能够将每个样本中的候选商铺与非候选商铺很好地区分开来的特性，并同时考虑到平均距离无法很好地区分候选商铺集合中的商铺的问题，将每个样本所对应的Wi-Fi强度与平均距离结合起来，创新性地提出了Wi-Fi距离强度向量特征，在区分候选商铺与非候选商铺的同时，保证了候选商铺集合中的商铺之间的区分度。5. In the process of feature extraction, this patent captures the average distance between the user corresponding to each sample in the store location application and each store in the candidate store set, and can compare the candidate stores in each sample with the non-candidate stores. The characteristics of the shops are well differentiated, and at the same time, considering the problem that the average distance cannot well distinguish the shops in the candidate shop set, the Wi-Fi strength corresponding to each sample is combined with the average distance, and innovatively proposed With the Wi-Fi distance strength vector feature, while distinguishing candidate stores from non-candidate stores, the distinction between the stores in the candidate store set is guaranteed.

6、在概率传播的过程中，本专利将经典的标签传播算法进行改造。经典的标签传播算法仅考虑了候选商铺的出现和未出现这种表面层次，而未考虑候选商铺集合的潜在概率分布，因此经典的标签传播算法无法达到令人满意的表现力，本专利利用了标签传播算法的架构，在此基础上本专利提出的概率传播算法，根据基于逻辑斯蒂分布的极大似然估计，去挖掘出每个样本所对应候选商铺集合的概率分布，然后将估计得到的概率分布放入标签传播算法的框架中去，并创新地提出消歧归一化(a、对于非候选商铺集合的商铺的互动概率置为0；b、对于候选商铺集合的商铺的互动概率进行最大最小归一化)优化了传播过程中非候选商铺概率不为0的问题。本质上说，概率传播算法解决了标签传播算法只能在数据表层进行数据挖掘的缺点，大大提高了偏标记学习的预测结果。6. In the process of probability propagation, this patent transforms the classic label propagation algorithm. The classic label propagation algorithm only considers the appearance and non-appearance of candidate stores, but does not consider the potential probability distribution of the candidate store set, so the classic label propagation algorithm cannot achieve satisfactory expressiveness. The structure of the label propagation algorithm, on this basis, the probability propagation algorithm proposed in this patent, according to the maximum likelihood estimation based on the logistic distribution, to mine the probability distribution of the candidate store set corresponding to each sample, and then estimate the Put the probability distribution of the probability distribution into the frame of the label propagation algorithm, and innovatively propose disambiguation normalization (a, the interaction probability for the stores in the non-candidate store set is set to 0; b, the interaction probability for the stores in the candidate store set is set to 0; Perform maximum and minimum normalization) to optimize the problem that the probability of non-candidate shops is not 0 during the propagation process. In essence, the probability propagation algorithm solves the disadvantage that the label propagation algorithm can only perform data mining at the data surface layer, and greatly improves the prediction results of partial label learning.

附图说明Description of drawings

图1是本发明提供优选实施例一种基于偏标记学习的用户所在商铺定位大数据预测方法的流程图。FIG. 1 is a flow chart of a method for predicting the location of a store where a user is located based on a big data according to a preferred embodiment of the present invention.

图2为本发明提供优选实施例一种基于偏标记学习的用户所在商铺定位大数据预测方法中的样本相似度图。FIG. 2 is a sample similarity graph in a big data prediction method for the location of a store where a user is located based on partial mark learning according to a preferred embodiment of the present invention.

图3为本发明提供优选实施例一种基于偏标记学习的用户所在商铺定位大数据预测方法中偏标记学习模型实际应用整体框架图。FIG. 3 is an overall framework diagram of the actual application of a partial mark learning model in a big data prediction method for the location of a store where a user is located based on partial mark learning according to a preferred embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、详细地描述。所描述的实施例仅仅是本发明的一部分实施例。The technical solutions in the embodiments of the present invention will be described clearly and in detail below with reference to the accompanying drawings in the embodiments of the present invention. The described embodiments are only some of the embodiments of the invention.

本发明解决上述技术问题的技术方案是：The technical scheme that the present invention solves the above-mentioned technical problems is:

参考图1，图1为本发明实施例一提供一种基于偏标记学习的用户所在商铺定位大数据预测方法的流程图，具体包括：Referring to FIG. 1, FIG. 1 is a flowchart of a method for predicting the location of a store where a user is located based on partial mark learning according to Embodiment 1 of the present invention, which specifically includes:

101.对用户的购物状态数据进行预处理操作，具体如下：1011.异常样本清洗：异常样本的清洗首先通过原数据集中样本所对应用户的经纬度和当前状态的Wi-Fi强度信息，根据公式(1)计算每个样本的异常置信度，若某样本的异常置信度c_i低于0.15或者高于0.85，则我们将该样本判定为异常样本，并将其从原数据集中过滤出去；1012.缺失Wi-Fi信息的填充：由于不可抗力的因素，某些样本的Wi-Fi强度信息无法精准的获取，根据经纬度相似的样本，Wi-Fi强度信息也应相似的思想，首先找到离Wi-Fi强度信息缺失的样本所对应用户经纬度最为相似的10个样本，且这10个样本对应的Wi-Fi强度信息均为已知信息，其两个样本之间的相似性根据公式(2)进行计算，然后通过这10个样本根据公式(3)去填充该样本缺失的Wi-Fi强度信息。101. Perform preprocessing operations on the user's shopping status data, as follows: 1011. Abnormal sample cleaning: The abnormal samples are cleaned first through the latitude and longitude of the user corresponding to the sample in the original data set and the Wi-Fi strength information of the current state, according to the formula ( 1) Calculate the abnormal confidence of each sample. If the abnormal confidence c _i of a sample is lower than 0.15 or higher than 0.85, we will judge the sample as an abnormal sample and filter it out from the original data set; 1012. Filling of missing Wi-Fi information: Due to force majeure factors, the Wi-Fi strength information of some samples cannot be accurately obtained. According to the samples with similar latitude and longitude, the Wi-Fi strength information should also be similar. First, find the distance from the Wi-Fi The 10 samples with the most similar longitude and latitude corresponding to the samples with missing strength information, and the Wi-Fi strength information corresponding to these 10 samples are all known information, and the similarity between the two samples is calculated according to formula (2). , and then use the 10 samples to fill in the missing Wi-Fi strength information of the sample according to formula (3).

102.根据每个用户所对应的候选商铺集合构建偏标记数据集，具体如下：对于原数据中的每个样本，重复执行以下操作以构造偏标记数据集：(1)根据原数据集中用户经纬度和商铺经纬度，计算样本与每个商铺之间的距离

(其中λ_A,

分别表示店铺A的经纬度，λ_a,

分别表示用户a经纬度)；(2)跟据计算得到的距离d，选取与样本距离最近的10个商铺；(3)根据此样本所对应的距离最近的10个商铺的经纬度，对二次规划方程(公式(4))进行优化，求解该样本所对应的10个商铺相对此样本的权重值，若计算出来的商铺所对应权重大于0.4，则将该商铺添加到该样本的候选商铺集合中。102. Construct a partial label dataset according to the candidate store set corresponding to each user, as follows: For each sample in the original data, repeat the following operations to construct a partial label dataset: (1) According to the user's latitude and longitude in the original dataset And the latitude and longitude of the store, calculate the distance between the sample and each store

(where λ _A ,

respectively represent the latitude and longitude of store A, λ _a ,

(2) According to the calculated distance d, select the 10 shops with the closest distance to the sample; (3) According to the latitude and longitude of the 10 shops with the closest distance corresponding to this sample, for the secondary planning Equation (formula (4)) is optimized to solve the weight value of the 10 shops corresponding to the sample relative to this sample. If the weight corresponding to the calculated shop is greater than 0.4, the shop is added to the set of candidate shops for the sample. .

103.根据对用户经纬度和Wi-Fi强度信息，对偏标记数据集进行特征提取操作，具体如下：首先将Wi-Fi名称离散化为1000维特征向量，特征值即为Wi-Fi名所对应的Wi-Fi强度，然后根据转换公式(5)将离散化的Wi-Fi强度特征向量转换为了Wi-Fi距离强度特征向量。103. According to the user's latitude and longitude and Wi-Fi strength information, perform feature extraction operation on the partial marked data set, as follows: First, the Wi-Fi name is discretized into a 1000-dimensional feature vector, and the feature value is the corresponding Wi-Fi name. Wi-Fi strength, and then convert the discretized Wi-Fi strength feature vector into a Wi-Fi distance strength feature vector according to the conversion formula (5).

104.根据特征空间构建相似度图，具体如下：为了构造基于特征空间的相似度图<V,E,ω_e>(见图2)，需要分别定义相似图的结点V、相似图的边E以及相似图的边权重ω_e。104. Construct the similarity graph according to the feature space, as follows: In order to construct the similarity graph based on the feature space <V, E, ω _e > (see Figure 2), it is necessary to define the node V of the similarity graph and the edge of the similarity graph respectively. E and the edge weights ω _e of similar graphs.

1041.相似图的结点的定义：将偏标记数据集中的每一个样本视为相似度图中的结点。1041. Definition of Nodes in Similarity Graphs: Treat each sample in a partially labeled dataset as a node in a similarity graph.

1042.相似图的边的定义：对于偏标记数据集中的每一个样本(相似度图中的每一个结点)，选择与之Wi-Fi距离强度欧式距离最近的10个除自身以外的样本(结点)作为关联对象，即将相似图中对应的两点进行连接，作为相似图的边。1042. Definition of the edge of the similarity graph: For each sample in the partial labeled dataset (each node in the similarity graph), select the 10 samples other than itself with the closest Wi-Fi distance strength Euclidean distance ( node) as an association object, that is, to connect two corresponding points in the similar graph as the edge of the similar graph.

105.根据相似度图进行概率传播，具体如下：105. Probabilistic propagation according to the similarity map, as follows:

1051.初始化概率：对于每个样本，首先假设商铺出现在其候选商铺集合中的概率为整个数据集中该商铺出现的比例，即将商铺在数据集中出现的概率作为商铺出现在该样本候选商铺集合概率的先验知识，并进一步假设在第i个样本的Wi-Fi距离强度出现的条件下，候选集中的商铺为真实标签的概率服从逻辑斯蒂分布，然后根据已有的偏标记数据集，构造出了似然函数为公式(6)，个似然函数形式化了整个数据集中的每一个样本的真实标签都存在于候选商铺集合这个已知的事实。而参数值θ可以用极大似然估计进行预估，其中

即为在此样本的Wi-Fi距离强度特征向量出现的条件下，该商铺y在未来将会被此样本所对应用户的进行交互行为的概率，将作为概率传播的初始化概率。1051. Initialization probability: For each sample, first assume that the probability of a store appearing in its candidate store set is the proportion of the store in the entire data set, that is, the probability of a store appearing in the data set is taken as the probability of a store appearing in the sample candidate store set. and further assume that under the condition that the Wi-Fi distance strength of the ith sample appears, the probability that the store in the candidate set is the real label obeys the logistic distribution, and then according to the existing partial labeled data set, construct Out of the likelihood function is Equation (6), a likelihood function formalizes the known fact that the true label of each sample in the entire dataset exists in the candidate store set. The parameter value θ can be estimated by maximum likelihood estimation, where

That is, under the condition that the Wi-Fi distance strength feature vector of this sample appears, the probability that the store y will be interacted with by the user corresponding to this sample in the future will be used as the initialization probability of probability propagation.

1052.概率的传播：在概率传播的第t轮迭代中，根据上一轮迭代的概率矩阵F_t-1和初始化概率矩阵P＝[p(y_i＝j|x_i,θ)]_m×q，就能获得一轮新的受到领域样本传播影响的概率矩阵F_t为公式(7)，概率传播一共迭代50轮。在概率传播的每一轮中，每个样本所对应的商铺互动概率按照样本之间的相似度传播给其所对应的近邻示例，每个样本根据其10个近邻样本所对应的商铺互动概率来更新自己对此商铺的互动概率。在偏标记学习问题中，每一轮迭代需要对更新后的概率矩阵进行消歧操作，即将每个样本非候选商铺集合中的商铺互动概率置为0，对候选商铺集合中的商铺互动概率进行归一化操作如公式(8)所示。1052. Probability propagation: in the t-th iteration of probability propagation, according to the probability matrix F _t-1 of the previous iteration and the initialization probability matrix P=[p(y _i =j|x _i ,θ)] _{m× q} , a new round of probability matrix F _t affected by domain sample propagation can be obtained as formula (7). The probability propagation is iterated for 50 rounds in total. In each round of probability propagation, the shop interaction probability corresponding to each sample is propagated to its corresponding neighboring examples according to the similarity between the samples, and each sample is based on the shop interaction probability corresponding to its 10 neighboring samples. Update your probability of interacting with this store. In the partial label learning problem, each iteration needs to perform a disambiguation operation on the updated probability matrix, that is, set the shop interaction probability in each sample non-candidate shop set to 0, and perform the shop interaction probability in the candidate shop set. The normalization operation is shown in Equation (8).

106.通过传播所收敛的概率，从偏标记数据集的候选商铺集合中预测出用户未来有行为互动的商铺，具体如下：根据105步骤中所传播收敛得到的概率矩阵F_t，即可获得每个样本对应用户最有可能互动的预测商铺为公式(9)所示。基于偏标记的概率传播方法使得用户能够获得更为精准的个性化推送服务，提高用户的购物体验，成为解决如今标签获取困难的条件下能够有效预测的途径。基于大数据的偏标记学习模型在用户所在商铺定位的实际应用的整体框架图见图3。106. By propagating the converged probability, predict from the candidate store set of the partial marker data set the stores that the user will interact with in the future. The details are as follows: According to the probability matrix F _t obtained by the propagation and convergence in step 105, each store can be obtained. The predicted stores corresponding to the samples that are most likely to interact with users are shown in formula (9). The probabilistic propagation method based on partial tagging enables users to obtain more accurate personalized push services, improves the user's shopping experience, and becomes an effective way to solve the problem of difficult tag acquisition today. The overall frame diagram of the practical application of the partial label learning model based on big data in the location of the user's store is shown in Figure 3.

以上这些实施例应理解为仅用于说明本发明而不用于限制本发明的保护范围。在阅读了本发明的记载的内容之后，技术人员可以对本发明作各种改动或修改，这些等效变化和修饰同样落入本发明权利要求所限定的范围。The above embodiments should be understood as only for illustrating the present invention and not for limiting the protection scope of the present invention. After reading the contents of the description of the present invention, the skilled person can make various changes or modifications to the present invention, and these equivalent changes and modifications also fall within the scope defined by the claims of the present invention.

Claims

1. a method for predicting big data based on the location of the user's shop based on partial mark learning, is characterized in that, comprises the following steps:

101. Perform preprocessing operations on the user's location behavior data, including cleaning abnormal samples and filling in missing Wi-Fi information;

102. According to the candidate store set corresponding to each sample, each sample in the data set is a shopping state corresponding to a certain user, and the different shopping states of each user correspond to different candidate store sets. The store set is obtained according to certain rules. For each sample, this rule is summarized into three steps: 1. Find the 10 closest stores to the user's current shopping status according to the distance; 2. According to the convex quadratic programming problem, to solve this The importance of 10 stores to the current shopping status of the user; 3. According to the importance, select stores with a weight value greater than the threshold of 0.4 as the candidate store set, and construct a partial labeling dataset;

103. Perform feature extraction on the partial labeled dataset, and extract the features of the Wi-Fi distance intensity feature vector to form a feature space. Each dimension of the feature vector represents the current shopping state of the user for each Wi-Fi that appears in the dataset. The distance strength value under ;

104. Build a similarity map according to the feature space, including:

For each sample _xi in the data set, do the same operation repeatedly: 1. Take _xi as a node of the similarity graph; 2. Take _xi as the center point, according to the relationship between _xi and the data set The Euclidean distance of the Wi-Fi distance intensity feature vector between other samples, select 10 samples with the smallest Euclidean distance for _xi , and then according to _xi and the selected 10 samples, _xi is regarded as these 10 samples The center sample point of , connect its corresponding nodes in the graph with edges in the similarity graph;

105. Probability propagation according to the similarity map; for each sample _xi in the data set, repeat the same operation: 1. Initialization: Calculate the optimal parameters according to the likelihood function, so as to calculate the candidate corresponding to _xi The probability that each candidate store in the store set may interact, and this probability distribution is taken as the initialization probability distribution of each candidate store in the candidate store set corresponding to x _i ; 2. For the t-th iteration of the probability propagation algorithm: according to Based on the formula of the similarity graph, the probability distribution of the candidate stores corresponding to x _i of the t-th iteration is obtained, and the probability propagation of the t-th iteration is realized. The process of calculating this formula is a process of probability propagation, and this propagation process can only achieve similar For the propagation between the two nodes corresponding to each edge in the graph, since the process of propagation may cause the interaction probability of the shops that are not in the set of candidate shops corresponding to x _i to be non-zero, it is necessary to compare all shops relative to each other. Disambiguation normalization is performed on the interaction probability of x _i , a. The interaction probability of the shops in the non-candidate shop set is set to 0; b, the interaction probability of the shops in the candidate shop set is normalized to the maximum and minimum;

106. Through the probability of probability propagation in step 105, the converged probability is predicted from the candidate store set of the partial marker data set to predict the store that the user will have behavioral interaction in the future;

The specific steps of preprocessing the user's shopping status data in step 101 are as follows:

1011. Cleaning of abnormal samples: The cleaning of abnormal samples is firstly carried out through the latitude and longitude of the original data set and the Wi-Fi strength information of the current shopping state, according to the formula

Calculate the anomaly confidence for each sample, where λ _i ,

1012. Filling of missing Wi-Fi information: First, find the 10 samples with the most similar latitude and longitude to the samples whose Wi-Fi strength information is missing, and the Wi-Fi strength information corresponding to these 10 samples are all known information. The similarity between samples according to the formula

is calculated, where λ _a ,

λ _b ,

To fill the missing Wi-Fi strength information of the sample, where sample a is the sample to be filled, a _i (i=1,2,...,10) is the 10 neighbor samples of sample a,

is the Wi-Fi strength information corresponding to sample a _i ;

The specific steps of constructing a partial label data set according to the candidate store set corresponding to each sample in the step 102 are as follows:

For each sample in the original data, repeat the following operations to construct a partial labeled dataset: (1) Calculate the distance between the sample and each store based on the user's latitude and longitude and the store's latitude and longitude in the original data set

where λ _A ,

respectively represent the latitude and longitude of store A, λ _a ,

Solve the weight value of the 10 shops corresponding to this sample relative to this sample, where λ _a ,

Respectively represent the latitude and longitude of the nearest 10 shops corresponding to sample a. If the weight corresponding to the calculated shop is greater than 0.4, the shop will be added to the set of candidate shops of the sample;

The step 103 performs a feature extraction operation on the partial labeled data set, which specifically includes the steps:

Wi-Fi distance strength: First, the Wi-Fi name is discretized into a 1000-dimensional feature vector, and the feature value is the Wi-Fi strength corresponding to the Wi-Fi name, and then according to the conversion formula:

Convert the discretized Wi-Fi strength feature vector to Wi-Fi distance strength feature vector, where

respectively represent the latitude and longitude of the user corresponding to the sample;

The step 105 performs probability propagation according to the similarity map, and the specific steps are:

1051. Initialization probability: For each sample, first assume that the probability of a store appearing in its candidate store set is the proportion of the store in the entire data set, that is, the probability of a store appearing in the data set is taken as the probability of a store appearing in the sample candidate store set. , and assume that under the condition that the Wi-Fi distance strength of the ith sample appears, the probability that the store in the candidate set is the real label obeys the logistic distribution, and then according to the existing partial labeled data set, construct the Likelihood function:

where p(y∈S _i | _xi ,θ) is the probability that the true label exists in the candidate store set of the sample under the condition that the Wi-Fi distance intensity vector of the ith sample appears, and n _y represents the store y The number of occurrences in the entire data set, π _i,y is the probability that store y appears in its candidate store set, p(y|x _i ,θ) is the condition for the occurrence of the Wi-Fi distance intensity vector in the ith sample Below, the store y is the probability of the true label. This likelihood function formalizes the known fact that the true label of each sample in the entire dataset exists in the candidate store set, and the parameter value θ is estimated by maximum likelihood. estimated, of which

1052. Probability propagation: in the t-th iteration of probability propagation, according to the probability matrix F _t-1 of the previous iteration and the initialization probability matrix P=[p(y _i =j|x _i ,θ)] _{m× q} , a new round of probability matrix F _t affected by domain sample propagation can be obtained:

Among them, W∈R ^m×m is the similarity matrix between samples, and probability propagation is iterated for 50 rounds in total. In each round of probability propagation, the shop interaction probability corresponding to each sample is based on the similarity between samples. It is propagated to its corresponding neighbor examples, and each sample updates its own interaction probability for this store according to the store interaction probability corresponding to its 10 neighbor samples.

2. The big data prediction method for the location of the store where the user is based on partial mark learning according to claim 1, is characterized in that, the concrete steps of described step 104 constructing similarity degree map according to feature space are:

In order to construct the similarity graph <V, E, ω _e > based on the feature space, it is necessary to define the node V of the similar graph, the edge E of the similar graph and the edge weight ω _e of the similar graph respectively;

1041. Definition of the node of similarity graph: treat each sample in the partial labeled dataset as a node in the similarity graph;

1042. Definition of the edge of the similarity graph: For each sample in the partial labeled data set, that is, each node in the similarity graph, select the 10 samples other than itself with the closest Wi-Fi distance strength Euclidean distance as the association The object, that is to connect the corresponding two points in the similar graph, as the edge of the similar graph;

1043. Definition of edge weights of similar graphs: Similar(a, b) in formula (2) is used as the weights of edges (a, b) of similar graphs, where a and b are the two nodes in the similar graphs respectively. The corresponding two samples in the partially labeled dataset.

3. The method for predicting the location of a store where the user is based on partial mark learning according to claim 2, is characterized in that, in the partial mark learning problem, each iteration needs to carry out a disambiguation operation to the updated probability matrix, That is, set the shop interaction probability in each sample non-candidate shop set to 0, and normalize the shop interaction probability in the candidate shop set:

4. The big data prediction method for the location of the shop where the user is based on partial mark learning according to claim 3, is characterized in that, described step 106 predicts from the candidate shop set of partial mark data set by spreading the probability of convergence The specific steps for a store where users have behavioral interaction in the future are:

According to the probability matrix F _t obtained by the propagation and convergence in step 105, the predicted store corresponding to each sample that is most likely to interact with the user can be obtained: