CN103744928B

CN103744928B - A kind of network video classification method based on history access record

Info

Publication number: CN103744928B
Application number: CN201310743880.7A
Authority: CN
Inventors: 宿红毅; 朱叶; 王彩群; 闫波; 郑宏
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2013-12-30
Filing date: 2013-12-30
Publication date: 2017-10-03
Anticipated expiration: 2033-12-30
Also published as: CN103744928A

Abstract

The invention relates to a network video classification method based on historical access records, belonging to the technical field of computer network data mining. First, by automatically analyzing the historical access record data set of the video, extracting meaningful features and generating a data file for use, through the data file, the historical access record is converted into a structured document that can be used for training, and then used Logistic regression performs machine learning on structured documents to obtain a predictive model. Using the prediction model, according to the completeness of the historical access record information of the video to be predicted, select the corresponding method to classify and predict it. Compared with the prior art, the present invention reduces the labor costs and at the same time makes the parameters involved in the calculation more streamlined, the prediction effect is more accurate, and the time spent is less. At the same time, because the operation of clustering or not can be selected according to the completeness of the historical access record information of the video to be predicted, the application of the model is more extensive.

Description

A Network Video Classification Method Based on Historical Access Records

技术领域technical field

本发明涉及一种网络视频分类方法，属于计算机网络数据挖掘技术领域。The invention relates to a network video classification method, which belongs to the technical field of computer network data mining.

背景技术Background technique

随着数据库技术的迅速发展、数据库管理系统的广泛应用和Internet的迅速普及，互联网上的视频(以下简称视频)历史访问记录数据量急剧增长。激增的数据后面蕴涵着大量的“宝藏”，即事先未知而潜在有用的信息。面对大规模的海量数据，数据挖掘技术应运而生，从大量的、不完全的、有噪声的、模糊的、随机的、实际应用的数据中提取隐含在其中的、人们不知道的但又有用的信息和知识的过程。With the rapid development of database technology, the wide application of database management systems and the rapid popularization of the Internet, the amount of historical access records of video (hereinafter referred to as video) on the Internet has increased dramatically. Behind the surge of data lies a large number of "treasures", that is, previously unknown and potentially useful information. In the face of large-scale massive data, data mining technology emerges as the times require, and extracts hidden but unknown but hidden information from a large number of incomplete, noisy, fuzzy, random, and practically applied data. And useful information and knowledge of the process.

数据挖掘的任务主要有分类、预测、关联分析、时序模式、聚类、偏差检测等。每种问题都有许多具体的数据挖掘或统计模型来加以解决。The tasks of data mining mainly include classification, prediction, association analysis, time series pattern, clustering, deviation detection and so on. Each problem has many specific data mining or statistical models to solve it.

其中，分类是根据数据集的特点构造一个分类器，利用分类器对未知类别的样本赋予类别的一种技术。构造分类器的过程一般分为模型训练和使用模型分类两个步骤。在模型训练阶段，分析训练数据集的特点，为每个类别产生一个对相应数据集的准确描述或模型。在模型使用阶段，根据待分类对象的数据描述信息，利用模型对其进行分类。Among them, classification is a technique of constructing a classifier according to the characteristics of the data set, and using the classifier to assign categories to samples of unknown categories. The process of constructing a classifier is generally divided into two steps: model training and classification using the model. In the model training phase, the characteristics of the training data set are analyzed to generate an accurate description or model of the corresponding data set for each category. In the model use stage, according to the data description information of the objects to be classified, the model is used to classify them.

分类算法主要包括神经网络方法、决策树分类法、统计方法等。其中，统计方法主要有回归和朴素贝叶斯分类算法。回归分类包括一般的线形回归和Logist回归(或称为逻辑回归)，都是将数据分为两类。普通的Logist回归是用事件发生的概率与不发生该事件的概率之比来进行分类的，对于多分类问题则会采取Logist回归的一种自然扩展Logit回归。目前，应用最为广泛的是基于逻辑回归的预测方法：通过对数据集进行分析、建模，对待分类的对象进行二分类预测。然而，数据集中的知识(属性)并不是同等重要的，还存在冗余，这不利于做出正确而简洁的决策。而较优的数据集拥有如下指标：个数较少；属性的规则数目较少；最终范化规则数目较少等。但是，现存的基于逻辑回归的预测方法在数据集的精简方面都存在一些局限性，如只对属性的重要度进行排序而忽略了取值的离散分布、没有考虑属性之间的相关性等等。Classification algorithms mainly include neural network methods, decision tree classification methods, statistical methods, etc. Among them, statistical methods mainly include regression and naive Bayesian classification algorithms. Regression classification includes general linear regression and Logist regression (or called logistic regression), both of which divide data into two categories. Ordinary Logist regression is classified by the ratio of the probability of event occurrence to the probability of non-occurrence of the event. For multi-classification problems, Logit regression, a natural extension of Logist regression, is used. At present, the most widely used prediction method is based on logistic regression: by analyzing and modeling the data set, the object to be classified is predicted by two categories. However, the knowledge (attributes) in the dataset are not equally important and there is redundancy, which is not conducive to making correct and concise decisions. A better data set has the following indicators: fewer numbers; fewer rules for attributes; fewer final normalization rules, etc. However, the existing predictive methods based on logistic regression have some limitations in the streamlining of data sets, such as only sorting the importance of attributes while ignoring the discrete distribution of values, not considering the correlation between attributes, etc. .

发明内容Contents of the invention

本发明的目的是为了克服当前基于逻辑回归的预测方法在数据集精简方面所存在的不足，提出一种基于历史访问记录的网络视频分类方法。The purpose of the present invention is to propose a network video classification method based on historical access records in order to overcome the deficiencies in data set simplification in the current predictive method based on logistic regression.

本发明所述方法在保持知识库的分类和决策能力不变的条件下，通过对数据集特征的抽取过程进行优化，删除不相关或不重要的属性，避免了变量之间所反映信息的重叠，从而使数据集达到了最为精简，并减少了人工代价。由于参与计算的参数更为精简，使预测效果更为准确、时间效果更为提升。本方法简单、易行，适合目前广泛流行的分布式计算应用。Under the condition of keeping the classification and decision-making ability of the knowledge base unchanged, the method of the present invention optimizes the feature extraction process of the data set, deletes irrelevant or unimportant attributes, and avoids the overlap of information reflected between variables , so that the data set is the most streamlined and the labor cost is reduced. Since the parameters involved in the calculation are more streamlined, the prediction effect is more accurate and the time effect is improved. The method is simple and easy to implement, and is suitable for widely popular distributed computing applications at present.

本发明所述方法包括以下步骤：The method of the present invention comprises the following steps:

步骤一、建立预测模型Step 1. Build a predictive model

首先，进行特征抽取。通过对视频的历史访问记录数据集进行自动分析，抽取出最精简的属性特征后生成待用数据文件，通过所述数据文件将历史访问记录数据集转化为可用于训练的结构化文档。First, feature extraction is performed. By automatically analyzing the historical access record data set of the video, the most streamlined attribute features are extracted to generate a data file to be used, and the historical access record data set is converted into a structured document that can be used for training through the data file.

然后，进行模型训练。采用逻辑回归方法，对所述结构化文档进行机器学习，得到预测模型。Then, model training is performed. A logistic regression method is used to perform machine learning on the structured document to obtain a predictive model.

步骤二、采用预测模型，对视频进行欢迎程度预测Step 2. Use the prediction model to predict the popularity of the video

首先，判断视频历史访问记录的信息完整性。如果视频是新视频，即历史访问记录信息不完整，则使用聚类方法找到与其相似性最高的视频，将其历史访问记录信息设为新视频的历史访问记录信息。如果视频不是新视频，即历史访问记录信息完整，直接进行下面的操作。First, judge the information integrity of the video historical access records. If the video is a new video, that is, the historical access record information is incomplete, a clustering method is used to find the video with the highest similarity, and its historical access record information is set as the historical access record information of the new video. If the video is not a new video, that is, the historical access record information is complete, proceed directly to the following operations.

然后，对待预测视频的历史访问记录信息进行特征抽取，使用预测模型对其进行欢迎与否的分类。Then, feature extraction is performed on the historical access record information of the video to be predicted, and the prediction model is used to classify whether it is welcome or not.

有益效果Beneficial effect

本发明采用基于历史访问记录的网络视频分类方法，对视频的欢迎与否进行预测。通过对视频的历史访问记录数据集进行特征抽取等属性约简，进而建立相应的预测模型。完整的历史访问记录分析方法，在减少人工代价的同时，使参与计算的参数更为精简，预测效果更为准确、花费的时间更少。同时，由于可以根据待预测视频历史访问记录信息的完整程度对其选择聚类与否的操作，使其模型的应用更为广泛。The invention adopts a network video classification method based on historical access records to predict whether the video is welcome or not. By performing attribute reduction such as feature extraction on the video historical access record data set, a corresponding prediction model is established. The complete historical access record analysis method, while reducing labor costs, makes the parameters involved in the calculation more streamlined, the prediction effect is more accurate, and the time spent is less. At the same time, because the operation of clustering or not can be selected according to the completeness of the historical access record information of the video to be predicted, the application of the model is more extensive.

附图说明Description of drawings

图1为本发明方法的流程图。Fig. 1 is the flowchart of the method of the present invention.

具体实施方式detailed description

下面结合附图及实施例对本发明的具体实施方式做进一步详细说明。The specific implementation manners of the present invention will be described in further detail below in conjunction with the accompanying drawings and embodiments.

如图1所述，一种基于历史访问记录的网络视频分类方法，包括以下步骤：As shown in Figure 1, a kind of network video classification method based on historical access records comprises the following steps:

步骤一、对视频历史访问记录数据集进行分析，抽取出最精简的属性特征并生成待用数据文件。通过所述待用数据文件将视频历史访问记录转化为待训练结构化文档。具体过程如下：Step 1: Analyze the historical video access record data set, extract the most streamlined attribute features, and generate a data file to be used. The historical video access record is converted into a structured document to be trained through the data file to be used. The specific process is as follows:

首先，对视频历史访问记录数据集，利用值分析方法去掉取值不正常的数据和属性，包括取值无变化的属性、缺失的或者噪音的数据以及去除播放次数小于某一阀值的视频记录，得到数据集U。First, for the video historical access record data set, use the value analysis method to remove data and attributes with abnormal values, including attributes with no change in value, missing or noisy data, and remove video records with playback times less than a certain threshold , to get the data set U.

然后，利用基于互信息增益率的启发式属性约简算法，训练对数据集U的属性集进行约简。约简由核开始，逐步选择Z(c,R,D)达到最大的属性加入，直到所选择的属性子集分类能力与整个属性集的分类能力相同时结束。具体步骤如下：Then, using the heuristic attribute reduction algorithm based on the mutual information gain rate, the training reduces the attribute set of the data set U. The reduction starts from the kernel, gradually selects Z(c, R, D) to reach the maximum attribute addition, and ends when the classification ability of the selected attribute subset is the same as that of the entire attribute set. Specific steps are as follows:

第一步，将预测系统S定义为一个四元组：S＝(U,A,V,f)，其中U＝{u₁,u₂,…,u_n}是视频对象集，即论域；A是视频的属性集合；为属性值的集合，V_a为属性a的值域；f是U×A→V_a的映射，它为U中各视频对象的属性指定唯一值。In the first step, the prediction system S is defined as a quadruple: S=(U,A,V,f), where U={u ₁ ,u ₂ ,…,u _n } is the set of video objects, that is, the domain of discourse ;A is the attribute set of the video; is a collection of attribute values, V _a is the value range of attribute a; f is the mapping of U×A→V _a , which specifies a unique value for the attribute of each video object in U.

对于预测系统S，将属性集合A分为条件属性集C和决策属性集D，A＝C∪D，且C∩D＝φ，其中属性集C中包含的元素有视频IDc₁、标题c₂、类型c₃、时长等级c₄、URLc₅、URL信誉度c₆、播放次数c₁₀、评论次数c₁₂、分享次数c₁₅，收藏次数c₁₆，下载次数c₁₇，分享率c₁₈，收藏率c₁₉，下载率c₂₀，点赞率c₂₁，播放次数增长率c₂₂，好评率c₂₃，时间戳c₂₄，被观看时长c₂₅，被观看时长占的比率c₂₆；决策属性集合D包括受欢迎与否d。将该做了上述变化的预测系统S命名为决策系统L。由于在S中，对于属性集可构造对应的二元等价关系，当称I_G为由G构造的不可分辨关系。则对决策系统L＝(U,C∪D,V,f)，设I_R和I_D导出的划分分别为X＝{X₁,X₂,…X_n}和Y＝{Y₁,Y₂,…Y_n}，则R的熵定义为其中p(X_i)＝card(X_i)/card(U)。R相对D的条件熵定义为其中p(Y_j/X_i)＝card(Y_j∩X_i)/card(X_i)。决策属性集D和条件属性子集R的互信息定义为：W(R；D)＝H(D)-H(D/R)，属性重要性的度量方法定义为：Z(c,R,D)＝(W(R∪{c}；D)-W(R；D))/H(c)，其中p_i是属性取值为x_i的对象的个数占总对象数N的比例，设属性c有m种取值x₁,x₂.…,x_m，N为总对象数。For the prediction system S, the attribute set A is divided into a conditional attribute set C and a decision attribute set D, A=C∪D, and C∩D=φ, where the elements contained in the attribute set C include video IDc ₁ , title c ₂ , type c ₃ , duration level c ₄ , URLc ₅ , URL reputation c ₆ , play count c ₁₀ , comment count c ₁₂ , share count c ₁₅ , favorite count c ₁₆ , download count c ₁₇ , share rate c ₁₈ , favorite rate c ₁₉ , download rate c ₂₀ , like rate c ₂₁ , growth rate of playback times c ₂₂ , praise rate c ₂₃ , time stamp c ₂₄ , watched duration c ₂₅ , proportion of watched duration c ₂₆ ; decision attribute set D includes popular or not d. The prediction system S with the above changes is named decision system L. Since in S, for the attribute set The corresponding binary equivalence relation can be constructed, when I _G is called an indistinguishable relation constructed by G. Then for the decision system L=(U,C∪D,V,f), set The divisions derived from I _R and ID are respectively X={X ₁ ,X ₂ ,…X _n } and Y={Y ₁ ,Y ₂ ,…Y _n }, then the entropy of _R is defined as where p(X _i )=card(X _i )/card(U). The conditional entropy of R with respect to D is defined as where p(Y _j /X _i )=card(Y _j ∩X _i )/card(X _i ). The mutual information of the decision attribute set D and the conditional attribute subset R is defined as: W(R; D) = H(D)-H(D/R), and the measurement method of attribute importance is defined as: Z(c,R, D)=(W(R∪{c};D)-W(R;D))/H(c), where p _i is the ratio of the number of objects whose attribute values are x _i to the total number of objects N. Let the attribute c have m types of values x ₁ , x ₂ . . . , x _m , and N is the total number of objects.

第二步，计算条件属性集C和决策属性集D的互信息W(C；D)＝H(D)-H(D/C)；The second step is to calculate the mutual information W(C; D)=H(D)-H(D/C) of the conditional attribute set C and the decision attribute set D;

第三步，计算核R＝CORE_D(C)，并计算W(R；D)。其中核的计算过程为：In the third step, calculate the core R=CORE _D (C), and calculate W(R; D). The calculation process of the kernel is:

1.设CORE_D(C)＝φ；1. Let CORE _D (C) = φ;

2.对于条件属性集C中的所有属性r，如果H({d}/C)<H({d}/C-{r})，则2. For all attributes r in conditional attribute set C, if H({d}/C)<H({d}/C-{r}), then

CORE_D(C)＝CORE_D(C)∪{r}.CORE _D (C)＝CORE _D (C)∪{r}.

3.结束。3. End.

第四步，令C_candidate＝C-R，按Z(c,R,D)＝(W(R∪{c}；D)-W(R；D))/H(c)计算C_candidate中各属性的重要性，并选择Z(c,R,D)达到最大的属性c_i；The fourth step, let C _candidate = CR, calculate the attributes in C _candidate according to Z(c,R,D)=(W(R∪{c};D)-W(R;D))/H(c) The importance of , and choose Z(c,R,D) to achieve the largest attribute c _i ;

第五步，令R＝R∪{c_i}，若W(C；D)＝W(R；D)，则终止，并将约简后的属性集所对应的数据集用U′表示；否则转第四步继续执行。The fifth step, let R=R∪{c _i }, if W(C; D)=W(R; D), then terminate, and denote the data set corresponding to the reduced attribute set by U′; Otherwise, go to step 4 to continue.

之后，对数据集U′进行主成分分析，得到彼此不相关的若干个主成分。具体步骤如下：Afterwards, principal component analysis is performed on the data set U' to obtain several principal components that are not correlated with each other. Specific steps are as follows:

第一步，对数据集U′进行Z标准化，得到数据集U″；The first step is to perform Z standardization on the data set U′ to obtain the data set U″;

第二步，对数据集U″进行主成分分析，得出各主成分的特征值、方差贡献率及累计方差贡献率，其中，对各个主成分的特征值按由大到小的方式进行排序。根据主成分累计方差贡献率大于85％的个数来确定主成分的个数k，根据主成分分析时得到的因子荷载表，写出k个主成分与数据集U″中的各个属性之间的关系式，如下所示，其中Z_k代表着第k个主成分，β_km代表着Z_k的第m个因子载荷，c_m为数据集U″中的第m个属性的值，c_m∈{视频IDc₁、标题c₂、类型c₃、时长等级c₄、URLc₅、URL信誉度c₆、播放次数c₁₀、评论次数c₁₂、点赞率c₂₁、分享率c₁₈，收藏率c₁₉，播放次数增长率c₂₂，被观看时长占的比率c₂₆}：The second step is to conduct principal component analysis on the data set U″ to obtain the eigenvalues, variance contribution rates and cumulative variance contribution rates of each principal component, where the eigenvalues of each principal component are sorted from large to small The number k of principal components is determined according to the number of principal components whose cumulative variance contribution rate is greater than 85%, and the relationship between the k principal components and each attribute in the data set U″ is written according to the factor loading table obtained during principal component analysis. The relationship between is as follows, where Z _k represents the k-th principal component, β _km represents the m-th factor loading of Z _k , c _m is the value of the m-th attribute in the data set U″, c _m ∈ {video IDc ₁ , title c ₂ , type c ₃ , duration level c ₄ , URLc ₅ , URL reputation c ₆ , play count c ₁₀ , comment count c ₁₂ , like rate c ₂₁ , share rate c ₁₈ , Collection rate c ₁₉ , growth rate of playback times c ₂₂ , ratio of watched time to c ₂₆ }:

步骤二、用逻辑回归方法，对所述结构化文档进行机器学习，得到预测模型。具体过程如下：Step 2, using a logistic regression method to perform machine learning on the structured document to obtain a predictive model. The specific process is as follows:

对步骤二得到的各主成分值进行二元逻辑回归分析，得出逻辑回归模型：Perform binary logistic regression analysis on the principal component values obtained in step 2 to obtain the logistic regression model:

其中α₁,α₂,…,α_k为预测模型经过训练后得到的参数，P的值越接近于1，说明待分类视频越受欢迎，P越接近于0，说明待分类视频为越不受欢迎，若p≥0.5，则待分类视频为受欢迎视频；若p＜0.5，则待分类视频为不受欢迎视频；Among them, α ₁ , α ₂ ,…, α _k are the parameters obtained after the prediction model is trained. The closer the value of P is to 1, the more popular the video to be classified is, and the closer P is to 0, the less the video to be classified is. Popular, if p≥0.5, the video to be classified is a popular video; if p<0.5, the video to be classified is an unpopular video;

步骤三、使用预测模型对视频进行欢迎与否的测试，具体过程如下：Step 3. Use the prediction model to test whether the video is welcome or not. The specific process is as follows:

首先判断视频历史访问记录的信息完整性。如果待预测视频是新视频，即该视频的历史访问记录不存在，但是其自身的一些特征信息是有的，比如视频ID、查询ID、视频的标题、描述、关键词等等，根据视频的特征信息计算tf-idf值，用tf-idf矩阵作为聚类模型的输入。运用tf-idf便能从文本的内容上进行聚类，得到新视频的最相似视频，并将其历史访问记录信息设为新视频的历史访问记录信息；如果待预测视频不是新视频，直接进行下一步。First, judge the information integrity of the video historical access records. If the video to be predicted is a new video, that is, the historical access record of the video does not exist, but some characteristic information of itself does exist, such as video ID, query ID, video title, description, keywords, etc., according to the video The feature information calculates the tf-idf value, and uses the tf-idf matrix as the input of the clustering model. Using tf-idf can cluster the content of the text to get the most similar video of the new video, and set its historical access record information as the historical access record information of the new video; if the video to be predicted is not a new video, directly perform Next step.

然后对待预测视频的历史访问记录数据进行相应的转化，即进行特征抽取。Then, corresponding conversion is performed on the historical access record data of the video to be predicted, that is, feature extraction is performed.

最后使用预测模型对其进行欢迎与否的分类。Finally, a predictive model is used to classify it as welcome or not.

实施例Example

本发明方法包括三阶段，第一阶段为对视频的历史访问记录进行特征抽取阶段，第二阶段为预测模型的训练阶段，第三阶段为待分类视频欢迎与否的预测阶段。The method of the present invention includes three stages, the first stage is the feature extraction stage of the historical access records of the video, the second stage is the training stage of the prediction model, and the third stage is the prediction stage of whether the video to be classified is welcome or not.

参阅图l，下面详细叙述本实施例第一阶段的具体过程：Referring to Fig. 1, the specific process of the first stage of the present embodiment is described in detail below:

步骤l：根据视频的历史访问记录数据量大小，去除播放次数小于某一阀值的视频访问记录。具体地，根据对一些数据集的分析，这些历史访问记录在一定程度上都服从长尾效应，即包含许多点击次数不够多的视频记录，所以处理的第一步，应该设定Q为阀值，移除点击次数低于此阀值的视频记录。然后去掉一些取值无变化的属性列，从而得到初步输入数据集U；Step 1: According to the data size of historical access records of videos, remove video access records whose playback times are less than a certain threshold. Specifically, according to the analysis of some data sets, these historical access records are subject to the long-tail effect to a certain extent, that is, they contain many video records with insufficient clicks, so the first step of processing should be set as the threshold to remove video records with clicks below this threshold. Then remove some attribute columns with no change in value, so as to obtain the preliminary input data set U;

步骤2：对数据集U的属性集进行约简，约简由核开始，逐步选择重要的属性加入，直到所选择的属性子集分类能力与整个属性集U的分类能力相同时结束。具体地，经过步骤l的初步筛选后，初步得到的输入数据集中条件属性集合＝{视频IDc₁、标题c₂、类型c₃、时长等级c₄、URLc₅、URL信誉度c₆、视频上传者IDc₇、上传者粉丝级别c₈、上传时间c₉、播放次数c₁₀、不同IP地址观看人数c₁₁、评论次数c₁₂、好评数c₁₃、视频画质c₁₄、分享次数c₁₅，收藏次数c₁₆，下载次数c₁₇，分享率c₁₈，收藏率c₁₉，下载率c₂₀，点赞率c₂₁，播放次数增长率c₂₂，好评率c₂₃，时间戳c₂₄，被观看时长c₂₅，被观看时长占的比率c₂₆}。先计算条件属性C与决策属性D的互信息W(C,D)＝0.283，以及相对核属性K_D(C)＝{视频ID、URL信誉度、上传者粉丝级别、播放次数、评论次数},然后分别计算剩余属性的重要性分别为Step 2: Reduce the attribute set of the data set U. The reduction starts from the core, and gradually selects important attributes to add until the classification ability of the selected attribute subset is the same as that of the entire attribute set U. Specifically, after the initial screening in step 1, the initially obtained set of condition attributes in the input data set = {video IDc ₁ , title c ₂ , type c ₃ , duration level c ₄ , URLc ₅ , URL reputation c ₆ , video upload User ID c ₇ , uploader fan level c ₈ , upload time c ₉ , playback times c ₁₀ , number of viewers from different IP addresses c ₁₁ , number of comments c ₁₂ , number of good reviews c ₁₃ , video quality c ₁₄ , number of shares c ₁₅ , favorites c ₁₆ , downloads c ₁₇ , share rate c ₁₈ , collection rate c ₁₉ , download rate c ₂₀ , like rate c ₂₁ , growth rate of play times c ₂₂ , praise rate c ₂₃ , time stamp c ₂₄ , watched Duration c ₂₅ , ratio c ₂₆ } of the watched duration. First calculate the mutual information W(C,D)=0.283 of the conditional attribute C and the decision attribute D, and the relative core attribute K _D (C)={video ID, URL reputation, uploader fan level, number of playbacks, number of comments} , and then calculate the importance of the remaining attributes respectively as

Z(c₂₁,R,D)＝(W(R∪{c₂₁}；D)-W(R；D))/H(c₂₁)＝0.2182，Z(c ₂₁ ,R,D)=(W(R∪{c ₂₁ };D)-W(R;D))/H(c ₂₁ )=0.2182,

Z(c₉,R,D)＝(W(R∪{c₉}；D)-W(R；D))/H(c₉)＝0.2180，Z(c ₉ ,R,D)=(W(R∪{c ₉ };D)-W(R;D))/H(c ₉ )=0.2180,

Z(c₄,R,D)＝(W(R∪{c₄}；D)-W(R；D))/H(c₄)＝0.2160，Z(c ₄ ,R,D)=(W(R∪{c ₄ };D)-W(R;D))/H(c ₄ )=0.2160,

…，...,

Z(c₁₄,R,D)＝(W(R∪{c₁₄}；D)-W(R；D))/H(c₁₄)＝0.0110，由重要度的高低次序，将属性加入条件属性集合-得到C′＝{视频IDc₁、标题c₂、类型c₃、时长等级c₄、URLc₅、URL信誉度c₆、视频上传者IDc₇、上传者粉丝级别c₈、上传时间c₉、播放次数c₁₀、评论次数c₁₂、分享次数c₁₅，收藏次数c₁₆，分享率c₁₈，收藏率c₁₉，点赞率c₂₁，播放次数增长率c₂₂，被观看时长占的比率c₂₆}；Z(c ₁₄ ,R,D)＝(W(R∪{c ₁₄ };D)-W(R;D))/H(c ₁₄ )＝0.0110, according to the order of importance, add attributes to the condition Attribute set-get C'={video IDc ₁ , title c ₂ , type c ₃ , duration level c ₄ , URLc ₅ , URL reputation c ₆ , video uploader IDc ₇ , uploader fan level c ₈ , upload time c _9. The number of plays c ₁₀ , the number of comments c ₁₂ , the number of shares c ₁₅ , the number of favorites c ₁₆ , the share rate c ₁₈ , the rate of favorites c ₁₉ , the rate of likes c ₂₁ , the growth rate of the number of plays c ₂₂ , and the percentage of viewing time ratio c ₂₆ };

步骤3：对条件属性集合C′进行主成分分析，得到彼此不相关的若干个主成分。具体步骤如下：Step 3: Perform principal component analysis on the conditional attribute set C′ to obtain several principal components that are not correlated with each other. Specific steps are as follows:

i)对条件属性集合C′对应的数据集U′进行Z标准化得到数据集U″；i) Perform Z standardization on the data set U' corresponding to the condition attribute set C' to obtain the data set U";

ii)对数据集U″进行主成分分析，求出各主成分的特征值(由大到小的方式进行排序)、方差贡献率及累计方差贡献率，根据主成分累计方差贡献率大于85％的个数来确定主成分的个数k，根据主成分分析时得到的因子荷载表，写出k个主成分与条件属性集合C′中的各个属性之间的关系式，如：ii) Conduct principal component analysis on the data set U″, and find out the eigenvalues of each principal component (sorted from large to small), variance contribution rate and cumulative variance contribution rate, and the cumulative variance contribution rate of the principal components is greater than 85% The number k of principal components is determined by the number of principal components. According to the factor loading table obtained during principal component analysis, the relationship between k principal components and each attribute in the conditional attribute set C′ is written, such as:

以上步骤1-3为本实施例第一阶段的特征抽取阶段的具体过程，得到了结构化的文档用作后续模型训练的输入。The above steps 1-3 are the specific process of the feature extraction stage of the first stage of this embodiment, and the obtained structured document is used as input for subsequent model training.

在第一阶段后，进入第二阶段，即模型训练阶段，此阶段用逻辑回归对第一阶段到的结构化文档进行机器学习，得到预测模型。After the first stage, enter the second stage, that is, the model training stage. In this stage, logistic regression is used to perform machine learning on the structured documents obtained in the first stage to obtain a predictive model.

在众多机器学习算法中，逻辑回归是一种高效又表现理想的算法。逻辑回归会充分用到所有的特征练预测模型，如得出的逻辑回归模型：Among many machine learning algorithms, logistic regression is an efficient and ideal algorithm. Logistic regression will make full use of all the features to train the prediction model, such as the resulting logistic regression model:

第三阶段为视频欢迎与否的预测阶段，具体包括以下阶段：The third stage is the prediction stage of whether the video is popular or not, including the following stages:

步骤1：判断待预测视频的描述信息的完整性；Step 1: Judging the integrity of the description information of the video to be predicted;

步骤2：若待预测视频不是新视频，即有一定的历史访问记录数据，则对其数据进行特征值的抽取，转化成结构化文档形式，而后代入预测模型进行欢迎与否预测；Step 2: If the video to be predicted is not a new video, that is, there is a certain amount of historical access record data, the feature value of the data is extracted, converted into a structured document form, and then substituted into the prediction model to predict whether it is welcome or not;

步骤3：若待预测视频是新视频，使用聚类找到与其相似性最高的视频，并将新的描述信息设为待预测视频的描述信息，然后进其进行相应的预测操作；Step 3: If the video to be predicted is a new video, use clustering to find the video with the highest similarity, and set the new description information as the description information of the video to be predicted, and then perform corresponding prediction operations on it;

具体的，将如何预测出新视频欢迎与否的问题转换成找到与此视频最相似的集合，即转换成了聚类问题。Specifically, the problem of how to predict whether a new video is welcome or not is transformed into finding the most similar set to this video, that is, into a clustering problem.

本发明针对待预测视频的条件属性计算tf-idf值，用tf-idf矩阵作为聚类模型的输入，运用tf-idf能从数据集的内容上进行聚类，以此方法计算出的相似性更为准确。The present invention calculates the tf-idf value for the conditional attribute of the video to be predicted, uses the tf-idf matrix as the input of the clustering model, uses the tf-idf to perform clustering from the content of the data set, and the similarity calculated by this method more accurate.

由此，本实施例通过三个步骤的处理，得到了对新视频欢迎与否的预测，使得视频能够得到更准确的预测，和更精准的投放。Thus, in this embodiment, through three steps of processing, the prediction of whether the new video is welcome or not is obtained, so that the video can be predicted more accurately and placed more accurately.

以上所述的具体实例是对本发明的进一步解释说明，并不用于限定本发明的保护范围，凡在本发明原则和精神之内，所做的更改和等同替换都应是本发明的保护范围之内。The specific examples described above are further explanations of the present invention, and are not used to limit the protection scope of the present invention. All changes and equivalent replacements made within the principles and spirit of the present invention should be within the protection scope of the present invention Inside.

Claims

1. a kind of network video classification method based on historical access record, it is characterized in that, may further comprise the steps:

Step 1. Analyzing the historical video access record data set, extracting attribute features and generating a data file to be used, and converting the historical video access record into a structured document to be trained through the data file to be used; the specific process is as follows:

First, for the video historical access record data set, use the value analysis method to remove data and attributes with abnormal values, including attributes with no change in value, missing or noisy data, and remove video records with playback times less than a certain threshold , get the data set U;

Then, use the heuristic attribute reduction algorithm based on the mutual information gain rate to train and reduce the attribute set of the data set U; the reduction starts from the core, and gradually selects Z(c,R,D) to reach the maximum attribute to join, Until the classification ability of the selected attribute subset is the same as that of the entire attribute set, the specific steps are as follows:

In the first step, the prediction system S is defined as a quadruple: S=(U,A,V,f), where U={u ₁ ,u ₂ ,…,u _n } is the set of video objects, that is, the domain of discourse ;A is the attribute set of the video; is the set of attribute values, V _a is the value range of attribute a; f is the mapping of U×A→V _a , it specifies a unique value for the attribute of each video object in U;

For the prediction system S, attribute set A is divided into conditional attribute set C and decision attribute set D, A=C∪D, and C∩D=φ, where the elements contained in attribute set C include video ID c ₁ , title c _2. Type c ₃ , duration level c ₄ , URL c ₅ , URL reputation c ₆ , play count c ₁₀ , comment count c ₁₂ , share count c ₁₅ , favorite count c ₁₆ , download count c ₁₇ , share rate c ₁₈ , collection rate c ₁₉ , download rate c ₂₀ , like rate c ₂₁ , growth rate of playback times c ₂₂ , praise rate c ₂₃ , time stamp c ₂₄ , watched duration c ₂₅ , proportion of watched duration c ₂₆ ; decision-making The attribute set D includes whether it is popular or not d; the prediction system S that has made the above changes is named decision system L; because in S, for the attribute set Construct the corresponding binary equivalence relation, when I _G ={(x,y)∈U×U; There is a(x)=a(y)}, I _G is said to be an indistinguishable relationship constructed by G, then for the decision system L=(U,C∪D,V,f), let The divisions derived from I _R and ID are respectively X={X ₁ ,X ₂ ,…X _n } and Y={Y ₁ ,Y ₂ ,…Y _n }, then the entropy of _R is defined as where p(X _i )=card(X _i )/card(U); the conditional entropy of R relative to D is defined as Where p(Y _j /X _i )=card(Y _j ∩X _i )/card(X _i ); the mutual information of decision attribute set D and conditional attribute subset R is defined as: W(R; D)=H( D)-H(D/R), the measurement method of attribute importance is defined as: Z(c,R,D)=(W(R∪{c};D)-W(R;D))/H( c), where p _i is the ratio of the number of objects whose attribute value is x _i to the total number of objects N, let the attribute c have m kinds of values x ₁ , x ₂ . . . , x _m , and N is the total number of objects;

The second step is to calculate the mutual information W(C; D)=H(D)-H(D/C) of the conditional attribute set C and the decision attribute set D;

The third step is to calculate the core R=CORE _D (C), and calculate W(R; D), where the calculation process of the core is:

Let CORE _D (C)=φ, for all attributes r in the conditional attribute set C, if H({d}/C)<H({d}/C-{r}), then CORE _D (C)= CORE _D (C)∪{r};

The fourth step, let C _candidate = CR, calculate the attributes in C _candidate according to Z(c,R,D)=(W(R∪{c};D)-W(R;D))/H(c) The importance of , and choose Z(c,R,D) to achieve the largest attribute c _i ;

The fifth step, let R=R∪{c _i }, if W(C; D)=W(R; D), then terminate, and denote the data set corresponding to the reduced attribute set by U′; Otherwise, go to the fourth step to continue;

Afterwards, principal component analysis is performed on the data set U′ to obtain several principal components that are not related to each other. The specific steps are as follows:

The first step is to perform Z standardization on the data set U′ to obtain the data set U″;

The second step is to conduct principal component analysis on the data set U″ to obtain the eigenvalues, variance contribution rates and cumulative variance contribution rates of each principal component, where the eigenvalues of each principal component are sorted from large to small ; Determine the number k of principal components according to the number of principal components whose cumulative variance contribution rate is greater than 85%, and write the relationship between the k principal components and each attribute in the data set U″ according to the factor loading table obtained during principal component analysis The relationship between is as follows, where Z _k represents the k-th principal component, β _km represents the m-th factor loading of Z _k , c _m is the value of the m-th attribute in the data set U″, c _m ∈ {video ID c ₁ , title c ₂ , type c ₃ , duration level c ₄ , URLc ₅ , URL reputation c ₆ , play count c ₁₀ , comment count c ₁₂ , like rate c ₂₁ , share rate c ₁₈ , collection rate c ₁₉ , growth rate of playback times c ₂₂ , ratio of watched time to c ₂₆ }:

Step 2, using the logistic regression method to perform machine learning on the structured document to obtain a predictive model, the specific process is as follows:

Perform binary logistic regression analysis on the principal component values obtained in step 2 to obtain the logistic regression model:

Among them, α ₁ , α ₂ ,…, α _k are the parameters obtained after the prediction model is trained. The closer the value of P is to 1, the more popular the video to be classified is, and the closer P is to 0, the less the video to be classified is. Popular, if p≥0.5, the video to be classified is a popular video; if p<0.5, the video to be classified is an unpopular video;

Step 3. Use the above prediction model to test whether the video is welcome or not. The specific process is as follows:

First, judge the information integrity of the video historical access records. If the video to be predicted is a new video, that is, the historical access records of the video do not exist, calculate the tf-idf value according to the feature information of the video, and use the tf-idf matrix as the clustering model Input, get the most similar video of the new video, and set its historical access record information as the historical access record information of the new video; if the video to be predicted is not a new video, proceed to the next step directly;

Then, the historical access record data of the video to be predicted is converted accordingly, that is, feature extraction is performed;

Finally, use a predictive model to classify it as welcome or not.