CN109558543A

CN109558543A - Sample sampling method, sample sampling device, server and storage medium

Info

Publication number: CN109558543A
Application number: CN201811510255.7A
Authority: CN
Inventors: 徐龙; 钟航标; 彭振
Original assignee: Lazas Network Technology Shanghai Co Ltd
Current assignee: Lazas Network Technology Shanghai Co Ltd
Priority date: 2018-12-11
Filing date: 2018-12-11
Publication date: 2019-04-02

Abstract

Embodiments of the present invention relate to the field of data statistics, and disclose a sample sampling method, a sample sampling device, a server and a storage medium. The sample sampling method in the present invention includes: respectively acquiring scale factors and corresponding data amounts at a plurality of predetermined times; performing model training on the acquired scale factors and corresponding data amounts to generate a scale factor prediction model; according to the scale factors The prediction model and the data volume predict the scale factor; the data is sampled using the predicted scale factor. The invention also discloses a sample sampling device, a server and a non-volatile storage medium respectively, so that the scale factor changes dynamically and is more suitable for the data characteristics of the sampling moment.

Description

A sample sampling method, sample sampling device, server and storage medium

技术领域technical field

本发明实施例涉及数据统计领域，特别涉及样本采样方法、样本采样装置、服务器和存储介质。Embodiments of the present invention relate to the field of data statistics, and in particular, to a sample sampling method, a sample sampling device, a server, and a storage medium.

背景技术Background technique

电商行业中由于产生的数据量较大，适合进行个性化推荐等为用户提升体验的解决方案。在此过程中，都需要进行数据的统计分析，现有其中一种方式就是利用历史数据训练出模型，这种方法中，模型的准确性和数据样本中的正负样本比例相关，一般需要正样本比例高于负样本，但实际应用中，经常出现负样本过高等样本不平衡的情况，这样获得模型的准确率则大大降低。Due to the large amount of data generated in the e-commerce industry, it is suitable for solutions such as personalized recommendations to improve user experience. In this process, statistical analysis of the data is required. One of the existing methods is to use historical data to train a model. In this method, the accuracy of the model is related to the proportion of positive and negative samples in the data samples. Generally, positive and negative samples are required. The proportion of samples is higher than that of negative samples, but in practical applications, there are often unbalanced samples such as too high negative samples, which greatly reduces the accuracy of the model obtained.

本申请发明人发现，为了解决上述问题，现有技术中会设置采样比例因子，对实际的数据量抽样采样，以调整正负样本的比例。然而所设置的采样比例因子的数值静态不变，又因为每个采样时刻的数据量并不相同，采用同样数值的比例因子抽样，所得的样本比例仍较难控制，尤其是在产生数据量变化较大的应用场景。The inventor of the present application found that, in order to solve the above problems, a sampling scale factor is set in the prior art, and the actual data volume is sampled to adjust the ratio of positive and negative samples. However, the value of the set sampling scale factor remains static, and because the amount of data at each sampling time is not the same, the sample proportion obtained by sampling with the same scale factor is still difficult to control, especially when the amount of data changes. larger application scenarios.

发明内容SUMMARY OF THE INVENTION

本发明实施方式的目的在于提供一种样本采样方法、样本采样装置、服务器和存储介质，使得比例因子动态变化，更适应采样时刻的数据特点。The purpose of the embodiments of the present invention is to provide a sample sampling method, a sample sampling device, a server and a storage medium, so that the scale factor changes dynamically and is more suitable for the data characteristics at the sampling moment.

为解决上述技术问题，本发明的实施方式提供了一种样本采样方法，包括：获取多个预定时刻的比例因子和对应的数据量；根据获取的比例因子和对应的数据量进行模型训练，生成比例因子预测模型；根据所述比例因子预测模型和数据量预测比例因子；利用所预测出的比例因子对数据采样。In order to solve the above-mentioned technical problems, an embodiment of the present invention provides a sample sampling method, which includes: acquiring scale factors and corresponding data volumes at multiple predetermined times; performing model training according to the acquired scale factors and corresponding data volumes, and generating a scale factor prediction model; predict a scale factor according to the scale factor prediction model and the amount of data; use the predicted scale factor to sample the data.

本发明的实施方式还提供了一种样本采样装置，包括：获取模块，用于获取多个预定时刻的比例因子和对应的数据量；模型生成模块，用于根据获取的比例因子和对应的数据量进行模型训练，生成比例因子预测模型；预测模块，用于根据所述比例因子预测模型和数据量预测比例因子；采样模块，用于利用所预测出的比例因子对数据采样。An embodiment of the present invention also provides a sample sampling device, comprising: an acquisition module for acquiring scale factors and corresponding data amounts at multiple predetermined times; a model generation module for acquiring scale factors and corresponding data according to the acquired scale factors A prediction module is used to predict a scale factor according to the scale factor prediction model and the data amount; a sampling module is used to sample the data by using the predicted scale factor.

本发明的实施方式还提供了一种服务器，包括：至少一个处理器；以及，与所述至少一个处理器通信连接的存储器；其中，所述存储器存储有可被所述至少一个处理器执行的指令，所述指令被所述至少一个处理器执行，以使所述至少一个处理器能够执行如下步骤：获取多个预定时刻的比例因子和对应的数据量；根据获取的比例因子和对应的数据量进行模型训练，生成比例因子预测模型；根据所述比例因子预测模型和数据量预测比例因子；利用所预测出的比例因子对数据采样。Embodiments of the present invention also provide a server, comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores a program executable by the at least one processor instructions, the instructions are executed by the at least one processor, so that the at least one processor can perform the following steps: acquiring scale factors and corresponding data amounts at a plurality of predetermined moments; according to the acquired scale factors and corresponding data Predict the scale factor according to the scale factor prediction model and the data amount, and use the predicted scale factor to sample the data.

本发明的实施方式还提供了一种非易失性存储介质，用于存储计算机可读程序，所述计算机可读程序用于供计算机执行如上述的样本采样方法。Embodiments of the present invention also provide a non-volatile storage medium for storing a computer-readable program, the computer-readable program being used for a computer to execute the above-mentioned sample sampling method.

本发明实施方式相对于现有技术而言，主要区别及其效果在于：根据历史时间内预定时刻的多个比例因子，训练生成比例因子的预测模型，再根据预测模型预测当前采样所需的比例因子，由此生成的比例因子，即使在同一统计周期内，也会根据采样所处的不同时刻动态变化，变化因素会涵盖历史数据的变化，从而使得比例因子更符合当前时刻的数据特点，从而在采样时得到更合适的抽样量。可见，本发明实施方式在数据量变化较大的场景中，也可采样到合适数量的样本数据，使得样本运用时所需的正负样本量更为平衡。Compared with the prior art, the main difference and effect of the embodiments of the present invention are: according to a plurality of scale factors at a predetermined time in the historical time, a prediction model for generating the scale factor is trained, and then the required ratio of the current sampling is predicted according to the prediction model. factor, the resulting scale factor, even in the same statistical period, will dynamically change according to the different moments of sampling, and the change factor will cover the change of historical data, so that the scale factor is more in line with the data characteristics of the current moment, thus Get a more appropriate sample size when sampling. It can be seen that the embodiment of the present invention can also sample an appropriate amount of sample data in a scenario where the amount of data varies greatly, so that the positive and negative sample amounts required for sample application are more balanced.

作为进一步改进，所述模型训练为基于Xgboost模型的回归拟合训练。采用Xgboost模型的回归拟合训练可以更快速准确地获得预测模型。As a further improvement, the model training is regression fitting training based on the Xgboost model. Regression fitting training using the Xgboost model can obtain prediction models more quickly and accurately.

作为进一步改进，所述预定时刻包括：待采样时刻的同比时刻，和/或所述待采样时刻的环比时刻。As a further improvement, the predetermined time includes: a comparable time to the time to be sampled, and/or a chain time compared to the time to be sampled.

作为进一步改进，所述待采样时刻的同比时刻包括：前N个统计周期中对应所述待采样时刻的时刻，所述N为大于0的自然数；所述待采样时刻的环比时刻包括：与所述待采样时刻处于同一统计周期内，且早于所述待采样时刻的M个时刻，所述M为大于0的自然数。As a further improvement, the comparable moments of the to-be-sampling moments include: the moments corresponding to the to-be-sampled moments in the first N statistical cycles, where N is a natural number greater than 0; the chain-to-sample moments of the to-be-sampled moments include: The time to be sampled is within the same statistical period and M times earlier than the time to be sampled, where M is a natural number greater than 0.

作为进一步改进，所述统计周期为一天。利用一天作为统计周期，避免周期太长造成数据更新不及时，同时避免周期太短，数据计算过于频繁，也更符合用户的使用频次。As a further improvement, the statistical period is one day. One day is used as the statistical period to avoid untimely data update due to too long period, and at the same time to avoid too short period and too frequent data calculation, which is more in line with the user's usage frequency.

作为进一步改进，待采样数据为负样本数据。As a further improvement, the data to be sampled is negative sample data.

作为进一步改进，所述数据量为流量数据。As a further improvement, the data volume is traffic data.

上述说明仅是本发明技术方案的概述，为了能够更清楚了解本发明的技术手段，而可依照说明书的内容予以实施，并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂，以下特举本发明的具体实施方式。The above description is only an overview of the technical solutions of the present invention, in order to be able to understand the technical means of the present invention more clearly, it can be implemented according to the content of the description, and in order to make the above and other purposes, features and advantages of the present invention more obvious and easy to understand , the following specific embodiments of the present invention are given.

附图说明Description of drawings

一个或多个实施例通过与之对应的附图中的图片进行示例性说明，这些示例性说明并不构成对实施例的限定，附图中具有相同参考数字标号的元件表示为类似的元件，除非有特别申明，附图中的图不构成比例限制。One or more embodiments are exemplified by the pictures in the corresponding drawings, and these exemplifications do not constitute limitations of the embodiments, and elements with the same reference numerals in the drawings are denoted as similar elements, Unless otherwise stated, the figures in the accompanying drawings do not constitute a scale limitation.

图1是根据本发明一个实施方式中样本采样方法的流程图；1 is a flowchart of a sample sampling method according to an embodiment of the present invention;

图2是根据本发明又一个实施方式中的样本采样装置示意图；2 is a schematic diagram of a sample sampling device according to another embodiment of the present invention;

图3是根据本发明再一个实施方式提供的服务器结构示意图。FIG. 3 is a schematic structural diagram of a server provided according to another embodiment of the present invention.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合附图对本发明的各实施方式进行详细的阐述。然而，本领域的普通技术人员可以理解，在本发明各实施方式中，为了使读者更好地理解本申请而提出了许多技术细节。但是，即使没有这些技术细节和基于以下各实施方式的种种变化和修改，也可以实现本申请所要求保护的技术方案。以下各个实施例的划分是为了描述方便，不应对本发明的具体实现方式构成任何限定，各个实施例在不矛盾的前提下可以相互结合相互引用。In order to make the objectives, technical solutions and advantages of the embodiments of the present invention clearer, the various embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, those of ordinary skill in the art can appreciate that, in the various embodiments of the present invention, many technical details are set forth in order for the reader to better understand the present application. However, even without these technical details and various changes and modifications based on the following embodiments, the technical solutions claimed in the present application can be realized. The following divisions of the various embodiments are for the convenience of description, and should not constitute any limitation on the specific implementation of the present invention, and the various embodiments may be combined with each other and referred to each other on the premise of not contradicting each other.

本发明的第一实施方式涉及一种样本采样方法。其流程如图1所示，具体如下：A first embodiment of the present invention relates to a sample sampling method. The process is shown in Figure 1, and the details are as follows:

步骤101，获取多个预定时刻的比例因子和对应的数据量。Step 101 , acquiring scale factors and corresponding data amounts at multiple predetermined times.

以预定时刻包括待采样时刻的同比时刻为例进行说明，具体的说，待采样时刻的同比时刻包括：前N个统计周期中对应待采样时刻的时刻，N为大于0的自然数，如：待采样时刻为今天10:00，统计周期为1天，那么对应的同比时刻可以包括：昨天10:00、前天10:00、大前天10:00等，实际应用中，为了保证数据统计的便捷，可以将采样时刻的同比时刻扩充为时段，如：昨天10:00-11:00、前天10:00-11:00、大前天10:00-11:00等。Take the predetermined time including the time-to-sampling time as an example for illustration. Specifically, the year-on-year time to be sampled includes: the time corresponding to the time to be sampled in the first N statistical cycles, and N is a natural number greater than 0, such as: The sampling time is 10:00 today, and the statistical period is 1 day, so the corresponding year-on-year time can include: 10:00 yesterday, 10:00 the day before yesterday, 10:00 the day before yesterday, etc. In practical applications, in order to ensure the convenience of data statistics, you can Expand the year-on-year time of the sampling time into a time period, such as: 10:00-11:00 yesterday, 10:00-11:00 the day before yesterday, 10:00-11:00 the day before yesterday, etc.

实施方式中，还可以以预定时刻包括待采样时刻的环比时刻为例进行说明，具体的说，待采样时刻的环比时刻包括：与待采样时刻处于同一统计周期内，且早于待采样时刻的M个时刻，M为大于0的自然数。，如：待采样时刻为今天10:00，那么对应的环比时刻可以包括：今天9:50、今天9:40、今天9:30等。In the embodiment, it can also be described by taking the predetermined time including the time to be sampled as an example. Specifically, the chain time of the time to be sampled includes: within the same statistical period as the time to be sampled, and earlier than the time to be sampled. M times, M is a natural number greater than 0. , such as: the time to be sampled is 10:00 today, then the corresponding chain ratio time can include: 9:50 today, 9:40 today, 9:30 today, etc.

实际应用中，多个预定时刻可以同时包括：待采样时刻的同比时刻和待采样时刻的环比时刻，也就是说，同时包括了上述两部分数据，在此不再赘述。In practical applications, the plurality of predetermined moments may include: a year-on-year moment of the to-be-sampling moment and a chain-comparison moment of the to-be-sampled moment, that is to say, the above two parts of data are included at the same time, which will not be repeated here.

继续说明，在确定好多个预定时刻后，就获取预定时刻的比例因子和对应的数据量，其中，预定时刻的比例因子可以是预设值，也可以是上一次预测的结果，对应的数据量是预定时刻产生的数据量，如点击率等。Continuing to explain, after a plurality of predetermined moments are determined, the scale factor and the corresponding data volume at the predetermined moment are obtained, wherein the scale factor at the predetermined moment may be a preset value, or may be the result of the last prediction, the corresponding data volume It is the amount of data generated at a predetermined time, such as click-through rate.

步骤102，根据获取的比例因子和对应的数据量进行模型训练。Step 102: Perform model training according to the obtained scale factor and the corresponding data amount.

具体的说，可以基于Xgboost模型进行比例因子的回归拟合训练。更具体的说，Xgboost模型中可以采用CART(Classification and Regression Trees)分类回归树，CART树的叶子节点对应的值是一个实际的分数，本实施方式中指的是某时刻的比例因子对最终估算的比例因子是否有影响，从而筛选出哪些类(时刻)的比例因子对预测结果有影响。需要继续说明的是，利用Xgboost模型进行训练后，生成比例因子预测模型。Specifically, the regression fitting training of the scale factor can be performed based on the Xgboost model. More specifically, the CART (Classification and Regression Trees) classification and regression tree can be used in the Xgboost model, and the value corresponding to the leaf node of the CART tree is an actual score. Whether the scale factor has an impact, so as to filter out which classes (times) of the scale factor have an impact on the prediction result. It should be further explained that after using the Xgboost model for training, a scale factor prediction model is generated.

步骤103，根据比例因子预测模型和数据量预测比例因子。Step 103: Predict the scale factor according to the scale factor prediction model and the amount of data.

具体的说，本步骤中比例因子预测模型为步骤102中获得的比例因子预测模型，数据量也可以根据历史预估模型获得，使得预估值更为准确，而且使得可以更早地预测出之后的数据量，利于实现实际应用中的实时采样。Specifically, the scale factor prediction model in this step is the scale factor prediction model obtained in step 102, and the amount of data can also be obtained according to the historical prediction model, so that the estimated value is more accurate, and it is possible to predict earlier The amount of data is conducive to real-time sampling in practical applications.

步骤104，利用所预测出的比例因子对数据采样。Step 104, sampling the data using the predicted scale factor.

具体的说，本步骤利用步骤103中预测出的比例因子进行采样。Specifically, this step uses the scale factor predicted in step 103 to perform sampling.

上述步骤101和步骤102相当于模型准确阶段，生成的模型可以供长期使用，可以不是每次执行，步骤103和步骤104相当于实际采样过程，是每次采样时需要执行的步骤。The above steps 101 and 102 are equivalent to the model accuracy stage. The generated model can be used for a long time and may not be performed every time. Steps 103 and 104 are equivalent to the actual sampling process and are steps that need to be performed each time sampling.

本实施方式相对于现有技术而言，主要区别及其效果在于：根据历史时间内预定时刻的多个比例因子，训练生成比例因子的预测模型，再根据预测模型预测当前采样所需的比例因子，由此生成的比例因子，即使在同一统计周期内，也会根据采样所处的不同时刻动态变化，变化因素会涵盖历史数据的变化，从而使得比例因子更符合当前时刻的数据特点，从而在采样时得到更合适的抽样量。可见，本实施方式在数据量变化较大的场景中，也可采样到合适数量的样本数据，使得样本运用时所需的正负样本量更为平衡。Compared with the prior art, the main difference and effect of this embodiment are: according to a plurality of scale factors at a predetermined time in the historical time, a prediction model for generating the scale factor is trained, and then the scale factor required for the current sampling is predicted according to the prediction model. , the resulting scale factor, even in the same statistical period, will dynamically change according to the different moments of sampling, and the change factor will cover the change of historical data, so that the scale factor is more in line with the data characteristics of the current moment, so that in the A more suitable sample size is obtained when sampling. It can be seen that this embodiment can also sample an appropriate amount of sample data in a scenario where the amount of data varies greatly, so that the amount of positive and negative samples required for sample application is more balanced.

本发明的第二实施方式涉及一种样本采样方法。本实施方式具体应用于商品点击率的负样本采样过程。A second embodiment of the present invention relates to a sample sampling method. This embodiment is specifically applied to the negative sample sampling process of the click-through rate of products.

具体的说，在点击率预估模型的训练中，可以将用户对商品的点击作为正样本，商品对用户展示而未获点击作为负样本。实施方式中，负样本数量往往比正样本数量多五倍甚至十倍以上，所以本实施方式中利用样本采样方法，对负样本进行采样，降低负样本数量，使得点击率预估模型的训练中的正负样本平衡。Specifically, in the training of the click-through rate estimation model, the user's click on the product can be regarded as a positive sample, and the product displayed to the user without being clicked can be regarded as a negative sample. In the embodiment, the number of negative samples is often five times or even more than ten times more than the number of positive samples. Therefore, in this embodiment, the sample sampling method is used to sample the negative samples and reduce the number of negative samples, so that the training of the click-through rate estimation model is performed. positive and negative sample balance.

继续说明，同样根据如第一实施方式中的步骤101至步骤104，数据量可以是流量数据，在预测出比例因子后，利用该比例因子对上述流量数据进行负采样。Continuing to explain, also according to steps 101 to 104 in the first embodiment, the amount of data may be flow data, and after a scale factor is predicted, the above flow data is negatively sampled using the scale factor.

可见，本实施方式在应用于商品点击率的负样本采样时，可以减少点击率预估模型的训练中所需的负样本，使得正负样本平衡，从而使得后续要训练的商品点击率模型更符合实际情况，点击率的预估更为准确。It can be seen that when this embodiment is applied to the negative sample sampling of the click rate of products, the negative samples required in the training of the click rate estimation model can be reduced, so that the positive and negative samples are balanced, so that the product click rate model to be trained later is more accurate. In line with the actual situation, the CTR estimate is more accurate.

上面各种方法的步骤划分，只是为了描述清楚，实现时可以合并为一个步骤或者对某些步骤进行拆分，分解为多个步骤，只要包括相同的逻辑关系，都在本专利的保护范围内；对算法中或者流程中添加无关紧要的修改或者引入无关紧要的设计，但不改变其算法和流程的核心设计都在该专利的保护范围内。The steps of the above various methods are divided only for the purpose of describing clearly. During implementation, they can be combined into one step or some steps can be split and decomposed into multiple steps. As long as the same logical relationship is included, they are all within the protection scope of this patent. ;Adding insignificant modifications to the algorithm or process or introducing insignificant designs, but not changing the core design of the algorithm and process are all within the scope of protection of this patent.

本发明第三实施方式涉及一种样本采样装置，如图2所示，具体包括：The third embodiment of the present invention relates to a sample sampling device, as shown in FIG. 2 , which specifically includes:

获取模块，用于获取多个预定时刻的比例因子和对应的数据量。The acquiring module is used to acquire the scale factors and corresponding data amounts of multiple predetermined moments.

模型生成模块，用于根据获取的比例因子和对应的数据量进行模型训练，生成比例因子预测模型。The model generation module is used to perform model training according to the obtained scale factor and the corresponding data amount, and generate a scale factor prediction model.

预测模块，用于根据比例因子预测模型和数据量预测比例因子。Prediction module for predicting the scale factor based on the scale factor prediction model and the amount of data.

采样模块，用于利用所预测出的比例因子对数据采样。A sampling module for sampling the data with the predicted scale factor.

在一个例子中，模型训练为基于Xgboost模型的回归拟合训练。In one example, the model is trained as a regression fit training based on the Xgboost model.

在一个例子中，预定时刻包括：待采样时刻的同比时刻，和/或待采样时刻的环比时刻。具体的说，待采样时刻的同比时刻包括：前N个统计周期中对应待采样时刻的时刻，N为大于0的自然数；待采样时刻的环比时刻包括：与待采样时刻处于同一统计周期内，且早于待采样时刻的M个时刻，M为大于0的自然数。In one example, the predetermined time includes: a comparable time to the time to be sampled, and/or a chain time compared to the time to be sampled. Specifically, the year-on-year time of the time to be sampled includes: the time corresponding to the time to be sampled in the first N statistical periods, where N is a natural number greater than 0; the chain-comparison time of the time to be sampled includes: within the same statistical period as the time to be sampled, And M times earlier than the time to be sampled, M is a natural number greater than 0.

在一个例子中，统计周期为一天。In one example, the statistical period is one day.

在一个例子中，待采样数据为负样本数据。In one example, the data to be sampled is negative sample data.

在一个例子中，数据量为流量数据。In one example, the amount of data is traffic data.

本实施方式相对于现有技术而言，根据历史时间内预定时刻的多个比例因子，训练生成比例因子的预测模型，再根据预测模型预测当前采样所需的比例因子，由此生成的比例因子，即使在同一统计周期内，也会根据采样所处的不同时刻动态变化，变化因素会涵盖历史数据的变化，从而使得比例因子更符合当前时刻的数据特点，从而在采样时得到更合适的抽样量。可见，本实施方式在数据量变化较大的场景中，也可采样到合适数量的样本数据，使得样本运用时所需的正负样本量更为平衡。Compared with the prior art, this embodiment trains a prediction model for generating a scale factor according to a plurality of scale factors at a predetermined time in the historical time, and then predicts the scale factor required for the current sampling according to the prediction model. , even in the same statistical period, it will change dynamically according to the different times of sampling, and the change factor will cover the change of historical data, so that the scale factor is more in line with the data characteristics of the current time, so that more suitable sampling can be obtained during sampling quantity. It can be seen that this embodiment can also sample an appropriate amount of sample data in a scenario where the amount of data varies greatly, so that the amount of positive and negative samples required for sample application is more balanced.

不难发现，本实施方式为与第一实施方式相对应的装置实施例，本实施方式可与第一实施方式互相配合实施。第一实施方式中提到的相关技术细节在本实施方式中依然有效，为了减少重复，这里不再赘述。相应地，本实施方式中提到的相关技术细节也可应用在第一实施方式中。It is not difficult to find that this embodiment is a device example corresponding to the first embodiment, and this embodiment can be implemented in cooperation with the first embodiment. The relevant technical details mentioned in the first embodiment are still valid in this embodiment, and are not repeated here in order to reduce repetition. Correspondingly, the related technical details mentioned in this embodiment can also be applied to the first embodiment.

值得一提的是，本实施方式中所涉及到的各模块均为逻辑模块，在实际应用中，一个逻辑单元可以是一个物理单元，也可以是一个物理单元的一部分，还可以以多个物理单元的组合实现。此外，为了突出本发明的创新部分，本实施方式中并没有将与解决本发明所提出的技术问题关系不太密切的单元引入，但这并不表明本实施方式中不存在其它的单元。It is worth mentioning that each module involved in this embodiment is a logical module. In practical applications, a logical unit may be a physical unit, a part of a physical unit, or multiple physical units. A composite implementation of the unit. In addition, in order to highlight the innovative part of the present invention, this embodiment does not introduce units that are not closely related to solving the technical problem proposed by the present invention, but this does not mean that there are no other units in this embodiment.

本发明第四实施方式涉及一种服务器，如图3所示，该服务器包括：至少一个处理器301；以及，与至少一个处理器301通信连接的存储器302；以及，与扫描装置通信连接的通信组件303，通信组件303在处理器301的控制下接收和发送数据；其中，存储器302存储有可被至少一个处理器301执行的指令，指令被至少一个处理器301执行以实现：The fourth embodiment of the present invention relates to a server. As shown in FIG. 3, the server includes: at least one processor 301; and a memory 302 communicatively connected to the at least one processor 301; Component 303, the communication component 303 receives and sends data under the control of the processor 301; wherein, the memory 302 stores instructions executable by the at least one processor 301, and the instructions are executed by the at least one processor 301 to achieve:

获取多个预定时刻的比例因子和对应的数据量。Obtain scale factors and corresponding data volumes at multiple predetermined moments.

根据获取的比例因子和对应的数据量进行模型训练，生成比例因子预测模型。Perform model training according to the obtained scale factor and the corresponding amount of data to generate a scale factor prediction model.

根据比例因子预测模型和数据量预测比例因子。Predict the scale factor based on the scale factor prediction model and the amount of data.

利用所预测出的比例因子对数据采样。The data is sampled using the predicted scale factor.

具体地，该服务器包括：一个或多个处理器301以及存储器302，图3中以一个处理器301为例。处理器301、存储器302可以通过总线或者其他方式连接，图3中以通过总线连接为例。存储器302作为一种非易失性计算机可读存储介质，可用于存储非易失性软件程序、非易失性计算机可执行程序以及模块。处理器301通过运行存储在存储器302中的非易失性软件程序、指令以及模块，从而执行设备的各种功能应用以及数据处理，即实现上述样本采样方法。Specifically, the server includes: one or more processors 301 and a memory 302, and one processor 301 is taken as an example in FIG. 3 . The processor 301 and the memory 302 may be connected by a bus or in other ways, and the connection by a bus is taken as an example in FIG. 3 . As a non-volatile computer-readable storage medium, the memory 302 can be used to store non-volatile software programs, non-volatile computer-executable programs and modules. The processor 301 executes various functional applications and data processing of the device by running the non-volatile software programs, instructions and modules stored in the memory 302, that is, to implement the above-mentioned sample sampling method.

存储器302可以包括存储程序区和存储数据区，其中，存储程序区可存储操作系统、至少一个功能所需要的应用程序；存储数据区可存储选项列表等。此外，存储器302可以包括高速随机存取存储器，还可以包括非易失性存储器，例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实施方式中，存储器302可选包括相对于处理器301远程设置的存储器302，这些远程存储器302可以通过网络连接至外接设备。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 302 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function; the storage data area may store an option list and the like. Additionally, memory 302 may include high speed random access memory, and may also include nonvolatile memory, such as at least one magnetic disk storage device, flash memory device, or other nonvolatile solid state storage device. In some embodiments, the memory 302 may optionally include memory 302 located remotely from the processor 301, and these remote memories 302 may be connected to external devices via a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.

一个或者多个模块存储在存储器302中，当被一个或者多个处理器301执行时，执行上述任意方法实施方式中的样本采样方法。One or more modules are stored in the memory 302, and when executed by the one or more processors 301, perform the sample sampling method in any of the above-described method embodiments.

上述产品可执行本申请实施方式所提供的方法，具备执行方法相应的功能模块和有益效果，未在本实施方式中详尽描述的技术细节，可参见本申请实施方式所提供的方法。The above product can execute the method provided by the embodiment of the present application, and has functional modules and beneficial effects corresponding to the execution method. For technical details not described in detail in this embodiment, please refer to the method provided by the embodiment of the present application.

本发明的第五实施方式涉及一种非易失性存储介质，用于存储计算机可读程序，计算机可读程序用于供计算机执行上述部分或全部的方法实施例。The fifth embodiment of the present invention relates to a non-volatile storage medium for storing a computer-readable program, and the computer-readable program is used for a computer to execute some or all of the above method embodiments.

即，本领域技术人员可以理解，实现上述实施例方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成，该程序存储在一个存储介质中，包括若干指令用以使得一个设备(可以是单片机，芯片等)或处理器(processor)执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-OnlyMemory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。That is, those skilled in the art can understand that all or part of the steps in the method for implementing the above embodiments can be completed by instructing the relevant hardware through a program, and the program is stored in a storage medium and includes several instructions to make a device ( It may be a single chip microcomputer, a chip, etc.) or a processor (processor) to execute all or part of the steps of the methods of the various embodiments of the present application. The aforementioned storage medium includes: U disk, removable hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes.

本领域的普通技术人员可以理解，上述各实施方式是实现本发明的具体实施例，而在实际应用中，可以在形式上和细节上对其作各种改变，而不偏离本发明的精神和范围。Those skilled in the art can understand that the above-mentioned embodiments are specific examples for realizing the present invention, and in practical applications, various changes in form and details can be made without departing from the spirit and the spirit of the present invention. scope.

本申请实施方式提供了A1.一种样本采样方法，包括：Embodiments of the present application provide A1. a sample sampling method, comprising:

获取多个预定时刻的比例因子和对应的数据量；Obtain the scale factors and corresponding data volumes of multiple predetermined moments;

根据获取的比例因子和对应的数据量进行模型训练，生成比例因子预测模型；Perform model training according to the obtained scale factor and the corresponding amount of data to generate a scale factor prediction model;

根据所述比例因子预测模型和数据量预测比例因子；Predict the scale factor according to the scale factor prediction model and the amount of data;

A2.根据A1所述的样本采样方法，所述模型训练为基于Xgboost模型的回归拟合训练。A2. The sample sampling method according to A1, wherein the model training is regression fitting training based on the Xgboost model.

A3.根据A1中所述的样本采样方法，所述预定时刻包括：待采样时刻的同比时刻，和/或所述待采样时刻的环比时刻。A3. According to the sample sampling method described in A1, the predetermined time includes: a time-comparison time of the time to be sampled, and/or a chain-comparison time of the time to be sampled.

A4.根据A3中所述的样本采样方法，所述待采样时刻的同比时刻包括：前N个统计周期中对应所述待采样时刻的时刻，所述N为大于0的自然数；A4. According to the sample sampling method described in A3, the year-on-year moments of the to-be-sampled moments include: the moments corresponding to the to-be-sampled moments in the first N statistical cycles, and the N is a natural number greater than 0;

所述待采样时刻的环比时刻包括：与所述待采样时刻处于同一统计周期内，且早于所述待采样时刻的M个时刻，所述M为大于0的自然数。The chain ratio of the time to be sampled includes: M times that are within the same statistical period as the time to be sampled and earlier than the time to be sampled, where M is a natural number greater than 0.

A5.根据A4所述的样本采样方法，所述统计周期为一天。A5. The sample sampling method according to A4, wherein the statistical period is one day.

A6.根据A1至A 5中任意一项所述的样本采样方法，待采样数据为负样本数据。A6. The sample sampling method according to any one of A1 to A5, the data to be sampled is negative sample data.

A7.根据A1至A5中任意一项所述的样本采样方法，所述数据量为流量数据。A7. The sample sampling method according to any one of A1 to A5, wherein the data volume is flow data.

本申请实施方式还提供了B8.一种样本采样装置，包括：Embodiments of the present application also provide B8. A sample sampling device, comprising:

获取模块，用于获取多个预定时刻的比例因子和对应的数据量；an acquisition module, used to acquire scale factors and corresponding data volumes at multiple predetermined moments;

模型生成模块，用于根据获取的比例因子和对应的数据量进行模型训练，生成比例因子预测模型；The model generation module is used to perform model training according to the obtained scale factor and the corresponding data amount, and generate a scale factor prediction model;

预测模块，用于根据所述比例因子预测模型和数据量预测比例因子；a prediction module, used for predicting the scale factor according to the scale factor prediction model and the amount of data;

本申请实施方式还提供了C9.一种服务器，包括：Embodiments of the present application also provide C9. A server, comprising:

至少一个处理器；以及，at least one processor; and,

与所述至少一个处理器通信连接的存储器；其中，a memory communicatively coupled to the at least one processor; wherein,

所述存储器存储有可被所述至少一个处理器执行的指令，所述指令被所述至少一个处理器执行，以使所述至少一个处理器能够执行如下步骤：The memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the steps of:

C10.根据C9所述的服务器，所述模型训练为基于Xgboost模型的回归拟合训练。C10. The server according to C9, wherein the model training is regression fitting training based on the Xgboost model.

C11.根据C9中所述的服务器，所述预定时刻包括：待采样时刻的同比时刻，和/或所述待采样时刻的环比时刻。C11. The server according to C9, wherein the predetermined time includes: a time-comparison time of the time to be sampled, and/or a chain-comparison time of the time to be sampled.

C12.根据C11中所述的服务器，所述待采样时刻的同比时刻包括：前N个统计周期中对应所述待采样时刻的时刻，所述N为大于0的自然数；C12. According to the server described in C11, the comparable time of the time to be sampled includes: the time corresponding to the time to be sampled in the first N statistical periods, and the N is a natural number greater than 0;

C13.根据C12所述的服务器，所述统计周期为一天。C13. The server according to C12, wherein the statistical period is one day.

C14.根据C9至C 14中任意一项所述的服务器，待采样数据为负样本数据。C14. The server according to any one of C9 to C14, wherein the data to be sampled is negative sample data.

C15.根据C9至C 14中任意一项所述的服务器，所述数据量为流量数据。C15. The server according to any one of C9 to C14, wherein the data volume is traffic data.

本申请实施方式还提供了D16.一种非易失性存储介质，用于存储计算机可读程序，所述计算机可读程序用于供计算机执行如A1至A7中任一所述的样本采样方法。Embodiments of the present application also provide D16. A non-volatile storage medium for storing a computer-readable program, the computer-readable program being used for a computer to execute the sample sampling method as described in any one of A1 to A7 .

Claims

1. A method of sampling a sample, comprising:

acquiring a plurality of scale factors at preset time and corresponding data volumes;

performing model training according to the obtained scale factors and the corresponding data quantity to generate a scale factor prediction model;

predicting a scale factor according to the scale factor prediction model and the data volume;

the data is sampled using the predicted scale factor.

2. The sample sampling method of claim 1, wherein the model training is a regression fit training based on an Xgboost model.

3. The sample sampling method as claimed in claim 1, wherein said predetermined time instants comprise: the same-ratio time of the time to be sampled and/or the ring-ratio time of the time to be sampled.

4. A method for sampling samples according to claim 3, characterized in that said times of parity of the times to be sampled comprise: the time corresponding to the time to be sampled in the first N statistical periods is a natural number greater than 0;

the ring ratio time of the time to be sampled comprises: and the time to be sampled are in the same statistical period and are earlier than M times of the time to be sampled, wherein M is a natural number larger than 0.

5. The sample sampling method of claim 4, wherein the statistical period is one day.

6. The sample sampling method according to any one of claims 1 to 5, wherein the data to be sampled is negative sample data.

7. The sample sampling method according to any one of claims 1 to 5, wherein the data volume is flow data.

8. A sample sampling device, comprising:

the acquisition module is used for acquiring the scale factors and the corresponding data volumes at a plurality of preset moments;

the model generation module is used for carrying out model training according to the obtained scale factors and the corresponding data quantity to generate a scale factor prediction model;

the prediction module is used for predicting the scale factor according to the scale factor prediction model and the data volume;

a sampling module for sampling the data using the predicted scale factor.

9. A server, comprising:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the steps of:

the data is sampled using the predicted scale factor.

10. A non-volatile storage medium storing a computer-readable program for causing a computer to perform the sample sampling method of any one of claims 1 to 7.