[go: up one dir, main page]

CN114169401B - Data processing, predictive model training methods and equipment - Google Patents

Data processing, predictive model training methods and equipment

Info

Publication number
CN114169401B
CN114169401B CN202111350521.6A CN202111350521A CN114169401B CN 114169401 B CN114169401 B CN 114169401B CN 202111350521 A CN202111350521 A CN 202111350521A CN 114169401 B CN114169401 B CN 114169401B
Authority
CN
China
Prior art keywords
sample
time
data
type
access
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111350521.6A
Other languages
Chinese (zh)
Other versions
CN114169401A (en
Inventor
张腾
谭剑
李飞飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd filed Critical Alibaba China Co Ltd
Priority to CN202111350521.6A priority Critical patent/CN114169401B/en
Publication of CN114169401A publication Critical patent/CN114169401A/en
Application granted granted Critical
Publication of CN114169401B publication Critical patent/CN114169401B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/0647Migration mechanisms

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本申请实施例提供数据处理、预测模型训练方法和设备。该方法包括:根据目标数据及目标数据的访问记录,确定特征信息;将特征信息输入预测模型,得到目标数据未来被访问的时间信息;预测模型通过训练样本训练得到,训练样本包括样本特征、样本时间标签及样本类型,样本时间标签及样本类型是由样本采样时段内一随机时间前后是否有样本特征对应数据的访问记录确定的;根据时间信息,对目标数据进行冷热数据识别。将样本时间标签作为训练样本的标签,对预测模型进行训练,进而基于预测模型模型对目标数据的下次访问发生的时间间隔作为预测结果,根据时间间隔大小实现对目标数据进行冷热数据识别,从而能够有效提高对目标数据进行冷热识别的准确率。

This application provides a data processing and prediction model training method and apparatus. The method includes: determining feature information based on target data and its access records; inputting the feature information into a prediction model to obtain future access time information for the target data; training the prediction model using training samples, which include sample features, sample time labels, and sample types, determined by whether there are access records for data corresponding to the sample features before and after a random time within the sample sampling period; and identifying hot and cold data based on the time information. By using the sample time labels as labels for the training samples to train the prediction model, and then using the time interval between the next access to the target data by the prediction model as the prediction result, hot and cold data identification of the target data is achieved based on the size of the time interval, thereby effectively improving the accuracy of hot and cold data identification.

Description

Data processing and prediction model training method and equipment
Technical Field
The application relates to the field of computers, in particular to a method and equipment for data processing and predictive model training.
Background
With the rapid development of data processing requirements, data storage costs have increased significantly. In the process of data storage, the data often has clear cold and hot characteristics, that is, the data in some areas belong to the data with higher access frequency, and the data in other areas are rarely in an access state. If a large amount of cold data occupies high performance devices, storage resources are wasted.
In the prior art, different types of storage media and storage modes are adopted to store cold and hot data respectively. Before separating cold and hot data from data, it is necessary to be able to accurately identify and separate the mixed data. The cold and hot data can be identified by making identification rules, for example, a method for identifying the cold and hot data based on rules such as LRU/LFU/LIRS/Exponential Decay and the like. Another type is based on machine learning, where data is historically accessed to predict a period of time in the future where data will not be accessed. However, the accuracy of the recognition result obtained by the above manner is relatively low. Therefore, a solution for improving the accuracy of cold and hot data identification is needed.
Disclosure of Invention
In order to solve or improve the problems existing in the prior art, the embodiments of the present application provide a method and apparatus for data processing and predictive model training.
In a first aspect, in one embodiment of the present application, a data processing method is provided. The method comprises the following steps:
determining characteristic information according to target data and access records of the target data;
The feature information is input into a prediction model to obtain time information of future accessed target data, wherein the prediction model is obtained through training of a training sample, the training sample comprises sample features, a sample time tag and a sample type, and the sample time tag and the sample type are determined by whether access records of sample feature corresponding data exist before and after random time in a sample sampling period;
and carrying out cold and hot data identification on the target data according to the time information.
In a second aspect, in one embodiment of the present application, a predictive model training method is provided. The method comprises the following steps:
Constructing a training sample, wherein the training sample comprises sample characteristics, a sample time tag and a sample type, and the sample time tag and the sample type are determined by whether access records of sample characteristic corresponding data exist before and after random time in a sample sampling period;
inputting the training sample into a prediction model to obtain a prediction result;
Optimizing parameters in the prediction model according to the prediction result, the sample time tag and the sample type;
wherein the predictive model is used to identify cold and hot data.
In a third aspect, in one embodiment of the application, a non-transitory machine-readable storage medium having executable code stored thereon, which when executed by a processor of an electronic device, causes the processor to perform a data processing method as described in the first aspect or a predictive model training method as described in the second aspect.
In a fourth aspect, in one embodiment of the application, an electronic device is provided that includes a memory and a processor, wherein,
The memory is used for storing programs;
The processor is coupled to the memory and is configured to execute the program stored in the memory, so as to implement a data processing method according to the first aspect or a predictive model training method according to the second aspect.
According to the technical scheme provided by the embodiment of the invention, the characteristic information of the target data is input into the pre-trained prediction model, and the time information of the target data to be accessed in the future, namely the time difference between the accessed time and the current time, can be obtained through the prediction model. In order to enable the prediction model to accurately access time information of the future, the constructed training sample comprises a sample feature, a sample time tag and a sample type, wherein the sample time tag and the sample type are determined by whether access records of sample feature corresponding data exist before and after a random time in a sample sampling period. The prediction model obtained by training the training sample can accurately identify the cold and hot data of the target data. According to the scheme, the sample time label is used as a label of a training sample, the prediction model is trained, the time interval of the next access of the target data based on the prediction model is used as a prediction result, and the cold and hot data identification of the target data is realized according to the size of the time interval, so that the accuracy of the cold and hot identification of the target data can be effectively improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a data processing method according to an embodiment of the present application;
FIG. 2 is a flow chart of a method for constructing training samples according to an embodiment of the present application;
fig. 3 is a schematic diagram of splitting sampling periods according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a training sample constructed based on an observation window according to an embodiment of the present application;
FIG. 5 is a flowchart of a training method of a prediction model according to an embodiment of the present application;
FIG. 6 is a flowchart of a method for optimizing parameters of a prediction model according to an embodiment of the present application;
FIG. 7 is a schematic diagram of a matching process of sample types and correspondence provided in an embodiment of the present application;
FIG. 8 is a schematic diagram of cold and hot data identification according to an embodiment of the present application;
FIG. 9 is a schematic flow chart of a predictive model training method according to an embodiment of the application;
FIG. 10 is a schematic diagram illustrating a hot and cold data identification system according to an embodiment of the present application;
FIG. 11 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;
fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
FIG. 13 is a schematic structural diagram of a prediction model training device according to an embodiment of the present application;
Fig. 14 is a schematic structural diagram of another electronic device according to an embodiment of the present application.
Detailed Description
It has been found during data storage that these data often have a clear hot and cold character, i.e., data in one area is data that has a relatively high access frequency, and data in another area is rarely accessed, or two accesses occur at very long intervals (e.g., 3 days or one week, one month, half year, etc.). If a large amount of cold data occupies high performance devices, storage resources are wasted. In the prior art, different types of storage media and storage modes are adopted to store cold and hot data respectively. Before separating cold and hot data from data, it is necessary to be able to accurately identify and separate the mixed data. While some schemes employ machine learning schemes, where the history of data is accessed to predict that data will not be accessed for a period of time in the future, the abundance of samples obtained is limited by the size of the observation window being sampled (e.g., the observation window can observe data within 1 day or 1 week of data) when training the machine learning model. In general, a single sample is generated within a limited observation window, data outside the observation window cannot be used as a training sample, and future data cannot be well predicted by using a machine learning model. Therefore, a solution is needed that can realize cold and hot data identification without being limited by the size of the observation window and the number of features.
In order to enable those skilled in the art to better understand the present invention, the following description will make clear and complete descriptions of the technical solutions according to the embodiments of the present invention with reference to the accompanying drawings.
In some of the flows described in the description of the invention, the claims, and the figures described above, a number of operations occurring in a particular order are included, and the operations may be performed out of order or concurrently with respect to the order in which they occur. The sequence numbers of operations such as 101, 102, etc. are merely used to distinguish between the various operations, and the sequence numbers themselves do not represent any order of execution. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first" and "second" herein are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, and are not limited to the "first" and the "second" being different types.
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Fig. 1 is a schematic flow chart of a data processing method according to an embodiment of the present application. The method comprises the following steps:
And 101, determining characteristic information according to the target data and the access record of the target data.
And 102, inputting the characteristic information into a prediction model to obtain time information of future accessed target data, wherein the prediction model is obtained through training of a training sample, the training sample comprises a sample characteristic, a sample time label and a sample type, and the sample time label and the sample type are determined by whether access records of sample characteristic corresponding data exist before and after random time in a sample sampling period.
And 103, identifying the cold and hot data of the target data according to the time information.
The target data as referred to herein is data for which cold and hot data identification is required. The feature information includes access features of the target data, target data related data, semantic information (such as/domain_name (data_size, SQL TEMPLATE)) of a database layer, and the like. In addition, the content included in the feature information may be adjusted to be increased or decreased as necessary.
The prediction model may be, for example, a generated analysis model, which is obtained by training a training sample in advance. The time information may be a time interval between a time point at which the target data is accessed in the future and a current time point (or a designated time point). The longer the time interval corresponding to the time information, the lower the frequency of the target data being accessed, the more likely the target data is divided into cold data, and conversely, the shorter the time interval corresponding to the time information, the higher the frequency of the target data being accessed, the more likely the target data is divided into hot data. Therefore, when cold and hot data identification is performed, the method does not directly classify the cold and hot time of the target data, but takes the time interval between the accessed time and the current time in the future as the basis for distinguishing the cold and hot data.
The training samples are generated based on access records and related information over a historical time of the plurality of target data. Specifically, training samples may be represented as (x_i, y_i, e_i), where x_i represents sample characteristics (covariates, such as access characteristics), y_i represents sample time tags (label), and e_i represents sample type. It should be noted that, the sample time tag and the sample type are determined by whether there is an access record of the sample feature corresponding data before and after a random time in the sample sampling period. Examples of the construction of the sample will be illustrated below. Sample types as referred to herein include both types of erasure and types of non-erasure.
The length of the sample sampling period is not limited, and the sample sampling may be performed based on a long (e.g., 1 year, half year) period. When the sample sampling is actually performed, the range of the time period corresponding to the training sample input to the prediction model is limited due to the limitation of the observation window. Here, sampling is performed in a time-division manner, and explanation will be made in the following examples.
In the training samples, the sample time is used as a label, and the prediction model is trained. Thus, the time information can be output as a prediction result to the target data using the prediction model.
The scheme for constructing the training samples will be specifically described below.
Fig. 2 is a schematic flow chart of a method for constructing training samples according to an embodiment of the present application. As can be seen from fig. 2, constructing the training sample specifically includes the following steps:
An access record of sample data over a sampling period is acquired 201.
And 202, setting a random moment in the sampling period, and splitting the sampling period to obtain a characteristic extraction period and an observation window period.
And 203, generating sample characteristics based on the access records in the characteristic extraction period of the sample data set.
And 204, searching whether at least one access record for the sample data exists after the random moment.
And 205, when at least one access record aiming at the sample data exists, determining the sample time tag according to the random moment and the at least one access record, and setting the sample type as a non-deletion type.
And 206, when at least one access record aiming at the sample data exists, determining the sample time tag according to the random time and the ending time of the observation window, and setting the sample type as a deletion type.
As described above, the time range of the sampling period is relatively wide, and a long history period can be used as the period in which sampling is performed. When sampling is actually performed, sampling may be performed only for a part of the period thereof. In the technical scheme of the application, the access frequencies corresponding to different cold and hot data are not completely the same, so when the access frequency of certain target data is relatively low, the acquired access records can be relatively scattered, and the access frequency can be unfixed (for example, the access time intervals of three consecutive times are one month, two months and three months respectively). In the technical scheme of the application, in order to fully utilize data, the sampling period is split through random time (i.e. Pivot time in FIG. 3), so as to obtain the characteristic extraction period (i.e. history phase in FIG. 3) and the observation window period (i.e. observation phase in FIG. 3). In practical application, there are many historical access data, and when inserting random time, access records corresponding to a plurality of sample data can be inserted. For ease of understanding, the following embodiments will be described by taking one target data as an example.
Fig. 3 is a schematic diagram of splitting sampling periods according to an embodiment of the present application. As can be seen from the view of figure 3,
In order to obtain the access characteristics of the sample data (for example, ACCESS INTERVAL (access interval) is used as dynamic characteristics), the characteristic extraction period forms characteristic information together with the static characteristics such as the size of the file to which the data belongs, the table name and the like, and the longer the sampling period is, the richer the access characteristics can be extracted.
And the observation window corresponding to the observation window time period is used for marking the sample time tag and the sample type in the training process. The longer the sampling period, the less censored data (deleted data) and conversely the proportion of cencored data will be increased.
Therefore, it is necessary to ensure a sufficient sample acquisition period length for both the feature extraction period and the observation window period to ensure the quality of the generated training samples. Further, since the observation window period and the feature extraction period are both unstable in the sample data of the period initial stage and the period end stage. For example, in the observation window period, the random time is randomly generated and divided in the middle of the window length of 20% in the initial stage and the window length of 20% in the end stage, that is, in the middle of the trace (tracking) period of 20% -80%, and then the data sets corresponding to each random time are combined to be the full data set.
Specifically, fig. 4 is a schematic diagram of constructing training samples based on observation windows according to an embodiment of the present application. The access records in the sampling period in fig. 4 sequentially include T0, T1, T2, T3, T4, and T5, that is, the number of accesses to the sample data in the sampling period is 6, and the corresponding moments are respectively T0 to T5. The start time of the observation window in the observation window period is Ts, and the end time is Te. As can be seen from fig. 3, T5 is not within the observation window, in other words, the access event of T5 is not observable. Although T5 is not observed, the sample data is still an access event in a future time outside the observation window. The random time may then be set uniformly or randomly in the observation window. The more random time is set in the observation window based on the same sample data, the more corresponding training samples can be obtained.
For example, the random time T may be set between T0 and T1, or between T1 and T2, or between T2 and T3, or between T3 and T4, or between T4 and T5. Since Te is smaller than T5 and larger than T4, if the random time T is set as the current observation time and T is set between T4 and T5, it means that an access event occurring beyond the time T5 cannot be observed within the observation window.
In practical application, the random time is adjusted, and whether at least one access record aiming at the sample data exists after the adjusted random time is searched, so that the sample time label and the sample type are determined according to the searching result. Specifically:
After the random time is set, it is further determined whether at least one access record for the sample data can be found after the random time. For example, the number of the cells to be processed,
If the random time T is set between T0 and T1, 4 access records can be observed in the observation window, and a training sample A1 is obtained.
If the random time T is set between T1 and T2, 3 access records can be observed in the observation window, and a training sample A2 is obtained.
If the random time T is set between T2 and T3, 2 access records can be observed in the observation window, and a training sample A3 is obtained.
If the random time T is set between T3 and T4, 1 visit record can be observed in the observation window, and a training sample A4 is obtained.
At this time, the sample types corresponding to the samples A1, A2, A3, A4 may be set to the no-erasure type, respectively.
If the random time T is set between T4 and T5, 0 access records can be observed in the observation window, and no access record at the time of T5 can be observed, so as to obtain a training sample A5. At this time, the sample type corresponding to the training sample A5 may be set as the erasure type.
Therefore, the training sets of training samples with different scales can be obtained by controlling the number of the random moments, and generally, the larger the training set is, the better the algorithm index of the prediction model is, and the more accurate the prediction result is.
The manner in which the sample time stamp is determined corresponding to step 206 will be specifically illustrated. The specific mode of determining the sample time tag comprises the following steps that if the last access time in the observation window is later than the random time, a first time difference between the random time and the last access time after the random time is marked as the sample time tag, and the target event type is marked as a non-deletion event.
In practical applications, when setting the random time, it is further necessary to calculate the sample time tag according to the random time as the tag of the training sample. Since the occurrence of the access event can be successfully observed after the random time in the observation window, that is, the access record exists, the access event can be observed, and the problem of data deletion does not occur, the sample type corresponding to the training sample is set to be a non-deletion type. And the corresponding sample time label is a first time difference between the random time and the last access time after the random time. Continuing with the example, sample time tag y1=t2-T for training sample A1, sample time tag y2=t2-T for training sample A2, sample time tag y3=t3-T for training sample A3, and sample time tag y4=t4-T for training sample A4.
The manner in which the sample time stamp is determined corresponding to step 207 will be specifically illustrated. The specific mode of determining the sample time tag comprises the following steps that if the last access time in the observation window period is earlier than the random time, a second time difference between the window end time and the random time is marked as the sample time tag, and the target event type is marked as a deletion event.
In practical applications, when setting the random time, it is further necessary to calculate the sample time tag according to the random time as the tag of the training sample. Since no access event can be observed after the random time in the observation window, that is, no access record exists, the access event cannot be observed, and the problem of data deletion occurs is solved, so that the sample type corresponding to the training sample is set to be a deletion type. The corresponding sample time tag is a second time difference between the window end time and a second time between the random instants. Continuing with the example, training sample A5 has a sample time stamp y5=te-t.
Based on the embodiment, when the training sample is constructed, unobserved access records can be fully utilized to be introduced into the training sample in the form of a first time difference and a non-erasure type or a second time difference and an erasure type. Although the observation window is limited in length, the problem of sample loss does not occur in the case of deleting data (some access records of the data are not observed). After the prediction model is trained based on the comprehensive training sample, the obtained training model can better access the time information of the next time after the current observation time, so that the prediction effect of the survival analysis model on future access can be improved. The scheme can be applied to a database or a cluster, and the cold and hot data identification is carried out on the data in each node, so that the data are stored respectively according to the identification result. When the model obtained through training is used for prediction, the time information of the next access occurrence output by the prediction model is used for judging the cold and hot type of the data.
After the training samples are obtained by the above embodiments, the predictive model may be trained. The present invention will be specifically illustrated with reference to examples.
Fig. 5 is a flowchart of a training method of a prediction model according to an embodiment of the present application. From fig. 5, it can be seen that the method specifically comprises the following steps:
501, constructing training samples. And 502, inputting the training sample into a prediction model to obtain a prediction result. And 503, optimizing parameters in the prediction model according to the prediction result, the sample time tag and the sample type, wherein the prediction model is used for identifying cold and hot data.
As described above, the training samples include (x_i, y_i, e_i), where x_i represents a sample feature (covariates such as access feature), y_i represents a sample time stamp (label), and e_i represents a sample type. In constructing the training samples, different random times may be set based on the same sample data to obtain a set of training samples. Further, a training sample set can be obtained based on the plurality of sample data. The set of training samples is used to train a predictive model (e.g., a survival analysis model). In the training process, y_i is used as a training label. It is easy to understand that although the prediction model is well trained by obtaining a relatively comprehensive training sample, the prediction model needs to be continuously optimized based on the training sample in the training process. The specific optimization process is as follows:
fig. 6 is a schematic flow chart of a method for optimizing parameters of a prediction model according to an embodiment of the present application. From fig. 6, it can be seen that the method specifically comprises the following steps:
and 601, determining the corresponding relation between the prediction result and the sample time tag.
And 602, optimizing parameters in the prediction model according to a matching result of the sample type and the corresponding relation.
The prediction result here is time information of a time difference from the current time of the next access and a sample type. In other words, the correspondence between the predicted result and the sample time stamp specifically includes that the predicted result is earlier than the sample time stamp or that the predicted result is later than the sample time stamp. And the sample tag types include a non-erasure type and an erasure type. The specific matching result determination process will be specifically illustrated in the following embodiment with reference to fig. 7.
As described above, the training samples include a sample type in addition to the sample time stamp as the training sample. Step 602 will be described in detail below with reference to the accompanying drawings. Fig. 7 is a schematic diagram of a matching process of sample types and correspondence provided in an embodiment of the present application. As can be seen from fig. 7:
701, if the sample type is a non-deletion type and the corresponding relation is that the time information corresponding to the prediction result is smaller than the sample time label, determining that the sample type is matched with the corresponding relation.
For example, let training sample Y6 be specifically (x_1, y_1, e_1), let y1=ty 1, e_1=0 (indicating no erasure type). The prediction result is tx, and the comparison shows that the corresponding relationship is tx smaller than ty1. In other words, the predicted result is that the tx time difference after the current time t will have an access event to the target data. Since tx is less than ty1, i.e., the event is observable, no erasure data is generated, matching the sample type (no erasure type). I.e. the sample type matches the correspondence.
702, If the sample type is a deleted type and the corresponding relation is that the time information corresponding to the prediction result is larger than the sample time tag, determining that the sample tag type is matched with the corresponding relation.
For example, let training sample Y7 be specifically (x_2, y_2, e_2), let y2=ty 2, e_2=1 (indicating no erasure type). The prediction result is tx, and the corresponding relationship is tx larger than ty2 through comparison. In other words, the predicted result is that the tx time difference after the current time t will have an access event to the target data. Since tx is greater than ty2, i.e., the event is not observable, erasure data is generated, matching the sample type (with erasure type). I.e. the sample tag type matches the correspondence.
703, If the sample type is of a non-deletion type and the corresponding relation is that the time information corresponding to the prediction result is larger than the sample time label, determining that the sample type is not matched with the corresponding relation.
For example, let training sample Y6 be specifically (x_1, y_1, e_1), let y1=ty 1, e_1=0 (indicating no erasure type). The prediction result is tx, and the corresponding relationship is tx larger than ty1 through comparison. In other words, the predicted result is that the tx time difference after the current time t will have an access event to the target data. Since tx is greater than ty1, i.e., the event is not observable, erasure data is generated that does not match the sample type (no erasure type). I.e. the sample type does not match the correspondence.
704, If the sample type is a deleted type and the corresponding relation is that the time information corresponding to the prediction result is smaller than the sample time tag, determining that the sample tag type is not matched with the corresponding relation.
For example, let training sample Y7 be specifically (x_2, y_2, e_2), let y2=ty 2, e_2=1 (indicating no erasure type). The prediction result is tx, and the comparison shows that the corresponding relationship is tx smaller than ty2. In other words, the predicted result is that the tx time difference after the current time t will have an access event to the target data. Since tx is less than ty2, i.e., the event is observable, no erasure data is generated, and no match with the sample type (no erasure type). I.e. the sample tag type does not match the correspondence.
As an alternative embodiment for optimizing a pair of prediction models, the quality of the prediction models can be measured by model prediction results when optimizing:
Due to the presence of deleted data, c-index is typically used to measure the effect of the predictive model. C-index refers to the proportion of the predicted and actual result pairs in all sample pairs.
The calculation steps are as follows:
1. And proportioning all training samples to obtain sample pairs. For example, if there are n samples, n (n-1)/2 pairs of samples are generated;
2. If the sample type corresponding to the sample pair with the smaller sample time is the erasure type (meaning that the sample data is erasure data) or both samples in the sample pair are erasure data, then the invalid pair is considered, and the remaining useful pairs are excluded.
3. The number of pairs in which the predicted result and the actual result agree among the useful pairs is calculated, i.e., the actual sample time of an individual with a longer sample time is longer.
C-index=consistent pair number/useful pair number, and the range of c-index is between 0 and 1, and the closer to 1, the stronger the cold and hot data distinguishing capability of the model is proved. In the training optimization process, training samples are used for continuous training so that the c-index of the prediction model is closer to 1.
After outputting the time information using the predictive model, the identification of the cold and hot data will be further performed on the target data based on the time information. As will be described in detail below.
Fig. 8 is a schematic diagram of cold and hot data identification according to an embodiment of the present application. From fig. 8, it can be seen that the method specifically comprises the following steps:
And 801, acquiring the time information corresponding to the target data respectively.
And 802, carrying out cold and hot data identification on the target data according to the size of the time information.
The comparison may be performed in an ordered manner based on the size of the time information, as described in step 802. For example, 4 pieces of time information were obtained, each of which was 10 minutes for tx1, 20 minutes for tx2, 30 minutes for tx3, and 40 minutes for tx4. The order may be from large to small or from small to large when ordered. It is assumed that the order obtained by sorting from small to large is tx1, tx2, tx3, tx4. The first 50% or the first 25% may be divided into hot data, the remainder into cold data. The specific proportions may be set in actual situations (e.g., thermal data storage space size), and are merely illustrative, and not limiting of the present application.
In addition to being sorted, the threshold size may be compared. The method comprises the following steps:
the target data whose time information is greater than a first time threshold is marked as cold data and stored to a first storage medium supporting low access performance.
Marking the target data with the time information less than or equal to the first time threshold as hot data, and storing the target data to a second storage medium supporting high access performance.
For example, 4 pieces of time information were obtained, each of which was 10 minutes for tx1, 20 minutes for tx2, 30 minutes for tx3, and 40 minutes for tx 4. Assuming that the first time threshold is 25 minutes, it is known that the target data is cold data because tx3 and tx4 are both greater than 25 minutes, and that the target data is hot data because tx1 and tx2 are both less than 25 minutes.
In practical application, the storage state of the cold and hot data can be adjusted by utilizing the time information output by the prediction model. In particular, the cold data is stored clocked. And determining the remaining time based on a difference between the stored timings and the time information corresponding to the cold data. And migrating the cold data from the first storage medium to the second storage medium when the remaining time is less than a second time threshold.
For example, after outputting time information (for example, 24 hours) by the prediction model, the target data is identified as cold data and stored in a first storage medium (for example, HDD medium). And starting the storage timing for the cold data, and accumulating the storage timing for 23 hours as time goes on, wherein the difference value between the time information and the storage timing (namely the residual time) is only 1 hour, and the threshold value is set to be 1 hour, and when the residual time is 1 hour or 59 minutes, the target data corresponding to the cold data is to be accessed because the residual time is smaller than the second time threshold value. To increase the access speed to the target data, the cold data may be transferred from the first storage medium to a second storage medium (such as an SSD medium) for storing the hot data.
Based on the same thought, the embodiment of the application also provides a prediction model training method. Fig. 9 is a schematic flow chart of a predictive model training method according to an embodiment of the application. From fig. 9, it can be seen that the method specifically comprises the following steps:
901, constructing a training sample, wherein the training sample comprises sample characteristics, a sample time tag and a sample type, and the sample time tag and the sample type are determined by whether access records of sample characteristic corresponding data exist before and after random time in a sample sampling period.
And 902, inputting the training sample into a prediction model to obtain a prediction result.
And 903, optimizing parameters in the prediction model according to the prediction result, the sample time tag and the sample type, wherein the prediction model is used for identifying cold and hot data.
The optimizing process of the parameters in the prediction model in step 903 is as follows, determining the corresponding relation between the prediction result and the sample time tag, and optimizing the parameters in the prediction model according to the matching result of the sample type and the corresponding relation.
The process of constructing the training sample in step 901 includes the following steps:
Acquiring access records of sample data in a sampling period;
Setting a random moment in the sampling period, and splitting the sampling period to obtain a characteristic extraction period and an observation window period;
generating sample features based on access records of the sample dataset within the feature extraction period;
Searching whether at least one access record aiming at the sample data exists after the random moment;
when at least one access record aiming at the sample data exists, determining the sample time tag according to the random time and the at least one access record, and setting the sample type as a non-deletion type;
And when at least one access record aiming at the sample data exists, determining the sample time tag according to the random time and the ending time of the observation window, and setting the sample type as a deletion type.
In particular, reference may be made to the respective embodiments corresponding to fig. 1 to 8, and the detailed description will not be repeated here.
For ease of understanding, the overall process of cold and hot data identification will be specifically illustrated below taking the example that the predictive model is a survival analysis model. Fig. 10 is a schematic diagram of a hot and cold data identification system according to an embodiment of the application. As can be seen from fig. 10, the survival analysis server and the application are included. The survival analysis service end comprises an object storage service (Object Storage Service, OSS), a relational database service (Relational Database Service, RDS) and a survival analysis model. In the model, a lightweight survival analysis model can be constructed based on cox, rsf and other algorithms, and the model is simpler than a neural network survival analysis algorithm and has higher model index (c-index). The historical data of the application end is used as sample data, and training samples are generated according to the embodiment corresponding to fig. 1 to 9, so that the survival analysis model is trained. And optimizing the model obtained by training. The target data to be identified can be identified, and cold and hot classified storage can be performed according to the nodes in the shared storage.
There are two important functions in the survival analysis, one is a survival function (Survival function), i.e. the probability that an event did not occur before time T: S (T) =pr (T > =t), and the other is a risk function:
Described is the probability of an event occurring at time t
The models based on survival analysis are all to fit risk functions, taking the cox model as an example, the cox model assumes that log-hard is a linear relationship with covariates (i.e., features), i.e.
h(t,X)=h0(t)exp(β1X12X2+...+βkXk)
Where x= (X 1,X2,X3,...,Xk) is k risk factors affecting the time to live t.
Then, the maximum likelihood estimator of beta can be obtained by establishing a partial likelihood function (partiallikelihood) of the cox risk model, taking the logarithm of the two sides of the partial likelihood function, and then solving the partial derivative of beta.
We turn the prediction of hot and cold data into a regression problem, i.e. the Time interval (Time-to-event prediction) of the next access of the predicted data, which, due to the limited observation window of the history information, can result in a lot of deleted data, i.e. the next access of a part of the samples is not observed within the observation window, but this does not indicate that the samples are not accessed after the access window. For those samples that are not accessed after the inference time, traditional machine learning is not able to utilize these samples. Survival analysis is naturally suitable for processing deleted data, and the problem of limited observation windows such as data cold and hot is very suitable for modeling by using survival analysis, and a better effect is obtained by fitting CHF (cumulative loss function).
Based on the same thought, the embodiment of the application also provides a data processing device. Fig. 11 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. The data processing apparatus includes:
The determining module 1101 is configured to determine feature information according to the target data and the access record of the target data.
The input module 1102 is configured to input the feature information into a prediction model to obtain time information of future access to the target data, where the prediction model is obtained by training a training sample, the training sample includes a sample feature, a sample time tag, and a sample type, and the sample time tag and the sample type are determined by an access record of whether there is data corresponding to the sample feature before and after a random time in a sample sampling period.
And the identification module 1103 is configured to identify the cold and hot data of the target data according to the time information.
Optionally, the system further comprises a training module 1104 for constructing a training sample, inputting the training sample into a prediction model to obtain a prediction result, and optimizing parameters in the prediction model according to the prediction result, the sample time tag and the sample type, wherein the prediction model is used for identifying cold and hot data.
Optionally, the training module 1104 is further configured to determine a correspondence between a prediction result and the sample time tag;
And optimizing parameters in the prediction model according to a matching result of the sample type and the corresponding relation.
Optionally, the training module 1104 is further configured to determine that the sample type is matched with the correspondence if the sample type is a non-erasure type and the correspondence is that the time information corresponding to the prediction result is smaller than the sample time tag;
And if the sample type is a deleted type and the corresponding relation is that the time information corresponding to the prediction result is larger than the sample time tag, determining that the sample tag type is matched with the corresponding relation.
Optionally, the system further comprises a sample construction module 1105, wherein the sample construction module is used for acquiring access records of sample data in a sampling period, setting a random time in the sampling period, splitting the sampling period to obtain a feature extraction period and an observation window period, generating sample features based on the access records in the feature extraction period of the sample data set, searching whether at least one access record aiming at the sample data exists after the random time, determining the sample time tag according to the random time and the at least one access record when the at least one access record aiming at the sample data exists, setting the sample type as a non-deletion type, and determining the sample time tag according to the random time and the termination time of the observation window when the at least one access record aiming at the sample data exists, and setting the sample type as a deletion type.
Optionally, the system further comprises a sample construction module 1105, which is further configured to adjust the random time;
Searching whether at least one access record for the sample data exists after the adjusted random moment so as to determine a sample time tag and a sample type according to a searching result.
Optionally, a sample construction module 1105 is further included, and is further configured to mark a first time difference between the random time and a last access time after the random time as the sample time tag and mark the target event type as a non-deletion event if the last access time is later than the random time in the observation window.
Optionally, a sample construction module 1105 is further included for marking a second time difference between a window end time and the random time as the sample time tag and marking the target event type as a deletion event if a last access time within the observation window period is earlier than the random time.
Optionally, the system further comprises an identification module 1103 for acquiring the time information corresponding to the target data, and identifying the cold and hot data of the target data according to the size of the time information.
Optionally, the identifying module 1103 is further configured to mark the target data with the time information greater than the first time threshold as cold data, and store the target data to a first storage medium supporting low access performance. Marking the target data with the time information less than or equal to the first time threshold as hot data, and storing the target data to a second storage medium supporting high access performance.
Optionally, the system further comprises a migration module, wherein the migration module is used for carrying out storage timing on the cold data, determining the residual time based on the difference value between the storage timing and the time information corresponding to the cold data, and migrating the cold data from the first storage medium to the second storage medium when the residual time is smaller than a second time threshold.
In one embodiment of the application, a non-transitory machine-readable storage medium having executable code stored thereon, which when executed by a processor of an electronic device, causes the processor to perform a data processing method as described in fig. 1-8 is provided. See in particular the embodiments described above.
The embodiment of the application also provides electronic equipment. Fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device comprises a memory 1201, a processor 1202 and a communication component 1203, wherein,
The memory 1201 is used for storing a program;
The processor 1202, coupled to the memory, is configured to execute the program stored in the memory for:
determining characteristic information according to target data and access records of the target data;
The feature information is input into a prediction model to obtain time information of future accessed target data, wherein the prediction model is obtained through training of a training sample, the training sample comprises sample features, a sample time tag and a sample type, and the sample time tag and the sample type are determined by whether access records of sample feature corresponding data exist before and after random time in a sample sampling period;
and carrying out cold and hot data identification on the target data according to the time information.
The memory 1201 may be configured to store various other data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on an electronic device. The memory may be implemented by any type of volatile or nonvolatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
Further, the processor 1202 in this embodiment may be specifically a programmable switching processing chip, where a data replication engine is configured to replicate received data.
The processor 1202 may perform other functions in addition to the above functions when executing programs in memory, as described in detail in the foregoing embodiments. Further, as shown in FIG. 12, the electronic device also includes other components, such as a power supply component 1204.
Based on the same thought, the embodiment of the application also provides a prediction model training device. Fig. 13 is a schematic structural diagram of a prediction model training device according to an embodiment of the present application. The other data processing apparatus includes:
The sample construction module 1301 is configured to construct a training sample, where the training sample includes a sample feature, a sample time tag, and a sample type, and the sample time tag and the sample type are determined by an access record of whether there is data corresponding to the sample feature before and after a random time in a sample sampling period.
The input module 1302 is configured to input the training samples into a prediction model to obtain a prediction result.
And the optimizing module 1303 is used for optimizing parameters in the prediction model according to the prediction result, the sample time tag and the sample type, wherein the prediction model is used for identifying cold and hot data.
Optionally, the optimizing module 1303 is further configured to determine a correspondence between a prediction result and the sample time tag, and optimize parameters in the prediction model according to a matching result of the sample type and the correspondence.
Optionally, the sample construction module 1301 is further configured to obtain an access record of sample data in a sampling period, split the sampling period to obtain a feature extraction period and an observation window period in the sampling period by setting a random time in the sampling period, generate a sample feature based on the access record in the feature extraction period of the sample data set, find whether there is at least one access record for the sample data after the random time, determine the sample time stamp according to the random time and the at least one access record when there is at least one access record for the sample data, and set a sample type to be a non-deletion type, and determine the sample time stamp according to the random time and a termination time of an observation window when there is at least one access record for the sample data, and set a sample type to be a deletion type.
In one embodiment of the application, a non-transitory machine-readable storage medium having executable code stored thereon that, when executed by a processor of an electronic device, causes the processor to perform a predictive model training method as described in fig. 9 is provided. See in particular the embodiments described above.
The embodiment of the application also provides electronic equipment. The electronic equipment is node-standby electronic equipment in the computing unit. Fig. 14 is a schematic structural diagram of another electronic device according to an embodiment of the present application. The electronic device comprises a memory 1401, a processor 1402 and a communication unit 1403, wherein,
The memory 1401 for storing a program;
the processor 1402, coupled to the memory, is configured to execute the program stored in the memory for:
Constructing a training sample, wherein the training sample comprises sample characteristics, a sample time tag and a sample type, and the sample time tag and the sample type are determined by whether access records of sample characteristic corresponding data exist before and after random time in a sample sampling period;
inputting the training sample into a prediction model to obtain a prediction result;
Optimizing parameters in the prediction model according to the prediction result, the sample time tag and the sample type;
wherein the predictive model is used to identify cold and hot data.
The memory 1401 described above may be configured to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on an electronic device. The memory may be implemented by any type of volatile or nonvolatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
Further, the processor 1402 in this embodiment may be specifically a programmable switching processing chip, where a data replication engine is configured in the programmable switching processing chip, and can replicate the received data.
The processor 1402 may implement functions other than the above when executing programs in a memory, and the above description of the embodiments can be specifically referred to. Further, as shown in FIG. 14, the electronic device also includes a power supply component 1404 and other components.
Based on the above embodiment, the feature information of the target data is input into a pre-trained prediction model, and the time information of the target data to be accessed in the future, that is, the time difference between the accessed time and the current time, can be obtained through the prediction model. In order to enable the prediction model to accurately access time information of the future, the constructed training sample comprises a sample feature, a sample time tag and a sample type, wherein the sample time tag and the sample type are determined by whether access records of sample feature corresponding data exist before and after a random time in a sample sampling period. The prediction model obtained by training the training sample can accurately identify the cold and hot data of the target data. According to the scheme, the sample time label is used as a label of a training sample, the prediction model is trained, the time interval of the next access of the target data based on the prediction model is used as a prediction result, and the cold and hot data identification of the target data is realized according to the size of the time interval, so that the accuracy of the cold and hot identification of the target data can be effectively improved.
It should be noted that, the data processing apparatus provided in the foregoing embodiments may implement the technical solutions described in the foregoing method embodiments, and the specific implementation principles of the foregoing modules or units may refer to corresponding contents in the foregoing method embodiments, which are not repeated herein.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention, and not for limiting the same, and although the present invention has been described in detail with reference to the above-mentioned embodiments, it should be understood by those skilled in the art that the technical solution described in the above-mentioned embodiments may be modified or some technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the spirit and scope of the technical solution of the embodiments of the present invention.

Claims (14)

1.一种数据处理方法,所述方法包括:1. A data processing method, the method comprising: 根据目标数据及所述目标数据的访问记录,确定特征信息;Based on the target data and the access records of the target data, characteristic information is determined; 将所述特征信息输入基于生存分析的模型,得到所述目标数据未来被访问的时间信息;其中,所述基于生存分析的模型通过训练样本训练得到,所述训练样本包括根据样本数据在样本采样时段内一随机时间前的访问记录生成的样本特征,以及由所述样本采样时段内所述随机时间后是否有所述样本数据的访问记录确定的样本时间标签及样本类型,所述样本类型包括无删失类型和有删失类型;The feature information is input into a survival analysis-based model to obtain the time information of future access to the target data; wherein, the survival analysis-based model is trained using training samples, the training samples include sample features generated based on access records of sample data before a random time within the sample sampling period, and sample time tags and sample types determined by whether there are access records of the sample data after the random time within the sample sampling period, the sample types including uncensored type and censored type; 根据所述时间信息,对所述目标数据进行冷热数据识别。Based on the time information, the target data is identified as hot or cold. 2.根据权利要求1所述的方法,所述基于生存分析的模型的训练方式包括:2. The method according to claim 1, wherein the training method of the survival analysis-based model includes: 构建训练样本;Construct training samples; 将所述训练样本输入基于生存分析的模型,得到预测结果;The training samples are input into a survival analysis-based model to obtain prediction results; 根据所述预测结果、样本时间标签及样本类型,对所述基于生存分析的模型中的参数进行优化;Based on the prediction results, sample time labels, and sample types, the parameters in the survival analysis-based model are optimized. 其中,所述基于生存分析的模型用于识别冷热数据的。The survival analysis-based model is used to identify hot and cold data. 3.根据权利要求2所述的方法,所述根据所述预测结果、样本时间标签及样本类型,对所述基于生存分析的模型中的参数进行优化,包括:3. The method according to claim 2, wherein optimizing the parameters in the survival analysis-based model based on the prediction results, sample time labels, and sample type includes: 确定预测结果与所述样本时间标签的对应关系;Determine the correspondence between the prediction results and the sample time labels; 根据所述样本类型与所述对应关系的匹配结果,对所述基于生存分析的模型中的参数进行优化。Based on the matching results between the sample type and the corresponding relationship, the parameters in the survival analysis-based model are optimized. 4.根据权利要求3所述的方法,所述样本类型与所述对应关系的匹配结果,包括:4. The method according to claim 3, wherein the matching result between the sample type and the correspondence includes: 若所述样本类型为无删失类型,且所述对应关系为所述预测结果对应的时间信息小于所述样本时间标签,则确定所述样本类型与所述对应关系匹配;If the sample type is an uncensored type, and the correspondence is that the time information corresponding to the prediction result is less than the time label of the sample, then the sample type is determined to match the correspondence. 若所述样本类型为有删失类型,且所述对应关系为所述预测结果对应的时间信息大于所述样本时间标签,则确定所述样本标签类型与所述对应关系匹配。If the sample type is a censored type, and the correspondence is that the time information corresponding to the prediction result is greater than the sample time label, then the sample label type is determined to match the correspondence. 5.根据权利要求2所述的方法,所述构建训练样本,包括:5. The method according to claim 2, wherein constructing training samples comprises: 获取样本数据在采样时段内的访问记录;Obtain the access records of the sample data during the sampling period; 在所述采样时段中设置一随机时刻对所述采样时段拆分得到特征提取时段和观测窗口时段;A random time point is set within the sampling period to split the sampling period into a feature extraction period and an observation window period; 基于所述样本数据集所述特征提取时段内的访问记录,生成样本特征;Based on the access records within the feature extraction period of the sample dataset, sample features are generated; 查找所述随机时刻后是否有针对所述样本数据的至少一个访问记录;Check whether there is at least one access record for the sample data after the random time. 有针对所述样本数据的至少一个访问记录时,根据所述随机时刻及所述至少一个访问记录,确定所述样本时间标签,并将样本类型设置为无删失类型;When there is at least one access record for the sample data, the sample time tag is determined based on the random time and the at least one access record, and the sample type is set to uncensored type; 有针对所述样本数据的至少一个访问记录时,根据所述随机时刻及观测窗口的终止时刻,确定所述样本时间标签,并将样本类型设置为有删失类型。When there is at least one access record for the sample data, the sample time tag is determined based on the random time and the end time of the observation window, and the sample type is set to censored type. 6.根据权利要求5所述的方法,还包括:6. The method according to claim 5, further comprising: 调整所述随机时刻;Adjust the random time; 查找调整后的所述随机时刻之后是否有针对所述样本数据的至少一个访问记录,以便根据查找结果确定样本时间标签及样本类型。The system searches for at least one access record for the sample data after the adjusted random time, in order to determine the sample time stamp and sample type based on the search results. 7.根据权利要求5所述的方法,所述根据所述随机时刻及所述至少一个访问记录,确定所述样本时间标签,包括:7. The method according to claim 5, wherein determining the sample timestamp based on the random time and the at least one access record comprises: 若在所述观测窗口内最后一次访问时间晚于所述随机时刻,则将所述随机时刻与所述随机时刻后最近一次访问时间之间的第一时间差标记为所述样本时间标签,并将所述目标事件类型标记为非删失事件。If the last access time within the observation window is later than the random time, then the first time difference between the random time and the most recent access time after the random time is marked as the sample time label, and the target event type is marked as a non-censored event. 8.根据权利要求5所述的方法,所述根据所述随机时刻及观测窗口的终止时刻,确定所述样本时间标签,包括:8. The method according to claim 5, wherein determining the sample time label based on the random time and the end time of the observation window comprises: 若在所述观测窗口时段内最后一次访问时间早于所述随机时刻,则将窗口结束时间与所述随机时刻之间的第二时间差标记为所述样本时间标签,并将所述目标事件类型标记为删失事件。If the last access time within the observation window is earlier than the random time, then the second time difference between the window end time and the random time is marked as the sample time label, and the target event type is marked as a censored event. 9.根据权利要求1所述的方法,所述根据所述时间信息,对所述目标数据进行冷热数据识别,包括:9. The method according to claim 1, wherein identifying hot and cold data of the target data based on the time information comprises: 获取所述目标数据分别对应的所述时间信息;Obtain the time information corresponding to the target data respectively; 根据所述时间信息的大小对所述目标数据进行冷热数据识别。The target data is identified as hot or cold based on the magnitude of the time information. 10.根据权利要求1或9所述的方法,还包括:10. The method according to claim 1 or 9, further comprising: 将所述时间信息大于第一时间阈值的所述目标数据标记为冷数据,并将所述目标数据存储到支持低访问性能的第一存储介质;The target data whose time information is greater than a first time threshold is marked as cold data, and the target data is stored in a first storage medium that supports low access performance; 将所述时间信息小于等于所述第一时间阈值的所述目标数据标记为热数据,并将所述目标数据存储到支持高访问性能的第二存储介质。The target data whose time information is less than or equal to the first time threshold is marked as hot data, and the target data is stored in a second storage medium that supports high access performance. 11.根据权利要求10所述的方法,还包括:11. The method of claim 10, further comprising: 对所述冷数据进行存储计时;The cold data is stored and timed; 基于所述存储计时与所述冷数据对应的所述时间信息之间的差值,确定剩余时间;The remaining time is determined based on the difference between the storage time and the time information corresponding to the cold data; 在所述剩余时间小于第二时间阈值时,将所述冷数据从所述第一存储介质迁移到所述第二存储介质。When the remaining time is less than the second time threshold, the cold data is migrated from the first storage medium to the second storage medium. 12.一种基于生存分析的模型训练方法,包括:12. A model training method based on survival analysis, comprising: 构建训练样本,其中,训练样本包括根据样本数据在样本采样时段内一随机时间前的访问记录生成的样本特征,以及由所述样本采样时段内所述随机时间后是否有所述样本数据的访问记录确定的样本时间标签及样本类型,所述样本类型包括无删失类型和有删失类型;Construct training samples, wherein the training samples include sample features generated based on access records of sample data before a random time within the sample sampling period, and sample timestamps and sample types determined by whether there are access records of the sample data after the random time within the sample sampling period, wherein the sample types include uncensored types and censored types. 将所述训练样本输入基于生存分析的模型,得到预测结果;The training samples are input into a survival analysis-based model to obtain prediction results; 根据所述预测结果、样本时间标签及样本类型,对所述基于生存分析的模型中的参数进行优化;Based on the prediction results, sample time labels, and sample types, the parameters in the survival analysis-based model are optimized. 其中,所述基于生存分析的模型用于识别冷热数据的。The survival analysis-based model is used to identify hot and cold data. 13.一种非暂时性机器可读存储介质,所述非暂时性机器可读存储介质上存储有可执行代码,当所述可执行代码被电子设备的处理器执行时,使所述处理器执行如权利要求1至11中任一项所述的方法,或者权利要求12所述的方法。13. A non-transitory machine-readable storage medium storing executable code that, when executed by a processor of an electronic device, causes the processor to perform the method as claimed in any one of claims 1 to 11, or the method as claimed in claim 12. 14.一种电子设备,包括存储器及处理器;其中,14. An electronic device, comprising a memory and a processor; wherein, 所述存储器,用于存储程序;The memory is used to store programs; 所述处理器,与所述存储器耦合,用于执行所述存储器中存储的所述程序,以用于实现上述权利要求1至11中任一项所述的方法,或者权利要求12所述的方法。The processor, coupled to the memory, is configured to execute the program stored in the memory to implement the method of any one of claims 1 to 11, or the method of claim 12.
CN202111350521.6A 2021-11-15 2021-11-15 Data processing, predictive model training methods and equipment Active CN114169401B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111350521.6A CN114169401B (en) 2021-11-15 2021-11-15 Data processing, predictive model training methods and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111350521.6A CN114169401B (en) 2021-11-15 2021-11-15 Data processing, predictive model training methods and equipment

Publications (2)

Publication Number Publication Date
CN114169401A CN114169401A (en) 2022-03-11
CN114169401B true CN114169401B (en) 2026-01-02

Family

ID=80479132

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111350521.6A Active CN114169401B (en) 2021-11-15 2021-11-15 Data processing, predictive model training methods and equipment

Country Status (1)

Country Link
CN (1) CN114169401B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115102779B (en) * 2022-07-13 2023-11-07 中国电信股份有限公司 Prediction model training and access request decision method, device and medium
CN116662638B (en) * 2022-09-06 2024-04-12 荣耀终端有限公司 Data collection method and related device
CN115494959B (en) * 2022-11-15 2023-02-28 四川易景智能终端有限公司 Multifunctional intelligent helmet and management platform thereof
CN116522158A (en) * 2023-04-28 2023-08-01 中国农业银行股份有限公司 Data cold and hot state prediction method, device, electronic equipment and storage medium
CN116932549A (en) * 2023-07-21 2023-10-24 企知道科技有限公司 Intelligent model-based platform data storage method, system, medium and equipment
CN117076523B (en) * 2023-10-13 2024-02-09 华能资本服务有限公司 Local data time sequence storage method
CN119513727B (en) * 2024-12-03 2025-08-19 北京大学第三医院(北京大学第三临床医学院) Sequential access prediction method based on multi-task fusion access interval and related equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112118143A (en) * 2020-11-18 2020-12-22 迈普通信技术股份有限公司 Traffic prediction model, training method, prediction method, device, apparatus, and medium
CN113064930A (en) * 2020-12-29 2021-07-02 中国移动通信集团贵州有限公司 Cold and hot data identification method, device and electronic equipment for data warehouse

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9779213B2 (en) * 2008-07-25 2017-10-03 Fundacao D. Anna Sommer Champalimaud E Dr. Carlos Montez Champalimaud System for evaluating a pathological stage of prostate cancer
US20140200951A1 (en) * 2013-01-11 2014-07-17 International Business Machines Corporation Scalable rule logicalization for asset health prediction
JP6626260B2 (en) * 2015-03-18 2019-12-25 株式会社東芝 Air conditioning control device, control method and program
CN205334185U (en) * 2016-02-01 2016-06-22 吴云辉 Water on -line monitoring and automatic control system
CN110705592B (en) * 2019-09-03 2024-05-14 平安科技(深圳)有限公司 Classification model training method, device, equipment and computer readable storage medium
CN114223012A (en) * 2019-10-31 2022-03-22 深圳市欢太科技有限公司 Push object determination method and device, terminal equipment and storage medium
US11386463B2 (en) * 2019-12-17 2022-07-12 At&T Intellectual Property I, L.P. Method and apparatus for labeling data
CN111585997B (en) * 2020-04-27 2022-01-14 国家计算机网络与信息安全管理中心 Network flow abnormity detection method based on small amount of labeled data
CN113297471B (en) * 2020-11-11 2024-09-13 阿里巴巴新加坡控股有限公司 Data object tag generation method, data object searching device and electronic equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112118143A (en) * 2020-11-18 2020-12-22 迈普通信技术股份有限公司 Traffic prediction model, training method, prediction method, device, apparatus, and medium
CN113064930A (en) * 2020-12-29 2021-07-02 中国移动通信集团贵州有限公司 Cold and hot data identification method, device and electronic equipment for data warehouse

Also Published As

Publication number Publication date
CN114169401A (en) 2022-03-11

Similar Documents

Publication Publication Date Title
CN114169401B (en) Data processing, predictive model training methods and equipment
US12430363B2 (en) Data partition storage system, method, and non-transitory computer readable medium
WO2019218475A1 (en) Method and device for identifying abnormally-behaving subject, terminal device, and medium
CN117370272A (en) File management method, device, equipment and storage medium based on file heat
WO2017097231A1 (en) Topic processing method and device
CN103886067A (en) Method for recommending books through label implied topic
WO2019001359A1 (en) Data processing method and data processing apparatus
Bojchevski et al. Is pagerank all you need for scalable graph neural networks
CN105589917B (en) Method and device for analyzing log information of browser
CN104035972A (en) Knowledge recommending method and system based on micro blogs
CN118194995B (en) A method and device for acquiring key information of scientific and technological literature in the field of earth environment
CN110706015A (en) Advertisement click rate prediction oriented feature selection method
CN106776370A (en) Cloud storage method and device based on the assessment of object relevance
WO2022068659A1 (en) Information pushing method and apparatus and storage medium
CN116542196B (en) Integrated circuit time sequence analysis method, system and medium based on effective clock path
CN113553320B (en) Data quality monitoring method and device
CN116155597A (en) Access request processing method and device and computer equipment
WO2023048807A1 (en) Hierarchical representation learning of user interest
CN114911898A (en) Knowledge graph-based searching method and device and electronic equipment
HK40070320A (en) Data processing method and device and prediction model training method and device
CN116662327A (en) A data fusion cleaning method for database
CN111191119B (en) Neural network-based scientific and technological achievement self-learning method and device
CN114550157A (en) Bullet screen gathering identification method and device
CN120705419B (en) Knowledge graph-based service resource recommendation method, system, equipment and medium
CN117235010B (en) Bid document chart title classification management method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40070320

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant