CN114169401B

CN114169401B - Data processing, predictive model training methods and equipment

Info

Publication number: CN114169401B
Application number: CN202111350521.6A
Authority: CN
Inventors: 张腾; 谭剑; 李飞飞
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2021-11-15
Filing date: 2021-11-15
Publication date: 2026-01-02
Anticipated expiration: 2041-11-15
Also published as: CN114169401A

Abstract

This application provides a data processing and prediction model training method and apparatus. The method includes: determining feature information based on target data and its access records; inputting the feature information into a prediction model to obtain future access time information for the target data; training the prediction model using training samples, which include sample features, sample time labels, and sample types, determined by whether there are access records for data corresponding to the sample features before and after a random time within the sample sampling period; and identifying hot and cold data based on the time information. By using the sample time labels as labels for the training samples to train the prediction model, and then using the time interval between the next access to the target data by the prediction model as the prediction result, hot and cold data identification of the target data is achieved based on the size of the time interval, thereby effectively improving the accuracy of hot and cold data identification.

Description

Data processing and prediction model training method and equipment

Technical Field

The application relates to the field of computers, in particular to a method and equipment for data processing and predictive model training.

Background

With the rapid development of data processing requirements, data storage costs have increased significantly. In the process of data storage, the data often has clear cold and hot characteristics, that is, the data in some areas belong to the data with higher access frequency, and the data in other areas are rarely in an access state. If a large amount of cold data occupies high performance devices, storage resources are wasted.

In the prior art, different types of storage media and storage modes are adopted to store cold and hot data respectively. Before separating cold and hot data from data, it is necessary to be able to accurately identify and separate the mixed data. The cold and hot data can be identified by making identification rules, for example, a method for identifying the cold and hot data based on rules such as LRU/LFU/LIRS/Exponential Decay and the like. Another type is based on machine learning, where data is historically accessed to predict a period of time in the future where data will not be accessed. However, the accuracy of the recognition result obtained by the above manner is relatively low. Therefore, a solution for improving the accuracy of cold and hot data identification is needed.

Disclosure of Invention

In order to solve or improve the problems existing in the prior art, the embodiments of the present application provide a method and apparatus for data processing and predictive model training.

In a first aspect, in one embodiment of the present application, a data processing method is provided. The method comprises the following steps:

determining characteristic information according to target data and access records of the target data;

The feature information is input into a prediction model to obtain time information of future accessed target data, wherein the prediction model is obtained through training of a training sample, the training sample comprises sample features, a sample time tag and a sample type, and the sample time tag and the sample type are determined by whether access records of sample feature corresponding data exist before and after random time in a sample sampling period;

and carrying out cold and hot data identification on the target data according to the time information.

In a second aspect, in one embodiment of the present application, a predictive model training method is provided. The method comprises the following steps:

Constructing a training sample, wherein the training sample comprises sample characteristics, a sample time tag and a sample type, and the sample time tag and the sample type are determined by whether access records of sample characteristic corresponding data exist before and after random time in a sample sampling period;

inputting the training sample into a prediction model to obtain a prediction result;

Optimizing parameters in the prediction model according to the prediction result, the sample time tag and the sample type;

wherein the predictive model is used to identify cold and hot data.

In a third aspect, in one embodiment of the application, a non-transitory machine-readable storage medium having executable code stored thereon, which when executed by a processor of an electronic device, causes the processor to perform a data processing method as described in the first aspect or a predictive model training method as described in the second aspect.

In a fourth aspect, in one embodiment of the application, an electronic device is provided that includes a memory and a processor, wherein,

The memory is used for storing programs;

The processor is coupled to the memory and is configured to execute the program stored in the memory, so as to implement a data processing method according to the first aspect or a predictive model training method according to the second aspect.

According to the technical scheme provided by the embodiment of the invention, the characteristic information of the target data is input into the pre-trained prediction model, and the time information of the target data to be accessed in the future, namely the time difference between the accessed time and the current time, can be obtained through the prediction model. In order to enable the prediction model to accurately access time information of the future, the constructed training sample comprises a sample feature, a sample time tag and a sample type, wherein the sample time tag and the sample type are determined by whether access records of sample feature corresponding data exist before and after a random time in a sample sampling period. The prediction model obtained by training the training sample can accurately identify the cold and hot data of the target data. According to the scheme, the sample time label is used as a label of a training sample, the prediction model is trained, the time interval of the next access of the target data based on the prediction model is used as a prediction result, and the cold and hot data identification of the target data is realized according to the size of the time interval, so that the accuracy of the cold and hot identification of the target data can be effectively improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a data processing method according to an embodiment of the present application;

FIG. 2 is a flow chart of a method for constructing training samples according to an embodiment of the present application;

fig. 3 is a schematic diagram of splitting sampling periods according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a training sample constructed based on an observation window according to an embodiment of the present application;

FIG. 5 is a flowchart of a training method of a prediction model according to an embodiment of the present application;

FIG. 6 is a flowchart of a method for optimizing parameters of a prediction model according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a matching process of sample types and correspondence provided in an embodiment of the present application;

FIG. 8 is a schematic diagram of cold and hot data identification according to an embodiment of the present application;

FIG. 9 is a schematic flow chart of a predictive model training method according to an embodiment of the application;

FIG. 10 is a schematic diagram illustrating a hot and cold data identification system according to an embodiment of the present application;

FIG. 11 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

FIG. 13 is a schematic structural diagram of a prediction model training device according to an embodiment of the present application;

Fig. 14 is a schematic structural diagram of another electronic device according to an embodiment of the present application.

Detailed Description

It has been found during data storage that these data often have a clear hot and cold character, i.e., data in one area is data that has a relatively high access frequency, and data in another area is rarely accessed, or two accesses occur at very long intervals (e.g., 3 days or one week, one month, half year, etc.). If a large amount of cold data occupies high performance devices, storage resources are wasted. In the prior art, different types of storage media and storage modes are adopted to store cold and hot data respectively. Before separating cold and hot data from data, it is necessary to be able to accurately identify and separate the mixed data. While some schemes employ machine learning schemes, where the history of data is accessed to predict that data will not be accessed for a period of time in the future, the abundance of samples obtained is limited by the size of the observation window being sampled (e.g., the observation window can observe data within 1 day or 1 week of data) when training the machine learning model. In general, a single sample is generated within a limited observation window, data outside the observation window cannot be used as a training sample, and future data cannot be well predicted by using a machine learning model. Therefore, a solution is needed that can realize cold and hot data identification without being limited by the size of the observation window and the number of features.

In order to enable those skilled in the art to better understand the present invention, the following description will make clear and complete descriptions of the technical solutions according to the embodiments of the present invention with reference to the accompanying drawings.

In some of the flows described in the description of the invention, the claims, and the figures described above, a number of operations occurring in a particular order are included, and the operations may be performed out of order or concurrently with respect to the order in which they occur. The sequence numbers of operations such as 101, 102, etc. are merely used to distinguish between the various operations, and the sequence numbers themselves do not represent any order of execution. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first" and "second" herein are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, and are not limited to the "first" and the "second" being different types.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Fig. 1 is a schematic flow chart of a data processing method according to an embodiment of the present application. The method comprises the following steps:

And 101, determining characteristic information according to the target data and the access record of the target data.

And 102, inputting the characteristic information into a prediction model to obtain time information of future accessed target data, wherein the prediction model is obtained through training of a training sample, the training sample comprises a sample characteristic, a sample time label and a sample type, and the sample time label and the sample type are determined by whether access records of sample characteristic corresponding data exist before and after random time in a sample sampling period.

And 103, identifying the cold and hot data of the target data according to the time information.

The target data as referred to herein is data for which cold and hot data identification is required. The feature information includes access features of the target data, target data related data, semantic information (such as/domain_name (data_size, SQL TEMPLATE)) of a database layer, and the like. In addition, the content included in the feature information may be adjusted to be increased or decreased as necessary.

The prediction model may be, for example, a generated analysis model, which is obtained by training a training sample in advance. The time information may be a time interval between a time point at which the target data is accessed in the future and a current time point (or a designated time point). The longer the time interval corresponding to the time information, the lower the frequency of the target data being accessed, the more likely the target data is divided into cold data, and conversely, the shorter the time interval corresponding to the time information, the higher the frequency of the target data being accessed, the more likely the target data is divided into hot data. Therefore, when cold and hot data identification is performed, the method does not directly classify the cold and hot time of the target data, but takes the time interval between the accessed time and the current time in the future as the basis for distinguishing the cold and hot data.

The training samples are generated based on access records and related information over a historical time of the plurality of target data. Specifically, training samples may be represented as (x_i, y_i, e_i), where x_i represents sample characteristics (covariates, such as access characteristics), y_i represents sample time tags (label), and e_i represents sample type. It should be noted that, the sample time tag and the sample type are determined by whether there is an access record of the sample feature corresponding data before and after a random time in the sample sampling period. Examples of the construction of the sample will be illustrated below. Sample types as referred to herein include both types of erasure and types of non-erasure.

The length of the sample sampling period is not limited, and the sample sampling may be performed based on a long (e.g., 1 year, half year) period. When the sample sampling is actually performed, the range of the time period corresponding to the training sample input to the prediction model is limited due to the limitation of the observation window. Here, sampling is performed in a time-division manner, and explanation will be made in the following examples.

In the training samples, the sample time is used as a label, and the prediction model is trained. Thus, the time information can be output as a prediction result to the target data using the prediction model.

The scheme for constructing the training samples will be specifically described below.

Fig. 2 is a schematic flow chart of a method for constructing training samples according to an embodiment of the present application. As can be seen from fig. 2, constructing the training sample specifically includes the following steps:

An access record of sample data over a sampling period is acquired 201.

And 202, setting a random moment in the sampling period, and splitting the sampling period to obtain a characteristic extraction period and an observation window period.

And 203, generating sample characteristics based on the access records in the characteristic extraction period of the sample data set.

And 204, searching whether at least one access record for the sample data exists after the random moment.

And 205, when at least one access record aiming at the sample data exists, determining the sample time tag according to the random moment and the at least one access record, and setting the sample type as a non-deletion type.

And 206, when at least one access record aiming at the sample data exists, determining the sample time tag according to the random time and the ending time of the observation window, and setting the sample type as a deletion type.

As described above, the time range of the sampling period is relatively wide, and a long history period can be used as the period in which sampling is performed. When sampling is actually performed, sampling may be performed only for a part of the period thereof. In the technical scheme of the application, the access frequencies corresponding to different cold and hot data are not completely the same, so when the access frequency of certain target data is relatively low, the acquired access records can be relatively scattered, and the access frequency can be unfixed (for example, the access time intervals of three consecutive times are one month, two months and three months respectively). In the technical scheme of the application, in order to fully utilize data, the sampling period is split through random time (i.e. Pivot time in FIG. 3), so as to obtain the characteristic extraction period (i.e. history phase in FIG. 3) and the observation window period (i.e. observation phase in FIG. 3). In practical application, there are many historical access data, and when inserting random time, access records corresponding to a plurality of sample data can be inserted. For ease of understanding, the following embodiments will be described by taking one target data as an example.

Fig. 3 is a schematic diagram of splitting sampling periods according to an embodiment of the present application. As can be seen from the view of figure 3,

In order to obtain the access characteristics of the sample data (for example, ACCESS INTERVAL (access interval) is used as dynamic characteristics), the characteristic extraction period forms characteristic information together with the static characteristics such as the size of the file to which the data belongs, the table name and the like, and the longer the sampling period is, the richer the access characteristics can be extracted.

And the observation window corresponding to the observation window time period is used for marking the sample time tag and the sample type in the training process. The longer the sampling period, the less censored data (deleted data) and conversely the proportion of cencored data will be increased.

Therefore, it is necessary to ensure a sufficient sample acquisition period length for both the feature extraction period and the observation window period to ensure the quality of the generated training samples. Further, since the observation window period and the feature extraction period are both unstable in the sample data of the period initial stage and the period end stage. For example, in the observation window period, the random time is randomly generated and divided in the middle of the window length of 20% in the initial stage and the window length of 20% in the end stage, that is, in the middle of the trace (tracking) period of 20% -80%, and then the data sets corresponding to each random time are combined to be the full data set.

Specifically, fig. 4 is a schematic diagram of constructing training samples based on observation windows according to an embodiment of the present application. The access records in the sampling period in fig. 4 sequentially include T0, T1, T2, T3, T4, and T5, that is, the number of accesses to the sample data in the sampling period is 6, and the corresponding moments are respectively T0 to T5. The start time of the observation window in the observation window period is Ts, and the end time is Te. As can be seen from fig. 3, T5 is not within the observation window, in other words, the access event of T5 is not observable. Although T5 is not observed, the sample data is still an access event in a future time outside the observation window. The random time may then be set uniformly or randomly in the observation window. The more random time is set in the observation window based on the same sample data, the more corresponding training samples can be obtained.

For example, the random time T may be set between T0 and T1, or between T1 and T2, or between T2 and T3, or between T3 and T4, or between T4 and T5. Since Te is smaller than T5 and larger than T4, if the random time T is set as the current observation time and T is set between T4 and T5, it means that an access event occurring beyond the time T5 cannot be observed within the observation window.

In practical application, the random time is adjusted, and whether at least one access record aiming at the sample data exists after the adjusted random time is searched, so that the sample time label and the sample type are determined according to the searching result. Specifically:

After the random time is set, it is further determined whether at least one access record for the sample data can be found after the random time. For example, the number of the cells to be processed,

If the random time T is set between T0 and T1, 4 access records can be observed in the observation window, and a training sample A1 is obtained.

If the random time T is set between T1 and T2, 3 access records can be observed in the observation window, and a training sample A2 is obtained.

If the random time T is set between T2 and T3, 2 access records can be observed in the observation window, and a training sample A3 is obtained.

If the random time T is set between T3 and T4, 1 visit record can be observed in the observation window, and a training sample A4 is obtained.

At this time, the sample types corresponding to the samples A1, A2, A3, A4 may be set to the no-erasure type, respectively.

If the random time T is set between T4 and T5, 0 access records can be observed in the observation window, and no access record at the time of T5 can be observed, so as to obtain a training sample A5. At this time, the sample type corresponding to the training sample A5 may be set as the erasure type.

Therefore, the training sets of training samples with different scales can be obtained by controlling the number of the random moments, and generally, the larger the training set is, the better the algorithm index of the prediction model is, and the more accurate the prediction result is.

The manner in which the sample time stamp is determined corresponding to step 206 will be specifically illustrated. The specific mode of determining the sample time tag comprises the following steps that if the last access time in the observation window is later than the random time, a first time difference between the random time and the last access time after the random time is marked as the sample time tag, and the target event type is marked as a non-deletion event.

In practical applications, when setting the random time, it is further necessary to calculate the sample time tag according to the random time as the tag of the training sample. Since the occurrence of the access event can be successfully observed after the random time in the observation window, that is, the access record exists, the access event can be observed, and the problem of data deletion does not occur, the sample type corresponding to the training sample is set to be a non-deletion type. And the corresponding sample time label is a first time difference between the random time and the last access time after the random time. Continuing with the example, sample time tag y1=t2-T for training sample A1, sample time tag y2=t2-T for training sample A2, sample time tag y3=t3-T for training sample A3, and sample time tag y4=t4-T for training sample A4.

The manner in which the sample time stamp is determined corresponding to step 207 will be specifically illustrated. The specific mode of determining the sample time tag comprises the following steps that if the last access time in the observation window period is earlier than the random time, a second time difference between the window end time and the random time is marked as the sample time tag, and the target event type is marked as a deletion event.

In practical applications, when setting the random time, it is further necessary to calculate the sample time tag according to the random time as the tag of the training sample. Since no access event can be observed after the random time in the observation window, that is, no access record exists, the access event cannot be observed, and the problem of data deletion occurs is solved, so that the sample type corresponding to the training sample is set to be a deletion type. The corresponding sample time tag is a second time difference between the window end time and a second time between the random instants. Continuing with the example, training sample A5 has a sample time stamp y5=te-t.

Based on the embodiment, when the training sample is constructed, unobserved access records can be fully utilized to be introduced into the training sample in the form of a first time difference and a non-erasure type or a second time difference and an erasure type. Although the observation window is limited in length, the problem of sample loss does not occur in the case of deleting data (some access records of the data are not observed). After the prediction model is trained based on the comprehensive training sample, the obtained training model can better access the time information of the next time after the current observation time, so that the prediction effect of the survival analysis model on future access can be improved. The scheme can be applied to a database or a cluster, and the cold and hot data identification is carried out on the data in each node, so that the data are stored respectively according to the identification result. When the model obtained through training is used for prediction, the time information of the next access occurrence output by the prediction model is used for judging the cold and hot type of the data.

After the training samples are obtained by the above embodiments, the predictive model may be trained. The present invention will be specifically illustrated with reference to examples.

Fig. 5 is a flowchart of a training method of a prediction model according to an embodiment of the present application. From fig. 5, it can be seen that the method specifically comprises the following steps:

501, constructing training samples. And 502, inputting the training sample into a prediction model to obtain a prediction result. And 503, optimizing parameters in the prediction model according to the prediction result, the sample time tag and the sample type, wherein the prediction model is used for identifying cold and hot data.

As described above, the training samples include (x_i, y_i, e_i), where x_i represents a sample feature (covariates such as access feature), y_i represents a sample time stamp (label), and e_i represents a sample type. In constructing the training samples, different random times may be set based on the same sample data to obtain a set of training samples. Further, a training sample set can be obtained based on the plurality of sample data. The set of training samples is used to train a predictive model (e.g., a survival analysis model). In the training process, y_i is used as a training label. It is easy to understand that although the prediction model is well trained by obtaining a relatively comprehensive training sample, the prediction model needs to be continuously optimized based on the training sample in the training process. The specific optimization process is as follows:

fig. 6 is a schematic flow chart of a method for optimizing parameters of a prediction model according to an embodiment of the present application. From fig. 6, it can be seen that the method specifically comprises the following steps:

and 601, determining the corresponding relation between the prediction result and the sample time tag.

And 602, optimizing parameters in the prediction model according to a matching result of the sample type and the corresponding relation.

The prediction result here is time information of a time difference from the current time of the next access and a sample type. In other words, the correspondence between the predicted result and the sample time stamp specifically includes that the predicted result is earlier than the sample time stamp or that the predicted result is later than the sample time stamp. And the sample tag types include a non-erasure type and an erasure type. The specific matching result determination process will be specifically illustrated in the following embodiment with reference to fig. 7.

As described above, the training samples include a sample type in addition to the sample time stamp as the training sample. Step 602 will be described in detail below with reference to the accompanying drawings. Fig. 7 is a schematic diagram of a matching process of sample types and correspondence provided in an embodiment of the present application. As can be seen from fig. 7:

701, if the sample type is a non-deletion type and the corresponding relation is that the time information corresponding to the prediction result is smaller than the sample time label, determining that the sample type is matched with the corresponding relation.

For example, let training sample Y6 be specifically (x_1, y_1, e_1), let y1=ty 1, e_1=0 (indicating no erasure type). The prediction result is tx, and the comparison shows that the corresponding relationship is tx smaller than ty1. In other words, the predicted result is that the tx time difference after the current time t will have an access event to the target data. Since tx is less than ty1, i.e., the event is observable, no erasure data is generated, matching the sample type (no erasure type). I.e. the sample type matches the correspondence.

702, If the sample type is a deleted type and the corresponding relation is that the time information corresponding to the prediction result is larger than the sample time tag, determining that the sample tag type is matched with the corresponding relation.

For example, let training sample Y7 be specifically (x_2, y_2, e_2), let y2=ty 2, e_2=1 (indicating no erasure type). The prediction result is tx, and the corresponding relationship is tx larger than ty2 through comparison. In other words, the predicted result is that the tx time difference after the current time t will have an access event to the target data. Since tx is greater than ty2, i.e., the event is not observable, erasure data is generated, matching the sample type (with erasure type). I.e. the sample tag type matches the correspondence.

703, If the sample type is of a non-deletion type and the corresponding relation is that the time information corresponding to the prediction result is larger than the sample time label, determining that the sample type is not matched with the corresponding relation.

For example, let training sample Y6 be specifically (x_1, y_1, e_1), let y1=ty 1, e_1=0 (indicating no erasure type). The prediction result is tx, and the corresponding relationship is tx larger than ty1 through comparison. In other words, the predicted result is that the tx time difference after the current time t will have an access event to the target data. Since tx is greater than ty1, i.e., the event is not observable, erasure data is generated that does not match the sample type (no erasure type). I.e. the sample type does not match the correspondence.

704, If the sample type is a deleted type and the corresponding relation is that the time information corresponding to the prediction result is smaller than the sample time tag, determining that the sample tag type is not matched with the corresponding relation.

For example, let training sample Y7 be specifically (x_2, y_2, e_2), let y2=ty 2, e_2=1 (indicating no erasure type). The prediction result is tx, and the comparison shows that the corresponding relationship is tx smaller than ty2. In other words, the predicted result is that the tx time difference after the current time t will have an access event to the target data. Since tx is less than ty2, i.e., the event is observable, no erasure data is generated, and no match with the sample type (no erasure type). I.e. the sample tag type does not match the correspondence.

As an alternative embodiment for optimizing a pair of prediction models, the quality of the prediction models can be measured by model prediction results when optimizing:

Due to the presence of deleted data, c-index is typically used to measure the effect of the predictive model. C-index refers to the proportion of the predicted and actual result pairs in all sample pairs.

The calculation steps are as follows:

1. And proportioning all training samples to obtain sample pairs. For example, if there are n samples, n (n-1)/2 pairs of samples are generated;

2. If the sample type corresponding to the sample pair with the smaller sample time is the erasure type (meaning that the sample data is erasure data) or both samples in the sample pair are erasure data, then the invalid pair is considered, and the remaining useful pairs are excluded.

3. The number of pairs in which the predicted result and the actual result agree among the useful pairs is calculated, i.e., the actual sample time of an individual with a longer sample time is longer.

C-index=consistent pair number/useful pair number, and the range of c-index is between 0 and 1, and the closer to 1, the stronger the cold and hot data distinguishing capability of the model is proved. In the training optimization process, training samples are used for continuous training so that the c-index of the prediction model is closer to 1.

After outputting the time information using the predictive model, the identification of the cold and hot data will be further performed on the target data based on the time information. As will be described in detail below.

Fig. 8 is a schematic diagram of cold and hot data identification according to an embodiment of the present application. From fig. 8, it can be seen that the method specifically comprises the following steps:

And 801, acquiring the time information corresponding to the target data respectively.

And 802, carrying out cold and hot data identification on the target data according to the size of the time information.

The comparison may be performed in an ordered manner based on the size of the time information, as described in step 802. For example, 4 pieces of time information were obtained, each of which was 10 minutes for tx1, 20 minutes for tx2, 30 minutes for tx3, and 40 minutes for tx4. The order may be from large to small or from small to large when ordered. It is assumed that the order obtained by sorting from small to large is tx1, tx2, tx3, tx4. The first 50% or the first 25% may be divided into hot data, the remainder into cold data. The specific proportions may be set in actual situations (e.g., thermal data storage space size), and are merely illustrative, and not limiting of the present application.

In addition to being sorted, the threshold size may be compared. The method comprises the following steps:

the target data whose time information is greater than a first time threshold is marked as cold data and stored to a first storage medium supporting low access performance.

Marking the target data with the time information less than or equal to the first time threshold as hot data, and storing the target data to a second storage medium supporting high access performance.

For example, 4 pieces of time information were obtained, each of which was 10 minutes for tx1, 20 minutes for tx2, 30 minutes for tx3, and 40 minutes for tx 4. Assuming that the first time threshold is 25 minutes, it is known that the target data is cold data because tx3 and tx4 are both greater than 25 minutes, and that the target data is hot data because tx1 and tx2 are both less than 25 minutes.

In practical application, the storage state of the cold and hot data can be adjusted by utilizing the time information output by the prediction model. In particular, the cold data is stored clocked. And determining the remaining time based on a difference between the stored timings and the time information corresponding to the cold data. And migrating the cold data from the first storage medium to the second storage medium when the remaining time is less than a second time threshold.

For example, after outputting time information (for example, 24 hours) by the prediction model, the target data is identified as cold data and stored in a first storage medium (for example, HDD medium). And starting the storage timing for the cold data, and accumulating the storage timing for 23 hours as time goes on, wherein the difference value between the time information and the storage timing (namely the residual time) is only 1 hour, and the threshold value is set to be 1 hour, and when the residual time is 1 hour or 59 minutes, the target data corresponding to the cold data is to be accessed because the residual time is smaller than the second time threshold value. To increase the access speed to the target data, the cold data may be transferred from the first storage medium to a second storage medium (such as an SSD medium) for storing the hot data.

Based on the same thought, the embodiment of the application also provides a prediction model training method. Fig. 9 is a schematic flow chart of a predictive model training method according to an embodiment of the application. From fig. 9, it can be seen that the method specifically comprises the following steps:

901, constructing a training sample, wherein the training sample comprises sample characteristics, a sample time tag and a sample type, and the sample time tag and the sample type are determined by whether access records of sample characteristic corresponding data exist before and after random time in a sample sampling period.

And 902, inputting the training sample into a prediction model to obtain a prediction result.

And 903, optimizing parameters in the prediction model according to the prediction result, the sample time tag and the sample type, wherein the prediction model is used for identifying cold and hot data.

The optimizing process of the parameters in the prediction model in step 903 is as follows, determining the corresponding relation between the prediction result and the sample time tag, and optimizing the parameters in the prediction model according to the matching result of the sample type and the corresponding relation.

The process of constructing the training sample in step 901 includes the following steps:

Acquiring access records of sample data in a sampling period;

Setting a random moment in the sampling period, and splitting the sampling period to obtain a characteristic extraction period and an observation window period;

generating sample features based on access records of the sample dataset within the feature extraction period;

Searching whether at least one access record aiming at the sample data exists after the random moment;

when at least one access record aiming at the sample data exists, determining the sample time tag according to the random time and the at least one access record, and setting the sample type as a non-deletion type;

And when at least one access record aiming at the sample data exists, determining the sample time tag according to the random time and the ending time of the observation window, and setting the sample type as a deletion type.

In particular, reference may be made to the respective embodiments corresponding to fig. 1 to 8, and the detailed description will not be repeated here.

For ease of understanding, the overall process of cold and hot data identification will be specifically illustrated below taking the example that the predictive model is a survival analysis model. Fig. 10 is a schematic diagram of a hot and cold data identification system according to an embodiment of the application. As can be seen from fig. 10, the survival analysis server and the application are included. The survival analysis service end comprises an object storage service (Object Storage Service, OSS), a relational database service (Relational Database Service, RDS) and a survival analysis model. In the model, a lightweight survival analysis model can be constructed based on cox, rsf and other algorithms, and the model is simpler than a neural network survival analysis algorithm and has higher model index (c-index). The historical data of the application end is used as sample data, and training samples are generated according to the embodiment corresponding to fig. 1 to 9, so that the survival analysis model is trained. And optimizing the model obtained by training. The target data to be identified can be identified, and cold and hot classified storage can be performed according to the nodes in the shared storage.

There are two important functions in the survival analysis, one is a survival function (Survival function), i.e. the probability that an event did not occur before time T: S (T) =pr (T > =t), and the other is a risk function:

Described is the probability of an event occurring at time t

The models based on survival analysis are all to fit risk functions, taking the cox model as an example, the cox model assumes that log-hard is a linear relationship with covariates (i.e., features), i.e.

h(t,X)=h0(t)exp(β₁X₁+β₂X₂+...+β_kX_k)

Where x= (X ₁,X₂,X₃,...,X_k) is k risk factors affecting the time to live t.

Then, the maximum likelihood estimator of beta can be obtained by establishing a partial likelihood function (partiallikelihood) of the cox risk model, taking the logarithm of the two sides of the partial likelihood function, and then solving the partial derivative of beta.

We turn the prediction of hot and cold data into a regression problem, i.e. the Time interval (Time-to-event prediction) of the next access of the predicted data, which, due to the limited observation window of the history information, can result in a lot of deleted data, i.e. the next access of a part of the samples is not observed within the observation window, but this does not indicate that the samples are not accessed after the access window. For those samples that are not accessed after the inference time, traditional machine learning is not able to utilize these samples. Survival analysis is naturally suitable for processing deleted data, and the problem of limited observation windows such as data cold and hot is very suitable for modeling by using survival analysis, and a better effect is obtained by fitting CHF (cumulative loss function).

Based on the same thought, the embodiment of the application also provides a data processing device. Fig. 11 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. The data processing apparatus includes:

The determining module 1101 is configured to determine feature information according to the target data and the access record of the target data.

The input module 1102 is configured to input the feature information into a prediction model to obtain time information of future access to the target data, where the prediction model is obtained by training a training sample, the training sample includes a sample feature, a sample time tag, and a sample type, and the sample time tag and the sample type are determined by an access record of whether there is data corresponding to the sample feature before and after a random time in a sample sampling period.

And the identification module 1103 is configured to identify the cold and hot data of the target data according to the time information.

Optionally, the system further comprises a training module 1104 for constructing a training sample, inputting the training sample into a prediction model to obtain a prediction result, and optimizing parameters in the prediction model according to the prediction result, the sample time tag and the sample type, wherein the prediction model is used for identifying cold and hot data.

Optionally, the training module 1104 is further configured to determine a correspondence between a prediction result and the sample time tag;

And optimizing parameters in the prediction model according to a matching result of the sample type and the corresponding relation.

Optionally, the training module 1104 is further configured to determine that the sample type is matched with the correspondence if the sample type is a non-erasure type and the correspondence is that the time information corresponding to the prediction result is smaller than the sample time tag;

And if the sample type is a deleted type and the corresponding relation is that the time information corresponding to the prediction result is larger than the sample time tag, determining that the sample tag type is matched with the corresponding relation.

Optionally, the system further comprises a sample construction module 1105, wherein the sample construction module is used for acquiring access records of sample data in a sampling period, setting a random time in the sampling period, splitting the sampling period to obtain a feature extraction period and an observation window period, generating sample features based on the access records in the feature extraction period of the sample data set, searching whether at least one access record aiming at the sample data exists after the random time, determining the sample time tag according to the random time and the at least one access record when the at least one access record aiming at the sample data exists, setting the sample type as a non-deletion type, and determining the sample time tag according to the random time and the termination time of the observation window when the at least one access record aiming at the sample data exists, and setting the sample type as a deletion type.

Optionally, the system further comprises a sample construction module 1105, which is further configured to adjust the random time;

Searching whether at least one access record for the sample data exists after the adjusted random moment so as to determine a sample time tag and a sample type according to a searching result.

Optionally, a sample construction module 1105 is further included, and is further configured to mark a first time difference between the random time and a last access time after the random time as the sample time tag and mark the target event type as a non-deletion event if the last access time is later than the random time in the observation window.

Optionally, a sample construction module 1105 is further included for marking a second time difference between a window end time and the random time as the sample time tag and marking the target event type as a deletion event if a last access time within the observation window period is earlier than the random time.

Optionally, the system further comprises an identification module 1103 for acquiring the time information corresponding to the target data, and identifying the cold and hot data of the target data according to the size of the time information.

Optionally, the identifying module 1103 is further configured to mark the target data with the time information greater than the first time threshold as cold data, and store the target data to a first storage medium supporting low access performance. Marking the target data with the time information less than or equal to the first time threshold as hot data, and storing the target data to a second storage medium supporting high access performance.

Optionally, the system further comprises a migration module, wherein the migration module is used for carrying out storage timing on the cold data, determining the residual time based on the difference value between the storage timing and the time information corresponding to the cold data, and migrating the cold data from the first storage medium to the second storage medium when the residual time is smaller than a second time threshold.

In one embodiment of the application, a non-transitory machine-readable storage medium having executable code stored thereon, which when executed by a processor of an electronic device, causes the processor to perform a data processing method as described in fig. 1-8 is provided. See in particular the embodiments described above.

The embodiment of the application also provides electronic equipment. Fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device comprises a memory 1201, a processor 1202 and a communication component 1203, wherein,

The memory 1201 is used for storing a program;

The processor 1202, coupled to the memory, is configured to execute the program stored in the memory for:

The memory 1201 may be configured to store various other data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on an electronic device. The memory may be implemented by any type of volatile or nonvolatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

Further, the processor 1202 in this embodiment may be specifically a programmable switching processing chip, where a data replication engine is configured to replicate received data.

The processor 1202 may perform other functions in addition to the above functions when executing programs in memory, as described in detail in the foregoing embodiments. Further, as shown in FIG. 12, the electronic device also includes other components, such as a power supply component 1204.

Based on the same thought, the embodiment of the application also provides a prediction model training device. Fig. 13 is a schematic structural diagram of a prediction model training device according to an embodiment of the present application. The other data processing apparatus includes:

The sample construction module 1301 is configured to construct a training sample, where the training sample includes a sample feature, a sample time tag, and a sample type, and the sample time tag and the sample type are determined by an access record of whether there is data corresponding to the sample feature before and after a random time in a sample sampling period.

The input module 1302 is configured to input the training samples into a prediction model to obtain a prediction result.

And the optimizing module 1303 is used for optimizing parameters in the prediction model according to the prediction result, the sample time tag and the sample type, wherein the prediction model is used for identifying cold and hot data.

Optionally, the optimizing module 1303 is further configured to determine a correspondence between a prediction result and the sample time tag, and optimize parameters in the prediction model according to a matching result of the sample type and the correspondence.

Optionally, the sample construction module 1301 is further configured to obtain an access record of sample data in a sampling period, split the sampling period to obtain a feature extraction period and an observation window period in the sampling period by setting a random time in the sampling period, generate a sample feature based on the access record in the feature extraction period of the sample data set, find whether there is at least one access record for the sample data after the random time, determine the sample time stamp according to the random time and the at least one access record when there is at least one access record for the sample data, and set a sample type to be a non-deletion type, and determine the sample time stamp according to the random time and a termination time of an observation window when there is at least one access record for the sample data, and set a sample type to be a deletion type.

In one embodiment of the application, a non-transitory machine-readable storage medium having executable code stored thereon that, when executed by a processor of an electronic device, causes the processor to perform a predictive model training method as described in fig. 9 is provided. See in particular the embodiments described above.

The embodiment of the application also provides electronic equipment. The electronic equipment is node-standby electronic equipment in the computing unit. Fig. 14 is a schematic structural diagram of another electronic device according to an embodiment of the present application. The electronic device comprises a memory 1401, a processor 1402 and a communication unit 1403, wherein,

The memory 1401 for storing a program;

the processor 1402, coupled to the memory, is configured to execute the program stored in the memory for:

wherein the predictive model is used to identify cold and hot data.

The memory 1401 described above may be configured to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on an electronic device. The memory may be implemented by any type of volatile or nonvolatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

Further, the processor 1402 in this embodiment may be specifically a programmable switching processing chip, where a data replication engine is configured in the programmable switching processing chip, and can replicate the received data.

The processor 1402 may implement functions other than the above when executing programs in a memory, and the above description of the embodiments can be specifically referred to. Further, as shown in FIG. 14, the electronic device also includes a power supply component 1404 and other components.

Based on the above embodiment, the feature information of the target data is input into a pre-trained prediction model, and the time information of the target data to be accessed in the future, that is, the time difference between the accessed time and the current time, can be obtained through the prediction model. In order to enable the prediction model to accurately access time information of the future, the constructed training sample comprises a sample feature, a sample time tag and a sample type, wherein the sample time tag and the sample type are determined by whether access records of sample feature corresponding data exist before and after a random time in a sample sampling period. The prediction model obtained by training the training sample can accurately identify the cold and hot data of the target data. According to the scheme, the sample time label is used as a label of a training sample, the prediction model is trained, the time interval of the next access of the target data based on the prediction model is used as a prediction result, and the cold and hot data identification of the target data is realized according to the size of the time interval, so that the accuracy of the cold and hot identification of the target data can be effectively improved.

It should be noted that, the data processing apparatus provided in the foregoing embodiments may implement the technical solutions described in the foregoing method embodiments, and the specific implementation principles of the foregoing modules or units may refer to corresponding contents in the foregoing method embodiments, which are not repeated herein.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention, and not for limiting the same, and although the present invention has been described in detail with reference to the above-mentioned embodiments, it should be understood by those skilled in the art that the technical solution described in the above-mentioned embodiments may be modified or some technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the spirit and scope of the technical solution of the embodiments of the present invention.

Claims

1. A data processing method, the method comprising:

Based on the target data and the access records of the target data, characteristic information is determined;

The feature information is input into a survival analysis-based model to obtain the time information of future access to the target data; wherein, the survival analysis-based model is trained using training samples, the training samples include sample features generated based on access records of sample data before a random time within the sample sampling period, and sample time tags and sample types determined by whether there are access records of the sample data after the random time within the sample sampling period, the sample types including uncensored type and censored type;

Based on the time information, the target data is identified as hot or cold.

2. The method according to claim 1, wherein the training method of the survival analysis-based model includes:

Construct training samples;

The training samples are input into a survival analysis-based model to obtain prediction results;

Based on the prediction results, sample time labels, and sample types, the parameters in the survival analysis-based model are optimized.

The survival analysis-based model is used to identify hot and cold data.

3. The method according to claim 2, wherein optimizing the parameters in the survival analysis-based model based on the prediction results, sample time labels, and sample type includes:

Determine the correspondence between the prediction results and the sample time labels;

Based on the matching results between the sample type and the corresponding relationship, the parameters in the survival analysis-based model are optimized.

4. The method according to claim 3, wherein the matching result between the sample type and the correspondence includes:

If the sample type is an uncensored type, and the correspondence is that the time information corresponding to the prediction result is less than the time label of the sample, then the sample type is determined to match the correspondence.

If the sample type is a censored type, and the correspondence is that the time information corresponding to the prediction result is greater than the sample time label, then the sample label type is determined to match the correspondence.

5. The method according to claim 2, wherein constructing training samples comprises:

Obtain the access records of the sample data during the sampling period;

A random time point is set within the sampling period to split the sampling period into a feature extraction period and an observation window period;

Based on the access records within the feature extraction period of the sample dataset, sample features are generated;

Check whether there is at least one access record for the sample data after the random time.

When there is at least one access record for the sample data, the sample time tag is determined based on the random time and the at least one access record, and the sample type is set to uncensored type;

When there is at least one access record for the sample data, the sample time tag is determined based on the random time and the end time of the observation window, and the sample type is set to censored type.

6. The method according to claim 5, further comprising:

Adjust the random time;

The system searches for at least one access record for the sample data after the adjusted random time, in order to determine the sample time stamp and sample type based on the search results.

7. The method according to claim 5, wherein determining the sample timestamp based on the random time and the at least one access record comprises:

If the last access time within the observation window is later than the random time, then the first time difference between the random time and the most recent access time after the random time is marked as the sample time label, and the target event type is marked as a non-censored event.

8. The method according to claim 5, wherein determining the sample time label based on the random time and the end time of the observation window comprises:

If the last access time within the observation window is earlier than the random time, then the second time difference between the window end time and the random time is marked as the sample time label, and the target event type is marked as a censored event.

9. The method according to claim 1, wherein identifying hot and cold data of the target data based on the time information comprises:

Obtain the time information corresponding to the target data respectively;

The target data is identified as hot or cold based on the magnitude of the time information.

10. The method according to claim 1 or 9, further comprising:

The target data whose time information is greater than a first time threshold is marked as cold data, and the target data is stored in a first storage medium that supports low access performance;

The target data whose time information is less than or equal to the first time threshold is marked as hot data, and the target data is stored in a second storage medium that supports high access performance.

11. The method of claim 10, further comprising:

The cold data is stored and timed;

The remaining time is determined based on the difference between the storage time and the time information corresponding to the cold data;

When the remaining time is less than the second time threshold, the cold data is migrated from the first storage medium to the second storage medium.

12. A model training method based on survival analysis, comprising:

Construct training samples, wherein the training samples include sample features generated based on access records of sample data before a random time within the sample sampling period, and sample timestamps and sample types determined by whether there are access records of the sample data after the random time within the sample sampling period, wherein the sample types include uncensored types and censored types.

The survival analysis-based model is used to identify hot and cold data.

13. A non-transitory machine-readable storage medium storing executable code that, when executed by a processor of an electronic device, causes the processor to perform the method as claimed in any one of claims 1 to 11, or the method as claimed in claim 12.

14. An electronic device, comprising a memory and a processor; wherein,

The memory is used to store programs;

The processor, coupled to the memory, is configured to execute the program stored in the memory to implement the method of any one of claims 1 to 11, or the method of claim 12.