Disclosure of Invention
In view of the foregoing, it is desirable to provide a product performance prediction model construction method, apparatus, computer device, computer readable storage medium, and computer program product that can improve the accuracy of model prediction.
In a first aspect, the application provides a method for constructing a product performance prediction model. The method comprises the following steps:
acquiring a data corpus, wherein the data corpus is a training set in an acquisition data set, and the acquisition data set comprises product performance parameters respectively corresponding to a plurality of versions of product control systems;
Acquiring at least one data subset based on the data corpus, and combining the data corpus and each data subset with a test set in the acquired data set respectively to obtain a first data set to be processed;
selecting data from the first data set to be processed as model processing data for each first data set to be processed, and dividing the model processing data into training data and verification data;
model training is carried out based on training data in all model processing data to obtain a first model, and importance degree ordering of various product performance parameters is determined based on training results, wherein the importance degree is used for representing the influence degree of the product performance parameters on model prediction results;
verifying the first model based on verification data in all model processing data to obtain a first model performance score, wherein the first model performance score is used for representing model prediction accuracy;
If the first model performance score is greater than a preset threshold, removing product performance parameters of a preset number of items before sequencing from the model processing data based on an importance sequencing result, re-using the removed model processing data as model processing data, returning to the step of performing model training based on training data in all the model processing data to obtain a first model, and continuing to execute until the first model performance score is smaller than the preset threshold;
and taking the remaining product performance parameters of the final model processing data as same-distribution variables, and constructing a product performance prediction model based on the same-distribution variables so as to predict the product performance.
In one embodiment, the constructing a product performance prediction model based on the co-distributed variables to make product performance predictions includes:
selecting data from the first data set to be processed as model test data;
determining product performance parameters corresponding to the same distribution variables from the model test data;
testing the first model based on the product performance parameters corresponding to the same distribution variables to obtain a second model performance score;
Determining spatial distribution information of product performance parameters corresponding to the same distribution variables by a principal component analysis method, and determining model performance scores of the first model based on the same distribution information;
And if the second model performance score meets the corresponding condition and/or the model performance score determined based on the spatial distribution information meets the corresponding condition, constructing a product performance prediction model based on the same distribution variable so as to predict the product performance.
In one embodiment, before the model training is performed based on training data in all model processing data to obtain the first model, the method further includes:
And deleting the product performance parameters of which the variable correlation coefficients are lower than a threshold value in the first data set to be processed, wherein the variable correlation coefficients are used for representing the correlation degree of each product performance parameter and a product performance prediction result.
In one embodiment, the acquiring at least one subset of data based on the full set of data includes:
Acquiring a plurality of candidate subsets based on the data corpus;
A target candidate subset satisfying the screening condition is determined from the plurality of candidate subsets as a data subset.
In one embodiment, the screening conditions include at least one of:
The occupation proportion of the number of the versions of the product control system corresponding to the target candidate subset in the number of the versions of the product control system corresponding to the data corpus exceeds a corresponding preset proportion;
The occupation proportion of the number of samples contained in the target candidate subset in the number of samples contained in the data total set exceeds a corresponding preset proportion, the data total set comprises samples corresponding to each version of product control system, different samples corresponding to the same version of product control system are distinguished based on the acquisition time, and each sample comprises product performance parameters corresponding to the corresponding version of product control system;
the difference between the maximum value of the variable correlation coefficients of the target candidate subset and the maximum value of the variable correlation coefficients of the data corpus is larger than a corresponding preset difference, and the variable correlation coefficients are used for representing the correlation degree of the product performance parameters and the product performance prediction results;
and the difference between the product control software version number corresponding to the target candidate subset and the product control system version number corresponding to the other candidate subset is larger than the corresponding preset difference.
In one embodiment, the constructing a product performance prediction model based on the co-distributed variables to make product performance predictions includes:
For each first data set to be processed, determining identical distribution variables from the first data set to be processed to obtain a second data set to be processed, determining a training set in the second data set to be processed, and dividing the training set into training data and verification data;
Model training is carried out based on training data in the second data set to be processed to obtain a second model, and importance ranking of various product performance parameters is determined based on training results of the second model;
Verifying the second model based on verification data in the second data set to be processed to obtain a third model performance score;
If the variation of the third model performance score exceeds a preset variation range, removing the product performance parameters of a preset number of items after the sorting from the second to-be-processed data set based on the importance sorting result of the product performance parameters in the second model, taking the removed second to-be-processed data set as the second to-be-processed data set again, returning to the step of performing model training based on training data in the second to-be-processed data set to obtain a second model, and continuing to execute until the variation of the third model performance score does not exceed the preset variation range;
And constructing a product performance prediction model based on the remaining product performance parameters in the final second data set to be processed so as to predict the product performance.
In one embodiment, the method further comprises:
And respectively inputting the samples to be predicted into product performance prediction models respectively constructed for the data total set and the data sub-set to obtain a plurality of prediction results, and taking the optimal value or average value in the plurality of prediction results as a product performance prediction result.
In a second aspect, the application also provides a device for constructing the product performance prediction model. The device comprises:
The system comprises a whole set acquisition module, a data collection module and a data collection module, wherein the whole set is a training set in an acquired data set, and the acquired data set comprises product performance parameters respectively corresponding to a plurality of versions of product control systems;
The subset acquisition module is used for acquiring at least one data subset based on the data corpus, and combining the data corpus and each data subset with a test set in the acquired data set respectively to obtain a first data set to be processed;
the selecting module is used for selecting data from the first data set to be processed as model processing data aiming at each first data set to be processed, and dividing the model processing data into training data and verification data;
The first module construction module is used for carrying out model training based on training data in all model processing data to obtain a first model, and determining importance degree sequencing of various product performance parameters based on training results, wherein the importance degree is used for representing the influence degree of the product performance parameters on model prediction results;
the model verification module is used for verifying the first model based on verification data in all model processing data to obtain a first model performance score, and the first model performance score is used for representing model prediction accuracy;
The circulation module is used for removing product performance parameters of a preset number of items before sequencing from the model processing data based on an importance sequencing result if the first model performance score is larger than a preset threshold value, re-using the removed model processing data as model processing data, returning to the step of performing model training based on training data in all the model processing data to obtain a first model, and continuing to execute until the first model performance score is smaller than the preset threshold value;
and the second module construction module is used for taking the remaining product performance parameters of the final model processing data as the same distribution variables, and constructing a product performance prediction model based on the same distribution variables so as to predict the product performance.
In a third aspect, the present application also provides a computer device. The computer device comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the product performance prediction model construction method when executing the computer program.
In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the product performance prediction model construction method described above.
In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of the product performance prediction model construction method described above.
In this embodiment, the collected data set includes a training set and a test set, and since the collected data set includes product performance parameters corresponding to multiple versions of product control systems respectively, version numbers corresponding to the product control systems corresponding to the training set and the test set are not completely consistent, the data set is the training set, the data set is obtained, at least one data subset is obtained based on the data set, the first set and each data subset are combined with the test set in the collected data set respectively, a first data set to be processed is obtained, that is, the data in the first data set to be processed includes both the data of the training set and the data of the test set, for each first data set to be processed, the data is selected from the first data set to be processed as model processing data, the model processing data is divided into training data and verification data, the method comprises the steps of performing model training based on training data in all model processing data to obtain a first model, determining importance ranking of various product performance parameters based on the training results, wherein the importance ranking is used for representing the influence degree of the product performance parameters on model prediction results, simultaneously verifying the first model based on verification data in all model processing data to obtain a first model performance score, wherein the first model performance score is used for representing model prediction accuracy, if the first model performance score is larger than a preset threshold, the data representing a training set with high energy and the data representing a test set with high accuracy are distinguished, removing the product performance parameters of a preset number of items before ranking from the model processing data based on the importance ranking results, re-using the removed model processing data as model processing data, the method comprises the steps of carrying out model training based on training data in all model processing data, obtaining a first model, and continuing to execute the step until the performance score of the first model is smaller than a preset threshold value, wherein the method can remove product performance parameters which have larger influence on a model prediction result in each iteration cycle until the performance score of the first model is smaller than the preset threshold value, namely the first model cannot well distinguish a training set and a testing set, taking the remaining product performance parameters of the final model processing data as same-distribution variables, constructing a product performance prediction model based on the same-distribution variables to carry out product performance prediction, and because the product performance prediction model is constructed by the same-distribution variables, the product performance prediction model cannot well distinguish product performance data corresponding to different versions of a product control system, so that after the version number of the product control system is changed, the accuracy of the prediction result is higher when the product performance prediction model is used for prediction.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
Because the product control system version numbers in the existing scheme are different, the product performance parameters have different value ranges, for example, the product control system of the V1 version, the product performance parameters have the value range of 0 to 10, the product performance parameters have the value range of 10 to 20, and the model is built aiming at the product control system of the V1 version, and can only accurately predict the product performance parameters with the value range of 0 to 10, but the accuracy of predicting the performance of the product performance parameters corresponding to the product control system of the V2 version is very low, so that the accuracy of product performance prediction is improved.
In one embodiment, as shown in fig. 1, a method for constructing a product performance prediction model is provided, and this embodiment is described by taking application of the method to a gateway server, where the gateway server may be a product performance prediction analysis module deployed in an edge computing gateway, and the method includes the following steps:
step S101, acquiring a data corpus, wherein the data corpus is a training set in an acquisition data set, and the acquisition data set comprises product performance parameters respectively corresponding to a plurality of versions of product control systems;
The product control system refers to a control program for controlling the product to realize the product function, the control program is burnt in the controller and has corresponding numbers, and the number of each control program is the version number of the product control system.
The gateway server performs data acquisition to obtain an acquisition data set, wherein the acquisition data set comprises samples corresponding to multiple versions of product control systems respectively, wherein samples at multiple moments can be acquired for the same version of product control system, different samples corresponding to the same version of product control system are distinguished based on acquisition moments, and each sample comprises product performance parameters corresponding to the corresponding version of product control system. For example, the collection data set includes a sample corresponding to the product control system of version a, a sample corresponding to the product control system of version B, and a sample corresponding to the product control system of version C, the product control system of version a includes a sample collected at collection time 1 and a sample sampled at collection time 2, the sample at each collection time includes a product performance parameter 1 and a product performance parameter 2, and the sample condition corresponding to the product control system of version B or version C is similar to the sample condition corresponding to the product control system of version a.
In this embodiment, the dimensions of the product performance parameters in each sample may be the same or different, for example, each sample includes X product performance parameters, or one sample includes X product performance parameters, and another sample includes Y product performance parameters, where X is different from Y. X and Y are positive integers.
The method comprises the steps that samples with performance average value labels already acquired are included in a collected data set, the samples are used as training sets and marked as S1, samples with performance average values to be predicted are also included in the collected data set, the samples are used as test sets and marked as S2, after the training sets and the test sets are respectively preprocessed by a gateway server, the preprocessed training sets are used as data complete sets, and the data complete sets comprise the samples marked as S1. The version numbers of the product control systems corresponding to the training set and the testing set are not completely consistent.
Step S102, acquiring at least one data subset based on a data total set, and combining the data total set and each data subset with a test set in the acquired data set respectively to obtain a first data set to be processed;
the gateway server obtains at least one data subset based on the data corpus;
The gateway server performs merging on the data whole set and the test set (i.e. merging the sample marked as S1 and the sample marked as S2) to obtain a first data set to be processed corresponding to the data whole set, and performs merging on the data subset and the test set to obtain a first data set to be processed corresponding to the data subset, i.e. the data whole set has the corresponding first data set to be processed, and the data subset also has the corresponding data set to be processed.
The steps of steps S103 to S107 described below are performed for each set of data to be processed.
In this embodiment, each first data set to be processed includes samples corresponding to multiple versions of product control systems respectively, and different samples corresponding to the same version of product control system are distinguished based on the acquisition time.
Step S103, selecting data from the first data set to be processed as model processing data aiming at each first data set to be processed, and dividing the model processing data into training data and verification data;
the gateway server randomly selects data from the first data set to be processed as model processing data, and reserves unselected data as model test data. For example, 70% of the data may be selected as the model process data, and 30% of the data may be reserved as the model test data.
The gateway server divides the model processing data into training data and verification data, wherein the training data comprises a sample marked as S1 and a sample marked as S2, and the verification data also comprises a sample marked as S1 and a sample marked as S2.
Step S104, model training is carried out based on training data in all model processing data to obtain a first model, and importance degree ordering of various product performance parameters is determined based on training results, wherein the importance degree is used for representing the influence degree of the product performance parameters on model prediction results;
The gateway server performs model training and tuning based on training data in the model processing data to obtain a first model, determines the influence degree of each product performance parameter on a model prediction result from the training result, obtains the weight of each product performance parameter based on the influence degree, takes the weight as the importance of the product performance parameter, and sorts the importance of each product performance parameter.
The first model is a classification model, the sample in training data is divided into the class corresponding to S1 or the class corresponding to S2 in the model training process, the model prediction result refers to the classification result of the first model on the sample, the performance of the first model is the accuracy of model classification, if the model can accurately divide the sample marked as S1 into the class corresponding to S1, the sample marked as S2 is accurately divided into the class corresponding to S2, the model is proved to perform well, and if the classification is wrong, the model is represented to perform poorly.
Step S105, verifying the first model based on verification data in all model processing data to obtain a first model performance score, wherein the first model performance score is used for representing model prediction accuracy;
The gateway server verifies the first model based on verification data in all model processing data to obtain a first model performance score, wherein the first model performance score is used for representing model prediction accuracy, and the higher the model prediction accuracy is, the more the training set marked as S1 and the test set marked as S2 can be distinguished on behalf of the first model, namely, the more product performance data collected by product control systems aiming at different versions can be distinguished.
Step S106, if the first model performance score is greater than a preset threshold, removing product performance parameters of a preset number of items before sequencing from the model processing data based on the importance sequencing result, re-using the removed model processing data as model processing data, returning to the step of performing model training based on training data in all the model processing data to obtain a first model, and continuing to execute until the first model performance score is less than the preset threshold;
If the first model performance score is greater than a preset threshold, the gateway server removes the product performance data of a preset number of items before sorting from the model processing data based on the importance sorting result, wherein the preset number is generally one, namely the product performance data with the forefront sorting is removed.
The gateway server re-uses the removed model processing data as the model processing data in the step S104, and returns to the loop execution step S104 to the step S106 until the first model performance score is smaller than the preset threshold, so that the product performance parameter with a larger influence on the model prediction result can be removed, and the first model performance score is reduced, so that the first model cannot well distinguish the training set marked as S1 from the test set marked as S2.
And S107, taking the remaining product performance parameters of the final model processing data as the same distribution variables, and constructing a product performance prediction model based on the same distribution variables so as to predict the product performance.
The gateway server takes the remaining product performance parameters of the final model processing data as same distribution variables, eliminates the product performance parameters except the same distribution variables in the samples aiming at each sample in the first data set to be processed, and builds a product performance prediction model based on the first data set to be processed after eliminating the data, namely, finally, determines the corresponding same distribution variables aiming at the data whole set and builds a corresponding product performance prediction model, and also determines the corresponding same distribution variables aiming at each data subset and builds a corresponding product performance prediction model.
For example, each sample of the first data set to be processed initially includes a product performance parameter a, a product performance parameter B, and a product performance parameter C, and the same distribution variables are the product performance parameters B and C, and then the product performance parameter a is removed, and the product performance parameter B and the product performance parameter C are removed from each sample of the first data set to be processed after the data is removed.
And the gateway server performs product performance prediction based on the product performance prediction model corresponding to the data corpus and the product performance prediction model corresponding to each data subset.
In this embodiment, the collected data set includes a training set and a test set, and since the collected data set includes product performance parameters corresponding to multiple versions of product control systems respectively, version numbers corresponding to the product control systems corresponding to the training set and the test set are not completely consistent, the data set is the training set, the data set is obtained, at least one data subset is obtained based on the data set, the first set and each data subset are combined with the test set in the collected data set respectively, a first data set to be processed is obtained, that is, the data in the first data set to be processed includes both the data of the training set and the data of the test set, for each first data set to be processed, the data is selected from the first data set to be processed as model processing data, the model processing data is divided into training data and verification data, the method comprises the steps of performing model training based on training data in all model processing data to obtain a first model, determining importance ranking of various product performance parameters based on the training results, wherein the importance ranking is used for representing the influence degree of the product performance parameters on model prediction results, simultaneously verifying the first model based on verification data in all model processing data to obtain a first model performance score, wherein the first model performance score is used for representing model prediction accuracy, if the first model performance score is larger than a preset threshold, the data representing a training set with high energy and the data representing a test set with high accuracy are distinguished, removing the product performance parameters of a preset number of items before ranking from the model processing data based on the importance ranking results, re-using the removed model processing data as model processing data, the method comprises the steps of carrying out model training based on training data in all model processing data, obtaining a first model, and continuing to execute the step until the performance score of the first model is smaller than a preset threshold value, wherein the method can remove product performance parameters which have larger influence on a model prediction result in each iteration cycle until the performance score of the first model is smaller than the preset threshold value, namely the first model cannot well distinguish a training set and a testing set, taking the remaining product performance parameters of the final model processing data as same-distribution variables, constructing a product performance prediction model based on the same-distribution variables to carry out product performance prediction, and because the product performance prediction model is constructed by the same-distribution variables, the product performance prediction model cannot well distinguish product performance data corresponding to different versions of a product control system, so that after the version number of the product control system is changed, the accuracy of the prediction result is higher when the product performance prediction model is used for prediction.
In one embodiment, verification of the selected co-distributed variable is also required, as shown in fig. 2, step S107 includes:
step S201, selecting data from a first data set to be processed as model test data;
The model test data also includes a sample labeled S1 and a sample labeled S2.
Step S202, determining product performance parameters corresponding to the same distribution variables from model test data;
and eliminating the product performance parameters except for the same distribution variable in the samples aiming at each sample in the model test data, wherein the performance parameters contained in the model test data after the data elimination are the product performance parameters corresponding to the same distribution variable.
Step S203, testing the first model based on the product performance parameters corresponding to the same distribution variables to obtain a second model performance score;
in this embodiment, the second model performance score, like the first model performance score described above, is also used to characterize model prediction accuracy.
In this embodiment, the first model is tested to detect if the first model is over fitted.
Step S204, determining spatial distribution information of product performance parameters corresponding to the same distribution variables through a principal component analysis method, and determining model performance scores of a first model based on the same distribution information;
The spatial distribution information of the product performance parameters corresponding to the distribution variables refers to the similarity of the product performance parameters corresponding to the distribution variables in the training data and the product performance parameters corresponding to the distribution variables in the test data on the data spatial distribution.
The gateway server can determine the similarity through a principal component analysis method (PRINCIPLE COMPONENT ANALYSIS, PCA), so that based on the similarity, the model performance of the first model can be measured, the higher the similarity, the more indistinguishable the model is from the sample marked as S1 and the sample marked as S2, the worse the model performance is represented, and the similarity needs to be ensured to be greater than a certain similarity threshold.
In step S205, if the second model performance score meets the corresponding condition and/or the model performance score determined based on the spatial distribution information meets the corresponding condition, a product performance prediction model is constructed based on the same distribution variables to perform product performance prediction.
The second model performance score meeting the corresponding condition may be that the second model performance score model is over-fitted to the score threshold, i.e., indicating that the first model is not over-fitted.
In this embodiment, the server tests whether the first model is over-fitted through the model test data, determines the spatial distribution information through the principal component analysis method, so as to verify whether the product performance data corresponding to the same distribution variable in the training data and the product performance data corresponding to the same distribution variable in the test data are similar in data space, if the similarity is greater than a similarity threshold, the first model cannot well distinguish the training set from the test set, if the first model does not over-fit and cannot well distinguish the training set from the test set, the determined same distribution variable is accurate, and then the product performance prediction model is constructed based on the same distribution variable.
In one embodiment, since the number of samples in the first set of data to be processed is huge, in order to reduce the data processing pressure of the gateway server, product performance parameters with variable correlation coefficients lower than a threshold in the first set of data to be processed may be deleted in advance, and then step S105 performs model training based on training data in all model processing data, and before obtaining the first model, the method further includes:
and deleting the product performance parameters of which the variable correlation coefficients are lower than the threshold value in the first data set to be processed, wherein the variable correlation coefficients are used for representing the correlation degree of each product performance parameter and the product performance prediction result.
In this embodiment, the variable correlation coefficient of each product performance parameter may be the maximum mutual information coefficient (Maximal Information Coefficient, MIC) of the product performance parameter and the product performance prediction result, or may be a linear correlation coefficient of the product performance parameter and the product performance prediction result, for example, pearson (Pearson) correlation coefficient, or the variable correlation coefficient may include both MIC and Pearson correlation coefficient, where the index of variable correlation is a comprehensive index of comprehensively considering linear correlation (Pearson) and nonlinear correlation MIC, and the expression is max (Pearson), MIC.
Since the number of product performance parameters in the first set to be processed is large, the variable correlation coefficient of a product performance parameter may be an average of the variable correlation coefficients of a plurality of the product performance parameters. For example, the first to-be-processed data set includes 1000 product performance parameters a, and the variable correlation coefficient of the product performance parameters a is an average value of the variable correlation coefficients of the 1000 product performance parameters a.
In this embodiment, the product performance parameters whose variable correlation coefficients are lower than the threshold may be deleted in advance, so as to reduce the data processing amount, thereby reducing the data processing pressure of the gateway server.
Referring to fig. 3, in one embodiment, acquiring at least one subset of data based on the full set of data in step S102 includes:
acquiring a plurality of candidate subsets based on the data corpus;
A target candidate subset satisfying the screening condition is determined from the plurality of candidate subsets as a data subset.
In this embodiment, the gateway server may obtain multiple candidate subsets from the data corpus, where multiple candidate subsets have coincident samples and different samples. As described above, the data subset includes only the samples labeled S1, and the data subset also includes only the samples labeled S1.
Referring to FIG. 3, the embodiment specifically comprises the steps of S1, acquiring a plurality of candidate subsets based on a data whole set, S2, initializing the candidate subset set, S3, judging whether a candidate subset exists in the candidate subset set, and ending the cycle if the candidate subset does not exist;
If there is a candidate subset, step SS4 is executed to determine whether the last candidate subset of the candidate subset set is a target candidate subset, step S5 is executed to determine whether the target candidate subset satisfies a screening condition, if yes, step S6 is executed to add the target candidate subset as a data subset to the data subset set, reject the target candidate subset from the candidate subset set as a new candidate subset set, and return to the step of determining whether there is a candidate subset in the candidate subset set to continue execution, and if not, step S7 is executed to reject the target candidate subset from the candidate subset set as a new candidate subset set, and return to the step of determining whether there is a candidate subset in the candidate subset set to continue execution.
In this embodiment, a subset of data satisfying the screening conditions may be screened, and by setting the screening conditions in the following embodiment, product performance parameters that affect each other may be separated, so that when product performance prediction is performed on product performance parameters corresponding to a product control system of a certain version, a product performance prediction model constructed based on the subset of data is higher in prediction accuracy than a product performance prediction model constructed based on the whole set of data.
In one embodiment, the screening conditions include at least one of:
The occupation proportion of the number of the versions of the product control system corresponding to the target candidate subset in the number of the versions of the product control system corresponding to the data corpus exceeds the corresponding preset proportion;
The occupation proportion of the number of samples contained in the target candidate subset in the number of samples contained in the data corpus exceeds a corresponding preset proportion, the data corpus comprises samples corresponding to each version of product control system, different samples corresponding to the same version of product control system are distinguished based on the acquisition time, and each sample comprises product performance parameters corresponding to the corresponding version of product control system;
The difference between the maximum value of the variable correlation coefficients of the target candidate subset and the maximum value of the variable correlation coefficients of the data corpus is larger than the corresponding preset difference, and the variable correlation coefficients are used for representing the correlation degree of the product performance parameters and the product performance prediction results;
The difference between the product control software version number corresponding to the target candidate subset and the product control system version number corresponding to the other candidate subset is larger than the corresponding preset difference.
The real mode of the maximum value of the variable correlation coefficient is that the variable correlation coefficient of each product performance parameter is respectively determined, the correlation corresponding to any product performance parameter can be the maximum mutual information coefficient and/or Pearson (Pearson) correlation coefficient, and the maximum correlation in the variable correlation coefficients of a plurality of product performance parameters is taken as the maximum value of the variable correlation coefficient.
When the product performance is predicted based on the model constructed by the product performance parameters corresponding to the product control software of a plurality of versions, the influence of the change of the performance mean value caused by the change of certain product performance parameters is weakened in value, but the screening condition is set to be that the difference between the maximum value of the variable correlation coefficient of the target candidate subset and the maximum value of the variable correlation coefficient of the data whole set is larger than the corresponding preset difference, namely the maximum value of the variable correlation coefficient in the target candidate subset is larger relative to the data whole set, the difference value between the maximum values of the variable correlation coefficients of the data subsets is larger than the corresponding preset difference value, so that after the target candidate subset is used as the data subset, product performance parameters of different product control system versions which are mutually weakened and influenced can be separated, and when the product performance prediction model constructed based on the data subset is used for predicting the product performance aiming at the product performance parameters corresponding to the product control system of a certain version, the prediction precision of the product performance prediction model constructed based on the data subsets is obviously higher than that of the product performance prediction model constructed based on the data subsets.
In one embodiment, after ensuring that the product performance prediction model cannot well distinguish product performance parameters corresponding to different versions of the product control system, the accuracy of the product performance prediction model needs to be improved, so in order to improve the accuracy of the product performance prediction model, the application also proposes a scheme as to whether the target of the scheme in the embodiment of fig. 1 can distinguish the training set from the testing set, so that the product performance parameters with higher importance ranking result need to be removed, while the purpose of the embodiment is to screen effective modeling variables (the modeling variables refer to the product performance parameters), so that the product performance parameters with lower most importance ranking result need to be removed every time. The termination condition of the loop body of each iteration of the embodiment of fig. 1 is that the first model performance score is greater than the preset threshold, and the termination condition of the loop of the iteration of the embodiment is that the model performance is no longer improved. The scheme is as follows:
in one embodiment, referring to fig. 4, step S108 builds a product performance prediction model based on the co-distributed variables to make a product performance prediction, specifically comprising:
Step S401, for each first data set to be processed, determining identical distribution variables from the first data set to be processed to obtain a second data set to be processed, determining a training set in the second data set to be processed, and dividing the training set into training data and verification data;
In this embodiment, for each first data set to be processed, the gateway server eliminates product performance parameters except for the same distribution variables in each sample of the first data set to be processed, and obtains a second data set to be processed.
Because the effective variable screening is only performed on the basis of the sample marked as S1, the gateway server determines a training set from the second data set to be processed, namely, determines the sample marked as S1, and divides the sample marked as S1 into training data and verification data.
Step S402, performing model training based on training data in a second data set to be processed to obtain a second model, and determining importance ranking of various product performance parameters based on training results of the second model;
the gateway server performs model training and optimization based on training data of the second model to obtain the second model, determines the influence degree of each product performance parameter on the model prediction result from the training result, obtains the weight of each product performance parameter based on the influence degree, takes the weight as the importance of the product performance parameter, and sorts the importance of each product performance parameter.
In this embodiment, the second model comprises a regression model.
Step S403, verifying the second model based on the verification data in the second data set to be processed to obtain a third model performance score;
the gateway server verifies the second model based on the verification data to obtain a third model performance score, and the third model performance score is used for representing model prediction accuracy.
Step S404, if the variation of the third model performance score exceeds the preset variation range, removing the product performance parameters of the ordered preset number items from the second to-be-processed data set based on the importance ordering result of the product performance parameters in the second model, re-using the removed second to-be-processed data set as the second to-be-processed data set, returning to the step of performing model training based on training data in the second to-be-processed data set to obtain the second model, and continuing to execute until the variation of the third model performance score does not exceed the preset variation range;
If the third model performance score changes beyond the preset change range, which means that the performance of the second model is still improved, and the second model can be further optimized, the product performance parameters of the preset number items after the sorting are removed from each sample of the training data based on the importance sorting result of the product performance parameters in the second model, wherein the preset number items are generally one, and the product performance parameters at the end of the sorting can be removed.
And the gateway server re-uses the removed model processing data as a second data set to be processed of the second model in the step S401, and returns to the steps S402 to S404 for cyclic execution until the variation of the third model performance score does not exceed the preset variation range. Therefore, the product performance parameters with smaller influence on the model prediction result can be removed, so that the second prediction model is more accurate in prediction, and the accuracy of the product performance prediction model constructed based on the product performance parameters with the rest of the final model processing data is higher.
Step S405, a product performance prediction model is constructed based on the remaining product performance parameters in the final second set of data to be processed to predict product performance.
And modeling and optimizing the gateway server based on the remaining product performance parameters in the final second data set to be processed, and constructing a corresponding product performance prediction model to predict the product performance.
In this embodiment, product performance parameters with smaller influence on the product performance prediction result in the second to-be-processed data set are removed, and product performance parameters with larger influence on the model prediction result are screened out for modeling, so that accuracy of model prediction can be improved.
In one embodiment, referring to fig. 5, after fusing the product performance prediction models, the product performance is predicted together, and the method further includes:
And respectively inputting the samples to be predicted into product performance prediction models respectively constructed for the data total set and the data sub-set to obtain a plurality of prediction results, and taking the optimal value or average value in the plurality of prediction results as the product performance prediction result.
As described above, the product performance prediction model is constructed for the data subset, the product performance prediction model is also constructed for the data subset (for example, as shown in fig. 5, the product performance prediction model is constructed for the data subset 1 to n, respectively, where n is an integer greater than 1), when the product performance prediction is performed, the samples to be predicted are respectively input into each product performance prediction model, so that a plurality of product performance prediction results (for example, as shown in fig. 5, product performance prediction results 1 to n+1) can be obtained, and if the model construction data (i.e., the final second to-be-processed data) for constructing the product performance prediction model includes product performance parameters of the same version of the product control system corresponding to the samples to be predicted, the optimal values of the plurality of product performance prediction results are taken as product performance prediction results, otherwise, the average value of the plurality of product performance prediction results is taken as the product performance prediction results.
Referring to fig. 6, the overall scheme of the present application is summarized as follows:
And S601, data acquisition.
Samples corresponding to the product control systems of the multiple versions are collected, different samples corresponding to the product control systems of the same version are distinguished based on the collection time, and each sample comprises at least one product performance parameter. For example, the hydrogen fuel cell collects control quantity and controllable quantity data of cell stacks, hydrogen gas paths, air paths, cooling paths and electric gas paths of different cell systems, monitors heartbeat data, abnormality detection data, cell marker positions and other data, the data are product performance parameters, and information extraction is carried out on the collected data (time-frequency domain statistical information extraction is adopted for information extraction), so as to obtain candidate data.
The candidate data comprises samples with performance average value labels, the samples are used as training sets and marked as S1, the candidate data also comprises samples with performance average value to be predicted, and the samples are used as test sets and marked as S2.
Step S602, data preprocessing.
And carrying out data preprocessing on the training set in the candidate data to obtain a data complete set. The test set is also preprocessed.
Step S603, data subset selection.
Selecting a subset of data from the full set of data that satisfies the following condition:
the number of versions of the product control system corresponding to the data subset accounts for 60% of the number of versions of the control system corresponding to the data subset;
The number of samples in the data subset is 60% of the number of samples in the full data set;
the difference between the corresponding product control software version number and the product control system version number corresponding to other candidate subsets is greater than 2;
the difference between the maximum value of the variable correlation coefficients of the data subset and the maximum value of the variable correlation coefficients of the data full set is larger than the corresponding preset difference by more than 0.05;
The data set includes only the samples labeled S1, and thus the selected subset of data also includes only the samples labeled S1.
Step S604 extracts corresponding co-distributed variables for the data corpus and corresponding co-distributed variables for each data subset.
Referring to fig. 7, the following is embodied:
1) Removing product performance parameters with variable correlation coefficients lower than 1 in the data subset, and removing product performance parameters with variable correlation coefficients lower than 1 in the data subset (correlation coefficient threshold screening in the corresponding map);
2) Combining the test set (the sample marked as S2) and the data set (the data set only comprises the sample marked as S1) aiming at the data set to obtain a first data set to be processed aiming at the data set;
Combining the test set (the sample marked as S2) with the data subset (the data subset only comprises the sample marked as S1) aiming at the data subset to obtain a first data set to be processed aiming at the data subset;
for each first set of data to be processed, the following contents described in 3) to 6) are executed:
3) Randomly extracting model processing data from the first data set to be processed, for example extracting 70% of the data as model processing data, reserving the rest of the data as model prediction data, for example reserving 30% of the data as model prediction data (randomization in the corresponding graph), and performing the following operations on the model processing data:
a) Dividing the model process data (70% of the model process data includes both the sample labeled S1 and the sample labeled S2) into training data (the training data includes both the sample labeled S1 and the sample labeled S2) and verification data (the verification data includes both the sample labeled S1 and the sample labeled S2);
b) Model training and optimizing are carried out by training data (the training data corresponds to a sample 1 to a sample k in the graph), and a first model is obtained;
c) Determining importance degree sequencing of various product performance parameters;
d) Verifying the first model by using verification data to obtain a first model performance score;
The first model is a classification model, and if the first model can accurately divide the sample into the class corresponding to S1 or S2, for example, can divide the sample marked as S1 into the class of S1 and the sample marked as S2 into the class of S2, it indicates that the first model performs well.
E) When the average value of the first model performance scores of the 5 samples is greater than 2, removing the product performance parameter with the forefront ranking from the model processing data, and returning to the step b) until the first model performance score is less than 2;
f) The remaining product performance parameters of the model processing data are used as identical distribution variables (steps c to e in the embodiment) in the corresponding graph to carry out cross verification and feature screening on each sample, the feature screening refers to screening of the identical distribution variables, the first model is verified by using test data after cross verification, and then product performance parameters except for the identical distribution variables in the first data set to be processed are removed based on the determined identical distribution variables to obtain a new first data set to be processed, namely a candidate subset 1 in the new corresponding graph of the first data set to be processed.
4) Calculating a second model performance score (corresponding to a subset of test set evaluation candidates in the graph) for the first model using the model test data (i.e., 30% of the predicted data described above);
5) Taking the data marked as S1 in the model test data as training data and the data marked as S2 as verification data, and determining the similarity of data space distribution between the training data only containing the same distribution variable and the verification data only containing the same distribution variable by a principal component analysis method (PCA inspection in a corresponding diagram);
6) And if the second model representation score and the similarity meet the corresponding conditions, taking the same distribution variable determined in the step f as a final same distribution variable (determining a final subset in the corresponding graph).
And removing samples which do not belong to the same distribution variables from all samples of the first data set to be processed to obtain a second data set to be processed.
Step S605, product performance parameters are screened for the data corpus, and product performance parameters are also screened for each data subset.
For the data complete set, the product performance parameters with higher importance ranking are screened out from the second data set to be processed corresponding to the data complete set, and for the data subset, the product performance parameters with higher importance ranking are screened out from the second data set to be processed corresponding to the data subset, and the specific screening mode is referred to the above embodiment and is not repeated.
Step S606, a product performance prediction model is built for the data corpus and each data subset.
Taking the remaining product performance parameters in the second to-be-processed data set corresponding to the data corpus as training data of a product performance prediction model, training the model and optimizing by adopting a random forest algorithm, and constructing the product performance prediction model corresponding to the data corpus;
aiming at each data subset, taking the remaining product performance parameters in the second to-be-processed data set corresponding to the data subset as training data of a product performance prediction model, adopting a random forest algorithm training model and tuning, and constructing the product performance prediction model corresponding to the data subset;
Step S607, model fusion is performed to determine a product performance prediction result.
And inputting the sample to be predicted into a product performance prediction model corresponding to the data corpus, and simultaneously inputting the sample to be predicted into the product performance prediction models corresponding to the data subsets, wherein if one or more product performance prediction models exist, the version number of a product control system corresponding to the training data of the product performance prediction models is the same as the version number of the product control system corresponding to the sample to be predicted, the optimal value of the product performance prediction results is used as the product performance prediction result, otherwise, the average value of the product performance prediction results is used as the product performance prediction result.
It should be noted that, the conventional prediction of product performance is often performed by a server, and this method has relatively low efficiency, and relatively high requirement for network bandwidth, which easily causes problems of jamming, high delay and low efficiency. The application can deploy the product performance analysis module at the edge computing gateway, and transfer the mass data processing and computing which are originally processed by the server to the edge computing gateway for processing, thereby avoiding the problems of blocking, high delay, low efficiency and the like caused by the processing of the mass data by the server. Of course, the application does not exclude the deployment of the product performance analysis module on the server, but only has the problems of processing jamming and delay compared with the deployment on the edge computing gateway. Specific:
The system comprises a product performance prediction analysis module, a relation library, a time sequence library, a data storage, a data processing, an edge reasoning, a pre-warning terminal, a personal computer (Personal Computer, PC) and a pre-warning terminal, wherein the product performance prediction analysis module is shown in a structure of the product performance prediction analysis module, the key data is protected and data safety is ensured through identity authentication, encryption transmission, access control, trusted computing and the like, the time sequence data acquisition is carried out through a communication protocol and an interface module and a driving protocol, the data storage is carried out in the relation library and the time sequence library, the data processing comprises data preprocessing, version subset selection, whole set subset data extraction and characteristic variable extraction, the edge reasoning is carried out after the data processing, the edge reasoning comprises the construction of an algorithm model, model optimization and model fusion, and then visual display is carried out, the product performance prediction analysis module further comprises a hardware module, the hardware module comprises a wifi, a 4G and/or 5G module, the Ethernet module and a serial port module is connected with serial port equipment, the Ethernet module is connected with the Ethernet interface equipment, and the wifi, the 4G and/5G module is communicated with the pre-warning terminal, and the pre-warning terminal comprises a mobile terminal and a personal computer (Personal Computer, and a PC).
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides a product performance prediction model construction device for realizing the above related product performance prediction model construction method. The implementation of the solution provided by the apparatus is similar to the implementation described in the above method, so the specific limitation in the embodiments of the apparatus for constructing a product performance prediction model provided below may be referred to the limitation of the method for constructing a product performance prediction model hereinabove, and will not be repeated herein.
In one embodiment, as shown in fig. 9, there is provided a product performance prediction model construction apparatus 900, including:
The corpus acquisition module 901 is configured to acquire a data corpus, where the data corpus is a training set in an acquired data set, and the acquired data set includes product performance parameters corresponding to multiple versions of product control systems respectively;
A subset obtaining module 902, configured to obtain at least one data subset based on the data corpus, and combine the data corpus and each data subset with a test set in the collected data set respectively to obtain a first data set to be processed;
a selecting module 903, configured to select, for each first set of data to be processed, data from the first set of data to be processed as model processing data, and divide the model processing data into training data and verification data;
The first module building module 904 is configured to perform model training based on training data in all model processing data to obtain a first model, and determine importance ranking of each product performance parameter based on training results, where the importance is used to characterize the influence degree of the product performance parameter on the model prediction results;
the model verification module 905 is configured to verify the first model based on verification data in all model processing data, to obtain a first model performance score, where the first model performance score is used to characterize model prediction accuracy;
A circulation module 906, configured to remove, based on the importance ranking result, product performance parameters of a preset number of items before ranking from the model processing data, re-use the removed model processing data as model processing data, return to perform model training based on training data in all model processing data, obtain a first model, and continue to execute until the first model performance score is less than a preset threshold;
a second module building module 907 is configured to use the remaining product performance parameters of the final model processing data as co-distributed variables, and build a product performance prediction model based on the co-distributed variables to perform product performance prediction.
In one embodiment, the second module is specifically configured to select data from the first set of data to be processed as model test data, determine a product performance parameter corresponding to the same distribution variable from the model test data, test the first model based on the product performance parameter corresponding to the same distribution variable to obtain a second model performance score, determine spatial distribution information of the product performance parameter corresponding to the same distribution variable through a principal component analysis method, determine a model performance score of the first model based on the same distribution information, and construct a product performance prediction model based on the same distribution variable if the second model performance score meets a corresponding condition and/or the model performance score determined based on the spatial distribution information meets a corresponding condition, so as to perform product performance prediction.
In one embodiment, the device further comprises a deleting module, configured to delete, before the first model building module 905 performs model training based on training data in all model processing data to obtain a first model, product performance parameters with variable correlation coefficients lower than a threshold in the first set of data to be processed, where the variable correlation coefficients are used to characterize a correlation between each product performance parameter and a product performance prediction result.
In one embodiment, the data subset obtaining module is specifically configured to obtain a plurality of candidate subsets based on the data corpus, and determine a target candidate subset satisfying a screening condition from the plurality of candidate subsets as the data subset.
In one embodiment, the screening condition comprises at least one of the fact that the occupation proportion of the number of versions of the product control system corresponding to the target candidate subset in the number of versions of the product control system corresponding to the data corpus exceeds a corresponding preset proportion, the occupation proportion of the number of samples contained in the target candidate subset in the number of samples contained in the data corpus exceeds a corresponding preset proportion, the data corpus comprises samples corresponding to each version of the product control system, different samples corresponding to the same version of the product control system are distinguished based on the collection time, each sample comprises a product performance parameter corresponding to the product control system of the corresponding version, the difference between the maximum value of the variable correlation coefficient of the target candidate subset and the maximum value of the variable correlation coefficient of the data corpus is larger than a corresponding preset difference, the variable correlation coefficient is used for representing the correlation degree of the product performance parameter and the product performance prediction result, and the difference between the number of the product control software versions corresponding to the target candidate subset and the number of versions of the product control systems corresponding to other candidate subsets is larger than a corresponding preset difference.
In one embodiment, a second model construction module is specifically configured to determine, for each first data set to be processed, a same distribution variable from the first data set to be processed to obtain a second data set to be processed, determine a training set in the second data set to be processed, divide the training set into training data and verification data, perform model training based on the training data in the second data set to obtain a second model, determine importance ranking of each product performance parameter based on a training result of the second model, verify the second model based on verification data in the second data set to obtain a third model performance score, and if a change of the third model performance score exceeds a preset change range, remove the product performance parameters of a preset number item after ranking from the second data set to be processed, re-use the removed second data set to be used as the second data set to be processed, perform model training based on the verification data in the second data set to obtain a third model performance score, and if a change of the third model performance score exceeds a preset change range, and continue performing the model performance prediction based on the second model until the remaining product performance score exceeds the predicted product performance range.
In one embodiment, the apparatus further includes a model prediction module, configured to input samples to be predicted into product performance prediction models respectively constructed for the data corpus and the data subset, to obtain a plurality of prediction results, and take an optimal value or an average value of the plurality of prediction results as a product performance prediction result.
The above-described respective modules in the product performance prediction model construction apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 10. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used to store product performance parameters. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program when executed by a processor implements a product performance prediction model construction method.
It will be appreciated by those skilled in the art that the structure shown in FIG. 10 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.
In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.
It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magneto-resistive random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.