CN117494877A

CN117494877A - Forecasting method of electric meter installation based on cluster analysis

Info

Publication number: CN117494877A
Application number: CN202311425732.0A
Authority: CN
Inventors: 赖国书; 张荔鹃; 叶强; 周厚源; 王姣; 洪巧文; 曹舒; 曾清娟; 杨涵脂; 胡敏贤; 林素存
Original assignee: State Grid Fujian Electric Power Co Ltd; Marketing Service Center of State Grid Fujian Electric Power Co Ltd
Current assignee: State Grid Fujian Electric Power Co Ltd; Marketing Service Center of State Grid Fujian Electric Power Co Ltd
Priority date: 2023-10-31
Filing date: 2023-10-31
Publication date: 2024-02-02

Abstract

The invention relates to an ammeter installation quantity prediction method based on cluster analysis, which comprises the following steps: carrying out data preprocessing on original data comprising a plurality of time sequences, dividing a training set and a prediction set according to prediction requirements, and enabling the time sequences in the training set to have the same length; determining the optimal clustering quantity by adopting an elbow rule; adopting a K-means++ algorithm to select an initial clustering center so as to improve the quality of a clustering result; according to the determined clustering quantity and the initial clustering center, iterating K-means clustering to obtain a clustering center and a classification result; according to the obtained classification result, respectively selecting the optimal order of the ARIMA model for each class of clustering centers; for the same kind of time sequences, respectively establishing an ARIMA model for each time sequence according to the selected corresponding ARIMA order to obtain a prediction result of each time sequence; evaluating the prediction effect of the model; and then predicting the electric meter installation quantity by using a model with the standard-reaching prediction effect. The method is favorable for efficiently and accurately predicting the installation quantity of the future ammeter.

Description

Electric meter installation quantity prediction method based on cluster analysis

Technical Field

The invention relates to the technical field of power data processing, in particular to an ammeter installation quantity prediction method based on cluster analysis.

Background

Predicting the amount of electricity meter installation is of great importance to planning, operation, policy making and business decisions in the power industry. The system can provide key information about future electricity meter demands, help each party make reasonable decisions, and promote sustainable development and efficiency improvement of the power system. And particularly, the purchasing and deployment strategies of the intelligent electric meter can be guided. The intelligent ammeter has the functions of data acquisition, remote monitoring, regulation and control and the like, and can improve the monitoring and management level of the power grid. Accurate prediction of the electricity meter installation amount is helpful for determining purchasing plans and arrangement strategies, and coverage range and quantity of the intelligent electricity meter are ensured to meet requirements.

Predicting electricity meter installation is based on historical data, and there are a number of prediction methods for time series prediction, both industrially and academically. The traditional time sequence method comprises ARIMA, GARCH model and the like, the machine learning method comprises random forest, GBDT and the like, and the deep neural network comprises CNN, LSTM and the like. However, although the conventional time series method, the machine learning method and the deep neural network are suitable for predicting the installation quantity of the electric meter, a large number of time series exist in the original data due to the fact that the indexes for dividing the electric meter in the original data are more, such as regions, types of the electric meter, installation flow and the like. If a model is built for each time series, the calculation amount is large. And the model is further corrected later, so that the targeted correction is difficult to be carried out on each time sequence independently.

Disclosure of Invention

The invention aims to provide an ammeter installation quantity prediction method based on cluster analysis, which is beneficial to efficiently and accurately predicting the installation quantity of future ammeter.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows: an ammeter installation quantity prediction method based on cluster analysis comprises the following steps:

s1, carrying out data preprocessing on original data comprising a plurality of time sequences, dividing a training set and a prediction set according to prediction requirements, and enabling the time sequences in the training set to have the same length;

s2, determining the optimal clustering quantity by adopting an elbow rule;

s3, adopting a K-means++ algorithm to select an initial clustering center so as to improve the quality of a clustering result;

s4, iterating K-means clustering according to the determined clustering quantity and the initial clustering center to obtain a clustering center and a classification result;

s5, respectively selecting the optimal ARIMA model order for each class of clustering centers according to the obtained classification result;

s6, for the same type of time sequences, respectively establishing an ARIMA model for each time sequence according to the selected corresponding ARIMA order to obtain a prediction result of each time sequence;

s7, evaluating the prediction effect of the model; and then predicting the electric meter installation quantity by using a model with the standard-reaching prediction effect.

Further, in step S1, the training set and the prediction set are divided according to the prediction requirement, specifically: if the electricity meter installation amount of each time series is to be predicted for the last n months, the data except the last n months are taken as a training set, and the data of the last n months are taken as a prediction set.

Further, in step S1, if the lengths of the time series are different, the time series are made to have the same length by interpolation or truncation, specifically: if the time series itself only records the data from the first installation to the last installation of the ammeter, the installation amount at the time point which is not recorded before can be recorded as 0; if it cannot be determined whether or not the amount of installation at the previous unrecorded time point is 0, all the time series may be truncated to the same length as the shortest time series, and the time series may be truncated from the initial part of the time series.

Further, in step S2, the elbow rule is adopted to determine the optimal number of clusters, and the specific method is as follows: and calculating the square sum SSE of the clustering errors under different clustering numbers, wherein when the clustering number is increased to a certain degree, the reducing speed of the SSE is suddenly slowed down, and when an elbow is formed, the clustering number is the optimal clustering number.

Further, in step S3, an initial clustering center is selected by adopting a K-means++ algorithm, and the specific method is as follows:

a1, selecting a first clustering center: randomly selecting a time sequence from the training set as a first clustering center;

a2, calculating a distance weighted probability: for each time sequence, calculating the distance between the time sequence and the selected cluster center, and taking the square of the distance as a weight; then, calculating probability distribution of each time sequence as the next cluster center according to the obtained weight; the more distant the time series of the selected cluster center will have a higher probability of being the next cluster center;

a3, selecting the next cluster center: selecting the next cluster center according to the calculated distance weighted probability distribution;

a4, repeating the steps A2 and A3 until K cluster centers are selected.

Further, in step S4, iterative K-means clustering is performed, and the specific method is as follows:

b1, assigning data points to the nearest clustering center: calculating the distance between each time sequence and the cluster center selected in the step S3, and distributing each time sequence to the nearest cluster center;

b2, updating a clustering center: for each cluster, calculating the average value of all time sequences in the cluster, and taking the average value as a new cluster center;

b3, repeating the steps B1 and B2, namely repeating the steps of time sequence distribution and cluster center updating until a stopping condition is reached; the stopping condition is that the maximum iteration number is reached or the clustering center is not changed any more.

Further, in step S5, the order of the optimal ARIMA model is selected for each class of cluster center, and is used as the order of the optimal ARIMA model of the corresponding class of time series, and the specific method is as follows:

determining the order of the ARIMA model is a process of selecting the appropriate autoregressive order p, differential order d, and moving average order q; for the differential order d, carrying out stability test on the clustering center, and if the stability test cannot be passed, increasing the differential order d until the sequence passes the stability test; in selecting the orders p and q, the order is determined using an autocorrelation function and a partial autocorrelation function: p is determined by the intercept point of the partial autocorrelation function plot and q is determined by the intercept point of the autocorrelation function plot.

Further, in step S5, if the autocorrelation and partial autocorrelation function diagrams cannot determine the orders p and q, a subset selection algorithm is used to select the appropriate p and q, specifically: using bayesian information criteria as evaluation criteria and selecting the model with the smallest BIC value as the best model by trying different combinations of p, d and q.

Further, in step S6, a corresponding ARIMA model is built for each time series according to the order of the ARIMA model used for each time series determined in step S5; and estimating parameter values of the ARIMA model by using maximum likelihood estimation so as to maximize the fitting degree of the model to the observed data, thereby obtaining the predicted data of the last n months of each time sequence.

Further, in step S7, the relative error is averaged _j Estimating the prediction error of the model:

wherein y is _act (i, j) real electricity meter installation amount data of the ith month of the jth time series in the prediction set, y _pred (i, J) is the i month prediction data to be predicted for the J time series obtained by the prediction method of the electricity meter installation amount based on cluster analysis, n is the total prediction number, and J is the total number of the time series; by error _j Reflecting the predictive effect on the j-th time sequence.

Compared with the prior art, the invention has the following beneficial effects: compared with the traditional method for directly establishing an ARIMA model for each time sequence to predict the installation quantity of the electric meter based on the clustering analysis, the method provided by the invention has the advantages that for the condition of a large number of time sequences, after K-means clustering is carried out, the clustering center can reflect the data characteristics of the class, so that the order is suitable for the orders of the time sequences of the same class only by selecting the appropriate order of the ARIMA model for the clustering center. If the order of the ARIMA model is directly selected for each time series without classification, whatever method is adopted is difficult. Although there are also functions in the R language that provide for automatic selection of the order of the ARIMA model, the prediction of the order selected by the function is poor. Therefore, the method can efficiently and accurately predict the installation quantity of various electric meters in the future through the historical data of the installation quantity of various electric meters, and has strong practicability and wide application prospect.

Drawings

FIG. 1 is a flow chart of a method implementation of an embodiment of the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the present application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments in accordance with the present application. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

As shown in fig. 1, the present embodiment provides a method for predicting an installation amount of an electric meter based on cluster analysis, including:

s1, carrying out data preprocessing on original data comprising a plurality of time sequences, dividing a training set and a prediction set according to prediction requirements, and enabling the time sequences in the training set to have the same length.

S2, determining the optimal clustering quantity by adopting an elbow rule.

S3, adopting a K-means++ algorithm to select an initial clustering center so as to improve the quality of a clustering result.

S4, iterating K-means clustering according to the determined clustering quantity and the initial clustering center to obtain a clustering center and a classification result.

S5, respectively selecting the optimal ARIMA model order for each class of clustering centers according to the obtained classification result.

S6, for the same kind of time sequences, respectively establishing an ARIMA model for each time sequence according to the selected corresponding ARIMA order to obtain a prediction result of each time sequence.

In step S1, the training set and the prediction set are firstly divided according to the prediction requirement, specifically: if the electricity meter installation amount of each time series is to be predicted for the last n months, the data except the last n months are taken as a training set, and the data of the last n months are taken as a prediction set. Since there are numerous time sequences in the training set, the different time sequences are not the same length, but K-means clustering requires that the time sequences be uniform in length. In order to be able to classify the time series in the training set subsequently with K-means clustering, the time series are made to have the same length by interpolation or truncation. If the time series itself records only the data of the meter from the first installation to the last installation, the installation amount at the previous unrecorded time point can be recorded as 0. If it cannot be determined whether or not the amount of installation at the previous unrecorded time point is 0, all the time series may be truncated to the same length as the shortest time series, and the time series may be truncated from the initial part of the time series. In addition, if the number of time series with inconsistent lengths is small, the sequences can be selected to be independently built into an ARIMA model for prediction, and the rest time series with consistent lengths can be predicted by adopting a method for predicting the installation quantity of the ammeter based on cluster analysis.

In step S2, in order to use K-means clustering, it is determined that the time series in the training set are to be divided into several categories, i.e. the number of clusters K. In this embodiment, the elbow rule is adopted to determine the optimal clustering number, and the specific method is as follows: the optimal number of clusters is determined by calculating the sum of squared cluster errors (SSE) for different numbers of clusters. SSE refers to the sum of squares of the clusters between each data point and the middle point of the class to which it belongs. As the number of clusters increases, the SSE will gradually decrease, but the rate of decrease will gradually slow. When the number of clusters increases to some extent, the rate of decrease of SSE will dramatically slow, forming an "elbow". The "elbow" is the key to the elbow rule and represents the optimal number of clusters. In practical application, the number of clusters where the "elbow" is located can be intuitively confirmed by drawing a relation diagram of SSE and the number of clusters K. The optimal cluster number is determined through an elbow rule, is simple and easy to use, does not need any priori knowledge, and can be rapidly selected.

The traditional K-means clustering algorithm randomly selects an initial cluster center, which can lead to the algorithm being sensitive to initial values and isolated point data. In order to solve the problem, in step S3, an initial cluster center is selected by adopting a K-means++ algorithm, and the algorithm makes a new specification in selecting an initial data center, so that the distance between the initial data centers can be kept the farthest, and the specific method is as follows:

a1, selecting a first clustering center: a time series is randomly selected from the training set as a first cluster center.

A2, calculating a distance weighted probability: for each time sequence, calculating the distance between the time sequence and the selected cluster center, and taking the square of the distance as a weight; then, calculating probability distribution of each time sequence as the next cluster center according to the obtained weight; the more distant the time series of selected cluster centers will have a higher probability of being the next cluster center.

A3, selecting the next cluster center: and selecting the next cluster center according to the calculated distance weighted probability distribution.

A4, repeating the steps A2 and A3 until K cluster centers are selected.

The initial clustering center is intelligently selected by the method, so that the distribution of the data set can be better represented, and the quality of a clustering result is improved. The method can reduce the risk of the K-means algorithm falling into a local optimal solution, and is better than the traditional method for randomly selecting the initial clustering center in most cases.

Iterative K-means clustering is a distance-based clustering algorithm that optimizes the clustering result by iteratively updating the cluster centers and reassigning data points. The distance chosen here is the euclidean distance. In the step S4, iterative K-means clustering is carried out, and the specific method is as follows:

b1, assigning data points to the nearest clustering center: and (3) calculating the distance between each time sequence and the cluster center selected in the step S3, and distributing each time sequence to the nearest cluster center.

B2, updating a clustering center: for each cluster, the average value of all time sequences in the cluster is calculated, and the average value is taken as a new cluster center.

The training set is divided into K classes by steps S1-S4. The cluster centers may be used to explain and describe the cluster features, so that in step S5, the order of the optimal ARIMA model is selected for each cluster center, respectively, as the order of the optimal ARIMA model for the time series of the corresponding classes. The specific method comprises the following steps:

determining the order of the ARIMA model is a process of selecting the appropriate autoregressive order p, differential order d, and moving average order q. And (3) carrying out stability test on the clustering center for the differential order d, and if the stability test cannot be passed, increasing the differential order d until the sequence passes the stability test. In practice, however, d is generally not more than 2. In selecting the orders p and q, the order is determined using an autocorrelation function and a partial autocorrelation function: by observing the autocorrelation and partial autocorrelation function plots of the data, p and q can be determined. p is determined by the intercept point of the partial autocorrelation function plot and q is determined by the intercept point of the autocorrelation function plot. If the autocorrelation and partial autocorrelation function diagrams cannot determine the orders p and q, a subset selection algorithm is used to select the appropriate p and q. The algorithm uses bayesian information criteria (Bayesian Information Criterion, BIC) as evaluation criteria and selects the model with the smallest BIC value as the best model by trying different combinations of p, d and q. The method can automatically try combinations of p, d and q without manually performing experiments and evaluations. And the complexity and the goodness of fit of the model can be comprehensively considered, the problem of over-fitting can be effectively avoided, and a model which is simpler but has good fit can be selected. Furthermore, the results of the subset selection algorithm may typically be visualized in order to more intuitively understand the process and results of model selection.

The corresponding ARIMA model is established for each time series by the order of ARIMA model used for each time series of step S5 (i.e., the same order as ARIMA model of the cluster center of the class). And estimating parameter values of the ARIMA model by using maximum likelihood estimation so as to maximize the fitting degree of the model to the observed data, thereby obtaining the predicted data of the last n months of each time sequence.

In step S7, the relative error is averaged _j Estimating the prediction error of the model:

The original data has a plurality of time sequences, so the invention firstly clusters the time sequences, and the adopted clustering algorithm is a K-means clustering algorithm. And establishing a corresponding ARIMA model for predicting the data characteristics of each class. K-means clustering has the following advantages: 1. simple and efficient: it is easy to understand and implement and has a low computational complexity, suitable for processing large-scale data sets. 2. Scalability: the K-means clustering algorithm can be adapted to different data set sizes and dimensions. It can process data having a large number of samples and high-dimensional features, and can perform parallelization processing to improve the calculation efficiency. The parallelization implementation of the K-means algorithm can be realized based on Spark: dividing mass data onto different computing nodes, sharing cluster center point coordinates among the different computing nodes through a Spark Context broadcasting method, completing data distribution work by a Map function, and then completing updating of a mean center by a Reduce function. When the algorithm is realized through parallelization design, all data blocks for clustering can be operated in parallel, and the operation process of updating the clustering center value is to calculate the average value, and can be completed in parallel. 3. Visual interpretation and visualization: the clustering result generated by the K-means clustering is relatively visual, and easy to explain and visualize. It divides the data points into K clusters, each cluster representing a cluster, such that the time series within the same cluster have similar characteristics. 4. Interpretive and interpretive: the clustering result generated by the K-means clustering algorithm is generally better in interpretation. The cluster center represents the center point of each cluster and can be used to interpret and describe the cluster characteristics. On the contrary, if the ARIMA model is built for each time sequence, on the one hand, due to the numerous number, it cannot be guaranteed that the optimal ARIMA model is selected for each time sequence to predict; on the other hand, if the ARIMA model is to be improved subsequently, it is difficult to improve each ARIMA model separately.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the invention in any way, and any person skilled in the art may make modifications or alterations to the disclosed technical content to the equivalent embodiments. However, any simple modification, equivalent variation and variation of the above embodiments according to the technical substance of the present invention still fall within the protection scope of the technical solution of the present invention.

Claims

1. The method for predicting the installation quantity of the ammeter based on cluster analysis is characterized by comprising the following steps of:

s2, determining the optimal clustering quantity by adopting an elbow rule;

2. The method for predicting the installation quantity of the electric meter based on the cluster analysis according to claim 1, wherein in the step S1, the training set and the prediction set are divided according to the prediction requirement, specifically: if the electricity meter installation amount of each time series is to be predicted for the last n months, the data except the last n months are taken as a training set, and the data of the last n months are taken as a prediction set.

3. The method for predicting the installation quantity of an electric meter based on cluster analysis according to claim 1, wherein in step S1, if the lengths of the time series are different, the time series are made to have the same length by interpolation or truncation, specifically: if the time series itself only records the data from the first installation to the last installation of the ammeter, the installation amount at the time point which is not recorded before can be recorded as 0; if it cannot be determined whether or not the amount of installation at the previous unrecorded time point is 0, all the time series may be truncated to the same length as the shortest time series, and the time series may be truncated from the initial part of the time series.

4. The method for predicting the installation quantity of an electric meter based on cluster analysis according to claim 1, wherein in step S2, the optimal number of clusters is determined by using an elbow rule, and the method specifically comprises: and calculating the square sum SSE of the clustering errors under different clustering numbers, wherein when the clustering number is increased to a certain degree, the reducing speed of the SSE is suddenly slowed down, and when an elbow is formed, the clustering number is the optimal clustering number.

5. The method for predicting the installation quantity of the ammeter based on cluster analysis according to claim 1, wherein in the step S3, an initial cluster center is selected by adopting a K-means++ algorithm, and the specific method is as follows:

a4, repeating the steps A2 and A3 until K cluster centers are selected.

6. The method for predicting the installation quantity of the electric meter based on the cluster analysis according to claim 1, wherein in the step S4, the K-means clustering is iterated, and the specific method is as follows:

7. The method for predicting the installation quantity of an electric meter based on cluster analysis according to claim 1, wherein in step S5, the order of the optimal ARIMA model is selected for each class of cluster center, and is used as the order of the optimal ARIMA model of the corresponding class of time series, and the specific method is as follows:

8. The method for predicting the installation quantity of an electric meter based on cluster analysis according to claim 7, wherein in step S5, if the autocorrelation and partial autocorrelation function diagrams cannot determine the orders p and q, a subset selection algorithm is used to select the appropriate p and q, specifically: using bayesian information criteria as evaluation criteria and selecting the model with the smallest BIC value as the best model by trying different combinations of p, d and q.

9. The method for predicting the installation quantity of an electric meter based on cluster analysis according to claim 1, wherein in step S6, a corresponding ARIMA model is established for each time series by the order of ARIMA model used for each time series determined in step S5; and estimating parameter values of the ARIMA model by using maximum likelihood estimation so as to maximize the fitting degree of the model to the observed data, thereby obtaining the predicted data of the last n months of each time sequence.

10. The base of claim 1The method for predicting the installation quantity of the electric meter by the cluster analysis is characterized in that in the step S7, the average relative error is adopted _j Estimating the prediction error of the model: