Disclosure of Invention
The invention aims to provide an abnormal electricity price node and an abnormal electricity price area identification system based on electricity price data distribution density characteristics, comprising a historical electricity price data dimension reduction module, an electricity price partition module, an abnormal electricity price node and an abnormal electricity price partition identification module;
the historical power price data dimension reduction module acquires historical high-dimension power price data, reduces dimensions and obtains a dimension-reduced low-dimension power price data set;
the electricity price partitioning module performs feature extraction and clustering on the low-dimensional electricity price data set to obtain a plurality of electricity price partitions;
Based on the electricity price partition, the abnormal electricity price partition module characterizes the degree of separation of the electricity price data to be tested, so that abnormal electricity price nodes and abnormal electricity price partitions are identified.
Further, the historical electricity price data dimension reduction module reduces dimension of the historical high-dimension electricity price data through a t-SNE algorithm.
Further, the step of the historical power price data dimension reduction module for reducing the dimension of the historical high-dimension power price data comprises the following steps:
1) Acquiring historical high-dimensional electricity price data, and recording the data as N node represents the number of nodes in the system, electricity price data x n=(xn,1,xn,2,…xn,T of any one node), and T represents the dimension of the dataset, namely the total number of time periods of electricity price;
2) Constructing a gaussian distribution with variance sigma i by taking any data point X i in the data set X as a center to obtain a similar conditional probability p i|j of the data point X i to the data point X j, and constructing a gaussian distribution with variance sigma j by taking the data point X j as a center to obtain a similar conditional probability p j|i of the data point X j to the data point X i, namely:
3) Based on the similar conditional probability p j|i and the similar conditional probability p j|i, a joint probability between x i and x j is constructed, namely:
Wherein: p ij and p ji are both joint probabilities between x i and x j; n represents the number of nodes in the system;
4) The confusion of data point x i is calculated Perplexity (P i), namely:
Wherein: p i represents the joint probability distribution of joint probability P ji between x i and all other data points;
The entropy H (P i) of the joint probability distribution P i is as follows:
Wherein: n node represents a set of nodes within the system;
5) Based on the confusion Perplexity (P i), searching the best Gaussian variance sigma i corresponding to the data point x i by utilizing a binary search mode;
6) Assume that A low-dimensional map point y n=(yn,1,yn,2 for a low-dimensional electricity price data set, wherein the low-dimensional map point y n=(yn,1,yn,2 is any one node electricity price;
The joint probability q ij of the data point y i and the data point y j in the low-dimensional electricity price data is as follows:
Wherein: q ij and q ji are both joint probabilities between y i and y j;
7) Constructing an optimization objective function minC, namely:
Wherein KL (P i||Qi) represents a KL divergence; q i is a joint probability distribution vector Q i formed by the joint probabilities Q ij between y i and all other data points;
8) Iteratively solving an optimization objective function min C by using a gradient descent method, so as to construct a low-dimensional electricity price data set Y;
the calculation formula for the iterative solution is as follows:
wherein: y (t) and Y (t+1) represent the results obtained from the t-th and t+1st iterations, respectively; gamma represents the learning rate of gradient descent; Representing the gradient of the objective function with respect to the result of the t-th iteration.
Further, the method for carrying out feature extraction and clustering on the low-dimensional electricity price data by the electricity price partitioning module comprises a DBSCAN algorithm.
Further, the step of extracting and clustering the characteristics of the low-dimensional electricity price data comprises the following steps:
1) Importing a low-dimensional electricity price data set Y, and initializing a cluster mark gate;
2) Taking one data point Y i from any one of the low-dimensional electricity price data sets Y;
3) Determine whether data point y i has been marked as a center point, boundary point, or noise point, discard data point y i if it has been marked, and return to step 2), if it has not been marked, jump to step 4).
4) Judging whether the data point y i is a core point or not by using the formula (9), if so, marking the data point y i as a center point, and jumping to the step 5); otherwise, marking as noise points, and jumping to the step 2);
Core(Y)={yi|yi∈Y,Size(Nε(yi))≥MinPts} (9)
Wherein: core (Y) represents the set of all Core points within the dataset Y; minPts is a set density threshold parameter; n ε(yi) represents a set of data points within data set Y that are less than epsilon in distance from data point Y i; size (·) represents the number of data points within the computation set;
5) Updating cluster type mark gate and putting data point y i into the cluster type set Initializing a set H to be tested, calculating all the adjacent points within the radius epsilon range of a data point y i, and adding all the adjacent points into the set H to be tested;
6) Sequentially selecting data points H s in the set H to be detected, and writing the data points H s into the cluster set It is determined whether the data point h s is a noise point. If h s is marked as a noise point, modifying the marked boundary point, and then entering step 7); otherwise, directly entering the step 7);
7) Judging whether the data point H s is a core point or not by utilizing the formula (9), if so, marking the data point H s as the core point, calculating all the adjacent points within the radius epsilon range of the data point H s, adding all the adjacent points into the set H to be detected, and then entering the step (8); otherwise, marking as a boundary point, and then entering the step 8);
8) Judging whether all data points in the set H to be tested are marked, if yes, entering a step 9); if not, jumping to the step 6);
9) Judging whether all data points in the data set Y finish marking, if so, finishing clustering; if not, jumping to the step 2).
Further, the step of identifying the abnormal electricity rate node and the abnormal electricity rate partition includes:
1) Identifying abnormal electricity prices in each electricity price partition;
2) And eliminating the identified abnormal electricity price to obtain normal electricity price data points in each area, carrying out abnormal identification on the electricity price intervals according to the data characteristics of the residual electricity price data, and detecting whether the abnormal electricity price area exists.
Further, the step of identifying abnormal electricity prices inside the respective electricity price partitions includes:
1) Training the isolated forest, wherein the method comprises the following steps of:
1.1 Setting super parameters including the number N Tree of isolated trees (iTrees) in iForest and the maximum height N sub of single iTrees in iForest;
1.2 According to the electricity price partition result Importing the data of the CAth electricity price partition from the data set to be testedPreprocessing the electricity price partition data;
1.3 From the data set Data amount of (2)Relationship to a single itrate maximum height N sub, for each price bin the corresponding itrate maximum heightPerforming assignment, namely:
1.4 Initializing an iTree flag q;
1.5 An itrate flag q update;
1.6 All subspaces K s,k are emptied, and the layer number mark s of the iTree and the mark K of each layer of subspace of the iTree are initialized;
1.7 From a dataset Random decimation inThe number constitutes a subspace K s,k, which is called the root node;
1.8 All data in the subspace K s,k are loaded into the interval R to be segmented;
1.9 Judging whether the interval R to be segmented only contains one data node, if so, marking the corresponding subspace K s,k as a leaf node and jumping to the step 1-1), otherwise, entering the step 1.10);
1.10 Ordering the data in the interval R to be segmented, randomly generating a data point R s+1,k, dividing the data in the interval R into two sets R right and R left, and pushing the newly generated two data into a subspace set of the next layer, namely K s+1={Ks,Rright,Rleft;
1.11 Judging whether all subspaces of the current layer are segmented, if yes, jumping to the step 1.13); otherwise, enter step 1.12);
1.12 Updating the mark k of the subspace, and jumping to the step 3-8);
1.13 Updating the layer number mark s;
1.14 Judging whether the current iTree reaches the maximum layer number, if so, marking all subspaces K s,k of the current layer as leaf nodes, and continuing the step 1.15); otherwise jump to step 1.12).
1.15 Judging whether all the training of the iTree is finished, if so, finishing the electricity price partition training iForest training process, and ending the flow; otherwise, jumping to the step 1.15);
2) Firstly, judging the data to be measured of the electricity price partition of the CATE If the ratio of the range to the average value is greater than a preset threshold, if so, then considering that all the node electricity prices in the electricity price partition are normal, otherwise, calculating the abnormal score of each node electricity price in the electricity price partition by using a formula (11):
Wherein: is an anomaly score; the function g (x) represents the average path length of constructing an iTree with a number x;
Wherein the function f q(xi) is as follows:
Wherein: Representing the number of layers experienced from the root node to the leaf node where data x i is located on the qth itrate; The data number of the same leaf node of the data point x i on the q-th iTree is represented; the function g (x) represents the average path length of constructing an iTree with a number x;
3) And identifying the data with the abnormal score being larger than a preset threshold value as abnormal electricity price.
Further, the step of preprocessing the electricity price partition data includes: and judging whether the average value of the data set is smaller than 10, if so, rounding the data to be accurate to the bit, otherwise, rounding the data according to the multiple of 5.
Further, the step of performing the abnormality recognition between electricity price partitions includes:
1) Calculating average value of normal electricity price in each electricity price partition Namely:
Wherein: power rate data set representing the Cat-th power rate partition A normal electricity price data set after abnormal electricity price data points are removed; the function E (-) represents the average value of the data in the collection; n cate is the number of electricity price partitions;
2) The average value of the normal electricity price of each electricity price partition is formed into a new data set Z, and the following steps are obtained:
3) And taking the data set Z as the data set to be tested, completing training of iForest under the data set, calculating the abnormal score of each electricity price partition, and if the abnormal score is greater than a preset score threshold, determining that the partition is an abnormal electricity price partition.
Further, the system also comprises an abnormality clearing early warning module;
And the clearing abnormality early warning module is used for calling the data of the abnormal electricity price partition module, and if the abnormal electricity price node and/or the abnormal electricity price partition exist at the current moment, an electric power system clearing abnormality alarm is generated.
The technical effect of the invention is undoubtedly that the invention has the following beneficial effects:
1) The technical thought of partitioning the electricity price signal to identify the space abnormal characteristics can effectively avoid the condition that the internal problems are ignored after the whole company of the circuit problems is judged to be abnormal electricity price in certain areas, and effectively improve the identification precision.
2) The two-step system identification framework for the abnormal electricity price based on the data distribution density characteristics can effectively extract the data distribution density characteristics of the non-time sequence signal, namely the space electricity price signal, can accurately identify the space abnormal electricity price from two angles of an electricity price node and an electricity price partition, and provides more accurate information for tracing the cause of the subsequent abnormal electricity price.
3) The method can be widely applied to the supervision of the power price signals of the power spot market, and can be used for rapidly identifying the space abnormal power price nodes and the subareas of the power price, realizing the preliminary classification of the space abnormal power price signals, providing an information basis for the source tracing of the follow-up power price abnormal signals and ensuring the safety and operation of the power market.
Detailed Description
The present invention is further described below with reference to examples, but it should not be construed that the scope of the above subject matter of the present invention is limited to the following examples. Various substitutions and alterations are made according to the ordinary skill and familiar means of the art without departing from the technical spirit of the invention, and all such substitutions and alterations are intended to be included in the scope of the invention.
Example 1:
Referring to fig. 1 to 6, an abnormal electricity price node and an abnormal electricity price region identification system starting from a density characteristic of electricity price data distribution comprises a historical electricity price data dimension reduction module, an electricity price partition module, an abnormal electricity price node and an abnormal electricity price partition identification module;
the historical power price data dimension reduction module acquires historical high-dimension power price data, reduces dimensions and obtains a dimension-reduced low-dimension power price data set;
the electricity price partitioning module performs feature extraction and clustering on the low-dimensional electricity price data set to obtain a plurality of electricity price partitions;
Based on the electricity price partition, the abnormal electricity price partition module characterizes the degree of separation of the electricity price data to be tested, so that abnormal electricity price nodes and abnormal electricity price partitions are identified.
Example 2:
The technical content of the abnormal electricity price node and abnormal electricity price area identification system based on the electricity price data distribution density characteristics is the same as that of the embodiment 1, and further, the historical electricity price data dimension reduction module reduces the dimension of the historical high-dimension electricity price data through a t-SNE algorithm.
Example 3:
The technical content of the abnormal electricity price node and the abnormal electricity price area identification system based on the electricity price data distribution density characteristics is any one of embodiments 1-2, and further, the step of the historical electricity price data dimension reduction module for reducing the dimension of the historical high-dimension electricity price data comprises the following steps:
1) Acquiring historical high-dimensional electricity price data, and recording the data as N node represents the number of nodes in the system, electricity price data x n=(xn,1,xn,2,…xn,T of any one node), and T represents the dimension of the dataset, namely the total number of time periods of electricity price;
2) Constructing a gaussian distribution with variance sigma i by taking any data point X i in the data set X as a center to obtain a similar conditional probability p i|j of the data point X i to the data point X j, and constructing a gaussian distribution with variance sigma j by taking the data point X j as a center to obtain a similar conditional probability p j|i of the data point X j to the data point X i, namely:
3) Based on the similar conditional probability p j|i and the similar conditional probability pj|i, a joint probability between x i and x j is constructed, namely:
Wherein: p ij and p ji are both joint probabilities between x i and x j; n represents the number of nodes in the system;
4) The confusion of data point x i is calculated Perplexity (P i), namely:
Wherein: p i represents the joint probability distribution of joint probability P ji between x i and all other data points;
The entropy H (P i) of the joint probability distribution P i is as follows:
Wherein: n node represents a set of nodes within the system;
5) Based on the confusion Perplexity (P i), searching the best Gaussian variance sigma i corresponding to the data point x i by utilizing a binary search mode;
6) Assume that A low-dimensional map point y n=(yn,1,yn,2 for a low-dimensional electricity price data set, wherein the low-dimensional map point y n=(yn,1,yn,2 is any one node electricity price;
The joint probability q ij of the data point y i and the data point y j in the low-dimensional electricity price data is as follows:
Wherein: q ij and q ji are both joint probabilities between y i and y j;
7) Constructing an optimization objective function minC, namely:
Wherein KL (P i||Qi) represents a KL divergence; q i is a joint probability distribution vector Q i formed by the joint probabilities Q ij between y i and all other data points;
8) Iteratively solving an optimization objective function minC by using a gradient descent method so as to construct a low-dimensional electricity price data set Y;
the calculation formula for the iterative solution is as follows:
wherein: y (t) and Y (t+1) represent the results obtained from the t-th and t+1st iterations, respectively; gamma represents the learning rate of gradient descent; Representing the gradient of the objective function with respect to the result of the t-th iteration.
Example 4:
the system for identifying abnormal electricity price nodes and abnormal electricity price areas based on the distribution density characteristics of electricity price data comprises the technical content as in any one of embodiments 1-3, and further, the method for extracting and clustering the characteristics of the low-dimensional electricity price data by the electricity price partitioning module comprises a DBSCAN algorithm.
Example 5:
The system for identifying abnormal electricity price nodes and abnormal electricity price areas based on the density characteristics of electricity price data distribution according to any one of embodiments 1 to 4 further comprises the steps of extracting and clustering characteristics of low-dimensional electricity price data:
1) Importing a low-dimensional electricity price data set Y, and initializing a cluster mark gate;
2) Taking one data point Y i from any one of the low-dimensional electricity price data sets Y;
3) Determine whether data point y i has been marked as a center point, boundary point, or noise point, discard data point y i if it has been marked, and return to step 2), if it has not been marked, jump to step 4).
4) Judging whether the data point y i is a core point or not by using the formula (9), if so, marking the data point y i as a center point, and jumping to the step 5); otherwise, marking as noise points, and jumping to the step 2);
Core(Y)={yi|yi∈Y,Size(Nε(yi))≥MinPts} (9)
Wherein: core (Y) represents the set of all Core points within the dataset Y; minPts is a set density threshold parameter; n ε(yi) represents a set of data points within data set Y that are less than epsilon in distance from data point Y i; size (·) represents the number of data points within the computation set;
5) Updating cluster type mark gate and putting data point y i into the cluster type set Initializing a set H to be tested, calculating all the adjacent points within the radius epsilon range of a data point y i, and adding all the adjacent points into the set H to be tested;
6) Sequentially selecting data points H s in the set H to be detected, and writing the data points H s into the cluster set It is determined whether the data point h s is a noise point. If h s is marked as a noise point, modifying the marked boundary point, and then entering step 7); otherwise, directly entering the step 7);
7) Judging whether the data point H s is a core point or not by utilizing the formula (9), if so, marking the data point H s as the core point, calculating all the adjacent points within the radius epsilon range of the data point H s, adding all the adjacent points into the set H to be detected, and then entering the step (8); otherwise, marking as a boundary point, and then entering the step 8);
8) Judging whether all data points in the set H to be tested are marked, if yes, entering a step 9); if not, jumping to the step 6);
9) Judging whether all data points in the data set Y finish marking, if so, finishing clustering; if not, jumping to the step 2).
Example 6:
The system for identifying abnormal electricity price nodes and abnormal electricity price regions according to the present invention, wherein the system is as described in any one of embodiments 1 to 5, and further the step of identifying the abnormal electricity price nodes and abnormal electricity price partitions comprises:
1) Identifying abnormal electricity prices in each electricity price partition;
2) And eliminating the identified abnormal electricity price to obtain normal electricity price data points in each area, carrying out abnormal identification on the electricity price intervals according to the data characteristics of the residual electricity price data, and detecting whether the abnormal electricity price area exists.
Example 7:
The system for identifying abnormal electricity price nodes and abnormal electricity price areas based on the density characteristics of electricity price data distribution according to any one of embodiments 1 to 6 further comprises the steps of:
1) Training the isolated forest, wherein the method comprises the following steps of:
1.1 Setting super parameters including the number N Tree of isolated trees (iTrees) in iForest and the maximum height N sub of single iTrees in iForest;
1.2 According to the electricity price partition result Importing the data of the CAth electricity price partition from the data set to be testedPreprocessing the electricity price partition data;
1.3 From the data set Data amount of (2)Relationship to a single itrate maximum height N sub, for each price bin the corresponding itrate maximum heightPerforming assignment, namely:
1.4 Initializing an iTree flag q;
1.5 An itrate flag q update;
1.6 All subspaces K s,k are emptied, and the layer number mark s of the iTree and the mark K of each layer of subspace of the iTree are initialized;
1.7 From a dataset Random decimation inThe number constitutes a subspace K s,k, which is called the root node;
1.8 All data in the subspace K s,k are loaded into the interval R to be segmented;
1.9 Judging whether the interval R to be segmented only contains one data node, if so, marking the corresponding subspace K s,k as a leaf node and jumping to the step 1-1), otherwise, entering the step 1.10);
1.10 Ordering the data in the interval R to be segmented, randomly generating a data point R s+1,k, dividing the data in the interval R into two sets R right and R left, and pushing the newly generated two data into a subspace set of the next layer, namely K s+1={Ks,Rright,Rleft;
1.11 Judging whether all subspaces of the current layer are segmented, if yes, jumping to the step 1.13); otherwise, enter step 1.12);
1.12 Updating the mark k of the subspace, and jumping to the step 3-8);
1.13 Updating the layer number mark s;
1.14 Judging whether the current iTree reaches the maximum layer number, if so, marking all subspaces K s,k of the current layer as leaf nodes, and continuing the step 1.15); otherwise jump to step 1.12).
1.15 Judging whether all the training of the iTree is finished, if so, finishing the electricity price partition training iForest training process, and ending the flow; otherwise, jumping to the step 1.15);
2) Firstly, judging the data to be measured of the electricity price partition of the CATE If the ratio of the range to the average value is greater than a preset threshold, if so, then considering that all the node electricity prices in the electricity price partition are normal, otherwise, calculating the abnormal score of each node electricity price in the electricity price partition by using a formula (11):
Wherein: is an anomaly score; the function g (x) represents the average path length of constructing an iTree with a number x;
Wherein the function f q(xi) is as follows:
Wherein: Representing the number of layers experienced from the root node to the leaf node where data x i is located on the qth itrate; The data number of the same leaf node of the data point x i on the q-th iTree is represented; the function g (x) represents the average path length of constructing an iTree with a number x;
3) And identifying the data with the abnormal score being larger than a preset threshold value as abnormal electricity price.
Example 8:
The technical content of the abnormal electricity price node and the abnormal electricity price area identification system based on the electricity price data distribution density characteristics is the same as any one of embodiments 1 to 7, and further, the step of preprocessing the electricity price partition data comprises: and judging whether the average value of the data set is smaller than 10, if so, rounding the data to be accurate to the bit, otherwise, rounding the data according to the multiple of 5.
Example 9:
the technical content of the abnormal electricity price node and the abnormal electricity price area identification system based on the density characteristics of electricity price data distribution is any one of embodiments 1 to 8, and further, the step of performing abnormal electricity price partition identification comprises the following steps:
1) Calculating average value of normal electricity price in each electricity price partition Namely:
Wherein: power rate data set representing the Cat-th power rate partition A normal electricity price data set after abnormal electricity price data points are removed; the function E (-) represents the average value of the data in the collection; n cate is the number of electricity price partitions;
2) The average value of the normal electricity price of each electricity price partition is formed into a new data set Z, and the following steps are obtained:
3) And taking the data set Z as the data set to be tested, completing training of iForest under the data set, calculating the abnormal score of each electricity price partition, and if the abnormal score is greater than a preset score threshold, determining that the partition is an abnormal electricity price partition.
Example 10:
The system for identifying abnormal electricity price nodes and abnormal electricity price areas based on the density characteristics of electricity price data distribution further comprises an abnormality clearing early warning module, wherein the technical content of the abnormal electricity price nodes and abnormal electricity price areas is the same as that of any one of the embodiments 1-9;
And the clearing abnormality early warning module is used for calling the data of the abnormal electricity price partition module, and if the abnormal electricity price node and/or the abnormal electricity price partition exist at the current moment, an electric power system clearing abnormality alarm is generated.
Example 10:
The system analyzes the distribution characteristics of the electricity price data through processing the historical high-dimensional electricity price data, extracts key partition characteristics of the electricity price data signals, further realizes the electricity price signal partition, further realizes the hierarchical recognition of the space abnormal electricity price by applying an isolated forest algorithm according to the density characteristics of the electricity price signal data to be tested, effectively recognizes the abnormal points of the electricity price and the abnormal partitions of the electricity price, and ensures the safety and economic operation of the electric power market.
Specifically, in order to avoid the influence of too high latitude of the historical electricity price data on the electricity price partition effect, the dimension of the historical electricity price data is reduced based on a t-SNE algorithm; then, analyzing the distribution characteristics of the power price data after the dimension reduction by using a DBSCAN algorithm, extracting key partition characteristics of power price data signals, and completing power price partition; and finally, describing the degree of separation of the motor signal to be tested on the basis of the electricity price partition information by utilizing an isolated forest bell algorithm, and identifying abnormal electricity price nodes and abnormal electricity price partitions. In addition, the effectiveness and the accuracy of the method are verified through actual calculation simulation.
The specific working process is as follows:
(1) Electricity price data dimension reduction method based on t-SNE algorithm
Firstly, mapping original high-dimensional data points to probability distribution through affine transformation by utilizing a t-SNE algorithm, and constructing new probability distribution in a low-dimensional space to continuously optimize low-dimensional space data, so that the two probability distributions of the high-dimensional space and the low-dimensional space are kept consistent as much as possible, and low-dimensional data which can be used for subsequent electricity price partition is obtained. The method comprises the following specific steps:
1) Constructing a high-dimensional original electricity price data set, and calculating joint probability of the high-dimensional original electricity price data set:
the electricity price data of each node of the system form a high-dimensional original data set N node represents the number of nodes in the system, electricity price data x n=(xn,1,xn,2,…xn,T for any one node), and T represents the dimension of the dataset, i.e., the total number of time periods of electricity price. X i and X j are any two data points in the dataset X, and a gaussian distribution with variance of σ i is constructed by taking X i as the center, so that a similar conditional probability p i|j of X i to X j as shown in formula (1) can be obtained, wherein the probability of X j in the field of X i is represented by a larger value, and the closer X j is to X i.
Wherein: σ i is a gaussian variance centered on x i, which determines the shape of the gaussian distribution built centered on x i.
In order to ensure the symmetry of the conditional probabilities to simplify the subsequent gradient calculation and speed up the optimization process, the original conditional probabilities are replaced by more general joint probabilities, as shown in the formula (2):
Wherein: p ij and p ji are both joint probabilities between x i and x j; n represents the number of nodes in the system.
From equations (1) and (2), to obtain an accurate joint probability p ij, it is necessary to determine the gaussian variances σ i and σ j. The concept of confusion (Perplexity) is used in the T-SNE algorithm to find the appropriate gaussian variance.
Taking the gaussian variance σ i of data point x i as an example, the confusion Perplexity of data point x i can be interpreted as the number of valid neighbors around x i, which can be characterized as an exponential function with respect to H (P i), as shown in equation (3):
Wherein: p i represents the joint probability distribution of joint probability P ji between x i and all other data points; h (P i) represents the entropy of the joint probability distribution P i, and the calculation method is shown in formula (4):
Wherein: n node represents a set of nodes within the system.
From equations (3) and (4), the corresponding gaussian variance σ i can be further adjusted by adjusting the magnitude of the confusion Perplexity (P i) of the probability distribution. To ensure that the t-SNE algorithm remains highly robust, the confusion size is typically chosen between 5-50. After the confusion degree of the distribution is determined, the best Gaussian variance sigma i corresponding to the data point x i is searched by utilizing a binary search mode.
2) Creating a low-dimensional target data set, and assuming its joint probability:
after the calculation of the joint probability distribution of the original high-dimensional data space is completed, the assumption of the joint probability distribution of the target data set after the dimension reduction is also needed. Assume that The method is characterized in that the method is a target data set after dimension reduction, wherein the low-dimension mapping points Y n=(yn,1,yn,2).i and j of any one node electricity price are any two data points in the data set Y.
In order to solve the congestion problem in the data dimension reduction process, the t distribution with the degree of freedom of 1 can be utilized to represent the probability distribution of the low-dimension data space. The joint probability q ij of data points in the low-dimensional data space is shown in formula (5):
Wherein: q ij and q ji are both joint probabilities between y i and y j. The joint probability Q ij between y i and all other data points forms a joint probability distribution Q i.
3) Calculating the KL divergence between the two joint probability distributions and continuously optimizing the low-dimensional target data set:
ideally, if the joint probability distributions P i and Q i of the data before and after the dimension reduction are equal, it indicates that the low-dimension dataset Y accurately reflects the similarity relationship between the data points in the high-dimension dataset X. the t-SNE algorithm adjusts the low-dimensional data set Y by iterative computation continuously to achieve the goal that the joint probability distributions P i and Q i are equal. The difference between the two joint probability distributions is measured in this process by the KL divergence (Kullback-Leibler divergence) as shown in equation (6):
Wherein: c is an objective function of iterative optimization, and is the sum of KL divergences of probability distributions of all data points in the data set. KL (·) represents calculating the KL divergence between the two distributions.
Minimizing an objective function C, ensuring that joint probability distributions P i and Q i of data before and after dimension reduction are as similar as possible, optimizing by adopting a gradient descent method, wherein the gradient of the objective function C on any data point i after dimension reduction is shown as a formula (7), and iteratively solving a calculation formula is shown as a formula (8):
wherein: y (t) and Y (t+1) represent the results obtained from the t-th and t+1st iterations, respectively; gamma represents the learning rate of gradient descent; the gradient representing the objective function with respect to the result of the t-th iteration can be solved by equation (7).
(2) DBSCAN algorithm-based post-dimension node electricity price partitioning method
The step of clustering the power price data Y after the dimension reduction by using the DBSCAN algorithm is shown in the figure 1, and the specific flow is as follows:
2-1) importing the power price data set Y after dimension reduction, and initializing a cluster type mark site.
2-2) Taking any data point from the data set Y and judging whether the data point is marked as a central point, a boundary point or a noise point by Y i, if so, repeating the step 2-2), and judging the next data point; if not, jumping to step 2-3).
2-3) Judging whether the data point y i is a core point or not by using the formula (9), if so, marking the data point y i as a center point, and jumping to the step 2-4); otherwise, marking as noise points, and jumping to the step 2-2).
Core(Y)={yi|yi∈Y,Size(Nε(yi))≥MinPts} (9)
Wherein: core (Y) represents the set of all Core points within the dataset Y; minPts is a set density threshold parameter that represents the minimum number of data points in the data set where a single core point is located; n ε(yi) represents a set of data points within data set Y that are less than epsilon in distance from data point Y i; size (·) represents the number of data points within the computation set.
2-4) Cluster type flag State update and put data point y i into the cluster type setInitializing a set H to be tested, calculating all adjacent points within the radius epsilon range of the data point y i, and adding all the adjacent points to the set H to be tested.
2-5) Selecting the data points H s in the set H to be tested in sequence and entering the data points H s into the cluster setIt is determined whether it is marked as a noise point. If h s has been marked as a noise point, modifying the mark as a boundary point, and continuing with step 2-6); otherwise, directly continuing with the step 2-6).
2-6) Judging whether the data point H s is a core point or not by using the formula (11), if yes, marking the data point H s as the core point, calculating all adjacent points in the radius range of the data point by using the formula (9), adding all adjacent points into the set H to be detected, and continuing the step 2-7); otherwise, marked as boundary points, and steps 2-7 are continued.
2-7) Judging whether all data points in the set H to be tested are marked, if yes, continuing the step 2-8); if not, jumping to the step 2-5).
2-8) Judging whether all data points in the data set Y finish marking, if so, finishing clustering; if not, jumping to the step 2-2).
(3) Two-step system identification framework for electricity price abnormality based on data distribution density characteristics
In the process of identifying the space abnormal electricity price, the method is carried out in two steps, firstly, the abnormal points in each electricity price partition are identified, after abnormal data are removed, normal electricity price data points in each area are obtained, then, the abnormal identification of the electricity price partition is carried out according to the data characteristics of the residual electricity price data, and whether the abnormal electricity price area exists is detected.
1) Abnormal electricity price identification process in electricity price partition
Based on the electricity price partition information, identifying the abnormal electricity price data after partition by using an isolated forest algorithm, wherein the abnormal electricity price data is mainly divided into two links, and the first link is the training of an isolated forest (iForest); the second link is to calculate the anomaly score of each data point by using the trained isolated forest and identify the anomaly data point.
The first step of training the isolated forest in the first link is shown in fig. 2, and taking any gate-th electricity price partition as an example:
3-1) setting super parameters: ① The number of isolated trees (ntres) in iForest is N Tree,② iForest, the maximum height N sub of the individual ntres.
3-2) Partitioning the result according to the electricity priceImporting the data of the CAth electricity price partition from the data set to be testedAnd preprocessing the data, if the average value of the data set is smaller than 10, rounding the data to be accurate to a bit, otherwise rounding the data according to a multiple of 5, so as to avoid the influence of very small differences on the identification result in the use process of the DBSCAN algorithm.
3-3) Judging the data setWhether or not the data amount of (a) is larger than the maximum height N sub of the single iTree, and using the formula (10) for each electricity price partition, the maximum height of the iTree is corresponding toAnd performing assignment.
3-4) Initialization of the iTree flag q.
3-5) The iTree flag q is updated.
3-6) All subspaces K s,k are emptied and the number of layers flag s of the iTree and the flag K of each layer subspace of the iTree are initialized.
3-7) From the datasetRandom decimation inThe number constitutes a subspace K s,k, which is called the root node.
3-8) The data in the subspace K s,k are all loaded into the interval R to be segmented.
3-9) Judging whether the interval R to be segmented contains only one data node, if so, marking the corresponding subspace K s,k as a leaf node and jumping to the step 3-1), otherwise, continuing.
3-10) Ordering the data in the interval R to be segmented, randomly generating a data point R s+1,k, dividing the data in the interval R into two sets R right and R left, and pushing the newly generated two data into a subspace set of the next layer, namely K s+1={Ks+1,Rright,Rleft.
3-11) Judging whether all subspaces of the current layer are segmented, if yes, jumping to the step 3-13); otherwise, continuing to step 3-12).
3-12) Flag k update of subspace, jump to step 3-8).
3-13) Layer number flag s.
3-14) Judging whether the current iTree reaches the maximum layer number, if so, marking all subspaces K s,k of the current layer as leaf nodes, and continuing the step 3-15); otherwise, jumping to step 3-12).
3-15) Judging whether all the training of the iTree is finished, if so, finishing the electricity price partition training iForest training process, and ending the flow; otherwise, jumping to the step 3-5).
After training of all power price partitions iForest is completed according to the steps, abnormal signal identification is sequentially carried out on the data to be tested in each power price partition. Firstly, judging the data to be measured of the electricity price partition of the CATEIf the ratio of the extreme difference to the average value is more than 3%, if so, judging that the electricity prices of all the nodes in the electricity price partition are normal, and not needing to carry out abnormal identification; otherwise, calculating an anomaly score of each node electricity price in the electricity price partition by using the formula (10):
Wherein: function f q(xi) represents the path length required to calculate the data x i with the gate-th litree, as shown in equation (11):
Wherein: Representing the number of layers experienced from the root node to the leaf node where data x i is located on the qth itrate; the data number of the same leaf node of the data point x i on the q-th iTree is represented; the function g (x) has the same meaning as in equation (10), and each represents the average path length for constructing an iTree of a sample number x.
Wherein: gamma is an euler constant having a value equal to about 0.5772156649.
The closer the anomaly score is to 0, the longer the path required to separate the data point under test from the dataset, the less likely the data point is to be an anomaly, whereas the closer the anomaly score is to 1, the more likely the data point under test is to be separated from the dataset. And determining a reasonable threshold according to the category number of the node electricity price anomaly scores in each electricity price partition, and completing the identification of the anomaly points in each electricity price partition.
2) Abnormal electricity price partition identification process
After the identification of the abnormal electricity prices in the electricity price partitions is completed, calculating the average value of the normal electricity prices in each electricity price partition by using the formula (13):
Wherein: power rate data set representing the Cat-th power rate partition A normal electricity price data set after abnormal electricity price data points are removed; the function E (·) represents averaging the data within the collection.
The average value of the normal electricity price of each electricity price partition is formed into a new data set Z, as shown in a formula (14):
Taking the data set Z as the data set to be tested, repeating the steps 3-1) to 3-15), completing training under iForest of the data set, and calculating the abnormal score of each electricity price partition by using the formulas (10) - (12), thereby identifying whether the electricity price in the electricity price partition is abnormal compared with the whole system.
In summary, a flowchart of the abnormal electricity price node and abnormal electricity price area identification method according to the present invention based on the density characteristics of the electricity price data distribution is shown in fig. 3.
Example 12:
An experiment for verifying an abnormal electricity price node and an abnormal electricity price region identification system according to any one of embodiments 1 to 11, wherein the abnormal electricity price node and the abnormal electricity price region identification system start from a density characteristic of electricity price data distribution, the experiment comprises the following steps:
(1) Constructing joint probability distribution before and after power price data dimension reduction by using t-SNE algorithm
First, constructing joint probability distribution of original high-dimensional electricity price data through formulas (1) - (4), creating a low-dimensional target data set, and assuming the joint probability distribution through formula (5).
(2) Calculating KL divergence between two joint probability distributions and continuously optimizing a low-dimensional target dataset
Calculating KL divergence between two distributions before and after data dimension reduction by using the formula (6) as an optimized objective function;
and calculating and optimizing the gradient of the objective function on the low-dimensional objective data set by using the formula (7), and continuously iterating and optimizing the low-dimensional objective data set by using the formula (8) until the objective function meets the requirement, namely, the similarity degree of two groups of data sets before and after the data dimension reduction meets the requirement.
(3) Partitioning the node electricity price data after dimension reduction by using DBSCAN algorithm
Clustering the power price data after the dimension reduction by using a DBSCAN algorithm according to the methods from the step 2-1) to the step 2-8), and completing the power market power price partition.
(4) Training of isolated forests (iForest)
Based on the electricity price partition information, sequentially taking out node data of the corresponding partition from the data set to be tested, and training an isolated forest tree (iForest) of each electricity price partition in the data set to be tested by using the methods from the step 3-1) to the step 3-15).
(5) Calculating abnormal scores of each data point by using the trained isolated forest, and identifying abnormal electricity price nodes in the electricity price partition
Sequentially judging whether the ratio of the extreme difference to the average value of the data to be measured in each power price partition in the data to be measured is more than 3%, if so, judging that the power price of all nodes in the power price partition is normal, and not needing to carry out abnormal identification, otherwise, carrying out the next step;
If the node electricity price in a certain electricity price partition needs to be subjected to space abnormality identification, obtaining abnormal scores of the electricity price data of all nodes in the electricity price partition by utilizing an isolated forest tree (iForest) corresponding to the electricity price partition according to the methods shown in formulas (10) - (12);
And identifying the abnormal nodes according to the comparison of the abnormal scores of the electricity price data of each node.
(6) After eliminating the abnormal electricity price node, identifying the abnormal electricity price partition
After the identification of the abnormal electricity price in each electricity price partition is completed, removing the abnormal electricity price data in each electricity price partition by using formulas (13) - (14), and averaging all the electricity price data in each electricity price partition to form a new regional electricity price data to-be-detected set;
Repeating the method from the step 3-1) to the step 3-15), training a corresponding isolated forest tree (iForest), and obtaining an abnormal score corresponding to the average normal average electricity price of each subarea according to the methods shown in the formulas (10) - (12);
And identifying the abnormal power price subareas according to the abnormal score comparison of the normal average power price of each power price subarea.
The specific simulation results are as follows:
1) Electricity price partition parameter selection and partition result
When the system node electricity prices are partitioned by using the DBSCAN algorithm, although the partition quantity can be avoided from being determined in advance, setting of a distance parameter epsilon and a density threshold parameter MinPts in the algorithm execution process can influence the electricity price partition result, and table 1 shows the electricity price partition result under 16 parameter combinations, and the partition effect under different parameter combinations is measured from five aspects of the number of electricity price partitions, the number of non-partitioned nodes, the extremely poor, the four-level difference and the variance.
TABLE 1 effects of Power price partitioning under different parameter combinations
As can be seen from table 1, when the distance parameter epsilon exceeds 10, the values of the range, the tetrad difference and the variance of the data distribution concentration degree are obviously increased, and the partition result does not meet the principle of "the electricity prices of all nodes in the partition are similar as much as possible and the distribution is concentrated as much as possible" provided in the first chapter; when the distance parameter is equal to 3 or 5, the difference of the degree of the partitioned data sets is not large, but the number of the power price partitions and the number of the non-partitioned nodes are obviously different. Meanwhile, the electricity price partition should contain all nodes as much as possible, and the electricity price partition should not be too much, so that excessive partition is caused, the distance parameter epsilon is set to be 5, and the density threshold parameter epsilon is set to be 3. Under the super-parameter setting, the node electricity price of the 2022-year node of the provincial power grid transaction center used for the example test is approximately divided into 29 electricity price partitions. From the electricity price partition result, the whole distribution is still based on administrative region division, and meanwhile, in the economically developed region, the partition is tighter due to the fact that the power grid construction degree is more perfect. Part of coastal areas are affected by geographical environments, and relatively independent electricity price partitions can be formed. Therefore, under the condition that the power grid topological information is missing, partial network information can be restored through the historical node electricity price data, and the obtained electricity price partition result has certain authenticity and credibility and can be used for identifying the subsequent space abnormal electricity price signals.
2) Space electricity price abnormal point identification effect
Comparing the recognition effects of the following three methods:
m1: the invention provides an abnormal electricity price node and an abnormal electricity price area identification method based on the distribution density characteristics of electricity price data;
m2: carrying out overall identification on the space abnormal electricity price signals by using an isolated forest algorithm;
m3: and (5) carrying out overall identification on the space abnormal electricity price signals by using a K-means algorithm.
It should be noted that, in order to conveniently compare the recognition effects of different abnormal signals, all parameter settings and data preprocessing flows of the isolated forest algorithm in M2 are consistent with those in M1, and the difference between the two is that partition recognition is no longer performed. The number of clusters required by the K-means algorithm in M3 is set to be the optimal partition number 29 obtained in M1, and the distance from the cluster center to the cluster center is more than 3 times of the average distance of the clusters and is regarded as an abnormal point.
The node electricity price data of 96 time periods of a certain day in 2022 of a certain provincial power grid trading center are respectively identified by the three methods, and the identification results based on the manual experience are used as standards to compare the identification effects of the three methods. The ratio of the abnormal electricity price signal points identified by the three methods is shown in table 2, the identification accuracy and the identification error rate are shown in fig. 4 and fig. 5 respectively.
TABLE 2M1-M3 space abnormal electricity price signal point identification ratio
As can be seen from the data in Table 2, the M3 method has the least space abnormality electricity price signal, which is about one half of the M1 method of the invention, and the M2 method has the most space abnormality electricity price signal, which is more than 3 times of the M1 method of the invention. As can be seen by combining the accurate recognition rates of the methods in FIG. 5, compared with the average accurate recognition rate of 95.09% in the M1 method and the average accurate recognition rate of 93.17% in the M2 method, the accurate recognition rate of the M3 method is lower and less than 30% in average due to the fact that the number of the recognized abnormal signals is too small. Meanwhile, although the M1 method and the M2 method have similar recognition accuracy, compared with the M2 method, the recognition effect of the M1 method is more stable, the recognition accuracy can be maintained to be more than 80% in most of time periods, and no extreme situation exists.
In combination with the error recognition rate shown in fig. 6, it can be found that M3 has the highest error recognition rate, and the average error recognition rate reaches more than 60%, so that K-means is taken as a special clustering algorithm, the abnormal recognition is only an auxiliary function, and the effect is not ideal for the abnormal recognition of the single-dimensional non-time sequence data of the node electricity price signal. The error recognition rate of the M2 method is 3 times higher than that of the M1 method provided by the invention and reaches 38.93%, and the higher accurate recognition rate of the M2 method is ensured by means of detecting a large number of abnormal signals, and the error recognition rate is too high because a large number of abnormal signals are detected by mistake although the accurate recognition rate is higher. Through comparison of three identification effects, the isolated forest algorithm can be found to have certain advantages in the identification of space abnormal electricity price signals, but under the condition of no partition identification, a large number of electricity price signals which are originally normal in the partition are identified as abnormal, so that the subsequent abnormal electricity price signal cause analysis is difficult. In summary, the method provided by the invention can ensure the identification effect of the space abnormal electricity price signal.
3) Abnormal electricity price partition identification effect
The M1 method provided by the invention can be used for completing the identification of the abnormal electricity price signal points, and distinguishing the abnormality of the electricity price signal area from the whole market perspective so as to further analyze the cause of the abnormality of the electricity price signal. The regional abnormal electricity price detection is performed on the node electricity price data of the same day tested in the above calculation example, and the detection result is shown in fig. 6. It can be found that the M1 method provided by the invention can almost perfectly identify all electricity price signal areas in all tested time periods, and the average accuracy rate reaches 99.27%. Meanwhile, the error recognition rate is less than 10% in more than 60% of time periods, and the overall error recognition rate is only 11.72%. Although there is a problem of excessive recognition of abnormal areas of electricity price signals, the extremely erroneous recognition condition is extremely small in proportion, and the condition that the electricity price distribution of the concentrated area is relatively average, and the period that the electricity price data is not obvious in separation characteristics, at this time, the judgment by using manual experience is needed to assist.