CN111177216B

CN111177216B - Association rule generation method and device for comprehensive energy consumer behavior characteristics

Info

Publication number: CN111177216B
Application number: CN201911333048.3A
Authority: CN
Inventors: 董得龙; 孙虹; 卢静雅; 杨光; 孔祥玉; 祝雨晨; 李野; 李刚; 乔亚男; 刘浩宇; 翟术然; 张兆杰; 许迪; 赵紫敬; 吕伟嘉; 顾强; 何泽昊; 季浩; 白涛
Original assignee: Tianjin University; State Grid Corp of China SGCC; State Grid Tianjin Electric Power Co Ltd; Electric Power Research Institute of State Grid Tianjin Electric Power Co Ltd
Current assignee: Tianjin University; State Grid Corp of China SGCC; State Grid Tianjin Electric Power Co Ltd; Electric Power Research Institute of State Grid Tianjin Electric Power Co Ltd
Priority date: 2019-12-23
Filing date: 2019-12-23
Publication date: 2024-01-05
Anticipated expiration: 2039-12-23
Also published as: CN111177216A

Abstract

The invention relates to a method for generating association rules for synthesizing behavior characteristics of energy consumers, which comprises the following steps: step 1, carrying out normalization processing on time sequence data of an intelligent ammeter; step 2, converting the time sequence data of the intelligent ammeter into symbolic representation; step 3, extracting characteristic patterns in the symbol and adding characteristic motifs of the characteristic patterns into a subject library; step 4, carrying out time association rule mining on the feature motifs of the newly added feature patterns in the theme library, and analyzing the relation between influence factors which cause energy consumption change in a certain specific period; step 5, performing clustering data analysis on characteristic motifs in the subject database by adopting a K-means clustering method, a hierarchical method and a density clustering algorithm respectively, and generating a daily consumption profile; and 6, measuring the fitting degree of the daily consumption profile group created by the three clustering methods and the actual daily consumption situation. The invention can accurately describe the consumption condition of the real energy.

Description

Association rule generation method and device integrating behavioral characteristics of energy consumers

技术领域Technical field

本发明属于智能电表数据挖掘术领域，涉及用户用电信息，尤其是一种综合能源消费者行为特征的关联规则生成方法及装置。The invention belongs to the field of smart meter data mining technology, relates to user electricity consumption information, and in particular, a method and device for generating association rules that integrate behavioral characteristics of energy consumers.

背景技术Background technique

智能电网是满足日益增长的能源需求和减少全球环境污染的有前途的技术之一。它提高了电能的效率，可靠性，可持续性和经济性。在过去十年中，智能电表已在世界大部分地区部署。智能电表和数据库管理系统构成了先进的计量基础设施(AMI)，它通过促进双向信息流和记录能量分布在能源系统中发挥重要作用。AMI已经产生了各种新颖的智能家居服务，例如向终端用户推荐节能和意识。智能电表具有分析细粒度能耗数据的巨大潜力，可用于能源规划和管理。智能电能表的部署对能源消费者和公用事业专家都有益。Smart grid is one of the promising technologies to meet the growing energy demand and reduce global environmental pollution. It improves the efficiency, reliability, sustainability and economy of electrical energy. Over the past decade, smart meters have been deployed in much of the world. Smart meters and database management systems form Advanced Metering Infrastructure (AMI), which plays an important role in energy systems by facilitating two-way information flow and recording energy distribution. AMI has produced various novel smart home services such as energy saving and awareness recommendations to end users. Smart meters have great potential to analyze fine-grained energy consumption data for energy planning and management. The deployment of smart energy meters benefits both energy consumers and utility professionals.

智能仪表产生的时间序列数据具有识别常规和异常能量的巨大潜力消费模式。时间序列数据挖掘技术被建模和开发以识别能量消费者的能量消耗行为。智能电表数据需要先进的数据分析，以便在实时环境中进行准确和自动化的决策制定。通过动态定价，它可以通过更好地了解能源的使用方式和时间来提高消费者的能源意识。能源数据分析已成为电力消费分析的主要研究领域。分析智能电表数据以识别日常活动的能力对电力公司实施需求侧管理技术非常有用。Time series data generated by smart meters has great potential to identify routine and abnormal energy consumption patterns. Time series data mining techniques are modeled and developed to identify the energy consumption behavior of energy consumers. Smart meter data requires advanced data analytics for accurate and automated decision making in a real-time environment. Through dynamic pricing, it can increase consumer energy awareness by providing a better understanding of how and when energy is used. Energy data analysis has become a major research area in electricity consumption analysis. The ability to analyze smart meter data to identify day-to-day activity can be useful for power companies implementing demand-side management techniques.

在智能电网中，可再生能源的普及率日益提高。然而，可再生能源发电的间断性导致了供需矛盾问题。因此，一天中动态的能源交易价格使得时间方面更加重要。智能电表的能源消耗模式在一天内会因时间或月份、天气、居住者的日程安排和行为而发生不同的波动。同样，电网负荷也会随着需求、温度和可再生能源发电的变化而发生时间上的变化，而这些变化又受天气和季节时间尺度的影响。In smart grids, renewable energy sources are increasingly becoming more prevalent. However, the intermittent nature of renewable energy generation has led to the contradiction between supply and demand. Therefore, dynamic energy trading prices throughout the day make the time aspect even more important. Smart meter energy consumption patterns can fluctuate differently throughout the day depending on the time or month, weather, occupant schedules and behavior. Likewise, grid load changes temporally as demand, temperature and renewable generation change, which are in turn influenced by weather and seasonal timescales.

近年来，虽然已经开发了各种技术来挖掘时间序列数据。然而，仅有限地研究了时间序列能量消耗数据的时间性质，但由于能源消耗是高动态的概念，随着时间的推移负载需求和定价不同。因此，为了能够准确的描述真实能源的消费情况，需要一种响应时间短且能够实现在一段时间内频繁采样的综合能源消费者行为特征的关联规则生成方法及装置。In recent years, various techniques have been developed to mine time series data. However, the temporal nature of time series energy consumption data has only been studied to a limited extent, but since energy consumption is a highly dynamic concept with different load demands and pricing over time. Therefore, in order to accurately describe real energy consumption, an association rule generation method and device are needed that has a short response time and can achieve comprehensive energy consumer behavior characteristics that are frequently sampled within a period of time.

发明内容Contents of the invention

本发明的目的在于克服现有技术的不足，提供一种设计合理、响应时间短且能够实现在一段时间内频繁采样的综合能源消费者行为特征的关联规则生成方法及装置。The purpose of the present invention is to overcome the shortcomings of the existing technology and provide an association rule generation method and device with reasonable design, short response time and the ability to achieve frequent sampling of comprehensive energy consumer behavior characteristics within a period of time.

本发明解决其技术问题是采取以下技术方案实现的：The present invention solves its technical problems by adopting the following technical solutions:

一种综合能源消费者行为特征的关联规则生成方法，包括以下步骤：An association rule generation method that integrates behavioral characteristics of energy consumers, including the following steps:

步骤1、对智能电表时间序列数据进行归一化处理；Step 1. Normalize the smart meter time series data;

步骤2、使用符号近似聚类将归一化处理后的智能电表时间序列数据先进行云分段聚合近似，然后转换为符号表示；Step 2. Use symbolic approximation clustering to first perform cloud segmentation aggregation approximation on the normalized smart meter time series data, and then convert it into symbolic representation;

步骤3、提取符号表示结果中的特征图案并将该特征图案的特征基序添加到主题库，其中，所述特征基序满足用户定义的频率计数和可用性阈值；Step 3. Extract the feature pattern in the symbolic representation result and add the feature motif of the feature pattern to the topic library, where the feature motif satisfies the user-defined frequency count and availability threshold;

步骤4、对主题库中新添加的特征图案的特征基序进行时间关联规则挖掘，分析特定时段内导致能源消耗变化的影响因素之间的联系，若导致能源消耗变化的影响因素之间的相关系数的绝对值大于设定值，则执行下一步；否则返回步骤3重新提取特征图案并将该特征图案的特征基序添加到主题库；Step 4. Execute time association rule mining on the feature motifs of the newly added feature patterns in the theme library, and analyze the connection between the influencing factors that lead to changes in energy consumption within a specific period. If the correlation between the influencing factors that lead to changes in energy consumption is If the absolute value of the coefficient is greater than the set value, proceed to the next step; otherwise, return to step 3 to re-extract the feature pattern and add the feature motif of the feature pattern to the theme library;

步骤5、分别采用K-means聚类、层次方法和密度聚类算法的三种聚类方法对主题库中的特征基序进行聚类数据分析后，执行基于所述特征基序对应的特征图案的聚类结果，生成每日的消费概况；Step 5: After analyzing the clustering data of the feature motifs in the subject database using three clustering methods: K-means clustering, hierarchical method and density clustering algorithm, execute the feature patterns corresponding to the feature motifs. The clustering results are used to generate daily consumption profiles;

步骤6、通过平方误差和以及轮廓系数的两个度量指标分别对三种聚类方法生成每日的消费概况进行统计评估，以测量由三种聚类方法所创建的每日消费概况组与实际的每日消费情况的拟合程度。Step 6: Statistically evaluate the daily consumption profiles generated by the three clustering methods through the two metrics of the sum of squared errors and the silhouette coefficient to measure the difference between the daily consumption profile groups created by the three clustering methods and the actual The fitting degree of daily consumption situation.

而且，所述步骤1的具体方法为：Moreover, the specific method of step 1 is:

用z归一化使数据归一化，使单位方差归一化，其中，归一化值均值μ和标准差σ如下所示：Normalize the data to unit variance using z normalization, where the normalized value The mean μ and standard deviation σ are as follows:

上式中，x_i为待处理的数据，n为待处理的数据的个数。In the above formula, _xi is the data to be processed, and n is the number of data to be processed.

而且，所述步骤2的具体步骤包括：Moreover, the specific steps of step 2 include:

(1)将时间序列数据转换为云分段聚合近似表示符号；(1) Convert time series data into cloud segmented aggregation approximate representation symbols;

(2)将云分段聚合近似表示符号转化为字符串。(2) Convert cloud segment aggregation approximate representation symbols into strings.

而且，所述步骤2第(1)步的具体步骤包括：Moreover, the specific steps of step 2 (1) include:

①对当前的各个分段序列数据进行云模型表示；①Represent the current segmented sequence data in a cloud model;

②利用各个云模型的熵来评价所在子序列的数据稳定性,选取稳定性最差(熵最小)的子序列(记为Q(i₀，j₀))分段聚合；②Use the entropy of each cloud model to evaluate the data stability of the subsequence, and select the subsequence with the worst stability (the smallest entropy) (denoted as Q(i ₀ , j ₀ )) for segmented aggregation;

③在分段聚合后的子序列Q(i₀，j₀)中找到一个数据点作为关键点q_k，i₀＜k＜j₀，该关键点q_k能使被它分开的两个子序列(Q(i₀，k)和Q(k，j₀))的云模型熵之和与子序列Q(i₀，j₀)的云模型熵之间的差值最大,同时删除子序列Q(i₀，j₀)，记录子序列Q(i₀，k)和Q(k，j₀)；③Find a data point as a key point q _k in the segmented aggregation subsequence Q(i ₀ , j ₀ ), i ₀ <k <j ₀ , this key point q _k can make the two subsequences separated by it The difference between the sum of cloud model entropy of (Q(i ₀ , k) and Q(k, j ₀ )) and the cloud model entropy of subsequence Q(i ₀ , j ₀ ) is the largest, and subsequence Q is deleted at the same time. (i ₀ , j ₀ ), record subsequences Q(i ₀ , k) and Q(k, j ₀ );

④重复①～③,到满足停止条件为止。④ Repeat ①～③ until the stop condition is met.

而且，所述步骤2第(2)步的具体方法为：Moreover, the specific method of step (2) of step 2 is:

在云分段聚合近似表示之后，将离散化表示进行符号化，将时间序列转换为符号串。After the cloud segmentation aggregation approximate representation, the discretized representation is symbolized, converting the time series into a symbolic string.

而且，所述步骤3的具体步骤包括：Moreover, the specific steps of step 3 include:

(1)在符号近似聚类转换之后，生成SAX模式类型；并将SAX生成的模式类型分别标记为主题或不常见的模式；其中，主题被定义为以前未知的，经常出现的模式；(1) After symbolic approximate clustering conversion, SAX pattern types are generated; and the pattern types generated by SAX are marked as topics or uncommon patterns respectively; where topics are defined as previously unknown, frequently occurring patterns;

(2)从符号近似聚类转换后的数据中提取所需特征图案，并将该特征图案的特征基序添加到主题库中。(2) Extract the required feature pattern from the data transformed by symbolic approximate clustering, and add the feature motif of the feature pattern to the subject library.

而且，所述步骤4的具体方法为：Moreover, the specific method of step 4 is:

其中，X是先发生的事，Y是后发生的事，N是记录或事务的总数规则；X→Y的置信度是支持规则的前因和后续的记录数与仅支持规则前提的记录数的分数。Among them, X is what happened first, Y is what happened later, and N is the total number of records or transactions rules; the confidence of score.

而且，所述步骤5的具体步骤包括：Moreover, the specific steps of step 5 include:

(1)利用K-means聚类法对主题库中的特征基序进行聚类：(1) Use K-means clustering method to cluster the characteristic motifs in the topic library:

使用欧氏距离度量，每日块(N₁,N₂,...,N₃)被划分为k个集合，S＝(S₁,S₂...S_k)，获得最小化平方和：Using the Euclidean distance metric, daily blocks (N ₁ , N ₂ ,..., N ₃ ) are divided into k sets, S = (S ₁ , S ₂ ...S _k ), to obtain the minimized sum of squares :

其中，μ_i是S_i中的值的平均值；where μ _i is the average of the values in Si _i ;

(2)利用层次方法对主题库中的特征基序进行聚类：(2) Use the hierarchical method to cluster the feature motifs in the topic library:

采用自底向上的层次算法进行聚类，再采用迭代的重定位法改进层次算法得到的聚类结果；A bottom-up hierarchical algorithm is used for clustering, and then an iterative relocation method is used to improve the clustering results obtained by the hierarchical algorithm;

(3)利用密度聚类算法对主题库中的特征基序进行聚类：(3) Use density clustering algorithm to cluster the feature motifs in the topic library:

①设置扫描半径eps和最小包含点数minPts的取值；①Set the values of scanning radius eps and minimum number of included points minPts;

②任选一个未被访问的点，找出与其距离在扫描半径eps之内的所有附近点；② Select an unvisited point and find all nearby points within the scanning radius eps;

③如果附近点的数量≥最小包含点数minPts，则当前点与其附近点形成一个簇，并且出发点被标记为已访问；然后递归，以相同的方法处理该簇内所有未被标记为已访问的点，从而对簇进行扩展；如果附近点的数量<最小包含点数minPts，则该点暂时被标记作为噪声点；③If the number of nearby points ≥ the minimum number of included points minPts, the current point and its nearby points form a cluster, and the starting point is marked as visited; then recursively, use the same method to process all points in the cluster that are not marked as visited , thereby expanding the cluster; if the number of nearby points < the minimum number of included points minPts, the point is temporarily marked as a noise point;

④若所述步骤③簇充分地被扩展，即簇内的所有点被标记为已访问，然后用同样的算法去处理未被访问的点。④If the cluster in step ③ is fully expanded, that is, all points in the cluster are marked as visited, then the same algorithm is used to process unvisited points.

而且，所述步骤6的具体步骤包括：Moreover, the specific steps of step 6 include:

(1)平方误差之和SSE是度量簇的紧密度，SSE越小群集质量越良好，其计算公式为：(1) The sum of squared errors SSE is a measure of the tightness of the cluster. The smaller the SSE, the better the quality of the cluster. Its calculation formula is:

其中，(N₁,N₂,...,N₃)为每日数据组，这些数据被划分为k个集合S＝(S₁,S₂...S_k)。Among them, (N ₁ , N ₂ , ..., N ₃ ) are daily data groups, and these data are divided into k sets S = (S ₁ , S ₂ ... S _k ).

(2)轮廓系数是簇间内聚性和簇内分离的度量，系数1最好，-1最差；轮廓系数的计算公式为：(2) Silhouette coefficient is a measure of inter-cluster cohesion and intra-cluster separation. The coefficient 1 is the best and -1 is the worst. The calculation formula of the silhouette coefficient is:

其中，a是数据实例与同一群集中所有其他点之间的平均距离，b是数据实例与距离该数据实例最近的群集中所有其他点之间的平均距离。where a is the average distance between a data instance and all other points in the same cluster, and b is the average distance between a data instance and all other points in the cluster closest to it.

一种综合能源消费者行为特征的关联规则生成装置，包括：An association rule generation device that integrates behavioral characteristics of energy consumers, including:

归一化处理模块，用于对智能电表时间序列数据进行归一化处理；Normalization processing module, used to normalize smart meter time series data;

符号转换模块，用于使用符号近似聚类将归一化处理后的智能电表时间序列数据先进行云分段聚合近似，然后转换为符号表示；The symbolic conversion module is used to use symbolic approximate clustering to first perform cloud segmentation aggregation approximation on the normalized smart meter time series data, and then convert it into symbolic representation;

特征图案提取模块，用于提取符号中的特征图案并将该特征图案的特征基序添加到主题库，满足用户定义的频率计数和可用性阈值；A feature pattern extraction module for extracting feature patterns in symbols and adding the feature motifs of the feature patterns to the theme library, meeting user-defined frequency counts and availability thresholds;

时间关联规则挖掘模块，用于对主题库中新添加的特征图案的特征基序进行时间关联规则挖掘，分析某一特定时段内导致能源消耗变化的影响因素之间的联系，若导致能源消耗变化的影响因素之间的联系强，则执行下一步；若联系弱，则返回特征图案提取模块重新提取特征图案并将该特征图案的特征基序添加到主题库；The time association rule mining module is used to mine the feature motifs of newly added feature patterns in the theme library and analyze the connection between the influencing factors that lead to changes in energy consumption within a specific period. If the changes in energy consumption are caused If the connection between the influencing factors is strong, proceed to the next step; if the connection is weak, return to the feature pattern extraction module to re-extract the feature pattern and add the feature motif of the feature pattern to the theme library;

聚类模块，分别采用K-means聚类、层次方法和密度聚类算法的三种聚类方法对主题库中的特征基序进行聚类数据分析后，执行基于所述特征基序对应的特征图案的聚类结果，从而生成每日的消费概况；The clustering module uses three clustering methods: K-means clustering, hierarchical method and density clustering algorithm to analyze the clustering data of the feature motifs in the subject library, and then executes the features corresponding to the feature motifs. Clustering results of patterns to generate daily consumption profiles;

聚类结果评估模块，通过平方误差和以及轮廓系数的两个度量指标分别对三种聚类方法生成每日的消费概况进行统计评估，以测量由三种聚类方法所创建的每日消费概况组与实际的每日消费情况的拟合程度。The clustering result evaluation module statistically evaluates the daily consumption profiles generated by the three clustering methods through the two metrics of the sum of squared errors and the silhouette coefficient to measure the daily consumption profiles created by the three clustering methods. How well the group fits actual daily consumption.

本发明的优点和积极效果是：The advantages and positive effects of the present invention are:

1、本发明提出了一种综合能源消费者行为特征的关联规则生成方法及装置，该方法在特定的时间窗内对智能电表数据进行关联规则挖掘，从而表征智能电表用户的用电行为。本发明首先对智能电表数据进行了符号化处理，以方便各种数据挖掘技术的应用；其次，基于基元主题识别能够帮助识别消费者行为模式的能量关系，并提取特征图案，用于满足用户定义的频率计数和可用性阈值；第三，进行时间关联规则挖掘，分析某一特定时段内可能导致能源消耗增加/减少的某些因素之间的联系；最后，执行基于图案的聚类以创建能量消耗的每日概况。本发明能够准确的描述真实能源的消费情况。1. The present invention proposes a method and device for generating association rules that integrate behavioral characteristics of energy consumers. The method mines association rules for smart meter data within a specific time window, thereby characterizing the electricity consumption behavior of smart meter users. The present invention first symbolically processes smart meter data to facilitate the application of various data mining technologies; secondly, based on primitive theme recognition, it can help identify the energy relationship of consumer behavior patterns and extract characteristic patterns to satisfy users. Defined frequency counts and availability thresholds; third, conduct temporal association rule mining to analyze the connections between certain factors that may cause an increase/decrease in energy consumption within a specific period; finally, perform pattern-based clustering to create energy Daily overview of consumption. The present invention can accurately describe real energy consumption.

2、本发明将能量消耗值转换为符号表示，大大降低时间序列的维数。本发明使用符号表示，既可以用于家庭自动化的嵌入式感测系统上本地操作，也可以用于实用专家，以有效地处理智能仪表数据。2. The present invention converts energy consumption values into symbolic representation, greatly reducing the dimensionality of the time series. The present invention uses symbolic representation that can be used both locally on embedded sensing systems for home automation and by practical experts to efficiently process smart meter data.

3、本发明的基元被定义为在时间序列中以前未知的、经常发生的模式，它能够发现在现实生活中代表相同事务的相似性，采用图案和它们的时间标签可以识别特定时间的家庭行为，因此基元的应用可以对于理解家庭行为确定家庭模式具有重要的作用，进而通过该方法提取的时间信息可以准确的描述真实能源的消费情况。3. The primitives of the present invention are defined as previously unknown and frequently occurring patterns in time series. It can discover similarities that represent the same transactions in real life. The patterns and their time tags can be used to identify families at a specific time. Behavior, therefore the application of primitives can play an important role in understanding household behavior and determining household patterns, and the time information extracted by this method can accurately describe real energy consumption.

附图说明Description of drawings

图1是本发明步骤的流程图Figure 1 is a flow chart of the steps of the present invention

图2是本发明的处理流程图；Figure 2 is a processing flow chart of the present invention;

图3是本发明的云分段聚合近似表示符号的处理流程图；Figure 3 is a processing flow chart of the cloud segment aggregation approximate representation symbol of the present invention;

图4是本发明的密度聚类算法的处理流程图。Figure 4 is a processing flow chart of the density clustering algorithm of the present invention.

具体实施方式Detailed ways

以下结合附图对本发明实施例作进一步详述：The embodiments of the present invention will be further described in detail below in conjunction with the accompanying drawings:

综合能源消费者行为特征的关联规则生成方法，如图1，图2，图3和图4所示，包括以下步骤：The association rule generation method for comprehensive energy consumer behavior characteristics is shown in Figure 1, Figure 2, Figure 3 and Figure 4, which includes the following steps:

所述步骤1的具体方法为：The specific method of step 1 is:

用z归一化使数据归一化，使单位方差归一化，其中，归一化值x,均值μ和标准差σ如下所示：Normalize the data to unit variance with z-normalization, where the normalized value x, mean μ, and standard deviation σ are as follows:

步骤2、使用符号近似聚类将预处理后的智能电表时间序列数据转换为符号表示；Step 2. Use symbolic approximate clustering to convert the preprocessed smart meter time series data into symbolic representation;

在步骤2中使用符号近似聚类(Symbolic aggregate approximation,SAX)将预处理的时间序列数据转换为符号表示，使用SAX可以有效的实现降维，以适合应用不同数据挖掘技术的格式准备数据。In step 2, Symbolic aggregate approximation (SAX) is used to convert the preprocessed time series data into symbolic representation. Using SAX can effectively achieve dimensionality reduction and prepare the data in a format suitable for applying different data mining techniques.

所述步骤2的具体步骤包括：The specific steps of step 2 include:

采用云分段聚合近似C(PAA)对z归一化时间序列数据进行离散化。在这种表示法中，利用各个云模型的熵来评价所在子序列的数据稳定性，对不满足要求的子序列重新进行云分段聚合近似，从而将其划分为w维空间。The z-normalized time series data is discretized using cloud piecewise aggregation approximation C (PAA). In this representation, the entropy of each cloud model is used to evaluate the data stability of the subsequences, and the cloud segmentation aggregation approximation is re-performed for the subsequences that do not meet the requirements, thereby dividing them into w-dimensional space.

所述步骤2第(1)步的具体步骤包括：The specific steps of step 2 (1) include:

(2)将云分段聚合近似表示符号转化为字符串；(2) Convert cloud segment aggregation approximate representation symbols into strings;

所述步骤2第(2)步的具体方法为：The specific method of step 2 (2) is:

在本实施例中，用英文字母来表示原始序列，一般来说，用符号“a”表示低能耗，“b”表示平均值，“c”表示高于平均值，“d”表示高能耗。在SAX转换之后，时间序列数据中的描述性知识类型(例如主题，关联规则挖掘)可以应用于知识发现。In this embodiment, English letters are used to represent the original sequence. Generally speaking, the symbol "a" represents low energy consumption, "b" represents average value, "c" represents higher than average value, and "d" represents high energy consumption. After SAX transformation, descriptive knowledge types (e.g. topics, association rule mining) in time series data can be applied to knowledge discovery.

步骤3、提取特征图案并将特征图案的特征基序添加到主题库，满足用户定义的频率计数和可用性阈值。Step 3. Extract feature patterns and add feature motifs of feature patterns to the subject library that meet user-defined frequency counts and availability thresholds.

所述步骤3的具体步骤包括：The specific steps of step 3 include:

(2)从符号近似聚类转换后的数据中提取所需特征图案，并将该特征图案的特征基序添加到主题库中；(2) Extract the required feature pattern from the data transformed by symbolic approximate clustering, and add the feature motif of the feature pattern to the subject library;

在本实施例中，在符号近似聚类SAX转换之后，我们关注SAX生成的模式类型。In this example, we focus on the type of patterns generated by SAX after the symbolic approximate clustering SAX transformation.

将SAX生成的模式类型标记为主题或不常见的模式，主题被定义为以前未知的，经常出现的模式。Mark SAX-generated pattern types as themes or uncommon patterns, with themes being defined as previously unknown, frequently occurring patterns.

字母表大小A应该被固定为合理的折衷，因为具有太多符号将产生太多可能不重复的模式，另一方面，具有很少的符号，将不会捕获更多的消费分辨率。每天的窗户数W也必须仔细选择。Alphabet size A should be fixed as a reasonable compromise, since having too many symbols will produce too many patterns that may not be repeated, on the other hand having few symbols will not capture more consumer resolution. The number of windows per day W must also be chosen carefully.

可以从数据中提取大量图案。没有必要将所有发现的图案用于分析。图案的频率和可用性对于检测智能电表的常规行为起着重要作用。例如，最常见，第二最常见的主题对分析很重要。我们可以设置选择特征图案的阈值。特征图案是满足不同标准的图案，例如特定时期的出现次数，超过阈值。阈值可以设置为每个基序的频率的一部分与所有基序的总数。A large number of patterns can be extracted from the data. It is not necessary to use all discovered patterns for analysis. The frequency and availability of patterns play an important role in detecting the regular behavior of smart meters. For example, the most common and second most common themes are important to the analysis. We can set a threshold for selecting feature patterns. Feature patterns are patterns that meet different criteria, such as the number of occurrences in a specific period, exceeding a threshold. The threshold can be set as a fraction of the frequency of each motif versus the total number of all motifs.

此外，一般而言，具有一定专业知识和经验的电力专家将测试所有主题，以确定有趣或无趣的主题。长时间没有活动，除了抄表的一个变化，将导致一系列类似的图案。例如，当使用5的字母大小时，除了能量消耗的小幅下降之外，长时间没有活动或事件将导致诸如ccccca，ccccac，cccacc等的图案。因为这些中的仅一个是有趣的。进一步分析，排除其他主题，从两个或更多c开始。如果一个图案仅代表能量消耗的增加，那么它被认为是没有意义的。发现的图案应该反映完整的行为，并且显示能量消耗增加的图案仅代表事件或活动的开始。一个有用的主题将包括启动活动(打开设备)和完成活动(关闭设备)。最后，一天，一周和一个月内的图案时间及其可用性对于不同的能量消耗行为很重要。Additionally, generally speaking, an electrical expert with some expertise and experience will test all topics to determine which ones are interesting or not. Long periods of inactivity, other than a change in meter reading, will result in a series of similar patterns. For example, when using a letter size of 5, in addition to a small drop in energy consumption, long periods of no activity or events will result in patterns such as ccccca, ccccac, cccacc, etc. Because only one of these is interesting. Further analysis, excluding other topics, starts with two or more c's. If a pattern represents only an increase in energy consumption, it is considered meaningless. The patterns found should reflect the complete behavior, and patterns showing increased energy expenditure only represent the beginning of an event or activity. A useful topic would include start activities (turning the device on) and completion activities (turning the device off). Finally, the time of pattern and its availability within the day, week and month are important for different energy consumption behaviours.

通过这种方式，提取特征基序并将其添加到主题库中以进一步发现知识。这些图案具有确定每个消费者的能量消耗行为的潜力。In this way, feature motifs are extracted and added to the topic library for further knowledge discovery. These patterns have the potential to determine the energy consumption behavior of each consumer.

步骤4、步骤4、对主题库中新添加的特征图案的特征基序进行时间关联规则挖掘，分析某一特定时段内导致能源消耗变化的影响因素之间的联系，若导致能源消耗变化的影响因素之间的相关系数的绝对值大于设定值0.8，则执行下一步；否则返回步骤3重新提取特征图案并将该特征图案的特征基序添加到主题库；Step 4. Execute time association rule mining on the feature motifs of the newly added feature patterns in the theme library, and analyze the connection between the influencing factors that lead to changes in energy consumption within a specific period. If the impact of changes in energy consumption is If the absolute value of the correlation coefficient between factors is greater than the set value 0.8, proceed to the next step; otherwise, return to step 3 to re-extract the feature pattern and add the feature motif of the feature pattern to the theme library;

所述步骤4的具体方法为：The specific method of step 4 is:

其中X是先发生的事，Y是后发生的事，N是记录或事务的总数规则；X→Y的置信度是支持规则的前因和后续的记录数与仅支持规则前提的记录数的分数。Among them, X is what happened first, Y is what happened later, and N is the total number of records or transactions. The confidence level of Fraction.

在本实施例中，关联规则挖掘在不考虑时间信息的情况下发现横截面关联。由于电力具有复杂的动态，时间关联规则挖掘是我们研究的热点。时间关联规则挖掘查找特定时间段内变量之间的关系。In this embodiment, association rule mining discovers cross-sectional associations without considering temporal information. Since electricity has complex dynamics, temporal association rule mining is a hot topic in our research. Temporal association rule mining finds relationships between variables within a specific time period.

提出的关联规则挖掘方法是基于一个包含特定时间段特征基序信息的基序库。频繁主题的提取必须从其支持数量大于或等于用户提供的最小支持阈值的存储库完成。例如，如果X是先行词而Y是随后的结果，则关联规则X→Y表示如果X发生，Y也将发生。规则的支持是先行和后续发生的次数与记录总数的比例。关联规则的支持表明统计重要性。具有较低支持的关联规则指示那些不常见的关系，具有较高支持的规则描述了在记录中常见的那些关系。The proposed association rule mining method is based on a motif library containing characteristic motif information for a specific time period. Fetching of frequent topics must be done from repositories whose support count is greater than or equal to the user-supplied minimum support threshold. For example, if X is the antecedent and Y is the subsequent result, the association rule X→Y means that if X occurs, Y will also occur. The rule is supported by the ratio of the number of antecedent and subsequent occurrences to the total number of records. Support for association rules indicates statistical significance. Association rules with lower support indicate those relationships that are uncommon, and rules with higher support describe those relationships that are common in records.

其中X是先发生的事，Y是后发生的事，N是记录或事务的总数规则X→Y的置信度是支持规则的前因和后续的记录数与仅支持规则前提的记录数的分数。置信表明统计强度。置信度越高表示先行词和后续词之间的相关性越强，而置信度越低表示弱相关性。Where X is what happened first, Y is what happened later, and N is the total number of records or transactions. The confidence of the rule . Confidence indicates statistical strength. A higher confidence indicates a stronger correlation between the antecedent and subsequent words, while a lower confidence indicates a weak correlation.

在电力动态中，在特定时间段内发生的关联规则是我们特别感兴趣的。这种规则的格式是其表明Y将在X发生之后的Ti时隙内发生。电器之间的联系可以从用电量的形状来解释。In power dynamics, association rules that occur within a specific time period are of particular interest to us. The format of this rule is It indicates that Y will occur within the Ti time slot after X occurs. The connection between appliances can be explained by the shape of electricity usage.

步骤5、分别采用K-means聚类、层次方法和密度聚类算法的三种聚类方法对主题库中的特征基序进行聚类，并进行聚类数据分析后，执行基于特征图案的聚类以生成每日的消费概况；Step 5. Use three clustering methods: K-means clustering, hierarchical method and density clustering algorithm to cluster the feature motifs in the topic database. After analyzing the clustering data, perform clustering based on feature patterns. Class to generate daily consumption profiles;

在本实施例中，在基序发现后，对特征基序进行聚类，生成每日的消费概况。日廓线是代表家庭用电格局。这是一个重要的步骤，如果有15个特征图案和电力专家需要进一步把这些图案聚合成5或6个典型的能源消耗情况。它将为用户提供额外的控制来进一步描述性能，这些性能可用于在SAX转换过程中选择参数A或W。针对不同的应用程序目标，有各种现有的集群方法。时间序列聚类是一种基于特征的模型聚类方法In this embodiment, after the motifs are discovered, the characteristic motifs are clustered to generate a daily consumption profile. The daily profile represents the household electricity consumption pattern. This is an important step if there are 15 characteristic patterns and power experts need to further aggregate these patterns into 5 or 6 typical energy consumption scenarios. It will provide the user with additional controls to further describe the properties that can be used to select parameter A or W during SAX conversion. There are various existing clustering approaches targeting different application goals. Time series clustering is a feature-based model clustering method

所述步骤5的具体步骤包括：The specific steps of step 5 include:

(1)由于K-means聚类的简单性，我们使用欧氏距离度量。每日块(N₁,N₂,...,N₃)被划分为k个集合，S＝(S₁,S₂...S_k)以便最小化平方和。(1) Due to the simplicity of K-means clustering, we use the Euclidean distance metric. The daily blocks (N ₁ , N ₂ , ..., N ₃ ) are divided into k sets, S = (S ₁ , S ₂ ..., S _k ) in order to minimize the sum of squares.

其中，μ_i是S_i中的值的平均值where μ _i is the average of the values in _Si

(2)利用层次方法的平衡迭代规约和聚类(2) Balanced iterative reduction and clustering using hierarchical methods

利用层次方法的平衡迭代规约和聚类(Balanced Iterative Reducing andClustering Using Hierarchies，BIRCH)，是一种非常有效的、传统的层次聚类算法。在给定的基序库中，BIRCH算法能够用一遍扫描有效地进行聚类，并能够有效地处理离群点。Balanced Iterative Reducing and Clustering Using Hierarchies (BIRCH) using hierarchical methods is a very effective and traditional hierarchical clustering algorithm. In a given motif library, the BIRCH algorithm can efficiently perform clustering in one pass and can effectively handle outliers.

该算法首先用自底向上的层次算法，然后用迭代的重定位来改进结果。层次凝聚是采用自底向上策略，首先将每个对象作为一个原子簇，然后合并这些原子簇形成更大的簇，减少簇的数目，直到所有的对象都在一个簇中，或某个终结条件被满足。The algorithm first uses a bottom-up hierarchical algorithm and then uses iterative relocation to improve the results. Hierarchical agglomeration adopts a bottom-up strategy, first treating each object as an atomic cluster, and then merging these atomic clusters to form larger clusters, reducing the number of clusters until all objects are in one cluster, or a certain terminal condition satisfied.

(3)密度聚类算法进行聚类(3) Density clustering algorithm for clustering

密度聚类算法，它将簇定义为密度相连的点的最大集合，能够把具有足够高密度的区域划分为簇，并可在噪声的空间数据库中发现任意形状的聚类。密度聚类算发需要二个参数:扫描半径(eps)和最小包含点数(minPts)。任选一个未被访问(unvisited)的点开始，找出与其距离在eps之内(包括eps)的所有附近点。Density clustering algorithm, which defines a cluster as the largest set of density-connected points, can divide areas with sufficiently high density into clusters and can discover clusters of arbitrary shapes in noisy spatial databases. Density clustering calculation requires two parameters: scanning radius (eps) and minimum number of included points (minPts). Start from an unvisited point and find all nearby points within eps (including eps).

如果附近点的数量≥minPts，则当前点与其附近点形成一个簇，并且出发点被标记为已访问(visited)。然后递归，以相同的方法处理该簇内所有未被标记为已访问(visited)的点，从而对簇进行扩展。If the number of nearby points ≥ minPts, the current point and its nearby points form a cluster, and the starting point is marked as visited. Then recurse and process all points in the cluster that are not marked as visited in the same way to expand the cluster.

如果附近点的数量<minPts，则该点暂时被标记作为噪声点。If the number of nearby points < minPts, the point is temporarily marked as a noise point.

如果簇充分地被扩展，即簇内的所有点被标记为已访问，然后用同样的算法去处理未被访问的点。If the cluster is sufficiently expanded, that is, all points in the cluster are marked as visited, then the same algorithm is used to process unvisited points.

步骤6、通过平方误差和以及轮廓系数的两个度量指标分别对三种聚类方法进行统计评估，以测量k-means聚类能够创建每日消费概况组的程度。Step 6. Statistically evaluate the three clustering methods through two metrics: the sum of squared errors and the silhouette coefficient, respectively, to measure the extent to which k-means clustering can create daily consumption profile groups.

所述步骤6的具体步骤包括：The specific steps of step 6 include:

(1)平方误差之和SSE是度量簇的紧密度，较小的错误表明群集质量良好，其计算公式为：(1) The sum of squared errors SSE is a measure of the tightness of the cluster. Smaller errors indicate good cluster quality. Its calculation formula is:

其中，(N₁,N₂,…,N₃)为每日数据组，这些数据被划分为k个集合S＝(S₁,S₂...S_k)。Among them, (N ₁ , N ₂ ,..., N ₃ ) are daily data groups, and these data are divided into k sets S=(S ₁ , S ₂ ...S _k ).

软件发明可存储在计算机可读取存储介质中的说明，请保存下面的模板：本领域内的技术人员应明白，本申请的实施例可提供为方法、系统、或计算机程序产品。因此，本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。For instructions that the software invention can be stored in a computer-readable storage medium, please save the following template: Those skilled in the art should understand that embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each process and/or block in the flowchart illustrations and/or block diagrams, and combinations of processes and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine, such that the instructions executed by the processor of the computer or other programmable data processing device produce a use A device for realizing the functions specified in one process or multiple processes of the flowchart and/or one block or multiple blocks of the block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions The device implements the functions specified in a process or processes of the flowchart and/or a block or blocks of the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。最后应当说明的是：以上实施例仅用以说明本发明的技术方案而非对其限制，尽管参照上述实施例对本发明进行了详细的说明，所属领域的普通技术人员应当理解：依然可以对本发明的具体实施方式进行修改或者等同替换，而未脱离本发明精神和范围的任何修改或者等同替换，其均应涵盖在本发明的权利要求保护范围之内。These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device. Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram. Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention and not to limit it. Although the present invention has been described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that the present invention can still be modified. Modifications or equivalent substitutions may be made to the specific embodiments, and any modifications or equivalent substitutions that do not depart from the spirit and scope of the invention shall be covered by the scope of the claims of the invention.

Claims

1. An association rule generation method that integrates energy consumer behavior characteristics, which is characterized by: including the following steps:

Step 1. Normalize the smart meter time series data;

Step 2. Use symbolic approximation clustering to first perform cloud segmentation aggregation approximation on the normalized smart meter time series data, and then convert it into symbolic representation;

Step 3. Extract the feature pattern in the symbolic representation result and add the feature motif of the feature pattern to the topic library, where the feature motif satisfies the user-defined frequency count and availability threshold;

Step 4. Execute time association rule mining on the feature motifs of the newly added feature patterns in the theme library, and analyze the connection between the influencing factors that lead to changes in energy consumption within a specific period. If the correlation between the influencing factors that lead to changes in energy consumption is If the absolute value of the coefficient is greater than the set value, proceed to the next step; otherwise, return to step 3 to re-extract the feature pattern and add the feature motif of the feature pattern to the theme library;

Step 5: After analyzing the clustering data of the feature motifs in the subject database using three clustering methods: K-means clustering, hierarchical method and density clustering algorithm, execute the feature patterns corresponding to the feature motifs. The clustering results are used to generate daily consumption profiles;

Step 6: Statistically evaluate the daily consumption profiles generated by the three clustering methods through the two metrics of the sum of squared errors and the silhouette coefficient to measure the difference between the daily consumption profile groups created by the three clustering methods and the actual The degree of fitting of the daily consumption situation;

The specific steps of step 2 include:

(1) Convert time series data into cloud segmented aggregation approximate representation symbols;

(2) Convert cloud segment aggregation approximate representation symbols into strings.

2. A method for generating association rules that integrates behavioral characteristics of energy consumers according to claim 1, characterized in that: the specific method of step 1 is:

Normalize the data to unit variance using z normalization, where the normalized value The mean μ and standard deviation σ are as follows:

In the above formula, _xi is the data to be processed, and n is the number of data to be processed.

3. A method for generating association rules that integrates behavioral characteristics of energy consumers according to claim 1, characterized in that: the specific steps of step 2 (1) include:

①Represent the current segmented sequence data in a cloud model;

②Use the entropy of each cloud model to evaluate the data stability of the subsequence, and select the subsequence with the worst stability (the smallest entropy) (denoted as Q(i ₀ , j ₀ )) for segmented aggregation;

③Find a data point as a key point q _k in the segmented aggregation subsequence Q(i ₀ , j ₀ ), i ₀ <k <j ₀ , this key point q _k can make the two subsequences separated by it The difference between the sum of cloud model entropy of (Q(i ₀ , k) and Q(k, j ₀ )) and the cloud model entropy of subsequence Q(i ₀ , j ₀ ) is the largest, and subsequence Q is deleted at the same time. (i ₀ , j ₀ ), record subsequences Q(i ₀ , k) and Q(k, j ₀ );

④ Repeat ①～③ until the stop condition is met.

4. A method for generating association rules that integrates behavioral characteristics of energy consumers according to claim 1, characterized in that: the specific method of step (2) of step 2 is:

After the cloud segmentation aggregation approximate representation, the discretized representation is symbolized, converting the time series into a symbolic string.

5. A method for generating association rules that integrates behavioral characteristics of energy consumers according to claim 1, characterized in that: the specific steps of step 3 include:

(1) After symbolic approximate clustering conversion, SAX pattern types are generated; and the pattern types generated by SAX are marked as topics or uncommon patterns respectively; where topics are defined as previously unknown, frequently occurring patterns;

(2) Extract the required feature pattern from the data transformed by symbolic approximate clustering, and add the feature motif of the feature pattern to the subject library.

6. A method for generating association rules that integrates behavioral characteristics of energy consumers according to claim 1, characterized in that: the specific method of step 4 is:

Among them, X is what happened first, Y is what happened later, and N is the total number of records or transactions. The confidence level of Fraction.

7. A method for generating association rules that integrates behavioral characteristics of energy consumers according to claim 1, characterized in that: the specific steps of step 5 include:

(1) Use K-means clustering method to cluster the characteristic motifs in the topic library:

Using the Euclidean distance metric, daily blocks (N ₁ , N ₂ ,..., N ₃ ) are divided into k sets, S = (S ₁ , S ₂ ...S _k ), to obtain the minimized sum of squares :

where μ _i is the average of the values in Si _i ;

(2) Use the hierarchical method to cluster the feature motifs in the topic library:

A bottom-up hierarchical algorithm is used for clustering, and then an iterative relocation method is used to improve the clustering results obtained by the hierarchical algorithm;

(3) Use density clustering algorithm to cluster the feature motifs in the topic library:

①Set the values of scanning radius eps and minimum number of included points minPts;

② Select an unvisited point and find all nearby points within the scanning radius eps;

③If the number of nearby points ≥ the minimum number of included points minPts, the current point and its nearby points form a cluster, and the starting point is marked as visited; then recursively, use the same method to process all points in the cluster that are not marked as visited , thereby expanding the cluster; if the number of nearby points < the minimum number of included points minPts, the point is temporarily marked as a noise point;

④If the cluster in step ③ is fully expanded, that is, all points in the cluster are marked as visited, then the same algorithm is used to process unvisited points.

8. A method for generating association rules that integrates behavioral characteristics of energy consumers according to claim 1, characterized in that: the specific steps of step 6 include:

(1) The sum of squared errors SSE is a measure of the tightness of the cluster. The smaller the SSE, the better the quality of the cluster. Its calculation formula is:

Among them, (N ₁ , N ₂ ,..., N ₃ ) are daily data groups, and these data are divided into k sets S = (S ₁ , S ₂ ...S _k );

(2) Silhouette coefficient is a measure of inter-cluster cohesion and intra-cluster separation. The coefficient 1 is the best and -1 is the worst. The calculation formula of the silhouette coefficient is:

where a is the average distance between a data instance and all other points in the same cluster, and b is the average distance between a data instance and all other points in the cluster closest to it.

9. An association rule generation device that integrates behavioral characteristics of energy consumers, including:

Normalization processing module, used to normalize smart meter time series data;

The symbolic conversion module is used to use symbolic approximate clustering to first perform cloud segmentation aggregation approximation on the normalized smart meter time series data, and then convert it into symbolic representation;

A feature pattern extraction module for extracting feature patterns in symbols and adding the feature motifs of the feature patterns to the theme library, meeting user-defined frequency counts and availability thresholds;

The time association rule mining module is used to mine the feature motifs of newly added feature patterns in the theme library and analyze the connection between the influencing factors that lead to changes in energy consumption within a specific period. If the changes in energy consumption are caused If the connection between the influencing factors is strong, proceed to the next step; if the connection is weak, return to the feature pattern extraction module to re-extract the feature pattern and add the feature motif of the feature pattern to the theme library;

The clustering module uses three clustering methods: K-means clustering, hierarchical method and density clustering algorithm to analyze the clustering data of the feature motifs in the subject library, and then executes the features corresponding to the feature motifs. Clustering results of patterns to generate daily consumption profiles;

The clustering result evaluation module statistically evaluates the daily consumption profiles generated by the three clustering methods through the two metrics of the sum of squared errors and the silhouette coefficient to measure the daily consumption profiles created by the three clustering methods. How well the group fits actual daily consumption.