CN111461171A

CN111461171A - A data optimization method and system for building a prediction model of silicon content in blast furnace hot metal

Info

Publication number: CN111461171A
Application number: CN202010143429.1A
Authority: CN
Inventors: 尹林子; 关羽吟; 蒋朝辉; 许雪梅
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2020-03-04
Filing date: 2020-03-04
Publication date: 2020-07-28
Anticipated expiration: 2040-03-04
Also published as: CN111461171B

Abstract

The invention discloses a data selection method and system for constructing a prediction model of silicon content in blast furnace hot metal. The unsupervised learning algorithm k-means++ is used to perform clustering, and input variable samples are clustered according to the degree of similarity. Abnormal data is eliminated in the 2019-2010 to strengthen the mapping relationship between input variable samples and silicon content data; at the same time, based on continuous time period indicators, high-confidence samples are screened to effectively reduce the interference of abnormal samples on the mapping relationship; based on the frequency histogram Determine the high-frequency range of silicon content, overcome the conservativeness of the traditional averaging method, and provide the most matching silicon content data for the input variable samples. After the evaluation model is verified, the data optimization method has better performance in model training than the traditional mean method. The invention solves the problem that the mapping relationship between the original input data and the silicon content is weak, and can effectively improve the training effect of the prediction model.

Description

A data optimization method for constructing a prediction model of silicon content in blast furnace hot metal and system

技术领域technical field

本申请属于高炉铁水硅含量预测领域，特别涉及一种用于构建高炉铁水硅含量预测模型的数据优选方法及系统。The application belongs to the field of prediction of silicon content in blast furnace molten iron, and particularly relates to a data optimization method and system for building a prediction model for silicon content in blast furnace molten iron.

背景技术Background technique

铁水硅含量预测是高炉优化控制的关键之一，目前多数研究者采用数据驱动的思想来建立高炉铁水硅含量预测模型，这些模型对训练数据集质量有较高的要求。然而，由于高炉冶炼具有多尺度特征且数据采集环境恶劣，采集的原始数据存在严重的异常、缺失等问题，尤其是原始硅含量数据分布不均衡且噪声严重，导致输入变量与硅含量之间的映射关系弱。具体表现为：输入变量由传感器采集以每小时为间隔记录，而硅含量数据受工艺限制由人工采集化验，在部分输入变量记录间隔内硅含量值较多、波动较大且具有时滞性。此时，难以合理确定与该输入变量最匹配的硅含量值，对获取高质量训练集和构建稳定预测模型造成了极大的阻碍。为了增强输入变量与硅含量之间的映射关系，有学者提出计算一定时间间隔内算术平均值的方法，如宋菁华.高炉冶炼过程的多尺度特性与硅含量预测方法研究[D].浙江大学,2016和Chu Y,Gao C.Data-based multiscale modeling for blastfurnace system[J].AIChE Journal,2014,60(6):2197-2210.公开了一种包样分析法，即出铁过程中依次采集两个硅含量值，取其算术平均值；刘敏,基于模糊模型的高炉硅含量研究及预测[D].内蒙古科技大学,2012则对各输入量以30min为采样间隔时间段对数据进行融合，即计算30min内数据的算术平均值。Prediction of silicon content in hot metal is one of the keys to optimal control of blast furnaces. At present, most researchers use data-driven ideas to build prediction models for silicon content in blast furnaces. These models have high requirements on the quality of training data sets. However, due to the multi-scale characteristics of blast furnace smelting and the harsh data collection environment, the collected raw data has serious abnormalities and missing problems, especially the raw silicon content data is unevenly distributed and has serious noise, resulting in the difference between the input variables and the silicon content. The mapping relationship is weak. The specific performance is as follows: the input variables are collected by sensors and recorded at hourly intervals, while the silicon content data is manually collected and tested due to process limitations. During the recording interval of some input variables, the silicon content has many values, large fluctuations and time lags. At this time, it is difficult to reasonably determine the silicon content value that best matches the input variable, which greatly hinders the acquisition of high-quality training sets and the construction of stable prediction models. In order to enhance the mapping relationship between input variables and silicon content, some scholars have proposed a method to calculate the arithmetic mean value within a certain time interval, such as Song Jinghua. Research on multi-scale characteristics of blast furnace smelting process and prediction method of silicon content [D]. Zhejiang University , 2016 and Chu Y, Gao C.Data-based multiscale modeling for blastfurnace system[J].AIChE Journal,2014,60(6):2197-2210. discloses a bag sample analysis method, that is, in the process of iron tapping Collect two silicon content values, and take their arithmetic mean; Liu Min, Research and prediction of blast furnace silicon content based on fuzzy model [D]. Inner Mongolia University of Science and Technology, 2012, the data were analyzed with 30min as the sampling interval for each input. Fusion, that is, calculating the arithmetic mean of the data within 30 minutes.

然而，我们发现计算算术平均值的方法仅对均匀采样的时间序列数据是有效的，但对于非均匀时间间隔的数据，由于其时间序列的数据量少而效果不佳；当输入变量记录间隔内存在多个硅含量值且波动较大时，均值法较为保守，易受噪声干扰使硅含量偏离正确范围。However, we found that the method of calculating the arithmetic mean is effective only for uniformly sampled time series data, but not for non-uniformly spaced data due to the small amount of time series data; when the input variable record interval memory When there are multiple silicon content values and the fluctuation is large, the mean value method is more conservative, and it is easy to be disturbed by noise and cause the silicon content to deviate from the correct range.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于提供一种用于高炉铁水硅含量预测模型的数据优选方法。解决高炉铁水原始数据中存在的记录时间非均匀间隔、数据波动大以及输入变量样本与硅含量数据映射关系弱等问题，建立为输入变量与匹配合理的硅含量之间的合理关联值，从而提高硅含量为基于数据驱动的预测模型的预测效果提供高质量的训练集。The purpose of the present invention is to provide a data optimization method for prediction model of silicon content in blast furnace hot metal. Solve the problems of non-uniform recording time interval, large data fluctuation and weak mapping relationship between input variable samples and silicon content data in the original data of blast furnace molten iron, and establish a reasonable correlation value between input variables and matching reasonable silicon content, so as to improve Silicon content provides a high-quality training set for the predictive performance of data-driven predictive models.

本发明提供的技术方案如下：The technical scheme provided by the present invention is as follows:

一方面，一种用于构建高炉铁水硅含量预测模型的数据优选方法，包括：In one aspect, a data optimization method for constructing a prediction model for silicon content in blast furnace molten iron, comprising:

获取高炉铁水生产样本数据，包括生产过程输入变量样本和硅含量样本；Obtain blast furnace molten iron production sample data, including production process input variable samples and silicon content samples;

利用聚类模型对生产过程输入变量样本进行聚类，并剔除异常输入变量样本；Use the clustering model to cluster the input variable samples in the production process, and eliminate the abnormal input variable samples;

基于设定的连续时间长度指标，从各簇输入变量样本中提取各簇代表时间段；Based on the set continuous time length index, the representative time period of each cluster is extracted from the input variable samples of each cluster;

绘制各簇代表时间段内包含的硅含量值的频数直方图，并确定各簇硅含量值的高频区间；Draw the frequency histogram of the silicon content values contained in the representative time period of each cluster, and determine the high frequency interval of the silicon content value of each cluster;

基于各簇硅含量值的高频区间，为各簇每个输入变量样本从对应的所有硅含量样本中选取最佳硅含量值。Based on the high-frequency interval of the silicon content value of each cluster, for each input variable sample of each cluster, the best silicon content value is selected from all the corresponding silicon content samples.

本方案通过运用聚类方法，将输入变量样本根据相似程度分簇，从复杂的高炉铁水数据中剔除异常的数据；同时，基于连续时间段指标以及频数直方图标示出各簇输入变量对应的合理硅含量范围，以此为依据为输入样本关联最佳硅含量值。解决了原始输入数据与硅含量之间映射关系弱的问题，可有效改善预测模型的效果。By using the clustering method, this scheme divides the input variable samples into clusters according to the degree of similarity, and removes abnormal data from the complex blast furnace molten iron data; at the same time, based on the indicators of continuous time periods and the frequency histogram, the corresponding reasonableness of the input variables of each cluster is shown. The silicon content range, based on which the optimal silicon content value is associated with the input sample. It solves the problem of weak mapping relationship between the original input data and silicon content, which can effectively improve the effect of the prediction model.

进一步地，为各簇输入变量样本选取最佳硅含量值的具体过程如下：Further, the specific process of selecting the optimal silicon content value for each cluster input variable sample is as follows:

将硅含量化验时间处于对应输入变量记录时间起，至单位记录间隔时间内的硅含量值构成集合Si，判断Si中的硅含量值是否处于输入变量所在聚类簇的硅含量值高频区间，若在，则将Si中硅含量值记录时间排列在最前的硅含量值，选为输入变量对应的最佳硅含量值；若不在，则从Si中选择与高频区间中点値最接近的硅含量值，作为输入变量对应的最佳硅含量值，同时，所选中的硅含量值不再作为其他输入变量的最佳硅含量候选值。The set Si is formed by the silicon content test time from the time of recording the corresponding input variable to the unit recording interval, and it is judged whether the silicon content value in Si is in the high-frequency range of the silicon content value of the cluster where the input variable is located, If it is, the silicon content value in Si whose recording time is arranged at the front is selected as the best silicon content value corresponding to the input variable; The silicon content value is used as the best silicon content value corresponding to the input variable, and at the same time, the selected silicon content value is no longer used as the best silicon content candidate value of other input variables.

进一步地，所述硅含量高频区间包括两个区间，分别为第一高频区间和第二高频区间，其中，第一高频区间为硅含量值的频数直方图中频次最高的区间，第二高频区间为硅含量值的频数直方图中频次次高的区间；且硅含量值的频数直方图在绘制时，对硅含量值进行均匀离散化处理，得到10个离散区间；Further, the high-frequency interval of silicon content includes two intervals, namely a first high-frequency interval and a second high-frequency interval, wherein the first high-frequency interval is the interval with the highest frequency in the frequency histogram of the silicon content value, The second high frequency interval is the interval with the second highest frequency in the frequency histogram of the silicon content value; and when the frequency histogram of the silicon content value is drawn, the silicon content value is uniformly discretized to obtain 10 discrete intervals;

在判断Si中的硅含量值是否处于输入变量所在聚类簇的硅含量值高频区间时，依次对第一高频区间和第二高频区间进行判断。When judging whether the silicon content value in Si is in the high frequency range of the silicon content value of the cluster where the input variable is located, the first high frequency range and the second high frequency range are judged in sequence.

进一步地，所述从各簇输入变量样本中提取各簇代表时间段的具体过程如下：Further, the specific process of extracting the representative time period of each cluster from the input variable samples of each cluster is as follows:

步骤3.1)，对簇中所有样本，按照记录时间先后顺序进行排序，获得该簇的样本时间序列T；Step 3.1), sort all the samples in the cluster according to the order of recording time, and obtain the sample time series T of the cluster;

步骤3.2)，按照记录时间的连续性，将序列T划分为不同的连续时间子序列；T＝{T₁,T₂,…,T_l}，T_i的连续时间长度记为L(T_i)，1≤i≤l，l表示连续时间子序列的总数量；Step 3.2), according to the continuity of the recording time, divide the sequence T into different continuous time subsequences; T={T ₁ ,T ₂ ,...,T _l }, and the continuous time length of T _i is denoted as L(T _i ), 1≤i≤l, l represents the total number of continuous time subsequences;

步骤3.3)，计算时间长度大于β的连续时间子序列在簇中的占比

L(T)为T中所有输入样本记录的时间长度之和，β为设定的持续时间长度，单位为小时，初始值设置为5；Step 3.3), calculate the proportion of continuous time subsequences with time length greater than β in the cluster

L(T) is the sum of the time lengths of all input sample records in T, β is the set duration, the unit is hours, and the initial value is set to 5;

步骤3.4)，若ρ<0.6，缩减设定的持续时间长度，令β＝β-1，并返回步骤3.3)，否则，将所有持续时间长度小于β的连续时间子序列从T中删除；此时，将T中记录时间作为该簇的代表时间段。Step 3.4), if ρ<0.6, reduce the set duration, let β=β-1, and return to step 3.3), otherwise, delete all continuous time subsequences whose duration is less than β from T; this , the recording time in T is taken as the representative time period of the cluster.

进一步地，所述利用聚类模型对生产过程输入变量样本进行聚类，并剔除异常输入变量样本的过程中，是采用k-means++聚类模型进行聚类，具体如下：Further, in the process of using the clustering model to cluster the input variable samples in the production process and eliminating the abnormal input variable samples, the k-means++ clustering model is used for clustering, and the details are as follows:

步骤2.1)，使用k-means++算法将输入变量样本集U中的输入变量样本聚类为k簇，聚类完成后，若某簇样本总数在总样本数中占比小于2％，则删除该簇的所有输入变量样本；Step 2.1), use the k-means++ algorithm to cluster the input variable samples in the input variable sample set U into k clusters. After the clustering is completed, if the total number of samples in a certain cluster accounts for less than 2% of the total number of samples, delete the all input variable samples of the cluster;

步骤2.2)，重复步骤2.1)，直至所有簇内样本总数在总样本数中的占比均不小于2％。Step 2.2), repeat step 2.1), until the total number of samples in all clusters accounts for no less than 2% of the total number of samples.

基于k-means++聚类算法，对样本进行分类，用以区分不同的炉况信息，强化了输入变量样本与硅含量数据之间的映射关系；Based on the k-means++ clustering algorithm, the samples are classified to distinguish different furnace condition information, and the mapping relationship between the input variable samples and the silicon content data is strengthened;

k-means++算法的具体内容可以参见文献：Arthur D,Vassilvitskii S.K-Means++:The Advantages of Careful Seeding[C]//Proceedings of the Eighteenth AnnualACM-SIAM Symposium on Discrete Algorithms,SODA 2007,New Orleans,Louisiana,USA,January 7-9,2007.ACM,2007.The specific content of the k-means++ algorithm can be found in the literature: Arthur D, Vassilvitskii S. K-Means++: The Advantages of Careful Seeding[C]//Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, New Orleans, Louisiana, USA , January 7-9, 2007. ACM, 2007.

k的具体取值以使得k-means++算法中的轮廓系数保持在[-1,1]之间，且尽可能接近1较佳。The specific value of k is to keep the silhouette coefficient in the k-means++ algorithm between [-1, 1], and preferably as close to 1 as possible.

进一步地，所述输入变量样本包括记录时间c₁、富氧率c₂、一氧化碳c₃、氢气c₄、二氧化碳c₅、标准风速c₆、富氧流量c₇、冷风流量c₈、鼓风动能c₉、炉腹煤气量c₁₀、炉腹煤气指数c₁₁、理论燃烧温度c₁₂、顶压c₁₃、富氧压力c₁₄、冷风压力c₁₅、全压差c₁₆、热风压力c₁₇、实际风速c₁₈、热风温度c₁₉、顶温东北c₂₀、顶温西南c₂₁、顶温西北c₂₂、顶温东南c₂₃、阻力系数c₂₄、鼓风湿度c₂₅、设定喷煤量c₂₆和上小时喷煤量c₂₇，硅含量样本包括化验时间d₁和硅含量数据d₂，其中，输入变量样本的记录周期为1小时。Further, the input variable samples include recording time c ₁ , oxygen enrichment rate c ₂ , carbon monoxide c ₃ , hydrogen c ₄ , carbon dioxide c ₅ , standard wind speed c ₆ , oxygen enrichment flow c ₇ , cold air flow c ₈ , blast air Kinetic energy c ₉ , bolly gas volume c ₁₀ , bosh gas index c ₁₁ , theoretical combustion temperature c ₁₂ , top pressure c ₁₃ , oxygen-enriched pressure c ₁₄ , cold air pressure c ₁₅ , total pressure difference c ₁₆ , hot air pressure c ₁₇ , actual wind speed c ₁₈ , hot air temperature c ₁₉ , top temperature northeast c ₂₀ , top temperature southwest c ₂₁ , top temperature northwest c ₂₂ , top temperature southeast c ₂₃ , resistance coefficient c ₂₄ , blast humidity c ₂₅ , set coal injection The amount c ₂₆ and the coal injection amount c ₂₇ in the previous hour, the silicon content sample includes the assay time d ₁ and the silicon content data d ₂ , wherein the recording period of the input variable sample is 1 hour.

另一方面，一种用于构建高炉铁水硅含量预测模型的数据优选系统，包括：On the other hand, a data optimization system for constructing a prediction model of silicon content in blast furnace molten iron, comprising:

样本数据采集单元，用于获取高炉铁水生产样本数据，包括生产过程输入变量样本和硅含量样本；The sample data acquisition unit is used to obtain sample data of blast furnace molten iron production, including samples of input variables in the production process and samples of silicon content;

样本聚类与剔除单元，用于利用聚类模型对生产过程输入变量样本进行聚类，并剔除异常输入变量样本；The sample clustering and elimination unit is used to cluster the input variable samples of the production process by using the clustering model, and eliminate abnormal input variable samples;

代表时间段提取单元，基于设定的连续时间长度指标，从各簇输入变量样本中提取各簇代表时间段；The representative time period extraction unit, based on the set continuous time length index, extracts the representative time period of each cluster from the input variable samples of each cluster;

硅含量高频区间绘制与确定单元，用于绘制各簇代表时间段内包含的硅含量值的频数直方图，并确定各簇硅含量值的高频区间；The high-frequency interval drawing and determining unit of silicon content is used to draw the frequency histogram of silicon content values contained in the representative time period of each cluster, and determine the high-frequency interval of silicon content values of each cluster;

最佳硅含量选取单元，基于各簇硅含量值的高频区间，为各簇每个输入变量样本从对应的所有硅含量样本中选取最佳硅含量值。The best silicon content selection unit, based on the high frequency interval of the silicon content value of each cluster, selects the best silicon content value from all the corresponding silicon content samples for each input variable sample of each cluster.

进一步地，最佳硅含量选取单元包括：Further, the optimal silicon content selection unit includes:

单位采样间隔时间内硅含量值构建模块，用于将硅含量化验时间处于对应输入变量记录时间起，至单位记录间隔时间内的硅含量值构成集合Si；The silicon content value building module within the unit sampling interval is used to form the set Si from the silicon content assay time from the recording time of the corresponding input variable to the unit recording interval time;

判断模块，用于判断Si中的硅含量值是否处于输入变量所在聚类簇的硅含量值高频区间，若在，则将Si中硅含量值记录时间排列在最前的硅含量值，选为输入变量对应的最佳硅含量值；若不在，则从Si中选择与高频区间中点値最接近的硅含量值，作为输入变量对应的最佳硅含量值，同时，所选中的硅含量值不再作为其他输入变量的最佳硅含量候选值。The judgment module is used to judge whether the silicon content value in Si is in the high frequency range of the silicon content value of the cluster where the input variable is located. The optimal silicon content value corresponding to the input variable; if not, select the silicon content value closest to the midpoint value of the high-frequency interval from Si as the optimal silicon content value corresponding to the input variable, and at the same time, the selected silicon content value is no longer a candidate for optimal silicon content for other input variables.

进一步地，所述代表时间段提取单元包括：Further, the representative time period extraction unit includes:

样本时间序列获取模块，用于对簇中所有样本，按照记录时间先后顺序进行排序，获得该簇的样本时间序列T；The sample time series acquisition module is used to sort all the samples in the cluster according to the order of recording time, and obtain the sample time series T of the cluster;

连续时间子序列划分单元，用于按照记录时间的连续性，将序列T划分为不同的连续时间子序列；T＝{T₁,T₂,…,T_l}，T_i的连续时间长度记为L(T_i)，1≤i≤l，l表示连续时间子序列的总数量； _The continuous time subsequence division unit is used to _divide the sequence T _into different continuous time subsequences according to the continuity of the recording time _; is L(T _i ), 1≤i≤l, where l represents the total number of continuous time subsequences;

占比计算模块，用于计算时间长度大于β的连续时间子序列在簇中的占比ρ；The proportion calculation module is used to calculate the proportion ρ of continuous time subsequences whose time length is greater than β in the cluster;

其中，L(T)为T中所有输入样本记录的时间长度之和，β为设定的持续时间长度，单位为小时，初始值设置为5；Among them, L(T) is the sum of the time lengths of all input sample records in T, β is the set duration length, the unit is hours, and the initial value is set to 5;

代表时间段选择模块，根据占比计算模块得到的占比值，进行判断，若ρ<0.6，缩减设定的持续时间长度，令β＝β-1，重新调用占比计算模块计算占比后，再调用代表时间段选择模块，否则，将所有持续时间长度小于β的连续时间子序列从T中删除；此时，将T中记录时间作为该簇的代表时间段。The representative time period selection module, according to the proportion value obtained by the proportion calculation module, makes a judgment. If ρ<0.6, reduce the set duration, let β=β-1, and re-call the proportion calculation module to calculate the proportion, Call the representative time period selection module again, otherwise, delete all continuous time subsequences whose duration is less than β from T; at this time, the recorded time in T is taken as the representative time period of the cluster.

进一步地，所述样本聚类与剔除单元中使用k-means++聚类模型进行聚类，通过使用k-means++算法将输入变量样本集U中的输入变量样本聚类为k簇，聚类完成后，若某簇样本总数在总样本数中占比小于2％，则删除该簇的所有输入变量样本；重复调用k-means++聚类模型进行聚类，直至所有簇内样本总数在总样本数中的占比均不小于2％。Further, use k-means++ clustering model in the sample clustering and culling unit for clustering, by using the k-means++ algorithm, the input variable samples in the input variable sample set U are clustered into k clusters, after the clustering is completed. , if the total number of samples in a certain cluster accounts for less than 2% of the total number of samples, delete all input variable samples of the cluster; repeatedly call the k-means++ clustering model for clustering until the total number of samples in all clusters is in the total number of samples The proportion is not less than 2%.

有益效果beneficial effect

本发明提供了一种用于构建高炉铁水硅含量预测模型的数据优选方法及系统，运用非监督学习算法k-means++进行聚类，将输入变量样本根据相似程度分簇，从复杂的高炉铁水数据中剔除异常的数据，强化了输入变量样本与硅含量数据之间的映射关系；同时，基于连续时间段指标，筛选高可信度样本，有效降低异常样本对映射关系的干扰；基于频数直方图确定硅含量高频区间，克服了传统均值化方法的保守性，为输入变量样本提供最匹配的硅含量数据。经过评估模型验证，该数据优选方法与传统均值法相比，能够提高硅含量预测的准确度。本发明解决了原始输入数据与硅含量之间映射关系弱的问题，可有效改善预测模型的效果。The invention provides a data selection method and system for constructing a prediction model of silicon content in blast furnace hot metal. The unsupervised learning algorithm k-means++ is used for clustering, and the input variable samples are clustered according to the degree of similarity. Abnormal data is eliminated in the 2019-2010 to strengthen the mapping relationship between input variable samples and silicon content data; at the same time, based on continuous time period indicators, high-confidence samples are screened to effectively reduce the interference of abnormal samples on the mapping relationship; based on the frequency histogram Determine the high-frequency range of silicon content, overcome the conservativeness of the traditional averaging method, and provide the most matching silicon content data for the input variable samples. After verification of the evaluation model, the data optimization method can improve the accuracy of silicon content prediction compared with the traditional mean method. The invention solves the problem that the mapping relationship between the original input data and the silicon content is weak, and can effectively improve the effect of the prediction model.

附图说明Description of drawings

图1为本发明实施例一的流程示意图；1 is a schematic flowchart of Embodiment 1 of the present invention;

图2为剔除异常簇后的k-means++聚类结果示意图；Figure 2 is a schematic diagram of the k-means++ clustering result after removing abnormal clusters;

图3为各簇硅含量频数直方图，其中，(a)为ClusterA，(b)为ClusterB，(c)为ClusterC，(d)为ClusterD，(e)为ClusterE。FIG. 3 is a frequency histogram of silicon content in each cluster, wherein (a) is ClusterA, (b) is ClusterB, (c) is ClusterC, (d) is ClusterD, and (e) is ClusterE.

具体实施方式Detailed ways

下面将结合附图和实施例对本发明做进一步的说明。The present invention will be further described below with reference to the accompanying drawings and embodiments.

实施例一Example 1

本实施例中以国内某钢铁厂2650m³某高炉，2017年10月1日0时至10月31日23时采集的实际高炉生成数据为例说明。In this example, the actual blast furnace generation data collected from 0:00 on October 1, 2017 to 23:00 on October 31st in a ^2650m3 blast furnace in a domestic steel plant is used as an example for illustration.

如图1所示，一种用于高炉铁水硅含量预测模型的数据优选方法，过程如下：As shown in Figure 1, a data optimization method for prediction model of silicon content in blast furnace molten iron, the process is as follows:

步骤1：获取高炉铁水生产样本数据，包括生产过程输入变量样本和硅含量样本。Step 1: Obtain blast furnace molten iron production sample data, including production process input variable samples and silicon content samples.

其中输入变量样本集U包括27项输入变量，如表1所示，包括：记录时间、富氧率、一氧化碳、氢气、二氧化碳、标准风速、富氧流量、冷风流量、鼓风动能、炉腹煤气量、炉腹煤气指数、理论燃烧温度、顶压、富氧压力、冷风压力、全压差、热风压力、实际风速、热风温度、顶温东北、顶温西南、顶温西北、顶温东南、阻力系数、鼓风湿度、设定喷煤量和上小时喷煤量，由传感器采集，记录间隔为1小时；硅含量样本集V，如表2所示，包括化验时间和硅含量化验值，由人工采集化验，记录间隔无规律。样本集中共含有744个输入变量样本和1478条硅含量数据。The input variable sample set U includes 27 input variables, as shown in Table 1, including: recording time, oxygen enrichment rate, carbon monoxide, hydrogen, carbon dioxide, standard wind speed, oxygen-enriched flow, cold air flow, blast kinetic energy, belly gas volume, bolly gas index, theoretical combustion temperature, top pressure, oxygen-rich pressure, cold air pressure, total pressure difference, hot air pressure, actual wind speed, hot air temperature, top temperature northeast, top temperature southwest, top temperature northwest, top temperature southeast, The resistance coefficient, blast humidity, set coal injection volume and coal injection volume in the last hour are collected by the sensor, and the recording interval is 1 hour; the silicon content sample set V, as shown in Table 2, includes the test time and the test value of silicon content, The test was collected manually, and the recording interval was irregular. The sample set contains a total of 744 input variable samples and 1478 silicon content data.

表1生产过程输入变量样本集Table 1 Sample set of input variables in production process

表2硅含量样本集Table 2 Silicon content sample set

步骤2：利用聚类模型对生产过程输入变量样本进行聚类，并剔除异常输入变量样本。Step 2: Use the clustering model to cluster the input variable samples in the production process, and eliminate abnormal input variable samples.

2.1)将样本聚类成k簇，多次试验，计算其聚类结果的轮廓系数，最终选取轮廓系数较大的k＝5进行聚类，聚类完成后，若某簇样本总数在总样本数中占比小于2％，则删除该簇的所有输入变量样本2.1) Cluster the samples into k clusters, perform multiple tests, calculate the silhouette coefficient of the clustering results, and finally select k=5 with a larger silhouette coefficient for clustering. After the clustering is completed, if the total number of samples in a certain cluster is within the total If the proportion of the number is less than 2%, delete all input variable samples of the cluster

2.2)重复步骤2.1)，直到所有簇内样本总数在总样本数的占比均不小于2％，得到最终样本聚类结果，如图2所示。2.2) Repeat step 2.1) until the total number of samples in all clusters accounts for no less than 2% of the total number of samples, and the final sample clustering result is obtained, as shown in Figure 2.

步骤3：基于设定的连续时间段长度指标，从各簇输入变量样本中提取各簇代表时间段。Step 3: Based on the set continuous time period length index, extract each cluster representative time period from each cluster input variable sample.

3.1)将聚类结果分别标记为ClusterA、ClusterB、ClusterC、ClusterD和ClusterE，各簇样本总数分别为26、129、71、393和116，对簇中所有样本，按照记录时间先后顺序进行排序，获得该簇的样本时间序列T＝{t₁,t₂,…,t_h}；3.1) Mark the clustering results as ClusterA, ClusterB, ClusterC, ClusterD and ClusterE respectively, and the total number of samples in each cluster is 26, 129, 71, 393 and 116 respectively. Sort all the samples in the cluster according to the order of recording time to obtain The sample time series T={t ₁ , t ₂ ,..., _th } of the cluster;

3.2)按照记录时间的连续性，将序列T划分为不同的连续时间子序列；T＝{T₁,T₂,…,T_l},其中，T_i(1≤i≤l)的连续时间长度记为L(T_i)；3.2) According to the continuity of the recording time, the sequence T is divided into different continuous time subsequences; T={T ₁ , T ₂ ,...,T _l }, where T _i (1≤i≤1) continuous time The length is recorded as L(T _i );

3.3)取β＝5，计算时间长度大于β的连续时间子序列，在簇中的占比

其中ClusterA、ClusterB、ClusterD和ClusterE，连续时间子序列总和分别为17、109、366和103，在对应簇中占比65％、84％、93％和89％。3.3) Take β=5, and calculate the proportion of continuous time subsequences whose time length is greater than β in the cluster

Among them, ClusterA, ClusterB, ClusterD and ClusterE, the sums of continuous time subsequences are 17, 109, 366 and 103 respectively, accounting for 65%, 84%, 93% and 89% of the corresponding clusters.

3.4)当β＝5时ClusterC的连续时间子序列在簇中占比ρ<0.6，因此缩减设定的持续时间长度，直到持续时间长度β＝3时，ρ＝0.63。剔除各簇T中持续时间小于β的时间子序列。此时，T中记录时间作为该簇的代表时间段。上述统计数据记于表3。3.4) When β=5, the proportion of continuous time subsequences of ClusterC in the cluster is ρ<0.6, so the set duration is reduced until the duration is β=3, ρ=0.63. Time subsequences with duration less than β in each cluster T are eliminated. At this time, the time recorded in T is taken as the representative time period of the cluster. The above statistics are recorded in Table 3.

表3各簇统计数据Table 3 Statistics of each cluster

步骤4：绘制各簇代表时间段内包含的硅含量值的频数直方图，并确定各簇硅含量值的“高频区间”。Step 4: Draw a frequency histogram of the silicon content values included in the representative time period of each cluster, and determine the "high frequency interval" of the silicon content value of each cluster.

4.1)从硅含量样本集V中，筛选出记录时间在各簇代表时间段T内的硅含量数据集D_T，即

则将y_m2加入数据集D_T，y_m1和y_m2分别表示记硅含量的记录时间和对应的硅含量值。4.1) From the silicon content sample set V, screen out the silicon content data set D _T whose recording time is within the representative time period T of each cluster, that is,

Then y _m2 is added to the data set D _T , and y _m1 and y _m2 respectively represent the recording time of recording the silicon content and the corresponding silicon content value.

4.2)绘制各簇D_T的频数分布直方图。横坐标为硅含量值，纵坐标为频数，如图3所示。考虑到实际硅含量为模拟量，因此，对硅含量值进行均匀离散化处理，得到10个离散区间，并统计“第一高频区间”和“第二高频区间”。各簇的“第一高频区间”和“第二高频区间”依次为([0.301,0.342],[0.342,0.383]),([0.516,0.58],[0.452,0.516]),([0.458,0.534],[0.382,0.458]),([0.536,0.605],[0.467,0.536]),([0.49,0.528],[0.414,0.452])；4.2) Draw the frequency distribution histogram of each cluster _DT . The abscissa is the silicon content value, and the ordinate is the frequency, as shown in Figure 3. Considering that the actual silicon content is an analog quantity, the silicon content value is uniformly discretized to obtain 10 discrete intervals, and the “first high-frequency interval” and the “second high-frequency interval” are counted. The "first high frequency interval" and "second high frequency interval" of each cluster are ([0.301, 0.342], [0.342, 0.383]), ([0.516, 0.58], [0.452, 0.516]), ([ 0.458,0.534],[0.382,0.458]),([0.536,0.605],[0.467,0.536]),([0.49,0.528],[0.414,0.452]);

步骤5：基于各簇硅含量的高频区间，为各簇每个输入变量样本从对应的所有硅含量样本中选取最佳硅含量值。Step 5: Based on the high frequency interval of the silicon content of each cluster, select the best silicon content value from all the corresponding silicon content samples for each input variable sample of each cluster.

5.1)表4所示为各簇输入变量记录间隔内不同硅含量值数量在各簇总样本数的占比，依照样本的记录时间顺序，为其选取对应硅含量值。5.1) Table 4 shows the proportion of the number of different silicon content values in the total number of samples in each cluster in the recording interval of each cluster input variable. According to the order of the recording time of the samples, the corresponding silicon content value is selected for it.

表4各簇不同输入变量记录间隔内硅含量数量在簇内总样本数的占比Table 4 The proportion of silicon content in the total number of samples in the cluster in the recording interval of different input variables in each cluster

选取完成后，生成新的数据样本，作为后续预测工作的数据集。经统计，在1478个硅含量值中，共有731个值处于高频区间中，为去异常后的735个样本匹配硅含量。匹配结果统计于表5中。After the selection is completed, a new data sample is generated as a data set for subsequent prediction work. According to statistics, among the 1478 silicon content values, a total of 731 values are in the high frequency range, which matches the silicon content of 735 samples after anomaly removal. The matching results are listed in Table 5.

表5各簇高频区间内硅含量可匹配样本数统计Table 5 Statistics of the number of samples that can be matched with silicon content in the high-frequency interval of each cluster

本实例选出的优选数据集，通过聚类算法增强输入变量样本与硅含量数据之间的映射关系；基于设定的连续时间段指标降低噪声干扰，与传统均值法相比可信度更高，克服了传统均值化方法的保守性。The preferred data set selected in this example enhances the mapping relationship between the input variable samples and the silicon content data through the clustering algorithm; reduces noise interference based on the set continuous time period indicators, and has higher reliability than the traditional mean method. Overcome the conservatism of traditional averaging methods.

实施例二Embodiment 2

基于上述方法，本发明实施例还提供一种用于构建高炉铁水硅含量预测模型的数据优选系统，包括：Based on the above method, the embodiment of the present invention also provides a data optimization system for constructing a prediction model of silicon content in blast furnace hot metal, including:

其中，最佳硅含量选取单元包括：Among them, the optimal silicon content selection unit includes:

其中，所述代表时间段提取单元包括：Wherein, the representative time period extraction unit includes:

其中，所述样本聚类与剔除单元中使用k-means++聚类模型进行聚类，通过使用k-means++算法将输入变量样本集U中的输入变量样本聚类为k簇，聚类完成后，若某簇样本总数在总样本数中占比小于2％，则删除该簇的所有输入变量样本；重复调用k-means++聚类模型进行聚类，直至所有簇内样本总数在总样本数中的占比均不小于2％。Wherein, the k-means++ clustering model is used in the sample clustering and elimination unit for clustering, and the input variable samples in the input variable sample set U are clustered into k clusters by using the k-means++ algorithm. After the clustering is completed, If the total number of samples in a certain cluster accounts for less than 2% of the total number of samples, delete all input variable samples of the cluster; repeatedly call the k-means++ clustering model for clustering, until the total number of samples in all clusters is less than the total number of samples in the total number of samples The proportion is not less than 2%.

应当理解，本发明各个实施例中的功能单元模块可以集中在一个处理单元中，也可以是各个单元模块单独物理存在，也可以是两个或两个以上的单元模块集成在一个单元模块中，可以采用硬件或软件的形式来实现。It should be understood that the functional unit modules in various embodiments of the present invention may be centralized in one processing unit, or each unit module may exist physically alone, or two or more unit modules may be integrated into one unit module, It can be implemented in the form of hardware or software.

以上仅是对本发明优选实施方式的描述，本发明的保护范围并不局限于上述实施例，凡属于本发明思路下的技术方案均属于本发明的保护范围。本发明所属技术领域的技术人员在不偏离本发明的精神和原理的情况下对所描述的具体实施例做各种修改或补充或采用类似的方式替代，视为本发明的保护范围。The above is only a description of the preferred embodiments of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions that belong to the idea of the present invention belong to the protection scope of the present invention. Those skilled in the art to which the present invention pertains can make various modifications or supplements to the described specific embodiments without departing from the spirit and principle of the present invention, or substitute in similar ways, which are regarded as the protection scope of the present invention.

Claims

1. A data optimization method for constructing a prediction model of the silicon content of blast furnace molten iron is characterized by comprising the following steps:

obtaining blast furnace molten iron production sample data, including a production process input variable sample and a silicon content sample;

clustering the input variable samples in the production process by using a clustering model, and removing abnormal input variable samples;

extracting each cluster representative time period from each cluster input variable sample based on the set continuous time length index;

drawing a frequency histogram of the silicon content values contained in each cluster representative time period, and determining a high-frequency interval of the silicon content values of each cluster;

and selecting the optimal silicon content value from all corresponding silicon content samples for each input variable sample of each cluster based on the high-frequency interval of the silicon content values of each cluster.

2. The method of claim 1, wherein the selection of the optimal silicon content value for each cluster of input variable samples is performed by:

forming a set Si by the silicon content values from the time when the silicon content test time is in the recording time of the corresponding input variable to the time of unit recording interval, judging whether the silicon content value in the Si is in the high-frequency interval of the silicon content value of the clustering cluster in which the input variable is positioned, if so, arranging the recording time of the silicon content value in the Si at the forefront, and selecting the recording time as the optimal silicon content value corresponding to the input variable; if not, the silicon content value closest to the point value in the high-frequency region is selected from Si as the optimal silicon content value corresponding to the input variable, and the selected silicon content value is no longer used as the optimal silicon content candidate value for the other input variable.

3. The method of claim 2, wherein the silicon content bin comprises two bins, a first bin and a second bin, wherein the first bin is the next highest bin in the frequency histogram of silicon content values and the second bin is the next highest bin in the frequency histogram of silicon content values; when the frequency histogram of the silicon content value is drawn, carrying out uniform discretization treatment on the silicon content value to obtain 10 discrete intervals;

and when judging whether the silicon content value in the Si is in the high-frequency interval of the silicon content value of the clustering cluster where the input variable is positioned, sequentially judging the first high-frequency interval and the second high-frequency interval.

4. The method of claim 1, wherein the specific process of extracting the representative time period of each cluster from the input variable samples of each cluster is as follows:

step 3.1), sequencing all samples in the cluster according to the recording time sequence to obtain a sample time sequence T of the cluster;

step 3.2), dividing the sequence T into different continuous time subsequences according to the continuity of the recording time; t ═ T₁,T₂,…,T_l}，T_iIs recorded as L (T)_i) I is 1. ltoreq. l, l representing the total number of consecutive time subsequences;

step 3.3), calculating the occupation ratio of continuous time subsequences with the time length longer than β in the cluster

L (T) is the sum of the time lengths of all input sample records in T, β is the set duration length in hours, the initial value is set to 5;

step 3.4), if rho <0.6, reducing the set duration, making β equal to β -1, and returning to step 3.3), otherwise, deleting all continuous-time subsequences with duration less than β from T, and at this time, taking the recording time in T as the representative time period of the cluster.

5. The method according to claim 1, wherein the clustering for the process input variable samples and the abnormal input variable samples is performed by using a k-means + + clustering model, and the method comprises the following steps:

step 2.1), clustering the input variable samples in the input variable sample set U into k clusters by using a k-means + + algorithm, and deleting all input variable samples of a cluster if the total number of the samples of the cluster is less than 2% of the total number of the samples after clustering is finished;

step 2.2), repeating step 2.1) until the total number of samples in all clusters is not less than 2% of the total number of samples.

6. The method of any of claims 1-5, wherein the input variable sample comprises a recording time c₁Oxygen enrichment rate c₂Carbon monoxide c₃Hydrogen gas c₄Carbon dioxide c₅Standard wind speed c₆Oxygen-enriched flow rate c₇Flow rate of cold air c₈And blast kinetic energy c₉Gas flow of furnace bosh c₁₀Gas index of furnace bosh c₁₁Theoretical combustion temperature c₁₂C, top pressure₁₃Oxygen-enriched pressure c₁₄Pressure of cold air c₁₅Total pressure difference c₁₆Pressure of hot air c₁₇Actual wind speed c₁₈Temperature c of hot air₁₉Top temperature northeast c₂₀Southwest c top temperature₂₁Top temperature northwest c₂₂Southeast C, top temperature₂₃Coefficient of resistance c₂₄Blast air humidity c₂₅Setting the coal injection quantity c₂₆And the amount of coal injected in last hour c₂₇The silicon content sample includes an assay time d₁And silicon content data d₂Wherein the recording period of the input variable sample is 1 hour.

7.A data optimization system for constructing a prediction model of the silicon content of blast furnace molten iron is characterized by comprising the following steps:

the sample data acquisition unit is used for acquiring sample data of the molten iron production of the blast furnace, and comprises a production process input variable sample and a silicon content sample;

the sample clustering and rejecting unit is used for clustering the input variable samples in the production process by using a clustering model and rejecting abnormal input variable samples;

a representative time period extraction unit which extracts each cluster representative time period from each cluster input variable sample based on the set continuous time length index;

the silicon content high-frequency interval drawing and determining unit is used for drawing a frequency histogram of the silicon content values contained in each cluster of representative time periods and determining the high-frequency interval of the silicon content values of each cluster;

and the optimal silicon content selecting unit selects the optimal silicon content value from all the corresponding silicon content samples for each input variable sample of each cluster based on the high-frequency interval of the silicon content value of each cluster.

8. The system of claim 7, wherein the optimum silicon content selecting unit comprises:

the silicon content value construction module in unit sampling interval time is used for constructing a set Si from the time when the silicon content assay time is in the corresponding input variable recording time to the time when the silicon content value in the unit recording interval time;

the judgment module is used for judging whether the silicon content value in the Si is in the high-frequency interval of the silicon content value of the clustering cluster where the input variable is located, if so, the recording time of the silicon content value in the Si is arranged at the forefront silicon content value, and the best silicon content value corresponding to the input variable is selected; if not, the silicon content value closest to the point value in the high-frequency region is selected from Si as the optimal silicon content value corresponding to the input variable, and the selected silicon content value is no longer used as the optimal silicon content candidate value for the other input variable.

9. The system according to claim 7, wherein the representative period extracting unit includes:

the sample time sequence acquisition module is used for sequencing all samples in the cluster according to the recording time sequence to obtain a sample time sequence T of the cluster;

a continuous time subsequence dividing unit for dividing the sequence T into different continuous time subsequences according to the continuity of the recording time; t ═ T₁,T₂,…,T_l}，T_iIs recorded as L (T)_i) I is 1. ltoreq. l, l representing the total number of consecutive time subsequences;

the occupation ratio calculation module is used for calculating the occupation ratio rho of the continuous time subsequences with the time length larger than β in the cluster;

wherein L (T) is the sum of the time lengths of all input sample records in T, β is a set duration length in hours, and the initial value is set to 5;

and the representative time period selection module is used for judging according to the ratio value obtained by the ratio calculation module, if rho is less than 0.6, reducing the set duration length, making β equal to β -1, calling the ratio calculation module again to calculate the ratio, calling the representative time period selection module again, and otherwise, deleting all continuous time subsequences with the duration length less than β from T, wherein the recording time in T is used as the representative time period of the cluster.

10. The system according to claim 7, wherein the sample clustering and removing unit performs clustering using a k-means + + clustering model, clusters the input variable samples in the input variable sample set U into k clusters by using a k-means + + algorithm, and deletes all input variable samples in a cluster if the total number of samples in the cluster is less than 2% of the total number of samples after the clustering is completed; and repeatedly calling the k-means + + clustering model for clustering until the total number of samples in all clusters is not less than 2% of the total number of samples.