CN108764597A

CN108764597A - A kind of product quality control method based on integrated study

Info

Publication number: CN108764597A
Application number: CN201810281599.9A
Authority: CN
Inventors: 傅予力; 李凯鑫; 张勰; 吴宗泽; 张莉婷
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2018-04-02
Filing date: 2018-04-02
Publication date: 2018-11-06

Abstract

The invention discloses a product quality control method based on integrated learning, which comprises the following steps for predicting key quality indicators (yield rate) of products under different progress in the production process: (1) data analysis based on injection molding process data; (2) Feature engineering analysis and construction; (3) model design based on ensemble learning; (4) data imbalance processing; (5) multi-model fusion processing scheme. Recommending optimal preset values for process parameters in the production process includes the following steps: (6) recommending overall process adjustable parameters; (7) recommending process adjustable parameters for specific process non-adjustable parameters. The present invention is suitable for dealing with the characteristics of data imbalance in industrial data, breaks through the single parameter analysis method of traditional product quality control, and discovers the abnormalities in the production process by digging out the inherent characteristic relationship between parameters by using the feature engineering construction of machine learning , Improve product quality control.

Description

A Product Quality Control Method Based on Ensemble Learning

技术领域technical field

本发明涉及数据挖掘技术领域，具体涉及一种基于集成学习的产品质量控制方法。The invention relates to the technical field of data mining, in particular to a product quality control method based on integrated learning.

背景技术Background technique

机器学习目前是人工智能应用一个重要的研究领域，发展十分活跃，而集成学习是机器学习一个热门的研究方向。《中国制造2025》提出了用信息化和工业化两化深度融合来引领和带动整个制造业的发展，让制造业向工业4.0转变。但是，由于注塑机械的网络化与智能化才刚起步，行业的信息化服务水平较低，行业资源缺乏统一规划，导致塑料相关产业的整体人力成本较高、信息化水平较低下、产品附加值较低等问题，严重制约了中国制造2025的整体发展。Machine learning is currently an important research field in the application of artificial intelligence, and its development is very active, while ensemble learning is a popular research direction in machine learning. "Made in China 2025" proposes to use in-depth integration of informatization and industrialization to lead and drive the development of the entire manufacturing industry and transform the manufacturing industry to Industry 4.0. However, due to the fact that the networking and intelligence of injection molding machinery has just started, the industry's information service level is low, and industry resources lack unified planning, resulting in high overall labor costs, low information level, and product added value in plastic-related industries. Low-level problems have seriously restricted the overall development of Made in China 2025.

云计算和大数据是实现工业4.0的关键技术，大数据平台的不断更新与完善，带动了机器学习与数据挖掘相关方向的不断进步。而针对注塑行业海量数据的基础上，利用大数据技术机器学习方法去解决工业实际问题，对于优化生产，提高产能有着及其重要的意义。工业数据由数据不平衡的特点，这在利用机器学习算法进行处理时有很大影响，而机器学习算法中火热的集成学习算法，却可以利用自身算法的特性一定程度上解决数据不平衡带来的影响，很好地应用在工业大数据上。Cloud computing and big data are the key technologies to realize Industry 4.0. The continuous update and improvement of the big data platform has driven the continuous progress of machine learning and data mining. On the basis of massive data in the injection molding industry, using big data technology and machine learning methods to solve practical industrial problems is of great significance for optimizing production and increasing production capacity. Industrial data is characterized by data imbalance, which has a great impact on the processing of machine learning algorithms. However, the popular ensemble learning algorithm in machine learning algorithms can solve the problems caused by data imbalance to a certain extent by using its own algorithm characteristics. The impact is well applied to industrial big data.

所谓质量控制，是指为达到质量要求所采取的作业技术和活动。这就是说，质量控制是为了通过监视质量形成过程，消除质量环上所有阶段引起不合格或不满意效果的因素。传统的产品质量控制分析更多的是通过逐个分析参数本身对质量指标的影响，然而这样的分析很难发现参数之间内在的联系，而且不具有通用性。The so-called quality control refers to the operating techniques and activities adopted to meet the quality requirements. That is to say, quality control is to eliminate factors that cause unqualified or unsatisfactory effects in all stages of the quality loop by monitoring the quality formation process. Traditional product quality control analysis is more about analyzing the impact of parameters on quality indicators one by one. However, such analysis is difficult to find the internal relationship between parameters, and it is not universal.

综上所述，将注塑产品质量控制问题转变为典型的机器学习问题，基于集成学习的方法对数据进行挖掘，提取数据内在的特征，来发现生产过程的异常，提高产品的质量控制，对于优化生产，提高产能有着极其重要的意义。To sum up, the problem of quality control of injection molding products is transformed into a typical machine learning problem. Based on the method of integrated learning, the data is mined, and the inherent characteristics of the data are extracted to find abnormalities in the production process and improve product quality control. For optimization production, increasing production capacity is extremely important.

发明内容Contents of the invention

本发明的目的是为了解决现有技术中的上述缺陷，提供一种基于集成学习的产品质量控制方法。The object of the present invention is to provide a product quality control method based on integrated learning in order to solve the above-mentioned defects in the prior art.

本发明的目的可以通过采取如下技术方案达到：The purpose of the present invention can be achieved by taking the following technical solutions:

一种基于集成学习的产品质量控制方法，所述的产品质量控制方法包括下列步骤：A kind of product quality control method based on integrated learning, described product quality control method comprises the following steps:

S1、基于注塑工艺数据的数据分析，根据注塑工艺参数，分析混合型变量、特征判别性、数据分布；S1. Based on data analysis of injection molding process data, analyze mixed variables, feature discrimination, and data distribution according to injection molding process parameters;

S2、特征工程分析与构建，过程如下：S2. Feature engineering analysis and construction, the process is as follows:

S21、明确特征使用方案，即预测不同生成进度下产品关键质量指标；S21. Specify the feature usage plan, that is, predict the key quality indicators of the product under different generation progress;

S22、特征清洗，剔除部分异样样本；S22. Feature cleaning, removing some abnormal samples;

S23、特征处理，包括类别变量处理、数值型变量处理、时序状态监控指标数据处理，其中，类别变量处理是对类别型变量在输入模型前进行编码处理；数值型变量处理是对取值只含有有限几种的数值型变量当成类别型变量进行编码处理，但保留原始数值，对于其他数值变量保持原值，对于缺失值，用中值填充处理；时序状态监控指标数据处理是对时序指标数据通过分时间阶段提取各个参数的统计值，包括均值、中值、众数、最大和小值、方差；S23. Feature processing, including categorical variable processing, numerical variable processing, and time-series state monitoring index data processing. Among them, categorical variable processing is to encode categorical variables before inputting them into the model; numerical variable processing is to only contain A limited number of numerical variables are coded as categorical variables, but the original values are retained. For other numerical variables, the original values are retained. For missing values, the median value is used to fill in the processing; the data processing of time series status monitoring indicators is to process the data of time series indicators through Extract statistical values of each parameter in time stages, including mean, median, mode, maximum and minimum values, and variance;

S24、特征选择，从时序状态指标数据中提取特征，进行嵌入式的特征选择方法，选择树模型XGBoost和随机森林的模型设计方法，通过利用树模型XGBoost得到特征重要性，并对特征进行排序，剔除重要性低的特征，降低特征维数；S24. Feature selection, extracting features from the time series state index data, performing embedded feature selection methods, selecting tree model XGBoost and random forest model design methods, obtaining feature importance by using tree model XGBoost, and sorting the features, Eliminate features with low importance and reduce the feature dimension;

S3、基于集成学习的模型设计，将评测指标通过预测值和实际值的RMSE值的算术均值作为评估标准，在模型训练过程中，关于分类模型，通过K交叉验证作为评估方法，选择AUC作为性能度量方法；关于回归模型，选择K交叉验证作为评估方法，选择RMSE作为性能度量方法；S3. Model design based on ensemble learning, using the arithmetic mean of the RMSE value of the evaluation index through the predicted value and the actual value as the evaluation standard. During the model training process, regarding the classification model, K cross-validation is used as the evaluation method, and AUC is selected as the performance Measurement method; for the regression model, choose K cross-validation as the evaluation method, and choose RMSE as the performance measurement method;

S4、数据不平衡处理，具体为：S4. Data imbalance processing, specifically:

S41、数据与算法层面：S41. Data and algorithm level:

S411、通过对不平衡的时间序列模型做组合抽样，对多的样本集进行抽样，与少的样本集组合成新的样本，针对新的样本集合进行模型训练，最后进行Bagging；S411. By performing combined sampling on the unbalanced time series model, sampling a large sample set, combining with a small sample set to form a new sample, performing model training on the new sample set, and finally carrying out Bagging;

S412、选择XGBoost算法和DART算法；S412. Select the XGBoost algorithm and the DART algorithm;

S413、通过采用代价敏感学习方法对样本集进行模型训练，在XGBoost算法中，对不同类别的数据采取不同的惩罚系数；S413. Perform model training on the sample set by adopting a cost-sensitive learning method, and adopt different penalty coefficients for different types of data in the XGBoost algorithm;

S414、采用引入深度学习的树模型Dart，并引入深度学习的Dropout方法进行处理防止模型过拟合；S414, adopting the tree model Dart that introduces deep learning, and introducing the Dropout method of deep learning for processing to prevent model overfitting;

S42、模型融合层面，分类模型和回归模型相融合：对于关键质量指标预测，通过回归模型预测出每一批次的产品关键质量指标，由于数据不平衡特点，对未处理的少样本数据当做小类别，采用分类模型进行模型预测，最终采用分类和回归方法共用的方式进行数据处理；S42. At the model fusion level, the classification model and the regression model are fused: for the key quality index prediction, the key quality index of each batch of products is predicted through the regression model. Category, use the classification model for model prediction, and finally use the method shared by classification and regression methods for data processing;

S5、多模型融合处理，具体为：S5. Multi-model fusion processing, specifically:

S51、回归模型融合采用加权平均的方法；S51, regression model fusion adopts the method of weighted average;

S52、分类模型融合采用两个二分类模型，模型训练完成后，对测试集进行预测，得到每个样本key_index低于0.92或者高于0.98的概率，将置信度高的样本的预测值，限定为0.92或者0.98。S52. The classification model fusion adopts two binary classification models. After the model training is completed, the test set is predicted to obtain the probability that the key_index of each sample is lower than 0.92 or higher than 0.98, and the predicted value of the sample with high confidence is limited to 0.92 or 0.98.

进一步地，所述的产品质量控制方法还包括下列步骤：Further, the described product quality control method also includes the following steps:

R2、明确特征使用方案，即对生产过程中工艺参数进行最优预设值推荐，以取得较好的关键质量指标，具体如下：R2. Define the feature usage plan, that is, recommend the optimal preset value of the process parameters in the production process to obtain better key quality indicators, as follows:

R21、整体的工艺可调整参数推荐，具体为：R21. The overall process adjustable parameter recommendation, specifically:

挖掘出使得良品率最大的最佳参数组合，对参数组合进行分组，得到训练数据中所有出现的参数组合，并计算每种组合的良品率的均值、中值、最大值、最小值以及每种组合出现的次数，得到统计表，按照良品率均值从大到小排序，对Top20/30/40参数组合中的每个可调参数进行累加，找出每个可调参数的众数值作为推荐；Dig out the best parameter combination that maximizes the yield rate, group the parameter combinations, get all the parameter combinations that appear in the training data, and calculate the mean, median, maximum, minimum, and value of each combination's yield rate Combine the number of occurrences, get the statistical table, sort according to the average yield rate from large to small, accumulate each adjustable parameter in the Top20/30/40 parameter combination, and find out the majority value of each adjustable parameter as a recommendation;

R22、针对特定的工艺不可调整参数，对工艺可调参数进行推荐，具体为：R22. For the non-adjustable parameters of a specific process, recommend the adjustable parameters of the process, specifically:

首先从训练数据里筛选出良品率大于一定阈值(阈值可根据需求调节)的产品批次，以这些产品批次的参数作为候选值；然后对于新的产品批次，以工艺不可调参数表作为特征，从候选样本里找出与之最相似或Top k个最相似的样本，取其可调参数作为推荐。Firstly, from the training data, product batches whose yield rate is greater than a certain threshold (threshold can be adjusted according to demand) are screened out, and the parameters of these product batches are used as candidate values; then, for new product batches, the process non-adjustable parameter table is used as Features, find the most similar or Top k most similar samples from the candidate samples, and take its adjustable parameters as recommendations.

进一步地，所述的步骤R21、整体的工艺可调整参数推荐中，对于double型的可调参数，取中值或者均值作为推荐。Further, in step R21 , in the overall process adjustable parameter recommendation, for the double-type adjustable parameters, the median or mean value is taken as the recommendation.

进一步地，所述的步骤R22、针对特定的工艺不可调整参数，对工艺可调参数进行推荐中将问题转化为一个相似性度量的问题，根据数值的类型采用不同距离度量方法，得到工艺不可调参数中的数值型参数的重要性，进行权重赋值，即完成加权欧氏距离。Further, in step R22, for a specific process non-adjustable parameter, the problem is transformed into a similarity measurement problem in recommending process adjustable parameters, and different distance measurement methods are used according to the type of value to obtain the process non-adjustable parameter. The importance of the numerical parameters in the parameters is assigned weights, that is, the weighted Euclidean distance is completed.

进一步地，所述的注塑工艺参数包括注塑压力、注塑时间、注塑温度、保压压力和时间、背压、转速。Further, the injection molding process parameters include injection pressure, injection time, injection temperature, holding pressure and time, back pressure, and rotation speed.

进一步地，所述的步骤S22中，对于不平衡数据特点，数据层面，采取上下采样相结合的方式；算法层面，选取Boosting集成学习算法作为基础算法模型，AUC作为分类结果的评判指标。Further, in the step S22, for the characteristics of unbalanced data, a combination of up and down sampling is adopted at the data level; at the algorithm level, the Boosting ensemble learning algorithm is selected as the basic algorithm model, and AUC is used as the evaluation index of the classification result.

进一步地，所述的步骤S23中类别变量处理和数值型变量处理采用one-hotencode对变量进行编码处理。Further, in the step S23, the categorical variable processing and the numerical variable processing adopt one-hotencode to code the variables.

进一步地，所述的回归模型选择XGBoost、DART、RandomForest三种模型，关于模型调参选择共同参数：min_child_weight，对于回归问题，该参数对应的是每个叶子结点上最小的样本个数；Further, the regression model selects three models of XGBoost, DART, and RandomForest, and selects a common parameter for model tuning: min_child_weight. For regression problems, this parameter corresponds to the minimum number of samples on each leaf node;

所述的分类模型选择两个二分类模型，其中一个预测样本的key_index是否低于0.92，另一个预测样本的key_index是否高于0.98，两个模型均以“binary:logistic”作为目标函数，以AUC作为评估指标。The classification model selects two binary classification models, one of which predicts whether the key_index of the sample is lower than 0.92, and the other predicts whether the key_index of the sample is higher than 0.98. Both models use "binary:logistic" as the objective function and AUC as an evaluation indicator.

进一步地，所述的步骤S411中利用Random Forest模型实现。Further, the above step S411 is realized by using the Random Forest model.

本发明相对于现有技术具有如下的优点及效果：Compared with the prior art, the present invention has the following advantages and effects:

(1)本发明创造性地提出了一种基于集成学习的产品质量控制方法，打破传统逐个参数分析的产品质量控制分析方法，通过利用机器学习数据挖掘的方法，对数据进行分析，构建特征工程，提取其所隐含的特征，通过模型训练的方式来分析这些参数之间的内在关联，以及对质量指标的影响。(1) The present invention creatively proposes a product quality control method based on ensemble learning, which breaks the traditional product quality control analysis method of parameter-by-parameter analysis, and uses machine learning data mining methods to analyze data and construct feature engineering, Extract the hidden features, and analyze the internal relationship between these parameters and the impact on quality indicators through model training.

(2)本发明针对工业数据不平衡问题，采用了数据层次和算法层次多方面融合的解决策略，数据方面上下采样相结合，可以从数据层面调整数据集分布不均匀问题；而算法层面采用多种集成学习算法、代价敏感学习方法、AUC模型评估指标、多模型进行融合的方式，可以修改算法使算法能够适应数据的不平衡，从而能够更加精准地对数据进行预测分析。(2) Aiming at the unbalanced problem of industrial data, the present invention adopts a solution strategy of multi-faceted fusion of data level and algorithm level, and the combination of up and down sampling in the data aspect can adjust the problem of uneven distribution of data sets from the data level; while the algorithm level adopts multiple An integrated learning algorithm, a cost-sensitive learning method, an AUC model evaluation index, and a multi-model fusion method can be modified to adapt the algorithm to the imbalance of the data, so that the data can be predicted and analyzed more accurately.

(3)本发明采用了多种高效的集成学习方法完成模型训练，一方面可以用于适应上述数据不平衡，还可以防止数据过拟合，得到较高预测准确度，同时还有高效的运算效率，能够减少资源利用和提高性能；RandomForest和DART通过不同的方面减少数据过拟合，分布式XGBoost能够有较好的准确率和较高的运算性能。(3) The present invention uses a variety of efficient integrated learning methods to complete model training. On the one hand, it can be used to adapt to the above-mentioned data imbalance, and can also prevent data overfitting, obtain higher prediction accuracy, and have efficient operations Efficiency can reduce resource utilization and improve performance; RandomForest and DART reduce data overfitting through different aspects, and distributed XGBoost can have better accuracy and higher computing performance.

(4)本发明采用了整体和局部相结合的参数推荐方式，在参数推荐方面，本发明不仅根据良品率的结果对可调参数进行推荐，还针对整体不可调参数的影响进行推荐，这样子可以更好更全面地考虑所有参数的影响，得到效果更优的参数组合；同时在参数推荐中，采用了多种方式相结合的模式，通过问题转换来得到最优的参数组合。(4) The present invention adopts the parameter recommendation mode combining the whole and the part. In terms of parameter recommendation, the present invention not only recommends the adjustable parameters according to the result of the yield rate, but also recommends the influence of the overall non-adjustable parameters, like this The influence of all parameters can be better and more comprehensively considered, and a parameter combination with better effect can be obtained; at the same time, in the parameter recommendation, a combination of multiple methods is used to obtain the optimal parameter combination through problem conversion.

附图说明Description of drawings

图1是本发明实施例基于集成学习的产品质量控制方法的工作流程图；Fig. 1 is the work flowchart of the product quality control method based on integrated learning of the embodiment of the present invention;

图2是本发明的数据分布散点分析图；Fig. 2 is a data distribution scatter analysis diagram of the present invention;

图3是本发明的特征重要性排序图；Fig. 3 is a feature importance ranking diagram of the present invention;

图4是本发明的模型融合方案图。Fig. 4 is a diagram of the model fusion scheme of the present invention.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

实施例Example

本实施例以注塑行业的产品质量控制为研究对象，基于1)预测生产流程中不同进度下的产品关键质量指标(良品率)；2)对生产过程中工艺参数进行最优预设值推荐，以取得较好的关键质量指标两大方对注塑产品质量控制问题进行应用研究。通过大数据技术、机器学习手段来发现生产过程的异常，提高产品的质量控制。This embodiment takes the product quality control of the injection molding industry as the research object, based on 1) predicting the key quality indicators (yield rate) of products under different progress in the production process; 2) recommending the optimal preset value of the process parameters in the production process, In order to obtain better key quality indicators, two methods are used to conduct applied research on the quality control of injection molding products. Use big data technology and machine learning methods to discover abnormalities in the production process and improve product quality control.

如图1所示，本实施的基于集成学习的产品质量控制方法，包含基于两方面：1)预测生产流程中不同进度下的产品关键质量指标(良品率)；2)对生产过程中工艺参数进行最优预设值推荐，以取得较好的关键质量指标。As shown in Figure 1, the product quality control method based on integrated learning in this implementation includes two aspects: 1) predicting the key quality indicators (yield rate) of products under different progress in the production process; 2) analyzing the process parameters in the production process Recommend optimal preset values to obtain better key quality indicators.

其中，基于问题一包含以下步骤：Among them, based on problem 1, the following steps are included:

S1、基于注塑工艺数据的数据分析，具体如下：S1. Data analysis based on injection molding process data, as follows:

制造业的生产过程一般包含选材，加工，产品质量检测等几个环节，注塑行业也不例外。注塑成型工艺是指将熔融的原料通过加压、注入、冷却、脱离等操作制作一定形状的半成品件的工艺流程。注塑工艺参数主要有注塑压力、注塑时间、注塑温度、保压压力和时间、背压、转速等等，制造业产品生产过程数据从大类上分为工艺上可调整参数、工艺上不可调整参数、时序状态监控指标。通过观察分析工艺参数数据，发现数据具有如下特点：混合型变量(数值型、类别型等)、特征判别性不足、数据分布不平衡等特点。The production process of the manufacturing industry generally includes several links such as material selection, processing, and product quality inspection, and the injection molding industry is no exception. The injection molding process refers to the process of making semi-finished parts of a certain shape from molten raw materials through operations such as pressurization, injection, cooling, and detachment. Injection molding process parameters mainly include injection pressure, injection time, injection temperature, holding pressure and time, back pressure, speed, etc. The production process data of manufacturing products are divided into process adjustable parameters and process non-adjustable parameters. , Timing status monitoring indicators. By observing and analyzing the process parameter data, it is found that the data has the following characteristics: mixed variables (numerical, categorical, etc.), insufficient feature discrimination, and unbalanced data distribution.

S2、特征工程分析与构建，具体步骤如下：S2. Feature engineering analysis and construction, the specific steps are as follows:

S21、明确特征使用方案，即预测不同生成进度下产品关键质量指标(良品率)。S21. Specify the feature usage plan, that is, predict the key quality index (yield rate) of the product under different generation progress.

S22、特征清洗，剔除了部分异样样本，对于不平衡数据特点，数据层面，采取上下采样相结合的方式；算法层面，选取了代价敏感学习的方法，选取Boosting集成学习算法作为基础算法模型，AUC作为分类结果的评判指标。采用了算法层面和算法多方面解决策略。S22. Feature cleaning, eliminating some abnormal samples. For the characteristics of unbalanced data, a combination of up and down sampling is adopted at the data level; at the algorithm level, the method of cost-sensitive learning is selected, and the Boosting ensemble learning algorithm is selected as the basic algorithm model, AUC As the evaluation index of classification results. Algorithm level and algorithm multi-faceted solution strategies are adopted.

S23、特征处理，主要包括类别变量处理、数值型变量处理、时序状态监控指标数据处理。类别变量处理主要是可调整和不可调整变量值中都含有类别型变量，在输入模型前进行编码处理，采用的是常用的one-hot encode。数值型变量处理中对指取值只含有有限的几种，亦当成类别型进行one-hot encode，但保留原始数值。对于其他数值变量保持原值，对于缺失值，用中值填充处理。时序状态监控指标数据处理：在时序指标数据中，很多指标数据，对于这部分和时序有关的指标，通过分时间阶段提取各个参数的统计值，包括均值、中值、众数、最大/小值、方差。具体来说，根据add_time，将每个product_no的加工过程均匀划分10个阶段，统计每个时间段内的各个参数的各种统计量。另外，也提取了整个加工过程、加工进度50％时的参数统计量。这些统计量特征虽然简单，却具有一定的判别性，比如某个时间段内，温度方差太大，对于产品的质量可能会产生比较大的影响。通过这种分阶段的方式，体现时序数据的特点。并且构建近似加工总时长、参数差值特征。S23. Feature processing, mainly including categorical variable processing, numerical variable processing, and time series state monitoring index data processing. The categorical variable processing mainly includes categorical variables in both adjustable and non-adjustable variable values, and encodes them before inputting into the model, using the commonly used one-hot encode. In the processing of numerical variables, there are only a limited number of reference values, and they are also treated as category types for one-hot encoding, but the original values are retained. For other numerical variables, keep the original value, and for the missing value, fill it with the median value. Time series state monitoring index data processing: In the time series index data, many index data, for this part of the index related to time series, the statistical value of each parameter is extracted by time stages, including mean, median, mode, maximum/minimum value ,variance. Specifically, according to add_time, the processing process of each product_no is evenly divided into 10 stages, and various statistics of each parameter in each time period are counted. In addition, the parameter statistics of the whole processing process and the processing progress of 50% are also extracted. Although these statistical features are simple, they are discriminative to a certain extent. For example, if the temperature variance is too large within a certain period of time, it may have a relatively large impact on the quality of the product. Through this staged approach, the characteristics of time series data are reflected. And construct the approximate total processing time and parameter difference features.

S24、特征选择：从时序状态指标数据中提取了几百维的特征，很多特征是冗余的，容易引起过拟合，这里进行了嵌入式的特征选择方法，因为在模型设计部分选择了树模型XGBoost和随机森林，通过利用树模型得到特征重要性，并对特征进行排序，剔除重要性低的特征，最终只保留100维左右的特征S24. Feature selection: Several hundred-dimensional features are extracted from the time series state index data. Many features are redundant and easily cause overfitting. Here, an embedded feature selection method is used because the tree is selected in the model design part. Model XGBoost and random forest, by using the tree model to get the feature importance, and sort the features, remove the low importance features, and finally retain only about 100-dimensional features

步骤S3、基于集成学习的模型设计，具体为：Step S3, model design based on integrated learning, specifically:

S31、模型评估方案S31. Model evaluation plan

基于预测生产流程中不同进度的关键质量指标值(良品率)问题，最终评测指标通过预测值和实际值的RMSE值的算术均值作为评估标准。在模型训练过程中，关于分类模型，通过K交叉验证作为评估方法，选择AUC作为性能度量方法；关于回归模型，同样选择了K交叉验证作为评估方法，RMSE作为性能度量方法。Based on the problem of predicting the key quality index value (yield rate) of different progress in the production process, the final evaluation index uses the arithmetic mean of the RMSE value of the predicted value and the actual value as the evaluation standard. In the process of model training, for the classification model, K cross-validation is used as the evaluation method, and AUC is selected as the performance measurement method; for the regression model, K cross-validation is also selected as the evaluation method, and RMSE is used as the performance measurement method.

S32、回归模型S32. Regression model

回归模型选择了XGBoost、DART、RandomForest三种模型。关于模型调参，由于模型用的是集成学习算法，三种树模型有一个共同且重要的参数：min_child_weight，对于回归问题来说，这个参数对应的是每个叶子结点上最小的样本个数，这个参数越小越容易过拟合，此次方案设置的min_child_weight是5，相比默认值1，RMSE有不小的提升。Three models of XGBoost, DART, and RandomForest were selected for the regression model. Regarding model tuning, since the model uses an integrated learning algorithm, the three tree models have a common and important parameter: min_child_weight. For regression problems, this parameter corresponds to the minimum number of samples on each leaf node. , the smaller this parameter is, the easier it is to overfit. The min_child_weight set in this scheme is 5, which is a big improvement in RMSE compared to the default value of 1.

S33、分类模型S33. Classification model

回归模型选择了XGBoost模型，回归模型的预测值集中分布在[0.92,0.98]之间，所以又建立了两个二分类模型，其中一个预测样本的key_index是否低于0.92，另一个预测样本的key_index是否高于0.98。这两个分类模型所采用的特征跟回归模型所用的特征是一样的，模型也同样采用了XGBoost，但是模型以“binary:logistic”作为目标函数，以AUC作为评估指标。The regression model selects the XGBoost model, and the predicted value of the regression model is concentrated between [0.92,0.98]. Therefore, two binary classification models are established, one of which predicts whether the key_index of the sample is lower than 0.92, and the other predicts whether the key_index of the sample is Is it higher than 0.98. The features used by these two classification models are the same as those used by the regression model. The model also uses XGBoost, but the model uses "binary:logistic" as the objective function and AUC as the evaluation indicator.

步骤S4、数据不平衡处理，具体为：Step S4, data imbalance processing, specifically:

S41、数据与算法层面：S41. Data and algorithm level:

S411、通过对不平衡的时间序列模型做组合抽样，对多的样本集进行抽样(如抽样20％，具体比例需要交叉验证)，与少的样本集组合成新的样本(如5个人样本集合)，针对新的样本集合进行模型训练，最后进行Bagging，此处主要利用Random Forest模型。S411, by performing combined sampling on the unbalanced time series model, sampling a large number of sample sets (such as sampling 20%, the specific ratio needs to be cross-validated), and combining with a small sample set to form a new sample (such as a 5-person sample set ), model training is carried out for the new sample set, and finally Bagging is carried out. Here, the Random Forest model is mainly used.

S412、选择Boosting相关算法，这里主要是选择了XGBoost算法和DART算法。而对于分类模型，选择AUC作为分类结果的评判指标。S412. Select a Boosting-related algorithm. Here, the XGBoost algorithm and the DART algorithm are mainly selected. For the classification model, AUC is selected as the evaluation index of the classification results.

S413、通过采用代价敏感学习方法对样本集进行模型训练，在XGBoost算法里，对不同类别的数据采取不同的惩罚系数。S413. Perform model training on the sample set by using a cost-sensitive learning method, and adopt different penalty coefficients for different types of data in the XGBoost algorithm.

S414、采用引入深度学习的树模型Dart。在数据进行欠采样等处理后，如果数据量不够大，很容易出现过拟合现象，通过引入深度学习的Dropout方法进行处理可以防止模型过拟合。S414. Adopting the tree model Dart introduced with deep learning. After the data is under-sampled and processed, if the amount of data is not large enough, it is easy to overfit. The dropout method of deep learning can be used to prevent the model from overfitting.

S42、模型融合层面：S42. Model fusion level:

S421、因为上述算法选择是采用集成学习算法，所以在模型融合通过两种方式：一是对不同集成学习算法进行线性加权的融合方式，二是上层采用集成学习算法，下层采用逻辑回归LR这样的线性模型交叉融合的方式进行模型融合。S421. Because the above algorithm selection uses an ensemble learning algorithm, there are two methods for model fusion: one is the fusion method of linearly weighting different ensemble learning algorithms; Model fusion is performed by linear model cross fusion.

S422、分类模型和回归模型相融合的方式：对于关键质量指标预测，通过回归模型预测出每一批次的产品关键质量指标，由于数据不平衡特点，可对未处理的少样本数据当做小类别，采用分类模型进行模型预测，最终采用分类和回归方法共用的方式进行数据处理。S422. The method of integrating classification model and regression model: For the prediction of key quality indicators, the key quality indicators of each batch of products are predicted through the regression model. Due to the characteristics of data imbalance, the unprocessed few-sample data can be regarded as a small category , use the classification model for model prediction, and finally use the shared method of classification and regression methods for data processing.

步骤S5、多模型融合处理方案，具体为：Step S5, multi-model fusion processing scheme, specifically:

S51、回归模型融合：S51. Regression model fusion:

XGBoost、Dart、Random Forest这三种模型都是基于树的模型。XGBoost和DART是boosting算法，侧重于降低模型偏差。随机森林是Bagging算法，偏重于降低模型方差。将这几种模型进行融合，可以进一步地提高模型的性能。常用的且效果比较好的融合方法是stacking或blending，而做多层(multi level)的stacking/blending learning容易过拟合，所以本此方案中最终只采用了简单的加权平均的方法：0.4*XGB+0.4*DART+0.3*RF，权重是根据模型的线下表现进行调节的。The three models XGBoost, Dart, and Random Forest are all tree-based models. XGBoost and DART are boosting algorithms that focus on reducing model bias. Random forest is a bagging algorithm that focuses on reducing the variance of the model. Combining these several models can further improve the performance of the model. The commonly used fusion method with better effect is stacking or blending, and multi-level stacking/blending learning is easy to overfit, so this solution only uses a simple weighted average method: 0.4* XGB+0.4*DART+0.3*RF, the weight is adjusted according to the offline performance of the model.

S52、分类模型融合：S52. Classification model fusion:

两个二分类模型，模型训练完成后，对测试集进行预测，得到每个样本key_index低于0.92或者高于0.98的概率，将那些置信度高(概率大)的样本的预测值，限定为0.92或者0.98。使用这个后处理方案，使最终评估结果有不小的提升。Two binary classification models, after the model training is completed, predict the test set, get the probability that the key_index of each sample is lower than 0.92 or higher than 0.98, and limit the predicted value of those samples with high confidence (high probability) to 0.92 Or 0.98. Using this post-processing scheme has greatly improved the final evaluation results.

针对问题二，本发明公开的基于集成学习的产品质量方法，还包括如下步骤：For problem two, the product quality method based on integrated learning disclosed by the present invention also includes the following steps:

R2、明确特征使用方案，即对生产过程中工艺参数进行最优预设值推荐，以取得较好的关键质量指标，具体如下R2. Define the feature usage plan, that is, recommend the optimal preset value of the process parameters in the production process to obtain better key quality indicators, as follows

R21、整体的工艺可调整参数推荐；R21. Overall process adjustable parameter recommendation;

R22、针对特定的工艺不可调整参数，对工艺可调参数进行推荐。R22. For the non-adjustable parameters of a specific process, recommend the adjustable parameters of the process.

步骤R21、整体的工艺可调整参数推荐，具体为：Step R21, overall process adjustable parameter recommendation, specifically:

挖掘出使得良品率最大的最佳参数组合，对参数组合进行分组，可以得到训练数据中所有出现的参数组合，并计算每种组合的良品率的均值、中值、最大值、最小值，以及每种组合出现的次数。得到统计表，该表中总共只有91种参数组合，按照良品率均值从大到小排序，对Top20/30/40参数组合中的每个可调参数进行累加，找出每个可调参数的众数值作为推荐。对于double型的可调参数，也可以取中值或者均值作为推荐。另外，在推荐时也考虑一些固定的参数组合。Dig out the best parameter combination that maximizes the yield rate, and group the parameter combinations to get all the parameter combinations that appear in the training data, and calculate the mean, median, maximum, and minimum value of the yield rate of each combination, and The number of occurrences of each combination. Get the statistical table, there are only 91 parameter combinations in total, sorted according to the average yield rate from large to small, accumulate each adjustable parameter in the Top20/30/40 parameter combination, and find out the value of each adjustable parameter The mode value is used as a recommendation. For double-type adjustable parameters, the median or mean value can also be used as a recommendation. In addition, some fixed parameter combinations are also considered when recommending.

步骤R22、针对特定的工艺不可调整参数，对工艺可调参数进行推荐，具体为：Step R22, recommending process adjustable parameters for specific process non-adjustable parameters, specifically:

方案是首先从训练数据里筛选出良品率大于一定阈值(阈值可根据需求调节)的产品批次，以这些产品批次的参数作为候选值。对于新的产品批次(即测试样本)，以工艺不可调参数表作为特征，从候选样本里找出与之最相似(或Top k个最相似)的样本，取其可调参数作为推荐。问题转化为一个相似性度量(距离度量)的问题。根据数值的类型采用不同距离度量方法，如MinkovDM，得到工艺不可调参数中的数值型参数的重要性，进行权重赋值，即完成加权欧氏距离。The solution is to first screen out product batches with a yield rate greater than a certain threshold (threshold can be adjusted according to demand) from the training data, and use the parameters of these product batches as candidate values. For a new product batch (ie, a test sample), the process non-adjustable parameter table is used as a feature, and the most similar (or Top k most similar) samples are found from the candidate samples, and its adjustable parameters are used as recommendations. The problem is transformed into a similarity measure (distance measure) problem. According to the type of value, different distance measurement methods are used, such as MinkovDM, to obtain the importance of the numerical parameters in the process non-adjustable parameters, and to assign weights, that is, to complete the weighted Euclidean distance.

上述实施例为本发明较佳的实施方式，但本发明的实施方式并不受上述实施例的限制，其他的任何未背离本发明的精神实质与原理下所作的改变、修饰、替代、组合、简化，均应为等效的置换方式，都包含在本发明的保护范围之内。The above-mentioned embodiment is a preferred embodiment of the present invention, but the embodiment of the present invention is not limited by the above-mentioned embodiment, and any other changes, modifications, substitutions, combinations, Simplifications should be equivalent replacement methods, and all are included in the protection scope of the present invention.

Claims

1. a kind of product quality control method based on integrated study, which is characterized in that the product quality control method packet Include the following steps：

S1, the data analysis based on Shooting Technique data, according to molding proces s parameters, analysis mixed type variable, feature decision, Data distribution；

S2, Feature Engineering analysis and structure, process are as follows：

S21, clear feature operational version, that is, predict product Key Quality Indicator under different manufacturing schedules；

S22, feature cleaning, reject the abnormal sample in part；

S23, characteristic processing, including class variable processing, the processing of numeric type variable, time sequence status monitor control index data processing, In, class variable processing is that coded treatment is carried out before input model to classification type variable；The processing of numeric type variable is to value It contains only limited several numeric type variable and carries out coded treatment as classification type variable, but retain raw value, for other Numerical variable keeps initial value, for missing values, is handled with intermediate value filling；Time sequence status monitor control index data processing is referred to sequential Mark data pass through the statistical value for dividing time phase to extract parameters, including mean value, intermediate value, mode, maximum and small value, variance；

S24, feature selecting extract feature from time sequence status achievement data, carry out Embedded feature selection approach, selection tree The design methods of model XGBoost and random forest obtain feature importance by using tree-model XGBoost, and right Feature is ranked up, and is rejected the low feature of importance, is reduced intrinsic dimensionality；

Evaluation metrics are passed through the arithmetic equal value of the RMSE value of predicted value and actual value by S3, the modelling based on integrated study As evaluation criteria, during model training, about disaggregated model, appraisal procedure is used as by K cross validations, selects AUC As performance metric method；About regression model, selects K cross validations as appraisal procedure, select RMSE as performance metric Method；

S4, data nonbalance processing, specially：

S41, data and algorithm level：

S411, by doing com bined- sampling to unbalanced time series models, more sample sets is sampled, with few sample Collection is combined into new sample, carries out model training for new sample set, finally carries out Bagging；

S412, selection XGBoost algorithms and DART algorithms；

S413, model training is carried out to sample set by using cost sensitive learning method, in XGBoost algorithms, to difference The data of classification take different penalty coefficients；

S414, using introduce deep learning tree-model Dart, and the Dropout methods for introducing deep learning carry out processing prevent Model over-fitting；

S42, Model Fusion level, disaggregated model and regression model blend：Key Quality Indicator is predicted, by returning mould Type is predicted per a batch of product Key Quality Indicator, due to data nonbalance feature, is worked as to untreated few sample data Small classification is done, model prediction is carried out using disaggregated model, it is final to be carried out at data in such a way that classification and homing method share Reason；

S5, multi-model fusion treatment, specially：

S51, regression model fusion use average weighted method；

S52, disaggregated model, which merge, uses two two disaggregated models, after the completion of model training, predicts test set, obtains every A sample key_index is less than 0.92 or the predicted value of the high sample of confidence level is limited to 0.92 by the probability higher than 0.98 Or 0.98.

2. a kind of product quality control method based on integrated study according to claim 1, which is characterized in that described Product quality control method further includes the following steps：

R2, clear feature operational version carry out optimal preset value recommendation to technological parameter in production process, preferable to obtain Key Quality Indicator, it is specific as follows：

R21, whole technique adjustable parameters are recommended, specially：

It excavates so that yields maximum optimal parameter combination, is grouped parameter combination, obtains owning in training data The parameter combination of appearance, and calculate the mean value of yields of each combination, intermediate value, maximum value, minimum value and each be combined into Existing number, obtains statistical form, sorts from big to small according to yields mean value, to each of Top20/30/40 parameter combinations Adjustable parameter adds up, and finds out the mode value of each adjustable parameter as recommendation；

R22, parameter is cannot be adjusted for specific technique, technique adjustable parameter recommended, specially：

The product batches that yields is more than certain threshold value are filtered out in training data first, are made with the parameter of these product batches For candidate value；Then new product batches are found out therewith using the non-adjustable parameter list of technique as feature in candidate samples Most like or k most like samples of Top, take its adjustable parameter as recommendation.

3. a kind of product quality control method based on integrated study according to claim 2, which is characterized in that described Step R21, during whole technique adjustable parameters are recommended, for the adjustable parameter of double types, intermediate value or mean value conduct are taken Recommend.

4. a kind of product quality control method based on integrated study according to claim 2, which is characterized in that described Step R22, parameter is cannot be adjusted for specific technique, problem is converted to a phase in recommending technique adjustable parameter The problem of being measured like property uses different distance measure according to the type of numerical value, obtains the numerical value in the non-adjustable parameter of technique The importance of shape parameter carries out weight assignment, that is, completes weighted euclidean distance.

5. a kind of product quality control method based on integrated study according to claim 1, which is characterized in that described Molding proces s parameters include injection pressure, injection time, injection temperature, dwell pressure and time, back pressure, rotating speed.

6. a kind of product quality control method based on integrated study according to claim 1, which is characterized in that described In step S22, for unbalanced data feature, data plane, the mode for taking down-sampling to be combined；Algorithm level is chosen Boosting Ensemble Learning Algorithms are as basic algorithm model, judging quotas of the AUC as classification results.

7. a kind of product quality control method based on integrated study according to claim 1, which is characterized in that described Class variable processing and the processing of numeric type variable carry out coded treatment using one-hot encode to variable in step S23.

8. a kind of product quality control method based on integrated study according to claim 1, which is characterized in that described Regression model selects tri- kinds of models of XGBoost, DART, RandomForest, and common parameters are selected about the participation in the election of model tune：min_ Child_weight, for regression problem, it is number of samples minimum on each leafy node that the parameter is corresponding；

The disaggregated model selects whether two two disaggregated models, the key_index of one of forecast sample are less than 0.92, Whether the key_index of another forecast sample is higher than 0.98, and two models are with " binary:Logistic " is used as target letter Number, using AUC as evaluation index.

9. a kind of product quality control method based on integrated study according to claim 1, which is characterized in that described Random Forest model realizations are utilized in step S411.