CN111563436B

CN111563436B - Infrared spectrum measuring instrument calibration migration method based on CT-CDD

Info

Publication number: CN111563436B
Application number: CN202010348512.2A
Authority: CN
Inventors: 赵煜辉; 刘晓东; 芦鹏程; 赵子恒
Original assignee: Northeastern University Qinhuangdao
Current assignee: Northeastern University Qinhuangdao
Priority date: 2020-04-28
Filing date: 2020-04-28
Publication date: 2022-04-08
Anticipated expiration: 2040-04-28
Also published as: CN111563436A

Abstract

The invention relates to the technical field of migration learning under a machine learning module, and provides a CT-CDD-based infrared spectrum measuring instrument calibration and migration method. First, collect the source domain and target domain data sets {X ^m , y ^m }, {X ^s }, and use the KS algorithm to divide the source domain calibration set

and centralize it; then, calibrate the centralized source domain set

Establish a PLS calibration model; then, calculate the characteristic spectrum ^Tm of the master instrument and the pseudo-characteristic spectrum of the slave instrument

Using OLS and the dataset ^{ T ^m , y ^m } to determine the number of clusters K by cross- ^validation

Cluster separately, for subdatasets

Establish the kth OLS model and calculate the transformation matrix M _k ; finally, predict the substance concentration variable of the measured object set. The invention does not need to use a standard sample to construct a migration model, and can greatly improve the precision and efficiency of the migration calibration of the infrared spectrum measuring instrument.

Description

A method of calibration and migration of infrared spectroscopy measuring instruments based on CT-CDD

技术领域technical field

本发明涉及机器学习模块下的迁移学习技术领域，特别是涉及一种基于CT-CDD的红外光谱测量仪器标定迁移方法。The invention relates to the technical field of migration learning under a machine learning module, in particular to a method for calibration and migration of an infrared spectrum measuring instrument based on CT-CDD.

背景技术Background technique

近红外区域是一种介于可见光和中红外光之间的电磁波，是由WilliamHerschel在19世纪最早发现的非可见光区域。美国材料检测协会(ASTM)在1985年10月将其定义为波长为780nm—2526nm、波数为12820—3959cm-1的光谱区。到了20世纪50年代，近红外光谱分析技术在某些领域内可以进行应用。随后到了60年代，由于一些新颖的分析技术的不断涌现，加之近红外光谱技术的一些缺点，使除了一些特定分析应用的用户之外，人们对近红外区域的兴趣逐渐减弱。The near-infrared region is a type of electromagnetic wave between visible light and mid-infrared light, and it is the non-visible light region first discovered by William Herschel in the 19th century. The American Society for Testing and Materials (ASTM) defined it in October 1985 as a spectral region with a wavelength of 780nm-2526nm and a wavenumber of 12820-3959cm-1. By the 1950s, near-infrared spectroscopy techniques were available in some areas. Then in the 1960s, due to the continuous emergence of some novel analytical techniques, coupled with some shortcomings of near-infrared spectroscopy, people's interest in the near-infrared region gradually diminished except for some users of specific analytical applications.

从此，关于近红外光谱的研究进入了一个很长的沉默时期。随着对于化学计量学的研究和讨论逐渐加深，加上对于光谱仪器的制造技术不断地进行完善，才又使得红外光谱分析技术在80年代的中期得到进一步的发展。与传统的分析技术不同，近红外光谱是一种间接分析技术，并不能直接获取物质的含量等信息，必须通过已知样本建立标定模型，来实现对未知样品的浓度信息的预测来完成定量或者定性分析。近红外光谱技术的分析过程如图1所示，主要步骤如下：Since then, the research on near-infrared spectroscopy has entered a long period of silence. With the deepening of the research and discussion of chemometrics and the continuous improvement of the manufacturing technology of spectroscopic instruments, the infrared spectroscopic analysis technology was further developed in the mid-1980s. Different from traditional analysis techniques, near-infrared spectroscopy is an indirect analysis technique and cannot directly obtain information such as substance content. A calibration model must be established through known samples to predict the concentration information of unknown samples to complete quantification or quantification. Qualitative analysis. The analysis process of near-infrared spectroscopy is shown in Figure 1, and the main steps are as follows:

(1)选择有代表性的样品组成标定集，测试标定集样本的近红外光谱，这一环节需要注意收集的标定集样本要具有代表性；(1) Select representative samples to form a calibration set, and test the near-infrared spectrum of the calibration set samples. In this link, it is necessary to pay attention to the representativeness of the collected calibration set samples;

(2)在收集标定集的样品之后，要通过标准的分析化学手段测量样本中感兴趣的物质浓度信息；(2) After collecting the samples of the calibration set, measure the concentration information of the substances of interest in the samples by standard analytical chemical means;

(3)选择合适的算法对测量的标定集样本的光谱和对应的物质浓度信息进行建模。这一步是近红外光谱定量分析的核心步骤，对预处理之后的近红外光谱和浓度信息建立标定模型，一般来讲通过交叉验证确定模型的参数，最后，需要对模型的性能进行检验；(3) Select an appropriate algorithm to model the spectrum of the measured calibration set samples and the corresponding substance concentration information. This step is the core step in the quantitative analysis of near-infrared spectroscopy. A calibration model is established for the pre-processed near-infrared spectroscopy and concentration information. Generally speaking, the parameters of the model are determined by cross-validation. Finally, the performance of the model needs to be checked;

(4)在多元标定模型建立完成之后，可以测量当前测试样本的近红外光谱，使用建立的标定模型，对测试样本的物质含量进行预测。(4) After the multivariate calibration model is established, the near-infrared spectrum of the current test sample can be measured, and the established calibration model can be used to predict the substance content of the test sample.

现代近红外光谱分析技术已经具备了丰富的理论基础和技术实践经验。和其他分析技术不同，近红外光谱分析技术涉及了光谱学、化学计量学和计算机技术等多种不同学科的理论知识。Modern near-infrared spectroscopic analysis technology has rich theoretical basis and technical practical experience. Different from other analytical techniques, near-infrared spectroscopy involves theoretical knowledge of many different disciplines such as spectroscopy, chemometrics and computer technology.

近红外光谱分析技术具有诸多优点，它在几分钟之内便可测定出样本的化学成分和性质。仅通过对被测样品完成一次近红外光谱的采集测量，便可同时对样本的多种成分进行分析，最多可达十余项指标。近红外光谱分析技术在对样品进行简单的预处理后便可直接分析，不会对样品产生损伤，实现无损检测；不需要使用任何化学试剂，大大降低了分析的成本，同时对环境也不会造成污染，属于“绿色分析”技术。近红外光谱主要是反映了样品中有机分子含氢基团C-H、O-H、N-H、S-H等化学键的信息，非常适用含氢有机物的定量或者定性分析。近红外光谱分析技术的分析范围包含大多数的有机混合物和化合物，加上近红外光谱分析技术的独特优势，使得它的应用领域极其广阔，在许多行业有着不可或缺的作用，在农业领域测量物质的成分含量，例如测量玉米中水分或者蛋白质的含量等；在医药领域，测量药品的成分含量以及生物、食品和环境检测等。Near-infrared spectroscopy has many advantages, as it can determine the chemical composition and properties of a sample within minutes. Only by collecting and measuring the near-infrared spectrum of the tested sample once, it is possible to analyze various components of the sample at the same time, up to more than ten indicators. Near-infrared spectroscopic analysis technology can directly analyze the sample after a simple pretreatment, without causing damage to the sample, and realizing non-destructive testing; it does not need to use any chemical reagents, which greatly reduces the cost of analysis, and has no effect on the environment. Cause pollution, which is a "green analysis" technology. Near-infrared spectroscopy mainly reflects the information of chemical bonds such as C-H, O-H, N-H, S-H and other hydrogen-containing groups in organic molecules in the sample, and is very suitable for quantitative or qualitative analysis of hydrogen-containing organic compounds. The analysis range of near-infrared spectroscopy analysis technology includes most organic mixtures and compounds, and the unique advantages of near-infrared spectroscopy technology make its application field extremely broad, and it plays an indispensable role in many industries. The component content of substances, such as measuring the moisture or protein content in corn, etc.; in the field of medicine, measuring the component content of medicines, as well as biological, food and environmental testing, etc.

机器学习和数据挖掘技术已经在包括分类、回归和聚类在内的许多知识工程领域取得了重大成功。对于传统的机器学习方法来讲，训练数据的分布和测试数据的分布应该是相同的，才可以使得测试数据使用训练数据建立的模型进行预测。在实际的应用场景中，它们的数据分布之间会有些许差异。有些情况下，训练数据十分昂贵或者无法收集。在这种情况下，如果训练数据和测试数据的数据分布存在显著差异时，测试数据的预测结果和真实结果之间会有较大差异，大多数统计模型需要使用新收集的训练数据重新建模。在这种情况下，在任务域之间进行知识迁移是可取的，这种方法称为迁移学习。迁移学习是通过从一个相关领域传递的信息来提高另一个领域的学习者的能力。Machine learning and data mining techniques have achieved significant success in many fields of knowledge engineering including classification, regression, and clustering. For traditional machine learning methods, the distribution of training data and the distribution of test data should be the same, so that the test data can be predicted using the model established by the training data. In practical application scenarios, there will be slight differences between their data distributions. In some cases, training data is expensive or impossible to collect. In this case, if there is a significant difference in the data distribution of the training data and the test data, there will be a large difference between the predicted results of the test data and the real results, and most statistical models need to be remodeled using the newly collected training data . In this case, knowledge transfer between task domains is desirable, an approach called transfer learning. Transfer learning is the enhancement of the abilities of learners in another domain through information transferred from a related domain.

多元标定是从光谱信号中提取化学信息的一种非常有用的工具，建立的多元标定模型对于许多分析测量都是至关重要的。它已应用于各种分析技术，但其重要性已在近红外(NIR)光谱中得到体现。通常，将花费大量的人力物力投入到构建鲁棒的标定模型中。当在不同的仪器上或在不同的环境因素下，测量样本时将会出现问题。即使测量了相同的样本，不同仪器测量的两个光谱矩阵也是不同的，建立的模型会产生差异。在一个仪器上建立的模型通常不能对第二个仪器测量的光谱进行预测。解决这一问题的一个方法是重新测量每个样品，并为新获得的光谱建立一个新的模型，但是这不是一个实际的解决方案。建立一个稳健的校准模型需要大量的成本和时间，节省这些不必要的开支的另一种可接受的方法是进行模型迁移。在机器学习领域的这种处理问题的方式称为迁移学习，更具体的讲，任务相同但是域不同的情况被称为域适应。而在化学计量学领域中，它们被称为标定迁移。Multivariate calibration is a very useful tool for extracting chemical information from spectral signals, and the established multivariate calibration model is crucial for many analytical measurements. It has been used in various analytical techniques, but its importance has been demonstrated in near-infrared (NIR) spectroscopy. Usually, a lot of manpower and resources are invested in building a robust calibration model. Problems can arise when measuring samples on different instruments or under different environmental factors. Even if the same sample is measured, the two spectral matrices measured by different instruments are different, and the model established will produce differences. A model built on one instrument usually cannot make predictions about the spectrum measured by a second instrument. One solution to this problem is to remeasure each sample and build a new model for the newly acquired spectrum, but this is not a practical solution. Building a robust calibration model requires a lot of cost and time, and another acceptable way to save these unnecessary expenses is to perform model transfer. This way of dealing with problems in the field of machine learning is called transfer learning, and more specifically, the situation where the task is the same but the domain is different is called domain adaptation. And in the field of chemometrics, they are called calibration shifts.

大多数的标定迁移方法，通过一组标准样品构建迁移模型，它需要在主仪器和从仪器上分别测量一组标准样本，已经有多种有标准的迁移方法被提出。例如，直接标准化(Direct standardization，DS)和分段直接标准化(Piecewise direct standardization，PDS)通过一组标准样本来纠正主仪器和从仪器之间光谱的差异。在DS中，主仪器的每个波长与从仪器的所有波长相关。在PDS中，主仪器的每个波长与从仪器的波长窗口相关，最终由每个窗口的回归系数形成带状迁移矩阵。实验结果与假设是一致的，即在各种迁移方法中，主仪器和从仪器之间的频谱相关性被限制在较小的区域。PDS的关键是窗口大小的选择和标准样本数目的确定，这个过程构建了多个回归模型，会导致大量的计算。PDS是最广泛使用的迁移方法之一，通常用作其他新技术的比较。在偏差斜率校正(slope andbiascorrection，SBC)中，假设不同仪器的预测值之间存在线性关系。首先，计算光谱和响应值之间的回归系数；通过回归系数计算主仪器和从仪器的预测值；最后，在预测值之间进行线性拟合。Liang等人提出了基于典型相关分析的标定迁移方法成功地校正了不同光谱之间的差异。首先，使用主仪器的标定集构建PLS模型；选取主仪器和从仪器的标定集的一部分作为标准样本；通过典型相关分析(Canonical correlation analysis，CCA)分别提取特征。用最小二乘法(Ordinary least squares，OLS)建立主光谱和从光谱之间的关系，最后成功地校正了光谱的差异。此外，还有其他的标定迁移方法被提出，如光谱回归(Spectralregression，SR)，正交投影转移(Transfer by orthogonal projection，TOP)，单波长标准化(Single wavelength standardization，SWS)，基于独立成分分析的多光谱标定迁移，广义最小二乘法(Generalized least squares weighting，GLSW)方法以及其他的一些需要标准样本的方法。Most of the calibration migration methods use a set of standard samples to build a migration model, which requires the measurement of a set of standard samples on the master instrument and the slave instrument respectively, and a variety of standard migration methods have been proposed. For example, Direct Standardization (DS) and Piecewise Direct Standardization (PDS) correct for spectral differences between master and slave instruments through a set of standard samples. In DS, each wavelength of the master is related to all wavelengths of the slaves. In PDS, each wavelength of the master instrument is related to the wavelength window of the slave instrument, and finally a band-like migration matrix is formed by the regression coefficients of each window. The experimental results are consistent with the hypothesis that the spectral correlation between master and slave instruments is restricted to a small area in various migration methods. The key to PDS is the selection of the window size and the determination of the number of standard samples. This process builds multiple regression models, which will result in a lot of computation. PDS is one of the most widely used migration methods and is often used as a comparison to other new technologies. In slope and bias correction (SBC), a linear relationship is assumed between the predicted values of different instruments. First, the regression coefficients between the spectra and the response values are calculated; the predicted values of the master and slave instruments are calculated from the regression coefficients; finally, a linear fit is performed between the predicted values. Liang et al. proposed a calibration transfer method based on canonical correlation analysis that successfully corrected the differences between different spectra. First, use the calibration set of the master instrument to build the PLS model; select a part of the calibration set of the master instrument and the slave instrument as a standard sample; extract features through canonical correlation analysis (CCA). The relationship between the master spectrum and the slave spectrum was established by the method of least squares (Ordinary least squares, OLS), and finally the difference of the spectrum was successfully corrected. In addition, other calibration migration methods have been proposed, such as Spectral regression (SR), Transfer by orthogonal projection (TOP), Single wavelength standardization (SWS), Multispectral calibration transfer, Generalized least squares weighting (GLSW) method, and other methods that require standard samples.

由上述可见，现有技术中已经有许多方法来开发比较稳健的标定模型，但环境条件的改变、测量仪器的调整都会使标定模型的预测性能较差甚至导致模型失效，从而需要借助已经建立好的标定模型的相关知识迁移到待测光谱，帮助待测光谱进行预测以节省大量的开销。现有能显著改善模型预测性能的标定迁移方法中，大都需要使用标准样本来构建迁移模型。标准样本应与构建标定模型的样本紧密匹配，必须表现出足够的可变性以解释两种仪器之间的差异。组分的挥发性和反应性使保持标准样本的完整性是一个很大的挑战。甚至，在一些实际应用中，很难甚至不可能获得标准样本，也即很难在主仪器和从仪器上同时测量它们的光谱。虽然存在少量的不需要标准样本的标定迁移方法，但是，这些方法的预测性能同有标准样本的迁移方法的预测性能相比相差较大。It can be seen from the above that there are many methods to develop a relatively robust calibration model in the prior art, but changes in environmental conditions and adjustment of measuring instruments will make the prediction performance of the calibration model poor or even lead to model failure. The relevant knowledge of the calibration model is transferred to the measured spectrum, which helps the measured spectrum to predict and saves a lot of overhead. Among the existing calibration transfer methods that can significantly improve the model prediction performance, most of them need to use standard samples to build a transfer model. The standard sample should closely match the sample from which the calibration model was constructed and must exhibit sufficient variability to account for differences between the two instruments. The volatility and reactivity of components make maintaining the integrity of standard samples a challenge. Even, in some practical applications, it is difficult or even impossible to obtain standard samples, that is, it is difficult to measure their spectra simultaneously on the master and slave instruments. Although there are a few calibration transfer methods that do not require standard samples, the prediction performance of these methods is quite different from that of transfer methods with standard samples.

发明内容SUMMARY OF THE INVENTION

针对现有技术存在的问题，本发明提供一种基于CT-CDD的红外光谱测量仪器标定迁移方法，不需要使用标准样本来构建迁移模型，能够大大提高红外光谱测量仪器标定迁移的精度和效率。Aiming at the problems existing in the prior art, the present invention provides a method for calibrating migration of an infrared spectrometer measuring instrument based on CT-CDD, which does not need to use a standard sample to construct a migration model, and can greatly improve the accuracy and efficiency of calibrating and migrating an infrared spectrometer measuring instrument.

本发明的技术方案为：The technical scheme of the present invention is:

一种基于CT-CDD的红外光谱测量仪器标定迁移方法，其特征在于，包括下述步骤：A kind of infrared spectroscopic measuring instrument calibration migration method based on CT-CDD, is characterized in that, comprises the following steps:

步骤1：将红外光谱测量主仪器对应到源域、将红外光谱测量从仪器对应到目标域，使用红外光谱测量主仪器、红外光谱测量从仪器采集每个样本的光谱，分别得到主光谱、从光谱，对主光谱、从光谱分别在波长范围内间隔anm提取光谱数据，并采集每个样本的物质浓度变量值，得到源域数据集{X^m,y^m}和目标域数据集{X^s}；Step 1: Correspond the main infrared spectrum measurement instrument to the source domain and the infrared spectrum measurement slave instrument to the target domain, use the infrared spectrum measurement master instrument and the infrared spectrum measurement slave instrument to collect the spectrum of each sample, and obtain the main spectrum and the slave instrument respectively. spectrum, extract spectral data at intervals of anm in the wavelength range for the main spectrum and the slave spectrum, and collect the variable value of the substance concentration of each sample to obtain the source domain data set {X ^m , y ^m } and the target domain data set {X ^s };

其中，X^m、X^s分别为主光谱矩阵、从光谱矩阵，

分别为第i个样本的主光谱向量、从光谱向量，i＝1,2,...,I，I为样本总数，

分别为第i个样本的第j个主光谱数据、从光谱数据，j＝1,2,...,J，J为提取的光谱数据点总数；y^m为物质浓度变量值矩阵，

为第i个样本的物质浓度变量的值；Among them, X ^m , X ^s are the main spectral matrix and the secondary spectral matrix, respectively,

are the master spectrum vector and slave spectrum vector of the ith sample respectively, i=1,2,...,I, where I is the total number of samples,

are the jth main spectral data and slave spectral data of the ith sample respectively, j=1,2,...,J, J is the total number of extracted spectral data points; y ^m is the substance concentration variable value matrix,

is the value of the substance concentration variable of the ith sample;

步骤2：使用KS算法将源域数据集{X^m,y^m}划分为源域标定集

与源域测试集

Step 2: Use the KS algorithm to divide the source domain dataset {X ^m , y ^m } into source domain calibration sets

Test set with source domain

步骤3：对源域标定集

进行中心化处理，得到中心化处理后的源域标定集

Step 3: Calibration set for source domain

Perform centralized processing to obtain the centrally processed source domain calibration set

步骤4：基于PLS算法对数据集

建立标定模型

计算得到

的权重矩阵W^m、

的载荷矩阵P^m、回归系数矩阵β^m；Step 4: Align the dataset based on the PLS algorithm

Build a calibration model

Calculated

The weight matrix W ^m ,

The load matrix P ^m and the regression coefficient matrix β ^m of ;

步骤5：构建迁移模型：Step 5: Build the migration model:

步骤5.1：计算红外光谱测量主仪器的特征光谱矩阵为Step 5.1: Calculate the characteristic spectrum matrix of the main instrument of infrared spectrum measurement as

T^m＝X^mW^m(P^mW^m)^-1 T ^m =X ^m W ^m (P ^m W ^m ) ^-1

计算红外光谱测量从仪器的伪特征光谱矩阵为Calculate the pseudo-eigenspectral matrix of the infrared spectrometer from the instrument as

步骤5.2：对每一个聚类数目L∈L^*，利用k-means聚类算法对数据集{T^m,y^m}的特征光谱向量进行聚类，将数据集{T^m,y^m}划分为L个子数据集

l＝1,2,...,L；Step 5.2: For each number of clusters L∈L ^* , use the k-means clustering algorithm to cluster the eigenspectral vectors of the dataset {T ^m , y ^m }, and divide the dataset {T ^m , y ^m } for L sub-datasets

l=1,2,...,L;

基于OLS算法对第l个子数据集

建立初始最小二乘模型

l＝1,2,...,L；Based on the OLS algorithm, the l-th sub-data set is

Build an initial least squares model

l=1,2,...,L;

计算每一个聚类数目下L个初始最小二乘模型的交叉验证误差RMSECV_L，确定min{RMSECV_L|L∈L^*}对应的聚类数目为最终的聚类数目K；Calculate the cross-validation error RMSECV _L of the L initial least squares models under each number of clusters, and determine the number of clusters corresponding to min{RMSECV _L |L∈L ^* } as the final number of clusters K;

其中，L^*为聚类数目集合，

为第l个初始子特征光谱矩阵，

为

对应的样本的物质浓度变量值组成的矩阵，β_{0_l}为第l个初始回归系数矩阵；Among them, L ^* is the cluster number set,

is the l-th initial sub-feature spectrum matrix,

for

The matrix composed of the substance concentration variable values of the corresponding sample, β _{0_l} is the l-th initial regression coefficient matrix;

步骤5.3：以聚类数目K，利用k-means聚类算法对数据集{T^m,y^m}的特征光谱向量进行聚类，将数据集{T^m,y^m}划分为K个子数据集

k＝1,2,...,K；Step 5.3: Use the k-means clustering algorithm to cluster the characteristic spectral vectors of the dataset {T ^m , y ^m } with the number of clusters K, and divide the dataset {T ^m , y ^m } into K sub-data sets

k=1,2,...,K;

以聚类数目K，利用k-means聚类算法对数据集

的伪特征光谱向量进行聚类，将数据集

划分为K个子数据集

k＝1,2,...,K；With the number of clusters K, use the k-means clustering algorithm to classify the data set

The pseudo eigenspectral vector of the clustering, the dataset

Divide into K sub-datasets

k=1,2,...,K;

其中，特征光谱向量、伪特征光谱向量分别为T^m、

的行向量，

分别为第k个子特征光谱矩阵、子伪特征光谱矩阵，

为

对应的样本的物质浓度变量值组成的矩阵；Among them, the characteristic spectral vector and pseudo characteristic spectral vector are T ^m ,

the row vector of ,

are the k-th sub-feature spectrum matrix and sub-pseudo-feature spectrum matrix, respectively,

for

A matrix consisting of the substance concentration variable values of the corresponding samples;

步骤5.4：基于OLS算法对第k个子数据集

建立第k个最小二乘模型

计算得到第k个回归系数矩阵β_k；Step 5.4: Based on the OLS algorithm for the kth subdataset

Build the kth least squares model

Calculate the kth regression coefficient matrix β _k ;

步骤5.5：计算第k个转换矩阵

其中，

分别为

的协方差矩阵；Step 5.5: Calculate the k-th transformation matrix

in,

respectively

The covariance matrix of ;

步骤6：对被测对象集合的物质浓度变量进行预测：Step 6: Predict the substance concentration variable of the measured object set:

步骤6.1：使用红外光谱测量从仪器采集被测对象集合中每个被测对象的光谱，使用与步骤1中相同的方法提取光谱数据，得到被测对象集合的从光谱矩阵

Step 6.1: Use infrared spectrum measurement to collect the spectrum of each measured object in the measured object set from the instrument, extract the spectral data using the same method as in step 1, and obtain the slave spectrum matrix of the measured object set

步骤6.2：计算红外光谱测量从仪器下被测对象集合的伪特征光谱矩阵为Step 6.2: Calculate the pseudo-feature spectrum matrix of the measured object collection from the infrared spectrum measurement as

步骤6.3：以聚类数目K，利用k-means聚类算法对数据集

的伪特征光谱向量进行聚类，将数据集

划分为K个子数据集

k＝1,2,...,K；其中，

为被测对象集合的第k个子伪特征光谱矩阵，Step 6.3: With the number of clusters K, use the k-means clustering algorithm to classify the data set

The pseudo eigenspectral vector of the clustering, the dataset

Divide into K sub-datasets

k=1,2,...,K; where,

is the kth sub-pseudo-feature spectral matrix of the measured object set,

步骤6.4：使用第k个转换矩阵M_k对

进行变换校正，得到第k个变换校正过的子伪特征光谱矩阵为

Step 6.4: Use the k-th transformation matrix _Mk pair

Perform transformation correction, and obtain the sub-pseudo-feature spectral matrix corrected by the kth transformation as

步骤6.5：计算第k个变换校正过的子伪特征光谱矩阵

对应的被测对象的物质浓度变量预测值矩阵为

Step 6.5: Calculate the k-th transformation-corrected sub-pseudo-feature spectral matrix

The predicted value matrix of the substance concentration variable corresponding to the measured object is:

本发明的有益效果为：The beneficial effects of the present invention are:

本发明通过校正PLS子空间的数据分布差异(CT-CDD)进行标定迁移，具体通过建立主仪器的PLS模型，同时将主仪器和从仪器的光谱投影到PLS子空间，对不同光谱的潜变量分别进行聚类分析，利用普通最小二乘法建立主仪器潜变量与浓度信息之间的回归模型，并找到两台仪器之间对应的数据分布最接近的特征光谱，分别计算转换函数来对被测对象的物质浓度变量进行预测，预测结果可以通过各自的转换函数得到校正，整个过程不需要使用标准样本来构建迁移模型，大大提高了红外光谱测量仪器标定迁移的精度和效率，解决了现有技术中存在的能显著改善模型预测性能的标定迁移方法需要标准样本来构建迁移模型而标准样本难以甚至不能获得以及其完整性很难保证、少量不需要标准样本的标定迁移方法预测性能较差的技术问题。The invention performs calibration and migration by correcting the data distribution difference (CT-CDD) of the PLS subspace. Specifically, by establishing the PLS model of the master instrument, and simultaneously projecting the spectra of the master instrument and the slave instrument into the PLS subspace, the latent variables of different spectra are analyzed. Cluster analysis was performed separately, and the regression model between the latent variables of the main instrument and the concentration information was established by the ordinary least square method, and the characteristic spectrum with the closest data distribution between the two instruments was found, and the conversion function was calculated separately to compare the measured data. The substance concentration variables of the object are predicted, and the prediction results can be corrected by their respective conversion functions. The whole process does not need to use standard samples to build a migration model, which greatly improves the accuracy and efficiency of the migration calibration of the infrared spectroscopy measuring instrument, and solves the problem of the existing technology. There are calibration migration methods that can significantly improve the prediction performance of the model. Standard samples are needed to build the migration model, and the standard samples are difficult or even impossible to obtain, and its integrity is difficult to guarantee. A small number of calibration migration methods that do not require standard samples have poor predictive performance. question.

附图说明Description of drawings

图1为近红外光谱技术的分析过程示意图。Figure 1 is a schematic diagram of the analysis process of near-infrared spectroscopy.

图2为本发明的基于CT-CDD的红外光谱测量仪器标定迁移方法的流程图。FIG. 2 is a flow chart of a method for calibrating migration of an infrared spectrometer measuring instrument based on CT-CDD of the present invention.

图3为实施例一与实施例二中不同仪器之间的光谱差异示意图FIG. 3 is a schematic diagram of the spectral difference between different instruments in Example 1 and Example 2

图4为实施例一中本发明的基于CT-CDD的红外光谱测量仪器标定迁移方法与其他五种标定迁移方法在M5*-MP5上的预测结果示意图。4 is a schematic diagram of the prediction results of the CT-CDD-based infrared spectroscopy measuring instrument calibration migration method of the present invention and the other five calibration migration methods on M5*-MP5 in Example 1.

图5为实施例一中本发明的基于CT-CDD的红外光谱测量仪器标定迁移方法与其他五种标定迁移方法在M5*-MP6上的预测结果示意图。5 is a schematic diagram of the prediction results of the CT-CDD-based infrared spectroscopy measuring instrument calibration migration method of the present invention and the other five calibration migration methods on M5*-MP6 in Example 1.

图6为实施例一中本发明的基于CT-CDD的红外光谱测量仪器标定迁移方法与其他五种标定迁移方法在MP5*-MP6上的预测结果示意图。6 is a schematic diagram showing the prediction results of the CT-CDD-based infrared spectroscopy measuring instrument calibration migration method of the present invention and the other five calibration migration methods on MP5*-MP6 in Example 1.

图7为实施例二中本发明的基于CT-CDD的红外光谱测量仪器标定迁移方法与其他五种标定迁移方法在B1*-B2上的预测结果示意图。FIG. 7 is a schematic diagram showing the prediction results on B1*-B2 of the CT-CDD-based infrared spectroscopy measuring instrument calibration migration method of the present invention and the other five calibration migration methods in the second embodiment.

图8为实施例二中本发明的基于CT-CDD的红外光谱测量仪器标定迁移方法与其他五种标定迁移方法在B1*-B3上的预测结果示意图。FIG. 8 is a schematic diagram showing the prediction results on B1*-B3 of the CT-CDD-based infrared spectroscopy measuring instrument calibration migration method of the present invention and the other five calibration migration methods in the second embodiment.

图9为实施例二中本发明的基于CT-CDD的红外光谱测量仪器标定迁移方法与其他五种标定迁移方法在B3*-B2上的预测结果示意图。FIG. 9 is a schematic diagram showing the prediction results on B3*-B2 of the CT-CDD-based infrared spectroscopy measuring instrument calibration migration method of the present invention and the other five calibration migration methods in the second embodiment.

具体实施方式Detailed ways

下面将结合附图和具体实施方式，对本发明作进一步描述。The present invention will be further described below with reference to the accompanying drawings and specific embodiments.

本发明针对现有技术中存在的能显著改善模型预测性能的标定迁移方法需要标准样本来构建迁移模型而标准样本难以甚至不能获得以及其完整性很难保证、少量不需要标准样本的标定迁移方法预测性能较差的技术问题，针对光谱数据维度高、存在多重共线性的特点，利用了机器学习中的迁移学习方法，提出无迁移标准的标定迁移方法。通过两个近红外光谱数据集，将本发明的CT-CDD的性能与SBC、PDS、CCACT、TCR和CTAI的预测性能进行比较。在没有标准样本的情况下，本发明取得的预测性能优于经典的有标准的标定迁移方法。The present invention aims at the calibration migration method existing in the prior art that can significantly improve the model prediction performance, which requires standard samples to construct a migration model, and the standard samples are difficult or even impossible to obtain, its integrity is difficult to guarantee, and a small number of calibration migration methods do not require standard samples. For the technical problem of poor prediction performance, in view of the characteristics of high dimension and multicollinearity of spectral data, the transfer learning method in machine learning is used, and a calibration transfer method without transfer standard is proposed. The performance of the CT-CDD of the present invention was compared with the predicted performance of SBC, PDS, CCACT, TCR and CTAI through two NIR spectral datasets. In the absence of standard samples, the prediction performance achieved by the present invention is superior to the classical standard calibration transfer method.

实施例一Example 1

如图2所示，本发明的基于CT-CDD的红外光谱测量仪器标定迁移方法，包括下述步骤：As shown in Figure 2, the CT-CDD-based infrared spectroscopy measuring instrument calibration migration method of the present invention comprises the following steps:

其中，X^m、X^s分别为主光谱矩阵、从光谱矩阵，

为第i个样本的物质浓度变量的值。Among them, X ^m , X ^s are the main spectral matrix and the secondary spectral matrix, respectively,

is the value of the substance concentration variable of the ith sample.

本实施例一中，样本为玉米，光谱数据为吸收度。物质浓度变量可以为水分含量、油分含量、蛋白质含量、淀粉含量，本实施例中用水分含量来验证本发明的方法。使用三种近红外光谱测量仪器(M5，MP5，MP6)对相同的I＝80个样本测得的数据构成玉米数据集。用近红外光谱测量仪器M5、MP5、MP6在1100-2498nm波长范围内每隔a＝2nm测量红外光谱，共J＝700个频道，得到M5-MP5之间、M5-MP6之间、MP5-MP6之间的光谱差异分别如图3A、图3B、图3C所示。In the first embodiment, the sample is corn, and the spectral data is the absorbance. The substance concentration variables can be moisture content, oil content, protein content, and starch content. In this embodiment, the moisture content is used to verify the method of the present invention. The corn data set was composed of data measured on the same I=80 samples using three near-infrared spectroscopy instruments (M5, MP5, MP6). Use the near-infrared spectrum measuring instruments M5, MP5, and MP6 to measure the infrared spectrum every a=2nm in the wavelength range of 1100-2498nm, with a total of J=700 channels, and obtain between M5-MP5, M5-MP6, MP5-MP6 The spectral differences between them are shown in Figure 3A, Figure 3B, and Figure 3C, respectively.

步骤2：使用KS算法将源域数据集{X^m,y^m}划分为源域标定集

与源域测试集

Test set with source domain

本实施例一中，Kennard-Stone(KS)算法将80个玉米样本分成两组：第一组的64个样本的源域数据构成源域标定集；第二组的16个样本的源域数据构成源域测试集。In the first embodiment, the Kennard-Stone (KS) algorithm divides 80 corn samples into two groups: the source domain data of the first group of 64 samples constitutes the source domain calibration set; the source domain data of the second group of 16 samples Constitute the source domain test set.

步骤3：对源域标定集

进行中心化处理，得到中心化处理后的源域标定集

Step 3: Calibration set for source domain

步骤4：基于PLS算法对数据集

建立标定模型

计算得到

的权重矩阵W^m、

的载荷矩阵P^m、回归系数矩阵β^m。Step 4: Align the dataset based on the PLS algorithm

Build a calibration model

Calculated

The weight matrix W ^m ,

The loading matrix P ^m , the regression coefficient matrix β ^m .

步骤5：构建迁移模型：Step 5: Build the migration model:

T^m＝X^mW^m(P^mW^m)^-1 T ^m =X ^m W ^m (P ^m W ^m ) ^-1

l=1,2,...,L;

基于OLS算法对第l个子数据集

建立初始最小二乘模型

l＝1,2,...,L；Based on the OLS algorithm, the l-th sub-data set is

Build an initial least squares model

l=1,2,...,L;

其中，L^*为聚类数目集合，

为第l个初始子特征光谱矩阵，

为

is the l-th initial sub-feature spectrum matrix,

for

k=1,2,...,K;

以聚类数目K，利用k-means聚类算法对数据集

的伪特征光谱向量进行聚类，将数据集

划分为K个子数据集

The pseudo eigenspectral vector of the clustering, the dataset

Divide into K sub-datasets

k=1,2,...,K;

其中，特征光谱向量、伪特征光谱向量分别为T^m、

的行向量，

分别为第k个子特征光谱矩阵、子伪特征光谱矩阵，

为

the row vector of ,

for

步骤5.4：基于OLS算法对第k个子数据集

建立第k个最小二乘模型

Build the kth least squares model

Calculate the kth regression coefficient matrix β _k ;

步骤5.5：计算第k个转换矩阵

其中，

分别为

的协方差矩阵。Step 5.5: Calculate the k-th transformation matrix

in,

respectively

The covariance matrix of .

在最小二乘模型的构建过程中，分别找到距离从仪器聚类后的特征光谱最近的主仪器模型，各自计算转换矩阵。具体如下：In the process of constructing the least squares model, the master instrument model closest to the characteristic spectrum after clustering of the slave instruments is found, and the transformation matrix is calculated separately. details as follows:

主仪器和从仪器分别对应一个域。一个域由两个主要部分组成：输入空间X、其对应的边际概率分布P(X)。相对熵或KL散度可表示两个域的数据分布之间的距离，使用公式(1)表示：The master instrument and the slave instrument each correspond to a domain. A domain consists of two main parts: the input space X, and its corresponding marginal probability distribution P(X). Relative entropy or KL divergence can represent the distance between the data distributions of two domains, expressed using formula (1):

其中，p、q分别为源域、目标域的数据分布的概率密度函数。Among them, p and q are the probability density functions of the data distribution of the source domain and the target domain, respectively.

p(x)不能直接获取，但假设已经观察到一组有限的训练点x_n,n＝1,...,N能从p(x)中得出。那么关于p(x)的期望可以通过这些点上的有限和来近似，公式如下所示：p(x) cannot be obtained directly, but assuming that a finite set of training points xn has been observed, _n =1,...,N can be derived from p(x). Then the expectation about p(x) can be approximated by a finite sum over these points with the formula:

给定主仪器的有标签的光谱{X^m,y^m}和从仪器的无标签光谱{X^s}，目标是预测从仪器待测光谱的输出

不同仪器测量的光谱不同，导致两个仪器之间的数据分布是不同的。公式(3)使用绝对值的形式来表示数据分布之间的距离，两者都是随机向量，分别遵循各自的数据分布：Given the labeled spectrum {X ^m ,y ^m } of the master instrument and the unlabeled spectrum {X ^s } of the slave instrument, the goal is to predict the output of the spectrum to be measured by the slave instrument

The spectra measured by different instruments are different, resulting in different data distributions between the two instruments. Formula (3) uses the form of absolute value to represent the distance between data distributions, both of which are random vectors and follow their respective data distributions:

KL(P||Q)≈|lnP(X^m)-lnQ(X^s)| (3)KL(P||Q)≈|lnP(X ^m )-lnQ(X ^s )| (3)

其中，P、Q分别为源域、目标域的数据分布的概率密度函数；Among them, P and Q are the probability density functions of the data distribution of the source domain and the target domain, respectively;

考虑到光谱数据具有多重共线性，将所有光谱映射到主仪器的PLS子空间，对数据进行降维的同时，也简化了模型(3)。主仪器的特征光谱和从仪器的伪特征光谱使用公式(4)计算，形式如下：Considering the multicollinearity of the spectral data, all spectra are mapped to the PLS subspace of the main instrument, which simplifies the model (3) while reducing the dimensionality of the data. The characteristic spectrum of the master instrument and the pseudo characteristic spectrum of the slave instrument are calculated using formula (4) in the following form:

其中，T^m和

分别是被提取的A个得分向量组成的矩阵。where ^Tm and

are a matrix composed of the extracted A score vectors, respectively.

此时，两个仪器的数据分布之间的KL距离为：At this point, the KL distance between the data distributions of the two instruments is:

KL(P||Q)≈|lnP(t^m)-lnQ(t^s)| (5)KL(P||Q)≈|lnP(t ^m )-lnQ(t ^s )| (5)

其中，t^m和t^s是随机向量，分别遵循数据分布T^m和

where t ^m and t ^s are random vectors that follow the data distributions T ^m and t s, respectively

主从仪器的光谱数据均为混合分布，假设聚类之后的数据分布为单高斯分布，主仪器和从仪器的特征光谱的数据分布分别为

和

可以通过对各个聚类的均值和协方差分别校正，降低公式(6)中的分布差异。先中心化数据校正均值，两个仪器的第i个特征光谱的数据分布分别为

和

和

分别是两个仪器的第i个高斯分布；

是第i个转换函数；

和

是两个随机向量，分别服从两个仪器的第i个特征光谱的数据分布。公式(5)可以写成公式(6)的形式。The spectral data of the master and slave instruments are all mixed distribution, assuming that the data distribution after clustering is a single Gaussian distribution, and the data distributions of the characteristic spectra of the master and slave instruments are

and

The distribution difference in equation (6) can be reduced by separately correcting the mean and covariance of each cluster. Firstly, the data is adjusted to the mean value, and the data distribution of the i-th characteristic spectrum of the two instruments is

and

are the ith Gaussian distribution of the two instruments, respectively;

is the ith conversion function;

and

are two random vectors, respectively obeying the data distribution of the i-th characteristic spectrum of the two instruments. Equation (5) can be written in the form of Equation (6).

假设存在线性变换矩阵M_i可以使上述

成立，有

使得校正的从光谱与主光谱之间的距离最小，相对熵(6)可以改写成如下：Assuming that there is a linear transformation matrix M _i can make the above

established, have

To minimize the distance between the corrected slave spectrum and the master spectrum, the relative entropy (6) can be rewritten as follows:

公式(7)中，线性变换矩阵M_i的求解过程如下：In formula (7), the solution process of the linear transformation matrix _Mi is as follows:

聚类后的每组数据近似正态分布，数据的均值为0，主特征光谱的概率密度函数

由公式(8.1)给出，从仪器的概率密度函数

由公式(8.2)给出。Each group of data after clustering is approximately normally distributed, the mean of the data is 0, and the probability density function of the main feature spectrum

is given by equation (8.1), from the probability density function of the instrument

is given by equation (8.2).

经过函数

变换之后，从仪器的随机向量t^s能转化为主仪器的随机向量t^m，公式如下所示：through the function

After transformation, the random vector t ^s of the slave instrument can be converted to the random vector t ^m of the master instrument with the following formula:

假设M_i是一个非奇异矩阵，那么主仪器的随机向量可以使用公式(10)转化为从仪器的随机向量：Assuming _Mi is a non-singular matrix, then the random vector of the master instrument can be converted to the random vector of the slave instrument using equation (10):

根据概率密度函数的性质，主仪器的概率密度函数

可用公式(9)进行变换如下

其可变换如下

从而有：

According to the properties of the probability density function, the probability density function of the main instrument

It can be transformed by formula (9) as follows

It can be transformed as follows

Thus there are:

公式(11)为从仪器特征光谱经过变换矩阵M_i变换至主仪器之后的概率密度函数，展开形式如下：Equation (11) is the probability density function after transformation from the instrument characteristic spectrum to the main instrument through the transformation matrix _Mi , and the expanded form is as follows:

公式(12)与公式(8.1)是相同的概率密度函数，因此二者的协方差是相同的，因此有

M_i的解为：Formula (12) and formula (8.1) are the same probability density function, so the covariance of the two is the same, so there is

The solution of _Mi is:

步骤6.3：以聚类数目K，利用k-means聚类算法对数据集

的伪特征光谱向量进行聚类，将数据集

划分为K个子数据集

k＝1,2,...,K；其中，

The pseudo eigenspectral vector of the clustering, the dataset

Divide into K sub-datasets

k=1,2,...,K; where,

is the kth sub-pseudo-feature spectral matrix of the measured object set,

步骤6.4：使用第k个转换矩阵M_k对

进行变换校正，得到第k个变换校正过的子伪特征光谱矩阵为

Step 6.4: Use the k-th transformation matrix _Mk pair

Perform transformation and correction to obtain the k-th transformation-corrected sub-pseudo-feature spectral matrix as

步骤6.5：计算第k个变换校正过的子伪特征光谱矩阵

对应的被测对象的物质浓度变量预测值矩阵为

k＝1,2,...,K。Step 6.5: Calculate the k-th transformation-corrected sub-pseudo-feature spectral matrix

k=1,2,...,K.

本实施例中，分别使用本发明的基于CT-CDD的红外光谱测量仪器标定迁移方法与传统的基于SBC、PDS、CCACT、TCR和CTAI的红外光谱测量仪器标定迁移方法对被测对象集合的物质浓度变量进行预测。本发明的主仪器的PLS模型建立在标定集上，对于有迁移标准的其他迁移方法，使用Kennard-Stone方法在标定集上选择若干个标准样本。对于SBC、PDS、CCACT、CTAI算法均采用PLS算法作为主体算法，使用主仪器的光谱数据建立多元标定模型作为参考模型，对从仪器的待测样本进行预测。In this embodiment, the method for calibrating the migration of an infrared spectrometer measuring instrument based on CT-CDD of the present invention and the traditional method for calibrating and migrating an infrared spectrometer measuring instrument based on SBC, PDS, CCACT, TCR and CTAI are respectively used for the substances collected by the measured object. Concentration variables are predicted. The PLS model of the main instrument of the present invention is established on the calibration set. For other migration methods with migration standards, the Kennard-Stone method is used to select several standard samples on the calibration set. For the SBC, PDS, CCACT, and CTAI algorithms, the PLS algorithm is used as the main algorithm, and the spectral data of the master instrument is used to establish a multivariate calibration model as a reference model to predict the samples to be tested from the slave instrument.

不同迁移方法的参数选择标准和CTAI类似。PLS模型的最优潜变量数目在范围[1,15]中取值，通过十折交叉验证确定，根据最小的交叉验证误差准则，选取最优的潜变量数目。The parameter selection criteria for different transfer methods are similar to CTAI. The optimal number of latent variables of the PLS model is in the range [1, 15], which is determined by ten-fold cross-validation. According to the minimum cross-validation error criterion, the optimal number of latent variables is selected.

SBC、PDS的主仪器的PLS建模方法和参数优化与CTAI相同。PDS中的窗口大小从3开始，以增量为2搜索到16，通过5折交叉验证选择参数，分别计算各个窗口模型的RMSECV，选择RMSECV最小的窗口为最优参数；PDS在小麦数据集的表现较差，在窗口选择时，使用F检验来确定最优的窗口大小。The PLS modeling method and parameter optimization of the main instruments of SBC and PDS are the same as those of CTAI. The window size in the PDS starts from 3, and the increment is 2 to 16. The parameters are selected through 5-fold cross-validation, the RMSECV of each window model is calculated separately, and the window with the smallest RMSECV is selected as the optimal parameter. Poor performance, use F-test to determine optimal window size during window selection.

本实施例中，均方根误差RMSE被用作参数选择和模型评估的指标。此外，RMSEC表示标定集的训练误差，RMSECV表示交叉验证误差，RMSEP表示测试集的预测误差。RMSE计算方法写为In this embodiment, the root mean square error RMSE is used as an indicator for parameter selection and model evaluation. Furthermore, RMSEC represents the training error on the calibration set, RMSECV represents the cross-validation error, and RMSEP represents the prediction error on the test set. The RMSE calculation method is written as

其中，

是预测值，y是测量值，n表示样本的数目。in,

is the predicted value, y is the measured value, and n is the number of samples.

在玉米数据集上，M5、MP5、MP6三台仪器的PLS模型的RMSEC、RMSEP、RMSECVmin、和LV如表1所示。其中，RMSECVmin为交叉验证误差最小值，LV为当取得最小的交叉验证误差时对应的潜变量数目。由表1可以看出，仪器M5的PLS模型的RMSECVmin、RMSEC、RMSEP分别为0.01066、0.00599、0.00764，可以看出三个均方根误差相差不大，PLS模型比较稳定，不存在过拟合和欠拟合的现象。仪器MP5的PLS模型的RMSECVmin、RMSEC、RMSEP分别为0.13035、0.09458、0.12445，与M5的PLS模型类似，没有欠拟合和过拟合的现象，在MP6的PLS模型上得到了相同的结论。参数的选择通过10折交叉验证进行选取，基于最低的RMSECV准则来确定最佳的潜变量数量，M5、MP5、MP6三台仪器的PLS模型中最佳潜变量的数量分别为14、15、10。主仪器建立一个预测性能更优的模型是十分重要的，本实施例选择具有良好的预测性能的仪器作为主仪器。从表1可以看到，仪器MP6的预测误差>仪器MP5预测误差>仪器M5的预测误差，从而用这三种组合(M5*-MP5，M5*-MP6和MP5*-MP6)进行模型检验更佳合理；其中，上标*表示主仪器，另一个表示从仪器。On the corn dataset, the RMSEC, RMSEP, RMSECVmin, and LV of the PLS models of the M5, MP5, and MP6 instruments are shown in Table 1. Among them, RMSECVmin is the minimum cross-validation error, and LV is the number of latent variables corresponding to the minimum cross-validation error. It can be seen from Table 1 that the RMSECVmin, RMSEC and RMSEP of the PLS model of the instrument M5 are 0.01066, 0.00599 and 0.00764 respectively. It can be seen that the three root mean square errors are not much different, the PLS model is relatively stable, and there is no overfitting and underfitting phenomenon. The RMSECVmin, RMSEC, and RMSEP of the PLS model of the instrument MP5 are 0.13035, 0.09458, and 0.12445, respectively. Similar to the PLS model of the M5, there is no underfitting and overfitting. The same conclusion is obtained on the PLS model of the MP6. The selection of parameters is selected by 10-fold cross-validation, and the optimal number of latent variables is determined based on the lowest RMSECV criterion. The optimal number of latent variables in the PLS models of the three instruments of M5, MP5, and MP6 are 14, 15, and 10, respectively. . It is very important for the main instrument to establish a model with better prediction performance. In this embodiment, an instrument with good prediction performance is selected as the main instrument. As can be seen from Table 1, the prediction error of instrument MP6 > the prediction error of instrument MP5 > the prediction error of instrument M5, so that the three combinations (M5*-MP5, M5*-MP6 and MP5*-MP6) are used for model checking. Best and reasonable; where the superscript * indicates the master instrument and the other indicates the slave instrument.

表1Table 1

仪器instrument 参考值Reference RMSECRMSEC RMSEPRMSEP RMSECVminRMSECVmin LVLV M5M5 水分moisture 0.005990.00599 0.007640.00764 0.010660.01066 1414 MP5MP5 水分moisture 0.094580.09458 0.124450.12445 0.130350.13035 1515 MP6MP6 水分moisture 0.099910.09991 0.156370.15637 0.147750.14775 1010

将CT-CDD与SBC、PDS、CCACT、TCR和CTAI这五种标定迁移方法进行比较。在CT-CDD中，聚类的数目通过十折交叉验证确定。玉米数据集含有80个样本，聚类后的子模型的最大数目被设置3，否则，所计算的迁移矩阵是欠秩的，将导致最后的预测结果无穷大。样本数目的限制，导致当聚类数目较大时，聚类后的特征光谱也没有足够的样本建立一个稳定的模型。本实施例一中，通过计算发现当聚类数目为2时，取得最小的交叉验证误差。CT-CDD was compared with five calibration migration methods, SBC, PDS, CCACT, TCR and CTAI. In CT-CDD, the number of clusters was determined by ten-fold cross-validation. The corn dataset contains 80 samples, and the maximum number of sub-models after clustering is set to 3. Otherwise, the calculated migration matrix is under-ranked, which will lead to the final prediction result being infinite. Due to the limitation of the number of samples, when the number of clusters is large, the characteristic spectrum after clustering does not have enough samples to establish a stable model. In the first embodiment, it is found through calculation that when the number of clusters is 2, the minimum cross-validation error is obtained.

CT-CDD和其他五种标定迁移方法的预测误差如表2所示。表2中，N为需要标准样本的迁移方法中标准样本的数目，a为PDS中最优的窗口大小，b为TCR中对应的最优子空间的维度。The prediction errors of CT-CDD and the other five calibration transfer methods are shown in Table 2. In Table 2, N is the number of standard samples in the migration method that requires standard samples, a is the optimal window size in PDS, and b is the dimension of the corresponding optimal subspace in TCR.

由表2可以看出：It can be seen from Table 2 that:

(1)对于仪器MP5到仪器M5的光谱转移：当标准样品的数量是35时，SBC达到最低RMSEP(0.28872)；当标准样品数为5时，PDS达到了最低的RMSEP(0.18828)；当标准样品数为25时，CCACT达到最低RMSEP(0.18699)；可以看到CT-CDD的RMSEP(0.15024)小于PDS、SBC、CCACT这三种方法的RMSEP的最小值，也小于TCR(0.47391)和CTAI(0.17511)。(1) For the spectral transfer from instrument MP5 to instrument M5: when the number of standard samples is 35, the SBC reaches the lowest RMSEP (0.28872); when the number of standard samples is 5, the PDS reaches the lowest RMSEP (0.18828); When the number of samples is 25, CCACT reaches the lowest RMSEP (0.18699); it can be seen that the RMSEP (0.15024) of CT-CDD is smaller than the minimum RMSEP of the three methods of PDS, SBC and CCACT, and is also smaller than TCR (0.47391) and CTAI ( 0.17511).

(2)对于从MP6到M5的光谱转移，由SBC、PDS、CCACT获得的最低RMSEP分别为0.33240、0.27901、0.17862，CT-CDD具有比其他五种方法更低的RMSEP。(2) For the spectral transfer from MP6 to M5, the lowest RMSEP obtained by SBC, PDS, CCACT are 0.33240, 0.27901, 0.17862, respectively, CT-CDD has lower RMSEP than the other five methods.

(3)对于从MP6到MP5的光谱转移，SBC、PDS、CCACT对应的最低的RMSEP分别为0.20481、0.18409和0.13722，TCR和CTAI的预测误差分别为0.46124和0.16563，CT-CDD再一次达到了最小的RMSEP(0.12357)。(3) For the spectral transfer from MP6 to MP5, the lowest RMSEP corresponding to SBC, PDS, and CCACT are 0.20481, 0.18409 and 0.13722, respectively, the prediction errors of TCR and CTAI are 0.46124 and 0.16563, respectively, and CT-CDD reaches the minimum again. RMSEP(0.12357).

从这三组实验可以看出CT-CDD模型，在一般情况下取得最优的预测性能，并具有更好的泛化能力。From these three sets of experiments, it can be seen that the CT-CDD model achieves the best prediction performance in general and has better generalization ability.

表2Table 2

图4、图5、图6分别显示了6种不同的标定迁移方法在组合M5*-MP5、M5*-MP6、MP5*-MP6上获得的预测值与测量值的关系图。预测浓度和测量浓度之间的零差异，将会使得样本点在直线上。对于有标准样本的标定迁移方法，在不同的标准样本下，当预测性能最优时，选择该组实验比较，以便更加充分地体现出CT-CDD能取得良好的预测性能。Figure 4, Figure 5, and Figure 6 show the relationship between the predicted and measured values obtained by 6 different calibration migration methods on the combinations M5*-MP5, M5*-MP6, and MP5*-MP6, respectively. Zero difference between predicted and measured concentrations will place the sample points on a straight line. For the calibration transfer method with standard samples, under different standard samples, when the prediction performance is the best, this group of experiments is selected for comparison, in order to more fully reflect that CT-CDD can achieve good prediction performance.

CT-CDD、CTAI、TCR、CCACT、SBC、PDS在M5*-MP5上的预测结果分别如图4A、图4B、图4C、图4D、图4E、图4F所示，在M5*-MP6上的预测结果分别如图5A、图5B、图5C、图5D、图5E、图5F所示，在MP5*-MP6上的预测结果分别如图6A、图6B、图6C、图6D、图6E、图6F所示。由图4可以看出，CT-CDD的样本点更加接近直线；TCR和SBC在该组实验下，拟合效果较差。由图5可以看出，仪器MP6到仪器M5的光谱传输中，CT-CDD比其他五种方法更接近直线，SBC和TCR再一次达到了最差的预测性能，PDS、CCACT和CTAI这三种方法的预测误差较小但相对CT-CDD而言其预测性能仍然欠佳。由图6可以看出，仪器MP6到仪器MP5的光谱传输中，取得了与图4和图5中相同的结论，它确认了CT-CDD达到了最佳的预测性能。可见，CT-CDD在与其他五种模型的对比中取得了更加令人满意的结果，实现了最优的预测性能。The prediction results of CT-CDD, CTAI, TCR, CCACT, SBC, and PDS on M5*-MP5 are shown in Fig. 4A, Fig. 4B, Fig. 4C, Fig. 4D, Fig. 4E, Fig. 4F, respectively, on M5*-MP6 The prediction results are shown in Figure 5A, Figure 5B, Figure 5C, Figure 5D, Figure 5E, Figure 5F, respectively, and the prediction results on MP5*-MP6 are shown in Figure 6A, Figure 6B, Figure 6C, Figure 6D, Figure 6E, respectively , as shown in Figure 6F. It can be seen from Figure 4 that the sample points of CT-CDD are closer to a straight line; TCR and SBC have poor fitting effects under this group of experiments. As can be seen from Figure 5, in the spectral transmission from instrument MP6 to instrument M5, CT-CDD is closer to a straight line than the other five methods, SBC and TCR once again achieve the worst prediction performance, PDS, CCACT and CTAI three The prediction error of the method is smaller but its prediction performance is still poor compared to CT-CDD. As can be seen from Figure 6, the same conclusions as in Figures 4 and 5 were achieved in the spectral transmission from instrument MP6 to instrument MP5, which confirms that CT-CDD achieves the best predictive performance. It can be seen that CT-CDD achieves more satisfactory results in comparison with the other five models, achieving the best prediction performance.

实施例二Embodiment 2

本实施例二中，样本为小麦。小麦数据集为2016年国际漫反射会议(IDRC)发布的“Shootout”数据集。小麦数据集包含3个不同NIR光谱仪(B1，B2，B3)对相同的I＝248个样本测得的数据，选择蛋白质含量作为物质浓度变量。用NIR光谱仪B1、B2、B3在570-1100nm波长范围内每间隔a＝0.5nm测量红外光谱，得到B1-B2、B1-B3、B3-B2之间的光谱差异分别如图3D、图3E、图3F所示。In the second embodiment, the sample is wheat. The wheat dataset is the "Shootout" dataset released at the 2016 International Diffuse Reflectance Conference (IDRC). The wheat dataset contains data measured by 3 different NIR spectrometers (B1, B2, B3) on the same I=248 samples, with protein content selected as the substance concentration variable. Use NIR spectrometers B1, B2, B3 to measure the infrared spectrum at every interval a=0.5nm in the wavelength range of 570-1100nm, and obtain the spectral differences between B1-B2, B1-B3, B3-B2 as shown in Figure 3D, Figure 3E, shown in Figure 3F.

本实施例一中，Kennard-Stone(KS)算法将248个小麦样本分成两组：第一组的198个样本的源域数据构成源域标定集；第二组的50个样本的源域数据构成源域测试集。In the first embodiment, the Kennard-Stone (KS) algorithm divides 248 wheat samples into two groups: the source domain data of 198 samples in the first group constitutes the source domain calibration set; the source domain data of 50 samples in the second group Constitute the source domain test set.

在小麦数据集上，B1、B2、B3三台仪器的PLS模型的RMSEC、RMSEP、RMSECVmin、和LV如表3所示。由表3可以看出，仪器B1上建立的PLS模型的RMSECVmin、RMSEC、RMSEP分别为0.50337、0.32880、0.33254，并不存在过拟合和欠拟合的现象。同样的情况在仪器B2和仪器B3上也都观察到，三台仪器建立的PLS模型中都没有过拟合和欠拟合现象，解释了最优潜变量的合理选择。对于小麦数据集，PLS模型的参数选择标准与玉米数据集相似，潜变量的数量最大设置为15。从表3观察得到，仪器B1的预测误差<仪器B3的预测误差<仪器B2的预测误差，从而用这三种组合(B1*-B2，B1*-B3和B3*-B2)进行模型性能检验。On the wheat dataset, the RMSEC, RMSEP, RMSECVmin, and LV of the PLS models of the three instruments B1, B2, and B3 are shown in Table 3. It can be seen from Table 3 that the RMSECVmin, RMSEC, and RMSEP of the PLS model established on the instrument B1 are 0.50337, 0.32880, and 0.33254, respectively, and there is no overfitting or underfitting. The same situation is also observed on instrument B2 and instrument B3. There is no overfitting and underfitting in the PLS models established by the three instruments, which explains the reasonable selection of the optimal latent variables. For the wheat dataset, the parameter selection criteria for the PLS model were similar to those for the corn dataset, and the number of latent variables was set to a maximum of 15. It can be observed from Table 3 that the prediction error of the instrument B1 < the prediction error of the instrument B3 < the prediction error of the instrument B2, so that the three combinations (B1*-B2, B1*-B3 and B3*-B2) are used to test the model performance .

表3table 3

仪器instrument 参考值Reference RMSECRMSEC RMSEPRMSEP RMSECVminRMSECVmin LVLV B1B1 蛋白质protein 0.328800.32880 0.332540.33254 0.503370.50337 1515 B2B2 蛋白质protein 0.216360.21636 0.837550.83755 0.324410.32441 1515 B3B3 蛋白质protein 0.302880.30288 0.515670.51567 0.438960.43896 1515

在CT-CDD中，特征光谱聚类数目的确定方法和玉米数据集类似。当小麦数据集中的样本数量相对充足，不会出现玉米数据集中计算迁移矩阵时出现的欠秩情况。聚类数目被设置为在2到5之间。本实施例二中，通过计算发现当聚类数目为2时，取得最小的交叉验证误差。In CT-CDD, the method for determining the number of characteristic spectral clusters is similar to that of the corn dataset. When the number of samples in the wheat dataset is relatively sufficient, the under-rank situation that occurs when calculating the migration matrix in the corn dataset does not occur. The number of clusters is set between 2 and 5. In the second embodiment, it is found through calculation that when the number of clusters is 2, the minimum cross-validation error is obtained.

其他标定迁移方法的参数选择标准与玉米数据集也是类似的。在PDS中，最佳窗口大小如表4所示。当B1作为主仪器，B2作为从仪器时，PDS的最佳窗口大小分别为11、15、15、15。当B1作为主仪器，B3作为从仪器时，PDS的最佳窗口大小分别为15、11、5、5。当B3作为主仪器，B2作为从仪器时，PDS的最佳窗口大小分别为7、15、15、15。在TCR中，子空间的最佳维数分别为7、12、21。The parameter selection criteria for other calibration transfer methods are also similar to the corn dataset. In PDS, the optimal window size is shown in Table 4. When B1 is the master instrument and B2 is the slave instrument, the optimal window sizes of PDS are 11, 15, 15, and 15, respectively. When B1 is the master instrument and B3 is the slave instrument, the optimal window sizes of the PDS are 15, 11, 5, and 5, respectively. When B3 is the master instrument and B2 is the slave instrument, the optimal window sizes of PDS are 7, 15, 15, and 15, respectively. In TCR, the optimal dimensions of the subspace are 7, 12, and 21, respectively.

由表4可以看出：It can be seen from Table 4 that:

(1)当仪器B1作为主仪器，仪器B2作为从仪器时，当标准样本的数目为5时，SBC产生了最低的RMSEP(0.45225)。在PDS和CCACT中，当标准样品增加时RMSEP显著降低。当标准样品数量为35时，PDS和CCACT达到最低的RMSEP，分别为0.47222和0.80448。与SBC、PDS、CCACT相比，CT-CDD达到最低的RMSEP(0.43007)。TCR和CTAI的预测误差分别为0.86884和0.41419。和它们相比，CT-CDD在当前实验组下，预测效果仅次于CTAI。(1) When the instrument B1 is the master instrument and the instrument B2 is the slave instrument, when the number of standard samples is 5, the SBC produces the lowest RMSEP (0.45225). In PDS and CCACT, the RMSEP decreased significantly when the standard sample was increased. When the number of standard samples was 35, PDS and CCACT achieved the lowest RMSEP of 0.47222 and 0.80448, respectively. Compared with SBC, PDS, and CCACT, CT-CDD achieved the lowest RMSEP (0.43007). The prediction errors of TCR and CTAI are 0.86884 and 0.41419, respectively. Compared with them, CT-CDD is second only to CTAI in prediction effect under the current experimental group.

(2)当仪器B1作为主仪器，仪器B3作为从仪器时，SBC、PDS和CCACT在不同数目的标准样本下，对应的最低的RMSEP分别为0.79919、0.41235和0.83440。TCR和CTAI的RMSEP分别为0.72987和0.68215。结果表明，CT-CDD的RMSEP(0.35160)显著低于其他标定迁移方法，取得了最优的预测性能。(2) When the instrument B1 is used as the master instrument and the instrument B3 is used as the slave instrument, the corresponding lowest RMSEPs of SBC, PDS and CCACT under different numbers of standard samples are 0.79919, 0.41235 and 0.83440 respectively. The RMSEP of TCR and CTAI were 0.72987 and 0.68215, respectively. The results show that the RMSEP (0.35160) of CT-CDD is significantly lower than other calibration transfer methods, achieving the best prediction performance.

(3)当仪器B3作为主仪器，仪器B2作为从仪器时，SBC、PDS和CCACT对应的最低的RMSEP分别为0.47177、0.33707、0.75119。TCR和CTAI这两种方法的预测误差分别为0.63708和0.38446。相同的情况再次出现，CT-CDD实现了最佳的预测性能(RMSEP为0.31856)。SBC、PDS和CCACT需要标准样品，TCR需要从仪器的参考值，在没有标准样品的情况下，CT-CDD实现了更好的预测性能。显然，这意味着CT-CDD是更可接受的方法。(3) When the instrument B3 is the master instrument and the instrument B2 is the slave instrument, the lowest RMSEP corresponding to SBC, PDS and CCACT are 0.47177, 0.33707 and 0.75119 respectively. The prediction errors of the two methods, TCR and CTAI, are 0.63708 and 0.38446, respectively. The same situation recurs, CT-CDD achieves the best prediction performance (RMSEP of 0.31856). SBC, PDS and CCACT require standard samples, TCR requires reference values from the instrument, and in the absence of standard samples, CT-CDD achieves better predictive performance. Obviously, this means that CT-CDD is the more acceptable method.

表4Table 4

图7、图8、图9分别显示了6种不同的标定迁移方法在组合B1*-B2、B1*-B3、B3*-B2上获得的预测值与测量值的关系图。Figures 7, 8, and 9 show the relationship between the predicted and measured values obtained by 6 different calibration migration methods on the combinations B1*-B2, B1*-B3, and B3*-B2, respectively.

CT-CDD、CTAI、TCR、CCACT、SBC、PDS在B1*-B2上的预测结果分别如图7A、图7B、图7C、图7D、图7E、图7F所示，在B1*-B3上的预测结果分别如图8A、图8B、图8C、图8D、图8E、图8F所示，在B3*-B2上的预测结果分别如图9A、图9B、图9C、图9D、图9E、图9F所示。由图7C和图7D可以清楚地看出，TCR和CCACT的相关性较差，模型的预测误差较大。从图7A中可以观察到CT-CDD的拟合效果较好，对应的模型的预测误差较小。图8表明了CT-CDD和PDS实现了良好的拟合效果，其他四种方法的拟合效果相对较差。图9表明了TCR和CCACT在物质浓度和预测结果之间发现了相当差的相关性，其他四种方法的拟合效果较好，但其中CT-CDD取得了最佳的拟合效果。可见，CT-CDD在与其他五种迁移方法的对比中提供了更加令人满意的结果。The prediction results of CT-CDD, CTAI, TCR, CCACT, SBC, and PDS on B1*-B2 are shown in Fig. 7A, Fig. 7B, Fig. 7C, Fig. 7D, Fig. 7E, Fig. 7F, respectively, and on B1*-B3 The prediction results are shown in Figure 8A, Figure 8B, Figure 8C, Figure 8D, Figure 8E, Figure 8F, respectively, and the prediction results on B3*-B2 are shown in Figure 9A, Figure 9B, Figure 9C, Figure 9D, Figure 9E , as shown in Figure 9F. It can be clearly seen from Figure 7C and Figure 7D that the correlation between TCR and CCACT is poor, and the prediction error of the model is large. It can be observed from Figure 7A that the CT-CDD has a better fitting effect, and the prediction error of the corresponding model is smaller. Figure 8 shows that CT-CDD and PDS achieve good fitting results, while the other four methods have relatively poor fitting results. Figure 9 shows that TCR and CCACT found a rather poor correlation between substance concentrations and predicted results, the other four methods fit better, but among them CT-CDD achieved the best fit. It can be seen that CT-CDD provides more satisfactory results in comparison with the other five transfer methods.

通过本发明的上述实施例一与实施例二可以看出，在使用CTAI、TCR、CCACT、SBC、PDS作为对比实验，使用两个NIR数据集检验CT-CDD方法的性能的过程中，本发明的基于CT-CDD的标定迁移方法实现了最佳的RMSEP(最小)。结果清楚地表明，CT-CDD成功地校正了在不同仪器上测量的光谱之间的差异。对于SBC、PDS和CCACT，它们需要标准样品建立迁移模型。在TCR中，从仪器样品还需要少量的参考值。这两个条件在实际应用中，都会产生很昂贵的代价，甚至无法满足这一条件。因此，当标准样品在实际应用中不可获得时，本发明基于CT-CDD的方法是一种经济有效的标定迁移方法。It can be seen from the above-mentioned Embodiment 1 and Embodiment 2 of the present invention that in the process of using CTAI, TCR, CCACT, SBC, and PDS as comparative experiments and using two NIR data sets to test the performance of the CT-CDD method, the present invention The CT-CDD-based calibration transfer method achieves the best RMSEP (minimum). The results clearly show that CT-CDD successfully corrects for differences between spectra measured on different instruments. For SBC, PDS and CCACT, they require standard samples to model migration. In TCR, a small amount of reference value is also required from the instrument sample. Both of these conditions are very expensive in practical applications, and even this condition cannot be met. Therefore, when standard samples are not available in practical applications, the CT-CDD-based method of the present invention is a cost-effective method for calibrating migration.

本发明的无标准的通过校正PLS子空间数据分布差异(CT-CDD)的标定迁移方法试图找到一个转换函数，确保当从仪器的数据投射到这个空间时，主仪器和从仪器之间的数据分布距离可以减少。特征光谱的数据分布为混合分布，需要对光谱进行聚类，并通过各自的转换函数最小化两个仪器之间的每个子分布的距离。本发明在相同的PLS子空间中保留了两个仪器的重要属性并消除了光谱的多重共线性，同时主仪器的特征和从仪器的伪特征之间的数据差异可以更精确地缩小。通过校正来自不同仪器的潜变量的每个部分的均值和方差来进一步校正数据分布上的差异。The standard-free calibration migration method of the present invention, by Correcting for PLS Subspace Data Distribution Difference (CT-CDD), attempts to find a transfer function that ensures that when the data from the slave is projected into this space, the data between the master and slave The distribution distance can be reduced. The data distribution of the characteristic spectra is a mixture distribution, and it is necessary to cluster the spectra and minimize the distance of each sub-distribution between the two instruments by their respective transfer functions. The present invention preserves the important properties of the two instruments in the same PLS subspace and eliminates spectral multicollinearity, while the data differences between the master instrument's features and the slave's pseudo-features can be narrowed more precisely. Differences in data distributions were further corrected for by correcting for the mean and variance of each component of the latent variables from different instruments.

显然，上述实施例仅仅是本发明的一部分实施例，而不是全部的实施例。上述实施例仅用于解释本发明，并不构成对本发明保护范围的限定。基于上述实施例，本领域技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例，也即凡在本申请的精神和原理之内所作的所有修改、等同替换和改进等，均落在本发明要求的保护范围内。Obviously, the above-mentioned embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. The above embodiments are only used to explain the present invention, and do not constitute a limitation on the protection scope of the present invention. Based on the above embodiments, all other embodiments obtained by those skilled in the art without creative work, that is, all modifications, equivalent replacements and improvements made within the spirit and principle of the present application, are fall within the scope of protection claimed by the present invention.

Claims

1. A CT-CDD-based infrared spectrum measuring instrument calibration migration method is characterized by comprising the following steps:

step 1: the method comprises the steps of enabling an infrared spectrum measurement master instrument to correspond to a source domain, enabling an infrared spectrum measurement slave instrument to correspond to a target domain, collecting the spectrum of each sample by using the infrared spectrum measurement master instrument and the infrared spectrum measurement slave instrument, respectively obtaining a master spectrum and a slave spectrum, respectively extracting spectral data of the master spectrum and the slave spectrum at intervals anm within a wavelength range, collecting material concentration variable values of each sample, and obtaining a source domain data set { X^m,y^mAnd a target domain data set { X }^s}；

Wherein, X^m、X^sRespectively a master spectral matrix and a slave spectral matrix,

i is the main spectral vector and the slave spectral vector of the ith sample, I is 1, 2.

J, J is the total number of extracted spectral data points, i.e. the jth primary spectral data and the jth secondary spectral data of the ith sample respectively; y is^mIs a matrix of values of the concentration of the substance,

is the value of the substance concentration variable for the ith sample;

step 2: source domain data set { X using KS algorithm^m,y^mDividing into a source domain calibration set

And source domain test set

And step 3: set of source domain calibrations

Performing centralization treatment to obtain a source domain calibration set after centralization treatment

And 4, step 4: PLS algorithm based on data sets

Establishing a calibration model

Is calculated to obtain

Weight matrix W of^m、

Load matrix P^mRegression coefficient matrix beta^m；

And 5: constructing a migration model:

step 5.1: calculating a characteristic spectrum matrix of a main instrument for infrared spectrum measurement

T^m＝X^mW^m(P^mW^m)^-1

Calculating a pseudo-characteristic spectral matrix of the infrared spectrometric slave instrument

Step 5.2: for each cluster number L belongs to L^*Using k-means clustering algorithm to data set { T }^m,y^mThe characteristic spectrum vectors of the data set are clustered, and the data set is subjected to the clustering of the characteristic spectrum vectors of the data set T^m,y^mDivide into L sub-datasets

l＝1,2,...,L；

On the basis of OLS algorithm, the first sub-data set

Establishing an initial least squares model

l＝1,2,...,L；

Calculating the cross validation error RMSECV of L initial least square models under each cluster number_LDetermining min { RMSECV }_L|L∈L^*The corresponding cluster number is the final cluster number K;

wherein L is^*To be a set of the number of clusters,

is the l-th initial sub-feature spectral matrix,

is composed of

Matrix of values of variables of the concentration of substance of the corresponding sample, beta_{0_l}Is the first initial regression coefficient matrix;

step 5.3: using K-means clustering algorithm to perform data set { T) according to clustering number K^m,y^mThe characteristic spectrum vectors of the data set are clustered, and the data set is subjected to the clustering of the characteristic spectrum vectors of the data set T^m,y^mDivide into K sub-datasets

k＝1,2,...,K；

Using K-means clustering algorithm to perform data set according to clustering number K

The pseudo characteristic spectral vectors are clustered, and the data set is obtained

Partitioning into K sub-datasets

k＝1,2,...,K；

Wherein the characteristic spectrum vector and the pseudo characteristic spectrum vector are respectively T^m、

The line vectors of (a) are,

respectively a k-th sub characteristic spectrum matrix and a sub pseudo characteristic spectrum matrix,

is composed of

A matrix formed by the variable values of the substance concentration of the corresponding sample;

step 5.4: on the basis of OLS algorithm, the kth sub-data set

Establishing a kth least squares model

Calculating to obtain a k-th regression coefficient matrix beta_k；

Step 5.5: computing the kth transformation matrix

Wherein,

are respectively as

The covariance matrix of (a);

step 6: and predicting the substance concentration variable of the measured object set:

step 6.1: collecting the spectrum of each measured object in the measured object set from the instrument by using infrared spectrum measurement, and extracting the spectrum data by using the same method as the step 1 to obtain a secondary spectrum matrix of the measured object set

Step 6.2: calculating a pseudo characteristic spectrum matrix of a measured object set under an infrared spectrum measuring slave instrument as

Step 6.3: using K-means clustering algorithm to perform data set according to clustering number K

Partitioning into K sub-datasets

K1, 2,. K; wherein,

for the kth sub-pseudo characteristic spectrum matrix of the measured object set,

step 6.4: using the kth transformation matrix M_kTo pair

Carrying out transformation correction to obtain a k transformation corrected sub-pseudo characteristic spectrum matrix of

Step 6.5: computing a k-th transform corrected sub-pseudo feature spectrum matrix

Predicted value of substance concentration variation of corresponding measured objectThe matrix is

k＝1,2,...,K。