CN106680238B - A Method for Analyzing Substance Content Based on Infrared Spectroscopy - Google Patents
A Method for Analyzing Substance Content Based on Infrared Spectroscopy Download PDFInfo
- Publication number
- CN106680238B CN106680238B CN201710009518.5A CN201710009518A CN106680238B CN 106680238 B CN106680238 B CN 106680238B CN 201710009518 A CN201710009518 A CN 201710009518A CN 106680238 B CN106680238 B CN 106680238B
- Authority
- CN
- China
- Prior art keywords
- data
- target domain
- infrared spectrum
- standard
- spectral
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 61
- 239000000126 substance Substances 0.000 title claims abstract description 24
- 238000004566 IR spectroscopy Methods 0.000 title claims abstract description 11
- 230000003595 spectral effect Effects 0.000 claims abstract description 111
- 238000002329 infrared spectrum Methods 0.000 claims abstract description 85
- 238000012546 transfer Methods 0.000 claims abstract description 60
- 239000000463 material Substances 0.000 claims abstract description 34
- 230000008569 process Effects 0.000 claims abstract description 20
- 238000012360 testing method Methods 0.000 claims description 44
- 238000000605 extraction Methods 0.000 claims description 14
- 238000012545 processing Methods 0.000 claims description 14
- 239000011159 matrix material Substances 0.000 claims description 13
- 238000010238 partial least squares regression Methods 0.000 claims description 13
- 238000004422 calculation algorithm Methods 0.000 abstract description 101
- 238000004458 analytical method Methods 0.000 abstract description 35
- 238000004364 calculation method Methods 0.000 abstract description 5
- 238000013526 transfer learning Methods 0.000 abstract description 3
- 230000009466 transformation Effects 0.000 abstract 1
- 240000008042 Zea mays Species 0.000 description 36
- 235000005824 Zea mays ssp. parviglumis Nutrition 0.000 description 36
- 235000002017 Zea mays subsp mays Nutrition 0.000 description 36
- 235000005822 corn Nutrition 0.000 description 36
- 238000013508 migration Methods 0.000 description 28
- 230000005012 migration Effects 0.000 description 28
- 239000004480 active ingredient Substances 0.000 description 24
- 238000001228 spectrum Methods 0.000 description 20
- 238000012549 training Methods 0.000 description 17
- 238000010586 diagram Methods 0.000 description 16
- 238000002790 cross-validation Methods 0.000 description 14
- 230000006872 improvement Effects 0.000 description 11
- 239000000203 mixture Substances 0.000 description 10
- 229920002472 Starch Polymers 0.000 description 9
- 230000000694 effects Effects 0.000 description 9
- 235000019698 starch Nutrition 0.000 description 9
- 239000008107 starch Substances 0.000 description 9
- 230000008859 change Effects 0.000 description 8
- 238000002474 experimental method Methods 0.000 description 8
- 102000004169 proteins and genes Human genes 0.000 description 5
- 108090000623 proteins and genes Proteins 0.000 description 5
- 238000010987 Kennard-Stone algorithm Methods 0.000 description 4
- 230000007423 decrease Effects 0.000 description 4
- 238000005259 measurement Methods 0.000 description 4
- 239000006187 pill Substances 0.000 description 4
- 238000011156 evaluation Methods 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- 230000000052 comparative effect Effects 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 238000012937 correction Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 238000007634 remodeling Methods 0.000 description 2
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 2
- 230000006870 function Effects 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 235000019198 oils Nutrition 0.000 description 1
- 235000018102 proteins Nutrition 0.000 description 1
- 238000000611 regression analysis Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 239000010421 standard material Substances 0.000 description 1
- 230000005477 standard model Effects 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
- 238000010998 test method Methods 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N21/00—Investigating or analysing materials by the use of optical means, i.e. using sub-millimetre waves, infrared, visible or ultraviolet light
- G01N21/17—Systems in which incident light is modified in accordance with the properties of the material investigated
- G01N21/25—Colour; Spectral properties, i.e. comparison of effect of material on the light at two or more different wavelengths or wavelength bands
- G01N21/31—Investigating relative effect of material at wavelengths characteristic of specific elements or molecules, e.g. atomic absorption spectrometry
- G01N21/35—Investigating relative effect of material at wavelengths characteristic of specific elements or molecules, e.g. atomic absorption spectrometry using infrared light
- G01N21/3563—Investigating relative effect of material at wavelengths characteristic of specific elements or molecules, e.g. atomic absorption spectrometry using infrared light for analysing solids; Preparation of samples therefor
Landscapes
- Physics & Mathematics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Biochemistry (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Immunology (AREA)
- Pathology (AREA)
- Investigating Or Analysing Materials By Optical Means (AREA)
Abstract
本发明涉及一种基于红外光谱分析物质成分含量的方法,包括根据源域红外光谱数据和与所述源域红外光谱数据对应的源域物质成分含量建立第一回归模型,求取所述第一回归模型中的参数;获取目标域红外光谱数据,建立目标域红外光谱数据与源域红外光谱数据之间的转移模型,求取所述转移模型中的参数;根据所述目标域红外光谱数据、所述转移模型,利用所述第一回归模型获取与所述目标域红外光谱数据对应的目标域物质成分含量。本发明的分析方法结合迁移学习中的特征迁移方法和偏最小二乘算法,实现目标域到源域光谱特征空间的变换,不但可以去除冗余信息,提高转移的准确性,而且可以在很大程度上降低了转移过程的计算量。
The invention relates to a method for analyzing the content of material components based on infrared spectroscopy, which includes establishing a first regression model based on source domain infrared spectrum data and source domain material component content corresponding to the source domain infrared spectrum data, and calculating the first Parameters in the regression model; obtain target domain infrared spectrum data, set up a transfer model between target domain infrared spectrum data and source domain infrared spectrum data, and obtain parameters in the transfer model; according to the target domain infrared spectrum data, The transfer model uses the first regression model to obtain the content of the substance components in the target domain corresponding to the infrared spectrum data of the target domain. The analysis method of the present invention combines the feature transfer method and the partial least squares algorithm in transfer learning to realize the transformation from the target domain to the source domain spectral feature space, which can not only remove redundant information, improve the accuracy of transfer, and can be used in a large To a certain extent, the calculation amount of the transfer process is reduced.
Description
技术领域technical field
本发明涉及红外光谱分析领域,具体而言,涉及一种基于红外光谱分析物质成分含量的方法。The invention relates to the field of infrared spectrum analysis, in particular to a method for analyzing the content of material components based on infrared spectrum.
背景技术Background technique
通过红外光谱分析可获知物质成分含量。通过测量红外光谱,对其进行分析,从而获知物质成分含量,不仅可以定性分析,也可定量分析。但是在现有的红外光谱测量过程中,测量仪器或测量条件的改变,都将导致原有标定模型失效,重新建立模型将浪费大量的时间和成本,造成分析结果不准确,分析效率低下的情况。The content of the material components can be known by infrared spectrum analysis. By measuring the infrared spectrum and analyzing it, we can know the content of the material composition, which can not only be analyzed qualitatively but also quantitatively. However, in the existing infrared spectrum measurement process, changes in measuring instruments or measurement conditions will cause the original calibration model to fail, and re-establishing the model will waste a lot of time and cost, resulting in inaccurate analysis results and low analysis efficiency. .
发明内容Contents of the invention
本发明为了解决现有的重新建模效率低的问题,提出了一种基于红外光谱分析物质成分含量的方法,包括以下步骤:In order to solve the problem of low efficiency of existing remodeling, the present invention proposes a method for analyzing the content of material components based on infrared spectroscopy, including the following steps:
S1,根据源域红外光谱数据和与所述源域红外光谱数据对应的源域物质成分含量建立第一回归模型,求取所述第一回归模型中的参数;S1. Establishing a first regression model according to the source-domain infrared spectrum data and the content of source-domain material components corresponding to the source-domain infrared spectrum data, and obtaining parameters in the first regression model;
S2,获取目标域红外光谱数据,建立目标域红外光谱数据与源域红外光谱数据之间的转移模型,求取所述转移模型中的参数;S2. Acquire target domain infrared spectral data, establish a transfer model between target domain infrared spectral data and source domain infrared spectral data, and obtain parameters in the transfer model;
S3,根据所述目标域红外光谱数据、所述转移模型,利用所述第一回归模型获取与所述目标域红外光谱数据对应的目标域物质成分含量。S3. According to the target domain infrared spectrum data and the transfer model, use the first regression model to acquire the target domain material component content corresponding to the target domain infrared spectrum data.
进一步地,所述第一回归模型为偏最小二乘回归模型,所述步骤S1包括,对所述源域红外光谱数据进行特征提取获取第一光谱特征,根据所述第一光谱特征和源域物质成分含量建立所述偏最小二乘回归模型,求出回归系数。Further, the first regression model is a partial least squares regression model, and the step S1 includes performing feature extraction on the source domain infrared spectral data to obtain a first spectral feature, and according to the first spectral feature and the source domain The partial least squares regression model is established according to the content of the material components, and the regression coefficient is obtained.
进一步地,所述目标域红外光谱数据包括目标域红外光谱标准数据和目标域红外光谱测试数据,所述步骤S2包括根据所述目标域红外光谱标准数据进行特征提取获取第二标准光谱特征;根据所述第一光谱特征和所述第二标准光谱特征建立所述转移模型,求出转移矩阵。Further, the target domain infrared spectrum data includes target domain infrared spectrum standard data and target domain infrared spectrum test data, and the step S2 includes performing feature extraction according to the target domain infrared spectrum standard data to obtain second standard spectral features; The first spectral feature and the second standard spectral feature establish the transfer model to obtain a transfer matrix.
进一步地,所述步骤S3包括,根据所述目标域红外光谱测试数据获取第三光谱特征,将所述第三光谱特征和所述转移模型带入到所述最小偏二乘回归模型中获取所述目标域物质成分含量。Further, the step S3 includes acquiring a third spectral feature according to the infrared spectrum test data in the target domain, bringing the third spectral feature and the transfer model into the least partial squares regression model to obtain the Describe the content of the substance in the target domain.
进一步地,所述对所述源域红外光谱数据进行特征提取获取第一光谱特征的步骤包括,对所述源域红外光谱数据和源域物质成分含量进行中心化处理,根据中心化处理后的源域红外光谱数据和源域物质成分含量建立最小二乘回归模型获取所述第一光谱特征。Further, the step of performing feature extraction on the source domain infrared spectral data to obtain the first spectral feature includes performing centralized processing on the source domain infrared spectral data and source domain material composition content, and according to the centralized processing The source domain infrared spectrum data and the content of source domain material components are used to establish a least squares regression model to obtain the first spectral feature.
进一步地,还获取包括目标域标准物质成分含量,所述根据所述目标域红外光谱标准数据进行特征提取获取第二标准光谱特征的步骤包括:对所述目标域红外光谱标准数据和所述目标域标准物质成分含量进行中心化处理,根据中心化处理后的目标域红外光谱标准数据和目标域标准物质成分含量建立偏最小二乘回归模型获取第二标准光谱特征。Further, the content of the standard substance in the target domain is also acquired, and the step of performing feature extraction according to the infrared spectrum standard data in the target domain to obtain the second standard spectral features includes: analyzing the infrared spectrum standard data in the target domain and the target Centralized processing is performed on the component content of the standard substance in the target domain, and a partial least squares regression model is established according to the centralized infrared spectrum standard data of the target domain and the component content of the standard substance in the target domain to obtain the spectral characteristics of the second standard.
进一步地,所述步骤S2获取第二标准光谱特征的同时,还获取了第二标准投影数据和第二标准载荷数据;所述步骤S3中根据所述目标域红外光谱测试数据获取第三光谱特征的步骤包括,利用所述目标域红外光谱标准数据的均值对所述目标域红外光谱测试数据进行中心化处理,利用中心化处理后的目标域红外光谱测试数据按照下式依次递推获取第三光谱特征:其中,i大于等于1且小于等于k,TT_test为第三光谱特征,k为第三光谱特征的个数,为第二标准投影数据的第i个分量,为中心化处理后的目标域红外光谱测试数据的第i个残差项,为第二标准载荷数据的第i个分量。Further, while the second standard spectral feature is obtained in the step S2, the second standard projection data and the second standard load data are also obtained; in the step S3, the third spectral feature is obtained according to the target domain infrared spectrum test data The step includes, using the mean value of the infrared spectrum standard data in the target domain to perform centralized processing on the infrared spectrum test data in the target domain, and using the centrally processed infrared spectrum test data in the target domain to recursively obtain the third Spectral features: Among them, i is greater than or equal to 1 and less than or equal to k, T T_test is the third spectral feature, k is the number of the third spectral feature, is the i-th component of the second standard projection data, is the i-th residual item of the infrared spectrum test data in the target domain after centralized processing, is the i-th component of the second standard load data.
进一步地,通过求解下式的最优化问题,其中,B表示基于源域特征回归模型的系数,M表示目标域特征到源域特征的转移矩阵,WS和WT分别表示源域和目标域的投影矩阵;通过TS=XS*WS求解第一光谱特征,其中第一光谱特征为i大于等于1且小于等于k,k为第一光谱特征的个数;通过计算回归系数ΒT=[b1,b2,...,bk],y表示源域物质成分含量。Further, by solving the optimization problem of the following formula, Among them, B represents the coefficient of the regression model based on the source domain features, M represents the transfer matrix from the target domain features to the source domain features, WS and W T represent the projection matrices of the source domain and the target domain respectively; by T S =X S *W S solves for the first spectral feature, where the first spectral feature is i is greater than or equal to 1 and less than or equal to k, k is the number of the first spectral feature; by Calculate the regression coefficient Β T =[b 1 ,b 2 ,...,b k ], and y represents the material composition content in the source domain.
进一步地,通过下式求取第二标准光谱特征,TT=XT*WT,其中第二标准光谱特征为i大于等于1且小于等于k,k为第二光谱特征的个数。Further, the second standard spectral feature is obtained by the following formula, T T =X T *W T , wherein the second standard spectral feature is i is greater than or equal to 1 and less than or equal to k, where k is the number of second spectral features.
进一步地,利用第二标准光谱特征和第一光谱特征通过下式获取转移矩阵Μ=[m1,m2,...,mk],i大于等于1且小于等于k,k为第二标准光谱特征的个数,其中从中选取。Further, using the second standard spectral feature and the first spectral feature by the following formula Acquire transfer matrix M=[m 1 ,m 2 ,...,m k ], i is greater than or equal to 1 and less than or equal to k, k is the number of second standard spectral features, where from to choose from.
通过上述实施例的技术方案,本发明的基于红外光谱分析物质成分含量的方法建立源域和目标域样本特征之间的转移关系,一方面可以去除冗余信息,获得更加准确简单的转移关系,进而可以获得较好的预测效果,另一方面对于高维小样本数据集可以很大程度上减少运算量。Through the technical solutions of the above embodiments, the method of the present invention based on infrared spectrum analysis of material composition content establishes the transfer relationship between the sample characteristics of the source domain and the target domain, on the one hand, redundant information can be removed, and a more accurate and simple transfer relationship can be obtained. In turn, better prediction results can be obtained. On the other hand, for high-dimensional small sample data sets, the amount of calculation can be greatly reduced.
附图说明Description of drawings
通过参考附图会更加清楚的理解本发明的特征和优点,附图是示意性的而不应理解为对本发明进行任何限制,在附图中:The features and advantages of the present invention will be more clearly understood by referring to the accompanying drawings, which are schematic and should not be construed as limiting the invention in any way. In the accompanying drawings:
图1为本发明实施例基于红外光谱分析物质成分含量的方法的流程示意图;Fig. 1 is a schematic flow chart of a method for analyzing material component content based on infrared spectroscopy according to an embodiment of the present invention;
图2为本发明实施例基于红外光谱分析物质成分含量的方法的流程示意图;Fig. 2 is a schematic flow chart of a method for analyzing material component content based on infrared spectroscopy according to an embodiment of the present invention;
图3为对本发明的分析方法进行验证的玉米数据集的主从光谱及偏差光谱;Fig. 3 is the master-slave spectrum and the deviation spectrum of the corn data set that the analysis method of the present invention is verified;
图4为本发明的分析方法进行验证的药片数据集的主从光谱及偏差光谱;Fig. 4 is the master-slave spectrum and the deviation spectrum of the tablet dataset verified by the analytical method of the present invention;
图5为本发明的分析方法进行验证的玉米的PLS模型的主成分数选取过程;Fig. 5 is the selection process of the principal component number of the PLS model of the corn that analytical method of the present invention verifies;
图6为本发明的分析方法进行验证的玉米的PDS模型的窗口大小选择过程;Fig. 6 is the window size selection process of the PDS model of the corn that analytical method of the present invention is verified;
图7为本发明的分析方法进行验证的玉米中水份在各个模型下真实值与预测值的比较示意图;Fig. 7 is the comparison schematic diagram of the actual value and the predicted value of the moisture content in the corn under each model that the analytical method of the present invention is verified;
图8为本发明的分析方法进行验证的玉米中油份在各个模型下真实值与预测值的比较示意图;Fig. 8 is the comparison schematic diagram of the actual value and the predicted value of the oil content in corn verified by the analytical method of the present invention under each model;
图9为对本发明的分析方法进行验证的玉米中蛋白质含量在各个模型下真实值与预测值的比较示意图;Fig. 9 is the comparative schematic diagram of the protein content in corn under each model that the analytical method of the present invention is verified;
图10为对本发明的分析方法进行验证的玉米中淀粉含量在各个模型下真实值与预测值的比较示意图;Fig. 10 is a comparison schematic diagram of the actual value and the predicted value of the starch content in corn under each model to verify the analytical method of the present invention;
图11为对本发明的分析方法进行验证的玉米中水份含量在标定迁移前后的预测值和真实值的比较示意图;Fig. 11 is the comparison schematic diagram of the predicted value and the real value of the moisture content in the corn before and after calibration migration to the analytical method of the present invention verified;
图12为对本发明的分析方法进行验证的玉米中油份含量在标定迁移前后的预测值和真实值的比较示意图;Fig. 12 is a comparison schematic diagram of the predicted value and the real value of the oil content in corn before and after calibration migration to verify the analytical method of the present invention;
图13为对本发明的分析方法进行验证的玉米中蛋白质含量在标定迁移前后的预测值和真实值的比较示意图;Fig. 13 is a comparison schematic diagram of the predicted value and the real value of the protein content in corn before and after calibration migration to verify the analytical method of the present invention;
图14为对本发明的分析方法进行验证的玉米中淀粉含量在标定迁移前后的预测值和真实值的比较示意图;Fig. 14 is a comparison schematic diagram of the predicted value and the real value of the starch content in corn before and after the calibration migration to verify the analytical method of the present invention;
图15为对本发明的分析方法进行验证的玉米的PLS模型的主成分数选取过程示意图;Fig. 15 is the schematic diagram of the selection process of the principal component number of the PLS model of the corn that the analytical method of the present invention is verified;
图16为对本发明的分析方法进行验证的玉米的PDS模型的窗口大小选择过程示意图;Fig. 16 is a schematic diagram of the window size selection process of the PDS model of corn verified by the analytical method of the present invention;
图17为对本发明的分析方法进行验证的药片中第一种活性成分在不同模型下的预测值与真实值的比较示意图;Figure 17 is a schematic diagram of the comparison between the predicted value and the real value of the first active ingredient in the tablet under different models for verifying the analytical method of the present invention;
图18为对本发明的分析方法进行验证的药片中第二种活性成分在不同模型下的预测值与真实值的比较示意图;Figure 18 is a schematic diagram of the comparison between the predicted value and the real value of the second active ingredient in the tablet under different models for verifying the analytical method of the present invention;
图19为对本发明的分析方法进行验证的药片中第三种活性成分在不同模型下的预测值与真实值的比较示意图;Fig. 19 is a schematic diagram of the comparison between the predicted value and the real value of the third active ingredient in the tablet under different models for verifying the analytical method of the present invention;
图20为对本发明的分析方法进行验证的药片中活性成分1含量在标定迁移前后的预测值和真实值的比较示意图;Figure 20 is a schematic diagram of the comparison of the predicted value and the real value of the content of active ingredient 1 in the tablet before and after calibration migration for verifying the analytical method of the present invention;
图21为对本发明的分析方法进行验证的药片中活性成分2含量在标定迁移前后的预测值和真实值的比较示意图;Fig. 21 is a comparison schematic diagram of the predicted value and the real value of the content of active ingredient 2 in the tablet before and after calibration migration for verifying the analytical method of the present invention;
图22为对本发明的分析方法进行验证的药片中活性成分3含量在标定迁移前后的预测值和真实值的比较示意图。Fig. 22 is a schematic diagram comparing the predicted value and the real value of the content of active ingredient 3 in the tablet before and after calibration migration for verification of the analytical method of the present invention.
具体实施方式Detailed ways
为了能够更清楚地理解本发明的上述目的、特征和优点,下面结合附图和具体实施方式对本发明进行进一步的详细描述。需要说明的是,在不冲突的情况下,本申请的实施例及实施例中的特征可以相互组合。In order to understand the above-mentioned purpose, features and advantages of the present invention more clearly, the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments. It should be noted that, in the case of no conflict, the embodiments of the present application and the features in the embodiments can be combined with each other.
在下面的描述中阐述了很多具体细节以便于充分理解本发明,但是,本发明还可以采用其他不同于在此描述的其他方式来实施,因此,本发明的保护范围并不受下面公开的具体实施例的限制。In the following description, many specific details are set forth in order to fully understand the present invention. However, the present invention can also be implemented in other ways different from those described here. Therefore, the protection scope of the present invention is not limited by the specific details disclosed below. EXAMPLE LIMITATIONS.
实施例一Embodiment one
如图1所示,本发明提供了一种基于红外光谱分析物质成分含量的方法,包括以下步骤:As shown in Figure 1, the present invention provides a kind of method based on infrared spectrum analysis material component content, comprises the following steps:
S101,根据源域红外光谱数据和与所述源域红外光谱数据对应的源域物质成分含量建立第一回归模型,求取所述第一回归模型中的参数;所述第一回归模型例如为偏最小二乘回归模型,对所述源域红外光谱数据进行特征提取获取第一光谱特征,根据所述第一光谱特征和源域物质成分含量建立所述偏最小二乘回归模型,求出回归系数;具体地,所述对所述源域红外光谱数据进行特征提取获取第一光谱特征的步骤包括,对所述源域红外光谱数据和源域物质成分含量进行中心化处理,根据中心化处理后的源域红外光谱数据和源域物质成分含量建立最小二乘回归模型获取所述第一光谱特征。中心化处理的操作为,用源域红外光谱数据减去源域红外光谱数据的均值,用源域物质成分含量减去源域物质成分含量的均值,减少偏差对建立模型的影响。S101. Establish a first regression model according to the source-domain infrared spectrum data and the content of source-domain material components corresponding to the source-domain infrared spectrum data, and obtain parameters in the first regression model; the first regression model is, for example, A partial least squares regression model, performing feature extraction on the source domain infrared spectral data to obtain a first spectral feature, establishing the partial least squares regression model according to the first spectral feature and source domain material component content, and calculating the regression coefficient; specifically, the step of performing feature extraction on the source domain infrared spectral data to obtain the first spectral feature includes performing centralized processing on the source domain infrared spectral data and source domain material composition content, according to the centralized processing The first spectral feature is obtained by establishing a least squares regression model based on the source domain infrared spectral data and the content of source domain material components. The operation of centralized processing is to subtract the mean value of the source domain infrared spectrum data from the source domain infrared spectrum data, and subtract the mean value of the source domain material component content from the source domain material composition content, so as to reduce the impact of deviation on model building.
具体地,通过求解下式的最优化问题,其中,B表示基于源域特征回归模型的系数,M表示目标域特征到源域特征的转移矩阵,WS和WT分别表示源域和目标域的投影矩阵。通过TS=XS*WS求解第一光谱特征,其中第一光谱特征为i大于等于1且小于等于k,k为第一光谱特征的个数;通过计算回归系数ΒT=[b1,b2,...,bk],y表示源域物质成分含量。Specifically, by solving the optimization problem of the following formula, Among them, B represents the coefficient of the regression model based on the source domain features, M represents the transfer matrix from the target domain features to the source domain features, and WS and W T represent the projection matrices of the source domain and the target domain, respectively. Solving the first spectral feature by T S =X S * WS , wherein the first spectral feature is i is greater than or equal to 1 and less than or equal to k, k is the number of the first spectral feature; by Calculate the regression coefficient Β T =[b 1 ,b 2 ,...,b k ], and y represents the material composition content in the source domain.
S102,获取目标域红外光谱数据,建立目标域红外光谱数据与源域红外光谱数据之间的转移模型,求取所述转移模型中的参数;所述目标域红外光谱数据包括目标域红外光谱标准数据和目标域红外光谱测试数据,根据所述目标域红外光谱标准数据进行特征提取获取第二标准光谱特征;根据所述第一光谱特征和所述第二标准光谱特征建立所述转移模型,求出转移矩阵,为了提高准确性,可从所述第一光谱特征中选取部分光谱特征与所述第二标准光谱特征建立转移模型,选取时对应按照物质浓度相对应选取,如可采取,源域物质成分含量与目标域标准物质浓度相同的数据集来进行运算。S102. Acquire target domain infrared spectrum data, establish a transfer model between target domain infrared spectrum data and source domain infrared spectrum data, and obtain parameters in the transfer model; the target domain infrared spectrum data includes target domain infrared spectrum standards Data and target domain infrared spectrum test data, perform feature extraction according to the target domain infrared spectrum standard data to obtain a second standard spectral feature; establish the transfer model according to the first spectral feature and the second standard spectral feature, and obtain To obtain the transfer matrix, in order to improve the accuracy, some spectral features can be selected from the first spectral feature and the second standard spectral feature to establish a transfer model, and the selection is corresponding to the corresponding selection according to the concentration of the substance. If it can be adopted, the source domain The data set with the same substance composition content as the standard substance concentration in the target domain is used for calculation.
具体地,通过下式求取第二标准光谱特征,TT=XT*WT,其中第二标准光谱特征为i大于等于1且小于等于k,k为第二光谱特征的个数。Specifically, the second standard spectral feature is obtained by the following formula, T T =X T *W T , wherein the second standard spectral feature is i is greater than or equal to 1 and less than or equal to k, where k is the number of second spectral features.
利用第二标准光谱特征和第一光谱特征通过下式获取转移矩阵Μ=[m1,m2,...,mk],i大于等于1且小于等于k,k为第二标准光谱特征的个数,其中从中选取。Using the second standard spectral feature and the first spectral feature by the following formula Acquire transfer matrix M=[m 1 ,m 2 ,...,m k ], i is greater than or equal to 1 and less than or equal to k, k is the number of second standard spectral features, where from to choose from.
S103,根据所述目标域红外光谱数据、所述转移模型,利用所述第一回归模型获取与所述目标域红外光谱数据对应的目标域物质成分含量;根据所述目标域红外光谱测试数据获取第三光谱特征,将所述第三光谱特征和所述转移模型带入到所述最小偏二乘回归模型中获取所述目标域物质成分含量。S103, according to the target domain infrared spectrum data and the transfer model, use the first regression model to acquire the target domain material component content corresponding to the target domain infrared spectrum data; acquire according to the target domain infrared spectrum test data The third spectral feature, bringing the third spectral feature and the transfer model into the least partial squares regression model to obtain the content of the substance in the target domain.
本发明实施例还包括获取目标域标准物质成分含量,所述根据所述目标域红外光谱标准数据进行特征提取获取第二标准光谱特征的步骤包括:对所述目标域红外光谱标准数据和所述目标域标准物质成分含量进行中心化处理,根据中心化处理后的目标域红外光谱标准数据和目标域标准物质成分含量建立偏最小二乘回归模型获取第二标准光谱特征。中心化处理的步骤与上述对源于红外光谱数据的处理步骤类似。The embodiment of the present invention also includes obtaining the target domain standard material component content, and the step of performing feature extraction according to the target domain infrared spectrum standard data to obtain the second standard spectral feature includes: analyzing the target domain infrared spectrum standard data and the The content of standard substance components in the target domain is centrally processed, and a partial least squares regression model is established based on the centralized infrared spectrum standard data of the target domain and the content of standard substance components in the target domain to obtain the second standard spectral features. The steps of the centralized processing are similar to the above-mentioned processing steps of the data originating from the infrared spectrum.
本发明实施例中所述步骤S102获取第二标准光谱特征的同时,还获取了第二标准投影数据和第二标准载荷数据;所述步骤S103中根据所述目标域红外光谱测试数据获取第三光谱特征的步骤包括,利用所述目标域红外光谱标准数据的均值对所述目标域红外光谱测试数据进行中心化处理,利用中心化处理后的目标域红外光谱测试数据按照下式依次递推获取第三光谱特征:其中,i大于等于1且小于等于k,TT_test为第三光谱特征,k为第三光谱特征的个数,为第二标准投影数据的第i个分量,为中心化处理后的目标域红外光谱测试数据的第i个残差项,为第二标准载荷数据的第i个分量。In the embodiment of the present invention, step S102 obtains the second standard spectral feature while obtaining the second standard projection data and the second standard load data; in the step S103, the third The step of spectral characteristics includes, using the mean value of the infrared spectrum standard data in the target domain to perform centralized processing on the infrared spectrum test data in the target domain, and using the centrally processed infrared spectrum test data in the target domain to obtain recursively according to the following formula The third spectral feature: Among them, i is greater than or equal to 1 and less than or equal to k, T T_test is the third spectral feature, k is the number of the third spectral feature, is the i-th component of the second standard projection data, is the i-th residual item of the infrared spectrum test data in the target domain after centralized processing, is the i-th component of the second standard load data.
本发明的基于红外光谱分析物质成分含量的方法建立源域和目标域样本特征之间的转移关系,一方面可以去除冗余信息,获得更加准确简单的转移关系,因此可以获得较好的预测效果,另一方面对于高维小样本数据集可以很大程度上减少运算量。此外,仅有偏最小二乘算法(PLS算法)的潜变量一个参数需要设置,实现过程十分简单。需要说明的是,本发明中采用了“红外光谱”一词,可理解成包括了近红外光谱,也可包括中红外光谱、远红外光谱。The method of the present invention based on infrared spectrum analysis of material component content establishes the transfer relationship between the source domain and the target domain sample characteristics, on the one hand, it can remove redundant information and obtain a more accurate and simple transfer relationship, so better prediction results can be obtained , on the other hand, for high-dimensional small-sample data sets, it can greatly reduce the amount of computation. In addition, only one parameter of the latent variable of the partial least squares algorithm (PLS algorithm) needs to be set, and the implementation process is very simple. It should be noted that the term "infrared spectrum" used in the present invention can be understood as including near-infrared spectrum, mid-infrared spectrum and far-infrared spectrum.
实施例二Embodiment two
本发明的基于红外光谱分析物质成分含量的方法结合迁移学习和PLS算法,形成了一种迁移标定算法(CT_pls算法),其基础思想来源于基于特征的迁移学习方法,将目标域特征映射至源域特征空间,进而可以利用源域的模型对目标域的数据进行处理。该方法首先利用PLS算法对源域样本和目标样本进行特征提取,然后建立基于源域特征的多元标定模型以及源域和目标域特征之间的线性转移模型,最后在以相同的方式对未知的目标域样本进行特征提取后转移后,利用源域标定模型对转移后的特征进行预测。The method for analyzing the content of material components based on infrared spectroscopy in the present invention combines transfer learning and PLS algorithm to form a transfer calibration algorithm (CT_pls algorithm). Domain feature space, and then the model of the source domain can be used to process the data of the target domain. This method first uses the PLS algorithm to extract the features of the source domain samples and the target samples, then establishes a multivariate calibration model based on the source domain features and a linear transfer model between the source domain and target domain features, and finally uses the same method for the unknown After feature extraction and transfer of target domain samples, the source domain calibration model is used to predict the transferred features.
假设分别存在源域数据集{XS,y}和目标域数据集{XT,y},其中XS和XT分别由主光谱仪和从光谱仪测得,建立源域和目标域之间的标定迁移模型,实际上是求解公式(3.1)的最优化问题。Assuming that there are source domain datasets {X S ,y} and target domain datasets {X T ,y} respectively, where X S and X T are measured by the master spectrometer and slave spectrometer respectively, establish the relationship between the source domain and the target domain Calibrating the migration model is actually an optimization problem for solving formula (3.1).
在公式(3.1)中,B表示基于源域特征回归模型的系数,M表示目标域特征到源域特征的转移矩阵,WS和WT分别表示源域和目标域的投影矩阵。本文选择偏最小二乘算法作为主体算法,WS和WT分别通过建立{XS,y}和{XS,y}的PLS模型求得,源域的特征TS和目标域的特征TT通过公式(3.2)求得。In formula (3.1), B represents the coefficients of the regression model based on source domain features, M represents the transfer matrix from target domain features to source domain features, and W S and W T represent the projection matrices of the source and target domains, respectively. In this paper, the partial least squares algorithm is selected as the main algorithm. W S and W T are obtained by establishing PLS models of {X S ,y} and {X S ,y} respectively. The feature T S of the source domain and the feature T of the target domain T is obtained by formula (3.2).
在获得源域特征TS后,利用源域特征数据{TS,yS}建立多元标定模型,其中计算回归系数ΒT=[b1,b2,...,bk],k表示提取的主特征个数。After obtaining the source domain characteristics T S , use the source domain characteristic data {T S , y S } to establish a multivariate calibration model, where Calculate the regression coefficient Β T =[b 1 ,b 2 ,...,b k ], where k represents the number of extracted main features.
为了实现源域模型对目标域数据的有效预测,需要利用标准集进行光谱空间进行变换,公式(3.4)(3.5)表明光谱特征从目标域变换到源域的实现方法。In order to realize the effective prediction of the target domain data by the source domain model, it is necessary to use the standard set to transform the spectral space. Formulas (3.4) (3.5) indicate the realization method of transforming the spectral features from the target domain to the source domain.
Τ'S←ΤTΜ (3.4)Τ' S ←Τ T Μ (3.4)
其中,Τ'S和TT分别是源域和目标域样本集的特征,Τ'S从中获得TS,用于计算转移矩阵Μ=[m1,m2,...,mk]。in, Τ' S and T T are the features of the source domain and target domain sample sets respectively, from which Τ' S obtains T S , which is used to calculate the transfer matrix M=[m 1 ,m 2 ,...,m k ].
在建立源域的标定模型以及源域和目标域之间的转移模型后,即可实现对目标域样本的有效预测,如公式(3.6)所示。After establishing the calibration model of the source domain and the transfer model between the source domain and the target domain, the effective prediction of the target domain samples can be realized, as shown in formula (3.6).
yT=TT*M*B (3.6)y T =T T *M*B (3.6)
具体地,如图2所示,本发明的基于红外光谱分析物质成分含量的方法包括获取源域训练集,即获取源域红外光谱数据和源域物质成分含量;获取目标域标准集,即获取目标域红外光谱标准数据和目标域标准物质成分含量;获取目标域测试集,即获取目标域红外光谱测试数据和目标域测试物质成分含量;对源域数据进行中心化处理,对目标域数据进行中心化处理;对源域数据利用pls模型进行第一光谱特征提取,形成组合特征数据集,从中抽取与标准集对应的特征(即物质成分含量对应),利用组合特征数据集和pls算法建立第一回归模型,目标域标准集利用pls进行特征提取获取第二标准光谱特征,通过pls模型求取挑选后的第一光谱特征和第二标准光谱特征之间的转移矩阵,对目标域测试数据利用pls模型求取第三光谱特征,将第三光谱特征和转移矩阵带入到第一回归模型中,从而获取与目标域测试数据相对应的物质成分含量。具体实现过程,包括数据预处理、特征提取、建立源域标定模型、计算转移关系、对未知目标域数据进行预测等步骤。Specifically, as shown in Figure 2, the method for analyzing the content of material components based on infrared spectroscopy in the present invention includes obtaining the source domain training set, that is, obtaining the source domain infrared spectrum data and the source domain material component content; obtaining the target domain standard set, that is, obtaining Infrared spectrum standard data in the target domain and the content of standard substances in the target domain; obtain the test set in the target domain, that is, obtain the infrared spectrum test data in the target domain and the content of the test material in the target domain; centralize the source domain data and process the target domain data Centralized processing; use the pls model to extract the first spectral feature of the source domain data to form a combined feature data set, from which the features corresponding to the standard set (that is, corresponding to the content of the material composition) are extracted, and the combined feature data set and the pls algorithm are used to establish the first A regression model, the target domain standard set uses pls to perform feature extraction to obtain the second standard spectral feature, and the transfer matrix between the selected first spectral feature and the second standard spectral feature is obtained through the pls model, and the target domain test data is used The pls model obtains the third spectral feature, and brings the third spectral feature and the transfer matrix into the first regression model, so as to obtain the content of the material composition corresponding to the test data in the target domain. The specific implementation process includes steps such as data preprocessing, feature extraction, establishment of a source domain calibration model, calculation of transfer relationships, and prediction of unknown target domain data.
具体地,可通过载有计算机程序的处理器电路来实现,计算机程序流程如下:Specifically, it can be realized by a processor circuit carrying a computer program, and the flow of the computer program is as follows:
本发明实施例的基于红外光谱分析物质成分含量的方法采用了偏最小二乘回归分析,偏最小二乘回归分析(PLS)提供一种多对多线性回归建模的方法,特别当两组变量的很多,且都存在多重相关性,而观测数据的数量(样本量)有较少时,用偏最小二乘回归分析建立的模型具有传统的经典回归分析等方法所没有的优点。当同一物品的两组测量样本来自不同测量仪器或测量状态时,两组样本不相同却相关,所以可以将来自新空间的样本迁移至参考空间,进而可以直接利用参考空间的模型对新样本进行预测。重新利用了原有模型,减小了建模成本。The method for analyzing the content of material components based on infrared spectroscopy in the embodiment of the present invention adopts partial least squares regression analysis, and partial least squares regression analysis (PLS) provides a method for many-to-many linear regression modeling, especially when two groups of variables There are many, and there are multiple correlations, and when the number of observed data (sample size) is small, the model established by partial least squares regression analysis has advantages that traditional methods such as classical regression analysis do not have. When two sets of measurement samples of the same item come from different measuring instruments or measurement states, the two sets of samples are different but related, so the samples from the new space can be migrated to the reference space, and then the model of the reference space can be directly used for the new samples. predict. The original model is reused to reduce the cost of modeling.
1.建立基于光谱特征的PLS回归模型1. Establish a PLS regression model based on spectral features
首先对红外光谱数据及其对应的成分浓度建立偏最小二乘回归模型,用于获取光谱特征,光谱特征的个数通过交叉验证方法进行选取。然后对光谱特征及其对应的成分浓度重新建立PLS模型,用于计算模型的回归系数,此时的主特征(光谱特征)个数依然通过交叉验证方法进行选择。对红外光谱数据两次建立PLS模型与一次直接建立PLS模型在预测精度上基本没有影响,使用光谱特征计算的回归系数可直接对转移后目标域的光谱特征进行预测。First, a partial least squares regression model was established for the infrared spectral data and its corresponding component concentrations to obtain spectral features, and the number of spectral features was selected by cross-validation method. Then, the PLS model is re-established for the spectral features and their corresponding component concentrations to calculate the regression coefficient of the model. At this time, the number of main features (spectral features) is still selected through the cross-validation method. Establishing the PLS model twice and directly establishing the PLS model for the infrared spectral data basically has no effect on the prediction accuracy, and the regression coefficient calculated by using the spectral characteristics can directly predict the spectral characteristics of the target domain after transfer.
2.实现光谱特征间的迁移学习2. Realize transfer learning between spectral features
不同光谱仪测得红外光谱数据的条件概率或边缘概率分布可能不同,使得原有的多元标定模型无法对目标域的红外光谱数据进行准确的预测,往往会存在很大的预测偏差,由于重新建模成本很高,因此需要将目标域的光谱特征迁移至源域,进而缩小源域和目标域在分布上差异。首先对源域和目标域中的标准光谱样本进行特征提取,然后建立特征对特征的PLS模型,计算转移矩阵。使目标域特征与转移矩阵相乘,即可实现特征的迁移。The conditional probability or marginal probability distribution of infrared spectral data measured by different spectrometers may be different, so that the original multivariate calibration model cannot accurately predict the infrared spectral data of the target domain, and there will often be a large prediction deviation. Due to the remodeling The cost is high, so it is necessary to migrate the spectral features of the target domain to the source domain, thereby reducing the distribution difference between the source domain and the target domain. Firstly, features are extracted from the standard spectral samples in the source and target domains, and then a feature-to-feature PLS model is established to calculate the transfer matrix. The transfer of features can be achieved by multiplying the target domain features with the transfer matrix.
3.对目标域光谱数据进行预测3. Predict the target domain spectral data
将目标域的特征迁移至源域的特征空间后,即可直接利用源域基于特征的回归模型,对目标域的特征进行预测。从而避免了对目标域样本重新建立模型,很大程度上减小了建模成本。After the features of the target domain are migrated to the feature space of the source domain, the feature-based regression model of the source domain can be directly used to predict the features of the target domain. This avoids re-establishing the model for the target domain samples and greatly reduces the modeling cost.
针对本发明中的分析方法分别对玉米和药片数据进行了分析,具体如下:Corn and tablet data have been analyzed respectively at the analytical method in the present invention, specifically as follows:
1.玉米数据集1. Corn Dataset
玉米数据集有80个样本,对应着水分、油分、蛋白质、淀粉四种物质的含量,可以从(http://www.eigenvector.com/Data/Data_sets.html)获得。对于红外光谱数据集分别由m5,mp5,mp6三种不同的仪器在波长范围1100–2498nm以2nm为间隔测得,共700个频道。本实验中将m5测得的光谱作为主光谱,光谱数据作为源域数据集XS,由于mp6测得的光谱与m5测得的差异大些,被选为从光谱,对应的数据集作为目标域数据集XT。光谱图如图3所示,其中子图(A)、(B)、(C)分别表示主光谱图,从光谱图,以及主光谱与从光谱之间的光谱差异图。The corn data set has 80 samples, corresponding to the contents of water, oil, protein, and starch, which can be obtained from (http://www.eigenvector.com/Data/Data_sets.html). The infrared spectrum data sets are measured by three different instruments, m5, mp5, and mp6, in the wavelength range of 1100–2498nm at intervals of 2nm, with a total of 700 channels. In this experiment, the spectrum measured by m5 is used as the main spectrum, and the spectral data is used as the source domain data set X S . Since the spectrum measured by mp6 is quite different from that measured by m5, it is selected as the secondary spectrum, and the corresponding data set is used as the target. Domain dataset X T . The spectrogram is shown in Figure 3, where the sub-figures (A), (B), and (C) respectively represent the master spectrogram, the slave spectrogram, and the spectral difference map between the master spectrum and the slave spectrum.
实验中,利用Kennard-Stone(KS)算法对数据集进行划分,首先从源域和目标域数据集中分别抽取20%的数据作为测试样本,分别为16个,其中目标域的测试样本用于测试标定迁移模型。剩余的80%样本作为训练样本,分别为64个,其中源域的训练样本用于建立参考模型,可对目标域的迁移样本进行预测,目标域的用于建立目标域的标准模型,以便于对比其他迁移模型的性能。再从源域和目标域的训练样本中通过KS算法分别抽取若干样本作为标准样本集,用于建立源域样本和目标域样本之间的转移关系。标准样本的数量对转移关系影响较大,标准样本数量太少,无法获取充分的样本信息,数量太多,容易引入冗余信息,这两种情况都无法获得准确的转移关系。为了兼顾二者,本实验利用KS算法从源域和目标域的训练样本中分别抽取50%的样本作为标准样本集,分别为32个。In the experiment, the Kennard-Stone (KS) algorithm was used to divide the data set. First, 20% of the data were extracted from the source domain and the target domain data set as test samples, respectively 16, and the test samples of the target domain were used for testing Calibrate the transfer model. The remaining 80% samples are used as training samples, which are 64 respectively. The training samples of the source domain are used to establish a reference model, which can predict the migration samples of the target domain, and the target domain is used to establish a standard model of the target domain, so that Compare the performance of other transfer models. Then, several samples are extracted from the training samples of the source domain and the target domain through the KS algorithm as standard sample sets, which are used to establish the transfer relationship between the source domain samples and the target domain samples. The number of standard samples has a great influence on the transfer relationship. The number of standard samples is too small to obtain sufficient sample information, and the number of standard samples is too large to easily introduce redundant information. In both cases, accurate transfer relationships cannot be obtained. In order to take both into account, this experiment uses the KS algorithm to extract 50% of the samples from the training samples of the source domain and the target domain as standard sample sets, 32 respectively.
2.药片数据集2. Pill Dataset
2002年,在国际漫反射会议(IDRC)上发布的”Shootout”数据集包含由两台光谱仪分别在波长范围600-1898nm以2nm间隔测得的药片样本的红外光谱数据,分别作为源域数据和目标域数据,均包含650个变量,用于分析药片中三种活性成分的含量。这些样本分别被划分为源域标定样本集和目标域标定样本集,各包含155个样本,源域测试集和目标域测试集,各包含460个样本。通过KS算法从源域和目标域的标定集中分别抽取50%的样本作为标准集,分别为78个。药片的红外光谱图在图4中给出,其中图4(A)表示主光谱,图4(B)表示从光谱,图4(C)表示主光谱与从光谱之间的光谱差异图。从图4(C)中中可以看出在波数和的范围,主光谱和从光谱存在着差异且在前端的差异存在着较大的波动,而在其他波数范围,存在的差异较小。说明在光谱的两端更容易引入噪声。由于主光谱和从光谱之间的差异并不大,因此可以猜想到,在模型迁移前后,预测的性能不会有太大的转变。In 2002, the "Shootout" data set released at the International Diffuse Reflectance Conference (IDRC) contained infrared spectral data of tablet samples measured by two spectrometers in the wavelength range of 600-1898nm at 2nm intervals, respectively as source domain data and The target domain data, each containing 650 variables, was used to analyze the content of three active ingredients in the tablet. These samples are divided into source domain calibration sample set and target domain calibration sample set, each containing 155 samples, source domain test set and target domain test set, each containing 460 samples. 50% of the samples are extracted from the calibration set of the source domain and the target domain as the standard set through the KS algorithm, 78 samples respectively. The infrared spectrum of the tablet is given in Figure 4, where Figure 4(A) represents the master spectrum, Figure 4(B) represents the slave spectrum, and Figure 4(C) represents the spectral difference between the master spectrum and the slave spectrum. It can be seen from Figure 4(C) that the wavenumber and In the range of , there are differences between the main spectrum and the secondary spectrum, and there are large fluctuations in the difference at the front end, while in other wavenumber ranges, there are small differences. It shows that it is easier to introduce noise at both ends of the spectrum. Since the difference between master and slave spectra is not large, it can be guessed that the predicted performance does not shift much before and after model migration.
具体过程如下:The specific process is as follows:
1.数据预处理方法1. Data preprocessing method
在训练模型前,选择中心化的方法对数据进行预处理,可以避免由于数值差异较大引起的偏差。Before training the model, choose a centralized method to preprocess the data to avoid deviations caused by large numerical differences.
2.参数选择2. Parameter selection
模型的参数选择对模型性能可以产生很大的影响,选择一个最佳的参数,可以使得模型获得最优的性能。例如,对于PLS算法,选择最佳的主成分数,可以使模型获得最好的预测效果。本发明实验中,SBC(斜率和偏差校正方法),MSC(多元散射校正),PDS(分段直接标准),CT_pls均采用PLS算法建立主光谱数据的多元标定模型,因此在确定标准样本数量之后,SBC和CT_pls算法仅有主成分数一个参数需要被设置,PDS算法除主成分数之外还需要对窗口大小进行设置。本发明中,选择10折交叉验证的方法对PLS算法的主成分个数进行选择,设置主成分数从1到5,间隔为1,分别计算其对应的交叉验证误差(RMSECV),选取最小的RMSECV对应的主成分数为最佳主成分数。对于PDS算法,由于标准数据集样本数较少,在对各个窗口建立PLS子模型时,采用5折交叉验证,设置窗口大小从3到20,间隔为2,窗口大小应为不小于3的奇数,计算每个窗口大小对应的RMSECV,对应最小RMSECV的抽口为最佳窗口。模型评估The parameter selection of the model can have a great impact on the performance of the model, and choosing an optimal parameter can make the model obtain the optimal performance. For example, for the PLS algorithm, choosing the best number of principal components can make the model obtain the best prediction effect. In the experiment of the present invention, SBC (slope and deviation correction method), MSC (multivariate scattering correction), PDS (subsection direct standard), CT_pls all adopts the multivariate calibration model of PLS algorithm to establish master spectrum data, so after determining the standard sample quantity , the SBC and CT_pls algorithms only need to set a parameter of the principal component score, and the PDS algorithm needs to set the window size in addition to the principal component score. In the present invention, the method of selecting 10-fold cross-validation selects the number of principal components of the PLS algorithm, sets the number of principal components from 1 to 5, and the interval is 1, calculates its corresponding cross-validation error (RMSECV) respectively, and selects the smallest The number of principal components corresponding to RMSECV is the best number of principal components. For the PDS algorithm, due to the small number of samples in the standard data set, when establishing the PLS sub-model for each window, use 5-fold cross-validation, set the window size from 3 to 20, the interval is 2, and the window size should be an odd number not less than 3 , calculate the RMSECV corresponding to each window size, and the sampling corresponding to the minimum RMSECV is the best window. model evaluation
本发明实验中,以均方根误差(RMSE)作为参数选择及模型评估的指标。RMSE的计算方法如公式(3.11)。In the experiment of the present invention, root mean square error (RMSE) is used as an index for parameter selection and model evaluation. The calculation method of RMSE is as formula (3.11).
其中,为预测值,为参考值(真实值或比较值),为测试样本数。in, is the predicted value, as the reference value (true value or comparative value), is the number of test samples.
RMSEC表示标定集的训练误差,RMSEP表示测试集的预测误差,RMSECV表示交叉验证误差。对于PLS算法的交叉验证误差,表示真实值。对于PDS算法,选取窗口大小的交叉验证误差,表示主光谱标准集的预测值。RMSEC represents the training error of the calibration set, RMSEP represents the prediction error of the test set, and RMSECV represents the cross-validation error. For the cross-validation error of the PLS algorithm, represents the real value. For the PDS algorithm, choose the cross-validation error of the window size, Indicates the predicted value of the main spectral standard set.
为了更加直观地比较本发明提出的CT_pls模型与其他经典模型以及PLS基准模型的在预测性能上的差异程度,使用公式(3.12)计算CT_pls算法相对其他算法性能的改善率或下降率。In order to more intuitively compare the CT_pls model proposed by the present invention with other classical models and the PLS benchmark model in the degree of difference in predictive performance, use formula (3.12) to calculate the improvement rate of CT_pls algorithm relative to other algorithmic performance or drop rate.
在公式(3.12)中,RMSEPCT_pls表示CT_pls算法的预测误差,表示其他对比算法的预测的误差。In formula (3.12), RMSEP CT_pls represents the prediction error of the CT_pls algorithm, Indicates the prediction error of the other comparison algorithms.
此外,本发明利用秩和检验方法来检验CT_pls方法与其他算法之间是否存在显著性差异,使用python中scipy包中的wilcoxon函数直接计算预测值之间的p值,若p>0.05,则说明两种算法之间不存在显著性差异,否则说明存在显著性差异。In addition, the present invention uses the rank sum test method to check whether there is a significant difference between the CT_pls method and other algorithms, and uses the wilcoxon function in the scipy package in python to directly calculate the p value between the predicted values. If p>0.05, it means There is no significant difference between the two algorithms, otherwise there is a significant difference.
本发明选用玉米数据集、药片数据集进行实验。对于SBC、PDS、CT_pls算法均采用PLS算法作为主体算法,使用源域数据建立多元标定模型作为参考模型,用于对迁移的目标域预测样本进行预测。同时,采用PLS算法,建立目标域训练样本的多元标定模型,用于对比标定迁移模型的预测性能,便于对SBC、PDS、CT_pls标定迁移方法做出更全面、准确的评估。实验结果主要包含以下几个部分:The present invention selects corn data set, tablet data set to carry out experiment. For the SBC, PDS, and CT_pls algorithms, the PLS algorithm is used as the main algorithm, and the source domain data is used to establish a multivariate calibration model as a reference model, which is used to predict the migrated target domain prediction samples. At the same time, the PLS algorithm is used to establish a multivariate calibration model for training samples in the target domain, which is used to compare the prediction performance of the calibration transfer model, which facilitates a more comprehensive and accurate evaluation of the SBC, PDS, and CT_pls calibration transfer methods. The experimental results mainly include the following parts:
(1)PLS算法的主成分数选取过程以及RMSEC、RMSEP、RMSECV的结果展示。(1) The selection process of the principal components of the PLS algorithm and the display of the results of RMSEC, RMSEP, and RMSECV.
(2)PDS算法窗口大小的选择过程。(2) The selection process of the window size of the PDS algorithm.
(3)在不同的标准样本数下,SBC、PDS、CT_pls三种迁移算法的RMSEP的变化情况。(3) Under different standard sample numbers, the change of RMSEP of the three migration algorithms SBC, PDS, and CT_pls.
(4)设置固定的标准样本数,SBC、PDS、CT_pls三种迁移算法预测能力的比较。(4) Set a fixed number of standard samples, and compare the predictive capabilities of the three migration algorithms, SBC, PDS, and CT_pls.
(5)标定迁移前后,模型预测能力的比较及参数设置。(5) Comparison of model prediction ability and parameter setting before and after calibration migration.
采用玉米数据集进行实验。表3.1展示了直接使用玉米的目标域训练集建立对应水份、油分、蛋白质、淀粉含量的PLS模型的训练误差、交叉验证误差、预测误差以及主成分数。Experiments were performed using the corn dataset. Table 3.1 shows the training error, cross-validation error, prediction error, and principal component number of the PLS model corresponding to moisture, oil, protein, and starch content directly using the target domain training set of corn.
表3.1玉米的目标域数据集PLS模型的误差及参数Table 3.1 Errors and parameters of the PLS model of the corn target domain dataset
从表3.1中可以看出,玉米中每种成分的RMSEC、RMSECV、RMSEP没有很大的差别,说明未出现过拟合现象,且RMSEP较小,说明也未出现欠拟合现象,进而可以说明主成分数选取的合理。本发明采用10折交叉验证方法对PLS算法的主成分进行选取,图5(A)(B)(C)(D)分别给出了关于玉米中水份、油份、蛋白质、淀粉含量的PLS模型的RMSECV随主成分数的变化过程,分别在主成分数为5,5,5,5时,取得RMSECV的最小值,因此关于玉米中各个组分含量的PLS模型的最佳主成分数分别为5,5,5,5。虽然设置最大主成分数为5,玉米数据集各个组分RMSECV未随着主成分数的变化而收敛,无法取得全局的最小值,但是如果主成分数选取过大会出现过拟合现象,且会增加PLS模型的复杂度,通过多次实验分析,选取最大主成分为5可以获得较为满意的效果。It can be seen from Table 3.1 that the RMSEC, RMSECV, and RMSEP of each component in corn are not very different, indicating that there is no over-fitting phenomenon, and the RMSEP is small, indicating that there is no under-fitting phenomenon, which can further explain The selection of principal components is reasonable. The present invention adopts 10-fold cross-validation method to select the principal components of the PLS algorithm, and Fig. 5 (A) (B) (C) (D) provides respectively about the PLS of moisture content, oil content, protein, starch content in corn The change process of RMSECV of the model with the number of principal components, respectively, when the scores of principal components are 5, 5, 5, and 5, the minimum value of RMSECV is obtained, so the optimal principal components of the PLS model about the content of each component in corn are respectively For 5, 5, 5, 5. Although the maximum principal component score is set to 5, the RMSECV of each component of the corn data set does not converge with the change of the principal component score, and the global minimum value cannot be obtained. However, if the principal component score is selected too large, overfitting will occur, and To increase the complexity of the PLS model, through multiple experimental analysis, selecting the maximum principal component to be 5 can obtain a more satisfactory effect.
对于PDS算法,需要对窗口大小进行合理选择,本发明通过5折交叉验证的方法对窗口大小进行选择,图6(A)(B)(C)(D)分别给出了关于玉米中水份、油份、蛋白质、淀粉含量的PDS模型的窗口大小选择过程,选取最小RMSECV对应的窗口大小为PDS模型的最佳窗口。从图6中对于水份含量的PDS模型,最佳窗口大小为13,而其他三种成分的PDS模型,最佳窗口大小为3。For the PDS algorithm, the window size needs to be reasonably selected, and the present invention selects the window size by the method of 5-fold cross-validation, and Fig. 6 (A) (B) (C) (D) provides respectively about moisture content in corn , oil, protein, starch content of the PDS model window size selection process, select the window size corresponding to the minimum RMSECV as the best window of the PDS model. From Fig. 6, for the PDS model of moisture content, the optimal window size is 13, while for the PDS models of the other three components, the optimal window size is 3.
对于SBC、PDS、CT_pls算法,其预测性能受标准样本数量影响。因为标准样本的数量影响着转移关系,转移关系又直接影响着预测精度,所以标准样本的数量影响着标定迁移模型的预测性能。表3.2-表3.5展示了在标准样本数不同的情况下,玉米中水分、油分、蛋白质、淀粉四种物质含量在不同模型下的预测误差,其中第一行的N表示标准样本数。此处的PLS模型表示直接使用目标域训练数据建立的基准模型,因此在对目标域测试样本进行预测时,不需要对样本进行迁移,所以预测误差与标准样本数无关。For the SBC, PDS, and CT_pls algorithms, their predictive performance is affected by the standard sample size. Because the number of standard samples affects the transfer relationship, and the transfer relationship directly affects the prediction accuracy, the number of standard samples affects the prediction performance of the calibration transfer model. Table 3.2-Table 3.5 shows the prediction errors of different models for the contents of moisture, oil, protein, and starch in corn when the number of standard samples is different, where N in the first row represents the number of standard samples. The PLS model here represents the benchmark model established directly using the training data of the target domain, so when predicting the test samples of the target domain, the samples do not need to be migrated, so the prediction error has nothing to do with the number of standard samples.
表3.2玉米中水分含量的预测误差Table 3.2 Prediction errors for moisture content in corn
从表3.2中可以看出,对于SBC算法,最小的预测误差为0.3081,与PLS方法的预测误差为0.1916,二者相差较大。由于SBC仅适用于系统化误差的情况下,说明对于水分的预测,SBC方法并不适合。对于PDS算法,最小的预测误差在N=45处获得,RMSECP=0.1767,对于CT_pls算法,最小的预测误差在N=13处获得,RMSEP=0.1678,二者的较小的预测误差均在N=32处取得,分别为0.1860,0.1831。由此可见,标准样本数过多或过少都不能获得最佳的转移关系。It can be seen from Table 3.2 that for the SBC algorithm, the minimum prediction error is 0.3081, which is quite different from that of the PLS method at 0.1916. Since SBC is only applicable to systematic errors, it shows that the SBC method is not suitable for moisture prediction. For the PDS algorithm, the minimum prediction error is obtained at N=45, RMSECP=0.1767, for the CT_pls algorithm, the minimum prediction error is obtained at N=13, RMSEP=0.1678, the smaller prediction errors of the two are all at N= 32 places were obtained, respectively 0.1860 and 0.1831. It can be seen that the optimal transfer relationship cannot be obtained if the number of standard samples is too large or too small.
从表3.3中可以看出,SBC方法在N=52时获得最小的预测误差0.0668,但在除N=26外的其他的标准样本数下的预测误差都与其接近,且都接近PLS的预测误差0.0624,说明SBC方法适合油分的预测。PDS算法的最小预测误差在N=52处取得,RMSEP=0.0787,CT_pls算法的最小预测误差在N=45处取得,RMSEP=0.0723,较小值都在N=32处取得,分别为0.0832和0.0740,且自N=32以后,PDS和CT_pls的RMSEP变化都不大。It can be seen from Table 3.3 that the SBC method obtains the smallest prediction error of 0.0668 when N=52, but the prediction errors under other standard sample numbers except N=26 are close to it, and they are all close to the prediction error of PLS 0.0624, indicating that the SBC method is suitable for the prediction of oil content. The minimum prediction error of the PDS algorithm is obtained at N=52, RMSEP=0.0787, the minimum prediction error of the CT_pls algorithm is obtained at N=45, RMSEP=0.0723, and the smaller values are obtained at N=32, respectively 0.0832 and 0.0740 , and since N=32, the RMSEP of PDS and CT_pls have little change.
表3.3玉米中油分含量的预测误差Table 3.3 Prediction error of oil content in corn
表3.4玉米中蛋白质含量的预测误差Table 3.4 Prediction errors for protein content in corn
从表3.4可以看出,SBC方法在N=39处取得最小预测误差,RMSEP=0.2552,且在整个标准样本数变化的过程中,RMSEP的变化并不大。PDS算法,在N=45处取得最小值0.2296,且自N=32以后,RMSEP相对稳定。CT_pls算法在N=45处取得最小预测误差0.2093,且自N=26以后,RMSEP相对稳定。It can be seen from Table 3.4 that the SBC method achieves the minimum prediction error at N=39, RMSEP=0.2552, and the change of RMSEP is not large during the whole process of changing the standard sample size. The PDS algorithm achieves the minimum value of 0.2296 at N=45, and since N=32, RMSEP is relatively stable. The CT_pls algorithm achieves the minimum prediction error of 0.2093 at N=45, and since N=26, RMSEP is relatively stable.
表3.5玉米中淀粉含量的预测误差Table 3.5 Prediction Errors for Starch Content in Corn
从表3.5可以看出,SBC算法的最小预测误差在N=39处取得,RMSEP=0.5775,且RMSEP相对稳定。PDS算法在N=32处取得,RMSEP=0.4964,较小值在N=26和N=39处取得,分别为0.5101,0.5270。CT_pls算法在N=52处取得预测误差最小值0.4592,且在N=(26,32,39,45)处取得较小值。It can be seen from Table 3.5 that the minimum prediction error of the SBC algorithm is obtained at N=39, RMSEP=0.5775, and RMSEP is relatively stable. The PDS algorithm is obtained at N=32, RMSEP=0.4964, and the smaller values are obtained at N=26 and N=39, which are 0.5101 and 0.5270 respectively. The CT_pls algorithm obtains the minimum prediction error value of 0.4592 at N=52, and obtains smaller values at N=(26, 32, 39, 45).
通过对表3.2-表3.5进行分析,可以得出以下结论:第一,标准样本数的变化对SBC算法的预测能力并不大,且SBC算法的预测能力并不稳定。例如,对于油份的预测取得很好的效果,稍好于PDS和CT_pls算法,且接近PLS算法,但是对于水份的预测的效果却很差,远不及PDS和CT_pls算法,又与PLS的预测误差相差较大。第二,对于PDS和CT_pls算法的预测误差受标准样本数影响较大,大体上,在N<32时,预测误差较大,且随着样本数的增加,RMSEP会下降,在N=32处取得最小值或较小值,此后,随着样本数增加RMSEP变化不大或者下降,因此选择32个标准样本(即训练样本的50%)可以获得较好的迁移效果。第三,综合比较SBC、PDS、CT_pls算法的预测性能,CT_pls的预测性能最佳,其次是PDS算法,再次是SBC算法。Through the analysis of Table 3.2-Table 3.5, the following conclusions can be drawn: First, the change of the standard sample size has little predictive ability for the SBC algorithm, and the predictive ability of the SBC algorithm is not stable. For example, the prediction of oil content is very good, slightly better than the PDS and CT_pls algorithm, and close to the PLS algorithm, but the effect of the water prediction is very poor, far inferior to the PDS and CT_pls algorithm, and it is similar to the PLS prediction The error is quite different. Second, the prediction error of the PDS and CT_pls algorithms is greatly affected by the number of standard samples. Generally speaking, when N<32, the prediction error is large, and as the number of samples increases, the RMSEP will decrease. At N=32 The minimum value or smaller value is obtained. After that, as the number of samples increases, the RMSEP does not change much or decreases. Therefore, choosing 32 standard samples (ie 50% of the training samples) can obtain a better migration effect. Third, comprehensively compare the predictive performance of SBC, PDS, and CT_pls algorithms. CT_pls has the best predictive performance, followed by PDS algorithm, and then SBC algorithm.
为了更加公平、直观地比较标定迁移算法的预测效果,本发明均选择32个标准样本建立源域和目标域之间的转移关系,图7-图10给出了对应于玉米中各种组分的各个算法预测值与真实值的比较图,预测值越接近真实值,相应的标注点则越接近y=x这条直线,因此可以根据每种算法对应的标注点在直线y=x附近的集中程度,来判断算法的预测性能,进而可以更加直观地观察它们的预测效果。In order to compare the prediction effect of the calibration migration algorithm more fairly and intuitively, the present invention selects 32 standard samples to establish the transfer relationship between the source domain and the target domain, and Fig. 7-Fig. The comparison graph between the predicted value and the real value of each algorithm. The closer the predicted value is to the real value, the closer the corresponding marked point is to the line y=x. The degree of concentration is used to judge the predictive performance of the algorithm, and then to observe their predictive effect more intuitively.
由于PDS、CT_pls模型的预测误差差别不大,通过图7-图10无法根据标注点的集中程度对比出两种算法优劣,因此在表3.6中展示了对应于图7-图10预测值的预测误差。同时表3.7-表3.10给出了CT_pls对PLS、SBC、PDS算法的预测误差改善率或下降率以及它们之间进行秩和检验的p值。Since the prediction errors of the PDS and CT_pls models are not very different, it is impossible to compare the advantages and disadvantages of the two algorithms according to the concentration of the marked points through Figure 7-Figure 10, so Table 3.6 shows the prediction values corresponding to Figure 7-Figure 10 forecast error. At the same time, Table 3.7-Table 3.10 gives the prediction error improvement rate or decrease rate of CT_pls for PLS, SBC, and PDS algorithms and the p-value of the rank sum test between them.
表3.6玉米数据集各个成分浓度在不同模型下的预测误差Table 3.6 Prediction error of each component concentration of corn data set under different models
表3.7玉米中水份含量CT_pls算法对其他算法的改善率和秩和检验的p值Table 3.7 The improvement rate of moisture content in corn CT_pls algorithm to other algorithms and the p-value of rank sum test
表3.8玉米中油份含量CT_pls算法对其他算法的改善率和秩和检验的p值Table 3.8 The improvement rate of CT_pls algorithm for oil content in corn to other algorithms and the p-value of rank sum test
表3.9玉米中蛋白质含量CT_pls算法对其他算法的改善率和秩和检验的p值Table 3.9 The improvement rate of CT_pls algorithm for protein content in corn to other algorithms and the p-value of rank sum test
表3.10玉米中淀粉含量CT_pls算法对其他算法的改善率和秩和检验的p值Table 3.10 The improvement rate of CT_pls algorithm for starch content in corn to other algorithms and the p-value of rank sum test
从表3.6-表3.10,进一步说明了SBC、PDS、CT_pls三种迁移算法中,CT_pls算法的预测性能最优,PDS算法次之,SBC算法最差。并且,由于表3.7-3.10中的p值均大于0.05,说明CT_pls算法与其他算法之间不存在显著性差异。From Table 3.6 to Table 3.10, it is further explained that among the three migration algorithms of SBC, PDS and CT_pls, CT_pls algorithm has the best prediction performance, followed by PDS algorithm and SBC algorithm is the worst. Moreover, since the p values in Table 3.7-3.10 are all greater than 0.05, it shows that there is no significant difference between the CT_pls algorithm and other algorithms.
最后,利用直接利用源域模型对未进行转移目标域测试样本进行预测,并与使用CT_pls算法进行的预测进行比较,进而可以直观地对CT_pls模型的迁移能力进行评估。图11-图14给出了未进行标定迁移的模型的预测值和真实值的比较图和使用CT_pls算法进行标定迁移的预测值和真实值的比较图。Finally, by using the source domain model directly to predict the target domain test samples that have not been transferred, and comparing it with the prediction using the CT_pls algorithm, the migration ability of the CT_pls model can be evaluated intuitively. Figures 11-14 show the comparison charts of the predicted value and the real value of the model without calibration migration and the comparison charts of the predicted value and the real value of the calibration migration using the CT_pls algorithm.
在图11-图14中,圆点表示未进行标定迁移时,目标域测试样本真实值与预测值之间的关系点,五角星表示使用CT_pls算法进行标定迁移后的目标域预测值和真实值之间的关系点。从图11-图14可以看出,深色的圆点都严重偏离直线y=x,而五角星都集中在直线y=x附近,说明直接使用源域模型对目标域数据进行预测会出现很大的偏差,这种偏差由不同的测量仪器引入,而在使用CT_pls算法进行标定迁移后,可以在很大程度上缩小源域数据和目标域数据之间的偏差,进而可以直接使用源域模型对转移后的目标与数据进行预测,并且获得和很好的预测效果。In Figures 11-14, the dots represent the relationship points between the real and predicted values of the test samples in the target domain when calibration migration is not performed, and the five-pointed star represents the predicted and real values of the target domain after calibration and migration using the CT_pls algorithm relationship between points. From Figures 11 to 14, it can be seen that the dark dots deviate seriously from the straight line y=x, while the five-pointed stars are all concentrated near the straight line y=x, indicating that direct use of the source domain model to predict the target domain data will appear very Large deviation, which is introduced by different measuring instruments, and after using the CT_pls algorithm for calibration migration, the deviation between the source domain data and the target domain data can be reduced to a large extent, and the source domain model can be directly used Predict the transferred target and data, and obtain good prediction results.
采用药片数据集进行实验。表11展示了直接使用药片的目标域训练集建立对应三种活性成分含量的PLS模型的训练误差、交叉验证误差、预测误差以及主成分数。Experiments were performed using the pill dataset. Table 11 shows the training error, cross-validation error, prediction error and the number of principal components of the PLS model corresponding to the contents of the three active ingredients directly using the target domain training set of the tablet.
表3.11药片的目标域数据集PLS模型的误差及参数Table 3.11 Errors and parameters of the PLS model for the target domain dataset of tablets
从表3.11中可以看出,药片中每种成分的RMSEC、RMSECV、RMSEP都在相同的数量级上,说明未出现过拟合现象,且RMSEP较小,说明也未出现欠拟合现象,进而可以说明主成分数选取的合理。本发明的实施例采用10折交叉验证方法对PLS算法的主成分进行选取,图15(A)(B)(C)分别给出了关于药片中三种活性成分含量的PLS模型的RMSECV随主成分数的变化过程,分别在主成分数为3,2,5时,取得RMSECV的最小值,因此关于药片中各个组分含量的PLS模型的最佳主成分数分别为3,2,5。It can be seen from Table 3.11 that the RMSEC, RMSECV, and RMSEP of each component in the tablet are on the same order of magnitude, indicating that there is no over-fitting phenomenon, and the RMSEP is small, indicating that there is no under-fitting phenomenon. It shows that the selection of principal components is reasonable. Embodiments of the present invention adopt the 10-fold cross-validation method to select the principal components of the PLS algorithm, and Fig. 15 (A) (B) (C) respectively provides the RMSECV of the PLS model about three kinds of active ingredient contents in the tablet. In the change process of the number of components, the minimum value of RMSECV is obtained when the number of principal components is 3, 2, and 5 respectively. Therefore, the optimal number of principal components of the PLS model about the content of each component in the tablet is 3, 2, and 5 respectively.
对于PDS算法,本发明的实施例通过5折交叉验证的方法对窗口大小进行选择,图16(A)(B)(C)分别给出了关于药片三种活性成分含量的PDS模型的窗口大小选择过程。从图16中可以看出,对应第一种活性成分的PDS模型,最佳窗口大小为19,而其两种活性成分的PDS模型,最佳窗口大小分别为3和13。For the PDS algorithm, the embodiment of the present invention selects the window size through a 5-fold cross-validation method, and Fig. 16 (A) (B) (C) respectively provides the window size of the PDS model about the content of the three active ingredients of the tablet selection process. It can be seen from Figure 16 that for the PDS model corresponding to the first active ingredient, the optimal window size is 19, while for the PDS models of the two active ingredients, the optimal window sizes are 3 and 13, respectively.
对于SBC、PDS、CT_pls算法,表12-表14展示了在标准样本数不同的情况下,药片中三种活性成分含量在不同模型下的预测误差,其中第一行的N表示标准样本数,PLS模型为目标域训练数据建立的模型。For the SBC, PDS, and CT_pls algorithms, Table 12-Table 14 shows the prediction errors of the contents of the three active ingredients in the tablet under different models when the number of standard samples is different, where N in the first row represents the number of standard samples, The PLS model is a model built for the target domain training data.
表3.12药片中第一种活性成分含量的预测误差Table 3.12 Prediction errors for the content of the first active ingredient in tablets
表3.13药片中第二种活性成分含量的预测误差Table 3.13 Prediction error for the content of the second active ingredient in the tablet
表3.14药片中第三种活性成分含量的预测误差Table 3.14 Prediction error for third active ingredient content in tablets
从表3.12-表3.14中可以看出,在标准样本数的变化过程中,CT_pls算法可以预测误差基本上都稍低于PDS算法的预测误差,且SBC算法的预测误差往往高于PDS算法的预测误差。说明CT_pls算法的预测性能优于PDS算法,PDS算法的预测性能优于SBC算法,并且CT_pls和PDS算法的预测误差都接近PLS算法的预测误差,说明二者都有较好的标定迁移能力。此外SBC算法在对第二种活性成分的预测误差也接近PLS算法的预测误差,但对第一种活性成分的预测误差与PLS算法的相差较大,进一步说明了SBC算法应用的不广泛性。It can be seen from Table 3.12-Table 3.14 that in the process of changing the number of standard samples, the prediction error of the CT_pls algorithm is basically slightly lower than that of the PDS algorithm, and the prediction error of the SBC algorithm is often higher than that of the PDS algorithm. error. It shows that the prediction performance of the CT_pls algorithm is better than that of the PDS algorithm, and the prediction performance of the PDS algorithm is better than that of the SBC algorithm, and the prediction errors of the CT_pls and PDS algorithms are close to the prediction errors of the PLS algorithm, indicating that both have good calibration migration capabilities. In addition, the prediction error of the SBC algorithm for the second active ingredient is also close to that of the PLS algorithm, but the prediction error for the first active ingredient is quite different from that of the PLS algorithm, which further shows that the application of the SBC algorithm is not widespread.
图17、图18、图19分别展示了在N=78时(即训练集的50%的样本),对应于三种活性成分的PLS、SBC、PDS、CT_pls四种模型的真实值与预测值的比较图。Figure 17, Figure 18, and Figure 19 respectively show the true and predicted values of the four models of PLS, SBC, PDS, and CT_pls corresponding to the three active ingredients when N=78 (i.e., 50% of the samples in the training set) comparison chart.
通过图17、图18、图19无法很明确地根据标注点的集中程度对比出两种算法优劣,因此在表3.14中展示了对应于图17、图18、图19预测值的预测误差。同时表3.16、表3.17、表3.18给出了CT_pls对PLS、SBC、PDS算法的预测误差改善率或下降率以及它们之间进行秩和检验的p值。Through Figure 17, Figure 18, and Figure 19, it is impossible to clearly compare the advantages and disadvantages of the two algorithms according to the concentration of the marked points. Therefore, Table 3.14 shows the prediction errors corresponding to the predicted values in Figure 17, Figure 18, and Figure 19. At the same time, Table 3.16, Table 3.17, and Table 3.18 give the prediction error improvement rate or decrease rate of CT_pls for PLS, SBC, and PDS algorithms and the p-values of the rank sum test between them.
表3.15药片数据集各个活性成分含量在不同模型下的预测误差Table 3.15 Prediction errors of the content of each active ingredient in the tablet data set under different models
表3.16药片中活性成分1含量的CT_pls模型对其他模型的改善率和秩和检验的p值Table 3.16 The improvement rate of the CT_pls model for the content of active ingredient 1 in tablets to other models and the p-value of the rank sum test
表3.17药片中活性成分2含量的CT_pls模型对其他模型的改善率和秩和检验的p值Table 3.17 The improvement rate of the CT_pls model for the content of active ingredient 2 in tablets to other models and the p-value of the rank sum test
表3.18药片中活性成分3含量的CT_pls模型对其他模型的改善率和秩和检验的p值Table 3.18 The improvement rate of the CT_pls model for the content of active ingredient 3 in tablets to other models and the p-value of the rank sum test
从表3.15-表3.17中可以看出,对于药片数据活性成分含量的预测,CT_pls算法的预测性能达到最佳,甚至优于直接使用目标域数据建立的PLS模型,PDS算法的预测性能十分接近PLS模型,SBC模型的预测性能最差。并且每组p值都小于0.05,说明CT_pls算法和其他算法之间存在着显著性差异。It can be seen from Table 3.15-Table 3.17 that for the prediction of the content of active ingredients in tablet data, the prediction performance of the CT_pls algorithm is the best, even better than the PLS model established directly using the target domain data, and the prediction performance of the PDS algorithm is very close to that of PLS model, the SBC model has the worst predictive performance. And the p value of each group is less than 0.05, indicating that there is a significant difference between the CT_pls algorithm and other algorithms.
图20、图21、图22给出了未进行标定迁移的模型的预测值和真实值的比较图和使用CT_pls算法进行标定迁移的预测值和真实值的比较图。Figure 20, Figure 21, and Figure 22 show the comparison charts of the predicted value and the real value of the model without calibration migration and the comparison charts of the predicted value and the real value of the calibration migration using the CT_pls algorithm.
从图20、图21、图22中可以看出,五角星型的标注点比圆点型的标注点更加接近且集中于直线y=x附近,说明使用CT_pls算法进行标定迁移后,获得了更好的预测效果。然而,与玉米数据集相比,药片数据集的迁移效果并不明显,这是因为药片数据集的主光谱和从光谱差异并不太大,这一点从图4可以看出。It can be seen from Figure 20, Figure 21, and Figure 22 that the five-pointed star-shaped marking points are closer than the dot-shaped marking points and are concentrated near the straight line y=x, indicating that after using the CT_pls algorithm for calibration migration, more good predictive effect. However, compared with the corn dataset, the migration effect of the pill dataset is not obvious, because the difference between the master and slave spectra of the pill dataset is not too large, which can be seen from Figure 4.
本发明的分析方法使用目标域训练样本建立PLS模型作为基准模型,用于对比SBC、PDS、CT_pls三种标定迁移模型的迁移能力。实验结果表明,PDS和CT_pls模型的预测误差都接近PLS的预测误差,说明二者都具有较好的迁移能力,并且CT_pls模型的预测误差小于PDS模型的预测误差。而SBC模型不是总能获得好的预测效果,说明其稳定性及预测能力远不及PDS和CT_pls模型。因此,综合来看,三种迁移模型中,CT_pls模型具有最佳的预测性能,PDS次之,SBC最差。The analysis method of the present invention uses target domain training samples to establish a PLS model as a benchmark model, which is used to compare the transfer capabilities of the three calibration transfer models of SBC, PDS and CT_pls. The experimental results show that the prediction errors of the PDS and CT_pls models are close to those of the PLS, indicating that both have good transferability, and the prediction errors of the CT_pls model are smaller than those of the PDS model. However, the SBC model does not always achieve good prediction results, indicating that its stability and predictive ability are far inferior to those of the PDS and CT_pls models. Therefore, in general, among the three migration models, CT_pls model has the best prediction performance, followed by PDS and SBC is the worst.
在本发明中,术语“第一”、“第二”、“第三”仅用于描述目的,而不能理解为指示或暗示相对重要性。术语“多个”指两个或两个以上,除非另有明确的限定。In the present invention, the terms "first", "second", and "third" are used for descriptive purposes only, and cannot be understood as indicating or implying relative importance. The term "plurality" means two or more, unless otherwise clearly defined.
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.
Claims (5)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710009518.5A CN106680238B (en) | 2017-01-06 | 2017-01-06 | A Method for Analyzing Substance Content Based on Infrared Spectroscopy |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710009518.5A CN106680238B (en) | 2017-01-06 | 2017-01-06 | A Method for Analyzing Substance Content Based on Infrared Spectroscopy |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106680238A CN106680238A (en) | 2017-05-17 |
CN106680238B true CN106680238B (en) | 2019-09-06 |
Family
ID=58849170
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710009518.5A Active CN106680238B (en) | 2017-01-06 | 2017-01-06 | A Method for Analyzing Substance Content Based on Infrared Spectroscopy |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106680238B (en) |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107918718B (en) * | 2017-11-03 | 2020-05-22 | 东北大学秦皇岛分校 | Sample component content determination method based on online sequential extreme learning machine |
CN108152239A (en) * | 2017-12-13 | 2018-06-12 | 东北大学秦皇岛分校 | The sample composition content assaying method of feature based migration |
CN108509998A (en) * | 2018-03-30 | 2018-09-07 | 中国科学院半导体研究所 | A kind of transfer learning method differentiated on different devices for target |
CN110632024B (en) * | 2019-10-29 | 2022-06-24 | 五邑大学 | Quantitative analysis method, device and equipment based on infrared spectrum and storage medium |
CN111220566A (en) * | 2020-01-16 | 2020-06-02 | 东北大学秦皇岛分校 | Calibration and Migration Method of Infrared Spectrometer Based on OPLS and PDS |
CN111220565B (en) * | 2020-01-16 | 2022-07-29 | 东北大学秦皇岛分校 | CPLS-based infrared spectrum measuring instrument calibration migration method |
CN111563436B (en) * | 2020-04-28 | 2022-04-08 | 东北大学秦皇岛分校 | Infrared spectrum measuring instrument calibration migration method based on CT-CDD |
CN113343804B (en) * | 2021-05-26 | 2022-04-29 | 武汉大学 | An ensemble transfer learning classification method and system for single-view full-polarization SAR data |
CN118780402B (en) * | 2024-09-06 | 2025-03-18 | 山东大学 | Spectral data processing method, system, medium, product and equipment for integrated modeling |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2311390B (en) * | 1996-03-18 | 2000-03-22 | Ibm | Initial program load in data processing network |
CN100561194C (en) * | 2007-09-04 | 2009-11-18 | 厦门中药厂有限公司 | A kind of AOTF near infrared spectrometer that utilizes detects method of microorganism in the Chinese medicine |
-
2017
- 2017-01-06 CN CN201710009518.5A patent/CN106680238B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN106680238A (en) | 2017-05-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106680238B (en) | A Method for Analyzing Substance Content Based on Infrared Spectroscopy | |
CN104949936B (en) | Sample component assay method based on optimization Partial Least-Squares Regression Model | |
CN103854305B (en) | A kind of Model Transfer method based on multi-scale Modeling | |
CN106596450B (en) | Incremental method based on infrared spectrum analysis material component content | |
CN106815643B (en) | Infrared spectroscopy Model Transfer method based on random forest transfer learning | |
CN108152239A (en) | The sample composition content assaying method of feature based migration | |
CN105842190B (en) | A kind of method for transferring near infrared model returned based on spectrum | |
Zhang et al. | A two-level strategy for standardization of near infrared spectra by multi-level simultaneous component analysis | |
CN111563436B (en) | Infrared spectrum measuring instrument calibration migration method based on CT-CDD | |
CN105095652B (en) | Sample component assay method based on stack limitation learning machine | |
CN104502306B (en) | Near-infrared spectrum wavelength system of selection based on variable importance | |
CN107290305B (en) | A kind of near infrared spectrum quantitative modeling method based on integrated study | |
Metz et al. | RoBoost-PLS2-R: an extension of RoBoost-PLSR method for multi-response | |
CN103398971A (en) | Chemometrics method for determining cetane number of diesel oil | |
Shao et al. | Measurement of yogurt internal quality through using Vis/NIR spectroscopy | |
CN102128805A (en) | Method and device for near infrared spectrum wavelength selection and quick quantitative analysis of fruit | |
CN115326746B (en) | Evaluation model construction method for near-infrared spectrometer measurement spectrum prediction | |
CN107918718B (en) | Sample component content determination method based on online sequential extreme learning machine | |
CN112651173B (en) | A non-destructive testing method and generalizable system for agricultural product quality based on cross-domain spectral information | |
CN116026780B (en) | On-line detection method and system for coating moisture absorption rate based on series strategy wavelength selection | |
CN105092509B (en) | A kind of sample component assay method of PCR-based ELM algorithms | |
CN109145403B (en) | A Modeling Method for Near Infrared Spectroscopy Based on Sample Consensus | |
CN111220565B (en) | CPLS-based infrared spectrum measuring instrument calibration migration method | |
CN110501294B (en) | Multivariate correction method based on information fusion | |
Hao et al. | Application of effective wavelength selection methods to determine total acidity of navel orange |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
OL01 | Intention to license declared | ||
OL01 | Intention to license declared | ||
EE01 | Entry into force of recordation of patent licensing contract | ||
EE01 | Entry into force of recordation of patent licensing contract |
Application publication date: 20170517 Assignee: QINHUANGDAO HUIEN BIOTECHNOLOGY Co.,Ltd. Assignor: NORTHEASTERN University AT QINHUANGDAO Contract record no.: X2024980039978 Denomination of invention: Method for Analyzing the Composition Content of Substances Based on Infrared Spectroscopy Granted publication date: 20190906 License type: Open License Record date: 20241219 Application publication date: 20170517 Assignee: Qinhuangdao Jindai Shaoguo Wine Co.,Ltd. Assignor: NORTHEASTERN University AT QINHUANGDAO Contract record no.: X2024980037523 Denomination of invention: Method for Analyzing the Composition Content of Substances Based on Infrared Spectroscopy Granted publication date: 20190906 License type: Open License Record date: 20241217 |