CN105203498A

CN105203498A - Near infrared spectrum variable selection method based on LASSO

Info

Publication number: CN105203498A
Application number: CN201510581659.5A
Authority: CN
Inventors: 卞希慧; 颜鼎荷; 李淑娟; 谭小耀; 李翔
Original assignee: Tianjin Polytechnic University
Current assignee: Tiangong University
Priority date: 2015-09-11
Filing date: 2015-09-11
Publication date: 2015-12-30

Abstract

A near-infrared spectrum variable selection method based on LASSO, the specific process is as follows: collect the near-infrared spectrum of the sample, and measure the concentration vector of the measured component by conventional methods; use a certain grouping method to divide the data set into a training set and a prediction set; use crossover Verify and determine the constraint value t of the LASSO method; use the minimum angle regression algorithm to calculate the regression coefficient β, and retain the position of the wavelength point where β is not 0; use the reserved wavelength corresponding to the training set spectrum and the concentration vector to establish a partial least squares regression model , to predict the concentration of the analyte in the prediction set sample. This method can extract the effective wavelength, simplifies the quantitative analysis model, and improves the prediction accuracy of the model. Compared with existing variable selection methods, it has the advantages of being fast, repeatable, and achieving higher prediction accuracy with fewer variables. The invention is suitable for variable selection of near-infrared spectra of complex samples.

Description

A LASSO-based variable selection method for near-infrared spectroscopy

技术领域technical field

本方法发明属于分析化学领域的无损分析技术领域，具体涉及一种基于LASSO的近红外光谱变量选择方法。The invention of the method belongs to the technical field of non-destructive analysis in the field of analytical chemistry, and specifically relates to a method for selecting variables of near-infrared spectrum based on LASSO.

背景技术Background technique

近红外光谱分析技术是分析化学领域里高速发展的技术，它具有分析效率高、检测速度快、无需样品预处理等优点，已广泛的应用于食品、石油等行业。在近红外光谱和被测物质的含量或类别之间建立模型，可以实现复杂物质的直接定性定量分析。近红外光谱建模中非常重要的一个问题就是光谱中存在冗余波长。一般的近红外光谱(NIR)包含成百上千的波长变量点，而其中一些波长与研究的性质是不相关的，这些不相关波长点，会影响模型质量，导致其预测能力下降。因此变量选择一直是光谱建模分析的重要内容。Near-infrared spectroscopy is a rapidly developing technology in the field of analytical chemistry. It has the advantages of high analysis efficiency, fast detection speed, and no need for sample pretreatment. It has been widely used in food, petroleum and other industries. Establishing a model between the near-infrared spectrum and the content or category of the measured substance can realize direct qualitative and quantitative analysis of complex substances. A very important problem in NIR spectral modeling is the existence of redundant wavelengths in the spectrum. General near-infrared spectroscopy (NIR) contains hundreds of wavelength variable points, and some of these wavelengths are irrelevant to the nature of the research. These irrelevant wavelength points will affect the quality of the model and lead to a decline in its predictive ability. Therefore, variable selection has always been an important part of spectral modeling analysis.

光谱数据分析中常用的变量选择方法主要包括基于智能优化算法的方法以及基于统计学的方法。前者主要有模拟退火(simulatedannealing，SA，参见SwierengaH，deGrootPJ，deWeijerAP，DerksenMWJ，BuydensLMC，ImprovementofPLSmodeltransferabilitybyrobustwavelengthselection，ChemomIntellLabSyst，1998，41，237-248)、遗传算法(geneticalgorithm，GA，参见LeardiR，GonzalezAL，GeneticalgorithmsappliedtofeatureselectioninPLSregression：howandwhentousethem，ChemomIntellLabSyst，1998，41，195-207)、禁忌搜索(Tabusearch，TS，参见HagemanJA，StreppelM，WehrensR，WavelengthselectionwithTabuSearch，JChemometrics，2003，17，427-437)、蚁群算法(antcolonyoptimization，ACO，参见ShamsipurM，Zare-ShahabadiV，HemmateenejadB，AkhondM，Antcolonyoptimization：apowerfultoolforwavelengthselection，JChemometrics，2006，20，146-157)、粒子群算法(particleswarmoptimization，PSO，参见XuL，JiangJH，WuHL，ShenGL，YuRQ，Variable-weightedPLS，ChemomIntellLabSyst，2007，85，140-143)等，这些最优化的方法存在需要大量的参数、搜索时间较长以及容易陷入局部最优等缺陷。后者主要有无信息变量消除方法(UninformativeVariableElimination，UVE，参见CentnerV，MassartDL，deNoordOE，JongS，VandeginsteBM，SternaC，Eliminationofuninformativevariablesformultivariatecalibration.AnalChem，1996，68，3851-3858)、蒙特卡洛结合无信息变量消除方法(MonteCarloUninformativeVariableElimination，MCUVE，参见CaiWS，LiYK，ShaoXG，Avariableselectionmethodbasedonuninformativevariableeliminationformultivariatecalibrationofnear-infraredspectra，ChemomIntellLabSyst，2008，90，188-194)、基于随机检验的变量筛选方法(RandomizationTest，RT，参见XuH，LiuZC，CaiWS，ShaoXG，Awavelengthselectionmethodbasedonrandomizationtestfornear-infraredspectralanalysis.ChemomIntellLabSyst，2009，97，189-193)等。UVE方法采用了留一法交叉验证来获取变量稳定性值，该过程需要多次反复的运算，而且还需要引入与原始光谱所包含变量数目相等的随机噪声变量，所以当数据集数目较大时，该方法计算效率低，耗时较长。MCUVE算法和RT方法都引入多次建模技术，产生的多个模型往往比单一模型更能有效地从数据的不同方面和不同层面抽取并表达自变量和因变量之间的复杂关系，有利于更合理、可靠地选择变量。但由于每次建模样本的随机选择，使得这两种方法的运算结果存在一定的不稳定性，而且在数据量较大时也比较费时。因此，有必要进一步发展新型快速的变量选择方法，提高模型的稳定性与预测精度。Variable selection methods commonly used in spectral data analysis mainly include methods based on intelligent optimization algorithms and methods based on statistics.前者主要有模拟退火(simulatedannealing，SA，参见SwierengaH，deGrootPJ，deWeijerAP，DerksenMWJ，BuydensLMC，ImprovementofPLSmodeltransferabilitybyrobustwavelengthselection，ChemomIntellLabSyst，1998，41，237-248)、遗传算法(geneticalgorithm，GA，参见LeardiR，GonzalezAL，GeneticalgorithmsappliedtofeatureselectioninPLSregression：howandwhentousethem， ChemomIntellLabSyst, 1998, 41, 195-207), tabu search (Tabusearch, TS, see HagemanJA, StreppelM, WehrensR, Wavelength selection with TabuSearch, JChemometrics, 2003, 17, 427-437), ant colony optimization (ACO, see ShamsipurM, Zare -ShahabadiV, HemmateenejadB, AkhondM, Antcolonyoptimization: apowerfultoolforwavelengthselection, JChemometrics, 2006, 20, 146-157), particle swarm optimization (PSO, see XuL, JiangJH, WuHL, ShenGL, YuRQ, Variable-weightedPLS, ChemIntell, 25 , 140-143), etc. These optimization methods have the defects of needing a large number of parameters, long search time and easy to fall into local optimum. The latter mainly includes non-informative variable elimination method (UninformativeVariableElimination, UVE, see CentnerV, MassartDL, deNoordOE, JongS, VandeginsteBM, SternaC, Eliminationofuninformativevariablesformultivariatecalibration.AnalChem, 1996, 68, 3851-3858), Monte Carlo combined non-informative variable elimination method ( MonteCarloUninformativeVariableElimination，MCUVE，参见CaiWS，LiYK，ShaoXG，Avariableselectionmethodbasedonuninformativevariableeliminationformultivariatecalibrationofnear-infraredspectra，ChemomIntellLabSyst，2008，90，188-194)、基于随机检验的变量筛选方法(RandomizationTest，RT，参见XuH，LiuZC，CaiWS，ShaoXG，Awavelengthselectionmethodbasedonrandomizationtestfornear-infraredspectralanalysis . ChemomIntellLabSyst, 2009, 97, 189-193) etc. The UVE method uses leave-one-out cross-validation to obtain variable stability values. This process requires multiple repeated operations, and it also needs to introduce random noise variables equal to the number of variables contained in the original spectrum. Therefore, when the number of data sets is large , this method is computationally inefficient and time-consuming. Both the MCUVE algorithm and the RT method introduce multiple modeling techniques, and the multiple models generated are often more effective than a single model in extracting and expressing the complex relationship between independent variables and dependent variables from different aspects and levels of data, which is beneficial to More rational and reliable selection of variables. However, due to the random selection of each modeling sample, the calculation results of these two methods are somewhat unstable, and it is time-consuming when the amount of data is large. Therefore, it is necessary to further develop new and fast variable selection methods to improve the stability and prediction accuracy of the model.

发明内容Contents of the invention

本发明的目的是针对上述存在问题，提供一种快速、稳定的变量选择方法。该方法在一个回归系数的绝对值之和小于一个常数的条件下，使残差平方和最小化，从而较严格地使某些回归系数变为零，相应的变量被删除，实现变量选择。The object of the present invention is to provide a fast and stable variable selection method for the above existing problems. Under the condition that the sum of the absolute value of a regression coefficient is less than a constant, the method minimizes the sum of the squares of the residual, so that some regression coefficients become zero more strictly, and the corresponding variables are deleted to realize variable selection.

具体步骤如下：Specific steps are as follows:

(1)收集m个待测样本。设定光谱参数，采集样本的近红外光谱，得到样本的光谱矩阵X。用常规方法测定样本的被测组分含量，得到浓度向量y。采用一定分组方式将数据分为训练集和预测集，其中训练集样本用来建立模型并优化参数，预测集样本用来检验模型的预测能力。(1) Collect m samples to be tested. Set the spectral parameters, collect the near-infrared spectrum of the sample, and obtain the spectral matrix X of the sample. Measure the measured component content of the sample by conventional methods to obtain the concentration vector y. A certain grouping method is used to divide the data into a training set and a prediction set. The training set samples are used to build the model and optimize parameters, and the prediction set samples are used to test the predictive ability of the model.

(2)采用交叉验证确定LASSO的约束值t。t控制着压缩的程度，t越小，压缩的程度越强，由于这个限制条件，最后结果会使得回归系数β的某些分量变成0，达到了变量选择的目的。(2) Use cross-validation to determine the constraint value t of LASSO. t controls the degree of compression. The smaller t is, the stronger the degree of compression is. Due to this restriction, the final result will make some components of the regression coefficient β become 0, which achieves the purpose of variable selection.

(3)利用最小角回归算法求解LASSO的回归系数β，保存回归系数不为0的波长点位置。(3) Use the minimum angle regression algorithm to solve the regression coefficient β of LASSO, and save the position of the wavelength point whose regression coefficient is not 0.

$\begin{matrix} \overset{^^}{β β} = = \underset{β β &Element; &Element; {R R}^{p p}}{arg arg m m i i n no} {{{((y the y - - X x β β))}^{T T} ((y the y - - X x β β))}} & s the s . . t t . . {Σ Σ}_{t t = = 11}^{p p} | | {β β}_{t t} | | \leq \leq t t \end{matrix}$

最小角回归算法过程如下：The minimum angle regression algorithm process is as follows:

①更新模型入选变量集(activeset)，计算相关系数绝对值①Update the selected variable set (activeset) of the model and calculate the absolute value of the correlation coefficient

${\overset{^^}{y the y}}_{00} = = 00;; {\overset{^^}{c c}}_{k k j j} = = {x x}_{j j}^{T T} ((y the y - - {\overset{^^}{y the y}}_{k k - - 11}));; {\overset{^^}{C C}}_{k k} = = m m a a x x {{| | {\overset{^^}{c c}}_{k k j j} | |}}$

更新activesetA(k)，update activesetA(k),

$A A ((k k)) = = A A ((k k - - 11)) + + {{\overset{^^}{j j}}};; A A ((00)) = = φ φ;; \overset{^^}{j j} = = \underset{j j &NotElement; &NotElement; A A ((k k - - 11))}{arg arg m m i i n no} {{| | {\overset{^^}{c c}}_{k k j j} | |}}$

②确定最小角方向(u_k)② Determine the minimum angle direction (u _k )

令X_k＝(…s_jx_j…)_j∈A(k) Let X _k = (...s _j x _j ...) _j∈A(k)

其中 $s_{j} = s i g n {{\hat{c}}_{k j}}, ω_{k} = A_{k} {(X_{k}^{T} X)}^{- 1} 1_{k}, A_{k} = {(1_{k}^{T} {(X_{k}^{T} X)}^{- 1} 1_{k})}^{- 0.5}$ in ${the s}_{j} = the s i g no {{\hat{c}}_{k j}}, ω_{k} = A_{k} {(x_{k}^{T} x)}^{- 1} 1_{k}, A_{k} = {(1_{k}^{T} {(x_{k}^{T} x)}^{- 1} 1_{k})}^{- 0.5}$

1_k是所有分量为1的向量，其长度等于|A|。计算最小角方向：u_k＝X_kω_k③计算步长1 _k is a vector of all 1s whose length is equal to |A|. Calculation of minimum angle direction: u _k ＝X _k ω _k ③Calculation step size

当 $j &NotElement; A (k),$ 令 $a_{k j} = x_{j}^{T} u_{k}$ when $j &NotElement; A (k),$ make $a_{k j} = x_{j}^{T} u_{k}$

若|A|＝d，则算法终止。If |A|＝d, then Algorithm terminated.

否则 ${\hat{γ}}_{k} = \min_{j &NotElement; A (k)}^{+} {{\hat{C}}_{k} - {\hat{c}}_{k j} / (A_{k} - a_{k j}), ({\hat{C}}_{k} + {\hat{c}}_{k j}) / (A_{k} + a_{k j})}$ otherwise ${\hat{γ}}_{k} = \min_{j &NotElement; A (k)}^{+} {{\hat{C}}_{k} - {\hat{c}}_{k j} / (A_{k} - a_{k j}), ({\hat{C}}_{k} + {\hat{c}}_{k j}) / (A_{k} + a_{k j})}$

④预测响应④ Predictive response

$\tilde{γ} = \underset{γ_{j} > 0, j &Element; A (k)}{m i n} {γ_{j}},$ 其中 $γ_{j} = - {\hat{β}}_{j} / (s_{j} ω_{k j}); {\tilde{γ}}_{1} = \infty$ $\tilde{γ} = \underset{γ_{j} > 0, j &Element; A (k)}{m i no} {γ_{j}},$ in $γ_{j} = - {\hat{β}}_{j} / ({the s}_{j} ω_{k j}); {\tilde{γ}}_{1} = \infty$

若 ${\tilde{γ}}_{k} < {\hat{γ}}_{k},$ 则 ${\hat{y}}_{k} = {\hat{y}}_{k - 1} + {\tilde{γ}}_{k} u_{k}$ like ${\tilde{γ}}_{k} < {\hat{γ}}_{k},$ but ${\hat{the y}}_{k} = {\hat{the y}}_{k - 1} + {\tilde{γ}}_{k} u_{k}$

当.j∈A时， ${\hat{β}}_{j} &LeftArrow; {\hat{β}}_{j} + \tilde{γ} ω_{k j} s_{j},$ 否则 $\hat{β} = 0$ When .j∈A, ${\hat{β}}_{j} &LeftArrow; {\hat{β}}_{j} + \tilde{γ} ω_{k j} {the s}_{j},$ otherwise $\hat{β} = 0$

$A (k + 1) = A (k) - {\tilde{j}},$ 其中 $\tilde{j} = \underset{j}{\arg m i n} {γ_{j}}$ $A (k + 1) = A (k) - {\tilde{j}},$ in $\tilde{j} = \underset{j}{\arg m i no} {γ_{j}}$

${\hat{c}}_{k + 1, j} = x_{j}^{T} (y - {\hat{y}}_{k}),$ 并且 ${\hat{C}}_{k + 1} = \underset{j}{m a x} {| {\hat{c}}_{k + 1, j} |},$ 返回执行步骤①。 ${\hat{c}}_{k + 1, j} = x_{j}^{T} (the y - {\hat{the y}}_{k}),$ and ${\hat{C}}_{k + 1} = \underset{j}{m a x} {| {\hat{c}}_{k + 1, j} |},$ Return to step ①.

否则 ${\hat{y}}_{k} = {\hat{y}}_{k - 1} + {\hat{γ}}_{k} u_{k}$ otherwise ${\hat{the y}}_{k} = {\hat{the y}}_{k - 1} + {\hat{γ}}_{k} u_{k}$

当j∈A时， ${\hat{β}}_{j} &LeftArrow; {\hat{β}}_{j} + {\hat{γ}}_{k} ω_{k j} s_{j},$ 否则 ${\hat{β}}_{j} = 0$ 返回执行步骤①。When j ∈ A, ${\hat{β}}_{j} &LeftArrow; {\hat{β}}_{j} + {\hat{γ}}_{k} ω_{k j} {the s}_{j},$ otherwise ${\hat{β}}_{j} = 0$ Return to step ①.

(4)根据保留的波长点位置，仅保留训练集光谱矩阵相应的波长列，得到新的光谱矩阵，并且与训练集样本被测成分浓度向量建立偏最小二乘回归(PLS)模型。其中PLS模型的因子数通过蒙特卡罗交叉验证结合F检验确定。利用这个模型，测定预测集样本被测成分的浓度含量。(4) According to the reserved wavelength point position, only retain the corresponding wavelength column of the training set spectral matrix to obtain a new spectral matrix, and establish a partial least squares regression (PLS) model with the concentration vector of the measured component of the training set sample. The number of factors in the PLS model was determined by Monte Carlo cross-validation combined with F-test. Using this model, determine the concentration content of the measured components in the prediction set samples.

与现有变量选择方法相比，本发明具有运行速度快、选择变量具有可重复性的优点，而且能用更少的变量数达到更好的预测结果。Compared with the existing variable selection method, the present invention has the advantages of fast running speed, repeatability of variable selection, and better forecasting result with less variable number.

附图说明Description of drawings

图1：烟草样本的近红外光谱图Figure 1: NIR Spectrum of a Tobacco Sample

图2：烟草近红外光谱数据训练集进行1000次交叉验证的残差平方和(SSR)平均值以及方差随着归一化的约束值t的变化图，其中竖线代表最优模型对应的t值Figure 2: The residual sum of squares (SSR) average and variance of the 1000 times cross-validated residual sum of squares (SSR) of the training set of tobacco near-infrared spectrum data and the change graph of the normalized constraint value t, where the vertical line represents the t corresponding to the optimal model value

图3：烟草近红外光谱数据训练集进行LASSO变量选择后所有变量对应的回归系数βFigure 3: The regression coefficient β corresponding to all variables after LASSO variable selection in the tobacco near-infrared spectrum data training set

图4：UVE、MCUVE、RT、LASSO四种变量选择方法保留变量的分布图Figure 4: Distribution of retained variables for four variable selection methods: UVE, MCUVE, RT, and LASSO

图5：香油与大豆油、稻米油三元掺混样本的近红外光谱图Figure 5: Near-infrared spectrum of the three-component blending sample of sesame oil, soybean oil, and rice oil

图6：香油与大豆油、稻米油三元掺混样本的光谱数据训练集进行1000次交叉验证的残差平方和(SSR)平均值以及方差随着归一化的约束值t的变化图，其中竖线代表最优模型对应的t值Figure 6: The residual sum of squares (SSR) mean and variance of the spectral data training set of the ternary blending samples of sesame oil, soybean oil, and rice oil for 1000 times of cross-validation and the change graph of the variance with the normalized constraint value t, The vertical line represents the t value corresponding to the optimal model

图7：香油与大豆油、稻米油三元掺混样本光谱数据训练集进行LASSO变量选择后所有变量对应的回归系数βFigure 7: The regression coefficient β corresponding to all variables after the LASSO variable selection of the spectral data training set of the ternary blending sample of sesame oil, soybean oil, and rice oil

图8：UVE、MCUVE、RT、LASSO四种变量选择方法保留变量的分布图Figure 8: Distribution of retained variables for four variable selection methods: UVE, MCUVE, RT, and LASSO

具体实施方式Detailed ways

为更好理解本发明，下面结合实施例对本发明做进一步地详细说明，但是本发明要求保护的范围并不局限于实施例表示的范围。In order to better understand the present invention, the present invention will be further described in detail below in conjunction with the examples, but the protection scope of the present invention is not limited to the range indicated by the examples.

实施例1：Example 1:

本实施例是应用于近红外光谱分析，对烟草样本中的还原糖含量值进行测定。具体的步骤如下：This embodiment is applied to near-infrared spectroscopic analysis to determine the content of reducing sugar in tobacco samples. The specific steps are as follows:

(1)采集烟叶样本的近红外光谱数据，使用BrukerVector22/N近红外光谱仪(德国Bruker光学仪器公司)测试了不同烟叶产区的269个烟叶薄片样本。NIR光谱波数范围为4000～9000cm^-1，采样间隔为4个波数，共1296个波长点，样品的近红外光谱图如图1所示。烟草样品中还原糖(ReducingSugar)含量采用AAIII型连续流动分析仪(德国BranLuebbe公司)按照标准方法测定。在建模前把烟叶样本随机分成两部分，包括训练集和预测集样本，其中训练集样本用来建立模型、预测集样本用来检验模型的预测能力。(1) The near-infrared spectrum data of tobacco leaf samples were collected, and 269 tobacco leaf thin slice samples from different tobacco leaf production areas were tested using a BrukerVector22/N near-infrared spectrometer (Bruker Optical Instrument Company, Germany). The wavenumber range of the NIR spectrum is 4000-9000cm ^-1 , the sampling interval is 4 wavenumbers, and there are 1296 wavelength points in total. The near-infrared spectrum of the sample is shown in Fig. 1 . The content of reducing sugar (ReducingSugar) in the tobacco samples was measured using a type AAIII continuous flow analyzer (BranLuebbe, Germany) according to standard methods. Before modeling, the tobacco leaf samples were randomly divided into two parts, including the training set and the prediction set samples. The training set samples were used to build the model, and the prediction set samples were used to test the predictive ability of the model.

(2)采用交叉验证确定LASSO的约束值t。t控制着压缩的程度，t越小，压缩的程度越强，这个限制条件使得向量β的某些分量变成0，从而达到了变量选择的目的。本实施例训练集进行1000次交叉验证的残差平方和(SSR)平均值以及方差随着归一化的约束值t的变化如图2所示，其中竖线代表最优模型对应的t值，为0.103。(2) Use cross-validation to determine the constraint value t of LASSO. t controls the degree of compression. The smaller t is, the stronger the degree of compression is. This constraint makes some components of the vector β become 0, thus achieving the purpose of variable selection. The training set of this embodiment carries out 1000 times of cross-validation residual sum of squares (SSR) average value and variance with the change of normalized constraint value t as shown in Figure 2, wherein the vertical line represents the t value corresponding to the optimal model , is 0.103.

(3)求解LASSO的回归系数β。利用最小角回归算法求解LASSO的回归系数β，保存回归系数不为0的波长点位置。(3) Solve the regression coefficient β of LASSO. Use the minimum angle regression algorithm to solve the regression coefficient β of LASSO, and save the position of the wavelength point whose regression coefficient is not 0.

该实施例进行LASSO变量选择后所有变量对应的回归系数β值如图3所示。The regression coefficient β values corresponding to all the variables after LASSO variable selection in this embodiment are shown in FIG. 3 .

(4)根据保留的波长点位置，仅保留训练集光谱矩阵相应的波长列，得到新的光谱矩阵。光谱矩阵与训练集样本被测成分浓度向量建立偏最小二乘回归(PLS)模型，其中PLS模型的因子数通过蒙特卡罗交叉验证结合F检验确定。利用这个模型，测定预测集样本被测成分的浓度含量。该实施例确定的因子数为8。(4) According to the reserved wavelength point positions, only the corresponding wavelength column of the training set spectral matrix is retained to obtain a new spectral matrix. The partial least squares regression (PLS) model was established from the spectral matrix and the measured component concentration vector of the training set samples, and the number of factors in the PLS model was determined by Monte Carlo cross-validation combined with F-test. Using this model, determine the concentration content of the measured components in the prediction set samples. The number of factors determined in this embodiment is 8.

UVE、MCUVE、RT、LASSO四种变量选择方法保留变量的分布图如图4所示。从图4可以看出，一方面，LASSO与其它三种方法选择的变量范围大致相同，这说明了LASSO方法选择变量的合理性。另一方面，LASSO选择的变量数比其它三种变量选择方法更少，这体现了该方法的优越性。The distribution of the retained variables of the four variable selection methods of UVE, MCUVE, RT, and LASSO is shown in Figure 4. It can be seen from Figure 4 that, on the one hand, the range of variables selected by LASSO is roughly the same as that of the other three methods, which shows the rationality of the selection of variables by the LASSO method. On the other hand, the number of variables selected by LASSO is less than the other three variable selection methods, which reflects the superiority of this method.

为了进一步比较四种变量选择的效果，表1给出了烟草近红外数据不采用变量选择以及采用变量选择后建立PLS模型的建模效果。由表中数据可知，LASSO选择变量仅27个，是其它三种变量选择方法的近十分之一。计算时间11.89，虽然比不进行变量选择的PLS要慢，但是明显快于其它变量选择方法。LASSO-PLS建模得到的RMSEP值最小，R值最大，说明该方法更能提高模型的预测精度。因此，LASSO-PLS与其它建模方法相比较选择变量数少，计算时间更短，预测精度更高。In order to further compare the effects of the four variable selections, Table 1 presents the modeling effects of the tobacco NIR data without using variable selection and after using variable selection to establish the PLS model. It can be seen from the data in the table that there are only 27 variables selected by LASSO, which is nearly one-tenth of the other three variable selection methods. Computation time 11.89, although slower than PLS without variable selection, is significantly faster than other variable selection methods. The RMSEP value obtained by LASSO-PLS modeling is the smallest, and the R value is the largest, indicating that this method can improve the prediction accuracy of the model. Therefore, compared with other modeling methods, LASSO-PLS has fewer selected variables, shorter calculation time and higher prediction accuracy.

表1烟草近红外数据不同建模方法的结果比较Table 1 Comparison of the results of different modeling methods for tobacco near-infrared data

实施例2：Example 2:

本实施例是应用于近红外光谱分析，对香油与大豆油、稻米油三元掺混的近红外光谱数据进行测定。具体的步骤如下：This embodiment is applied to near-infrared spectrum analysis, and the near-infrared spectrum data of the ternary blending of sesame oil, soybean oil, and rice oil are determined. The specific steps are as follows:

(1)采集香油与大豆油、稻米油三元掺混样本的NIR光谱数据，使用近红外分光光度计(TJ270-60，天津市拓普仪器有限公司)进行近红外光谱数据测量，波长范围为800～2500nm，采样间隔为1nm，共1701个波长点。样品的近红外光谱图如图5所示。样品按一定比例配置(大豆油质量0.05～2.5，间隔0.05；稻米油浓度0.05～2.5，间隔0.05)。在建模前把样本随机分成两部分，包括训练集和预测集样本，其中训练集样本用来建立模型、预测集样本用来检验模型的预测能力。(1) Collect the NIR spectral data of the ternary blending samples of sesame oil, soybean oil, and rice oil, and use a near-infrared spectrophotometer (TJ270-60, Tianjin Tuopu Instrument Co., Ltd.) to measure the near-infrared spectral data. The wavelength range is 800~2500nm, the sampling interval is 1nm, a total of 1701 wavelength points. The near-infrared spectrum of the sample is shown in Figure 5. Samples are configured according to a certain ratio (soybean oil quality 0.05-2.5, interval 0.05; rice oil concentration 0.05-2.5, interval 0.05). Before modeling, the samples are randomly divided into two parts, including training set and prediction set samples, where the training set samples are used to build the model, and the prediction set samples are used to test the predictive ability of the model.

(2)采用交叉验证确定LASSO的约束值t。t控制着压缩的程度，t越小，压缩的程度越强，这个限制条件使得向量β的某些分量变成0，从而达到了变量选择的目的。该实施例训练集进行1000次交叉验证的残差平方和(SSR)平均值以及方差随着归一化的约束值t的变化图如图6所示，其中竖线代表最优模型对应的t值为0.254。(2) Use cross-validation to determine the constraint value t of LASSO. t controls the degree of compression. The smaller t is, the stronger the degree of compression is. This constraint makes some components of the vector β become 0, thus achieving the purpose of variable selection. The training set of this embodiment carries out 1000 times of cross-validation residual sum of squares (SSR) average value and the change figure of variance along with the normalized constraint value t as shown in Figure 6, wherein the vertical line represents the t corresponding to the optimal model The value is 0.254.

该实施例训练集进行LASSO变量选择后所有变量对应的回归系数β值如图7所示。The regression coefficient β values corresponding to all variables after LASSO variable selection in the training set of this embodiment are shown in FIG. 7 .

UVE、MCUVE、RT、LASSO四种变量选择方法保留变量的分布图如图8所示。从图8可以看出，LASSO与其它三种方法选择的变量范围大致相同，这说明了LASSO方法选择变量的合理性。另一方面，LASSO选择的变量数比其它三种变量选择方法更少，这体现了该方法的优越性。Figure 8 shows the distribution of the retained variables of the four variable selection methods UVE, MCUVE, RT, and LASSO. It can be seen from Figure 8 that the range of variables selected by LASSO is roughly the same as that of the other three methods, which shows the rationality of the variables selected by the LASSO method. On the other hand, the number of variables selected by LASSO is less than the other three variable selection methods, which reflects the superiority of this method.

为了进一步比较四种变量选择的效果，表2给出了香油与大豆油、稻米油三元掺混近红外光谱数据不采用变量选择以及采用变量选择后建立PLS模型的建模效果。由表中数据可知，LASSO选择变量仅11个，远远少于其他变量选择方法选择的变量。计算时间2.48秒，明显快于其它变量选择方法。LASSO-PLS建模得到的RMSEP值最小，R值最大。因此，LASSO-PLS与其它建模方法相比较选择变量数少，计算时间更短，预测精度更高。In order to further compare the effects of the four variable selections, Table 2 shows the modeling effect of the PLS model established by the near-infrared spectrum data of the three-component blending of sesame oil, soybean oil, and rice oil without variable selection and with variable selection. It can be seen from the data in the table that there are only 11 variables selected by LASSO, far less than the variables selected by other variable selection methods. The calculation time is 2.48 seconds, which is significantly faster than other variable selection methods. The RMSEP value obtained by LASSO-PLS modeling is the smallest, and the R value is the largest. Therefore, compared with other modeling methods, LASSO-PLS has fewer selected variables, shorter calculation time and higher prediction accuracy.

表2植物油NIR数据不同建模方法的结果比较Table 2 Comparison of results of different modeling methods for vegetable oil NIR data

Claims

1., based on a near infrared spectrum Variable Selection of LASSO, it is characterized in that comprising following steps:

1) gather the near infrared spectrum data of measured object sample, measure the measured component concentration content of sample in training set by conventional method, adopt certain packet mode that data are divided into training set and forecast set;

2) the binding occurrence t. of LASSO is determined;

3) minimum angle regression algorithm is utilized to solve the regression coefficient β of LASSO;

4) by training set spectrum Matrix Regression factor beta be not 0 wavelength row set up partial least squares regression (PLS) model with concentration vector, utilize this model, the content of prediction unknown sample composition.

2. a kind of near infrared spectrum Variable Selection based on LASSO according to claim 1, is characterized in that: the described detailed process utilizing minimum angle regression algorithm to solve the regression coefficient β of LASSO is:

1. Renewal model is selected in variables set (activeset), calculates related coefficient absolute value

{\hat{y}}_{0} = 0; {\hat{c}}_{k j} = x_{j}^{I} (y - {\hat{y}}_{k - 1}); {\hat{C}}_{k} = \max {| {\hat{c}}_{k j} |}

Upgrade activesetA (k)

A (k) = A (k - 1) + {\hat{j}};

A(0)＝φ

\hat{j} = \underset{j &NotElement; A (k - 1)}{\arg \min} {| {\hat{c}}_{k j} |}

2. minimum angular direction (u is determined _k)

Make X _k=(... s _jx _j...) _{j ∈ A (k)}

Wherein

s_{j} = si g n {{\hat{c}}_{k j}},

ω_{k} = A_{k} {(X_{k}^{I} X)}^{- 1} 1_{k},

A_{k} = {(1_{k}^{T} {(X_{k}^{T} X)}^{- 1} 1_{k})}^{- 0.5}

1 _kbe important be 1 vector, its length equals | A|

Calculate minimum angular direction: u _k=X _kω _k

3. material calculation

When

j &NotElement; A (k),

Order

a_{k j} = x_{j}^{T} u_{k}

If | A|=d, then algorithm stops

Otherwise

{\hat{γ}}_{k} = \min_{j &NotElement; A (k)}^{+} {{\hat{C}}_{k} - {\hat{c}}_{k j} / (A_{k} - a_{k j}), ({\hat{C}}_{k} + {\hat{c}}_{k j}) / (A_{k} + a_{k j})}

4. predicated response

\tilde{γ} = \underset{γ_{j} > 0, j &Element; A (k)}{m i n} {γ_{j}},

Wherein

γ_{j} = - {\hat{β}}_{j} / (s_{j} ω k j);

{\tilde{γ}}_{l} = \infty

If

{\tilde{γ}}_{k} < {\hat{γ}}_{k},

Then

{\hat{y}}_{k} = {\hat{y}}_{k - 1} + {\hat{γ}}_{k} u_{k}

As j ∈ A,

{\hat{β}}_{j} &LeftArrow; {\hat{β}}_{j} + \tilde{γ} ω_{kj} s_{j},

Otherwise

\hat{β} = 0

A (k + 1) = A (k) - {\tilde{j}},

Wherein

\tilde{j} = \underset{j}{argmin} {γ_{j}}

{\hat{c}}_{k + 1 j} = x_{j}^{I} (y - \hat{y} k),

And

{\hat{C}}_{k + 1} = \max_{j} {| {\hat{c}}_{k + 1 j} |},

Return and perform step 1.

Otherwise

{\hat{y}}_{k} = {\hat{y}}_{k - 1} + {\hat{γ}}_{k} u_{k}

As j ∈ A, otherwise return and perform step 1..

3. a kind of near infrared spectrum Variable Selection based on LASSO according to claim 1, it is characterized in that: the defining method of the binding occurrence t. of described LASSO is cross validation, t controls the degree of compression, t is less, the degree of compression is stronger, this restrictive condition makes some component of vectorial β become 0, thus reaches the object of variables choice.

4. a kind of near infrared spectrum Variable Selection based on LASSO according to claim 1, is characterized in that: described PLS model because of subnumber defining method be that Monte Carlo Cross-Validation is checked in conjunction with F.