[go: up one dir, main page]

CN105203498A - Near infrared spectrum variable selection method based on LASSO - Google Patents

Near infrared spectrum variable selection method based on LASSO Download PDF

Info

Publication number
CN105203498A
CN105203498A CN201510581659.5A CN201510581659A CN105203498A CN 105203498 A CN105203498 A CN 105203498A CN 201510581659 A CN201510581659 A CN 201510581659A CN 105203498 A CN105203498 A CN 105203498A
Authority
CN
China
Prior art keywords
lasso
gamma
variable selection
infrared spectrum
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510581659.5A
Other languages
Chinese (zh)
Inventor
卞希慧
颜鼎荷
李淑娟
谭小耀
李翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tiangong University
Original Assignee
Tianjin Polytechnic University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin Polytechnic University filed Critical Tianjin Polytechnic University
Priority to CN201510581659.5A priority Critical patent/CN105203498A/en
Publication of CN105203498A publication Critical patent/CN105203498A/en
Pending legal-status Critical Current

Links

Landscapes

  • Investigating Or Analysing Materials By Optical Means (AREA)

Abstract

一种基于LASSO的近红外光谱变量选择方法,具体过程如下:采集样本的近红外光谱,用常规方法测定被测成分浓度向量;采用一定的分组方式将数据集分成训练集和预测集;采用交叉验证确定LASSO方法的约束值t;利用最小角回归算法计算回归系数β,保留β不为0的波长点的位置;利用保留的波长对应的训练集光谱与浓度向量间建立偏最小二乘回归模型,对预测集样本被测成分浓度进行预测。该方法能提取出有效波长,简化了定量分析模型,提高了模型的预测精度。与已有变量选择方法相比,具有快速、可重复、用更少的变量数达到更高预测精度的优势。本发明适用于复杂样品近红外光谱的变量选择。

A near-infrared spectrum variable selection method based on LASSO, the specific process is as follows: collect the near-infrared spectrum of the sample, and measure the concentration vector of the measured component by conventional methods; use a certain grouping method to divide the data set into a training set and a prediction set; use crossover Verify and determine the constraint value t of the LASSO method; use the minimum angle regression algorithm to calculate the regression coefficient β, and retain the position of the wavelength point where β is not 0; use the reserved wavelength corresponding to the training set spectrum and the concentration vector to establish a partial least squares regression model , to predict the concentration of the analyte in the prediction set sample. This method can extract the effective wavelength, simplifies the quantitative analysis model, and improves the prediction accuracy of the model. Compared with existing variable selection methods, it has the advantages of being fast, repeatable, and achieving higher prediction accuracy with fewer variables. The invention is suitable for variable selection of near-infrared spectra of complex samples.

Description

一种基于LASSO的近红外光谱变量选择方法A LASSO-based variable selection method for near-infrared spectroscopy

技术领域technical field

本方法发明属于分析化学领域的无损分析技术领域,具体涉及一种基于LASSO的近红外光谱变量选择方法。The invention of the method belongs to the technical field of non-destructive analysis in the field of analytical chemistry, and specifically relates to a method for selecting variables of near-infrared spectrum based on LASSO.

背景技术Background technique

近红外光谱分析技术是分析化学领域里高速发展的技术,它具有分析效率高、检测速度快、无需样品预处理等优点,已广泛的应用于食品、石油等行业。在近红外光谱和被测物质的含量或类别之间建立模型,可以实现复杂物质的直接定性定量分析。近红外光谱建模中非常重要的一个问题就是光谱中存在冗余波长。一般的近红外光谱(NIR)包含成百上千的波长变量点,而其中一些波长与研究的性质是不相关的,这些不相关波长点,会影响模型质量,导致其预测能力下降。因此变量选择一直是光谱建模分析的重要内容。Near-infrared spectroscopy is a rapidly developing technology in the field of analytical chemistry. It has the advantages of high analysis efficiency, fast detection speed, and no need for sample pretreatment. It has been widely used in food, petroleum and other industries. Establishing a model between the near-infrared spectrum and the content or category of the measured substance can realize direct qualitative and quantitative analysis of complex substances. A very important problem in NIR spectral modeling is the existence of redundant wavelengths in the spectrum. General near-infrared spectroscopy (NIR) contains hundreds of wavelength variable points, and some of these wavelengths are irrelevant to the nature of the research. These irrelevant wavelength points will affect the quality of the model and lead to a decline in its predictive ability. Therefore, variable selection has always been an important part of spectral modeling analysis.

光谱数据分析中常用的变量选择方法主要包括基于智能优化算法的方法以及基于统计学的方法。前者主要有模拟退火(simulatedannealing,SA,参见SwierengaH,deGrootPJ,deWeijerAP,DerksenMWJ,BuydensLMC,ImprovementofPLSmodeltransferabilitybyrobustwavelengthselection,ChemomIntellLabSyst,1998,41,237-248)、遗传算法(geneticalgorithm,GA,参见LeardiR,GonzalezAL,GeneticalgorithmsappliedtofeatureselectioninPLSregression:howandwhentousethem,ChemomIntellLabSyst,1998,41,195-207)、禁忌搜索(Tabusearch,TS,参见HagemanJA,StreppelM,WehrensR,WavelengthselectionwithTabuSearch,JChemometrics,2003,17,427-437)、蚁群算法(antcolonyoptimization,ACO,参见ShamsipurM,Zare-ShahabadiV,HemmateenejadB,AkhondM,Antcolonyoptimization:apowerfultoolforwavelengthselection,JChemometrics,2006,20,146-157)、粒子群算法(particleswarmoptimization,PSO,参见XuL,JiangJH,WuHL,ShenGL,YuRQ,Variable-weightedPLS,ChemomIntellLabSyst,2007,85,140-143)等,这些最优化的方法存在需要大量的参数、搜索时间较长以及容易陷入局部最优等缺陷。后者主要有无信息变量消除方法(UninformativeVariableElimination,UVE,参见CentnerV,MassartDL,deNoordOE,JongS,VandeginsteBM,SternaC,Eliminationofuninformativevariablesformultivariatecalibration.AnalChem,1996,68,3851-3858)、蒙特卡洛结合无信息变量消除方法(MonteCarloUninformativeVariableElimination,MCUVE,参见CaiWS,LiYK,ShaoXG,Avariableselectionmethodbasedonuninformativevariableeliminationformultivariatecalibrationofnear-infraredspectra,ChemomIntellLabSyst,2008,90,188-194)、基于随机检验的变量筛选方法(RandomizationTest,RT,参见XuH,LiuZC,CaiWS,ShaoXG,Awavelengthselectionmethodbasedonrandomizationtestfornear-infraredspectralanalysis.ChemomIntellLabSyst,2009,97,189-193)等。UVE方法采用了留一法交叉验证来获取变量稳定性值,该过程需要多次反复的运算,而且还需要引入与原始光谱所包含变量数目相等的随机噪声变量,所以当数据集数目较大时,该方法计算效率低,耗时较长。MCUVE算法和RT方法都引入多次建模技术,产生的多个模型往往比单一模型更能有效地从数据的不同方面和不同层面抽取并表达自变量和因变量之间的复杂关系,有利于更合理、可靠地选择变量。但由于每次建模样本的随机选择,使得这两种方法的运算结果存在一定的不稳定性,而且在数据量较大时也比较费时。因此,有必要进一步发展新型快速的变量选择方法,提高模型的稳定性与预测精度。Variable selection methods commonly used in spectral data analysis mainly include methods based on intelligent optimization algorithms and methods based on statistics.前者主要有模拟退火(simulatedannealing,SA,参见SwierengaH,deGrootPJ,deWeijerAP,DerksenMWJ,BuydensLMC,ImprovementofPLSmodeltransferabilitybyrobustwavelengthselection,ChemomIntellLabSyst,1998,41,237-248)、遗传算法(geneticalgorithm,GA,参见LeardiR,GonzalezAL,GeneticalgorithmsappliedtofeatureselectioninPLSregression:howandwhentousethem, ChemomIntellLabSyst, 1998, 41, 195-207), tabu search (Tabusearch, TS, see HagemanJA, StreppelM, WehrensR, Wavelength selection with TabuSearch, JChemometrics, 2003, 17, 427-437), ant colony optimization (ACO, see ShamsipurM, Zare -ShahabadiV, HemmateenejadB, AkhondM, Antcolonyoptimization: apowerfultoolforwavelengthselection, JChemometrics, 2006, 20, 146-157), particle swarm optimization (PSO, see XuL, JiangJH, WuHL, ShenGL, YuRQ, Variable-weightedPLS, ChemIntell, 25 , 140-143), etc. These optimization methods have the defects of needing a large number of parameters, long search time and easy to fall into local optimum. The latter mainly includes non-informative variable elimination method (UninformativeVariableElimination, UVE, see CentnerV, MassartDL, deNoordOE, JongS, VandeginsteBM, SternaC, Eliminationofuninformativevariablesformultivariatecalibration.AnalChem, 1996, 68, 3851-3858), Monte Carlo combined non-informative variable elimination method ( MonteCarloUninformativeVariableElimination,MCUVE,参见CaiWS,LiYK,ShaoXG,Avariableselectionmethodbasedonuninformativevariableeliminationformultivariatecalibrationofnear-infraredspectra,ChemomIntellLabSyst,2008,90,188-194)、基于随机检验的变量筛选方法(RandomizationTest,RT,参见XuH,LiuZC,CaiWS,ShaoXG,Awavelengthselectionmethodbasedonrandomizationtestfornear-infraredspectralanalysis . ChemomIntellLabSyst, 2009, 97, 189-193) etc. The UVE method uses leave-one-out cross-validation to obtain variable stability values. This process requires multiple repeated operations, and it also needs to introduce random noise variables equal to the number of variables contained in the original spectrum. Therefore, when the number of data sets is large , this method is computationally inefficient and time-consuming. Both the MCUVE algorithm and the RT method introduce multiple modeling techniques, and the multiple models generated are often more effective than a single model in extracting and expressing the complex relationship between independent variables and dependent variables from different aspects and levels of data, which is beneficial to More rational and reliable selection of variables. However, due to the random selection of each modeling sample, the calculation results of these two methods are somewhat unstable, and it is time-consuming when the amount of data is large. Therefore, it is necessary to further develop new and fast variable selection methods to improve the stability and prediction accuracy of the model.

发明内容Contents of the invention

本发明的目的是针对上述存在问题,提供一种快速、稳定的变量选择方法。该方法在一个回归系数的绝对值之和小于一个常数的条件下,使残差平方和最小化,从而较严格地使某些回归系数变为零,相应的变量被删除,实现变量选择。The object of the present invention is to provide a fast and stable variable selection method for the above existing problems. Under the condition that the sum of the absolute value of a regression coefficient is less than a constant, the method minimizes the sum of the squares of the residual, so that some regression coefficients become zero more strictly, and the corresponding variables are deleted to realize variable selection.

具体步骤如下:Specific steps are as follows:

(1)收集m个待测样本。设定光谱参数,采集样本的近红外光谱,得到样本的光谱矩阵X。用常规方法测定样本的被测组分含量,得到浓度向量y。采用一定分组方式将数据分为训练集和预测集,其中训练集样本用来建立模型并优化参数,预测集样本用来检验模型的预测能力。(1) Collect m samples to be tested. Set the spectral parameters, collect the near-infrared spectrum of the sample, and obtain the spectral matrix X of the sample. Measure the measured component content of the sample by conventional methods to obtain the concentration vector y. A certain grouping method is used to divide the data into a training set and a prediction set. The training set samples are used to build the model and optimize parameters, and the prediction set samples are used to test the predictive ability of the model.

(2)采用交叉验证确定LASSO的约束值t。t控制着压缩的程度,t越小,压缩的程度越强,由于这个限制条件,最后结果会使得回归系数β的某些分量变成0,达到了变量选择的目的。(2) Use cross-validation to determine the constraint value t of LASSO. t controls the degree of compression. The smaller t is, the stronger the degree of compression is. Due to this restriction, the final result will make some components of the regression coefficient β become 0, which achieves the purpose of variable selection.

(3)利用最小角回归算法求解LASSO的回归系数β,保存回归系数不为0的波长点位置。(3) Use the minimum angle regression algorithm to solve the regression coefficient β of LASSO, and save the position of the wavelength point whose regression coefficient is not 0.

ββ ^^ == argarg mm ii nno ββ ∈∈ RR pp {{ (( ythe y -- Xx ββ )) TT (( ythe y -- Xx ββ )) }} sthe s .. tt .. ΣΣ tt == 11 pp || ββ tt || ≤≤ tt

最小角回归算法过程如下:The minimum angle regression algorithm process is as follows:

①更新模型入选变量集(activeset),计算相关系数绝对值①Update the selected variable set (activeset) of the model and calculate the absolute value of the correlation coefficient

ythe y ^^ 00 == 00 ;; cc ^^ kk jj == xx jj TT (( ythe y -- ythe y ^^ kk -- 11 )) ;; CC ^^ kk == mm aa xx {{ || cc ^^ kk jj || }}

更新activesetA(k),update activesetA(k),

AA (( kk )) == AA (( kk -- 11 )) ++ {{ jj ^^ }} ;; AA (( 00 )) == φφ ;; jj ^^ == argarg mm ii nno jj ∉∉ AA (( kk -- 11 )) {{ || cc ^^ kk jj || }}

②确定最小角方向(uk)② Determine the minimum angle direction (u k )

令Xk=(…sjxj…)j∈A(k) Let X k = (...s j x j ...) j∈A(k)

其中 s j = s i g n { c ^ k j } , ω k = A k ( X k T X ) - 1 1 k , A k = ( 1 k T ( X k T X ) - 1 1 k ) - 0.5 in the s j = the s i g no { c ^ k j } , ω k = A k ( x k T x ) - 1 1 k , A k = ( 1 k T ( x k T x ) - 1 1 k ) - 0.5

1k是所有分量为1的向量,其长度等于|A|。计算最小角方向:uk=Xkωk③计算步长1 k is a vector of all 1s whose length is equal to |A|. Calculation of minimum angle direction: u k =X k ω k ③Calculation step size

j ∉ A ( k ) , a k j = x j T u k when j ∉ A ( k ) , make a k j = x j T u k

若|A|=d,则算法终止。If |A|=d, then Algorithm terminated.

否则 γ ^ k = min j ∉ A ( k ) + { C ^ k - c ^ k j / ( A k - a k j ) , ( C ^ k + c ^ k j ) / ( A k + a k j ) } otherwise γ ^ k = min j ∉ A ( k ) + { C ^ k - c ^ k j / ( A k - a k j ) , ( C ^ k + c ^ k j ) / ( A k + a k j ) }

④预测响应④ Predictive response

γ ~ = m i n γ j > 0 , j ∈ A ( k ) { γ j } , 其中 γ j = - β ^ j / ( s j ω k j ) ; γ ~ 1 = ∞ γ ~ = m i no γ j > 0 , j ∈ A ( k ) { γ j } , in γ j = - β ^ j / ( the s j ω k j ) ; γ ~ 1 = ∞

&gamma; ~ k < &gamma; ^ k , y ^ k = y ^ k - 1 + &gamma; ~ k u k like &gamma; ~ k < &gamma; ^ k , but the y ^ k = the y ^ k - 1 + &gamma; ~ k u k

当.j∈A时, &beta; ^ j &LeftArrow; &beta; ^ j + &gamma; ~ &omega; k j s j , 否则 &beta; ^ = 0 When .j∈A, &beta; ^ j &LeftArrow; &beta; ^ j + &gamma; ~ &omega; k j the s j , otherwise &beta; ^ = 0

A ( k + 1 ) = A ( k ) - { j ~ } , 其中 j ~ = arg m i n j { &gamma; j } A ( k + 1 ) = A ( k ) - { j ~ } , in j ~ = arg m i no j { &gamma; j }

c ^ k + 1 , j = x j T ( y - y ^ k ) , 并且 C ^ k + 1 = m a x j { | c ^ k + 1 , j | } , 返回执行步骤①。 c ^ k + 1 , j = x j T ( the y - the y ^ k ) , and C ^ k + 1 = m a x j { | c ^ k + 1 , j | } , Return to step ①.

否则 y ^ k = y ^ k - 1 + &gamma; ^ k u k otherwise the y ^ k = the y ^ k - 1 + &gamma; ^ k u k

当j∈A时, &beta; ^ j &LeftArrow; &beta; ^ j + &gamma; ^ k &omega; k j s j , 否则 &beta; ^ j = 0 返回执行步骤①。When j ∈ A, &beta; ^ j &LeftArrow; &beta; ^ j + &gamma; ^ k &omega; k j the s j , otherwise &beta; ^ j = 0 Return to step ①.

(4)根据保留的波长点位置,仅保留训练集光谱矩阵相应的波长列,得到新的光谱矩阵,并且与训练集样本被测成分浓度向量建立偏最小二乘回归(PLS)模型。其中PLS模型的因子数通过蒙特卡罗交叉验证结合F检验确定。利用这个模型,测定预测集样本被测成分的浓度含量。(4) According to the reserved wavelength point position, only retain the corresponding wavelength column of the training set spectral matrix to obtain a new spectral matrix, and establish a partial least squares regression (PLS) model with the concentration vector of the measured component of the training set sample. The number of factors in the PLS model was determined by Monte Carlo cross-validation combined with F-test. Using this model, determine the concentration content of the measured components in the prediction set samples.

与现有变量选择方法相比,本发明具有运行速度快、选择变量具有可重复性的优点,而且能用更少的变量数达到更好的预测结果。Compared with the existing variable selection method, the present invention has the advantages of fast running speed, repeatability of variable selection, and better forecasting result with less variable number.

附图说明Description of drawings

图1:烟草样本的近红外光谱图Figure 1: NIR Spectrum of a Tobacco Sample

图2:烟草近红外光谱数据训练集进行1000次交叉验证的残差平方和(SSR)平均值以及方差随着归一化的约束值t的变化图,其中竖线代表最优模型对应的t值Figure 2: The residual sum of squares (SSR) average and variance of the 1000 times cross-validated residual sum of squares (SSR) of the training set of tobacco near-infrared spectrum data and the change graph of the normalized constraint value t, where the vertical line represents the t corresponding to the optimal model value

图3:烟草近红外光谱数据训练集进行LASSO变量选择后所有变量对应的回归系数βFigure 3: The regression coefficient β corresponding to all variables after LASSO variable selection in the tobacco near-infrared spectrum data training set

图4:UVE、MCUVE、RT、LASSO四种变量选择方法保留变量的分布图Figure 4: Distribution of retained variables for four variable selection methods: UVE, MCUVE, RT, and LASSO

图5:香油与大豆油、稻米油三元掺混样本的近红外光谱图Figure 5: Near-infrared spectrum of the three-component blending sample of sesame oil, soybean oil, and rice oil

图6:香油与大豆油、稻米油三元掺混样本的光谱数据训练集进行1000次交叉验证的残差平方和(SSR)平均值以及方差随着归一化的约束值t的变化图,其中竖线代表最优模型对应的t值Figure 6: The residual sum of squares (SSR) mean and variance of the spectral data training set of the ternary blending samples of sesame oil, soybean oil, and rice oil for 1000 times of cross-validation and the change graph of the variance with the normalized constraint value t, The vertical line represents the t value corresponding to the optimal model

图7:香油与大豆油、稻米油三元掺混样本光谱数据训练集进行LASSO变量选择后所有变量对应的回归系数βFigure 7: The regression coefficient β corresponding to all variables after the LASSO variable selection of the spectral data training set of the ternary blending sample of sesame oil, soybean oil, and rice oil

图8:UVE、MCUVE、RT、LASSO四种变量选择方法保留变量的分布图Figure 8: Distribution of retained variables for four variable selection methods: UVE, MCUVE, RT, and LASSO

具体实施方式Detailed ways

为更好理解本发明,下面结合实施例对本发明做进一步地详细说明,但是本发明要求保护的范围并不局限于实施例表示的范围。In order to better understand the present invention, the present invention will be further described in detail below in conjunction with the examples, but the protection scope of the present invention is not limited to the range indicated by the examples.

实施例1:Example 1:

本实施例是应用于近红外光谱分析,对烟草样本中的还原糖含量值进行测定。具体的步骤如下:This embodiment is applied to near-infrared spectroscopic analysis to determine the content of reducing sugar in tobacco samples. The specific steps are as follows:

(1)采集烟叶样本的近红外光谱数据,使用BrukerVector22/N近红外光谱仪(德国Bruker光学仪器公司)测试了不同烟叶产区的269个烟叶薄片样本。NIR光谱波数范围为4000~9000cm-1,采样间隔为4个波数,共1296个波长点,样品的近红外光谱图如图1所示。烟草样品中还原糖(ReducingSugar)含量采用AAIII型连续流动分析仪(德国BranLuebbe公司)按照标准方法测定。在建模前把烟叶样本随机分成两部分,包括训练集和预测集样本,其中训练集样本用来建立模型、预测集样本用来检验模型的预测能力。(1) The near-infrared spectrum data of tobacco leaf samples were collected, and 269 tobacco leaf thin slice samples from different tobacco leaf production areas were tested using a BrukerVector22/N near-infrared spectrometer (Bruker Optical Instrument Company, Germany). The wavenumber range of the NIR spectrum is 4000-9000cm -1 , the sampling interval is 4 wavenumbers, and there are 1296 wavelength points in total. The near-infrared spectrum of the sample is shown in Fig. 1 . The content of reducing sugar (ReducingSugar) in the tobacco samples was measured using a type AAIII continuous flow analyzer (BranLuebbe, Germany) according to standard methods. Before modeling, the tobacco leaf samples were randomly divided into two parts, including the training set and the prediction set samples. The training set samples were used to build the model, and the prediction set samples were used to test the predictive ability of the model.

(2)采用交叉验证确定LASSO的约束值t。t控制着压缩的程度,t越小,压缩的程度越强,这个限制条件使得向量β的某些分量变成0,从而达到了变量选择的目的。本实施例训练集进行1000次交叉验证的残差平方和(SSR)平均值以及方差随着归一化的约束值t的变化如图2所示,其中竖线代表最优模型对应的t值,为0.103。(2) Use cross-validation to determine the constraint value t of LASSO. t controls the degree of compression. The smaller t is, the stronger the degree of compression is. This constraint makes some components of the vector β become 0, thus achieving the purpose of variable selection. The training set of this embodiment carries out 1000 times of cross-validation residual sum of squares (SSR) average value and variance with the change of normalized constraint value t as shown in Figure 2, wherein the vertical line represents the t value corresponding to the optimal model , is 0.103.

(3)求解LASSO的回归系数β。利用最小角回归算法求解LASSO的回归系数β,保存回归系数不为0的波长点位置。(3) Solve the regression coefficient β of LASSO. Use the minimum angle regression algorithm to solve the regression coefficient β of LASSO, and save the position of the wavelength point whose regression coefficient is not 0.

&beta;&beta; ^^ == argarg mm ii nno &beta;&beta; &Element;&Element; RR pp {{ (( ythe y -- Xx &beta;&beta; )) TT (( ythe y -- Xx &beta;&beta; )) }} sthe s .. tt .. &Sigma;&Sigma; tt == 11 pp || &beta;&beta; tt || &le;&le; tt

该实施例进行LASSO变量选择后所有变量对应的回归系数β值如图3所示。The regression coefficient β values corresponding to all the variables after LASSO variable selection in this embodiment are shown in FIG. 3 .

(4)根据保留的波长点位置,仅保留训练集光谱矩阵相应的波长列,得到新的光谱矩阵。光谱矩阵与训练集样本被测成分浓度向量建立偏最小二乘回归(PLS)模型,其中PLS模型的因子数通过蒙特卡罗交叉验证结合F检验确定。利用这个模型,测定预测集样本被测成分的浓度含量。该实施例确定的因子数为8。(4) According to the reserved wavelength point positions, only the corresponding wavelength column of the training set spectral matrix is retained to obtain a new spectral matrix. The partial least squares regression (PLS) model was established from the spectral matrix and the measured component concentration vector of the training set samples, and the number of factors in the PLS model was determined by Monte Carlo cross-validation combined with F-test. Using this model, determine the concentration content of the measured components in the prediction set samples. The number of factors determined in this embodiment is 8.

UVE、MCUVE、RT、LASSO四种变量选择方法保留变量的分布图如图4所示。从图4可以看出,一方面,LASSO与其它三种方法选择的变量范围大致相同,这说明了LASSO方法选择变量的合理性。另一方面,LASSO选择的变量数比其它三种变量选择方法更少,这体现了该方法的优越性。The distribution of the retained variables of the four variable selection methods of UVE, MCUVE, RT, and LASSO is shown in Figure 4. It can be seen from Figure 4 that, on the one hand, the range of variables selected by LASSO is roughly the same as that of the other three methods, which shows the rationality of the selection of variables by the LASSO method. On the other hand, the number of variables selected by LASSO is less than the other three variable selection methods, which reflects the superiority of this method.

为了进一步比较四种变量选择的效果,表1给出了烟草近红外数据不采用变量选择以及采用变量选择后建立PLS模型的建模效果。由表中数据可知,LASSO选择变量仅27个,是其它三种变量选择方法的近十分之一。计算时间11.89,虽然比不进行变量选择的PLS要慢,但是明显快于其它变量选择方法。LASSO-PLS建模得到的RMSEP值最小,R值最大,说明该方法更能提高模型的预测精度。因此,LASSO-PLS与其它建模方法相比较选择变量数少,计算时间更短,预测精度更高。In order to further compare the effects of the four variable selections, Table 1 presents the modeling effects of the tobacco NIR data without using variable selection and after using variable selection to establish the PLS model. It can be seen from the data in the table that there are only 27 variables selected by LASSO, which is nearly one-tenth of the other three variable selection methods. Computation time 11.89, although slower than PLS without variable selection, is significantly faster than other variable selection methods. The RMSEP value obtained by LASSO-PLS modeling is the smallest, and the R value is the largest, indicating that this method can improve the prediction accuracy of the model. Therefore, compared with other modeling methods, LASSO-PLS has fewer selected variables, shorter calculation time and higher prediction accuracy.

表1烟草近红外数据不同建模方法的结果比较Table 1 Comparison of the results of different modeling methods for tobacco near-infrared data

实施例2:Example 2:

本实施例是应用于近红外光谱分析,对香油与大豆油、稻米油三元掺混的近红外光谱数据进行测定。具体的步骤如下:This embodiment is applied to near-infrared spectrum analysis, and the near-infrared spectrum data of the ternary blending of sesame oil, soybean oil, and rice oil are determined. The specific steps are as follows:

(1)采集香油与大豆油、稻米油三元掺混样本的NIR光谱数据,使用近红外分光光度计(TJ270-60,天津市拓普仪器有限公司)进行近红外光谱数据测量,波长范围为800~2500nm,采样间隔为1nm,共1701个波长点。样品的近红外光谱图如图5所示。样品按一定比例配置(大豆油质量0.05~2.5,间隔0.05;稻米油浓度0.05~2.5,间隔0.05)。在建模前把样本随机分成两部分,包括训练集和预测集样本,其中训练集样本用来建立模型、预测集样本用来检验模型的预测能力。(1) Collect the NIR spectral data of the ternary blending samples of sesame oil, soybean oil, and rice oil, and use a near-infrared spectrophotometer (TJ270-60, Tianjin Tuopu Instrument Co., Ltd.) to measure the near-infrared spectral data. The wavelength range is 800~2500nm, the sampling interval is 1nm, a total of 1701 wavelength points. The near-infrared spectrum of the sample is shown in Figure 5. Samples are configured according to a certain ratio (soybean oil quality 0.05-2.5, interval 0.05; rice oil concentration 0.05-2.5, interval 0.05). Before modeling, the samples are randomly divided into two parts, including training set and prediction set samples, where the training set samples are used to build the model, and the prediction set samples are used to test the predictive ability of the model.

(2)采用交叉验证确定LASSO的约束值t。t控制着压缩的程度,t越小,压缩的程度越强,这个限制条件使得向量β的某些分量变成0,从而达到了变量选择的目的。该实施例训练集进行1000次交叉验证的残差平方和(SSR)平均值以及方差随着归一化的约束值t的变化图如图6所示,其中竖线代表最优模型对应的t值为0.254。(2) Use cross-validation to determine the constraint value t of LASSO. t controls the degree of compression. The smaller t is, the stronger the degree of compression is. This constraint makes some components of the vector β become 0, thus achieving the purpose of variable selection. The training set of this embodiment carries out 1000 times of cross-validation residual sum of squares (SSR) average value and the change figure of variance along with the normalized constraint value t as shown in Figure 6, wherein the vertical line represents the t corresponding to the optimal model The value is 0.254.

(3)求解LASSO的回归系数β。利用最小角回归算法求解LASSO的回归系数β,保存回归系数不为0的波长点位置。(3) Solve the regression coefficient β of LASSO. Use the minimum angle regression algorithm to solve the regression coefficient β of LASSO, and save the position of the wavelength point whose regression coefficient is not 0.

&beta;&beta; ^^ == argarg mm ii nno &beta;&beta; &Element;&Element; RR pp {{ (( ythe y -- Xx &beta;&beta; )) TT (( ythe y -- Xx &beta;&beta; )) }} sthe s .. tt .. &Sigma;&Sigma; tt == 11 pp || &beta;&beta; tt || &le;&le; tt

该实施例训练集进行LASSO变量选择后所有变量对应的回归系数β值如图7所示。The regression coefficient β values corresponding to all variables after LASSO variable selection in the training set of this embodiment are shown in FIG. 7 .

(4)根据保留的波长点位置,仅保留训练集光谱矩阵相应的波长列,得到新的光谱矩阵。光谱矩阵与训练集样本被测成分浓度向量建立偏最小二乘回归(PLS)模型,其中PLS模型的因子数通过蒙特卡罗交叉验证结合F检验确定。利用这个模型,测定预测集样本被测成分的浓度含量。该实施例确定的因子数为8。(4) According to the reserved wavelength point positions, only the corresponding wavelength column of the training set spectral matrix is retained to obtain a new spectral matrix. The partial least squares regression (PLS) model was established from the spectral matrix and the measured component concentration vector of the training set samples, and the number of factors in the PLS model was determined by Monte Carlo cross-validation combined with F-test. Using this model, determine the concentration content of the measured components in the prediction set samples. The number of factors determined in this embodiment is 8.

UVE、MCUVE、RT、LASSO四种变量选择方法保留变量的分布图如图8所示。从图8可以看出,LASSO与其它三种方法选择的变量范围大致相同,这说明了LASSO方法选择变量的合理性。另一方面,LASSO选择的变量数比其它三种变量选择方法更少,这体现了该方法的优越性。Figure 8 shows the distribution of the retained variables of the four variable selection methods UVE, MCUVE, RT, and LASSO. It can be seen from Figure 8 that the range of variables selected by LASSO is roughly the same as that of the other three methods, which shows the rationality of the variables selected by the LASSO method. On the other hand, the number of variables selected by LASSO is less than the other three variable selection methods, which reflects the superiority of this method.

为了进一步比较四种变量选择的效果,表2给出了香油与大豆油、稻米油三元掺混近红外光谱数据不采用变量选择以及采用变量选择后建立PLS模型的建模效果。由表中数据可知,LASSO选择变量仅11个,远远少于其他变量选择方法选择的变量。计算时间2.48秒,明显快于其它变量选择方法。LASSO-PLS建模得到的RMSEP值最小,R值最大。因此,LASSO-PLS与其它建模方法相比较选择变量数少,计算时间更短,预测精度更高。In order to further compare the effects of the four variable selections, Table 2 shows the modeling effect of the PLS model established by the near-infrared spectrum data of the three-component blending of sesame oil, soybean oil, and rice oil without variable selection and with variable selection. It can be seen from the data in the table that there are only 11 variables selected by LASSO, far less than the variables selected by other variable selection methods. The calculation time is 2.48 seconds, which is significantly faster than other variable selection methods. The RMSEP value obtained by LASSO-PLS modeling is the smallest, and the R value is the largest. Therefore, compared with other modeling methods, LASSO-PLS has fewer selected variables, shorter calculation time and higher prediction accuracy.

表2植物油NIR数据不同建模方法的结果比较Table 2 Comparison of results of different modeling methods for vegetable oil NIR data

Claims (4)

1., based on a near infrared spectrum Variable Selection of LASSO, it is characterized in that comprising following steps:
1) gather the near infrared spectrum data of measured object sample, measure the measured component concentration content of sample in training set by conventional method, adopt certain packet mode that data are divided into training set and forecast set;
2) the binding occurrence t. of LASSO is determined;
3) minimum angle regression algorithm is utilized to solve the regression coefficient β of LASSO;
4) by training set spectrum Matrix Regression factor beta be not 0 wavelength row set up partial least squares regression (PLS) model with concentration vector, utilize this model, the content of prediction unknown sample composition.
2. a kind of near infrared spectrum Variable Selection based on LASSO according to claim 1, is characterized in that: the described detailed process utilizing minimum angle regression algorithm to solve the regression coefficient β of LASSO is:
1. Renewal model is selected in variables set (activeset), calculates related coefficient absolute value
y ^ 0 = 0 ; c ^ k j = x j I ( y - y ^ k - 1 ) ; C ^ k = max { | c ^ k j | }
Upgrade activesetA (k)
A ( k ) = A ( k - 1 ) + { j ^ } ; A(0)=φ j ^ = arg min j &NotElement; A ( k - 1 ) { | c ^ k j | }
2. minimum angular direction (u is determined k)
Make X k=(... s jx j...) j ∈ A (k)
Wherein s j = si g n { c ^ k j } , &omega; k = A k ( X k I X ) - 1 1 k , A k = ( 1 k T ( X k T X ) - 1 1 k ) - 0.5
1 kbe important be 1 vector, its length equals | A|
Calculate minimum angular direction: u k=X kω k
3. material calculation
When j &NotElement; A ( k ) , Order a k j = x j T u k
If | A|=d, then algorithm stops
Otherwise &gamma; ^ k = min j &NotElement; A ( k ) + { C ^ k - c ^ k j / ( A k - a k j ) , ( C ^ k + c ^ k j ) / ( A k + a k j ) }
4. predicated response
&gamma; ~ = m i n &gamma; j > 0 , j &Element; A ( k ) { &gamma; j } , Wherein &gamma; j = - &beta; ^ j / ( s j &omega; k j ) ; &gamma; ~ l = &infin;
If &gamma; ~ k < &gamma; ^ k , Then y ^ k = y ^ k - 1 + &gamma; ^ k u k
As j ∈ A, &beta; ^ j &LeftArrow; &beta; ^ j + &gamma; ~ &omega; kj s j , Otherwise &beta; ^ = 0
A ( k + 1 ) = A ( k ) - { j ~ } , Wherein j ~ = argmin j { &gamma; j }
c ^ k + 1 j = x j I ( y - y ^ k ) , And C ^ k + 1 = max j { | c ^ k + 1 j | } , Return and perform step 1.
Otherwise y ^ k = y ^ k - 1 + &gamma; ^ k u k
As j ∈ A, otherwise return and perform step 1..
3. a kind of near infrared spectrum Variable Selection based on LASSO according to claim 1, it is characterized in that: the defining method of the binding occurrence t. of described LASSO is cross validation, t controls the degree of compression, t is less, the degree of compression is stronger, this restrictive condition makes some component of vectorial β become 0, thus reaches the object of variables choice.
4. a kind of near infrared spectrum Variable Selection based on LASSO according to claim 1, is characterized in that: described PLS model because of subnumber defining method be that Monte Carlo Cross-Validation is checked in conjunction with F.
CN201510581659.5A 2015-09-11 2015-09-11 Near infrared spectrum variable selection method based on LASSO Pending CN105203498A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510581659.5A CN105203498A (en) 2015-09-11 2015-09-11 Near infrared spectrum variable selection method based on LASSO

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510581659.5A CN105203498A (en) 2015-09-11 2015-09-11 Near infrared spectrum variable selection method based on LASSO

Publications (1)

Publication Number Publication Date
CN105203498A true CN105203498A (en) 2015-12-30

Family

ID=54951293

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510581659.5A Pending CN105203498A (en) 2015-09-11 2015-09-11 Near infrared spectrum variable selection method based on LASSO

Country Status (1)

Country Link
CN (1) CN105203498A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106529008A (en) * 2016-11-01 2017-03-22 天津工业大学 Double-integration partial least square modeling method based on Monte Carlo and LASSO
CN106950193A (en) * 2017-05-24 2017-07-14 长春理工大学 Based on the near infrared spectrum Variable Selection that cluster analysis is combined from weight variable
CN107356556A (en) * 2017-07-10 2017-11-17 天津工业大学 A kind of double integrated modelling approach of Near-Infrared Spectra for Quantitative Analysis
CN108763673A (en) * 2018-05-16 2018-11-06 广东省生态环境技术研究所 The Driving forces of land use change screening technique and device returned based on LASSO
CN108827905A (en) * 2018-04-08 2018-11-16 江南大学 A kind of near-infrared model online updating method based on local weighted Lasso
CN109193635A (en) * 2018-09-29 2019-01-11 清华大学 A kind of power distribution network topological structure method for reconstructing based on adaptive sparse homing method
CN109459408A (en) * 2017-09-06 2019-03-12 盐城工学院 A kind of Near-Infrared Quantitative Analysis method based on sparse regression LAR algorithm
CN110657890A (en) * 2018-06-29 2020-01-07 唯亚威通讯技术有限公司 Cross-validation based calibration of spectral models
CN114298107A (en) * 2021-12-29 2022-04-08 安徽大学 Net signal extraction method and system for near-infrared spectroscopy
CN117033993A (en) * 2022-04-29 2023-11-10 华东交通大学 Method for selecting optimal training set based on minimum angle ordering

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002010726A2 (en) * 2000-08-01 2002-02-07 Sensys Medical, Inc. Combinative multivariate calibration that enhances prediction ability through removal of over-modeled regions
CN102305772A (en) * 2011-07-29 2012-01-04 江苏大学 Method for screening characteristic wavelength of near infrared spectrum features based on heredity kernel partial least square method
CN104502306A (en) * 2014-12-09 2015-04-08 西北师范大学 Near infrared spectrum wavelength selecting method based on variable significance

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002010726A2 (en) * 2000-08-01 2002-02-07 Sensys Medical, Inc. Combinative multivariate calibration that enhances prediction ability through removal of over-modeled regions
CN102305772A (en) * 2011-07-29 2012-01-04 江苏大学 Method for screening characteristic wavelength of near infrared spectrum features based on heredity kernel partial least square method
CN104502306A (en) * 2014-12-09 2015-04-08 西北师范大学 Near infrared spectrum wavelength selecting method based on variable significance

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
WANG CHANGYUE ET AL.: "Rapid compositional analysis of sawdust using sparse method and near infrared spectroscopy", 《26TH CHINESE CONTROL AND DECISION CONFERENCE》 *
柯郑林: "Lasso及其相关方法在多元线性回归模型中的应用", 《中国优秀硕士学位论文全文数据库 基础科学辑》 *

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106529008B (en) * 2016-11-01 2019-11-26 天津工业大学 A kind of double integrated offset minimum binary modeling methods based on Monte Carlo and LASSO
CN106529008A (en) * 2016-11-01 2017-03-22 天津工业大学 Double-integration partial least square modeling method based on Monte Carlo and LASSO
CN106950193A (en) * 2017-05-24 2017-07-14 长春理工大学 Based on the near infrared spectrum Variable Selection that cluster analysis is combined from weight variable
CN106950193B (en) * 2017-05-24 2019-04-26 长春理工大学 Variable selection method for near-infrared spectroscopy based on cluster analysis of self-weighted variable combinations
CN107356556A (en) * 2017-07-10 2017-11-17 天津工业大学 A kind of double integrated modelling approach of Near-Infrared Spectra for Quantitative Analysis
CN109459408A (en) * 2017-09-06 2019-03-12 盐城工学院 A kind of Near-Infrared Quantitative Analysis method based on sparse regression LAR algorithm
CN108827905A (en) * 2018-04-08 2018-11-16 江南大学 A kind of near-infrared model online updating method based on local weighted Lasso
CN108827905B (en) * 2018-04-08 2020-07-24 江南大学 An Online Updating Method of Near Infrared Model Based on Local Weighted Lasso
CN108763673A (en) * 2018-05-16 2018-11-06 广东省生态环境技术研究所 The Driving forces of land use change screening technique and device returned based on LASSO
CN108763673B (en) * 2018-05-16 2021-11-23 广东省科学院生态环境与土壤研究所 Land use change driving force screening method and device based on LASSO regression
CN110657890A (en) * 2018-06-29 2020-01-07 唯亚威通讯技术有限公司 Cross-validation based calibration of spectral models
CN110657890B (en) * 2018-06-29 2022-07-05 唯亚威通讯技术有限公司 Cross-validation based calibration of spectral models
US11719628B2 (en) 2018-06-29 2023-08-08 Viavi Solutions Inc. Cross-validation based calibration of a spectroscopic model
US12320743B2 (en) 2018-06-29 2025-06-03 Viavi Solutions, Inc. Cross-validation based calibration of a spectroscopic model
CN109193635A (en) * 2018-09-29 2019-01-11 清华大学 A kind of power distribution network topological structure method for reconstructing based on adaptive sparse homing method
CN109193635B (en) * 2018-09-29 2020-09-11 清华大学 A Reconstruction Method of Distribution Network Topology Based on Adaptive Sparse Regression Method
CN114298107A (en) * 2021-12-29 2022-04-08 安徽大学 Net signal extraction method and system for near-infrared spectroscopy
WO2023123329A1 (en) * 2021-12-29 2023-07-06 安徽大学 Method and system for extracting net signal in near-infrared spectrum
CN117033993A (en) * 2022-04-29 2023-11-10 华东交通大学 Method for selecting optimal training set based on minimum angle ordering

Similar Documents

Publication Publication Date Title
CN105203498A (en) Near infrared spectrum variable selection method based on LASSO
CN101936895B (en) A rapid detection method for rice storage time by near-infrared spectroscopy analysis
CN104020135B (en) Calibration model modeling method based near infrared spectrum
CN101825567A (en) Screening method for near infrared spectrum wavelength and Raman spectrum wavelength
CN104142311B (en) A kind of method of near-infrared spectrum technique prediction torch pine pine resin yield
CN104062259B (en) A kind of use the method for total saponin content near infrared spectrum quick test complex prescription glue mucilage
CN104020127A (en) Method for rapidly measuring inorganic element in tobacco by near infrared spectrum
CN109060771B (en) A Consensus Model Construction Method Based on Different Spectral Feature Sets
CN102937575B (en) Watermelon sugar degree rapid modeling method based on secondary spectrum recombination
CN109270022B (en) Waveband selection method of near-infrared spectrum model and model construction method
CN105044024A (en) Method for nondestructive testing of grape berries based on near infrared spectrum technology
CN104155264A (en) Method for predicting content of turpentine in loblolly pine gum by using near infrared spectroscopy
CN104730042A (en) Method for improving free calibration analysis precision by combining genetic algorithm with laser induced breakdown spectroscopy
CN104990895A (en) Near infrared spectral signal standard normal correction method based on local area
CN109374530A (en) On-line monitoring method of photoacoustic spectroscopy for sulfur hexafluoride gas decomposition products
CN102128805A (en) Method and device for near infrared spectrum wavelength selection and quick quantitative analysis of fruit
CN106153561A (en) The many metal ion inspections of uv-vis spectra based on wavelength screening
CN104596976A (en) Method for determining protein of paper-making reconstituted tobacco through ear infrared reflectance spectroscopy technique
CN107505283A (en) The method of nitrate ion content near infrared ray Secondary salinization soil
CN104596975A (en) Method for measuring lignin of reconstituted tobacco by paper-making process by virtue of near infrared reflectance spectroscopy technique
CN104297201A (en) Method for quickly, accurately and quantitatively detecting ratio of various oil components in blend oil
CN104502306A (en) Near infrared spectrum wavelength selecting method based on variable significance
CN105092526A (en) Rapid determination method for content of binary adulterated sesame oil based on near-infrared spectroscopy
CN106525755A (en) Oil-sand pH value testing method based on near infrared spectroscopy technology
CN105738311A (en) Apple sweetness non-damage quick detection method based on near-infrared spectrum technology

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20151230

RJ01 Rejection of invention patent application after publication