Raman spectrum multivariate data analysis method
Technical Field
The invention relates to the field of Raman spectrum information processing and spectral feature computer identification, in particular to a Raman spectrum multivariate data analysis method.
Background
Raman spectroscopy is based on the interaction of light and chemical bonds within materials and is a non-destructive analytical technique that can yield detailed information about the chemical structure, phase and morphology, crystallinity, and molecular interactions of a sample. The raman spectrum can also be used to shift the molecular energy spectrum in the infrared region to the visible region for detection. Therefore, the Raman spectrum is a powerful weapon for researching the structure of molecular substances as a supplement of the infrared spectrum. With the development and progress of science and technology, the Raman spectrum technology is applied to multiple fields such as petroleum, chemical engineering, materials, biology, environmental protection, geology and the like, and provides more information on molecular structures for the development of various industries.
Currently, raman spectroscopy has been developed as one of the most important techniques in the basic and applied scientific research in the field of analytical science. Due to the technical characteristics of molecular sensitivity, easy implementation, water environment applicability and the like, the Raman spectrum analysis technology is also widely applied to other multidisciplinary research fields. Furthermore, recent developments have combined the chemosensitivity and specificity of raman scattering with the high spatial resolution of confocal microscopy to reconstruct image information that yields the biochemical makeup of the sample. Nevertheless, the wide application of raman spectroscopy and its related analysis techniques is limited by some technical difficulties. Firstly, raman scattering is a weak optical phenomenon, and the generated spectral information (i.e., raman spectrum) is easily interfered by the environment and external factors; secondly, in a complex biochemical environment or other systems, different types of biological macromolecules contain similar biochemical components, so that the Raman spectrum has the phenomena of spectral peak position overlapping, spectral peak intensity non-uniformity and spectral peak width (half-height width) extension.
Based on the background, the multivariate data analysis method of the Raman spectrum is provided, and on the basis of realizing the original Raman spectrum pretreatment of different types of samples, the multivariate data analysis method of characteristic extraction and classification and identification is applied to realize the extraction and judgment of the spectral characteristic information of different materials.
Disclosure of Invention
The invention aims to provide a Raman spectrum multivariate data analysis method and a software system, which are applied to Raman spectrum and spectrum data set preprocessing and multivariate analysis of various organic and inorganic materials. And performing feature extraction on the sample spectrum according to the Raman spectrum data set by combining PCA and PLS-DA algorithms, and performing discriminant analysis on the sample features by combining LDA, PLS-DA, SVM and PCA-SVM algorithms.
In order to achieve the purpose, the invention provides the following scheme:
a Raman spectrum multivariate data analysis method comprises the following steps:
s1, measuring by using a Raman spectrum detection instrument to obtain original Raman spectrums and spectrum data sets of various organic and inorganic materials;
s2, preprocessing the obtained Raman spectrum data set by using a Raman spectrum multivariate data analysis software system;
s3, preprocessing the obtained Raman spectrum data set, and then performing normalization and mean value centralization processing on the Raman spectrum data;
s4, extracting Raman spectrum characteristic data by adopting a Principal Component Analysis (PCA) method or a partial least squares-discriminant analysis (PLS-DA) method, and extracting significant characteristic components in the Raman spectrum data by respectively utilizing one-way analysis of variance and cross validation;
s5, respectively establishing the features extracted in the step S4 by combining classification models, and performing classification and identification on spectral information by using the four classification models;
s6, evaluating the reliability of the classification model by using unbiased leave-one method for cross validation;
and S7, selecting the residual data for testing to obtain the accuracy, sensitivity and specificity of sample classification and the characteristic curve of the tested worker of the classification model, and evaluating the performance of the classification model.
Preferably, in step S2, the preprocessing mainly includes: spectral feature range selection, cosmic ray removal, background fluorescence signal processing based on a polynomial fitting method, and spectral smoothing processing based on a Savitzky-Golay convolution method.
Preferably, in step S3, on the basis of the preprocessing, the spectral intensity normalization, the spectral peak area normalization, the peak intensity normalization and the mean centering processing are selected according to requirements.
Preferably, the principal component analysis PCA in step S4 includes:
converting a group of linear correlation variables into linear independent variables through orthogonal transformation, reducing the dimensionality of a spectral data set, and simultaneously extracting a significant feature J in the data set; constructing a sample data set X (I multiplied by J) according to the observed sample number I and the spectral feature number J, carrying out spectral peak area normalization and mean value centralization on the sample data set, and then obtaining a covariance matrix XTX; performing singular value decomposition on the covariance matrix to obtain X ═ P delta QTWhere P is the left singular vector, Q is the right singular vector, and Δ is the diagonal matrix of singular values;
F=PΔ,F=PΔ=PΔQTq is XQ, and the matrix Q gives the coefficients for calculating the linear combination of the factor scores, and is therefore also referred to as a projection matrix, and multiplying X by Q yields the projection F of the observed values on the principal component.
Preferably, the four classification models in step S5 include: the method is based on a classification model established by a linear discriminant analysis method LDA, a partial least square-discriminant analysis method PLS-DA, a support vector machine SVM and a principal component analysis combined support vector machine PCA-SVM algorithm.
Preferably, the remaining data is selected in the step S7 for testing to obtain a characteristic curve ROC of the test worker for each classification model performance index, and the raman spectrum data and biochemical difference are analyzed in combination with the steps S5-S7.
Preferably, the ROC curve is a subject working characteristic curve and can reflect the sensitivity and specificity of the spectral classification model; the ROC curve calculates a series of sensitivity and specificity by continuously changing the classification threshold value, and then is drawn into the ROC curve by taking the sensitivity as a vertical coordinate and the 1-specificity as a horizontal coordinate, and the larger the area under the curve is, the higher the prediction accuracy of the classification model is.
The invention has the beneficial effects that:
1. the Raman spectrum data set preprocessing method has a perfect Raman spectrum data set preprocessing function, can select a spectrum characteristic range of a single acquired Raman spectrum or spectrum data set, remove cosmic rays, process background fluorescence signals based on a polynomial fitting method, smooth spectrum processing based on Savitzky-Golay convolution, and select a normalization (spectrum intensity normalization, spectrum peak area normalization and peak intensity normalization) and mean value centralization processing function according to requirements;
2. the invention integrates and optimizes a plurality of Raman spectrum multivariate data analysis methods commonly used for various organic materials and inorganic materials: a principal component analysis method (PCA), a partial least squares-discriminant analysis method (PLS-DA), a linear discriminant analysis method (LDA), a Support Vector Machine (SVM), a principal component analysis combined support vector machine (PCA-SVM);
3. according to the PCA-SVM classification algorithm model, principal component analysis and a support vector machine are combined, and classification performance of the model is improved on the basis of an SVM;
4. the invention can effectively identify and distinguish the characteristics of samples including various organic and inorganic materials represented by biological tissues and cells, but not limited to the samples;
5. in the feature extraction part, PCA, PLS-DA and single-factor analysis of variance and cross validation are combined to select features with significant meaning in a data set.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a flow chart of data analysis according to the present invention;
FIG. 2 is a schematic view of a Raman spectrum data preprocessing interface according to the present invention;
FIG. 3 is a diagram illustrating the results of smoothing process after removing cosmic rays, removing background noise, and performing cosmic ray removal in the embodiment of the invention;
FIG. 4 is a diagram illustrating a result of mean centering according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a PCA-LDA model cross validation and classification summary interface in an embodiment of the present invention;
FIG. 6 is a schematic diagram of a PLS-DA model cross validation and sorting summary interface according to an embodiment of the present invention;
FIG. 7 is a diagram of an SVM model training interface according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
FIG. 1 shows a flow chart of data analysis according to the present invention;
utilize various commercial and independently developed raman spectrometer, include: the Raman spectrum and the spectrum data set of various organic and inorganic materials are obtained by measurement of a large scientific research-grade Raman spectrum detection instrument and a small portable Raman spectrum detection instrument;
preprocessing the acquired Raman spectrum data through a spectrum preprocessing interface shown in FIG. 2, wherein the spectrum preprocessing comprises spectrum characteristic range selection, cosmic ray removal, background fluorescence signal processing based on a polynomial fitting method and spectrum smoothing processing based on a Savitzky-Golay convolution method (the result and the interface are shown in FIGS. 2 and 3); on the basis of preprocessing, normalization and mean centering processing can be selected according to requirements (the result is shown in fig. 4);
the normalization processing comprises the following steps: in order to eliminate the influence of power disturbance and sample nonuniformity, the spectral intensity normalization can be selected; for the purpose of discussing quantitative information for a substance, spectral peak area normalization may be selected; peak intensity normalization may be selected in order to further highlight certain species content variations by eliminating effects due to sample and instrument variations.
The invention provides two methods for extracting the characteristics of a preprocessed spectral data set: principal Component Analysis (PCA), Partial least squares discriminant analysis (PLS-DA); and selecting any one of the methods to analyze the spectral data set, and then selecting the most significant spectral characteristic components by utilizing one-factor analysis of variance and cross validation respectively.
The main component analysis comprises the following specific steps:
converting a group of linear correlation variables into linear independent variables through orthogonal transformation, thereby reducing the dimensionality of the spectral data set and simultaneously extracting remarkable characteristics in the data set; the sample data set is X (I × J), I is the number of observation samples, and J is the number of spectral features.
Firstly, carrying out spectrum peak area normalization and mean value centralization treatment, and then obtaining a covariance matrix XTX; performing singular value decomposition on the covariance matrix to obtain X ═ P delta QTWhere P is the left singular vector, Q is the right singular vector, and Δ is the diagonal matrix of singular values.
F=PΔ,F=PΔ=PΔQTQ is XQ, the matrix Q gives the coefficients for calculating the linear combination of the factor scores and is therefore also called the projection matrix (or loading matrix), and multiplying X by Q yields the projection F of the observed values on the principal component (F is also called the score matrix).
The linear discriminant analysis LDA comprises the following steps:
(1) the convention data set comprises two types of samples, and an interspecies divergence matrix S is calculatedbAnd mu1;
Sb=(μ0-μ1)(μ0-μ1)Tu0
Projecting the data onto a straight line omega, the centers of the two types of samples are on the straight lineThe projections are respectively omegaTμ0And ωTμ1;
(2) Calculating the similar internal divergence matrix S of the samplew
(3) Calculating an inter-class divergence matrix S
bIn-class divergence matrix S of the same kind as the sample
wGeneralized Rayleigh entropy of
Solving a projection direction omega;
(4) projection line, i.e. y ═ ωTx;
(5) And projecting the new unknown sample to the straight line, and classifying the class of the point according to the distance from the projected point to the centers of the two types of samples.
FIG. 5 is a cross-validation and sort summary interface of the PCA-LDA model according to an embodiment of the present invention.
The least square discrimination method includes:
(1) carrying out mean value centralization processing on the data;
(2) calculating the predicted response value of each sample according to the least square regression;
(3) and calculating the posterior probability of the sample belonging to each category according to a probability density function and a Bayes formula, such as time A and event B:
P(A|B)=P(B|A)*P(A)/P(B)
(4) the class with the highest probability is selected as the predictive label.
FIG. 6 shows a PLS-DA model cross validation and sort summary interface;
the support vector machine comprises the following steps:
(1) appointment hyper-plane omegaTx + b ═ y; where ω is the normal vector and b is the displacement.
(2) Calculating the distance d from the point to the hyperplane y;
(3) maximizing the classification interval;
s.t.yi(wT·Φ(xi)+b)≥1,i=1,2,…,n
wherein, phi (x)i) Is a feature space transformation function, i.e. a mapping function, and s.t. is a constraint.
(4) Introducing slack variables allows some data to be misclassified, preventing overfitting;
the constraint s.t. is:
yi(w·xi+b)≥1-ξ,i=1,2,…,n
ξi≥0,i=1,2,…,n
fig. 7 shows an SVM training model interface.
The invention uses Linear Discriminant Analysis (LDA), partial least square-discriminant analysis (PLS-DA), Support Vector Machine (SVM), Principal component analysis (Principal component analysis combined Support vector machine (PCA-SVM)) algorithm to establish four classification models, and extracts the features through the four models respectively.
And (3) cross validation and evaluation of the reliability of each classification model by using an unbiased leave-one method, so that an overfitting phenomenon is prevented.
Taking the total amount of the samples as N in the above steps, and selecting NtThe data are used as training set, then Nts=N-NtThe number of the taken test samples is the accuracy, the sensitivity and the specificity of the obtained sample classification and the characteristics of the tested workers of the modelThe curve is used for evaluating the performance of Raman spectrum multivariate data analysis methods on Raman spectrum identification of samples (particularly biological tissue samples).
The above-described embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solutions of the present invention can be made by those skilled in the art without departing from the spirit of the present invention, and the technical solutions of the present invention are within the scope of the present invention defined by the claims.