CN106570325A

CN106570325A - Partial-least-squares-based abnormal detection method of mammary gland cell

Info

Publication number: CN106570325A
Application number: CN201610962997.8A
Authority: CN
Inventors: 陈善雄; 朱雨晨; 彭喜化; 周俊; 浦汛; 彭茂玲
Original assignee: Southwest University
Current assignee: Southwest University
Priority date: 2016-11-04
Filing date: 2016-11-04
Publication date: 2017-04-19

Abstract

The invention relates to a breast cell abnormality detection method based on the partial least squares method, which includes: (1) importing a data set for building a model, setting corresponding dependent variables and independent variables, standardizing the data, extracting Principal components, fitting and establishing a partial least squares linear model; (2) Observe the T ² ellipse, identify abnormal points, remove the abnormal points from the data set, obtain a new data set, and fit again until there are no abnormal points, Obtain the parameter set, obtain the expression equation of y; (3) input the data set to be tested, use the equation to calculate, obtain the predicted value y, and then judge whether the predicted value is benign or malignant according to the determined threshold. The present invention establishes a regression model for breast cell abnormality detection through the method of partial least squares regression, and generates a better detection method for benign and malignant breast cells through the training of the regression model, which has rapid detection ability and high detection accuracy .

Description

A Method for Abnormal Detection of Breast Cells Based on Partial Least Squares

技术领域technical field

本发明属于人体医学技术领域，具体涉及一种基于偏最小二乘法的乳腺细胞异常检测方法。The invention belongs to the technical field of human medicine, and in particular relates to a breast cell abnormality detection method based on a partial least square method.

背景技术Background technique

计算机辅助诊断技术中，各种机器学习、人工智能算法被用于乳腺癌辅助诊断方面的研究所用的方法有一定的缺陷，如训练用时长、易陷入局部收敛、较依赖于样本等，这些因素都会影响辅助诊断的准确性。乳腺癌是女性发病率较高的恶性肿瘤之一，20世纪以来乳腺癌的发病率在世界各地均有上升的趋势，但其病因目前尚未完全明确，所以对乳腺细胞的检测显得尤为重要。乳腺癌是发生在乳腺腺上皮组织的恶性肿瘤，女性乳腺是由皮肤、纤维组织、乳腺腺体和脂肪组成的乳腺并不是维持人体生命活动的重要器官，原位乳腺癌并不致命；但由于乳腺癌细胞丧失了正常细胞的特性，细胞之间连接松散，容易脱落。癌细胞一旦脱落，游离的癌细胞可以随血液或淋巴液播散全身，形成转移，危及生命。目前乳腺癌已成为威胁女性身心健康的常见肿瘤。早期乳腺癌往往不具备典型的症状和体征，不易引起重视，常大多数乳腺癌为无痛性肿块，仅少数伴有不同程度的隐痛或刺痛。目前来看，乳腺癌检测的方法有许多，如乳腺x线钼靶、超声检查、CT检查等。然而许多方法需要进一步探讨，有些方法甚至不宜作为检测乳腺癌的主要方法。对于患者来说，乳腺癌早期检测发现是降低病发的关键，乳腺癌早期多是乳房形状变化或者产生肿块等症状，并且常通过体检或乳腺癌筛查发现80％的乳腺癌患者以乳腺肿块首诊，所以可以通过检测乳腺细胞的情况来判断是否存在乳腺肿块，因此对于乳腺细胞的检查是发现乳腺癌细胞存在以及预防乳腺癌细胞扩散的一种重要手段。In computer-aided diagnosis technology, various machine learning and artificial intelligence algorithms are used in the research of auxiliary diagnosis of breast cancer, which have certain defects, such as long training time, easy to fall into local convergence, more dependent on samples, etc. These factors will affect the accuracy of auxiliary diagnosis. Breast cancer is one of the malignant tumors with a high incidence in women. Since the 20th century, the incidence of breast cancer has been on the rise all over the world. However, its etiology has not yet been fully clarified, so the detection of breast cells is particularly important. Breast cancer is a malignant tumor that occurs in the glandular epithelial tissue of the breast. The female breast is composed of skin, fibrous tissue, breast glands and fat. The breast is not an important organ to maintain human life. Breast cancer in situ is not fatal; but due to Breast cancer cells have lost the characteristics of normal cells, and the connections between cells are loose and easy to fall off. Once the cancer cells fall off, the free cancer cells can spread throughout the body with the blood or lymph fluid, forming metastasis, which is life-threatening. At present, breast cancer has become a common tumor that threatens women's physical and mental health. Early breast cancer often does not have typical symptoms and signs, and it is not easy to attract attention. Most breast cancers are usually painless masses, and only a few are accompanied by varying degrees of dull pain or tingling pain. At present, there are many methods for breast cancer detection, such as mammography, ultrasound examination, CT examination and so on. However, many methods need to be further explored, and some methods are not even suitable as the main method for detecting breast cancer. For patients, early detection of breast cancer is the key to reducing the incidence of breast cancer. Early breast cancer is mostly symptoms such as changes in breast shape or lumps, and 80% of breast cancer patients are often found to have breast lumps through physical examination or breast cancer screening. Therefore, it is possible to judge whether there is a breast mass by detecting the condition of breast cells. Therefore, the examination of breast cells is an important means to discover the existence of breast cancer cells and prevent the spread of breast cancer cells.

发明内容Contents of the invention

为了克服上述这些因素，本发明提出了一种用于乳腺癌的辅助诊断，通过偏最小二乘回归的方法，建立起乳腺细胞异常检测的回归模型，通过对回归模型的训练，生成较好的对良恶性乳腺细胞具有快速的检测能力和较高的检测精度的基于偏最小二乘的乳腺细胞异常检测方法。In order to overcome the above-mentioned factors, the present invention proposes an auxiliary diagnosis for breast cancer. A regression model for breast cell abnormality detection is established through the method of partial least squares regression, and a better regression model is generated by training the regression model. A breast cell abnormality detection method based on partial least squares with rapid detection ability and high detection accuracy for benign and malignant breast cells.

本发明的技术方案如下：Technical scheme of the present invention is as follows:

上述的基于偏最小二乘法的乳腺细胞异常检测方法，具体包括：(1)将用于建立模型的数据集中所得的数据均分为两部分，一部分用于模型的建立，另一部分用于模型的检测，设定相应的因变量和自变量，对数据进行标准化处理，分别提取自变量和因变量的主成分，拟合并建立模型；(2)T²为样本点对成分的累计贡献率，利用软件绘制并观察T²椭圆图，识别异常点，将异常点从数据集中剔除，得到新的数据集和模型，再次拟合并观察T²椭圆图直到不存异常点，获得参数集，求得是否为癌细胞的因变量y的表达方程式；(3)输入另一部分待测数据集，利用求得的方程将数值带入进行计算，得到预测值y’，确定阈值，规定大于阈值的预测值为恶性细胞，小于阈值的预测值为良性细胞，将y’与原值进行比较，记录正确预测的结果，计算出该预测模型的正确率。The above-mentioned breast cell abnormality detection method based on the partial least squares method specifically includes: (1) dividing the data obtained in the data set used to establish the model into two parts, one part is used for the establishment of the model, and the other part is used for the establishment of the model. Detection, setting the corresponding dependent variable and independent variable, standardizing the data, extracting the principal components of the independent variable and dependent variable respectively, fitting and establishing the model; (2) T ² is the cumulative contribution rate of the sample point to the component, Use the software to draw and observe the T ² ellipse, identify abnormal points, remove the abnormal points from the data set, obtain a new data set and model, fit and observe the T ² ellipse again until there are no abnormal points, obtain the parameter set, and calculate Obtain the expression equation of the dependent variable y whether it is a cancer cell; (3) Input another part of the data set to be tested, use the obtained equation to bring the value into the calculation, obtain the predicted value y', determine the threshold value, and specify the prediction value greater than the threshold value The value is a malignant cell, and the predicted value is a benign cell that is less than the threshold value. Compare y' with the original value, record the correct prediction result, and calculate the correct rate of the prediction model.

所述基于偏最小二乘法的乳腺细胞异常检测方法，其中:所述步骤(1)是通过软件SIMCA-P 13.0导入用于建立模型的数据集，所述自变量主要包括半径、质地、周长、面积、光滑度、致密性、凹度、凹点、对称性和分形维数；所述因变量为是否是癌变细胞。Described breast cell abnormality detection method based on partial least squares method, wherein: described step (1) is to import the data set that is used for building a model by software SIMCA-P 13.0, and described independent variable mainly comprises radius, texture, girth , area, smoothness, compactness, concavity, pit, symmetry and fractal dimension; the dependent variable is whether it is a cancerous cell.

所述基于偏最小二乘法的乳腺细胞异常检测方法，其中：所述步骤(2)中当样本点都落在椭圆内时，认为样本是均匀的；若有样本点落在椭圆外，则可以认为这些点为特异点它们的取值远离样本点的平均水平。The breast cell abnormality detection method based on the partial least squares method, wherein: in the step (2), when the sample points all fall within the ellipse, the sample is considered to be uniform; if any sample point falls outside the ellipse, then it can be These points are considered to be singular points whose values are far from the average level of the sample points.

所述基于偏最小二乘法的乳腺细胞异常检测方法，其中：所述步骤(3)中确定的阈值为0.5，规定大于0.5的预测值为恶性细胞，小于0.5的预测值为良性细胞。The breast cell abnormality detection method based on partial least squares method, wherein: the threshold determined in the step (3) is 0.5, and the predicted value greater than 0.5 is defined as a malignant cell, and the predicted value less than 0.5 is defined as a benign cell.

所述基于偏最小二乘法的乳腺细胞异常检测方法，其中，所述步骤(1)具体包括以下步骤：The breast cell abnormality detection method based on the partial least squares method, wherein, the step (1) specifically includes the following steps:

(1.1)对自变量和因变量进行标准化处理(1.1) Standardize the independent variable and dependent variable

X经标准化处理后的数据矩阵记为E₀＝(E₀₁，E₀₂，...，E_0p)_n×p，Y经过标准化处理后的数据矩阵记为F₀＝(F₀₁，F₀₂，...，F_0q)_n×q；The normalized data matrix of X is denoted as E ₀ =(E ₀₁ ,E ₀₂ ,...,E _0p ) _n×p , and the normalized data matrix of Y is denoted as F ₀ =(F ₀₁ ,F ₀₂ ,...,F _0q ) _n×q ;

(1.2)提取主成分，逐步回归(1.2) Extracting principal components and stepwise regression

记t₁是F₀的第一个成分，t₁＝E₀w₁，w₁是E₀的第一个轴且是一个单位向量，即||w₁||＝1；记u₁是F₀的第一个成分，u₁＝F₀c₁，c₁是F₀的第一个轴且是一个单位向量，即||c₁||＝1；在t₁与u₁的相关程度达到最大时，即Note that t ₁ is the first component of F ₀ , t ₁ =E ₀ w ₁ , w ₁ is the first axis of E ₀ and is a unit vector, ie ||w ₁ ||=1; note u ₁ is The first component of F ₀ , u ₁ =F ₀ c ₁ , c ₁ is the first axis of F ₀ and is a unit vector, ie ||c ₁ ||=1; the correlation between t ₁ and u ₁ When the degree reaches the maximum, that is,

Var(t₁)→maxVar(t ₁ )→max

Var(u₁)→maxVar(u ₁ )→max

根据典型相关分析，t₁和u₁的相关程度应达到最大值，即：According to canonical correlation analysis, the degree _of correlation between _t1 and u1 should reach the maximum value, namely:

r(t₁，u₁)→maxr(t ₁ ，u ₁ )→max

在t₁和u₁的协方差达到最大值时，即：When the covariance _of _t1 and u1 reaches its maximum value, that is:

max＜E₀w₁，F₀c₁＞max < E ₀ w ₁ , F ₀ c ₁ >

在||w₁||＝1和||c₁||＝1的条件下，求的最大值；Under the condition of ||w ₁ ||＝1 and ||c ₁ ||＝1, find the maximum value;

w₁是矩阵的特征向量，对应的特征值为θ₁是目标函数，其最大值，即求矩阵的最大特征值所对应的特征向量w₁，求成分t₁和残差矩阵E₁： _w1 is the matrix The eigenvector of is, and the corresponding eigenvalue is θ ₁ is the objective function, its maximum value, that is, to find the matrix The eigenvector w ₁ corresponding to the largest eigenvalue of , find the component t ₁ and the residual matrix E ₁ :

t₁＝E₀w₁ t ₁ =E ₀ w ₁

其中，同理求矩阵的最大特征值所对应的特征向量w₂，t₂和残差矩阵E₂ in, Find the matrix in the same way The eigenvector w ₂ corresponding to the largest eigenvalue, t ₂ and the residual matrix E ₂

t₂＝E₁w₂ t ₂ =E ₁ w ₂

其中， in,

如此计算下去，如果X的秩是A，则最终得到：Calculating in this way, if the rank of X is A, we finally get:

(1.3)拟合(1.3) Fitting

将样本y中除去某个样本点i，用该部分样本提取h个成分拟合一个回归方程，然后将被排除的样本i带入到该回归方程中，得到拟合值则定义y_i的预测误差平方和为S_PRESS，hj，即Remove a certain sample point i from the sample y, use this part of the sample to extract h components to fit a regression equation, and then bring the excluded sample i into the regression equation to obtain the fitted value Then define the sum of squared prediction errors of y _i as S _{PRESS, hj} , that is

定义y_i的误差平方和为S_SS，hj，即Define the error sum of squares of y _i as S _{SS, hj} , namely

所述基于偏最小二乘法的乳腺细胞异常检测方法，其中，所述步骤(2)具体步骤为：The breast cell abnormality detection method based on the partial least squares method, wherein, the specific steps of the step (2) are:

定义第i个样本点对第h成分t_h的贡献率来找到样本中的特异点，定义贡献率为：Define the contribution rate of the i-th sample point to the h-th component t _h to find the singular point in the sample, and define the contribution rate for:

式中是成分t_h的方差，测量出样本点i对成分t₁，t₂，…，t_m的累计贡献率：In the formula is the variance of component t _h , and measures the cumulative contribution rate of sample point i to components t ₁ , t ₂ ,…, t _m :

在SIMCA-P 13.0软件中绘制T²椭圆图，落在椭圆之外的样本点为特异点，去掉特异点进行重新拟合，直到样本中不存在特异点。The T ² ellipse was drawn in the SIMCA-P 13.0 software, and the sample points falling outside the ellipse were the singular points, and the singular points were removed for re-fitting until there were no singular points in the sample.

有益效果：Beneficial effect:

本发明通过偏最小二乘回归的方法，建立起乳腺细胞异常检测的回归模型，通过对回归模型的训练，生成较好的良恶性乳腺细胞的预测方法，通过对乳腺细胞的10个特征对其进行预测，结果达到了93.67％的正确率，能够有效地对乳腺癌细胞是否癌变的情况进行分析和预测，对乳腺癌的诊断与预防有重要的作用。The present invention establishes a regression model for breast cell abnormality detection through the method of partial least squares regression, and generates a better prediction method for benign and malignant breast cells through the training of the regression model. Prediction, the result reached 93.67% correct rate, can effectively analyze and predict whether breast cancer cells are cancerous, and plays an important role in the diagnosis and prevention of breast cancer.

本发明利用偏最小二乘法，对有10个特征变量的乳腺细胞进行了回归建模，较好的预测出了细胞是否癌变，准确率达93.67％。从实验数据中可以看出，细胞的半径、质地、凹点以及周长和面积等与细胞癌是否变异呈正相关，而分形维数呈负相关；有VIP_j数可看出，凹点、周长、半径、面积和凹度对预测值的贡献度最大，而细胞的对称性、光滑度和分形维数对预测值的贡献度相对较小，在进行回归变量的选择时，有时可以舍弃贡献度较小的自变量。但有关VIP_j指标分析的结论基本还是定性的，只能说这些自变量的作用更大一些，并且VIP方法还有一些局限性，当自变量的贡献度非常大时，不能说这几个自变量就是最好的变量选择，有时还要考虑变量之间的相关性来取舍。因此可以看出本发明具有符合乳腺细胞表征的恰当检测指标，以及较高的准确率。The invention uses the partial least square method to carry out regression modeling on mammary gland cells with 10 characteristic variables, and can better predict whether the cells are cancerous, with an accuracy rate of 93.67%. It can be seen from the experimental data that the radius, texture, pit, perimeter and area of the cell are positively correlated with whether the cell cancer is mutated, while the fractal dimension is negatively correlated; it can be seen from the VIP _j number that the pit, perimeter The length, radius, area, and concavity have the largest contribution to the predicted value, while the symmetry, smoothness, and fractal dimension of the cell have relatively small contributions to the predicted value. When selecting a regression variable, the contribution can sometimes be discarded independent variable with a small degree. However, the conclusions about VIP _j index analysis are basically qualitative. It can only be said that these independent variables play a greater role, and the VIP method still has some limitations. When the contribution of independent variables is very large, it cannot be said that these independent variables Variables are the best choice of variables, and sometimes we have to consider the correlation between variables to choose. Therefore, it can be seen that the present invention has appropriate detection indexes conforming to the characterization of mammary gland cells, and has a relatively high accuracy rate.

偏最小二乘法可运用于许多领域之中，并且建立出的预测模型有较好的准确率，可用性强。乳腺癌作为一个发病率高且早期诊断效果好的疾病，对于乳腺细胞的观测就成为了一个预防乳腺癌的重要手段。偏最小二乘回归偏最小二乘回归(Partial Least-Squares Regression，PLS回归)是一种先进的多元分析方法，它由多元线性回归分析、典型相关分析和主成分分析构成，在许多的领域都有这广泛的应用，同时在进行多元建模预测上取得了很好的成效，并且建立出的预测模型具较高的稳定性、准确率和抗噪声能力，所以对于解决乳腺癌这类有多种影响因素的问题来说，是一种较好的方法。本发明选取了良性与恶性乳腺癌细胞的医学检测结果，利用偏最小二乘法建立起回归模型，通过选择恰当的判断阈值作为检测结果的辨别标准，对测试数据进行预测。结果发现，基于偏最小二乘的方法建立的预测模型具有快速的检测能力和较高的检测精度。The partial least squares method can be used in many fields, and the established prediction model has good accuracy and strong usability. As breast cancer is a disease with a high incidence and good early diagnosis effect, the observation of breast cells has become an important means of preventing breast cancer. Partial least squares regression (Partial Least-Squares Regression, PLS regression) is an advanced multivariate analysis method, which consists of multiple linear regression analysis, canonical correlation analysis and principal component analysis. With such a wide range of applications, it has achieved good results in multivariate modeling and prediction, and the established prediction model has high stability, accuracy and anti-noise ability, so it is very useful for solving breast cancer. It is a better method for the problem of influencing factors. The invention selects the medical detection results of benign and malignant breast cancer cells, uses the partial least squares method to establish a regression model, and predicts the test data by selecting an appropriate judgment threshold as the discrimination standard of the detection results. It was found that the prediction model based on the method of partial least squares has fast detection ability and high detection accuracy.

附图说明Description of drawings

图1为本发明基于偏最小二乘法的乳腺细胞异常检测方法的流程图；Fig. 1 is the flow chart of the breast cell abnormality detection method based on partial least square method of the present invention;

图2为本发明基于偏最小二乘法的乳腺细胞异常检测方法的第一次拟合的T²椭圆图；Fig. ² is the first fitting T2 ellipse diagram of the breast cell abnormality detection method based on partial least square method of the present invention;

图3为本发明基于偏最小二乘法的乳腺细胞异常检测方法的多次拟合的T²椭圆图；Fig. ³ is the T2 ellipse diagram of multiple fittings of the breast cell abnormal detection method based on the partial least squares method of the present invention;

图4为本发明基于偏最小二乘法的乳腺细胞异常检测方法的各个因变量的系数柱状图；Fig. 4 is the histogram of the coefficients of each dependent variable of the breast cell abnormality detection method based on the partial least squares method of the present invention;

图5为本发明基于偏最小二乘法的乳腺细胞异常检测方法的回归系数表图；Fig. 5 is the regression coefficient table figure of the breast cell abnormal detection method based on the partial least squares method of the present invention;

图6为本发明基于偏最小二乘法的乳腺细胞异常检测方法的变量投影重要性柱状图；Fig. 6 is the histogram of the variable projection importance of the breast cell abnormality detection method based on the partial least squares method of the present invention;

图7为本发明基于偏最小二乘法的乳腺细胞异常检测方法的数据处理完成后的预测结果表图。Fig. 7 is a chart of prediction results after data processing of the breast cell abnormality detection method based on the partial least squares method of the present invention.

具体实施方式detailed description

如图1所示，本发明基于偏最小二乘法的乳腺细胞异常检测方法，具体包括以下步骤：As shown in Figure 1, the breast cell abnormal detection method based on the partial least square method of the present invention specifically comprises the following steps:

(1)在软件SIMCA-P 13.0中导入用于建立模型的数据集，将所得的数据均分为两部分，一部分用于模型的建立，另一部分用于模型的检测；数据集的指标主要包括半径、质地、周长、面积、光滑度、致密性、凹度、凹点、对称性和分形维数这10个特征值以及是否是癌变细胞；设定相应的因变量(是否是癌变细胞)和自变量(10个特征值)，利用软件的功能选择为偏最小二乘PLS模型对数据进行标准化处理，分别提取自变量和因变量的主成分，可以根据需要增加或减少主成分，拟合并建立模型；(1) Import the data set used to build the model in the software SIMCA-P 13.0, and divide the obtained data into two parts, one part is used for model building, and the other part is used for model testing; the indicators of the data set mainly include The 10 characteristic values of radius, texture, perimeter, area, smoothness, compactness, concavity, pit, symmetry and fractal dimension and whether it is a cancerous cell; set the corresponding dependent variable (whether it is a cancerous cell) and independent variables (10 eigenvalues), use the function of the software to select the partial least squares PLS model to standardize the data, extract the principal components of the independent variable and the dependent variable, and increase or decrease the principal components as needed, and fit and build a model;

(2)T²为样本点对成分的累计贡献率，利用软件绘制并观察T²椭圆图，当样本点都落在椭圆内时，认为样本是均匀的；若有样本点落在椭圆外，则可以认为这些点为特异点且这些特异点的取值远离样本点的平均水平；识别异常点，在软件工具栏中利用相应的功能将异常点从数据集中剔除，得到新的数据集和模型，再次拟合并观察T²椭圆图直到不存异常点，则此时的模型较好，从而可以在软件中查看获得的参数集，求得是否为癌细胞的因变量y的表达方程式；(2) T ² is the cumulative contribution rate of the sample points to the components. Use the software to draw and observe the T ² ellipse. When the sample points fall within the ellipse, the sample is considered uniform; if any sample point falls outside the ellipse, Then it can be considered that these points are singular points and the value of these singular points is far from the average level of the sample points; identify the abnormal points, use the corresponding function in the software toolbar to remove the abnormal points from the data set, and obtain a new data set and model , fit again and observe the T ² ellipse until there are no abnormal points, then the model at this time is better, so you can check the obtained parameter set in the software, and obtain the expression equation of the dependent variable y whether it is a cancer cell;

(3)输入另一部分待测数据集，利用求得的方程将数值带入进行计算，得到预测值y’，接着确定阈值为0.5，规定大于0.5的预测值为恶性细胞，小于0.5的预测值为良性细胞，将y’与原值进行比较，记录有多少结果是正确预测的，计算出该预测模型的正确率。(3) Input another part of the data set to be tested, and use the obtained equation to bring the value into the calculation to obtain the predicted value y', and then determine the threshold value as 0.5, and stipulate that the predicted value greater than 0.5 is malignant cells, and the predicted value less than 0.5 For benign cells, compare y' with the original value, record how many results are correctly predicted, and calculate the correct rate of the prediction model.

上述步骤(1)中建立偏最小二乘线性模型的具体步骤如下：The specific steps for establishing the partial least squares linear model in the above step (1) are as follows:

X经标准化处理后的数据矩阵记为E₀＝(E₀₁，E₀₂，…，E_0p)_n×p，Y经过标准化处理后的数据矩阵记为F₀＝(F₀₁，F₀₂，…，F_0q)_n×q。The standardized data matrix of X is recorded as E ₀ =(E ₀₁ ,E ₀₂ ,…,E _0p ) _n×p , and the standardized data matrix of Y is recorded as F ₀ =(F ₀₁ ,F ₀₂ ,… , F _0q ) _n×q .

记t₁是E₀的第一个成分，t₁＝E₀w₁，w₁是E₀的第一个轴，它是一个单位向量，即||w₁||＝1；记u₁是F₀的第一个成分，u₁＝F₀c₁，c₁是F₀的第一个轴，它是一个单位向量，即||c₁||＝1；要使得t₁与u₁的相关程度达到最大，则根据主成分分析，应有：Note that t ₁ is the first component of E ₀ , t ₁ =E ₀ w ₁ , w ₁ is the first axis of E ₀ , it is a unit vector, that is ||w ₁ ||=1; record u ₁ is the first component of F ₀ , u ₁ =F ₀ c ₁ , c ₁ is the first axis of F ₀ , it is a unit vector, namely ||c ₁ ||=1; to make t ₁ and u ₁ reaches the maximum, then according to the principal component analysis, there should be:

Var(t₁)→maxVar(t ₁ )→max

Var(u₁)→maxVar(u ₁ )→max

另一方面，由于回归建模的需求，又要求t₁对u₁有最大的解释能力，根据典型相关分析，t₁和u₁的相关程度应达到最大值，即：On the other hand, due to the requirement of regression modeling, _t1 is required to have the greatest explanatory power to u1. According to the canonical correlation analysis, the degree _of correlation between _t1 and _u1 should reach the maximum value, namely:

r(t₁，u₁)→maxr(t ₁ ，u ₁ )→max

综合起来，即要求t₁和u₁的协方差达到最大值，从而转为求解下列优化问题，即：To sum up, it is required that the covariance of t ₁ and u ₁ reach the maximum value, so as to solve the following optimization problems, namely:

max＜E₀w₁，F₀c₁＞max < E ₀ w ₁ , F ₀ c ₁ >

故在||w₁||＝1和||c₁||＝1的条件下，求的最大值。Therefore, under the conditions of ||w ₁ ||＝1 and ||c ₁ ||＝1, find the maximum value.

根据王惠文等人所著的《偏最小二乘回归的线性与非线性方法》一书，可得那么，w₁是矩阵的特征向量，对应的特征值为θ₁是目标函数，求最大值，即求矩阵的最大特征值所对应的特征向量w₁，然后求成分t₁和残差矩阵E₁：According to the book "Linear and Nonlinear Methods for Partial Least Squares Regression" written by Wang Huiwen et al., we can get Then, _w1 is the matrix The eigenvector of is, and the corresponding eigenvalue is θ ₁ is the objective function, find the maximum value, that is, find the matrix The eigenvector w ₁ corresponding to the largest eigenvalue of , and then calculate the component t ₁ and the residual matrix E ₁ :

t₁＝E₀w₁ t ₁ =E ₀ w ₁

其中，同理可求矩阵的最大特征值所对应的特征向量w₂，t₂和残差矩阵E₂ in, Matrix can be found in the same way The eigenvector w ₂ corresponding to the largest eigenvalue, t ₂ and the residual matrix E ₂

t₂＝E₁w₂ t ₂ =E ₁ w ₂

其中， in,

(1.3)拟合(1.3) Fitting

将样本y中除去某个样本点i，用该部分样本提取h个成分拟合一个回归方程，然后将被排除的样本i带入到由上一步得到的y的回归方程中中，得到拟合值则定义y_i的预测误差平方和为S_PRESS，hj，有Remove a certain sample point i from the sample y, use this part of the sample to extract h components to fit a regression equation, and then bring the excluded sample i into the regression equation of y obtained in the previous step to obtain the fitting value Then define the prediction error sum of squares of y _i as S _{PRESS, hj} , we have

定义y_i的误差平方和为S_SS，hj，有Define the error sum of squares of y _i as S _{SS, hj} , have

对样本来说，若h个成分回归方程的含扰动误差能在一定程度上小于h-1个成分回归方程的拟合误差，则认为增加一个成分t_h会使预测精度提高；因此S_PRESS，h/S_SS，h-1的比值越小越好；定义成分t_h的交叉有效性为：For the sample, if the disturbance-containing error of the h component regression equation can be smaller than the fitting error of the h-1 component regression equation to a certain extent, it is considered that adding a component t _h will improve the prediction accuracy; therefore S _{PRESS, h} /S _{SS, the smaller the ratio of h-1} , the better; define the cross validity of component t _h as:

当时，成分t_h有明显的贡献作用。when , the component t _h has a significant contribution.

上述步骤(2)的具体步骤为：The concrete steps of above-mentioned step (2) are:

式中：是成分t_h的方差；从而可以测量出样本点i对成分t₁，t₂，…，t_m的累计贡献率：In the formula: is the variance of component t _h ; thus, the cumulative contribution rate of sample point i to components t ₁ , t ₂ ,..., t _m can be measured:

在SIMCA-P 13.0中，可绘制T²椭圆图，若样本点落在椭圆之外，则认为这些样本点是特异点，可去掉这些特异点进行重新拟合，并重复这个过程，直到样本中不存在特异点。In SIMCA-P 13.0, the T ² ellipse can be drawn. If the sample points fall outside the ellipse, these sample points are considered to be singular points, and these singular points can be removed for re-fitting, and this process is repeated until the sample points There are no singularities.

x_i在解释Y时作用的重要性，可以用变量投影重要性指标VIP_j来测量，其值越大，则该自变量在解释Y时有更加重要的作用。VIP_j的定义式如下：The importance of x _i in explaining Y can be measured by variable projection importance index VIP _j , the larger the value, the more important the independent variable is in explaining Y. The definition of VIP _j is as follows:

其中，Rd(Y；t_h)＝r²(y，t_h)；r(y，t_h)是y和t_h的相关系数，w_hj是w_h的第j个分量。Wherein, Rd(Y; t _h )=r ² (y, t _h ); r(y, t _h ) is the correlation coefficient between y and t _h , and w _hj is the jth component of w _h .

实验验证：Experimental verification:

数据集介绍Dataset introduction

本次实验的数据集来自Wisconsin Diagnostic Breast Cancer(WDBC)，是在1995年11月由Dr.William H.Wolberg(General Surgery Dept.University of Wisconsin)、W.Nick Street(Computer Sciences Dept.University of Wisconsin)、OlviL.Mangasarian(Computer Sciences Dept.University of Wisconsin)创作，由NickStreet捐助^[11]。该数据有569例细胞活检案例，每个案例有32个属性，其中包含有病人的编号和癌症诊断结果，其他30个属性是真实测量值。在癌症诊断属性中，“B”代表良性，“M”代表恶性，其他的30个属性是由细胞核的10个特征的均值、标准差、最大值构成。这10个特征分别是：半径，质地，周长，面积，光滑度，致密性，凹度，凹点，对称性，分形维数。The data set of this experiment comes from Wisconsin Diagnostic Breast Cancer (WDBC), which was published in November 1995 by Dr.William H.Wolberg(General Surgery Dept.University of Wisconsin), W.Nick Street(Computer Sciences Dept.University of Wisconsin ), created by OlviL.Mangasarian(Computer Sciences Dept.University of Wisconsin), donated by NickStreet ^[11] . The data has 569 cell biopsy cases, each case has 32 attributes, which contain the patient's number and cancer diagnosis result, and the other 30 attributes are real measurement values. Among the cancer diagnosis attributes, "B" stands for benign, "M" stands for malignant, and the other 30 attributes are composed of the mean, standard deviation, and maximum value of 10 features of the nucleus. The 10 features are: radius, texture, perimeter, area, smoothness, compactness, concavity, pits, symmetry, and fractal dimension.

数据处理data processing

本次实验将“B”良性规定为值0，将“M”恶性规定为值1作为因变量，十个特征属性作为自变量，选择一半的数据(284个样本)作为模型的建立，剩余一般(285个样本)用来验证，并且对数据分为两组，分别为良性组：编号为1，恶性组：编号为2，将处理好的数据导入到SIMCA-P 13.0中，将数据设置Class ID，并将良性、恶性值作为因变量。设置好后，点击完成，完成数据的导入。在这里将良性组和恶性组进行了分组，将良性组的class定为1，恶性组定为2。In this experiment, "B" is defined as benign with a value of 0, "M" is defined as malignant with a value of 1 as the dependent variable, ten characteristic attributes are used as independent variables, and half of the data (284 samples) are selected as the establishment of the model, and the remaining average (285 samples) are used for verification, and the data is divided into two groups, respectively benign group: number 1, malignant group: number 2, import the processed data into SIMCA-P 13.0, and set the data to Class ID, with benign and malignant values as dependent variables. After setting, click Finish to complete the data import. Here, the benign group and the malignant group are grouped, and the class of the benign group is set as 1, and the class of the malignant group is set as 2.

对数据进行主成分分析，得到三个主成分。R2X代表的是从X变量中提取的主成分对X的累计解释能力，R2Y代表的是从Y变量中提取的主成分对Y的累计解释能力，Q2代表交叉有效性，提取主成分后，它们的值分别为0.877、0.616、0.603。以这3个主成分，绘制它的T²椭圆图，可以看出所建模型使良性组和恶性组有较好的区分(如图2)。另外样本中有许多异常点，需要将它们去除，再次进行模型的拟合。经过多次异常点去除后，得到的T²椭圆图如下(图3)。Principal component analysis was performed on the data to obtain three principal components. R2X represents the cumulative explanatory ability of the principal components extracted from the X variable to X, R2Y represents the cumulative explanatory ability of the principal components extracted from the Y variable to Y, and Q2 represents the cross validity. After extracting the principal components, they The values of are 0.877, 0.616, 0.603 respectively. Using these three principal components to draw its T ² ellipse diagram, it can be seen that the model built can better distinguish the benign group from the malignant group (as shown in Figure 2). In addition, there are many abnormal points in the sample, which need to be removed, and the model fitting is performed again. After removing outliers many times, the obtained T2 ellipse is as follows (Fig. ³ ).

此时，R2X＝0.744，R2Y＝0.757，Q2＝0.75。点击Coefficient Plot按钮能够查看回归系数(如图4，图5)。At this time, R2X=0.744, R2Y=0.757, and Q2=0.75. Click the Coefficient Plot button to view the regression coefficient (Figure 4, Figure 5).

得到标准化的回归方程为：The standardized regression equation is obtained as:

y＝0.941699+radius*0.154609+texture*0.16134+perimeter*0.154876+area*0.147183y＝0.941699+radius*0.154609+texture*0.16134+perimeter*0.154876+area*0.147183

+smoothness*0.0290967+compactness*0.0961712+concavity*0.121727+smoothness*0.0290967+compactness*0.0961712+concavity*0.121727

+concave points*0.15144+symmety*0.0355873-fractal dimension*0.0424221+concave points*0.15144+symmetry*0.0355873-fractal dimension*0.0424221

在VIP图中(图6)可以看出细胞凹点、周长、半径、面积和凹度对解释是否癌变有着重要的作用。In the VIP diagram (Fig. 6), it can be seen that the cell pit, perimeter, radius, area and concavity play an important role in explaining whether it is cancerous or not.

在数据处理完成后则可查看预测结果(图7)。为了方便查看，将数据整理到Excel中(表1)。After the data processing is completed, the prediction results can be viewed (Figure 7). For the convenience of viewing, the data were organized into Excel (Table 1).

在这里以0.5为阈值，若预测值大于0.5，则为恶性细胞，若预测值小于0.5，则为良性细胞。从而计算出在357个良性细胞中，预测出有339个良性；在212个恶性细胞中，预测出194个恶性细胞，则预测的正确率达到93.67％，能够较好的预测出细胞是否癌变。Here, 0.5 is used as the threshold, if the predicted value is greater than 0.5, it is a malignant cell, and if the predicted value is less than 0.5, it is a benign cell. Therefore, it is calculated that among the 357 benign cells, 339 are predicted to be benign; among the 212 malignant cells, 194 malignant cells are predicted, and the correct rate of prediction reaches 93.67%, which can better predict whether the cells are cancerous.

本发明具有快速的检测能力和较高的检测精度，用于乳腺癌的辅助诊断，通过偏最小二乘回归的方法，建立起乳腺细胞异常检测的回归模型，通过对回归模型的训练，生成较好的良恶性乳腺细胞，能够较好的预测出细胞是否癌变。The present invention has rapid detection capability and high detection accuracy, and is used for auxiliary diagnosis of breast cancer. A regression model for breast cell abnormality detection is established through the method of partial least squares regression, and a comparative Good benign and malignant breast cells can better predict whether the cells are cancerous.

Claims

1. a breast cell abnormality detection method based on partial least squares method, is characterized in that, specifically comprises:

(1) Divide the data obtained in the data set used to build the model into two parts, one part is used for the establishment of the model, and the other part is used for the detection of the model, and the corresponding dependent variables and independent variables are set to standardize the data , respectively extract the principal components of the independent variable and the dependent variable, fit and build the model;

(2) T ² is the cumulative contribution rate of the sample points to the components. Use the software to draw and observe the T ² ellipse diagram, identify abnormal points, remove the abnormal points from the data set, obtain a new data set and model, and fit and observe again T ² ellipse until there are no abnormal points, obtain the parameter set, and obtain the expression equation of the dependent variable y whether it is a cancer cell;

(3) Input another part of the data set to be tested, use the obtained equation to bring the value into the calculation, obtain the predicted value y', determine the threshold, stipulate that the predicted value greater than the threshold value is malignant cells, and the predicted value smaller than the threshold value is benign cells , compare y' with the original value, record the correct prediction result, and calculate the correct rate of the prediction model.

2. the breast cell abnormal detection method based on partial least squares method as claimed in claim 1, is characterized in that: described step (1) is to import the data set that is used for building a model by software SIMCA-P 13.0, and described self The variables mainly include radius, texture, perimeter, area, smoothness, compactness, concavity, pit, symmetry and fractal dimension; the dependent variable is whether it is a cancerous cell or not.

3. the breast cell abnormal detection method based on partial least squares method as claimed in claim 1, is characterized in that: in described step (2), when sample points all fall in ellipse, think that sample is even; If have If the sample points fall outside the ellipse, these points can be considered as singular points, and their values are far from the average level of the sample points.

4. the breast cell abnormality detection method based on partial least square method as claimed in claim 1, is characterized in that: the threshold value determined in the described step (3) is 0.5, stipulates that the predicted value greater than 0.5 is a malignant cell, less than 0.5 The predicted value for benign cells.

5. the breast cell abnormal detection method based on partial least squares method as claimed in claim 1, is characterized in that, described step (1) specifically comprises the following steps:

(1.1) Standardize the independent variable and dependent variable

\begin{matrix} {x x}_{i i j j}^{* *} = = \frac{{x x}_{i i j j} - - \overset{&OverBar; &OverBar;}{{x x}_{j j}}}{{s the s}_{j j}},, & (\begin{matrix} i i = = 11,, 22,, ... ... m m;; & j j = = 11,, 22,, ... ...,, n no \end{matrix}) \end{matrix}

\begin{matrix} {y the y}_{i i j j}^{* *} = = \frac{{y the y}_{i i j j} - - \overset{&OverBar; &OverBar;}{{y the y}_{j j}}}{{s the s}_{y the y}},, & (\begin{matrix} i i = = 11,, 22,, ... ... m m;; & j j = = 11,, 22,, ... ...,, n no \end{matrix}) \end{matrix}

The normalized data matrix of X is denoted as E ₀ =(E ₀₁ ,E ₀₂ ,...,E _0p ) _n×p , and the normalized data matrix of Y is denoted as F ₀ =(F ₀₁ ,F ₀₂ ,...,F _0q ) _n×q ;

(1.2) Extracting principal components and stepwise regression

Note that t ₁ is the first component of E ₀ , t ₁ =E ₀ w ₁ , w ₁ is the first axis of E ₀ and is a unit vector, ie ||w ₁ ||=1; note u ₁ is The first component of F ₀ , u ₁ =F ₀ c ₁ , c ₁ is the first axis of F ₀ and is a unit vector, ie ||c ₁ ||=1; the correlation between t ₁ and u ₁ When the degree reaches the maximum, that is,

Var(t ₁ )→max

Var(u ₁ )→max

According to canonical correlation analysis, the degree _of correlation between _t1 and u1 should reach the maximum value, namely:

r(t ₁ ，u ₁ )→max

When the covariance _of _t1 and u1 reaches its maximum value, that is:

max < E ₀ w ₁ , F ₀ c ₁ >

s the s . . t t \{\begin{matrix} {w w}_{11}^{T T} {w w}_{11} = = 11 \\ {c c}_{11}^{T T} {c c}_{11} = = 11 \end{matrix}

Under the condition of ||w ₁ ||＝1 and ||c ₁ ||＝1, find the maximum value;

_w1 is the matrix The eigenvector of is, and the corresponding eigenvalue is θ ₁ is the objective function, its maximum value, that is, to find the matrix The eigenvector w ₁ corresponding to the largest eigenvalue of , find the component t ₁ and the residual matrix E ₁ :

t ₁ =E ₀ w ₁

{E E.}_{11} = = {E E.}_{00} - - {t t}_{11} {p p}_{11}^{T T}

in, Find the matrix in the same way The eigenvector w ₂ corresponding to the largest eigenvalue, t ₂ and the residual matrix E ₂

t ₂ =E ₁ w ₂

{E E.}_{22} = = {E E.}_{11} - - {t t}_{22} {p p}_{22}^{T T}

in,

Calculating in this way, if the rank of X is A, we finally get:

\begin{matrix} {y the y}_{k k}^{* *} = = {a a}_{k k 11} {x x}_{11}^{* *} + + {a a}_{k k 22} {x x}_{22}^{* *} + + ... ... + + {a a}_{k k p p} {x x}_{p p}^{* *} + + {F f}_{A A k k} & ((k k = = 11,, 22,, ... ...,, q q)) \end{matrix};;

(1.3) Fitting

Remove a certain sample point i from the sample y, use this part of the sample to extract h components to fit a regression equation, and then bring the excluded sample i into the regression equation to obtain the fitted value Then define the sum of squared prediction errors of y _i as S _{PRESS, hj} , that is

{S S}_{P P R R E E. E E. E E.,, h h} = = {Σ Σ}_{i i = = 11}^{n no} {(({y the y}_{i i j j} - - {\overset{^^}{y the y}}_{h h j j ((- - i i))}))}^{22}

Define the error sum of squares of y _i as S _{SS, hj} , namely

{S S}_{S S S S,, h h} = = {Σ Σ}_{j j = = 11}^{n no} {(({y the y}_{i i j j} - - {\overset{^^}{y the y}}_{h h j j i i}))}^{22} . .

6. the breast cell abnormal detection method based on partial least squares method as claimed in claim 1, is characterized in that, described step (2) concrete steps are:

Define the contribution rate of the i-th sample point to the h-th component t _h to find the singular point in the sample, and define the contribution rate for:

{T T}_{h h i i}^{22} = = \frac{{t t}_{h h i i}^{22}}{((n no - - 11)) {s the s}_{h h}^{22}},,

In the formula is the variance of component t _h , and measures the cumulative contribution rate of sample point i to components t ₁ , t ₂ ,…, t _m :

{T T}_{i i}^{22} = = \frac{11}{n no - - 11} {Σ Σ}_{h h = = 11}^{m m} \frac{{t t}_{h h i i}^{22}}{{s the s}_{h h}^{22}};;

The T ² ellipse was drawn in the SIMCA-P 13.0 software, and the sample points falling outside the ellipse were the singular points, and the singular points were removed for re-fitting until there were no singular points in the sample.