CN109948732B

CN109948732B - A method and system for classifying distant metastasis of abnormal cells based on non-equilibrium learning

Info

Publication number: CN109948732B
Application number: CN201910251365.4A
Authority: CN
Inventors: 彭立志; 李雪梅; 杨波; 李宝生; 朱健
Original assignee: Shandong Cancer Hospital & Institute (shandong Cancer Hospital); University of Jinan
Current assignee: Shandong Cancer Hospital & Institute (shandong Cancer Hospital); University of Jinan
Priority date: 2019-03-29
Filing date: 2019-03-29
Publication date: 2020-12-22
Anticipated expiration: 2039-03-29
Also published as: CN109948732A

Abstract

The disclosure provides a method and a system for classifying abnormal cell distant metastasis based on unbalanced learning, wherein a plurality of data sequences with certain cell distant metastasis and a plurality of data sequences without certain cell distant metastasis are obtained, the data set is divided into a training set and a testing set, the training set is used for training a model, and the testing set is used for testing the model; firstly, inputting a training set into a feature selection algorithm to be compared with the results of the classification of an original condition data set, and selecting p features with the best results; obtaining a training set with a positive-negative sample ratio of 1:1 by using an oversampling algorithm, respectively inputting the training set into a classification algorithm, testing by using a data sequence of a test set, and selecting an oversampling algorithm i of a training set Pi with an optimal evaluation result; and (3) inputting the training set into an oversampling algorithm for obtaining a training set Pi by adjusting the proportion of the positive and negative samples, gradually increasing the proportion of the positive and negative samples to a set proportion, and carrying out classification evaluation on the optimal proportion of the positive and negative samples. According to the technical scheme, the proportion of the positive samples is increased by adopting an oversampling algorithm, and better model evaluation indexes and the recall rate of a few positive samples are obtained.

Description

A method and system for classifying distant metastasis of abnormal cells based on non-equilibrium learning

技术领域technical field

本公开涉及机器学习和数据挖掘技术领域，特别是涉及基于非平衡学习的异常细胞远处转移分类方法及系统。The present disclosure relates to the technical field of machine learning and data mining, and in particular, to a method and system for classifying distant metastasis of abnormal cells based on unbalanced learning.

背景技术Background technique

食管鳞癌是世界范围内最常见的恶性肿瘤之一，但其前期症状并不明显，身体的变化容易被忽略，一旦身体承受不住才去医院检查，一般就已经是中晚期。在临床，医生是通过影像，甚至是穿刺和手术诊断食管鳞癌患者癌细胞是否有远处转移。这三种方式不仅加大了患者的治疗成本，并且耗时较长。随着大数据时代的到来，为了解决这个问题，提出用血细胞分析预测出患者癌细胞是否有转移。查阅相关文献了解到，在医学领域专业有发表过对淋巴结转移做过分类预测，特异度和敏感度不足50％，而没有对远处转移做过相关研究，临床相关研究者惯用统计分析软件(SPSS、SAS)做P检验统计分析，而本公开是使用机器学习做分析预测。Esophageal squamous cell carcinoma is one of the most common malignant tumors in the world, but its early symptoms are not obvious, and changes in the body are easy to be ignored. In clinical practice, doctors diagnose whether there is distant metastasis of cancer cells in patients with esophageal squamous cell carcinoma through imaging, even puncture and surgery. These three methods not only increase the cost of treatment for patients, but also take a long time. With the advent of the era of big data, in order to solve this problem, it is proposed to use blood cell analysis to predict whether a patient's cancer cells have metastasized. According to the relevant literature, it is known that the classification and prediction of lymph node metastasis has been published in the medical field, and the specificity and sensitivity are less than 50%, but no relevant research has been done on distant metastasis. SPSS, SAS) for P-test statistical analysis, and the present disclosure uses machine learning for analysis and prediction.

由于所收集到的食管鳞癌患者数据并不多，且癌细胞远处转移的患者更是占很小的比例，即存在类别不平衡问题。Since the collected data on patients with esophageal squamous cell carcinoma is not much, and the patients with distant metastasis of cancer cells account for a small proportion, there is a problem of category imbalance.

发明人在研究中发现，在这种不平衡数据集中，用标准分类器往往会倾向于获得最大的准确率，而忽略少数类样本，这少数类样本又是关注的重点，即使得到很高的准确率，这个分析结果也毫无意义，难以有效地预测出患者癌细胞是否有转移。在实际生活中，尤其是医学领域，类别不平衡问题经常见，这种情形主要由于发病率导致的。在这种情形下，如果没有将不平衡数据进行处理，标准分类器的性能将会受到严重影响。The inventor found in the research that in this imbalanced dataset, the standard classifier tends to obtain the maximum accuracy, while ignoring the minority class samples, which are the focus of attention, even if the results are very high. The accuracy of this analysis is also meaningless, and it is difficult to effectively predict whether a patient's cancer cells have metastasized. In real life, especially in the medical field, the problem of class imbalance is very common, which is mainly caused by morbidity. In this case, if the imbalanced data is not processed, the performance of the standard classifier will be severely affected.

发明内容SUMMARY OF THE INVENTION

本说明书实施方式的目的是提供基于非平衡学习的异常细胞远处转移分类方法，用过采样算法尝试增大正类样本比例，获得更好的模型评价指标和少数正类样本的召回率。The purpose of the embodiments of this specification is to provide a method for classifying distant metastases of abnormal cells based on unbalanced learning, using an oversampling algorithm to try to increase the proportion of positive samples, and to obtain better model evaluation indicators and recall rate of a few positive samples.

本说明书实施方式提供基于非平衡学习的异常细胞远处转移分类方法，包括：Embodiments of the present specification provide a method for classifying distant metastasis of abnormal cells based on non-equilibrium learning, including:

获得存在某细胞远处转移的若干条数据序列及某细胞没有远处转移的若干条数据序列，并构成训练集；Obtain several data sequences with distant metastasis of a certain cell and several data sequences without distant metastasis of a certain cell, and form a training set;

将训练集分别输入k个特征选择算法，分别选择排序靠前的p个属性作为训练集的特征，输入分类器进行训练，对分类结果进行对比，选出得到结果最好的p个特征；Input the training set into k feature selection algorithms, respectively select the top p attributes as the features of the training set, input the classifier for training, compare the classification results, and select the p features with the best results;

基于过采样算法使训练集在数据层面上达到数据均衡，将经过特征选择算法处理的训练集输入到n个过采样算法中，得到正负类样本比例为1:1的训练集；Based on the oversampling algorithm, the training set achieves data balance at the data level, and the training set processed by the feature selection algorithm is input into n oversampling algorithms to obtain a training set with a positive and negative sample ratio of 1:1;

将正负类样本比例为1:1的训练集分别输入到分类算法，再用测试集的数据序列进行测试，选择得到评价结果最优的训练集Pi的过采样算法i；Input the training set with the ratio of positive and negative samples to 1:1 into the classification algorithm respectively, and then use the data sequence of the test set to test, and select the oversampling algorithm i that obtains the training set Pi with the best evaluation result;

通过调整正负类样本的比例，将训练集输入到得到训练集Pi的过采样算法，逐渐增大正负类样本比例至设定比例，分类评价最优的正负类样本比例。By adjusting the ratio of positive and negative samples, input the training set to the oversampling algorithm to obtain the training set Pi, gradually increase the ratio of positive and negative samples to the set ratio, and classify and evaluate the optimal ratio of positive and negative samples.

本说明书实施方式提供基于非平衡学习的细胞远处转移分类系统，包括：Embodiments of the present specification provide a non-equilibrium learning-based cell distant metastasis classification system, including:

训练集采集单元，被配置为：获得存在某细胞远处转移的若干条数据序列及某细胞没有远处转移的若干条数据序列，并构成训练集；The training set acquisition unit is configured to: obtain several data sequences with distant metastasis of a certain cell and several data sequences without distant metastasis of a certain cell, and form a training set;

特征选择单元，被配置为：将训练集分别输入k个特征选择算法，分别选择排序靠前的p个属性作为训练集的特征，输入分类器进行训练，对分类结果进行对比，选出得到结果最好的p个特征；The feature selection unit is configured to: input the training set into k feature selection algorithms, respectively select the top p attributes as the features of the training set, input the classifier for training, compare the classification results, and select the result. the best p features;

过采样单元，被配置为：基于过采样算法使训练集在数据层面上达到数据均衡，将经过特征选择算法处理的训练集输入到n个过采样算法中，得到正负类样本比例为1:1的训练集；The oversampling unit is configured to: make the training set achieve data balance at the data level based on the oversampling algorithm, input the training set processed by the feature selection algorithm into n oversampling algorithms, and obtain a positive and negative sample ratio of 1: 1 training set;

最优过采样算法获得单元，被配置为：将正负类样本比例为1:1的训练集分别输入到分类算法，再用测试集的数据序列进行测试，选择得到评价结果最优的训练集Pi的过采样算法i；The optimal oversampling algorithm obtaining unit is configured to: input the training set with the positive and negative class sample ratio of 1:1 into the classification algorithm, and then use the data sequence of the test set for testing, and select the training set with the best evaluation result. Pi's oversampling algorithm i;

最优正负类样本比例获取单元，被配置为：通过调整正负类样本的比例，将训练集输入到得到训练集Pi的过采样算法，逐渐增大正负类样本比例至设定比例，分类评价最优的正负类样本比例。The optimal positive and negative sample ratio acquisition unit is configured to: by adjusting the ratio of positive and negative samples, input the training set to the oversampling algorithm to obtain the training set Pi, and gradually increase the ratio of positive and negative samples to a set ratio, The optimal proportion of positive and negative class samples for classification evaluation.

与现有技术相比，本公开的有益效果是：Compared with the prior art, the beneficial effects of the present disclosure are:

1、本公开技术方案中数据序列可为医院常规检查血细胞分析数据，数据的获取从技术的实现上比较容易，便于进行后续的数据特征的选择及过采样处理。1. The data sequence in the technical solution of the present disclosure can be the blood cell analysis data of the routine examination in the hospital, and the acquisition of the data is relatively easy in terms of technical realization, which is convenient for subsequent data feature selection and oversampling processing.

2、本公开技术方案用过采样算法尝试增大正类样本比例，获得更好的模型评价指标和少数正类样本的召回率。2. The technical solution of the present disclosure uses an oversampling algorithm to try to increase the proportion of positive samples, so as to obtain better model evaluation indicators and the recall rate of a small number of positive samples.

附图说明Description of drawings

构成本公开的一部分的说明书附图用来提供对本公开的进一步理解，本公开的示意性实施例及其说明用于解释本公开，并不构成对本公开的不当限定。The accompanying drawings that constitute a part of the present disclosure are used to provide further understanding of the present disclosure, and the exemplary embodiments of the present disclosure and their descriptions are used to explain the present disclosure and do not constitute an improper limitation of the present disclosure.

图1为本公开实施例子的基于非平衡学习的异常细胞远处转移分类方法流程图；1 is a flowchart of a method for classifying distant metastasis of abnormal cells based on non-equilibrium learning according to an embodiment of the disclosure;

图2为本公开实施例子的基于非平衡学习的异常细胞远处转移分类方法特征选择策略示意图；2 is a schematic diagram of a feature selection strategy for a method for classifying distant metastasis of abnormal cells based on non-equilibrium learning according to an embodiment of the present disclosure;

图3为本公开实施例子的基于非平衡学习的异常细胞远处转移分类方法过采样算法选择策略示意图；3 is a schematic diagram of an oversampling algorithm selection strategy for an abnormal cell distant metastasis classification method based on unbalanced learning according to an embodiment of the present disclosure;

图4为本公开实施例子的基于非平衡学习的异常细胞远处转移分类方法调整正负类样本比例策略示意图。FIG. 4 is a schematic diagram of a strategy for adjusting the proportion of positive and negative samples in a method for classifying distant metastasis of abnormal cells based on unbalanced learning according to an embodiment of the present disclosure.

具体实施方式Detailed ways

应该指出，以下详细说明都是例示性的，旨在对本公开提供进一步的说明。除非另有指明，本文使用的所有技术和科学术语具有与本公开所属技术领域的普通技术人员通常理解的相同含义。It should be noted that the following detailed description is exemplary and intended to provide further explanation of the present disclosure. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

需要注意的是，这里所使用的术语仅是为了描述具体实施方式，而非意图限制根据本公开的示例性实施方式。如在这里所使用的，除非上下文另外明确指出，否则单数形式也意图包括复数形式，此外，还应当理解的是，当在本说明书中使用术语“包含”和/或“包括”时，其指明存在特征、步骤、操作、器件、组件和/或它们的组合。It should be noted that the terminology used herein is for the purpose of describing specific embodiments only, and is not intended to limit the exemplary embodiments according to the present disclosure. As used herein, unless the context clearly dictates otherwise, the singular is intended to include the plural as well, furthermore, it is to be understood that when the terms "comprising" and/or "including" are used in this specification, it indicates that There are features, steps, operations, devices, components and/or combinations thereof.

目前对于处理非平衡数据分类的问题，主要有两种解决方案：第一，均衡数据。在数据层面上，利用适当的方法重构训练样本，可采用过采样或者欠采样算法达到数据均衡；第二，改进或者提出新的算法。在算法层面上，利用现有的分类算法进行改进或者提出新的分类算法，使少数类的样本得到更多的重视，提高少数类样本的准确率。本申请的实施例子的技术方案为第一种，在数据层面上平衡数据样本。At present, there are two main solutions for dealing with the problem of unbalanced data classification: first, balanced data. At the data level, using appropriate methods to reconstruct training samples, over-sampling or under-sampling algorithms can be used to achieve data balance; second, improve or propose new algorithms. At the algorithm level, the existing classification algorithms are used to improve or new classification algorithms are proposed to make the samples of the minority class get more attention and improve the accuracy of the samples of the minority class. The technical solution of the embodiments of the present application is the first one, which balances data samples at the data level.

实施例子一Example 1

该实施例子公开了基于非平衡学习的异常细胞远处转移分类方法，本公开首先筛选可用数据集，此处以食管鳞癌的细胞远处转移分类为例进行说明，根据食管鳞癌患者已有的诊断信息，从诊断表里筛选出首次检验出临床M分期的患者，临床M分期为0表示该患者未出现癌细胞往其他脏器转移，设置标签为0，临床M分期为非0表示该患者出现癌细胞往其他脏器转移，设置标签为1，再根据这些患者记录在手术表里的手术时间，选取在有手术治疗前一次的血细胞分析检验数据，若没有进行手术治疗，则选择诊断分期当天或者诊断时间前一次血细胞分析数据。本申请所基于的数据均为患者的数据，因此，与诊疗无关，只是基于相关数据对相关细胞的转移进行预测。This example discloses a method for classifying distant metastasis of abnormal cells based on unbalanced learning. The present disclosure first screens the available data sets. Diagnostic information, screen out the patients with clinical M stage for the first time from the diagnosis table. A clinical M stage of 0 indicates that the patient does not have cancer cells metastasized to other organs, and the label is set to 0, and a clinical M stage of non-0 indicates that the patient If cancer cells metastasize to other organs, set the label to 1, and then select the blood cell analysis and test data before the surgical treatment according to the operation time recorded in the operation table of these patients. If no surgical treatment is performed, select the diagnosis stage Blood cell analysis data on the day or before the time of diagnosis. The data on which this application is based are all patient data, and therefore, have nothing to do with diagnosis and treatment, but only predict the transfer of relevant cells based on relevant data.

在一实施例子中，将筛选完的可用样本分为75％训练集和25％测试集，将训练集输入多种特征选择方法，选择排名前8的属性作为该数据集的特征，再输入分类器，输出的模型评价指标AUC和召回率recall同原始情况输出的结果相对比，选出得到结果最好的特征。In one embodiment, the filtered available samples are divided into 75% training set and 25% test set, the training set is input into a variety of feature selection methods, the top 8 attributes are selected as the features of the data set, and then the classification is input. The output model evaluation index AUC and recall are compared with the output results of the original situation, and the features with the best results are selected.

在该实施例子中，将两种类型的数据通过分类器训练，能够学习到这两种类型数据的分类特点，进而输入新的数据，该分类器可以自动识别归属于哪一类。In this embodiment, by training two types of data through a classifier, the classification characteristics of the two types of data can be learned, and new data can be input, and the classifier can automatically identify which category belongs to.

输出的模型评价指标AUC和召回率recall解释说明如下：The output model evaluation index AUC and recall are explained as follows:

AUC(Area Under Curve)被定义为ROC曲线下与坐标轴围成的面积，显然这个面积的数值不会大于1。又由于ROC曲线一般都处于y＝x这条直线的上方，所以AUC的取值范围在0.5和1之间。使用AUC值作为评价标准是因为很多情况下ROC曲线并不能清晰的说明哪个分类器的效果更好，而作为一个数值，对应AUC更大的分类器效果更好。AUC (Area Under Curve) is defined as the area enclosed by the coordinate axis under the ROC curve. Obviously, the value of this area will not be greater than 1. Also, since the ROC curve is generally above the straight line y=x, the value range of AUC is between 0.5 and 1. The AUC value is used as the evaluation criterion because in many cases the ROC curve does not clearly indicate which classifier is better, and as a value, the classifier with a larger AUC is better.

召回率(Recall)，又称为查全率(TPR)，是一种对不平衡数据分类结果的完整性度量，表示实际少数类样本个数占实际应该为少数类样本个数的比例。Recall, also known as recall rate (TPR), is a measure of the integrity of the classification results of imbalanced data, indicating the ratio of the actual number of minority class samples to the actual number of minority class samples.

然后，用选出的特征，再输入不同的过采样算法，使正负类的样本比例为1:1，输入分类器，将输出的模型评价指标和召回率相对比，选出得到结果最好的过采样算法。Then, using the selected features, input different oversampling algorithms, so that the sample ratio of positive and negative classes is 1:1, input the classifier, and compare the output model evaluation index with the recall rate, and select the best result. oversampling algorithm.

接着，用选出的过采样算法，尝试增大正类样本的比例，从1.1:1，1.2:1一直到2:1，再给出两个比例相差较大的比例5:1，10:1，对比结果，选出合适的正负类样本比例。Then, using the selected oversampling algorithm, try to increase the proportion of positive samples, from 1.1:1, 1.2:1 to 2:1, and then give two ratios with a large difference between 5:1 and 10:1 , compare the results, and select the appropriate proportion of positive and negative samples.

在具体实施时，参见附图1所示，基于非平衡学习的异常细胞远处转移分类方法，包括：In specific implementation, referring to Figure 1, the method for classifying distant metastasis of abnormal cells based on non-equilibrium learning includes:

步骤(1)：对数据集进行筛选，筛选完之后清洗脏数据，含有缺失数据的样本直接删除，同时红细胞分布宽度(CV)这一属性确实严重，在特征选择之前将此列属性数据删除，留下完整的数据集。Step (1): Screen the data set, clean the dirty data after screening, delete the samples with missing data directly, and the red blood cell distribution width (CV) attribute is really serious, delete this column of attribute data before feature selection, Leave the complete dataset.

具体的，数据集中包括训练集及测试集，训练集中的数据包括若干条癌细胞有远处转移的血细胞分析数据序列，及若干条癌细胞没有远处转移的血细胞分析数据序列。Specifically, the data set includes a training set and a test set, and the data in the training set includes several blood cell analysis data sequences with distant metastasis of cancer cells, and several blood cell analysis data sequences with no distant metastasis of cancer cells.

测试集存储有待测的血细胞分析数据序列。The test set stores the sequence of blood cell analysis data to be tested.

数据序列即血细胞分析数据序列包括：白细胞计数、淋巴细胞绝对值、淋巴细胞百分比、中性粒细胞绝对值、中性粒细胞百分比、单核细胞绝对值、单核细胞百分比、嗜酸性粒细胞绝对值、嗜酸性粒细胞百分比、嗜碱性粒细胞绝对值、嗜碱性粒细胞百分比、红细胞计数、血红蛋白、红细胞平均体积、红细胞平均血红蛋白含量、红细胞平均血红蛋白浓度、红细胞分布宽度(CV)、血小板计数、血小板分布宽度、血小板分布压积、血小板平均体积。The data sequence is the blood cell analysis data sequence including: white blood cell count, lymphocyte absolute value, lymphocyte percentage, neutrophil absolute value, neutrophil percentage, monocyte absolute value, monocyte percentage, eosinophil absolute value value, percent eosinophils, absolute basophils, percent basophils, red blood cell count, hemoglobin, mean red blood cell volume, mean red blood cell hemoglobin content, mean red blood cell hemoglobin concentration, red blood cell distribution width (CV), platelets Count, platelet distribution width, platelet distribution volume, mean platelet volume.

步骤(2)：参加附图2所示，对血细胞分析进行特征选择，将数据集分别输入k个特征选择算法，分别选择排序靠前的p个属性作为数据集A的特征，输入分类器进行训练，输出模型评价指标(G-Mean，AUC)和召回率(Recall)。Step (2): As shown in Figure 2, feature selection is performed on blood cell analysis, the data set is input into k feature selection algorithms, respectively, the top p attributes are selected as the features of data set A, and the classifier is input to carry out Training, output model evaluation metrics (G-Mean, AUC) and recall rate (Recall).

在该实施例子中，此处的模型即为前述的分类器。In this embodiment, the model here is the aforementioned classifier.

因少数类样本是关注重点，根据具体问题具体分析，重新给定计算模型评价指标G-Mean和AUC的阈值，提出加权G-Mean和加权AUC，记为WG-Mean和WAUC；通过重新给定的计算方式计算出WG-Mean和WAUC，与原始情况相对比，选出得到结果最好的p个特征，此处计算出的WG-Mean和WAUC越大代表结果越好，以下输入都将使用这p个特征。Because minority samples are the focus of attention, according to the specific analysis of specific problems, the thresholds of the evaluation indicators G-Mean and AUC of the calculation model are re-specified, and weighted G-Mean and weighted AUC are proposed, which are denoted as WG-Mean and WAUC; WG-Mean and WAUC are calculated by the calculation method of , and compared with the original situation, the p features with the best results are selected. The larger the WG-Mean and WAUC calculated here, the better the result. The following inputs will be used This p feature.

因为在本案例中，少数类样本是关注的重点，这个阈值的重新给定就是提高了在计算时召回率的比率。Because in this case, the minority class samples are the focus, the re-specification of this threshold is to improve the ratio of the recall rate in the calculation.

在该实施例子中，G-Mean和AUC是两个综合评价分类器的指标。In this embodiment, G-Mean and AUC are two indicators for comprehensive evaluation of the classifier.

重新给定阈值后的计算公式：The calculation formula after re-given threshold:

WAUC＝Sensitivity×0.7+Specificity×0.3；WAUC=Sensitivity×0.7+Specificity×0.3;

在该实施例子中，原始情况的数据就是没有进行正负类样本平衡过的数据。In this embodiment, the data of the original situation is the data that has not been balanced for positive and negative samples.

在具体实施时，通过算法得出的结果，分别选取排名前8名的属性作为基础特征。通过分析，选取特征包括：血小板分布宽度、淋巴细胞百分比、淋巴细胞绝对值、中性粒细胞百分比、血小板平均体积、红细胞计数、血红蛋白和红细胞压积。In the specific implementation, the top 8 attributes are selected as the basic features through the results obtained by the algorithm. Through analysis, the selected features include: platelet distribution width, lymphocyte percentage, absolute lymphocyte value, neutrophil percentage, mean platelet volume, red blood cell count, hemoglobin, and hematocrit.

参见附图3所示，选取得到结果更好的过采样算法。步骤(3)和步骤(4)：基于过采样算法使数据集在数据层面上达到数据均衡，将选取出的8个特征，输入到不同的过采样算法，获得使正负类样本比例为1:1的训练集P1，P2……Pn，将训练集P1，P2……Pn分别输入到分类算法，再用测试集N进行测试，输出模型评价指标(G-Mean，AUC)和召回率(Recall)；计算出WG-Mean和WAUC，选择结果最好的所对应的得到训练集Pi的过采样算法i。Referring to Fig. 3, an oversampling algorithm with better results is selected. Steps (3) and (4): Based on the oversampling algorithm, the data set achieves data balance at the data level, and the selected 8 features are input into different oversampling algorithms to obtain a positive and negative sample ratio of 1. : 1 training set P1, P2...Pn, input the training set P1, P2...Pn to the classification algorithm respectively, and then use the test set N for testing, output the model evaluation index (G-Mean, AUC) and recall rate ( Recall); calculate WG-Mean and WAUC, and select the oversampling algorithm i corresponding to the best result to obtain the training set Pi.

在该实施例子中，将此数据集分为训练集和测试集，训练集用来训练模型，测试集用来测试模型；首先将训练集输入到特征选择算法与原始情况数据集分类的结果作对比，选出得到结果最好的p个特征；再用过采样算法得到正负类样本比例为1:1的训练集。In this example, the data set is divided into a training set and a test set, the training set is used to train the model, and the test set is used to test the model; first, the training set is input into the feature selection algorithm and the result of the original data set classification is compared For comparison, select the p features with the best results; then use the oversampling algorithm to obtain a training set with a positive and negative sample ratio of 1:1.

步骤(5)：参见附图4，通过调整正负类样本的比例，以期获得更高的召回率和更好的模型评价指标，将训练集M输入到得到训练集Pi的过采样算法，逐渐增大正负类样本比例为1.1:1，1.2:1，一直到2:1，甚至给出5:1，10:1输出的Recall，WG-Mean和WAUC，选出得到结果最优的正负类样本比例。Step (5): Referring to Figure 4, by adjusting the ratio of positive and negative samples, in order to obtain a higher recall rate and a better model evaluation index, the training set M is input into the oversampling algorithm to obtain the training set Pi, and gradually Increase the ratio of positive and negative samples to 1.1:1, 1.2:1, until 2:1, and even give 5:1, 10:1 output Recall, WG-Mean and WAUC, and select the positive result with the best result Negative class sample proportion.

选取出得到结果最优的正负类样本比例可以使模型评价指标达到最好即可。Selecting the proportion of positive and negative samples with the best results can make the model evaluation index reach the best.

本公开实施例子选用医院常规检查血细胞分析做分析预测，以期取代价格昂贵且耗时较长的其他诊断途径。在应用上具有一定的创新性。In the embodiment of the present disclosure, blood cell analysis in routine examinations in hospitals is used for analysis and prediction, in order to replace other expensive and time-consuming diagnostic methods. It has certain innovation in application.

本公开实施例子所用技术突破了临床医学研究人员不懂得机器学习的弱点，打破传统惯用P检验分析方法。The technology used in the embodiments of the present disclosure breaks through the weakness that clinical medical researchers do not understand machine learning, and breaks the traditional conventional P-test analysis method.

本公开实施例子结合具体的实际意义，重新给定计算模型评价指标的阈值。The embodiments of the present disclosure redefine the threshold of the evaluation index of the calculation model in combination with the specific practical significance.

本公开实施例子用过采样算法尝试增大正类样本比例，获得更好的模型评价指标和少数正类样本的召回率。The embodiments of the present disclosure use an oversampling algorithm to try to increase the proportion of positive samples, so as to obtain better model evaluation indicators and a recall rate of a small number of positive samples.

实施例子二Example 2

在另一实施例中，上述系统在具体实施时，可采用服务器、数据输入设备及数据显示器，利用数据输入设备将血细胞的分析数据输入至服务器或者通过调用的方式将存储在存储器的血细胞数据进行调用，服务器对数据进行上述处理后，利用显示器将具体的结果及数据处理过程中的相关数据进行显示。In another embodiment, during the specific implementation of the above-mentioned system, a server, a data input device and a data display may be used, and the data input device may be used to input the analysis data of blood cells to the server, or the blood cell data stored in the memory may be processed by invoking. After the server performs the above-mentioned processing on the data, the specific result and the relevant data in the data processing process are displayed by the display.

服务器包括训练集采集单元、特征选择单元、过采样单元、最优过采样算法获得单元及最优正负类样本比例获取单元。The server includes a training set collection unit, a feature selection unit, an oversampling unit, an optimal oversampling algorithm obtaining unit and an optimal positive and negative class sample ratio obtaining unit.

上述单元的具体实现过程可参见实施例子一中的具体过程，此处不再详细说明。For the specific implementation process of the above unit, reference may be made to the specific process in Embodiment 1, which will not be described in detail here.

实施例子三Example three

本公开实施例子公开了一种计算机设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，其特征在于，所述处理器执行所述程序时实现基于非平衡学习的细胞远处转移分类步骤。An embodiment of the present disclosure discloses a computer device, including a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that, when the processor executes the program, unbalanced learning is implemented The cell distant metastases classification steps.

在该实施例中，具体的步骤参见实施例一的详细过程，此处不再进行详细说明。In this embodiment, for specific steps, refer to the detailed process of Embodiment 1, which will not be described in detail here.

应当注意，尽管在上文的详细描述中提及了设备的若干模块或子模块，但是这种划分仅仅是示例性而非强制性的。实际上，根据本公开的实施例，上文描述的两个或更多模块的特征和功能可以在一个模块中具体化。反之，上文描述的一个模块的特征和功能可以进一步划分为由多个模块来具体化。It should be noted that although several modules or sub-modules of the apparatus are mentioned in the detailed description above, this division is merely exemplary and not mandatory. Indeed, in accordance with embodiments of the present disclosure, the features and functions of two or more modules described above may be embodied in one module. Conversely, the features and functions of one module described above can be further divided into multiple modules to be embodied.

实施例子四Example 4

本公开实施例子公开了一种计算机可读存储介质，其上存储有计算机程序，其特征在于，该程序被处理器执行时实现基于非平衡学习的细胞远处转移分类步骤。An embodiment of the present disclosure discloses a computer-readable storage medium on which a computer program is stored, characterized in that, when the program is executed by a processor, the step of classifying distant metastasis of cells based on unbalanced learning is implemented.

在本实施例中，计算机程序产品可以包括计算机可读存储介质，其上载有用于执行本公开的各个方面的计算机可读程序指令。计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。In this embodiment, the computer program product may comprise a computer-readable storage medium having computer-readable program instructions loaded thereon for carrying out various aspects of the present disclosure. A computer-readable storage medium may be a tangible device that can hold and store instructions for use by the instruction execution device.

可以理解的是，在本说明书的描述中，参考术语“一实施例”、“另一实施例”、“其他实施例”、或“第一实施例～第N实施例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中，对上述术语的示意性表述不一定指的是相同的实施例或示例。而且，描述的具体特征、结构、材料者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。It is to be understood that, in the description of this specification, referring to the description of the terms "an embodiment", "another embodiment", "other embodiment", or "the first embodiment to the Nth embodiment" etc. means A particular feature, structure, material, or characteristic described in connection with this embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials and characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

以上所述仅为本公开的优选实施例而已，并不用于限制本公开，对于本领域的技术人员来说，本公开可以有各种更改和变化。凡在本公开的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本公开的保护范围之内。The above descriptions are only preferred embodiments of the present disclosure, and are not intended to limit the present disclosure. For those skilled in the art, the present disclosure may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure shall be included within the protection scope of the present disclosure.

Claims

1. The abnormal cell distant metastasis classification method based on unbalanced learning is characterized by comprising the following steps:

obtaining a plurality of data sequences with certain cells having distant metastasis and a plurality of data sequences without certain cells having distant metastasis, and forming a training set;

respectively inputting the training set into k feature selection algorithms, respectively selecting p attributes ranked in the front as features of the training set, inputting the attributes into a classifier for training, comparing classification results, and selecting p features with the best results;

enabling the training set to achieve data balance on a data level based on an oversampling algorithm, and inputting the training set processed by the feature selection algorithm into n oversampling algorithms to obtain a training set with a positive-negative sample ratio of 1: 1;

respectively inputting the training sets with the positive and negative sample ratio of 1:1 into a classification algorithm, testing by using the data sequence of the test set, and selecting the training set with the optimal evaluation result_PiThe oversampling algorithm i of (1);

inputting the training set into the obtained training set by adjusting the proportion of the positive and negative samples_PiThe over-sampling algorithm gradually increases the proportion of the positive and negative samples to a set proportion, and selects the proportion of the positive and negative samples with the optimal classification evaluation.

2. The method of classifying abnormal cell distant metastasis based on unbalanced learning according to claim 1, wherein the data sequence comprises: leukocyte count, lymphocyte absolute value, lymphocyte percentage, neutrophil absolute value, neutrophil percentage, monocyte absolute value, monocyte percentage, eosinophil absolute value, eosinophil percentage, basophil absolute value, basophil percentage, erythrocyte count, hemoglobin, erythrocyte mean volume, erythrocyte mean hemoglobin content, erythrocyte mean hemoglobin concentration, erythrocyte distribution width, platelet count, platelet distribution width, platelet distribution volume, and platelet mean volume.

3. The method of claim 1, wherein the selecting the p features with the best results comprises: width of platelet distribution, lymphocyte percentage, absolute value of lymphocytes, neutrophil percentage, mean volume of platelets, red blood cell count, hemoglobin, and hematocrit.

4. The abnormal cell distant metastasis classification method based on unbalanced learning as claimed in claim 1, wherein the data of the training set is subjected to data screening before feature selection, and the data integrity is judged, and the samples containing the missing data are deleted.

5. A cell distant metastasis classification system based on unbalanced learning is characterized by comprising:

a training set acquisition unit configured to: obtaining a plurality of data sequences with certain cells having distant metastasis and a plurality of data sequences without certain cells having distant metastasis, and forming a training set;

a feature selection unit configured to: respectively inputting the training set into k feature selection algorithms, respectively selecting p attributes ranked in the front as features of the training set, inputting the attributes into a classifier for training, comparing classification results, and selecting p features with the best results;

an oversampling unit configured to: enabling the training set to achieve data balance on a data level based on an oversampling algorithm, and inputting the training set processed by the feature selection algorithm into n oversampling algorithms to obtain a training set with a positive-negative sample ratio of 1: 1;

an optimal oversampling algorithm obtaining unit configured to: respectively inputting the training sets with the positive and negative sample ratio of 1:1 into a classification algorithm, testing by using the data sequence of the test set, and selecting the training set with the optimal evaluation result_PiThe oversampling algorithm i of (1);

an optimal positive and negative sample proportion obtaining unit configured to: inputting the training set into the obtained training set by adjusting the proportion of the positive and negative samples_PiThe over-sampling algorithm gradually increases the proportion of the positive and negative samples to a set proportion, and selects the proportion of the positive and negative samples with the optimal classification evaluation.

6. The system of claim 5, wherein the selecting the p features that yield the best results comprises: width of platelet distribution, lymphocyte percentage, absolute value of lymphocytes, neutrophil percentage, mean volume of platelets, red blood cell count, hemoglobin, and hematocrit.

7. The cell remote transfer classification system based on unbalanced learning is characterized by comprising a server, a data input device and a data display, wherein the data input device is used for inputting analysis data of blood cells into the server or calling the blood cell data stored in a memory in a calling mode, and the display is used for displaying specific results and related data in a data processing process;

the server is configured to include:

8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the steps of the method for classifying distant metastasis of abnormal cells based on unbalanced learning according to any one of claims 1 to 4.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of a method for classifying distant metastasis of abnormal cells based on unbalanced learning according to any one of claims 1 to 4.