CN108648827A

CN108648827A - Cardiovascular and cerebrovascular disease Risk Forecast Method and device

Info

Publication number: CN108648827A
Application number: CN201810449174.4A
Authority: CN
Inventors: 刘奎; 倪壮; 康桂霞; 杨波; 张宁波
Original assignee: Chinese PLA General Hospital; Beijing University of Posts and Telecommunications
Current assignee: Chinese PLA General Hospital; Beijing University of Posts and Telecommunications
Priority date: 2018-05-11
Filing date: 2018-05-11
Publication date: 2018-10-12
Anticipated expiration: 2038-05-11
Also published as: CN108648827B

Abstract

A kind of cardiovascular and cerebrovascular disease Risk Forecast Method provided in an embodiment of the present invention and device, including：Obtain sample set；Sample in sample set is divided into the local cluster of preset quantity, according to preset first K values and first distance set, the first K values first of input sample are calculated adjacent to sample, so that it is determined that target part cluster, calculate the input sample in the cluster of the target part at a distance from sample, so that it is determined that the 2nd K values second of the input sample are adjacent to sample；The label for determining input sample, determine input sample whether be Patients with Cardiovascular/Cerebrovascular Diseases sample；Finally determine whether patient to be predicted is Patients with Cardiovascular/Cerebrovascular Diseases.The present embodiment is higher in view of Patients with Cardiovascular/Cerebrovascular Diseases characteristic similarity, avoids influence of the different sample datas to training prediction model.It is thus possible to improve predict patient to be predicted whether be Patients with Cardiovascular/Cerebrovascular Diseases accuracy rate.

Description

Risk prediction method and device for cardiovascular and cerebrovascular diseases

技术领域technical field

本发明涉及预测分析领域，特别是涉及一种心脑血管疾病风险预测方法及装置。The invention relates to the field of predictive analysis, in particular to a method and device for predicting the risk of cardiovascular and cerebrovascular diseases.

背景技术Background technique

随着人们的生活压力和精神压力与日俱增，心脑血管疾病的发病率逐年提升，严重影响居民的健康。医学实践表明如果心脑血管疾病患者在早期诊断时能够确诊，对于心脑血管疾病的干预和治疗效果有很大帮助。With the increasing life pressure and mental pressure of people, the incidence of cardiovascular and cerebrovascular diseases has increased year by year, seriously affecting the health of residents. Medical practice shows that if patients with cardiovascular and cerebrovascular diseases can be diagnosed at an early stage, it will be of great help to the intervention and treatment of cardiovascular and cerebrovascular diseases.

现有技术使用数据挖掘技术对于心脑血管疾病的病例数据特征进行挖掘，将所有患者的体检特征数据及回访数据组成一个训练集，使用决策树、逻辑斯蒂回归和人工神经网络算法，训练出预测模型。然后将待预测患者的体检数据作为输入样本，输入到训练出的预测模型中，输出待预测患者是否是心脑血管疾病患者。The existing technology uses data mining technology to mine the characteristics of the case data of cardiovascular and cerebrovascular diseases, and forms a training set from the physical examination feature data and return visit data of all patients, and uses decision tree, logistic regression and artificial neural network algorithms to train predictive model. Then, the physical examination data of the patient to be predicted is used as an input sample, which is input into the trained prediction model, and whether the patient to be predicted is a cardiovascular and cerebrovascular disease patient is output.

以人工神经网络算法训练预测模型为例，使用人工神经网络算法训练预测模型过程中，由于神经网络的输入样本包含了非心脑血管疾病患者样本和心脑血管疾病患者样本，而非心脑血管疾病患者样本与心脑血管疾病患者样本中的特征数据差距较大，因此，将训练集中的所有样本作为输入层的输入，神经网络的输出层的误差函数较大。因为受到不同样本数据的影响，根据误差函数调整神经网络的各层权值及阈值，训练出的预测模型并不准确。因而，使用人工神经网络算法训练预测模型，预测待预测患者是否是心脑血管疾病患者的准确率不高。Taking the artificial neural network algorithm training prediction model as an example, in the process of using the artificial neural network algorithm to train the prediction model, since the input samples of the neural network include samples of patients with non-cardiovascular and cerebrovascular diseases and samples of patients with cardiovascular and cerebrovascular diseases, rather than samples of patients with cardiovascular and cerebrovascular diseases. There is a large gap between the characteristic data of the samples of patients with diseases and the samples of patients with cardiovascular and cerebrovascular diseases. Therefore, if all the samples in the training set are used as the input of the input layer, the error function of the output layer of the neural network is relatively large. Due to the influence of different sample data, the weights and thresholds of each layer of the neural network are adjusted according to the error function, and the trained prediction model is not accurate. Therefore, the accuracy rate of using the artificial neural network algorithm to train the prediction model to predict whether the patient to be predicted is a patient with cardiovascular and cerebrovascular diseases is not high.

发明内容Contents of the invention

本发明实施例的目的在于提供一种心脑血管疾病风险预测方法及装置，以提高预测患者是否是心脑血管疾病患者的准确率。具体技术方案如下：The purpose of the embodiments of the present invention is to provide a cardiovascular and cerebrovascular disease risk prediction method and device, so as to improve the accuracy of predicting whether a patient is a cardiovascular and cerebrovascular disease patient. The specific technical scheme is as follows:

第一方面，本发明实施例提供了一种心脑血管疾病风险预测方法，包括：In the first aspect, the embodiment of the present invention provides a method for predicting the risk of cardiovascular and cerebrovascular diseases, including:

获取样本集；所述样本集根据设置完标签的患者医疗数据库集的多个样本所确定的；一条样本包括：患者的编号、特征及特征数据；所述标签包括：第一标签和第二标签；第一标签标识心脑血管疾病患者样本；第二标签标识非心脑血管疾病患者样本；Obtain a sample set; the sample set is determined according to multiple samples of the patient medical database set with labels; one sample includes: the patient's number, characteristics and characteristic data; the labels include: the first label and the second label ; The first label identifies samples from patients with cardiovascular and cerebrovascular diseases; the second label identifies samples from patients with non-cardio-cerebrovascular diseases;

获取一条输入样本；所述输入样本由待预测患者的医疗健康体检数据及医疗就诊数据合并构成；Obtain an input sample; the input sample is formed by merging the medical health examination data and medical visit data of the patient to be predicted;

使用余弦-大间隔最近邻居COS-LMNN算法进行度量学习，得到所述样本集的全局度量矩阵；Use the cosine-large interval nearest neighbor COS-LMNN algorithm to carry out metric learning to obtain the global metric matrix of the sample set;

使用预设的聚类算法，将样本集中的样本分为预设数量的局部簇；Use a preset clustering algorithm to divide the samples in the sample set into a preset number of local clusters;

根据所述全局度量矩阵，使用余弦相似度算法，计算所述输入样本与所述样本集中样本的距离，组成第一距离集合；According to the global metric matrix, using a cosine similarity algorithm to calculate the distance between the input sample and the samples in the sample set to form a first distance set;

根据预设的第一K值与所述第一距离集合，使用k近邻算法，计算得到输入样本的第一K值个第一邻近样本；According to the preset first K value and the first distance set, use the k-nearest neighbor algorithm to calculate the first K-value first neighboring samples of the input sample;

确定所述第一邻近样本所在的局部簇；determining the local cluster in which the first neighboring sample is located;

在所述第一邻近样本所在的局部簇中，选择第一邻近样本的数量超过第一预设阈值的局部簇，作为目标局部簇；In the local cluster where the first adjacent samples are located, select a local cluster whose number of first adjacent samples exceeds a first preset threshold as a target local cluster;

将所述输入样本划入所述目标局部簇；classifying the input samples into the target local clusters;

根据COS-LMNN算法学习得到的所述目标局部簇的局部度量矩阵，使用余弦相似度算法计算，所述输入样本与所述目标局部簇中样本的距离，组成第二距离集合；The local metric matrix of the target local cluster learned according to the COS-LMNN algorithm is calculated using a cosine similarity algorithm, and the distance between the input sample and the samples in the target local cluster forms a second distance set;

在所述目标局部簇中，根据预设的第二K值与所述第二距离集合，使用k近邻算法，确定所述输入样本第二K值个第二邻近样本；In the target local cluster, according to the preset second K value and the second distance set, use the k-nearest neighbor algorithm to determine the second K-value second neighboring samples of the input sample;

统计第二邻近样本的第一标签个数与第二标签个数；counting the number of the first label and the number of the second label of the second adjacent sample;

如果第一标签个数与第二标签个数的比值超过预设标签阈值，则将第一标签作为输入样本的标签，否则将第二标签作为输入样本的标签；If the ratio of the number of the first label to the number of the second label exceeds the preset label threshold, the first label is used as the label of the input sample, otherwise the second label is used as the label of the input sample;

根据所述输入样本的标签，确定输入样本是否是心脑血管疾病患者的样本；determining whether the input sample is a sample of a patient with cardiovascular and cerebrovascular diseases according to the label of the input sample;

如果输入样本是心脑血管疾病患者的样本，则确定输入样本中的待预测患者是心脑血管疾病患者；If the input sample is a sample of a patient with cardiovascular and cerebrovascular diseases, it is determined that the patient to be predicted in the input sample is a patient with cardiovascular and cerebrovascular diseases;

如果输入样本不是心脑血管疾病患者的样本，则确定输入样本中的待预测患者不是心脑血管疾病患者。If the input sample is not a sample of a patient with cardiovascular and cerebrovascular diseases, it is determined that the patient to be predicted in the input sample is not a patient with cardiovascular and cerebrovascular diseases.

可选的，在所述确定输入样本中的待预测患者是心脑血管疾病患者的步骤之后，所述方法还包括：Optionally, after the step of determining that the patient to be predicted in the input sample is a patient with cardiovascular and cerebrovascular diseases, the method further includes:

根据患者的健康回访数据确定所述待预测患者是否是高危心脑血管疾病患者；Determine whether the patient to be predicted is a high-risk cardiovascular and cerebrovascular disease patient according to the patient's health follow-up data;

如果所述待预测患者是高危心脑血管疾病患者，则对所述待预测患者作住院治疗的建议；If the patient to be predicted is a patient with high-risk cardiovascular and cerebrovascular diseases, a suggestion for hospitalization is made to the patient to be predicted;

如果所述待预测患者不是高危心脑血管疾病患者，则对所述待预测患者作增加体检频次的建议；If the patient to be predicted is not a high-risk cardiovascular and cerebrovascular disease patient, a suggestion to increase the frequency of physical examination is made to the patient to be predicted;

在所述确定输入样本中的待预测患者不是心脑血管疾病患者的步骤之后，所述方法还包括：After the step of determining that the patient to be predicted in the input sample is not a patient with cardiovascular and cerebrovascular diseases, the method also includes:

根据患者的健康回访数据确定所述待预测患者是否是健康用户；Determine whether the patient to be predicted is a healthy user according to the patient's health follow-up data;

如果所述待预测患者是健康用户，则对所述正常患者作保持正常体检频次的建议；If the patient to be predicted is a healthy user, make a suggestion for the normal patient to maintain a normal frequency of physical examination;

如果所述待预测患者不是健康用户，则将所述待预测患者标记为漏诊患者，将所述漏诊患者的特征数据加入所述患者医疗数据库集；If the patient to be predicted is not a healthy user, then mark the patient to be predicted as a missed diagnosis patient, and add the feature data of the missed diagnosis patient to the patient medical database set;

其中，漏诊患者为心脑血管疾病患者。Among them, the missed diagnosis patients were patients with cardiovascular and cerebrovascular diseases.

可选的，所述第一标签标识心脑血管疾病患者样本，包括：Optionally, the first label identifies samples of patients with cardiovascular and cerebrovascular diseases, including:

根据已收集的患者的健康回访数据，确定心脑血管疾病患者的标识信息；Determine the identification information of patients with cardiovascular and cerebrovascular diseases based on the collected patient health follow-up data;

所述患者的健康回访数据包括：患者的编号、特征、特征数据及确认病症；所述标识信息包括：确认病症、确认病症对应的特征及特征数据；The health follow-up data of the patient includes: the patient's number, characteristics, characteristic data, and confirmed symptoms; the identification information includes: confirmed symptoms, characteristics and characteristic data corresponding to the confirmed symptoms;

根据心脑血管疾病患者的标识信息，在所述医疗数据库集中确定心脑血管疾病患者样本；According to the identification information of patients with cardiovascular and cerebrovascular diseases, determine the samples of patients with cardiovascular and cerebrovascular diseases in the medical database set;

将所述心脑血管疾病患者的样本，设置第一标签；Set the first label on the samples of patients with cardiovascular and cerebrovascular diseases;

所述第二标签标识非心脑血管疾病患者样本，包括：The second label identifies non-cardiovascular and cerebrovascular disease patient samples, including:

将除所述心脑血管疾病患者样本以外的其他样本，设置第二标签。Set the second label on other samples except the samples of patients with cardiovascular and cerebrovascular diseases.

可选的，获取样本集，包括：Optionally, obtain a sample set, including:

根据设置标签的患者医疗数据库集的多个样本，将样本缺失值大于第一阈值的样本作样本删除处理；According to the multiple samples of the patient medical database set with labels, the samples whose missing values are greater than the first threshold are deleted as samples;

所述样本缺失值为：一条样本中缺失的特征数量与该样本中特征总数量的比值；The sample missing value is: the ratio of the number of missing features in a sample to the total number of features in the sample;

在删除处理后的多条样本中查找，特征缺失值大于第二阈值的特征作特征删除处理；Search in multiple samples after deletion processing, and perform feature deletion processing for features whose feature missing value is greater than the second threshold;

所述特征缺失值为：多条样本的同一特征中，缺少特征数据的特征数量与同一特征总数量的比值；The feature missing value is: in the same feature of multiple samples, the ratio of the number of features lacking feature data to the total number of the same feature;

在作特征删除处理后的多条样本查找缺失特征数据的特征，作为第一特征；Find the feature of missing feature data in multiple samples after feature deletion processing, as the first feature;

使用多重填补法，对所述第一特征缺失的特征数据作缺失值填补；Using a multiple filling method to fill in missing values for the feature data missing from the first feature;

按照数据类型，对缺失值填补后的所述多条样本的特征数据做分类，获得分类结果；According to the data type, classify the characteristic data of the plurality of samples after the missing value is filled, and obtain a classification result;

其中，所述分类结果包括：离散特征数据和连续特征数据；Wherein, the classification result includes: discrete feature data and continuous feature data;

根据分类结果，将所述离散特征数据和连续特征数据，作与数据类型对应的处理；According to the classification result, the discrete feature data and the continuous feature data are processed corresponding to the data type;

将所述离散特征数据和连续特征数据做相对应的处理后的特征数据加入所述患者医疗数据库集，作为第一数据库集；Adding the processed feature data corresponding to the discrete feature data and the continuous feature data to the patient medical database set as the first database set;

其中，将所述离散特征数据和连续特征数据，作与数据类型对应的处理，包括：对离散特征数据进行独热编码；对连续特征数据，使用正太标准化z-score方法进行标准化处理；Wherein, the discrete feature data and the continuous feature data are processed corresponding to the data type, including: performing one-hot encoding on the discrete feature data; standardizing the continuous feature data using the normalized z-score method;

使用欠采样及SMOTE算法，对第一数据库集的样本，进行不均衡处理，获得第二数据库集；Using the undersampling and SMOTE algorithm to perform unbalanced processing on the samples of the first database set to obtain the second database set;

使用方差分析法计算，所述第二数据库集中的同一特征数据的方差，删除特征数据方差值小于预设方差阈值的特征数据；Using the variance analysis method to calculate the variance of the same feature data in the second database set, and delete the feature data whose variance value of the feature data is less than a preset variance threshold;

使用relief算法计算，所述删除特征数据方差值小于预设方差阈值的特征数据后的每个特征数据的权重；Using a relief algorithm to calculate the weight of each feature data after the feature data whose variance value is less than a preset variance threshold is deleted;

根据特征数据的权重与特征数据的权重对应的分数值，将分数值小于预设分数阈值的特征数据及对应的特征删除，获得第四数据库集；According to the weight of the feature data and the score value corresponding to the weight of the feature data, the feature data and the corresponding feature whose score value is less than the preset score threshold are deleted to obtain the fourth database set;

根据第四数据库集，使用前向选择法，确定样本集。According to the fourth database set, a forward selection method is used to determine a sample set.

可选的，所述根据所述全局度量矩阵，使用余弦相似度算法计算所述输入样本与所述样本集中样本的距离，组成第一距离集合，包括：Optionally, according to the global metric matrix, the distance between the input sample and the samples in the sample set is calculated using a cosine similarity algorithm to form a first distance set, including:

根据所述全局度量矩阵，使用余弦相似度算法公式计算所述输入样本与所述样本集中样本的距离，组成第一距离集合；According to the global metric matrix, using a cosine similarity algorithm formula to calculate the distance between the input sample and the samples in the sample set to form a first distance set;

其中，所述余弦相似度算法公式为：Wherein, the cosine similarity algorithm formula is:

所述第一距离集合D1包括：{D(x_i,x₁)，D(x_i,x₂)，D(x_i,x₃)，…,D(x_i,x_j)}；The first distance set D1 includes: {D( _xi ,x ₁ ), D( _xi ,x ₂ ), D( _xi ,x ₃ ),...,D( _xi ,x _j )};

其中，i代表输入样本的标号，x_i代表第i个输入样本为x_i；样本集为X；全局度量矩阵为A；M＝A^TA；j代表样本集中的样本编号；x_j代表样本集中第j个的样本；i与j取正整数；D(x_i,x_j)代表在全局度量矩阵下输入样本x_i与X集中第j个样本的距离；A(x_i,x_j)代表经过A矩阵变换后x_i,x_j之间的距离。Among them, i represents the label of the input sample, x _i represents the i-th input sample as x _i ; the sample set is X; the global metric matrix is A; M= ^AT A; j represents the sample number in the sample set; x _j represents the sample The j-th sample in the set; i and j take positive integers; D( _xi , x _j ) represents the distance between the input sample x _i and the j-th sample in the X set under the global metric matrix; A( _xi , x _j ) Represents the distance between x _i and x _j after A matrix transformation.

可选的，所述根据COS-LMNN算法学习得到的所述目标局部簇的局部度量矩阵，使用余弦相似度算法计算所述输入样本与所述目标局部簇中样本的距离，组成第二距离集合，包括：Optionally, the local metric matrix of the target local cluster learned according to the COS-LMNN algorithm uses a cosine similarity algorithm to calculate the distance between the input sample and the samples in the target local cluster to form a second distance set ,include:

根据COS-LMNN算法学习得到的所述目标局部簇的局部度量矩阵，使用余弦相似度算法公式计算所述输入样本与所述样本集中样本的距离，组成第二距离集合；According to the local metric matrix of the target local cluster learned by the COS-LMNN algorithm, using the cosine similarity algorithm formula to calculate the distance between the input sample and the samples in the sample set to form a second distance set;

所述第二距离集合D2包括：{D(x_i,x_s1)，D(x_i,x_s2)，…，D(x_i,x_si)}；The second distance set D2 includes: {D( _xi , x _s1 ), D( _xi , x _s2 ),..., D( _xi , x _si )};

其中，i代表输入样本的标号，x_i代表第i个输入样本为x；x_si代表与i同类别的样本；局部度量矩阵为A_S；M_S＝A_S ^TA_S；i取正整数；D(x_i,x_si)代表在局部度量矩阵下输入样本x_i与所述目标局部簇中与i同类别的样本的距离；i取正整数；A_s(x_i,x_si)代表经过A_S矩阵变换后x_i,x_si之间的距离。Among them, i represents the label of the input sample, x _i represents the i-th input sample as x; x _si represents the sample of the same category as i; the local metric matrix is A _S ; M _S =A _S ^T A _S ; i takes a positive integer ; D( _xi , x _si ) represents the distance between the input sample x _i and the samples of the same category as i in the target local cluster under the local metric matrix; i takes a positive integer; A _s ( _xi , x _si ) represents The distance between x _i and x _si after _AS matrix transformation.

第二方面，本实施例提供了一种心脑血管疾病风险预测装置，包括：In a second aspect, this embodiment provides a cardiovascular and cerebrovascular disease risk prediction device, including:

集合获取模块，用于获取样本集；A collection acquisition module, used to obtain a sample set;

所述样本集根据设置完标签的患者医疗数据库集的多个样本所确定的；一条样本包括：患者的编号、特征及特征数据；所述标签包括：第一标签和第二标签；第一标签标识心脑血管疾病患者样本；第二标签标识非心脑血管疾病患者样本；The sample set is determined according to multiple samples of the patient medical database set with labels; one sample includes: patient number, characteristics and characteristic data; the labels include: a first label and a second label; the first label Identify samples from patients with cardiovascular and cerebrovascular diseases; the second label identifies samples from patients with non-cardiovascular and cerebrovascular diseases;

样本获取模块，用于获取一条输入样本；A sample acquisition module, configured to acquire an input sample;

所述输入样本由待预测患者的医疗健康体检数据及医疗就诊数据合并构成；The input sample is formed by merging the medical health examination data and medical visit data of the patient to be predicted;

矩阵计算模块，用于使用余弦-大间隔最近邻居COS-LMNN算法进行度量学习，得到所述样本集的全局度量矩阵；A matrix calculation module, configured to use the Cosine-Large Interval Nearest Neighbor COS-LMNN algorithm to perform metric learning to obtain the global metric matrix of the sample set;

第一局部簇确定模块，用于使用预设的聚类算法，将样本集中的样本分为预设数量的局部簇；The first local cluster determination module is configured to use a preset clustering algorithm to divide the samples in the sample set into a preset number of local clusters;

第一距离确定模块，用于根据所述全局度量矩阵，使用余弦相似度算法，计算所述输入样本与所述样本集中样本的距离，组成第一距离集合；The first distance determination module is used to calculate the distance between the input sample and the samples in the sample set by using the cosine similarity algorithm according to the global metric matrix to form a first distance set;

第一样本确定模块，用于根据预设的第一K值与所述第一距离集合，使用k近邻算法，计算得到输入样本的第一K值个第一邻近样本；The first sample determination module is configured to use the k-nearest neighbor algorithm to calculate the first K-value first neighboring samples of the input sample according to the preset first K value and the first distance set;

第二局部簇确定模块，用于确定所述第一邻近样本所在的局部簇；A second local cluster determination module, configured to determine the local cluster where the first adjacent sample is located;

目标局部簇确定模块，用于在所述第一邻近样本所在的局部簇中，选择第一邻近样本的数量超过第一预设阈值的局部簇，作为目标局部簇；A target local cluster determination module, configured to, among the local clusters where the first adjacent samples are located, select a local cluster whose number of first adjacent samples exceeds a first preset threshold as a target local cluster;

局部簇划分模块，用于将所述输入样本划入所述目标局部簇；a local cluster division module, configured to divide the input samples into the target local clusters;

第二距离确定模块，用于根据COS-LMNN算法学习得到的所述目标局部簇的局部度量矩阵，使用余弦相似度算法计算，所述输入样本与所述目标局部簇中样本的距离，组成第二距离集合；The second distance determination module is used to calculate the local metric matrix of the target local cluster learned according to the COS-LMNN algorithm, using the cosine similarity algorithm, and the distance between the input sample and the samples in the target local cluster to form the first Two-distance set;

第二样本确定模块，用于在所述目标局部簇中，根据预设的第二K值与所述第二距离集合，使用k近邻算法，确定所述输入样本第二K值个第二邻近样本；The second sample determination module is configured to determine the second neighbors of the second K value of the input sample by using the k-nearest neighbor algorithm according to the preset second K value and the second distance set in the target local cluster sample;

统计模块，用于统计第二邻近样本的第一标签个数与第二标签个数；A statistical module, configured to count the number of first labels and the number of second labels of the second adjacent samples;

标签确定模块，用于如果第一标签个数与第二标签个数的比值超过预设标签阈值，则将第一标签作为输入样本的标签，否则将第二标签作为输入样本的标签；A label determination module, configured to use the first label as the label of the input sample if the ratio of the number of the first label to the number of the second label exceeds the preset label threshold, otherwise the second label is used as the label of the input sample;

患者样本确定模块，用于根据所述输入样本的标签，确定输入样本是否是心脑血管疾病患者的样本；The patient sample determination module is used to determine whether the input sample is a sample of a patient with cardiovascular and cerebrovascular diseases according to the label of the input sample;

心脑血管疾病患者确定模块，用于如果输入样本是心脑血管疾病患者的样本，则确定输入样本中的待预测患者是心脑血管疾病患者；A cardiovascular and cerebrovascular disease patient determination module is used to determine that the patient to be predicted in the input sample is a cardiovascular and cerebrovascular disease patient if the input sample is a sample of a patient with a cardiovascular and cerebrovascular disease;

非心脑血管疾病患者确定模块，用于如果输入样本不是心脑血管疾病患者的样本，则确定输入样本中的待预测患者不是心脑血管疾病患者。The non-cardio-cerebrovascular disease patient determining module is used to determine that the patient to be predicted in the input sample is not a cardio-cerebrovascular disease patient if the input sample is not a sample of a cardio-cerebrovascular disease patient.

可选的，本实施例提供的一种心脑血管疾病风险预测装置还包括：Optionally, a cardiovascular and cerebrovascular disease risk prediction device provided in this embodiment also includes:

高危确定模块，用于根据患者的健康回访数据确定所述待预测患者是否是高危心脑血管疾病患者；A high-risk determination module, configured to determine whether the patient to be predicted is a high-risk cardiovascular and cerebrovascular disease patient according to the patient's health follow-up data;

住院建议模块，用于如果所述待预测患者是高危心脑血管疾病患者，则对所述待预测患者作住院治疗的建议；Hospitalization suggestion module, for if the patient to be predicted is a high-risk cardiovascular and cerebrovascular disease patient, then make a suggestion for hospitalization of the patient to be predicted;

增加体检建议模块，用于如果所述待预测患者不是高危心脑血管疾病患者，则对所述待预测患者作增加体检频次的建议；Adding a physical examination suggestion module, for if the patient to be predicted is not a high-risk cardiovascular and cerebrovascular disease patient, then make a suggestion to increase the frequency of physical examination for the patient to be predicted;

健康确定模块，用于根据患者的健康回访数据确定所述待预测患者是否是健康用户；A health determination module, configured to determine whether the patient to be predicted is a healthy user according to the patient's health follow-up data;

正常体检建议模块，用于如果所述待预测患者是健康用户，则对所述正常患者作保持正常体检频次的建议；A normal physical examination suggestion module, used to make suggestions for maintaining a normal physical examination frequency for the normal patient if the patient to be predicted is a healthy user;

漏诊患者确定模块，用于如果所述待预测患者不是健康用户，则将所述待预测患者标记为漏诊患者，将所述漏诊患者的特征数据加入所述患者医疗数据库集；A missed patient determination module, configured to mark the patient to be predicted as a missed patient if the patient to be predicted is not a healthy user, and add the feature data of the missed patient to the patient medical database set;

可选的，所述集合获取模块，包括：Optionally, the collection acquisition module includes:

样本删除子模块，用于根据设置标签的患者医疗数据库集的多个样本，将样本缺失值大于第一阈值的样本作样本删除处理；The sample deletion sub-module is used to delete samples whose sample missing value is greater than the first threshold according to multiple samples of the patient medical database set with labels;

特征删除子模块，用于在删除处理后的多条样本中查找，特征缺失值大于第二阈值的特征作特征删除处理；The feature deletion sub-module is used to search for multiple samples after deletion processing, and perform feature deletion processing for features whose feature missing value is greater than the second threshold;

第一特征子模块，用于在作特征删除处理后的多条样本查找缺失特征数据的特征，作为第一特征；The first feature sub-module is used to find the feature of missing feature data in multiple samples after feature deletion processing, as the first feature;

缺失值填充子模块，用于使用多重填补法，对所述第一特征缺失的特征数据作缺失值填补；The missing value filling sub-module is used to fill the missing value of the feature data whose first feature is missing by using a multiple filling method;

数据分类子模块，用于按照数据类型，对缺失值填补后的所述多条样本的特征数据做分类，获得分类结果；The data classification sub-module is used to classify the characteristic data of the plurality of samples after the missing values are filled according to the data type, and obtain the classification result;

数据处理子模块，用于根据分类结果，将所述离散特征数据和连续特征数据，作与数据类型对应的处理；The data processing sub-module is used to process the discrete feature data and continuous feature data corresponding to the data type according to the classification result;

集合更新子模块，用于将所述离散特征数据和连续特征数据做相对应的处理后的特征数据加入所述患者医疗数据库集，作为第一数据库集；The set update submodule is used to add the processed feature data corresponding to the discrete feature data and continuous feature data to the patient medical database set as the first database set;

均衡处理子模块，用于使用欠采样及SMOTE算法，对第一数据库集的样本，进行不均衡处理，获得第二数据库集；The equalization processing sub-module is used to use the undersampling and SMOTE algorithm to perform unbalanced processing on the samples of the first database set to obtain the second database set;

方差删除子模块，用于使用方差分析法计算，所述第二数据库集中的同一特征数据的方差，删除特征数据方差值小于预设方差阈值的特征数据；The variance deletion submodule is used to calculate the variance of the same feature data in the second database set by using the variance analysis method, and delete the feature data whose variance value of the feature data is less than the preset variance threshold;

权重计算子模块，用于使用relief算法计算，所述删除特征数据方差值小于预设方差阈值的特征数据后的每个特征数据的权重；The weight calculation sub-module is used to use the relief algorithm to calculate the weight of each feature data after the feature data whose variance value is less than the preset variance threshold is deleted;

集合确定子模块，用于根据每个特征数据的权重及第二数据库集，使用前向选择法，确定样本集。The set determination sub-module is used to determine the sample set by using the forward selection method according to the weight of each characteristic data and the second database set.

在本发明实施的又一方面，还提供了一种电子设备，包括处理器、通信接口、存储器和通信总线，其中，处理器，通信接口，存储器通过通信总线完成相互间的通信；In yet another aspect of the implementation of the present invention, an electronic device is also provided, including a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory complete mutual communication through the communication bus;

存储器，用于存放计算机程序；memory for storing computer programs;

处理器，用于执行存储器上所存放的程序时，实现上述任一所述的一种心脑血管疾病风险预测方法。The processor is used to implement any one of the methods for predicting the risk of cardiovascular and cerebrovascular diseases described above when executing the program stored in the memory.

在本发明实施的又一方面，还提供了一种计算机可读存储介质，所述计算机可读存储介质中存储有指令，当其在计算机上运行时，使得计算机执行上述任一所述的一种心脑血管疾病风险预测方法。In yet another aspect of the implementation of the present invention, a computer-readable storage medium is also provided, the computer-readable storage medium stores instructions, and when it is run on a computer, it causes the computer to perform any one of the above-mentioned ones. A risk prediction method for cardiovascular and cerebrovascular diseases.

在本发明实施的又一方面，本发明实施例还提供了一种包含指令的计算机程序产品，当其在计算机上运行时，使得计算机执行上述任一所述的一种心脑血管疾病风险预测方法。In yet another aspect of the implementation of the present invention, the embodiment of the present invention also provides a computer program product containing instructions, which, when run on a computer, enables the computer to perform any one of the above-mentioned cardiovascular and cerebrovascular disease risk predictions. method.

本发明实施例提供的一种心脑血管疾病风险预测方法及装置，通过获取样本的获取样本集；将样本集中的样本分为预设数量的局部簇，根据预设的第一K值与所述第一距离集合，使用k近邻算法，计算得到输入样本的第一K值个第一邻近样本，从而确定目标局部簇，将所述输入样本划入所述目标局部簇后，计算所述输入样本与所述目标局部簇中样本的距离，从而确定输入样本第二K值个第二邻近样本；统计第二邻近样本的第一标签个数与第二标签个数，从而确定输入样本的标签，确定输入样本是否是心脑血管疾病患者的样本；最终确定待预测患者是否是心脑血管疾病患者。本实施例考虑到心脑血管疾病患者特征数据相似度较高，计算样本之间的相似度距离，确定输入样本的最邻近的样本，从而确定输入样本中的待预测患者是否是心脑血管疾病患者，避免了不同样本特征数据对训练预测模型的影响。因此，可以提高预测待预测患者是否是心脑血管疾病患者的准确率。当然，实施本发明的任一产品或方法并不一定需要同时达到以上所述的所有优点。In the method and device for predicting the risk of cardiovascular and cerebrovascular diseases provided by the embodiments of the present invention, by obtaining a sample set of samples; dividing the samples in the sample set into a preset number of local clusters, according to the preset first K value and the set For the first distance set, use the k-nearest neighbor algorithm to calculate the first K-value first neighboring samples of the input sample, thereby determining the target local cluster, and after dividing the input sample into the target local cluster, calculate the input The distance between the sample and the sample in the target local cluster, thereby determining the second K-value second adjacent samples of the input sample; counting the number of first labels and the number of second labels of the second adjacent samples, thereby determining the label of the input sample , determine whether the input sample is a sample of a patient with cardiovascular and cerebrovascular diseases; finally determine whether the patient to be predicted is a patient with cardiovascular and cerebrovascular diseases. In this embodiment, considering the high similarity of the characteristic data of patients with cardiovascular and cerebrovascular diseases, the similarity distance between samples is calculated to determine the nearest sample of the input sample, so as to determine whether the patient to be predicted in the input sample is a cardiovascular and cerebrovascular disease patients, avoiding the impact of different sample feature data on the training prediction model. Therefore, the accuracy rate of predicting whether the patient to be predicted is a cardiovascular and cerebrovascular disease patient can be improved. Of course, implementing any product or method of the present invention does not necessarily need to achieve all the above-mentioned advantages at the same time.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. Those skilled in the art can also obtain other drawings based on these drawings without creative work.

图1为本发明实施例的一种心脑血管疾病风险预测方法流程图；Fig. 1 is a flow chart of a method for predicting risk of cardiovascular and cerebrovascular diseases according to an embodiment of the present invention;

图2为本发明实施例中图1中S101步骤的具体流程图；Fig. 2 is the specific flowchart of step S101 in Fig. 1 in the embodiment of the present invention;

图3为本发明实施例的一种心脑血管疾病风险预测装置结构图；3 is a structural diagram of a cardiovascular and cerebrovascular disease risk prediction device according to an embodiment of the present invention;

图4为本发明实施例的一种电子设备的结构示意图。FIG. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

本实施例为了解决现有技术使用决策树、逻辑斯蒂回归和人工神经网络算法训练预测模型，预测待预测患者是否是心脑血管疾病患者的准确率不高的问题，可以理解的是心脑血管疾病患者之间的特征数据都相似，因此，可以利用心脑血管疾病患者样本之间的相似度距离，获得与待预测患者样本相似的样本，从而得知待预测患者样本是否是心脑血管疾病患者样本，确定待预测患者是否是心脑血管疾病患者。In order to solve the problem of low accuracy in predicting whether the patient to be predicted is a patient with cardiovascular and cerebrovascular diseases by using decision tree, logistic regression and artificial neural network algorithm to train the prediction model in the present embodiment, it can be understood that the heart and brain The characteristic data of patients with vascular diseases are similar. Therefore, the similarity distance between samples of patients with cardiovascular and cerebrovascular diseases can be used to obtain samples similar to the samples of patients to be predicted, so as to know whether the samples of patients to be predicted are cardiovascular and cerebrovascular. Disease patient samples, to determine whether the patient to be predicted is a patient with cardiovascular and cerebrovascular diseases.

如图1所示，本发明实施例所提供的一种心脑血管疾病风险预测方法，包括如下步骤：As shown in Figure 1, a method for predicting the risk of cardiovascular and cerebrovascular diseases provided by the embodiment of the present invention includes the following steps:

S101，获取样本集；样本集根据设置完标签的患者医疗数据库集的多个样本所确定的；一条样本包括：患者的编号、特征及特征数据；标签包括：第一标签和第二标签；第一标签标识心脑血管疾病患者样本；第二标签标识非心脑血管疾病患者样本；S101, acquire a sample set; the sample set is determined according to multiple samples of the patient medical database set with labels; one sample includes: patient number, characteristics and characteristic data; labels include: first label and second label; One label identifies samples from patients with cardiovascular and cerebrovascular diseases; the second label identifies samples from patients with non-cardio-cerebrovascular diseases;

举例而言，一条样本包括：患者的编号0001；特征包括：年龄、性别、城市、职业、家族遗传史、疾病史、饮食规律、吸烟习惯、饮酒习惯、血压、脉搏、血脂、血糖等；特征数据包括：年龄：50；性别：男；城市：武汉；职业：教师；家族遗传史：无；疾病史：高血压；饮食规律：早餐不吃、午餐面食或者米饭、晚餐烧烤；吸烟习惯：一天两根烟；饮酒习惯：至少三天一次；血压：100-145mmHg；脉搏：60～100次/分；血清总胆固醇：2.9～5.17mmoi/l；血清甘油三酯：0.56～1.7mmoi/l；高密度脂蛋白胆固醇：0,94～2.0mmoi/l；低密度脂蛋白胆固醇：2.07～3.12i/l；血糖：空腹7.8--9.0mmoL/L等。For example, a sample includes: patient number 0001; features include: age, gender, city, occupation, family genetic history, disease history, diet, smoking habits, drinking habits, blood pressure, pulse, blood lipids, blood sugar, etc.; features Data include: Age: 50; Gender: Male; City: Wuhan; Occupation: Teacher; Family genetic history: None; Disease history: Hypertension; Dietary law: no breakfast, pasta or rice for lunch, barbecue for dinner; Smoking habit: one day Two cigarettes; drinking habits: at least once every three days; blood pressure: 100-145mmHg; pulse: 60-100 beats/min; serum total cholesterol: 2.9-5.17mmoi/l; serum triglycerides: 0.56-1.7mmoi/l; High-density lipoprotein cholesterol: 0,94-2.0mmoi/l; low-density lipoprotein cholesterol: 2.07-3.12i/l; blood sugar: fasting 7.8--9.0mmoL/L, etc.

可以理解的是，本实施例中的患者医疗数数据库集中设置完标签的多条样本，是预先将患者的医疗体检数据和医疗就诊数据合并形成一条样本，或者也可以在每次获取样本集时，将患者的医疗体检数据和医疗就诊数据合并形成一条样本。鉴于前者可以节省时间，本实施例采用预先将患者的医疗体检数据和医疗就诊数据合并形成一条样本，然后根据患者的健康回访数据设置标签。患者的健康回访数据包含：患者的标号、患者确认的病症及患者确认病症的数据特征。例如：健康回访数据包括：“脑卒中”、“高血压”、“冠心病”、“高血脂”、“高血糖”、“咳血”、“晕倒”、“脑梗”及“心力衰竭”等多种心脑血管疾病。患者确认的病症的方式可以是现有技术中经医生确诊或者患者身体反映出的病症或者患者家属确认的病症，在此不做赘述。It can be understood that the patient medical data database in this embodiment has multiple samples with labels set in a centralized manner, which is to combine the patient's medical examination data and medical visit data in advance to form a sample, or it can also be obtained every time a sample set , to combine the patient's medical examination data and medical visit data to form a sample. In view of the fact that the former can save time, this embodiment combines the patient's medical examination data and medical visit data in advance to form a sample, and then sets the label according to the patient's health return visit data. The patient's health follow-up data includes: the patient's label, the disease confirmed by the patient, and the data characteristics of the disease confirmed by the patient. For example: health follow-up data include: "stroke", "high blood pressure", "coronary heart disease", "high blood fat", "high blood sugar", "hemoptysis", "fainting", "cerebral infarction" and "heart failure "And many other cardiovascular and cerebrovascular diseases. The disease confirmed by the patient can be the disease diagnosed by the doctor or reflected by the patient's body in the prior art, or the disease confirmed by the patient's family members, and will not be repeated here.

患者的医疗体检数据和医疗就诊数据与现有技术中一样，医疗体检数据包含患者的标号及特征数据，医疗就诊数据包含患者标号及患者的基本信息。将患者的医疗体检数据和医疗就诊数据根据患者标号合并形成一条样本，然后根据患者的健康回访数据包含的患者的标号及患者确认的病症确定该样本是否是心脑血管疾病患者样本，给该样本设置标签。The patient's medical examination data and medical visit data are the same as in the prior art. The medical examination data includes the patient's label and characteristic data, and the medical visit data includes the patient's label and basic information of the patient. Combine the patient's medical examination data and medical visit data according to the patient's label to form a sample, and then determine whether the sample is a sample of a patient with cardiovascular and cerebrovascular diseases according to the patient's label contained in the patient's health follow-up data and the disease confirmed by the patient, and give the sample Set tabs.

S102，获取一条输入样本；输入样本由待预测患者的医疗健康体检数据及医疗就诊数据合并构成；S102, obtaining an input sample; the input sample is formed by merging the medical health examination data and medical visit data of the patient to be predicted;

S103，使用余弦-大间隔最近邻居COS-LMNN算法进行度量学习，得到样本集的全局度量矩阵；S103, using the cosine-large margin nearest neighbor COS-LMNN algorithm to perform metric learning to obtain a global metric matrix of the sample set;

本实施例中的COS-LMNN算法是将余弦COS算法与大间隔最近邻居LMNN算法结合的算法，计算样本集的全局度量矩阵，结合的方式与现有技术的方法一样，在此不做赘述。The COS-LMNN algorithm in this embodiment is an algorithm that combines the cosine COS algorithm and the large-margin nearest neighbor LMNN algorithm to calculate the global metric matrix of the sample set.

S104，使用预设的聚类算法，将样本集中的样本分为预设数量的局部簇；S104, using a preset clustering algorithm to divide the samples in the sample set into a preset number of local clusters;

其中，预设数量是根据行业经验设定的数量，预设的聚类算法可以是现有技术k-means聚类算法、层次聚类算法、SOM聚类算法、FCM聚类算法等等。预设数量可以根据不同的聚类算法做适应的调整。Wherein, the preset number is a number set according to industry experience, and the preset clustering algorithm may be the prior art k-means clustering algorithm, hierarchical clustering algorithm, SOM clustering algorithm, FCM clustering algorithm and the like. The preset number can be adjusted according to different clustering algorithms.

可以理解的是，本实施例是将样本集中的样本分类，划分至不同的局部簇中，例如样本集中的样本有7个，分别是A、B、C、D、E、F和G，划分之后的结果是局部簇1：A和B；局部簇2：C、F、G；局部簇3：D和E。It can be understood that, in this embodiment, the samples in the sample set are classified and divided into different local clusters. For example, there are 7 samples in the sample set, namely A, B, C, D, E, F and G. The results are then local cluster 1: A and B; local cluster 2: C, F, G; local cluster 3: D and E.

S105，根据全局度量矩阵，使用余弦相似度算法，计算输入样本与样本集中样本的距离，组成第一距离集合；S105. Calculate the distance between the input sample and the samples in the sample set by using the cosine similarity algorithm according to the global metric matrix to form a first distance set;

S106，根据预设的第一K值与第一距离集合，使用k近邻算法，计算得到输入样本的第一K值个第一邻近样本；S106. According to the preset first K value and the first distance set, use the k-nearest neighbor algorithm to calculate and obtain the first K-value first neighboring samples of the input sample;

本实施例中第一K值是根据实际经验预先设定的数值，第一K值的取值与使用k近邻算法计算输入样本的第一邻近样本的个数相同。In this embodiment, the first K value is a preset value based on actual experience, and the value of the first K value is the same as the number of first adjacent samples calculated by using the k-nearest neighbor algorithm.

举例而言，样本集中包含：样本1、2、3及4，假设样本1、2及3是心脑血管疾病患者样本，则样本1、2及3之间的样本距离可以为0。如果输入样本是心脑血管疾病患者样本，第二K值设定为2，那么输入样本的第一K值个第一邻近样本是样本1和2、或者2和3、或者1和3。For example, the sample set includes: samples 1, 2, 3, and 4. Assuming that samples 1, 2, and 3 are samples of patients with cardiovascular and cerebrovascular diseases, the sample distance between samples 1, 2, and 3 can be 0. If the input sample is a sample of a patient with cardiovascular and cerebrovascular diseases, and the second K value is set to 2, then the first K-value first adjacent samples of the input sample are samples 1 and 2, or 2 and 3, or 1 and 3.

S107，确定第一邻近样本所在的局部簇；S107, determine the local cluster where the first adjacent sample is located;

S108，在邻近样本所在的局部簇中，选择第一邻近样本的数量超过第一预设阈值的局部簇，作为目标局部簇；S108. In the local clusters where the adjacent samples are located, select a local cluster whose number of first adjacent samples exceeds a first preset threshold as a target local cluster;

S109，将输入样本划入目标局部簇；S109, dividing the input samples into target local clusters;

S110，根据COS-LMNN算法学习得到的目标局部簇的局部度量矩阵，使用余弦相似度算法计算，输入样本与目标局部簇中样本的距离，组成第二距离集合；S110, the local metric matrix of the target local cluster learned according to the COS-LMNN algorithm is calculated using a cosine similarity algorithm, and the distance between the input sample and the samples in the target local cluster is formed into a second distance set;

S111，在目标局部簇中，根据预设的第二K值与第二距离集合，使用k近邻算法，确定输入样本的第二K值个第二邻近样本；S111. In the target local cluster, according to the preset second K value and the second distance set, use the k-nearest neighbor algorithm to determine the second K-value second neighboring samples of the input sample;

本实施例为了在目标局部簇中，确定输入样本的第二邻近样本，第二邻近样本个数与第二K值取值相同，获得的确定输入样本的第二邻近样本可以包含多个或者一个。In this embodiment, in order to determine the second adjacent samples of the input sample in the target local cluster, the number of the second adjacent samples is the same as the value of the second K value, and the determined second adjacent samples of the input sample can include multiple or one .

S112，统计第二邻近样本的第一标签个数与第二标签个数；S112, counting the number of first labels and the number of second labels of the second adjacent samples;

S113，如果第一标签个数与第二标签个数的比值超过预设标签阈值，则将第一标签作为输入样本的标签，否则将第二标签作为输入样本的标签；S113, if the ratio of the number of the first label to the number of the second label exceeds the preset label threshold, use the first label as the label of the input sample, otherwise use the second label as the label of the input sample;

S114，根据输入样本的标签，确定输入样本是否是心脑血管疾病患者的样本；S114, according to the label of the input sample, determine whether the input sample is a sample of a patient with cardiovascular and cerebrovascular diseases;

S115，如果输入样本是心脑血管疾病患者的样本，则确定输入样本中的待预测患者是心脑血管疾病患者；S115, if the input sample is a sample of a patient with cardiovascular and cerebrovascular diseases, determine that the patient to be predicted in the input sample is a patient with cardiovascular and cerebrovascular diseases;

S116，如果输入样本不是心脑血管疾病患者的样本，则确定输入样本中的待预测患者不是心脑血管疾病患者。S116. If the input sample is not a sample of a patient with cardiovascular and cerebrovascular diseases, determine that the patient to be predicted in the input sample is not a patient with cardiovascular and cerebrovascular diseases.

相较于，现有技术使用数据挖掘技术对于心脑血管疾病的病例数据特征进行挖掘，将所有患者的体检特征数据及回访数据组成一个训练集，使用决策树、逻辑斯蒂回归和人工神经网络算法，训练出预测模型。然后将待预测患者的体检数据作为输入样本，输入到训练出的预测模型中，输出待预测患者是否是心脑血管疾病患者。Compared with the existing technology, data mining technology is used to mine the characteristics of case data of cardiovascular and cerebrovascular diseases, and the physical examination feature data and return visit data of all patients are combined into a training set, and decision trees, logistic regression and artificial neural networks are used to Algorithms to train predictive models. Then, the physical examination data of the patient to be predicted is used as an input sample, which is input into the trained prediction model, and whether the patient to be predicted is a cardiovascular and cerebrovascular disease patient is output.

由于决策树在训练预测模型过程中要分析所有待预测患者的体检数据，将体检数据信息量增益最大的样本数据作为第一节点，其他体检数据按照体检数据信息量增益高低依次作为分支，当数据样本只有一种类别时候停止训练，获得预测模型。由于当信息量增益最大的体检数据是非心脑血管疾病患者的体检数据时，该方法训练出来的预测模型，受体检数据信息量增益最大的样本数据的影响，决策树训练出的预测模型预测结果准确率并不高。Since the decision tree needs to analyze the physical examination data of all patients to be predicted in the process of training the prediction model, the sample data with the largest information gain of the physical examination data is used as the first node, and the other physical examination data are used as branches in order according to the information gain of the physical examination data. When the data Stop training when there is only one type of sample, and obtain a prediction model. When the physical examination data with the largest information gain is the physical examination data of patients with non-cardiovascular and cerebrovascular diseases, the prediction model trained by this method is affected by the sample data with the largest information gain in the physical examination data, and the prediction model trained by the decision tree predicts the results The accuracy rate is not high.

使用逻辑斯蒂回归算法训练预测模型过程中，需要求解损失函数的最小，确定预测模型。由于求解损失函数的最小的过程容易受不同样本数据的影响，预测模型的输出待预测患者是心脑血管疾病患者的概率，从而确定待预测患者是否是心脑血管疾病患者并不准确。In the process of using the logistic regression algorithm to train the prediction model, it is necessary to find the minimum loss function and determine the prediction model. Since the minimum process of solving the loss function is easily affected by different sample data, the output of the prediction model is the probability that the patient to be predicted is a patient with cardiovascular and cerebrovascular diseases, so it is not accurate to determine whether the patient to be predicted is a patient with cardiovascular and cerebrovascular diseases.

本实施例通过获取样本的获取样本集；将样本集中的样本分为预设数量的局部簇，计算得到输入样本的第一K值个第一邻近样本，从而确定目标局部簇。通过计算所述输入样本与所述目标局部簇中样本的距离，从而确定输入样本第二K值个第二邻近样本；统计第二邻近样本的第一标签个数与第二标签个数，从而确定输入样本的标签，确定输入样本是否是心脑血管疾病患者的样本；最终确定待预测患者是否是心脑血管疾病患者。本实施例不用训练预测模型，考虑到心脑血管疾病患者特征数据相似度较高，使用样本相似度距离达到确定待预测患者是否是心脑血管疾病患者的目的，避免使用决策树在训练预测模型过程中，受体检数据信息量增益最大的样本数据的影响。本实施例也不用使用逻辑斯蒂回归算法求解损失函数训练预测模型，避免求解损失函数的最小的过程容受不同样本数据的影响，导致训练出来的预测模型不准确。因此，可以提高预测待预测患者是否是心脑血管疾病患者的准确率。In this embodiment, the acquired sample set of samples is obtained; the samples in the sample set are divided into a preset number of local clusters, and the first K-value first adjacent samples of the input samples are calculated to determine the target local cluster. By calculating the distance between the input sample and the sample in the target local cluster, thereby determining the second K-value second adjacent samples of the input sample; counting the number of first labels and the number of second labels of the second adjacent samples, so that Determine the label of the input sample, determine whether the input sample is a sample of a patient with cardiovascular and cerebrovascular diseases; finally determine whether the patient to be predicted is a patient with cardiovascular and cerebrovascular diseases. This embodiment does not need to train the prediction model. Considering the high similarity of the characteristic data of patients with cardiovascular and cerebrovascular diseases, the sample similarity distance is used to determine whether the patient to be predicted is a patient with cardiovascular and cerebrovascular diseases, and the decision tree is avoided in training the prediction model. In the process, the sample data with the largest information gain of the physical examination data is affected. In this embodiment, it is not necessary to use the logistic regression algorithm to solve the loss function to train the prediction model, so as to avoid the influence of different sample data on the minimum process of solving the loss function, resulting in inaccurate prediction models trained. Therefore, the accuracy rate of predicting whether the patient to be predicted is a cardiovascular and cerebrovascular disease patient can be improved.

可选的，本发明的一种心脑血管疾病风险预测方法实施例的中，在S115如果输入样本是心脑血管疾病患者的样本，则确定输入样本中的待预测患者是心脑血管疾病患者步骤之后，所述方法还包括：Optionally, in an embodiment of the risk prediction method for cardiovascular and cerebrovascular diseases of the present invention, in S115, if the input sample is a sample of a patient with cardiovascular and cerebrovascular diseases, it is determined that the patient to be predicted in the input sample is a patient with cardiovascular and cerebrovascular diseases After the step, the method also includes:

步骤一：根据患者的健康回访数据确定待预测患者是否是高危心脑血管疾病患者；Step 1: Determine whether the patient to be predicted is a high-risk cardiovascular and cerebrovascular disease patient according to the patient's health follow-up data;

健康回访数据中的特征数据超出正常指标，体征信息异常，症状异常，因此可以判断待预测患者为高危心脑血管疾病患者。比如，血压超标，血红蛋白超标，症状异常，比如，出现晕倒、咳嗽咯血。The feature data in the health follow-up data exceeds the normal index, the sign information is abnormal, and the symptoms are abnormal. Therefore, it can be judged that the patient to be predicted is a high-risk cardiovascular and cerebrovascular disease patient. For example, blood pressure exceeds the standard, hemoglobin exceeds the standard, and symptoms are abnormal, such as fainting, coughing and hemoptysis.

步骤二：如果待预测患者是高危心脑血管疾病患者，则对待预测患者作住院治疗的建议；Step 2: If the patient to be predicted is a patient with high-risk cardiovascular and cerebrovascular diseases, then recommend hospitalization for the patient to be predicted;

根据待预测患者是高危心脑血管疾病患者，根据该患者的特征数据给该患者提供与该患者的特征数据对应的住院天数及治疗方案。According to the patient to be predicted is a high-risk cardiovascular and cerebrovascular disease patient, according to the characteristic data of the patient, the hospitalization days and treatment plan corresponding to the characteristic data of the patient are provided to the patient.

本实施例预先建立心脑血管患者治疗数据库，心脑血管患者治疗数据库包括：患者的特征数据、特征数据对应的住院天数及治疗方案。治疗方案包括：注射胰岛素的计量、频次及吃降压药的计量和频次、锻炼身体或者是否需要外科治疗等等。In this embodiment, a treatment database for cardiovascular and cerebrovascular patients is preliminarily established. The treatment database for cardiovascular and cerebrovascular patients includes: characteristic data of patients, days of hospitalization corresponding to the characteristic data, and treatment plans. The treatment plan includes: the amount and frequency of insulin injection, the amount and frequency of taking antihypertensive drugs, physical exercise or whether surgical treatment is needed, etc.

例如：高危心脑血管疾病患者的血压：120-155mmHg；血糖：空腹7.8--9.0mmoL/L对应的得住院天数是20天，特征数据对应的治疗方案为射胰岛素每天1U。For example: blood pressure of patients with high-risk cardiovascular and cerebrovascular diseases: 120-155mmHg; blood sugar: fasting 7.8--9.0mmoL/L corresponds to 20 days of hospitalization, and the treatment plan corresponding to the characteristic data is to inject 1U of insulin per day.

本实施例节省了医生诊断建议的时间，节省了医疗资源。判断该患者是高危心脑血管疾病患者，给出待预测患者作住院治疗的建议。This embodiment saves time for doctors to diagnose and advise, and saves medical resources. It is judged that the patient is a high-risk cardiovascular and cerebrovascular disease patient, and a suggestion for hospitalization of the patient to be predicted is given.

步骤三：如果待预测患者不是高危心脑血管疾病患者，则对待预测患者作增加体检频次的建议。Step 3: If the patient to be predicted is not a high-risk cardiovascular and cerebrovascular disease patient, make suggestions to increase the frequency of physical examination for the patient to be predicted.

可理解的是，本实施例在如果待预测患者是心脑血管疾病患者之后，如果患者的健康回访数据记录待预测患者特征数据属于一定数量的心脑血管疾病患者特征数据的范围内，该待预测患者体征信息未异常，症状未异常，则该患者不是高危心脑血管疾病患者。可以根据特征数据的范围，给该患者建议与特征数据的范围对应的体检频次。It can be understood that in this embodiment, if the patient to be predicted is a patient with cardiovascular and cerebrovascular diseases, if the patient's health follow-up data records the characteristic data of the patient to be predicted to belong to a certain number of characteristic data of patients with cardiovascular and cerebrovascular diseases, the to-be-predicted It is predicted that the patient's signs and symptoms are not abnormal, and the patient is not a high-risk cardiovascular and cerebrovascular disease patient. According to the range of characteristic data, the frequency of physical examination corresponding to the range of characteristic data may be suggested to the patient.

例如：100个心脑血管疾病患者的血压范围为：100-145mmHg，待预测患者的血压为：102-140mmHg，该患者也未出现晕倒、咳嗽咯血等危及生命的症状，该患者不是高危心脑血管疾病患者。如果该患者之前体检频次为一个月作一次体检，该患者特征数据中的血压值范围为102-140mmHg。假设与该患者特征数据对应的体检频次为一个月二次，则建议该患者一个月作二次体检。本实施例节省了医生诊断建议的时间，节省了医疗资源。For example: the blood pressure range of 100 patients with cardiovascular and cerebrovascular diseases is: 100-145mmHg, and the blood pressure of the patient to be predicted is: 102-140mmHg. Patients with cerebrovascular disease. If the patient's previous physical examination frequency is once a month, the blood pressure value range in the patient's characteristic data is 102-140mmHg. Assuming that the physical examination frequency corresponding to the patient's characteristic data is twice a month, it is recommended that the patient undergo a physical examination twice a month. This embodiment saves time for doctors to diagnose and advise, and saves medical resources.

可选的，本发明的一种心脑血管疾病风险预测方法实施例的中，在S115，确定输入样本中的待预测患者不是心脑血管疾病患者的步骤之后，还包括：Optionally, in an embodiment of the method for predicting the risk of cardiovascular and cerebrovascular diseases of the present invention, in S115, after the step of determining that the patient to be predicted in the input sample is not a patient with cardiovascular and cerebrovascular diseases, it also includes:

步骤一：根据患者的健康回访数据确定待预测患者是否是健康用户；Step 1: Determine whether the patient to be predicted is a healthy user according to the patient's health follow-up data;

其中，健康用户是待预测患者的各项特征数据都在医学各项特征数据规定的标准范围内。例如：医学规定正常血压：80－90/120－140mmHg，如果健康回访数据显示待预测患者的血压在82/125mmHg，该患者的其他特征数据都在医学各项特征数据规定的标准范围内，该患者为健康用户。Among them, the healthy user is a patient whose characteristic data are all within the standard range specified by the medical characteristic data. For example: the normal blood pressure stipulated by medicine: 80-90/120-140mmHg, if the health follow-up data shows that the blood pressure of the patient to be predicted is 82/125mmHg, and other characteristic data of the patient are within the standard range stipulated by various medical characteristic data, the A patient is a healthy user.

步骤二：如果待预测患者是健康用户，则对正常患者作保持正常体检频次的建议；Step 2: If the patient to be predicted is a healthy user, make suggestions for normal patients to maintain normal physical examination frequency;

可理解的是，如果待预测患者是健康用户，健康用户的体检频次与该患者的数据特征对应。建议该患者保持与之前体检频次相同的体检次数。例如：该患者之前体检频次为一个月一次，建议保持一个月一次的体检频次。本实施例挑选出健康用户，给出合适的建议，节省了医生诊断建议的时间，同时为患者减少了医疗方面的支出。It can be understood that, if the patient to be predicted is a healthy user, the frequency of physical examination of the healthy user corresponds to the data characteristics of the patient. It is recommended that the patient maintain the same number of physical examinations as the previous physical examination frequency. For example: the patient's previous physical examination frequency was once a month, and it is recommended to maintain the monthly physical examination frequency. In this embodiment, healthy users are selected and appropriate suggestions are given, which saves time for doctors to diagnose and recommend, and reduces medical expenses for patients.

步骤三：如果待预测患者不是健康用户，则将待预测患者标记为漏诊患者，将漏诊患者的特征数据加入患者医疗数据库集；Step 3: If the patient to be predicted is not a healthy user, mark the patient to be predicted as a missed diagnosis patient, and add the characteristic data of the missed diagnosis patient to the patient medical database set;

可以理解的是，如果待预测患者不是健康用户，根据患者的健康回访数据可以确定待预测患者是心脑血管疾病患者或者不是心脑血管疾病患者。如果待预测患者是心脑血管疾病患者，将该患者标记为漏诊患者，并将该患者的特征数据加入患者医疗数据库集，以防止同类型的待预测患者在预测是否是心脑血管疾病患者时，发生错误预测，提高预测待预测患者是否是心脑血管疾病患者的准确率。It can be understood that if the patient to be predicted is not a healthy user, it can be determined whether the patient to be predicted is a patient with cardiovascular and cerebrovascular diseases or not according to the patient's health follow-up data. If the patient to be predicted is a patient with cardiovascular and cerebrovascular diseases, mark the patient as a missed diagnosis patient, and add the characteristic data of the patient to the patient medical database set to prevent the same type of patients to be predicted from predicting whether they are patients with cardiovascular and cerebrovascular diseases , misprediction occurs, and the accuracy of predicting whether the patient to be predicted is a patient with cardiovascular and cerebrovascular diseases is improved.

可选的，本发明的一种心脑血管疾病风险预测方法实施例的中，第一标签标识心脑血管疾病患者样本，包括：Optionally, in an embodiment of the risk prediction method for cardiovascular and cerebrovascular diseases of the present invention, the first label identifies samples of patients with cardiovascular and cerebrovascular diseases, including:

步骤一：根据已收集的患者的健康回访数据，确定心脑血管疾病患者的标识信息；Step 1: Determine the identification information of patients with cardiovascular and cerebrovascular diseases according to the collected patient health follow-up data;

患者的健康回访数据包括：患者的编号、特征、特征数据及确认病症；标识信息包括：确认病症、确认病症对应的特征及特征数据；The patient's health follow-up data includes: the patient's number, characteristics, characteristic data and confirmed symptoms; identification information includes: confirmed symptoms, confirmed symptoms corresponding characteristics and characteristic data;

步骤二：根据心脑血管疾病患者的标识信息，在医疗数据库集中确定心脑血管疾病患者样本；Step 2: According to the identification information of patients with cardiovascular and cerebrovascular diseases, determine the samples of patients with cardiovascular and cerebrovascular diseases in the medical database set;

步骤三：将心脑血管疾病患者的样本，设置第一标签。Step 3: Set the first label on the samples of patients with cardiovascular and cerebrovascular diseases.

本实施例通过患者的健康回访数据，区分出医疗数据库集中确定心脑血管疾病患者样本，并设置第一标签，为确定输入样本的标签，节省时间。In this embodiment, through the patient's health follow-up data, it is distinguished that the medical database centrally determines samples of patients with cardiovascular and cerebrovascular diseases, and the first label is set to save time for determining the label of the input sample.

可选的，本发明的一种心脑血管疾病风险预测方法实施例的中，第二标签标识非心脑血管疾病患者样本，包括：Optionally, in an embodiment of the method for predicting the risk of cardiovascular and cerebrovascular diseases of the present invention, the second label identifies samples of patients with non-cardio-cerebrovascular diseases, including:

将除心脑血管疾病患者样本以外的其他样本，设置第二标签。A second label is set for samples other than samples of patients with cardiovascular and cerebrovascular diseases.

本实施例可以采用用户体检后一个月的健康回访数据。对于健康回访数据中关于心脑血管疾病样本及非心脑血管疾病样本设置标签。标签包括：字母、数字、符号等等。例如：健康回访数据包括：“脑卒中”、“高血压”、“冠心病”、“高血脂”、“高血糖”、“脑梗”及“心力衰竭”等多种心脑血管疾病描述设置标签为正样本，并新增类别“label”字段，设置标签“1”。所有非正样本的样本均作为负样本，设置标签“0”，加入医疗数据库集。In this embodiment, the health follow-up data of one month after the user's physical examination can be used. Labels are set for samples of cardiovascular and cerebrovascular diseases and samples of non-cardiovascular and cerebrovascular diseases in the health follow-up data. Labels include: letters, numbers, symbols, and more. For example: health follow-up data include: "stroke", "hypertension", "coronary heart disease", "hyperlipidemia", "hyperglycemia", "cerebral infarction" and "heart failure" and other description settings for cardiovascular and cerebrovascular diseases The label is a positive sample, and a new category "label" field is added, and the label "1" is set. All samples that are not positive samples are taken as negative samples, and the label "0" is set to join the medical database set.

本实施例通过患者的健康回访数据，区分出医疗数据库集中确定非心脑血管疾病患者样本，并设置第二标签，为确定输入样本的标签，节省时间。In this embodiment, through the patient's health follow-up data, the medical database is used to distinguish the non-cardiovascular and cerebrovascular disease patient samples, and set the second label to save time for determining the label of the input sample.

可选的，本发明的一种心脑血管疾病风险预测方法实施例中，S101获取样本集，包括：Optionally, in an embodiment of a cardiovascular and cerebrovascular disease risk prediction method of the present invention, S101 acquires a sample set, including:

S201，根据设置标签的患者医疗数据库集的多个样本，将样本缺失值大于第一阈值的样本作样本删除处理；S201, according to the multiple samples of the patient medical database set with labels, delete the samples whose sample missing value is greater than the first threshold;

样本缺失值为：一条样本中缺失的特征的数量与该样本中特征总数量的比值；The sample missing value is: the ratio of the number of missing features in a sample to the total number of features in the sample;

其中，第一阈值是人为根据行业经验规定的数值，下面举例说明样本缺失值。例如，一条样本中包含的特征10个，缺失特征数据的特征有7个，那么该样本中缺失的特征数量与该样本中特征总数量的比值是假设规定的第一阈值是则将该样本作样本删除处理。Among them, the first threshold is a value artificially specified based on industry experience. The following example illustrates the missing value of the sample. For example, if a sample contains 10 features and 7 features are missing feature data, then the ratio of the number of missing features in the sample to the total number of features in the sample is Assume that the specified first threshold is The sample is then treated as a sample deletion.

本实施例将样本缺失值大于第一阈值的样本作样本删除处理的目的是：减少患者医疗数据库集中的特征数据较少的样本，提高患者医疗数据库集中样本的质量，并且为后续处理节省时间。In this embodiment, the purpose of deleting samples whose missing values are greater than the first threshold is to reduce samples with less characteristic data in the patient medical database, improve the quality of samples in the patient medical database, and save time for subsequent processing.

S202，在删除处理后的多条样本中查找，特征缺失值大于第二阈值的特征作特征删除处理；S202, searching among the multiple samples after deletion processing, performing feature deletion processing for features whose feature missing value is greater than the second threshold;

本实施例将特征缺失值大于第二阈值的特征作特征删除处理；目的是：减少患者医疗数据库集的样本中较少的特征数据，提高患者医疗数据库集的样本中的特征数据的质量，并且为后续处理节省时间。In this embodiment, features whose feature missing value is greater than the second threshold are processed for feature deletion; the purpose is to reduce the less feature data in the samples of the patient medical database set, improve the quality of feature data in the samples of the patient medical database set, and Save time for subsequent processing.

特征缺失值为：多条样本的同一特征中，缺少特征数据的特征数量与同一特征总数量的比值；The feature missing value is: the ratio of the number of features lacking feature data to the total number of the same feature in the same feature of multiple samples;

其中，第二阈值是人为根据行业经验规定的数值。下面举例说明特征缺失值，例如，10个样本中有同一特征：脉搏。缺少特征数据的脉搏特征数量共有7个，脉搏特征的总数量是10，假设规定的第一阈值是则将脉搏特征作特征删除处理。Wherein, the second threshold is a numerical value artificially specified based on industry experience. The following example illustrates the missing value of a feature, for example, there is the same feature in 10 samples: pulse. The number of pulse features lacking feature data is 7, and the total number of pulse features is 10. Assume that the specified first threshold is The pulse feature is then processed as feature deletion.

本实施例将特征缺失值大于第二阈值的特征作特征删除处理目的是：减少患者医疗数据库集的样本中较少的特征数据，提高患者医疗数据库集的样本中的特征数据的质量，并且为后续处理节省时间。In this embodiment, the features whose feature missing value is greater than the second threshold are used for feature deletion processing. The purpose is to reduce the less feature data in the samples of the patient medical database set, improve the quality of the feature data in the samples of the patient medical database set, and for Subsequent processing saves time.

S203，在作特征删除处理后的多条样本查找缺失特征数据的特征，作为第一特征；S203, searching for the feature of the missing feature data in the multiple samples after feature deletion processing, as the first feature;

S204，使用多重填补法，对所述第一特征缺失的特征数据作缺失值填补；S204, using a multiple filling method to fill in missing values for the feature data whose first feature is missing;

其中，采用IBM SPSS statistics 23中使用多重填补法构建的模块来对缺失值进行填补，例如有2条样本血压特征数据缺失，使用多重填补法构建的模块，根据患者的特征数据：年龄：50；血脂：1、血清总胆固醇2.9～5.17mmoi/l；2、血清甘油三酯0.56～1.7mmoi/l；3、高密度脂蛋白胆固醇0.94～2.0mmoi/l；4、低密度脂蛋白胆固醇2.07～3.12i/l；血糖：空腹7.8--9.0mmoL/L，将2条样本中的血压特征数据填补数值100-145mmHg，具体填补方式与现有技术填补的方式相同，在此不作赘述。Among them, the module constructed using the multiple imputation method in IBM SPSS statistics 23 is used to fill in the missing values. For example, there are two sample blood pressure characteristic data missing, and the module constructed using the multiple imputation method is based on the characteristic data of the patient: age: 50; Blood lipids: 1. Serum total cholesterol 2.9-5.17mmoi/l; 2. Serum triglycerides 0.56-1.7mmoi/l; 3. High-density lipoprotein cholesterol 0.94-2.0mmoi/l; 4. Low-density lipoprotein cholesterol 2.07- 3.12i/l; blood sugar: 7.8--9.0mmoL/L on an empty stomach, fill in the blood pressure characteristic data in the two samples with a value of 100-145mmHg, the specific filling method is the same as that of the prior art, and will not be repeated here.

本实施例对多条样本的缺失特征数据进行缺失值填补，可以提高样本的质量，以提高获取样本集的质量。In this embodiment, the missing feature data of multiple samples is filled with missing values, which can improve the quality of the samples, so as to improve the quality of the obtained sample set.

S205，按照数据类型，对缺失值填补后的多条样本的特征数据做分类，获得分类结果；S205, according to the data type, classify the characteristic data of the multiple samples after the missing value is filled, and obtain the classification result;

其中，分类结果包括：离散特征数据和连续特征数据；Among them, the classification results include: discrete feature data and continuous feature data;

S206，根据分类结果，将离散特征数据和连续特征数据，作与数据类型对应的处理；S206, according to the classification result, process the discrete feature data and the continuous feature data corresponding to the data type;

S207，将离散特征数据和连续特征数据作相对应的处理后的特征数据加入患者医疗数据库集，作为第一数据库集；S207, adding the processed feature data corresponding to the discrete feature data and the continuous feature data into the patient medical database set as the first database set;

其中，将离散特征数据和连续特征数据，作与数据类型对应的处理，包括：对离散特征数据进行独热编码；对连续特征数据，使用正太标准化z-score方法进行标准化处理；Among them, the discrete feature data and continuous feature data are processed corresponding to the data type, including: performing one-hot encoding on the discrete feature data; and standardizing the continuous feature data using the normalized z-score method;

患者的特征数据的数据类型包含：离散特征数据和连续特征数据。例如血压数据、心跳数据是连续类型，年龄数据是离散类型。The data types of the characteristic data of the patient include: discrete characteristic data and continuous characteristic data. For example, blood pressure data and heartbeat data are continuous types, and age data are discrete types.

举例而言，对于离散特征数据，编写适用特征数据的独热编码的代码。以对“年龄”编码为例，首先，将年龄特征，按照样本数等频分段，分为“76及以上”，“66-75”，“55-65”，“46-55”，“36-45”，“26-35”，“25以下”7个区间，若一个人的年龄为30岁，则独热编码后年龄值为0000010。其他离散特征类似年龄特征，如性别、城市、职业、家族遗传史、疾病史、饮食规律、吸烟习惯、饮酒习惯、每周运动规律等都进行独热编码转换。For example, for discrete feature data, write code that applies one-hot encoding of the feature data. Taking the encoding of "age" as an example, first, divide the age feature into equal frequency segments according to the number of samples, and divide it into "76 and above", "66-75", "55-65", "46-55", " There are seven intervals: 36-45", "26-35", and "below 25". If a person's age is 30 years old, the age value after one-hot encoding is 0000010. Other discrete features are similar to age features, such as gender, city, occupation, family genetic history, disease history, eating habits, smoking habits, drinking habits, weekly exercise patterns, etc., all undergo one-hot encoding conversion.

可理解的是，本实施例中对连续特征数据，使用正太标准化z-score方法进行标准化处理的方法与现有技术的处理方式一样，在此不作赘述。本实施例根据不同的数据类型，将离散特征数据和连续特征数据作相对应的处理后，避免使用相同的方法处理数据造成数据结果一致，可以提高处理数据的准确性。It can be understood that, in this embodiment, the normalized z-score method is used to standardize the continuous feature data in the same way as in the prior art, and will not be repeated here. In this embodiment, according to different data types, discrete feature data and continuous feature data are processed correspondingly, so as to avoid consistent data results caused by using the same method to process data, which can improve the accuracy of data processing.

S208，使用欠采样及SMOTE算法，对第一数据库集的样本，进行不均衡处理，获得第二数据库集；S208, using the undersampling and SMOTE algorithm to perform unbalanced processing on the samples of the first database set to obtain the second database set;

本实施例中对第一数据库集的样本，进行不均衡处理是为了使得同类型的样本分布较为均匀，以得到准确的第二数据库集。In this embodiment, the unbalanced processing is performed on the samples of the first database set to make the distribution of samples of the same type relatively uniform, so as to obtain an accurate second database set.

S209，使用方差分析法计算，第二数据库集中的同一特征数据的方差，删除特征数据方差值小于预设方差阈值的特征数据，获得第三数据库集；S209, using the variance analysis method to calculate the variance of the same characteristic data in the second database set, and delete the characteristic data whose variance value of the characteristic data is smaller than the preset variance threshold, and obtain the third database set;

本实施例选择删除特征数据方差值小于预设方差阈值的特征数据，可以减少样本特征数据差异较小的数据，将特征数据方差值小于预设方差阈值的特征数据删除后的第二数据库集，作为第三数据库集。可以理解的是：差异值越大，样本的差异就越大，区分心脑血管疾病样本与非心脑血管疾病的准确率越高。This embodiment chooses to delete the feature data whose variance value of the feature data is less than the preset variance threshold, which can reduce the data with small difference in the sample feature data, and delete the feature data whose variance value of the feature data is less than the preset variance threshold. The second database Set, as the third database set. It can be understood that the greater the difference value, the greater the difference of the samples, and the higher the accuracy of distinguishing the samples of cardiovascular and cerebrovascular diseases from non-cardio and cerebrovascular diseases.

S210，使用relief算法计算，删除特征数据方差值小于预设方差阈值的特征数据后的每个特征数据的权重；S210, use the relief algorithm to calculate, delete the weight of each feature data after the feature data whose variance value is smaller than the preset variance threshold;

S211，根据特征数据的权重与特征数据的权重对应的分数值，将分数值小于预设分数阈值的特征数据及对应的特征删除，获得第四数据库集；S211. According to the weight of the feature data and the score value corresponding to the weight of the feature data, delete the feature data and the corresponding feature whose score value is less than the preset score threshold, and obtain the fourth database set;

本实施例预先建立权重数据库，权重数据库包括：特征数据的权重与特征数据的权重对应的分数值。根据每个特征数据的权重，在数据库中查找各个特征数据的权重对应的分数值，给各个特征数据打分。将第三数据库集中分数值未超过分数阈值的特征数据及对应的特征删除，分数阈值是根据行业经验设定的数值。In this embodiment, a weight database is established in advance, and the weight database includes: the weight of the feature data and the score value corresponding to the weight of the feature data. According to the weight of each feature data, the score value corresponding to the weight of each feature data is searched in the database, and each feature data is scored. Delete the feature data and corresponding features whose score values in the third database set do not exceed the score threshold, and the score threshold is a value set according to industry experience.

S212，根据第四数据库集，使用前向选择法，确定样本集。S212. Determine a sample set by using a forward selection method according to the fourth database set.

可以理解的是，使用前向选择法确定样本集的过程，可以使用评价函数评价第四数据库集中各个特征对应的特征数据，确定各个特征对应的特征数据的评价函数值。It can be understood that, in the process of using the forward selection method to determine the sample set, the evaluation function can be used to evaluate the feature data corresponding to each feature in the fourth database set, and determine the evaluation function value of the feature data corresponding to each feature.

在一些实例中，对于同一特征对应的特征数据，可以将与评价函数值相同的特征对应的特征数据对应的样本，作为模型样本集，模型样本集包含多个样本，多个样本包含：至少一个相同的特征及特征对应的特征数据；然后评价各个模型样本集，最终选择评价函数值最高的模型样本集作为样本集。评价各个模型样本集可以包括：计算模型样本集中所有特征数据评价函数值的平均值，或者选择计算模型样本集中心脑血管疾病相关的特征数据的平均值，评价各个模型样本集还可以使用现有技术评价集合的方法，此处不再赘述。In some examples, for the feature data corresponding to the same feature, the samples corresponding to the feature data corresponding to the feature with the same evaluation function value can be used as a model sample set. The model sample set includes multiple samples, and the multiple samples include: at least one The same feature and feature data corresponding to the feature; then evaluate each model sample set, and finally select the model sample set with the highest evaluation function value as the sample set. Evaluating each model sample set may include: calculating the average value of all feature data evaluation function values in the model sample set, or choosing to calculate the average value of the feature data related to cerebrovascular diseases in the model sample set, and evaluating each model sample set can also use the existing The method of technology evaluation set will not be repeated here.

下面举例说明：假设有3个特征数据分别是：血压：100-145mmHg；脉搏：60～100次/分；血糖：空腹7.8--9.0mmoL/L。三个特征数据的评价函数值分别是64、78、12；选择脉搏：60～100次/分所在的样本，组成模型样本集1；选择血压：100-145mmHg所在的样本，组成模型样本集2；选择血糖：空腹7.8--9.0mmoL/L所在的样本，组成模型样本集3；模型样本集1中特征数据评价函数值的平均值为50分，模型样本集2中特征数据评价函数值的平均值为45分，模型样本集3中特征数据评价函数值的平均值为65分，将模型样本集3作为样本集。The following example illustrates: Suppose there are 3 characteristic data: blood pressure: 100-145mmHg; pulse: 60-100 beats/min; blood sugar: fasting 7.8--9.0mmoL/L. The evaluation function values of the three characteristic data are 64, 78, and 12 respectively; select the samples where the pulse rate is 60-100 beats/min to form the model sample set 1; select the samples where the blood pressure: 100-145mmHg is located to form the model sample set 2 ;Choose the blood sugar: fasting 7.8--9.0mmoL/L samples to form the model sample set 3; The average value is 45 points, the average value of the characteristic data evaluation function value in model sample set 3 is 65 points, and model sample set 3 is used as the sample set.

本实施例通过对患者医疗数据库集的样本及样本中的特征数据做预处理，提高了样本的质量，因此，可以提高样本集的质量。In this embodiment, the quality of the samples is improved by preprocessing the samples of the patient medical database set and the feature data in the samples, so the quality of the sample sets can be improved.

可选的，步骤S105，根据全局度量矩阵，使用余弦相似度算法计算输入样本与样本集中样本的距离，组成第一距离集合，包括：Optionally, in step S105, according to the global metric matrix, use the cosine similarity algorithm to calculate the distance between the input sample and the samples in the sample set to form a first distance set, including:

根据全局度量矩阵，使用余弦相似度算法公式计算所述输入样本与所述样本集中样本的距离，组成第一距离集合；According to the global metric matrix, using the cosine similarity algorithm formula to calculate the distance between the input sample and the samples in the sample set to form a first distance set;

其中，i代表输入样本的标号，x_i代表第i个输入样本为x_i；样本集为X；全局度量矩阵为A；M＝A^TA；j代表样本集中的样本编号；x_j代表样本集中第j个的样本；i与j取正整数；D(s_i,x_j)代表在全局度量矩阵下输入样本x_i与X集中第j个样本的距离；A(x_i,x_j)代表经过A矩阵变换后x_i,x_j之间的距离。Among them, i represents the label of the input sample, x _i represents the i-th input sample as x _i ; the sample set is X; the global metric matrix is A; M= ^AT A; j represents the sample number in the sample set; x _j represents the sample The jth sample in the set; i and j take positive integers; D(s _i , x _j ) represents the distance between the input sample x _i and the jth sample in the X set under the global metric matrix; A(x _i , x _j ) Represents the distance between x _i and x _j after A matrix transformation.

可选的，步骤S110，根据COS-LMNN算法学习得到的目标局部簇的局部度量矩阵，使用余弦相似度算法计算，输入样本与目标局部簇中样本的距离，组成第二距离集合，包括：Optionally, in step S110, the local metric matrix of the target local cluster learned according to the COS-LMNN algorithm is calculated using the cosine similarity algorithm, and the distance between the input sample and the samples in the target local cluster forms a second distance set, including:

使用余弦相似度算法公式计算输入样本与样本集中样本的距离；Use the cosine similarity algorithm formula to calculate the distance between the input sample and the sample in the sample set;

其中，余弦相似度算法公式为：Among them, the cosine similarity algorithm formula is:

第二距离集合D2包括：{D(x_i,x_s1)，D(x_i,x_s2)，…，D(x_i,x_si)}；The second distance set D2 includes: {D(x _i , x _s1 ), D(x _i , x _s2 ), . . . , D(x _i , x _si )};

如图3所示，本发明实施例所提供的一种心脑血管疾病风险预测装置，包括：As shown in Figure 3, a cardiovascular and cerebrovascular disease risk prediction device provided by an embodiment of the present invention includes:

集合获取模块301，用于获取样本集；A set acquisition module 301, configured to acquire a sample set;

样本集根据设置完标签的患者医疗数据库集的多个样本所确定的；一条样本包括：患者的编号、特征及特征数据；所述标签包括：第一标签和第二标签；第一标签标识心脑血管疾病患者样本；第二标签标识非心脑血管疾病患者样本；The sample set is determined according to multiple samples of the patient medical database set with labels; one sample includes: the patient's number, characteristics and characteristic data; the labels include: the first label and the second label; the first label identifies the heart Samples from patients with cerebrovascular diseases; the second label identifies samples from patients with non-cerebrovascular diseases;

样本获取模块302，用于获取一条输入样本；A sample acquisition module 302, configured to acquire an input sample;

输入样本由待预测患者的医疗健康体检数据及医疗就诊数据合并构成；The input sample is composed of the medical health examination data and medical visit data of the patient to be predicted;

矩阵计算模块303，用于使用余弦-大间隔最近邻居COS-LMNN算法进行度量学习，得到样本集的全局度量矩阵；The matrix calculation module 303 is used to use the cosine-large interval nearest neighbor COS-LMNN algorithm to perform metric learning to obtain the global metric matrix of the sample set;

第一局部簇确定模块304，用于使用预设的聚类算法，将样本集中的样本分为预设数量的局部簇；The first local cluster determination module 304 is configured to use a preset clustering algorithm to divide the samples in the sample set into a preset number of local clusters;

第一距离确定模块305，用于根据全局度量矩阵，使用余弦相似度算法，计算输入样本与所述样本集中样本的距离，组成第一距离集合；The first distance determination module 305 is used to calculate the distance between the input sample and the samples in the sample set by using the cosine similarity algorithm according to the global metric matrix to form a first distance set;

第一样本确定模块306，用于根据预设的第一K值与所述第一距离集合，使用k近邻算法，计算得到输入样本的第一K值个第一邻近样本；The first sample determination module 306 is configured to use the k-nearest neighbor algorithm to calculate the first K-value first neighboring samples of the input sample according to the preset first K value and the first distance set;

第二局部簇确定模块307，用于确定第一邻近样本所在的局部簇；The second local cluster determination module 307 is configured to determine the local cluster where the first adjacent sample is located;

目标局部簇确定模块308，用于在邻近样本所在的局部簇中，选择第一邻近样本的数量超过第一预设阈值的局部簇，作为目标局部簇；A target local cluster determination module 308, configured to select, among the local clusters where the adjacent samples are located, a local cluster whose number of first adjacent samples exceeds a first preset threshold, as the target local cluster;

局部簇划分模块309，用于将输入样本划入所述目标局部簇；A local cluster division module 309, configured to divide input samples into the target local clusters;

第二距离确定模块310，用于根据COS-LMNN算法学习得到的目标局部簇的局部度量矩阵，使用余弦相似度算法计算，输入样本与目标局部簇中样本的距离，组成第二距离集合；The second distance determination module 310 is used to learn the local metric matrix of the target local cluster according to the COS-LMNN algorithm, and use the cosine similarity algorithm to calculate the distance between the input sample and the samples in the target local cluster to form a second distance set;

第二样本确定模块311，用于在目标局部簇中，根据预设的第二K值与第二距离集合，使用k近邻算法，确定输入样本的第二K值个第二邻近样本；The second sample determination module 311 is used to determine the second K-value second adjacent samples of the input sample by using the k-nearest neighbor algorithm according to the preset second K value and the second distance set in the target local cluster;

统计模块312，用于统计第二邻近样本的第一标签个数与第二标签个数；A statistical module 312, configured to count the number of first labels and the number of second labels of the second adjacent samples;

标签确定模块313，用于如果第一标签个数与第二标签个数的比值超过预设标签阈值，则将第一标签作为输入样本的标签，否则将第二标签作为输入样本的标签；A label determination module 313, configured to use the first label as the label of the input sample if the ratio of the number of the first label to the number of the second label exceeds the preset label threshold, otherwise the second label is used as the label of the input sample;

患者样本确定模块314，用于根据输入样本的标签，确定输入样本是否是心脑血管疾病患者的样本；The patient sample determination module 314 is used to determine whether the input sample is a sample of a patient with cardiovascular and cerebrovascular diseases according to the label of the input sample;

心脑血管疾病患者确定模块315，用于如果输入样本是心脑血管疾病患者的样本，则确定输入样本中的待预测患者是心脑血管疾病患者；Cardiovascular and cerebrovascular disease patient determination module 315 is used to determine that the patient to be predicted in the input sample is a cardiovascular and cerebrovascular disease patient if the input sample is a sample of a patient with cardiovascular and cerebrovascular disease;

非心脑血管疾病患者确定模块316，用于如果输入样本不是心脑血管疾病患者的样本，则确定输入样本中的待预测患者不是心脑血管疾病患者。The non-cardiovascular and cerebrovascular disease patient determination module 316 is configured to determine that the patient to be predicted in the input sample is not a cardiovascular and cerebrovascular disease patient if the input sample is not a sample of a cardiovascular and cerebrovascular disease patient.

可选的，本发明实施例所提供的一种心脑血管疾病风险预测装置，还包括:Optionally, a cardiovascular and cerebrovascular disease risk prediction device provided in the embodiments of the present invention also includes:

高危确定模块，用于根据患者的健康回访数据确定待预测患者是否是高危心脑血管疾病患者；The high-risk determination module is used to determine whether the patient to be predicted is a high-risk cardiovascular and cerebrovascular disease patient according to the patient's health follow-up data;

住院建议模块，用于如果待预测患者是高危心脑血管疾病患者，则对待预测患者作住院治疗的建议；The hospitalization suggestion module is used for if the patient to be predicted is a high-risk cardiovascular and cerebrovascular disease patient, then the patient to be predicted is recommended to be hospitalized;

增加体检建议模块，用于如果待预测患者不是高危心脑血管疾病患者，则对待预测患者作增加体检频次的建议；Add a physical examination suggestion module, which is used to make suggestions to increase the frequency of physical examination for the predicted patient if the predicted patient is not a high-risk cardiovascular and cerebrovascular disease patient;

健康确定模块，用于根据患者的健康回访数据确定待预测患者是否是健康用户；A health determination module is used to determine whether the patient to be predicted is a healthy user according to the patient's health return visit data;

正常体检建议模块，用于如果待预测患者是健康用户，则对所述正常患者作保持正常体检频次的建议；A normal physical examination suggestion module, used for making suggestions on maintaining a normal physical examination frequency for the normal patient if the patient to be predicted is a healthy user;

漏诊患者确定模块，用于如果待预测患者不是健康用户，则将所述待预测患者标记为漏诊患者，将漏诊患者的特征数据加入所述患者医疗数据库集；A missed diagnosis module is used for if the patient to be predicted is not a healthy user, then the patient to be predicted is marked as a missed diagnosis patient, and the characteristic data of the missed diagnosis patient is added to the patient medical database set;

可选的，集合获取模块301，包括：Optionally, the collection acquisition module 301 includes:

特征删除子模块，用于在删除处理后的多条样本中查找，特征缺失值大于第二阈值的特征作特征删除处理；The feature deletion sub-module is used to search in multiple samples after deletion processing, and perform feature deletion processing for features whose feature missing value is greater than the second threshold;

缺失值填充子模块，用于使用多重填补法，对所述第一特征缺失的特征数据作缺失值填补；The missing value filling submodule is used to fill the missing value of the feature data whose first feature is missing by using a multiple filling method;

数据处理子模块，用于根据分类结果，将离散特征数据和连续特征数据，作与数据类型对应的处理；The data processing sub-module is used to process the discrete feature data and continuous feature data corresponding to the data type according to the classification result;

方差删除子模块，用于使用方差分析法计算所述第二数据库集中的同一特征数据的方差，删除特征数据方差值小于预设方差阈值的特征数据，获得第三数据库集；The variance deletion sub-module is used to calculate the variance of the same characteristic data in the second database set using the variance analysis method, delete the characteristic data whose variance value of the characteristic data is less than the preset variance threshold, and obtain the third database set;

权重计算子模块，用于使用relief算法计算所述第三数据库集中每个特征数据的权重；A weight calculation submodule, configured to use a relief algorithm to calculate the weight of each characteristic data in the third database set;

分数删除子模块，用于根据特征数据的权重与特征数据的权重对应的分数值，将第三数据库集中分数值小于预设分数阈值的特征数据及对应的特征删除，获得第四数据库集；The score deletion sub-module is used to delete the feature data and corresponding features whose score value is less than the preset score threshold in the third database set according to the weight of the feature data and the score value corresponding to the weight of the feature data, and obtain the fourth database set;

集合确定子模块，用于根据所述第四数据库集，使用前向选择法，确定样本集。The set determining submodule is used to determine the sample set according to the fourth database set by using a forward selection method.

本实施例的一种心脑血管疾病风险预测装置还包括：A cardiovascular and cerebrovascular disease risk prediction device of the present embodiment also includes:

正常体检建议模块，用于如果待预测患者是健康用户，则对正常患者作保持正常体检频次的建议；The normal physical examination suggestion module is used to make suggestions on maintaining normal physical examination frequency for normal patients if the patient to be predicted is a healthy user;

漏诊患者确定模块，用于如果待预测患者不是健康用户，则将待预测患者标记为漏诊患者，将漏诊患者的特征数据加入患者医疗数据库集；The missed patient determination module is used to mark the patient to be predicted as a missed patient if the patient to be predicted is not a healthy user, and add the feature data of the missed patient to the patient medical database set;

可选的，第一距离确定模块具体用于：根据全局度量矩阵，使用余弦相似度算法公式计算输入样本与样本集中样本的距离，组成第一距离集合；Optionally, the first distance determination module is specifically configured to: calculate the distance between the input sample and the samples in the sample set by using the cosine similarity algorithm formula according to the global metric matrix to form the first distance set;

第一距离集合D1包括：{D(x_i,x₁)，D(x_i,x₂)，D(x_i,x₃)，…,D(x_i,x_j)}；The first distance set D1 includes: {D( _xi ,x ₁ ), D( _xi ,x ₂ ), D( _xi ,x ₃ ),...,D( _xi ,x _j )};

可选的，第二距离确定模块，具体用于：Optionally, the second distance determination module is specifically used for:

本发明实施例还提供了一种电子设备，如图4所示，包括处理器401、通信接口402、存储器403和通信总线404，其中，处理器401，通信接口402，存储器403通过通信总线404完成相互间的通信，The embodiment of the present invention also provides an electronic device, as shown in FIG. complete the mutual communication,

存储器403，用于存放计算机程序；Memory 403, used to store computer programs;

处理器401，用于执行存储器403上所存放的程序时，实现如下步骤：When the processor 401 is used to execute the program stored on the memory 403, the following steps are implemented:

获取样本集；样本集根据设置完标签的患者医疗数据库集的多个样本所确定的；一条样本包括：患者的编号、特征及特征数据；标签包括：第一标签和第二标签；第一标签标识心脑血管疾病患者样本；第二标签标识非心脑血管疾病患者样本；Obtain a sample set; the sample set is determined based on multiple samples of the patient medical database set with labels; a sample includes: the patient's number, characteristics and characteristic data; the labels include: the first label and the second label; the first label Identify samples from patients with cardiovascular and cerebrovascular diseases; the second label identifies samples from patients with non-cardiovascular and cerebrovascular diseases;

获取一条输入样本；输入样本由待预测患者的医疗健康体检数据及医疗就诊数据合并构成；Obtain an input sample; the input sample is composed of the medical health examination data and medical visit data of the patient to be predicted;

使用余弦-大间隔最近邻居COS-LMNN算法进行度量学习，得到样本集的全局度量矩阵；Use the cosine-large interval nearest neighbor COS-LMNN algorithm for metric learning to obtain the global metric matrix of the sample set;

根据所述全局度量矩阵，使用余弦相似度算法，计算输入样本与样本集中样本的距离，组成第一距离集合；According to the global metric matrix, using the cosine similarity algorithm to calculate the distance between the input sample and the samples in the sample set to form a first distance set;

根据预设的第一K值与第一距离集合，使用k近邻算法，计算得到输入样本的第一K值个第一邻近样本；According to the preset first K value and the first distance set, use the k-nearest neighbor algorithm to calculate the first K-value first neighboring samples of the input sample;

在邻近样本所在的局部簇中，选择第一邻近样本的数量超过第一预设阈值的局部簇，作为目标局部簇；In the local clusters where the adjacent samples are located, select a local cluster whose number of first adjacent samples exceeds a first preset threshold as a target local cluster;

将输入样本划入所述目标局部簇；dividing input samples into said target local clusters;

根据COS-LMNN算法学习得到的所述目标局部簇的局部度量矩阵，使用余弦相似度算法计算，输入样本与所述目标局部簇中样本的距离，组成第二距离集合；The local metric matrix of the target local cluster learned according to the COS-LMNN algorithm is calculated using a cosine similarity algorithm, and the distance between the input sample and the samples in the target local cluster forms a second distance set;

在所述目标局部簇中，根据预设的第二K值与所述第二距离集合，使用k近邻算法，确定所述输入样本的第二K值个第二邻近样本；In the target local cluster, according to the preset second K value and the second distance set, use the k-nearest neighbor algorithm to determine the second K-value second adjacent samples of the input sample;

根据输入样本的标签，确定输入样本是否是心脑血管疾病患者的样本；Determine whether the input sample is a sample of a patient with cardiovascular and cerebrovascular diseases according to the label of the input sample;

上述电子设备提到的通信总线可以是外设部件互连标准(Peripheral ComponentInterconnect，PCI)总线或扩展工业标准结构(Extended Industry StandardArchitecture，EISA)总线等。该通信总线可以分为地址总线、数据总线、控制总线等。为便于表示，图中仅用一条粗线表示，但并不表示仅有一根总线或一种类型的总线。The communication bus mentioned in the above electronic device may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus or the like. The communication bus can be divided into an address bus, a data bus, a control bus, and the like. For ease of representation, only one thick line is used in the figure, but it does not mean that there is only one bus or one type of bus.

通信接口用于上述电子设备与其他设备之间的通信。The communication interface is used for communication between the electronic device and other devices.

存储器可以包括随机存取存储器(Random Access Memory，RAM)，也可以包括非易失性存储器(Non-Volatile Memory，NVM)，例如至少一个磁盘存储器。可选的，存储器还可以是至少一个位于远离前述处理器的存储装置。The memory may include a random access memory (Random Access Memory, RAM), and may also include a non-volatile memory (Non-Volatile Memory, NVM), such as at least one disk memory. Optionally, the memory may also be at least one storage device located far away from the aforementioned processor.

上述的处理器可以是通用处理器，包括中央处理器(Central Processing Unit，CPU)、网络处理器(Network Processor，NP)等；还可以是数字信号处理器(Digital SignalProcessing，DSP)、专用集成电路(Application Specific Integrated Circuit，ASIC)、现场可编程门阵列(Field-Programmable Gate Array，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。Above-mentioned processor can be general-purpose processor, comprises central processing unit (Central Processing Unit, CPU), network processor (Network Processor, NP) etc.; Can also be Digital Signal Processor (Digital Signal Processing, DSP), ASIC (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

在本发明提供的又一实施例中，还提供了一种计算机可读存储介质，该计算机可读存储介质中存储有指令，当其在计算机上运行时，使得计算机执行上述实施例中任一所述的一种心脑血管疾病风险预测方法。In yet another embodiment provided by the present invention, a computer-readable storage medium is also provided. Instructions are stored in the computer-readable storage medium. When the computer-readable storage medium is run on a computer, it causes the computer to execute any one of the above-mentioned embodiments. A method for predicting the risk of cardiovascular and cerebrovascular diseases.

在本发明提供的又一实施例中，还提供了一种包含指令的计算机程序产品，当其在计算机上运行时，使得计算机执行上述实施例中任一所述的一种心脑血管疾病风险预测方法。In yet another embodiment provided by the present invention, a computer program product containing instructions is also provided, and when it is run on a computer, it makes the computer execute the cardiovascular and cerebrovascular disease risk control described in any one of the above embodiments. method of prediction.

需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that there is a relationship between these entities or operations. any such actual relationship or order exists between them. Furthermore, the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus comprising a set of elements includes not only those elements, but also includes elements not expressly listed. other elements of or also include elements inherent in such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.

本说明书中的各个实施例均采用相关的方式描述，各个实施例之间相同相似的部分互相参见即可，每个实施例重点说明的都是与其他实施例的不同之处。尤其，对于装置实施例而言，由于其基本相似于方法实施例，所以描述的比较简单，相关之处参见方法实施例的部分说明即可。Each embodiment in this specification is described in a related manner, the same and similar parts of each embodiment can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, as for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for relevant parts, please refer to part of the description of the method embodiment.

以上所述仅为本发明的较佳实施例而已，并非用于限定本发明的保护范围。凡在本发明的精神和原则之内所作的任何修改、等同替换、改进等，均包含在本发明的保护范围内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the protection scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present invention are included in the protection scope of the present invention.

Claims

1. a kind of cardiovascular and cerebrovascular disease Risk Forecast Method, which is characterized in that the method includes：

Obtain sample set；Determined by multiple samples of the sample set according to the patient medical data library collection for setting up label； One sample includes：Number, feature and the characteristic of patient；The label includes：First label and the second label；First mark Label mark Patients with Cardiovascular/Cerebrovascular Diseases sample；The non-Patients with Cardiovascular/Cerebrovascular Diseases sample of second tag identifier；

Obtain an input sample；The input sample is by the medical treatment ＆ health physical examination data of patient to be predicted and the medical data of medical treatment Merge and constitutes；

Metric learning is carried out using cosine-large-spacing nearest-neighbors COS-LMNN algorithms, obtains the global measurement of the sample set Matrix；

Using preset clustering algorithm, the sample in sample set is divided into the local cluster of preset quantity；

According to the global metric matrix input sample and sample in the sample set are calculated using cosine similarity algorithm This distance forms the first distance set；

The first K of input sample is calculated using k nearest neighbor algorithms according to preset first K values and first distance set Value first is adjacent to sample；

Described first is determined adjacent to the local cluster where sample；

Described first in the local cluster where sample, selection first is more than the first predetermined threshold value adjacent to the quantity of sample Local cluster, as target part cluster；

The input sample is included in target part cluster；

According to the Local Metric matrix for the target part cluster that COS-LMNN algorithms learn, cosine similarity algorithm is used Calculate, the input sample in the cluster of the target part at a distance from sample, composition second distance set；

In the cluster of the target part, determined using k nearest neighbor algorithms according to preset 2nd K values and the second distance set 2nd K values second of the input sample are adjacent to sample；

First label number and second label number of the statistics second adjacent to sample；

If second is more than default label threshold value adjacent to the ratio of the first label number and the second label number of sample, by the Label of one label as input sample, otherwise using the second label as the label of input sample；

According to the label of the input sample, determine input sample whether be Patients with Cardiovascular/Cerebrovascular Diseases sample；

If input sample is the sample of Patients with Cardiovascular/Cerebrovascular Diseases, it is determined that the patient to be predicted in input sample is heart and brain blood Pipe Disease；

If input sample is not the sample of Patients with Cardiovascular/Cerebrovascular Diseases, it is determined that the patient to be predicted in input sample is not the heart Cerebrovascular patients.

2. according to the method described in claim 1, it is characterized in that,

After the step of patient to be predicted in the determining input sample is Patients with Cardiovascular/Cerebrovascular Diseases, the method is also wrapped It includes：

Determine whether the patient to be predicted is high-risk Patients with Cardiovascular/Cerebrovascular Diseases according to the health follow-up data of patient；

If the patient to be predicted is high-risk Patients with Cardiovascular/Cerebrovascular Diseases, the patient to be predicted is built as hospitalization View；

If the patient to be predicted is not high-risk Patients with Cardiovascular/Cerebrovascular Diseases, the patient to be predicted is made to increase physical examination frequency Secondary suggestion；

After the step of patient to be predicted in the determining input sample is not Patients with Cardiovascular/Cerebrovascular Diseases, the method is also Including：

Determine whether the patient to be predicted is healthy user according to the health follow-up data of patient；

If the patient to be predicted is healthy user, make the suggestion for keeping the regular inspection frequency to the normal patient；

It is to fail to pinpoint a disease in diagnosis patient by the patient indicia to be predicted, by the leakage if the patient to be predicted is not healthy user Patient medical data library collection is added in the characteristic for examining patient；

Wherein, it is Patients with Cardiovascular/Cerebrovascular Diseases to fail to pinpoint a disease in diagnosis patient.

3. according to the method described in claim 1, it is characterized in that,

The first tag identifier Patients with Cardiovascular/Cerebrovascular Diseases sample, including：

According to the health follow-up data of the patient collected, the identification information of Patients with Cardiovascular/Cerebrovascular Diseases is determined；

The health follow-up data of the patient include：Number, feature, characteristic and the confirmation illness of patient；The mark letter Breath includes：Confirm illness, confirm the corresponding feature of illness and characteristic；

According to the identification information of Patients with Cardiovascular/Cerebrovascular Diseases, is concentrated in the medical data base and determine Patients with Cardiovascular/Cerebrovascular Diseases sample This；

By the sample of the Patients with Cardiovascular/Cerebrovascular Diseases, the first label is set；

The non-Patients with Cardiovascular/Cerebrovascular Diseases sample of second tag identifier, including：

By other samples in addition to the Patients with Cardiovascular/Cerebrovascular Diseases sample, the second label is set.

4. according to the method described in claim 1, it is characterized in that, the acquisition sample set, including：

According to multiple samples of the patient medical data library collection of setting label, the sample that sample missing values are more than to first threshold is made Sample delete processing；

The sample missing values are：The ratio of the feature quantity lacked in one sample and feature total quantity in the sample；

It is searched in a plurality of sample after delete processing, the feature that feature missing values are more than second threshold makees feature delete processing；

The feature missing values are：In the same feature of a plurality of sample, the feature quantity for the data that lack in individuality and same feature are total The ratio of quantity；

A plurality of sample after making feature delete processing searches the feature of missing characteristic, as fisrt feature；

Using Multiple Imputation, the characteristic of fisrt feature missing is filled up as missing values；

According to data type, the characteristic of a plurality of sample after being filled up to missing values, which is done, classifies, and obtains classification results；

Wherein, the classification results include：Discrete features data and continuous characteristic；

The discrete features data and continuous characteristic are made into processing corresponding with data type according to classification results；

The discrete features data and continuous characteristic are done into corresponding treated characteristic, the patient doctor is added Data base set is treated, as first database collection；

Wherein, by the discrete features data and continuous characteristic, make processing corresponding with data type, including：To discrete Characteristic carries out one-hot coding；To continuous characteristic, it is standardized using just too standardization z-score methods；

Using lack sampling and SMOTE algorithms, unbalanced processing is carried out to the sample of the first database collection, obtains the second data Library collection；

The variance of the same characteristic of second centralized database is calculated using method of analysis of variance, deletes characteristic variance Value is less than the characteristic of default variance threshold values, obtains third data base set；

The weight of each characteristic of third centralized database is calculated using relief algorithms；

According to the weight of characteristic fractional value corresponding with the weight of characteristic, third centralized database fractional value is less than The characteristic of preset fraction threshold value and corresponding feature are deleted, and the 4th data base set is obtained；

According to the 4th data base set sample set is determined using forward selection procedures.

5. according to the method described in claim 1, it is characterized in that,

It is described according to the global metric matrix, using cosine similarity algorithm calculate the input sample in the sample set The distance of sample forms the first distance set, including：

According to the global metric matrix, calculated in the input sample and the sample set using cosine similarity algorithmic formula The distance of sample forms the first distance set；

Wherein, the cosine similarity algorithmic formula is：

The first distance set D1 includes：{D(x_i,x₁), D (x_i,x₂), D (x_i,x₃) ..., D (x_i,x_j)}；

Wherein, i represents the label of input sample, x_iI-th of input sample is represented as x_i；Sample set is X；Global metric matrix is A；M=A^TA；The sample number that j representative samples are concentrated；x_jRepresentative sample concentrates j-th of sample；I and j takes positive integer；D(x_i, x_j) represent the input sample x under global metric matrix_iIt is concentrated at a distance from j-th of sample with X；A(x_i,x_j) represent and pass through A matrixes X after transformation_i,x_jThe distance between.

6. according to the method described in claim 1, it is characterized in that,

The Local Metric matrix of the target part cluster learnt according to COS-LMNN algorithms, uses cosine similarity Algorithm calculates the input sample and at a distance from sample, forms second distance set in the cluster of the target part, including：

According to the Local Metric matrix for the target part cluster that COS-LMNN algorithms learn, cosine similarity algorithm is used Formula calculates the input sample and at a distance from sample, forms second distance set in the sample set；

Wherein, the cosine similarity algorithmic formula is：

The second distance set D2 includes：{D(x_i,x_s1), D (x_i,x_s2) ..., D (x_i,x_si)}；

Wherein, i represents the label of input sample, x_iI-th of input sample is represented as x；x_siIt represents and sample generic i；Office Portion's metric matrix is A_S；M_S=A_S ^TA_S；I takes positive integer；D(x_i,x_si) represent the input sample x under Local Metric matrix_iWith it is described In the cluster of target part at a distance from the sample generic with i；I takes positive integer；A_s(x_i,x_si) represent and pass through A_SX after matrixing_i,x_si The distance between.

7. a kind of cardiovascular and cerebrovascular disease risk profile device, which is characterized in that described device includes：

Gather acquisition module, for obtaining sample set；

Determined by multiple samples of the sample set according to the patient medical data library collection for setting up label；One sample packet It includes：Number, feature and the characteristic of patient；The label includes：First label and the second label；First tag identifier heart and brain Vascular disease's sample；The non-Patients with Cardiovascular/Cerebrovascular Diseases sample of second tag identifier；

Sample acquisition module, for obtaining an input sample；

The input sample is made of the medical treatment ＆ health physical examination data of patient to be predicted and the medical data merging of medical treatment；

Matrix computing module obtains described for carrying out metric learning using cosine-large-spacing nearest-neighbors COS-LMNN algorithms The global metric matrix of sample set；

Sample in sample set is divided into preset quantity by First partial cluster determining module for using preset clustering algorithm Local cluster；

First apart from determining module, for calculating the input using cosine similarity algorithm according to the global metric matrix Sample forms the first distance set in the sample set at a distance from sample；

First sample determining module, for according to preset first K values and first distance set, using k nearest neighbor algorithms, meter The first K values first for obtaining input sample are calculated adjacent to sample；

Second local cluster determining module, for determining described first adjacent to the local cluster where sample；

Target part cluster determining module, in the local cluster where sample, selection first to be adjacent to sample described first Quantity be more than the first predetermined threshold value local cluster, as target part cluster；

Local cluster division module, for the input sample to be included in target part cluster；

Second distance determining module, the Local Metric square of the target part cluster for being learnt according to COS-LMNN algorithms Battle array, is calculated using cosine similarity algorithm, the input sample in the cluster of the target part at a distance from sample, form second away from From set；

Second sample determining module is used in the cluster of the target part, according to preset 2nd K values and the second distance collection It closes, using k nearest neighbor algorithms, determines the 2nd K values second of the input sample adjacent to sample；

Statistical module, for counting the second the first label number and the second label number adjacent to sample；

Label determining module, if it is more than pre- to be used for second adjacent to the first label number of sample and the ratio of the second label number Bidding label threshold value, then using the first label as the label of input sample, otherwise using the second label as the label of input sample；

Clinical samples determining module determines whether input sample is cardiovascular and cerebrovascular disease for the label according to the input sample The sample of patient；

Patients with Cardiovascular/Cerebrovascular Diseases determining module, if being the sample of Patients with Cardiovascular/Cerebrovascular Diseases for input sample, it is determined that Patient to be predicted in input sample is Patients with Cardiovascular/Cerebrovascular Diseases；

Non- Patients with Cardiovascular/Cerebrovascular Diseases determining module, if for input sample not being the sample of Patients with Cardiovascular/Cerebrovascular Diseases, Determine that the patient to be predicted in input sample is not Patients with Cardiovascular/Cerebrovascular Diseases.

8. device according to claim 7, which is characterized in that described device further includes：

High-risk determining module, for determining whether the patient to be predicted is high-risk heart and brain blood according to the health follow-up data of patient Pipe Disease；

Suggestion module in hospital, if being high-risk Patients with Cardiovascular/Cerebrovascular Diseases for the patient to be predicted, to described to be predicted Patient makees the suggestion of hospitalization；

Increase physical examination suggestion module, if not being high-risk Patients with Cardiovascular/Cerebrovascular Diseases for the patient to be predicted, to described Patient to be predicted makees to increase the suggestion of the physical examination frequency；

Healthy determining module, for determining whether the patient to be predicted is healthy user according to the health follow-up data of patient；

Normal physical examination suggestion module keeps the normal patient if being healthy user for the patient to be predicted The suggestion of the normal physical examination frequency；

Patient's determining module is failed to pinpoint a disease in diagnosis, if not being healthy user for the patient to be predicted, the patient to be predicted is marked It is denoted as and fails to pinpoint a disease in diagnosis patient, patient medical data library collection is added in the characteristic for failing to pinpoint a disease in diagnosis patient；

9. device according to claim 7, which is characterized in that the set acquisition module, including：

Sample deletes submodule, for multiple samples according to the patient medical data library collection that label is arranged, by sample missing values Sample more than first threshold makees sample delete processing；

Feature deletes submodule, and for being searched in a plurality of sample after delete processing, feature missing values are more than second threshold Feature makees feature delete processing；

Fisrt feature submodule searches the feature of missing characteristic for a plurality of sample after making feature delete processing, makees For fisrt feature；

Missing Data Filling submodule makees missing values for using Multiple Imputation to the characteristic of fisrt feature missing It fills up；

Data classification submodule, for according to data type, the characteristic of a plurality of sample after being filled up to missing values to be done Classification obtains classification results；

Data processing submodule, for according to classification results, by the discrete features data and continuous characteristic, work and data The corresponding processing of type；

Set update submodule, for the discrete features data and continuous characteristic to be done corresponding treated feature Patient medical data library collection is added in data, as first database collection；

Equilibrium treatment submodule, for carrying out uneven lack sampling and SMOTE algorithms to the sample of the first database collection Weighing apparatus processing, obtains the second data base set；

Variance deletes submodule, the side of the same characteristic for calculating second centralized database using method of analysis of variance Difference deletes the characteristic that characteristic variance yields is less than default variance threshold values, obtains third data base set；

Weight calculation submodule, the power for calculating each characteristic of third centralized database using relief algorithms Weight；

Score deletes submodule, for according to the weight of characteristic and the corresponding fractional value of the weight of characteristic, by third Centralized database fractional value is less than the characteristic of preset fraction threshold value and corresponding feature is deleted, and obtains the 4th data base set；

Gather determination sub-module, sample set is determined using forward selection procedures according to the 4th data base set.

10. a kind of electronic equipment, which is characterized in that including processor, communication interface, memory and communication bus, wherein processing Device, communication interface, memory complete mutual communication by communication bus；

Memory, for storing computer program；

Processor when for executing the program stored on memory, realizes any method and steps of claim 1-6.