CN115337000A

CN115337000A - Machine learning method for evaluating brain aging caused by diseases based on brain structure images

Info

Publication number: CN115337000A
Application number: CN202211276691.9A
Authority: CN
Inventors: 张瑜; 王凯凯; 孙超良; 张欢; 王志超; 钱浩天
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2022-10-19
Filing date: 2022-10-19
Publication date: 2022-11-15
Anticipated expiration: 2042-10-19
Also published as: CN115337000B

Abstract

The invention discloses a machine learning method for evaluating brain aging caused by diseases based on brain structure images, and extracts brain structure features of different brain regions from brain structure magnetic resonance images, including structural features such as thickness and volume of cerebral cortex in different brain regions. Since not all features are helpful for model prediction, the features are screened, and the screened out features that can be generalized on different training subsets and are more concise and effective are used to build a brain age prediction model based on ridge regression. K-fold cross-validation was used to find out the features that were repeatedly recognized in k models, and to locate the structural features of brain regions most relevant to brain age prediction. Finally, the trained model will be predicted on the patient data to assess the degree to which the disease affects brain aging.

Description

A Machine Learning Approach to Assess Disease-Induced Brain Aging Based on Brain Structural Imaging

技术领域technical field

本发明涉及神经影像数据分析的技术领域，尤其涉及一种基于脑结构影像评估疾病引起大脑衰老的机器学习方法。The invention relates to the technical field of neuroimage data analysis, and in particular to a machine learning method for evaluating brain aging caused by diseases based on brain structure images.

背景技术Background technique

大脑老化是一种自然过程，但在这个过程中，大脑体积、皮层厚度等存在的变化有着明显的个体差异。大脑发育在正常衰老过程中遵循特定模式，这意味着人们可以根据大脑发育模式预测正常年龄。大脑年龄预测不仅具有重要的科学意义，而且还具有广泛的临床价值。有研究表明，多种类型的神经系统疾病、代谢疾病如精神分裂症、代谢障碍，糖尿病，心功能下降都与大脑衰老有关。而脑龄预测模型产生的脑龄已经很好的用来评估这些疾病与大脑衰老的相关性。如果一个受试者的脑龄大于实际生理年龄，受试者可能有一个更老的大脑，或者说大脑老化程度偏高，偏离了正常老化的轨迹，患有相关疾病的风险程度高。反之亦然，如果一个受试者的脑龄小于实际生理年龄，受试者可能具有一个年轻的大脑。给定个体的预测大脑年龄和实际年龄之间的差异被称为“大脑年龄差”。该值被认为反映了全脑弥漫性、多变量的形态学改变，并且可能是整体大脑健康的标志。大脑老化轨迹相对健康大脑老化平均轨迹的偏离程度，可以反映该个体未来患神经退型性疾病的风险。因此，基于神经影像数据中包含的大脑老化特征模式构建模型，检测个体大脑的老化轨迹，可为研究大脑在衰老过程中的变化以及脑部疾病如何影响正常的大脑老化提供了一种新方法。Brain aging is a natural process, but in this process, changes in brain volume, cortical thickness, etc. have obvious individual differences. Brain development follows a pattern during normal aging, which means that a normal age can be predicted based on the pattern of brain development. Brain age prediction not only has important scientific significance, but also has extensive clinical value. Studies have shown that various types of neurological diseases, metabolic diseases such as schizophrenia, metabolic disorders, diabetes, and decreased heart function are all related to brain aging. The brain age generated by the brain age prediction model has been well used to assess the correlation between these diseases and brain aging. If a subject's brain age is greater than the actual biological age, the subject may have an older brain, or a high degree of brain aging, which deviates from the normal aging trajectory and has a high risk of suffering from related diseases. And vice versa, if a subject's brain age is younger than their biological age, the subject probably has a young brain. The difference between a given individual's predicted brain age and actual age is known as the "brain age gap". This value is thought to reflect diffuse, multivariable morphological changes throughout the brain and may be a marker of overall brain health. The degree to which the brain aging trajectory deviates from the average trajectory of healthy brain aging can reflect the individual's future risk of neurodegenerative diseases. Therefore, constructing models based on the characteristic patterns of brain aging contained in neuroimaging data and detecting the aging trajectory of individual brains may provide a new method for studying changes in the brain during aging and how brain diseases affect normal brain aging.

结构磁共振图像(structural Magnetic Resonance Imaging,sMRI)具有独特的无创性研究脑结构的能力，对于健康和患者来说，其影像数据上会呈现不同的结构特征。利用这些特征，可以进行被试相应的脑龄预测，估计大脑的生物学年龄。由于不同大脑图谱对全脑划分的依据不同，所以不同脑图谱对大脑年龄预测效果也不同。Structural Magnetic Resonance Imaging (sMRI) has a unique ability to study brain structure non-invasively. For healthy and patients, the imaging data will present different structural features. These features can be used to predict the corresponding brain age of the subjects and estimate the biological age of the brain. Since different brain atlases have different basis for dividing the whole brain, different brain atlases have different predictive effects on brain age.

近年来，随着人工智能的发展，基于机器学习利用影像学数据对不同个体进行脑龄预测是临床上重点研究方向之一，不仅可以有效提高疾病的诊断率，同时也为疾病治疗方案的制定提供有益指导。目前，在影像学的脑龄预测研究中常采用使用支持向量回归(support vector regression，SVR)、岭回归、随机森林(random forest，RF)等各种回归模型来进行脑龄预测。岭回归是一种专用于共线性数据分析的有偏估计回归方法，实质上是一种改良的最小二乘估计法，通过放弃最小二乘法的无偏性，以损失部分信息、降低精度为代价获得回归系数更为符合实际、更可靠的回归方法。这些研究证实，基于磁共振大脑结构信息的脑龄预测算法对临床诊断具有很大的帮助。然而，从结构磁共振图像中提取与脑龄相关的特征，如皮层厚度、皮层面积、曲率等。将这些特征进行组合会得到高维的特征，对于高维特征需要进行特征选择。目前，在影像学的脑龄研究中常采用主成分分析法、偏最小二乘法等进行降维，但是这些降维方法具有模糊性，解释性差，因此如何选择特征选择算法对脑龄预测至关重要。另外，目前基于机器学习的脑龄预测模型很少定位出与脑龄相关的大脑区域特征，定位出影响脑龄的关键大脑区域对模型可解释性和临床诊断具有重要意义。同时，通过大脑年龄和实际年龄之间的差异衡量大脑衰老的程度有助于疾病的早期发现和鉴别。因此，亟需一种新型预测方法以解决上述问题。In recent years, with the development of artificial intelligence, predicting the brain age of different individuals based on machine learning using imaging data is one of the key clinical research directions. Provide helpful guidance. At present, various regression models such as support vector regression (SVR), ridge regression, and random forest (random forest, RF) are often used in the brain age prediction research of imaging studies to predict brain age. Ridge regression is a biased estimation regression method dedicated to collinear data analysis. It is essentially an improved least squares estimation method. By giving up the unbiasedness of the least squares method, it is at the cost of losing part of the information and reducing accuracy. A more realistic and reliable regression method is obtained to obtain regression coefficients. These studies confirm that brain age prediction algorithms based on MRI brain structure information are of great help to clinical diagnosis. However, features associated with brain age, such as cortical thickness, cortical area, curvature, etc., are extracted from structural MRI images. Combining these features will result in high-dimensional features, and feature selection is required for high-dimensional features. At present, principal component analysis, partial least squares, etc. are often used for dimensionality reduction in brain age research in imaging, but these dimensionality reduction methods are ambiguous and poorly interpretable, so how to choose a feature selection algorithm is crucial for brain age prediction . In addition, the current brain age prediction models based on machine learning rarely locate the characteristics of brain regions related to brain age. Locating the key brain regions that affect brain age is of great significance for model interpretability and clinical diagnosis. At the same time, measuring the degree of brain aging through the difference between brain age and actual age is helpful for early detection and identification of diseases. Therefore, a new prediction method is urgently needed to solve the above problems.

发明内容Contents of the invention

本发明的目的在于针对现有技术的不足，提出一种基于脑结构影像评估疾病引起大脑衰老的机器学习方法。该方法可以筛选出在不同训练子集上可泛化的且更简约有效的特征，此外还可以定位出与对预测脑龄贡献最大的脑区结构特征。最好将训练好的模型在病人数据上进行预测，能够评估疾病影响大脑衰老的程度。The purpose of the present invention is to address the deficiencies in the prior art and propose a machine learning method for evaluating brain aging caused by diseases based on brain structural images. This method can screen out generalizable and more concise and effective features on different training subsets, and can also locate the structural features of brain regions that contribute most to predicting brain age. It is best to predict the trained model on patient data, which can assess the degree to which the disease affects brain aging.

本发明是通过以下技术方案来实现的：一种基于脑结构影像评估疾病引起大脑衰老的机器学习方法，包括如下步骤：The present invention is achieved through the following technical solutions: a machine learning method for assessing brain aging caused by diseases based on brain structure images, comprising the following steps:

S1：对大脑结构磁共振影像进行预处理及特征提取；S1: Preprocessing and feature extraction of brain structure MRI images;

S2：对步骤S1提取的大脑结构特征采用Bootstrap自展法进行特征筛选；S2: performing feature screening on the brain structural features extracted in step S1 using the Bootstrap self-expanding method;

S3：根据S2的筛选结果构建岭回归脑龄预测模型；S3: Construct a ridge regression brain age prediction model according to the screening results of S2;

S4：采用k折交叉验证定位出对预测脑龄贡献最大的脑区；S4: Use k-fold cross-validation to locate the brain region that contributes the most to predicting brain age;

S5：对构建的脑龄预测模型进行训练和测试；S5: Train and test the constructed brain age prediction model;

S6：使用独立病人组数据集进行测试，评估疾病影响大脑衰老的程度。S6: Testing with an independent patient cohort dataset to assess the extent to which disease affects brain aging.

作为优选，所述步骤S1包含以下子步骤：Preferably, said step S1 includes the following sub-steps:

S1.1：将采集得到的临床数据，根据临床表现分为0型病人，1型病人以及健康人，将这些病人的大脑结构磁共振影像中的非脑结构影像移除；S1.1: Divide the collected clinical data into type 0 patients, type 1 patients and healthy people according to clinical manifestations, and remove the non-brain structure images in the brain structure magnetic resonance images of these patients;

S1.2：根据大脑的解刨结构将目标组织的磁共振影像提取出来；S1.2: Extract the magnetic resonance image of the target tissue according to the anatomical structure of the brain;

S1.3：采用图像分割算法将目标组织的磁共振影像按脑灰质、脑白质、脑脊液结构分割成3个不同的组织；S1.3: Use the image segmentation algorithm to segment the magnetic resonance image of the target tissue into 3 different tissues according to the structure of gray matter, white matter and cerebrospinal fluid;

S1.4：通过FreeSurfer软件包对分割组织进行皮层重建，量化人脑的功能、连接以及结构属性，对结构像进行三维重建，生成展平或胀平图像，利用不同脑图谱得到不同脑区的皮质厚度、曲率、面积、灰质容积的解剖参数。S1.4: Use the FreeSurfer software package to reconstruct the cortex of the segmented tissue, quantify the function, connection and structural properties of the human brain, perform three-dimensional reconstruction of the structural image, generate flattened or flattened images, and use different brain atlases to obtain images of different brain regions Anatomical parameters of cortical thickness, curvature, area, gray matter volume.

作为优选，所述步骤S2包含以下子步骤：Preferably, said step S2 includes the following sub-steps:

S2.1：针对健康组数据集，设置抽样子集占健康组数据集的比例；S2.1: For the healthy group data set, set the proportion of the sampling subset to the healthy group data set;

S2.2：设置抽样次数和抽样方式，即执行多少次有放回的抽样；S2.2: Set the sampling frequency and sampling method, that is, how many times to perform sampling with replacement;

S2.3：根据S2.1设置的比例和S2.2设置的次数和方式进行抽样，对抽样出的子集计算皮层厚度、皮层表面积等特征与脑龄的皮尔森相关性，保留具有显著相关性的结构特征作为候选特征，并统计这些特征在抽样中出现的频次；S2.3: Sampling is performed according to the ratio set in S2.1 and the number and method set in S2.2, and the Pearson correlation between cortical thickness, cortical surface area and other characteristics and brain age is calculated for the sampled subset, and a significant correlation is retained The characteristic structural features are used as candidate features, and the frequency of these features appearing in the sampling is counted;

S2.4：设置抽样子集提取的特征出现的频率值，按频率值将筛选出的特征作为模型最终特征。S2.4: Set the frequency value of the features extracted by the sampling subset, and use the filtered features as the final features of the model according to the frequency value.

作为优选，所述步骤S3包含以下子步骤：Preferably, said step S3 includes the following sub-steps:

S3.1：将健康组数据集按照比例随机划分为训练集和测试集；S3.1: Randomly divide the healthy group data set into training set and test set according to the proportion;

S3.2：对划分完成的训练集和测试集的特征数据进行标准化，把不同来源的数据统一到一个参考系下方便进行比较，保正程序运行时收敛加快，大部分模型归一化后收敛速度会加快；S3.2: Standardize the characteristic data of the divided training set and test set, unify the data from different sources into one reference system for comparison, and speed up the convergence when the positive-guaranteed program is running, and the convergence speed of most models after normalization will speed up;

S3.3：定义岭回归模型alpha参数取值范围；S3.3: Define the value range of the alpha parameter of the ridge regression model;

S3.4：定义交叉验证的评价指标为R2，利用k折交叉验证在岭回归模型alpha参数取值范围内寻找最优参数，即使模型准确率最高的参数；S3.4: Define the evaluation index of cross-validation as R2, and use k-fold cross-validation to find the optimal parameters within the value range of the alpha parameter of the ridge regression model, even the parameter with the highest model accuracy;

S3.5：将最优参数作为大脑模板下的模型参数。S3.5: Use the optimal parameters as the model parameters under the brain template.

作为优选，所述步骤S4包含以下子步骤：Preferably, said step S4 includes the following sub-steps:

S4.1：针对训练集进行k折交叉验证，设置交叉验证评价指标为R2；S4.1: Perform k-fold cross-validation on the training set, and set the cross-validation evaluation index to R2;

S4.2：通过岭回归模型的coef_参数获取k个模型的特征权重，对所述k个模型的权重进行从大到小排序，分别获取k个模型的前h个特征权重对应的结构特征；S4.2: Obtain the feature weights of k models through the coef_ parameter of the ridge regression model, sort the weights of the k models from large to small, and obtain the structural features corresponding to the first h feature weights of the k models respectively ;

S4.3：识别出在这k个模型中重复出现的特征，定位出对预测脑龄贡献最大的脑区结构特征。S4.3: Identify the features that recur in these k models, and locate the structural features of the brain regions that contribute most to the prediction of brain age.

作为优选，所述步骤S5包含以下子步骤：Preferably, said step S5 includes the following sub-steps:

S5.1：选择n个大脑模板下的k折交叉验证中测试分数最好的模型，使其在整个训练集上进行重新训练；S5.1: Select the model with the best test score in the k-fold cross-validation under n brain templates, so that it can be retrained on the entire training set;

S5.2：在测试集上进行模型测试，得到模型测试的大脑年龄；S5.2: Perform model testing on the test set to obtain the brain age of the model test;

S5.3：计算真实年龄与预测脑龄之间的MAE、R2、皮尔森相关系数以及平均误差，作为模型的评价指标，最终选择Brainnetome脑图谱所建立的脑龄预测模型作为最优模型。S5.3: Calculate the MAE, R2, Pearson correlation coefficient and average error between the real age and the predicted brain age, as the evaluation index of the model, and finally select the brain age prediction model established by the Brainnetome brain atlas as the optimal model.

作为优选，所述步骤S6包含以下子步骤：Preferably, said step S6 includes the following sub-steps:

S6.1：对独立病人组数据进行归一化处理，加载训练完成的脑龄预测模型，利用训练好的最优模型对病人测试集进行测试，获取模型预测出的大脑年龄；S6.1: Normalize the data of the independent patient group, load the trained brain age prediction model, use the trained optimal model to test the patient test set, and obtain the brain age predicted by the model;

S6.2：计算真实年龄与预测脑龄的平均误差，与健康测试集测试出的平均误差进行比较，病人测试集平均误差高于健康组则表明该疾病会导致病人大脑衰老；S6.2: Calculate the average error between the real age and the predicted brain age, and compare it with the average error of the healthy test set. If the average error of the patient test set is higher than that of the healthy group, it indicates that the disease will cause the patient's brain to age;

S6.3：生成健康组数据集和病人组数据集真实值与预测值之间的拟合线，通过对比两个拟合线的斜率证明病人大脑偏离健康大脑衰老轨迹。S6.3: Generate the fitting line between the real value and the predicted value of the healthy group data set and the patient group data set, and prove that the patient's brain deviates from the healthy brain aging trajectory by comparing the slopes of the two fitting lines.

作为优选，所述步骤S1.5采用的脑图谱包括AAL、DKT、Destrieux和Brainnetome。Preferably, the brain atlas used in the step S1.5 includes AAL, DKT, Destrieux and Brainnetome.

本发明提出一种基于脑结构影像评估疾病引起大脑衰老的机器学习方法，从大脑结构磁共振影像中提取不同脑区大脑结构特征，包括不同脑区大脑皮层厚度以及体积等结构特征。由于并不是所有特征对模型预测都有帮助，因此对特征进行筛选，使用筛选出的在不同训练子集上可泛化的且更简约有效的特征构建基于岭回归的脑龄预测模型。采用k折交叉验证找出在k个模型中均被反复识别的特征，定位出与脑龄预测最相关的脑区结构特征。最后将训练好的模型在病人数据上进行预测，来评估疾病影响大脑衰老的程度。该方法可以定位出与对预测脑龄贡献最大的脑区结构特征，同时将训练好的模型在病人数据上进行预测，能够评估疾病影响大脑衰老的程度。The present invention proposes a machine learning method for assessing brain aging caused by diseases based on brain structural images, and extracts brain structural features of different brain regions from magnetic resonance images of brain structures, including structural features such as thickness and volume of cerebral cortex in different brain regions. Since not all features are helpful for model prediction, the features are screened, and the screened out features that can be generalized on different training subsets and are more concise and effective are used to construct a brain age prediction model based on ridge regression. K-fold cross-validation was used to find the features that were repeatedly recognized in k models, and to locate the structural features of brain regions most relevant to brain age prediction. Finally, the trained model will be predicted on the patient data to assess the extent to which the disease affects brain aging. This method can locate the structural features of the brain region that contribute the most to the prediction of brain age, and at the same time predict the trained model on patient data, and can evaluate the degree to which diseases affect brain aging.

附图说明Description of drawings

图1是基于脑结构影像评估疾病引起大脑衰老的机器学习方法流程示意图；Figure 1 is a schematic flow chart of the machine learning method for evaluating brain aging caused by diseases based on brain structural images;

图2是Bootstrap自展法特征筛选示意图；Figure 2 is a schematic diagram of the feature screening of the Bootstrap self-expanding method;

图3是脑龄预测模型测试集结果图。Figure 3 is a graph of the test set results of the brain age prediction model.

具体实施方式Detailed ways

下面将结合附图对本发明作进一步的说明。为了使本领域的人员更好地理解本申请中的技术方案，下面将结合附图对本发明作进一步的说明。但这仅仅是本申请的一部分实施例，而不是全部的实施例。基于本申请所述的具体实施例，本领域的其他人员在没有做出创造性劳动的前提下所获得的其他实施例，都应当落在本发明的构思范围之内。The present invention will be further described below in conjunction with the accompanying drawings. In order to enable those skilled in the art to better understand the technical solution in the application, the present invention will be further described below in conjunction with the accompanying drawings. But these are only some embodiments of the present application, not all embodiments. Based on the specific embodiments described in this application, other embodiments obtained by other persons in the art without making creative efforts should fall within the scope of the present invention.

本发明的一种基于脑结构影像评估疾病引起大脑衰老的机器学习方法，包括如下步骤：A machine learning method for assessing brain aging caused by diseases based on brain structural images of the present invention comprises the following steps:

所述步骤S1中，大脑结构磁共振影像预处理及特征提取包括以下步骤：首先，由于磁共振数据的原始结构图像中都包含着一些非脑结构，比如头骨等。因为头骨信号在后续分析中并不使用且图像边缘的信噪比较差，所以需要在图像预处理操作中将图像中的头骨等非脑结构移除。然后，在磁共振图像处理时，有时只关注某些特定区域的状态，这就需要根据大脑的解剖结构将目标部位的组织提取出来。在预处理流程中，将脑影像按脑灰质、白质、脑脊液结构分割成为3个不同的组织，这是因为这三个组织在大脑中有着不同的功能，因此这一步骤需要用到图像分割算法。最后，通过FreeSurfer软件包进行皮层重建，量化人脑的功能、连接以及结构属性，对结构像进行三维重建，生成展平或胀平图像，并得到皮质厚度、曲率、面积、灰质容积等解剖参数。采用4种常用的脑图谱分别获取大脑结构特征，包括AAL、DKT、Destrieux和Brainnetome。具体包含以下子步骤：In the step S1, the brain structure magnetic resonance image preprocessing and feature extraction include the following steps: first, since the original structure images of the magnetic resonance data all contain some non-brain structures, such as skulls and the like. Because the skull signal is not used in the subsequent analysis and the signal-to-noise ratio at the edge of the image is poor, it is necessary to remove non-brain structures such as the skull in the image in the image preprocessing operation. Then, during MRI image processing, sometimes only the state of certain specific regions is concerned, which requires the extraction of tissues from the target site according to the anatomical structure of the brain. In the preprocessing process, the brain image is divided into three different tissues according to the structure of gray matter, white matter and cerebrospinal fluid. This is because these three tissues have different functions in the brain, so this step requires the use of image segmentation algorithms . Finally, cortical reconstruction is carried out through the FreeSurfer software package to quantify the function, connection and structural properties of the human brain, perform three-dimensional reconstruction of the structural image, generate a flattened or expanded image, and obtain anatomical parameters such as cortical thickness, curvature, area, and gray matter volume . Four commonly used brain atlases were used to obtain brain structural features, including AAL, DKT, Destrieux, and Brainnetome. It specifically includes the following sub-steps:

所述步骤S2中，采用Bootstrap自展法进行特征筛选包括以下步骤：首先，针对健康组数据集，设置抽样子集占数据集的比例为r%，即有r%样本用来构建抽样的样本集；其次，设置抽样次数及方式，即执行s次无放回抽样；然后，计算每个特征与脑龄的皮尔森相关性，保留具有显著相关的结构特征(p<0.05)作为候选特征，并统计这些特征在s次抽样中出现的频次；最后，设置抽样子集提取的特征出现的频率为t，即在s次抽样中被筛选出t次的特征作为模型最终特征。特征集最终大小为m*n，m即样本个数，n即每个样本结构特征维数。筛选出的这些特征在不同训练子集上可泛化且更简约有效。具体包含以下子步骤：In the step S2, feature screening using the Bootstrap self-expanding method includes the following steps: first, for the healthy group data set, set the ratio of the sampling subset to the data set as r%, that is, r% samples are used to construct the sample for sampling secondly, set the sampling frequency and method, that is, perform s sampling without replacement; then, calculate the Pearson correlation between each feature and brain age, and keep the structural features with significant correlation (p<0.05) as candidate features, And count the frequency of these features appearing in s samples; finally, set the frequency of the features extracted by the sampling subset to t, that is, the features that are screened out t times in s samples are used as the final features of the model. The final size of the feature set is m*n, where m is the number of samples, and n is the dimensionality of the structural features of each sample. These filtered features are generalizable and more parsimonious and effective on different training subsets. It specifically includes the following sub-steps:

所述步骤S3中，构建岭回归脑龄预测模型包括以下步骤：首先，将健康组数据集按照a:b比例随机划分为训练集和测试集；其次，对划分完成的训练集和测试集的特征数据进行标准化，标准化后的数据的均值为0，标准差为1；然后，定义岭回归模型alpha参数取值范围(0.01,0.1,1,3...60)，定义交叉验证的评价指标为R2，利用k折交叉验证在取值范围内寻找最优参数；最后，将最优参数作为4个大脑模板下的模型参数。具体包含以下子步骤：In the step S3, constructing the ridge regression brain age prediction model includes the following steps: first, the healthy group data set is randomly divided into a training set and a test set according to the ratio of a:b; secondly, the divided training set and test set are completed. The characteristic data is standardized, and the mean value of the standardized data is 0, and the standard deviation is 1; then, define the value range of the alpha parameter of the ridge regression model (0.01,0.1,1,3...60), and define the evaluation index of cross-validation As R2, k-fold cross-validation is used to find the optimal parameters within the value range; finally, the optimal parameters are used as the model parameters under the 4 brain templates. It specifically includes the following sub-steps:

所述步骤S4中，采用k折交叉验证定位出对预测脑龄贡献最大的脑区包括以下步骤：首先，针对训练集进行k折交叉验证，设置交叉验证评价指标为R2；然后，通过岭回归模型的coef_参数获取k个模型的特征权重，对这k个模型的权重进行从大到小排序，分别获取k个模型的前h个特征权重对应的结构特征；最后，识别出在这k个模型中重复出现的特征，定位出与脑龄预测最相关的脑区结构特征。具体包含以下子步骤：In the step S4, using k-fold cross-validation to locate the brain region that contributes the most to the prediction of brain age includes the following steps: first, perform k-fold cross-validation on the training set, and set the cross-validation evaluation index to R2; then, through ridge regression The coef_ parameter of the model obtains the feature weights of k models, sorts the weights of these k models from large to small, and respectively obtains the structural features corresponding to the first h feature weights of the k models; finally, recognizes the k features recurring in each model, locating the structural features of brain regions most relevant to brain age prediction. It specifically includes the following sub-steps:

S4.3：识别出在这k个模型中重复出现的特征，定位出对预测脑龄贡献最大的脑区结构特征。S4.3: Identify the features that recur in these k models, and locate the structural features of the brain regions that contribute the most to the prediction of brain age.

所述步骤S5中，对构建脑龄预测模型进行训练和测试包括以下步骤：首先，选择n个大脑模板下的k折交叉验证中测试分数最好的模型，使其在整个训练集上进行重新训练；其次，在测试集上进行模型测试，得到模型测试的大脑年龄；最后，计算真实年龄与预测脑龄之间的MAE、R2、皮尔森相关系数以及平均误差，作为模型的评价指标，最终选择Brainnetome脑图谱所建立的脑龄预测模型作为最优模型。具体包含以下子步骤：In the step S5, training and testing the brain age prediction model includes the following steps: First, select the model with the best test score in the k-fold cross-validation under n brain templates, so that it can be re-tested on the entire training set. Training; secondly, test the model on the test set to obtain the brain age of the model test; finally, calculate the MAE, R2, Pearson correlation coefficient and average error between the real age and the predicted brain age, as the evaluation index of the model, and finally The brain age prediction model established by Brainnetome was selected as the optimal model. It specifically includes the following sub-steps:

所述步骤S6中，使用独立病人组数据集进行测试，来评估疾病影响大脑衰老的程度包括以下步骤：首先，病人组数据集大小为p*n，p表示病人样本数，n为每个样本结构特征维数，对病人数据进行归一化处理，加载训练完成的脑龄预测模型，利用训练好的最优模型对病人测试集进行测试，获取模型预测出的大脑年龄；然后，计算真实年龄与预测脑龄的平均误差，与健康测试集测试出的平均误差进行比较，病人测试集平均误差高于健康组则表明该疾病会导致病人大脑衰老；最后，为了进一步验证病人组的大脑会出现衰老现象，生成健康组和病人组真实值与预测值之间的拟合线，通过对比两个拟合线的斜率证明病人大脑偏离健康大脑衰老轨迹。具体包含以下子步骤：In the step S6, using an independent patient group data set for testing to evaluate the extent to which the disease affects brain aging includes the following steps: First, the size of the patient group data set is p*n, p represents the number of patient samples, and n is each sample Structural feature dimension, normalize the patient data, load the trained brain age prediction model, use the trained optimal model to test the patient test set, and obtain the brain age predicted by the model; then, calculate the real age Compared with the average error of predicting brain age and the average error of the healthy test set, the average error of the patient test set is higher than that of the healthy group, indicating that the disease will cause the patient's brain to age; finally, in order to further verify that the brain of the patient group will appear Aging phenomenon, generate a fitting line between the real value and the predicted value of the healthy group and the patient group, and prove that the patient's brain deviates from the healthy brain aging trajectory by comparing the slopes of the two fitting lines. It specifically includes the following sub-steps:

总体而言，本发明提出一种基于脑结构影像评估疾病引起大脑衰老的机器学习方法。该方法可以定位出与对预测脑龄贡献最大的脑区结构特征，同时将训练好的模型在病人数据上进行预测，能够评估疾病影响大脑衰老的程度。整体方法流程如图1所示，首先，对结构磁共振数据进行预处理及特征提取，流程包括：去除脑影像的颅骨和非脑组织、对图像灰质、白质、脑脊液的分割、对结构像进行皮层重建，生成展平图像。完成预处理后，统计得到不同脑区的皮层厚度、面积、曲率、体积等结构特征，构建组合的大脑区域特征。其次，基于Bootstrap自展法对基于不同脑区的组合特征进行筛选，即在s次抽样中被筛选出t次的特征作为模型最终特征，在不同训练子集上可泛化的且更简约有效。接下来，构建岭回归脑龄预测模型，通过k折交叉验证获取岭回归最优alpha参数。再次，采用k折交叉验证识别出在这k个模型中重复出现的特征，定位出与脑龄预测最相关的脑区结构特征。然后，基于正常组数据集对模型进行训练和测试。最后，基于训练完成的脑龄预测模型对独立的病人组数据集进行测试，证明病人大脑出现衰老症状。In general, the present invention proposes a machine learning method for assessing disease-induced brain aging based on brain structural images. This method can locate the structural features of the brain region that contribute the most to the prediction of brain age, and at the same time predict the trained model on patient data, and can evaluate the degree to which diseases affect brain aging. The overall method flow is shown in Figure 1. First, the structural MRI data are preprocessed and feature extracted. The process includes: removing the skull and non-brain tissues of the brain image, segmenting the gray matter, white matter, and cerebrospinal fluid of the image, and performing structural image segmentation. Cortical reconstruction, generating a flattened image. After the preprocessing is completed, the structural characteristics such as cortical thickness, area, curvature, and volume of different brain regions are obtained statistically, and the combined brain region characteristics are constructed. Secondly, based on the Bootstrap bootstrap method, the combined features based on different brain regions are screened, that is, the features that are selected t times in the s sampling are used as the final features of the model, which can be generalized on different training subsets and are more concise and effective. . Next, build a ridge regression brain age prediction model, and obtain the optimal alpha parameter of ridge regression through k-fold cross-validation. Again, k-fold cross-validation was used to identify the recurring features in these k models, and to locate the structural features of brain regions most relevant to brain age prediction. Then, the model is trained and tested on the normal group dataset. Finally, based on the trained brain age prediction model, an independent patient group data set was tested to prove that the patient's brain has aging symptoms.

实施例1Example 1

该实例用到的是医院采集得到的临床数据，根据临床表现将数据分为0型病人，1型病人以及健康人。对数据进行整理和质量控制，最后入组的被试有0型病人138例，1型病人94例，健康人109例。This example uses the clinical data collected by the hospital, and divides the data into type 0 patients, type 1 patients and healthy people according to clinical manifestations. After sorting out the data and controlling the quality, 138 type 0 patients, 94 type 1 patients and 109 healthy people were finally enrolled.

本方法的具体实施过程包括如下步骤：The concrete implementation process of this method comprises the following steps:

(1)脑结构磁共振影像预处理及特征提取：首先，由于磁共振数据的原始结构图像中都包含着一些非脑结构，比如头骨等。因为头骨信号在后续分析中并不使用且图像边缘的信噪比较差，所以需要在图像预处理操作中将图像中的头骨等非脑结构移除。然后，在磁共振图像处理时，有时只关注某些特定区域的状态，这就需要根据大脑的解剖结构将目标部位的组织提取出来。在预处理流程中，将脑影像按脑灰质、白质、脑脊液结构分割成为3个不同的组织，这是因为这三个组织在大脑中有着不同的功能，因此这一步骤需要用到图像分割算法。最后，通过FreeSurfer软件包进行皮层重建，量化人脑的功能、连接以及结构属性，对结构像进行三维重建，生成展平或胀平图像，并得到皮质厚度、曲率、面积、灰质容积等解剖参数。采用4种常用的脑图谱分别获取大脑结构特征，包括AAL、DKT、Destrieux和Brainnetome。(1) Brain structure magnetic resonance image preprocessing and feature extraction: First, because the original structural images of magnetic resonance data contain some non-brain structures, such as skulls. Because the skull signal is not used in the subsequent analysis and the signal-to-noise ratio at the edge of the image is poor, it is necessary to remove non-brain structures such as the skull in the image in the image preprocessing operation. Then, during MRI image processing, sometimes only the state of certain specific regions is concerned, which requires the extraction of tissues from the target site according to the anatomical structure of the brain. In the preprocessing process, the brain image is divided into three different tissues according to the structure of gray matter, white matter and cerebrospinal fluid. This is because these three tissues have different functions in the brain, so this step requires the use of image segmentation algorithms . Finally, cortical reconstruction is carried out through the FreeSurfer software package to quantify the function, connection and structural properties of the human brain, perform three-dimensional reconstruction of the structural image, generate a flattened or expanded image, and obtain anatomical parameters such as cortical thickness, curvature, area, and gray matter volume . Four commonly used brain atlases were used to obtain brain structural features, including AAL, DKT, Destrieux, and Brainnetome.

(2)如图2所示，采用Bootstrap自展法进行特征筛选：首先，针对健康组数据集，设置抽样子集占数据集的比例为80%，即有r%样本用来构建抽样的样本集；其次，设置抽样次数及方式，即执行s次无放回抽样；然后，计算每个特征与脑龄的皮尔森相关性，保留具有显著相关的结构特征(p<0.05)作为候选特征，并统计这些特征在s次抽样中出现的频次；最后，设置抽样子集提取的特征出现的频率为t，即在s次抽样中被筛选出t次的特征作为模型最终特征。特征集最终大小为m*n，m即样本个数，n即每个样本结构特征维数。筛选出的这些特征在不同训练子集上可泛化且更简约有效。(2) As shown in Figure 2, the Bootstrap bootstrap method is used for feature screening: first, for the healthy group data set, set the proportion of the sampling subset to 80% of the data set, that is, r% samples are used to construct the sampling sample secondly, set the sampling frequency and method, that is, perform s sampling without replacement; then, calculate the Pearson correlation between each feature and brain age, and keep the structural features with significant correlation (p<0.05) as candidate features, And count the frequency of these features appearing in s samples; finally, set the frequency of the features extracted by the sampling subset to t, that is, the features that are screened out t times in s samples are used as the final features of the model. The final size of the feature set is m*n, where m is the number of samples, and n is the dimensionality of the structural features of each sample. These filtered features are generalizable and more parsimonious and effective on different training subsets.

(3)构建岭回归脑龄预测模型：首先，将健康组数据集按照a:b比例随机划分为训练集和测试集；其次，对划分完成的训练集和测试集的特征数据进行标准化，标准化后的数据的均值为0，标准差为1；然后，定义岭回归模型alpha参数取值范围(0.01,0.1,1,3...60)，定义交叉验证的评价指标为R2，利用k折交叉验证在取值范围内寻找最优参数；最后，将最优参数作为4个大脑模板下的模型参数。岭回归脑龄预测结果如图3所示。(3) Construct the ridge regression brain age prediction model: firstly, the healthy group data set is randomly divided into training set and test set according to the ratio of a:b; secondly, the characteristic data of the divided training set and test set are standardized and standardized The mean value of the final data is 0, and the standard deviation is 1; then, define the alpha parameter range of the ridge regression model (0.01,0.1,1,3...60), define the evaluation index of cross-validation as R2, and use k-fold Cross-validation finds the optimal parameters within the range of values; finally, the optimal parameters are used as model parameters under the 4 brain templates. The results of ridge regression brain age prediction are shown in Figure 3.

(4)采用k折交叉验证定位出与对预测脑龄贡献最大的脑区：首先，针对训练集进行k折交叉验证，即将训练集分成k份，轮流将其中k-1份作为训练数据，1份作为测试数据，进行试验，设置交叉验证评价指标为R2；然后，通过岭回归模型的coef_参数获取k个模型的特征权重，对这k个模型的权重进行从大到小排序，分别获取k个模型的前h个特征权重对应的不同脑区的结构特征；最后，识别出在这k个模型中重复出现的特征，定位出对脑龄预测贡献大的脑区结构特征，特征权重越大表明该脑区结构特征对脑龄预测的贡献越大。(4) Use k-fold cross-validation to locate the brain region that contributes the most to predicting brain age: first, perform k-fold cross-validation on the training set, that is, divide the training set into k parts, and take k-1 of them as training data in turn. 1 copy is used as the test data, conduct experiments, and set the cross-validation evaluation index to R2; then, obtain the feature weights of the k models through the coef_ parameter of the ridge regression model, and sort the weights of the k models from large to small, respectively Obtain the structural features of different brain regions corresponding to the first h feature weights of the k models; finally, identify the features that appear repeatedly in the k models, locate the structural features of the brain regions that contribute greatly to brain age prediction, and feature weights The larger the value, the greater the contribution of the structural characteristics of the brain area to the prediction of brain age.

(5)对构建脑龄预测模型进行训练和测试：首先，选择4个大脑模板下的k折交叉验证中测试分数最好的模型，使其在整个训练集上进行重新训练；其次，在测试集上进行模型测试，得到模型测试的大脑年龄；最后，计算真实年龄与预测脑龄之间的MAE、R2、皮尔森相关系数以及平均误差，作为模型的评价指标，最终选择Brainnetome脑图谱所建立的脑龄预测模型作为最优模型。(5) Train and test the brain age prediction model: first, select the model with the best test score in the k-fold cross-validation under the 4 brain templates, so that it can be retrained on the entire training set; secondly, in the test The model test is performed on the set to obtain the brain age of the model test; finally, the MAE, R2, Pearson correlation coefficient and average error between the real age and the predicted brain age are calculated as the evaluation index of the model, and finally the Brainnetome brain atlas is selected. The brain age prediction model is used as the optimal model.

(6) 使用独立病人组数据集进行测试，来评估疾病影响大脑衰老的程度：首先，病人组数据集大小为p*n，p表示病人样本数，n为每个样本结构特征维数，对病人数据进行归一化处理，加载训练完成的脑龄预测模型，利用训练好的最优模型对病人测试集进行测试，获取模型预测出的大脑年龄；然后，计算真实年龄与预测脑龄的平均误差，与健康测试集测试出的平均误差进行比较，病人测试集平均误差高于健康组则表明该疾病会导致病人大脑衰老；最后，为了进一步验证病人组的大脑会出现衰老现象，生成健康组和病人组真实值与预测值之间的拟合线，通过对比两个拟合线的斜率证明病人大脑偏离健康大脑衰老轨迹。(6) Use an independent patient group data set for testing to evaluate the extent to which the disease affects brain aging: First, the size of the patient group data set is p*n, where p represents the number of patient samples, and n is the structural feature dimension of each sample. Normalize the patient data, load the trained brain age prediction model, use the trained optimal model to test the patient test set, and obtain the brain age predicted by the model; then, calculate the average of the real age and the predicted brain age The error is compared with the average error of the healthy test set. If the average error of the patient test set is higher than that of the healthy group, it indicates that the disease will cause the brain aging of the patient; finally, in order to further verify that the brain of the patient group will age, a healthy group is generated and the fitting line between the true value and the predicted value of the patient group. By comparing the slopes of the two fitting lines, it is proved that the patient's brain deviates from the healthy brain's aging trajectory.

以上所述仅为本发明的优选实施例而已，并不用于限制本发明，对于本领域的技术人员来说，本发明可以有各种更改和变化。凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明的保护范围内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection scope of the present invention.

Claims

1. A machine learning method for evaluating brain aging caused by diseases based on brain structure images is characterized by comprising the following steps:

s1: preprocessing a brain structure magnetic resonance image and extracting characteristics of the brain structure magnetic resonance image;

s2: screening the structural features of the brain extracted in the step S1 by adopting a self-development method;

s3: constructing a ridge regression brain age prediction model according to the screening result of the S2;

s4: positioning a brain region which has the maximum contribution to the predicted brain age by adopting k-fold cross validation;

s5: training and testing the constructed brain age prediction model;

s6: tests were conducted using independent patient data sets to assess the extent to which disease affects brain aging.

2. The method for machine learning to evaluate brain aging caused by diseases based on brain structure image as claimed in claim 1, wherein the step S1 comprises the following sub-steps:

s1.1: dividing the acquired clinical data into a type 0 patient, a type 1 patient and a healthy person according to clinical manifestations, and removing non-brain structure images in brain structure magnetic resonance images of the patients;

s1.2: extracting the magnetic resonance image of the target tissue according to the brain planning structure;

s1.3: adopting an image segmentation algorithm to segment the magnetic resonance image of the target tissue into 3 different tissues according to the structures of the grey brain matter, the white brain matter and the cerebrospinal fluid;

s1.4: performing cortical reconstruction on the segmented tissue through a FreeSharfer software package, quantifying the function, connection and structural attributes of the human brain, performing three-dimensional reconstruction on the structural image to generate a flattened or flatwise image, and obtaining anatomical parameters of cortex thickness, curvature, area and gray matter volume of different brain areas by using different brain maps.

3. The method of claim 1, wherein the step S2 comprises the following sub-steps:

s2.1: setting the proportion of the sampling subset to the health group data set aiming at the health group data set;

s2.2: setting sampling times and sampling modes;

s2.3: sampling according to the proportion set in the S2.1 and the times and modes set in the S2.2, calculating the correlation between the cortex thickness and the cortex surface area characteristic of the sampled subset and the pilsner of the brain age, reserving the structural characteristic with obvious correlation as a candidate characteristic, and counting the frequency of the characteristic appearing in the sampling;

s2.4: and setting the frequency value of the feature extracted from the sampling subset, and taking the screened feature as the final feature of the model according to the frequency value.

4. The method for machine learning to evaluate brain aging caused by diseases based on brain structure image as claimed in claim 1, wherein the step S3 comprises the following sub-steps:

s3.1: randomly dividing a health group data set into a training set and a testing set according to a proportion;

s3.2: standardizing the characteristic data of the divided training set and test set;

s3.3: defining the value range of alpha parameters of the ridge regression model;

s3.4: defining the evaluation index of cross validation as R2, and searching for an optimal parameter in the value range of alpha parameters of the ridge regression model by using k-fold cross validation, namely the parameter with the highest model accuracy;

s3.5: and taking the optimal parameters as model parameters under the brain template.

5. The method for machine learning to assess brain aging caused by diseases based on brain structure images as claimed in claim 1, wherein the step S4 comprises the following sub-steps:

s4.1: performing k-fold cross validation on the training set, and setting a cross validation evaluation index as R2;

s4.2: obtaining the feature weights of k models through coef _ parameters of ridge regression models, sequencing the weights of the k models from large to small, and respectively obtaining the structural features corresponding to the first h feature weights of the k models;

s4.3: features that recur in these k models are identified, and structural features of the brain region that contribute most to the predicted brain age are located.

6. The method for machine learning to assess brain aging caused by diseases based on brain structure images as claimed in claim 1, wherein the step S5 comprises the following sub-steps:

s5.1: selecting a model with the best test score in the k-fold cross validation under the n brain templates, and retraining the model on the whole training set;

s5.2: performing model test on the test set to obtain the brain age of the model test;

s5.3: and calculating MAE, R2, pearson correlation coefficient and average error between the real age and the predicted brain age, taking the MAE, R2, pearson correlation coefficient and average error as evaluation indexes of the model, and finally selecting a brain age prediction model established by a brain atlas as an optimal model.

7. The method of claim 1, wherein the step S6 comprises the following sub-steps:

s6.1: carrying out normalization processing on independent patient group data, loading a trained brain age prediction model, and testing a patient test set by using the trained optimal model to obtain the brain age predicted by the model;

s6.2: calculating the average error between the real age and the predicted brain age, and comparing the average error with the average error tested by the health test set, wherein if the average error of the patient test set is higher than that of the health group, the disease can cause the brain aging of the patient;

s6.3: and generating a fit line between the real value and the predicted value of the health group data set and the patient group data set, and comparing the slopes of the two fit lines to prove that the brain of the patient deviates from the aging track of the healthy brain.

8. The method of claim 2, wherein the brain atlas used in step S1.4 includes AAL, DKT, desrieux and Brainneme.