CN115881296A - A risk-assisted stratification system for papillary thyroid carcinoma (PTC) - Google Patents
A risk-assisted stratification system for papillary thyroid carcinoma (PTC) Download PDFInfo
- Publication number
- CN115881296A CN115881296A CN202310090839.8A CN202310090839A CN115881296A CN 115881296 A CN115881296 A CN 115881296A CN 202310090839 A CN202310090839 A CN 202310090839A CN 115881296 A CN115881296 A CN 115881296A
- Authority
- CN
- China
- Prior art keywords
- model
- risk
- ptc
- data
- stratification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Investigating Or Analysing Biological Materials (AREA)
Abstract
Description
技术领域technical field
本发明涉及一种肿瘤低或中高风险辅助分层系统和方法,具体涉及一种人工智能甲状腺乳头状癌(PTC)风险辅助分层系统和方法。本发明属于医学人工智能辅助分层技术领域。The present invention relates to an auxiliary stratification system and method for low or medium-high tumor risk, in particular to an artificial intelligence papillary thyroid carcinoma (PTC) risk auxiliary stratification system and method. The invention belongs to the technical field of medical artificial intelligence assisted stratification.
背景技术Background technique
甲状腺癌是世界上最常见的内分泌恶性肿瘤之一。近年来,甲状腺癌的发病率呈快速上升趋势。甲状腺癌包括4种组织学类型,其中甲状腺乳头状癌(Papillary thyroidcancer, PTC)占甲状腺癌的85%。尽管PTC总体预后良好,10年生存率超过90%,但仍有部分肿瘤在早期即表现出高侵袭性的生物学行为,30-80%的患者在初诊时即出现淋巴结转移。初次手术就发现PTC被膜侵犯约占30-50%,即使是手术彻底切除,术后复发率仍有10-25%。因此,早期准确评估PTC病灶情况以及风险程度,定制个体化手术方案以及规范是精准诊疗的基础。Thyroid cancer is one of the most common endocrine malignancies in the world. In recent years, the incidence of thyroid cancer has shown a rapid upward trend. Thyroid cancer includes 4 histological types, among which papillary thyroid cancer (PTC) accounts for 85% of thyroid cancer. Although the overall prognosis of PTC is good, and the 10-year survival rate exceeds 90%, some tumors still exhibit highly aggressive biological behavior in the early stage, and 30-80% of patients have lymph node metastasis at the time of first diagnosis. PTC capsule invasion is found in about 30-50% of cases at the first operation, and the postoperative recurrence rate is still 10-25% even after complete surgical resection. Therefore, early and accurate assessment of PTC lesions and risk levels, customized and individualized surgical plans and specifications are the basis for precise diagnosis and treatment.
在现阶段,医生在进行肿瘤治疗过程中大多依靠其自身的经验和所掌握的知识、以及参考现有医学的医学指南进行评估和治疗,具有评估准确率低、工作繁琐受经验局限等问题。目前,超声、CT以及MRI是评估PTC病灶及颈部淋巴结主要的影像学手段,但是既往的文献未能显示对评估病灶风险程度有较高的效能。从基因组学层面的改变探究PTC风险分层的研究相对有限,基因组和转录组学揭示了不同甲状腺癌的体细胞突变,如BRAFV600E、RAS、TP53、TERT和RET/PTC基因融合;其中,BRAFV600E突变在PTC的发病机制备受关注,相关资料表明85%的PTC存在BRAFV600E突变,但是仅仅BRAFV600E突变单个指标对PTC预后作用的预测能力较弱。At the present stage, doctors mostly rely on their own experience and knowledge in the process of tumor treatment, as well as refer to existing medical guidelines for assessment and treatment. There are problems such as low assessment accuracy, cumbersome work and limited experience. At present, ultrasound, CT, and MRI are the main imaging methods for evaluating PTC lesions and cervical lymph nodes, but previous literatures have not shown that they have high efficacy in evaluating the risk of lesions. Studies exploring PTC risk stratification from genomic changes are relatively limited, and genomics and transcriptomics have revealed somatic mutations in different thyroid cancers, such as BRAFV600E, RAS, TP53, TERT, and RET/PTC gene fusions; among them, BRAFV600E mutation The pathogenesis of PTC has attracted much attention. Relevant data show that 85% of PTC have BRAFV600E mutation, but only a single indicator of BRAFV600E mutation has a weak predictive ability for the prognosis of PTC.
与基因组学中核酸不同,蛋白质直接参与所有生命过程中并确定细胞和器官表型。蛋白质作为生命体活动最直接产物,可以在细胞不同生命阶段及不同蛋白周期展现出较为稳定的表达。蛋白质组学在肺癌、口腔鳞状细胞癌、胃癌、食管癌、结直肠癌等肿瘤的预后有显著的预测作用。蛋白质组学对于甲状腺癌的研究也层出不穷,Sofiadis(SOFIADISA, DINETS A, ORRE L M, et al. Proteomic study of thyroidtumors revealsfrequent up-regulation of the Ca2+ -binding protein S100A6 inpapillarythyroid carcinoma [J]. Thyroid, 2010, 20(10): 1067-76)等研究揭示了PTC中Ca2+结合蛋白S100A6的频繁上调可以作为鉴别滤泡性甲状腺肿瘤和PTC的潜在辅助标志。也有研究(ZHAN S, LI J, WANGT, et al. Quantitative Proteomics Analysis of SporadicMedullary Thyroid CancerReveals FN1 as a Potential Novel Candidate PrognosticBiomarker [J].Oncologist, 2018, 23(12): 1415-25)对散发性髓样癌的定量蛋白质组学分析揭示了纤维连接蛋白1可以作为新型候选预后生物标志物。Unlike nucleic acids in genomics, proteins are directly involved in all life processes and determine cellular and organ phenotypes. As the most direct product of life activities, protein can show relatively stable expression in different life stages of cells and different protein cycles. Proteomics has a significant predictive effect on the prognosis of lung cancer, oral squamous cell carcinoma, gastric cancer, esophageal cancer, colorectal cancer and other tumors. Sofiadis (SOFIADISA, DINETS A, ORRE L M, et al. Proteomic study of thyroidtumors reveals frequent up-regulation of the Ca2+ -binding protein S100A6 inpapillarythyroid carcinoma [J]. Thyroid, 2010, 20 (10): 1067-76) and other studies revealed that the frequent upregulation of Ca2+-binding protein S100A6 in PTC can be used as a potential auxiliary marker to distinguish follicular thyroid tumors from PTC. There are also studies (ZHAN S, LI J, WANGT, et al. Quantitative Proteomics Analysis of Sporadic Medullary Thyroid Cancer Reveals FN1 as a Potential Novel Candidate Prognostic Biomarker [J]. Oncologist, 2018, 23(12): 1415-25) on sporadic myeloid cancer Quantitative proteomic analysis of cancer reveals
传统的蛋白组学样本制备技术比如基于溶液内消化方法或二维凝胶电泳(Two-dimensionalgel electrophoresis, 2D PAGE),时间花费长,效率低下。压力循环技术(Pressure cycling technology, PCT)的诞生是构建基于FFPE标本蛋白质组学数据库的重要处理技术,在3秒的上升时间和毫秒的降压时间内采用环境常压(14.7psi)和高压(高达45,000psi)之间的快速交替静水压力变化,该方法简便快速,高通量,有效的促进FFPE切片中蛋白质的去交联、提取和胰蛋白酶消化,因此有效提高了从FFPE样品中鉴定的蛋白质数量。Traditional proteomic sample preparation techniques such as in-solution digestion or two-dimensional gel electrophoresis (Two-dimensional gel electrophoresis, 2D PAGE) are time-consuming and inefficient. The birth of pressure cycling technology (Pressure cycling technology, PCT) is an important processing technology for constructing proteomics database based on FFPE specimens, using ambient normal pressure (14.7psi) and high pressure ( Up to 45,000 psi) between rapid alternating hydrostatic pressure changes, this method is simple, fast, high-throughput, and effectively promotes the de-crosslinking, extraction and trypsin digestion of proteins in FFPE sections, thus effectively improving the identification of proteins from FFPE samples. protein quantity.
除了PCT辅助样品处理外,还需要定量准确的质谱(Massspectrometry, MS)分析来识别与疾病相关的蛋白质表达水平变化。数据独立采集(Data independentacquisition , DIA)是一种蛋白质组学技术,其对选定质荷比(m/z)范围内的所有离子进行碎裂和二级质谱分析。DIA是数据依赖性采集(Data-dependentacquisition, DDA)的替代方案,DIA相比于DDA的最大的优势在于高效测定复杂样品中丰度极低的蛋白分子,能获得完整的数据,实现蛋白质的深度覆盖和精准定量,极大地提高了定量分析的可信度并且有更高的定量准确性和可重复性。DIA-MS与基于深度神经网络的数据处理相结合,有效提高了可重复性、识别数和定量准确性。In addition to PCT-assisted sample processing, quantitative and accurate mass spectrometry (MS) analysis is required to identify disease-associated changes in protein expression levels. Data independent acquisition (DIA) is a proteomics technique that performs fragmentation and MS/MS analysis of all ions within a selected mass-to-charge ratio (m/z) range. DIA is an alternative to Data-dependent acquisition (DDA). Compared with DDA, the biggest advantage of DIA is the efficient determination of protein molecules with extremely low abundance in complex samples, which can obtain complete data and achieve protein depth. Coverage and precise quantification greatly improve the reliability of quantitative analysis and have higher quantitative accuracy and repeatability. The combination of DIA-MS with deep neural network-based data processing effectively improves reproducibility, number of identifications, and quantitative accuracy.
随着网络和技术化的发展,意味着拥有着大量的数据集,人工智能现阶段被运用于众多领域,如何运用人工智能并且挖掘和处理这些数据,更好的为人类生活进展服务,成了现在各个学科各大领域争先努力的目标。当人工智能跨进医疗,将医疗记录转化为医疗知识,就能更好的理解疾病的发生发展;收集基因序列,DNA序列数据等生物信息,能更好的理解疾病各生物信息的表达差异。Zhu等人利用机器学习比较了复发患者和未复发患者的外泌体代谢组模式,并证明了由3'-UMP、棕榈油酸、棕榈醛和癸酸异丁酯组成的标志物组,用于预测食管鳞状细胞癌复发。既往的研究表明人工智能与基础学科的结合能有效的运用于临床难以攻克的问题。With the development of the network and technology, it means that there are a large number of data sets. Artificial intelligence is currently used in many fields. How to use artificial intelligence and mine and process these data to better serve the progress of human life has become a challenge. Now it is the goal that various disciplines and major fields are striving to be the first. When artificial intelligence enters medical care and transforms medical records into medical knowledge, the occurrence and development of diseases can be better understood; biological information such as gene sequences and DNA sequence data can be collected to better understand the differences in the expression of various biological information of diseases. Zhu et al. used machine learning to compare the exosomal metabolome patterns of relapsed and non-relapsed patients and demonstrated that a marker panel consisting of 3′-UMP, palmitoleic acid, palmitaldehyde, and isobutyl decanoate was used to in predicting recurrence of esophageal squamous cell carcinoma. Previous studies have shown that the combination of artificial intelligence and basic disciplines can be effectively applied to clinically difficult problems.
专利US6005256A提供了同时检测机体样本中多种荧光标记的标记物的装置和方法的详细描述,包括出于鉴定癌细胞的目的装置和方法,但并没有涉及到癌症评估的具体应用中或提及适当的标记物联合物可以产生较高的特异性。Patent US6005256A provides a detailed description of devices and methods for simultaneously detecting multiple fluorescently-labeled markers in body samples, including devices and methods for the purpose of identifying cancer cells, but does not involve specific applications in cancer assessment or mention Appropriate marker combinations can yield high specificity.
专利CN110129442A公开了检测基因及其表达产物水平的试剂在制备诊断甲状腺癌的产品中的应用,其特征在于,所述基因选自LRRN4CL或ZNF883。所述产品包括芯片、试剂盒或核酸膜条。所述基因芯片包括用于检测LRRN4CL或ZNF883基因转录水平的针对LRRN4CL或ZNF883基因的寡核苷酸探针,所述蛋白芯片包括LRRN4CL或ZNF883蛋白的特异性结合剂;所述试剂盒包括基因检测试剂盒、蛋白检测试剂盒,所述基因检测试剂盒包括用于检测LRRN4CL或ZNF883基因转录水平的试剂、或芯片,所述蛋白检测试剂盒包括用于检测LRRN4CL或ZNF883蛋白表达水平的试剂、或芯片。Patent CN110129442A discloses the application of reagents for detecting levels of genes and their expression products in the preparation of products for diagnosing thyroid cancer, characterized in that the genes are selected from LRRN4CL or ZNF883. The products include chips, kits or nucleic acid membrane strips. The gene chip includes an oligonucleotide probe for detecting LRRN4CL or ZNF883 gene transcription level for LRRN4CL or ZNF883 gene, and the protein chip includes a specific binding agent for LRRN4CL or ZNF883 protein; the test kit includes gene detection kit, protein detection kit, the gene detection kit includes reagents for detecting LRRN4CL or ZNF883 gene transcription level, or a chip, and the protein detection kit includes reagents for detecting LRRN4CL or ZNF883 protein expression level, or chip.
专利CN107563383A的发明专利公布了一种良恶性病变风险分层辅助预测系统,包括结节检测、医生指导、语义标注生成、样本生成、在线学习等步骤,需要医生指导,即医生标注关注点,自动计算得到关注区域内的肺结节轮廓与轮廓内各像素点的预测概率,生成含有精确语义轮廓的肺结节标注样本,供自学习系统学习。The invention patent of patent CN107563383A discloses an auxiliary prediction system for the risk stratification of benign and malignant lesions, including nodule detection, doctor guidance, semantic annotation generation, sample generation, online learning and other steps. The contour of the pulmonary nodule in the area of interest and the prediction probability of each pixel in the contour are calculated, and the labeled samples of pulmonary nodules containing precise semantic contours are generated for learning by the self-learning system.
目前,常用甲状腺癌预后评估体系包括AJCC/UICC中的TNM分期、AMES、AGES、MACIS评分系统及ATA风险分层指南等。AJCC/UICC是基于pTNM参数和年龄的分期系统,适用于所有类型的肿瘤,同时也适用于甲状腺癌。而AMES,AGES,和MACIS等预后评估体系,主要包括是否存在淋巴结/远处转移、患者年龄和肿瘤范围等。此外,ATA风险分层指南提出了甲状腺癌3级分层体系,用于预测甲状腺癌的复发风险。这些预后评估体系均是基于术后的临床病理学结果,但是目前尚无术前就能评估PTC风险的评估体系。Currently, the commonly used prognostic assessment systems for thyroid cancer include TNM staging in AJCC/UICC, AMES, AGES, MACIS scoring system, and ATA risk stratification guidelines. AJCC/UICC is a staging system based on pTNM parameters and age, which is applicable to all types of tumors, and is also applicable to thyroid cancer. Prognostic evaluation systems such as AMES, AGES, and MACIS mainly include the presence or absence of lymph node/distant metastasis, patient age, and tumor extent. In addition, the ATA risk stratification guidelines proposed a 3-level stratification system for thyroid cancer to predict the recurrence risk of thyroid cancer. These prognostic evaluation systems are all based on postoperative clinicopathological results, but currently there is no evaluation system that can assess the risk of PTC before operation.
发明内容Contents of the invention
为解决现有技术的不足,本发明的目的在于提供甲状腺乳头状癌(PTC)风险辅助分层系统,提高临床医生信心,辅助甲状腺乳头状癌临床管理决策。In order to solve the deficiencies of the prior art, the purpose of the present invention is to provide an auxiliary risk stratification system for papillary thyroid carcinoma (PTC), improve the confidence of clinicians, and assist clinical management decision-making of papillary thyroid carcinoma.
本发明的一个方面旨在提供一种甲状腺乳头状癌(PTC)风险辅助分层系统,其特征在于,包括:One aspect of the present invention aims to provide an auxiliary papillary thyroid carcinoma (PTC) risk stratification system, which is characterized in that it includes:
数据采集模块:用于采集现有数据,包括蛋白质组学+临床指标+血液免疫指标+BRAFV600E突变标签,对数据完整性及数据质量进行验证评估后,进行数据脱敏处理并发送至数据预处理模块;Data collection module: used to collect existing data, including proteomics + clinical indicators + blood immune indicators + BRAFV600E mutation label, after verifying and evaluating data integrity and data quality, perform data desensitization and send to data preprocessing module;
数据预处理模块:对采集到的现有数据进行缺失值插补和归一化处理;Data preprocessing module: perform missing value interpolation and normalization processing on the collected existing data;
模型建立、训练与验证模块:基于数据预处理模块得到的全蛋白质组检测数据进行特征筛选,再基于梯度提升树的模型—极限梯度提升( XGBoost),将输入特征与PTC风险分层联系起来,用于解决监督学习问题;发现集作为训练集,用于训练和调整模型中的参数,将发现集分为训练序列和测试序列,在训练序列构建模型,在测试序列进行内部验证,选择并保存在测试集上AUC最佳的模型,然后于两个独立验证集进行验证并通过受试者工作特征曲线的曲线下面积(AUC)评估模型的性能;基于XGBoost算法选择的蛋白质特征结合10折交叉验证最终确定了以6个特征蛋白质、5个临床特征、5个血液免疫指标和BRAFV600E突变特征作为构建模型的特征;建立甲状腺乳头状癌(PTC)风险分层模型,确定模型阈值,根据数值可预测分层出各级风险病变;Model building, training and verification module: feature screening based on the whole proteome detection data obtained by the data preprocessing module, and then based on the model of gradient boosting tree - extreme gradient boosting (XGBoost), linking input features with PTC risk stratification, Used to solve supervised learning problems; the discovery set is used as a training set to train and adjust parameters in the model, divide the discovery set into a training sequence and a test sequence, build a model in the training sequence, perform internal verification in the test sequence, select and save The model with the best AUC on the test set was then validated on two independent validation sets and the performance of the model was evaluated by the area under the curve (AUC) of the receiver operating characteristic curve; protein features selected based on the XGBoost algorithm combined with a 10-fold crossover The verification finally determined that 6 characteristic proteins, 5 clinical characteristics, 5 blood immune indicators and BRAFV600E mutation characteristics were used as the characteristics of the model; a risk stratification model for papillary thyroid carcinoma (PTC) was established, and the threshold of the model was determined. Predict and stratify risk lesions at all levels;
分层模块:对检测样本进行分析,将肿瘤分层为低风险病变或中高风险病变,反馈给医生的终端设备和数据库平台;Stratification module: analyze the test samples, stratify tumors into low-risk lesions or medium-high-risk lesions, and feed back to the doctor's terminal equipment and database platform;
数据库平台:与上述各模块连接,用于模型建立数据存储和调取,并不断收集新增数据,提供模型在线更新功能。Database platform: connected with the above modules, used for model building data storage and retrieval, and continuously collects new data to provide online model update function.
其中评定标准基于ATA 2015管理指南:没有肿瘤侵犯和淋巴结转移的样本被认为是低风险的PTC,而有肿瘤侵犯或淋巴结转移的样本被认为是中度或高度风险的PTC。The assessment criteria were based on the ATA 2015 management guidelines: samples without tumor invasion and lymph node metastasis were considered low-risk PTC, while samples with tumor invasion or lymph node metastasis were considered intermediate or high-risk PTC.
优选地,所述模型阈值为0.6645。Preferably, the model threshold is 0.6645.
优选地,所述6个特征蛋白质为:DPP7,DPLIM3,COL12A1,CTSL,TUBB2A和ITGB5。Preferably, the six characteristic proteins are: DPP7, DPLIM3, COL12A1, CTSL, TUBB2A and ITGB5.
优选地,所述5个临床特征为:被膜侵犯、腺外侵犯、肿瘤直径、多灶性和年龄。Preferably, the five clinical features are: capsule invasion, extraglandular invasion, tumor diameter, multifocality and age.
优选地,所述5个血液免疫指标为:血小板计数(Platelet count,plt)、中性粒细胞计数(Neutrophil count,N)、淋巴细胞计数(Lymphocyte count,L)、单核细胞计数(Monocyte count,M)和淋巴细胞/单核细胞比值(Lymphocyte-to-MonocyteRatio, LMR)。Preferably, the five blood immune indexes are: platelet count (Platelet count, plt), neutrophil count (Neutrophil count, N), lymphocyte count (Lymphocyte count, L), monocyte count (Monocyte count , M) and lymphocyte/monocyte ratio (Lymphocyte-to-MonocyteRatio, LMR).
优选地,所述现有数据的来源包括:医院的病例数据。Preferably, the sources of the existing data include: hospital case data.
优选地,所述监督学习是利用一组已知类别的样本调整分类器的参数,使其达到所要求性能的过程。Preferably, the supervised learning is a process of using a set of samples of known categories to adjust the parameters of the classifier to achieve the required performance.
优选地,样本纳入标准:①初次手术,且行淋巴结清扫;②既往无化疗及放疗病史;③术后病理学检查诊断为经典型PTC;④术后病理学诊断结果包含关于患者危险分层的完整信息。排除标准:①颈部外伤史;②合并或既往罹患有其他癌症;③术后病理诊断为其他亚型PTC或其他病理类型;④缺乏完整可用的术后病理。Preferably, the sample inclusion criteria are as follows: ① initial operation with lymph node dissection; ② no history of chemotherapy and radiotherapy; ③ classic PTC diagnosed by postoperative pathological examination; ④ postoperative pathological diagnosis results containing complete information on patient risk stratification. information. Exclusion criteria: ① History of neck trauma; ② Combination or previous suffering from other cancers; ③ Postoperative pathological diagnosis of other subtypes of PTC or other pathological types; ④ Lack of complete and available postoperative pathology.
本发明的再一个方面旨在提供一种非诊断目的的甲状腺乳头状癌(PTC)风险辅助分层系统的使用方法,其特征在于,使用基于上述的辅助分层系统得出分层结果。Another aspect of the present invention aims to provide a non-diagnostic papillary thyroid cancer (PTC) risk auxiliary stratification system using method, which is characterized in that the above-mentioned auxiliary stratification system is used to obtain stratification results.
本发明的最后一个方面旨在提供上述的甲状腺乳头状癌(PTC)风险辅助分层系统在制备对患者的甲状腺乳头状癌的风险进行预测分层的装置中的用途。The last aspect of the present invention aims to provide the use of the above papillary thyroid carcinoma (PTC) risk auxiliary stratification system in preparing a device for predicting and stratifying the risk of papillary thyroid carcinoma in patients.
以下是对本发明中涉及的部分术语的解释:The following is an explanation of some terms involved in the present invention:
缩略语/符号说明Abbreviations/Symbol Explanation
二肽基肽酶(Dipeptidyl peptidase 7, DPP7):由DPP7基因调控,在一些寡肽的降解中起重要作用,在Bgee基因表达数据库中发现DPP7在甲状腺腺叶中存在表达。该蛋白在结直肠癌中是预后不良的标志物,然而有文献表明在乳腺癌患者中高表达的DPP7与良好预后的结果相关,既往有文献表明敲除DPP7后通过上调HepG2 肝癌细胞系中的Bax-Bcl2信号传导增加细胞凋亡,因此也有可能通过类似途径降低甲状腺癌风险度,这与本发明的研究结果相符。DPP7的氨基酸序列请参见GenBank登录号NP_037511.2,核酸序列见NM_013379.3。Dipeptidyl peptidase 7 (DPP7): Regulated by the DPP7 gene, it plays an important role in the degradation of some oligopeptides. It was found in the Bgee gene expression database that DPP7 is expressed in the thyroid lobe. This protein is a marker of poor prognosis in colorectal cancer. However, literature has shown that high expression of DPP7 in breast cancer patients is associated with good prognosis. Previous literature has shown that knockout of DPP7 can up-regulate Bax in HepG2 liver cancer cell line -Bcl2 signaling increases cell apoptosis, so it is also possible to reduce the risk of thyroid cancer through similar pathways, which is consistent with the research results of the present invention. For the amino acid sequence of DPP7, please refer to GenBank accession number NP_037511.2, and for the nucleic acid sequence, please refer to NM_013379.3.
织蛋白酶L(Procathepsin L,CTSL)是一种硫醇蛋白酶,对溶酶体中蛋白质的整体降解起到重要作用,是维护甲状腺功能细胞的重要蛋白酶。其中参与甲状腺滤泡腔内甲状腺球蛋白的有限蛋白水解,参与交联甲状腺球蛋白的溶解和随后甲甲状腺素T4的释放。CTSL氨基酸序列请参见GenBank登录号NP_001244900.1,核酸序列见NM_001257971.2。Cathepsin L (CTSL) is a thiol protease that plays an important role in the overall degradation of proteins in lysosomes and is an important protease for maintaining thyroid function cells. It is involved in the limited proteolysis of thyroglobulin in the lumen of thyroid follicles, the dissolution of cross-linked thyroglobulin and the subsequent release of thyroxine T4. For the amino acid sequence of CTSL, please refer to GenBank accession number NP_001244900.1, and for the nucleic acid sequence, please refer to NM_001257971.2.
整合素 beta-5(Integrin subunit beta 5, ITGB5)与淋巴结转移密切相关,ITGB5可以诱导下游Src磷酸化,进而活化NF-kB信号通路,促进肿瘤转移,这可能与蛋白的组织异质性相关。ITGB5氨基酸序列请参见GenBank登录号NP_001341693.1,核酸序列见NM_001354764.2。Integrin beta-5 (Integrin subunit beta 5, ITGB5) is closely related to lymph node metastasis. ITGB5 can induce downstream Src phosphorylation, thereby activating NF-kB signaling pathway and promoting tumor metastasis, which may be related to the tissue heterogeneity of the protein. For the amino acid sequence of ITGB5, please refer to GenBank accession number NP_001341693.1, and for the nucleic acid sequence, please refer to NM_001354764.2.
PDZ和LIM结构域蛋白3(PDZ and LIM domain 3, PDLIM3):有研究表明PDLIM3可能在肌肉细胞内肌动蛋白丝阵列的组织中起作用。该蛋白是甲状腺癌预后不良的标志物。另有文献报道PDLIM3通过MAPK信号通路调控细胞增殖和分化,可能由此导致促进甲状腺癌的高侵袭性的生物学行为,这与本发明的研究结果也是一致的。PDLIM3氨基酸序列请参见GenBank登录号NP_001107579.1,核酸序列见NM_001114107.5。PDZ and LIM domain 3 (PDZ and LIM domain 3, PDLIM3): Studies have shown that PDLIM3 may play a role in the organization of actin filament arrays in muscle cells. This protein is a marker of poor prognosis in thyroid cancer. It is also reported in literature that PDLIM3 regulates cell proliferation and differentiation through the MAPK signaling pathway, which may lead to the promotion of highly aggressive biological behavior of thyroid cancer, which is also consistent with the research results of the present invention. For the amino acid sequence of PDLIM3, please refer to GenBank accession number NP_001107579.1, and for the nucleic acid sequence, please refer to NM_001114107.5.
胶原蛋白 alpha-1(XII) 链(Collagen alpha-1(XII) chain, COL12A1):有研究表明在胃癌中COL12A1系统作用通过MAPK通路形成正反馈促进细胞迁移,猜测以此为基础的肿瘤细胞侵犯和转移导致的高危风险。COL12A1氨基酸序列请参见GenBank登录号NP_004361.3,核酸序列见NM_004370.6。Collagen alpha-1(XII) chain (Collagen alpha-1(XII) chain, COL12A1): Studies have shown that in gastric cancer, the COL12A1 system acts through the MAPK pathway to form a positive feedback to promote cell migration, and it is speculated that tumor cell invasion based on this and the high risk of transfer. For the amino acid sequence of COL12A1, please refer to GenBank accession number NP_004361.3, and for the nucleic acid sequence, please refer to NM_004370.6.
微管蛋白β-2A链(Tubulin beta 2A class IIa, TUBB2A)是微管的主要成分,参与有丝分裂的细胞周期,与肿瘤的生长密切相关。有研究表明TUBB2A作为预测乳腺癌远处转移的的新型标志物[48],在细胞株的研究中,当TUBB2A被敲低时,侵袭的细胞显著减少,以此验证了TUBB2A的远处转移潜力。在胃癌细胞中,TUBB2A的表达同样表现为促进胃癌细胞系增殖迁移和侵袭,上述研究都表现了TUBB2A的表达导致高侵袭和高转移性的状态,这与本研究结果高度一致。TUBB2A氨基酸序列请参见GenBank登录号NP_001060.1,核酸序列见NP_001060.1。Tubulin β-2A chain (Tubulin beta 2A class IIa, TUBB2A) is the main component of microtubules, participates in the cell cycle of mitosis, and is closely related to the growth of tumors. Studies have shown that TUBB2A is a new marker for predicting distant metastasis of breast cancer [48]. In the study of cell lines, when TUBB2A was knocked down, the invasive cells were significantly reduced, thus verifying the distant metastasis potential of TUBB2A . In gastric cancer cells, the expression of TUBB2A also promotes the proliferation, migration and invasion of gastric cancer cell lines. The above studies have shown that the expression of TUBB2A leads to a state of high invasion and high metastasis, which is highly consistent with the results of this study. For the amino acid sequence of TUBB2A, please refer to GenBank accession number NP_001060.1, and for the nucleic acid sequence, please refer to NP_001060.1.
梯度提升(Gradient boosting,GB)算法是人工智能的算法之一,通过结合简单和弱的决策树来创建更准确和更强的学习器。虽然弱树模型的准确性揭示了预测误差的缺陷,但可以使用第二种模型进行补偿。因此,将这些连续的弱树模型组合起来会产生一个比第一个更准确的模型。在GB中,用弱树模型对残差进行拟合,然后通过将预测的残差与之前的预测相加来更新预测值。极限梯度提升(Extreme gradient boosting, XGBoost)是最近在基于树的集成学习中备受关注的模型。虽然XGBoost基于GB,但XGBoost可以克服GB的诸多缺点,例如执行时间慢和缺乏过度调节。因此,它可以比现有的GB模型更快地完成训练。Gradient boosting (GB) algorithm is one of the algorithms of artificial intelligence, which creates more accurate and stronger learners by combining simple and weak decision trees. While the accuracy of the weak tree model reveals a flaw in prediction error, it can be compensated for using the second model. Therefore, combining these successive weak tree models produces a model that is more accurate than the first one. In GB, a weak tree model is fitted to the residuals, and then the forecasts are updated by adding the predicted residuals to the previous forecasts. Extreme gradient boosting (XGBoost) is a model that has recently received much attention in tree-based ensemble learning. Although XGBoost is based on GB, XGBoost can overcome many shortcomings of GB, such as slow execution time and lack of overscaling. Therefore, it can finish training faster than existing GB models.
本发明的有益之处在于:(1)本发明基于独特的PCT-MS技术,处理包括新鲜冷冻的组织样本,FFPE样本以及细针穿刺样本(Fine-needle aspiration , FNA),通过机器学习筛选特征蛋白以及特征临床变量,构建预测风险度模型,通过石蜡切片样本及穿刺活检样本检验上述模型。探索蛋白质组学对PTC危险分层的价值,为PTC患者制定个性化治疗方案提供决策依据。The benefits of the present invention are as follows: (1) The present invention is based on the unique PCT-MS technology, processing including fresh frozen tissue samples, FFPE samples and fine-needle aspiration (FNA) , and screening features through machine learning Protein and characteristic clinical variables were used to construct a predictive risk model, and the above model was tested through paraffin section samples and biopsy samples. Explore the value of proteomics in risk stratification of PTC, and provide decision-making basis for the development of personalized treatment plans for PTC patients.
(2)同时,本发明构建源自术前FNA标本蛋白质的预后预测模型,最后凝聚成为PTC风险评估体系。基于此构建的模型预测效能高,发现集AUC达到0.92,验证组中石蜡样本和FNA穿刺样本分别达到了0.79及0.80。根据既往的文献回顾,本报道是首个基于蛋白质组学和人工智能探究PTC与风险度关系的研究,在术前即实现PTC风险分层管理及个体化精准诊疗,提供有效可靠的基础研究作用,具有开拓性的创新意义。(2) At the same time, the present invention constructs a prognosis prediction model derived from preoperative FNA specimen proteins, and finally condenses into a PTC risk assessment system. The prediction efficiency of the model based on this is high, and the AUC of the discovery set reached 0.92, and the paraffin samples and FNA puncture samples in the verification group reached 0.79 and 0.80, respectively. According to the previous literature review, this report is the first study based on proteomics and artificial intelligence to explore the relationship between PTC and risk. It realizes PTC risk stratification management and individualized precise diagnosis and treatment before surgery, providing effective and reliable basic research. , with pioneering and innovative significance.
(3)XGBoost是最近在基于树的集成学习中备受关注的模型。XGBoost有利于防止过拟合,从而提高模型的泛化能力,也就意味着能更便于运用于临床。SHAP算法通过构建一个可视化的解释模型,输出了重要特征排名以解密XGBoost(图6)。SHAP值最大的优势是SHAP能对于反映出每一个样本中的特征的影响力,而且还表现出影响的正负性。红色点越大,代表正向影响,如被膜侵犯、腺外侵犯、多灶性以及PDLIM3、COL12A1、TUBB2A表达,单核细胞计数的增加提高了中高风险度的可能性。DPP7、CTSL和ITGB5的高表达以及中性粒细胞计数、淋巴细胞计数的增加提高了低风险度的概率。这些模型让医生直观地了解关键特征,使医生能够有效的将模型应用于个体患者,从而达到分层管理和精准治疗的目的。(3) XGBoost is a model that has recently received much attention in tree-based ensemble learning. XGBoost helps prevent overfitting, thereby improving the generalization ability of the model, which means that it can be more easily used in clinical practice. The SHAP algorithm builds a visual interpretation model and outputs the ranking of important features to decrypt XGBoost (Figure 6). The biggest advantage of the SHAP value is that SHAP can reflect the influence of the characteristics in each sample, and it also shows the positive and negative nature of the influence. The larger the red dot, it represents positive effects, such as capsule invasion, extraglandular invasion, multifocality, and expression of PDLIM3, COL12A1, and TUBB2A. The increase in monocyte count increases the possibility of medium and high risk. High expression of DPP7, CTSL, and ITGB5, as well as increased neutrophil and lymphocyte counts increased the probability of low risk. These models allow doctors to intuitively understand the key characteristics, enabling doctors to effectively apply the models to individual patients, so as to achieve the purpose of hierarchical management and precise treatment.
附图说明Description of drawings
图1是PTC样本PCT-DIA工作流程图;Figure 1 is the workflow diagram of PTC sample PCT-DIA;
图2是特征选择、构建机器学习模型以及模型验证的流程图;Fig. 2 is a flow chart of feature selection, building a machine learning model and model verification;
图3是机器学习模型的预测性能,机器学习模型辅助临床医生和仅有临床医生对低风险和中高风险PTC的区分效果,图3A是回顾测试集ROC曲线,图3B是前瞻测试集ROC曲线;Figure 3 shows the predictive performance of the machine learning model. The machine learning model assists clinicians and only clinicians in distinguishing low-risk and medium-high-risk PTC. Figure 3A is the ROC curve of the retrospective test set, and Figure 3B is the ROC curve of the prospective test set;
图4是高危组与低危组(图A),高危组与中危组(图B),中危组与低危组(图C)具有显著性表达差异的蛋白示意图,图中每一个点表示某一差异蛋白,虚线以外的点被认为是有统计学差异。右上象限的点为表达上调的蛋白,左上象限的点表示下调的蛋白;Figure 4 is a schematic diagram of proteins with significant expression differences between high-risk group and low-risk group (Figure A), high-risk group and intermediate-risk group (Figure B), and intermediate-risk group and low-risk group (Figure C), each point in the figure Indicates a differential protein, points outside the dotted line are considered to be statistically different. The dots in the upper right quadrant represent up-regulated proteins, and the dots in the upper left quadrant represent down-regulated proteins;
图5是SHAP分析的摘要图,显示了该模型的17个最重要特征的平均绝对SHAP值。自上而下列出的该模型的特征重要性。特征越靠近上方,样本区分程度越大,特征对输出的影响越大;Figure 5 is a summary plot of the SHAP analysis showing the mean absolute SHAP values for the 17 most important features of the model. Feature importance for this model listed from top to bottom. The closer the feature is to the top, the greater the degree of sample differentiation, and the greater the impact of the feature on the output;
图6是机器学习筛选的6个蛋白质在高中危组和低危组表达水平图,箱式图的上下缘分别为蛋白表达水平的四分位数,中线代表中位数。Figure 6 is a diagram of the expression levels of 6 proteins screened by machine learning in the high- and high-risk groups and low-risk groups. The upper and lower edges of the box plots are the quartiles of protein expression levels, and the middle line represents the median.
具体实施方式Detailed ways
一、对象与方法1. Objects and methods
(一)研究对象(1) Research object
1.一般资料1. General Information
该研究纳入的PTC样本集分为发现集和独立验证集。独立验证集包括回顾验证集以及前瞻验证集。发现集回顾性纳入浙江大学医学院附属杭州市第一人民医院肿瘤外科和山东毓璜顶医院PTC石蜡切片标本以及临床信息,时间跨度为2013年6月至2020年11月,最初共纳入283例,因制片样本量少或数量不足剔除9例,最后纳入274例(191训练集/83测试集)。同时收集2016年1月至2021年12月上述两家单位166例PTC石蜡切片标本以及临床信息作为回顾性验证集,并且收集2020年1月至2021年12月上述两家单位118例穿刺活检样本作为前瞻性验证集。The PTC sample set included in this study is divided into a discovery set and an independent validation set. Independent validation sets include retrospective validation sets and forward validation sets. The discovery set retrospectively included PTC paraffin section specimens and clinical information from the Oncology Department of Hangzhou First People's Hospital affiliated to Zhejiang University School of Medicine and Shandong Yuhuangding Hospital. The time span was from June 2013 to November 2020. A total of 283 cases were initially included. Nine cases were excluded due to small or insufficient samples, and finally 274 cases were included (191 training sets/83 testing sets). At the same time, collect 166 cases of PTC paraffin section specimens and clinical information from the above two units from January 2016 to December 2021 as a retrospective verification set, and collect 118 needle biopsy samples from the above two units from January 2020 to December 2021 as a prospective validation set.
样本纳入标准:①初次手术,且行淋巴结清扫;②既往无化疗及放疗病史;③术后病理学检查诊断为经典型PTC;④术后病理学诊断结果包含关于患者危险分层的完整信息。排除标准:①颈部外伤史;②合并或既往罹患有其他癌症;③术后病理诊断为其他亚型PTC或其他病理类型;④缺乏完整可用的术后病理。Sample inclusion criteria: ①First operation with lymph node dissection; ②No history of chemotherapy and radiotherapy; ③Classic PTC diagnosed by postoperative pathological examination; ④Postoperative pathological diagnosis results included complete information on patient risk stratification. Exclusion criteria: ① History of neck trauma; ② Combination or previous suffering from other cancers; ③ Postoperative pathological diagnosis of other subtypes of PTC or other pathological types; ④ Lack of complete and available postoperative pathology.
2.临床病理信息2. Clinicopathological information
从电子病历中提取患者临床信息以及肿瘤特征,包括(1)术前临床信息:患者性别、年龄、有无桥本氏甲状腺炎、肿瘤最大径、肿瘤是否多灶、肿瘤是否侵犯被膜或腺外。(2)血液免疫指标:血小板计数(platelet count,plt)、中性粒细胞计数(Neutrophil count,N)、淋巴细胞计数(Lymphocyte count,L)、巨噬细胞计数(Monocyte count,M)、计算血小板与淋巴细胞比值(Platelet-lymphocyte ratio, PLR)、中性粒细胞与淋巴细胞比值(Neutrophil-lymphocyte ratio, NLR)、淋巴细胞/巨噬细胞比值(Lymphocyte-to-monocyte ratio, LMR)以及系统免疫炎症指标(Systemic immune-inflammatory index,SII)。(3)BRAF突变状态。Extract patient clinical information and tumor characteristics from electronic medical records, including (1) preoperative clinical information: patient gender, age, presence or absence of Hashimoto’s thyroiditis, largest tumor diameter, whether the tumor is multifocal, whether the tumor invades the capsule or extraglandular . (2) Blood immune indicators: platelet count (platelet count, plt), neutrophil count (Neutrophil count, N), lymphocyte count (Lymphocyte count, L), macrophage count (Monocyte count, M), calculation Platelet-to-lymphocyte ratio (Platelet-lymphocyte ratio, PLR), neutrophil-lymphocyte ratio (Neutrophil-lymphocyte ratio, NLR), lymphocyte/macrophage ratio (Lymphocyte-to-monocyte ratio, LMR) and system Systemic immune-inflammatory index (SII). (3) BRAF mutation status.
根据第8版美国肿瘤委员会(American Joint Committee on Cancer, AJCC)公布的TNM分期对纳入的PTC患者进行肿瘤分期。PTC肿瘤风险度参照最新2015版ATA的PTC术后复发风险分层。The tumor staging of the included PTC patients was carried out according to the 8th edition of the TNM staging published by the American Joint Committee on Cancer (AJCC). The risk of PTC tumors refers to the latest 2015 version of ATA for PTC postoperative recurrence risk stratification.
ATA复发风险分层系统ATA recurrence risk stratification system
(二)组织样本制备(2) Tissue sample preparation
在组织切片机上将FFPE甲状腺癌组织块连续切取4块厚度为10μm的石蜡薄片,将其附着于载玻片。每一块组织样本均由两位经验丰富的病理学家检查并制备,对比手术后苏木精-伊红染色的病理诊断切片,于显微镜下标出10μm石蜡薄片的病变区域后进行组织取芯。穿刺活检样本来自于术前或术中对甲状腺结节细针穿刺组织,经由病理科医生明确后,将穿刺活检针吸样本放置于-80℃冰箱中保存。Four paraffin thin sections with a thickness of 10 μm were continuously cut from FFPE thyroid cancer tissue blocks on a tissue slicer, and attached to glass slides. Each tissue sample was examined and prepared by two experienced pathologists, compared with hematoxylin-eosin stained pathological diagnostic sections after surgery, and tissue cores were taken after marking the lesion area of 10 μm paraffin thin section under a microscope. The needle biopsy samples came from the preoperative or intraoperative fine-needle aspiration tissue of thyroid nodules, and after being cleared by the pathologist, the needle biopsy samples were stored in a -80°C refrigerator.
(三)批次设计(3) Batch design
在发现集中,为了最大限度地减少不同实验批次导致的批次效应,将274例石蜡样本随机抽取24个生物重复和27个样本作为技术重复,随机分配到21个批次中。回顾性试验集166个样品和13个技术重复分为12批,前瞻性试验集118个样品和8个技术重复分为9批。每一批次包括15份甲状腺样本一个小鼠肝脏样品以及汇集高中低危风险度甲状腺样本作为质控。在这个发现集的分析中,每个石蜡切片的组织病理学诊断都是明确的,由此建立了数据分离模型。In the discovery set, in order to minimize the batch effect caused by different experimental batches, 274 paraffin samples were randomly selected as 24 biological replicates and 27 samples as technical replicates, and were randomly assigned to 21 batches. A retrospective test set of 166 samples and 13 technical replicates was divided into 12 batches, and a prospective test set of 118 samples and 8 technical repeats was divided into 9 batches. Each batch includes 15 thyroid samples, one mouse liver sample and pooled thyroid samples with high, medium and low risk as quality control. In the analysis of this discovery set, the histopathological diagnosis of each paraffin section was unambiguous, and thus the data separation model was established.
(四)FFPE组织的脱蜡、再水化和水解(4) Dewaxing, rehydration and hydrolysis of FFPE tissue
对于发现集中的每个PTC病例,共制备4个FFPE生物复制品组织核心。样本于庚烷中脱蜡,随后在室温下依次在100%乙醇(Sigma)、90%乙醇和75%乙醇中水化。然后用100mMTris-HCl(pH10,Sigma)洗涤,建立95℃的碱水解条件。于95℃,600rmp反应30min,然后将样品快速冷却至4℃。For each PTC case in the discovery set, a total of 4 FFPE bioreplicate tissue cores were prepared. Samples were deparaffinized in heptane and subsequently hydrated sequentially in 100% ethanol (Sigma), 90% ethanol, and 75% ethanol at room temperature. It was then washed with 100 mM Tris-HCl (
(五)组织裂解、蛋白质提取和蛋白质消化(5) Tissue Lysis, Protein Extraction and Protein Digestion
于脱蜡样品中加入6M尿素,2M硫脲,10 mM三(2-羧乙基)膦盐酸盐(TCEP)及40mM碘乙酰胺(IAA),随后使用压力循环技术(Pressure cycling technology, PCT),45000psi工作25s,在30℃的常压下工作10s,进行90次循环。在黑暗中微旋孵育30min,然后按40:1(蛋白质与赖斯C(lysc))的比例加入lysC。PCT辅助的lysC消化在以下设置下进行:45个周期,20000psi,50s。30℃常压下10s。最后的胰酶消化以40:1(蛋白质与胰蛋白酶)的比例用PCT进行,设置如下:90个循环,50s,20000psi。在30℃的常压下工作10s。随后加入10%TFA终止消化,调整PH至2-3后12000g,5min离心。复溶后测浓度,并调整至0.2 ug/ul。Add 6M urea, 2M thiourea, 10 mM tris(2-carboxyethyl)phosphine hydrochloride (TCEP) and 40 mM iodoacetamide (IAA) to the dewaxed sample, and then use pressure cycling technology (Pressure cycling technology, PCT ), 45000psi for 25s, 10s at 30°C for 90 cycles. Incubate for 30 min with micro-rotation in the dark, then add lysC at a ratio of 40:1 (protein to Lys C (lysc)). PCT-assisted lysC digestion was performed at the following settings: 45 cycles, 20000 psi, 50 s. 10s at 30°C under normal pressure. The final trypsinization was performed with PCT at a ratio of 40:1 (protein to trypsin) with the following settings: 90 cycles, 50 s, 20000 psi. Work under normal pressure at 30°C for 10s. Then add 10% TFA to stop the digestion, adjust the pH to 2-3, then centrifuge at 12000g for 5min. Measure the concentration after reconstitution and adjust to 0.2 ug/ul.
(六)蛋白质组学数据采集与分析(6) Proteomics data collection and analysis
洁净的多肽使用nanoLC-MS/MS系统分离,配备15cm*75μmID色谱柱,定梯度为45mins,3-25%线性梯度(缓冲液A:2%乙腈,0.1%甲酸;缓冲液B:98%乙腈,0.1%甲酸),流速为300nL/min。流出胎通过QExactiveHF质谱仪。在DIA模式进行数据采集。使用3E6电荷的AGC目标值和100ms的最大注入时间,在Orbitrap中以60,000(m/z 200)的分辨率分析390-1010m/z。在全MS扫描后,获得24 MS/MS扫描,每个扫描具有 30,000 分辨率(m/z 200),AGC目标值为1E6个电荷,归一化碰撞能量为27%,默认电荷状态设置为2,最大进样时间设置为自动。24 次 MS/MS 扫描(隔离窗口中心)的循环周期为 3 种宽隔离窗口(m/z):410、430、450、470、490、510、530、550、570、590、 610、630、650、670、 690、 710、730、 770、 790、 820、860、 910、 970。整个 MS 和 MS/MS 扫描采集周期大约需要3秒,并在整个 LC/MS 分析过程中重复。采集到的数据通过DIA-NN(1.7.15)配合甲状腺多肽谱图库进行搜库。Clean peptides are separated by nanoLC-MS/MS system, equipped with 15cm*75μm ID column, fixed gradient is 45mins, 3-25% linear gradient (buffer A: 2% acetonitrile, 0.1% formic acid; buffer B: 98% acetonitrile ,0.1% formic acid), the flow rate is 300nL/min. The effluent passes through the QExactiveHF mass spectrometer. Data acquisition was performed in DIA mode. 390-1010 m/z were analyzed in Orbitrap at a resolution of 60,000 (m/z 200) using an AGC target of 3E6 charge and a maximum injection time of 100 ms. After the full MS scan, 24 MS/MS scans were acquired, each with 30,000 resolution (m/z 200), with an AGC target of 1E6 charges, a normalized collision energy of 27%, and a default charge state setting of 2 , the maximum injection time is set to auto. 24 MS/MS scans (center of isolation window) cycled over 3 wide isolation windows (m/z): 410, 430, 450, 470, 490, 510, 530, 550, 570, 590, 610, 630, 650, 670, 690, 710, 730, 770, 790, 820, 860, 910, 970. The entire MS and MS/MS scan acquisition cycle takes approximately 3 seconds and is repeated throughout the LC/MS analysis. The collected data was searched through DIA-NN (1.7.15) with the thyroid peptide spectrum library.
(七)质量控制(7) Quality control
(八)基于XGBoost构建预测模型(8) Building a prediction model based on XGBoost
本发明基于XGBoost构建了预测模型,以便于将任何给定的蛋白质组数据样本及临床特征分类为高中危和低危中两类之一,以达到最佳精度。这包括4个阶段:数据预处理、特征选择、机器学习模型构建以及对模型验证。The present invention builds a prediction model based on XGBoost, so as to classify any given proteome data sample and clinical features into one of two categories: high-risk and low-risk, so as to achieve the best accuracy. This includes 4 stages: data preprocessing, feature selection, machine learning model building, and model validation.
以下是4个阶段的详细步骤(见图3A和图3B)The following are the detailed steps of the 4 stages (see Figure 3A and Figure 3B)
阶段1:数据预处理Phase 1: Data Preprocessing
来自2个数据集共3个组,为发现集,回顾验证集,前瞻验证集,用于开发DNN模型。来自发现集队列的274个样本,其中191例分为训练集,用于模型搭建, 83例分为测试集,用于调优参数,使得参数对应的测试集AUC结果最优,然后将训练好的模型用于回顾验证集,前瞻验证集队列的数据,进行外部验证以展示本发明模型的泛化能力。A total of 3 groups from 2 datasets, which are discovery set, retrospective validation set, and forward validation set, are used to develop DNN models. Of the 274 samples from the discovery set queue, 191 samples are divided into training sets for model building, and 83 samples are divided into test sets for tuning parameters, so that the AUC results of the test set corresponding to the parameters are optimal, and then the trained The model is used to review the validation set, look forward to the data of the validation set queue, and perform external validation to demonstrate the generalization ability of the model of the present invention.
预处理包括两个步骤:(1)缺失值插补和(2)归一化。缺失值不可避免地是蛋白质强度数据的一个特征。考虑到大多数缺失值发生在蛋白质含量低于检测阈值时,通过用0.8Dmin填充所有缺失值来完成插补。其中Dmin是发现集中所有特征值的最小值,在本项工作中,Dmin=13。因此,对于插补步骤之后的每个特征,从发现集估计该特征的均值和方差,并根据以下对每个训练样本的特征进行归一化。Preprocessing consists of two steps: (1) missing value imputation and (2) normalization. Missing values are inevitably a feature of protein intensity data. Imputation was done by filling all missing values with 0.8 Dmin, considering that most missing values occurred when the protein content was below the detection threshold. Among them, Dmin is the minimum value of all eigenvalues in the discovery set, and in this work, Dmin=13. Therefore, for each feature after the imputation step, estimate the mean and variance of that feature from the discovery set, and normalize the feature for each training sample according to the following.
阶段2:特征选择Phase 2: Feature Selection
需要特征选择有两个原因:(1)由于是全蛋白质组检测,因此检测到的大多数蛋白与本问题相关性不高,此外过多的蛋白进行机器学习会导致模型泛化能力降低,造成过拟合,因此在机器学习中应将它们从特征矩阵中删除;(2)在临床实践应用中尽量减少蛋白质的数量,选择最优的组合达到最有效的区分效果。它分两个步骤完成。第一步是特征筛选。在最初的蛋白质特征中,数据集中高中低危风险分层的差异表达蛋白质以及与甲状腺或甲状腺癌相关的已发表文献相关的蛋白。在274例病例中,没有出现在本发明的数据集中将被排除在外。此外,如果这种蛋白质的缺失率大于45%,则将其删除。如果一对蛋白质之间的Pearson相关性的绝对值小于0.1,则删除它们。第二步,执行组合优化以从筛选出的蛋白质中选择10种蛋白质的最佳组合。虽然没有算法可以保证全局高效的最优解,这里使用机器学习算法来寻找最好的蛋白质组合。There are two reasons for the need for feature selection: (1) Since it is a full proteome detection, most of the detected proteins are not highly relevant to this problem. In addition, too many proteins for machine learning will reduce the generalization ability of the model, resulting in Overfitting, so they should be deleted from the feature matrix in machine learning; (2) Minimize the number of proteins in clinical practice applications, and select the optimal combination to achieve the most effective discrimination effect. It is done in two steps. The first step is feature screening. In the initial protein signature, differentially expressed proteins in the data set stratified by high-medium-low risk and proteins associated with the published literature related to thyroid or thyroid cancer. Of the 274 cases that did not appear in the data set of the present invention, they were excluded. Additionally, this protein was deleted if it was > 45% missing. If the absolute value of the Pearson correlation between a pair of proteins was less than 0.1, they were removed. In the second step, combinatorial optimization is performed to select the best combination of 10 proteins from the screened proteins. Although no algorithm can guarantee a globally efficient optimal solution, a machine learning algorithm is used here to find the best protein combination.
进化操作(交叉、变异和选择操作)用于从现有的蛋白质特征组合中生成新的蛋白质特征组合。在每次迭代中,算法都会消除低适应度组合,并根据剩余的高适应度组合生成新组合。Evolution operations (crossover, mutation, and selection operations) are used to generate new protein feature combinations from existing protein feature combinations. In each iteration, the algorithm eliminates low-fitness combinations and generates new combinations from the remaining high-fitness combinations.
阶段3:模型训练Phase 3: Model Training
本发明设计了一个基于梯度提升树的模型—极限梯度提升,将输入特征与PTC风险分层联系起来,用于解决监督学习问题。监督学习是利用一组已知类别的样本调整分类器的参数,使其达到所要求性能的过程。本发明中共有15个可能的术前PTC风险分层有关临床指标被纳入本发明的机器学习模型。发现集作为训练集,用于训练和调整模型中的参数,将发现集分为训练序列和测试序列,在训练序列构建模型,在测试序列进行验证,选择并保存在测试集上AUC最佳的模型,然后于两个独立验证集进行验证并通过受试者工作特征曲线的曲线下面积(AUC)评估模型的性能。从而得出分层结果。The present invention designs a model based on a gradient boosting tree—extreme gradient boosting, which links input features with PTC risk stratification for solving supervised learning problems. Supervised learning is the process of using a set of samples of known categories to adjust the parameters of the classifier to achieve the required performance. In the present invention, a total of 15 possible clinical indicators related to preoperative PTC risk stratification are incorporated into the machine learning model of the present invention. The discovery set is used as a training set to train and adjust the parameters in the model, divide the discovery set into a training sequence and a test sequence, build a model in the training sequence, verify it in the test sequence, select and save the best AUC on the test set The model was then validated on two independent validation sets and the performance of the model was evaluated by the area under the curve (AUC) of the receiver operating characteristic curve. resulting in hierarchical results.
阶段4:模型验证Phase 4: Model Validation
将未知的样本放入已设立好的模型当中。基于上述模型所挑选的特征组合,输入未知样本的临床信息及蛋白质组学信息,根据已获得的模型可以得到患者可能的危险分层。特征选择、构建机器学习模型以及对模型验证的流程图详见图2。Put unknown samples into the established model. Based on the feature combination selected by the above model, the clinical information and proteomic information of the unknown sample are input, and the possible risk stratification of the patient can be obtained according to the obtained model. The flow chart of feature selection, machine learning model building and model verification is shown in Figure 2.
(九)统计学分析(9) Statistical analysis
使用具有heatmap和绘图功能的R软件(版本3.5.1)进行统计分析。CV计算为标准偏差与平均值的比率。P值通过单向方差分析计算蛋白质组合特征的表达。选择火山图计算差异蛋白,差异蛋白的筛选条件:1) Unpaired two side Welch’s t test p<0.05;2)fold-change>1.2或fold-change<-1.2。通过独创性通路分析(Ingenuity pathwayanalysis, IPA)分析生物学见解。使用Pearson相关系数(Pearson correlationcoefficient)评估重复数据相关性强弱。采用AUC对平均算法性能指标进行评估。SHAP汇总图来说明各个变量的重要性分布。Statistical analysis was performed using R software (version 3.5.1) with heatmap and graphing functions. CV was calculated as the ratio of the standard deviation to the mean. P-values were calculated by one-way ANOVA for expression of protein combination features. Select the volcano map to calculate the differential protein, the screening conditions of the differential protein: 1) Unpaired two side Welch’s t test p<0.05; 2) fold-change>1.2 or fold-change<-1.2. Analyze biological insights through Ingenuity pathway analysis (IPA). Use the Pearson correlation coefficient (Pearson correlationcoefficient) to assess the strength of the repeated data correlation. AUC was used to evaluate the average algorithm performance index. SHAP summary plots to illustrate the importance distribution of each variable.
二、结 果2. Results
(一)一般临床特征(1) General clinical features
该研究一共纳入的PTC样本共计558例,患者平均年龄45.69岁,女性患者397名,男性患者161名,性别比例为2.47:1。肿瘤平均直径13.01mm。共有244例PTC超声下被膜侵犯,占43.73%,超声下腺外侵犯共179例,占32.08%,超声下多灶共103例,占18.46%。A total of 558 PTC samples were included in the study, with an average age of 45.69 years, 397 female patients and 161 male patients, with a sex ratio of 2.47:1. The average tumor diameter was 13.01mm. There were 244 cases of PTC capsular invasion under ultrasound, accounting for 43.73%, 179 cases of extraglandular invasion under ultrasound, accounting for 32.08%, and 103 cases of multifocal ultrasound under ultrasound, accounting for 18.46%.
558例PTC患者的临床病理资料见表1。The clinicopathological data of 558 PTC patients are shown in Table 1.
表1. 558例PTC患者的临床病理资料Table 1. Clinicopathological data of 558 PTC patients
(二)构建蛋白质组学数据库(2) Construction of proteomics database
本研究通过PCT-MS构建了一个甲状腺蛋白质组数据库,以支持通过DIA-MS对甲状腺乳头状癌中的蛋白质进行鉴定和定量。经过滤污染蛋白及重复蛋白,最终本发明对发现集、回顾验证集以及前瞻验证集三组原始数据鉴定并定量了 5774 、5025以及6301个蛋白质。在三组原始数据中,构建的甲状腺数据库中包含121960个肽段以及9941个蛋白质组。从而获得的DIA数据集验证了这个库,并将其应用于高中低危风险度的蛋白质组学分层。In this study, a thyroid proteome database was constructed by PCT-MS to support the identification and quantification of proteins in papillary thyroid carcinoma by DIA-MS. After filtering the polluting proteins and repetitive proteins, the present invention finally identified and quantified 5774, 5025 and 6301 proteins for the three sets of original data of the discovery set, the retrospective verification set and the prospective verification set. Among the three sets of raw data, the constructed thyroid database contained 121,960 peptides and 9,941 proteomes. The resulting DIA dataset validated this library and applied it to the proteomic stratification of high-medium-low risk.
(三)质控(3) Quality control
为了保证后续纳入构建模型的蛋白质组学的数据稳定可靠,本发明设置了严格的质控。In order to ensure the stability and reliability of the subsequent proteomics data included in the model construction, the present invention sets strict quality control.
1.pool样本分析:分别计算发现集石蜡样本,回顾集石蜡样本及前瞻集穿刺活检样本之间的汇总pool样本变异系数(Coefficient variation, CV),中位数分别为0.019,0.014和0.011,结果提示变异程度低(详见图4A)。1. Pool sample analysis: Calculate the summary pool sample variation coefficient (Coefficient variation, CV) between the discovery set paraffin samples, review set paraffin samples and prospective set biopsy samples respectively, the medians are 0.019, 0.014 and 0.011 respectively, the results It suggested a low degree of variation (see Figure 4A for details).
2.技术重复分析:分别于上述三个集合中随机挑选27,13及8例样本,重复一次质谱分析,将两次的质谱数据做对比,计算方式为pearson correlation,相关系数均在0.95以上,相关系数绝对值越靠近1,说明相关性越强(见图4B)。2. Technical repeat analysis: Randomly select 27, 13 and 8 samples from the above three collections, repeat the mass spectrometry analysis once, and compare the mass spectrometry data of the two times. The calculation method is pearson correlation, and the correlation coefficients are all above 0.95. The closer the absolute value of the correlation coefficient is to 1, the stronger the correlation is (see Figure 4B).
3.生物重复分析:在发现集中每个batch随机挑选一例样本,共22例样本作为生物学重复进行质谱分析,相关系数为0.9以上,提示相关性强,在发现集中无明显批次效应(详见图4C)。也就意味着来自于生物和机器的偏差可以忽略不计。3. Biological repeat analysis: one sample was randomly selected for each batch in the discovery set, and a total of 22 samples were used as biological repeats for mass spectrometry analysis. The correlation coefficient was above 0.9, suggesting a strong correlation, and there was no obvious batch effect in the discovery set (details See Figure 4C). This means that biases from biology and machines are negligible.
(四)不同风险度分组的差异蛋白(4) Differential proteins in different risk groups
完成质量控制后,可以得到完整稳定的蛋白质组,在所有蛋白水平中对比高中低危三组之间的表达水平可以发现有差异的蛋白。在对高中低危分组的蛋白进一步分析,如图5火山图所示,图A:高危与低危样本比较,整体蛋白表达水平相近,97个蛋白表达水平上调,71个蛋白表达水平下调。图B:高危样本整体蛋白水平与中危样本相近,44个蛋白表达水平上调,44个蛋白表达水平下调。图C:中危样本的整体蛋白水平与低危样本相近,49个蛋白表达水平上调,26个蛋白表达水平下调。After the quality control is completed, a complete and stable proteome can be obtained, and proteins with differences can be found by comparing the expression levels among the three groups of high, middle and low risk at all protein levels. In the further analysis of the proteins in the high, middle and low risk groups, as shown in the volcano diagram in Figure 5, Figure A: Compared with high-risk and low-risk samples, the overall protein expression levels are similar, 97 protein expression levels are up-regulated, and 71 protein expression levels are down-regulated. Panel B: The overall protein level of high-risk samples is similar to that of medium-risk samples, with 44 proteins up-regulated and 44 proteins down-regulated. Panel C: The overall protein levels of intermediate-risk samples are similar to those of low-risk samples, with 49 protein expression levels up-regulated and 26 protein expression levels down-regulated.
(五)机器学习构架预测模型(5) Machine learning framework prediction model
为了通过蛋白质组学方法区分具有不同风险分层的PTC,本发明建立了一种基于特征选择过程的人工神经网络算法。研究设计和工作流程如图2所示。建模包含数据预处理、特征选择、深度学习模型训练和预测四个部分。机器学习模型是基于如前所述的发现集通过交叉训练和验证建立的。基于XGBoost算法选择的蛋白质特征结合10折交叉验证使本发明最终确定了以6个特征蛋白质和11个临床特征作为构建模型的特征。In order to distinguish PTCs with different risk stratifications by proteomics approach, the present invention established an artificial neural network algorithm based on a feature selection process. The study design and workflow are shown in Figure 2. Modeling includes four parts: data preprocessing, feature selection, deep learning model training and prediction. The machine learning model was built by cross-training and validation based on the discovery set as described previously. Based on the protein features selected by the XGBoost algorithm combined with 10-fold cross-validation, the present invention finally determined 6 feature proteins and 11 clinical features as the features of the model.
利用蛋白质组学数据和上述详细的临床和遗传学数据生成预测PTC样本风险水平的机器学习模型。利用我们的训练集,对每两个风险水平和每个蛋白质特征进行Student 's t检验和fold change (FC)值计算,以确定最能区分不同风险水平的PTC样本的蛋白质特征。选取P值为0.05和|log2(FC)|>0. 25的蛋白。进一步剔除缺失率≥0.5的蛋白。根据这些标准,我们选择了6种蛋白质。这些蛋白质特征在0 ~ 1之间标准化;它们的缺失值被设置为0。我们使用相同的特性并对两个测试集执行相同的规范化。将测试数据集中无法检测到的特征设为0向量。The proteomic data and the above detailed clinical and genetic data were used to generate a machine learning model to predict the risk level of PTC samples. Using our training set, Student's t-test and fold change (FC) value calculations were performed for each of the two risk levels and for each protein feature to identify the protein feature that best discriminated PTC samples at different risk levels. Proteins with a P value of 0.05 and |log2(FC)|>0.25 were selected. Proteins with deletion rate ≥ 0.5 were further eliminated. Based on these criteria, we selected 6 proteins. These protein features were normalized between 0 and 1; their missing values were set to 0. We use the same features and perform the same normalization on both test sets. Set the undetectable features in the test dataset to 0 vectors.
在训练模型前,将发现集的274例PTC样本分为训练集(n=191)和内部验证集(n=83)。然后,使用训练集开发模型,使用内部验证集验证和优化模型的性能。我们设计了一个包括特征选择和风险级别分类的机器学习架构。Before training the model, the 274 PTC samples from the discovery set were divided into a training set (n = 191) and an internal validation set (n = 83). Then, the training set is used to develop the model, and the internal validation set is used to validate and optimize the model's performance. We design a machine learning architecture including feature selection and risk level classification.
该算法的核心部分是一个级联的两步特征选择步骤,允许选择蛋白质特征和其他多组学特征。首先,利用网格搜索算法对参数和特征进行优化;然后,给定一组参数,利用蛋白质矩阵构建XGBoost模型,并对所有蛋白质特征的重要性进行排序。使用这个工具,我们选择了前6个蛋白质特征;在临床(7个)、免疫组学(8个)和遗传特征(1个)中选择蛋白数量作为平均值。然后,我们将所有22个特征结合起来,用前面的参数构建另一个XGBoost模型,得到特征的重要性。最后使用验证集选择曲线下面积(area under the curve, AUC)值最佳的前k个特征。这条管道产生了以下算法。The core part of the algorithm is a cascaded two-step feature selection step that allows selection of protein features and other multi-omic features. First, the parameters and features are optimized using a grid search algorithm; then, given a set of parameters, the XGBoost model is constructed using the protein matrix and the importance of all protein features is ranked. Using this tool, we selected the top 6 protein signatures; protein numbers were selected as average among clinical (7), immunomics (8) and genetic signatures (1). Then, we combine all 22 features to build another XGBoost model with the previous parameters to get feature importances. Finally, the validation set is used to select the top k features with the best area under the curve (AUC) value. This pipeline yielded the following algorithm.
Algorithm 1
Input: protein matrix P; other feature matrix Q; grid space GInput: protein matrix P; other feature matrix Q; grid space G
Best_AUC = 0Best_AUC = 0
For the grid in G:For the grid in G:
Model1 = XGBoost (P, grid) Model1 = XGBoost (P, grid)
Importance1 = sort (Model1.importance) Importance1 = sort (Model1.importance)
P_selected = P [:,Importance1[:6]] P_selected = P[:,Importance1[:6]]
Multi_omics_features = [Q,P_selected] Multi_omics_features = [Q,P_selected]
Model2 = XGBoost (Multi_omics_features, grid) Model2 = XGBoost (Multi_omics_features, grid)
Importance2 = sort (Model2.importance) Importance2 = sort (Model2.importance)
For num in range (num of Multi_omics_features): For num in range (num of Multi_omics_features):
Final_features = Multi_omics_features[:,Importance2[:num]]Final_features = Multi_omics_features[:,Importance2[:num]]
Model3 = XGBoost (Final_features,tree_num=20)Model3 = XGBoost (Final_features, tree_num=20)
Pred_score = Model3(validation_data[:,Final_features])Pred_score = Model3(validation_data[:,Final_features])
Temp_AUC = AUC (label,Pred_score)Temp_AUC = AUC(label,Pred_score)
If Temp_AUC>Best_AUC:If Temp_AUC>Best_AUC:
Best_parameter =[grid, num]Best_parameter = [grid, num]
Best_features =Final_featuresBest_features=Final_features
return Best_parameter, Best_featuresreturn Best_parameter, Best_features
Output: Model_parameter= Best_parameter, Model_features=Best_featuresOutput: Model_parameter= Best_parameter, Model_features=Best_features
我们使用越来越多的特征维度生成了四个模型。模型A只使用了一个维度的特征:术前临床数据(n=7)。模型B在前一个模型的基础上增加了第二个维度:免疫学指标(n=8)。模型C在之前的基础上增加了的遗传特征。最后,模型D与之前的蛋白质组学特征(n=6)相结合,从而提供了最终的基于四维特征的模型。最后,我们比较了4种模型的预测性能。We generated four models with increasing feature dimensions. Model A used only one dimension of features: preoperative clinical data (n = 7). Model B adds a second dimension to the previous model: immunological indicators (n=8). Model C has added genetic features on the basis of the previous ones. Finally, Model D was combined with the previous proteomic features (n = 6), thus providing the final 4D feature-based model. Finally, we compare the predictive performance of the 4 models.
临床上更希望有效区分低危和中高危风险度的病例,于是本发明以此为结局指标行下一步分析。本发明比较了不同特征组合对结果的预测效能(详见表2)。Clinically, it is more desirable to effectively distinguish low-risk and medium-high risk cases, so the present invention uses this as the outcome index for further analysis. The present invention compares the predictive performance of different feature combinations on the results (see Table 2 for details).
1.在最初的模型构建中,仅依靠临床数据(被膜侵犯、腺外侵犯、肿瘤直径、多灶性及年龄)构建的模型A未能展现出良好的结果,于发现集、回顾验证集及前瞻验证集中的AUC分别为0.86、0.68、0.71。1. In the initial model construction, Model A, which only relied on clinical data (capsular invasion, extraglandular invasion, tumor diameter, multifocality, and age), failed to show good results. In the discovery set, retrospective validation set and The AUCs in the prospective validation set were 0.86, 0.68, 0.71, respectively.
2.其次本发明纳入了以临床+免疫组合的模型B,对各个集合的验证结果有了明显的提升, AUC分别达到了0.94、0.65、0.71。2. Secondly, the present invention incorporates the combination of clinical + immunization model B, which has significantly improved the verification results of each set, and the AUC has reached 0.94, 0.65, and 0.71 respectively.
3.在纳入临床+免疫+BRAF组合构建的模型C,发现其中发现集的验证效果显著提升,AUC达到了0.93、0.68、0.73。3. In the model C built with the combination of clinical + immune + BRAF, it was found that the verification effect of the discovery set was significantly improved, and the AUC reached 0.93, 0.68, and 0.73.
4.最终本发明纳入了蛋白指标,以临床+免疫+BRAF+蛋白的模型D为预测模型,在三个集合中的AUC分别达到了0.92、0.79、0.80。4. Finally, the present invention incorporates protein indicators, and the model D of clinical + immune + BRAF + protein is used as the prediction model, and the AUCs in the three sets reach 0.92, 0.79, and 0.80, respectively.
表2 各组合模型在发现集、回顾验证集以及前瞻验证集AUCTable 2 AUC of each combined model in the discovery set, retrospective validation set and forward validation set
(六)最终特征选择和辅助风险预测(6) Final feature selection and auxiliary risk prediction
输出机器学习筛选上述模型得到特征性建模指标,特征结果:Output machine learning to filter the above models to obtain characteristic modeling indicators, characteristic results:
基于“临床+免疫+BRAF+蛋白”这四个维度的特征,本发明构建了XGBoost验证模型,本发明运用“特征选择”输出特征指标:临床因素(被膜侵犯、腺外侵犯、肿瘤直径、多灶性及年龄),血液免疫指标(术前单核细胞计数(M)、术前中性粒细胞计数(N)、术前淋巴细胞与单核细胞比值(LMR)、术前血小板计数(Plt)及术前淋巴细胞计数(L)),蛋白质组学(DPP7,DPLIM3,COL12A1,CTSL,TUBB2A及ITGB5)及基因组学(BRAF基因突变)。Based on the characteristics of the four dimensions of "clinical + immune + BRAF + protein", the present invention constructs the XGBoost verification model, and the present invention uses "feature selection" to output feature indicators: clinical factors (capsule invasion, extraglandular invasion, tumor diameter, multifocal gender and age), blood immune indicators (preoperative monocyte count (M), preoperative neutrophil count (N), preoperative lymphocyte-to-monocyte ratio (LMR), preoperative platelet count (Plt) and preoperative lymphocyte count (L)), proteomics (DPP7, DPLIM3, COL12A1, CTSL, TUBB2A and ITGB5) and genomics (BRAF gene mutation).
为了展现上述模型中每一个特征的影响力,本发明通过沙普利加和解释(Shapleyadditive explanations,SHAP)算法构建一个可视化的解释模型。对于每个预测样本,模型都产生一个预测值,SHAP值就是该样本中每个特征所分配到的数值(图6),同时是一种特征重要性的排列图。SHAP值最大的优势是SHAP能对于反映出每一个样本中的特征的影响力,而且还表现出影响的正负性。In order to demonstrate the influence of each feature in the above model, the present invention constructs a visual explanation model through the Shapley additive explanations (SHAP) algorithm. For each prediction sample, the model generates a prediction value, and the SHAP value is the value assigned to each feature in the sample (Figure 6), and it is also an array of feature importance. The biggest advantage of the SHAP value is that SHAP can reflect the influence of the characteristics in each sample, and it also shows the positive and negative nature of the influence.
上述XGBoost模型为每个检测病人输出一个分数:如果分数小于0.6645,机器学习模型认为是低风险PTC;如果分数大于0.6645,机器学习模型认为是中/高风险PTC。同时由9位不同资历的临床医生单独进行初步的独立分析。临床医生的分析结果与XGBoost模型的结果进行比较,协助医生做出最终分析结果。The above XGBoost model outputs a score for each detected patient: if the score is less than 0.6645, the machine learning model considers it low-risk PTC; if the score is greater than 0.6645, the machine learning model considers it medium/high-risk PTC. At the same time, a preliminary independent analysis was performed by 9 clinicians with different qualifications. The analysis results of the clinician are compared with the results of the XGBoost model to assist the doctor to make the final analysis results.
为了评估机器学习模型的效果,本发明比较了由四维特征构建的XGBoost模型和不同资历水平的临床医生之间的术前风险分层的效果(图4A-B)。预测性能结果随着临床医生资历水平的提高而逐渐改善。在回顾测试集中,XGBoost的准确率(0.737[95% CI 0.668-0.802])主要高于仅有高年资临床医生的准确率(0.598[95% CI 0.584-0.606],P<0.0001)。在前瞻性测试集中,XGBoost的准确率(0.736[95% CI 0.717-0.740])也明显高于仅有高年资临床医生的准确率(0.579[95% CI 0.573-0.593],P<0.0001)。这些结果表明,与临床医生相比,XGBoost模型的预测能力更强。To evaluate the effect of the machine learning model, we compared the effect of the XGBoost model constructed from four-dimensional features and preoperative risk stratification among clinicians of different seniority levels (Fig. 4A-B). Predictive performance results gradually improved with increasing levels of clinician seniority. In the retrospective test set, the accuracy of XGBoost (0.737 [95% CI 0.668-0.802]) was significantly higher than that of only senior clinicians (0.598 [95% CI 0.584-0.606], P<0.0001). In the prospective test set, the accuracy of XGBoost (0.736 [95% CI 0.717-0.740]) was also significantly higher than that of only senior clinicians (0.579 [95% CI 0.573-0.593], P<0.0001) . These results suggest that the XGBoost model has a stronger predictive power than clinicians.
本发明进一步探讨了用机器学习辅助临床医生的预测能力。结果显示,在机器学习模型的辅助下,临床医生的预测效能随着资历水平逐渐提高。在回顾测试集中,XGBoost模型的准确率为0.737(95%CI为0.668-0.802),然而机器学习辅助下的高年资临床医生其预测准确率提高到0.844(95%CI为0.829-0.844)(P<0.0001;图4A)。在前瞻测试集中,XGBoost模型的准确率为0.736(95% CI 0.717-0.740),在机器学习辅助下的高年资临床医生其预测准确率提高到0.835(95% CI 0.822-0.842)(P<0.0001;图4B)。这些结果表明,机器学习辅助下的临床医生预测效能比机器学习预测效能更优。The present invention further explores the use of machine learning to assist the predictive power of clinicians. The results showed that with the aid of the machine learning model, the predictive power of clinicians gradually increased with the level of seniority. In the retrospective test set, the accuracy of the XGBoost model was 0.737 (95% CI 0.668-0.802), while the prediction accuracy of senior clinicians assisted by machine learning improved to 0.844 (95% CI 0.829-0.844) ( P<0.0001; Figure 4A). In the prospective test set, the accuracy of the XGBoost model was 0.736 (95% CI 0.717-0.740), and the prediction accuracy of senior clinicians assisted by machine learning was increased to 0.835 (95% CI 0.822-0.842) (P< 0.0001; Figure 4B). These results suggest that machine learning-assisted clinician predictive performance is superior to machine learning predictive performance.
尽管上面已经示出和描述了本发明的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本发明的限制,本领域的普通技术人员在本发明的范围内可以对上述实施例进行变化、修改、替换和变型。Although the embodiments of the present invention have been shown and described above, it can be understood that the above embodiments are exemplary and should not be construed as limiting the present invention, those skilled in the art can make the above-mentioned The embodiments are subject to changes, modifications, substitutions and variations.
Claims (9)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310090839.8A CN115881296B (en) | 2023-02-09 | 2023-02-09 | A risk-assisted stratification system for papillary thyroid carcinoma (PTC) |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310090839.8A CN115881296B (en) | 2023-02-09 | 2023-02-09 | A risk-assisted stratification system for papillary thyroid carcinoma (PTC) |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115881296A true CN115881296A (en) | 2023-03-31 |
CN115881296B CN115881296B (en) | 2023-05-26 |
Family
ID=85760938
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310090839.8A Active CN115881296B (en) | 2023-02-09 | 2023-02-09 | A risk-assisted stratification system for papillary thyroid carcinoma (PTC) |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115881296B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116230237A (en) * | 2023-05-06 | 2023-06-06 | 四川省医学科学院·四川省人民医院 | Lung cancer influence evaluation method and system based on ROI focus features |
CN118675744A (en) * | 2024-06-03 | 2024-09-20 | 山东大学齐鲁医院 | Gastric cancer patient prognosis prediction method and system based on inflammation nutrition index |
TWI866867B (en) * | 2024-05-30 | 2024-12-11 | 國立清華大學 | Method of breast cancer risk assessment |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050048533A1 (en) * | 2003-04-12 | 2005-03-03 | The Johns Hopkins University | BRAF mutation T1796A in thyroid cancers |
US20090061422A1 (en) * | 2005-04-19 | 2009-03-05 | Linke Steven P | Diagnostic markers of breast cancer treatment and progression and methods of use thereof |
CN108025051A (en) * | 2015-07-29 | 2018-05-11 | 诺华股份有限公司 | Combination therapy comprising anti-PD-1 antibody molecules |
CN111724903A (en) * | 2020-06-29 | 2020-09-29 | 北京市肿瘤防治研究所 | A system for predicting the prognosis of gastric cancer in subjects |
CN114171200A (en) * | 2021-12-24 | 2022-03-11 | 中南大学湘雅医院 | PTC (Positive temperature coefficient) prognosis marker, application thereof and construction method of PTC prognosis evaluation model |
CN114927223A (en) * | 2021-07-02 | 2022-08-19 | 中国医学科学院北京协和医院 | Evaluation model and method for differentiated thyroid cancer after total resection |
CN115144599A (en) * | 2022-09-05 | 2022-10-04 | 西湖大学 | Use of protein combination in the preparation of a kit for prognostic stratification of childhood thyroid cancer, kit and system thereof |
-
2023
- 2023-02-09 CN CN202310090839.8A patent/CN115881296B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050048533A1 (en) * | 2003-04-12 | 2005-03-03 | The Johns Hopkins University | BRAF mutation T1796A in thyroid cancers |
US20090061422A1 (en) * | 2005-04-19 | 2009-03-05 | Linke Steven P | Diagnostic markers of breast cancer treatment and progression and methods of use thereof |
CN108025051A (en) * | 2015-07-29 | 2018-05-11 | 诺华股份有限公司 | Combination therapy comprising anti-PD-1 antibody molecules |
CN111724903A (en) * | 2020-06-29 | 2020-09-29 | 北京市肿瘤防治研究所 | A system for predicting the prognosis of gastric cancer in subjects |
CN114927223A (en) * | 2021-07-02 | 2022-08-19 | 中国医学科学院北京协和医院 | Evaluation model and method for differentiated thyroid cancer after total resection |
CN114171200A (en) * | 2021-12-24 | 2022-03-11 | 中南大学湘雅医院 | PTC (Positive temperature coefficient) prognosis marker, application thereof and construction method of PTC prognosis evaluation model |
CN115144599A (en) * | 2022-09-05 | 2022-10-04 | 西湖大学 | Use of protein combination in the preparation of a kit for prognostic stratification of childhood thyroid cancer, kit and system thereof |
Non-Patent Citations (6)
Title |
---|
DINGCUN LUO: "Artificial intelligence defines protein-based classification of thyroid nodules", 《CELL DISCOV》 * |
王成晨;向大鹏;李志宇;: "甲状腺乳头状癌相关基因突变与其临床病理特征的关系", 实用肿瘤杂志 * |
罗定存: "基于ARMS法检测BRAF V600E突变在甲状腺乳头状癌中的临床价值", 《中国耳鼻咽喉头颈外科》 * |
蒋筠;顾丽萍;李娜;王育;吴艺捷;彭永德;: "外周血甲状腺过氧化物酶mRNA与甲状腺乳头状癌转移风险的研究", 现代生物医学进展 * |
薛金才;刘勤江;田尤新;侯小峰;: "BRAF~(V600E)及TERT启动子突变在甲状腺微小乳头状癌风险评估中的价值", 中国癌症杂志 * |
闫维;姜涛;郑晓;张冬雪;: "甲状腺乳头状癌中央区淋巴结转移风险预测及危险分层", 实用医学杂志 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116230237A (en) * | 2023-05-06 | 2023-06-06 | 四川省医学科学院·四川省人民医院 | Lung cancer influence evaluation method and system based on ROI focus features |
CN116230237B (en) * | 2023-05-06 | 2023-07-21 | 四川省医学科学院·四川省人民医院 | Lung cancer influence evaluation method and system based on ROI focus features |
TWI866867B (en) * | 2024-05-30 | 2024-12-11 | 國立清華大學 | Method of breast cancer risk assessment |
CN118675744A (en) * | 2024-06-03 | 2024-09-20 | 山东大学齐鲁医院 | Gastric cancer patient prognosis prediction method and system based on inflammation nutrition index |
Also Published As
Publication number | Publication date |
---|---|
CN115881296B (en) | 2023-05-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115881296B (en) | A risk-assisted stratification system for papillary thyroid carcinoma (PTC) | |
CN110706749B (en) | Cancer type prediction system and method based on tissue and organ differentiation hierarchical relation | |
US20130332083A1 (en) | Gene Marker Sets And Methods For Classification Of Cancer Patients | |
CN1484806A (en) | A hidden pattern based method for identifying biological states from biological data | |
CN111128385B (en) | Prognosis early warning system for esophageal squamous carcinoma and application thereof | |
CN108588230B (en) | Marker for breast cancer diagnosis and screening method thereof | |
CN109801680A (en) | Tumour metastasis and recurrence prediction technique and system based on TCGA database | |
CN113270188A (en) | Method and device for constructing prognosis prediction model of patient after esophageal squamous carcinoma radical treatment | |
CN115144599B (en) | Use of protein combination in preparation of kit for prognostic stratification of thyroid cancer in children and its kit and system | |
CN108559777A (en) | A kind of New molecular marker and its application in preparing for the kit of clear cell carcinoma of kidney diagnosis and prognosis | |
EP4428864A1 (en) | Method for diagnosing cancer by using sequence frequency and size at each position of cell-free nucleic acid fragment | |
Makhlouf et al. | True-T–Improving T-cell response quantification with holistic artificial intelligence based prediction in immunohistochemistry images | |
CN115792247B (en) | Application of protein combination in preparation of thyroid papillary carcinoma risk auxiliary layering system | |
Zhou et al. | Construction and validation of a deep learning prognostic model based on digital pathology images of stage III colorectal cancer | |
CN116959554A (en) | CAFs related gene-based prostate cancer biochemical recurrence prediction model and application thereof | |
CN116479133A (en) | A multi-gene combination and its application for precise diagnosis and treatment of intrahepatic cholangiocarcinoma | |
CN116805509A (en) | Construction method and application of predictive markers for colorectal cancer immunotherapy | |
CN116246710A (en) | Colorectal cancer prediction model based on cluster molecules and application | |
CN105624276A (en) | Endometrioid adenocarcinoma prognosis-related gene and protein as well as application thereof | |
CN116930495A (en) | Liver cancer marker based on single cell sequencing and application thereof | |
CN119851775B (en) | Comprehensive analysis platform for tumor marker detection data | |
CN118501477B (en) | Radioactive iodine resistance thyroid cancer biomarker and screening method and application thereof | |
CN118887993B (en) | A method for screening protein markers for breast cancer based on the estrogen receptor-epigenetic factor interaction network | |
US20250037864A1 (en) | Blood cell-free dna-based method for predicting prognosis of breast cancer treatment | |
TWI676688B (en) | The cell type identification method and system thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |