CN118792426A

CN118792426A - Pancreatic ductal adenocarcinoma prediction method and system based on oral flora

Info

Publication number: CN118792426A
Application number: CN202410930065.XA
Authority: CN
Inventors: 李�真; 朱逸清; 梁潇; 李理想; 冯强; 支梦凡; 张国明; 陈昶旭; 王逸凡; 钟宁
Original assignee: Shandong University; Qilu Hospital of Shandong University
Current assignee: Shandong University; Qilu Hospital of Shandong University
Priority date: 2024-07-11
Filing date: 2024-07-11
Publication date: 2024-10-18

Abstract

The present invention belongs to the field of pancreatic ductal adenocarcinoma prediction, and provides a pancreatic ductal adenocarcinoma prediction method and system based on oral flora. The pancreatic ductal adenocarcinoma prediction method based on oral flora includes obtaining the relative abundance information of each biomarker in a set of biomarkers for potential diagnosis of pancreatic ductal adenocarcinoma of the subject to be tested; using the relative abundance information of each biomarker in the biomarker set and a pre-trained multivariate statistical model to obtain the probability value of suffering from pancreatic ductal adenocarcinoma; wherein, by performing microbiome difference analysis on saliva, duodenal fluid and pancreatic tissue samples of pancreatic ductal adenocarcinoma patients and benign pancreatic disease patients, the difference bacteria are intersected to obtain a set of biomarkers for potential diagnosis of pancreatic ductal adenocarcinoma.

Description

Pancreatic ductal adenocarcinoma prediction method and system based on oral flora

技术领域Technical Field

本发明属于胰腺导管腺癌预测领域，尤其涉及一种基于口腔菌群的胰腺导管腺癌预测方法及系统。The present invention belongs to the field of pancreatic ductal adenocarcinoma prediction, and in particular relates to a pancreatic ductal adenocarcinoma prediction method and system based on oral flora.

背景技术Background Art

本部分的陈述仅仅是提供了与本发明相关的背景技术信息，不必然构成在先技术。The statements in this section merely provide background information related to the present invention and do not necessarily constitute prior art.

由于胰腺的解剖位置隐蔽、早期病变难以发现，且大部分PDAC(胰腺导管腺癌)缺乏明确的症状，能够用于诊断的生物标志物也很少，导致大约80-85％的患者在诊断时就已经不能切除或转移。即使是一小部分确诊为局部可切除肿瘤的患者，其预后仍很差，5年生存率仅为20％。PDAC(胰腺导管腺癌)的病因复杂，菌群可能也与其发病有关，例如口腔。现有的研究表明，健康对照者和胰腺癌患者的口腔微生物组成不同，但口腔菌群和胰腺菌群之间的关系不明确，也未联合胰腺部位本身的菌群确定出预测PDAC的生物标志物组合，导致无法基于口腔菌群定量预测出胰腺导管腺癌患病概率，并且无法为医生提供更加量化的精确决策性建议。Due to the hidden anatomical location of the pancreas, the difficulty in detecting early lesions, and the lack of clear symptoms in most PDAC (pancreatic ductal adenocarcinoma), there are few biomarkers that can be used for diagnosis, resulting in approximately 80-85% of patients being unable to be removed or metastasized at the time of diagnosis. Even for a small number of patients diagnosed with locally resectable tumors, the prognosis is still poor, with a 5-year survival rate of only 20%. The etiology of PDAC (pancreatic ductal adenocarcinoma) is complex, and the flora may also be related to its onset, such as the oral cavity. Existing studies have shown that the oral microbial composition of healthy controls and pancreatic cancer patients is different, but the relationship between the oral flora and the pancreatic flora is unclear, and the flora of the pancreas itself has not been combined to determine a combination of biomarkers for predicting PDAC, resulting in the inability to quantitatively predict the probability of pancreatic ductal adenocarcinoma based on the oral flora, and the inability to provide doctors with more quantitative and accurate decision-making advice.

发明内容Summary of the invention

为了解决上述背景技术中存在的技术问题，本发明提供一种基于口腔菌群的胰腺导管腺癌预测方法及系统，其能够为医生提供更加量化的精确决策性建议。In order to solve the technical problems existing in the above-mentioned background technology, the present invention provides a method and system for predicting pancreatic ductal adenocarcinoma based on oral flora, which can provide doctors with more quantitative and accurate decision-making suggestions.

为了实现上述目的，本发明采用如下技术方案：In order to achieve the above object, the present invention adopts the following technical solution:

本发明的第一个方面提供一种基于口腔菌群的胰腺导管腺癌预测方法。The first aspect of the present invention provides a method for predicting pancreatic ductal adenocarcinoma based on oral flora.

一种基于口腔菌群的胰腺导管腺癌预测方法，其包括：A method for predicting pancreatic ductal adenocarcinoma based on oral flora, comprising:

获取待测者的潜在诊断胰腺导管腺癌的生物标记物集合中各个生物标记物的相对丰度信息；Obtaining relative abundance information of each biomarker in a set of biomarkers for potential diagnosis of pancreatic ductal adenocarcinoma in a subject;

利用所述生物标记物集合中各个生物标记物的相对丰度信息及预先训练的多元统计模型，得到患胰腺导管腺癌的概率值；Using the relative abundance information of each biomarker in the biomarker set and a pre-trained multivariate statistical model, a probability value of suffering from pancreatic ductal adenocarcinoma is obtained;

其中，通过对胰腺导管腺癌患者群和胰腺良性疾病患者群的唾液、十二指肠液和胰腺组织样本进行微生物组间差异分析，将其中的差异菌取交集，得到潜在诊断胰腺导管腺癌的生物标记物集合；Among them, by analyzing the differences between the microbiome of saliva, duodenal fluid and pancreatic tissue samples of patients with pancreatic ductal adenocarcinoma and patients with benign pancreatic diseases, the intersection of the differential bacteria was taken to obtain a set of biomarkers for the potential diagnosis of pancreatic ductal adenocarcinoma;

在所述多元统计模型的训练中，训练样本由所述生物标记物集合中的各个生物标记物的相对丰度信息及其对应属性标签构成；所述属性标签包括胰腺导管腺癌患者及胰腺良性疾病患者。In the training of the multivariate statistical model, the training samples are composed of the relative abundance information of each biomarker in the biomarker set and its corresponding attribute label; the attribute label includes patients with pancreatic ductal adenocarcinoma and patients with benign pancreatic diseases.

本发明的第二个方面提供一种基于口腔菌群的胰腺导管腺癌预测系统。The second aspect of the present invention provides a pancreatic ductal adenocarcinoma prediction system based on oral flora.

一种基于口腔菌群的胰腺导管腺癌预测系统，其包括：A pancreatic ductal adenocarcinoma prediction system based on oral flora, comprising:

相对丰度信息获取模块，其用于获取待测者的潜在诊断胰腺导管腺癌的生物标记物集合中各个生物标记物的相对丰度信息；A relative abundance information acquisition module, which is used to obtain the relative abundance information of each biomarker in a set of biomarkers for potential diagnosis of pancreatic ductal adenocarcinoma in a subject to be tested;

胰腺导管腺癌预测模块，其用于利用所述生物标记物集合中各个生物标记物的相对丰度信息及预先训练的多元统计模型，得到患胰腺导管腺癌的概率值；A pancreatic ductal adenocarcinoma prediction module, which is used to obtain a probability value of pancreatic ductal adenocarcinoma using the relative abundance information of each biomarker in the biomarker set and a pre-trained multivariate statistical model;

本发明的第三个方面提供一种计算机可读存储介质。A third aspect of the present invention provides a computer-readable storage medium.

一种计算机可读存储介质，其上存储有计算机程序，该程序被处理器执行时实现如上述所述的基于口腔菌群的胰腺导管腺癌预测方法中的步骤。A computer-readable storage medium stores a computer program, which, when executed by a processor, implements the steps in the above-mentioned method for predicting pancreatic ductal adenocarcinoma based on oral flora.

本发明的第四个方面提供一种计算机设备。A fourth aspect of the present invention provides a computer device.

一种计算机设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现如上述所述的基于口腔菌群的胰腺导管腺癌预测方法中的步骤。A computer device comprises a memory, a processor and a computer program stored in the memory and executable on the processor, wherein when the processor executes the program, the steps in the above-mentioned method for predicting pancreatic ductal adenocarcinoma based on oral flora are implemented.

与现有技术相比，本发明的有益效果是：Compared with the prior art, the present invention has the following beneficial effects:

(1)本发明利用潜在诊断胰腺导管腺癌的生物标记物集合中各个生物标记物的相对丰度信息及多元统计模型，预测患胰腺导管腺癌的概率值，构建出口腔菌群和胰腺菌群之间的关系，实现了联合胰腺部位本身的菌群确定出预测胰腺导管腺癌的生物标志物组合，基于口腔菌群定量预测出胰腺导管腺癌患病概率，能够为医生提供更加量化的精确决策性建议，以辅助医生为诊断胰腺导管腺癌结果提供数据支撑。(1) The present invention uses the relative abundance information of each biomarker in a potential biomarker set for diagnosing pancreatic ductal adenocarcinoma and a multivariate statistical model to predict the probability of suffering from pancreatic ductal adenocarcinoma, construct the relationship between the oral flora and the pancreatic flora, and achieve the combination of the flora of the pancreas itself to determine the biomarker combination for predicting pancreatic ductal adenocarcinoma. The probability of pancreatic ductal adenocarcinoma is quantitatively predicted based on the oral flora, which can provide doctors with more quantitative and accurate decision-making suggestions to assist doctors in providing data support for the diagnosis of pancreatic ductal adenocarcinoma.

(2)本发明公开的生物标志物具有较高的准确度和特异性，具有良好的开发为诊断方法的前景，从而为PDAC的患病风险评估、诊断、早期诊断，寻找潜在药物靶点提供依据；基于口腔菌群的PDAC生物标志物组合作为检测靶点或检测目标在制备检测试剂盒中的应用；基于口腔菌群的PDAC生物标志物组合作为靶点在筛选治疗和/或者预防PDAC的药物中的应用；生物标志物组合相对丰度的变化为确定候选药物是否有效提供依据。(2) The biomarkers disclosed in the present invention have high accuracy and specificity, and have good prospects for development into diagnostic methods, thereby providing a basis for risk assessment, diagnosis, early diagnosis of PDAC, and finding potential drug targets; the use of oral flora-based PDAC biomarker combinations as detection targets or detection targets in the preparation of detection kits; the use of oral flora-based PDAC biomarker combinations as targets in screening drugs for the treatment and/or prevention of PDAC; changes in the relative abundance of the biomarker combination provide a basis for determining whether a candidate drug is effective.

本发明附加方面的优点将在下面的描述中部分给出，部分将从下面的描述中变得明显，或通过本发明的实践了解到。Advantages of additional aspects of the present invention will be given in part in the following description, and in part will become obvious from the following description, or will be learned through practice of the present invention.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

构成本发明的一部分的说明书附图用来提供对本发明的进一步理解，本发明的示意性实施例及其说明用于解释本发明，并不构成对本发明的不当限定。The accompanying drawings in the specification, which constitute a part of the present invention, are used to provide a further understanding of the present invention. The exemplary embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute improper limitations on the present invention.

图1是本发明实施例的基于口腔菌群的胰腺导管腺癌预测方法流程图；FIG1 is a flow chart of a method for predicting pancreatic ductal adenocarcinoma based on oral flora according to an embodiment of the present invention;

图2是本发明实施例的潜在诊断胰腺导管腺癌的生物标记物集合；FIG. 2 is a set of biomarkers for potential diagnosis of pancreatic ductal adenocarcinoma according to an embodiment of the present invention;

图3是本发明实施例的利用RF模型检验生物标志物对胰腺导管腺癌诊断的性能；FIG3 is a diagram of an embodiment of the present invention using an RF model to test the performance of biomarkers in diagnosing pancreatic ductal adenocarcinoma;

图4是本发明实施例的基于口腔菌群的胰腺导管腺癌预测系统结构示意图。FIG. 4 is a schematic diagram of the structure of a pancreatic ductal adenocarcinoma prediction system based on oral flora according to an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

下面结合附图与实施例对本发明作进一步说明。The present invention will be further described below in conjunction with the accompanying drawings and embodiments.

应该指出，以下详细说明都是例示性的，旨在对本发明提供进一步的说明。除非另有指明，本文使用的所有技术和科学术语具有与本发明所属技术领域的普通技术人员通常理解的相同含义。It should be noted that the following detailed descriptions are all illustrative and intended to provide further explanation of the present invention. Unless otherwise specified, all technical and scientific terms used herein have the same meanings as those commonly understood by those skilled in the art to which the present invention belongs.

需要注意的是，这里所使用的术语仅是为了描述具体实施方式，而非意图限制根据本发明的示例性实施方式。如在这里所使用的，除非上下文另外明确指出，否则单数形式也意图包括复数形式，此外，还应当理解的是，当在本说明书中使用术语“包含”和/或“包括”时，其指明存在特征、步骤、操作、器件、组件和/或它们的组合。It should be noted that the terms used herein are only for describing specific embodiments and are not intended to limit exemplary embodiments according to the present invention. As used herein, unless the context clearly indicates otherwise, the singular form is also intended to include the plural form. In addition, it should be understood that when the terms "comprising" and/or "including" are used in this specification, it indicates the presence of features, steps, operations, devices, components and/or combinations thereof.

术语解释：Terminology explanation:

PDAC，胰腺导管腺癌：PDAC的最常见类型，由胰腺的外分泌组织(包括腺泡细胞和导管细胞)恶变形成；主要起源于胰腺上皮内瘤变或囊性肿瘤。与正常人相比，PDAC患者存在不同程度的菌群失调，主要表现在益生菌数量的减少和条件致病菌数量的增多。PDAC, pancreatic ductal adenocarcinoma: the most common type of PDAC, formed by the malignant transformation of the exocrine tissue of the pancreas (including acinar cells and ductal cells); mainly originating from pancreatic intraepithelial neoplasia or cystic tumors. Compared with normal people, PDAC patients have different degrees of dysbiosis, mainly manifested in the decrease of the number of probiotics and the increase of the number of conditional pathogens.

生物标志物，是指“一种可客观检测和评价的特性，可作为正常生物学过程、病理过程或治疗干预药理学反应的指示因子”。例如，核酸标志物(也可以称为基因标志物，例如DNA)，蛋白质标志物，细胞因子标记物，趋化因子标记物，碳水化合物标志物，抗原标志物，抗体标志物，物种标志物(种/属的标记)和功能标志物(KO/OG标记)等。其中，核酸标志物的含义并不局限于现有可以表达为具有生物活性的蛋白质的基因，还包括任何核酸片段，可以为DNA，也可以为RNA，可以是经过修饰的DNA或者RNA，也可以是未经修饰的DNA或者RNA，以及由它们组成的集合。在本文中核酸标志物有时也可以称为特征片段。A biomarker is “a characteristic that can be objectively detected and evaluated and can serve as an indicator of normal biological processes, pathological processes, or pharmacological responses to therapeutic interventions”. For example, nucleic acid markers (also known as gene markers, such as DNA), protein markers, cytokine markers, chemokine markers, carbohydrate markers, antigen markers, antibody markers, species markers (species/genus markers) and functional markers (KO/OG markers), etc. Among them, the meaning of nucleic acid markers is not limited to existing genes that can be expressed as biologically active proteins, but also includes any nucleic acid fragments, which can be DNA or RNA, modified DNA or RNA, or unmodified DNA or RNA, as well as a collection of them. In this article, nucleic acid markers are sometimes also referred to as characteristic fragments.

在本发明中，生物标志物也可以用“口腔标志物”来表示，因为所发现的与PDAC相关的生物标志物均存在于受试者的口腔内。生物标记物经过测量和评估，经常用以检查正常生物过程，致病过程，或治疗干预药理响应，而且在许多科学领域都是有用的。In the present invention, biomarkers can also be represented by "oral markers" because the biomarkers found to be associated with PDAC are all present in the oral cavity of the subject. Biomarkers are measured and evaluated, often used to examine normal biological processes, pathogenic processes, or pharmacological responses to therapeutic interventions, and are useful in many scientific fields.

实施例一Embodiment 1

图1是本发明实施例的基于口腔菌群的胰腺导管腺癌预测方法流程图。如图1所示，本发明实施例提供了一种基于口腔菌群的胰腺导管腺癌预测方法，其具体包括如下步骤：Fig. 1 is a flow chart of a method for predicting pancreatic ductal adenocarcinoma based on oral flora according to an embodiment of the present invention. As shown in Fig. 1, an embodiment of the present invention provides a method for predicting pancreatic ductal adenocarcinoma based on oral flora, which specifically includes the following steps:

S101：获取待测者的潜在诊断胰腺导管腺癌的生物标记物集合中各个生物标记物的相对丰度信息。S101: Obtaining relative abundance information of each biomarker in a set of biomarkers for potential diagnosis of pancreatic ductal adenocarcinoma in a subject.

其中，通过对胰腺导管腺癌患者群和胰腺良性疾病患者群的唾液、十二指肠液和胰腺组织样本进行微生物组间差异分析，将其中的差异菌取交集，得到潜在诊断胰腺导管腺癌的生物标记物集合。Among them, by analyzing the differences between the microbiome groups of saliva, duodenal fluid and pancreatic tissue samples of patients with pancreatic ductal adenocarcinoma and patients with benign pancreatic diseases, the differential bacteria were intersected to obtain a set of potential biomarkers for the diagnosis of pancreatic ductal adenocarcinoma.

所述生物标记物集合中各个生物标记物的相对丰度信息由测序方法得到。其中，测序方法是通过第二代测序方法或第三代测序方法进行的。进行测序的手段并不受特别限制，通过二代或者三代测序的方法进行测序，可以实现快速高效的测序。例如，测序方法通过选自Hiseq2000、SOLiD、454和单分子测序装置的至少一种进行的。由此，能够利用这些测序装置的高通量、深度测序的特点，从而有利于对后续测序数据进行分析，尤其是进行统计学检验时的精确性和准确度。The relative abundance information of each biomarker in the biomarker set is obtained by a sequencing method. The sequencing method is performed by a second-generation sequencing method or a third-generation sequencing method. The means for sequencing is not particularly limited, and sequencing by a second-generation or third-generation sequencing method can achieve fast and efficient sequencing. For example, the sequencing method is performed by at least one selected from Hiseq2000, SOLiD, 454 and a single-molecule sequencing device. Thus, the high-throughput and deep sequencing characteristics of these sequencing devices can be utilized, which is conducive to the analysis of subsequent sequencing data, especially the precision and accuracy when performing statistical tests.

在一些具体实施例中，采集唾液、十二指肠液和胰腺组织样品后冷冻运输并迅速转移到-80℃保存，进行DNA提取，得到提取的DNA样本。本发明的胰腺导管腺癌患者群和胰腺良性疾病患者群受试者的唾液、十二指肠液和胰腺组织样本来自中国，共计239例样本，包括唾液样本85例(胰腺良性病变患者22例和PDAC患者63例)，十二指肠液样本69例(胰腺良性病变患者19例和PDAC患者50例)，胰腺组织样本85例(胰腺良性病变患者22例和PDAC患者63例)，这些样本均经受试者同意，且经过合法来源得到。In some specific embodiments, saliva, duodenal fluid and pancreatic tissue samples are collected, frozen for transportation and quickly transferred to -80°C for storage, and DNA is extracted to obtain extracted DNA samples. The saliva, duodenal fluid and pancreatic tissue samples of the pancreatic ductal adenocarcinoma patient group and the pancreatic benign disease patient group of the present invention are from China, totaling 239 samples, including 85 saliva samples (22 patients with benign pancreatic lesions and 63 patients with PDAC), 69 duodenal fluid samples (19 patients with benign pancreatic lesions and 50 patients with PDAC), and 85 pancreatic tissue samples (22 patients with benign pancreatic lesions and 63 patients with PDAC). These samples were obtained with the consent of the subjects and from legal sources.

以上述提取的DNA为模板，使用携带Barcode序列的上游引物338F(5’-ACTCCTACGGGAGGCAGCAG-3’)和下游引物806R(5’-GGACTACHVGGGTWTCTAAT-3’)对16S rRNA基因V3-V4可变区进行PCR扩增。使用NEXTFLEX Rapid DNA-Seq Kit对纯化后的PCR产物进行建库。利用Illumina公司的Miseq PE300/NovaSeq PE250平台进行测序。The extracted DNA was used as a template to perform PCR amplification of the V3-V4 variable region of the 16S rRNA gene using the upstream primer 338F (5'-ACTCCTACGGGAGGCAGCAG-3') and the downstream primer 806R (5'-GGACTACHVGGGTWTCTAAT-3') carrying the Barcode sequence. The purified PCR product was library constructed using the NEXTFLEX Rapid DNA-Seq Kit. Sequencing was performed using the Miseq PE300/NovaSeq PE250 platform of Illumina.

对双端原始测序序列进行质控并拼接。使用UPARSE软件(version 7.1，maxee＝3,minlength＝370)，根据97％的相似度对质控拼接后的序列进行操作分类单元OTU(Operational taxonomic unit)聚类并剔除嵌合体。将7219个OTU的代表性读数与核糖体数据库(Ribosomal Database Project,release 18)进行比对，以获得与参考数据库物种80％相似的微生物组的分类学。The double-end original sequencing sequences were quality controlled and spliced. Using UPARSE software (version 7.1, maxee = 3, minlength = 370), the quality-controlled spliced sequences were clustered into operational taxonomic units (OTUs) and chimeras were removed based on 97% similarity. Representative reads of 7219 OTUs were compared with the Ribosomal Database Project (release 18) to obtain the taxonomy of the microbiome with 80% similarity to the reference database species.

使用MaAsLin2(version 1.14.1)分析调整了年龄和性别的影响，挑选错误发现率(FDR)调整后的p值小于0.001的属或种。对PDAC患者和胰腺良性病变患者的唾液、十二指肠液和胰腺组织样本进行种水平上的微生物组间差异分析，将不同种类样本中的差异菌取交集，并使用K-means进行聚类，得到一个由25个微生物(4个簇)组成的面板，如图2所示，作为潜在的诊断PDAC的生物标记物。其中，所述生物标记物集合由如下生物标记物构成：MaAsLin2 (version 1.14.1) was used to adjust the effects of age and gender, and the genera or species with a false discovery rate (FDR) adjusted p value less than 0.001 were selected. The differences between the microbiome groups at the species level were analyzed in the saliva, duodenal fluid and pancreatic tissue samples of PDAC patients and patients with benign pancreatic lesions. The differential bacteria in the samples of different species were intersected and clustered using K-means to obtain a panel of 25 microorganisms (4 clusters), as shown in Figure 2, as potential biomarkers for diagnosing PDAC. Among them, the biomarker set consists of the following biomarkers:

生物标记物1为Bifidobacterium bifidum(两岐双岐杆菌)；Biomarker 1 is Bifidobacterium bifidum;

生物标记物2为Sutterella massiliensis(马赛类萨特氏菌)；Biomarker 2 is Sutterella massiliensis;

生物标记物3为Herbaspirillum huttiense(哈特草螺菌)；Biomarker 3 is Herbaspirillum huttiense;

生物标记物4为Prevotella buccalis(口颊普雷沃菌)；Biomarker 4 is Prevotella buccalis (Prevotella buccalis);

生物标记物5为Phocaeicola abscessus(脓肿拟杆菌)；Biomarker 5 is Phocaeicola abscessus (Bacteroides abscessus);

生物标记物6为Prevotella dentalis(牙普雷沃菌)；Biomarker 6 is Prevotella dentalis;

生物标记物7为Peptoanaerobacter stomatis(口炎消化链球菌)；Biomarker 7 is Peptoanaerobacter stomatis (Peptostreptococcus stomatitis);

生物标记物8为Schwartzia succinivorans(食琥珀酸施氏菌)；Biomarker 8 is Schwartzia succinivorans;

生物标记物9为Anaeroglobus geminatus(成双厌氧球形菌)；Biomarker 9 is Anaeroglobus geminatus (dimorphic anaerobic spherical bacteria);

生物标记物10为Olsenella uli(齿龈欧氏菌)；Biomarker 10 is Olsenella uli (Olsenella gingivalis);

生物标记物11为Slackia exigua(甜菜丝孢菌)；Biomarker 11 is Slackia exigua (beet hyphomycetes);

生物标记物12为Arachnia rubra(鲁布拉蛛网菌)；Biomarker 12 is Arachnia rubra;

生物标记物13为Leptotrichia goodfellowii(古氏纤毛菌)；Biomarker 13 is Leptotrichia goodfellowii (Goodfellowii);

生物标记物14为Propionibacterium acidifaciens(产酸丙酸杆菌)；Biomarker 14 is Propionibacterium acidifaciens (acid-producing Propionibacterium);

生物标记物15为Fusobacterium mortiferum(死亡梭杆菌)；Biomarker 15 is Fusobacterium mortiferum (Fusobacterium mortiferum);

生物标记物16为Acidaminococcus fermentans(发酵氨基酸球菌)；Biomarker 16 is Acidaminococcus fermentans;

生物标记物17为Loigolactobacillus coryniformis(棒状腐败乳杆菌)；Biomarker 17 is Loigolactobacillus coryniformis (rod-shaped putrefactive lactobacillus);

生物标记物18为Bacteroides caecigallinarum(脆弱拟杆菌)；Biomarker 18 is Bacteroides caecigallinarum (Bacteroides fragilis);

生物标记物19为Caldicoprobacter faecalis(粪嗜热互营杆菌)；Biomarker 19 is Caldicoprobacter faecalis (Thermophilic Interactive Bacillus faecalis);

生物标记物20为Atopostipes suicloacalis(粪阿托波斯蒂普斯菌)；Biomarker 20 is Atopostipes suicloacalis;

生物标记物21为Akkermansia muciniphila(嗜黏蛋白阿克曼菌)；Biomarker 21 is Akkermansia muciniphila;

生物标记物22为Phocaeicola vulgatus(普通拟杆菌)；Biomarker 22 is Phocaeicola vulgatus (common Bacteroides);

生物标记物23为Bacteroides acidifaciens(产酸拟杆菌)；Biomarker 23 is Bacteroides acidifaciens (acidogenic Bacteroides);

生物标记物24为Lactiplantibacillus plantarum(植物乳植杆菌)；Biomarker 24 is Lactiplantibacillus plantarum;

生物标记物25为Faecalibacterium prausnitzii(普拉梭菌)。Biomarker 25 is Faecalibacterium prausnitzii.

可以理解的是，生物标记物集合中的生物标记物的数量，本领域技术人员可根据实际精度需求进行选择，此处不再详述。It is understandable that the number of biomarkers in the biomarker set can be selected by those skilled in the art according to actual accuracy requirements, which will not be described in detail here.

本发明提出的PDAC相关的生物标记物对早期诊断是有价值的，本发明的标记物具有较高的特异性和灵敏性；口腔菌群的分析保证准确性、安全性、可负担性和患者依从性。并且咽拭子的样本是可运输的。基于聚合酶链反应(PCR)的试验舒适且无创，人们会更容易参与给定的筛选程序。本发明的标记物还可以用作于对PDAC患者进行治疗监测的工具以检测对治疗的响应。由于丰度度量的原因，上述25种标记物的组合适用于基于标记基因比对方法度量丰度的情况。The PDAC-related biomarkers proposed in the present invention are valuable for early diagnosis, and the markers of the present invention have high specificity and sensitivity; the analysis of oral flora guarantees accuracy, safety, affordability and patient compliance. And the samples of throat swabs are transportable. The test based on polymerase chain reaction (PCR) is comfortable and non-invasive, and people will be more likely to participate in a given screening program. The markers of the present invention can also be used as a tool for monitoring the treatment of PDAC patients to detect the response to treatment. Due to the abundance measurement, the combination of the above 25 markers is suitable for measuring the abundance based on the marker gene comparison method.

S102：利用所述生物标记物集合中各个生物标记物的相对丰度信息及预先训练的多元统计模型，得到患胰腺导管腺癌的概率值。S102: Obtaining a probability value of suffering from pancreatic ductal adenocarcinoma using the relative abundance information of each biomarker in the biomarker set and a pre-trained multivariate statistical model.

在S102中，优选地，所述多元统计模型为RF模型。In S102, preferably, the multivariate statistical model is a RF model.

将上述步骤S101筛选出来的25个微生物输入RF模型中。RF模型是一个包含多个决策树的分类器，并且其输出的类别是由个别树输出的类别的众数而定。RF模型的构建算法如下：The 25 microorganisms screened in step S101 above are input into the RF model. The RF model is a classifier containing multiple decision trees, and the category of its output is determined by the mode of the category output by the individual trees. The construction algorithm of the RF model is as follows:

(1)用N来表示训练用例(样本)的个数，M表示特征数目。(1) N represents the number of training cases (samples), and M represents the number of features.

(2)输入特征数目m，用于确定决策树上一个节点的决策结果；其中m应远小于M。(2) Input the number of features m, which is used to determine the decision result of a node on the decision tree; m should be much smaller than M.

(3)从N个训练用例(样本)中以有放回抽样的方式，取样N次，形成一个训练集(即bootstrap取样)，并用未抽到的用例(样本)作预测，评估其误差。(3) Sample N times with replacement from N training cases (samples) to form a training set (i.e., bootstrap sampling), and use the cases (samples) that were not sampled to make predictions and evaluate their errors.

(4)对于每一个节点，随机选择m个特征，决策树上每个节点的决定都是基于这些特征确定的。根据这m个特征，计算其最佳的分裂方式。(4) For each node, m features are randomly selected. The decision of each node in the decision tree is determined based on these m features. Based on these m features, the best splitting method is calculated.

(5)每棵树都会完整成长而不会剪枝，这有可能在建完一棵正常树状分类器后会被采用。对分类器进行5折交叉验证，利用RF模型筛选的物种相对丰度对每一个体计算其PDAC患病风险，绘制ROC曲线，并计算出AUC作为判别模型效能评价参数，如图3所示。(5) Each tree will grow completely without pruning, which may be adopted after a normal tree classifier is built. The classifier was cross-validated 5-fold, and the relative abundance of species screened by the RF model was used to calculate the PDAC risk for each individual, draw the ROC curve, and calculate the AUC as the discriminant model performance evaluation parameter, as shown in Figure 3.

在另一些实施例中，多元统计模型也可其他现有的统计模型来实现，此处不再详述。In other embodiments, the multivariate statistical model may also be implemented by other existing statistical models, which will not be described in detail here.

在一个或多个实施例中，在所述多元统计模型的训练中，训练样本由所述生物标记物集合中的各个生物标记物的相对丰度信息及其对应属性标签构成；所述属性标签包括胰腺导管腺癌患者及胰腺良性疾病患者。In one or more embodiments, in the training of the multivariate statistical model, the training samples are composed of the relative abundance information of each biomarker in the biomarker set and its corresponding attribute label; the attribute label includes patients with pancreatic ductal adenocarcinoma and patients with benign pancreatic diseases.

在一些实施例中，将患胰腺导管腺癌的概率值与预设概率阈值(如0.5等)进行比较，若前者大于后者，则预测相应待测者为胰腺导管腺癌患者；否则，预测相应待测者为胰腺良性疾病患者。In some embodiments, the probability value of suffering from pancreatic ductal adenocarcinoma is compared with a preset probability threshold (such as 0.5, etc.). If the former is greater than the latter, the corresponding subject is predicted to be a patient with pancreatic ductal adenocarcinoma; otherwise, the corresponding subject is predicted to be a patient with benign pancreatic disease.

本实施例利用潜在诊断胰腺导管腺癌的生物标记物集合中各个生物标记物的相对丰度信息及多元统计模型，预测患胰腺导管腺癌的概率值，构建出口腔菌群和胰腺菌群之间的关系，实现了联合胰腺部位本身的菌群确定出预测胰腺导管腺癌的生物标志物组合，基于口腔菌群定量预测出胰腺导管腺癌患病概率，能够为医生提供更加量化的精确决策性建议。This embodiment uses the relative abundance information of each biomarker in the potential biomarker set for diagnosing pancreatic ductal adenocarcinoma and a multivariate statistical model to predict the probability of having pancreatic ductal adenocarcinoma, constructs the relationship between the oral flora and the pancreatic flora, and realizes the determination of a biomarker combination for predicting pancreatic ductal adenocarcinoma by combining the flora of the pancreas itself. The probability of pancreatic ductal adenocarcinoma is quantitatively predicted based on the oral flora, which can provide doctors with more quantitative and accurate decision-making suggestions.

本发明公开的生物标志物具有较高的准确度和特异性，具有良好的开发为诊断方法的前景，从而为PDAC的患病风险评估、诊断、早期诊断，寻找潜在药物靶点提供依据；基于口腔菌群的PDAC生物标志物组合作为检测靶点或检测目标在制备检测试剂盒中的应用；基于口腔菌群的PDAC生物标志物组合作为靶点在筛选治疗和/或者预防PDAC的药物中的应用；生物标志物组合相对丰度的变化为确定候选药物是否有效提供依据。The biomarkers disclosed in the present invention have high accuracy and specificity, and have good prospects for development into diagnostic methods, thereby providing a basis for risk assessment, diagnosis, early diagnosis of PDAC, and finding potential drug targets; the use of a PDAC biomarker combination based on oral flora as a detection target or detection object in the preparation of a detection kit; the use of a PDAC biomarker combination based on oral flora as a target in screening drugs for treating and/or preventing PDAC; and changes in the relative abundance of the biomarker combination provide a basis for determining whether a candidate drug is effective.

实施例二Embodiment 2

根据图4，本发明实施例提供了一种基于口腔菌群的胰腺导管腺癌预测系统，其包括：相对丰度信息获取模块401和胰腺导管腺癌预测模块402。According to FIG. 4 , an embodiment of the present invention provides a pancreatic ductal adenocarcinoma prediction system based on oral flora, which includes: a relative abundance information acquisition module 401 and a pancreatic ductal adenocarcinoma prediction module 402 .

在具体实施过程中，相对丰度信息获取模块401，其用于获取待测者的潜在诊断胰腺导管腺癌的生物标记物集合中各个生物标记物的相对丰度信息。In a specific implementation process, the relative abundance information acquisition module 401 is used to acquire the relative abundance information of each biomarker in a set of biomarkers for potential diagnosis of pancreatic ductal adenocarcinoma in a subject.

此处的所述生物标记物集合包括但不限于如下生物标记物：The biomarker set herein includes but is not limited to the following biomarkers:

在具体实施过程中，胰腺导管腺癌预测模块402，其用于利用所述生物标记物集合中各个生物标记物的相对丰度信息及预先训练的多元统计模型，得到患胰腺导管腺癌的概率值。In a specific implementation process, the pancreatic ductal adenocarcinoma prediction module 402 is used to obtain the probability value of pancreatic ductal adenocarcinoma using the relative abundance information of each biomarker in the biomarker set and a pre-trained multivariate statistical model.

在所述胰腺导管腺癌预测模块402中，将患胰腺导管腺癌的概率值与预设概率阈值进行比较，若前者大于后者，则预测相应待测者为胰腺导管腺癌患者；否则，预测相应待测者为胰腺良性疾病患者。In the pancreatic ductal adenocarcinoma prediction module 402, the probability value of suffering from pancreatic ductal adenocarcinoma is compared with a preset probability threshold. If the former is greater than the latter, the corresponding subject is predicted to be a patient with pancreatic ductal adenocarcinoma; otherwise, the corresponding subject is predicted to be a patient with benign pancreatic disease.

此处可以理解的是，所述多元统计模型为RF模型，或是其他现有的多元统计模型。It can be understood here that the multivariate statistical model is the RF model, or other existing multivariate statistical models.

需要说明的是，本实施例中的基于口腔菌群的胰腺导管腺癌预测系统中的相对丰度信息获取模块401和胰腺导管腺癌预测模块402，与实施例一中的步骤S101及步骤S102一一对应，其具体实施过程相同，此处不再详述。It should be noted that the relative abundance information acquisition module 401 and the pancreatic ductal adenocarcinoma prediction module 402 in the oral flora-based pancreatic ductal adenocarcinoma prediction system in this embodiment correspond one-to-one to step S101 and step S102 in Example 1, and their specific implementation processes are the same, which will not be described in detail here.

实施例三Embodiment 3

本实施例提供了一种计算机可读存储介质，其上存储有计算机程序，该程序被处理器执行时实现如上述实施例一所述的基于口腔菌群的胰腺导管腺癌预测方法中的步骤。This embodiment provides a computer-readable storage medium having a computer program stored thereon. When the program is executed by a processor, the steps in the method for predicting pancreatic ductal adenocarcinoma based on oral flora as described in the first embodiment above are implemented.

实施例四Embodiment 4

本实施例提供了一种计算机设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现如上述实施例一所述的基于口腔菌群的胰腺导管腺癌预测方法中的步骤。This embodiment provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, the steps in the method for predicting pancreatic ductal adenocarcinoma based on oral flora as described in the first embodiment above are implemented.

本领域内的技术人员应明白，本发明的实施例可提供为方法、系统、或计算机程序产品。因此，本发明可采用硬件实施例、软件实施例、或结合软件和硬件方面的实施例的形式。而且，本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器和光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will appreciate that embodiments of the present invention may be provided as methods, systems, or computer program products. Therefore, the present invention may take the form of hardware embodiments, software embodiments, or embodiments combining software and hardware. Moreover, the present invention may take the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage and optical storage, etc.) containing computer-usable program codes.

本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present invention is described with reference to the flowchart and/or block diagram of the method, device (system), and computer program product according to the embodiment of the present invention. It should be understood that each process and/or box in the flowchart and/or block diagram, as well as the combination of the process and/or box in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, an embedded processor or other programmable data processing device to produce a machine, so that the instructions executed by the processor of the computer or other programmable data processing device produce a device for implementing the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机程序来指令相关的硬件来完成，所述的程序可存储于一计算机可读取存储介质中，该程序在执行时，可包括如上述各方法的实施例的流程。其中，所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory，ROM)或随机存储记忆体(Random AccessMemory，RAM)等。A person skilled in the art can understand that all or part of the processes in the above-mentioned embodiments can be implemented by instructing the relevant hardware through a computer program, and the program can be stored in a computer-readable storage medium, and when the program is executed, it can include the processes of the embodiments of the above-mentioned methods. The storage medium can be a disk, an optical disk, a read-only memory (ROM) or a random access memory (RAM), etc.

以上所述仅为本发明的优选实施例而已，并不用于限制本发明，对于本领域的技术人员来说，本发明可以有各种更改和变化。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and variations. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included in the protection scope of the present invention.

Claims

1. A method for predicting pancreatic ductal adenocarcinoma based on oral flora, comprising:

Obtaining relative abundance information of each biomarker in a set of biomarkers for potential diagnosis of pancreatic ductal adenocarcinoma in a subject;

Using the relative abundance information of each biomarker in the biomarker set and a pre-trained multivariate statistical model, a probability value of suffering from pancreatic ductal adenocarcinoma is obtained;

Among them, by analyzing the differences between the microbiome of saliva, duodenal fluid and pancreatic tissue samples of patients with pancreatic ductal adenocarcinoma and patients with benign pancreatic diseases, the intersection of the differential bacteria was taken to obtain a set of biomarkers for the potential diagnosis of pancreatic ductal adenocarcinoma;

In the training of the multivariate statistical model, the training samples are composed of the relative abundance information of each biomarker in the biomarker set and its corresponding attribute label; the attribute label includes patients with pancreatic ductal adenocarcinoma and patients with benign pancreatic diseases.

2. The method for predicting pancreatic ductal adenocarcinoma based on oral flora as described in claim 1, characterized in that the probability value of suffering from pancreatic ductal adenocarcinoma is compared with a preset probability threshold, and if the former is greater than the latter, the corresponding subject is predicted to be a patient with pancreatic ductal adenocarcinoma; otherwise, the corresponding subject is predicted to be a patient with benign pancreatic disease.

3. The method for predicting pancreatic ductal adenocarcinoma based on oral flora according to claim 1, wherein the multivariate statistical model is a RF model.

4. The method for predicting pancreatic ductal adenocarcinoma based on oral flora according to claim 1, wherein the relative abundance information of each biomarker in the biomarker set is obtained by a sequencing method.

5. The method for predicting pancreatic ductal adenocarcinoma based on oral flora according to claim 1, wherein the biomarker set consists of the following biomarkers:

Biomarker 1 is Bifidobacterium bifidum;

Biomarker 2 is Sutterella massiliensis;

Biomarker 3 is Herbaspirillum huttiense;

Biomarker 4 is Prevotella buccalis (Prevotella buccalis);

Biomarker 5 is Phocaeicola abscessus (Bacteroides abscessus);

Biomarker 6 is Prevotella dentalis;

Biomarker 7 is Peptoanaerobacter stomatis (Peptostreptococcus stomatitis);

Biomarker 8 is Schwartzia succinivorans;

Biomarker 9 is Anaeroglobus geminatus (dimorphic anaerobic spherical bacteria);

Biomarker 10 is Olsenella uli (Olsenella gingivalis);

Biomarker 11 is Slackia exigua (beet hyphomycetes);

Biomarker 12 is Arachnia rubra;

Biomarker 13 is Leptotrichia goodfellowii (Goodfellowii);

Biomarker 14 is Propionibacterium acidifaciens (acid-producing Propionibacterium);

Biomarker 15 is Fusobacterium mortiferum (Fusobacterium mortiferum);

Biomarker 16 is Acidaminococcus fermentans;

Biomarker 17 is Loigolactobacillus coryniformis (rod-shaped putrefactive lactobacillus);

Biomarker 18 is Bacteroides caecigallinarum (Bacteroides fragilis);

Biomarker 19 is Caldicoprobacter faecalis (Thermophilic Interactive Bacillus faecalis);

Biomarker 20 is Atopostipes suicloacalis;

Biomarker 21 is Akkermansia muciniphila;

Biomarker 22 is Phocaeicola vulgatus (common Bacteroides);

Biomarker 23 is Bacteroides acidifaciens (acidogenic Bacteroides);

Biomarker 24 is Lactiplantibacillus plantarum;

Biomarker 25 is Faecalibacterium prausnitzii.

6. A pancreatic ductal adenocarcinoma prediction system based on oral flora, comprising:

A relative abundance information acquisition module, which is used to obtain the relative abundance information of each biomarker in a set of biomarkers for potential diagnosis of pancreatic ductal adenocarcinoma in a subject to be tested;

A pancreatic ductal adenocarcinoma prediction module, which is used to obtain a probability value of pancreatic ductal adenocarcinoma using the relative abundance information of each biomarker in the biomarker set and a pre-trained multivariate statistical model;

Among them, by analyzing the differences between the microbiome of saliva, duodenal fluid and pancreatic tissue samples of patients with pancreatic ductal adenocarcinoma and patients with benign pancreatic diseases, the differential bacteria were intersected to obtain a set of biomarkers for the potential diagnosis of pancreatic ductal adenocarcinoma;

7. The pancreatic ductal adenocarcinoma prediction system based on oral flora as described in claim 6 is characterized in that, in the pancreatic ductal adenocarcinoma prediction module, the probability value of suffering from pancreatic ductal adenocarcinoma is compared with a preset probability threshold, and if the former is greater than the latter, the corresponding subject is predicted to be a pancreatic ductal adenocarcinoma patient; otherwise, the corresponding subject is predicted to be a patient with benign pancreatic disease.

8. The method for predicting pancreatic ductal adenocarcinoma based on oral flora according to claim 6, wherein the multivariate statistical model is a RF model.

9. A computer-readable storage medium having a computer program stored thereon, wherein when the program is executed by a processor, the steps in the method for predicting pancreatic ductal adenocarcinoma based on oral flora as described in any one of claims 1 to 5 are implemented.

10. A computer device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the program, the steps in the method for predicting pancreatic ductal adenocarcinoma based on oral flora are implemented as described in any one of claims 1 to 5.