MSI prediction model construction method based on immune related gene
Technical Field
The invention relates to the technical field of biological information, in particular to a construction method of an MSI prediction model related to colon cancer and based on immune related genes.
Background
In recent years, the immunotherapy of cancer of colon is considered as a non-negligible treatment method, which focuses on achieving the curative effects of recognizing, controlling and eliminating cancer by activating the immune system of human body. Drugs targeting Immune Checkpoint Inhibitors (ICIs), such as cytotoxic T-lymphoid system-associated protein 4(CTLA-4) monoclonal antibody, programmed death inhibitor protein and its ligand (PD-1/PD-L1) monoclonal antibody, have brought new eosin for the treatment of various tumors, including advanced melanoma, non-small cell lung cancer and bladder cancer. Colon cancer patients can also benefit from immunotherapy, and the FDA currently approved PD-1 immunotherapeutic mabs pembrolizumab, ipilimumab and nivolumab in the United states as effective drugs for treating colon cancer patients.
Tumor immunotherapy is one of the first-line treatment schemes, and biomarker selection is particularly important. Microsatellite instability (MSI), one of the hottest biomarkers of interest, refers to the phenomenon of microsatellite sequence length changes due to insertion or deletion mutations during DNA replication, often caused by a defect in mismatch repair function, and is closely associated with the formation of malignant tumors.
In the colon cancer guide issued by NCCN in the United states, it is recommended that MSI testing should be performed in patients with all colon cancer histories to guide clinical medication. Research proves that the sensitivity of advanced colon cancer patients with high microsatellite instability (MSI-H) to ICIs is obviously higher than that of colon cancer patients with stable Microsatellite (MSS)/low microsatellite instability (MSI-L), the colon cancer patients can promote the body immune system to attack and kill tumor cells through the targeted inhibition of PD-1/PD-L1, but the microsatellite instability (MSI) does not directly treat or diagnose tumors. Furthermore, MSI is closely related to the prognosis of colon cancer, which is the prediction of the final outcome of a disease. Compared with MSS/MSI-L patients, MSI-H colon cancer patients have significant survival advantages and poor clinical manifestations, but the overall survival period and disease-free survival period are obviously prolonged.
Therefore, immune-related genes play a crucial role in the occurrence and development of colon cancer, and the traditional method for detecting MSI mainly comprises Immunohistochemistry (IHC) and Polymerase Chain Reaction (PCR), but because IHC and PCR detection means are required to be carried out in large-scale medical institutions, the cost is high, the operation is complex, and the method is difficult to popularize to each patient in clinical practice, timely ICIs treatment cannot be provided for a large number of potential immunotherapy-sensitive patients, and thus the clinical benefit opportunity is lost.
Disclosure of Invention
The invention aims to overcome the defects of the traditional MSI detection method, provides an MSI prediction model construction method based on immune related genes, does not need extra laboratories to carry out IHC and PCR detection analysis, and obtains the immune related genes with differential expression based on a cancer genome map (TCGA) and an immunological database (ImmPort).
In order to achieve the above object, the embodiments of the present invention provide the following technical solutions:
the MSI prediction model construction method based on the immune related gene comprises the following steps:
step S1: collecting a training set and a verification set for constructing an immune-related MSI prediction model irMSIs from a cancer genomic map database;
step S2: selecting immune related genes from an immunological database, and screening out differential genes from the immune related genes;
step S3: constructing an immune-related MSI prediction model irMSIs by an LASSO logistic regression algorithm according to the screened differential genes;
step S4: prognostic risk was validated using an immune-related MSI prediction model irMSIs.
The step of collecting a training set and a validation set for constructing an immune-related MSI prediction model irMSIs from a cancer genomic profile database comprises:
downloading four cancer cohorts from a cancer genomic profile database, four of the cancer cohorts comprising mRNA expression profiles and clinical information of colon cancer COAD, rectal cancer READ, gastric cancer STAD, esophageal cancer ESCA;
the colon cancer COAD queue is used as a training set of a screening and immune-related MSI prediction model irMSIs of differential genes, and other queues are used as a verification set of the immune-related MSI prediction model irMSIs.
The step of selecting immune-related genes from an immunological database and screening differential genes therefrom, comprising:
downloading N immune related genes from an immunological database, and selecting M paired genes for analysis, wherein N is greater than M; using an edgeR software package to screen differential genes between the microsatellite instability high MSI-H group and the microsatellite stability MSS group, or the microsatellite instability high MSI-H group and the microsatellite instability low MSI-L group in a colon cancer COAD cohort, the screening criteria were:
false discovery rate FDR <0.05
|log2(Fold Change)| ≥ 1
Wherein FDR is false discovery rate, the value of which is determined for multiple test adjustments; fold Change represents the Fold difference of counts expression of sequencing data of a certain gene between two groups;
thereby identifying M distinct genes, M < M; the m differential genes include a up-regulated genes and b down-regulated genes, and m = a + b.
The step of constructing an immune-related MSI prediction model irMSIs by a LASSO logistic regression algorithm according to the screened differential genes comprises the following steps:
randomly dividing the colon cancer COAD queue into a training set and a testing set according to the proportion of 7:3, identifying c robust genes by adopting a recursive characteristic elimination random forest algorithm, selecting the first 5 genes with the strongest robust genes as the minimum absolute contraction, and performing score calculation of an LASSO logistic regression algorithm, wherein c is more than or equal to 5;
verifying an immune-related MSI prediction model irMSIs in a test set of a colon cancer COAD queue, a rectal cancer READ queue, a gastric cancer STAD queue and an esophageal cancer ESCA queue; the predictive efficacy of the immune-related MSI predictive model irMSIs was evaluated by the area AUC values under the ROC curve.
In the above scheme, the first 5 genes with the strongest robust genes are selected as TGFBR2 gene, GNLY gene, ULBP2 gene, SEMA5A gene and R3HDML gene, and the coefficients of minimum absolute contraction are-0.077, 0.084, 0.070, -0.064 and-0.055 in order, and then the score calculation of LASSO logistic regression algorithm can be performed:
irMSIs = 0.683-0.077 TGFBR2 expression level + 0.084 GNLY expression level + 0.070 ULBP2 expression level-0.064 SEMA5A expression level-0.055R 3HDML expression level.
The step of validating prognostic risk using an immune-related MSI prediction model, irMSIs, comprises:
in a colon cancer COAD queue, dividing patients into an irMSIs high group and an irMSIs low group according to the fact that an immune-related MSI prediction model irMSIs reaches a critical value of a highest Yoden index of a ROC value;
dividing patients into a high group of micro-satellite stable MSS and micro-satellite unstable low MSS-L and a low group of micro-satellite stable MSS and micro-satellite unstable low MSI-L according to the fact that an immune correlation MSI prediction model irMSIs reaches the median of the highest Yoden index of the ROC value;
based on the cutoff value of the highest Yoden index of the ROC value and the median of the highest Yoden index of the ROC value, the patients were divided into an irMSIs high group, an irMSIs medium group and an irMSIs low group, and the prognosis differences among the three groups of patients were compared.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides application of immune-related genes in MSI state prediction, and a group of characteristic genes capable of stably predicting MSI in digestive tract tumors, particularly colon cancer, are found by combining the immune-related genes, and the prognosis risk of the colon cancer can be well predicted.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is a volcano plot of differential genes selected according to an embodiment of the present invention;
FIG. 3 is a schematic diagram illustrating the establishment and evaluation of a predictive model irMSIs in accordance with an embodiment of the present invention; FIG. 3(A) is a parameter diagram of a prediction model irMSIs established by using a LASSO logistic regression algorithm; FIG. 3(B) is a coefficient diagram of a prediction model irMSIs established using the LASSO logistic regression algorithm; FIG. 3(C) is a schematic representation of the evaluation of the predictive model irMSIs in a colon cancer COAD cohort by ROC curves of the training and validation sets; fig. 3(D) is a schematic diagram of evaluation of the predictive model irMSI by ROC curves in colorectal cancer READ, gastric cancer STAD, esophageal cancer ESCA cohorts.
FIG. 4 is a schematic diagram of the survival analysis of OS and DSS between groups according to the embodiment of the present invention; wherein FIG. 4(A) is a schematic representation of OS and DSS survival for MSS/MSI-L in a colon cancer COAD cohort; FIG. 4(B) is a schematic representation of the OS and DSS survival for the MSI-H group; FIG. 4(C) is a graph showing the survival of OS between irMSIs high and low in the colon cancer COAD cohort; FIG. 4(D) is a graph showing the survival of DSS between irMSIs high and low in the colon cancer COAD cohort; FIG. 4(E) is a schematic representation of OS survival between the high group in MSS/MSI-L and the low group in MSS/MSI-L in the colon cancer COAD cohort; FIG. 4(F) is a graph showing the survival of DSS between the high group in MSS/MSI-L and the low group in MSS/MSI-L in the colon cancer COAD cohort; FIG. 4(G) is a graph showing the survival of OS between irMSIs high, irMSIs medium and irMSIs low in the colon cancer COAD cohort; FIG. 4(H) is a graph showing the survival of DSS between irMSIs high, irMSIs medium and irMSIs low in the colon cancer COAD cohort.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Also, in the description of the present invention, the terms "first", "second", and the like are used for distinguishing between descriptions and not necessarily for describing a relative importance or implying any actual relationship or order between such entities or operations.
Example (b):
the invention is realized by the following technical scheme, please refer to fig. 1, and the MSI prediction model construction method based on immune related genes comprises the following steps:
step S1: training and validation sets for constructing immune-related MSI prediction models irMSIs were collected from a cancer genomic profiling database.
Four cancer cohorts including mRNA expression profiles and clinical information of colon cancer COAD (n = 551), rectal cancer READ (n = 177), gastric cancer STAD (n = 407), esophageal cancer ESCA (n = 173) were downloaded from a cancer genomic profile database TCGA (hereinafter TCGA). The colon cancer COAD queue is used as a training set of a screening and immune-related MSI prediction model irMSIs of differential genes, and other queues are used as a verification set of the immune-related MSI prediction model irMSIs.
The fragments per million per kilobase (FKPM) in the cohort were converted to the number of Transcripts Per Million (TPM) and normalized to expression data using 1 and log 2. A total of 1028 samples were included after excluding repeat, recurrent and normal tissue samples, or tissue samples lacking MSI status.
Step S2: immune-related genes are selected from an immunological database, and differential genes are screened from the immune-related genes.
Downloading N immune related genes from an immunological database ImmPort (hereinafter referred to as ImmPort) and selecting M paired genes for analysis, wherein N is more than M; the edgeR software package was used to screen for differential genes in the colon cancer COAD cohort between the group of microsatellite instability high MSI-H and the group of microsatellite stable MSS/group of microsatellite instability low MSI-L.
In this example, 2428 immune-related genes were downloaded, 1229 alleles were selected for further analysis, and the difference genes between the microsatellite instability high MSI-H group and the microsatellite stability MSS group, or the difference genes between the microsatellite instability high MSI-H group and the microsatellite instability low MSI-L group, in the colon cancer COAD cohort in step S1 were screened using the R software package edgeR.
It should be noted that, in the following, the set of high-MSI-H due to microsatellite instability is referred to as MSI-H, the set of MSS due to microsatellite stability is referred to as MSS, the set of low-MSI-L due to microsatellite instability is referred to as MSI-L, and MSS/MSI-L represents the set of MSS due to microsatellite stability or the set of low-MSI-L due to microsatellite instability.
The screening mode is as follows: calculating count-per-million (CPM) of read counts data of original sequencing, normalizing by using a TMM method, and calculating the size factor of each sample; differential expression genes between MSI-H and MSS/MSI-L groups were compared using a likelihood ratio test, where the screening criteria were: the false discovery rate FDR is less than 0.05, | log2 (Fold Change) | ≧ 1. Where FDR is the false discovery rate, which is the P value determined for multiple test adjustments (by the Benjamini-Hochberg method); fold Change represents the Fold difference in counts expression from the sequencing data for a gene between the two groups. Thus, 233 differential genes were identified, including 112 up-regulated genes and 121 down-regulated genes among the 233 differential genes, see the volcano plot shown in fig. 2.
Step S3: and constructing an immune-related MSI prediction model irMSIs by a LASSO logistic regression algorithm according to the screened differential genes.
Randomly dividing the colon cancer COAD queue into a training set and a testing set according to the proportion of 7:3, removing low-variance sparse variables and highly-relevant variables from the 233 identified differential genes by using a 'caret' packet, wherein the variable coefficients are all 0.8, and then identifying 65 robust genes by using a random forest recursive feature elimination algorithm by using a 'randomForest' packet. The first 5 genes with the strongest robust genes as shown in table 1 were selected as the input of the least absolute contraction algorithm (LASSO), see fig. 3(a) and fig. 3(B), and score calculation of the LASSO logistic regression algorithm was performed:
irMSIs = 0.683-0.077 TGFBR2 expression level + 0.084 GNLY expression level + 0.070 ULBP2 expression level-0.064 SEMA5A expression level-0.055R 3HDML expression level.
TABLE 1
The verification of immune-related MSI prediction models irMSIs is carried out in a test set of colon cancer COAD queues, rectal cancer READ queues, stomach cancer STAD queues and esophageal cancer ESCA queues, and the prediction efficiency of the immune-related MSI prediction models irMSIs is evaluated through an area AUC value under an ROC curve. Among them, the AUC value in training set was 0.974 (95% CI: 0.954-0.994), and AUC value in validation set was 0.999 (95% CI: 0.985-1.000), indicating that the immune-related MSI prediction model irMSIs had significant prediction effect.
In addition, referring to FIG. 3(C) and FIG. 3(D), immune-related MSI prediction models irMSIs were also used to predict colorectal cancer READ cohort, gastric cancer STAD cohort, and esophageal cancer ESCA cohort, and AUC values were 0.845 (95% CI: 0.800-0.899), 0.855 (95% CI: 0.608-1.000), and 0.824 (95% CI: 0.582-1.000), respectively.
Step S4: prognostic risk was validated using an immune-related MSI prediction model irMSIs.
In the colon cancer COAD cohort, when patients were divided into irMSIs high and low groups based on the immune-related MSI prediction model irMSIs reaching the critical value of the highest Yoden index with ROC values (0.325), the survival difference between irMSIs high and low groups was not statistically significant, corresponding to the actual MSI status, see fig. 4(a) -4 (D).
And when the patient is divided into a high group of MSS and MSI-L and a low group of MSS and MSI-L according to the fact that the immune-related MSI prediction model irMSIs reaches the median of the highest Yoden index of ROC value, the overall survival OS and the disease-specific survival DSS have significant difference within 5 years. The survival rates of the low groups in MSS and MSI-L were significantly higher than those of the high groups in MSS and MSI-L (OS: P = 0.0063; DSS: P = 0.0026; P indicates the significance of the difference between the two groups in survival analysis), see FIG. 4(E) and FIG. 4 (F).
Therefore, by classifying patients into the irMSIs high group, the irMSIs medium group and the irMSIs low group based on the critical value of the highest Yoden index of the ROC value and the median of the highest Yoden index of the ROC value, the results of comparing the prognosis differences among the three groups of patients showed that the prognosis of the patients in the irMSIs low group was the best and the prognosis of the patients in the irMSIs low group was the worst (OS: P = 0.0130; DSS: P = 0.0055), see FIG. 4(G), FIG. 4 (H).
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.