CN112071363A - Gastric mucosa lesion protein molecule typing, lesion progression, gastric cancer-associated protein marker and method for predicting lesion progression risk - Google Patents
Gastric mucosa lesion protein molecule typing, lesion progression, gastric cancer-associated protein marker and method for predicting lesion progression risk Download PDFInfo
- Publication number
- CN112071363A CN112071363A CN202010958039.XA CN202010958039A CN112071363A CN 112071363 A CN112071363 A CN 112071363A CN 202010958039 A CN202010958039 A CN 202010958039A CN 112071363 A CN112071363 A CN 112071363A
- Authority
- CN
- China
- Prior art keywords
- protein
- lesion
- gastric
- typing
- score
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 108090000623 proteins and genes Proteins 0.000 title claims abstract description 168
- 102000004169 proteins and genes Human genes 0.000 title claims abstract description 167
- 208000005718 Stomach Neoplasms Diseases 0.000 title claims abstract description 111
- 206010017758 gastric cancer Diseases 0.000 title claims abstract description 111
- 201000011549 stomach cancer Diseases 0.000 title claims abstract description 111
- 210000001156 gastric mucosa Anatomy 0.000 title claims abstract description 88
- 230000003902 lesion Effects 0.000 title claims abstract description 74
- 239000012474 protein marker Substances 0.000 title claims abstract description 12
- 238000000034 method Methods 0.000 title claims description 106
- 230000014509 gene expression Effects 0.000 claims abstract description 56
- 206010061818 Disease progression Diseases 0.000 claims abstract description 53
- 230000005750 disease progression Effects 0.000 claims abstract description 53
- 230000001575 pathological effect Effects 0.000 claims abstract description 25
- 235000004252 protein component Nutrition 0.000 claims abstract description 16
- 238000010801 machine learning Methods 0.000 claims abstract description 8
- 238000004458 analytical method Methods 0.000 claims abstract description 6
- 235000018102 proteins Nutrition 0.000 claims description 125
- 206010061164 Gastric mucosal lesion Diseases 0.000 claims description 86
- 238000012333 histopathological diagnosis Methods 0.000 claims description 53
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims description 34
- 238000007477 logistic regression Methods 0.000 claims description 34
- 238000012216 screening Methods 0.000 claims description 33
- 201000010099 disease Diseases 0.000 claims description 32
- 206010019375 Helicobacter infections Diseases 0.000 claims description 28
- 108090000765 processed proteins & peptides Proteins 0.000 claims description 22
- 108010026552 Proteome Proteins 0.000 claims description 18
- 238000010219 correlation analysis Methods 0.000 claims description 17
- 230000008569 process Effects 0.000 claims description 17
- 238000012098 association analyses Methods 0.000 claims description 15
- 239000011159 matrix material Substances 0.000 claims description 14
- 238000011161 development Methods 0.000 claims description 12
- 238000012360 testing method Methods 0.000 claims description 11
- 238000007637 random forest analysis Methods 0.000 claims description 9
- 230000006641 stabilisation Effects 0.000 claims description 9
- 238000011105 stabilization Methods 0.000 claims description 9
- 238000012937 correction Methods 0.000 claims description 7
- 238000001914 filtration Methods 0.000 claims description 7
- 238000010200 validation analysis Methods 0.000 claims description 7
- 101800001646 Protein n Proteins 0.000 claims description 6
- 238000013473 artificial intelligence Methods 0.000 claims description 6
- 238000007635 classification algorithm Methods 0.000 claims description 6
- 238000007781 pre-processing Methods 0.000 claims description 6
- 238000012706 support-vector machine Methods 0.000 claims description 5
- 101000632037 Homo sapiens NAD(P)H-hydrate epimerase Proteins 0.000 claims description 4
- 102100028167 NAD(P)H-hydrate epimerase Human genes 0.000 claims description 4
- 238000010276 construction Methods 0.000 claims description 4
- 238000003556 assay Methods 0.000 claims description 3
- 238000004422 calculation algorithm Methods 0.000 claims description 3
- 238000002474 experimental method Methods 0.000 claims description 3
- 238000003205 genotyping method Methods 0.000 claims description 3
- 230000003247 decreasing effect Effects 0.000 claims description 2
- 210000001519 tissue Anatomy 0.000 claims description 2
- 238000011160 research Methods 0.000 abstract description 16
- 230000036285 pathological change Effects 0.000 abstract description 12
- 231100000915 pathological change Toxicity 0.000 abstract description 12
- 238000013058 risk prediction model Methods 0.000 abstract description 6
- 230000002265 prevention Effects 0.000 abstract description 4
- 230000009897 systematic effect Effects 0.000 abstract description 3
- 238000003766 bioinformatics method Methods 0.000 abstract description 2
- 206010054949 Metaplasia Diseases 0.000 description 21
- 208000007882 Gastritis Diseases 0.000 description 18
- 208000016644 chronic atrophic gastritis Diseases 0.000 description 18
- 150000002500 ions Chemical class 0.000 description 12
- 230000006870 function Effects 0.000 description 11
- 206010058314 Dysplasia Diseases 0.000 description 10
- 208000015181 infectious disease Diseases 0.000 description 10
- 206010028980 Neoplasm Diseases 0.000 description 9
- 108010033276 Peptide Fragments Proteins 0.000 description 8
- 102000007079 Peptide Fragments Human genes 0.000 description 8
- 238000001514 detection method Methods 0.000 description 6
- 238000012795 verification Methods 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 5
- 230000002496 gastric effect Effects 0.000 description 5
- 239000003147 molecular marker Substances 0.000 description 5
- 238000004393 prognosis Methods 0.000 description 5
- ATRRKUHOCOJYRX-UHFFFAOYSA-N Ammonium bicarbonate Chemical compound [NH4+].OC([O-])=O ATRRKUHOCOJYRX-UHFFFAOYSA-N 0.000 description 4
- 238000001574 biopsy Methods 0.000 description 4
- 238000004949 mass spectrometry Methods 0.000 description 4
- 238000001819 mass spectrum Methods 0.000 description 4
- 210000004400 mucous membrane Anatomy 0.000 description 4
- 239000000243 solution Substances 0.000 description 4
- 238000001228 spectrum Methods 0.000 description 4
- 229910000013 Ammonium bicarbonate Inorganic materials 0.000 description 3
- 230000009471 action Effects 0.000 description 3
- 235000012538 ammonium bicarbonate Nutrition 0.000 description 3
- 239000001099 ammonium carbonate Substances 0.000 description 3
- 230000001174 ascending effect Effects 0.000 description 3
- 201000011510 cancer Diseases 0.000 description 3
- 230000008676 import Effects 0.000 description 3
- 208000020082 intraepithelial neoplasia Diseases 0.000 description 3
- 238000000108 ultra-filtration Methods 0.000 description 3
- 208000005623 Carcinogenesis Diseases 0.000 description 2
- 108010019160 Pancreatin Proteins 0.000 description 2
- 230000036952 cancer formation Effects 0.000 description 2
- 231100000504 carcinogenesis Toxicity 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000004140 cleaning Methods 0.000 description 2
- 230000034994 death Effects 0.000 description 2
- 238000004925 denaturation Methods 0.000 description 2
- 230000036425 denaturation Effects 0.000 description 2
- 238000013399 early diagnosis Methods 0.000 description 2
- 238000010195 expression analysis Methods 0.000 description 2
- 238000010438 heat treatment Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 210000004379 membrane Anatomy 0.000 description 2
- 239000012528 membrane Substances 0.000 description 2
- 229940055695 pancreatin Drugs 0.000 description 2
- 239000000047 product Substances 0.000 description 2
- 230000002250 progressing effect Effects 0.000 description 2
- 230000000750 progressive effect Effects 0.000 description 2
- 238000002331 protein detection Methods 0.000 description 2
- 238000011002 quantification Methods 0.000 description 2
- 230000002441 reversible effect Effects 0.000 description 2
- 238000007789 sealing Methods 0.000 description 2
- 230000035945 sensitivity Effects 0.000 description 2
- 108090000790 Enzymes Proteins 0.000 description 1
- 102000004190 Enzymes Human genes 0.000 description 1
- 241000590002 Helicobacter pylori Species 0.000 description 1
- FFEARJCKVFRZRR-BYPYZUCNSA-N L-methionine Chemical compound CSCC[C@H](N)C(O)=O FFEARJCKVFRZRR-BYPYZUCNSA-N 0.000 description 1
- OKKJLVBELUTLKV-UHFFFAOYSA-N Methanol Chemical compound OC OKKJLVBELUTLKV-UHFFFAOYSA-N 0.000 description 1
- PZBFGYYEXUXCOF-UHFFFAOYSA-N TCEP Chemical compound OC(=O)CCP(CCC(O)=O)CCC(O)=O PZBFGYYEXUXCOF-UHFFFAOYSA-N 0.000 description 1
- 239000007983 Tris buffer Substances 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000021736 acetylation Effects 0.000 description 1
- 238000006640 acetylation reaction Methods 0.000 description 1
- 230000029936 alkylation Effects 0.000 description 1
- 238000005804 alkylation reaction Methods 0.000 description 1
- 235000001014 amino acid Nutrition 0.000 description 1
- 150000001413 amino acids Chemical class 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000019522 cellular metabolic process Effects 0.000 description 1
- 239000003153 chemical reaction reagent Substances 0.000 description 1
- VXIVSQZSERGHQP-UHFFFAOYSA-N chloroacetamide Chemical compound NC(=O)CCl VXIVSQZSERGHQP-UHFFFAOYSA-N 0.000 description 1
- 208000037976 chronic inflammation Diseases 0.000 description 1
- 230000006020 chronic inflammation Effects 0.000 description 1
- 238000001816 cooling Methods 0.000 description 1
- 238000005336 cracking Methods 0.000 description 1
- 235000018417 cysteine Nutrition 0.000 description 1
- XUJNEKJLAYXESH-UHFFFAOYSA-N cysteine Natural products SCC(N)C(O)=O XUJNEKJLAYXESH-UHFFFAOYSA-N 0.000 description 1
- 230000009089 cytolysis Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- KXGVEGMKQFWNSR-LLQZFEROSA-N deoxycholic acid Chemical compound C([C@H]1CC2)[C@H](O)CC[C@]1(C)[C@@H]1[C@@H]2[C@@H]2CC[C@H]([C@@H](CCC(O)=O)C)[C@@]2(C)[C@@H](O)C1 KXGVEGMKQFWNSR-LLQZFEROSA-N 0.000 description 1
- 229960003964 deoxycholic acid Drugs 0.000 description 1
- KXGVEGMKQFWNSR-UHFFFAOYSA-N deoxycholic acid Natural products C1CC2CC(O)CCC2(C)C2C1C1CCC(C(CCC(O)=O)C)C1(C)C(O)C2 KXGVEGMKQFWNSR-UHFFFAOYSA-N 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 229940088598 enzyme Drugs 0.000 description 1
- 230000010429 evolutionary process Effects 0.000 description 1
- 238000000556 factor analysis Methods 0.000 description 1
- 229940037467 helicobacter pylori Drugs 0.000 description 1
- 239000005457 ice water Substances 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000000968 intestinal effect Effects 0.000 description 1
- 239000007788 liquid Substances 0.000 description 1
- 238000004895 liquid chromatography mass spectrometry Methods 0.000 description 1
- 239000012160 loading buffer Substances 0.000 description 1
- 238000011068 loading method Methods 0.000 description 1
- 239000006166 lysate Substances 0.000 description 1
- 229930182817 methionine Natural products 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000003647 oxidation Effects 0.000 description 1
- 238000007254 oxidation reaction Methods 0.000 description 1
- 238000012856 packing Methods 0.000 description 1
- 238000010837 poor prognosis Methods 0.000 description 1
- 230000004481 post-translational protein modification Effects 0.000 description 1
- 230000003334 potential effect Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000000751 protein extraction Methods 0.000 description 1
- 238000000164 protein isolation Methods 0.000 description 1
- 230000004850 protein–protein interaction Effects 0.000 description 1
- 230000017854 proteolysis Effects 0.000 description 1
- 238000000575 proteomic method Methods 0.000 description 1
- 230000005180 public health Effects 0.000 description 1
- 238000005086 pumping Methods 0.000 description 1
- 238000011470 radical surgery Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000009863 secondary prevention Effects 0.000 description 1
- 239000006228 supernatant Substances 0.000 description 1
- 238000004885 tandem mass spectrometry Methods 0.000 description 1
- 230000001225 therapeutic effect Effects 0.000 description 1
- 238000009210 therapy by ultrasound Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- LENZDBCJOHFCAS-UHFFFAOYSA-N tris Chemical compound OCC(N)(CO)CO LENZDBCJOHFCAS-UHFFFAOYSA-N 0.000 description 1
- 238000005406 washing Methods 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N30/00—Investigating or analysing materials by separation into components using adsorption, absorption or similar phenomena or using ion-exchange, e.g. chromatography or field flow fractionation
- G01N30/02—Column chromatography
- G01N30/62—Detectors specially adapted therefor
- G01N30/72—Mass spectrometers
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/30—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N30/00—Investigating or analysing materials by separation into components using adsorption, absorption or similar phenomena or using ion-exchange, e.g. chromatography or field flow fractionation
- G01N30/02—Column chromatography
- G01N30/88—Integrated analysis systems specially adapted therefor, not covered by a single one of the groups G01N30/04 - G01N30/86
- G01N2030/8809—Integrated analysis systems specially adapted therefor, not covered by a single one of the groups G01N30/04 - G01N30/86 analysis specially adapted for the sample
- G01N2030/8813—Integrated analysis systems specially adapted therefor, not covered by a single one of the groups G01N30/04 - G01N30/86 analysis specially adapted for the sample biological materials
- G01N2030/8818—Integrated analysis systems specially adapted therefor, not covered by a single one of the groups G01N30/04 - G01N30/86 analysis specially adapted for the sample biological materials involving amino acids
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N30/00—Investigating or analysing materials by separation into components using adsorption, absorption or similar phenomena or using ion-exchange, e.g. chromatography or field flow fractionation
- G01N30/02—Column chromatography
- G01N30/88—Integrated analysis systems specially adapted therefor, not covered by a single one of the groups G01N30/04 - G01N30/86
- G01N2030/8809—Integrated analysis systems specially adapted therefor, not covered by a single one of the groups G01N30/04 - G01N30/86 analysis specially adapted for the sample
- G01N2030/8813—Integrated analysis systems specially adapted therefor, not covered by a single one of the groups G01N30/04 - G01N30/86 analysis specially adapted for the sample biological materials
- G01N2030/8831—Integrated analysis systems specially adapted therefor, not covered by a single one of the groups G01N30/04 - G01N30/86 analysis specially adapted for the sample biological materials involving peptides or proteins
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Public Health (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Epidemiology (AREA)
- Pathology (AREA)
- Databases & Information Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- Chemical & Material Sciences (AREA)
- Primary Health Care (AREA)
- Theoretical Computer Science (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Analytical Chemistry (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Molecular Biology (AREA)
- Genetics & Genomics (AREA)
- Artificial Intelligence (AREA)
- Bioethics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Biochemistry (AREA)
- General Physics & Mathematics (AREA)
- Immunology (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
The invention relates to a molecular typing based on gastric mucosa lesion proteomics, an analysis method for molecular subtype characteristics of different gastric mucosa lesion proteomics and association of the molecular subtype characteristics and the progression of gastric mucosa lesion; and by calculating the relationship between protein expression and the pathological state of gastric mucosa tissue, protein component subtype and gastric mucosa lesion progress, a protein marker database related to gastric cancer and gastric mucosa lesion progress is established; and further establishing a disease progression risk scoring system of the gastric mucosa lesion sample. According to the invention, by means of molecular epidemiological research, bioinformatics analysis and machine learning are combined, micro and macro gastric cancer etiology risk factors are integrated, a gastric mucosa pathological change molecular classification frame and a progress risk prediction model are established, and a foundation is laid for finally constructing a comprehensive and systematic gastric cancer prevention strategy.
Description
Technical Field
The invention relates to the field of tumor clinical medicine, in particular to a gastric mucosa lesion protein molecular typing, lesion progress and gastric cancer related protein marker and a method for predicting lesion progress risk.
Background
Gastric Cancer (GC) is located in the fifth part of the global tumor incidence spectrum and the third part of the death spectrum, China is one of the countries with the highest incidence and mortality of gastric cancer worldwide, nearly half of the incidence and death of gastric cancer occur in China all over the world, and the prevention and control of gastric cancer still remain important public health challenges. Past evidence suggests that the development of gastric cancer, particularly intestinal-type gastric cancer, undergoes a complex, multi-stage dynamic evolutionary process, including Superficial Gastritis (SG), Chronic Atrophic Gastritis (CAG), Intestinal Metaplasia (IM) and Dysplasia (DYS), ultimately progressing to gastric cancer. Most patients are diagnosed with advanced gastric cancer and have a poor prognosis. In addition to H.pylori infection, the etiology and risk factors of gastric cancer, especially the etiology and risk factors in the progression of severe gastric mucosal lesions into gastric cancer, are not clear. Severe gastric mucosal lesions have the potential to reverse either naturally or after intervention, with only a small percentage of people eventually progressing to gastric cancer. Early identification of subgroups of people at high risk of developing gastric cancer among people with gastric mucosal lesions promotes early discovery, early diagnosis and early treatment (secondary prevention) of gastric cancer, and is a key breakthrough for reducing burden of gastric cancer diseases. Meanwhile, abnormal expression of proteins plays an important role in tumorigenesis. The research is carried out aiming at key individual protein and protein phenotype, and molecular markers related to gastric mucosa lesion evolution and gastric carcinogenesis are expected to be searched, so that a new way is provided for further exploring the etiology of gastric cancer.
With the development of molecular biology technology and the emergence of various emerging omics detection technologies, the tumor genome project (TCGA) and asian cancer research groups divide gastric cancer into four different subtypes based on gene expression data, and analyze the relationship between the different subtypes and prognosis. However, the genome research based on gastric cancer patients focuses on molecular typing, therapeutic target and prognosis research, and a subgroup of people at risk of gastric cancer cannot be identified, so that the research of early diagnosis markers is lacked.
Proteomics essentially refers to the study of the characteristics of proteins at a large scale, including the expression level of proteins, post-translational modifications, protein-protein interactions, etc., to gain an overall and comprehensive understanding of the processes at the protein level with respect to disease occurrence, cellular metabolism, etc. Protein level analysis not only provides the most efficient real-time analytical model for biomolecular systems, but also yields information that is not readily available at the DNA and RNA levels. At present, some stomach cancer proteomics researches are carried out, stomach cancer is divided into different subtypes based on protein expression difference of cancer and cancer side samples, molecular characteristics of the different subtypes and the relation between the molecular characteristics and prognosis are analyzed, and new treatment targets are searched. When the gastric cancer is diagnosed in China, about 70 percent of patients are in a local progressive stage or a late stage (gastric cancer in a progressive stage), the prognosis is very poor, and even if radical surgery is carried out, the recurrence rate is up to 30 to 70 percent. Therefore, the molecular marker for effectively predicting the pathological change progression and the gastric cancer occurrence of the gastric mucosa is searched, the subgroup of people with high gastric cancer occurrence risk is identified in the pathological change population of the gastric mucosa at an early stage, and the molecular marker is a key breakthrough for reducing the incidence and mortality of the gastric cancer and lightening the burden of gastric cancer diseases.
Helicobacter pylori (h.pylori) infection is a definite risk factor of gastric cancer, mainly acts on the early stage of gastric mucosal lesion, can induce chronic inflammation of gastric mucosa, thereby significantly increasing the risk of severe gastric mucosal lesion (IM/DYS) and gastric cancer, but the mechanism of gastric cancer caused by h.pylori is still unclear. The etiology and risk factors of gastric cancer, in addition to h.pylori infection, especially in the progression of severe gastric mucosal lesions to gastric cancer, remain unclear.
At present, most stomach cancer proteomics researches focus on molecular typing and prognosis researches of diffuse type stomach cancer, and molecular markers for effectively predicting the lesion progress of gastric mucosa and the occurrence of gastric cancer are not available due to lack of data in the aspect of intestinal type stomach cancer. Previous proteomics studies lack systematic comprehensive exploration of the associations between proteins and different levels of and evolutionary changes in gastric mucosal lesions. The literature inquiry only finds a few small sample proteomics researches related to the gastric mucosal lesion, the sample amount is between 12 and 229, most researches only comprise dozens of examples, only mild gastric mucosal lesion is taken as a control group, gastric cancer proteomic changes are discussed, and deep discussion on the mild gastric mucosal lesion and the evolution process of the gastric mucosal lesion is not carried out. In addition, screening for differential proteins generally lacks correction for multiple comparisons and validation based on large-scale independent samples. Meanwhile, some researches select proteome detection based on a specific chip, and compared with the modern mass spectrometry technology, the method has certain limitation on the protein detection depth.
Furthermore, on the one hand, due to the differences in understanding to the person skilled in the art; on the other hand, since the inventor has studied a lot of documents and patents when making the present invention, but the space is not limited to the details and contents listed in the above, however, the present invention is by no means free of the features of the prior art, but the present invention has been provided with all the features of the prior art, and the applicant reserves the right to increase the related prior art in the background.
Disclosure of Invention
Aiming at the defects of the prior art, the invention deeply excavates extremely low abundance protein based on the modern mass spectrum technology with high sensitivity, high resolution and high precision, screens and searches differential expression protein among samples through quantitative proteomics research, and establishes a gastric cancer related molecular marker database through multiple inspection correction and multi-factor analysis. By means of molecular epidemiology research, bioinformatics analysis and machine learning, micro and macro gastric cancer etiology risk factors are integrated, a gastric mucosa pathological change molecular typing frame and a progress risk prediction model are established, and finally, a comprehensive and systematic gastric cancer prevention strategy is constructed. The specific scheme is as follows:
the invention provides a method for constructing a molecular typing classifier of proteomics for gastric mucosal lesion, which comprises the following steps:
1) protein expression profile pretreatment and experimental filtration: obtaining protein expression profile data of a gastric mucosa tissue sample, and then carrying out the following treatment:
a) screening high-confidence proteins;
preferably, the protein to be quantified contains at least one specific peptide segment (unique peptide) with a Mascot ion score (ion score) of more than or equal to 20 and at least two peptide segments with an ion score of more than or equal to 20, or three peptide segments with an ion score of more than or equal to 20;
b) (ii) normalizing the quantitative data based on the sum;
preferably, a peak area-based non-labeled quantitative iBAQ method is adopted to calculate the iBAQ value of the high-confidence protein, the calculated iBAQ data is normalized, and then the ratio of each identified protein to all identified protein quantitative values is calculated to obtain the iFOT value;
preferably, the iBAQ value of a certain protein is the sum of all peak areas of corresponding peptide segments of the protein/the number of theoretical peptide segments;
c) and (3) experiment filtration: rejecting samples with protein identification total number lower than a first threshold value, and screening the protein with the lowest identification frequency, namely the protein accounting for more than a second threshold value of the total sample number;
preferably the first threshold is 1500 and the second threshold is 3/4;
2) selection of typing profiles
Selecting the first third threshold proteins with the maximum coefficient of variation and the quantitative values thereof to form a typing characteristic protein matrix according to the sequence of the Coefficient of Variation (CV) from high to low with larger difference among samples;
preferably the third threshold is 100;
3) NMF typing
a) non-Negative Matrix Factorization (NMF) consistent clustering method typing: selecting an optimal clustering number K according to an outline coefficient (average simple value width) and a co-phenotypic correlation coefficient (phenotypic coefficient), and performing consistent clustering method typing on the typing characteristic protein obtained in the step 2) by using non-Negative Matrix Factorization (NMF) to obtain an NMF typing label;
b) according to the result of the consistent clustering method typing, adjusting a third threshold value to determine the optimal parameter of the third threshold value: re-screening characteristic proteins for NMF clustering according to the heatmap and contour coefficient obtained by the NMF clustering, wherein the CV value is increased or decreased until an ideal heatmap and contour coefficient are obtained;
4) constructing a molecular typing classifier: selecting the typing feature protein and the optimal clustering number K as classifier features, then selecting a classifier, and outputting molecular typing result data through data input and intermediate processes;
preferably, the typing characteristic protein is 100 characteristic proteins with the largest coefficient of variation, and the optimal NMF clustering number K is 4;
the preferred classifier is a known machine learning classification algorithm or an artificial intelligence model, such as a random forest and a support vector machine; inputting data including a characteristic protein matrix and an NMF typing label; the intermediate process comprises pretreatment of expression profile data of a gastric mucosa lesion sample and feature matching of a classifier.
The invention also provides a method for molecular typing of proteomics of gastric mucosa lesion, which applies the molecular typing classifier constructed by the method to perform molecular typing on lesion samples, and comprises the following steps:
1) pretreating a sample to be tested according to the method in the step 1) in the claim 1 to obtain expression profile data;
2) classifier data input: inputting expression profile data of a sample;
3) the intermediate process comprises the following steps: preprocessing the expression profile data of the gastric mucosa lesion sample, and matching the characteristics of a classifier;
4) and outputting the molecular typing result data.
In another aspect, the present invention provides a method for analyzing association between different proteomic molecular subtypes of gastric mucosal lesion and progression of gastric mucosal lesion, which combines clinical histopathological diagnosis to determine outcome of disease progression: incorporating clinical variables into the analysis of association between molecular subtypes and disease progression, and performing multifactorial non-conditional Logistic regression on how the molecular subtypes and clinical factors influence the progression of gastric mucosal lesions to analyze the association between different subtypes and the progression of gastric mucosal lesions, wherein the molecular subtypes are obtained according to the method;
preferably, the disease progression outcome determination method comprises: according to the clinical histopathological diagnosis of a baseline and a follow-up endpoint, pathological scoring is carried out according to SG, CAG, IM, DYS and GC sequence levels, the follow-up endpoint is compared with the baseline, a score-up person is judged as disease progression, a score-down person is judged as disease reversion, and a score-unchanged person is judged as disease stabilization;
the clinical variables preferably included in the correlation analysis are: gender, age, helicobacter pylori infection status, baseline histopathological diagnosis;
the preferred clinical variables corrected for by the multifactor unconditional Logistic regression are: gender, age, helicobacter pylori infection status, baseline histopathological diagnosis.
In another aspect, the invention provides a method for screening gastric cancer and gastric mucosal lesion progress-related protein markers, which respectively calculates the relationship between protein expression and histopathological gastric mucosal lesion state, protein component molecular typing and gastric mucosal lesion progress, thereby establishing a gastric cancer and gastric mucosal lesion progress-related protein marker database:
1) analyzing the relation between protein expression and pathological gastric mucosa pathological state of tissue, and screening the protein obviously related to gastric cancer
a) The pathological state of the gastric mucosa is divided into SG, CAG, IM, DYS and GC according to the histopathological diagnosis, and the protein expression difference of severe gastric mucosal lesion (IM/DYS) and Gastric Cancer (GC) is explored by taking mild gastric mucosal lesion (SG/CAG) as a reference;
b) clinical variables for inclusion association analysis: sex, age;
c) the correlation analysis method comprises the following steps: multi-factor unconditional Logistic regression;
the preferred clinical variables corrected for by the multifactor unconditional Logistic regression are: sex, age; selecting a protein having FDR q <0.05 corrected by multiple assays;
preferably, the multiple test correction method is a Benjamini-Hochberg method or a Bonferroni method;
2) analyzing the molecular typing relation between protein expression and protein components, and screening proteins which are obviously related to severe gastric mucosal lesion and gastric cancer defined by the protein components
a) Performing proteomic molecular typing of gastric mucosal lesions by the method of claim 2, calculating Spearman correlation coefficient of molecular subtype and histopathology, and analyzing the severe gastric mucosal lesions and Gastric Cancer (GC) protein expression difference defined by proteome with mild gastric mucosal lesions defined by proteome as reference;
b) clinical variables for inclusion association analysis: sex, age;
c) the correlation analysis method comprises the following steps: multi-factor unconditional Logistic regression;
preferably, the mild gastric mucosal lesions defined by the proteome are of molecular subtype S1, and the severe gastric mucosal lesions defined by the proteome are of molecular subtypes S2, S3 and S4;
the preferred clinical variables corrected for by the multifactor unconditional Logistic regression are: sex, age; selecting a protein having FDR q <0.05 corrected by multiple assays;
preferably, the multiple test correction method is a Benjamini-Hochberg method or a Bonferroni method;
3) analyzing the relation between protein expression and gastric mucosa lesion development, and screening the protein obviously related to the gastric mucosa lesion development
a) Judging the disease progression outcome: according to the clinical histopathological diagnosis of a baseline and a follow-up endpoint, pathological scoring is carried out according to SG, CAG, IM, DYS and GC sequence levels, the follow-up endpoint is compared with the baseline, a score-up person is judged as disease progression, a score-down person is judged as disease reversion, and a score-unchanged person is judged as disease stabilization;
b) clinical variables for inclusion association analysis: gender, age, helicobacter pylori infection status, baseline histopathological diagnosis;
c) the correlation analysis method comprises the following steps: multi-factor unconditional Logistic regression;
the preferred clinical variables corrected for by the multifactor unconditional Logistic regression are: sex, age, helicobacter pylori infection status, baseline histopathological diagnosis, p < 0.05.
The invention further provides a method for establishing a disease progression risk scoring system of a gastric mucosa lesion sample, which comprises the following steps:
1) the protein marker obtained by screening by the method is used, and a risk scoring formula is established by adopting a regression coefficient of the relationship between the protein expression and the lesion progress of the gastric mucosa and the protein expression amount in the step 3) in the method; the protein marker is a protein which is obviously related to gastric cancer and the development of gastric mucosal lesion;
preferably, the risk score formula is:
risk score β1X1+β2X2+β3X3+…βnXn
Beta is the coefficient of the protein n in the regression equation obtained in the step 3) of the method, and X is the expression quantity of the protein n, namely the iFOT value;
preferably n is 4, and is respectively protein APOA1BP, PGC, HPX, DDT;
2) analyzing the relation between the risk score of gastric mucosa lesion and the progression of gastric mucosa lesion
Judging the disease progression outcome, bringing clinical variables into association analysis, and analyzing the relationship between the gastric mucosal lesion risk score and the gastric mucosal lesion progression based on multi-factor unconditional Logistic regression modeling.
The specific method for analyzing the relationship between the risk score of the gastric mucosa lesion and the progression of the gastric mucosa lesion in the step 2) comprises the following steps:
a) judging the disease progression outcome: according to the clinical histopathological diagnosis of a baseline and a follow-up endpoint, pathological scoring is carried out according to SG, CAG, IM, DYS and GC sequence levels, the follow-up endpoint is compared with the baseline, a score-up person is judged as disease progression, a score-down person is judged as disease reversion, and a score-unchanged person is judged as disease stabilization;
b) clinical variables for inclusion association analysis: gender, age, helicobacter pylori infection status, baseline histopathological diagnosis;
c) the correlation analysis method comprises the following steps: multi-factor unconditional Logistic regression;
preferably the clinical variables corrected by the multi-factor unconditional Logistic regression are: gender, age, helicobacter pylori infection status, baseline histopathological diagnosis.
The invention further provides a construction method of the gastric mucosa lesion progress risk classifier, which comprises the following steps:
1) screening independent variables;
preferred independent variables are sex, age, helicobacter pylori infection status, baseline histopathological diagnosis, risk score, protein component typing; the risk score is calculated according to the method;
2) judging the disease progression outcome: according to the clinical histopathological diagnosis of a baseline and a follow-up endpoint, pathological scoring is carried out according to SG, CAG, IM, DYS and GC sequence levels, the follow-up endpoint is compared with the baseline, a score-up person is judged as disease progression, a score-down person is judged as disease reversion, and a score-unchanged person is judged as disease stabilization;
3) constructing a lesion progress risk classifier:
a) and (3) selecting a classifier: a machine learning classification algorithm or an artificial intelligence model;
preferably a random forest or a support vector machine;
b) data input: (iv) inclusion of independent variables after screening;
preferably sex, age, helicobacter pylori infection status, baseline histopathological diagnosis, risk score, protein component genotyping;
c) and (3) data output: a disease progression status for each sample;
d) testing the accuracy of the algorithm;
preferably, the area under the receiver operating characteristic curve (ROC curve) (AUC) is calculated by independent validation set validation.
In another aspect, the invention provides a method for predicting the risk of gastric mucosal lesion progression, which uses the classifier obtained by the method to predict:
a) pretreatment: preprocessing sample protein expression profile data, performing molecular typing and performing risk scoring;
preferably, the protein expression profile pretreatment and experimental filtration method of claim 1 is adopted for sample protein expression profile data pretreatment; typing the sample using the molecular typing method of claim 2; scoring the risk of disease progression for a gastric mucosal lesion sample using the method of claim 5;
b) inputting: (iv) inclusion of independent variables after screening;
preferred independent variables are sex, age, helicobacter pylori infection status, baseline histopathological diagnosis, risk score, protein component typing;
c) the intermediate process comprises the following steps: matching classifier features and predicting disease progress;
d) and (3) outputting: disease progression/non-progression.
The invention further provides a molecular typing classifier of the proteomics for gastric mucosal lesion constructed by the method.
The invention further provides a gastric mucosa lesion progress risk classifier constructed by the method.
The application of the gastric cancer and the protein marker related to the lesion development of the gastric mucosa, which are obtained by the method of claim 4, in the preparation of a gastric cancer high-risk population identification detection kit and/or chip;
preferably, the protein markers are 217 gastric cancer related protein markers shown in table 4 and 54 gastric mucosal lesion progression related protein markers shown in table 6.
In a final aspect of the present invention, there is provided a kit and/or chip for identifying and detecting a high-risk population with gastric cancer, comprising the molecular typing classifier of claim 9, the risk scoring system of claim 5, the gastric mucosal lesion progression risk classifier of claim 10, and the gastric cancer and protein markers related to the progression of gastric mucosal lesion of claim 11.
The invention has the beneficial effects that:
1) the invention selects clinical gastroscope biopsy gastric mucosa tissues and can directly reflect the physiological and pathological states of the gastric mucosa.
2) The invention adopts the modern mass spectrum technology with high sensitivity, high resolution and high precision to deeply cover the gastric mucosa proteome, can further mine the characteristic information of the protein with extremely low abundance, and can realize higher detection efficiency.
3) The method realizes the molecular typing of gastric mucosa pathological changes through proteome data for the first time, further obtains the molecular characteristics of the gastric mucosa pathological changes which are difficult to obtain in a cell morphology layer, and can be associated with the progression risks of the gastric mucosa pathological changes, thereby analyzing the progression risks of the gastric mucosa pathological changes of different subtypes.
4) The method is used for carrying out comprehensive proteomics research aiming at different stages of gastric mucosa lesion, the evolution process of the gastric mucosa lesion and the occurrence of gastric cancer for the first time, and exploring expression change rules, signal paths and potential action mechanisms of individual proteins and protein phenotypes in the gastric mucosa lesion evolution and gastric cancer occurrence processes.
5) The protein marker data set related to the pathological changes and the progression of the gastric cancer and the gastric mucosa is verified through prospective queue research, and an important clue is provided for the etiological exploration of the gastric cancer.
6) Proteome data and other gastric cancer risk factors are integrated for the first time, a gastric mucosa lesion progress risk prediction model is established, and an important basis can be provided for gastric cancer prevention and control.
Drawings
FIG. 1 is a graph showing the results of protein detection on 169 samples of gastric mucosal tissue.
FIG. 2 shows the results of the pretreatment of protein expression profiles of 169 samples of gastric mucosal tissue.
FIG. 3 is a graph showing the results of molecular typing of a lesion sample of the gastric mucosa using non-Negative Matrix Factorization (NMF).
FIG. 4 is a multi-factor non-conditional Logistic regression forest chart, which is a relation between molecular subtypes and the risk of progression of gastric mucosal lesions.
FIG. 5 is a multi-factor unconditional Logistic regression forest chart showing the relationship between risk score and risk of progression of gastric mucosal lesions.
FIG. 6 is a graph of a risk prediction model versus a random forest model receiver operating characteristic curve (ROC).
Detailed Description
The invention is illustrated below with reference to specific embodiments. The experimental procedures in the following examples are conventional unless otherwise specified. Materials, reagents and the like used in the following examples are commercially available unless otherwise specified.
Example 1 obtaining protein expression Profile data of clinical gastroscope biopsy gastric mucosal tissue samples
The experimental samples are 169 gastroscopic biopsy gastric mucosa tissue samples from Shandong Lin \26384msite of high incidence of gastric cancer and fifth medical center of the liberation general hospital.
Protein extraction and analysis are carried out on 169 clinical gastroscope biopsy gastric mucosa tissue samples, and a proteome data set corresponding to each sample is obtained through the step, wherein the proteome data set comprises the types and the quantities of the proteins and the quantitative values of various proteins.
Firstly, a lysate formula:
1% (w/v) DOC (Deoxycholic acid),10mM TCEP,
40mM 2-chloroacetamide(CAA),100mM Tris,pH 8.5。
second, the operation steps
1. Material taking: taking a gastroscope sample, and storing the gastroscope sample in a clean EP tube;
2. and (3) cracking the sample: adding 500uL of lysis solution, and homogenizing a sample;
3. heating for denaturation: heating the homogenized sample at 95 ℃ for denaturation for 5min, and naturally cooling to room temperature;
4. ultrasonic crushing: placing the sample tube on an ice-water mixture for ultrasonic treatment for 5min, wherein the power is 30% and the power is 3s on and 3s off;
5. protein isolation: centrifuging the sample at 4 deg.C for 10min at 16,000g, and keeping the supernatant;
6. protein quantification: measuring the protein concentration by the Nanodrop, and taking 50ug of protein to a new EP tube;
7. protein cleaning: the sample was added to a 10KD ultrafiltration tube, centrifuged at 14,000g for 2min at room temperature to allow the protein to bind completely to the membrane, and then treated with 50mM ammonium bicarbonate (NH)4HCO3) Washing the protein sample on the membrane, centrifuging at room temperature of 14,000g for 20min at 300 ul/time; repeating for 2 times;
8. and (3) proteolysis: replacing the ultrafiltration tube with a new collection tube, adding 100ul of 50mM ammonium bicarbonate solution and 5ug pancreatin, sealing, vertically digesting at 37 ℃ for 4 hours, then adding 100ul of 50mM ammonium bicarbonate solution and 5ug pancreatin, sealing, and rotationally digesting at 37 ℃ overnight;
9. peptide fragment collection: centrifuging the ultrafiltration tube for 14,000g and 20min, retaining the peptide segment in the collection tube, adding 200ul of mass spectrum water, centrifuging and cleaning once, retaining the peptide segment collected for the second time, combining the peptide segments collected for the two times, and then performing vacuum pumping to dry, thus obtaining a product for mass spectrum detection.
Thirdly, a mass spectrometry method:
for a new type of LC-MS tandem mass spectrometry from Thermo, C18 packing was used for both the pre-column and analytical column. The mobile phase is liquid A (H)2O: FA 99.8: 0.2) and B solution (ACN: FA 99.8: 0.2). The dried peptide fragment was applied to the loading buffer (H)2O∶CH3OH: FA (94.8: 5: 0.2) are fully dissolved, centrifuged at 12000r/min for 10min, and subjected to mass spectrometry. The concrete steps refer to the section "third, mass spectrometric detection of gastric cancer protein sample" in the concrete embodiment of CN 108445097A.
Fourth, searching and identifying protein
The original file obtained was subjected to spectrum matching with NCBI _ human Ref-sequence protein database (version 2013) using a protome Discover (version 1.4, Thermo Scientific) and Mascot. The search parameters are set as: the enzyme miscut site is 2, the oxidation of methionine, N-terminal acetylation and reduction alkylation on cysteine are dynamically modified, the length of the peptide segment contains at least 7 amino acids, the fraction of the peptide segment is at least 10, and the sequence of the peptide segment is set to be high. The deviation of the primary ion was set to 20ppm and the deviation of the secondary ion to 50mmu, and evaluated using the integrated reverse library, an FDR of less than 1% was considered acceptable. The concrete steps refer to the section "mass spectrometry data analysis of gastric cancer protein sample" in the section of the concrete embodiment of CN 108445097A.
Fifthly, controlling the quality of the peptide fragment:
screening conditions are as follows:
conditions 1: US > -1 and S > -2
Condition 2: S > -3
Wherein, U represents Unique, S represents Strict;
the Unique peptide fragment of the protein refers to the Unique peptide fragment of the peptide fragment which is not shared with other proteins;
strict is mascot Ion score-the degree of stringency when the Ion score is greater than 20, i.e. the secondary spectrum is identified;
only proteins satisfying the above condition 1 or 2 were selected for further analysis.
Example 2.
A first part: proteomics molecular typing
The proteomics molecular typing of gastric mucosal lesion is carried out based on the data of the embodiment 1, and the specific steps are as follows:
1) protein expression profiling pretreatment and experimental filtration
a) High confidence protein screening: the quantitative protein is required to contain at least one special peptide segment (unique peptide) and the Mascot ion score (ion score) is more than or equal to 20, at least two peptide segments with the ion scores more than or equal to 20, or three peptide segments with the ion scores more than or equal to 20;
b) sum-based quantification normalization: adopting a peak area-based non-labeled quantitative iBAQ method, wherein the iBAQ value of a certain protein is the sum of the peak areas of all corresponding peptide fragments of the protein/the number of theoretical peptide fragments, and normalizing the data by calculating the ratio of the identified iBAQ value of each protein to the sum of the identified iBAQ values of all proteins to obtain a quantitative value (iFOT value);
c) taking the lesion sample of gastric mucosa of example 1 as an example, the total number of the removed proteins identified is less than 1500, and the procedure can be adjusted according to the actual protein identification numbers of different cancers and different sample types.
As shown in figure 1, 15158 gene products are detected in 169 gastric mucosa tissues in example 1, and 9119 high-reliability proteins are obtained by screening; FIG. 2 shows the results of the pretreatment of the protein expression profiles of the 169 samples of gastric mucosal tissue.
3) Selection of typing profiles
a) Screening the lowest identification frequency, namely the protein accounting for more than 3/4 of the total sample number, based on the high-reliability protein data detected by 111 gastric mucosa lesion samples, wherein the step can be adjusted according to proteome data of different cancers and different sample types;
b) selecting the first third threshold proteins with the largest coefficient of variation and the quantitative values (iFOT values) thereof to form a typing characteristic protein matrix according to the sequence of the Coefficient of Variation (CV) from high to low, wherein the step can be adjusted according to the proteome data of different cancers and different sample types.
The method for determining the optimal parameter of the third threshold in this embodiment is as follows: in the next step of NMF typing, the first 500 proteins with the largest CV are selected for NMF clustering for the first time, but the obtained heatmap and contour coefficient are not ideal no matter how many K are, and then new feature protein screening is tried again, the first 200 proteins with the largest CV are selected for NMF clustering, the result is still not ideal, the CV value is gradually reduced for NMF clustering until the ideal heatmap and contour coefficient are obtained, and finally the optimal parameter of the third threshold of the gastric mucosal lesion sample in the embodiment is selected as 100, so that the gastric mucosal lesion typing feature proteins shown in table 1 are obtained.
TABLE 1 gastric mucosal lesion typing profiles
4) NMF typing
non-Negative Matrix Factorization (NMF) consistent clustering method typing: selecting an optimal clustering number K according to an outline coefficient (average likelihood width) and a co-phenotypic correlation coefficient (phenotypic coefficient), and performing consistent clustering method typing on molecules of a lesion sample by using non-Negative Matrix Factorization (NMF), wherein the specific process comprises the following steps:
loading R language program package cancer libraries, analyzing the typing feature protein matrix by using an ExecuteNMF function, setting the parameter clusterNum to be 2 to 8 in an attempt, setting the nrun to be 50, and selecting the optimal clustering number K through a typing result heat map and a contour coefficient, wherein the NMF typing result is shown in figure 3: the optimal clustering number K was selected to be 4 by clustering number from 2 to 8 typing the resulting heatmap and contour coefficient change line plot.
5) Molecular typing classifier construction
Selecting a typing characteristic protein and an optimal clustering number K as classifier characteristics, then selecting a proper classifier, outputting molecular typing result data through data input and intermediate processes, and then typing the molecules of the pathological change sample by using the classifier. The method specifically comprises the following steps:
a) and (3) selecting the characteristics of the classifier: using the typing characteristic proteins and the optimal clustering results obtained in the steps 2) and 3) (taking the proteomic analysis of gastric mucosal lesion as an example, selecting 100 characteristic proteins with the maximum coefficient of variation and the NMF clustering result in the table 1, wherein k is 4)
b) And (3) selecting a classifier: random forests; other machine learning classification algorithms or artificial intelligence models such as support vector machine can also be selected
c) Data input: the characteristic protein matrix obtained in step 2) and the NMF typing label obtained in step 3)
d) The intermediate process comprises the following steps: preprocessing the expression profile data of the gastric mucosa lesion sample, and matching the characteristics of a classifier: the method comprises the steps of adopting an R language randomForest software package randomForest function, setting a parameter na.action to be na.roughfix, setting a parameter prompt to be TRUE, and setting a parameter import to be TRUE.
e) And (3) data output: molecular typing results of gastric mucosa lesion samples: according to the results and subtype characteristics of 111 cases of gastric mucosal lesion, the obtained proteome-defined mild gastric mucosal lesion is of molecular subtype S1, and the proteome-defined severe gastric mucosal lesion is of molecular subtypes S2, S3 and S4. The results of molecular typing of 39 gastric mucosal lesion samples are shown in table 2.
Table 2 independent verification set 39 gastric mucosa pathological change sample molecular typing results
6) Molecular subtypes are associated with disease progression
Calculating a Spearman correlation coefficient of the obtained molecular subtype and the histopathology by combining clinical histopathological diagnosis, judging the disease progression outcome, incorporating clinical variables into the correlation analysis of the molecular subtype and the disease progression, and performing multi-factor unconditional Logistic regression on how the molecular subtype and the clinical factors influence the gastric mucosal lesion progression to analyze the correlation between different subtypes and the gastric mucosal lesion progression, wherein the Spearman correlation coefficient is specifically as follows:
a) calculating the correlation coefficient of molecular subtype and histopathology Spearman by combining clinical histopathological diagnosis:
calculating to obtain a Spearman's correlation coefficient R which is 0.39 and a Spearman's correlation coefficient P which is 0.016 by adopting an R language cor.test function;
b) judging the disease progression outcome: according to the clinical histopathological diagnosis of baseline and follow-up end point, making pathological scores according to SG, CAG (mild, moderate and severe), IM (mild, moderate and severe), LGIN, HGIN and GC sequence grades, comparing the follow-up end point with the baseline, judging that the person with the ascending score is the disease progress, judging that the person with the descending score is the disease reversion, and judging that the disease is stable.
SG (superficial gastritis), CAG (chronic atrophic gastritis): mild gastric mucosal lesions; IM (intestinal metaplasia), LGIN (low grade intraepithelial neoplasia), HGIN (high grade intraepithelial neoplasia): severe gastric mucosal lesions; GC: gastric cancer.
The results of the disease progression outcome determination of 39 samples in table 2 are shown in table 3.
Table 3 validation set of disease progression determinations for 39 follow-up patients
Through follow-up study on 39 gastric mucosa lesion samples in the independent verification set, 19 patients with gastric mucosa lesions are judged to be in disease progression, and 20 patients with gastric mucosa lesions are judged to be in disease non-progression.
c) Clinical variables for inclusion association analysis: gender, age, helicobacter pylori infection status, baseline histopathological diagnosis;
d) the correlation analysis method comprises the following steps: multifactorial unconditional Logistic regression (corrected sex, age, helicobacter pylori infection status, baseline histopathological diagnosis): using the glm function in R language, the formula is for disease progression or non-progression-gender + age + h.pylori infection status + baseline histopathological diagnosis + protein, and the parameter family is set to binomial (link ═ logit).
As shown in fig. 4, with subtype S1 as a reference, subtype S2 was not significantly associated with disease progression, subtype S4 was significantly associated with disease progression, and subtype S4 was 19.29 times more at risk of disease progression than subtype S1.
A second part: establishment of gastric cancer related molecular marker database
Respectively exploring the relationship between protein expression and histopathological gastric mucosa pathological state, protein component molecular typing and gastric mucosa pathological change progress, thereby establishing a gastric cancer related molecular marker database.
1) Relationship between protein expression and pathological gastric mucosa pathological state
a) Taking the proteomic data of gastric mucosal lesion as an example, the gastric mucosal lesion state is classified into SG, CAG, IM, LGIN and GC according to histopathological diagnosis, and the protein expression difference of severe gastric mucosal lesion (IM/LGIN) and Gastric Cancer (GC) is explored by taking mild gastric mucosal lesion (SG/CAG) as reference;
b) clinical variables for inclusion association analysis: sex and age
c) The correlation analysis method comprises the following steps: multi-factor unconditional Logistic regression (correcting gender, age), using R language glm function, formula for lesion grouping-gender + age + protein, parameter family set to binomial (link ═ logit).
The results are shown in Table 4: 217 proteins were identified and verified to be significantly associated with gastric cancer, 104 proteins being positively associated and 113 proteins being negatively associated.
TABLE 4 significant associated proteins for gastric cancera
aUnconditional Logistic regression, correct gender, age. Level of significance of discovery set FDR<0.05, the significance level of the verification set is unilateral P<0.05。
CAG, chronic atrophic gastritis; GC, gastric cancer; IM, intestinal metaplasia; LGIN, low grade intraepithelial neoplasia; OR, odds ratio; SG, superficial gastritis.
2) Protein expression and protein component molecular typing relation
a) Taking the pathological proteomic data of the gastric mucosa as an example, the differences of protein expression of the subtype S2-S4 and Gastric Cancer (GC) are explored according to the molecular classification of the pathological proteome of the first part of the gastric mucosa and by taking the subtype S1 as a reference.
b) Clinical variables for inclusion association analysis: sex and age
c) The correlation analysis method comprises the following steps: multi-factor unconditional Logistic regression (correcting gender, age), using R language glm function, formula for lesion grouping-gender + age + protein, parameter family set to binomial (link ═ logit).
The results are shown in Table 5: 37 proteins are identified and verified to be significantly related to gastric cancer and proteome-defined severe gastric mucosal lesions (subtype S2-S4), wherein 27 proteins are positively related and 10 proteins are negatively related.
TABLE 5 Severe gastric mucosal lesions significantly associated proteins defined by gastric cancer and proteomea
aUnconditional Logistic regression, correct gender, age. Level of significance of discovery set FDR<0.05, the significance level of the verification set is unilateral P<0.05。
GC, gastric cancer; OR, odds ratio.
3) Relationship between protein expression and gastric mucosal lesion progression
a) Taking the proteomics data of gastric mucosal lesion as an example, judging the disease progression outcome: according to the clinical histopathological diagnosis of a baseline and a follow-up endpoint, pathological scoring is carried out according to SG, CAG, IM, DYS and GC sequence levels, the follow-up endpoint is compared with the baseline, a score-up person is judged as disease progression, a score-down person is judged as disease reversion, and a score-unchanged person is judged as disease stabilization.
b) Clinical variables for inclusion association analysis: gender, age, helicobacter pylori infection status, baseline histopathological diagnosis. c) The correlation analysis method comprises the following steps: the multi-factor unconditional Logistic regression (correcting sex, age, helicobacter pylori infection state and baseline histopathological diagnosis) adopts an R language glm function, the formula is that lesion grouping-sex + age + helicobacter pylori infection state + baseline histopathological diagnosis + protein, and the parameter family is set as binomial (link ═ logit).
The results are shown in Table 6: 54 proteins are identified and verified to be obviously related to the development of gastric mucosal lesions, wherein 26 proteins are positively related and 28 proteins are negatively related.
TABLE 6 protein significantly associated with the progression of gastric mucosal lesionsa
aUnconditional Logistic regression, corrected for gender, age, H.pylori infection and baseline histopathology. Significance level was unilateral P<0.05。
IM, intestinal metaplasia; OR, odds ratio.
And a third part: establishment of disease progression risk scoring system of gastric mucosa lesion sample
1) Screening of gastric mucosa lesion progress and gastric cancer related molecular markers: according to the second part of the results, the proteins APOA1BP, PGC, DDT, HPX, which were significantly associated with both gastric cancer and the progression of gastric mucosal lesions (i.e., the proteins that appear repeatedly in tables 4 and 6), were selected.
2) Establishing a gastric mucosa lesion progress risk scoring system: selecting the protein markers screened in the step 1), and establishing a risk score through a regression coefficient of the relationship between the protein expression in the second part 3) and the lesion development of the gastric mucosa and the protein expression quantity:
risk score β1X1+β2X2+β3X3+…βnXn
Wherein, beta is the coefficient of the protein n in the regression equation obtained in the second part of the step 3), and X is the expression quantity of the protein n, namely the iFOT value.
In this example, n is 4, and the 4 proteins are APOA1BP, PGC, HPX, DDT, calculated according to the above formula,
risk score-1.485 × APOA1BP-1.231 × PGC +1.868 × HPX-0.565 × DDT.
3) Relationship between risk score of gastric mucosal lesion and progression of gastric mucosal lesion
Judging the disease progression outcome, bringing clinical variables (sex, age, helicobacter pylori infection state and baseline histopathological diagnosis) into correlation analysis, and modeling based on multi-factor non-condition Logistic regression (correcting sex, age, helicobacter pylori infection state and baseline histopathological diagnosis) to analyze the relation between the risk score of gastric mucosal lesion and the progression of gastric mucosal lesion, wherein the method specifically comprises the following steps:
a) judging the disease progression outcome: according to the clinical histopathological diagnosis of baseline and follow-up end point, making pathological scores according to SG, CAG (mild, moderate and severe), IM (mild, moderate and severe), LGIN, HGIN and GC sequence grades, comparing the follow-up end point with the baseline, judging that the person with the ascending score is the disease progress, judging that the person with the descending score is the disease reversion, and judging that the disease is stable.
b) Clinical variables for inclusion association analysis: gender, age, helicobacter pylori infection status, baseline histopathological diagnosis.
c) The correlation analysis method comprises the following steps: multifactor unconditional Logistic regression (correct sex, age, h.pylori infection status, baseline histopathological diagnosis) using R language glm function, formula for disease progression or not-sex + age + h.pylori infection status + baseline histopathological diagnosis + risk score, parameter family set as binomial (link ═ logit).
The results are shown in fig. 5, with a significant positive correlation between risk score and disease progression, with a 3.09-fold increase in risk for disease progression for each standard deviation increase in risk score.
The fourth part: constructing a gastric mucosa lesion progress risk prediction model
Screening out proper independent variables, judging the disease progress outcome, then constructing a classifier and an application classifier, and finally completing the construction of a gastric mucosa lesion progress risk prediction model, wherein the specific process is as follows:
1) independent variable screening: gender, age, helicobacter pylori infection status, baseline histopathological diagnosis, risk score, protein component genotyping.
2) Judging the disease progression outcome: according to the clinical histopathological diagnosis of baseline and follow-up end point, making pathological scores according to SG, CAG (mild, moderate and severe), IM (mild, moderate and severe), LGIN, HGIN and GC sequence grades, comparing the follow-up end point with the baseline, judging that the person with the ascending score is the disease progress, judging that the person with the descending score is the disease reversion, and judging that the disease is stable.
3) Constructing a lesion progress risk classifier: selecting an R language randomForest software package and a randomForest function, incorporating the screened variables, setting the parameter na.action to be na.roughfix, setting the prompt to be TRUE, setting the import to be TRUE, selecting 21 patients with follow-up visit as a training set and 18 patients as a verification set.
a) And (3) selecting a classifier: random forests; other machine learning classification algorithms or artificial intelligence models, such as support vector machines, may also be selected.
b) Data input: independent variables incorporated after screening.
c) And (3) testing the accuracy of the algorithm: and (4) independently verifying the verification set.
Results as in fig. 6, a total of 3 classifiers were tested: classifier 1 included gender, age, h.pylori infection, baseline histopathological diagnosis, subject operating characteristic curve (ROC) Area Under (AUC) 0.75; classifier 2 contains gender, age, h.pylori infection, baseline histopathological diagnosis, risk score, AUC 0.84; classifier 3 included gender, age, h.pylori infection, baseline histopathological diagnosis, risk score, molecular subtype, AUC 0.95. Classifier 2 compares with classifier 1, and Delong's test P is 0.50; compared with classifier 1, classifier 3 has a significant difference in Delong's test P of 0.04.
4) The classifier is applied as follows:
a) pretreatment: expression profiling data preprocessing, molecular typing, risk scoring
b) Inputting: independent variable incorporated after screening
c) The intermediate process comprises the following steps: classifier feature matching, disease progression prediction: selecting an R language randomForest software package and a randomForest function, incorporating the screened variables, and setting a formula RF-gender + age + helicobacter pylori infection + baseline histopathological diagnosis + risk score + molecular subtype), setting the parameter na.action to be na.roughfix, setting the prompt to be TRUE, and setting the import to be TRUE. The prediction tag is given by a prediction function.
d) And (3) outputting: disease progression/non-progression, results are shown in table 7.
TABLE 7 validation set sample prediction results
It should be noted that the above-mentioned embodiments are exemplary, and that those skilled in the art, having benefit of the present disclosure, may devise various arrangements that are within the scope of the present disclosure and that fall within the scope of the invention. It should be understood by those skilled in the art that the present specification and figures are illustrative only and are not limiting upon the claims. The scope of the invention is defined by the claims and their equivalents.
Claims (10)
1. A method for constructing a molecular typing classifier of proteomics for gastric mucosal lesion is characterized by comprising the following steps:
1) protein expression profile pretreatment and experimental filtration: obtaining protein expression profile data of a gastric mucosa tissue sample, and then carrying out the following treatment:
a) screening high-confidence proteins;
preferably, the protein to be quantified contains at least one specific peptide segment (unique peptide) with a Mascot ion score (ion score) of more than or equal to 20 and at least two peptide segments with an ion score of more than or equal to 20, or three peptide segments with an ion score of more than or equal to 20;
b) (ii) normalizing the quantitative data based on the sum;
preferably, a peak area-based non-labeled quantitative iBAQ method is adopted to calculate the iBAQ value of the high-confidence protein, the calculated iBAQ data is normalized, and then the ratio of each identified protein to all identified protein quantitative values is calculated to obtain the iFOT value;
preferably, the iBAQ value of a certain protein is the sum of all peak areas of corresponding peptide segments of the protein/the number of theoretical peptide segments;
c) and (3) experiment filtration: rejecting samples with protein identification total number lower than a first threshold value, and screening the protein with the lowest identification frequency, namely the protein accounting for more than a second threshold value of the total sample number;
preferably the first threshold is 1500 and the second threshold is 3/4;
2) selection of typing profiles
Selecting the first third threshold proteins with the maximum coefficient of variation and the quantitative values thereof to form a typing characteristic protein matrix according to the sequence of the Coefficient of Variation (CV) from high to low with larger difference among samples;
preferably the third threshold is 100;
3) NMF typing
a) non-Negative Matrix Factorization (NMF) consistent clustering method typing: selecting an optimal clustering number K according to an outline coefficient (average simple value width) and a co-phenotypic correlation coefficient (phenotypic coefficient), and performing consistent clustering method typing on the typing characteristic protein obtained in the step 2) by using non-Negative Matrix Factorization (NMF) to obtain an NMF typing label;
b) according to the result of the consistent clustering method typing, adjusting a third threshold value to determine the optimal parameter of the third threshold value: re-screening characteristic proteins for NMF clustering according to the heatmap and contour coefficient obtained by the NMF clustering, wherein the CV value is increased or decreased until an ideal heatmap and contour coefficient are obtained;
4) constructing a molecular typing classifier: selecting the typing feature protein and the optimal clustering number K as classifier features, then selecting a classifier, and outputting molecular typing result data through data input and intermediate processes;
preferably, the typing characteristic protein is 100 characteristic proteins with the largest coefficient of variation, and the optimal NMF clustering number K is 4;
the preferred classifier is a known machine learning classification algorithm or an artificial intelligence model; inputting data including a characteristic protein matrix and an NMF typing label; the intermediate process comprises pretreatment of expression profile data of a gastric mucosa lesion sample and feature matching of a classifier.
2. A method for molecular typing of proteomics of gastric mucosal lesion, which is characterized in that the molecular typing classifier constructed by the method of claim 1 is used for molecular typing of lesion samples, and comprises the following steps:
1) pretreating a sample to be tested according to the method in the step 1) in the claim 1 to obtain expression profile data;
2) classifier data input: inputting expression profile data of a sample;
3) the intermediate process comprises the following steps: preprocessing the expression profile data of the gastric mucosa lesion sample, and matching the characteristics of a classifier;
4) and outputting the molecular typing result data.
3. A method of analyzing associations of different proteomic molecular subtypes of gastric mucosal lesion with progression of gastric mucosal lesion, characterized by determining outcome of disease progression in combination with clinical histopathological diagnosis: incorporating clinical variables into the analysis of association of molecular subtypes with disease progression and performing a multifactorial non-conditional Logistic regression of how molecular subtypes and clinical factors affect progression of gastric mucosal lesions to analyze association of different subtypes with progression of gastric mucosal lesions, the molecular subtypes being molecular subtypes obtained according to the method of claim 2;
preferably, the disease progression outcome determination method comprises: according to the clinical histopathological diagnosis of a baseline and a follow-up endpoint, pathological scoring is carried out according to SG, CAG, IM, DYS and GC sequence levels, the follow-up endpoint is compared with the baseline, a score-up person is judged as disease progression, a score-down person is judged as disease reversion, and a score-unchanged person is judged as disease stabilization;
the clinical variables preferably included in the correlation analysis are: gender, age, helicobacter pylori infection status, baseline histopathological diagnosis;
the preferred clinical variables corrected for by the multifactor unconditional Logistic regression are: gender, age, helicobacter pylori infection status, baseline histopathological diagnosis.
4. A method for screening protein markers related to gastric cancer and gastric mucosa lesion progress is characterized in that the relations between protein expression and histopathological gastric mucosa lesion state, protein component molecular typing and gastric mucosa lesion progress are respectively calculated, so that a protein marker database related to gastric cancer and gastric mucosa lesion progress is established:
1) analyzing the relation between protein expression and pathological gastric mucosa pathological state of tissue, and screening the protein obviously related to gastric cancer
a) The pathological state of the gastric mucosa is divided into SG, CAG, IM, DYS and GC according to the histopathological diagnosis, and the protein expression difference of severe gastric mucosal lesion (IM/DYS) and Gastric Cancer (GC) is explored by taking mild gastric mucosal lesion (SG/CAG) as a reference;
b) clinical variables for inclusion association analysis: sex, age;
c) the correlation analysis method comprises the following steps: multi-factor unconditional Logistic regression;
the preferred clinical variables corrected for by the multifactor unconditional Logistic regression are: sex, age; selecting a protein having FDR q <0.05 corrected by multiple assays;
preferably, the multiple test correction method is a Benjamini-Hochberg method or a Bonferroni method;
2) analyzing the molecular typing relation between protein expression and protein components, and screening proteins which are obviously related to severe gastric mucosal lesion and gastric cancer defined by the protein components
a) Performing proteomic molecular typing of gastric mucosal lesions by the method of claim 2, calculating Spearman correlation coefficient of molecular subtype and histopathology, and analyzing the severe gastric mucosal lesions and Gastric Cancer (GC) protein expression difference defined by proteome with mild gastric mucosal lesions defined by proteome as reference;
b) clinical variables for inclusion association analysis: sex, age;
c) the correlation analysis method comprises the following steps: multi-factor unconditional Logistic regression;
preferably, the mild gastric mucosal lesions defined by the proteome are of molecular subtype S1, and the severe gastric mucosal lesions defined by the proteome are of molecular subtypes S2, S3 and S4;
the preferred clinical variables corrected for by the multifactor unconditional Logistic regression are: sex, age; selecting a protein with FDR q less than 0.05 after multiple test correction;
preferably, the multiple test correction method is a Benjamini-Hochberg method or a Bonferroni method;
3) analyzing the relation between protein expression and gastric mucosa lesion development, and screening the protein obviously related to the gastric mucosa lesion development
a) Judging the disease progression outcome: according to the clinical histopathological diagnosis of a baseline and a follow-up endpoint, pathological scoring is carried out according to SG, CAG, IM, DYS and GC sequence levels, the follow-up endpoint is compared with the baseline, a score-up person is judged as disease progression, a score-down person is judged as disease reversion, and a score-unchanged person is judged as disease stabilization;
b) clinical variables for inclusion association analysis: sex, age, helicobacter pylori infection status, baseline histopathological diagnosis
c) The correlation analysis method comprises the following steps: multi-factor unconditional Logistic regression;
the preferred clinical variables corrected for by the multifactor unconditional Logistic regression are: sex, age, helicobacter pylori infection status, baseline histopathological diagnosis, p < 0.05.
5. A method for establishing a disease progression risk scoring system of a gastric mucosa lesion sample is characterized by comprising the following steps:
1) establishing a risk scoring formula by using the protein marker obtained by screening by the method of claim 4 and the regression coefficient of the relationship between the protein expression and the lesion development of the gastric mucosa in the step 3) in the method of claim 4 and the protein expression amount; the protein marker is a protein which is obviously related to gastric cancer and the development of gastric mucosal lesion;
preferably, the risk score formula is:
risk score β1X1+β2X2+β3X3+…βnXn
Beta is the coefficient of the protein n in the regression equation obtained in step 3) of claim 4, X is the expression level of the protein n, i.e. the iFOT value;
preferably n is 4, and is respectively protein APOA1BP, PGC, HPX, DDT;
2) analyzing the relation between the risk score of gastric mucosa lesion and the progression of gastric mucosa lesion
Judging the disease progression outcome, bringing clinical variables into association analysis, and analyzing the relationship between the gastric mucosal lesion risk score and the gastric mucosal lesion progression based on multi-factor unconditional Logistic regression modeling.
6. The method of claim 5, wherein the step 2) of analyzing the relationship between the risk score and the progression of the lesion of the gastric mucosa comprises the following steps:
a) judging the disease progression outcome: according to the clinical histopathological diagnosis of a baseline and a follow-up endpoint, pathological scoring is carried out according to SG, CAG, IM, DYS and GC sequence levels, the follow-up endpoint is compared with the baseline, a score-up person is judged as disease progression, a score-down person is judged as disease reversion, and a score-unchanged person is judged as disease stabilization;
b) clinical variables for inclusion association analysis: gender, age, helicobacter pylori infection status, baseline histopathological diagnosis;
c) the correlation analysis method comprises the following steps: multi-factor unconditional Logistic regression;
preferably the clinical variables corrected by the multi-factor unconditional Logistic regression are: gender, age, helicobacter pylori infection status, baseline histopathological diagnosis.
7. A construction method of a gastric mucosa lesion progress risk classifier is characterized by comprising the following steps:
1) screening independent variables;
preferred independent variables are sex, age, helicobacter pylori infection status, baseline histopathological diagnosis, risk score, protein component typing; the risk score is calculated according to the method of claim 5 or 6;
2) judging the disease progression outcome: according to the clinical histopathological diagnosis of a baseline and a follow-up endpoint, pathological scoring is carried out according to SG, CAG, IM, DYS and GC sequence levels, the follow-up endpoint is compared with the baseline, a score-up person is judged as disease progression, a score-down person is judged as disease reversion, and a score-unchanged person is judged as disease stabilization;
3) constructing a lesion progress risk classifier:
a) and (3) selecting a classifier: a machine learning classification algorithm or an artificial intelligence model;
preferably a random forest or a support vector machine;
b) data input: (iv) inclusion of independent variables after screening;
preferably sex, age, helicobacter pylori infection status, baseline histopathological diagnosis, risk score, protein component genotyping;
c) and (3) data output: a disease progression status for each sample;
d) testing the accuracy of the algorithm;
preferably, the area under the receiver operating characteristic curve (ROC curve) (AUC) is calculated by independent validation set validation.
8. A method for predicting the risk of progression of lesions of the gastric mucosa, characterized in that the classifier obtained according to claim 7 is used for predicting:
a) pretreatment: preprocessing sample protein expression profile data, performing molecular typing and performing risk scoring;
preferably, the protein expression profile pretreatment and experimental filtration method of claim 1 is adopted for sample protein expression profile data pretreatment; typing the sample using the molecular typing method of claim 2; scoring the risk of disease progression for a gastric mucosal lesion sample using the method of claim 5;
b) inputting: (iv) inclusion of independent variables after screening;
preferred independent variables are sex, age, helicobacter pylori infection status, baseline histopathological diagnosis, risk score, protein component typing;
c) the intermediate process comprises the following steps: matching classifier features and predicting disease progress;
and (3) outputting: disease progression/non-progression.
9. The molecular typing classifier of the proteomics of gastric mucosal lesion constructed by the method of claim 1.
10. The gastric mucosa lesion progress risk classifier constructed by the method of claim 7.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010707820X | 2020-07-21 | ||
CN202010707820 | 2020-07-21 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112071363A true CN112071363A (en) | 2020-12-11 |
CN112071363B CN112071363B (en) | 2023-11-14 |
Family
ID=73696182
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010958039.XA Active CN112071363B (en) | 2020-07-21 | 2020-09-11 | Gastric mucosal lesion protein molecular typing, lesion progress and gastric cancer related protein marker and method for predicting lesion progress risk |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112071363B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113238052A (en) * | 2021-04-27 | 2021-08-10 | 中国人民解放军空军军医大学 | Application of MG7-Ag, hTERT and TFF2 expression analysis in intestinal epithelialization risk stratification and gastric cancer early warning |
CN113466471A (en) * | 2021-09-06 | 2021-10-01 | 中国农业科学院农产品加工研究所 | Gastric mucosa injury related biomarker, screening method and application |
CN114694748A (en) * | 2022-02-22 | 2022-07-01 | 中国人民解放军军事科学院军事医学研究院 | Proteomics molecular typing method based on prognosis information and reinforcement learning |
CN114822854A (en) * | 2022-06-27 | 2022-07-29 | 北京肿瘤医院(北京大学肿瘤医院) | Gastric mucosa lesion progress and gastric cancer related urine protein marker and application thereof |
CN115112778A (en) * | 2021-03-19 | 2022-09-27 | 复旦大学 | A method for identification of disease protein biomarkers |
CN116798520A (en) * | 2023-06-28 | 2023-09-22 | 复旦大学附属肿瘤医院 | Method for constructing squamous cell carcinoma tissue origin site protein marker prediction model |
CN118314962A (en) * | 2024-04-16 | 2024-07-09 | 北京科技大学 | Gastrointestinal cancer parting method and system |
CN118942552A (en) * | 2024-07-29 | 2024-11-12 | 首都医科大学附属北京朝阳医院 | A method and storage medium for detecting pathological types of pulmonary nodules |
CN119291102A (en) * | 2024-10-14 | 2025-01-10 | 上海市第一人民医院 | A chordoma molecular typing method and system and its application |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040153249A1 (en) * | 2002-08-06 | 2004-08-05 | The Johns Hopkins University | System, software and methods for biomarker identification |
US20080010024A1 (en) * | 2003-09-23 | 2008-01-10 | Prediction Sciences Llp | Cellular fibronectin as a diagnostic marker in cardiovascular disease and methods of use thereof |
CN108445097A (en) * | 2017-03-31 | 2018-08-24 | 北京谷海天目生物医学科技有限公司 | Molecular typing of diffuse type gastric cancer, protein marker for typing, screening method and application thereof |
CN108504732A (en) * | 2017-02-27 | 2018-09-07 | 复旦大学附属华山医院 | A method of establishing the risk forecast model of gastric cancer |
-
2020
- 2020-09-11 CN CN202010958039.XA patent/CN112071363B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040153249A1 (en) * | 2002-08-06 | 2004-08-05 | The Johns Hopkins University | System, software and methods for biomarker identification |
US20080010024A1 (en) * | 2003-09-23 | 2008-01-10 | Prediction Sciences Llp | Cellular fibronectin as a diagnostic marker in cardiovascular disease and methods of use thereof |
CN108504732A (en) * | 2017-02-27 | 2018-09-07 | 复旦大学附属华山医院 | A method of establishing the risk forecast model of gastric cancer |
CN108445097A (en) * | 2017-03-31 | 2018-08-24 | 北京谷海天目生物医学科技有限公司 | Molecular typing of diffuse type gastric cancer, protein marker for typing, screening method and application thereof |
Non-Patent Citations (1)
Title |
---|
汲淑慧 等: "采用IP-XL/MS技术研究SGC7901胃癌细胞中EGFR信号通路动态变化", 《军事医学》, vol. 42, no. 4, pages 241 - 248 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115112778A (en) * | 2021-03-19 | 2022-09-27 | 复旦大学 | A method for identification of disease protein biomarkers |
CN115112778B (en) * | 2021-03-19 | 2023-08-04 | 复旦大学 | A method for identifying disease protein biomarkers |
CN113238052A (en) * | 2021-04-27 | 2021-08-10 | 中国人民解放军空军军医大学 | Application of MG7-Ag, hTERT and TFF2 expression analysis in intestinal epithelialization risk stratification and gastric cancer early warning |
CN113466471A (en) * | 2021-09-06 | 2021-10-01 | 中国农业科学院农产品加工研究所 | Gastric mucosa injury related biomarker, screening method and application |
CN114694748A (en) * | 2022-02-22 | 2022-07-01 | 中国人民解放军军事科学院军事医学研究院 | Proteomics molecular typing method based on prognosis information and reinforcement learning |
CN114694748B (en) * | 2022-02-22 | 2022-10-28 | 中国人民解放军军事科学院军事医学研究院 | Proteomics molecular typing method based on prognosis information and reinforcement learning |
CN114822854A (en) * | 2022-06-27 | 2022-07-29 | 北京肿瘤医院(北京大学肿瘤医院) | Gastric mucosa lesion progress and gastric cancer related urine protein marker and application thereof |
CN116798520A (en) * | 2023-06-28 | 2023-09-22 | 复旦大学附属肿瘤医院 | Method for constructing squamous cell carcinoma tissue origin site protein marker prediction model |
CN116798520B (en) * | 2023-06-28 | 2024-08-23 | 复旦大学附属肿瘤医院 | Method for constructing squamous cell carcinoma tissue origin site protein marker prediction model |
CN118314962A (en) * | 2024-04-16 | 2024-07-09 | 北京科技大学 | Gastrointestinal cancer parting method and system |
CN118942552A (en) * | 2024-07-29 | 2024-11-12 | 首都医科大学附属北京朝阳医院 | A method and storage medium for detecting pathological types of pulmonary nodules |
CN119291102A (en) * | 2024-10-14 | 2025-01-10 | 上海市第一人民医院 | A chordoma molecular typing method and system and its application |
Also Published As
Publication number | Publication date |
---|---|
CN112071363B (en) | 2023-11-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112071363A (en) | Gastric mucosa lesion protein molecule typing, lesion progression, gastric cancer-associated protein marker and method for predicting lesion progression risk | |
Chen et al. | Proteomic profiling of pancreatic cancer for biomarker discovery | |
Seibert et al. | Advances in clinical cancer proteomics: SELDI-ToF-mass spectrometry and biomarker discovery | |
Koomen et al. | Proteomic contributions to personalized cancer care | |
CN115678994B (en) | A biomarker combination, a reagent containing the same and its application | |
CN116798520B (en) | Method for constructing squamous cell carcinoma tissue origin site protein marker prediction model | |
JP2017520775A (en) | Protein biomarker profile for detecting colorectal tumors | |
CN116735889B (en) | Protein marker for early colorectal cancer screening, kit and application | |
US20240194294A1 (en) | Artificial-intelligence-based method for detecting tumor-derived mutation of cell-free dna, and method for early diagnosis of cancer, using same | |
Chen et al. | Enhanced detection of early hepatocellular carcinoma by serum SELDI-TOF proteomic signature combined with alpha-fetoprotein marker | |
Wessels et al. | Plasma glycoproteomics delivers high-specificity disease biomarkers by detecting site-specific glycosylation abnormalities | |
Jaumot et al. | Data analysis for omic sciences: methods and applications | |
CA3219354A1 (en) | Biomarkers for diagnosing ovarian cancer | |
US20050100967A1 (en) | Detection of endometrial pathology | |
Sanchez-Carbayo | Recent advances in bladder cancer diagnostics | |
Suh et al. | Next-generation proteomics-based discovery, verification, and validation of urine biomarkers for bladder cancer diagnosis | |
Henkel et al. | From proteomic multimarker profiling to interesting proteins: thymosin‐β4 and kininogen‐1 as new potential biomarkers for inflammatory hepatic lesions | |
US20240233946A1 (en) | Artificial intelligence-based method for early diagnosis of cancer, using cell-free dna distribution in tissue-specific regulatory region | |
CN116759070A (en) | Biomarker for diagnosing whether tested person is breast cancer | |
WO2006129401A1 (en) | Screening method for specific protein in proteome comprehensive analysis | |
Plymoth et al. | Proteomics beyond proteomics: toward clinical applications | |
CN117079710A (en) | Biomarkers and their use in predicting and/or diagnosing UTUC muscle infiltration | |
CN112992272B (en) | Protein marker for prognosis of chronic and acute hepatitis B failure and screening method thereof | |
Carvalho et al. | Normalization methods in mass spectrometry-based analytical proteomics: a case study based on renal cell carcinoma datasets | |
CN115985397A (en) | High-throughput identification method of lncRNA encoding peptide and application thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |