CN112071363A

CN112071363A - Gastric mucosa lesion protein molecule typing, lesion progression, gastric cancer-associated protein marker and method for predicting lesion progression risk

Info

Publication number: CN112071363A
Application number: CN202010958039.XA
Authority: CN
Inventors: 秦钧; 李雪; 郑乃仁; 汪宜; 吴红星
Original assignee: Beijing Guhai Tianmu Biomedical Technology Co ltd
Current assignee: Beijing Guhai Tianmu Biomedical Technology Co ltd
Priority date: 2020-07-21
Filing date: 2020-09-11
Publication date: 2020-12-11
Anticipated expiration: 2040-09-11
Also published as: CN112071363B

Abstract

The invention relates to a molecular typing based on gastric mucosa lesion proteomics, an analysis method for molecular subtype characteristics of different gastric mucosa lesion proteomics and association of the molecular subtype characteristics and the progression of gastric mucosa lesion; and by calculating the relationship between protein expression and the pathological state of gastric mucosa tissue, protein component subtype and gastric mucosa lesion progress, a protein marker database related to gastric cancer and gastric mucosa lesion progress is established; and further establishing a disease progression risk scoring system of the gastric mucosa lesion sample. According to the invention, by means of molecular epidemiological research, bioinformatics analysis and machine learning are combined, micro and macro gastric cancer etiology risk factors are integrated, a gastric mucosa pathological change molecular classification frame and a progress risk prediction model are established, and a foundation is laid for finally constructing a comprehensive and systematic gastric cancer prevention strategy.

Description

Gastric mucosa lesion protein molecule typing, lesion progression, gastric cancer-associated protein marker and method for predicting lesion progression risk

Technical Field

The invention relates to the field of tumor clinical medicine, in particular to a gastric mucosa lesion protein molecular typing, lesion progress and gastric cancer related protein marker and a method for predicting lesion progress risk.

Background

Gastric Cancer (GC) is located in the fifth part of the global tumor incidence spectrum and the third part of the death spectrum, China is one of the countries with the highest incidence and mortality of gastric cancer worldwide, nearly half of the incidence and death of gastric cancer occur in China all over the world, and the prevention and control of gastric cancer still remain important public health challenges. Past evidence suggests that the development of gastric cancer, particularly intestinal-type gastric cancer, undergoes a complex, multi-stage dynamic evolutionary process, including Superficial Gastritis (SG), Chronic Atrophic Gastritis (CAG), Intestinal Metaplasia (IM) and Dysplasia (DYS), ultimately progressing to gastric cancer. Most patients are diagnosed with advanced gastric cancer and have a poor prognosis. In addition to H.pylori infection, the etiology and risk factors of gastric cancer, especially the etiology and risk factors in the progression of severe gastric mucosal lesions into gastric cancer, are not clear. Severe gastric mucosal lesions have the potential to reverse either naturally or after intervention, with only a small percentage of people eventually progressing to gastric cancer. Early identification of subgroups of people at high risk of developing gastric cancer among people with gastric mucosal lesions promotes early discovery, early diagnosis and early treatment (secondary prevention) of gastric cancer, and is a key breakthrough for reducing burden of gastric cancer diseases. Meanwhile, abnormal expression of proteins plays an important role in tumorigenesis. The research is carried out aiming at key individual protein and protein phenotype, and molecular markers related to gastric mucosa lesion evolution and gastric carcinogenesis are expected to be searched, so that a new way is provided for further exploring the etiology of gastric cancer.

With the development of molecular biology technology and the emergence of various emerging omics detection technologies, the tumor genome project (TCGA) and asian cancer research groups divide gastric cancer into four different subtypes based on gene expression data, and analyze the relationship between the different subtypes and prognosis. However, the genome research based on gastric cancer patients focuses on molecular typing, therapeutic target and prognosis research, and a subgroup of people at risk of gastric cancer cannot be identified, so that the research of early diagnosis markers is lacked.

Proteomics essentially refers to the study of the characteristics of proteins at a large scale, including the expression level of proteins, post-translational modifications, protein-protein interactions, etc., to gain an overall and comprehensive understanding of the processes at the protein level with respect to disease occurrence, cellular metabolism, etc. Protein level analysis not only provides the most efficient real-time analytical model for biomolecular systems, but also yields information that is not readily available at the DNA and RNA levels. At present, some stomach cancer proteomics researches are carried out, stomach cancer is divided into different subtypes based on protein expression difference of cancer and cancer side samples, molecular characteristics of the different subtypes and the relation between the molecular characteristics and prognosis are analyzed, and new treatment targets are searched. When the gastric cancer is diagnosed in China, about 70 percent of patients are in a local progressive stage or a late stage (gastric cancer in a progressive stage), the prognosis is very poor, and even if radical surgery is carried out, the recurrence rate is up to 30 to 70 percent. Therefore, the molecular marker for effectively predicting the pathological change progression and the gastric cancer occurrence of the gastric mucosa is searched, the subgroup of people with high gastric cancer occurrence risk is identified in the pathological change population of the gastric mucosa at an early stage, and the molecular marker is a key breakthrough for reducing the incidence and mortality of the gastric cancer and lightening the burden of gastric cancer diseases.

Helicobacter pylori (h.pylori) infection is a definite risk factor of gastric cancer, mainly acts on the early stage of gastric mucosal lesion, can induce chronic inflammation of gastric mucosa, thereby significantly increasing the risk of severe gastric mucosal lesion (IM/DYS) and gastric cancer, but the mechanism of gastric cancer caused by h.pylori is still unclear. The etiology and risk factors of gastric cancer, in addition to h.pylori infection, especially in the progression of severe gastric mucosal lesions to gastric cancer, remain unclear.

At present, most stomach cancer proteomics researches focus on molecular typing and prognosis researches of diffuse type stomach cancer, and molecular markers for effectively predicting the lesion progress of gastric mucosa and the occurrence of gastric cancer are not available due to lack of data in the aspect of intestinal type stomach cancer. Previous proteomics studies lack systematic comprehensive exploration of the associations between proteins and different levels of and evolutionary changes in gastric mucosal lesions. The literature inquiry only finds a few small sample proteomics researches related to the gastric mucosal lesion, the sample amount is between 12 and 229, most researches only comprise dozens of examples, only mild gastric mucosal lesion is taken as a control group, gastric cancer proteomic changes are discussed, and deep discussion on the mild gastric mucosal lesion and the evolution process of the gastric mucosal lesion is not carried out. In addition, screening for differential proteins generally lacks correction for multiple comparisons and validation based on large-scale independent samples. Meanwhile, some researches select proteome detection based on a specific chip, and compared with the modern mass spectrometry technology, the method has certain limitation on the protein detection depth.

Furthermore, on the one hand, due to the differences in understanding to the person skilled in the art; on the other hand, since the inventor has studied a lot of documents and patents when making the present invention, but the space is not limited to the details and contents listed in the above, however, the present invention is by no means free of the features of the prior art, but the present invention has been provided with all the features of the prior art, and the applicant reserves the right to increase the related prior art in the background.

Disclosure of Invention

Aiming at the defects of the prior art, the invention deeply excavates extremely low abundance protein based on the modern mass spectrum technology with high sensitivity, high resolution and high precision, screens and searches differential expression protein among samples through quantitative proteomics research, and establishes a gastric cancer related molecular marker database through multiple inspection correction and multi-factor analysis. By means of molecular epidemiology research, bioinformatics analysis and machine learning, micro and macro gastric cancer etiology risk factors are integrated, a gastric mucosa pathological change molecular typing frame and a progress risk prediction model are established, and finally, a comprehensive and systematic gastric cancer prevention strategy is constructed. The specific scheme is as follows:

the invention provides a method for constructing a molecular typing classifier of proteomics for gastric mucosal lesion, which comprises the following steps:

1) protein expression profile pretreatment and experimental filtration: obtaining protein expression profile data of a gastric mucosa tissue sample, and then carrying out the following treatment:

a) screening high-confidence proteins;

preferably, the protein to be quantified contains at least one specific peptide segment (unique peptide) with a Mascot ion score (ion score) of more than or equal to 20 and at least two peptide segments with an ion score of more than or equal to 20, or three peptide segments with an ion score of more than or equal to 20;

b) (ii) normalizing the quantitative data based on the sum;

preferably, a peak area-based non-labeled quantitative iBAQ method is adopted to calculate the iBAQ value of the high-confidence protein, the calculated iBAQ data is normalized, and then the ratio of each identified protein to all identified protein quantitative values is calculated to obtain the iFOT value;

preferably, the iBAQ value of a certain protein is the sum of all peak areas of corresponding peptide segments of the protein/the number of theoretical peptide segments;

c) and (3) experiment filtration: rejecting samples with protein identification total number lower than a first threshold value, and screening the protein with the lowest identification frequency, namely the protein accounting for more than a second threshold value of the total sample number;

preferably the first threshold is 1500 and the second threshold is 3/4;

2) selection of typing profiles

Selecting the first third threshold proteins with the maximum coefficient of variation and the quantitative values thereof to form a typing characteristic protein matrix according to the sequence of the Coefficient of Variation (CV) from high to low with larger difference among samples;

preferably the third threshold is 100;

3) NMF typing

a) non-Negative Matrix Factorization (NMF) consistent clustering method typing: selecting an optimal clustering number K according to an outline coefficient (average simple value width) and a co-phenotypic correlation coefficient (phenotypic coefficient), and performing consistent clustering method typing on the typing characteristic protein obtained in the step 2) by using non-Negative Matrix Factorization (NMF) to obtain an NMF typing label;

b) according to the result of the consistent clustering method typing, adjusting a third threshold value to determine the optimal parameter of the third threshold value: re-screening characteristic proteins for NMF clustering according to the heatmap and contour coefficient obtained by the NMF clustering, wherein the CV value is increased or decreased until an ideal heatmap and contour coefficient are obtained;

4) constructing a molecular typing classifier: selecting the typing feature protein and the optimal clustering number K as classifier features, then selecting a classifier, and outputting molecular typing result data through data input and intermediate processes;

preferably, the typing characteristic protein is 100 characteristic proteins with the largest coefficient of variation, and the optimal NMF clustering number K is 4;

the preferred classifier is a known machine learning classification algorithm or an artificial intelligence model, such as a random forest and a support vector machine; inputting data including a characteristic protein matrix and an NMF typing label; the intermediate process comprises pretreatment of expression profile data of a gastric mucosa lesion sample and feature matching of a classifier.

The invention also provides a method for molecular typing of proteomics of gastric mucosa lesion, which applies the molecular typing classifier constructed by the method to perform molecular typing on lesion samples, and comprises the following steps:

1) pretreating a sample to be tested according to the method in the step 1) in the claim 1 to obtain expression profile data;

2) classifier data input: inputting expression profile data of a sample;

3) the intermediate process comprises the following steps: preprocessing the expression profile data of the gastric mucosa lesion sample, and matching the characteristics of a classifier;

4) and outputting the molecular typing result data.

In another aspect, the present invention provides a method for analyzing association between different proteomic molecular subtypes of gastric mucosal lesion and progression of gastric mucosal lesion, which combines clinical histopathological diagnosis to determine outcome of disease progression: incorporating clinical variables into the analysis of association between molecular subtypes and disease progression, and performing multifactorial non-conditional Logistic regression on how the molecular subtypes and clinical factors influence the progression of gastric mucosal lesions to analyze the association between different subtypes and the progression of gastric mucosal lesions, wherein the molecular subtypes are obtained according to the method;

preferably, the disease progression outcome determination method comprises: according to the clinical histopathological diagnosis of a baseline and a follow-up endpoint, pathological scoring is carried out according to SG, CAG, IM, DYS and GC sequence levels, the follow-up endpoint is compared with the baseline, a score-up person is judged as disease progression, a score-down person is judged as disease reversion, and a score-unchanged person is judged as disease stabilization;

the clinical variables preferably included in the correlation analysis are: gender, age, helicobacter pylori infection status, baseline histopathological diagnosis;

the preferred clinical variables corrected for by the multifactor unconditional Logistic regression are: gender, age, helicobacter pylori infection status, baseline histopathological diagnosis.

In another aspect, the invention provides a method for screening gastric cancer and gastric mucosal lesion progress-related protein markers, which respectively calculates the relationship between protein expression and histopathological gastric mucosal lesion state, protein component molecular typing and gastric mucosal lesion progress, thereby establishing a gastric cancer and gastric mucosal lesion progress-related protein marker database:

1) analyzing the relation between protein expression and pathological gastric mucosa pathological state of tissue, and screening the protein obviously related to gastric cancer

a) The pathological state of the gastric mucosa is divided into SG, CAG, IM, DYS and GC according to the histopathological diagnosis, and the protein expression difference of severe gastric mucosal lesion (IM/DYS) and Gastric Cancer (GC) is explored by taking mild gastric mucosal lesion (SG/CAG) as a reference;

b) clinical variables for inclusion association analysis: sex, age;

c) the correlation analysis method comprises the following steps: multi-factor unconditional Logistic regression;

the preferred clinical variables corrected for by the multifactor unconditional Logistic regression are: sex, age; selecting a protein having FDR q <0.05 corrected by multiple assays;

preferably, the multiple test correction method is a Benjamini-Hochberg method or a Bonferroni method;

2) analyzing the molecular typing relation between protein expression and protein components, and screening proteins which are obviously related to severe gastric mucosal lesion and gastric cancer defined by the protein components

a) Performing proteomic molecular typing of gastric mucosal lesions by the method of claim 2, calculating Spearman correlation coefficient of molecular subtype and histopathology, and analyzing the severe gastric mucosal lesions and Gastric Cancer (GC) protein expression difference defined by proteome with mild gastric mucosal lesions defined by proteome as reference;

b) clinical variables for inclusion association analysis: sex, age;

preferably, the mild gastric mucosal lesions defined by the proteome are of molecular subtype S1, and the severe gastric mucosal lesions defined by the proteome are of molecular subtypes S2, S3 and S4;

3) analyzing the relation between protein expression and gastric mucosa lesion development, and screening the protein obviously related to the gastric mucosa lesion development

a) Judging the disease progression outcome: according to the clinical histopathological diagnosis of a baseline and a follow-up endpoint, pathological scoring is carried out according to SG, CAG, IM, DYS and GC sequence levels, the follow-up endpoint is compared with the baseline, a score-up person is judged as disease progression, a score-down person is judged as disease reversion, and a score-unchanged person is judged as disease stabilization;

b) clinical variables for inclusion association analysis: gender, age, helicobacter pylori infection status, baseline histopathological diagnosis;

the preferred clinical variables corrected for by the multifactor unconditional Logistic regression are: sex, age, helicobacter pylori infection status, baseline histopathological diagnosis, p < 0.05.

The invention further provides a method for establishing a disease progression risk scoring system of a gastric mucosa lesion sample, which comprises the following steps:

1) the protein marker obtained by screening by the method is used, and a risk scoring formula is established by adopting a regression coefficient of the relationship between the protein expression and the lesion progress of the gastric mucosa and the protein expression amount in the step 3) in the method; the protein marker is a protein which is obviously related to gastric cancer and the development of gastric mucosal lesion;

preferably, the risk score formula is:

risk score β₁X₁+β₂X₂+β₃X₃+…β_nX_n

Beta is the coefficient of the protein n in the regression equation obtained in the step 3) of the method, and X is the expression quantity of the protein n, namely the iFOT value;

preferably n is 4, and is respectively protein APOA1BP, PGC, HPX, DDT;

2) analyzing the relation between the risk score of gastric mucosa lesion and the progression of gastric mucosa lesion

Judging the disease progression outcome, bringing clinical variables into association analysis, and analyzing the relationship between the gastric mucosal lesion risk score and the gastric mucosal lesion progression based on multi-factor unconditional Logistic regression modeling.

The specific method for analyzing the relationship between the risk score of the gastric mucosa lesion and the progression of the gastric mucosa lesion in the step 2) comprises the following steps:

preferably the clinical variables corrected by the multi-factor unconditional Logistic regression are: gender, age, helicobacter pylori infection status, baseline histopathological diagnosis.

The invention further provides a construction method of the gastric mucosa lesion progress risk classifier, which comprises the following steps:

1) screening independent variables;

preferred independent variables are sex, age, helicobacter pylori infection status, baseline histopathological diagnosis, risk score, protein component typing; the risk score is calculated according to the method;

2) judging the disease progression outcome: according to the clinical histopathological diagnosis of a baseline and a follow-up endpoint, pathological scoring is carried out according to SG, CAG, IM, DYS and GC sequence levels, the follow-up endpoint is compared with the baseline, a score-up person is judged as disease progression, a score-down person is judged as disease reversion, and a score-unchanged person is judged as disease stabilization;

3) constructing a lesion progress risk classifier:

a) and (3) selecting a classifier: a machine learning classification algorithm or an artificial intelligence model;

preferably a random forest or a support vector machine;

b) data input: (iv) inclusion of independent variables after screening;

preferably sex, age, helicobacter pylori infection status, baseline histopathological diagnosis, risk score, protein component genotyping;

c) and (3) data output: a disease progression status for each sample;

d) testing the accuracy of the algorithm;

preferably, the area under the receiver operating characteristic curve (ROC curve) (AUC) is calculated by independent validation set validation.

In another aspect, the invention provides a method for predicting the risk of gastric mucosal lesion progression, which uses the classifier obtained by the method to predict:

a) pretreatment: preprocessing sample protein expression profile data, performing molecular typing and performing risk scoring;

preferably, the protein expression profile pretreatment and experimental filtration method of claim 1 is adopted for sample protein expression profile data pretreatment; typing the sample using the molecular typing method of claim 2; scoring the risk of disease progression for a gastric mucosal lesion sample using the method of claim 5;

b) inputting: (iv) inclusion of independent variables after screening;

preferred independent variables are sex, age, helicobacter pylori infection status, baseline histopathological diagnosis, risk score, protein component typing;

c) the intermediate process comprises the following steps: matching classifier features and predicting disease progress;

d) and (3) outputting: disease progression/non-progression.

The invention further provides a molecular typing classifier of the proteomics for gastric mucosal lesion constructed by the method.

The invention further provides a gastric mucosa lesion progress risk classifier constructed by the method.

The application of the gastric cancer and the protein marker related to the lesion development of the gastric mucosa, which are obtained by the method of claim 4, in the preparation of a gastric cancer high-risk population identification detection kit and/or chip;

preferably, the protein markers are 217 gastric cancer related protein markers shown in table 4 and 54 gastric mucosal lesion progression related protein markers shown in table 6.

In a final aspect of the present invention, there is provided a kit and/or chip for identifying and detecting a high-risk population with gastric cancer, comprising the molecular typing classifier of claim 9, the risk scoring system of claim 5, the gastric mucosal lesion progression risk classifier of claim 10, and the gastric cancer and protein markers related to the progression of gastric mucosal lesion of claim 11.

The invention has the beneficial effects that:

1) the invention selects clinical gastroscope biopsy gastric mucosa tissues and can directly reflect the physiological and pathological states of the gastric mucosa.

2) The invention adopts the modern mass spectrum technology with high sensitivity, high resolution and high precision to deeply cover the gastric mucosa proteome, can further mine the characteristic information of the protein with extremely low abundance, and can realize higher detection efficiency.

3) The method realizes the molecular typing of gastric mucosa pathological changes through proteome data for the first time, further obtains the molecular characteristics of the gastric mucosa pathological changes which are difficult to obtain in a cell morphology layer, and can be associated with the progression risks of the gastric mucosa pathological changes, thereby analyzing the progression risks of the gastric mucosa pathological changes of different subtypes.

4) The method is used for carrying out comprehensive proteomics research aiming at different stages of gastric mucosa lesion, the evolution process of the gastric mucosa lesion and the occurrence of gastric cancer for the first time, and exploring expression change rules, signal paths and potential action mechanisms of individual proteins and protein phenotypes in the gastric mucosa lesion evolution and gastric cancer occurrence processes.

5) The protein marker data set related to the pathological changes and the progression of the gastric cancer and the gastric mucosa is verified through prospective queue research, and an important clue is provided for the etiological exploration of the gastric cancer.

6) Proteome data and other gastric cancer risk factors are integrated for the first time, a gastric mucosa lesion progress risk prediction model is established, and an important basis can be provided for gastric cancer prevention and control.

Drawings

FIG. 1 is a graph showing the results of protein detection on 169 samples of gastric mucosal tissue.

FIG. 2 shows the results of the pretreatment of protein expression profiles of 169 samples of gastric mucosal tissue.

FIG. 3 is a graph showing the results of molecular typing of a lesion sample of the gastric mucosa using non-Negative Matrix Factorization (NMF).

FIG. 4 is a multi-factor non-conditional Logistic regression forest chart, which is a relation between molecular subtypes and the risk of progression of gastric mucosal lesions.

FIG. 5 is a multi-factor unconditional Logistic regression forest chart showing the relationship between risk score and risk of progression of gastric mucosal lesions.

FIG. 6 is a graph of a risk prediction model versus a random forest model receiver operating characteristic curve (ROC).

Detailed Description

The invention is illustrated below with reference to specific embodiments. The experimental procedures in the following examples are conventional unless otherwise specified. Materials, reagents and the like used in the following examples are commercially available unless otherwise specified.

Example 1 obtaining protein expression Profile data of clinical gastroscope biopsy gastric mucosal tissue samples

The experimental samples are 169 gastroscopic biopsy gastric mucosa tissue samples from Shandong Lin \26384msite of high incidence of gastric cancer and fifth medical center of the liberation general hospital.

Protein extraction and analysis are carried out on 169 clinical gastroscope biopsy gastric mucosa tissue samples, and a proteome data set corresponding to each sample is obtained through the step, wherein the proteome data set comprises the types and the quantities of the proteins and the quantitative values of various proteins.

Firstly, a lysate formula:

1% (w/v) DOC (Deoxycholic acid),10mM TCEP,

40mM 2-chloroacetamide(CAA),100mM Tris,pH 8.5。

second, the operation steps

1. Material taking: taking a gastroscope sample, and storing the gastroscope sample in a clean EP tube;

2. and (3) cracking the sample: adding 500uL of lysis solution, and homogenizing a sample;

3. heating for denaturation: heating the homogenized sample at 95 ℃ for denaturation for 5min, and naturally cooling to room temperature;

4. ultrasonic crushing: placing the sample tube on an ice-water mixture for ultrasonic treatment for 5min, wherein the power is 30% and the power is 3s on and 3s off;

5. protein isolation: centrifuging the sample at 4 deg.C for 10min at 16,000g, and keeping the supernatant;

6. protein quantification: measuring the protein concentration by the Nanodrop, and taking 50ug of protein to a new EP tube;

7. protein cleaning: the sample was added to a 10KD ultrafiltration tube, centrifuged at 14,000g for 2min at room temperature to allow the protein to bind completely to the membrane, and then treated with 50mM ammonium bicarbonate (NH)₄HCO₃) Washing the protein sample on the membrane, centrifuging at room temperature of 14,000g for 20min at 300 ul/time; repeating for 2 times;

8. and (3) proteolysis: replacing the ultrafiltration tube with a new collection tube, adding 100ul of 50mM ammonium bicarbonate solution and 5ug pancreatin, sealing, vertically digesting at 37 ℃ for 4 hours, then adding 100ul of 50mM ammonium bicarbonate solution and 5ug pancreatin, sealing, and rotationally digesting at 37 ℃ overnight;

9. peptide fragment collection: centrifuging the ultrafiltration tube for 14,000g and 20min, retaining the peptide segment in the collection tube, adding 200ul of mass spectrum water, centrifuging and cleaning once, retaining the peptide segment collected for the second time, combining the peptide segments collected for the two times, and then performing vacuum pumping to dry, thus obtaining a product for mass spectrum detection.

Thirdly, a mass spectrometry method:

for a new type of LC-MS tandem mass spectrometry from Thermo, C18 packing was used for both the pre-column and analytical column. The mobile phase is liquid A (H)₂O: FA 99.8: 0.2) and B solution (ACN: FA 99.8: 0.2). The dried peptide fragment was applied to the loading buffer (H)₂O∶CH₃OH: FA (94.8: 5: 0.2) are fully dissolved, centrifuged at 12000r/min for 10min, and subjected to mass spectrometry. The concrete steps refer to the section "third, mass spectrometric detection of gastric cancer protein sample" in the concrete embodiment of CN 108445097A.

Fourth, searching and identifying protein

The original file obtained was subjected to spectrum matching with NCBI _ human Ref-sequence protein database (version 2013) using a protome Discover (version 1.4, Thermo Scientific) and Mascot. The search parameters are set as: the enzyme miscut site is 2, the oxidation of methionine, N-terminal acetylation and reduction alkylation on cysteine are dynamically modified, the length of the peptide segment contains at least 7 amino acids, the fraction of the peptide segment is at least 10, and the sequence of the peptide segment is set to be high. The deviation of the primary ion was set to 20ppm and the deviation of the secondary ion to 50mmu, and evaluated using the integrated reverse library, an FDR of less than 1% was considered acceptable. The concrete steps refer to the section "mass spectrometry data analysis of gastric cancer protein sample" in the section of the concrete embodiment of CN 108445097A.

Fifthly, controlling the quality of the peptide fragment:

screening conditions are as follows:

conditions 1: US > -1 and S > -2

Condition 2: S > -3

Wherein, U represents Unique, S represents Strict;

the Unique peptide fragment of the protein refers to the Unique peptide fragment of the peptide fragment which is not shared with other proteins;

strict is mascot Ion score-the degree of stringency when the Ion score is greater than 20, i.e. the secondary spectrum is identified;

only proteins satisfying the

above condition

1 or 2 were selected for further analysis.

Example 2.

A first part: proteomics molecular typing

The proteomics molecular typing of gastric mucosal lesion is carried out based on the data of the embodiment 1, and the specific steps are as follows:

1) protein expression profiling pretreatment and experimental filtration

a) High confidence protein screening: the quantitative protein is required to contain at least one special peptide segment (unique peptide) and the Mascot ion score (ion score) is more than or equal to 20, at least two peptide segments with the ion scores more than or equal to 20, or three peptide segments with the ion scores more than or equal to 20;

b) sum-based quantification normalization: adopting a peak area-based non-labeled quantitative iBAQ method, wherein the iBAQ value of a certain protein is the sum of the peak areas of all corresponding peptide fragments of the protein/the number of theoretical peptide fragments, and normalizing the data by calculating the ratio of the identified iBAQ value of each protein to the sum of the identified iBAQ values of all proteins to obtain a quantitative value (iFOT value);

c) taking the lesion sample of gastric mucosa of example 1 as an example, the total number of the removed proteins identified is less than 1500, and the procedure can be adjusted according to the actual protein identification numbers of different cancers and different sample types.

As shown in figure 1, 15158 gene products are detected in 169 gastric mucosa tissues in example 1, and 9119 high-reliability proteins are obtained by screening; FIG. 2 shows the results of the pretreatment of the protein expression profiles of the 169 samples of gastric mucosal tissue.

3) Selection of typing profiles

a) Screening the lowest identification frequency, namely the protein accounting for more than 3/4 of the total sample number, based on the high-reliability protein data detected by 111 gastric mucosa lesion samples, wherein the step can be adjusted according to proteome data of different cancers and different sample types;

b) selecting the first third threshold proteins with the largest coefficient of variation and the quantitative values (iFOT values) thereof to form a typing characteristic protein matrix according to the sequence of the Coefficient of Variation (CV) from high to low, wherein the step can be adjusted according to the proteome data of different cancers and different sample types.

The method for determining the optimal parameter of the third threshold in this embodiment is as follows: in the next step of NMF typing, the first 500 proteins with the largest CV are selected for NMF clustering for the first time, but the obtained heatmap and contour coefficient are not ideal no matter how many K are, and then new feature protein screening is tried again, the first 200 proteins with the largest CV are selected for NMF clustering, the result is still not ideal, the CV value is gradually reduced for NMF clustering until the ideal heatmap and contour coefficient are obtained, and finally the optimal parameter of the third threshold of the gastric mucosal lesion sample in the embodiment is selected as 100, so that the gastric mucosal lesion typing feature proteins shown in table 1 are obtained.

TABLE 1 gastric mucosal lesion typing profiles

4) NMF typing

non-Negative Matrix Factorization (NMF) consistent clustering method typing: selecting an optimal clustering number K according to an outline coefficient (average likelihood width) and a co-phenotypic correlation coefficient (phenotypic coefficient), and performing consistent clustering method typing on molecules of a lesion sample by using non-Negative Matrix Factorization (NMF), wherein the specific process comprises the following steps:

loading R language program package cancer libraries, analyzing the typing feature protein matrix by using an ExecuteNMF function, setting the parameter clusterNum to be 2 to 8 in an attempt, setting the nrun to be 50, and selecting the optimal clustering number K through a typing result heat map and a contour coefficient, wherein the NMF typing result is shown in figure 3: the optimal clustering number K was selected to be 4 by clustering number from 2 to 8 typing the resulting heatmap and contour coefficient change line plot.

5) Molecular typing classifier construction

Selecting a typing characteristic protein and an optimal clustering number K as classifier characteristics, then selecting a proper classifier, outputting molecular typing result data through data input and intermediate processes, and then typing the molecules of the pathological change sample by using the classifier. The method specifically comprises the following steps:

a) and (3) selecting the characteristics of the classifier: using the typing characteristic proteins and the optimal clustering results obtained in the steps 2) and 3) (taking the proteomic analysis of gastric mucosal lesion as an example, selecting 100 characteristic proteins with the maximum coefficient of variation and the NMF clustering result in the table 1, wherein k is 4)

b) And (3) selecting a classifier: random forests; other machine learning classification algorithms or artificial intelligence models such as support vector machine can also be selected

c) Data input: the characteristic protein matrix obtained in step 2) and the NMF typing label obtained in step 3)

d) The intermediate process comprises the following steps: preprocessing the expression profile data of the gastric mucosa lesion sample, and matching the characteristics of a classifier: the method comprises the steps of adopting an R language randomForest software package randomForest function, setting a parameter na.action to be na.roughfix, setting a parameter prompt to be TRUE, and setting a parameter import to be TRUE.

e) And (3) data output: molecular typing results of gastric mucosa lesion samples: according to the results and subtype characteristics of 111 cases of gastric mucosal lesion, the obtained proteome-defined mild gastric mucosal lesion is of molecular subtype S1, and the proteome-defined severe gastric mucosal lesion is of molecular subtypes S2, S3 and S4. The results of molecular typing of 39 gastric mucosal lesion samples are shown in table 2.

Table 2 independent verification set 39 gastric mucosa pathological change sample molecular typing results

6) Molecular subtypes are associated with disease progression

Calculating a Spearman correlation coefficient of the obtained molecular subtype and the histopathology by combining clinical histopathological diagnosis, judging the disease progression outcome, incorporating clinical variables into the correlation analysis of the molecular subtype and the disease progression, and performing multi-factor unconditional Logistic regression on how the molecular subtype and the clinical factors influence the gastric mucosal lesion progression to analyze the correlation between different subtypes and the gastric mucosal lesion progression, wherein the Spearman correlation coefficient is specifically as follows:

a) calculating the correlation coefficient of molecular subtype and histopathology Spearman by combining clinical histopathological diagnosis:

calculating to obtain a Spearman's correlation coefficient R which is 0.39 and a Spearman's correlation coefficient P which is 0.016 by adopting an R language cor.test function;

b) judging the disease progression outcome: according to the clinical histopathological diagnosis of baseline and follow-up end point, making pathological scores according to SG, CAG (mild, moderate and severe), IM (mild, moderate and severe), LGIN, HGIN and GC sequence grades, comparing the follow-up end point with the baseline, judging that the person with the ascending score is the disease progress, judging that the person with the descending score is the disease reversion, and judging that the disease is stable.

SG (superficial gastritis), CAG (chronic atrophic gastritis): mild gastric mucosal lesions; IM (intestinal metaplasia), LGIN (low grade intraepithelial neoplasia), HGIN (high grade intraepithelial neoplasia): severe gastric mucosal lesions; GC: gastric cancer.

The results of the disease progression outcome determination of 39 samples in table 2 are shown in table 3.

Table 3 validation set of disease progression determinations for 39 follow-up patients

Through follow-up study on 39 gastric mucosa lesion samples in the independent verification set, 19 patients with gastric mucosa lesions are judged to be in disease progression, and 20 patients with gastric mucosa lesions are judged to be in disease non-progression.

c) Clinical variables for inclusion association analysis: gender, age, helicobacter pylori infection status, baseline histopathological diagnosis;

d) the correlation analysis method comprises the following steps: multifactorial unconditional Logistic regression (corrected sex, age, helicobacter pylori infection status, baseline histopathological diagnosis): using the glm function in R language, the formula is for disease progression or non-progression-gender + age + h.pylori infection status + baseline histopathological diagnosis + protein, and the parameter family is set to binomial (link ═ logit).

As shown in fig. 4, with subtype S1 as a reference, subtype S2 was not significantly associated with disease progression, subtype S4 was significantly associated with disease progression, and subtype S4 was 19.29 times more at risk of disease progression than subtype S1.

A second part: establishment of gastric cancer related molecular marker database

Respectively exploring the relationship between protein expression and histopathological gastric mucosa pathological state, protein component molecular typing and gastric mucosa pathological change progress, thereby establishing a gastric cancer related molecular marker database.

1) Relationship between protein expression and pathological gastric mucosa pathological state

a) Taking the proteomic data of gastric mucosal lesion as an example, the gastric mucosal lesion state is classified into SG, CAG, IM, LGIN and GC according to histopathological diagnosis, and the protein expression difference of severe gastric mucosal lesion (IM/LGIN) and Gastric Cancer (GC) is explored by taking mild gastric mucosal lesion (SG/CAG) as reference;

b) clinical variables for inclusion association analysis: sex and age

c) The correlation analysis method comprises the following steps: multi-factor unconditional Logistic regression (correcting gender, age), using R language glm function, formula for lesion grouping-gender + age + protein, parameter family set to binomial (link ═ logit).

The results are shown in Table 4: 217 proteins were identified and verified to be significantly associated with gastric cancer, 104 proteins being positively associated and 113 proteins being negatively associated.

TABLE 4 significant associated proteins for gastric cancer^a

^aUnconditional Logistic regression, correct gender, age. Level of significance of discovery set FDR<0.05, the significance level of the verification set is unilateral P<0.05。

CAG, chronic atrophic gastritis; GC, gastric cancer; IM, intestinal metaplasia; LGIN, low grade intraepithelial neoplasia; OR, odds ratio; SG, superficial gastritis.

2) Protein expression and protein component molecular typing relation

a) Taking the pathological proteomic data of the gastric mucosa as an example, the differences of protein expression of the subtype S2-S4 and Gastric Cancer (GC) are explored according to the molecular classification of the pathological proteome of the first part of the gastric mucosa and by taking the subtype S1 as a reference.

b) Clinical variables for inclusion association analysis: sex and age

The results are shown in Table 5: 37 proteins are identified and verified to be significantly related to gastric cancer and proteome-defined severe gastric mucosal lesions (subtype S2-S4), wherein 27 proteins are positively related and 10 proteins are negatively related.

TABLE 5 Severe gastric mucosal lesions significantly associated proteins defined by gastric cancer and proteome^a

GC, gastric cancer; OR, odds ratio.

3) Relationship between protein expression and gastric mucosal lesion progression

a) Taking the proteomics data of gastric mucosal lesion as an example, judging the disease progression outcome: according to the clinical histopathological diagnosis of a baseline and a follow-up endpoint, pathological scoring is carried out according to SG, CAG, IM, DYS and GC sequence levels, the follow-up endpoint is compared with the baseline, a score-up person is judged as disease progression, a score-down person is judged as disease reversion, and a score-unchanged person is judged as disease stabilization.

b) Clinical variables for inclusion association analysis: gender, age, helicobacter pylori infection status, baseline histopathological diagnosis. c) The correlation analysis method comprises the following steps: the multi-factor unconditional Logistic regression (correcting sex, age, helicobacter pylori infection state and baseline histopathological diagnosis) adopts an R language glm function, the formula is that lesion grouping-sex + age + helicobacter pylori infection state + baseline histopathological diagnosis + protein, and the parameter family is set as binomial (link ═ logit).

The results are shown in Table 6: 54 proteins are identified and verified to be obviously related to the development of gastric mucosal lesions, wherein 26 proteins are positively related and 28 proteins are negatively related.

TABLE 6 protein significantly associated with the progression of gastric mucosal lesions^a

^aUnconditional Logistic regression, corrected for gender, age, H.pylori infection and baseline histopathology. Significance level was unilateral P<0.05。

IM, intestinal metaplasia; OR, odds ratio.

And a third part: establishment of disease progression risk scoring system of gastric mucosa lesion sample

1) Screening of gastric mucosa lesion progress and gastric cancer related molecular markers: according to the second part of the results, the proteins APOA1BP, PGC, DDT, HPX, which were significantly associated with both gastric cancer and the progression of gastric mucosal lesions (i.e., the proteins that appear repeatedly in tables 4 and 6), were selected.

2) Establishing a gastric mucosa lesion progress risk scoring system: selecting the protein markers screened in the step 1), and establishing a risk score through a regression coefficient of the relationship between the protein expression in the second part 3) and the lesion development of the gastric mucosa and the protein expression quantity:

risk score β₁X₁+β₂X₂+β₃X₃+…β_nX_n

Wherein, beta is the coefficient of the protein n in the regression equation obtained in the second part of the step 3), and X is the expression quantity of the protein n, namely the iFOT value.

In this example, n is 4, and the 4 proteins are APOA1BP, PGC, HPX, DDT, calculated according to the above formula,

risk score-1.485 × APOA1BP-1.231 × PGC +1.868 × HPX-0.565 × DDT.

3) Relationship between risk score of gastric mucosal lesion and progression of gastric mucosal lesion

Judging the disease progression outcome, bringing clinical variables (sex, age, helicobacter pylori infection state and baseline histopathological diagnosis) into correlation analysis, and modeling based on multi-factor non-condition Logistic regression (correcting sex, age, helicobacter pylori infection state and baseline histopathological diagnosis) to analyze the relation between the risk score of gastric mucosal lesion and the progression of gastric mucosal lesion, wherein the method specifically comprises the following steps:

a) judging the disease progression outcome: according to the clinical histopathological diagnosis of baseline and follow-up end point, making pathological scores according to SG, CAG (mild, moderate and severe), IM (mild, moderate and severe), LGIN, HGIN and GC sequence grades, comparing the follow-up end point with the baseline, judging that the person with the ascending score is the disease progress, judging that the person with the descending score is the disease reversion, and judging that the disease is stable.

b) Clinical variables for inclusion association analysis: gender, age, helicobacter pylori infection status, baseline histopathological diagnosis.

c) The correlation analysis method comprises the following steps: multifactor unconditional Logistic regression (correct sex, age, h.pylori infection status, baseline histopathological diagnosis) using R language glm function, formula for disease progression or not-sex + age + h.pylori infection status + baseline histopathological diagnosis + risk score, parameter family set as binomial (link ═ logit).

The results are shown in fig. 5, with a significant positive correlation between risk score and disease progression, with a 3.09-fold increase in risk for disease progression for each standard deviation increase in risk score.

The fourth part: constructing a gastric mucosa lesion progress risk prediction model

Screening out proper independent variables, judging the disease progress outcome, then constructing a classifier and an application classifier, and finally completing the construction of a gastric mucosa lesion progress risk prediction model, wherein the specific process is as follows:

1) independent variable screening: gender, age, helicobacter pylori infection status, baseline histopathological diagnosis, risk score, protein component genotyping.

2) Judging the disease progression outcome: according to the clinical histopathological diagnosis of baseline and follow-up end point, making pathological scores according to SG, CAG (mild, moderate and severe), IM (mild, moderate and severe), LGIN, HGIN and GC sequence grades, comparing the follow-up end point with the baseline, judging that the person with the ascending score is the disease progress, judging that the person with the descending score is the disease reversion, and judging that the disease is stable.

3) Constructing a lesion progress risk classifier: selecting an R language randomForest software package and a randomForest function, incorporating the screened variables, setting the parameter na.action to be na.roughfix, setting the prompt to be TRUE, setting the import to be TRUE, selecting 21 patients with follow-up visit as a training set and 18 patients as a verification set.

a) And (3) selecting a classifier: random forests; other machine learning classification algorithms or artificial intelligence models, such as support vector machines, may also be selected.

b) Data input: independent variables incorporated after screening.

c) And (3) testing the accuracy of the algorithm: and (4) independently verifying the verification set.

Results as in fig. 6, a total of 3 classifiers were tested: classifier 1 included gender, age, h.pylori infection, baseline histopathological diagnosis, subject operating characteristic curve (ROC) Area Under (AUC) 0.75; classifier 2 contains gender, age, h.pylori infection, baseline histopathological diagnosis, risk score, AUC 0.84; classifier 3 included gender, age, h.pylori infection, baseline histopathological diagnosis, risk score, molecular subtype, AUC 0.95. Classifier 2 compares with classifier 1, and Delong's test P is 0.50; compared with classifier 1, classifier 3 has a significant difference in Delong's test P of 0.04.

4) The classifier is applied as follows:

a) pretreatment: expression profiling data preprocessing, molecular typing, risk scoring

b) Inputting: independent variable incorporated after screening

c) The intermediate process comprises the following steps: classifier feature matching, disease progression prediction: selecting an R language randomForest software package and a randomForest function, incorporating the screened variables, and setting a formula RF-gender + age + helicobacter pylori infection + baseline histopathological diagnosis + risk score + molecular subtype), setting the parameter na.action to be na.roughfix, setting the prompt to be TRUE, and setting the import to be TRUE. The prediction tag is given by a prediction function.

d) And (3) outputting: disease progression/non-progression, results are shown in table 7.

TABLE 7 validation set sample prediction results

It should be noted that the above-mentioned embodiments are exemplary, and that those skilled in the art, having benefit of the present disclosure, may devise various arrangements that are within the scope of the present disclosure and that fall within the scope of the invention. It should be understood by those skilled in the art that the present specification and figures are illustrative only and are not limiting upon the claims. The scope of the invention is defined by the claims and their equivalents.

Claims

1. A method for constructing a molecular typing classifier of proteomics for gastric mucosal lesion is characterized by comprising the following steps:

a) screening high-confidence proteins;

b) (ii) normalizing the quantitative data based on the sum;

preferably the first threshold is 1500 and the second threshold is 3/4;

2) selection of typing profiles

preferably the third threshold is 100;

3) NMF typing

the preferred classifier is a known machine learning classification algorithm or an artificial intelligence model; inputting data including a characteristic protein matrix and an NMF typing label; the intermediate process comprises pretreatment of expression profile data of a gastric mucosa lesion sample and feature matching of a classifier.

2. A method for molecular typing of proteomics of gastric mucosal lesion, which is characterized in that the molecular typing classifier constructed by the method of claim 1 is used for molecular typing of lesion samples, and comprises the following steps:

2) classifier data input: inputting expression profile data of a sample;

4) and outputting the molecular typing result data.

3. A method of analyzing associations of different proteomic molecular subtypes of gastric mucosal lesion with progression of gastric mucosal lesion, characterized by determining outcome of disease progression in combination with clinical histopathological diagnosis: incorporating clinical variables into the analysis of association of molecular subtypes with disease progression and performing a multifactorial non-conditional Logistic regression of how molecular subtypes and clinical factors affect progression of gastric mucosal lesions to analyze association of different subtypes with progression of gastric mucosal lesions, the molecular subtypes being molecular subtypes obtained according to the method of claim 2;

4. A method for screening protein markers related to gastric cancer and gastric mucosa lesion progress is characterized in that the relations between protein expression and histopathological gastric mucosa lesion state, protein component molecular typing and gastric mucosa lesion progress are respectively calculated, so that a protein marker database related to gastric cancer and gastric mucosa lesion progress is established:

b) clinical variables for inclusion association analysis: sex, age;

the preferred clinical variables corrected for by the multifactor unconditional Logistic regression are: sex, age; selecting a protein with FDR q less than 0.05 after multiple test correction;

b) clinical variables for inclusion association analysis: sex, age, helicobacter pylori infection status, baseline histopathological diagnosis

5. A method for establishing a disease progression risk scoring system of a gastric mucosa lesion sample is characterized by comprising the following steps:

1) establishing a risk scoring formula by using the protein marker obtained by screening by the method of claim 4 and the regression coefficient of the relationship between the protein expression and the lesion development of the gastric mucosa in the step 3) in the method of claim 4 and the protein expression amount; the protein marker is a protein which is obviously related to gastric cancer and the development of gastric mucosal lesion;

preferably, the risk score formula is:

risk score β₁X₁+β₂X₂+β₃X₃+…β_nX_n

Beta is the coefficient of the protein n in the regression equation obtained in step 3) of claim 4, X is the expression level of the protein n, i.e. the iFOT value;

preferably n is 4, and is respectively protein APOA1BP, PGC, HPX, DDT;

6. The method of claim 5, wherein the step 2) of analyzing the relationship between the risk score and the progression of the lesion of the gastric mucosa comprises the following steps:

7. A construction method of a gastric mucosa lesion progress risk classifier is characterized by comprising the following steps:

1) screening independent variables;

preferred independent variables are sex, age, helicobacter pylori infection status, baseline histopathological diagnosis, risk score, protein component typing; the risk score is calculated according to the method of claim 5 or 6;

3) constructing a lesion progress risk classifier:

preferably a random forest or a support vector machine;

b) data input: (iv) inclusion of independent variables after screening;

c) and (3) data output: a disease progression status for each sample;

d) testing the accuracy of the algorithm;

8. A method for predicting the risk of progression of lesions of the gastric mucosa, characterized in that the classifier obtained according to claim 7 is used for predicting:

b) inputting: (iv) inclusion of independent variables after screening;

and (3) outputting: disease progression/non-progression.

9. The molecular typing classifier of the proteomics of gastric mucosal lesion constructed by the method of claim 1.

10. The gastric mucosa lesion progress risk classifier constructed by the method of claim 7.