CN105986007B

CN105986007B - Detection method of cancer tumor suppressor gene cluster (TSG)

Info

Publication number: CN105986007B
Application number: CN201410508691.6A
Authority: CN
Inventors: 苏红; 刘栋兵; 彭丽花
Original assignee: BGI Shenzhen Co Ltd
Current assignee: BGI Shenzhen Co Ltd
Priority date: 2015-02-11
Filing date: 2015-02-11
Publication date: 2020-03-17
Anticipated expiration: 2035-02-11
Also published as: CN105986007A

Abstract

In order to quickly and accurately obtain tumor suppressor genes that play a common role in the occurrence of cancer, the inventors designed a new method: first, by simulating random sampling, the pairwise relationship between all genes in the same region is searched for ; secondly, select genes that have a co-occurrence relationship; then, link these two-co-occurring genes to form a chain, and the genes in the chain must have a co-occurrence relationship; finally, combine the clinical information and expression of the sample data to verify whether genes in a chain play a common role in cancer development, and whether this common role is stronger than that of individual genes.

Description

A cancer tumor suppressor gene cluster (TSG) detection method

技术领域technical field

本发明涉及生物信息领域。更具体而言，本发明涉及癌症肿瘤抑制基因簇的检测。The present invention relates to the field of biological information. More specifically, the present invention relates to the detection of cancer tumor suppressor gene clusters.

背景技术Background technique

一直以来癌症发生的基因突变学说始终占据癌症发生机制学说的主流。从 1982年鉴定出第一个可能导致癌症发生的人类基因HRAS突变开始，癌症的研究进入癌基因跟抑癌基因的探索和鉴定时代。早期研究主要集中于基因组区域上的单个基因突变在肿瘤发生过程起“驱动(Driver)”的作用，然而近期的研究发现，在基因组上出现大量频繁发生缺失突变的肿瘤抑制基因，这些与杂合缺失相关的缺失突变也会导致该肿瘤抑制基因周围基因的活性降低。研究证明基因组上大量频繁发生大片段缺失的基因，这些基因倾向于以簇(cluster)的形式存在(多种基因对癌症的发生起共同的作用)，这种共同的生物调控作用比单个基因的作用还要强。大规模基因组上的损伤可以通过共发生的癌症基因的共同作用而起作用，而不是通过单个独立基因发生损伤来起作用，成为癌症发生的一种可能的机制。The genetic mutation theory of cancer has always occupied the mainstream of cancer mechanism theory. Since the identification of the first human gene HRAS mutation that may cause cancer in 1982, cancer research has entered the era of exploration and identification of oncogenes and tumor suppressor genes. Early studies focused on a single gene mutation in a genomic region that acts as a "driver" in tumorigenesis, but recent studies have found a large number of tumor suppressor genes with frequent deletion mutations in the genome, which are associated with heterozygous Deletion-related deletion mutations also lead to reduced activity of genes surrounding this tumor suppressor gene. Studies have shown that a large number of genes with frequent large fragment deletions on the genome tend to exist in the form of clusters (multiple genes play a common role in the occurrence of cancer), and this common biological regulation is stronger than that of a single gene. The effect is even stronger. Damage on large-scale genomes can act through the co-action of co-occurring cancer genes, rather than through damage to a single independent gene, as a possible mechanism for cancer development.

分析得到候选肿瘤抑制基因的过程并不复杂，目前的分析方案也较为一致。但是怎样从大量的候选基因中得到有共同作用的基因是现阶段分析的一个难点。目前的方法主要有两种，1)通过使用癌症的大鼠模型系统和体内RNAi等的实验方法，这种方法耗时较长，费用较高。2)选取已有文献报道过的肿瘤抑制基因，由于人自身存在一定的局限性，筛选出来的结果带有一定的主观性，而且可能不完整，同时也需要较多的人力和时间。The process of analyzing candidate tumor suppressor genes is not complicated, and the current analysis protocols are relatively consistent. However, how to obtain common genes from a large number of candidate genes is a difficult point in the current analysis. There are two main methods at present, 1) by using a rat model system of cancer and experimental methods such as RNAi in vivo, this method is time-consuming and expensive. 2) Selecting tumor suppressor genes that have been reported in the literature, due to the limitations of human beings, the screened results are subject to a certain degree, and may be incomplete, and also require more manpower and time.

综上所述，已有的研究肿瘤抑制基因的方法在前期都会得到较多的候选基因，从这些基因中挑出真正对癌症的发生具有单一或者共同作用的基因是非常困难的。按照之前文章中报道过的方法，需要耗费大量的人力物力。因此，本领域急需从大量的候选基因中得到对癌症发生起共同作用的肿瘤抑制基因的方法。To sum up, the existing methods of studying tumor suppressor genes will obtain more candidate genes in the early stage, and it is very difficult to pick out the genes that have a single or joint effect on the occurrence of cancer from these genes. According to the method reported in the previous article, it requires a lot of manpower and material resources. Therefore, there is an urgent need in the art for a method for obtaining tumor suppressor genes that play a common role in carcinogenesis from a large number of candidate genes.

发明内容SUMMARY OF THE INVENTION

为了能快速准确的得到对癌症发生起共同作用的肿瘤抑制基因，发明人设计了一套新的方法：In order to quickly and accurately obtain tumor suppressor genes that play a common role in cancer, the inventors designed a new method:

首先，通过模拟随机抽样的方法对同一区域内的所有基因进行两两之间的关系寻找；First, the relationship between all genes in the same region is searched by simulating random sampling;

其次，挑选出有共同发生关系的基因；Second, select genes that have a common relationship;

然后，将这些两两共同发生的基因链接起来，形成一条链，链里的基因必须两两具有共发生关系；Then, these genes that co-occur in pairs are linked to form a chain, and the genes in the chain must have a co-occurring relationship;

最后，结合样本的临床信息及表达数据，验证一条链中的基因是否对癌症的发生起到了共同的作用，这种共同的作用是否强于单个基因。Finally, combining the clinical information and expression data of the samples, it is verified whether the genes in a chain play a common role in the occurrence of cancer, and whether this common role is stronger than that of a single gene.

本发明基于全基因组成对肿瘤样本的体细胞CNV(拷贝数变异)^[1]检测结果及转录组表达量分析结果(FPKM)^[1]进行下游分析。The present invention performs downstream analysis on the detection results of somatic CNV (copy number variation) ^[1] and transcriptome expression analysis results (FPKM) ^[1] of tumor samples based on the whole genome composition.

因此，本发明提供了一种获得对癌症发生起共同作用的肿瘤抑制基因的方法，所述方法包括步骤：Accordingly, the present invention provides a method for obtaining tumor suppressor genes that act in concert with carcinogenesis, the method comprising the steps of:

1)对于多个肿瘤患者的肿瘤组织样本和正常组织样本，获得全基因测序数据、转录组表达量数据，以及包括患者生存时间的临床信息和基因在所述样本中的表达量；1) For tumor tissue samples and normal tissue samples of multiple tumor patients, obtain whole gene sequencing data, transcriptome expression data, and clinical information including patient survival time and gene expression in the samples;

2)将基因组划分成多个子区域(例如所述子区域长度为10K-10M，优选 10K-1M，更优选100K-1M)，对于所述每个子区域，用上述全基因组测序数据的 CNV检测结果计算所述多个样本中的CNV显著性(例如采用Gscore值^[2]进行计算)；2) dividing the genome into a plurality of sub-regions (for example, the length of the sub-region is 10K-10M, preferably 10K-1M, more preferably 100K-1M), for each sub-region, use the CNV detection results of the above-mentioned whole genome sequencing data calculating CNV significance in the plurality of samples (eg, using a Gscore value ^[2] );

3)对CNV显著变化(例如，对于采用Gscore值进行计算，Gscore>＝0.1) 的子区域进行延展，挑选出基因组上CNV显著变化的连续子区域，作为缺失区域；3) Extend the sub-regions with significant changes in CNV (for example, for Gscore value calculation, Gscore>=0.1), and select continuous sub-regions with significant changes in CNV on the genome as deletion regions;

4)对每个缺失区域内的所有基因，利用所述肿瘤组织样本和正常组织样本的转录组表达量数据，挑选出转录组表达量在正常组织样本和肿瘤组织样本之间有显著差异(例如成对秩和检验p<0.05)并且下调的基因，这些基因为候选的肿瘤抑制基因；4) For all genes in each deletion region, using the transcriptome expression data of the tumor tissue sample and the normal tissue sample, select the transcriptome expression that has a significant difference between the normal tissue sample and the tumor tissue sample (for example, Paired rank sum test p<0.05) and down-regulated genes, these genes are candidate tumor suppressor genes;

5)对于每个缺失区域，判断两两基因是否同时发生缺失突变，例如按如下方式进行：假设同时发生两种基因丢失突变是一个随机过程，那么进行大量 (例如1万次以上，10万次以上，……，1000000万次以上)随机抽样，对于每一次抽样，个基因会在多个样本中发生cnv，每次抽样都会针对任意的两两基因，每次抽取的次数与该基因发生的次数相关，每次抽取都会记录这两个基因共同发生(即在同一个样本中)的次数，将这个次数与实际结果相比较，将两个基因同时出现的次数大于实际两个基因共同发生的次数的结果记录下来，将上述结果进行累加，然后除以总的抽样数，得到最终的p值，p值越小，说明两个基因同时发生突变的随机性越小，一般认为当p<0.05时，认为两个基因是同时发生了缺失突变；5) For each deletion region, determine whether deletion mutations occur at the same time in two genes, for example, as follows: Assuming that the simultaneous occurrence of deletion mutations in two genes is a random process, then a large number of (for example, more than 10,000 times, 100,000 times) above, ..., more than 10 million times) random sampling, for each sampling, each gene will have cnv in multiple samples, each sampling will target any pair of genes, and the number of times of each sampling is related to the occurrence of the gene. The number of times is related, and the number of times the two genes co-occur (that is, in the same sample) will be recorded for each extraction, and this number of times will be compared with the actual results. The results of the number of times were recorded, the above results were accumulated, and then divided by the total number of samples to obtain the final p value. The smaller the p value, the smaller the randomness of the simultaneous mutation of the two genes. It is generally considered that when p<0.05 When the two genes are considered to have deletion mutations at the same time;

6)将每个区域内同时发生缺失突变的基因进行链接，要求在一条链中的基因必须是两两互相共发生的；6) Link genes with simultaneous deletion mutations in each region, requiring that the genes in one chain must co-occur with each other;

7)利用所述临床信息和表达量的信息得到在癌症的发生过程某个基因的表达差异(图2中的高表达和低表达)在癌症的生存率上有显著的差异^[3]，分析上述链中的基因的表达与癌症患者的生存时间的关系，如果基因共同对癌症患者的预后有显著的影响(例如利用基因表达的数据将癌症患者分为高表达和低表达两类(基因在该样本中的表达量大于基因在所有样本中的表达量的均值，则该基因在该样本中为高表达，否则为低表达)，计算每类患者在某个时间节点的生存率(生存/死亡)如图2，如果该基因高表达对于癌症患者有更高的生存率，则验证了该基因有对癌症患者的预后有显著的影响)，那么它们确实是有共同的作用，如此得到所有对癌症发生起共同作用的肿瘤抑制基因。7) Using the clinical information and the information of the expression level to obtain the difference in the expression of a certain gene during the occurrence of cancer (high expression and low expression in Figure 2), there is a significant difference in the survival rate of cancer ^[3] , analysis The relationship between the expression of genes in the above chain and the survival time of cancer patients, if the genes together have a significant impact on the prognosis of cancer patients (for example, using gene expression data to divide cancer patients into two categories: high and low expression (genes in The expression level in this sample is greater than the mean expression level of the gene in all samples, then the gene is highly expressed in this sample, otherwise it is low expression), calculate the survival rate of each type of patient at a certain time node (survival/ death) as shown in Figure 2, if the high expression of the gene has a higher survival rate for cancer patients, it is verified that the gene has a significant impact on the prognosis of cancer patients), then they do have a common effect, so all Tumor suppressor genes that play a role in carcinogenesis.

已有的研究肿瘤抑制基因的方法在前期都会得到较多的候选基因，从这些基因中挑出真正对癌症的发生具有单一或者共同作用的基因是非常困难的。按照之前文章中报道过的方法，需要耗费大量的人力物力。本研究方法，巧妙的运用数学及生物信息方法，准确快速的得到共同作用的肿瘤抑制基因，并且结合临床信息，对于肿瘤的后续治疗起到了很好的指引。Existing methods for studying tumor suppressor genes will obtain more candidate genes in the early stage, and it is very difficult to pick out the genes that really have a single or joint effect on the occurrence of cancer from these genes. According to the method reported in the previous article, it requires a lot of manpower and material resources. The method of this research, clever use of mathematics and bioinformatics methods, can accurately and quickly obtain tumor suppressor genes that act together, and combined with clinical information, it has played a good guide for the follow-up treatment of tumors.

附图说明Description of drawings

附图中的两幅图示例性示出了整个研究的最终结果。The final results of the entire study are exemplified by two figures in the accompanying drawings.

图1.以13号染色体为例，说明染色体上发生大片段丢失突变的基因，这些基因之间存在共发生的关系。该图展示了人类基因组13号染色体上发生的大规模的片段丢失，与这一丢失突变相关的基因之间形成了一个簇(cluster)，它们之间的共发生关系可以通过图2展示的结果来验证。Figure 1. Take chromosome 13 as an example to illustrate the genes with large fragment loss mutations on the chromosome, and there is a co-occurring relationship between these genes. This figure shows a large-scale fragment loss on chromosome 13 of the human genome, and the genes associated with this loss mutation form a cluster, and the co-occurrence relationship between them can be shown by the results shown in Figure 2 to verify.

图2.单个基因的生存曲线以及它们共同作用的生存曲线。表明本研究得到的成簇的基因之间确实具有共发生的关系，它们整体对于该种癌症发生的促进作用明显大于单个基因所起的作用。而图中的基因也都是近来发现的跟该中癌症发生高度相关的一些基因。Figure 2. Survival curves of individual genes and their combined effect. It shows that there is indeed a co-occurrence relationship among the clustered genes obtained in this study, and their overall promotion effect on the occurrence of this kind of cancer is obviously greater than that of a single gene. The genes in the figure are also some recently discovered genes that are highly related to the occurrence of cancer in this group.

具体实施方式Detailed ways

本具体实施例是对本发明的进一步解释，并不是对本发明的限制，本领域技术人员阅读完本说明书，可根据具体需要，对本实施例进行无创造性贡献的修改，但只要在本发明的权利要求的范围内，均受到专利法的保护。This specific embodiment is a further explanation of the present invention, not a limitation of the present invention. After reading this specification, those skilled in the art can make non-creative contribution modifications to the present embodiment according to specific needs, but as long as the claims of the present invention are protected by patent law.

1)样本、数据来源：65例前列腺癌患者的肿瘤组织样本和正常组织样本的全基因测序据以及转录组表达数据，例如采用华大自主研发的cnv检测流程^[1]的数据产出以及cufflinks^[1]软件得到的基因的FPKM(每1百万个比对上参考基因组的序列中，比对到外显子的每1K个碱基上的片段的个数)作为本实施例的数据输入；1) Samples and data sources: whole gene sequencing data and transcriptome expression data of tumor tissue samples and normal tissue samples of 65 prostate cancer patients, such as data output and cufflinks using the cnv detection process independently developed by BGI ^[1] ^[1] The FPKM of the gene obtained by the software (the number of fragments aligned per 1K bases of exons in the sequence of the reference genome per 1 million alignments) is input as the data of this example ;

2)将人类参考基因组(UCSC hg19, http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.g z)各个染色体划分为1M的窗口，计算得到每1M窗口内的CNV显著性(用Gscore^[2]值代表)；2) Divide each chromosome of the human reference genome (UCSC hg19, http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz) into 1M windows, and calculate the CNV in each 1M window Significance (represented by Gscore ^[2] value);

3)对第二步得到的窗口按照位置先后顺序进行扫描，将CNV显著变化 (Gscore>＝0.1)的相邻的窗口进行合并，得到整个基因组上频繁发生大规模丢失的区域；3) Scan the windows obtained in the second step in the order of their positions, and merge adjacent windows with significant CNV changes (Gscore>=0.1) to obtain regions with frequent large-scale losses on the entire genome;

4)利用ANNOVAR(http://www.openbioinformatics.org/annovar/)软件对第三步得到的cnv区域进行注释，得到这些区域内的基因列表；4) Use ANNOVAR (http://www.openbioinformatics.org/annovar/) software to annotate the cnv regions obtained in the third step to obtain a list of genes in these regions;

5)对每个区域内的基因在正常组织和肿瘤组织中的FPKM值进行成对秩和检验，挑选出表达量在正常组织样本和肿瘤组织样本中有显著差异(p<0.05) 的基因列表；5) Perform a pairwise rank sum test on the FPKM values of the genes in each region in normal tissue and tumor tissue, and select a list of genes whose expression levels are significantly different (p<0.05) in normal tissue samples and tumor tissue samples ;

6)对得到的基因列表进行过滤：首先去除发生了常见的和癌症相关的点突变(体细胞单核苷酸突变和体细胞插入缺失突变)的基因(例如，COSMIC数据库中的癌症相关的基因列表,http://cancer.sanger.ac.uk/cancergenome/projects/cosmic/download) ，其次挑选出表达量变化跟cnv变异一致的基因，即发生了缺失的基因，表达量在肿瘤组织当中是下调的基因；6) Filter the resulting gene list: first remove genes with common cancer-related point mutations (somatic single nucleotide mutations and somatic indel mutations) (for example, cancer-related genes in the COSMIC database) List, http://cancer.sanger.ac.uk/cancergenome/projects/cosmic/download), and then select the genes whose expression changes are consistent with the cnv variation, that is, the genes that have been deleted. The expression in the tumor tissue is down-regulated genes;

7)经过筛选后得到的基因，按照不同的簇区域，分别对各个区域内的基因两两之间进行共发生关系的研究：模拟1000000万次随机抽样，对于每一次抽样的结果，每个基因会在多个样本中发生cnv，每次抽样都会针对任意的两两基因，每次抽取的次数与该基因发生的次数相关，每次抽取都会记录这两个基因共同发生(在同一个样本中)的次数，将这个次数与实际结果相比较，将两个基因同时出现的次数大于实际两个基因共同发生的次数的结果记录下来，将上述结果进行累加，然后除以总的抽样数，得到最终的p值，当p<0.05时，认为两个基因是同时发生了缺失突变；7) For the genes obtained after screening, according to different cluster regions, study the co-occurrence relationship between the genes in each region: simulate 10,000,000 random sampling, for the results of each sampling, each gene CNV will occur in multiple samples, each sampling will target any pair of genes, the number of times of each sampling is related to the number of occurrences of the gene, and each sampling will record the co-occurrence of these two genes (in the same sample). ), compare this number with the actual result, record the result that the number of simultaneous occurrence of two genes is greater than the actual number of co-occurrence of two genes, accumulate the above results, and then divide by the total number of samples to get The final p value, when p<0.05, considered that the two genes had deletion mutations at the same time;

8)基于上一步得到的关系列表，将两两共发生的基因连接起来，得到完全互连的基因列表。8) Based on the relationship list obtained in the previous step, connect the genes that co-occur in pairs to obtain a fully interconnected gene list.

9)绘制人类13号染色体上的cnv区域，如图1所示，横坐标表示人类基因组13号染色体的位置，纵坐标表示Gscore值，红色区域代表扩增的区域，蓝色的区域代表缺失的区域，图中标注的基因为上述找到的两两互相共发生的基因列表的示例。9) Draw the cnv region on human chromosome 13, as shown in Figure 1, the abscissa represents the location of chromosome 13 in the human genome, the ordinate represents the Gscore value, the red area represents the amplified area, and the blue area represents the missing Regions, the genes marked in the figure are examples of the gene list found above that co-occur with each other in pairs.

10)下载已发表的前列腺癌样本的对应患者临床信息(需包含患者生存时间)和基因在样本当中的表达[3]，利用R软件中做生存曲线的软件包 (survival)，绘制出每个基因以及所有基因的生存曲线图， mfit＝survfit(Surv(time,status)～group)，其中，关于group的定义：对于单个基因，将基因在癌症样本中的表达量值与该基因在整体样本中的表达量的均值进行比较，大于均值的样本group＝2(高表达)，小于均值的样本group＝1(低表达)，对于链表中的所有基因，如果所有基因在该样本中都是低表达的，那么整个链表的group＝1，否则group＝2。10) Download the clinical information of the published prostate cancer samples (including the patient's survival time) and gene expression in the samples [3], and use the survival curve software package (survival) in the R software to draw each Survival curves of genes and all genes, mfit=survfit(Surv(time, status)～group), where the definition of group: for a single gene, compare the expression value of the gene in the cancer sample with the gene’s expression in the overall sample The mean of the expression in the list is compared, the sample group = 2 (high expression) is greater than the mean, and the sample group = 1 (low expression) is less than the mean, for all genes in the linked list, if all genes are low in this sample Expressed, then group=1 of the entire linked list, otherwise group=2.

由上一步的方法绘制出图1中的基因的生存曲线，如图2所示，横坐标表示复发时间，纵坐标代表生存概率，上面的曲线代表高表达，下面的曲线代表低表达，P值代表该基因的高表达和低表达对与前列腺癌患者的生存率的影响是否存在显著差异(p<0.05作为显著的阈值)。如图所示，本发明得到的抑癌基因在高表达的时候会有较好的预后效果，并且所有基因共同作用的显著程度大于单个基因。The survival curve of the gene in Figure 1 is drawn by the method in the previous step, as shown in Figure 2, the abscissa represents the recurrence time, the ordinate represents the survival probability, the upper curve represents high expression, the lower curve represents low expression, P value It represents whether there is a significant difference in the effect of high and low expression of this gene on the survival rate of patients with prostate cancer (p<0.05 as a significant threshold). As shown in the figure, the tumor suppressor gene obtained by the present invention will have a better prognosis effect when it is highly expressed, and the significant degree of joint action of all genes is greater than that of a single gene.

参考文献references

[1]Chiang D Y,Getz G,Jaffe D B,et al.High-resolution mapping of copy-number alterations with massively parallel sequencing[J].Nature methods,2008,6(1): 99-103.[1] Chiang D Y, Getz G, Jaffe D B, et al. High-resolution mapping of copy-number alterations with massively parallel sequencing[J]. Nature methods, 2008, 6(1): 99-103.

[2]Trapnell C,Roberts A,Goff L,et al.Differential gene and transcriptexpression analysis of RNA-seq experiments with TopHat and Cufflinks[J].Nature protocols,2012,7(3):562-578.[2] Trapnell C, Roberts A, Goff L, et al.Differential gene and transcriptexpression analysis of RNA-seq experiments with TopHat and Cufflinks[J].Nature protocols,2012,7(3):562-578.

[3]Mermel C H,Schumacher S E,Hill B,et al.GISTIC2.0 facilitatessensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers[J].Genome Biol,2011,12(4):R41.[3] Mermel C H, Schumacher S E, Hill B, et al.GISTIC2.0 facilitatessensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers[J].Genome Biol,2011,12(4):R41 .

[4]Glinsky G V,Glinskii A B,Stephenson A J,et al.Gene expressionprofiling predicts clinical outcome of prostate cancer[J].The Journal ofclinical investigation,2004,113(6):913-923.[4] Glinsky G V, Glinskii A B, Stephenson A J, et al.Gene expressionprofiling predicts clinical outcome of prostate cancer[J].The Journal ofclinical investigation,2004,113(6):913-923.

[5]Xue W,Kitzing T,Roessler S,et al.A cluster of cooperating tumor-suppressor gene candidates in chromosomal deletions[J].Proceedings of theNational Academy of Sciences,2012,109(21):8212-8217.[5] Xue W, Kitzing T, Roessler S, et al. A cluster of cooperating tumor-suppressor gene candidates in chromosomal deletions [J]. Proceedings of the National Academy of Sciences, 2012, 109(21): 8212-8217.

[6]J.Clin.Invest.113:913–923(2004).doi:10.1172/JCI200420032.[6] J.Clin.Invest.113:913–923(2004).doi:10.1172/JCI200420032.

Claims

1. A method of obtaining a tumor suppressor gene that functions in concert for carcinogenesis, said method comprising the steps of:

1) obtaining whole gene sequencing data, transcriptome expression data, clinical information including patient survival time and gene expression in tumor tissue samples and normal tissue samples of a plurality of tumor patients;

2) dividing a genome into a plurality of subregions, and calculating the CNV significance of the plurality of samples by using the CNV detection result of the whole genome sequencing data for each subregion;

3) extending the sub-regions with significantly changed CNV, and selecting continuous sub-regions with significantly changed CNV on the genome as deletion regions;

4) selecting genes of which the transcriptome expression quantity is obviously different from that of the normal tissue sample and the tumor tissue sample and is down-regulated by using the transcriptome expression quantity data of the tumor tissue sample and the normal tissue sample for all the genes in each deletion region, wherein the genes are candidate tumor suppressor genes;

5) judging whether two genes have deletion mutation or not at the same time in each deletion region;

6) linking the genes which simultaneously have deletion mutation in each region, wherein the genes in one chain must be mutually shared pairwise;

7) obtaining the significant difference in the survival rate of the cancer by the expression difference of a certain gene in the development process of the cancer by using the clinical information and the information of the expression amount, analyzing the relation between the expression of the genes in the chain and the survival time of the cancer patient, and if the genes have significant influence on the prognosis of the cancer patient, the genes have a common effect, so that all tumor suppressor genes which have a common effect on the development of the cancer are obtained;

the length of the subarea is 100K-1M;

the step 3) is specifically as follows: combining adjacent sub-regions with significantly changed CNV to obtain a region frequently lost in a large scale on the whole genome as a deletion region;

step 5) judging whether deletion mutation occurs to every two genes simultaneously is carried out according to the following mode: assuming that the simultaneous occurrence of two gene loss mutations is a random process, random sampling is carried out more than 1000000 ten thousand times, each gene generates cnv in a plurality of samples for each sampling, each sampling aims at any two genes, the frequency of each sampling is related to the frequency of the gene generation, the frequency of the two genes which commonly generate is recorded for each sampling, the frequency is compared with an actual result, the result that the frequency of the two genes which simultaneously generate is greater than the frequency of the two genes which actually generate together is recorded, the results are accumulated, and then the total sampling number is divided to obtain the randomness p of the two genes which simultaneously generate mutations, and if the p is less than 0.05, the two genes simultaneously generate the loss mutations.

2. The method of claim 1, said CNV significance being calculated using a Gscore value.

3. The method of claim 1, said CNV significant change is Gscore > -0.1.

4. The method of claim 1, wherein the transcriptome expression level is significantly different between a normal tissue sample and a tumor tissue sample by a pairwise rank sum test p < 0.05.

5. The method of claim 1, wherein the significant effect of the gene on the prognosis of the cancer patient is confirmed by: the cancer patients are divided into high expression and low expression by using the data of gene expression, the survival rate of each type of patients at a certain time node is calculated, and if the high expression of the gene has higher survival rate to the cancer patients, the gene is verified to have obvious influence on the prognosis of the cancer patients.

6. The method of claim 5, wherein the expression level of the gene in the sample is greater than the mean of the expression levels of the gene in all samples for both high expression and low expression, and the gene is highly expressed in the sample, otherwise the gene is low expressed.