CN110782947A

CN110782947A - Identification of cancer drivers based on functional regions of protein sequences

Info

Publication number: CN110782947A
Application number: CN201910991325.3A
Authority: CN
Inventors: 卢新国; 袁玥; 王新宇; 丁莉; 高妍
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2019-10-18
Filing date: 2019-10-18
Publication date: 2020-02-11

Abstract

癌症驱动因素的识别是解释癌症发生机制和实现精准医疗的关键挑战。根据单个突变位点或整个基因来识别癌症驱动因素的方法有很多。但是它们忽略了大量中等大小的功能元素。假设在蛋白质序列的不同区域发生的突变对癌症的进展有不同的影响。在此，我们开发了一种基于贝叶斯概率和多元线性回归模型的新的功能驱动区域(frDriver)识别方法，来识别能够调控基因表达水平和具有高功能影响潜力的蛋白区域。结合基因表达数据和体细胞突变数据，结合功能影响评分(SIFT,PROVEAN)作为先验知识，我们确定了预测基因表达水平最准确的癌症驱动区域。我们评估了frDriver在TCGA的BRCA和GBM数据集上的性能。结果表明，frDriver识别出已知的癌症驱动因素，并优于其他三种最先进的方法(eDriver、ActiveDriver和OncodriveCLUST)。The identification of cancer drivers is a key challenge in explaining the mechanisms of cancer occurrence and enabling precision medicine. There are many ways to identify cancer drivers based on individual mutation sites or entire genes. But they ignore a lot of medium-sized functional elements. It is hypothesized that mutations that occur in different regions of the protein sequence have different effects on cancer progression. Here, we developed a novel functional driver region (frDriver) identification method based on Bayesian probability and multiple linear regression models to identify protein regions that can regulate gene expression levels and have high potential for functional impact. Combining gene expression data and somatic mutation data, combined with functional impact scores (SIFT, PROVEAN) as prior knowledge, we identified the most accurate cancer driver regions for predicting gene expression levels. We evaluate the performance of frDriver on the BRCA and GBM datasets of TCGA. The results show that frDriver identifies known cancer drivers and outperforms three other state-of-the-art methods (eDriver, ActiveDriver, and OncodriveCLUST).

Description

Identification of cancer drivers based on functional regions of protein sequences

技术领域technical field

本发明涉及生物信息学中的数据挖掘，特别是涉及一种对癌症生物信息学数据的挖掘。具体涉及基于蛋白质序列功能区域通过贝叶斯概率和多元线性回归模型识别新的功能驱动癌症区域相关的方法。The invention relates to data mining in bioinformatics, in particular to a mining of cancer bioinformatics data. Specifically, it relates to methods for identifying novel functionally-driven cancer region associations through Bayesian probability and multiple linear regression models based on protein sequence functional regions.

背景技术Background technique

癌症驱动因素的识别是解释癌症发生机制和实现精准医疗的关键挑战。根据单个突变位点或整个基因来识别癌症驱动因素的方法有很多。但是它们忽略了大量中等规模的功能元素。The identification of cancer drivers is a key challenge in explaining the mechanisms of cancer occurrence and enabling precision medicine. There are many ways to identify cancer drivers based on individual mutation sites or entire genes. But they ignore a large number of mid-scale functional elements.

突变可以影响细胞调控过程、信号传递过程等等。由于突变的复杂性，导致了不同的功能效应。与生殖系细胞相比，体细胞获得突变的速度更快，是原细胞的数十至数百倍。然而，只有一小部分突变为肿瘤的发展提供了选择性生长优势，并发挥主导作用，称为肿瘤驱动因素；大多数突变是中性和非功能性的，被称为癌症乘客。Mutations can affect cellular regulatory processes, signaling processes, and more. Due to the complexity of the mutations, different functional effects result. Compared with germline cells, somatic cells acquire mutations at a faster rate, tens to hundreds of times faster than primary cells. However, only a small subset of mutations provide a selective growth advantage and play a dominant role in the development of tumors, termed tumor drivers; most mutations are neutral and non-functional, termed cancer passengers.

许多计算方法和工具已经开发出来，以区分癌症司机和乘客。识别驱动基因的算法通常基于基因突变的频率。他们通过识别基因与背景突变的关系来定义河流基因，如何构建背景突变模型成为研究的重点。背景突变模型用于量化乘客突变的积累，过高或过低的背景突变率估计会导致结果不准确。同义突变常被用作固定的背景模型。然而，研究表明，背景突变并不是均匀分布的，因此一些研究人员考虑结合其他可能影响突变频率的生物识别技术。例如，MutSigCV通过考虑患者特异性突变谱特征和基因特异性特征(如表达、复制)来评估背景。但并不是所有的基因突变位点都是功能性的，不同的位点可能会有不同的影响，这取决于它们所处的特定位置。因此，一些研究者提出了热点，即突变频率高的氨基酸位点，即含有许多突变的基因的三碱基区。Chang等人提出了一种检测重复突变残基的二项统计模型，取代了以往检测重复突变残基的方法。通过对22种不同癌症类型的基因在形态学上的蛋白家族进行比对，识别出热点，从而识别出罕见的功能突变。Numerous computational methods and tools have been developed to differentiate cancer drivers from passengers. Algorithms for identifying driver genes are often based on the frequency of genetic mutations. They define river genes by identifying the relationship between genes and background mutations, and how to construct background mutation models has become the focus of research. Background mutation models are used to quantify the accumulation of passenger mutations, and estimates of background mutation rates that are too high or too low can lead to inaccurate results. Synonymous mutations are often used as a fixed background model. However, studies have shown that background mutations are not evenly distributed, so some researchers are considering combining other biometric techniques that may affect mutation frequency. For example, MutSigCV assesses background by considering patient-specific mutational profile signatures and gene-specific signatures (eg, expression, replication). But not all genetic mutation sites are functional, and different sites may have different effects, depending on where they are located. Therefore, some investigators have proposed hotspots, ie, amino acid sites with high mutation frequency, ie, three-base regions of genes that contain many mutations. Chang et al. proposed a binomial statistical model for the detection of repetitively mutated residues, replacing previous methods for detecting repetitively mutated residues. By aligning the morphological protein families of genes across 22 different cancer types, hotspots were identified to identify rare functional mutations.

然而，这些分析忽略了大量中等大小的功能元素，如蛋白质界面或蛋白质亚基。以前的研究表明，癌症的形成与局部功能的增强有关。在区域基础上对癌症驱动因素的研究提供了比基因和热点更好的统计性能。OncodriveCLUST通过编码沉默突变构建背景模型，将背景率阈值以上多个突变的位置识别为潜在有意义的簇种子，然后包括其他突变组形成簇。eDiver利用蛋白质功能区域间体细胞错义突变的内部分布，利用二项检验检验该区域观察到的突变数量是否存在偏倚，从而识别出富含体细胞错义突变的蛋白质区域。DMCM 利用数据自适应带宽的核密度估计(KDE)估计突变密度，并在氨基酸序列中识别可变长度簇。然而，这些方法只考虑局部序列特征，忽略了与突变相关的基因调控效应。TCGA等项目提供了体细胞突变、基因表达等多组学数据，不同组学数据的组合提供了比单一数据更好的性能。一些研究已经开始使用基因表达数据系统地预测癌症驱动因素。例如，xSeq 建立了基于分层贝叶斯方法和已知交互网络的统计模型，并将其应用于TCGA泛癌数据集，研究了体细胞突变对12种肿瘤类型表达谱的影响。由于目前已知的网络大多是易出错或不完整的，仍然存在一些局限性，容易给的结果带来噪声。另一方面，该方法对未检测到的网络具有一定的偏差。目前，根据蛋白质的保守程度和结构信息，通过比较同源氨基酸序列来估计氨基酸变化对蛋白质功能的影响，并计算其功能影响评分的方法有很多，如SIFT、MutationAssessor。将这些功能影响因子整合到驾驶员识别过程中，将提供更多的功能信息。However, these analyses ignore a large number of medium-sized functional elements, such as protein interfaces or protein subunits. Previous studies have shown that cancer formation is associated with enhanced local function. The study of cancer drivers on a regional basis provides better statistical performance than genes and hotspots. OncodriveCLUST constructs a background model by encoding silent mutations, identifying positions of multiple mutations above a background rate threshold as potentially meaningful cluster seeds, and then including other groups of mutations to form clusters. eDiver identifies regions of the protein rich in somatic missense mutations using the internal distribution of somatic missense mutations across functional regions of the protein and a binomial test to test for bias in the number of mutations observed in that region. DMCM estimates mutation density using data-adaptive bandwidth kernel density estimation (KDE) and identifies variable-length clusters in amino acid sequences. However, these methods only consider local sequence features and ignore gene regulatory effects associated with mutations. Projects such as TCGA provide multi-omics data such as somatic mutation and gene expression, and the combination of different omics data provides better performance than single data. Several studies have begun to systematically predict cancer drivers using gene expression data. For example, xSeq built a statistical model based on a hierarchical Bayesian approach and known interaction networks and applied it to the TCGA pan-cancer dataset to study the impact of somatic mutations on the expression profiles of 12 tumor types. Since most of the currently known networks are error-prone or incomplete, there are still some limitations that tend to bring noise to the results. On the other hand, this method has a certain bias towards undetected networks. At present, according to the degree of protein conservation and structural information, there are many methods to estimate the impact of amino acid changes on protein function by comparing homologous amino acid sequences, and to calculate their functional impact scores, such as SIFT and MutationAssessor. Integrating these functional influence factors into the driver identification process will provide more functional information.

综上所述，现有的方法没有充分考虑到大量中等大小的功能元素对识别癌症驱动因素的作用，极少从比较单个氨基酸来估计氨基酸序列变化对蛋白质功能的影响，并计算其功能影响评分这个方向识别癌症驱动区域的。In summary, existing methods do not adequately consider the role of a large number of medium-sized functional elements in identifying cancer drivers, and rarely compare individual amino acids to estimate the impact of amino acid sequence changes on protein function and calculate their functional impact scores. This orientation identifies cancer driver regions.

发明内容SUMMARY OF THE INVENTION

本发明针对以上方法存在的问题和与区域蛋白质序列对识别癌症驱动因素的作用，我们提出了基于蛋白质序列识别新的功能驱动癌症区域相关的方法，来识别能够调控基因表达水平和具有高功能影响潜力的蛋白区域。蛋白质结构域和固有的无序区域被用作候选区域。结合基因表达数据和体细胞突变数据，结合功能影响评分(SIFT,PROVEAN)作为先验知识，我们确定了预测基因表达水平最准确的癌症驱动区域。我们评估了frDriver在TCGA 的BRCA和GBM数据集上的性能。所叙述方法步骤包括：In view of the problems existing in the above methods and the role of regional protein sequences in identifying cancer driving factors, the present invention proposes a method for identifying new functionally driven cancer regions based on protein sequences to identify genes that can regulate gene expression levels and have high functional impact. potential protein regions. Protein domains and intrinsically disordered regions were used as candidate regions. Combining gene expression data and somatic mutation data, combined with functional impact scores (SIFT, PROVEAN) as prior knowledge, we identified the most accurate cancer driver regions for predicting gene expression levels. We evaluate the performance of frDriver on the BRCA and GBM datasets of TCGA. The described method steps include:

1.基因表达进行差异比较1. Gene expression for differential comparison

采用TCGA数据库中的基因表达数据和体细胞突变数据，筛选出具有关键功能的基因；Use gene expression data and somatic mutation data in the TCGA database to screen out genes with key functions;

2.数据预处理阶段2. Data preprocessing stage

筛选约3000个已被证实具有关键功能的基因，如信号转导基因、细胞周期控制基因、调控过程相关基因；为了使数据分布均匀，对表示的数据进行量化归一化。取所有基因表达值的对数，然后用预处理语言(Rlanguage)对所有样本进行量化或标准化；通过匹配突变数据和基因表达数据中的样本，只保留两个数据集中共存的样本。About 3000 genes that have been confirmed to have key functions, such as signal transduction genes, cell cycle control genes, and regulatory process-related genes, were screened; in order to make the data evenly distributed, the represented data were quantified and normalized. The logarithm of all gene expression values was taken and then all samples were quantified or normalized with a preprocessing language (Rlanguage); only samples co-existing in both datasets were kept by matching samples in mutation data and gene expression data.

3.候选驱动区域3. Candidate driver regions

选择两种类型的功能区域作为候选区域：蛋白质结构域和内在无序结构域(IDR)；蛋白质结构域是特定蛋白质序列的一个保守部分，它能够独立于蛋白链的其余部分发挥作用并存活下来；内在无序结构域是重要的功能区域，能够适应细胞的条件，呈现许多不同的结构。使用这些区域作为候选区域为功能突变的影响提供了一个充分的解释。Two types of functional domains were selected as candidates: protein domains and intrinsic disorder domains (IDRs); a protein domain is a conserved portion of a specific protein sequence that is capable of functioning and surviving independently of the rest of the protein chain ; Intrinsically disordered domains are important functional regions capable of adapting to cellular conditions and exhibiting many different structures. Using these regions as candidate regions provides an adequate explanation for the effect of functional mutations.

4.预测单氨基酸变异的功能特征4. Predicting functional characteristics of single amino acid variants

SIFT根据氨基酸序列中氨基酸变化的位置及其理化性质，通过同源序列比对来评估氨基酸变化的危害性程度，每个位置以同源序列构建的比对方式进行扫描，并将评估每种氨基酸出现的概率记录在一个比例概率矩阵中；PROVEAN基于序列比对评分来评估变异的功能影响。该方法通过序列比对得分来评估氨基酸变异引起的成对序列相似性的变化。delta比对分数定义如下：SIFT evaluates the degree of hazard of amino acid changes through homologous sequence alignment according to the positions of amino acid changes in the amino acid sequence and their physicochemical properties. Each position is scanned in an alignment constructed by homologous sequences, and each amino acid will be evaluated Probabilities of occurrence are recorded in a proportional probability matrix; PROVEAN evaluates the functional impact of variants based on sequence alignment scores. This method assesses changes in pairwise sequence similarity due to amino acid variation through sequence alignment scores. The delta alignment score is defined as follows:

Δ(P，v，S)＝R(P′，S)-A(P，S)Δ(P, v, S) = R(P', S) - A(P, S)

得到簇内和支持序列集合内的平均值，得到的无偏平均值作为PROVEAN的得分，如下式所示：The average value within the cluster and within the set of supporting sequences is obtained, and the resulting unbiased average is used as the score of PROVEAN, as shown in the following formula:

5.评价区域功能潜力5. Evaluate regional functional potential

为了估计单个氨基酸变异影响基因表达水平的概率，将突变的功能势潜能定义为：To estimate the probability that a single amino acid variation affects the level of gene expression, the functional potential of the mutation is defined as:

通过数据自动学习特征权重参数。在结果中添加一个sigmoid函数防止FP无限膨胀；Automatically learn feature weight parameters from data. Add a sigmoid function to the result to prevent infinite expansion of FP;

区域r突变影响基因表达谱的概率定义如下：The probability that a mutation in region r affects the gene expression profile is defined as:

FP表示第j个突变影响基因g表达的概率，即通过方程计算的突变的功能潜能；FP represents the probability that the jth mutation affects the expression of gene g, that is, the functional potential of the mutation calculated by the equation;

6.构建区域调控模型6. Build a regional regulation model

在N个样本中，根据候选区域和基因表达值，我们建模如下：In N samples, based on candidate regions and gene expression values, we model as follows:

计算相关系数w，w最能体现候选区域与基因表达水平之间的关系。通过取p(w|y)的最大值来计算w。Calculate the correlation coefficient w, which can best reflect the relationship between candidate regions and gene expression levels. Calculate w by taking the maximum value of p(w|y).

p(w|y)＝Kp(y|w)p(w)p(w|y)=Kp(y|w)p(w)

加入L1正则化参数。当RFP值增大时，D趋于减小，因此在学习相关系数w时，w 趋于非零值；Add L1 regularization parameter. When the RFP value increases, D tends to decrease, so when learning the correlation coefficient w, w tends to a non-zero value;

7.明确肿瘤特异性驱动区域7. Identify tumor-specific driver regions

通过求解局部极小值来求解上述优化问题的全局极小值。有一个L1正则化项，在零点处是不可导的，所以我们选择迭代下降算法。算法更新一次一个维度，迭代优化权重β和相关系数w，在求解得到的相关系数w的基础上，构建了特定基因与候选区域的相关矩阵，表示每对基因与区域之间的相关程度。候选区域的得分是根据相关基因的数量来计算的，即非零权重系数的计数。如果候选区域的得分大于给定阈值，则预测该区域为癌症驱动区域；The global minima of the above optimization problem are solved by solving the local minima. There is an L1 regularization term, which is not differentiable at zero, so we choose the iterative descent algorithm. The algorithm updates one dimension at a time, and iteratively optimizes the weight β and the correlation coefficient w. Based on the obtained correlation coefficient w, a correlation matrix between specific genes and candidate regions is constructed to represent the degree of correlation between each pair of genes and regions. The score of the candidate region is calculated based on the number of related genes, i.e. the count of non-zero weight coefficients. If the score of a candidate region is greater than a given threshold, the region is predicted to be a cancer driver region;

附图说明Description of drawings

图1：区域和基因的多元线性回归模型的建立Figure 1: Construction of Multiple Linear Regression Models for Regions and Genes

图2：区域评分与功能评分的相关性Figure 2: Correlation of Regional Scores with Functional Scores

图3：四种方法在BRCA和GBM数据集上的结果重叠在维恩图中Figure 3: The results of the four methods on the BRCA and GBM datasets are overlaid in a Venn diagram

图4：BRCA前25名GO生物过程注释以及KEGG pathway注释前30名Figure 4: Top 25 GO biological process annotations in BRCA and top 30 KEGG pathway annotations

图5：GBM前25名GO生物过程注释以及KEGG pathway注释前30名Figure 5: GBM top 25 GO biological process annotations and KEGG pathway annotation top 30

图6：正常和BRCA样本中MYH7B相对表达的箱线图Figure 6: Boxplot of relative expression of MYH7B in normal and BRCA samples

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合实验，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with experiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

硬件环境主要是一台PC主机。其中，PC主机的CPU为Intel(R)Core(TM)i7-6700，3.40GHz，内存为32GB RAM，64位操作系统。软件以Windows 7为平台，在RStudio环境下用R语言实现，RStudio版本为1.1.142，R语言版本为3.5.0。The hardware environment is mainly a PC host. Among them, the CPU of the PC host is Intel(R) Core(TM) i7-6700, 3.40GHz, the memory is 32GB RAM, and the 64-bit operating system is used. The software takes Windows 7 as the platform and is implemented in R language in the RStudio environment. The RStudio version is 1.1.142, and the R language version is 3.5.0.

所用数据是TCGA中下载的两种数据相对完善的癌症数据集，乳腺癌(BreastInvasive Carcinoma，BRCA)和多形胶质母细胞瘤(Glioblastoma Multiforme，GBM)，这两种癌症相关的基因研究相对较多，可用于验证结果的数据较完备。样本的数量信息如下表的第一列所示。此处需要说明的是，两种癌症数据是分别执行所有步骤计算的，只是为了简化，在此处一起进行说明。The data used are two relatively well-established cancer data sets downloaded from TCGA, Breast Invasive Carcinoma (BRCA) and Glioblastoma Multiforme (GBM), which are relatively related to gene research. There are more data available to verify the results. Information on the number of samples is shown in the first column of the table below. It should be noted here that the two types of cancer data are calculated by performing all steps separately, but for simplicity, they are explained together here.

1.基因表达进行差异比较1. Gene expression for differential comparison

2.数据预处理阶段2. Data preprocessing stage

3.候选驱动区域3. Candidate driver regions

所得到结果如表1所示：The obtained results are shown in Table 1:

Δ(P，v，S)＝R(P′，S)-A(P，S)Δ(P, v, S) = R(P', S) - A(P, S)

5.评价区域功能潜力5. Evaluate regional functional potential

6.构建区域调控模型6. Build a regional regulation model

p(w|y)＝Kp(y|w)p(w)p(w|y)=Kp(y|w)p(w)

7.明确肿瘤特异性驱动区域7. Identify tumor-specific driver regions

表1：BRCA和GBM数据集中突变的数量、患者和候选区域Table 1: Number of mutations, patients and candidate regions in BRCA and GBM datasets

1.结果分析验证1. Result analysis and verification

表1显示了最终获得的两个癌症数据中包含的突变数量、患者和候选区域。与BRCA数据集(753例)相比，GBM数据集的样本更少，只有150例，因此根据上述过程获得的错义体突变和候选区域更少，分别为7023和67。Table 1 shows the number of mutations, patients, and candidate regions included in the two cancer data finally obtained. Compared with the BRCA dataset (753 cases), the GBM dataset has fewer samples with only 150 cases, so there are fewer missense mutations and candidate regions obtained according to the above procedure, 7023 and 67, respectively.

将frDriver方法应用于BRCA和GBM数据集，SIFT和PROVEAN两种功能评分作为先验知识，预测能够准确预测作为癌症驱动因素的基因表达水平的区域。图2显示了 532个候选区域的得分与BRCA数据集上的先验知识之间的对应关系。区域得分(蓝条) 是相关基因的数量，而相应区域的得分(即位于该区域的得分)则表示位于该区域的感知信息的得分(即位于该区域的得分)，以便清楚地观察它们之间的相关性。我们发现，在功能得分高的位置，对应的区域得分高；而那些功能得分较低的区域与相对较少的基因相关。这充分证明了我们引入的功能先验知识为我们的癌症驱动预测带来了重要的信息。其中，位于TP53基因上的p53 DNA-biding domain具有较高的功能影响评分，因为位于该区域本身的错义突变很可能影响蛋白功能，该区域的错义突变数量较多，共有141个突变。因此，候选区域影响基因表达水平的可能性较大。Applying the frDriver method to the BRCA and GBM datasets, with both SIFT and PROVEAN functional scores as prior knowledge, predicts regions that can accurately predict gene expression levels as cancer drivers. Figure 2 shows the correspondence between the scores of the 532 candidate regions and the prior knowledge on the BRCA dataset. The regional score (blue bar) is the number of related genes, and the score of the corresponding region (i.e. the score located in the region) represents the score of the perceptual information located in the region (i.e. the score located in the region), so that they can be clearly observed. correlation between. We found that locations with high functional scores corresponded to regions with high scores; whereas those regions with low functional scores were associated with relatively fewer genes. This fully demonstrates that the functional priors we introduce bring important information to our cancer driver predictions. Among them, the p53 DNA-biding domain located on the TP53 gene has a higher function impact score, because the missense mutation located in this region itself is likely to affect the protein function, and the number of missense mutations in this region is relatively large, with a total of 141 mutations. Therefore, the candidate regions are more likely to affect gene expression levels.

表2：用于BRCA数据集的驱动程序区域列表Table 2: List of driver regions used for the BRCA dataset

我们在BRCA数据集中的532个区域预测了20个癌症驱动因素。表2列出了这20 个区域的基因、区域名称、位置及其功能的简要描述。其中PF编号为Pfam数据库中的蛋白功能区域，iur编号为内部紊乱区域(IDR)。位置是指区域的起点和终点位置。我们识别的区域分布在22种氨基酸和3912种氨基酸之间。其中，结构域长度在122-911个氨基酸之间，比IDR整体小得多。IDR的长度大多为数千，只有位于SEMFG2和PHKA2 基因上的区域长度小于1000。在20个区域的基因中，有9个基因， TP53,PIK3CA,CBFB,MAP2K4,MAP3K1,FOXA1,PTEN,ERBB2,CTCF包含在癌症基因普查(CGC)列表中，被称为癌症驱动基因，或者基因在癌症中有很强的作用，但缺乏广泛的证据。这9个区域中只有一个是IDR，其他区域是域，由Pfam数据库注释。这可能是因为IDR在蛋白质的三维结构中是一个不确定的区域，具有多种尚未被证实的功能。由此可见，我们的方法能够识别出已知的癌症驱动基因及其特定的内部区域。例如,TP53 基因P53的DNA结合域位于癌症是一个司机,和P53的研究表明,抑制域导致激活的细胞迁移的趋化因子CXCL13和增加CXCR5趋化因子受体基因的表达。We predicted 20 cancer drivers across 532 regions in the BRCA dataset. Table 2 lists the genes, region names, locations and brief descriptions of their functions for these 20 regions. The PF number is the protein functional region in the Pfam database, and the iur number is the internal disorder region (IDR). The location refers to the start and end location of the area. The regions we identified were distributed between 22 amino acids and 3912 amino acids. Among them, the domain length is between 122-911 amino acids, which is much smaller than the overall IDR. Most of the IDRs are several thousand in length, and only the regions located on the SEMFG2 and PHKA2 genes are less than 1000 in length. Among the genes in the 20 regions, 9 genes, TP53, PIK3CA, CBFB, MAP2K4, MAP3K1, FOXA1, PTEN, ERBB2, CTCF, are included in the Cancer Gene Census (CGC) list and are known as cancer driver genes, or genes There is a strong role in cancer, but extensive evidence is lacking. Only one of these 9 regions is an IDR, the others are domains, annotated by the Pfam database. This may be because the IDR is an indeterminate region in the three-dimensional structure of the protein with multiple as yet unproven functions. Thus, our method is able to identify known cancer driver genes and their specific internal regions. For example, the DNA-binding domain of the TP53 gene P53 located in cancer is a driver, and studies of p53 have shown that the inhibitory domain leads to activation of the cell migration by the chemokine CXCL13 and increased expression of the CXCR5 chemokine receptor gene.

表3：用于GBM数据集的驱动程序区域列表Table 3: List of driver regions used for GBM dataset

我们在GBM数据集上进行了同样的实验，在67个候选区域中确定了17个区域，并认为它们是癌症驱动因素。与BRCA数据一样，我们列出了17个区域，包括它们的基因名、区域号、氨基酸序列上的位置，以及对它们功能的简要描述(表3)。该区域的长度最少为76个氨基酸，最多为3912个氨基酸，共有14个域名和3个域名。.通过与CGC列表比较，该区域所在基因TP53、PTEN、EGFR、IDH1、GRM3、PIK3CA、SMC1A是已知的癌症驱动基因或已识别的潜在驱动基因。PF00757、PF01030和PF00069三个区域位于同一基因EGFR上。GBM与突变引起的EGFR过表达有关，EGFR的特异性突变常被观察到。包括EGFR在内的体细胞突变会导致细胞持续活化，从而导致细胞分裂失控。此外，其他三种用于比较的方法(ActiveDriver、OncodriveCLUST和eDriver)也可以识别 EGFR基因。此外，研究表明，我们认识的TCHH、FLG、KLK15、SEMA3C、GABRA6 等基因与癌症有关。We performed the same experiments on the GBM dataset and identified 17 regions out of 67 candidates and considered them to be cancer drivers. As with the BRCA data, we list 17 regions, including their gene names, region numbers, positions on the amino acid sequence, and brief descriptions of their functions (Table 3). The length of this region is a minimum of 76 amino acids and a maximum of 3912 amino acids, with a total of 14 domain names and 3 domain names. .By comparing with the CGC list, the genes TP53, PTEN, EGFR, IDH1, GRM3, PIK3CA, SMC1A in this region are known cancer driver genes or identified potential driver genes. The three regions PF00757, PF01030 and PF00069 are located on the same gene EGFR. GBM is associated with mutation-induced overexpression of EGFR, and EGFR-specific mutations are frequently observed. Somatic mutations, including EGFR, lead to persistent cell activation, which can lead to uncontrolled cell division. In addition, the other three methods used for comparison (ActiveDriver, OncodriveCLUST and eDriver) can also identify the EGFR gene. In addition, studies have shown that the genes we know, such as TCHH, FLG, KLK15, SEMA3C, GABRA6, are associated with cancer.

从该区域所处基因的角度，我们将frDriver方法与eDriver、ActiveDriver和OncodriveCLUST这三种最先进的方法进行了比较。利用数据库DriverDB，得到了BRCA 和GBM数据集上其他方法的结果列表。在BRCA数据集上，eDriver鉴定出273个驱动基因，ActiveDriver鉴定出220个驱动基因，OncodriveCLUST鉴定出26个驱动基因。 frDriver发现了20个驱动基因。20个基因中有10个是通过至少一种其他方法鉴定的，这使得它们作为驱动基因更具有说明性(图3a)。在GBM数据集中，eDriver、ActiveDriver 和OncodriveCLUST分别鉴定出69、60和8个基因。使用Venn图分析，frDriver鉴定的 15个基因中，有7个至少是用另一种方法鉴定的(图3b)。此外，EGFR和TP53基因被四种方法同时识别并纳入CGC列表。因此，frDriver为其他方法提供了互补的结果，以更准确地识别癌症驱动因素。We compared the frDriver approach with three state-of-the-art approaches, eDriver, ActiveDriver, and OncodriveCLUST, from the perspective of the genes in which this region resides. Using the database DriverDB, a list of results from other methods on the BRCA and GBM datasets was obtained. On the BRCA dataset, eDriver identified 273 driver genes, ActiveDriver identified 220 driver genes, and OncodriveCLUST identified 26 driver genes. frDriver found 20 driver genes. Ten of the 20 genes were identified by at least one other method, making them more illustrative as driver genes (Fig. 3a). In the GBM dataset, eDriver, ActiveDriver and OncodriveCLUST identified 69, 60 and 8 genes, respectively. Using Venn diagram analysis, 7 of the 15 genes identified by frDriver were identified by at least another method (Fig. 3b). In addition, EGFR and TP53 genes were simultaneously identified by four methods and included in the CGC list. Thus, frDriver provides complementary results for other approaches to more accurately identify cancer drivers.

表4：癌症样本和突变的数量Table 4: Number of cancer samples and mutations

CGC列表包含了723个已知的癌症驱动基因，或者是癌症驱动基因的高概率，但还没有足够的证据。我们比较了四种方法的识别准确率。表4显示了四种方法识别的驱动基因数量，以及与BRCA数据集和GBM数据集上的CGC列表重叠的基因数量。在BRCA 数据集和GBM数据集上，frDriver的准确率分别为45％和46.67％，是四种方法中最高的。在BRCA数据集上，ActiveDriver和eDriver的准确率分别为13.64％和12.82％。他们发现的CGC基因数量分别为30和35个，超过了其他两种方法，但由于候选的rivergenes 数量众多，activedriveranddedriver有很高的假阳性。同样的现象也发生在GBM数据集上。OncodriveCLUST识别少量驱动基因的准确率较低。因此，frDriver优于其它烧结制造方法，且所识别的驱动基因更可靠。The CGC list contains 723 known cancer driver genes, or a high probability of being a cancer driver gene, but not enough evidence yet. We compared the recognition accuracy of the four methods. Table 4 shows the number of driver genes identified by the four methods, and the number of genes that overlapped with the CGC lists on the BRCA dataset and GBM dataset. On the BRCA dataset and GBM dataset, the accuracy of frDriver is 45% and 46.67%, which are the highest among the four methods. On the BRCA dataset, ActiveDriver and eDriver achieve 13.64% and 12.82% accuracy, respectively. The number of CGC genes they found was 30 and 35, respectively, more than the other two methods, but due to the large number of candidate rivergenes, activedriver and dedriver had high false positives. The same phenomenon also occurs on the GBM dataset. OncodriveCLUST has low accuracy in identifying a small number of driver genes. Therefore, frDriver is superior to other sintered manufacturing methods and the identified driver genes are more reliable.

Claims

1. A protein sequence-based functional region driver identification, characterized in that the implementation steps are:

(1) Differential comparison of gene expression, using gene expression data and somatic mutation data in the TCGA database to screen out genes with key functions;

(2) Preprocess the sample data, which includes three steps of sample matching, missing value processing, and data standardization, to obtain samples that coexist in the GBM and BRCA data sets, and list them as sample names;

(3) Candidate driver regions, two types of functional regions are selected as candidate regions: protein domains and intrinsically disordered domains (IDRs);

(4) To predict the functional characteristics of single amino acid variation, the functional impact score of single amino acid variation can be evaluated and calculated according to sequence evolution conservation, protein structure characteristics and amino acid physicochemical properties. SIFT and PROVEAN methods were selected to predict functional effects of single amino acid mutations to measure functional potential;

(5) Evaluate the functional potential of the region. The single amino acid variation in the candidate region has a functional effect on the protein due to the functional effect of the single point mutation in the region. The weight of the single amino acid can be calculated to solve the functional potential of the region;

(6) Build a regional regulation model, establish a multiple linear regression model between gene expression levels and candidate regions, show that there is a linear relationship between them, and identify regions that can more accurately predict gene expression levels as cancer drivers;

(7) To clarify the tumor-specific driver regions, and based on the obtained correlation coefficient w, a correlation matrix between specific genes and candidate regions was constructed to represent the degree of correlation between each pair of genes and regions. The score of the candidate region is calculated based on the number of related genes, i.e. the count of non-zero weight coefficients. If the score of a candidate region is greater than a given threshold, the region is predicted to be a cancer driver region.

2. The cancer-driven identification based on protein sequence functional regions according to claim 1, characterized in that the method is in the stage of differential comparison of gene expression:

(1) Select gene expression data and somatic mutation data, and perform consistent processing of expression data and mutation data;

(2) Screen about 3,000 genes that have been confirmed to have key functions, such as signal transduction genes, cell cycle control genes, and regulatory process-related genes.

3. The cancer driver identification based on protein sequence functional regions according to claim 1, characterized in that the method is in the data preprocessing stage:

(1) Screening about 3,000 genes that have been confirmed to have key functions, such as signal transduction genes, cell cycle control genes, and regulatory process-related genes;

(2) In order to make the data distribution uniform, quantify and normalize the represented data. Take the logarithm of all gene expression values, and then quantify or normalize all samples in R;

(3) By matching the samples in the mutation data and gene expression data, only the samples that coexist in the two datasets are kept.

4. The protein sequence functional region-based cancer driver identification according to claim 1, characterized in that the method candidate driver region stage:

(1) Two types of functional regions were selected as candidate regions: protein domains and intrinsically disordered domains (IDRs);

(2) A protein domain is a conserved part of a specific protein sequence that can function and survive independently of the rest of the protein chain;

(3) Intrinsically disordered domains are important functional regions that can adapt to cell conditions and exhibit many different structures. Using these regions as candidate regions provides an adequate explanation for the effect of functional mutations.

5. The identification of cancer drivers based on functional regions of protein sequences according to claim 1, characterized in that the method predicts the functional characteristic stage of single amino acid variation:

(1) SIFT evaluates the degree of hazard of amino acid changes through homologous sequence alignment according to the positions of amino acid changes in the amino acid sequence and their physicochemical properties. The probability of occurrence of each amino acid is recorded in a proportional probability matrix;

(2) PROVEAN evaluates the functional impact of variants based on sequence alignment scores. This method assesses changes in pairwise sequence similarity due to amino acid variation through sequence alignment scores. The delta alignment score is defined as follows:

△(P, v, S)=R(P′,S)-A(P,S)

The average value within the cluster and within the set of supporting sequences is obtained, and the resulting unbiased average is used as the score of PROVEAN, as shown in the following formula:

.

6. The cancer-driven identification based on protein sequence functional regions according to claim 1, characterized in that the method evaluates the functional potential stage of the region:

(1) In order to estimate the probability that a single amino acid variation affects the gene expression level, the functional potential of the mutation is defined as:

Automatically learn feature weight parameters from data. Add a sigmoid function to the result to prevent infinite expansion of FP;

(2) The probability that a mutation in region r affects the gene expression profile is defined as follows:

.

FP represents the probability that the jth mutation affects the expression of gene g, that is, the functional potential of the mutation calculated by the equation.

7. The cancer driver recognition based on protein sequence functional regions according to claim 1, characterized in that the method constructs a region regulation model stage:

(1) In N samples, according to candidate regions and gene expression values, we model as follows:

(2) Calculate the correlation coefficient w, which can best reflect the relationship between candidate regions and gene expression levels. Calculate w by taking the maximum value of p(w|y).

p(w|y)=Kp(y|w)p(w)

Add L1 regularization parameter. When the RFP value increases, D tends to decrease, so when learning the correlation coefficient w, w tends to a non-zero value.

8 . The identification of cancer drivers based on protein sequence functional regions according to claim 1 , wherein the method defines the stage of tumor-specific driver regions, and solves the global minima of the optimization problem by solving local minima. 9 .

There is an L1 regularization term, which is not differentiable at zero, so we choose iterative descent algorithm, a non-gradient optimization method [35]. The algorithm updates one dimension at a time, that is, iteratively optimizes the weights β and correlation coefficients w. Regularization parameters D0, D1, E are solved by cross-checking.