CN109785903A

CN109785903A - A kind of Classification of Gene Expression Data device

Info

Publication number: CN109785903A
Application number: CN201811654735.0A
Authority: CN
Inventors: 廖清; 丁烨; 漆舒汉; 蒋琳; 王轩
Original assignee: Harbin Institute of Technology Shenzhen
Current assignee: Harbin Institute of Technology Shenzhen
Priority date: 2018-12-29
Filing date: 2018-12-29
Publication date: 2019-05-21

Abstract

The invention relates to the field of electrical data processing, in particular to a gene expression data classifier. Based on the traditional multi-task deep learning processing method, the present invention designs a gene expression data classifier with an input layer, a first hidden layer, a second hidden layer and an output layer. In particular, shared hidden units are set in the first hidden layer and the second hidden layer, so that the classifier can process gene expression data from different data sets and with different identifiers, which effectively solves the problem of classification of gene expression data. Tissue samples are insufficient and reduce the undesirable effects introduced by high feature space dimensionality.

Description

A Gene Expression Data Classifier

技术领域technical field

本发明涉及电数据处理领域，特别是涉及一种基因表达数据分类器。The invention relates to the field of electrical data processing, in particular to a gene expression data classifier.

背景技术Background technique

对基因表达谱和癌症/疾病状态的关联关系的研究是在生物学和医疗任务中非常重要内容。例如，对比发生病变的组织和正常组织能够加深对病理学的理解，并帮助识别不同的组织(癌变或者正常)，因为基因表达数据能够为组织的表型、功能和生理过程提供线索。但是，考虑到基因表达数据的数量和复杂性，传统的生物学的实验无法处理这类数据。The study of the association between gene expression profiles and cancer/disease states is very important in biological and medical tasks. For example, comparing diseased and normal tissue can improve the understanding of pathology and help identify different tissues (cancerous or normal), as gene expression data can provide clues to tissue phenotype, function, and physiological processes. However, given the amount and complexity of gene expression data, traditional biological experiments cannot handle such data.

表1基因表达数据的例子Table 1 Examples of gene expression data

表1为基因表达数据的一个例子。基因表达数据通常使用n行、m列的矩阵来表达，这些行代表了特性(基因)，这些列代表了样本(例如，组织，病变阶段和治疗手段等)。通常n>>m，对于一个生物学专家，手工计算和对比上述n×m的基因表达矩阵是不现实的。为此，一些机器学习算法被用以分析基因表达并自动分类。Table 1 is an example of gene expression data. Gene expression data is typically represented using a matrix of n rows and m columns, where the rows represent traits (genes), and the columns represent samples (eg, tissue, disease stage, and treatment, etc.). Usually n >> m, it is unrealistic for a biological expert to manually calculate and compare the above n × m gene expression matrix. To this end, some machine learning algorithms are used to analyze gene expression and automatically classify them.

过去的几十年间，一些机器算法已经被应用于基因表达数据中的微阵列当中，并将人体组织分为癌变组织和正常组织。最早应用的是决策树(decisiontree，DT)，它对一直疾病的蛋白质进行了与所有人类蛋白不同的序列特征。利用K临近分类器(K-nearestneighbor，K-NN)和朴素贝叶斯(naiveBayesianClassifier，NB)对多种类型的基因表达数据进行分类，对人类疾病，尤其是癌症进行识别。沿着这一方向，Bharathi对个别基因使用了方差排序的分析策略(Analysis of Variance,ANOVA)，同时使用支持向量机(Support Vector Machine,SVM)测试分类能力。Hu等人提出了一种最大多样化多树(Maximally Diversified Multiple Trees,MDMT)算法，该算法将决策因素中的一组独特树集合起来。Halder等人提出了一种基于模糊k近邻的主动学习(a fuzzy k-nearestneighbor based active learning,ALFKNN)方法，该方法首先使用未标记的样本从专家那里获取标签，然后将标记好的输入样本迭代添加到训练样本中，以提高预测精度。Begum等将AdaBoost方法和线性支持向量机(AdaBoost and linear SVM,ADASVM)组合作为组件分类器，结果表明其性能优于目前最先进的分类器。Over the past few decades, machine algorithms have been applied to microarrays in gene expression data and to classify human tissue into cancerous and normal tissue. The earliest application was decision tree (DT), which characterizes the sequence of proteins that have been diseased differently from all human proteins. Using K-nearestneighbor (K-NN) and naive Bayesian (naivbayesianClassifier, NB) to classify various types of gene expression data to identify human diseases, especially cancer. Along this direction, Bharathi used an Analysis of Variance (ANOVA) strategy for individual genes, while using a Support Vector Machine (SVM) to test the classification ability. Hu et al. proposed a Maximally Diversified Multiple Trees (MDMT) algorithm that aggregates a set of unique trees in decision factors. Halder et al. proposed a fuzzy k-nearest neighbor based active learning (ALFKNN) method, which first uses unlabeled samples to obtain labels from experts, and then iterates over labeled input samples. Added to training samples to improve prediction accuracy. Begum et al. combined the AdaBoost method with a linear support vector machine (AdaBoost and linear SVM, ADASVM) as a component classifier, and the results show that its performance is better than the current state-of-the-art classifiers.

然而，现有的分类器面临两个挑战，使其难以直接应用于癌症诊断。第一个挑战是基因表达数据中特征空间的维数诅咒。例如，甲状腺癌的基因表达有367个样本，而每个样本的特征数(该样本检测的基因数量)约为该样本数量的4次方个，该数量巨大。因此，特征空间的高维度会增加过拟合的风险，因为基因的数量远远大于样本的数量。第二个挑战是组织样本的不足，现有的方法很难很好地表征癌症信息，因为癌症的训练数据很少。不足的原因有二:1)罕见癌症(有些每年只发生极少数人)或罕见性状的数据样本的采集、获取要比普通癌症或常见性状难得多，这种样本的不足意味着分类器无法从稳定的癌症或性状模式表征中训练出来；2)来自不同基因表达平台的数据样本难以整合在一起，这种整合的困难加剧了数据样本不足的问题。例如，白血病是世界上最常见的癌症之一，2015年已经造成353500人死亡。许多研究小组为了人类的健康而研究白血病。然而，由于不同的实验设置，不同研究小组的数据来源不能整合在一起，选择不同的基因表达特征，甚至标记不同的癌症标签。在我们测试的癌症数据集中，有两种类型的白血病数据来自两个基因表达数据源。一个白血病数据集将白血病样本分为两种类型(NPMI1+和NPMI1-)。另一个白血病数据集将样本分为四种类型(MP、HDMTX、HDMTX+MP和LDMTX+MP)。第一个白血病数据集基于基因点进行分类，第二个数据集基于人类白血病细胞的药物反应进行分类。此外，第一个白血病数据集有54,675个基因特征，第二个白血病数据集有12,600个基因特征，这是因为根据不同的背景知识，不同的基因特征被用来研究相同的癌症。However, existing classifiers face two challenges that make it difficult to directly apply to cancer diagnosis. The first challenge is the curse of dimensionality in the feature space in gene expression data. For example, the gene expression of thyroid cancer has 367 samples, and the number of features of each sample (the number of genes detected in this sample) is about the fourth power of the number of samples, which is a huge number. Therefore, the high dimensionality of the feature space increases the risk of overfitting because the number of genes is much larger than the number of samples. The second challenge is the lack of tissue samples, and it is difficult for existing methods to characterize cancer information well because there are few training data for cancer. There are two reasons for the insufficiency: 1) Rare cancers (some only occur in very few people each year) or rare traits are much more difficult to collect and obtain than data samples for common cancers or common traits. 2) It is difficult to integrate data samples from different gene expression platforms, which exacerbates the problem of insufficient data samples. For example, leukemia is one of the most common cancers in the world, causing 353,500 deaths in 2015. Many research groups study leukemia for the benefit of human health. However, due to different experimental settings, data sources from different research groups could not be integrated together to select for different gene expression signatures or even to label different cancer signatures. In the cancer dataset we tested, there were two types of leukemia data from two gene expression data sources. A leukemia dataset divides leukemia samples into two types (NPMI1+ and NPMI1-). Another leukemia dataset divides samples into four types (MP, HDMTX, HDMTX+MP, and LDMTX+MP). The first leukemia dataset was classified based on gene points, and the second was based on drug response of human leukemia cells. In addition, the first leukemia dataset has 54,675 gene signatures and the second leukemia dataset has 12,600 gene signatures, because different gene signatures are used to study the same cancer based on different background knowledge.

现有技术中，为解决基因表达数据的组织样本数量不足的问题，常见有三种机器学习的分类方法，分别为：单任务学习、多任务学习和迁移学习。In the prior art, in order to solve the problem of insufficient number of tissue samples of gene expression data, there are commonly three classification methods of machine learning, namely: single-task learning, multi-task learning and transfer learning.

附图1显示了单任务、多任务和迁移学习技术的学习过程的差异。在单任务学习(附图1(a))中，所有4个数据集都没有连接，每个任务将单独进行训练，因为单任务学习假定训练样本独立于特定的分布。因此，单任务学习完全忽略了相关任务之间的关系。在多任务学习(附图1(b))中，它假设任务可能是相关的，这意味着从一个任务学到的信息可以用于另一个任务。因此，多任务学习同时学习所有任务的联合模型。在迁移学习(附图1(c))中，它将知识(参数或表达)从源数据集传输到目标数据集，以提高性能。现有技术中，多任务学习和迁移学习已被充分证明，当样本数量不足以训练单个任务时，可以显著提高泛化性能。Figure 1 shows the differences in the learning process of single-task, multi-task and transfer learning techniques. In single-task learning (Fig. 1(a)), all 4 datasets are not connected, and each task will be trained separately, because single-task learning assumes that the training samples are independent of a specific distribution. Therefore, single-task learning completely ignores the relationship between related tasks. In multi-task learning (Fig. 1(b)), it assumes that tasks may be related, which means that information learned from one task can be used for another task. Hence, multi-task learning learns a joint model for all tasks simultaneously. In transfer learning (Fig. 1(c)), it transfers knowledge (parameters or expressions) from a source dataset to a target dataset to improve performance. In the prior art, multi-task learning and transfer learning have been fully demonstrated to significantly improve generalization performance when the number of samples is insufficient to train a single task.

如前所述，迁移学习从源数据集中提取知识，并将知识转移到目标任务中，即使训练数据和测试数据应用于不同的领域、任务和分布。而迁移学习只追求在目标任务上具有良好的表现，因为它更关心目标数据集而不是源数据集。近年来，一些迁移学习在生物信息学特别是生物图像分析方面取得了成功。例如，Ravi.K.Samala等人开发了一种计算机辅助检测(computer-aideddetection,CAD)系统，该系统可以利用深度卷积神经网络(deepconvolution neural network,DCNN)将知识从乳房x线照片迁移到数字乳房断层合成(digital breast tomosynthesis，DBT)。Hariharan等研究了(convolution neuralnetwork,CNN)的迁移过程，将ImageNet的数据集作为训练集，医学图像作为测试集。As mentioned earlier, transfer learning extracts knowledge from the source dataset and transfers the knowledge to the target task, even though the training data and test data are applied to different domains, tasks and distributions. And transfer learning only pursues good performance on the target task because it cares more about the target dataset rather than the source dataset. In recent years, some transfer learning has achieved success in bioinformatics, especially biological image analysis. For example, Ravi.K.Samala et al. developed a computer-aided detection (CAD) system that can utilize deep convolution neural network (DCNN) to transfer knowledge from mammograms to Digital breast tomosynthesis (DBT). Hariharan et al. studied the transfer process of (convolution neural network, CNN), using the ImageNet dataset as the training set and medical images as the test set.

另一方面，多任务学习接近于迁移学习，它试图同时学习多个数据集(任务)，即使所有的数据集是不同的。但是多任务学习的目标要求所有的数据集都具有良好的实验性能，而不仅追求目标任务具有良好的性能，因为所有的数据集对于多任务学习模型都是非常重要的。而多任务学习更适合基因表达数据的分类，且已经在癌症数据分析上取得一定的成果。随着深度学习技术在图像处理和模式识别方面的巨大成功，三年来越来越多的研究者将多任务学习和深度学习技术结合到计算机视觉和生物信息学中。在计算机视觉领域，Zhang等人提出了任务约束的深度卷积网络(tasks-constrained deep convolutionnetwork,TCDCN)模型，将人脸特征检测与多个相关任务(如头部姿态估计任务和面部属性推断任务)联合优化。类似地，Rejeev等人提出了一种基于CNN的HyperFace架构，它可以同时对给定的图像进行人脸检测、人脸特征定位、头部姿态评估和性别识别。Abrar等人提出了一种多任务CNN模型，通过使用深度神经卷积网络(deep CNN)来更好的预测图像中的属性，例如佩戴领带还是穿着蓝色连衣裙。在生物信息学领域，Zhang等人提出了一种基于迁移学习和多任务学习的深度模型，用于对特定领域的生物图像进行生物图像分析。Ravi等人提出了一种多任务转移学习DCNN，将非医学图像的知识转换为医学诊断学习任务，同时学习辅助任务。在近五年，上述基于多任务的深度学习模型虽然在计算机视觉和生物医学图像分析方面取得了成功，但现有的方法都是各自独立地完成学习表达和获取学习结果。具体来说，这些方法通过迁移学习方法学习参数，并在第一阶段通过卷积神经网络(CNN)学习表示。然后使用多任务学习技术通过同时从多个数据集学习一个联合模型来产生任务的每个结果。On the other hand, multi-task learning is close to transfer learning, which tries to learn multiple datasets (tasks) at the same time, even if all the datasets are different. But the goal of multi-task learning requires that all datasets have good experimental performance, not only the target task has good performance, because all datasets are very important for multi-task learning models. Multi-task learning is more suitable for the classification of gene expression data, and has achieved certain results in cancer data analysis. With the great success of deep learning techniques in image processing and pattern recognition, more and more researchers have combined multi-task learning and deep learning techniques into computer vision and bioinformatics over the past three years. In the field of computer vision, Zhang et al. proposed a task-constrained deep convolution network (TCDCN) model, which combines facial feature detection with multiple related tasks (such as head pose estimation task and facial attribute inference task). ) joint optimization. Similarly, Rejeev et al. proposed a CNN-based HyperFace architecture that can simultaneously perform face detection, facial feature localization, head pose evaluation, and gender recognition on a given image. Abrar et al. proposed a multi-task CNN model to better predict attributes in images, such as wearing a tie or a blue dress, by using a deep neural convolutional network (deep CNN). In the field of bioinformatics, Zhang et al. proposed a deep model based on transfer learning and multi-task learning for biological image analysis of domain-specific biological images. Ravi et al. proposed a multi-task transfer learning DCNN to transfer knowledge from non-medical images to a medical diagnosis learning task while learning auxiliary tasks. In the past five years, although the above-mentioned multi-task-based deep learning models have achieved success in computer vision and biomedical image analysis, the existing methods all complete the learning representation and obtain the learning results independently. Specifically, these methods learn parameters through a transfer learning approach and learn representations through a convolutional neural network (CNN) in the first stage. Then use multi-task learning techniques to produce each result of the task by learning a joint model from multiple datasets simultaneously.

发明内容SUMMARY OF THE INVENTION

为了解决组织样本稀少，并降低特征空间维度高造成的不好的影响，本发明提出了基于多任务深度学习算法(Multi-task Deep Learning，MTDL)的基因表达数据分类器。In order to solve the problem of the scarcity of tissue samples and reduce the bad influence caused by the high dimension of the feature space, the present invention proposes a gene expression data classifier based on a multi-task deep learning algorithm (Multi-task Deep Learning, MTDL).

本发明提出的一种基因表达数据分类器，其特征在于，所述分类器包含：输入层、第一隐藏层、第二隐藏层及输出层；A gene expression data classifier proposed by the present invention is characterized in that, the classifier includes: an input layer, a first hidden layer, a second hidden layer and an output layer;

所述输入层包含多个输入单元，每个输入单元接收一个数据集，且该数据单元与第一共享单元中的一个局部隐藏单元相连接；The input layer includes a plurality of input units, each input unit receives a data set, and the data unit is connected with a local hidden unit in the first shared unit;

所述第一隐藏层，包括与输入层中输入单元个数相同的局部隐藏单元和第一共享隐藏单元，所述局部隐藏单元与第二隐藏层中的局部隐藏单元对应连接，所述第一共享隐藏单元与所有输入单元相连接；The first hidden layer includes the same number of local hidden units as the number of input units in the input layer and a first shared hidden unit, the local hidden units are correspondingly connected to the local hidden units in the second hidden layer, and the first hidden unit is connected. Shared hidden units are connected to all input units;

所述第二隐藏层，包括与输入层中输入单元个数相同的局部隐藏单元和第二共享隐藏单元，所述局部隐藏单元与输出层中的输出单元对应连接，所述第二共享隐藏单元与第一隐藏层中所有局部隐藏单元和共享隐藏单元相连接；The second hidden layer includes the same number of local hidden units as the input units in the input layer and a second shared hidden unit, the local hidden units are correspondingly connected to the output units in the output layer, and the second shared hidden unit Connect with all local hidden units and shared hidden units in the first hidden layer;

所述输出层，包含与输入层中输入单元个数相同的输出单元，且每个输出单元都与第二隐藏单元相连接，所处输出单元输出分类结果；The output layer includes the same number of output units as the input units in the input layer, and each output unit is connected to the second hidden unit, where the output unit outputs the classification result;

所述输入单元、局部隐藏单元和共享隐藏单元均输出激活值。The input unit, local hidden unit and shared hidden unit all output activation values.

优选的，所述分类器，第一隐藏层及第二隐藏可以使用整流线性单元(rectifiedlinear unit,ReLU)函数作为局部隐藏单元的激活函数，而输出层使用sigmoid作为局部隐藏单元的激活函数。Preferably, in the classifier, the first hidden layer and the second hidden layer can use a rectified linear unit (ReLU) function as the activation function of the local hidden unit, and the output layer uses a sigmoid as the activation function of the local hidden unit.

本发明提出的基因表达数据分类器，兼具多任务(multi-task)和深度学习(deepleaning)两者的优点。特别的是，1)分类器能够解决组织样本数目不足的问题，并改善特征空间的维度数目高带来的负面影响；2)即使肿瘤的类型、特征、甚至标记不同，分类器也能够将不同来源的癌症数据库集成为一个数据库，实现分类性能的增强；3)分类器能够持续地利用多个癌症数据库，实现对小规模癌症数据库中隐藏基因表达的发掘，增强分类性能。The gene expression data classifier proposed by the present invention has the advantages of both multi-task and deep learning. In particular, 1) the classifier can solve the problem of insufficient number of tissue samples and improve the negative impact of the high number of dimensions in the feature space; 2) even if the tumor types, characteristics, and even markers are different, the classifier can The source cancer database is integrated into one database to enhance the classification performance; 3) The classifier can continuously utilize multiple cancer databases to realize the discovery of hidden gene expression in small-scale cancer databases and enhance the classification performance.

附图说明Description of drawings

图1(a)单任务学习、(b)多任务学习，(c)迁移学习；Figure 1 (a) single-task learning, (b) multi-task learning, (c) transfer learning;

图2为本发明提出的基因表达数据分类器结构；Fig. 2 is the structure of the gene expression data classifier proposed by the present invention;

图3为本发明提出的分类器与DNN分类器、使用稀疏自编码器的分类器分类准确性对比；Fig. 3 is the classification accuracy comparison of the classifier proposed by the present invention, the DNN classifier, and the classifier using the sparse autoencoder;

具体实施方式Detailed ways

下面结合附图和实施例，对本发明的具体实施方式作进一步详细描述。以下实施例用于说明本发明，但不用来限制本发明的范围。The specific embodiments of the present invention will be described in further detail below with reference to the accompanying drawings and embodiments. The following examples are intended to illustrate the present invention, but not to limit the scope of the present invention.

本发明提出的一种基因表达数据分类器的结构附图2所示。The structure of a gene expression data classifier proposed by the present invention is shown in FIG. 2 .

一种基因表达数据分类器，其特征在于，所述分类器包含：输入层、第一隐藏层、第二隐藏层及输出层；A gene expression data classifier, wherein the classifier comprises: an input layer, a first hidden layer, a second hidden layer and an output layer;

其中共享隐藏单元为附图2中的菱形，而三角形和正方形则分别表示第一隐藏层和第二隐藏层中的多个独立的局部隐藏单元。The shared hidden units are diamonds in Fig. 2, while triangles and squares represent multiple independent local hidden units in the first hidden layer and the second hidden layer, respectively.

优选的，所述分类器，第一隐藏层及第二隐藏可以使用整流线性单元(rectifiedlinear unit,ReLU)函数作为局部隐藏单元的激活函数，而输出层使用sigmoid作为局部隐藏单元的激活函数。在机器学习中，计算梯度时，非饱和非线性ReLU要比饱和非线性如tanh和sigmoid快得多。因此，使用ReLU单元的卷积神经网络(CNN)比使用tanh单元的卷积神经网络(CNN)训练速度快几倍，选择ReLU作为快速训练的激活函数。Sigmoid函数通常用于在输出层获取标签，因为它只输出0或1。因此，选择sigmoid函数的输出来表示多个任务的分类结果。Preferably, in the classifier, the first hidden layer and the second hidden layer can use a rectified linear unit (ReLU) function as the activation function of the local hidden unit, and the output layer uses a sigmoid as the activation function of the local hidden unit. In machine learning, non-saturating nonlinear ReLUs are much faster than saturating nonlinearities such as tanh and sigmoid when computing gradients. Therefore, a convolutional neural network (CNN) using ReLU units is trained several times faster than a convolutional neural network (CNN) using tanh units, and ReLU is chosen as the activation function for fast training. The sigmoid function is usually used to get the labels at the output layer because it only outputs 0 or 1. Therefore, the output of the sigmoid function is chosen to represent the classification results of multiple tasks.

使用x₁，x₂，……，x_n标识n个任务的输入，对于第一隐藏层，每个局部隐藏单元接收的激活值可以通过下式计算得出：Identify the inputs of n tasks using x ₁ , x ₂ , ..., x _n , for the first hidden layer, the activation values received by each local hidden unit It can be calculated by the following formula:

其中，的上标表示层的索引值，下标表示任务源的索引值，激活函数是ReLU，例如，σ(x)＝max(0,x)；为1号任务源和本地隐藏单元之间边缘局部权重；为第二隐藏层中第n个局部隐藏单元的偏差值。这样，第一共享隐藏单元的激活值s¹可以通过下式计算得出：in, The superscript represents the index value of the layer, the subscript represents the index value of the task source, and the activation function is ReLU, for example, σ(x)=max(0,x); is the local weight of the edge between the No. 1 task source and the local hidden unit; is the bias value of the nth local hidden unit in the second hidden layer. In this way, the activation value s ¹ of the first shared hidden unit can be calculated by the following formula:

其中，为与s¹边界之间的共享权重值，是共享隐藏单元s¹的偏差，所使用的激活函数为ReLU。对于第二隐藏单元，每个局部隐藏单元的激活值可以通过下式计算得出：in, for The shared weight value between the boundary of s ¹ is the deviation of the shared hidden unit s ¹ , and the activation function used is ReLU. For the second hidden unit, each local hidden unit The activation value of can be calculated by:

其中，为与边界之间局部权重值；为第一隐藏层中第i个局部隐藏单元与s²边界之间的共享权值；是第二隐藏层中的第i个局部隐藏单元的偏差。共享隐藏单元的激活值s²可以通过下式计算得出：in, for and Local weight value between boundaries; is the shared weight between the ith local hidden unit in the first hidden layer and the s2 boundary ^; is the bias of the ith local hidden unit in the second hidden layer. ^The activation value s2 of the shared hidden unit can be calculated by:

其中，是s¹和s²边界之间的权重值。对于输出层，每个任务的输出可以通过下式计算得出：in, is the weight value between the ^s1 and ^s2 boundaries. For the output layer, the output of each task It can be calculated by the following formula:

其中，为与边界之间的局部权重值；为s²与边界之间的权重值；为第i个任务的输出单元的偏差值。输出单元的激活函数定义为：in, for and local weight values between boundaries; for s ² with The weight value between the boundaries; is the deviation value of the output unit of the ith task. The activation function of the output unit is defined as:

sigmoid(x)＝1/(1+e^(-x))。sigmoid(x)=1/(1+e ^(-x) ).

设置局部隐藏单元和共享隐藏单元的优势在于每一个任务都能够从所有的局部隐藏单元中学习到用于分类的私有表达。因为局部隐藏单元保留了每个单独任务的特征，所以不同的任务能够学习其他任务的私有表达。另一方面，共享隐藏单元从整个数据集中学习相同的表达，以充分利用从微阵列系统中获得的信息。正是因为共享隐藏单元能够利用所有任务的信息，所以共享隐藏单元提高了每个任务的性能。The advantage of setting local hidden units and shared hidden units is that each task can learn private representations for classification from all local hidden units. Because local hidden units preserve the features of each individual task, different tasks are able to learn private representations of other tasks. On the other hand, shared hidden units learn the same representations from the entire dataset to fully exploit the information obtained from microarray systems. It is precisely because the shared hidden unit can utilize the information of all tasks, the shared hidden unit improves the performance of each task.

综上所述，所述分类器不仅可以保留每个任务的局部特性，还可以利用共享的知识为所有任务提供稳定的特性。To sum up, the classifier can not only preserve the local properties of each task, but also utilize the shared knowledge to provide stable properties for all tasks.

为了证明该分类器的可行性和效果，共使用了12个不同的数据集，如表2所示。在表2中，有两个不同来源的急性粒细胞白血病数据集(任务1和任务6)，传统的机器学习算法不可能将任务1和任务6整合起来，扩大急性粒细胞白血病数据集，用以解决基因表达数据的不足的技术问题。因为，A.任务1和任务6包含了不同的领域中的不同标签。其中，任务1中“AML”标签和“MDS”标签代表了急性髓系白血病疾病的不同阶段，但任务6中的“完全缓解”标签和“复发”标签则代表不同初始治疗后的症状；B.任务1和任务6包含来自不同数据源的不同类型的基因特征，其中任务1中收集了54,613个基因特征，任务6中收集了12,625个基因特征。因为即使对同一种癌症，不同的研究小组选择不同的标签和基因特征来调查样本，所以，在常见的癌症分类的组织样本不足的问题仍然没有得到解决。To demonstrate the feasibility and effectiveness of this classifier, a total of 12 different datasets are used, as shown in Table 2. In Table 2, there are two acute myeloid leukemia datasets (task 1 and task 6) from different sources. It is impossible for traditional machine learning algorithms to integrate task 1 and task 6. To expand the acute myeloid leukemia dataset, use To address the technical problem of insufficient gene expression data. Because, A. Task 1 and Task 6 contain different labels in different domains. Among them, the "AML" and "MDS" tags in task 1 represent different stages of acute myeloid leukemia disease, but the "complete remission" and "relapse" tags in task 6 represent symptoms after different initial treatments; B . Task 1 and task 6 contain different types of gene features from different data sources, where 54,613 gene features were collected in task 1 and 12,625 gene features were collected in task 6. Because different research groups choose different labels and genetic signatures to investigate samples even for the same cancer, the problem of insufficient tissue samples in common cancer classifications remains unresolved.

表2基因表达数据集汇总Table 2 Summary of gene expression datasets

进一步地，对使用了MTDL算法的分类器(即本发明提出的基因表达数据分类器)与使用传统的两种深度学习方法(即深度神经网络(DNN)和稀疏自编码器)的分类器进行对比。使用10份的数据，利用留一法交叉验证或评估DNN分类器。即，选择10个数据集，其中9个用于训练，1个用于测试。为了消除随机划分的影响，重复十次该试验，将得到的输出结果的平均精度作为最终结果。该分类器与其他两种分类器的最大区别在于，该分类器利用多个数据集的共享知识来学习更有效的表示，而其他两种分类器则仅从数据集中学习局部表达。在实验中，本发明提出分类器可以同时对12个数据集进行学习，但是其他两种分类器仅能对数据集进行独立的学习。Further, the classifier using the MTDL algorithm (ie, the gene expression data classifier proposed by the present invention) and the classifier using two traditional deep learning methods (ie, deep neural network (DNN) and sparse autoencoder) are analyzed. Compared. Use leave-one-out cross-validation or evaluate DNN classifiers using 10 copies of the data. That is, 10 datasets are selected, of which 9 are used for training and 1 is used for testing. In order to eliminate the effect of random division, the experiment is repeated ten times, and the average precision of the obtained output results is used as the final result. The biggest difference between this classifier and the other two classifiers is that this classifier utilizes the shared knowledge of multiple datasets to learn more efficient representations, while the other two classifiers only learn local representations from datasets. In the experiment, the present invention proposes that the classifier can learn 12 datasets at the same time, but the other two classifiers can only learn the datasets independently.

表3使用DNN分类器对12个癌症进行分类的准确性Table 3. Accuracy of classifying 12 cancers using DNN classifier

表3显示了传统的DNN分类器对12种肿瘤组织的基因表达数据的分类准确率。由于传统的神经网络模型忽略了数据集之间的相似性信息，DNN分类器接收12个癌症数据集，并逐个输出分类结果。传统的DNN模型忽略了相关癌症数据集之间的相似性信息，如果癌症样本不足，分类性能会相当差。Table 3 shows the classification accuracy of the traditional DNN classifier on the gene expression data of 12 tumor tissues. Since the traditional neural network model ignores the similarity information between datasets, the DNN classifier receives 12 cancer datasets and outputs the classification results one by one. Traditional DNN models ignore the similarity information between related cancer datasets, and if there are insufficient cancer samples, the classification performance will be quite poor.

表4使用稀疏自编码器分类器对12个癌症进行分类的准确性Table 4. Accuracy of classification of 12 cancers using sparse autoencoder classifier

而DNN分类器在任务2、3、9、10、11中有非常高的准确率，因为这些癌症数据集中只有两个标签，这些样本非常容易分配给每个标签。但白血病数据集(任务4、5、12)与其他癌症数据集相比，分类性能并不好。原因有二:1)由于每个白血病标签的模式不明确，难以准确诊断白血病；2)任务5中的标签相较其他任务来讲较多(即四个标签)，造成其准确性远远低于其他任务。While DNN classifiers have very high accuracy on tasks 2, 3, 9, 10, 11, since there are only two labels in these cancer datasets, these samples are very easy to assign to each label. But the leukemia dataset (tasks 4, 5, 12) does not perform well in classification compared to other cancer datasets. There are two reasons: 1) due to the unclear pattern of each leukemia label, it is difficult to accurately diagnose leukemia; 2) there are more labels in task 5 (ie four labels) than other tasks, resulting in far lower accuracy. for other tasks.

表4中使用稀疏自编码器的分类器对12个癌症数据集对癌症数据信息进行分类的准确性结果类似，因为稀疏自编码器还不能使用共享知识来提高分类性能。The classifiers using sparse autoencoders in Table 4 have similar accuracy results for classifying cancer data information on 12 cancer datasets, because sparse autoencoders have not yet been able to use shared knowledge to improve classification performance.

表5本发明提出的分类器对12个癌症进行分类的准确性Table 5 The accuracy of the classifier proposed by the present invention for classifying 12 cancers

表5给出了本发明提出的分类器对12种癌症的分类准确率。与对比的分类器不同，本发明提出的分类器能够同时处理多个癌症数据集(任务)，使该分类器能够充分利用所有数据集(癌症)之间相似的隐藏信息，提高每个学习任务的性能。从表5可以得出，与对比的分类器相比，白血病数据集(任务4、5和12)的准确性有了很大的提高(超过20％)。因为本发明提出的分类器可以同时利用这些数据集来减少白血病数据集中缺乏透明度问题的不良影响。此外，本发明提出的分类器在任务7和任务8中都可以实现20％以上的提升，因为其他癌症数据集通过共享层提供了更有代表性的信息来帮助任务7和任务8学习更好的表示，提高了分类结果的准确性。最后，本发明提出的分类器在其他的癌症数据集上具有令人满意的性能，一些数据集本身具有清晰的分类模式，使得三种分类器的分类结果都很好。但本发明提出的分类器在大多数数据集中仍然具有最高的性能，因为本发明提出的分类器设置了共享隐藏单元，能够同时利用所有的癌症数据集，通过共享隐藏单元的学习，提高大多数任务的分类性能。Table 5 shows the classification accuracy of the proposed classifier for 12 cancers. Different from the contrasting classifiers, the classifier proposed in the present invention can process multiple cancer datasets (tasks) simultaneously, so that the classifier can make full use of the similar hidden information among all datasets (cancers) and improve each learning task. performance. From Table 5, it can be concluded that the accuracy of the leukemia dataset (tasks 4, 5 and 12) is greatly improved (over 20%) compared to the contrasting classifiers. Because the classifier proposed in the present invention can utilize these datasets simultaneously to reduce the ill effects of the lack of transparency problem in the leukemia dataset. In addition, the classifier proposed in the present invention can achieve more than 20% improvement in both task 7 and task 8, because other cancer datasets provide more representative information through shared layers to help task 7 and task 8 learn better , which improves the accuracy of classification results. Finally, the classifier proposed in the present invention has satisfactory performance on other cancer datasets, and some datasets themselves have clear classification patterns, making the classification results of the three classifiers all good. But the classifier proposed by the present invention still has the highest performance in most datasets, because the classifier proposed by the present invention is set up with shared hidden units, which can utilize all cancer datasets at the same time, and improve most of the data sets through the learning of shared hidden units. task classification performance.

在附图3中，展示了参加对比的所有分类器处理12个数据集的精度结果，每个子图中柱状图上的黑线显示了每个任务准确度的标准差值。从附图3可以得出，本发明提出的分类器的标准差远远小于对比对象，不仅具有最好的性能，且具有更稳定的性能。In Figure 3, the accuracy results of all classifiers participating in the comparison are shown on 12 datasets, and the black line on the histogram in each subplot shows the standard deviation value of the accuracy for each task. It can be seen from FIG. 3 that the standard deviation of the classifier proposed by the present invention is much smaller than that of the comparison object, and not only has the best performance, but also has a more stable performance.

以上所述仅是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明技术原理的前提下，还可以做出若干改进和替换，这些改进和替换也应视为本发明的保护范围。The above are only the preferred embodiments of the present invention. It should be pointed out that for those skilled in the art, without departing from the technical principle of the present invention, several improvements and replacements can be made. These improvements and replacements It should also be regarded as the protection scope of the present invention.

Claims

1. a gene expression data classifier, it is characterised in that the classifier comprises: an input layer, a first hidden layer, a second hidden layer and an output layer;

The input layer includes a plurality of input units, each input unit receives a data set, and the data unit is connected with a local hidden unit in the first shared unit;

The first hidden layer includes the same number of local hidden units as the number of input units in the input layer and a first shared hidden unit, the local hidden units are correspondingly connected to the local hidden units in the second hidden layer, and the first hidden unit is connected. Shared hidden units are connected to all input units;

The second hidden layer includes the same number of local hidden units as the input units in the input layer and a second shared hidden unit, the local hidden units are correspondingly connected to the output units in the output layer, and the second shared hidden unit Connect with all local hidden units and shared hidden units in the first hidden layer;

The output layer includes the same number of output units as the input units in the input layer, and each output unit is connected to the second hidden unit, where the output unit outputs the classification result;

The input unit, local hidden unit and shared hidden unit all output activation values.

2. The classifier of claim 1, wherein the input unit uses the ReLU function to receive the source data set, and the activation value received by the local hidden unit in the first hidden It can be calculated by the following formula:

in, The superscript of is the index value of the hidden layer, the subscript is the index value of the task source, and n is the number of input units in the input layer; is the local weight of the edge between the No. 1 task source and the local hidden unit; x _i is the input of the i-th task, is the bias value of the nth local hidden unit in the second hidden layer.

3. The classifier according to claim 2, wherein: the activation value s ¹ of the first shared hidden unit can be calculated by the following formula:

in, for The shared weight value between the boundary of s ¹ is the deviation of the shared hidden unit s ¹ , and the activation function used is ReLU.

4. The classifier of claim 3, wherein the local hidden unit in the first hidden layer receives its activation value using the ReLU function, and the activation value received by each local hidden unit in the second hidden layer It can be calculated by the following formula:

in, for and local weight values between boundaries, is the shared weight between the ith local hidden unit in the first hidden layer and the ^s2 boundary, is the bias of the ith local hidden unit in the second hidden layer.

5. The classifier according to claim 4, wherein: the activation value s of the ^second shared hidden unit can be calculated by the following formula:

in, is the weight value between the ^s1 and ^s2 boundaries.

6. The classifier according to claim 4 or 5, wherein each output unit uses a sigmoid function to receive its activation value, and the output value of each task It can be calculated by the following formula:

in, for and local weight values between boundaries, for s ² with the weight value between the boundaries, is the deviation value of the output unit of the ith task.