CN111312336A

CN111312336A - Method and system for establishing biological edge identification system

Info

Publication number: CN111312336A
Application number: CN202010108036.7A
Authority: CN
Inventors: 陈洛南; 张万纬
Original assignee: Shanghai Institutes for Biological Sciences SIBS of CAS
Current assignee: Center for Excellence in Molecular Cell Science of CAS
Priority date: 2014-11-13
Filing date: 2014-11-13
Publication date: 2020-06-19
Also published as: CN105590037A

Abstract

The invention discloses a method and a system for establishing a biological edge identification system, which can simply and efficiently find out the change of the key interaction as the biological identification of the occurrence and development of the disease. The technical scheme is as follows: collecting data with two states; selecting gene pairs whose correlations meet the conditions of significant differences; for gene pairs whose correlations meet the conditions of significant differences, through matrix transformation, the expression value data of the gene pairs are converted into representative correlations. The feature selection algorithm is used to find out the gene pair with the best classification ability in the edge data, and the gene pair with the best classification ability is used as the biological edge identification, so as to establish the biological edge identification system.

Description

Method and system for establishing biological edge identification system

本发明是2014年11月13日所提出的申请号为201410640410.2、发明名称为《生物边标识系统的建立方法和系统》的发明专利申请的分案申请。The present invention is a divisional application of the invention patent application with the application number 201410640410.2 and the invention title "Method and System for Establishing Biological Edge Identification System" filed on November 13, 2014.

技术领域technical field

本发明涉及计算系统生物学和生物信息学，尤其涉及生物标识的处理方法和系统。The present invention relates to computational systems biology and bioinformatics, in particular to a method and system for processing biological markers.

背景技术Background technique

生物标识的研究一直是生物医学领域的重要课题，一个成功的生物标识能帮助医生做出准确的诊断或者提出有效的治疗方案，因此寻找合适的生物标识对攻克疾病特别是复杂疾病具有十分重要的意义。The research of biomarkers has always been an important topic in the field of biomedicine. A successful biomarker can help doctors make an accurate diagnosis or propose an effective treatment plan. Therefore, finding a suitable biomarker is very important to overcome diseases, especially complex diseases. significance.

人类复杂疾病是对病因不明确、涉及因素众多、无有效治疗手段的一类疾病的统称，如各类癌症及糖尿病等。20世纪80年代以来，高通量生物技术(如DNB芯片，高通量测序等)的迅猛发展，为人类复杂疾病的研究带来了机遇。Human complex disease is a general term for a class of diseases with unclear etiology, many factors, and no effective treatment, such as various types of cancer and diabetes. Since the 1980s, the rapid development of high-throughput biotechnology (such as DNB chips, high-throughput sequencing, etc.) has brought opportunities for the study of complex human diseases.

如何从这些技术所产生的海量数据中找出有用的生物标识也是当今生物标识研究领域所面临的一大挑战。早期的研究关注于差异表达的基因或者蛋白等生物分子，把具有区分能力的分子作为生物标识，这些方法简单直观，对于一些简单疾病也起到很好的效果，但这些方法没有考虑分子之间存在复杂的相互作用，而很多复杂疾病的发生往往是这些分子之间相互作用的改变导致的，因此这些方法在复杂疾病中的应用效果并不好。How to find useful biomarkers from the massive data generated by these technologies is also a major challenge facing the field of biomarker research today. Early research focused on differentially expressed genes or proteins and other biomolecules, and used distinguishing molecules as biomarkers. These methods are simple and intuitive, and have good effects on some simple diseases, but these methods do not consider the difference between molecules. There are complex interactions, and the occurrence of many complex diseases is often caused by changes in the interactions between these molecules, so the application of these methods in complex diseases is not effective.

正因为如此，许多研究者开始从系统或网络的角度找生物标识，即考虑生物分子间的各种相互作用所组成的网络，把具有区分能力的子网或者边集作为生物标识。目前很少有理想的方法来实现这一目的。Because of this, many researchers began to look for biomarkers from the perspective of systems or networks, that is, considering the network composed of various interactions between biomolecules, and using the sub-networks or edge sets with distinguishing ability as biomarkers. There are currently few ideal ways to achieve this.

发明内容SUMMARY OF THE INVENTION

以下给出一个或多个方面的简要概述以提供对这些方面的基本理解。此概述不是所有构想到的方面的详尽综览，并且既非旨在指认出所有方面的关键性或决定性要素亦非试图界定任何或所有方面的范围。其唯一的目的是要以简化形式给出一个或多个方面的一些概念以为稍后给出的更加详细的描述之序。A brief summary of one or more aspects is presented below to provide a basic understanding of the aspects. This summary is not an exhaustive overview of all contemplated aspects and is neither intended to identify key or critical elements of all aspects nor attempt to delineate the scope of any or all aspects. Its sole purpose is to present some concepts of one or more aspects in a simplified form as a prelude to the more detailed description that is presented later.

本发明的目的在于提供一种生物边标识系统的建立方法和系统，可以简单高效的找出关键的相互作用的改变作为疾病发生发展的生物标识。The purpose of the present invention is to provide a method and system for establishing a biological edge marker system, which can simply and efficiently find changes in key interactions as biological markers for the occurrence and development of diseases.

本发明的技术方案为：本发明揭示了一种生物边标识系统的建立方法，包括：The technical scheme of the present invention is as follows: the present invention discloses a method for establishing a biological edge identification system, including:

收集具有双状态的数据；Collect data with two states;

选出相关性符合显著差异条件的基因对；Select the gene pairs whose correlation meets the conditions of significant difference;

对于相关性符合显著差异条件的基因对，通过矩阵变换，将基因对的表达值数据转化为代表相关性的边数据；For gene pairs whose correlations meet the conditions of significant difference, the expression data of gene pairs are converted into edge data representing correlations through matrix transformation;

应用特征选择算法找出边数据中分类能力最佳的基因对，将分类能力最佳的基因对作为生物边标识，从而建立起生物边标识系统。The feature selection algorithm is used to find out the gene pair with the best classification ability in the edge data, and the gene pair with the best classification ability is used as the biological edge identification, so as to establish the biological edge identification system.

根据本发明的生物边标识系统的建立方法的一实施例，所述具有双状态的数据包括：正常状态数据和疾病状态数据、转移状态数据和非转移状态数据、有药物抵抗状态的数据和无药物抵抗状态的数据。According to an embodiment of the method for establishing a biological edge identification system of the present invention, the data with dual states includes: normal state data and disease state data, transfer state data and non-transfer state data, data with drug resistance state and data without Data on drug resistance status.

根据本发明的生物边标识系统的建立方法的一实施例，所述具有双状态的数据的数据类型包括基因对的表达谱或丰度谱数据。According to an embodiment of the method for establishing a biological edge identification system of the present invention, the data type of the data with two states includes expression profile or abundance profile data of gene pairs.

根据本发明的生物边标识系统的建立方法的一实施例，在所述收集具有双状态的数据的步骤之后还包括：According to an embodiment of the method for establishing a biological edge identification system of the present invention, after the step of collecting data with two states, the method further includes:

对数据进行预处理，去除表达均值低于设定值或变异系数高于设定值的基因。The data is preprocessed to remove genes whose expression mean is lower than the set value or whose coefficient of variation is higher than the set value.

根据本发明的生物边标识系统的建立方法的一实施例，在所述选出相关性符合显著差异条件的基因对的步骤中，计算基因对在双状态下的相关系数，根据双状态下的相关系数的差异的绝对值和阈值的比较来确定相关性是否符合显著差异条件。According to an embodiment of the method for establishing a biological edge identification system of the present invention, in the step of selecting gene pairs whose correlations meet the conditions of significant difference, the correlation coefficients of the gene pairs in two states are calculated, according to the two states The absolute value of the difference of the correlation coefficient is compared with the threshold value to determine whether the correlation meets the conditions of significant difference.

根据本发明的生物边标识系统的建立方法的一实施例，在所述对于相关性符合显著差异条件的基因对，通过矩阵变换，将基因对的表达值数据转化为代表相关性的边数据的步骤中，基因对的表达值数据是矩阵形式：According to an embodiment of the method for establishing a biological edge identification system of the present invention, for the gene pairs whose correlations meet the conditions of significant difference, the expression value data of the gene pairs are converted into edge data representing the correlation through matrix transformation. In step, the expression value data of gene pairs are in matrix form:

其中，x_ij代表生物分子i在所述双状态中的第一状态下第j个样本的表达谱的数值或丰度谱的数值，y_ij代表生物分子i在所述双状态中的第二状态下第j个样本的表达谱的数值或丰度谱的数值；Wherein, x _ij represents the value of the expression profile or the value of the abundance profile of the j-th sample of biomolecule i in the first state of the two states, and y _ij represents the second state of the biomolecule i in the two states The numerical value of the expression profile or the numerical value of the abundance profile of the jth sample in the state;

矩阵转换的过程为：The process of matrix transformation is:

对于给定的基因对u和v，做如下变换：For a given gene pair u and v, do the following transformation:

其中，<u,v>_N和<u,v>_D分别是指基因对u，v在第一状态下和第二状态下的边特征，

分别是基因对u和v在第一状态和第二状态下的表达谱的数值或丰度谱的数值的均值，Sx_u，Sx_v，Sy_u，Sy_v分别是基因对u和v在第一状态下和第二状态下的方差，k₁，k₂为校正系数，所有相关性符合显著差异条件的基因对得到的<u,v>_N和<u,v>_D所组成的矩阵就是基因对对应的边数据，边数据代表该基因对在不同状态下的相关性，每一个基因对由边数据里的两个对偶的变量或特征所刻画。Among them, <u, v> _N and <u, v> _D refer to the edge features of the gene pair u, v in the first state and the second state, respectively,

are the mean values of expression profiles or abundance profiles of gene pairs u and v in the first and second states, respectively, Sx _u , Sx _v , Sy _u , Sy _v are the gene pairs u and v in the first and second states, respectively. The variance in the first state and the second state, k ₁ , k ₂ are correction coefficients, and the matrix composed of <u,v> _N and <u,v> _D obtained from all gene pairs whose correlations meet the conditions of significant difference is The edge data corresponding to the gene pair, the edge data represents the correlation of the gene pair in different states, and each gene pair is characterized by two dual variables or features in the edge data.

根据本发明的生物边标识系统的建立方法的一实施例，校正系数k₁，k₂的取值均为1。According to an embodiment of the method for establishing a biological edge identification system of the present invention, the values of the correction coefficients k ₁ and k ₂ are both 1.

根据本发明的生物边标识系统的建立方法的一实施例，所述特征选择算法包括机器学习中的循环增减法(Sequential Forward Floating Selection，SFFS)和支持向量机(Support Vector Machine，SVM)。According to an embodiment of the method for establishing a biological edge identification system of the present invention, the feature selection algorithm includes Sequential Forward Floating Selection (SFFS) and Support Vector Machine (SVM) in machine learning.

本发明揭示了一种生物边标识系统，包括：The present invention discloses a biological edge identification system, comprising:

信息收集模块，收集具有双状态的数据；An information collection module that collects data with two states;

基因对选取模块，选出相关性符合显著差异条件的基因对；The gene pair selection module selects gene pairs whose correlations meet the conditions of significant difference;

边数据获取模块，对于相关性符合显著差异条件的基因对，通过矩阵变换，将基因对的表达值数据转化为代表相关性的边数据；The edge data acquisition module converts the expression value data of the gene pair into edge data representing the correlation through matrix transformation for gene pairs whose correlations meet the conditions of significant difference;

生物边标识建立模块，应用特征选择算法找出边数据中分类能力最佳的基因对，将分类能力最佳的基因对作为生物边标识，从而建立起生物边标识系统。The biological edge identification building module uses the feature selection algorithm to find out the gene pair with the best classification ability in the edge data, and uses the gene pair with the best classification ability as the biological edge identification, thereby establishing the biological edge identification system.

根据本发明的生物边标识系统的一实施例，所述具有双状态的数据包括：正常状态数据和疾病状态数据、转移状态数据和非转移状态数据、有药物抵抗状态的数据和无药物抵抗状态的数据。According to an embodiment of the biological edge identification system of the present invention, the data with dual states includes: normal state data and disease state data, transfer state data and non-transfer state data, drug resistance state data and drug resistance state data The data.

根据本发明的生物边标识系统的一实施例，所述具有双状态的数据的数据类型包括基因对的表达谱或丰度谱数据。According to an embodiment of the biological edge identification system of the present invention, the data type of the data with two states includes expression profile or abundance profile data of gene pairs.

根据本发明的生物边标识系统的一实施例，在信息收集模块之后还连接：According to an embodiment of the biometric edge identification system of the present invention, after the information collection module is further connected:

预处理模块，对数据进行预处理，去除表达均值低于设定值或变异系数高于设定值的基因。The preprocessing module preprocesses the data to remove genes whose expression mean is lower than the set value or whose coefficient of variation is higher than the set value.

根据本发明的生物边标识系统的一实施例，在基因对选取模块中，计算基因对在双状态下的相关系数，根据双状态下的相关系数的差异的绝对值和阈值的比较来确定相关性是否符合显著差异条件。According to an embodiment of the biological edge identification system of the present invention, in the gene pair selection module, the correlation coefficient of the gene pair under two states is calculated, and the correlation is determined according to the comparison between the absolute value of the difference of the correlation coefficient under the two states and the threshold value. Whether the sex meets the conditions of significant difference.

根据本发明的生物边标识系统的一实施例，在边数据获取模块中，基因对的表达值数据是矩阵形式：According to an embodiment of the biological edge identification system of the present invention, in the edge data acquisition module, the expression value data of the gene pair is in the form of a matrix:

矩阵转换的过程为：The process of matrix transformation is:

根据本发明的生物边标识系统的一实施例，校正系数k₁，k₂的取值均为1。According to an embodiment of the biological edge identification system of the present invention, the values of the correction coefficients k ₁ and k ₂ are both 1.

根据本发明的生物边标识系统的一实施例，生物边标识建立模块中的特征选择算法包括机器学习中的循环增减法(Sequential Forward Floating Selection，SFFS)和支持向量机(Support Vector Machine，SVM)。According to an embodiment of the biological edge identification system of the present invention, the feature selection algorithm in the biological edge identification building module includes a cyclic addition and subtraction method (Sequential Forward Floating Selection, SFFS) and a Support Vector Machine (SVM) in machine learning. ).

本发明对比现有技术有如下的有益效果：传统的生物标识用一个或多个基因(也称为分子)的表达量来区分不同的状态，而生物边标识是用基因对之间的相关性来区分不同状态。由于在生物体内，分子间呈现出错综复杂的相互作用网络，这些相互作用的改变往往是导致复杂疾病发生发展的关键因素，因此生物边标识比传统的生物标识有更强的生物学意义，能找出这些关键的相互作用作为疾病发生发展的生物标识，能更好的揭示内在的机制。Compared with the prior art, the present invention has the following beneficial effects: traditional biological markers use the expression of one or more genes (also called molecules) to distinguish different states, while biological edge markers use the correlation between pairs of genes to distinguish different states. Because there are intricate interaction networks between molecules in living organisms, changes in these interactions are often the key factors leading to the development of complex diseases. Therefore, biological edge markers have stronger biological significance than traditional biological markers. Identifying these key interactions as biomarkers for the occurrence and development of diseases can better reveal the underlying mechanisms.

附图说明Description of drawings

图1示出了本发明的生物边标识系统的建立方法的第一实施例的流程图。FIG. 1 shows a flow chart of a first embodiment of a method for establishing a biological edge identification system of the present invention.

图2示出了本发明的生物边标识系统的建立方法的第二实施例的流程图。FIG. 2 shows a flow chart of the second embodiment of the method for establishing the biological edge identification system of the present invention.

图3示出了本发明的生物边标识系统的第一实施例的原理图。Figure 3 shows a schematic diagram of a first embodiment of the biometric edge identification system of the present invention.

图4示出了本发明的生物边标识系统的第二实施例的原理图。Figure 4 shows a schematic diagram of a second embodiment of the biometric edge identification system of the present invention.

图5示出了生物边标识系统的建立及应用的流程示意图。FIG. 5 shows a schematic flowchart of the establishment and application of the biological edge identification system.

具体实施方式Detailed ways

在结合以下附图阅读本公开的实施例的详细描述之后，能够更好地理解本发明的上述特征和优点。在附图中，各组件不一定是按比例绘制，并且具有类似的相关特性或特征的组件可能具有相同或相近的附图标记。The above-described features and advantages of the present invention can be better understood after reading the detailed description of the embodiments of the present disclosure in conjunction with the following drawings. In the drawings, components are not necessarily drawn to scale and components with similar related characteristics or features may have the same or similar reference numbers.

图1示出了本发明的生物边标识系统的建立方法的第一实施例的流程。请参见图1，本实施例的生物边标识系统的建立方法的实施步骤如下。FIG. 1 shows the flow of the first embodiment of the method for establishing the biological edge identification system of the present invention. Referring to FIG. 1 , the implementation steps of the method for establishing a biological edge identification system in this embodiment are as follows.

步骤S11：收集具有双状态的数据。Step S11: Collect data with two states.

具有双状态的数据包括：正常状态数据和疾病状态数据、转移状态数据和非转移状态数据、有药物抵抗状态的数据和无药物抵抗状态的数据。具有双状态的数据的数据类型包括基因对的表达谱或丰度谱数据，包括microarray、蛋白质谱、代谢物质质谱、表观遗传谱数据等。Data with dual states includes: normal state data and disease state data, metastatic state data and non-metastatic state data, data with drug resistance state and data without drug resistance state. The data types of data with two states include gene pair expression profile or abundance profile data, including microarray, protein profile, metabolite mass spectrometry, epigenetic profile data, etc.

步骤S12：选出相关性符合显著差异条件的基因对。Step S12: Select gene pairs whose correlations meet the conditions of significant difference.

在这一步骤中，计算基因对在双状态下的相关系数，根据双状态下的相关系数的差异的绝对值和阈值的比较来确定相关性是否符合显著差异条件。In this step, the correlation coefficient of the gene pair under the two states is calculated, and whether the correlation meets the condition of significant difference is determined according to the comparison of the absolute value of the difference of the correlation coefficient under the two states and the threshold value.

比如，基因1和基因2的在正常情况下的相关系数是0.8而在疾病状态下为-0.6，则它们相关性差异的绝对值为1.4，假定阈值为0.8，则可以确定基因1和基因2为相关性有显著差异的基因对。For example, if the correlation coefficient between gene 1 and gene 2 is 0.8 under normal conditions and -0.6 in disease state, then the absolute value of their correlation difference is 1.4. Assuming a threshold of 0.8, gene 1 and gene 2 can be determined. Gene pairs with significantly different correlations.

步骤S13：对于相关性符合显著差异条件的基因对，通过矩阵变换，将基因对的表达值数据转化为代表相关性的边数据。Step S13: For gene pairs whose correlations meet the conditions of significant difference, the expression value data of the gene pairs are converted into edge data representing correlations through matrix transformation.

这一步骤是实施本发明的关键所在。基因对的表达值数据是矩阵形式：This step is the key to the implementation of the present invention. The expression value data for gene pairs are in matrix form:

其中，x_ij代表生物分子i在所述双状态中的第一状态下第j个样本的表达谱的数值或丰度谱的数值，y_ij代表生物分子i在所述双状态中的第二状态下第j个样本的表达谱的数值或丰度谱的数值。Wherein, x _ij represents the value of the expression profile or the value of the abundance profile of the j-th sample of biomolecule i in the first state of the two states, and y _ij represents the second state of the biomolecule i in the two states The numerical value of the expression profile or the numerical value of the abundance profile of the jth sample in the state.

矩阵转换的过程为：The process of matrix transformation is:

分别是基因对u和v在第一状态和第二状态下的表达谱的数值或丰度谱的数值的均值，Sx_u，Sx_v，Sy_u，Sy_v分别是基因对u和v在第一状态下和第二状态下的方差，k₁，k₂为校正系数(其中校正系数k₁，k₂的取值均为1)，所有相关性符合显著差异条件的基因对得到的<u,v>_N和<u,v>_D所组成的矩阵就是基因对对应的边数据，边数据代表该基因对在不同状态下的相关性，每一个基因对由边数据里的两个对偶的变量或特征所刻画。每个基因对可以根据上述矩阵转换算出一个2*(m+n)的小矩阵，把这些小矩阵按行堆叠在一起得到的大矩阵就是所谓的边数据。这个边数据中每一行代表一个基因对，每一列代表一个样本。Among them, <u, v> _N and <u, v> _D refer to the edge features of the gene pair u, v in the first state and the second state, respectively,

are the mean values of expression profiles or abundance profiles of gene pairs u and v in the first and second states, respectively, Sx _u , Sx _v , Sy _u , Sy _v are the gene pairs u and v in the first and second states, respectively. The variance in the first state and the second state, k ₁ , k ₂ are the correction coefficients (where the correction coefficients k ₁ , k ₂ are both 1), and all the gene pairs whose correlations meet the conditions of significant difference get <u ,v> _N and <u,v> _D matrix is the edge data corresponding to the gene pair, the edge data represents the correlation of the gene pair in different states, each gene pair is composed of two duals in the edge data. variable or characteristic. For each gene pair, a small matrix of 2*(m+n) can be calculated according to the above matrix transformation, and the large matrix obtained by stacking these small matrices together in rows is the so-called edge data. Each row in this edge data represents a gene pair, and each column represents a sample.

步骤S14：应用特征选择算法找出边数据中分类能力最佳的基因对，将分类能力最佳的基因对作为生物边标识，从而建立起生物边标识系统。Step S14: Find out the gene pair with the best classification ability in the edge data by applying the feature selection algorithm, and use the gene pair with the best classification ability as the biological edge identification, thereby establishing the biological edge identification system.

在本步骤中，特征选择算法包括机器学习中的循环增减法(Sequential ForwardFloating Selection，SFFS)和支持向量机(Support Vector Machine，SVM)。In this step, the feature selection algorithm includes Sequential Forward Floating Selection (SFFS) and Support Vector Machine (SVM) in machine learning.

图2示出了本发明的生物边标识系统的建立方法的第二实施例的流程。请参见图2，本实施例的生物边标识系统的建立方法的实施步骤如下。FIG. 2 shows the flow of the second embodiment of the method for establishing the biometric edge identification system of the present invention. Referring to FIG. 2 , the implementation steps of the method for establishing a biological edge identification system in this embodiment are as follows.

步骤S21：收集具有双状态的数据。Step S21: Collect data with two states.

步骤S22：对数据进行预处理，去除表达均值低于设定值或变异系数高于设定值的基因。Step S22: Preprocess the data to remove genes whose expression mean is lower than the set value or whose coefficient of variation is higher than the set value.

步骤S23：选出相关性符合显著差异条件的基因对。Step S23: Select gene pairs whose correlations meet the conditions of significant difference.

步骤S24：对于相关性符合显著差异条件的基因对，通过矩阵变换，将基因对的表达值数据转化为代表相关性的边数据。Step S24: For gene pairs whose correlations meet the conditions of significant difference, transform the expression value data of the gene pairs into edge data representing the correlations through matrix transformation.

矩阵转换的过程为：The process of matrix transformation is:

步骤S25：应用特征选择算法找出边数据中分类能力最佳的基因对，将分类能力最佳的基因对作为生物边标识，从而建立起生物边标识系统。Step S25 : find out the gene pair with the best classification ability in the edge data by applying the feature selection algorithm, and use the gene pair with the best classification ability as the biological edge identification, thereby establishing the biological edge identification system.

图3示出了本发明的生物边标识系统的第一实施例的原理。请参见图3，本实施例的生物边标识系统包括：信息收集模块11、基因对选取模块12、边数据获取模块13、生物边标识建立模块14。Figure 3 shows the principle of the first embodiment of the biometric edge identification system of the present invention. Referring to FIG. 3 , the biological edge identification system in this embodiment includes: an information collection module 11 , a gene pair selection module 12 , an edge data acquisition module 13 , and a biological edge identification establishment module 14 .

这些模块之间的连接关系是，信息收集模块11后连接基因对选取模块12，基因对选取模块12后连接边数据获取模块13，边数据获取模块13后连接生物边标识建立模块14。The connection between these modules is that the information collection module 11 is connected to the gene pair selection module 12 , the gene pair selection module 12 is connected to the edge data acquisition module 13 , and the edge data acquisition module 13 is connected to the biological edge identification establishment module 14 .

信息收集模块11收集具有双状态的数据。具有双状态的数据包括：正常状态数据和疾病状态数据、转移状态数据和非转移状态数据、有药物抵抗状态的数据和无药物抵抗状态的数据。具有双状态的数据的数据类型包括基因对的表达谱或丰度谱数据，包括microarray、蛋白质谱、代谢物质质谱、表观遗传谱数据等。The information collection module 11 collects data with two states. Data with dual states includes: normal state data and disease state data, metastatic state data and non-metastatic state data, data with drug resistance state and data without drug resistance state. The data types of data with two states include gene pair expression profile or abundance profile data, including microarray, protein profile, metabolite mass spectrometry, epigenetic profile data, etc.

基因对选取模块12选出相关性符合显著差异条件的基因对。基因对选取模块12计算基因对在双状态下的相关系数，根据双状态下的相关系数的差异的绝对值和阈值的比较来确定相关性是否符合显著差异条件。The gene pair selection module 12 selects gene pairs whose correlations meet the conditions of significant difference. The gene pair selection module 12 calculates the correlation coefficient of the gene pair under two states, and determines whether the correlation meets the condition of significant difference according to the comparison between the absolute value of the difference of the correlation coefficient under the two states and the threshold value.

边数据获取模块13对于相关性符合显著差异条件的基因对，通过矩阵变换，将基因对的表达值数据转化为代表相关性的边数据。The edge data acquisition module 13 converts the expression value data of the gene pair into edge data representing the correlation through matrix transformation for the gene pairs whose correlations meet the conditions of significant difference.

边数据获取模块13是实施本发明的关键所在。基因对的表达值数据是矩阵形式：The edge data acquisition module 13 is the key to implementing the present invention. The expression value data for gene pairs are in matrix form:

矩阵转换的过程为：The process of matrix transformation is:

生物边标识建立模块14应用特征选择算法找出边数据中分类能力最佳的基因对，将分类能力最佳的基因对作为生物边标识，从而建立起生物边标识系统。特征选择算法包括机器学习中的循环增减法(Sequential Forward Floating Selection，SFFS)和支持向量机(Support Vector Machine，SVM)。The biological edge identification building module 14 applies the feature selection algorithm to find out the gene pair with the best classification ability in the edge data, and uses the gene pair with the best classification ability as the biological edge identification, thereby establishing the biological edge identification system. Feature selection algorithms include Sequential Forward Floating Selection (SFFS) and Support Vector Machine (SVM) in machine learning.

图4示出了本发明的生物边标识系统的第二实施例的原理。请参见图4，本实施例的生物边标识系统包括：信息收集模块21、预处理模块22、基因对选取模块23、边数据获取模块24、生物边标识建立模块25。Figure 4 shows the principle of the second embodiment of the biometric edge identification system of the present invention. Referring to FIG. 4 , the biological edge identification system in this embodiment includes: an information collection module 21 , a preprocessing module 22 , a gene pair selection module 23 , an edge data acquisition module 24 , and a biological edge identification establishment module 25 .

这些模块之间的连接关系是，信息收集模块21后连接预处理模块22，预处理模块22后连接基因对选取模块23，基因对选取模块23后连接边数据获取模块24，边数据获取模块24后连接生物边标识建立模块25。The connection relationship between these modules is that the information collection module 21 is connected to the preprocessing module 22, the preprocessing module 22 is connected to the gene pair selection module 23, the gene pair selection module 23 is connected to the side data acquisition module 24, and the side data acquisition module 24 Afterwards, the biological edge identification establishing module 25 is connected.

信息收集模块21收集具有双状态的数据。具有双状态的数据包括：正常状态数据和疾病状态数据、转移状态数据和非转移状态数据、有药物抵抗状态的数据和无药物抵抗状态的数据。具有双状态的数据的数据类型包括基因对的表达谱或丰度谱数据，包括microarray、蛋白质谱、代谢物质质谱、表观遗传谱数据等。The information collection module 21 collects data with two states. Data with dual states includes: normal state data and disease state data, metastatic state data and non-metastatic state data, data with drug resistance state and data without drug resistance state. The data types of data with two states include gene pair expression profile or abundance profile data, including microarray, protein profile, metabolite mass spectrometry, epigenetic profile data, etc.

预处理模块22对数据进行预处理，去除表达均值低于设定值或变异系数高于设定值的基因，以降低噪声对结果的影响。The preprocessing module 22 preprocesses the data, and removes genes whose expression mean value is lower than the set value or the variation coefficient is higher than the set value, so as to reduce the influence of noise on the result.

基因对选取模块23选出相关性符合显著差异条件的基因对。基因对选取模块23计算基因对在双状态下的相关系数，根据双状态下的相关系数的差异的绝对值和阈值的比较来确定相关性是否符合显著差异条件。The gene pair selection module 23 selects gene pairs whose correlations meet the conditions of significant difference. The gene pair selection module 23 calculates the correlation coefficient of the gene pair under two states, and determines whether the correlation meets the significant difference condition according to the comparison between the absolute value of the difference of the correlation coefficient under the two states and the threshold value.

边数据获取模块24对于相关性符合显著差异条件的基因对，通过矩阵变换，将基因对的表达值数据转化为代表相关性的边数据。The edge data acquisition module 24 converts the expression value data of the gene pairs into edge data representing the correlation through matrix transformation for gene pairs whose correlations meet the conditions of significant difference.

边数据获取模块24是实施本发明的关键所在。基因对的表达值数据是矩阵形式：The edge data acquisition module 24 is the key to implementing the present invention. The expression value data for gene pairs are in matrix form:

矩阵转换的过程为：The process of matrix transformation is:

生物边标识建立模块25应用特征选择算法找出边数据中分类能力最佳的基因对，将分类能力最佳的基因对作为生物边标识，从而建立起生物边标识系统。特征选择算法包括机器学习中的循环增减法(Sequential Forward Floating Selection，SFFS)和支持向量机(Support Vector Machine，SVM)。The biological edge identification building module 25 applies the feature selection algorithm to find out the gene pair with the best classification ability in the edge data, and uses the gene pair with the best classification ability as the biological edge identification, thereby establishing a biological edge identification system. Feature selection algorithms include Sequential Forward Floating Selection (SFFS) and Support Vector Machine (SVM) in machine learning.

图5还示出了生物边标识系统的一个示例的建立和应用的示意流程。有了本发明建立的生物边标识系统后，可以建立对应的分类器或诊断模型，基于这个诊断模型，对于待测的样本，根据矩阵变换算出其对应于生物边标识的边值作为诊断模型的输入数据，再根据模型的输出结果判断待测样本的状态，即是否得病、或是否发生癌转移等。FIG. 5 also shows a schematic flow of the establishment and application of an example of the biometric edge identification system. With the biological edge identification system established by the present invention, a corresponding classifier or diagnostic model can be established. Based on this diagnostic model, for the sample to be tested, the boundary value corresponding to the biological edge identification is calculated according to the matrix transformation as the diagnostic model. Input data, and then judge the status of the sample to be tested according to the output results of the model, that is, whether it is sick or whether cancer metastasis occurs.

以基因表达数据为例，现有方法主要是从差异表达的基因中挑出具有最大区分能力的基因作为生物标识，这些生物标识是否具有相互作用是不知道的。而本发明关注的是相互作用上有差异的基因，并从中找出具有最大区分能力的基因对作为生物标识，称之为生物边标识，这样找出来的边标识从机制上讲更有可能是导致疾病发生发展的原因。Taking gene expression data as an example, existing methods mainly select genes with the greatest discriminating ability from differentially expressed genes as biomarkers, and it is unknown whether these biomarkers have interactions. The present invention focuses on genes with differences in interaction, and finds the gene pair with the greatest discriminative ability as a biological marker, which is called a biological edge marker. causes of disease development.

尽管为使解释简单化将上述方法图示并描述为一系列动作，但是应理解并领会，这些方法不受动作的次序所限，因为根据一个或多个实施例，一些动作可按不同次序发生和/或与来自本文中图示和描述或本文中未图示和描述但本领域技术人员可以理解的其他动作并发地发生。Although the above-described methods are illustrated and described as a series of acts for simplicity of explanation, it should be understood and appreciated that these methods are not limited by the order of the acts, as some acts may occur in a different order in accordance with one or more embodiments and/or occur concurrently with other actions from or not shown and described herein but understood by those skilled in the art.

本领域技术人员将进一步领会，结合本文中所公开的实施例来描述的各种解说性逻辑板块、模块、电路、和算法步骤可实现为电子硬件、计算机软件、或这两者的组合。为清楚地解说硬件与软件的这一可互换性，各种解说性组件、框、模块、电路、和步骤在上面是以其功能性的形式作一般化描述的。此类功能性是被实现为硬件还是软件取决于具体应用和施加于整体系统的设计约束。技术人员对于每种特定应用可用不同的方式来实现所描述的功能性，但这样的实现决策不应被解读成导致脱离了本发明的范围。Those skilled in the art will further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends on the specific application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

结合本文所公开的实施例描述的各种解说性逻辑板块、模块、和电路可用通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现场可编程门阵列(FPGA)或其它可编程逻辑器件、分立的门或晶体管逻辑、分立的硬件组件、或其设计成执行本文所描述功能的任何组合来实现或执行。通用处理器可以是微处理器，但在替换方案中，该处理器可以是任何常规的处理器、控制器、微控制器、或状态机。处理器还可以被实现为计算设备的组合，例如DSP与微处理器的组合、多个微处理器、与DSP核心协作的一个或多个微处理器、或任何其他此类配置。The various illustrative logic blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented using general purpose processors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), or other Programmable logic devices, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein are implemented or performed. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, such as a combination of a DSP and a microprocessor, multiple microprocessors, one or more microprocessors cooperating with a DSP core, or any other such configuration.

结合本文中公开的实施例描述的方法或算法的步骤可直接在硬件中、在由处理器执行的软件模块中、或在这两者的组合中体现。软件模块可驻留在RAM存储器、闪存、ROM存储器、EPROM存储器、EEPROM存储器、寄存器、硬盘、可移动盘、CD-ROM、或本领域中所知的任何其他形式的存储介质中。示例性存储介质耦合到处理器以使得该处理器能从/向该存储介质读取和写入信息。在替换方案中，存储介质可以被整合到处理器。处理器和存储介质可驻留在ASIC中。ASIC可驻留在用户终端中。在替换方案中，处理器和存储介质可作为分立组件驻留在用户终端中。The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integrated into the processor. The processor and storage medium may reside in the ASIC. The ASIC may reside in the user terminal. In the alternative, the processor and storage medium may reside in the user terminal as discrete components.

在一个或多个示例性实施例中，所描述的功能可在硬件、软件、固件或其任何组合中实现。如果在软件中实现为计算机程序产品，则各功能可以作为一条或更多条指令或代码存储在计算机可读介质上或藉其进行传送。计算机可读介质包括计算机存储介质和通信介质两者，其包括促成计算机程序从一地向另一地转移的任何介质。存储介质可以是能被计算机访问的任何可用介质。作为示例而非限定，这样的计算机可读介质可包括RAM、ROM、EEPROM、CD-ROM或其它光盘存储、磁盘存储或其它磁存储设备、或能被用来携带或存储指令或数据结构形式的合意程序代码且能被计算机访问的任何其它介质。任何连接也被正当地称为计算机可读介质。例如，如果软件是使用同轴电缆、光纤电缆、双绞线、数字订户线(DSL)、或诸如红外、无线电、以及微波之类的无线技术从web网站、服务器、或其它远程源传送而来，则该同轴电缆、光纤电缆、双绞线、DSL、或诸如红外、无线电、以及微波之类的无线技术就被包括在介质的定义之中。如本文中所使用的盘(disk)和碟(disc)包括压缩碟(CD)、激光碟、光碟、数字多用碟(DVD)、软盘和蓝光碟，其中盘(disk)往往以磁的方式再现数据，而碟(disc)用激光以光学方式再现数据。上述的组合也应被包括在计算机可读介质的范围内。In one or more exemplary embodiments, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software as a computer program product, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage medium can be any available medium that can be accessed by a computer. By way of example and not limitation, such computer-readable media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage, or other magnetic storage devices, or can be used to carry or store instructions or data structures in the form of Any other medium that conforms to program code and that can be accessed by a computer. Any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a web site, server, or other remote source using coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave , then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc as used herein includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and blu-ray disc, where disks are often reproduced magnetically data, and discs reproduce the data optically with a laser. Combinations of the above should also be included within the scope of computer-readable media.

提供对本公开的先前描述是为使得本领域任何技术人员皆能够制作或使用本公开。对本公开的各种修改对本领域技术人员来说都将是显而易见的，且本文中所定义的普适原理可被应用到其他变体而不会脱离本公开的精神或范围。由此，本公开并非旨在被限定于本文中所描述的示例和设计，而是应被授予与本文中所公开的原理和新颖性特征相一致的最广范围。The previous description of the present disclosure is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to the present disclosure will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other variations without departing from the spirit or scope of the present disclosure. Thus, the present disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for establishing a biological edge identification system comprises the following steps:

collecting data having a dual state;

selecting a gene pair with correlation meeting a significant difference condition;

for the gene pairs whose correlation meets the condition of significant difference, the expression value data of the gene pairs is converted into edge data representing the correlation by matrix transformation, wherein the expression value data of the gene pairs is in the form of a matrix:

wherein x is_ijA value representing the expression profile or abundance profile of the j-th sample of the biomolecule i in the first of the two states, y_ijA value representing the expression profile or abundance profile of the sample j in the second of the two states of the biomolecule i;

the process of matrix transformation is:

for a given gene pair u and v, the following transformations are made:

wherein,<u,v>_Nand<u,v>_Dthe edge characteristics of the gene pair u, v in the first state and the second state respectively,

the mean of the values of the expression or abundance profiles of the gene pairs u and v in the first and second states, Sx_u，Sx_v，Sy_u，Sy_vThe variances, k, of the gene pairs u and v in the first and second states, respectively₁，k₂All the genes with correlation meeting the condition of significant difference are obtained for correcting the coefficient and taking the value of 1<u,v>_NAnd<u,v>_Dthe formed matrix is the edge data corresponding to the gene pair, the edge data represents the relativity of the gene pair under different states, and each gene pair is described by two dual variables or characteristics in the edge data; and

and (3) finding out the gene pair with the best classification capability in the side data by using an applied feature selection algorithm comprising a cyclic addition and subtraction method and a support vector machine in machine learning, and taking the gene pair with the best classification capability as the biological side identifier, thereby establishing a biological side identifier system.

2. The method for establishing a biometric edge identification system as claimed in claim 1, wherein the data having two states comprises: normal state data and disease state data, metastatic state data and non-metastatic state data, data for a drug resistant state and data for a drug non-resistant state.

3. The method for establishing a biological edge marking system as claimed in claim 1, wherein the data type of the data with two states comprises expression profile or abundance profile data of gene pairs.

4. The method for establishing a biometric edge identification system as claimed in claim 1, further comprising, after the step of collecting data having two states:

and (3) preprocessing the data, and removing genes of which the expression mean value is lower than a set value or the variation coefficient is higher than a set value.

5. The method for establishing a biometric edge identification system according to claim 1, wherein in the step of selecting the pair of genes whose correlations satisfy the condition of significant difference, the correlation coefficient of the pair of genes in the two states is calculated, and whether the correlations satisfy the condition of significant difference is determined according to a comparison between an absolute value of the difference of the correlation coefficients in the two states and a threshold.

6. A biometric edge identification system comprising:

an information collection module that collects data having a dual state;

the gene pair selection module selects a gene pair with correlation meeting the significant difference condition;

and the side data acquisition module is used for converting the expression value data of the gene pair into side data representing the correlation through matrix transformation for the gene pair of which the correlation meets the significant difference condition, wherein the expression value data of the gene pair is in a matrix form:

the process of matrix transformation is:

for a given gene pair u and v, the following transformations are made:

the mean of the values of the expression or abundance profiles of the gene pairs u and v in the first and second states, Sx_u，Sx_v，Sy_u，Sy_vThe variances, k, of the gene pairs u and v in the first and second states, respectively₁，k₂All the genes with correlation meeting the condition of significant difference are obtained for correcting the coefficient and taking the value of 1<u,v>_NAnd<u,v>_Dthe formed matrix is the edge data and the edge number corresponding to the gene pairsEach gene pair is characterized by two dual variables or characteristics in the edge data according to the relevance of the gene pair under different states;

and the biological edge identification establishing module is used for finding out the gene pair with the best classification capability in the edge data by using an applied characteristic selection algorithm comprising a cyclic addition and subtraction method and a support vector machine in machine learning, and taking the gene pair with the best classification capability as the biological edge identification, thereby establishing a biological edge identification system.

7. The biometric edge identification system of claim 6, wherein the data having a two-state comprises: normal state data and disease state data, metastatic state data and non-metastatic state data, data for a drug resistant state and data for a drug non-resistant state.

8. The biometric edge identification system of claim 6, wherein the data type of the data having two states comprises expression profile or abundance profile data of a gene pair.

9. The biometric edge identification system according to claim 6, further comprising, after the information collection module:

and the preprocessing module is used for preprocessing the data and removing genes of which the expression mean value is lower than a set value or the variation coefficient is higher than the set value.

10. The biometric edge identification system of claim 6, wherein in the pair selection module, the correlation coefficients of the pair of the biometric edges in the two states are calculated, and whether the correlation meets the condition of significant difference is determined according to the comparison between the absolute value of the difference of the correlation coefficients in the two states and a threshold.