CN113130010A

CN113130010A - Gene regulation network database and application thereof in personalized medicine screening

Info

Publication number: CN113130010A
Application number: CN202110438364.8A
Authority: CN
Inventors: 覃静; 张艳红
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2021-04-22
Filing date: 2021-04-22
Publication date: 2021-07-16

Abstract

The invention discloses a construction method, a system and application of a gene regulation network database with adjustable active molecules. Constructing a gene regulation network for regulating a large number of active molecules in cells, and searching for the active molecules capable of being directly used for cell reprogramming by large-scale calculation by using a mathematical optimization or machine learning method. By means of the gene expression regulation network database or the gene expression network data analysis system, a safer and more effective new target point or a new method for treating diseases can be further developed deeply, and the personalized treatment level of diseases and the drug development efficiency can be greatly improved.

Description

Gene regulation network database and application thereof in personalized medicine screening

Technical Field

The invention relates to the technical field of biological information, in particular to a gene regulation network database and application thereof in personalized medicine screening.

Background

Over the past decade, our health care system has enjoyed progress. The breakthrough of targeted vaccines and the continuing progress of cancer therapy have greatly improved our quality of life and increased our viability, enabling us to overcome many health problems that once appeared to be insurmountable. At the same time, people are becoming more aware of each person's individual characteristics and the differences between different disease types. Precision Medicine (PM) can provide more targeted personalized treatment for individuals based on complex, comprehensive data and extensive testing. Since everyone lives in different ecological environments, our health problems are very different. Accurate medical treatment can make full use of monitorable genetic and environmental information to realize personalized treatment or effectively prevent diseases before the diseases occur.

For example, cancer therapy has heretofore remained a significant problem that the medical community has not been able to completely overcome. Although surgery is one of the most major and effective strategies for treating early stage cancer, the feasibility and prognosis of surgery depends to a large extent on the stage and physiological state of the patient's cancer. This makes more than 50% of stage III and IV patients available to conventional chemotherapy and radiation therapy. However, cancer cells are generated by variation of various genes, and are a very complex system. Most cancer cells receiving conventional chemotherapy and radiation therapy rapidly develop acquired resistance through complex gene networks and compensation based on heterozygosity due to multiple genetic variations or multiple gene signaling pathways. Furthermore, although very effective anticancer results have been obtained in the past few years by immunotherapy against immune checkpoints and targeted therapy against mutated gene proteins, anticancer drugs against a single gene often fail to regulate the complex cancer gene network resulting from multiple gene variations, fail to inhibit cancer stem cell survival and cancer cell population heterogeneity, and thus these therapeutic approaches against a single target protein often fail to achieve the expected effects. In order to overcome the disadvantages of single-gene anticancer drugs, the study on the heterogeneity caused by different mutant genes at the level of the whole gene network of cancer cells is important for the research field to develop a cancer treatment method capable of controlling complex variable pathways from the whole level to avoid drug resistance. In recent years, safe targeting and control of cancer cells from the level of the whole gene regulatory network of the cells is a novel potential therapeutic method in the field. The development of the method not only aims to attack the intractable fort of cancer treatment, but also lays a foundation for the arrival of the accurate medical age.

Currently commonly used gene regulatory network databases:

(1) TRRUST (transcription Relationships by continuous-based Text) is an artificially annotated database of gene Regulatory networks, currently 2 nd edition includes both human and mouse species. The trruit database contains 800 individuals of transcription factors and 828 mice transcription factors, and contains 8444 and 6552 pairs of human and mouse transcription factor-target regulatory relations, respectively. These data are from the literature of PubMed for low throughput experimental studies on transcriptional regulation. In functional use, the database can conveniently inquire the gene regulation related to any gene and can also inquire the key transcription factors for regulating a group of genes. And meanwhile, functional annotations such as diseases related to the gene, KEGG, GO and the like and regulation and control relation of the transcription factor on the regulated and controlled target gene are also output. And it supports the input of a set of differential genes for querying.

(2) TRRD (transcription regulation genes database) is constructed on the basis of the continuously accumulated structural-functional characteristic information of the eukaryotic gene Regulatory region. Each TRRD entry contains various structure-function characteristics of the specific gene: transcription factor binding sites, promoters, enhancers, silencers, and gene expression regulation patterns. TRRD includes five related data tables: TRRDGENES (including basic information and regulatory unit information of all TRRD library genes); TRRDSITES (including specific information on the binding sites of regulatory factors); TRRDFACTORS (including the specific information of the regulatory factors combined with each site in TRRD); TRRDEXP (including the specific description of the gene expression pattern); TRRDBIB (including all references to comments). The TRRD homepage provides retrieval services for these several data tables.

(3) ChIPBase is an open database, provides a control network of transcription factors and various genes, and provides information of transcription factor binding sites and motifs; meanwhile, the co-expression relationship between the transcription factor and the gene is analyzed by using RNA-seq data in a TCGA database. Currently this database contains 10200 peak data sets of 10 species obtained by the ChIP-seq method.

ChIP-seq data contains two types: transcription factors and histone modifications. From the data of ChIP-seq, the region of the binding site, peak, can be analyzed. Binding to transcription factor binding sites, the binding region can be further analyzed for motif; through gene annotation of peak regions, regulatory relationships between transcription factors and various genes can also be obtained. This database divides genes into the following categories:

1)lncRNA；

2)miRNA；

3)Other ncRNA；

4)Protein。

each category corresponds to a sub-menu. Taking lncRNA as an example, the regulation relationship between a certain transcription factor and lncRNA can be obtained through searching.

The regulation and control relation between the transcription factor and the gene is analyzed through ChIP-seq data, and actually, the regulation and control relation is obtained by carrying out gene annotation on peak intervals. Since the binding site is located in the vicinity of the gene, a region (e.g., 1kb for each upstream and downstream extension) is usually assigned, and if there is overlap, a regulatory relationship is considered to exist between the two regions.

In addition, the official website also provides that the ChIP-Function can perform GO enrichment analysis and Co-Expression (Co-Expression) on all the regulated and controlled target genes of the transcription factor, and the Co-Expression relation of the specified genes is analyzed by using about 2000 RNA-seq Expression profile data which is collated from public databases such as TCGA and the like.

However, the above databases only provide information on the binding site of the transcription factor, the transcription factor target gene deduced by co-expression, or the transcription factor target gene information collected from the literature, not the condition-specific gene regulatory network, and cannot directly predict the information on the active molecules that can regulate these gene regulatory networks on a large scale. Meanwhile, the accuracy of the data is not verified, for example, the experimental results in the TRRUST database are downloaded from PubMed by workers by using a mathematical algorithm, and the quality and the reliability of the data cannot be guaranteed.

Disclosure of Invention

The invention aims to construct a large number of gene regulation networks which comprise various cells (including cancer cells) and can be regulated and controlled by active molecules such as compounds, biological factors or gene tools, and search the active molecules capable of regulating diseased cells through large-scale calculation by using a mathematical optimization or machine learning method.

The technical scheme adopted by the invention is as follows:

in a first aspect of the present invention, there is provided a method for constructing a database of gene regulatory networks controllable by active molecules, comprising:

s01, acquiring gene expression profile data of an active molecule interference cell experiment;

s02, analyzing genes with expression difference;

and S03, combining the genes with expression difference and transcription factor binding site data to construct a gene regulation network database with adjustable active molecules.

Further, the active molecule in step S01 includes a compound, a biological factor, a gene tool, or the like;

preferably, the compound is an activation or inhibitor of a particular protein.

More preferably, the compound is a substance that activates or inhibits the activity of a specific protein or a substance that degrades a specific protein.

Preferably, the genetic means is a genetic means to increase or decrease the expression level of a particular protein.

More preferably, the genetic tool is interfering RNA, microRNA, gene editing or knock-out material.

Further, the experimental data of active molecule interference in step S01 is derived from a gene expression profile database.

Preferably, The Gene Expression profile database is a Gene Expression Omnibus database, The Connectivity Map database, or an Arrayexpress database.

Further, in step S02, the experimental data of active molecule interference is grouped according to experimental batches, cell lines and interference conditions, and then genes with expression differences are analyzed;

the gene having expression difference is analyzed before and after the interference.

Preferably, after gene expression profile data of the active molecule interfering cell experiment are grouped according to experiment batches, cell lines and interference conditions, genes with expression difference before and after the active molecule interference are analyzed.

More specifically, genes having expression differences before and after the interference are analyzed by a differentially expressed gene analysis method such as Limma;

further, in step S03, based on the analysis result of the differential gene expression, the differential gene is further analyzed in conjunction with the cell type-specific transcription factor binding site data, so as to construct a gene regulation network controlled by the corresponding active molecule.

More specifically, in step S03, a gene regulatory network database is constructed by a gene regulatory network construction tool such as BETA, ChIP-Array, etc.

After the gene regulation network of high-quality active molecule regulation is obtained by the gene regulation network construction tool, the gene regulation network can be converted into a network characteristic matrix.

Further, the gene regulatory network data may be constructed as an online database.

In a second aspect of the present invention, there is provided a method for screening an active molecule capable of regulating a gene regulatory network of a diseased cell, comprising the steps of:

s11, preparing paired gene expression data of the pathological changes and normal tissues of specific individuals and processing the paired gene expression data into a patient data matrix B; processing an active molecule-controllable gene regulation network database constructed by the method of the first aspect of the invention into a data matrix A;

s12, matching the data matrix A with the specific individual data matrix B, and analyzing active molecules of the gene regulation network capable of controlling the disorder in the pathological cells of the specific individual by a mathematical method or machine learning.

The sparse learning mathematical model of the set of matching control networks described in step S12 can be summarized as the following equation set:

min||Ax-b||²

s.t.card(x)≤s，

wherein A represents a matrix formed by combining gene regulation networks which can be controlled by all active molecules in a database; b is the data vector for a particular patient k in matrix B (i.e., column k of B); x represents a correlation coefficient vector between the gene regulatory network of active molecule interference and the patient k dysregulated gene; card (x) represents the number of non-zero coefficients of variable x; s is a given sparsity representing the number of gene regulatory networks selected (say 10 small molecules regulated gene regulatory networks). In this mathematical model, arbitrarily setting s to 10 allows searching for the 10 most relevant gene regulatory networks in each patient.

Both mathematical analysis and machine learning methods can be used to analyze a gene regulatory network that is deregulated for a particular individual and to adjust the active molecules of the corresponding deregulated gene regulatory network. In some embodiments of the invention, the inventors tested the performance of the following four Greedy Sparse Learning Algorithms (GSLAs) in predicting gene regulation network dependence on individual breast cancer differential gene expression profiles: orthogonal Matching Pursuit (OMP), compressive sampling matching pursuit (CoSaMP), adaptive forward-backward greedy algorithm (FoBa), and Least Angle Regression (LARS). The adaptive forward and backward greedy algorithm (FoBa) of the four algorithms is the most capable of identifying significantly affected patient-specific networks, but is not limited thereto.

The novel calculation method can rapidly carry out quantitative search on a gene regulatory network regulated by a large amount of active molecules.

In a third aspect of the invention, there is provided the use of a database for screening for active molecules targeting a gene regulatory network that is deregulated in diseased cells; the gene regulation network database is the gene regulation network database which is obtained by the method of the first aspect of the invention and is adjustable by active molecules.

In a fourth aspect of the present invention, there is provided a gene regulatory network database analysis system comprising:

the data import module: paired gene expression data for introduction into diseased and normal tissues of a specific individual;

a data processing module: the method is used for processing paired gene expression data of pathological changes and normal tissues of the specific individual and an active molecule-controllable gene regulation network database constructed by the method of the first aspect of the invention into data matrixes respectively;

a data comparison module: a data matrix of a gene regulatory network database which is constructed by the method of the first aspect of the invention and which is used for comparing the paired gene expression data matrices of the diseased and normal tissues of the specific individual with each other;

a result output module: the matching result is used for outputting the data matrix;

a data query module: used for inquiring the information of the corresponding gene regulation network and the corresponding active molecule through the matching result, or inputting at least one of the active molecule, the gene and the transcription factor to inquire the information of the gene regulation network which can be controlled by the related active molecule.

Further, the system also comprises:

homepage: for introducing said database;

a help module: for introducing a method of use of said database.

Specifically, the mathematical model currently adopted by the data comparison module is as follows:

min||Ax-b||²

s.t.card(x)≤s，

wherein A represents a matrix formed by combining gene regulation networks which can be controlled by all active molecules in a database; b is the data vector for a particular patient k in matrix B (i.e., column k of B); x represents a correlation coefficient vector between the gene regulatory network of active molecule interference and the patient k dysregulated gene; card (x) represents the number of non-zero coefficients of variable x; s is a given sparsity representing the number of gene regulatory networks selected (say 10 small molecules regulated gene regulatory networks). In this mathematical model, arbitrarily setting s to 10 allows searching for the 10 most relevant gene regulatory networks in each patient. The combination of the gene regulation network and the corresponding active molecules capable of inducing the pathological cells to be reprogrammed into normal cells is predicted by the mathematical optimization method.

In a fifth aspect of the present invention, there is provided a gene regulatory network database analysis apparatus comprising:

at least one processor;

at least one memory for storing at least one program;

when the program is executed by the processor, the processor may perform the method for constructing a database of a gene regulatory network in which an active molecule can be regulated according to the first aspect of the present invention or the method for analyzing a gene regulatory site in which a specific individual can be regulated according to the second aspect of the present invention.

In a sixth aspect of the present invention, there is provided a storage medium in which the gene regulatory network database analysis system according to the fourth aspect of the present invention is stored.

The invention has the beneficial effects that:

1. at present, researchers can only try to find active molecules capable of recovering diseased cells through a large amount of hard experimental work, and the method is long in time consumption, high in cost, low in success rate, narrow in application field, and most importantly, cannot accurately control the whole gene network. So far, no case report of the pathological cell gene regulation network targeted therapy with higher clinical value exists. Therefore, the invention provides a rapid and efficient bioinformatics prediction method, which is used for predicting active molecules of a gene regulation network capable of effectively targeting pathological cell disorder in a large scale by constructing a cell-specific gene regulation network controlled by a large number of active molecules and applying a mathematical optimization or machine learning method.

2. At present, a large number of gene regulation network reconstruction methods use a single data set to construct a network. These methods for predicting the relationship of a target gene to a transcription factor can be used only for identifying the interaction between individual genes, and the interaction of many functionally related genes has not been studied so much. The method provided by the invention not only considers the interaction between a plurality of function-related genes and transcription factors, but also has important significance for researching a regulation mechanism; meanwhile, the method can detect new genes related to the functional expression of specific genes. The method considers the regulation of the transcription factor on the target gene, and simultaneously considers the co-expression mechanism of the transcription factor and the co-regulation mechanism of the target gene. By applying the invention, not only can new genes directly or indirectly related to pathogenesis of diseases be found, for example, in the previous researches related to breast cancer by using the method of the invention, not only common networks of breast cancer lesions (FOXH1 TRN0000150 is related to research patients with 1/3, EPSTI1, BAMBI, FOXQ1, ROR2 and the like) are found, but also personalized disorder networks of several patients (such as the down regulation of TAF1 in invasive tumor patients and poor prognosis patients, the down regulation of CTCF in poor prognosis patients, the up regulation of expression of SMARCC1 in TCGA-BH-A1EV patients with invasive tumor patients) are found, meanwhile, important biological processes related to the development and development of some breast cancer are deeply understood, and a basis is provided for the personalized targeted treatment of diseases.

In general, biological networks are the manifestation of interactions between various molecules within a cell. As a special biological network form, the gene regulation network is formed by complex interaction between a plurality of transcription factors in different tissues and target genes. Gene regulation is one of the key links of the expression of genetic information of organisms, and relates to a plurality of important physiological and biochemical processes of development, signal transduction, metabolic regulation, stimulation reaction, immune reaction and the like of the organisms. The inventors comprehensively consider the regulation of transcription factors and target genes, and the co-expression of transcription factors and co-regulated target genes in the transcriptional regulation mechanism of cells. Our studies not only validated the interaction relationships obtained in previous studies, but also predicted new pathways associated with pathogenesis.

The invention provides a quick and efficient bioinformatics prediction method, which is used for predicting active molecules capable of effectively targeting a diseased cell or tissue dysregulation gene regulation network on a large scale by constructing a cell-specific gene regulation network database regulated by a large number of active molecules and applying a mathematical optimization or machine learning method.

The invention also provides a construction method of the gene regulation and control network database with adjustable and controllable active molecules, and the gene regulation and control network matched with the patient can be obtained by comparing the gene regulation and control network database with the gene regulation and control network database, so that the individual prediction of the patient can effectively target the active molecules of the gene regulation and control network of diseased cells or tissue disorder.

The invention also provides a gene regulation network database analysis system, which comprises a data import module, a data processing module, a data comparison module, a result output module and the like, and can be used for inputting any one of active molecules, genes or transcription factors and inquiring the transcription regulation network information related to the active molecules, the genes or the transcription factors; meanwhile, the personalized analysis and prediction of active molecules targeting diseased cells or tissues can be realized.

In addition, the invention also provides a gene regulation network database analysis device and a storage medium, which are used for loading the gene regulation network database or the gene regulation network database analysis system.

Drawings

FIG. 1 is a schematic diagram of the construction of a cellular gene regulatory network library regulated by active molecules.

Fig. 2 overview of a bioinformatics analysis computing framework.

FIG. 3 is a graph showing the output results of the target gene and cell line information.

Fig. 4 is a specific information diagram corresponding to the network ID, including physicochemical properties, experimental conditions, and experimental results of the interfering compound.

FIG. 5 Gene regulatory network is shown.

Figure 6100 heat map of highest scoring gene regulatory network-patient associations.

FIG. 7 survival analysis results of HES1, CEBPD, CLOCK, KLF9 and breast cancer patients.

FIG. 810 is a chart of a deregulated gene regulatory network and a heatmap of active molecules that can regulate the network in breast cancer patients.

FIG. 9112 deregulated BRCA1, SPDEF network and active molecules that can regulate this network in breast cancer patients.

Detailed Description

The present invention will be described in further detail with reference to the following specific embodiments and accompanying drawings. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention.

The noun explains:

cancer Genome map (The Cancer Genome Atlas, TCGA)

Connectivity Map (The Connectivity Map, CMap)

Binding and Expression Target Analysis (BETA)

Chromatin immunoprecipitation ChIP-X (Chromatin immunopropractination, ChIP-ChIP/-exo/-seq)

Gene Expression Integrated dataset (Gene Expression Omnibus, GEO)

Example 1

A method for constructing a gene regulatory network database which can be regulated by active molecules is provided, which comprises the following steps:

s02, analyzing genes with expression difference;

and S03, combining the genes with expression difference and the transcription factor binding site data to construct a gene regulation network database with adjustable active molecules.

Active molecules include compounds, biological factors or genetic tools; the compound may be an activation or inhibition agent for a particular protein, including a substance that activates or inhibits the activity of a particular protein or a substance that degrades a particular protein.

The gene tool is a gene tool for increasing or reducing the expression level of a specific protein, and comprises interfering RNA, microRNA, gene editing or gene knockout materials and the like.

Gene Expression profile data of active molecule interference cell experiments are derived from Gene Expression profile databases, including The Gene Expression Omnibus dataset, The Connectivity Map database, or The Arrayexpress database.

The inventor collects a large amount of experimental data of active molecule interference, including experimental conditions, experimental time, experimental interferon concentration and the like.

In step S02, the gene expression profile data of the active molecule interfering cell experiment are grouped according to the experiment batch, cell line and interference condition, and then the genes with expression difference, i.e. the genes with expression difference before and after interference, are analyzed.

Genes with expression differences before and after the interference were analyzed by Limma.

And step S03, combining the differential gene expression analysis result and the transcription factor binding site data for further analysis, marking the transcription regulation and control factor, observing the regulation and control relationship between the transcription factor and other genes under the condition, and constructing a corresponding gene regulation and control network.

In step S03, a gene regulatory network database can be constructed by gene regulatory network construction tools such as BETA, ChIP-Array, etc.

Meanwhile, the gene regulation network database can be constructed into an online database.

In addition, for individual patients, active molecules that control the regulatory site of a particular individual gene can be analyzed by:

s11, preparing paired gene expression data of the pathological changes and normal tissues of specific individuals and processing the paired gene expression data into a patient data matrix B; processing the gene regulation network database which can be regulated and controlled by the active molecules into a data matrix A;

and S12, analyzing the gene regulation and control sites which can be regulated and controlled by specific individuals and the active molecules which can be regulated and controlled by the genes by a mathematical model or a machine learning method according to the data matrix A and the specific individual data matrix B.

The data of the gene control network regulated by all active molecules is converted into a matrix A, b is a key gene control network for cell reprogramming, and x is a correlation coefficient vector representing the gene control network controlled by the active molecules and the key gene control network for cell reprogramming. At present, we can rapidly screen active molecules capable of inducing cell reprogramming based on network matching through the following mathematical optimization model:

min||Ax-b||²

s.t.card(x)≤s,

where card (x) represents the number of non-zero coefficients for variable x, and s is a given sparsity representing the number of gene regulatory networks of the selected active molecule (say, 10 active molecules regulated gene regulatory networks).

The novel calculation method can rapidly carry out quantitative search on the gene regulation network regulated by a large number of active molecules.

Example 2

(1) Transcription factor ChIP-X and transcriptome data collection and arrangement

To construct an active molecule regulated gene regulatory network, the inventors downloaded human transcription factor ChIP-X data (BED file format) from cistome DB data browser on 11/28/2019. The BED files of the same cell line were grouped and merged using BEDTools. Binding information of 741 transcription factors in 351 cancer cell lines was collected by screening.

In order to construct a downstream gene control network of active molecules controlled by the transcription factors, transcriptional group data under various specific conditions obtained after the 76 cell lines are treated by 25200 interference means such as 19811 small molecule compounds, 314 biological products, shRNA or cDNA and the like are downloaded from CMap (the Connectivity map).

(2) Construction of a Condition-specific Gene regulatory network

The inventor carries out grouping treatment on all transcriptome data, groups the transcriptome data according to experimental batches, cell lines and interference conditions, carries out difference analysis by using Limma, compares gene expression difference before and after interference, and obtains genes with obviously changed expression under active molecule interference.

All genes with significant differential Expression reported by Limma and their differential Expression values were resolved into a beta (binding and Expression Target analysis) specific format for subsequent gene regulatory network construction. ChIP-X files and transcriptome data from the same cell line and whose genes regulated by transcription factors in ChIP-X were significantly altered between the two conditions (Log2>0.5, adpj <0.05) were paired and analyzed. Inputting paired ChIP-X files and transcriptome data, and using a gene regulation network construction tool BETAbasic to construct a condition-specific gene regulation network. We now obtained 9554 condition-specific gene regulatory networks centered on 204 transcription factors in 25 cell lines, and we subsequently continued to expand and refine this database.

Corresponding to the mathematical model in example 1, A represents matrix A, and each pair of ChIP-X and transcriptome data is input to construct a condition-specific gene regulatory network using gene regulatory network construction tool BETA. In some embodiments of the invention, 9554 condition-specific gene regulatory networks centered at 204 transcription factors were obtained and combined into a single matrix a. B is the data vector for a particular patient k in matrix B (i.e., column k of B), and x is a coefficient vector representing the association between the gene regulatory network and patient k. In this mathematical model, arbitrarily setting s to 5 allows searching for the 5 most relevant gene regulatory networks in each patient.

Fig. 2 overview of a bioinformatics analysis computing framework.

(3) Database construction

The gene regulation network database constructed by the inventor mainly comprises a homepage module, a help module, a search module and a tool module.

Wherein, the homepage (Home page) introduces the whole database in detail; the help module (help) can help the user to use the database efficiently; a Search module and a Tool module are main functional modules, and can enter a selection interface By inputting active molecules (By Condition), genes (By Target Gene) or Transcription factors (By Transcription factor) in the Search module and can obtain detailed network information By selecting corresponding conditions according to prompts; the Tool module allows inputting a group of differential genes for network precise matching.

Taking Search By Target Gene as an example (see the result in fig. 3), the corresponding Target Gene and cell line information is selected, i.e. two specific Gene control networks of CN2044 and CN585 are output.

Clicking the network ID can see the specific information of each network, including the physicochemical properties of the active molecules, experimental conditions, experimental results and the like (see FIG. 4). The brand new platform can obtain all regulatory information by inputting any link information of transcription factor-gene in specific cells, and provides a systematic method for understanding molecular pathology, discovering new targets of old drugs and the like (see figure 5).

The method not only considers the interaction between a plurality of function-related genes and transcription factors, has important significance for researching a regulation mechanism, but also can detect new genes related to the function expression of specific genes. The method considers the regulation of the transcription factor on the target gene, and simultaneously considers the co-expression mechanism of the transcription factor and the co-regulation mechanism of the target gene. Not only can new genes directly or indirectly related to disease pathogenesis be found, for example, the inventor utilizes the method to find not only common networks related to breast cancer or other cancer lesions in breast cancer related researches (APOBEC3B, ATF3, EGR1, ESR1, FOS, KLF4, NCAPG, SPDEF and the like) but also personalized disorder networks of several patients (HES1, CEBPD, CLOCK and the like are shown in figure 6), and meanwhile, some important biological processes of breast cancer occurrence and development are deeply understood, so that a powerful tool is provided for personalized targeted therapy of the breast cancer.

Fig. 6 is a heatmap reporting the 10 gene regulatory network-patient associations with the highest relevance score among each patient for 112 breast cancer patients from TCGA. The x-axis represents the core transcription factor, while the y-axis represents the TCGA patient.

Several patient-specific networks were identified not only by high score by the method (CEBPD, CLOCK, GATA4, HES1, JUND, KLF9, MXI1, TP53, etc.). Several common networks of interest to us, APOBEC3B, ATF3, EGR1, were also found to be associated with most patients (above 1/3) and directly matched to active molecules that could regulate each network. The analysis results show that:

discovery of multiple cancer common lesion network

It is not surprising that the multiple networks we have identified are associated with more than 1/3 or even half of the patients. Early research reports show that these networks are indeed closely related to the development of various cancers. Professor re Harris's research team from minnesota university as early as 2013 revealed that expression of DNA cytidine deaminase APOBEC3B accounted for not only half of the mutation load in breast cancer. In addition, they analyzed 19 different types of cancer and found that the upregulation of APOBEC3B expression was associated with bladder, cervical, lung (adenocarcinoma and squamous cell), head and neck, and breast cancer, among at least six different cancers. The DNA damage is closely related to the cancer risk, and the research result of the Yangchun-Macro-doctor team published in Nature Communications calls that ATF3(activating transcription factor 3) is a stress response protein, and the research shows that when cells feel stress such as DNA damage, the cells can induce ATF3 to be combined with Tip60 protein to promote the DNA damage repair function. ATF3 is expressed at a low level in normal cells like its chaperone protein Tip60, but once a cellular stress reaction such as cancer occurs, they will quickly respond, and although the reaction mechanism is still riddle, increasing the expression of ATF3 can improve the activity of Tip60, promote the repair of DNA damage by cells as a whole, and open up a new way for the development of cancer drugs. In the research of R L Akshaya et al, it was shown that ATF3 can promote the proliferation of breast cancer cells and bone metastasis by activating the expression of cancer metastasis related gene Runx2 and cancer invasion related gene matrix metalloproteinase 13(MMP 13). The metastatic spread of breast cancer can be reduced by taking ATF3 as a target point, and the life of a patient is prolonged. EGR1(Early growth response-1) is taken as a polymerization point of various signal transduction pathways, the relation with malignant tumors is always a hot problem of research, after being activated, the EGR1 can further activate downstream genes, and has a very important role in repairing, growing and apoptosis of cells. Juanhong ZHao et al research results show that EGR1 is the only prognostic marker of cervical cancer, and highly expressed EGR1 can promote proliferation and invasion of cervical cancer cells and obtain stem cell characteristics by increasing downstream SOX9 expression. Therefore, the EGR1-SOX9 axis can be a potential drug target, and the blockage of the EGR1-SOX9 axis can be a possible method for treating cervical cancer. Numerous other studies have also shown an important role for EGR1 in tumor studies, such as CTCF and EGR1 high expression can induce Nm23-H1 expression in MDA-MB-231 cells and thereby reduce cell migration; high EGR1/EGR3 is related to the methylation level of glioma patients, or can be used as a prognostic judgment marker of glioma patients; EGR1 plays an important role in the progression of various cancers, such as thyroid cancer, leukemia, and lung cancer. These important findings are of great significance for the development of broad spectrum anti-cancer drugs. Discovery of breast cancer patient-specific lesion networks

Our survival analysis results using the GEPIA database for 8 high-score specific networks identified in only a few patients (10 patients or less) showed that these specific networks were indeed associated with breast cancer disease, with the survival analysis results for HES1, CEBPD, CLOCK, KLF9 and breast cancer patients shown in fig. 7. Studies show that the HES1 network is dysregulated in two patients of TCGA-BH-A0H5 and TCGA-E2-A15I, and HES1 up-regulation is a predictor of poor prognosis of human breast cancer and possibly is a key factor of proliferation and invasion of breast cancer cells. Furthermore, the proportion of cells overexpressing HES1 was significantly higher in the triple negative breast cancer samples. Thus, HES1 may be a potential target for the treatment of TNBC. Other research groups find that HES1 can promote the stem cell characteristics of breast cancer cells and play a carcinogenic role by up-regulating Slug, which indicates that HES1 may be a new candidate gene for regulating the stem cell characteristics in triple-negative breast cancer, and provides a new clue for finding a promising triple-negative breast cancer prognosis marker and a treatment target. CEBPD associated with TCGA-A7-A0CH, TCGA-BH-A204, TCGA-E2-A1BC patients had different functions in different cancers, and expression of CEBPD protein was associated with expression of estrogen receptor (ER +) and progesterone receptor (PGR) and longer progression-free survival in breast cancer patients. Researchers found that CEBPD also inhibited cell growth, motility, and invasiveness by inhibiting expression of Slug transcriptional repressor, resulting in expression of the cyclin-dependent kinase inhibitor CDKN 1A. The CLOCK gene disorder identified only in TCGA-a7-A0CH affects important cancer-related pathways such as breast cancer, lung cancer, prostate cancer, and hematological malignancies. Studies have shown that the CLOCK gene progresses from p53 mediated cell cycle to apoptosis. At best, CLOCK gene dysfunction can be both tumorigenic and antitumor in model and cell type specificity. At present, the research and the report on the regulation mechanism of the gene and the network thereof are less, and the subsequent decryption can be expected as soon as possible. As can be seen, our CORN database shows great potential in discovering cancer patient-specific lesion networks, and more interesting and valuable studies wait for our continued exploration, validation.

Implementation of patient-customized treatment protocols and discovery of active molecules that can target patient-specific lesion networks

One of the purposes of constructing the CORN database is to identify the gene regulation network of the patient disorder and screen active molecules capable of regulating the pathological change gene regulation network in a large scale by a calculation method, so that an effective personalized treatment strategy is provided for the patient. Because each of our networks is derived from active molecules that perturb multiple cell lines under specific perturbation conditions, when we input the patient's differential expression matrix, we can not only discover patient-specific pathological networks, but also output active molecules that can modulate these networks. FIG. 8 is a heatmap of the deregulated gene regulatory network and the active molecules that can regulate it in 10 breast cancer patients we arbitrarily selected. The x-axis represents a patient with TCGA breast cancer, and the y-axis represents a patient's deregulated gene regulatory network paired with an active molecule that regulates the network. In the figure we can see that the lesion network of each patient is different and there are multiple active molecules targeting the same network, e.g. ZNF711 associated with TCGA-a7-A0DB patient matches the four active molecules that can regulate the network, artemether, tropisetron, wortmannin, vorinostat. Multiple networks such as EGR1, ESR1, NCAPG, etc. associated with more patients are also matched to multiple active molecules that can regulate the same network. These interesting findings bring a good news to the personalized treatment of patients. For example, TCGA-A7-A0CE patient, we matched them with 110 networks of BRD-K68548958-BRCA 1, SB-939-EZH 2, AS-601245-FOS, fulvestrant-GREB 1, SA-85268-NCAPG, MG-132-NCAPG, cellstrol-NCAPG 2, barasertib-HQPA-SPDEF, tozasertib-SPDEF, sulforaphane-WDHD that are in vivo dysregulated. Wherein, 7 disease networks are related and directly matched with 10 active molecules which can regulate and control the networks, and the active molecules comprise fulvestrant (fulvestrant) which is a medicine for treating advanced, refractory or metastatic breast cancer and is widely used clinically; there are also Barasertib in experimental studies of treatment of tumors, lymphomas, solid tumors and myeloid leukemia; and other active compounds. Aiming at the personalized lesion network of a patient, a plurality of medicines which are listed on the market and can target the disease network can be accurately used for treatment, and the method has great value for guiding a clinician to reasonably select the medicine or medicine combination which can target the disease network; aiming at active molecules with unknown medicinal value, potential targets of the active molecules are discovered through cell experiments and mathematical models, a CORN database currently comprises 204 transcription factors as a core, and 9554 networks obtained by the interference of 1802 active molecules, and each transcription factor target is averagely regulated and controlled by 8.8 active molecules. Referring to fig. 9, we arbitrarily selected BRCA1, SPDEF networks, and active molecules that can regulate these networks, which are deregulated in 112 breast cancer patients, with 11 active molecules that can regulate BRCA1, and up to 47 active molecules that can regulate SPDEF alone. Therefore, CORN can quickly and efficiently screen active molecules capable of regulating and controlling a pathological change gene regulation network in a large scale, and can find new targets of marketed drugs, so that the development process of new drugs is greatly accelerated.

The results can verify that the gene regulation network database which can be regulated by the active molecules and is constructed by the inventor can be associated with a disease-related network which is proved by research, and can also be associated with new active molecules which can control gene regulation sites, and the results show that the database which is constructed by the construction method provided by the inventor can be used for large-scale prediction of the active molecules which can effectively control the gene regulation network of pathological changes.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A method for constructing a gene regulation network database which can be regulated by active molecules comprises the following steps:

s02, analyzing genes with expression difference;

2. The method of claim 1, wherein the active molecule in step S01 comprises a compound, a biological agent, or a genetic tool;

the compound is preferably an activation or inhibition agent for a specific protein, more preferably a substance that activates or inhibits the activity of a specific protein or a substance that degrades a specific protein;

the genetic tool is preferably a genetic tool for increasing or decreasing the expression level of a specific protein, and more preferably interfering RNA, microRNA, gene editing or gene knockout material.

3. Construction method according to claim 1, characterized in that The Gene Expression profile data of The active molecule interfering cell experiments in step S01 are derived from a Gene Expression profile database, preferably from The Gene Expression Omnibus database, The Connectivity Map database or The ArrayExpress database.

4. The method of claim 1, wherein the step S02 of analyzing the gene having the expression difference is analyzing the gene having the expression difference before and after the interference of the active molecule;

preferably, the gene expression profile data of the active molecule interfering cell experiment are grouped according to experiment batches, cell lines and interference conditions, and genes with expression difference before and after the active molecule interference are analyzed.

5. The method of claim 1, wherein the genes having expression difference before and after the interference are analyzed using a differentially expressed gene analysis method in step S02.

6. An analytical method for screening active molecules that can regulate a diseased gene regulatory network, comprising the steps of:

s11, preparing differential gene expression data of pathological changes and normal tissue cells corresponding to specific individuals and processing the differential gene expression data into a patient data matrix B; processing the active molecule controllable gene regulation network database constructed by the method of any one of claims 1 to 5 into a data matrix A;

s12, matching the data matrix A with the specific individual data matrix B, and analyzing the gene regulation and control network of specific individual maladjustment and active molecules capable of regulating and controlling the gene regulation and control network of the maladjustment by a mathematical analysis or machine learning method.

7. Use of a database for screening for active molecules that can treat a gene regulatory network of a disease, said database comprising a database of gene regulatory networks that can be modulated by active molecules constructed by the method of any one of claims 1 to 5.

8. A gene regulatory network database analysis system, comprising:

a data input module: for inputting paired gene expression data for diseased and normal tissues of a particular individual;

a data processing module: a database for processing paired gene expression data of diseased and normal tissues of the specific individual and the active molecule-controllable gene regulation network database constructed by the method of any one of claims 1 to 5 into data matrices, respectively;

a data comparison module: a data matrix for comparing paired gene expression data matrices of diseased and normal tissues of the specific individual with a data matrix of an active molecule-regulatable gene regulation network database constructed by the method of any one of claims 1 to 5;

9. A gene regulatory network database analysis apparatus, comprising:

at least one processor;

at least one memory for storing at least one program;

when the program is executed by the processor, the processor may implement the method for constructing a database of gene regulatory networks controllable by active molecules according to any one of claims 1 to 5 or the method for analyzing for screening active molecules of gene regulatory networks controllable by diseased.

10. A storage medium in which an active molecule-controllable gene regulatory network database constructed by the method according to any one of claims 1 to 5 and the gene regulatory network database analysis system according to claim 9 are stored.