CN110767261A

CN110767261A - A method to automate the construction of high-precision genome-scale metabolic network models

Info

Publication number: CN110767261A
Application number: CN201910934928.XA
Authority: CN
Inventors: 王敏; 夏梦雷; 郑宇�; 薛丹妮; 成杨; 李彩霞; 彭明梦
Original assignee: Tianjin University of Science and Technology
Current assignee: Tianjin University of Science and Technology
Priority date: 2019-09-29
Filing date: 2019-09-29
Publication date: 2020-02-07
Anticipated expiration: 2039-09-29
Also published as: CN110767261B

Abstract

本发明属于系统生物学领域，具体涉及一种自动化构建高精度基因组尺度代谢网络模型的方法。本专利基于目标微生物基因组信息，运用KEGG提供的API，实现“基因‑酶‑反应‑化合物”数据库的自动化构建，并通过转录组学和蛋白组数据数据库对反应矩阵进行自动精炼和校准，从而快速的构建出精准的基因组尺度代谢网络模型。通过本专利所提供的模型对实验进行模拟，可以有效地减少工作量并为湿实验提供参考依据，为关键基因的模拟筛选、外源基因与底盘生物的适配性评估、关键代谢路径的挖掘、最小基因组的以及微生物响应外界刺激的应激机制的研究提供帮助。The invention belongs to the field of systems biology, in particular to a method for automatically constructing a high-precision genome scale metabolic network model. Based on the genomic information of target microorganisms, this patent uses the API provided by KEGG to realize the automatic construction of the "gene-enzyme-reaction-compound" database, and automatically refine and calibrate the reaction matrix through the transcriptomic and proteomic data database, so as to quickly to construct an accurate genome-scale metabolic network model. The simulation of the experiment through the model provided by this patent can effectively reduce the workload and provide a reference for the wet experiment, which can be used for the simulation screening of key genes, the adaptation evaluation of exogenous genes and chassis organisms, and the mining of key metabolic pathways. , minimal genomic and the study of the stress mechanisms of microorganisms in response to external stimuli.

Description

A method to automate the construction of high-precision genome-scale metabolic network models

技术领域：Technical field:

本发明涉及一种构建代谢网络模型的方法，特别涉及一种自动化构建高精度基因组尺度代谢网络模型的方法，属于系统生物学领域。The invention relates to a method for constructing a metabolic network model, in particular to a method for automatically constructing a high-precision genome scale metabolic network model, which belongs to the field of systems biology.

背景技术：Background technique:

细胞代谢是由成千上万个代谢物、酶和调控因子构成的复杂网络。该网络的复杂构成以及调控机制使得对代谢系统的改造和微生物工厂的理性设计变得非常困难。如何从无数可能的代谢模式中，准确捕获细胞实际发生的代谢行为成为微生物学领域的核心挑战之一。Cellular metabolism is a complex network of thousands of metabolites, enzymes, and regulators. The complex composition and regulatory mechanisms of this network make it very difficult to engineer metabolic systems and rationally design microbial factories. How to accurately capture the actual metabolic behavior of cells from the countless possible metabolic patterns has become one of the core challenges in the field of microbiology.

目前构建代谢通量模型是解决以上问题的有效途径之一，该方法采用计量矩阵表示拟稳态假设下胞内反应。其模型具体方法为：假定一定时间内，胞内中间代谢产物的浓度不变，根据代谢途径中各反应的计量关系以及实验中测得的底物消耗速率或者产物生成速率，基于质量平衡以及可能的能量平衡来确定未知的反应速率，进而确定代谢网络的通量分配。作为系统生物学的重要部分之一，基因组尺度代谢网络模型在菌种改造、微生物代谢行为预测、发酵过程监测等领域得到了成功的应用。One of the effective ways to solve the above problems is to construct a metabolic flux model. This method uses a stoichiometric matrix to represent the intracellular response under the quasi-steady state assumption. The specific method of the model is: assuming that the concentration of intracellular intermediate metabolites remains unchanged for a certain period of time, according to the measurement relationship of each reaction in the metabolic pathway and the substrate consumption rate or product generation rate measured in the experiment, based on mass balance and possible energy balance to determine unknown reaction rates and, in turn, flux allocations to metabolic networks. As one of the important parts of systems biology, genome-scale metabolic network models have been successfully applied in the fields of strain modification, microbial metabolic behavior prediction, and fermentation process monitoring.

目前代谢通量建模数据主要来源于三个方面：①网络模型数据库，比如KEGG、NCBI、Biomodel等；②文献中已有的网络模型；③在已有的其他种属的代谢模型基础上进行修改；近年来随着组学技术的研究，基于微生物实际的基因组序列信息构建代谢网络成为热点。如CN 103729576 B、CN 103276011 B、CN 102629304 B、CN 103279689 B等。然而，微生物细胞对于环境的响应是一个非常复杂的过程，涉及基因转录、蛋白表达以及酶活调控等多尺度生化反应；并且会根据不同的环境刺激，动态调整代谢行为。基因组信息只能反映微生物所具备的生化反应，但并不能准确反应微生物在某种条件下真实参与的反应。这种忽略了微生物多样性和动态性的建模方法，使得单纯基于数据库、文献数据或者组学数据的方法，无法准确反映细胞中真正发生的具体过程。The current metabolic flux modeling data mainly comes from three aspects: ① network model databases, such as KEGG, NCBI, Biomodel, etc.; ② existing network models in the literature; ③ based on existing metabolic models of other species Modification: In recent years, with the research of omics technology, the construction of metabolic network based on the actual genome sequence information of microorganisms has become a hot topic. For example, CN 103729576 B, CN 103276011 B, CN 102629304 B, CN 103279689 B and the like. However, the response of microbial cells to the environment is a very complex process, involving multi-scale biochemical reactions such as gene transcription, protein expression, and enzyme activity regulation; and it dynamically adjusts its metabolic behavior according to different environmental stimuli. Genomic information can only reflect the biochemical reactions of microorganisms, but cannot accurately reflect the reactions that microorganisms actually participate in under certain conditions. This modeling method, which ignores the diversity and dynamics of microorganisms, makes methods based solely on databases, literature data, or omics data unable to accurately reflect the specific processes that actually occur in cells.

此外，目前基因组代谢网络均采用手动的方法进行数据的查询与精简，构建一个完整的代谢通量模型，往往需要进行千上万次的数据库比对，耗费大量的时间和精力成本，严重限制了模型搭建的效率。目前尚未见有效的自动化解决方案。因此，开发新的、自动化构建高质量模型的方法已成为相关领域亟待解决的难题。In addition, the current genomic metabolic network uses manual methods to query and simplify data, and build a complete metabolic flux model, which often requires thousands of database comparisons, which consumes a lot of time and energy costs, which severely limits the Efficiency of model building. There is no effective automation solution yet. Therefore, developing new and automated methods for constructing high-quality models has become an urgent problem to be solved in related fields.

KEGG数据库的API(Application Programming Interface)为基因组尺度代谢网络模型的自动化构建提供了技术支撑。KEGG(Kyoto Encyclopedia of Genes andGenomes)是目前最权威的基因组破译方面的数据库。其最主要的宗旨是使细胞和有机体在计算机上完整的表达和演绎，让计算机利用基因信息对更高层次和更复杂细胞活动和生物体行为作出计算推测。在给出染色体中一套完整的基因的情况下，它可以对蛋白质交互(互动)网络在各种细胞活动起的作用作出预测。KEGG的PATHWAY数据库整合当前在分子互动网络(比如通道，联合体)的知识，KEGG的GENES/SSDB/KO数据库提供关于在基因组计划中发现的基因和蛋白质的相关知识，KEGG的COMPOUND/GLYCAN/REACTION数据库提供生化复合物及反应方面的知识。具体的API调用方式KEGG官方说明：https://www.kegg.jp/kegg/rest/keggapi.html。The API (Application Programming Interface) of the KEGG database provides technical support for the automated construction of genome-scale metabolic network models. KEGG (Kyoto Encyclopedia of Genes and Genomes) is currently the most authoritative database for genome deciphering. Its main purpose is to fully express and deduce cells and organisms on the computer, allowing computers to use genetic information to make computational inferences on higher-level and more complex cellular activities and organism behaviors. Given a complete set of genes in a chromosome, it can make predictions about the role of protein interaction (interaction) networks in various cellular activities. KEGG's PATHWAY database integrates current knowledge on molecular interaction networks (eg channels, consortia), KEGG's GENES/SSDB/KO database provides relevant knowledge about genes and proteins discovered in the Genome Project, KEGG's COMPOUND/GLYCAN/REACTION The database provides knowledge of biochemical complexes and reactions. The specific API call method KEGG official description: https://www.kegg.jp/kegg/rest/keggapi.html.

本专利将基于目标微生物基因组信息，运用KEGG提供的API，实现“基因-酶-反应-化合物”数据库的自动化构建，并通过转录组学和蛋白组数据数据库对反应矩阵进行自动精炼和校准，从而快速的构建出精准的基因组尺度代谢网络模型。This patent will use the API provided by KEGG to realize the automatic construction of the "gene-enzyme-reaction-compound" database based on the genome information of the target microorganism, and automatically refine and calibrate the reaction matrix through the transcriptomic and proteomic data database, thereby Quickly build accurate genome-scale metabolic network models.

发明内容：Invention content:

本发明的目的是提供一种自动化构建高精度基因组尺度代谢网络模型的方法，该方法相比与之前的方法有着更高效，精度更高以及更好的完整性等特点，具体步骤如下：The purpose of the present invention is to provide a method for automatically constructing a high-precision genome-scale metabolic network model. Compared with the previous method, the method has the characteristics of higher efficiency, higher precision and better integrity. The specific steps are as follows:

(1)基因组测序：对目标微生物进行基因组测序，获取该微生物基因组中所有的基因信息；(1) Genome sequencing: Sequence the genome of the target microorganism to obtain all the genetic information in the microorganism genome;

所述基因信息包括其基因功能注释，以及所对应的KEGG基因号码；Described gene information includes its gene function annotation, and corresponding KEGG gene number;

(2)反应网络数据库的建立：采用KEGG数据库提供的API接口得到目标微生物所在“属”的基因、酶、化合物、反应及其关联信息，并构建数据库；(2) Establishment of the reaction network database: use the API interface provided by the KEGG database to obtain the genes, enzymes, compounds, reactions and their associated information of the "genus" of the target microorganism, and build a database;

进一步地，上述数据库的构建是使用matlab语言编写程序，根据已获取的关联信息自动搭建的；Further, the construction of the above-mentioned database is to use the matlab language to write a program, and is automatically built according to the acquired associated information;

然后，根据步骤(1)基因组测序结果对上述数据库进行修正，增添数据库中未包含的基因、蛋白、化合物及反应信息，同时删除多余的数据信息，精简得到目标微生物实际含有的代谢网络的基因-酶-反应-化合物数据信息数据库(信息图示例见图2)；Then, according to the genome sequencing results of step (1), the above database is revised, and the genes, proteins, compounds and reaction information not included in the database are added, and redundant data information is deleted at the same time, and the genes of the metabolic network actually contained in the target microorganism are simplified- Enzyme-reaction-compound data information database (see Figure 2 for an example of an infographic);

进一步地，当步骤(1)中的某基因不存在于上述数据库中时，将该基因序列在KEGG数据库中进行BLAST算法比对，从得到比对结果中挑选相似性最高基因并统计其KO号码，通过比对KEGG数据库中KO与酶以及酶与反应的数据库得到反应编号，并将其添加进入数据库；Further, when a certain gene in the step (1) does not exist in the above-mentioned database, this gene sequence is compared in the KEGG database with the BLAST algorithm, and the highest similarity gene is selected from the obtained comparison result and its KO number is counted. , obtain the reaction number by comparing the database of KO and enzyme and enzyme and reaction in the KEGG database, and add it into the database;

更进一步地，利用python语言编写程序，可实现自动将基因逐个在KEGG数据库进行BLAST算法比对并统计KO号码；利用VBA语言编写程序，可实现上述KO与酶以及酶与反应等数据的相互转化；Furthermore, using python language to write programs can automatically compare genes one by one with BLAST algorithm in KEGG database and count KO numbers; using VBA language to write programs can realize the mutual conversion of the above KO and enzyme and enzyme and reaction data. ;

进一步地，当数据库上存在的某基因并不存在与步骤(1)的检测结果中时，则将该基因及对应的反应从数据库中删除。Further, when a certain gene existing in the database does not exist in the detection result of step (1), the gene and the corresponding reaction are deleted from the database.

(3)模型的构建：根据步骤(2)精简获得的数据信息，并以反应为横行，以化合物为纵行，以化合物在反应中的系数为数值(消耗为负，合成物为正)，构建代谢通量矩阵，即代谢模型，该过程可借助Python、Matlab、Java、C++等语言自动化完成(通量矩阵模型示例见图3)；(3) Construction of the model: According to the data information obtained in step (2), the reaction is taken as the horizontal row, the compound is taken as the vertical row, and the coefficient of the compound in the reaction is taken as the numerical value (consumption is negative, synthesis is positive), Construct a metabolic flux matrix, that is, a metabolic model. This process can be automated with the help of languages such as Python, Matlab, Java, and C++ (see Figure 3 for an example of a flux matrix model);

(4)模型的校准：测定目标微生物在特定时间节点的转录组和蛋白组数据，通过设计程序将测定的蛋白组、转录组学的数据和步骤(3)的代谢模型进行比对，对转录组和蛋白组上存在但模型上没有的数据进行补充，对转录组和蛋白组上没有但是模型上存在的数据进行删除，以此获得的模型即可准确描述细胞中瞬时发生的通路行为。(4) Calibration of the model: Measure the transcriptome and proteome data of the target microorganism at a specific time node, and compare the measured proteome and transcriptome data with the metabolic model of step (3) by designing a program to compare the transcriptome. Supplementing data that exists in the group and proteome but not in the model, and deleting data that is not available in the transcriptome and proteome but exists in the model, the resulting model can accurately describe the transient pathway behavior in cells.

进一步地，是将蛋白组与转录组学的数据在KEGG信息数据库转化为对应的反应方程式后与代谢矩阵进行比对，对转录组和蛋白组上存在但模型上没有的数据进行补充，对转录组和蛋白组上没有有但是反应矩阵上存在的数据进行删除。Further, the proteomic and transcriptomic data are converted into the corresponding reaction equations in the KEGG information database and compared with the metabolic matrix, and the data that exists in the transcriptome and proteome but not in the model is supplemented. Data that are not present on the group and proteome but are present on the response matrix are deleted.

有益效果：Beneficial effects:

目前基因组代谢网络均采用手动的方法进行数据的查询与精简，一般从头构建一个全基因组尺度网络通过手动方法需要约半年至一年的时间，本发明提供的方法利用KEGG数据库提供的API格式获取数据，可在十几分钟内完成数据抓取，大大节省时间成本，提高工作效率和准确度。At present, the genome metabolism network uses manual methods to query and simplify data. Generally, it takes about half a year to a year to construct a genome-wide scale network from scratch. The method provided by the present invention uses the API format provided by the KEGG database to obtain data. , the data capture can be completed within ten minutes, which greatly saves time and cost, and improves work efficiency and accuracy.

本专利在构建模型后通过测定测定特定时间节点的转录组和蛋白组数据，并对代谢模型进行修正，依次准确获得细胞中瞬时发生的通路行为。In this patent, after the model is constructed, the transcriptome and proteome data of a specific time node are determined by measurement, and the metabolic model is corrected, so as to accurately obtain the transient pathway behavior in the cell in turn.

通过准确掌握细胞中瞬时发生的通路行为，本专利可应用于关键基因的模拟筛选、外源基因与底盘生物的适配性评估、关键代谢路径的挖掘、最小基因组的以及微生物响应外界刺激的应激机制的研究。通过本专利所提供的模型对实验进行模拟，可以有效地减少工作量并为湿实验提供参考依据。By accurately grasping the transient pathway behavior in cells, this patent can be applied to the simulation screening of key genes, the evaluation of the fitness of exogenous genes and chassis organisms, the mining of key metabolic pathways, the minimal genome and the response of microorganisms to external stimuli. research on the mechanism of stimulation. By simulating the experiment through the model provided by this patent, the workload can be effectively reduced and a reference basis for the wet experiment can be provided.

附图说明：Description of drawings:

图1本专利构建代谢网络模型的工作流程图；Fig. 1 is the working flow chart of this patent to construct the metabolic network model;

图2通过API获得的生化信息示意图；Figure 2 is a schematic diagram of biochemical information obtained through API;

图3通量矩阵模型示意图。Figure 3 Schematic diagram of the flux matrix model.

具体实施方式：Detailed ways:

为了使本专利的目的、技术方案及优点更加清楚明白，以下结合具体实施例，对本专利进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本专利，并不用于限定本发明。In order to make the purpose, technical solutions and advantages of the present patent more clear, the present patent will be further described in detail below with reference to specific embodiments. It should be understood that the specific embodiments described herein are only used to explain the present patent, but not to limit the present invention.

本专利的特点之一是通过中心法则建立将目标微生物在KEGG上的生化反应数据整合在一起的数据库。首先通过API接口从KEGG数据库中获得目标菌株的全部基因，随后再根据基因获取目标微生物全部基因与酶的对应关系，最后根据酶来获取其催化的特定反应以及特定反应中涉及到的化合物，建立了目标菌株由基因到酶到反应和化合物的生化信息数据库。由于KEGG数据库官方将功能相似的同源基因归为一类并赋予相应的KEGGOrthology号码(简称KO号码)，所以本专利构建了从基因所属的KO号码到对应的酶到对应的专属反应以及化合物的另一条路线，并将其添加进了数据库中。至此，生化信息数据库中根据数据转化或获取方式的不同分为两条路线，即：1)将基因作为起始点通过基因与酶以及酶与反应、反应与化合物的对应关系获取目标微生物的生化反应方程式。2)将KO号码作为起始点，通过KO号码与酶以及酶与反应、反应与化合物的对应关系获取特定的反应方程式。One of the features of this patent is the establishment of a database integrating the biochemical reaction data of target microorganisms on KEGG through the central dogma. First, all the genes of the target strain are obtained from the KEGG database through the API interface, then the corresponding relationship between all the genes of the target microorganism and the enzymes is obtained according to the genes, and finally the specific reactions catalyzed by the enzymes and the compounds involved in the specific reactions are obtained according to the enzymes. A database of biochemical information of target strains from genes to enzymes to reactions and compounds. Since the KEGG database officially classifies homologous genes with similar functions into one category and assigns the corresponding KEGG Orthology number (KO number for short), this patent constructs a structure from the KO number to which the gene belongs to the corresponding enzyme to the corresponding exclusive reaction and compound. another route and added it to the database. So far, the biochemical information database is divided into two routes according to the data transformation or acquisition method, namely: 1) The biochemical reaction of the target microorganism is obtained by using the gene as the starting point through the corresponding relationship between the gene and the enzyme, the enzyme and the reaction, and the reaction and the compound. equation. 2) Using the KO number as the starting point, obtain a specific reaction equation through the correspondence between the KO number and the enzyme, the enzyme and the reaction, and the reaction and the compound.

本发明在构建代谢网络模型过程中涉及到的基因、蛋白、化合物、反应信息，以及他们之间的转化、转换或对应关系均来自于KEGG中的以下5个数据库信息，而具体的数据获取和转化可利用VBA语言编写程序自动化实现：The gene, protein, compound, reaction information involved in the process of constructing the metabolic network model of the present invention, as well as the transformation, conversion or correspondence between them all come from the following five database information in KEGG, and the specific data acquisition and The transformation can be realized automatically by programming the VBA language:

(1)基因数据的下载接口为：http://rest.kegg.jp/list/<org>；(1) The download interface of genetic data is: http://rest.kegg.jp/list/<org>;

(2)酶-基因对应关系的数据下载接口为：http://rest.kegg.jp/link/<org>/ec；(2) The data download interface of the enzyme-gene correspondence is: http://rest.kegg.jp/link/<org>/ec ;

(3)酶-反应对应关系的数据下载接口为：http://rest.kegg.jp/link/ec/rn；(3) The data download interface of the enzyme-reaction correspondence is: http://rest.kegg.jp/link/ec/rn ;

(4)反应-化合物对应关系的数据下载接口为：http://rest.kegg.jp/link/rn/ cpd；(4) The data download interface of the reaction-compound correspondence is: http://rest.kegg.jp/link/rn/cpd ;

(5)KO号码与酶对应关系的数据下载接口为：http://rest.kegg.jp/link/ec/ko。 (5) The data download interface for the correspondence between KO numbers and enzymes is: http://rest.kegg.jp/link/ec/ko.

本发明构建高精度基因组尺度代谢网络模型的方法流程如图1所示，以下将通过具体实施方式对本发明作进一步地解释说明。The method flow of the present invention for constructing a high-precision genome-scale metabolic network model is shown in FIG. 1 , and the present invention will be further explained below through specific embodiments.

实施例1自动化构建丙酮丁醇梭菌高精度基因组尺度代谢网络模型的方法Example 1 A method for automatically constructing a high-precision genome-scale metabolic network model of Clostridium acetobutylicum

以丙酮丁醇梭菌(Clostridium acetobutylicum)ATCC824为研究对象，构建高精度基因组尺度代谢网络模型。具体构建过程如下：Taking Clostridium acetobutylicum ATCC824 as the research object, a high-precision genome-scale metabolic network model was constructed. The specific construction process is as follows:

1)对实验室保存的丙酮丁醇梭菌ATCC824进行基因测序，得到2779个基因，记录基因功能注释，及对应的KEGG基因号码；2)通过API下载KEGG数据库中ATCC824所在属的基因、酶-基因对应关系、酶-反应对应关系、和化合物-反应总信息。其中基因数据的下载接口为：http://rest.kegg.jp/list/cac；酶-反应对应关系的下载接口为：http://rest.kegg.jp/ link/ec/rn；酶-基因对应关系为：http://rest.kegg.jp/link/cac/ec；反应-化合物的对应关系为：http://rest.kegg.jp/link/rn/cpd；利用以上信息，使用matlab语言编写程序，自动搭建丙酮丁醇梭菌全部基因、蛋白、化合物、反应信息的数据库，该数据库含有1131个化合物，2838个基因和1114个反应。2)对比基因组测序结果对数据库信息进行增添与删除，例如：bofA基因等不存在于上述数据库中，但全基因组测序结果表明bofA基因的确存在于目标菌株之中。则利用python语言编写程序，将bofA基因序列在KEGG数据库中进行BLAST算法比对，随后从得到比对结果中挑选相似性最高基因并统计其KO号码K06317，通过VBA语言编写程序比对酶与KO以及酶与反应的数据库(KO号码与酶的对应关系下载接口为：http:// rest.kegg.jp/link/ec/ko)得到反应rn:R10128，并将其添加进入数据库。而数据库上存在的kdgK等基因并没有被检测出来，则将该基因及对应的反应从数据库中删除。整个调校过程删除数据库中目标菌株没有的69个基因，增添目标菌株独有的10个基因，并根据基因注释功能更新目标微生物实际含有的代谢网络的生化信息数据，该模型含有1032个化合物，2779个基因和1065个反应。3)根据精简获得的基因-酶-反应-化合物数据信息，并以反应为横行，以化合物为纵行，以化合物在反应中的系数为数值(消耗为负，生成为正)，构建代谢通量矩阵。4)在丙酮丁醇梭菌ATCC824进行丙酮丁醇高速产丁醇的节点(72h)时，进行转录组和蛋白组检测，发现此节点有680种酶以及2604个基因参与细胞内的代谢，将蛋白组与转录组学的数据在生化信息数据库转化为对应的反应方程式后与代谢矩阵进行比对，对转录组和蛋白组上存在但模型上没有的数据进行补充，对转录组和蛋白组上没有有但是反应矩阵上存在的数据进行删除，例如：根据蛋白组学结果统计出所有被检测到的酶号码，利用VBA语言编写程序，根据酶和反应对应的数据库得到反应，增添蛋白组中出现但反应矩阵中没有出现的反应，如Rn00356等。而只在反应矩阵中出现但根据转录组蛋白组无法得到的反应则从反应矩阵中删除，如Rn20226等。其反应数据库总共增加了69个反应，减少了209个反应。最终确定参与72h高产溶剂时期的动态代谢网络含有978个化合物，2604个基因和886个反应。通过代谢通量计算准确描述细胞中72h瞬时发生的通路行为。1) Sequence the gene of Clostridium acetobutylicum ATCC824 stored in the laboratory to obtain 2779 genes, record the gene function annotation, and the corresponding KEGG gene number; 2) Download the gene, enzyme- Gene correspondence, enzyme-reaction correspondence, and compound-reaction general information. The download interface of the gene data is: http://rest.kegg.jp/list/cac ; the download interface of the enzyme-reaction correspondence is: http://rest.kegg.jp/link/ec/rn ; the enzyme- The gene correspondence is: http://rest.kegg.jp/link/cac/ec ; The reaction-compound correspondence is: http://rest.kegg.jp/link/rn/cpd ; Using the above information, use A program written in matlab language automatically builds a database of all genes, proteins, compounds, and reaction information of Clostridium acetobutylicum. The database contains 1131 compounds, 2838 genes and 1114 reactions. 2) Add and delete the database information by comparing the genome sequencing results, for example, the bofA gene does not exist in the above database, but the whole genome sequencing results show that the bofA gene does exist in the target strain. Then use the python language to write a program to compare the bofA gene sequence in the KEGG database with the BLAST algorithm, then select the gene with the highest similarity from the comparison results and count its KO number K06317, and use the VBA language to write a program to compare the enzyme and KO. And the database of enzymes and reactions (the download interface for the correspondence between KO numbers and enzymes is: http://rest.kegg.jp/link/ec/ko ) to obtain the reaction rn: R10128 , and add it into the database. However, genes such as kdgK existing in the database have not been detected, and the gene and the corresponding reaction will be deleted from the database. The whole adjustment process deletes 69 genes that the target strain does not have in the database, adds 10 genes unique to the target strain, and updates the biochemical information data of the metabolic network actually contained in the target microorganism according to the gene annotation function. The model contains 1032 compounds, 2779 genes and 1065 responses. 3) According to the gene-enzyme-reaction-compound data information obtained by streamlining, and taking the reaction as the horizontal row, the compound as the vertical row, and the coefficient of the compound in the reaction as the value (consumption is negative, production is positive), construct a metabolic pathway. quantity matrix. 4) When the node (72h) of Clostridium acetobutylicum ATCC824 for high-speed production of butanol from acetone butanol was performed, the transcriptome and proteome were detected, and it was found that 680 enzymes and 2604 genes were involved in intracellular metabolism at this node. The proteomic and transcriptomic data are compared with the metabolic matrix after the biochemical information database is converted into the corresponding reaction equation, and the data that exists in the transcriptome and proteome but not in the model is supplemented. Delete the data that does not exist but exists in the reaction matrix, for example: count all the detected enzyme numbers according to the proteomics results, use the VBA language to write the program, get the reaction according to the database corresponding to the enzyme and the reaction, and add the occurrences in the protein group. But the reactions that do not appear in the reaction matrix, such as Rn00356 etc. Responses that only appeared in the response matrix but could not be obtained according to the transcriptome were deleted from the response matrix, such as Rn20226 and so on. A total of 69 responses were added to its response database and 209 responses were decreased. The dynamic metabolic network involved in the 72h high-yield solvent period was finally determined to contain 978 compounds, 2604 genes and 886 reactions. Metabolic flux calculations accurately describe the 72-h transient pathway behavior in cells.

实施例2自动化构建大肠杆菌高精度基因组尺度代谢网络模型的方法Example 2 A method for automatically constructing a high-precision genome-scale metabolic network model of Escherichia coli

以大肠杆菌Escherichia coli JM109为研究对象，构建高精度基因组尺度代谢网络。具体构建过程如下：Taking Escherichia coli JM109 as the research object, a high-precision genome-scale metabolic network was constructed. The specific construction process is as follows:

1)对实验室保存的大肠杆菌Escherichia coli JM109进行基因测序，得知JM109中有2279个基因，记录基因功能注释，及对应的KEGG基因号码；2)通过API下载KEGG数据库中JM109所在属的基因、酶-基因对应关系、酶-反应对应关系和化合物-反应总信息。其中基因数据的下载接口为:http://rest.kegg.jp/list/eco；酶-反应对应关系的下载接口为：http://rest.kegg.jp/link/ec/rn；酶-基因对应关系为：http://rest.kegg.jp/link/ eco/ec；反应-化合物的对应关系为：http://rest.kegg.jp/link/rn/cpd；利用以上信息，使用matlab语言编写程序，自动搭建大肠杆菌Escherichia coli JM109所在属全部基因、蛋白、化合物、反应信息的数据库，该数据库含有915种酶，1528个化合物，3410个基因和1656个反应。对比基因组测序结果对数据库进行增添与删除，例如：gfcD基因等不存在于反应信息的数据库中，但全基因组测序结果表明gfcD基因的确存在于目标菌株之中。利用python语言编写程序，将gfcD基因序列在KEGG数据库中进行BLAST算法比对，随后从得到比对结果中挑选相似性最高基因并统计其KO号码K02377，通过VBA语言编写程序比对酶与KO以及酶与反应的数据库(KO-酶的对应关系下载接口为：http://rest.kegg.jp/link/ec/ ko)得到反应rn:R04128，并将其添加进入数据库。而数据库上存在的torR等基因并没有被检测出来，将torR基因及对应的反应从数据库中删除。删除数据库中目标菌株没有的1251个基因，增添目标菌株独有的120个基因，并根据基因注释功能更新目标微生物实际含有的代谢网络的生化信息数据，该模型含873种酶,1325个化合物，2779个基因和1324个反应。3)根据精简获得的基因-反应-化合物数据信息，并反应为横行，以化合物为纵行，以化合物在反应中的系数为数值(消耗为负，生成为正)，构建代谢通量矩阵。4)在大肠杆菌发酵进入对数期节点(8h)进行转录组和蛋白组检测，将蛋白组与转录组学的数据在生化信息数据库转化为对应的反应方程式后与代谢矩阵进行比对，对转录组和蛋白组上存在但模型上没有的数据进行补充，对转录组和蛋白组上没有但是反应矩阵上存在的数据进行删除，例如：根据蛋白组学结果统计出所有被检测到的酶号码，利用VBA语言编写程序，根据酶和反应的对应关系在数据库得到反应，增添蛋白组中出现但反应矩阵中没有出现的反应，如Rn02346等。而只在反应矩阵中出现但根据转录组蛋白组无法得到的反应则将其从反应矩阵中删除，如Rn22048等。结果发现此节点有543种酶以及2480个基因参与细胞内的代谢。对其反应数据库进行更新，总共增加了50个反应，减少了377个反应。在对数期的动态代谢网络含有1302个化合物，2480个基因和997个反应。通过代谢通量计算准确描述细胞中8h瞬时发生的通路行为。1) Sequence the gene of Escherichia coli JM109 stored in the laboratory, and find that there are 2279 genes in JM109, record the gene function annotation, and the corresponding KEGG gene number; 2) Download the gene of the genus JM109 belongs to in the KEGG database through API , enzyme-gene correspondence, enzyme-reaction correspondence, and compound-reaction general information. The download interface of the gene data is: http://rest.kegg.jp/list/eco ; the download interface of the enzyme-reaction correspondence is: http://rest.kegg.jp/link/ec/rn ; the enzyme- The gene correspondence is: http://rest.kegg.jp/link/ eco/ec ; the reaction-compound correspondence is: http://rest.kegg.jp/link/rn/cpd ; Using the above information, use A program written in matlab language automatically builds a database of all genes, proteins, compounds and reaction information of the genus Escherichia coli JM109 belongs to. The database contains 915 enzymes, 1528 compounds, 3410 genes and 1656 reactions. The database was added and deleted by comparing the results of genome sequencing. For example, the gfcD gene did not exist in the database of reaction information, but the results of whole genome sequencing showed that the gfcD gene did exist in the target strain. A program written in python language was used to compare the gfcD gene sequence in the KEGG database with the BLAST algorithm, and then the gene with the highest similarity was selected from the comparison results and its KO number K02377 was counted. The program was written in the VBA language to compare the enzyme with KO and Reaction rn: R04128 was obtained from the database of enzymes and reactions (the download interface for the correspondence between KO-enzymes: http://rest.kegg.jp/link/ec/ko ), and added it into the database. However, the torR and other genes existing in the database were not detected, and the torR gene and the corresponding reaction were deleted from the database. Delete 1251 genes that the target strain does not have in the database, add 120 genes unique to the target strain, and update the biochemical information data of the metabolic network actually contained in the target microorganism according to the gene annotation function. The model contains 873 enzymes, 1325 compounds, 2779 genes and 1324 responses. 3) According to the gene-reaction-compound data information obtained by streamlining, and the reaction is a horizontal row, the compound is a vertical row, and the coefficient of the compound in the reaction is a numerical value (consumption is negative, generation is positive), and a metabolic flux matrix is constructed. 4) Transcriptome and proteome detection were performed at the node (8h) when E. coli fermentation entered the logarithmic phase, and the proteome and transcriptome data were converted into corresponding reaction equations in the biochemical information database and compared with the metabolic matrix. The data that exists in the transcriptome and proteome but not in the model is supplemented, and the data that is not in the transcriptome and proteome but exists in the reaction matrix is deleted, for example, all detected enzyme numbers are counted according to the proteomic results , using VBA language to write programs, according to the corresponding relationship between enzymes and reactions to get reactions in the database, add reactions that appear in the proteome but not in the reaction matrix, such as Rn02346 and so on. Responses that only appeared in the response matrix but were not available according to the transcriptome were removed from the response matrix, such as Rn22048. It was found that there are 543 enzymes and 2480 genes involved in intracellular metabolism at this node. An update to its response database added a total of 50 responses and decreased 377 responses. The dynamic metabolic network in log phase contained 1302 compounds, 2480 genes and 997 responses. Metabolic flux calculations accurately describe 8-h transient pathway behavior in cells.

实施例3自动化构建酿酒酵母高精度基因组尺度代谢网络模型的方法Example 3 Method for automatically constructing a high-precision genome-scale metabolic network model of Saccharomyces cerevisiae

以酿酒酵母Saccharomyces Cerevisiae W3a为研究对象，构建高精度基因组尺度代谢网络。具体构建过程如下：Using Saccharomyces Cerevisiae W3a as the research object, a high-precision genome-scale metabolic network was constructed. The specific construction process is as follows:

1)对实验室保存的酿酒酵母Saccharomyces Cerevisiae W3a进行基因测序，得到2989个基因，记录基因功能注释，及对应的KEGG基因号码；2)通过API从KEGG的数据库中下载W3a所在属的基因、酶-基因对应关系、酶-反应对应关系、和化合物-反应总信息。其中基因数据的下载接口为:http://rest.kegg.jp/list/sce；酶-反应对应关系的下载接口为：http://rest.kegg.jp/link/ec/rn；酶-基因对应关系为：http://rest.kegg.jp/link/ sce/ec；反应-化合物的对应关系为：http://rest.kegg.jp/link/rn/cpd；利用以上信息，使用matlab语言编写程序，自动搭建酿酒酵母W3a所在属全部基因、蛋白、化合物、反应信息的数据库，该数据库含有697种酶，1780个化合物，3221个基因和1390个反应。对比基因组测序结果对数据库信息进行增添与删除，例如：ychD基因等不存在于反应信息的数据库中，但全基因组测序结果表明ychD基因的确存在于目标菌株之中。则利用python语言编写程序，将ychD基因序列在KEGG数据库中进行BLAST算法比对，随后从得到比对结果中挑选相似性最高基因并统计其KO号码K02508，通过VBA语言编写程序比对酶与KO以及酶与反应的数据库(KO-酶的对应关系下载接口为：http://rest.kegg.jp/link/ec/ko)得到反应rn:R04406，并将其添加进入数据库。而数据库上存在的tcrP等基因并没有被检测出来，将其及对应的反应从数据库中删除。删除数据库中目标菌株没有的250个基因，增添目标菌株独有的18个基因，并根据基因注释功能更新目标微生物实际含有的代谢网络的生化信息数据，得到目标微生物实际含有的代谢网络的生化信息数据，该模型含578种酶,1525个化合物，2989个基因和1224个反应。3)根据精简获得的基因-反应-化合物数据信息，并以反应为横行，以化合物为纵行，以化合物在反应中的系数为数值(消耗为负，生成为正)，构建代谢通量矩阵。4)在酒精高速产丁醇节点(10h)时进行转录组和蛋白组检测，发现此节点有482种酶以及2694个基因参与细胞内的代谢，将蛋白组与转录组学的数据在生化信息数据库转化为对应的反应方程式后与代谢矩阵进行比对，对转录组和蛋白组上存在但模型上没有的数据进行补充，对转录组和蛋白组上没有有但是模型上存在的数据进行删除，例如：根据蛋白组学结果统计出所有被检测到的酶，利用VBA语言编写程序，根据酶和反应的对应关系在数据库得到反应，增添蛋白组中出现但反应矩阵中没有出现的反应，如Rn02886等。而只在反应矩阵中出现但根据转录组蛋白组无法得到的反应则将其从反应矩阵中删除，如Rn28672等。调校结果为其反应数据库总共增加了82个反应，减少了219个反应，参与10h产乙醇时期的动态代谢网络含有1236个化合物，2694个基因和1087个反应。通过代谢通量计算准确描述细胞中10h瞬时发生的通路行为。1) Sequence the gene of Saccharomyces Cerevisiae W3a stored in the laboratory to obtain 2989 genes, record the gene function annotation, and the corresponding KEGG gene number; 2) Download the genes and enzymes of the genus W3a belongs to from the KEGG database through API -Gene correspondence, enzyme-reaction correspondence, and compound-reaction general information. The download interface of the gene data is: http://rest.kegg.jp/list/sce ; the download interface of the enzyme-reaction correspondence is: http://rest.kegg.jp/link/ec/rn ; the enzyme- The gene correspondence is : http://rest.kegg.jp/link/sce/ec ; The reaction-compound correspondence is: http://rest.kegg.jp/link/rn/cpd ; Using the above information, use A program written in matlab language automatically builds a database of all genes, proteins, compounds and reaction information of the genus Saccharomyces cerevisiae W3a belongs to. The database contains 697 enzymes, 1780 compounds, 3221 genes and 1390 reactions. The database information was added and deleted by comparing the genome sequencing results. For example, the ychD gene did not exist in the database of the reaction information, but the whole genome sequencing results showed that the ychD gene did exist in the target strain. Then use the python language to write a program to compare the ychD gene sequence in the KEGG database with the BLAST algorithm, then select the gene with the highest similarity from the comparison results and count its KO number K02508, and use the VBA language to write a program to compare the enzyme and KO. And the database of enzymes and reactions (the KO-enzyme correspondence download interface is: http://rest.kegg.jp/link/ec/ko ) to obtain the reaction rn:R04406, and add it into the database. However, the tcrP and other genes present in the database were not detected, and they and their corresponding reactions were deleted from the database. Delete 250 genes that the target strain does not have in the database, add 18 genes unique to the target strain, and update the biochemical information data of the metabolic network actually contained in the target microorganism according to the gene annotation function to obtain the biochemical information of the metabolic network actually contained in the target microorganism Data, the model contains 578 enzymes, 1525 compounds, 2989 genes and 1224 reactions. 3) According to the gene-reaction-compound data information obtained by streamlining, and taking the reaction as the horizontal row, the compound as the vertical row, and the coefficient of the compound in the reaction as the value (consumption is negative, generation is positive), construct a metabolic flux matrix . 4) The transcriptome and proteome were detected at the high-speed butanol-producing node of alcohol (10h). It was found that 482 enzymes and 2694 genes were involved in intracellular metabolism at this node. The proteome and transcriptome data were included in the biochemical information. After the database is converted into the corresponding reaction equation, it is compared with the metabolic matrix, the data that exists in the transcriptome and proteome but not in the model is supplemented, and the data that is not in the transcriptome and proteome but exists in the model is deleted, For example: count all the detected enzymes according to the proteomics results, use the VBA language to write the program, get the reactions in the database according to the corresponding relationship between the enzymes and the reactions, and add the reactions that appear in the proteome but do not appear in the reaction matrix, such as Rn02886 Wait. Responses that only appeared in the response matrix but were not available according to the transcriptome were removed from the response matrix, such as Rn28672. The tuning results added a total of 82 reactions and decreased 219 reactions to its reaction database, and the dynamic metabolic network involved in the 10h ethanol production period contained 1236 compounds, 2694 genes and 1087 reactions. Metabolic flux calculations accurately describe the 10-h transient pathway behavior in cells.

以上所述实施例仅表达了本发明的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本专利构思的前提下，上述各实施方式还可以做出若干变形、组合和改进，这些都属于本专利的保护范围。因此，本专利的保护范围应以权利要求为准。The above-mentioned embodiments only represent several embodiments of the present invention, and the descriptions thereof are relatively specific and detailed, but should not be construed as a limitation on the scope of the patent. It should be noted that, for those skilled in the art, without departing from the concept of the present patent, the above-mentioned embodiments can also be modified, combined and improved, which all belong to the protection scope of the present patent. Therefore, the scope of protection of this patent should be subject to the claims.

Claims

1. a method for automatically constructing a high-precision genome scale metabolic network model, is characterized in that, concrete steps are as follows:

(1) Genome sequencing: Sequence the genome of the target microorganism to obtain all the genetic information in the microorganism genome;

(2) Establishment of the reaction network database: use the API interface provided by the KEGG database to obtain the genes, enzymes, compounds, reactions and their associated information involved in the "genus" of the target microorganism, and build a database;

According to the genome sequencing results of step (1), the above database is revised, the genes, proteins, compounds and reaction information not included in the database are added, and redundant data information is deleted to simplify the gene-enzyme- of the metabolic network actually contained in the target microorganism. Reaction-Compound Data Information Database;

(3) Model construction: According to the data information obtained in step (2), and take the reaction as the horizontal row, the compound as the vertical row, and the coefficient of the compound in the reaction as the numerical value, construct the metabolic flux matrix, that is, the metabolic model;

(4) Model calibration: measure the transcriptome and proteome data of the target microorganism at a specific time node, compare the measured proteome and transcriptomic data with the metabolic model of step (3), and compare the transcriptome and protein The data that exists on the group but not on the model is supplemented, and the data that does not exist on the transcriptome and proteome but exists on the model is deleted, and the obtained model can accurately describe the transient pathway behavior in the cell.

2 . The method for automatically constructing a high-precision genome-scale metabolic network model according to claim 1 , wherein the gene information in step (1) includes gene function annotation and the corresponding KEGG gene number. 3 .

3. a kind of method for automatically constructing a high-precision genome scale metabolic network model as claimed in claim 1, is characterized in that, the database in step (2) is written program by matlab language, and the acquired gene, enzyme, compound, The associated information of the reaction is automatically constructed.

4. The method for automatically constructing a high-precision genome scale metabolic network model according to claim 1, wherein in step (2), when a certain gene in step (1) does not exist in the database, The gene sequence was compared with the BLAST algorithm in the KEGG database, and the gene with the highest similarity was selected from the comparison results and its KO number was counted, and the reaction number was obtained by comparing the KO and enzyme and enzyme and reaction databases in the KEGG database, and add it to the database.

5. The method for automatically constructing a high-precision genome-scale metabolic network model according to claim 1, wherein in step (2), when a certain gene existing in the database does not exist in the same manner as in step (1) When the genome detection result is in, the gene and the corresponding reaction are deleted from the database.

6. The method for automatically constructing a high-precision genome scale metabolic network model as claimed in claim 1, wherein in step (3), the construction process of the metabolic flux matrix is by means of Python, Matlab, Java or C++ language Automation is done.

7. The method for automatically constructing a high-precision genome scale metabolic network model according to claim 1, wherein in step (4), the data of proteome and transcriptomics are transformed with the help of the information of the KEGG database In order to compare the corresponding reaction equation with the metabolic matrix, supplement the data that exists in the transcriptome and proteome but not in the model, and delete the data that is not in the transcriptome and proteome but exists in the response matrix.