[go: up one dir, main page]

Academia.eduAcademia.edu
Biobytes Vol – 5, July - 2009 21. Takahashi, K., Yugi, K., Hashimoto, K., Yamada, Y., Pickett, C.J., and Tomita, M. 2002. Computational Challenges in Cell Simulation: A Software Engineering Approach. IEEE Intelligent Systems 17: 64-71. 22. Tomita, M. 2001. Whole-cell simulation: a grand challenge of the 21st century. Trends Biotechnol 19: 205-210. 23. Tomita, M., Hashimoto, K., Takahashi, K., Shimizu, T., Matsuzaki, Y., Miyoshi, F., Saito, K., Tanida, S., Yugi, K., Venter, J.C., et al. 1997. E-CELL: Software Environment for Whole Cell Simulation. Genome Inform Ser Workshop Genome Inform 8: 147-155. 24. Tomita, M., Hashimoto, K., Takahashi, K., Shimizu, T.S., Matsuzaki, Y., Miyoshi, F., Saito, K., Tanida, S., Yugi, K., Venter, J.C., et al. 1999. E-CELL: software environment for whole-cell simulation. Bioinformatics 15: 72-84. 25. Yugi, K., and Tomita, M. 2004. A general computational model of mitochondrial metabolism in a whole organelle scale. Bioinformatics 20: 1795-1796. Mycobacterium tuberculosis systems biology data in R Srinivasan Ramachandran*, Amit Katiyar, Amit Sinha, Anshu Bharadwaj, Anirban Dutta, Ayush Raman, Archana Pan, Balwant Kishen Malik, Balvinder Singh, Beena Pillai, Bharati Dutta, Bhanot Priyamwada Sinha, Bhupesh Taneja, Chhabinath Mandal, Charu Kapil Richa, Chitra Dutta, Debasis Dash, Debaprasad Mukherjee, Debdas Paul, Debojyoti Chakraborty, Faraz Alam Ansari, Gajinder Pal Singh, Gajendra Pal Singh Raghava, Gargi Guhathakurtha, Imran Siddiqui, Manish Kumar, Manoj Hariharan, Mekapati Bala Subramanyam, Monika Joon, Mridula Bose, Mudgal Haymanti, Muthiah Gnanamani, Muthukurussi Varieth Raghunandanan, Nanda Ghoshal, Nitin Kumar Singh, Pallavi Sarmah, Ramaswamy Suyambu Kesava Vijayan, Rajni Verma, Rakesh Sharma, Ravishankar Ramachandran, Rupanjali Chaudhuri, Sabyasachi Das, Samir Kumar Brahmachari, Sandip Paul, Sanjib Chatterjee, Savita Bhutoria, Shantanu Chowdhury, Simone Gupta, Souvik Maiti, Subhagata Ghosh, Suchir Arora, Sudipto Saha, Sumit K. Bag, Sumit Deb, Vani Brahmachari, Vanika Gupta, Vikram Kumar, Vinod Scaria, Yasha Bhasin, Yogendra Singh, [OSDD Consortium] *G.N. Ramachandran Knowledge Centre for Genome Informatics, Institute of Genomics and Integrative Biology, Mall Road, Delhi 110 007, India * Email: ramuigib@gmail.com Abstract M. tuberculosis is a dreaded pathogen causing the respiratory disease Tuberculosis with high rates of mortality (approximately 1 death per 1.5 min. in India alone). The rate of susceptibility to M. tuberculosis infection is extremely high. The circulation of drug resistance strains aggravates this problem several folds higher. While considerable efforts are focused on identifying drug targets and new vaccine candidates, novel strategies and new molecules are required to continuously battle with the problem of drug resistance. In this scenario, it is very important to carry out systems modeling 40 Biobytes Vol – 5, July - 2009 in order to simulate the dynamic tracts of molecular transformations or to select potential drug targets using integrative approach. The OSDD Consortium has been contributing towards this aim by dividing the drug discovery pipeline in various Work Packages which in turn are implemented as Projects managed online by Project Managers. In order to enable data analysis through integrative querying, I have packaged the available data into R environment. The R environment is open source, comes with many functions and also can be readily interfaced with Bioconductor. Kinetic data wherever available, can be used to carry out simulations using packages available in the R repository. Availability The M. tuberculosis SysBorg in R is available to download from http://sysborgtb.osdd.net/bin/view/OpenLabNotebook/SysBorgInR Introduction The genomics heralded a radical transformation of biological sciences in terms of massive data collection wherein, systems biology is being envisioned as a field in elevating our capabilities to model integrative data and predict outcomes, which could be tested in appropriately designed experiments. Even if rigorous modeling is not easily approachable to an all biologists, integrative approaches could still be applied to address problems at the systems level, wherein analytical data from different algorithms and experiments are used in combination. While reductionist approaches have produced immensely valuable results, it is now being increasingly realized that integrative approaches could provide a more holistic view of the biological phenomena. In other words, we now see this transition phase as part of an upward movement of value addition as shown in Figure 1. Kitano (2002) emphasized examination of structure and dynamics at the cellular level or in whole organism instead of parts as the basis for understanding the biological phenomena at systems level. A key element of focus has been the ‘robustness’ of the system for example the robustness of a conserved biochemical network (Morohashi et al 2002). The first proposed standards for development of models by systems biology community was released as Systems Biology Markup Language (SBML) (Hucka et al 2003). These standards were developed with the aim to facilitate sharing, evaluation and cooperative development of models. Figure 1. The Knowledge Elevation Path of systems biology for biologists in the post genomics era. The Systems Biology data in R is at the second stage of this knowledge elevation path. 41 Biobytes Vol – 5, July - 2009 A few attempts have been made towards systems modeling in Tuberculosis. These include the analysis of drug and stress response (Cabusora et al 2005), flux balance analysis of mycolic acid pathway (Raman et al 2005), tricarboxylic acid cycle and glyoxylate bypass (Singh and Ghosh 2006), M. tuberculosis metabolic network model (Beste et al 2007), regulatory network during growth arrest (Balázsi et al 2008), target identification through network analysis (Raman et al 2008), and origin of drug resistance through interactome analysis (Raman et al 2008). The main purpose of these investigations is to enable identification of drug targets and by including multiple parameters, these authors’ present deep analytical methods for drug target identification. These approaches include comparative analysis between generic stress response and specific drug response, flux balance analysis of the mycolic acid pathway in order to identify critical points as drug targets, and more exhaustive stepwise network analysis for identification of drug targets. Identified drug targets also can be assessed by probing the systemic effects using kinetic modeling or investigating the cause of drug resistance through interactome analysis. In all these studies, data were sourced from different sites and this is the usual procedure followed while setting up these analysis frameworks. The availability of data in a open source environment with many mathematical and statistical tools, such as the R package can enable wide application development and analysis. Very recently, The Council of Scientific and Industrial Research (CSIR) has initiated a program on Open Source Drug Discovery, a relatively new idea in the field of drug discovery. The start base of this program is community collected data on Mycobacterium tuberculosis including gene sequences, expression function, activity and the response to drugs and host-pathogen interactions (Seema 2008). Methods (1) Choice of R: The High-level interpreted language R is suitable for developing new computational methods. The successful development of Bioconductor as an open software for computational biology and bioinformatics is widely appreciated and currently it has many users (Gentleman et al 2004). Since then, several computational biology packages have been developed in R language. Very recently even Chemistry packages are being developed in R (Cao et al 2008). The advantage of developing computational packages in R is that one can carry out the analysis locally and also build further tools and scripts. This facilitates development of both new applications and extension of existing applications. In addition complex tasks can be performed using simple scripts. Another major advantage of preparing datasets and computational biology tools in R is that a large set of statistical and mathematical tools can be applied on the datasets for analysis. Furthermore, R is in open source controlled by GNU General Public License allowing future developments and customizations more widely. A core group takes responsibility for maintaining R and therefore the availability of this platform remains ensured providing long life. (2) Collection of datasets: A large consortium of scientists and students invested their expertise and time to systematically collect curated data from the literature and also by applying bioinformatics analysis. This consortium called M. tuberculosis SysBorg (Systems Biology of Organisms) consortium was formed through a co-ordinated effort from the Institute of Genomics and Integrative Biology. Scientists and students participated in thematic work BLOCKS namely, ANNOTATION, DRUGS ACTIVITY, GENE EXPRESSION, HOST-PATHOGEN RELATIONSHIPS, STRAIN 42 Biobytes Vol – 5, July - 2009 POLYMORPHISMS, and PATHWAYS. A seventh BLOCK called ADMINISTRATION was responsible for managing the project. The equivalent data packs in R are annot, drugs, geneexpress, hostpatho, strainpoly and pathways. Very recently a new BLOCK named others has been added in order to receive new collections. Data folders were prepared using these names. Structured data were prepared as both *.csv files and R image data files. The *.csv files are stored in the base folders whereas the R image data files are stored in subfolder named as R_image. All the data objects within a BLOCK can accesed instantly by loading the R image data files using the load command. Similarly multiple R image datafiles can be loaded as desired within seconds. Additional data were sourced from the Open Source Drug Discovery Consortium through the following link http://sysborg.osdd.net Features (1) Getting to know the contents: A glance at the available data objects can be obtained with the command ls(). This will list all the data objects currently available from the image data files loaded. Each data object can be explored deeper step by step. After issuing the command names(dataobjectname) the characteristics of the data table will be displayed. The headers of the data fields will also be revealed. Using the dim(dataobjectname) command the number of rows and columns will be displayed. This information enables users to plan their work for in the subsequent steps. Much of further work depends on knowing the data types present in the dataobject rows and columns. For example the data type could be a character such as ORFid(Rvno.) or a boolean answer type such as Yes or No or could be a numerical type such as normalized gene expression values. The command dataobjectname[rowno.,columnno.] will display the data contained in that cell from which information one may note the data type as a character or numeric. During this process user will also be able to get a glimpse of the ways to prepare complex queries for example in preparing scripts to carry out filtering searches meeting conditional criteria. (2) Examples of scripts: (i) For a simple search for functions of a given set of Rvnos. x<- c(“Rv----“,”Rv----“,”Rv----“,...) or x, a dataobject read from a file with many Rv nos. The relevant dataobject from sysborg is ORFFunctions. Running the following script will give the results: for (i in 1:3513) { for (j in 1:n) {if (as.character(ORFFunctions[i,1]) == as.character(x[j]) ) print(ORFFunctions[i,])}} here n is the no. of entries in x, which can be obtained using the command length(x). (ii) Example Does “Rv----“,”Rv----“,”Rv----“,... have homologs in human genome? The relevant dataobject from sysborg is HostMimicry. Running the following script will give the results for (i in 1:3924) { for (j in 1:3) {if (as.character(HostMimicry[i,1]) == as.character(x[j]) ) 43 Biobytes Vol – 5, July - 2009 print(HostMimicry[i,])}} in the entries in second and third columns if there is no match then you get No match and NM respectively but if there is a human homolog then you get the details of the topmost matching protein. NOTE: these results are better than a simple yes or no answer to the question (iii) Example Is RV--- a target of known drug? The relevant dataobjects are FiveFirstLineDrug and SecondLineDrugs. Running the following scripts will give the results for (i in 1:8) { for (j in 1:3) {if (as.character(FirstLineDrug[i,1]) == as.character(x[j]) ) print(FirstLineDrugs[i,])}} for (i in 1:11) { for (j in 1:3) {if (as.character(SecondLineDrugs[i,1]) == as.character(x[j]) ) print(SecondLineDrugs[i,])}} Obviously if no result comes then the given Rv nos. in x are not target of known drugs. But if you do get an output then it is obvious that the Rv no. is a known target. (iv) What are the genes in a certain pathway? The relevant dataobject is PathwayReaction Running the following script will give the results for (i in 1:952) { if (as.character(PathwayReaction[i,5]) == "Purine metabolism" ) print(paste(PathwayReaction[i,1]," ",PathwayReaction[i,5]))} Caution: may get redundant entries because of KEGG data (v) If we want to check for essential and non polymorphic Rv nos. this can be done by selecting essential and leave out the known polymorphic ones and running the following scripts x<- union(SNPintragenic[,1],SNPintergenic[,1]) x<- union(x,InDelintragenic[,1]) x<- union(x,InDelintergenic[,1]) y<- HighProbabilityOfEssentialGenes[,1] z<- setdiff(y,x) z The dataobject z contains the Rv nos. as per requirement. Caution: The results are strongly dependent on the known data on polymorphisms in genes, therefore absence of data does not automatically conclude absence of polymorphism. However, the known polymorphic genes will be certain to be excluded by this approach. 44 Biobytes Vol – 5, July - 2009 (vi) If we want to use the Bioconductor as one example where we wish to get the sequence of a given Rv id is > library(Biobase) > getSEQI(as.numeric(Rv2GI[1,2])) Output is the sequence. Extending this further for example, the following command can be used to get sequences of multiple Rv nos. using functions from Bioconductor. For example one could combine queries from above and link to this script to obtain sequence data along. for (i in 1:3989) { for (j in 1:3) {if (as.character(Rv2GI[i,1]) == as.character(x[j]) ) print(getSEQ(as.numeric(Rv2GI[i,2]))) }} (vii) If we wish to use the KEGGgraph then as an example pumetKGML <- system.file("extdata/mtu00230.xml", package="KEGGgraph") > pumetpathway<- parseKGML(pumetKGML) > pumetpathway KEGG Pathway [ Title ]: Purine metabolism [ Name ]: path:mtu00230 [ Organism ]: mtu [ Number ] :00230 [ Image ] :http://www.genome.jp/kegg/pathway/mtu/mtu00230.gif [ Link ] :http://www.genome.jp/dbget-bin/show_pathway?mtu00230 -----------------------------------------------------------Statistics: 257 node(s) 154 edge(s) 66 reaction(s) Note that the file mtu00230.xml must be downloaded from KEGG site as described in KEGGgraph documentation. The downloaded file must be stored in C:\Program Files\R\R-2.8.1\library\KEGGgraph\extdata If we wish to check adhesions also involved in persistence and check for their consistent expression in various strains as a new way to approach for drug targets, we 45 Biobytes Vol – 5, July - 2009 could do it as follows: (viii) x<- NULL for (i in 1:3997) {if(as.numeric(as.character(SurfaceAdhesion[i,2])) >= 0.7) x<cbind(x,as.character(SurfaceAdhesion[i,1]))} z<- intersect(x[1,],as.character(MtbPersistance[,1])) w<- NULL > for (i in 1:4686) {for (j in 1:8) {if (as.character(MtbStrainWiseExpressionZScores[i,1]) == z[j]) w<rbind(w,MtbStrainWiseExpressionZScores[i,])}} (ix) Can we get ORFids belonging to one pathway having high Drug Target Rank and with known interactions in Human host? Example glycine, serine… metabolism pathway and assuming 100 is a high drug target rank, we have x<- grep("^glycine,serine", PathwayReaction[1:952,5], ignore.case=TRUE) y <- NULL for (i in 1:26) {m<- x[i]; y<- cbind(y, as.character(PathwayReaction[m,1]))} > z<- NULL > for (i in 1:3927) {if (as.numeric(as.character(DrugTargetRank[i,2])) >=100 ) z<c(z,as.character(DrugTargetRank[i,1]))} temp<- intersect(z,y) result<- intersect(result, as.character(HumanMtbProteinInteractions[,1])) Conclusion At present different types of data on M. tuberculosis and tuberculosis related topics have to be sourced from different sites. Besides these data will have to be organized in proper formats so that they can be further analyzed through other algorithms and software. The initialization of M. tuberculosis Sysborg in R aims to bridge all these gaps. However, we may have some shortcomings in that all data tables may not have been packaged. This is due either to restricted access or data distribution control issues. The present data provided is the publicly available data. However, users can package their own data or other publicly available data and make their own package. In this sense, this platform truly adheres to the basic tenets of open source. Relevant data for modeling can also be extracted with simple commands and fed as input in to programs like CellDesigner for simulation work. One can also use the deSolve package from R to solve differential equations giving identical results. As shown above, one can also use the functions of Bioconductor or KEGGgraph to get additional information or carry our further analysis. As more analysis packages appear in R, they can be integrated with M. tuberculosis SysBorg thereby enhancing the capabilities further. We as authors envisage many more developments by bright young minds in order to contribute effectively to Tuberculosis research. 46 Biobytes Vol – 5, July - 2009 Availability The M. tuberculosis SysBorg in R is available to download from http://sysborgtb.osdd.net/bin/view/OpenLabNotebook/SysBorgInR We encourage students to contribute the enhancements to this first release back to this community. Authors Contributors AB, BKM, BS, BP, BT, CM, CD, DD, GPSR, IS, MB, MVR, NG, RS, RR, SKB, SC, SM, VB, VS, YS were advisors, and the rest student's collected data. Data curation done by all. Acknowledgements SR thanks Prof. S.K. Brahmachari for giving an opportunity to develop this platform, Mr. Zakir Thomas for giving guidance in open source, Dr. Andrew M. Lynn for constant encouragement, CSIR for research grants under the task force “In silico biology for drug target identification” CMM0017, “Open Source Drug Discovery” IAP0008, and DBT for “National Bioscience Award Grant”, and all colleagues who have contributed to this platform. References: 1. Balázsi G, Heath AP, Shi L, Gennaro ML. 2008.The temporal response of the Mycobacterium tuberculosis gene regulatory network during growth arrest. Mol Syst Biol.4:225. 2. Beste DJ, Hooper T, Stewart G, Bonde B, Avignone-Rossa C, Bushell ME, Wheeler P, Klamt S, Kierzek AM, McFadden J. GSMN-TB. 2007. A web-based genome-scale network model of Mycobacterium tuberculosis metabolism. Genome Biol.8(5):R89 3. Cabusora L, Sutton E, Fulmer A, Forst CV. 2005 Jun 15. Differential network expression during drug and stress response. Bioinformatics. 21(12):2898-905. 4. Cao Y, Charisi A, Cheng LC, Jiang T, Girke T. ChemmineR. . 2008 Aug 1. A compound mining framework for R. Bioinformatics. 24(15):1733-4. 5. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B,Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R,Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J. 2004. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5(10):R80. 6. Hucka M, Finney A, Sauro HM, Bolouri H, Doyle JC, Kitano H, Arkin AP, Bornstein BJ, Bray D, Cornish-Bowden A, Cuellar AA, Dronov S, Gilles ED, Ginkel M, Gor V, Goryanin II, Hedley WJ, Hodgman TC, Hofmeyr JH, Hunter PJ, Juty NS, Kasberger JL,Kremling A, 47 Biobytes Vol – 5, July - 2009 Kummer U, Le Novère N, Loew LM, Lucio D, Mendes P, Minch E, Mjolsness ED, Nakayama Y, Nelson MR, Nielsen PF, Sakurada T, Schaff JC, Shapiro BE, Shimizu TS, Spence HD, Stelling J, Takahashi K, Tomita M, Wagner J, Wang J. 2003 Mar 1. SBML Forum The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics. 19(4):524-31. 7. Kitano H. 2002 Mar 1. Systems biology: a brief overview. Science. 295(5560):1662-4. 8. Morohashi M, Winn AE, Borisuk MT, Bolouri H, Doyle J, Kitano H. 2002 May 7. Robustness as a measure of plausibility in models of biochemical networks. J Theor Biol. 216(1):19-30. 9. Raman K, Rajagopalan P, Chandra N. 2005. Flux balance analysis of mycolic Acid pathway: targets for anti-tubercular drugs. PLoS Comput Biol. 1:e46. 10. Raman K, Yeturu K, Chandra N. targetTB. 2008 Dec 19. A target identification pipeline for Mycobacterium tuberculosis through an interactome, reactome and genome-scale structural analysis. BMC Syst Biol. 2:109. 11. Raman K, Chandra N. 2008 Dec 23. Mycobacterium tuberculosis interactome analysis unravels potential pathways to drug resistance. BMC Microbiol.8:234. 12. Singh VK, Ghosh I . 2006 Aug 3. Kinetic modeling of tricarboxylic acid cycle and glyoxylate bypass in Mycobacterium tuberculosis, and its application to assessment of drug targets. Theor Biol Med Model.3:27 13. Singh, S. 2008. India Takes an Open Source Approach to Drug Discovery. Cell 133, April 18. Systems Biology of Malaria: An Indian Perspective Ashis Das3, Dhanpath Kochar2 and Utpal Tatu1* 1 Department of Biochemistry, Indian Institute of Science, Bangalore, 560012, Karnataka, India. Department of Medicine, S. P. Medical College, C-54, Sadul Ganj, Bikaner, Rajasthan 334003, India. 3 Biological Sciences Group, BITS-Pilani, Rajasthan - 333031, India. *Corresponding author: Department of Biochemistry, Indian Institute of Science, Bangalore, 560012, Karnataka, India Tel: +91-080-22932823; Fax: +91-808-23600814/23600683 * E-Mail: tatu@biochem.iisc.ernet.in 2 Introduction: Malaria is a prehistoric disease. It is believed that malaria may have contributed to extinction of dinosaurs from the earth. There is documentation of malaria in our old civilizations, Egyptian 48 View publication stats