8024 Bio Info
8024 Bio Info
8024 Bio Info
What is Bioinformatics?
Bioinformatics is a relatively new interdisciplinary field that integrates computer science, mathematics, biology, and information technology to manage, analyze, and understand biological, biochemical and biophysical information. Bioinformatics is a computational science and the subset of larger field of Computational Biology.
What is Bioinformatics?
Bioinformatics is the use of computers to study biology Bioinformatics is the science of using information to understand biology Bioinformatics is integration of information technology (IT) and biology Bioinformatics is the development of computational methods for studying structure, function and evolution of genes, proteins and whole genomes
Some Terminology
Cell is a primary unit of life Cell consists of molecules, chemical reactions and a copy of the genome for that organism All life on this planet depends on three types of molecules: DNA, RNA and proteins
Some Terminology
DNA Holds information on how cell works RNA Acts to transfer short pieces of information to different parts of cell Provide templates to synthesize into protein Proteins Form enzymes that send signals to other cells and regulate gene activity Form bodys major components (e.g. hair, skin, etc.)
Genetic material Consists of two long strands Each strand is made of: Phosphates Sugar Nucleotides A (adenine) G (guanine) C ( cytosine) T (thymine)
Information has been transferred from DNA (information storage molecule) to RNA (information transfer molecule) to a specific protein (a functional, non-coding product)
More Terminology
Transcription of DNA
DNA transcribed into RNA RNA exits as a single-strand unit and as a double-helix as well RNA consist of A, C, G and U (uracil)
Messenger RNA mRNA Transfer RNA tRNA Ribosomal RNA rRNA
Types of RNA
More Terminology
mRNA is translated into protein linear polymers built from amino acids
Proteins:
The transfer of information from DNA to specific protein via RNA takes place according to the genetic code.
The RNA sequence is divided into blocks of three letters This block is called CODON Each codon corresponds to the specific amino acid
More Terminology
Four different nucleotides are used to build DNA and RNA molecules A, G, C, T and A, G, C, U 20 different amino acids are used in protein synthesis Four nucleotides can be arranged in 64 different combinations of three. There are 64 = 4*4*4 different codons Some codons are redundant and some have special function to terminate the translation process
Traditionally, research was carried out entirely at the experimental laboratory but the huge increase in the data in the genomic era has seen a need to incorporate computers into this research process There are three central biological processes around which bioinformatics tools must be developed: DNA sequence determines protein sequence Protein sequence determines protein structure Protein structure determines protein function
Computational evolutionary biologyEvolutionary biology is the study of the origin and descent of species, as well as their change over time. Informatics has assisted evolutionary biologists in several key ways; it has enabled researchers to: trace the evolution of a large number of organisms by measuring changes in their DNA, rather than through physical taxonomy or physiological observations alone, build complex computational models of populations to predict the outcome of the system over time track and share information on an increasingly large number of species and organisms
Need for storing and communicating large datasets has grown Make biological data available to scientists. To make biological data available in computer-readable form.
Type of data
nucleotide sequences protein sequences proteins sequence patterns or motifs macromolecular 3D structure gene expression data metabolic pathways
Primary databases: experimental results directly into database Secondary databases: results of analysis of primary databases Aggregate of many databases
EMBL, GenBank, and DDBJ are the three primary nucleotide sequence databases EMBL www.ebi.ac.uk/embl/ GenBank www.ncbi.nlm.nih.gov/Genbank/ DDBJ www.ddbj.nig.ac.jp
Genbank
An annotated collection of all publicly available nucleotide and proteins Set up in 1979 at the LANL (Los Alamos). Maintained since 1992 NCBI (Bethesda).
http://www.ncbi.nlm.nih.gov
An annotated collection of all publicly available nucleotide and protein sequences Created in 1980 at the European Molecular Biology Laboratory in Heidelberg. Maintained since 1994 by EBICambridge. http://www.ebi.ac.uk/embl.html
An annotated collection of all publicly available nucleotide and protein sequences Started, 1984 at the National Institute of Genetics (NIG) in Mishima. Still maintained in this institute a team lead by Takashi Gojobori. http://www.ddbj.nig.ac.jp
EST database: A collection of expressed sequence tags, or short, single-pass sequence reads from mRNA (cDNA).
GSS database: A database of genome survey sequences, or short, single-pass genomic sequences.
HomoloGene: A gene homology tool that compares nucleotide sequences between pairs of organisms in order to identify putative orthologs. HTG database: A collection of high-throughput genome sequences from large-scale genome sequencing centers, including unfinished and finished sequences.
SNPs database: A central repository for both single-base nucleotide substitutions and short deletion and insertion polymorphisms.
RefSeq: A database of non-redundant reference sequences standards, including genomic DNA contigs, mRNAs, and proteins for known genes. Multiple collaborations, both within NCBI and with external groups, supports data-gathering efforts. STS database: A database of sequence tagged sites, or short sequences that are operationally unique in the genome. UniSTS: A unified, non-redundant view of sequence tagged sites (STSs). UniGene: A collection of ESTs and full-length mRNA sequences organized into clusters, each representing a unique known or putative human gene annotated with mapping and expression information and cross-references to other sources.
Bioinformatics Tools
BLAST:
The Basic Local Alignment Search Tool (BLAST) for comparing gene and protein sequences against others in public databases, now comes in several types including PSI-BLAST, PHI-BLAST, and BLAST 2 sequences. Specialized BLASTs are also available
FASTA
A database search tool used to compare a nucleotide or peptide sequence to a sequence database. It was the first widely used algorithm for database similarity searching. The program looks for optimal local alignments by scanning the sequence for small matches called "words"
Clustalw
ClustalW is a general purpose multiple sequence alignment program for DNA or proteins. It produces biologically meaningful multiple sequence alignments of divergent sequences, calculates the best match for the selected sequences, and lines them up so that the identities, similarities and differences can be seen.
RasMol
It is a powerful research tool to display the structure of DNA, proteins, and smaller molecules. Protein Explorer, a derivative of RasMol, is an easier to use program.
DeepView (also knows as Swiss-PdbViewer) For seeing and exploring macromolecular models in three dimensions, and for manual and semiautomated homology modeling
conclusion
Bioinformatics in India is at an early stage of development. But at 4 to 5 centers in the country, one sees mature understanding of the needs of this sector and world class development of tools and applications. These centers will ensure that Indias traditional strengths in IT are leveraged to place us on par with the developed countries.