Bioinformatics Tools for Nucleotide Sequence Analysis and Database exploration
Varij Nayan and Anuradha Bhardwaj
Bioinformatics
Research, Development, or Application of Computational Tools and Approaches for Expanding the use of Biological, Medical, Behavioral, or Health Data, including those to Acquire, Store Organize, Archive, Analyze, or Visualize Such Data
(Working Definition of NIH Biomedical Information Science & Technology Initiative Consortium)
2
What is a database ?
Convenient method of collecting vast amount of information Allows for proper storing, searching & retrieving of data. Before analyzing them we need to assemble them into central, shareable resources
Why databases ?
Means to handle and share large volumes of biological data Support large-scale analysis efforts Make data access easy and updated Link knowledge obtained from various fields of biology and medicine
4
Biological Databases
libraries of life sciences information, collected from scientific experiments, published literature, high throughput experiment technology, and computational analyses. information from research areas including genomics, proteomics, metabolomics, microarray gene expression, and phylogenetics. Information includes gene function, structure, localization (both cellular and chromosomal), clinical effects of mutations as well as similarities of biological sequences and structures. 5
Features
Most of the databases have a webinterface to search for data Common mode to search is by Keywords User can choose to view the data or save to your computer Cross-references help to navigate from one database to another easily
6
Biological Databases
Type of databases Information they contain
Bibliographic databases Taxonomic databases Nucleic acid databases Genomic databases Protein databases Protein families, domains and functional sites Enzymes/ metabolic pathways
Literature Classification DNA information Gene level information Protein information Classification of proteins and identifying domains
Metabolic pathways
7
Types Of Biological Databases Accessible
Primary databases
Secondary databases Composite databases
Primary databases (archival/annotated)
Contain sequence data such as nucleic acid or protein
Annotation implies extraction, definition and interpretation of features on the genome sequence
Examples of nucleic acid database areEMBL, DDBJ and NCBI GenBank.
9
10
International Nucleotide Sequence Database Collaboration
DDBJ: DNA Data Bank of Japan CIB-DDBJ: Center for Information Biology and DNA Data Bank of Japan NIG: National Institute of Genetics EBI: European Bioinformatics Institute EMBL: European Molecular Biology Laboratory NCBI: National Center for Biotechnology Information NLM: National Library of Medicine IAC: International Advisory Committee ICM: International Collaborative Meeting
11
EMBL Nucleotide Sequence Database
EMBL Nucleotide Sequence Database (also known as EMBL-Bank) constitutes Europe's primary nucleotide sequence resource EMBL nucleotide sequence database is part of the The Protein and Nucleotide Database Group (PANDA) [Link]/embl/
12
DNA Data Bank of Japan (DDBJ)
DDBJ (DNA Data Bank of Japan) began DNA data bank activities in earnest in 1986 at the National Institute of Genetics (NIG) with the endorsement of the Ministry of Education, Science, Sport and Culture sole DNA data bank in Japan, which is officially certified to collect DNA sequences from researchers and to issue the internationally recognized accession number to data submitters [Link] 13
NCBI Genbank
Bethesda, MD
established in November 4, 1988 as a division of the National Library of Medicine (NLM) at the National Institutes of Health (NIH), United States. 14
The National Center for Biotechnology Information
Accepts submissions of primary data Develops tools to analyze these data Creates derivative databases based on the primary data Provides free search, link, and retrieval of these data, primarily through the Entrez system
15
Secondary databases Curated and Composite databases
16
Secondary databases
sometimes known as pattern databases Contain results from the analysis of the sequences in the primary databases
17
Composite databases
Combine different sources of primary databases. Make querying and searching efficient and without the need to go to each of the primary databases. Example - nrDB Non-Redundant DataBase
18
Secondary Databases and Composite Databases
DNA
RNA
Protein
cDNA
DNA databases derived from GenBank containing data for a single gene -Non-redundant (nr) -dbGSS (genome survey sequences) -dbHTGS (high throughput) -dbSTS (sequence tagged site) -LocusLink* -RefSeq
RNA (cDNA) databases derived from GenBank containing data for a single gene - dbEST (expressed sequence tag) - UniGene - LocusLink* - RefSeq
19
RefSeq (Reference Sequence)
Curated collection of DNA, RNA, and protein sequences built by NCBI Unlike GenBank, RefSeq provides only one example of each natural biological molecule for major organisms ranging from viruses to bacteria to eukaryotes. limited to major organisms for which sufficient data is available
20
GenBank versus RefSeq
GenBank Not curated Author submits Only author can revise Multiple records for same loci common Curated NCBI creates from existing data NCBI revises as new data emerge Single records for each molecule of major organisms RefSeq
No limit to species included
Data exchanged among INSDC members Akin to primary literature
Limited to model organisms
Exclusive NCBI database Akin to review articles
Proteins identified and linked
Access via NCBI Nucleotide databases
Proteins and transcripts identified
and linked Access via Nucleotide & Protein databases
21
Other nucleotide sequence databases
UniGene
SGD (Saccharomyces Genome Database)
EBI Genomes - for the completed genomes, and information about ongoing projects Genome Biology - available complete genomes Ensembl - joint project between EMBL-EBI and the Sanger Centre 22
Nucleotide sequence analysis
Map viewer Model maker SAGEmap UniGene, ProtEST, and DDD
ORFfinder
Electronic PCR VecScreen Spidey Nucleotide BLAST
23
Map Viewer
Complete genome maps, from cytogenetic and physical maps down to the sequence level Accessible for 110 organisms
Vertebrates-17 Invertebrates-12 Protozoa-18 Plants-46 Fungi-17
[Link]
24
25
26
27
Human PAPP-A Gene
(Spotted on Chromosome 9 using Map Viewer)
28
maps can be sequence-based or not (e.g., cytogenetic maps or radiationhybrid maps) it is possible to access a map view and zoom into progressively more detailed views Maps are linked to several resources, such as UniGene clusters, Evidence Viewer, and Model Maker
29
Model Maker
used for the construction of transcript models by the assembly of putative exons exons may be derived from predictions or from alignments of ESTs or mRNAs to the genomic sequence Once the transcript is created, potential ORFs (open reading frames) and their translation are shown
30
31
32
SAGEmap
on-line resource to store, retrieve, and compare Serial Analysis of Gene Expression (SAGE) profiles SAGE libraries are derived from the Cancer Genome Anatomy Project (CGAP) as well as from GenBank SAGE tags SAGEmap accepts user-submitted libraries
Finally, different libraries can be compared
[Link]
33
UniGene, ProtEST, and DDD
UniGene: [Link] is a system for the automatic clustering of GenBank sequences and ESTs into nonredundant groups UniGene project tries to identify all ESTs generated from the same genes, overcoming problems due to the EST sequence errors
UniGene then stores, for a given organism, tissue, organ or pathological condition, libraries of clustered ESTs 34
35
36
37
ProtEST
a tool that uses BLASTX to search through sequence databases (Swissprot, PIR, PDB, PRF) with possible translations of UniGene clusters Proteomes from eight organisms (human, mice, rat, Drosophila, Caenorhabditis elegans, Saccharomyces cerevisiae, Arabidopsis thaliana, Escherichia coli ) are used for the comparison, and the best match in each organism is presented to the user
38
39
DDD (Digital Differential Display)
tool for comparing EST-based expression profiles among different UniGene libraries Aim: finding genes related 1. to tissue-specific or organ-specific processes 2. specific pathologies 3. different development stages
40
41
42
ORFfinder
[Link] /[Link]
tool for the identification of all ORFs in a user-submitted sequence or in a sequence in the GenBank database
If an open reading frame is found, the amino acid translation can be used for similarity search by means of BLAST or in the COGs database.
43
44
Electronic PCR
45
[Link]
looks for potential STSs given a pair of PCR primers and a DNA sequence
looks for DNA subsequences that are closely similar to the primers, and checks if order, orientation, and spacing are correct
46
Two ways : 1. Forward (searching a STS database with a sequence) - useful to map a sequence on a genome using a large database of known STSs (UniSTS) 2. Reverse (searching a sequence database with a STS) - for the prediction of PCR products in a selected genome given one or more pairs of primers
47
VecScreen
[Link] 48
a system for the identification of segments of a nucleic acid sequence that may be the result of a contamination, of vector origin (plasmid, phage, cosmid, YAC DNA) as well as linkers, adapters, and primers minimize the incidence and impact of such contaminations in public sequence databases
49
Contd
Spidey
tool for the alignment of one or more mRNA (FASTA format sequences or accession numbers) on a single eukaryotic genomic sequence, determining the exon/intron structure of the query messenger
[Link]
50
uses BLAST searches to identify a genome window that covers the entire mRNA length, then refines the alignment to align each exon, taking into account predicted splice sites four splice-site matrices can be used, i.e., vertebrate, Drosophila, C. elegans, and plant
Spidey output is an alignment for each exon, each one evaluated for its quality
51
Blast Implementations
BLAST
Basic Local Alignment Search Tool Program for sequence similarity searching developed at NCBI Instrumental in identifying genes and genetic features Executes sequence searches against the database of stored sequences
53
Local and global alignments
Global
Local
55
FASTA vs BLAST
BLAST is faster than FASTA Similar search strategy SensitivityProtein searches: BLAST and FASTA are comparable Nucleotide searches: FASTA is more sensitive
S-W is the most sensitive, but time consuming
56
BLAST USES
Provides the identity and function of query sequence Helps to direct experimental design to prove function of the sequence Finds similar sequences in other organisms Compares genomes against each other to find similarities and differences
57
Blast: A Family of Programs
Query: DNA Protein
Database:
DNA
Protein
BlastN - nt versus nt database. BlastP - protein versus protein database. BlastX - translated nt versus protein database. tBlastN - protein versus translated nt database. tBlastX - translated nt versus translated nt database.
58
59
60
Nucleotide BLAST
61 Compares a nucleotide sequence against a database of nucleotide sequences
BLASTn
General purpose nucleotide search and alignment program that is sensitive and can be used to align tRNA or rRNA sequences as well as mRNA or genomic DNA sequences containing a mix of coding and noncoding regions.
62
MegaBLAST
10 times faster than blastn
designed to align sequences that are nearly identical, differing by only a few percent from one another
allows the rapid mapping of a transcript onto a typical 3 billion base mammalian genome in seconds, and is useful for processing large batches of sequences
63
discontiguous MegaBLAST
uses a discontiguous template to define an initial word in which characters in some positions, such as those in the wobble base position of codons, need not match
allows rapid cross-species mappings involving coding regions in cases where species differences in codon usage would prevent alignments using the original MegaBLAST program
64
How to run a BLAST query
FASTA format
Query DNA or protein sequence must be in FASTA format
FASTA definition line ("def line") that begins with a >, followed by some text that briefly describes the query sequence on a single line up to 80 nucleotide bases or amino acids per line
>DinoDNA "Dinosaur DNA" from Crichton's JURASSIC PARK p. 103 nt 1-1200 GCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAAAATCGACGC GGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCTCCCTCG TGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGAAGCGTGGC TGCTCACGCTGTACCTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGGGCTGTGTG CCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACCCGGTAA AGTAGGACAGGTGCCGGCAGCGCTCTGGGTCATTTTCGGCGAGGACCGCTTTCGCTGGAG ATCGGCCTGTCGCTTGCGGTATTCGGAATCTTGCACGCCCTCGCTCAAGCCTTCGTCACT CCAAACGTTTCGGCGAGAAGCAGGCCATTATCGCCGGCATGGCGGCCGACGCGCTGGGCT GGCGTTCGCGACGCGAGGCTGGATGGCCTTCCCCATTATGATTCTTCTCGCTTCCGGCGG CCCGCGTTGCAGGCCATGCTGTCCAGGCAGGTAGATGACGACCATCAGGGACAGCTTCAA CGGCTCTTACCAGCCTAACTTCGATCACTGGACCGCTGATCGTCACGGCGATTTATGCCG CACATGGACGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAA CAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAA GCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGG CTTTCTCAATGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTG ACGAACCCCCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCA ACACGACTTAACGGGTTGGCATGGATTGTAGGCGCCGCCCTATACCTTGTCTGCCTCCCC GCGGTGCATGGAGCCGGGCCACCTCGACCTGAATGGAAGCCGGCGGCACCTCGCTAACGG CCAAGAATTGGAGCCAATCAATTCTTGCGGAGAACTGTGAATGCGCAAACCAACCCTTGG CCATCGCGTCCGCCATCTCCAGCAGCCGCACGCGGCGCATCTCGGGCAGCGTTGGGTCCT
65
How to run a BLAST query
Select nucleotide blast Paste sequence into search box Select database Click
66
BLAST OUTPUT
67
Results 1- Distribution
Graphical representation of hits
68
BLASTn
69
MEGABLAST
70
Discontiguous Megablast
71
Results 2 sequences with specific alignments Description Links to relevant
records in other databases
6e-62=6 X 10-62
Link to entrez
Estimate of statistical significance 72
Results 3 alignments
Shows the actual alignments
73
What do the numbers mean?
Bit score:
Indicates how good the alignment is; the higher the score, the better the alignment. Score is calculated from a formula which takes into account the alignment of similar or identical residues, as well as any gaps introduced to align the sequences
E-value: Expect value
Describes the # of hits one can expect to see by chance when searching a database of a particular size. Essentially, the E-value describes the random background noise that exists for matches between sequences. The lower the E-value, or the closer it is to 0, the higher is the significance of the match. Searches with short sequences can be virtually identical and have relatively high E-value. This is because shorter 74 sequences have a high probability of occurring in the database purely by chance.
blastn is more sensitive than MEGABLAST because it uses a shorter default word size. Because of this, blastn is better than MEGABLAST at finding alignments to related nucleotide sequences from other organisms MEGABLAST is the tool of choice to identify a nucleotide sequence (MEGABLAST is specifically designed to efficiently find long alignments between very similar sequences ) Discontiguous MEGABLAST is better at finding nucleotide sequences similar, but not identical, to nucleotide query
75
76