Genome Science (IBB.MB.
501)
IBB.MB.501
Database search and sequence alignment
2
Introduction
– Over the past five decades the use of computers has had a profound
effect on research in the biological sciences
– The amount of information available to researchers in databases,
increases almost exponentially, with biologists and computer scientists
coming together to provide Bioinformatics tools to help extract useful
information from these databases
– The aim of these sessions is to introduce you to the use of some of the
information and software resources available in the public domain
IBB.MB.501
3
Searching sequence databases
– Sequence databases exist for nucleic acids, proteins and complex
carbohydrates
– For nucleic acids and proteins the chemical structure is represented as
a string of characters, such as ACCGTA for nucleic acids or DFGIMCR
for proteins
– Database entries include much more information, or annotation, which
contains the biological, bibliographic and administrative context for
IBB.MB.501
the sequence.
4
NA databases
– For nucleic acids, there are three major public domain databases:
– European Nucleotide Archive, from EMBL-EBI
– NCBI (including GenBank (USA))
– DDBJ (DNA DataBank of Japan)
– All exchange information daily, so that they are essentially identical
IBB.MB.501
5
IBB.MB.501
6
IBB.MB.501
7
IBB.MB.501
8
IBB.MB.501
9
Sequence Alignment
IBB.MB.501
10
Job Dispatcher ❏50+ tools
Bioinformatics Tools (nucleotide and
protein analysis)
❏Recently added:
❏R2DT
❏SSRAECH2SEQ
IBB.MB.501
❏GGSEARCH2SE
Q
11
Tool Categories
▪ Sequence Format Conversion (sfc)
▪ Protein Function Analysis (pfa)
▪ Sequence Operation (so)
▪ Sequence Statistics (seqstats)
▪ Sequence Translation (st)
▪ RNA Analysis (rna)
▪ Phylogeny (phylogeny)
▪ Pairwise Sequence Alignment (psa)
▪ Multiple Sequence Alignment (msa)
IBB.MB.501
▪ Sequence Similarity Search (sss)
▪ Emboss Tools (emboss)
12
Sequence Format Conversion (sfc)
❏Convert one sequence format to another.
❏EMBOSS Seqret, MView
IBB.MB.501
13
Advantages
• No local installation
• Workflows
IBB.MB.501
Typical Bioinformatics Setup
14
How to access tools?
– EBI Service page:
– https://www.ebi.ac.uk/services
– Job dispatcher Tool category page:
– https://www.ebi.ac.uk/Tools/<category>
– Eg: https://www.ebi.ac.uk/Tools/msa (Multiple Sequence Alignment)
IBB.MB.501
15
https://www.ebi.ac.uk/services
IBB.MB.501
16
Sequence Alignment
– Identify regions of similarity
IBB.MB.501
17
Sequence Alignment
IBB.MB.501
18
Sequence Alignment
– Match, Mismatch, Gap
– Gap extension penalty
IBB.MB.501
19
Sequence alignment
– Similarity and Identity
– Substitution matrix
– https://github.com/kimrutherford/EMBOSS/tree/master/emboss/data
– Alignment score : calculated based on the match/mismatch of residues
using the substitution matrices
IBB.MB.501
20
Sequence Alignment Types
PAIRWISE MULTIPLE
IBB.MB.501
21
Pairwise Sequence Alignment
– Involves aligning two sequences using a scoring matrix
– Basic of database similarity search
– Dynamic Programming for global alignment : Needleman-Wunsch
algorithm
IBB.MB.501
– Dynamic programming for local alignment : Smith-Waterman
Algorithm
22
Local and Global Alignment
– Global
IBB.MB.501
– Local
23
Pairwise Alignment Tools
https://www.ebi.ac.uk/Tools/psa/
– Needle
– Stretcher
– GGSEARCH2SEQ
– Water
– Matcher
– LALIGN
IBB.MB.501
– SSEARCH2SEQ
– GeneWISE
24
Which tool to use?
– Global Alignment
Needle Stretcher GGSEARCH2SEQ
(big sequences)
– Local Alignment
Water LALIGN Matcher SSEARCH2SE
(big sequences)
Q
IBB.MB.501
25
https://www.ebi.ac.uk/Tools/psa/emboss_needle/
Sequence input
IBB.MB.501
Parameters
Submit! 26
IBB.MB.501
27
Things to remember
– Selection of tool
– Choose local/global based on your requirement
– Selection of Matrix
IBB.MB.501
– Blosum{n} higher value focus on more closely related proteins.
– PAM{n} higher value focuses on more distantly related proteins.
28
What is BLAST?
– Basic BLAST search
– What is BLAST?
– The framework of BLAST
– Different BLAST programs
– BLAST databases you can search
– Where can I run BLAST?
IBB.MB.501
29
What is BLAST?
• BLAST stands for
Basic Local Alignment Search Tool
• Why BLAST is popular?
- Good balance of sensitivity and speed
- Reliable
- Flexible
• Produce local alignments: short significant stretches of
IBB.MB.501
similarity, irrespective of where they are in the sequence
30
BLAST Programs
The most common BLAST search include five programs:
Program Database (Subject) Query
BLASTN Nucleotide Nucleotide
BLASTP Protein Protein
BLASTX Protein Nt. ➔ Protein
TBLASTN Nt. ➔ Protein Protein
IBB.MB.501
TBLASTX Nt. ➔ Protein Nt. ➔ Protein
31
BLASTN
– BLASTN
– The query is a nucleotide sequence
– The database is a nucleotide database
– No conversion is done on the query or database
– DNA :: DNA homology
– Mapping oligos to a genome
– Annotating genomic DNA with transcriptome data from ESTs
and RNA-Seq
IBB.MB.501
– Annotating untranslated regions
32
BLASTP
– BLASTP
– The query is an amino acid sequence
– The database is an amino acid database
– No conversion is done on the query or database
– Protein :: Protein homology
– Protein function exploration
– Novel gene ➔ make parameters more sensitive
IBB.MB.501
33
BLASTX
– BLASTX
– The query is a nucleotide sequence
– The database is an amino acid database
– All six reading frames are translated on the query and used
to search the database
– Coding nucleotide seq :: Protein homology
– Gene finding in genomic DNA
IBB.MB.501
– Annotating ESTs and transcripts assembled from RNA-Seq
data
34
TBLASTN
– TBLASTN
– The query is an amino sequence
– The database is a nucleotide database
– All six frames are translated in the database and searched
with the protein sequence
– Protein :: Coding nucleotide DB homology
– Mapping a protein to a genome
IBB.MB.501
– Mining ESTs and RNA-Seq data for protein similarities
35
TBLASTX
– TBLASTX
– The query is a nucleotide sequence
– The database is a nucleotide database
– All six frames are translated on the query and on the
database
– Coding :: Coding homology
– Searching distantly-related species
IBB.MB.501
– Sensitive but expensive
36
BLAST output
1. List of sequences with scores
– Raw score
– Higher is better
– Depends on aligned length
– Expect Value (E-value)
– Smaller is better
IBB.MB.501
– Independent of length and database size
2. List of alignments
37
IBB.MB.501
38
IBB.MB.501
39
Multiple Sequence Alignment
– Multiple Sequence Alignment (MSA) can be seen as a generalization of
a Pairwise Sequence Alignment (PSA). Instead of aligning just two
sequences, three or more sequences are aligned simultaneously.
– MSA is used for:
– Detection of conserved domains in a group of genes or proteins
(conservation analysis)
– Construction of a phylogenetic tree
– Prediction of a protein function/structure
IBB.MB.501
– Determination of a consensus sequence
40
https://www.ebi.ac.uk/Tools/msa/
IBB.MB.501
https://www.ebi.ac.uk/Tools/<tool category> 41
Multiple Sequence Alignment Tools
https://www.ebi.ac.uk/Tools/msa/
– Clustal Omega
– Kalign
– MAFFT
– MUSCLE
IBB.MB.501
– T-Coffee
– EMBOSS Cons
42
Multiple Sequence Alignment Tools
– Use heuristics
– Progressive alignment
– E.g. Clustal Omega
– Iterative alignment
– E.g. MAFFT, MUSCLE, Clustal Omega
– Consistency-based alignment
– E.g. T-Coffee
– Profile (HMM-based) alignment
IBB.MB.501
– E.g. Clustal Omega
43
https://www.ebi.ac.uk/Tools/msa/clustalo
IBB.MB.501
44
IBB.MB.501
45
Consensus Symbols
– An * (asterisk) indicates positions which have a single, fully conserved
residue.
– A : (colon) indicates conservation between groups of amino acids
with strongly similar properties
– A . (period) indicates conservation between groups of amino acids
IBB.MB.501
with weakly similar properties
46
Things to remember
– Check the input size limit (depends on tool)
– Tool Errors (not a proper file format, if you provide a single sequence)
IBB.MB.501
47
Things to remember
– Input format
– Try using FASTA format
– Unique sequence identifiers
– First 30 characters in identifier should be unique
– Include sequence!
– Job can’t be found/other error
IBB.MB.501
– Results deleted after 7 days
– Some sequence/program combinations run out of memory
– Use a different program
48
Which tool should I use?
– 3-100 sequences of typical protein length
– MUSCLE, T-Coffee, MAFFT, Clustal Omega
– 100-500 sequences
– Clustal Omega, MUSCLE, MAFFT
– >500 sequences
– Clustal Omega, Kalign
IBB.MB.501
49
Which tool should I use?
– Small number of unusually long sequence
– KALIGN, MAFFT (fast)
– DNA
– MAFFT, Kalign, MUSCLE
IBB.MB.501
50
Final remarks
– Don’t assume a single tool will cater for all your needs
– Change the parameters of the tools
– Remember where the tool excels and what its limitations are
– A tool intended for specific task A can also be used for task B (and
IBB.MB.501
may be better than the tool intended for task B specifically!)
– Crazy input will always give crazy results!
51