0% found this document useful (0 votes)

31 views51 pages

IBB - MB.501 Database Search and Sequence Alignment

Genome Science

Uploaded by

Muhammad Shahzad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views51 pages

IBB - MB.501 Database Search and Sequence Alignment

Genome Science

Uploaded by

Muhammad Shahzad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Genome Science (IBB.MB.

501)
IBB.MB.501
Database search and sequence alignment

2
Introduction
– Over the past five decades the use of computers has had a profound
effect on research in the biological sciences
– The amount of information available to researchers in databases,
increases almost exponentially, with biologists and computer scientists
coming together to provide Bioinformatics tools to help extract useful
information from these databases
– The aim of these sessions is to introduce you to the use of some of the
information and software resources available in the public domain

IBB.MB.501
3
Searching sequence databases
– Sequence databases exist for nucleic acids, proteins and complex
carbohydrates
– For nucleic acids and proteins the chemical structure is represented as
a string of characters, such as ACCGTA for nucleic acids or DFGIMCR
for proteins
– Database entries include much more information, or annotation, which
contains the biological, bibliographic and administrative context for

IBB.MB.501
the sequence.

4
NA databases

– For nucleic acids, there are three major public domain databases:
– European Nucleotide Archive, from EMBL-EBI
– NCBI (including GenBank (USA))
– DDBJ (DNA DataBank of Japan)
– All exchange information daily, so that they are essentially identical

IBB.MB.501
5
IBB.MB.501
6
IBB.MB.501
7
IBB.MB.501
8
IBB.MB.501
9
Sequence Alignment

IBB.MB.501
10
Job Dispatcher ❏50+ tools
Bioinformatics Tools (nucleotide and
protein analysis)

❏Recently added:

❏R2DT

❏SSRAECH2SEQ

IBB.MB.501
❏GGSEARCH2SE
Q

11
Tool Categories
▪ Sequence Format Conversion (sfc)
▪ Protein Function Analysis (pfa)
▪ Sequence Operation (so)
▪ Sequence Statistics (seqstats)
▪ Sequence Translation (st)
▪ RNA Analysis (rna)
▪ Phylogeny (phylogeny)
▪ Pairwise Sequence Alignment (psa)
▪ Multiple Sequence Alignment (msa)

IBB.MB.501
▪ Sequence Similarity Search (sss)
▪ Emboss Tools (emboss)

12
Sequence Format Conversion (sfc)

❏Convert one sequence format to another.

❏EMBOSS Seqret, MView

IBB.MB.501
13
Advantages

• No local installation
• Workflows

IBB.MB.501
Typical Bioinformatics Setup
14
How to access tools?
– EBI Service page:
– https://www.ebi.ac.uk/services

– Job dispatcher Tool category page:

– https://www.ebi.ac.uk/Tools/<category>
– Eg: https://www.ebi.ac.uk/Tools/msa (Multiple Sequence Alignment)

IBB.MB.501
15
https://www.ebi.ac.uk/services

IBB.MB.501
16
Sequence Alignment

– Identify regions of similarity

IBB.MB.501
17
Sequence Alignment

IBB.MB.501
18
Sequence Alignment

– Match, Mismatch, Gap

– Gap extension penalty

IBB.MB.501
19
Sequence alignment

– Similarity and Identity

– Substitution matrix

– https://github.com/kimrutherford/EMBOSS/tree/master/emboss/data

– Alignment score : calculated based on the match/mismatch of residues

using the substitution matrices

IBB.MB.501
20
Sequence Alignment Types

PAIRWISE MULTIPLE

IBB.MB.501
21
Pairwise Sequence Alignment

– Involves aligning two sequences using a scoring matrix

– Basic of database similarity search

– Dynamic Programming for global alignment : Needleman-Wunsch

algorithm

IBB.MB.501
– Dynamic programming for local alignment : Smith-Waterman

Algorithm
22
Local and Global Alignment

– Global

IBB.MB.501
– Local

23
Pairwise Alignment Tools
https://www.ebi.ac.uk/Tools/psa/

– Needle
– Stretcher
– GGSEARCH2SEQ
– Water
– Matcher
– LALIGN

IBB.MB.501
– SSEARCH2SEQ
– GeneWISE
24
Which tool to use?
– Global Alignment

Needle Stretcher GGSEARCH2SEQ

(big sequences)

– Local Alignment

Water LALIGN Matcher SSEARCH2SE

(big sequences)
Q

IBB.MB.501
25
https://www.ebi.ac.uk/Tools/psa/emboss_needle/

Sequence input

IBB.MB.501
Parameters

Submit! 26
IBB.MB.501
27
Things to remember

– Selection of tool

– Choose local/global based on your requirement

– Selection of Matrix

IBB.MB.501
– Blosum{n} higher value focus on more closely related proteins.
– PAM{n} higher value focuses on more distantly related proteins.

28
What is BLAST?

– Basic BLAST search

– What is BLAST?
– The framework of BLAST
– Different BLAST programs
– BLAST databases you can search
– Where can I run BLAST?

IBB.MB.501
29
What is BLAST?

• BLAST stands for

Basic Local Alignment Search Tool
• Why BLAST is popular?
- Good balance of sensitivity and speed
- Reliable
- Flexible
• Produce local alignments: short significant stretches of

IBB.MB.501
similarity, irrespective of where they are in the sequence

30
BLAST Programs
The most common BLAST search include five programs:

Program Database (Subject) Query

BLASTN Nucleotide Nucleotide
BLASTP Protein Protein
BLASTX Protein Nt. ➔ Protein
TBLASTN Nt. ➔ Protein Protein

IBB.MB.501
TBLASTX Nt. ➔ Protein Nt. ➔ Protein

31
BLASTN

– BLASTN
– The query is a nucleotide sequence
– The database is a nucleotide database
– No conversion is done on the query or database
– DNA :: DNA homology
– Mapping oligos to a genome
– Annotating genomic DNA with transcriptome data from ESTs
and RNA-Seq

IBB.MB.501
– Annotating untranslated regions

32
BLASTP

– BLASTP
– The query is an amino acid sequence
– The database is an amino acid database
– No conversion is done on the query or database
– Protein :: Protein homology
– Protein function exploration
– Novel gene ➔ make parameters more sensitive

IBB.MB.501
33
BLASTX

– BLASTX
– The query is a nucleotide sequence
– The database is an amino acid database
– All six reading frames are translated on the query and used
to search the database
– Coding nucleotide seq :: Protein homology
– Gene finding in genomic DNA

IBB.MB.501
– Annotating ESTs and transcripts assembled from RNA-Seq
data

34
TBLASTN

– TBLASTN
– The query is an amino sequence
– The database is a nucleotide database
– All six frames are translated in the database and searched
with the protein sequence
– Protein :: Coding nucleotide DB homology
– Mapping a protein to a genome

IBB.MB.501
– Mining ESTs and RNA-Seq data for protein similarities

35
TBLASTX

– TBLASTX
– The query is a nucleotide sequence
– The database is a nucleotide database
– All six frames are translated on the query and on the
database
– Coding :: Coding homology
– Searching distantly-related species

IBB.MB.501
– Sensitive but expensive

36
BLAST output

1. List of sequences with scores

– Raw score
– Higher is better
– Depends on aligned length
– Expect Value (E-value)
– Smaller is better

IBB.MB.501
– Independent of length and database size
2. List of alignments

37
IBB.MB.501
38
IBB.MB.501
39
Multiple Sequence Alignment
– Multiple Sequence Alignment (MSA) can be seen as a generalization of
a Pairwise Sequence Alignment (PSA). Instead of aligning just two
sequences, three or more sequences are aligned simultaneously.

– MSA is used for:

– Detection of conserved domains in a group of genes or proteins
(conservation analysis)
– Construction of a phylogenetic tree
– Prediction of a protein function/structure

IBB.MB.501
– Determination of a consensus sequence

40
https://www.ebi.ac.uk/Tools/msa/

IBB.MB.501
https://www.ebi.ac.uk/Tools/<tool category> 41
Multiple Sequence Alignment Tools
https://www.ebi.ac.uk/Tools/msa/

– Clustal Omega

– Kalign

– MAFFT

– MUSCLE

IBB.MB.501
– T-Coffee

– EMBOSS Cons
42
Multiple Sequence Alignment Tools
– Use heuristics
– Progressive alignment
– E.g. Clustal Omega
– Iterative alignment
– E.g. MAFFT, MUSCLE, Clustal Omega
– Consistency-based alignment
– E.g. T-Coffee
– Profile (HMM-based) alignment

IBB.MB.501
– E.g. Clustal Omega

43
https://www.ebi.ac.uk/Tools/msa/clustalo

IBB.MB.501
44
IBB.MB.501
45
Consensus Symbols

– An * (asterisk) indicates positions which have a single, fully conserved

residue.

– A : (colon) indicates conservation between groups of amino acids

with strongly similar properties

– A . (period) indicates conservation between groups of amino acids

IBB.MB.501
with weakly similar properties

46
Things to remember

– Check the input size limit (depends on tool)

– Tool Errors (not a proper file format, if you provide a single sequence)

IBB.MB.501
47
Things to remember

– Input format
– Try using FASTA format
– Unique sequence identifiers
– First 30 characters in identifier should be unique
– Include sequence!

– Job can’t be found/other error

IBB.MB.501
– Results deleted after 7 days
– Some sequence/program combinations run out of memory
– Use a different program
48
Which tool should I use?
– 3-100 sequences of typical protein length
– MUSCLE, T-Coffee, MAFFT, Clustal Omega

– 100-500 sequences
– Clustal Omega, MUSCLE, MAFFT

– >500 sequences
– Clustal Omega, Kalign

IBB.MB.501
49
Which tool should I use?

– Small number of unusually long sequence

– KALIGN, MAFFT (fast)

– DNA
– MAFFT, Kalign, MUSCLE

IBB.MB.501
50
Final remarks

– Don’t assume a single tool will cater for all your needs

– Change the parameters of the tools

– Remember where the tool excels and what its limitations are

– A tool intended for specific task A can also be used for task B (and

IBB.MB.501
may be better than the tool intended for task B specifically!)

– Crazy input will always give crazy results!

Blast Introduction
No ratings yet
Blast Introduction
42 pages
Blast
100% (1)
Blast
21 pages
Basics of Bioinformatics
100% (7)
Basics of Bioinformatics
99 pages
Blast Introduction
No ratings yet
Blast Introduction
42 pages
Genetic Engineering Software Guide
No ratings yet
Genetic Engineering Software Guide
44 pages
Bioinformatics
No ratings yet
Bioinformatics
22 pages
Bioinformatics: Intended Learning Outcomes
No ratings yet
Bioinformatics: Intended Learning Outcomes
9 pages
Bioinformatics Lab 2
No ratings yet
Bioinformatics Lab 2
9 pages
Search Sequence Database
No ratings yet
Search Sequence Database
6 pages
Bioinformatics Lab 2 (Evelyn)
No ratings yet
Bioinformatics Lab 2 (Evelyn)
9 pages
Bioinformatics Database and Applications
100% (3)
Bioinformatics Database and Applications
82 pages
Bioinformatics Tools for Biologists
No ratings yet
Bioinformatics Tools for Biologists
26 pages
Application in Establishing Epidemiology and Variability: Genome & Protein " Sequence Analysis Programs"
100% (3)
Application in Establishing Epidemiology and Variability: Genome & Protein " Sequence Analysis Programs"
23 pages
BLAST Presentation
No ratings yet
BLAST Presentation
18 pages
Bif401 Manual 2023
No ratings yet
Bif401 Manual 2023
27 pages
Diploma - Practical
No ratings yet
Diploma - Practical
11 pages
Bioinformatics: ABE 2007 Kent Koster Group 3
No ratings yet
Bioinformatics: ABE 2007 Kent Koster Group 3
43 pages
BLAST: Fast Sequence Search Tool
No ratings yet
BLAST: Fast Sequence Search Tool
6 pages
Bioinformatics Lab Assignment Group 3
No ratings yet
Bioinformatics Lab Assignment Group 3
7 pages
Fundamentals of Bioinformatics - L5
No ratings yet
Fundamentals of Bioinformatics - L5
56 pages
بحث المعلوماتية الحيوية
No ratings yet
بحث المعلوماتية الحيوية
39 pages
Unit Iv - Blast
No ratings yet
Unit Iv - Blast
21 pages
Latthika
No ratings yet
Latthika
21 pages
Retrieval of Data
No ratings yet
Retrieval of Data
22 pages
Lab 1 - Introduction and Protocol
No ratings yet
Lab 1 - Introduction and Protocol
28 pages
Basic Bioinformatics
No ratings yet
Basic Bioinformatics
40 pages
Aanchal Maurya Bioinformatics 2
No ratings yet
Aanchal Maurya Bioinformatics 2
24 pages
BLAST
No ratings yet
BLAST
17 pages
BLAST Guide for Bioinformatics Students
No ratings yet
BLAST Guide for Bioinformatics Students
36 pages
Blast (Basic Local Alignment Search Tool)
No ratings yet
Blast (Basic Local Alignment Search Tool)
28 pages
University of Kwazulu-Natal Bioinformatics Gene320 3 May 2016 Test 2 Duration 100 Minutes Total Marks: 70
No ratings yet
University of Kwazulu-Natal Bioinformatics Gene320 3 May 2016 Test 2 Duration 100 Minutes Total Marks: 70
6 pages
BLAST Background
100% (1)
BLAST Background
27 pages
DIVYA Bioinformatics
No ratings yet
DIVYA Bioinformatics
20 pages
Module - 4 - Reference Course Content
No ratings yet
Module - 4 - Reference Course Content
25 pages
Bioinformatics Tutorial
No ratings yet
Bioinformatics Tutorial
12 pages
Intro to Bioinformatics Lab Guide
No ratings yet
Intro to Bioinformatics Lab Guide
6 pages
BLAST
100% (1)
BLAST
4 pages
Bioinformatics:: Guide To Bio-Computing and The Internet
No ratings yet
Bioinformatics:: Guide To Bio-Computing and The Internet
34 pages
Biopython Org DIST Docs Tutorial Tutorial HTML
No ratings yet
Biopython Org DIST Docs Tutorial Tutorial HTML
267 pages
Data Retrieval
67% (3)
Data Retrieval
17 pages
Bioinformatics: Blast and Sequence Analysis
No ratings yet
Bioinformatics: Blast and Sequence Analysis
45 pages
ALLIENU Blast and Fasta
No ratings yet
ALLIENU Blast and Fasta
27 pages
Bio PPT
No ratings yet
Bio PPT
35 pages
Bioinformatics
No ratings yet
Bioinformatics
11 pages
Entrez
No ratings yet
Entrez
46 pages
Bioinformatics Tutorial 2019
No ratings yet
Bioinformatics Tutorial 2019
54 pages
Bioinformatics Intern
No ratings yet
Bioinformatics Intern
8 pages
8024 Bio Info
No ratings yet
8024 Bio Info
28 pages
Some Significant Databases Blast Blast
No ratings yet
Some Significant Databases Blast Blast
18 pages
Blast: Background: BLAST Is One of The Most Widely Used Bioinformatics Programs
100% (1)
Blast: Background: BLAST Is One of The Most Widely Used Bioinformatics Programs
4 pages
Biological Sequence Databases
No ratings yet
Biological Sequence Databases
35 pages
BLOSUM 62: Blast vs. FastA Alignment
No ratings yet
BLOSUM 62: Blast vs. FastA Alignment
28 pages
Animal Cloning Thesis Statement
100% (2)
Animal Cloning Thesis Statement
8 pages
Recombinant DNA Technology Guide
No ratings yet
Recombinant DNA Technology Guide
15 pages
Cellular Communication Types of Cell Signaling
No ratings yet
Cellular Communication Types of Cell Signaling
13 pages
Peptides and Proteins Overview
No ratings yet
Peptides and Proteins Overview
4 pages
11 Biology Revision Study Material Chapter 8
No ratings yet
11 Biology Revision Study Material Chapter 8
8 pages
04-Gel Electrophoresis
100% (1)
04-Gel Electrophoresis
24 pages
Serological Tests For Cancer Detection
No ratings yet
Serological Tests For Cancer Detection
22 pages
Life Sciences Ssip Learner Booklet Sessions 1-8 2025
No ratings yet
Life Sciences Ssip Learner Booklet Sessions 1-8 2025
123 pages
1 Neoplasia Sample Questions
No ratings yet
1 Neoplasia Sample Questions
3 pages
B-PER® 6xhis Fusion Protein Spin Purification Kit
No ratings yet
B-PER® 6xhis Fusion Protein Spin Purification Kit
3 pages
Lesson 1 GENETICS: Its Emergence and Contribution To Society
No ratings yet
Lesson 1 GENETICS: Its Emergence and Contribution To Society
8 pages
Ahmad Maher 3170300685
No ratings yet
Ahmad Maher 3170300685
4 pages
MP Board 11th Class Reduced Syllabus
No ratings yet
MP Board 11th Class Reduced Syllabus
8 pages
Chapter 8 Cellular Energy: Section 1: How Organisms Obtain Energy Section 2: Photosynthesis
100% (1)
Chapter 8 Cellular Energy: Section 1: How Organisms Obtain Energy Section 2: Photosynthesis
31 pages
Chapter 2 Biology 1st Year - Prof. Ijaz Ahmed Khan Abbasi (Lecturer Biology PGC) Notes - MDCAT by FUTURE DOCTORS - Touseef Ahmad Khan - 03499815886
100% (1)
Chapter 2 Biology 1st Year - Prof. Ijaz Ahmed Khan Abbasi (Lecturer Biology PGC) Notes - MDCAT by FUTURE DOCTORS - Touseef Ahmad Khan - 03499815886
19 pages
Cell Division: DNA, Mitosis, Meiosis
No ratings yet
Cell Division: DNA, Mitosis, Meiosis
4 pages
2025 CPT Update
100% (1)
2025 CPT Update
7 pages
XL T Zoology
No ratings yet
XL T Zoology
2 pages
Pam Blosum Comparison 2022
No ratings yet
Pam Blosum Comparison 2022
2 pages
EB Cell Membrane
No ratings yet
EB Cell Membrane
68 pages
Lesson 7 - Meiosis
No ratings yet
Lesson 7 - Meiosis
20 pages
New Senior Secondary Mastering Biology Revision Notes Chapter 11
No ratings yet
New Senior Secondary Mastering Biology Revision Notes Chapter 11
6 pages
Nao-Chan NTR - The Person I Fell in Love With Was My Childhood Friend's Father. ?official Version ? - Oneshot NTR Series
No ratings yet
Nao-Chan NTR - The Person I Fell in Love With Was My Childhood Friend's Father. ?official Version ? - Oneshot NTR Series
61 pages
MC US 07038 Whole Exome Sequencing App Highlight
No ratings yet
MC US 07038 Whole Exome Sequencing App Highlight
8 pages
Ue Biochemistry 2023
No ratings yet
Ue Biochemistry 2023
7 pages
(Methods in Enzymology 463) Richard R. Burgess and Murray P. Deutscher (Eds.) - Guide To Protein Purification, 2nd Edition-Academic Press (2009)
100% (1)
(Methods in Enzymology 463) Richard R. Burgess and Murray P. Deutscher (Eds.) - Guide To Protein Purification, 2nd Edition-Academic Press (2009)
853 pages
Application of Recombinant Dna in Medicine
No ratings yet
Application of Recombinant Dna in Medicine
32 pages
Bio1a03 Practice Test 3
No ratings yet
Bio1a03 Practice Test 3
10 pages
Topic 2.1, 2.2 Cell Structure and Functions
No ratings yet
Topic 2.1, 2.2 Cell Structure and Functions
45 pages
Unit 3 Packet Carolyn Student
100% (2)
Unit 3 Packet Carolyn Student
28 pages

IBB - MB.501 Database Search and Sequence Alignment

Uploaded by

IBB - MB.501 Database Search and Sequence Alignment

Uploaded by

Genome Science (IBB.MB.

❏Convert one sequence format to another.

❏EMBOSS Seqret, MView

– Job dispatcher Tool category page:

– Identify regions of similarity

– Match, Mismatch, Gap

– Gap extension penalty

– Similarity and Identity

– Alignment score : calculated based on the match/mismatch of residues

using the substitution matrices

– Involves aligning two sequences using a scoring matrix

– Basic of database similarity search

– Dynamic Programming for global alignment : Needleman-Wunsch

Needle Stretcher GGSEARCH2SEQ

Water LALIGN Matcher SSEARCH2SE

– Choose local/global based on your requirement

– Basic BLAST search

• BLAST stands for

Program Database (Subject) Query

1. List of sequences with scores

– MSA is used for:

– An * (asterisk) indicates positions which have a single, fully conserved

– A : (colon) indicates conservation between groups of amino acids

– A . (period) indicates conservation between groups of amino acids

– Check the input size limit (depends on tool)

– Job can’t be found/other error

– Small number of unusually long sequence

– Change the parameters of the tools

– Crazy input will always give crazy results!

You might also like