Multiple sequence alignment (MSA)
Practical Workflow:
1. Data Collection:
• Download the sequences you want to align from public sequence
databases like NCBI. Ensure diversity in the dataset (sequences
from different species or variants for meaningful comparisons).
• Store the sequences in a file for easy access.
For now, we will investigate the evolutionary history and diversity of the
P53 protein, a critical player in cell regulation and DNA repair. Using 6 P53
protein sequences from various species, with following accession
numbers:
AAC53040.1, BAA08629.1, AAA39883.1, AEG21062.2, AAL83290.1,
AAA39882.1
2. Clustal Omega:
• Clustal Omega employs a progressive alignment algorithm. It builds a
guide tree, aligning the most closely related sequences first and then
adding others.
• Clustal Omega employs a progressive alignment algorithm. It builds a
guide tree, aligning the most closely related sequences first and then
adding others.
• The tool generates a final MSA, highlighting conserved regions and
variable regions.
• Open Clustal Omega, a multiple sequence alignment tool. Go to
http://www.ebi.ac.uk/Tools/msa/clustalo/.
• Step 1 - You will get a page to select the type of sequences to be
aligned (Protein, DNA or RNA), enter the sequences directly into
this box in FASTA format (or upload a file of a supported format)
and set the output format.
Copy all of your sequences in FASTA format into the open frame
below the Submission Form, making sure to leave one space
between them. Clustal Omega will attempt to align these amino
acid sequences based on their similarities. Click RUN, Your results
might take a few seconds.
• Step 2 - Set Your Parameter
Multiple Sequence Alignment Tool Output Examples
Clustal Omega: Clustal w/o Numbers:
MAFFT: Pearson/ FASTA:
MUSCLE: HTML:
*For more information about the output format, please check
(https://www.ebi.ac.uk/seqdb/confluence/display/JDSAT/Multiple+Sequence+Alignment+Tool+Output+Exampl
es#MultipleSequenceAlignmentToolOutputExamples-ClustalOmegaproteinoutputexamples: )
3. MAFFT:
• Open MAFFT, another multiple sequence alignment tool
(https://www.ebi.ac.uk/Tools/msa/mafft/ ).
• Load the same set of sequences into MAFFT.
• MAFFT uses an iterative refinement method to align sequences. It
automatically selects an appropriate strategy for your dataset
(e.g., FFT-NS-2 for accurate alignments).
• MAFFT outputs an aligned sequence file.
4. MUSCLE:
• Repeat the process with MUSCLE
(https://www.ebi.ac.uk/Tools/msa/muscle/ ).
• MUSCLE employs a progressive method and also incorporates
iterative refinement.
• It aligns sequences and produces an aligned output.
5. Comparison:
• Compare the results from Clustal Omega, MAFFT, and MUSCLE.
• Look for consensus in conserved regions and evaluate any
differences in variable regions.
• Different algorithms may produce slightly different alignments,
and some tools may perform better for specific types of
sequences or datasets.
6. Results Interpretation:
• Analyze the aligned sequences to identify conserved domains or
motifs.
• Evaluate the positions of indels and their potential impact on the
function or structure.
• Consider the evolutionary implications of the alignment.
Here's a step-by-step guide on how to interpret MSA results (eg. Clustal Omega:
Clustal w/o Numbers):
In Clustal Omega results, you typically have access to various tabs or sections
that provide additional information and options for analyzing and extractin
information from your multiple sequence alignment.
(Please note: Depending on the specific software or interface you are using, the available
tabs and their functionalities may vary).
▪ Sequence Alignment Tab:
This is the primary tab where you can view the aligned sequences. It displays
the MSA itself, often with color-coding to represent conserved regions, gaps,
and sequence properties.
1. Conserved Regions:
• Start by identifying segments within the alignment where most of the
sequences have identical or highly similar amino acids. These are
conserved regions.
• Conserved regions are often functionally important. They may
correspond to critical structural elements or functional domains of the
protein or DNA.
Consensus Symbols:
"*" means that the residues or nucleotides in that column are identical in
all sequences in the alignment.
":" means that conserved substitutions have been observed, according to
the COLOR table below.
"." means that semi-conserved substitutions are observed, i.e., amino
acids having similar shape. Conserved means the amino acid is replaced
by one having similar characteristics.
2. Variable Regions:
• Look for segments in the alignment where the sequences show
variations. These variable regions may indicate adaptational differences
or regions with less functional constraint.
3. Gaps (Indels):
• Identify gaps or insertions/deletions (indels) in the alignment. Gaps
represent regions where sequences differ in length or have insertions or
deletions.
• Indels can be functionally significant, potentially indicating structural
variation or unique features in specific sequences.
4. Alignment Quality:
• Assess the overall quality of the alignment. A well-aligned region should
have minimal gaps and few sequence variations. Higher alignment quality
indicates stronger sequence similarity.
▪ Show Colours Tab:
Returning to your results, you will notice that there are various taps at the upper
part of your results page, Click on the tap called “ Show Colours." it. Now your
sequences appear in color.
The use of colors can be a
or divergence at each position in the alignment. Clustal programs often use specific
colors to represent different amino acid or nucleotide properties. For example, red
may represent negatively charged amino acids, while blue represents positively
charged ones.
Here is some common color schemes used in MSA visualization:
▪ Guide Tree Tab:
The "Guide Tree" is a graphical representation of the evolutionary relationships
among the sequences that were aligned. It helps you visualize how closely
related or distant the sequences are from one another. Here's how to interpret
the Guide Tree in Clustal Omega results:
▪ Phylogenetic Analysis Tab:
If your alignment was used to build a phylogenetic tree, this tab may include
options for visualizing or further analyzing the tree, such as selecting a root
node or adjusting tree display settings.
Case-study: Multiple Sequence Alignment of Mitochondrial Cytochrome b
in Rodents
Introduction: The objective of this experiment is to perform MSA on a set of
mitochondrial cytochrome b Protein sequences from various rodent species.
MSA is a fundamental bioinformatics technique used to identify conserved
regions and sequence variations, aiding in the study of molecular evolution. In
this study, we aim to uncover evolutionary patterns and similarities among
rodent cytochrome b genes.
Exercise 1:
o Data Collection: Retrieve the mitochondrial cytochrome b Protein
sequences for eight different rodent species from the GenBank database.
Species included in the analysis are Mus musculus (house mouse), Rattus
norvegicus (brown rat), Cricetulus griseus (Chinese hamster),
Peromyscus eremicus (Cactus mouse) and others.
Use the following accession numbers to download the sequences from
the Protein database with the same format.
YP_001686710.1, AP_004904.1, YP_537131.1, YP_006073056.1,
YP_009245653.1, YP_009245095.1, YP_009186415.1, YP_009166339.1
o Clustal Omega, MAFFT, and MUSCLE Alignment: use the three different
MSA tools: Clustal Omega, MAFFT, and MUSCLE. Apply each tool to align
the sequences separately with default parameters.
• Clustal Omega:
o Input the sequences into the Clustal Omega tool.
• MAFFT:
o Load the same set of sequences into MAFFT.
• MUSCLE:
o Repeat the process with MUSCLE.
Questions:
1. How many Cytochrome b protein sequences did you include in the
alignment?
Eight sequences
2. What is the length of the longest sequence and smallest sequence?
length of the longest sequence is 381 and length of the smallest
sequence is 379
3. What are the conserved regions in the Clustal Omega alignment, and what
might these regions signify in terms of function or structure?
Conserved regions are often functionally important. They may
correspond to critical structural elements or functional domains of the
protein.
4. Did you observe any differences between the MAFFT alignment and the
Clustal Omega alignment? If so, what might explain these differences?
They differ in output format.
MAFFT: Pearson/ FASTA:
Clustal Omega: Clustal w/o Numbers
5. Are there any additional conserved regions or differences in the aligned
sequences revealed by MAFFT?
No, there are not any additional conserved regions or differences
Exercise 2:
o Repeat the previous practice, but with adding the “XP_059124694.1”
to the same previous set of sequences:
o Clustal Omega, MAFFT, and MUSCLE Alignment: use the three
different MSA tools: Clustal Omega, MAFFT, and MUSCLE.
Apply each tool to align the sequences separately.
Questions:
1. How many Cytochrome b protein sequences did you include in the
alignment?
2. What is the length of the longest sequence and smallest sequence?
3. Do you observe any difference between this alignment and the previous
one?