Dr.
M C Saxena College of Engineering
&
Technology
Assigment
on
BOINFORMATICS-II
Submitted To:- Submitted by:
Dr. Akrati Dev Aanchal Maurya
HOD(biotech) B.Tech (biotech)
INDEX
S.no Experiment Pg no Date Remark
1. (a) To study the pair wise 1-8
sequence similarity
search using BLAST
algorithm.
(b)To study the
functional and
evolutionary
relationship between
different sequences
2. To find the evolutionary 9-16
relationship between
different organism and
analyse the changes that
occurred in organisms
during the course of
evolution using PHYLIP.
3. Secondary structure analysis 17-22
of a protein using SOPMA.
Experiment No. 1
Objective:
• To study the pairwise sequence similarity search using BLAST algorithm.
• To study the functional and evolutionary relationships between different sequences.
Theory:
BLAST program was designed by Stephen Altschul, Warren Gish, Webb Miller, Eugene Myers, and
David J. Lipmann at National Institutes of Health (NIH) and was published in Journal of Molecular
Biology in 1990. BLAST (Basic local alignment search tool) is a heuristic search algorithm, it finds
the solutions from the all possibilities ,which takes input as nucleotide or protein sequence and
compare it with existing databases like NCBI, GenBank etc. It finds the local similarity between
different sequences and calculates the statistical significance of matches. It can also be used to find
functional and evolutionary relationship between different sequences. Search is done by taking the
sequence of a certain word size, comparing it with the database sequence and scores are assigned
for each comparison. Based on the threshold, a suitable match of that query word is taken and the
alignment is extended to both sides. After the alignment is complete, the total score is calculated
and alignment is displayed on the blast result page only if the total scores exceed the threshold
value.
Sequence, Sequence Alignment and importance:
A biological sequence refers to a sequence of characters which belong to DNA/RNA/protein. Two
types of biological sequences are most commonly known namely, Nucleotide Sequence and Protein
Sequence. Nucleotide sequence is mainly formed of four different nucleotides namely, adenine (A),
guanine (G), cytosine (C) and tyrosine (T). While protein sequence is formed of 20 different amino
acids which are commonly found. The nucleotides arrange themselves in the form of triplet code
(triplet code refers to a group of three nucleotides) to code for an amino acid. These sequences are
properly indexed in the already existing databases and it is possible to retrieve these sequences
from their corresponding databases. The sequences are obtained by the following methods explained
below.
DNA sequencing methods:
Sanger Method (dideoxy chain termination method): Here 4 test tubes are taken labelled with
A, T, G and C. Into each of the test tubes DNA has to be added in denatured form (single strands).
Next a primer is to be added which anneals to one of the strand in template. The 3' end of the primer
accomadates the dideoxy nucleotides [ddNTPs] (specific to each tube) as well as the deoxy
nucleotides randomly. When the ddNTP's gets attached to the growing chain, the chain
terminatesdue to lack of 3'OH which forms the phospho diester bond with the next nucleotide. Thus
small strands of DNA are formed. Electrophoresis is done and the sequence order can be obtained
by analysing the bands in the gel based on the molecular weight. The primer or one of the
nucleotides can be radioactively or fluorescently labeled also, so that the final product can be
detected from the gel easily and the sequence can be inferred.
Maxam-Gilbert (Chemical degradation method): This method requires denature DNA fragment
whose 5' end is radioactively labeled. This fragment is then subjected to purification before
proceeding for chemical treatment which results in a series of labeled fragments. Electrophoresis
technique helps in arranging the fragments based on their molecular weight. To view the fragments,
gel is exposed to X-ray film for autoradiography. A series of dark bands will appear, each
corresponding to a radio labeled DNA fragment, from which the sequence can be inferred.
Protein sequencing methods:
Edman Degradation reaction: The reaction finds the order of amino acids in a protein from the
N-terminal, by cleaving each amino acid from the N-terminal without distrubing the bonds in the
protein. After each clevage, chromatography or electrophoresis is done to identify the amino acid
Mass Spectrometry: It is used to determine the mass of particle, composition of molecule and for
finding the chemical structures of molecules like peptides and other chemical compounds. Based on
the mass to charge ratio, one can identify the amino acids in a protein
Sequence Alignment or sequence comparison lies at heart of the bioinformatics, which describes
the way of arrangement of DNA/RNA/Protein sequences and identify the regions of similarity among
them. It is used to infer structural, functional and evolutionary relationship between the sequences.
Alignment finds similarity level between the query sequence and different available database
sequences. The algorithm works by dynamic programming approach which divides the problem into
smaller independent sub problems. It finds the alignment more quantitatively by assigning scores.
Methods of Sequence Alignment:
There are mainly two methods of Sequence Alignment:
Global Alignment :Sequences having same length and quite similar are very much appropriate for
global alignment. Here the alignment is carried out from beginning of the sequence to end of the
sequences to find out the best possible alignment.
Local Alignment:Sequences which are suspected to have similarity or even dissimilar sequences
can be compared with local alignment method. It finds the local regions with high level of similarity
BLAST is one of the pairwise sequence alignment tool used to compare different sequences. There
are different BLAST programs for different comparisons as shown in Table 1.
Nucleotide BLAST Programs:
BLASTN : The initial search is done for a word of length ‘w’ and threshold score ‘T’. Whole sequence
is divided into words with a length of 11 for nucleic acids (maximum number of words can be
calculated by L-w+1= max.word no (L=sequence length, w=words)). The BLASTn algorithm parses
nucleotide sequences into 11 letter “words” the same is done for every sequence in the query
database, word matches are being identified from the database sequence. This searches for
somewhat similar sequences.
Mega BLAST: Searches for highly similar sequences.
Discontiguous Mega BLAST: Searches for more dissimilar sequences.
Protein BLAST Programs :
BLASTp: Finds the similarity between the query protein sequences to a protein sequences available
in the protein database. BLASTp also reports for global alignment, which is the preferred result for
protein identification. The BLASTp algorithm parses protein sequences into 3 letter “words” the same
is done for every sequence in the query database, word matches are being identified from the
database.
PSI-BLAST: Position-Specific Iterated -BLAST is the most sensitive BLAST program. It is used to
find very distantly related proteins or new members of protein family. Algorithm builds a position-
specific scoring matrix (PSSM or profile) from an iterative alignment of sequences, returns with E-
values and threshold (default=0.005). E-value It decreases exponentially with the score that is
assigned to a match between two sequences.
PHI-BLAST: Pattern-Hit Initiated BLAST is used to find protein sequences which contains a pattern,
specified by the user and are similar to the query sequence. This requirement was proposed to
reduce the number of hits which contains only the pattern, but is likely to have no true homology to
the query. To run PHI-BLAST, enter the query into the Search box, and enter the pattern into the
PHI pattern box. Only one pattern can be used in one search.
Scoring matrices :
Scoring matrices are used to assign score for comparision of pairs of characters. There are different
types of scoring matrices like:
Identity Matrices: In this type of matrix, the score would either 1's or 0's. 1's will lie along the
diagonal. Basically the scoring scheme is based on matches and mismatches.
Unitary scoring matrices: This matrix also have either 0's or 1's as their scores . The difference
is that it takes into the idea of transitions(change among purines or pyramidines) and
transversions(change between purine and a pyramidine).
PAM Matrices: Margaret Dayhoff was the first one to develop the PAM matrix, PAM stands for Point
Accepted Mutations. PAM matrices are calculated by observing the differences in closely related
proteins. One PAM unit (PAM1) specifies one accepted point mutation per 100 amino acid residues,
i.e. 1% change and 99% remains as such.
BLOSUM: BLOcks SUbstitution Matrix, developed by Henikoff and Henikoff in 1992, using conserved
regions, these matrices are actual percentage identity values. Simply to say they depend on
similarity. Blosum 62 means there is a 62 % similarity.
Parameters used in BLAST algorithm :
Threshold: It is a boundary of minimum or maximum value which can be used to filter out words
during comparison.
True Homology: In BLAST true homology refers how much the sequence is similar to the query
sequence.
E-value: It decreases exponentially with the score that is assigned to an alignment between two
sequences.
Word size: Whole Search is done by taking the sequence of a certain word size and compares it
with the database sequence and scores are assigned for each comparison. Word size is given as 11
for nucleic acids and 3 for proteins.
Putative conserved domains: These are the domains that have different functionalities.
Gap score or gap penalty: Dynamic programming algorithms uses gap penalties to maximize the
biological meaning. Gap penalty is subtracted for each gap that has been introduced. There are
different gap penalties such as gap open and gap extension. The gap score defines, a penalty given
to alignment when we have insertion or deletion. During the evolution, there may be a case where
we can see continuous gaps all along the sequence, so the linear gap penalty would not be
appropriate for the alignment. Thus gap open and gap extension has been introduced when there
are continuous gaps (five or more). The open penalty is always applied at the start of the gap, and
then the other gaps following it is given with a gap extension penalty which will be lesser compared
to the open penalty. Typical values are –12 for gap opening, and –4 for gap extension.
Working of BLAST Algorithm:
• Query sequence is taken and analyzed for low complex regions. Low complexity regions are
regions which contain less information or variations like AAAAAAAA or ATATATAT etc.
• These low complex regions are masked with alphabet s like X or N
• List of words of certain word size is made. Usually the word size is 3 for proteins and 11 for
DNA
• Scores are calculated for each pair of words(query sequence word and database word) using
substitution scoring matrixes (like PAM or BLOSUM),and only the high scoring words i.e.
above a threshold value or a cutoff score is taken for further alignment. A cutoff score is
selected to reduce number the number of matches so as to decrease the computation time.
• This scoring and checking is repeated for all the words in the query sequence.
• The remaining high-scoring words are organised into efficient search tree and rapidly
compared to the database sequence. This is done to find out the exact matches.
• If an exact or good match is found then an alignment is extended in both directions from
the position where the exact match occurred
Image source : petang.cgu.edu.tw/Bioinfomatics/MANUALS/NCBIblast/BLAST_algorithm.html
• High scoring pairs (HSP) which have score greater than a threshold are taken for
consideration.
• Significance of the HSP score are calculated. The probability p of observing a score S equal
to or greater than x is given by the equation:
where
• Statistical assessments are made in the case if two or more HSP regions are found and
certain matching pairs are put in descending order in the output file as far as their similarity/
score is concerned.
BLAST Procedure
This is the common procedure for any BLAST program.
Step 1: Select the BLAST program.
Step 2: Enter a query sequence or upload a file containing sequence.
Step 3: Select the database to search.
Step 4: Select the algorithm and the parameters of the algorithm for the search.
Step 5: Run the BLAST program.
Step 1: Select the BLAST program
User have to specify the type of BLAST programs from the database like BLASTp, BLASTn, BLASTx,
tBLASTn, tBLASTx.
Step 2: Enter a query sequence or upload a file containing sequence
Enter a query sequence by pasting the sequence in the query box or uploading a FASTA file which
is having the sequence for similarity search. This step is similar for all BLAST programs. The user
can give the accession number or gi number or even a raw FASTA sequence. Go to simulator tab to
know more about how to retrieve query sequence.
Figure 1: Enter a query sequence or upload a file containing sequence
Step 3: Select database to search
User first has to know what all databases are available and what type of sequences are present in
those databases. Sequence similarity search involves searching of similar sequences of the query
sequence from the selected databases (Figure 2).
Figure 2: Select database to search
Step 4: Select the algorithm and the parameters of the algorithm for the
search
There are different algorithms for some of the BLAST program. User has to specify the algorithm for
the BLAST program. Nucleotide BLAST uses algorithms like MegaBLAST which searches for highly
similar sequences, discontiguous MegaBLAST which searches for more dissimilar sequences and
BLASTn which searches for somewhat similar sequences. Meanwhile for protein BLAST algorithms
like BLASTp, searches for similarity between protein query and protein database, PSI-BLAST
performs position specific search iteratively, PHI-BLAST searches for a particular pattern (user has
to enter the pattern to search in the PHI pattern box provided) that is present in the sequence
against the sequences in the database, DELTA-BLAST is Domain Enhanced Lookup Time Accelerated
BLAST. It searches multiple sequence and aligns them to find protein homology. The different
algorithmic parameters are, Target sequences, Short queries, E-value, Word size, Query range,
scoring parameters (Match/Mismatch scores, and Gap penalties) and filters (Filter and Mask) which
are required to run BLAST programs. Default values are provided but the user can adjust the values
accordingly which is shown in figure 3.
Figure 3: Algorithm and the parameters
Step 5: Run the BLAST program
Submission of the BLAST program can be done by clicking the BLAST button at the end of the page.
Screen shot of result can be shown in figure 4.
Figure 4: Run the BLAST program
BLAST Result:
After submitting the query sequence for sequence similarity search, the result page will appear along
with the information like Query id, Description, Molecule type, Length of sequence, Database name
and BLAST program. It shows the putative conserved domains that have been detected while
undergoing sequence similarity search.
The query sequence represented as a numbered red bar below the color key. Database hits are
shown below the query (red) bar according to the alignment score. Among the aligned sequences,
the most related sequences are kept near to the query sequence. User can find more description
about these alignments, by dragging the mouse to the each colored bar which is shown below in
figure 5.
Figure 5: BLAST result
The alignment is preceded by the sequence identities, along with the definition line, length of the
matched sequence, followed by the score and E-value. The line also contains the information about
the identical residues in alignment (identities), number of positivity’s, number of gaps used in the
alignment. Finally it shows the actual alignment, along with the query sequence on the top and
database sequence below the query. The number on either sides of the alignment indicates the
position of amino acids/nucleotides in sequence which can be represented in figure 6.
Figure 6: BLAST result
Objective:
• To find the evolutionary relationship between different organisms and analyze the changes
that occured in organisms during the course of evolution using PHYLIP.
Key words:
Phylogenetic analysis: Analyze the evolutionary relationships between different organisms and
this analysis would help to find out the changes that occured in organisms during the evolution.
Boot Strapping: It is a way to test the reliability of Dataset.
Query: User can give input called as a query. This can be either a protein or nucleotide sequence.
Rooted tree: A tree which is having a special node as main node also called the root. A tree without
root is treated as a free tree.
Tree topology: Tree topology refers to the arrangement of phylogenetic tree.
Theory :
PHYLIP is a complete phylogenetic analysis package which was developed by Joseph Felsestein at
University of Washington. PHYLIP is used to find the evolutionary relationships between different
organisms. Some of the methods available in this package are maximum parsimony method,
distance matrix and likelihood methods. The data is presented to the program from a text file, which
is prepared by the user using common text editors such as word processor, etc. Some of the
sequence analysis programs such as ClustalW can write data files in PHYLIP format. Most of the
programs look for the input file called "infile" -- if they do not find this file, then they ask the user
to type in the file name of the data file. Before starting the computation, the program will ask the
user to set options (optional) through a menu. Output is written into special files with names like
outfile and outtree.
PHYLIP file format :
• The input files have information about the number of sequences, nucleic acids and amino
acids.
• The sequence has 10 characters length. Spaces can be added to the end of the short
sequences to make them long.
• Gaps can be represented as ‘-‘.
• Missing data can be represented as ‘?’
• Spaces between the alignments are allowed usually after every 10 bases.
Example:
4 1061
4 indicates number of species taken for phylogenetic analysis
1061 indicates number of characters.
PHYLIP program :
The PHYLIP programs have to be run in sequential manner, output of one program is used as input
of another program. User has to know how to use these programs in a sequential manner. Simple
examples to run PHYLIP programs are given in the below flowcharts.
Methods involved in PHYLIP:
1. Maximum parsimony method
2. Distance method
3. Maximum likelihood methods
Maximum parsimony method: It is a character-based method which infers a phylogenetic tree by
minimizing the total number of evolutionary steps or total tree length for a given set of data. It is
also referred to as sequence based tree reconstruction method.
Distance methods: Evolutionary distances are calculated for all operational taxonomic units and
build tree where distance between the operational taxonomic units match these distances.
Maximum likelihood method: Refers to a model of sequence evolution which finds the tree and
gives highest likelihood of the observed data.
Programs used in PHYLIP :
The following are the methods available in PHYLIP program.
Dnapars: Estimates the phylogeny using parsimony method from nucleic acid sequence.
Dnamove: It is an interactive process used for construction of phylogeny from nucleic acid
sequences using parsimony method.
Dnapenny: Estimates the parsimonious phylogeny for nucleic acid sequences which uses branch
and bound theory.
Dnacomp: States the phylogeny of nucleic acids and searches for the largest sites which have
uniquely evolved on the same tree.
Dnainvar: Computes the nucleic acid sequence which tests the alternative tree topologies. The
programs tabulate (chart) the frequencies of occurrences of different nucleotide patterns.
Dnaml: Estimates the phylogenies from nucleotide sequences by maximum likelihood method
without assuming molecular clock. Molecular clock defines to calculate timings of evolutionary events.
Dnamlk: It estimates the phylogeny using maximum likelihood method, it assumes the molecular
clock.
Dnadist: Dnadist calculates the pair wise distances between the sequences. It also makes a table
of percentage similarity among different sequences.
Seqboot: Reads a dataset, and produces multiple datasets by bootstrap resampling. Most of the
programs in the current version allow processing of multiple datasets; this can be used together
with the consensus tree program CONSENSE.
Concense: Computes consensus trees by the consensus tree method, which can allow one to easily
find the consensus tree.
Protpars: Estimates the phylogenies from protein sequences which use parsimony method.
Protdist: It measures the distances of protein sequences using maximum likelihood method which
is based on the PAM matrix, JTT model and PBM model. It can give the percentage of similarity
among the sequences.
Promol: Estimates phylogeny from amino acid sequences by using maximum likelihood methods.
The program allows us to find different changes at known sites. Proml is without a molecular clock.
Promlk: This estimates the phylogeny from amino acid sequence by using maximum likelihood
method. It assumes a molecular clock. Molecular clock defines to calculate timings of evolutionary
events.
Restml: Estimates the phylogeny using maximum likelihood method with restriction sites data. It
does not allow the rate difference between the transitions and transversions.
Restdist: It estimates the phylogeny and calculates the distance from the restriction site data and
restriction fragment data.
Fitch: Estimates phylogenies from distance matrix data under “additive tree model”. It uses fitch-
Margoliash and some related least square criteria or the distance matrix method. It does not assume
the evolutionary clock. The program computes the distance from molecular sequences, fragment
distances, and genetic distances calculated from gene frequencies.
Kitsch: Estimates phylogenies from distance matrix data under “Ultrametric model” same as the
additive tree model except the evolutionary clock is measured. It is similar to Fitch algorithm.
Neighbor: Neighbor joining is a distance matrix method which will produce an unrooted tree without
the assumption of an evolutionary clock. This method is very fast, it can handle large data sets.
Dnadist : It’s a distance matrix method which can be used to find the distances between nucleic
acid sequences. This can give the percentage similarity among the sequences.
Protdist: Computes distance between the protein sequences uses maximum likelihood method.
Restdist: Computes the distance calculated from restriction sites data and restriction fragment data.
Drawgram: It estimates the rooted phylogeny, cladograms, circular trees in a wide variety.
Drawtree: It estimates the unrooted phylogeny similar to Drawgram.
Treedist: It estimates the branch lengths, by making use of branch lengths allows for difference in
tree topology.
Procedure for phylogenetic analysis :
Go to simulator tab to know more about how to retrieve the query sequence .
Procedure For PHYLIP :
Align the multiple DNA sequences (output of the ClustalW) and save it in PHYLIP format as infile.phy.
Start the program of Dnadist by clicking the icon and giving this infile as input.
All the PHYLIP programs are menu driven programs. Dnadist will calculate pairwise distances
between the sequences. At first, Dnadist will ask whether the input file is there in the PHYLIP folder.
If the file does not exist, it will ask you to give the correct file name. After giving the correct input,
if needed it will ask to change any settings for the program by typing the first letter or number. If
the changes are not required, by typing ‘Y’ it will start running the program. Output will return to
the file as outfile, so that the output of this file can be used as input of another program. Output
would be as represented in Figure 1.
Figure 1: Distance Representation
Like Dnadist, Neighbor also gives sequence distance analysis. Output of Dnadist is given as input
to Neighbor. Output file and tree file will be returned to outfile and outtree as represented in
Figure 2 .
Figure 2 : Sequence distance analysis.
Branch lengths and tree are represented with the help of Neighbor joining method. The outfile and
outree after the Neighbor joining method are given below (Figures 3 & 4).
Figure 3: Outfile
Figure 4: Outtree
Cladogram is represented via Consensus tree program. Input for the cladogram will be output
(outtree) of Neighbor program which will generate outfile and outtree. It represents the consensus
tree. Numbers on the branches indicate the number of times the species has been partitioned into
two sets separated by that branch occurred among the trees. Here the outfile and outtree are
represented in Figure 5 and Figure 6.
Figure 5 : Cladogram representation (outfile).
Figure 6 :Cladogram Representation (outtree).
Objective:
• Secondary structure analysis of a protein using SOPMA
Theory:
In this exericise one can learn how to analyze the secondary structure of a protein using SOPMA.
The structure of a protein has a very important role in its function. The binding of a protein with
other molecules is very specific to carry out its function properly. For this reason every protein has
a particular structure. Protein structures are classified into primary, secondary, tertiary, and
quaternary. The proteins are synthesized as primary sequence and then it fold to form secondary,
tertiary and quaternary structure.
All proteins are made up of long chain of amino acids that fold into a 3-D shape. Amino acids are
organic compounds that contain a hydrogen atom, α carbon, two functional groups and a side chain
R group. The α carbon is the first carbon atom that attached to a functional group. The two functional
groups in amino acid are an amino group and a carboxyl group. The functional groups and R group
are also bonded to α carbon atom. The side chain refers to a particular amino acid. There are almost
20 amino acids are found in human body that varies in their R groups. R group can be hydrophobic
or hydrophilic. The hydrophobic side chains will tend to get away from water environment while
hydrophilic side chains are attracted towards it. The atoms attached to some of the hydrophilic side
chains make them acidic and some of them make them basic. So the basic ends will attracted
towards the acidic ends. This makes the protein to be in its native conformation. The native
conformation is the condition of a protein which is correctly folded and functional.
Amino acids are linked to each other by peptide bond. A peptide bond is formed when the carboxyl
group of one amino acid linked to the amino group of another molecule through a covalent bond.
During this reaction a molecule of water is released. Short sequence of amino acids held together
by peptide bonds is called peptides. Each amino acid in a peptide is called as a residue. Each end of
every peptide has an N-terminus and C-terminus residue. N-terminus is the starting of a protein
which contains an amino acid with a free amine group (-NH2) and the C-terminus is the end of a
protein which contains an amino acid (-COOH) with a free carboxyl group.
The primary structure of a protein is made up of linear sequence of amino acid. It is synthesized
during the translation process of DNA to mRNA. DNA (Deoxyribonucleic acid) is the genetic material
that contains all the genetic information for the development and maintaining all functions in all
living organisms. The information is stored as genetic codes using four types of bases. They are
adenine (A), guanine (G), cytosine(C) and thymine (T). In two strands of DNA, adenine always pair
with thymine and guanine pair with cytosine. Each of these base pairs will bond with a sugar and
phosphate molecule to form a nucleotide. The base pairing of DNA will result in a ladder shape
structure of these strands which is called a double helix. RNA is differing from DNA only in 1 base
pair i.e. in RNA it is uracil (U) instead of thymine. mRNA (messenger RNA) is a molecule of RNA
which is forming from DNA transcription process. During the transcription process, DNA is
transcribed to mRNA i.e. thymine is replaced by Uracil.
The intermolecular and intramolecular hydrogen bonding between the amide groups in primary
structure of protein form secondary structure. The attraction of hydrogen molecule towards electro
negative atom (N, F, O etc) within same molecule is called intramolecular hydrogen bonding and
formed between two different molecules is called intermolecular H bonding. Alpha helices and beta
sheets are the two important secondary structures in protein. The alpha helix has a right handed
helix conformation. It is stabilized by hydrogen bonds between the carbonyl (CO) group and amino
(NH) group of the fourth amino acid in the C – terminal. The structures that are formed with zigzag
back bone of amino acids are called as strands (e.g. beta strands). Beta sheets are planar structures
that are made up of beta strands that connected through hydrogen bonds.
The structural information of a protein can be determined by x–ray crystallography or nuclear
magnetic resonance (NMR) spectroscopy methods. Here X-rays of a particular wave length are
diffracted by electrons in a comparable size of atom. The resulting X-ray patterns are obtained as
small spots in an X-ray film. These patterns are used to calculate the coordinates of atoms in a
protein. NMR spectroscopy (Nuclear Magnetic Resonance) is also used for determining the structure
of molecules. The nucleus of an atom that is located in a high magnetic field can absorb the
electromagnetic radiation of a particular frequency. Electromagnetic radiation is a form of energy
that contains both electric and magnetic fields. This type of radiation includes X-rays, gamma rays,
radio waves, visible light etc.
The Self-Optimized Prediction method With Alignment (SOPMA) is a tool to predict the secondary
structure of a protein. Based on the query (primary sequence of a protein), SOPMA will predict its
secondary structure. SOPMA is using homologue method of Levin et al.. According to this method,
short homologous sequence of amino acids will tend to form similar secondary structure. So it has
a whole database consist of 126 chains of non-homologous proteins. If the user enters an unknown
protein, it will search against a collection of proteins in the database that have some similar
properties and evolutionary history.
Procedure:
In this exercise one can learn how to analyze the secondary structure of a protein using SOPMA.
Figure 1 shows the home page of SOPMA.
For info to retrieve the input sequences go to simulator tab.
Figure1: Home page of SOPMA
Here we can paste the computerized protein sequence in the text box provided. By default the output
width is 70. It means that in the output shows up to 70 amino acids in each line. We can change the
output width if we want. In the parameters there are options like ‘Number of conformational states’,
‘Similarity threshold’ and ‘Window width’. The user can select ‘Number of conformational states’ as
either ‘(3Helix, Sheet, Coil)’ or ‘4(Helix, Sheet, Turn, Coil). The former predicts the percentage of
helix, sheet and coil structure while the latter predicts percentage of helix, coil, turn and sheet.
Figure 6: Paste the FASTA sequence of hemoglobin. Now click on the submit button.
Following figures shows the SOPMA results.
Figure7: SOPMA results
Since the output width we set as 70, here it shows 70 amino acids and corresponding predicted
structures in each line. The sequence length is also displayed in the output (333 amino acids in this
case). The percentage of each structure is also listed in this page. For example, for Alpha helix it is
59.46%.
Figure 8: SOPMA results
There are two graphs shown in the result page of SOPMA. First one is to visualize the prediction.
The second contains score curves for all predicted states.It also shows the parameters such as
window width, number of states etc. that are used for the prediction. It provides a link on prediction
result file which gives the result in a text format. There are links to find the intermediate result files
also.