0% found this document useful (0 votes)

31 views24 pages

Aanchal Maurya Bioinformatics 2

The document is an assignment on Bioinformatics-II submitted by Aanchal Maurya to Dr. Akrati Dev, covering various experiments related to sequence similarity searches using the BLAST algorithm, evolutionary relationships using PHYLIP, and secondary structure analysis of proteins using SOPMA. It includes detailed explanations of sequence alignment methods, DNA and protein sequencing techniques, and the BLAST algorithm's functioning and parameters. Additionally, it outlines the steps for running BLAST programs and interpreting the results.

Uploaded by

Standard Code Learning

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views24 pages

Aanchal Maurya Bioinformatics 2

Uploaded by

Standard Code Learning

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Dr.

M C Saxena College of Engineering

&
Technology

Assigment
on
BOINFORMATICS-II

Submitted To:- Submitted by:

Dr. Akrati Dev Aanchal Maurya

HOD(biotech) B.Tech (biotech)
INDEX

S.no Experiment Pg no Date Remark

1. (a) To study the pair wise 1-8
sequence similarity
search using BLAST
algorithm.
(b)To study the
functional and
evolutionary
relationship between
different sequences

2. To find the evolutionary 9-16

relationship between
different organism and
analyse the changes that
occurred in organisms
during the course of
evolution using PHYLIP.

3. Secondary structure analysis 17-22

of a protein using SOPMA.
Experiment No. 1

Objective:
• To study the pairwise sequence similarity search using BLAST algorithm.
• To study the functional and evolutionary relationships between different sequences.

Theory:

BLAST program was designed by Stephen Altschul, Warren Gish, Webb Miller, Eugene Myers, and
David J. Lipmann at National Institutes of Health (NIH) and was published in Journal of Molecular
Biology in 1990. BLAST (Basic local alignment search tool) is a heuristic search algorithm, it finds
the solutions from the all possibilities ,which takes input as nucleotide or protein sequence and
compare it with existing databases like NCBI, GenBank etc. It finds the local similarity between
different sequences and calculates the statistical significance of matches. It can also be used to find
functional and evolutionary relationship between different sequences. Search is done by taking the
sequence of a certain word size, comparing it with the database sequence and scores are assigned
for each comparison. Based on the threshold, a suitable match of that query word is taken and the
alignment is extended to both sides. After the alignment is complete, the total score is calculated
and alignment is displayed on the blast result page only if the total scores exceed the threshold
value.

Sequence, Sequence Alignment and importance:

A biological sequence refers to a sequence of characters which belong to DNA/RNA/protein. Two

types of biological sequences are most commonly known namely, Nucleotide Sequence and Protein
Sequence. Nucleotide sequence is mainly formed of four different nucleotides namely, adenine (A),
guanine (G), cytosine (C) and tyrosine (T). While protein sequence is formed of 20 different amino
acids which are commonly found. The nucleotides arrange themselves in the form of triplet code
(triplet code refers to a group of three nucleotides) to code for an amino acid. These sequences are
properly indexed in the already existing databases and it is possible to retrieve these sequences
from their corresponding databases. The sequences are obtained by the following methods explained
below.

DNA sequencing methods:

Sanger Method (dideoxy chain termination method): Here 4 test tubes are taken labelled with
A, T, G and C. Into each of the test tubes DNA has to be added in denatured form (single strands).
Next a primer is to be added which anneals to one of the strand in template. The 3' end of the primer
accomadates the dideoxy nucleotides [ddNTPs] (specific to each tube) as well as the deoxy
nucleotides randomly. When the ddNTP's gets attached to the growing chain, the chain
terminatesdue to lack of 3'OH which forms the phospho diester bond with the next nucleotide. Thus
small strands of DNA are formed. Electrophoresis is done and the sequence order can be obtained
by analysing the bands in the gel based on the molecular weight. The primer or one of the
nucleotides can be radioactively or fluorescently labeled also, so that the final product can be
detected from the gel easily and the sequence can be inferred.

Maxam-Gilbert (Chemical degradation method): This method requires denature DNA fragment
whose 5' end is radioactively labeled. This fragment is then subjected to purification before
proceeding for chemical treatment which results in a series of labeled fragments. Electrophoresis
technique helps in arranging the fragments based on their molecular weight. To view the fragments,
gel is exposed to X-ray film for autoradiography. A series of dark bands will appear, each
corresponding to a radio labeled DNA fragment, from which the sequence can be inferred.

Protein sequencing methods:

Edman Degradation reaction: The reaction finds the order of amino acids in a protein from the
N-terminal, by cleaving each amino acid from the N-terminal without distrubing the bonds in the
protein. After each clevage, chromatography or electrophoresis is done to identify the amino acid

Mass Spectrometry: It is used to determine the mass of particle, composition of molecule and for
finding the chemical structures of molecules like peptides and other chemical compounds. Based on
the mass to charge ratio, one can identify the amino acids in a protein

Sequence Alignment or sequence comparison lies at heart of the bioinformatics, which describes
the way of arrangement of DNA/RNA/Protein sequences and identify the regions of similarity among
them. It is used to infer structural, functional and evolutionary relationship between the sequences.
Alignment finds similarity level between the query sequence and different available database
sequences. The algorithm works by dynamic programming approach which divides the problem into
smaller independent sub problems. It finds the alignment more quantitatively by assigning scores.

Methods of Sequence Alignment:

There are mainly two methods of Sequence Alignment:

Global Alignment :Sequences having same length and quite similar are very much appropriate for
global alignment. Here the alignment is carried out from beginning of the sequence to end of the
sequences to find out the best possible alignment.

Local Alignment:Sequences which are suspected to have similarity or even dissimilar sequences
can be compared with local alignment method. It finds the local regions with high level of similarity

BLAST is one of the pairwise sequence alignment tool used to compare different sequences. There
are different BLAST programs for different comparisons as shown in Table 1.
Nucleotide BLAST Programs:

BLASTN : The initial search is done for a word of length ‘w’ and threshold score ‘T’. Whole sequence
is divided into words with a length of 11 for nucleic acids (maximum number of words can be
calculated by L-w+1= max.word no (L=sequence length, w=words)). The BLASTn algorithm parses
nucleotide sequences into 11 letter “words” the same is done for every sequence in the query
database, word matches are being identified from the database sequence. This searches for
somewhat similar sequences.

Mega BLAST: Searches for highly similar sequences.

Discontiguous Mega BLAST: Searches for more dissimilar sequences.

Protein BLAST Programs :

BLASTp: Finds the similarity between the query protein sequences to a protein sequences available
in the protein database. BLASTp also reports for global alignment, which is the preferred result for
protein identification. The BLASTp algorithm parses protein sequences into 3 letter “words” the same
is done for every sequence in the query database, word matches are being identified from the
database.

PSI-BLAST: Position-Specific Iterated -BLAST is the most sensitive BLAST program. It is used to
find very distantly related proteins or new members of protein family. Algorithm builds a position-
specific scoring matrix (PSSM or profile) from an iterative alignment of sequences, returns with E-
values and threshold (default=0.005). E-value It decreases exponentially with the score that is
assigned to a match between two sequences.

PHI-BLAST: Pattern-Hit Initiated BLAST is used to find protein sequences which contains a pattern,
specified by the user and are similar to the query sequence. This requirement was proposed to
reduce the number of hits which contains only the pattern, but is likely to have no true homology to
the query. To run PHI-BLAST, enter the query into the Search box, and enter the pattern into the
PHI pattern box. Only one pattern can be used in one search.

Scoring matrices :

Scoring matrices are used to assign score for comparision of pairs of characters. There are different
types of scoring matrices like:

Identity Matrices: In this type of matrix, the score would either 1's or 0's. 1's will lie along the
diagonal. Basically the scoring scheme is based on matches and mismatches.

Unitary scoring matrices: This matrix also have either 0's or 1's as their scores . The difference
is that it takes into the idea of transitions(change among purines or pyramidines) and
transversions(change between purine and a pyramidine).

PAM Matrices: Margaret Dayhoff was the first one to develop the PAM matrix, PAM stands for Point
Accepted Mutations. PAM matrices are calculated by observing the differences in closely related
proteins. One PAM unit (PAM1) specifies one accepted point mutation per 100 amino acid residues,
i.e. 1% change and 99% remains as such.

BLOSUM: BLOcks SUbstitution Matrix, developed by Henikoff and Henikoff in 1992, using conserved
regions, these matrices are actual percentage identity values. Simply to say they depend on
similarity. Blosum 62 means there is a 62 % similarity.

Parameters used in BLAST algorithm :

Threshold: It is a boundary of minimum or maximum value which can be used to filter out words
during comparison.

True Homology: In BLAST true homology refers how much the sequence is similar to the query
sequence.

E-value: It decreases exponentially with the score that is assigned to an alignment between two
sequences.

Word size: Whole Search is done by taking the sequence of a certain word size and compares it
with the database sequence and scores are assigned for each comparison. Word size is given as 11
for nucleic acids and 3 for proteins.

Putative conserved domains: These are the domains that have different functionalities.

Gap score or gap penalty: Dynamic programming algorithms uses gap penalties to maximize the
biological meaning. Gap penalty is subtracted for each gap that has been introduced. There are
different gap penalties such as gap open and gap extension. The gap score defines, a penalty given
to alignment when we have insertion or deletion. During the evolution, there may be a case where
we can see continuous gaps all along the sequence, so the linear gap penalty would not be
appropriate for the alignment. Thus gap open and gap extension has been introduced when there
are continuous gaps (five or more). The open penalty is always applied at the start of the gap, and
then the other gaps following it is given with a gap extension penalty which will be lesser compared
to the open penalty. Typical values are –12 for gap opening, and –4 for gap extension.
Working of BLAST Algorithm:

• Query sequence is taken and analyzed for low complex regions. Low complexity regions are
regions which contain less information or variations like AAAAAAAA or ATATATAT etc.

• These low complex regions are masked with alphabet s like X or N

• List of words of certain word size is made. Usually the word size is 3 for proteins and 11 for
DNA

• Scores are calculated for each pair of words(query sequence word and database word) using
substitution scoring matrixes (like PAM or BLOSUM),and only the high scoring words i.e.
above a threshold value or a cutoff score is taken for further alignment. A cutoff score is
selected to reduce number the number of matches so as to decrease the computation time.

• This scoring and checking is repeated for all the words in the query sequence.

• The remaining high-scoring words are organised into efficient search tree and rapidly
compared to the database sequence. This is done to find out the exact matches.

• If an exact or good match is found then an alignment is extended in both directions from
the position where the exact match occurred

Image source : petang.cgu.edu.tw/Bioinfomatics/MANUALS/NCBIblast/BLAST_algorithm.html

• High scoring pairs (HSP) which have score greater than a threshold are taken for
consideration.
• Significance of the HSP score are calculated. The probability p of observing a score S equal
to or greater than x is given by the equation:

where

• Statistical assessments are made in the case if two or more HSP regions are found and
certain matching pairs are put in descending order in the output file as far as their similarity/
score is concerned.

BLAST Procedure

This is the common procedure for any BLAST program.

Step 1: Select the BLAST program.

Step 2: Enter a query sequence or upload a file containing sequence.
Step 3: Select the database to search.
Step 4: Select the algorithm and the parameters of the algorithm for the search.
Step 5: Run the BLAST program.

Step 1: Select the BLAST program

User have to specify the type of BLAST programs from the database like BLASTp, BLASTn, BLASTx,
tBLASTn, tBLASTx.

Step 2: Enter a query sequence or upload a file containing sequence

Enter a query sequence by pasting the sequence in the query box or uploading a FASTA file which
is having the sequence for similarity search. This step is similar for all BLAST programs. The user
can give the accession number or gi number or even a raw FASTA sequence. Go to simulator tab to
know more about how to retrieve query sequence.

Figure 1: Enter a query sequence or upload a file containing sequence

Step 3: Select database to search

User first has to know what all databases are available and what type of sequences are present in
those databases. Sequence similarity search involves searching of similar sequences of the query
sequence from the selected databases (Figure 2).

Figure 2: Select database to search

Step 4: Select the algorithm and the parameters of the algorithm for the
search

There are different algorithms for some of the BLAST program. User has to specify the algorithm for
the BLAST program. Nucleotide BLAST uses algorithms like MegaBLAST which searches for highly
similar sequences, discontiguous MegaBLAST which searches for more dissimilar sequences and
BLASTn which searches for somewhat similar sequences. Meanwhile for protein BLAST algorithms
like BLASTp, searches for similarity between protein query and protein database, PSI-BLAST
performs position specific search iteratively, PHI-BLAST searches for a particular pattern (user has
to enter the pattern to search in the PHI pattern box provided) that is present in the sequence
against the sequences in the database, DELTA-BLAST is Domain Enhanced Lookup Time Accelerated
BLAST. It searches multiple sequence and aligns them to find protein homology. The different
algorithmic parameters are, Target sequences, Short queries, E-value, Word size, Query range,
scoring parameters (Match/Mismatch scores, and Gap penalties) and filters (Filter and Mask) which
are required to run BLAST programs. Default values are provided but the user can adjust the values
accordingly which is shown in figure 3.

Figure 3: Algorithm and the parameters

Step 5: Run the BLAST program

Submission of the BLAST program can be done by clicking the BLAST button at the end of the page.
Screen shot of result can be shown in figure 4.

Figure 4: Run the BLAST program

BLAST Result:

After submitting the query sequence for sequence similarity search, the result page will appear along
with the information like Query id, Description, Molecule type, Length of sequence, Database name
and BLAST program. It shows the putative conserved domains that have been detected while
undergoing sequence similarity search.
The query sequence represented as a numbered red bar below the color key. Database hits are
shown below the query (red) bar according to the alignment score. Among the aligned sequences,
the most related sequences are kept near to the query sequence. User can find more description
about these alignments, by dragging the mouse to the each colored bar which is shown below in
figure 5.

Figure 5: BLAST result

The alignment is preceded by the sequence identities, along with the definition line, length of the
matched sequence, followed by the score and E-value. The line also contains the information about
the identical residues in alignment (identities), number of positivity’s, number of gaps used in the
alignment. Finally it shows the actual alignment, along with the query sequence on the top and
database sequence below the query. The number on either sides of the alignment indicates the
position of amino acids/nucleotides in sequence which can be represented in figure 6.
Figure 6: BLAST result
Objective:

• To find the evolutionary relationship between different organisms and analyze the changes
that occured in organisms during the course of evolution using PHYLIP.

Key words:

Phylogenetic analysis: Analyze the evolutionary relationships between different organisms and
this analysis would help to find out the changes that occured in organisms during the evolution.

Boot Strapping: It is a way to test the reliability of Dataset.

Query: User can give input called as a query. This can be either a protein or nucleotide sequence.

Rooted tree: A tree which is having a special node as main node also called the root. A tree without
root is treated as a free tree.

Tree topology: Tree topology refers to the arrangement of phylogenetic tree.

Theory :

PHYLIP is a complete phylogenetic analysis package which was developed by Joseph Felsestein at
University of Washington. PHYLIP is used to find the evolutionary relationships between different
organisms. Some of the methods available in this package are maximum parsimony method,
distance matrix and likelihood methods. The data is presented to the program from a text file, which
is prepared by the user using common text editors such as word processor, etc. Some of the
sequence analysis programs such as ClustalW can write data files in PHYLIP format. Most of the
programs look for the input file called "infile" -- if they do not find this file, then they ask the user
to type in the file name of the data file. Before starting the computation, the program will ask the
user to set options (optional) through a menu. Output is written into special files with names like
outfile and outtree.

PHYLIP file format :

• The input files have information about the number of sequences, nucleic acids and amino
acids.
• The sequence has 10 characters length. Spaces can be added to the end of the short
sequences to make them long.
• Gaps can be represented as ‘-‘.
• Missing data can be represented as ‘?’
• Spaces between the alignments are allowed usually after every 10 bases.

Example:

4 1061
4 indicates number of species taken for phylogenetic analysis

1061 indicates number of characters.

PHYLIP program :

The PHYLIP programs have to be run in sequential manner, output of one program is used as input
of another program. User has to know how to use these programs in a sequential manner. Simple
examples to run PHYLIP programs are given in the below flowcharts.

Methods involved in PHYLIP:

1. Maximum parsimony method

2. Distance method
3. Maximum likelihood methods

Maximum parsimony method: It is a character-based method which infers a phylogenetic tree by

minimizing the total number of evolutionary steps or total tree length for a given set of data. It is
also referred to as sequence based tree reconstruction method.
Distance methods: Evolutionary distances are calculated for all operational taxonomic units and
build tree where distance between the operational taxonomic units match these distances.

Maximum likelihood method: Refers to a model of sequence evolution which finds the tree and
gives highest likelihood of the observed data.

Programs used in PHYLIP :

The following are the methods available in PHYLIP program.

Dnapars: Estimates the phylogeny using parsimony method from nucleic acid sequence.

Dnamove: It is an interactive process used for construction of phylogeny from nucleic acid
sequences using parsimony method.

Dnapenny: Estimates the parsimonious phylogeny for nucleic acid sequences which uses branch
and bound theory.

Dnacomp: States the phylogeny of nucleic acids and searches for the largest sites which have
uniquely evolved on the same tree.

Dnainvar: Computes the nucleic acid sequence which tests the alternative tree topologies. The
programs tabulate (chart) the frequencies of occurrences of different nucleotide patterns.

Dnaml: Estimates the phylogenies from nucleotide sequences by maximum likelihood method
without assuming molecular clock. Molecular clock defines to calculate timings of evolutionary events.

Dnamlk: It estimates the phylogeny using maximum likelihood method, it assumes the molecular
clock.

Dnadist: Dnadist calculates the pair wise distances between the sequences. It also makes a table
of percentage similarity among different sequences.

Seqboot: Reads a dataset, and produces multiple datasets by bootstrap resampling. Most of the
programs in the current version allow processing of multiple datasets; this can be used together
with the consensus tree program CONSENSE.

Concense: Computes consensus trees by the consensus tree method, which can allow one to easily
find the consensus tree.

Protpars: Estimates the phylogenies from protein sequences which use parsimony method.

Protdist: It measures the distances of protein sequences using maximum likelihood method which
is based on the PAM matrix, JTT model and PBM model. It can give the percentage of similarity
among the sequences.

Promol: Estimates phylogeny from amino acid sequences by using maximum likelihood methods.
The program allows us to find different changes at known sites. Proml is without a molecular clock.
Promlk: This estimates the phylogeny from amino acid sequence by using maximum likelihood
method. It assumes a molecular clock. Molecular clock defines to calculate timings of evolutionary
events.

Restml: Estimates the phylogeny using maximum likelihood method with restriction sites data. It
does not allow the rate difference between the transitions and transversions.

Restdist: It estimates the phylogeny and calculates the distance from the restriction site data and
restriction fragment data.

Fitch: Estimates phylogenies from distance matrix data under “additive tree model”. It uses fitch-
Margoliash and some related least square criteria or the distance matrix method. It does not assume
the evolutionary clock. The program computes the distance from molecular sequences, fragment
distances, and genetic distances calculated from gene frequencies.

Kitsch: Estimates phylogenies from distance matrix data under “Ultrametric model” same as the
additive tree model except the evolutionary clock is measured. It is similar to Fitch algorithm.

Neighbor: Neighbor joining is a distance matrix method which will produce an unrooted tree without
the assumption of an evolutionary clock. This method is very fast, it can handle large data sets.

Dnadist : It’s a distance matrix method which can be used to find the distances between nucleic
acid sequences. This can give the percentage similarity among the sequences.

Protdist: Computes distance between the protein sequences uses maximum likelihood method.

Restdist: Computes the distance calculated from restriction sites data and restriction fragment data.

Drawgram: It estimates the rooted phylogeny, cladograms, circular trees in a wide variety.

Drawtree: It estimates the unrooted phylogeny similar to Drawgram.

Treedist: It estimates the branch lengths, by making use of branch lengths allows for difference in
tree topology.

Procedure for phylogenetic analysis :

Go to simulator tab to know more about how to retrieve the query sequence .

Procedure For PHYLIP :

Align the multiple DNA sequences (output of the ClustalW) and save it in PHYLIP format as infile.phy.
Start the program of Dnadist by clicking the icon and giving this infile as input.

All the PHYLIP programs are menu driven programs. Dnadist will calculate pairwise distances
between the sequences. At first, Dnadist will ask whether the input file is there in the PHYLIP folder.
If the file does not exist, it will ask you to give the correct file name. After giving the correct input,
if needed it will ask to change any settings for the program by typing the first letter or number. If
the changes are not required, by typing ‘Y’ it will start running the program. Output will return to
the file as outfile, so that the output of this file can be used as input of another program. Output
would be as represented in Figure 1.

Figure 1: Distance Representation

Like Dnadist, Neighbor also gives sequence distance analysis. Output of Dnadist is given as input
to Neighbor. Output file and tree file will be returned to outfile and outtree as represented in
Figure 2 .

Figure 2 : Sequence distance analysis.

Branch lengths and tree are represented with the help of Neighbor joining method. The outfile and
outree after the Neighbor joining method are given below (Figures 3 & 4).
Figure 3: Outfile

Figure 4: Outtree
Cladogram is represented via Consensus tree program. Input for the cladogram will be output
(outtree) of Neighbor program which will generate outfile and outtree. It represents the consensus
tree. Numbers on the branches indicate the number of times the species has been partitioned into
two sets separated by that branch occurred among the trees. Here the outfile and outtree are
represented in Figure 5 and Figure 6.
Figure 5 : Cladogram representation (outfile).

Figure 6 :Cladogram Representation (outtree).

Objective:

• Secondary structure analysis of a protein using SOPMA

Theory:

In this exericise one can learn how to analyze the secondary structure of a protein using SOPMA.
The structure of a protein has a very important role in its function. The binding of a protein with
other molecules is very specific to carry out its function properly. For this reason every protein has
a particular structure. Protein structures are classified into primary, secondary, tertiary, and
quaternary. The proteins are synthesized as primary sequence and then it fold to form secondary,
tertiary and quaternary structure.

All proteins are made up of long chain of amino acids that fold into a 3-D shape. Amino acids are
organic compounds that contain a hydrogen atom, α carbon, two functional groups and a side chain
R group. The α carbon is the first carbon atom that attached to a functional group. The two functional
groups in amino acid are an amino group and a carboxyl group. The functional groups and R group
are also bonded to α carbon atom. The side chain refers to a particular amino acid. There are almost
20 amino acids are found in human body that varies in their R groups. R group can be hydrophobic
or hydrophilic. The hydrophobic side chains will tend to get away from water environment while
hydrophilic side chains are attracted towards it. The atoms attached to some of the hydrophilic side
chains make them acidic and some of them make them basic. So the basic ends will attracted
towards the acidic ends. This makes the protein to be in its native conformation. The native
conformation is the condition of a protein which is correctly folded and functional.

Amino acids are linked to each other by peptide bond. A peptide bond is formed when the carboxyl
group of one amino acid linked to the amino group of another molecule through a covalent bond.
During this reaction a molecule of water is released. Short sequence of amino acids held together
by peptide bonds is called peptides. Each amino acid in a peptide is called as a residue. Each end of
every peptide has an N-terminus and C-terminus residue. N-terminus is the starting of a protein
which contains an amino acid with a free amine group (-NH2) and the C-terminus is the end of a
protein which contains an amino acid (-COOH) with a free carboxyl group.
The primary structure of a protein is made up of linear sequence of amino acid. It is synthesized
during the translation process of DNA to mRNA. DNA (Deoxyribonucleic acid) is the genetic material
that contains all the genetic information for the development and maintaining all functions in all
living organisms. The information is stored as genetic codes using four types of bases. They are
adenine (A), guanine (G), cytosine(C) and thymine (T). In two strands of DNA, adenine always pair
with thymine and guanine pair with cytosine. Each of these base pairs will bond with a sugar and
phosphate molecule to form a nucleotide. The base pairing of DNA will result in a ladder shape
structure of these strands which is called a double helix. RNA is differing from DNA only in 1 base
pair i.e. in RNA it is uracil (U) instead of thymine. mRNA (messenger RNA) is a molecule of RNA
which is forming from DNA transcription process. During the transcription process, DNA is
transcribed to mRNA i.e. thymine is replaced by Uracil.
The intermolecular and intramolecular hydrogen bonding between the amide groups in primary
structure of protein form secondary structure. The attraction of hydrogen molecule towards electro
negative atom (N, F, O etc) within same molecule is called intramolecular hydrogen bonding and
formed between two different molecules is called intermolecular H bonding. Alpha helices and beta
sheets are the two important secondary structures in protein. The alpha helix has a right handed
helix conformation. It is stabilized by hydrogen bonds between the carbonyl (CO) group and amino
(NH) group of the fourth amino acid in the C – terminal. The structures that are formed with zigzag
back bone of amino acids are called as strands (e.g. beta strands). Beta sheets are planar structures
that are made up of beta strands that connected through hydrogen bonds.

The structural information of a protein can be determined by x–ray crystallography or nuclear

magnetic resonance (NMR) spectroscopy methods. Here X-rays of a particular wave length are
diffracted by electrons in a comparable size of atom. The resulting X-ray patterns are obtained as
small spots in an X-ray film. These patterns are used to calculate the coordinates of atoms in a
protein. NMR spectroscopy (Nuclear Magnetic Resonance) is also used for determining the structure
of molecules. The nucleus of an atom that is located in a high magnetic field can absorb the
electromagnetic radiation of a particular frequency. Electromagnetic radiation is a form of energy
that contains both electric and magnetic fields. This type of radiation includes X-rays, gamma rays,
radio waves, visible light etc.
The Self-Optimized Prediction method With Alignment (SOPMA) is a tool to predict the secondary
structure of a protein. Based on the query (primary sequence of a protein), SOPMA will predict its
secondary structure. SOPMA is using homologue method of Levin et al.. According to this method,
short homologous sequence of amino acids will tend to form similar secondary structure. So it has
a whole database consist of 126 chains of non-homologous proteins. If the user enters an unknown
protein, it will search against a collection of proteins in the database that have some similar
properties and evolutionary history.

Procedure:
In this exercise one can learn how to analyze the secondary structure of a protein using SOPMA.
Figure 1 shows the home page of SOPMA.
For info to retrieve the input sequences go to simulator tab.
Figure1: Home page of SOPMA

Here we can paste the computerized protein sequence in the text box provided. By default the output
width is 70. It means that in the output shows up to 70 amino acids in each line. We can change the
output width if we want. In the parameters there are options like ‘Number of conformational states’,
‘Similarity threshold’ and ‘Window width’. The user can select ‘Number of conformational states’ as
either ‘(3Helix, Sheet, Coil)’ or ‘4(Helix, Sheet, Turn, Coil). The former predicts the percentage of
helix, sheet and coil structure while the latter predicts percentage of helix, coil, turn and sheet.
Figure 6: Paste the FASTA sequence of hemoglobin. Now click on the submit button.

Following figures shows the SOPMA results.

Figure7: SOPMA results

Since the output width we set as 70, here it shows 70 amino acids and corresponding predicted
structures in each line. The sequence length is also displayed in the output (333 amino acids in this
case). The percentage of each structure is also listed in this page. For example, for Alpha helix it is
59.46%.
Figure 8: SOPMA results

There are two graphs shown in the result page of SOPMA. First one is to visualize the prediction.
The second contains score curves for all predicted states.It also shows the parameters such as
window width, number of states etc. that are used for the prediction. It provides a link on prediction
result file which gives the result in a text format. There are links to find the intermediate result files
also.

Search Sequence Database
No ratings yet
Search Sequence Database
6 pages
Bioinformatics: ABE 2007 Kent Koster Group 3
No ratings yet
Bioinformatics: ABE 2007 Kent Koster Group 3
43 pages
Introduction To Different Resources of Bioinformatics and Application PDF
No ratings yet
Introduction To Different Resources of Bioinformatics and Application PDF
55 pages
Lecture - 02 - Comparative Sequence Analysis
No ratings yet
Lecture - 02 - Comparative Sequence Analysis
28 pages
BLAST
100% (1)
BLAST
4 pages
BLAST Presentation
No ratings yet
BLAST Presentation
18 pages
BLAST: Fast Sequence Search Tool
No ratings yet
BLAST: Fast Sequence Search Tool
6 pages
Second - Done - w14b - Searching Squence Databases
No ratings yet
Second - Done - w14b - Searching Squence Databases
32 pages
BLAST
No ratings yet
BLAST
17 pages
Bioinformatics Lab 2 (Evelyn)
No ratings yet
Bioinformatics Lab 2 (Evelyn)
9 pages
Bioinformatics Lab 2
No ratings yet
Bioinformatics Lab 2
9 pages
Blast
100% (1)
Blast
21 pages
Bioinformatics Tools for Biologists
No ratings yet
Bioinformatics Tools for Biologists
26 pages
Basics of Bioinformatics
100% (7)
Basics of Bioinformatics
99 pages
BI205 Prac 5&6
No ratings yet
BI205 Prac 5&6
11 pages
Evelyn Work MCB 311
No ratings yet
Evelyn Work MCB 311
6 pages
Blast Introduction
No ratings yet
Blast Introduction
42 pages
Bioinfo Final Practical
No ratings yet
Bioinfo Final Practical
66 pages
Intro to Bioinformatics Lab Guide
No ratings yet
Intro to Bioinformatics Lab Guide
6 pages
Bioinformatics: Arushi Dinesh Kasi Shruthi
No ratings yet
Bioinformatics: Arushi Dinesh Kasi Shruthi
28 pages
Application in Establishing Epidemiology and Variability: Genome & Protein " Sequence Analysis Programs"
100% (3)
Application in Establishing Epidemiology and Variability: Genome & Protein " Sequence Analysis Programs"
23 pages
Bioinformatics Tutorial 2019
No ratings yet
Bioinformatics Tutorial 2019
54 pages
Bs982 l08 Basic Blast
No ratings yet
Bs982 l08 Basic Blast
38 pages
Diploma - Practical
No ratings yet
Diploma - Practical
11 pages
Sequence Similarity Search with BLAST
No ratings yet
Sequence Similarity Search with BLAST
19 pages
Bioinformatics: Sequence Alignment Basics
No ratings yet
Bioinformatics: Sequence Alignment Basics
14 pages
Bio 2
No ratings yet
Bio 2
39 pages
Blast: Background: BLAST Is One of The Most Widely Used Bioinformatics Programs
100% (1)
Blast: Background: BLAST Is One of The Most Widely Used Bioinformatics Programs
4 pages
Unit Iv - Blast
No ratings yet
Unit Iv - Blast
21 pages
05 CAP5510 Fall21
No ratings yet
05 CAP5510 Fall21
40 pages
Bioinformatics: Blast and Sequence Analysis
No ratings yet
Bioinformatics: Blast and Sequence Analysis
45 pages
Blast Fasta
No ratings yet
Blast Fasta
27 pages
Fasta and Blast
No ratings yet
Fasta and Blast
3 pages
Blast Introduction
No ratings yet
Blast Introduction
42 pages
Lecture 4: Blast: Ly Le, PHD
No ratings yet
Lecture 4: Blast: Ly Le, PHD
60 pages
Lecture/Lab: BLAST: Materials Last Updated June 2007
No ratings yet
Lecture/Lab: BLAST: Materials Last Updated June 2007
11 pages
BLAST 2 SEQUENCES: Sequence Alignment Tool
No ratings yet
BLAST 2 SEQUENCES: Sequence Alignment Tool
4 pages
ALLIENU Blast and Fasta
No ratings yet
ALLIENU Blast and Fasta
27 pages
Bio Tics
No ratings yet
Bio Tics
7 pages
Bi205: Genetics & Evolution: Bioinformatics 1 & 2
No ratings yet
Bi205: Genetics & Evolution: Bioinformatics 1 & 2
14 pages
Bioinfo Lab - Exp 5 9921001004
No ratings yet
Bioinfo Lab - Exp 5 9921001004
5 pages
TY-Exercise 4 (35) (Updated)
No ratings yet
TY-Exercise 4 (35) (Updated)
7 pages
Fasta and Blast
No ratings yet
Fasta and Blast
2 pages
Lecture 4
No ratings yet
Lecture 4
106 pages
Lesson 4.3 Fast Blast
No ratings yet
Lesson 4.3 Fast Blast
45 pages
Bioinformatics
No ratings yet
Bioinformatics
22 pages
About Basic Local Alignment Search Tool
No ratings yet
About Basic Local Alignment Search Tool
17 pages
Gene Sequence Analysis Guide
No ratings yet
Gene Sequence Analysis Guide
14 pages
BLAST Guide for Biologists
0% (1)
BLAST Guide for Biologists
3 pages
Bioinformatics: Analyzing DNA Sequence Using BLAST
No ratings yet
Bioinformatics: Analyzing DNA Sequence Using BLAST
30 pages
Bio Intro
No ratings yet
Bio Intro
32 pages
Sequence Alignment and Searching
No ratings yet
Sequence Alignment and Searching
54 pages
Blast
No ratings yet
Blast
18 pages
Bioinformatics for Biochem Students
No ratings yet
Bioinformatics for Biochem Students
6 pages
Lab Manual Object Oriented Programming Through JAVA
No ratings yet
Lab Manual Object Oriented Programming Through JAVA
57 pages
Software Engg.
No ratings yet
Software Engg.
21 pages
Biotech - Experiments - Manual 2
No ratings yet
Biotech - Experiments - Manual 2
29 pages
Bio Tech Time Table
No ratings yet
Bio Tech Time Table
3 pages
Biochemical Color Reagents Study
No ratings yet
Biochemical Color Reagents Study
6 pages
MCQS On Protein Metabolism BY Siraj Ul Islam
100% (3)
MCQS On Protein Metabolism BY Siraj Ul Islam
12 pages
Wool New PPT (Updated)
No ratings yet
Wool New PPT (Updated)
22 pages
Chapter 5 - Amino acids and Proteins: Trần Thị Minh Đức
No ratings yet
Chapter 5 - Amino acids and Proteins: Trần Thị Minh Đức
59 pages
Photosynthesis Enzyme Regulation
No ratings yet
Photosynthesis Enzyme Regulation
6 pages
Biomolecule MCQs for Students
No ratings yet
Biomolecule MCQs for Students
4 pages
Protein Structure and Functions Guide
No ratings yet
Protein Structure and Functions Guide
9 pages
A Practical Guide To Nutrition, Feeds, and Feeding of Catfish
No ratings yet
A Practical Guide To Nutrition, Feeds, and Feeding of Catfish
43 pages
General Chemistry 1 Activity Sheet Quarter 2-MELC 12 Week 6
No ratings yet
General Chemistry 1 Activity Sheet Quarter 2-MELC 12 Week 6
9 pages
Module 2 Enzyme Trans
No ratings yet
Module 2 Enzyme Trans
11 pages
IGCSE Biology Paper 1 2019-2024
No ratings yet
IGCSE Biology Paper 1 2019-2024
595 pages
A Microbroth Dilution Assay
No ratings yet
A Microbroth Dilution Assay
9 pages
Igcse Bio CS WS
No ratings yet
Igcse Bio CS WS
92 pages
Ijfsb 20180301 14
No ratings yet
Ijfsb 20180301 14
10 pages
Pharmaceutical Biotechnology Second Edition 2nd Edition Michael J Groves PDF Download
100% (1)
Pharmaceutical Biotechnology Second Edition 2nd Edition Michael J Groves PDF Download
87 pages
Austrian National Chemistry Olympiad 1998
No ratings yet
Austrian National Chemistry Olympiad 1998
21 pages
Bioc201 Worksheet 2 & 3 (Vunene)
No ratings yet
Bioc201 Worksheet 2 & 3 (Vunene)
9 pages
Amino Acids Peptides and Proteins SPR Amino Acids Peptides and Proteins RSC Vol 36 1st Edition John S. Davies
100% (1)
Amino Acids Peptides and Proteins SPR Amino Acids Peptides and Proteins RSC Vol 36 1st Edition John S. Davies
54 pages
Chapter 14-2024
No ratings yet
Chapter 14-2024
71 pages
Problem Set - Proteins and Enzymes
No ratings yet
Problem Set - Proteins and Enzymes
4 pages
Performance by Layer Upon Substitution of Soybean Meal With Mung Bean Protein Concentrate
No ratings yet
Performance by Layer Upon Substitution of Soybean Meal With Mung Bean Protein Concentrate
6 pages
Product Formulation and Specification
No ratings yet
Product Formulation and Specification
11 pages
Preparation and Properties of Cheese Fudge
No ratings yet
Preparation and Properties of Cheese Fudge
10 pages
Biochemistry Blue Print
No ratings yet
Biochemistry Blue Print
10 pages
Quiz 2
No ratings yet
Quiz 2
2 pages
EDAC: Versatile Cross-Linker for Bioconjugation
No ratings yet
EDAC: Versatile Cross-Linker for Bioconjugation
5 pages
Produksi Dan Aplikasi Pepton Ikan Selar Untuk Media Pertumbuhan Bakteri
No ratings yet
Produksi Dan Aplikasi Pepton Ikan Selar Untuk Media Pertumbuhan Bakteri
9 pages
Dentin Biomodification Agents
No ratings yet
Dentin Biomodification Agents
4 pages
Shimadzu HPLC-Columns PDF
No ratings yet
Shimadzu HPLC-Columns PDF
106 pages
Chapter 5 Activity Answers
No ratings yet
Chapter 5 Activity Answers
9 pages

Aanchal Maurya Bioinformatics 2

Uploaded by

Aanchal Maurya Bioinformatics 2

Uploaded by

Dr.

M C Saxena College of Engineering

Submitted To:- Submitted by:

Dr. Akrati Dev Aanchal Maurya

S.no Experiment Pg no Date Remark

2. To find the evolutionary 9-16

3. Secondary structure analysis 17-22

Sequence, Sequence Alignment and importance:

A biological sequence refers to a sequence of characters which belong to DNA/RNA/protein. Two

DNA sequencing methods:

Protein sequencing methods:

Methods of Sequence Alignment:

There are mainly two methods of Sequence Alignment:

Mega BLAST: Searches for highly similar sequences.

Discontiguous Mega BLAST: Searches for more dissimilar sequences.

Protein BLAST Programs :

Parameters used in BLAST algorithm :

• These low complex regions are masked with alphabet s like X or N

Image source : petang.cgu.edu.tw/Bioinfomatics/MANUALS/NCBIblast/BLAST_algorithm.html

This is the common procedure for any BLAST program.

Step 1: Select the BLAST program.

Step 1: Select the BLAST program

Step 2: Enter a query sequence or upload a file containing sequence

Figure 1: Enter a query sequence or upload a file containing sequence

Figure 2: Select database to search

Figure 3: Algorithm and the parameters

Step 5: Run the BLAST program

Figure 4: Run the BLAST program

Figure 5: BLAST result

Boot Strapping: It is a way to test the reliability of Dataset.

Tree topology: Tree topology refers to the arrangement of phylogenetic tree.

PHYLIP file format :

1061 indicates number of characters.

Methods involved in PHYLIP:

1. Maximum parsimony method

Maximum parsimony method: It is a character-based method which infers a phylogenetic tree by

Programs used in PHYLIP :

The following are the methods available in PHYLIP program.

Drawtree: It estimates the unrooted phylogeny similar to Drawgram.

Procedure for phylogenetic analysis :

Procedure For PHYLIP :

Figure 1: Distance Representation

Figure 2 : Sequence distance analysis.

Figure 6 :Cladogram Representation (outtree).

• Secondary structure analysis of a protein using SOPMA

The structural information of a protein can be determined by x–ray crystallography or nuclear

Following figures shows the SOPMA results.

You might also like