[go: up one dir, main page]

0% found this document useful (0 votes)
41 views35 pages

My Bioinformatics Notes

Uploaded by

zia750880
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views35 pages

My Bioinformatics Notes

Uploaded by

zia750880
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Definition:

"A field of science that uses computers, databases, math, and statistics to collect, store,
organize, and analyze large amounts of biological, medical, and health information."

Introduction and Applications


 With a large number of prokaryotic and eukaryotic genomes completely sequenced and
more forthcoming, access to the genomic information and synthesizing it for the
discovery of new knowledge have become central themes of modern biological research.
 Mining the genomic information requires the use of sophisticated computational tools.
 It therefore becomes imperative for the new generation of biologists to initiate and
familiarize with a field of study that is concerned with the careful storage, organization
and indexing of information in order to tackle the new challenges in the genomic era.
 Information science has been applied to biology to produce a field is called
bioinformatics.
 It is concerned with the state of- the-art computational tools available to solve biological
research problems.
 The term bioinformatics was coined by Paulien Hogeweg and Ben Hesper to describe
“the study of informatic processes in biotic systems” and it found early use when the first
biological sequence data began to be shared.
 Bioinformatics is an interdisciplinary field that develops methods and software tools
for understanding biological data.
 The development of bioinformatics as a field is the result of advances in both molecular
biology and computer science over the past 30–40 years.
 As an interdisciplinary field of science, bioinformatics combines biology, computer
science, information engineering, mathematics and statistics to analyze and interpret
biological data.
 The key areas of bioinformatics include biological databases, sequence alignment, gene
and promoter prediction, molecular phylogenetics, structural bioinformatics, genomics,
and proteomics.
Bioinformatics vs Computational Biology
 Bioinformatics differs from a related field known as computational biology.
 Bioinformatics is limited to sequence, structural, and functional analysis of genes and
genomes and their corresponding products and is often considered computational
molecular biology.
 However, computational biology encompasses all biological areas that involve
computation.
 Bioinformatics as the development and application of computational tools in managing
all kinds of biological data, whereas computational biology is more confined to the
theoretical development of algorithms used for bioinformatics.
Applications of Bioinformatics
Bioinformatics has not only become essential for basic genomic and molecular biology research,
but is having a major impact on many areas of biotechnology and biomedical sciences. The main
uses of bioinformatics include:
 Bioinformatics plays a vital role in the areas of structural genomics, functional genomics,
and nutritional genomics.
 It covers emerging scientific research and the exploration of proteomes from the overall
level of intracellular protein composition (protein profiles), protein structure, protein-
protein interaction, and unique activity patterns (e.g. post-translational modifications).
 Bioinformatics is used for transcriptome analysis where mRNA expression levels can be
determined.
 Bioinformatics is used to identify and structurally modify a natural product, to design a
compound with the desired properties and to assess its therapeutic effects, theoretically.
 Cheminformatics analysis includes analyses such as similarity searching, clustering,
QSAR modeling, virtual screening, etc.
 Bioinformatics is playing an increasingly important role in almost all aspects of drug
discovery and drug development.
 Bioinformatics tools are very effective in prediction, analysis and interpretation of
clinical and preclinical findings.

Applications in Other Fields


Its major applications include in the following fields:
Molecular medicine
 The human genome will have profound effects on the fields of biomedical research and
clinical medicine.
 The completion of the human genome and the use of bioinformatic tools means that we
can search for the genes directly associated with different diseases and begin to
understand the molecular basis of these diseases more clearly.
 This new knowledge of the molecular mechanisms of disease will enable better
treatments, cures and even preventative tests to be developed.
Personalised medicine
 Clinical medicine will become more personalised with the development of the field of
pharmacogenomics.
 This is the study of how an individual’s genetic inheritence affects the body’s response to
drugs.
 Today, doctors have to use trial and error to find the best drug to treat a particular patient
as those with the same clinical symptoms can show a wide range of responses to the same
treatment.
 In the future, doctors will be able to analyse a patient’s genetic profile and prescribe the
best available drug therapy and dosage from the beginning.
Preventative medicine
 With the specific details of the genetic mechanisms of diseases being unravelled, the
development of diagnostic tests to measure a persons susceptibility to different diseases
may become a distinct reality.
Gene therapy
 In the not too distant future with the use of bioinformatics tool, the potential for using
genes themselves to treat disease may become a reality.
 Gene therapy is the approach used to treat, cure or even prevent disease by changing the
expression of a person’s genes.
Drug development
 At present all drugs on the market target only about 500 proteins.
 With an improved understanding of disease mechanisms and using computational tools to
identify and validate new drug targets, more specific medicines that act on the cause, not
merely the symptoms, of the disease can be developed.
 These highly specific drugs promise to have fewer side effects than many of today’s
medicines.
Microbial genome applications
 The arrival of the complete genome sequences and their potential to provide a greater
insight into the microbial world and its capacities could have broad and far reaching
implications for environment, health, energy and industrial applications.
 For these reasons, in 1994, the US Department of Energy (DOE) initiated the MGP
(Microbial Genome Project) to sequence genomes of bacteria useful in energy
production, environmental cleanup, industrial processing and toxic waste reduction.
 By studying the genetic material of these organisms, scientists can begin to understand
these microbes at a very fundamental level and isolate the genes that give them their
unique abilities to survive under extreme conditions.
Waste cleanup
 Deinococcus radiodurans is known as the world’s toughest bacteria and it is the most
radiation resistant organism known.
 Scientists are interested in this organism because of its potential usefulness in cleaning up
waste sites that contain radiation and toxic chemicals.
Climate change Studies
 Increasing levels of carbon dioxide emission, mainly through the expanding use of fossil
fuels for energy, are thought to contribute to global climate change.
 Recently, the DOE (Department of Energy, USA) launched a program to decrease
atmospheric carbon dioxide levels.
 One method of doing so is to study the genomes of microbes that use carbon dioxide as
their sole carbon source.
Alternative energy sources
 Scientists are studying the genome of the microbe Chlorobium tepidum which has an
unusual capacity for generating energy from light
Biotechnology
 The archaeon Archaeoglobus fulgidus and the bacterium Thermotoga maritima have
potential for practical applications in industry and government-funded environmental
remediation.
 These microorganisms thrive in water temperatures above the boiling point and therefore
may provide the DOE, the Department of Defence, and private companies with heat-
stable enzymes suitable for use in industrial processes
 Other industrially useful microbes include, Corynebacterium glutamicum which is of
high industrial interest as a research object because it is used by the chemical industry for
the biotechnological production of the amino acid lysine.
 The substance is employed as a source of protein in animal nutrition.
 Biotechnologically produced lysine is added to feed concentrates as a source of protein,
and is an alternative to soybeans or meat and bonemeal.
 Lactococcus lactis is one of the most important micro-organisms involved in the dairy
industry.
 Researchers anticipate that understanding the physiology and genetic make-up of this
bacterium will prove invaluable for food manufacturers as well as the pharmaceutical
industry, which is exploring the capacity of lactis to serve as a vehicle for delivering
drugs.
Antibiotic resistance
 Scientists have been examining the genome of Enterococcus faecalis-a leading cause of
bacterial infection among hospital patients.
 They have discovered a virulence region made up of a number of antibiotic-resistant
genes that may contribute to the bacterium’s transformation from a harmless gut bacteria
to a menacing invader.
 The discovery of the region, known as a pathogenicity island, could provide useful
markers for detecting pathogenic strains and help to establish controls to prevent the
spread of infection in wards.
Forensic analysis of microbes
 Scientists used their genomic tools to help distinguish between the strain of Bacillus
anthracis that was used in the summer of 2001 terrorist attack in Florida with that of
closely related anthrax strains.
The reality of bioweapon creation
 Scientists have recently built the virus poliomyelitis using entirely artificial means.
 They did this using genomic data available on the Internet and materials from a mail-
order chemical supply.
 The research was financed by the US Department of Defence as part of a biowarfare
response program to prove to the world the reality of bioweapons.
 The researchers also hope their work will discourage officials from ever relaxing
programs of immunisation.
 This project has been met with very mixed feelings.
Evolutionary studies
 The sequencing of genomes from all three domains of life, eukaryota, bacteria and
archaea means that evolutionary studies can be performed in a quest to determine the tree
of life and the last universal common ancestor.
Crop improvement
 Comparative genetics of the plant genomes has shown that the organisation of their genes
has remained more conserved over evolutionary time than was previously believed.
 These findings suggest that information obtained from the model crop systems can be
used to suggest improvements to other food crops.
 At present the complete genomes of Arabidopsis thaliana (water cress) and Oryza sativa
(rice) are available.
Insect resistance
 Genes from Bacillus thuringiensis that can control a number of serious pests have been
successfully transferred to cotton, maize and potatoes.
 This new ability of the plants to resist insect attack means that the amount of insecticides
being used can be reduced and hence the nutritional quality of the crops is increased.
Improve nutritional quality
 Scientists have recently succeeded in transferring genes into rice to increase levels of
Vitamin A, iron and other micronutrients.
 This work could have a profound impact in reducing occurrences of blindness and
anaemia caused by deficiencies in Vitamin A and iron respectively.
 Scientists have inserted a gene from yeast into the tomato, and the result is a plant whose
fruit stays longer on the vine and has an extended shelf life.
Development of Drought resistance varieties
 Progress has been made in developing cereal varieties that have a greater tolerance for
soil alkalinity, free aluminium and iron toxicities.
 These varieties will allow agriculture to succeed in poorer soil areas, thus adding more
land to the global production base.
 Research is also in progress to produce crop varieties capable of tolerating reduced water
conditions.
Veterinary Science
 Sequencing projects of many farm animals including cows, pigs and sheep are now well
under way in the hope that a better understanding of the biology of these organisms will
have huge impacts for improving the production and health of livestock and ultimately
have benefits for human nutrition.
Comparative Studies
 Analysing and comparing the genetic material of different species is an important method
for studying the functions of genes, the mechanisms of inherited diseases and species
evolution.
 Bioinformatics tools can be used to make comparisons between the numbers, locations
and biochemical functions of genes in different organisms.

Basic principle of computing in Bioinformatics

Bioinformatics combines computer science, mathematics, and biology to analyze and


interpret biological data. The following basic principles of computing are essential in
bioinformatics:

1. Algorithms: Efficient step-by-step procedures for solving computational problems. An


algorithm is a set of instructions that is used to solve a specific problem or perform a
particular task. It is a well-defined procedure that takes some input, processes it, and
produces a corresponding output.

2. Data Structures: Organized ways to store and manage large biological datasets.
3. Programming Languages:
A programming language is a set of rules and instructions that a computer can understand
and execute to perform specific tasks, solve problems, or create software. Tools like Python,
R, and SQL used to write bioinformatics software.
4. Databases: Organized collections of biological data, like GenBank or PDB.

5. Computational Complexity: Understanding the resources required to solve computational


problems. Computational complexity in bioinformatics refers to the amount of computational
resources (time, memory, and processing power) required to solve a biological problem or
analyze a dataset.

Types of computational complexity:

1. Time complexity: How long an algorithm takes to complete.


2. Space complexity: How much memory an algorithm requires.
3. Parallel complexity: How well an algorithm can be parallelized.

6. Scalability: Designing algorithms and systems to handle large datasets.


Scalability in bioinformatics refers to the ability of a computational system, algorithm, or
tool to:
1. Handle increasing amounts of data
2. Process larger datasets
3. Support more users or requests
4. Adapt to growing computational demands
5. Maintain performance and efficiency

7. Parallel Computing:
Parallel computing in bioinformatics refers to the use of multiple processing units or cores to
perform computational tasks simultaneously, speeding up the analysis of large biological
datasets.

8. Cloud Computing: cloud computing is a model of delivering computing services over the
internet.Using remote computing resources to analyze large datasets. Cloud computing
applications in bioinformatics:

1. Genomics: Assembly, annotation, and analysis of genomic data.


2. Proteomics: Analysis of protein structures and functions.
3. Transcriptomics: Analysis of gene expression data.
4. Epigenomics: Analysis of epigenetic modifications.

9. Machine Learning:
Machine learning refers to the development of algorithms and statistical models that enable
computers to:
1. Learn from data
2. Improve their performance
3. Make predictions or decisions
4. Identify patterns and relationships
5. Classify or group data points

10. Visualization: Presenting complex data in a clear and interpretable way.

These principles enable bioinformaticians to:

- Analyze large datasets


- Identify patterns and trends
- Make predictions and models
- Develop new treatments and therapies

By applying these principles, bioinformatics has revolutionized fields like genomics,


proteomics, and personalized medicine.
Basic Data Acquisition and major Databses DDBJ, NCBI and EMBL.

(DNA Data Bank of Japan)


DDBJ (DNA Data Bank of Japan) is a public database that collects, processes, and distributes
nucleotide sequence data. It is one of the three major international nucleotide sequence
databases, along with NCBI (National Center for Biotechnology Information) and EMBL
(European Molecular Biology Laboratory).
DDBJ was established in 1987 as a national DNA database for Japan. It was initially funded by
the Japanese government and was operated by the National Institute of Genetics.
DDBJ consists of several databases, including:
1. DDBJ: The main database for nucleotide sequences.
2. DAD: Database of Annotated and Integrated Data.
3. GEA: Genome Expression Archive.
4. MGED: Microarray Gene Expression Data.

Access and Usage


DDBJ provides various tools and interfaces for accessing and analyzing data, including:

1. Web Interface: A user-friendly interface for searching and retrieving data.


2. FTP: File Transfer Protocol for bulk data download.
3. API: Application Programming Interface for programmatic access.
DDBJ plays a crucial role in:
1. Genomics Research: Providing data for genome assembly, annotation, and analysis.
Data Acquisition
1. Submission: Researchers submit sequence data to DDBJ through online forms or file uploads.
2. Direct Submission: Sequencing centers and researchers can directly submit data to DDBJ.
3. Data Exchange: DDBJ exchanges data with other databases like NCBI and EMBL.
4. Literature Scanning: DDBJ staff scan scientific literature for new sequence data.

(European Molecular Biology Laboratory)


EMBL is a renowned research organization and database provider in the field of molecular
biology. It was established in 1974 and is headquartered in Heidelberg, Germany. . It is one
of the three major international nucleotide sequence databases, along with NCBI and DDBJ
EMBL conducts cutting-edge research in:
1. Genomics: Genome sequencing, annotation, and analysis.
2. Proteomics: Protein structure, function, and interaction analysis.
3. Transcriptomics: Gene expression analysis and regulation.
4. Bioinformatics: Development of computational tools and databases.

EMBL provides several databases, including:

1. EMBL-Bank: A comprehensive nucleotide sequence database.


2. UniProt: A protein sequence database
3. Ensembl: A genome annotation and analysis platform.
4. PDBe: A protein structure database).
EMBL plays a crucial role in:

1. Advancing Molecular Biology Research: Providing data, tools, and resources for
researchers.
2. Supporting Genomics and Proteomics: Enabling genome assembly, annotation, and
analysis.
3. Promoting Open Science: Making data and resources openly available to the scientific
community.
By providing a comprehensive collection of nucleotide sequence data and cutting-edge
research, EMBL supports advancements in molecular biology, genomics, and proteomics.
DATA ACQUISITION:
1. Submission: Researchers submit sequence data to EMBL through online forms or file
uploads.
2. Automated Pipeline: EMBL uses an automated pipeline to process and annotate submitted
data.
3. Data Exchange: EMBL exchanges data with other databases like NCBI and DDBJ.
NCBI (National Center for Biotechnology Information)
NCBI is a leading biomedical research organization and database provider, part of the
National Library of Medicine (NLM) at the National Institutes of Health (NIH). The NCBI is
located in Bethesda, Maryland, and was founded in 1988 through legislation sponsored by
US Congressman Claude Pepper.

NCBI provides numerous databases, including:

1. GenBank: A comprehensive nucleotide sequence database.


2. PubMed: A biomedical literature database.
3. Protein: A protein sequence database.
4. BLAST: A sequence alignment and comparison tool.

NCBI conducts research in:

1. Genomics: Genome assembly, annotation, and analysis.


2. Proteomics: Protein structure, function, and interaction analysis.
3. Transcriptomics: Gene expression analysis and regulation.
4. Bioinformatics: Development of computational tools and databases.

NCBI plays a crucial role in:

1. Advancing Biomedical Research: Providing data, tools, and resources for researchers.
2. Supporting Genomics and Proteomics: Enabling genome assembly, annotation, and
analysis.
3. Promoting Open Science: Making data and resources openly available to the scientific
community.

Data acquisition
1. Submission: Researchers submit sequence data to NCBI through online forms or file
uploads.
2. GenBank Direct Submission: Sequencing centers and researchers can directly submit data
to GenBank.
3. Data Exchange: NCBI exchanges data with other databases like DDBJ and EMBL.
4. Literature Scanning: NCBI staff scan scientific literature for new sequence data.
5. Automated Pipeline: NCBI uses an automated pipeline to process and annotate submitted
data.

Common features among these databases:

1. Submission guidelines: Each database has guidelines for submitting data.


2. Data validation: Databases validate submitted data for quality and accuracy.
3. Data annotation: Databases annotate submitted data with relevant information.
4. Data sharing: Databases share data with each other to ensure comprehensive coverage.

These databases play a crucial role in collecting, processing, and disseminating biological
data, enabling researchers to access and analyze data for various applications.
BIOINFORMATICS TOOLS
Multiple Sequence Alignment

ClustalW

Retrieval of protein sequence from a database in FASTA


Introduction to FASTA Format

FASTA (FastA) is a commonly used format in bioinformatics for representing nucleotide or


protein sequences. It was developed by David J. Lipman and William R. Pearson in the early
1980s as a simple and efficient way to store and search for sequences.

A FASTA file typically consists of two parts: a header line and a sequence line. The header
line begins with a greater-than symbol ">" followed by a unique identifier for the sequence.
The sequence line contains the actual sequence of nucleotides or amino acids.
It is recommended that all lines of text be shorter than 80 characters in length.

An example sequence in FASTA format is:

>gi|186681228|ref|YP_001864424.1| phycoerythrobilin:ferredoxin oxidoreductase


MNSERSDVTLYQPFLDYAIAYMRSRLDLEPYPIPTGFESNSAVVGKGKNQEEVVT
TSYAFQTAKLRQIRA
AHVQGGNSLQVLNFVIFPHLNYDLPFFGADLVTLPGGHLIALDMQPLFRDDSAY
QAKYTEPILPIFHAHQ
QHLSWGGDFPEEAQPFFSPAFLWTRPQETAVVETQVFAAFKDYLKAYLDFVEQA
EAVTDSQNLVAIKQAQ
LRYLRYRAEKDPARGMFKRFYGAEWTEEYIHGFLFDLERKLTVVK
Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic
acid codes.

The nucleic acid codes are:

A --> adenosine M --> A C (amino)


C --> cytidine S --> G C (strong)
G --> guanine W --> A T (weak)
T --> thymidine B --> G T C
U --> uridine D --> G A T
R --> G A (purine) H --> A C T
Y --> T C (pyrimidine) V --> G C A
K --> G T (keto) N --> A G C T (any)
- gap of indeterminate length
The accepted amino acid codes are:
A ALA alanine P PRO proline
B ASX aspartate or asparagine Q GLN glutamine
C CYS cystine R ARG arginine
D ASP aspartate S SER serine
E GLU glutamate T THR threonine
F PHE phenylalanine U selenocysteine
G GLY glycine V VAL valine
H HIS histidine W TRP tryptophan
I ILE isoleucine Y TYR tyrosine
K LYS lysine Z GLX glutamate or glutamine
L LEU leucine X any
M MET methionine * translation stop
N ASN asparagine - gap of indeterminate length

To retrieve a protein sequence from a database using FASTA, you can follow these steps:

1. Identify the protein database you want to search, such as UniProt or NCBI's Protein
database.

2. Go to the website of the database you selected and access their search interface.

3. Enter your search term, which can be the protein name, accession number, or any other
relevant ID. Submit the search.

4. Look for the search results and identify the specific protein sequence you want to retrieve.
5. On the result page, you may find an option to download or view the sequence in different
formats. Look for the FASTA format, which often has a link or button to retrieve the
sequence.

6. Click on the FASTA link or button, and it will likely open a new page or display the
sequence directly on the screen.

7. Copy the FASTA-formatted sequence, which usually starts with the ">" symbol followed
by the sequence identifier and then the sequence itself.

Now you have successfully retrieved the protein sequence in FASTA format from the
database. You can save it in a text file or use it directly for further analysis or processing in
bioinformatics workflows.

Sequence Alignment

Sequence alignment is a fundamental bioinformatics technique that involves comparing two


or more biological sequences to identify similarities and differences between them. The goal
of sequence alignment is to identify the conserved regions, gaps, and mismatches between
sequences, which can provide insights into the evolutionary history and functional
relationships of proteins or nucleic acids.

There are two main types of sequence alignment: single sequence alignment and multiple
sequence alignment.

1. Single sequence alignment: This type of alignment involves comparing a single sequence
with itself or a homologous sequence from a different organism. The purpose of single
sequence alignment is to identify conserved regions within a single sequence. This type of
alignment can be useful for identifying functional motifs within a protein or identifying
mutations in a gene that may contribute to the development of disease.

2. Multiple sequence alignment: This type of alignment involves comparing three or more
sequences simultaneously. The purpose of multiple sequence alignment is to identify
conserved regions, gaps, and mutations across multiple sequences. This type of alignment is
useful for studying the evolutionary history of a gene or protein family and for identifying
residues that are essential for protein function.

In general, multiple sequence alignment is more informative and more challenging to


perform than single sequence alignment. Multiple sequence alignment allows for the
identification of more subtle structural and functional similarities and differences between
sequences. However, it can also be more difficult to align multiple sequences due to their
variability and length. Therefore, various software tools and algorithms have been developed
to
Alignment of protein/nucleotide sequences (BLAST)
The Basic Local Alignment Search Tool (BLAST) is a widely used bioinformatics algorithm
for comparing and aligning protein and nucleotide sequences. BLAST allows researchers to
search a query sequence against a database of sequences to find potential matches and
identify regions of similarity.

Here is a detailed overview of how the alignment of protein and nucleotide sequences using
BLAST works: as an example
Obtain FASTA files with accession numbers NM_007393 (Mus musculus actin, beta
(ACTB)); NM_205518 (Gallus gallus actin, beta (ACTB)); and NM_001101 (Homo sapiens
actin beta (ACTB).

1. Selection of a sequence database: First, you need to select a suitable sequence database
for the search. Popular choices include the NCBI protein database or the nucleotide database.

2. Input sequence: Prepare the query sequence that you want to align against the selected
database. It can be either a protein or nucleotide sequence. Make sure to have the sequence in
a suitable format, such as FASTA.

3. BLAST program selection: Choose the appropriate BLAST program based on the type of
sequence(s) you are working with. For protein sequences, the most commonly used program
is BLASTP. For nucleotide sequences, BLASTN or BLASTX are often used.

4. Running the BLAST search: Open the BLAST website or use command-line tools to
start the BLAST search. Enter the query sequence in the provided field and select the desired
database.

5. Scoring algorithm: BLAST utilizes a scoring algorithm (e.g., the BLOSUM matrix for
proteins or the substitution matrix for nucleotides) to calculate similarity scores between the
query sequence and each database sequence.
6. Searching for alignments: BLAST processes the scoring information to identify regions
of local similarity between the query sequence and sequences in the database. The algorithm
starts by identifying short, significant matches, also known as seed matches, and then extends
them to identify longer alignments.

7. Scoring and statistical significance: The alignments produced by BLAST are assigned
scores based on the similarity of the aligned residues. Additionally, statistical calculations are
performed to estimate the significance of the alignments. This significance is often expressed
as an E-value, which represents the expected number of matches with similar scores that
could occur randomly.

8. Displaying the results: Once the search is complete, BLAST provides a list of alignments
ranked by their E-values. The output typically includes information such as the alignment
score, E-value, sequence identifiers, and aligned regions of the query and database
sequences.

9. Interpreting the results: Researchers examine the BLAST results to determine the
jbiological significance of the alignments. They assess the quality of the matches, the
sequence similarity, and consider factors like the alignment length and alignment coverage.

BLAST is a powerful tool for aligning and comparing protein and nucleotide sequences. It
enables researchers to identify similar sequences, infer evolutionary relationships, predict
protein structures, and gain insights into the function and characteristics of genes and
proteins.
Additional Information
BLAST (Basic Local Alignment Search Tool) is an efficient and widely used local
sequence alignment tool to compare nucleotide or protein sequences. It can identify regions
of similarity between different sequences, which is useful for detecting homologous
sequences, identifying conserved domains, and predicting protein structures.
BLAST uses a heuristic algorithm to rapidly compare sequences, making it a valuable tool
for large-scale sequence analysis. It has two different types of algorithms: the original
BLAST algorithm, for detecting closely related sequences, and the PSI-BLAST algorithm,
for detecting sequences that are distantly related.
The original BLAST algorithm lists all possible words of a certain length (k-mers) from the
query sequence. It then searches a sequences database for exact matches to these k-mers and
extends the alignment to include nearby sequences. The final score is calculated based on the
number and quality of the matches.
PSI-BLAST is an iterative version of the BLAST algorithm that builds a position-specific
scoring matrix based on the initial alignment and then uses this matrix to search for
additional sequences.
BLAST is widely used in biological research due to its user-friendly interface and ability to
handle large-scale sequence analysis. It is accessible through the NCBI website; standalone
versions are also available for download.
In addition to its basic functionality, BLAST offers various options and parameters, allowing
users to customize their searches based on specific criteria. For example, users can choose
different scoring matrices to match their query sequences to the database sequences or
specify the e-value cutoff to control the search stringency.
Using BLAST to search for sequences similar to a specific region of interest within a more
extensive sequence is also possible, which helps identify conserved domains or motifs. Its
ability to rapidly identify regions of similarity between sequences has been critical in
advancing our understanding of the genetic basis of life.
While BLAST has some limitations, such as its inability to detect distantly related sequences,
it remains an indispensable tool for many biological research projects.

Alignment of protein/nucleotide sequences (CLUSTALW).

Sequence alignment is a fundamental task in bioinformatics that involves comparing two or


more biological sequences to identify the similarities and differences between them. Various
algorithms and tools have been developed to perform sequence alignment, with CLUSTALW
being one of the most widely used.

CLUSTALW is a software package for multiple sequence alignment that provides a user-
friendly interface for aligning nucleotide or protein sequences. It employs a progressive
alignment approach that initially generates a guide tree to group sequences into clusters
based on their similarity. Once grouped, the software aligns sequences within each cluster
and merges them iteratively until the whole dataset is aligned.

The following are the steps involved in performing sequence alignment using CLUSTALW:

1. Input sequences: In order to align sequences using CLUSTALW, you need to provide a
set of nucleotide or protein sequences in a common format, such as FASTA. These sequences
can be obtained from various online databases or generated from your own experiments.

2. Selection of parameters: CLUSTALW allows users to specify a range of parameters for


the alignment process, including gap penalties, substitution matrices, and output format. The
default settings are often suitable for most applications, but users can adjust these parameters
to optimize the alignment based on their specific biological question.

3. Generate guide tree: CLUSTALW uses a guide tree to group similar sequences together.
This tree is constructed using the neighbor-joining method and is based on the pairwise
similarities between sequences. The guide tree provides a roadmap for the progressive
alignment process.

4. Progressive alignment: After the guide tree is constructed, CLUSTALW proceeds to


align sequences in a progressive manner, starting from the most similar pair and extending to
the more divergent pairwise comparisons. The software employs a dynamic programming
algorithm to align sequences and adjust gaps and insertion-deletion residues.
5. Iterative alignment: Once all sequences have been aligned, the software iteratively
improves the alignment by swapping sequences and adjusting gaps until a consensus
alignment is obtained.

6. Output: The final alignment is saved in a common format such as Clustal or FASTA and
can be used in downstream analyses such as phylogenetic tree construction, molecular
modeling, or functional analysis.

Overall, CLUSTALW is an easy-to-use and powerful tool for sequence alignment that can be
used for various applications in bioinformatics research. Its ability to handle large datasets
and optimize alignments based on specific parameters make it a valuable resource for
studying the evolution and function of biological sequences.

Additional information

ClustalW is a widely used multiple sequence alignment tool with a progressive approach. It
works by creating a guide tree that represents the evolutionary relationships between the
sequences and then aligning the sequences based on the guide tree.

ClustalW allows users to adjust the alignment parameters to optimize the alignment for their
specific research question. It is handy for aligning large numbers of sequences and valuable in
many areas of biological research, including phylogenetics, functional genomics, and drug
discovery.

ClustalW also offers a variety of output formats, including Clustal, NEXUS, and PHYLIP
formats, making it easy to integrate with other bioinformatics tools. Additionally, ClustalW has
several advanced features, such as the ability to perform pairwise alignments, identify conserved
regions of the alignment, and generate phylogenetic trees.

However, there are several strategies that users can employ to optimize the performance of
ClustalW, such as running the program on a high-performance computing cluster or using the
parallelized version of the software, ClustalOmega.

ClustalW has an intuitive interface, customizable parameters, and the ability to handle large
datasets, making it a valuable resource for many different types of biological research.

In addition to its advanced features, ClustalW allows users to incorporate external information,
such as secondary structure predictions or phylogenetic information from other sources,
improving the accuracy of the alignment by providing additional context and constraints for the
alignment process.

One potential restriction of ClustalW is that it can be computationally intensive, particularly


when aligning large numbers of sequences or when using more advanced features. Another
potential limitation of ClustalW is that it assumes that the aligned sequences have a similar
overall structure, which may not always be the case. Other alignment tools like MUSCLE or T-
Coffee may be more appropriate in such situations.
Despite these limitations, ClustalW remains a popular choice for multiple sequence alignment
due to its flexibility, ease of use, and ability to handle large datasets. It has been used in many
landmark studies across various fields of biology and continues to be a valuable tool for
researchers worldwide.

Computing physicochemical parameters of proteins using PROTPARAM

Protparam is a web-based tool developed by the ExPASy Bioinformatics Resource Portal. It is


used for the prediction and analysis of various physical and chemical properties of protein
sequences. Protparam uses algorithms and established models to calculate and provide
information about features such as molecular weight, isoelectric point (pI), amino acid
composition, charge, instability index, aliphatic index, and other parameters. These properties
can be valuable in understanding the characteristics and behavior of proteins, aiding in protein
engineering, drug design, and bioinformatics research. Protparam is widely used by researchers
and scientists working in the field of protein analysis and biochemistry.

To analyze protein physicochemical properties using Protparam, the following steps are usually
followed:

1. Retrieval of protein sequence: The first step is to obtain the amino acid sequence of the protein
of interest. This can be done by either extracting the sequence from a protein database or by
generating a sequence from an experimentally determined protein structure.

2. Accessing the Protparam tool: The Protparam tool is available online as part of the Expasy
Bioinformatics Resource Portal. Access the Protparam page by either searching for "Protparam"
in a search engine or directly visiting the Expasy website.

3. Input protein sequence: Once on the Protparam page, you will see a text box labeled "Enter
your protein sequence." Copy and paste your protein sequence into this box. Make sure that the
sequence is in the correct single-letter amino acid code format.

4. Choosing parameter categories: Protparam provides various categories of physicochemical


properties to analyze. Depending on your research interest, you can choose the desired parameter
categories. Some commonly used parameter categories include molecular weight, theoretical
isoelectric point, amino acid composition, and instability index.
5. Running the analysis: After entering the protein sequence and selecting the parameter
categories, click on the "Submit" or "Analyze" button to run the analysis. The Protparam tool
will process the sequence and calculate the selected physicochemical properties of the protein.

6. Interpreting the results: Protparam will generate a report displaying the analyzed properties of
the protein. This report usually includes the calculated values for each selected parameter
category. Carefully review the results and interpret the protein's physicochemical properties
based on the provided values.

7. Further analysis: Depending on the research goals, the obtained data from Protparam analysis
can be further analyzed and interpreted using statistical methods or compared with other proteins
or databases to gain insights into the protein's function, stability, or other relevant characteristics.

Note: Protparam is a widely used tool for analyzing protein physicochemical properties.
However, there are other alternative tools available that provide similar functionality.
Researchers often use multiple tools or compare results from various resources for
comprehensive protein analysis.

Protparam is a tool used to predict various physical and chemical properties of a protein
sequence. Some of the properties that can be analyzed by protparam include:

Physical properties:

1. Molecular weight: The total size of a protein molecule, which is the sum of the atomic masses
of all atoms within the protein.

2. Isoelectric point (pI): The pH at which a protein carries no net electrical charge. It can provide
information about the protein's behavior under different pH conditions.

3. Instability index: A measure of the stability of a protein based on the presence of unstable
features, such as the occurrence of certain amino acid residues.

4. Aliphatic index: A measure of the relative volume occupied by aliphatic amino acids (e.g.,
alanine, leucine, and isoleucine) in the protein sequence. It indicates the protein's thermostability.

5. Grand average of hydropathicity (GRAVY): A parameter representing the hydrophobicity or


hydrophilicity of a protein. Positive values indicate hydrophobicity, while negative values
indicate hydrophilicity.
Chemical properties:

1. Amino acid composition: Protparam provides the amino acid composition of the protein
sequence, including the frequency and percentage of each amino acid present.

2. Charge: It provides the net charge of the protein at a given pH value, which can be useful in
predicting its interactions with other molecules.

3. Half-life: Estimation of the half-life of a protein based on its N-terminal amino acids. This
property predicts the stability of a protein within a cellular environment.

4. Extinction coefficient: The measure of absorbance of a protein at a specific wavelength, which


is useful in determining the concentration of the protein in a solution.

These properties, among others provided by protparam, help researchers gain insights into the
protein's structure, stability, hydrophobicity, and electrochemical properties.

PREDICTING ELEMENTS OF SECONDARY STRUCTURE OF PROTEINS (EG.


PSSP).

The Protein Secondary Structure Prediction (PSSP) method utilizes various algorithms and
statistical models to predict the secondary structure elements of proteins. The primary approach
employed by PSSP is based on analyzing the sequence of amino acids in the protein and
extracting information from known structures in protein databases.

There are several steps involved in the PSSP prediction process:

1. Data collection: PSSP gathers known protein structures from databases such as the Protein
Data Bank (PDB) that have experimentally determined secondary structure annotations.

2. Feature extraction: Relevant features are extracted from the protein sequence, such as amino
acid composition, physicochemical properties, and sequence patterns. These features provide
input for the prediction algorithms.

3. Algorithm selection: PSSP employs different algorithms for secondary structure prediction,
each with its own strengths and weaknesses. Some common algorithms used include neural
networks, support vector machines, and hidden Markov models.

4. Model training: The selected algorithm is trained using the extracted features and the known
protein structures from the database. This step involves optimizing the model parameters and
fine-tuning the algorithm to improve accuracy.
5. Prediction: Once the model is trained, it can be applied to predict the secondary structure
elements of a target protein. The protein sequence is processed using the trained model, and each
amino acid is assigned a secondary structure label (e.g., helix, strand, coil).

It's important to note that PSSP predictions are not always completely accurate, especially for
proteins with unique or novel structural features. However, by utilizing large datasets and
advanced algorithms, PSSP can provide reasonably reliable estimations of protein secondary
structure elements.

Example of PSSP Tool are

 PSIPRED
 JPRED
 SSPRO
 PORTER

RETRIEVAL, UNDERSTANDING AND PREDICTING 3D STRUCTURE OF PROTEIN


FROM SEQUENCE; PTMS (E.G NETPHOS ETC.)

Tertiary Structure Prediction:

Introduction

Protein tertiary structure prediction aims to determine the three-dimensional arrangement of


atoms in space, crucial for understanding protein function, interactions, and design.

Methods

1. Homology Modeling

As the name suggests, homology modeling predicts protein structures based on sequence
homology with known structures. It is also known as comparative modeling. The principle
behind it is that if two proteins share a high enough sequence similarity, they are likely to have
very similar three-dimensional structures. If one of the protein sequences has a known structure,
then the structure can be copied to the unknown protein with a high degree of confidence.
Homology modeling produces an all-atom model based on alignment with template proteins.

Steps:

1. Template selection
The first step in protein structural modeling is to select appropriate structural
templates. This forms the foundation for rest of the modeling process. The template
selection involves searching the Protein Data Bank (PDB) for homologous proteins
with determined structures. The search can be performed using a heuristic pairwise
alignment search program such as BLAST or FASTA.

2. Once the structure with the highest sequence similarity is identified as a template,
the full-length sequences of the template and target proteins need to be realigned
using refined alignment algorithms to obtain optimal alignment. This realignment
is the most critical step in homology modeling, which directly affects the quality
of the final model. This is because incorrect alignment at this stage leads to
incorrect designation of homologous residues and therefore to incorrect structural
models. Errors made in the alignment step cannot be corrected in the following
modeling steps. Therefore, the best possible multiple alignment algorithms, such
as Praline and T-Coffee.
3. Model building
Construct a 3D model of the target protein based on the aligned sequence and
template structures.

4. Model refinement

The final homology model has to be evaluated to make sure that the structural
features of the model are consistent with the physicochemical rules. If there are any structural
irregularities left, These kinds of structural irregularities can be corrected by applying the energy
minimization procedure on the entire model, which moves the atoms in such a way that the
overall conformation has the lowest energy potential.

2. Threading

There are only small number of protein folds available (<1,000), compared to millions of protein
sequences. This means that protein structures tend to be more conserved than protein sequences.
Consequently, many proteins can share a similar fold even in the absence of sequence
similarities.
This allowed the development of computational methods to predict protein structures beyond
sequence similarities. To determine whether a protein sequence adopts a known three-
dimensional structure fold relies on threading and fold recognition methods.

By definition, threading or structural fold recognition predicts the structural fold of

an unknown protein sequence by fitting the sequence into a structural database and

selecting the best-fitting fold. The comparison emphasizes matching of secondary

structures, which are most evolutionarily conserved. Therefore, this approach can

identify structurally similar proteins even without detectable sequence similarity.

- Steps:

1. Template selection

1. Sequence identity: Templates with high sequence identity (>30%) are preferred.

2. Structural similarity: Templates with similar folds or domains are suitable.

2. Sequence alignment

Align the query sequence with the selected template structures.

4. Scoring and ranking

Use functions like Z-score, TM-score, or SP-score to evaluate alignment quality.

Ab Initio.

This method Predicts structure from sequence alone, without template, using physical and
chemical principles to simulate protein folding. Both homology and fold recognition approaches
rely on the availability of template structures in the database to achieve predictions. If no correct
structures exist in the database, the methods fail. However, proteins in nature fold on their own
without checking what the structures of their homologs are in databases. Obviously, there is
some information in the sequences that provides instruction for the proteins to “find” their native
structures. Early biophysical studies have shown that most proteins fold spontaneously into a
stable structure that has near minimum energy. This structural state is called the native state.
This folding process appears to be nonrandom; however, its mechanism is poorly understood.
The limited knowledge of protein folding forms the basis of ab initio prediction. As the name
suggests, the ab initio prediction method attempts to produce all-atom protein models based on
sequence information alone without the aid of known protein structures. The perceived
advantage of this method is that predictions are not restricted by known folds and that novel
protein folds can be identified. However, because the physicochemical laws governing protein
folding are not yet well understood, the energy functions used in the ab initio prediction are at
present rather inaccurate. The folding problem remains one of the greatest challenges in
bioinformatics today.

Examples.

ROSETTA, I-TASSER, ALPHA FOLD.

Rosetta (www.bioinfo.rpi.edu/∼bystrc/hmmstr/server.php) is a web server that predicts protein


three-dimensional conformations using the ab initio method. This in fact relies on a “mini-
threading” method. The method first breaks down the query sequence into many very short
segments (three to nine residues) and predicts the secondary structure of the small segments
using a hidden Markov model–based program, HMMSTR. The segments with assigned
secondary structures are subsequently assembled into a three-dimensional configuration.
Through random combinations of the fragments, a large number of models are built and their
overall energy potentials calculated. The conformation with the lowest global free energy is
chosen as the best model.

PTMS (E.G NETPHOS ETC.)

Post-Translational Modifications (PTMs) are chemical changes that occur to proteins after they
have been synthesized. These modifications play crucial roles in regulating protein function,
activity, localization, and interaction with other cellular molecules.
Types of PTMs:

Phosphorylation:

Phosphorylation involves the addition of a phosphate group (PO₄³⁻) to a protein, typically on


serine, threonine, or tyrosine residues. This modification is catalyzed by enzymes known as
kinases and can be reversed by phosphatases.

Key Functions:

Regulation of Protein Activity: Phosphorylation can activate or deactivate enzymes and other
proteins, thereby regulating various cellular processes.

Signal Transduction: It plays a crucial role in cell signaling pathways, allowing cells to respond
to external stimuli by transmitting signals from the cell surface to the nucleus.

Protein-Protein Interactions: Phosphorylation can create binding sites for other proteins,
facilitating complex formation and interaction networks.

Cell Cycle Control: It is essential for the regulation of the cell cycle, ensuring proper cell
division and function.

Acetylation: Addition of an acetyl group, commonly affecting gene expression by modifying


histones. Acetylation involves the addition of an acetyl group (CH₃CO) to a molecule, usually
on lysine residues of histone proteins: Catalyzed by histone acetyltransferases (HATs), this
modification neutralizes the positive charge of histones, leading to a more relaxed chromatin
structure. This relaxation allows transcription factors to access DNA, generally promoting gene
activation2.

Methylation: Addition of a methyl (CH3) group, Commonly occurs on arginine and lysine
residues of histones. It can either activate or repress gene expression depending on the specific
site and context.

Ubiquitination: Attachment of ubiquitin molecules, marking proteins prone for degradation by


the proteasome.

Glycosylation: Glycosylation is a post-translational modification where a carbohydrate (glycan)


is covalently attached to a protein or lipid. This process is crucial for proper protein folding,
stability, and function, and it occurs in the endoplasmic reticulum and Golgi apparatus of cells.

Example: Identifying phosphorylation sites using (NETPHOS)

1. Retrieve the protein sequence: Start by obtaining the amino acid sequence of the protein you
want to analyze. You can do this by searching for the protein in online databases like UniProt or
NCBI.
2. Accessing NETPHOS: Visit the NETPHOS website - to access the tool.

3. Submitting the sequence: Copy and paste the protein sequence into the input box provided
on the NETPHOS website. Make sure the sequence is in the correct format and without any
errors.

4. Selecting parameters: NETPHOS allows you to select several parameters that influence the
prediction, such as the kinase group, thresholds, and output format. You can choose default
settings or customize them according to your specific requirements.

5. Running the prediction: Click on the "Submit" button to initiate the prediction process. The
tool will use its algorithm to predict potential phosphorylation sites in the protein sequence based
on the specified parameters and kinase groups.

6. Analyzing the results: Once the prediction is complete, NETPHOS will provide you with a
list of potential phosphorylation sites identified in the protein sequence. It may also give
additional information such as the position of the sites and the associated score.

7. Interpreting the results: Review the predicted phosphorylation sites and their corresponding
scores. Higher scores suggest a higher probability of phosphorylation. It is important to note that
NETPHOS predicts potential phosphorylation sites based on sequence information only.
Enzymes Classification (read this topic from Mushtaq
Biochemistry)
Enzymes Retrieval Databases
Enzyme retrieval databases are specialized repositories that store comprehensive information
about enzymes, including their functions, structures, and classifications. These databases are
essential tools for researchers in biochemistry, molecular biology, and related fields. Here are
some key enzyme databases:
1. BRENDA
o Description: BRENDA is one of the most extensive enzyme databases, providing detailed
information on enzyme functions, structures, and kinetics. It includes data on enzyme-ligand
interactions, enzyme nomenclature, and metabolic pathways.
o Features: Advanced search options, enzyme classification, metabolic pathways, and
visualization tools.
2. ExplorEnz:
o Description: This database is the official repository for the International Union of Biochemistry
and Molecular Biology (IUBMB) enzyme nomenclature. It offers a comprehensive list of
enzymes classified according to the IUBMB standards.
o Features: Simple and advanced search options, downloadable data in SQL or XML formats, and
adherence to IUPAC naming conventions.
3. IntEnz (Integrated relational Enzyme database):
o Description: IntEnz provides a curated collection of enzyme information, integrating data from
various sources to offer a unified view of enzyme nomenclature and classification.
4. SIB-ENZYME:
o Description: Managed by the Swiss Institute of Bioinformatics, this database offers detailed
enzyme information, including enzyme kinetics and molecular functions..

Importance of Enzyme Databases

Enzyme databases play a crucial role in scientific research by:

 Facilitating Data Retrieval: Providing easy access to detailed enzyme information for
researchers.
 Enabling Comparative Analysis: Allowing researchers to compare enzyme functions and
structures across different species.
 Enhancing Functional Annotation: Assisting in the annotation of newly discovered enzymes
based on existing data.
 Integrating Data: Combining information from various sources to provide a comprehensive
view of enzyme functions and interactions.

These databases are continually updated to include the latest research findings, making them
invaluable resources for the scientific community.
Analyzing the DNA/RNA sequence by the use of BI tools.

Analysis of DNA/RNA sequences plays a crucial role in various fields of scientific research,
including genetics, genomics, evolutionary biology, and biomedical research. The use of
bioinformatics (BI) tools has revolutionized the analysis and interpretation of these nucleic acid
sequences.

1. Sequence Alignment:

Sequence alignment is a fundamental step in DNA/RNA sequence analysis. BI tools, such as


BLAST (Basic Local Alignment Search Tool), perform sequence alignments to compare a query
sequence against a database of known sequences. This aids in identifying similar sequences,
determining evolutionary relationships, and discovering functional domains within the sequence.
2. Functional Annotation:

Functional annotation assigns biological meaning to a sequence by predicting the presence of


functional elements such as genes, regulatory regions, and protein-coding regions. BI tools like
InterProScan, GeneMark, or Augustus assist in identifying and characterizing these functional
elements.

3. Phylogenetic Analysis:

Phylogenetic analysis is used to determine the evolutionary relationships between DNA/RNA


sequences. BI tools, such as MEGA (Molecular Evolutionary Genetics Analysis) or PHYLIP
(Phylogeny Inference Package), construct phylogenetic trees based on sequence alignment data.
These trees help scientists understand the relatedness and evolutionary history of different
species or lineages.

4. Gene Expression Analysis:

When dealing with RNA sequences, gene expression analysis is of great significance. BI tools
like DESeq2 or EdgeR perform differential gene expression analysis, comparing expression
levels between different conditions or tissues. These tools employ statistical methods to identify
genes that are significantly upregulated or downregulated, providing insights into biological
processes and disease mechanisms.

5. Structural Analysis and Prediction:

BI tools assist in analyzing and predicting the 3D structure of the DNA/RNA molecules.
Programs like NUCLEIC, RNA Composer, and MODELLER predict the structural
conformation, folding patterns, and interactions of nucleic acids. This information is valuable for
understanding how sequence variations contribute to structural changes and functional
implications.

6. Variant Identification and Analysis:

BI tools are also used for detecting genetic variations, such as single nucleotide polymorphisms
(SNPs) or insertions/deletions (indels), within DNA/RNA sequences. Variant calling tools like
GATK (Genome Analysis Toolkit) and SAMtools identify variants, filter spurious calls, and
provide information about their functional impact. This aids in understanding genetic diversity,
disease susceptibility, and personalized medicine.

7. Epigenetic Analysis:

BI tools are extensively used for analyzing epigenetic modifications, such as DNA methylation
or histone modifications. Tools like Bismark, Bisulfite-Seq, or ChIP-seq analyze DNA
methylation patterns and post-translational modifications of histones. These analyses shed light
on gene regulation, chromatin structure, and epigenetic mechanisms underlying development and
diseases.
In conclusion, the use of BI tools plays a pivotal role in the analysis of DNA/RNA sequences.
These tools enable researchers to align sequences, annotate functional elements, perform
phylogenetic analysis, identify genetic variations, predict 3D structures, analyze gene expression,
and explore epigenetic modifications. The integration of BI tools with experimental data
accelerates scientific discoveries, enhances understanding of biological processes, and aids in the
development of novel therapies and treatments.

Retrieving the DNA sequence from database:

Retrieving DNA sequences from databases is a common task in bioinformatics research.


Databases such as GenBank, Ensembl, and NCBI provide extensive collections of DNA
sequences that can be accessed for various analyses. This detailed note provides a step-by-step
guide on how to retrieve DNA sequences from a database.

1. Determine the Database: Identify the appropriate database that contains the DNA sequence
you are interested in. Popular databases include NCBI GenBank, Ensembl, RefSeq. Different
databases may specialize in different types of sequences, so choose the one that is most relevant
to your research objectives.

2. Access the Database: Visit the website of the chosen database and open the search interface.
Most databases have user-friendly search interfaces that allow users to search by various criteria,
such as gene name, accession number, organism, or keywords.

3. Define Search Parameters: Specify the search criteria for the DNA sequence you want to
retrieve. This could be the gene name, accession number, species name, or any other identifier
associated with the sequence. You can also use keywords related to the sequence or its function.

4. Refine the Search: If the initial search yields too many results, you can further refine the
search by applying additional filters. For example, you can specify a particular chromosome,
limit the search to a specific species, or apply constraints based on sequence length or
publication date.

5. Review Search Results: After applying the search criteria, the database will display a list of
matching sequences. Browse through the search results and identify the sequence(s) that are most
relevant to your study. The list may include multiple sequences, so select the one(s) that meet
your requirements.

6. Retrieve the Sequence: Once you have identified the desired sequence, you can retrieve it in
various formats. Most databases provide options to download sequences in FASTA format,
which is a commonly used format for DNA sequences. You can copy the sequence directly from
the website or download it as a text file.

7. Record Sequence Information: It is essential to record the relevant information about the
retrieved sequence for future reference. This includes details such as the database name,
accession number, version, source organism, and any other associated metadata. Maintaining this
information ensures proper citation and accurate analysis.
8. Quality Check: Before using the retrieved DNA sequence, perform a quality check to ensure
its integrity and accuracy. Verify that the sequence length, composition, and any associated
annotations align with your expectations. You can use sequence analysis tools or align the
sequence with known references for validation.

In summary, retrieving DNA sequences from databases involves accessing the relevant database,
defining search parameters, refining the search if necessary, reviewing search results, selecting
the desired sequence(s), retrieving the sequence in the desired format, and recording the relevant
information. This process allows researchers to access valuable DNA sequence data for analysis,
comparison, and further investigation.

PRIMER DESIGNING

Primer designing in bioinformatics refers to the process of designing short nucleotide sequences
called primers to initiate the amplification of a specific DNA or RNA target region using
techniques such as polymerase chain reaction (PCR).

Primers are essential components of PCR, a widely used molecular biology technique that
amplifies specific DNA or RNA sequences. The primers are synthesized in the laboratory and
are designed to be complementary to the target DNA or RNA region, allowing them to bind or
anneal to the template strand. They act as starting points for DNA replication or amplification.

Here are the steps involved in primer designing:

1. Define the target sequence: The first step is to identify the specific DNA or RNA region of
interest that needs to be amplified. This can be a gene, a segment of a gene, or any other region.

2. Determine primer characteristics: Once the target sequence is known, the characteristics of
the primers need to be determined. These include the primer length, GC content, melting
temperature (Tm), presence of secondary structures, and absence of regions with significant
similarity to other sequences. Following are the characteristics to be taken into account while
designing primers

i. Primer Length: It is generally accepted that the optimal length of PCR primers is 18-
22 bp. This length is long enough for adequate specificity and short enough for primers to
bind easily to the template at the annealing temperature.
ii. Primer Melting Temperature: Primer Melting Temperature (Tm) by definition is the
temperature at which one half of the DNA duplex will dissociate to become single
stranded and indicates the duplex stability. Primers with melting temperatures in the
range of 52-58 oC generally produce the best results.
iii. GC Content: The GC content (the number of G's and C's in the primer as a percentage
of the total bases) of primer should be 40-60%.
iv. Primer Secondary Structures: Presence of the primer secondary structures produced
by intermolecular or intramolecular interactions can lead to poor or no yield of the
product. They adversely affect primer template annealing and thus the amplification.
They greatly reduce the availability of primers to the reaction.
v. Repeats: A repeat is a di-nucleotide occurring many times consecutively and should be
avoided because they can misprime. For example: ATATATAT. A maximum number of
dinucleotide repeats acceptable in an oligo is 4 di-nucleotides.
vi. Runs: Primers with long runs of a single base should generally be avoided as they can
misprime. For example, AGCGGGGGATGGGG has runs of base 'G' of value 5 and 4. A
maximum number of runs accepted is 4bp.
vii. End Stability: It is the maximum ΔG value of the five bases from the 3' end. An
unstable 3' end (less negative ΔG) will result in less false priming.
viii. Primer specificity: Primers should be designed to specifically bind to the target
sequence and not to any other similar sequences in the organism's genome. This is
important to ensure accurate amplification and avoid nonspecific PCR products.
ix. Avoid Template Secondary Structure: A single stranded Nucleic acid sequences is
highly unstable and fold into conformations (secondary structures). The stability of these
template secondary structures depends largely on their free energy and melting
temperatures(Tm). Consideration of template secondary structures is important in
designing primers, especially in qPCR. If primers are designed on a secondary structures
which is stable even above the annealing temperatures, the primers are unable to bind to
the template and the yield of PCR product is significantly affected. Hence, it is important
to design primers in the regions of the templates that do not form stable secondary
structures during the PCR reaction.
x. Avoiding primer-primer interactions: Primers should also be checked for possible
secondary structure formations and interactions between the forward and reverse primers.
Self-complementarity or complementarity between the primers can lead to primer-primer
interactions, which can inhibit efficient PCR amplification.

3. Primer design software: Various bioinformatics tools and software programs are available to
aid in primer designing. These tools help in identifying potential primer sequences that meet the
desired characteristics. Examples, Primer-Blast, Primer3, PrimerQuest, Oligo, Autoprime.

Overall, primer designing in bioinformatics plays a critical role in PCR-based experiments. It


requires careful consideration of various factors to ensure the selection of primers that are
specific, efficient, and capable of successfully amplifying the desired target sequence.

Identifying restriction sites

Restriction Mapping
The restriction/modification system in bacteria is a small-scale immune system for protection
from infection by foreign DNA. In the late 1960's it was discovered that E. coli contains
enzymes that will methylate specific nucleotide bases in DNA. Different strains of E. coli
contained different types of these methylases.
In addition to possessing a particular methylase, individual bacterial strains also contained
accompanying specific endonuclease activities. These specific nucleases, however, would not
cleave at these specific palindromic sequences if the DNA was methylated.
Thus, this combination of a specific methylase and associated endonuclease functioned as a type
of immune system for individual bacterial strains, protecting them from infection by foreign
DNA (e.g. viruses).
• In the bacterial strain EcoR1, the sequence GAATTC will be methylated at the internal adenine
base (by the EcoR1 methylase).
• The EcoR1 endonuclease within the same bacteria will not cleave the methylated DNA.
• Foreign viral DNA, which is not methylated at the sequence "GAATTC" will therefore be
recognized as "foreign" DNA and will be cleaved by the EcoR1 endonuclease.
• Cleavage of the viral DNA renders it non-functional.
• Such endonucleases are referred to as "restriction endonucleases" because they restrict the
DNA within the cell to being "self".
• The combination of restriction endonuclease and methylase is termed the "restriction-
modification" system.
Since different bacterial strains and species have potentially different R/M systems, their
characterization has made available hundreds of endonucleases with different sequence specific
cleavage sites.
• They are one of the primary tools in modern molecular biology for the manipulation and
identification of DNA sequences.
• Restriction endonucleases are commonly named after the bacterium from which it was isolated
 The utility of restriction endonucleases lies in their specificity and the frequency with
which their recognition sites occur within any given DNA sample.
 The assortment of DNA fragments would represent a specific "fingerprint" of the
particular DNA being digested. Different DNA would not yield the same collection of
fragment sizes. Thus, DNA from different sources can be either matched or distinguished
based on the assembly of fragments after restriction endonuclease treatment. These are
termed "Restriction Fragment Length Polymorphisms", or RFLP's. This simple analysis
is used in various aspects of molecular biology as well as a law enforcement and
genealogy. For example, genetic variations that distinguish individuals also may result in
fewer or additional restriction endonuclease recognition sites.
VIRS (A visual tool for identifying restriction sites in multiple DNA sequences) is an
interactive web-based program designed for restriction endonuclease cut sites prediction
and visualization. It can afford to analyze multiple DNA sequences simultaneously and
produce visual restriction maps with several useful options intended for users'
customization. These options also perform in-depth analysis of the restriction maps, such
as providing virtual electrophoretic result for digested fragments. Different from other
analytical tools, VIRS not only displays visual outputs but also provides the detailed
properties of restriction endonucleases that are commercially available. All the
information of these enzymes is stored in our internal database, which is updated monthly
from the manufacturers' web pages. It is freely available online at
http://bis.zju.edu.cn/virs/index.html.E
Other tools

1. REBASE (Restriction Enzyme Database)


2. NEBcutter (New England Biolabs)
3. Webcutter (online restriction mapping tool)
4. RestrictionMapper (online tool)
5. DNAStar (Lasergene)
6. ApE (A Plasmid Editor)
7. SnapGene

You might also like