My Bioinformatics Notes
My Bioinformatics Notes
"A field of science that uses computers, databases, math, and statistics to collect, store,
organize, and analyze large amounts of biological, medical, and health information."
2. Data Structures: Organized ways to store and manage large biological datasets.
3. Programming Languages:
A programming language is a set of rules and instructions that a computer can understand
and execute to perform specific tasks, solve problems, or create software. Tools like Python,
R, and SQL used to write bioinformatics software.
4. Databases: Organized collections of biological data, like GenBank or PDB.
7. Parallel Computing:
Parallel computing in bioinformatics refers to the use of multiple processing units or cores to
perform computational tasks simultaneously, speeding up the analysis of large biological
datasets.
8. Cloud Computing: cloud computing is a model of delivering computing services over the
internet.Using remote computing resources to analyze large datasets. Cloud computing
applications in bioinformatics:
9. Machine Learning:
Machine learning refers to the development of algorithms and statistical models that enable
computers to:
1. Learn from data
2. Improve their performance
3. Make predictions or decisions
4. Identify patterns and relationships
5. Classify or group data points
1. Advancing Molecular Biology Research: Providing data, tools, and resources for
researchers.
2. Supporting Genomics and Proteomics: Enabling genome assembly, annotation, and
analysis.
3. Promoting Open Science: Making data and resources openly available to the scientific
community.
By providing a comprehensive collection of nucleotide sequence data and cutting-edge
research, EMBL supports advancements in molecular biology, genomics, and proteomics.
DATA ACQUISITION:
1. Submission: Researchers submit sequence data to EMBL through online forms or file
uploads.
2. Automated Pipeline: EMBL uses an automated pipeline to process and annotate submitted
data.
3. Data Exchange: EMBL exchanges data with other databases like NCBI and DDBJ.
NCBI (National Center for Biotechnology Information)
NCBI is a leading biomedical research organization and database provider, part of the
National Library of Medicine (NLM) at the National Institutes of Health (NIH). The NCBI is
located in Bethesda, Maryland, and was founded in 1988 through legislation sponsored by
US Congressman Claude Pepper.
1. Advancing Biomedical Research: Providing data, tools, and resources for researchers.
2. Supporting Genomics and Proteomics: Enabling genome assembly, annotation, and
analysis.
3. Promoting Open Science: Making data and resources openly available to the scientific
community.
Data acquisition
1. Submission: Researchers submit sequence data to NCBI through online forms or file
uploads.
2. GenBank Direct Submission: Sequencing centers and researchers can directly submit data
to GenBank.
3. Data Exchange: NCBI exchanges data with other databases like DDBJ and EMBL.
4. Literature Scanning: NCBI staff scan scientific literature for new sequence data.
5. Automated Pipeline: NCBI uses an automated pipeline to process and annotate submitted
data.
These databases play a crucial role in collecting, processing, and disseminating biological
data, enabling researchers to access and analyze data for various applications.
BIOINFORMATICS TOOLS
Multiple Sequence Alignment
ClustalW
A FASTA file typically consists of two parts: a header line and a sequence line. The header
line begins with a greater-than symbol ">" followed by a unique identifier for the sequence.
The sequence line contains the actual sequence of nucleotides or amino acids.
It is recommended that all lines of text be shorter than 80 characters in length.
To retrieve a protein sequence from a database using FASTA, you can follow these steps:
1. Identify the protein database you want to search, such as UniProt or NCBI's Protein
database.
2. Go to the website of the database you selected and access their search interface.
3. Enter your search term, which can be the protein name, accession number, or any other
relevant ID. Submit the search.
4. Look for the search results and identify the specific protein sequence you want to retrieve.
5. On the result page, you may find an option to download or view the sequence in different
formats. Look for the FASTA format, which often has a link or button to retrieve the
sequence.
6. Click on the FASTA link or button, and it will likely open a new page or display the
sequence directly on the screen.
7. Copy the FASTA-formatted sequence, which usually starts with the ">" symbol followed
by the sequence identifier and then the sequence itself.
Now you have successfully retrieved the protein sequence in FASTA format from the
database. You can save it in a text file or use it directly for further analysis or processing in
bioinformatics workflows.
Sequence Alignment
There are two main types of sequence alignment: single sequence alignment and multiple
sequence alignment.
1. Single sequence alignment: This type of alignment involves comparing a single sequence
with itself or a homologous sequence from a different organism. The purpose of single
sequence alignment is to identify conserved regions within a single sequence. This type of
alignment can be useful for identifying functional motifs within a protein or identifying
mutations in a gene that may contribute to the development of disease.
2. Multiple sequence alignment: This type of alignment involves comparing three or more
sequences simultaneously. The purpose of multiple sequence alignment is to identify
conserved regions, gaps, and mutations across multiple sequences. This type of alignment is
useful for studying the evolutionary history of a gene or protein family and for identifying
residues that are essential for protein function.
Here is a detailed overview of how the alignment of protein and nucleotide sequences using
BLAST works: as an example
Obtain FASTA files with accession numbers NM_007393 (Mus musculus actin, beta
(ACTB)); NM_205518 (Gallus gallus actin, beta (ACTB)); and NM_001101 (Homo sapiens
actin beta (ACTB).
1. Selection of a sequence database: First, you need to select a suitable sequence database
for the search. Popular choices include the NCBI protein database or the nucleotide database.
2. Input sequence: Prepare the query sequence that you want to align against the selected
database. It can be either a protein or nucleotide sequence. Make sure to have the sequence in
a suitable format, such as FASTA.
3. BLAST program selection: Choose the appropriate BLAST program based on the type of
sequence(s) you are working with. For protein sequences, the most commonly used program
is BLASTP. For nucleotide sequences, BLASTN or BLASTX are often used.
4. Running the BLAST search: Open the BLAST website or use command-line tools to
start the BLAST search. Enter the query sequence in the provided field and select the desired
database.
5. Scoring algorithm: BLAST utilizes a scoring algorithm (e.g., the BLOSUM matrix for
proteins or the substitution matrix for nucleotides) to calculate similarity scores between the
query sequence and each database sequence.
6. Searching for alignments: BLAST processes the scoring information to identify regions
of local similarity between the query sequence and sequences in the database. The algorithm
starts by identifying short, significant matches, also known as seed matches, and then extends
them to identify longer alignments.
7. Scoring and statistical significance: The alignments produced by BLAST are assigned
scores based on the similarity of the aligned residues. Additionally, statistical calculations are
performed to estimate the significance of the alignments. This significance is often expressed
as an E-value, which represents the expected number of matches with similar scores that
could occur randomly.
8. Displaying the results: Once the search is complete, BLAST provides a list of alignments
ranked by their E-values. The output typically includes information such as the alignment
score, E-value, sequence identifiers, and aligned regions of the query and database
sequences.
9. Interpreting the results: Researchers examine the BLAST results to determine the
jbiological significance of the alignments. They assess the quality of the matches, the
sequence similarity, and consider factors like the alignment length and alignment coverage.
BLAST is a powerful tool for aligning and comparing protein and nucleotide sequences. It
enables researchers to identify similar sequences, infer evolutionary relationships, predict
protein structures, and gain insights into the function and characteristics of genes and
proteins.
Additional Information
BLAST (Basic Local Alignment Search Tool) is an efficient and widely used local
sequence alignment tool to compare nucleotide or protein sequences. It can identify regions
of similarity between different sequences, which is useful for detecting homologous
sequences, identifying conserved domains, and predicting protein structures.
BLAST uses a heuristic algorithm to rapidly compare sequences, making it a valuable tool
for large-scale sequence analysis. It has two different types of algorithms: the original
BLAST algorithm, for detecting closely related sequences, and the PSI-BLAST algorithm,
for detecting sequences that are distantly related.
The original BLAST algorithm lists all possible words of a certain length (k-mers) from the
query sequence. It then searches a sequences database for exact matches to these k-mers and
extends the alignment to include nearby sequences. The final score is calculated based on the
number and quality of the matches.
PSI-BLAST is an iterative version of the BLAST algorithm that builds a position-specific
scoring matrix based on the initial alignment and then uses this matrix to search for
additional sequences.
BLAST is widely used in biological research due to its user-friendly interface and ability to
handle large-scale sequence analysis. It is accessible through the NCBI website; standalone
versions are also available for download.
In addition to its basic functionality, BLAST offers various options and parameters, allowing
users to customize their searches based on specific criteria. For example, users can choose
different scoring matrices to match their query sequences to the database sequences or
specify the e-value cutoff to control the search stringency.
Using BLAST to search for sequences similar to a specific region of interest within a more
extensive sequence is also possible, which helps identify conserved domains or motifs. Its
ability to rapidly identify regions of similarity between sequences has been critical in
advancing our understanding of the genetic basis of life.
While BLAST has some limitations, such as its inability to detect distantly related sequences,
it remains an indispensable tool for many biological research projects.
CLUSTALW is a software package for multiple sequence alignment that provides a user-
friendly interface for aligning nucleotide or protein sequences. It employs a progressive
alignment approach that initially generates a guide tree to group sequences into clusters
based on their similarity. Once grouped, the software aligns sequences within each cluster
and merges them iteratively until the whole dataset is aligned.
The following are the steps involved in performing sequence alignment using CLUSTALW:
1. Input sequences: In order to align sequences using CLUSTALW, you need to provide a
set of nucleotide or protein sequences in a common format, such as FASTA. These sequences
can be obtained from various online databases or generated from your own experiments.
3. Generate guide tree: CLUSTALW uses a guide tree to group similar sequences together.
This tree is constructed using the neighbor-joining method and is based on the pairwise
similarities between sequences. The guide tree provides a roadmap for the progressive
alignment process.
6. Output: The final alignment is saved in a common format such as Clustal or FASTA and
can be used in downstream analyses such as phylogenetic tree construction, molecular
modeling, or functional analysis.
Overall, CLUSTALW is an easy-to-use and powerful tool for sequence alignment that can be
used for various applications in bioinformatics research. Its ability to handle large datasets
and optimize alignments based on specific parameters make it a valuable resource for
studying the evolution and function of biological sequences.
Additional information
ClustalW is a widely used multiple sequence alignment tool with a progressive approach. It
works by creating a guide tree that represents the evolutionary relationships between the
sequences and then aligning the sequences based on the guide tree.
ClustalW allows users to adjust the alignment parameters to optimize the alignment for their
specific research question. It is handy for aligning large numbers of sequences and valuable in
many areas of biological research, including phylogenetics, functional genomics, and drug
discovery.
ClustalW also offers a variety of output formats, including Clustal, NEXUS, and PHYLIP
formats, making it easy to integrate with other bioinformatics tools. Additionally, ClustalW has
several advanced features, such as the ability to perform pairwise alignments, identify conserved
regions of the alignment, and generate phylogenetic trees.
However, there are several strategies that users can employ to optimize the performance of
ClustalW, such as running the program on a high-performance computing cluster or using the
parallelized version of the software, ClustalOmega.
ClustalW has an intuitive interface, customizable parameters, and the ability to handle large
datasets, making it a valuable resource for many different types of biological research.
In addition to its advanced features, ClustalW allows users to incorporate external information,
such as secondary structure predictions or phylogenetic information from other sources,
improving the accuracy of the alignment by providing additional context and constraints for the
alignment process.
To analyze protein physicochemical properties using Protparam, the following steps are usually
followed:
1. Retrieval of protein sequence: The first step is to obtain the amino acid sequence of the protein
of interest. This can be done by either extracting the sequence from a protein database or by
generating a sequence from an experimentally determined protein structure.
2. Accessing the Protparam tool: The Protparam tool is available online as part of the Expasy
Bioinformatics Resource Portal. Access the Protparam page by either searching for "Protparam"
in a search engine or directly visiting the Expasy website.
3. Input protein sequence: Once on the Protparam page, you will see a text box labeled "Enter
your protein sequence." Copy and paste your protein sequence into this box. Make sure that the
sequence is in the correct single-letter amino acid code format.
6. Interpreting the results: Protparam will generate a report displaying the analyzed properties of
the protein. This report usually includes the calculated values for each selected parameter
category. Carefully review the results and interpret the protein's physicochemical properties
based on the provided values.
7. Further analysis: Depending on the research goals, the obtained data from Protparam analysis
can be further analyzed and interpreted using statistical methods or compared with other proteins
or databases to gain insights into the protein's function, stability, or other relevant characteristics.
Note: Protparam is a widely used tool for analyzing protein physicochemical properties.
However, there are other alternative tools available that provide similar functionality.
Researchers often use multiple tools or compare results from various resources for
comprehensive protein analysis.
Protparam is a tool used to predict various physical and chemical properties of a protein
sequence. Some of the properties that can be analyzed by protparam include:
Physical properties:
1. Molecular weight: The total size of a protein molecule, which is the sum of the atomic masses
of all atoms within the protein.
2. Isoelectric point (pI): The pH at which a protein carries no net electrical charge. It can provide
information about the protein's behavior under different pH conditions.
3. Instability index: A measure of the stability of a protein based on the presence of unstable
features, such as the occurrence of certain amino acid residues.
4. Aliphatic index: A measure of the relative volume occupied by aliphatic amino acids (e.g.,
alanine, leucine, and isoleucine) in the protein sequence. It indicates the protein's thermostability.
1. Amino acid composition: Protparam provides the amino acid composition of the protein
sequence, including the frequency and percentage of each amino acid present.
2. Charge: It provides the net charge of the protein at a given pH value, which can be useful in
predicting its interactions with other molecules.
3. Half-life: Estimation of the half-life of a protein based on its N-terminal amino acids. This
property predicts the stability of a protein within a cellular environment.
These properties, among others provided by protparam, help researchers gain insights into the
protein's structure, stability, hydrophobicity, and electrochemical properties.
The Protein Secondary Structure Prediction (PSSP) method utilizes various algorithms and
statistical models to predict the secondary structure elements of proteins. The primary approach
employed by PSSP is based on analyzing the sequence of amino acids in the protein and
extracting information from known structures in protein databases.
1. Data collection: PSSP gathers known protein structures from databases such as the Protein
Data Bank (PDB) that have experimentally determined secondary structure annotations.
2. Feature extraction: Relevant features are extracted from the protein sequence, such as amino
acid composition, physicochemical properties, and sequence patterns. These features provide
input for the prediction algorithms.
3. Algorithm selection: PSSP employs different algorithms for secondary structure prediction,
each with its own strengths and weaknesses. Some common algorithms used include neural
networks, support vector machines, and hidden Markov models.
4. Model training: The selected algorithm is trained using the extracted features and the known
protein structures from the database. This step involves optimizing the model parameters and
fine-tuning the algorithm to improve accuracy.
5. Prediction: Once the model is trained, it can be applied to predict the secondary structure
elements of a target protein. The protein sequence is processed using the trained model, and each
amino acid is assigned a secondary structure label (e.g., helix, strand, coil).
It's important to note that PSSP predictions are not always completely accurate, especially for
proteins with unique or novel structural features. However, by utilizing large datasets and
advanced algorithms, PSSP can provide reasonably reliable estimations of protein secondary
structure elements.
PSIPRED
JPRED
SSPRO
PORTER
Introduction
Methods
1. Homology Modeling
As the name suggests, homology modeling predicts protein structures based on sequence
homology with known structures. It is also known as comparative modeling. The principle
behind it is that if two proteins share a high enough sequence similarity, they are likely to have
very similar three-dimensional structures. If one of the protein sequences has a known structure,
then the structure can be copied to the unknown protein with a high degree of confidence.
Homology modeling produces an all-atom model based on alignment with template proteins.
Steps:
1. Template selection
The first step in protein structural modeling is to select appropriate structural
templates. This forms the foundation for rest of the modeling process. The template
selection involves searching the Protein Data Bank (PDB) for homologous proteins
with determined structures. The search can be performed using a heuristic pairwise
alignment search program such as BLAST or FASTA.
2. Once the structure with the highest sequence similarity is identified as a template,
the full-length sequences of the template and target proteins need to be realigned
using refined alignment algorithms to obtain optimal alignment. This realignment
is the most critical step in homology modeling, which directly affects the quality
of the final model. This is because incorrect alignment at this stage leads to
incorrect designation of homologous residues and therefore to incorrect structural
models. Errors made in the alignment step cannot be corrected in the following
modeling steps. Therefore, the best possible multiple alignment algorithms, such
as Praline and T-Coffee.
3. Model building
Construct a 3D model of the target protein based on the aligned sequence and
template structures.
4. Model refinement
The final homology model has to be evaluated to make sure that the structural
features of the model are consistent with the physicochemical rules. If there are any structural
irregularities left, These kinds of structural irregularities can be corrected by applying the energy
minimization procedure on the entire model, which moves the atoms in such a way that the
overall conformation has the lowest energy potential.
2. Threading
There are only small number of protein folds available (<1,000), compared to millions of protein
sequences. This means that protein structures tend to be more conserved than protein sequences.
Consequently, many proteins can share a similar fold even in the absence of sequence
similarities.
This allowed the development of computational methods to predict protein structures beyond
sequence similarities. To determine whether a protein sequence adopts a known three-
dimensional structure fold relies on threading and fold recognition methods.
an unknown protein sequence by fitting the sequence into a structural database and
structures, which are most evolutionarily conserved. Therefore, this approach can
- Steps:
1. Template selection
1. Sequence identity: Templates with high sequence identity (>30%) are preferred.
2. Sequence alignment
Ab Initio.
This method Predicts structure from sequence alone, without template, using physical and
chemical principles to simulate protein folding. Both homology and fold recognition approaches
rely on the availability of template structures in the database to achieve predictions. If no correct
structures exist in the database, the methods fail. However, proteins in nature fold on their own
without checking what the structures of their homologs are in databases. Obviously, there is
some information in the sequences that provides instruction for the proteins to “find” their native
structures. Early biophysical studies have shown that most proteins fold spontaneously into a
stable structure that has near minimum energy. This structural state is called the native state.
This folding process appears to be nonrandom; however, its mechanism is poorly understood.
The limited knowledge of protein folding forms the basis of ab initio prediction. As the name
suggests, the ab initio prediction method attempts to produce all-atom protein models based on
sequence information alone without the aid of known protein structures. The perceived
advantage of this method is that predictions are not restricted by known folds and that novel
protein folds can be identified. However, because the physicochemical laws governing protein
folding are not yet well understood, the energy functions used in the ab initio prediction are at
present rather inaccurate. The folding problem remains one of the greatest challenges in
bioinformatics today.
Examples.
Post-Translational Modifications (PTMs) are chemical changes that occur to proteins after they
have been synthesized. These modifications play crucial roles in regulating protein function,
activity, localization, and interaction with other cellular molecules.
Types of PTMs:
Phosphorylation:
Key Functions:
Regulation of Protein Activity: Phosphorylation can activate or deactivate enzymes and other
proteins, thereby regulating various cellular processes.
Signal Transduction: It plays a crucial role in cell signaling pathways, allowing cells to respond
to external stimuli by transmitting signals from the cell surface to the nucleus.
Protein-Protein Interactions: Phosphorylation can create binding sites for other proteins,
facilitating complex formation and interaction networks.
Cell Cycle Control: It is essential for the regulation of the cell cycle, ensuring proper cell
division and function.
Methylation: Addition of a methyl (CH3) group, Commonly occurs on arginine and lysine
residues of histones. It can either activate or repress gene expression depending on the specific
site and context.
1. Retrieve the protein sequence: Start by obtaining the amino acid sequence of the protein you
want to analyze. You can do this by searching for the protein in online databases like UniProt or
NCBI.
2. Accessing NETPHOS: Visit the NETPHOS website - to access the tool.
3. Submitting the sequence: Copy and paste the protein sequence into the input box provided
on the NETPHOS website. Make sure the sequence is in the correct format and without any
errors.
4. Selecting parameters: NETPHOS allows you to select several parameters that influence the
prediction, such as the kinase group, thresholds, and output format. You can choose default
settings or customize them according to your specific requirements.
5. Running the prediction: Click on the "Submit" button to initiate the prediction process. The
tool will use its algorithm to predict potential phosphorylation sites in the protein sequence based
on the specified parameters and kinase groups.
6. Analyzing the results: Once the prediction is complete, NETPHOS will provide you with a
list of potential phosphorylation sites identified in the protein sequence. It may also give
additional information such as the position of the sites and the associated score.
7. Interpreting the results: Review the predicted phosphorylation sites and their corresponding
scores. Higher scores suggest a higher probability of phosphorylation. It is important to note that
NETPHOS predicts potential phosphorylation sites based on sequence information only.
Enzymes Classification (read this topic from Mushtaq
Biochemistry)
Enzymes Retrieval Databases
Enzyme retrieval databases are specialized repositories that store comprehensive information
about enzymes, including their functions, structures, and classifications. These databases are
essential tools for researchers in biochemistry, molecular biology, and related fields. Here are
some key enzyme databases:
1. BRENDA
o Description: BRENDA is one of the most extensive enzyme databases, providing detailed
information on enzyme functions, structures, and kinetics. It includes data on enzyme-ligand
interactions, enzyme nomenclature, and metabolic pathways.
o Features: Advanced search options, enzyme classification, metabolic pathways, and
visualization tools.
2. ExplorEnz:
o Description: This database is the official repository for the International Union of Biochemistry
and Molecular Biology (IUBMB) enzyme nomenclature. It offers a comprehensive list of
enzymes classified according to the IUBMB standards.
o Features: Simple and advanced search options, downloadable data in SQL or XML formats, and
adherence to IUPAC naming conventions.
3. IntEnz (Integrated relational Enzyme database):
o Description: IntEnz provides a curated collection of enzyme information, integrating data from
various sources to offer a unified view of enzyme nomenclature and classification.
4. SIB-ENZYME:
o Description: Managed by the Swiss Institute of Bioinformatics, this database offers detailed
enzyme information, including enzyme kinetics and molecular functions..
Facilitating Data Retrieval: Providing easy access to detailed enzyme information for
researchers.
Enabling Comparative Analysis: Allowing researchers to compare enzyme functions and
structures across different species.
Enhancing Functional Annotation: Assisting in the annotation of newly discovered enzymes
based on existing data.
Integrating Data: Combining information from various sources to provide a comprehensive
view of enzyme functions and interactions.
These databases are continually updated to include the latest research findings, making them
invaluable resources for the scientific community.
Analyzing the DNA/RNA sequence by the use of BI tools.
Analysis of DNA/RNA sequences plays a crucial role in various fields of scientific research,
including genetics, genomics, evolutionary biology, and biomedical research. The use of
bioinformatics (BI) tools has revolutionized the analysis and interpretation of these nucleic acid
sequences.
1. Sequence Alignment:
3. Phylogenetic Analysis:
When dealing with RNA sequences, gene expression analysis is of great significance. BI tools
like DESeq2 or EdgeR perform differential gene expression analysis, comparing expression
levels between different conditions or tissues. These tools employ statistical methods to identify
genes that are significantly upregulated or downregulated, providing insights into biological
processes and disease mechanisms.
BI tools assist in analyzing and predicting the 3D structure of the DNA/RNA molecules.
Programs like NUCLEIC, RNA Composer, and MODELLER predict the structural
conformation, folding patterns, and interactions of nucleic acids. This information is valuable for
understanding how sequence variations contribute to structural changes and functional
implications.
BI tools are also used for detecting genetic variations, such as single nucleotide polymorphisms
(SNPs) or insertions/deletions (indels), within DNA/RNA sequences. Variant calling tools like
GATK (Genome Analysis Toolkit) and SAMtools identify variants, filter spurious calls, and
provide information about their functional impact. This aids in understanding genetic diversity,
disease susceptibility, and personalized medicine.
7. Epigenetic Analysis:
BI tools are extensively used for analyzing epigenetic modifications, such as DNA methylation
or histone modifications. Tools like Bismark, Bisulfite-Seq, or ChIP-seq analyze DNA
methylation patterns and post-translational modifications of histones. These analyses shed light
on gene regulation, chromatin structure, and epigenetic mechanisms underlying development and
diseases.
In conclusion, the use of BI tools plays a pivotal role in the analysis of DNA/RNA sequences.
These tools enable researchers to align sequences, annotate functional elements, perform
phylogenetic analysis, identify genetic variations, predict 3D structures, analyze gene expression,
and explore epigenetic modifications. The integration of BI tools with experimental data
accelerates scientific discoveries, enhances understanding of biological processes, and aids in the
development of novel therapies and treatments.
1. Determine the Database: Identify the appropriate database that contains the DNA sequence
you are interested in. Popular databases include NCBI GenBank, Ensembl, RefSeq. Different
databases may specialize in different types of sequences, so choose the one that is most relevant
to your research objectives.
2. Access the Database: Visit the website of the chosen database and open the search interface.
Most databases have user-friendly search interfaces that allow users to search by various criteria,
such as gene name, accession number, organism, or keywords.
3. Define Search Parameters: Specify the search criteria for the DNA sequence you want to
retrieve. This could be the gene name, accession number, species name, or any other identifier
associated with the sequence. You can also use keywords related to the sequence or its function.
4. Refine the Search: If the initial search yields too many results, you can further refine the
search by applying additional filters. For example, you can specify a particular chromosome,
limit the search to a specific species, or apply constraints based on sequence length or
publication date.
5. Review Search Results: After applying the search criteria, the database will display a list of
matching sequences. Browse through the search results and identify the sequence(s) that are most
relevant to your study. The list may include multiple sequences, so select the one(s) that meet
your requirements.
6. Retrieve the Sequence: Once you have identified the desired sequence, you can retrieve it in
various formats. Most databases provide options to download sequences in FASTA format,
which is a commonly used format for DNA sequences. You can copy the sequence directly from
the website or download it as a text file.
7. Record Sequence Information: It is essential to record the relevant information about the
retrieved sequence for future reference. This includes details such as the database name,
accession number, version, source organism, and any other associated metadata. Maintaining this
information ensures proper citation and accurate analysis.
8. Quality Check: Before using the retrieved DNA sequence, perform a quality check to ensure
its integrity and accuracy. Verify that the sequence length, composition, and any associated
annotations align with your expectations. You can use sequence analysis tools or align the
sequence with known references for validation.
In summary, retrieving DNA sequences from databases involves accessing the relevant database,
defining search parameters, refining the search if necessary, reviewing search results, selecting
the desired sequence(s), retrieving the sequence in the desired format, and recording the relevant
information. This process allows researchers to access valuable DNA sequence data for analysis,
comparison, and further investigation.
PRIMER DESIGNING
Primer designing in bioinformatics refers to the process of designing short nucleotide sequences
called primers to initiate the amplification of a specific DNA or RNA target region using
techniques such as polymerase chain reaction (PCR).
Primers are essential components of PCR, a widely used molecular biology technique that
amplifies specific DNA or RNA sequences. The primers are synthesized in the laboratory and
are designed to be complementary to the target DNA or RNA region, allowing them to bind or
anneal to the template strand. They act as starting points for DNA replication or amplification.
1. Define the target sequence: The first step is to identify the specific DNA or RNA region of
interest that needs to be amplified. This can be a gene, a segment of a gene, or any other region.
2. Determine primer characteristics: Once the target sequence is known, the characteristics of
the primers need to be determined. These include the primer length, GC content, melting
temperature (Tm), presence of secondary structures, and absence of regions with significant
similarity to other sequences. Following are the characteristics to be taken into account while
designing primers
i. Primer Length: It is generally accepted that the optimal length of PCR primers is 18-
22 bp. This length is long enough for adequate specificity and short enough for primers to
bind easily to the template at the annealing temperature.
ii. Primer Melting Temperature: Primer Melting Temperature (Tm) by definition is the
temperature at which one half of the DNA duplex will dissociate to become single
stranded and indicates the duplex stability. Primers with melting temperatures in the
range of 52-58 oC generally produce the best results.
iii. GC Content: The GC content (the number of G's and C's in the primer as a percentage
of the total bases) of primer should be 40-60%.
iv. Primer Secondary Structures: Presence of the primer secondary structures produced
by intermolecular or intramolecular interactions can lead to poor or no yield of the
product. They adversely affect primer template annealing and thus the amplification.
They greatly reduce the availability of primers to the reaction.
v. Repeats: A repeat is a di-nucleotide occurring many times consecutively and should be
avoided because they can misprime. For example: ATATATAT. A maximum number of
dinucleotide repeats acceptable in an oligo is 4 di-nucleotides.
vi. Runs: Primers with long runs of a single base should generally be avoided as they can
misprime. For example, AGCGGGGGATGGGG has runs of base 'G' of value 5 and 4. A
maximum number of runs accepted is 4bp.
vii. End Stability: It is the maximum ΔG value of the five bases from the 3' end. An
unstable 3' end (less negative ΔG) will result in less false priming.
viii. Primer specificity: Primers should be designed to specifically bind to the target
sequence and not to any other similar sequences in the organism's genome. This is
important to ensure accurate amplification and avoid nonspecific PCR products.
ix. Avoid Template Secondary Structure: A single stranded Nucleic acid sequences is
highly unstable and fold into conformations (secondary structures). The stability of these
template secondary structures depends largely on their free energy and melting
temperatures(Tm). Consideration of template secondary structures is important in
designing primers, especially in qPCR. If primers are designed on a secondary structures
which is stable even above the annealing temperatures, the primers are unable to bind to
the template and the yield of PCR product is significantly affected. Hence, it is important
to design primers in the regions of the templates that do not form stable secondary
structures during the PCR reaction.
x. Avoiding primer-primer interactions: Primers should also be checked for possible
secondary structure formations and interactions between the forward and reverse primers.
Self-complementarity or complementarity between the primers can lead to primer-primer
interactions, which can inhibit efficient PCR amplification.
3. Primer design software: Various bioinformatics tools and software programs are available to
aid in primer designing. These tools help in identifying potential primer sequences that meet the
desired characteristics. Examples, Primer-Blast, Primer3, PrimerQuest, Oligo, Autoprime.
Restriction Mapping
The restriction/modification system in bacteria is a small-scale immune system for protection
from infection by foreign DNA. In the late 1960's it was discovered that E. coli contains
enzymes that will methylate specific nucleotide bases in DNA. Different strains of E. coli
contained different types of these methylases.
In addition to possessing a particular methylase, individual bacterial strains also contained
accompanying specific endonuclease activities. These specific nucleases, however, would not
cleave at these specific palindromic sequences if the DNA was methylated.
Thus, this combination of a specific methylase and associated endonuclease functioned as a type
of immune system for individual bacterial strains, protecting them from infection by foreign
DNA (e.g. viruses).
• In the bacterial strain EcoR1, the sequence GAATTC will be methylated at the internal adenine
base (by the EcoR1 methylase).
• The EcoR1 endonuclease within the same bacteria will not cleave the methylated DNA.
• Foreign viral DNA, which is not methylated at the sequence "GAATTC" will therefore be
recognized as "foreign" DNA and will be cleaved by the EcoR1 endonuclease.
• Cleavage of the viral DNA renders it non-functional.
• Such endonucleases are referred to as "restriction endonucleases" because they restrict the
DNA within the cell to being "self".
• The combination of restriction endonuclease and methylase is termed the "restriction-
modification" system.
Since different bacterial strains and species have potentially different R/M systems, their
characterization has made available hundreds of endonucleases with different sequence specific
cleavage sites.
• They are one of the primary tools in modern molecular biology for the manipulation and
identification of DNA sequences.
• Restriction endonucleases are commonly named after the bacterium from which it was isolated
The utility of restriction endonucleases lies in their specificity and the frequency with
which their recognition sites occur within any given DNA sample.
The assortment of DNA fragments would represent a specific "fingerprint" of the
particular DNA being digested. Different DNA would not yield the same collection of
fragment sizes. Thus, DNA from different sources can be either matched or distinguished
based on the assembly of fragments after restriction endonuclease treatment. These are
termed "Restriction Fragment Length Polymorphisms", or RFLP's. This simple analysis
is used in various aspects of molecular biology as well as a law enforcement and
genealogy. For example, genetic variations that distinguish individuals also may result in
fewer or additional restriction endonuclease recognition sites.
VIRS (A visual tool for identifying restriction sites in multiple DNA sequences) is an
interactive web-based program designed for restriction endonuclease cut sites prediction
and visualization. It can afford to analyze multiple DNA sequences simultaneously and
produce visual restriction maps with several useful options intended for users'
customization. These options also perform in-depth analysis of the restriction maps, such
as providing virtual electrophoretic result for digested fragments. Different from other
analytical tools, VIRS not only displays visual outputs but also provides the detailed
properties of restriction endonucleases that are commercially available. All the
information of these enzymes is stored in our internal database, which is updated monthly
from the manufacturers' web pages. It is freely available online at
http://bis.zju.edu.cn/virs/index.html.E
Other tools