Insilico Gene Analysis

iabt
In silico Gene Analysis

iabt
Outline
Introduction
Alignment
ORF searching
3D protein modeling
Case study
iabt
INTRODUCTION
What is gene?
A length of DNA which codes for a particular protein, or in certain

cases a functional or structural RNA molecule
What are the essential components of a gene
Initiation codon
Intron and exons(in eukaryotes)
Stop codon
Regulatory sequences
iabt
INTRODUCTION ….
Essential feature of gene which are considered for in silico gene analysis
All nucleotide sequences essentially contains A, T,G and C
All proteins contains 20 amino acids (one letter code)
Initiation codon is fixed - ATG
Stop codons are also fixed –TAA, TAG and TGA
Intron boundaries- GU-AC
Codon usage differs from organism to organism

iabt
FILE FORMATS
FASTA format
>XM_414949 | Gallus gallus |alpha 2 globin
MVLSAADKNNVKGIFTKIAGHAEEYGAETLERMFTTYPPTKTYF
GI format
; comment
; comment
XM_414949
MVLSAADKNNVKGIFTKIAGHAEEYGAETLERMFTTYPPTKTYF1
GDE format
% XM_414949 | Gallus gallus |alpha 2 globin
NBRF/PIR format
>P1; XM_414949 | Gallus gallus |alpha 2 globin
iabt
ALIGNMENTS
The result of a comparison of two or more gene or protein sequences in

order to determine their degree of base or amino acid similarity
ALIGNMENT
Pair wise Alignment Multiple Alignment
Local Alignment Global alignment

iabt
REFERENCE SEQUENCE
>NG_000007 |chromosome 11| beta hemoglobin|Homo sapiens

atggtgcatctgactcctgaggagaagtctgccgttactgccctgtggggcaaggtgaacg
tggatgaagttggtggtgaggccctgggcaggctgctggtggtctacccttggacccagag
gttctttgagtcctttggggatctgtccactcctgatgctgttatgggcaaccctaaggtgaagg
ctcatggcaagaaagtgctcggtgcctttagtgatggcctggctcacctggacaacctcaag
ggcacctttgccacactgagtgagctgcactgtgacaagctgcacgtggatcctgagaacttc
aggctcctgggcaacgtgctggtctgtgtgctggcccatcactttggcaaagaattcaccccac
cagtgcaggctgcctatcagaaagtggtggctggtgtggctaatgccctggcccacaagtatc
actaa
>NG_000007 |chromosome 11| beta hemoglobin|Homo sapiens

MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMG
NPK ZVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCV
LAHHFG KEFTPPVQAAYQKVVAGVANALAHKYH
iabt
PAIRWISE ALIGNMENT
 Two sequences are compared at a time
 Sequences may be nucleotide-nucleotide or amino acid-amino acid
 It may be gaped/ un-gaped alignment
 Two algorithms –Smith- Waterman algorithm (local alignment)

Needleman-Wunsch algorithm (global alignment)
Ex : BLAST and FASTA

iabt
BLAST (Basic Local Alignment Search Tool)
Pair wise local alignment
Developed by Stephen Altschul, Warren Gish, David Lipman
Stages in search
BLAST searches for short matches of a fixed length W between the

query and sequences in the database
BLAST performs an ungapped alignment between the query and database

sequence on either sides , if they share a common word.
BLAST performs a gapped alignment between the query sequence and the
database sequence
iabt
BLAST ….
It consider whole database as one

sequence and align the query
sequence
high-scoring segment pairs

iabt
BLAST …..
Low complexity region

iabt
FASTA
Pairwise local alignment
Developed by David J. Lipman and William R. Pearson in 1985
It looks for identically matching word length called ktup
It identifies single high scoring region
It matches individual sequence of database with query sequence

iabt
FASTA ….
It aligns the individual database
sequence with Query sequence
E value is different from BLAST
E= Np
iabt
PROTEIN MATRICES
C 1 C 12
S 0 1 S 0 2
T 0 0 1 T -2 1 3
S P 0 0 0 1 P -1 1 0 6
A 0 0 0 0 1 A -2 1 1 1 2
U G 0 0 0 0 0 1 G -3 1 0 -1 1 5
N 0 0 0 0 0 0 1
B D 0 0 0 0 0 0 0 1
N
D
-4
-5
1
0
0
0
-1
-1
0
0
0
1
2
2 4
J E
Q
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0 1
E -5 0 0 -1 0 0 1 3 4
Q -5 -1 -1 0 0 -1 1 2 2 4
E H 0 0 0 0 0 0 0 0 0 0 1 H -3 -1 -1 0 -1 -2 2 1 4 3 6
R 0 0 0 0 0 0 0 0 0 0 0 1 R -4 0 -1 0 -2 -3 0 -1 -1 1 2 6
C K 0 0 0 0 0 0 0 0 0 0 0 0 1 K -5 0 0 -1 -1 -2 1 0 0 1 0 3 5
M 0 0 0 0 0 0 0 0 0 0 0 0 0 1
T I 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
M
I
-5
-3
-1
0
-1
0
-2
-2
-1
-1
-3
-3
-2
-2
-3
-2
-2
-2
-1
-2
-2
-2
0
-2
0
-2
6
2 5
L 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 L -6 -2 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6
V 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 V -2 0 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4
F 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 F -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9
Y 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 Y 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10
W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
W -8 -5 5 -6 -6 -7 4 7 7 5 3 2 -3 -4 -5 -2 -6 0 0 17
C S T P A G N D E Q H R K M I L V F Y W
C S T P A G N D E Q H R K M I L V F Y W
QUERY
Associated substitution matrix PAM250 matrix
iabt
GAPS AND PENALTIES
Constant penalty : usually it is 1
Proportional penalty : depends on length of the gap
Affine : gap openig penalty + gap extension penalty
S = actual alignment score from matrix – gap penalty

iabt
RESTRICTION SITES
iabt
MULTIPLE ALIGNMENT
More than two sequences
Gaps are frequent
Always global alignment

iabt
WHY DO WE NEED MULTIPLE ALIGNMENT ???

Homology searching between the sequence
To characterize the protein families-conserved domains, promoters etc,…
Designing special probes, degenerated primers etc,..
Required in Protein modeling
Helps in prediction of secondary and tertiary structure of new sequence
Input for constructing phylogenetic tree

iabt
MULTIPLE ALIGNMENT ALGORITHMS

Hierarchical method (Clustal W) Divide and conquer method
A
B
C
D
E
iabt
MULTIPLE ALIGNMENT …….
Gaps
Conserved
region
iabt
CONSERVED DOMAIN SEARCH
Conserved domain
Some amount of sequence (20 %) missing in blast

at C terminal end
iabt
SOFTWARE AVAILABLE
Clustal W / X
Bioedit
Q align
CLC free work bench
Gene tool
Vector NTI
NCBI server
EMBL server
iabt
PHYLOGENETIC ANALYSIS
Sequence should be correct and originated from specified source
Sequences should be homologous
Each position in a alignment should be homologous with every other

in that alignment
No contamination of sequence i.e., nuclear and organelle genomes

iabt
PHYLOGENETIC ANALYSIS….
Tree building methods
Distance method Character based method
UPGMA NJ
Maximum parsimony method Maximum likelihood method

iabt
NEIGHBOUR JOINING METHOD

P
P G S
S G
L
H Ao
H
At At
Ao L
iabt
ORF SEARCHING
Molecular biology background
ORF contains following features
Initiation codon
Stop codon
Intron boundaries
Defined codon usage
iabt
ORF FINDING ALGORITHMS
Content-based method
Site based method
Comparative method
iabt
ORF FINDING ALGORITHMS ……
Text information
Graphical view
ß-Hemoglobin gene
iabt
SOFTWARE AVAILABLE
GENSCAN
Gene tool
CLC free work bench

iabt
PROTEIN THREE DIMENSIONAL MODELING
Comparative modeling
Fold recognition
Ab initio prediction
iabt
COMPARATIVE PROTEIN MODELING
start
Identify related structure
Select Template
Align target sequence with

template structure
Build model for target
Evaluate the model
NO Model YES
end
OK?
iabt
COMPARATIVE PROTEIN MODELING
Bovine hemoglobin Human hemoglobin Beta chain

iabt
SOFTWARE AVAILABLE
Cn3D
Bioediter
Deep view / swiss-pdb viewer

iabt
OTHER METHODS
Fold recognition
Also called as protein threading
It uses the library of models
Based on library information model is constructed
Ab initio prediction
Uses the the thermodynamics and quantum mechanism

Insilico Gene Analysis

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Insilico Gene Analysis

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Insilico Gene Analysis

Uploaded by

Copyright:

Available Formats

iabt

In silico Gene Analysis

A length of DNA which codes for a particular protein, or in certain

What are the essential components of a gene

Intron and exons(in eukaryotes)

All nucleotide sequences essentially contains A, T,G and C

All proteins contains 20 amino acids (one letter code)

Initiation codon is fixed - ATG

Stop codons are also fixed –TAA, TAG and TGA

Intron boundaries- GU-AC

Codon usage differs from organism to organism

The result of a comparison of two or more gene or protein sequences in

Pair wise Alignment Multiple Alignment

Local Alignment Global alignment

>NG_000007 |chromosome 11| beta hemoglobin|Homo sapiens

>NG_000007 |chromosome 11| beta hemoglobin|Homo sapiens

 Two sequences are compared at a time

 Sequences may be nucleotide-nucleotide or amino acid-amino acid

 It may be gaped/ un-gaped alignment

 Two algorithms –Smith- Waterman algorithm (local alignment)

Ex : BLAST and FASTA

BLAST (Basic Local Alignment Search Tool)

Pair wise local alignment

Developed by Stephen Altschul, Warren Gish, David Lipman

BLAST searches for short matches of a fixed length W between the

BLAST performs an ungapped alignment between the query and database

It consider whole database as one

high-scoring segment pairs

Low complexity region

Pairwise local alignment

Developed by David J. Lipman and William R. Pearson in 1985

It looks for identically matching word length called ktup

It identifies single high scoring region

It matches individual sequence of database with query sequence

It aligns the individual database

sequence with Query sequence

E value is different from BLAST

GAPS AND PENALTIES

Constant penalty : usually it is 1

Proportional penalty : depends on length of the gap

Affine : gap openig penalty + gap extension penalty

S = actual alignment score from matrix – gap penalty

More than two sequences

Gaps are frequent

Always global alignment

WHY DO WE NEED MULTIPLE ALIGNMENT ???

To characterize the protein families-conserved domains, promoters etc,…

Designing special probes, degenerated primers etc,..

Required in Protein modeling

Helps in prediction of secondary and tertiary structure of new sequence

Input for constructing phylogenetic tree

MULTIPLE ALIGNMENT ALGORITHMS

MULTIPLE ALIGNMENT …….

CONSERVED DOMAIN SEARCH

Some amount of sequence (20 %) missing in blast

Sequence should be correct and originated from specified source

Sequences should be homologous

Each position in a alignment should be homologous with every other