[go: up one dir, main page]

0% found this document useful (0 votes)
105 views34 pages

Insilico Gene Analysis

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 34

iabt

In silico Gene Analysis


iabt

Outline
Introduction

Alignment

ORF searching

3D protein modeling

Case study
iabt

INTRODUCTION
What is gene?

A length of DNA which codes for a particular protein, or in certain


cases a functional or structural RNA molecule

What are the essential components of a gene

Initiation codon

Intron and exons(in eukaryotes)

Stop codon

Regulatory sequences
iabt

INTRODUCTION ….

Essential feature of gene which are considered for in silico gene analysis

All nucleotide sequences essentially contains A, T,G and C

All proteins contains 20 amino acids (one letter code)

Initiation codon is fixed - ATG

Stop codons are also fixed –TAA, TAG and TGA

Intron boundaries- GU-AC

Codon usage differs from organism to organism


iabt

FILE FORMATS
FASTA format
>XM_414949 | Gallus gallus |alpha 2 globin
MVLSAADKNNVKGIFTKIAGHAEEYGAETLERMFTTYPPTKTYF
GI format
; comment
; comment
XM_414949
MVLSAADKNNVKGIFTKIAGHAEEYGAETLERMFTTYPPTKTYF1
GDE format
% XM_414949 | Gallus gallus |alpha 2 globin
MVLSAADKNNVKGIFTKIAGHAEEYGAETLERMFTTYPPTKTYF
NBRF/PIR format
>P1; XM_414949 | Gallus gallus |alpha 2 globin
MVLSAADKNNVKGIFTKIAGHAEEYGAETLERMFTTYPPTKTYF
iabt

ALIGNMENTS

The result of a comparison of two or more gene or protein sequences in


order to determine their degree of base or amino acid similarity

ALIGNMENT

Pair wise Alignment Multiple Alignment

Local Alignment Global alignment


iabt

REFERENCE SEQUENCE

>NG_000007 |chromosome 11| beta hemoglobin|Homo sapiens


atggtgcatctgactcctgaggagaagtctgccgttactgccctgtggggcaaggtgaacg
tggatgaagttggtggtgaggccctgggcaggctgctggtggtctacccttggacccagag
gttctttgagtcctttggggatctgtccactcctgatgctgttatgggcaaccctaaggtgaagg
ctcatggcaagaaagtgctcggtgcctttagtgatggcctggctcacctggacaacctcaag
ggcacctttgccacactgagtgagctgcactgtgacaagctgcacgtggatcctgagaacttc
aggctcctgggcaacgtgctggtctgtgtgctggcccatcactttggcaaagaattcaccccac
cagtgcaggctgcctatcagaaagtggtggctggtgtggctaatgccctggcccacaagtatc
actaa

>NG_000007 |chromosome 11| beta hemoglobin|Homo sapiens


MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMG
NPK ZVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCV
LAHHFG KEFTPPVQAAYQKVVAGVANALAHKYH
iabt

PAIRWISE ALIGNMENT

 Two sequences are compared at a time

 Sequences may be nucleotide-nucleotide or amino acid-amino acid

 It may be gaped/ un-gaped alignment

 Two algorithms –Smith- Waterman algorithm (local alignment)


Needleman-Wunsch algorithm (global alignment)

Ex : BLAST and FASTA


iabt

BLAST (Basic Local Alignment Search Tool)

Pair wise local alignment

Developed by Stephen Altschul, Warren Gish, David Lipman

Stages in search

BLAST searches for short matches of a fixed length W between the


query and sequences in the database

BLAST performs an ungapped alignment between the query and database


sequence on either sides , if they share a common word.

BLAST performs a gapped alignment between the query sequence and the
database sequence
iabt

BLAST ….

It consider whole database as one


sequence and align the query
sequence

high-scoring segment pairs


iabt

BLAST …..

Low complexity region


iabt

FASTA

Pairwise local alignment

Developed by David J. Lipman and William R. Pearson in 1985

It looks for identically matching word length called ktup

It identifies single high scoring region

It matches individual sequence of database with query sequence


iabt

FASTA ….

It aligns the individual database

sequence with Query sequence

E value is different from BLAST

E= Np
iabt

PROTEIN MATRICES

C 1 C 12
S 0 1 S 0 2
T 0 0 1 T -2 1 3
S P 0 0 0 1 P -1 1 0 6
A 0 0 0 0 1 A -2 1 1 1 2
U G 0 0 0 0 0 1 G -3 1 0 -1 1 5
N 0 0 0 0 0 0 1
B D 0 0 0 0 0 0 0 1
N
D
-4
-5
1
0
0
0
-1
-1
0
0
0
1
2
2 4
J E
Q
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0 1
E -5 0 0 -1 0 0 1 3 4
Q -5 -1 -1 0 0 -1 1 2 2 4
E H 0 0 0 0 0 0 0 0 0 0 1 H -3 -1 -1 0 -1 -2 2 1 4 3 6
R 0 0 0 0 0 0 0 0 0 0 0 1 R -4 0 -1 0 -2 -3 0 -1 -1 1 2 6
C K 0 0 0 0 0 0 0 0 0 0 0 0 1 K -5 0 0 -1 -1 -2 1 0 0 1 0 3 5
M 0 0 0 0 0 0 0 0 0 0 0 0 0 1
T I 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
M
I
-5
-3
-1
0
-1
0
-2
-2
-1
-1
-3
-3
-2
-2
-3
-2
-2
-2
-1
-2
-2
-2
0
-2
0
-2
6
2 5
L 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 L -6 -2 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6
V 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 V -2 0 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4
F 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 F -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9
Y 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 Y 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10
W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
W -8 -5 5 -6 -6 -7 4 7 7 5 3 2 -3 -4 -5 -2 -6 0 0 17
C S T P A G N D E Q H R K M I L V F Y W
C S T P A G N D E Q H R K M I L V F Y W

QUERY
Associated substitution matrix PAM250 matrix
iabt

GAPS AND PENALTIES

Constant penalty : usually it is 1

Proportional penalty : depends on length of the gap

Affine : gap openig penalty + gap extension penalty

S = actual alignment score from matrix – gap penalty


iabt

RESTRICTION SITES
iabt

MULTIPLE ALIGNMENT

More than two sequences

Gaps are frequent

Always global alignment


iabt

WHY DO WE NEED MULTIPLE ALIGNMENT ???


Homology searching between the sequence

To characterize the protein families-conserved domains, promoters etc,…

Designing special probes, degenerated primers etc,..

Required in Protein modeling

Helps in prediction of secondary and tertiary structure of new sequence

Input for constructing phylogenetic tree


iabt

MULTIPLE ALIGNMENT ALGORITHMS


Hierarchical method (Clustal W) Divide and conquer method
A
B
C
D
E
iabt

MULTIPLE ALIGNMENT …….

Gaps

Conserved
region
iabt

CONSERVED DOMAIN SEARCH

Conserved domain

Some amount of sequence (20 %) missing in blast


at C terminal end
iabt

SOFTWARE AVAILABLE
Clustal W / X
Bioedit
Q align
CLC free work bench

Gene tool
Vector NTI

NCBI server
EMBL server
iabt

PHYLOGENETIC ANALYSIS

Sequence should be correct and originated from specified source

Sequences should be homologous

Each position in a alignment should be homologous with every other


in that alignment

No contamination of sequence i.e., nuclear and organelle genomes


iabt

PHYLOGENETIC ANALYSIS….
Tree building methods

Distance method Character based method

UPGMA NJ

Maximum parsimony method Maximum likelihood method


iabt

NEIGHBOUR JOINING METHOD


P
P G S
S G

L
H Ao
H
At At
Ao L
iabt

ORF SEARCHING

Molecular biology background

ORF contains following features

Initiation codon
Stop codon
Intron boundaries
Defined codon usage
iabt

ORF FINDING ALGORITHMS

Content-based method

Site based method

Comparative method
iabt
ORF FINDING ALGORITHMS ……

Text information

Graphical view

ß-Hemoglobin gene
iabt

SOFTWARE AVAILABLE

GENSCAN

Gene tool

CLC free work bench


iabt

PROTEIN THREE DIMENSIONAL MODELING

Comparative modeling

Fold recognition

Ab initio prediction
iabt

COMPARATIVE PROTEIN MODELING

start

Identify related structure

Select Template

Align target sequence with


template structure

Build model for target

Evaluate the model

NO Model YES
end
OK?
iabt

COMPARATIVE PROTEIN MODELING

Bovine hemoglobin Human hemoglobin Beta chain


iabt

SOFTWARE AVAILABLE

Cn3D

Bioediter

Deep view / swiss-pdb viewer


iabt

OTHER METHODS
Fold recognition

Also called as protein threading

It uses the library of models

Based on library information model is constructed

Ab initio prediction

Uses the the thermodynamics and quantum mechanism

You might also like