Bioinformatics
Lecturer: Nam Vo
Email: vtnam@[Link]
2
Contents
Biological databases
Sequence comparisons and Sequence-
Based database searches
Protein structures and structure-based
rational drug design
3 Research fields in
bioinformatics???
Create databases for storing and
managing biological data sources
Introduce / Develop algorithms and
statistical methods to determine the
biological relationship between data
Use bioinformatics tools to analyze
and interpret biological data sources
4
What is database???
5
Database
A database is an organized collection of data, generally
stored and accessed electronically from a computer
system.
Biological databases are libraries of life sciences
information, collected from scientific experiments,
published literature, high-throughput experiment
technology, and computational analysis. They contain
information from research areas including genomics,
proteomics, metabolomics, microarray gene expression,
and phylogenetics.
6
Biological Data
Sequences Structures Genomes Transcriptomes Networks
Genetics and and systems
populations
Databases Data-mining Physic and Phylogenetic
Algorithms Statistics
chemistry trees
7
8
[Link]
9
10
[Link]
11
[Link]
12
13
Biological database
operators
NCBI (National Center for Biotechnology Information)
EMBL-EBI (European Molecular Biology Laboratory)
DDBJ (DNA Data Bank of Japan)
14
Categories
Based on kind of data they retrieve:
Primary databases
Secondary databases
Based on category of data they include:
Literature databases
Taxonomy database
Nucleotide sequence database
Genome database
Genotype-Phenotype database
Protein sequence dataset
Structure database
…
15
Primary and secondary database
Primary sequence • Summarize the results from analyses
information of primary sequence databases
Sequence annotation • Other databases (literature)
information
stream
16
Primary databases
Nucleotide sequence databases
Protein sequence databases
Protein structure databases
17
Nucleotide sequence databases
Entrez
NIH
NCBI
•Upload
•Synchronize
GenBank
EMBL
•Upload
ENA •Synchronize
DDBJ
DDBJ EBI
NIG •Upload
•Synchronize
ENA online
getentry retrieval
18
Genbank
format
19
Accession number
Unique
Permanent
The only way to absolutely verify the identity of a
sequence or database entry
Used in all 3 databases (Genbank, ENA and DDBJ)
20
ENA
format
21
Entrez
22
ENA online
retrieval
ARSA
24
Protein sequence databases
Huge data:
Sequence information
Expression profiles
Secondary structures
Biological functions
Biochemical functions
UniProt (Universal Protein Resource)
NCBI protein database
25
UniProt
United information from 3 databases:
Swissprot
TrEMBL
Protein Information Resource (PIR)
Consists of 3 parts:
UniProt Knowledgebase (UniProtKB)
UniProt Reference Clusters Database (UniRef)
UniProt Archive (UniPArc)
26
UniProtKB
Data: Protein sequences and annotations
2 Realms:
UniProtKB/TrEMBL realm: automatically annotated
sequences (65 mil entries)
UniProtKB/SwissProt realm: manually curated and
annotated sequences (0.55 mil entries)
27
28
UniRef
Nonredundant sequence database
Used for fast similarity searches
3 versions:
UniRef100
UniRef90
UniRef50
29
Protein data bank (PDB)
Was founded at Brookhaven National
Laboratory in 1971
Database of experimentally determined crystal
structures of biological macromolecules
30
PDB website
31
32
Protein structure by NMR
33
Cryo-electron microscopy
34 PDB format - ATOM
35
Secondary databases
Literature databases
Motif
Protein family
Protein classification
Genotype-Phenotype database*
…
36
Pubmed
37
Prosite
38
Sequence Motif
Short sequence regions (10-20 amino acids) that are conserved in
related proteins
Usually have a key role in the protein’s function
Provide a first hint of an affiliation to a protein family or function
Derived from multiple alignments
Example:
[GSTNE] – [GSTQCR] – [FYW] – {ANW} – x (2) – P
39
Sequence Motif
Motif:
M – [ATN] – T – K – {WMC} – x (2) – P
Which of the following sequences satisfy the
motif?
A. MTTKARP
B. MNTKCARP
C. MTTKRWMP
D. AMNTKNWMP
40
Pfam
Classify protein families based on profiles
Profile: pattern that evaluates the probability of the
appearance of a given amino acid, an insertion, or a
deletion at every position in a protein sequence
Search results should be reviewed
41
Pfam
42
PRINTS
43
Interpro
Integrated Resource of Protein Families, Domains and
Sites
Merges Swissprot, TrEMBL, Prosite, Pfam, PRINTS,
ProDom, Smart, TIGRFAMS
44
SCOP
SCOP – Structural Classification Of Protein
Compare retrieved protein structural
organization with that of known proteins to
predict functions
Classifications:
Families: clear evolutionary relationship (>30%
identities)
Superfamilies: low identities
Folds: The same arrangement of secondary structural
elements in the same topology
Class: based on Secondary structural elements (SSE)
45
1hbg: 7 helix 1who: 8 strand
46
1auz: / 1c54: +
47
CATH
Compare retrieved protein structural organization with
that of known proteins to predict functions
4 caregories:
Class: proportion of secondary structural elements (SSE)
Mainly Alpha
Mainly Beta
Alpha-beta
Few SSE
Architecture: arrangement of SSE
Topology: protein form and interconnections of SSE
Homologous: homologous on protein domains
Sequence Families: >-35% identities over 60% length
48 Motif - domain
49
PubChem
PubChem
Compound
PubChem
Substance
PubChem
BioAssay
50
1. Role of database in biological research?
2. Is it necessary to create more databases?
3. How to chose a database for collecting
certain data?
51
Take home lessons
What is database and its necessary?
Distinguish between primary and secondary
databases
Know some famous databases
Read sequence motifs
Distinguish between motifs and domains