[go: up one dir, main page]

0% found this document useful (0 votes)
44 views51 pages

Bioinformatics: Databases and Analysis

Uploaded by

yenoanh2020mn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views51 pages

Bioinformatics: Databases and Analysis

Uploaded by

yenoanh2020mn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Bioinformatics

Lecturer: Nam Vo
Email: vtnam@[Link]
2
Contents

 Biological databases

 Sequence comparisons and Sequence-


Based database searches

 Protein structures and structure-based


rational drug design
3 Research fields in
bioinformatics???

 Create databases for storing and


managing biological data sources
 Introduce / Develop algorithms and
statistical methods to determine the
biological relationship between data
 Use bioinformatics tools to analyze
and interpret biological data sources
4
What is database???
5
Database

 A database is an organized collection of data, generally


stored and accessed electronically from a computer
system.

 Biological databases are libraries of life sciences


information, collected from scientific experiments,
published literature, high-throughput experiment
technology, and computational analysis. They contain
information from research areas including genomics,
proteomics, metabolomics, microarray gene expression,
and phylogenetics.
6
Biological Data

Sequences Structures Genomes Transcriptomes Networks


Genetics and and systems
populations

Databases Data-mining Physic and Phylogenetic


Algorithms Statistics
chemistry trees
7
8
[Link]
9
10
[Link]
11
[Link]
12
13
Biological database
operators
 NCBI (National Center for Biotechnology Information)

 EMBL-EBI (European Molecular Biology Laboratory)

 DDBJ (DNA Data Bank of Japan)


14
Categories
 Based on kind of data they retrieve:
 Primary databases

 Secondary databases

 Based on category of data they include:


 Literature databases

 Taxonomy database

 Nucleotide sequence database

 Genome database

 Genotype-Phenotype database

 Protein sequence dataset

 Structure database

 …
15
Primary and secondary database
 Primary sequence • Summarize the results from analyses
information of primary sequence databases
 Sequence annotation • Other databases (literature)
information

stream
16
Primary databases

 Nucleotide sequence databases


 Protein sequence databases
 Protein structure databases
17
Nucleotide sequence databases

Entrez
NIH
NCBI
•Upload
•Synchronize
GenBank
EMBL
•Upload
ENA •Synchronize

DDBJ
DDBJ EBI

NIG •Upload
•Synchronize
ENA online
getentry retrieval
18

Genbank
format
19
Accession number

 Unique
 Permanent
 The only way to absolutely verify the identity of a
sequence or database entry
 Used in all 3 databases (Genbank, ENA and DDBJ)
20

ENA
format
21
Entrez
22

ENA online
retrieval
ARSA
24
Protein sequence databases

 Huge data:
 Sequence information
 Expression profiles
 Secondary structures
 Biological functions
 Biochemical functions
 UniProt (Universal Protein Resource)
 NCBI protein database
25
UniProt

 United information from 3 databases:


 Swissprot
 TrEMBL
 Protein Information Resource (PIR)
 Consists of 3 parts:
 UniProt Knowledgebase (UniProtKB)
 UniProt Reference Clusters Database (UniRef)
 UniProt Archive (UniPArc)
26
UniProtKB

 Data: Protein sequences and annotations


 2 Realms:
 UniProtKB/TrEMBL realm: automatically annotated
sequences (65 mil entries)
 UniProtKB/SwissProt realm: manually curated and
annotated sequences (0.55 mil entries)
27
28
UniRef

 Nonredundant sequence database


 Used for fast similarity searches
 3 versions:
 UniRef100
 UniRef90
 UniRef50
29
Protein data bank (PDB)

 Was founded at Brookhaven National


Laboratory in 1971
 Database of experimentally determined crystal
structures of biological macromolecules
30
PDB website
31
32
Protein structure by NMR
33
Cryo-electron microscopy
34 PDB format - ATOM
35
Secondary databases

 Literature databases

 Motif

 Protein family

 Protein classification

 Genotype-Phenotype database*

…
36
Pubmed
37
Prosite
38
Sequence Motif
 Short sequence regions (10-20 amino acids) that are conserved in
related proteins
 Usually have a key role in the protein’s function
 Provide a first hint of an affiliation to a protein family or function
 Derived from multiple alignments
 Example:
[GSTNE] – [GSTQCR] – [FYW] – {ANW} – x (2) – P
39
Sequence Motif
 Motif:
M – [ATN] – T – K – {WMC} – x (2) – P
 Which of the following sequences satisfy the
motif?
A. MTTKARP
B. MNTKCARP
C. MTTKRWMP
D. AMNTKNWMP
40
Pfam

 Classify protein families based on profiles


 Profile: pattern that evaluates the probability of the
appearance of a given amino acid, an insertion, or a
deletion at every position in a protein sequence
 Search results should be reviewed
41
Pfam
42
PRINTS
43
Interpro

 Integrated Resource of Protein Families, Domains and


Sites
 Merges Swissprot, TrEMBL, Prosite, Pfam, PRINTS,
ProDom, Smart, TIGRFAMS
44
SCOP

 SCOP – Structural Classification Of Protein


 Compare retrieved protein structural
organization with that of known proteins to
predict functions
 Classifications:
 Families: clear evolutionary relationship (>30%
identities)
 Superfamilies: low identities
 Folds: The same arrangement of secondary structural
elements in the same topology
 Class: based on Secondary structural elements (SSE)
45

1hbg: 7 helix 1who: 8 strand


46

1auz: / 1c54:  + 


47
CATH

 Compare retrieved protein structural organization with


that of known proteins to predict functions
 4 caregories:
 Class: proportion of secondary structural elements (SSE)
 Mainly Alpha
 Mainly Beta
 Alpha-beta
 Few SSE
 Architecture: arrangement of SSE
 Topology: protein form and interconnections of SSE
 Homologous: homologous on protein domains
 Sequence Families: >-35% identities over 60% length
48 Motif - domain
49
PubChem

 PubChem
Compound
 PubChem
Substance
 PubChem
BioAssay
50

1. Role of database in biological research?


2. Is it necessary to create more databases?
3. How to chose a database for collecting
certain data?
51
Take home lessons

 What is database and its necessary?


 Distinguish between primary and secondary
databases
 Know some famous databases
 Read sequence motifs
 Distinguish between motifs and domains

You might also like