CS427 Bioinformatics
Biological Databases and Data Formats
Biological Databases Classification
1. Based on Data Type
2. Based on Scope:
3. Based on Data Source
4. Based on Access and Usage
5. Based on Organizational Structure
Based on Data Type
• Nucleotide Sequence Databases: These databases store DNA and RNA
sequences.
• Examples: GenBank, EMBL, DDBJ.
• Protein Sequence Databases: These databases store protein
sequences.
• Examples: UniProt, SWISS-PROT, PIR, PRF, TrEMBL
• Structure Databases: These contain three-dimensional structures of
biomolecules.
• Examples: Protein Data Bank (PDB), SCOP, CATH.
Based on Data Type
• Genomic Databases: These store whole-genome sequences and
annotations.
• Examples: Ensembl, UCSC Genome Browser.
• Expression Databases: These include data on gene expression, such as
microarray data.
• Examples: Gene Expression Omnibus (GEO), ArrayExpress.
• Pathway Databases: These focus on metabolic and signaling
pathways.
• Examples: KEGG, Reactome.
Based on Data Type
• Interaction Databases: These include protein-protein interactions,
gene regulatory networks, etc.
• Examples: BioGRID, STRING.
• Phenotype and Mutation Databases: These contain data related to
genetic mutations and their phenotypic consequences.
• Examples: OMIM, ClinVar, HGMD.
Based on Scope
• Primary Databases: These contain raw data submitted directly by
researchers.
• Examples: GenBank, EMBL, DDBJ, Swiss-Prot, PIR, PRF
• Secondary Databases: These contain data derived from primary
databases, usually through analysis, annotation, or curation.
• Examples: UniProtKB (a combination of Swiss-Prot and TrEMBL), RefSeq, PDB
• Specialized Databases: These focus on a specific organism, biological
system, or type of data.
• Examples: FlyBase (Drosophila), WormBase (C. elegans).
Based on Data Source
• Curated Databases: These are manually curated by experts to ensure
high accuracy.
• Examples: Swiss-Prot, RefSeq.
• Non-curated Databases: These rely on automated data submissions
with minimal human intervention.
• Examples: TrEMBL, EST databases.
Based on Access and Usage
• Public Databases: These are freely accessible to everyone.
• Examples: NCBI, EMBL-EBI, DDBJ.
• Proprietary Databases: Access is restricted, often requiring a
subscription or payment
• Examples: COSMIC, BioBase.
Based on Organizational Structure
• Flat File Databases: Store data in simple text files with a standardized
format.
• Example: GenBank flat files.
• Relational Databases: Organize data in tables that can be linked based
on relationships.
• Example: UniProtKB.
• Object-oriented Databases: Store data as objects, similar to object-
oriented programming.
• Example: Ensembl.
Bibliography Data Base
• PubMed
Most common
• GenBank by NCBI
• Go to NCBI website and search for a gene in nucleotide (eg. HBA1 or TP53)
• The information is extracted from GenBank
• EMBL
• Goto [Link] and search
• DDBJ
• UniProt
• PDB
Expasy
• The translational tool
Entrez
• Global Query Cross-Database Search System is a federated search
engine, or web portal that allows users to search many discrete
health sciences databases at the National Center for Biotechnology
Information (NCBI) website.
• Text based search
• All information integrated: Structure, sequence, literature etc.
• Cross reference across databases
• French word for “come in”
• Use of Boolean operators
Ensembl
• Genome browser
• Demo
Biomart in Ensembl
• For interconversion of ids and names of genes
Motif
• A sequence motif is a nucleotide or amino-acid sequence pattern that
is widespread and usually assumed to be related to biological
function of the macromolecule.
• Explore motif data bases