HOMEWORK DAY 1
Problems and Solutions?
1
Note for HOMEWORK 1
Homework 1: DNA sequencing vs Protein sequencing
a. What is the difference between DNA sequencing and protein sequencing?
Answer 1?
DNA sequence Protein sequence
Definition DNA sequence is a series of Protein sequence is a series of amino
deoxyribonucleotides acids
Building block Deoxyribonucleotides Amino acid
Different types Four types of deoxyribonucleotides Twenty different amino acid
of monomers
Bonds between Phosphodiester bonds Peptide bonds
monomers
Function DNA mainly stores genetic Important in structure, function, and
information to make proteins in a cell regulation of the body’s tissues and
organs
Variety One DNA sequence can only be One protein sequence can have more than
translated into one possible protein one possible translation of DNA sequence
sequence
Deduce Can deduce to protein sequence Cannnot deduce to DNA sequence
2
Answer 2?
DNA sequencing Protein sequencing
DNA sequencing relies heavily upon PCR Protein sequencing is de novo, meaning it
primers, which works well for model species doesn’t rely on a database.
=>DNA sequencing proves difficult for non- => It can sequence any protein of any isotype
annotated genomes
DNA sequencing requires access to the intact Protein sequencing uses the protein itself
original cell line
=> So when the hybridoma is lost, DNA => providing the ability to sequence without
sequencing is no longer feasible accessing to the original cell line or hybridoma
DNA sequencing is blind to post-translational Protein sequencing can objectively uncover
modifications, which may have implications on post-translational modifications
protein functionality
Missing information: Principle and techniques?
- DNA sequencing: Traditional Sanger sequencing and next-generation sequencing
- Protein sequencing: two major direct methods (mass spectrometry & Edman
degradation using a protein sequenator (sequencer))
3
b. Why don't we sequence protein like we sequence DNA
- Because if we sequence protein like what we do with DNA, it may include both
introns and exons, that leads to the lack of accuracy of result.
- Due to the different structural components and the different nature of the
sequencing process. DNA sequencing relies on DNA polymerase and primer, taking
advantage of DNA replication to sequence. Protein sequencing uses the protein itself
, so it must be solved directly to give the position and structure of each amino acid.
- DNA sequencing is blind to the post translational modification, which may have
implications on protein functionally. Protein sequencing can objectively uncover post
translational modifications like N terminal pyroglutamate formation, glycosylation
sites and deamidation
4
b. Why don't we sequence protein like we sequence DNA
Missing points:
- The technique lacks high-throughput capabilities
- Cost:
> Protein sequencing cost: First 5 amino acids: $600; 50$ for each Additional amino acid
> DNA sequencing cost: a whole-exome sequence of human genome (30 x 106 bp, 1000$)
5
Note for HOMEWORK 2
Figure out how the genes assigned to each of you are implicated in cancers and/or immunity
(File: Gene List.xlsx)
Requirements: get the following information about each of the 3 genes assigned to you
• Gene symbol, full name, reviewed by RefSeq
• Summary of its function
• Location on the human genome (based on GRCh38)
– e.g. chromosome, start, end, strand
• How this gene is related to cancer
– Get one open-access reference that is most relevant to cancers and/or immunity in your
opinion. Please list the article title, the authors, their institutions, publication year, journal
name.
• Any situations (mutations, over-expression, etc.) of this gene associated with other (non-cancer
and non-immune) diseases
• Extract DNA sequence of these genes and translate the DNA sequences in 3 frames, and
determine the reading frame which contains an open reading frame (ORF).
6
Using NCBI RefSeqGene
https://www.ncbi.nlm.nih.gov/gene/?term=akt1
7
RefSeqGene - AKT1
• Gene symbol, full name, reviewed by RefSeq
• Summary of its function 8
RefSeqGene - AKT1
• Location on the human genome (based on GRCh38)
e.g. chromosome, start, end, strand
9
• How this gene is related to cancer
RefSeqGene - AKT1
10
• How this gene is related to cancer
11
• How this gene is related to cancer
– Get one open-access reference that is most relevant to cancers and/or
immunity in your opinion. Please list the article title, the authors, their
institutions, publication year, journal name.
12
• Any situations (mutations, over-expression, etc.) of this gene associated
with other (non-cancer and non-immune) diseases
RefSeqGene – AKT1
13
From NCBI RefSeqGene to ClinVar
14
From NCBI RefSeqGene to ClinVar
15
Extract DNA sequence of these genes and translate the DNA sequences in 3
frames, and determine the reading frame which contains an open reading
frame (ORF).
GenBank Record Fields
16
RefSeqGene - AKT1 transcript
17
Extract DNA sequence of a transcript of AKT1 genes
Searching for ORFs
a. Missing protocol
- Which program? Website?
- Parameter: strand? Inititation codons? genetic code? min ORF size?.. 18
b. Conlusion: which ORF should be chosen for further study?
Structure of an Eukaryotic genes
19
How gene structure is determined?
• Experiments
– Reverse transcription PCR (RT-PCR) -> sequencing
– 5’ Rapid Amplification of cDNA ends (5’ RACE) -> finding the 5’ most exon -
sequencing
– Transcriptome library -> single-pass sequencing
• Expressed sequence tags (EST)
• RNA-seq
• Computational prediction
20
How computer can predict
the gene structure?
The site for transcription and translation elements.
The homology sequence of known gene/protein.
21
Strategy: Splice site recognition
GT-AG rule
22
DONOR-SPLICE: splicing site at the beginning of an intron, intron 5' left end.
ACCEPTOR-SPLICE: splicing site at the end of an intron, intron 3' right end.
Programs for gene prediction
geneid: https://genome.crg.es/software/geneid/geneid.html
- Available organism: Homo sapiens (human), Drosophila melanogaster (fruit fly), Tetraodon
nigroviridis (puffer fish), Oryza sativa (rice), ….
GenScan: http://hollywood.mit.edu/GENSCAN.html
- Available organism: Vertebrate, Arabidopsis, maize
Augustus: http://bioinf.uni-greifswald.de/augustus/submission.php
- Available organisms: animals, alveolata, plants and algae, fungi, bacteria, archaea
Other genefinders: FGENESH, GRAIL, GLIMMERM, GENEID, GENEFINDER,
GENEMARK, ….
23
EXERCISE BREAK
Exploring ab initio gene prediction
1. Extract the FASTA sequence of the genomic region of the AKT1 gene (NCBI Reference
Sequence: NG_012188.1)
2. Predict gene structure of this DNA sequence
- Searching signals of the first exon with geneid: Select acceptors, donors, start and stop
codons. Look for them in the real annotation of the sequence
- Searching exons using both geneid and GeneScan/or Augustus (or at least by two gene
prediction programs)
> Select All exons and try to find the real ones
> Finding gene
> Compare the predicted gene with the GenBank Record gene from NCBI
24
One gene
=> multiple (alternatively spliced) transcripts
=> multiple proteins (with distinct functions)
http://commons.wikimedia.org/wiki/File:Transformer_splicing.gif 25
Browsing genes and genomes
with Ensembl
26
Contents
• Introduction to Ensembl database and browser
• EXERCISE: A light exploration of the Ensembl genome
browser with AKT1 genes
27
NCBI databases are not the ultimate
solution to the knowledge of genomes
28
Introduction
Why do we need/have genome browsers? So many!
29
The Human Genome Project (HGP)
• Draft
– Published on June 26,
2000
– Coverage: 90 %
– Error rate: 1 %
• Finish
– Published in 2003
– Coverage: > 99 %
3
– Error rate: 0.01 % 0
30
Any thing new for the human genome?
The truth is that what we do
not know is much more
than what we've known…
This is no longer true since Encyclopedia of DNA Elements
(ENCODE) Consortium found new evidence
Once nearly everyone believed that only
3% of the human genome are functional
regions
1.5% are protein-coding regions
1.5% are regulatory elements
97% are junk DNAs
Nature (2001), 409(6822): 860-921
32
Non-coding RNA: It’s Not Junk
• ~70% (3/4) of the human genome can be
transcribed …, functionally unknown!
• >20,000 non-coding RNAs, functionally
unknown!
Djebali, S., et al. (2012). "Landscape of transcription in human cells." Nature
489 (7414): 101-108.
33
Genomic sequences must be
annotated with functions
Human Genome Project
GRCh38.p4 (June 29, 2015)
Annotation of gene structures
Reference genome
Advanced annotation
Population variations
Gene regulation Pathways
Variation and diseases
34
The Ensembl project
• The goal of Ensembl was to automatically annotate
the genome, integrate this annotation with other
available biological data and make all this publicly
available via the web (since 1999).
www.ensembl.org
35
Ensembl Features
36
EXERCISE BREAK
Exercise 2: A light exploration of the Ensembl genome browser with AKT1
genes
- Extracting genomic information from Ensembl:
Gene ID, Gene Name, Ensembl Gene ID (Gene stable ID), NCBI gene ID,
Uniprot/Swiss-Prot ID
What is the description of this gene? Where is it located in the genome?
How many contigs cover the gene region? Is AKT1 gene in the forward strand
or in the reverse strand? How many transcripts are annotated for AKT1? How
many of them code for protein?
SNP or variants within the genome of interest? What SNPs are found in my
gene and are they located in introns, promoters or exons?
37
HOMEWORK Day 2
- Revise your Homework 2 from Day 1.
- Extract the FASTA sequence of the genomic region of your genes (from
Homework Day 1) and predict gene structure of these DNA sequences using
one gene prediction programs. Summary the exons and introns from your
prediction; and write your observation and conclusion.
- Finding transcript information about a specific gene using NCBI & Ensembl
and compare with your prediction from bioinformatics program.
- Exploring genomic information of your genes (from Homework Day 1) using
Ensembl (see exercise 2 for detail).
- Between Ensembl and NCBI, which one would you prefer when searching
information of human genes? Why?
DEADLINE: 10am Thursday 15th 2021
37
Sequencing Primary data
ORF finder Gene prediction
Take-home message?
NCBI Ensembl
END
40