Introduction to Bioinformatics
Microarrays1: Microarray Technology
Course 341
Department of Computing
Imperial College, London
Moustafa Ghanem
Aims for the 2nd part of Course
Microarray Bioinformatics
Appreciate the bigger picture of bioinformatics
Bioinformatics is more than nucleotide sequence analysis
Functional Genomics and Drug Discovery
Understand basic microarray technology and its use in gene
expression analysis.
Learn basic data analysis methods and how to apply them in the
analysis of gene expression data
Data Clustering
Data Classification
Statistical Analysis
Recommended Texts
For this part of the course
General overview of microarray data analysis
Lecture Notes
Handouts
Microarray Gene Expression Data Analysis: A Beginners
Guide (Causton, Quakenbush and Brazma)
Microarray Bioinformatics (Stekel)
Data Mining
Data Mining: Concepts and Techniques (Han)
Microarray Technology
Lecture Overview
Aims, Motivation and Overview of 2nd Part of Course
Biology Background
Basic Idea of Microarrays
Types of Microarray technologies and how they work
Outputs of Microarrays
Image Analysis required to transform output to gene
expression matrices
Generating Gene Expression Matrices
Background
Functional Genomics
Functional Genomics:
Systematic analysis of gene activity in healthy and diseased tissues.
The study of obtaining an overall picture of genome functions, including the
expression profiles at the mRNA level and the protein level.
Functional Genome Analysis:
used to understand the functions of genes and proteins in an organism. This is
typically known as genome annotation.
used in integrative biology and systems biology studies aiming to understand
health and disease states (e.g. cancer, obesity, etc)
Used as an important step in the search for new target molecules in the drug
discovery process.
Background
The Drug Discovery Pipeline
Drug Discovery is a lengthy process that takes years and requires the use
of bioinformatics, chemoinformatics and clinical-informatics tools.
Target
Identification
Target
Validation
Lead
Identification
Lead
Optimization
Preclinical
Trials
clinical
Trials
Functional genomics plays an important role in speeding up the pipeline
and also in allowing us to try new therapeutic methods.
Background
Drug Discovery
Functional genomics plays an important role in identifying functions of
potential therapeutic targets such as encoded proteins. Gene expression
studies plays an important role in most stages:
Target Identification:
Target Validation:
Understand the role of a target and the effects of manipulating a target
candidate (e.g. what if I knock a gene out)
Compound Screening:
Understand disease states, identify genetics changes that cause disease
(genes, proteins, tissues, environmental conditions, etc)
Understand compounds effect on target and its risk profile
Pre-clinical and clinical trials:
Prioritise studies
Cell
Nucleus
Chromosome
Background
Biology, Cells and DNA
Protein
Gene (mRNA),
single strand
Gene (DNA)
All living organisms consist of cells. Humans have trillions of
cells; Yeast - one cell.
Cells are of many different types (blood, skin, nerve), but all
arose from a single cell (the fertilized egg)
Each cell contains a complete copy of the genome (the program
for making the organism), encoded in DNA.
A gene is a segment of DNA that specifies how to make a
protein. Human DNA has about 30-35,000 genes; Rice has
about 50-60,000, but shorter genes.
DNA sequence
(split into genes)
codes for
Amino Acid
Sequence
What is?
folds into
Protein
has
3D
Structure
dictates
Protein
Function
determines
Cell
Activity
Gene Expression:
The process by which the information encoded in a gene is converted into an
observable phenotype (most commonly production of a protein).
The degree to which a gene is active in a certain tissue of the body, measured
by the amount of mRNA in the tissue.
Microarrays:
Tools used to measure the presence and abundance of gene expression in
tissue.
microarray technologies provide a powerful tool by which the expression
patterns of thousands of genes can be monitored simultaneously
Background
Gene Expression
Cells are different because of differential gene expression.
About 40% of human genes are expressed at one time.
Gene is expressed by transcribing DNA into single-stranded
mRNA
mRNA is later translated into a protein
Microarrays measure the level of mRNA expression
A Dynamic View
Gene expression depends on environment!
Interactions
Environment
Metabolites
DNA
Growth rate
RNA
Protein
Expression
A Dynamic View
Gene expression varies with time !
forwards-propagated
correlations
metabolites
protein
mRNA
time
event
Microarray Technology
Quantitative Measurement of Gene Expression
Also known as DNA microarrays, DNA arrays, DNA chips, gene
chips, Whatever the name, their use is effectively transforming
a living from a black box into a transparent box.
Applications of Microarray
Technology
10
9
0.8
8
7
0.6
6
5
0.4
4
0.2
3
2
1
0
-0.2
0
100
200
300
400
500
600
700
Data Analysis over microarray data
What type of data analysis is required to:
Identify Genes expressed in different cell types (e.g. Liver vs finger)
Learn how expression levels change in different developmental
stages (embryo vs. adult)
Learn how expression levels change in different developmental
stages (cancerous vs non-cancerous)
Learn how groups of genes inter-relate (gene-gene interactions)
Identify cellular processes that genes participate in (structure,
repair, metabolism, replication, etc)
Applications covered only as example contexts, emphasis is on
analysis methods
Affymetrix Inc. is the leading
provider of Microarray
Microarrays
Basic Idea
technology (GeneChip )
http://www.affymetrix.com/
A Microarray is a device that detects the presence and abundance
of labelled nucleic acids in a biological sample.
In the majority of experiments, the labelled nucleic acids are derived
from the mRNA of a sample or tissue.
The Microarray consists of a solid surface onto which known DNA
molecules have been chemically bonded at special locations.
Each array location is typically known as a probe and contains many
replicates of the same molecule.
The molecules in each array location are carefully chosen so as to
hybridise only with mRNA molecules corresponding to a single gene.
Several companies sell equipment to make DNA chips, including
spotters to deposit the DNA on the surface and scanners to detect
the fluorescent or radioactive signals.
Basic Idea
A Microarray works by exploiting the ability of a given mRNA
molecule to bind specifically to, or hybridize to, the DNA template
from which it originated.
By using an array containing many DNA samples, scientists can
determine, in a single experiment, the expression levels of
hundreds or thousands of genes within a cell by measuring the
amount of mRNA bound to each site on the array.
With the aid of a computer, the amount of mRNA bound to the
spots on the Microarray is precisely measured, generating a
profile of gene expression in the cell.
Background
DNA/RNA Hybridization
DNA molecules:
DNA molecules are long doublestranded chains; 4 types of bases are
attached to the backbone: adenine (A),
guanine (G), cytosine (C), and thymine
(T). A pairs with T, C with G.
DNA-RNA hybridization:
When a mixture of DNA and RNA
is heated to denaturation
temperatures to form single
strands and then cooled, RNA can
hybridize (form a double helix) with
DNA that has a complementary
nucleotide sequence.
The Array
The technology for making DNA chips has become so well-defined
that it is even possible to construct all of the equipment for under
$50,000 using directions on the Internet from Professor Pat
Browns laboratory at Stanford. http://cmgm.stanford.edu/pbrown/
Applying a Labelled
Sample
The molecules in the target biological sample are labelled using a
fluorescent dye before sample is applied to array
If a gene is expressed in the sample, the corresponding mRNA hybridises
with the molecules on a given probe (array location).
If a gene is not expressed, no hybridisation occurs on the corresponding
probe.
Reading the array output
After the sample is applied, a laser light source is applied to the array.
The fluorescent label enables the detection of which probes have hybridised
(presence) via the light emitted from the probe.
If gene is highly expressed, more mRNA exists and thus more mRNA
hybridises to the probe molecules (abundance) via the intensity of the light
emitted.
Chemistry Basics:
Surface Chemistry is used to attach the probe molecules
to the glass substrate.
The Process
Chemical reactions are used to attach the florescent
dyes to the target molecules
Probe and Target hybridise to form a double helix
Labelled targets
in solution
Heteroduplexes
Probes on array
Hybridisation
The array
Steps of a Microarray Experiment
1.
Prepare DNA chip(s) by choosing probes and attaching them to
glass substrate. Note location and properties of each probe.
2.
Generate a hybridization solution containing a mixture of
fluorescently labelled targets.
3.
Incubate hybridization mixture.
4.
Detect probe hybridization using laser technology
a)
b)
c)
d)
e)
5.
Scan the arrays and store output as images
Quantify each spot
Subtract background
Normalize
Export a table of fluorescent intensities for each gene in the array
Analyze data using computational methods.
Types of Microarrays
How are Microarrays are made?
What molecules make the probes?
How are the probes added to the chip?
Spotting vs. In-situ synthesis
Output type
cDNA (PCR products) vs Oligos
Single label vs. Dual label
Why ? Appreciation of some of the concepts of the technology.
Helps us understand and choose between available technology.
Helps us design our experiments.
Helps understand sources of errors in array outputs and compensate
for them.
Each probe represents the measurement for a single gene
An array represents measurements for many genes
Designing the Probes
The probes need to be of high specificity to avoid hybridization with
wrong target molecules.
The probes need to generate an output that is easy to read (spots lie in
defined positions and be of regular size and shape and even spacing).
The probes have to have high sensitivity to detect the mRNA and the
intensity of the spot light must be differentiable from background noise.
The intensity of a spot light also needs to correlate with the abundance
of the target molecule in the sample.
Results must be reproducible across multiple experiments.
Different chip manufacturers use different technologies
Probe Types
As an end user you will use the probe types
recommended for the chips, but would have to select
the sequences for the probes to be used in your
experiments
Affymetrix technology is based on oligos (20 bases per
probe)
The DNA probes used on a an array can either be polymerase chain
reaction (PCR) products (cDNAs) or Oligonucleotides.
In the first case (cDNA), highly parallel PCR is used to amplify DNA
from a clone library, and the amplified DNA is purified, the clones are
typically long sequences (Complete genes or ESTs).
In the second case, DNA oligonucleotides are presynthesised for use on
the array --- An oligonucleotide, or oligo as it is commonly called, is a
short fragment of a single-stranded DNA that is typically 5 to 50
nucleotides long. This can achieve a higher density of probes per chip.
In both cases the probes are attached (fixed or immobilized) to a
glass (or nylon) surface using special surface chemical techniques
(Beyond this course).
Spotting vs. In-situ Synthesis
Spotting
Spotting works for both cDNA probes and oligo probes
The Spotting Process
1.
2.
3.
The DNA probes are produced and stored in wells.
A Spotting robot is used to deposit them onto individual
locations on the glass slide
The glass slide is post-processed so no further DNA can attach
to it.
Spotting is easy to automate but may generate poor quality
spots (irregular spots of different shapes and sizes)
The Spotting Robot
The Operation of the Spotting Robot
1.
2.
3.
4.
5.
The pins are dipped into the wells to collect the
first batch of DNA.
This DNA is spotted onto a number of different
arrays, depending on the number of arrays
being made and the amount of liquid the pins
can hold.
The pins are washed to remove any residual
solution and ensure no contamination of the
next sample.
The pins are dipped into the next set of wells.
Return to step 2 and repeat until the array is
complete.
Spotting Process
Affymetrix technology is based on in-situ synthesis in a
series of addition steps separated by mask addition and
then photo-deprotection.
Spotting vs. In-situ Synthesis
In-situ Synthesis
Since oligos are synthesized short sequences,
their bases can be added to the glass surface
one at a time.
Using high tech processes this can generate
best quality (regular even spots).
Different patented technologies are used to
enable this to happen while not allowing more
than one base to be added at a time, including
Photodeprotection technology (Affymetrix)
Inkjet Array Synthesis
In-situ Synthesis
Affymetrix
Many other variations of the technology exist, such as
the use of longer oligos, the use of fibre optics, etc.
Comparison of Probe Types
In-situ Synthesis / Oligos
PCR Products / cDNA Probes
Advantages
Advantages
No need to isolate and purify cDNAs
because oligonucleotides can be
synthesized.
Short oligonucleotides are less likely to have
cross-reactivity with other sequences in the
target DNA.
Density of chips is higher than with cDNAs.
Flexibility to study cDNAs from any source.
cDNAs do not require any a priori information
about the corresponding genes.
Longer sequences increase hybridization
specificity, which reduces false positives.
Limitations
Limitations
The sequence has to be known.
Synthesis can be expensive and timeconsuming.
The short sequences are not as specific for
target DNA, so appropriate controls must be
added.
Isolation of individual cDNAs to immobilize
on each spot can be cumbersome.
Density is lower than synthesizing
oligonucleotides on the surface of the chip.
cDNAs are longer sequences and are more
likely to randomly contain sequences found
in target DNA, which results in crossreactivity.
Affymetrix technology is based on the use of single
labels
Single Label vs. Dual Label
Single Channel vs Dual Channel
Most laboratories use fluorescent labelling, with the two dyes Cy3 (excited by a
green laser) and Cy5 (excited by a red laser).
In Dual label experiments, two samples are hybridised to the arrays, one
labelled with each dye; this allows the simultaneous measurement of two
samples (e.g. for differential analysis)
In Single label experiments, only one sample is hybridised to the arrays labelled
with one dye. (in which case control needs to be measured using a separate
chip).
Choice between single and dual label is governed by array technology and
underlying chemistry.
Dual Label Experiments
+ Red label
+ Green label
RNA sample 2
RNA sample 1
e
Slid
Typically used in custom made cDNA chips
Typically used to study one sample (e.g. diseased tissue) vs. a
control sample (e.g. normal tissue)
Separate images are obtained for each channel, and then combined
Qualitative Interpretation of Double
Label Experiments
GREEN represents High Control hybridization
RED represents High Sample hybridization
YELLOW represents a combination of Control and Sample
where both hybridized equally.
BLACK represents areas where neither the Control nor Sample
hybridized.
Main issue is to quantify the results:
How green is green?
What is the ratio of the signal to background noise?
How to compare multiple experiments using different chips?
How to quantify cross hybridization (if any)?
Affymetrix GeneChip
Example of Single Label Chips
Hundreds of thousands of oligonucleotide probes packed at extremely high
densities. The probes designed to maximize sensitivity, specificity, and
reproducibility, allowing consistent discrimination between specific and
background signals, and between closely related target sequences.
RNA labeled and scanned in a single color one sample per chip
Interpreting Affymetrix Output
Perfect Match/Mismatch Strategy
GeneChips use a Perfect Match/Mismatch probe strategy
Each probe designed to be perfectly complementary to a target sequence,
a partner probe is generated that is identical except for a single base
mismatch in its centre.
These probe pairs, called the Perfect Match probe (PM) and the Mismatch
probe (MM), allow the quantitation and subtraction of signals caused by
non-specific cross-hybridization.
The difference in hybridization signals between the partners, as well as
their intensity ratios, serve as indicators of specific target abundance.
PM to maximizehybridization
MM toascertainthedegreeofcrosshybridization
Affymetrix GeneChips
Perfect Matches and Mismatches
Other Image Processing Problems
Spot Quality Problems
Various Image processing techniques may be applied to read and interpret the
outputs of Microarrays
Commercial Microarray (e.g. Affymetrix) systems use proprietary software
Image Analysis software packages exist for the analysis of the output of custom made
chips (e.g. GenePix Pro, Array Vision, TIGR Spot Finder, etc)
Typical Problems of Raw Output
Uneven grid positions
Curves within a grid
Variable Spot size or shape
Variable Distance between spots
From Microarray images to
Gene Expression Matrices
Final data
Gene Expression Matrix
Intermediate data
Array scans
Images
Samples
Spots
Genes
Raw data
Spot/Image
quantiations
Gene
expression
levels
From Microarray images to
Gene Expression Matrices
In spot quantitation matrices, rows typically represent all the measurements made from
individual spots on the array. These can include mean and median pixel intensities of the spot
and local background, etc.
An experiment typically consists of one or more spot quantitation matrices representing all
arrays used in the study.
In the gene expression matrix, rows represent genes (as opposed to features/spots on the array)
and columns represent measurements from different experimental conditions measured on
individual arrays.
An example is each column representing measurements at different time points (to, t1, t2, ) in time
course experiments
A second example is each column representing different tissue type
A third is each column representing a different individual
A fourth is having groups of columns representing measurements from diseased cells, and other groups
representing measurements from health cells,
etc,
Each of the above matrices requires the application of data normalisation technuiques as
discussed in the next lecture.
Summary
Microarrays
Basic Concept
Different Microarray technologies exist.
Based on Crick-Watson Hybridization
Probe type (cDNA vs oligo)
Spotting vs in-situ synthesis
Single vs. dual channel
Output is a typically an image
Sources of errors
Image processing is required
Images are converted into gene expression matrices for further analysis