0% found this document useful (0 votes)

11 views392 pages

Genetics

The document introduces the use of phylogenies in studying human evolution and adaptation, covering key concepts, methods, and the phylogeny of humans and their relatives. It emphasizes the importance of understanding evolutionary relationships through phylogenetics, including the use of molecular data and the concept of the 'Tree of Life'. Additionally, it discusses the significance of sequence alignment and the evolution of DNA and protein sequences in phylogenetic analysis.

Uploaded by

originalxlusiv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views392 pages

Genetics

Uploaded by

originalxlusiv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 392

Human Genetics

Human phylogenetics and

evolution
Robin M. D. Beck
R.M.D.Beck@salford.ac.uk
Overall Aim

• To introduce you to the use of

phylogenies to study human evolution
and adaptation
Learning objectives

• To introduce you to:

• key phylogenetic concepts and terminology (= “tree thinking”)
• methods and models of phylogenetic analysis
• the phylogeny of humans and their relatives
• the “molecular clock”
• using phylogenies to study human evolution and adaptation
Lecture 1: An introduction to
molecular phylogenetics
Robin M. D. Beck
R.M.D.Beck@salford.ac.uk
The ‘Tree of Life’
‘…the great Tree of Life, which fills
with its dead and broken branches
the crust of the earth, and covers
the surface with ever-branching and
beautiful ramifications.’
- Charles Darwin (1859) ‘On the
The only figure in the first edition of ‘On Origin of Species’
the Origin of Species’ (1859)

All organisms (living and extinct) on Earth are

related to each other at one level or another, i.e.
there is a hierarchy of relationships
= ‘ THE TREE OF LIFE’
‘The time will come I
believe…when we shall have
very fairly true genealogical
trees of each great kingdom of
nature’
- Charles Darwin in a letter to
Thomas Huxley (1857)

Charles Darwin (1809-1882)

1837 sketch by Charles Darwin

Figure from ‘Allgemeine
Entwickelungsgeschichte
der Organismen’ (1866)

Ernst Haeckel (1834-1919)

• Ernst Haeckel:
• Produced the first reconstruction of the entire ‘Tree of Life’
(1866)
• Coined the term ‘phylogeny’ (Ancient Greek: phyle = tribe, race;
genesis = birth) for the evolutionary relationships between
organisms
Why is phylogenetics important?
• To understand ‘our place in Nature’, and that of every other organism:

Tree of ~4500 species of extant mammal

(Bininda-Emonds et al., 2007)
Frameworks for studying evolutionary
patterns and processes
evidence for rapid increase in brain size in modern humans and their
relatives (Hominini) – only identifiable because we know the phylogeny!
• Medical uses, e.g. origins of diseases:
Origin of MERS
Multiple origins of HIV
Phylogeny of coronaviruses
Phylogeny of SARS-CoV-2 virus
(virus causes COVID-19)
Account for non-independence when
comparing taxa
two groups of 20 close relatives –
inappropriate to treat them as 40
independent data points • Need to take into account the fact that
some taxa are more closely related than
others
• Standard statistical approaches that
assume independence of data points are
invalid for this kind of biological data!

Felsenstein (1985 –The American Naturalist)

“Nothing in Biology Makes
Sense Except in the Light of
Evolution”

Nothing in evolution makes

sense except in the light of
Theodosius Dobzhansky
(1900-1975)
phylogeny!
Tree Thinking
• Phylogenetics has an underlying logic
and terminology that might be unfamiliar
to you…
Trees and Networks

• If organisms are able to

exchange genes with each
other, their relationships can be
represented as a network
• e.g. interbreeding populations,
hybrids, prokaryotes that engage
in horizontal gene transfer
Trees and Networks
• If organisms are not able to
exchange genes with each other
their relationships can be
represented as a branching tree
• e.g. phylogenies of species that
cannot interbreed and higher
taxonomic levels (genera, families)
Tree terminology
• Trees are made up of
nodes and branches
• an internal node
represents the most recent
tip or leaf or
common ancestor (MRCA)
terminal node
of the descendant branches root or
• x = MRCA of A+B root node terminal
• y = MRCA of C+D+E branch
• MRCAs are usually
hypothetical – it is difficult
internal
(impossible?) to identify a node
fossil as a genuine ancestor
internal
branch or
edge
Tree terminology

• Trees can be either unrooted or rooted

• rooted trees have one node identified as the root (or root node) = the first
divergence in the tree
• this provides a “direction” to the tree
unrooted tree rooted tree

root = position of
first divergence
Tree terminology

• A single unrooted tree can be

rooted in multiple different
positions
• Different rooting positions will
result in different relationships

unrooted
tree
Rooting trees
• Usually, a tree is rooted with an outgroup (or multiple outgroups)
• outgroup taxa = taxa that we already know are definitely more distantly
related than the other taxa (= the ingroup)
• root is between the outgroup taxon (or taxa) and the remaining ingroup taxa

We know a priori that

the kangaroo (a
marsupial mammal) is
more distantly related
than the other taxa
(which are all placental
mammals), and so it can
be used as an outgroup
to root the tree
Definition of relatedness

• Two taxa (species, families etc.) are

more closely related to each other than
either is to another taxon if they share a
more recent common ancestor
Definition of relatedness – an example
• the mouse and the human are more human
closely related to each other than they
are to the lizard because they share a
more recent common ancestor fish
frog lizard
mouse
(indicated by the purple dot)
• The mouse, human and lizard are more
closely related are more closely related
to each other than they are to the frog
because they share a more recent
common ancestor (indicated by the blue
dot)
Tree Thinking quiz
(taken from Baum et al., 2005 – Science)
Tree Thinking quiz
(taken from Baum et al., 2005 – Science)
Tree Thinking quiz
(taken from Baum et al., 2005 – Science)
Tree Thinking quiz
(taken from Baum et al., 2005 – Science)
IMPORTANT!
Trees are like mobiles…

A B C D D C B A B C D A

= =

What is important is the groupings that are specified, not the precise
order that the taxon labels occur
Tree Thinking quiz
(taken from Baum et al., 2005 – Science)
Tree Thinking quiz
(taken from Baum et al., 2005 – Science)
Different types of data for producing
phylogenies
• Morphological or other phenotypic data:
• only kind of data available for fossils (if ancient DNA or proteins not preserved)
Different types of data for producing
phylogenies
• Molecular data:
• Today, the most commonly used form of molecular data is sequence data –
either nucleotide sequence data (DNA or RNA) or protein sequence data
(amino acids)
DNA sequence data

• Double-helix
• specific sequence of four bases
• adenine = A
• cytosine = C
• guanine = G
• thymine = T

• Example: ATGCGCAGTTATTGCGAT
RNA sequence data
• same bases as DNA except uracil (U) instead of
thymine, i.e. A, C, G and U
• usually produced using DNA as a template (=
transcription) ∴ same sequence as DNA template
(but U instead of T)

• Example: ATGCGCAGTTATTGCGAT (DNA)

↓transcription
AUGCGCAGUUAUUGCGAU (RNA)
Protein sequence data
• Proteins are comprised of sequences of 20-
22 different types of amino acids
• Sequence of amino acids in a protein is
specified by DNA sequence of the gene
coding for that protein, e.g. BRCA1 gene
codes for BRCA1 protein
• Example: ATGCGCAGTTATTGCGAT (DNA)
amino acid sequence
↓ transcription
for protein AUGCGCAGUUAUUGCGAU (RNA)
↓ translation
Met,Arg,Ser,Tyr,Cys,Asp (protein)
Sequencing
“next-generation sequencing”
• Modern sequencing technology (“next
generation sequencing”, or NGS) means
that large amounts of sequence data
can be produced cheaply and quickly
• Human genome project (1990-2013):
US$3,000,000,000
• January 2014: US$1,000
Illumina Miseq – Salford has one of these!
Obtaining sequences
• Published molecular sequences (nucleotides and proteins) freely
available for download from numerous online databases (e.g. GenBank,
UniProt, European Nucleotide Archive…)
How do DNA sequences evolve?
• Types of DNA mutation:
• Substitution = replacement of one base by another,
e.g. G by C

• Insertion = addition of a new base somewhere in the

sequence

• Deletion = loss of an existing base somewhere in the

sequence

• If the DNA is a coding sequence, these mutations

will be present in the RNA transcribed from it
How do protein sequences evolve?

• Amino acid sequence of a protein is specified by the DNA sequence of the

gene that codes that protein
• each amino acid is specified by a triplet of nucleotides (= codon)
• AAA = lysine (Lys or K)
• GCT/GCU = alanine (Ala or A)
• GTG = valine (Val or V)
…
• TAA/UAA = STOP
How do protein sequences evolve?
DNA translation table
RNA translation table
How do protein sequences evolve?
• Changes (substitutions, insertions, deletions) in the DNA sequence of a
protein-coding gene can alter the amino acid sequence of the protein

• But…genetic code is “degenerate”

• multiple codons code for same amino acid
→ change in DNA sequence doesn’t always
result in a change in amino acid sequence

• Leucine coded for by 6 different codons

• Alanine coded for by 4 different codons
• STOP coded for by 3 different codons
• Methionine coded for by 1 codon only
• …
How do protein sequences evolve?
• DNA mutation in protein-coding gene that does not change amino
acid sequence of protein = synonymous
• DNA mutation in protein-coding gene that does change amino acid
sequence of protein= non-synonymous
Question: which evolves faster? A DNA sequence coding for a protein,
or the amino acid sequence of that protein?
Comparing sequences
• Let’s say we want to compare the
Sequence 1 four DNA sequences on the left
• If we line them up:
0000000001
Sequence 2 1234567890
Sequence 1 CGAACTCGA
Sequence 2 CCAACTCGA
Sequence 3 Sequence 3 CCGAACTCGA
Sequence 4 CCGAACCGA
• Are we comparing homologous
Sequence 4 nucleotides across all sequences?
Comparing sequences
Sequence 1 0000000001
1234567890
Sequence 1 CG-AACTCGA
0
Sequence 2 Sequence 2 CC-AACTCGA
Sequence 3 CCGAACTCGA
Sequence 4 CCGAAC-CGA
Sequence 3
• We need to insert gaps into sequences
to account for insertions and deletions
• This procedure is called alignment
Sequence 4
Alignment is absolutely critical when comparing
molecular sequences!
Sequence alignment
• Critical step when comparing sequences – otherwise you risk getting
incorrect results

Sequence 1: ACGTGCTAGACTATGTGTC
0% sequence similarity
Sequence 2: CGTGCTAGACTATGTGTC

Sequence 1: ACGTGCTAGACTATGTGTC
100% sequence similarity
Sequence 2: -CGTGCTAGACTATGTGTC
Sequence alignment
• Usually we don’t know the exact sequence of substitutions, insertions
and deletions – we have to make a best guess
Sequence 1
0000000001
Sequence 2
1234567890
Sequence 1 CG-AACTCGA
Sequence 2 CC-AACTCGA
Sequence 3 Sequence 3 CCGAACTCGA
Sequence 4 CCGAAC-CGA
Sequence 4
Sequence alignment

• perhaps all taxa with a particular sequence have

gone extinct, or we haven’t sampled them
Sequence 1 • perhaps we cannot confidently determine the
exact pattern of subsitutions and insertions and
deletions (= indels)
Sequence 2 • we have to make a best guess

000000000
Sequence 3 123456789
Sequence 2 CCAACTCGA
0
Sequence 4 Sequence 4 CCGAACCGA
Sequence alignment
• Sequences that evolve slowly (= they are more highly conserved) are
easier to align
Sequence alignment
• Sequences that evolve very rapidly are harder to align, particularly
when comparing distantly related taxa
• rate of evolution varies:
• between coding and non-coding DNA
• between genes
• within the same gene (some parts of one gene more conserved than others)
• between taxa – e.g. within mammals, rate of evolution in mice and rats much
faster than in humans
Protein-coding DNA/RNA sequence alignment
• The amino acid sequence of a protein is determined by the sequence of
codons (= nucleotide triplets) in the DNA sequence of the gene coding for
that protein
• Nucleotide indels in protein-coding genes are usually in multiples of 3 (i.e. 3,
6, 9, 12….)
• If an indel is not in a multiple of 3, it will completely alter the amino acid
sequence and the position of STOP codon (protein will be shorter or longer
than usual)
• usually results in a non-functional protein
Protein-coding DNA/RNA sequence alignment

Original ATGAGCAGTAGAGGCGTTTAG Met,Ser,Ser,Arg,Gly,Val,STOP

Protein-coding DNA/RNA sequence alignment

Original ATGAGCAGTAGAGGCGTTTAG Met,Ser,Ser,Arg,Gly,Val,STOP

very different amino
acid sequence
Single nucleotide premature STOP codon
insertion (= ATGCAGCAGTAGAGGCGTTTAG
frameshift mutation)
Met,Gln,Gln,STOP
Protein-coding DNA/RNA sequence alignment

Original ATGAGCAGTAGAGGCGTTTAG Met,Ser,Ser,Arg,Gly,Val,STOP

single amino acid
insertion, but rest of
triplet insertion sequence is identical
(= reading frame Met,Tyr,Ser,Ser,Arg,Gly,Val,STOP
maintained) ATGTACAGCAGTAGAGGCGTTTAG
Protein-coding DNA/RNA sequence alignment

Original ATGAGCAGTAGAGGCGTTTAG Met,Ser,Ser,Arg,Gly,Val,STOP

single amino acid
deletion, but rest of
triplet deletion sequence is identical
(= reading frame ATGAGTAGAGGCGTTTAG Met,Ser,Arg,Gly,Val,STOP
maintained)
Example of a frameshift mutation
Interphotoreceptor Binding Protein (IRBP) – plays key role in vision

frameshift mutation in the marsupial mole

(blind underground burrower)- protein is
non-functional = a “pseudogene”
Protein-coding DNA/RNA sequence alignment
• In general, alignments of exons of protein-coding genes will have
indels in multiples of 3, because the protein needs to be functional

9bp deletion in exon

11 of BRCA1 gene
Non-protein-coding DNA/RNA sequence alignment
• Non-protein-coding DNA sequences include:
• introns in protein-coding genes (removed before translation)
• Non protein-coding RNA genes (e.g. ribosomal genes)
• regulatory sequences
• pseudogenes
• other non-functional(?) DNA

• In these types of sequences, indels can be any length

Non-protein-coding DNA/RNA sequence alignment
mitochondrial 16S ribosomal gene alignment
Protein (amino acid) sequence alignment
• Indels in protein (amino acid) sequences can be any length
Multiple sequence alignment
• So…
• Prior to phylogenetic analysis, we need to align our sequences to
ensure that we are comparing homologous nucleotides (if DNA or
RNA sequences) or homologous amino acids (if protein sequences)
• = “Multiple sequence alignment” (MSA)

• Alignment can be done:

• manually – insert gaps into sequences by hand, using a sequence editor
• automatically – use software that aligns the sequences according to an
algorithm
Multiple Sequence Alignment
• Manual alignment can be very accurate, but quickly becomes
unfeasible as number and/or length of sequences increases
Multiple sequence alignment
• There are a number of different programs for automated alignment,
including:
• ClustalW (one of the oldest and most commonly used)
• MAFFT
• MUSCLE
• They all work in broadly the same way:
• assign a “cost” to a mismatch between nucleotides (e.g. A in sequence 1
versus C in sequence 2) or amino acids (e.g. Leucine in sequence 1, Glycine in
sequence 2)
• assign a “cost” to inserting a gap in a sequence
• assign a “cost” to extending a gap in a sequence
• Find the multiple sequence alignment that minimises the overall cost
Multiple Sequence Alignment
alignment
ambiguous region
• Automated sequence alignment is
(obviously) much faster than manual
alignment, but is often less accurate
• Can still be very time-consuming with
very long sequences and/or many
sequences
• Some regions might be very
difficult/impossible to align
confidently – we might want to
remove these “alignment ambiguous
regions”
So…
• Now we have nicely aligned sequence data

• How do we use it for phylogenetic analysis?

Methods of phylogenetic analysis
• Three basic types
• distance-based methods - less commonly used now (will not be discussed in
detail here)
• parsimony-based methods - mainly used for morphological data (will not be
discussed here)

• model-based methods – standard approach for analysing molecular

sequence data (e.g., nucleotide sequence data, amino acid sequence data)
Model-based methods
• Key feature of these methods: they assume an explicit model of
evolution
• Implemented in two slightly different frameworks
• Maximum Likelihood
• Bayesian analysis
Maximum Likelihood

• Likelihood = probability of the data given a particular model

P(D|H)
R. A. Fisher
(1890-1962)

• Method developed by Ronald A. Fisher, the “father of modern statistics”

Maximum Likelihood

• Likelihood = probability of the data given a particular model

P(D|H)
How is maximum likelihood used in a
phylogenetic context?
• Likelihood = probability of the data given a particular model, P(D|H)
• In molecular phylogenetics:
• “data” = sequence alignment
• “model” = tree + substitution model (= model of how sequences evolve)
• the substitution model is usually specified prior to analysis
• So…we’re interested in finding the tree with the highest likelihood (= the
maximum likelihood tree), given a sequence alignment and our assumed
substitution model

• To summarise: maximum likelihood analysis = find the most likely

tree(s) given a data matrix and a substitution model
Substitution models
• Substitution models for molecular sequence data generally comprise
two parts:
• the relative frequencies of the bases or amino acids
• a rate matrix that models the probability of changes between bases or amino
acids

• We will focus on nucleotide (DNA/RNA) sequence data as it’s simpler

(only 4 states), but the same broad principles apply to amino acid
sequence data (20-22 states)
Let’s consider the state frequencies first
• Let’s say our model is: πA = 0.1, πC = 0.4, πG = 0.3, πT = 0.2 (these are
the relative frequencies of each state, and so add up to 1)
• And our data is a single nucleotide, A
• What is the likelihood of this?

• What about if our data is a sequence of two nucleotides, AC?

• What is the probability of the sequence of an entire human genome

(3 billion base pairs)?
Now add in a rate matrix to model changes
between nucleotides
• Let’s stick with: πA = 0.1, πC = 0.4, πG = 0.3, πT = 0.2
• Rate matrix: ending nucleotide probabilities
A C G T
A =1
starting C =1
nucleotide G =1
T =1

• Alignment: taxon_1 CCAT Tree: taxon_1 taxon_2

taxon_2 CCGT

• Likelihood going from taxon_1 to taxon_2 = πCPC→CπCPC→C πAPA→G πT*PT→T

= 0.4×0.983×0.4×0.983×0.1×0.007×0.3×0.979 = 0.0000300
Jukes & Cantor (1969) = JC69 model
• simplest substitution model
• equal state frequencies: πA = 0.25, πC = 0.25, πG = 0.25, πT = 0.25

• Single parameter rate matrix:

• How can we make this model more complex?

Felsenstein (1981) = F81 model
• unequal frequencies: πA ≠ πC ≠ πG ≠ πT

• Single parameter rate matrix (same as JC69):

Jukes & Cantor (1969) = JC69 model
• simplest substitution model
• state frequencies: πA = 0.25, πC = 0.25, πG = 0.25, πT = 0.25 (equal)

• Single parameter rate matrix:

• How can we make this model more complex?

Transitions and transversions
Adenine (purine) Cytosine (pyrimidine)
• Two classes of nucleotides
• purines (A and G) – two-ring
• pyrimidines (C, T and U) – one-ring
• Mutations from purine to purine (A
↔ G) and from pyrimidine to
pyrimidine (C ↔ T) = TRANSITIONS
• Mutations from pyrimidine to purine
(or vice versa) = TRANSVERSIONS
• Transitions are more likely than
transversions due to similarities in
ring structure
Guanine (purine) Thymine (pyrimidine)
Kimura (1980) = K80 model
• equal state frequencies: πA = 0.25, πC = 0.25, πG = 0.25, πT = 0.25

• Two parameter rate matrix: one for transitions (α) and one for
transversions (β):
Hasegawa, Kishino & Yano (1985) = HKY85 model
• unequal state frequencies: πA ≠ πC ≠ πG ≠ πT

• Two parameter rate matrix: one for transitions (α) and one for
transversions (β):
Several more…
General Time Reversible (GTR) model
• Most complex well-known model – very widely used

• unequal state frequencies: πA ≠ πC ≠ πG ≠ πT

• Rate matrix has six parameters (a-f) , one for each possible
substitution type:
Overview of commonly used models
simplest model
(fewest parameters)

most complex model

(most parameters)
Rate variation
• By comparing sequences, we can see that some regions of an
alignment are more likely to change than others
• These basic models don’t allow this – they assume that the rate of evolution
is the same across the entire alignment
• There are also variants of all of these models that let the rate of evolution
vary between different regions of the sequence alignment - we won’t
discuss the details here, but they are indicated by + I, + G, or + I + G

• Examples: JC69 + I, HKY + G, GTR + I + G

The agony of choice…
• We have lots of models, from simple to complex – which one should
we use?
• Different models might give different answers (i.e. different trees)
with the same data
• Is a complex model necessarily better?
Model selection
• Which is the “best” model here?
Model selection
under-fit – too few parameters
model has too high bias (model over-fit – too many parameters
does not adequately reflect model has too high variance
underlying relationship of the good compromise between (model overly affected by noise
training data ) bias and variance in the training data)
Model selection
• All models represent a trade-off between bias and variance
• we want to use a model that minimises both
Model selection
• We can use formal statistical methods to decide which model to use

• Commonly used model selection criteria: As far as you’re concerned, the

details don’t really matter!
• Akaike Information Criterion (AIC) What’s important is that these
• Bayesian Information Criterion (BIC) are objective methods for
identifying an appropriate model
Calculating support
• How well supported are the monophyletic groups/clades in our
maximum likelihood tree by our data?
• if one clade is more strongly supported by our data, we should probably have
more confidence that it is “correct”

• Several different measures of support, but bootstrapping is by far the

most commonly used method
• Bootstrapping = a resampling method
• Bootstrap values can be between 0 and 100%
• A bootstrap value of >70% is usually considered “strong” support
Basic strategy for maximum likelihood
analysis of sequence data
1. Collect your sequence data
2. Align it
3. Identify best-fitting model of sequence evolution
4. Analyse your sequence data to find the maximum likelihood (ML)
tree assuming the best-fitting model
5. Use bootstrapping to calculate support values for the clades in the
ML tree
Practical
• Align a set of mitochondrial sequences from different modern
humans from around the world
• Make a phylogeny of modern humans using Maximum Likelihood
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
Human Genetics
Two conflicting hypotheses for the origin of
anatomically modern humans (AMHs)
• “Out of Africa” hypothesis
• Archaic hominins dispersed out of Africa >1
million years ago
• AMHs originated in Africa and some
dispersed out of Africa 100-200k years ago
• AMHs completely replaced (and did not
interbreed with) archaic hominins
Two conflicting hypotheses for the origin of
anatomically modern humans (AMHs)
• “Multiregional” hypothesis
• Archaic hominins dispersed out of Africa
>1 million years ago
• The different archaic hominin
populations (e.g. Neanderthals in
Europe, Homo erectus in Asia) evolved
into AMHs, possibly with some
interbreeding/geneflow between
populations
Two conflicting hypotheses for the origin of
anatomically modern humans (AMHs)
Enter genetic data…

Nature (1987)
• Analysis of mitochondrial genomes from 147 modern
humans
• Mitochondrial DNA:
• inherited maternally only
• does not undergo recombination
• Conclusion of paper:
• African origin of modern humans
• All the mitochondrial lineages can be traced to a female
common ancestor who lived 140k-290kya (kya = thousand
years ago)
• Split between Africans and non-Africans 62-225kya
Your phylogeny from last practical (which is based
on mitochondrial sequence data) should have
found a similar result!
• Greater genetic diversity
among modern Africans
than modern non-Africans
Your phylogeny from last practical (which is based
on mitochondrial sequence data) should have
found a similar result!
• All non-Africans should
form a single group to the
exclusion of some African
populations
• This group would be the
descendants of the humans
who left Africa

“Out of Africa”
dispersal event
“Mitochondrial Eve” = most recent common ancestor
of all modern human mitochondrial sequences
• IMPORTANT!
• The “mitochondrial Eve” was not
the only female alive at that time!
• Other females were present, but
their mitochondrial lineages have
not survived to the present day

X
Newsweek cover 1988

http://www.virginia.edu/woodson/courses/aas102%20(sprin
g%2001)/articles/tierney.html for the original story!
“Mitochondrial Eve”
• Cann et al. (1987) was highly controversial at the time, but recent
studies of mitochondrial genomes and Y chromosome (males only)
have reached similar conclusions

Fu et al. (2013 – Current Biology)

“Mitochondrial Eve”: 120–197 kya Poznik et al. (2013 – Science)
Split between Africans and non-Africans: “Mitochondrial Eve”: 99-148 kya
62.4–94.9 kya Y-chromosome “Adam”: 120–156 kya
These studies appear to support “Out of
Africa” hypothesis

• Mitochondrial “Eve” and Y-

chromosome “Adam” both ~100-200
kya
• This is much younger than oldest non-
African archaic hominins
• Split between Africans and non-
Africans <200 kya
• suggests a relatively late dispersal of
AMHs out of Africa
Enter ancient DNA…
Sequencing ancient biomolecules
• DNA is a very stable molecule
• It needs to be because it is the information store for the cell!
• DNA can be damaged by free radicals, radiation (e.g. UV) etc.
• DNA damage: breaks in the DNA strands, modification of the bases
• Cells have complex mechanisms to repair damaged DNA
• When a cell dies, these mechanisms stop functioning → DNA
becomes increasingly damaged
• DNA also degraded by microorganisms after death
After ~500 years, 50% of mt DNA has been lost
Can we successfully sequence DNA from the
remains of organisms that are hundreds or
thousands of years old or even older?
Svante Pääbo
• Problems:
• DNA degrades after death → left with
small numbers of fragments
• contaminating DNA (e.g. human DNA
from researchers, bacterial DNA) will
be far more common than the
ancient DNA → standard PCR is likely
to amplify more of the contaminating
DNA than the ancient DNA
Can we successfully sequence DNA and proteins
from the remains of organisms that are hundreds
or thousands of years old or even older?
• Solutions:
• use ultra-clean rooms (filtered air systems,
UV light etc.) when extracting DNA
• use “next generation” sequencing
• Doesn’t require PCR
• Can sequence all DNA fragments present in a
sample
• use methods that can sequence very short
pieces of DNA (<75bp) – ancient DNA usually
very fragmentary
• look for distinctive modifications to bases
that are seen in ancient DNA but not in
modern DNA (e.g. cytosine deamination)
There has been enormous progress made in
the field of ancient DNA – it’s a hot topic!

• Number of publications
mentioning ancient DNA:
• 1995-2000: 1450
• 2001-2005: 2870
• 2006-2010: 6600
• 2011-2015: 10200
aDNA
• 2016-2020: 16100
• 2021-present: 10100
So what do these ancient genomes tell us
about human evolution…?
1997: partial mitochondrial sequence from a
Neanderthal - 40,000 years old

• Neanderthal mtDNA falls outside

mtDNA variation of modern humans
• Neanderthal mtDNA not closer to e.g.
Europeans than any other modern
human
• suggests no interbreeding/geneflow
between Neanderthals and AMHs
2008: entire mitochondrial genome from a
Neanderthal - 38,000 years old, 16.5 kb

• Complete mt genome from Neanderthal

• Similar findings to the 1997 paper:
• Neanderthal mtDNA falls outside mtDNA
variation of modern humans
• divergence between Neanderthals and
modern humans ~660 kya
• Further support for “Out of Africa”
But then…
Green et al. (2010): entire nuclear genome from
a Neanderthal - 38,000 years old, 4000 Mb
• Neanderthal genome sequenced using “next generation sequencing”
• Neanderthal genome compared with five modern humans from
different regions of the world:
• Yoruba
• San Africans
• French
• Papua New Guinean non-Africans
• Han Chinese
Green et al. (2010): entire nuclear genome from
a Neanderthal - 38,000 years old, 4000 Mb
four possible explanations for
• Neanderthal genome more similar to the observed pattern – option 3
the three non-African genomes than the fits the data the best
African genomes → strongly suggest
geneflow (“introgression”) from
Neanderthals to non-Africans
• estimated 1-4% of nuclear genome of non-
Africans is from Neanderthals
Then a real surprise!
• Partial little finger bone of a hominin from
Denisova cave in Siberia, >50 kya Neanderthals
• besides their genomes, we know almost nothing
about them!
• Only remains are fragments of bone and teeth
• Nuclear genome suggests that Denisovans are
more closely related to Neanderthals than to
modern humans modern
• split of Denisovan+Neanderthal lineage from
modern humans ~800 kya humans
• split between Denisovans and Neanderthals ~650
kya
The biggest surprise…
• 3-6% of the genome of New Guineans and Aboriginal Australians is from
Denisovans!
• strong evidence of interbreeding
• 0.2% of the genome of Asians and Native Americans is from Denisovans
• either small amount of interbreeding with Denisovans, or geneflow from New
Guinean/Aboriginal populations
• European and African populations lack evidence of geneflow from
Denisovans
Distribution of Denisovan DNA in modern non-
African populations (red = greater proportion)
pre-genomic hypotheses
Post-genomic hypothesis
The “Assimilation Model”
• Current consensus view based on
genomic data:
• AMHs originated in Africa ~300 kya
• Majority of genetic diversity among non-
African modern humans is the result of a
relatively recent (50-200 kya) dispersal from
Africa
• BUT multiple small but significant
contributions by archaic hominins
(Neanderthals, Denisovans, possibly others)
to non-African populations
Important!

• Most of these findings are quite recent (2008 and later) – details still being
worked out, and may change with new discoveries/sequencing of more
archaic hominin genomes

• Results are strongly influenced by assumptions about population size,

mutation rate, amount of geneflow etc.
• almost certainly massive oversimplifications
• probably many, many cases of interbreeding between different hominins

• Many gaps in the story, e.g. what was going on in Africa? Some evidence of
geneflow from archaic hominins, but relatively few ancient genomes from
Africa (DNA degrades faster at higher temps)
FIVE MINUTE BREAK!
Molecular clocks and human evolution
Molecular clocks
• The branching pattern of an evolutionary tree
indicates relationships – which taxa (individuals,
populations, species, genera, families etc.) are more
closely related than others
The closest living
relatives of humans
are chimps

Our taxa (in this case,

species of primate)
Molecular clocks
• The branching pattern of an evolutionary tree
indicates relationships – which taxa (individuals,
populations, species, genera, families etc.) are more
closely related than others
• It would also be useful to know when particular taxa
diverged from each other (i.e. the ages of the nodes)
When did the split
between humans and
chimps occur?

When did this

split occur?
Our taxa (in this case,
species of primate)
Molecular clocks

• Some uses of divergence dates:

• comparing the timing of divergences to other events (e.g.
geological events, changes in climate etc.) to identify possible
causal links
• determining when particular adaptations originated (e.g.
large brains of humans)
• determining when particular diseases originated (e.g. HIV)
The molecular clock
• Basic principle proposed by Zuckerkandl and
Pauling in 1962
• found that the number of amino acid differences
in haemoglobin of different taxa corresponded to
estimated divergence time (based on the fossil
record)
• Zuckerkandl and Pauling (1965):
Alternative approach – the molecular clock

in this hypothetical example, rate of change is

1 substitution per 10 nucleotides per 25 million years
• Underlying assumption:
• rates of molecular evolution are constant
through time and between lineages
• if so, the amount of molecular change that
has occurred along a branch is proportional
to the length of time that branch represents
cytochrome c
cytochrome c

estimated from the fossil record estimated from the fossil record
Relative and absolute divergence times
• If you assume a molecular clock, then branch lengths that
represent the amount of molecular change can be used to
estimate relative divergence times
• Example: if one branch is twice as long as another, it represents
twice as much time

• But how much time is this in an absolute sense, i.e. how many
years, or thousands of years, or millions of years?
Relative and absolute divergence times

• We need to calibrate the divergence times – convert relative

times to absolute times
• normally done using fossil evidence – estimate absolute age of one
or more nodes based on fossil evidence, then use this to calculate
how much molecular change occurs per unit time (e.g. change per
year, change per million years)
• this estimate of molecular change per unit time can then be used to
calculate the ages of the other nodes
Fossil calibrations relevant to humans (from
de Vries and Beck, 2023)
• Split between humans and chimps:
• calibrating fossil: Ardipithecus – more
closely related to humans than to chimps
• minimum date: 4.631 million years ago
(minimum age of Ardipithecus)
• maximum date: 15 million years ago (lots
of fossil apes in Africa, but none belong to
the human or chimp lineages)
Fossil calibrations relevant to humans (from
Benton et al., 2015)
• Split between Homo sapiens (modern
humans) and Homo neanderthalis
(Neanderthals):
• calibrating fossil: oldest known Homo
neanderthalis fossils
• minimum date: 0.2 million years ago (age of
oldest known Homo neanderthalis fossils)
• maximum date: 1 million years ago (lots of
Homo fossils, but no humans or Neanderthals)
Standard approach for molecular clock
analysis
1. Collect sequence data
2. Align it
3. Determine appropriate model of sequence evolution
4. Identify appropriate fossil calibration(s)
5. Use (relaxed) molecular clock model to calculate divergence times
– today’s exercise!
Human phylogenetics and
evolution
Robin M. D. Beck
R.M.D.Beck@salford.ac.uk
So…
• Now we have nicely aligned sequence data

• How do we use it for phylogenetic analysis?

• model-based methods – standard approach for analysing molecular

• Likelihood = probability of the data given a particular model

P(D|H)
R. A. Fisher
(1890-1962)

• Method developed by Ronald A. Fisher, the “father of modern statistics”

Maximum Likelihood

• Likelihood = probability of the data given a particular model

• To summarise: maximum likelihood analysis = find the most likely

• We will focus on nucleotide (DNA/RNA) sequence data as it’s simpler

• What about if our data is a sequence of two nucleotides, AC?

• What is the probability of the sequence of an entire human genome

• Alignment: taxon_1 CCAT Tree: taxon_1 taxon_2

taxon_2 CCGT

• Likelihood going from taxon_1 to taxon_2 = πCPC→CπCPC→C πAPA→G πT*PT→T

• Single parameter rate matrix:

• How can we make this model more complex?

Felsenstein (1981) = F81 model
• unequal frequencies: πA ≠ πC ≠ πG ≠ πT

• Single parameter rate matrix (same as JC69):

Jukes & Cantor (1969) = JC69 model
• simplest substitution model
• state frequencies: πA = 0.25, πC = 0.25, πG = 0.25, πT = 0.25 (equal)

• Single parameter rate matrix:

• How can we make this model more complex?

• Two parameter rate matrix: one for transitions (α) and one for
transversions (β):
Hasegawa, Kishino & Yano (1985) = HKY85 model
• unequal state frequencies: πA ≠ πC ≠ πG ≠ πT

• Two parameter rate matrix: one for transitions (α) and one for
transversions (β):
Several more…
General Time Reversible (GTR) model
• Most complex well-known model – very widely used

• unequal state frequencies: πA ≠ πC ≠ πG ≠ πT

• Rate matrix has six parameters (a-f) , one for each possible
substitution type:
Overview of commonly used models
simplest model
(fewest parameters)

most complex model

• Examples: JC69 + I, HKY + G, GTR + I + G

• Commonly used model selection criteria: As far as you’re concerned, the

• Several different measures of support, but bootstrapping is by far the

• Current evidence suggests that split between

humans split from chimps ~5-7 humans and
million years ago chimps
• Modern humans (Homo sapiens)
plus any taxon more closely
related closer to modern humans

hominins
than to chimps (Pan troglodytes
and P. paniscus) = hominins
First…some terminology

• Precise taxonomy of fossil hominins is selection of archaic hominins

extremely controversial and in flux (in
part because of new genomic data!)
• To keep things simple, we will
distinguish between:
• anatomically modern humans (AMHs) –
fossil hominins with essentially the same
anatomy as present day humans – they
look basically the same as me and you
• archaic hominins – fossil hominins with
clearly different anatomy from present
day humans
Archaic hominins
Modern human Neanderthal

• Archaic hominins differ from AMHs in

several ways:
• in general, the skull appears more heavily
built (e.g. large brow ridges), and chin is
absent
• earliest archaic hominins small (1.5 m or Neanderthal reconstructions
less) with brains only slightly larger than
those of chimps
• Neanderthals similar in height to AMHs,
but more robust and on average had
slightly larger brains
Everything on the “human” side of the tree = hominins
split between
humans and chimps
is about here

hominins
Beginnings
• Charles Darwin argued that the earliest
phases of hominin evolution probably
occurred in Africa:
“"In each great region of the world the living
mammals are closely related to the extinct
species of the same region. It is, therefore,
probable that Africa was formerly inhabited by
extinct apes closely allied to the gorilla and
chimpanzee; and as these two species are now
man's nearest allies, it is somewhat more
probable that our early progenitors lived on
the African continent than elsewhere.”
- The Descent of Man (1871)
Charles Darwin (1809-1882)
But… the first fossils of ancient human
relatives were discovered in Europe
Neanderthal 1
• “Neanderthal Man”
• First fossil (“Neanderthal 1”) found
in northern Germany in 1856 and
named Homo neanderthalis in 1864
The 20th century saw the discovery of many
fossil human remains
“Consensus” view
• Split between hominins and chimps occurred >5 million years ago
• Earliest phases of hominin evolution occurred in Africa
• At least one African hominin lineage (Australopithecus -> Homo habilis -> Homo
erectus) became increasingly “human-like” through time (more upright stance, larger
brain, use of tools etc.)
• At least one hominin lineage dispersed out of Africa >1.8 million years ago

oldest hominins from outside Africa (Homo

erectus, 1.8 million years old, Georgia)
Two conflicting hypotheses for the origin of
anatomically modern humans (AMHs)
• “Multiregional” hypothesis
• Archaic hominins dispersed out of Africa
>1 million years ago
• The different archaic hominin
populations (e.g. Neanderthals in
Europe, Homo erectus in Asia) evolved
into AMHs, possibly with some
interbreeding/geneflow between
populations
Two conflicting hypotheses for the origin of
anatomically modern humans (AMHs)
• “Out of Africa” hypothesis
• Archaic hominins dispersed out of Africa >1
million years ago
• AMHs originated in Africa and some
dispersed out of Africa 100-200k years ago
• AMHs completely replaced (and did not
interbreed with) archaic hominins
Big debate from the 1980s until the 2010s!
Prediction – what would the phylogeny of modern humans
look like if the multi-regional hypothesis is correct?

• Genomes of modern Africans should show

similar diversity to people from other
parts of the world
Prediction – what would the phylogeny of modern humans
look like if the “Out of Africa” hypothesis is correct?

• There should be greater genetic diversity

among modern Africans than modern non-
Africans
Prediction – what would the phylogeny of modern humans
look like if the “Out of Africa” hypothesis is correct?

• All non-Africans should form a single group to

the exclusion of some African populations
Prediction – what would the phylogeny of modern humans
look like if the “Out of Africa” hypothesis is correct?

“Out of Africa”
dispersal event

• All non-Africans should form a single group to

the exclusion of some African populations
• This group would be the descendants of the humans
who left Africa
Testing this with mitochondrial sequence data
from modern humans
• Mitochondria have their own small
circular genome
• In humans, ~16 thousand base pairs
(versus 3 billion in the nuclear genome)
• Mitochondrial DNA:
• inherited maternally only
• does not undergo recombination
• Easy to sequence because mitochondria
occur in large numbers in each cell (so
multiple copies of the mitochondrial
genome)
Practical
• You will do a maximum likelihood analysis of your mitochondrial
sequences to see if they fit a “multi-regional” model or an “Out of
Africa” model better

Basic Approach:
1. Collect your sequence data You did these steps last session!
2. Align it
3. Identify best-fitting model of sequence evolution
4. Analyse your sequence data to find the maximum likelihood (ML) tree
assuming the best-fitting model
5. Use bootstrapping to calculate support values for the clades in the ML tree
Practical
• You will do a maximum likelihood analysis of your mitochondrial
sequences to see if they fit a “multi-regional” model or an “Out of
Africa” model better

Tutorialspoint For R PDF
100% (2)
Tutorialspoint For R PDF
34 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
179 pages
Week1 Slides
No ratings yet
Week1 Slides
64 pages
Data Visualization in R
No ratings yet
Data Visualization in R
36 pages
Unit 4 AO1 Summaries
No ratings yet
Unit 4 AO1 Summaries
8 pages
Basics of R Programming
No ratings yet
Basics of R Programming
64 pages
Unit 3 Big Data
No ratings yet
Unit 3 Big Data
25 pages
ProgrammingForDS13 Intror
No ratings yet
ProgrammingForDS13 Intror
25 pages
D1 2 Intro R
No ratings yet
D1 2 Intro R
52 pages
Solution Manual For Introduction To Biomedical Engineering 2nd Edition by Domach
100% (1)
Solution Manual For Introduction To Biomedical Engineering 2nd Edition by Domach
6 pages
Class One
No ratings yet
Class One
66 pages
D1.2 Protein Synthesis
No ratings yet
D1.2 Protein Synthesis
76 pages
R Programming
No ratings yet
R Programming
59 pages
R Programming Language - 2020 Edition
No ratings yet
R Programming Language - 2020 Edition
228 pages
R Course ISLR Basics 2023
No ratings yet
R Course ISLR Basics 2023
77 pages
Nucleic Acids and Protein Synthesis
No ratings yet
Nucleic Acids and Protein Synthesis
9 pages
Mitochondria Genome
No ratings yet
Mitochondria Genome
12 pages
Unit - 3
No ratings yet
Unit - 3
64 pages
R Studio Manual
No ratings yet
R Studio Manual
61 pages
D1 R-Intro
No ratings yet
D1 R-Intro
33 pages
RNA and Protein Synthesis Worksheet
100% (1)
RNA and Protein Synthesis Worksheet
4 pages
BIOLOGY DPP - 5 Molecular Basis of Inheritance by Garima Mam
No ratings yet
BIOLOGY DPP - 5 Molecular Basis of Inheritance by Garima Mam
8 pages
R Programmimg Lab FIle
No ratings yet
R Programmimg Lab FIle
35 pages
Grade 9 Biology Revision Sheet TERM 2
No ratings yet
Grade 9 Biology Revision Sheet TERM 2
2 pages
Table 1
No ratings yet
Table 1
34 pages
R Language Lab Manual Lab 1
No ratings yet
R Language Lab Manual Lab 1
32 pages
Beginners Guide To R and RStudio
No ratings yet
Beginners Guide To R and RStudio
20 pages
MIS 3.hafta (Introduction To R)
No ratings yet
MIS 3.hafta (Introduction To R)
32 pages
R Concepts - 25092018 PDF
No ratings yet
R Concepts - 25092018 PDF
51 pages
A Concise Tutorial On R
No ratings yet
A Concise Tutorial On R
112 pages
Introduction To R (Used in PSYC8010)
No ratings yet
Introduction To R (Used in PSYC8010)
24 pages
Soal2 Blok 5
No ratings yet
Soal2 Blok 5
10 pages
R Programming
No ratings yet
R Programming
22 pages
Basic Data Science With R
100% (1)
Basic Data Science With R
364 pages
Intro2R Wk2
No ratings yet
Intro2R Wk2
40 pages
Introduction To R: Alka Vaidya Nibm
No ratings yet
Introduction To R: Alka Vaidya Nibm
50 pages
Bbyet 141 MAP)
No ratings yet
Bbyet 141 MAP)
3 pages
Barbara J Bain - Haemoglobinopathy Diagnosis
100% (2)
Barbara J Bain - Haemoglobinopathy Diagnosis
440 pages
R Programming 2
No ratings yet
R Programming 2
11 pages
Computing-II - Lecture Notes-I
No ratings yet
Computing-II - Lecture Notes-I
72 pages
DNA and Protein Powerpoint
No ratings yet
DNA and Protein Powerpoint
13 pages
Introduction To R, Version 2
No ratings yet
Introduction To R, Version 2
51 pages
Revision Test End of Semester Exam 2020 - Attempt Review PDF
No ratings yet
Revision Test End of Semester Exam 2020 - Attempt Review PDF
14 pages
Virulence and Pathogenesis
No ratings yet
Virulence and Pathogenesis
52 pages
R Tutorial
No ratings yet
R Tutorial
100 pages
Learn To Use: Your Hands-On Guide
100% (1)
Learn To Use: Your Hands-On Guide
43 pages
MIT 201 - Tutorial 01
No ratings yet
MIT 201 - Tutorial 01
8 pages
13 - IB Biology (2016) - 2.7 - DNA Replication, Transcription & Translation
No ratings yet
13 - IB Biology (2016) - 2.7 - DNA Replication, Transcription & Translation
43 pages
R For Data Science
No ratings yet
R For Data Science
47 pages
Biology PQ2
No ratings yet
Biology PQ2
8 pages
SP 3 02-Dec-2024-1
No ratings yet
SP 3 02-Dec-2024-1
6 pages
Transcription and Translation VIrtual Lab Worksheet-1
No ratings yet
Transcription and Translation VIrtual Lab Worksheet-1
2 pages
R4beginners v3
100% (1)
R4beginners v3
43 pages
Https - App - Oswaalbooks.com - Download - Sample-Qp - Subsolution - 9621-CBSE Bio SAP 1 - Solution
No ratings yet
Https - App - Oswaalbooks.com - Download - Sample-Qp - Subsolution - 9621-CBSE Bio SAP 1 - Solution
5 pages
Learn To Use: Your Hands-On Guide
No ratings yet
Learn To Use: Your Hands-On Guide
45 pages
Lec 1
No ratings yet
Lec 1
42 pages
1.R Unit 1
No ratings yet
1.R Unit 1
49 pages
Triplet Code1
No ratings yet
Triplet Code1
21 pages
Unit 1 - Data Analysis Using R
No ratings yet
Unit 1 - Data Analysis Using R
28 pages
The Darwin-Eigen Cycle, The Emergence of Biological Complexity, and The Continuity Principle
No ratings yet
The Darwin-Eigen Cycle, The Emergence of Biological Complexity, and The Continuity Principle
3 pages
Beginner Guide To R and R Studio V1
No ratings yet
Beginner Guide To R and R Studio V1
27 pages
Topic 1 - Intro To Basics
No ratings yet
Topic 1 - Intro To Basics
38 pages
Linear Regression Analysis HUDM 5122: Introduction To R Johnny Wang
No ratings yet
Linear Regression Analysis HUDM 5122: Introduction To R Johnny Wang
17 pages
DSRS BR
No ratings yet
DSRS BR
25 pages
R Intro Script
No ratings yet
R Intro Script
86 pages
01-MSBA-615 - Introduction To R Programming and R Studio
No ratings yet
01-MSBA-615 - Introduction To R Programming and R Studio
47 pages
Lec-3-Dna Replication, Transcription, and Translation
100% (1)
Lec-3-Dna Replication, Transcription, and Translation
11 pages
Introduction To R
No ratings yet
Introduction To R
6 pages
R Language Lab Manual Lab 1
100% (1)
R Language Lab Manual Lab 1
33 pages
Untitled
No ratings yet
Untitled
59 pages
Getting Started With R and RStudio
No ratings yet
Getting Started With R and RStudio
35 pages
HSC Biology Blue Print of Life
No ratings yet
HSC Biology Blue Print of Life
64 pages
2.R Concepts - BDSM - Oct2020 PDF
No ratings yet
2.R Concepts - BDSM - Oct2020 PDF
37 pages
Unit3 160420200647 PDF
No ratings yet
Unit3 160420200647 PDF
146 pages
Bayes CPH - Tutorial R
No ratings yet
Bayes CPH - Tutorial R
9 pages
IGA 10e SM Chapter 09
100% (1)
IGA 10e SM Chapter 09
18 pages
R Studio Info For 272
No ratings yet
R Studio Info For 272
13 pages
Chapter 2 Introduction To R and Python
No ratings yet
Chapter 2 Introduction To R and Python
35 pages
Activation Sequence Webinar Parts 1 5 Transcript
No ratings yet
Activation Sequence Webinar Parts 1 5 Transcript
20 pages
3rd Quarter W4 LECTURE On Protein Synthesis in DNA and Mutation
No ratings yet
3rd Quarter W4 LECTURE On Protein Synthesis in DNA and Mutation
53 pages
Introduction To R
No ratings yet
Introduction To R
39 pages
Part I: Introductory Materials: Introduction To R
No ratings yet
Part I: Introductory Materials: Introduction To R
25 pages
R Handout Statistics and Data Analysis Using R
No ratings yet
R Handout Statistics and Data Analysis Using R
91 pages
Biology Xii Chapterwise Diagram Based QN 2015 16
No ratings yet
Biology Xii Chapterwise Diagram Based QN 2015 16
42 pages
For RO BIOTECH Q3 LAS Week6 Translation Final
No ratings yet
For RO BIOTECH Q3 LAS Week6 Translation Final
9 pages
Brief R Tutorial
No ratings yet
Brief R Tutorial
8 pages
IB Biology Study Guide Answers
No ratings yet
IB Biology Study Guide Answers
21 pages
Snab AS Biology (Summary)
No ratings yet
Snab AS Biology (Summary)
23 pages
A Tale of Thirty Three Trees
From Everand
A Tale of Thirty Three Trees
Sylva
No ratings yet
Nature Spirits of the Trees: Messages from the Beings of the Trees
From Everand
Nature Spirits of the Trees: Messages from the Beings of the Trees
Verena Stael von Holstein
No ratings yet

Genetics

Uploaded by

Genetics

Uploaded by

Human Genetics

Human phylogenetics and

• To introduce you to the use of

• To introduce you to:

All organisms (living and extinct) on Earth are

Charles Darwin (1809-1882)

1837 sketch by Charles Darwin

Ernst Haeckel (1834-1919)

YOU ARE HERE

Tree of ~4500 species of extant mammal

Felsenstein (1985 –The American Naturalist)

Nothing in evolution makes

• If organisms are able to

• Trees can be either unrooted or rooted

• A single unrooted tree can be

We know a priori that

• Two taxa (species, families etc.) are

• Example: ATGCGCAGTTATTGCGAT (DNA)

• Insertion = addition of a new base somewhere in the

• Deletion = loss of an existing base somewhere in the

• If the DNA is a coding sequence, these mutations

• Amino acid sequence of a protein is specified by the DNA sequence of the

• But…genetic code is “degenerate”

• Leucine coded for by 6 different codons

• perhaps all taxa with a particular sequence have

Original ATGAGCAGTAGAGGCGTTTAG Met,Ser,Ser,Arg,Gly,Val,STOP

Original ATGAGCAGTAGAGGCGTTTAG Met,Ser,Ser,Arg,Gly,Val,STOP

Original ATGAGCAGTAGAGGCGTTTAG Met,Ser,Ser,Arg,Gly,Val,STOP

Original ATGAGCAGTAGAGGCGTTTAG Met,Ser,Ser,Arg,Gly,Val,STOP

frameshift mutation in the marsupial mole

9bp deletion in exon

• In these types of sequences, indels can be any length

• Alignment can be done:

• How do we use it for phylogenetic analysis?

• model-based methods – standard approach for analysing molecular

• Likelihood = probability of the data given a particular model

• Method developed by Ronald A. Fisher, the “father of modern statistics”

• Likelihood = probability of the data given a particular model

• To summarise: maximum likelihood analysis = find the most likely

• We will focus on nucleotide (DNA/RNA) sequence data as it’s simpler

• What about if our data is a sequence of two nucleotides, AC?

• What is the probability of the sequence of an entire human genome

• Alignment: taxon_1 CCAT Tree: taxon_1 taxon_2

• Likelihood going from taxon_1 to taxon_2 = πC*PC→C*πC*PC→C* πA*PA→G* πT*PT→T

• Single parameter rate matrix:

• How can we make this model more complex?

• Single parameter rate matrix (same as JC69):

• Single parameter rate matrix:

• How can we make this model more complex?

• unequal state frequencies: πA ≠ πC ≠ πG ≠ πT

most complex model

• Examples: JC69 + I, HKY + G, GTR + I + G

• Commonly used model selection criteria: As far as you’re concerned, the

• Several different measures of support, but bootstrapping is by far the

Fu et al. (2013 – Current Biology)

• Mitochondrial “Eve” and Y-

• Neanderthal mtDNA falls outside

• Complete mt genome from Neanderthal

• Results are strongly influenced by assumptions about population size,

Our taxa (in this case,

When did this

• Some uses of divergence dates:

in this hypothetical example, rate of change is

• We need to calibrate the divergence times – convert relative

• How do we use it for phylogenetic analysis?

• model-based methods – standard approach for analysing molecular

• Likelihood = probability of the data given a particular model

• Method developed by Ronald A. Fisher, the “father of modern statistics”

• Likelihood = probability of the data given a particular model

• To summarise: maximum likelihood analysis = find the most likely

• We will focus on nucleotide (DNA/RNA) sequence data as it’s simpler

• What about if our data is a sequence of two nucleotides, AC?

• What is the probability of the sequence of an entire human genome

• Alignment: taxon_1 CCAT Tree: taxon_1 taxon_2

• Likelihood going from taxon_1 to taxon_2 = πC*PC→C*πC*PC→C* πA*PA→G* πT*PT→T

• Single parameter rate matrix:

• Likelihood going from taxon_1 to taxon_2 = πCPC→CπCPC→C πAPA→G πT*PT→T

• Likelihood going from taxon_1 to taxon_2 = πCPC→CπCPC→C πAPA→G πT*PT→T