[go: up one dir, main page]

0% found this document useful (0 votes)
11 views392 pages

Genetics

The document introduces the use of phylogenies in studying human evolution and adaptation, covering key concepts, methods, and the phylogeny of humans and their relatives. It emphasizes the importance of understanding evolutionary relationships through phylogenetics, including the use of molecular data and the concept of the 'Tree of Life'. Additionally, it discusses the significance of sequence alignment and the evolution of DNA and protein sequences in phylogenetic analysis.

Uploaded by

originalxlusiv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views392 pages

Genetics

The document introduces the use of phylogenies in studying human evolution and adaptation, covering key concepts, methods, and the phylogeny of humans and their relatives. It emphasizes the importance of understanding evolutionary relationships through phylogenetics, including the use of molecular data and the concept of the 'Tree of Life'. Additionally, it discusses the significance of sequence alignment and the evolution of DNA and protein sequences in phylogenetic analysis.

Uploaded by

originalxlusiv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 392

Human Genetics

Human phylogenetics and


evolution
Robin M. D. Beck
R.M.D.Beck@salford.ac.uk
Overall Aim

• To introduce you to the use of


phylogenies to study human evolution
and adaptation
Learning objectives

• To introduce you to:


• key phylogenetic concepts and terminology (= “tree thinking”)
• methods and models of phylogenetic analysis
• the phylogeny of humans and their relatives
• the “molecular clock”
• using phylogenies to study human evolution and adaptation
Lecture 1: An introduction to
molecular phylogenetics
Robin M. D. Beck
R.M.D.Beck@salford.ac.uk
The ‘Tree of Life’
‘…the great Tree of Life, which fills
with its dead and broken branches
the crust of the earth, and covers
the surface with ever-branching and
beautiful ramifications.’
- Charles Darwin (1859) ‘On the
The only figure in the first edition of ‘On Origin of Species’
the Origin of Species’ (1859)

All organisms (living and extinct) on Earth are


related to each other at one level or another, i.e.
there is a hierarchy of relationships
= ‘ THE TREE OF LIFE’
‘The time will come I
believe…when we shall have
very fairly true genealogical
trees of each great kingdom of
nature’
- Charles Darwin in a letter to
Thomas Huxley (1857)

Charles Darwin (1809-1882)

1837 sketch by Charles Darwin


Figure from ‘Allgemeine
Entwickelungsgeschichte
der Organismen’ (1866)

Ernst Haeckel (1834-1919)

• Ernst Haeckel:
• Produced the first reconstruction of the entire ‘Tree of Life’
(1866)
• Coined the term ‘phylogeny’ (Ancient Greek: phyle = tribe, race;
genesis = birth) for the evolutionary relationships between
organisms
Why is phylogenetics important?
• To understand ‘our place in Nature’, and that of every other organism:

YOU ARE HERE

Tree of ~4500 species of extant mammal


(Bininda-Emonds et al., 2007)
Frameworks for studying evolutionary
patterns and processes
evidence for rapid increase in brain size in modern humans and their
relatives (Hominini) – only identifiable because we know the phylogeny!
• Medical uses, e.g. origins of diseases:
Origin of MERS
Multiple origins of HIV
Phylogeny of coronaviruses
Phylogeny of SARS-CoV-2 virus
(virus causes COVID-19)
Account for non-independence when
comparing taxa
two groups of 20 close relatives –
inappropriate to treat them as 40
independent data points • Need to take into account the fact that
some taxa are more closely related than
others
• Standard statistical approaches that
assume independence of data points are
invalid for this kind of biological data!

Felsenstein (1985 –The American Naturalist)


“Nothing in Biology Makes
Sense Except in the Light of
Evolution”

Nothing in evolution makes


sense except in the light of
Theodosius Dobzhansky
(1900-1975)
phylogeny!
Tree Thinking
• Phylogenetics has an underlying logic
and terminology that might be unfamiliar
to you…
Trees and Networks

• If organisms are able to


exchange genes with each
other, their relationships can be
represented as a network
• e.g. interbreeding populations,
hybrids, prokaryotes that engage
in horizontal gene transfer
Trees and Networks
• If organisms are not able to
exchange genes with each other
their relationships can be
represented as a branching tree
• e.g. phylogenies of species that
cannot interbreed and higher
taxonomic levels (genera, families)
Tree terminology
• Trees are made up of
nodes and branches
• an internal node
represents the most recent
tip or leaf or
common ancestor (MRCA)
terminal node
of the descendant branches root or
• x = MRCA of A+B root node terminal
• y = MRCA of C+D+E branch
• MRCAs are usually
hypothetical – it is difficult
internal
(impossible?) to identify a node
fossil as a genuine ancestor
internal
branch or
edge
Tree terminology

• Trees can be either unrooted or rooted


• rooted trees have one node identified as the root (or root node) = the first
divergence in the tree
• this provides a “direction” to the tree
unrooted tree rooted tree

root = position of
first divergence
Tree terminology

• A single unrooted tree can be


rooted in multiple different
positions
• Different rooting positions will
result in different relationships

unrooted
tree
Rooting trees
• Usually, a tree is rooted with an outgroup (or multiple outgroups)
• outgroup taxa = taxa that we already know are definitely more distantly
related than the other taxa (= the ingroup)
• root is between the outgroup taxon (or taxa) and the remaining ingroup taxa

We know a priori that


the kangaroo (a
marsupial mammal) is
more distantly related
than the other taxa
(which are all placental
mammals), and so it can
be used as an outgroup
to root the tree
Definition of relatedness

• Two taxa (species, families etc.) are


more closely related to each other than
either is to another taxon if they share a
more recent common ancestor
Definition of relatedness – an example
• the mouse and the human are more human
closely related to each other than they
are to the lizard because they share a
more recent common ancestor fish
frog lizard
mouse
(indicated by the purple dot)
• The mouse, human and lizard are more
closely related are more closely related
to each other than they are to the frog
because they share a more recent
common ancestor (indicated by the blue
dot)
Tree Thinking quiz
(taken from Baum et al., 2005 – Science)
Tree Thinking quiz
(taken from Baum et al., 2005 – Science)
Tree Thinking quiz
(taken from Baum et al., 2005 – Science)
Tree Thinking quiz
(taken from Baum et al., 2005 – Science)
IMPORTANT!
Trees are like mobiles…

A B C D D C B A B C D A

= =

What is important is the groupings that are specified, not the precise
order that the taxon labels occur
Tree Thinking quiz
(taken from Baum et al., 2005 – Science)
Tree Thinking quiz
(taken from Baum et al., 2005 – Science)
Different types of data for producing
phylogenies
• Morphological or other phenotypic data:
• only kind of data available for fossils (if ancient DNA or proteins not preserved)
Different types of data for producing
phylogenies
• Molecular data:
• Today, the most commonly used form of molecular data is sequence data –
either nucleotide sequence data (DNA or RNA) or protein sequence data
(amino acids)
DNA sequence data

• Double-helix
• specific sequence of four bases
• adenine = A
• cytosine = C
• guanine = G
• thymine = T

• Example: ATGCGCAGTTATTGCGAT
RNA sequence data
• same bases as DNA except uracil (U) instead of
thymine, i.e. A, C, G and U
• usually produced using DNA as a template (=
transcription) ∴ same sequence as DNA template
(but U instead of T)

• Example: ATGCGCAGTTATTGCGAT (DNA)


↓transcription
AUGCGCAGUUAUUGCGAU (RNA)
Protein sequence data
• Proteins are comprised of sequences of 20-
22 different types of amino acids
• Sequence of amino acids in a protein is
specified by DNA sequence of the gene
coding for that protein, e.g. BRCA1 gene
codes for BRCA1 protein
• Example: ATGCGCAGTTATTGCGAT (DNA)
amino acid sequence
↓ transcription
for protein AUGCGCAGUUAUUGCGAU (RNA)
↓ translation
Met,Arg,Ser,Tyr,Cys,Asp (protein)
Sequencing
“next-generation sequencing”
• Modern sequencing technology (“next
generation sequencing”, or NGS) means
that large amounts of sequence data
can be produced cheaply and quickly
• Human genome project (1990-2013):
US$3,000,000,000
• January 2014: US$1,000
Illumina Miseq – Salford has one of these!
Obtaining sequences
• Published molecular sequences (nucleotides and proteins) freely
available for download from numerous online databases (e.g. GenBank,
UniProt, European Nucleotide Archive…)
How do DNA sequences evolve?
• Types of DNA mutation:
• Substitution = replacement of one base by another,
e.g. G by C

• Insertion = addition of a new base somewhere in the


sequence

• Deletion = loss of an existing base somewhere in the


sequence

• If the DNA is a coding sequence, these mutations


will be present in the RNA transcribed from it
How do protein sequences evolve?

• Amino acid sequence of a protein is specified by the DNA sequence of the


gene that codes that protein
• each amino acid is specified by a triplet of nucleotides (= codon)
• AAA = lysine (Lys or K)
• GCT/GCU = alanine (Ala or A)
• GTG = valine (Val or V)

• TAA/UAA = STOP
How do protein sequences evolve?
DNA translation table
RNA translation table
How do protein sequences evolve?
• Changes (substitutions, insertions, deletions) in the DNA sequence of a
protein-coding gene can alter the amino acid sequence of the protein

• But…genetic code is “degenerate”


• multiple codons code for same amino acid
→ change in DNA sequence doesn’t always
result in a change in amino acid sequence

• Leucine coded for by 6 different codons


• Alanine coded for by 4 different codons
• STOP coded for by 3 different codons
• Methionine coded for by 1 codon only
• …
How do protein sequences evolve?
• DNA mutation in protein-coding gene that does not change amino
acid sequence of protein = synonymous
• DNA mutation in protein-coding gene that does change amino acid
sequence of protein= non-synonymous
Question: which evolves faster? A DNA sequence coding for a protein,
or the amino acid sequence of that protein?
Comparing sequences
• Let’s say we want to compare the
Sequence 1 four DNA sequences on the left
• If we line them up:
0000000001
Sequence 2 1234567890
Sequence 1 CGAACTCGA
Sequence 2 CCAACTCGA
Sequence 3 Sequence 3 CCGAACTCGA
Sequence 4 CCGAACCGA
• Are we comparing homologous
Sequence 4 nucleotides across all sequences?
Comparing sequences
Sequence 1 0000000001
1234567890
Sequence 1 CG-AACTCGA
0
Sequence 2 Sequence 2 CC-AACTCGA
Sequence 3 CCGAACTCGA
Sequence 4 CCGAAC-CGA
Sequence 3
• We need to insert gaps into sequences
to account for insertions and deletions
• This procedure is called alignment
Sequence 4
Alignment is absolutely critical when comparing
molecular sequences!
Sequence alignment
• Critical step when comparing sequences – otherwise you risk getting
incorrect results

Sequence 1: ACGTGCTAGACTATGTGTC
0% sequence similarity
Sequence 2: CGTGCTAGACTATGTGTC

Sequence 1: ACGTGCTAGACTATGTGTC
100% sequence similarity
Sequence 2: -CGTGCTAGACTATGTGTC
Sequence alignment
• Usually we don’t know the exact sequence of substitutions, insertions
and deletions – we have to make a best guess
Sequence 1
0000000001
Sequence 2
1234567890
Sequence 1 CG-AACTCGA
Sequence 2 CC-AACTCGA
Sequence 3 Sequence 3 CCGAACTCGA
Sequence 4 CCGAAC-CGA
Sequence 4
Sequence alignment

• perhaps all taxa with a particular sequence have


gone extinct, or we haven’t sampled them
Sequence 1 • perhaps we cannot confidently determine the
exact pattern of subsitutions and insertions and
deletions (= indels)
Sequence 2 • we have to make a best guess

000000000
Sequence 3 123456789
Sequence 2 CCAACTCGA
0
Sequence 4 Sequence 4 CCGAACCGA
Sequence alignment
• Sequences that evolve slowly (= they are more highly conserved) are
easier to align
Sequence alignment
• Sequences that evolve very rapidly are harder to align, particularly
when comparing distantly related taxa
• rate of evolution varies:
• between coding and non-coding DNA
• between genes
• within the same gene (some parts of one gene more conserved than others)
• between taxa – e.g. within mammals, rate of evolution in mice and rats much
faster than in humans
Protein-coding DNA/RNA sequence alignment
• The amino acid sequence of a protein is determined by the sequence of
codons (= nucleotide triplets) in the DNA sequence of the gene coding for
that protein
• Nucleotide indels in protein-coding genes are usually in multiples of 3 (i.e. 3,
6, 9, 12….)
• If an indel is not in a multiple of 3, it will completely alter the amino acid
sequence and the position of STOP codon (protein will be shorter or longer
than usual)
• usually results in a non-functional protein
Protein-coding DNA/RNA sequence alignment

Original ATGAGCAGTAGAGGCGTTTAG Met,Ser,Ser,Arg,Gly,Val,STOP


Protein-coding DNA/RNA sequence alignment

Original ATGAGCAGTAGAGGCGTTTAG Met,Ser,Ser,Arg,Gly,Val,STOP


very different amino
acid sequence
Single nucleotide premature STOP codon
insertion (= ATGCAGCAGTAGAGGCGTTTAG
frameshift mutation)
Met,Gln,Gln,STOP
Protein-coding DNA/RNA sequence alignment

Original ATGAGCAGTAGAGGCGTTTAG Met,Ser,Ser,Arg,Gly,Val,STOP


single amino acid
insertion, but rest of
triplet insertion sequence is identical
(= reading frame Met,Tyr,Ser,Ser,Arg,Gly,Val,STOP
maintained) ATGTACAGCAGTAGAGGCGTTTAG
Protein-coding DNA/RNA sequence alignment

Original ATGAGCAGTAGAGGCGTTTAG Met,Ser,Ser,Arg,Gly,Val,STOP


single amino acid
deletion, but rest of
triplet deletion sequence is identical
(= reading frame ATGAGTAGAGGCGTTTAG Met,Ser,Arg,Gly,Val,STOP
maintained)
Example of a frameshift mutation
Interphotoreceptor Binding Protein (IRBP) – plays key role in vision

frameshift mutation in the marsupial mole


(blind underground burrower)- protein is
non-functional = a “pseudogene”
Protein-coding DNA/RNA sequence alignment
• In general, alignments of exons of protein-coding genes will have
indels in multiples of 3, because the protein needs to be functional

9bp deletion in exon


11 of BRCA1 gene
Non-protein-coding DNA/RNA sequence alignment
• Non-protein-coding DNA sequences include:
• introns in protein-coding genes (removed before translation)
• Non protein-coding RNA genes (e.g. ribosomal genes)
• regulatory sequences
• pseudogenes
• other non-functional(?) DNA

• In these types of sequences, indels can be any length


Non-protein-coding DNA/RNA sequence alignment
mitochondrial 16S ribosomal gene alignment
Protein (amino acid) sequence alignment
• Indels in protein (amino acid) sequences can be any length
Multiple sequence alignment
• So…
• Prior to phylogenetic analysis, we need to align our sequences to
ensure that we are comparing homologous nucleotides (if DNA or
RNA sequences) or homologous amino acids (if protein sequences)
• = “Multiple sequence alignment” (MSA)

• Alignment can be done:


• manually – insert gaps into sequences by hand, using a sequence editor
• automatically – use software that aligns the sequences according to an
algorithm
Multiple Sequence Alignment
• Manual alignment can be very accurate, but quickly becomes
unfeasible as number and/or length of sequences increases
Multiple sequence alignment
• There are a number of different programs for automated alignment,
including:
• ClustalW (one of the oldest and most commonly used)
• MAFFT
• MUSCLE
• They all work in broadly the same way:
• assign a “cost” to a mismatch between nucleotides (e.g. A in sequence 1
versus C in sequence 2) or amino acids (e.g. Leucine in sequence 1, Glycine in
sequence 2)
• assign a “cost” to inserting a gap in a sequence
• assign a “cost” to extending a gap in a sequence
• Find the multiple sequence alignment that minimises the overall cost
Multiple Sequence Alignment
alignment
ambiguous region
• Automated sequence alignment is
(obviously) much faster than manual
alignment, but is often less accurate
• Can still be very time-consuming with
very long sequences and/or many
sequences
• Some regions might be very
difficult/impossible to align
confidently – we might want to
remove these “alignment ambiguous
regions”
So…
• Now we have nicely aligned sequence data

• How do we use it for phylogenetic analysis?


Methods of phylogenetic analysis
• Three basic types
• distance-based methods - less commonly used now (will not be discussed in
detail here)
• parsimony-based methods - mainly used for morphological data (will not be
discussed here)

• model-based methods – standard approach for analysing molecular


sequence data (e.g., nucleotide sequence data, amino acid sequence data)
Model-based methods
• Key feature of these methods: they assume an explicit model of
evolution
• Implemented in two slightly different frameworks
• Maximum Likelihood
• Bayesian analysis
Maximum Likelihood

• Likelihood = probability of the data given a particular model

P(D|H)
R. A. Fisher
(1890-1962)

• Method developed by Ronald A. Fisher, the “father of modern statistics”


Maximum Likelihood

• Likelihood = probability of the data given a particular model

P(D|H)
How is maximum likelihood used in a
phylogenetic context?
• Likelihood = probability of the data given a particular model, P(D|H)
• In molecular phylogenetics:
• “data” = sequence alignment
• “model” = tree + substitution model (= model of how sequences evolve)
• the substitution model is usually specified prior to analysis
• So…we’re interested in finding the tree with the highest likelihood (= the
maximum likelihood tree), given a sequence alignment and our assumed
substitution model

• To summarise: maximum likelihood analysis = find the most likely


tree(s) given a data matrix and a substitution model
Substitution models
• Substitution models for molecular sequence data generally comprise
two parts:
• the relative frequencies of the bases or amino acids
• a rate matrix that models the probability of changes between bases or amino
acids

• We will focus on nucleotide (DNA/RNA) sequence data as it’s simpler


(only 4 states), but the same broad principles apply to amino acid
sequence data (20-22 states)
Let’s consider the state frequencies first
• Let’s say our model is: πA = 0.1, πC = 0.4, πG = 0.3, πT = 0.2 (these are
the relative frequencies of each state, and so add up to 1)
• And our data is a single nucleotide, A
• What is the likelihood of this?

• What about if our data is a sequence of two nucleotides, AC?

• What is the probability of the sequence of an entire human genome


(3 billion base pairs)?
Now add in a rate matrix to model changes
between nucleotides
• Let’s stick with: πA = 0.1, πC = 0.4, πG = 0.3, πT = 0.2
• Rate matrix: ending nucleotide probabilities
A C G T
A =1
starting C =1
nucleotide G =1
T =1

• Alignment: taxon_1 CCAT Tree: taxon_1 taxon_2


taxon_2 CCGT

• Likelihood going from taxon_1 to taxon_2 = πC*PC→C*πC*PC→C* πA*PA→G* πT*PT→T


= 0.4×0.983×0.4×0.983×0.1×0.007×0.3×0.979 = 0.0000300
Jukes & Cantor (1969) = JC69 model
• simplest substitution model
• equal state frequencies: πA = 0.25, πC = 0.25, πG = 0.25, πT = 0.25

• Single parameter rate matrix:

• How can we make this model more complex?


Felsenstein (1981) = F81 model
• unequal frequencies: πA ≠ πC ≠ πG ≠ πT

• Single parameter rate matrix (same as JC69):


Jukes & Cantor (1969) = JC69 model
• simplest substitution model
• state frequencies: πA = 0.25, πC = 0.25, πG = 0.25, πT = 0.25 (equal)

• Single parameter rate matrix:

• How can we make this model more complex?


Transitions and transversions
Adenine (purine) Cytosine (pyrimidine)
• Two classes of nucleotides
• purines (A and G) – two-ring
• pyrimidines (C, T and U) – one-ring
• Mutations from purine to purine (A
↔ G) and from pyrimidine to
pyrimidine (C ↔ T) = TRANSITIONS
• Mutations from pyrimidine to purine
(or vice versa) = TRANSVERSIONS
• Transitions are more likely than
transversions due to similarities in
ring structure
Guanine (purine) Thymine (pyrimidine)
Kimura (1980) = K80 model
• equal state frequencies: πA = 0.25, πC = 0.25, πG = 0.25, πT = 0.25

• Two parameter rate matrix: one for transitions (α) and one for
transversions (β):
Hasegawa, Kishino & Yano (1985) = HKY85 model
• unequal state frequencies: πA ≠ πC ≠ πG ≠ πT

• Two parameter rate matrix: one for transitions (α) and one for
transversions (β):
Several more…
General Time Reversible (GTR) model
• Most complex well-known model – very widely used

• unequal state frequencies: πA ≠ πC ≠ πG ≠ πT


• Rate matrix has six parameters (a-f) , one for each possible
substitution type:
Overview of commonly used models
simplest model
(fewest parameters)

most complex model


(most parameters)
Rate variation
• By comparing sequences, we can see that some regions of an
alignment are more likely to change than others
• These basic models don’t allow this – they assume that the rate of evolution
is the same across the entire alignment
• There are also variants of all of these models that let the rate of evolution
vary between different regions of the sequence alignment - we won’t
discuss the details here, but they are indicated by + I, + G, or + I + G

• Examples: JC69 + I, HKY + G, GTR + I + G


The agony of choice…
• We have lots of models, from simple to complex – which one should
we use?
• Different models might give different answers (i.e. different trees)
with the same data
• Is a complex model necessarily better?
Model selection
• Which is the “best” model here?
Model selection
under-fit – too few parameters
model has too high bias (model over-fit – too many parameters
does not adequately reflect model has too high variance
underlying relationship of the good compromise between (model overly affected by noise
training data ) bias and variance in the training data)
Model selection
• All models represent a trade-off between bias and variance
• we want to use a model that minimises both
Model selection
• We can use formal statistical methods to decide which model to use

• Commonly used model selection criteria: As far as you’re concerned, the


details don’t really matter!
• Akaike Information Criterion (AIC) What’s important is that these
• Bayesian Information Criterion (BIC) are objective methods for
identifying an appropriate model
Calculating support
• How well supported are the monophyletic groups/clades in our
maximum likelihood tree by our data?
• if one clade is more strongly supported by our data, we should probably have
more confidence that it is “correct”

• Several different measures of support, but bootstrapping is by far the


most commonly used method
• Bootstrapping = a resampling method
• Bootstrap values can be between 0 and 100%
• A bootstrap value of >70% is usually considered “strong” support
Basic strategy for maximum likelihood
analysis of sequence data
1. Collect your sequence data
2. Align it
3. Identify best-fitting model of sequence evolution
4. Analyse your sequence data to find the maximum likelihood (ML)
tree assuming the best-fitting model
5. Use bootstrapping to calculate support values for the clades in the
ML tree
Practical
• Align a set of mitochondrial sequences from different modern
humans from around the world
• Make a phylogeny of modern humans using Maximum Likelihood
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
Human Genetics
Two conflicting hypotheses for the origin of
anatomically modern humans (AMHs)
• “Out of Africa” hypothesis
• Archaic hominins dispersed out of Africa >1
million years ago
• AMHs originated in Africa and some
dispersed out of Africa 100-200k years ago
• AMHs completely replaced (and did not
interbreed with) archaic hominins
Two conflicting hypotheses for the origin of
anatomically modern humans (AMHs)
• “Multiregional” hypothesis
• Archaic hominins dispersed out of Africa
>1 million years ago
• The different archaic hominin
populations (e.g. Neanderthals in
Europe, Homo erectus in Asia) evolved
into AMHs, possibly with some
interbreeding/geneflow between
populations
Two conflicting hypotheses for the origin of
anatomically modern humans (AMHs)
Enter genetic data…

Nature (1987)
• Analysis of mitochondrial genomes from 147 modern
humans
• Mitochondrial DNA:
• inherited maternally only
• does not undergo recombination
• Conclusion of paper:
• African origin of modern humans
• All the mitochondrial lineages can be traced to a female
common ancestor who lived 140k-290kya (kya = thousand
years ago)
• Split between Africans and non-Africans 62-225kya
Your phylogeny from last practical (which is based
on mitochondrial sequence data) should have
found a similar result!
• Greater genetic diversity
among modern Africans
than modern non-Africans
Your phylogeny from last practical (which is based
on mitochondrial sequence data) should have
found a similar result!
• All non-Africans should
form a single group to the
exclusion of some African
populations
• This group would be the
descendants of the humans
who left Africa

“Out of Africa”
dispersal event
“Mitochondrial Eve” = most recent common ancestor
of all modern human mitochondrial sequences
• IMPORTANT!
• The “mitochondrial Eve” was not
the only female alive at that time!
• Other females were present, but
their mitochondrial lineages have
not survived to the present day

X
Newsweek cover 1988

http://www.virginia.edu/woodson/courses/aas102%20(sprin
g%2001)/articles/tierney.html for the original story!
“Mitochondrial Eve”
• Cann et al. (1987) was highly controversial at the time, but recent
studies of mitochondrial genomes and Y chromosome (males only)
have reached similar conclusions

Fu et al. (2013 – Current Biology)


“Mitochondrial Eve”: 120–197 kya Poznik et al. (2013 – Science)
Split between Africans and non-Africans: “Mitochondrial Eve”: 99-148 kya
62.4–94.9 kya Y-chromosome “Adam”: 120–156 kya
These studies appear to support “Out of
Africa” hypothesis

• Mitochondrial “Eve” and Y-


chromosome “Adam” both ~100-200
kya
• This is much younger than oldest non-
African archaic hominins
• Split between Africans and non-
Africans <200 kya
• suggests a relatively late dispersal of
AMHs out of Africa
Enter ancient DNA…
Sequencing ancient biomolecules
• DNA is a very stable molecule
• It needs to be because it is the information store for the cell!
• DNA can be damaged by free radicals, radiation (e.g. UV) etc.
• DNA damage: breaks in the DNA strands, modification of the bases
• Cells have complex mechanisms to repair damaged DNA
• When a cell dies, these mechanisms stop functioning → DNA
becomes increasingly damaged
• DNA also degraded by microorganisms after death
After ~500 years, 50% of mt DNA has been lost
Can we successfully sequence DNA from the
remains of organisms that are hundreds or
thousands of years old or even older?
Svante Pääbo
• Problems:
• DNA degrades after death → left with
small numbers of fragments
• contaminating DNA (e.g. human DNA
from researchers, bacterial DNA) will
be far more common than the
ancient DNA → standard PCR is likely
to amplify more of the contaminating
DNA than the ancient DNA
Can we successfully sequence DNA and proteins
from the remains of organisms that are hundreds
or thousands of years old or even older?
• Solutions:
• use ultra-clean rooms (filtered air systems,
UV light etc.) when extracting DNA
• use “next generation” sequencing
• Doesn’t require PCR
• Can sequence all DNA fragments present in a
sample
• use methods that can sequence very short
pieces of DNA (<75bp) – ancient DNA usually
very fragmentary
• look for distinctive modifications to bases
that are seen in ancient DNA but not in
modern DNA (e.g. cytosine deamination)
There has been enormous progress made in
the field of ancient DNA – it’s a hot topic!

• Number of publications
mentioning ancient DNA:
• 1995-2000: 1450
• 2001-2005: 2870
• 2006-2010: 6600
• 2011-2015: 10200
aDNA
• 2016-2020: 16100
• 2021-present: 10100
So what do these ancient genomes tell us
about human evolution…?
1997: partial mitochondrial sequence from a
Neanderthal - 40,000 years old

• Neanderthal mtDNA falls outside


mtDNA variation of modern humans
• Neanderthal mtDNA not closer to e.g.
Europeans than any other modern
human
• suggests no interbreeding/geneflow
between Neanderthals and AMHs
2008: entire mitochondrial genome from a
Neanderthal - 38,000 years old, 16.5 kb

• Complete mt genome from Neanderthal


• Similar findings to the 1997 paper:
• Neanderthal mtDNA falls outside mtDNA
variation of modern humans
• divergence between Neanderthals and
modern humans ~660 kya
• Further support for “Out of Africa”
But then…
Green et al. (2010): entire nuclear genome from
a Neanderthal - 38,000 years old, 4000 Mb
• Neanderthal genome sequenced using “next generation sequencing”
• Neanderthal genome compared with five modern humans from
different regions of the world:
• Yoruba
• San Africans
• French
• Papua New Guinean non-Africans
• Han Chinese
Green et al. (2010): entire nuclear genome from
a Neanderthal - 38,000 years old, 4000 Mb
four possible explanations for
• Neanderthal genome more similar to the observed pattern – option 3
the three non-African genomes than the fits the data the best
African genomes → strongly suggest
geneflow (“introgression”) from
Neanderthals to non-Africans
• estimated 1-4% of nuclear genome of non-
Africans is from Neanderthals
Then a real surprise!
• Partial little finger bone of a hominin from
Denisova cave in Siberia, >50 kya Neanderthals
• besides their genomes, we know almost nothing
about them!
• Only remains are fragments of bone and teeth
• Nuclear genome suggests that Denisovans are
more closely related to Neanderthals than to
modern humans modern
• split of Denisovan+Neanderthal lineage from
modern humans ~800 kya humans
• split between Denisovans and Neanderthals ~650
kya
The biggest surprise…
• 3-6% of the genome of New Guineans and Aboriginal Australians is from
Denisovans!
• strong evidence of interbreeding
• 0.2% of the genome of Asians and Native Americans is from Denisovans
• either small amount of interbreeding with Denisovans, or geneflow from New
Guinean/Aboriginal populations
• European and African populations lack evidence of geneflow from
Denisovans
Distribution of Denisovan DNA in modern non-
African populations (red = greater proportion)
pre-genomic hypotheses
Post-genomic hypothesis
The “Assimilation Model”
• Current consensus view based on
genomic data:
• AMHs originated in Africa ~300 kya
• Majority of genetic diversity among non-
African modern humans is the result of a
relatively recent (50-200 kya) dispersal from
Africa
• BUT multiple small but significant
contributions by archaic hominins
(Neanderthals, Denisovans, possibly others)
to non-African populations
Important!

• Most of these findings are quite recent (2008 and later) – details still being
worked out, and may change with new discoveries/sequencing of more
archaic hominin genomes

• Results are strongly influenced by assumptions about population size,


mutation rate, amount of geneflow etc.
• almost certainly massive oversimplifications
• probably many, many cases of interbreeding between different hominins

• Many gaps in the story, e.g. what was going on in Africa? Some evidence of
geneflow from archaic hominins, but relatively few ancient genomes from
Africa (DNA degrades faster at higher temps)
FIVE MINUTE BREAK!
Molecular clocks and human evolution
Molecular clocks
• The branching pattern of an evolutionary tree
indicates relationships – which taxa (individuals,
populations, species, genera, families etc.) are more
closely related than others
The closest living
relatives of humans
are chimps

Our taxa (in this case,


species of primate)
Molecular clocks
• The branching pattern of an evolutionary tree
indicates relationships – which taxa (individuals,
populations, species, genera, families etc.) are more
closely related than others
• It would also be useful to know when particular taxa
diverged from each other (i.e. the ages of the nodes)
When did the split
between humans and
chimps occur?

When did this


split occur?
Our taxa (in this case,
species of primate)
Molecular clocks

• Some uses of divergence dates:


• comparing the timing of divergences to other events (e.g.
geological events, changes in climate etc.) to identify possible
causal links
• determining when particular adaptations originated (e.g.
large brains of humans)
• determining when particular diseases originated (e.g. HIV)
The molecular clock
• Basic principle proposed by Zuckerkandl and
Pauling in 1962
• found that the number of amino acid differences
in haemoglobin of different taxa corresponded to
estimated divergence time (based on the fossil
record)
• Zuckerkandl and Pauling (1965):
Alternative approach – the molecular clock

in this hypothetical example, rate of change is


1 substitution per 10 nucleotides per 25 million years
• Underlying assumption:
• rates of molecular evolution are constant
through time and between lineages
• if so, the amount of molecular change that
has occurred along a branch is proportional
to the length of time that branch represents
cytochrome c
cytochrome c

estimated from the fossil record estimated from the fossil record
Relative and absolute divergence times
• If you assume a molecular clock, then branch lengths that
represent the amount of molecular change can be used to
estimate relative divergence times
• Example: if one branch is twice as long as another, it represents
twice as much time

• But how much time is this in an absolute sense, i.e. how many
years, or thousands of years, or millions of years?
Relative and absolute divergence times

• We need to calibrate the divergence times – convert relative


times to absolute times
• normally done using fossil evidence – estimate absolute age of one
or more nodes based on fossil evidence, then use this to calculate
how much molecular change occurs per unit time (e.g. change per
year, change per million years)
• this estimate of molecular change per unit time can then be used to
calculate the ages of the other nodes
Fossil calibrations relevant to humans (from
de Vries and Beck, 2023)
• Split between humans and chimps:
• calibrating fossil: Ardipithecus – more
closely related to humans than to chimps
• minimum date: 4.631 million years ago
(minimum age of Ardipithecus)
• maximum date: 15 million years ago (lots
of fossil apes in Africa, but none belong to
the human or chimp lineages)
Fossil calibrations relevant to humans (from
Benton et al., 2015)
• Split between Homo sapiens (modern
humans) and Homo neanderthalis
(Neanderthals):
• calibrating fossil: oldest known Homo
neanderthalis fossils
• minimum date: 0.2 million years ago (age of
oldest known Homo neanderthalis fossils)
• maximum date: 1 million years ago (lots of
Homo fossils, but no humans or Neanderthals)
Standard approach for molecular clock
analysis
1. Collect sequence data
2. Align it
3. Determine appropriate model of sequence evolution
4. Identify appropriate fossil calibration(s)
5. Use (relaxed) molecular clock model to calculate divergence times
– today’s exercise!
Human phylogenetics and
evolution
Robin M. D. Beck
R.M.D.Beck@salford.ac.uk
So…
• Now we have nicely aligned sequence data

• How do we use it for phylogenetic analysis?


Methods of phylogenetic analysis
• Three basic types
• distance-based methods - less commonly used now (will not be discussed in
detail here)
• parsimony-based methods - mainly used for morphological data (will not be
discussed here)

• model-based methods – standard approach for analysing molecular


sequence data (e.g., nucleotide sequence data, amino acid sequence data)
Model-based methods
• Key feature of these methods: they assume an explicit model of
evolution
• Implemented in two slightly different frameworks
• Maximum Likelihood
• Bayesian analysis
Maximum Likelihood

• Likelihood = probability of the data given a particular model

P(D|H)
R. A. Fisher
(1890-1962)

• Method developed by Ronald A. Fisher, the “father of modern statistics”


Maximum Likelihood

• Likelihood = probability of the data given a particular model

P(D|H)
How is maximum likelihood used in a
phylogenetic context?
• Likelihood = probability of the data given a particular model, P(D|H)
• In molecular phylogenetics:
• “data” = sequence alignment
• “model” = tree + substitution model (= model of how sequences evolve)
• the substitution model is usually specified prior to analysis
• So…we’re interested in finding the tree with the highest likelihood (= the
maximum likelihood tree), given a sequence alignment and our assumed
substitution model

• To summarise: maximum likelihood analysis = find the most likely


tree(s) given a data matrix and a substitution model
Substitution models
• Substitution models for molecular sequence data generally comprise
two parts:
• the relative frequencies of the bases or amino acids
• a rate matrix that models the probability of changes between bases or amino
acids

• We will focus on nucleotide (DNA/RNA) sequence data as it’s simpler


(only 4 states), but the same broad principles apply to amino acid
sequence data (20-22 states)
Let’s consider the state frequencies first
• Let’s say our model is: πA = 0.1, πC = 0.4, πG = 0.3, πT = 0.2 (these are
the relative frequencies of each state, and so add up to 1)
• And our data is a single nucleotide, A
• What is the likelihood of this?

• What about if our data is a sequence of two nucleotides, AC?

• What is the probability of the sequence of an entire human genome


(3 billion base pairs)?
Now add in a rate matrix to model changes
between nucleotides
• Let’s stick with: πA = 0.1, πC = 0.4, πG = 0.3, πT = 0.2
• Rate matrix: ending nucleotide probabilities
A C G T
A =1
starting C =1
nucleotide G =1
T =1

• Alignment: taxon_1 CCAT Tree: taxon_1 taxon_2


taxon_2 CCGT

• Likelihood going from taxon_1 to taxon_2 = πC*PC→C*πC*PC→C* πA*PA→G* πT*PT→T


= 0.4×0.983×0.4×0.983×0.1×0.007×0.3×0.979 = 0.0000300
Jukes & Cantor (1969) = JC69 model
• simplest substitution model
• equal state frequencies: πA = 0.25, πC = 0.25, πG = 0.25, πT = 0.25

• Single parameter rate matrix:

• How can we make this model more complex?


Felsenstein (1981) = F81 model
• unequal frequencies: πA ≠ πC ≠ πG ≠ πT

• Single parameter rate matrix (same as JC69):


Jukes & Cantor (1969) = JC69 model
• simplest substitution model
• state frequencies: πA = 0.25, πC = 0.25, πG = 0.25, πT = 0.25 (equal)

• Single parameter rate matrix:

• How can we make this model more complex?


Transitions and transversions
Adenine (purine) Cytosine (pyrimidine)
• Two classes of nucleotides
• purines (A and G) – two-ring
• pyrimidines (C, T and U) – one-ring
• Mutations from purine to purine (A
↔ G) and from pyrimidine to
pyrimidine (C ↔ T) = TRANSITIONS
• Mutations from pyrimidine to purine
(or vice versa) = TRANSVERSIONS
• Transitions are more likely than
transversions due to similarities in
ring structure
Guanine (purine) Thymine (pyrimidine)
Kimura (1980) = K80 model
• equal state frequencies: πA = 0.25, πC = 0.25, πG = 0.25, πT = 0.25

• Two parameter rate matrix: one for transitions (α) and one for
transversions (β):
Hasegawa, Kishino & Yano (1985) = HKY85 model
• unequal state frequencies: πA ≠ πC ≠ πG ≠ πT

• Two parameter rate matrix: one for transitions (α) and one for
transversions (β):
Several more…
General Time Reversible (GTR) model
• Most complex well-known model – very widely used

• unequal state frequencies: πA ≠ πC ≠ πG ≠ πT


• Rate matrix has six parameters (a-f) , one for each possible
substitution type:
Overview of commonly used models
simplest model
(fewest parameters)

most complex model


(most parameters)
Rate variation
• By comparing sequences, we can see that some regions of an
alignment are more likely to change than others
• These basic models don’t allow this – they assume that the rate of evolution
is the same across the entire alignment
• There are also variants of all of these models that let the rate of evolution
vary between different regions of the sequence alignment - we won’t
discuss the details here, but they are indicated by + I, + G, or + I + G

• Examples: JC69 + I, HKY + G, GTR + I + G


The agony of choice…
• We have lots of models, from simple to complex – which one should
we use?
• Different models might give different answers (i.e. different trees)
with the same data
• Is a complex model necessarily better?
Model selection
• Which is the “best” model here?
Model selection
under-fit – too few parameters
model has too high bias (model over-fit – too many parameters
does not adequately reflect model has too high variance
underlying relationship of the good compromise between (model overly affected by noise
training data ) bias and variance in the training data)
Model selection
• All models represent a trade-off between bias and variance
• we want to use a model that minimises both
Model selection
• We can use formal statistical methods to decide which model to use

• Commonly used model selection criteria: As far as you’re concerned, the


details don’t really matter!
• Akaike Information Criterion (AIC) What’s important is that these
• Bayesian Information Criterion (BIC) are objective methods for
identifying an appropriate model
Calculating support
• How well supported are the monophyletic groups/clades in our
maximum likelihood tree by our data?
• if one clade is more strongly supported by our data, we should probably have
more confidence that it is “correct”

• Several different measures of support, but bootstrapping is by far the


most commonly used method
• Bootstrapping = a resampling method
• Bootstrap values can be between 0 and 100%
• A bootstrap value of >70% is usually considered “strong” support
Basic strategy for maximum likelihood
analysis of sequence data
1. Collect your sequence data
2. Align it
3. Identify best-fitting model of sequence evolution
4. Analyse your sequence data to find the maximum likelihood (ML)
tree assuming the best-fitting model
5. Use bootstrapping to calculate support values for the clades in the
ML tree
FIVE MINUTE BREAK!
Testing different models of human origins
using phylogenetic analysis of molecular data
First…some terminology

• Current evidence suggests that split between


humans split from chimps ~5-7 humans and
million years ago chimps
• Modern humans (Homo sapiens)
plus any taxon more closely
related closer to modern humans

hominins
than to chimps (Pan troglodytes
and P. paniscus) = hominins
First…some terminology

• Precise taxonomy of fossil hominins is selection of archaic hominins


extremely controversial and in flux (in
part because of new genomic data!)
• To keep things simple, we will
distinguish between:
• anatomically modern humans (AMHs) –
fossil hominins with essentially the same
anatomy as present day humans – they
look basically the same as me and you
• archaic hominins – fossil hominins with
clearly different anatomy from present
day humans
Archaic hominins
Modern human Neanderthal

• Archaic hominins differ from AMHs in


several ways:
• in general, the skull appears more heavily
built (e.g. large brow ridges), and chin is
absent
• earliest archaic hominins small (1.5 m or Neanderthal reconstructions
less) with brains only slightly larger than
those of chimps
• Neanderthals similar in height to AMHs,
but more robust and on average had
slightly larger brains
Everything on the “human” side of the tree = hominins
split between
humans and chimps
is about here

hominins
Beginnings
• Charles Darwin argued that the earliest
phases of hominin evolution probably
occurred in Africa:
“"In each great region of the world the living
mammals are closely related to the extinct
species of the same region. It is, therefore,
probable that Africa was formerly inhabited by
extinct apes closely allied to the gorilla and
chimpanzee; and as these two species are now
man's nearest allies, it is somewhat more
probable that our early progenitors lived on
the African continent than elsewhere.”
- The Descent of Man (1871)
Charles Darwin (1809-1882)
But… the first fossils of ancient human
relatives were discovered in Europe
Neanderthal 1
• “Neanderthal Man”
• First fossil (“Neanderthal 1”) found
in northern Germany in 1856 and
named Homo neanderthalis in 1864
The 20th century saw the discovery of many
fossil human remains
“Consensus” view
• Split between hominins and chimps occurred >5 million years ago
• Earliest phases of hominin evolution occurred in Africa
• At least one African hominin lineage (Australopithecus -> Homo habilis -> Homo
erectus) became increasingly “human-like” through time (more upright stance, larger
brain, use of tools etc.)
• At least one hominin lineage dispersed out of Africa >1.8 million years ago

oldest hominins from outside Africa (Homo


erectus, 1.8 million years old, Georgia)
Two conflicting hypotheses for the origin of
anatomically modern humans (AMHs)
• “Multiregional” hypothesis
• Archaic hominins dispersed out of Africa
>1 million years ago
• The different archaic hominin
populations (e.g. Neanderthals in
Europe, Homo erectus in Asia) evolved
into AMHs, possibly with some
interbreeding/geneflow between
populations
Two conflicting hypotheses for the origin of
anatomically modern humans (AMHs)
• “Out of Africa” hypothesis
• Archaic hominins dispersed out of Africa >1
million years ago
• AMHs originated in Africa and some
dispersed out of Africa 100-200k years ago
• AMHs completely replaced (and did not
interbreed with) archaic hominins
Big debate from the 1980s until the 2010s!
Prediction – what would the phylogeny of modern humans
look like if the multi-regional hypothesis is correct?

• Genomes of modern Africans should show


similar diversity to people from other
parts of the world
Prediction – what would the phylogeny of modern humans
look like if the “Out of Africa” hypothesis is correct?

• There should be greater genetic diversity


among modern Africans than modern non-
Africans
Prediction – what would the phylogeny of modern humans
look like if the “Out of Africa” hypothesis is correct?

• All non-Africans should form a single group to


the exclusion of some African populations
Prediction – what would the phylogeny of modern humans
look like if the “Out of Africa” hypothesis is correct?

“Out of Africa”
dispersal event

• All non-Africans should form a single group to


the exclusion of some African populations
• This group would be the descendants of the humans
who left Africa
Testing this with mitochondrial sequence data
from modern humans
• Mitochondria have their own small
circular genome
• In humans, ~16 thousand base pairs
(versus 3 billion in the nuclear genome)
• Mitochondrial DNA:
• inherited maternally only
• does not undergo recombination
• Easy to sequence because mitochondria
occur in large numbers in each cell (so
multiple copies of the mitochondrial
genome)
Practical
• You will do a maximum likelihood analysis of your mitochondrial
sequences to see if they fit a “multi-regional” model or an “Out of
Africa” model better

Basic Approach:
1. Collect your sequence data You did these steps last session!
2. Align it
3. Identify best-fitting model of sequence evolution
4. Analyse your sequence data to find the maximum likelihood (ML) tree
assuming the best-fitting model
5. Use bootstrapping to calculate support values for the clades in the ML tree
Practical
• You will do a maximum likelihood analysis of your mitochondrial
sequences to see if they fit a “multi-regional” model or an “Out of
Africa” model better

Basic Approach:
1. Collect your sequence data You did these steps last session!
2. Align it
3. Identify best-fitting model of sequence evolution
4. Analyse your sequence data to find the maximum likelihood (ML) tree
assuming the best-fitting model
5. Use bootstrapping to calculate support values for the clades in the ML tree

You might also like