Genetics
Genetics
• Ernst Haeckel:
• Produced the first reconstruction of the entire ‘Tree of Life’
(1866)
• Coined the term ‘phylogeny’ (Ancient Greek: phyle = tribe, race;
genesis = birth) for the evolutionary relationships between
organisms
Why is phylogenetics important?
• To understand ‘our place in Nature’, and that of every other organism:
root = position of
first divergence
Tree terminology
unrooted
tree
Rooting trees
• Usually, a tree is rooted with an outgroup (or multiple outgroups)
• outgroup taxa = taxa that we already know are definitely more distantly
related than the other taxa (= the ingroup)
• root is between the outgroup taxon (or taxa) and the remaining ingroup taxa
A B C D D C B A B C D A
= =
What is important is the groupings that are specified, not the precise
order that the taxon labels occur
Tree Thinking quiz
(taken from Baum et al., 2005 – Science)
Tree Thinking quiz
(taken from Baum et al., 2005 – Science)
Different types of data for producing
phylogenies
• Morphological or other phenotypic data:
• only kind of data available for fossils (if ancient DNA or proteins not preserved)
Different types of data for producing
phylogenies
• Molecular data:
• Today, the most commonly used form of molecular data is sequence data –
either nucleotide sequence data (DNA or RNA) or protein sequence data
(amino acids)
DNA sequence data
• Double-helix
• specific sequence of four bases
• adenine = A
• cytosine = C
• guanine = G
• thymine = T
• Example: ATGCGCAGTTATTGCGAT
RNA sequence data
• same bases as DNA except uracil (U) instead of
thymine, i.e. A, C, G and U
• usually produced using DNA as a template (=
transcription) ∴ same sequence as DNA template
(but U instead of T)
Sequence 1: ACGTGCTAGACTATGTGTC
0% sequence similarity
Sequence 2: CGTGCTAGACTATGTGTC
Sequence 1: ACGTGCTAGACTATGTGTC
100% sequence similarity
Sequence 2: -CGTGCTAGACTATGTGTC
Sequence alignment
• Usually we don’t know the exact sequence of substitutions, insertions
and deletions – we have to make a best guess
Sequence 1
0000000001
Sequence 2
1234567890
Sequence 1 CG-AACTCGA
Sequence 2 CC-AACTCGA
Sequence 3 Sequence 3 CCGAACTCGA
Sequence 4 CCGAAC-CGA
Sequence 4
Sequence alignment
000000000
Sequence 3 123456789
Sequence 2 CCAACTCGA
0
Sequence 4 Sequence 4 CCGAACCGA
Sequence alignment
• Sequences that evolve slowly (= they are more highly conserved) are
easier to align
Sequence alignment
• Sequences that evolve very rapidly are harder to align, particularly
when comparing distantly related taxa
• rate of evolution varies:
• between coding and non-coding DNA
• between genes
• within the same gene (some parts of one gene more conserved than others)
• between taxa – e.g. within mammals, rate of evolution in mice and rats much
faster than in humans
Protein-coding DNA/RNA sequence alignment
• The amino acid sequence of a protein is determined by the sequence of
codons (= nucleotide triplets) in the DNA sequence of the gene coding for
that protein
• Nucleotide indels in protein-coding genes are usually in multiples of 3 (i.e. 3,
6, 9, 12….)
• If an indel is not in a multiple of 3, it will completely alter the amino acid
sequence and the position of STOP codon (protein will be shorter or longer
than usual)
• usually results in a non-functional protein
Protein-coding DNA/RNA sequence alignment
P(D|H)
R. A. Fisher
(1890-1962)
P(D|H)
How is maximum likelihood used in a
phylogenetic context?
• Likelihood = probability of the data given a particular model, P(D|H)
• In molecular phylogenetics:
• “data” = sequence alignment
• “model” = tree + substitution model (= model of how sequences evolve)
• the substitution model is usually specified prior to analysis
• So…we’re interested in finding the tree with the highest likelihood (= the
maximum likelihood tree), given a sequence alignment and our assumed
substitution model
• Two parameter rate matrix: one for transitions (α) and one for
transversions (β):
Hasegawa, Kishino & Yano (1985) = HKY85 model
• unequal state frequencies: πA ≠ πC ≠ πG ≠ πT
• Two parameter rate matrix: one for transitions (α) and one for
transversions (β):
Several more…
General Time Reversible (GTR) model
• Most complex well-known model – very widely used
Nature (1987)
• Analysis of mitochondrial genomes from 147 modern
humans
• Mitochondrial DNA:
• inherited maternally only
• does not undergo recombination
• Conclusion of paper:
• African origin of modern humans
• All the mitochondrial lineages can be traced to a female
common ancestor who lived 140k-290kya (kya = thousand
years ago)
• Split between Africans and non-Africans 62-225kya
Your phylogeny from last practical (which is based
on mitochondrial sequence data) should have
found a similar result!
• Greater genetic diversity
among modern Africans
than modern non-Africans
Your phylogeny from last practical (which is based
on mitochondrial sequence data) should have
found a similar result!
• All non-Africans should
form a single group to the
exclusion of some African
populations
• This group would be the
descendants of the humans
who left Africa
“Out of Africa”
dispersal event
“Mitochondrial Eve” = most recent common ancestor
of all modern human mitochondrial sequences
• IMPORTANT!
• The “mitochondrial Eve” was not
the only female alive at that time!
• Other females were present, but
their mitochondrial lineages have
not survived to the present day
X
Newsweek cover 1988
http://www.virginia.edu/woodson/courses/aas102%20(sprin
g%2001)/articles/tierney.html for the original story!
“Mitochondrial Eve”
• Cann et al. (1987) was highly controversial at the time, but recent
studies of mitochondrial genomes and Y chromosome (males only)
have reached similar conclusions
• Number of publications
mentioning ancient DNA:
• 1995-2000: 1450
• 2001-2005: 2870
• 2006-2010: 6600
• 2011-2015: 10200
aDNA
• 2016-2020: 16100
• 2021-present: 10100
So what do these ancient genomes tell us
about human evolution…?
1997: partial mitochondrial sequence from a
Neanderthal - 40,000 years old
• Most of these findings are quite recent (2008 and later) – details still being
worked out, and may change with new discoveries/sequencing of more
archaic hominin genomes
• Many gaps in the story, e.g. what was going on in Africa? Some evidence of
geneflow from archaic hominins, but relatively few ancient genomes from
Africa (DNA degrades faster at higher temps)
FIVE MINUTE BREAK!
Molecular clocks and human evolution
Molecular clocks
• The branching pattern of an evolutionary tree
indicates relationships – which taxa (individuals,
populations, species, genera, families etc.) are more
closely related than others
The closest living
relatives of humans
are chimps
estimated from the fossil record estimated from the fossil record
Relative and absolute divergence times
• If you assume a molecular clock, then branch lengths that
represent the amount of molecular change can be used to
estimate relative divergence times
• Example: if one branch is twice as long as another, it represents
twice as much time
• But how much time is this in an absolute sense, i.e. how many
years, or thousands of years, or millions of years?
Relative and absolute divergence times
P(D|H)
R. A. Fisher
(1890-1962)
P(D|H)
How is maximum likelihood used in a
phylogenetic context?
• Likelihood = probability of the data given a particular model, P(D|H)
• In molecular phylogenetics:
• “data” = sequence alignment
• “model” = tree + substitution model (= model of how sequences evolve)
• the substitution model is usually specified prior to analysis
• So…we’re interested in finding the tree with the highest likelihood (= the
maximum likelihood tree), given a sequence alignment and our assumed
substitution model
• Two parameter rate matrix: one for transitions (α) and one for
transversions (β):
Hasegawa, Kishino & Yano (1985) = HKY85 model
• unequal state frequencies: πA ≠ πC ≠ πG ≠ πT
• Two parameter rate matrix: one for transitions (α) and one for
transversions (β):
Several more…
General Time Reversible (GTR) model
• Most complex well-known model – very widely used
hominins
than to chimps (Pan troglodytes
and P. paniscus) = hominins
First…some terminology
hominins
Beginnings
• Charles Darwin argued that the earliest
phases of hominin evolution probably
occurred in Africa:
“"In each great region of the world the living
mammals are closely related to the extinct
species of the same region. It is, therefore,
probable that Africa was formerly inhabited by
extinct apes closely allied to the gorilla and
chimpanzee; and as these two species are now
man's nearest allies, it is somewhat more
probable that our early progenitors lived on
the African continent than elsewhere.”
- The Descent of Man (1871)
Charles Darwin (1809-1882)
But… the first fossils of ancient human
relatives were discovered in Europe
Neanderthal 1
• “Neanderthal Man”
• First fossil (“Neanderthal 1”) found
in northern Germany in 1856 and
named Homo neanderthalis in 1864
The 20th century saw the discovery of many
fossil human remains
“Consensus” view
• Split between hominins and chimps occurred >5 million years ago
• Earliest phases of hominin evolution occurred in Africa
• At least one African hominin lineage (Australopithecus -> Homo habilis -> Homo
erectus) became increasingly “human-like” through time (more upright stance, larger
brain, use of tools etc.)
• At least one hominin lineage dispersed out of Africa >1.8 million years ago
“Out of Africa”
dispersal event
Basic Approach:
1. Collect your sequence data You did these steps last session!
2. Align it
3. Identify best-fitting model of sequence evolution
4. Analyse your sequence data to find the maximum likelihood (ML) tree
assuming the best-fitting model
5. Use bootstrapping to calculate support values for the clades in the ML tree
Practical
• You will do a maximum likelihood analysis of your mitochondrial
sequences to see if they fit a “multi-regional” model or an “Out of
Africa” model better
Basic Approach:
1. Collect your sequence data You did these steps last session!
2. Align it
3. Identify best-fitting model of sequence evolution
4. Analyse your sequence data to find the maximum likelihood (ML) tree
assuming the best-fitting model
5. Use bootstrapping to calculate support values for the clades in the ML tree