Lecture Notes On Algorithmic Information Theory
Lecture Notes On Algorithmic Information Theory
April 2025
Contents
1 Introduction 4
1.1 Hilbert’s Theory of Everything . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Berry’s Paradox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4 Computability Theory 16
4.1 Turing’s Model of Computation . . . . . . . . . . . . . . . . . . . . . 16
4.2 Mathematizing the mathematician . . . . . . . . . . . . . . . . . . . 17
4.3 Partial Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.4 Effective Enumeration of Turing Machines . . . . . . . . . . . . . . . 19
4.5 The Universal Turing Machine . . . . . . . . . . . . . . . . . . . . . . 19
4.6 Church–Turing–Deutsch . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.7 Computing, Deciding, Enumerating, Recognizing . . . . . . . . . . . . 22
4.8 The Halting Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.9 Uncomputability =⇒ Incompleteness . . . . . . . . . . . . . . . . . 25
4.10 Rice’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5 Algorithmic Complexity 29
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Plain Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3 The Invariance Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.4 Basic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.5 Chain Rule for Plain Complexity . . . . . . . . . . . . . . . . . . . . 34
5.6 Incompressibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.7 The Problems of Plain complexity . . . . . . . . . . . . . . . . . . . . 36
6 Prefix Complexity 39
6.1 Self-delimiting Turing Machines . . . . . . . . . . . . . . . . . . . . . 39
6.2 Solving the Problems of Plain Complexity . . . . . . . . . . . . . . . 41
6.3 Information Equivalence between x∗ and (x, K(x)) . . . . . . . . . . . 43
6.4 The Uncomputability of K(x) . . . . . . . . . . . . . . . . . . . . . . 44
1
7 The Halting Probability Ω 45
7.1 Mathematical Truths in Ω . . . . . . . . . . . . . . . . . . . . . . . . 46
7.2 Chaitin’s Incompleteness Theorem . . . . . . . . . . . . . . . . . . . . 47
A Problem Sheets 55
A.1 Problem Sheet #1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
A.2 Problem Sheet #2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
A.3 Problem Sheet #3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
A.4 Problem Sheet #4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2
Preface
These lecture notes on Algorithmic Information Theory were constructed in con-
junction with the graduate course I taught at Università della Svizzera italiana in
the spring of 2023.
The course is intended for graduate students and researchers seeking a self-
contained journey from the foundations of computability theory to prefix complex-
ity and the information-theoretic limits of formal systems. My exposition ignores
boundaries between computer science, mathematics, physics, and philosophy, which
I consider essential when explaining inherently multidisciplinary fields.
Recordings of the lectures are available online—click on the following link:
[Full lecture series]. Links to individual lectures are also provided throughout the
notes.
The course is based on many books, most notably the detailed textbook by Li and
Vitányi [1], which readers should also consult for extensive references to the primary
scientific literature. I enjoyed the explanatory approach of Cover and Thomas [2],
and I gained insights from Hutter [3] and Claude [4]. Chaitin’s Meta Math! [5], both
profound and accessible, also inspired the content of the course.
As his Ph.D. student, I was introduced by Gilles Brassard to the marvels of
theoretical computer science through many open discussions, for which I am deeply
grateful. My interest in algorithmic information theory was then sparked by reading
Bennett [6] and Chaitin [7].
I am especially grateful to Stefan Wolf, who encouraged me to design and teach
this course, and with whom I had many valuable conversations—both on this topic
and beyond. Finally, I am immensely grateful to the students who attended the
lectures. Their enthusiasm and curiosity were potent motivators and led to countless
stimulating discussions.
3
1 Introduction
[Lecture AIT-1]
Algorithmic information theory (AIT) is the meeting between Turing’s theory of
computation and Shannon’s theory of information.
AIT roots the concept of information in computation rather than probability. Its
central mathematical construct, algorithmic complexity, is a universal quantification
of information content in individual digital objects.
– Quantification, not qualification.
– Digital objects are those describable by strings of symbols.
– Individual, i.e. we take the object as is; we do not posit a set or a distribution
to which the object is assumed to belong.
– Universal, because any universal computer gives roughly the same measure of
algorithmic complexity.
Algorithmic complexity was introduced independently by Solomonoff (1964)
Kolmogorov (1965) and Chaitin (1966). Their work was broadly motivated—
respectively—by induction (Solomonoff), information (Kolmogorov) and random-
ness (Chaitin). AIT is still being developed. See Fig. 1 for a contrast with Shannon’s
information theory.
Non-prob. statistics
Non-prob. thermodynamics
Inductive methods
Logical depth
Incompleteness
Statistics of mathematics
Thermodynamics
Philosophy
Shannon’s IT AIT
(Occam’s razor)
Entropy Complexity
Foundations of
Probability Computability
mathematics
Theory Theory
4
1.1 Hilbert’s Theory of Everything
Hilbert expressed the belief that if mathematics is to provide absolute and complete
certainty, then there ought to be a formal axiomatic theory for all of mathematics.
There should be a finite set of axioms from which all mathematical truths can be
proven by mechanically applying the rules of logic. In practice, today, the closest we
have to such a theory is the Zermelo-Fraenkel set theory with the axiom of choice
(ZFC), together with first-order logic.
Hilbert did not suggest that mathematics should only be done formally, but that
it could, so matters of opinion and ambiguity can be eliminated from mathematics.
In 1931, Gödel puts an end to Hilbert’s program. He shows that in all logi-
cally consistent formal axiomatic systems that are strong enough to encapsulate the
arithmetic of whole numbers, there are true and unprovable statements.
5
2 Strings and Codes
[Lecture AIT-2]
2.1 Strings
A set A = {ai }i∈I of symbols is an alphabet. It can be finite, in which case I =
{1, 2, . . . , N} or infinite, I = N.
Symbols can be concatenated into strings (also called words). The set of all finite
strings over an alphabet A is denoted A∗ .
The binary alphabet is {0, 1} and
where ǫ is the empty string. The above set is displayed in the lexicographical ordering.
The following pairing defines a bijection between {0, 1}∗ and N:
{0, 1}∗ ↔ N
ǫ ↔ 1
0 ↔ 2
1 ↔ 3
00 ↔ 4 (1)
01 ↔ 5
10 ↔ 6
11 ↔ 7
000 ↔ 8.
Proof. Observe first that on the left-hand side, x is a string because we can’t take
the length of a number, whereas on the right-hand side, it is a number because we
can’t take the log of a string. Since the binary representation of the number x is 1x,
we have that x = 1 · 2|x| + r with r < 2|x| . Thus ⌊log x⌋ = |x|.
A string y is a prefix of a string x if x = yz for some z ∈ A∗ . A set S ⊂ A∗ is
prefix-free if for all x, y ∈ S such that x 6= y, x is not a prefix of y.
6
Example 1. Consider the alphabet of digits D = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}. The
set of strings {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, . . .} is not prefix-free. Yet with the alphabet
D ∪ {∗}, the set {1∗, 2∗, 3∗, 4∗, 5∗, 6∗, 7∗, 8∗, 9∗, 10∗, . . .} is prefix-free. Similarly,
{10, 110, 1110, 11110, . . .} is a prefix-free set of bit strings.
2.2 Codes
Definition 1. A code is a mapping
E : A → B∗
E ∗ : A∗ → B ∗
7
E1 All codes
Non-singular codes
E2
E3 Uniquely decodable
E4
Prefix codes
0 0
1 0 10
1 0 110
1
111
Figure 3: The binary tree representing the codewords of E4 .
– E2 is not uniquely decodable. How does one decode the stream of code-
words 110? Is it B, CA, or CCD?.
– E3 is uniquely decodable but not a prefix code.
– E4 is a prefix code.
In a stream of codewords, prefix codes have the advantage of being instanta-
neously decodable, i.e., a codeword can be decoded when its last bit is read. For this
reason, prefix codes are also called instantaneous codes.
Suppose that the sequence of codewords to decode is 0111010110.
With the encoding E4 , it can be instantaneously decoded into ADABC, by reading
from left to right. With the code E3 , it can be decoded, but one needs to read
further into the sequence. With E2 or E1 , it cannot be decoded. See Fig. 2.
Prefix sets ⊂ {0, 1}∗ can be represented by trees. Fig. 3 represents the tree of
the codewords of E4 .
Observation. The lengths of the codewords of E4 are 1, 2, 3, 3, and
2−1 + 2−2 + 2−3 + 2−3 = 1 ≤ 1 .
8
This relation is called Kraft’s inequality.
[Lecture AIT-3]
Proof. ⇒. Let P be a set of codewords of a binary prefix code. To each string
x ∈ {0, 1}∗ corresponds a specific kind of interval ⊂ [0, 1), called a cylinder :
Above, 0.x denotes the real number in [0, 1) whose binary expansion begins with x.
Γx can be understood as the set of all real numbers whose first bits in their binary
expansion are those of x. By construction, considering any two binary strings, either
one cylinder is contained in the other, or they are disjoint. The first case happens
only in one string is the prefix of the other. The cylinders pertaining to a prefix-free
set of codewords, like P, are, therefore, disjoint. Thus
X X
2−|x| = lenght (Γx ) ≤ length ([0, 1)) = 1 .
x∈P x∈P
The set of codewords given by {xj }, where 0.xj = αj and |xj | = lj , forms the
codewords of the prefix code.
This assignment of codewords is better viewed on the binary tree. The code-
word x1 is assigned to the lexicographically first node at depth l1 . When a node
is assigned, its descendant branches are cut out. Then x2 is assigned to the next
available node at depth l2 , and so on.
A code may have code-word lengths that fulfill Kraft’s inequality without being
a prefix code. The code E3 , in Example 2, for instance. Yet, Kraft’s theorem states
that such code can be transformed into a prefix code while keeping the lengths
of codewords fixed. In the case of E3 , we reverse the bits of the codewords and
obtain E4 .
Theorem 2 (McMillan). The code-word lengths of any uniquely decodable code sat-
isfy Kraft’s inequality.
Corollary 1. The class of uniquely decodable codes, bigger than that of prefix
codes, does not offer more admissible sets of code-word lengths. Since no gain is
obtained for optimality of code-word lengths if we look into this larger class, we can
safely restrict our attention to prefix codes.
9
2.4 Prefix codes of {0, 1}∗
Let us consider prefix codes of the alphabet A = {0, 1}∗.
where ||x|| denotes the length of the length of x. The encoding E (1) (x) will be often
used. It is called the self-delimiting encoding of x, and we denote it x̄.
The lengths of these codewords are
|E (1) (x)| = 2|x| + 1 = 2⌊log x⌋ + 1 and |E (2) (x)| = 2⌊log⌊log x⌋⌋ + ⌊log x⌋ + 1 .
Nk → N
(x1 , . . . , xk ) 7→ E(x1 ) . . . E(xk ) .
10
3 Shannon Information Theory
[Lecture AIT-4]
Example 3. Eight horses race. They have probabilities of winning given by
And because they are probabilities, they sum to 1. Thus the numbers
(1, 2, 3, 4, 6, 6, 6, 6)
fulfil Kraft’s inequality, and so there exists a prefix code with codewords of these
lengths1 . Since the lengths lx are related to the probabilities p(x) via p(x) = 2−lx ,
the expected number of bits to transmit is
8 8
X X 1
p(x)lx = p(x) log .
x=1 x=1
p(x)
This expression shall be recognized as the Shannon entropy H(X) of the random
variable X defined by the race. Here, H(X) = 2. As we shall see, the entropy of a
random variable is a lower bound on the expected length of a prefix code.
1
For instance, {0, 10, 110, 1110, 111100, 111101, 111110, 111111}.
11
P
with the constraint that x 2−lx ≤ 1 (because we want a prefix code). Assuming
an equality in the constraint and omitting that lx ∈ N, it boils down to a standard
minimization problem.
!
X X
J= p(x)lx + λ 2−lx − 1 .
x x
∂J
= p(x′ ) − λ ln 2 · 2−lx′ = 0 ⇐⇒ p(x′ ) = λ ln 2 · 2−lx′ .
∂lx′
Summing over x′ sets λ ln 2 = 1 and we find that the minimal expected length is
achieved with codewords of length
∗ 1
lx = − log p(x) = log ,
p(x)
def
X 1
H(X) = p(x) log .
x∈X
p(x)
1
The quantity log p(x) , which grows as p(x) gets smaller is sometimes called the
surprise or the self-information of the event {X = x}. The entropy is thus the
expectation of the surprise.
The operational meaning is clear from our derivation: it is the minimum expected
number of bits to communicate the outcome of the random variable in a prefix code
(or in a uniquely decodable code, thanks to McMillan’s theorem).
12
Points 3 and 4 show that the more uncertain a random variable is, the larger its
entropy.
Definition 7. The joint entropy of a pair of discrete random variables (X, Y ) is
defined as XX 1
H(X, Y ) = p(x, y) log .
x∈X y∈Y
p(x, y)
If the random variable X takes value x̂ ∈ X , this defines pY |X=x̂ (y). Given
each individual outcome x̂ ∈ X of X, there is residual entropy H(Y |X = x̂) in the
distribution pY |X=x̂ (y). The conditional entropy H(Y |X) is the expectation (over
X) of that residual entropy.
Definition 8. The conditional entropy is
X
H(Y |X) = p(x)H(Y |X = x)
x∈X
X X
= p(x) −p(y|x) log p(y|x) .
x∈X y∈Y
13
H(X, Y )
14
Since Z is conditionally independent of X given Y , I(X; Z|Y ) = 0 and since
I(X; Y |Z) ≥ 0, we find I(X; Z) ≤ I(X; Y ).
15
4 Computability Theory
[Lecture AIT-6]
Algorithms have existed for millennia — for instance, the Babylonians employed
them as early as 2500 B.C. Informally, they describe a procedure that can be per-
formed to compute a desired outcome. In 1936, Church and Turing independently
formalized computability (and with it, algorithms).
... 0 1 0 0 1 0 1 1 ...
head
q0 (q0 , 0, R, q3 )
(q0 , 1, 0, q0 )
current state q1 (q1 , 0, L, q2 )
(q1 , 1, L, q2 ) list of rules
(q2 , 0, 1, q6 )
q2
(q2 , 1, B, q3 )
... ...
finite control
Figure 5: A generic configuration of a Turing Machine.
A computation starts at time 0 with the head on a distinguished start cell, and
the finite control is in a distinguished start state q0 . Initially, all cells of the tape
are blank, except for a contiguous finite sequence extending from the start cell to
the right, which contains 0s and 1s. This binary sequence is called the input.
The Turing machine performs one operation per time step in which it does one
of the five following actions: writes a 0 in the cell that is being scanned, writes a 1,
16
leaves a blank, moves the head to the left (L) or moves the head to the right (R).
The action performed depends on the current state q and the scanned symbol. At
the end of each operation, the control takes a (possibly different) state q ′ ∈ Q.
For a given Turing machine, its set of operations is given by the finite list of
rules of the form (q, s, a, q ′ ), where q, q ′ ∈ Q, s ∈ {0, 1, B} and a ∈ {0, 1, B, L, R}
(B stands for “blank”). This denotes the state q and the scanned symbol s before
the operation, the action a taken, and the state q ′ after the operation.
The machine is deterministic: (q, s, a1 , q1′ ) and (q, s, a2 , q2′ ) cannot both be rules
if (a1 , q1′ ) 6= (a2 , q2′ ). Moreover, not every possible combination of the first two
elements has to be in the list: if the machine enters a state q and reads a symbol s
for which no rule is defined, it halts. Hence, we can define a Turing machine by a
mapping from a finite subset of Q × {0, 1, B} into {0, 1, B, L, R} × Q.
A given Turing machine, on a given input, carries out a uniquely determined
succession of operations, which may or may not terminate in a finite number of
steps.
“I shall also suppose that the number of symbols which may be printed
is finite. If we were to allow an infinity of symbols, then there would be
symbols differing to an arbitrarily small extent.”
17
The “computer” he is referring to cannot be what we call a computer, for it is in
that very paper that Turing is laying down the foundational work that gave us our
computers. So what is he referring to? A “computer” was how a person carrying
out a computation was called.
Turing continues, arguing that the range of squares that a computer can read is
limited:
“We may suppose that there is a bound B to the number of symbols or
squares which the computer can observe at one moment. If he wishes to
observe more, he must use successive observations.”
Yet it can shown that a machine scanning one single symbol, like the ones we defined
in §4.1, can simulate the computer that is scanning B symbols at a time.
Turing then explains why there are only finitely many states (here “states of
mind”).
“We will also suppose that the number of states of mind which need
be taken into account is finite. The reasons for this are of the same
character as those which restrict the number of symbols. If we admitted
an infinity of states of mind, some of them will be “arbitrarily close” and
will be confused. Again, the restriction is not one which seriously affects
computation, since the use of more complicated states of mind can be
avoided by writing more symbols on the tape.”
18
Example 4. A function
g : N×...× N → N
(x1 , . . . , xn ) 7→ g(x1 , . . . , xn )
can be viewed as being computed by a Turing machine expecting an input of the
form x̄1 x̄2 . . . x̄n .
M: On input i,
Initialize a counter S = 0.
For each y ∈ {0, 1}∗ (in lexicographical order),
check if y corresponds to E(T ), for some T .
If it does, add 1 to a counter S.
If S = i, output y = E(Ti ).
19
[Lecture AIT-8]
Comment on the lecture: There is a brief discussion of Problem Sheet #1, §A.1.
Theorem 5. A universal Turing machine exists and can be constructed effectively.
Proof sketch. U invokes M so that, from i, it reconstructs the list of rules E(Ti ),
which it saves on a portion of its tape. To execute the consecutive actions that
Ti would perform (on input x) on its own tape, U uses Ti ’s rules to simulate Ti ’s
actions on a representation of Ti ’s tape contents.
In fact, there are infinitely many such U’s.
4.6 Church–Turing–Deutsch
Why do Turing machines matter so much? Why not some other model of computa-
tion? The answer lies in the robustness of the notion of computability itself—which
we now explore through the Church–Turing thesis, and its physical extension by
Deutsch.
It can be proved that the set of functions N → N that can be computed by
Turing machines is equal to the set of functions that can be computed by many
other systems of symbolic manipulations:
– λ-calculus – TeX
– General recursive functions – Conway’s game of life
– Python, C, Lisp, Java,... – Minecraft
– ...
The Church–Turing Thesis
20
The argument is short yet inescapable. Qua theoretical construct, Turing ma-
chines may appear to be an abstract piece of mathematics. But so do pseudo-
Riemannian manifolds, which are conjectured to describe spacetime in general rel-
ativity. Thus, whether Turing machines are the relevant theoretical construct de-
pends on whether they actually grasp the set of computations that can be carried
out physically.
Example 5 (The Zeno machine). Consider a regular Turing machine with an extra
functionality: a big step. In one single big step, the Zeno machine can perform
infinitely many operations of a regular Turing machine, as if it were doing one
operation in 1/2 time unit, another operation in 1/4 time unit, another in 1/8 time,
and so on. It can be proven that the Zeno machines compute more functions than the
Turing machines. Based on what do we reject Zeno machines as a valid formalization
of computation? Physics. Try and build that machine...
Example 6. Suppose that in the year 25, a new theory of physics is discovered:
physical systems can undergo different processes simultaneously. Since computations
are physical processes, this functionality can be harnessed, perhaps to speed up
computations due to parallelism. The year is 1925, this theory is quantum theory,
and the speed up is that of quantum computation.
21
4.7 Computing, Deciding, Enumerating, Recognizing
[Lecture AIT-9]
Comment on the lecture: Some solutions to Problem Sheet #1 (§A.1) are presented.
Example 7. All finite sets are decidable. The algorithm may just include the list
of all elements and verify. The set of odd numbers, the set of prime numbers, and
the set of powers of 17 are decidable.
{hn, mi|n is the largest number of consecutive 0s in the first 2m digits of π = 3.14159 . . .}.
3. The set of indices i such that the range of ϕi is nonempty: {i|∃j : ϕi (j) ց}.
Indeed, consider the following enumerator
22
E: For k = 1, 2, 3, . . .
simulate all {Ti (j)}i,j≤k for k steps of computation.
If some Ti (j) ց (which means ϕi is defined on j),
output i
Notice that if ϕi has a non-empty range, there is some j and some t such that
the computation Ti (j) halts after t steps. Eventually (when k = max(i, j, t)),
the enumerator will find it out and output i in the enumeration. On the other
hand, if ϕi ր for all inputs, the enumerator will never output i.
The proof uses the diagonalization method invented by Cantor when he proved
that there are uncountably many real numbers. He assumed that the real numbers
could be enumerated and reached a contradiction by exhibiting a real number that
23
cannot be on the list. The contradiction forces us to abandon the possibility of
enumerating the real numbers. Here, the Turing machines and their inputs really
can be enumerated. The assumption that leads to the contradiction is that the
table i, j 7→ Ti (j) can be computed. See the proof and Table 1.
Proof. Suppose that a Turing machine DK0 decides K0 .
(
1 if Ti (j) ց
DK0 (hi, ji) =
0 if Ti (j) ր .
24
1 2 3 4 5 ... δ
T1 3 ր 3 1 ր ...
T2 1 ր 1 ր ր ...
T3 2 2 ր 2 ր ...
T4 1 ր 1 2 ր ...
T5 7 7 1 ր 7 ...
.. .. .. .. .. .. ..
. . . . . . .
Tδ ր 0 0 ր ր ... ?
Table 1: Table representing the results of all computations. The machine
Tδ is constructed by inverting the behaviour of the diagonal of the table.
The defining properties of FATs imply that the theorems of any given FAT can
be computably enumerated by a Turing machine. Such a Turing machine embodies
the FAT(the axioms and rules of inference can be encoded finitely), and enumerates,
one by one, all theorems in order of proof size (not in order of theorem-statement
size). How? Two ways.
1. For i = 1, 2, 3, . . ., output all statements obtained by i applications of a rule
of inference to the axioms.
25
2. Working through all strings p (in length increasing fashion), verify if p is a
proof of some statement. If yes output the statement.
In computer-theoretic terms, a formal axiomatic theory is a computably enu-
merable set of mathematical assertions.
Theorem 7 (Incompleteness à la Turing). For all FATs, there exists some i and j
for which Ti (j) ր is true but unprovable.
Proof. By contradiction: Suppose that there exists a FAT T in which for all hi, ji
such that Ti (j) ր, it can be proven that indeed Ti (j) ր. A decider for the halting
problem can be exhibited by executing two computations in parallel:
26
Problem: Is IS decidable? I.e. does there exist a Turing machine that computes
(
1 if ϕi = S
DIS (i) = ?
0 if ϕi 6= S
In modern terms: can a fixed algorithm generically decide if a given program com-
putes the successor function?
Answer: No, the set IS is undecidable.
Proof. Suppose by contradiction that DIS exists. First, construct a Turing ma-
chine Tk that computes S. (The decider DIS can be used on input k to verify
that Tk does indeed compute S.) Then, from Tk , we construct a family of Turing
(i,j) (i,j)
machines {Tk }i,j∈N , with the parameters i and j hardcoded in the machine Tk .
(i,j)
Tk : On input n
Run Ti (j)
if it halts
Run and output Tk (n).
(i,j)
Note that if Ti (j) ց, then Tk does compute S and
(i,j)
if Ti (j) ր, then Tk does not compute S.
Note also that
(i, j) → Tki,j → E(Tki,j ) → index of Tki,j
| {z }
Turing machine M
Definition 20. An index set I is a set with the following property: If ϕi = ϕj then
i ∈ I ⇐⇒ j ∈ I
Index sets are a way to describe classes of partial computable functions or prop-
erties of partial computable functions. In the example, IS is an index set.
27
Proof. Let I be a non-trivial set of indices. Suppose (by contradiction) that DI
exists. Program your favourite Turing machine Tl that loops on all inputs, and
compute its index l. Compute DI (l).
Case 1: If DI (l) = 0, i.e., l 6∈ I, compute DI (1), DI (2),... until a k ∈ I is found.
Case 2: If DI (l) = 1, i.e., l ∈ I, compute DI (1), DI (2),... until a k 6∈ I is found.
By construction, Tk computes a function that does not loop on all inputs—
i.e., it halts on at least one input. We can then define a family of Turing ma-
(i,j)
chines {Tk }i,j∈N that have i and j harded coded into them as follows:
(i,j)
Tk : On input n,
Run Ti (j)
if it halts
Run Tk (n).
(i,j)
Note that if Ti (j) ց, then Tk has the same behaviour as Tk . Therefore, k and
(i,j)
the index of Tk are either both elements of I (case 1) or both not elements of I
(case 2).
(i,j)
If Ti (j) ր, then Tk loops on all inputs, therefore either k ∈ I and the index of
(i,j)
Tk is not in I (case 1) or vice versa (case 2).
Note also that
(i, j) → Tki,j → E(Tki,j ) → index of Tki,j
| {z }
Turing machine M
would be.
All notions for Problem Sheet #2 have been covered. See §A.2.
28
5 Algorithmic Complexity
We have seen that some functions or numbers cannot be computed. But even among
computable objects, not all are created equal: some are structured, others seem
random. Can we measure this? Algorithmic complexity provides a framework for
quantifying the information content of individual strings, not through probabilities,
but via their shortest descriptions on a universal machine.
5.1 Motivation
There are contexts in which it is unsuitable to look for an underlying probability
distribution when communicating a string of symbols.
Example 10 (Communication of π). Suppose Alice wants to communicate the
following message to Bob
31415926538...9
| {z } (first decimal digits of π) .
1010
29
We would like to say that the first one is not random and that the second one
at least appears to be random.
How can this be formalized probabilistically? It is possible that both strings
arise from the same probabilistic Bernoulli process of p = 1/2, in which case both
of them have the same probability.
If viewed as arising from a computation instead of a probabilistic process, the
patterns displayed by the first string amount to a short description of it. (e.g. print
“01” 106 times). On the other hand, if the second string really has no patterns, then
there is no short description for it. The shortest computable description for it might
be:
p ✲ T ✲ x
.
30
We can thus define the complexity of a string x, with respect to a description
method given by T as
x 7→ CT (x) .
Since what we are after is the shortest descriptions, we have good news: one com-
plexity measure is “smaller” than all the others.
Let
def
hi, ji = īj = 1i 0ij .
Let U be the universal Turing machine such that U(hi, ji) = Ti (j) .
Definition 23. The plain algorithmic complexity CU (x) of a string x ∈ {0, 1}∗ is
defined as
CU (x) = min{|p| : U(p) = x} .
p
A universal Turing machine with the property (*) is additively optimal. Note that
not all universal Turing machines are additively optimal. For instance, a universal
Turing machine W may expect input of the form 1i 01q to simulate Ti (q) (and halt
with empty output on any program not of the right form).
The invariance theorem can be applied by setting T = V , where V is another
additively optimal universal Turing machine. We have ∃bV , ∃bU such that
31
Therefore, there exist a constant bU V such that
The constant may be large, but for strings large enough, it is irrelevant.
And by the Church–Turing–Deutsch principle, if any other physical system is
used to define a complexity measure on strings, it will not yield a more minimal
measure than CU .
32
This is one program for computing x from U: the minimal program therefore has
length C(x) ≤ |p|.
Invoking high-level programming language.
Print “x” is a program for x. It has length |x| + O(1).
Recalling our identification of strings with numbers, Eq (1), and Prop. 1,
|x| = ⌊log x⌋, so
C(x) ≤ log x + O(1) .
Definition 25. The complexity C(x, y) of a pair of strings is defined with respect
to a fixed encoding h·, ·i : {0, 1}∗ × {0, 1}∗ → {0, 1}∗ as
Definition 26. The conditional complexity of the string x given the string y is
def
C(x|y) = min{|p| : U(p, y) = x} ,
p
where the universal Turing machine U(·, ·) has been promoted to two tapes3 . The
second entry refers to the tape of auxiliary information.
def
We update the unconditional definition to C(x) = C(x|ε) .
Example 13. C(x|y) ≤ C(x) + O(1), because one way to compute x with y as
auxiliary information is to ignore y, and produce x from scratch with x∗ .
3
The two-tape Turing machine can be simulated by a single-tape Turing machine; so two-tape
machines compute the same class of functions. I invoke two-tape machines because they correspond
to the realistic model of keeping auxiliary information and programs on different registers.
33
5.5 Chain Rule for Plain Complexity
Proposition 5 (Chain rule, easy side). For all strings x and y,
[Lecture AIT-14]
Proposition 6 (Chain rule for plain complexity, hard side). For all strings x and y,
C(x) + C(y|x) ≤ C(x, y) + O(log(C(x, y)))
≤ C(x, y) + O(log n)
where again n = max{|x|, |y|}.
34
Proof. Let m ≡ C(x, y). Define
Define now
B = {x′ : log |Ax′ | > l − 1} ,
which contains x. If m and l are given (with an O(log m) advice, because l is a
number smaller than m + 1, because |Ax | is smaller than the number of programs
of length ≤ m), B can be enumerated by enumerating A (thanks to m), and when
for a given x′ the subset Ax′ contains more than 2l−1 elements, x′ is added to B. A
possible program for x is thus given by the enumeration number of x in B. Note
that X X X
|A| = |Ax′ | ≥ |Ax′ | ≥ 2l−1 = |B|2l−1 .
x′ x′ ∈B x′ ∈B
Again, the error term could be set to be O(log n), where n = max{|x|, |y|}.
35
5.6 Incompressibility
It is easy to come up with highly compressible strings. For instance, define
2
2...
f (k) = 22 k times
For each k, the binary representation of f (k) has complexity at most C(k) + c ≤
log k + c for some constant c independent of k. Yet it has a huge length.
What about incompressible strings?
In the most conservative scenario, each of those short programs computes a different
string of length n. Indeed, generically, there are some of those short programs that
don’t produce any output (they don’t halt). Also, there are some of those short
programs that compute strings of a length other than n bits. And finally, there are
some of those short programs that compute the same string.
Thus, there are at least 2n 2n−c + 1 strings that are not computed by such a short
program, namely, that are c-incompressible.
36
2. Let n be an incompressible number (string) of length m. Consider
00 . . . 0} ≡ 0n .
| {z
n times
C(x) = n + m + O(1),
This measure amounts to the probability that a program for which the bits
are picked at random would yield the string x.
Solomonoff’s problem was that the measure does not converge. Indeed there
exists a Ti such that on all strings y, Ti (y) = x. Thus, each īy is a valid
program for x, as U(īy) = x for all y ∈ {0, 1}∗ .
X
P (x) ≥ 2−|īy|
y∈{0,1}∗
X
= 2−|ī| 2−|y|
y∈{0,1}∗
X∞ X∞
= 2−|ī| 2−n
n=0 y∈{0,1}n
X∞
−|ī| −n n
= 2 2 2
n=0
= ∞.
37
[Lecture AIT-15]
4. Recalling coding theory,
38
6 Prefix Complexity
In prefix complexity, it is demanded that a program must be self-delimited : its total
length in bits must be determined by the program itself. For Turing machines, it
means that the head scanning the program is not allowed to overshoot.
This way, the alphabet on which we measure program size really is binary. In
the plain complexity framework, the head can hit and recognize the first blank after
the program, which effectively introduces a third symbol into the alphabet used to
describe programs.
Real programming languages are self-delimiting because they provide instruc-
tions for beginning and ending programs. This allows a program to be well-formed
by the concatenation or nesting of sub-programs.
The first argument refers to the program tape content; the second to the work
(auxiliary) tape.
39
Remark 3. For all self-delimiting Turing machines T and for all auxiliary informa-
tion y, the set of programs that lead to a halting computation {p : T (p, y) ց} is a
prefix-free set. Indeed, if T (q, y) ց, it means that the last bit of q has been read
(and no more). Therefore, no program of the form qs, for some non-empty string s,
can also lead to a successful computation4 .
In particular, the set {p : U(p) ց} is a prefix-free set. See Fig. 6.
0 ...
0 1
U(001) = 11
1
0
...
1
0 U(10) = ε
1
0 U(110) ր
1
...
Figure 6: A possible prefix tree representing the set {p : U (p) ց}.
On input 001, the self-delimiting universal Turing machine U halts and
outputs 11, and on input 10, it outputs ε. Thus, all programs of the
form 001s or 11s for non-empty s do not lead to a successful computa-
tion because their last bit will never be scanned. We denote U (001s) ր
and U (10s) ր, and remove the descendant branches from the tree. More-
over, the program 110 loops. Therefore, U (110) ր, but also U (001s) ր
for all nonempty s.
4
If qs happens to be on the input tape, the computation would carry through, but q would
be recognized as being the program of the computation. Therefore, we adopt the convention that
T (qs, y) is undefined and also denote it as ր.
40
def
The unconditional prefix complexity of x is K(x) = K(x|ǫ), x∗ is the shortest
def
program, and the prefix complexity of a pair is K(x, y) = K(hx, yi).
is a prefix code.
[Lecture AIT-16]
Comment on the lecture: Unfortunately, the audio was not recorded.
3. Solomonoff’s problem is solved:
X
x 7→ P (x) = 2−|p|
p : U (p)=x
now converges for all x. Indeed, the set {p : U(p) = x} is prefix-free because
it is a subset of the prefix-free domain of halting programs {p : P U(p) ց}.
Therefore,
P by Kraft’s inequality not only does P (x) ≤ 1, but also x P (x) =
−|p|
{p : U (p)ց} 2 ≤ 1.
2. Let x ∈ {0, 1}n . I mentioned that intuitively, if both x and n are algorithmi-
cally random, we might want
K(x) = n + m + O(1) ,
rn∗ x ,
41
f (x)
g(x)
K(x)
Figure 7: A made-up graph for K(x), with its upper bound given by
f (x) = log x + log log x + log log log x + O(log log log log x), which is very
close to being tight: infinitely often, K(x) exceeds g(x) = log x+log log x+
log log log x.
Corollary 4. Let x ∈ {0, 1}∗. Via iterative appeals of Prop. 8, we find that
This gives an upper bound for K(x) that is nearly tight. Indeed, as is shown in
Problem Sheet #3 (§A.3), for infinitely many values of x, K(x) exceeds log x +
log log x + log log log x. See Fig. 7.
1. The chain rule had a O(log C(x, y)) error term. For prefix complexity, the
error is O(1).
Theorem 11 (Chain rule for prefix complexity). For all strings x and y,
The proof of the chain rule follows upon proving the easy side, Prop. 9, and the
hard side, Prop. 17 in §8.1.
42
Proposition 9 (Chain rule for prefix complexity, easy side). For all x and y,
✲ x x ✲
x∗ ✲ O(1) ✲ K(x) O(1) ✲ x∗
and K(x) ✲
.
Proof. First, we want to show that there exists a constant-size program, which
takes x∗ as auxiliary information and outputs hx, K(x)i. This can be done with the
following instructions:
1. Measure the length of x∗ (this gives K(x));
2. Execute x∗ to get x;
3. Compute hx, K(x)i.
Second, to compute x∗ from (x, K(x)), for i = 1, 2, . . ., run all programs of length
K(x) for i steps. Eventually some p halts with U(p) = x, this p is x∗ .
Proposition 10 entails that the chain rule can be expressed as
Prefix complexity may be slightly larger than plain complexity, due to the cost
of self-delimiting structure, but not by more than O(log n) for strings of length n.
Proposition 11. For all x ∈ {0, 1}n, K(x) = C(x) + O(log n).
43
6.4 The Uncomputability of K(x)
Theorem 12. The function x 7→ K(x) is uncomputable.
Proof. Assume that there is a Turing machine M which, given x, computes K(x).
Let B be a constant to determine. Consider the following program.
The displayed sentence can be put into bits, and if B is large enough, the obtained
description would be shorter than B bits. The contradiction is avoided because while
the displayed sentence encoded in bits is a description of zB —it uniquely specifies
it—by the uncomputability of K(x), it is not an algorithmic description.
44
7 The Halting Probability Ω
As we shall see in that section, the halting probability Ω—also known as Chaitin’s
constant—is an algorithmically random number that compactly encodes the halting
problem and, with it, many mathematical truths.
Reminder: U(p) = Ti (j), where p = īj. We defined Turing’s number χ as
χ = χ0 χ1 χ2 . . . ,
where ( (
1 if Ti (j) ց 1 if U(p) ց
χp = =
0 if Ti (j) ր . 0 if U(p) ր .
Observe that the first 20 + 21 + 22 + . . . 2n = 2n+1 − 1 bits of χ encode the halting
problem for all programs of length ≤ n.
Definition 29. The halting probability Ω (or Chaitin’s constant) is
X
2−|p| .
def
Ω=
p : U (p)ց
[Lecture AIT-18]
Comment on the lecture: The solution to Problem 4 of Sheet #2 (A.2) is discussed.
Moreover, in the first hour, I reviewed a selection of articles. See Appendix B for
the references and abstracts of the presented articles.
Proposition 12. Ω is a compressed version of χ: n bits of Ω are enough to compute
2n+1 − 1 bits of χ. Schematically,
45
Thus, Ω is not computable, but it is lower semi-computable: we can successively
approximate it from below, though never exactly determine its bits.
[Lecture AIT-19]
Proof. From Ω∗[n] , Ω[n] can be reconstructed, and by Proposition 12, χ[2n+1 −1] can be
computed. This, in turn, permits the computation of the output of all programs of
length ≤ n. All of these strings, therefore, have complexity smaller or equal to n.
Then, the lexicographically first string that does not have complexity ≤ n can be
recognized. Let us denote it zn (it is zB with B = n). In diagram form,
✲ {x def
: K(x) ≤ n} ✲ O(1) ✲ zn = min{x : K(x) > n} .
This program for zn is of total size K(Ω[n] ) + O(1), so K(zn ) ≤ K(Ω[n] ) + O(1). But
by definition, K(zn ) > n.
All notions for Problem Sheet #3 have been covered. See §A.3.
q: For i = 1, 2, 3, . . .
check if 2i + 2 = p1 + p2 , where p1 and p2 are primes
If yes, go to the next i.
If not, halt.
Note that q halts if and only if Goldbach’s conjecture is false. Moreover, q is quite
short: its halting status is encoded in the first few thousand bits of Ω.
Another example of a statement refutable in finite means is Fermat’s last theorem
(proven by Wiles). It states that the equation xn + y n = z n admits no integer
solutions for x, y, z > 0 and n > 2. An exhaustive search can look for such an
integer solution and halt if it finds it.
46
P
The Riemann hypothesis states that the non-trivial zeros of ζ(s) = ∞ 1
n=1 ns are
all on the line Re (s) = 1/2. The Riemann hypothesis can be shown to be refutable
by finite means, due to a result by Matiyasevich [11]. Therefore, the first bits of Ω
could tell us if the hypothesis is true or not.
The first bits of Ω can also be used to determine whether a statement s is
provably true (s is a theorem), provably false (¬s is a theorem), or independent5
of some formal axiomatic theory A. Remember that a formal axiomatic theory is
a computably enumerable set of mathematical assertions (theorems). Consider the
following program
Q: Computably enumerate the theorems of A.
If s is enumerated halt and output 1.
If ¬s is enumerated, halt and output 2.
If Q loops, s is independent. The program Q is of finite length (roughly the length
of the enumerator for A and for an algorithmic description of s). We use Ω to
determine if Q halts or loops. If it is promised to halt, we run Q to find if it
outputs 1 or 2.
47
p: Computably enumerate all the proofs of A0 until a proof is found of
“The string y is such that K(y) > k0 ”, for some y.
Return that y.
For example, the complexity of ZFC might be 5000. Then, the overwhelming
majority of strings of length 6000 bits have a complexity larger than 5000, yet this
can be proven in ZFC for no such string.
Compared with Gödel’s incompleteness, which might appear to be an anomaly,
Chaitin’s incompleteness makes it inevitable. It puts information-theoretic limits
on the power of formal axiomatic theories.
[Lecture AIT-20]
Comment on the lecture: Some solutions to Problem Sheet #3 are presented.
What happens if we enlarge the axioms? The complexity of the formal axiomatic
theory increases, but it doesn’t circumvent the problem.
Theorem 14 (Chaitin). There is a constant c such that a formal axiomatic the-
ory A0 with complexity K(A0 ) can never determine more than K(A0 ) + c bits of Ω.
Proof sketch: Otherwise, we can get a compression of Ω.
Chaitin wrote [12]:
“I believe that complexity reveals the depth of the incompleteness phe-
nomenon: For the Platonic world of mathematical ideas has infinite com-
plexity, but any FAT only has finite complexity. Therefore, any FAT can
only capture an infinitesimal part of the richness of pure mathematics.”
He also wrote [12]:
“I believe in creativity in mathematics, I believe that math isn’t a closed
system, it’s an open system. A fixed formal axiomatic system of the
kind that Hilbert wanted for all of math, yes, it would have given us
absolute certainty, but it would also have made mathematics into a frozen
cemetery.”
He advocates a “quasi-empirical” viewpoint on mathematics, where the complexity
of formal axiomatic theories can be increased “by adding new principles, new axioms,
that are not self-evident but that are justified by their fertile consequences” [12].
48
8 The Coding Theorem
[Lecture AIT-21]
Reminder:
X
P (x) = 2−|p|
p : U (p)=x
= |2−|x | −|p1 |
+ 2−|p2| + 2−|p3| + . . .
∗
{z } +2
2−K(x)
Clearly P (x) > 2−K(x). But how much larger than 2−K(x) can P (x) be? For some
x of length 50, there might be 3 programs of length K(x) and 4 programs of length
K(x) + 1, so for that x,
However, one might expect that as the length of the string increases, the multiplica-
tive factor relating P (x) and 2−K(x) could grow unboundedly. For instance, it would
suffice to have strings of increasing length whose number of programs of minimal
length grows unboundedly.
The coding theorem shows that this is not the case: it states that there exists a
constant c such that ∀x,
Proof. For a first take on the proof, let us set y = ǫ. The proof carries through if y
sits on the auxiliary tape.
We need to show that K(x) ≤ − log P (x) + O(1). We shall do this by ex-
hibiting a Turing machine T that computes each x from an input al of length
−⌈log P (x)⌉ + 1. Since U can simulate that machine, it provides a program of
length − log P (x) + O(1) to compute x, and thus the shortest program for x has a
length no greater.
49
T (al ) : Run all programs, dovetailed fashion.
When the j-th program halts (in the dovetail order),
add (pj , xj , Sxj , aj ) to a list of quadruples, where
pj is the program that halted
xj = U(p P j ) is its−|p output
Sxj = i≤j : xi =xj 2 i| is the current value
of the lower semi-computation of P (xj ) and
aj is ∅ if the position of the leading 1
in the binary expansion of Sxj has not changed
and if it has changed, aj is the address of
the lexicographically first available node in
the binary tree at depth ⌈− log Sxj ⌉ + 1
Output the string that is allocated on the node al .
i pi xi Sx i ai 0 01
1 1100 01 0.0001 00000 0 1 0 11011
2 00110 11011 0.00001 000010 0 1
1 ...
3 000 1 0.001 0001 0 1 0
4 11011
011101 11011 0.000011 ∅ 0 1 ...
5 111 11011 0.001011 0010 1 ...
... ... ... ... ... ...
50
Suppose that P (11011) = 0.00110111.... The lower semi-computation of S11011
would continue forever, but 11011 will not be assigned to more nodes, because the
leading 1 of S11011 has been stabilized.
The machine T can compute x via the massive computation that builds the tree
for all strings, together with the address 0010, which tells it to stop and output the
string allocated to that address. Here, this is x = 11011.
In general, because Sx lower semi-computes P (x), there will eventually be a
program pl that computes x (x = xl ) and whose contribution to Sx will make
the leading 1 in the binary expansions of Sx to be at the same position as the
leading 1 in the binary expansion of P (x). This will give a first (and only) node at
depth ⌈− log Sxl ⌉ + 1 = ⌈− log P (x)⌉ + 1 to which the string xl = x will be allocated.
The address of this node is al , and |al | = ⌈− log P (x)⌉ + 1.
We should verify that this way of allocating strings to nodes in the binary tree
will never run out of space.
X X X
2−|aj | = 2−|aj |
j : aj 6=∅ x j : aj 6=∅
and xj =x
X
< 2−(⌈− log P (x)⌉+1) + 2−(⌈− log P (x)⌉+2) + 2−(⌈− log P (x)⌉+3) + . . .
x
X
⌊log P (x)⌋ −1 1 1
= 2 2 1 + + + ...
x
2 4
X
≤ P (x)
x
≤ 1.
Proposition 14. Let y be some auxiliary string. Let Tprefix be a Turing machine
whose domain {a : Tprefix (a, y) ց} is a prefix-free set. Then there exists a self-
delimiting Turing machine Ts-d s.t. Tprefix (·, y) ≡ Ts-d (·, y).
51
Proof. The idea is that Ts-d should read another square of its program tape only
when it is necessary. Suppose Ts-d has the string y on its work/auxiliary tape
(no self-delimitation constraints are required on the auxiliary information). Ts-d
generates the computably enumerable set S = {s : Tprefix (s, y) ց} on its work tape.
As it generates S, Ts-d continually checks whether or not the part p of the program
that it has already read (initially p = ǫ) is a prefix of some known element s of S.
Whenever Ts-d finds that p is a prefix of an s ∈ S, it does the following. If p is
a proper prefix of s (i.e. if s = pw, for w 6= ǫ), Ts-d reads another square of the
program tape. This is because it knows that p cannot be in the domain, otherwise
the domain would not be prefix-free. And if p = s, Ts-d computes and outputs
Tprefix (p, y) on the output tape.
µ(x) ≤ c m(x) ∀x .
52
From this construction of Sx we define programs for T which produce x. The
set Sx can be covered from within by cylinders Γp = [0.p, 0.p+2−|p|), namely, Γp ⊆ Sx
and ∪p Γp = Sx . The p’s are T ’s programs for x.
Remark 4. This proof provides another equivalent “probabilistic” interpretation of
the input tape. One can view the input tape as already containing infinitely many
bits (instead of calling them by the monkey one by one). All infinite strings s which
would yield a computation of x by T are of the form s = ps′ with T (p) = x. The
Lebesgue measure of all such s is PT (x).
Proposition 16. P (x) is a universal semi-measure for the class of lower semi-
computable semi-measures.
def
Proof. By the last Proposition, it suffices to show that P (x) = PU (x) dominates all
PT (x). By the universality of U, there exists r such that U(rq) = T (q). Thus
X
PT (x) = 2|r| 2−|r| 2−|q|
q : T (q)=x
X
|r|
= 2 2−|rq|
q : U (rq)=x
X
≤ 2|r| 2−|p|
p : U (p)=x
|r|
= 2 P (x) .
⇐⇒
−K(x, y) + K(y) ≤ −K(x|y ∗ ) + c
⇐⇒
−K(x,y)
2 ∗
−K(y)
2c 2−K(x|y ) .
≤ |{z}
2 ′ c
2−K(x,y)
This will be obtained if we show that x 7→ 2−K(y)
is a lower semi-computable
semi-measure given y ∗ .
Analogously with the probabilistic setting (§3.3), the O(1)-chain rule entails a
symmetric notion of algorithmic mutual information,
def
I(x; y) = K(x) − K(x|y)
c.r.
= K(x) − K(x, y) + K(y) + O(1)
c.r.
= K(y) − K(y|x) + O(1)
def
= I(y; x) + O(1) .
All notions for Problem Sheet #4 have been covered. See §A.4.
54
A Problem Sheets
A.1 Problem Sheet #1
1. We identified strings with natural numbers via {(ǫ, 1), (0, 2), (1, 3), (00, 4) . . .},
and we defined x̄ = 1|x| 0x and x′ = |x|x. Find y, z and t if
ȳz ′ t = 111110001011110000110100101111100100101 .
2. It was once asked (on 23.02.23, in D1.14) whether there are countably many
or uncountably many prefix codes of a countable alphabet. Answer that question.
Hint: subsets of {0, 10, 110, . . .}.
3. A conjecture was once made (on 23.02.23, in D1.14) that a set of strings is both
prefix-free and suffix-free if and only if the strings have constant length. Disprove
the conjecture by exhibiting an infinite prefix- and suffix-free set of binary strings.
4. Does there exist a prefix code Φ : {0, 1}∗\{ǫ} → {0, 1}∗ such that
6. Prove the chain rule for entropy: H(X, Y ) = H(X) + H(Y |X) .
55
A.2 Problem Sheet #2
4. We saw that the set of indices i such that the range of ϕi is nonempty is com-
putably enumerable. It was asked (on 28.03.23) if the set is also computable. Answer
that question and provide a brief justification.
R = RE ∩ coRE .
6
The order is given by N.
7
In complexity theory, a subset of {0, 1}∗ ≃ N is also called a language.
56
A.3 Problem Sheet #3
Note 1. One of these questions is a bonus: 6 correct solutions yield 60/60, 7 correct
solutions yield 70/60.
Note 2. If some proof involves a program in the setting of prefix complexity, explain
what makes the program self-delimited.
def
. . . 0}. Show that C(0n ) = C(n) + O(1).
1. Let 0n = |00 {z
n times
2. Show that for all x ∈ {0, 1}n, K(x) = C(x) + O(log n).
3. Prove the easy side of the chain rule for prefix complexity: For all x and y,
K:N → N
x 7→ K(x)
is upper semi-computable.
57
A.4 Problem Sheet #4
Notes. Question 3 is only for fun. No points, no bonus. Just fun. Question 6 is
worth as many points as one chooses to solve the easier or the harder version.
2. Let x be an n bit string with exactly as many 0’s as 1’s. Using either the plain
or the prefix setting, show that x is compressible given n.
√
Stirling’s formula: n! = nn e−n 2πn 1 + O( n1 ) .
3. Only for fun. “But how can the result of question 2 be possible? By the law of
large numbers, in a probabilistic random string, we should expect as many 0’s as
1’s.” Yes, but not exactly as many, for this is a pattern which entails compression.√In
fact, the central limit theorem states that we should expect variations of about n.
Easier. Show that the √ analysis performed in question 2 fails if the number of 1’s is
n
2
+ k, where k = O( n) and k is an incompressible number given n.
Harder. Show√ that x is incompressible given n only if the number of 1’s in x is n2 +k,
where k = O( n) and k is an incompressible number given n.
Note. That’s another way to incompleteness: Only a finite number of programs can
be proven elegant, yet there are infinitely many elegant programs.
5. a) Write down a function that grows extremely fast. The faster it grows, the
better. Don’t read further before having written down a function.
b) Read online about the Ackermann function. Does it grow faster than yours?
c) The Busy Beaver function is defined as
def
B(n) = max{Rt(p) : U(p) ց & |p| ≤ n} ,
where Rt(p) denotes the run time of p, namely, the number of computational steps
before U(p) halts. Show that B(n) grows faster than any computable function,
namely, show that for all computable f , there exists n0 such that for all n ≥ n0 ,
B(n) > f (n).
Note. Assuming that your function was computable, the Busy Beaver beats both
your and Ackermann’s functions.
d) Show that Ω[n] permits to compute B(n).
58
6. Lower semicomputable semimeasures. Let
7. Write down a question that you have regarding any topic related to AIT. If you
would like to comment on why you find your question intriguing, please do it.
If you have ideas on how to solve your question, please elaborate.
59
B Proposed Articles for the Seminar
Theoretical, Foundational
1. Gregory J Chaitin. Incompleteness theorems for random reals. Advances in
Applied Mathematics, 8(2):119–146, 1987.
Abstract: We obtain some dramatic results using statistical mechanics-thermodynamics kinds of
arguments concerning randomness, chaos, unpredictability, and uncertainty in mathematics. We
construct an equation involving only whole numbers and addition, multiplication, and exponenti-
ation, with the property that if one varies a parameter and asks whether the number of solutions
is finite or infinite, the answer to this question is indistinguishable from the result of independent
tosses of a fair coin. This yields a number of powerful Gödel incompleteness-type results concerning
the limitations of the axiomatic method, in which entropy-information measures are used.
Algorithmic statistics
3. Nikolay Vereshchagin and Alexander Shen. Algorithmic statistics: Forty years
later. In Computability and Complexity, pages 669–737. Springer, 2017.
Abstract: Algorithmic statistics has two different (and almost orthogonal) motivations. From the
philosophical point of view, it tries to formalize how the statistics works and why some statistical
models are better than others. After this notion of a "good model" is introduced, a natural question
arises: it is possible that for some piece of data there is no good model? If yes, how often these
bad (“non-stochastic”) data appear "in real life"? Another, more technical motivation comes from
algorithmic information theory. In this theory a notion of complexity of a finite object (=amount
of information in this object) is introduced; it assigns to every object some number, called its
algorithmic complexity (or Kolmogorov complexity). Algorithmic statistic provides a more fine-
grained classification: for each finite object some curve is defined that characterizes its behavior.
It turns out that several different definitions give (approximately) the same curve. In this survey
we try to provide an exposition of the main results in the field (including full proofs for the most
important ones), as well as some historical comments. We assume that the reader is familiar with
the main notions of algorithmic information (Kolmogorov complexity) theory.
4. Paul MB Vitányi and Ming Li. Minimum description length induction, bayesian-
ism, and kolmogorov complexity. IEEE Transactions on information theory,
46(2):446–464, 2000.
60
Abstract: The relationship between the Bayesian approach and the minimum description length ap-
proach is established. We sharpen and clarify the general modeling principles minimum description
length (MDL) and minimum message length (MML), abstracted as the ideal MDL principle and
defined from Bayes’s rule by means of Kolmogorov complexity. The basic condition under which
the ideal principle should be applied is encapsulated as the fundamental inequality, which in broad
terms states that the principle is valid when the data are random, relative to every contemplated hy-
pothesis and also these hypotheses are random relative to the (universal) prior. The ideal principle
states that the prior probability associated with the hypothesis should be given by the algorithmic
universal probability, and the sum of the log universal probability of the model plus the log of the
probability of the data given the model should be minimized. If we restrict the model class to finite
sets then application of the ideal principle turns into Kolmogorov’s minimal sufficient statistic. In
general, we show that data compression is almost always the best strategy, both in model selection
and prediction.
Induction
5. Samuel Rathmanner and Marcus Hutter. A philosophical treatise of universal
induction. Entropy, 13(6):1076–1136, 2011.
Abstract: Understanding inductive reasoning is a problem that has engaged mankind for thousands
of years. This problem is relevant to a wide range of fields and is integral to the philosophy of
science. It has been tackled by many great minds ranging from philosophers to scientists to math-
ematicians, and more recently computer scientists. In this article we argue the case for Solomonoff
Induction, a formal inductive framework which combines algorithmic information theory with the
Bayesian framework. Although it achieves excellent theoretical results and is based on solid philo-
sophical foundations, the requisite technical knowledge necessary for understanding this framework
has caused it to remain largely unknown and unappreciated in the wider scientific community. The
main contribution of this article is to convey Solomonoff induction and its related concepts in a
generally accessible form with the aim of bridging this current technical gap. In the process we ex-
amine the major historical contributions that have led to the formulation of Solomonoff Induction as
well as criticisms of Solomonoff and induction in general. In particular we examine how Solomonoff
induction addresses many issues that have plagued other inductive systems, such as the black ravens
paradox and the confirmation problem, and compare this approach with other recent approaches.
61
“logical depth” as the time required by a standard universal Turing machine to generate it from an
input that is algorithmically random (i.e. Martin-Löf random). This definition of depth is shown to
be reasonably machine-independent, as well as obeying a slow-growth law: deep objects cannot be
quickly produced from shallow ones by any deterministic process, nor with much probability by a
probabilistic process, but can be pro- duced slowly. Next we apply depth to the physical problem of
“self-organization,” inquiring in particular under what conditions (e.g. noise, irreversibility, spatial
and other symmetries of the initial conditions and equations of motion) statistical-mechanical model
systems can imitate computers well enough to undergo unbounded increase of depth in the limit of
infinite space and time.
Thermodynamics
9. Paul Vitányi and Ming Li. Algorithmic arguments in physics of computation. In
Algorithms and Data Structures: 4th International Workshop, WADS’95 Kingston,
Canada, August 16–18, 1995 Proceedings 4, pages 315–333. Springer, 1995.
Abstract : We show the usefulness of incompressibility arguments based on Kolmogorov complexity
in physics of computation by several examples. These include analysis of energy parsimonious
’adiabatic’ computation, and scalability of network architectures.
62
11. Ämin Baumeler and Stefan Wolf. Causality–Complexity–Consistency: Can
space-time be based on logic and computation? In Time in Physics, pages 69–101.
Springer, 2017.
Abstract: The difficulty of explaining non-local correlations in a fixed causal structure sheds new
light on the old debate on whether space and time are to be seen as fundamental. Refraining from
assuming space-time as given a priori has a number of consequences. First, the usual definitions
of randomness depend on a causal structure and turn meaningless. So motivated, we propose an
intrinsic, physically motivated measure for the randomness of a string of bits: its length minus its
normalized work value, a quantity we closely relate to its Kolmogorov complexity (the length of the
shortest program making a universal Turing machine output this string). We test this alternative
concept of randomness for the example of non-local correlations, and we end up with a reasoning that
leads to similar conclusions as in, but is conceptually more direct than, the probabilistic view since
only the outcomes of measurements that can actually all be carried out together are put into relation
to each other. In the same context-free spirit, we connect the logical reversibility of an evolution to
the second law of thermodynamics and the arrow of time. Refining this, we end up with a speculation
on the emergence of a space-time structure on bit strings in terms of data-compressibility relations.
Finally, we show that logical consistency, by which we replace the abandoned causality, it strictly
weaker a constraint than the latter in the multi-party case.
13. Fabio Benatti, Tyll Krüger, Markus Müller, Rainer Siegmund-Schultze, and
Arleta Szkoła. Entropy and quantum kolmogorov complexity: A quantum brudno’s
theorem. Communications in mathematical physics, 265:437–461, 2006.
Abstract: In classical information theory, entropy rate and algorithmic complexity per symbol are
related by a theorem of Brudno. In this paper, we prove a quantum version of this theorem,
connecting the von Neumann entropy rate and two notions of quantum Kolmogorov complexity,
both based on the shortest qubit descriptions of qubit strings that, run by a universal quantum
Turing machine, reproduce them as outputs.
Information Measures
14. Charles H Bennett, Péter Gács, Ming Li, Paul MB Vitányi, and Woj-
ciech H Zurek. Information distance. IEEE Transactions on information theory,
44(4):1407–1423, 1998.
Abstract : While Kolmogorov complexity is the accepted absolute measure of information content
in an individual finite object, a similarly absolute notion is needed for the information distance
between two individual objects, for example, two pictures. We give several natural definitions of a
universal information metric, based on length of shortest programs for either ordinary computations
or reversible (dissipationless) computations. It turns out that these definitions are equivalent up
63
to an additive logarithmic term. We show that the information distance is a universal cognitive
similarity distance. We investigate the maximal correlation of the shortest programs involved, the
maximal uncorrelation of programs (a generalization of the Slepian–Wolf theorem of classical infor-
mation theory), and the density properties of the discrete metric spaces induced by the information
distances. A related distance measures the amount of nonreversibility of a computation. Using
the physical theory of reversible computation, we give an appropriate (universal, antisymmetric,
and transitive) measure of the thermodynamic work required to transform one object in another
object by the most efficient process. Information distance between individual objects is needed in
pattern recognition where one wants to express effective notions of “pattern similarity” or “cognitive
similarity” between individual objects and in thermodynamics of computation where one wants to
analyze the energy dissipation of a computation from a particular input to a particular output.
Applications
Anomaly Detection
16. Eamonn Keogh, Stefano Lonardi, Chotirat Ann Ratanamahatana, Li Wei, Sang-
Hee Lee, and John Handley. Compression-based data mining of sequential data.
Data Mining and Knowledge Discovery, 14:99–129, 2007.
Abstract: The vast majority of data mining algorithms require the setting of many input parameters.
The dangers of working with parameter-laden algo- rithms are twofold. First, incorrect settings may
cause an algorithm to fail in finding the true patterns. Second, a perhaps more insidious problem
is that the algorithm may report spurious patterns that do not really exist, or greatly overestimate
the significance of the reported patterns. This is especially likely when the user fails to understand
the role of parameters in the data mining process. Data mining algorithms should have as few pa-
rameters as possible. A parameter-light algorithm would limit our ability to impose our prejudices,
expectations, and presumptions on the problem at hand, and would let the data itself speak to us.
In this work, we show that recent results in bioinformatics, learning, and computational theory hold
great promise for a parameter-light data-mining paradigm. The results are strongly connected to
Kolmogorov complexity theory. However, as a practical matter, they can be implemented using
any off-the-shelf compression algorithm with the addition of just a dozen lines of code. We will
show that this approach is competitive or superior to many of the state-of-the-art approaches in
anomaly/interestingness detection, classification, and clustering with empirical tests on time se-
ries/DNA/text/XML/video data- sets. As a further evidence of the advantages of our method,
we will demonstrate its effectiveness to solve a real world classification problem in recommending
printing services and products.
64
Biology
17. Rudi L Cilibrasi and Paul MB Vitányi. Fast whole-genome phylogeny of the
covid-19 virus sars-cov-2 by compression. bioRxiv, pages 2020–07, 2020.
Abstract: We analyze the phylogeny and taxonomy of the SARS-CoV-2 virus using compression.
This is a new alignment-free method called the “normalized compression distance” (NCD) method.
It discovers all effective similarities based on Kolmogorov complexity. The latter being incomputable
we approximate it by a good compressor such as the modern zpaq. The results comprise that the
SARS-CoV-2 virus is closest to the RaTG13 virus and similar to two bat SARS-like coronaviruses
bat-SL-CoVZXC21 and bat-SL-CoVZC4. The similarity is quantified and compared with the same
quantified similarities among the mtDNA of certain species. We treat the question whether Pangolins
are involved in the SARS-CoV-2 virus.
Neurosciences
18. J Szczepański, José M Amigó, E Wajnryb, and MV Sanchez-Vives. Application
of lempel–ziv complexity to the analysis of neural discharges. Network: Computation
in Neural Systems, 14(2):335, 2003.
Abstract: Pattern matching is a simple method for studying the properties of information sources
based on individual sequences (Wyner et al 1998 IEEE Trans. Inf. Theory 44 2045–56). In partic-
ular, the normalized Lempel– Ziv complexity (Lempel and Ziv 1976 IEEE Trans. Inf. Theory 22
75–88), which measures the rate of generation of new patterns along a sequence, is closely related
to such important source properties as entropy and information compression ratio. We make use of
this concept to characterize the responses of neurons of the primary visual cortex to different kinds
of stimulus, including visual stimulation (sinusoidal drifting gratings) and intracellular current in-
jections (sinusoidal and random currents), under two conditions (in vivo and in vitro preparations).
Specifically, we digitize the neuronal discharges with several encoding techniques and employ the
complexity curves of the resulting discrete signals as fingerprints of the stimuli ensembles. Our
results show, for example, that if the neural discharges are encoded with a particular one-parameter
method (‘interspike time coding’), the normalized complexity remains constant within some classes
of stimuli for a wide range of the parameter. Such constant values of the normalized complexity
allow then the differentiation of the stimuli classes. With other encodings (e.g. ‘bin coding’), the
whole complexity curve is needed to achieve this goal. In any case, it turns out that the normalized
complexity of the neural discharges in vivo are higher (and hence carry more information in the
sense of Shannon) than in vitro for the same kind of stimulus.
Linguistics
19. Henry Brighton and Simon Kirby. The survival of the smallest: Stability condi-
tions for the cultural evolution of compositional language. In Advances in Artificial
Life: 6th European Conference, ECAL 2001 Prague, Czech Republic, September
10–14, 2001 Proceedings 6, pages 592–601. Springer, 2001.
Abstract: Recent work in the field of computational evolutionary linguistics suggests that the dynam-
ics arising from the cultural evolution of language can explain the emergence of syntactic structure.
We build on this work by introducing a model of language acquisition based on the Minimum De-
scription Length Principle. Our experiments show that compositional syntax is most likely to occur
under two conditions specific to hominids: (i) A complex meaning space structure, and (ii) the
poverty of the stimulus.
Sociology
20. Carter T Butts. The complexity of social networks: theoretical and empirical
findings. Social Networks, 23(1):31–72, 2001.
65
Abstract: A great deal of work in recent years has been devoted to the topic of “complexity”, its
measurement, and its implications. Here, the notion of algorithmic complexity is applied to the
analysis of social networks. Structural features of theoretical importance — such as structural
equivalence classes — are shown to be strongly related to the algorithmic complexity of graphs, and
these results are explored using analytical and simulation methods. Analysis of the complexity of a
variety of empirically derived networks suggests that many social networks are nearly as complex as
their source entropy, and thus that their structure is roughly in line with the conditional uniform
graph distribution hypothesis. Implications of these findings for network theory and methodology
are also discussed.
Ecology
21. Vasilis Dakos and Fernando Soler-Toscano. Measuring complexity to infer
changes in the dynamics of ecological systems under stress. Ecological Complexity,
32:144–155, 2017.
Abstract: Despite advances in our mechanistic understanding of ecological processes, the inherent
complexity of real-world ecosystems still limits our ability in predicting ecological dynamics espe-
cially in the face of ongoing environmental stress. Developing a model is frequently challenged by
structure uncertainty, unknown parameters, and limited data for exploring out-of-sample predic-
tions. One way to address this challenge is to look for patterns in the data themselves in order to
infer the underlying processes of an ecological system rather than to build system-specific models.
For example, it has been recently suggested that statistical changes in ecological dynamics can be
used to infer changes in the stability of ecosystems as they approach tipping points. For computer
scientists such inference is similar to the notion of a Turing machine: a computational device that
could execute a program (the process) to produce the observed data (the pattern). Here, we make
use of such basic computational ideas introduced by Alan Turing to recognize changing patterns in
ecological dynamics in ecosystems under stress. To do this, we use the concept of Kolmogorov algo-
rithmic complexity that is a measure of randomness. In particular, we estimate an approximation
to Kolmogorov complexity based on the Block Decomposition Method (BDM). We apply BDM to
identify changes in complexity in simulated time-series and spatial datasets from ecosystems that
experience different types of ecological transitions. We find that in all cases, KBDM complexity
decreased before all ecological transitions both in time-series and spatial datasets. These trends
indicate that loss of stability in the ecological models we explored is characterized by loss of com-
plexity and the emergence of a regular and computable underlying structure. Our results suggest
that Kolmogorov complexity may serve as tool for revealing changes in the dynamics of ecosystems
close to ecological transitions.
Ethology
22. Boris Ryabko and Zhanna Reznikova. The use of ideas of information theory
for studying “language” and intelligence in ants. Entropy, 11(4):836–853, 2009.
Abstract: In this review we integrate results of long term experimental study on ant “language” and
intelligence which were fully based on fundamental ideas of Information Theory, such as the Shannon
entropy, the Kolmogorov complexity, and the Shannon’s equation connecting the length of a message
(l) and its frequency (p), i.e., l = − log p for rational communication systems. This approach enabled
us to obtain the following important results on ants’ communication and intelligence: (i) to reveal
“distant homing” in ants, that is, their ability to transfer information about remote events; (ii) to
estimate the rate of information transmission; (iii) to reveal that ants are able to grasp regularities
and to use them for “compression” of information; (iv) to reveal that ants are able to transfer to
each other the information about the number of objects; (v) to discover that ants can add and
subtract small numbers. The obtained results show that information theory is not only excellent
mathematical theory, but many of its results may be considered as Nature laws.
66
References
[1] Ming Li and Paul Vitányi. An Introduction to Kolmogorov Complexity and its
Applications. Springer, New York, 4th edition, 2019.
[2] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley-
Interscience, 2nd edition, 2006.
[5] Gregory J. Chaitin. Meta Maths! The Quest for Omega. Vintage Books, 2006.
[6] Charles H. Bennett and Martin Gardner. The random number Omega bids fair
to hold the mysteries of the universe. Scientific American, 241(5):20–34, 1979.
[9] David Deutsch. The Fabric of Reality: The Science of Parallel Universes—and
Its Implications. Allen Lane The Penguin Press, London, 1997.
[10] David Deutsch. Quantum theory, the Church–Turing principle and the univer-
sal quantum computer. Proceedings of the Royal Society A, 400(1818):97–117,
1985.
[11] Yuri V. Matiyasevich. Hilbert’s Tenth Problem. MIT Press, Cambridge, MA,
1993. Translated from the Russian by M. B. Kan.
[13] David Deutsch. The Beginning of Infinity: Explanations that Transform the
World. Penguin Books, London, 2011.
67