[go: up one dir, main page]

0% found this document useful (0 votes)
16 views68 pages

Lecture Notes On Algorithmic Information Theory

These lecture notes on Algorithmic Information Theory cover the intersection of Turing's computation theory and Shannon's information theory, focusing on algorithmic complexity as a measure of information content in digital objects. The course is designed for graduate students and researchers, emphasizing a multidisciplinary approach and includes various topics such as strings, codes, Shannon information theory, and computability theory. The notes also address foundational concepts and paradoxes in mathematics, illustrating the relevance of algorithmic information theory in understanding these issues.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views68 pages

Lecture Notes On Algorithmic Information Theory

These lecture notes on Algorithmic Information Theory cover the intersection of Turing's computation theory and Shannon's information theory, focusing on algorithmic complexity as a measure of information content in digital objects. The course is designed for graduate students and researchers, emphasizing a multidisciplinary approach and includes various topics such as strings, codes, Shannon information theory, and computability theory. The notes also address foundational concepts and paradoxes in mathematics, illustrating the relevance of algorithmic information theory in understanding these issues.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Lecture Notes on

Algorithmic Information Theory


arXiv:2504.18568v1 [cs.IT] 22 Apr 2025

Charles Alexandre Bédard

École de technologie supérieure


charles.alexandre.bedard@etsmtl.ca

April 2025
Contents
1 Introduction 4
1.1 Hilbert’s Theory of Everything . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Berry’s Paradox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Strings and Codes 6


2.1 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Kraft’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Prefix codes of {0, 1}∗ . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Shannon Information Theory 11


3.1 Optimal Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Properties of Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Computability Theory 16
4.1 Turing’s Model of Computation . . . . . . . . . . . . . . . . . . . . . 16
4.2 Mathematizing the mathematician . . . . . . . . . . . . . . . . . . . 17
4.3 Partial Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.4 Effective Enumeration of Turing Machines . . . . . . . . . . . . . . . 19
4.5 The Universal Turing Machine . . . . . . . . . . . . . . . . . . . . . . 19
4.6 Church–Turing–Deutsch . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.7 Computing, Deciding, Enumerating, Recognizing . . . . . . . . . . . . 22
4.8 The Halting Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.9 Uncomputability =⇒ Incompleteness . . . . . . . . . . . . . . . . . 25
4.10 Rice’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5 Algorithmic Complexity 29
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Plain Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3 The Invariance Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.4 Basic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.5 Chain Rule for Plain Complexity . . . . . . . . . . . . . . . . . . . . 34
5.6 Incompressibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.7 The Problems of Plain complexity . . . . . . . . . . . . . . . . . . . . 36

6 Prefix Complexity 39
6.1 Self-delimiting Turing Machines . . . . . . . . . . . . . . . . . . . . . 39
6.2 Solving the Problems of Plain Complexity . . . . . . . . . . . . . . . 41
6.3 Information Equivalence between x∗ and (x, K(x)) . . . . . . . . . . . 43
6.4 The Uncomputability of K(x) . . . . . . . . . . . . . . . . . . . . . . 44

1
7 The Halting Probability Ω 45
7.1 Mathematical Truths in Ω . . . . . . . . . . . . . . . . . . . . . . . . 46
7.2 Chaitin’s Incompleteness Theorem . . . . . . . . . . . . . . . . . . . . 47

8 The Coding Theorem 49


8.1 Lower Semi-computable Semi-measures . . . . . . . . . . . . . . . . . 52

A Problem Sheets 55
A.1 Problem Sheet #1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
A.2 Problem Sheet #2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
A.3 Problem Sheet #3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
A.4 Problem Sheet #4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

B Proposed Articles for the Seminar 60

2
Preface
These lecture notes on Algorithmic Information Theory were constructed in con-
junction with the graduate course I taught at Università della Svizzera italiana in
the spring of 2023.
The course is intended for graduate students and researchers seeking a self-
contained journey from the foundations of computability theory to prefix complex-
ity and the information-theoretic limits of formal systems. My exposition ignores
boundaries between computer science, mathematics, physics, and philosophy, which
I consider essential when explaining inherently multidisciplinary fields.
Recordings of the lectures are available online—click on the following link:
[Full lecture series]. Links to individual lectures are also provided throughout the
notes.
The course is based on many books, most notably the detailed textbook by Li and
Vitányi [1], which readers should also consult for extensive references to the primary
scientific literature. I enjoyed the explanatory approach of Cover and Thomas [2],
and I gained insights from Hutter [3] and Claude [4]. Chaitin’s Meta Math! [5], both
profound and accessible, also inspired the content of the course.
As his Ph.D. student, I was introduced by Gilles Brassard to the marvels of
theoretical computer science through many open discussions, for which I am deeply
grateful. My interest in algorithmic information theory was then sparked by reading
Bennett [6] and Chaitin [7].
I am especially grateful to Stefan Wolf, who encouraged me to design and teach
this course, and with whom I had many valuable conversations—both on this topic
and beyond. Finally, I am immensely grateful to the students who attended the
lectures. Their enthusiasm and curiosity were potent motivators and led to countless
stimulating discussions.

3
1 Introduction
[Lecture AIT-1]
Algorithmic information theory (AIT) is the meeting between Turing’s theory of
computation and Shannon’s theory of information.
AIT roots the concept of information in computation rather than probability. Its
central mathematical construct, algorithmic complexity, is a universal quantification
of information content in individual digital objects.
– Quantification, not qualification.
– Digital objects are those describable by strings of symbols.
– Individual, i.e. we take the object as is; we do not posit a set or a distribution
to which the object is assumed to belong.
– Universal, because any universal computer gives roughly the same measure of
algorithmic complexity.
Algorithmic complexity was introduced independently by Solomonoff (1964)
Kolmogorov (1965) and Chaitin (1966). Their work was broadly motivated—
respectively—by induction (Solomonoff), information (Kolmogorov) and random-
ness (Chaitin). AIT is still being developed. See Fig. 1 for a contrast with Shannon’s
information theory.

Non-prob. statistics
Non-prob. thermodynamics
Inductive methods
Logical depth
Incompleteness
Statistics of mathematics
Thermodynamics

Philosophy
Shannon’s IT AIT
(Occam’s razor)
Entropy Complexity
Foundations of
Probability Computability
mathematics
Theory Theory

Figure 1: Contrasting Shannon’s information theory with AIT.

4
1.1 Hilbert’s Theory of Everything
Hilbert expressed the belief that if mathematics is to provide absolute and complete
certainty, then there ought to be a formal axiomatic theory for all of mathematics.
There should be a finite set of axioms from which all mathematical truths can be
proven by mechanically applying the rules of logic. In practice, today, the closest we
have to such a theory is the Zermelo-Fraenkel set theory with the axiom of choice
(ZFC), together with first-order logic.
Hilbert did not suggest that mathematics should only be done formally, but that
it could, so matters of opinion and ambiguity can be eliminated from mathematics.
In 1931, Gödel puts an end to Hilbert’s program. He shows that in all logi-
cally consistent formal axiomatic systems that are strong enough to encapsulate the
arithmetic of whole numbers, there are true and unprovable statements.

1.2 Berry’s Paradox


All natural numbers can be described unambiguously by (at least) a statement in
English. For instance, 52 is described by “fifty-two”, 10000000000 is described by “ten
to the ten” and 637291037 is described by “the number’s digits are six, three, seven,
two, nine, one, zero, three, seven.” The description “fifty-two” was coined using 3
syllables, “ten to the ten”, 4 syllables, and the last one, 18 syllables. Consider the
set S of all numbers that can be described by 20 syllables or less. There is a finite
number s of syllables in English, and therefore there are at most s21 elements in
S. Therefore, there are infinitely many numbers in N\S, and in particular, it has a
smallest element; let us denote it k20 . In other words, k20 is

“The smallest number that cannot be described by twenty syllables or less.”

It has just been described by 19 syllables, so k20 6∈ S and k20 ∈ S. A contradiction.


The formalization of algorithmic complexity not only solves the paradox but it
turns it into proof of the incompleteness of mathematics—Chaitin’s.

5
2 Strings and Codes
[Lecture AIT-2]

2.1 Strings
A set A = {ai }i∈I of symbols is an alphabet. It can be finite, in which case I =
{1, 2, . . . , N} or infinite, I = N.
Symbols can be concatenated into strings (also called words). The set of all finite
strings over an alphabet A is denoted A∗ .
The binary alphabet is {0, 1} and

{0, 1}∗ = {ǫ, 0, 1, 00, 01, 10, 11, 000, . . .} ,

where ǫ is the empty string. The above set is displayed in the lexicographical ordering.
The following pairing defines a bijection between {0, 1}∗ and N:

{0, 1}∗ ↔ N
ǫ ↔ 1
0 ↔ 2
1 ↔ 3
00 ↔ 4 (1)
01 ↔ 5
10 ↔ 6
11 ↔ 7
000 ↔ 8.

The relationship can be computed by observing that 1x is the binary representation


of the number associated with the string x. This relationship is often implied, so a
string might be thought to be a number and vice versa; the context will determine.
The length of a string x is denoted |x|. The ceiling and floor of a real number α
are respectively denoted ⌈α⌉ and ⌊α⌋.

Proposition 1. ∀x ∈ {0, 1}∗, |x| = ⌊log x⌋.

Proof. Observe first that on the left-hand side, x is a string because we can’t take
the length of a number, whereas on the right-hand side, it is a number because we
can’t take the log of a string. Since the binary representation of the number x is 1x,
we have that x = 1 · 2|x| + r with r < 2|x| . Thus ⌊log x⌋ = |x|.
A string y is a prefix of a string x if x = yz for some z ∈ A∗ . A set S ⊂ A∗ is
prefix-free if for all x, y ∈ S such that x 6= y, x is not a prefix of y.

6
Example 1. Consider the alphabet of digits D = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}. The
set of strings {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, . . .} is not prefix-free. Yet with the alphabet
D ∪ {∗}, the set {1∗, 2∗, 3∗, 4∗, 5∗, 6∗, 7∗, 8∗, 9∗, 10∗, . . .} is prefix-free. Similarly,
{10, 110, 1110, 11110, . . .} is a prefix-free set of bit strings.

2.2 Codes
Definition 1. A code is a mapping

E : A → B∗

where A is the source alphabet and B∗ is set of codewords.


For our purposes, we will want binary codewords. So let us set B∗ ≡ {0, 1}∗.
Definition 2. The code E is non-singular if it is injective.
This permits unambiguous decoding since an injective encoding function E has
a well-defined decoding function D = E −1 .
In many situations, we want to encode sequences of source symbols, which result
in a stream of concatenated codewords (think of a communication channel).
Definition 3. An extension of a code E defines a mapping

E ∗ : A∗ → B ∗

through the requirement that E ∗ (s1 s2 . . . sn ) = E(s1 )E(s2 ) . . . E(sn ).


Definition 4. A code E is uniquely decodable if its extension is non-singular .
This means that if one recieves E(s1 )E(s2 ) . . . E(sn ), one can unambigously de-
coded it as s1 s2 . . . sn .
Definition 5. A code E is a prefix code if E(A) is prefix-free, i.e. no codeword is
the prefix of another codeword.
Example 2. Let A = {A, B, C, D} be source symbols to be encoded into bits.
Consider the four codes defined in the table below.
A B C D
E1 10 10 11 0
E2 10 110 1 0
E3 0 01 011 111
E4 0 10 110 111

– E1 is singular. The codeword 10 cannot be decoded.

7
E1 All codes
Non-singular codes
E2
E3 Uniquely decodable
E4
Prefix codes

Figure 2: Classes of codes with the codes from Example 2.

0 0
1 0 10
1 0 110
1
111
Figure 3: The binary tree representing the codewords of E4 .

– E2 is not uniquely decodable. How does one decode the stream of code-
words 110? Is it B, CA, or CCD?.
– E3 is uniquely decodable but not a prefix code.
– E4 is a prefix code.
In a stream of codewords, prefix codes have the advantage of being instanta-
neously decodable, i.e., a codeword can be decoded when its last bit is read. For this
reason, prefix codes are also called instantaneous codes.
Suppose that the sequence of codewords to decode is 0111010110.
With the encoding E4 , it can be instantaneously decoded into ADABC, by reading
from left to right. With the code E3 , it can be decoded, but one needs to read
further into the sequence. With E2 or E1 , it cannot be decoded. See Fig. 2.
Prefix sets ⊂ {0, 1}∗ can be represented by trees. Fig. 3 represents the tree of
the codewords of E4 .
Observation. The lengths of the codewords of E4 are 1, 2, 3, 3, and
2−1 + 2−2 + 2−3 + 2−3 = 1 ≤ 1 .

2.3 Kraft’s Theorem


Theorem 1 (Kraft). Let l1 , l2 , . . . be a finite or infinite sequence of natural numbers.
These numbers are the lengths of codewords of a prefix code if and only if
X
2−ln ≤ 1 .
n

8
This relation is called Kraft’s inequality.
[Lecture AIT-3]
Proof. ⇒. Let P be a set of codewords of a binary prefix code. To each string
x ∈ {0, 1}∗ corresponds a specific kind of interval ⊂ [0, 1), called a cylinder :

x 7→ Γx = [0.x, 0.x + 2−|x| ) .

Above, 0.x denotes the real number in [0, 1) whose binary expansion begins with x.
Γx can be understood as the set of all real numbers whose first bits in their binary
expansion are those of x. By construction, considering any two binary strings, either
one cylinder is contained in the other, or they are disjoint. The first case happens
only in one string is the prefix of the other. The cylinders pertaining to a prefix-free
set of codewords, like P, are, therefore, disjoint. Thus
X X
2−|x| = lenght (Γx ) ≤ length ([0, 1)) = 1 .
x∈P x∈P

⇐. Without loss of generality, let l1 , l2 , ... be non-decreasing. Define


j−1
X
αj = 2−ln and Γxj = [αj , αj + 2−lj ) .
n=1

The set of codewords given by {xj }, where 0.xj = αj and |xj | = lj , forms the
codewords of the prefix code.
This assignment of codewords is better viewed on the binary tree. The code-
word x1 is assigned to the lexicographically first node at depth l1 . When a node
is assigned, its descendant branches are cut out. Then x2 is assigned to the next
available node at depth l2 , and so on.
A code may have code-word lengths that fulfill Kraft’s inequality without being
a prefix code. The code E3 , in Example 2, for instance. Yet, Kraft’s theorem states
that such code can be transformed into a prefix code while keeping the lengths
of codewords fixed. In the case of E3 , we reverse the bits of the codewords and
obtain E4 .

Theorem 2 (McMillan). The code-word lengths of any uniquely decodable code sat-
isfy Kraft’s inequality.

Corollary 1. The class of uniquely decodable codes, bigger than that of prefix
codes, does not offer more admissible sets of code-word lengths. Since no gain is
obtained for optimality of code-word lengths if we look into this larger class, we can
safely restrict our attention to prefix codes.

9
2.4 Prefix codes of {0, 1}∗
Let us consider prefix codes of the alphabet A = {0, 1}∗.

E : {0, 1}∗ → {0, 1}∗


x 7→ E(x) .

The unary encoding with an end marker, 1, gives a prefix code:

E (0) (x) = |00 {z


. . . 0} 1 ≡ 0x 1 (implicit identification between strings and N).
x times

We can do better. Consider

E (1) (x) = E (0) (x)x E (2) (x) = E (1) (x)x


and
= 0|x| 1x ≡ x̄ = 0||x||1|x|x ,

where ||x|| denotes the length of the length of x. The encoding E (1) (x) will be often
used. It is called the self-delimiting encoding of x, and we denote it x̄.
The lengths of these codewords are

|E (1) (x)| = 2|x| + 1 = 2⌊log x⌋ + 1 and |E (2) (x)| = 2⌊log⌊log x⌋⌋ + ⌊log x⌋ + 1 .

It can be verified that these lengths satisfy Kraft’s inequality.


A prefix code E : {0, 1}∗ → {0, 1}∗ can be used to encode tuples or numbers into
numbers:

Nk → N
(x1 , . . . , xk ) 7→ E(x1 ) . . . E(xk ) .

10
3 Shannon Information Theory
[Lecture AIT-4]
Example 3. Eight horses race. They have probabilities of winning given by

(1/2 , 1/4 , 1/8 , 1/16 , 1/64 , 1/64 , 1/64 , 1/64 ) .

At the end of a large number of races, we want to communicate in a binary (noiseless)


channel the sequence of outcomes. How can we minimize the average number of bits
to transmit? What is the expected number of bits to transmit per race?
Because the encodings will be concatenated, we need a uniquely decodable code,
and without loss of generality, we shall take a prefix code.
The general idea is to save short codewords for probable symbols and use long
codewords for rare symbols.
The proof of Kraft’s theorem gives us a method. The probabilities are

(2−1 , 2−2 , 2−3, 2−4 , 2−6 , 2−6 , 2−6 , 2−6) .

And because they are probabilities, they sum to 1. Thus the numbers

(1, 2, 3, 4, 6, 6, 6, 6)

fulfil Kraft’s inequality, and so there exists a prefix code with codewords of these
lengths1 . Since the lengths lx are related to the probabilities p(x) via p(x) = 2−lx ,
the expected number of bits to transmit is
8 8  
X X 1
p(x)lx = p(x) log .
x=1 x=1
p(x)

This expression shall be recognized as the Shannon entropy H(X) of the random
variable X defined by the race. Here, H(X) = 2. As we shall see, the entropy of a
random variable is a lower bound on the expected length of a prefix code.

3.1 Optimal Codes


Let X be a random variable over an alphabet X . What is the prefix code of minimal
expected length? In fact, what we care about is not the codewords themselves, but
their lengths, i.e., the set {lx }x∈X . This is because only the lengths determine the
expected number of bits sent over the channel. Thus we want to minimize
X
L= p(x)lx ,
x

1
For instance, {0, 10, 110, 1110, 111100, 111101, 111110, 111111}.

11
P
with the constraint that x 2−lx ≤ 1 (because we want a prefix code). Assuming
an equality in the constraint and omitting that lx ∈ N, it boils down to a standard
minimization problem.
!
X X
J= p(x)lx + λ 2−lx − 1 .
x x

∂J
= p(x′ ) − λ ln 2 · 2−lx′ = 0 ⇐⇒ p(x′ ) = λ ln 2 · 2−lx′ .
∂lx′
Summing over x′ sets λ ln 2 = 1 and we find that the minimal expected length is
achieved with codewords of length
 
∗ 1
lx = − log p(x) = log ,
p(x)

and the expected length is


 
X 1 def
L= p(x) log = H(X) .
x∈X
p(x)

Definition 6. The entropy of a discrete random variable X is defined as

def
X 1
H(X) = p(x) log .
x∈X
p(x)

1
The quantity log p(x) , which grows as p(x) gets smaller is sometimes called the
surprise or the self-information of the event {X = x}. The entropy is thus the
expectation of the surprise.
The operational meaning is clear from our derivation: it is the minimum expected
number of bits to communicate the outcome of the random variable in a prefix code
(or in a uniquely decodable code, thanks to McMillan’s theorem).

3.2 Properties of Entropy


[Lecture AIT-5]
1. H(X) ≥ 0;
2. H(X) = 0 iff X is deterministic;
3. If |X | is kept fixed, H(X) is optimized by the equiprobable distribution, in
which case H(X) = log |X |;
4. For equiprobable distributions, the larger |X |, the larger H(X).

12
Points 3 and 4 show that the more uncertain a random variable is, the larger its
entropy.
Definition 7. The joint entropy of a pair of discrete random variables (X, Y ) is
defined as XX 1
H(X, Y ) = p(x, y) log .
x∈X y∈Y
p(x, y)

If the random variable X takes value x̂ ∈ X , this defines pY |X=x̂ (y). Given
each individual outcome x̂ ∈ X of X, there is residual entropy H(Y |X = x̂) in the
distribution pY |X=x̂ (y). The conditional entropy H(Y |X) is the expectation (over
X) of that residual entropy.
Definition 8. The conditional entropy is
X
H(Y |X) = p(x)H(Y |X = x)
x∈X
X X
= p(x) −p(y|x) log p(y|x) .
x∈X y∈Y

H(Y |X) = 0 iff the value of Y is completely determined by the value of X.


Theorem 3 (Chain rule).

H(X, Y ) = H(X) + H(Y |X) .

Proof. In Problem Sheet #1.


All notions for Problem Sheet #1 have been covered. See §A.1.

3.3 Mutual Information


Definition 9. The mutual information between X and Y is
def
I(X; Y ) = H(X) − H(X|Y ) .

It is the reduction in the uncertainty of X due to the knowledge of Y .


See Fig. 4 for a representation of entropies, conditional entropies and mutual
information.
The chain rule ensures that mutual information is symmetric:
c.r.
I(X; Y ) = H(X) − (H(X, Y ) − H(Y ))
= H(Y ) − (H(X, Y ) − H(X))
c.r.
= H(Y ) − H(X|Y )
= I(Y ; X) .

13
H(X, Y )

H(X) H(X|Y ) I(X; Y ) H(Y |X) H(Y )

Figure 4: A Venn diagram representing the algebra of entropies, con-


ditional entropies, and mutual information. The entropy H(X) is repre-
sented by the whole circular area enclosed by the blue line, while H(X|Y )
is represented by the sub-region outside the red circle. The mutual infor-
mation I(X; Y ) is represented by the overlapping region.

Information inequality. It can be shown that I(X; Y ) ≥ 0 and = 0 iff X and Y


are independent. Thus H(X) ≥ H(X|Y ), so learning Y , on average, doesn’t make
predicting X harder.
Definition 10. The random variables X, Y and Z form a Markov Chain, denoted,
X → Y → Z if the joint distribution factorizes as
p(x, y, z) = p(x)p(y|x)p(z|y) .
It means that p(z|y, x) = p(z|y), i.e., Z is conditionally independent of X, given Y .
Theorem 4 (Data processing inequality). If X → Y → Z, then
I(X; Y ) ≥ I(X; Z) .
Proof.
I(X; Y, Z) = H(X) − H(X|Y, Z)
= H(X) − H(X|Z) + H(X|Z) − H(X|Y, Z)
= I(X; Z) + I(X; Y |Z) .
and similarly,
I(X; Y, Z) = H(X) − H(X|Y, Z)
= H(X) − H(X|Y ) + H(X|Y ) − H(X|Y, Z)
= I(X; Y ) + I(X; Z|Y ) .

14
Since Z is conditionally independent of X given Y , I(X; Z|Y ) = 0 and since
I(X; Y |Z) ≥ 0, we find I(X; Z) ≤ I(X; Y ).

Corollary 2. In particular, if Z = g(Y ), we have X → Y → g(Y ), and so


I(X; Y ) ≥ I(X; g(Y )).

It is impossible to increase the mutual information between two random variables


by processing the outcomes in some deterministic manner—also in a probabilistic
manner.

15
4 Computability Theory
[Lecture AIT-6]
Algorithms have existed for millennia — for instance, the Babylonians employed
them as early as 2500 B.C. Informally, they describe a procedure that can be per-
formed to compute a desired outcome. In 1936, Church and Turing independently
formalized computability (and with it, algorithms).

4.1 Turing’s Model of Computation


Turing exhibited a simple type of hypothetical machine and argued that everything
that can be reasonably said to be computed by a human computer using a fixed
procedure can be computed by such a machine.
A Turing machine, like the one displayed in Fig. 5, consists of
– An infinite tape made of cells, which can contain a 0, a 1, or a blank;
– A head, capable of scanning one cell;
– A finite control, which can be in any one of a finite set of states Q, and embody
a list of rules;
– In the background there are discrete times 0, 1, 2, . . .

... 0 1 0 0 1 0 1 1 ...

head

q0 (q0 , 0, R, q3 )
(q0 , 1, 0, q0 )
current state q1 (q1 , 0, L, q2 )
(q1 , 1, L, q2 ) list of rules
(q2 , 0, 1, q6 )
q2
(q2 , 1, B, q3 )
... ...
finite control
Figure 5: A generic configuration of a Turing Machine.

A computation starts at time 0 with the head on a distinguished start cell, and
the finite control is in a distinguished start state q0 . Initially, all cells of the tape
are blank, except for a contiguous finite sequence extending from the start cell to
the right, which contains 0s and 1s. This binary sequence is called the input.
The Turing machine performs one operation per time step in which it does one
of the five following actions: writes a 0 in the cell that is being scanned, writes a 1,

16
leaves a blank, moves the head to the left (L) or moves the head to the right (R).
The action performed depends on the current state q and the scanned symbol. At
the end of each operation, the control takes a (possibly different) state q ′ ∈ Q.
For a given Turing machine, its set of operations is given by the finite list of
rules of the form (q, s, a, q ′ ), where q, q ′ ∈ Q, s ∈ {0, 1, B} and a ∈ {0, 1, B, L, R}
(B stands for “blank”). This denotes the state q and the scanned symbol s before
the operation, the action a taken, and the state q ′ after the operation.
The machine is deterministic: (q, s, a1 , q1′ ) and (q, s, a2 , q2′ ) cannot both be rules
if (a1 , q1′ ) 6= (a2 , q2′ ). Moreover, not every possible combination of the first two
elements has to be in the list: if the machine enters a state q and reads a symbol s
for which no rule is defined, it halts. Hence, we can define a Turing machine by a
mapping from a finite subset of Q × {0, 1, B} into {0, 1, B, L, R} × Q.
A given Turing machine, on a given input, carries out a uniquely determined
succession of operations, which may or may not terminate in a finite number of
steps.

4.2 Mathematizing the mathematician


Is Turing’s 1936 paper [8] a piece of mathematics or a piece of physics? As is
manifest in his §9’s “direct appeal to intuition”, he explains that his model correctly
describes a specific physical system, namely, a mathematician. This kind of activity
belongs to physics.
He gives arguments (no proof) that his conjectured description is faithful. For
instance, Turing explains why he uses a one-dimensional tape made of squares:

“Computing is normally done by writing certain symbols on paper. We


may suppose this paper is divided into squares like a child’s arithmetic
book. In elementary arithmetic the two-dimensional character of the
paper is sometimes used. But such a use is always avoidable, and I think
that it will be agreed that the two-dimensional character of paper is no
essential of computation. I assume then that the computation is carried
out on one-dimensional paper, i.e. on a tape divided into squares.”

He explains why only a finite alphabet is admissible:

“I shall also suppose that the number of symbols which may be printed
is finite. If we were to allow an infinity of symbols, then there would be
symbols differing to an arbitrarily small extent.”

Further, Turing writes:

“The behaviour of the computer at any moment is determined by the


symbols which he is observing, and his ‘state of mind’ at that moment.”

17
The “computer” he is referring to cannot be what we call a computer, for it is in
that very paper that Turing is laying down the foundational work that gave us our
computers. So what is he referring to? A “computer” was how a person carrying
out a computation was called.
Turing continues, arguing that the range of squares that a computer can read is
limited:
“We may suppose that there is a bound B to the number of symbols or
squares which the computer can observe at one moment. If he wishes to
observe more, he must use successive observations.”
Yet it can shown that a machine scanning one single symbol, like the ones we defined
in §4.1, can simulate the computer that is scanning B symbols at a time.
Turing then explains why there are only finitely many states (here “states of
mind”).
“We will also suppose that the number of states of mind which need
be taken into account is finite. The reasons for this are of the same
character as those which restrict the number of symbols. If we admitted
an infinity of states of mind, some of them will be “arbitrarily close” and
will be confused. Again, the restriction is not one which seriously affects
computation, since the use of more complicated states of mind can be
avoided by writing more symbols on the tape.”

4.3 Partial Functions


[Lecture AIT-7]
A function f : A → B is partial if its domain of definition D is not necessarily
equal to A: there may be some a ∈ A for which f (a) is undefined. When D = A, f
is total.
We can associate a partial function fT : {0, 1}∗ → {0, 1}∗ (or N → N) to each
Turing machine T . The input of T is understood as the input of fT . If on input
x, T halts, the string of 0s and 1s that is being scanned (and delimited by blanks)
defines fT (x). If on input x, T does not halt, fT (x) is undefined.
The function f : N → N is computable 2 if there exists a Turing machine T for
which f = fT .
It is partial computable or total computable if f is respectively partial or total.
When the context is clear, we identify the function fT computed by T with T (so
we may write T (x) for fT (x)).
Although the set of functions N → N appears restrained, via suitable encodings
of other discrete sets into N, the computable functions may be viewed as computing
other kinds of functions of a countable set into another.
2
Also called recursive; the adjective effective is also commonly used in this context.

18
Example 4. A function
g : N×...× N → N
(x1 , . . . , xn ) 7→ g(x1 , . . . , xn )
can be viewed as being computed by a Turing machine expecting an input of the
form x̄1 x̄2 . . . x̄n .

4.4 Effective Enumeration of Turing Machines


The list of rules of a Turing machine T can be encoded in a string. Fixing an
encoding e of Q ∪ {0, 1, B, L, R} into s = ⌈log(|Q| + 5)⌉ bits, each rule can be
encoded as e(q)e(s)e(a)e(q ′ ). Let r be the number of rules; r ≤ 3|Q|, because each
state of a Turing machine can have at most one rule defined per possible scanned
symbol. The list of rules of T can thus be encoded as
E(T ) = s̄ r̄ e(q1 )e(s1 )e(a1 )e(q1′ )e(q2 )e(s2 )e(a2 )e(q2′ ) . . . e(qr )e(sr )e(ar )e(qr′ ) .
(Remember that s̄ = 1|s| 0s is the self-delimiting encoding of s.) The list of rules
encodes all that there is to know about T . The set of states is contained in the list
of rules and the initial state may be taken to be the state in the first rule.
All bit strings that correspond to a valid encoding of some Turing machine can
be listed lexicographically and, in turn, numbered. The index (or Gödel number )
of a Turing machine T is i if E(T ) is the ith element in the lexicographic order of
valid encodings of Turing machines. This yields a sequence of Turing machines T1 ,
T2 , . . . called an effective enumeration.
Remark 1. It is effective, in the sense that one can construct a Turing machine M
which, on input i, returns E(Ti ).

M: On input i,
Initialize a counter S = 0.
For each y ∈ {0, 1}∗ (in lexicographical order),
check if y corresponds to E(T ), for some T .
If it does, add 1 to a counter S.
If S = i, output y = E(Ti ).

4.5 The Universal Turing Machine


There are infinitely many Turing machines. But one of them can simulate them all.
Definition 11. A universal Turing machine U is a machine that expects an input
encoding a pair (i, x) and simulates the machine Ti on input x. Therefore,
U(hi, xi) = Ti (x) ∀i∀x .

19
[Lecture AIT-8]
Comment on the lecture: There is a brief discussion of Problem Sheet #1, §A.1.
Theorem 5. A universal Turing machine exists and can be constructed effectively.
Proof sketch. U invokes M so that, from i, it reconstructs the list of rules E(Ti ),
which it saves on a portion of its tape. To execute the consecutive actions that
Ti would perform (on input x) on its own tape, U uses Ti ’s rules to simulate Ti ’s
actions on a representation of Ti ’s tape contents.
In fact, there are infinitely many such U’s.

4.6 Church–Turing–Deutsch
Why do Turing machines matter so much? Why not some other model of computa-
tion? The answer lies in the robustness of the notion of computability itself—which
we now explore through the Church–Turing thesis, and its physical extension by
Deutsch.
It can be proved that the set of functions N → N that can be computed by
Turing machines is equal to the set of functions that can be computed by many
other systems of symbolic manipulations:
– λ-calculus – TeX
– General recursive functions – Conway’s game of life
– Python, C, Lisp, Java,... – Minecraft
– ...
The Church–Turing Thesis

There is an objective notion of computability independent of a particular


formalization—and it captures what is computable by a human.

Turing machines are a (convenient) representative of this notion of computation.


Definition 12. A system of symbolic manipulation (such as a computer’s instruc-
tion set, a programming language, or a cellular automaton) is said to be Turing-
complete or computationally universal if it can be used to simulate any Turing
machine. It is sufficient to show that it can simulate the universal Turing machine.
Is computational universality a property of mathematics, metamathematics or
our physical universe? David Deutsch wrote [9, Chapter 5]:

Computers are physical objects, and computations are physical pro-


cesses. What computers can or cannot compute is determined by the
laws of physics alone, and not by pure mathematics.

20
The argument is short yet inescapable. Qua theoretical construct, Turing ma-
chines may appear to be an abstract piece of mathematics. But so do pseudo-
Riemannian manifolds, which are conjectured to describe spacetime in general rel-
ativity. Thus, whether Turing machines are the relevant theoretical construct de-
pends on whether they actually grasp the set of computations that can be carried
out physically.

Example 5 (The Zeno machine). Consider a regular Turing machine with an extra
functionality: a big step. In one single big step, the Zeno machine can perform
infinitely many operations of a regular Turing machine, as if it were doing one
operation in 1/2 time unit, another operation in 1/4 time unit, another in 1/8 time,
and so on. It can be proven that the Zeno machines compute more functions than the
Turing machines. Based on what do we reject Zeno machines as a valid formalization
of computation? Physics. Try and build that machine...

When computer science is incorporated into a physical worldview, the Church–


Turing thesis can be seen as a special case of a physical principle due to Deutsch.

The Church–Turing–Deutsch Principle

It is possible to build a universal computer: a machine that can simulate (to


arbitrary accuracy) any physical process.

Since computations are special kinds of physical processes, a universal computer


can perform any computation. Moreover, the sense in which the “objective notion
of computability” is objective in the Church–Turing thesis is clarified: it is objective
insofar as it pertains to the physical world. Humans are computationally universal.

Example 6. Suppose that in the year 25, a new theory of physics is discovered:
physical systems can undergo different processes simultaneously. Since computations
are physical processes, this functionality can be harnessed, perhaps to speed up
computations due to parallelism. The year is 1925, this theory is quantum theory,
and the speed up is that of quantum computation.

By understanding computation physically, Deutsch came up with the universal


quantum computer [10].
It should be noted that quantum computers can be simulated by Turing machines
with an exponential overhead in computation time. Since in this course we are
interested in what can or cannot be computed—and not by computation time—we
can safely restrict our attention to the universal Turing machine.

21
4.7 Computing, Deciding, Enumerating, Recognizing
[Lecture AIT-9]
Comment on the lecture: Some solutions to Problem Sheet #1 (§A.1) are presented.

The effective enumeration of Turing machines T1 , T2 , T3 , . . . provides an enumer-


ation of all partial computable functions ϕ1 , ϕ2 , ϕ3 , . . .. We shall denote
– ϕi (j) ց (and Ti (j) ց) if ϕi (j) is defined, i.e., if Ti halts on input j and
– ϕi (j) ր (and Ti (j) ր) if ϕi (j) is undefined, i.e., if Ti does not halt on input j.
Let α ∈ R. Let α[n] denote the first n bits of α in binary expansion. A number
α ∈ R is computable if the function n 7→ α[n] is computable.

Definition 13. Let A ⊆ N. A decider DA of the set A is a Turing machine that


computes the characteristic function of A,
(
1 if n ∈ A
DA (n) =
0 if n ∈
/ A.

It decides if, yes or no, a given element is in A.

Definition 14. A set is computable (or decidable or recursive) it has a decider.

Example 7. All finite sets are decidable. The algorithm may just include the list
of all elements and verify. The set of odd numbers, the set of prime numbers, and
the set of powers of 17 are decidable.

Definition 15. Let A ⊆ N. An enumerator EA of the set A is a Turing machine


that enumerates all (and only) elements of A, without order and possibly with
repetitions. In general, the enumeration is not a halting computation, as must be
the case if A is infinite. But even if A is finite, the enumerator might forever be
looking for more elements of A.

Definition 16. A set A ⊆ N is computably enumerable (or recursively enumerable)


if it has an enumerator.

Example 8. The following sets are computably enumerable.


1. All decidable sets. This is because the decider can be transformed into an
enumerator. Run DA (1), DA (2), . . . if DA (a) = 1 add a to the enumeration.
2. The set

{hn, mi|n is the largest number of consecutive 0s in the first 2m digits of π = 3.14159 . . .}.

3. The set of indices i such that the range of ϕi is nonempty: {i|∃j : ϕi (j) ց}.
Indeed, consider the following enumerator

22
E: For k = 1, 2, 3, . . .
simulate all {Ti (j)}i,j≤k for k steps of computation.
If some Ti (j) ց (which means ϕi is defined on j),
output i

Notice that if ϕi has a non-empty range, there is some j and some t such that
the computation Ti (j) halts after t steps. Eventually (when k = max(i, j, t)),
the enumerator will find it out and output i in the enumeration. On the other
hand, if ϕi ր for all inputs, the enumerator will never output i.

Definition 17. A set A ⊆ N is recognizable (or semi-decidable) if the following


partial computable function exists
(
1 if n ∈ A
RA (n) =
0 or ր if n ∈
/ A.

Proposition 2. A set A ⊆ N is computably enumerable ⇐⇒ it is recognizable.

Proof. =⇒. EA → RA . On a given n ∈ N, run the enumerator EA until n is


outputted, and then accept it (map n to 1). Let’s verify it works.
If n ∈ A, it will eventually be enumerated by EA , and so accepted.
If n ∈
/ A, n will never be enumerated, and the vain search will loop, as required.

⇐=. RA → EA . For k = 1, 2, 3, . . ., run {RA (n)}n≤k for k steps of compu-


tation. If some n is found for which RA (n) = 1, it means n ∈ A, enumerate that n.
Note that this procedure will never enumerate some n ∈ / A.

4.8 The Halting Problem


[Lecture AIT-10]

Definition 18. The halting set is defined as


def
K0 = {hi, ji|ϕi (j) ց} .

Proposition 3. K0 is computably enumerable.

Proof. In Problem Sheet #2 (§A.2). The proof is similar to item 3. of Example 8.

Theorem 6. The halting set K0 is undecidable.

The proof uses the diagonalization method invented by Cantor when he proved
that there are uncountably many real numbers. He assumed that the real numbers
could be enumerated and reached a contradiction by exhibiting a real number that

23
cannot be on the list. The contradiction forces us to abandon the possibility of
enumerating the real numbers. Here, the Turing machines and their inputs really
can be enumerated. The assumption that leads to the contradiction is that the
table i, j 7→ Ti (j) can be computed. See the proof and Table 1.
Proof. Suppose that a Turing machine DK0 decides K0 .
(
1 if Ti (j) ց
DK0 (hi, ji) =
0 if Ti (j) ր .

The machine DK0 can be used to construct the machine


(
1 if Ti (i) ց
∆(i) =
0 if Ti (i) ր ,

which in turn can be transformed into the machine


(
¯ ր if Ti (i) ց
∆(i) =
0 if Ti (i) ր .

¯ is in the effective enumeration, so ∆


Like all Turing machine, ∆ ¯ = Tδ for some δ ∈ N.
What does ∆ ¯ do on input δ?
(
¯ ր if Tδ (δ) ց
∆(δ) =
0 if Tδ (δ) ր ,
so
¯
∆(δ) ¯
ր if ∆(δ) ց and ¯
∆(δ) ¯
ց if ∆(δ) ր,
a contradiction. See Table 1.
Thus there exists no fixed algorithm which decides if a given Turing machine i
halts on a given input j. If Ti (j) ց, this can be verified: just run it. If Ti (j) ր,
running it is a bad idea—no amount of waiting can confirm non-halting. If this is
transposed into modern programming languages, how can one tell if a program will
loop? For trivial loops, it can be recognized. But what the halting problem tells us
is that there are ever more convoluted ways in which a program can loop so that no
fixed algorithm can capture all those ways.
The halting set K0 can be embedded into a real number χ = χ1 χ2 χ3 . . ., where
(
1 if n ∈ K0
χn =
0 if n ∈/ K0 .

This χ is Turing’s uncomputable number.

24
1 2 3 4 5 ... δ
T1 3 ր 3 1 ր ...
T2 1 ր 1 ր ր ...
T3 2 2 ր 2 ր ...
T4 1 ր 1 2 ր ...
T5 7 7 1 ր 7 ...
.. .. .. .. .. .. ..
. . . . . . .
Tδ ր 0 0 ր ր ... ?
Table 1: Table representing the results of all computations. The machine
Tδ is constructed by inverting the behaviour of the diagonal of the table.

4.9 Uncomputability =⇒ Incompleteness


We’ve now seen that some functions cannot be computed—even in principle.
But what about proving properties of such functions? Can a sufficiently clever
mathematician—or a sufficiently powerful formal system—sidestep uncomputability
by proving whether a program halts? Because a proof is itself a kind of computa-
tion, uncomputability spills over into provability—and we now enter the realm of
incompleteness.

Definition 19. A formal axiomatic theory (FAT) is defined by


– an alphabet
– a grammar
– a set of axioms
– a set of rules of inference (logic) and
– a proof-checking algorithm.
In Hilbert’s spirit, proofs must be made so precise that a machine can verify their
validity. In a FAT, a theorem is a statement that can be derived from the axioms
and the rules of inference. The proof is the derivation.

The defining properties of FATs imply that the theorems of any given FAT can
be computably enumerated by a Turing machine. Such a Turing machine embodies
the FAT(the axioms and rules of inference can be encoded finitely), and enumerates,
one by one, all theorems in order of proof size (not in order of theorem-statement
size). How? Two ways.
1. For i = 1, 2, 3, . . ., output all statements obtained by i applications of a rule
of inference to the axioms.

25
2. Working through all strings p (in length increasing fashion), verify if p is a
proof of some statement. If yes output the statement.
In computer-theoretic terms, a formal axiomatic theory is a computably enu-
merable set of mathematical assertions.
Theorem 7 (Incompleteness à la Turing). For all FATs, there exists some i and j
for which Ti (j) ր is true but unprovable.
Proof. By contradiction: Suppose that there exists a FAT T in which for all hi, ji
such that Ti (j) ր, it can be proven that indeed Ti (j) ր. A decider for the halting
problem can be exhibited by executing two computations in parallel:

D: On input hi, ji,


For t = 1, 2, . . .
Run Ti (j) for t steps (looking for Ti (j) ց)
Run the enumerator of T for t steps (looking for a proof of Ti (j) ր)
(One of these computations will eventually find the halting status of Ti (j).)
Output that status.
But no such decider can exist.
We will return to incompleteness in §7.2 when we discuss Chaitin’s incomplete-
ness theorem.

4.10 Rice’s Theorem


[Lecture AIT-11]
What has been presented so far is sufficient computably theory to move to al-
gorithmic information. However, it is worth presenting Rice’s theorem, because
it shows that what appears to be a specific uncomputable problem, the halting
problem, contaminates many, many other problems. In a nutshell, Rice’s theorem
states that any non-trivial property concerning the behaviour of Turing machines
(or computer programs) is undecidable.
It is perhaps more enlightening to begin with an example and later generalize it
into the theorem.
Example 9. Consider the successor function
S:N → N
n 7→ n + 1 .
In the enumeration ϕ1 , ϕ2 , ϕ3 . . . of all partial computable functions, the function S
occurs infinitely many times. Consider
IS = {i : ϕi = S}

26
Problem: Is IS decidable? I.e. does there exist a Turing machine that computes
(
1 if ϕi = S
DIS (i) = ?
0 if ϕi 6= S

In modern terms: can a fixed algorithm generically decide if a given program com-
putes the successor function?
Answer: No, the set IS is undecidable.
Proof. Suppose by contradiction that DIS exists. First, construct a Turing ma-
chine Tk that computes S. (The decider DIS can be used on input k to verify
that Tk does indeed compute S.) Then, from Tk , we construct a family of Turing
(i,j) (i,j)
machines {Tk }i,j∈N , with the parameters i and j hardcoded in the machine Tk .
(i,j)
Tk : On input n
Run Ti (j)
if it halts
Run and output Tk (n).
(i,j)
Note that if Ti (j) ց, then Tk does compute S and
(i,j)
if Ti (j) ր, then Tk does not compute S.
Note also that
(i, j) → Tki,j → E(Tki,j ) → index of Tki,j
| {z }
Turing machine M

is computable because each stage is computable. Therefore,


(
M D I 1 if Ti (j) ց
hi, ji → index of Tki,j →S
0 if Ti (j) ր

would be a decider for K0 .

Definition 20. An index set I is a set with the following property: If ϕi = ϕj then

i ∈ I ⇐⇒ j ∈ I

Index sets are a way to describe classes of partial computable functions or prop-
erties of partial computable functions. In the example, IS is an index set.

Definition 21. An index set I is trivial if either I = ∅ or I = N.

Theorem 8 (Rice). Any non-trivial index set is uncomputable.

27
Proof. Let I be a non-trivial set of indices. Suppose (by contradiction) that DI
exists. Program your favourite Turing machine Tl that loops on all inputs, and
compute its index l. Compute DI (l).
Case 1: If DI (l) = 0, i.e., l 6∈ I, compute DI (1), DI (2),... until a k ∈ I is found.
Case 2: If DI (l) = 1, i.e., l ∈ I, compute DI (1), DI (2),... until a k 6∈ I is found.
By construction, Tk computes a function that does not loop on all inputs—
i.e., it halts on at least one input. We can then define a family of Turing ma-
(i,j)
chines {Tk }i,j∈N that have i and j harded coded into them as follows:
(i,j)
Tk : On input n,
Run Ti (j)
if it halts
Run Tk (n).
(i,j)
Note that if Ti (j) ց, then Tk has the same behaviour as Tk . Therefore, k and
(i,j)
the index of Tk are either both elements of I (case 1) or both not elements of I
(case 2).
(i,j)
If Ti (j) ր, then Tk loops on all inputs, therefore either k ∈ I and the index of
(i,j)
Tk is not in I (case 1) or vice versa (case 2).
Note also that
(i, j) → Tki,j → E(Tki,j ) → index of Tki,j
| {z }
Turing machine M

is computable. Therefore, in case 1,


(
M D 1 if Ti (j) ց
hi, ji → index of Tki,j →I
0 if Ti (j) ր

would be a decider for K0 , while in case 2,


(
M D 0 if Ti (j) ց
hi, ji → index of Tki,j →I
1 if Ti (j) ր

would be.
All notions for Problem Sheet #2 have been covered. See §A.2.

28
5 Algorithmic Complexity
We have seen that some functions or numbers cannot be computed. But even among
computable objects, not all are created equal: some are structured, others seem
random. Can we measure this? Algorithmic complexity provides a framework for
quantifying the information content of individual strings, not through probabilities,
but via their shortest descriptions on a universal machine.

5.1 Motivation
There are contexts in which it is unsuitable to look for an underlying probability
distribution when communicating a string of symbols.
Example 10 (Communication of π). Suppose Alice wants to communicate the
following message to Bob
31415926538...9
| {z } (first decimal digits of π) .
1010

Applying Shannon’s theory to that case might be frustrating. In practice, the


probability distribution is not given; it has to be estimated from the frequencies of
the symbols. In the case of π, the estimation would yield a uniform and independent
distribution on the alphabet {0, 1, 2, . . . 9}. In this case, 4 bits per digit are required,
and the message is communicated by 4 · 1010 bits.
Considering groupings of digits doesn’t help much. For instance, Alice could
group the digits in blocks of 3 and reserve a 10-bit codeword for each of the 1000
possible sequences of 3 digits. This would allow Alice to communicate the message
10
in 10 · 103 bits. Coding the message with larger blocks would still require more
than log2 10 · 1010 bits of communication.
The generic idea behind AIT is to assume that the string has been generated
by an underlying computation; not by an underlying probabilistic process. In the
context of communication, the sender (Alice) and the receiver (Bob) are equipped
with universal computers, and Alice may communicate

10
X (−1)n
‘The first 10 digits of 4 ,’
n=0
2n + 1

which Bob runs on his computer to decode.


Example 11 (Randomness of individual objects). Is this string random
010101010101010101010101010101010101010101010101010101010101010101 . . . 01?
Is this one random
001010110101111010101000011011110101011111001010101111100010011000 . . . 11?

29
We would like to say that the first one is not random and that the second one
at least appears to be random.
How can this be formalized probabilistically? It is possible that both strings
arise from the same probabilistic Bernoulli process of p = 1/2, in which case both
of them have the same probability.
If viewed as arising from a computation instead of a probabilistic process, the
patterns displayed by the first string amount to a short description of it. (e.g. print
“01” 106 times). On the other hand, if the second string really has no patterns, then
there is no short description for it. The shortest computable description for it might
be:

Print: “001010110101111010101000011011110101011111001010101111. . . 11”

Both examples motivate us to look at descriptions of minimal lengths to be run


on a universal computer.

5.2 Plain Complexity


[Lecture AIT-12]
Algorithmic complexity has two main versions: plain and prefix complexity.
Plain complexity C Prefix complexity K
– At first the most intuitive – Initially less intuitive
– Originally discovered in the mid-60s by – Discovered in the mid-70s
Solomonoff, Kolmogorov and Chaitin. by Levin and Chaitin.
– Problematic for the development of the – Solves many problems
theory
The key idea: algorithmic complexity of a bit string x is the length of its shortest
description (on a fixed description method). To be useful, there ought to be a
computation (and so a Turing machine) that can generate x from its description:

p ✲ T ✲ x
.

Definition 22. An input of a Turing machine shall also be called a program.

We can view every Turing machine T as a description method. If T admits a


program p for which T (p) = x, then p is a description for x (under the description
method given by T ).

30
We can thus define the complexity of a string x, with respect to a description
method given by T as

CT (x) = min{|p| : T (p) = x} and CT (x) = ∞ if no p exists.


p

Note that for each Turing machine T , we have a complexity measure

x 7→ CT (x) .

Since what we are after is the shortest descriptions, we have good news: one com-
plexity measure is “smaller” than all the others.
Let
def
hi, ji = īj = 1i 0ij .
Let U be the universal Turing machine such that U(hi, ji) = Ti (j) .
Definition 23. The plain algorithmic complexity CU (x) of a string x ∈ {0, 1}∗ is
defined as
CU (x) = min{|p| : U(p) = x} .
p

5.3 The Invariance Theorem


Theorem 9 (Additive optimality of U; The invariance theorem; Universality of
algorithmic complexity). For any Turing machine T , there exists a constant bT such
that,
CU (x) ≤ CT (x) + bT ∀x . (*)
Proof. If CT (x) ≤ ∞, it means that the machine T can compute x by some program
q, where |q| = CT (x).
But T = Ti for some i. Therefore, U can also compute x, via the program p = hi, qi.
This program has length

|1i 0iq| = |q| + |2i{z


+ 1} = CT (x) + bT .
bT

A universal Turing machine with the property (*) is additively optimal. Note that
not all universal Turing machines are additively optimal. For instance, a universal
Turing machine W may expect input of the form 1i 01q to simulate Ti (q) (and halt
with empty output on any program not of the right form).
The invariance theorem can be applied by setting T = V , where V is another
additively optimal universal Turing machine. We have ∃bV , ∃bU such that

CU (x) ≤ CV (x) + bV and CV (x) ≤ CU (x) + bU ∀x .

31
Therefore, there exist a constant bU V such that

|CV (x) − CU (x)| ≤ bU V ∀x .

Notation: We denote O(f (x)) to be a quantity whose absolute value does


not exceed f (x) by more than a fixed multiplicative factor. More precisely,
g(x) = O(f (x)) if there are constants c, x0 such that |g(x)| ≤ cf (x) for all x ≥ x0 .
This permits us to write
CV (x) = CU (x) + O(1) .
We thus see that the choice of reference universal and additively optimal Turing
machine only matters by an additive constant (independent of x). We consider this
irrelevant, as it is irrelevant for strings large enough. This is the invariance in the
choice of the reference machine. The invariance theorem motivates dropping the U
in CU . We simply write C(x).
The invariance theorem is particularly useful when the Church–Turing is invoked.
What is the length CLISP (x) of the minimal LISP program for a string x? What
is the length CPYTHON (x) of the minimal PYTHON program for a string x? The
invariance theorem states that there exists a universal constant bLISP ; PYTHON that
bounds the difference between the measures CLISP (·) and CPYTHON (·), i.e.,

|CLISP (x) − CPYTHON (x)| ≤ bLISP ; PYTHON ∀x .

The constant may be large, but for strings large enough, it is irrelevant.
And by the Church–Turing–Deutsch principle, if any other physical system is
used to define a complexity measure on strings, it will not yield a more minimal
measure than CU .

5.4 Basic Properties


[Lecture AIT-13]
Definition 24. Let x be a finite string. Its shortest program is denoted x∗ . In other
words, x∗ is the witness of C(x). If there is more than one shortest program, x∗ is
the one with the least running time.
Proposition 4 (upper bound). For all strings x,

C(x) ≤ |x| + O(1) .

Proof. Invoking Turing machines.


There is a Turing machine Id, which computes the identity function. U can simulate
Id on input x. The program for U would be p = hi, xi = īx, where Ti = Id.

|p| = |ī| + |x| = |x| + O(1) .

32
This is one program for computing x from U: the minimal program therefore has
length C(x) ≤ |p|.
Invoking high-level programming language.
Print “x” is a program for x. It has length |x| + O(1).
Recalling our identification of strings with numbers, Eq (1), and Prop. 1,
|x| = ⌊log x⌋, so
C(x) ≤ log x + O(1) .

Definition 25. The complexity C(x, y) of a pair of strings is defined with respect
to a fixed encoding h·, ·i : {0, 1}∗ × {0, 1}∗ → {0, 1}∗ as

C(x, y) = min{|p| : U(p) = hx, yi} .


p

Example 12. C(x, x) = C(x) + O(1). This means two things:

1. C(x, x) ≤ C(x) + O(1) and 2. C(x) ≤ C(x, x) + O(1) .

1. One program to compute hx, xi can be obtained from x∗ by executing x∗ to get


x, and computing hx, xi.
2. One program to compute x can be obtained from hx, xi∗ by executing hx, xi∗ and
extract x from the inverse of h·, ·i.

The computation of a string x may be eased by some auxiliary information y.

Definition 26. The conditional complexity of the string x given the string y is
def
C(x|y) = min{|p| : U(p, y) = x} ,
p

where the universal Turing machine U(·, ·) has been promoted to two tapes3 . The
second entry refers to the tape of auxiliary information.
def
We update the unconditional definition to C(x) = C(x|ε) .

Example 13. C(x|y) ≤ C(x) + O(1), because one way to compute x with y as
auxiliary information is to ignore y, and produce x from scratch with x∗ .
3
The two-tape Turing machine can be simulated by a single-tape Turing machine; so two-tape
machines compute the same class of functions. I invoke two-tape machines because they correspond
to the realistic model of keeping auxiliary information and programs on different registers.

33
5.5 Chain Rule for Plain Complexity
Proposition 5 (Chain rule, easy side). For all strings x and y,

C(x, y) ≤ C(x) + C(y|x) + O(log n) ,


where n = max{|x|, |y|}.
Proof. Let q be the witness of C(y|x), i.e., the program of minimal length such that
U(q, x) = y. The aim is to construct a program for hx, yi from x∗ and q. If delimiters
(#) were free, a program for hx, yi would be
r#x∗ #q ,
where r contains the following instructions:
1. Between the first and second delimiter, a program (x∗ ) is given to compute
the first string (x), and this string should be stored on the auxiliary tape.
2. After the second delimiter, a program (q) is given to compute the second string
(y) with the help of x on the auxiliary tape.
3. Compute the pairing hx, yi.
Note that r is of constant length (independent of x and y, so independent of n).
However, delimiters are not free: what we want is a single program p, made only of
bits, for which U(p) = hx, yi. If p is of the form
p = r ′ x∗ q ,
the preamble r ′ needs to do what r does, and moreover, it needs to contain some
information to break x∗ q into x∗ and q. It can be the length of x∗ , (C(x)), given in
a self-delimiting way (e.g. C(x)). Thus

|p| = |r ′ | + |x∗ | + |q|


= |r ′ | + C(x) + C(y|x)
≤ C(x) + C(y|x) + 2 log C(x) + O(1)
≤ C(x) + C(y|x) + O(log n) .

[Lecture AIT-14]
Proposition 6 (Chain rule for plain complexity, hard side). For all strings x and y,
C(x) + C(y|x) ≤ C(x, y) + O(log(C(x, y)))
≤ C(x, y) + O(log n)
where again n = max{|x|, |y|}.

34
Proof. Let m ≡ C(x, y). Define

A = {hx′ , y ′i : C(x′ , y ′) ≤ m} and Ax = {y ′ : C(x, y ′ ) ≤ m} ,

which contain hx, yi and y, respectively. A program for y given x is to computably


enumerate Ax and to give its enumeration number iy . Ax can be enumerated if x
and m are known, by running all programs of length m and if a program halts with
an output of the form hx, y ′ i, listing y ′ .
Let l ≡ ⌈log |Ax |⌉ be an upper bound on the number of bits of iy . Hence, the
program considered requires O(log m) bits to compute m, and l bits to give the
enumeration number of y. So

C(y|x) ≤ l + O(log m) . (2)

Define now
B = {x′ : log |Ax′ | > l − 1} ,
which contains x. If m and l are given (with an O(log m) advice, because l is a
number smaller than m + 1, because |Ax | is smaller than the number of programs
of length ≤ m), B can be enumerated by enumerating A (thanks to m), and when
for a given x′ the subset Ax′ contains more than 2l−1 elements, x′ is added to B. A
possible program for x is thus given by the enumeration number of x in B. Note
that X X X
|A| = |Ax′ | ≥ |Ax′ | ≥ 2l−1 = |B|2l−1 .
x′ x′ ∈B x′ ∈B

Since |A| < 2m+1 ,


log |B| ≤ m − l + 2 .
Therefore
C(x) ≤ m − l + O(log m) . (3)
Summing (2) and (3) together yields what is to be shown.
From Propositions 5 and 6, this version of the chain rule follows

Theorem 10 (Chain rule). For all strings x and y,

C(x, y) = C(x) + C(y|x) + O(log(C(x, y))) .

Again, the error term could be set to be O(log n), where n = max{|x|, |y|}.

35
5.6 Incompressibility
It is easy to come up with highly compressible strings. For instance, define
2
2...
f (k) = 22 k times

For each k, the binary representation of f (k) has complexity at most C(k) + c ≤
log k + c for some constant c independent of k. Yet it has a huge length.
What about incompressible strings?

Definition 27. For each c, a string x is c-incompressible (given y) if C(x) ≥ |x| − c


(if C(x|y) ≥ |x| − c).

Proposition 7. There are at least 2n − 2n−c + 1 strings of length n that are c-


incompressible.

Proof. The number of strings of length n is 2n . The number of programs of length


strictly less than n − c is
n−c−1
X
2i = 2n−c − 1
i=0

In the most conservative scenario, each of those short programs computes a different
string of length n. Indeed, generically, there are some of those short programs that
don’t produce any output (they don’t halt). Also, there are some of those short
programs that compute strings of a length other than n bits. And finally, there are
some of those short programs that compute the same string.
Thus, there are at least 2n 2n−c + 1 strings that are not computed by such a short
program, namely, that are c-incompressible.

Corollary 3. There is at least 1 string that is 0-incompressible. Not so bad. But


it goes fast: more than 3/4 of the strings are 2-incompressible.

Remark 2. The definition of incompressibility can be made to accommodate aux-


iliary information: a string x is c-incompressible given y if C(x|y) ≥ |x| − c. And
Proposition 7 carries through in the presence of auxiliary informaiton y.

For our purposes, we can identify incompressible strings with algorithmically


random strings.

5.7 The Problems of Plain complexity


1. The chain rule holds only up to O(log C(x, y)).

36
2. Let n be an incompressible number (string) of length m. Consider

00 . . . 0} ≡ 0n .
| {z
n times

C(0n ) = C(n) + O(1) = m + O(1) .


As in the case of 0n , there can be algorithmic complexity arising from the
length of the string, even when the string’s contents are highly regular. Con-
sider now x ∈ {0, 1}n , incompressible given n, i.e., C(x|n) ≥ n. The fact that
not only C(x) ≥ n but also C(x|n) ≥ n means that the algorithmic informa-
tion present in the length of x, which is by default carried by x, is independent
of the algorithmic information present within x. Intuitively, we would like that

C(x) = n + m + O(1),

where m bits of algorithmic complexity are embodied in the size of x and


n bits of algorithmic complexity come from the bits within x. But we have
shown that C(x) ≤ n + O(1).
3. Solomonoff had a brilliant idea for a measure over all strings. Instead of a
monkey in front of a typewriter, consider a monkey in front of the computer,
and define: X
x 7→ P (x) = 2−|p| .
p : U (p)=x

This measure amounts to the probability that a program for which the bits
are picked at random would yield the string x.
Solomonoff’s problem was that the measure does not converge. Indeed there
exists a Ti such that on all strings y, Ti (y) = x. Thus, each īy is a valid
program for x, as U(īy) = x for all y ∈ {0, 1}∗ .
X
P (x) ≥ 2−|īy|
y∈{0,1}∗
X
= 2−|ī| 2−|y|
y∈{0,1}∗
X∞ X∞
= 2−|ī| 2−n
n=0 y∈{0,1}n

X∞
−|ī| −n n
= 2 2 2
n=0
= ∞.

37
[Lecture AIT-15]
4. Recalling coding theory,

SP : {0, 1}∗ → {0, 1}∗


x 7→ x∗

is a natural and powerful code. Is it a prefix code? No. In fact, considering


y 6= x, nothing prevents y ∗ from being a prefix of x∗ because a priori, nothing
prevents a program from being the prefix of another program.
Adopting prefix instead of plain complexity solves all of the above problems.

38
6 Prefix Complexity
In prefix complexity, it is demanded that a program must be self-delimited : its total
length in bits must be determined by the program itself. For Turing machines, it
means that the head scanning the program is not allowed to overshoot.
This way, the alphabet on which we measure program size really is binary. In
the plain complexity framework, the head can hit and recognize the first blank after
the program, which effectively introduces a third symbol into the alphabet used to
describe programs.
Real programming languages are self-delimiting because they provide instruc-
tions for beginning and ending programs. This allows a program to be well-formed
by the concatenation or nesting of sub-programs.

6.1 Self-delimiting Turing Machines


A self-delimiting Turing machine has three tapes:
1. The input or program tape is one-way (the head can only move to the right),
read-only and initially contains the program.
2. The work tape moves two-ways, it reads and writes and initially contains a
string y as auxiliary information.
3. The output tape is one-way, write-only and initially empty.
If, after finitely many steps, T halts with
– a string p that has been so far scanned on the input tape, namely, upon halting,
the head of the input tape is on the last bit of p, but no further;
– possibly, a string y that was initially given on the work tape;
– a string x on the output tape;
then we say that the computation is successful, and T (p, y) = x, (in particular,
T (p, y) ց). Otherwise, the computation is a failure, and T (p, y) ր.
Self-delimiting Turing machines can be effectively enumerated, T1 , T2 , . . . , and
there exists U s.t.
U(īq) = Ti (q) ∀i, q .
More generally, with auxiliary information,

U(īq, y) = Ti (q, y) ∀i, q, y .

The first argument refers to the program tape content; the second to the work
(auxiliary) tape.

39
Remark 3. For all self-delimiting Turing machines T and for all auxiliary informa-
tion y, the set of programs that lead to a halting computation {p : T (p, y) ց} is a
prefix-free set. Indeed, if T (q, y) ց, it means that the last bit of q has been read
(and no more). Therefore, no program of the form qs, for some non-empty string s,
can also lead to a successful computation4 .
In particular, the set {p : U(p) ց} is a prefix-free set. See Fig. 6.
0 ...
0 1
U(001) = 11
1
0
...
1
0 U(10) = ε
1
0 U(110) ր
1
...
Figure 6: A possible prefix tree representing the set {p : U (p) ց}.
On input 001, the self-delimiting universal Turing machine U halts and
outputs 11, and on input 10, it outputs ε. Thus, all programs of the
form 001s or 11s for non-empty s do not lead to a successful computa-
tion because their last bit will never be scanned. We denote U (001s) ր
and U (10s) ր, and remove the descendant branches from the tree. More-
over, the program 110 loops. Therefore, U (110) ր, but also U (001s) ր
for all nonempty s.

Definition 28. The conditional prefix algorithmic complexity K(x|y) of a string


x ∈ {0, 1}∗ is defined as
K(x|y) = min{|p| : U(p, y) = x} .
p

As in plain complexity, U is additively optimal, and the invariance theorem


holds: ∀ self-delimiting Turing machines T and ∀ strings x, y
def
K(x|y) = KU (x|y) ≤ KT (x|y) + O(1) .
Proof. If KT (x|y) ≤ ∞, it means that T (q, y) = x for some (self-delimiting) pro-
gram q, where |q| = KT (x). But T = Ti for some i. Therefore, U can also compute x,
via the program p = īq. This program has a length
|1i 0iq| = |q| + |2i{z
+ 1} = KT (x) + bT .
bT

4
If qs happens to be on the input tape, the computation would carry through, but q would
be recognized as being the program of the computation. Therefore, we adopt the convention that
T (qs, y) is undefined and also denote it as ր.

40
def
The unconditional prefix complexity of x is K(x) = K(x|ǫ), x∗ is the shortest
def
program, and the prefix complexity of a pair is K(x, y) = K(hx, yi).

6.2 Solving the Problems of Plain Complexity


Let us address the problems of §5.7 in reverse order. We now consider the notions
in the context of prefix complexity.
4. The code

SP : {0, 1}∗ → {0, 1}∗


x 7→ x∗

is a prefix code.
[Lecture AIT-16]
Comment on the lecture: Unfortunately, the audio was not recorded.
3. Solomonoff’s problem is solved:
X
x 7→ P (x) = 2−|p|
p : U (p)=x

now converges for all x. Indeed, the set {p : U(p) = x} is prefix-free because
it is a subset of the prefix-free domain of halting programs {p : P U(p) ց}.
Therefore,
P by Kraft’s inequality not only does P (x) ≤ 1, but also x P (x) =
−|p|
{p : U (p)ց} 2 ≤ 1.
2. Let x ∈ {0, 1}n . I mentioned that intuitively, if both x and n are algorithmi-
cally random, we might want

K(x) = n + m + O(1) ,

where m measures to the complexity of n itself, while n measures the com-


plexity of the bits “within” x. However, with plain complexity, the print “x”
program entailed C(x) ≤ n + O(1). Yet, as we had seen it, the print program
was not self-delimiting.
Proposition 8. For all x ∈ {0, 1}n,

K(x) ≤ n + K(n) + O(1) .

Proof. Consider the following self-delimiting program for U:

rn∗ x ,

where r is a constant-sized (and self-delimited) program which gives the following


instructions.

41
f (x)
g(x)
K(x)

Figure 7: A made-up graph for K(x), with its upper bound given by
f (x) = log x + log log x + log log log x + O(log log log log x), which is very
close to being tight: infinitely often, K(x) exceeds g(x) = log x+log log x+
log log log x.

1. Execute n∗ to find n. (Note that there is no need for delimiting information


to tell n∗ and x apart, because n∗ is self-delimiting.)
2. Output the next n bits on the tape (x).

Corollary 4. Let x ∈ {0, 1}∗. Via iterative appeals of Prop. 8, we find that

K(x) ≤ |x| + K(|x|) + O(1)


≤ |x| + ||x|| + K(||x||) + O(1)
≤ |x| + ||x|| + |||x||| + K(|||x|||) + O(1)
≤ log x + log log x + log log log x + O(log log log log x) ,

where || · || denotes the length of the length.

This gives an upper bound for K(x) that is nearly tight. Indeed, as is shown in
Problem Sheet #3 (§A.3), for infinitely many values of x, K(x) exceeds log x +
log log x + log log log x. See Fig. 7.
1. The chain rule had a O(log C(x, y)) error term. For prefix complexity, the
error is O(1).

Theorem 11 (Chain rule for prefix complexity). For all strings x and y,

K(x, y) = K(x) + K(y|x∗ ) + O(1) .

The proof of the chain rule follows upon proving the easy side, Prop. 9, and the
hard side, Prop. 17 in §8.1.

42
Proposition 9 (Chain rule for prefix complexity, easy side). For all x and y,

K(x, y) ≤ K(x) + K(y|x∗) + O(1) .

Proof. In Problem Sheet #3. See A.3.

6.3 Information Equivalence between x∗ and (x, K(x))


[Lecture AIT-17]

Proposition 10. There is the same algorithmic information in x∗ as there is in


(x, K(x)), i.e.

K(hx, K(x)i|x∗ ) = O(1) and K(x∗ |hx, K(x)i) = O(1) .

Here is a convenient notation for the above relations:

✲ x x ✲
x∗ ✲ O(1) ✲ K(x) O(1) ✲ x∗
and K(x) ✲
.

Proof. First, we want to show that there exists a constant-size program, which
takes x∗ as auxiliary information and outputs hx, K(x)i. This can be done with the
following instructions:
1. Measure the length of x∗ (this gives K(x));
2. Execute x∗ to get x;
3. Compute hx, K(x)i.
Second, to compute x∗ from (x, K(x)), for i = 1, 2, . . ., run all programs of length
K(x) for i steps. Eventually some p halts with U(p) = x, this p is x∗ .
Proposition 10 entails that the chain rule can be expressed as

K(x, y) = K(x) + K(y|x, K(x)) + O(1) .

Prefix complexity may be slightly larger than plain complexity, due to the cost
of self-delimiting structure, but not by more than O(log n) for strings of length n.

Proposition 11. For all x ∈ {0, 1}n, K(x) = C(x) + O(log n).

Proof. See Problem Sheet #3, §A.3.

43
6.4 The Uncomputability of K(x)
Theorem 12. The function x 7→ K(x) is uncomputable.

Proof. Assume that there is a Turing machine M which, given x, computes K(x).
Let B be a constant to determine. Consider the following program.

Simulate M to compute K(x) for x = 1, 2, 3, . . . until some zB


is found for which K(zB ) > B. Output that zB .

What is the size of the above program?


– The instructions to simulate M are of constant size c1 ;
– The hard coding of the constant B can be given in a self-delimiting way
by 2 log B + 1 bits;
– The extra instructions to assemble the program is of constant size c2 .
Therefore the size of the program is c1 + c2 + 2 log B + 1. Let B be any number
such that
c1 + c2 + 2 log B + 1 < B.
Thus, we have a program of size ≤ B that produces zB , which, supposedly, has
K(zB ) > B. Contradiction.
This proof is the formalization of the Berry paradox. In §1.2, we defined k20 as
“the smallest number that cannot be described by twenty syllables or less”. This
“definition” puts a high burden on what is meant by “described”. In AIT, we look
at algorithmic descriptions, namely those that are given to a universal computer.
The proof of the uncomputablility of K(x) involves the algorithmic-information-
theoretic cousin of k20 . Indeed, zB = min{x : x ∈ {0, 1}∗ & K(x) > B}, or in words,
it is

“the smallest number that cannot be algorithmically described by B bits or less.”

The displayed sentence can be put into bits, and if B is large enough, the obtained
description would be shorter than B bits. The contradiction is avoided because while
the displayed sentence encoded in bits is a description of zB —it uniquely specifies
it—by the uncomputability of K(x), it is not an algorithmic description.

44
7 The Halting Probability Ω
As we shall see in that section, the halting probability Ω—also known as Chaitin’s
constant—is an algorithmically random number that compactly encodes the halting
problem and, with it, many mathematical truths.
Reminder: U(p) = Ti (j), where p = īj. We defined Turing’s number χ as

χ = χ0 χ1 χ2 . . . ,

where ( (
1 if Ti (j) ց 1 if U(p) ց
χp = =
0 if Ti (j) ր . 0 if U(p) ր .
Observe that the first 20 + 21 + 22 + . . . 2n = 2n+1 − 1 bits of χ encode the halting
problem for all programs of length ≤ n.
Definition 29. The halting probability Ω (or Chaitin’s constant) is
X
2−|p| .
def
Ω=
p : U (p)ց

It corresponds to the probability that the self-delimiting universal Turing ma-


chine halts if it is given a program whose bits are chosen by coin flip. Note that by
the construction of self-delimiting machines (see §6.1), a random bit can be drawn
only when the reading head asks for one more bit, thus keeping the process finite.

[Lecture AIT-18]
Comment on the lecture: The solution to Problem 4 of Sheet #2 (A.2) is discussed.
Moreover, in the first hour, I reviewed a selection of articles. See Appendix B for
the references and abstracts of the presented articles.
Proposition 12. Ω is a compressed version of χ: n bits of Ω are enough to compute
2n+1 − 1 bits of χ. Schematically,

Ω[n] ✲ O(1) ✲ χ[2n+1 −1]


.
Proof. For i = 1, 2, . . . run all programs of length ≤ i for i steps of computation.
(By the way, this is called “running all programs in a dovetailed fashion”). Add 2−|p|
to a sum S (initially set to 0) whenever a program p halts. When the first n bits
of the sum S stabilize to the first n bits of Ω, i.e., S[n] = Ω[n] , (this happens when
S ≥ Ω[n] ) then no program of length ≤ n will ever halt, since such an additional
contribution to the sum would contradict the value of Ω. This process is said to
lower semi-compute Ω, since it always returns smaller numbers than Ω, and they
converge to it in the limit of infinite time.

45
Thus, Ω is not computable, but it is lower semi-computable: we can successively
approximate it from below, though never exactly determine its bits.
[Lecture AIT-19]

Proposition 13. Ω is random: ∀n, K(Ω[n] ) > n + O(1) .

Proof. From Ω∗[n] , Ω[n] can be reconstructed, and by Proposition 12, χ[2n+1 −1] can be
computed. This, in turn, permits the computation of the output of all programs of
length ≤ n. All of these strings, therefore, have complexity smaller or equal to n.
Then, the lexicographically first string that does not have complexity ≤ n can be
recognized. Let us denote it zn (it is zB with B = n). In diagram form,

Ω∗[n] ✲ O(1) ✲ Ω[n] ✲ O(1) ✲ χ[2n+1 −1] ✲ O(1) ✲

✲ {x def
: K(x) ≤ n} ✲ O(1) ✲ zn = min{x : K(x) > n} .

This program for zn is of total size K(Ω[n] ) + O(1), so K(zn ) ≤ K(Ω[n] ) + O(1). But
by definition, K(zn ) > n.
All notions for Problem Sheet #3 have been covered. See §A.3.

7.1 Mathematical Truths in Ω


The first bits of Ω encode many mathematical facts. In particular, they encode the
truth or falsity of statements that are refutable by finite means, namely, statements
that, if false, an algorithm can find a counterexample and halt.
For instance, the Goldbach conjecture, which states that every even number larger
than 2 is the sum of two primes, can be refuted if a counter-example is found.
Consider the following program:

q: For i = 1, 2, 3, . . .
check if 2i + 2 = p1 + p2 , where p1 and p2 are primes
If yes, go to the next i.
If not, halt.

Note that q halts if and only if Goldbach’s conjecture is false. Moreover, q is quite
short: its halting status is encoded in the first few thousand bits of Ω.
Another example of a statement refutable in finite means is Fermat’s last theorem
(proven by Wiles). It states that the equation xn + y n = z n admits no integer
solutions for x, y, z > 0 and n > 2. An exhaustive search can look for such an
integer solution and halt if it finds it.

46
P
The Riemann hypothesis states that the non-trivial zeros of ζ(s) = ∞ 1
n=1 ns are
all on the line Re (s) = 1/2. The Riemann hypothesis can be shown to be refutable
by finite means, due to a result by Matiyasevich [11]. Therefore, the first bits of Ω
could tell us if the hypothesis is true or not.
The first bits of Ω can also be used to determine whether a statement s is
provably true (s is a theorem), provably false (¬s is a theorem), or independent5
of some formal axiomatic theory A. Remember that a formal axiomatic theory is
a computably enumerable set of mathematical assertions (theorems). Consider the
following program
Q: Computably enumerate the theorems of A.
If s is enumerated halt and output 1.
If ¬s is enumerated, halt and output 2.
If Q loops, s is independent. The program Q is of finite length (roughly the length
of the enumerator for A and for an algorithmic description of s). We use Ω to
determine if Q halts or loops. If it is promised to halt, we run Q to find if it
outputs 1 or 2.

7.2 Chaitin’s Incompleteness Theorem


Definition 30. We can define the complexity K(A) of a formal axiomatic theory A,
as the length of the shortest program, which computably enumerates the theorems
of A.
Theorem 13 (Chaitin). For any formal axiomatic theory A0 there is a constant k0
(which is roughly of size K(A0 )) such that for all bit strings x ∈ {0, 1}∗ —except a
finite number of them—the mathematical assertion:

“The string x is such that K(x) > k0 ”

is true, but unprovable in A0 .


Proof. First, it is true that only finitely many strings have complexity ≤ k0 . Indeed,
there is only a finite number of programs shorter than k0 , and each of them can be
the shortest program of no more than one string. Thus, for all other strings, the
assertion is true.
Suppose for contradiction that there is a string whose complexity is provably greater
than k0 . Then consider the following program.
5
The most well-known examples of independent statements are perhaps Euclid’s parallel pos-
tulate and the continuum hypothesis. In Euclidean geometry, the parallel postulate states that in
a plane, given a line and a point not on it, at most one line parallel to the given line can be drawn
through the point; it can be shown to be independent of the other four axioms of Euclid. The
continuum hypothesis states that there is no set whose cardinality is strictly between that of the
integers and the real numbers; it can be shown to be independent of ZFC.

47
p: Computably enumerate all the proofs of A0 until a proof is found of
“The string y is such that K(y) > k0 ”, for some y.
Return that y.

The program p can be made of length at most k0 , provided that k0 ≥ K(A0 , k0 ) + c.


Yet, p computes a string y whose shortest program has length > k0 . Contradiction.

For example, the complexity of ZFC might be 5000. Then, the overwhelming
majority of strings of length 6000 bits have a complexity larger than 5000, yet this
can be proven in ZFC for no such string.
Compared with Gödel’s incompleteness, which might appear to be an anomaly,
Chaitin’s incompleteness makes it inevitable. It puts information-theoretic limits
on the power of formal axiomatic theories.
[Lecture AIT-20]
Comment on the lecture: Some solutions to Problem Sheet #3 are presented.
What happens if we enlarge the axioms? The complexity of the formal axiomatic
theory increases, but it doesn’t circumvent the problem.
Theorem 14 (Chaitin). There is a constant c such that a formal axiomatic the-
ory A0 with complexity K(A0 ) can never determine more than K(A0 ) + c bits of Ω.
Proof sketch: Otherwise, we can get a compression of Ω.
Chaitin wrote [12]:
“I believe that complexity reveals the depth of the incompleteness phe-
nomenon: For the Platonic world of mathematical ideas has infinite com-
plexity, but any FAT only has finite complexity. Therefore, any FAT can
only capture an infinitesimal part of the richness of pure mathematics.”
He also wrote [12]:
“I believe in creativity in mathematics, I believe that math isn’t a closed
system, it’s an open system. A fixed formal axiomatic system of the
kind that Hilbert wanted for all of math, yes, it would have given us
absolute certainty, but it would also have made mathematics into a frozen
cemetery.”
He advocates a “quasi-empirical” viewpoint on mathematics, where the complexity
of formal axiomatic theories can be increased “by adding new principles, new axioms,
that are not self-evident but that are justified by their fertile consequences” [12].

Further reading on the topic: [6, 7] [13, Chapter 8].


See also [the conversation I held with Bennett and Deutsch].

48
8 The Coding Theorem
[Lecture AIT-21]
Reminder:
X
P (x) = 2−|p|
p : U (p)=x

= |2−|x | −|p1 |
+ 2−|p2| + 2−|p3| + . . .

{z } +2
2−K(x)

Clearly P (x) > 2−K(x). But how much larger than 2−K(x) can P (x) be? For some
x of length 50, there might be 3 programs of length K(x) and 4 programs of length
K(x) + 1, so for that x,

P (x) ≥ 3 · 2−K(x) + 4 · 2−(K(x)+1) = 5 · 2−K(x) .

However, one might expect that as the length of the string increases, the multiplica-
tive factor relating P (x) and 2−K(x) could grow unboundedly. For instance, it would
suffice to have strings of increasing length whose number of programs of minimal
length grows unboundedly.
The coding theorem shows that this is not the case: it states that there exists a
constant c such that ∀x,

2−K(x) ≤ P (x) ≤ c 2−K(x) .

Equivalently, K(x) ≥ − log P (x) ≥ − log c + K(x).

Definition 31. The conditional universal probability is given by


X
P (x|y) = 2−|p| .
p : U (p,y)=x

Theorem 15 (The Coding Theorem). For all x and for all y

K(x|y) = − log P (x|y) + O(1) .

Proof. For a first take on the proof, let us set y = ǫ. The proof carries through if y
sits on the auxiliary tape.
We need to show that K(x) ≤ − log P (x) + O(1). We shall do this by ex-
hibiting a Turing machine T that computes each x from an input al of length
−⌈log P (x)⌉ + 1. Since U can simulate that machine, it provides a program of
length − log P (x) + O(1) to compute x, and thus the shortest program for x has a
length no greater.

49
T (al ) : Run all programs, dovetailed fashion.
When the j-th program halts (in the dovetail order),
add (pj , xj , Sxj , aj ) to a list of quadruples, where
pj is the program that halted
xj = U(p P j ) is its−|p output
Sxj = i≤j : xi =xj 2 i| is the current value
of the lower semi-computation of P (xj ) and
aj is ∅ if the position of the leading 1
in the binary expansion of Sxj has not changed
and if it has changed, aj is the address of
the lexicographically first available node in
the binary tree at depth ⌈− log Sxj ⌉ + 1
Output the string that is allocated on the node al .

When aj 6= ∅, the idea is to allocate the string xj to the node aj . To preserve


a prefix-free tree, when a node is allocated, all its decedent branches are removed
from the tree.
Note that he quantity ⌈− log α⌉ corresponds to the position of the first 1 in the
binary expansion of α. Indeed, let
α= 0.000 . . . 01bi+1 bi+2 . . . = 2−i + bi+1 2−(i+1) + bi+2 2−(i+2) + . . . .
| {z }
First 1 is at the i-th bit after the point

α ∈ [2−i , 2−i+1 ) =⇒ − log α ∈ (i − 1, i] =⇒ ⌈− log α⌉ = i .


Example 14. Let x = 11011. See Figure 8

i pi xi Sx i ai 0 01
1 1100 01 0.0001 00000 0 1 0 11011
2 00110 11011 0.00001 000010 0 1
1 ...
3 000 1 0.001 0001 0 1 0
4 11011
011101 11011 0.000011 ∅ 0 1 ...
5 111 11011 0.001011 0010 1 ...
... ... ... ... ... ...

Figure 8: Suppose that T gives an enumeration of halting programs as in


the table on the left. The first halting program is of length 4, so its output,
01, is assigned to the lexicographically first available node of length 5. That
node is 00000. Then, the string 11011 is computed by a 5-bit program, so
11011 is assigned to the node 000010. The string 11011 is computed later
by a 6-bit program, but adding 2−6 to S11011 , which lower semi-computes
P (11011), does not in this case change the position of the leading 1, so no
node is assigned. But when a 3-bit program later computes also 11011,
the string is allocated to a node at depth 4.

50
Suppose that P (11011) = 0.00110111.... The lower semi-computation of S11011
would continue forever, but 11011 will not be assigned to more nodes, because the
leading 1 of S11011 has been stabilized.
The machine T can compute x via the massive computation that builds the tree
for all strings, together with the address 0010, which tells it to stop and output the
string allocated to that address. Here, this is x = 11011.
In general, because Sx lower semi-computes P (x), there will eventually be a
program pl that computes x (x = xl ) and whose contribution to Sx will make
the leading 1 in the binary expansions of Sx to be at the same position as the
leading 1 in the binary expansion of P (x). This will give a first (and only) node at
depth ⌈− log Sxl ⌉ + 1 = ⌈− log P (x)⌉ + 1 to which the string xl = x will be allocated.
The address of this node is al , and |al | = ⌈− log P (x)⌉ + 1.
We should verify that this way of allocating strings to nodes in the binary tree
will never run out of space.

X X X
2−|aj | = 2−|aj |
j : aj 6=∅ x j : aj 6=∅
and xj =x
X
< 2−(⌈− log P (x)⌉+1) + 2−(⌈− log P (x)⌉+2) + 2−(⌈− log P (x)⌉+3) + . . .
x
 
X
⌊log P (x)⌋ −1 1 1
= 2 2 1 + + + ...
x
2 4
X
≤ P (x)
x
≤ 1.

The third line uses the fact that −⌈−β⌉ = ⌊β⌋.


By Kraft’s inequality, the |aj |’s can be the lengths prefix-free set of codewords,
and the aj ’s are codewords.
Thus, a program for U to produce x is īal , where T ≡ Ti . It has length
− log P (x) + O(1).
[Lecture AIT-22]
A subtlety: The domain of the definition of T above is a prefix-free set of strings,
fine, but is T a self-delimiting Turing machine? The concern is that U can simulate
all self-delimiting Turing machines.

Proposition 14. Let y be some auxiliary string. Let Tprefix be a Turing machine
whose domain {a : Tprefix (a, y) ց} is a prefix-free set. Then there exists a self-
delimiting Turing machine Ts-d s.t. Tprefix (·, y) ≡ Ts-d (·, y).

51
Proof. The idea is that Ts-d should read another square of its program tape only
when it is necessary. Suppose Ts-d has the string y on its work/auxiliary tape
(no self-delimitation constraints are required on the auxiliary information). Ts-d
generates the computably enumerable set S = {s : Tprefix (s, y) ց} on its work tape.
As it generates S, Ts-d continually checks whether or not the part p of the program
that it has already read (initially p = ǫ) is a prefix of some known element s of S.
Whenever Ts-d finds that p is a prefix of an s ∈ S, it does the following. If p is
a proper prefix of s (i.e. if s = pw, for w 6= ǫ), Ts-d reads another square of the
program tape. This is because it knows that p cannot be in the domain, otherwise
the domain would not be prefix-free. And if p = s, Ts-d computes and outputs
Tprefix (p, y) on the output tape.

8.1 Lower Semi-computable Semi-measures


Definition 32. A discrete semi-measure is a map µ : N → R such that

X
µ(x) ≤ 1 .
x=1

Definition 33. The semi-measure is lower semi-computable if there exists a partial


computable function µ̃(x, t) such that µ̃(x, t) ≤ µ̃(x, t+1) and limt→∞ µ̃(x, t) = µ(x).
Definition 34. The semi-measure m is universal with respect to a class C of semi-
measures if it is in that class and it is “larger than any other”, namely, if there exists
a universal constant c such that for all µ ∈ C,

µ(x) ≤ c m(x) ∀x .

Definition 35. Let T be a self-delimiting Turing machine.


X
PT (x) = 2−|p| .
p : T (p)=x

Proposition 15. Let µ be a lower semi-computable semi-measure. There exists a


self-delimiting Turing machine T such that µ(x) = PT (x) for all x.
Proof. The machine T calls the lower semi-computation of µ, with initialized
value µ̃(x, 0) = 0. For t = 1, 2, . . ., run µ̃(x, t) for all x ≤ t. When some µ̃(x, t)
is produced with µ̃(x, t) − µ̃(x, t − 1) = ℓ > 0 (an update on the lower bound of
µ(x)), assign to x the subset of [0, 1] which is the next available interval of length ℓ
(starting from 0).
Let Sx be the union of all the intervals assigned to x. Note that the total length of
the intervals that compose Sx converges to µ(x), and since µ is a semimeasure, [0, 1]
is long enough to accomodate all {Sy }y .

52
From this construction of Sx we define programs for T which produce x. The
set Sx can be covered from within by cylinders Γp = [0.p, 0.p+2−|p|), namely, Γp ⊆ Sx
and ∪p Γp = Sx . The p’s are T ’s programs for x.
Remark 4. This proof provides another equivalent “probabilistic” interpretation of
the input tape. One can view the input tape as already containing infinitely many
bits (instead of calling them by the monkey one by one). All infinite strings s which
would yield a computation of x by T are of the form s = ps′ with T (p) = x. The
Lebesgue measure of all such s is PT (x).
Proposition 16. P (x) is a universal semi-measure for the class of lower semi-
computable semi-measures.
def
Proof. By the last Proposition, it suffices to show that P (x) = PU (x) dominates all
PT (x). By the universality of U, there exists r such that U(rq) = T (q). Thus
X
PT (x) = 2|r| 2−|r| 2−|q|
q : T (q)=x
X
|r|
= 2 2−|rq|
q : U (rq)=x
X
≤ 2|r| 2−|p|
p : U (p)=x
|r|
= 2 P (x) .

Corollary 5. By the coding theorem, 2−K(x) multiplicatively dominates P (x), and


therefore, 2−K(x) is also a universal lower semi-computable semi-measure.
Example 15. The mapping X
y 7→ 2−K(x,y)
x
is a lower semi-computable semi-measure. Indeed, it is lower semi-computable: On
input y, for i = 1, 2, . . . run all p of length ≤ i. When some p halts with hx, yi for
some x, add 2−|p| . To the sum. When some p′ halts later with hx, yi (the same x as
before), add 2−|p | − 2−|p| if p′ is shorter than p. Otherwise, don’t add anything.

It is a semi-measure, because by Kraft’s inequality,


XX
2−K(x,y) ≤ 1 .
y x
P
Thus there is a c st ∀y x 2−K(x,y) ≤ c2−K(y) , or equivalently,
P −K(x,y)
x2
≤ c. (4)
2−K(y)
53
Proposition 17 (Chain rule, hard side). For all x and y,

K(y) + K(x|y ∗ ) ≤ K(x, y) + O(1) .

Proof. The statement to prove is ∃c s.t.

K(y) + K(x|y ∗ ) ≤ K(x, y) + c

⇐⇒
−K(x, y) + K(y) ≤ −K(x|y ∗ ) + c
⇐⇒
−K(x,y)
2 ∗

−K(y)
2c 2−K(x|y ) .
≤ |{z}
2 ′ c

2−K(x,y)
This will be obtained if we show that x 7→ 2−K(y)
is a lower semi-computable
semi-measure given y ∗ .

1. It is lower semi-computable. Since we have y ∗ , we have K(y) = |y ∗|, so we do


not have to worry about the denominator. The string y can be obtained from
y ∗, and since x is given, we need to show that 2−K(x,y) is lower semi-computable
given x, y.
Run all p dovetail fashion, when some p is found for which U(p) = hx, yi,
update the lower bound to 2−|p| . The smaller the length of the p, the higher
the lower bound. Eventually, the witness of K(x, y) is found.

2. Up to a renormalization which introduces a multiplicative constant, is a semi-


measure. See Equation (4) of Example 15.

Analogously with the probabilistic setting (§3.3), the O(1)-chain rule entails a
symmetric notion of algorithmic mutual information,
def
I(x; y) = K(x) − K(x|y)
c.r.
= K(x) − K(x, y) + K(y) + O(1)
c.r.
= K(y) − K(y|x) + O(1)
def
= I(y; x) + O(1) .

All notions for Problem Sheet #4 have been covered. See §A.4.

54
A Problem Sheets
A.1 Problem Sheet #1

1. We identified strings with natural numbers via {(ǫ, 1), (0, 2), (1, 3), (00, 4) . . .},
and we defined x̄ = 1|x| 0x and x′ = |x|x. Find y, z and t if

ȳz ′ t = 111110001011110000110100101111100100101 .

2. It was once asked (on 23.02.23, in D1.14) whether there are countably many
or uncountably many prefix codes of a countable alphabet. Answer that question.
Hint: subsets of {0, 10, 110, . . .}.

3. A conjecture was once made (on 23.02.23, in D1.14) that a set of strings is both
prefix-free and suffix-free if and only if the strings have constant length. Disprove
the conjecture by exhibiting an infinite prefix- and suffix-free set of binary strings.

4. Does there exist a prefix code Φ : {0, 1}∗\{ǫ} → {0, 1}∗ such that

|Φ(x)| = |x| + ⌈log |x|⌉ + 1 ∀x ?

Prove your claim.

5. Shannon-Fano code. Let X be a random variable over the alphabet X . Explain


how to construct a prefix code of X with expected code-word length < H(X) + 1.

6. Prove the chain rule for entropy: H(X, Y ) = H(X) + H(Y |X) .

55
A.2 Problem Sheet #2

We defined an enumerator of a set A ⊆ N as a Turing machine that enumerates all


(and only) elements of A without order 6 and possibly with repetitions.

1. A non-repeating enumerator of a set A is a Turing machine that enumerates all


(and only) elements of A without order and without repetitions. Complete the follow-
ing statement with either “computable” or “computably enumerable,” and prove it.

A set A can be enumerated by a non-repeating enumerator


if and only if it is .

2. An ordered enumerator of a set A is a Turing machine that enumerates all (and


only) elements of A in order and possibly with repetitions. Complete the following
statement with either “computable” or “computably enumerable,” and prove it.

An infinite set A can be enumerated by an ordered enumerator


if and only if it is .

3. Prove that the halting set K0 = {hi, ji | ϕi(j) ց} is computably enumerable.

4. We saw that the set of indices i such that the range of ϕi is nonempty is com-
putably enumerable. It was asked (on 28.03.23) if the set is also computable. Answer
that question and provide a brief justification.

5. Prove that a set A ⊆ N is computable if and only if both A and A = N \ A are


computably enumerable.
Comment. The class of all computable sets7 is denoted R, the class of all com-
putably enumerable sets is denoted RE, and the class of all sets whose complement
is computably enumerable is denoted coRE. Question 5 shows that

R = RE ∩ coRE .

6
The order is given by N.
7
In complexity theory, a subset of {0, 1}∗ ≃ N is also called a language.

56
A.3 Problem Sheet #3

Note 1. One of these questions is a bonus: 6 correct solutions yield 60/60, 7 correct
solutions yield 70/60.
Note 2. If some proof involves a program in the setting of prefix complexity, explain
what makes the program self-delimited.
def
. . . 0}. Show that C(0n ) = C(n) + O(1).
1. Let 0n = |00 {z
n times

2. Show that for all x ∈ {0, 1}n, K(x) = C(x) + O(log n).

3. Prove the easy side of the chain rule for prefix complexity: For all x and y,

K(x, y) ≤ K(x) + K(y|x∗) + O(1) .

4. Show that there exist infinitely many strings such that

K(x) > log x + log log x + log log log x .


P
Hint 1: x 2−K(x).
d 1 1 1
Hint 2: dx
log log log x = log log x log x x
.

5. Let A be a finite set whose cardinality is denoted |A|. Find an appropriate


definition of K(A), and use your definition to show that for all x ∈ A,

K(x) ≤ K(A) + log |A| + O(1) .

6. Ω’s little cousin. Let ωn be the cardinality of the following set

{p : U(p) ց and |p| ≤ n} .

Show that K(ωn , n) > n + O(1) .

7. A function is upper semi-computable if there exists a never-ending algorithm


which finds ever smaller upper bounds to the graph of the function, and its tentative
upper bounds converge to the graph of the function in the limit of infinite time. Show
that the function

K:N → N
x 7→ K(x)

is upper semi-computable.

57
A.4 Problem Sheet #4

Notes. Question 3 is only for fun. No points, no bonus. Just fun. Question 6 is
worth as many points as one chooses to solve the easier or the harder version.

1. The incompressibility method. Use the existence of incompressible strings to show


that there exist infinitely many primes.

2. Let x be an n bit string with exactly as many 0’s as 1’s. Using either the plain
or the prefix setting, show that x is compressible given n.
√ 
Stirling’s formula: n! = nn e−n 2πn 1 + O( n1 ) .

3. Only for fun. “But how can the result of question 2 be possible? By the law of
large numbers, in a probabilistic random string, we should expect as many 0’s as
1’s.” Yes, but not exactly as many, for this is a pattern which entails compression.√In
fact, the central limit theorem states that we should expect variations of about n.
Easier. Show that the √ analysis performed in question 2 fails if the number of 1’s is
n
2
+ k, where k = O( n) and k is an incompressible number given n.
Harder. Show√ that x is incompressible given n only if the number of 1’s in x is n2 +k,
where k = O( n) and k is an incompressible number given n.

4. A program is elegant if it is the shortest program that computes some string.


Show that a formal axiomatic theory A cannot prove the elegance of programs of
length significantly larger than K(A).

Note. That’s another way to incompleteness: Only a finite number of programs can
be proven elegant, yet there are infinitely many elegant programs.

5. a) Write down a function that grows extremely fast. The faster it grows, the
better. Don’t read further before having written down a function.
b) Read online about the Ackermann function. Does it grow faster than yours?
c) The Busy Beaver function is defined as
def
B(n) = max{Rt(p) : U(p) ց & |p| ≤ n} ,

where Rt(p) denotes the run time of p, namely, the number of computational steps
before U(p) halts. Show that B(n) grows faster than any computable function,
namely, show that for all computable f , there exists n0 such that for all n ≥ n0 ,
B(n) > f (n).
Note. Assuming that your function was computable, the Busy Beaver beats both
your and Ackermann’s functions.
d) Show that Ω[n] permits to compute B(n).

58
6. Lower semicomputable semimeasures. Let

ωn′ = |{p : U(p) ց and |p| = n}|


def def
and ωn = |{p : U(p) ց and |p| ≤ n}| .

Answer only one version of the following questions.


Easier. Show that ωn′ ≤ 2n−K(n)+O(1) .
Harder. Show that ωn ≤ 2n−K(n)+O(1) .

7. Write down a question that you have regarding any topic related to AIT. If you
would like to comment on why you find your question intriguing, please do it.
If you have ideas on how to solve your question, please elaborate.

59
B Proposed Articles for the Seminar
Theoretical, Foundational
1. Gregory J Chaitin. Incompleteness theorems for random reals. Advances in
Applied Mathematics, 8(2):119–146, 1987.
Abstract: We obtain some dramatic results using statistical mechanics-thermodynamics kinds of
arguments concerning randomness, chaos, unpredictability, and uncertainty in mathematics. We
construct an equation involving only whole numbers and addition, multiplication, and exponenti-
ation, with the property that if one varies a parameter and asks whether the number of solutions
is finite or infinite, the answer to this question is indistinguishable from the result of independent
tosses of a fair coin. This yields a number of powerful Gödel incompleteness-type results concerning
the limitations of the axiomatic method, in which entropy-information measures are used.

2. Markus Müller. Stationary algorithmic probability. Theoretical Computer


Science, 411(1):113–130, 2010.
Abstract : Kolmogorov complexity and algorithmic probability are defined only up to an additive
resp. multiplicative constant, since their actual values depend on the choice of the universal reference
computer. In this paper, we analyze a natural approach to eliminate this machine-dependence. Our
method is to assign algorithmic probabilities to the different computers themselves, based on the idea
that “unnatural” computers should be hard to emulate. Therefore, we study the Markov process of
universal computers randomly emulating each other. The corresponding stationary distribution, if it
existed, would give a natural and machine-independent probability measure on the computers, and
also on the binary strings. Unfortunately, we show that no stationary distribution exists on the set
of all computers; thus, this method cannot eliminate machine-dependence. Moreover, we show that
the reason for failure has a clear and interesting physical interpretation, suggesting that every other
conceivable attempt to get rid of those additive constants must fail in principle, too. However, we
show that restricting to some subclass of computers might help to get rid of some amount of machine-
dependence in some situations, and the resulting stationary computer and string probabilities have
beautiful properties.

Algorithmic statistics
3. Nikolay Vereshchagin and Alexander Shen. Algorithmic statistics: Forty years
later. In Computability and Complexity, pages 669–737. Springer, 2017.
Abstract: Algorithmic statistics has two different (and almost orthogonal) motivations. From the
philosophical point of view, it tries to formalize how the statistics works and why some statistical
models are better than others. After this notion of a "good model" is introduced, a natural question
arises: it is possible that for some piece of data there is no good model? If yes, how often these
bad (“non-stochastic”) data appear "in real life"? Another, more technical motivation comes from
algorithmic information theory. In this theory a notion of complexity of a finite object (=amount
of information in this object) is introduced; it assigns to every object some number, called its
algorithmic complexity (or Kolmogorov complexity). Algorithmic statistic provides a more fine-
grained classification: for each finite object some curve is defined that characterizes its behavior.
It turns out that several different definitions give (approximately) the same curve. In this survey
we try to provide an exposition of the main results in the field (including full proofs for the most
important ones), as well as some historical comments. We assume that the reader is familiar with
the main notions of algorithmic information (Kolmogorov complexity) theory.

4. Paul MB Vitányi and Ming Li. Minimum description length induction, bayesian-
ism, and kolmogorov complexity. IEEE Transactions on information theory,
46(2):446–464, 2000.

60
Abstract: The relationship between the Bayesian approach and the minimum description length ap-
proach is established. We sharpen and clarify the general modeling principles minimum description
length (MDL) and minimum message length (MML), abstracted as the ideal MDL principle and
defined from Bayes’s rule by means of Kolmogorov complexity. The basic condition under which
the ideal principle should be applied is encapsulated as the fundamental inequality, which in broad
terms states that the principle is valid when the data are random, relative to every contemplated hy-
pothesis and also these hypotheses are random relative to the (universal) prior. The ideal principle
states that the prior probability associated with the hypothesis should be given by the algorithmic
universal probability, and the sum of the log universal probability of the model plus the log of the
probability of the data given the model should be minimized. If we restrict the model class to finite
sets then application of the ideal principle turns into Kolmogorov’s minimal sufficient statistic. In
general, we show that data compression is almost always the best strategy, both in model selection
and prediction.

Induction
5. Samuel Rathmanner and Marcus Hutter. A philosophical treatise of universal
induction. Entropy, 13(6):1076–1136, 2011.
Abstract: Understanding inductive reasoning is a problem that has engaged mankind for thousands
of years. This problem is relevant to a wide range of fields and is integral to the philosophy of
science. It has been tackled by many great minds ranging from philosophers to scientists to math-
ematicians, and more recently computer scientists. In this article we argue the case for Solomonoff
Induction, a formal inductive framework which combines algorithmic information theory with the
Bayesian framework. Although it achieves excellent theoretical results and is based on solid philo-
sophical foundations, the requisite technical knowledge necessary for understanding this framework
has caused it to remain largely unknown and unappreciated in the wider scientific community. The
main contribution of this article is to convey Solomonoff induction and its related concepts in a
generally accessible form with the aim of bridging this current technical gap. In the process we ex-
amine the major historical contributions that have led to the formulation of Solomonoff Induction as
well as criticisms of Solomonoff and induction in general. In particular we examine how Solomonoff
induction addresses many issues that have plagued other inductive systems, such as the black ravens
paradox and the confirmation problem, and compare this approach with other recent approaches.

6. Marcus Hutter. A theory of universal artificial intelligence based on algorithmic


complexity. arXiv preprint cs/0004001, 2000.
Abstract: Decision theory formally solves the problem of rational agents in uncertain worlds if
the true environmental prior probability distribution is known. Solomonoff’s theory of universal
induction formally solves the problem of sequence prediction for unknown prior distribution. We
combine both ideas and get a parameterless theory of universal Artificial Intelligence. We give
strong arguments that the resulting AIξ model is the most intelligent unbiased agent possible. We
outline for a number of problem classes, including sequence prediction, strategic games, function
minimization, reinforcement and supervised learning, how the AIξ model can formally solve them.
The major drawback of the AIξ model is that it is uncomputable. To overcome this problem, we
construct a modified algorithm AIξtl, which is still effectively more intelligent than any other time
t and space l bounded agent. The computation time of AIξtl is of the order t · 2l . Other discussed
topics are formal definitions of intelligence order relations, the horizon problem and relations of the
AIξ theory to other AI approaches.

Nature and Computation


7. Charles H Bennett. Logical depth and physical complexity. The Universal
Turing Machine A Half-Century Survey, pages 227–257, 1988.
Abstract: Some mathematical and natural objects (a random sequence, a sequence of zeros, a perfect
crystal, a gas) are intuitively trivial, while others (e.g. the human body, the digits of π) contain
internal evidence of a nontrivial causal history. We formalize this distinction by defining an object’s

61
“logical depth” as the time required by a standard universal Turing machine to generate it from an
input that is algorithmically random (i.e. Martin-Löf random). This definition of depth is shown to
be reasonably machine-independent, as well as obeying a slow-growth law: deep objects cannot be
quickly produced from shallow ones by any deterministic process, nor with much probability by a
probabilistic process, but can be pro- duced slowly. Next we apply depth to the physical problem of
“self-organization,” inquiring in particular under what conditions (e.g. noise, irreversibility, spatial
and other symmetries of the initial conditions and equations of motion) statistical-mechanical model
systems can imitate computers well enough to undergo unbounded increase of depth in the limit of
infinite space and time.

8. Jürgen Schmidhuber. The fastest way of computing all universes. In A com-


putable universe: Understanding and exploring nature as computation, pages 381–
398. World Scientific, 2013.
Abstract: Is there a short and fast program that can compute the precise history of our universe,
including all seemingly random but possibly actually deterministic and pseudo-random quantum
fluctuations? There is no physical evidence against this possibility. So let us start searching! We
already know a short program that computes all constructively computable uni- verses in parallel,
each in the asymptotically fastest way. Assuming ours is computed by this optimal method, we can
predict that it is among the fastest compatible with our existence. This yields testable predictions.
Note: This paper extends an overview of previous work presented in a survey for the German edition
of Scientific American.

Thermodynamics
9. Paul Vitányi and Ming Li. Algorithmic arguments in physics of computation. In
Algorithms and Data Structures: 4th International Workshop, WADS’95 Kingston,
Canada, August 16–18, 1995 Proceedings 4, pages 315–333. Springer, 1995.
Abstract : We show the usefulness of incompressibility arguments based on Kolmogorov complexity
in physics of computation by several examples. These include analysis of energy parsimonious
’adiabatic’ computation, and scalability of network architectures.

10. Wojciech H Zurek. Algorithmic randomness and physical entropy. Physical


Review A, 40(8):4731, 1989.
Abstract : Algorithmic randomness provides a rigorous, entropylike measure of disorder of an indi-
vidual, microscopic, definite state of a physical system. It is defined by the size (in binary digits) of
the shortest message specifying the microstate uniquely up to the assumed resolution. Equivalently,
al- algorithmic randomness can be expressed as the number of bits in the smallest program for a
universal computer that can reproduce the state in question (for instance, by plotting it with the
assumed accuracy). In contrast to the traditional definitions of entropy, algorithmic randomness can
be used to measure disorder without any recourse to probabilities. Algorithmic randomness is typi-
cally very difficult to calculate exactly but relatively easy to estimate. In large systems, probabilistic
en- semble definitions of entropy (e.g., coarse-grained entropy of Gibbs and Boltzmann’s entropy
H = ln 8; as well as Shannon’s information-theoretic entropy) provide accurate estimates of the al-
algorithmic entropy of an individual system or its average value for an ensemble. One is thus able
to rederive much of thermodynamics and statistical mechanics in a setting very different from the
usual. Physical entropy, I suggest, is a sum of (i) the missing information measured by Shannon’s
formula and (ii) of the algorithmic information content —algorithmic randomness —present in the
available data about the system. This definition of entropy is essential in describing the operation
of thermodynamic engines from the viewpoint of information gathering and using systems. These
Maxwell demon-type entities are capable of acquiring and processing information and therefore can
"decide" on the basis of the results of their measurements and computations the best strategy for
extracting energy from their surroundings. From their internal point of view the outcome of each
measurement is definite. The limits on the thermodynamic efficiency arise not from the ensemble
considerations, but rather reflect basic laws of computation. Thus inclusion of algorithmic random-
ness in the definition of physical entropy allows one to formulate thermodynamics from the Maxwell
demon’s point of view.

62
11. Ämin Baumeler and Stefan Wolf. Causality–Complexity–Consistency: Can
space-time be based on logic and computation? In Time in Physics, pages 69–101.
Springer, 2017.
Abstract: The difficulty of explaining non-local correlations in a fixed causal structure sheds new
light on the old debate on whether space and time are to be seen as fundamental. Refraining from
assuming space-time as given a priori has a number of consequences. First, the usual definitions
of randomness depend on a causal structure and turn meaningless. So motivated, we propose an
intrinsic, physically motivated measure for the randomness of a string of bits: its length minus its
normalized work value, a quantity we closely relate to its Kolmogorov complexity (the length of the
shortest program making a universal Turing machine output this string). We test this alternative
concept of randomness for the example of non-local correlations, and we end up with a reasoning that
leads to similar conclusions as in, but is conceptually more direct than, the probabilistic view since
only the outcomes of measurements that can actually all be carried out together are put into relation
to each other. In the same context-free spirit, we connect the logical reversibility of an evolution to
the second law of thermodynamics and the arrow of time. Refining this, we end up with a speculation
on the emergence of a space-time structure on bit strings in terms of data-compressibility relations.
Finally, we show that logical consistency, by which we replace the abandoned causality, it strictly
weaker a constraint than the latter in the multi-party case.

Quantum Algorithmic Complexity


12. Paul MB Vitányi. Quantum Kolmogorov complexity based on classical descrip-
tions. IEEE Transactions on Information Theory, 47(6):2464–2479, 2001.
Abstract: We develop a theory of the algorithmic information in bits contained in an individual pure
quantum state. This extends classical Kolmogorov complexity to the quantum domain retaining
classical descriptions. Quantum Kolmogorov complexity coincides with the classical Kolmogorov
complexity on the classical domain. Quantum Kolmogorov complexity is upper-bounded and can
be effectively approximated from above under certain conditions. With high probability, a quantum
object is incompressible. Upper and lower bounds of the quantum complexity of multiple copies of
in- dividual pure quantum states are derived and may shed some light on the no-cloning properties
of quantum states. In the quantum situation, complexity is not subadditive. We discuss some
relations with “no-cloning” and “approximate cloning” properties.

13. Fabio Benatti, Tyll Krüger, Markus Müller, Rainer Siegmund-Schultze, and
Arleta Szkoła. Entropy and quantum kolmogorov complexity: A quantum brudno’s
theorem. Communications in mathematical physics, 265:437–461, 2006.
Abstract: In classical information theory, entropy rate and algorithmic complexity per symbol are
related by a theorem of Brudno. In this paper, we prove a quantum version of this theorem,
connecting the von Neumann entropy rate and two notions of quantum Kolmogorov complexity,
both based on the shortest qubit descriptions of qubit strings that, run by a universal quantum
Turing machine, reproduce them as outputs.

Information Measures
14. Charles H Bennett, Péter Gács, Ming Li, Paul MB Vitányi, and Woj-
ciech H Zurek. Information distance. IEEE Transactions on information theory,
44(4):1407–1423, 1998.
Abstract : While Kolmogorov complexity is the accepted absolute measure of information content
in an individual finite object, a similarly absolute notion is needed for the information distance
between two individual objects, for example, two pictures. We give several natural definitions of a
universal information metric, based on length of shortest programs for either ordinary computations
or reversible (dissipationless) computations. It turns out that these definitions are equivalent up

63
to an additive logarithmic term. We show that the information distance is a universal cognitive
similarity distance. We investigate the maximal correlation of the shortest programs involved, the
maximal uncorrelation of programs (a generalization of the Slepian–Wolf theorem of classical infor-
mation theory), and the density properties of the discrete metric spaces induced by the information
distances. A related distance measures the amount of nonreversibility of a computation. Using
the physical theory of reversible computation, we give an appropriate (universal, antisymmetric,
and transitive) measure of the thermodynamic work required to transform one object in another
object by the most efficient process. Information distance between individual objects is needed in
pattern recognition where one wants to express effective notions of “pattern similarity” or “cognitive
similarity” between individual objects and in thermodynamics of computation where one wants to
analyze the energy dissipation of a computation from a particular input to a particular output.

15. Rudi Cilibrasi and Paul MB Vitányi. Clustering by compression. IEEE


Transactions on Information theory, 51(4):1523–1545, 2005.
Abstract : We present a new method for clustering based on compression. The method does not
use subject-specific features or background knowledge, and works as follows: First, we determine a
parameter-free, universal, similarity distance, the normalized compression distance or NCD, com-
puted from the lengths of compressed data files (singly and in pairwise concatenation). Second, we
apply a hierarchical clustering method. The NCD is not restricted to a specific application area,
and works across application area boundaries. A theoretical precursor, the normalized information
distance, co-developed by one of the authors, is provably optimal. How- ever, the optimality comes
at the price of using the noncomputable notion of Kolmogorov complexity. We propose axioms
to capture the real-world setting, and show that the NCD approximates optimality. To extract a
hierarchy of clusters from the distance matrix, we determine a dendrogram (ternary tree) by a new
quartet method and a fast heuristic to implement it. The method is implemented and available
as public software, and is robust under choice of different compressors. To substantiate our claims
of universality and robustness, we report evidence of successful application in areas as diverse as
genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of
objects from completely different domains, using statistical, dictionary, and block sorting compres-
sors. In genomics, we presented new evidence for major questions in Mammalian evolution, based
on whole-mitochondrial genomic analysis: the Eutherian orders and the Marsupionta hypothesis
against the Theria hypothesis.

Applications
Anomaly Detection
16. Eamonn Keogh, Stefano Lonardi, Chotirat Ann Ratanamahatana, Li Wei, Sang-
Hee Lee, and John Handley. Compression-based data mining of sequential data.
Data Mining and Knowledge Discovery, 14:99–129, 2007.
Abstract: The vast majority of data mining algorithms require the setting of many input parameters.
The dangers of working with parameter-laden algo- rithms are twofold. First, incorrect settings may
cause an algorithm to fail in finding the true patterns. Second, a perhaps more insidious problem
is that the algorithm may report spurious patterns that do not really exist, or greatly overestimate
the significance of the reported patterns. This is especially likely when the user fails to understand
the role of parameters in the data mining process. Data mining algorithms should have as few pa-
rameters as possible. A parameter-light algorithm would limit our ability to impose our prejudices,
expectations, and presumptions on the problem at hand, and would let the data itself speak to us.
In this work, we show that recent results in bioinformatics, learning, and computational theory hold
great promise for a parameter-light data-mining paradigm. The results are strongly connected to
Kolmogorov complexity theory. However, as a practical matter, they can be implemented using
any off-the-shelf compression algorithm with the addition of just a dozen lines of code. We will
show that this approach is competitive or superior to many of the state-of-the-art approaches in
anomaly/interestingness detection, classification, and clustering with empirical tests on time se-
ries/DNA/text/XML/video data- sets. As a further evidence of the advantages of our method,
we will demonstrate its effectiveness to solve a real world classification problem in recommending
printing services and products.

64
Biology
17. Rudi L Cilibrasi and Paul MB Vitányi. Fast whole-genome phylogeny of the
covid-19 virus sars-cov-2 by compression. bioRxiv, pages 2020–07, 2020.
Abstract: We analyze the phylogeny and taxonomy of the SARS-CoV-2 virus using compression.
This is a new alignment-free method called the “normalized compression distance” (NCD) method.
It discovers all effective similarities based on Kolmogorov complexity. The latter being incomputable
we approximate it by a good compressor such as the modern zpaq. The results comprise that the
SARS-CoV-2 virus is closest to the RaTG13 virus and similar to two bat SARS-like coronaviruses
bat-SL-CoVZXC21 and bat-SL-CoVZC4. The similarity is quantified and compared with the same
quantified similarities among the mtDNA of certain species. We treat the question whether Pangolins
are involved in the SARS-CoV-2 virus.

Neurosciences
18. J Szczepański, José M Amigó, E Wajnryb, and MV Sanchez-Vives. Application
of lempel–ziv complexity to the analysis of neural discharges. Network: Computation
in Neural Systems, 14(2):335, 2003.
Abstract: Pattern matching is a simple method for studying the properties of information sources
based on individual sequences (Wyner et al 1998 IEEE Trans. Inf. Theory 44 2045–56). In partic-
ular, the normalized Lempel– Ziv complexity (Lempel and Ziv 1976 IEEE Trans. Inf. Theory 22
75–88), which measures the rate of generation of new patterns along a sequence, is closely related
to such important source properties as entropy and information compression ratio. We make use of
this concept to characterize the responses of neurons of the primary visual cortex to different kinds
of stimulus, including visual stimulation (sinusoidal drifting gratings) and intracellular current in-
jections (sinusoidal and random currents), under two conditions (in vivo and in vitro preparations).
Specifically, we digitize the neuronal discharges with several encoding techniques and employ the
complexity curves of the resulting discrete signals as fingerprints of the stimuli ensembles. Our
results show, for example, that if the neural discharges are encoded with a particular one-parameter
method (‘interspike time coding’), the normalized complexity remains constant within some classes
of stimuli for a wide range of the parameter. Such constant values of the normalized complexity
allow then the differentiation of the stimuli classes. With other encodings (e.g. ‘bin coding’), the
whole complexity curve is needed to achieve this goal. In any case, it turns out that the normalized
complexity of the neural discharges in vivo are higher (and hence carry more information in the
sense of Shannon) than in vitro for the same kind of stimulus.

Linguistics
19. Henry Brighton and Simon Kirby. The survival of the smallest: Stability condi-
tions for the cultural evolution of compositional language. In Advances in Artificial
Life: 6th European Conference, ECAL 2001 Prague, Czech Republic, September
10–14, 2001 Proceedings 6, pages 592–601. Springer, 2001.
Abstract: Recent work in the field of computational evolutionary linguistics suggests that the dynam-
ics arising from the cultural evolution of language can explain the emergence of syntactic structure.
We build on this work by introducing a model of language acquisition based on the Minimum De-
scription Length Principle. Our experiments show that compositional syntax is most likely to occur
under two conditions specific to hominids: (i) A complex meaning space structure, and (ii) the
poverty of the stimulus.

Sociology
20. Carter T Butts. The complexity of social networks: theoretical and empirical
findings. Social Networks, 23(1):31–72, 2001.

65
Abstract: A great deal of work in recent years has been devoted to the topic of “complexity”, its
measurement, and its implications. Here, the notion of algorithmic complexity is applied to the
analysis of social networks. Structural features of theoretical importance — such as structural
equivalence classes — are shown to be strongly related to the algorithmic complexity of graphs, and
these results are explored using analytical and simulation methods. Analysis of the complexity of a
variety of empirically derived networks suggests that many social networks are nearly as complex as
their source entropy, and thus that their structure is roughly in line with the conditional uniform
graph distribution hypothesis. Implications of these findings for network theory and methodology
are also discussed.

Ecology
21. Vasilis Dakos and Fernando Soler-Toscano. Measuring complexity to infer
changes in the dynamics of ecological systems under stress. Ecological Complexity,
32:144–155, 2017.
Abstract: Despite advances in our mechanistic understanding of ecological processes, the inherent
complexity of real-world ecosystems still limits our ability in predicting ecological dynamics espe-
cially in the face of ongoing environmental stress. Developing a model is frequently challenged by
structure uncertainty, unknown parameters, and limited data for exploring out-of-sample predic-
tions. One way to address this challenge is to look for patterns in the data themselves in order to
infer the underlying processes of an ecological system rather than to build system-specific models.
For example, it has been recently suggested that statistical changes in ecological dynamics can be
used to infer changes in the stability of ecosystems as they approach tipping points. For computer
scientists such inference is similar to the notion of a Turing machine: a computational device that
could execute a program (the process) to produce the observed data (the pattern). Here, we make
use of such basic computational ideas introduced by Alan Turing to recognize changing patterns in
ecological dynamics in ecosystems under stress. To do this, we use the concept of Kolmogorov algo-
rithmic complexity that is a measure of randomness. In particular, we estimate an approximation
to Kolmogorov complexity based on the Block Decomposition Method (BDM). We apply BDM to
identify changes in complexity in simulated time-series and spatial datasets from ecosystems that
experience different types of ecological transitions. We find that in all cases, KBDM complexity
decreased before all ecological transitions both in time-series and spatial datasets. These trends
indicate that loss of stability in the ecological models we explored is characterized by loss of com-
plexity and the emergence of a regular and computable underlying structure. Our results suggest
that Kolmogorov complexity may serve as tool for revealing changes in the dynamics of ecosystems
close to ecological transitions.

Ethology
22. Boris Ryabko and Zhanna Reznikova. The use of ideas of information theory
for studying “language” and intelligence in ants. Entropy, 11(4):836–853, 2009.
Abstract: In this review we integrate results of long term experimental study on ant “language” and
intelligence which were fully based on fundamental ideas of Information Theory, such as the Shannon
entropy, the Kolmogorov complexity, and the Shannon’s equation connecting the length of a message
(l) and its frequency (p), i.e., l = − log p for rational communication systems. This approach enabled
us to obtain the following important results on ants’ communication and intelligence: (i) to reveal
“distant homing” in ants, that is, their ability to transfer information about remote events; (ii) to
estimate the rate of information transmission; (iii) to reveal that ants are able to grasp regularities
and to use them for “compression” of information; (iv) to reveal that ants are able to transfer to
each other the information about the number of objects; (v) to discover that ants can add and
subtract small numbers. The obtained results show that information theory is not only excellent
mathematical theory, but many of its results may be considered as Nature laws.

66
References
[1] Ming Li and Paul Vitányi. An Introduction to Kolmogorov Complexity and its
Applications. Springer, New York, 4th edition, 2019.

[2] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley-
Interscience, 2nd edition, 2006.

[3] Marcus Hutter. Universal Artificial Intelligence: Sequential Decisions Based


on Algorithmic Probability. Springer Berlin, Heidelberg, 2005.

[4] Cristian S. Calude. Information and Randomness: An Algorithmic Perspective.


Springer Berlin, Heidelberg, 2013.

[5] Gregory J. Chaitin. Meta Maths! The Quest for Omega. Vintage Books, 2006.

[6] Charles H. Bennett and Martin Gardner. The random number Omega bids fair
to hold the mysteries of the universe. Scientific American, 241(5):20–34, 1979.

[7] Gregory J. Chaitin. The halting probability Omega: Irreducible complexity in


pure mathematics. Milan Journal of Mathematics, 75(1):291–304, 2007.

[8] A. M. Turing. On computable numbers, with an application to the entschei-


dungsproblem. Proceedings of the London Mathematical Society, s2-42(1):230–
265, 1936.

[9] David Deutsch. The Fabric of Reality: The Science of Parallel Universes—and
Its Implications. Allen Lane The Penguin Press, London, 1997.

[10] David Deutsch. Quantum theory, the Church–Turing principle and the univer-
sal quantum computer. Proceedings of the Royal Society A, 400(1818):97–117,
1985.

[11] Yuri V. Matiyasevich. Hilbert’s Tenth Problem. MIT Press, Cambridge, MA,
1993. Translated from the Russian by M. B. Kan.

[12] Gregory J. Chaitin. Complexity and metamathematics.


https:// www.academia.edu/ 128350080/ Complexity_and_Metamathematics,
2023. Lecture notes accompanying a talk given at the UM6P Science Week,
Feb. 2023.

[13] David Deutsch. The Beginning of Infinity: Explanations that Transform the
World. Penguin Books, London, 2011.

67

You might also like