Think Os
Think Os
Version 0.8.1
Allen B. Downey
The original form of this book is LATEX source code. Compiling this code has the
effect of generating a device-independent representation of a textbook, which can
be converted to other formats and printed.
The cover for this book is based on a photo by Paul Friel (http://flickr.com/
people/frielp/), who made it available under the Creative Commons Attribution
license. The original photo is at http://flickr.com/photos/frielp/11999738/.
Preface
This book is intended for a different audience, and it has different goals. I
developed it for a class at Olin College called Software Systems.
This book does not assume that you have studied Computer Architecture.
As we go along, I will explain what we need.
Chapter 2 explains how the operating system uses processes to protect run-
ning programs from interfering with each other.
iv Chapter 0. Preface
Chapter 5 describes how numbers, letters, and other values are encoded,
and presents the bitwise operators.
• If you don’t want to use Git at all, you can download the files in a Zip
file using the button in the lower-right corner of the GitHub page.
Contributor List
If you have a suggestion or correction, please send email to
downey@allendowney.com. If I make a change based on your feed-
back, I will add you to the contributor list (unless you ask to be omitted).
If you include at least part of the sentence the error appears in, that makes it
easy for me to search. Page and section numbers are fine, too, but not quite
as easy to work with. Thanks!
Other people who found typos and errors include Jim Tyson, Donald Robertson,
Jeremy Vermast, Yuzhong Huang, Ian Hill.
vi Chapter 0. Preface
Contents
Preface iii
1 Compilation 1
1.6 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Processes 9
2.4 Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Virtual memory 17
6 Memory management 41
6.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Contents ix
7 Caching 47
7.1 Cache performance . . . . . . . . . . . . . . . . . . . . . . . . 47
7.2 Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.3 Programming for cache performance . . . . . . . . . . . . . . 49
7.4 The memory hierarchy . . . . . . . . . . . . . . . . . . . . . . 50
7.5 Caching policy . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.6 Paging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
8 Multitasking 55
8.1 Hardware state . . . . . . . . . . . . . . . . . . . . . . . . . . 56
8.2 Context switching . . . . . . . . . . . . . . . . . . . . . . . . . 57
8.3 Kernel mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
8.4 The process life cycle . . . . . . . . . . . . . . . . . . . . . . . 58
8.5 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.6 Real-time scheduling . . . . . . . . . . . . . . . . . . . . . . . 61
9 Threads 63
9.1 Creating threads . . . . . . . . . . . . . . . . . . . . . . . . . . 64
9.2 Creating threads . . . . . . . . . . . . . . . . . . . . . . . . . . 64
9.3 Joining threads . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
9.4 Synchronization errors . . . . . . . . . . . . . . . . . . . . . . 67
9.5 Mutex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
10 Condition variables 71
10.1 The work queue . . . . . . . . . . . . . . . . . . . . . . . . . . 71
10.2 Producers and consumers . . . . . . . . . . . . . . . . . . . . 74
10.3 Mutual exclusion . . . . . . . . . . . . . . . . . . . . . . . . . 75
10.4 Condition variables . . . . . . . . . . . . . . . . . . . . . . . . 77
10.5 Condition variable implementation . . . . . . . . . . . . . . . 80
x Contents
11 Semaphores in C 83
Compilation
These checks happen before the program starts executing, so errors can be
found earlier. More importantly, errors can be found in parts of the program
that have never run. Furthermore, these checks don’t have to happen at run
time, which is one of the reasons compiled languages generally run faster
than interpreted languages.
This shows that the name of the variable is stored in memory while the pro-
gram is running (along with some other values that are part of the default
runtime environment).
2. Parsing: During parsing, the compiler reads the source code and
builds an internal representation of the program, called an abstract
syntax tree. Errors detected during this step are generally syntax er-
rors.
Normally when you run gcc, it runs all of these steps and generates an
executable file. For example, here is a minimal C program:
#include <stdio.h>
int main()
{
printf("Hello World\n");
}
If you save this code in a file called hello.c, you can compile and run it like
this:
$ gcc hello.c
$ ./a.out
By default, gcc stores the executable code in a file called a.out (which orig-
inally stood for “assembler output”). The second line runs the executable.
The prefix ./ tells the shell to look for it in the current directory.
It is usually a good idea to use the -o flag to provide a better name for the
executable:
$ gcc hello.c -o hello
$ ./hello
This output indicates that hello.o defines the name main and uses a func-
tion named puts, which stands for “put string”. In this example, gcc per-
forms an optimization by replacing printf, which is a large and compli-
cated function, with puts, which is relatively simple.
You can control how much optimization gcc does with the -O flag. By de-
fault, it does very little optimization, which can make debugging easier.
The option -O1 turns on the most common and safe optimizations. Higher
numbers turn on additional optimizations that require longer compilation
time.
1.6 Preprocessing
Taking another step backward through the compilation process, you can
use the -E flag to run the preprocessor only:
$ gcc hello.c -E
The result is the output from the preprocessor. In this example, it contains
the included code from stdio.h, and all the files included from stdio.h,
and all the files included from those files, and so on. On my machine, the
total is more than 800 lines of code. Since almost every C program includes
stdio.h, those 800 lines of code get compiled a lot. If, like many C pro-
grams, you also include stdlib.h, the result is more than 1800 lines of code.
Once the program starts, C does very little runtime checking, so there are
only a few runtime errors you are likely to see. If you divide by zero, or
perform another illegal floating-point operation, you will get a “Floating
point exception.” And if you try to read or write an incorrect location in
memory, you will get a “Segmentation fault.”
8 Chapter 1. Compilation
Chapter 2
Processes
While a program is running, most of its data is held in main memory, which
is usually some kind of random access memory (RAM). On most current
computers, main memory is volatile, which means that when the computer
shuts down, the contents of main memory are lost. A typical desktop com-
puter has 2–8 GiB of memory. GiB stands for “gibibyte,” which is 230 bytes.
If the program reads and writes files, those files are usually stored on a
hard disk drive (HDD) or solid state drive (SSD). These storage devices are
non-volatile, so they are used for long-term storage. Currently a typical
desktop computer has a HDD with a capacity of 500 GB to 2 TB. GB stands
for “gigabyte,” which is 109 bytes. TB stands for “terabyte,” which is 1012
bytes.
You might have noticed that I used the binary unit GiB for the size of main
memory and the decimal units GB and TB for the size of the HDD. For
historical and technical reasons, memory is measured in binary units, and
disk drives are measured in decimal units. In this book I will be careful
to distinguish binary and decimal units, but you should be aware that the
word “gigabyte” and the abbreviation GB are often used ambiguously.
In casual use, the term “memory” is sometimes used for HDDs and SSDs as
well as RAM, but the properties of these devices are very different, so we
will need to distinguish them. I will use storage to refer to HDDs and SSDs.
10 Chapter 2. Processes
• The program counter, or PC, which contains the address (in memory)
of the next instruction in the program.
• The instruction register, or IR, which contains the machine code in-
struction currently executing.
• The stack pointer, or SP, which contains the address of the stack frame
for the current function, which contains its parameters and local vari-
ables.
Figure 2.1 shows these registers, as well as some other components we will
learn about later.
When a program is running, the CPU executes the following steps, called
the instruction cycle:
• Fetch: The next instruction is fetched from memory and stored in the
instruction register.
• Decode: Part of the CPU, called the control unit, decodes the instruc-
tion and sends signals to the other parts of the CPU.
• Execute: Signals from the control unit cause the appropriate compu-
tation to occur.
Most computers can execute a few hundred different instructions, called the
instruction set. But most instructions fall into a few general categories:
2.3. Abstraction and virtualization 11
CPU Memory
normally don’t have to think about any of those details. You can get
along very well with a simple mental model of steering. Your mental
model is an abstraction.
Similarly, when you use a web browser, you understand that when
you click on a link, the browser displays the page the link refers to.
The software and network communication that make that possible are
complex, but as a user, you don’t have to know the details.
A large part of software engineering is designing abstractions like
these that allow users and other programmers to use powerful and
complicated systems without having to know about the details of their
implementation.
The word “virtual” is often used in the context of a virtual machine, which is
software that creates the illusion of a dedicated computer running a particu-
lar operating system, when in reality the virtual machine might be running,
along with many other virtual machines, on a computer running a different
operating system.
2.4. Isolation 13
2.4 Isolation
One of the most important principles of engineering is isolation: when you
are designing a system with multiple components, it is usually a good idea
to isolate them from each other so that a change in one component doesn’t
have undesired effects on other components.
• The hardware state of the program, which includes data stored in reg-
isters, status information, and the program counter, which indicates
which instruction is currently executing.
Usually one process runs one program, but it is also possible for a process
to load and run a new program.
It is also possible, and common, to run the same program in more than
one process. In that case, the processes share the same program text but
generally have different data and hardware states.
• Virtual memory: Most operating systems create the illusion that each
process has its own chunk of memory, isolated from all other pro-
cesses. Again, programmers generally don’t have to think about how
virtual memory works; they can proceed as if every program has a
dedicated chunk of memory.
As a programmer, you don’t need to know much about how these capabili-
ties are implemented. But if you are curious, you will find a lot of interest-
ing things going on under the metaphorical hood. And if you know what’s
going on, it can make you a better programmer.
Some browsers, like Chrome, create a new process for each window and
each tab.
And those are just the processes I am aware of. At the same time there
are many other processes running in the background. Many of them are
performing operations related to the operating system.
The UNIX command ps prints information about running processes. If you
run it in a terminal, you might see something like this:
PID TTY TIME CMD
2687 pts/1 00:00:00 bash
2801 pts/1 00:01:24 emacs
24762 pts/1 00:00:00 ps
The first column is the unique numerical process ID. The second column
is the terminal that created the process; “TTY” stands for teletypewriter,
which was the original mechanical terminal.
The third column is the total processor time used by the process, in hours,
minutes, and seconds. The last column is the name of the running program.
In this example, bash is the name of the shell that interprets the commands
I type in the terminal, emacs is my text editor, and ps is the program gener-
ating this output.
By default, ps lists only the processes associated with the current terminal.
If you use the -e flag, you get every process (including processes belonging
to other users, which is a security flaw, in my opinion).
On my system there are currently 233 processes. Here are some of them:
PID TTY TIME CMD
1 ? 00:00:17 init
2 ? 00:00:00 kthreadd
3 ? 00:00:02 ksoftirqd/0
4 ? 00:00:00 kworker/0:0
8 ? 00:00:00 migration/0
9 ? 00:00:00 rcu_bh
10 ? 00:00:16 rcu_sched
47 ? 00:00:00 cpuset
48 ? 00:00:00 khelper
49 ? 00:00:00 kdevtmpfs
50 ? 00:00:00 netns
51 ? 00:00:00 bdi-default
52 ? 00:00:00 kintegrityd
53 ? 00:00:00 kblockd
16 Chapter 2. Processes
54 ? 00:00:00 ata_sff
55 ? 00:00:00 khubd
56 ? 00:00:00 md
57 ? 00:00:00 devfreq_wq
init is the first process created when the operating system starts. It creates
many of the other processes, and then sits idle until the processes it created
are done.
kthreadd is a process the operating system uses to create new threads. We’ll
talk more about threads later, but for now you can think of a thread as kind
of a process. The k at the beginning stands for kernel, which is the part of
the operating system responsible for core capabilities like creating threads.
The extra d at the end stands for daemon, which is another name for pro-
cesses like this that run in the background and provide operating system
services. In this context, “daemon” is used in the sense of a helpful spirit,
with no connotation of evil.
Based on the name, you can infer that ksoftirqd is also a kernel daemon;
specifically, it handles software interrupt requests, or “soft IRQ”.
I won’t go into more details about the other processes, but if you are inter-
ested you can search for more information about them. You should run ps
on your system and compare your results to mine.
Chapter 3
Virtual memory
Thus, virtual memory is one important way the operating system isolates
processes from each other. In general, a process cannot access data belong-
ing to another process, because there is no virtual address it can generate
that maps to physical memory allocated to another process.
• The code segment contains the program text; that is, the machine lan-
guage instructions that make up the program.
• The global segment contains global variables and local variables that
are declared static.
• The stack segment contains the call stack, which is a sequence of stack
frames. Each time a function is called, a stack frame is allocated to
contain the parameters and local variables of the function. When the
function completes, its stack frame is removed from the stack.
• The text segment is near the “bottom” of memory, that is, at addresses
near 0.
• The constant segment is often just above the text segment, that is, at
higher addresses.
• The stack is near the top of memory; that is, near the highest addresses
in the virtual address space. As the stack expands, it grows down
toward smaller addresses.
int global;
int main ()
{
int local = 5;
void *p = malloc(128);
char *s = "Hello, World";
MMU
CPU virtual address
virtual page # offset
virtual address
MMU TLB
physical address
physical page # offset
Most processors provide a memory management unit (MMU) that sits be-
tween the CPU and main memory. The MMU performs fast translation be-
tween VAs and PAs.
2. The MMU splits the VA into two parts, called the page number and
the offset. A “page” is a chunk of memory; the size of a page depends
on the operating system and the hardware, but common sizes are 1–4
KiB.
The TLB contains cached copies of data from the page table (which is stored
in kernel memory). The page table contains the mapping from virtual page
3.6. Storing page tables 23
numbers to physical page numbers. Since each process has its own page
table, the TLB has to make sure it only uses entries from the page table of
the process that’s running.
• Since 1 GiB is 230 bytes and 1 KiB is 210 bytes, there are 220 physical
pages, sometimes called frames.
• The size of the virtual address space is 232 B and the size of a page is
210 B, so there are 222 virtual pages.
• The size of the offset is determined by the page size. In this example
the page size is 210 B, so it takes 10 bits to specify a byte on a page.
• Since there are 220 physical pages, each physical page number is 20
bits. Adding in the 10 bit offset, the resulting PAs are 30 bits.
So far this all seems feasible. But let’s think about how big a page table
might have to be. The simplest implementation of a page table is an array
with one entry for each virtual page. Each entry would contain a physical
page number, which is 20 bits in this example, plus some additional infor-
mation about each frame. So we expect 3–4 bytes per entry. But with 222
virtual pages, the page table would require 224 bytes, or 16 MiB.
And since we need a page table for each process, a system running 256
processes would need 232 bytes, or 4 GiB, just for page tables! And that’s
just with 32-bit virtual addresses. With 48- or 64-bit VAs, the numbers are
ridiculous.
Fortunately, we don’t actually need that much space, because most pro-
cesses don’t use even a small fraction of their virtual address space. And
if a process doesn’t use a virtual page, we don’t need an entry in the page
table for it.
24 Chapter 3. Virtual memory
Another way to say the same thing is that page tables are “sparse”, which
implies that the simple implementation, an array of page table entries, is a
bad idea. Fortunately, there are several good implementations for sparse
arrays.
One option is a multilevel page table, which is what many operating sys-
tems, including Linux, use. Another option is an associative table, where
each entry includes both the virtual page number and the physical page
number. Searching an associative table can be slow in software, but in hard-
ware we can search the entire table in parallel, so associative arrays are often
used to represent the page table entries in the TLB.
I mentioned earlier that the operating system can interrupt a running pro-
cess, save its state, and then run another process. This mechanism is called
a context switch. Since each process has its own page table, the operating
system has to work with the MMU to make sure each process gets the right
page table. In older machines, the page table information in the MMU had
to be replaced during every context switch, which was expensive. In newer
systems, each page table entry in the MMU includes the process ID, so page
tables from multiple processes can be in the MMU at the same time.
Chapter 4
When a process completes (or crashes), any data stored in main memory is
lost. But data stored on a hard disk drive (HDD) or solid state drive (SSD)
is persistent; that is, it survives after the process completes, even if the
computer shuts down.
Hard disk drives are complicated. Data is stored in blocks, which are laid
out in sectors, which make up tracks, which are arranged in concentric cir-
cles on platters.
Solid state drives are simpler in one sense, because blocks are numbered se-
quentially, but they raise a different complication: each block can be written
a limited number of times before it becomes unreliable.
Abstractly:
• A file system is a mapping from each file’s name to its contents. If you
think of the names as keys, and the contents as values, a file system is
a kind of key-value database (see https://en.wikipedia.org/wiki/
Key-value_database).
• A file is a sequence of bytes.
File names are usually strings, and they are usually hierarchical; that is, the
string specifies a path from a top-level directory (or folder), through a series
of subdirectories, to a specific file.
26 Chapter 4. Files and file systems
The primary difference between the abstraction and the underlying mecha-
nism is that files are byte-based and persistent storage is block-based. The
operating system translates byte-based file operations in the C library into
block-based operations on storage devices. Typical block sizes are 1–8 KiB.
For example, the following code opens a file and reads the first byte:
FILE *fp = fopen("/home/downey/file.txt", "r");
char c = fgetc(fp);
fclose(fp);
When this code runs:
1. fopen uses the filename to find the top-level directory, called /, the
subdirectory home, and the sub-subdirectory downey.
2. It finds the file named file.txt and “opens” it for reading, which
means it creates a data structure that represents the file being read.
Among other things, this data structure keeps track of how much of
the file has been read, called the “file position”.
In DOS, this data structure is called a File Control Block, but I want to
avoid that term because in UNIX it means something else. In UNIX,
there seems to be no good name for it. It is an entry in the open file
table, so I will call it an OpenFileTableEntry.
3. When we call fgetc, the operating system checks whether the next
character of the file is already in memory. If so, it reads the next char-
acter, advances the file position, and returns the result.
5. When the I/O operation is complete, the new block of data is stored
in memory, and the process resumes. It reads the first character and
stores it as a local variable.
6. When the process closes the file, the operating system completes or
cancels any pending operations, removes data stored in memory, and
frees the OpenFileTableEntry.
The process for writing a file is similar, but there are some additional steps.
Here is an example that opens a file for writing and changes the first char-
acter.
4.1. Disk performance 27
1. Again, fopen uses the filename to find the file. If it does not already
exist, it creates a new file and adds an entry in the parent directory,
/home/downey.
3. fputc attempts to write (or re-write) the first byte of the file. If the
file already exists, the operating system has to load the first block into
memory. Otherwise it allocates a new block in memory and requests
a new block on disk.
5. When the file is closed, any buffered data is written to disk and the
OpenFileTableEntry is freed.
To put these numbers in perspective, let’s compare them to the clock cycle
of the CPU. A processor with clock rate 2 GHz completes one clock cycle
every 0.5 ns. The time to get a byte from memory to the CPU is typically
28 Chapter 4. Files and file systems
around 100 ns. If the processor completes one instruction per clock cycle, it
would complete 200 instructions while waiting for a byte from memory.
If there’s nothing for the CPU to do while it waits, it would be idle. That’s
why the operating system generally switches to another process while it is
waiting for data from disk.
• Block transfers: The time it takes to load a single byte from disk is
5–25 ms. The additional time to load an 8 KiB block is negligible. So
systems generally try to read large blocks each time they access the
disk.
• Buffering: When you write a file, the operating system stores the data
in memory and only writes it to disk later. If you modify the block
several times while it is in memory, the system only has to write it to
disk once.
unexpectedly bad, you might have to know something about these mech-
anisms to diagnose the problem, and (2) when data is buffered, it can be
harder to debug a program. For example, if a program prints a value and
then crashes, the value might not appear, because it might be in a buffer.
Similarly, if a program writes data to disk and then the computer loses
power, the data might be lost if it is in a cache and not yet on disk.
file size of 8 TiB. When UNIX inodes were designed, that seemed big enough
to serve for a long time. But that was a long time ago.
It is hard to design a file system that achieves all of these goals, especially
since file system performance depends on “workload characteristics” like
file sizes, access patterns, etc. A file system that is well tuned for one work-
load might not perform as well for another.
For this reason, most operating systems support several kinds of file sys-
tems, and file system design is an active area of research and development.
In the last decade, Linux systems have migrated from ext2, which was a
conventional UNIX file system, to ext3, a journaling file system intended
to improve speed and contiguity, and more recently to ext4, which can han-
dle larger files and file systems. Within the next few years, there might be
another migration to the B-tree file system, Btrfs.
4.4. Everything is a file? 31
Reusing the file abstraction makes life easier for programmers, since they
only have to learn one API (application program interface). It also makes
programs more versatile, since a program intended to work with files can
also work with data coming from pipes and other sources.
32 Chapter 4. Files and file systems
Chapter 5
For negative numbers, the most obvious representation uses a sign bit to
indicate whether a number is positive or negative. But there is another rep-
resentation, called two’s complement that is much more common because
it is easier to work with in hardware.
In two’s complement, the leftmost bit acts like a sign bit; it is 0 for positive
numbers and 1 for negative numbers.
To convert from an 8-bit number to 16-bits, we have to add more 0’s for a
positive number and add 1’s for a negative number. In effect, we have to
copy the sign bit into the new bits. This process is called sign extension.
In C all integer types are signed (able to represent positive and negative
numbers) unless you declare them unsigned. The difference, and the reason
this declaration is important, is that operations on unsigned integers don’t
use sign extension.
34 Chapter 5. More bits and bytes
In this context, the value 3 is called a “mask” because it selects some bits
and masks the rest.
xxxx
| 0011
----
xx11
Toggling bits: Finally, if you XOR a vector with 3, it flips the rightmost bits
and leaves the rest alone. As an exercise, see if you can compute the two’s
complement of 12 using ^. Hint: what’s the two’s complement representa-
tion of -1?
C also provides shift operators, and , which shift bits left and right. Each
left shift doubles a number, so 5 1 is 10, and 5 2 is 20. Each right shift
divides by two (rounding down), so 5 1 is 2 and 2 1 is 1.
Most computers use the IEEE standard for floating-point arithmetic. The C
type float usually corresponds to the 32-bit IEEE standard; double usually
corresponds to the 64-bit standard.
In the 32-bit standard, the leftmost bit is the sign bit, s. The next 8 bits are
the exponent, q, and the last 23 bits are the coefficient, c. The value of a
floating-point number is
(−1)s c · 2q
Well, that’s almost correct, but there’s one more wrinkle. Floating-point
numbers are usually normalized so that there is one digit before the point.
For example, in base 10, we prefer 2.998 · 108 rather than 2998 · 105 or any
other equivalent expression. In base 2, a normalized number always has the
digit 1 before the binary point. Since the digit in this location is always 1,
we can save space by leaving it out of the representation.
36 Chapter 5. More bits and bytes
Well, that’s almost correct, but there’s one more wrinkle. The exponent is
stored with a bias. In the 32-bit standard, the bias is 127, so the exponent 3
would be stored as 130.
union {
float f;
unsigned int u;
} p;
p.f = -13.0;
unsigned int sign = (p.u >> 31) & 1;
unsigned int exp = (p.u >> 23) & 0xff;
printf("%d\n", sign);
printf("%d\n", exp);
printf("0x%x\n", coef);
This code is in float.c in the repository for this book (see Section 0.1).
The union allows us to store a floating-point value using p.f and then read
it as an unsigned integer using p.u.
To get the sign bit, we shift the bits to the right 31 places and then use a 1-bit
mask to select only the rightmost bit.
To get the exponent, we shift the bits 23 places, then select the rightmost 8
bits (the hexadecimal value 0xff has eight 1’s).
To get the coefficient, we need to extract the 23 rightmost bits and ignore the
rest. We do that by making a mask with 1s in the 23 rightmost places and 0s
on the left. The easiest way to do that is by shifting 1 to the left by 23 places
and then subtracting 1.
1
130
0x500000
As expected, the sign bit for a negative number is 1. The exponent is 130,
including the bias. And the coefficient, which I printed in hexadecimal, is
101 followed by 20 zeros.
Actually, the same thing can happen if you read a location in memory in-
correctly. One way that can happen is if you read past the end of an array.
To see what happens, I’ll start with a function that allocates an array on the
stack and fills it with the numbers from 0 to 99.
void f1() {
int i;
int array[100];
printf("%d\n", array[-2]);
printf("%d\n", array[-1]);
printf("%d\n", array[10]);
printf("%d\n", array[11]);
}
If I call f1 and then f2, I get these results:
17
123
98
99
The details here depend on the compiler, which arranges variables on the
stack. From these results, we can infer that the compiler put x and y next to
each other, “below” the array (at a lower address). And when we read past
the array, it looks like we are getting values that were left on the stack by
the previous function call.
In this example, all of the variables are integers, so it is relatively easy to fig-
ure out what is going on. But in general when you read beyond the bounds
of an array, the values you read might have any type. For example, if I
change f1 to make an array of floats, the results are:
17
123
1120141312
1120272384
The latter two values are what you get if you interpret a floating-point value
as an integer. If you encountered this output while debugging, you would
have a hard time figuring out what’s going on.
Also, the letters and numbers in C strings are encoded in ASCII. The ASCII
codes for the digits “0” through “9” are 48 through 57, not 0 through 9. The
ASCII code 0 is the NUL character that marks the end of a string. And the
ASCII codes 1 through 9 are special characters used in some communication
5.5. Representing strings 39
The ASCII code for the letter “A” is 65; the code for “a” is 97. Here are those
codes in binary:
65 = b0100 0001
97 = b0110 0001
A careful observer will notice that they differ by a single bit. And this pat-
tern holds for the rest of the letters; the sixth bit (counting from the right)
acts as a “case bit”, 0 for upper-case letters and 1 for lower case letters.
As an exercise, write a function that takes a string and converts from lower-
case to upper-case by flipping the sixth bit. As a challenge, you can make a
faster version by reading the string 32 or 64 bits at a time, rather than one
character at a time. This optimization is made easier if the length of the
string is a multiple of 4 or 8 bytes.
If you read past the end of a string, you are likely to see strange characters.
Conversely, if you write a string and then accidentally read it as an int or
float, the results will be hard to interpret.
Memory management
• calloc, which is the same as malloc except that it also clears the newly
allocated chunk; that is, it sets all bytes in the chunk to 0.
• If you access (read or write) any chunk that has not been allocated,
that’s a paddling.
• If you free an allocated chunk and then access it, that’s a paddling.
• If you try to free a chunk that has not been allocated, that’s a paddling.
• If you free the same chunk more than once, that’s a paddling.
• If you call realloc with a chunk that was not allocated, or was allo-
cated and then freed, that’s a paddling.
It might not sound difficult to follow these rules, but in a large program a
chunk of memory might be allocated in one part of the program, used in
several other parts, and freed in yet another part. So changes in one part of
the program can require changes in many other parts.
Ideally, every function that allocates memory should include, as part of the
documented interface, information about how that memory is supposed to
be freed. Mature libraries often do this well, but in the real world, software
engineering practice often falls short of this ideal.
To make matters worse, memory errors can be difficult to find because the
symptoms are unpredictable. For example:
• If you read a value from an unallocated chunk, the system might de-
tect the error, trigger a runtime error called a segmentation fault, and
stop the program. Or, the program might read unallocated memory
without detecting the error; in that case, the value it gets is whatever
happened to be stored at the accessed location, which is unpredictable,
and might be different each time the program runs.
6.2. Memory leaks 43
And things can be even worse than that! One of the most common prob-
lems with C-style memory management is that the data structures used
to implement malloc and free (which we will see soon) are often stored
along with the allocated chunks. So if you accidentally write past the end
of a dynamically-allocated chunk, you are likely to mangle these data struc-
tures. The system usually won’t detect the problem until later, when you
call malloc or free, and those functions fail in some inscrutable way.
One conclusion you should draw from this is that safe memory manage-
ment requires design and discipline. If you write a library or module that
allocates memory, you should also provide an interface to free it, and mem-
ory management should be part of the API design from the beginning.
If you use a library that allocates memory, you should be disciplined in your
use of the API. For example, if the library provides functions to allocate and
deallocate storage, you should use those functions and not, for example,
call free on a chunk you did not malloc. And you should avoid keeping
multiple references to the same chunk in different parts of your program.
Often there is a trade-off between safe memory management and perfor-
mance. For example, the most common source of memory errors is writing
beyond the bounds of an array. The obvious remedy for this problem is
bounds checking; that is, every access to the array should check whether the
index is out of bounds. High-level libraries that provide array-like struc-
tures usually perform bounds checking. But C arrays and most low-level
libraries do not.
But if a program runs for a long time and leaks memory, its total memory
use will increase indefinitely. At that point, a few things might happen:
• On systems with virtual memory, the operating system can move an-
other process’s pages from memory to disk and then allocate more
space to the leaking process. I explain this mechanism in Section 7.6.
• Eventually, a process might fill its virtual address space (or the usable
part). After that, there are no more addresses to allocate, so malloc
returns NULL.
If malloc returns NULL, but you persist and access the chunk you think
you allocated, you get a segmentation fault. For this reason, it is considered
good style to check the result from malloc before using it. One option is to
add a condition like this after every malloc call:
void *p = malloc(size);
if (p == NULL) {
perror("malloc failed");
exit(-1);
}
perror is declared in stdio.h; it prints an error message and additional
information about the last error that occurred.
6.3 Implementation
When a process starts, the system allocates space for the text segment and
statically allocated data, space for the stack, and space for the heap, which
contains dynamically allocated data.
Not all programs allocate data dynamically, so the initial size of the heap
might be small or zero. Initially the heap contains only one free chunk.
When malloc is called, it checks whether it can find a free chunk that’s big
enough. If not, it has to request more memory from the system. The func-
tion that does that is sbrk, which sets the program break, which you can
think of as a pointer to the end of the heap.
When sbrk is called, the OS allocates new pages of physical memory, up-
dates the process’s page table, and sets the program break.
In theory, a program could call sbrk directly (without using malloc) and
manage the heap itself. But malloc is easier to use and, for most memory-
use patterns, it runs fast and uses memory efficiently.
To implement the memory management API (that is, the functions malloc,
free, calloc, and realloc), most Linux systems use ptmalloc, which is
based on dlmalloc, written by Doug Lea. A short paper that describes key
elements of the implementation is available at http://gee.cs.oswego.edu/
dl/html/malloc.html.
• The run time of malloc does not usually depend on the size of the
chunk, but might depend on how many free chunks there are. free is
usually fast, regardless of the number of free chunks. Because calloc
clears every byte in the chunk, the run time depends on chunk size (as
well as the number of free chunks).
realloc is sometimes fast, if the new size is smaller than the current
size, or if space is available to expand the existing chunk. If not, it has
to copy data from the old chunk to the new; in that case, the run time
depends on the size of the old chunk.
• Space overhead: Boundary tags and free list pointers take up space.
The minimum chunk size on most systems is 16 bytes. So for very
small chunks, malloc is not space efficient. If your program requires
large numbers of small structures, it might be more efficient to allocate
them in arrays.
• Fragmentation: If you allocate and free chunks with varied sizes, the
heap will tend to become fragmented. That is, the free space might be
broken into many small pieces. Fragmentation wastes space; it also
slows the program down by making memory caches less effective.
• Binning and caching: The free list is sorted by size into bins, so when
malloc searches for a chunk with a particular size, it knows what bin
to search in. If you free a chunk and then immediately allocate a chunk
with the same size, malloc will usually be fast.
Chapter 7
Caching
During each instruction cycle, one instruction is read from the program text.
In addition, about half of the instructions in a typical program load or store
data. And therein lies one of the fundamental problems of computer archi-
tecture: the memory bottleneck.
In current computers, a typical core is capable of executing an instruction in
less than 1 ns. But the time it takes to transfer data to and from memory is
about 100 ns. If the CPU has to wait 100 ns to fetch the next instruction, and
another 100 ns to load data, it would complete instructions 200 times slower
than what’s theoretically possible. For many computations, memory is the
speed limiting factor, not the CPU.
the program can run close to the full speed of the CPU. If the CPU frequently
needs data that are not in cache, the program is limited by the speed of
memory.
The cache hit rate, h, is the fraction of memory accesses that find data in
cache; the miss rate, m, is the fraction of memory accesses that have to go
to memory. If the time to process a cache hit is Th and the time for a cache
miss is Tm , the average time for each memory access is
hTh + mTm
Equivalently, we could define the miss penalty as the extra time to process
a cache miss, Tp = Tm − Th . Then the average access time is
Th + mTp
When the miss rate is low, the average access time can be close to Th . That
is, the program can perform as if memory ran at cache speeds.
7.2 Locality
When a program reads a byte for the first time, the cache usually loads a
block or line of data that includes the requested byte and some of its neigh-
bors. If the program goes on to read one of the neighbors, it will already be
in cache.
As an example, suppose the block size is 64 B; you read a string with length
64, and the first byte of the string happens to fall at the beginning of a block.
When you load the first byte, you incur a miss penalty, but after that the
rest of the string will be in cache. After reading the whole string, the hit rate
will be 63/64, about 98%. If the string spans two blocks, you would incur 2
miss penalties. But even then the hit rate would be 62/64, or almost 97%. If
you then read the same string again, the hit rate would be 100%.
The tendency of a program to use the same data more than once is called
temporal locality. The tendency to use data in nearby locations is called
spatial locality. Fortunately, many programs naturally display both kinds
of locality:
7.3. Programming for cache performance 49
For example, if you are working with a large array, it might be faster to
traverse the array once, performing several operations with each element,
rather than traversing the array several times.
If you are working with a 2-D array, it might be stored as an array of rows. If
you traverse through the elements, it would be faster to go row-wise, with
stride equal to the element size, rather than column-wise, with stride equal
to the row length.
Linked data structures don’t always exhibit spatial locality, because the
nodes aren’t necessarily contiguous in memory. But if you allocate many
nodes at the same time, they are usually co-located in the heap. Or, even
better, if you allocate an array of nodes all at once, you know they will be
contiguous.
Recursive strategies like mergesort often have good cache behavior because
they break big arrays into smaller pieces and then work with the pieces.
Sometimes these algorithms can be tuned to take advantage of cache be-
havior.
50 Chapter 7. Caching
At some point during this chapter, a question like the following might have
occurred to you: “If caches are so much faster than main memory, why not
make a really big cache and forget about memory?”
Without going too far into computer architecture, there are two reasons:
electronics and economics. Caches are fast because they are small and close
to the CPU, which minimizes delays due to capacitance and signal propa-
gation. If you make a cache big, it will be slower.
Also, caches take up space on the processor chip, and bigger chips are
more expensive. Main memory is usually dynamic random-access memory
(DRAM), which uses only one transistor and one capacitor per bit, so it is
possible to pack more memory into the same amount of space. But this way
of implementing memory is slower than the way caches are implemented.
The trade-off between speed, size, and cost is the fundamental reason for
caching. If there were one memory technology that was fast, big, and cheap,
we wouldn’t need anything else.
The same principle applies to storage as well as memory. Solid state drives
(SSD) are fast, but they are more expensive than hard drives (HDD), so they
tend to be smaller. Tape drives are even slower than hard drives, but they
can store large amounts of data relatively cheaply.
The following table shows typical access times, sizes, and costs for each of
these technologies (as of 2018).
7.5. Caching policy 51
The cost of registers and caches is hard to quantify. They contribute to the
cost of the chips they are on, but consumers don’t see that cost directly.
For the other numbers in the table, I looked at the specifications for typical
hardware for sale from online computer hardware stores. By the time you
read this, these numbers will be obsolete, but they give you an idea of what
the performance and cost gaps looked like at one point in time.
These technologies make up the memory hierarchy (note that this use of
“memory” also includes storage). Each level of the hierarchy is bigger and
slower than the one above it. And in some sense, each level acts as a cache
for the one below it. You can think of main memory as a cache for programs
and data that are stored permanently on SSDs and HHDs. And if you are
working with very large datasets stored on tape, you could use hard drives
to cache one subset of the data at a time.
• Who moves data up and down the hierarchy? At the top of the hierar-
chy, register allocation is usually done by the compiler. Hardware on
the CPU handles the memory cache. Users implicitly move data from
storage to memory when they execute programs and open files. But
52 Chapter 7. Caching
the operating system also moves data back and forth between mem-
ory and storage. At the bottom of the hierarchy, administrators move
data explicitly between disk and tape.
• What gets moved? In general, block sizes are small at the top of the
hierarchy and bigger at the bottom. In a memory cache, a typical block
size is 128 B. Pages in memory might be 4 KiB, but when the operating
system reads a file from disk, it might read 10s or 100s of blocks at a
time.
• When does data get moved? In the most basic cache, data gets moved
into cache when it is used for the first time. But many caches use some
kind of prefetching, meaning that data is loaded before it is explicitly
requested. We have already seen one form of prefetching: loading an
entire block when only part of it is requested.
• Where in the cache does the data go? When the cache is full, we can’t
bring anything in without kicking something out. Ideally, we want to
keep data that will be used again soon and replace data that won’t.
The answers to these questions make up the cache policy. Near the top of
the hierarchy, cache policies tend to be simple because they have to be fast
and they are implemented in hardware. Near the bottom of the hierarchy,
there is more time to make decisions, and well-designed policies can make
a big difference.
Most cache policies are based on the principle that history repeats itself;
if we have information about the recent past, we can use it to predict the
immediate future. For example, if a block of data has been used recently,
we expect it to be used again soon. This principle suggests a replacement
policy called least recently used, or LRU, which removes from the cache a
block of data that has not been used recently. For more on this topic, see
http://en.wikipedia.org/wiki/Cache_algorithms.
7.6 Paging
In systems with virtual memory, the operating system can move pages back
and forth between memory and storage. As I mentioned in Section 6.2, this
mechanism is called paging or sometimes swapping.
3. If there are no free pages, the paging system chooses a victim page
belonging to Process B. It copies the contents of the victim page from
memory to disk, then it modifies the page table for Process B to indi-
cate that this page is swapped out.
4. Once the data from Process B is written, the page can be reallocated
to Process A. To prevent Process A from reading Process B’s data, the
page should be cleared.
5. At this point the call to sbrk can return, giving malloc additional space
in the heap. Then malloc allocates the requested chunk and returns.
Process A can resume.
7. When the operating system handles the interrupt, it sees that the page
is swapped out, so it transfers the page back from disk to memory.
When paging works well, it can greatly improve the utilization of physical
memory, allowing more processes to run in less space. Here’s why:
• Most processes don’t use all of their allocated memory. Many parts of
the text segment are never executed, or execute once and never again.
Those pages can be swapped out without causing any problems.
• On most systems, there are processes like daemons that sit idle most of
the time and only occasionally “wake up” to respond to events. While
they are idle, these processes can be swapped out.
54 Chapter 7. Caching
• A user might have many windows open, but only a few are active at
a time. The inactive processes can be swapped out.
If you add up the total memory allocated to all processes, it can greatly
exceed the size of physical memory, and yet the system can still behave
well.
Up to a point.
When a process accesses a page that’s swapped out, it has to get the data
back from disk, which can take several milliseconds. The delay is often
noticeable. If you leave a window idle for a long time and then switch back
to it, it might start slowly, and you might hear the disk drive working while
pages are swapped in.
Occasional delays like that might be acceptable, but if you have too many
processes using too much space, they start to interfere with each other.
When Process A runs, it evicts the pages Process B needs. Then when B
runs, it evicts the pages A needs. When this happens, both processes slow
to a crawl and the system can become unresponsive. This scenario is called
thrashing.
Multitasking
In many current systems, the CPU contains multiple cores, which means it
can run several processes at the same time. In addition, each core is capable
of multitasking, which means it can switch from one process to another
quickly, creating the illusion that many processes are running at the same
time.
The part of the operating system that implements multitasking is the kernel.
In a nut or seed, the kernel is the innermost part, surrounded by a shell. In
an operating system, the kernel is the lowest level of software, surrounded
by several other layers, including an interface called a shell. Computer
scientists love extended metaphors.
Then, usually, it is the responsibility of the interrupt handler to save the rest
of the hardware state before it does anything that might modify it, and then
restore the saved state before the interrupted process resumes.
1. When the interrupt occurs, the hardware saves the program counter
in a special register and jumps to the appropriate interrupt handler.
2. The interrupt handler stores the program counter and the status reg-
ister in memory, along with the contents of any data registers it plans
to use.
3. The interrupt handler runs whatever code is needed to handle the in-
terrupt.
If this mechanism works correctly, there is no way for the interrupted pro-
cess to know there was an interrupt, unless it detects the change in time
between instructions.
8.2. Context switching 57
When an interrupt occurs, the CPU switches to kernel mode before jumping
into the interrupt handler.
• Ready, if the process could be running, but isn’t, usually because there
are more runnable processes than cores.
• Done, if the process has completed, but has exit status information
that has not been read yet.
Here are the events that cause a process to transition from one state to an-
other:
• When a process calls exit, the interrupt handler stores the exit code
in the PCB and changes the process’s state to done.
8.5 Scheduling
As we saw in Section 2.5 there might be hundreds of processes on a com-
puter, but usually most of them are blocked. Usually there are only a few
processes that are ready or running. When an interrupt occurs, the sched-
uler decides which process to start or resume.
Usually the scheduler doesn’t have much information about what processes
are doing, so its decisions are based on a few heuristics:
• If a process is likely to run for a short time and then make a blocking
request, it should probably run immediately, for two reasons: (1) if
60 Chapter 8. Multitasking
• If a process makes a request and blocks before its time slice is com-
plete, it is more likely to be interactive or I/O-bound, so its priority
should go up.
• If a task blocks for a long time and then becomes ready, it should get
a priority boost so it can respond to whatever it was waiting for.
• The system call nice allows a process to decrease (but not increase)
its own priority, allowing programmers to pass explicit information to
the scheduler.
Threads
When I mentioned threads in Section 2.5, I said you could think of a thread
as a kind of process. But that is not completely correct; now I will provide a
more careful explanation.
When you create a process, the operating system creates a new address
space, which includes the text segment, constant segment, and heap; it also
creates a new thread of execution, which includes the program counter and
other hardware state, and the call stack.
The processes we have seen so far are single-threaded, which means that
only one thread of execution runs in each address space. In this chapter,
you will learn about multi-threaded processes that have multiple threads
running in the same address space.
Within a single process, all threads share the same text segment, so they run
the same code. But different threads often run different parts of the code.
Threads also share the same constant segment, so if one thread changes a
global variable, other threads see the change. And they share the heap, so
threads can share dynamically-allocated chunks.
But each thread has its own stack, so threads can call functions without
interfering with each other. Usually threads don’t access each other’s local
variables (and sometimes they can’t).
The example code for this chapter is in the repository for this book, in a
directory named counter. For information on downloading this code, see
Section 0.1.
64 Chapter 9. Threads
• When you compile the program, you link it with the Pthread library.
exit(-1);
}
return thread;
}
make_thread is a wrapper I wrote to make pthread_create easier to use,
and to provide error-checking.
The return type from pthread_create is pthread_t, which you can think of
as an id or “handle” for the new thread.
The parameter is the handle of the thread you want to wait for. All the
wrapper does is call pthread_join and check the result.
Any thread can join any other thread, but in the most common pattern the
parent thread creates and joins all child threads. Continuing the example
from the previous section, here’s the code that waits on the children:
for (i=0; i<NUM_CHILDREN; i++) {
join_thread(child[i]);
}
This loops waits for the children one at a time in the order they were created.
There is no guarantee that the child threads complete in that order, but this
loop works correctly even if they don’t. If one of the children is late, the loop
might have to wait, and other children might complete in the meantime. But
regardless, the loop exits only when all children are done.
If you have downloaded the repository for this book (see Section 0.1), you’ll
find this example in counter/counter.c. You can compile and run it like
this:
$ make counter
gcc -Wall counter.c -o counter -lpthread
$ ./counter
When I ran it with 5 children, I got the following output:
counter = 0
counter = 0
counter = 1
counter = 0
counter = 3
When you run it, you will probably get different results. And if you run it
again, you might get different results each time. What’s going on?
Here is a sequence of events that could explain the output in the previous
section:
68 Chapter 9. Threads
Child A reads 0
Child B reads 0
Child C reads 0
Child A prints 0
Child B prints 0
Child A sets counter=1
Child D reads 1
Child D prints 1
Child C prints 0
Child A sets counter=1
Child B sets counter=2
Child C sets counter=3
Child E reads 3
Child E prints 3
Child D sets counter=4
Child E sets counter=5
Each time you run the program, threads might be interrupted at different
points, or the scheduler might choose different threads to run, so the se-
quence of events, and the results, will be different.
I have written a small module called mutex.c that provides mutex objects.
I’ll show you how to use it first; then I’ll explain how it works.
9.5 Mutex
My definition of Mutex is a wrapper for a type called pthread_mutex_t,
which is defined in the POSIX threads API.
One of the problems with this API is that pthread_mutex_t behaves like a
structure, so if you pass it as an argument, it makes a copy, which makes the
70 Chapter 9. Threads
My code makes it easier to get that right. It defines a type, Mutex, which is
just a more readable name for pthread_mutex_t:
#include <pthread.h>
The functions to lock and unlock the mutex are simple wrappers for POSIX
functions:
void mutex_lock(Mutex *mutex)
{
int n = pthread_mutex_lock(mutex);
if (n != 0) perror_exit("lock failed");
}
Condition variables
I’ll start with a simple queue that is not thread safe, then we’ll see what goes
wrong and fix it. The code for this example is in the repository for this book,
in a folder called queue. The file queue.c contains a basic implementation of
72 Chapter 10. Condition variables
queue->array[queue->next_in] = item;
queue->next_in = queue_incr(queue, queue->next_in);
}
If the queue is full, queue_push prints an error message and exits. I will
explain queue_full soon.
If the queue is not full, queue_push inserts the new element and then incre-
ments next_in using queue_incr:
int queue_incr(Queue *queue, int i)
{
return (i+1) % queue->length;
}
When the index, i, gets to the end of the array, it wraps around to 0. And
that’s where we run into a tricky part. If we keep adding elements to the
queue, eventually next_in wraps around and catches up with next_out.
But if next_in == next_out, we would incorrectly conclude that the queue
was empty.
To avoid that, we define another special case to indicate that the queue is
full:
int queue_full(Queue *queue)
{
return (queue_incr(queue, queue->next_in) == queue->next_out);
}
If incrementing next_in lands on next_out, that means we can’t add an-
other element without making the queue seem empty. So we stop one el-
ement before the “end” (keeping in mind that the end of the queue can be
anywhere, not necessarily the end of the array).
Now we can write queue_pop, which removes and returns the next element
from the queue:
int queue_pop(Queue *queue) {
if (queue_empty(queue)) {
perror_exit("queue is empty");
}
If you try to pop from an empty queue, queue_pop prints an error message
and exits.
typedef struct {
Queue *queue;
} Shared;
Shared *make_shared()
{
Shared *shared = check_malloc(sizeof(Shared));
shared->queue = make_queue(QUEUE_LENGTH);
return shared;
}
The code we have so far is a good starting place, but it has several problems:
• Access to the queue is not thread safe. Different threads could access
array, next_in, and next_out at the same time and leave the queue in
a broken, or inconsistent, state.
In the next section, we solve the first problem with a Mutex. In the following
section, we solve the second problem with condition variables.
queue->array[queue->next_in] = item;
queue->next_in = queue_incr(queue, queue->next_in);
mutex_unlock(queue->mutex); //-- new
}
Before checking whether the queue is full, we have to lock the Mutex. If
the queue is full, we have to unlock the Mutex before exiting; otherwise the
thread would leave it locked and no other threads could proceed.
tions is required to lock the mutex first; this requirement is part of the doc-
umented interface for these functions.
With this additional code, the queue is thread safe; if you run it, you should
not see any synchronization errors. But it is likely that the consumer will
exit at some point because the queue is empty, or the producer will exit
because the queue is full, or both.
Similarly, thread_push might want to check whether the queue is full and,
if so, block until it is not full.
I’ll handle the first condition here, and you will have a chance to handle the
second condition as an exercise.
But before returning, it does one more thing: it “signals” the condition vari-
able nonempty.
If there are threads waiting on the condition variable, one of them gets un-
blocked and resumes execution of cond_wait. But before the awakened
thread can return from cond_wait, it has to wait for and lock the Mutex,
again.
Now go back to queue_pop and see what happens when the thread returns
from cond_wait. It loops back to the top of the while loop and checks the
condition again. I’ll explain why in just a second, but for now let’s assume
that the condition is true; that is, the queue is not empty.
When the consumer thread exits the while loop, we know two things: (1)
the condition is true, so there is at least one item in the queue, and (2) the
Mutex is locked, so it is safe to access the queue.
In the next section I’ll show you how my Cond code works, but first I want
to answer two frequently-asked questions:
• The other question that comes up when people learn about condition
variables is “How does the condition variable know what condition it
is associated with?”
80 Chapter 10. Condition variables
return cond;
}
And here are the wrappers for cond_wait and cond_signal.
void cond_wait(Cond *cond, Mutex *mutex) {
int n = pthread_cond_wait(cond, mutex);
if (n != 0) perror_exit("cond_wait failed");
}
Semaphores in C
Semaphores are a good way to learn about synchronization, but they are
not as widely used, in practice, as mutexes and condition variables.
This chapter presents a C API for working with semaphores and my code
for making it easier to work with. And it presents a final challenge: can
you write an implementation of a semaphore using mutexes and condition
variables?
The code for this chapter is in directory semaphore in the repository for this
book (see Section 0.1).
semaphore_wait(mutex);
// protected code goes here
semaphore_signal(mutex);
When you use a semaphore as a mutex, you usually initialize it to 1 to indi-
cate that the mutex is unlocked; that is, one thread can pass the semaphore
without blocking.
Here I am using the variable name mutex to indicate that the semaphore is
being used as a mutex. But remember that the behavior of a semaphore is
not the same as a Pthread mutex.
queue->spaces = make_semaphore(length-1);
return queue;
}
mutex is used to guarantee exclusive access to the queue; the initial value is
1, so the mutex is initially unlocked.
items is the number of items in the queue, which is also the number of
consumer threads that can execute queue_pop without blocking. Initially
there are no items in the queue.
spaces is the number of empty spaces in the queue, which is the number
of producer threads that can execute queue_push without blocking. Initially
the number of spaces is the capacity of the queue, which is length-1, as
explained in Section 10.1.
Here is the new version of queue_push, which is run by producer threads:
void queue_push(Queue *queue, int item) {
semaphore_wait(queue->spaces);
semaphore_wait(queue->mutex);
queue->array[queue->next_in] = item;
queue->next_in = queue_incr(queue, queue->next_in);
semaphore_signal(queue->mutex);
semaphore_signal(queue->items);
}
Notice that queue_push doesn’t have to call queue_full any more; instead,
the semaphore keeps track of how many spaces are available and blocks
producers if the queue is full.
Here is the new version of queue_pop:
int queue_pop(Queue *queue) {
semaphore_wait(queue->items);
semaphore_wait(queue->mutex);
semaphore_signal(queue->mutex);
semaphore_signal(queue->spaces);
return item;
}
11.3. Make your own semaphores 87
Using the code in the repo for this book, you should be able to compile and
run this solution. You should be able to run:
$ make queue_sem
$ ./queue_sem
Before you go on, you might want to try this as an exercise: write func-
tions that implement the semaphore API in sem.h using using condition
variables and mutexes. You can put your solution in mysem.c and mysem.h
in the repo for this book, and you’ll find my solution in mysem_soln.c and
mysem_soln.h.
If you have trouble getting started, you can use the following structure def-
inition, from my solution, as a hint:
typedef struct {
int value, wakeups;
Mutex *mutex;
Cond *cond;
} Semaphore;
value is the value of the semaphore. wakeups counts the number of pend-
ing signals; that is, the number of threads that have been woken but
have not yet resumed execution. The reason for wakeups is to make sure
that our semaphores have Property 3, described in The Little Book of
Semaphores.
mutex provides exclusive access to value and wakeups; cond is the condition
variable threads wait on if they wait on the semaphore.
semaphore->wakeups = 0;
semaphore->mutex = make_mutex();
semaphore->cond = make_cond();
return semaphore;
}
if (semaphore->value < 0) {
do {
cond_wait(semaphore->cond, semaphore->mutex);
} while (semaphore->wakeups < 1);
semaphore->wakeups--;
}
mutex_unlock(semaphore->mutex);
}
When a thread waits on the semaphore, it has to lock the mutex before it
decrements value. If the value of the semaphore becomes negative, the
thread blocks until a “wakeup” is available. While it is blocked, the mutex
is unlocked, so another thread can signal.
Here is the code for semaphore_signal:
void semaphore_signal(Semaphore *semaphore)
{
mutex_lock(semaphore->mutex);
semaphore->value++;
if (semaphore->value <= 0) {
semaphore->wakeups++;
cond_signal(semaphore->cond);
}
mutex_unlock(semaphore->mutex);
}
11.4. Semaphore implementation 89
Again, a thread has to lock the mutex before it increments value. If the
semaphore was negative, that means threads are waiting, so the signalling
thread increments wakeups and signals the condition variable.
At this point one of the waiting threads might wake up, but the mutex is
still locked until the signalling thread unlocks it.
At that point, one of the waiting threads returns from cond_wait and checks
whether a wakeup is still available. If not, it loops and waits on the condi-
tion variable again. If so, it decrements wakeups, unlocks the mutex, and
exits.
One thing about this solution that might not be obvious is the use of a
do...while loop. Can you figure out why it is not a more conventional
while loop? What would go wrong?
The problem is that with a while loop this implementation would not have
Property 3. It would be possible for a thread to signal and then run around
and catch its own signal.
With the do...while loop, it is guaranteed1 that when a thread signals, one
of the waiting threads will get the signal, even if the signalling thread runs
around and gets the mutex before one of the waiting threads resumes.