Osdev Notes 2025 01 19
Osdev Notes 2025 01 19
Ivan G.
Dean T.
ii
Contents
1 Welcome 1
1.1 Structure Of The Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Topics covered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Assumed Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 About The Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Our Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
I Build Process 5
2 Build Process 7
2.1 Freestanding Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Cross Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Differences Between GCC and Clang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Setting up a build environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4.1 Getting a Cross Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.2 Building C Source Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.3 Building C++ Source Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Linking Object Files Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5.1 Building with Makefiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 Quick Addendum: Easily Generating a Bootable Iso . . . . . . . . . . . . . . . . . . . . . . . 11
2.7 Testing with An Emulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.8 Building and Using Debugging Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.8.1 Multiboot 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.8.2 Limine Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.8.3 ELFs Ahead, Beware! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.8.4 Locating The Symbol Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Boot Protocols 15
3.1 What about the earlier versions? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Why A Bootloader At All? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Multiboot 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.1 Creating a Boot Shim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.2 Creating a Multiboot 2 Header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 Limine Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.1 Fancy Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4.2 Creating a Limine Header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5 Finding Bootloader Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5.1 Multiboot 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5.2 Limine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
iii
CONTENTS iv
4 Makefiles 25
4.1 GNUMakefile vs Makefile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Simple Makefile Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Built In Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4 Complex Makefile Example (with recursion!) . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4.1 Think Bigger! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5 Linker Scripts 31
5.1 Anatomy of a Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.1.1 LMA (Load Memory Address) vs VMA (Virtual Memory Address) . . . . . . . . . . . 31
5.1.2 Adding Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2 Program Headers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.3 Sections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.3.1 The ‘.’ Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.3.2 Incoming vs Outgoing Sections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.4 Common Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.5 Complete Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
8 Hello World 47
8.1 Printing to Serial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8.1.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
8.1.2 Sending a string . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
8.1.3 Printing Digits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
8.1.4 Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
iv
v CONTENTS
12 ACPI Tables 73
12.1 RSDP and RSDT/XSDT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
12.1.1 RSDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
12.1.2 RSDT Data structure and fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
12.1.3 Some useful infos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
13 APIC 77
13.1 What is APIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
13.2 Types of APIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
13.3 Local APIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
13.3.1 Disabling The PIC8259 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
13.3.2 Discovering the Local APIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
13.3.3 Enabling the Local APIC, and The Spurious Vector . . . . . . . . . . . . . . . . . . . 79
13.3.4 Reading APIC Id and Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
13.3.5 Local Vector Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
13.3.6 X2 APIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
13.3.7 Handling Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
13.3.8 Sending An Inter-Processor Interrupt . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
13.4 I/O APIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
13.4.1 Configure the I/O APIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
13.4.2 Getting the I/O APIC address . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
13.4.3 I/O APIC Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
13.4.4 Reading data from I/O APIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
13.4.5 Interrupt source overrides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
13.4.6 IO Redirection Table (IOREDTBL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
14 Timer 85
14.1 Types and Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
14.1.1 Calibrating Timers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
14.2 Programmable Interval Timer (PIT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
14.2.1 Theory Of Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
14.2.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
14.3 High Precision Event Timer (HPET) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
14.3.1 Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
14.3.2 Theory Of Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
14.3.3 Comparators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
14.3.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
v
CONTENTS vi
vi
vii CONTENTS
22 Paging 125
22.1 What is Paging? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
22.1.1 Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
22.1.2 Page Directories and Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
22.1.3 Virtual (or Logical) Address . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
22.2 Paging in Long Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
22.3 Page Directories and Table Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
22.3.1 Loading the root table and enable paging . . . . . . . . . . . . . . . . . . . . . . . . . 128
22.3.2 PML4 & PDPR & PD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
22.3.3 Page Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
22.3.4 Page Table/Directory Entry Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
22.4 Address translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
22.4.1 Address Translation Using 2MB Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
22.4.2 Address translation Using 4KB Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
22.5 Page Fault . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
22.6 Accessing Page Tables and Physical Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
22.6.1 Recursive Paging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
22.6.2 Direct Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
22.6.3 Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
vii
CONTENTS viii
VI Userspace 181
29 All About Userspace 183
29.1 Some Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
29.2 A Change in Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
viii
ix CONTENTS
ix
CONTENTS x
X Conclusion 253
x
xi CONTENTS
Appendices 267
A Troubleshooting 267
A.1 Unexpected UD/Undefined Opcode exception in x86_64 . . . . . . . . . . . . . . . . . . . . . 267
A.1.1 Why does this happen? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
A.1.2 The easy solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
A.1.3 The hard solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
F Debugging 289
F.1 GDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
F.1.1 Remote Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
F.1.2 Useful Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
F.1.3 Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
F.1.4 Print and Examine Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
F.1.5 How Did I Get Here? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
F.1.6 Breakpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
F.1.7 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
F.1.8 TUI - Text User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
F.2 Virtual Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
F.2.1 Useful Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
xi
CONTENTS xii
I Acknowledgments 303
K Licence 307
K.1 Attribution-NonCommercial 4.0 International . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
K.1.1 Using Creative Commons Public Licenses . . . . . . . . . . . . . . . . . . . . . . . . . 307
K.1.2 Creative Commons Attribution-NonCommercial 4.0 International Public License . . . 307
K.1.3 Section 5 – Disclaimer of Warranties and Limitation of Liability. . . . . . . . . . . . . 310
xii
Chapter 1
Welcome
Whether you’re reading this online, or in a book, welcome to our collection of notes about operating systems
development! We’ve written these while writing (and re-writing) our own kernels, with the intent of guiding a
reader through the various stages of building an operating system from scratch. We’ve tried to focus more on
the concepts and theory behind the various components, with the code only provided to help solidify some
concepts.
We hope you enjoy, and find something interesting here!
1
CHAPTER 1. WELCOME 2
• Userspace - Many modern architectures support different level of privileges, that means programs that
are running on lower levels can’t access resources/data reserved for higher levels.
• Inter-Process Communication (IPC) - This part looks at how we might implement IPC for our kernel,
allowing isolated programs to communicate with each other in a controlled way.
• Virtual File System (VFS) - This part will cover how a kernel presents different file systems to the rest
of the system. We’ll also take a look at implementing a ‘tempfs’ that is loaded from a tape archive (tar),
similar to initrd.
• The ELF format - Once we have a file system we can load files from it, why not load a program? This
part looks at writing a simple program loader for ELF64 binaries, and why you would want to use this
format.
• Going Beyond - The final part (for now). We have implemented all the core components of a kernel,
and we are free to go from here. This final chapter contains some ideas for new components that we
might want to add, or at least begin thinking about.
In the appendices we cover various topic, from debugging tips, language specific information, troubleshooting,
etc.
Code can have bugs and freestanding code can be hard (or impossible) to debug in some cases. Some hardware
does not include serial ports, real CPUs can have bugs in hardware, or architectural quirks you’re unaware of
that interfere with developing for them.
As such, below is a list of the recommended prior experience before continuing with this book:
• Intermediate understanding of the C programming language. Mastery is not required, but you should
be very familiar with the ins and outs of the language, especially pointers and pointer arithmetic.
• You should be comfortable compiling and debugging code in userspace. GDB is recommended as several
emulators provide a GDB server you can use to step through your kernel.
• Knowledge and experience using common data structures like intrusive linked lists. While we may use
array notation at several points to help visualize what’s going on, you won’t want to place arbitrary
limits on your kernel by using fixed size arrays.
If you feel confident in your knowledge of the above, please read on! If not, don’t be discouraged. There are
plenty of resources available for learning, and you can always come back later.
I’m a hobbyist programmer, and have been working on my operating system kernel since 2021, called northport.
I’ve experimented with a few other projects in that time, namely a micro-kernel and a window manager. Before
getting into osdev my programming interests were game engines and system utilities. My first programming
project that I finished was a task manager clone in C#. These days C++ is my language of choice, I like the
freedom the language offers, even if its the freedom to cause a triple fault. - Dean.
2
3 1.4. OUR PROJECTS
3
CHAPTER 1. WELCOME 4
4
Part I
Build Process
5
Chapter 2
Build Process
An OS like any other project needs to be built and packaged, but it is different from any other programs, we
don’t only need it to be compiled, but it needs different tools and steps in order to have an image that can be
loaded by a bootloader. And booting it requires another step.
In this part we are going to explore all the tools and steps that are needed in order to have an initial set of
building scripts for our os, and also will explore some options for the compilers and the bootloader that can
be used for our kernel.
In this chapter we will have a global overview of the build process, touching briefly what are the steps involved,
and the tools that are going to be used.
Then in the Boots Protocols and Bootloaders chapter we will explore in detail how to boot a kernel, and
describe two options that can be used for the boot process: Multiboot and Limine.
The Makefiles chapter will explain how to build a process, even if initially is just a bunch of file and it can be
done manually, it soon grow more complex, and having the process automated will be more than useful, we
will use Makefile for our build script.
One of the most obscure, that is always present while building any software, but is hidden to us until we start
to roll up our own kernel is the linking process. The Linker Scripts chapter will introduce us to the world of
linking files and explain how to write a linker script.
Finally the kernel is built but not ready to run yet, we need to copy it into a bootable media, the Generating
A Bootable Iso chapter will show how to create a bootable iso of our kernel, and finally being able to launch
it and see the results of our hard work.
For the rest of this part a basic knowledge of compiling these languages is assumed.
7
CHAPTER 2. BUILD PROCESS 8
Since we will be writing the operating system, we can’t depend on any functionality that requires our host
operating system, like these libraries. A program like this is called a freestanding program. It has no external
dependencies.
Authors note: technically your kernel can depend on some utility libraries, or sections of the compiler runtime.
However the idea is you should build your code with nothing extra added by default, and only add things back
in that are also freestanding.
In a freestanding environment, we should assume nothing. That includes the standard library (as it requires
os support to work). Our program will also need a special linker script in order to run properly, since the
linker wont know where to start the program. Linker scripts are expanded on below, as well as in their own
chapter.
Both C and C++ have several freestanding headers. The common ones are stdint.h, stddef.h for C/C++,
and utility and type_traits for C++. There are a few others, and compiler vendors will often supply
extra freestanding headers. GCC and Clang provide ones like cpuid.h as a helper for x86 cpuid functions, for
example.
8
9 2.4. SETTING UP A BUILD ENVIRONMENT
Shorthand Meaning
$(CC) C Compiler (cross compiler version we just setup)
$(CXX) C++ compiler (cross-compiler version)
$(LD) Linker (again, cross compiler version)
$(C_SRCS) All the C source files to compile
$(OBJS) All the object files to link
If using clang be sure to remember to pass --target=xyz with each command. This is not necessary with
GCC.
9
CHAPTER 2. BUILD PROCESS 10
• -mno-80387: Not strictly necessary, but tells the compiler that the FPU is not available, and to process
floating point calculations in software instead of hardware, and to not use the FPU registers.
• -mno-mmx: Disables using the FPU registers for 64-bit integer calculations.
• -mno-3dnow: Disables 3dnow! extensions, similar to MMX.
• -mno-sse -mno-sse2: Disables SSE and SSE2, which use the 128-bit xmm registers, and require setup
before use.
• -mcmodel=kernel: The compiler uses ‘code models’ to help optimize code generation depending on
where in memory the code might run. The medium code model runs in the lower 2GiB, while the large
runs anywhere in the 64-bit address space. We could use large for our kernel, but if the kernel is being
loaded loading in the top-most 2GiB kernel value can be used which allows similar optimizations to
medium.
There are also a few other compiler flags that are useful, but not necessary:
• -fno-stack-protector: Disables stack protector checks, which use the compiler library to check for
stack smashing attacks. Since we’re not including the standard libraries, we can’t use this unless we
implement the functions ourselves. Not really worth it.
• -fno-omit-frame-pointer: Sometimes the compiler will skip creating a new stack frame for optimiza-
tion reasons. This will mess with stack traces, and only increases the memory usage by a few bytes here
and there. Well worth having.
• -Wall and -Wextra: These flags need no introduction, they just enable all default warnings, and then
extra warnings on top of that. Some people like to use -Wpedantic as well, but it can cause some false
positives.
10
11 2.6. QUICK ADDENDUM: EASILY GENERATING A BOOTABLE ISO
11
CHAPTER 2. BUILD PROCESS 12
• VirtualBox/VMWare. These are grouped together as they’re more industrial virtualization software.
They aim to be as fast as possible, and provide little to no debug functionality. Useful for testing
compatibility, but not day-to-day development.
We’ll be using qemu for this example, and assuming the output filename of the iso is contained in the makefile
variable ISO_FILENAME.
# runs our kernel
run:
qemu-system-x86_64 -cdrom $(ISO_FILENAME)
run-with-kvm:
qemu-system-x86_64 -cdrom $(ISO_FILENAME) --enable-kvm
There are a few other qemu flags we might want to be aware of:
• -machine xyz changes the machine that qemu emulates to xyz. To get a list of supported machines,
use -machine help. Recommended is to use -machine -q35 as it provides some modern features like
the mcfg for accessing pci over mmio instead of over IO ports.
• -smp used to configure how many processors and their layout. If wanting to support smp, it’s recom-
mended to enable this early on as it’s easier to fix smp bugs as they are added, rather than fixing them
all at once if we add smp support later. To emulate a simple quad-core cpu use -smp cores=4.
• -monitor qemu provides a built in monitor for debugging. Super useful! It’s always available in it’s own
tab (under view->monitor) but we can move the monitor to terminal that was used to launch qemu
using -monitor stdio. The built in terminal is fairly basic, so this is recommended.
• -m xyz is used to set the amount of ram given to the VM. It supports common suffixes like ‘M’ for MiB,
‘G’ for GiB and so on.
• -cpu xyz sets the cpu model to emulate, like -machine and list can be viewed by running qemu with
-cpu help. There are some special options like ‘host’ which will try to emulate the host’s cpu, or
‘qemu64’ which provides a generic cpu with as many host-supported features. There is also ‘max’ which
provides every feature possible either through kvm or software implementations.
• -d for enable debug traces of certain things. -d int is the most useful, for logging the output of any
interrupts that occur. If running with uefi instead of bios we may get a lot of SMM enter/exit interrupts
during boot, these can be disabled (in the log) by using -d int -M smm=off.
• -D sets the output for the debug log. If not specified this is stdout, but we can redirect it to anywhere.
• -S pauses the emulator before actually running any code. Useful for attaching a debugger early on.
• -s creates a gdb server on port 1234. Inside of gdb we can attach to this server and debug our
kernel/bootloader using target remote :1234.
• -no-reboot when qemu encounters a triple fault, it will reset the machine (meaning it restarts, and
runs from the bootloader again). This flag tells qemu to pause the virtual machine immediately after
the faulting instruction. Very useful for debugging!
• -no-shutdown some configurations of qemu will shutdown if -no-reboot is specified, instead of pausing
the VM. This flag forces qemu to stay open, but paused.
12
13 2.8. BUILDING AND USING DEBUGGING SYMBOLS
2.8.1 Multiboot 2
Multiboot 2 provides the Elf-Symbols (section 3.6.7 of the spec) tag to the kernel which provides the elf
section headers and the location of the string table. Using these is described below in the stivale section.
13
CHAPTER 2. BUILD PROCESS 14
//Limine Protocol
struct limine_kernel_file_response file_response;
struct limine_file kernel_file = file_response->kernel_file;
Elf64_Ehdr* hdr = (Elf64_Ehdr*) kernel_file->address;
Elf64_Shdr* shdrs = (Elf64_Shdr*) (kernel_file->address + hdr->shoff);
const char* strtab_data = shdrs[hdr->e_shstrndx].sh_offset + kernel_file->address
To find the symbol table, iterate through the section headers until one with the name .symtab is found. As a
reminder, the name of a section header is stored as an offset into the string table data. For example:
Elf64_Shdr* example_shdr;
const char* name = strtab_data[example_shdr->sh_name];
Now all that’s left is a function that parses the symbol table. It’s important to note that some symbols only
occupy a single address, like a label or a variable, while others will occupy a range of addresses. Fortunately
symbols have a size field.
An example function is included below, showing how a symbol can be looked up by its address. The name of
this symbol is then printed, using a fictional print function.
Elf64_Shdr* sym_tab;
const char* strtab_data;
14
Chapter 3
Boot Protocols
A boot protocol defines the machine state when the kernel is given control by the bootloader. It also makes
several services available to the kernel, like a memory map of the machine, a framebuffer and sometimes other
utilities like uart or kernel debug symbols.
This chapter covers 2 protocols Multiboot 2 and Limine Protocol:
• Multiboot 2 supercedes multiboot 1, both of which are the native protocols of grub. Meaning that
anywhere grub is installed, a multiboot kernel can be loaded. making testing easy on most linux
machines. Multiboot 2 is quite an old, but very robust protocol.
• Limine Protocol (it was preceded by Stivale 1 and 2) is the native protocol of the Limine bootloader.
Limine and Stivale protocols were designed many years after Multiboot 2 as an attempt to make hobbyist
OS development easier. Limine Protocol is a more complex spec to read through, but it leaves the
machine in a more known state prior to handing off to the kernel.
The Limine protocol is based on Stivale2 (it was covered in earlier version of this book), with mainly
architectural changes, but similar concepts behind it. If familiar with the concepts of stivale 2, the limine
protocol is easy enough to understand.
All the referenced specifications and documents are provided as links at the start of this chapter/in the
readme.
15
CHAPTER 3. BOOT PROTOCOLS 16
can’t be assumed everywhere, and some machines sometimes outright break spec. This leads to a few edge
cases on some machines, and more or less on some others. It’s a big mess.
This is where a bootloader comes in: a layer of abstraction between the kernel and the mess of PC hardware. It
provides a boot protocol (often many we can choose from), and then ensures that everything in the hardware
world is setup to allow that protocol to function. This is until the kernel has enough drivers set up to take
full control of the hardware itself.
Authors note: Writing a good bootloader is an advanced topic in the osdev world. If new, please use an existing
bootloader. It’s a fun project, but not at the same time as an os. Using an existing bootloader will save you
many issues down the road. And no, an assembly stub to get into long mode is not a bootloader.
3.3 Multiboot 2
For this section we’ll mainly be talking about grub 2. There is a previous version of grub (called grub legacy),
and if we have hardware that must run grub legacy, there are patches for legacy that add most of the version
2 features to it. This is highly recommended.
One such feature is the ability for grub to load 64-bit elf kernels. This greatly simplifies creating a 64-bit OS
with multiboot 2, as previously we would have needed to load a 32-bit elf, and the 64-bit kernel as a module,
and then load the 64-bit elf manually. Effectively re-writing stage3 of the bootloader.
Regardless of what kind of elf is loaded, multiboot 2 is well defined and will always drop us into 32-bit
protected mode, with the cpu in the state as described in the specification. If writing a 64-bit kernel this
means that we will need a hand-crafted 32-bit assembly stub to set up and enter long mode.
One of the major differences between the two protocols is how info is passed between the kernel and bootloader:
• Multiboot 1 has a fixed size header within the kernel, that is read by the bootloader. This limits the
number of options available, and wastes space if not all options are used.
• Multiboot 2 uses a fixed sized header that includes a size field, which contains the number of bytes of
the header + all of the following requests. Each request contains an identifier field and then some
request specific fields. This has slightly more overhead, but is more flexible. The requests are terminated
with a special null request (see the specs on this).
• Multiboot 1 returns info to the kernel via a single large structure, with a bitmap indicating which sections
of the structure are considered valid.
• Multiboot 2 returns a pointer to a series of tags. Each tag has an identifier field, used to determine
its contents, and a size field that can be used to calculate the address of the next tag. This list is also
terminated with a special null tag.
One important note about multiboot 2: the memory map is essentially the map given by the bios/uefi. The
areas used by bootloader memory (like the current gdt/idt), kernel and info structure given to the kernel
are all allocated in free regions of memory. The specification does not say that these regions must then be
marked as used before giving the memory map to the kernel. This is actually how grub handles this, so should
definitely do a sanity check on the memory map.
16
17 3.3. MULTIBOOT 2
as a catch-all term for the upper-most 2GB of any address space. Since the exact address will be different
depending on the number of bits used for the address space (32-bit vs 64-bit for example), referring to it as
an underflow value is more portable.
Entering long mode is fairly easy, it requires setting 3 flags:
• PAE (physical address extension), bit 5 in CR4.
• LME (long mode enable), bit 8 in EFER (this is an MSR).
• PG (paging enable), bit 31 in cr0. This MUST be enabled last.
If unfamiliar with paging, there is a chapter that goes into more detail in the memory management chapter.
Since we have enabled paging, we’ll also need to populate cr3 with a valid paging structure. This needs to be
done before setting the PG bit. Generally these initial page tables can be set up using 2mb pages with the
present and writable flags set. Nothing else is needed for the initial pages.
We will be operating in compatibility mode, a subset of long mode that pretends to be a protected mode
cpu. This is to allow legacy programs to run in long mode. However we can enter full 64-bit long mode by
reloading the CS register with a far jump or far return. See the GDT notes for details on doing that.
It’s worth noting that this boot shim will need its own linker sections for code and data, since until we have
entered long mode the higher half sections used by the rest of the kernel won’t be available, as we have no
memory at those addresses yet.
KERNEL_START = .;
KERNEL_VIRT_BASE = 0xFFFFFFFF8000000;
.mb2_hdr :
{
/* Be sure that the multiboot2 header is at the beginning */
KEEP(*(.mb2_hdr))
}
.mb2_text :
{
/* Space for the assembly stub to get us into long mode */
.mb2_text
}
. += KERNEL_VIRT_BASE
17
CHAPTER 3. BOOT PROTOCOLS 18
*(.rodata)
}
18
19 3.3. MULTIBOOT 2
.byte 0x1000
# backup the address of mb2 info struct, since ebx may be clobbered
.section .mb_text
mov %ebx, %edi
# load cr3
mov pml4_addr, %eax
mov %eax, %cr3
# enable PAE
mov $0x20, %eax
mov %eax, %cr4
19
CHAPTER 3. BOOT PROTOCOLS 20
the same name) is a place to store local data that doesn’t in the registers of a platform. This means local
variables in a function or parts of a complex calculation. It’s become so universal that it has also adopted
other uses over time, like passing function arguments (sometimes) and being used by hardware to inform the
kernel of things (the iret frame on x86).
20
21 3.4. LIMINE PROTOCOL
21
CHAPTER 3. BOOT PROTOCOLS 22
__attribute__((used, section(".requests")))
static volatile struct limine_framebuffer_request {
.id = LIMINE_FRAMEBUFFER_REQUEST,
.revision = 0
};
The requests types are all declared in the limine.h header.
Any type of information that needs to be requested to the bootloader, will have it’s own request variable,
with it’s own type.
Finally the requests start and end markers needs to be declared:
__attribute__((used, section(".requests_start_marker")))
static volatile LIMINE_REQUESTS_START_MARKER;
__attribute__((used, section(".requests_end_marker")))
static volatile LIMINE_REQUESTS_END_MARKER;
Again they can be placed anywhere in the code, since their position will be decided by the section they belongs
to.
The last detail is to add the kernel start function (declared in the ENTRY() section in the linker script).
Since all the bootloader information are provided via static variables, the kernel start function doesn’t
require any particular signature.
3.5.1 Multiboot 2
Multiboot 2 gives us a pointer to the multiboot info struct, which contains 2x 32-bit fields. These can
be safely ignored, as the list is null-terminated (a tag with a type 0, and size of 8). The first tag is at
8 bytes after the start of the mbi. All the structures and defines used here are available in the header
provided by the multiboot specification (check the bottom section, in the example kernel), including the
MULTIBOOT_TAG_TYPE_xyz defines (where xyz is a feature described by a tag). For example the memory map
is MULTIBOOT_TAG_TYPE_MMAP, and framebuffer is MULTIBOOT_TAG_TYPE_FRAMEBUFFER.
//placed in ebx when the kernel is booted
multiboot_info* mbi;
if (tag->type == type)
return tag;
22
23 3.5. FINDING BOOTLOADER TAGS
tag = (multiboot_tag*)next_addr;
}
}
Lets talk about the last three lines of the loop, where we set the tag variable to the next value. The multiboot
2 spec says that tags should always be 8-byte aligned. While this is not a problem most of the time, it is
possible we could get a misaligned pointer by simply adding size bytes to the current pointer. So to be on
safe side, and spec-compliant, we’ll align the value up to the nearest 8 bytes.
3.5.2 Limine
Accessing the bootloader response is very simple, and it doesn’t require iterating any list. We just need to
read the content of the .response field of the request structure:
//somewhere in the code
if ( framebuffer_request->response != NULL) {
// Do something with the response
}
Is important to note, for every type of request the response field have a different type, in this case it
is a pointer to a struct limine_framebuffer_response, for more info on all the available requests, and
repsonses refer to the protocol documentation.
23
CHAPTER 3. BOOT PROTOCOLS 24
24
Chapter 4
Makefiles
There’s a million and one excellent resources on makefiles out there, so this chapter is less of a tutorial and
more of a collection of interesting things.
#inputs
C_SRCS = kernel_main.c util.c
ASM_SRCS = boot.S
TARGET = build/kernel.elf
#flags
CC_FLAGS = -g -ffreestanding
LD_FLAGS = -T linker_script.lds
25
CHAPTER 4. MAKEFILES 26
all: $(OBJS)
@echo "Linking program ..."
$(LD) $(LD_FLAGS) $(OBJS) -o $(TARGET)
@echo "Program linked, placed @ $(TARGET)"
clean:
@echo "Cleaning build files ..."
-rm -r build/
@echo "Cleaning done!"
build/%.c.o: %.c
@echo "Compiling C file: $<"
@mkdir -p $(@D)
$(CC) $(CC_FLAGS) $< -c -o $@
build/%.S.o: %.S
@echo "Assembling file: $<"
@mkdir -p $(@D)
$(AS) $< -c -o $@
Okay! So there’s a lot going on there. This is just a way to organise makefiles, and by no means a definitive
guide. Since we may be using a cross compiler or changing compilers (it’s a good idea to test with both gcc
and clang) we’ve declared some variables representing the various programs we’ll call when compiling. CXX is
not used here, but if using c++ it’s the common name for the compiler.
Following that we have our inputs, C_SRCS is a list of our source files. Anytime we want to compile a new
file, we’ll add it here. The same goes for ASM_SRCS. Why do we have two lists of sources? Because they’re
going to be processed by different tools (c files -> c compiler, assembly files -> assembly compiler/assembler).
TARGET is the output location and name of the file we’re going to compile.
Authors Note: In this example I’ve declared each input file in C_SRCS manually, but you could also make use
the builtin function $(shell) to use a program like find to search for you source files automatically, based on
their file extension. Another level of automation! An example of what this might look like, when searching for
all files with the extension ‘.c’, is given below:
C_SRCS = $(shell find -name "*.c")
Next up we have flags for the c compiler (CC_FLAGS) and the linker (LD_FLAGS). If we wanted flags for the
assembler, we could create a variable here for those too. After the flags we have our first example of where
make can be really useful.
The linker wants a list of compiled object files, from the c compiler or assembler, not a list of the source files
they came from. We already maintain a list of source files as inputs, but we don’t have a list of the produced
object files that the linker needs to know what to link in the final binary. We could create a second list, and
keep that up to date, but that’s more things to keep track off. More room for error as well. Make has built in
search and replace functionality, in the form of the patsubst (pattern substitution) function. patsubst uses
the wildcard (%) symbol to indicate the section of text we want to keep. Anything specified outside of the
wildcard is used for pattern matching. It takes the following arguments:
• Pattern used to select items from the input variable.
• Pattern used to transform selected items into the output.
• Input variable.
Using patsubst we can transform the list of source files into a list of object files, that we can give to the
linker. The second line OBJS += ... functions in the same way as the first, but we use the append operator
instead of assign. Similar to how they work in other languages, here we append to the end of the variable,
instead of overwriting it.
26
27 4.3. BUILT IN VARIABLES
Next up is an important line: .PHONY:. Make targets are presumed to output a file of the same name by
default. By adding a target as a dependency of the .PHONY target, we tell make that this target is more of a
command, and not a real file it should look for. In simple scenarios it serves little purpose, but it’s an issue
that can catch us by surprise later on. Since phony targets don’t have a time associated with them, they are
assumed to be always out of date and thus are run everytime they are called.
The all and clean targets work as we’d expect, building the final output or cleaning the build files. It’s
worth noting the ‘@’ symbol in front of echo. When at the start of a line, it tells make not to echo the rest of
the line to the shell. In this case we’re using echo because we want to output text without the shell trying to
run it. Therefore we tell make not to output the echo line itself, since echo will already write the following
text to the shell.
The line -rm -r build/ begins with a minus/hyphen. Normally if a command fails (returns a non-zero exit
code), make will abort the sequence of commands and display an error. Beginning a line with a hyphen tells
make to ignore the error code. Make will still tell if an error occurred, but it won’t stop the executing the
make file. In this case this is what we want.
The last two rules tell make how it should create a *.c.o or *.S.o file if it needs them. They have a
dependency on a file of the same name, but with a different extension (*.c or *.S). This means make will fail
with an error if the source file does not exist, or if we forget to add it to the SRCS variables above. We do a
protective mkdir, to ensure that the filepath used for output actually exists.
Make has a few built-in symbols that are used through the above example makefile. These are called automatic
variables, and are described in the makefile manual.
That’s a lot of text! But here we can see an example of a number of make functions being used. This provides
a simple, but very flexible build system for a project, even allowing the tools to be swapped out by editing a
few lines.
There are other built in functions and symbols that have useful meanings, however discovering them is left as
an exercise to the reader.
27
CHAPTER 4. MAKEFILES 28
extras.mk (.mk is a common extension for non-primary makefiles) into Makefile we would add the line
somewhere:
include extras.mk
This would place the contents of the included file (extras.mk) at the line where it was included. This means
the usual top to bottom reading of a makefile is followed as well. You can think of this as how #define works
in C/C++, it’s a glorified copy-and-paste mechanism.
One important note about using import is to remember that the included file will run with the current
working directory of the file that used the import, not the directly of the where the included file is.
You can also run make itself as part of a command to build a target. This opens the door to a whole new
world of makefiles calling further makefiles and including others.
28
29 4.4. COMPLEX MAKEFILE EXAMPLE (WITH RECURSION!)
| \ - LibCommon.mk
|
| - BuildPrep.mk
| - Run.mk
\ - Makefile
Whew, there’s a lot going on there! Let’s look at why the various parts exist:
• When the user runs make in the shell, the root makefile is run. This file is mostly configuration, specifying
the toolchain and the options it’ll use.
• This makefile then recursively calls make on each of the sub-projects.
– For example, the kernel makefile will be run, and it will have all of the make variables specified in
the root makefile in its environment.
– This means if we decide to change the toolchain, or want to add debug symbols to all projects, we
can do it in a single change.
– Libraries and userland apps work in a similar way, but there is an extra layer. What I’ve called the
glue makefile. It’s very simple, it just passes through the make commands from above to each sub
project.
– This means we don’t need to update the root makefile every time a new userland app is updated,
or a new library.
– It also allows us to override some variables for every library or every userspace app, instead of
globally. Useful!
• There are a few extra makefiles:
– Run.mk is totally optional if we just want to build the system. It contains anything to do with
qemu or gdb, so it can easily be removed if the end user only wants to build the project, and not
run it.
– LibCommon.mk and UserlandCommon.mk contain common definitions that most userland
apps/libraries would want. Like a make clean target, automatically copying the output to a global
‘apps’ directory, rules for building c++ object files from source, etc. This saves having to write
those rules per project. They can instead be written once, and then included into each makefile.
• The kernel/arch dir contains several local.mk files. Only one of these is included at a time, and they
include any platform-specific source files. These are also contained in the same directory. This is a nice
way to automate what gets built.
– The root makefile contains a variable CPU_ARCH which contains either x86_64 or rv64g. If us-
ing the gcc toolchain, the tools are selected by using the CPU_ARCH variable (g++ is actually
named $(CPU_ARCH)-elf-g++, or x86_64-elf-g++), and for the clang toolchain it’s passed to the
--target= argument.
• This allows us to piggy-back on the variable, and place include arch/$(CPU_ARCH)/local.mk inside
the kernel makefile. Now the kernel changes what is built based on the same rule, ensuring we’re always
using the correct files. Cool!
29
CHAPTER 4. MAKEFILES 30
30
Chapter 5
Linker Scripts
If never done any kind of low level programming before, it’s unlikely we’ve had to deal with linker scripts. The
compiler will provide a default one for the host cpu and operating system. Most of the time this is exactly
what we need, however since we are the operating system, we’ll need our own!
A linker script is just a description of the final linked binary. It tells the linker how we want the various bits
of code and data from our compiled source files (currently they’re unlinked object files) to be laid out in the
final executable.
For the rest of this chapter we’ll assume we’re using elf as the file format. If building a UEFI binary we’ll
need to use PE/COFF+ instead, and that’s a separate beast of its own. There’s nothing x86 specific here,
but it was written with x86 in mind, other architectures may have slight differences. We also reference some
fields of structs, these are described very plainly in the elf64 base specification.
31
CHAPTER 5. LINKER SCRIPTS 32
protected mode with paging disabled, we can’t load a higher half kernel, as no one would have enough physical
memory to have physical addresses in the range high enough.
So instead, we load the kernel at a lower physical memory address (by setting LMA), run a self-contained
assembly stub that is linked in its own section at a lower VMA. This stub sets up paging, and jumps to the
higher half region of code once paging and long mode are setup. Now the code is at the address it expects to
be in, and will run correctly, all done within the same kernel file.
Modern linkers are quite clever, and will often deduce which program headers are needed, but it never hurts
to specify these. A freestanding kernel will need at least three program headers, all of type PT_LOAD:
• text, with execute and read permissions.
• rodata, with only the read permission.
• data, with both read and write permissions.
But what about the bss, and other zero-initialized data? Well that’s stored in the data program header.
Program headers have two variables for their size, one is the file size and the other is their memory size. If
32
33 5.3. SECTIONS
the memory size is bigger than the file size, that memory is expected to be zeroed (as per the elf spec), and
thus the bss can be placed there!
5.2.1 Example
An example of what a program headers section might look like:
/* This declares our program headers */
PHDRS
{
text PT_LOAD;
rodata PT_LOAD;
data PT_LOAD;
}
This example is actually missing the flags field that sets the permissions, but modern linkers will see these
common names like text and rodata and give them default permissions.
Of course, they can (and in best practice, should) be set them manually using the keyword FLAGS:
PHDRS
{
/* flags bits: 0 = execute, 1 = write, 2 = read */
text PT_LOAD FLAGS((1 << 0) | (1 << 2));
rodata PT_LOAD FLAGS((1 << 2));
data PT_LOAD FLAGS((1 << 1) | (1 << 2));
}
The flags sets p_flags field of the program header, for more detail on it refer to to the Executable Linker
Format chapter.
5.3 Sections
While program headers are a fairly course mechanism for telling the program loader what it needs to do in
order to get our program running, sections allow us a lot of control over how the code and data within those
areas is arranged.
33
CHAPTER 5. LINKER SCRIPTS 34
If wondering why most higher half kernels are loaded at this address, it’s because it’s the upper-most 2GB of
the 64-bit address space
34
35 5.5. COMPLETE EXAMPLE
OUTPUT_ARCH(): Like OUTPUT_FORMAT, this is not necessary most of the time, but it allows for specifying the
target cpu architecture, used in the elf header. This can safely be omitted.
PHDRS
{
/* bare minimum: code, data and rodata phdrs. */
text PT_LOAD FLAGS((1 << 0) | (1 << 2));
rodata PT_LOAD FLAGS((1 << 2));
data PT_LOAD FLAGS((1 << 1) | (1 << 2));
}
SECTIONS
{
/* start linking at the -2GB address. */
. = 0xFFFFFFFF80000000;
.rodata :
{
RODATA_BEGIN = .;
*(.rodata*)
RODATA_END = .;
} :rodata
. += CONSTANT(MAXPAGESIZE);
.data :
{
DATA_BEGIN = .;
*(.data*)
} :data
.bss :
35
CHAPTER 5. LINKER SCRIPTS 36
{
/* COMMON is where the compiler places its internal symbols */
*(COMMON)
*(.bss*)
DATA_END = .;
} :data
/* we can use the '.' to determine how large the kernel is. */
KERNEL_SIZE = . - 0xFFFFFFFF80000000;
}
In the script above we are configuring the required sections .text, .rodata, .data, and .bss , for every
section include all the symbols that start with the secion name (consider that .bss and .bss.debug are two
different symbols, but we want them to be included in the same section. We also create two extra symbols,
that will be available to the kernel at runtime as variables, the content will be the start and address of the
section.
We also create another symbol to contain the kernel size, where . is the memory address when the script
has reached that point, and 0xFFFFFFFF80000000 is the starting address as you can see at the beginning of
SECTIONS.
This example doesn’t explicitly set the LMAs of any sections, and assumes that each section can be loaded
at its requested VMA (or relocated if we can’t do that). For userspace programs or maybe loadable kernel
modules (if we support these) this is usually no problem, but if this is a linker script for the kernel we must be
careful as paging may not be enabled yet. In this case we need the sections to be loaded somewhere in physical
memory. This is where setting the LMA can come in handy - it allows us to still link parts of the kernel as
higher half (based on their VMA), but have them loaded at a known physical address. For more details about
a scenario like this see how we boot using multiboot 2, as it starts the kernel with paging disabled. If you do
this, don’t forget to set the VMA to the higher half before the higher half sections of the kernel.
36
Chapter 6
Generating a bootable iso is specific to the bootloader of choice, both grub and limine are outlined below!
The generated iso can then be used as a cdrom in an emulator, or burnt to a real disk or usb thumb drive as
we would do for installing any other operating system.
6.1 Xorriso
Xorriso is a tool used to create iso disk images. Iso is actually quite a complex format, as it aims to be
compatible with a lot of different formats and ways of booting.
A walkthrough of xorriso is outside the scope of this chapter, but just know it’s the standard tool for working
with the iso format.
The tool is very straight forward, just create a root folder and populate that with what we want to appear on
the final iso. In our case we’ll need grub.cfg and our kernel (called kernel.elf for now), so we’ll create the
following directory layout:
disk/
\-boot/
|-grub/
| \-grub.cfg
\-kernel.elf
We can now run grub-mkrescue -o my_iso.iso disk, and it will generate an iso called my_iso.iso with
the contents of disk/ as the root directory.
6.2.1 Grub.cfg
This file contains some global config options for grub (setting boot timeouts, etc. . . ) as well as a list of boot
options for this disk. In our case we only have the one option of our kernel.
Global options can be changed using the set option=value syntax, and a list is available from the grub
documentation.
37
CHAPTER 6. GENERATING A BOOTABLE ISO 38
For each boot option, a menu entry needs to be created within grub.cfg. A menu entry consists of a header,
and then a body containing a series of grub commands to boot the operating system. This allows grub to
provide a flexible interface for more complex setups.
However in our case, we just want to boot our kernel using the standard multiboot2 protocol. Fortunately
grub has a built in command (multiboot2) for doing just that:
menuentry "My Operating System" {
multiboot2 /boot/kernel.elf
boot
}
Note the last command boot, this tells grub to complete the boot process and actually run our kernel.
6.3.1 Limine.conf
Similar to grub, limine also uses a config file. This config file has its own documentation, which is available in
the limine repository.
The limine.conf file lists each boot entry as a title, followed by a series of key-value pairs. To boot our
example from above using stivale 2, our config might look like the following:
# The entry name that will be displayed in the boot menu.
/My Operating System
# We use the Limine boot protocol.
protocol: limine
38
39 6.3. LIMINE (LIMINE PROTOCOL, MULTIBOOT 2)
# Path to the kernel to boot. boot():/ represents the partition on which limine.conf is located.
kernel_path: boot():/boot/kernel.elf
One thing to note is how the kernel’s path is specified. It uses the URI-like format, which begins with a
protocol and then a protocol-specific path. In our case we’re using the boot disk (boot():), and then an
absolute path (/boot/kernel.elf)
39
CHAPTER 6. GENERATING A BOOTABLE ISO 40
40
Part II
41
Chapter 7
Before going beyond a basic “hello world” and implementing the first real parts of our kernel, there are some
key concepts about how the CPU operates that we have to understand. What is an interrupt, and how do we
handle it? What does it mean to mask them? What is the GDT and what is its purpose?
It’s worth noting that we’re going to focus exclusively on x86_64 here, and some concepts are specific to this
platform (the GDT, for example), while some concepts are transferable across most platforms (like a higher
half kernels). Some, like interrupts and interrupt handlers, are only partially transferable to other platforms.
Similarly to the previous part, this chapter will be a high level introduction of the concept that will be
explained later.
The Hello World chapter will guide through the implementation of some basic serial i/o functions to be used
mostly for debugging purpose (especially with an emulator), we will see how to send characters, strings and
how to read them.
Many modern operating systems place their kernel in the Higher Half of the virtual memory space, what it is,
and how to place the kernel there is explained in the Higher Half chapter.
In the GDT we will explain one of the x86 structures used to describe the memory to the CPU, although
is a legacy structures its usage is still required in several part of the kernel (especially when dealing with
userspace)
Then the chapters Interrupt Handling, ACPI Tables and APIC will discuss how the x86 cpu handle the
exceptions and interrupts, and how the kernel should deal with them.
The Timers chapter will use one of the Interrupts handling routines to interrupt the kernel execution at
regular intervals, this will be the ground for the implementation of the multitasking in our kernel.
The final three chapters of this part: PS2 Keyboard Overview, PS2 Keyboard Interrupt Handling, PS2
Keyboard Driver implementation will explain how a keyboard work, what are the scancodes, how to translate
them into character, and finally describe the steps to implement a basic keyboard driver.
43
CHAPTER 7. ARCHITECTURE AND DRIVERS 44
These are not the same, as we’ll see later on we can convert virtual addresses to physical addresses (usually
the cpu will do this for us), but they are actually separate things.
There are also other address spaces we may encounter in osdev, like:
• Port I/O: Some older devices on x86 are wired up to ‘ports’ on the cpu, with each port being given an
address. These addresses are not virtual or physical memory addresses, so we can’t access them like
pointers. Instead, special cpu instructions are used to move in and out of this address space.
• PCI Config Space: PCI has an entirely separate address that for configuring devices. This address space
has a few different ways to access it.
Most of the time we won’t have to worry about which address space to deal with: hardware will only deal
with physical addresses, and the code will mostly deal with virtual addresses. As mentioned earlier we’ll later
look at how we use both of these so don’t worry!
44
45 7.4. DRIVERS
do we detect that efficiently? Maybe something we can’t predict happens like a program trying to access
memory it’s not supposed to, or a new packet arrives over the network.
These things can happen at any time, and as the operating system kernel we would like to react to them and
take some action. This is where interrupts come in.
7.3.1 Interrupts
When an unexpected event happens, the cpu will immediately stop the current code it’s running and start
running a special function called an interrupt handler. The interrupt handler is something the kernel tells
the cpu about, and the function can then work out what event happened, and then take some action. The
interrupt handler then tells the cpu when it’s done, and then cpu goes back to executing the previously
running code.
The interrupted code is usually never aware that an interrupt even occurred, and should continue on as
normal.
7.4 Drivers
Not device drivers for graphic cards, network interfaces, and other hardware, but on early stages of development
we will need some basic drivers to implement some of the future features, for example we will need to have at
least one supported Timer to implement the scheduler, we will most likely want to add a basic support for a
keyboard in order to implement a cli, these topics will be covered in this section, along with other architecture
specific drivers required by the CPU.
45
CHAPTER 7. ARCHITECTURE AND DRIVERS 46
46
Chapter 8
Hello World
During the development of our kernel we will need to debug a lot, and checking a lot of values, but so far our
kernel is not capable of doing anything, and having proper video output with scrolling, fonts etc., can take
some time, so we need a quick way of getting some text out from our kernel, not necessarily on the screen.
This is where the serial logging came to an aid, we will use the serial port to output our text and numbers.
Many emulators have an option to redirect serial data to a file, if we are using QEMU (for more information
about it refer to the Appendices section) we need to start it passing the parameter -serial file:filename:
qemu -serial file:filename.log -cdrom yourosiso
This will save the serial output on the file called filename.log, if we want the serial output directly on the
screen, we can use stdio instead.
47
CHAPTER 8. HELLO WORLD 48
8.1.1 Initialization
The second part is pretty simple, we just need to send few configuration command for initializing the serial
communication, the code below is copied from https://wiki.osdev.org/Serial_Ports#Initialization:
#define PORT 0x3f8 // COM1
48
49 8.1. PRINTING TO SERIAL
How to get the single digits will depend on what base we are using (the most common are base 8, 10 and 16),
let’s assume we want for now just print decimals (base 10).
To get decimal strings we will use a property of division by 10: The remainder of any integer number divided
by 10 is always the same as the least significant digit.
As an example consider the number 1235: 1235/10 = 123.5 and 1235 mod 10 = 5, remember that in C (and
other programming languages) a division between integers will ignore any decimal digit, so this means that
1235/10 = 123. And what if now we divide 123 by 10? yes we get 3 as remainder, below the full list of
divisions for the number 1235:
• 1235/10 = 123 and 1235 mod 10 = 5
• 123/10 = 12 and 123 mod 10 = 3
• 12/10 = 1 and 12 mod 10 = 2
• 1/10 = 0 and 1 mod 10 = 1
And as we can see we got all the digits in reverse order, so now the only thing we need to do is reverse them.
The implementation of this function should be now pretty straightforward, and it will be left as exercise.
Printing other format like Hex or Octal is a little bit different, but the base idea of getting the single number
and converting it into a character is similar. The only tricky thing with the hex number is that now we have
symbols for numbers between 10 and 15 that are characters, and they are before the digits symbol in the ascii
map, but once that is known it is going to be just an if statement in our function.
8.1.4 Troubleshooting
If the output to serial is not working, there is no output in the log, try to remove the line that set the serial
as loopback:
outb(PORT + 4, 0x1E); // Set in loopback mode, test the serial chip
49
CHAPTER 8. HELLO WORLD 50
50
Chapter 9
Commonly kernels will place themselves in the higher half of the address space, as this allows the lower half
to be used for userspace. It greatly simplifies writing new programs and porting existing ones. Of course this
does make it slightly more complex for the kernel, but not by much!
Most architectures that support virtual addressing use an MMU (memory management unit), and for x86 it’s
built into the CPU. Virtual memory (and paging - which is how the x86 MMU is programmed) is discussed
in more detail in the paging chapter, but for now think of it as allowing us to map what the code running on
the CPU sees to a different location in physical memory. This allows us to put things anywhere we want in
memory and at any address.
With a higher half kernel we take advantage of this, and place our kernel at a very high address, so that is it
out of the way of any user programs that might be running. Commonly we typically claim the entire higher
half of the virtual address space for use by the kernel, and leave the entire lower half alone for userspace.
51
CHAPTER 9. HIGHER HALF KERNEL 52
52
Chapter 10
10.1 Overview
The GDT is an x86(_64) structure that contains a series of descriptors. In a general sense, each of these
descriptors tell the cpu about different things it should do. To refer to a GDT descriptor a selector is used,
which is simply the byte offset from the beginning of the GDT where that descriptor starts ORed with the
ring that selector refers to. The OR operation is necessary for legacy reasons, but these mechanisms still exist.
It’s important to separate the idea of the bit-width of the cpu (16-bit, 32-bit, 64-bit) from the current mode
(real mode, protected mode, long mode). Real mode is generally 16 bit, protected mode is generally 32 bit,
and long mode is usually 64 bit, but this is not always the case. The GDT decides the bit-width (affecting
how instructions are decoded, and how stack operations work for example), while CR0 and EFER affect the
mode the cpu is in.
Most descriptors are 8 bytes wide, usually resulting in the selectors looking like the following:
• null descriptor: selector 0x0
• first descriptor: selector 0x8
• second descriptor: selector 0x10
• third descriptor: selector 0x18
• etc . . .
There is one exception to the 8-byte-per-descriptor rule, the TSS descriptor, which is used by the ltr
instruction to load the task register with a task state segment. It’s a 16-byte wide descriptor.
Usually these selectors are for code (CS) and data (DS, SS), which tell the cpu where it’s allowed to fetch
instructions from, and what regions of memory it can read/write to. There are other selectors, for example
the first entry in the GDT must be all zeroes (called the null descriptor).
The null selector is mainly used for edge cases, and is usually treated as ‘ignore segmentation’, although it
can lead to #GP faults if certain instructions are issued. Its usage only occurs with more advanced parts of
x86, so we’ll know to look out for it.
The code and data descriptors are what they sound like: the code descriptor tells the cpu what region of
memory it can fetch instructions from, and how to interpret them. Code selectors can be either 16-bit or
32-bit, or if running in long mode 64-bit or 32-bit.
To illustrate this point, is possible to run 32 bit code in 2 ways: - in long mode, with a compatibility (32-bit)
segment loaded. Paging is required to be used here as we’re in long mode (4 or 5 levels used), and segmentation
is also enabled due to compatibility mode. SSE/SSE2 and various other long mode features are always
available too. - in protected mode, with a 32-bit segment loaded. Segmentation is mandatory here, and paging
is optional (available as 2 or 3 levels). SSE/SSE2 is an optional cpu extension, and may not be supported.
53
CHAPTER 10. THE GLOBAL DESCRIPTOR TABLE 54
10.2 Terminology
• Descriptor: an entry in the GDT (can also refer to the LDT/local descriptor table, or IDT).
• Selector: byte offset into the GDT, refers to a descriptor. The lower 3 bits contain some extra fields, see
below.
• Segment: the region of memory described by the base address and limit of a descriptor.
• Segment Register: where the currently in use segments are stored. These have a visible portion (the
selector loaded), and an invisible portion which contains the cached base and limit fields.
The various segment registers:
• CS: Code selector, defines where instructions can be fetched from.
• DS: Data selector, where general memory access can happen.
• SS: Stack selector, where push/pop operations can happen.
• ES: Extra selector, intended for use with string operations, no specific purpose.
• FS: F selector, no specific purpose. Sys V ABI uses it for thread local storage.
• GS: G selector, no specific purpose. Sys V ABI uses it for process local storage, commonly used for
cpu-local storage in kernels due to swapgs instruction.
When using a selector to refer to a GDT descriptor, we’ll also need to specify the ring we’re trying to access.
This exists for legacy reasons to solve a few edge cases that have been solved in other ways. If we need to use
these mechanisms, we’ll know, otherwise the default (setting to zero) is fine.
A segment selector contains the following information:
• index bits 15-3: is the GDT selector.
• TI bit 2: is the Table Indicator if clear it means GDT, if set it means LDT, in our case we can leave it
to 0.
• RPL bits 1 and 0: is the Requested Privilege Level, it will be explained later.
Constructing a segment selector is done like so:
uint8_t is_ldt_selector = 0;
uint8_t target_cpu_ring = 0;
uint16_t selector = byte_offset_of_descriptor & ~(uint16_t)0b111;
selector |= (target_cpu_ring & 0b11);
selector |= ((is_ldt_selector & 0b1) << 2);
The is_ldt_selector field can be set to tell the cpu this selector references the LDT (local descriptor table)
instead of the GDT. We’re not interested in the LDT, so we will leave this as zero. The target_cpu_ring
field (called RPL in the manuals), is used to handle some edge cases. This is best set to the same ring the
selector refers to (if the selector is for ring 0, set this to 0, if the selector is for ring 3, set this to 3).
It’s worth noting that in the early stages of the kernel we only be using the GDT and kernel selectors, meaning
these fields are zero. Therefore, this calculation is not necessary, we can simply use the byte offset into the
GDT as the selector.
This is also the first mention of the LDT (local descriptor table). The LDT uses the same structure as the
GDT, but is loaded into a separate register. The idea being that the GDT would hold system descriptors, and
54
55 10.3. SEGMENTATION
the LDT would hold process-specific descriptors. This tied in with the hardware task switching that existed
in protected mode. The LDT still exists in long mode, but should be considered deprecated by paging.
Address types:
• Logical address: addresses the programmer deals with.
• Linear address: logical address after translation through segmentation (logical_address + selector_base).
• Physical address: linear address translated through paging, maps to an actual memory location in RAM.
It’s worth noting if segmentation is ignored, logical and linear addresses are the same.
If paging is disabled, linear and physical addresses are the same.
10.3 Segmentation
Segmentation is a mechanism for separating regions of memory into code and data, to help secure operating
systems and hardware against potential security issues, and simplifies running multiple processes.
How it works is pretty simple, each GDT descriptor defines a segment in memory, using a base address and
limit. When a descriptor is loaded into the appropriate segment register, it creates a window into memory
with the specified permissions. All memory outside of this segment has no permissions (read, write, execute)
unless specified by another segment.
The idea is to place code in one region of memory, and then create a descriptor with a base and limit that
only expose that region of memory to the cpu. Any attempts to fetch instructions from outside that region
will result in a #GP fault being triggered, and the kernel will intervene.
Accessing memory inside a segment is done relative to its base. Let’s say we have a segment with a base
of 0x1000, and some data in memory at address 0x1100. The data would be accessed at address 0x100
(assuming the segment is the active DS), as addressed are translated as segment_base + offset. In this
case the segment base is 0x1000, and the offset is 0x100.
Segments can also be explicitly referenced. To load something at offset 0x100 into the ES region, an instruction
like mov es:0x100, $rax can be used. This would perform the translation from logical address to linear
address using ES instead of DS (the default for data), a common example is when an interrupt occurs while
the cpu is in ring 3, it will switch to ring 0 and load the appropriate descriptors into the segment registers.
55
CHAPTER 10. THE GLOBAL DESCRIPTOR TABLE 56
reload_cs:
pop %rdi
push $0x8
push %rdi
retfq
In the above example we take advantage of the call instruction pushing the return address onto the stack
before jumping. To reload %cs we’ll need an address to jump to, so we’ll use the saved address on the stack.
We need to place the selector we want to load into %cs onto the stack before the return address though, so
we’ll briefly store it in %rdi, push our example code selector (0x8 in this - the implementation may differ),
then push the return address back onto the stack.
We use retfq instead of ret because we want to do a far return, and we want to use the 64-bit (quadword)
version of the instruction. Some assemblers have different syntax for this instruction, and it may be called
lretq.
For system-type descriptors, it’s best to consult the manual, the Intel SDM volume 3A chapter 3.5 has the
relevant details.
56
57 10.6. USING THE GDT
The Selector Type is a multibit field, for non-system descriptor types, the MSB (bit 3) is set for code descriptors,
and cleared for data descriptors. The LSB (bit 0) is a flag for the cpu to communicate to the OS that the
descriptor has been accessed in someway, but this feature is mostly abandoned, and should not be used.
For a data selector, the remaining two bits are: expand-down (bit 2) - causes the limit to grow downwards,
instead of up. Useful for stack selectors. Write-allow (bit 1), allows writing to this region of memory. Region
is read-only if cleared.
For a code selector, the remaining bits are: Conforming (bit 2) - a tricky subject to explain. Allow user code
to run with kernel selectors under certain circumstances, best left cleared. Read-allow (bit 1), allows for
read-only access to code for accessing constants stored near instructions. Otherwise, code cannot be read as
data, only for instruction fetches.
uint64_t kernel_code = 0;
kernel_code |= 0b1011 << 8; //type of selector
kernel_code |= 1 << 12; //not a system descriptor
kernel_code |= 0 << 13; //DPL field = 0
kernel_code |= 1 << 15; //present
kernel_code |= 1 << 21; //long-mode segment
57
CHAPTER 10. THE GLOBAL DESCRIPTOR TABLE 58
Most of this descriptor is unchanged, except for the type field. Bit 4 is cleared to indicate this is a data
selector. Creating the user mode selectors is even more straightforward, as we’ll reuse the existing descriptors
and just update their DPL fields (bits 13 and 14).
uint64_t user_code = kernel_code | (3 << 13);
gdt_entries[3] = user_code;
struct GDTR
{
uint16_t limit;
uint64_t address;
} __attribute__((packed));
GDTR example_gdtr =
{
.limit = num_gdt_entries * sizeof(uint64_t) - 1;
.address = (uint64_t)gdt_entries;
};
void load_gdt()
{
asm("lgdt %0" : : "m"(example_gdtr));
}
If not familiar with inline assembly, check the appendix on using inline assembly in C. The short of it is we
use the “m” constraint to tell the compiler that example_gdtr is a memory address. The lgdt instruction
loads the new GDT, and all that’s left is to reload the current selectors, since they’re using cached information
from the previous GDT. This is done in the function below:
void flush_gdt()
{
asm volatile("\
mov $0x10, %ax \n\
mov %ax, %ds \n\
mov %ax, %es \n\
mov %ax, %fs \n\
mov %ax, %gs \n\
58
59 10.6. USING THE GDT
59
CHAPTER 10. THE GLOBAL DESCRIPTOR TABLE 60
60
Chapter 11
As the title implies, this chapter is purely focused on x86_64. Other platforms will have different mechanisms
for handling interrupts.
If not familiar with the term interrupt, it’s a way for the cpu to tell our code that something unexpected or
unpredictable has happened, and that it needs to be handled. When an interrupt is triggered, the cpu will
serve the interrupt by loading the interrupt handler specified. The handler itself is just a function, but with a
few special conditions.
Interrupts get their name because they interrupt the normal flow of execution, stop whatever code was
running on the cpu, execute a handler function, and then resume the previously running code. Interrupts can
signal a number of events from the system, from fatal errors to a device telling us it has some data ready to
be read.
The x86 architecture makes a distinction between hardware interrupts and software interrupts. Don’t worry
though, this is only something we’ll need to worry about, if we deliberately use it. A software interrupt is
one that’s triggered by the int instruction, anything else is considered a hardware interrupt. The difference
is that some hardware interrupts will store an error code (and some will not), but a software interrupt will
never store an error code. Meaning if the int instruction is used to trigger an interrupt which normally
has an error code, there won’t be one present, and most likely run into bugs if the handler function is not
prepared for this.
61
CHAPTER 11. INTERRUPT HANDLING ON X86_64 62
62
63 11.1. SETTING UP FOR HANDLING INTERRUPTS
Let’s look closer at the type field. We have two options here, with only one difference between them: an
interrupt gate will clear the interrupt flag before running the handler function, and a trap gate will not.
Meaning if a trap gate is used, interrupts can occur while inside of the handler function. There are situations
where this is useful, but we will know those when we encounter them. An interrupt gate should be used
otherwise. They have the following values for the type field:
• Interrupt gate: 0b1110.
• Trap gate: 0b1111.
The DPL field is used to control which cpu rings can trigger this vector with a software interrupt. On x86
there are four protection rings (0 being the most privileged, 3 the least). Setting DPL = 0 means that only
ring 0 can issue a software interrupt for this vector, if a program in another ring tries to do this it will instead
trigger a general protection fault. For now, we have no use for software interrupts, so we’ll set this to 0 to
only allow ring 0 to trigger them.
That’s a lot of writing, but in practice it won’t be that complex. Let’s create a function to populate a single
IDT entry for us. In this example we’ll assume the kernel code selector is 0x8, but it may not be.
interrupt_descriptors idt[256];
63
CHAPTER 11. INTERRUPT HANDLING ON X86_64 64
struct idtr
{
uint16_ limit;
uint64_t base;
} __attribute__((packed));
Again, note the use of the packed attribute. In long mode the limit field should be set to 0xFFF (16 bytes
per descriptor * 256 descriptors, minus 1). The base field needs to contain the logical address of the idt. This
is usually the virtual address, but if the segmentation have been re-enabled in long mode (some cpus allow
this), this address ignores segmentation.
Authors Note: The reason for subtracting one from the size of the idt is interesting. Loading an IDT with zero
entries would effectively be pointless, as there would be nothing there to handle interrupts, and so no point in
having loaded it in the first place. Since the size of 1 is useless, the length field is encoded as one less than the
actual length. This has the benefit of reducing the 12-bit value of 4096 (for a full IDT), to a smaller 11-bit
value of 4096. One less bit to store!
void load_idt(void* idt_addr)
{
idtr idt_reg;
idt_reg.limit = 0xFFF;
idt_reg.base = (uint64_t)idt_addr;
asm volatile("lidt %0" :: "m"(&idt_reg));
}
In this example we stored idtr on the stack, which gets cleaned up when the function returns. This is okay
because the IDTR register is like a segment register in that it caches whatever value was loaded into it, similar
to the GDTR. So it’s okay that our idtr structure is no longer present after the function returns, as the
register will have a copy of the data our structure contained. Having said that, the actual IDT can’t be on
the stack, as the cpu does not cache that.
At this point we should be able to install an interrupt handler into the IDT, load it, and set the interrupts
flag. The kernel will likely crash as soon as an interrupt is triggered though, as there are some special things
we need to perform inside of the interrupt handler before it can finish.
64
65 11.2. INTERRUPT HANDLER STUB
call interrupt_disaptch
pop %r15
pop %r14
//push other registers here
pop %rbx
pop %rax
iret
A thing to notice is that we added 16 bytes to the stack before the iret. This is because there will be an
error code (real or dummy) and the vector number that we need to remove, so that the iret frame is at the
65
CHAPTER 11. INTERRUPT HANDLING ON X86_64 66
top of the stack. If we don’t do this, iret will use the wrong data and likely trigger a general protection fault.
As for the general purpose registers, the order they’re pushed doesn’t really matter, as long as they’re popped
in reverse. You can skip storing %rsp, as its value is already preserved in the iret frame. That’s the generic
part of our interrupt stub, now we just need the handlers for each vector. They’re very simple!
We’re also going to align each handler’s function to 16 bytes, as this will allow us to easily install all 256
handlers using a loop, instead of installing them individually.
.align 16
vector_0_handler:
//vector 0 has no error code
pushq $0
//the vector number
pushq $0
jmp interrupt_stub
//skipping ahead
.align 16
vector_13_handler:
//vector 13(#GP) does push an error code
//so we wont. Just the vector number.
pushq $13
jmp interrupt_stub
There’s still a lot of repetition, so we could take advantage of our assembler macro features to automate that
down into a few lines. That’s beyond the scope of this chapter though. Because of the 16-byte alignment, we
know that handler number xyz is offset by xyz * 16 bytes from the first handler.
extern char vector_0_handler[];
66
67 11.3. INTERRUPT DISPATCH
If we don’t send the EOI, the cpu will return from the interrupt handler and execute normally, but we will
never be able to handle any future interrupts because the local APIC thinks we’re still handling one.
67
CHAPTER 11. INTERRUPT HANDLING ON X86_64 68
the fields in a more familiar way. We’re going to call this structure cpu_status_t, it has been called all sorts
of things from context_t to registers_t. What’s important about it is that we define the fields in the
reverse order of what we push to the stack. Remember the stack grows downwards, so the earlier pushes will
have higher memory addresses, meaning they will come later in the struct’s definition. Our struct is going to
look like the following:
struct cpu_status_t
{
uint64_t r15;
uint64_t r14;
//other pushed registers
uint64_t rbx;
uint64_t rax;
uint64_t vector_number;
uint64_t error_code;
uint64_t iret_rip;
uint64_t iret_cs;
uint64_t iret_flags;
uint64_t iret_rsp;
uint64_t iret_ss;
};
The values pushed by the cpu are prefixed with iret_, which are also the values that the iret instruction
will pop from the stack when leaving the interrupt handler. This is another nice side effect of having our
stack laid out in a standard way, because of the dummy error code we pushed we know we can always use this
structure.
Our modified interrupt_dispatch now looks like:
void interrupt_dispatch(cpu_status_t* context)
{
switch (vector_number)
{
case 13:
log("general protection fault.");
break;
case 14:
log("page fault.");
break;
default:
log("unexpected interrupt.");
break;
}
return context;
}
All that’s left is to modify the assembly interrupt_stub to handle this. It’s only a few lines:
//push other registers here
push %r15
68
69 11.3. INTERRUPT DISPATCH
pop %r15
//pop other registers here
That’s it! One thing to note is that whatever is returned from interrupt_dispatch will be loaded as the
new stack, so only return things we know are valid. Returning the existing stack is fine, but don’t try to
return NULL or anything as an error.
While some of these vectors are unused, they are still reserved and might be used in the future. So consider
using them as an error. Most of these are fairly rare occurrences, however we will quickly explain a few of the
common ones:
• Page Fault: Easily the most common one to run into. It means there was an issue with translating a
virtual address into a physical one. This does push an error code which describes the memory access
that triggered the page fault. Note the error describes what was being attempted, not what caused
translation to fail. The %cr2 register will also contain the virtual address that was being translated.
• General Protection Fault: A GP fault can come from a large number of places, although it’s generally
from an instruction dealing with the segment registers in some way. This includes iret (it modifies
cs/ss), and others like lidt/ltr. It also pushes an error code, which is described below. A GP fault
can also come from trying to execute a privileged instruction outside when it’s not allowed to be. This
case is different to an undefined opcode, as the instruction exists, but is just not allowed.
• Double Fault: This means something has gone horribly wrong, and the system is not in a state that can
be recovered from. Commonly this occurs because the cpu could not call the GP fault handler, but
it can be triggered by hardware conditions too. This should be considered as our last chance to clean
up and save any state. If a double fault is not handled, the cpu will ‘triple fault’, meaning the system
resets.
69
CHAPTER 11. INTERRUPT HANDLING ON X86_64 70
A number of the reserved interrupts will not be fired by default, they require certain flags to be set. For
example the x87 FPU error only occurs if CR0.NE is set, otherwise the FPU will silently fail. The SIMD
error will only occur if the cpu has been told to enable SSE. Others like bound range exceeded or device not
available can only occur on specific instructions, and are generally unseen.
A Page Fault will push a bitfield as its error code. This is not a complete description of all the fields, but it’s
all the common ones. The others are specific to certain features of the cpu.
The other interrupts that push an error code (excluding the always-zero ones) use the following format to
indicate which selector caused the fault:
11.4 Troubleshooting
11.4.1 Remapping The PICs
This is touched on more in the APIC chapter, but before the current layout of the IDT existed there were a
pair of devices called the PICs that handled interrupts for the cpu. They can issue 8 interrupts each, and by
default they send them as vectors 0-7 and 8-15. That was fine at the time, but now this interferes with the
reserved vectors, and can lead to the cpu thinking certain events are happening when they’re actually not.
Fortunately the PICs allow us to offset the vectors they issue to the cpu. They can be offset anywhere about
0x20, and commonly are placed at 0x20 and 0x28.
70
71 11.4. TROUBLESHOOTING
we end up executing something, and ultimately trigger some sort of error. The solution is to use the halt
instruction within a loop, so that after each instruction we run hlt again, like so:
//this is what you want
while (true)
asm("hlt");
//not good!
asm("hlt");
71
CHAPTER 11. INTERRUPT HANDLING ON X86_64 72
72
Chapter 12
ACPI Tables
ACPI (Advanced Configuration and Power Interface) is a Power Management and configuration standard for
the PC, it allows operating systems to control many different hardware features, like the amount of power on
each device, thermal zones, fan control, IRQs, battery levels, etc.
We need to access the ACPI Tables in order to read the IO-APIC information, used to receive hardware
interrupts (it will be explained later).
12.1.1 RSDP
The RSDP (Root System Description Pointer) used in the ACPI programming interface is the pointer to the
RSDT (Root System Descriptor Table), the full structure depends on whether the version of ACPI used is 1
or 2, the newer version is just extending the previous one.
The newer version is backward compatible with the older.
73
CHAPTER 12. ACPI TABLES 74
74
75 12.1. RSDP AND RSDT/XSDT
• Just checking the last byte of the result it can be achieved in several ways: for example is possible cast
the result to uint_8 if the content after casting is 0 the struct is valid, or use bitwise AND with 0xFF
value (0xFF is equivalent to the 0b11111111 byte) sum & 0xFF, if it is 0 the struct is valid otherwise it
has to be ignored.
The function above works perfectly with both versions of descriptors. In the XSDT, since it has more fields,
the previous checksum field won’t offset them properly (because it doesn’t know about them), so this is why
an extended checksum field is added.
struct XSDT {
ACPISDTHeader sdtHeader; //signature "XSDT"
uint64_t sdtAddresses[];
};
This means that if we want to get the n-th SDT table we just need to access the corresponding item in the
*SDT array:
75
CHAPTER 12. ACPI TABLES 76
76
Chapter 13
APIC
77
CHAPTER 13. APIC 78
78
79 13.3. LOCAL APIC
Note that the registers are given as a physical address, so to access these we will need to map them somewhere
in the virtual address space. This is true for the addresses of any I/O APICs we obtain as well. When the
system boots, the base address is usually 0xFEE0000 and often this is the value we read from rdmsr. For
correct operation the local APIC registers should be mapped as ‘strong uncachable’.
A complete list of local APIC registers is available in the Intel/AMD software development manuals, but the
important ones for now are:
• Spurious Interrupt Vector: offset 0xF0.
• EOI (end of interrupt): offset 0xB0.
• Timer LVT: offset 0x320.
• Local APIC ID: offset 0x20.
Bits Value
0-7 Spurious vector
8 APIC Software enable/disable
9 Focus Processor checking
10-31 Reserved
79
CHAPTER 13. APIC 80
• Thermal Monitor: used for configuring interrupts when certain thermal conditions are met. Offset:
0x330.
• Performance Counter: allows an interrupt to be generated when a performance counter overflows. Offset:
0x340.
• LINT0 : Specifies the interrupt delivery when an interrupt is signaled on LINT0 pin. Offset: 0x350.
• LINT1 : Specifies the interrupt delivery when an interrupt is signaled on LINT1 pin. Offset: 0x360.
• Error: configures how the local APIC should report an internal error, Offset 0x370.
The LINT0 and LINT1 pins are mostly used for emulating the legacy PIC, but they may also be used as NMI
sources. These are best left untouched until we have parsed the MADT, which will tell how the LVT for these
pins should be programmed.
Most LVT entries use the following format, with the timer LVT being the notable exception. Its format is
explained in the timers chapter. The thermal sensor and performance entries ignore bits 15:13.
Bit Description
0:7 Interrupt Vector. This is the IDT entry we want to trigger when for this interrupt.
8:10 Delivery mode (see below)
11 Destination Mode, can be either physical or logical.
12 Delivery Status (Read Only), whether the interrupt has been served or not.
13 Pin polarity: 0 is active-high, 1 is active-low.
14 Remote IRR (Read Only) used by the APIC for managing level-triggered interrupts.
15 Trigger mode: 0 is edge-triggered, 1 is level-triggered.
16 Interrupt mask, if it is 1 the interrupt is disabled, if 0 is enabled.
The delivery mode field determines how the APIC should present the interrupt to the processor. The fixed
mode (0b000) is fine in almost all cases, the other modes are for specific functions or advanced usage.
13.3.6 X2 APIC
The X2APIC is an extension of the XAPIC (the local APIC in its regular mode). The main difference is the
registers are now accessed via MSRs and some the ID register is expanded to use a 32-bit value (previously
8-bits). While we’re going to look at how to use this mode, it’s perfectly fine to not support it.
Checking whether the current processor supports the X2APIC or not can be done via cpuid. It will be under
leaf 1, bit 21 in ecx. If this bit is set, the processor supports the X2APIC.
Enabling the X2APIC is done by setting bit 10 in the IA32_APIC_BASE MSR. It’s important to note that
once this bit is set, we cannot clear it to transition back to the regular APIC operation without resetting the
system.
Once enabled, the local APIC registers are no longer memory mapped (trying to access them there is now an
error) and can instead be accessed as a range of MSRs starting at 0x800. Since each MSR is 64-bits wide, the
offset used to access an APIC register is shifted right by 4 bits.
As an example, the spurious interrupt register is offset 0xF0. To access the MSR version of this register we
would shift it right by 4 (0xF0 >> 4 = 0xF) and then add the base offset (0x800) to get the MSR we want.
That means the spurious interrupt register is MSR 0x80F.
Since MSRs are 64-bits, the upper 32 bits are zero on reads and ignored on writes. As always there is an
exception to this, which is the ICR register (used for sending IPIs to other cores) which is now a single 64-bit
register.
80
81 13.4. I/O APIC
the processor. This is a separate mechanism to the interrupt flag (IF), which also disables interrupts being
served to the processor. It is possible to send EOI to the local APIC while IF is cleared (disabling interrupts)
and no further interrupts will be served until IF is set again.
There are few exceptions where sending an EOI is not needed, this is mainly spurious interrupts and NMIs.
The EOI can be sent at any time when handling an interrupt, but it’s important to do it before returning with
iret. If we enable interrupts and only receive a single interrupt, forgetting to send EOI may be the reason.
81
CHAPTER 13. APIC 82
The I/O APIC ID field is mostly fluff, as we’ll be accessing the I/O APIC by its MMIO address, not its ID.
The Global System Interrupt Base is the first interrupt number that the I/O APIC handles. In the case of
most systems, with only a single I/O APIC, this will be 0.
To check the number of inputs an I/O APIC supports:
uint32_t ioapicver = read_ioapic_register(IOAPICVER);
size_t number_of_inputs = ((ioapicver >> 16) & 0xFF) + 1;
The number of inputs is encoded as bits 23:16 of the IOAPICVER register, minus one.
Memory Mnemonic
Address Name Register Name Description
FEC0 0000h IOREGSEL I/O Register Select Is used to select the I/O Register to access
FEC0 0010h IOWIN I/O Window (data) Used to access data selected by IOREGSEL
And then there are 4 I/O Registers that can be accessed using the two above:
82
83 13.4. I/O APIC
Bit Description
31:8 Reserved
7:0 APIC Register Address, they specifies the I/O APIC Registers to be read or written via the
IOWIN Register
• Bus source usually is constant and is 0 (is the ISA irq source), starting from ACPI v2 it is also a reserved
field.
• Irq source is the source IRQ pin
• Global system interrupt is the target IRQ on the APIC
Flags are defined as follows:
• Polarity (Length: 2 bits, Offset: 0 of the APIC/IO input signals, possible values are:
– 00 Use the default settings is active-low for level-triggered interrupts)
– 01 Active High
– 10 Reserved
– 11 Active Low
• Trigger Mode (Length: 2 bits, Offset: 2 ) Trigger mode of the APIC I/O Input signals:
– 00 Use the default settings (in the ISA is edge-triggered)
– 01 Edge-triggered
– 10 Reserved
– 11 Level-Triggered
• Reserved (Length: 12 bits, Offset: 4) this must be 0
83
CHAPTER 13. APIC 84
• The lower double word is basically an LVT entry, for their definition check the LVT entry definition
• The upper double word contains:
– Bits 17 to 55: are Reserved
– Bits 56 to 63: are the Destination Field, In physical addressing mode (see the destination bit of
the entry) it is the local apic id to forward the interrupts to, for more information read the I/O
APIC datasheet.
The number of items is stored in the I/O APIC MADT entry, but usually on modern architectures is 24.
84
Chapter 14
Timer
Timers are useful for all sorts of things, from keeping track of real-world time to forcing entry into the kernel
to allow for pre-emptive multitasking. There are many timers available, some standard and some not.
In this chapter we’re going to take a look at the main timers used on x86 and what they’re useful for.
85
CHAPTER 14. TIMER 86
We’re going to focus on setting up the local APIC timer, and calibrating it with either the PIT or HPET.
We’ll also have a look at a how we could also use the TSC with the local APIC to generate interrupts.
86
87 14.2. PROGRAMMABLE INTERVAL TIMER (PIT)
14.2.2 Example
As an example let’s say we want the PIT to trigger an interrupt every 1ms (1ms = 1/1000 of a second). To
figure out what to set the reload register to (how many cycles of the PIT’s clock) we divide the clock rate by
the duration we want:
One problem is that we can’t use floating point numbers for these counters, so we truncate the result to 1193.
This does introduce some error, and it can be corrected for this over a long time if we want. However, for our
purposes it’s small enough to ignore, for now.
To actually program the PIT with this value is pretty straightforward, we first send a configuration byte to
the command port (0x43) and then the reload value to the channel port (0x40).
The configuration byte is actually a bitfield with the following layout:
Bits Description
0 Selects BCD/binary coded decimal (1) or binary (0) encoding. If unsure, leave this as zero.
1-3 Selects the mode to use for this channel.
4-5 Select the access mode for the channel: generally it should be 0b11 which means we send the low
byte, then the high byte of the 16-bit register.
6-7 Select the channel we want to use, we always want channel 0.
For our example we’re going to use binary encoding, mode 2 and channel 0 with the low byte/high byte access
mode. This results in the following config byte: 0b00110100.
Now it’s a matter of sending the config byte and reload value to the PIT over the IO ports, like so:
void set_pit_periodic(uint16_t count) {
outb(0x43, 0b00110100);
outb(0x40, count & 0xFF); //low-byte
outb(0x40, count >> 8); //high-byte
}
87
CHAPTER 14. TIMER 88
Now we should be getting an interrupt from the PIT every millisecond! By default, the PIT appears on irq0,
which may be remapped to irq2 on modern (UEFI-based) systems. Also be aware that the PIT is system-wide
device, and if using the APIC we will need to program the IO APIC to route the interrupt to one of the
LAPICs.
14.3.1 Discovery
To determine if the HPET is available we’ll need access to the ACPI tables. Handling these is covered in a
separate chapter, but we’re after one particular SDT with the signature of ‘HPET’. If not familiar with ACPI
tables yet, feel free to come back to the HPET later.
This SDT has the standard header, followed by the following fields:
struct HpetSdt {
ACPISDTHeader header;
uint32_t event_timer_block_id;
uint32_t reserved;
uint64_t address;
uint8_t id;
uint16_t min_ticks;
uint8_t page_protection;
}__attribute__((packed));
Authors note: the reserved field before the address field is actually some type information describing the address
space where the HPET registers are located. In the ACPI table this reserved field is the first part of a ‘generic
address structure’, however we can safely ignore this info because the HPET spec requires the registers to be
memory mapped (thus in memory space).
We’re mainly interested in the address field which gives us the physical address of the HPET registers. The
other fields are explained in the HPET specification but are not needed for our purposes right now.
As with any MMIO we will need to map this physical address into the virtual address space, so we can access
the registers with paging enabled.
88
89 14.3. HIGH PRECISION EVENT TIMER (HPET)
We can read the main counter at any time, which is measured in timer ticks. We can convert these ticks
into realtime by multiplying them with the timer period in the general capabilities register. Bits 63:32 of
the general capabilities register contain the number of femtoseconds for each tick. A nanosecond is 1000
femtoseconds, and 1 second is 1’000’000’000 femtoseconds.
We can also write to the main counter, usually we would write a 0 here when initializing the HPET in order
to be able to determine uptime, but this is not really necessary.
The general capabilities register contains some other useful information, briefly summarized below. If interested
in more details, all of this is available in the public specification.
• Bits 63:32 : This number of femtoseconds for each tick of the main clock.
• Bits 31:16 : This field contains the PCI vendor ID of the HPET manufacturer, not needed for operation.
• Bit 15 : Legacy routing support, if set indicates this HPET can emulate the PIT and RTC timers present
in older PCs.
• Bit 13 : If 1 indicates the main counter is 64-bits wide, otherwise it’s 32-bits.
• Bits 12:8 : Encodes the number of timers supported. This is the id of the last timer; a value of 2 means
there are three timers (0, 1, 2).
• Bits 7:0 : Hardware revision id.
In order for the main counter to actually begin counting, we need to enable it. This is done by setting bit 0 of
the general configuration register. Once this bit is set, the main counter will increment by one every time its
internal clock ticks. The period of this clock is what’s specified in the general capabilities register (bits 63:32).
The general configuration register also contains one other interesting setting: bit 1. If this bit is set the HPET
is in legacy replacement mode, where it pretends to be the PIT and RTC timer. This is the default setting,
and if we want to use the HPET as described above this bit should be cleared.
14.3.3 Comparators
The main counter is only suitable for polling the time, but it cannot generate interrupts. For that we have to
use one of the comparators. The HPET will always have at least three comparators, but may have up to 32.
In reality most vendors use the stock intel chip which comes with 3 comparators, but there are some other
vendors of compatible hardware out there which may support more.
By default, the first two comparators are set up to mimic the PIT and RTC clocks, but they can be configured
like the others.
It’s worth noting that all comparators support one-shot mode, but periodic mode is optional. Testing if a
comparator supports periodic mode can be done by checking if bit 4 is set in the capabilities register for that
comparator.
Speaking of which: each comparator has its own set of registers to control it. These registers are accessed
as an offset from the HPET base. There are two registers we’re interested in: the comparator config and
capability register (accessed at offset 0x100 + N * 0x20), and the comparator value register (at offset 0x108
+ N * 0x20). In those equations N is the comparator number we want. As an example to access the config
and capability register for comparator 2, we would determine its location as: 0x100 + 2 * 0x20 = 0x140.
Meaning we would access the register at offset 0x140 from the HPET mmio base address.
The config and capabilities register for a comparator also contains some other useful fields to be aware of:
• Bits 63:32 : This is a bitfield indicating which interrupts this comparator can trigger. If a bit is set,
the comparator can trigger that interrupt. This maps directly to GSIs, which are the inputs to the IO
APIC. If there is only a single IO APIC in the system, then these interrupt numbers map directly to
the IO APIC input pins. For example if bits 2/3/4 are set, then we could trigger the IO APIC pins
2/3/4 from this comparator.
• Bits 13:9 : Write the integer value of the interrupt that should be triggered by this comparator. It’s
recommended to read this register back after writing to verify the comparator accepted the interrupt
number that has been set.
89
CHAPTER 14. TIMER 90
• Bits 4:3 : Bit 4 is set if the comparator supports periodic mode. Bit 3 is used to select periodic mode if
it’s supported. If either bit is cleared, the comparator operates as a one-shot.
• Bit 2 : Enables the comparator to generate interrupts. Even if this is cleared the comparator will still
operate, and set the interrupt pending bit, but no interrupt will be sent to the IO APIC. This bit acts
in reverse to how a mask bit would: if this bit is set, interrupts are generated.
14.3.4 Example
Let’s look at two examples of using the HPET timer: polling the main counter and setting up a one-shot
timer. In case a periodic timer is needed, more work is needed, and check that a comparator supports periodic
mode.
We’re going to assume that the HPET registers are mapped into virtual memory, and that address is stored
in a variable void* hpet_regs.
Polling the main counter is very straightforward:
uint64_t poll_hpet() {
volatile uint64_t* caps_reg = hpet_regs;
uint32_t period = *caps_reg >> 32;
90
91 14.4. LOCAL APIC TIMER
14.4.1 Example
Calibrating a timer is explained above, so we’re going to assume there is a function called lapic_ms_to_ticks
that converts a number of milliseconds into the number of local APIC timer ticks. This may not be necessary,
but it serves for the example. We’re also going to assume that the divisor register is set to the desired value.
If not sure what this does, it divides the incoming clock pulses, reducing the rate the timer ticks. This is
useful in case longer clock durations are needed. Starting with a value of 2 or 4 is recommended.
Other than setting the initial count, we also have to set up the timer LVT entry. There’s a few fields here,
but we’re mostly interested in the following:
• Bits 7:0 : this is interrupt vector the timer will trigger when it expires. It will only trigger that vector
on the core the LAPIC is attached to.
• Bit 16 : Acts as a mask bit, if set the timer won’t generate an interrupt when expiring.
• Bits 18:17 : The mode field. Set this to 0b00 for one-shot operation, and 0b01 for periodic.
The intel and AMD manuals contain the full description if interested in exploring the other functionality
offered.
void arm_lapic_interrupt_timer(size_t millis, uint8_t vector) {
volatile uint32_t* lvt_reg = lapic_regs + 0x320;
//note this clears bits 16 (mask) and 18:17 (mode)
*lvt_reg = vector;
91
CHAPTER 14. TIMER 92
The TSC is probably the simplest timer we’ve covered so far: it’s simply a 64-bit counter that increments
every time the base clock of the processor pulses. To read this counter we can use the rdtsc instruction
which places the low 33-bits in eax and high 32-bits in edx. Similar to how the MSR instructions work.
There are some issues with this version of the TSC however: modern processors will change their base speed
depending on power/performance requirements, which means that the rate the TSC ticks at will change
dynamically! This makes it pretty useless as a timer, and a newer version was quickly implemented, called
the invariant TSC.
The I-TSC ticks at the base speed the processor is supposed to run at, not what it’s actually running at,
meaning the tick-rate is constant. Most processors support the I-TSC nowadays, and most emulators also do,
even if they don’t advertise it through cpuid (qemu has invariant TSC, but doesn’t set the bit). To test if the
TSC is invariant can be done via cpuid once again: leaf 7, edx bit 8.
How about generating interrupts with the TSC? This is also an option feature (that’s almost always supported)
called TSC deadline. We can test for its existence via cpuid leaf 1, ecx, bit 24. To use TSC deadline we write
the absolute time (in TSC ticks) of when we want the interrupt to a special MSR, called IA_32_TSC_DEADLINE
(MSR 0x6E0).
When the TSC passes the tick value in this MSR, it tells the local APIC, and if TSC deadline mode is selected
in the timer LVT an interrupt is generated. Selecting TSC deadline mode can be done by using mode 0b10
instead of 0b00/0b01 in the timer LVT register.
92
Chapter 15
Writing an OS is great fun, but instead of only printing stuff to the screen, it would be better if we could
interact with the user, right?
In the next chapters we will see how to interact with keyboard, and get the user input.
The topics we are going to cover are:
• Handling the keyboard interrupt The first thing to do is to properly handle the keyboard generated by
the IRQ, and understand when, they are generated, how to understand the event type.
• Writing a driver How to translate a scancode to a character, and print it on the screen/logs
93
CHAPTER 15. ADDING KEYBOARD SUPPORT 94
• Once we have the full scancode, store it in a buffer along with any extra info we might need (any
currently pressed modifiers).
Translating the scancode to a printable character is not in the list above, we’ll touch on how to do this briefly,
although it’s not really related to the keyboard driver.
94
Chapter 16
Either the PIC or IOAPIC can be used to set up the keyboard irq. For this chapter we’ll use the IOAPIC as
it’s more modern and the LAPIC + IOAPIC is the evolution of the PIC. However, if using the PIC, most of
the theory still applies, we’ll need to adjust the irq routing code accordingly.
To keep the examples below simple, we’ll assume only a single IOAPIC is present in the system. This is true
for most desktop systems, and is only something to worry about in server hardware.
}
Once we have a valid IDT entry, we can clear the mask bit in the IOAPIC redirect entry for the ps/2 keyboard.
Be sure that the destination LAPIC id is set to the cpu we want to handle the keyboard interrupts. This id
can be read from the LAPIC registers.
Since there are three different scancode sets, it’s a good idea to check what set the keyboard is currently using.
95
CHAPTER 16. HANDLING THE KEYBOARD INTERRUPT 96
Usually the PS/2 controller, the device that the OS is actually talking to on ports 0x60 and 0x64, converts
set 2 scancodes into set 1 (for legacy reasons).
We can check if the translation is enabled, by sending the command 0x20 on the command register (port
0x64), and then read the byte returned on the data port (0x60). If the 6th bit is set than the translation is
enabled.
We can disable the translation if we want, in that case we need to do the following steps: - Read current
controller configuration byte, by sending command 0x20 to port 0x64 (the reply byte will be sent on port
0x60). - Clear the 6th bit on the current controller configuration byte. - To send the modified config byte
back to the controller, send the command 0x60 (to port 0x64), then send the byte to port 0x60.
For our driver we will keep the translation enabled, since we’ll be using set 1.
The only scancode set guaranteed to be supported by keyboards is the set 2. Keep in mind that most of the
time the kernel communicate with a controller compatible with the intel 8042 PS2 controller. In this case the
scancodes can be translated into set 1.
Value Description
0 Get current set
1 Set scancode set 1
2 Set scancode set 2
3 Set scancode set 3
The command has to be sent to the device port (0x60), and reply will be composed by two bytes: if we are
setting the scancode, the reply will be: 0xFA 0xFE. If we are reading the current used set the response will be:
0xFA followed by one of the below values:
Value Description
0x43 Scancode set 1
0x41 Scancode set 2
0x3f Scancode set 3
96
97 16.4. HANDLING KEYBOARD INTERRUPTS
97
CHAPTER 16. HANDLING THE KEYBOARD INTERRUPT 98
98
Chapter 17
Now that we can get scancodes from the keyboard (see the previous chapter), we’ll look at building a simple
PS/2 keyboard driver.
First of all, the driver is generally not responsible for translating the key presses and releases into printable
characters, the driver’s purpose is to deal with the specifics of the device (the PS2 keyboard) and provide
a generic interface for getting key presses/releases. However, it does usually involve translating from the
keyboard-specific scancode into an os-specific one. The idea is that if more scancode sets or keyboards (like
USB) are supported later on, these can be added without having to modify any code that uses the keyboard.
Simply write a new driver that provides the same interface and it will work!
Our keyboard driver does care about keeping track of any keyboard events (presses/releases), and making
them available to any code that needs them. Quite often these events will be consumed (that is, read by some
other code and then removed from the driver’s buffer).
As already mentioned there are 3 scan code sets. We’ll focus on just one (set 1, since by default the ps2
controller translates the other sets to set 1 when the system is powered). We’ll implement the translate
function in a generic fashion to make adding other scancode sets easier in the future.
Now let’s see what are the problems we need to solve when developing a keyboard driver:
• We’ll need to store the history of key press and their statuses somewhere.
• There are some special keys that also need to be handled, and some combinations that we should handle
(was shift/alt/ctrl/super pressed at the same time as this key?).
• Handle the press/release status if needed (we don’t care much when we release a normal key, but we do
care when we release a key like shift or similar).
• Try to not lose sequence of key pressed/released.
• Handle the caps, num, and scroll lock keys (with the LEDs).
• We can optionally translate the scancode into a human-readable character when needed.
From now on we will assume that the scancode translation is enabled, so no matter what set is being used it
will be translated to set 1.
We will try to build the driver in small steps adding one piece at time, so it will be easier to understand it.
99
CHAPTER 17. KEYBOARD DRIVER IMPLEMENTATION 100
uint8_t keyboard_buffer[MAX_KEYB_BUFFER_SIZE];
uint8_t buf_position = 0;
If we want to store just the scancode we don’t need much more, so we can already implement our new irq
handler:
void keyboard_driver_irq_handler() {
uint8_t scancode = inb(0x60); // Read byte from the Keyboard data port
keyboard_buffer[buf_position] = scancode;
buf_position = (buf_position + 1) % MAX_KEYB_BUFFER_SIZE;
}
And we’re done! This first function will keep track of the scancode generated by a key press, and since we’re
using set 1 it also will tell us if the button has been pressed (MAKE) or released (BREAK).
Now using uint8_t as the buffer type can work in this rather simple scenario, but it makes our driver hard
to expand for future updates. For example what if we want to attach some extra information to each key
event? We will actually be doing this in the future, so we’ll make our lives easier now by using a struct.
typedef struct {
uint8_t code;
} key_event;
So the updated irq function will be:
#define MAX_KEYB_BUFFER_SIZE 255
key_event keyboard_buffer[MAX_KEYB_BUFFER_SIZE];
uint8_t buf_position = 0;
void keyboard_driver_irq_handler() {
int scancode = inb(0x60); // Read byte from the Keyboard data port
keyboard_buffer[buf_position].code = scancode;
buf_position = (buf_position + 1) % MAX_KEYB_BUFFER_SIZE;
}
There are a few limitations with this implementation, but we have a working skeleton of a driver. We can
track key-presses in a circular buffer.
100
101 17.1. HIGH LEVEL OVERVIEW
problem is that we can’t read both bytes in one single interrupt, because even if we do, we still get two
interrupts generated. We’re going to solve this problem by keeping track of the current status of what we’re
reading. To do this we will implement a very simple state machine that has two states:
• Normal State: This is exactly what it sounds like and also the one the driver starts in. If the driver
reads a byte that is not the prefix byte (0xE0) it will remain in this state. After being in the prefix
state and reading a byte, it will also return to this state.
• Prefix state: In case the driver has encountered the prefix byte, it will enter into this state. While in
this state we know the next read is an extended scancode, and can be processed appropriately.
If we don’t know what a state machine is there’s a link to the wikipedia page in the Useful Resources
appendix chapter. It’s a straightforward concept: an algorithm can only be in one of several states, and the
algorithm reacts differently in each state. In our example we’re going to use a global variable to identity the
state:
#define NORMAL_STATE 0
#define PREFIX_STATE 1
uint8_t current_state;
There are some scancodes that have up to 4 or more bytes which we’re not going to cover here.
Author’s note: This is one area where the state-machine implementation can break down. As you potentially
need a separate state for each byte in the sequence. An alternative implementation, that’s not covered here,
is to have an array of uint8_ts, and a pointer to the latest byte in the buffer. The idea being: read a byte
from the buffer, place it after the last received byte in the array, and then increment the variable of the latest
byte. Then you can check if a full scancode has been received, for extended codes beginning with 0xE0 you’re
expecting 2 bytes, for normal codes only 1 byte. Once you’ve detected a full scancode in the buffer, process it,
and reset the pointer in the buffer for the next byte to zero. Therefore, the next byte gets placed at the start of
the buffer. Now it’s just a matter of making the buffer large enough, which is trivial.
Regarding storing the prefix byte, this comes down to a design decision. In our case we’re not going to store
them as they don’t contain any information we need later on, when translating these scancodes into the kernel
scancodes. Just to re-iterate: the idea of using a separate, unrelated, scancode set inside the kernel is that
we’re not bound to any implementation. Our keyboard driver can support as many sets as needed, and the
running programs just use what the kernel provides, in this case its own scancode set. It seems like a lot of
work up front, but it’s a very useful abstraction to have!
Now by changing the current_state variable, we can change how the code will treat the incoming data.
We’ll also need an init function, so we can do some set up like setting the default state and zeroing the
keyboard event buffer:
#define NORMAL_STATE 0
#define PREFIX_STATE 1
uint8_t current_state;
void init_keyboard() {
// You'll want to do other setup here in your own driver:
// ensure the input buffer of the keyboard is empty, check which scancode
// set is in use, enable irqs.
current_state = NORMAL_STATE;
}
void keyboard_driver_irq_handler() {
int scancode = inb(0x60); // Read byte from the Keyboard data port
if (scancode == 0xE0) {
current_state = PREFIX_STATE
101
CHAPTER 17. KEYBOARD DRIVER IMPLEMENTATION 102
102
103 17.1. HIGH LEVEL OVERVIEW
As an example, say we detect the CTRL key is pressed. We would want to update the current modifiers
(which we store a copy of whenever we store a new key event):
current_modifiers |= 1 << CTRL_MASK;
And to clear the bit when we detect CTRL is released:
current_modifiers &= ~(1 << CTRL_MASK);
At this point we just need to identify what key is being pressed/released and update the status_mask
accordingly.
The case of caps lock can be handled in 2 ways. The first is to add a boolean variable to the key_event struct
which stores the current state of caps lock. We can also use one of the unused bits in the status_mask field.
An interesting note is that on ps/2 keyboards the LEDs must be controlled manually, implementing this is as
simple as a single command to the keyboard, and is left as an exercise for the reader.
17.1.4 Translation
Now that all the core parts of the driver are in place, let’s talk about translation.
There’s two main stages of translation we’re interested in at the moment:
• From the keyboard-specific scancode to our kernel scancode (the one applications use).
• From the kernel scancode to a printable ascii character. This isn’t really part of the keyboard driver,
but we will cover it here since it’s a useful function to test if the keyboard driver works.
Translation from the keyboard scancode to the kernel one can be done in a number of ways. In our example
driver we’re going to use a lookup table in the form of an array.
Our array is going to be an array of kernel scancodes, with the index into the array being the keyboard
scancode. Let’s say get scancode 0x12 from the keyboard, and we know that key is the F1 key (just an
example, check the real scancodes before implementing this).
We could use the following:
103
CHAPTER 17. KEYBOARD DRIVER IMPLEMENTATION 104
scancode should be ignored. We could also filter them out when an application tries to get any pending key
events.
We also don’t check if the keyboard scancode is within the lookup table, which it may not be. This is
something to consider.
So now we have our internal representation of a scancode, and the code field in the key_event structure
outlined above can use it. In the paragraph Store Key Press History we have seen how the interrupt handler
should save the key event in the circular buffer. However, that was before we had any translation. Using
what we saw above we’ll change the following line to now use the lookup table instead of storing the scancode
directly:
keyboard_buffer[buf_position].code = scancode;
becomes
keyboard_buffer[buf_position].code = scancode_mapping[scancode];
At this point we have a fully functioning PS/2 keyboard driver! However, we will quickly cover translating a
kernel scancode into a printable character, as that’s a useful feature to have at this stage.
There’s a few approaches to getting printable characters from our kernel scancodes:
• Using a lookup table like we did before. We could have 2 tables, one for shifted keys and one for
non-shifted keys.
• Using a big switch statement, with inline if/elses to handle shifting.
A lookup table would work the same as it did above. If we want the scancode with the value 6 to translate
to the printable character ‘f’, we would put ‘f’ at the 6th position in the lowercase array, and ‘F’ in the 6th
position of the shifted array.
char lower_chars[] = {
'a', 'b', 'c', 'd', 'e', 'f', [ ... ]
};
char shifted_chars[] = {
'A', 'B', 'C', 'D', 'E', 'F', [ ... ]
};
104
105 17.1. HIGH LEVEL OVERVIEW
case KEY_A:
return shifted ? 'A' : 'a';
case KEY_B:
return shifted ? 'B' : 'b';
[ ... ]
}
}
And that’s basically it, in this chapter we went through the basic of implementing a Keyboard Driver, and
translating a scancode into a readable character. This will let us in the future to implement our own command
line interpreter, and other cool stuffs.
105
CHAPTER 17. KEYBOARD DRIVER IMPLEMENTATION 106
106
Part III
Video Output
107
Chapter 18
One of the first thing we want to do is make our Operating System capable of producing some kind of screen
output, even if not strictly necessary (there can be different ways to debug our OS behaviour while developing),
it can be useful sometime to visualize something in real time, and probably especially if at the beginning of
our project, is probably very motivating having our os print a nice logo, or write some fancy text.
As per many other parts there’s a few ways we can get output on the screen, in this book we’re going to
use a linear framebuffer (linear meaning all its pixels are arranged directly after each other in memory), but
historically there have been other ways to display output, some of them listed below:
• In real mode there were BIOS routines that could be called to print to the display. There were sometimes
other extensions for hardware accelerated drawing of shapes or sprites too. This is not not implemented
in modern systems, and even then it’s only available to real mode software.
• In legacy systems some display controllers supported something called ‘text mode’, where the screen
was an array of characters like a terminal, rather than an array of pixels. This is long deprecated, and
UEFI actually requires all displays to operate as pixel-based framebuffers. Often the text mode buffer
lived around the 0xB800 address. This buffer is comprised of pairs of bytes, the first being the ascii
character to display, and the second encodes the foreground and background colour.
• Modern systems provide some kind of linear framebuffer these days. Often this is obtained through
BIOS routines on older systems, or by the Graphics Output Protocol (GOP) from UEFI. Most boot
protocols will present these framebuffers in a uniform way to our kernel.
In these chapters we’re going to use a linear framebuffer, since it’s the only framebuffer type reliably available
on x86_64 and other platforms.
109
CHAPTER 18. VIDEO OUTPUT AND FRAMEBUFFER 110
Size Description
u32 type = 8
u32 size
u64 framebuffer_addr
u32 framebuffer_pitch
u32 framebuffer_width
u32 framebuffer_height
u8 framebuffer_bpp
u8 framebuffer_type
u8 reserved
varies color_info
Where:
• type is the type of the tag being read, and 8 means that it is a framebuffer info tag.
• size it indicates the size of the tag (header info included).
• framebuffer_addr it contains the current address of the framebuffer.
• framebuffer_pitch contains the pitch in bytes.
• framebuffer_width contains the fb width.
• framebuffer_height contains the fb height.
• framebuffer_bpp it contains the number of bits per pixel (is the depth in the multiboot request tag).
• framebuffer_type it indicates the current type of FB, and the content of color_index.
• reserved is always 0 and should be ignored.
• color_info it depends on the framebuffer type.
Pitch is the number of bytes on each row. bpp is same as depths.
Size Description
u32 framebuffer_palette_num_colors
varies framebuffer_palette
110
111 18.4. PLOTTING A PIXEL
The framebuffer_palette_num_colors is the number of colors available in the palette, and the framebuffer
palette is an array of colour descriptors, where every colour has the following structure:
Size Description
u8 red_val
u8 green_val
u8 blue_val
Size Description
u8 framebuffer_red_field_position
u8 framebuffer_red_mask_size
u8 framebuffer_green_field_position
u8 framebuffer_green_mask_size
u8 framebuffer_blue_field_position
u8 framebuffer_blue_mask_size
Where framebuffer_XXX_field_position is the starting bit of the color XXX, and the framebuffer_XXX_mask_size
is the size in bits of the color XXX. Usually the format is 0xRRGGBB (is the same format used in HTML).
• If it is 2, it means EGA text, so the width and height are specified in characters and not pixels,
framebuffer-bpp = 16 and framebuffer_pitch is expressed in byte text per line.
column = x ∗ bpp
Now we have the offset in byte for both the row and column byte, to compute the absolute address of our
pixel we need just need to add row and column to the base address:
111
CHAPTER 18. VIDEO OUTPUT AND FRAMEBUFFER 112
This address is the location where we are going to write a colour value and it will be displayed on our screen.
Be aware that grub is giving us a physical address for the framebuffer_base. When enabling virtual memory
(refer to the Memory Management part) be sure to map the framebuffer somewhere so that it can be still
accessible, otherwise a Page Fault Exception will be triggered!
112
Chapter 19
Drawing Fonts
When framebuffer is enabled, the hardware bios video memory is no longer accessible, and the only things
that we can do now is drawing pixels.
So in order to write text we need first to have at least one font available to our operating system and how to
store it.
The font can be one of many different available (ttf, psf, etc.) in this tutorial we will use Pc Screen Font v2
aka PSF (keep in mind that v1 has some differences in the header, if we want to support that as well, the
code needs to be adapted)
How the font is stored depends on the status of the operating system, there are several way:
• Store it in memory (if the OS doesn’t support a file system yet)
• In case a file system has been already implemented, it could be better to store them in a file
• We can also get a copy of the VGA Fonts in memory, using grub.
If running on linux there are some nearly ready to use fonts in /usr/share/kbd/consolefonts (path can change
slightly depending on the linux distribution)
In this chapter we are going to see how to use a font that has been stored into memory.
The steps involved are:
1. Find a suitable PSF font
2. Add it to the kernel
3. When the kernel is loading identify the PSF version and parse it accordingly
4. Write functions to: pick the glyph (a single character) and draw it on the screen
113
CHAPTER 19. DRAWING FONTS 114
The objcopy command is a tool that copy a source file into another, and can change its format. The
parameters used in the example above are:
• -O the output format (in this case is elf64-x86-64)
• -B is the binary architecture
• -I the inpurt target
Once converted into binary elf, it can be linked to the kernel like any other compiled file, in this case we just
need to add the output file to the linker command:
ld -n -o build/kernel.bin -T src/linker.ld <other_kernel_files> font.o -Map
,→ build/kernel.map
With the font linked, now is possible to access to 3 new variables, like in the following example:
readelf -s font.o
114
115 19.3. GLYPH
• number of glyphs is always 256 unless the mode field is set to 1, in this case it means that the font will
have 512 characters
The font data starts right after the header.
19.3 Glyph
Now that we have access to our PSF font, we can work with “Glyphs”. Every character (Glyph) is stored in a
bitmap. Each bitmap is WIDTH x HEIGHT pixel . If for example the glyph is 8x16, it will be 16 bytes long,
every byte encode a row of the glyph. Below an example of how a glyph is stored:
00000000b byte 0
00000000b byte 1
00000000b byte 2
00010000b byte 3
00111000b byte 4
01101100b byte 5
11000110b byte 6
11000110b byte 7
11111110b byte 8
11000110b byte 9
11000110b byte 10
11000110b byte 11
11000110b byte 12
00000000b byte 13
00000000b byte 14
00000000b byte 15
(This is the letter A).
The glyphs start right after the psf header, the address of the first character will be then:
uint8_t* first_glyph = (uint8_t*) &_binary_font_psf_start +
default_font->headersize
115
CHAPTER 19. DRAWING FONTS 116
Since we know that every glyph has the same size, and this is available in the PSF_Header, if we want to
access the i-th character, we just need to do the following:
uint8_t* selected_glyph_v1 = (uint8_t*) &_binary_font_psf_start +
sizeof(PSFv1_Header_Struct) + (i * default_font->bytesperglyph);
116
Part IV
Memory Management
117
Chapter 20
Memory Management
Welcome to the first challenge of our osdev adventure! Memory management in a kernel is a big area, and it
can easily get very complex. This chapter aims to breakdown the various layers you might use in your kernel,
and explain how each of them is useful.
The design and complexity of a memory manger can vary greatly, a lot depends on what the operating system
is designed, and its specific goals. For example if only want mono-tasking os, with paging disabled and no
memory protection, it will probably be fairly simple to implement.
In this part we will try to cover a more common use case that is probably what nearly all modern operating
system uses, that is a 32/64 operating system with paging enabled, and various forms of memory allocators
for the kernel and one for user space.
In the appendices there is also an additional section on memory protection features available in some CPUs.
We will cover the following topics:
• Physical Memory Manager
• Paging
• Virtual Memory Manager
• Heap Allocation
Authors note: don’t worry, we will try to keep it as simple as possible, using basic algorithms and explaining
all the gray areas as we go. The logic may sometimes be hard to follow, you will most likely have to go through
several reads of this part multiple times.
Each of the layers has a dedicated chapter, however we’ll start with a high level look at how they fit together.
Before proceeding let’s briefly define the concepts above:
119
CHAPTER 20. MEMORY MANAGEMENT 120
20.3 Paging
Although Paging and VMM are strongly tied, let’s split this topic into two parts: with paging we refer to the
hardware paging mechanism, that usually involves tables, and registers and address translation, while the
VMM it refers to the higher level (usually architecture independent).
While writing the support for paging, independently there are few future choices we need to think about now:
• Are we going to have a single or multiple address spaces (i.e. every task will have its own address space)?
If yes in this case we need to keep in mind that when mapping addresses we need to make sure they
are done on the right Virtual Memory Space. So usually a good idea is to add an extra parameter
to the mapping/unmapping functions that contains the pointer to the root page table (for _x86_64
architecture is the PML4 table).
• Are we going to support User and Supervisor mode? In this case we need to make sure that the correct
flag is set in the table entries.
120
121 20.5. HEAP ALLOCATOR
– Paging: Can be used to map virtual pages (what the running code sees) to physical pages (where
it exists in real memory).
– Swapping: Can ask other VMMs to swap their in-use memory to storage, and handover any freed
pages for us to use.
• A simple implementation often only involves paging.
• If using a higher half kernel, the upper half of every VMM will be identical, and contain the protected
kernel code and data.
• Can present other resources via virtual memory to simplify their interface, like memory mapping a file
or inter-process communication.
Similarly to paging there are some things we need to consider depending on our future decisions:
• If we are going to support multiple address spaces, we need to initialize a different VMM for every task,
so all the initialization/allocation/free function should be aware of which is the VMM that needs to be
updated.
• In case we want to implement User and Supervisor support, a good idea is to have separate address
space for the user processes/threads and the supervisor one. Usually the supervisor address space is in
the higher half and the user space is in the lower half, as it starts at the lowest address of the higher
half of the address space (in x86_64 and ricsv it is starting from: 0xFFFF800000000000)
121
CHAPTER 20. MEMORY MANAGEMENT 122
122
Chapter 21
The physical memory manager is responsible for tracking which parts of physical memory are in use, or free
for use. The PMM doesn’t manage individual bytes of memory, rather it keeps track of pages. A page is a
fixed size determined by the MMU: in the case of x86 this is 4096 (0x1000) bytes.
There are different ways on how to handle a PMM, one of them is using a bitmap. Where every bit represent
a page, if the bit is a 0 the page is available, if is 1 is taken.
What the physical memory manager has to take care of as its bare minimum is:
1. Initialize the data structures marking unavailable memory as “used”
2. Check if given an address it is already used or not
3. Allocate/free a page
In this chapter we will explain the bitmap method, because is probably the simplest to understand for a
beginner. To keep the explanation simple, we will assume that the kernel will support only one page size.
7 6 5 4 3 2 1 0
bitmap[0] 0 0 0 0 0 1 1 1
bitmap[1] 0 0 0 0 0 0 0 0
bitmap[2] 0 0 0 0 0 0 0 0
bitmap[3] 0 0 0 0 0 0 0 0
bitmap[4] 0 0 0 0 0 0 0 0
bitmap[5] 0 0 0 0 0 0 0 0
bitmap[6] 0 0 0 0 0 0 0 0
bitmap[7] 0 0 0 0 0 0 0 0
So marking a memory location as free or used is just matter of setting clearing a bit in this bitmap.
123
CHAPTER 21. PHYSICAL MEMORY MANAGER 124
It just represents the offset in bit from &bitmap (the starting address of the bitmap).
In our example with row=0 column=3 (and page size of 4k) we get:
• bit_number = (0 * 8) + 3 = 3
• address = bit_number * 4k = 3 * 4096 = 3 * 0x1000 = 0x3000
Another example: row = 1 column = 4 we will get:
• bit_number = (1 * 8) + 4 = 12
• address = bit_number * 4k = 0xC000
But what about the opposite way? Given an address compute the bitmap location? Still pretty easy:
address
bitmaplocation =
4096
In this way we know the “page” index into a hypothetical array of Pages. But we need row and columns, how
do we compute them? That depends on the variable size used for the bitmap, let’s stick to 8 bits, in this case:
• The row is given by bitmap_location / 8
• The column is given by: bitmap_location % 8
124
Chapter 22
Paging
22.1.1 Page
A page is a contiguous block of memory, with the exact size depending on what the architecture supports. On
x86_64 we have page sizes of 4K, 2M and optionally 1G. The smallest page size is also called a page frame as
it represents the smallest unit the memory management unit can work with, and therefore the smallest unit
we can work with! Each entry in a page table describes one page.
125
CHAPTER 22. PAGING 126
The memory page in the picture refers to a physical memory page (the picture above doesn’t refer to any
existing hardware paging, it is just an example scenario). Using logical addresses and paging, we can introduce
a whole new address space that can be much bigger than the available physical memory.
For example, we can have that:
phys(0x123'456) = virt(0xFFF'F234'5235)
Meaning that the virtual address 0xFFFF2345235 refers to the physical address 0x123456.
This mapping is usually achieved through the usage of several hierarchical tables, with each item in one level
pointing to the next level table. As already mentioned above a virtual address is a composition of entry
numbers for each level of the tables. Now let’s assume for example that we have 3 levels paging, 32 bits
addressing and the address translation mechanism used is the one in the picture above, and we have the
virtual address below:
126
127 22.2. PAGING IN LONG MODE
virtaddress = 0x2F880120
Looking at the picture above we know that the bits:
• 0 to 5 represent the offset (for offset we mean what location we want to access within the physical
memory page).
• 6 to 13 are the page table entry.
• 14 to 21 are the page directory level 1 entry.
• 21 to 31 are the page directory level 2 entry.
We can translate the above address to:
• Offset: 0x20 bytes into page.
• Page Table entry: number 0x4 (it points to the memory page).
• Page Dir 1 entry: 0x20 (it points to a page table).
• Page Dir 2 entry: 0xBE (it points to a page dir 1).
The above example is just an imaginary translation mechanism, we’ll discuss the actual x86_64 4-level paging
below. If we are wondering how the first page directory can be accessed, this will be clear later, but the
answer is that there is usually a special register that contains the base address of the root page directory (in
this example page dir 1).
127
CHAPTER 22. PAGING 128
128
129 22.3. PAGE DIRECTORIES AND TABLE STRUCTURE
63 62 . . . 59 58 . . . 52 51 . . . 40 39 . . . 12 11 . . . 9
XD PK or available Available Reserved must be 0 Table base address Available
8 ... 6 5 4 3 2 1 0
Reserved A PCD PWT U/S R/W P
Where Table base address is a PDPR table base address if the table is PML4 or the PD base address if the
table is the PDPR.
Now the Page Directory (PD) has few differences:
• Bits 39 to 12 are the page table’s base address when using 4k pages, or 2mb area of physical memory if
the PS bit is set.
• Bits 6 and 8 must be 0.
• Bit 7 is the Page Size bit (PS) if set it means that the entry points to a 2mb page, if clear it points to a
Page Table (PT).
• If we are using 2mb pages bit 12 to 20 are reserved and must be 0. If not, accessing address within this
range will cause a #PF.
63 62 . . . 59 58 . . . 52 51 . . . 40 39 . . . 12 11 . . . 9
XD PK or available Available Reserved must be 0 Page Base Address Available
8 7 6 5 4 3 2 1 0
G PAT D A PCD PWT U/S R/W P
In this table there are 3 new bits (D, PAT, G) and the page base address, as already explained, is not pointing
to a table but to the physical memory this page represents.
In the next section we will go through the fields of an entry.
129
CHAPTER 22. PAGING 130
• PWT (Page Level Write Through): Controls the caching policy (write-through or write-back). I usually
leave it to 0, for more information refer to the Intel Developer Manuals.
• PCD (Page Level Cache Disable): Controls the caching of individual pages or tables. I usually leave it
to 0, for more information refer to the Intel Developer Manuals.
• A (Accessed): This value is set by the CPU, if is 0 it means the page hasn’t been accessed yet. It’s set
when the page (or page teble) has been accessed since this bit was last cleared.
• D (Dirty): If set, indicates that a page has been written to since last cleared. This flag is supposed
to only apply to page tables, but some emulators will set it on other levels as well. This flag and the
accessed flag are provided for being use by the memory management software, the CPU only set it when
its value is 0. Otherwise, is up to the operating system’s memory manager to decide if it has to be
cleared or not. Ignoring them is also fine.
• PS (Page Size): Reserved in the pml4, if set on the PDPR it means address translation stops at this
level and is mapping a 1GB page. Check for 1gb page support before using this. More commonly this
can be set on the PD entry to stop translation at that level, and map a 2MB page.
• PAT (Page Attribute Table Index) only for the page table: It selects the PAT entry (in combination
with the PWT and PCD bits above), refer to the Intel Manual for a more detailed explanation.
• G (Global): If set it indicates that when CR3 is loaded or a task switch occurs that this particular entry
should not be ejected. This feature is not architectural, and should be checked for before using.
• PK (Protection Key): A 4-bit value used to control supervisor & user level accesses for a virtual address.
If bit 22 (PKE) is set in CR4, the PKRU register will be used to control access rights for user level
accesses based on the PK, and if bit 24 (PKS) is set, same will happen but for supervisor level accesses
with the PKRS register. Note: This value is ignored on older CPUs, which means those bits are marked
as available on them. If you want to use the protection key, make sure to check for its existence using
CPUID, and of course to set the corresponding bits for it in the CR4 register.
• XD: Also known as NX, the execute disable bit is only available if supported by the CPU (can be
checked wit CPUID), otherwise reserved. If supported, and after enabling this feature in EFER (see the
intel manual for this), attempting to execute code from a page with this bit set will result in a page
fault.
Note about PWT and PCD, the definition of those bits depends on whether PAT (page attribute tables) are in
use or not. For a better understanding of those two bits please refer to the most updated intel documentation
(is in the Paging section of the intel Software Developer Manual vol.3)
63 . . . . 48 47 . . . 39 38 . . . 30 29 .. 21 20 . . . 0
1 ... 1 1 ... 1 1 ... 0 0 ... 0 0 ... 0
Sgn. ext PML4 PDPR Page dir Offset
130
131 22.5. PAGE FAULT
63 . . . 48 47 . . . 39 38 . . . 30 29 . . . 21 20 . . . 12 11 . . . 0
1 ... 1 1 ... 1 1 ... 0 0 ... 0 0 ... 0 0 ... 0
Sgn. ext PML4 PDPR Page dir Page Table Offset
31 . . . . 4 4 3 2 1 0
Reserved I/D RSVD U/S W/R P
131
CHAPTER 22. PAGING 132
• Using a technique called recursion, where access the tables using special virtual addresses.
To use the recursion the only thing we need to do, is reserve an entry in the root page directory (PML4 in our
case) and make its base address to point to the directory itself.
A good idea is to pick a number high enough, that it will not interfere with other kernel/hardware special
addresses. For example let’s use the entry 510 for the recursive item.
Creating the self reference is pretty straightforward, we just need to use the directory physical address as the
base address for the entry being created:
pml4[510l] = pml4_physical_address | PRESENT | WRITE;
This should be done again when setting up paging, on early boot stages.
Now as we have seen above address translation will split the virtual address in entry numbers for the
different tables, starting from the leftmost (the root). So now if we have for example the following address:
virt_addr = 0xff7f80005000
The entries in this address are: 510 for PML4, 510 for PDPR, 0 for PD and 5 for PT (we are using 4k pages
for this example). Now let’s see what happens from the point of view of the address translation:
• First the 510th PML4 entry is loaded, that is the pointer to the PDPR, and in this case its content is
PML4 itself.
• Now it get the next entry from the address, to load the PD, that is again the 510th, and is again PML4
itself, so it is loaded as PD too.
• It is time for the third entry the PT, and in this case we have 0, so it loads the first entry from the
Page Directory loaded, that in this case is still PML4, so it loads the PDPR table
• Finally, the PT entry is loaded, that is 5, and since the current PD loaded for translation is actually a
PDPR we are going to get the 5th item of the page directory.
• Now the last part of the address is the offset, this can be used then to access the entries of the
directory/table loaded.
This means that by carefully using the recursive item from PML4 we can access all the tables.
Few more examples of address translation:
• PML4: 511 (hex: 1ff) - PDPR: 510 (hex: 1fe) - PD 0 (hex: 0) using 2mb pages translates to:
0xFFFF'FFFF'8000'0000.
• Let’s assume we mapped PML4 into itself at entry 510,
– If we want to access the content of the PML4 page itself, using the recursion we need to build a
special address using the entries: PML4: 510, PDPR: 510, PD: 510, PT: 510, now keep in mind
that the 510th entry of PML4 is PML4 itself, so this means that when the processor loads that
entry, it loads PML4 itself instead of PDPR, but now the value for the PDPR entry is still 510,
that is still PML4 then, the table loaded is PML4 again, repeat this process for PD and PT with
page number equals to 510, and we got access to the PML4 table.
– Now using a similar approach we can get access to other tables, for example the following values:
PML4: 510, PDPR:510, PD: 1, PT: 256, will give access at the Page Directory PD at entry number
256 in PDPR that is contained in the first PML4 entry.
This technique makes it easy to access page tables in the current address space, but it falls apart for accessing
data in other address spaces. For that purpose, we’ll need to either use a different technique or switch to that
address space, which can be quite costly.
132
133 22.6. ACCESSING PAGE TABLES AND PHYSICAL MEMORY
dmap_base. Typically, we’ll set this to some address in the higher half so that the lower half of the address
space is completely free for userspace programs. This also makes other parts of the kernel easier later on.
How does the direct map actually work though? It’s simple enough, we just map all of physical memory at
the same virtual address plus the dmap_base offset: paddr = vaddr - dmap_base. Now in order to access a
physical page (from our PMM for example) we just add dmap_base to it, and we can read and write to it as
normal.
The direct map does require a one-time setup early in your kernel, as you do need to map all usable physical
memory starting at dmap_base. This is no more work than creating an identity map though.
What address should you use for the base address of the direct map? Well you can put it at the lowest address
in the higher half, which depends on how many levels of page tables you have. For 4 level paging this will
0xffff'8000'0000'0000.
While recursive paging only requires using a single page table entry at the highest level, a direct map consumes
a decent chunk of address space. A direct map is also more flexible as it allows the kernel to access arbitrary
parts of physical memory as needed. Direct mapping is only really possible in 64-bit kernels due to the large
address space made available, 32-bit kernels should opt to use recursive mapping to reduce the amount of
address space used.
The real potential of this technique will unveil when we have multiple address spaces to handle, when the
kernel may need to update data in different address spaces (especially the paging data structures), in this
case using the direct map it can access any data in any address space, by only knowing its physical address.
It will also help when we will start to work on device drivers (out of the scope of this book) where the kernel
may need to access the DMA buffers, that are stored by their physical addresses.
22.6.3 Troubleshooting
There are few things to take in account when trying to access paging structures using the recursion technique
for x86_64 architecture:
• When specifying entries using constant numbers (not stored in variables) during conversion, always use
the long version appending the “l” letter (i.e. 510th entry became: 510l). Especially when dealing with
macros, because otherwise they could be converted to the wrong type, causing wrong result. Usually
gcc show a warning message while compiling if this happens:
warning: result of ‘510 << 30’ requires 40 bits to represent, but ‘int’ only has 32 bits
• Always remember to properly sign extend any addresses if we’re creating them from nothing. We won’t
need to sign extend on every operation, as things are usually relative to a pointer we’ve already set up.
The CPU will throw a page fault if it’s a good address but something is wrong in the page tables, and a
general protection fault if the virtual address is non-canonical (it’s a bad address).
133
CHAPTER 22. PAGING 134
134
Chapter 23
23.1 An Overview
At first a virtual memory manager might not seem like necessary when we have paging, but the VMM serves
as an abstraction on top of paging (or whatever memory management hardware our platform has), as well as
abstracting away other things like memory mapping files or even devices.
As mentioned before, a simple kernel only requires a simple VMM which may end up being a glorified
page-table manager. However, as our kernel grows more complex, so will the VMM.
135
CHAPTER 23. VIRTUAL MEMORY MANAGER 136
We can also add more advanced features later on, like demand paging. Typically, when a program (including
the kernel) asks the VMM for memory, and the VMM can successfully allocate it, physical memory is mapped
there right away. Immediately backing like this has advantages in that it’s very simple to implement, and
can be very fast. The major downside is that we trust the program to only allocate what it needs, and if it
allocates more (which is very common) that extra physical memory is wasted. In contrast, demand paging
does not back memory right away, instead relying on the program to cause a page fault when it accesses the
virtual memory it just allocated. At this point the VMM now backs that virtual memory with some physical
memory, usually a few pages at a time (to save overhead on page-faults). The benefits of demand-paging are
that it can reduce physical memory usage, but it can slow down programs if not implemented carefully. It
also requires a more complex VMM, and the ability to handle page faults properly.
On the topic of advanced VMM features, it can also do other things like caching files in memory, and then
mapping those files into the virtual address space somewhere (this is what the mmap system call does).
A lot of these features are not needed in the beginning, but hopefully the uses of a VMM are clear. To answer
the original question of what a VMM does: it’s a virtual address space manager and allocator.
23.2 Concepts
As it might be expected, there are many VMM designs out there. We’re going to look at a simple one that
should provide all the functionality needed for now. First we’ll need to introduce a new concept: a virtual
memory object, sometimes called a virtual memory range. This is just a struct that represents part of the
virtual address space, so it will need a base address and length, both of these are measured in bytes and will
be page-aligned. This requirement to be page-aligned comes from the mechanism used to manage virtual
memory: paging. On x86 the smallest page we can manage is 4K, meaning that all of our VM objects must
be aligned to this.
In addition, we might want to store some flags in the vm object, they are like the flags used in the page tables,
we could technically just store them there, but having them as part of the object makes looking them up
faster, since we don’t need to manually traverse the paging structure. It also allows us to store flags that are
not relevant to paging.
typedef struct {
uintptr_t base;
size_t length;
size_t flags;
vm_object* next;
} vm_object;
#define VM_FLAG_NONE 0
#define VM_FLAG_WRITE (1 << 0)
#define VM_FLAG_EXEC (1 << 1)
#define VM_FLAG_USER (1 << 2)
The flags field is actually a bitfield, and we’ve defined some macros to use with it.
These don’t correspond to the bits in the page table, but having them separate like this means they are
platform-agnostic. We can port our kernel to any cpu architecture that supports some kind of MMU and
most of the code won’t need to change, we’ll just need a short function that converts our vm flags into page
table flags. This is especially convenient for oddities like x86 and its nx-bit, where all memory is executable
by default, and it must be specified if the memory shouldn’t be to be executable.
Having it like this allows that to be abstracted away from the rest of our kernel. For x86_64 our translation
function would look like the following:
136
137 23.3. ALLOCATING OBJECTS
137
CHAPTER 23. VIRTUAL MEMORY MANAGER 138
We’re going to create a new function for this. Our example function is going to have the following prototype:
void* vmm_alloc(size_t length, size_t flags, void* arg);
The length field is how many bytes we want. Internally we will round this up to the nearest page size, since
everything must be page-aligned. The flags field is the same bitfield we store in a VM object, it contains a
description of the type of memory we want to allocate.
The final argument is unused for the moment, but will be used to pass data for more exotic allocations. We’ll
look at an example of this later on.
The function will return a virtual address, it doesn’t have necessarily to be already mapped and present, it
just needs to be an available address. Again the question is: where is that address? The answer again is that
it depends on the design decisions. So we need to decide where we want the virtual memory range to be
returned is, and use it as starting address. It can be the same space used for the vmm data structures, or
another area, that is up to us, of course this decision will have an impact on the design of the algorithm.
For the example code we’re going to assume we have a function to modify page tables that looks like the
following:
void map_memory(void* root_table, void* phys, void* virt, size_t flags);
And that there is a variable to keep track of the head of the linked list of objects:
vm_object* vm_objs = NULL;
Now onto our alloc function. The first thing it will need to do is align the length up to the nearest page. This
should look familiar.
length = ((length + PAGE_SIZE - 1) / PAGE_SIZE) * PAGE_SIZE;
The next step is to find a space between two VM objects big enough to hold length bytes. We’ll also want to
handle the edge cases of allocating before the first object, after the last object, or if there are no VM objects
in the list at all (not covered in the example below, they are left as exercise).
vm_object* current = vm_objs;
vm_object* prev = NULL;
uintptr_t found = 0;
prev = current;
current = current->next;
}
This is where the bulk of our time allocating virtual address space will be spent, so it could probably be
wise in giving some thought designing this function for the VMM. We could keep allocating after the last
item until address space becomes limited, and only then try allocating between objects, or perhaps another
allocation strategy.
The example code above focuses on being simple and will try to allocate at the lowest address it can first.
Now that a place for the new VM object has been found, the new object should be stored in the list.
138
139 23.3. ALLOCATING OBJECTS
if (prev == NULL)
vm_objs = latest;
else
prev->next = latest;
latest->next = current;
What happens next depends on the design of the VMM. We’re going to use immediate backing to keep things
simple, meaning we will immediately map some physical memory to the virtual memory we’ve allocated.
//immediate backing: map physical pages right away.
void* pages = pmm_alloc(length / PAGE_SIZE);
map_memory(vmm_pt_root, pages, (void*)obj->base, convert_x86_64_vm_flags(flags));
return obj->base;
}
We’re not handling errors here to keep the focus on the core code, but they should be handled in a real
implementation. There is also the caveat of using malloc() in the VMM. It may run before the heap is
initialized, in which case another way to allocate memory for the VM objects is needed. Alternatively if the
heap exists outside of the VMM, and is already set up at this point this is fine.
139
CHAPTER 23. VIRTUAL MEMORY MANAGER 140
At this point our VMM can allocate any object types we’ll need for now, and hopefully we can start to see
the purpose of the VMM.
As mentioned previously a more advanced design could allow for memory mapping files: by adding another
flag, and passing the file name (as a char*) in the extra argument, or perhaps a file descriptor (like mmap
does).
23.5 Workflow
Now that we have a virtual memory manager, let’s take a look at how we might use it:
140
141 23.6. NEXT STEPS
Since we know the physical address of the MMIO, we want to map this into virtual memory to access it. We
could do this by directly modifying the page tables, but we’re going to use the VMM.
const size_t flags = VM_FLAG_WRITE | VM_FLAG_MMIO;
void* lapic_regs = vmm_alloc(0x1000, flags, (void*)0xFEE0'0000);
//now we can access the lapic registers at this virtual address
141
CHAPTER 23. VIRTUAL MEMORY MANAGER 142
142
Chapter 24
Heap Allocation
24.1 Introduction
Welcome to the last layer of memory allocation, the heap, this is where usually the various alloc functions
are implemented. This layer is usually built on top of the other layers of memory management (PMM and
VMM), but a heap can be built on top of anything, even another heap! Since different implementations have
different characteristics, they may be favoured for certain things. We will describe a way of building a heap
allocator that is easy to understand, piece by piece. The final form will be a linked list.
We’ll focus on three things: allocating memory (alloc()), freeing memory (free()) and the data structure
needed for those to work.
143
CHAPTER 24. HEAP ALLOCATION 144
• Once the VMM has the physical memory, that memory is mapped into the program’s virtual address
space, at the address the heap requested (usually at the end).
• The heap will now return an address with the requested amount of space to the program.
24.3.1 Overview
A heap allocator exposes two main functions:
• void *alloc(size_t size); To request memory of size bytes.
• void free(void *ptr); To free previously allocated memory.
In user space these are the well known malloc()/free() functions. However, the kernel will also need its
own heap (we don’t want to put data where user programs can access it!). The kernel heap usually exposes
functions called kmalloc()/kfree(). Functionally these heaps can be the same.
So let’s get started with describing the allocation algorithm.
cur is the variable keeping track of the next address that can be returned and is initialized to 0, in this
example. Now when the alloc(10) is called, the program is asking for a memory location of 10 bytes. Since
cur = 0, the address to return is 0, and the next available address will become: cur + 10.
144
145 24.3. ALLOCATING AND FREEING
X is just used as marker to convey that memory has been allocated already. Now when calling alloc(3), the
allocator will return the address currently pointed by cur = 10 and then move cur 3 bytes forward.
Now the third alloc() call will work similarly to the others, and we can imagine the results. ‘
What we have so far is already an allocation algorithm, that’s easy to implement and very fast! Its
implementation is very simple:
uint8_t *cur_heap_position = 0; //Just an example, in the real world you would use
//a virtual address allocated from the VMM.
void *first_alloc(size_t size) {
uint8_t *addr_to_return = cur_heap_position;
cur_heap_position = cur_heap_position + size;
return (void*) addr_to_return;
}
Congratulations! We have written our first allocator! It is called the bump allocator, because each allocation
just bumps the next address pointer forward.
But what about free()? That’s even easier, let’s have a look at it:
void first_free(void *ptr) {
return;
}
Yeah. . . that’s right, it’s not an error. A bump allocator does not have free().
Why? Because we are not keeping track of the allocated memory, so we can’t just update the
cur_heap_position variable with the address of ptr, since we don’t know who is using the memory after ptr.
So we are forced just to do nothing.
Even if probably useless let’s see what are the pros and cons of this approach:
Pros:
• Is very time-efficient allocating memory is O(1), as well as “freeing” it.
• It is also memory efficient, in fact there is no overhead at all, we just need a variable to keep track of
the next free address.
• It is very easy to implement, and probably it could be a good placeholder when we haven’t developed a
full memory manager yet, but we need some malloc like functions.
• Actually there is no fragmentation since there is no freeing!
Of course the cons are probably pretty clear and make this algorithm pretty useless in most cases:
• We don’t free memory.
• There is no way to traverse the heap, because we don’t keep track of the allocations.
• We will eventually run out of memory (OOM - out of memory).
145
CHAPTER 24. HEAP ALLOCATION 146
0000 0001 0002 0003 ... 0010 0011 0013 ... 00100
2 X X 7 ... X cur ...
Authors note: just a reminder that the pointer is a uint8_t pointer, so when we are storing the size, the
memory cell pointed by cur_heap_position will be of type uint8_t, that means that in this example and the
followings, the size stored can be maximum 255. In a real allocator we want to support bigger allocations, so
using at least a uint32_t or even size_t is recommended.
In this example, the number indicates the size of the allocated block. There have already been 2 memory
allocations, with the first of 2 bytes and the second of 7 bytes. Now if we want to iterate from the first to the
last item allocated the code will look like:
uint8_t *cur_pointer = start_pointer;
while(cur_pointer < cur_heap_pointer) {
printf("Allocated address: size: %d - 0x%x\n", *cur_pointer, cur_pointer+1);
cur_pointer = cur_pointer + (*cur_pointer) + 1;
}
But are we able to reclaim unused memory with this approach? The answer is no. You may think so, but
even if we know the size of the area to reclaim, and we can reach it everytime from the start of the heap,
there is no mechanism to mark this area as available or not. If we set the size field to 0, we break the heap
(all areas after the one we are trying to free will become unreachable).
146
147 24.3. ALLOCATING AND FREEING
0000 0001 0002 0003 0004 ... 0011 0011 0013 ... 00100
2 U X 7 U ... X cur ...
Where U is just a label for a boolean-like variable (U = used = false, F = true = free).
At this point we the first change we can do to our allocation function is add the new status variable just after
the size:
#define USED 1
#define FREE 0
uint8_t *heap_start = 0;
uint8_t *cur_heap_position = heap_start; //This is just pseudocode in real word this will
,→ be a memory location
147
CHAPTER 24. HEAP ALLOCATION 148
uint8_t *heap_start = 0;
uint8_t *cur_heap_position = heap_start; //This is just pseudocode in real word this will
,→ be a memory location
*cur_heap_position=size;
cur_heap_position = cur_heap_position + 1;
*cur_heap_position = USED;
148
149 24.3. ALLOCATING AND FREEING
cur_heap_position = cur_heap_position + 1;
uint8_t *addr_to_return = cur_heap_position;
cur_heap_position+=size;
return (void*) addr_to_return;
}
If we are returning a previously allocated address, we don’t need to move cur_heap_position, since we are
reusing an area of memory that is before the end of the heap.
Now we have a decent and working function that can free previously allocated memory, and is able to reuse it.
It is still not perfect and there are several major problems:
• There is a lot of potential waste of space, for example if we are allocating 10 bytes, and the heap has
two holes big enough the first is 40 bytes, the second 14, the algorithm will pick the first one free so
the bigger one with a waste of 26 bytes. There can be different solution to this issue, but is out of the
purpose of this tutorial (and eventually left as an exercise)
• It can suffer from fragmentation. Basically there can be a lot of small freed areas that the allocator
will not be able to use because of their size. A partial solution to this problem is described in the next
paragraph.
Another thing worth doing is to improve readability of the code by replacing the direct pointer access with a
more elegant data structure. This lets us add more fields (as we will in the next paragraph) as needed.
So far our allocator needs to keep track of just the size of the block returned and its status The data structure
for this could look like the following:
struct {
size_t size;
uint8_t status;
} Heap_Node;
That’s it! That’s what we need to clean up the code and replace the pointers in the latest with the new struct
reference. Since it is just matter of replacing few variables, implementing this part is left to the reader.
00 01 02 .. 07 08 09 10 .. 15 16 17 .. 23 24 25
6 F X .. X 6 F X .. X 6 F .. X
149
CHAPTER 24. HEAP ALLOCATION 150
Now, all of the memory in the heap is available to allocate (except for the overhead used to store the status of
each chunk), and everything looks perfectly fine. But now the code keeps executing, and it will arrive at the
following instruction:
alloc(7);
Pretty small allocation, and we have plenty of space. . . no wait. The heap is mostly empty, but we can’t
allocate just 7 bytes because all the free blocks are too small. That is fragmentation in a nutshell.
How do we solve this issue? The idea is pretty straightforward, every time a memory location is being freed,
we do the following:
• First check if it is adjacent to other free locations (both directions: previous and next)
– If ptr_to_free + ptr_to_free_size == next_node then merge the two nodes and create a
single node of ptr_to_free_size + next_node_size (notice we don’t need to add the size of
Heap_node because ptr should be the address immediately after the struct).
– If prev_node_address + prev_node_size + sizeof(Heap_Node) == ptr_to_free then merge
the two nodes and create a single node of prev_node_size + ptr_to_free_size
• If not just mark this location as free.
There are different ways to implement this:
• Adding a next and prev pointer to the node structure. This is the way we’ll use in the rest of
this chapter. This makes checking the next and previous nodes for merge-ability very easy. It does
dramatically increase the memory overhead. Checking if a node can be merged can be done via
(cur_node->prev).status = FREE and (next_node->next).status = FREE.
• Otherwise, without adding the next and prev pointer to the node, we can scan the heap from the start
until the node before ptr_to_free, and if is free we can merge. For the next node instead things are
easier: we just need to check if the node starting at ptr_to_free + ptr_size if it is free is possible to
merge. By comparison this increases the runtime overhead of free().
Both solutions have their own pros and cons, like previously mentioned we’ll go with the first one for these
examples. Adding the prev and next pointers to the heap node struct leaves us with:
typedef struct {
size_t size;
uint8_t status;
Heap_Node *prev;
Heap_Node *next;
} Heap_Node;
So now our heap node will look like the following in memory:
00 01 02 10 18
6 F/U PREV NEXT X
As mentioned earlier using the double linked list the check for merge-ability is more straightforward. For
example to check if we can merge with the left node we just need to check the status of the node pointed by
the prev field, if it is freer than they can be merged. To merge with the previous node would apply the logic
below to node->prev:
• Update the size its, adding to it the size of cur_node
• Update the next pointer to point to cur_node->next
Referring to the next node:
• Update its prev pointer to point to the previous node above (cur_node->prev)
150
151 24.3. ALLOCATING AND FREEING
Of course merging with the right node is the opposite (update the size and the prev pointer of cur_node->next
and update the next pointer of cur_node->next).
Important note: We always want to merge in the order of current + next and then prev + current as
if the prev node absorbs current, what happens to the memory owned by the next node when merged with it?
Nothing, it’s simply lost. It can be avoided with clever and careful logic, but the simpler solution is to simply
merge in the right order.
What we’re describing here is the left node being “swallowed” by the right one, and growing in size. The
memory that the left node owns and is responsible for is now part of the right oneTo make it easier to
understand, consider the portion of a hypothetical heap in the picture below:
Basically the heap starts from address 0, the first node is marked as free and the next two nodes are both
used. Now imagine that free() is called on the second address (for this example we consider size of the heap
node structure to be just of 2 bytes):
This means that the allocator (before marking this location as free and returning) will check if it is possible
to merge first to the left (YES) and then to the right (NO since the next node is still in use) and then will
proceed with a merge only on the left side. The final result will be:
151
CHAPTER 24. HEAP ALLOCATION 152
The fields in bold are the fields that are changed. The exact implementation of this code is left to the reader.
152
153 24.3. ALLOCATING AND FREEING
After that the allocator can compute the address to return using (uintptr_t)cur_node + sizeof(Heap_node),
since we want to return the memory after the node, not the node itself (otherwise the program would put
data there and overwrite what we’ve stored there!).
Before wrapping up there’s a few things worth pointing out about implementing splitting:
• Remember that every node has some overhead, so when splitting we shouldn’t have nodes smaller (or
equal to) than sizeof(Heap_Node), because otherwise they will never be allocated.
• It’s a good idea to have a minimum size for the memory a chunk can contain, to avoid having a large
number of nodes and for easy alignment later on. For example if the minimum_allocatable_size is
0x20 bytes, and we want to allocate 5 bytes, we will still receive a memory block of 0x20 bytes. The
program may not know it was returned 0x20 bytes, but that is okay. What exactly value should be
used for it is implementation specific, values of 0x10 and 0x20 are popular.
• Always remember that there is the memory footprint of sizeof(Heap_Node) bytes while computing
sizes that involve multiple nodes. If we decide to include the overhead size in the node’s size, remember
to also subtract it when checking for suitable nodes.
And that’s it!
Heap_Node *heap_start;
Heap_Node *heap_end;
void initialize_heap() {
heap_start = INITIAL_HEAP_ADDRESS //Remember is a virtual address;
heap_start->next = NULL;
heap_start->next = NULL; // Just make sure that prev and next are not going anywhere
heap_start->status = free;
heap_start->size = INITIAL_HEAP_SIZE // 8096 = 8k;
heap_end = heap_start
}
Now the question is, how do we choose the starting address? This really is arbitrary. We can pick any address
that we like, but there are a few constraints that we should follow:
• Some memory is used by the kernel, we don’t want to overwrite anything with our heap, so let’s keep
sure that the area we are going is free.
• Usually when paging is enabled, in many case the kernel is moved to one half of the memory space
(usually referred as to HIGHER_HALF and LOWER_HALF) so when deciding the initial address we
should place it in the correct half, so if the kernel is placed in the HIGHER half, and we are implementing
the kernel heap it should go on the HIGHER half and if it is for the user space heap it will go to the
LOWER half.
For the kernel heap, a good place for it to start is immediately following the kernel binary in memory. If
the kernel is loaded at 0xFFFFFFFF80000000 as is common for higher half kernels, and the kernel is 0x4321
bytes long. It round up to the nearest page and then add another page (0x4321 gets rounded to 0x5000, add
0x1000 now we’re at 0x6000). Therefore, our kernel heap would start at 0xFFFFFFFF80006000.
153
CHAPTER 24. HEAP ALLOCATION 154
The reason for the empty page is that it can be left unmapped, and then any buggy code that attempts to
access memory before the heap will likely cause a page fault, rather than returning bits of the kernel.
And that’s it, that is how the heap is initialized with a single node. The first allocation will trigger a split
from that node. . . and so on. . .
154
Part V
155
Chapter 25
So far our kernel has been running a single stream of code, initializing various parts of the cpu and system
hardware in sequence and handling interrupts from few sources.
While it is possible to go further, we’ll begin to run into a handful of problems. Some devices take time to
perform actions, and our code may need to wait. If we have a lot of devices, this can lead to things becoming
very slow. What if we want to start running multiple threads, or even multiple programs with multiple
threads? This is a common scenario, and what the scheduler is responsible for.
The scheduler is similar to a hardware multiplexer (mux) in that it takes multiple inputs (programs or threads)
and allows them to share a single output (the cpu executing the code).
The scheduler does this by interrupting the current stream of code, saving its state, selecting the next stream,
loading the new state and then returning. If done at a fast enough rate, all programs will get to spend a
little time running, and to the user it will appear as if all programs are running at the same time. This whole
operation is called a context switch.
For our examples the scheduler is going to do its selection inside of an interrupt handler, as that’s the simplest
way to get started. As always, there are other designs out there.
This part will cover the following areas:
• The Scheduler: This is the core of the scheduling subsystem. It’s responsible for selecting the next
process to run, as well as some general housekeeping. There are many implementations, we’ve chosen
one that is simple and easy to expand upon, a first come first served approach, called Round Robin.
• Processes and Threads: These are the two basic units our scheduler deals with. A stream of code is
represented by a thread, which contains everything that is needed to save and restore its context: a
stack, the saved registers state and the iret frame used to resume its execution. A process represents
a whole program, and can contain a number of threads (or just one), as well as a VMM and a list of
resource handles (file descriptors and friends). Both processes and threads can also have names and
unique identifiers.
• Locks: Once the scheduler starts running different tasks, there will be a range of new problems that we
will need to take care of, for example the same resource being accessed by multiple processes/threads.
This is what we are going to cover in this chapter, describing how to mitigate it.
157
CHAPTER 25. SCHEDULING AND TASKS 158
158
Chapter 26
The Scheduler
26.2 Overview
Before we start describing the workflow in detail, let’s answer a few questions:
• When does the scheduler run? The simplest answer is during a timer interrupt. Having said that, there
are many other times you might want to trigger the scheduler, such as a waiting on a slow IO operation
to complete (the network, or an old hard drive). Some programs may have run out of work to do
temporarily and want to ask the scheduler to end their current time slice early. This is called yielding.
• How long does a thread run for before being replaced by the next one? There’s no simple answer to this,
as it can depend by a number of factors, even down to personal preference. There is a minimum time
that a thread can run for, and that’s the time between one timer interrupt and the next. This portion
of time is called a quantum, because it represents the fundamental unit of time we will be dealing with.
The act of a running thread being interrupted and replaced is called pre-emption.
The main part of our scheduler is going to be thread selection. Let’s breakdown how we’re going to implement
it:
• When called, the first thing the scheduler needs to do is check whether the current thread should be
pre-empted or not. Some critical sections of kernel code may disable pre-emption for various reasons,
or a thread may simply be running for more than one quantum. The scheduler can choose to simply
return here if it decides it’s not time to reschedule.
• Next it must save the current thread’s context so that we can resume it later.
159
CHAPTER 26. THE SCHEDULER 160
• Then we select the next thread to run. For a round robin scheduler we will search the list of threads
that are available to run, starting with the current thread. We stop searching when we find the first
thread that can run.
• Optionally, while iterating through the list of threads we may want to do some house-keeping. This is a
good time to do things like check wakeup-timers for sleeping threads, or remove dead threads from the
list. If this concept is unfamiliar, we’ll discuss this more later don’t worry.
• Now we load the context for the selected thread and mark it as the current thread.
The basic scheduler we are going to implement will have the following characteristics:
1. It will execute on a first-come first-served basis.
2. The threads will be kept in a linked list. This was done to keep the implementation simple, and keep
the focus on the scheduling code.
3. Each thread will only run for a single quantum (i.e. each timer interrupt will trigger the thread to
reschedule).
4. While we have explained the difference between a thread and process, for now we’re going to combine
them both into the same structure for simplicity. We’ll be referring to this structure as just a process
from now on.
5. For context switching we are going to use the interrupt stub we have created in the Interrupt Handling
chapter, of course this is not the only method available, but we found it useful for learning purposes.
160
161 26.2. OVERVIEW
DEAD
} status_t;
The scheduler will also need to keep track of which process is currently being executed. How exactly you do
this depends on the data structure used to store your processes, in our case using a linked list, we just need a
pointer to it:
process_t *current_process;
All that remains is to initialize the pointers to NULL, since we don’t have any process running yet and the
linked list is empty.
161
CHAPTER 26. THE SCHEDULER 162
void schedule() {
current_process = current_process->next;
}
Of course it is not going to work like that, and if executed like this the kernel will most likely end up in
running garbage, but don’t worry it is going to work later on, the code snippet above is just the foundation of
our scheduler.
There are few problems with this implementation, the first is that it doesn’t check if it has reached the end of
the list, to fix this we just need to add an if statement:
void schedule() {
if (current_process->next != NULL) {
current_process = current_process->next;
} else {
current_process = processes_list;
}
}
The else statement is in case we reached the end, where we want to move back to the first item. This can
vary depending on the data structure used. The second problem is that this function is not checking if the
current_process is NULL or not, it will be clear shortly why this shouldn’t happen.
The last problem is: what if there are no processes to be run? In our case our selection function would
probably run into garbage, unless we explicitly check that the current_process and/or the list are empty. But
there is a more useful and elegant solution used by modern operating systems: having a special process that
the kernel run when there are no others. This is called the idle process, and will be looked at later.
if (current_process->next != NULL) {
current_process = current_process->next;
} else {
current_process = processes_list;
}
return current_process->context;
}
26.2.5.1 In-Review
1. When a timer interrupt is fired, the cpu pushes the iret frame on to the stack. We then push a copy
of what all the general purpose registers looked like before the interrupt. This is the context of the
interrupted code.
2. We place the address of the stack pointer into the rdi register, as per the calling convention, so it
appears as the first argument for our schedule() function.
162
163 26.2. OVERVIEW
3. After the selection function has returned, we use the value in rax for the new stack value. This is where
a return value is placed according to the calling convention.
4. The context saved on the new stack is loaded, and the new iret frame is used to return to the stack and
code of the new process.
This whole process is referred to as a context switch, and perhaps now it should be clearer why it can be a
slow operation.
while () {
process_t *prev_process = current_process;
if (current_process->next != NULL) {
current_process = current_process->next;
} else {
current_process = processes_list;
}
163
CHAPTER 26. THE SCHEDULER 164
We’ll look at the DEAD state more in the next chapter, but for now we can set a processes state to DEAD to
signal its termination. When a thread is in the DEAD state, it will be removed from the queue the next time
the scheduler encounters it.
26.3 Wrapping Up
This chapter has covered everything needed to have a basic scheduler up and running. In the next chapter
we’ll look at creating and destroying processes. As mentioned this scheduler was written to be simple, not
feature-rich. There are a number of ways you could improve upon it in your own design:
• Use a more optimized algorithm.
• Add more states for a process to be in. We’re going to add one more in the next chapter.
• Implement priority queues, where the scheduler runs threads from a higher priority first if they’re
available, otherwise it selects background processes.
• Add support for multiple processor cores. This can be very tricky, so some thought needs to go into how
you design for this.
164
Chapter 27
27.2 Processes
We introduced a very basic process control block in the previous chapter:
typedef struct {
status_t status;
cpu_status_t context;
process_t *next;
} process_t;
While this is functional, there are few problems:
165
CHAPTER 27. PROCESSES AND THREADS 166
• They can’t be easily identified, how do we know the difference between processes.
• All processes currently share the same address space, and as a byproduct, the same virtual memory
allocator.
• Is not possible to keep track of any resources they might be using, like file handles or network sockets.
• They can’t be prioritized, since we don’t know which ones are more important.
We’re not going to look at how to solve all of these, but we’ll cover the important ones.
166
167 27.2. PROCESSES
process->context.rdi = (uint64_t)arg;
process->context.rbp = 0;
add_process(process);
return process;
}
The above code omits any error handling, but this is left as an exercise to the reader. We may also want to
disable interrupts while creating a new process, so that we aren’t pre-empted and the half-initialized process
starts running.
Most of what happens in the above function should be familiar, but let’s look at iret_flags for a moment.
The value 0x202 will clear all flags except for bits 2 and 9. Bit 2 is a legacy feature and the manual recommends
that it’s set, and bit 9 is the interrupts flag. If the interrupt flag is not set when starting a new process, we
won’t be able to pre-empt it!
We also set rbp to 0. This is not strictly required, but it can make debugging easier. If we choose to load and
run elf files later on this is the expected set up. Zero is a special value that indicates we have reached the
top-most stack frame.
The alloc_stack() function is left as an exercise to the reader, but it should allocate some memory, and
return a pointer to the top of the allocated region. 16KiB (4 pages) is a good starting place, although we can
always go bigger. Modern systems will allocate around 1MiB per stack.
167
CHAPTER 27. PROCESSES AND THREADS 168
to it the physical address of the new table contained in root_page_table, with the PRESENT and WRITE
flags set (don’t forget that the physical address has to be page aligned).
Copying the higher half page tables like this can introduce a subtle issue: If the kernel modifies a pml4 entry
in one process the changes won’t be visible in any of the other processes. Let’s say the kernel heap expands
across a 512 GiB boundary, this would modify the next pml4 (since each pml4 entry is responsible for 512
GiB of address space). The current process would be able to see the new part of the heap, but upon switching
processes the kernel could fault when trying to access this memory.
While we’re not going to implement a solution to this, but it’s worth being aware of. One possible solution is
to keep track of the current ‘generation’ of the kernel pml4 entries. Everytime a kernel pml4 is modified the
generation number is increased, and whenever a new processes is loaded its kernel pml4 generation is checked
against the current generation. If the current generation is higher, we copy its kernel tables over, and now the
page tables are synchronized again.
Don’t forget to load the new process’s page tables before leaving the schedule().
27.2.4 Resources
Resources are typically implemented as an opaque handle: a resource is given an id by the subsystem it
interacts with, and that id is used to represent the resource outside of the subsystem. Other kernel subsystems
or programs can use this id to perform operations with the resource. These resources are usually tracked per
process.
As an example, let’s look at opening a file. We wont go over the code for this, as it’s beyond the scope of this
chapter, but it serves as a familiar example.
When a program goes to open a file, it asks the kernel’s VFS (virtual file system) to locate a file by name.
Assuming the file exists and can be accessed, the VFS loads the file into memory and keeps track of the buffer
holding the loaded file. Let’s say this is the 23rd file the VFS has opened, it might be assigned the id 23. We
could simply use this id as the resource id, however that is a system-wide id, and not specific to the current
process.
Commonly each process holds a table that maps process-specific resource ids to system resource ids. A simple
example would be an array, which might look like the following:
#define MAX_RESOURCE_IDS 255
typedef struct {
//other fields
size_t resources[MAX_RESOURCE_IDS];
} process_t;
We’ve used size_t as the type to hold our resource ids here. To open a file, like the previous example, might
look like the following. Note that we don’t check for any errors, like the file not existing or having invalid
permissions.
168
169 27.3. FROM PROCESSES TO THREADS
27.2.5 Priorities
There are many ways to implement priorities, the easiest way to get started is with multiple process queues:
one per priority level. Then the scheduler would always check the highest priority queue first, and if there’s
no threads in the READY state, check the next queue and so on.
169
CHAPTER 27. PROCESSES AND THREADS 170
return thread;
}
We can see that this function looks almost identical to the create_process outlined earlier in this chapter.
That’s because a lot of it is the same! The first part of the function is just inserting the new thread at the
end of the list of threads.
Let’s look at how our create_process function would look now:
process_t* create_process(char* name) {
process_t* process = alloc(sizeof(process_t));
process->pid = next_process_id++;
process->threads = NULL;
process->root_page_table = vmm_create();
strncpy(process->name, name, NAME_MAX_LEN);
add_process(process)
170
171 27.3. FROM PROCESSES TO THREADS
return process;
}
The vmm_create function is just a placeholder, but it should create a new vmm instance for our new process.
The details of this function are described more in the chapter on the virtual memory manager itself. Ultimately
this function should set up some new page tables for the new process, and then map the existing kernel into
the higher half of these new tables. You may wish to do some other things here as well.
The last part is we’ll need to update the scheduler to deal with threads instead of processes. A lot of the
things the scheduler was interacting with are now contained per-thread, rather than per-process.
That’s it! Our scheduler now supports multiple threads and processes. As always there are a number of
improvements to be made:
• The create_process function could add a default thread, since a process with no threads is not very
useful. Or it may not, it depends on the design.
• Similarly, add_thread could accept NULL as the process to add to, and in this case create a new process
for the thread instead of returning an error.
171
CHAPTER 27. PROCESSES AND THREADS 172
Another option, which we’ve chosen here, is to update the thread’s status here. Then when the scheduler
encounters a thread in the DEAD state, it will free its resources there.
Note that we use an infinite loop at the end of thread_exit since that function cannot return (it would
return to the junk after the thread’s main function). This will busy-wait until the end of the current quantum,
however we could also call the scheduler to reschedule early here.
172
173 27.3. FROM PROCESSES TO THREADS
The main difference is how the scheduler interacts with the timer. A periodic scheduler tells the timer to
trigger at a fixed interval, and runs in response to the timer interrupt. A tickless scheduler instead uses a
one-shot timer, and set the timer to send an interrupt when the next task switch is due.
At a first glance this may seem like the same thing, but it eliminates unnecessary timer interrupts, when no
task switch is occurring. It also removes the idea of a quantum, since we can run a thread for any arbitrary
amount of time, rather than a number of timer intervals.
Authors note: Tickless schedulers are usually seen as more accurate and operate with less latency than periodic
ones, but this comes at the cost of added complexity.
173
CHAPTER 27. PROCESSES AND THREADS 174
174
Chapter 28
28.1 Introduction
Now that we have a scheduler, we can run multiple threads at the same time. This introduces a new problem
though: shared resources and synchronization.
Imagine we have a shared resource that can be accessed at a specific address. This resource could be anything
from MMIO, a buffer or some variable, the important part is that multiple threads can access it at the same
time.
For our example we’re going to say this resource is a NS16550 uart at address 0xDEAD'BEEF. If not familiar
with this type of uart device, it’s the de facto standard for serial devices. The COM ports on x86 use one of
these, as do many other platforms.
The key things to know are that if we write a byte at that address, it will be sent over the serial port to
whatever is on the other end. So if to send a message, we must send it one character at a time, at the address
specified (0xDEADBEEF).
175
CHAPTER 28. CRITICAL SECTIONS AND LOCKS 176
void thead_two() {
serial_log("while I am the second string");
}
What would we expect to see on the serial output? We don’t know! It’s essentially non-deterministic, since
we can’t know how these will be scheduled. Each thread may get to write the full string before the other is
scheduled, but more likely they will get in the way of each other.
The image above is an example of threads being scheduled, assuming there are only three of them in the system
(labeled as A, B, C ). Imagine that A is thread_one and B is thread_two, while C does not interact with the
serial. One example of what we could see then is Iwh aI ammi lethe secfionrsd t stristngring. This
contains all the right characters but it’s completely unreadable. The image below shows what a scenario that
could happen:
What we’d expect to see is one of two outcomes: I am the first string while I am the second or
while I am the second I am the first string.
176
177 28.3. IMPLEMENTING A LOCK
The situation described is an example of a race condition. The order of accesses to the shared resource (the
uart) matters here, so we need a way to protect it. This where a lock comes in.
release(&serial_lock);
}
It’s a small change, but it’s very important! Now anytime a thread tries to call serial_log, the acquire
function will block (wait) until the lock is free, before trying to print to the serial output.
This ensures that each call to serial_log completes before another is allowed to start, meaning each message
is written properly without being jumbled by another.
It’s worth noting that each instance of a lock is independent, so to protect a single resource, we must use a
single lock.
177
CHAPTER 28. CRITICAL SECTIONS AND LOCKS 178
What we’ve implemented here is a spinlock. To take the lock we keep checking the lock variable within a
loop (aka ‘spinning’) until it’s available. A spinlock is the simplest kind of lock to implement, and very low
latency compared to some others. It can be very wasteful when it comes to CPU time however, as the CPU is
constantly querying memory only to do nothing but query it again, instead of doing actual work.
Consider the previous example of threads being sequenced, and let’s see what happens now:
Now we can see how locks can be used to keep two threads from interfering with each other.
Unfortunately this implementation has some issues, and can fail to ensure mutual exclusion in several ways:
• We haven’t marked the lock variable as volatile, so the acquire and release operations may or may
not be written to memory. This means other threads might not even see the changes made to the lock
variable.
• If we’re operating in a multiprocessor environment, this is also an issue, because the other processors
won’t see the updated lock state. Even if we do use volatile, two threads on separate processors
could still both take the lock at the same time. This is because processors will generally perform a
read-modify-write operation, which leaves time for another processor to read the old state, while
another is modifying it.
178
179 28.5. ATOMIC OPERATIONS
Rather than writing assembly directly for this, we’re going to use some compiler intrinsic functions to generate
the assembly for us. These are provided by any compatible GCC compiler (that includes clang).
We’ll also be using two constants (these are provided by the compiler as well): __ATOMIC_ACQUIRE and
__ATOMIC_RELEASE. These are memory order constraints and are used to describe to the compiler what we
want to accomplish. Let’s look at the difference between these, and a third constraint, sequential consistency
(__ATOMIC_SEQ_CST).
• __ATOMIC_SEQ_CST: An atomic operation with this constraint is a two-way barrier. Memory operations
(reads and writes) that happen before this operation must complete before it, and operations that
happen after it must also complete after it. This is actually what we expect to happen most of the
time, and if not sure which constraint to use, this is an excellent default. However it’s also the most
restrictive, and implies the biggest performance penalty as a result.
• __ATOMIC_ACQUIRE: Less restrictive, it communicates that operations after this point in the code cannot
be reordered to happen before it. It allows the reverse though (writes that happened before this may
complete after it).
• __ATOMIC_RELEASE: This is the reverse of acquire, this constraint says that any memory operations
before this must complete before this point in the code. Operations after this point may be reordered
before this point however.
Using these constraints we can be specific enough to achieve what we want while leaving room for the compiler
and cpu to optimize for us. We won’t use it here, but there is another ordering constraint to be aware of:
__ATOMIC_RELAXED. Relaxed ordering is useful when in case a memory operation is desired to be atomic, but
not interact with the memory operations surrounding it.
Both of the previously mentioned atomic functions take a pointer to either a bool or char that’s used as the
lock variable, and the memory order. The __atomic_test_and_set function returns the previous state of the
lock. So if it returns true, the lock was already taken. A return of falses indicates we successfully took the
lock.
Using our new compiler intrinsics, we can update the acquire and release functions to look like the following:
Using the compiler built-in functions like this ensures that the compiler will always generate the correct
atomic instructions for us. These are also cross-platform: meaning if the kernel is ported to another cpu
architecture, it can still use these functions.
179
CHAPTER 28. CRITICAL SECTIONS AND LOCKS 180
180
Part VI
Userspace
181
Chapter 29
After this part our kernel will be able to switch between user and supervisor privilege levels, and we’ll have a
basic system call interface.
In the Switching Modes chapter we are going to explore how the x86 architecture handles changing privilege
levels, and how to switch back and forth between the supervisor and user mode.
In the Handling Interrupts chapter we will update our interrupt handling to be able to run in user mode too,
and avoid kernel panics
Then in the System Calls we’ll introduce the concept of system calls. These are a controlled way to allow user
programs to ask the kernel to perform certain tasks for it.
Finally in Example ABI chapter we will implement an example system call interface for our kernel.
183
CHAPTER 29. ALL ABOUT USERSPACE 184
are allowed to do, how they accept arguments, and what data is returned. Every argument that is given to a
system call should be scrutinized and validated anyway it can be.
184
Chapter 30
Switching Modes
In this chapter we are going to study how to get to userspace, and back. Although it is focused on x86_64, a
lot of high level concepts apply to other platforms too.
185
CHAPTER 30. SWITCHING MODES 186
So if we’re going to ring 0 (supervisor), RPL can be left at 0. If going to ring 3 (user) we’d set it to 3.
This means our selectors for ss and cs end up looking like this:
kernel_ss = 0x08 | 0;
kernel_cs = 0x10 | 0;
user_cs = 0x18 | 3;
user_ss = 0x20 | 3;
The kernel/supervisor selectors don’t need to have their RPL set explicitly, since it’ll be zero by default. This
is why we may not have dealt with this field before.
If RPL is not set correctly, it will throw #GP (General Protection) exception.
As for the other two values? We’re going to set rip to the instruction we want to execute after using iret,
and rsp can be set to the stack we want to use. Remember that on x86_64 the stack grows downwards, so if
we allocate memory this should be set to the highest address of that region. It’s a good idea to run user and
supervisor code on separate stacks. This way the supervisor stack can have the U/S bit cleared in the paging
structure, and prevent user mode accessing supervisor data that may be stored on the stack.
186
187 30.2. GETTING BACK TO SUPERVISOR MODE
187
CHAPTER 30. SWITCHING MODES 188
kernel to do something for it. This is often done via interrupt, but there are specialized instructions for it too.
We have a dedicated chapter on system calls.
The third way is inside of an interrupt handler. While is possible to run interrupts in user mode (an advanced
topic for sure), most interrupts will result in supervisor code running in the form of the interrupt handler.
Any interrupt will work, for example a page fault or ps/2 keyboard irq, but the most common one is a timer.
Since the timer can be programmed to tick at a fixed interval, we can ensure that supervisor code gets to
run at a fixed interval. That code may return immediately, but it gives the kernel a chance to look at the
program and machine states and see if anything needs to be done. Commonly the handler code for the timer
also runs the scheduler tick, and can trigger a task switch.
Handling interrupts while not in supervisor mode on x86 is a surprisingly big topic, so we’re going to cover it
in a separate chapter. In fact, it’s the next chapter!
188
Chapter 31
Handling Interrupts
This is not a complete guide on how to handle interrupts. It assumes we already have an IDT setup and
working in supervisor mode, if don’t refer the earlier chapter that covers how to set up an IDT and the
basics of handling interrupts. This chapter is focused on handling interrupts when user mode programs are
executing.
On x86_64 there are two main structures involved in handling interrupts. The first is the IDT, which we
should already be familiar with. The second is the task state segment (TSS). While the TSS is not technically
mandatory for handling interrupts, once we leave ring 0 it’s functionally impossible to handle interrupts
without it.
189
CHAPTER 31. HANDLING INTERRUPTS 190
uint64_t rsp1;
uint64_t rsp2;
uint64_t reserved1;
uint64_t reserved2;
uint64_t ist1;
uint64_t ist2;
uint64_t ist3;
uint64_t ist4;
uint64_t ist5;
uint64_t ist6;
uint64_t ist7;
uint64_t reserved3;
uint16_t reserved4;
uint16_t io_bitmap_offset;
}__attribute__((__packed__)) tss_t;
As per the manual, the reserved fields should be left as zero. The rest of the fields can be broken up into
three groups:
• rspX: where X represents a cpu ring (0 = supervisor). When an interrupt occurs, the cpu switches the
code selector to the selector in the IDT entry. Remember the CS register is what determines the current
privilege level. If the new CS is a lower ring (lower value = more privileged), the cpu will switch to the
stack in rspX before pushing the iret frame.
• istX: where X is a non-zero identifier. These are the IST (Interrupt Stack Table) stacks, and are used
by the IST field in the IDT descriptors. If an IDT descriptor has non-zero IST field, the cpu will always
load the stack in the corresponding IST field in the TSS. This overrides the loading of a stack from an
rspX field. This is useful for some interrupts that can occur at any time, like a machine check or NMI,
or if you do sensitive work in a specific interrupt and don’t want to leak data afterwards.
• io_bitmap_offset: Works in tandem with the IOPL field in the flags register. If IOPL is less than the
current privilege level, IO port access is not allowed (results in a #GP). Otherwise IO port accesses
can be allowed by setting a bit in a bitmap (cleared bits deny access). This field in the tss specifies
where this bitmap is located in memory, as an offset from the base of the tss. If IOPL is zero, ring 0 can
implicitly access all ports, and io_bitmap_offset will be ignored in all rings.
With the exception of the IO permissions bitmap, the TSS is all about switching stacks for interrupts. It’s
worth noting that if an interrupt doesn’t use an IST, and occurs while the cpu is in ring 0, no stack switch
will occur. Remember that the rspX stacks only used when the cpu switches from a less privileged mode.
Setting the IST field in an IDT entry will always force a stack switch, if that’s needed.
190
191 31.3. THE TSS AND SMP
Yes, it’s right a TSS descriptor for the GDT is 128 bits. This because we need to specify the 64 bit address
containing the TSS data structure.
Now for the third step, we need to load the task register. This is similar to the segment registers, in that
it has visible and invisible parts. It’s loaded in a similar manner, although we use a dedicated instruction
instead of a simple mov.
The ltr instruction (load task register) takes the byte offset into the GDT we want to load from. This is the
offset of the TSS descriptor we created before. For the example below, we’ll assume this descriptor is at offset
0x28.
ltr $0x28
It’s that simple! Now the cpu knows where to find our TSS. It’s worth noting that we only need to reload the
task register if the TSS has moved in memory. Ideally it should never move, and so should only be loaded
once. If the fields of the TSS are ever updated, the CPU will use the new values the next time it needs them,
no need to reload TR.
191
CHAPTER 31. HANDLING INTERRUPTS 192
own entry in the GDT to be loaded, and we can’t know how many cores (and TSSs) we’ll need ahead of time.
There’s a few ways to go about this:
• Each core has a separate GDT, allowing us to use the same selector in each GDT for that core’s TSS.
This option uses the most memory, but is the most straightforward to implement.
• Have a single GDT shared between all cores, but each core gets a separate TSS selector. This would
require some logic to decide which core uses which selector.
• Have a single GDT and a single TSS descriptor within it. This works because the task register caches
the values it loads from the GDT until it is next reloaded. Since the TR is never changed by the cpu, if
we never change it ourselves, we are free to change the TSS descriptor after using it to load the TR.
This would require logic to determine which core can use the TSS descriptor to load its own TSS. Uses
the least memory, but the most code of the three options.
192
Chapter 32
System Calls
System calls are a way for a user mode program to request something from the kernel, or other supervisor code.
For this chapter we’re going to focus on a user program calling the kernel directly. If the kernel being written
is a micro-kernel, system calls can be more complicated, as they might be redirected to other supervisor (or
even user) programs, but we’re not going to talk about that here.
On x86_64 there are a few ways to perform a system call. The first is to dedicate an interrupt vector to be
used for software interrupts. This is the most common, and straightforward way. The other main way is to
use the dedicated instructions (sysenter and friends), however these are rather niche and have some issues of
their own. This is discussed below.
There are other obscure ways to perform syscalls. For example, executing a bad instruction will cause the cpu
to trigger a #UD exception. This transfers control to supervisor code, and could be used as an entry to a
system call. While not recommended for beginners, there was one hobby OS kernel that used this method.
193
CHAPTER 32. SYSTEM CALLS 194
What about using multiple vectors for different syscalls? Well this is certainly possible, but it’s more points
of entry into supervisor code. System calls are also not the only thing that require interrupt vectors (for
example, a single PCI device can request upto 32!), and with enough hardware you may find yourself running
out of interrupt vectors to allocate. So only using a single vector is recommended.
//see below
cpu_status_t* syscall_handler(cpu_status_t* regs);
void setup_syscalls()
{
set_idt_entry(0xFE, syscall_handler, 3);
}
Now on the user side, we can use int $0xFE to trigger a software interrupt. If we try to trigger any other
interrupts, we’ll still get a protection fault.
__attribute__((naked))
size_t execute_syscall(size_t syscall_num, size_t arg)
{
asm("int $0xFE"
: "=a"(arg)
: "D"(syscall_num), "S"(arg));
return arg;
}
There’s a few tricks happening with the inline assembly above. First is the naked attribute. This is not
strictly necessary, but since we’re only doing inline assembly in the function it’s a nice optimization hint to
the compiler. It tells the compiler not to generate the prologue/epilogue sequences for this function. This is
stuff like creating the stack frame.
Next we’re using some special constraints for the input and output operands. “a” means to use the rax
register, while “S” and “D” are the source and destination registers (rsi and rdi) registers. This means the
194
195 32.3. USING DEDICATED INSTRUCTIONS
compiler will ensure that they are loaded with the values we specify before the assembly body is run. The
compiler will then also put the value of “a” (rax) into arg after the assembly body has run. This is where
we’ll be placing the return value of the system call, hence why the return arg line below.
For more details on inline assembly, see the dedicated appendix on it, or check the compiler’s manual.
Now assuming everything is setup correctly, running the above code in user mode should trigger the kernel’s
system call handler. In the example below, the syscall_handler function should end up running, and we’ve
just implemented system calls!
return regs;
}
This is certainly faster as the instruction only needs to deal with a handful of registers, however it leaves the
rest of the context switching up to the kernel code.
Upon entering the kernel, you will be running with ring 0 privileges and certain flags will be cleared, and
that’s it. You must perform the stack switch yourself, as well as collecting any information the kernel might
need (like the user RIP, stack, SS/CS).
Authors Note: While these instructions are covered here, syscall can actually result in several quite nasty
security bugs if not used carefully. These issues can be worked around of course, but at that point you’ve
lost the speed benefit offered by using these instead of an interrupt. We consider using these instructions an
advanced topic. If you do find yourself in a position where the speed of the system call entry is a bottleneck,
then these instructions are likely not the solution, and you should look at why you require so many system
calls. - DT
195
CHAPTER 32. SYSTEM CALLS 196
MSRs:
In this configuration, STAR(47:32) is set to 0x8. It corresponds to the kernel code segment. Thus, on a
syscall the CS will be 0x8 and SS 0x10. Next, bits (64:48) are set to 0x13. After sysret, the CS will be
196
197 32.3. USING DEDICATED INSTRUCTIONS
0x23 (0x13 + 16) and SS will be 0x1B (0x13 + 8). Note that we have to set the lower two bits that represent
the privilege level! And last, FMASK clears all flags while keeping bit 1 (reserved) set.
As an aside, the sysret instruction determines which mode to return to based on the operand size. By default
all operands are 32-bit, to specify a 64-bit operand (i.e. return to 64-bit long mode) just add the q suffix in
GNU as, or the o64 prefix in NASM.
//GNU as
sysretq
//NASM
o64 sysret
Finally, we need to tell the CPU we support these instructions and have done all of the above setup. Like
many extended features on x86, there is a flag to enable them at a global level. For syscall/sysret this is
the system call extensions flag in the EFER (0xC0000080) MSR, which is bit 0. After setting this, the CPU is
ready to handle these instructions!
197
CHAPTER 32. SYSTEM CALLS 198
198
Chapter 33
“We break things inside the kernel constantly, but there is one rule among the kernel developers: we never,
ever, break userspace.” - Linus Torvalds
While breaking the system call ABI in our kernel won’t have the same ramifications as it would in Linux, it’s
a good idea to set up a stable ABI early on. Early on meaning as soon as we begin writing code that will use
the ABI. As such, we’re going to take a look at an example ABI to show how it could be done. This example
is loosely based on the system V calling convention for x86_64.
199
CHAPTER 33. EXAMPLE SYSTEM CALL ABI 200
Name: memcpy
Id: 3
Args: source addr, dest addr, count in bytes
Returns: count copied
Please don’t actually do this, memcpy does not need to be a system call, but it serves for this example, as it’s
a function everyone is familiar with.
We’re going to implement a wrapper function for system calls in C, purely for convenience, which might look
so:
__attribute__((naked))
void execute_syscall(uint64_t num, uint64_t a0, uint64_t a1, uint64_t a2, uint64_t a3) {
asm ("int $0x50" ::: "rdi", "rsi", "rdx", "rcx", "r8", "memory");
}
The above function takes advantage of the fact the system V calling convention is the one used by GCC/Clang.
If a different compiler/calling convention is used, then the arguments need to be moved into the registers
manually. This is as straightforward as it sounds, but is left as an exercise for the reader.
This function also uses the naked attribute. If unfamiliar with attributes, they are discussed in the C language
chapter. This particular attribute tells the compiler not to generate the entry and exit sequences for this
function. These are normally very useful, but in our case are unnecessary.
Now, let’s combine our wrapper function with our example system call from above. We’re going to write a
memcpy function that could be called by another code, but uses the system call internally:
void memcpy(void* src, void* dest, size_t count) {
return execute_syscall(3, (uint64_t)src, (uint64_t)dest, (uint64_t)count, 0, 0);
}
200
Part VII
201
Chapter 34
Inter-Process Communication
So far we’ve put a lot of effort into making sure each program (represented by a process in our kernel) is
completely isolated from all others. This is great for safety and security, but it presents a big problem: what
if we want two processes to communicate with each other?
The answer to this is some form of inter-process communication (aka IPC). This part will look at some
basic implementations for the common types and will hopefully serve a good jumping off point for further
implementations. It should be noted that IPC is mainly intended for userspace programs to communicate
with each other, if you have multiple kernel threads wanting to communicate there’s no need for the overhead
of IPC.
203
CHAPTER 34. INTER-PROCESS COMMUNICATION 204
204
Chapter 35
Shared memory is the easiest form of IPC to implement. It’s also the fastest form as the data is just written
by one process, and then read by another. No copying involved. Note that speed here does not necessarily
mean low-latency.
The principle behind shared memory is simple: we’re going to map the same physical pages into two different
virtual address spaces. Now this memory is visible to both processes, and they can pass data back and forth.
205
CHAPTER 35. IPC VIA SHARED MEMORY 206
the other (i.e. contiguous). Attaching a name to a range lets us identify it, and we can even give some of
these names special meanings, like /dev/stdout for example. In reality this is not how stdout is usually
implemented, but it serves to get the point across.
We’re going to use a struct to keep track of all the information we need for shared memory, and store them in
a linked list so we can keep track of multiple shared memory instances.
struct ipc_shared_memory {
uintptr_t physical_base;
size_t length;
size_t ref_count;
const char* name;
ipc_shared_memory* next;
}
The ipc_shared_memory struct holds everything we’ll need. The physical_base and length fields describe
the physical memory (read: pages allocated from your physical memory manager) used by this shared memory
instance. It’s important to note that the address is a physical one, since virtual addresses are useless outside
of their virtual address space. Since each process is isolated in its own address space, we cannot store a virtual
address here.
The ref_count field is how many processes are currently using this physical memory. If this ever reaches
zero, it means no-one is using this memory, and we can safely free the physical pages. The name field holds a
string used to identify this shared memory instance, and the next pointer is used for the linked list.
206
207 35.2. IPC MANAGER
acquire(list_lock);
ipc_shared_memory* tail = list;
while (tail->next != NULL)
tail = list->next;
tail->next = shared_mem;
release(list_lock);
return shared_mem->physical_base;
}
This code is the core to implementing shared memory, it allows us to create a new shared memory region and
gives us its physical address. The VMM can then map this physical address into the memory space of the
process like it would do with any other memory.
Notice the use of the lock functions (acquire and release) when we access the linked list. Since this single
list is shared between all processes that use shared memory, we have to protect it so we don’t accidentally
corrupt it. This is discussed further in the chapter on scheduling.
207
CHAPTER 35. IPC VIA SHARED MEMORY 208
else
phys_base = pmm_alloc(length / PAGE_SIZE);
35.2.4 Cleaning Up
At some point our programs are going to exit, and when they do we’ll want to reclaim any memory they were
using. We mentioned using reference counting to prevent use-after-free bugs with shared memory, so let’s take
a look at that in practice.
Again, this assumes your vmm has some way of knowing which regions of allocated virtual memory it owns,
and which regions are shared memory. For our example we’ve stored the flags field used with vmm_alloc.
void vmm_free(void* addr) {
size_t flags = vmm_get_flags(addr);
uintptr_t phys_addr = get_phys_addr(addr);
208
209 35.3. INTERESTING APPLICATIONS
found->ref_count--;
if (found->ref_count == 0) {
pmm_free(found->physical_base, found->length / PAGE_SIZE);
free(found->name);
if (prev == NULL)
list = found->next;
else
prev->next = found->next;
free(found);
}
release(list_lock);
}
Again we’ve omitted error handling and checking for NULL to keep the examples concise, but you should
handle these cases in your code. The first part of the function should look similar, it’s the same code used
in access_shared_memory. The interesting part happens when the reference count reaches zero: we free
the physical pages used, and free any memory we previously allocated. We also remove the shared memory
instance from the linked list.
209
CHAPTER 35. IPC VIA SHARED MEMORY 210
210
Chapter 36
Compared to shared memory, message passing is slightly more complex but does offer more features. It’s
comparable to networking where there is a receiver that must be ready for an incoming message, and a sender
who creates the full message ahead of time, ready to send all at once.
Unlike shared memory, which can have many processes all communicating with one initial process (or even
many), message passing is usually one-to-one.
Message passing also has more kernel overhead as the kernel must manually copy each message between
processes, unlike shared memory where the kernel is only involved in creating or destroying shared memory.
However the upside to the kernel being involved is that it can make decisions based on what’s happening.
If a process is waiting to receive a message, the kernel can dequeue that process from the scheduler until a
message is sent, rather than scheduling the process only for it to spin wait.
The key concept to understand for message passing is that we now have two distinct parties: the sender and
the receiver. The receiver must set up an endpoint that messages can be sent to ahead of time, and then wait
for a message to appear there. An endpoint can be thought of as the mailbox out the front of your house. If
you have no mailbox, they don’t know where to deliver the mail.
More specifically, an endpoint is just a buffer with some global identifier that we can look up. In unix systems
these endpoints are typically managed by assigning them file descriptors, but we’re going to use a linked list
of structs to keep things simple. Each struct will represent an endpoint.
You may wish to use the VFS and file descriptors in your own design, or something completely different!
Several of the ideas discussed in the previous chapter on shared memory apply here as well, like access control.
We won’t go over those again, but they’re worth keeping in mind here.
211
CHAPTER 36. IPC VIA MESSAGE PASSING 212
• The next time process 1 runs, ipc_receive() will see that the flag has been set, and copy the message
from the kernel buffer into a buffer for the program.
• The ipc_receive() function can also free the kernel buffer, before returning and letting process 1
continue as normal.
What we’ve described here is a double-copy implementation of message: because the data is first copied into
the kernel, and then out of it. Hence we performed two copy operations.
ep->msg_buffer = NULL;
212
213 36.3. SENDING A MESSAGE
As you can see creating a new endpoint is pretty simple, and most of the code in the example function is
actually for managing the linked list.
Now our endpoint has been added to the list! As always we omitted checking for errors, and we didn’t check
if an endpoint with this name already exists. In the real world you’ll want to handle this things.
release(&endpoints_lock);
if (target == NULL)
return;
You may want to return an error here if the endpoint couldn’t be found, however in our case we’re simply
discarding the message.
Now we’ll need to allocate a buffer to store a copy of the message in, and copy the original message into this
buffer.
void* msg_copy = malloc(length);
memcpy(msg_copy, buffer, length);
Why do we make a copy of the original message? Well if we don’t, the sending process has to keep the original
message around until the receiver has processed it. We don’t have a way for the receiver to communicate that
it’s finished reading the message. By making a copy, we can return from ipc_send() as soon as the message
is sent, regardless of when the message is read. Now the sending process is free to do what it wants with the
memory holding the original message as soon as ipc_send() has completed.
If you’re performing this IPC as part of a system call from userspace, the memory containing the original
message is unlikely to be mapped in the receiver’s address space anyway, so we have to copy it into the kernel’s
address space, which is mapped in both processes.
213
CHAPTER 36. IPC VIA MESSAGE PASSING 214
All that’s left is to tell the receiver it has a message available by placing the buffer address on the endpoint.
Again, notice the use of the lock to prevent race conditions while we mess with the internals of the endpoint.
acquire(&target->lock);
target->msg_buffer = msg_copy;
target->msg_length = length;
release(&target->lock);
After the lock on the endpoint is released, the message has been sent! Now it’s up to the receiving thread to
check the endpoint and set the buffer to NULL again.
36.4 Receiving
We have seen how to send messages, now let’s take a look at how to receive them. We’re going to use a basic
(and inefficient) example, but it shows how it could be done.
The theory behind this is simple: when we’re in the receiving process, we allocate a buffer to hold the message,
and copy the message data stored at the endpoint into our local buffer. Now we can set the endpoint’s
msg_buffer field to NULL to indicate that there is no longer a message to be received. Note that setting the
buffer to NULL is specific to our example code, and your implementation may be different.
As always, note the use of locks to prevent race conditions. The variable endpoint is assumed to be the
endpoint we want to receive from.
ipc_endpoint* endpoint;
acquire(&endpoint->lock);
void* local_copy = malloc(endpoint->msg_length);
memcpy(local_copy, endpoint->msg_data, endpoint->msg_length);
endpoint->msg_data = NULL;
endpoint->msg_length = 0;
release(&endpoint->lock);
At this point the endpoint is now ready to receive another message, and we’ve got a copy of the message in
local_copy. You’re successfully passed a message from one address space to another!
214
215 36.6. LOCK FREE DESIGNS
• A process waiting on an endpoint (to either send or receive a message) could be waiting quite a while in
some circumstances. This is time the cpu could be doing work instead of blocking and spinning on a
lock. A simple optimization would be to put the thread to sleep, and have it be woken up whenever the
endpoint is updated: a new message is sent, or the current message is read.
• In this example we’ve allowed for messages of any size to be sent to an endpoint, but you may want
to set a maximum message size for each endpoint when creating it. This makes it easier to receive
messages as you know the maximum possible size the message can be, and can allocate a buffer without
checking the size of the message. This might seem silly, but when receiving a message from userspace the
program has to make a system call each time it wants the kernel to do something. Having a maximum
size allows for one-less system call. Enforcing a maximum size for messages also has security benefits.
215
CHAPTER 36. IPC VIA MESSAGE PASSING 216
216
Part VIII
217
Chapter 37
After we have made our kernel works with multiple programs, let them communicate each other, handle access
to shared resources and protect the kernel space. Now it is time to start to think about how to store and
access files in our kernel, and how we want to support file systems.
219
CHAPTER 37. VIRTUAL FILE SYSTEM 220
Different filesystems have different advantages: some are simpler to implement, otherwise may be offer extreme
redundancy and others may be usable across a network. Each filesystem implementation is typically provided
by a separate driver that then interacts with the virtual file system provided by the kernel. The most common
filesystem drivers you will want to provide are ext2, FAT(12/16/32 - they are fundamentally all the same)
and an ram-based ‘temp fs’. The tempfs may also support loading its contents from a TAR passed by the
bootloader. This is the concept of a init ramdisk, and we’ll look at an example of how to implement this.
Each filesystem internally represents file (and directory) data in different ways. Whether they are just a
structure laid out before the data, or an entry in a array or list somewhere.
How do we combine the output of all these different filesystems in a uniform way that’s usable by the rest of
the OS? We achieve this through a layer of abstraction, which we called the virtual file system, or VFS. It’s
responsible for acting as a scaffold that other filesystems can attach themselves to.
How the VFS presents itself is another design decision, but the two common ways to do it are:
• Each mounted filesystem is a distinct filesystem, with a separate root. Typically each root is given a
single letter to identify it. This is the MS-DOS/Windows approach and is called the multi-root approach.
• Each mounted filesystem exists within a single global tree, under a single root. This is the usual unix
approach, where a directory can actually be a window into another filesystem.
220
Chapter 38
Nowadays there are many OSes available for many different hardware architectures, and probably there are
even more file systems. One of the problems for the OS is to provide a generic enough interface to support
as many file systems as possible, and making it easy to implement new ones, in the future. This is where
the VFS layer comes to aid, in this chapter we are going to see in detail how it works, and make a basic
implementation of it. To keep our design simple, the features of our VFS driver will be:
• Mountpoints will be handled using a simple linked list (with no particular sorting or extra features)
• Support only the following functions: open, read, write, close, opendir, readdir, closedir,
stat.
• No extra features like permissions, uid and gid (although we are going to add those fields, they will not
be used).
• The path length will be limited.
221
CHAPTER 38. THE VIRTUAL FILE SYSTEM 222
How the different file systems are presented to the-end user depends on design decision. For example windows
operating systems wants to keep different file systems logically separated using unit letters, while unix/linux
systems represents them under the same tree, so a folder can be either on the same FS or on another one, but
in both cases the idea is the same, we want to use the same functions to read/write files on them.
In this guide we will follow a unix approach. To better understand how does it works let’s have a look at this
picture:
It shows a portion of a unix directory tree (starting from root), the gray circle represents actual file systems,
while the white ones are directories.
So for example:
• /home/user1 point to a folder into the file system that is loaded at the “/” folder (we say that it’s
mounted)
• /mnt/ext_device/A instead points to a file that is mounted within the /mnt folder
When a filesystem is mounted in a folder it means that the folder is no longer a container of other files/directories
for the same filesystem but is referring to another one somewhere else (it can be a network drive, external
device, an image file, etc.) and the folder takes the name of mountpoint.
Every mountpoint will contain the information on how to access the target file system, so the VFS every time
it has to access a file or directory (i.e. a open function is called), it does the following:
• Parse the file path to identify the mountpoint of the filesystem.
• Once we have the mountpoint struct, we can access a series of function pointers, one for each operation
like opening/closing/reading/writing a file.
• It call the open function for that FS passing the path to the filename (in this case the path should be
relative).
• From this point everything is handled by the File System Driver and once the file is accessed is returned
back to the vfs layer.
The multi-root approach, even if it is different, it share the same behaviour, the biggest difference is that
instead of having to parse the path searching for a mountpoint it has only to check the first item in the it to
222
223 38.2. THE VFS IN DETAIL
char type[VFS_TYPE_LENGTH];
char mountpoint[VFS_TYPE_LENGTH];
fs_operations_t operations;
} mountpoint_t;
223
CHAPTER 38. THE VIRTUAL FILE SYSTEM 224
224
225 38.2. THE VFS IN DETAIL
In this case we just need to find the item in the list that contains the required file system, and if found remove
it from the list/tree by calling the remove_mountpoint function, making sure to free all resources if possible,
or return an error.
225
CHAPTER 38. THE VIRTUAL FILE SYSTEM 226
path: “/” (0), and “/home” (3), and the longest one is number 3, so this is the file system our function is
going to return (whether it is going to be an id or the reference to the mountpoint item).
The second path instead has three mountpoints contained into it: /” (0), “/home/mount” (1), “/home” (3),
in this case we are going to return 1.
The last one, has only one mountpoint that is contained into the path, and it is “/” (0).
In a single root scenario, there is always at least a mountpoint that is contained into the given path, and this
is the root folder “/”.
What if for example path doesn’t start with a “/”? this means that it’s a relative path, it will be explained
soon.
Implementing the function is left as exercise, below we just declare the header (that we will use in the next
sections):
mountpoint_t get_mountpoint(char *path);
If the above function fail it should return NULL to let the caller know that something didn’t work (even if it
should always return at least the root “/” item in a single root implementation).
226
227 38.2. THE VFS IN DETAIL
• It now reads 10 bytes from the opened file (the read function will access it via the fd field), and store
the output in the buffer. The read function returns the number of bytes read.
• If we want to print the string read we need to append the EndOfLine symbol after the last byte read.
• Now we can close the file_pointer (destroying the file descriptor associated with the id if it is possible,
otherwise -1 will be returned).
As we can see there are no instructions where we specify the file system type, or the driver to use this is all
managed by the vfs layer. The above functions will avail of kernel system calls open/read/close, they usually
sits somewhere above the kernel VFS layer, in our naive implementation they we are not going to create new
system calls, and let them to be our VFS layer, and where needed make a simpler version of them.
We can assume that any file system i/o operation consists of three basic steps: opening the file, reading/writing
from/to it and then closing it.
227
CHAPTER 38. THE VIRTUAL FILE SYSTEM 228
are the current positions of the buffer pointer for the read and write operations and file_size is the size of
the opened file.
So once our open function has found the mountpoint for the requested file, eventually a new file descriptor
item will be created and filled, and an id value returned. This id is different from the one in the data structure,
since it represent the internal fs descriptor id, while this one represent the vfs descriptor id. In our case the
descriptor list is implemented again using an array, so the id returned will be the array position where the
descriptor is being filled.
Why “eventually”? Having found the mountpoint id for the file doesn’t mean that the file exists on that fs,
the only thing that exist so far is the mountpoint, but after that the VFS can’t really know if the file exists or
not, it has to defer this task to the fs driver, hence it will call the implementation of a function that open a
file on that FS that will do the search and return an error if the file doesn’t exists.
But how to call the fs driver function? Earlier in this chapter when we outlined the mountpoint_t structure
we added a field called operations, of type fs_operations_t and left it unimplemented. Now is the time to
implement it, this field is going to contain the pointer to the driver functions that will be used by the vfs to
open, read, write, and close files:
struct fs_operations_t {
int (*open)(const char *path, int flags, ... );
int (*close)(int file_descriptor);
ssize_t (*read)(int file_descriptor, char* read_buffer, size_t nbyte);
ssize_t (*write)(int file_descriptor, const void* write_buffer, size_t nbyte);
};
if (mountpoint != NULL`) {
char *rel_path = get_rel_path(mountpoint, path);
int fs_specific_id = mountpoint.operations.open(rel_path, flags);
if (fs_specific_id != ERROR) {
/* IMPLEMENTATION LEFT AS EXERCISE */
// Get a new vfs descriptor id vfs_id
vfs_opened_files[vfs_id] = //fill the file descriptor entry at position
}
}
return vfs_id
}
The pseudo code above should give us an idea of what is the workflow of opening a file from a VFS point of
view, as we can see the process is pretty simple in principle: getting the mountpoint_id from the vfs, if one
has been found get strip out the mountpoint path from the path name, and call the fs driver open function, if
this function call is successful is time to initialize a new vfs file descriptor item.
Let’s now have a look at the close function, as suggested by name this will do the opposite of the open
function: given a file descriptor id it will free all the resources related to it and remove the file descriptor from
the list of opened files. The function signature is the following:
228
229 38.2. THE VFS IN DETAIL
229
CHAPTER 38. THE VIRTUAL FILE SYSTEM 230
The buffer content of the first read will be: Text, and the second one examp. This is the purpose buf_read_pos
variable in the file descriptor, so it basically needs to be incremented of nbytes, of course only if buf_read_pos
+ nbytes < file_size . The pseudocode for this function is going to be similar to the open/close:
This is more or less all the code needed for the VFS level of the read function, from the moment the driver
read function will be called the control will leave the VFS and will go one layer below, and in most cases it
will involve similar steps for the VFS with few differences like:
• It will use the relative path (without the mountpoint part) to search for the file
• It will use some internal data-structures to keep track of the files we are accessings (yes this can bring
to some duplication of similar data stored in two different data structure)
• It will read the content of the file if found in different ways: if it is a file system loaded in memory, it will
read it accessing its location, instead if it is stored inside a device it will probably involve another layer
to call the device driver and access the data stored on it, like reading from disk sectors, or from a tape.
The above differences are valid for all of the vfs calls.
Now that we have implemented the read function we should be able to code a simple version of the write
function, the logic is more or less the same, and the two key differences are that we are saving data (so
it will call the fs driver equivalent for write) and there probably be at least another addition to the file
descriptor data structure, to keep track of the position we are writing in (yes better keep read and write
pointers separated, even because files can be opened in R/W mode). This is left as an exercise.
Initially let’s concentrate with the basic function for directory handling: opendir, readdir and closedir,
then when we get a grasp on them we can implement other more sophisticated functions.
230
231 38.2. THE VFS IN DETAIL
231
CHAPTER 38. THE VIRTUAL FILE SYSTEM 232
232
Chapter 39
In the previous chapter we have implemented the VFS layer. That will provide us with a common interface
to access files and directories across different file systems on different devices. In this chapter we are going
to implement our first file system driver and try to access its content using the VFS function. As already
anticipated we will implement the (US)TAR format.
39.1 Introduction
The Tar (standing for Tape ARchive) format is not technically a file system, rather it’s an archive format
which stores a snapshot of a filesystem. The acronym USTar is used to identify the posix standard version of
it. Although is not a real fs it can be easily used as one in read-only mode.
It was first released on 1979 in Version 7 Unix, there are different tar formats (including historical and current
ones) two are codified into standards: ustar (the one we will implement), and “pax”, also still widely used but
not standardized is the GNU Tar format.
A tar archive consists of a series of file objects, and each one of them includes any file of data and is preceded
by a fixed size header (512 bytes) record, the file data is written as it is, but its size is rounded to a multiple
of 512 bytes. Usually the padding bytes are filled extra zeros. The end of an archived is marked by at least
two consecutive zero filled records.
39.2 Implementation
To implement it we just need:
• The header structure that represent a record (each record is a file object item).
• A function to lookup the file within the archive.
• Implement a function to open a file and read its content.
233
CHAPTER 39. THE TAR FILE SYSTEM 234
To ensure portability all the information on the header are encoded in ASCII, so we can use the char type
to store the information into those fields. Every record has a type flag, that says what kind of resource it
represent, the possible values depends on the type of tar we are supporting, for the ustar format the possible
values are:
Value Meaning
‘0’ (ASCII Null) Normal file
‘1’ Hard link
‘2’ Symbolic link
‘3’ Character Special Device
‘4’ Block device
‘5’ Directory
‘6’ Named Pipe
The name of linked file field refers to symbolic links in the unix world, when a file is a link to another file,
that field contains the value of the target file of the link.
The USTar indictator (containing the string ustar followed by NULL), and the version field are used to
identify the format being used, and the version field value is “00”.
The filename prefix field, present only in the ustar, this format allows for longer file names, but it is
splitted into two parts the file name field ( 100 bytes) and the filename prefix field (155 bytes)
The other fields are either self-explanatory (like uid/gid) or can be left as 0 (TO BE CHECKED) the only
one that needs more explanation is the file size field because it is expressed as an octal number encoded in
ASCII. This means we need to convert an ascii octal into a decimal integer. Just to remind, an octal number
is a number represetend in base 8, we can use digits from 0 to 7 to represent it, similar to how binary (base 2)
only have 0 and 1, and hexadecimal (base 16) has 0 to F. So for example:
octal 12 = hex A = bin 1010
In C an octal number is represented adding a 0 in front of the number, so for example 0123 is 83 in decimal.
But that’s not all, we also have that the number is represented as an ascii characters, so to get the decimal
number we need to:
1. Convert each ascii digit into decimal, this should be pretty easy to do, since in the ascii table the digits
are placed in ascending order starting from 0x30 ( ´0' ), to get the digit we need just to subtract the
ascii code for the 0 to the char supplied
234
235 39.2. IMPLEMENTATION
2. To obtain the decimal number from an octal we need to multiply each digit per 8ˆi where i is the digit
position (rightmost digit is 0) and sum their results. For example 37 in octal is:
037 = 3 * 8ˆ1 + 7 * 8ˆ0 = 31
Remember we ignore the first 0 because it tells C that it is an octal number (and also it doesn’t add any
value to the final result!), since we are writing an os implementing this function should be pretty easy, so this
will be left as exercise, we will just assume that we have the following function available to convert octal ascii
to decimal:
size_t octascii_to_dec(char *number, int size);
The size parameter tells us how many bytes is the digit long, and in the case of a tar object record the size is
fixed: 12 bytes. Since we just need to implement a data structure for the header, this is left as exercise. Let’s
assume just that a new type is defined with the name tar_record.
To move from the first header to the next we simply need to use the following formula:
The lookup function then will be in the form of a loop. The first thing we’ll need to know is when we’ve
reached the end of the archive. As mentioned above, if there are two or more zero-filled records, it indicated
the end. So while searching, we need to make sure that we keep track of the number of zeroed records. The
main lookup loop should be similar to the following pseudo-code:
int zero_counter = 0;
while (zero_counter < 2) {
if (is_zeroed(current_record) ) {
zero_counter++;
continue;
}
zero_counter = 0;
//lookup for file or load next record if not found
}
235
CHAPTER 39. THE TAR FILE SYSTEM 236
The is_zeroed function is a helper function that we should implement, as the name suggest it should just
check that the current record is full of zeros, the implementation is left as an exercise. Within the loop now
we just need to search for the file requested, the tricky part here is that we can have two scenarios:
• The filename length is less than 100 bytes, in this case it is stored into the file_name field
• The length is greater than 100 bytes (up to 256) so in this case the filename is split in two parts, the
first 100 bytes goes into the file_name field, the rest goes into the filename_prefix field.
An easy solution is to check first the searched filename length, if it less than 100 characters, so we can use
just the file_name field, otherwise we can merge the two fields and compare than with the searched filename.
The updated loop pseudo-code should look similar to this: ,
uint64_t tar_file_lookup(const char *filename) {
char tar_filename[256];
int zero_counter = 0;
//The starting address should be known somehow to the OS
tar_record *current_record = tar_fs_start_address;
while (zero_counter < 2) {
if (is_zeroed(current_record) ) {
zero_counter++;
continue;
}
zero_counter = 0;
if ( tar_record->filename_prefix[0] != 0) {
// We need to merge the two strings;
sprintf(tar_filename, "%s%s", current_record->file_prefix,
,→ current_record->file_name);
} else {
strcpy(tar_filename, current_record->file_prefix);
}
if ( strcmp(tar_filename, searched_file) == 0) {
// We have found the file, we can return whether the beginning of data, or
,→ the record itself
}
uint64_t file_size = octascii_to_dec(current_record.file_size, 12);
current_record = current_record + sizeof(tar_header) + file_size;
}
// If the while loop finish it means we have not found the file
}
The above code outlines what are the steps required to lookup for a file, the searched_file variable is the
file we are looking for. With the function above now we are able to tell the vfs that the file is present and it
can be opened. How things are implemented depends on design decisions, for example there are many paths
we can take while implementing functions for opening a file on a fs:
• We can just keep a list of opened files like the VFS
• We can lookup the file and return the address of where it starts to the vfs, so the read will know where
to look for it.
• We can map the file content somewhere in the memory
These are just few examples, but there can be different options. In this guide we are going to simply return
the location address of starting position of the file to the VFS layer, the driver will not keep track of opened
files, and also we assume that the tar fs is fully loaded into memory. In the real world we probably will need a
more complex way to handle stuff, because file systems will be stored on different devices, and will unlikely be
fully loaded into memory.
With the assumptions above, we already have all that we need for opening the file, from a file system point of
view, it could be useful just to create a open function to eventually handle the extra parameters passed by
236
237 39.3. AND NOW FROM A VFS POINT OF VIEW
the vfs:
uint64_t ustar_open(const char* filename, int flags);
The implementation is left as exercise, since it just calling the tar_lookup and returning its value. Of course
this function can be improved, and we can avoid creating a wrapper function, and use the lookup directly, but
the purpose here was just to show how to search for a file and make it available to the vfs.
237
CHAPTER 39. THE TAR FILE SYSTEM 238
238
239 39.4. WHERE TO GO FROM HERE
improved by populating a list with all the files present in the file system, keeping track of the informations
needed for lookup/read purposes. We can add a simple struct like the following:
struct tar_list_item {
char filename[256];
void *tar_record_pointer;
size_t file_size
int type;
struct tar_list_item* next;
};
And using the new datatype initialize the list accordingly.
Now when the file system is accessed for the first time we can initialize this list, and use it to search for the
files, saving a lot of time and resources, and it can makes things easier to for the lookup and read function.
Another limitation of our driver is that it expects for the tar to be fully loaded into memory, while we know
that probably file system will be stored into an external device, so a good idea is to make the driver aware of
all possible scenarios.
And of course we can implement more file systems from here.
There is no write function too, it can be implemented, but since it has many limitations it is not really a good
idea.
239
CHAPTER 39. THE TAR FILE SYSTEM 240
240
Part IX
Loading ELFs
241
Chapter 40
That’s not to say ELF is the only format for these kinds of files (there are others like PE/portable execute,
a.out or even mach-o), but the ELF format is the best for our purposes. A majority of operating systems have
come to a similar to conclusion. We could also use our own format, but be aware this requires a compiler
capable of outputting it (meaning either write our own compiler, or modify an existing one - a lot of work!).
This chapter won’t be too heavy on new concepts, besides the ELF specification itself, but will focus on
bringing everything together. We’re going to load a program in a new process, and run it in userspace. This
is typically how most programs run, and then from there we can execute a few example system calls.
It should be noted that the original ELF specification is for 32-bit programs, but a 64-bit extension was
created later on called ELF64. We’ll be focusing on ELF64.
It’s worth having a copy of the ELF specification as a reference for this chapter as we won’t define every
structure required. The specification doesn’t use fixed width types in its definition, instead using words and
half words, which are based on the word size of the cpu. For the exact definition of these terms we will need
the platform-specific part of the ELF spec, which gives concrete types for these.
All structs in the base ELF spec are defined using these types, and so we will use them too. Note that their
exact definitions will change depending on the target platform.
243
CHAPTER 40. EXECUTABLE LINKER FORMAT 244
244
245 40.5. LOADING THEORY
Elf64_Off p_offset;
Elf64_Addr p_vaddr;
Elf64_Addr p_paddr;
Elf64_Xword p_filesz;
Elf64_Xword p_memsz;
Elf64_Xword p_align;
} Elf64_Phdr;
The meaning of all these fields is explained below when we look at how actually loading a program header.
The most important field here is p_type: this tells the program loader what it should do with this particular
header. A full list of types is available in the ELF spec, but for now we only need one type: PT_LOAD.
Finding the program headers within an ELF binary is also quite straightforward. The offset of the first phdr
is given by the phoff (program header offset) field of the ELF header.
Like section headers, each program header is tightly packed against the next one. This means that program
headers can be treated as an array. As an example is possible to loop through the phdrs as follows:
void loop_phdrs(Elf64_Ehdr* ehdr) {
Elf64_Phdr* phdrs = (Elf64_Phdr*)((uintptr_t)ehdr + ehdr->e_phoff);
245
CHAPTER 40. EXECUTABLE LINKER FORMAT 246
copy: we are expected to make p_memsz bytes available at p_vaddr, even if p_memsz is bigger than p_filesz.
The spec says that this extra space (p_memsz - p_filesz) should be zeroed.
This is actually how the .bss section is allocated, and any pre-zeroed parts of an ELF executable are created
this way.
Before looking at some example code our VMM will need a new function that tries to allocate memory at
a specific virtual address, instead of whatever is the best fit. For our example we’re going to assume the
following function is implemented (according to the chosen design):
void* vmm_alloc_at(uintptr_t addr, size_t length, size_t flags);
Alternatively in the vmm_alloc function we can make use of the flag parameter, and add a new flag like
VM_FLAG_AT_ADDR that indicates the VMM should use the extra arg as the virtual address. Bear in mind that
if we’re loading a program into another address space we will need a way to copy the phdr contents into that
address space. The specifics of this don’t matter too much, as long as there is a way to do it.
The reason we need to use a specific address is that the code and data contained in the ELF are compiled and
linked assuming that they’re at that address. There might be code that jumps to a fixed address or data
that is expected to be at a certain address. If we don’t copy the program header where it expects to be, the
program may break.
Authors Note: Relocations are another way of dealing with the problem of not being able to use the requested
virtual address, but these are more advanced. They’re not hard to implement, certainly easier than dynamic
linking, but still beyond the scope of this chapter.
Now that we have that, lets look at the example code (without error handling, as always):
void load_phdr(Elf64_EHdr* ehdr, Elf64_Phdr* phdr) {
if (phdr->p_type != PT_LOAD)
return;
246
247 40.5. LOADING THEORY
We´ll also want to adjust the permissions of the mapped memory after copying the program header content.
This is because we will need the memory to be mapped as writable so the CPU lets us copy the phdr content
into it, and then the permissions can be adjusted to what the program header requested.
247
CHAPTER 40. EXECUTABLE LINKER FORMAT 248
248
Chapter 41
Before we start, we’re going to apply a few restrictions to our program loader. These are things you can easily
add later, but they only serve to complicate the process.
For a program to be compatible with our loader:
• It cannot contain any relocations. We don’t care about static linking or position independent code (PIC)
however, as that doesn’t affect the loader.
• All libraries must be statically linked, we won’t support dynamic linking for now. This feature isn’t too
hard to implement, but we will leave this as an exercise to the reader.
• The program must be freestanding! As of right now we don’t have a libc that targets our kernel. It can
be worth porting (or writing) a libc later on.
249
CHAPTER 41. LOADING AND RUNNING AN ELF 250
Value Byte
0x7f 0
E 1
L 2
F 3
• We need to check that the file class match with the one we are supporting. There are two possible
classes: 64 and 32. This is byte 4
• The data field indicates the endiannes, again this depends on the architecture used. It can be three
values: None (0), LSB (1) and MSB (2). For example x86_64 architecture endiannes is LSB, then the
value is expected to be 1. This field is in the byte 5.
• The version field, byte 6, to be a valid elf it has to be set to 1 (EVCURRENT).
• The OS ABI and ABI version they identify the operating system together with the ABI to which the
object is targeted and the version of the ABI to which the object is targeted, for now we can ignore
them, the should be 0.
Then from the other fields that need validation (that area not in the e_ident field) are:
• e_type: this identifies the type of elf, for our purpose the one to be considered valid this value should
be 2 that indicates an Executable File (ET_EXEC) there are other values that in the future we could
support, for example the ET_DYN type that is used for position independent code or shared object, but
they require more work to be done.
• e_machine: this indicates the required architecture for the executable, the value depends on the
architectures we are supporting. For example the value for the AMD64 architecture is 62
Be aware that most of the variables and their values have a specific naming convention, for more information
refer to the ELF specs.
Another thing to be aware is that some compilers when generating a simple executable are not using the
ET_EXEC value, but it could be of the type ET_REL (value 1), to obtain an executable we need to link it using
a linker. For example if we generated the executable: example.elf with ET_REL type, we can use ld (or
another equivalent linker):
ld -o example.o example.elf
For basic executables, we most likely don’t need to include any linker script.
If we want to know the type of an elf, we can use the readelf command, if we are on a unix-like os:
readelf -e example.elf
Will print out all the executable information, including the type.
250
251 41.2. CAVEATS
extern loop
[bits 64]
loop:
jmp loop
The code, as we can expect is pretty simple, and self-explanatory, it declares a loop function, and mark it as
global using the extern keyword..
The above code now can be compiled with nasm:
nasm -g -f elf64 -F dwarf example_file.s -o example_file.o
Where: * -f elf64 is the output format (in our case we use elf64, but this depend on the target architecture).
* -g enable debug symbols * -F dwarf is the debug symbols format (for elf64 we use dwarf, but again it can
depends on the target architecture).
The last step is to use a linker to link the file, in this example we are going to use ld:
ld -g example_file.o -o example_file.elf -e loop
Where -g is a parameter that instructs the linker to include the debugging symbols, and -e loop instructs
the linker to look for the symbol called loop as entry point of the program.
Now the program is ready to be loaded by our kernel, either as a bootloader module or a file on a filesystem
(or any other way that allow the kernel to reach this executable).
41.2 Caveats
As we can already see from the above restrictions there is plenty of room for improvement. There are also
some other things to keep in mind:
• If the program is going to be loaded into userspace (rather than in the kernel) we will need to map all
the memory we want to allow the program to use as user-accessible. This means not just the program
headers but also the stack. We’ll want to mark all this memory as user-accessible after copying the
program data in there though.
• Again if the program being loaded is a user program the scheduler will need to handle switching between
different privilege levels on the cpu. On x86_64 these are called rings (ring 0 = kernel, ring 3 = user),
other platforms may use different names. See the userspace chapter for more detail.
• As mentioned earlier in the scheduling chapter, don’t forget to call a function when exiting the main
thread of the program! In a typical userspace program the standard library does this for us, but our
programs are freestanding so it needs to be done manually. If coming from userspace this will require a
syscall.
251
CHAPTER 41. LOADING AND RUNNING AN ELF 252
252
Part X
Conclusion
253
Chapter 42
Going Beyond
42.1 Introduction
If you have reached this far in the book, your kernel should have all the basic components needed to start
adding some of the more exciting features. This means turning your project from a kernel and into an
operating system: i.e. making it more useful and interactive. In this chapter we are going to explore some
features that can be added to our operating system. This will be a high level overview, and not as in-depth as
the previous parts.
At this point in development any feature can be added, and it’s up to you what direction your kernel goes.
While we’re going to list a few suggestions of where you might want to explore next, they are merely that:
suggestions.
42.2.1 Prerequisites
To implement a shell we need at least the following parts of the kernel implemented:
• A video output (either using framebuffer, or legacy vga driver).
• A keyboard driver implemented with at least one layout working.
• Functions for working with strings: the common comparison and manipulation ones.
In order to run external programs, you will also need:
• The VFS layer implemented.
• At least one file system supported. If you’ve been following along you’ll have a tempfs that uses USTar
which provides everything you need.
• A program loader capable of loading executable files and running them.
Typically when a command is executed by the shell it will run in a newly spawned process, so it can be useful
to have fork and exec (or something equivalent) implemented.
255
CHAPTER 42. GOING BEYOND 256
256
257 42.3. GRAPHICAL USER INTERFACE
Of course there are many ways you could architect a GUI, but the approach followed by most developers
these days consists of a protocol, client and a server:
• A client is any application that wants to perform operations on the graphical shell, like creating/destroying
a window. In this case the client uses the method described in the protocol to send a request to the
server, and receives a response. The client can then perform any operations it likes in this manner and
now we have a working system!
• A protocol. Think of the X window protocol (or wayland), this is how other applications can interact
with the shell and perform operations. Things like asking the shell for a new window (with its own
framebuffer), minimizing/maximizing the window, or perhaps receiving input into a window from the
shell. This is similar to an ABI (like your system calls use) and should be documented as such. The
protocol defines how the clients and server communicate.
• The server is where the bulk of the work happens. The server is what we’ve been referring to as the
GUI or graphical shell, as it’s the program responsible for ultimately drawing to the screen or passing
input along to the focused window. The server may is usually also responsible for drawing window
decorations and any UI elements that aren’t part of a specific window like the task bar and the start
menu. Although this can also be exposed via the protocol and a separate program can handle those.
Ok it’s a little more complex than that, but those are the key parts! You’ll likely want to provide a static (or
dynamic, if you support that yet) library for your client programs that contains code for communicating with
the server. You could wrap the IPC calls and shuffling data in and out of buffers into nice functions that are
easier to work with.
Instead it’s recommended to create a new framebuffer for each window. If you don’t have hardware acceleration
for your GPU (which is okay!) this can just be a buffer in memory big enough to hold the pixels. This buffer
should be created by the server program but shared with the client program. The idea here is that the client
can write as much data as it wants into the framebuffer, and then send a single IPC message to the server to
tell it the window framebuffer has been updated. At this point the server would copy the updated region
of the window’s framebuffer onto the real framebuffer, taking into account the window’s position and size
(maybe part of the window is offsreen, so not visible).
This is much more efficient than sending the entire framebuffer over an IPC message and fits very naturally
with how you might expect to use a framebuffer, only now you will need the additional step of flushing the
framebuffer (by sending an IPC message to the server).
Another efficiency you might see used is the idea of ‘damage rectangles’. This is simply tracking where on
the framebuffer something has changed, and often when flushing a framebuffer you can pass this information
along with the call. For example say we have a mouse cursor that is 10x10 pixels, and we move the cursor 1
pixel to the right. In total we have only modified 11x10 pixels, so there is no need to copy the entire window
framebuffer. Hopefully you can see how this information is useful for performance.
Another useful tool for rendering is called a quad-tree. We won’t delve into too much detail here, but it can
greatly increase rendering speed if used correctly, very beneficial for software rendering.
257
CHAPTER 42. GOING BEYOND 258
typedef struct {
struct { size_t x, size_t y } position;
struct { size_t w, size_t h } size;
258
259 42.3. GRAPHICAL USER INTERFACE
return base;
}
You can see in create_button() how we can create a button element and populate the functions pointers we
care about.
All that remains is a core loop that calls the various functions on each element as needed. This means calling
elem->render() when you want to render an element to the framebuffer! Now you can combine these calls
however you like, but the core concept is that the framework is flexible to allow adding custom elements of
any kind, and they just work with the rest of them!
Don’t forget to flush the window’s framebuffer once you are done rendering.
42.3.3 In Conclusion. . .
Like we’ve said this is no small task, and requires a fairly complete kernel. However it can be a good project
for testing lots of parts of the kernel all working together. It’s also worth considering that if your shell server
is designed carefully you can write it to run under an existing operating system. This allows you to test your
259
CHAPTER 42. GOING BEYOND 260
server, protocol, and clients in an easily debuggable environment. Later on you can then just port the system
calls used by your existing server.
260
261 42.5. NETWORKING
Similar to how linux targets are built with the target triplet of something like x86_64-linux-elf you would
build a toolchain that targets x86_64-your_os_here-elf. This is actually very exciting as it means you can
use this compiler for any program that can be built just using the standard library!
42.5 Networking
Networking is another important feature for modern operating systems, it lets our project no longer to be
confined into our emulator/machine and talk to other computers. Not only computers on our local network,
but also servers on the internet.
Once implemented we can write some simple clients like an irc or email client, and use them to show how cool
we are; chatting from a client written by us, on an os written by us.
Like a graphical shell this is another big task with many moving parts. Network is not just one protocol, it
requires a whole stack of various protocols, layered on top of each other. The (in)famous TCP/IP is often
described as requiring 7 layers.
42.5.1 Prerequisites
What we need already implemented for the networking are:
• Memory management: we’ll need the usual dynamic memory management (malloc and free), but also a
capable VMM for writing networking drivers.
• Inter process communication: processes need to be informed if there is some data they have received
from the network.
• Interrupt infrastructure: we’ll need to be able to handle interrupts from network cards.
• PCI support: any reasonable network interface card is managed through PCI.
Unlike a framebuffer which is often provided to us by the bootloader, we’ll need to write drivers for each
network card we want to support. Fortunately a lot of network cards use the same chipset, which means we
can use the same driver for more than just a single card.
A good place to start is with the Intel e1000 driver (or the e1000e extended version). This card is supported
by most emulators and is used as the basis for almost all intel networking chipsets. Even some non-intel
chipsets are compatible with it! The osdev wiki has documentation for several different chipsets that can be
implemented.
42.5.1.1 Implementation
Once the driver is in place this means we are able to send and receive data through a network, we can start
implementing a communication protocol. Although there are different protocols available nowadays, we most
likely want to implement the TCP/IP one, since it is basically used by every internet service.
The TCP/IP protocol is composed by 7 levels divided into 4 different layers:
1. The Link layer - The lower one, usually it is the one responsible of communicating the data through the
network (usually part of the implementation done within the network interface driver)
2. Internet - It move packets from source to destination over the network
3. Transport - This provide a reliable message delivery between processes in the system
4. Application - It is at the top, and is the one used by processes to communicate with the network, all
major internet communication protocols are at this layer of the stack (i.e. ftp, http, etc.)
As mentioned above each layer is comprised of one or more levels. Implementing a TCP/IP stack is beyond
our scope and also require a good knowledge of it. This paragraph is just a general overview of what are the
layers and what we should expect to implement.
Usually the network levels should be pretty easy to implement, since it reflect the hardware part of the
network. Every layer/level is built on top of the previous, so a packet that is received by a host in the network
261
CHAPTER 42. GOING BEYOND 262
will traverse the stack and at every level some of the information it contains will be read, stripped from it and
the result passed to the level below. The same is true also for sending a packet (but in this case at every level
some information will be added).
The internet layer is responsible of moving datagrams (packets) in the network, it provides a uniform
networking interface that hides the actual topology of the network, or the network connections. This is the
layer that establishes the inter-networking and defines the addressing of the network (IP), at this layer we
have implemented ICMP and IGMP protocols.
At the Transport layer we have the host to host communication this is where the TCP and UDP protocol are
implemented, and those are responsible of the routing of our packets.
The Application layer instead are usually the protocols we want to implement, so for example FTP, HTTP,
POP, DNS are all application level protocols.
When we want to send data, we start from the topmost layer (the application) and go down the whole stack
until the network layer adding some extra information on each level. The extra information is the layer header
and footer (if needed), so when the data has reached the last level it will have all the technical information for
each level. This is described in the picture below.
On the other way a packet received from the network will observe the opposite path, so it will start as a big
packet containing headers/footers for each layer, and while it is traversing the stack upwards, at every layer it
will have the layer’s header stripped, so when it will reach the Application Layer it will be the information we
are looking for (you can just look at the picture from bottom to top):
Like for the GUI implementing a TCP/IP stack is not a quick task, neither trivial, since networking is
composed by many different components, but the biggest difference is that most of what we need to implement
262
263 42.6. FEW FINAL WORDS
is very well standardized, and we just need to follow the documentation and the implementation part should
be less difficult.
263
CHAPTER 42. GOING BEYOND 264
264
Appendices
265
Appendix A
Troubleshooting
267
APPENDIX A. TROUBLESHOOTING 268
mmx, especially since sse replaces most of their functionality. First you’ll went to set some flags in cr0 for the
fpu:
• Bit 1 = MP/Monitor processor - required
• Bit 2 = Must be cleared (if set means you want to emulate the fpu - you don’t).
• Bit 4 = Hardwired to 1 (but not always in some emulator versions!). If set means use 387 protocol,
otherwise 287.
• Bit 5 = NE/Native exceptions - required.
You’ll likely want to ensure bit 3 is clear. This is the TS/task switched bit, which if set will generate a
#NM/device missing exception when the cpu thinks you’ve switched tasks. Not worth the hassle. The FPU
can now be initialized by a simple finit instruction.
SSE is a little trickier, but still straight forward enough You’ll want to set the following bits in cr4:
• Bit 9 = OSFDSR, tell the cpu our os knows to use the fxsave/fxrstor instructions for saving/restoring
extended cpu state.
• Bit 10 = OSXMMEXCPT, tell the cpu it can issue #XM (simd/sse exceptions), and we’ll handle them.
If the cpu supports XSAVE (check cpuid), you can also set bit 18 here to enable it, otherwise leave it as is.
There is more work to setting up xsave as well, for running in a single-task state where you don’t care about
saving/loading registers, not having xsave setup is fine.
That’s it, that should enable your compiler-generated code with sse instructions to work.
268
Appendix B
Let’s look at some examples and see how they can be useful. While this technique is useful, it has to be used
carefully. If accessing MMIO using a volatile instance of a union, be sure to read about access requirements
for the underlying hardware. For example a device may expose a 64-bit register, consisting of 8 8-bit fields.
However the device may require that perform 64-bit reads and writes to the register, in which we will have to
read the whole register, and create the union from that value. If the device doesn’t have such a requirement,
we could instead use a volatile union and access it normally.
Imagine we have a function that reads a register and returns its value as uint64_t:
uint64_t read_register();
If we want to use a struct and populate it with the value returned by the function, we will have something
similar to the following code:
struct BadIdea
{
uint8_t a;
uint8_t b;
uint8_t c;
uint8_t d;
uint8_t e;
uint8_t f;
uint8_t g;
uint8_t h;
};
269
APPENDIX B. TIPS AND TRICKS 270
Now let’s see what happens instead if we are using a union approach:
union GoodIdea
{
//each item inside of union shares the same memory.
//using an anonymous struct ensures these fields are treated as 'a single item'.
struct
{
uint8_t a;
uint8_t b;
uint8_t c;
uint8_t d;
uint8_t e;
uint8_t f;
uint8_t g;
uint8_t h;
};
uint64_t squished;
};
GoodIdea gi;
gi.squished = read_register();
//now the fields in gi will represent what they would in the register.
//assuming a is bits 7:0, b is bits 16:8, etc ...
In this way we moved the struct declaration inside the union. This means that now that the struct and the
union share the same memory location, using an anonymous structure it ensures that the fields are treated as
a single item
In this way we can either access the uint64_t value of the register squished, or the single fields a, b, . . . , h.
uint8_t _6bits : 6;
uint8_t _2bits : 2;
};
Bitfields can be very useful, as they allow access to oddly sized bits of data. However there’s big issue that
can lead to unexpected bugs:
Consider the example above. This struct would form 16bits of data in memory, and while _3bits and _5bits
would share 1 byte, same with _6bits and _2bits, the compiler makes no guarantees about which field
occupies the least or most significant bits. Byte order of fields is always guaranteed by the spec to be in the
order they were declared in the source file. Bitwise order is not.
It is worth noting that relying on this is usually okay, most compilers will do the right thing, but optimizations
can get a little weird sometimes. Especially -O2 and above.
270
271 B.2. BITFIELDS? MORE LIKE MINEFIELDS.
271
APPENDIX B. TIPS AND TRICKS 272
272
Appendix C
273
APPENDIX C. C LANGUAGE USEFUL INFORMATION 274
• Clobbered registers can usually be left empty. However if we use an instruction like rdmsr which places
data in registers without the compiler knowing, we’ll want to mark those as clobbered. If we specify
eax/edx as output operands, the compiler is smart enough to work this out.
• One special clobber exists: “memory”. This is a read/write barrier. It tells the compiler we’ve accessed
memory other than what was specified as input/ouput operands. The cause of many optimization issues!
• For every operand type there can be more than one, in this case they must be comma separated.
• Every operand consists of a constraint and c expression pair. A constraint can also have a modifier itself
• Operands parameters are indicated with an increasing number prefixed by a %, so the first operand
declared is %0, the second is %1, etc. And the order they appears in the output/input operands section
represent their numbering
Below is the list of the constraint modifiers:
Symbol Meaning
= Indicates that this is an output operand, whatever was the previous value, it will
be discarded and replaced with something else
+ It indicates that the operand is both read and written by the instruction.
& It indicates that the operand can be modified before the instruction completion.
% It means that the instruction is commutative for this operand and the following,
so the compiler may interchange the two operands
The constraints are many and they depends the architecture too. Usually they are used to specify where the
value should be stored, if in registers or memory, for more details see the useful links section.
The list below contains some constraints that are worth an explanation:
• 0, 1, . . . , 9 - when a constraint is a number, it is called a matching_constraint, and this means that use
the same register in output as the corresponding input registers.
• m - this constraint indicates to use a memory operand supported by the architecture.
• a, b, c, etc. - The etters usually indicate the registers we want to use, so for example a it means rax (or
eax or ax depending on the mode), b means rbx (or ebx or bx), etc.
• g - this constiraint indicates that a general register, memory or immediate operand is used.
• r - it indicates that a register operand is allowed, provided that it is a general purpose register.
An example of an inline assembly instruction of this type is:
asm("movl %2, %%rcx;"
"rdmsr;"
: "=a" (low), "=d" (high)
: "g" (address)
);
Note here how eax and ecx are clobbered here, but since they’re specified as outputs the compiler implicitly
knows this.
Let’s dig into the syntax:
• First thing to know is that the order of operands is source, destination
• When a %% is used to identify a register, it means that it is an operand (its value is provided in the
operand section), otherwise a single % is used.
• Every operand has it’s own constraint, that is the letter in front of the variable referred in the operand
section
• If an operand is output then it will have a “=” in front of constraint (for example “=a”)
• The operand variable is specified next to the constraint between bracket
• Even if the example above has only %2, we can say that: %0 is the low variable, %1 is high, and %2 is
address.
274
275 C.3. C +(+) ASSEMBLY TOGETHER - CALLING CONVENTIONS
It is worth mentioning that inline assembly syntax is the At&t syntax, so the usual rules for that apply, for
example if we want to use immediate values in an instruction we must prefix them with the $, symbol so for
example if we want to mov 5 into rcx register:
asm("movl $5, %rcx;");
275
APPENDIX C. C LANGUAGE USEFUL INFORMATION 276
thoroughly complaining about its existence. The next page is usually filled with an equal number of results of
blog posts by people complaining about the previous people’s complaining, and so on. I’m not interested in that,
and instead will explain how I’ve found it a useful tool.
uint64_t pit_ticks;
//volatile uint64_t pit_ticks;
void pit_interrupt_handler()
{
pit_ticks++;
send_eoi();
276
277 C.6. A FINAL NOTE
void calibrate_apic_timer()
{
[ ... ] //setup apic timer to max value, and setup pit to known frequency.
unmask_pit_and_reset_ticks();
while (pit_ticks < APIC_CALIBRATE_MS);
stop_apic_timer();
[ ... ] //now apic current count register can be used to determine apic timer
,→ frequency.
}
The issue with this example is that pit_ticks is being constantly accessed inside the loop, and we never
modify it inside the loop (its modified in unrelated code, the interrupt handler). With optimizations enabled
the compiler will deduce that pit_ticks will always be its initial value, and will always be its initial value of
0, therefore the loop is infinite. If we change the variable declaration to be volatile uint64_t pit_ticks
the compiler assumes nothing, and that this variable can be modified at any time.
Therefore it must do a valid read from memory each time it accesses the variable. This causes the above code
to work as expected, although with increased latency, as memory must be accessed each cycle of the loop.
Authors note: This is actually not technically true on more modern CPUs, as there are mechanisms for
caches to talk to each other, and share fresh data without going to memory. See the MESI protocol. This skips
the full roundtrip of going to memory and back.
Another example would be an mmio based framebuffer on x86. Normally this results in a lot of smaller
accesses, all nearby (think about drawing a line of pixels for example). volatile could be used to ensure
each write to the framebuffer actually happens, and is not cached in a register and written later. This would
work perfectly fine, but it also limits the compiler’s options.
However in this case, the platform has a built in solution to this problem: write-combine cache mode.
If unfamiliar with caching, each page can have a set of caching attributes applied to it. These are designed to
greatly improve performance in certain situations, however they are specific in application. One of these is
write-combine. It works by queueing up writes to nearby areas of memory, until a buffer is full, and then
flushing them to main memory in a single access. This is much faster than accessing main memory each write.
However if we’re working with an older x86 cpu, or another platform, this solution is not available. Hence
volatile would do the job.
277
APPENDIX C. C LANGUAGE USEFUL INFORMATION 278
278
Appendix D
D.1 Macros
There are some cases where writing some assembly code is preferred/needed to do certain operations
(i.e. interrupts handling).
Nasm has a macro processor that supports conditional assembly, multi-level file inclusion, etc. A macro start
with the ‘%’ symbol.
There are two types of macros: single line (defined with %define) and multiline wrapped around %macro and
%endmacro. In this paragraph we will explain the multi-line macros.
A multi-line macro is defined as follows:
%macro my_first_macro 1
push ebp
mov ebp, esp
sub esp %1
%endmacro
A macro can be accessed from C if needed, in this case we need to add a global label to it, for example the
macro above will become:
%macro my_first_macro 1
[global my_first_macro_label_%1]
my_first_macro_label_%1:
push ebp
mov ebp, esp
sub esp %1
%endmacro
In the code above we can see few new things:
• First we said the label my_first_macro_label_%1 has to be set as global, this is pretty straightforward
to understand.
• the %1 in the label definition, let us create different label using the first parameter passed in the macro.
So if now we add a new line with the following code:
my_first_macro 42
It creates the global label: my_first_macro_label_42, and since it is global it will be visible also from our
C code (of course if the files are linked)
279
APPENDIX D. SOME INFORMATION ABOUT NASM 280
Basically defining a macro with nasm is similar to use C define statement, these special “instruction” are
evaluated by nasm preprocessor, and transformed at compile time.
So for example my_first_macro 42 is transformed in the following statement:
my_first_macro_label_42:
push ebp
mov ebp, esp
sub esp 42
Directive Description
DB Allocate a byte
DW Allocate 2 bytes (a word)
DD Allocate 4 bytes (a double word)
DQ Allocate 8 bytes (a quad word)
These directive are intended to be used for initialized variables. The syntax is:
single_byte_var:
db 'y'
word_var:
dw 54321
double_var:
dd -54321
quad_var:
dq 133.463 ; Example with a real number
If we want to declare a string we need to use a different syntax for db:
string_var:
db "Hello", 10
The above code means that we are declaring a variable (string_variable) that starts at ‘H’, and fill the
consecutive bytes with the next letters. And what about the last number? It is just an extra byte, that
represents the newline character, so what we are really storing is the string “Hello\n”
What we have seen so far is valid for a variable that can be initialized with a value, but what if we don’t know
the value yet, but we want just to “label” it with a variable name? Well is pretty simple, we have equivalent
directives for reserving memory:
Directive Description
RESB Rserve a byte
RESW Rserve 2 bytes (a word)
RESD Rserve 4 bytes (a double word)
RESQ Rserve 8 bytes (a quad word)
280
281 D.3. CALLING C FROM NASM
resw 2
double_var:
resd 3
quad_var:
resq 4
One moment! What are those number after the directives? Well it’s pretty simple, they indicate how many
bytes/word/dword/qword we want to allocate. In the example above: * resb 1 Is reserving one byte * resw
2 Is reserving 2 words, and each word is 2 bytes each, in total 4 bytes * resd 3 Is reserving 3 dwords, again a
dword is 4 bytes, in total we have 12 bytes reserved * resq 4 Is reserving. . . well you should know it now. . .
281
APPENDIX D. SOME INFORMATION ABOUT NASM 282
to read 8 memory locations, while in the second case only 4, and of course there can be differences (unless we
are lucky enough and the extra 4 bytes are all 0s).
This is kind of misleading if we usually do mostly register to memory, or value to register, value to memory,
where the size is “implicit”.
D.5 If Statement
Below an example showing a possible solution to a complex if statement. Let’s assume that we have the
following if statement in C and we want to translate in assembly:
if ( var1==SOME_VALUE && var2 == SOME_VALUE2){
//do something
}
In asm we can do something like the following:
cmp [var1], SOME_VALUE
jne .else_label
cmp [var2], SOME_VALUE2
jne .else_label
;here code if both conditions are true
.else_label:
;the else part
And in a similar way we can have a if statement with a logic OR:
if (var1 == SOME_VALUE || var2 == SOME_VALUE){
//do_something
}
in asm it can be rendered with the following code
cmp [var1], SOME_VALUE
je .true_branch
cmp [var2], SOME_VALUE
je .true_branch
jmp .else_label
.true_branch
jne .else_label
282
283 D.7. LOOP
D.7 Loop
Another typical scenario are loops. For example imagine we have the following while loop in C:
unsigned int counter = 0;
while (counter < SOME_VALUE) {
//do something
counter++;
}
Again in assembly we can use the jmp instructions family:
mov ecx, 0 ; Loop counter
.loop_cycle
; do sometehing
inc ecx
cmp ecx, SOME_VALUE
jne loop_cycle
The inc instruction increase the value contained by the ecx register.
283
APPENDIX D. SOME INFORMATION ABOUT NASM 284
nasm we use the struc and endstruc keywords, and the fields are defined between them. The example above
can be rendered in the following way:
struc task
id: resd 1
name: resb 8
endstruc
What this code is doing is creating three symbols: id as 0 representing the offset from the beginning of a task
structure and name as 4 (still the offset) and the task symbol that is 0 too. This notation has a drawback, it
defines the labels as global constants, so you can’t have another struct or label declared with same name, to
solve this problem you can use the following notation:
struc task
.id: resd 1
.name: resb 8
endstruc
Now we can access the fields inside our struct in a familiar way: struct_name.field_name. What’s really
happening here is the assembler will add the offset of field_name to the base address of struct_name to give
us the real address of this variable.
Now if we have a memory location or register that contains our structure, for example let’s say that we have
the pointer to our structure stored in the register rax and we want to copy the id field in the register rbx:
mov rbx, dword [(rax + task.id)]
This is how to access a struct, basically we add the label representing an offset to its base address. What if
we want to create an instance of it? Well in this case we can use the macros istruc and iend, and using at
to access the fields. For example if we want create an instance of task with the values 1 for the id field and
“hello123” for the name field, we can use the following syntax:
istruc task
at id dd 1
at name db 'hello123'
iend
In this way we have declared a struc for the first of the two examples. But again this doesn’t work with
the second one, because the labels are different. In that case we have to use the full label name (that means
adding the prefix task):
istruc task
at task.id dd 1
at task.name db 'hello123'
iend
284
Appendix E
This appendix looks at how and why we might want to build a cross compilation toolchain, and how to build
some other tools like gdb and qemu. These tools include:
• binary utils like a linker, readelf, objdump, addr2line.
• a compiler for our chosen language.
• a debugger, like gdb.
• an emulator for testing, like qemu.
For most of these tools (except clang/LLVM) we’ll need to build them specifically for the target architecture
we want to support. The building processes covered below are intended to be done on a unix-style system, if
developing in a windows environment this will likely be different and is not covered here.
For the binary utils and compiler there are two main vendors: the GNU toolchain consisting of GCC and
binutils, and the LLVM toolchain of the same name which uses clang as a frontend for the compiler.
285
APPENDIX E. CROSS PLATFORM BUILDING 286
• libmpfr-dev.
• texinfo.
• libcloog.
There’s three environment variables we’re going to use during the build process:
export PREFIX="/your/path/to/cross/compiler"
export TARGET="x86_64-elf"
export PATH="$PREFIX/bin:$PATH"
The PREFIX environment variable stores the directory we want to install the toolchain into after building,
TARGET is the target triplet of our target and we also modify the shell PATH variable to include the prefix
directory. To permanently add the cross compiler to your path, you may want to add this last line (where we
updated PATH) to your shell’s configuration. If you’re unsure what shell you use, it’s probably bash and you
will want to edit your .bashrc. As for where to install this up to personal preference, but if unsure a ‘tools’
directory under our home folder should work. This is nice because the install process doesn’t require root
permissions. If we want all users on the system to be able to access it, we could install somewhere under
/usr/ too.
The process of building binutils and GCC follows a pattern:
• first we’ll create a directory to hold all the temporary build files and change into it.
• next is to generate the build system files using a script.
• then we can build the targets we want, before installing them.
E.2.2 Binutils
These are a set of tools to create and manage programs, including the linker (ld), the GNU assembler (as,
referred to as GAS), objcopy (useful for inserting non-code files into binaries) and readelf.
The flags we’ll need to pass to the configure script are:
• --prefix=$PREFIX: this tells the script where we want to install programs after building. If omitted
this will use a default value, but this is not recommended.
• --target=$TARGET: tells the script which target triplet we want to use.
• --with-sysroot: the build process needs a system root to include any system headers from. We don’t
want it to use the host headers so by just using the flag we point the system root to an empty directory,
disabling this functionality - which is what we want for a freestanding toolchain.
• --disable-nls: this helps reduce the size of the generated binaries by disabling native language support.
If you want these tools to support non-english languages you may want to omit this option (and keep
nls enabled).
• --disable-werror: tells the configure script not to add -Werror to the compile commands generated.
This option may be needed depending on if you have any missing dependencies.
Before we can do this however we’ll need to obtain a copy of the source code for GNU binutils. This should
be available on their website or can be downloaded from the ftp mirror: https://ftp.gnu.org/gnu/binutils/.
Once downloaded extract the source into a directory, then create a new directory for holding the temporary
build files and enter it. If unsure of where to put this, a sibling directory to where the source code was
extracted works well. An example might be:
mkdir binutils_build
cd binutils_build
From this temporary directory we can use the configure script to generate the build files, like so:
/path/to/binutils_src/configure --target=$TARGET --with-sysroot --disable-nls
,→ --disable-werror
At this point the configure script has generated a makefile for us with our requested options, now we can do
the usual series of commands:
286
287 E.2. BINUTILS AND COMPILERS
make
make install
That’s it! Now all the binutils tools are installed in the PREFIX directory and ready to be used.
E.2.3 GCC
The process for building GCC is very similar to binutils. Note that we need to have a version of binutils for
our target triplet before trying to build GCC. These binaries must also be in the path, which we did before.
Let’s create a new folder for the build files (build_gcc) and move into it.
Now we can use the configure script like before:
/path/to/gcc_sources/configure --target=$TARGET --prefix=$PREFIX --disable-nls
,→ --enable-languages=c,c++ --without-headers
For brevity we’ll only explain the new flags:
• --enable-languages=c,c++: select which language frontends to enable, these two are the default. We
can disable c++ but if we plan to cross compile more things than our kernel this can be nice to have.
• --without-headers: tells the compiler not to rely on any headers from the host and instead generate
its own.
Once the script is finished we can run a few make targets to build the compiler and its support libraries. By
default running make/make all is not recommended as this builds everything for a full userspace compiler.
We don’t need all that and it can take a lot of time. For a freestanding program like a kernel we only need
the compiler and libgcc.
make all-gcc
make all-target-libgcc
make install-gcc
make install-target-libgcc
Libgcc contains code the compiler might add calls to for certain operations. This happens mainly when an
operation the compiler tries to perform isn’t supported by the target hardware and has to be emulated in
software. GCC states that it can emit calls to libgcc functions anywhere and that we should always link to it.
The linker can remove the unused parts of the code if they’re not called. This set up is specific to GCC.
In practice we can get away with not linking to libgcc, but this can result in unexpected linker errors. Best
practice here is to build and link with libgcc.
287
APPENDIX E. CROSS PLATFORM BUILDING 288
E.4 GDB
The steps for building GDB are similar to binutils and GCC. We’ll create a temporary working directory
and move into it. Gdb has a few extra dependencies we’ll need (name can be different depending on the
distribution used):
• libncurses-dev
• libsource-highlight-dev
path/to/gdb_sources/configure --target=$TARGET --host=x86_64-linux-gnu
,→ --prefix="$PREFIX" --disable-werror --enable-tui --enable-source-highlight
The last two options enable compiling the text-user-interface (--enable-tui) and source code highlighting
(--enable-source-highlight) which are nice-to-haves. These flags can be safely omitted if these aren’t
features we want.
The --target= flag is special here in that it can also take an option all which builds gdb with support
for every single architecture it can support. If we’re going to develop on one machine but test on multiple
architectures (via qemu or real hardware) this is nice. It allows a single instance of gdb to debug multiple
architectures without needing different versions of gdb. Often this is how the ‘gdb-multiarch’ package is
created for distros that have it.
The --host= flags specify the host system running gdb, in the example above an x86_64-linux-gnu system,
this should be changed depending on the system used.
After running the configure script, we can build and install our custom gdb like so:
make all-gdb
make install-gdb
288
Appendix F
Debugging
F.1 GDB
F.1.1 Remote Debugging
First thing Qemu needs to be launched telling it to accepts connections from gdb, it needs the parameters: -s
and -S added to the command, where:
• -s is a shortand for -gdb tcp::1234
• -S instead tells the emulator to halt before starting the CPU, in this way we have time to connect the
debugger before the OS start. To connect with qemu/bochs host configure for remote debugging launch
gdb, and type the following command in gdb cli:
target remote localhost:1234
And then we can load the symbols (if we have compiled our os with debugging symbols):
symbol-file path/to/kernel.bin
F.1.3 Navigation
• c/continue can be used to continue the program until the next breakpoint is reached.
• fin/finish will run until the end of the current function.
• s/step will run the next line of code, and if the next line is a function call, it will ‘step into’ it. This
will leave us on the first line on the new function.
• n/next similar to step, but will ‘step over’ any function calls, treating them like the rest of the lines of
code.
• si/ni step instruction/next instruction. These are like step and next, but work on instructions, rather
than lines of code.
289
APPENDIX F. DEBUGGING 290
F.1.6 Breakpoints
A breakpoint can be set in a variety of ways! The command is b/break symbol, where symbol can be a
number of things:
• a function entry: break init or break init() will break at any function named init.
• a file/line number: break main.c:42 will break just before the first instruction of line 42 (in main.c) is
executed.
• an address in memory: break 0x1234 will break whenever the cpu’s next instruction is at address
0x1234.
Breakpoints can be enabled/disabled at runtime with enable x/disable x where x is the breakpoint number
(displayed when we first set it).
Breakpoints can also take conditions if we’re trying to debug a commonly-run function. The syntax follows
a c-like style, and is pretty forgiving. For example: break main if i == 0 would break at the function
main() whenever the variable i is equal to 0. This syntax supports all sorts of things, like casts and working
with pointers.
Breakpoints can also be issued contextually too! If we’re at a breakpoint main.c:123, we can simply use b
234 to break at line 234 in the same file.
It is possible at any time to print the list of breakpoints using the command: info breakpoint
290
291 F.1. GDB
F.1.7 Variables
While debugging with gdb, we can change the value of the variables in the code being executed. To do that
we just need the command:
set variable_name=value
where variable_name is a variable present in the code being debugged. This is extremely useful in the cases
where we want to test some edge cases, that are hard to reproduce.
291
APPENDIX F. DEBUGGING 292
F.3 Qemu
F.4 Qemu Interrupt Log
If using qemu, a good idea is to dump registers when an exception occurs, we just need to add the following
option to qemu command:
qemu -d int
Sometime could be needed to avoid the emulator restart on triple fault, in this case to catch the “offending”
exception, just add:
qemu -d int -no-reboot
While debugging with gdb, we may want to keep qemu hanging after a triple fault (when the cpu should reset),
to do some more investigation, in this case we need to add also -no-shutdown (along with) -no-reboot
292
293 F.4. QEMU INTERRUPT LOG
One way to start Qemu monitor on a unix system is using the following parameter when starting Qemu:
qemu-system-i386 [..other params..] -monitor unix:qemu-monitor-socket,server,nowait
then on another shell, on the same folder where we started the emulator launch the following command:
socat -,echo=0,icanon=0 unix-connect:qemu-monitor-socket
Another method is to use Telnet to start the monitor. This is handy for easier, cross-platform or remote use
(albeit less secure).
qemu-system-i386 [..other params..] -monitor telnet::45454,server,nowait
This enables the monitor to listen on a specified port (ie, 45454). You can then connect to the QEMU monitor
from another terminal or a remote machine (with networking setup) using Telnet:
telnet localhost 45454
Both of these will prompt with a shell similar to the following:
QEMU 6.1.0 monitor - type 'help' for more information
(qemu)
Once the monitor is running, it is possible to send commands directly to the emulator, below is a list of useful
commands:
• help This is the first command to get some help using the monitor.
• info xxxx It will print information, depending on xxxx for example: info lapic will show the current
status of the local apic, info mem will print current virtual memory mappings, info registers will
print the content of the registers.
• x /cf address where c is the number of items we want to display in decimal, f is the format (x for hex,
c for char, etc) display the content of c virtual memory locations starting from address.
• xp /cf address same as above, but for physical memory.
293
APPENDIX F. DEBUGGING 294
F.4.2 Debugcon
Qemu (and several other emulators - bochs for example) support something called debugcon. It’s an extremely
simple protocol, similar to serial - but with no config, where anything written to port 0xE9 in the VM will
appear byte-for-byte at where we tell qemu to put it. Same is true with input (although this is quite buggy,
best to use serial for this). To enable it in qemu add this to the qemu flags -debugcon where. Where can be
anything really, a log file for example. We can even use -debugcon /dev/stdout to have the output appear
on the current terminal.
It’s worth noting that because this is just a binary stream, and not a serial device emulation, its much faster
than usual port i/o. And there’s no state management or device setup to worry about.
294
Appendix G
Memory Protection
This appendix is a collection of useful strategies for memory protection. This mainly serves as a reminder
that these features exist, and are worth looking into!
G.1 WP bit
On x86_* platforms, we have the R/W bit in our page tables. This flag must be enabled in order for a page
to be written to, otherwise the page is read-only (assuming the page is present at all).
However this is not actually true! Supervisor accesses (rings 0/1/2 - not ring 3) can write to readonly pages
by default. This is not as bad as it might seem, as the kernel is usually carefully crafted to only access the
memory it needs. However when we allow user code to access the kernel (via system calls for example), the
kernel can be ‘tricked’ into writing into areas it wouldn’t normally, via software or hardware bugs.
One helpful mitigation for this is to set the WP bit, which is bit 16 of cr0. Once written to cr0, any attempts
to write to a read-only page will generate a page fault, like a user access would.
295
APPENDIX G. MEMORY PROTECTION 296
Once these features are known to be supported, they can be enabled like so:
• SMAP: set bit 21 in CR4.
• SMEP: set bit 20 in CR4.
296
297 G.3. PAGE HEAP
297
APPENDIX G. MEMORY PROTECTION 298
298
Appendix H
Useful Resources
This appendix is a collection of links we found useful developing our own kernels and these notes.
H.2 Architecture
• Intel Software developer’s manual Vol 3A APIC Chapter
• IOAPIC Datasheet: https://pdos.csail.mit.edu/6.828/2016/readings/ia32/ioapic.pdf
• Broken Thorn Osdev Book Series, The PIC: http://www.brokenthorn.com/Resources/OSDevPic.html
• Osdev wiki page for RSDP: https://wiki.osdev.org/RSDP
• Osdev wiki page for RSDThttps://wiki.osdev.org/RSDT
• OSdev Wiki - Pit page: https://wiki.osdev.org/Programmable_Interval_Timer
• Broken Thron Osdev Book Series Chapter 16 PIC, PIT and Exceptions: http://www.brokenthorn.com/Resources/OSDev16
• Osdev Wiki Ps2 Keyboard page: https://wiki.osdev.org/PS/2_Keyboard
• Osdev Wiki Interrupts page: https://wiki.osdev.org/IRQ#From_the_keyboard.27s_perspective
• Osdev Wiki 8042 Controller pagepage: https://wiki.osdev.org/8042_PS/2_Controller#Translation
• Scancode sets page: https://www.win.tue.nl/~aeb/linux/kbd/scancodes-10.html#scancodesets
• Brokenthorn Book Series Chapter 19 Keyboard programming: http://www.brokenthorn.com/Resource
s/OSDev19.html
299
APPENDIX H. USEFUL RESOURCES 300
H.5 Scheduling
• Osdev Wiki page for Scheduling Algorithm: https://wiki.osdev.org/Scheduling_Algorithms
• Operating System Three Easy Pieces (Book): https://pages.cs.wisc.edu/~remzi/OSTEP/
• Broken Thorn Osdev Book Series: http://www.brokenthorn.com/Resources/OSDev25.html
• Writing an Os in Rust by Philip Opperman Multitasking: https://os.phil-opp.com/async-await/
H.6 Userspace
• Intel Software developer’s manual Vol 3A Protection Chapter
• Wiki Osdev Page for Ring 3: https://wiki.osdev.org/Getting_to_Ring_3
• Default calling conventions for different compilers: https://www.agner.org/optimize/#manuals
H.7 IPC
• Wiki Osdev Page for IPC Data Copying: https://wiki.osdev.org/IPC_Data_Copying_methods
• Wiki Osdev Page Message Passing Tutorial: https://wiki.osdev.org/Message_Passing_Tutorial
• Wikipedia IPC page: https://en.wikipedia.org/wiki/Inter-process_communication
• InterProcess communication by GeeksForGeeks: https://www.geeksf orgeeks.org/inter-process-
communication-ipc/
300
301 H.10. C LANGUAGE INFOS
H.11 Nasm
• Nasm Struct Section: https://www.nasm.us/xdoc/2.15/html/nasmdoc5.html#section-5.9.1
• Nasm String section: https://www.nasm.us/xdoc/2.15/html/nasmdoc3.html#section-3.4.2
H.12 Debugging
• Osdev Wiki Page for kernel debugging: https://wiki.osdev.org/Kernel_Debugging
• Osdev Wiki Page for Serial ports: https://wiki.osdev.org/Serial_Ports
• Debugging with Qemu at Wikibooks: https://en.wikibooks.org/wiki/QEMU/Debugging_with_QEMU
H.13 Communities
• Osdev Forum: https://forum.osdev.org/
• Operating System Development on Reddit: https://www.reddit.com/r/osdev/
• OSdev Discord server: https://discord.gg/osdev
• Friendly Operating System Devs: https://discord.gg/Vwudfxx8Sp
301
APPENDIX H. USEFUL RESOURCES 302
302
Appendix I
Acknowledgments
First of all a big thank you goes to Koobin Bingku who created the cover image for the printed book.
And then a special thanks to all the contributors who helped by fixing errors, spelling mistakes or providing
critique.
In no particular order:
• @xyve7 (https://github.com/xyve7)
• @Pigmy-penguin (https://github.com/Pigmy-penguin)
• @Zenails (https://github.com/zenails)
• @qvjp (https://github.com/qvjp)
• @leecannon (https://github.com/leecannon)
• @seekbytes (https://github.com/seekbytes)
• @Flukas88 (https://github.com/Flukas88)
• @Rubo3 (https://github.com/Rubo3)
• @ajccosta(https://github.com/ajccosta)
• @maxtyson123 (https://github.com/maxtyson123)
• @Moldytzu (https://github.com/Moldytzu)
• @AnErrupTion (https://github.com/AnErrupTion)
• @MRRcode979 (https://github.com/MRRcode979)
• @Hqnnqh (https://github.com/Hqnnqh)
• @malletgaetan (https://github.com/malletgaetan)
• @mrjbom (https://github.com/mrjbom)
• @AFellowSpeedrunner (https://github.com/AFellowSpeedrunner)
303
APPENDIX I. ACKNOWLEDGMENTS 304
304
Appendix J
J.1.2 Revision 1
Release date: August, 2023. Second book release.
• Add Cross Compiling appendix chapter
• Linker chapter improvements
• Various typos fix
• Add more details on the Thread Sleep section
• Improve readability of Virtual Memory Management illustration.
• Fix broken links
• Fix Cover Alignment
• Add paragraph about lockfree queues in IPC chapter
• Add more details in the VMM Section of Process and Threads chapter
• Explain content of Segment Selectors in GDT chapter
• Improve readability of some parts inside Userspace chapters
J.1.3 Revision 2
Release date: December 2023. Third Book release
• Add more information on the Memory protection chapter, abouit its future implications
• Emergency grammar fix!
• Fix tss structure in Userspace/Handling_Interrupt chapter
• Add watchpoint information on gdb chapter
• Add explanation on how to test when entering userspace
• Fix some examples in the Nasm appendix, and add new section with loop cycle example.
J.1.4 Revision 3
Release date: April 2024. Fourth book release
305
APPENDIX J. UPDATES TO THE PRINTED EDITION 306
• Typo fixes
• Expand Syscall example chapter
• Expand ELF chapter, and fixing code in one of examples
J.1.5 Revision 4
Release date: Sept 2024 Fifth book release
• Typo fixes
• Code (and typo) fixes on the framebuffer chapter.
• Add explanation of potential bug on converting GIMP Header file to RGB.
• Added missing flags on page table/dirs entries
• Rewrite and rearrangement of syscall/sysret section
J.1.6 Revision 5
Relase date: Jan 2025 Sixth Book Release
• Stivale 2 protocol sections have been replaced with Limine protocol, since stivale2 has been deprecated.
• Add a complete exammple of how to create an ELF executable for our kernel
• Typo and error fixes
306
Appendix K
Licence
307
APPENDIX K. LICENCE 308
License”). To the extent this Public License may be interpreted as a contract, You are granted the Licensed
Rights in consideration of Your acceptance of these terms and conditions, and the Licensor grants You such
rights in consideration of benefits the Licensor receives from making the Licensed Material available under
these terms and conditions.
308
309 K.1. ATTRIBUTION-NONCOMMERCIAL 4.0 INTERNATIONAL
1. Subject to the terms and conditions of this Public License, the Licensor hereby grants You a worldwide,
royalty-free, non-sublicensable, non-exclusive, irrevocable license to exercise the Licensed Rights in the
Licensed Material to:
A. reproduce and Share the Licensed Material, in whole or in part, for NonCommercial purposes only; and
B. produce, reproduce, and Share Adapted Material for NonCommercial purposes only.
2. Exceptions and Limitations. For the avoidance of doubt, where Exceptions and Limitations apply
to Your use, this Public License does not apply, and You do not need to comply with its terms and
conditions.
3. Term. The term of this Public License is specified in Section 6(a).
4. Media and formats; technical modifications allowed. The Licensor authorizes You to exercise the
Licensed Rights in all media and formats whether now known or hereafter created, and to make technical
modifications necessary to do so. The Licensor waives and/or agrees not to assert any right or authority
to forbid You from making technical modifications necessary to exercise the Licensed Rights, including
technical modifications necessary to circumvent Effective Technological Measures. For purposes of this
Public License, simply making modifications authorized by this Section 2(a)(4) never produces Adapted
Material.
5. Downstream recipients.
A. Offer from the Licensor – Licensed Material. Every recipient of the Licensed Material automatically
receives an offer from the Licensor to exercise the Licensed Rights under the terms and conditions of this
Public License.
B. No downstream restrictions. You may not offer or impose any additional or different terms or conditions
on, or apply any Effective Technological Measures to, the Licensed Material if doing so restricts exercise of
the Licensed Rights by any recipient of the Licensed Material.
6. No endorsement. Nothing in this Public License constitutes or may be construed as permission to
assert or imply that You are, or that Your use of the Licensed Material is, connected with, or sponsored,
endorsed, or granted official status by, the Licensor or others designated to receive attribution as provided
in Section 3(a)(1)(A)(i).
b. Other rights.
1. Moral rights, such as the right of integrity, are not licensed under this Public License, nor are publicity,
privacy, and/or other similar personality rights; however, to the extent possible, the Licensor waives
and/or agrees not to assert any such rights held by the Licensor to the limited extent necessary to allow
You to exercise the Licensed Rights, but not otherwise.
2. Patent and trademark rights are not licensed under this Public License.
3. To the extent possible, the Licensor waives any right to collect royalties from You for the exercise of
the Licensed Rights, whether directly or through a collecting society under any voluntary or waivable
statutory or compulsory licensing scheme. In all other cases the Licensor expressly reserves any right
to collect such royalties, including when the Licensed Material is used other than for NonCommercial
purposes.
309
APPENDIX K. LICENCE 310
i. identification of the creator(s) of the Licensed Material and any others designated to receive attribution,
in any reasonable manner requested by the Licensor (including by pseudonym if designated);
ii. a copyright notice;
iii. a notice that refers to this Public License;
iv. a notice that refers to the disclaimer of warranties;
v. a URI or hyperlink to the Licensed Material to the extent reasonably practicable;
B. indicate if You modified the Licensed Material and retain an indication of any previous modifications; and
C. indicate the Licensed Material is licensed under this Public License, and include the text of, or the URI or
hyperlink to, this Public License.
2. You may satisfy the conditions in Section 3(a)(1) in any reasonable manner based on the medium, means,
and context in which You Share the Licensed Material. For example, it may be reasonable to satisfy the
conditions by providing a URI or hyperlink to a resource that includes the required information.
3. If requested by the Licensor, You must remove any of the information required by Section 3(a)(1)(A) to
the extent reasonably practicable.
4. If You Share Adapted Material You produce, the Adapter’s License You apply must not prevent recipients
of the Adapted Material from complying with this Public License.
310
311 K.1. ATTRIBUTION-NONCOMMERCIAL 4.0 INTERNATIONAL
c. The disclaimer of warranties and limitation of liability provided above shall be interpreted in a manner
that, to the extent possible, most closely approximates an absolute disclaimer and waiver of all liability.
311
APPENDIX K. LICENCE 312
312