Linux-Insides Interrupts Linux-Interrupts Part 3
Linux-Insides Interrupts Linux-Interrupts Part 3
Code Issues 27 Pull requests 7 Actions Security Insights from the arch/x86/kernel/traps.c. We already saw implementation of the
set_intr_gate_ist and set_system_intr_gate_ist functions in the previous part and now
we will look on the implementation of these two exception handlers.
linux-insides / Interrupts / linux-interrupts-3.md
Ok, we setup exception handlers in the early_trap_init function for the #DB and #BP
523 lines (396 loc) · 24.2 KB
exceptions and now is time to consider their implementations. But before we will do this,
first of all let's look on details of these exceptions.
Preview Code Blame Raw
The first exceptions - #DB or debug exception occurs when a debug event occurs. For
example - attempt to change the contents of a debug register. Debug registers are special
Interrupts and Interrupt Handling. Part 3. registers that were presented in x86 processors starting from the Intel 80386 processor
and as you can understand from name of this CPU extension, main purpose of these
registers is debugging.
Exception Handling
These registers allow to set breakpoints on the code and read or write data to trace it.
Debug registers may be accessed only in the privileged mode and an attempt to read or
This is the third part of the chapter about interrupts and an exceptions handling in the
write the debug registers when executing at any other privilege level causes a general
Linux kernel and in the previous part we stopped at the setup_arch function from the
protection fault exception. That's why we have used set_intr_gate_ist for the #DB
arch/x86/kernel/setup.c source code file.
exception, but not the set_system_intr_gate_ist .
We already know that this function executes initialization of architecture-specific stuff. In
The vector number of the #DB exceptions is 1 (we pass it as X86_TRAP_DB ) and as we
our case the setup_arch function does x86_64 architecture related initializations. The
setup_arch is big function, and in the previous part we stopped on the setting of the two
may read in specification, this exception has no error code:
These exceptions allow the x86_64 architecture to have early exception processing for the
purpose of debugging via the kgdb. The second exception is #BP or breakpoint exception occurs when processor executes
the int 3 instruction. Unlike the DB exception, the #BP exception may occur in userspace.
As you can remember we set these exceptions handlers in the early_trap_init function: We can add it anywhere in our code, for example let's look on the simple program:
int i; ...
while (i < 6){ ...
printf("i equal to: %d\n", i); ...
__asm__("int3");
++i;
} From this moment we know a little about these two exceptions and we can move on to
} consideration of their handlers.
If we will compile and run this program, we will see following output: Preparation before an exception handler
$ gcc breakpoint.c -o breakpoint As you may note before, the set_intr_gate_ist and set_system_intr_gate_ist functions
$ ./breakpoint takes an addresses of exceptions handlers in theirs second parameter. In or case our two
i equal to: 0 exception handlers will be:
Trace/breakpoint trap
debug ;
But if will run it with gdb, we will see our breakpoint and can continue execution of our int3 .
program:
You will not find these functions in the C code. All of that could be found in the kernel's
*.c/*.h files only definition of these functions which are located in the
$ gdb breakpoint
arch/x86/include/asm/traps.h kernel header file:
...
...
... asmlinkage void debug(void);
(gdb) run
Starting program: /home/alex/breakpoints
i equal to: 0 and
.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 But it is not only fake error-code. Moreover the -1 also represents invalid system call
ENTRY(\sym) number, so that the system call restart logic will not be triggered.
...
... The last two parameters of the idtentry macro shift_ist and paranoid allow to know
... do an exception handler runned at stack from Interrupt Stack Table or not. You already
END(\sym)
may know that each kernel thread in the system has its own stack. In addition to these
.endm
stacks, there are some specialized stacks associated with each processor in the system.
One of these stacks is - exception stack. The x86_64 architecture provides special feature
Before we will consider internals of the idtentry macro, we should to know state of stack which is called - Interrupt Stack Table . This feature allows to switch to a new stack for
when an exception occurs. As we may read in the Intel® 64 and IA-32 Architectures designated events such as an atomic exceptions like double fault , etc. So the shift_ist
Software Developer’s Manual 3A, the state of stack when an exception occurs is following: parameter allows us to know do we need to switch on IST stack for an exception handler
or not.
+------------+
+40 | %SS |
The second parameter - paranoid defines the method which helps us to know did we
ALLOC_PT_GPREGS_ON_STACK
come from userspace or not to an exception handler. The easiest way to determine this is
to via CPL or Current Privilege Level in CS segment register. If it is equal to 3 , we
came from userspace, if zero we came from kernel space: macro which is defined in the arch/x86/entry/calling.h header file. This macro just allocates
15*8 bytes space on the stack to preserve general purpose registers:
testl $3,CS(%rsp)
jnz userspace .macro ALLOC_PT_GPREGS_ON_STACK addskip=0
... addq $-(15*8+\addskip), %rsp
... .endm
...
// we are from the kernel space
So the stack will look like this after execution of the ALLOC_PT_GPREGS_ON_STACK :
But unfortunately this method does not give a 100% guarantee. As described in the kernel
+------------+
documentation:
+160 | %SS |
+152 | %RSP |
if we are in an NMI/MCE/DEBUG/whatever super-atomic entry context, which might +144 | %RFLAGS |
have triggered right after a normal entry wrote CS to the stack but before we +136 | %CS |
executed SWAPGS, then the only safe way to check for GS is the slower method: the +128 | %RIP |
RDMSR. +120 | ERROR CODE |
|------------|
In other words for example NMI could happen inside the critical section of a swapgs +112 | |
+104 | |
instruction. In this way we should check value of the MSR_GS_BASE model specific register
+96 | |
which stores pointer to the start of per-cpu area. So to check if we did come from +88 | |
userspace or not, we should to check value of the MSR_GS_BASE model specific register and +80 | |
if it is negative we came from kernel space, in other way we came from userspace: +72 | |
+64 | |
+56 | |
movl $MSR_GS_BASE,%ecx +48 | |
rdmsr +40 | |
testl %edx,%edx +32 | |
js 1f +24 | |
+16 | |
+8 | |
In first two lines of code we read value of the MSR_GS_BASE model specific register into +0 | | <- %RSP
edx:eax pair. We can't set negative value to the gs from userspace. But from other side +------------+
we know that direct mapping of the physical memory starts from the 0xffff880000000000
virtual address. In this way, MSR_GS_BASE will contain an address from 0xffff880000000000
After we allocated space for general purpose registers, we do some checks to understand
to 0xffffc7ffffffffff . After the rdmsr instruction will be executed, the smallest possible
did an exception come from userspace or not and if yes, we should move back to an
value in the %edx register will be - 0xffff8800 which is -30720 in unsigned 4 bytes.
interrupted process stack or stay on exception stack:
That's why kernel space gs which points to start of per-cpu area will contain negative
value.
.if \paranoid
After we push fake error code on the stack, we should allocate space for general purpose .if \paranoid == 1
testb $3, CS(%rsp)
registers with:
jnz 1f
.endif +------------+
call paranoid_entry +160 | %SS |
.else +152 | %RSP |
call error_entry +144 | %RFLAGS |
.endif +136 | %CS |
+128 | %RIP |
+120 | ERROR CODE |
Let's consider all of these there cases in course. |------------|
+112 | %RDI |
+104 | %RSI |
An exception occurred in userspace +96 | %RDX |
+88 | %RCX |
In the first let's consider a case when an exception has paranoid=1 like our debug and +80 | %RAX |
+72 | %R8 |
int3 exceptions. In this case we check selector from CS segment register and jump at
+64 | %R9 |
1f label if we came from userspace or the paranoid_entry will be called in other way.
+56 | %R10 |
+48 | %R11 |
Let's consider first case when we came from userspace to an exception handler. As +40 | %RBX |
described above we should jump at 1 label. The 1 label starts from the call of the +32 | %RBP |
+24 | %R12 |
+16 | %R13 |
call error_entry
+8 | %R14 |
+0 | %R15 | <- %RSP
+------------+
routine which saves all general purpose registers in the previously allocated area on the
stack:
After the kernel saved general purpose registers at the stack, we should check that we
SAVE_C_REGS 8 came from userspace space again with:
SAVE_EXTRA_REGS 8
Here we put base address of stack pointer %rdi register which will be first argument .else
xorl %esi, %esi
(according to x86_64 ABI) of the sync_regs function and call this function which is defined
.endif
in the arch/x86/kernel/traps.c source code file:
Additionally you may see that we zeroed the %esi register above in a case if an exception
asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs)
{ does not provide error code.
struct pt_regs *regs = task_pt_regs(current);
*regs = *eregs; In the end we just call secondary exception handler:
return regs;
}
call \do_sym
which:
This function takes the result of the task_ptr_regs macro which is defined in the
arch/x86/include/asm/processor.h header file, stores it in the stack pointer and returns it.
dotraplinkage void do_debug(struct pt_regs *regs, long error_code);
The task_ptr_regs macro expands to the address of thread.sp0 which represents
pointer to the normal kernel stack:
will be for debug exception and:
#define task_pt_regs(tsk) ((struct pt_regs *)(tsk)->thread.sp0 - 1)
dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code);
As we came from userspace, this means that exception handler will run in real process
context. After we got stack pointer from the sync_regs we switch stack: will be for int 3 exception. In this part we will not see implementations of secondary
handlers, because they are very specific, but will see some of them in one of next parts.
movq %rax, %rsp
We just considered first case when an exception occurred in userspace. Let's consider last
two.
The last two steps before an exception handler will call secondary handler are:
1. Passing pointer to pt_regs structure which contains preserved general purpose An exception with paranoid > 0 occurred in kernelspace
registers to the %rdi register:
In this case an exception was occurred in kernelspace and idtentry macro is defined with
paranoid=1 for this exception. This value of paranoid means that we should use slower
movq %rsp, %rdi
way that we saw in the beginning of this part to check do we really came from kernelspace
or not. The paranoid_entry routing allows us to know this:
as it will be passed as first parameter of secondary exception handler.
2. Pass error code to the %rsi register as it will be second argument of an exception ENTRY(paranoid_entry)
cld
handler and set it to -1 on the stack for the same purpose as we did it before - to
SAVE_C_REGS 8
prevent restart of a system call: SAVE_EXTRA_REGS 8
movl $1, %ebx
movl $MSR_GS_BASE, %ecx
.if \has_error_code
rdmsr
movq ORIG_RAX(%rsp), %rsi
testl %edx, %edx
movq $-1, ORIG_RAX(%rsp)
js 1f
SWAPGS
xorl %ebx, %ebx
Exit from an exception handler
1: ret
END(paranoid_entry) After secondary handler will finish its works, we will return to the idtentry macro and the
next step will be jump to the error_exit :
As you may see, this function represents the same that we covered before. We use second
(slow) method to get information about previous state of an interrupted task. As we jmp error_exit
checked this and executed SWAPGS in a case if we came from userspace, we should to do
the same that we did before: We need to put pointer to a structure which holds general routine. The error_exit function defined in the same arch/x86/entry/entry_64.S assembly
purpose registers to the %rdi (which will be first parameter of a secondary handler) and source code file and the main goal of this function is to know where we are from (from
put error code if an exception provides it to the %rsi (which will be second parameter of userspace or kernelspace) and execute SWPAGS depends on this. Restore registers to
a secondary handler): previous state and execute iret instruction to transfer control to an interrupted task.
That's all.
movq %rsp, %rdi
.if \has_error_code
Conclusion
movq ORIG_RAX(%rsp), %rsi
movq $-1, ORIG_RAX(%rsp)
.else It is the end of the third part about interrupts and interrupt handling in the Linux kernel.
xorl %esi, %esi We saw the initialization of the Interrupt descriptor table in the previous part with the #DB
.endif and #BP gates and started to dive into preparation before control will be transferred to an
exception handler and implementation of some interrupt handlers in this part. In the next
The last step before a secondary handler of an exception will be called is cleanup of new part we will continue to dive into this theme and will go next by the setup_arch function
IST stack frame: and will try to understand interrupts handling related stuff.
You may remember that we passed the shift_ist as argument of the idtentry macro. Links
Here we check its value and if its not equal to -1 , we get pointer to a stack from
Interrupt Stack Table by shift_ist index and setup it. Debug registers
In the end of this second way we just call secondary exception handler as we did it before: Intel 80385
INT 3
call \do_sym gcc
TSS
The last method is similar to previous both, but an exception occurred with paranoid=0 GNU assembly .error directive
and we may use fast method determination of where we are from. dwarf2
CFI directives
IRQ
system call
swapgs
SIGTRAP
Per-CPU variables
kgdb
ACPI
Previous part