oss-sec mailing list archives
Re: StackRot (CVE-2023-3269): Linux kernel privilege escalation vulnerability
From: Ruihan Li <lrh2000 () pku edu cn>
Date: Fri, 28 Jul 2023 21:25:12 +0800
Hi all, The PoC code and exploit write-up of the StackRot vulnerability has been published to [GitHub][gh]. I'm going to copy the write-up here, but I recommend the GitHub version because there are some figures embedded. [gh]: https://github.com/lrh2000/StackRot Besides, the exploit is verified by the GitHub CI. Therefore, the most straightforward way to understand how this exploit operates without having to set up a local environment is by reviewing the GitHub CI pass procedure, which can be found [here][ci]. [ci]: https://github.com/lrh2000/StackRot/actions ## Exploit The exploit primarily focuses on the Google kCTF challenge, specifically when neither CONFIG_PREEMPT nor CONFIG_SLAB_MERGE_DEFAULT is set. To exploit StackRot, the most important task is to locate a VMA iteration that fulfills the following criteria: 1. The iteration's timing can be controlled. This control allows us to ensure that the RCU grace period concludes during the VMA iteration. 2. The iteration retrieves specific information from the VMA structure, and returns the information to the userspace. This feature enables us to exploit the UAF vulnerability of the maple node to leak some kernel addresses. 3. The iteration invokes certain function pointers in the VMA structure. This particular capability allows us to exploit the UAF of the maple node to control the kernel-mode program counter (PC). The chosen VMA iteration is the iteration responsible for generating the contents of `/proc/[pid]/maps`. The following sections will show how this iteration satisfies the above criteria. ### Step 0: From UAFBR to UAF During any VMA iteration, the reference to the root node of the VMA tree is obtained, and the iteration proceeds through its slots. Thus, by triggering stack expansion in another thread on a separate CPU during the VMA iteration, the node replacement can be concurrently initiated. At this point, accessing the old node is considered a use-after-free-by-RCU (UAFBR) situation. However, actual issues arise only when the old node is truly freed, which occurs in the RCU callback. This presents two challenges: (i) determining when the old node is freed and (ii) ensuring that the VMA iteration does not complete before the old node is freed. The first question is relatively straightforward. In the kernel, the `synchronize_rcu()` function can be employed to wait until the RCU grace period concludes, ensuring that all pre-existing RCU callbacks have been invoked. In userspace, system calls that ultimately call `synchronize_rcu()` can be utilized for the same purpose. Thus, when such system calls terminate, it is known that the old node has been freed. Notably, there is a system call, `membarrier(MEMBARRIER_CMD_GLOBAL, 0, -1)`, that solely invokes `synchronize_rcu()`. ```c SYSCALL_DEFINE3(membarrier, int, cmd, unsigned int, flags, int, cpu_id) { // ... switch (cmd) { // ... case MEMBARRIER_CMD_GLOBAL: /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */ if (tick_nohz_full_enabled()) return -EINVAL; if (num_online_cpus() > 1) synchronize_rcu(); return 0; // ... } } ``` The second question necessitates further consideration. Several potential solutions are as follows: 1. The iteration task gets preempted, the RCU grace period ends, and the iteration resumes execution. However, this approach is ineffective if CONFIG_PREEMPT is not set. 2. The iteration task enters a sleep state (e.g., waiting for I/O), the RCU grace period ends, and the iteration continues. Currently, I am unaware of any VMA iteration that satisfies this requirement and can be exploited to leak kernel addresses and control the program counter (PC). It may exist, but a thorough investigation is required. 3. The iteration task experiences an interruption (e.g., timer interrupt), during which the RCU grace period concludes. It is possible to employ timerfd to create multiple hardware timers that, upon timeout during the VMA iteration, can trigger a lengthy interrupt. However, this approach is not viable because the interrupt handler operates with interrupts disabled, and if a CPU cannot handle inter-processor interrupts (IPIs), the RCU grace period will not end. 4. The iteration task is deliberately prolonged, allowing the RCU grace period to expire. This is the chosen solution. If the current RCU grace period surpasses RCU_TASK_IPI_DELAY (defaulting to 0.5 seconds), inter-processor interrupts (IPIs) are dispatched to all CPUs to verify that they are not in RCU critical sections. In the case of VMA iteration, the answer is negative, signifying that the RCU grace period concludes and the maple node is freed, effectively converting UAFBR into a genuine use-after-free (UAF) scenario. One significant observation is that during VMA iteration for `/proc/[pid]/maps`, it generates the entire file path for file-mapped memory regions. Although the directory name is typically restricted to a maximum of 255 characters, there is no limitation on the directory depth. This means that by creating a file with an exceedingly large directory depth and establishing a memory mapping for this file, accessing `/proc/[pid]/maps` can take a considerable amount of time during the VMA iteration. Consequently, this extended duration enables the possibility of concluding the RCU grace period and acquiring the UAF primitive. ```c static void show_map_vma(struct seq_file *m, struct vm_area_struct *vma) { // ... /* * Print the dentry name for named mappings, and a * special [heap] marker for the heap: */ if (file) { seq_pad(m, ' '); /* * If user named this anon shared memory via * prctl(PR_SET_VMA ..., use the provided name. */ if (anon_name) seq_printf(m, "[anon_shmem:%s]", anon_name->name); else seq_file_path(m, file, "\n"); goto done; } // ... } ``` This step is illustrated in the following figure: ![Step 0: From UAFBR to UAF](https://github.com/lrh2000/StackRot/raw/master/pic/nodes_free_and_use.png) ### Step 1: From slab UAF to page UAF Now that UAF is functioning within a slab. If CONFIG_SLAB_MERGE_DEFAULT is enabled and the slab of maple nodes merges with kmalloc-256, the contents within the old node can be controlled by allocating a new structure from kmalloc-256 and populating it with userspace data. However, if CONFIG_SLAB_MERGE_DEFAULT is not set, an alternative approach is required. In this case, one needs to return the page of the freed node to the page allocator, allowing the old node to be controlled by allocating a new page and filling it accordingly. Recall that the VMA tree will only contain one node. Hence, by utilizing `fork()`/`clone()`, multiple VMA trees and an equal number of maple nodes are generated. Assuming one slab encompasses M maple nodes, and one node per M nodes is retained while all other nodes are freed via `exit()`, the remaining nodes become the sole nodes within their respective slabs. Initially, these slabs reside in the CPU's partial list. When the partial list reaches its capacity, the slabs are flushed back to the partial list of the corresponding NUMA node. If the last maple node within a slab is freed, the slab becomes empty. If this slab resides in the partial list of a NUMA node, and the partial list of that particular NUMA node is already at maximum capacity, the page is immediately returned to the page allocator. Consequently, the slab UAF transforms into a page UAF scenario. The contents within the freed page can be manipulated by sending some data via `msgsnd()`, which allocates elastic objects and directly populates them with the provided user data. ```c static void __slab_free(struct kmem_cache *s, struct slab *slab, void *head, void *tail, int cnt, unsigned long addr) { // ... if (unlikely(!new.inuse && n->nr_partial >= s->min_partial)) goto slab_empty; // ... return; slab_empty: // ... discard_slab(s, slab); } ``` The number of maple nodes per slab, M, depends on the number of CPUs. The exploit implementation considers a situation with two CPUs and therefore assumes 16 as the value of M, as illustrated in the following figure: ![Step 1: From slab UAF to page UAF](https://github.com/lrh2000/StackRot/raw/master/pic/node_master_fengshui.png) ### Step 2: From UAF to address leaking Upon gaining control of the maple node, it becomes possible to manipulate the addresses of subsequent VMAs that will be later iterated. As the targeted iteration is aimed at generating `/proc/self/maps`, certain VMA information, such as the start and end addresses, which reside within the VMA structure, are returned to the user space. However, a challenge arises: the address of a VMA structure in the maple node can only be appropriately set if some addresses are already known. Fortunately, CVE-2023-0597 directly serves this purpose. According to CVE-2023-0597, the address of `cpu_entry_area` is not randomized. Although this vulnerability has been patched in Linux 6.2, it has not been backported to earlier stable kernels as of the time of writing. Consequently, by overwriting the address of the VMA structure with that of the last IDT entry, the entry that contains the address of `asm_sysvec_spurious_apic_interrupt` is directly leaked, thereby revealing the base addresses of the kernel code and kernel data. ![Step 2: From UAF to address leaking (1)](https://github.com/lrh2000/StackRot/raw/master/pic/node_master_code_leak.png) The previously discussed method can be used recurrently to incrementally expose more addresses from the kernel data section. For instance, the `init_task.tasks.prev` pointer within the data section directs to the `task_struct` structure of the latest created task, which is without question allocated on the heap. ![Step 2: From UAF to address leaking (2)](https://github.com/lrh2000/StackRot/raw/master/pic/node_master_heap_leak.png) When all newly established tasks are terminated, their `task_struct` structures will subsequently be deallocated. If the quantity of these tasks is large enough, the corresponding pages can be surrendered back to the page allocator. This allows for the possibility to reallocate these pages and fill them with user data. However, keep in mind that the released pages generally belong to the per-cpu page (PCP) list. For pages present in the PCP list, they can be reallocated exclusively in the same page order. Consequently, solely mapping new pages into the user space, which requires only order-0 pages from the page allocator, won't fulfill the objectives. Nonetheless, the msgsnd system call will solicit memory chunks via kmalloc and populate these chunks with user-defined data. When the kmalloc cache is exhausted, it will requisition pages from the page allocator at a specific order. If the message size is accurately adjusted, the exact order will be desired. Thus, the page whose address has been previously leaked will be reallocated. As a result, it becomes possible to obtain a page with a known address and user-manipulated data. ### Step 3: From UAF to root privileges It is now possible to forge the VMA structure in the address-known page and control the `vma->vm_ops->name` function pointer. The next step involves finding suitable gadgets to escape containers and acquire root privileges. ```c static void show_map_vma(struct seq_file *m, struct vm_area_struct *vma) { // ... if (vma->vm_ops && vma->vm_ops->name) { name = vma->vm_ops->name(vma); if (name) goto done; } // ... } ``` ![Step 3: From UAF to root privileges](https://github.com/lrh2000/StackRot/raw/master/pic/node_master_kern_exec.png) The gadget constructions are as follows: 1. Stack pivot: `movq %rbx, %rsi; movq %rbp, %rdi; call __x86_indirect_thunk_r13` -> `pushq %rsi; jmp 46(%rsi)` -> `popq %rsp; ret` -> `popq %rsp; ret`, where %rdi, %rbx, and %r13 _initially_ points to user-controllable data. 2. Gain root privileges: `popq %rdi; ret` -> `prepare_kernel_cred` -> `popq %rdi; ret` -> `movq %rax, (%rdi); ret`, where %rdi _now_ points to the stack top; `popq %rdi; ret` -> `commit_creds`, effectively executing `commit_creds(prepare_kernel_cred(&init_task))`. 3. Escape containers: `popq %rdi; ret` -> `find_task_by_vpid` -> `popq %rdi; ret` -> `movq %rax, (%rdi); ret`, where %rdi _now_ points to the stack top; `popq %rdi; ret` -> `popq %rsi; ret` -> `switch_task_namespaces`, effectively performing `switch_task_namespaces(find_task_by_vpid(1), &init_nsproxy)`. 4. Unlock mm: `popq %rax; ret` -> `movq %rbp, %rdi; call __x86_indirect_thunk_rax`, where %rbp points to the original seq_file; `popq %rax; ret` -> `m_stop`, effectively executing `m_stop(seq_file, ..)`. 5. Return to userspace: use `swapgs_restore_regs_and_return_to_usermode`, and call `execve()` to get the shell. Finally, using `nsenter --mount=/proc/1/ns/mnt` to restore the mount namespace and get the flag via `cat /flag/flag`. ### Source code The full exploit source is available [here][exp]. For more details, refer to its README file. [exp]: https://github.com/lrh2000/StackRot/tree/master/exp Thanks, Ruihan Li
Current thread:
- StackRot (CVE-2023-3269): Linux kernel privilege escalation vulnerability Ruihan Li (Jul 05)
- Re: StackRot (CVE-2023-3269): Linux kernel privilege escalation vulnerability Solar Designer (Jul 07)
- Re: StackRot (CVE-2023-3269): Linux kernel privilege escalation vulnerability Ruihan Li (Jul 28)