Consider a kernel data structure that’s in charge of maintaining a record of all currently open files in the system. Whenever a new file is opened or closed, this list must be updated. This means either adding a file to the list when it’s opened or removing it when it’s closed. Now, picture this: two processes decide to open files simultaneously. In this scenario, the separate changes they make to the list could potentially create a tricky situation called a “race condition.”
Concurrency bugs arise from the unforeseen intertwining of threads or processes that manipulate shared resources. Incorrect handling of thread and processes, or resources may lead to a race condition. Dealing with them manually poses challenges due to the non-deterministic thread scheduling, intricate data flows, and an incredibly vast array of possible interleavings.
This post is all about breaking down concurrency bugs in the Linux kernel for beginners. In Part One, I’ve got your back with a simple breakdown of:
- The basics of concurrency
- race condition
- preemptive design
- atomic variables
- mutex locks
- deadlock
Table of contents
Open Table of contents
What is a concurrency bug?
The easiest way to understand a concurrency bug in the kernel is to think of a race condition where two operations access the same memory location. One of the main examples where concurrency bugs occur in the kernel is a multi-variable race.
below is a simplified code of a real-world example: CVE-2017-15649
Thread 1
int fanout_add(){
if (!po->running)
return -EINVAL;
match = kmalloc();
po->fanout= match;
fanout_link();
}
void fanout_link(){
list_add(sk, &global_list);
}
Thread 2
int packet_do_bind(){
if (po->fanout)
return -EINVAL;
unregister_hook();
fanout_link();
}
void unregister_hook(){
po->running = 0;
if (po->fanout)
fanout_unlink(sk, po);
}
void fanout_unlink(){
BUG_ON(
!list_contains(sk, &global_list)
);
}
Two system calls, setsockopt
and bind
interact with each other through two memory-accessing variables, namely po->fanout
and po->running
. These two are semantically correlated; po->fanout
can only be updated if po->running
is set to 1, and po->running
can be set to 0 only if po->fanout
is NULL.
Two variables po->fanout
and po->running
are NULL
and 1
.
Let us imagine a situation where the code eventually crashes.
- Thread A checks if
po->running
is 0 at line 2 of thread 1. It then executes throughout the code becuase there is no error to report atreturn -EINVAL
. - Thread 2 checks if
po->fanout
is NULL at line 2 of thread 2. Its initial value is NULL and therefore it proceeds up topo->running = 0;
(line 11 of thread 2). - Thread 1 stores a value in
po->fanout
atpo->fanout= match
(line 6 of thread 1). - This line allows thread 2 to pass the check at
if (po->fanout)
(atunregister_hook
of thread 2), callingfanout_unlink()
. - This function attempts to remove
sk
from theglobal_list
. - Because this execution order did not insert
sk
before thread 2 callsfanout_unlink()
, the code fails and causesBUG_ON()
in thread 2.
In other words, fanout_unlink()
assumes sk
must be on global_list
, but this instruction sequence does not put sk
in the linked list.
This vulnerability sparks a critical question for the user: how is the kernel designed? What factors in the kernel lead to such vulnerabilities? What are the mitigations? So, let us start from the bottom.
What is a preemptive kernel?
Preemption is the switching of one task to another. CPU-scheduling decisions can occur in four different circumstances:
- When a process transitions from the running state to the waiting state, such as due to an I/O request or when waiting for a child process to terminate after invoking wait().
- When a process shifts from the running state to the ready state, typically triggered by an interrupt.
- When a process moves from the waiting state to the ready state, usually when I/O operation completes.
- When a process completes and terminates.
When scheduling is restricted to only situations 1 and 4, the scheduling scheme is referred to as nonpreemptive or cooperative. In this case, once the CPU is allocated to a process, it retains control until it voluntarily releases the CPU by either terminating or moving to the waiting state.
However, preemptive scheduling can introduce race conditions when multiple processes share data. For example, imagine two processes that share data: one process is currently updating the data and gets preempted, allowing the second process to run. If the second process attempts to read the data, it might access inconsistent or incomplete information.
To avoid race conditions when multiple processes access shared kernel data structures, a preemptive kernel utilizes mechanisms like mutex locks. Nowadays, the majority of modern operating systems are fully preemptive while operating in kernel mode.
What are the mitigations against race conditions?
atomic variables
When performing non-atomic operations on memory, the typical process involves reading from memory and then writing or modifying it to complete the update. This approach can create race conditions when multiple processes share the same resource.
On the other hand, atomic operations offer instructions that complete in a single instruction cycle. These operations are not interrupted by other CPUs attempting to access the same memory location, effectively preventing race conditions.
Modern computer systems provide hardware instructions that allows the user to test and modify the content of a word or to swap the contents of two words “atomically”. To rephrase, as one uninterruptible unit.
When the operation a = a + 1
is executed, the Linux kernel cannot ensure its atomic completion. To differentiate between the requirements for atomic and non-atomic operations, the Linux kernel offers a dedicated data type called atomic_t
.
include/linux/types.h
typedef struct {
int counter;
} atomic_t;
The way to define an atomic variable is: atomic_t variable;
Atomic variables serve to guarantee mutual exclusion in situations where a data race may occur on a single variable during its update, such as when incrementing a counter. For example:
int compare_and_swap(int *value, int expected, int new value) {
int temp = *value;
if (*value == expected)
*value = new_value;
return temp;
}
The following increments the atomic integer sequence:
void increment(atomic_int *v){
int temp;
do {
temp = *v;
}
while (temp != compare_and_swap(v, temp, temp+1));
}
The compare_and_swap
, or CAS
function is designed to perform this update operation atomically. That means, if multiple threads attempt to update the value concurrently, only one thread will succeed in changing the value to the new value, and the others will fail. This ensures mutual exclusion and prevents data races when multiple threads are trying to modify the same memory location simultaneously.
The increment
function is used to increment an atomic integer sequence. It takes a pointer to an atomic integer as an argument. Inside the function, there’s a loop that continues until the increment operation is successfully performed using the CAS
function.
- It reads the current value from the atomic integer (pointed to by
*v
) and stores it in thetemp
variable. - Then, it calls the
CAS
function, attempting to update the value totemp + 1
. - If the
CAS
function succeeds, meaning no other thread has modified the value between the read and the update, the loop will terminate. - Otherwise, the loop will continue, and the process will be repeated until the update is successful.
By using the CAS
function in this manner, the increment
function guarantees that the atomic integer sequence is incremented safely, without any race conditions or data inconsistency issues that could arise when multiple threads are involved.
Use of atomic variables is often limited to single updates of shared data such as counters and sequence generators. Also, they do not entirely solve race conditions. Then, what are the more robust ways to solve the problem in question?
mutex locks
Before explaining what a mutex lock is, we must know what the critical-section problem is in the kernel.
In a system, every process has a specific portion of code known as a “critical section.” Within this critical section, the process can access and modify data that is shared with at least one other process. The crucial aspect of this system is that while one process is executing within its critical section, no other process is permitted to do so simultaneously.
Those who design operating systems create more advanced software utilities to address the critical-section problem. Among these utilities, the most basic one is the mutex lock, where mutex
is a contraction of mutual exclusion
. The mutex lock is employed to safeguard critical sections, ensuring the prevention of race conditions. A process needs to obtain the lock before accessing a critical section, and it relinquishes the lock upon leaving that critical section.
Neither type is recursive.
while (true){
(acquire lock)
critical section
(release lock)
remainder section
}
One of the disadvantages of the mutex lock is that it requires busy waiting. If process A is in the critical section, process B (which is trying to enter its critical section) must loop continmuously in the call to acquire()
.
Common concurrency bug types
deadlocks
Deadlock is a problem that can occur in a multithreaded program using mutex locks. For example, thread A requests resource. Until the resource is available, thread A enters a waiting state. Sometimes it is unable to change state forever because the resources it requested are held by other threads. This is called a deadlock. The below code snippet illusts this in a simple manner:
void *dosomework_1(void *param) {
pthread mutex lock(&mutex1);
pthread mutex lock(&mutex2);
pthread mutex unlock(&mutex2);
pthread mutex unlock(&mutex1);
pthread exit(0);
}
void *dosomework_2(void *param) {
pthread mutex lock(&mutex2);
pthread mutex lock(&mutex1);
pthread mutex unlock(&mutex1);
pthread mutex unlock(&mutex2);
pthread exit(0);
}
Thread A tries to acquire the mutex locks in the order
- mutex1
- mutex2
While thread B tries to aquire the mutex locks in the order
- mutex2
- mutex1
If 1. thread A aquires mutex1 while thread B acquires mutex2 and 2. thread A does not acquire and release the mutex locks for mutex1 and mutex2 before thread B tries to acquires the locks, This order of instructions will result in deadlock.
atomic violations
When there’s a need for a section of code to run without any interruption from other threads, programmers have to rely on specialized synchronization tools like locks or signal/wait. These tools ensure that the code runs atomically as intended. If these atomic regions aren’t managed correctly, other threads might sneak in between, causing unexpected outcomes. These issues, known as atomicity violations, arise from improper enforcement of the intended atomic operations.
af_packet.c
Thread 1
...
if (po->rollover != NULL)
po->rollover->num = 0;
...
}
Thread 2
...
if (err) {
kfree(po->rollover);
po->rollover = NULL;
}
return err;
T1 initially examines whether the pointer po->rollover
holds a non-null value and subsequently employs it to access the property named num
. As T2 operates simultaneously with T1, it could potentially release the memory referenced by po->rollover and then nullify the pointer. As T1 does not verify the pointer’s null status right before its usage, this could lead to a null-pointer dereference error.
This is an example of atomic violation, where it can potentially crash a process.
References
Operating System Concepts, Book by Abraham Silerschatz, Greg Gagne, and Peter Baer Galvin