Concurrency Bugs in the Linux Kernel | An Overview of Concurrency and Kernel locking

Consider a kernel data structure that’s in charge of maintaining a record of all currently open files in the system. Whenever a new file is opened or closed, this list must be updated. This means either adding a file to the list when it’s opened or removing it when it’s closed. Now, picture this: two processes decide to open files simultaneously. In this scenario, the separate changes they make to the list could potentially create a tricky situation called a “race condition.”

Concurrency bugs arise from the unforeseen intertwining of threads or processes that manipulate shared resources. Incorrect handling of thread and processes, or resources may lead to a race condition. Dealing with them manually poses challenges due to the non-deterministic thread scheduling, intricate data flows, and an incredibly vast array of possible interleavings.

This post is all about breaking down concurrency bugs in the Linux kernel for beginners. In Part One, I’ve got your back with a simple breakdown of:

The basics of concurrency
race condition
preemptive design
atomic variables
mutex locks
deadlock

Open Table of contents

What is a concurrency bug?
What is a preemptive kernel?
What are the mitigations against race conditions?
- atomic variables
- mutex locks
Common concurrency bug types
- deadlocks
- atomic violations
References

What is a concurrency bug?

The easiest way to understand a concurrency bug in the kernel is to think of a race condition where two operations access the same memory location. One of the main examples where concurrency bugs occur in the kernel is a multi-variable race.

below is a simplified code of a real-world example: CVE-2017-15649

Thread 1

int fanout_add(){
   if (!po->running)
      return -EINVAL;

   match = kmalloc();

   po->fanout= match;
   fanout_link();
}

void fanout_link(){
   list_add(sk, &global_list);
}

Thread 2

int packet_do_bind(){
   if (po->fanout)
      return -EINVAL;

   unregister_hook();

   fanout_link();
}

void unregister_hook(){
   po->running = 0;
   if (po->fanout)
      fanout_unlink(sk, po);
}

void fanout_unlink(){
   BUG_ON(
      !list_contains(sk, &global_list)
   );
}

Two system calls, setsockopt and bind interact with each other through two memory-accessing variables, namely po->fanout and po->running. These two are semantically correlated; po->fanout can only be updated if po->running is set to 1, and po->running can be set to 0 only if po->fanout is NULL.

Two variables po->fanout and po->running are NULL and 1.

Let us imagine a situation where the code eventually crashes.

Thread A checks if po->running is 0 at line 2 of thread 1. It then executes throughout the code becuase there is no error to report at return -EINVAL.
Thread 2 checks if po->fanout is NULL at line 2 of thread 2. Its initial value is NULL and therefore it proceeds up to po->running = 0; (line 11 of thread 2).
Thread 1 stores a value in po->fanout at po->fanout= match (line 6 of thread 1).
This line allows thread 2 to pass the check at if (po->fanout) (at unregister_hook of thread 2), calling fanout_unlink().
This function attempts to remove sk from the global_list.
Because this execution order did not insert sk before thread 2 calls fanout_unlink(), the code fails and causes BUG_ON() in thread 2.

In other words, fanout_unlink() assumes sk must be on global_list, but this instruction sequence does not put sk in the linked list.

This vulnerability sparks a critical question for the user: how is the kernel designed? What factors in the kernel lead to such vulnerabilities? What are the mitigations? So, let us start from the bottom.

What is a preemptive kernel?

Preemption is the switching of one task to another. CPU-scheduling decisions can occur in four different circumstances:

When a process transitions from the running state to the waiting state, such as due to an I/O request or when waiting for a child process to terminate after invoking wait().
When a process shifts from the running state to the ready state, typically triggered by an interrupt.
When a process moves from the waiting state to the ready state, usually when I/O operation completes.
When a process completes and terminates.

When scheduling is restricted to only situations 1 and 4, the scheduling scheme is referred to as nonpreemptive or cooperative. In this case, once the CPU is allocated to a process, it retains control until it voluntarily releases the CPU by either terminating or moving to the waiting state.

However, preemptive scheduling can introduce race conditions when multiple processes share data. For example, imagine two processes that share data: one process is currently updating the data and gets preempted, allowing the second process to run. If the second process attempts to read the data, it might access inconsistent or incomplete information.

To avoid race conditions when multiple processes access shared kernel data structures, a preemptive kernel utilizes mechanisms like mutex locks. Nowadays, the majority of modern operating systems are fully preemptive while operating in kernel mode.

What are the mitigations against race conditions?

atomic variables

When performing non-atomic operations on memory, the typical process involves reading from memory and then writing or modifying it to complete the update. This approach can create race conditions when multiple processes share the same resource.

On the other hand, atomic operations offer instructions that complete in a single instruction cycle. These operations are not interrupted by other CPUs attempting to access the same memory location, effectively preventing race conditions.

Modern computer systems provide hardware instructions that allows the user to test and modify the content of a word or to swap the contents of two words “atomically”. To rephrase, as one uninterruptible unit.

When the operation a = a + 1 is executed, the Linux kernel cannot ensure its atomic completion. To differentiate between the requirements for atomic and non-atomic operations, the Linux kernel offers a dedicated data type called atomic_t.

include/linux/types.h

typedef struct {
	int counter;
} atomic_t;

The way to define an atomic variable is: atomic_t variable; Atomic variables serve to guarantee mutual exclusion in situations where a data race may occur on a single variable during its update, such as when incrementing a counter. For example:

int compare_and_swap(int *value, int expected, int new value) {
  int temp = *value;

  if (*value == expected)
    *value = new_value;

  return temp;
}

The following increments the atomic integer sequence:

void increment(atomic_int *v){
  int temp;
  do {
    temp = *v;
  }
  while (temp != compare_and_swap(v, temp, temp+1));
}

The compare_and_swap, or CAS function is designed to perform this update operation atomically. That means, if multiple threads attempt to update the value concurrently, only one thread will succeed in changing the value to the new value, and the others will fail. This ensures mutual exclusion and prevents data races when multiple threads are trying to modify the same memory location simultaneously.

The increment function is used to increment an atomic integer sequence. It takes a pointer to an atomic integer as an argument. Inside the function, there’s a loop that continues until the increment operation is successfully performed using the CAS function.

It reads the current value from the atomic integer (pointed to by *v) and stores it in the temp variable.
Then, it calls the CAS function, attempting to update the value to temp + 1.
If the CAS function succeeds, meaning no other thread has modified the value between the read and the update, the loop will terminate.
Otherwise, the loop will continue, and the process will be repeated until the update is successful.

By using the CAS function in this manner, the increment function guarantees that the atomic integer sequence is incremented safely, without any race conditions or data inconsistency issues that could arise when multiple threads are involved.

Use of atomic variables is often limited to single updates of shared data such as counters and sequence generators. Also, they do not entirely solve race conditions. Then, what are the more robust ways to solve the problem in question?

mutex locks

Before explaining what a mutex lock is, we must know what the critical-section problem is in the kernel.

In a system, every process has a specific portion of code known as a “critical section.” Within this critical section, the process can access and modify data that is shared with at least one other process. The crucial aspect of this system is that while one process is executing within its critical section, no other process is permitted to do so simultaneously.

Those who design operating systems create more advanced software utilities to address the critical-section problem. Among these utilities, the most basic one is the mutex lock, where mutex is a contraction of mutual exclusion. The mutex lock is employed to safeguard critical sections, ensuring the prevention of race conditions. A process needs to obtain the lock before accessing a critical section, and it relinquishes the lock upon leaving that critical section.

Neither type is recursive.

while (true){
    (acquire lock)
	critical section
    (release lock)
	remainder section
}

One of the disadvantages of the mutex lock is that it requires busy waiting. If process A is in the critical section, process B (which is trying to enter its critical section) must loop continmuously in the call to acquire().

Common concurrency bug types

deadlocks

Deadlock is a problem that can occur in a multithreaded program using mutex locks. For example, thread A requests resource. Until the resource is available, thread A enters a waiting state. Sometimes it is unable to change state forever because the resources it requested are held by other threads. This is called a deadlock. The below code snippet illusts this in a simple manner:

void *dosomework_1(void *param) {
   pthread mutex lock(&mutex1);
   pthread mutex lock(&mutex2);

   pthread mutex unlock(&mutex2);
   pthread mutex unlock(&mutex1);
   pthread exit(0);
}

void *dosomework_2(void *param) {
   pthread mutex lock(&mutex2);
   pthread mutex lock(&mutex1);

   pthread mutex unlock(&mutex1);
   pthread mutex unlock(&mutex2);
   pthread exit(0);
}

Thread A tries to acquire the mutex locks in the order

mutex1
mutex2

While thread B tries to aquire the mutex locks in the order

mutex2
mutex1

If 1. thread A aquires mutex1 while thread B acquires mutex2 and 2. thread A does not acquire and release the mutex locks for mutex1 and mutex2 before thread B tries to acquires the locks, This order of instructions will result in deadlock.

atomic violations

When there’s a need for a section of code to run without any interruption from other threads, programmers have to rely on specialized synchronization tools like locks or signal/wait. These tools ensure that the code runs atomically as intended. If these atomic regions aren’t managed correctly, other threads might sneak in between, causing unexpected outcomes. These issues, known as atomicity violations, arise from improper enforcement of the intended atomic operations.

af_packet.c

Thread 1

...
if (po->rollover != NULL) 
    po->rollover->num = 0;
...
}

Thread 2

...
if (err) {
    kfree(po->rollover);
    po->rollover = NULL;
}
return err;

T1 initially examines whether the pointer po->rollover holds a non-null value and subsequently employs it to access the property named num. As T2 operates simultaneously with T1, it could potentially release the memory referenced by po->rollover and then nullify the pointer. As T1 does not verify the pointer’s null status right before its usage, this could lead to a null-pointer dereference error.

This is an example of atomic violation, where it can potentially crash a process.

References

Linux Kernel Documentation

Mutex 와 spinlock 이해하기

Operating System Concepts, Book by Abraham Silerschatz, Greg Gagne, and Peter Baer Galvin