The title is "Remove task and cgroup local storage percpu counters" On Tue, Jul 29, 2025 at 11:25 AM Amery Hung <ameryhung@xxxxxxxxx> wrote: > > * Motivation * > > The goal of this patchset is to make bpf syscalls and helpers updating > task and cgroup local storage more robust by removing percpu counters > in them. Task local storage and cgroup storage each employs a percpu > counter to prevent deadlock caused by recursion. Since the underlying > bpf local storage takes spinlocks in various operations, bpf programs > running recursively may try to take a spinlock which is already taken. > For example, when a tracing bpf program called recursively during > bpf_task_storage_get(..., F_CREATE) tries to call > bpf_task_storage_get(..., F_CREATE) again, it will cause AA deadlock > if the percpu variable is not in place. > > However, sometimes, the percpu counter may cause bpf syscalls or helpers > to return errors spuriously, as soon as another threads is also updating > the local storage or the local storage map. Ideally, the two threads > could have taken turn to take the locks and perform their jobs > respectively. However, due to the percpu counter, the syscalls and > helpers can return -EBUSY even if one of them does not run recursively > in another one. All it takes for this to happen is if the two threads run > on the same CPU. This happened when BPF-CI ran the selftest of task local > data. Since CI runs the test on VM with 2 CPUs, bpf_task_storage_get(..., > F_CREATE) can easily fail. > > The failure mode is not good for users as they need to add retry logic > in user space or bpf programs to avoid it. Even with retry, there > is no guaranteed upper bound of the loop for a succeess call. Therefore, > this patchset seeks to remove the percpu counter and makes the related > bpf syscalls and helpers more reliable, while still make sure recursion > deadlock will not happen, with the help of resilient queued spinlock > (rqspinlock). > > > * Implementation * > > To remove the percpu counter without introducing deadlock, > bpf_local_storage is refactored by changing the locks from raw_spin_lock > to rqspinlock, which prevents deadlock with deadlock detection and a > timeout mechanism. > > There are two locks in bpf_local_storage due to the ownership model as > illustrated in the figure below. A map value, which consists of a > pointer to the map and the data, is a bpf_local_storage_map_data (sdata) > stored in a bpf_local_storage_elem (selem). A selem belongs to a > bpf_local_storage and bpf_local_storage_map at the same time. > bpf_local_storage::lock (lock_storage->lock in short) protects the list > in a bpf_local_storage and bpf_local_storage_map_bucket::lock (b->lock) > protects the hash bucket in a bpf_local_storage_map. > > > task_struct > ┌ task1 ───────┐ bpf_local_storage > │ *bpf_storage │---->┌─────────┐ > └──────────────┘<----│ *owner │ bpf_local_storage_elem > │ *cache[16] (selem) selem > │ *smap │ ┌──────────┐ ┌──────────┐ > │ list │------->│ snode │<------->│ snode │ > │ lock │ ┌---->│ map_node │<--┐ ┌-->│ map_node │ > └─────────┘ │ │ sdata = │ │ │ │ sdata = │ > task_struct │ │ {&mapA,} │ │ │ │ {&mapB,} │ > ┌ task2 ───────┐ bpf_local_storage └──────────┘ │ │ └──────────┘ > │ *bpf_storage │---->┌─────────┐ │ │ │ > └──────────────┘<----│ *owner │ │ │ │ > │ *cache[16] │ selem │ │ selem > │ *smap │ │ ┌──────────┐ │ │ ┌──────────┐ > │ list │--│---->│ snode │<--│-│-->│ snode │ > │ lock │ │ ┌-->│ map_node │ └-│-->│ map_node │ > └─────────┘ │ │ │ sdata = │ │ │ sdata = │ > bpf_local_storage_map │ │ │ {&mapB,} │ │ │ {&mapA,} │ > (smap) │ │ └──────────┘ │ └──────────┘ > ┌ mapA ───────┐ │ │ │ > │ bpf_map map │ bpf_local_storage_map_bucket │ > │ *buckets │---->┌ b[0] ┐ │ │ │ > └─────────────┘ │ list │------┘ │ │ > │ lock │ │ │ > └──────┘ │ │ > smap ... │ │ > ┌ mapB ───────┐ │ │ > │ bpf_map map │ bpf_local_storage_map_bucket │ > │ *buckets │---->┌ b[0] ┐ │ │ > └─────────────┘ │ list │--------┘ │ > │ lock │ │ > └──────┘ │ > ┌ b[1] ┐ │ > │ list │-----------------------------┘ > │ lock │ > └──────┘ > ... > > > The refactoring is divided into three steps. > > First, in patch 1-4, local storage helpers that take locks are being > converted to failable. The functions are changed from returning void to > returning an int error code with the return value temporarily set to 0. > In callers where the helpers cannot fail in the middle of an update, > the helper is open coded. In callers that are not allowed to fail, (i.e., > bpf_local_storage_destroy() and bpf_local_storage_map_free(), we make > sure the functions cannot be called recursively, causing deadlock, and > assert the return value to be 0. > > Then, in patch 5, the locks are changed to rqspinlock, and the error > returned from raw_res_spin_lock_irqsave() is propogated to the syscalls > and heleprs. > > Finally, in patch 7-8, the percpu counters in task and cgroup local > storage are removed. > > Question: > > - In bpf_local_storage_destroy() and bpf_local_storage_map_free(), where > it is not allow to fail, I assert that the lock acquisition always > succeeds based on the fact that 1) these paths cannot run recursively > causing AA deadlock and 2) local_storage->lock and b->lock are always > acquired in the same order, but I also notice that rqspinlock has > a timeout fallback. Is this assertion an okay thing to do? > > > * Test * > > Task and cgroup local storage selftests have already covered deadlock > caused by recursion. Patch 9 updates the expected result of task local > storage selftests as task local storage bpf helpers can now run on the > same CPU as they don't cause deadlock. > > * Patch organization * > > E(exposed) L(local storage lock) B(bucket lock) > EL __bpf_local_storage_insert_cache (skip cache update) > ELB bpf_local_storage_destroy (cannot recur) > ELB bpf_local_storage_map_free (cannot recur) > ELB bpf_selem_unlink --> Patch 4 > E B bpf_selem_link_map --> Patch 2 > B bpf_selem_unlink_map --> Patch 1 > L bpf_selem_unlink_storage --> Patch 3 > > During the refactoring, to make sure all exposed functions > handle the error returned by raw_res_spin_lock_irqsave(), > __must_check is added locally to catch all callers. > > Patch 1-4 > Convert local storage helpers to failable, or open-code > the helpers > > Patch 5 > Change local_storage->lock and b->lock from > raw_spin_lock to rqspinlock > > Patch 6 > Remove percpu lock in task local storage and remove > bpf_task_storage_{get,delete}_recur() > > Patch 7 > Remove percpu lock in cgroup local storage > > Patch 8 > Remove percpu lock in bpf_local_storage > > Patch 9 > Update task local storage recursion test > > Patch 10 > Remove task local storage stress test > > Patch 11 > Update btf_dump to use another percpu variable > > ---- > > Amery Hung (11): > bpf: Convert bpf_selem_unlink_map to failable > bpf: Convert bpf_selem_link_map to failable > bpf: Open code bpf_selem_unlink_storage in bpf_selem_unlink > bpf: Convert bpf_selem_unlink to failable > bpf: Change local_storage->lock and b->lock to rqspinlock > bpf: Remove task local storage percpu counter > bpf: Remove cgroup local storage percpu counter > bpf: Remove unused percpu counter from bpf_local_storage_map_free > selftests/bpf: Update task_local_storage/recursion test > selftests/bpf: Remove test_task_storage_map_stress_lookup > selftests/bpf: Choose another percpu variable in bpf for btf_dump test > > include/linux/bpf_local_storage.h | 14 +- > kernel/bpf/bpf_cgrp_storage.c | 60 +----- > kernel/bpf/bpf_inode_storage.c | 6 +- > kernel/bpf/bpf_local_storage.c | 202 ++++++++++++------ > kernel/bpf/bpf_task_storage.c | 153 ++----------- > kernel/bpf/helpers.c | 4 - > net/core/bpf_sk_storage.c | 10 +- > .../bpf/map_tests/task_storage_map.c | 128 ----------- > .../selftests/bpf/prog_tests/btf_dump.c | 4 +- > .../bpf/prog_tests/task_local_storage.c | 8 +- > .../bpf/progs/read_bpf_task_storage_busy.c | 38 ---- > 11 files changed, 184 insertions(+), 443 deletions(-) > delete mode 100644 tools/testing/selftests/bpf/map_tests/task_storage_map.c > delete mode 100644 tools/testing/selftests/bpf/progs/read_bpf_task_storage_busy.c > > -- > 2.47.3 >