* Motivation * The goal of this patchset is to make bpf syscalls and helpers updating task and cgroup local storage more robust by removing percpu counters in them. Task local storage and cgroup storage each employs a percpu counter to prevent deadlock caused by recursion. Since the underlying bpf local storage takes spinlocks in various operations, bpf programs running recursively may try to take a spinlock which is already taken. For example, when a tracing bpf program called recursively during bpf_task_storage_get(..., F_CREATE) tries to call bpf_task_storage_get(..., F_CREATE) again, it will cause AA deadlock if the percpu variable is not in place. However, sometimes, the percpu counter may cause bpf syscalls or helpers to return errors spuriously, as soon as another threads is also updating the local storage or the local storage map. Ideally, the two threads could have taken turn to take the locks and perform their jobs respectively. However, due to the percpu counter, the syscalls and helpers can return -EBUSY even if one of them does not run recursively in another one. All it takes for this to happen is if the two threads run on the same CPU. This happened when BPF-CI ran the selftest of task local data. Since CI runs the test on VM with 2 CPUs, bpf_task_storage_get(..., F_CREATE) can easily fail. The failure mode is not good for users as they need to add retry logic in user space or bpf programs to avoid it. Even with retry, there is no guaranteed upper bound of the loop for a succeess call. Therefore, this patchset seeks to remove the percpu counter and makes the related bpf syscalls and helpers more reliable, while still make sure recursion deadlock will not happen, with the help of resilient queued spinlock (rqspinlock). * Implementation * To remove the percpu counter without introducing deadlock, bpf_local_storage is refactored by changing the locks from raw_spin_lock to rqspinlock, which prevents deadlock with deadlock detection and a timeout mechanism. There are two locks in bpf_local_storage due to the ownership model as illustrated in the figure below. A map value, which consists of a pointer to the map and the data, is a bpf_local_storage_map_data (sdata) stored in a bpf_local_storage_elem (selem). A selem belongs to a bpf_local_storage and bpf_local_storage_map at the same time. bpf_local_storage::lock (lock_storage->lock in short) protects the list in a bpf_local_storage and bpf_local_storage_map_bucket::lock (b->lock) protects the hash bucket in a bpf_local_storage_map. task_struct ┌ task1 ───────┐ bpf_local_storage │ *bpf_storage │---->┌─────────┐ └──────────────┘<----│ *owner │ bpf_local_storage_elem │ *cache[16] (selem) selem │ *smap │ ┌──────────┐ ┌──────────┐ │ list │------->│ snode │<------->│ snode │ │ lock │ ┌---->│ map_node │<--┐ ┌-->│ map_node │ └─────────┘ │ │ sdata = │ │ │ │ sdata = │ task_struct │ │ {&mapA,} │ │ │ │ {&mapB,} │ ┌ task2 ───────┐ bpf_local_storage └──────────┘ │ │ └──────────┘ │ *bpf_storage │---->┌─────────┐ │ │ │ └──────────────┘<----│ *owner │ │ │ │ │ *cache[16] │ selem │ │ selem │ *smap │ │ ┌──────────┐ │ │ ┌──────────┐ │ list │--│---->│ snode │<--│-│-->│ snode │ │ lock │ │ ┌-->│ map_node │ └-│-->│ map_node │ └─────────┘ │ │ │ sdata = │ │ │ sdata = │ bpf_local_storage_map │ │ │ {&mapB,} │ │ │ {&mapA,} │ (smap) │ │ └──────────┘ │ └──────────┘ ┌ mapA ───────┐ │ │ │ │ bpf_map map │ bpf_local_storage_map_bucket │ │ *buckets │---->┌ b[0] ┐ │ │ │ └─────────────┘ │ list │------┘ │ │ │ lock │ │ │ └──────┘ │ │ smap ... │ │ ┌ mapB ───────┐ │ │ │ bpf_map map │ bpf_local_storage_map_bucket │ │ *buckets │---->┌ b[0] ┐ │ │ └─────────────┘ │ list │--------┘ │ │ lock │ │ └──────┘ │ ┌ b[1] ┐ │ │ list │-----------------------------┘ │ lock │ └──────┘ ... The refactoring is divided into three steps. First, in patch 1-4, local storage helpers that take locks are being converted to failable. The functions are changed from returning void to returning an int error code with the return value temporarily set to 0. In callers where the helpers cannot fail in the middle of an update, the helper is open coded. In callers that are not allowed to fail, (i.e., bpf_local_storage_destroy() and bpf_local_storage_map_free(), we make sure the functions cannot be called recursively, causing deadlock, and assert the return value to be 0. Then, in patch 5, the locks are changed to rqspinlock, and the error returned from raw_res_spin_lock_irqsave() is propogated to the syscalls and heleprs. Finally, in patch 7-8, the percpu counters in task and cgroup local storage are removed. Question: - In bpf_local_storage_destroy() and bpf_local_storage_map_free(), where it is not allow to fail, I assert that the lock acquisition always succeeds based on the fact that 1) these paths cannot run recursively causing AA deadlock and 2) local_storage->lock and b->lock are always acquired in the same order, but I also notice that rqspinlock has a timeout fallback. Is this assertion an okay thing to do? * Test * Task and cgroup local storage selftests have already covered deadlock caused by recursion. Patch 9 updates the expected result of task local storage selftests as task local storage bpf helpers can now run on the same CPU as they don't cause deadlock. * Patch organization * E(exposed) L(local storage lock) B(bucket lock) EL __bpf_local_storage_insert_cache (skip cache update) ELB bpf_local_storage_destroy (cannot recur) ELB bpf_local_storage_map_free (cannot recur) ELB bpf_selem_unlink --> Patch 4 E B bpf_selem_link_map --> Patch 2 B bpf_selem_unlink_map --> Patch 1 L bpf_selem_unlink_storage --> Patch 3 During the refactoring, to make sure all exposed functions handle the error returned by raw_res_spin_lock_irqsave(), __must_check is added locally to catch all callers. Patch 1-4 Convert local storage helpers to failable, or open-code the helpers Patch 5 Change local_storage->lock and b->lock from raw_spin_lock to rqspinlock Patch 6 Remove percpu lock in task local storage and remove bpf_task_storage_{get,delete}_recur() Patch 7 Remove percpu lock in cgroup local storage Patch 8 Remove percpu lock in bpf_local_storage Patch 9 Update task local storage recursion test Patch 10 Remove task local storage stress test Patch 11 Update btf_dump to use another percpu variable ---- Amery Hung (11): bpf: Convert bpf_selem_unlink_map to failable bpf: Convert bpf_selem_link_map to failable bpf: Open code bpf_selem_unlink_storage in bpf_selem_unlink bpf: Convert bpf_selem_unlink to failable bpf: Change local_storage->lock and b->lock to rqspinlock bpf: Remove task local storage percpu counter bpf: Remove cgroup local storage percpu counter bpf: Remove unused percpu counter from bpf_local_storage_map_free selftests/bpf: Update task_local_storage/recursion test selftests/bpf: Remove test_task_storage_map_stress_lookup selftests/bpf: Choose another percpu variable in bpf for btf_dump test include/linux/bpf_local_storage.h | 14 +- kernel/bpf/bpf_cgrp_storage.c | 60 +----- kernel/bpf/bpf_inode_storage.c | 6 +- kernel/bpf/bpf_local_storage.c | 202 ++++++++++++------ kernel/bpf/bpf_task_storage.c | 153 ++----------- kernel/bpf/helpers.c | 4 - net/core/bpf_sk_storage.c | 10 +- .../bpf/map_tests/task_storage_map.c | 128 ----------- .../selftests/bpf/prog_tests/btf_dump.c | 4 +- .../bpf/prog_tests/task_local_storage.c | 8 +- .../bpf/progs/read_bpf_task_storage_busy.c | 38 ---- 11 files changed, 184 insertions(+), 443 deletions(-) delete mode 100644 tools/testing/selftests/bpf/map_tests/task_storage_map.c delete mode 100644 tools/testing/selftests/bpf/progs/read_bpf_task_storage_busy.c -- 2.47.3