Re: [RFC PATCH bpf-next v1 00/11] Remove task and cgroup local

Amery Hung <ameryhung@xxxxxxxxx> · Tue, 29 Jul 2025 11:28:48 -0700



The title is "Remove task and cgroup local storage percpu counters"

On Tue, Jul 29, 2025 at 11:25 AM Amery Hung <ameryhung@xxxxxxxxx> wrote:
>
> * Motivation *
>
> The goal of this patchset is to make bpf syscalls and helpers updating
> task and cgroup local storage more robust by removing percpu counters
> in them. Task local storage and cgroup storage each employs a percpu
> counter to prevent deadlock caused by recursion. Since the underlying
> bpf local storage takes spinlocks in various operations, bpf programs
> running recursively may try to take a spinlock which is already taken.
> For example, when a tracing bpf program called recursively during
> bpf_task_storage_get(..., F_CREATE) tries to call
> bpf_task_storage_get(..., F_CREATE) again, it will cause AA deadlock
> if the percpu variable is not in place.
>
> However, sometimes, the percpu counter may cause bpf syscalls or helpers
> to return errors spuriously, as soon as another threads is also updating
> the local storage or the local storage map. Ideally, the two threads
> could have taken turn to take the locks and perform their jobs
> respectively. However, due to the percpu counter, the syscalls and
> helpers can return -EBUSY even if one of them does not run recursively
> in another one. All it takes for this to happen is if the two threads run
> on the same CPU. This happened when BPF-CI ran the selftest of task local
> data. Since CI runs the test on VM with 2 CPUs, bpf_task_storage_get(...,
> F_CREATE) can easily fail.
>
> The failure mode is not good for users as they need to add retry logic
> in user space or bpf programs to avoid it. Even with retry, there
> is no guaranteed upper bound of the loop for a succeess call. Therefore,
> this patchset seeks to remove the percpu counter and makes the related
> bpf syscalls and helpers more reliable, while still make sure recursion
> deadlock will not happen, with the help of resilient queued spinlock
> (rqspinlock).
>
>
> * Implementation *
>
> To remove the percpu counter without introducing deadlock,
> bpf_local_storage is refactored by changing the locks from raw_spin_lock
> to rqspinlock, which prevents deadlock with deadlock detection and a
> timeout mechanism.
>
> There are two locks in bpf_local_storage due to the ownership model as
> illustrated in the figure below. A map value, which consists of a
> pointer to the map and the data, is a bpf_local_storage_map_data (sdata)
> stored in a bpf_local_storage_elem (selem). A selem belongs to a
> bpf_local_storage and bpf_local_storage_map at the same time.
> bpf_local_storage::lock (lock_storage->lock in short) protects the list
> in a bpf_local_storage and bpf_local_storage_map_bucket::lock (b->lock)
> protects the hash bucket in a bpf_local_storage_map.
>
>
>  task_struct
> ┌ task1 ───────┐       bpf_local_storage
> │ *bpf_storage │---->┌─────────┐
> └──────────────┘<----│ *owner  │         bpf_local_storage_elem
>                      │ *cache[16]        (selem)              selem
>                      │ *smap   │        ┌──────────┐         ┌──────────┐
>                      │ list    │------->│ snode    │<------->│ snode    │
>                      │ lock    │  ┌---->│ map_node │<--┐ ┌-->│ map_node │
>                      └─────────┘  │     │ sdata =  │   │ │   │ sdata =  │
>  task_struct                      │     │ {&mapA,} │   │ │   │ {&mapB,} │
> ┌ task2 ───────┐      bpf_local_storage └──────────┘   │ │   └──────────┘
> │ *bpf_storage │---->┌─────────┐  │                    │ │
> └──────────────┘<----│ *owner  │  │                    │ │
>                      │ *cache[16] │      selem         │ │    selem
>                      │ *smap   │  │     ┌──────────┐   │ │   ┌──────────┐
>                      │ list    │--│---->│ snode    │<--│-│-->│ snode    │
>                      │ lock    │  │ ┌-->│ map_node │   └-│-->│ map_node │
>                      └─────────┘  │ │   │ sdata =  │     │   │ sdata =  │
>  bpf_local_storage_map            │ │   │ {&mapB,} │     │   │ {&mapA,} │
>  (smap)                           │ │   └──────────┘     │   └──────────┘
> ┌ mapA ───────┐                   │ │                    │
> │ bpf_map map │      bpf_local_storage_map_bucket        │
> │ *buckets    │---->┌ b[0] ┐      │ │                    │
> └─────────────┘     │ list │------┘ │                    │
>                     │ lock │        │                    │
>                     └──────┘        │                    │
>  smap                 ...           │                    │
> ┌ mapB ───────┐                     │                    │
> │ bpf_map map │      bpf_local_storage_map_bucket        │
> │ *buckets    │---->┌ b[0] ┐        │                    │
> └─────────────┘     │ list │--------┘                    │
>                     │ lock │                             │
>                     └──────┘                             │
>                     ┌ b[1] ┐                             │
>                     │ list │-----------------------------┘
>                     │ lock │
>                     └──────┘
>                       ...
>
>
> The refactoring is divided into three steps.
>
> First, in patch 1-4, local storage helpers that take locks are being
> converted to failable. The functions are changed from returning void to
> returning an int error code with the return value temporarily set to 0.
> In callers where the helpers cannot fail in the middle of an update,
> the helper is open coded. In callers that are not allowed to fail, (i.e.,
> bpf_local_storage_destroy() and bpf_local_storage_map_free(), we make
> sure the functions cannot be called recursively, causing deadlock, and
> assert the return value to be 0.
>
> Then, in patch 5, the locks are changed to rqspinlock, and the error
> returned from raw_res_spin_lock_irqsave() is propogated to the syscalls
> and heleprs.
>
> Finally, in patch 7-8, the percpu counters in task and cgroup local
> storage are removed.
>
> Question:
>
> - In bpf_local_storage_destroy() and bpf_local_storage_map_free(), where
>   it is not allow to fail, I assert that the lock acquisition always
>   succeeds based on the fact that 1) these paths cannot run recursively
>   causing AA deadlock and 2) local_storage->lock and b->lock are always
>   acquired in the same order, but I also notice that rqspinlock has
>   a timeout fallback. Is this assertion an okay thing to do?
>
>
> * Test *
>
> Task and cgroup local storage selftests have already covered deadlock
> caused by recursion. Patch 9 updates the expected result of task local
> storage selftests as task local storage bpf helpers can now run on the
> same CPU as they don't cause deadlock.
>
> * Patch organization *
>
>   E(exposed) L(local storage lock) B(bucket lock)
>   EL    __bpf_local_storage_insert_cache (skip cache update)
>   ELB   bpf_local_storage_destroy (cannot recur)
>   ELB   bpf_local_storage_map_free (cannot recur)
>   ELB   bpf_selem_unlink                  --> Patch 4
>   E B   bpf_selem_link_map                --> Patch 2
>     B   bpf_selem_unlink_map              --> Patch 1
>    L    bpf_selem_unlink_storage          --> Patch 3
>
>   During the refactoring, to make sure all exposed functions
>   handle the error returned by raw_res_spin_lock_irqsave(),
>   __must_check is added locally to catch all callers.
>
>   Patch 1-4
>     Convert local storage helpers to failable, or open-code
>     the helpers
>
>   Patch 5
>     Change local_storage->lock and b->lock from
>     raw_spin_lock to rqspinlock
>
>   Patch 6
>     Remove percpu lock in task local storage and remove
>     bpf_task_storage_{get,delete}_recur()
>
>   Patch 7
>     Remove percpu lock in cgroup local storage
>
>   Patch 8
>     Remove percpu lock in bpf_local_storage
>
>   Patch 9
>     Update task local storage recursion test
>
>   Patch 10
>     Remove task local storage stress test
>
>   Patch 11
>     Update btf_dump to use another percpu variable
>
> ----
>
> Amery Hung (11):
>   bpf: Convert bpf_selem_unlink_map to failable
>   bpf: Convert bpf_selem_link_map to failable
>   bpf: Open code bpf_selem_unlink_storage in bpf_selem_unlink
>   bpf: Convert bpf_selem_unlink to failable
>   bpf: Change local_storage->lock and b->lock to rqspinlock
>   bpf: Remove task local storage percpu counter
>   bpf: Remove cgroup local storage percpu counter
>   bpf: Remove unused percpu counter from bpf_local_storage_map_free
>   selftests/bpf: Update task_local_storage/recursion test
>   selftests/bpf: Remove test_task_storage_map_stress_lookup
>   selftests/bpf: Choose another percpu variable in bpf for btf_dump test
>
>  include/linux/bpf_local_storage.h             |  14 +-
>  kernel/bpf/bpf_cgrp_storage.c                 |  60 +-----
>  kernel/bpf/bpf_inode_storage.c                |   6 +-
>  kernel/bpf/bpf_local_storage.c                | 202 ++++++++++++------
>  kernel/bpf/bpf_task_storage.c                 | 153 ++-----------
>  kernel/bpf/helpers.c                          |   4 -
>  net/core/bpf_sk_storage.c                     |  10 +-
>  .../bpf/map_tests/task_storage_map.c          | 128 -----------
>  .../selftests/bpf/prog_tests/btf_dump.c       |   4 +-
>  .../bpf/prog_tests/task_local_storage.c       |   8 +-
>  .../bpf/progs/read_bpf_task_storage_busy.c    |  38 ----
>  11 files changed, 184 insertions(+), 443 deletions(-)
>  delete mode 100644 tools/testing/selftests/bpf/map_tests/task_storage_map.c
>  delete mode 100644 tools/testing/selftests/bpf/progs/read_bpf_task_storage_busy.c
>
> --
> 2.47.3
>