Re: [PATCH bpf-next v2 3/4] bpf: task work scheduling kfuncs

Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> · Tue, 19 Aug 2025 21:27:31 +0200

On Tue, 19 Aug 2025 at 20:16, Mykyta Yatsenko
<mykyta.yatsenko5@xxxxxxxxx> wrote:
>
> On 8/19/25 15:18, Kumar Kartikeya Dwivedi wrote:
> > On Fri, 15 Aug 2025 at 21:22, Mykyta Yatsenko
> > <mykyta.yatsenko5@xxxxxxxxx> wrote:
> >> From: Mykyta Yatsenko <yatsenko@xxxxxxxx>
> >>
> >> Implementation of the bpf_task_work_schedule kfuncs.
> >>
> >> Main components:
> >>   * struct bpf_task_work_context – Metadata and state management per task
> >> work.
> >>   * enum bpf_task_work_state – A state machine to serialize work
> >>   scheduling and execution.
> >>   * bpf_task_work_schedule() – The central helper that initiates
> >> scheduling.
> >>   * bpf_task_work_acquire() - Attempts to take ownership of the context,
> >>   pointed by passed struct bpf_task_work, allocates new context if none
> >>   exists yet.
> >>   * bpf_task_work_callback() – Invoked when the actual task_work runs.
> >>   * bpf_task_work_irq() – An intermediate step (runs in softirq context)
> >> to enqueue task work.
> >>   * bpf_task_work_cancel_and_free() – Cleanup for deleted BPF map entries.
> > Can you elaborate on why the bouncing through irq_work context is necessary?
> > I think we should have this info in the commit log.
> > Is it to avoid deadlocks with task_work locks and/or task->pi_lock?
> yes, mainly to avoid locks in NMI.
> >
> >> Flow of successful task work scheduling
> >>   1) bpf_task_work_schedule_* is called from BPF code.
> >>   2) Transition state from STANDBY to PENDING, marks context is owned by
> >>   this task work scheduler
> >>   3) irq_work_queue() schedules bpf_task_work_irq().
> >>   4) Transition state from PENDING to SCHEDULING.
> >>   4) bpf_task_work_irq() attempts task_work_add(). If successful, state
> >>   transitions to SCHEDULED.
> >>   5) Task work calls bpf_task_work_callback(), which transition state to
> >>   RUNNING.
> >>   6) BPF callback is executed
> >>   7) Context is cleaned up, refcounts released, context state set back to
> >>   STANDBY.
> >>
> >> bpf_task_work_context handling
> >> The context pointer is stored in bpf_task_work ctx field (u64) but
> >> treated as an __rcu pointer via casts.
> >> bpf_task_work_acquire() publishes new bpf_task_work_context by cmpxchg
> >> with RCU initializer.
> >> Read under the RCU lock only in bpf_task_work_acquire() when ownership
> >> is contended.
> >> Upon deleting map value, bpf_task_work_cancel_and_free() is detaching
> >> context pointer from struct bpf_task_work and releases resources
> >> if scheduler does not own the context or can be canceled (state ==
> >> STANDBY or state == SCHEDULED and callback canceled). If task work
> >> scheduler owns the context, its state is set to FREED and scheduler is
> >> expected to cleanup on the next state transition.
> >>
> >> Signed-off-by: Mykyta Yatsenko <yatsenko@xxxxxxxx>
> >> ---
> > This is much better now, with clear ownership between free path and
> > scheduling path, I mostly have a few more comments on the current
> > implementation, plus one potential bug.
> >
> > However, the more time I spend on this, the more I feel we should
> > unify all this with the two other bpf async work execution mechanisms
> > (timers and wq), and simplify and deduplicate a lot of this under the
> > serialized async->lock. I know NMI execution is probably critical for
> > this primitive, but we can replace async->lock with rqspinlock to
> > address that, so that it becomes safe to serialize in any context.
> > Apart from that, I don't see anything that would negate reworking all
> > this as a case of BPF_TASK_WORK for bpf_async_kern, modulo internal
> > task_work locks that have trouble with NMI execution (see later
> > comments).
> >
> > I also feel like it would be cleaner if we split the API into 3 steps:
> > init(), set_callback(), schedule() like the other cases, I don't see
> > why we necessarily need to diverge, and it simplifies some of the
> > logic in schedule().
> > Once every state update is protected by a lock, all of the state
> > transitions are done automatically and a lot of the extra races are
> > eliminated.
> >
> > I think we should discuss whether this was considered and why you
> > discarded this approach, otherwise the code is pretty complex, with
> > little upside.
> > Maybe I'm missing something obvious and you'd know more since you've
> > thought about all this longer.
> As for API, I think having 1 function for scheduling callback is cleaner
> then having 3 which are always called in the same order anyway. Most of
> the complexity
> comes from synchronization, not logic, so not having to do the same
> synchronization in
> init(), set_callback() and schedule() seems like a benefit to me.

Well, if you were to reuse bpf_async_kern, all of that extra logic is
already taken care of, or can be easily shared.
If you look closely you'll see that a lot of what you're doing is a
repetition of what timers and bpf_wq have.

> Let me check if using rqspinlock going to make things simpler. We still
> need states to at least know if cancellation is possible and to flag
> deletion to scheduler, but using a lock will make code easier to
> understand.

Yeah I think for all of this using lockless updates is not really
worth it, let's just serialize using a lock.

> >
> >>   kernel/bpf/helpers.c | 270 +++++++++++++++++++++++++++++++++++++++++--
> >>   1 file changed, 260 insertions(+), 10 deletions(-)
> >>
> >> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> >> index d2f88a9bc47b..346ae8fd3ada 100644
> >> --- a/kernel/bpf/helpers.c
> >> +++ b/kernel/bpf/helpers.c
> >> @@ -25,6 +25,8 @@
> >>   #include <linux/kasan.h>
> >>   #include <linux/bpf_verifier.h>
> >>   #include <linux/uaccess.h>
> >> +#include <linux/task_work.h>
> >> +#include <linux/irq_work.h>
> >>
> >>   #include "../../lib/kstrtox.h"
> >>
> >> @@ -3701,6 +3703,226 @@ __bpf_kfunc int bpf_strstr(const char *s1__ign, const char *s2__ign)
> >>
> >>   typedef void (*bpf_task_work_callback_t)(struct bpf_map *map, void *key, void *value);
> >>
> >> +enum bpf_task_work_state {
> >> +       /* bpf_task_work is ready to be used */
> >> +       BPF_TW_STANDBY = 0,
> >> +       /* irq work scheduling in progress */
> >> +       BPF_TW_PENDING,
> >> +       /* task work scheduling in progress */
> >> +       BPF_TW_SCHEDULING,
> >> +       /* task work is scheduled successfully */
> >> +       BPF_TW_SCHEDULED,
> >> +       /* callback is running */
> >> +       BPF_TW_RUNNING,
> >> +       /* associated BPF map value is deleted */
> >> +       BPF_TW_FREED,
> >> +};
> >> +
> >> +struct bpf_task_work_context {
> >> +       /* the map and map value associated with this context */
> >> +       struct bpf_map *map;
> >> +       void *map_val;
> >> +       /* bpf_prog that schedules task work */
> >> +       struct bpf_prog *prog;
> >> +       /* task for which callback is scheduled */
> >> +       struct task_struct *task;
> >> +       enum task_work_notify_mode mode;
> >> +       enum bpf_task_work_state state;
> >> +       bpf_task_work_callback_t callback_fn;
> >> +       struct callback_head work;
> >> +       struct irq_work irq_work;
> >> +       struct rcu_head rcu;
> >> +} __aligned(8);
> >> +
> >> +static struct bpf_task_work_context *bpf_task_work_context_alloc(void)
> >> +{
> >> +       struct bpf_task_work_context *ctx;
> >> +
> >> +       ctx = bpf_mem_alloc(&bpf_global_ma, sizeof(struct bpf_task_work_context));
> >> +       if (ctx)
> >> +               memset(ctx, 0, sizeof(*ctx));
> >> +       return ctx;
> >> +}
> >> +
> >> +static void bpf_task_work_context_free(struct rcu_head *rcu)
> >> +{
> >> +       struct bpf_task_work_context *ctx = container_of(rcu, struct bpf_task_work_context, rcu);
> >> +       /* bpf_mem_free expects migration to be disabled */
> >> +       migrate_disable();
> >> +       bpf_mem_free(&bpf_global_ma, ctx);
> >> +       migrate_enable();
> >> +}
> >> +
> >> +static bool task_work_match(struct callback_head *head, void *data)
> >> +{
> >> +       struct bpf_task_work_context *ctx = container_of(head, struct bpf_task_work_context, work);
> >> +
> >> +       return ctx == data;
> >> +}
> >> +
> >> +static void bpf_task_work_context_reset(struct bpf_task_work_context *ctx)
> >> +{
> >> +       bpf_prog_put(ctx->prog);
> >> +       bpf_task_release(ctx->task);
> >> +}
> >> +
> >> +static void bpf_task_work_callback(struct callback_head *cb)
> >> +{
> >> +       enum bpf_task_work_state state;
> >> +       struct bpf_task_work_context *ctx;
> >> +       u32 idx;
> >> +       void *key;
> >> +
> >> +       ctx = container_of(cb, struct bpf_task_work_context, work);
> >> +
> >> +       /*
> >> +        * Read lock is needed to protect map key and value access below, it has to be done before
> >> +        * the state transition
> >> +        */
> >> +       rcu_read_lock_trace();
> >> +       /*
> >> +        * This callback may start running before bpf_task_work_irq() switched the state to
> >> +        * SCHEDULED so handle both transition variants SCHEDULING|SCHEDULED -> RUNNING.
> >> +        */
> >> +       state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_RUNNING);
> >> +       if (state == BPF_TW_SCHEDULED)
> > ... and let's say we have concurrent cancel_and_free here, we mark
> > state BPF_TW_FREED.
> >
> >> +               state = cmpxchg(&ctx->state, BPF_TW_SCHEDULED, BPF_TW_RUNNING);
> >> +       if (state == BPF_TW_FREED) {
> > ... and notice it here now ...
> >
> >> +               rcu_read_unlock_trace();
> >> +               bpf_task_work_context_reset(ctx);
> >> +               call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
> > ... then I presume this is ok, because the cancel of tw in
> > cancel_and_free will fail?
> yes, if cancellation succeeds, callback will not be called.
> If it fails, cancel_and_free does not do anything, except changing
> the state and callback does the cleanup.
> > Maybe add a comment here that it's interlocked with the free path.
> >
> >> +               return;
> >> +       }
> >> +
> >> +       key = (void *)map_key_from_value(ctx->map, ctx->map_val, &idx);
> >> +       migrate_disable();
> >> +       ctx->callback_fn(ctx->map, key, ctx->map_val);
> >> +       migrate_enable();
> >> +       rcu_read_unlock_trace();
> >> +       /* State is running or freed, either way reset. */
> >> +       bpf_task_work_context_reset(ctx);
> >> +       state = cmpxchg(&ctx->state, BPF_TW_RUNNING, BPF_TW_STANDBY);
> >> +       if (state == BPF_TW_FREED)
> >> +               call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
> >> +}
> >> +
> >> +static void bpf_task_work_irq(struct irq_work *irq_work)
> >> +{
> >> +       struct bpf_task_work_context *ctx;
> >> +       enum bpf_task_work_state state;
> >> +       int err;
> >> +
> >> +       ctx = container_of(irq_work, struct bpf_task_work_context, irq_work);
> >> +
> >> +       state = cmpxchg(&ctx->state, BPF_TW_PENDING, BPF_TW_SCHEDULING);
> >> +       if (state == BPF_TW_FREED)
> >> +               goto free_context;
> >> +
> >> +       err = task_work_add(ctx->task, &ctx->work, ctx->mode);
> >> +       if (err) {
> >> +               bpf_task_work_context_reset(ctx);
> >> +               state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_STANDBY);
> >> +               if (state == BPF_TW_FREED)
> >> +                       call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
> >> +               return;
> >> +       }
> >> +       state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_SCHEDULED);
> >> +       if (state == BPF_TW_FREED && task_work_cancel_match(ctx->task, task_work_match, ctx))
> >> +               goto free_context; /* successful cancellation, release and free ctx */
> >> +       return;
> >> +
> >> +free_context:
> >> +       bpf_task_work_context_reset(ctx);
> >> +       call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
> >> +}
> >> +
> >> +static struct bpf_task_work_context *bpf_task_work_context_acquire(struct bpf_task_work *tw,
> >> +                                                                  struct bpf_map *map)
> >> +{
> >> +       struct bpf_task_work_context *ctx, *old_ctx;
> >> +       enum bpf_task_work_state state;
> >> +       struct bpf_task_work_context __force __rcu **ppc =
> >> +               (struct bpf_task_work_context __force __rcu **)&tw->ctx;
> >> +
> >> +       /* ctx pointer is RCU protected */
> >> +       rcu_read_lock_trace();
> >> +       ctx = rcu_dereference(*ppc);
> >> +       if (!ctx) {
> >> +               ctx = bpf_task_work_context_alloc();
> >> +               if (!ctx) {
> >> +                       rcu_read_unlock_trace();
> >> +                       return ERR_PTR(-ENOMEM);
> >> +               }
> >> +               old_ctx = unrcu_pointer(cmpxchg(ppc, NULL, RCU_INITIALIZER(ctx)));
> >> +               /*
> >> +                * If ctx is set by another CPU, release allocated memory.
> >> +                * Do not fail, though, attempt stealing the work
> >> +                */
> >> +               if (old_ctx) {
> >> +                       bpf_mem_free(&bpf_global_ma, ctx);
> >> +                       ctx = old_ctx;
> >> +               }
> >> +       }
> >> +       state = cmpxchg(&ctx->state, BPF_TW_STANDBY, BPF_TW_PENDING);
> >> +       /*
> >> +        * We can unlock RCU, because task work scheduler (this codepath)
> >> +        * now owns the ctx or returning an error
> >> +        */
> >> +       rcu_read_unlock_trace();
> >> +       if (state != BPF_TW_STANDBY)
> >> +               return ERR_PTR(-EBUSY);
> >> +       return ctx;
> >> +}
> >> +
> >> +static int bpf_task_work_schedule(struct task_struct *task, struct bpf_task_work *tw,
> >> +                                 struct bpf_map *map, bpf_task_work_callback_t callback_fn,
> >> +                                 struct bpf_prog_aux *aux, enum task_work_notify_mode mode)
> >> +{
> >> +       struct bpf_prog *prog;
> >> +       struct bpf_task_work_context *ctx = NULL;
> >> +       int err;
> >> +
> >> +       BTF_TYPE_EMIT(struct bpf_task_work);
> >> +
> >> +       prog = bpf_prog_inc_not_zero(aux->prog);
> >> +       if (IS_ERR(prog))
> >> +               return -EBADF;
> >> +
> >> +       if (!atomic64_read(&map->usercnt)) {
> >> +               err = -EBADF;
> >> +               goto release_prog;
> >> +       }
> > Please add a comment on why lack of ordering between load of usercnt
> > and load of tw->ctx is safe, in presence of a parallel usercnt
> > dec_and_test and ctx xchg.
> > See __bpf_async_init for similar race.
> I think I see what you mean, let me double check this.
> >
> >> +       task = bpf_task_acquire(task);
> >> +       if (!task) {
> >> +               err = -EPERM;
> >> +               goto release_prog;
> >> +       }
> >> +       ctx = bpf_task_work_context_acquire(tw, map);
> >> +       if (IS_ERR(ctx)) {
> >> +               err = PTR_ERR(ctx);
> >> +               goto release_all;
> >> +       }
> >> +
> >> +       ctx->task = task;
> >> +       ctx->callback_fn = callback_fn;
> >> +       ctx->prog = prog;
> >> +       ctx->mode = mode;
> >> +       ctx->map = map;
> >> +       ctx->map_val = (void *)tw - map->record->task_work_off;
> >> +       init_irq_work(&ctx->irq_work, bpf_task_work_irq);
> >> +       init_task_work(&ctx->work, bpf_task_work_callback);
> >> +
> >> +       irq_work_queue(&ctx->irq_work);
> >> +
> >> +       return 0;
> >> +
> >> +release_all:
> >> +       bpf_task_release(task);
> >> +release_prog:
> >> +       bpf_prog_put(prog);
> >> +       return err;
> >> +}
> >> +
> >>   /**
> >>    * bpf_task_work_schedule_signal - Schedule BPF callback using task_work_add with TWA_SIGNAL mode
> >>    * @task: Task struct for which callback should be scheduled
> >> @@ -3711,13 +3933,11 @@ typedef void (*bpf_task_work_callback_t)(struct bpf_map *map, void *key, void *v
> >>    *
> >>    * Return: 0 if task work has been scheduled successfully, negative error code otherwise
> >>    */
> >> -__bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task,
> >> -                                             struct bpf_task_work *tw,
> >> +__bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task, struct bpf_task_work *tw,
> >>                                                struct bpf_map *map__map,
> >> -                                             bpf_task_work_callback_t callback,
> >> -                                             void *aux__prog)
> >> +                                             bpf_task_work_callback_t callback, void *aux__prog)
> >>   {
> >> -       return 0;
> >> +       return bpf_task_work_schedule(task, tw, map__map, callback, aux__prog, TWA_SIGNAL);
> >>   }
> >>
> >>   /**
> >> @@ -3731,19 +3951,47 @@ __bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task,
> >>    *
> >>    * Return: 0 if task work has been scheduled successfully, negative error code otherwise
> >>    */
> >> -__bpf_kfunc int bpf_task_work_schedule_resume(struct task_struct *task,
> >> -                                             struct bpf_task_work *tw,
> >> +__bpf_kfunc int bpf_task_work_schedule_resume(struct task_struct *task, struct bpf_task_work *tw,
> >>                                                struct bpf_map *map__map,
> >> -                                             bpf_task_work_callback_t callback,
> >> -                                             void *aux__prog)
> >> +                                             bpf_task_work_callback_t callback, void *aux__prog)
> >>   {
> >> -       return 0;
> >> +       enum task_work_notify_mode mode;
> >> +
> >> +       mode = task == current && in_nmi() ? TWA_NMI_CURRENT : TWA_RESUME;
> >> +       return bpf_task_work_schedule(task, tw, map__map, callback, aux__prog, mode);
> >>   }
> >>
> >>   __bpf_kfunc_end_defs();
> >>
> >>   void bpf_task_work_cancel_and_free(void *val)
> >>   {
> >> +       struct bpf_task_work *tw = val;
> >> +       struct bpf_task_work_context *ctx;
> >> +       enum bpf_task_work_state state;
> >> +
> >> +       /* No need do rcu_read_lock as no other codepath can reset this pointer */
> >> +       ctx = unrcu_pointer(xchg((struct bpf_task_work_context __force __rcu **)&tw->ctx, NULL));
> >> +       if (!ctx)
> >> +               return;
> >> +       state = xchg(&ctx->state, BPF_TW_FREED);
> >> +
> >> +       switch (state) {
> >> +       case BPF_TW_SCHEDULED:
> >> +               /* If we can't cancel task work, rely on task work callback to free the context */
> >> +               if (!task_work_cancel_match(ctx->task, task_work_match, ctx))
> > This part looks broken to me.
> > You are calling this path
> > (update->obj_free_fields->cancel_and_free->cancel_and_match) in
> > possibly NMI context.
> > Which means we can deadlock if we hit the NMI context prog in the
> > middle of task->pi_lock critical section.
> > That's taken in task_work functions
> > The task_work_cancel_match takes the pi_lock.
> Good point, thanks. I think this could be solved in 2 ways:
>   * Don't cancel, rely on callback dropping the work
>   * Cancel in another irq_work
> I'll probably go with the second one.

What about 1? It seems like we can just rely on the existing hunk to
free the callback on noticing BPF_TW_FREED?
That seems simpler to me.

> >
> >> +                       break;
> >> +               bpf_task_work_context_reset(ctx);
> >> +               fallthrough;
> >> +       case BPF_TW_STANDBY:
> >> +               call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
> >> +               break;
> >> +       /* In all below cases scheduling logic should detect context state change and cleanup */
> >> +       case BPF_TW_SCHEDULING:
> >> +       case BPF_TW_PENDING:
> >> +       case BPF_TW_RUNNING:
> >> +       default:
> >> +               break;
> >> +       }
> >>   }
> >>
> >>   BTF_KFUNCS_START(generic_btf_ids)
> >> @@ -3769,6 +4017,8 @@ BTF_ID_FLAGS(func, bpf_rbtree_first, KF_RET_NULL)
> >>   BTF_ID_FLAGS(func, bpf_rbtree_root, KF_RET_NULL)
> >>   BTF_ID_FLAGS(func, bpf_rbtree_left, KF_RET_NULL)
> >>   BTF_ID_FLAGS(func, bpf_rbtree_right, KF_RET_NULL)
> >> +BTF_ID_FLAGS(func, bpf_task_work_schedule_signal, KF_TRUSTED_ARGS)
> >> +BTF_ID_FLAGS(func, bpf_task_work_schedule_resume, KF_TRUSTED_ARGS)
> >>
> >>   #ifdef CONFIG_CGROUPS
> >>   BTF_ID_FLAGS(func, bpf_cgroup_acquire, KF_ACQUIRE | KF_RCU | KF_RET_NULL)
> >> --
> >> 2.50.1
> >>
>