On Tue, 19 Aug 2025 at 20:16, Mykyta Yatsenko <mykyta.yatsenko5@xxxxxxxxx> wrote: > > On 8/19/25 15:18, Kumar Kartikeya Dwivedi wrote: > > On Fri, 15 Aug 2025 at 21:22, Mykyta Yatsenko > > <mykyta.yatsenko5@xxxxxxxxx> wrote: > >> From: Mykyta Yatsenko <yatsenko@xxxxxxxx> > >> > >> Implementation of the bpf_task_work_schedule kfuncs. > >> > >> Main components: > >> * struct bpf_task_work_context – Metadata and state management per task > >> work. > >> * enum bpf_task_work_state – A state machine to serialize work > >> scheduling and execution. > >> * bpf_task_work_schedule() – The central helper that initiates > >> scheduling. > >> * bpf_task_work_acquire() - Attempts to take ownership of the context, > >> pointed by passed struct bpf_task_work, allocates new context if none > >> exists yet. > >> * bpf_task_work_callback() – Invoked when the actual task_work runs. > >> * bpf_task_work_irq() – An intermediate step (runs in softirq context) > >> to enqueue task work. > >> * bpf_task_work_cancel_and_free() – Cleanup for deleted BPF map entries. > > Can you elaborate on why the bouncing through irq_work context is necessary? > > I think we should have this info in the commit log. > > Is it to avoid deadlocks with task_work locks and/or task->pi_lock? > yes, mainly to avoid locks in NMI. > > > >> Flow of successful task work scheduling > >> 1) bpf_task_work_schedule_* is called from BPF code. > >> 2) Transition state from STANDBY to PENDING, marks context is owned by > >> this task work scheduler > >> 3) irq_work_queue() schedules bpf_task_work_irq(). > >> 4) Transition state from PENDING to SCHEDULING. > >> 4) bpf_task_work_irq() attempts task_work_add(). If successful, state > >> transitions to SCHEDULED. > >> 5) Task work calls bpf_task_work_callback(), which transition state to > >> RUNNING. > >> 6) BPF callback is executed > >> 7) Context is cleaned up, refcounts released, context state set back to > >> STANDBY. > >> > >> bpf_task_work_context handling > >> The context pointer is stored in bpf_task_work ctx field (u64) but > >> treated as an __rcu pointer via casts. > >> bpf_task_work_acquire() publishes new bpf_task_work_context by cmpxchg > >> with RCU initializer. > >> Read under the RCU lock only in bpf_task_work_acquire() when ownership > >> is contended. > >> Upon deleting map value, bpf_task_work_cancel_and_free() is detaching > >> context pointer from struct bpf_task_work and releases resources > >> if scheduler does not own the context or can be canceled (state == > >> STANDBY or state == SCHEDULED and callback canceled). If task work > >> scheduler owns the context, its state is set to FREED and scheduler is > >> expected to cleanup on the next state transition. > >> > >> Signed-off-by: Mykyta Yatsenko <yatsenko@xxxxxxxx> > >> --- > > This is much better now, with clear ownership between free path and > > scheduling path, I mostly have a few more comments on the current > > implementation, plus one potential bug. > > > > However, the more time I spend on this, the more I feel we should > > unify all this with the two other bpf async work execution mechanisms > > (timers and wq), and simplify and deduplicate a lot of this under the > > serialized async->lock. I know NMI execution is probably critical for > > this primitive, but we can replace async->lock with rqspinlock to > > address that, so that it becomes safe to serialize in any context. > > Apart from that, I don't see anything that would negate reworking all > > this as a case of BPF_TASK_WORK for bpf_async_kern, modulo internal > > task_work locks that have trouble with NMI execution (see later > > comments). > > > > I also feel like it would be cleaner if we split the API into 3 steps: > > init(), set_callback(), schedule() like the other cases, I don't see > > why we necessarily need to diverge, and it simplifies some of the > > logic in schedule(). > > Once every state update is protected by a lock, all of the state > > transitions are done automatically and a lot of the extra races are > > eliminated. > > > > I think we should discuss whether this was considered and why you > > discarded this approach, otherwise the code is pretty complex, with > > little upside. > > Maybe I'm missing something obvious and you'd know more since you've > > thought about all this longer. > As for API, I think having 1 function for scheduling callback is cleaner > then having 3 which are always called in the same order anyway. Most of > the complexity > comes from synchronization, not logic, so not having to do the same > synchronization in > init(), set_callback() and schedule() seems like a benefit to me. Well, if you were to reuse bpf_async_kern, all of that extra logic is already taken care of, or can be easily shared. If you look closely you'll see that a lot of what you're doing is a repetition of what timers and bpf_wq have. > Let me check if using rqspinlock going to make things simpler. We still > need states to at least know if cancellation is possible and to flag > deletion to scheduler, but using a lock will make code easier to > understand. Yeah I think for all of this using lockless updates is not really worth it, let's just serialize using a lock. > > > >> kernel/bpf/helpers.c | 270 +++++++++++++++++++++++++++++++++++++++++-- > >> 1 file changed, 260 insertions(+), 10 deletions(-) > >> > >> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > >> index d2f88a9bc47b..346ae8fd3ada 100644 > >> --- a/kernel/bpf/helpers.c > >> +++ b/kernel/bpf/helpers.c > >> @@ -25,6 +25,8 @@ > >> #include <linux/kasan.h> > >> #include <linux/bpf_verifier.h> > >> #include <linux/uaccess.h> > >> +#include <linux/task_work.h> > >> +#include <linux/irq_work.h> > >> > >> #include "../../lib/kstrtox.h" > >> > >> @@ -3701,6 +3703,226 @@ __bpf_kfunc int bpf_strstr(const char *s1__ign, const char *s2__ign) > >> > >> typedef void (*bpf_task_work_callback_t)(struct bpf_map *map, void *key, void *value); > >> > >> +enum bpf_task_work_state { > >> + /* bpf_task_work is ready to be used */ > >> + BPF_TW_STANDBY = 0, > >> + /* irq work scheduling in progress */ > >> + BPF_TW_PENDING, > >> + /* task work scheduling in progress */ > >> + BPF_TW_SCHEDULING, > >> + /* task work is scheduled successfully */ > >> + BPF_TW_SCHEDULED, > >> + /* callback is running */ > >> + BPF_TW_RUNNING, > >> + /* associated BPF map value is deleted */ > >> + BPF_TW_FREED, > >> +}; > >> + > >> +struct bpf_task_work_context { > >> + /* the map and map value associated with this context */ > >> + struct bpf_map *map; > >> + void *map_val; > >> + /* bpf_prog that schedules task work */ > >> + struct bpf_prog *prog; > >> + /* task for which callback is scheduled */ > >> + struct task_struct *task; > >> + enum task_work_notify_mode mode; > >> + enum bpf_task_work_state state; > >> + bpf_task_work_callback_t callback_fn; > >> + struct callback_head work; > >> + struct irq_work irq_work; > >> + struct rcu_head rcu; > >> +} __aligned(8); > >> + > >> +static struct bpf_task_work_context *bpf_task_work_context_alloc(void) > >> +{ > >> + struct bpf_task_work_context *ctx; > >> + > >> + ctx = bpf_mem_alloc(&bpf_global_ma, sizeof(struct bpf_task_work_context)); > >> + if (ctx) > >> + memset(ctx, 0, sizeof(*ctx)); > >> + return ctx; > >> +} > >> + > >> +static void bpf_task_work_context_free(struct rcu_head *rcu) > >> +{ > >> + struct bpf_task_work_context *ctx = container_of(rcu, struct bpf_task_work_context, rcu); > >> + /* bpf_mem_free expects migration to be disabled */ > >> + migrate_disable(); > >> + bpf_mem_free(&bpf_global_ma, ctx); > >> + migrate_enable(); > >> +} > >> + > >> +static bool task_work_match(struct callback_head *head, void *data) > >> +{ > >> + struct bpf_task_work_context *ctx = container_of(head, struct bpf_task_work_context, work); > >> + > >> + return ctx == data; > >> +} > >> + > >> +static void bpf_task_work_context_reset(struct bpf_task_work_context *ctx) > >> +{ > >> + bpf_prog_put(ctx->prog); > >> + bpf_task_release(ctx->task); > >> +} > >> + > >> +static void bpf_task_work_callback(struct callback_head *cb) > >> +{ > >> + enum bpf_task_work_state state; > >> + struct bpf_task_work_context *ctx; > >> + u32 idx; > >> + void *key; > >> + > >> + ctx = container_of(cb, struct bpf_task_work_context, work); > >> + > >> + /* > >> + * Read lock is needed to protect map key and value access below, it has to be done before > >> + * the state transition > >> + */ > >> + rcu_read_lock_trace(); > >> + /* > >> + * This callback may start running before bpf_task_work_irq() switched the state to > >> + * SCHEDULED so handle both transition variants SCHEDULING|SCHEDULED -> RUNNING. > >> + */ > >> + state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_RUNNING); > >> + if (state == BPF_TW_SCHEDULED) > > ... and let's say we have concurrent cancel_and_free here, we mark > > state BPF_TW_FREED. > > > >> + state = cmpxchg(&ctx->state, BPF_TW_SCHEDULED, BPF_TW_RUNNING); > >> + if (state == BPF_TW_FREED) { > > ... and notice it here now ... > > > >> + rcu_read_unlock_trace(); > >> + bpf_task_work_context_reset(ctx); > >> + call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free); > > ... then I presume this is ok, because the cancel of tw in > > cancel_and_free will fail? > yes, if cancellation succeeds, callback will not be called. > If it fails, cancel_and_free does not do anything, except changing > the state and callback does the cleanup. > > Maybe add a comment here that it's interlocked with the free path. > > > >> + return; > >> + } > >> + > >> + key = (void *)map_key_from_value(ctx->map, ctx->map_val, &idx); > >> + migrate_disable(); > >> + ctx->callback_fn(ctx->map, key, ctx->map_val); > >> + migrate_enable(); > >> + rcu_read_unlock_trace(); > >> + /* State is running or freed, either way reset. */ > >> + bpf_task_work_context_reset(ctx); > >> + state = cmpxchg(&ctx->state, BPF_TW_RUNNING, BPF_TW_STANDBY); > >> + if (state == BPF_TW_FREED) > >> + call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free); > >> +} > >> + > >> +static void bpf_task_work_irq(struct irq_work *irq_work) > >> +{ > >> + struct bpf_task_work_context *ctx; > >> + enum bpf_task_work_state state; > >> + int err; > >> + > >> + ctx = container_of(irq_work, struct bpf_task_work_context, irq_work); > >> + > >> + state = cmpxchg(&ctx->state, BPF_TW_PENDING, BPF_TW_SCHEDULING); > >> + if (state == BPF_TW_FREED) > >> + goto free_context; > >> + > >> + err = task_work_add(ctx->task, &ctx->work, ctx->mode); > >> + if (err) { > >> + bpf_task_work_context_reset(ctx); > >> + state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_STANDBY); > >> + if (state == BPF_TW_FREED) > >> + call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free); > >> + return; > >> + } > >> + state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_SCHEDULED); > >> + if (state == BPF_TW_FREED && task_work_cancel_match(ctx->task, task_work_match, ctx)) > >> + goto free_context; /* successful cancellation, release and free ctx */ > >> + return; > >> + > >> +free_context: > >> + bpf_task_work_context_reset(ctx); > >> + call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free); > >> +} > >> + > >> +static struct bpf_task_work_context *bpf_task_work_context_acquire(struct bpf_task_work *tw, > >> + struct bpf_map *map) > >> +{ > >> + struct bpf_task_work_context *ctx, *old_ctx; > >> + enum bpf_task_work_state state; > >> + struct bpf_task_work_context __force __rcu **ppc = > >> + (struct bpf_task_work_context __force __rcu **)&tw->ctx; > >> + > >> + /* ctx pointer is RCU protected */ > >> + rcu_read_lock_trace(); > >> + ctx = rcu_dereference(*ppc); > >> + if (!ctx) { > >> + ctx = bpf_task_work_context_alloc(); > >> + if (!ctx) { > >> + rcu_read_unlock_trace(); > >> + return ERR_PTR(-ENOMEM); > >> + } > >> + old_ctx = unrcu_pointer(cmpxchg(ppc, NULL, RCU_INITIALIZER(ctx))); > >> + /* > >> + * If ctx is set by another CPU, release allocated memory. > >> + * Do not fail, though, attempt stealing the work > >> + */ > >> + if (old_ctx) { > >> + bpf_mem_free(&bpf_global_ma, ctx); > >> + ctx = old_ctx; > >> + } > >> + } > >> + state = cmpxchg(&ctx->state, BPF_TW_STANDBY, BPF_TW_PENDING); > >> + /* > >> + * We can unlock RCU, because task work scheduler (this codepath) > >> + * now owns the ctx or returning an error > >> + */ > >> + rcu_read_unlock_trace(); > >> + if (state != BPF_TW_STANDBY) > >> + return ERR_PTR(-EBUSY); > >> + return ctx; > >> +} > >> + > >> +static int bpf_task_work_schedule(struct task_struct *task, struct bpf_task_work *tw, > >> + struct bpf_map *map, bpf_task_work_callback_t callback_fn, > >> + struct bpf_prog_aux *aux, enum task_work_notify_mode mode) > >> +{ > >> + struct bpf_prog *prog; > >> + struct bpf_task_work_context *ctx = NULL; > >> + int err; > >> + > >> + BTF_TYPE_EMIT(struct bpf_task_work); > >> + > >> + prog = bpf_prog_inc_not_zero(aux->prog); > >> + if (IS_ERR(prog)) > >> + return -EBADF; > >> + > >> + if (!atomic64_read(&map->usercnt)) { > >> + err = -EBADF; > >> + goto release_prog; > >> + } > > Please add a comment on why lack of ordering between load of usercnt > > and load of tw->ctx is safe, in presence of a parallel usercnt > > dec_and_test and ctx xchg. > > See __bpf_async_init for similar race. > I think I see what you mean, let me double check this. > > > >> + task = bpf_task_acquire(task); > >> + if (!task) { > >> + err = -EPERM; > >> + goto release_prog; > >> + } > >> + ctx = bpf_task_work_context_acquire(tw, map); > >> + if (IS_ERR(ctx)) { > >> + err = PTR_ERR(ctx); > >> + goto release_all; > >> + } > >> + > >> + ctx->task = task; > >> + ctx->callback_fn = callback_fn; > >> + ctx->prog = prog; > >> + ctx->mode = mode; > >> + ctx->map = map; > >> + ctx->map_val = (void *)tw - map->record->task_work_off; > >> + init_irq_work(&ctx->irq_work, bpf_task_work_irq); > >> + init_task_work(&ctx->work, bpf_task_work_callback); > >> + > >> + irq_work_queue(&ctx->irq_work); > >> + > >> + return 0; > >> + > >> +release_all: > >> + bpf_task_release(task); > >> +release_prog: > >> + bpf_prog_put(prog); > >> + return err; > >> +} > >> + > >> /** > >> * bpf_task_work_schedule_signal - Schedule BPF callback using task_work_add with TWA_SIGNAL mode > >> * @task: Task struct for which callback should be scheduled > >> @@ -3711,13 +3933,11 @@ typedef void (*bpf_task_work_callback_t)(struct bpf_map *map, void *key, void *v > >> * > >> * Return: 0 if task work has been scheduled successfully, negative error code otherwise > >> */ > >> -__bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task, > >> - struct bpf_task_work *tw, > >> +__bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task, struct bpf_task_work *tw, > >> struct bpf_map *map__map, > >> - bpf_task_work_callback_t callback, > >> - void *aux__prog) > >> + bpf_task_work_callback_t callback, void *aux__prog) > >> { > >> - return 0; > >> + return bpf_task_work_schedule(task, tw, map__map, callback, aux__prog, TWA_SIGNAL); > >> } > >> > >> /** > >> @@ -3731,19 +3951,47 @@ __bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task, > >> * > >> * Return: 0 if task work has been scheduled successfully, negative error code otherwise > >> */ > >> -__bpf_kfunc int bpf_task_work_schedule_resume(struct task_struct *task, > >> - struct bpf_task_work *tw, > >> +__bpf_kfunc int bpf_task_work_schedule_resume(struct task_struct *task, struct bpf_task_work *tw, > >> struct bpf_map *map__map, > >> - bpf_task_work_callback_t callback, > >> - void *aux__prog) > >> + bpf_task_work_callback_t callback, void *aux__prog) > >> { > >> - return 0; > >> + enum task_work_notify_mode mode; > >> + > >> + mode = task == current && in_nmi() ? TWA_NMI_CURRENT : TWA_RESUME; > >> + return bpf_task_work_schedule(task, tw, map__map, callback, aux__prog, mode); > >> } > >> > >> __bpf_kfunc_end_defs(); > >> > >> void bpf_task_work_cancel_and_free(void *val) > >> { > >> + struct bpf_task_work *tw = val; > >> + struct bpf_task_work_context *ctx; > >> + enum bpf_task_work_state state; > >> + > >> + /* No need do rcu_read_lock as no other codepath can reset this pointer */ > >> + ctx = unrcu_pointer(xchg((struct bpf_task_work_context __force __rcu **)&tw->ctx, NULL)); > >> + if (!ctx) > >> + return; > >> + state = xchg(&ctx->state, BPF_TW_FREED); > >> + > >> + switch (state) { > >> + case BPF_TW_SCHEDULED: > >> + /* If we can't cancel task work, rely on task work callback to free the context */ > >> + if (!task_work_cancel_match(ctx->task, task_work_match, ctx)) > > This part looks broken to me. > > You are calling this path > > (update->obj_free_fields->cancel_and_free->cancel_and_match) in > > possibly NMI context. > > Which means we can deadlock if we hit the NMI context prog in the > > middle of task->pi_lock critical section. > > That's taken in task_work functions > > The task_work_cancel_match takes the pi_lock. > Good point, thanks. I think this could be solved in 2 ways: > * Don't cancel, rely on callback dropping the work > * Cancel in another irq_work > I'll probably go with the second one. What about 1? It seems like we can just rely on the existing hunk to free the callback on noticing BPF_TW_FREED? That seems simpler to me. > > > >> + break; > >> + bpf_task_work_context_reset(ctx); > >> + fallthrough; > >> + case BPF_TW_STANDBY: > >> + call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free); > >> + break; > >> + /* In all below cases scheduling logic should detect context state change and cleanup */ > >> + case BPF_TW_SCHEDULING: > >> + case BPF_TW_PENDING: > >> + case BPF_TW_RUNNING: > >> + default: > >> + break; > >> + } > >> } > >> > >> BTF_KFUNCS_START(generic_btf_ids) > >> @@ -3769,6 +4017,8 @@ BTF_ID_FLAGS(func, bpf_rbtree_first, KF_RET_NULL) > >> BTF_ID_FLAGS(func, bpf_rbtree_root, KF_RET_NULL) > >> BTF_ID_FLAGS(func, bpf_rbtree_left, KF_RET_NULL) > >> BTF_ID_FLAGS(func, bpf_rbtree_right, KF_RET_NULL) > >> +BTF_ID_FLAGS(func, bpf_task_work_schedule_signal, KF_TRUSTED_ARGS) > >> +BTF_ID_FLAGS(func, bpf_task_work_schedule_resume, KF_TRUSTED_ARGS) > >> > >> #ifdef CONFIG_CGROUPS > >> BTF_ID_FLAGS(func, bpf_cgroup_acquire, KF_ACQUIRE | KF_RCU | KF_RET_NULL) > >> -- > >> 2.50.1 > >> >