On Fri, 2025-09-05 at 17:45 +0100, Mykyta Yatsenko wrote: [...] > A small state machine and refcounting scheme ensures safe reuse and > teardown: > STANDBY -> PENDING -> SCHEDULING -> SCHEDULED -> RUNNING -> STANDBY Nit: state machine is actually a bit more complex: digraph G { scheduling -> running [label="callback 1"]; scheduled -> running [label="callback 2"]; running -> standby [label="callback 3"]; pending -> scheduling [label="irq 1"]; scheduling -> standby [label="irq 2"]; scheduling -> scheduled [label="irq 3"]; standby -> pending [label="acquire_ctx"]; freed -> freed [label="cancel_and_free"]; pending -> freed [label="cancel_and_free"]; running -> freed [label="cancel_and_free"]; scheduled -> freed [label="cancel_and_free"]; scheduling -> freed [label="cancel_and_free"]; standby -> freed [label="cancel_and_free"]; } [...] > Flow of successful task work scheduling > 1) bpf_task_work_schedule_* is called from BPF code. > 2) Transition state from STANDBY to PENDING, marks context is owned by > this task work scheduler > 3) irq_work_queue() schedules bpf_task_work_irq(). > 4) Transition state from PENDING to SCHEDULING. > 4) bpf_task_work_irq() attempts task_work_add(). If successful, state > transitions to SCHEDULED. Nit: "4" repeated two times. > 5) Task work calls bpf_task_work_callback(), which transition state to > RUNNING. > 6) BPF callback is executed > 7) Context is cleaned up, refcounts released, context state set back to > STANDBY. > > Signed-off-by: Mykyta Yatsenko <yatsenko@xxxxxxxx> > --- > kernel/bpf/helpers.c | 319 ++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 317 insertions(+), 2 deletions(-) > > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > index 109cb249e88c..418a0a211699 100644 > --- a/kernel/bpf/helpers.c > +++ b/kernel/bpf/helpers.c [...] > +static void bpf_task_work_cancel(struct bpf_task_work_ctx *ctx) > +{ > + /* > + * Scheduled task_work callback holds ctx ref, so if we successfully > + * cancelled, we put that ref on callback's behalf. If we couldn't > + * cancel, callback is inevitably run or has already completed > + * running, and it would have taken care of its ctx ref itself. > + */ > + if (task_work_cancel_match(ctx->task, task_work_match, ctx)) Will `task_work_cancel(ctx->task, ctx->work)` do the same thing here? > + bpf_task_work_ctx_put(ctx); > +} [...] > +static void bpf_task_work_irq(struct irq_work *irq_work) > +{ > + struct bpf_task_work_ctx *ctx = container_of(irq_work, struct bpf_task_work_ctx, irq_work); > + enum bpf_task_work_state state; > + int err; > + > + guard(rcu_tasks_trace)(); > + > + if (cmpxchg(&ctx->state, BPF_TW_PENDING, BPF_TW_SCHEDULING) != BPF_TW_PENDING) { > + bpf_task_work_ctx_put(ctx); > + return; > + } Why are separate PENDING and SCHEDULING states needed? Both indicate that the task had not been yet but is ready to be submitted to task_work_add(). So, on a first glance it seems that merging the two won't change the behaviour, what do I miss? > + err = task_work_add(ctx->task, &ctx->work, ctx->mode); > + if (err) { > + bpf_task_work_ctx_reset(ctx); > + /* > + * try to switch back to STANDBY for another task_work reuse, but we might have > + * gone to FREED already, which is fine as we already cleaned up after ourselves > + */ > + (void)cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_STANDBY); > + > + /* we don't have RCU protection, so put after switching state */ > + bpf_task_work_ctx_put(ctx); > + } > + > + /* > + * It's technically possible for just scheduled task_work callback to > + * complete running by now, going SCHEDULING -> RUNNING and then > + * dropping its ctx refcount. Instead of capturing extra ref just to > + * protected below ctx->state access, we rely on RCU protection to > + * perform below SCHEDULING -> SCHEDULED attempt. > + */ > + state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_SCHEDULED); > + if (state == BPF_TW_FREED) > + bpf_task_work_cancel(ctx); /* clean up if we switched into FREED state */ > +} [...] > +static struct bpf_task_work_ctx *bpf_task_work_acquire_ctx(struct bpf_task_work *tw, > + struct bpf_map *map) > +{ > + struct bpf_task_work_ctx *ctx; > + > + /* early check to avoid any work, we'll double check at the end again */ > + if (!atomic64_read(&map->usercnt)) > + return ERR_PTR(-EBUSY); > + > + ctx = bpf_task_work_fetch_ctx(tw, map); > + if (IS_ERR(ctx)) > + return ctx; > + > + /* try to get ref for task_work callback to hold */ > + if (!bpf_task_work_ctx_tryget(ctx)) > + return ERR_PTR(-EBUSY); > + > + if (cmpxchg(&ctx->state, BPF_TW_STANDBY, BPF_TW_PENDING) != BPF_TW_STANDBY) { > + /* lost acquiring race or map_release_uref() stole it from us, put ref and bail */ > + bpf_task_work_ctx_put(ctx); > + return ERR_PTR(-EBUSY); > + } > + > + /* > + * Double check that map->usercnt wasn't dropped while we were > + * preparing context, and if it was, we need to clean up as if > + * map_release_uref() was called; bpf_task_work_cancel_and_free() > + * is safe to be called twice on the same task work > + */ > + if (!atomic64_read(&map->usercnt)) { > + /* drop ref we just got for task_work callback itself */ > + bpf_task_work_ctx_put(ctx); > + /* transfer map's ref into cancel_and_free() */ > + bpf_task_work_cancel_and_free(tw); > + return ERR_PTR(-EBUSY); > + } I don't understand how the above check is useful. Is map->usercnt protected from being changed during execution of bpf_task_work_schedule()? There are two such checks in this function, so apparently it is not. Then what's the point of checking usercnt value if it can be immediately changed after the check? > + > + return ctx; > +} > + > +static int bpf_task_work_schedule(struct task_struct *task, struct bpf_task_work *tw, > + struct bpf_map *map, bpf_task_work_callback_t callback_fn, > + struct bpf_prog_aux *aux, enum task_work_notify_mode mode) > +{ > + struct bpf_prog *prog; > + struct bpf_task_work_ctx *ctx; > + int err; > + > + BTF_TYPE_EMIT(struct bpf_task_work); > + > + prog = bpf_prog_inc_not_zero(aux->prog); > + if (IS_ERR(prog)) > + return -EBADF; > + task = bpf_task_acquire(task); > + if (!task) { > + err = -EPERM; Nit: Why -EPERM? bpf_task_acquire() returns NULL if task->rcu_users is zero, does not seem to be permission related. > + goto release_prog; > + } > + > + ctx = bpf_task_work_acquire_ctx(tw, map); > + if (IS_ERR(ctx)) { > + err = PTR_ERR(ctx); > + goto release_all; > + } > + > + ctx->task = task; > + ctx->callback_fn = callback_fn; > + ctx->prog = prog; > + ctx->mode = mode; > + ctx->map = map; > + ctx->map_val = (void *)tw - map->record->task_work_off; > + init_task_work(&ctx->work, bpf_task_work_callback); > + init_irq_work(&ctx->irq_work, bpf_task_work_irq); > + > + irq_work_queue(&ctx->irq_work); > + return 0; > + > +release_all: > + bpf_task_release(task); > +release_prog: > + bpf_prog_put(prog); > + return err; > +} > + [...]