Re: [PATCH bpf-next v3 6/7] bpf: task work scheduling kfuncs

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Mon, 8 Sep 2025 23:33:47 -0400

On Mon, Sep 8, 2025 at 9:13 AM Mykyta Yatsenko
<mykyta.yatsenko5@xxxxxxxxx> wrote:
>
> On 9/6/25 21:22, Eduard Zingerman wrote:
> > On Fri, 2025-09-05 at 17:45 +0100, Mykyta Yatsenko wrote:
> >
> > [...]
> >
> >> A small state machine and refcounting scheme ensures safe reuse and
> >> teardown:
> >>   STANDBY -> PENDING -> SCHEDULING -> SCHEDULED -> RUNNING -> STANDBY
> > Nit: state machine is actually a bit more complex:
> >
> >    digraph G {
> >      scheduling  -> running    [label="callback 1"];
> >      scheduled   -> running    [label="callback 2"];
> >      running     -> standby    [label="callback 3"];
> >      pending     -> scheduling [label="irq 1"];
> >      scheduling  -> standby    [label="irq 2"];
> >      scheduling  -> scheduled  [label="irq 3"];
> >      standby     -> pending    [label="acquire_ctx"];
> >
> >      freed      -> freed [label="cancel_and_free"];
> >      pending    -> freed [label="cancel_and_free"];
> >      running    -> freed [label="cancel_and_free"];
> >      scheduled  -> freed [label="cancel_and_free"];
> >      scheduling -> freed [label="cancel_and_free"];
> >      standby    -> freed [label="cancel_and_free"];
> >    }
> >
> > [...]
> >
> I'll update the description to contain proper graph.

Hm... I like the main linear chain of state transitions as is, tbh.
It's fundamentally simple and helps get the general picture. Sure,
there are important details, but I don't think we should overwhelm
anyone reading with all of this upfront.

In the above, "callback 1" and so on is not really helpful for
understanding, IMO.

I'd just add the note that a) each state can transition to FREED and
b) with tiny probability we might skip SCHEDULED and go SCHEDULING ->
RUNNING (extremely unlikely if at all possible, tbh).

In short, let's not go too detailed here.

> >> Flow of successful task work scheduling
> >>   1) bpf_task_work_schedule_* is called from BPF code.
> >>   2) Transition state from STANDBY to PENDING, marks context is owned by
> >>   this task work scheduler
> >>   3) irq_work_queue() schedules bpf_task_work_irq().
> >>   4) Transition state from PENDING to SCHEDULING.
> >>   4) bpf_task_work_irq() attempts task_work_add(). If successful, state
> >>   transitions to SCHEDULED.
> > Nit: "4" repeated two times.
> >
> >>   5) Task work calls bpf_task_work_callback(), which transition state to
> >>   RUNNING.
> >>   6) BPF callback is executed
> >>   7) Context is cleaned up, refcounts released, context state set back to
> >>   STANDBY.
> >>
> >> Signed-off-by: Mykyta Yatsenko <yatsenko@xxxxxxxx>
> >> ---
> >>   kernel/bpf/helpers.c | 319 ++++++++++++++++++++++++++++++++++++++++++-
> >>   1 file changed, 317 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> >> index 109cb249e88c..418a0a211699 100644
> >> --- a/kernel/bpf/helpers.c
> >> +++ b/kernel/bpf/helpers.c
> > [...]
> >
> >> +static void bpf_task_work_cancel(struct bpf_task_work_ctx *ctx)
> >> +{
> >> +    /*
> >> +     * Scheduled task_work callback holds ctx ref, so if we successfully
> >> +     * cancelled, we put that ref on callback's behalf. If we couldn't
> >> +     * cancel, callback is inevitably run or has already completed
> >> +     * running, and it would have taken care of its ctx ref itself.
> >> +     */
> >> +    if (task_work_cancel_match(ctx->task, task_work_match, ctx))
> > Will `task_work_cancel(ctx->task, ctx->work)` do the same thing here?
> I think so, yes, thanks for checking.
> >
> >> +            bpf_task_work_ctx_put(ctx);
> >> +}
> > [...]
> >
> >> +static void bpf_task_work_irq(struct irq_work *irq_work)
> >> +{
> >> +    struct bpf_task_work_ctx *ctx = container_of(irq_work, struct bpf_task_work_ctx, irq_work);
> >> +    enum bpf_task_work_state state;
> >> +    int err;
> >> +
> >> +    guard(rcu_tasks_trace)();
> >> +
> >> +    if (cmpxchg(&ctx->state, BPF_TW_PENDING, BPF_TW_SCHEDULING) != BPF_TW_PENDING) {
> >> +            bpf_task_work_ctx_put(ctx);
> >> +            return;
> >> +    }
> > Why are separate PENDING and SCHEDULING states needed?
> > Both indicate that the task had not been yet but is ready to be
> > submitted to task_work_add(). So, on a first glance it seems that
> > merging the two won't change the behaviour, what do I miss?
> Yes, this is right, we may drop SCHEDULING state, it does not change any
> behavior compared to PENDING.
> The state check before task_work_add is needed anyway, so we won't
> remove much code here.
> I kept it just to be more consistent: with every state check we also
> transition state machine forward.

Yeah, I like this property as well, I think it makes it easier to
reason about all this. I'd keep the PENDING and SCHEDULING
distinction, unless there is a strong reason not to.

It also gives us a natural point to check for FREED before doing
unnecessary task_work scheduling + cancelling (if we were already in
FREED). It doesn't seem like we'll simplify anything by SCHEDULING (or
PENDING) state.

> >
> >> +    err = task_work_add(ctx->task, &ctx->work, ctx->mode);
> >> +    if (err) {
> >> +            bpf_task_work_ctx_reset(ctx);
> >> +            /*
> >> +             * try to switch back to STANDBY for another task_work reuse, but we might have
> >> +             * gone to FREED already, which is fine as we already cleaned up after ourselves
> >> +             */
> >> +            (void)cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_STANDBY);
> >> +
> >> +            /* we don't have RCU protection, so put after switching state */
> >> +            bpf_task_work_ctx_put(ctx);
> >> +    }
> >> +

[...]

> >> +
> >> +    return ctx;
> >> +}
> >> +
> >> +static int bpf_task_work_schedule(struct task_struct *task, struct bpf_task_work *tw,
> >> +                              struct bpf_map *map, bpf_task_work_callback_t callback_fn,
> >> +                              struct bpf_prog_aux *aux, enum task_work_notify_mode mode)
> >> +{
> >> +    struct bpf_prog *prog;
> >> +    struct bpf_task_work_ctx *ctx;
> >> +    int err;
> >> +
> >> +    BTF_TYPE_EMIT(struct bpf_task_work);
> >> +
> >> +    prog = bpf_prog_inc_not_zero(aux->prog);
> >> +    if (IS_ERR(prog))
> >> +            return -EBADF;
> >> +    task = bpf_task_acquire(task);
> >> +    if (!task) {
> >> +            err = -EPERM;
> > Nit: Why -EPERM? bpf_task_acquire() returns NULL if task->rcu_users
> >       is zero, does not seem to be permission related.
> Right, this probably should be -EBADF.

timer and wq (and now task_work) return -EPERM for map->usercnt==0
check, but we have -EBADF for prog refcount being zero. It's a bit all
over the place... I don't have a strong preference, but it would be
nice to stay more or less consistent for all these "it's too late"
conditions, IMO.

> >> +            goto release_prog;
> >> +    }
> >> +
> >> +    ctx = bpf_task_work_acquire_ctx(tw, map);
> >> +    if (IS_ERR(ctx)) {
> >> +            err = PTR_ERR(ctx);
> >> +            goto release_all;
> >> +    }
> >> +
> >> +    ctx->task = task;
> >> +    ctx->callback_fn = callback_fn;
> >> +    ctx->prog = prog;
> >> +    ctx->mode = mode;
> >> +    ctx->map = map;
> >> +    ctx->map_val = (void *)tw - map->record->task_work_off;
> >> +    init_task_work(&ctx->work, bpf_task_work_callback);
> >> +    init_irq_work(&ctx->irq_work, bpf_task_work_irq);
> >> +
> >> +    irq_work_queue(&ctx->irq_work);
> >> +    return 0;
> >> +
> >> +release_all:
> >> +    bpf_task_release(task);
> >> +release_prog:
> >> +    bpf_prog_put(prog);
> >> +    return err;
> >> +}
> >> +
> > [...]
>