Re: [PATCH bpf-next v3 6/7] bpf: task work scheduling kfuncs

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Mon, 8 Sep 2025 23:42:24 -0400

On Mon, Sep 8, 2025 at 1:39 PM Eduard Zingerman <eddyz87@xxxxxxxxx> wrote:
>
> On Mon, 2025-09-08 at 14:13 +0100, Mykyta Yatsenko wrote:
>
> [...]
>
> > > > +static void bpf_task_work_irq(struct irq_work *irq_work)
> > > > +{
> > > > + struct bpf_task_work_ctx *ctx = container_of(irq_work, struct bpf_task_work_ctx, irq_work);
> > > > + enum bpf_task_work_state state;
> > > > + int err;
> > > > +
> > > > + guard(rcu_tasks_trace)();
> > > > +
> > > > + if (cmpxchg(&ctx->state, BPF_TW_PENDING, BPF_TW_SCHEDULING) != BPF_TW_PENDING) {
> > > > +         bpf_task_work_ctx_put(ctx);
> > > > +         return;
> > > > + }
> > > Why are separate PENDING and SCHEDULING states needed?
> > > Both indicate that the task had not been yet but is ready to be
> > > submitted to task_work_add(). So, on a first glance it seems that
> > > merging the two won't change the behaviour, what do I miss?
>
> > Yes, this is right, we may drop SCHEDULING state, it does not change any
> > behavior compared to PENDING.
> > The state check before task_work_add is needed anyway, so we won't
> > remove much code here.
> > I kept it just to be more consistent: with every state check we also
> > transition state machine forward.
>
> Why is state check before task_work_add() mandatory?
> You check for FREED in both branches of task_work_add(),
> so there seem to be no issues with leaking ctx.

Not really mandatory, but I think it is good to avoid even attempting
to schedule task_work if the map element was already deleted?
Technically, even that last FREED check in bpf_task_work_irq is not
strictly necessary, we could have always let task_work callback
execute and bail all the way there, but that seems too extreme (and
task_work can be delayed by a lot for some special task states, I
think).

Also, keep in mind, this same state machine will be used for timers
and wqs (at least we should try), and so, in general, being diligent
about not doing doomed-to-be-failed-or-cancelled work is a good
property.

>
> > > > + err = task_work_add(ctx->task, &ctx->work, ctx->mode);
> > > > + if (err) {
> > > > +         bpf_task_work_ctx_reset(ctx);
> > > > +         /*
> > > > +          * try to switch back to STANDBY for another task_work reuse, but we might have
> > > > +          * gone to FREED already, which is fine as we already cleaned up after ourselves
> > > > +          */
> > > > +         (void)cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_STANDBY);
> > > > +
> > > > +         /* we don't have RCU protection, so put after switching state */
> > > > +         bpf_task_work_ctx_put(ctx);
> > > > + }
> > > > +
> > > > + /*
> > > > +  * It's technically possible for just scheduled task_work callback to
> > > > +  * complete running by now, going SCHEDULING -> RUNNING and then
> > > > +  * dropping its ctx refcount. Instead of capturing extra ref just to
> > > > +  * protected below ctx->state access, we rely on RCU protection to
> > > > +  * perform below SCHEDULING -> SCHEDULED attempt.
> > > > +  */
> > > > + state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_SCHEDULED);
> > > > + if (state == BPF_TW_FREED)
> > > > +         bpf_task_work_cancel(ctx); /* clean up if we switched into FREED state */
> > > > +}
> > > [...]
> > >
> > > > +static struct bpf_task_work_ctx *bpf_task_work_acquire_ctx(struct bpf_task_work *tw,
> > > > +                                                    struct bpf_map *map)
> > > > +{
> > > > + struct bpf_task_work_ctx *ctx;
> > > > +
> > > > + /* early check to avoid any work, we'll double check at the end again */
> > > > + if (!atomic64_read(&map->usercnt))
> > > > +         return ERR_PTR(-EBUSY);
> > > > +
> > > > + ctx = bpf_task_work_fetch_ctx(tw, map);
> > > > + if (IS_ERR(ctx))
> > > > +         return ctx;
> > > > +
> > > > + /* try to get ref for task_work callback to hold */
> > > > + if (!bpf_task_work_ctx_tryget(ctx))
> > > > +         return ERR_PTR(-EBUSY);
> > > > +
> > > > + if (cmpxchg(&ctx->state, BPF_TW_STANDBY, BPF_TW_PENDING) != BPF_TW_STANDBY) {
> > > > +         /* lost acquiring race or map_release_uref() stole it from us, put ref and bail */
> > > > +         bpf_task_work_ctx_put(ctx);
> > > > +         return ERR_PTR(-EBUSY);
> > > > + }
> > > > +
> > > > + /*
> > > > +  * Double check that map->usercnt wasn't dropped while we were
> > > > +  * preparing context, and if it was, we need to clean up as if
> > > > +  * map_release_uref() was called; bpf_task_work_cancel_and_free()
> > > > +  * is safe to be called twice on the same task work
> > > > +  */
> > > > + if (!atomic64_read(&map->usercnt)) {
> > > > +         /* drop ref we just got for task_work callback itself */
> > > > +         bpf_task_work_ctx_put(ctx);
> > > > +         /* transfer map's ref into cancel_and_free() */
> > > > +         bpf_task_work_cancel_and_free(tw);
> > > > +         return ERR_PTR(-EBUSY);
> > > > + }
> > > I don't understand how the above check is useful.
> > > Is map->usercnt protected from being changed during execution of
> > > bpf_task_work_schedule()?
> > > There are two such checks in this function, so apparently it is not.
> > > Then what's the point of checking usercnt value if it can be
> > > immediately changed after the check?
>
> > BPF map implementation calls bpf_task_work_cancel_and_free() for each
> > value when map->usercnt goes to 0.
> > We need to make sure that after mutating map value (attaching a ctx,
> > setting state and refcnt), we do not
> > leak memory to a newly allocated ctx.
> > If bpf_task_work_cancel_and_free() runs concurrently with
> > bpf_task_work_acquire_ctx(), there is a chance that map cleans up the
> > value first and then we attach a ctx with refcnt=2, memory will leak.
> > Alternatively, if map->usercnt is set to 0 right after this check, we
> > are guaranteed to have the initialized context attached already, so the
> > refcnts will be properly decremented (once by
> > bpf_task_work_cancel_and_free()
> > and once by bpf_task_work_irq() and clean up is safe).
> >
> > In other words, initialization of the ctx in struct bpf_task_work is
> > multi-step operation, those steps could be
> > interleaved with cancel_and_free(), in such case the value may leak the
> > ctx. Check map->usercnt==0 after initialization,
> > to force correct cleanup preventing the leak. Calling cancel_and_free()
> > for the same value twice is safe.
>
> Ack, thank you for explaining.

btw, one can argue that first map->usercnt==0 check is also not
mandatory, we can always go through allocating/getting context and
only then check map->usercnt, but see above, I think we should be
diligent and avoid useless work.

>
> > >
> > > > +
> > > > + return ctx;
> > > > +}
>
> [...]