On Mon, 2025-09-08 at 14:13 +0100, Mykyta Yatsenko wrote: [...] > > > +static void bpf_task_work_irq(struct irq_work *irq_work) > > > +{ > > > + struct bpf_task_work_ctx *ctx = container_of(irq_work, struct bpf_task_work_ctx, irq_work); > > > + enum bpf_task_work_state state; > > > + int err; > > > + > > > + guard(rcu_tasks_trace)(); > > > + > > > + if (cmpxchg(&ctx->state, BPF_TW_PENDING, BPF_TW_SCHEDULING) != BPF_TW_PENDING) { > > > + bpf_task_work_ctx_put(ctx); > > > + return; > > > + } > > Why are separate PENDING and SCHEDULING states needed? > > Both indicate that the task had not been yet but is ready to be > > submitted to task_work_add(). So, on a first glance it seems that > > merging the two won't change the behaviour, what do I miss? > Yes, this is right, we may drop SCHEDULING state, it does not change any > behavior compared to PENDING. > The state check before task_work_add is needed anyway, so we won't > remove much code here. > I kept it just to be more consistent: with every state check we also > transition state machine forward. Why is state check before task_work_add() mandatory? You check for FREED in both branches of task_work_add(), so there seem to be no issues with leaking ctx. > > > + err = task_work_add(ctx->task, &ctx->work, ctx->mode); > > > + if (err) { > > > + bpf_task_work_ctx_reset(ctx); > > > + /* > > > + * try to switch back to STANDBY for another task_work reuse, but we might have > > > + * gone to FREED already, which is fine as we already cleaned up after ourselves > > > + */ > > > + (void)cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_STANDBY); > > > + > > > + /* we don't have RCU protection, so put after switching state */ > > > + bpf_task_work_ctx_put(ctx); > > > + } > > > + > > > + /* > > > + * It's technically possible for just scheduled task_work callback to > > > + * complete running by now, going SCHEDULING -> RUNNING and then > > > + * dropping its ctx refcount. Instead of capturing extra ref just to > > > + * protected below ctx->state access, we rely on RCU protection to > > > + * perform below SCHEDULING -> SCHEDULED attempt. > > > + */ > > > + state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_SCHEDULED); > > > + if (state == BPF_TW_FREED) > > > + bpf_task_work_cancel(ctx); /* clean up if we switched into FREED state */ > > > +} > > [...] > > > > > +static struct bpf_task_work_ctx *bpf_task_work_acquire_ctx(struct bpf_task_work *tw, > > > + struct bpf_map *map) > > > +{ > > > + struct bpf_task_work_ctx *ctx; > > > + > > > + /* early check to avoid any work, we'll double check at the end again */ > > > + if (!atomic64_read(&map->usercnt)) > > > + return ERR_PTR(-EBUSY); > > > + > > > + ctx = bpf_task_work_fetch_ctx(tw, map); > > > + if (IS_ERR(ctx)) > > > + return ctx; > > > + > > > + /* try to get ref for task_work callback to hold */ > > > + if (!bpf_task_work_ctx_tryget(ctx)) > > > + return ERR_PTR(-EBUSY); > > > + > > > + if (cmpxchg(&ctx->state, BPF_TW_STANDBY, BPF_TW_PENDING) != BPF_TW_STANDBY) { > > > + /* lost acquiring race or map_release_uref() stole it from us, put ref and bail */ > > > + bpf_task_work_ctx_put(ctx); > > > + return ERR_PTR(-EBUSY); > > > + } > > > + > > > + /* > > > + * Double check that map->usercnt wasn't dropped while we were > > > + * preparing context, and if it was, we need to clean up as if > > > + * map_release_uref() was called; bpf_task_work_cancel_and_free() > > > + * is safe to be called twice on the same task work > > > + */ > > > + if (!atomic64_read(&map->usercnt)) { > > > + /* drop ref we just got for task_work callback itself */ > > > + bpf_task_work_ctx_put(ctx); > > > + /* transfer map's ref into cancel_and_free() */ > > > + bpf_task_work_cancel_and_free(tw); > > > + return ERR_PTR(-EBUSY); > > > + } > > I don't understand how the above check is useful. > > Is map->usercnt protected from being changed during execution of > > bpf_task_work_schedule()? > > There are two such checks in this function, so apparently it is not. > > Then what's the point of checking usercnt value if it can be > > immediately changed after the check? > BPF map implementation calls bpf_task_work_cancel_and_free() for each > value when map->usercnt goes to 0. > We need to make sure that after mutating map value (attaching a ctx, > setting state and refcnt), we do not > leak memory to a newly allocated ctx. > If bpf_task_work_cancel_and_free() runs concurrently with > bpf_task_work_acquire_ctx(), there is a chance that map cleans up the > value first and then we attach a ctx with refcnt=2, memory will leak. > Alternatively, if map->usercnt is set to 0 right after this check, we > are guaranteed to have the initialized context attached already, so the > refcnts will be properly decremented (once by > bpf_task_work_cancel_and_free() > and once by bpf_task_work_irq() and clean up is safe). > > In other words, initialization of the ctx in struct bpf_task_work is > multi-step operation, those steps could be > interleaved with cancel_and_free(), in such case the value may leak the > ctx. Check map->usercnt==0 after initialization, > to force correct cleanup preventing the leak. Calling cancel_and_free() > for the same value twice is safe. Ack, thank you for explaining. > > > > > + > > > + return ctx; > > > +} [...]