On Fri, 8 Aug 2025 at 02:44, Mykyta Yatsenko <mykyta.yatsenko5@xxxxxxxxx> wrote: > > On 8/7/25 19:55, Kumar Kartikeya Dwivedi wrote: > > On Wed, 6 Aug 2025 at 16:46, Mykyta Yatsenko <mykyta.yatsenko5@xxxxxxxxx> wrote: > >> From: Mykyta Yatsenko <yatsenko@xxxxxxxx> > >> > >> Implementation of the bpf_task_work_schedule kfuncs. > >> > >> Main components: > >> * struct bpf_task_work_context – Metadata and state management per task > >> work. > >> * enum bpf_task_work_state – A state machine to serialize work > >> scheduling and execution. > >> * bpf_task_work_schedule() – The central helper that initiates > >> scheduling. > >> * bpf_task_work_callback() – Invoked when the actual task_work runs. > >> * bpf_task_work_irq() – An intermediate step (runs in softirq context) > >> to enqueue task work. > >> * bpf_task_work_cancel_and_free() – Cleanup for deleted BPF map entries. > >> > >> Flow of task work scheduling > >> 1) bpf_task_work_schedule_* is called from BPF code. > >> 2) Transition state from STANDBY to PENDING. > >> 3) irq_work_queue() schedules bpf_task_work_irq(). > >> 4) Transition state from PENDING to SCHEDULING. > >> 4) bpf_task_work_irq() attempts task_work_add(). If successful, state > >> transitions to SCHEDULED. > >> 5) Task work calls bpf_task_work_callback(), which transition state to > >> RUNNING. > >> 6) BPF callback is executed > >> 7) Context is cleaned up, refcounts released, state set back to > >> STANDBY. > >> > >> Map value deletion > >> If map value that contains bpf_task_work_context is deleted, BPF map > >> implementation calls bpf_task_work_cancel_and_free(). > >> Deletion is handled by atomically setting state to FREED and > >> releasing references or letting scheduler do that, depending on the > >> last state before the deletion: > >> * SCHEDULING: release references in bpf_task_work_cancel_and_free(), > >> expect bpf_task_work_irq() to cancel task work. > >> * SCHEDULED: release references and try to cancel task work in > >> bpf_task_work_cancel_and_free(). > >> * other states: one of bpf_task_work_irq(), bpf_task_work_schedule(), > >> bpf_task_work_callback() should cleanup upon detecting the state > >> switching to FREED. > >> > >> The state transitions are controlled with atomic_cmpxchg, ensuring: > >> * Only one thread can successfully enqueue work. > >> * Proper handling of concurrent deletes (BPF_TW_FREED). > >> * Safe rollback if task_work_add() fails. > >> > > In general I am not sure why we need so many acquire/release pairs. > > Why not use test_and_set_bit etc.? Or simply cmpxchg? > > What ordering of stores are we depending on that merits > > acquire/release ordering? > > We should probably document explicitly. > > > >> Signed-off-by: Mykyta Yatsenko <yatsenko@xxxxxxxx> > >> --- > >> kernel/bpf/helpers.c | 188 ++++++++++++++++++++++++++++++++++++++++++- > >> 1 file changed, 186 insertions(+), 2 deletions(-) > >> > >> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > >> index 516286f67f0d..4c8b1c9be7aa 100644 > >> --- a/kernel/bpf/helpers.c > >> +++ b/kernel/bpf/helpers.c > >> @@ -25,6 +25,8 @@ > >> #include <linux/kasan.h> > >> #include <linux/bpf_verifier.h> > >> #include <linux/uaccess.h> > >> +#include <linux/task_work.h> > >> +#include <linux/irq_work.h> > >> > >> #include "../../lib/kstrtox.h" > >> > >> @@ -3702,6 +3704,160 @@ __bpf_kfunc int bpf_strstr(const char *s1__ign, const char *s2__ign) > >> > >> typedef void (*bpf_task_work_callback_t)(struct bpf_map *, void *, void *); > >> > >> +enum bpf_task_work_state { > >> + /* bpf_task_work is ready to be used */ > >> + BPF_TW_STANDBY = 0, > >> + /* bpf_task_work is getting scheduled into irq_work */ > >> + BPF_TW_PENDING, > >> + /* bpf_task_work is in irq_work and getting scheduled into task_work */ > >> + BPF_TW_SCHEDULING, > >> + /* bpf_task_work is scheduled into task_work successfully */ > >> + BPF_TW_SCHEDULED, > >> + /* callback is running */ > >> + BPF_TW_RUNNING, > >> + /* BPF map value storing this bpf_task_work is deleted */ > >> + BPF_TW_FREED, > >> +}; > >> + > >> +struct bpf_task_work_context { > >> + /* map that contains this structure in a value */ > >> + struct bpf_map *map; > >> + /* bpf_task_work_state value, representing the state */ > >> + atomic_t state; > >> + /* bpf_prog that schedules task work */ > >> + struct bpf_prog *prog; > >> + /* task for which callback is scheduled */ > >> + struct task_struct *task; > >> + /* notification mode for task work scheduling */ > >> + enum task_work_notify_mode mode; > >> + /* callback to call from task work */ > >> + bpf_task_work_callback_t callback_fn; > >> + struct callback_head work; > >> + struct irq_work irq_work; > >> +} __aligned(8); > > I will echo Alexei's comments about the layout. We cannot inline all > > this in map value. > > Allocation using an init function or in some control function is > > probably the only way. > > > >> + > >> +static bool task_work_match(struct callback_head *head, void *data) > >> +{ > >> + struct bpf_task_work_context *ctx = container_of(head, struct bpf_task_work_context, work); > >> + > >> + return ctx == data; > >> +} > >> + > >> +static void bpf_reset_task_work_context(struct bpf_task_work_context *ctx) > >> +{ > >> + bpf_prog_put(ctx->prog); > >> + bpf_task_release(ctx->task); > >> + rcu_assign_pointer(ctx->map, NULL); > >> +} > >> + > >> +static void bpf_task_work_callback(struct callback_head *cb) > >> +{ > >> + enum bpf_task_work_state state; > >> + struct bpf_task_work_context *ctx; > >> + struct bpf_map *map; > >> + u32 idx; > >> + void *key; > >> + void *value; > >> + > >> + rcu_read_lock_trace(); > >> + ctx = container_of(cb, struct bpf_task_work_context, work); > >> + > >> + state = atomic_cmpxchg_acquire(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_RUNNING); > >> + if (state == BPF_TW_SCHEDULED) > >> + state = atomic_cmpxchg_acquire(&ctx->state, BPF_TW_SCHEDULED, BPF_TW_RUNNING); > >> + if (state == BPF_TW_FREED) > >> + goto out; > > I am leaving out commenting on this, since I expect it to change per > > later comments. > > > >> + > >> + map = rcu_dereference(ctx->map); > >> + if (!map) > >> + goto out; > >> + > >> + value = (void *)ctx - map->record->task_work_off; > >> + key = (void *)map_key_from_value(map, value, &idx); > >> + > >> + migrate_disable(); > >> + ctx->callback_fn(map, key, value); > >> + migrate_enable(); > >> + > >> + /* State is running or freed, either way reset. */ > >> + bpf_reset_task_work_context(ctx); > >> + atomic_cmpxchg_release(&ctx->state, BPF_TW_RUNNING, BPF_TW_STANDBY); > >> +out: > >> + rcu_read_unlock_trace(); > >> +} > >> + > >> +static void bpf_task_work_irq(struct irq_work *irq_work) > >> +{ > >> + struct bpf_task_work_context *ctx; > >> + enum bpf_task_work_state state; > >> + int err; > >> + > >> + ctx = container_of(irq_work, struct bpf_task_work_context, irq_work); > >> + > >> + rcu_read_lock_trace(); > > What's the idea behind rcu_read_lock_trace? Let's add a comment. > > > >> + state = atomic_cmpxchg_release(&ctx->state, BPF_TW_PENDING, BPF_TW_SCHEDULING); > >> + if (state == BPF_TW_FREED) { > >> + bpf_reset_task_work_context(ctx); > >> + goto out; > >> + } > >> + > >> + err = task_work_add(ctx->task, &ctx->work, ctx->mode); > > Racy, SCHEDULING->FREE state claim from cancel_and_free will release ctx->task. > Thanks for pointing this out, I missed that case. > > > >> + if (err) { > >> + state = atomic_cmpxchg_acquire(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_PENDING); > > Races here look fine, since we don't act on FREED (for this block > > atleast), cancel_and_free doesn't act on seeing PENDING, > > so there is interlocking. > > > >> + if (state == BPF_TW_SCHEDULING) { > >> + bpf_reset_task_work_context(ctx); > >> + atomic_cmpxchg_release(&ctx->state, BPF_TW_PENDING, BPF_TW_STANDBY); > >> + } > >> + goto out; > >> + } > >> + state = atomic_cmpxchg_release(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_SCHEDULED); > >> + if (state == BPF_TW_FREED) > >> + task_work_cancel_match(ctx->task, task_work_match, ctx); > > It looks like there is a similar race condition here. > > If BPF_TW_SCHEDULING is set, cancel_and_free may invoke and attempt > > bpf_task_release() from bpf_reset_task_work_context(). > > Meanwhile, we will access ctx->task here directly after seeing BPF_TW_FREED. > Yeah, we should release task_work in this function in case SCHEDULING > gets transitioned into FREED. > > > >> +out: > >> + rcu_read_unlock_trace(); > >> +} > >> + > >> +static int bpf_task_work_schedule(struct task_struct *task, struct bpf_task_work_context *ctx, > >> + struct bpf_map *map, bpf_task_work_callback_t callback_fn, > >> + struct bpf_prog_aux *aux, enum task_work_notify_mode mode) > >> +{ > >> + struct bpf_prog *prog; > >> + > >> + BTF_TYPE_EMIT(struct bpf_task_work); > >> + > >> + prog = bpf_prog_inc_not_zero(aux->prog); > >> + if (IS_ERR(prog)) > >> + return -EPERM; > >> + > >> + if (!atomic64_read(&map->usercnt)) { > >> + bpf_prog_put(prog); > >> + return -EPERM; > >> + } > >> + task = bpf_task_acquire(task); > >> + if (!task) { > >> + bpf_prog_put(prog); > >> + return -EPERM; > >> + } > >> + > >> + if (atomic_cmpxchg_acquire(&ctx->state, BPF_TW_STANDBY, BPF_TW_PENDING) != BPF_TW_STANDBY) { > > If we are reusing map values, wouldn't a freed state stay perpetually freed? > > I.e. after the first delete of array elements etc. it becomes useless. > > Every array map update would invoke a cancel_and_free. > > Who resets it? > I'm not sure I understand the question, the idea is that if element is > deleted from map, we > transition state to FREED and make sure refcounts of the task and prog > are released. > > An element is returned into STANDBY state after task_work is completed > or failed, so it can be reused. > Could you please elaborate on the scenario you have in mind? I guess I am confused about where we will go from FREED to STANDBY, if we set it to BPF_TW_FREED in cancel_and_free. When you update an array map element, we always do bpf_obj_free_fields. Typically, this operation leaves the field in a reusable state. I don't see a FREED->STANDBY transition (after going from [SCHEDULED|SCHEDULING]->FREED, only RUNNING->STANDBY in the callback.