On Mon, 11 Aug 2025 at 22:13, Mykyta Yatsenko <mykyta.yatsenko5@xxxxxxxxx> wrote: > > On 8/9/25 04:04, Kumar Kartikeya Dwivedi wrote: > > On Fri, 8 Aug 2025 at 02:44, Mykyta Yatsenko <mykyta.yatsenko5@xxxxxxxxx> wrote: > >> On 8/7/25 19:55, Kumar Kartikeya Dwivedi wrote: > >>> On Wed, 6 Aug 2025 at 16:46, Mykyta Yatsenko <mykyta.yatsenko5@xxxxxxxxx> wrote: > >>>> From: Mykyta Yatsenko <yatsenko@xxxxxxxx> > >>>> > >>>> Implementation of the bpf_task_work_schedule kfuncs. > >>>> > >>>> Main components: > >>>> * struct bpf_task_work_context – Metadata and state management per task > >>>> work. > >>>> * enum bpf_task_work_state – A state machine to serialize work > >>>> scheduling and execution. > >>>> * bpf_task_work_schedule() – The central helper that initiates > >>>> scheduling. > >>>> * bpf_task_work_callback() – Invoked when the actual task_work runs. > >>>> * bpf_task_work_irq() – An intermediate step (runs in softirq context) > >>>> to enqueue task work. > >>>> * bpf_task_work_cancel_and_free() – Cleanup for deleted BPF map entries. > >>>> > >>>> Flow of task work scheduling > >>>> 1) bpf_task_work_schedule_* is called from BPF code. > >>>> 2) Transition state from STANDBY to PENDING. > >>>> 3) irq_work_queue() schedules bpf_task_work_irq(). > >>>> 4) Transition state from PENDING to SCHEDULING. > >>>> 4) bpf_task_work_irq() attempts task_work_add(). If successful, state > >>>> transitions to SCHEDULED. > >>>> 5) Task work calls bpf_task_work_callback(), which transition state to > >>>> RUNNING. > >>>> 6) BPF callback is executed > >>>> 7) Context is cleaned up, refcounts released, state set back to > >>>> STANDBY. > >>>> > >>>> Map value deletion > >>>> If map value that contains bpf_task_work_context is deleted, BPF map > >>>> implementation calls bpf_task_work_cancel_and_free(). > >>>> Deletion is handled by atomically setting state to FREED and > >>>> releasing references or letting scheduler do that, depending on the > >>>> last state before the deletion: > >>>> * SCHEDULING: release references in bpf_task_work_cancel_and_free(), > >>>> expect bpf_task_work_irq() to cancel task work. > >>>> * SCHEDULED: release references and try to cancel task work in > >>>> bpf_task_work_cancel_and_free(). > >>>> * other states: one of bpf_task_work_irq(), bpf_task_work_schedule(), > >>>> bpf_task_work_callback() should cleanup upon detecting the state > >>>> switching to FREED. > >>>> > >>>> The state transitions are controlled with atomic_cmpxchg, ensuring: > >>>> * Only one thread can successfully enqueue work. > >>>> * Proper handling of concurrent deletes (BPF_TW_FREED). > >>>> * Safe rollback if task_work_add() fails. > >>>> > >>> In general I am not sure why we need so many acquire/release pairs. > >>> Why not use test_and_set_bit etc.? Or simply cmpxchg? > >>> What ordering of stores are we depending on that merits > >>> acquire/release ordering? > >>> We should probably document explicitly. > >>> > >>>> Signed-off-by: Mykyta Yatsenko <yatsenko@xxxxxxxx> > >>>> --- > >>>> kernel/bpf/helpers.c | 188 ++++++++++++++++++++++++++++++++++++++++++- > >>>> 1 file changed, 186 insertions(+), 2 deletions(-) > >>>> > >>>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > >>>> index 516286f67f0d..4c8b1c9be7aa 100644 > >>>> --- a/kernel/bpf/helpers.c > >>>> +++ b/kernel/bpf/helpers.c > >>>> @@ -25,6 +25,8 @@ > >>>> #include <linux/kasan.h> > >>>> #include <linux/bpf_verifier.h> > >>>> #include <linux/uaccess.h> > >>>> +#include <linux/task_work.h> > >>>> +#include <linux/irq_work.h> > >>>> > >>>> #include "../../lib/kstrtox.h" > >>>> > >>>> @@ -3702,6 +3704,160 @@ __bpf_kfunc int bpf_strstr(const char *s1__ign, const char *s2__ign) > >>>> > >>>> typedef void (*bpf_task_work_callback_t)(struct bpf_map *, void *, void *); > >>>> > >>>> +enum bpf_task_work_state { > >>>> + /* bpf_task_work is ready to be used */ > >>>> + BPF_TW_STANDBY = 0, > >>>> + /* bpf_task_work is getting scheduled into irq_work */ > >>>> + BPF_TW_PENDING, > >>>> + /* bpf_task_work is in irq_work and getting scheduled into task_work */ > >>>> + BPF_TW_SCHEDULING, > >>>> + /* bpf_task_work is scheduled into task_work successfully */ > >>>> + BPF_TW_SCHEDULED, > >>>> + /* callback is running */ > >>>> + BPF_TW_RUNNING, > >>>> + /* BPF map value storing this bpf_task_work is deleted */ > >>>> + BPF_TW_FREED, > >>>> +}; > >>>> + > >>>> +struct bpf_task_work_context { > >>>> + /* map that contains this structure in a value */ > >>>> + struct bpf_map *map; > >>>> + /* bpf_task_work_state value, representing the state */ > >>>> + atomic_t state; > >>>> + /* bpf_prog that schedules task work */ > >>>> + struct bpf_prog *prog; > >>>> + /* task for which callback is scheduled */ > >>>> + struct task_struct *task; > >>>> + /* notification mode for task work scheduling */ > >>>> + enum task_work_notify_mode mode; > >>>> + /* callback to call from task work */ > >>>> + bpf_task_work_callback_t callback_fn; > >>>> + struct callback_head work; > >>>> + struct irq_work irq_work; > >>>> +} __aligned(8); > >>> I will echo Alexei's comments about the layout. We cannot inline all > >>> this in map value. > >>> Allocation using an init function or in some control function is > >>> probably the only way. > >>> > >>>> + > >>>> +static bool task_work_match(struct callback_head *head, void *data) > >>>> +{ > >>>> + struct bpf_task_work_context *ctx = container_of(head, struct bpf_task_work_context, work); > >>>> + > >>>> + return ctx == data; > >>>> +} > >>>> + > >>>> +static void bpf_reset_task_work_context(struct bpf_task_work_context *ctx) > >>>> +{ > >>>> + bpf_prog_put(ctx->prog); > >>>> + bpf_task_release(ctx->task); > >>>> + rcu_assign_pointer(ctx->map, NULL); > >>>> +} > >>>> + > >>>> +static void bpf_task_work_callback(struct callback_head *cb) > >>>> +{ > >>>> + enum bpf_task_work_state state; > >>>> + struct bpf_task_work_context *ctx; > >>>> + struct bpf_map *map; > >>>> + u32 idx; > >>>> + void *key; > >>>> + void *value; > >>>> + > >>>> + rcu_read_lock_trace(); > >>>> + ctx = container_of(cb, struct bpf_task_work_context, work); > >>>> + > >>>> + state = atomic_cmpxchg_acquire(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_RUNNING); > >>>> + if (state == BPF_TW_SCHEDULED) > >>>> + state = atomic_cmpxchg_acquire(&ctx->state, BPF_TW_SCHEDULED, BPF_TW_RUNNING); > >>>> + if (state == BPF_TW_FREED) > >>>> + goto out; > >>> I am leaving out commenting on this, since I expect it to change per > >>> later comments. > >>> > >>>> + > >>>> + map = rcu_dereference(ctx->map); > >>>> + if (!map) > >>>> + goto out; > >>>> + > >>>> + value = (void *)ctx - map->record->task_work_off; > >>>> + key = (void *)map_key_from_value(map, value, &idx); > >>>> + > >>>> + migrate_disable(); > >>>> + ctx->callback_fn(map, key, value); > >>>> + migrate_enable(); > >>>> + > >>>> + /* State is running or freed, either way reset. */ > >>>> + bpf_reset_task_work_context(ctx); > >>>> + atomic_cmpxchg_release(&ctx->state, BPF_TW_RUNNING, BPF_TW_STANDBY); > >>>> +out: > >>>> + rcu_read_unlock_trace(); > >>>> +} > >>>> + > >>>> +static void bpf_task_work_irq(struct irq_work *irq_work) > >>>> +{ > >>>> + struct bpf_task_work_context *ctx; > >>>> + enum bpf_task_work_state state; > >>>> + int err; > >>>> + > >>>> + ctx = container_of(irq_work, struct bpf_task_work_context, irq_work); > >>>> + > >>>> + rcu_read_lock_trace(); > >>> What's the idea behind rcu_read_lock_trace? Let's add a comment. > >>> > >>>> + state = atomic_cmpxchg_release(&ctx->state, BPF_TW_PENDING, BPF_TW_SCHEDULING); > >>>> + if (state == BPF_TW_FREED) { > >>>> + bpf_reset_task_work_context(ctx); > >>>> + goto out; > >>>> + } > >>>> + > >>>> + err = task_work_add(ctx->task, &ctx->work, ctx->mode); > >>> Racy, SCHEDULING->FREE state claim from cancel_and_free will release ctx->task. > >> Thanks for pointing this out, I missed that case. > >>>> + if (err) { > >>>> + state = atomic_cmpxchg_acquire(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_PENDING); > >>> Races here look fine, since we don't act on FREED (for this block > >>> atleast), cancel_and_free doesn't act on seeing PENDING, > >>> so there is interlocking. > >>> > >>>> + if (state == BPF_TW_SCHEDULING) { > >>>> + bpf_reset_task_work_context(ctx); > >>>> + atomic_cmpxchg_release(&ctx->state, BPF_TW_PENDING, BPF_TW_STANDBY); > >>>> + } > >>>> + goto out; > >>>> + } > >>>> + state = atomic_cmpxchg_release(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_SCHEDULED); > >>>> + if (state == BPF_TW_FREED) > >>>> + task_work_cancel_match(ctx->task, task_work_match, ctx); > >>> It looks like there is a similar race condition here. > >>> If BPF_TW_SCHEDULING is set, cancel_and_free may invoke and attempt > >>> bpf_task_release() from bpf_reset_task_work_context(). > >>> Meanwhile, we will access ctx->task here directly after seeing BPF_TW_FREED. > >> Yeah, we should release task_work in this function in case SCHEDULING > >> gets transitioned into FREED. > >>>> +out: > >>>> + rcu_read_unlock_trace(); > >>>> +} > >>>> + > >>>> +static int bpf_task_work_schedule(struct task_struct *task, struct bpf_task_work_context *ctx, > >>>> + struct bpf_map *map, bpf_task_work_callback_t callback_fn, > >>>> + struct bpf_prog_aux *aux, enum task_work_notify_mode mode) > >>>> +{ > >>>> + struct bpf_prog *prog; > >>>> + > >>>> + BTF_TYPE_EMIT(struct bpf_task_work); > >>>> + > >>>> + prog = bpf_prog_inc_not_zero(aux->prog); > >>>> + if (IS_ERR(prog)) > >>>> + return -EPERM; > >>>> + > >>>> + if (!atomic64_read(&map->usercnt)) { > >>>> + bpf_prog_put(prog); > >>>> + return -EPERM; > >>>> + } > >>>> + task = bpf_task_acquire(task); > >>>> + if (!task) { > >>>> + bpf_prog_put(prog); > >>>> + return -EPERM; > >>>> + } > >>>> + > >>>> + if (atomic_cmpxchg_acquire(&ctx->state, BPF_TW_STANDBY, BPF_TW_PENDING) != BPF_TW_STANDBY) { > >>> If we are reusing map values, wouldn't a freed state stay perpetually freed? > >>> I.e. after the first delete of array elements etc. it becomes useless. > >>> Every array map update would invoke a cancel_and_free. > >>> Who resets it? > >> I'm not sure I understand the question, the idea is that if element is > >> deleted from map, we > >> transition state to FREED and make sure refcounts of the task and prog > >> are released. > >> > >> An element is returned into STANDBY state after task_work is completed > >> or failed, so it can be reused. > >> Could you please elaborate on the scenario you have in mind? > > I guess I am confused about where we will go from FREED to STANDBY, if > > we set it to BPF_TW_FREED in cancel_and_free. > > When you update an array map element, we always do > > bpf_obj_free_fields. Typically, this operation leaves the field in a > > reusable state. > > I don't see a FREED->STANDBY transition (after going from > > [SCHEDULED|SCHEDULING]->FREED, only RUNNING->STANDBY in the callback. > I see your point, arraymap does not overwrite those special fields on > update, thanks! > Right, it won't just be array maps though. Hash maps also have memory reuse, without any reinitialization. What you want in bpf_obj_free_fields is to claim the field in a way that it's ready for reuse. It must be left in a state that if a new program obtains a pointer to the map value after that, it can reinit the object and begin using it like if it were allocated anew.