Re: [PATCH bpf-next v2 3/4] bpf: task work scheduling kfuncs

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Tue, 19 Aug 2025 13:49:12 -0700

On Tue, Aug 19, 2025 at 12:28 PM Kumar Kartikeya Dwivedi
<memxor@xxxxxxxxx> wrote:
>
> On Tue, 19 Aug 2025 at 20:16, Mykyta Yatsenko
> <mykyta.yatsenko5@xxxxxxxxx> wrote:
> >
> > On 8/19/25 15:18, Kumar Kartikeya Dwivedi wrote:
> > > On Fri, 15 Aug 2025 at 21:22, Mykyta Yatsenko
> > > <mykyta.yatsenko5@xxxxxxxxx> wrote:
> > >> From: Mykyta Yatsenko <yatsenko@xxxxxxxx>
> > >>
> > >> Implementation of the bpf_task_work_schedule kfuncs.
> > >>
> > >> Main components:
> > >>   * struct bpf_task_work_context – Metadata and state management per task
> > >> work.
> > >>   * enum bpf_task_work_state – A state machine to serialize work
> > >>   scheduling and execution.
> > >>   * bpf_task_work_schedule() – The central helper that initiates
> > >> scheduling.
> > >>   * bpf_task_work_acquire() - Attempts to take ownership of the context,
> > >>   pointed by passed struct bpf_task_work, allocates new context if none
> > >>   exists yet.
> > >>   * bpf_task_work_callback() – Invoked when the actual task_work runs.
> > >>   * bpf_task_work_irq() – An intermediate step (runs in softirq context)
> > >> to enqueue task work.
> > >>   * bpf_task_work_cancel_and_free() – Cleanup for deleted BPF map entries.
> > > Can you elaborate on why the bouncing through irq_work context is necessary?
> > > I think we should have this info in the commit log.
> > > Is it to avoid deadlocks with task_work locks and/or task->pi_lock?
> > yes, mainly to avoid locks in NMI.
> > >
> > >> Flow of successful task work scheduling
> > >>   1) bpf_task_work_schedule_* is called from BPF code.
> > >>   2) Transition state from STANDBY to PENDING, marks context is owned by
> > >>   this task work scheduler
> > >>   3) irq_work_queue() schedules bpf_task_work_irq().
> > >>   4) Transition state from PENDING to SCHEDULING.
> > >>   4) bpf_task_work_irq() attempts task_work_add(). If successful, state
> > >>   transitions to SCHEDULED.
> > >>   5) Task work calls bpf_task_work_callback(), which transition state to
> > >>   RUNNING.
> > >>   6) BPF callback is executed
> > >>   7) Context is cleaned up, refcounts released, context state set back to
> > >>   STANDBY.
> > >>
> > >> bpf_task_work_context handling
> > >> The context pointer is stored in bpf_task_work ctx field (u64) but
> > >> treated as an __rcu pointer via casts.
> > >> bpf_task_work_acquire() publishes new bpf_task_work_context by cmpxchg
> > >> with RCU initializer.
> > >> Read under the RCU lock only in bpf_task_work_acquire() when ownership
> > >> is contended.
> > >> Upon deleting map value, bpf_task_work_cancel_and_free() is detaching
> > >> context pointer from struct bpf_task_work and releases resources
> > >> if scheduler does not own the context or can be canceled (state ==
> > >> STANDBY or state == SCHEDULED and callback canceled). If task work
> > >> scheduler owns the context, its state is set to FREED and scheduler is
> > >> expected to cleanup on the next state transition.
> > >>
> > >> Signed-off-by: Mykyta Yatsenko <yatsenko@xxxxxxxx>
> > >> ---
> > > This is much better now, with clear ownership between free path and
> > > scheduling path, I mostly have a few more comments on the current
> > > implementation, plus one potential bug.
> > >
> > > However, the more time I spend on this, the more I feel we should
> > > unify all this with the two other bpf async work execution mechanisms
> > > (timers and wq), and simplify and deduplicate a lot of this under the
> > > serialized async->lock. I know NMI execution is probably critical for
> > > this primitive, but we can replace async->lock with rqspinlock to
> > > address that, so that it becomes safe to serialize in any context.
> > > Apart from that, I don't see anything that would negate reworking all
> > > this as a case of BPF_TASK_WORK for bpf_async_kern, modulo internal
> > > task_work locks that have trouble with NMI execution (see later
> > > comments).
> > >
> > > I also feel like it would be cleaner if we split the API into 3 steps:
> > > init(), set_callback(), schedule() like the other cases, I don't see
> > > why we necessarily need to diverge, and it simplifies some of the
> > > logic in schedule().
> > > Once every state update is protected by a lock, all of the state
> > > transitions are done automatically and a lot of the extra races are
> > > eliminated.
> > >
> > > I think we should discuss whether this was considered and why you
> > > discarded this approach, otherwise the code is pretty complex, with
> > > little upside.
> > > Maybe I'm missing something obvious and you'd know more since you've
> > > thought about all this longer.
> > As for API, I think having 1 function for scheduling callback is cleaner
> > then having 3 which are always called in the same order anyway. Most of
> > the complexity
> > comes from synchronization, not logic, so not having to do the same
> > synchronization in
> > init(), set_callback() and schedule() seems like a benefit to me.
>
> Well, if you were to reuse bpf_async_kern, all of that extra logic is
> already taken care of, or can be easily shared.
> If you look closely you'll see that a lot of what you're doing is a
> repetition of what timers and bpf_wq have.
>
> > Let me check if using rqspinlock going to make things simpler. We still
> > need states to at least know if cancellation is possible and to flag
> > deletion to scheduler, but using a lock will make code easier to
> > understand.
>
> Yeah I think for all of this using lockless updates is not really
> worth it, let's just serialize using a lock.

I don't think it's "just serialize".

__bpf_async_init and __bpf_async_set_callback currently have `if
(in_nmi()) return -EOPNOTSUPP;`, because of `bpf_map_kmalloc_node`
(solvable with bpf_mem_alloc, not a big deal) and then unconditional
`__bpf_spin_lock_irqsave(&async->lock);` (and maybe some other things
that can't be done in NMI).

We can't just replace __bpf_spin_lock_irqsave with rqspinlock, because
the latter can fail. So the simplicity of unconditional locking is
gone. We'd need to deal with the possibility of lock failing. It's
probably not that straightforward in the case of
__bpf_async_cancel_and_free.

On the other hand, state machine with cmpxchg means there is always
forward progress and there is always determinism of what was the last
reached state before we went to FREED state, which means we will know
who needs to cancel callback (if at all), and who is responsible for
freeing resources.

I'm actually wondering if this state machine approach could/should be
adopted for bpf_async_cb?.. I wouldn't start there, though, and rather
finish task_work_add integration before trying to generalize.

[...]

> > > This part looks broken to me.
> > > You are calling this path
> > > (update->obj_free_fields->cancel_and_free->cancel_and_match) in
> > > possibly NMI context.
> > > Which means we can deadlock if we hit the NMI context prog in the
> > > middle of task->pi_lock critical section.
> > > That's taken in task_work functions
> > > The task_work_cancel_match takes the pi_lock.
> > Good point, thanks. I think this could be solved in 2 ways:
> >   * Don't cancel, rely on callback dropping the work
> >   * Cancel in another irq_work
> > I'll probably go with the second one.
>
> What about 1? It seems like we can just rely on the existing hunk to
> free the callback on noticing BPF_TW_FREED?
> That seems simpler to me.
>

Callback potentially might not be called for a long time, I'd feel
uneasy relying on it being called soon. Mykyta does irq_work in
scheduling kfunc for the reason that it might need to cancel task work
(because that doesn't support NMI), we can reuse the same approach
(and same irq work struct) here for cancellation, probably?

[...]