Re: [PATCH bpf-next v2 3/4] bpf: task work scheduling kfuncs

Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> · Wed, 20 Aug 2025 18:11:23 +0200

On Tue, 19 Aug 2025 at 22:49, Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote:
>
> On Tue, Aug 19, 2025 at 12:28 PM Kumar Kartikeya Dwivedi
> <memxor@xxxxxxxxx> wrote:
> >
> > On Tue, 19 Aug 2025 at 20:16, Mykyta Yatsenko
> > <mykyta.yatsenko5@xxxxxxxxx> wrote:
> > >
> > > On 8/19/25 15:18, Kumar Kartikeya Dwivedi wrote:
> > > > On Fri, 15 Aug 2025 at 21:22, Mykyta Yatsenko
> > > > <mykyta.yatsenko5@xxxxxxxxx> wrote:
> > > >> From: Mykyta Yatsenko <yatsenko@xxxxxxxx>
> > > >>
> > > >> Implementation of the bpf_task_work_schedule kfuncs.
> > > >>
> > > >> Main components:
> > > >>   * struct bpf_task_work_context – Metadata and state management per task
> > > >> work.
> > > >>   * enum bpf_task_work_state – A state machine to serialize work
> > > >>   scheduling and execution.
> > > >>   * bpf_task_work_schedule() – The central helper that initiates
> > > >> scheduling.
> > > >>   * bpf_task_work_acquire() - Attempts to take ownership of the context,
> > > >>   pointed by passed struct bpf_task_work, allocates new context if none
> > > >>   exists yet.
> > > >>   * bpf_task_work_callback() – Invoked when the actual task_work runs.
> > > >>   * bpf_task_work_irq() – An intermediate step (runs in softirq context)
> > > >> to enqueue task work.
> > > >>   * bpf_task_work_cancel_and_free() – Cleanup for deleted BPF map entries.
> > > > Can you elaborate on why the bouncing through irq_work context is necessary?
> > > > I think we should have this info in the commit log.
> > > > Is it to avoid deadlocks with task_work locks and/or task->pi_lock?
> > > yes, mainly to avoid locks in NMI.
> > > >
> > > >> Flow of successful task work scheduling
> > > >>   1) bpf_task_work_schedule_* is called from BPF code.
> > > >>   2) Transition state from STANDBY to PENDING, marks context is owned by
> > > >>   this task work scheduler
> > > >>   3) irq_work_queue() schedules bpf_task_work_irq().
> > > >>   4) Transition state from PENDING to SCHEDULING.
> > > >>   4) bpf_task_work_irq() attempts task_work_add(). If successful, state
> > > >>   transitions to SCHEDULED.
> > > >>   5) Task work calls bpf_task_work_callback(), which transition state to
> > > >>   RUNNING.
> > > >>   6) BPF callback is executed
> > > >>   7) Context is cleaned up, refcounts released, context state set back to
> > > >>   STANDBY.
> > > >>
> > > >> bpf_task_work_context handling
> > > >> The context pointer is stored in bpf_task_work ctx field (u64) but
> > > >> treated as an __rcu pointer via casts.
> > > >> bpf_task_work_acquire() publishes new bpf_task_work_context by cmpxchg
> > > >> with RCU initializer.
> > > >> Read under the RCU lock only in bpf_task_work_acquire() when ownership
> > > >> is contended.
> > > >> Upon deleting map value, bpf_task_work_cancel_and_free() is detaching
> > > >> context pointer from struct bpf_task_work and releases resources
> > > >> if scheduler does not own the context or can be canceled (state ==
> > > >> STANDBY or state == SCHEDULED and callback canceled). If task work
> > > >> scheduler owns the context, its state is set to FREED and scheduler is
> > > >> expected to cleanup on the next state transition.
> > > >>
> > > >> Signed-off-by: Mykyta Yatsenko <yatsenko@xxxxxxxx>
> > > >> ---
> > > > This is much better now, with clear ownership between free path and
> > > > scheduling path, I mostly have a few more comments on the current
> > > > implementation, plus one potential bug.
> > > >
> > > > However, the more time I spend on this, the more I feel we should
> > > > unify all this with the two other bpf async work execution mechanisms
> > > > (timers and wq), and simplify and deduplicate a lot of this under the
> > > > serialized async->lock. I know NMI execution is probably critical for
> > > > this primitive, but we can replace async->lock with rqspinlock to
> > > > address that, so that it becomes safe to serialize in any context.
> > > > Apart from that, I don't see anything that would negate reworking all
> > > > this as a case of BPF_TASK_WORK for bpf_async_kern, modulo internal
> > > > task_work locks that have trouble with NMI execution (see later
> > > > comments).
> > > >
> > > > I also feel like it would be cleaner if we split the API into 3 steps:
> > > > init(), set_callback(), schedule() like the other cases, I don't see
> > > > why we necessarily need to diverge, and it simplifies some of the
> > > > logic in schedule().
> > > > Once every state update is protected by a lock, all of the state
> > > > transitions are done automatically and a lot of the extra races are
> > > > eliminated.
> > > >
> > > > I think we should discuss whether this was considered and why you
> > > > discarded this approach, otherwise the code is pretty complex, with
> > > > little upside.
> > > > Maybe I'm missing something obvious and you'd know more since you've
> > > > thought about all this longer.
> > > As for API, I think having 1 function for scheduling callback is cleaner
> > > then having 3 which are always called in the same order anyway. Most of
> > > the complexity
> > > comes from synchronization, not logic, so not having to do the same
> > > synchronization in
> > > init(), set_callback() and schedule() seems like a benefit to me.
> >
> > Well, if you were to reuse bpf_async_kern, all of that extra logic is
> > already taken care of, or can be easily shared.
> > If you look closely you'll see that a lot of what you're doing is a
> > repetition of what timers and bpf_wq have.
> >
> > > Let me check if using rqspinlock going to make things simpler. We still
> > > need states to at least know if cancellation is possible and to flag
> > > deletion to scheduler, but using a lock will make code easier to
> > > understand.
> >
> > Yeah I think for all of this using lockless updates is not really
> > worth it, let's just serialize using a lock.
>
> I don't think it's "just serialize".
>
> __bpf_async_init and __bpf_async_set_callback currently have `if
> (in_nmi()) return -EOPNOTSUPP;`, because of `bpf_map_kmalloc_node`
> (solvable with bpf_mem_alloc, not a big deal) and then unconditional
> `__bpf_spin_lock_irqsave(&async->lock);` (and maybe some other things
> that can't be done in NMI).
>
> We can't just replace __bpf_spin_lock_irqsave with rqspinlock, because
> the latter can fail. So the simplicity of unconditional locking is
> gone. We'd need to deal with the possibility of lock failing. It's
> probably not that straightforward in the case of
> __bpf_async_cancel_and_free.

We discussed converting async_cb to rqspinlock last time, the hold up
was __bpf_async_cancel_and_free, every other case can propagate error
upwards since they're already fallible.

The only reason I didn't move ahead was there was no apparent use case
for timer usage in NMI (to me at least).

But I don't see why it's less simpler in other cases, you need to
return an error in case you fail to take the lock (which should not
occur in correct usage), yes, but once you take the lock nobody
is touching the object anymore. And all those paths are already
fallible, so it's an extra error condition.

It is possible to then focus our effort on understanding failure modes
where __bpf_async_cancel_and_free's lock acquisition can fail, the
last time I looked it wasn't possible (otherwise we already have a bug
with the existing spin lock).

That said, BPF timers cannot be invoked in NMI, and irqsave provides
interrupt exclusion. We exclude usage of maps with timers in programs
that may run in NMI context. Things will be different once that restriction is
lifted for task_work, but it just means if the lock acquisition is failing on a
single lock, a lower context we interrupted is holding it, which means
it won the claim to free the object and we don't need to do anything.
Since we have a single lock the cases we need to actively worry about
are the reentrant ones.

I can imagine a task context program updating an array map element,
which invoked bpf_obj_free_fields, and then a perf program attempting
to do the same thing on the same element from an NMI. Fine, the lock
acquisition in free will fail, but we understand why it's ok to give up the
free in such a case.

>
> On the other hand, state machine with cmpxchg means there is always
> forward progress and there is always determinism of what was the last
> reached state before we went to FREED state, which means we will know
> who needs to cancel callback (if at all), and who is responsible for
> freeing resources.

There is forward progress regardless (now), but with a lockless state
machine, every state transition needs to consider various edges which
may have been concurrently activated by some other racing invocation.
You don't have such concerns with a lock. At least to me, I don't see
how the latter is worse than the former, it's less cases to think
about and deal with in the code.
E.g. all these "state == BPF_TW_FREED" would go away at various places
in the middle of various operations.

To me after looking at this code the second time, there seems to be
little benefit. Things would be different if multiple concurrent
schedule() calls on the same map value was a real use case, such that
lock contention would quickly become a performance bottleneck, but I
don't think that's true.

>
> I'm actually wondering if this state machine approach could/should be
> adopted for bpf_async_cb?.. I wouldn't start there, though, and rather
> finish task_work_add integration before trying to generalize.

Maybe it's just me, but I feel like it's additional complexity that's
not giving us much benefit.

There are enough things to worry about even when holding a lock and
excluding NMI, as seen with various bugs over the years.
E.g. git log --oneline --grep="Fixes: b00628b1c7d5 (\"bpf: Introduce
bpf timers.\")"

It is impossible to say that we can get it right with all this in the
1st attempt, even if we hold a fallible lock to avoid deadlocks, or we
switch to this state machine approach.
The best we can do is to at least minimize the set of cases we have to
worry about.

[
   As an aside, if we intend on keeping the door open on
consolidation, we probably should at least mirror the API surface.
   Maybe we made a mistake with init+set_callback+kick style split in
existing APIs, but it might be easier for people to understand that
all async primitives mostly follow this look and feel.
   It wouldn't be the end of the world, but there's an argument to be
made for consistency.
]

>
> [...]
>
> > > > This part looks broken to me.
> > > > You are calling this path
> > > > (update->obj_free_fields->cancel_and_free->cancel_and_match) in
> > > > possibly NMI context.
> > > > Which means we can deadlock if we hit the NMI context prog in the
> > > > middle of task->pi_lock critical section.
> > > > That's taken in task_work functions
> > > > The task_work_cancel_match takes the pi_lock.
> > > Good point, thanks. I think this could be solved in 2 ways:
> > >   * Don't cancel, rely on callback dropping the work
> > >   * Cancel in another irq_work
> > > I'll probably go with the second one.
> >
> > What about 1? It seems like we can just rely on the existing hunk to
> > free the callback on noticing BPF_TW_FREED?
> > That seems simpler to me.
> >
>
> Callback potentially might not be called for a long time, I'd feel
> uneasy relying on it being called soon. Mykyta does irq_work in
> scheduling kfunc for the reason that it might need to cancel task work
> (because that doesn't support NMI), we can reuse the same approach
> (and same irq work struct) here for cancellation, probably?
>
>
> [...]