Re: [RFC bpf-next v2 3/4] bpf: Runtime part of fast-path termination approach

Zvi Effron <zeffron@xxxxxxxxxxxxx> · Mon, 7 Jul 2025 15:08:58 -0700

On Mon, Jul 7, 2025 at 12:17 PM Kumar Kartikeya Dwivedi
<memxor@xxxxxxxxx> wrote:
>
> On Mon, 7 Jul 2025 at 19:41, Alexei Starovoitov
> <alexei.starovoitov@xxxxxxxxx> wrote:
> >
> > On Fri, Jul 4, 2025 at 12:11 PM Kumar Kartikeya Dwivedi
> > <memxor@xxxxxxxxx> wrote:
> > >
> > > On Fri, 4 Jul 2025 at 19:29, Raj Sahu <rjsu26@xxxxxxxxx> wrote:
> > > >
> > > > > > Introduces watchdog based runtime mechanism to terminate
> > > > > > a BPF program. When a BPF program is interrupted by
> > > > > > an watchdog, its registers are are passed onto the bpf_die.
> > > > > >
> > > > > > Inside bpf_die we perform the text_poke and stack walk
> > > > > > to stub helpers/kfunc replace bpf_loop helper if called
> > > > > > inside bpf program.
> > > > > >
> > > > > > Current implementation doesn't handle the termination of
> > > > > > tailcall programs.
> > > > > >
> > > > > > There is a known issue by calling text_poke inside interrupt
> > > > > > context - https://elixir.bootlin.com/linux/v6.15.1/source/kernel/smp.c#L815.
> > > > >
> > > > > I don't have a good idea so far, maybe by deferring work to wq context?
> > > > > Each CPU would need its own context and schedule work there.
> > > > > The problem is that it may not be invoked immediately.
> > > > We will give it a try using wq. We were a bit hesitant in pursuing wq
> > > > earlier because to modify the return address on the stack we would
> > > > want to interrupt the running BPF program and access its stack since
> > > > that's a key part of the design.
> > > >
> > > > Will need some suggestions here on how to achieve that.
> > >
> > > Yeah, this is not trivial, now that I think more about it.
> > > So keep the stack state untouched so you could synchronize with the
> > > callback (spin until it signals us that it's done touching the stack).
> > > I guess we can do it from another CPU, not too bad.
> > >
> > > There's another problem though, wq execution not happening instantly
> > > in time is not a big deal, but it getting interrupted by yet another
> > > program that stalls can set up a cascading chain that leads to lock up
> > > of the machine.
> > > So let's say we have a program that stalls in NMI/IRQ. It might happen
> > > that all CPUs that can service the wq enter this stall. The kthread is
> > > ready to run the wq callback (or in the middle of it) but it may be
> > > indefinitely interrupted.
> > > It seems like this is a more fundamental problem with the non-cloning
> > > approach. We can prevent program execution on the CPU where the wq
> > > callback will be run, but we can also have a case where all CPUs lock
> > > up simultaneously.
> >
> > If we have such bugs that prog in NMI can stall CPU indefinitely
> > they need to be fixed independently of fast-execute.
> > timed may_goto, tailcalls or whatever may need to have different
> > limits when it detects that the prog is running in NMI or with hard irqs
> > disabled.
>
> I think a lot of programs end up running with hard IRQs disabled. Most
> scheduler programs (which would use all these runtime checked
> facilities) can fall into this category.
> I don't think we can come up with appropriate limits without proper
> WCET analysis.
>
> > Fast-execute doesn't have to be a universal kill-bpf-prog
> > mechanism that can work in any context. I think fast-execute
> > is for progs that deadlocked in res_spin_lock, faulted arena,
> > or were slow for wrong reasons, but not fatal for the kernel reasons.
> > imo we can rely on schedule_work() and bpf_arch_text_poke() from there.
> > The alternative of clone of all progs and memory waste for a rare case
> > is not appealing. Unless we can detect "dangerous" progs and
> > clone with fast execute only for them, so that the majority of bpf progs
> > stay as single copy.
>
> Right, I sympathize with the memory overhead argument. But I think
> just ignoring NMI programs as broken is not sufficient.
> You can have some tracing program that gets stuck while tracing parts
> of the kernel such that it prevents wq from making progress in
> patching it out.
> I think we will discover more edge cases. All this effort to then have
> something working incompletely is sort of a bummer.
>
> Most of the syzbot reports triggering deadlocks were also not "real"
> use cases, but we still decided to close the gap with rqspinlock.
> When leaving the door open, it's hard to anticipate how the failure
> mode in case fast-execute is not triggered will be.
> I think let's try to see how bad cloning can be, if done
> conditionally, before giving up completely on it.
>
> I think most programs won't use these facilities that end up extending
> the total runtime beyond acceptable bounds.
> Most legacy programs are not using cond_break, rqspinlock, or arenas.
> Most network function programs and tracing programs probably don't use
> these facilities as well.

I can't speak to _most_, but doesn't BPF_MAP_TYPE_LPM_TRIE use rqspinlock? And
isn't that map type frequently used by network programs (especially XDP
filters) to do IP prefix lookups?

So it wouldn't be uncommon for network programs (including "legacy" ones) to
use rqspinlock because they use a BPF map type that uses rqspinlock?

> This can certainly change in the future, but I think unless pressing
> use cases come up people would stick to how things are.
> Schedulers are a different category and they will definitely make use
> of all this.
> So only triggering cloning when these are used sounds better to me,
> even if not ideal.
>
> We can have some simple WCET heuristic that assigns fixed weights to
> certain instructions/helpers, and compute an approximate cost which
> when breached triggers cloning of the prog.
> We can obviously disable cloning when we see things like bpf_for(i, 0, 1000).
> We can restrict it to specific subprogs where we notice specific
> patterns requiring it, so we may be able to avoid the entire prog.
> But for simplicity we can also just trigger cloning when one of
> rqspinlock/cond_break/arenas/iterators/bpf_loop are used.
>
> The other option of course is to do run time checks of
> prog->aux->terminate bit and start failing things that way.
> Most of the time, the branch will be untaken. It can be set to 1 when
> a prog detects a fatal condition or from a watchdog (later).
> cond_break, rqspinlock, iterators, bpf_loop can all be adjusted to check it.
>
> When I was playing with all this, this is basically what I did
> (amortized or not) and I think the overhead was negligible/in range of
> noise.
> If a prog is hot, the line with terminate bit is almost always in
> cache, and it's not a load dependency for other accesses.
> If it's not hot, the cost of every other state access dominates anyway.
>