Re: [RFC bpf-next v2 3/4] bpf: Runtime part of fast-path termination approach

Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> · Tue, 8 Jul 2025 00:39:05 +0200

On Tue, 8 Jul 2025 at 00:09, Zvi Effron <zeffron@xxxxxxxxxxxxx> wrote:
>
> On Mon, Jul 7, 2025 at 12:17 PM Kumar Kartikeya Dwivedi
> <memxor@xxxxxxxxx> wrote:
> >
> > On Mon, 7 Jul 2025 at 19:41, Alexei Starovoitov
> > <alexei.starovoitov@xxxxxxxxx> wrote:
> > >
> > > On Fri, Jul 4, 2025 at 12:11 PM Kumar Kartikeya Dwivedi
> > > <memxor@xxxxxxxxx> wrote:
> > > >
> > > > On Fri, 4 Jul 2025 at 19:29, Raj Sahu <rjsu26@xxxxxxxxx> wrote:
> > > > >
> > > > > > > Introduces watchdog based runtime mechanism to terminate
> > > > > > > a BPF program. When a BPF program is interrupted by
> > > > > > > an watchdog, its registers are are passed onto the bpf_die.
> > > > > > >
> > > > > > > Inside bpf_die we perform the text_poke and stack walk
> > > > > > > to stub helpers/kfunc replace bpf_loop helper if called
> > > > > > > inside bpf program.
> > > > > > >
> > > > > > > Current implementation doesn't handle the termination of
> > > > > > > tailcall programs.
> > > > > > >
> > > > > > > There is a known issue by calling text_poke inside interrupt
> > > > > > > context - https://elixir.bootlin.com/linux/v6.15.1/source/kernel/smp.c#L815.
> > > > > >
> > > > > > I don't have a good idea so far, maybe by deferring work to wq context?
> > > > > > Each CPU would need its own context and schedule work there.
> > > > > > The problem is that it may not be invoked immediately.
> > > > > We will give it a try using wq. We were a bit hesitant in pursuing wq
> > > > > earlier because to modify the return address on the stack we would
> > > > > want to interrupt the running BPF program and access its stack since
> > > > > that's a key part of the design.
> > > > >
> > > > > Will need some suggestions here on how to achieve that.
> > > >
> > > > Yeah, this is not trivial, now that I think more about it.
> > > > So keep the stack state untouched so you could synchronize with the
> > > > callback (spin until it signals us that it's done touching the stack).
> > > > I guess we can do it from another CPU, not too bad.
> > > >
> > > > There's another problem though, wq execution not happening instantly
> > > > in time is not a big deal, but it getting interrupted by yet another
> > > > program that stalls can set up a cascading chain that leads to lock up
> > > > of the machine.
> > > > So let's say we have a program that stalls in NMI/IRQ. It might happen
> > > > that all CPUs that can service the wq enter this stall. The kthread is
> > > > ready to run the wq callback (or in the middle of it) but it may be
> > > > indefinitely interrupted.
> > > > It seems like this is a more fundamental problem with the non-cloning
> > > > approach. We can prevent program execution on the CPU where the wq
> > > > callback will be run, but we can also have a case where all CPUs lock
> > > > up simultaneously.
> > >
> > > If we have such bugs that prog in NMI can stall CPU indefinitely
> > > they need to be fixed independently of fast-execute.
> > > timed may_goto, tailcalls or whatever may need to have different
> > > limits when it detects that the prog is running in NMI or with hard irqs
> > > disabled.
> >
> > I think a lot of programs end up running with hard IRQs disabled. Most
> > scheduler programs (which would use all these runtime checked
> > facilities) can fall into this category.
> > I don't think we can come up with appropriate limits without proper
> > WCET analysis.
> >
> > > Fast-execute doesn't have to be a universal kill-bpf-prog
> > > mechanism that can work in any context. I think fast-execute
> > > is for progs that deadlocked in res_spin_lock, faulted arena,
> > > or were slow for wrong reasons, but not fatal for the kernel reasons.
> > > imo we can rely on schedule_work() and bpf_arch_text_poke() from there.
> > > The alternative of clone of all progs and memory waste for a rare case
> > > is not appealing. Unless we can detect "dangerous" progs and
> > > clone with fast execute only for them, so that the majority of bpf progs
> > > stay as single copy.
> >
> > Right, I sympathize with the memory overhead argument. But I think
> > just ignoring NMI programs as broken is not sufficient.
> > You can have some tracing program that gets stuck while tracing parts
> > of the kernel such that it prevents wq from making progress in
> > patching it out.
> > I think we will discover more edge cases. All this effort to then have
> > something working incompletely is sort of a bummer.
> >
> > Most of the syzbot reports triggering deadlocks were also not "real"
> > use cases, but we still decided to close the gap with rqspinlock.
> > When leaving the door open, it's hard to anticipate how the failure
> > mode in case fast-execute is not triggered will be.
> > I think let's try to see how bad cloning can be, if done
> > conditionally, before giving up completely on it.
> >
> > I think most programs won't use these facilities that end up extending
> > the total runtime beyond acceptable bounds.
> > Most legacy programs are not using cond_break, rqspinlock, or arenas.
> > Most network function programs and tracing programs probably don't use
> > these facilities as well.
>
> I can't speak to _most_, but doesn't BPF_MAP_TYPE_LPM_TRIE use rqspinlock? And
> isn't that map type frequently used by network programs (especially XDP
> filters) to do IP prefix lookups?
>
> So it wouldn't be uncommon for network programs (including "legacy" ones) to
> use rqspinlock because they use a BPF map type that uses rqspinlock?

Right, good point.
We would probably need to penalize such a program when it triggers a deadlock.
That would require cloning if we don't go the text_poke() way.
So it may indeed end up covering most programs.
hashtab.c uses it too, which probably the majority of programs use.
Oh well, I guess it would be most of them then.
So it's hard to eliminate the memory overhead in cloning.

>
> > This can certainly change in the future, but I think unless pressing
> > use cases come up people would stick to how things are.
> > Schedulers are a different category and they will definitely make use
> > of all this.
> > So only triggering cloning when these are used sounds better to me,
> > even if not ideal.
> >
> > We can have some simple WCET heuristic that assigns fixed weights to
> > certain instructions/helpers, and compute an approximate cost which
> > when breached triggers cloning of the prog.
> > We can obviously disable cloning when we see things like bpf_for(i, 0, 1000).
> > We can restrict it to specific subprogs where we notice specific
> > patterns requiring it, so we may be able to avoid the entire prog.
> > But for simplicity we can also just trigger cloning when one of
> > rqspinlock/cond_break/arenas/iterators/bpf_loop are used.
> >
> > The other option of course is to do run time checks of
> > prog->aux->terminate bit and start failing things that way.
> > Most of the time, the branch will be untaken. It can be set to 1 when
> > a prog detects a fatal condition or from a watchdog (later).
> > cond_break, rqspinlock, iterators, bpf_loop can all be adjusted to check it.
> >
> > When I was playing with all this, this is basically what I did
> > (amortized or not) and I think the overhead was negligible/in range of
> > noise.
> > If a prog is hot, the line with terminate bit is almost always in
> > cache, and it's not a load dependency for other accesses.
> > If it's not hot, the cost of every other state access dominates anyway.
> >