Re: [RFC bpf-next v2 3/4] bpf: Runtime part of fast-path termination approach

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Mon, 7 Jul 2025 10:40:57 -0700

On Fri, Jul 4, 2025 at 12:11 PM Kumar Kartikeya Dwivedi
<memxor@xxxxxxxxx> wrote:
>
> On Fri, 4 Jul 2025 at 19:29, Raj Sahu <rjsu26@xxxxxxxxx> wrote:
> >
> > > > Introduces watchdog based runtime mechanism to terminate
> > > > a BPF program. When a BPF program is interrupted by
> > > > an watchdog, its registers are are passed onto the bpf_die.
> > > >
> > > > Inside bpf_die we perform the text_poke and stack walk
> > > > to stub helpers/kfunc replace bpf_loop helper if called
> > > > inside bpf program.
> > > >
> > > > Current implementation doesn't handle the termination of
> > > > tailcall programs.
> > > >
> > > > There is a known issue by calling text_poke inside interrupt
> > > > context - https://elixir.bootlin.com/linux/v6.15.1/source/kernel/smp.c#L815.
> > >
> > > I don't have a good idea so far, maybe by deferring work to wq context?
> > > Each CPU would need its own context and schedule work there.
> > > The problem is that it may not be invoked immediately.
> > We will give it a try using wq. We were a bit hesitant in pursuing wq
> > earlier because to modify the return address on the stack we would
> > want to interrupt the running BPF program and access its stack since
> > that's a key part of the design.
> >
> > Will need some suggestions here on how to achieve that.
>
> Yeah, this is not trivial, now that I think more about it.
> So keep the stack state untouched so you could synchronize with the
> callback (spin until it signals us that it's done touching the stack).
> I guess we can do it from another CPU, not too bad.
>
> There's another problem though, wq execution not happening instantly
> in time is not a big deal, but it getting interrupted by yet another
> program that stalls can set up a cascading chain that leads to lock up
> of the machine.
> So let's say we have a program that stalls in NMI/IRQ. It might happen
> that all CPUs that can service the wq enter this stall. The kthread is
> ready to run the wq callback (or in the middle of it) but it may be
> indefinitely interrupted.
> It seems like this is a more fundamental problem with the non-cloning
> approach. We can prevent program execution on the CPU where the wq
> callback will be run, but we can also have a case where all CPUs lock
> up simultaneously.

If we have such bugs that prog in NMI can stall CPU indefinitely
they need to be fixed independently of fast-execute.
timed may_goto, tailcalls or whatever may need to have different
limits when it detects that the prog is running in NMI or with hard irqs
disabled. Fast-execute doesn't have to be a universal kill-bpf-prog
mechanism that can work in any context. I think fast-execute
is for progs that deadlocked in res_spin_lock, faulted arena,
or were slow for wrong reasons, but not fatal for the kernel reasons.
imo we can rely on schedule_work() and bpf_arch_text_poke() from there.
The alternative of clone of all progs and memory waste for a rare case
is not appealing. Unless we can detect "dangerous" progs and
clone with fast execute only for them, so that the majority of bpf progs
stay as single copy.