Re: [RFC bpf-next v2 3/4] bpf: Runtime part of fast-path termination approach

Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> · Thu, 10 Jul 2025 02:54:09 +0200

On Tue, 8 Jul 2025 at 09:07, Raj Sahu <rjsu26@xxxxxxxxx> wrote:
>
> > If we have such bugs that prog in NMI can stall CPU indefinitely
> > they need to be fixed independently of fast-execute.
> > timed may_goto, tailcalls or whatever may need to have different
> > limits when it detects that the prog is running in NMI or with hard irqs
> > disabled. Fast-execute doesn't have to be a universal kill-bpf-prog
> > mechanism that can work in any context. I think fast-execute
> > is for progs that deadlocked in res_spin_lock, faulted arena,
> > or were slow for wrong reasons, but not fatal for the kernel reasons.
> > imo we can rely on schedule_work() and bpf_arch_text_poke() from there.
> > The alternative of clone of all progs and memory waste for a rare case
> > is not appealing. Unless we can detect "dangerous" progs and
> > clone with fast execute only for them, so that the majority of bpf progs
> > stay as single copy.
>
> I just want to confirm that we are on the same page here:
> While the RFC we sent was using prog cloning, Kumar's earlier
> suggestion of implementing offset tables can avoid the complete
> cloning process and the associated memory footprint. Is there
> something else which is concerning here in terms of memory overhead?
>
> Regarding the NMI issue, the fast-execute design was meant to take
> care of stalling in tracing and other task-context based programs
> running slow for some reason. While I do agree with your point that
> deadlocks in NMIs should be solved independently, kumar's point of
> having several BPF programs needing termination, running in hardIRQ,
> puts us in a fix. What should be the way forward here?

I would give the prog->aux->terminate bit idea we discussed in the
other thread a try.
If we can can know that it has acceptable overhead (see example
microbenchmarking I did here:
https://lore.kernel.org/bpf/20250304003239.2390751-1-memxor@xxxxxxxxx/)
then I think it seems the best option to go with.
You can also try loops with costs for the body, since it's more
appropriate as the % of cost of the loop body.
We can sample this bit and later on hook up enforcement to set it when
it detects a timeout, but let's keep both separate for the next
iteration.