On Tue, 8 Jul 2025 at 09:07, Raj Sahu <rjsu26@xxxxxxxxx> wrote: > > > If we have such bugs that prog in NMI can stall CPU indefinitely > > they need to be fixed independently of fast-execute. > > timed may_goto, tailcalls or whatever may need to have different > > limits when it detects that the prog is running in NMI or with hard irqs > > disabled. Fast-execute doesn't have to be a universal kill-bpf-prog > > mechanism that can work in any context. I think fast-execute > > is for progs that deadlocked in res_spin_lock, faulted arena, > > or were slow for wrong reasons, but not fatal for the kernel reasons. > > imo we can rely on schedule_work() and bpf_arch_text_poke() from there. > > The alternative of clone of all progs and memory waste for a rare case > > is not appealing. Unless we can detect "dangerous" progs and > > clone with fast execute only for them, so that the majority of bpf progs > > stay as single copy. > > I just want to confirm that we are on the same page here: > While the RFC we sent was using prog cloning, Kumar's earlier > suggestion of implementing offset tables can avoid the complete > cloning process and the associated memory footprint. Is there > something else which is concerning here in terms of memory overhead? > > Regarding the NMI issue, the fast-execute design was meant to take > care of stalling in tracing and other task-context based programs > running slow for some reason. While I do agree with your point that > deadlocks in NMIs should be solved independently, kumar's point of > having several BPF programs needing termination, running in hardIRQ, > puts us in a fix. What should be the way forward here? I would give the prog->aux->terminate bit idea we discussed in the other thread a try. If we can can know that it has acceptable overhead (see example microbenchmarking I did here: https://lore.kernel.org/bpf/20250304003239.2390751-1-memxor@xxxxxxxxx/) then I think it seems the best option to go with. You can also try loops with costs for the body, since it's more appropriate as the % of cost of the loop body. We can sample this bit and later on hook up enforcement to set it when it detects a timeout, but let's keep both separate for the next iteration.