On Tue, 8 Jul 2025 at 00:09, Zvi Effron <zeffron@xxxxxxxxxxxxx> wrote: > > On Mon, Jul 7, 2025 at 12:17 PM Kumar Kartikeya Dwivedi > <memxor@xxxxxxxxx> wrote: > > > > On Mon, 7 Jul 2025 at 19:41, Alexei Starovoitov > > <alexei.starovoitov@xxxxxxxxx> wrote: > > > > > > On Fri, Jul 4, 2025 at 12:11 PM Kumar Kartikeya Dwivedi > > > <memxor@xxxxxxxxx> wrote: > > > > > > > > On Fri, 4 Jul 2025 at 19:29, Raj Sahu <rjsu26@xxxxxxxxx> wrote: > > > > > > > > > > > > Introduces watchdog based runtime mechanism to terminate > > > > > > > a BPF program. When a BPF program is interrupted by > > > > > > > an watchdog, its registers are are passed onto the bpf_die. > > > > > > > > > > > > > > Inside bpf_die we perform the text_poke and stack walk > > > > > > > to stub helpers/kfunc replace bpf_loop helper if called > > > > > > > inside bpf program. > > > > > > > > > > > > > > Current implementation doesn't handle the termination of > > > > > > > tailcall programs. > > > > > > > > > > > > > > There is a known issue by calling text_poke inside interrupt > > > > > > > context - https://elixir.bootlin.com/linux/v6.15.1/source/kernel/smp.c#L815. > > > > > > > > > > > > I don't have a good idea so far, maybe by deferring work to wq context? > > > > > > Each CPU would need its own context and schedule work there. > > > > > > The problem is that it may not be invoked immediately. > > > > > We will give it a try using wq. We were a bit hesitant in pursuing wq > > > > > earlier because to modify the return address on the stack we would > > > > > want to interrupt the running BPF program and access its stack since > > > > > that's a key part of the design. > > > > > > > > > > Will need some suggestions here on how to achieve that. > > > > > > > > Yeah, this is not trivial, now that I think more about it. > > > > So keep the stack state untouched so you could synchronize with the > > > > callback (spin until it signals us that it's done touching the stack). > > > > I guess we can do it from another CPU, not too bad. > > > > > > > > There's another problem though, wq execution not happening instantly > > > > in time is not a big deal, but it getting interrupted by yet another > > > > program that stalls can set up a cascading chain that leads to lock up > > > > of the machine. > > > > So let's say we have a program that stalls in NMI/IRQ. It might happen > > > > that all CPUs that can service the wq enter this stall. The kthread is > > > > ready to run the wq callback (or in the middle of it) but it may be > > > > indefinitely interrupted. > > > > It seems like this is a more fundamental problem with the non-cloning > > > > approach. We can prevent program execution on the CPU where the wq > > > > callback will be run, but we can also have a case where all CPUs lock > > > > up simultaneously. > > > > > > If we have such bugs that prog in NMI can stall CPU indefinitely > > > they need to be fixed independently of fast-execute. > > > timed may_goto, tailcalls or whatever may need to have different > > > limits when it detects that the prog is running in NMI or with hard irqs > > > disabled. > > > > I think a lot of programs end up running with hard IRQs disabled. Most > > scheduler programs (which would use all these runtime checked > > facilities) can fall into this category. > > I don't think we can come up with appropriate limits without proper > > WCET analysis. > > > > > Fast-execute doesn't have to be a universal kill-bpf-prog > > > mechanism that can work in any context. I think fast-execute > > > is for progs that deadlocked in res_spin_lock, faulted arena, > > > or were slow for wrong reasons, but not fatal for the kernel reasons. > > > imo we can rely on schedule_work() and bpf_arch_text_poke() from there. > > > The alternative of clone of all progs and memory waste for a rare case > > > is not appealing. Unless we can detect "dangerous" progs and > > > clone with fast execute only for them, so that the majority of bpf progs > > > stay as single copy. > > > > Right, I sympathize with the memory overhead argument. But I think > > just ignoring NMI programs as broken is not sufficient. > > You can have some tracing program that gets stuck while tracing parts > > of the kernel such that it prevents wq from making progress in > > patching it out. > > I think we will discover more edge cases. All this effort to then have > > something working incompletely is sort of a bummer. > > > > Most of the syzbot reports triggering deadlocks were also not "real" > > use cases, but we still decided to close the gap with rqspinlock. > > When leaving the door open, it's hard to anticipate how the failure > > mode in case fast-execute is not triggered will be. > > I think let's try to see how bad cloning can be, if done > > conditionally, before giving up completely on it. > > > > I think most programs won't use these facilities that end up extending > > the total runtime beyond acceptable bounds. > > Most legacy programs are not using cond_break, rqspinlock, or arenas. > > Most network function programs and tracing programs probably don't use > > these facilities as well. > > I can't speak to _most_, but doesn't BPF_MAP_TYPE_LPM_TRIE use rqspinlock? And > isn't that map type frequently used by network programs (especially XDP > filters) to do IP prefix lookups? > > So it wouldn't be uncommon for network programs (including "legacy" ones) to > use rqspinlock because they use a BPF map type that uses rqspinlock? Right, good point. We would probably need to penalize such a program when it triggers a deadlock. That would require cloning if we don't go the text_poke() way. So it may indeed end up covering most programs. hashtab.c uses it too, which probably the majority of programs use. Oh well, I guess it would be most of them then. So it's hard to eliminate the memory overhead in cloning. > > > This can certainly change in the future, but I think unless pressing > > use cases come up people would stick to how things are. > > Schedulers are a different category and they will definitely make use > > of all this. > > So only triggering cloning when these are used sounds better to me, > > even if not ideal. > > > > We can have some simple WCET heuristic that assigns fixed weights to > > certain instructions/helpers, and compute an approximate cost which > > when breached triggers cloning of the prog. > > We can obviously disable cloning when we see things like bpf_for(i, 0, 1000). > > We can restrict it to specific subprogs where we notice specific > > patterns requiring it, so we may be able to avoid the entire prog. > > But for simplicity we can also just trigger cloning when one of > > rqspinlock/cond_break/arenas/iterators/bpf_loop are used. > > > > The other option of course is to do run time checks of > > prog->aux->terminate bit and start failing things that way. > > Most of the time, the branch will be untaken. It can be set to 1 when > > a prog detects a fatal condition or from a watchdog (later). > > cond_break, rqspinlock, iterators, bpf_loop can all be adjusted to check it. > > > > When I was playing with all this, this is basically what I did > > (amortized or not) and I think the overhead was negligible/in range of > > noise. > > If a prog is hot, the line with terminate bit is almost always in > > cache, and it's not a load dependency for other accesses. > > If it's not hot, the cost of every other state access dominates anyway. > >