On Sun, 20 Apr 2025 at 12:56, Raj Sahu <rjsu26@xxxxxxxxx> wrote: > > From: Raj <rjsu26@xxxxxxxxx> > > Motivation: > We propose an enhancement to the BPF infrastructure to address the > critical issue of long-running BPF programs [1,2,3,4]. While the verifier > ensures termination by restricting instruction count and backward edges, the > total execution time of BPF programs is not bounded. Certain helper functions > and iterators can result in extended runtimes, affecting system performance. > > The existing BPF infrastructure verifies that programs will not indefinitely > loop or leak resources. However, helper functions such as `bpf_loop`, > `bpf_for_each_map_elem`, and other iterative or expensive kernel interactions > can cause runtimes that significantly degrade system performance [6]. Current > detaching mechanisms do not immediately terminate running instances, > monopolizing CPU. Therefore, a termination solution is necessary to swiftly > terminate execution while safely releasing resources. Thanks for sending out the code and cover letter, much appreciated! I will keep aside opinions on which of 'fast execute' vs 'panic/unwind' feel semantically cleaner, since it's hard to have an objective discussion on that. One can argue about abstract concepts like complexity vs clean design either way. Instead just some questions/comments on the design. > > Existing termination approach like the BPF Exception or Runtime hooks [5] have > the issue of either lack of dynamism or having runtime overheads: BPF > Exceptions: Introduces bpf_throw() and exception callbacks, requiring stack > unwinding and exception state management. I feel there is an equal amount of state management here, it's just that it's not apparent directly, and not reified into tables etc. One of the (valid) concerns with unwinding is that preparing tables of objects that need to be released requires the verifier/runtime to be aware of each resource acquiring functions. Every time you add a new kfunc that acquires some resource, you'd have to update some place in the kernel to make sure it gets tracked for clean up too. But that would be equally true for this case: - The verifier must know which functions acquire resources, so that it can replace them with stubs in the cloned text. - Thus, every time a new kfunc is added, one must introduce its noop stub and add it to _some_ mapping to acquire kfuncs with their stubs. > Cleanup can only be done for pre-defined cancellation points. But can you terminate the program at any arbitrary point with this? It doesn't seem likely to me. You still have designated points where you stub empty calls instead of ones which return resources. You will jump to the return address into the cloned stub on return from an interrupt that gives you control of the CPU where the program is running. But apart from the stub functions, everything else would be kept the same. I think you are conflating the mechanism to clean up resources (unwinding, this (speed-run-to-exit/fast-execute), runtime log of resources), with the mechanism to enforce termination. Both are mutually exclusive, and any strategy (probing a memory location from the program with strategically placed instrumentation, watchdog timers, probing rdtsc to do more granular and precise accounting of time spent, etc.) can be combined with any mechanism to perform cleanup. There is no necessity to bind one with the other. Depending on different program types, we may need multiple strategies to terminate them with the right amount of precision. We may do something coarse for now (like watchdogs), but separation of concerns keeps doors open. > Design: > We introduce the Fast-Path termination mechanism, leveraging the > verifier's guarantees regarding control flow and resource management. The > approach dynamically patches running BPF programs with a stripped-down version > that accelerates termination. This can be used to terminate any given instance > of a BPF execution. Key elements include: > > - Implicit Lifetime Management: Utilizing the verifier’s inherent control flow > and resource cleanup paths encoded within the BPF program structure, > eliminating the need for explicit garbage collection or unwinding tables. > > - Dynamic Program Patching: At runtime, BPF programs are atomically patched, > replacing expensive helper calls with stubbed versions (fast fall-through > implementations). This ensures swift termination without compromising safety > or leaking resources. > > - Helper Function Adjustments: Certain helper functions (e.g., `bpf_loop`, > `bpf_for_each_map_elem`) include mechanisms to facilitate early exits through > modified return values. > > TODOs: > - Termination support for nested BPF programs. What's the plan for this? What do you do if e.g. I attach to some kfunc that you don't stub out? E.g. you may stub out bpf_sk_lookup, but I can attach something to bpf_get_prandom_u32 called after it in the stub which is not stubbed out and stall. Will you stub out every kfunc with a noop? If so, how do we reason about correctness when the kfunc introduces side effects on program state? > - Policy enforcements to control runtime of BPF programs in a system: > - Timer based termination (watchdog) > - Userspace management to detect low-performing BPF program and > terminated them > I think one of the things I didn't see reasoned about so far is how would you handle tail calls or extension programs? Or in general, control flow being extended dynamically by program attachments? Since you execute until the end of the program, someone could construct a chain of 32 chained programs that individually expire the watchdog timer, breaking your limit and inflating it by limit x 32 etc. Perhaps you just fail direct/indirect calls? They're already something that can be skipped because of the recursion counter, so it probably won't break things. Extension programs are different, most likely they don't appear as attached when the program is loaded, so it's an empty global function call in the original program and the stub. So I presume you don't attach them to the stub and it's the original empty function that executes in the stub? It will be a more difficult question to answer once we have indirect function calls, and possibly allow calling from the BPF program to another as long as signatures match correctly. Say a hierarchical BPF scheduler, where indirect function calls are used to dispatch to leaves. Each indirect call target may be a separate program attached into the original one (say application-specific schedulers). By making the program continue executing, the second program invoked from the first one could begin to stall, and this could happen recursively, again breaching your limit on the parent that called into them. It doesn't have to be indirect calls, it may be a kernel function that does this propagation down the tree (like sched-ext plans to do now). Possibly will have to stub out these kfuncs as well. But then we have to be mindful if the program depends on side effects vs if they are pure. So I think the conclusion is that we need to reason about and stub all functions (apart from those that acquire resources) that can in turn invoke more BPF programs, so that the parent calling into them transitively isn't stalled while it's fast executing, which doesn't seem easy. It's the same problem as with nested program execution. On the path where we are terminating, we allow yet another program to come in and stall the kernel. I think it's just a consequence of "keep executing until exit" vs "stop executing and return" that such a problem would come up. It's much easier to reason about and notrace the few bits needed to unwind and return control to the kernel vs controlling it for every possible suffix of the program where the stub is invoked. But perhaps you have given thought to these question, and may have solutions in mind. Will it be some kind of bpf_prog_active() check that avoids invoking more programs on the 'fast-execute' path? It may interfere with other programs that interrupt the fast-execute termination and try to run (e.g. XDP in an interrupt where a fast-execute in task context was in progress) and lead to surprises. > We haven’t added any selftests in the POC as this mail is mainly to get > feedback on the design. Attaching link to sample BPF programs to > validate the POC [7]. Styling will be taken care in next iteration. > > References: > 1. https://lpc.events/event/17/contributions/1610/attachments/1229/2505/LPC_BPF_termination_Raj_Sahu.pdf > 2. https://vtechworks.lib.vt.edu/server/api/core/bitstreams/f0749daa-4560-41c9-9f36-6aa618161665/content > 3. https://lore.kernel.org/bpf/AM6PR03MB508011599420DB53480E8BF799F72@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/T/ > 4. https://vtechworks.lib.vt.edu/server/api/core/bitstreams/7fb70c04-0736-4e2d-b48b-2d8d012bacfc/content > 5. https://lwn.net/ml/all/AM6PR03MB5080513BFAEB54A93CC70D4399FE2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/#t > 6. https://people.cs.vt.edu/djwillia/papers/ebpf23-runtime.pdf > 7. https://github.com/sidchintamaneni/os-dev-env/tree/main/bpf-programs-catalog/research/termination/patch_gen_testing > > arch/x86/kernel/smp.c | 4 +- > include/linux/bpf.h | 18 ++ > include/linux/filter.h | 16 ++ > include/linux/smp.h | 2 +- > include/uapi/linux/bpf.h | 13 ++ > kernel/bpf/bpf_iter.c | 65 ++++++ > kernel/bpf/core.c | 45 ++++ > kernel/bpf/helpers.c | 8 + > kernel/bpf/syscall.c | 375 +++++++++++++++++++++++++++++++++ > kernel/bpf/verifier.c | 16 +- > kernel/smp.c | 22 +- > tools/bpf/bpftool/prog.c | 40 ++++ > tools/include/uapi/linux/bpf.h | 5 + > tools/lib/bpf/bpf.c | 15 ++ > tools/lib/bpf/bpf.h | 10 + > tools/lib/bpf/libbpf.map | 1 + > 16 files changed, 643 insertions(+), 12 deletions(-) > All of this said, I think the patches need more work. The arch specific bits can be moved into arch/*/net/bpf_* files and you can dispatch to them using __weak functions in kernel/bpf/core.c. A complete version of stubbing that handles both kfuncs and helpers would be better. I don't think bpftool support for termination is necessary, it should be kicked in by the kernel automatically once a stall is detected. So you can drop the bpf(2) syscall command being added. For now, we can also keep the enforcement of termination bits out and just focus on termination bits, both can land separately (the enforcement will require more discussion, so it'd be better to keep focus on and not mix both IMO). > -- > 2.43.0 >