Re: [RFC bpf-next 0/4] bpf: Fast-Path approach for BPF program Termination

Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> · Mon, 5 May 2025 22:13:17 +0200

On Sun, 20 Apr 2025 at 12:56, Raj Sahu <rjsu26@xxxxxxxxx> wrote:
>
> From: Raj <rjsu26@xxxxxxxxx>
>
> Motivation:
> We propose an enhancement to the BPF infrastructure to address the
> critical issue of long-running BPF programs [1,2,3,4]. While the verifier
> ensures termination by restricting instruction count and backward edges, the
> total execution time of BPF programs is not bounded. Certain helper functions
> and iterators can result in extended runtimes, affecting system performance.
>
> The existing BPF infrastructure verifies that programs will not indefinitely
> loop or leak resources. However, helper functions such as `bpf_loop`,
> `bpf_for_each_map_elem`, and other iterative or expensive kernel interactions
> can cause runtimes that significantly degrade system performance [6]. Current
> detaching mechanisms do not immediately terminate running instances,
> monopolizing CPU. Therefore, a termination solution is necessary to swiftly
> terminate execution while safely releasing resources.

Thanks for sending out the code and cover letter, much appreciated!

I will keep aside opinions on which of 'fast execute' vs
'panic/unwind' feel semantically cleaner, since it's hard to have an
objective discussion on that. One can argue about abstract concepts
like complexity vs clean design either way.

Instead just some questions/comments on the design.

>
> Existing termination approach like the BPF Exception or Runtime hooks [5] have
> the issue of either lack of dynamism or having runtime overheads: BPF
> Exceptions: Introduces bpf_throw() and exception callbacks, requiring stack
> unwinding and exception state management.

I feel there is an equal amount of state management here, it's just
that it's not apparent directly, and not reified into tables etc.
One of the (valid) concerns with unwinding is that preparing tables of
objects that need to be released requires the verifier/runtime to be
aware of each resource acquiring functions.
Every time you add a new kfunc that acquires some resource, you'd have
to update some place in the kernel to make sure it gets tracked for
clean up too.

But that would be equally true for this case:
- The verifier must know which functions acquire resources, so that it
can replace them with stubs in the cloned text.
- Thus, every time a new kfunc is added, one must introduce its noop
stub and add it to _some_ mapping to acquire kfuncs with their stubs.

> Cleanup can only be done for pre-defined cancellation points.

But can you terminate the program at any arbitrary point with this? It
doesn't seem likely to me. You still have designated points where you
stub empty calls instead of ones which return resources. You will jump
to the return address into the cloned stub on return from an interrupt
that gives you control of the CPU where the program is running. But
apart from the stub functions, everything else would be kept the same.

I think you are conflating the mechanism to clean up resources
(unwinding, this (speed-run-to-exit/fast-execute), runtime log of
resources), with the mechanism to enforce termination.

Both are mutually exclusive, and any strategy (probing a memory
location from the program with strategically placed instrumentation,
watchdog timers, probing rdtsc to do more granular and precise
accounting of time spent, etc.) can be combined with any mechanism to
perform cleanup. There is no necessity to bind one with the other.
Depending on different program types, we may need multiple strategies
to terminate them with the right amount of precision.

We may do something coarse for now (like watchdogs), but separation of
concerns keeps doors open.

> Design:
> We introduce the Fast-Path termination mechanism, leveraging the
> verifier's guarantees regarding control flow and resource management. The
> approach dynamically patches running BPF programs with a stripped-down version
> that accelerates termination. This can be used to terminate any given instance
> of a BPF execution. Key elements include:
>
> - Implicit Lifetime Management: Utilizing the verifier’s inherent control flow
>   and resource cleanup paths encoded within the BPF program structure,
>   eliminating the need for explicit garbage collection or unwinding tables.
>
> - Dynamic Program Patching: At runtime, BPF programs are atomically patched,
>   replacing expensive helper calls with stubbed versions (fast fall-through
>   implementations). This ensures swift termination without compromising safety
>   or leaking resources.
>
> - Helper Function Adjustments: Certain helper functions (e.g., `bpf_loop`,
>   `bpf_for_each_map_elem`) include  mechanisms to facilitate early exits through
>   modified return values.
>
> TODOs:
> - Termination support for nested BPF programs.

What's the plan for this?
What do you do if e.g. I attach to some kfunc that you don't stub out?
E.g. you may stub out bpf_sk_lookup, but I can attach something to
bpf_get_prandom_u32 called after it in the stub which is not stubbed
out and stall.

Will you stub out every kfunc with a noop? If so, how do we reason
about correctness when the kfunc introduces side effects on program
state?

> - Policy enforcements to control runtime of BPF programs in a system:
> - Timer based termination (watchdog)
>         - Userspace management to detect low-performing BPF program and
>           terminated them
>

I think one of the things I didn't see reasoned about so far is how
would you handle tail calls or extension programs?
Or in general, control flow being extended dynamically by program attachments?

Since you execute until the end of the program, someone could
construct a chain of 32 chained programs that individually expire the
watchdog timer, breaking your limit and inflating it by limit x 32
etc.

Perhaps you just fail direct/indirect calls? They're already something
that can be skipped because of the recursion counter, so it probably
won't break things.

Extension programs are different, most likely they don't appear as
attached when the program is loaded, so it's an empty global function
call in the original program and the stub. So I presume you don't
attach them to the stub and it's the original empty function that
executes in the stub?

It will be a more difficult question to answer once we have indirect
function calls, and possibly allow calling from the BPF program to
another as long as signatures match correctly.

Say a hierarchical BPF scheduler, where indirect function calls are
used to dispatch to leaves. Each indirect call target may be a
separate program attached into the original one (say
application-specific schedulers). By making the program continue
executing, the second program invoked from the first one could begin
to stall, and this could happen recursively, again breaching your
limit on the parent that called into them.

It doesn't have to be indirect calls, it may be a kernel function that
does this propagation down the tree (like sched-ext plans to do now).
Possibly will have to stub out these kfuncs as well. But then we have
to be mindful if the program depends on side effects vs if they are
pure.

So I think the conclusion is that we need to reason about and stub all
functions (apart from those that acquire resources) that can in turn
invoke more BPF programs, so that the parent calling into them
transitively isn't stalled while it's fast executing, which doesn't
seem easy.

It's the same problem as with nested program execution. On the path
where we are terminating, we allow yet another program to come in and
stall the kernel.

I think it's just a consequence of "keep executing until exit" vs
"stop executing and return" that such a problem would come up. It's
much easier to reason about and notrace the few bits needed to unwind
and return control to the kernel vs controlling it for every possible
suffix of the program where the stub is invoked.

But perhaps you have given thought to these question, and may have
solutions in mind.
Will it be some kind of bpf_prog_active() check that avoids invoking
more programs on the 'fast-execute' path?
It may interfere with other programs that interrupt the fast-execute
termination and try to run (e.g. XDP in an interrupt where a
fast-execute in task context was in progress) and lead to surprises.

> We haven’t added any selftests in the POC as this mail is mainly to get
> feedback on the design. Attaching link to sample BPF programs to
> validate the POC [7].  Styling will be taken care in next iteration.
>
> References:
> 1. https://lpc.events/event/17/contributions/1610/attachments/1229/2505/LPC_BPF_termination_Raj_Sahu.pdf
> 2. https://vtechworks.lib.vt.edu/server/api/core/bitstreams/f0749daa-4560-41c9-9f36-6aa618161665/content
> 3. https://lore.kernel.org/bpf/AM6PR03MB508011599420DB53480E8BF799F72@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/T/
> 4. https://vtechworks.lib.vt.edu/server/api/core/bitstreams/7fb70c04-0736-4e2d-b48b-2d8d012bacfc/content
> 5. https://lwn.net/ml/all/AM6PR03MB5080513BFAEB54A93CC70D4399FE2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/#t
> 6. https://people.cs.vt.edu/djwillia/papers/ebpf23-runtime.pdf
> 7. https://github.com/sidchintamaneni/os-dev-env/tree/main/bpf-programs-catalog/research/termination/patch_gen_testing
>
>  arch/x86/kernel/smp.c          |   4 +-
>  include/linux/bpf.h            |  18 ++
>  include/linux/filter.h         |  16 ++
>  include/linux/smp.h            |   2 +-
>  include/uapi/linux/bpf.h       |  13 ++
>  kernel/bpf/bpf_iter.c          |  65 ++++++
>  kernel/bpf/core.c              |  45 ++++
>  kernel/bpf/helpers.c           |   8 +
>  kernel/bpf/syscall.c           | 375 +++++++++++++++++++++++++++++++++
>  kernel/bpf/verifier.c          |  16 +-
>  kernel/smp.c                   |  22 +-
>  tools/bpf/bpftool/prog.c       |  40 ++++
>  tools/include/uapi/linux/bpf.h |   5 +
>  tools/lib/bpf/bpf.c            |  15 ++
>  tools/lib/bpf/bpf.h            |  10 +
>  tools/lib/bpf/libbpf.map       |   1 +
>  16 files changed, 643 insertions(+), 12 deletions(-)
>

All of this said, I think the patches need more work. The arch
specific bits can be moved into arch/*/net/bpf_* files and you can
dispatch to them using __weak functions in kernel/bpf/core.c. A
complete version of stubbing that handles both kfuncs and helpers
would be better.

I don't think bpftool support for termination is necessary, it should
be kicked in by the kernel automatically once a stall is detected.
So you can drop the bpf(2) syscall command being added.

For now, we can also keep the enforcement of termination bits out and
just focus on termination bits, both can land separately (the
enforcement will require more discussion, so it'd be better to keep
focus on and not mix both IMO).

> --
> 2.43.0
>