Re: [RFC bpf-next v2 3/4] bpf: Runtime part of fast-path termination approach

Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> · Fri, 4 Jul 2025 21:10:47 +0200

On Fri, 4 Jul 2025 at 19:29, Raj Sahu <rjsu26@xxxxxxxxx> wrote:
>
> > > Introduces watchdog based runtime mechanism to terminate
> > > a BPF program. When a BPF program is interrupted by
> > > an watchdog, its registers are are passed onto the bpf_die.
> > >
> > > Inside bpf_die we perform the text_poke and stack walk
> > > to stub helpers/kfunc replace bpf_loop helper if called
> > > inside bpf program.
> > >
> > > Current implementation doesn't handle the termination of
> > > tailcall programs.
> > >
> > > There is a known issue by calling text_poke inside interrupt
> > > context - https://elixir.bootlin.com/linux/v6.15.1/source/kernel/smp.c#L815.
> >
> > I don't have a good idea so far, maybe by deferring work to wq context?
> > Each CPU would need its own context and schedule work there.
> > The problem is that it may not be invoked immediately.
> We will give it a try using wq. We were a bit hesitant in pursuing wq
> earlier because to modify the return address on the stack we would
> want to interrupt the running BPF program and access its stack since
> that's a key part of the design.
>
> Will need some suggestions here on how to achieve that.

Yeah, this is not trivial, now that I think more about it.
So keep the stack state untouched so you could synchronize with the
callback (spin until it signals us that it's done touching the stack).
I guess we can do it from another CPU, not too bad.

There's another problem though, wq execution not happening instantly
in time is not a big deal, but it getting interrupted by yet another
program that stalls can set up a cascading chain that leads to lock up
of the machine.
So let's say we have a program that stalls in NMI/IRQ. It might happen
that all CPUs that can service the wq enter this stall. The kthread is
ready to run the wq callback (or in the middle of it) but it may be
indefinitely interrupted.
It seems like this is a more fundamental problem with the non-cloning
approach. We can prevent program execution on the CPU where the wq
callback will be run, but we can also have a case where all CPUs lock
up simultaneously.

So the alternative is to do it locally, such that the program itself
enters the repair path and terminates.
Switching the return address to a patched clone seems much simpler in
comparison, even though it's added memory overhead.

>
> > > +static void bpf_terminate_timer_init(const struct bpf_prog *prog)
> > > +{
> > > +       ktime_t timeout = ktime_set(1, 0); // 1s, 0ns
> > > +
> > > +       /* Initialize timer on Monotonic clock, relative mode */
> > > +       hrtimer_setup(&prog->term_states->hrtimer, bpf_termination_wd_callback, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> >
> > Hmm, doesn't this need to be a per-CPU hrtimer? Otherwise all
> > concurrent invocations will race to set up and start it?
> > Doesn't even look thread safe, unless I'm missing something.
> Yes, this was an oversight. Thanks for pointing it out.
> > > +       /* Start watchdog */
> > > +       hrtimer_start(&prog->term_states->hrtimer, timeout, HRTIMER_MODE_REL);
> > > +}
> > > +
> > > +static void bpf_terminate_timer_cancel(const struct bpf_prog *prog)
> > > +{
> > > +       hrtimer_cancel(&prog->term_states->hrtimer);
> > > +}
> > > +
> > >  static __always_inline u32 __bpf_prog_run(const struct bpf_prog *prog,
> > >                                           const void *ctx,
> > >                                           bpf_dispatcher_fn dfunc)
> > > @@ -706,7 +735,11 @@ static __always_inline u32 __bpf_prog_run(const struct bpf_prog *prog,
> > >                 u64 duration, start = sched_clock();
> > >                 unsigned long flags;
> > >
> > > +               update_term_per_cpu_flag(prog, 1);
> > > +               bpf_terminate_timer_init(prog);
> > >                 ret = dfunc(ctx, prog->insnsi, prog->bpf_func);
> > > +               bpf_terminate_timer_cancel(prog);
> > > +               update_term_per_cpu_flag(prog, 0);
> > >
> > >                 duration = sched_clock() - start;
> > >                 stats = this_cpu_ptr(prog->stats);
> > > @@ -715,8 +748,11 @@ static __always_inline u32 __bpf_prog_run(const struct bpf_prog *prog,
> > >                 u64_stats_add(&stats->nsecs, duration);
> > >                 u64_stats_update_end_irqrestore(&stats->syncp, flags);
> > >         } else {
> > > +               update_term_per_cpu_flag(prog, 1);
> > > +               bpf_terminate_timer_init(prog);
> > >                 ret = dfunc(ctx, prog->insnsi, prog->bpf_func);
> > > -       }
> > > +               bpf_terminate_timer_cancel(prog);
> > > +               update_term_per_cpu_flag(prog, 0);}
> > >         return ret;
> > >  }
> >
> > Hmm, did you profile how much overhead this adds? It's not completely
> > free, right?
> > I guess the per_cpu flag's lock is uncontended, so there wouldn't be
> > too much overhead there (though it's still an extra atomic op on the
> > fast path).
> > hrtimer_setup() won't be that expensive either, but I think
> > hrtimer_start() can be.
> > Also, what about programs invoked from BPF trampoline? We would need
> > such "watchdog" protection for potentially every program, right?
> >
> > I'm more concerned about the implications of using an hrtimer around
> > every program invocation though.
> > Imagine that the program gets invoked in task context, the same
> > program then runs in interrupt context (let's say it's a tracing
> > program).
> > Even the simple hrtimer_cancel() when returning from interrupt context
> > can potentially deadlock the kernel if the task context program hit
> > its limit and was inside the timer callback.
> > Let alone the fact that we can have recursion on the same CPU as above
> > or by repeatedly invoking the same program, which reprograms the timer
> > again.
> >
> > I think we should piggy back on softlockup / hardlockup checks (that's
> > what I did long ago), but for simplicity I would just drop these time
> > based enforcement checks from the set for now.
> > They're incomplete, and potentially buggy. Instead you can invoke
> > bpf_die() when a program hits the loop's max count limit or something
> > similar, in order to test this.
> > We also need to account for sleepable programs, so a 1 second
> > hardcoded limit is probably not appropriate.
> > Enforcement is orthogonal to how a program is cleaned up, though as
> > important, but it can be revisited once we sort out the first part.
> ACK
> We can do some profiling eventually then if we decide to bring it back.
> The deadlock case is a good case to consider, however a program's
> recursion is not possible on a given CPU right?

It is possible, we have certain programs which need to recurse to
function correctly.

>
> Earlier we were thinking of enforcing performance based runtime
> policies for BPF programs. Looks like it is getting hard to implement
> it. So I think we will go ahead and rely on the kernel/ bpf mechanism
> to detect bad BPF programs (stalls, pf, etc).
>
> Adding an iteration based termination for bpf_loop won't be enough
> because an expensive callback won't need too many
> iterations,comparatively, to exceed runtime expectations.
>

We can do this as a follow up. As you see from the comments above,
it's already too complicated to think about and review :).

>
> [...]