Re: [PATCH bpf-next v1 1/2] bpf: Add verifier support for timed may_goto

Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> · Mon, 3 Mar 2025 23:25:38 +0100

On Mon, 3 Mar 2025 at 22:59, Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Sun, Mar 2, 2025 at 12:13 PM Kumar Kartikeya Dwivedi
> <memxor@xxxxxxxxx> wrote:
> >
> > Implement support in the verifier for replacing may_goto implementation
> > from a counter-based approach to one which samples time on the local CPU
> > to have a bigger loop bound.
> >
> > We implement it by maintaining 16-bytes per-stack frame, and using 8
> > bytes for maintaining the count for amortizing time sampling, and 8
> > bytes for the starting timestamp. To minimize overhead, we need to avoid
> > spilling and filling of registers around this sequence, so we push this
> > cost into the time sampling function 'arch_bpf_timed_may_goto'. This is
> > a JIT-specific wrapper around bpf_check_timed_may_goto which returns us
> > the count to store into the stack through BPF_REG_AX. All caller-saved
> > registers (r0-r5) are guaranteed to remain untouched.
> >
> > The loop can be broken by returning count as 0, otherwise we dispatch
> > into the function when the count becomes 1, and the runtime chooses to
> > refresh it (by returning count as BPF_MAX_TIMED_LOOPS) or returning 0
> > and aborting it.
> >
> > Since the check for 0 is done right after loading the count from the
> > stack, all subsequent cond_break sequences should immediately break as
> > well.
> >
> > We pass in the stack_depth of the count (and thus the timestamp, by
> > adding 8 to it) to the arch_bpf_timed_may_goto call so that it can be
> > passed in to bpf_check_timed_may_goto as an argument after r1 is saved,
> > by adding the offset to r10/fp. This adjustment will be arch specific,
> > and the next patch will introduce support for x86.
> >
> > Note that depending on loop complexity, time spent in the loop can be
> > more than the current limit (250 ms), but imposing an upper bound on
> > program runtime is an orthogonal problem which will be addressed when
> > program cancellations are supported.
> >
> > The current time afforded by cond_break may not be enough for cases
> > where BPF programs want to implement locking algorithms inline, and use
> > cond_break as a promise to the verifier that they will eventually
> > terminate.
> >
> > Below are some benchmarking numbers on the time taken per-iteration for
> > an empty loop that counts the number of iterations until cond_break
> > fires. For comparison, we compare it against bpf_for/bpf_repeat which is
> > another way to achieve the same number of spins (BPF_MAX_LOOPS).  The
> > hardware used for benchmarking was a Saphire Rapids Intel server with
> > performance governor enabled.
> >
> > +-----------------------------+--------------+--------------+------------------+
> > | Loop type                   | Iterations   |  Time (ms)   |   Time/iter (ns) |
> > +-----------------------------|--------------+--------------+------------------+
> > | may_goto                    | 8388608      |  3           |   0.36           |
> > | timed_may_goto (count=65535)| 589674932    |  250         |   0.42           |
> > | bpf_for                     | 8388608      |  10          |   1.19           |
> > +-----------------------------+--------------+--------------+------------------+
> >
> > This gives a good approximation at low overhead while staying close to
> > the current implementation.
> >
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx>
> > ---
> >  include/linux/bpf.h    |  1 +
> >  include/linux/filter.h |  8 +++++++
> >  kernel/bpf/core.c      | 31 +++++++++++++++++++++++++
> >  kernel/bpf/verifier.c  | 52 +++++++++++++++++++++++++++++++++++-------
> >  4 files changed, 84 insertions(+), 8 deletions(-)
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index aec102868b93..788f6ca374e9 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -1986,6 +1986,7 @@ struct bpf_array {
> >   */
> >  enum {
> >         BPF_MAX_LOOPS = 8 * 1024 * 1024,
> > +       BPF_MAX_TIMED_LOOPS = 0xffff,
> >  };
> >
> >  #define BPF_F_ACCESS_MASK      (BPF_F_RDONLY |         \
> > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > index 3ed6eb9e7c73..02dda5c53d91 100644
> > --- a/include/linux/filter.h
> > +++ b/include/linux/filter.h
> > @@ -669,6 +669,11 @@ struct bpf_prog_stats {
> >         struct u64_stats_sync syncp;
> >  } __aligned(2 * sizeof(u64));
> >
> > +struct bpf_timed_may_goto {
> > +       u64 count;
> > +       u64 timestamp;
> > +};
> > +
> >  struct sk_filter {
> >         refcount_t      refcnt;
> >         struct rcu_head rcu;
> > @@ -1130,8 +1135,11 @@ bool bpf_jit_supports_ptr_xchg(void);
> >  bool bpf_jit_supports_arena(void);
> >  bool bpf_jit_supports_insn(struct bpf_insn *insn, bool in_arena);
> >  bool bpf_jit_supports_private_stack(void);
> > +bool bpf_jit_supports_timed_may_goto(void);
> >  u64 bpf_arch_uaddress_limit(void);
> >  void arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie);
> > +u64 arch_bpf_timed_may_goto(void);
> > +u64 bpf_check_timed_may_goto(struct bpf_timed_may_goto *);
> >  bool bpf_helper_changes_pkt_data(enum bpf_func_id func_id);
> >
> >  static inline bool bpf_dump_raw_ok(const struct cred *cred)
> > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > index a0200fbbace9..b3f7c7bd08d3 100644
> > --- a/kernel/bpf/core.c
> > +++ b/kernel/bpf/core.c
> > @@ -3069,6 +3069,37 @@ void __weak arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp,
> >  {
> >  }
> >
> > +bool __weak bpf_jit_supports_timed_may_goto(void)
> > +{
> > +       return false;
> > +}
> > +
> > +u64 __weak arch_bpf_timed_may_goto(void)
> > +{
> > +       return 0;
> > +}
> > +
> > +u64 bpf_check_timed_may_goto(struct bpf_timed_may_goto *p)
> > +{
> > +       u64 time = ktime_get_mono_fast_ns();
>
> let's move the call after !p->count check to avoid unused work.
>

Ok.

> > +
> > +       /* If the count is zero, we've already broken a prior loop in this stack
> > +        * frame, let's just exit quickly.
> > +        */
>
> Let's use normal kernel comment style in all new code.
> I think even netdev folks allow both styles now.
>

Ack.

> > +       if (!p->count)
> > +               return 0;
> > +       /* Populate the timestamp for this stack frame. */
> > +       if (!p->timestamp) {
> > +               p->timestamp = time;
> > +               return BPF_MAX_TIMED_LOOPS;
> > +       }
> > +       /* Check if we've exhausted our time slice. */
> > +       if (time - p->timestamp >= (NSEC_PER_SEC / 4))
> > +               return 0;
> > +       /* Refresh the count for the stack frame. */
> > +       return BPF_MAX_TIMED_LOOPS;
> > +}
> > +
> >  /* for configs without MMU or 32-bit */
> >  __weak const struct bpf_map_ops arena_map_ops;
> >  __weak u64 bpf_arena_get_user_vm_start(struct bpf_arena *arena)
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index dcd0da4e62fc..79bfb1932f40 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -21503,7 +21503,34 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
> >                         goto next_insn;
> >                 }
> >
> > -               if (is_may_goto_insn(insn)) {
> > +               if (is_may_goto_insn(insn) && bpf_jit_supports_timed_may_goto()) {
> > +                       int stack_off_cnt = -stack_depth - 16;
> > +
> > +                       /* Two 8 byte slots, depth-16 stores the count, and
> > +                        * depth-8 stores the start timestamp of the loop.
> > +                        */
> > +                       stack_depth_extra = 16;
> > +                       insn_buf[0] = BPF_LDX_MEM(BPF_DW, BPF_REG_AX, BPF_REG_10, stack_off_cnt);
> > +                       if (insn->off >= 0)
> > +                               insn_buf[1] = BPF_JMP_IMM(BPF_JEQ, BPF_REG_AX, 0, insn->off + 5);
> > +                       else
> > +                               insn_buf[1] = BPF_JMP_IMM(BPF_JEQ, BPF_REG_AX, 0, insn->off - 1);
> > +                       insn_buf[2] = BPF_ALU64_IMM(BPF_SUB, BPF_REG_AX, 1);
> > +                       insn_buf[3] = BPF_JMP_IMM(BPF_JNE, BPF_REG_AX, 1, 2);
>
> Maybe != 0 instead ?
> Otherwise it's off by 1.

We'll never do the sub with AX=0, we'll always break out in that case.
It starts at 0xffff, so when it's 2 in stack, and 1 on subtraction, we
will do the call.
This resets it to 0xffff or 0.

But it's late and I could be missing something.

>
> > +                       insn_buf[4] = BPF_MOV64_IMM(BPF_REG_AX, stack_off_cnt);
>
> Please add a comment that BPF_REG_AX is used as an argument
> register and contains return value too.

Ok, will do.

>
> I looked at a couple other non-x86 JITs and I think this calling
> convention should work for them too.
>
> > +                       insn_buf[5] = BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_CALL_IMM(arch_bpf_timed_may_goto));
>
> Use BPF_EMIT_CALL() instead?
>

Ack.

> > [...]