Re: [PATCH bpf-next 00/25] bpf: tracing multi-link support

Menglong Dong <menglong8.dong@xxxxxxxxx> · Sun, 15 Jun 2025 16:35:05 +0800

On Thu, Jun 12, 2025 at 8:58 AM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Wed, Jun 11, 2025 at 5:07 PM Menglong Dong <menglong8.dong@xxxxxxxxx> wrote:
> >
> > Hi Alexei, thank you for your explanation, and now I realize the
> > problem is my hash table :/
> >
> > My hash table made reference to ftrace and fprobe, whose
> > max budget length is 1024.
> >
> > It's interesting to make the hash table O(1) by using rhashtable
> > or sizing up the budgets, as you said. I suspect we even don't
> > need the function padding part if the hash table is random
> > enough.
>
> I suggest starting with rhashtable. It's used in many
> performance critical places, and when rhashtable_params are
> constant the compiler optimizes everything nicely.
> lookup is lockless and only needs RCU, so safe to use
> from fentry_multi.

Hi, Alexei. Sorry to bother you. I have implemented the hash table
with the rhashtable and did the bench testing with the existing
framework.

You said that "fm_single has to be within couple percents of fentry"
before, and I think it's a little difficult if we use the hashtable without
the function padding mode.

The extra overhead of the global trampoline can be as follows:
1. addition hash lookup. The rhashtable is O(1), but the hash key
compute and memory read can still have a slight overhead.
2. addition function call to kfunc_md_get_noref() in the asm, which is
used to get the function metadata. We can make it inlined in the
function padding mode in the asm, but it's hard if we are using the
rhashtable.
3. extra logic in the global trampoline. For example, we save and
restore the reg1 to reg6 on the stack for the function args even if
the function we attached to doesn't have any args.

following is the bench result in rhashtable mode, and the performance
of fentry_multi is about 77.7% of fentry:

  usermode-count :  893.357 ± 0.566M/s
  kernel-count   :  421.290 ± 0.159M/s
  syscall-count  :   21.018 ± 0.165M/s
  fentry         :  100.742 ± 0.065M/s
  fexit          :   51.283 ± 0.784M/s
  fmodret        :   55.410 ± 0.026M/s
  fentry-multi   :   78.237 ± 0.117M/s
  fentry-multi-all:   80.090 ± 0.049M/s
  rawtp          :  161.496 ± 0.197M/s
  tp             :   70.021 ± 0.015M/s
  kprobe         :   54.693 ± 0.013M/s
  kprobe-multi   :   51.481 ± 0.023M/s
  kretprobe      :   22.504 ± 0.011M/s
  kretprobe-multi:   27.221 ± 0.037M/s

(It's weird that the performance of fentry-multi-all is a little higher
than fentry-multi, but I'm sure that the bpf prog is attached to all the
kernel functions in the fentry-multi-all testcase.)

The overhead of the part 1 can be eliminated with using the
function padding mode, and following is the bench result:

  usermode-count :  895.874 ± 2.472M/s
  kernel-count   :  423.882 ± 0.342M/s
  syscall-count  :   20.480 ± 0.009M/s
  fentry         :  105.191 ± 0.275M/s
  fexit          :   52.430 ± 0.050M/s
  fmodret        :   56.130 ± 0.062M/s
  fentry-multi   :   88.114 ± 0.108M/s
  fentry-multi-all:   86.988 ± 0.024M/s
  rawtp          :  145.488 ± 0.043M/s
  tp             :   73.386 ± 0.095M/s
  kprobe         :   55.294 ± 0.046M/s
  kprobe-multi   :   50.457 ± 0.075M/s
  kretprobe      :   22.414 ± 0.020M/s
  kretprobe-multi:   27.205 ± 0.044M/s

The performance of fentry_multi now is 83.7% of fentry. Next
step, we make the function metadata inlined in the asm, and
the performance of fentry_multi is 89.7% of the fentry, which is
close to "be within couple percents of fentry":

  usermode-count :  886.836 ± 0.300M/s
  kernel-count   :  419.962 ± 1.252M/s
  syscall-count  :   20.715 ± 0.022M/s
  fentry         :  102.783 ± 0.166M/s
  fexit          :   52.502 ± 0.014M/s
  fmodret        :   55.822 ± 0.038M/s
  fentry-multi   :   92.201 ± 0.027M/s
  fentry-multi-all:   89.831 ± 0.057M/s
  rawtp          :  158.337 ± 4.918M/s
  tp             :   72.883 ± 0.041M/s
  kprobe         :   54.963 ± 0.013M/s
  kprobe-multi   :   50.069 ± 0.079M/s
  kretprobe      :   22.260 ± 0.012M/s
  kretprobe-multi:   27.211 ± 0.011M/s

For the overhead of the part3, I'm thinking of introducing a
dynamic global trampoline. We create different global trampoline
for the functions that have different features, and the features
can be:
* function arguments count
* if bpf_get_func_ip() is ever called in the bpf progs
* if FEXIT and MODIFY_RETURN progs existing
* etc.

Then, we can generate a global trampoline for the function with
minimum instructions. According to my estimation, the performance
of the fentry_multi should be above 95% of the fentry with
function padding, function metadata inline and dynamic global
trampoline.

In fact, I implemented the first version of this series with the dynamic
global trampoline.However, that makes the series very very complex.
So I think it's not a good idea to mention it in this series.

All in all, the performance of the fentry_multi can't be within a couple
percents of fentry if we use rhashtable only according to my testing,
and I'm not sure if I should go ahead :/

BTW, the Kconfig I used in the testing comes from "make tinyconfig", and
I enabled some config to make the tools/testing/selftests/bpf can be compiled
successfully. I would appreciate it if someone can offer a better and
authoritative Kconfig for the testing :/

Thanks, have a nice weekend!
Menglong Dong