[PATCH bpf-next v2 00/18] bpf: tracing multi-link support

Menglong Dong <menglong8.dong@xxxxxxxxx> · Thu, 3 Jul 2025 20:15:03 +0800

(Thanks for Alexei's advice to implement the bpf global trampoline with C
instead of asm, the performance of tracing-multi has been significantly
improved. And the function metadata that implemented with hash table is
also fast enough to satisfy our needs.)

For now, the BPF program of type BPF_PROG_TYPE_TRACING is not allowed to
be attached to multiple hooks, and we have to create a BPF program for
each kernel function, for which we want to trace, even through all the
program have the same (or similar) logic. This can consume extra memory,
and make the program loading slow if we have plenty of kernel function to
trace.

In this series, we add the support to allow attaching a tracing BPF
program to multi hooks, which is similar to BPF_TRACE_KPROBE_MULTI.
Generally speaking, this series can be divided into 5 parts:

1. Add per-function metadata storage support.
2. Add bpf global trampoline support for x86_64.
3. Add bpf global trampoline link support.
4. Add tracing multi-link support.

per-function metadata storage
-----------------------------
The per-function metadata storage is the basic of the bpf global
trampoline. In short, it's a hash table and store some information of the
kernel functions. The key of this hash table is the kernel function
address, and following data is stored in the hash value:

* The BPF progs, whose type is FENTRY, FEXIT or MODIFY_RETURN. The struct
  kfunc_md_tramp_prog is introduced to store the BPF prog and the cookie,
  and makes the BPF progs of the same type a list with the "next" field.
* The kernel function address
* The kernel function arguments count
* If origin call needed

The budgets of the hash table can grow and shrink when necessary. Alexei
advised to use rhashtable. However, the compiler is not clever enough and
it refused to inline the hash lookup for me, which bring in addition
overhead in the following BPF global trampoline. I have to replace the
"inline" with "__always_inline" for rhashtable_lookup_fast,
rhashtable_lookup, __rhashtable_lookup, rht_key_get_hash to force it
inline the hash lookup for me. Then, I just implement a hash table myself
instead.

bpf global trampoline
---------------------
The bpf global trampoline is similar to the general bpf trampoline. The
bpf trampoline store the bpf progs and some metadata in the trampoline
instructions directly. However, the bpf global trampoline store and get
the metadata from the function metadata with kfunc_md_get_rcu(). This
makes the bpf global trampoline more flexible and can be used for all the
kernel functions.

The bpf global trampoline is designed to implement the tracing multi-link
for FENTRY, FEXIT and MODIFY_RETURN.

The global trampoline is implemented in C mostly. We implement the entry
of the trampoline with a "__naked" function, who will save the regs to
an array on the stack and call bpf_global_caller_run(). The entry will
pass the address of the array and the address of the rip to
bpf_global_caller_run().

The whole idea to implement the trampoline with C is inspired by Alexei
in [3]. It do have advantage to implement in C. Some function call, such
as __bpf_prog_enter_recur, __bpf_prog_exit_recur, __bpf_tramp_enter
and __bpf_tramp_exit, are inlined, which reduces some overhead. The
performance of the global trampoline can be see below.

bpf global trampoline link
--------------------------
We reuse part of the code in [2] to implement the tracing multi-link. The
struct bpf_gtramp_link is introduced for the bpf global trampoline link.
Similar to the bpf trampoline link, the bpf global trampoline link has
bpf_gtrampoline_link_prog() and bpf_gtrampoline_unlink_prog() to link and
unlink the bpf progs.

The "entries" in the bpf_gtramp_link is a array of struct
bpf_gtramp_link_entry, which contain all the information of the functions
that we trace, such as the address, the number of args, the cookie and so
on.

The bpf global trampoline is much simpler than the bpf trampoline, and we
introduce then new struct bpf_global_trampoline for it. The "image" field
is a pointer to bpf_global_caller_x. We introduce the global trampoline
array and kernel function with arguments count "x" can be handled by the
global trampoline global_tr_array[x]. We implement the global trampoline
based on the direct ftrace, and the "fops" field for this propose. This
means bpf2bpf is not supported by the tracing multi-link.

When we link the bpf prog, we will add it to all the target functions'
kfunc_md. Then, we get all the function addresses that have bpf progs with
kfunc_md_bpf_ips(), and reset the ftrace filter of the fops to it. The
direct ftrace don't support to reset the filter functions yet, so we
introduce the reset_ftrace_direct_ips() to do this work.

tracing multi-link
------------------
Most of the code of this part comes from the series [2].

In the 6th patch, we add the support to record index of the accessed
function args of the target for tracing program. Meanwhile, we add the
function btf_check_func_part_match() to compare the accessed function args
of two function prototype. This function will be used in the next commit.

In the 7th patch, we refactor the struct modules_array to ptr_array, as
we need similar function to hold the target btf, target program and kernel
modules that we reference to in the following commit.

In the 11th patch, we implement the multi-link support for tracing, and
following new attach types are added:

  BPF_TRACE_FENTRY_MULTI
  BPF_TRACE_FEXIT_MULTI
  BPF_MODIFY_RETURN_MULTI

We introduce the struct bpf_tracing_multi_link for this purpose, which
can hold all the kernel modules, target bpf program (for attaching to bpf
program) or target btf (for attaching to kernel function) that we
referenced.

During loading, the first target is used for verification by the verifier.
And during attaching, we check the consistency of all the targets with
the first target.

performance comparison
----------------------
We have implemented the following performance testings in the selftests in
bench_trigger.c:

- trig-fentry-multi
- trig-fentry-multi-all
- trig-fexit-multi
- trig-fmodret-multi

The "fentry_multi_all" is used to test the performance of the function
metadata hash table and all the kernel function is hooked during testings.

The mitigations is disabled during the testings. It is enabled by default
in the kernel, and we can disable it with the "mitigations=off" cmdline
to do the testing.

The testings is done with the command:
  ./run_bench_trigger.sh fentry fentry-multi fentry-multi-all fexit \
                         fexit-multi fmodret fmodret-multi

Following is the testings results, and the unit is "M/s":

fentry  | fm     | fm_all | fexit  | fexit-multi | fmodret | fmodret-multi
103.303 | 94.532 | 98.009 | 55.155 | 55.448      | 58.632  | 56.379 
107.564 | 98.007 | 97.857 | 55.278 | 53.997      | 59.485  | 55.855 
106.841 | 97.483 | 95.064 | 55.715 | 55.502      | 59.442  | 56.126 
109.852 | 97.486 | 93.161 | 56.432 | 55.494      | 59.454  | 56.178 
109.791 | 97.973 | 96.728 | 55.729 | 55.363      | 59.445  | 56.228

* fm: fentry-multi, fm_all: fentry-multi-all

Following is the results to run all the bench testings:

  usermode-count :  746.907 ± 0.323M/s
  kernel-count   :  313.423 ± 0.031M/s 
  syscall-count  :   18.179 ± 0.013M/s 
  fentry         :  107.149 ± 0.051M/s 
  fexit          :   56.565 ± 0.019M/s 
  fmodret        :   59.495 ± 0.024M/s 
  fentry-multi   :   99.073 ± 0.087M/s 
  fentry-multi-all:   97.920 ± 0.095M/s 
  fexit-multi    :   55.426 ± 0.045M/s 
  fmodret-multi  :   56.589 ± 0.163M/s 
  rawtp          :  166.774 ± 0.137M/s 
  tp             :   61.947 ± 0.035M/s 
  kprobe         :   43.719 ± 0.018M/s 
  kprobe-multi   :   47.451 ± 0.087M/s 
  kretprobe      :   18.358 ± 0.026M/s 
  kretprobe-multi:   24.523 ± 0.016M/s