On Fri, Jul 4, 2025 at 4:47 PM Jiri Olsa <olsajiri@xxxxxxxxx> wrote: > > On Thu, Jul 03, 2025 at 08:15:03PM +0800, Menglong Dong wrote: > > (Thanks for Alexei's advice to implement the bpf global trampoline with C > > instead of asm, the performance of tracing-multi has been significantly > > improved. And the function metadata that implemented with hash table is > > also fast enough to satisfy our needs.) > > > > For now, the BPF program of type BPF_PROG_TYPE_TRACING is not allowed to > > be attached to multiple hooks, and we have to create a BPF program for > > each kernel function, for which we want to trace, even through all the > > program have the same (or similar) logic. This can consume extra memory, > > and make the program loading slow if we have plenty of kernel function to > > trace. > > hi, > what tree did you base your patchset on? I can't apply it on > bpf-next/master and I tried several other trees Sorry that I forgot to rebase to the latest bpf-next/master, and this patchset is based on this commit: c4b1be928ea0, which means I have not updated my tree for a week :/ Thanks! Menglong Dong > > thanks, > jirka > > > > > In this series, we add the support to allow attaching a tracing BPF > > program to multi hooks, which is similar to BPF_TRACE_KPROBE_MULTI. > > Generally speaking, this series can be divided into 5 parts: > > > > 1. Add per-function metadata storage support. > > 2. Add bpf global trampoline support for x86_64. > > 3. Add bpf global trampoline link support. > > 4. Add tracing multi-link support. > > > > per-function metadata storage > > ----------------------------- > > The per-function metadata storage is the basic of the bpf global > > trampoline. In short, it's a hash table and store some information of the > > kernel functions. The key of this hash table is the kernel function > > address, and following data is stored in the hash value: > > > > * The BPF progs, whose type is FENTRY, FEXIT or MODIFY_RETURN. The struct > > kfunc_md_tramp_prog is introduced to store the BPF prog and the cookie, > > and makes the BPF progs of the same type a list with the "next" field. > > * The kernel function address > > * The kernel function arguments count > > * If origin call needed > > > > The budgets of the hash table can grow and shrink when necessary. Alexei > > advised to use rhashtable. However, the compiler is not clever enough and > > it refused to inline the hash lookup for me, which bring in addition > > overhead in the following BPF global trampoline. I have to replace the > > "inline" with "__always_inline" for rhashtable_lookup_fast, > > rhashtable_lookup, __rhashtable_lookup, rht_key_get_hash to force it > > inline the hash lookup for me. Then, I just implement a hash table myself > > instead. > > > > bpf global trampoline > > --------------------- > > The bpf global trampoline is similar to the general bpf trampoline. The > > bpf trampoline store the bpf progs and some metadata in the trampoline > > instructions directly. However, the bpf global trampoline store and get > > the metadata from the function metadata with kfunc_md_get_rcu(). This > > makes the bpf global trampoline more flexible and can be used for all the > > kernel functions. > > > > The bpf global trampoline is designed to implement the tracing multi-link > > for FENTRY, FEXIT and MODIFY_RETURN. > > > > The global trampoline is implemented in C mostly. We implement the entry > > of the trampoline with a "__naked" function, who will save the regs to > > an array on the stack and call bpf_global_caller_run(). The entry will > > pass the address of the array and the address of the rip to > > bpf_global_caller_run(). > > > > The whole idea to implement the trampoline with C is inspired by Alexei > > in [3]. It do have advantage to implement in C. Some function call, such > > as __bpf_prog_enter_recur, __bpf_prog_exit_recur, __bpf_tramp_enter > > and __bpf_tramp_exit, are inlined, which reduces some overhead. The > > performance of the global trampoline can be see below. > > > > bpf global trampoline link > > -------------------------- > > We reuse part of the code in [2] to implement the tracing multi-link. The > > struct bpf_gtramp_link is introduced for the bpf global trampoline link. > > Similar to the bpf trampoline link, the bpf global trampoline link has > > bpf_gtrampoline_link_prog() and bpf_gtrampoline_unlink_prog() to link and > > unlink the bpf progs. > > > > The "entries" in the bpf_gtramp_link is a array of struct > > bpf_gtramp_link_entry, which contain all the information of the functions > > that we trace, such as the address, the number of args, the cookie and so > > on. > > > > The bpf global trampoline is much simpler than the bpf trampoline, and we > > introduce then new struct bpf_global_trampoline for it. The "image" field > > is a pointer to bpf_global_caller_x. We introduce the global trampoline > > array and kernel function with arguments count "x" can be handled by the > > global trampoline global_tr_array[x]. We implement the global trampoline > > based on the direct ftrace, and the "fops" field for this propose. This > > means bpf2bpf is not supported by the tracing multi-link. > > > > When we link the bpf prog, we will add it to all the target functions' > > kfunc_md. Then, we get all the function addresses that have bpf progs with > > kfunc_md_bpf_ips(), and reset the ftrace filter of the fops to it. The > > direct ftrace don't support to reset the filter functions yet, so we > > introduce the reset_ftrace_direct_ips() to do this work. > > > > tracing multi-link > > ------------------ > > Most of the code of this part comes from the series [2]. > > > > In the 6th patch, we add the support to record index of the accessed > > function args of the target for tracing program. Meanwhile, we add the > > function btf_check_func_part_match() to compare the accessed function args > > of two function prototype. This function will be used in the next commit. > > > > In the 7th patch, we refactor the struct modules_array to ptr_array, as > > we need similar function to hold the target btf, target program and kernel > > modules that we reference to in the following commit. > > > > In the 11th patch, we implement the multi-link support for tracing, and > > following new attach types are added: > > > > BPF_TRACE_FENTRY_MULTI > > BPF_TRACE_FEXIT_MULTI > > BPF_MODIFY_RETURN_MULTI > > > > We introduce the struct bpf_tracing_multi_link for this purpose, which > > can hold all the kernel modules, target bpf program (for attaching to bpf > > program) or target btf (for attaching to kernel function) that we > > referenced. > > > > During loading, the first target is used for verification by the verifier. > > And during attaching, we check the consistency of all the targets with > > the first target. > > > > performance comparison > > ---------------------- > > We have implemented the following performance testings in the selftests in > > bench_trigger.c: > > > > - trig-fentry-multi > > - trig-fentry-multi-all > > - trig-fexit-multi > > - trig-fmodret-multi > > > > The "fentry_multi_all" is used to test the performance of the function > > metadata hash table and all the kernel function is hooked during testings. > > > > The mitigations is disabled during the testings. It is enabled by default > > in the kernel, and we can disable it with the "mitigations=off" cmdline > > to do the testing. > > > > The testings is done with the command: > > ./run_bench_trigger.sh fentry fentry-multi fentry-multi-all fexit \ > > fexit-multi fmodret fmodret-multi > > > > Following is the testings results, and the unit is "M/s": > > > > fentry | fm | fm_all | fexit | fexit-multi | fmodret | fmodret-multi > > 103.303 | 94.532 | 98.009 | 55.155 | 55.448 | 58.632 | 56.379 > > 107.564 | 98.007 | 97.857 | 55.278 | 53.997 | 59.485 | 55.855 > > 106.841 | 97.483 | 95.064 | 55.715 | 55.502 | 59.442 | 56.126 > > 109.852 | 97.486 | 93.161 | 56.432 | 55.494 | 59.454 | 56.178 > > 109.791 | 97.973 | 96.728 | 55.729 | 55.363 | 59.445 | 56.228 > > > > * fm: fentry-multi, fm_all: fentry-multi-all > > > > Following is the results to run all the bench testings: > > > > usermode-count : 746.907 ± 0.323M/s > > kernel-count : 313.423 ± 0.031M/s > > syscall-count : 18.179 ± 0.013M/s > > fentry : 107.149 ± 0.051M/s > > fexit : 56.565 ± 0.019M/s > > fmodret : 59.495 ± 0.024M/s > > fentry-multi : 99.073 ± 0.087M/s > > fentry-multi-all: 97.920 ± 0.095M/s > > fexit-multi : 55.426 ± 0.045M/s > > fmodret-multi : 56.589 ± 0.163M/s > > rawtp : 166.774 ± 0.137M/s > > tp : 61.947 ± 0.035M/s > > kprobe : 43.719 ± 0.018M/s > > kprobe-multi : 47.451 ± 0.087M/s > > kretprobe : 18.358 ± 0.026M/s > > kretprobe-multi: 24.523 ± 0.016M/s > > > > From the above test data, it can be seen that the performance of fentry-multi > > is approximately 10% worse than that of fentry, and fmodret-multi is ~5% > > worse then fmodret, fexit-multi is almost the same to fexit. > > > > The bpf global trampoline has addition overhead in comparison with the bpf > > trampoline: > > 1. We do more checks. We check if origin call is need, if the prog is > > sleepable, etc, in the global trampoline. > > 2. We do more memory read and write. We need to load the bpf progs from > > memory, and save addition regs to stack. > > 3. The function metadata lookup. > > > > However, we also have some optimization: > > 1. For fentry, we avoid 2 function call: __bpf_prog_enter_recur and > > __bpf_prog_exit_recur, as we make them inline in our case. > > 2. For fexit/fmodret, we avoid another 2 function call: __bpf_tramp_enter > > and __bpf_tramp_exit by inline them. > > > > The performance of fentry-multi is closer to fentry-multi-all, which means > > the hash table is O(1) and fast enough. > > > > Further work > > ------------ > > The performance of the global trampoline can be optimized further. > > > > First, we can avoid some checks by generate more bpf_global_caller, such > > as: > > > > static __always_inline notrace int > > bpf_global_caller_run(unsigned long *args, unsigned long *ip, int nr_args, > > bool sleepable, bool do_origin) > > { > > xxxxxx > > } > > > > static __always_used __no_stack_protector notrace int > > bpf_global_caller_2_sleep_origin(unsigned long *args, unsigned long *ip) > > { > > return bpf_global_caller_run(args, ip, nr_args, 2, 1, 1); > > } > > > > And the bpf global caller "bpf_global_caller_2_sleep_origin" can be used > > for the functions who have 2 function args, and have sleepable bpf progs, > > and have fexit or modify_return. The check of sleepable and origin call > > will be optimized by the compiler, as they are const. > > > > Second, we can implement the function metadata with the function padding. > > The hash table lookup for metadata consume ~15 instructions. With > > function padding, it needs only 5 instructions, and will be faster. > > > > Besides the performance, we also need to make the global trampoline > > collaborate with bpf trampoline. For now, FENTRY_MULTI will be attached > > to the target who already have FENTRY on it, and -EEXIST will be returned. > > So we need another series to make them work together. > > > > Changes since V1: > > > > * remove the function metadata that bases on function padding, and > > implement it with a resizable hash table. > > * rewrite the bpf global trampoline with C. > > * use the existing bpf bench frame for bench testings. > > * remove the part that make tracing-multi compatible with tracing. > > > > Link: https://lore.kernel.org/all/20250303132837.498938-1-dongml2@xxxxxxxxxxxxxxx/ [1] > > Link: https://lore.kernel.org/bpf/20240311093526.1010158-1-dongmenglong.8@xxxxxxxxxxxxx/ [2] > > Link: https://lore.kernel.org/bpf/CAADnVQ+G+mQPJ+O1Oc9+UW=J17CGNC5B=usCmUDxBA-ze+gZGw@xxxxxxxxxxxxxx/ [3] > > Menglong Dong (18): > > bpf: add function hash table for tracing-multi > > x86,bpf: add bpf_global_caller for global trampoline > > ftrace: factor out ftrace_direct_update from register_ftrace_direct > > ftrace: add reset_ftrace_direct_ips > > bpf: introduce bpf_gtramp_link > > bpf: tracing: add support to record and check the accessed args > > bpf: refactor the modules_array to ptr_array > > bpf: verifier: add btf to the function args of bpf_check_attach_target > > bpf: verifier: move btf_id_deny to bpf_check_attach_target > > x86,bpf: factor out arch_bpf_get_regs_nr > > bpf: tracing: add multi-link support > > libbpf: don't free btf if tracing_multi progs existing > > libbpf: support tracing_multi > > libbpf: add btf type hash lookup support > > libbpf: add skip_invalid and attach_tracing for tracing_multi > > selftests/bpf: move get_ksyms and get_addrs to trace_helpers.c > > selftests/bpf: add basic testcases for tracing_multi > > selftests/bpf: add bench tests for tracing_multi > > > > arch/x86/Kconfig | 4 + > > arch/x86/net/bpf_jit_comp.c | 290 ++++++++++++- > > include/linux/bpf.h | 59 +++ > > include/linux/bpf_tramp.h | 72 ++++ > > include/linux/bpf_types.h | 1 + > > include/linux/bpf_verifier.h | 1 + > > include/linux/btf.h | 3 +- > > include/linux/ftrace.h | 7 + > > include/linux/kfunc_md.h | 91 ++++ > > include/uapi/linux/bpf.h | 10 + > > kernel/bpf/Makefile | 1 + > > kernel/bpf/btf.c | 113 ++++- > > kernel/bpf/kfunc_md.c | 352 ++++++++++++++++ > > kernel/bpf/syscall.c | 395 +++++++++++++++++- > > kernel/bpf/trampoline.c | 220 +++++++++- > > kernel/bpf/verifier.c | 161 ++++--- > > kernel/trace/bpf_trace.c | 48 +-- > > kernel/trace/ftrace.c | 183 +++++--- > > net/bpf/test_run.c | 3 + > > net/core/bpf_sk_storage.c | 2 + > > net/sched/bpf_qdisc.c | 2 +- > > tools/bpf/bpftool/common.c | 3 + > > tools/include/uapi/linux/bpf.h | 10 + > > tools/lib/bpf/bpf.c | 10 + > > tools/lib/bpf/bpf.h | 6 + > > tools/lib/bpf/btf.c | 102 +++++ > > tools/lib/bpf/btf.h | 6 + > > tools/lib/bpf/libbpf.c | 296 ++++++++++++- > > tools/lib/bpf/libbpf.h | 25 ++ > > tools/lib/bpf/libbpf.map | 5 + > > tools/testing/selftests/bpf/Makefile | 2 +- > > tools/testing/selftests/bpf/bench.c | 8 + > > .../selftests/bpf/benchs/bench_trigger.c | 72 ++++ > > .../selftests/bpf/benchs/run_bench_trigger.sh | 1 + > > .../selftests/bpf/prog_tests/fentry_fexit.c | 22 +- > > .../selftests/bpf/prog_tests/fentry_test.c | 79 +++- > > .../selftests/bpf/prog_tests/fexit_test.c | 79 +++- > > .../bpf/prog_tests/kprobe_multi_test.c | 220 +--------- > > .../selftests/bpf/prog_tests/modify_return.c | 60 +++ > > .../bpf/prog_tests/tracing_multi_link.c | 210 ++++++++++ > > .../selftests/bpf/progs/fentry_multi_empty.c | 13 + > > .../selftests/bpf/progs/tracing_multi_test.c | 181 ++++++++ > > .../selftests/bpf/progs/trigger_bench.c | 22 + > > .../selftests/bpf/test_kmods/bpf_testmod.c | 24 ++ > > tools/testing/selftests/bpf/test_progs.c | 50 +++ > > tools/testing/selftests/bpf/test_progs.h | 3 + > > tools/testing/selftests/bpf/trace_helpers.c | 283 +++++++++++++ > > tools/testing/selftests/bpf/trace_helpers.h | 3 + > > 48 files changed, 3349 insertions(+), 464 deletions(-) > > create mode 100644 include/linux/bpf_tramp.h > > create mode 100644 include/linux/kfunc_md.h > > create mode 100644 kernel/bpf/kfunc_md.c > > create mode 100644 tools/testing/selftests/bpf/prog_tests/tracing_multi_link.c > > create mode 100644 tools/testing/selftests/bpf/progs/fentry_multi_empty.c > > create mode 100644 tools/testing/selftests/bpf/progs/tracing_multi_test.c > > > > -- > > 2.39.5 > > > >