The bpf global trampoline has addition overhead in comparison with the bpf trampoline: 1. We do more checks. We check if origin call is need, if the prog is sleepable, etc, in the global trampoline. 2. We do more memory read and write. We need to load the bpf progs from memory, and save addition regs to stack. 3. The function metadata lookup. However, we also have some optimization: 1. For fentry, we avoid 2 function call: __bpf_prog_enter_recur and __bpf_prog_exit_recur, as we make them inline in our case. 2. For fexit/fmodret, we avoid another 2 function call: __bpf_tramp_enter and __bpf_tramp_exit by inline them. The performance of fentry-multi is closer to fentry-multi-all, which means the hash table is O(1) and fast enough. Further work ------------ The performance of the global trampoline can be optimized further. First, we can avoid some checks by generate more bpf_global_caller, such as: static __always_inline notrace int bpf_global_caller_run(unsigned long *args, unsigned long *ip, int nr_args, bool sleepable, bool do_origin) { xxxxxx } static __always_used __no_stack_protector notrace int bpf_global_caller_2_sleep_origin(unsigned long *args, unsigned long *ip) { return bpf_global_caller_run(args, ip, nr_args, 2, 1, 1); } And the bpf global caller "bpf_global_caller_2_sleep_origin" can be used for the functions who have 2 function args, and have sleepable bpf progs, and have fexit or modify_return. The check of sleepable and origin call will be optimized by the compiler, as they are const. Second, we can implement the function metadata with the function padding. The hash table lookup for metadata consume ~15 instructions. With function padding, it needs only 5 instructions, and will be faster. Besides the performance, we also need to make the global trampoline collaborate with bpf trampoline. For now, FENTRY_MULTI will be attached to the target who already have FENTRY on it, and -EEXIST will be returned. So we need another series to make them work together. Changes since V1: * remove the function metadata that bases on function padding, and implement it with a resizable hash table. * rewrite the bpf global trampoline with C. * use the existing bpf bench frame for bench testings. * remove the part that make tracing-multi compatible with tracing. Link: https://lore.kernel.org/all/20250303132837.498938-1-dongml2@xxxxxxxxxxxxxxx/ [1] Link: https://lore.kernel.org/bpf/20240311093526.1010158-1-dongmenglong.8@xxxxxxxxxxxxx/ [2] Link: https://lore.kernel.org/bpf/CAADnVQ+G+mQPJ+O1Oc9+UW=J17CGNC5B=usCmUDxBA-ze+gZGw@xxxxxxxxxxxxxx/ [3] Menglong Dong (18): bpf: add function hash table for tracing-multi x86,bpf: add bpf_global_caller for global trampoline ftrace: factor out ftrace_direct_update from register_ftrace_direct ftrace: add reset_ftrace_direct_ips bpf: introduce bpf_gtramp_link bpf: tracing: add support to record and check the accessed args bpf: refactor the modules_array to ptr_array bpf: verifier: add btf to the function args of bpf_check_attach_target bpf: verifier: move btf_id_deny to bpf_check_attach_target x86,bpf: factor out arch_bpf_get_regs_nr bpf: tracing: add multi-link support libbpf: don't free btf if tracing_multi progs existing libbpf: support tracing_multi libbpf: add btf type hash lookup support libbpf: add skip_invalid and attach_tracing for tracing_multi selftests/bpf: move get_ksyms and get_addrs to trace_helpers.c selftests/bpf: add basic testcases for tracing_multi selftests/bpf: add bench tests for tracing_multi arch/x86/Kconfig | 4 + arch/x86/net/bpf_jit_comp.c | 290 ++++++++++++- include/linux/bpf.h | 59 +++ include/linux/bpf_tramp.h | 72 ++++ include/linux/bpf_types.h | 1 + include/linux/bpf_verifier.h | 1 + include/linux/btf.h | 3 +- include/linux/ftrace.h | 7 + include/linux/kfunc_md.h | 91 ++++ include/uapi/linux/bpf.h | 10 + kernel/bpf/Makefile | 1 + kernel/bpf/btf.c | 113 ++++- kernel/bpf/kfunc_md.c | 352 ++++++++++++++++ kernel/bpf/syscall.c | 395 +++++++++++++++++- kernel/bpf/trampoline.c | 220 +++++++++- kernel/bpf/verifier.c | 161 ++++--- kernel/trace/bpf_trace.c | 48 +-- kernel/trace/ftrace.c | 183 +++++--- net/bpf/test_run.c | 3 + net/core/bpf_sk_storage.c | 2 + net/sched/bpf_qdisc.c | 2 +- tools/bpf/bpftool/common.c | 3 + tools/include/uapi/linux/bpf.h | 10 + tools/lib/bpf/bpf.c | 10 + tools/lib/bpf/bpf.h | 6 + tools/lib/bpf/btf.c | 102 +++++ tools/lib/bpf/btf.h | 6 + tools/lib/bpf/libbpf.c | 296 ++++++++++++- tools/lib/bpf/libbpf.h | 25 ++ tools/lib/bpf/libbpf.map | 5 + tools/testing/selftests/bpf/Makefile | 2 +- tools/testing/selftests/bpf/bench.c | 8 + .../selftests/bpf/benchs/bench_trigger.c | 72 ++++ .../selftests/bpf/benchs/run_bench_trigger.sh | 1 + .../selftests/bpf/prog_tests/fentry_fexit.c | 22 +- .../selftests/bpf/prog_tests/fentry_test.c | 79 +++- .../selftests/bpf/prog_tests/fexit_test.c | 79 +++- .../bpf/prog_tests/kprobe_multi_test.c | 220 +--------- .../selftests/bpf/prog_tests/modify_return.c | 60 +++ .../bpf/prog_tests/tracing_multi_link.c | 210 ++++++++++ .../selftests/bpf/progs/fentry_multi_empty.c | 13 + .../selftests/bpf/progs/tracing_multi_test.c | 181 ++++++++ .../selftests/bpf/progs/trigger_bench.c | 22 + .../selftests/bpf/test_kmods/bpf_testmod.c | 24 ++ tools/testing/selftests/bpf/test_progs.c | 50 +++ tools/testing/selftests/bpf/test_progs.h | 3 + tools/testing/selftests/bpf/trace_helpers.c | 283 +++++++++++++ tools/testing/selftests/bpf/trace_helpers.h | 3 + 48 files changed, 3349 insertions(+), 464 deletions(-) create mode 100644 include/linux/bpf_tramp.h create mode 100644 include/linux/kfunc_md.h create mode 100644 kernel/bpf/kfunc_md.c create mode 100644 tools/testing/selftests/bpf/prog_tests/tracing_multi_link.c create mode 100644 tools/testing/selftests/bpf/progs/fentry_multi_empty.c create mode 100644 tools/testing/selftests/bpf/progs/tracing_multi_test.c -- 2.39.5