(Thanks for Alexei's advice to implement the bpf global trampoline with C instead of asm, the performance of tracing-multi has been significantly improved. And the function metadata that implemented with hash table is also fast enough to satisfy our needs.) For now, the BPF program of type BPF_PROG_TYPE_TRACING is not allowed to be attached to multiple hooks, and we have to create a BPF program for each kernel function, for which we want to trace, even through all the program have the same (or similar) logic. This can consume extra memory, and make the program loading slow if we have plenty of kernel function to trace. In this series, we add the support to allow attaching a tracing BPF program to multi hooks, which is similar to BPF_TRACE_KPROBE_MULTI. Generally speaking, this series can be divided into 5 parts: 1. Add per-function metadata storage support. 2. Add bpf global trampoline support for x86_64. 3. Add bpf global trampoline link support. 4. Add tracing multi-link support. per-function metadata storage ----------------------------- The per-function metadata storage is the basic of the bpf global trampoline. In short, it's a hash table and store some information of the kernel functions. The key of this hash table is the kernel function address, and following data is stored in the hash value: * The BPF progs, whose type is FENTRY, FEXIT or MODIFY_RETURN. The struct kfunc_md_tramp_prog is introduced to store the BPF prog and the cookie, and makes the BPF progs of the same type a list with the "next" field. * The kernel function address * The kernel function arguments count * If origin call needed The budgets of the hash table can grow and shrink when necessary. Alexei advised to use rhashtable. However, the compiler is not clever enough and it refused to inline the hash lookup for me, which bring in addition overhead in the following BPF global trampoline. I have to replace the "inline" with "__always_inline" for rhashtable_lookup_fast, rhashtable_lookup, __rhashtable_lookup, rht_key_get_hash to force it inline the hash lookup for me. Then, I just implement a hash table myself instead. bpf global trampoline --------------------- The bpf global trampoline is similar to the general bpf trampoline. The bpf trampoline store the bpf progs and some metadata in the trampoline instructions directly. However, the bpf global trampoline store and get the metadata from the function metadata with kfunc_md_get_rcu(). This makes the bpf global trampoline more flexible and can be used for all the kernel functions. The bpf global trampoline is designed to implement the tracing multi-link for FENTRY, FEXIT and MODIFY_RETURN. The global trampoline is implemented in C mostly. We implement the entry of the trampoline with a "__naked" function, who will save the regs to an array on the stack and call bpf_global_caller_run(). The entry will pass the address of the array and the address of the rip to bpf_global_caller_run(). The whole idea to implement the trampoline with C is inspired by Alexei in [3]. It do have advantage to implement in C. Some function call, such as __bpf_prog_enter_recur, __bpf_prog_exit_recur, __bpf_tramp_enter and __bpf_tramp_exit, are inlined, which reduces some overhead. The performance of the global trampoline can be see below. bpf global trampoline link -------------------------- We reuse part of the code in [2] to implement the tracing multi-link. The struct bpf_gtramp_link is introduced for the bpf global trampoline link. Similar to the bpf trampoline link, the bpf global trampoline link has bpf_gtrampoline_link_prog() and bpf_gtrampoline_unlink_prog() to link and unlink the bpf progs. The "entries" in the bpf_gtramp_link is a array of struct bpf_gtramp_link_entry, which contain all the information of the functions that we trace, such as the address, the number of args, the cookie and so on. The bpf global trampoline is much simpler than the bpf trampoline, and we introduce then new struct bpf_global_trampoline for it. The "image" field is a pointer to bpf_global_caller_x. We introduce the global trampoline array and kernel function with arguments count "x" can be handled by the global trampoline global_tr_array[x]. We implement the global trampoline based on the direct ftrace, and the "fops" field for this propose. This means bpf2bpf is not supported by the tracing multi-link. When we link the bpf prog, we will add it to all the target functions' kfunc_md. Then, we get all the function addresses that have bpf progs with kfunc_md_bpf_ips(), and reset the ftrace filter of the fops to it. The direct ftrace don't support to reset the filter functions yet, so we introduce the reset_ftrace_direct_ips() to do this work. tracing multi-link ------------------ Most of the code of this part comes from the series [2]. In the 6th patch, we add the support to record index of the accessed function args of the target for tracing program. Meanwhile, we add the function btf_check_func_part_match() to compare the accessed function args of two function prototype. This function will be used in the next commit. In the 7th patch, we refactor the struct modules_array to ptr_array, as we need similar function to hold the target btf, target program and kernel modules that we reference to in the following commit. In the 11th patch, we implement the multi-link support for tracing, and following new attach types are added: BPF_TRACE_FENTRY_MULTI BPF_TRACE_FEXIT_MULTI BPF_MODIFY_RETURN_MULTI We introduce the struct bpf_tracing_multi_link for this purpose, which can hold all the kernel modules, target bpf program (for attaching to bpf program) or target btf (for attaching to kernel function) that we referenced. During loading, the first target is used for verification by the verifier. And during attaching, we check the consistency of all the targets with the first target. performance comparison ---------------------- We have implemented the following performance testings in the selftests in bench_trigger.c: - trig-fentry-multi - trig-fentry-multi-all - trig-fexit-multi - trig-fmodret-multi The "fentry_multi_all" is used to test the performance of the function metadata hash table and all the kernel function is hooked during testings. The mitigations is disabled during the testings. It is enabled by default in the kernel, and we can disable it with the "mitigations=off" cmdline to do the testing. The testings is done with the command: ./run_bench_trigger.sh fentry fentry-multi fentry-multi-all fexit \ fexit-multi fmodret fmodret-multi Following is the testings results, and the unit is "M/s": fentry | fm | fm_all | fexit | fexit-multi | fmodret | fmodret-multi 103.303 | 94.532 | 98.009 | 55.155 | 55.448 | 58.632 | 56.379 107.564 | 98.007 | 97.857 | 55.278 | 53.997 | 59.485 | 55.855 106.841 | 97.483 | 95.064 | 55.715 | 55.502 | 59.442 | 56.126 109.852 | 97.486 | 93.161 | 56.432 | 55.494 | 59.454 | 56.178 109.791 | 97.973 | 96.728 | 55.729 | 55.363 | 59.445 | 56.228 * fm: fentry-multi, fm_all: fentry-multi-all Following is the results to run all the bench testings: usermode-count : 746.907 ± 0.323M/s kernel-count : 313.423 ± 0.031M/s syscall-count : 18.179 ± 0.013M/s fentry : 107.149 ± 0.051M/s fexit : 56.565 ± 0.019M/s fmodret : 59.495 ± 0.024M/s fentry-multi : 99.073 ± 0.087M/s fentry-multi-all: 97.920 ± 0.095M/s fexit-multi : 55.426 ± 0.045M/s fmodret-multi : 56.589 ± 0.163M/s rawtp : 166.774 ± 0.137M/s tp : 61.947 ± 0.035M/s kprobe : 43.719 ± 0.018M/s kprobe-multi : 47.451 ± 0.087M/s kretprobe : 18.358 ± 0.026M/s kretprobe-multi: 24.523 ± 0.016M/s