On Thu, Jun 12, 2025 at 8:58 AM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Wed, Jun 11, 2025 at 5:07 PM Menglong Dong <menglong8.dong@xxxxxxxxx> wrote: > > > > Hi Alexei, thank you for your explanation, and now I realize the > > problem is my hash table :/ > > > > My hash table made reference to ftrace and fprobe, whose > > max budget length is 1024. > > > > It's interesting to make the hash table O(1) by using rhashtable > > or sizing up the budgets, as you said. I suspect we even don't > > need the function padding part if the hash table is random > > enough. > > I suggest starting with rhashtable. It's used in many > performance critical places, and when rhashtable_params are > constant the compiler optimizes everything nicely. > lookup is lockless and only needs RCU, so safe to use > from fentry_multi. Hi, Alexei. Sorry to bother you. I have implemented the hash table with the rhashtable and did the bench testing with the existing framework. You said that "fm_single has to be within couple percents of fentry" before, and I think it's a little difficult if we use the hashtable without the function padding mode. The extra overhead of the global trampoline can be as follows: 1. addition hash lookup. The rhashtable is O(1), but the hash key compute and memory read can still have a slight overhead. 2. addition function call to kfunc_md_get_noref() in the asm, which is used to get the function metadata. We can make it inlined in the function padding mode in the asm, but it's hard if we are using the rhashtable. 3. extra logic in the global trampoline. For example, we save and restore the reg1 to reg6 on the stack for the function args even if the function we attached to doesn't have any args. following is the bench result in rhashtable mode, and the performance of fentry_multi is about 77.7% of fentry: usermode-count : 893.357 ± 0.566M/s kernel-count : 421.290 ± 0.159M/s syscall-count : 21.018 ± 0.165M/s fentry : 100.742 ± 0.065M/s fexit : 51.283 ± 0.784M/s fmodret : 55.410 ± 0.026M/s fentry-multi : 78.237 ± 0.117M/s fentry-multi-all: 80.090 ± 0.049M/s rawtp : 161.496 ± 0.197M/s tp : 70.021 ± 0.015M/s kprobe : 54.693 ± 0.013M/s kprobe-multi : 51.481 ± 0.023M/s kretprobe : 22.504 ± 0.011M/s kretprobe-multi: 27.221 ± 0.037M/s (It's weird that the performance of fentry-multi-all is a little higher than fentry-multi, but I'm sure that the bpf prog is attached to all the kernel functions in the fentry-multi-all testcase.) The overhead of the part 1 can be eliminated with using the function padding mode, and following is the bench result: usermode-count : 895.874 ± 2.472M/s kernel-count : 423.882 ± 0.342M/s syscall-count : 20.480 ± 0.009M/s fentry : 105.191 ± 0.275M/s fexit : 52.430 ± 0.050M/s fmodret : 56.130 ± 0.062M/s fentry-multi : 88.114 ± 0.108M/s fentry-multi-all: 86.988 ± 0.024M/s rawtp : 145.488 ± 0.043M/s tp : 73.386 ± 0.095M/s kprobe : 55.294 ± 0.046M/s kprobe-multi : 50.457 ± 0.075M/s kretprobe : 22.414 ± 0.020M/s kretprobe-multi: 27.205 ± 0.044M/s The performance of fentry_multi now is 83.7% of fentry. Next step, we make the function metadata inlined in the asm, and the performance of fentry_multi is 89.7% of the fentry, which is close to "be within couple percents of fentry": usermode-count : 886.836 ± 0.300M/s kernel-count : 419.962 ± 1.252M/s syscall-count : 20.715 ± 0.022M/s fentry : 102.783 ± 0.166M/s fexit : 52.502 ± 0.014M/s fmodret : 55.822 ± 0.038M/s fentry-multi : 92.201 ± 0.027M/s fentry-multi-all: 89.831 ± 0.057M/s rawtp : 158.337 ± 4.918M/s tp : 72.883 ± 0.041M/s kprobe : 54.963 ± 0.013M/s kprobe-multi : 50.069 ± 0.079M/s kretprobe : 22.260 ± 0.012M/s kretprobe-multi: 27.211 ± 0.011M/s For the overhead of the part3, I'm thinking of introducing a dynamic global trampoline. We create different global trampoline for the functions that have different features, and the features can be: * function arguments count * if bpf_get_func_ip() is ever called in the bpf progs * if FEXIT and MODIFY_RETURN progs existing * etc. Then, we can generate a global trampoline for the function with minimum instructions. According to my estimation, the performance of the fentry_multi should be above 95% of the fentry with function padding, function metadata inline and dynamic global trampoline. In fact, I implemented the first version of this series with the dynamic global trampoline.However, that makes the series very very complex. So I think it's not a good idea to mention it in this series. All in all, the performance of the fentry_multi can't be within a couple percents of fentry if we use rhashtable only according to my testing, and I'm not sure if I should go ahead :/ BTW, the Kconfig I used in the testing comes from "make tinyconfig", and I enabled some config to make the tools/testing/selftests/bpf can be compiled successfully. I would appreciate it if someone can offer a better and authoritative Kconfig for the testing :/ Thanks, have a nice weekend! Menglong Dong