On Wed, Jul 16, 2025 at 6:51 PM Menglong Dong <menglong.dong@xxxxxxxxx> wrote: > > On Thursday, July 17, 2025 8:59 AM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> write: > > On Wed, Jul 16, 2025 at 6:06 AM Menglong Dong <menglong.dong@xxxxxxxxx> wrote: > > > > > > On Wednesday, July 16, 2025 12:35 AM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> write: > > > > On Tue, Jul 15, 2025 at 1:37 AM Menglong Dong <menglong.dong@xxxxxxxxx> wrote: > > > > > > > > > > > > > > > On 7/15/25 10:25, Alexei Starovoitov wrote: > > > [......] > > > > > > > > > > According to my benchmark, it has ~5% overhead to save/restore > > > > > *5* variants when compared with *0* variant. The save/restore of regs > > > > > is fast, but it still need 12 insn, which can produce ~6% overhead. > > > > > > > > I think it's an ok trade off, because with one global trampoline > > > > we do not need to call rhashtable lookup before entering bpf prog. > > > > bpf prog will do it on demand if/when it needs to access arguments. > > > > This will compensate for a bit of lost performance due to extra save/restore. > > > > > > I don't understand here :/ > > > > > > The rhashtable lookup is done at the beginning of the global trampoline, > > > which is called before we enter bpf prog. The bpf progs is stored in the > > > kfunc_md, and we need get them from the hash table. > > > > Ahh. Right. > > > > Looking at the existing bpf trampoline... It has complicated logic > > to handle livepatching and tailcalls. Your global trampoline > > doesn't, and once that is added it's starting to feel that it will > > look just as complex as the current one. > > So I think we better repurpose what we have. > > Maybe we can rewrite the existing one in C too. > > You are right, the tailcalls is not handled yet. But for the livepatching, > it is already handled, as we always get the origin ip from the stack > and call it, just like how the bpf trampoline handle the livepatching. > So no addition handling is needed here. > > > > > How about the following approach. > > I think we discussed something like this in the past > > and Jiri tried to implement something like this. > > Andrii reminded me recently about it. > > > > Say, we need to attach prog A to 30k functions. > > 10k with 2 args, 10k with 3 args, and 10k with 7 args. > > We can generate 3 _existing_ bpf trampolines for 2,3,7 args > > with hard coded prog A in there (the cookies would need to be > > fetched via binary search similar to kprobe-multi). > > The arch_prepare_bpf_trampoline() supports BPF_TRAMP_F_ORIG_STACK. > > So one 2-arg trampoline will work to invoke prog A in all 10k 2-arg functions. > > We don't need to match types, but have to compare that btf_func_model-s > > are the same. > > > > Menglong, your global trampoline for 0,1,..6 args works only for x86, > > because btf_func_model doesn't care about sizes of args, > > but it's not the correct mental model to use. > > > > The above "10k with 2 args" is a simplified example. > > We will need an arch specific callback is_btf_func_model_equal() > > that will compare func models in arch specific ways. > > For x86-64 the number of args is all it needs. > > For other archs it will compare sizes and flags too. > > So 30k functions will be sorted into > > 10k with btf_func_model_1, 10k with btf_func_model_2 and so on. > > And the corresponding number of equivalent trampolines will be generated. > > > > Note there will be no actual BTF types. All args will be untyped and > > untrusted unlike current fentry. > > We can go further and sort 30k functions by comparing BTFs > > instead of btf_func_model-s, but I suspect 30k funcs will be split > > into several thousands of exact BTFs. At that point multi-fentry > > benefits are diminishing and we might as well generate 30k unique > > bpf trampolines for 30k functions and avoid all the complexity. > > So I would sort by btf_func_model compared by arch specific comparator. > > > > Now say prog B needs to be attached to another 30k functions. > > If all 30k+30k functions are different then it's the same as > > the previous step. > > Say, prog A is attached to 10k funcs with btf_func_model_1. > > If prog B wants to attach to the exact same func set then we > > just regenerate bpf trampoline with hard coded progs A and B > > and reattach. > > If not then we need to split the set into up to 3 sets. > > Say, prog B wants 5k funcs, but only 1k func are common: > > (prog_A, 9k func with btf_func_model_1) -> bpf trampoline X > > (prog_A, prog_B, 1k funcs with btf_func_model_1) -> bpf trampoline Y > > (prog_B, 4k funcs with btf_func_model_1) -> bpf trampoline Z > > > > And so on when prog C needs to be attached. > > At detach time we can merge sets/trampolines, > > but for now we can leave it all fragmented. > > Unlike regular fentry progs the multi-fentry progs are not going to > > be attached for long time. So we can reduce the detach complexity. > > > > The nice part of the algorithm is that coexistence of fentry > > and multi-fentry is easy. > > If fentry is already attached to some function we just > > attach multi-fentry prog to that bpf trampoline. > > If multi-fentry was attached first and fentry needs to be attached, > > we create a regular bpf trampoline and add both progs there. > > This seems not easy, and it is exactly how I handle the > coexistence now: > > https://lore.kernel.org/bpf/20250528034712.138701-16-dongml2@xxxxxxxxxxxxxxx/ > https://lore.kernel.org/bpf/20250528034712.138701-17-dongml2@xxxxxxxxxxxxxxx/ > https://lore.kernel.org/bpf/20250528034712.138701-18-dongml2@xxxxxxxxxxxxxxx/ hmm. exactly? That's very different. You're relying on kfunc_md for prog list. The above proposal doesn't need kfunc_md in the critical path. All progs are built into the trampolines. > The most difficult part is that we need a way to replace the the > multi-fentry with fentry for the function in the ftrace atomically. Of > course, we can remove the global trampoline first, and then attach > the bpf trampoline, which will make things much easier. But a > short suspend will happen for the progs in fentry-multi. I don't follow. In the above proposal fentry attach/detach is atomic. Prepare a new trampoline, single call to ftrace to modify_fentry(). > > > > The intersect and sorting by btf_func_model is not trivial, > > but we can hold global trampoline_mutex, so no concerns of races. > > > > Example: > > bpf_link_A is a set of: > > (prog_A, funcs X,Y with btf_func_model_1) > > (prog_A, funcs N,M with btf_func_model_2) > > > > To attach prog B via bpf_link_B that wants: > > (prog_B, funcs Y,Z with btf_func_model_1) > > (prog_B, funcs P,Q with btf_func_model_3) > > > > walk all existing links, intersect and split, and update the links. > > At the end: > > > > bpf_link_A: > > (prog_A, funcs X with btf_func_model_1) > > (prog_A, prog_B funcs Y with btf_func_model_1) > > (prog_A, funcs N,M with btf_func_model_2) > > > > bpf_link_B: > > (prog_A, prog_B funcs Y with btf_func_model_1) > > (prog_B, funcs Z with btf_func_model_1) > > (prog_B, funcs P,Q with btf_func_model_3) > > > > When link is detached: walk its own tuples, remove the prog, > > if nr_progs == 0 -> detach corresponding trampoline, > > if nr_progs > 0 -> remove prog and regenerate trampoline. > > > > If fentry prog C needs to be attached to N it might split bpf_link_A: > > (prog_A, funcs X with btf_func_model_1) > > (prog_A, prog_B funcs Y with btf_func_model_1) > > (prog_A, funcs M with btf_func_model_2) > > (prog_A, prog_C funcs N with _fentry_) > > > > Last time we gave up on it because we discovered that > > overlap support was too complicated, but I cannot recall now > > what it was :) > > Maybe all of the above repeating some old mistakes. > > In my impression, this is exactly the solution of Jiri's, and this is > part of the discussion that I know: > > https://lore.kernel.org/bpf/ZfKY6E8xhSgzYL1I@krava/ Yes. It's similar, but somehow it feels simple enough now. The algorithms for both detach and attach fit on one page, and everything is uniform. There are no spaghetty of corner cases.