Re: multi-fentry proposal. Was: [PATCH bpf-next v2 02/18] x86,bpf: add bpf_global_caller for global trampoline

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Wed, 16 Jul 2025 19:13:57 -0700

On Wed, Jul 16, 2025 at 6:51 PM Menglong Dong <menglong.dong@xxxxxxxxx> wrote:
>
> On Thursday, July 17, 2025 8:59 AM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> write:
> > On Wed, Jul 16, 2025 at 6:06 AM Menglong Dong <menglong.dong@xxxxxxxxx> wrote:
> > >
> > > On Wednesday, July 16, 2025 12:35 AM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> write:
> > > > On Tue, Jul 15, 2025 at 1:37 AM Menglong Dong <menglong.dong@xxxxxxxxx> wrote:
> > > > >
> > > > >
> > > > > On 7/15/25 10:25, Alexei Starovoitov wrote:
> > > [......]
> > > > >
> > > > > According to my benchmark, it has ~5% overhead to save/restore
> > > > > *5* variants when compared with *0* variant. The save/restore of regs
> > > > > is fast, but it still need 12 insn, which can produce ~6% overhead.
> > > >
> > > > I think it's an ok trade off, because with one global trampoline
> > > > we do not need to call rhashtable lookup before entering bpf prog.
> > > > bpf prog will do it on demand if/when it needs to access arguments.
> > > > This will compensate for a bit of lost performance due to extra save/restore.
> > >
> > > I don't understand here :/
> > >
> > > The rhashtable lookup is done at the beginning of the global trampoline,
> > > which is called before we enter bpf prog. The bpf progs is stored in the
> > > kfunc_md, and we need get them from the hash table.
> >
> > Ahh. Right.
> >
> > Looking at the existing bpf trampoline... It has complicated logic
> > to handle livepatching and tailcalls. Your global trampoline
> > doesn't, and once that is added it's starting to feel that it will
> > look just as complex as the current one.
> > So I think we better repurpose what we have.
> > Maybe we can rewrite the existing one in C too.
>
> You are right, the tailcalls is not handled yet. But for the livepatching,
> it is already handled, as we always get the origin ip from the stack
> and call it, just like how the bpf trampoline handle the livepatching.
> So no addition handling is needed here.
>
> >
> > How about the following approach.
> > I think we discussed something like this in the past
> > and Jiri tried to implement something like this.
> > Andrii reminded me recently about it.
> >
> > Say, we need to attach prog A to 30k functions.
> > 10k with 2 args, 10k with 3 args, and 10k with 7 args.
> > We can generate 3 _existing_ bpf trampolines for 2,3,7 args
> > with hard coded prog A in there (the cookies would need to be
> > fetched via binary search similar to kprobe-multi).
> > The arch_prepare_bpf_trampoline() supports BPF_TRAMP_F_ORIG_STACK.
> > So one 2-arg trampoline will work to invoke prog A in all 10k 2-arg functions.
> > We don't need to match types, but have to compare that btf_func_model-s
> > are the same.
> >
> > Menglong, your global trampoline for 0,1,..6 args works only for x86,
> > because btf_func_model doesn't care about sizes of args,
> > but it's not the correct mental model to use.
> >
> > The above "10k with 2 args" is a simplified example.
> > We will need an arch specific callback is_btf_func_model_equal()
> > that will compare func models in arch specific ways.
> > For x86-64 the number of args is all it needs.
> > For other archs it will compare sizes and flags too.
> > So 30k functions will be sorted into
> > 10k with btf_func_model_1, 10k with btf_func_model_2 and so on.
> > And the corresponding number of equivalent trampolines will be generated.
> >
> > Note there will be no actual BTF types. All args will be untyped and
> > untrusted unlike current fentry.
> > We can go further and sort 30k functions by comparing BTFs
> > instead of btf_func_model-s, but I suspect 30k funcs will be split
> > into several thousands of exact BTFs. At that point multi-fentry
> > benefits are diminishing and we might as well generate 30k unique
> > bpf trampolines for 30k functions and avoid all the complexity.
> > So I would sort by btf_func_model compared by arch specific comparator.
> >
> > Now say prog B needs to be attached to another 30k functions.
> > If all 30k+30k functions are different then it's the same as
> > the previous step.
> > Say, prog A is attached to 10k funcs with btf_func_model_1.
> > If prog B wants to attach to the exact same func set then we
> > just regenerate bpf trampoline with hard coded progs A and B
> > and reattach.
> > If not then we need to split the set into up to 3 sets.
> > Say, prog B wants 5k funcs, but only 1k func are common:
> > (prog_A, 9k func with btf_func_model_1) -> bpf trampoline X
> > (prog_A, prog_B, 1k funcs with btf_func_model_1) -> bpf trampoline Y
> > (prog_B, 4k funcs with btf_func_model_1) -> bpf trampoline Z
> >
> > And so on when prog C needs to be attached.
> > At detach time we can merge sets/trampolines,
> > but for now we can leave it all fragmented.
> > Unlike regular fentry progs the multi-fentry progs are not going to
> > be attached for long time. So we can reduce the detach complexity.
> >
> > The nice part of the algorithm is that coexistence of fentry
> > and multi-fentry is easy.
> > If fentry is already attached to some function we just
> > attach multi-fentry prog to that bpf trampoline.
> > If multi-fentry was attached first and fentry needs to be attached,
> > we create a regular bpf trampoline and add both progs there.
>
> This seems not easy, and it is exactly how I handle the
> coexistence now:
>
>   https://lore.kernel.org/bpf/20250528034712.138701-16-dongml2@xxxxxxxxxxxxxxx/
>   https://lore.kernel.org/bpf/20250528034712.138701-17-dongml2@xxxxxxxxxxxxxxx/
>   https://lore.kernel.org/bpf/20250528034712.138701-18-dongml2@xxxxxxxxxxxxxxx/

hmm. exactly? That's very different.
You're relying on kfunc_md for prog list.
The above proposal doesn't need kfunc_md in the critical path.
All progs are built into the trampolines.

> The most difficult part is that we need a way to replace the the
> multi-fentry with fentry for the function in the ftrace atomically. Of
> course, we can remove the global trampoline first, and then attach
> the bpf trampoline, which will make things much easier. But a
> short suspend will happen for the progs in fentry-multi.

I don't follow.
In the above proposal fentry attach/detach is atomic.
Prepare a new trampoline, single call to ftrace to modify_fentry().

> >
> > The intersect and sorting by btf_func_model is not trivial,
> > but we can hold global trampoline_mutex, so no concerns of races.
> >
> > Example:
> > bpf_link_A is a set of:
> > (prog_A, funcs X,Y with btf_func_model_1)
> > (prog_A, funcs N,M with btf_func_model_2)
> >
> > To attach prog B via bpf_link_B that wants:
> > (prog_B, funcs Y,Z with btf_func_model_1)
> > (prog_B, funcs P,Q with btf_func_model_3)
> >
> > walk all existing links, intersect and split, and update the links.
> > At the end:
> >
> > bpf_link_A:
> > (prog_A, funcs X with btf_func_model_1)
> > (prog_A, prog_B funcs Y with btf_func_model_1)
> > (prog_A, funcs N,M with btf_func_model_2)
> >
> > bpf_link_B:
> > (prog_A, prog_B funcs Y with btf_func_model_1)
> > (prog_B, funcs Z with btf_func_model_1)
> > (prog_B, funcs P,Q with btf_func_model_3)
> >
> > When link is detached: walk its own tuples, remove the prog,
> > if nr_progs == 0 -> detach corresponding trampoline,
> > if nr_progs > 0 -> remove prog and regenerate trampoline.
> >
> > If fentry prog C needs to be attached to N it might split bpf_link_A:
> > (prog_A, funcs X with btf_func_model_1)
> > (prog_A, prog_B funcs Y with btf_func_model_1)
> > (prog_A, funcs M with btf_func_model_2)
> > (prog_A, prog_C funcs N with _fentry_)
> >
> > Last time we gave up on it because we discovered that
> > overlap support was too complicated, but I cannot recall now
> > what it was :)
> > Maybe all of the above repeating some old mistakes.
>
> In my impression, this is exactly the solution of Jiri's, and this is
> part of the discussion that I know:
>
>   https://lore.kernel.org/bpf/ZfKY6E8xhSgzYL1I@krava/

Yes. It's similar, but somehow it feels simple enough now.
The algorithms for both detach and attach fit on one page,
and everything is uniform. There are no spaghetty of corner cases.