Re: Linking BTF

Nick Alcock <nick.alcock@xxxxxxxxxx> · Thu, 17 Jul 2025 12:52:26 +0100

On 17 Jul 2025, Jose E. Marchesi outgrape:

>
>> On Wed, 2025-07-16 at 16:15 +0100, Nick Alcock wrote:
>>
>> [...]
>>
>>>  - So... a third option, which is probably the most BTFish because it's
>>>    something BTF already does, in a sense: put everything in one section,
>>>    call it .BTF or .BTFA or whatever, and make that section an archive of
>>>    named BTF members, and then stuff however many BTF outputs the
>>>    deduplication generates (or none, if we're just stuffing inputs into
>>>    outputs without dedupping) into archive members.
>>> 
>>> So, here's a possibility which seems to provide the latter option while
>>> still letting existing tools read the first member (likely vmlinux):
>>> 
>>> The idea is that we add a *next member link field* in the BTF header, and a
>>> name (a strtab offset).  The next member link field is an end-of-header-
>>> relative offset just like most of the other header fields, which chains BTF
>>> members together in a linked list:
>>> 
>>> parent     BTF
>>>             |
>>>             v
>>> children   BTF -> BTF -> BTF -> ... -> BTF
>>> 
>>> The parent is always first in the list.
>>
>> Hi Nick,
>>
>> You are talking about BTF section embedded in a final vmlinux binary, right?
>
> More generally, a section embedded in any object which is the result of
> linking two or more objects having .BTF sections:
>
>   ld foo.o (.BTF) bar.o (.BTF) -> baz.o (.BTF)
>
> This covers the particular vmlinux case I think.

Yes, though I wasn't expecting to see this in vmlinux yet! It might
happen in the end. What this is used for is *communicating with pahole*:
the .btfa file pahole receives is one of these, containing deduplicated
BTF for the entire kernel plus all modules, and it's then up to pahole
what to do with it.

In userspace links (and in intermediate links of multifile kernel
modules, used only as input to the btfarchive deduplicator), we do see
this sort of thing heavily.

>> Could you please elaborate a bit on why do you need multiple members
>> within this section (in the context of your third option)?
>> I re-read the email but don't get it :(
>
> As I understand it:
>
> The linker deduplicates types in the set of input .BTF sections.  This
> means that when linking foo.o and bar.o, if both compilation units refer
> to a type 'quux', there are two possibilities:
>
> a) The type 'quux' is the same (using C type equivalence rules) in both
>    compilation units.  Then the type is "shared" and the linker puts it
>    only once in the first output BTF member in baz.o .BTF, the "parent".
>
> b) The type 'quux' is different in both compilation units.  These are
>    then conflicting types.  Then two versions the type, foo.quux and
>    bar.quux, are placed by the linker in the corresponding "children"
>    member in baz.o.

Yes. (We don't really quite use C type equivalence rules -- we're
pickier, since types can be assignment-compatible but still different,
and we want to preserve that difference. But that's nitpicking.)

This happens really quite a lot in the kernel (I was surprised how
often). It happens even more in userspace, sometimes to an almost
pathological degree (hello, Ghostscript). LTO may make its prevalence
lower in the future, but I doubt this sort of thing will ever go away:
it's still with us in C++ programs, and there it's outright undefined
behaviour!

> Graphically, the .BTF section in a linked binary would contain a
> one-level tree of members, with as many children as input compilation
> units :
>
>     parent (common types)
>       |
>       +---  child1 (types only in child1)
>       +---  child2 (types only in child2)
>       .
>       +---  childN (types only in childN)
>
> Hope this makes sense.  Nick should be able to explain it better than I
> do.

There are really two cases, because the purpose of "being a child" is
sort of overloaded. The kernel is, as ever, different...

- Kernel-style builds (the traditional BTF case):

  vmlinux (parent) (common types, any types shared by more than one module)
       +---  child1.ko (types only in child1)
       +---  child2.ko (types only in child2)
       .
       +---  childN.ko (types only in childN)

  Notably, if a type differs (conflicts) across translation units, and
  all those translation units are in the core kernel, we can't put them
  in children because none of them are in modules, and children are
  reserved for modules: so we actually emit them as "hidden types" (a
  concept BTF doesn't have and that I am not currently proposing, which
  lets us say "this type is not visible in any namespaces, here's the
  name of the translation unit it was found in"). The same applies if a
  type differs within one module.

  If a type has conflicting definitions in two distinct modules, we can
  indeed just emit them into each module in turn. Also, if a type has
  one definition in a lot of modules and then a different one in one or
  two, we realise that the first definition is "most popular" and emit
  it into the parent, then emit the conflicting one into the few
  per-module children it is found in.

  Types that are used only by one module are placed in that per-module
  child, both because that's what pahole has always done and because it
  makes sense for a loosely-coupled project like the kernel not to
  clutter vmlinux up with thousands of types for huge modules like
  amdgpu that might never even be loaded.

  I am not expecting pahole to preserve hidden types, at least not yet
  (BTF has no way to encode them and no consumer understands them), but
  it can see them on its input, so it might use hiddenness as a flag
  that "hey, this type is conflicting, take care with everything with
  the same name" or something. The concept is not useless even if pahole
  largely ignores it: it does at least preserve the type graph and
  ensure that any type that refers to a conflicting type still refers to
  it after deduplication: it doesn't end up pointing at some other type
  with the same name.

  e.g. if we have these two TUs in the core kernel:

  a.c:struct foo { int a; };
      struct bar { struct foo baz; };

  b.c:struct foo { long a; }; /* Different! */
      struct bar { struct foo baz; };

  one struct foo (the least-referenced one) will wind up hidden, but the
  struct bar in that same TU will *still point at the hidden type*. Both
  types are *still there* and we don't end up pointing at the same
  struct foo from both struct bars.

- For normal ELF links outside the kernel, the model above doesn't
  really make sense. Most programs don't have a concept like kernel
  modules, and most programs are more tightly coupled, so you want to
  see as many types as possible. So for those, the distribution is like
  this:

  parent (all types that are not conflicting)
       +---  child1.c (conflicting types defined in child1.c)
       +---  child2.c (conflicting types defined in child2.c)
       .
       +---  childN.c (conflicting types defined in child3.c)

  i.e., conflicting types are placed into children that are named after
  the translation units they come from. Within those dictionaries, there
  are no hidden types and there is no possibility of conflict; the
  shared parent corresponds to "all TUs together" and there can be no
  conflicts there either.

  In many ways this is a simpler model, but it just won't cut it for the
  kernel.

We could in the end combine the two schemes, producing a multilevel
tree, so that each module, and the core kernel, could contain an archive
like userspace links do, with each conflicting type hived off into its
own translation unit. This is *definitely* more work, and would probably
require consumer changes too. I am not proposing it, at least not yet.
But it shows where we could end up:

  vmlinux (parent) (common types, any types shared by more than one module)
    +--- core1a.c (conflicting types defined in core1a.c)...
    ...
       +---  child1.ko (types found only in child1)
         +-- child1a.c (conflicting types defined in child1a.c)
         +-- child1b.b (conflicting types defined in child1a.c)
       +---  child2.ko (types only in child2)
       .
       +---  childN.ko (types only in childN)

The distinction between the two link types above is largely controlled
via this linker option in GNU ld:

  --ctf-share-types=<method>  How to share CTF types between translation units.
                                <method> is: share-unconflicted (default),
                                             share-duplicated

The final stage of kernel deduplication (the btfarchive tool) uses
share-duplicated mode (and extra stuff to smush multiple translation
units together into modules).

(that's from current upstream master: obviously I'll have to find some
way to say --ctf-or-btf without making it too verbose :) maybe I could
just add a --btf-share-types as a synonym?)