Linking BTF

Nick Alcock <nick.alcock@xxxxxxxxxx> · Wed, 16 Jul 2025 16:15:25 +0100

So I'm working on a scheme where the compiler generates BTF for object files
it generates, for later consumption by pahole after deduplication. (And I
have a proof of concept[1], already posted[2]). But doing this sort of thing
rather than post-facto generation from DWARF raises one obvious question: if
we're emitting BTF into multiple object files, what do we do with it when we
link them together?

There seem to me to be three options: an ugly but simple one, a maximally
ELFish one and a third option that is a sort of middle route. Based on some
investigations we did, I think I know which one makes the most sense, but I
could well be wrong: please critique.

 - The simple route is just to let the linker do its thing like it does for
   every other section it doesn't know about, and blindly concatenate the
   contents. I hope it's obvious why this is less than ideal: the result
   would soon get enormous, and every BTF-reading tool would need adjusting
   to allow for the fact that rather than one pile of BTF in a .BTF section,
   you might have many concatenated together, with no way to tell which
   input file each one referred to! You can't even save much space by
   compressing the result compared to compressing the inputs' .BTF
   individually (I tried it, the individual input files' chunks are usually
   too far apart to be caught by the same compression dictionary).

 - The maximally ELFish one is to put each .BTF into its own per-input-file
   section, named after the input file. Since this requires at least a bit
   of special linker handling anyway you can make it do more, and
   deduplicate at the same time, turning the result into split BTF and
   putting shared things into a .BTF section.

   This... feels like it would be really nice, and I tried to implement it,
   but at least in GNU ld falls foul of a deeply embedded architectural
   limitation, and as far as I can see from a brief look at lld, it shares
   this. You don't get an output section to deduplicate anything into until
   after the linker has figured out what output sections exist, and it only
   does this in one place: you can't go back and add more after the
   fact. This means that we could only deduplicate and add sections freely
   before the output sections are laid out -- when we have nowhere to put
   it, and honestly we might not even have acquired all the input sections
   at this point, since that is partly interleaved with output section
   layout. Essentially, it is incredibly hard to have output sections whose
   names depend on the contents of more than one input section, and that's
   what any plausible deduplicator is going to want to do.

   So this is really only useful if we're doing what ELF usually does, which
   is to copy input sections into output sections without modification, or
   with at most small changes (like relocs) that don't change sizes.

   I note that DWARF emission is special-cased in ld in part for this
   reason, and even *that* only emits a fixed set of sections rather than a
   potentially unbounded one. We should probably try to emit a fixed set
   too.

 - So... a third option, which is probably the most BTFish because it's
   something BTF already does, in a sense: put everything in one section,
   call it .BTF or .BTFA or whatever, and make that section an archive of
   named BTF members, and then stuff however many BTF outputs the
   deduplication generates (or none, if we're just stuffing inputs into
   outputs without dedupping) into archive members.

So, here's a possibility which seems to provide the latter option while
still letting existing tools read the first member (likely vmlinux):

The idea is that we add a *next member link field* in the BTF header, and a
name (a strtab offset).  The next member link field is an end-of-header-
relative offset just like most of the other header fields, which chains BTF
members together in a linked list:

parent     BTF
            |
            v
children   BTF -> BTF -> BTF -> ... -> BTF

The parent is always first in the list.

This has the notable advantage that existing BTF tools understand it without
change: they see only the parent (but since the parent is vmlinux, this is
probably enough for most existing tools that don't have to deal with
modules, and of course that's enough for existing tools working over actual
modules, which won't need archives at all).  We give members a name because
with the exception of the parent we do want to be able to distinguish the
members from each other: we need to know which module, or translation unit,
or whatever each individual member relates to.  The parent probably doesn't
need a name (it's always "vmlinux" or "shared types" or something), so it
can just use 0.

The proof of concept I posted earlier does not understand this format yet,
but only an earlier version of the archive format used in the
proof-of-concept; the scheme above was invented later.  Of course I plan to
teach the proof of concept (and upstream binutils) about the new format too,
once we agree.

There is one big change caused by using any format that involves more than
one BTF dictionary like this: external references to types become harder. If
all you have is a straight .BTF section, you can refer to a type with its ID
and it is unambiguous. If you have a bunch of them, suddenly you need a pair
of (member, type ID)!  The ability to refer to types via a small fixed-size
native-type token of some kind is extremely desirable and I do not want to
lose it. But if we consider the linked list above to be an array (and
looking members by integer is something libctf will make easy in the near
future), we can just make 64-bit "archive BTF IDs" where the top 32 bits is
the index of the individual BTF: types in the parent, with an index of 0,
just get a bigger type with no change in value. The kernel apparently does
something like this internally already.

Doing this means we keep the ability to refer to BTF types *in any module*
from other sections, even if they're stored together as an archive like this,
which is how nearly all extensions to BTF are structured anyway and which
seems like obviously the right way to do things: I'm thinking up ways to
turn most remaining CTF extensions to BTF into that sort of external table
already.

[1] The proof-of-concept is here:
    (binutils) https://sourceware.org/git/?p=binutils-gdb.git;a=log;h=refs/heads/users/nalcock/archive-v2/road-to-ctfv4

    (kernel) https://github.com/nickalcock/linux/tree/nix/btfa

[2] mail about it: https://lore.kernel.org/dwarves/87ldqf1i19.fsf@xxxxxxxxxxxxx/T/#u
    (note that the branch name in this mail is out of date)