Re: [PATCH RFC v3 0/2] Task local data API

Amery Hung <ameryhung@xxxxxxxxx> · Thu, 1 May 2025 21:26:36 -0700

On Thu, May 1, 2025 at 1:37 PM Andrii Nakryiko
<andrii.nakryiko@xxxxxxxxx> wrote:
>
> On Fri, Apr 25, 2025 at 2:40 PM Amery Hung <ameryhung@xxxxxxxxx> wrote:
> >
> > Hi,
> >
> > This a respin of uptr KV store. It is renamed to task local data (TLD)
> > as the problem statement and the solution have changed, and it now draws
> > more similarities to pthread thread-specific data.
> >
> > * Overview *
> >
> > This patchset is a continuation of the original UPTR work[0], which aims
> > to provide a fast way for user space programs to pass per-task hints to
> > sched_ext schedulers. UPTR built the foundation by supporting sharing
> > user pages with bpf programs through task local storage maps.
> >
> > Additionally, sched_ext would like to allow multiple developers to share
> > a storage without the need to explicitly agreeing on the layout of it.
> > This simplify code base management and makes experimenting easier.
> > While a centralized storage layout definition would have worked, the
> > friction of synchronizing it across different repos is not desirable.
> >
> > This patchset contains the user space plumbing so that user space and bpf
> > program developers can exchange per-task hints easily through simple
> > interfaces.
> >
> > * Design *
> >
> > BPF task local data is a simple API for sharing task-specific data
> > between user space and bpf programs, where data are refered to using
> > string keys. As shown in the following figure, user space programs can
> > define a task local data using bpf_tld_type_var(). The data is
> > effectively a variable declared with __thread, which every thread owns an
> > independent copy and can be directly accessed. On the bpf side, a task
> > local data first needs to be initialized for every new task once (e.g.,
> > in sched_ext_ops::init_task) using bpf_tld_init_var(). Then, other bpf
> > programs can get a pointer to the data using bpf_tld_lookup(). The task
> > local data APIs refer to data using string keys so developers
> > does not need to deal with addresses of data in a shared storage.
> >
> >  ┌─ Application ─────────────────────────────────────────┐
> >  │                          ┌─ library A ──────────────┐ │
> >  │ bpf_tld_type_var(int, X) │ bpf_tld_type_var(int, Y) │ │
> >  │                          └┬─────────────────────────┘ │
> >  └───────┬───────────────────│───────────────────────────┘
> >          │ X = 123;          │ Y = true;
>
> bpf_tld_type_var() is *defining* variable (i.e., it allocates storage
> for it), right? I think for completeness we need also *declare* macro
> to access that variable from other compilation units
>
>
> >          V                   V
> >  + ─ Task local data ─ ─ ─ ─ ─ ─ +
> >  | ┌─ task_kvs_map ────────────┐ |  ┌─ sched_ext_ops::init_task ──────┐
> >  | │ BPF Task local storage    │ |  │ bpf_tld_init_var(&kvs, X);      │
> >  | │  ┌───────────────────┐    │ |<─┤ bpf_tld_init_var(&kvs, Y);      │
> >  | │  │ __uptr *udata     │    │ |  └─────────────────────────────────┘
> >  | │  └───────────────────┘    │ |
> >  | │  ┌───────────────────┐    │ |  ┌─ Other sched_ext_ops op ────────┐
> >  | │  │ __uptr *umetadata │    │ |  │ int *y;                         ├┐
> >  | │  └───────────────────┘    │ |<─┤ y = bpf_tld_lookup(&kvs, Y, 1); ││
> >  | └───────────────────────────┘ |  │ if (y)                          ││
> >  | ┌─ task_kvs_off_map ────────┐ |  │     /* do something */          ││
> >  | │ BPF Task local storage    │ |  └┬────────────────────────────────┘│
> >  | └───────────────────────────┘ |   └─────────────────────────────────┘
> >  + ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ +
> >
> > * Implementation *
> >
> > Task local data API hides the memory management from the developers.
> > Internally, it shares user data with bpf programs through udata UPTRs.
> > Task local data from different compilation units are placed into a
> > custom "udata" section by the declaration API, bpf_tld_type_var(), so
> > that they are placed together in the memory. User space will need to
> > call bpf_tld_thread_init() for every new thread to pin udata pages to
> > kernel.
> >
> > The metadata used to address udata is stored in umetadata UPTR. It is
> > generated by constructors inserted by bpf_tld_type_var() and
> > bpf_tld_thread_init(). umetadata is an array of 64 metadata corresponding
> > to each data, which contains the key and the offset of data in udata.
> > During initialization, bpf_tld_init_var() will search umetadata for
> > a matching key and cache its offset in task_kvs_off_map. Later,
> > bpf_tld_lookup() will use the cached offset to retreive a pointer to
> > udata.
> >
> > * Limitation *
> >
> > Currently, it is assumed all key-value pairs are known as a program
> > starts. All compilation units using task local data should be statically
> > linked together so that values are all placed together in a udata section
> > and therefore can be shared with bpf through two UPTRs. The next
>
> Lack of support for shared libraries is a big limitation, IMO, so I
> think we should design for that support from the very beginning.
>
> FWIW, I think this compile-time static __thread variable definitions
> are unnecessarily limiting and add a non-trivial amount of complexity.
> I think it's more reasonable to go with a more dynamic approach, and
> I'll try to outline what an API (and some implementation details)
> might look like.
>

Indeed, the limitation of the static approach plus the restriction of
uptr makes the implementation complicated. I will switch to the
dynamic approach in the next respin.

> Note, all this is related to the user-space side of things. BPF-side
> is unchanged, effectively, except you get a guarantee that all the
> data will definitely be page-aligned, so you won't need to do these
> two uptr pages handling.
>
> First, data structures. You'll have one per-process metadata structure
> describing known keys and their assigned "offsets":
>
> struct tld_metadata {
>     int cnt;
>     char keys[MAX_KEY_CNT];
>     __u16 offs[MAX_KEY_CNT];
>     __u16 szs[MAX_KEY_CNT];
> };
>
> Now, each thread will basically have just a blob of data, so,
> technically, per-thread we will just have:
>
> struct tld_data {
>     __u8 data[PAGE_SIZE];
> };
>
> By pre-allocating the entire page we avoid lots of complications, so I
> think it's worth doing.
>
> Now, we really need just two APIs on user-space (and I'll use the
> "opaque offset" type as I suggested on another patch):
>
> typedef struct { int off; } tld_off_t;
>
> tld_off_t tld_add_key_def(const char *key_name, size_t key_size);

I will incorporate it into the next iteration.

>
> This API can be called just once per each key that process cares
> about. And this can be done at any point, really, very dynamically.
> The implementation will:
>   - (just once per process) open pinned BPF map, remember its FD;
>   - (just once) allocate struct tld_metadata, unless we define it as
> pre-allocated global variable;
>   - (locklessly) check if key_name is already in tld_metadata, if yes
> - return already assigned offset;
>   - (locklessly) if not, add this key and assign it offset that is
> offs[cnt - 1] + szs[cnt - 1] (i.e., we just tightly pack all the
> values (though we should take care of alignment requirements, of
> course);
>   - return newly assigned offset;
>
> Now, the second essential API is called for each participating thread
> for each different key. And again, this is all very dynamic. It's
> possible that some threads won't use any of this TLD stuff, in which
> case there will be no overhead (memory or CPU), and not even an entry
> in task local storage map for that thread. So, API:
>

The advantage of no memory wasted for threads that are not using TLD
doesn't seem to be that definite to me. If users add per-process
hints, then this scheme can potentially use a lot more memory (i.e.,
PAGE_SIZE * number of threads). Maybe we need another uptr for
per-process data? Or do you think this is out of the scope of TLD and
we should recommend other solutions?

> void *tld_resolve_key(tld_off_t key_off);
>
> This API will:
>    - (just once per thread, which is done trivially by using
> __thread-local global variable to keep track of this) allocate struct
> tld_data dynamically (with page alignment, alloc_aligned(PAGE_SIZE,
> PAGE_SIZE))
>    - (just once per thread as well) bpf_map_update_elem() for current
> thread, updating two uptrs: one pointing to global tld_metadata,
> another pointing to thread-local tld_data;
>    - return tld_data->data + key_off.off;
>
> That is, this API returns an absolute memory address of a value
> resolved in the context of the current thread.
>
>
> And let's look at how one can make use of this on the user-space side
> to optimally use this API.
>
>
> /* global variables */
> tld_off_t my_key1_off; /* struct my_val */
> tld_off_t my_key2_off; /* int */
>
> __thread struct my_val *my_val1;
> __thread int *my_val2;
>
> ... somewhere in constructor ...
>
> my_key1_off = tld_add_key_def("my_key1", sizeof(struct my_val));
> my_key2_off = tld_add_key_def("my_key2", sizeof(int));
>
> ... and then somewhere in the code that makes use of TLD stuff ...
>
> if (!my_val1) /* this can be initialized once per thread to avoid this
> check (or hidden in a helper accessor function) */
>     my_val1 = tld_resolve(my_key1_off);
>
> my_val1->blah_field = 123;
>
> if (!my_val2)
>    my_val2 = tld_resolve(my_key2_off);
> *my_val2 = 567;
>
>
> That's pretty much it, I think.

Thanks for elaborating what the alternative approach would look like.

>
> In C++ code base, it should be possible to make this even more
> convenient by using a templated wrapper with thread-local inner
> variable with its own constructor. Adding operator overloading (e.g.,
> operator= and operator->) you get a very naturally looking definition
> and access patterns:
>
> /* global data */
>
> tld_variable<struct my_val> my_val1("my_key1");
> tld_variable<int> my_val2("my_key2");
>
> ... now in the actual TLD-using code, we just do:
>
> my_val1->blah_field = 123;
> my_val2 = 567; /* I think C++ would allow this with operator overloading */
>
> I hope the example explains why it's still fast despite everything
> being dynamic. There is a pointer indirection and that page-sized
> allocation (so that we can cache thread-specific resolved pointers),
> but it seems absolutely fine from performance perspective.
>
> > iteration will explore how bpf task local data can work in dynamic
> > libraries. Maybe more udata UPTRs will be added to pin page of TLS
> > of dynamically loaded modules. Or maybe it will allocate memory for data
> > instead of relying on __thread, and change how user space interact with
> > task local data slightly. The later approach can also save some troubles
> > dealing with the restriction of UPTR.
> >
> > Some other limitations:
> >  - Total task local data cannot exceed a page
> >  - Only support 64 task local data
> >  - Some memory waste for data whose size is not power of two
> >    due to UPTR limitation
> >
> > [0] https://lore.kernel.org/bpf/20241023234759.860539-1-martin.lau@xxxxxxxxx/
> >
> >
> > Amery Hung (2):
> >   selftests/bpf: Introduce task local data
> >   selftests/bpf: Test basic workflow of task local data
> >
> >  .../bpf/prog_tests/task_local_data.c          | 159 +++++++++++++++
> >  .../bpf/prog_tests/task_local_data.h          |  58 ++++++
> >  .../bpf/prog_tests/test_task_local_data.c     | 156 +++++++++++++++
> >  .../selftests/bpf/progs/task_local_data.h     | 181 ++++++++++++++++++
> >  .../bpf/progs/test_task_local_data_basic.c    |  78 ++++++++
> >  .../selftests/bpf/task_local_data_common.h    |  49 +++++
> >  6 files changed, 681 insertions(+)
> >  create mode 100644 tools/testing/selftests/bpf/prog_tests/task_local_data.c
> >  create mode 100644 tools/testing/selftests/bpf/prog_tests/task_local_data.h
> >  create mode 100644 tools/testing/selftests/bpf/prog_tests/test_task_local_data.c
> >  create mode 100644 tools/testing/selftests/bpf/progs/task_local_data.h
> >  create mode 100644 tools/testing/selftests/bpf/progs/test_task_local_data_basic.c
> >  create mode 100644 tools/testing/selftests/bpf/task_local_data_common.h
> >
> > --
> > 2.47.1
> >