On Thu, May 1, 2025 at 1:37 PM Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote: > > On Fri, Apr 25, 2025 at 2:40 PM Amery Hung <ameryhung@xxxxxxxxx> wrote: > > > > Hi, > > > > This a respin of uptr KV store. It is renamed to task local data (TLD) > > as the problem statement and the solution have changed, and it now draws > > more similarities to pthread thread-specific data. > > > > * Overview * > > > > This patchset is a continuation of the original UPTR work[0], which aims > > to provide a fast way for user space programs to pass per-task hints to > > sched_ext schedulers. UPTR built the foundation by supporting sharing > > user pages with bpf programs through task local storage maps. > > > > Additionally, sched_ext would like to allow multiple developers to share > > a storage without the need to explicitly agreeing on the layout of it. > > This simplify code base management and makes experimenting easier. > > While a centralized storage layout definition would have worked, the > > friction of synchronizing it across different repos is not desirable. > > > > This patchset contains the user space plumbing so that user space and bpf > > program developers can exchange per-task hints easily through simple > > interfaces. > > > > * Design * > > > > BPF task local data is a simple API for sharing task-specific data > > between user space and bpf programs, where data are refered to using > > string keys. As shown in the following figure, user space programs can > > define a task local data using bpf_tld_type_var(). The data is > > effectively a variable declared with __thread, which every thread owns an > > independent copy and can be directly accessed. On the bpf side, a task > > local data first needs to be initialized for every new task once (e.g., > > in sched_ext_ops::init_task) using bpf_tld_init_var(). Then, other bpf > > programs can get a pointer to the data using bpf_tld_lookup(). The task > > local data APIs refer to data using string keys so developers > > does not need to deal with addresses of data in a shared storage. > > > > ┌─ Application ─────────────────────────────────────────┐ > > │ ┌─ library A ──────────────┐ │ > > │ bpf_tld_type_var(int, X) │ bpf_tld_type_var(int, Y) │ │ > > │ └┬─────────────────────────┘ │ > > └───────┬───────────────────│───────────────────────────┘ > > │ X = 123; │ Y = true; > > bpf_tld_type_var() is *defining* variable (i.e., it allocates storage > for it), right? I think for completeness we need also *declare* macro > to access that variable from other compilation units > > > > V V > > + ─ Task local data ─ ─ ─ ─ ─ ─ + > > | ┌─ task_kvs_map ────────────┐ | ┌─ sched_ext_ops::init_task ──────┐ > > | │ BPF Task local storage │ | │ bpf_tld_init_var(&kvs, X); │ > > | │ ┌───────────────────┐ │ |<─┤ bpf_tld_init_var(&kvs, Y); │ > > | │ │ __uptr *udata │ │ | └─────────────────────────────────┘ > > | │ └───────────────────┘ │ | > > | │ ┌───────────────────┐ │ | ┌─ Other sched_ext_ops op ────────┐ > > | │ │ __uptr *umetadata │ │ | │ int *y; ├┐ > > | │ └───────────────────┘ │ |<─┤ y = bpf_tld_lookup(&kvs, Y, 1); ││ > > | └───────────────────────────┘ | │ if (y) ││ > > | ┌─ task_kvs_off_map ────────┐ | │ /* do something */ ││ > > | │ BPF Task local storage │ | └┬────────────────────────────────┘│ > > | └───────────────────────────┘ | └─────────────────────────────────┘ > > + ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ + > > > > * Implementation * > > > > Task local data API hides the memory management from the developers. > > Internally, it shares user data with bpf programs through udata UPTRs. > > Task local data from different compilation units are placed into a > > custom "udata" section by the declaration API, bpf_tld_type_var(), so > > that they are placed together in the memory. User space will need to > > call bpf_tld_thread_init() for every new thread to pin udata pages to > > kernel. > > > > The metadata used to address udata is stored in umetadata UPTR. It is > > generated by constructors inserted by bpf_tld_type_var() and > > bpf_tld_thread_init(). umetadata is an array of 64 metadata corresponding > > to each data, which contains the key and the offset of data in udata. > > During initialization, bpf_tld_init_var() will search umetadata for > > a matching key and cache its offset in task_kvs_off_map. Later, > > bpf_tld_lookup() will use the cached offset to retreive a pointer to > > udata. > > > > * Limitation * > > > > Currently, it is assumed all key-value pairs are known as a program > > starts. All compilation units using task local data should be statically > > linked together so that values are all placed together in a udata section > > and therefore can be shared with bpf through two UPTRs. The next > > Lack of support for shared libraries is a big limitation, IMO, so I > think we should design for that support from the very beginning. > > FWIW, I think this compile-time static __thread variable definitions > are unnecessarily limiting and add a non-trivial amount of complexity. > I think it's more reasonable to go with a more dynamic approach, and > I'll try to outline what an API (and some implementation details) > might look like. > Indeed, the limitation of the static approach plus the restriction of uptr makes the implementation complicated. I will switch to the dynamic approach in the next respin. > Note, all this is related to the user-space side of things. BPF-side > is unchanged, effectively, except you get a guarantee that all the > data will definitely be page-aligned, so you won't need to do these > two uptr pages handling. > > First, data structures. You'll have one per-process metadata structure > describing known keys and their assigned "offsets": > > struct tld_metadata { > int cnt; > char keys[MAX_KEY_CNT]; > __u16 offs[MAX_KEY_CNT]; > __u16 szs[MAX_KEY_CNT]; > }; > > Now, each thread will basically have just a blob of data, so, > technically, per-thread we will just have: > > struct tld_data { > __u8 data[PAGE_SIZE]; > }; > > By pre-allocating the entire page we avoid lots of complications, so I > think it's worth doing. > > Now, we really need just two APIs on user-space (and I'll use the > "opaque offset" type as I suggested on another patch): > > typedef struct { int off; } tld_off_t; > > tld_off_t tld_add_key_def(const char *key_name, size_t key_size); I will incorporate it into the next iteration. > > This API can be called just once per each key that process cares > about. And this can be done at any point, really, very dynamically. > The implementation will: > - (just once per process) open pinned BPF map, remember its FD; > - (just once) allocate struct tld_metadata, unless we define it as > pre-allocated global variable; > - (locklessly) check if key_name is already in tld_metadata, if yes > - return already assigned offset; > - (locklessly) if not, add this key and assign it offset that is > offs[cnt - 1] + szs[cnt - 1] (i.e., we just tightly pack all the > values (though we should take care of alignment requirements, of > course); > - return newly assigned offset; > > Now, the second essential API is called for each participating thread > for each different key. And again, this is all very dynamic. It's > possible that some threads won't use any of this TLD stuff, in which > case there will be no overhead (memory or CPU), and not even an entry > in task local storage map for that thread. So, API: > The advantage of no memory wasted for threads that are not using TLD doesn't seem to be that definite to me. If users add per-process hints, then this scheme can potentially use a lot more memory (i.e., PAGE_SIZE * number of threads). Maybe we need another uptr for per-process data? Or do you think this is out of the scope of TLD and we should recommend other solutions? > void *tld_resolve_key(tld_off_t key_off); > > This API will: > - (just once per thread, which is done trivially by using > __thread-local global variable to keep track of this) allocate struct > tld_data dynamically (with page alignment, alloc_aligned(PAGE_SIZE, > PAGE_SIZE)) > - (just once per thread as well) bpf_map_update_elem() for current > thread, updating two uptrs: one pointing to global tld_metadata, > another pointing to thread-local tld_data; > - return tld_data->data + key_off.off; > > That is, this API returns an absolute memory address of a value > resolved in the context of the current thread. > > > And let's look at how one can make use of this on the user-space side > to optimally use this API. > > > /* global variables */ > tld_off_t my_key1_off; /* struct my_val */ > tld_off_t my_key2_off; /* int */ > > __thread struct my_val *my_val1; > __thread int *my_val2; > > ... somewhere in constructor ... > > my_key1_off = tld_add_key_def("my_key1", sizeof(struct my_val)); > my_key2_off = tld_add_key_def("my_key2", sizeof(int)); > > ... and then somewhere in the code that makes use of TLD stuff ... > > if (!my_val1) /* this can be initialized once per thread to avoid this > check (or hidden in a helper accessor function) */ > my_val1 = tld_resolve(my_key1_off); > > my_val1->blah_field = 123; > > if (!my_val2) > my_val2 = tld_resolve(my_key2_off); > *my_val2 = 567; > > > That's pretty much it, I think. Thanks for elaborating what the alternative approach would look like. > > In C++ code base, it should be possible to make this even more > convenient by using a templated wrapper with thread-local inner > variable with its own constructor. Adding operator overloading (e.g., > operator= and operator->) you get a very naturally looking definition > and access patterns: > > /* global data */ > > tld_variable<struct my_val> my_val1("my_key1"); > tld_variable<int> my_val2("my_key2"); > > ... now in the actual TLD-using code, we just do: > > my_val1->blah_field = 123; > my_val2 = 567; /* I think C++ would allow this with operator overloading */ > > I hope the example explains why it's still fast despite everything > being dynamic. There is a pointer indirection and that page-sized > allocation (so that we can cache thread-specific resolved pointers), > but it seems absolutely fine from performance perspective. > > > iteration will explore how bpf task local data can work in dynamic > > libraries. Maybe more udata UPTRs will be added to pin page of TLS > > of dynamically loaded modules. Or maybe it will allocate memory for data > > instead of relying on __thread, and change how user space interact with > > task local data slightly. The later approach can also save some troubles > > dealing with the restriction of UPTR. > > > > Some other limitations: > > - Total task local data cannot exceed a page > > - Only support 64 task local data > > - Some memory waste for data whose size is not power of two > > due to UPTR limitation > > > > [0] https://lore.kernel.org/bpf/20241023234759.860539-1-martin.lau@xxxxxxxxx/ > > > > > > Amery Hung (2): > > selftests/bpf: Introduce task local data > > selftests/bpf: Test basic workflow of task local data > > > > .../bpf/prog_tests/task_local_data.c | 159 +++++++++++++++ > > .../bpf/prog_tests/task_local_data.h | 58 ++++++ > > .../bpf/prog_tests/test_task_local_data.c | 156 +++++++++++++++ > > .../selftests/bpf/progs/task_local_data.h | 181 ++++++++++++++++++ > > .../bpf/progs/test_task_local_data_basic.c | 78 ++++++++ > > .../selftests/bpf/task_local_data_common.h | 49 +++++ > > 6 files changed, 681 insertions(+) > > create mode 100644 tools/testing/selftests/bpf/prog_tests/task_local_data.c > > create mode 100644 tools/testing/selftests/bpf/prog_tests/task_local_data.h > > create mode 100644 tools/testing/selftests/bpf/prog_tests/test_task_local_data.c > > create mode 100644 tools/testing/selftests/bpf/progs/task_local_data.h > > create mode 100644 tools/testing/selftests/bpf/progs/test_task_local_data_basic.c > > create mode 100644 tools/testing/selftests/bpf/task_local_data_common.h > > > > -- > > 2.47.1 > >