On Tue, Apr 29, 2025 at 6:45 PM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Fri, Apr 25, 2025 at 2:40 PM Amery Hung <ameryhung@xxxxxxxxx> wrote: > > > > Task local data provides simple and fast bpf and user space APIs to > > exchange per-task data through task local storage map. The user space > > first declares task local data using bpf_tld_type_key_var() or > > bpf_tld_type_var(). The data is a thread-specific variable which > > every thread has its own copy. Then, a bpf_tld_thread_init() needs to > > be called for every thread to share the data with bpf. Finally, users > > can directly read/write thread local data without bpf syscall. > > > > For bpf programs to see task local data, every data need to be > > initialized once for every new task using bpf_tld_init_var(). Then > > bpf programs can retrieve pointers to the data with bpf_tld_lookup(). > > > > The core of task local storage implementation relies on UPTRs. They > > pin user pages to the kernel so that user space can share data with bpf > > programs without syscalls. Both data and the metadata used to access > > data are pinned via UPTRs. > > > > A limitation of UPTR makes the implementation of task local data > > less trivial than it sounds: memory pinned to UPTR cannot exceed a > > page and must not cross the page boundary. In addition, the data > > declaration uses __thread identifier and therefore does not have > > directly control over the memory allocation. Therefore, several > > tricks and checks are used to make it work. > > > > First, task local data declaration APIs place data in a custom "udata" > > section so that data from different compilation units will be contiguous > > in the memory and can be pinned using two UPTRs if they are smaller than > > one page. > > > > To avoid each data from spanning across two pages, they are each aligned > > to the smallest power of two larget than their sizes. > > > > As we don't control the memory allocation for data, we need to figure > > out the layout of user defined data. This is done by the data > > declaration API and bpf_tld_thread_init(). The data declaration API > > will insert constructors for all data, and they are used to find the > > size and offset of data as well as the beginning and end of the whole > > udata section. Then, bpf_tld_thread_init() performs a per-thread check > > to make sure no data will cross the page boundary as udata can start at > > different offset in a page. > > > > Note that for umetadata, we directly aligned_alloc() memory for it and > > assigned to the UPTR. This is only done once for every process as every > > tasks shares the same umetadata. The actual thread-specific data offset > > will be adjusted in the bpf program when calling bpf_tld_init_var(). > > > > Signed-off-by: Amery Hung <ameryhung@xxxxxxxxx> > > --- > > .../bpf/prog_tests/task_local_data.c | 159 +++++++++++++++ > > .../bpf/prog_tests/task_local_data.h | 58 ++++++ > > .../selftests/bpf/progs/task_local_data.h | 181 ++++++++++++++++++ > > .../selftests/bpf/task_local_data_common.h | 41 ++++ > > 4 files changed, 439 insertions(+) > > create mode 100644 tools/testing/selftests/bpf/prog_tests/task_local_data.c > > create mode 100644 tools/testing/selftests/bpf/prog_tests/task_local_data.h > > create mode 100644 tools/testing/selftests/bpf/progs/task_local_data.h > > create mode 100644 tools/testing/selftests/bpf/task_local_data_common.h > > > > diff --git a/tools/testing/selftests/bpf/prog_tests/task_local_data.c b/tools/testing/selftests/bpf/prog_tests/task_local_data.c > > new file mode 100644 > > index 000000000000..5a21514573d2 > > --- /dev/null > > +++ b/tools/testing/selftests/bpf/prog_tests/task_local_data.c > > @@ -0,0 +1,159 @@ > > +#include <fcntl.h> > > +#include <errno.h> > > +#include <stdio.h> > > +#include <pthread.h> > > + > > +#include <bpf/bpf.h> > > + > > +#include "bpf_util.h" > > +#include "task_local_data.h" > > +#include "task_local_storage_helpers.h" > > + > > +#define PIDFD_THREAD O_EXCL > > + > > +/* To find the start of udata for each thread, insert a dummy variable to udata. > > Pls use kernel comment style instead of networking. > Thanks for reviewing! Since the next respin will pivot to a dynamic approach. I will address the issue if they are still applicable. I will change to kernel comment style. > > + * Contructors generated for every task local data will figured out the offset > > constructors. > Not the only typo I made today... I will fix my codespell setup. > > + * from the beginning of udata to the dummy symbol. Then, every thread can infer > > + * the start of udata by subtracting the offset from the address of dummy. > > + */ > > Pls add the full algorithm involving udata_dummy here. > How it can be anywhere in the udata section... > then constructors will keep adjusting udata_start_dummy_off > and eventually km->off -= udata_start_dummy_off. > These steps are very tricky. > So big and detailed comments are necessary. > > > +static __thread struct udata_dummy {} udata_dummy SEC("udata"); > > + > > +static __thread bool task_local_data_thread_inited; > > + > > +struct task_local_data { > > + void *udata_start; > > + void *udata_end; > > + int udata_start_dummy_off; > > + struct meta_page *umetadata; > > + int umetadata_cnt; > > + bool umetadata_init; > > + short udata_sizes[64]; > > + pthread_mutex_t lock; > > +} task_local_data = { > > + .udata_start = (void *)-1UL, > > + .lock = PTHREAD_MUTEX_INITIALIZER, > > +}; > > + > > +static void tld_set_data_key_meta(int i, const char *key, short off) > > +{ > > + task_local_data.umetadata->meta[i].off = off; > > + strncpy(task_local_data.umetadata->meta[i].key, key, TASK_LOCAL_DATA_KEY_LEN); > > +} > > + > > +static struct key_meta *tld_get_data_key_meta(int i) > > +{ > > + return &task_local_data.umetadata->meta[i]; > > +} > > + > > +static void tld_set_data_size(int i, int size) > > +{ > > + task_local_data.udata_sizes[i] = size; > > +} > > + > > +static int tld_get_data_size(int i) > > +{ > > + return task_local_data.udata_sizes[i]; > > +} > > The above 4 helpers are single use. > If nothing else is using them, open code them directly. > Otherwise it only makes it harder to understand the logic. > > > + > > +void __bpf_tld_var_init(const char *key, void *var, int size) > > +{ > > + int i; > > + > > + i = task_local_data.umetadata_cnt++; > > + > > + if (!task_local_data.umetadata) { > > + if (task_local_data.umetadata_cnt > 1) > > + return; > > + > > + task_local_data.umetadata = aligned_alloc(PAGE_SIZE, PAGE_SIZE); > > + if (!task_local_data.umetadata) > > + return; > > + } > > + > > + if (var < task_local_data.udata_start) { > > + task_local_data.udata_start = var; > > + task_local_data.udata_start_dummy_off = > > + (void *)&udata_dummy - task_local_data.udata_start; > > + } > > + > > + if (var + size > task_local_data.udata_end) > > + task_local_data.udata_end = var + size; > > + > > + tld_set_data_key_meta(i, key, var - (void *)&udata_dummy); > > + tld_set_data_size(i, size); > > +} > > + > > +int bpf_tld_thread_init(void) > > +{ > > + unsigned long udata_size, udata_start, udata_start_page, udata_end_page; > > + struct task_local_data_map_value map_val; > > + int i, task_id, task_fd, map_fd, err; > > + > > + if (!task_local_data.umetadata_cnt || task_local_data_thread_inited) > > + return 0; > > + > > + if (task_local_data.umetadata_cnt && !task_local_data.umetadata) > > + return -ENOMEM; > > + > > + udata_start = (unsigned long)&udata_dummy + task_local_data.udata_start_dummy_off; > > + > > + pthread_mutex_lock(&task_local_data.lock); > > can we drop this? > > If .c is part of .h just this line will drag -lpthread dependency. > I think it's an artifact on the selftest. > The selftest/library/application can have its own mutex to protect > or this function can use simple xchg() like exclusion > without mutexes. I will drop pthread dependency. The mutex also makes the user space tld library stateful, which is against the principle that making the task local storage map the ground truth. > > > + for (i = 0; i < task_local_data.umetadata_cnt; i++) { > > + struct key_meta *km = tld_get_data_key_meta(i); > > + int size = tld_get_data_size(i); > > + int off; > > + > > + if (!task_local_data.umetadata_init) { > > + /* Constructors save the offset from udata_dummy to each data > > + * Now as all ctors have run and the offset between the start of > > + * udata and udata_dummy is known, adjust the offsets of data > > + * to be relative to the start of udata > > + */ > > + km->off -= task_local_data.udata_start_dummy_off; > > + > > + /* Data exceeding a page may not be able to be covered by > > + * two udata UPTRs in every thread > > + */ > > + if (km->off >= PAGE_SIZE) > > + return -EOPNOTSUPP; > > returns without releasing the mutex... > One more reason to avoid it. > > > + } > > + > > + /* A task local data should not span across two pages. */ > > + off = km->off + udata_start; > > + if ((off & PAGE_MASK) != ((off + size - 1) & PAGE_MASK)) > > + return -EOPNOTSUPP; > > + } > > + task_local_data.umetadata_init = true; > > + pthread_mutex_unlock(&task_local_data.lock); > > + > > + udata_size = task_local_data.udata_end - task_local_data.udata_start; > > + udata_start_page = udata_start & PAGE_MASK; > > + udata_end_page = (udata_start + udata_size) & PAGE_MASK; > > + > > + /* The whole udata can span across two pages for a thread. Use two UPTRs > > + * to cover the second page in case it happens. > > + */ > > + map_val.udata_start = udata_start & ~PAGE_MASK; > > + map_val.udata[0].page = (struct data_page *)(udata_start_page); > > + map_val.udata[1].page = (udata_start_page == udata_end_page) ? NULL : > > + (struct data_page *)(udata_start_page + PAGE_SIZE); > > + > > + /* umetadata is shared by all threads under the assumption that all > > + * task local data are defined statically and linked together > > + */ > > + map_val.umetadata = task_local_data.umetadata; > > + map_val.umetadata_cnt = task_local_data.umetadata_cnt; > > + > > + map_fd = bpf_obj_get(TASK_LOCAL_DATA_MAP_PIN_PATH); > > + if (map_fd < 0) > > + return -errno; > > + > > + task_id = sys_gettid(); > > + task_fd = sys_pidfd_open(task_id, PIDFD_THREAD); > > + err = bpf_map_update_elem(map_fd, &task_fd, &map_val, 0); > > + if (err) > > + return err; > > + > > + task_local_data_thread_inited = true; > > + return 0; > > +} > > diff --git a/tools/testing/selftests/bpf/prog_tests/task_local_data.h b/tools/testing/selftests/bpf/prog_tests/task_local_data.h > > new file mode 100644 > > index 000000000000..c928e8d2c0a6 > > --- /dev/null > > +++ b/tools/testing/selftests/bpf/prog_tests/task_local_data.h > > @@ -0,0 +1,58 @@ > > +#ifndef __BPF_TASK_LOCAL_DATA_H__ > > +#define __BPF_TASK_LOCAL_DATA_H__ > > + > > +#include "task_local_data_common.h" > > + > > +#define SEC(name) __attribute__((section(name), used)) > > +#define __aligned(x) __attribute__((aligned(x))) > > + > > +#define ROUND_UP_POWER_OF_TWO(x) (1UL << (sizeof(x) * 8 - __builtin_clzl(x - 1))) > > + > > +void __bpf_tld_var_init(const char *key, void *var, int size); > > If possible, let's try to put everything into .h, so it's easier > to distribute. Otherwise extra .c becomes a headache. > I will try to put everything in a .h file in the next respin. > > + > > +/** > > + * @brief bpf_tld_key_type_var() declares a task local data shared with bpf > > + * programs. The data will be a thread-specific variable which a user space > > + * program can directly read/write, while bpf programs will need to lookup > > + * with the string key. > > + * > > + * @param key The string key a task local data will be associated with. The > > + * string will be truncated if the length exceeds TASK_LOCAL_DATA_KEY_LEN > > + * @param type The type of the task local data > > + * @param var The name of the task local data > > + */ > > +#define bpf_tld_key_type_var(key, type, var) \ > > +__thread type var SEC("udata") __aligned(ROUND_UP_POWER_OF_TWO(sizeof(type))); \ > > + \ > > +__attribute__((constructor)) \ > > +void __bpf_tld_##var##_init(void) \ > > +{ \ > > + _Static_assert(sizeof(type) < PAGE_SIZE, \ > > + "data size must not exceed a page"); \ > > + __bpf_tld_var_init(key, &var, sizeof(type)); \ > > +} > > + > > +/** > > + * @brief bpf_tld_key_type_var() declares a task local data shared with bpf > > + * programs. The data will be a thread-specific variable which a user space > > + * program can directly read/write, while bpf programs will need to lookup > > + * the data with the string key same as the variable name. > > + * > > + * @param type The type of the task local data > > + * @param var The name of the task local data as well as the name of the > > + * key. The key string will be truncated if the length exceeds > > + * TASK_LOCAL_DATA_KEY_LEN. > > + */ > > +#define bpf_tld_type_var(type, var) \ > > + bpf_tld_key_type_var(#var, type, var) > > Hiding string obfuscates it too much. > This API doesn't have analogous APIs either in bpf or user space. > So let's make everything explicit. > In this case bpf_tld_key_type_var()-like should be the only api > to declare a variable. > I would call it bpf_tld_var(). > I will keep only one flavor of declaration API. It will let users specify the key string without macro obfuscation. > > + > > +/** > > + * @brief bpf_tld_thread_init() initializes the task local data for the current > > + * thread. All data are undefined from a bpf program's point of view until > > + * bpf_tld_thread_init() is called. > > + * > > + * @return 0 on success; negative error number on failure > > + */ > > +int bpf_tld_thread_init(void); > > + > > +#endif > > diff --git a/tools/testing/selftests/bpf/progs/task_local_data.h b/tools/testing/selftests/bpf/progs/task_local_data.h > > new file mode 100644 > > index 000000000000..7358993ee634 > > --- /dev/null > > +++ b/tools/testing/selftests/bpf/progs/task_local_data.h > > @@ -0,0 +1,181 @@ > > +#include <errno.h> > > +#include <bpf/bpf_helpers.h> > > + > > +#include "task_local_data_common.h" > > + > > +#define PAGE_IDX_MASK 0x8000 > > + > > +/* Overview > > + * > > + * Task local data (TLD) allows sharing per-task information between users and > > + * bpf programs without explicit storage layout managenent. TLD APIs use to > > + * string keys to access data. Internally, TLD shares user pages throguh two > > + * UPTRs in a task local storage: udata and umetadata. Data are stored in udata. > > + * Keys and the offsets of the corresponding data in udata are stored in umetadata. > > + * > > + * Usage > > + * > > + * Users should initialize every task local data once for every new task before > > + * using them with bpf_tld_init_var(). It should be done ideally in non-critical > > + * paths first (e.g., sched_ext_ops::init_task) as it compare key strings and > > + * cache the offsets of data. > > + * > > + * First, user should define struct task_local_data_offsets, which will be used > > + * to cache the offsets of task local data. Each member of the struct should > > + * be a short integer with name same as the key name defined in the user space. > > + * Another task local storage map will be created to save the offsets. For example: > > + * > > + * struct task_local_data_offsets { > > + * short priority; > > + * short in_critical_section; > > + * }; > > The use of 'short' is unusual. > The kernel and bpf progs always use either u16 or s16. Will change the use of short to u16. > > > + * > > + * Task local data APIs take a pointer to bpf_task_local_data object as the first > > + * argument. The object should be declared as a stack variable and initialized via > > + * bpf_tld_init(). Then, in a bpf program, to cache the offset for a key-value pair, > > + * call bpf_tld_init_var(). For example, in init_task program: > > + * > > + * struct bpf_task_local_data tld; > > + * > > + * err = bpf_tld_init(task, &tld); > > + * if (err) > > + * return 0; > > + * > > + * bpf_tld_init_var(&tld, priority); > > + * bpf_tld_init_var(&tld, in_critical_section); > > + * > > + * Subsequently and in other bpf programs, to lookup task local data, call > > + * bpf_tld_lookup(). For example: > > + * > > + * int *p; > > + * > > + * p = bpf_tld_lookup(&tld, priority, sizeof(int)); > > + * if (p) > > + * // do something depending on *p > > + */ > > + > > +struct task_local_data_offsets; > > + > > +struct { > > + __uint(type, BPF_MAP_TYPE_TASK_STORAGE); > > + __uint(map_flags, BPF_F_NO_PREALLOC); > > + __type(key, int); > > + __type(value, struct task_local_data_map_value); > > + __uint(pinning, LIBBPF_PIN_BY_NAME); > > +} task_local_data_map SEC(".maps"); > > + > > +struct { > > + __uint(type, BPF_MAP_TYPE_TASK_STORAGE); > > + __uint(map_flags, BPF_F_NO_PREALLOC); > > + __type(key, int); > > + __type(value, struct task_local_data_offsets); > > +} task_local_data_off_map SEC(".maps"); > > + > > +struct bpf_task_local_data { > > + struct task_local_data_map_value *data_map; > > + struct task_local_data_offsets *off_map; > > +}; > > + > > +/** > > + * @brief bpf_tld_init() initializes a bpf_task_local_data object. > > + * > > + * @param task The task_struct of the target task > > + * @param tld A pointer to a bpf_task_local_data object to be initialized > > + * @return -EINVAL if task KV store is not initialized by the user space for this task; > > + * -ENOMEM if cached offset map creation fails. In both cases, users can abort, or > > + * conitnue without calling task KV store APIs as if no key-value pairs are set. > > continue > > > + */ > > +__attribute__((unused)) > > +static int bpf_tld_init(struct task_struct *task, struct bpf_task_local_data *tld) > > +{ > > + tld->data_map = bpf_task_storage_get(&task_local_data_map, task, 0, 0); > > + if (!tld->data_map) > > + return -EINVAL; > > + > > + tld->off_map = bpf_task_storage_get(&task_local_data_off_map, task, 0, > > + BPF_LOCAL_STORAGE_GET_F_CREATE); > > + if (!tld->off_map) > > + return -ENOMEM; > > + > > + return 0; > > +} > > + > > +/** > > + * @brief bpf_tld_init_var() lookups the metadata with a key and caches the offset of > > + * the value corresponding to the key. > > + * > > + * @param tld A pointer to a valid bpf_task_local_data object initialized by bpf_tld_init() > > + * @param key The key used to lookup the task KV store. Should be one of the > > + * symbols defined in struct task_local_data_offsets, not a string > > + */ > > +#define bpf_tld_init_var(tld, key) \ > > + ({ \ > > + (tld)->off_map->key = bpf_tld_fetch_off(tld, #key); \ > > + }) > > + > > +__attribute__((unused)) > > +static short bpf_tld_fetch_off(struct bpf_task_local_data *tld, const char *key) > > +{ > > + int i, umetadata_off, umetadata_cnt, udata_start; > > + void *umetadata, *key_i, *off_i; > > + short off = 0; > > + > > + if (!tld->data_map || !tld->data_map->umetadata) > > + goto out; > > + > > + udata_start = tld->data_map->udata_start; > > + umetadata_cnt = tld->data_map->umetadata_cnt; > > + umetadata = tld->data_map->umetadata->meta; > > + > > + bpf_for(i, 0, umetadata_cnt) { > > + umetadata_off = i * sizeof(struct key_meta); > > + if (umetadata_off > PAGE_SIZE - sizeof(struct key_meta)) > > + break; > > + > > + key_i = umetadata + umetadata_off + offsetof(struct key_meta, key); > > + off_i = umetadata + umetadata_off + offsetof(struct key_meta, off); > > + > > + if (!bpf_strncmp(key_i, TASK_LOCAL_DATA_KEY_LEN, key)) { > > + off = *(short *)(off_i) + udata_start; > > + if (off >= PAGE_SIZE) > > + off = (off - PAGE_SIZE) | PAGE_IDX_MASK; > > + /* Shift cached offset by 1 so that 0 means not initialized */ > > + off += 1; > > + break; > > + } > > + } > > +out: > > + return off; > > +} > > + > > +/** > > + * @brief bpf_tld_lookup() lookups the task KV store using the cached offset > > + * corresponding to the key. > > + * > > + * @param tld A pointer to a valid bpf_task_local_data object initialized by bpf_tld_init() > > + * @param key The key used to lookup the task KV store. Should be one of the > > + * symbols defined in struct task_local_data_offsets, not a string > > + * @param size The size of the value. Must be a known constant value > > + * @return A pointer to the value corresponding to the key; NULL if the offset > > + * if not cached or the size is too big > > + */ > > +#define bpf_tld_lookup(tld, key, size) __bpf_tld_lookup(tld, (tld)->off_map->key, size) > > +__attribute__((unused)) > > +static void *__bpf_tld_lookup(struct bpf_task_local_data *tld, short cached_off, int size) > > +{ > > + short page_off, page_idx; > > + > > + if (!cached_off--) > > + return NULL; > > + > > + page_off = cached_off & ~PAGE_IDX_MASK; > > + page_idx = !!(cached_off & PAGE_IDX_MASK); > > + > > + if (page_idx) { > > + return (tld->data_map->udata[1].page && page_off < PAGE_SIZE - size) ? > > + (void *)tld->data_map->udata[1].page + page_off : NULL; > > + } else { > > + return (tld->data_map->udata[0].page && page_off < PAGE_SIZE - size) ? > > + (void *)tld->data_map->udata[0].page + page_off : NULL; > > + } > > +} > > diff --git a/tools/testing/selftests/bpf/task_local_data_common.h b/tools/testing/selftests/bpf/task_local_data_common.h > > new file mode 100644 > > index 000000000000..2a0bb724c77c > > --- /dev/null > > +++ b/tools/testing/selftests/bpf/task_local_data_common.h > > @@ -0,0 +1,41 @@ > > +#ifndef __BPF_TASK_KV_STORE_COMMON_H__ > > +#define __BPF_TASK_KV_STORE_COMMON_H__ > > + > > +#ifdef __BPF__ > > +struct data_page *dummy_data_page; > > +struct meta_page *dummy_meta_page; > > +#else > > +#define __uptr > > +#endif > > + > > + > > +#define TASK_LOCAL_DATA_MAP_PIN_PATH "/sys/fs/bpf/task_local_data_map" > > +#define TASK_LOCAL_DATA_KEY_LEN 62 > > +#define PAGE_SIZE 4096 > > We have > enum page_size_enum { > __PAGE_SIZE = PAGE_SIZE > }; > inside kernel/bpf/core.c > > and bpf progs that include vmlinux.h can use it directly as __PAGE_SIZE. > Thanks for the tip. I will get page size from vmlinux.h > Let's think through upfront how the whole thing works on > architectures with different page size. Other than different entries of data users can store, do you see potential issues? > > > +#define PAGE_MASK (~(PAGE_SIZE - 1)) > > + > > +struct data_page { > > + char data[PAGE_SIZE]; > > +}; > > + > > +struct data_page_entry { > > + struct data_page __uptr *page; > > +}; > > + > > +struct key_meta { > > + char key[TASK_LOCAL_DATA_KEY_LEN]; > > + short off; > > +}; > > + > > +struct meta_page { > > + struct key_meta meta[64]; > > +}; > > + > > +struct task_local_data_map_value { > > + struct data_page_entry udata[2]; > > + struct meta_page __uptr *umetadata; > > + short umetadata_cnt; > > + short udata_start; > > +}; > > + > > +#endif > > -- > > 2.47.1 > >