Re: [PATCH bpf-next v4 1/3] selftests/bpf: Introduce task local data

Amery Hung <ameryhung@xxxxxxxxx> · Tue, 20 May 2025 00:36:48 -0700

On Fri, May 16, 2025 at 3:22 PM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Fri, May 16, 2025 at 1:41 PM Amery Hung <ameryhung@xxxxxxxxx> wrote:
> >
> > > > +       size = round_up(size, 8);
> > >
> > > why roundup ? and to 8 in particular?
> > > If user space wants byte size keys, why not let it?
> >
> > I will remove it. This was to prevent breaking using TLD in atomic
> > operations, but it should be very unlikely as they are thread
> > specific.
>
> You mean for a case where one part of the app (like a shared library)
> is using u32, but the other is using u64 and doing atomic ops on it?
>
> Make sense to align the off set by tld_create_key(),
> but it can be done without rounding up all previous keys to 8.
> 63 bytes in the header are wasted. So use 2 as an offset.
> A single cmpxchg 4 byte can update cnt+offset.
> Actually why store size in each tld_metadata ?
> Won't the logic will be simpler if it's an offset ?
> bpf side tld_fetch_key() wouldn't need to count.
>

I changed to size since metadata is initialized to 0 and size == 0 can
be used to signal pending metadata update, while 0 is a valid offset.

I will save offset in metadata in the next respin. Tejun suggested a
run-length key encoding, and there are other fields in the metadata
that can be used for the signaling.

> > > > +               if (i < cnt) {
> > > > +                       /*
> > > > +                        * Pending tld_create_key() uses size to signal if the metadata has
> > > > +                        * been fully updated.
> > > > +                        */
> > > > +                       while (!(sz = __atomic_load_n(&tld_metadata_p->metadata[i].size,
> > > > +                                                     __ATOMIC_ACQUIRE)))
> > > > +                               sched_yield();
> > > > +
> > > > +                       if (!strncmp(tld_metadata_p->metadata[i].name, name, TLD_NAME_LEN))
> > > > +                               return (tld_key_t) {.off = -EEXIST};
> > > > +
> > > > +                       off += sz;
> > > > +                       continue;
> > > > +               }
> > > > +
> > > > +               if (off + size > TLD_DATA_SIZE)
> > > > +                       return (tld_key_t) {.off = -E2BIG};
> > > > +
> > > > +               /*
> > > > +                * Only one tld_create_key() can increase the current cnt by one and
> > > > +                * takes the latest available slot. Other threads will check again if a new
> > > > +                * TLD can still be added, and then compete for the new slot after the
> > > > +                * succeeding thread update the size.
> > > > +                */
> > > > +               if (!__atomic_compare_exchange_n(&tld_metadata_p->cnt, &cnt, cnt + 1, true,
> > > > +                                                __ATOMIC_RELAXED, __ATOMIC_RELAXED))
> > >
> > > weak and relaxed/relaxed ?
> >
> > I can't see reordering issue with cnt so I choose to use relax. I can
> > change to strong acq/rel just to be safe.
> >
> > > That's unusual.
> > > I don't know what it is supposed to do.
> > > I think weak=false and __ATOMIC_ACQUIRE, __ATOMIC_RELAXED
> > > would look as expected.
> > >
> >
> > Do you mean weak=false and __ATOMIC_RELAXED, __ATOMIC_ACQUIRE?
>
> no idea. I just grepped the kernel and saw:
> TEST_KERNEL_LOCKED(atomic_builtin_with_memorder,
>                    __atomic_compare_exchange_n(flag, &v, 1, 0,
> __ATOMIC_ACQUIRE, __ATOMIC_RELAXED),
>                    __atomic_store_n(flag, 0, __ATOMIC_RELEASE));
> TEST_KERNEL_LOCKED(atomic_builtin_wrong_memorder,
>                    __atomic_compare_exchange_n(flag, &v, 1, 0,
> __ATOMIC_RELAXED, __ATOMIC_RELAXED),
>                    __atomic_store_n(flag, 0, __ATOMIC_RELAXED));
>
> I'd just use __ATOMIC_SEQ_CST everywhere.
> Speed is not important here.

Make sense. I will use __ATOMIC_SEQ_CST. Thanks for the suggestion.

>
> >
> > > > +                       goto retry;
> > > > +
> > > > +               strncpy(tld_metadata_p->metadata[i].name, name, TLD_NAME_LEN);
> > > > +               __atomic_store_n(&tld_metadata_p->metadata[i].size, size, __ATOMIC_RELEASE);
> > > > +               return (tld_key_t) {.off = off};
> > > > +       }
> > > > +
> > > > +       return (tld_key_t) {.off = -ENOSPC};
> > > > +}
> > > > +
> > > > +__attribute__((unused))
> > > > +static inline bool tld_key_is_err(tld_key_t key)
> > > > +{
> > > > +       return key.off < 0;
> > > > +}
> > > > +
> > > > +__attribute__((unused))
> > > > +static inline int tld_key_err_or_zero(tld_key_t key)
> > > > +{
> > > > +       return tld_key_is_err(key) ? key.off : 0;
> > > > +}
> > > > +
> > > > +/**
> > > > + * tld_get_data() - Gets a pointer to the TLD associated with the key.
> > > > + *
> > > > + * @map_fd: A file descriptor of the underlying task local storage map,
> > > > + * tld_data_map
> > > > + * @key: A key object returned by tld_create_key().
> > > > + *
> > > > + * Returns a pointer to the TLD if the key is valid; NULL if no key has been
> > > > + * added, not enough memory for TLD for this thread, or the key is invalid.
> > > > + *
> > > > + * Threads that call tld_get_data() must call tld_free() on exit to prevent
> > > > + * memory leak.
> > > > + */
> > > > +__attribute__((unused))
> > > > +static void *tld_get_data(int map_fd, tld_key_t key)
> > > > +{
> > > > +       if (!READ_ONCE(tld_metadata_p))
> > > > +               return NULL;
> > > > +
> > > > +       if (!tld_data_p && __tld_init_data(map_fd))
> > > > +               return NULL;
> > >
> > > Why call it again?
> > > tld_create_key() should have done it, no?
> > >
> >
> > A TLD is created by calling tld_create_key() once. Then, threads may
> > call tld_get_data() to get their thread-specific TLD. So it is
> > possible for a thread to call tld_get_data() with tld_data_p==NULL.
>
> I see. Please add a comment.

I will explain it in the comment.

>
> > > > +
> > > > +       return tld_data_p->data + key.off;
> > > > +}
> > > > +
> > > > +/**
> > > > + * tld_free() - Frees task local data memory of the calling thread
> > > > + */
> > > > +__attribute__((unused))
> > > > +static void tld_free(void)
> > > > +{
> > > > +       if (tld_data_p)
> > > > +               free(tld_data_p);
> > > > +}
> > >
> > > Since this .h allocates tld_metadata_p, it probably needs
> > > a helper to free it too.
> > >
> > > > +
> > > > +#endif /* __TASK_LOCAL_DATA_H */
> > > > diff --git a/tools/testing/selftests/bpf/progs/task_local_data.bpf.h b/tools/testing/selftests/bpf/progs/task_local_data.bpf.h
> > > > new file mode 100644
> > > > index 000000000000..5f48e408a5e5
> > > > --- /dev/null
> > > > +++ b/tools/testing/selftests/bpf/progs/task_local_data.bpf.h
> > > > @@ -0,0 +1,220 @@
> > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > +#ifndef __TASK_LOCAL_DATA_BPF_H
> > > > +#define __TASK_LOCAL_DATA_BPF_H
> > > > +
> > > > +/*
> > > > + * Task local data is a library that facilitates sharing per-task data
> > > > + * between user space and bpf programs.
> > > > + *
> > > > + *
> > > > + * PREREQUISITE
> > > > + *
> > > > + * A TLD, an entry of data in task local data, first needs to be created by the
> > > > + * user space. This is done by calling user space API, tld_create_key(), with
> > > > + * the name of the TLD and the size.
> > > > + *
> > > > + *     tld_key_t prio, in_cs;
> > > > + *
> > > > + *     prio = tld_create_key("priority", sizeof(int));
> > > > + *     in_cs = tld_create_key("in_critical_section", sizeof(bool));
> > > > + *
> > > > + * A key associated with the TLD, which has an opaque type tld_key_t, will be
> > > > + * returned. It can be used to get a pointer to the TLD in the user space by
> > > > + * calling tld_get_data().
> > > > + *
> > > > + *
> > > > + * USAGE
> > > > + *
> > > > + * Similar to user space, bpf programs locate a TLD using the same key.
> > > > + * tld_fetch_key() allows bpf programs to retrieve a key created in the user
> > > > + * space by name, which is specified in the second argument of tld_create_key().
> > > > + * tld_fetch_key() additionally will cache the key in a task local storage map,
> > > > + * tld_key_map, to avoid performing costly string comparisons every time when
> > > > + * accessing a TLD. This requires the developer to define the map value type of
> > > > + * tld_key_map, struct tld_keys. It only needs to contain keys used by bpf
> > > > + * programs in the compilation unit.
> > > > + *
> > > > + * struct tld_keys {
> > > > + *     tld_key_t prio;
> > > > + *     tld_key_t in_cs;
> > > > + * };
> > > > + *
> > > > + * Then, for every new task, a bpf program will fetch and cache keys once and
> > > > + * for all. This should be done ideally in a non-critical path (e.g., in
> > > > + * sched_ext_ops::init_task).
> > > > + *
> > > > + *     struct tld_object tld_obj;
> > > > + *
> > > > + *     err = tld_object_init(task, &tld);
> > > > + *     if (err)
> > > > + *         return 0;
> > > > + *
> > > > + *     tld_fetch_key(&tld_obj, "priority", prio);
> > > > + *     tld_fetch_key(&tld_obj, "in_critical_section", in_cs);
> > > > + *
> > > > + * Note that, the first argument of tld_fetch_key() is a pointer to tld_object.
> > > > + * It should be declared as a stack variable and initialized via tld_object_init().
> > > > + *
> > > > + * Finally, just like user space programs, bpf programs can get a pointer to a
> > > > + * TLD by calling tld_get_data(), with cached keys.
> > > > + *
> > > > + *     int *p;
> > > > + *
> > > > + *     p = tld_get_data(&tld_obj, prio, sizeof(int));
> > > > + *     if (p)
> > > > + *         // do something depending on *p
> > > > + */
> > > > +#include <errno.h>
> > > > +#include <bpf/bpf_helpers.h>
> > > > +
> > > > +#define TLD_DATA_SIZE __PAGE_SIZE
> > > > +#define TLD_DATA_CNT 63
> > > > +#define TLD_NAME_LEN 62
> > > > +
> > > > +typedef struct {
> > > > +       __s16 off;
> > > > +} tld_key_t;
> > > > +
> > > > +struct u_tld_data *dummy_data;
> > > > +struct u_tld_metadata *dummy_metadata;
> > > > +
> > > > +struct tld_metadata {
> > > > +       char name[TLD_NAME_LEN];
> > > > +       __u16 size;
> > > > +};
> > > > +
> > > > +struct u_tld_metadata {
> > > > +       __u8 cnt;
> > > > +       __u8 padding[63];
> > > > +       struct tld_metadata metadata[TLD_DATA_CNT];
> > > > +};
> > > > +
> > > > +struct u_tld_data {
> > > > +       char data[TLD_DATA_SIZE];
> > > > +};
> > > > +
> > > > +struct tld_map_value {
> > > > +       struct u_tld_data __uptr *data;
> > > > +       struct u_tld_metadata __uptr *metadata;
> > > > +};
> > > > +
> > > > +struct tld_object {
> > > > +       struct tld_map_value *data_map;
> > > > +       struct tld_keys *key_map;
> > > > +};
> > > > +
> > > > +/*
> > > > + * Map value of tld_key_map for caching keys. Must be defined by the developer.
> > > > + * Members should be tld_key_t and passed to the 3rd argument of tld_fetch_key().
> > > > + */
> > > > +struct tld_keys;
> > > > +
> > > > +struct {
> > > > +       __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> > > > +       __uint(map_flags, BPF_F_NO_PREALLOC);
> > > > +       __type(key, int);
> > > > +       __type(value, struct tld_map_value);
> > > > +} tld_data_map SEC(".maps");
> > > > +
> > > > +struct {
> > > > +       __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> > > > +       __uint(map_flags, BPF_F_NO_PREALLOC);
> > > > +       __type(key, int);
> > > > +       __type(value, struct tld_keys);
> > > > +} tld_key_map SEC(".maps");
> > > > +
> > > > +/**
> > > > + * tld_object_init() - Initializes a tld_object.
> > > > + *
> > > > + * @task: The task_struct of the target task
> > > > + * @tld_obj: A pointer to a tld_object to be initialized
> > > > + *
> > > > + * Returns 0 on success; -ENODATA if the task has no TLD; -ENOMEM if the creation
> > > > + * of tld_key_map fails
> > > > + */
> > > > +__attribute__((unused))
> > > > +static int tld_object_init(struct task_struct *task, struct tld_object *tld_obj)
> > > > +{
> > > > +       tld_obj->data_map = bpf_task_storage_get(&tld_data_map, task, 0, 0);
> > > > +       if (!tld_obj->data_map)
> > > > +               return -ENODATA;
> > > > +
> > > > +       tld_obj->key_map = bpf_task_storage_get(&tld_key_map, task, 0,
> > > > +                                               BPF_LOCAL_STORAGE_GET_F_CREATE);
> > > > +       if (!tld_obj->key_map)
> > > > +               return -ENOMEM;
> > > > +
> > > > +       return 0;
> > > > +}
> > > > +
> > > > +/**
> > > > + * tld_fetch_key() - Fetches the key to a TLD by name. The key has to be created
> > > > + * by user space first with tld_create_key().
> > > > + *
> > > > + * @tld_obj: A pointer to a valid tld_object initialized by tld_object_init()
> > > > + * @name: The name of the key associated with a TLD
> > > > + * @key: The key in struct tld_keys to be saved to
> > > > + *
> > > > + * Returns a positive integer on success; 0 otherwise
> > > > + */
> > > > +#define tld_fetch_key(tld_obj, name, key)                                      \
> > > > +       ({                                                                      \
> > > > +               (tld_obj)->key_map->key.off = __tld_fetch_key(tld_obj, name);   \
> > > > +       })
> > > > +
> > > > +__attribute__((unused))
> > > > +static u16 __tld_fetch_key(struct tld_object *tld_obj, const char *name)
> > > > +{
> > > > +       int i, meta_off, cnt;
> > > > +       void *metadata, *nm, *sz;
> > > > +       u16 off = 0;
> > > > +
> > > > +       if (!tld_obj->data_map || !tld_obj->data_map->metadata)
> > > > +               return 0;
> > > > +
> > > > +       cnt = tld_obj->data_map->metadata->cnt;
> > > > +       metadata = tld_obj->data_map->metadata->metadata;
> > > > +
> > > > +       bpf_for(i, 0, cnt) {
> > > > +               meta_off = i * sizeof(struct tld_metadata);
> > > > +               if (meta_off > TLD_DATA_SIZE - offsetof(struct u_tld_metadata, metadata)
> > > > +                                          - sizeof(struct tld_metadata))
> > > > +                       break;
> > > > +
> > > > +               nm = metadata + meta_off + offsetof(struct tld_metadata, name);
> > > > +               sz = metadata + meta_off + offsetof(struct tld_metadata, size);
> > > > +
> > > > +               /*
> > > > +                * Reserve 0 for uninitialized keys. Increase the offset of a valid key
> > > > +                * by one and subtract it later in tld_get_data().
> > > > +                */
> > > > +               if (!bpf_strncmp(nm, TLD_NAME_LEN, name))
> > > > +                       return off + 1;
> > >
> > > I think all this +1, -1 dance is doing is helping to catch an
> > > error when tld_get_data() is called without tld_fetch_key().
> > > I feel this is too defensive.
> > >
> > > Let tld_fetch_key() return proper negative error code.
> > >
> >
> > I can certainly return negative error code.
> >
> > But for the +1, -1 logic, I think is a simpler way to differentiate an
> > uninitialized key in tld_key_map from the first TLD (both key.off ==
> > 0). After all, bpf programs can call tld_get_data() without
> > tld_fetch_key().
>
> I'm saying we don't need to catch this case.
> progs should not call tld_get_data() without tld_fetch_key().
> If they do, it's a bug.
>

Got it. I will document this in the comment of tld_get_data().

> >
> > > > +
> > > > +               off += *(u16 *)sz;
> > > > +       }
> > > > +
> > > > +       return 0;
> > > > +}
> > > > +
> > > > +/**
> > > > + * tld_get_data() - Retrieves a pointer to the TLD associated with the key.
> > > > + *
> > > > + * @tld_obj: A pointer to a valid tld_object initialized by tld_object_init()
> > > > + * @key: The key of a TLD saved in tld_maps
> > > > + * @size: The size of the TLD. Must be a known constant value
> > > > + *
> > > > + * Returns a pointer to the TLD data associated with the key; NULL if the key
> > > > + * is not valid or the size is too big
> > > > + */
> > > > +#define tld_get_data(tld_obj, key, size) \
> > > > +       __tld_get_data(tld_obj, (tld_obj)->key_map->key.off - 1, size)
> > > > +
> > > > +__attribute__((unused))
> > > > +__always_inline void *__tld_get_data(struct tld_object *tld_obj, u32 off, u32 size)
> > > > +{
> > > > +       return (tld_obj->data_map->data && off >= 0 && off < TLD_DATA_SIZE - size) ?
> > > > +               (void *)tld_obj->data_map->data + off : NULL;
> > >
> > > If we require users to call tld_fetch_key() first and check for errors
> > > tld_get_data() can be faster:
> > > #define tld_get_data(tld_obj, key)
> > >    (void *)tld_obj->data_map->data + tld_obj->key_map->key.off
> > >
> >
> > tld_get_data() can be called in a bpf program without tld_fetch_key(),
> > so checking tld_obj->data_map->data is needed as the first load from
> > tld_obj->data_map->data yields a "mem_or_null".
> >
> > I did try to save this uptr "mem" after null check to stack (e.g., in
> > a tld_object) so that we can save subsequent checks. However, the
> > compiler sometime will do a fresh load from map_ptr when reading
> > tld_obj->data_map->data. Might need some work or trick to make it
> > happen.
>
> likely because you do tld_obj->data_map->data twice.
>
> > > I wouldn't bother with extra checks, especially for size.
> > >
> >
> > It needs to be bound-checked. If tld_get_data() doesn't do it, bpf
> > programs have to do it before acceess. Otherwise:
> >
> > ; return (tld_obj->data_map->data && off >= 0) ? @ task_local_data.bpf.h:218
> > 25: (bf) r3 = r1                      ; R1_w=mem(sz=4096) R3_w=mem(sz=4096)
> > 26: (0f) r3 += r2                     ;
> > R2_w=scalar(smin=0,smax=umax=0xffffffff,smin32=0xffff7fff,smax32=32766,var_off=(0x0;
> > 0xffffffff)) R3_w=mem(sz=4096,smin=0,smax=umax=0xffffffff,var_off=(0x0;
> > 0xfffffff)
> > ; test_value1 = *int_p; @ test_task_local_data.c:63
> > 27: (61) r2 = *(u32 *)(r3 +0)
> > R3 unbounded memory access, make sure to bounds check any such access
>
> That's easy to fix.
> Then something like:
> #define tld_get_data(tld_obj, key) \
>  ({
>     void * data = tld_obj->data_map->data;
>     if (data)
>          data += tld_obj->key_map->key.off & (PAGE_SIZE - 1);
>     data;
>   })
>
> size is really not needed. The verifier sees it as one page.
> Bad bpf prog can write into the wrong key and the verifier cannot stop it.
>

key.off is a variable offset, so the verifier may assume key.off ==
PAGE_SIZE - 1. If a bpf program tries to dereference a pointer
returned by the proposed tld_get_data() as an int * without bound
check, the verifier will still consider this a potential out-of-bound
access:

invalid access to memory, mem_size=4096 off=4095 size=4

I think if there needs to be a bound check anyways, hiding it
tld_get_data() makes the user written part less complex.

> > > Bigger question.. can we cache the whole pointer for each key ?
> > > and then
> > > #define tld_get_data(tld_obj, key) ld_obj->key_map->key
> >
> > Maybe define the member type of tld_key_map as __uptr and allow bpf
> > programs to update a uptr field with a valid uptr?
>
> yeah. That indeed gets complicated. Maybe it's possible with some
> verifier changes, but let's not go there yet.
> The tld_get_data() proposed above is speedy enough.