Re: [PATCH RFC v3 0/2] Task local data API

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Fri, 2 May 2025 10:55:40 -0700

On Thu, May 1, 2025 at 7:22 PM Andrii Nakryiko
<andrii.nakryiko@xxxxxxxxx> wrote:
>
> I wasn't trying to optimize those few bytes taken by szs, tbh.
> Allocating from the end of the page bakes in the assumption that we
> won't ever need more than one page. I don't know if I'd do that. But
> we can just track "next available offset" instead, so it doesn't
> really matter much.

Right. That works too.

> >
> > I'm not quite sure how different processes can do it locklessly.
>
> There are no different processes, it's all one process, many
> threads... Or is that what you meant? tld_metadata is *per process*,
> tld_data is *per thread*. Processes don't need to coordinate anything
> between themselves, only threads within the process.

Yeah. I confused myself thinking that we need to support this
through fork/exec. Since they will be different processes
they will have their own task local storage map elements,
so any kind of signaling into a child needs to be done in user space.
Using bpf tls map won't work.

So using one "tld" library in multiple threads within a single
process works. No need to complicate things by asking kernel tls map.
"tld" library can keep whatever state it needs,
centralized locking, etc.

> As for how I'd do offset allocation and key addition locklessly. You
> are right that it can't be done completely locklessly, but just
> looping and yielding probably would be fine.
> =
>
> Then the sequence of adding the key would be something like below.
> I've modified tld_metadata a bit to make this simpler and more
> economical (and I fixed definition of keys array of array of chars,
> oops):
>
> struct tld_metadata {
>     int cnt;
>     int next_off;
>     char keys[MAX_KEY_CNT][MAX_KEY_LEN];
>     __u16 offs[MAX_KEY_CNT];
> };
>
> struct tld_metadata *m = ...;
> const char *new_key = ...;
> int i = 0;
>
> /* all m->offs[i] are set to -1 on creation */
> again:
>
>     int key_cnt = m->cnt;
>     for (; i < key_cnt; i++) {
>        while (m->offs[i] < 0) /* update in progress */
>             sched_yield();
>
>        if (strcmp(m->keys[i], new_key) == 0)
>             return m->offs[i];
>
>        if (!cmpxchg(*m->cnt, key_cnt, key_cnt + 1)) {
>             goto again; /* we raced, key might have been added
> already, recheck, but keep i */
>
>        /* slot key_cnt is ours, we need to calculate and assign offset */
>        int new_off = m->next_off;
>        m->next_off = new_off + key_sz;
>
>        m->keys[key_cnt][0] = '\0';
>        strncat(m->keys[key_cnt], new_key, MAX_KEY_LEN);
>
>        /* MEMORY BARRIERS SHOULD BE CAREFULLY CONSIDERED */
>
>        m->offs[key_cnt] = new_off; /* this is finalizing key -> offset
> assignment */
>
>        /* MEMORY BARRIERS SHOULD BE CAREFULLY CONSIDERED */
>
>        return new_off; /* we are done */
>     }
>
> Something like that. There is that looping and yield to not miss
> someone else winning the race and adding a key, so that's the locking
> part. But given that adding a key definition is supposed to be one
> time operation (per key), I don't think we should be fancy with
> locking.

something like that should work.
I wish there was some trivial futex wrapper in .h
that can be used instead of pthread_mutex baggage.