Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> writes: > On Mon, Sep 1, 2025 at 4:28 AM Catalin Marinas <catalin.marinas@xxxxxxx> wrote: >> >> On Fri, Aug 29, 2025 at 01:07:35AM -0700, Ankur Arora wrote: >> > diff --git a/arch/arm64/include/asm/rqspinlock.h b/arch/arm64/include/asm/rqspinlock.h >> > index a385603436e9..ce8feadeb9a9 100644 >> > --- a/arch/arm64/include/asm/rqspinlock.h >> > +++ b/arch/arm64/include/asm/rqspinlock.h >> > @@ -3,6 +3,9 @@ >> > #define _ASM_RQSPINLOCK_H >> > >> > #include <asm/barrier.h> >> > + >> > +#define res_smp_cond_load_acquire_waiting() arch_timer_evtstrm_available() >> >> More on this below, I don't think we should define it. >> >> > diff --git a/kernel/bpf/rqspinlock.c b/kernel/bpf/rqspinlock.c >> > index 5ab354d55d82..8de1395422e8 100644 >> > --- a/kernel/bpf/rqspinlock.c >> > +++ b/kernel/bpf/rqspinlock.c >> > @@ -82,6 +82,7 @@ struct rqspinlock_timeout { >> > u64 duration; >> > u64 cur; >> > u16 spin; >> > + u8 wait; >> > }; >> > >> > #define RES_TIMEOUT_VAL 2 >> > @@ -241,26 +242,20 @@ static noinline int check_timeout(rqspinlock_t *lock, u32 mask, >> > } >> > >> > /* >> > - * Do not amortize with spins when res_smp_cond_load_acquire is defined, >> > - * as the macro does internal amortization for us. >> > + * Only amortize with spins when we don't have a waiting implementation. >> > */ >> > -#ifndef res_smp_cond_load_acquire >> > #define RES_CHECK_TIMEOUT(ts, ret, mask) \ >> > ({ \ >> > - if (!(ts).spin++) \ >> > + if ((ts).wait || !(ts).spin++) \ >> > (ret) = check_timeout((lock), (mask), &(ts)); \ >> > (ret); \ >> > }) >> > -#else >> > -#define RES_CHECK_TIMEOUT(ts, ret, mask) \ >> > - ({ (ret) = check_timeout((lock), (mask), &(ts)); }) >> > -#endif >> >> IIUC, RES_CHECK_TIMEOUT in the current res_smp_cond_load_acquire() usage >> doesn't amortise the spins, as the comment suggests, but rather the >> calls to check_timeout(). This is fine, it matches the behaviour of >> smp_cond_load_relaxed_timewait() you introduced in the first patch. The >> only difference is the number of spins - 200 (matching poll_idle) vs 64K >> above. Does 200 work for the above? >> >> > /* >> > * Initialize the 'spin' member. >> > * Set spin member to 0 to trigger AA/ABBA checks immediately. >> > */ >> > -#define RES_INIT_TIMEOUT(ts) ({ (ts).spin = 0; }) >> > +#define RES_INIT_TIMEOUT(ts) ({ (ts).spin = 0; (ts).wait = res_smp_cond_load_acquire_waiting(); }) >> >> First of all, I don't really like the smp_cond_load_acquire_waiting(), >> that's an implementation detail of smp_cond_load_*_timewait() that >> shouldn't leak outside. But more importantly, RES_CHECK_TIMEOUT() is >> also used outside the smp_cond_load_acquire_timewait() condition. The >> (ts).wait check only makes sense when used together with the WFE >> waiting. > > +1 to the above. Ack. > Penalizing all other architectures with pointless runtime check: > >> - if (!(ts).spin++) \ >> + if ((ts).wait || !(ts).spin++) \ > > is not acceptable. Is it still a penalty if the context is a busy wait loop. Oddly enough the compiler does not eliminate this check on x86 (given that it is statically defined to be 0 and ts does not escape the function.) -- ankur