Re: [PATCH v3 1/5] asm-generic: barrier: Add smp_cond_load_relaxed_timewait()

Ankur Arora <ankur.a.arora@xxxxxxxxxx> · Mon, 11 Aug 2025 14:15:56 -0700

[ Added Rafael, Daniel. ]

Catalin Marinas <catalin.marinas@xxxxxxx> writes:

> On Thu, Jun 26, 2025 at 09:48:01PM -0700, Ankur Arora wrote:
>> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
>> index d4f581c1e21d..d33c2701c9ee 100644
>> --- a/include/asm-generic/barrier.h
>> +++ b/include/asm-generic/barrier.h
>> @@ -273,6 +273,101 @@ do {									\
>>  })
>>  #endif
>>
>> +#ifndef SMP_TIMEWAIT_SPIN_BASE
>> +#define SMP_TIMEWAIT_SPIN_BASE		16
>> +#endif
>> +
>> +/*
>> + * Policy handler that adjusts the number of times we spin or
>> + * wait for cacheline to change before evaluating the time-expr.
>> + *
>> + * The generic version only supports spinning.
>> + */
>> +static inline u64 ___smp_cond_spinwait(u64 now, u64 prev, u64 end,
>> +				       u32 *spin, bool *wait, u64 slack)
>> +{
>> +	if (now >= end)
>> +		return 0;
>> +
>> +	*spin = SMP_TIMEWAIT_SPIN_BASE;
>> +	*wait = false;
>> +	return now;
>> +}
>> +
>> +#ifndef __smp_cond_policy
>> +#define __smp_cond_policy ___smp_cond_spinwait
>> +#endif
>> +
>> +/*
>> + * Non-spin primitive that allows waiting for stores to an address,
>> + * with support for a timeout. This works in conjunction with an
>> + * architecturally defined policy.
>> + */
>> +#ifndef __smp_timewait_store
>> +#define __smp_timewait_store(ptr, val)	do { } while (0)
>> +#endif
>> +
>> +#ifndef __smp_cond_load_relaxed_timewait
>> +#define __smp_cond_load_relaxed_timewait(ptr, cond_expr, policy,	\
>> +					 time_expr, time_end,		\
>> +					 slack) ({			\
>> +	typeof(ptr) __PTR = (ptr);					\
>> +	__unqual_scalar_typeof(*ptr) VAL;				\
>> +	u32 __n = 0, __spin = SMP_TIMEWAIT_SPIN_BASE;			\
>> +	u64 __prev = 0, __end = (time_end);				\
>> +	u64 __slack = slack;						\
>> +	bool __wait = false;						\
>> +									\
>> +	for (;;) {							\
>> +		VAL = READ_ONCE(*__PTR);				\
>> +		if (cond_expr)						\
>> +			break;						\
>> +		cpu_relax();						\
>> +		if (++__n < __spin)					\
>> +			continue;					\
>> +		if (!(__prev = policy((time_expr), __prev, __end,	\
>> +					  &__spin, &__wait, __slack)))	\
>> +			break;						\
>> +		if (__wait)						\
>> +			__smp_timewait_store(__PTR, VAL);		\
>> +		__n = 0;						\
>> +	}								\
>> +	(typeof(*ptr))VAL;						\
>> +})
>> +#endif
>
> TBH, this still looks over-engineered to me, especially with the second
> patch trying to reduce the spin loops based on the remaining time. Does
> any of the current users of this interface need it to get more precise?

No, neither of rqspinlock nor poll_idle() really care about precision.
And, the slack even in this series is only useful for the waiting
implementation.

> Also I feel the spinning added to poll_idle() is more of an architecture
> choice as some CPUs could not cope with local_clock() being called too
> frequently.

Just on the frequency point -- I think it might be a more general
problem that just on specific architectures.

Architectures with GENERIC_SCHED_CLOCK could use a multitude of
clocksources and from a quick look some of them do iomem reads.
(AFAICT GENERIC_SCHED_CLOCK could also be selected by the clocksource
itself, so an architecture header might not need to be an arch choice
at  all.)

Even for something like x86 which doesn't use GENERIC_SCHED_CLOCK,
we might be using tsc or jiffies or paravirt-clock all of which would
have very different performance characteristics. Or, just using a
clock more expensive than local_clock(); rqspinlock uses
ktime_get_mono_fast_ns().

So, I feel we do need a generic rate limiter.

> The above generic implementation takes a spin into
> consideration even if an arch implementation doesn't need it (e.g. WFET
> or WFE). Yes, the arch policy could set a spin of 0 but it feels overly
> complicated for the generic implementation.

Agree with the last point. My thought was that it might be okay to always
optimistically spin a little, just because WFE*/MWAITX etc might (?)
have a entry/exit cost even when the wakeup is immediate.

Though the code is wrong in that it always waits right after evaluating
the policy. I should have done something like this instead:

+#define __smp_cond_load_relaxed_timewait(ptr, cond_expr, policy,       \
+                                        time_expr, time_end,           \
+                                        slack) ({                      \
+       typeof(ptr) __PTR = (ptr);                                      \
+       __unqual_scalar_typeof(*ptr) VAL;                               \
+       u32 __n = 0, __spin = SMP_TIMEWAIT_SPIN_BASE;                   \
+       u64 __prev = 0, __end = (time_end);                             \
+       u64 __slack = slack;                                            \
+       bool __wait = false;                                            \
+                                                                       \
+       for (;;) {                                                      \
+               VAL = READ_ONCE(*__PTR);                                \
+               if (cond_expr)                                          \
+                       break;                                          \
+               cpu_relax();                                            \
+               if (++__n < __spin)                                     \
+                       continue;                                       \
+               if (__wait)                                             \
+                       __smp_timewait_store(__PTR, VAL);               \
+               if (!(__prev = policy((time_expr), __prev, __end,       \
+                                         &__spin, &__wait, __slack)))  \
+                       break;                                          \
+               __n = 0;                                                \
+       }                                                               \
+       (typeof(*ptr))VAL;                                              \
+})

> Can we instead have the generic implementation without any spinning?
> Just polling a variable with cpu_relax() like
> smp_cond_load_acquire/relaxed() with the additional check for time. We
> redefine it in the arch code.
>
>> +#define __check_time_types(type, a, b)			\
>> +		(__same_type(typeof(a), type) &&	\
>> +		 __same_type(typeof(b), type))
>> +
>> +/**
>> + * smp_cond_load_relaxed_timewait() - (Spin) wait for cond with no ordering
>> + * guarantees until a timeout expires.
>> + * @ptr: pointer to the variable to wait on
>> + * @cond: boolean expression to wait for
>> + * @time_expr: monotonic expression that evaluates to the current time
>> + * @time_end: end time, compared against time_expr
>> + * @slack: how much timer overshoot can the caller tolerate?
>> + * Useful for when we go into wait states. A value of 0 indicates a high
>> + * tolerance.
>> + *
>> + * Note that all times (time_expr, time_end, and slack) are in microseconds,
>> + * with no mandated precision.
>> + *
>> + * Equivalent to using READ_ONCE() on the condition variable.
>> + */
>> +#define smp_cond_load_relaxed_timewait(ptr, cond_expr, time_expr,	\
>> +				       time_end, slack) ({		\
>> +	__unqual_scalar_typeof(*ptr) _val;				\
>> +	BUILD_BUG_ON_MSG(!__check_time_types(u64, time_expr, time_end),	\
>> +			 "incompatible time units");			\
>> +	_val = __smp_cond_load_relaxed_timewait(ptr, cond_expr,		\
>> +						__smp_cond_policy,	\
>> +						time_expr, time_end,	\
>> +						slack);			\
>> +	(typeof(*ptr))_val;						\
>> +})
>
> Looking at the current user of the acquire variant - rqspinlock, it does
> not even bother with a time_expr but rather added the time condition to
> cond_expr. I don't think it has any "slack" requirements, only that
> there's no deadlock eventually.

So, that code only uses smp_cond_load_*_timewait() on arm64. Everywhere
else it just uses smp_cond_load_acquire() and because it jams both
of these interfaces together, it doesn't really use time_expr.

But, it needs more extensive rework so all platforms can use
__smp_cond_load_acquire_timewait with the deadlock check folded
inside its own policy handler.

Anyway let me detail that in my reply to your other mail.

> About poll_idle(), are there any slack requirement or we get away
> without?

I don't believe there are any slack requirements. Definitely not for
rqspinlock (given that it has a large timeout) and I believe also
not for poll_idle() since a timeout delay only leads to a slightly
delayed deeper sleep.

Question for Rafael, Daniel: With smp_cond_load_relaxed_timewait(), when
used in waiting mode instead of via the usual cpu_relax() spin, we
could overshoot by an architecturally defined granularity.
On arm64, that could be ~100us in the worst case. Do we have hard
requirements about timer overshoot in poll_idle()?

> I think we have two ways forward (well, at least):
>
> 1. Clearly define what time_end is and we won't need a time_expr at all.
>    This may work for poll_idle(), not sure about rqspinlock. The
>    advantage is that we can drop the 'slack' argument since none of the
>    current users seem to need it. The downside is that we need to know
>    exactly what this time_end is to convert it to timer cycles for a
>    WFET implementation on arm64.
>
> 2. Drop time_end and only leave time_expr as a bool (we don't care
>    whether it uses ns, jiffies or whatever underneath, it's just a
>    bool). In this case, we could use a 'slack' argument mostly to make a
>    decision on whether we use WFET, WFE or just polling with
>    cpu_relax(). For WFET, the wait time would be based on the slack
>    value rather than some absolute end time which we won't have.
>
> I'd go with (2), it looks simpler. Maybe even drop the 'slack' argument
> for the time being until we have a clear user. The fallback on arm64
> would be from wfe (if event streaming available), wfet with the same
> period as the event stream (in the absence of a slack argument) or
> cpu_relax().

So I like the approach with (2) quite a bit. It'll simplify the time
handling quite nicely. And, I think it is also good to drop slack
unless there's a use for it.

There's just one problem, which is that a notion of time-remaining
still seems quite important to me. Without it, it's difficult to know
how often to do the time-check etc. I could use an arbitrary
parameter, say evaluate time_expr once every N cpu_relax() loops etc
but that seems worse than the current approach.

So, how about replacing the bool time_expr, with a time_remaining_expr
(s32) which evaluates to a fixed time unit (ns).

This also gives the WFET a clear end time (though it would still need
to be converted to timer cycles) but the WFE path could stay simple
by allowing an overshoot instead of falling back to polling.

--
ankur