Re: [PATCH v3 1/5] asm-generic: barrier: Add smp_cond_load_relaxed_timewait()

Catalin Marinas <catalin.marinas@xxxxxxx> · Fri, 8 Aug 2025 11:51:55 +0100

On Thu, Jun 26, 2025 at 09:48:01PM -0700, Ankur Arora wrote:
> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> index d4f581c1e21d..d33c2701c9ee 100644
> --- a/include/asm-generic/barrier.h
> +++ b/include/asm-generic/barrier.h
> @@ -273,6 +273,101 @@ do {									\
>  })
>  #endif
>  
> +#ifndef SMP_TIMEWAIT_SPIN_BASE
> +#define SMP_TIMEWAIT_SPIN_BASE		16
> +#endif
> +
> +/*
> + * Policy handler that adjusts the number of times we spin or
> + * wait for cacheline to change before evaluating the time-expr.
> + *
> + * The generic version only supports spinning.
> + */
> +static inline u64 ___smp_cond_spinwait(u64 now, u64 prev, u64 end,
> +				       u32 *spin, bool *wait, u64 slack)
> +{
> +	if (now >= end)
> +		return 0;
> +
> +	*spin = SMP_TIMEWAIT_SPIN_BASE;
> +	*wait = false;
> +	return now;
> +}
> +
> +#ifndef __smp_cond_policy
> +#define __smp_cond_policy ___smp_cond_spinwait
> +#endif
> +
> +/*
> + * Non-spin primitive that allows waiting for stores to an address,
> + * with support for a timeout. This works in conjunction with an
> + * architecturally defined policy.
> + */
> +#ifndef __smp_timewait_store
> +#define __smp_timewait_store(ptr, val)	do { } while (0)
> +#endif
> +
> +#ifndef __smp_cond_load_relaxed_timewait
> +#define __smp_cond_load_relaxed_timewait(ptr, cond_expr, policy,	\
> +					 time_expr, time_end,		\
> +					 slack) ({			\
> +	typeof(ptr) __PTR = (ptr);					\
> +	__unqual_scalar_typeof(*ptr) VAL;				\
> +	u32 __n = 0, __spin = SMP_TIMEWAIT_SPIN_BASE;			\
> +	u64 __prev = 0, __end = (time_end);				\
> +	u64 __slack = slack;						\
> +	bool __wait = false;						\
> +									\
> +	for (;;) {							\
> +		VAL = READ_ONCE(*__PTR);				\
> +		if (cond_expr)						\
> +			break;						\
> +		cpu_relax();						\
> +		if (++__n < __spin)					\
> +			continue;					\
> +		if (!(__prev = policy((time_expr), __prev, __end,	\
> +					  &__spin, &__wait, __slack)))	\
> +			break;						\
> +		if (__wait)						\
> +			__smp_timewait_store(__PTR, VAL);		\
> +		__n = 0;						\
> +	}								\
> +	(typeof(*ptr))VAL;						\
> +})
> +#endif

TBH, this still looks over-engineered to me, especially with the second
patch trying to reduce the spin loops based on the remaining time. Does
any of the current users of this interface need it to get more precise?

Also I feel the spinning added to poll_idle() is more of an architecture
choice as some CPUs could not cope with local_clock() being called too
frequently. The above generic implementation takes a spin into
consideration even if an arch implementation doesn't need it (e.g. WFET
or WFE). Yes, the arch policy could set a spin of 0 but it feels overly
complicated for the generic implementation.

Can we instead have the generic implementation without any spinning?
Just polling a variable with cpu_relax() like
smp_cond_load_acquire/relaxed() with the additional check for time. We
redefine it in the arch code.

> +#define __check_time_types(type, a, b)			\
> +		(__same_type(typeof(a), type) &&	\
> +		 __same_type(typeof(b), type))
> +
> +/**
> + * smp_cond_load_relaxed_timewait() - (Spin) wait for cond with no ordering
> + * guarantees until a timeout expires.
> + * @ptr: pointer to the variable to wait on
> + * @cond: boolean expression to wait for
> + * @time_expr: monotonic expression that evaluates to the current time
> + * @time_end: end time, compared against time_expr
> + * @slack: how much timer overshoot can the caller tolerate?
> + * Useful for when we go into wait states. A value of 0 indicates a high
> + * tolerance.
> + *
> + * Note that all times (time_expr, time_end, and slack) are in microseconds,
> + * with no mandated precision.
> + *
> + * Equivalent to using READ_ONCE() on the condition variable.
> + */
> +#define smp_cond_load_relaxed_timewait(ptr, cond_expr, time_expr,	\
> +				       time_end, slack) ({		\
> +	__unqual_scalar_typeof(*ptr) _val;				\
> +	BUILD_BUG_ON_MSG(!__check_time_types(u64, time_expr, time_end),	\
> +			 "incompatible time units");			\
> +	_val = __smp_cond_load_relaxed_timewait(ptr, cond_expr,		\
> +						__smp_cond_policy,	\
> +						time_expr, time_end,	\
> +						slack);			\
> +	(typeof(*ptr))_val;						\
> +})

Looking at the current user of the acquire variant - rqspinlock, it does
not even bother with a time_expr but rather added the time condition to
cond_expr. I don't think it has any "slack" requirements, only that
there's no deadlock eventually.

About poll_idle(), are there any slack requirement or we get away
without?

I think we have two ways forward (well, at least):

1. Clearly define what time_end is and we won't need a time_expr at all.
   This may work for poll_idle(), not sure about rqspinlock. The
   advantage is that we can drop the 'slack' argument since none of the
   current users seem to need it. The downside is that we need to know
   exactly what this time_end is to convert it to timer cycles for a
   WFET implementation on arm64.

2. Drop time_end and only leave time_expr as a bool (we don't care
   whether it uses ns, jiffies or whatever underneath, it's just a
   bool). In this case, we could use a 'slack' argument mostly to make a
   decision on whether we use WFET, WFE or just polling with
   cpu_relax(). For WFET, the wait time would be based on the slack
   value rather than some absolute end time which we won't have.

I'd go with (2), it looks simpler. Maybe even drop the 'slack' argument
for the time being until we have a clear user. The fallback on arm64
would be from wfe (if event streaming available), wfet with the same
period as the event stream (in the absence of a slack argument) or
cpu_relax().

-- 
Catalin