Catalin Marinas <catalin.marinas@xxxxxxx> writes: > On Thu, Aug 14, 2025 at 12:30:36AM -0700, Ankur Arora wrote: >> Catalin Marinas <catalin.marinas@xxxxxxx> writes: >> > On Mon, Aug 11, 2025 at 02:15:56PM -0700, Ankur Arora wrote: >> >> Catalin Marinas <catalin.marinas@xxxxxxx> writes: >> >> > Also I feel the spinning added to poll_idle() is more of an architecture >> >> > choice as some CPUs could not cope with local_clock() being called too >> >> > frequently. >> >> >> >> Just on the frequency point -- I think it might be a more general >> >> problem that just on specific architectures. >> >> >> >> Architectures with GENERIC_SCHED_CLOCK could use a multitude of >> >> clocksources and from a quick look some of them do iomem reads. >> >> (AFAICT GENERIC_SCHED_CLOCK could also be selected by the clocksource >> >> itself, so an architecture header might not need to be an arch choice >> >> at all.) >> >> >> >> Even for something like x86 which doesn't use GENERIC_SCHED_CLOCK, >> >> we might be using tsc or jiffies or paravirt-clock all of which would >> >> have very different performance characteristics. Or, just using a >> >> clock more expensive than local_clock(); rqspinlock uses >> >> ktime_get_mono_fast_ns(). >> >> >> >> So, I feel we do need a generic rate limiter. >> > >> > That's a good point but the rate limiting is highly dependent on the >> > architecture, what a CPU does in the loop, how fast a loop iteration is. >> > >> > That's why I'd keep it hidden in the arch code. >> >> Yeah, this makes sense. However, I would like to keep as much of the >> code that does this common. > > You can mimic what poll_idle() does for x86 in the generic > implementation, maybe with some comment referring to the poll_idle() CPU > usage of calling local_clock() in a loop. However, allow the arch code > to override the whole implementation and get rid of the policy. If an > arch wants to spin for some power reason, it can do it itself. The code > duplication for a while loop is much more readable than a policy setting > some spin/wait parameters just to have a single spin loop. If at some > point we see some pattern, we could revisit the common code. > > For arm64, I doubt the extra spinning makes any difference. Our > cpu_relax() doesn't do anything (almost), it's probably about the same > cost as reading the monotonic clock. I also see a single definition > close enough to the logic in __delay() on arm64. It would be more > readable than a policy callback setting wait/spin with a separate call > for actually waiting. Well, gut feel, let's see how that would look > like. So, I tried to pare back the code and the following (untested) is what I came up with. Given the straight-forward rate-limiting, and the current users not needing accurate timekeeping, this uses a bool time_check_expr. Figured I'd keep it simple until someone actually needs greater complexity as you suggested. diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h index d4f581c1e21d..e8793347a395 100644 --- a/include/asm-generic/barrier.h +++ b/include/asm-generic/barrier.h @@ -273,6 +273,34 @@ do { \ }) #endif + +#ifndef SMP_TIMEWAIT_SPIN_COUNT +#define SMP_TIMEWAIT_SPIN_COUNT 200 +#endif + +#ifndef smp_cond_load_relaxed_timewait +#define smp_cond_load_relaxed_timewait(ptr, cond_expr, \ + time_check_expr) \ +({ \ + typeof(ptr) __PTR = (ptr); \ + __unqual_scalar_typeof(*ptr) VAL; \ + u32 __n = 0, __spin = SMP_TIMEWAIT_SPIN_COUNT; \ + \ + for (;;) { \ + VAL = READ_ONCE(*__PTR); \ + if (cond_expr) \ + break; \ + cpu_relax(); \ + if (++__n < __spin) \ + continue; \ + if ((time_check_expr)) \ + break; \ + __n = 0; \ + } \ + (typeof(*ptr))VAL; \ +}) +#endif diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h index f5801b0ba9e9..c9934ab68da2 100644 --- a/arch/arm64/include/asm/barrier.h +++ b/arch/arm64/include/asm/barrier.h @@ -219,6 +219,43 @@ do { \ (typeof(*ptr))VAL; \ }) +extern bool arch_timer_evtstrm_available(void); + +#ifndef SMP_TIMEWAIT_SPIN_COUNT +#define SMP_TIMEWAIT_SPIN_COUNT 200 +#endif + +#define smp_cond_load_relaxed_timewait(ptr, cond_expr, \ + time_check_expr) \ +({ \ + typeof(ptr) __PTR = (ptr); \ + __unqual_scalar_typeof(*ptr) VAL; \ + u32 __n = 0, __spin = 0; \ + bool __wfet = alternative_has_cap_unlikely(ARM64_HAS_WFXT); \ + bool __wfe = arch_timer_evtstrm_available(); \ + bool __wait = false; \ + \ + if (__wfet || __wfe) \ + __wait = true; \ + else \ + __spin = SMP_TIMEWAIT_SPIN_COUNT; \ + \ + for (;;) { \ + VAL = READ_ONCE(*__PTR); \ + if (cond_expr) \ + break; \ + cpu_relax(); \ + if (++__n < __spin) \ + continue; \ + if ((time_check_expr)) \ + break; \ + if (__wait) \ + __cmpwait_relaxed(__PTR, VAL); \ + __n = 0; \ + } \ + (typeof(*ptr))VAL; \ +}) + #include <asm-generic/barrier.h> __cmpwait_relaxed() will need adjustment to set a deadline for WFET. AFAICT the rqspinlock code should be able to work by specifying something like: ((ktime_get_mono_fast_ns() > tval)) || (deadlock_check(&lock_context))) as the time_check_expr. I think they also want to rate limit how often deadlock_check() is called, so they can redefine SMP_TIMEWAIT_SPIN_COUNT to some large value for arm64. How does this look? Thanks -- ankur