Christoph Lameter (Ampere) <cl@xxxxxxxxxx> writes: > On Fri, 2 May 2025, Ankur Arora wrote: > >> smp_cond_load_relaxed_spinwait(ptr, cond_expr, time_expr_ns, time_limit_ns) >> took four arguments, with ptr and cond_expr doing the usual smp_cond_load() >> things and time_expr_ns and time_limit_ns being used to decide the >> terminating condition. >> >> There were some problems in the timekeeping: >> >> 1. How often do we do the (relatively expensive) time-check? > > Is this really important? We have instructions that wait on an event and > terminate at cycle counter values like WFET on arm64 > > The case were we need to perform time checks is only needed if the > processor does not support WFET but must use a event stream or does not > even have that available. AFAICT the vast majority of arm64 processors in the wild don't yet support WFET. For instance I haven't been able to find a single one to test my WFET changes with ;). The other part is that this needs to be in common code and x86 primarily uses PAUSE. So, supporting both configurations: WFE + spin on arm64 and PAUSE on x86 needs a way of rate-limiting the time-check. Otherwise you run into issues like this one: commit 4dc2375c1a4e88ed2701f6961e0e4f9a7696ad3c Author: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx> Date: Tue Mar 27 23:58:45 2018 +0200 cpuidle: poll_state: Avoid invoking local_clock() too often Rik reports that he sees an increase in CPU use in one benchmark due to commit 612f1a22f067 "cpuidle: poll_state: Add time limit to poll_idle()" that caused poll_idle() to call local_clock() in every iteration of the loop. Utilization increase generally means more non-idle time with respect to total CPU time (on the average) which implies reduced CPU frequency. Doug reports that limiting the rate of local_clock() invocations in there causes much less power to be drawn during a CPU-intensive parallel workload (with idle states 1 and 2 disabled to enforce more state 0 residency). These two reports together suggest that executing local_clock() on multiple CPUs in parallel at a high rate may cause chips to get hot and trigger thermal/power limits on them to kick in, so reduce the rate of local_clock() invocations in poll_idle() to avoid that issue. > So the best approach is to have a simple interface were we specify the > cycle count when the wait is to be terminated and where we can cover that > with one WFET instruction. > > The other cases then are degenerate forms of that. If only WFE is > available then only use that if the timeout is larger than the event > stream granularity. Or if both are not available them do the relax / > loop thing. That's what I had proposed for v1. But as Catalin pointed out, that's not very useful when the caller wants to limit the overshoot. This version tries to optimistically use WFE where possible while minimizing the spin time. > So the interface could be much simpler: > > __smp_cond_load_relaxed_wait(ptr, timeout_cycle_count) > > with a wrapper > > smp_cond_relaxed_wait_expr(ptr, expr, timeout cycle count) Oh, I would have absolutely liked to keep the interface simple, but couldn't see a way to do that while managing the other constraints. For instance, different users want different clocks: poll_idle() can do with an imprecise clock but rqspinlock needs ktime_get_mono(). I think using standard clock types is also better instead of using arm64 specific cycles or tsc or whatever. > where we check the expression too and retry if the expression is not true. > > The fallbacks with the spins and relax logic would be architecture > specific. Even if they were all architecture specific, I suspect there's quite a lot of variation between cpu_relax() across microarchitectures. For instance YIELD is a nop on non-SMT arm64, but probably heavier on SMT arm64. Thanks for the quick review! Ankur