On Mon, Aug 11, 2025, Yury Norov wrote: > On Mon, Aug 11, 2025 at 01:45:46PM -0700, Sean Christopherson wrote: > > On Mon, Aug 11, 2025, Yury Norov wrote: > > > Testing cpumask for a CPU to be cleared just before setting the exact > > > same CPU is useless because the end result is always the same: CPU is > > > set. > > > > No, it is not useless. Blindly writing to the variable will unnecessarily bounce > > the cacheline, and this is a hot path. > > How hot is that path? Very, it gets hit on every VM-Exit => VM-Entry. For context, putting a single printk anywhere in KVM's exit=>entry path can completely prevent forward progress in the guest (for some workloads/patterns). > How bad the cache contention is? I would expect it to be "fatally" bad for some workloads and setups. Not literally fatal, but bad enough that it would require an urgent fix. > Is there any evidence that conditional cpumask_set_cpu() worth the effort? I don't have evidence for this specific code flow, but there is plenty of evidence that shows that generating atomic accesses, especially across sockets, can have a significant negative impact on performance. I didn't ask for performance numbers for optimizing setting the mask because (a) I know the VM-Entry path can be extremely hot, (b) I know that dueling atomics can be hugely problematic, and (c) I don't see the separate test + set logic as being at all notable in terms of effort. > The original patch doesn't discuss that at all, and without any comment the > code looks just buggy. FWIW, there was discussion in a previous version of the series, but no hard numbers on the perf impact. https://lore.kernel.org/all/Z75se_OZQvaeQE-4@xxxxxxxxxx > > > > While there, switch CPU setter to a non-atomic version. Atomicity is > > > useless here > > > > No, atomicity isn't useless here either. Dropping atomicity could result in > > CPU's bit being lost. I.e. the atomic accesses aren't for the benefit of > > smp_call_function_many_cond(), the writes are atomic so that multiple vCPUs can > > concurrently update the mask without needing additional protection. > > OK, I see. Something heavy hit my head before I decided to drop > atomicity there. > > > > because sev_writeback_caches() ends up with a plain > > > for_each_cpu() loop in smp_call_function_many_cond(), which is not > > > atomic by nature. > > > > That's fine. As noted in sev_writeback_caches(), if vCPU could be running, then > > the caller is responsible for ensuring that all vCPUs flush caches before the > > memory being reclaimed is fully freed. Those guarantees are provided by KVM's > > MMU. > > > > sev_writeback_caches() => smp_call_function_many_cond() could hit false positives, > > i.e. trigger WBINVD on CPUs that couldn't possibly have accessed the memory being > > reclaimed, but such false positives are functionally benign, and are "intended" > > in the sense that we chose to prioritize simplicity over precision. > > So, I don't object to drop the patch, but it would be really nice to > have this > if (!cpumask_test_cpu()) > cpumask_set_cpu() > > pattern explained, and even better supported with performance numbers. I can definitely add a comment, and I might try to gather numbers out of curiosity, but as above, I just don't see this as something that needs to be investigated with any urgency.