On Tue, Jul 29, 2025 at 12:28:41AM +1200, Kai Huang wrote: >On TDX platforms, during kexec, the kernel needs to make sure there are >no dirty cachelines of TDX private memory before booting to the new >kernel to avoid silent memory corruption to the new kernel. > >During kexec, the kexec-ing CPU firstly invokes native_stop_other_cpus() >to stop all remote CPUs before booting to the new kernel. The remote >CPUs will then execute stop_this_cpu() to stop themselves. > >The kernel has a percpu boolean to indicate whether the cache of a CPU >may be in incoherent state. In stop_this_cpu(), the kernel does WBINVD >if that percpu boolean is true. > >TDX turns on that percpu boolean on a CPU when the kernel does SEAMCALL. >This makes sure the caches will be flushed during kexec. > >However, the native_stop_other_cpus() and stop_this_cpu() have a "race" >which is extremely rare to happen but could cause the system to hang. > >Specifically, the native_stop_other_cpus() firstly sends normal reboot >IPI to remote CPUs and waits one second for them to stop. If that times >out, native_stop_other_cpus() then sends NMIs to remote CPUs to stop >them. > >The aforementioned race happens when NMIs are sent. Doing WBINVD in >stop_this_cpu() makes each CPU take longer time to stop and increases >the chance of the race happening. > >Explicitly flush cache in tdx_disable_virtualization_cpu() after which >no more TDX activity can happen on this cpu. This moves the WBINVD to >an earlier stage than stop_this_cpus(), avoiding a possibly lengthy >operation at a time where it could cause this race. > >Signed-off-by: Kai Huang <kai.huang@xxxxxxxxx> >Acked-by: Paolo Bonzini <pbonzini@xxxxxxxxxx> >Tested-by: Farrah Chen <farrah.chen@xxxxxxxxx> >Reviewed-by: Binbin Wu <binbin.wu@xxxxxxxxxxxxxxx> Flushing cache after disabling virtualization looks clean. So, Reviewed-by: Chao Gao <chao.gao@xxxxxxxxx>