On 6/26/25 05:48, Kai Huang wrote: > On SME platforms, at hardware level dirty cachelines with and without > encryption bit of the same memory can coexist, and the CPU can flush > them back to memory in random order. During kexec, the caches must be > flushed before jumping to the new kernel to avoid silent memory > corruption when a cacheline with a different encryption property is > written back over whatever encryption properties the new kernel is > using. > > TDX also needs cache flush during kexec for the same reason. It would > be good to implement a generic way to flush cache instead of scattering > checks for each feature all around. > > During kexec, the kexec-ing CPU sends IPIs to all remote CPUs to stop > them before it boots to the new kernel. For SME the kernel basically > encrypts all memory including the kernel itself by manipulating page > tables. A simple memory write from the kernel could dirty cachelines. > > Currently, the kernel uses WBINVD to flush cache for SME during kexec in > two places: > > 1) the one in stop_this_cpu() for all remote CPUs when the kexec-ing CPU > stops them; > 2) the one in the relocate_kernel() where the kexec-ing CPU jumps to the > new kernel. > > Unlike SME, TDX can only dirty cachelines when it is used (i.e., when > SEAMCALLs are performed). Since there are no more SEAMCALLs after the > aforementioned WBINVDs, leverage this for TDX. > > In order to have a generic way to cover both SME and TDX, use a percpu > boolean to indicate the cache may be in an incoherent state thus cache > flush is needed during kexec, and turn on the boolean for SME. TDX can > then leverage it by also turning the boolean on. > > A percpu boolean isn't strictly needed for SME since it is activated at > very early boot time and on all CPUs. A global flag would be > sufficient. But using a percpu variable has two benefits. Foremost, > the state that is being tracked here (percpu cache coherency situation > requiring a flush) is percpu, so a percpu state is a more direct and > natural fit. > > Secondly, it will fit TDX's usage better since the percpu var can be set > when a CPU makes a SEAMCALL, and cleared when another WBINVD on the CPU > obviates the need for a kexec-time WBINVD. Saving kexec-time WBINVD is > valuable, because there is an existing race[*] where kexec could proceed > while another CPU is active. WBINVD could make this race worse, so it's > worth skipping it when possible. > > Today the first WBINVD in the stop_this_cpu() is performed when SME is > *supported* by the platform, and the second WBINVD is done in > relocate_kernel() when SME is *activated* by the kernel. Make things > simple by changing to do the second WBINVD when the platform supports > SME. This allows the kernel to simply turn on this percpu boolean when > bringing up a CPU by checking whether the platform supports SME. > > No other functional change intended. > > Also, currently machine_check() has a comment to explain why no function > call is allowed after load_segments(). After changing to use the new > percpu boolean to control whether to perform WBINVD when calling the > relocate_kernel(), that comment isn't needed anymore. But it is still a > useful comment, so expand the comment around load_segments() to mention > that due to depth tracking no function call can be made after > load_segments(). > > [*] The "race" in native_stop_other_cpus() > > Commit > > 1f5e7eb7868e: ("x86/smp: Make stop_other_cpus() more robust") > > introduced a new 'cpus_stop_mask' to resolve an "intermittent lockups on > poweroff" issue which was caused by the WBINVD in stop_this_cpu(). > Specifically, the new cpumask resolved the below problem mentioned in > that commit: > > CPU0 CPU1 > > stop_other_cpus() > send_IPIs(REBOOT); stop_this_cpu() > while (num_online_cpus() > 1); set_online(false); > proceed... -> hang > wbinvd() > > While it fixed the reported issue, that commit explained the new cpumask > "cannot plug all holes either". > > This is because it doesn't address the "race" mentioned in the #3 in the > comment in native_stop_other_cpus(): > > /* > * 1) Send an IPI on the reboot vector to all other CPUs. > * > * The other CPUs should react on it after leaving critical > * sections and re-enabling interrupts. They might still hold > * locks, but there is nothing which can be done about that. > * > * 2) Wait for all other CPUs to report that they reached the > * HLT loop in stop_this_cpu() > * > * 3) If #2 timed out send an NMI to the CPUs which did not > * yet report > * > * 4) Wait for all other CPUs to report that they reached the > * HLT loop in stop_this_cpu() > * > * #3 can obviously race against a CPU reaching the HLT loop late. > * That CPU will have reported already and the "have all CPUs > * reached HLT" condition will be true despite the fact that the > * other CPU is still handling the NMI. Again, there is no > * protection against that as "disabled" APICs still respond to > * NMIs. > */ > > Consider below case: > > CPU 0 CPU 1 > > native_stop_other_cpus() stop_this_cpu() > > // sends REBOOT IPI to stop remote CPUs > ... > wbinvd(); > > // wait timesout, try NMI > if (!cpumask_empty(&cpus_stop_mask)) { > for_each_cpu(cpu, &cpus_stop_mask) { > > ... > cpumask_clear_cpu(cpu, > &cpus_stop_mask); > hlt; > > // send NMI ---> > wakeup from hlt and run > stop_this_cpu(): > > // WAIT CPUs TO STOP > while (!cpumask_empty( > &cpus_stop_mask) && > ...) > } > ... > proceed ... wbinvd(); > ... > hlt; > > The "WAIT CPUs TO STOP" is supposed to wait until all remote CPUs are > stopped, but actually it quits immediately because the remote CPUs have > been cleared in cpus_stop_mask when stop_this_cpu() is called from the > REBOOT IPI. > > Doing WBINVD in stop_this_cpu() could potentially increase the chance to > trigger the above "race" despite it's still rare to happen. > > Signed-off-by: Kai Huang <kai.huang@xxxxxxxxx> Reviewed-by: Tom Lendacky <thomas.lendacky@xxxxxxx> Tested-by: Tom Lendacky <thomas.lendacky@xxxxxxx> > --- > arch/x86/include/asm/kexec.h | 2 +- > arch/x86/include/asm/processor.h | 2 ++ > arch/x86/kernel/cpu/amd.c | 16 ++++++++++++++++ > arch/x86/kernel/machine_kexec_64.c | 15 ++++++++++----- > arch/x86/kernel/process.c | 16 +++------------- > arch/x86/kernel/relocate_kernel_64.S | 15 +++++++++++---- > 6 files changed, 43 insertions(+), 23 deletions(-) >