[RFC PATCH 0/1] KVM: SEV: Add support for SMT Protection

Kim Phillips <kim.phillips@xxxxxxx> · Thu, 7 Aug 2025 11:59:49 -0500

On an SMT-enabled system, the SMT Protection feature allows an
SNP guest to demand its hardware vCPU thread to run alone on
the physical core.  It will opt to do this to protect itself
against possible side channel attacks from shared core resources.
Hardware supports this by enforcing the sibling of the vCPU thread
to be in the idle state when the vCPU is running: If hardware detects
the sibling has not entered the idle state, or it exited it, then
the vCPU VMRUN exits with a new "IDLE_REQUIRED" status, where the
hypervisor should schedule the idle process on the sibling thread
simultaneously with resuming the vCPU VMRUN.

There is a new HLT_WAKEUP_ICR MSR that the hypervisor programs
for each system SMT thread such that if an idle sibling of a
SMT Protected guest vCPU receives an interrupt, hardware will write
the HLT_WAKEUP_ICR value to the APIC ICR to 'kick' the vCPU
thread out of its VMRUN state. Hardware then allows the sibling
to then exit the idle state and service its interrupt.

The feature is supported on EYPC Zen 4 and above CPUs.

For more information, see "15.36.17 Side-Channel Protection",
"SMT Protection", in:

"AMD64 Architecture Programmer's Manual Volume 2: System Programming Part 2,
Pub. 24593 Rev. 3.42 - March 2024"

available here:

https://bugzilla.kernel.org/attachment.cgi?id=306250

See the end of this message for the qemu hack that calls the
Linux Core Scheduler prctl syscall to create a unique per-vCPU
cookie to ensure the vCPU process will not be scheduled if
there is anything else running on the sibling thread of the
core.

As it turns out, this approach is less than efficient because
existing Core Scheduling semantics only prevent other userspace
processes from running on the sibling thread that hardware requires
to be in the idle state.

Because of this, the sibling CPU VMRUN frequently exits with
"IDLE_REQUIRED" when the scheduler runs its "OS noise" (softirq
work, etc.) instead of forcing the hardware idle state throughout
the duration of the VMRUN.

Mild testing yields eventual CPU stalls in the guest (minutes after
boot):

[    C0] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[    C0] rcu: 	1-...!: (0 ticks this GP) idle=8d58/0/0x0 softirq=12830/12830 fqs=0 (false positive?)
[    C0] rcu: 	(detected by 0, t=16253 jiffies, g=12377, q=12 ncpus=2)
[    C0] rcu: rcu_preempt kthread timer wakeup didn't happen for 16252 jiffies! g12377 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[    C0] rcu: 	Possible timer handling issue on cpu=1 timer-softirq=15006
[    C0] rcu: rcu_preempt kthread starved for 16253 jiffies! g12377 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=1
[    C0] rcu: 	Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.

..with the occasional "NOHZ tick-stop error: local softirq work is
pending, handler #200!!!" on the host.

However, this RFC represents only one of three approaches attempted:

 - Another brute-force approach simply called remove_cpu() on the sibling
   before, and add_cpu() after __svm_sev_es_vcpu_run() in
   svm_vcpu_enter_exit().  The effort was quickly abandoned since
   it led to insurmountable lock contention issues:
   BUG: scheduling while atomic: qemu-system-x86/6743/0x00000002
    4 locks held by qemu-system-x86/6743:
    #0: ff160079b2dd80b8 (&vcpu->mutex){....}-{3:3}, at: kvm_vcpu_ioctl+0x94/0xa40 [kvm]
    #1: ffffffffba3c5410 (device_hotplug_lock){....}-{3:3}, at: lock_device_hotplug+0x1b/0x30
    #2: ff16009838ff5398 (&dev->mutex){....}-{3:3}, at: device_offline+0x9c/0x120
    #3: ffffffffb9e7e6b0 (cpu_add_remove_lock){....}-{3:3}, at: cpu_device_down+0x24/0x50

 - The third approach attempted to forward port vCPU Core Scheduling
   from the original 4.18 based work by Peter Z.:

   https://github.com/pdxChen/gang/commits/sched_1.23-base

   K. Prateek Nayak provided enough guidance to get me past host lockups
   from "kvm,sched: Track VCPU threads", but the following "sched: Add VCPU
   aware SMT scheduling" commit proved insurmountable to forward-port
   given the complex changes to scheduler internals since then.

Comments welcome:

- Are any of these three approaches even close to an
  upstream-acceptable solution to support SMT Protection?

- Given the feature's strict sibling idle state constraints,
  should SMT Protection even be supported at all?

This RFC applies to kvm-x86/next kvm-x86-next-2025.07.21 (33f843444e28).

Qemu hack: