On an SMT-enabled system, the SMT Protection feature allows an SNP guest to demand its hardware vCPU thread to run alone on the physical core. It will opt to do this to protect itself against possible side channel attacks from shared core resources. Hardware supports this by enforcing the sibling of the vCPU thread to be in the idle state when the vCPU is running: If hardware detects the sibling has not entered the idle state, or it exited it, then the vCPU VMRUN exits with a new "IDLE_REQUIRED" status, where the hypervisor should schedule the idle process on the sibling thread simultaneously with resuming the vCPU VMRUN. There is a new HLT_WAKEUP_ICR MSR that the hypervisor programs for each system SMT thread such that if an idle sibling of a SMT Protected guest vCPU receives an interrupt, hardware will write the HLT_WAKEUP_ICR value to the APIC ICR to 'kick' the vCPU thread out of its VMRUN state. Hardware then allows the sibling to then exit the idle state and service its interrupt. The feature is supported on EYPC Zen 4 and above CPUs. For more information, see "15.36.17 Side-Channel Protection", "SMT Protection", in: "AMD64 Architecture Programmer's Manual Volume 2: System Programming Part 2, Pub. 24593 Rev. 3.42 - March 2024" available here: https://bugzilla.kernel.org/attachment.cgi?id=306250 See the end of this message for the qemu hack that calls the Linux Core Scheduler prctl syscall to create a unique per-vCPU cookie to ensure the vCPU process will not be scheduled if there is anything else running on the sibling thread of the core. As it turns out, this approach is less than efficient because existing Core Scheduling semantics only prevent other userspace processes from running on the sibling thread that hardware requires to be in the idle state. Because of this, the sibling CPU VMRUN frequently exits with "IDLE_REQUIRED" when the scheduler runs its "OS noise" (softirq work, etc.) instead of forcing the hardware idle state throughout the duration of the VMRUN. Mild testing yields eventual CPU stalls in the guest (minutes after boot): [ C0] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ C0] rcu: 1-...!: (0 ticks this GP) idle=8d58/0/0x0 softirq=12830/12830 fqs=0 (false positive?) [ C0] rcu: (detected by 0, t=16253 jiffies, g=12377, q=12 ncpus=2) [ C0] rcu: rcu_preempt kthread timer wakeup didn't happen for 16252 jiffies! g12377 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 [ C0] rcu: Possible timer handling issue on cpu=1 timer-softirq=15006 [ C0] rcu: rcu_preempt kthread starved for 16253 jiffies! g12377 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=1 [ C0] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior. ..with the occasional "NOHZ tick-stop error: local softirq work is pending, handler #200!!!" on the host. However, this RFC represents only one of three approaches attempted: - Another brute-force approach simply called remove_cpu() on the sibling before, and add_cpu() after __svm_sev_es_vcpu_run() in svm_vcpu_enter_exit(). The effort was quickly abandoned since it led to insurmountable lock contention issues: BUG: scheduling while atomic: qemu-system-x86/6743/0x00000002 4 locks held by qemu-system-x86/6743: #0: ff160079b2dd80b8 (&vcpu->mutex){....}-{3:3}, at: kvm_vcpu_ioctl+0x94/0xa40 [kvm] #1: ffffffffba3c5410 (device_hotplug_lock){....}-{3:3}, at: lock_device_hotplug+0x1b/0x30 #2: ff16009838ff5398 (&dev->mutex){....}-{3:3}, at: device_offline+0x9c/0x120 #3: ffffffffb9e7e6b0 (cpu_add_remove_lock){....}-{3:3}, at: cpu_device_down+0x24/0x50 - The third approach attempted to forward port vCPU Core Scheduling from the original 4.18 based work by Peter Z.: https://github.com/pdxChen/gang/commits/sched_1.23-base K. Prateek Nayak provided enough guidance to get me past host lockups from "kvm,sched: Track VCPU threads", but the following "sched: Add VCPU aware SMT scheduling" commit proved insurmountable to forward-port given the complex changes to scheduler internals since then. Comments welcome: - Are any of these three approaches even close to an upstream-acceptable solution to support SMT Protection? - Given the feature's strict sibling idle state constraints, should SMT Protection even be supported at all? This RFC applies to kvm-x86/next kvm-x86-next-2025.07.21 (33f843444e28). Qemu hack: