Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 8/21/2025 1:56 AM, Sean Christopherson wrote:
On Wed, Aug 20, 2025, Binbin Wu wrote:
On 8/20/2025 6:03 PM, Binbin Wu wrote:
Presumably this an EDK2 bug?  If it's not an EDK2 bug, then how is the kernel's
ACPI driver supposed to know that some ranges of SystemMemory must be mapped UC?
Checked with Jiewen offline.

He didn't think there was an existing interface to tell the OS to map a
OperationRegion of SystemMemory as UC via the ACPI table. He thought the
OS/ACPI driver still needed to rely on MTRRs for the hint before there was an
alternative way.

According to the ACPI spec 6.6, an operation region of SystemMemory has no
interface to specify the cacheable attribute.

One solution could be using MTRRs to communicate the memory attribute of legacy
PCI hole to the kernel.
So IIUC, there are no bugs anywhere, just a gap in specs that has been hidden
until now :-(

But during the PUCK meeting last week, Sean mentioned
that "long-term, firmware should not be using MTRRs to communicate anything to
the kernel." So this solution is not preferred.

If not MTRRs, there should be an alternative way to do the job.
1. ACPI table
      According to the ACPI spec, neither operation region nor 32-Bit Fixed Memory
      Range Descriptor can specify the cacheable attribute.
      "Address Space Resource Descriptors" could be used to describe a memory range
      and the they can specify the cacheable attribute via "Type Specific Flags".
      One of the Address Space Resource Descriptors could be added to the ACPI
      table as a hint when the kernel do the mapping for operation region.
      (There is "System Physical Address (SPA) Range Structure", which also can
      specify the cacheable attribute. But it's should be used for NVDIMMs.)
2. EFI memory map descriptor
      EFI memory descriptor can specify the cacheable attribute. Firmware can add
      a EFI memory descriptor for the TPM TIS device as a hint when the kernel do
      the mapping for operation region.

Operation region of SystemMemory is still needed if a "Control Method" of APCI
needs to access a field, e.g., the method _STA. Checking another descriptor for
cacheable attribute, either "Address Space Resource Descriptor" or "EFI memory
map descriptor" during the ACPI code doing the mapping for operation region
makes the code complicated.

Another thing is if long-term firmware should not be using MTRRs to to
communicate anything to the kernel. It seems it's safer to use ioremap() instead
of ioremap_cache() for MMIO resource when the kernel do the mapping for the
operation region access?

Would it work if instead of doubling down on declaring the low memory
above TOLUD as WB, guest kernel reserves the range as uncacheable by
default i.e. effectively simulating a ioremap before ACPI tries to map
the memory as WB?
It seems as hacky as this patch set?


Hi Sean,

Since guest_force_mtrr_state() also supports to force MTRR variable ranges,
I am wondering if we could use guest_force_mtrr_state() to set the legacy PCI
hole range as UC?

Is it less hacky?
Oh!  That's a way better idea than my hack.  I missed that the kernel would still
consult MTRRs.

Compile tested only, but something like this?

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 8ae750cde0c6..45c8871cdda1 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -933,6 +933,13 @@ static void kvm_sev_hc_page_enc_status(unsigned long pfn, int npages, bool enc)
static void __init kvm_init_platform(void)
  {
+       u64 tolud = e820__end_of_low_ram_pfn() << PAGE_SHIFT;
+       struct mtrr_var_range pci_hole = {
+               .base_lo = tolud | X86_MEMTYPE_UC,
+               .mask_lo = (u32)(~(SZ_4G - tolud - 1)) | BIT(11),
+               .mask_hi = (BIT_ULL(boot_cpu_data.x86_phys_bits) - 1) >> 32,
+       };
+

This value of tolud  may not meet the range size and alignment requirement for
variable MTRR.

Variable MTRR has requirement for range size and alignment:
For ranges greater than 4 KBytes, each range must be of length 2^n and its base
address must be aligned on a 2^n boundary, where n is a value equal to or
greater than 12. The base-address alignment value cannot be less than its length.

In my setup, the value of tolud is 0x7FF7C000, it requires 3 variable MTRRs to
meet the requirement, i.e.,
- 7FF7 C000  ~   7FF8 0000
- 7FF8 0000  ~   8000 0000
- 8000 0000  ~ 1 0000 0000

I checks the implementation in EDK2, in order to fit the legacy PCI hole into
one variable MTRR, it has some assumption to truncate the size and round up the
base address in PlatformQemuUc32BaseInitialization():
    ...
    ASSERT (
      PlatformInfoHob->HostBridgeDevId == INTEL_Q35_MCH_DEVICE_ID ||
      PlatformInfoHob->HostBridgeDevId == INTEL_82441_DEVICE_ID
      );
    ...
    //
    // Start with the [LowerMemorySize, 4GB) range. Make sure one
    // variable MTRR suffices by truncating the size to a whole power of two,
    // while keeping the end affixed to 4GB. This will round the base up.
    //
    PlatformInfoHob->Uc32Size = GetPowerOfTwo32 ((UINT32)(SIZE_4GB - PlatformInfoHob->LowMemory));
    PlatformInfoHob->Uc32Base = (UINT32)(SIZE_4GB - PlatformInfoHob->Uc32Size);
    //
    // Assuming that LowerMemorySize is at least 1 byte, Uc32Size is at most 2GB.
    // Therefore Uc32Base is at least 2GB.
    //
    ASSERT (PlatformInfoHob->Uc32Base >= BASE_2GB);

I am not sure if KVM can do such assumption.
Otherwise, KVM needs to calculate the ranges to meet the requirement. :(


         if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) &&
             kvm_para_has_feature(KVM_FEATURE_MIGRATION_CONTROL)) {
                 unsigned long nr_pages;
@@ -982,8 +989,12 @@ static void __init kvm_init_platform(void)
         kvmclock_init();
         x86_platform.apic_post_init = kvm_apic_init;
- /* Set WB as the default cache mode for SEV-SNP and TDX */
-       guest_force_mtrr_state(NULL, 0, MTRR_TYPE_WRBACK);
+       /*
+        * Set WB as the default cache mode for SEV-SNP and TDX, with a single
+        * UC range for the legacy PCI hole, e.g. so that devices that expect
+        * to get UC/WC mappings don't get surprised with WB.
+        */
+       guest_force_mtrr_state(&pci_hole, 1, MTRR_TYPE_WRBACK);
  }
#if defined(CONFIG_AMD_MEM_ENCRYPT)





[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux