On Wed, Aug 20, 2025, Binbin Wu wrote: > On 8/20/2025 6:03 PM, Binbin Wu wrote: > > > > > Presumably this an EDK2 bug? If it's not an EDK2 bug, then how is the kernel's > > > > > ACPI driver supposed to know that some ranges of SystemMemory must be mapped UC? > > > > Checked with Jiewen offline. > > > > He didn't think there was an existing interface to tell the OS to map a > > OperationRegion of SystemMemory as UC via the ACPI table. He thought the > > OS/ACPI driver still needed to rely on MTRRs for the hint before there was an > > alternative way. > > > > > > According to the ACPI spec 6.6, an operation region of SystemMemory has no > > > > interface to specify the cacheable attribute. > > > > > > > > One solution could be using MTRRs to communicate the memory attribute of legacy > > > > PCI hole to the kernel. So IIUC, there are no bugs anywhere, just a gap in specs that has been hidden until now :-( > > > > But during the PUCK meeting last week, Sean mentioned > > > > that "long-term, firmware should not be using MTRRs to communicate anything to > > > > the kernel." So this solution is not preferred. > > > > > > > > If not MTRRs, there should be an alternative way to do the job. > > > > 1. ACPI table > > > > According to the ACPI spec, neither operation region nor 32-Bit Fixed Memory > > > > Range Descriptor can specify the cacheable attribute. > > > > "Address Space Resource Descriptors" could be used to describe a memory range > > > > and the they can specify the cacheable attribute via "Type Specific Flags". > > > > One of the Address Space Resource Descriptors could be added to the ACPI > > > > table as a hint when the kernel do the mapping for operation region. > > > > (There is "System Physical Address (SPA) Range Structure", which also can > > > > specify the cacheable attribute. But it's should be used for NVDIMMs.) > > > > 2. EFI memory map descriptor > > > > EFI memory descriptor can specify the cacheable attribute. Firmware can add > > > > a EFI memory descriptor for the TPM TIS device as a hint when the kernel do > > > > the mapping for operation region. > > > > > > > > Operation region of SystemMemory is still needed if a "Control Method" of APCI > > > > needs to access a field, e.g., the method _STA. Checking another descriptor for > > > > cacheable attribute, either "Address Space Resource Descriptor" or "EFI memory > > > > map descriptor" during the ACPI code doing the mapping for operation region > > > > makes the code complicated. > > > > > > > > Another thing is if long-term firmware should not be using MTRRs to to > > > > communicate anything to the kernel. It seems it's safer to use ioremap() instead > > > > of ioremap_cache() for MMIO resource when the kernel do the mapping for the > > > > operation region access? > > > > > > > Would it work if instead of doubling down on declaring the low memory > > > above TOLUD as WB, guest kernel reserves the range as uncacheable by > > > default i.e. effectively simulating a ioremap before ACPI tries to map > > > the memory as WB? > > > > It seems as hacky as this patch set? > > > > > Hi Sean, > > Since guest_force_mtrr_state() also supports to force MTRR variable ranges, > I am wondering if we could use guest_force_mtrr_state() to set the legacy PCI > hole range as UC? > > Is it less hacky? Oh! That's a way better idea than my hack. I missed that the kernel would still consult MTRRs. Compile tested only, but something like this? diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 8ae750cde0c6..45c8871cdda1 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -933,6 +933,13 @@ static void kvm_sev_hc_page_enc_status(unsigned long pfn, int npages, bool enc) static void __init kvm_init_platform(void) { + u64 tolud = e820__end_of_low_ram_pfn() << PAGE_SHIFT; + struct mtrr_var_range pci_hole = { + .base_lo = tolud | X86_MEMTYPE_UC, + .mask_lo = (u32)(~(SZ_4G - tolud - 1)) | BIT(11), + .mask_hi = (BIT_ULL(boot_cpu_data.x86_phys_bits) - 1) >> 32, + }; + if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) && kvm_para_has_feature(KVM_FEATURE_MIGRATION_CONTROL)) { unsigned long nr_pages; @@ -982,8 +989,12 @@ static void __init kvm_init_platform(void) kvmclock_init(); x86_platform.apic_post_init = kvm_apic_init; - /* Set WB as the default cache mode for SEV-SNP and TDX */ - guest_force_mtrr_state(NULL, 0, MTRR_TYPE_WRBACK); + /* + * Set WB as the default cache mode for SEV-SNP and TDX, with a single + * UC range for the legacy PCI hole, e.g. so that devices that expect + * to get UC/WC mappings don't get surprised with WB. + */ + guest_force_mtrr_state(&pci_hole, 1, MTRR_TYPE_WRBACK); } #if defined(CONFIG_AMD_MEM_ENCRYPT)