Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX

Binbin Wu <binbin.wu@xxxxxxxxxxxxxxx> · Wed, 20 Aug 2025 18:03:41 +0800

On 8/20/2025 11:07 AM, Vishal Annapurve wrote:
On Wed, Jul 30, 2025 at 12:34 AM Binbin Wu <binbin.wu@xxxxxxxxxxxxxxx> wrote:

On 7/28/2025 11:33 PM, Sean Christopherson wrote:
+Jiewen
Jiewen is out of the office until August 4th.
Hi Jiewen, can we get some help in deciding the next steps here?

Please see below.

Summary, with the questions at the end.

Recent upstream kernels running in GCE SNP/TDX VMs fail to probe the TPM due to
the TPM driver's ioremap (with UC) failing because the kernel has already mapped
the range using a cachaeable mapping (WB).

   ioremap error for 0xfed40000-0xfed45000, requested 0x2, got 0x0
   tpm_tis MSFT0101:00: probe with driver tpm_tis failed with error -12

The "guilty" commit is 8e690b817e38 ("x86/kvm: Override default caching mode for
SEV-SNP and TDX"), which as the subject suggests, forces the kernel's MTRR memtype
to WB.  With SNP and TDX, the virtual MTRR state is (a) controlled by the VMM and
thus is untrusted, and (b) _should_ be irrelevant because no known hypervisor
actually honors the memtypes programmed into the virtual MTRRs.

It turns out that the kernel has been relying on the MTRRs to force the TPM TIS
region (and potentially other regions) to be UC, so that the kernel ACPI driver's
attempts to map of SystemMemory entries as cacheable get forced to UC.  With MTRRs
forced WB, x86_acpi_os_ioremap() succeeds in creating a WB mapping, which in turn
causes the ioremap infrastructure to reject the TPM driver's UC mapping.

IIUC, the TPM entry(s) in the ACPI tables for GCE VMs are derived (built?) from
EDK2's TPM ASL.  And (again, IIUC), this code in SecurityPkg/Tcg/Tcg2Acpi/Tpm.asl[1]

        //
        // Operational region for TPM access
        //
        OperationRegion (TPMR, SystemMemory, 0xfed40000, 0x5000)

generates the problematic SystemMemory entry that triggers the ACPI driver's
auto-mapping logic.

QEMU-based VMs don't suffer the same fate, as QEMU intentionally[2] doesn't use
EDK2's AML for the TPM, and QEMU doesn't define a SystemMemory entry, just a
Memory32Fixed entry.

Presumably this an EDK2 bug?  If it's not an EDK2 bug, then how is the kernel's
ACPI driver supposed to know that some ranges of SystemMemory must be mapped UC?

Checked with Jiewen offline.

He didn't think there was an existing interface to tell the OS to map a
OperationRegion of SystemMemory as UC via the ACPI table. He thought the
OS/ACPI driver still needed to rely on MTRRs for the hint before there was an
alternative way.

According to the ACPI spec 6.6, an operation region of SystemMemory has no
interface to specify the cacheable attribute.

One solution could be using MTRRs to communicate the memory attribute of legacy
PCI hole to the kernel. But during the PUCK meeting last week, Sean mentioned
that "long-term, firmware should not be using MTRRs to communicate anything to
the kernel." So this solution is not preferred.

If not MTRRs, there should be an alternative way to do the job.
1. ACPI table
     According to the ACPI spec, neither operation region nor 32-Bit Fixed Memory
     Range Descriptor can specify the cacheable attribute.
     "Address Space Resource Descriptors" could be used to describe a memory range
     and the they can specify the cacheable attribute via "Type Specific Flags".
     One of the Address Space Resource Descriptors could be added to the ACPI
     table as a hint when the kernel do the mapping for operation region.
     (There is "System Physical Address (SPA) Range Structure", which also can
     specify the cacheable attribute. But it's should be used for NVDIMMs.)
2. EFI memory map descriptor
     EFI memory descriptor can specify the cacheable attribute. Firmware can add
     a EFI memory descriptor for the TPM TIS device as a hint when the kernel do
     the mapping for operation region.

Operation region of SystemMemory is still needed if a "Control Method" of APCI
needs to access a field, e.g., the method _STA. Checking another descriptor for
cacheable attribute, either "Address Space Resource Descriptor" or "EFI memory
map descriptor" during the ACPI code doing the mapping for operation region
makes the code complicated.

Another thing is if long-term firmware should not be using MTRRs to to
communicate anything to the kernel. It seems it's safer to use ioremap() instead
of ioremap_cache() for MMIO resource when the kernel do the mapping for the
operation region access?

Would it work if instead of doubling down on declaring the low memory
above TOLUD as WB, guest kernel reserves the range as uncacheable by
default i.e. effectively simulating a ioremap before ACPI tries to map
the memory as WB?

It seems as hacky as this patch set?