On Wed, Jul 30, 2025 at 12:34 AM Binbin Wu <binbin.wu@xxxxxxxxxxxxxxx> wrote: > > > > On 7/28/2025 11:33 PM, Sean Christopherson wrote: > > +Jiewen > > Jiewen is out of the office until August 4th. Hi Jiewen, can we get some help in deciding the next steps here? > > > > > Summary, with the questions at the end. > > > > Recent upstream kernels running in GCE SNP/TDX VMs fail to probe the TPM due to > > the TPM driver's ioremap (with UC) failing because the kernel has already mapped > > the range using a cachaeable mapping (WB). > > > > ioremap error for 0xfed40000-0xfed45000, requested 0x2, got 0x0 > > tpm_tis MSFT0101:00: probe with driver tpm_tis failed with error -12 > > > > The "guilty" commit is 8e690b817e38 ("x86/kvm: Override default caching mode for > > SEV-SNP and TDX"), which as the subject suggests, forces the kernel's MTRR memtype > > to WB. With SNP and TDX, the virtual MTRR state is (a) controlled by the VMM and > > thus is untrusted, and (b) _should_ be irrelevant because no known hypervisor > > actually honors the memtypes programmed into the virtual MTRRs. > > > > It turns out that the kernel has been relying on the MTRRs to force the TPM TIS > > region (and potentially other regions) to be UC, so that the kernel ACPI driver's > > attempts to map of SystemMemory entries as cacheable get forced to UC. With MTRRs > > forced WB, x86_acpi_os_ioremap() succeeds in creating a WB mapping, which in turn > > causes the ioremap infrastructure to reject the TPM driver's UC mapping. > > > > IIUC, the TPM entry(s) in the ACPI tables for GCE VMs are derived (built?) from > > EDK2's TPM ASL. And (again, IIUC), this code in SecurityPkg/Tcg/Tcg2Acpi/Tpm.asl[1] > > > > // > > // Operational region for TPM access > > // > > OperationRegion (TPMR, SystemMemory, 0xfed40000, 0x5000) > > > > generates the problematic SystemMemory entry that triggers the ACPI driver's > > auto-mapping logic. > > > > QEMU-based VMs don't suffer the same fate, as QEMU intentionally[2] doesn't use > > EDK2's AML for the TPM, and QEMU doesn't define a SystemMemory entry, just a > > Memory32Fixed entry. > > > > Presumably this an EDK2 bug? If it's not an EDK2 bug, then how is the kernel's > > ACPI driver supposed to know that some ranges of SystemMemory must be mapped UC? > According to the ACPI spec 6.6, an operation region of SystemMemory has no > interface to specify the cacheable attribute. > > One solution could be using MTRRs to communicate the memory attribute of legacy > PCI hole to the kernel. But during the PUCK meeting last week, Sean mentioned > that "long-term, firmware should not be using MTRRs to communicate anything to > the kernel." So this solution is not preferred. > > If not MTRRs, there should be an alternative way to do the job. > 1. ACPI table > According to the ACPI spec, neither operation region nor 32-Bit Fixed Memory > Range Descriptor can specify the cacheable attribute. > "Address Space Resource Descriptors" could be used to describe a memory range > and the they can specify the cacheable attribute via "Type Specific Flags". > One of the Address Space Resource Descriptors could be added to the ACPI > table as a hint when the kernel do the mapping for operation region. > (There is "System Physical Address (SPA) Range Structure", which also can > specify the cacheable attribute. But it's should be used for NVDIMMs.) > 2. EFI memory map descriptor > EFI memory descriptor can specify the cacheable attribute. Firmware can add > a EFI memory descriptor for the TPM TIS device as a hint when the kernel do > the mapping for operation region. > > Operation region of SystemMemory is still needed if a "Control Method" of APCI > needs to access a field, e.g., the method _STA. Checking another descriptor for > cacheable attribute, either "Address Space Resource Descriptor" or "EFI memory > map descriptor" during the ACPI code doing the mapping for operation region > makes the code complicated. > > Another thing is if long-term firmware should not be using MTRRs to to > communicate anything to the kernel. It seems it's safer to use ioremap() instead > of ioremap_cache() for MMIO resource when the kernel do the mapping for the > operation region access? > Would it work if instead of doubling down on declaring the low memory above TOLUD as WB, guest kernel reserves the range as uncacheable by default i.e. effectively simulating a ioremap before ACPI tries to map the memory as WB?