On Tue, Apr 22, 2025 at 04:02:07PM +0300, Ilpo Järvinen wrote: > When bifurcated to x2, Xeon 6 Root Port performance is sensitive to the > configuration of Extended Tags, Max Read Request Size (MRRS), and 10-Bit > Tag Requester (note: there is currently no 10-Bit Tag support in the > kernel). While those can be configured to the recommended values by FW, > kernel may decide to overwrite the initial values. > > Unfortunately, there is no mechanism for FW to indicate OS which parts > of PCIe configuration should not be altered. Thus, the only option is > to add such logic into the kernel as quirks. > > There is a pre-existing quirk flag to disable Extended Tags. Depending > on CONFIG_PCIE_BUS_* setting, MRRS may be overwritten by what the > kernel thinks is the best for performance (the largest supported > value), resulting in performance degradation instead with these Root > Ports. (There would have been a pre-existing quirk to disallow > increasing MRRS but it is not identical to rejecting >128B MRRS.) > > Add a quirk that disallows enabling Extended Tags and setting MRRS > larger than 128B for devices under Xeon 6 Root Ports if the Root Port is > bifurcated to x2. Reject >128B MRRS only when it is going to be written > by the kernel (this assumes FW configured a good initial value for MRRS > in case the kernel is not touching MRRS at all). > > It was first attempted to always write MRRS when the quirk is needed > (always overwrite the initial value). That turned out to be quite > invasive change, however, given the complexity of the initial setup > callchain and various stages returning early when they decide no changes > are necessary, requiring override each. As such, the initial value for > MRRS is now left into the hands of FW. > > Link: https://cdrdv2.intel.com/v1/dl/getContent/837176 > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@xxxxxxxxxxxxxxx> > --- v1/rfc: https://lore.kernel.org/all/20250304135108.2599-1-ilpo.jarvinen@xxxxxxxxxxxxxxx/ I suppose it's quixotic to hope for anything better than quirks that have to be added or updated for every new processor that comes along. ACPI _HPX might be a possible way for the platform to tell us what to do here. ACPI r6.5, sec 6.2.9 says it's for hot-added devices and "Functions not configured by the platform firmware during initial system boot" (how are we supposed to determine that?) In any case, Linux does evaluate _HPX for every device in pci_configure_device(). I'm not sure _HPX really works; it's very general, and I would expect to see reports of problems if firmware really tried to use it. Or, I guess a _DSM function would be a possible way to communicate this. > v2: > - Explain in changelog why FW cannot solve this on its own > - Moved the quirk under arch/x86/pci/ > - Don't NULL check value from pci_find_host_bridge() > - Added comment above the quirk about the performance degradation > - Removed all setup chain 128B quirk overrides expect for MRRS write > itself (assumes a sane initial value is set by FW) > > arch/x86/pci/fixup.c | 30 ++++++++++++++++++++++++++++++ > drivers/pci/pci.c | 15 ++++++++------- > include/linux/pci.h | 1 + > 3 files changed, 39 insertions(+), 7 deletions(-) > > diff --git a/arch/x86/pci/fixup.c b/arch/x86/pci/fixup.c > index efefeb82ab61..aa9617bc4b55 100644 > --- a/arch/x86/pci/fixup.c > +++ b/arch/x86/pci/fixup.c > @@ -294,6 +294,36 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_MCH_PB1, pcie_r > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_MCH_PC, pcie_rootport_aspm_quirk); > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_MCH_PC1, pcie_rootport_aspm_quirk); > > +/* > + * PCIe devices underneath Xeon6 PCIe Root Port bifurcated to 2x have slower > + * performance with Extended Tags and MRRS > 128B. Workaround the performance > + * problems by disabling Extended Tags and limiting MRRS to 128B. > + * > + * https://cdrdv2.intel.com/v1/dl/getContent/837176 > + */ > +static void quirk_pcie2x_no_tags_no_mrrs(struct pci_dev *pdev) > +{ > + struct pci_host_bridge *bridge = pci_find_host_bridge(pdev->bus); > + u32 linkcap; > + > + pcie_capability_read_dword(pdev, PCI_EXP_LNKCAP, &linkcap); > + if (FIELD_GET(PCI_EXP_LNKCAP_MLW, linkcap) != 0x2) > + return; > + > + bridge->no_ext_tags = 1; > + bridge->only_128b_mrrs = 1; > + pci_info(pdev, "Disabling Extended Tags and limiting MRRS to 128B (performance reasons due to 2x PCIe link)\n"); > +} > + > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x0db0, quirk_pcie2x_no_tags_no_mrrs); > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x0db1, quirk_pcie2x_no_tags_no_mrrs); > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x0db2, quirk_pcie2x_no_tags_no_mrrs); > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x0db3, quirk_pcie2x_no_tags_no_mrrs); > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x0db6, quirk_pcie2x_no_tags_no_mrrs); > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x0db7, quirk_pcie2x_no_tags_no_mrrs); > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x0db8, quirk_pcie2x_no_tags_no_mrrs); > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x0db9, quirk_pcie2x_no_tags_no_mrrs); > + > /* > * Fixup to mark boot BIOS video selected by BIOS before it changes > * > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index 4d7c9f64ea24..2ca9cb30fbd3 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -5941,7 +5941,7 @@ EXPORT_SYMBOL(pcie_get_readrq); > int pcie_set_readrq(struct pci_dev *dev, int rq) > { > u16 v; > - int ret; > + int ret, max_mrrs = 4096; > struct pci_host_bridge *bridge = pci_find_host_bridge(dev->bus); > > if (rq < 128 || rq > 4096 || !is_power_of_2(rq)) > @@ -5961,13 +5961,14 @@ int pcie_set_readrq(struct pci_dev *dev, int rq) > > v = FIELD_PREP(PCI_EXP_DEVCTL_READRQ, ffs(rq) - 8); > > - if (bridge->no_inc_mrrs) { > - int max_mrrs = pcie_get_readrq(dev); > + if (bridge->no_inc_mrrs) > + max_mrrs = pcie_get_readrq(dev); > + if (bridge->only_128b_mrrs) > + max_mrrs = 128; > > - if (rq > max_mrrs) { > - pci_info(dev, "can't set Max_Read_Request_Size to %d; max is %d\n", rq, max_mrrs); > - return -EINVAL; > - } > + if (rq > max_mrrs) { > + pci_info(dev, "can't set Max_Read_Request_Size to %d; max is %d\n", rq, max_mrrs); > + return -EINVAL; > } > > ret = pcie_capability_clear_and_set_word(dev, PCI_EXP_DEVCTL, > diff --git a/include/linux/pci.h b/include/linux/pci.h > index 0e8e3fd77e96..6dc7a05f4d4b 100644 > --- a/include/linux/pci.h > +++ b/include/linux/pci.h > @@ -603,6 +603,7 @@ struct pci_host_bridge { > unsigned int ignore_reset_delay:1; /* For entire hierarchy */ > unsigned int no_ext_tags:1; /* No Extended Tags */ > unsigned int no_inc_mrrs:1; /* No Increase MRRS */ > + unsigned int only_128b_mrrs:1; /* Only 128B MRRS */ > unsigned int native_aer:1; /* OS may use PCIe AER */ > unsigned int native_pcie_hotplug:1; /* OS may use PCIe hotplug */ > unsigned int native_shpc_hotplug:1; /* OS may use SHPC hotplug */ > > base-commit: 0af2f6be1b4281385b618cb86ad946eded089ac8 > -- > 2.39.5 >